Abstract

Many speech recognition systems use mel-frequency cepstral coefficient (MFCC) feature extraction as a front end. In the algorithm, a speech spectrum passes through a filter bank of mel-spaced triangular filters, and the filter output energies are log-compressed and transformed to the cepstral domain by the DCT. The spacing of filter bank center frequencies mimics the known warped-frequency characteristics of the human auditory system, yet the bandwidths of these filters are not chosen through biological inspiration. Instead, they are set by aligning endpoints of the triangle, which is itself an arbitrary shape. It is surprising that for such a popular speech recognition front end, proper analysis or optimization of the filter bandwidths has not been performed. With complex cochlear models, realistic filter shapes that more closely approximate critical bands are used. And these filters, compared to the filters used in MFCC, are considerably wider and overlap with neighboring filters more. We have extended this filter characteristic to the MFCC algorithm and found that the increased filter bandwidth improves recognition performance in clean speech and provides added noise robustness as well.

Department(s)

Electrical and Computer Engineering

International Standard Serial Number (ISSN)

1520-6149

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2025 Institute of Electrical and Electronics Engineers, All rights reserved.

Publication Date

01 Jan 2002

Share

 
COinS