P
US6701291B2ExpiredUtilityPatentIndex 83

Automatic speech recognition with psychoacoustically-based feature extraction, using easily-tunable single-shape filters along logarithmic-frequency axis

Assignee: LUCENT TECHNOLOGIES INCPriority: Oct 13, 2000Filed: Apr 2, 2001Granted: Mar 2, 2004
Est. expiryOct 13, 2020(expired)· nominal 20-yr term from priority
Inventors:LI QI PSIOHAN OLIVIERSOONG FRANK KAO-PING
G10L 15/02G10L 19/0212
83
PatentIndex Score
13
Cited by
12
References
60
Claims

Abstract

A method and apparatus for extracting speech features from a speech signal in which the linear frequency spectrum data, as generated, for example, by a conventional frequency transform, is first converted to logarithmic frequency spectrum data having frequency data distributed on a substantially logarithmic (rather than linear) frequency scale. Then, a plurality of digital auditory filters is applied to the resultant logarithmic frequency spectrum data, each of these filters having a substantially similar shape, but centered at different points on the logarithmic frequency scale. Because each of the filters have a similar shape, the feature extraction approach of the present invention advantageously can be easily modified or tuned by adjusting each of the filters in a coordinated manner, with the adjustment of only a handful of filter parameters.

Claims

exact text as granted — not AI-modified
We claim:  
     
       1. A method of extracting speech features from a speech signal for use in performing automatic speech recognition, the method comprising the steps of: 
       performing a time-to-frequency domain transformation on at least a portion of said speech signal to produce a linear frequency spectrum thereof, wherein said linear frequency spectrum comprises frequency data distributed on a substantially linear frequency scale;  
       converting said linear frequency spectrum of said speech signal portion to a logarithmic frequency spectrum thereof, wherein said logarithmic frequency spectrum comprises said frequency data distributed on a substantially logarithmic frequency scale;  
       filtering said logarithmic frequency spectrum of said speech signal portion with a plurality of filters, each of said filters having a substantially similar mathematical shape defined for a plurality of frequencies, and centered at different points on said substantially logarithmic frequency scale; and  
       generating one or more speech features based on one or more outputs of said plurality of filters.  
     
     
       2. The method of  claim 1  wherein said time-to-frequency domain transformation comprises a Fast Fourier Transform. 
     
     
       3. The method of  claim 1  wherein said substantially logarithmic frequency scale comprises a mel scale. 
     
     
       4. The method of  claim 1  wherein said substantially logarithmic frequency scale comprises a Bark scale. 
     
     
       5. The method of  claim 4  wherein said plurality of filters are centered at equal distances along the Bark scale. 
     
     
       6. The method of  claim 1  further comprising the step of applying to said linear frequency spectrum of said speech signal an outer and middle ear transfer function which approximates a human's outer and middle ear signal processing of an incoming speech signal. 
     
     
       7. The method of  claim 1  wherein said step of generating said one or more speech features comprises the steps of 
       performing a discrete cosine transform based on said one or more outputs of said plurality of filters to generate a set of DCT coefficients, and  
       generating said one or more speech features based on said set of DCT coefficients.  
     
     
       8. The method of  claim 7  wherein said step of generating said one or more speech features further comprises the step of modifying said one or more outputs of said plurality of filters by applying a nonlinearity to each one of said outputs, and wherein said discrete cosine transform is applied to said modified outputs. 
     
     
       9. The method of  claim 7  wherein said one or more speech features comprises each of said DCT coefficients and first and second order derivatives thereof. 
     
     
       10. The method of  claim 9  wherein said one or more speech features further comprises a measure of short-term energy of said speech signal. 
     
     
       11. A method of performing automatic speech recognition of a speech signal, the method comprising the steps of: 
       performing a time-to-frequency domain transformation on at least a portion of said speech signal to produce a linear frequency spectrum thereof, wherein said linear frequency spectrum comprises frequency data distributed on a substantially linear frequency scale;  
       converting said linear frequency spectrum of said speech signal portion to a logarithmic frequency spectrum thereof, wherein said logarithmic frequency spectrum comprises said frequency data distributed on a substantially logarithmic frequency scale;  
       filtering said logarithmic frequency spectrum of said speech signal portion with a plurality of filters, each of said filters having a substantially similar mathematical shape defined for a plurality of frequencies, and centered at different points on said substantially logarithmic frequency scale;  
       generating one or more speech features based on one or more outputs of said plurality of filters; and  
       performing speech recognition of said speech signal based on said one or more speech features.  
     
     
       12. The method of  claim 11  wherein said time-to-frequency domain transformation comprises a Fast Fourier Transform. 
     
     
       13. The method of  claim 11  wherein said substantially logarithmic frequency scale comprises a mel scale. 
     
     
       14. The method of  claim 11  wherein said substantially logarithmic frequency scale comprises a Bark scale. 
     
     
       15. The method of  claim 14  wherein said plurality of filters are centered at equal distances along the Bark scale. 
     
     
       16. The method of  claim 11  further comprising the step of applying to said linear frequency spectrum of said speech signal an outer and middle ear transfer function which approximates a human's outer and middle ear signal processing of an incoming speech signal. 
     
     
       17. The method of  claim 11  wherein said step of generating said one or more speech features comprises the steps of 
       performing a discrete cosine transform based on said one or more outputs of said plurality of filters to generate a set of DCT coefficients, and  
       generating said one or more speech features based on said set of DCT coefficients.  
     
     
       18. The method of  claim 17  wherein said step of generating said one or more speech features further comprises the step of modifying said one or more outputs of said plurality of filters by applying a nonlinearity to each one of said outputs, and wherein said discrete cosine transform is applied to said modified outputs. 
     
     
       19. The method of  claim 17  wherein said one or more speech features comprises each of said DCT coefficients and first and second order derivatives thereof. 
     
     
       20. The method of  claim 19  wherein said one or more speech features further comprises a measure of short-term energy of said speech signal. 
     
     
       21. An apparatus for extracting speech features from a speech signal for use in performing automatic speech recognition, the apparatus comprising: 
       a time-to-frequency domain transform applied to at least a portion of said speech signal to produce a linear frequency spectrum thereof, wherein said linear frequency spectrum comprises frequency data distributed on a substantially linear frequency scale;  
       a linear-to-logarithmic frequency spectrum converter applied to said linear frequency spectrum of said speech signal portion to produce a logarithmic frequency spectrum thereof, wherein said logarithmic frequency spectrum comprises said frequency data distributed on a substantially logarithmic frequency scale;  
       a plurality of filters applied to said logarithmic frequency spectrum of said speech signal portion, each of said filters having a substantially similar mathematical shape defined for a plurality of frequencies, and centered at different points on said substantially logarithmic frequency scale; and  
       a speech feature generator which generates one or more speech features based on one or more outputs of said plurality of filters.  
     
     
       22. The apparatus of  claim 21  wherein said time-to-frequency domain transform comprises a Fast Fourier Transform. 
     
     
       23. The apparatus of  claim 21  wherein said substantially logarithmic frequency scale comprises a mel scale. 
     
     
       24. The apparatus of  claim 21  wherein said substantially logarithmic frequency scale comprises a Bark scale. 
     
     
       25. The apparatus of  claim 24  wherein said plurality of filters are centered at equal distances along the Bark scale. 
     
     
       26. The apparatus of  claim 21  further comprising an outer and middle ear transfer function applied to said linear frequency spectrum of said speech signal, wherein said outer and middle ear transfer function approximates a human's outer and middle ear signal processing of an incoming speech signal. 
     
     
       27. The apparatus of  claim 21  wherein said speech feature generator comprises a discrete cosine transform applied to said one or more outputs of said plurality of filters to generate a set of DCT coefficients, and wherein said one or more speech features are generated based on said set of DCT coefficients. 
     
     
       28. The apparatus of  claim 27  wherein said speech feature generator further comprises a nonlinearity module applied to said one or more outputs of said plurality of filters thereby generating one or more modified outputs, and wherein said discrete cosine transform is applied to said modified outputs. 
     
     
       29. The apparatus of  claim 27  wherein said one or more speech features comprises each of said DCT coefficients and first and second order derivatives thereof. 
     
     
       30. The apparatus of  claim 29  wherein said one or more speech features further comprises a measure of short-term energy of said speech signal. 
     
     
       31. An apparatus for performing automatic speech recognition of a speech signal, the apparatus comprising: 
       a time-to-frequency domain transform applied to at least a portion of said speech signal to produce a linear frequency spectrum thereof, wherein said linear frequency spectrum comprises frequency data distributed on a substantially linear frequency scale;  
       a linear-to-logarithmic frequency spectrum converted applied to said linear frequency spectrum of said speech signal portion to produce a logarithmic frequency spectrum thereof, wherein said logarithmic frequency spectrum comprises said frequency data distributed on a substantially logarithmic frequency scale;  
       a plurality of filters applied to said logarithmic frequency spectrum of said speech signal portion, each of said filters having a substantially similar mathematical shape defined for a plurality of frequencies, and centered at different points on said substantially logarithmic frequency scale;  
       a speech feature generator which generates one or more speech features based on one or more outputs of said plurality of filters; and  
       a speech recognizer which recognizes said speech signal based on said one or more speech features.  
     
     
       32. The apparatus of  claim 31  wherein said time-to-frequency domain transform comprises a Fast Fourier Transform. 
     
     
       33. The apparatus of  claim 31  wherein said substantially logarithmic frequency scale comprises a mel scale. 
     
     
       34. The apparatus of  claim 31  wherein said substantially logarithmic frequency scale comprises a Bark scale. 
     
     
       35. The apparatus of  claim 34  wherein said plurality of filters are centered at equal distances along the Bark scale. 
     
     
       36. The apparatus of  claim 31  further comprising an outer and middle inner ear transfer function applied to said linear frequency spectrum of said speech signal, wherein said outer and middle ear transfer function approximates a human's outer and middle ear signal processing of an incoming speech signal. 
     
     
       37. The apparatus of  claim 31  wherein said speech feature generator comprises a discrete cosine transform applied to said one or more outputs of said plurality of filters to generate a set of DCT coefficients, and wherein said one or more speech features are generated based on said set of DCT coefficients. 
     
     
       38. The apparatus of  claim 37  wherein said speech feature generator further comprises a nonlinearity module applied to said one or more outputs of said plurality of filters thereby generating one or more modified outputs, and wherein said discrete cosine transform is applied to said modified outputs. 
     
     
       39. The apparatus of  claim 37  wherein said one or more speech features comprises each of said DCT coefficients and first and second order derivatives thereof. 
     
     
       40. The apparatus of  claim 39  wherein said one or more speech features further comprises a measure of short-term energy of said speech signal. 
     
     
       41. An apparatus for extracting speech features from a speech signal for use in performing automatic speech recognition, the apparatus comprising: 
       means for performing a time-to-frequency domain transformation on at least a portion of said speech signal to produce a linear frequency spectrum thereof, wherein said linear frequency spectrum comprises frequency data distributed on a substantially linear frequency scale;  
       means for converting said linear frequency spectrum of said speech signal portion to a logarithmic frequency spectrum thereof, wherein said logarithmic frequency spectrum comprises said frequency data distributed on a substantially logarithmic frequency scale;  
       means for filtering said logarithmic frequency spectrum of said speech signal portion with a plurality of filters, each of said filters having a substantially similar mathematical shape and centered at different points on said substantially logarithmic frequency scale; and  
       means for generating one or more speech features based on one or more outputs of said plurality of filters.  
     
     
       42. The apparatus of  claim 41  wherein said time-to-frequency domain transformation comprises a Fast Fourier Transform. 
     
     
       43. The apparatus of  claim 41  wherein said substantially logarithmic frequency scale comprises a mel scale. 
     
     
       44. The apparatus of  claim 41  wherein said substantially logarithmic frequency scale comprises a Bark scale. 
     
     
       45. The apparatus of  claim 44  wherein said plurality of filters are centered at equal distances along the Bark scale. 
     
     
       46. The apparatus of  claim 41  further comprising means for applying to said linear frequency spectrum of said speech signal an outer and middle ear transfer function which approximates a human's outer and middle ear signal processing of an incoming speech signal. 
     
     
       47. The apparatus of  claim 41  wherein said means for generating said one or more speech features comprises 
       means for performing a discrete cosine transform based on said one or more outputs of said plurality of filters to generate a set of DCT coefficients, and  
       means for generating said one or more speech features based on said set of DCT coefficients.  
     
     
       48. The apparatus of  claim 47  wherein said means for generating said one or more speech features further comprises means for modifying said one or more outputs of said plurality of filters by applying a nonlinearity to each one of said outputs, and wherein said discrete cosine transform is applied to said modified outputs. 
     
     
       49. The apparatus of  claim 47  wherein said one or more speech features comprises each of said DCT coefficients and first and second order derivatives thereof. 
     
     
       50. The apparatus of  claim 49  wherein said one or more speech features further comprises a measure of short-term energy of said speech signal. 
     
     
       51. An apparatus for performing automatic speech recognition of a speech signal, the apparatus comprising: 
       means for performing a time-to-frequency domain transformation on at least a portion of said speech signal to produce a linear frequency spectrum thereof, wherein said linear frequency spectrum comprises frequency data distributed on a substantially linear frequency scale;  
       means for converting said linear frequency spectrum of said speech signal portion to a logarithmic frequency spectrum thereof, wherein said logarithmic frequency spectrum comprises said frequency data distributed on a substantially logarithmic frequency scale;  
       means for filtering said logarithmic frequency spectrum of said speech signal portion with a plurality of filters, each of said filters having a substantially similar mathematical shape and centered at different points on said substantially logarithmic frequency scale;  
       means for generating one or more speech features based on one or more outputs of said plurality of filters; and  
       means for performing speech recognition of said speech signal based on said one or more speech features.  
     
     
       52. The apparatus of  claim 51  wherein said time-to-frequency domain transformation comprises a Fast Fourier Transform. 
     
     
       53. The apparatus of  claim 51  wherein said substantially logarithmic frequency scale comprises a mel scale. 
     
     
       54. The apparatus of  claim 51  wherein said substantially logarithmic frequency scale comprises a Bark scale. 
     
     
       55. The apparatus of  claim 54  wherein said plurality of filters are centered at equal distances along the Bark scale. 
     
     
       56. The apparatus of  claim 51  further comprising means for applying to said linear frequency spectrum of said speech signal an outer and middle ear transfer function which approximates a human's outer and middle ear signal processing of an incoming speech signal. 
     
     
       57. The apparatus of  claim 51  wherein said means for generating said one or more speech features comprises 
       means for performing a discrete cosine transform based on said one or more outputs of said plurality of filters to generate a set of DCT coefficients, and  
       means for generating said one or more speech features based on said set of DCT coefficients.  
     
     
       58. The apparatus of  claim 57  wherein said step of generating said one or more speech features further comprises means for modifying said one or more outputs of said plurality of filters by applying a nonlinearity to each one of said outputs, and wherein said discrete cosine transform is applied to said modified outputs. 
     
     
       59. The apparatus of  claim 57  wherein said one or more speech features comprises each of said DCT coefficients and first and second order derivatives thereof. 
     
     
       60. The apparatus of  claim 59  wherein said one or more speech features further comprises a measure of short-term energy of said speech signal.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.