skip to main content
Fast speaker-adaptive training for large-vocabulary speech recognition
Publisher:
  • Northeastern University
  • Boston, MA
  • United States
Order Number:AAI9007722
Pages:
153
Bibliometrics
Skip Abstract Section
Abstract

The vast acoustic variability among different speakers makes the task of automatic speech recognition in large vocabularies a difficult one. Speaker-dependent training and speaker-independent training are two typical methods used in dealing with this inter-speaker variation problem. However, both methods require a lengthy speech enrollment process before a user can start using the system in large-vocabulary speech recognition tasks. Therefore, there is great interest in a training method that requires only a reasonable amount of speech collection effort before the users start using the system. This dissertation presents an efficient speaker-adaptive training method that requires the collection of a large amount of speech from only one proto-type speaker and a small amount of speech from each additional target speaker. This method can quickly learn and adapt to the voice of an individual speaker, as well as to the acoustic environment.

In the proposed speaker-adaptive training, we use a speaker normalization procedure that transforms the well-trained phonetic hidden Markov models (HMMs) derived for a prototype speaker into the HMMs for a new target speaker by using a set of transformation matrices. Each matrix represents a probabilistic spectral mapping for a phoneme between the prototype speaker and the target speaker. Several algorithms are investigated in this thesis to estimate reliable transformation matrices that optimize recognition performance. Experiments were performed on a standard 1000-word continuous-speech database with a word-pair grammar of perplexity 60 to evaluate the proposed algorithms. The experimental results show that the recognition performance of speaker-adaptive training which uses only two minutes of target speech approaches the performance of speaker-dependent training with 18 minutes of target speech.

We also examined two properties of the proposed speaker-adaptive training method: session effect and reduced training. Experiments show that training and testing in the same session results in a word error rate that is 10% lower than that achieved when training and testing occur in different sessions. Reducing the amount of training speech from 2 minutes to 30 seconds results in doubling of the word error rate. Our results indicate that the speaker-adaptive training method needs at least 1 to 2 minutes of training speech to achieve satisfactory performance.

Contributors
  • Northeastern University
  • BBN Technologies

Recommendations