




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Various techniques and models used for improving noise robustness in speech recognition. Topics include hidden Markov models, uncertainty decoding, noise model estimation, and parallel model combination. The document also provides data on the effects of noise on speech distributions and the performance of different compensation methods.
What you will learn
Typology: Summaries
1 / 181
This page cannot be seen from the preview
Don't miss anything!
This dissertation is submitted for the degree of Doctor of Philosophy to the University of Cambridge.
This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. It has not been submitted in whole or in part for a degree at any other university. Some of the work has been published previously in conference pro- ceedings [93, 94, 95] and technical reports [90, 91, 92]. The length of this thesis including appendices, references, footnotes, tables and equations is approximately 53,000 words and contains 38 tables and 41 figures.
i
First and foremost, I would like to thank my supervisor Mark Gales for his always insightful suggestions and expert guidance. His unwavering commitment to his students and constant demand for excellence really helped me bring this work to fruition. It has been a privilege and memorable experience to work with Mark. Secondly, I would like to express my gratitude to Toshiba Research Europe, and in partic- ular Drs. Masami Akamine and Kate Knill, for providing the generous funding that made this research possible. I am grateful to Steve Young and Phil Woodland for providing the excellent research facilities here in the Machine Intelligence Laboratory at Cambridge University. To the many people who have contributed to developing and maintaining HTK, I am indebted to you for providing such useful software for conducted this work. I am most obliged to Anna Langley and Patrick Gosling for their excellent work in managing the computing facilities here in the lab and quickly dealing with spontaneous shutdowns, overheating processors and the incessant demands on space, bandwidth and memory. I would like to thank Mitch Weintraub and Brian Strope for their early encouragement in the field of speech recognition and noise robustness. I thank Matt Stuttle for his help in preparing the RM corpus in the initial stages of my research. I would also like to thank James Nealand for his diligent assistance in setting up the Toshiba corpus. I appreciate the discussions on my work with those at various conferences in particular Jasha Droppo, Michael Picheny, Li Deng, and Dan Povey; their views gave me different perspectives on my work. I also appreciate Sharmaine and Vidura’s kind help at the last stages of writing up this thesis. I acknowledge Sarah Airey, Rogier van Dalen, Darren Green, Andrew Liu, Chris Longworth and Kai Yu for going over various sections in this thesis; special thanks to Mark and Catherine for proofreading and providing useful feedback on large portions of this work. I will indeed miss the supervisions with Mark, discussions about MBR with Catherine, adaptation and notation with Kai, learning about decoders from Andrew, and numerous chats about kernels with Chris. Thanks to my friends and acquaintances from college, the MCR social circle, volleyball, hockey, the CCC, the CHC and HRM for your support and making my time in Blighty most enjoyable. To my friends back at home and in the Bay Area thanks for always making it feel like I’d never left, despite my short visits and poor attempts at keeping in touch. And lastly, but most of all, thanks to my family for their unconditional love and support.
iii
ASR Automatic Speech Recognition BN Broadcast News CMLLR Constrained MLLR CMN Cepstral Mean Normalisation CVN Cepstral Variance Normalisation DBN Dynamic Bayesian Network DCT Discrete Cosine Transform DPMC Data-driven PMC EM Expectation Maximisation FFT Fast Fourier Transform GMM Gaussian Mixture Model HMM Hidden Markov Model HTK HMM Toolkit IDCT Inverse DCT IPMC Iterative PMC JUD Joint Uncertainty Decoding LVCSR Large Vocabulary Continuous Speech Recognition MAP Maximum A Posteriori MFCC Mel-Frequency Cepstral Coefficients ML Maximum Likelihood MLLR Maximum Likelihood Linear Regression MMSE Minimum Mean Squared Error PDF Probability Density Function PMC Parallel Model Combination POF Probabilistic Optimal Filtering RM Resource Management SAT Speaker Adaptive Training SNR Signal-to-Noise Ratio SPLICE Stereo Piece-wise Linear Compensation for Environments STC Semi-tied Covariance UD Uncertainty Decoding VTS Vector Taylor Series WER Word Error Rate WSJ Wall Street Journal
iv
vi
A -^1 inverse of matrix A
ai column vector that is the ith column of A a¯i row vector that is the ith row of A aij scalar value that is the element in row i and column j of A I identity matrix 1 column vector of 1’s ∆ij all-zero matrix, except for a 1 in row i and column j
b column vector a ◦ b element-wise product of a and b yielding a column vector a •^ b dot product of a and b, yielding a scalar value
Observations
T number of frames in a sequence of observations t time frame index D number of dimensions of full feature vector Ds number of dimensions of static, delta, or delta-delta components of static features—therefore 3 × Ds = D d dimension index S sequence of clean speech vectors [s 1 s 2 · · · sT ]
st complete clean speech vector, comprised of static, delta and delta-delta clean speech vectors—that is st = [xT t ∆xT t ∆^2 xT t ]T O sequence of noise-corrupted speech vectors [o 1 o 2 · · · oT ] ot complete noise-corrupted speech vector, comprised of static, delta and delta-delta noise-corrupted speech vectors—that is ot = [yT t ∆yT t ∆^2 yT t ]T nt complete additive noise vector, comprised of static, delta and delta- delta additive noise vectors—that is nt = [z tT ∆z tT ∆^2 zT t ]T h convolutional noise vector
C discrete cosine transform matrix C -^1 inverse discrete cosine transform matrix
Probability and Distributions
P(·) probability mass function
p(·) probability density function
vii
p(x, y) joint probability density function—that is, the probability density of having both x and y
p(x|y) conditional probability density function of having x given y N (μ, Σ) multivariate Gaussian distribution with mean vector μ and covariance matrix Σ N (x; μ, Σ) probability of vector x given a multivariate Gaussian distribution δ(x) Dirac delta function, which has value of 0 for x 6 = 0, integrates to 1 δij Kronecker delta symbol, which equals 1 when i = j and is 0 otherwise
Γ(·) Gamma function
HMM Parameters
M set of clean speech acoustic model parameters
Mˆ set of estimated corrupted speech acoustic model parameters Mˇ set of front-end model parameters
Mn set of noise model parameters Θ set of all possible state sequences θ for a transcription Wr θ sequence of discrete clean speech states [θ 1 θ 2 · · · θT ] θn^ sequence of discrete noise speech states [θn 1 θn 2 · · · θTn ] M set of all possible component sequences m for a transcription Wr
K number of GMM components in the front-end model M number of GMM components in the full acoustic model R number of regression classes—that is the number of clusters of acoustic model components
rm regression class for component m ˇa(k)^ parameter a is associated with front-end component k
a(m)^ parameter a is associated with acoustic model component m
a(rm), a(r)^ parameter a is associated with regression class rm or just class r ˇc(k)^ component prior associated with front-end component k
c(m)^ component prior associated with acoustic model component m
μ( xm ), Σ( xm ) static clean speech mean and variance of component m
μ(∆mx) , Σ(∆mx) delta clean speech mean and variance of component m
μ(∆m 2 )x, Σ(∆m 2 )x delta-delta clean speech mean and variance of component m
5.1 Number of free parameters to estimate for diagonal forms of various noise compensation schemes................................ 78 5.2 Computational cost for diagonal forms of different noise compensation schemes. 79
8.1 WER (%) for 256-component front-end GMM schemes compensating clean models on Aurora2 test set A averaged across N1-N4.............. 106 8.2 WER (%) for 256-component front-end UD schemes using noisy GMM and compensating clean models, varying parameter flooring, on Aurora2 test set A averaged across N1-N4............................... 107 8.3 Number of insertions, % of total errors in parentheses, for 256-component FE- Joint compensation, varying ρ flooring, on Aurora2 N1 subway noise...... 107 8.4 WER (%) for diagonal and full matrix JUD compensation of clean models on Aurora2 test set A averaged across N1-N4..................... 108 8.5 WER (%) for various noise robustness techniques compensating clean models on Aurora2 test set A averaged across N1-N4................... 109 8.6 WER (%) for a variety of techniques compensating clean models on Operations Room corrupted RM task at 20 dB SNR (EDA)................. 111 8.7 WER (%) for feature-based techniques compensating clean models on Opera- tions Room corrupted RM task at 20 dB SNR (EDA).............. 112 8.8 WER (%) for model-based techniques compensating clean models on Opera- tions Room corrupted RM task at 20 dB SNR (EDA).............. 113 8.9 WER (%) and average number of active models when compensating clean acoustic models on Operations Room corrupted RM task at 20 dB SNR (EDA). 8.10 WER (%) and log-likelihood for VTS compensation of clean models on Op- erations Room corrupted RM task at 20 dB SNR (0DA) varying dimensions compensated and noise model estimation..................... 116 8.11 WER (%) for VTS compensation of clean models on Operations Room cor- rupted RM task at 20 dB SNR (0DA) varying estimation level, noise model and hypothesis.................................... 116 8.12 WER (%) for 16-diagonal M-Joint compensation of clean and multistyle mod- els, comparing noise estimation type, on Operations Room corrupted RM task at 20 dB SNR (0DA)................................ 117 8.13 WER (%) and log-likelihood for 16-diagonal M-Joint and VTS compensation of clean models, varying number of EM iterations and updating hypothesis, on Operations Room corrupted RM task at 20 dB SNR (0DA)........... 118
xii
LIST OF TABLES xiii
8.14 WER (%) for VTS compensation of clean models, varying noise estimation speech models, on Operations Room corrupted RM task at 20 dB SNR (0DA). 119 8.15 WER (%) for model-based compensation of multistyle models, comparing noise estimation speech model and amount of adaptation data, on Operations Room corrupted RM task at 14 dB SNR (0DA).................... 120 8.16 WER (%) for 16-diagonal M-Joint compensation of clean, multistyle and JAT acoustic models, on clean and corrupted RM task (0DA)............ 121 8.17 WER (%) for JAT, NAT-CMLLR and SAT-CMLLR systems on Operations Room corrupted RM task (0DA).......................... 122 8.18 WER (%) for block-diagonal semi-tied transform combined with 16 diagonal M-Joint transforms with clean and multistyle acoustic models on Operations Room corrupted RM task (0DA).......................... 122 8.19 WER (%) for 16-diagonal M-Joint with 2-full CMLLR compensation of multi- style and JAT models on Operations Room corrupted RM task (0DA)..... 123
9.1 SNR and number of utterances for focus conditions in test set bneval98.... 126 9.2 WER (%) for 256 diagonal M-Joint transform and VTS compensation of mul- tistyle models on bneval98 and bndev03..................... 127 9.3 WER (%) for 256 diagonal M-Joint transform and VTS compensation of mul- tistyle models on bneval98 broken down by focus condition........... 127 9.4 WER (%) for 256 diagonal M-Joint transform compensation of multistyle and JAT models on bneval98 and bndev03...................... 128 9.5 Average SNR level of TREL-CRL04 test set conditions............. 128 9.6 Utterance length mean and standard deviation in TREL-CRL04 test sets... 129 9.7 Summary of multistyle training data for TREL-CRL04 system, SNR in dB.. 130 9.8 WER (%) for CMLLR, 16-diagonal M-Joint and VTS compensation of clean models on TREL-CRL04 digits task........................ 130 9.9 WER (%) for CMN+CVN and 4-component Gaussianisation with multistyle models on TREL-CRL04 digits task....................... 131 9.10 WER (%) for CMLLR, 16-diagonal M-Joint and VTS compensation of multi- style models on TREL-CRL04 digits task..................... 132 9.11 WER (%) for VTS compensation of multistyle models, varying supervision mode and number of EM iterations, on TREL-CRL04 digits task........ 132 9.12 WER (%) for 16-diagonal M-Joint and VTS compensation of multistyle models, varying the noise estimation type, on TREL-CRL04 digits task......... 132 9.13 WER (%) for CMLLR, PCMLLR or M-Joint compensation of multistyle mod- els on TREL-CRL04 digits task comparing estimation with all utterances to only one utterance per speaker........................... 133 9.14 WER (%) for 16-diagonal M-Joint combined with 2 full CMLLR transforms compensating multistyle and JAT models on TREL-CRL04 digits task.... 134 9.15 WER (%) for 16-diagonal M-Joint compensation on TREL-CRL04 digits task comparing estimation with all utterances to only one per speaker........ 135 9.16 WER (%) for 16-diagonal M-Joint compensation on TREL-CRL04 digits com- paring HMM or GMM speech model for noise model estimation........ 135 9.17 WER (%) for 16-diagonal M-Joint compensation of multistyle or JAT models on TREL-CRL04 city names task......................... 136
LIST OF FIGURES xv
5.1 Joint distribution of clean xlt and corrupted speech ylt with an additive noise source N (3, 1) in log spectral domain....................... 61 5.2 Front-end uncertainty decoding........................... 65 5.3 Plot of log energy dimension from Aurora2 digit string 8-6-zero-1-1-6-2, show- ing 16-component GMM FE-Joint estimate a(k
∗) ot + b(k
∗) , uncertainty bias σ(k
∗) b , and^ a
(k∗).................................... 67 5.4 Plot of log energy dimension from Aurora2 digit string 8-6-zero-1-1-6-2, show- ing 16-component GMM FE-Joint estimate a(k
∗) ot +b(k
∗) , and uncertainty bias σ(k
∗) b , with correlation flooring^ ρ^ = 0.1.......................^69 5.5 Model-based joint uncertainty decoding...................... 70 5.6 Estimating model-based joint uncertainty decoding transforms......... 72 5.7 Comparing Monte Carlo and VTS generated corrupted speech ylt distributions and cross-covariance between and clean and corrupted speech in log-spectral domain........................................ 75 5.8 Corrupted speech conditional distribution with additive noise ztl ∼ N (4, 1) in log-spectral domain. Various distributions are fitted to the simulated data.. 82 5.9 Corrupted speech conditional distribution with clean speech xt = 7, additive noise zlt ∼ N (4, 1). Single Gaussian and 2-component GMM fitted (components dotted)........................................ 83
6.1 EM-based ML noise model estimation procedure................. 85 6.2 Noise model estimation back-off procedure.................... 88 6.3 Noise model estimation back-off example..................... 88
7.1 Joint adaptive training............................... 98
8.1 Clean spectrum (left) compared with corrupting Operations Room noise at 8 dB SNR (right) for the utterance“Clear all windows”.............. 110 8.2 Graph of auxiliary function value during ML VTS noise model estimation... 118
9.1 Broadcast News transcription system architecture................ 126
utomatic speech recognition (ASR) has improved markedly over the last decade such that it can be used to transcribe speech in a variety of domains such as consumer goods^1 , call centre applications^2 and desktop personal computer software^3. However, recognition accuracy is still far from human levels. Humans make mistakes at a rate of less than one hundredth of a percent [97] when recognising strings of digits, while the best machine error rates have only advanced from 0.72% to 0.55% over the last decade [155]. For more difficult tasks the difference narrows: for example on telephone conversation transcription [56] the human word error rate is about 4% while state-of-the-art automatic transcription systems rates are still over three times worse [18, 32]. The difference between human and machine performance has been attributed to a variety of causes including: the immense variability of speech [105], poor modelling of spontaneous speech [97, 112], fundamental limitations in conventional speech feature extraction [107] and the statistical framework [13]. Despite this “performance gap”, basic ASR technology has advanced to a level where it may be applied in a variety of commercial applications. However, a major problem is robustness to noise. Despite decades of research on noise robustness, leading researchers in the field such as Nelson Morgan and Sadaoki Furui have called on a serious effort to improve recogniser performance in noise [38]. The reason for poor accuracy in noise is a mismatch between the original conditions of the data used to train the recogniser and the actual noisy environment it is
(^1) For example, voice-dialling in mobile phones or controlling toys such as the robotic dog AIBO (1999), or
interactive doll Amazing Amanda (2005). Visit the Saras institute (http://www.sarasinstitute.org/) for an extensive history. (^2) Examples include Charles Schwab’s stock trading and lookup system (Nuance), Cineworld’s Movieline for movie information and booking (Telephonetics), or Verizon’s 411 directory assistance (Microsoft/Tellme). (^3) Such as Dragon NaturallySpeaking, IBM ViaVoice or Microsoft’s Whisper ASR Engine for its Windows operating system.
1
CHAPTER 1. INTRODUCTION 3
Following this introduction, a brief overview of automatic speech recognition using hidden Markov models (HMMs) is given in chapter 2 along with semi-tied covariance modelling, adaptation and adaptive training, which will all be evaluated in this work. A model of the noisy acoustic environment is presented, along with the effects that noise has on ASR, in chapter 3. Chapter 4 reviews some relevant noise robustness techniques. In particular, obser- vation uncertainty, SPLICE with and without certainty, and VTS compensation are discussed because they provide interesting theoretical and practical comparisons for uncertainty decod- ing. Joint uncertainty decoding is formally presented in chapter 5. A method for estimating a model of the noise to predict JUD transforms is given in chapter 6. These Joint transforms are applied in an adaptive training framework described in chapter 7. In chapter 8, experimental results on artificially corrupted corpora, Aurora2 and Resource Management, are presented. The next chapter looks at evaluating the various techniques on speech recorded in noisy con- ditions such as Broadcast News and Toshiba Research Europe’s internal collection of in-car speech data. Finally, conclusions and future research directions are presented in chapter 10.
utomatic speech recognition is a classic pattern recognition problem where the goal is to automatically produce a text transcription of spoken words. Major concerns are finding compact set of classification features and determining a suitable means of recognising words from these features. The features should be a compact representation of the audio signal that is optimal for discrimination. The majority of recognisers use HMMs as models of speech although how they are trained can vary. Finding the actual transcription should be efficient as well as accurate and is also known as decoding. This chapter describes this standard approach to automatic speech recognition in detail.
The main components of a generic speech recognition system, or recogniser, and how they interact are shown in figure 2.1. The input speech, captured by some transducer, is processed by the front-end to provide a compact and effective set of features for recognition. The front- end may perform some speech detection, also known as endpointing, to remove background silence or noise reduction before passing feature vectors to the decoder. Given the features provided by the front-end, the goal is to classify or “recognise” the speech uttered. This amounts to “decoding” the most likely word sequence Wh, i.e. the hypothesis, given the observation sequence S and a set of model parameters M¯
Wh = argmax W
4