Title: Acoustic Modeling for MultiLanguage, MultiStyle, MultiChannel Automatic Speech Recognition
1Acoustic Modeling for Multi-Language,
Multi-Style, Multi-Channel Automatic Speech
Recognition
University of Illinois
- Mark Hasegawa-Johnson
- Yuxiao Hu, Dennis Lin, Xiaodan Zhuang, Jui-Ting
Huang, Xi Zhou, Zhen Li, and Thomas Huang - including also the research results of
- Laehoon Kim and Harsh Sharma
2Motivation
- Applications in a Multilingual Society
- News Hound Find all TV news segments, in any
language, mentioning Barack Obama - Language Learner Transcribe learner's accented
speech tell him which words sound accented - Broadcaster/Podcaster Automatically transcribe
man on the street interviews in a multilingual
city (LA, Sing)? - Problems
- Physical variability noise, echo, talker
- Imprecise categories dependent on context
- Content variability language, topic, dialect,
style
3Method Transform and Infer(ubiquitous
methodology in ASR see, e.g., Jelinek, 1976)?
Signal transforms
Classifier transforms
Likelihood Vector bip(observationtstateti)?
Inference Algorithm A Parametric Model of
p(state1,...,stateT,label1,...,labelT)?
Best label sequence argmax p(label1,...,labelTo
bservation1,...,observationT)?
4Signal TransformsTransforms determined by a
physical model of the signal
- A good signal model tells you a lot
- Reverberation model ynvn?mhmxn-m
- xn produced by a human vocal tract, designed
for efficient processing by a human auditory
system - A good signal transform improves the accuracy of
all classifiers - Denoising Correct for additive noise
- Dereverberation Correct for convolutional noise
- Perceptual freq warping Hear what humans hear
5Denoising Example(Kim et al., 2006)?
6Classifier TransformsCompute a precise and
accurate estimate of p(obststatet)?
- Robust Machine Learning
- From a limited amount of training data,
- Learn parameterized probability models as precise
as possible, - ...with a known upper bound on generalization
error - Methods that trade off precision and
generalization - Decorrelate the signal measurements PCA, DCT
- Select the most informative features from an
inventory AdaBoost - Train a linear or nonlinear function ztf(yt)
that - Discriminates among the training examples from
diff classes - Has known upper bounds on generalization error
(SVM, ANN)? - Train another nonlinear function p(ztstatet)
with same properties
7Classifier TransformsCompute a precise and
accurate estimate of p(obststatet)?
8InferenceIntegrate information to choose best
global labelset
- Labels variables that matter globally
- Speech Recognition what words were spoken?
- Information Retrieval which segment best matches
the query? - Language Learning where's the error?
- States variables that can be classified locally
- May be scalar, e.g., qtsub-phoneme
- May be vector, e.g., qtvector of articulatory
states - Inference algorithm Parametric model of
p(states,labels)? - Scalar states Hidden Markov model, Finite State
Transducer - Vector states Dynamic Bayesian network,
Conditional Random Field
9InferenceIntegrate information to choose best
global labelset
10Example Language-Independent Phone
Recognition(Huang et al., in preparation)?
Voice activity detection Perceptual freq warping
Gaussian mixtures
Likelihood Vector bip(observationtstateti)?
Inference Algorithm Hidden Markov Model with
Token Passing p(state1,...,stateT,phone1,...,phone
T)?
Best label sequence argmax p(phone1,...,phoneTo
bservation1,...,observationT)?
11A Language-Independent Phone Set (Consonants)?
Plus secondary articulations (glottis, pharynx,
palate, lips), sequences, and syllabics
12A Language-Independent Phone Set (Vowels)?
13Training Data
- 10 languages, 11 corpora
- Arabic, Croatian, English, Japanese, Mandarin,
Portuguese, Russian, Spanish, Turkish, Urdu - 95 hours of speech
- Sampled from a larger set of corpora
- Mixed styles of speech broadcast, read, and
spontaneous
14Summarization of Corpora
15Dictionaries(Hasegawa-Johnson and Fleck,
http//www.isle.uiuc.edu/dict/)?
Orthographic Transcriptions
Diacriticized Version available on web?
Urdu No Vowels!!
????, ????
????
?????
No
Yes
Ruleset 1 ? q ? k ? g ...
Ruleset 2 ? A ? ligature ? u ...
Phonetic Transcriptions
/sAhSVbSV/, /sA!iq?/
16Context-Dependent Phones
- Triphones when is a /t/ not a /t/?
- writer /t/ is unusual call it /aI-t3r/
- a tree /t/ is unusual call it /-tr/
- that soup /t/ is unusual call it /ae-ts/
- Lexical stress
- /i/ in reek longer than in recover
- Call them /r-ik'/ vs. /r-ik/
- Punctuation, an easy-to-transcribe proxy for
prosody - /n/ in I'm done. 2X as long as /n/ in Done
yet? - Call them /-nPERIOD/ vs. /-nj/
- Language, Dialect, Style
- /o/ in atone call it /t-oneng/
- /o/ in ??? call it /t-onjap/
- Gender handled differently (speaker adaptation)?
-Abeng -Abeng gt-Adcmn .
17Decision Tree State Tying
- Categories for decision tree questions
- Distinctive phone features (manner/place of
articulation)? of right or left context - Language identity
- Dialect identity (L1 vs. L2)?
- Lexical stress
- Punctuation mark
-AbengL2 -AbengL2 gtAdcmn .
Each leaf node contains at least 3.5 seconds of
training data
18Phone Recognition Experiment(Huang et al., in
preparation)?
- Language-independent triphone bigram language
model - Standard classifier transforms (PLPddd, CDHMM,
11-17 Gaussians)? - Vocabulary size top 60K most frequent triphones
(since 140K is too much!)? - For the rest of infrequent triphones, map them
back to center monophones
19Recognition Results(Huang et al., in
preparation)?
- Test set 50 sentences per corpus
20Example Language-Independent Speech Information
Retrieval(Zhuang et al., in preparation)?
Voice activity detection Perceptual freq warping
Gaussian mixtures
Likelihood Vector bip(observationtstateti)?
Inference Algorithm Finite State Transducer
built from ASR Lattices E(count(queryobservations
))?
Retrieval Ranking E(count(querysegment
observations))?
21Information RetrievalStandard Methods ?
- Task Description given a query, find the most
relevant segments in a database - Published Algorithms
- EXACT MATCH segment argmin d(query,segment)?
- Fast
- SUMMARY STATISTICS segment argmax
p(querysegment), no concept of word order - Good for text, e.g., google, yahoo, etc.
- TRANSFORM AND INFER segment argmax
p(querysegment), E(count(query)segment) word
order matters? - Flexible, but slow....
22Language-Independent IRThe Star Challenge
- A Multi-Language Multi-Media Broadcast News
Retrieval Competition, sponsored by ASTAR - Elimination rounds, June-August 2008
- Three rounds, each of 48 hours duration
- 56 teams entered from around the world
- 5 teams selected for the Grand Finals
- Grand Finals 10/23/2008, Singapore
23Star Challenge Tasks
- VT1, VT2 Given image category (e.g., crowd,
sports, keyboard), find examples - AT1 Given an IPA phoneme sequence (example
/?ogut?A/), find audio segments - AT2 Given a waveform containing a word or word
sequence in any language, find audio segments
containing the same word - AT1VT2 find specified video class, speech
contains IPA (e.g., man monologue/gro??/)?
24Star Challenge Simplified Results
- Rounds 1 and 3 48,000 CPU hours
- Round 1 English, 20 queries
- Round 3 English and Mandarin, 3 queries each
- Grand Final 6 CPU hours
- English, Mandarin, Malay, and Tamil, 2 queries
each
25Open Research Areas
- When does Transform and Infer help?
- ROUND 3 (1000cpus, 48 hours) best algorithms
were transform and infer - GRAND FINAL (3 cpus, 2 hours) best algorithms
were exact match - Open research area 1 complexity
- Inference algorithm user constraints ?
simplified classifier - Improved transforms and improved classifiers
allow the use of a less-constrained user
interface - Open research area 2 accuracy
26Existence ProofASR can beat Human
Listeners(Sharma et al., in preparation)?
- The task speech of talkers with gross motor
disability (Cerebral Palsy)? - Familiar listeners in familiar situations
understand most of what they say... ASR can also
be talker-dependent and vocabulary-constrained
27Open Research Areas
- Remove the Constraints!
- ASR can beat a human listener if the ASR knows
more than the human - (e.g., knows the talker and the vocabulary)?
- Better knowledge
- better signal models
- better classifiers
- better inference
28Thank You!Questions?
29Decision Tree State Tying (Odell, Woodland and
Young, 1994)?
- Divide each IPA phone into three temporally
sequential states, - /i/ -gt /i/onset, /i/center, /i/offset
- Start with one model for each state. Create a
statistical model p(acousticsstate) using
training data - Ask yes-no questions about context variables
- Left phone, right phone, lexical stress, language
ID - If p(acousticsstate, yes) ? p(acousticsstate,
no), split the training data into two groups - The yes examples vs. the no examples
- If many such questions exist, choose the best
- Repeat this process as long as each group
contains enough training data examples