Acoustic Modeling for MultiLanguage, MultiStyle, MultiChannel Automatic Speech Recognition - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Acoustic Modeling for MultiLanguage, MultiStyle, MultiChannel Automatic Speech Recognition

Description:

Acoustic Modeling for Multi-Language, Multi-Style, Multi-Channel ... News Hound: Find all TV news segments, in any language, mentioning 'Barack Obama' ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 30
Provided by: jhu47
Category:

less

Transcript and Presenter's Notes

Title: Acoustic Modeling for MultiLanguage, MultiStyle, MultiChannel Automatic Speech Recognition


1
Acoustic Modeling for Multi-Language,
Multi-Style, Multi-Channel Automatic Speech
Recognition
University of Illinois
  • Mark Hasegawa-Johnson
  • Yuxiao Hu, Dennis Lin, Xiaodan Zhuang, Jui-Ting
    Huang, Xi Zhou, Zhen Li, and Thomas Huang
  • including also the research results of
  • Laehoon Kim and Harsh Sharma

2
Motivation
  • Applications in a Multilingual Society
  • News Hound Find all TV news segments, in any
    language, mentioning Barack Obama
  • Language Learner Transcribe learner's accented
    speech tell him which words sound accented
  • Broadcaster/Podcaster Automatically transcribe
    man on the street interviews in a multilingual
    city (LA, Sing)?
  • Problems
  • Physical variability noise, echo, talker
  • Imprecise categories dependent on context
  • Content variability language, topic, dialect,
    style

3
Method Transform and Infer(ubiquitous
methodology in ASR see, e.g., Jelinek, 1976)?
Signal transforms
Classifier transforms
Likelihood Vector bip(observationtstateti)?
Inference Algorithm A Parametric Model of
p(state1,...,stateT,label1,...,labelT)?
Best label sequence argmax p(label1,...,labelTo
bservation1,...,observationT)?
4
Signal TransformsTransforms determined by a
physical model of the signal
  • A good signal model tells you a lot
  • Reverberation model ynvn?mhmxn-m
  • xn produced by a human vocal tract, designed
    for efficient processing by a human auditory
    system
  • A good signal transform improves the accuracy of
    all classifiers
  • Denoising Correct for additive noise
  • Dereverberation Correct for convolutional noise
  • Perceptual freq warping Hear what humans hear

5
Denoising Example(Kim et al., 2006)?
6
Classifier TransformsCompute a precise and
accurate estimate of p(obststatet)?
  • Robust Machine Learning
  • From a limited amount of training data,
  • Learn parameterized probability models as precise
    as possible,
  • ...with a known upper bound on generalization
    error
  • Methods that trade off precision and
    generalization
  • Decorrelate the signal measurements PCA, DCT
  • Select the most informative features from an
    inventory AdaBoost
  • Train a linear or nonlinear function ztf(yt)
    that
  • Discriminates among the training examples from
    diff classes
  • Has known upper bounds on generalization error
    (SVM, ANN)?
  • Train another nonlinear function p(ztstatet)
    with same properties

7
Classifier TransformsCompute a precise and
accurate estimate of p(obststatet)?
8
InferenceIntegrate information to choose best
global labelset
  • Labels variables that matter globally
  • Speech Recognition what words were spoken?
  • Information Retrieval which segment best matches
    the query?
  • Language Learning where's the error?
  • States variables that can be classified locally
  • May be scalar, e.g., qtsub-phoneme
  • May be vector, e.g., qtvector of articulatory
    states
  • Inference algorithm Parametric model of
    p(states,labels)?
  • Scalar states Hidden Markov model, Finite State
    Transducer
  • Vector states Dynamic Bayesian network,
    Conditional Random Field

9
InferenceIntegrate information to choose best
global labelset
10
Example Language-Independent Phone
Recognition(Huang et al., in preparation)?
Voice activity detection Perceptual freq warping
Gaussian mixtures
Likelihood Vector bip(observationtstateti)?
Inference Algorithm Hidden Markov Model with
Token Passing p(state1,...,stateT,phone1,...,phone
T)?
Best label sequence argmax p(phone1,...,phoneTo
bservation1,...,observationT)?
11
A Language-Independent Phone Set (Consonants)?
Plus secondary articulations (glottis, pharynx,
palate, lips), sequences, and syllabics
12
A Language-Independent Phone Set (Vowels)?
13
Training Data
  • 10 languages, 11 corpora
  • Arabic, Croatian, English, Japanese, Mandarin,
    Portuguese, Russian, Spanish, Turkish, Urdu
  • 95 hours of speech
  • Sampled from a larger set of corpora
  • Mixed styles of speech broadcast, read, and
    spontaneous

14
Summarization of Corpora
15
Dictionaries(Hasegawa-Johnson and Fleck,
http//www.isle.uiuc.edu/dict/)?
Orthographic Transcriptions
Diacriticized Version available on web?
Urdu No Vowels!!
????, ????
????
?????
No
Yes
Ruleset 1 ? q ? k ? g ...
Ruleset 2 ? A ? ligature ? u ...
Phonetic Transcriptions
/sAhSVbSV/, /sA!iq?/
16
Context-Dependent Phones
  • Triphones when is a /t/ not a /t/?
  • writer /t/ is unusual call it /aI-t3r/
  • a tree /t/ is unusual call it /-tr/
  • that soup /t/ is unusual call it /ae-ts/
  • Lexical stress
  • /i/ in reek longer than in recover
  • Call them /r-ik'/ vs. /r-ik/
  • Punctuation, an easy-to-transcribe proxy for
    prosody
  • /n/ in I'm done. 2X as long as /n/ in Done
    yet?
  • Call them /-nPERIOD/ vs. /-nj/
  • Language, Dialect, Style
  • /o/ in atone call it /t-oneng/
  • /o/ in ??? call it /t-onjap/
  • Gender handled differently (speaker adaptation)?

-Abeng -Abeng gt-Adcmn .
17
Decision Tree State Tying
  • Categories for decision tree questions
  • Distinctive phone features (manner/place of
    articulation)? of right or left context
  • Language identity
  • Dialect identity (L1 vs. L2)?
  • Lexical stress
  • Punctuation mark

-AbengL2 -AbengL2 gtAdcmn .
Each leaf node contains at least 3.5 seconds of
training data
18
Phone Recognition Experiment(Huang et al., in
preparation)?
  • Language-independent triphone bigram language
    model
  • Standard classifier transforms (PLPddd, CDHMM,
    11-17 Gaussians)?
  • Vocabulary size top 60K most frequent triphones
    (since 140K is too much!)?
  • For the rest of infrequent triphones, map them
    back to center monophones

19
Recognition Results(Huang et al., in
preparation)?
  • Test set 50 sentences per corpus

20
Example Language-Independent Speech Information
Retrieval(Zhuang et al., in preparation)?
Voice activity detection Perceptual freq warping
Gaussian mixtures
Likelihood Vector bip(observationtstateti)?
Inference Algorithm Finite State Transducer
built from ASR Lattices E(count(queryobservations
))?
Retrieval Ranking E(count(querysegment
observations))?
21
Information RetrievalStandard Methods ?
  • Task Description given a query, find the most
    relevant segments in a database
  • Published Algorithms
  • EXACT MATCH segment argmin d(query,segment)?
  • Fast
  • SUMMARY STATISTICS segment argmax
    p(querysegment), no concept of word order
  • Good for text, e.g., google, yahoo, etc.
  • TRANSFORM AND INFER segment argmax
    p(querysegment), E(count(query)segment) word
    order matters?
  • Flexible, but slow....

22
Language-Independent IRThe Star Challenge
  • A Multi-Language Multi-Media Broadcast News
    Retrieval Competition, sponsored by ASTAR
  • Elimination rounds, June-August 2008
  • Three rounds, each of 48 hours duration
  • 56 teams entered from around the world
  • 5 teams selected for the Grand Finals
  • Grand Finals 10/23/2008, Singapore

23
Star Challenge Tasks
  • VT1, VT2 Given image category (e.g., crowd,
    sports, keyboard), find examples
  • AT1 Given an IPA phoneme sequence (example
    /?ogut?A/), find audio segments
  • AT2 Given a waveform containing a word or word
    sequence in any language, find audio segments
    containing the same word
  • AT1VT2 find specified video class, speech
    contains IPA (e.g., man monologue/gro??/)?

24
Star Challenge Simplified Results
  • Rounds 1 and 3 48,000 CPU hours
  • Round 1 English, 20 queries
  • Round 3 English and Mandarin, 3 queries each
  • Grand Final 6 CPU hours
  • English, Mandarin, Malay, and Tamil, 2 queries
    each

25
Open Research Areas
  • When does Transform and Infer help?
  • ROUND 3 (1000cpus, 48 hours) best algorithms
    were transform and infer
  • GRAND FINAL (3 cpus, 2 hours) best algorithms
    were exact match
  • Open research area 1 complexity
  • Inference algorithm user constraints ?
    simplified classifier
  • Improved transforms and improved classifiers
    allow the use of a less-constrained user
    interface
  • Open research area 2 accuracy

26
Existence ProofASR can beat Human
Listeners(Sharma et al., in preparation)?
  • The task speech of talkers with gross motor
    disability (Cerebral Palsy)?
  • Familiar listeners in familiar situations
    understand most of what they say... ASR can also
    be talker-dependent and vocabulary-constrained

27
Open Research Areas
  • Remove the Constraints!
  • ASR can beat a human listener if the ASR knows
    more than the human
  • (e.g., knows the talker and the vocabulary)?
  • Better knowledge
  • better signal models
  • better classifiers
  • better inference

28
Thank You!Questions?
29
Decision Tree State Tying (Odell, Woodland and
Young, 1994)?
  • Divide each IPA phone into three temporally
    sequential states,
  • /i/ -gt /i/onset, /i/center, /i/offset
  • Start with one model for each state. Create a
    statistical model p(acousticsstate) using
    training data
  • Ask yes-no questions about context variables
  • Left phone, right phone, lexical stress, language
    ID
  • If p(acousticsstate, yes) ? p(acousticsstate,
    no), split the training data into two groups
  • The yes examples vs. the no examples
  • If many such questions exist, choose the best
  • Repeat this process as long as each group
    contains enough training data examples
Write a Comment
User Comments (0)
About PowerShow.com