Acoustic Modeling for MultiLanguage, MultiStyle, MultiChannel Automatic Speech Recognition

About This Presentation

Title:

Acoustic Modeling for MultiLanguage, MultiStyle, MultiChannel Automatic Speech Recognition

Description:

Acoustic Modeling for Multi-Language, Multi-Style, Multi-Channel ... News Hound: Find all TV news segments, in any language, mentioning 'Barack Obama' ... – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 30

Provided by: jhu47

Category:

more less

Transcript and Presenter's Notes

Title: Acoustic Modeling for MultiLanguage, MultiStyle, MultiChannel Automatic Speech Recognition

1
Acoustic Modeling for Multi-Language,
Multi-Style, Multi-Channel Automatic Speech
Recognition
University of Illinois

Mark Hasegawa-Johnson
Yuxiao Hu, Dennis Lin, Xiaodan Zhuang, Jui-Ting
Huang, Xi Zhou, Zhen Li, and Thomas Huang
including also the research results of
Laehoon Kim and Harsh Sharma

2
Motivation

Applications in a Multilingual Society
News Hound Find all TV news segments, in any
language, mentioning Barack Obama
Language Learner Transcribe learner's accented
speech tell him which words sound accented
Broadcaster/Podcaster Automatically transcribe
man on the street interviews in a multilingual
city (LA, Sing)?
Problems
Physical variability noise, echo, talker
Imprecise categories dependent on context
Content variability language, topic, dialect,
style

3
Method Transform and Infer(ubiquitous
methodology in ASR see, e.g., Jelinek, 1976)?
Signal transforms
Classifier transforms
Likelihood Vector bip(observationtstateti)?
Inference Algorithm A Parametric Model of
p(state1,...,stateT,label1,...,labelT)?
Best label sequence argmax p(label1,...,labelTo
bservation1,...,observationT)?
4
Signal TransformsTransforms determined by a
physical model of the signal

A good signal model tells you a lot
Reverberation model ynvn?mhmxn-m
xn produced by a human vocal tract, designed
for efficient processing by a human auditory
system
A good signal transform improves the accuracy of
all classifiers
Denoising Correct for additive noise
Dereverberation Correct for convolutional noise
Perceptual freq warping Hear what humans hear

5
Denoising Example(Kim et al., 2006)?
6
Classifier TransformsCompute a precise and
accurate estimate of p(obststatet)?

Robust Machine Learning
From a limited amount of training data,
Learn parameterized probability models as precise
as possible,
...with a known upper bound on generalization
error
Methods that trade off precision and
generalization
Decorrelate the signal measurements PCA, DCT
Select the most informative features from an
inventory AdaBoost
Train a linear or nonlinear function ztf(yt)
that
Discriminates among the training examples from
diff classes
Has known upper bounds on generalization error
(SVM, ANN)?
Train another nonlinear function p(ztstatet)
with same properties

7
Classifier TransformsCompute a precise and
accurate estimate of p(obststatet)?
8
InferenceIntegrate information to choose best
global labelset

Labels variables that matter globally
Speech Recognition what words were spoken?
Information Retrieval which segment best matches
the query?
Language Learning where's the error?
States variables that can be classified locally
May be scalar, e.g., qtsub-phoneme
May be vector, e.g., qtvector of articulatory
states
Inference algorithm Parametric model of
p(states,labels)?
Scalar states Hidden Markov model, Finite State
Transducer
Vector states Dynamic Bayesian network,
Conditional Random Field

9
InferenceIntegrate information to choose best
global labelset
10
Example Language-Independent Phone
Recognition(Huang et al., in preparation)?
Voice activity detection Perceptual freq warping
Gaussian mixtures
Likelihood Vector bip(observationtstateti)?
Inference Algorithm Hidden Markov Model with
Token Passing p(state1,...,stateT,phone1,...,phone
T)?
Best label sequence argmax p(phone1,...,phoneTo
bservation1,...,observationT)?
11
A Language-Independent Phone Set (Consonants)?
Plus secondary articulations (glottis, pharynx,
palate, lips), sequences, and syllabics
12
A Language-Independent Phone Set (Vowels)?
13
Training Data

10 languages, 11 corpora
Arabic, Croatian, English, Japanese, Mandarin,
Portuguese, Russian, Spanish, Turkish, Urdu
95 hours of speech
Sampled from a larger set of corpora
Mixed styles of speech broadcast, read, and
spontaneous

14
Summarization of Corpora
15
Dictionaries(Hasegawa-Johnson and Fleck,
http//www.isle.uiuc.edu/dict/)?
Orthographic Transcriptions
Diacriticized Version available on web?
Urdu No Vowels!!
????, ????
????
?????
No
Yes
Ruleset 1 ? q ? k ? g ...
Ruleset 2 ? A ? ligature ? u ...
Phonetic Transcriptions
/sAhSVbSV/, /sA!iq?/
16
Context-Dependent Phones

Triphones when is a /t/ not a /t/?
writer /t/ is unusual call it /aI-t3r/
a tree /t/ is unusual call it /-tr/
that soup /t/ is unusual call it /ae-ts/
Lexical stress
/i/ in reek longer than in recover
Call them /r-ik'/ vs. /r-ik/
Punctuation, an easy-to-transcribe proxy for
prosody
/n/ in I'm done. 2X as long as /n/ in Done
yet?
Call them /-nPERIOD/ vs. /-nj/
Language, Dialect, Style
/o/ in atone call it /t-oneng/
/o/ in ??? call it /t-onjap/
Gender handled differently (speaker adaptation)?

-Abeng -Abeng gt-Adcmn .
17
Decision Tree State Tying

Categories for decision tree questions
Distinctive phone features (manner/place of
articulation)? of right or left context
Language identity
Dialect identity (L1 vs. L2)?
Lexical stress
Punctuation mark

-AbengL2 -AbengL2 gtAdcmn .
Each leaf node contains at least 3.5 seconds of
training data
18
Phone Recognition Experiment(Huang et al., in
preparation)?

Language-independent triphone bigram language
model
Standard classifier transforms (PLPddd, CDHMM,
11-17 Gaussians)?
Vocabulary size top 60K most frequent triphones
(since 140K is too much!)?
For the rest of infrequent triphones, map them
back to center monophones

19
Recognition Results(Huang et al., in
preparation)?

Test set 50 sentences per corpus

20
Example Language-Independent Speech Information
Retrieval(Zhuang et al., in preparation)?
Voice activity detection Perceptual freq warping
Gaussian mixtures
Likelihood Vector bip(observationtstateti)?
Inference Algorithm Finite State Transducer
built from ASR Lattices E(count(queryobservations
))?
Retrieval Ranking E(count(querysegment
observations))?
21
Information RetrievalStandard Methods ?

Task Description given a query, find the most
relevant segments in a database
Published Algorithms
EXACT MATCH segment argmin d(query,segment)?
Fast
SUMMARY STATISTICS segment argmax
p(querysegment), no concept of word order
Good for text, e.g., google, yahoo, etc.
TRANSFORM AND INFER segment argmax
p(querysegment), E(count(query)segment) word
order matters?
Flexible, but slow....

22
Language-Independent IRThe Star Challenge

A Multi-Language Multi-Media Broadcast News
Retrieval Competition, sponsored by ASTAR
Elimination rounds, June-August 2008
Three rounds, each of 48 hours duration
56 teams entered from around the world
5 teams selected for the Grand Finals
Grand Finals 10/23/2008, Singapore

23
Star Challenge Tasks

VT1, VT2 Given image category (e.g., crowd,
sports, keyboard), find examples
AT1 Given an IPA phoneme sequence (example
/?ogut?A/), find audio segments
AT2 Given a waveform containing a word or word
sequence in any language, find audio segments
containing the same word
AT1VT2 find specified video class, speech
contains IPA (e.g., man monologue/gro??/)?

24
Star Challenge Simplified Results

Rounds 1 and 3 48,000 CPU hours
Round 1 English, 20 queries
Round 3 English and Mandarin, 3 queries each
Grand Final 6 CPU hours
English, Mandarin, Malay, and Tamil, 2 queries
each

25
Open Research Areas

When does Transform and Infer help?
ROUND 3 (1000cpus, 48 hours) best algorithms
were transform and infer
GRAND FINAL (3 cpus, 2 hours) best algorithms
were exact match
Open research area 1 complexity
Inference algorithm user constraints ?
simplified classifier
Improved transforms and improved classifiers
allow the use of a less-constrained user
interface
Open research area 2 accuracy

26
Existence ProofASR can beat Human
Listeners(Sharma et al., in preparation)?

The task speech of talkers with gross motor
disability (Cerebral Palsy)?
Familiar listeners in familiar situations
understand most of what they say... ASR can also
be talker-dependent and vocabulary-constrained

27
Open Research Areas

Remove the Constraints!
ASR can beat a human listener if the ASR knows
more than the human
(e.g., knows the talker and the vocabulary)?
Better knowledge
better signal models
better classifiers
better inference

28
Thank You!Questions?
29
Decision Tree State Tying (Odell, Woodland and
Young, 1994)?

Divide each IPA phone into three temporally
sequential states,
/i/ -gt /i/onset, /i/center, /i/offset
Start with one model for each state. Create a
statistical model p(acousticsstate) using
training data
Ask yes-no questions about context variables
Left phone, right phone, lexical stress, language
ID
If p(acousticsstate, yes) ? p(acousticsstate,
no), split the training data into two groups
The yes examples vs. the no examples
If many such questions exist, choose the best
Repeat this process as long as each group
contains enough training data examples

Write a Comment

User Comments (0)

About PowerShow.com

Acoustic Modeling for MultiLanguage, MultiStyle, MultiChannel Automatic Speech Recognition - PowerPoint PPT Presentation

Acoustic Modeling for MultiLanguage, MultiStyle, MultiChannel Automatic Speech Recognition

Acoustic Modeling for Multi-Language, Multi-Style, Multi-Channel ... News Hound: Find all TV news segments, in any language, mentioning 'Barack Obama' ... – PowerPoint PPT presentation