Title: Crosslingual Speech Recognition
 1Crosslingual Speech Recognition
Firbush Presentation Partha Lal
Some slides were swiped from Karen Livescu's WS06 
slides 
 2Multilingual Speech Recognition
- Speech Recognisers need large amounts of labelled 
 training data
- What about languages for which little labelled 
 data exists?
- Speech data in one language could be used for 
 another language
- (the focus here is on acoustic modelling)
Image swiped from Joseph Picone's WS99 slides 
 3Hidden Markov Models
- HMMs traditionally used here 
- Hidden variable Q represents discrete sub-phone 
 units
- Observation obs represents continuous acoustic 
 observations
- Parameters to be estimated 
- Transition probabilities P(qiqi-1) 
- Emission probabilities P(obsiqi) (GMM)
4Hybrid HMMs
- Emission probabilities are usually estimated with 
 a Gaussian Mixture Models
- Instead we could use a neural network (MLP)
P(qi obsi) 
 5MLP training targets
- Could use phones but... 
- Number of units in output layer  30-45 
- Not all phones occur in all languages 
- Phones sound different in different languages
qi
obsi 
 6MLP training targets
- So instead try articulatory features 
- Smaller output layer 
- Language independent
qi
. . .
voicing
manner
place 
 7MLP training data
- Now that the MLPs classify the acoustic signal 
 into more language universal classes, data can
 be shared more easily between languages
- e.g. English data can help train a Mandarin 
 recogniser
8Conclusions
- Speech data in resource-rich languages can be 
 used to train recognisers for resource-poor
 languages
- Neural networks that detect articulatory feature 
 values may be useful for transferring knowledge
 between languages
9Thank you!
Questions? Comments? 
 10Bayesian networks (BNs)
- Directed acyclic graph (DAG) with one-to-one 
 correspondence between nodes and variables X1,
 X2, ... , XN
- Node Xi with parents pa(Xi) has a local 
 probability function pXipa(Xi)
- Joint probability  product of local 
 probabilities p(xi,...,xN)  ?
 p(xipa(xi))
p(ba)
?
p(a,b,c,d)  p(a) p(ba) p(cb) p(db,c)
p(cb)
p(a)
p(db,c) 
 11Dynamic Bayesian networks (DBNs)
- BNs consisting of a structure that repeats an 
 indefinite (i.e. dynamic) number of times
- Useful for modeling time series (e.g. speech!) 
12Notation Representations of HMMs as DBNs 
 13A phone HMM-based recognizer
frame 0
frame i
last frame
variable name
values
- Standard phone HMM-based recognizer with bigram 
 language model
14Inference
- Definition 
- Computation of the probability of one subset of 
 the variables given another subset
- Inference is a subroutine of 
- Viterbi decoding 
-  argmax p(word, subWordState, phoneState, ... 
 obs)
- Maximum-likelihood parameter estimation 
-  ?  argmax ? p(obs ?) 
- For WS06, all models implemented, trained, and 
 tested using the Graphical Models Toolkit (GMTK)
 Bilmes 02
15Feature set for observation modeling 
 16Hybrid models MLP overall accuracies
- Frame-level accuracies 
- MLPs trained on Fisher 
- Accuracy computed with respect to SVB test set 
- Silence frames excluded from this calculation
17Tandem Processing Steps
- MLP posteriors are processed to make them 
 Gaussian like
- There are 8 articulatory MLPs their outputs are 
 joined together at the input (64 dims)
- PCA reduces dimensionality to 26 (95 of the 
 total variance)
- Use this 26-dimensional vector as acoustic 
 observations in an HMM or some other model
- The tandem features are usually used in 
 combination w/ a standard feature, e.g. PLP
18Articulatory vs. Phone Tandems
- Monophones on 500 vocabulary task w/o alignments 
 feature concatenated PLP/tandem models
- All tandem systems are significantly better than 
 PLP alone
- Articulatory tandems are as good as phone tandems 
- Articulatory tandems from Fisher (1776 hrs) 
 trained MLPs outperform those from SVB (3 hrs)
 trained MLPs