Title: Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK
1Feature Selection, Acoustic Modeling and
Adaptation SDSG REVIEW of recent WORK
- Technical University of Crete
- Speech Processing and
- Dialog Systems Group
- Presenter Alex Potamianos
2Outline
- Prior Work
- Adaptation
- Acoustic Modeling
- Robust Feature Selection
- Bridge over to HIWIRE work-plan
- Robust Features, Acoustic Modeling, Adaptation
- New areas audio-visual, microphone arrays
3Adaptation
- Transformation-based adaptation
- MAP Adaptation (Bayesian learning approximation)
- Speaker Clustering / Speaker space models.
- Robust Feature Selection
- Combinations
4Acoustic Model Adaptation SDSG Selected Work
- Constrained Estimation Adaptation
- Maximum Likelihood Stochastic Transformations
- Combined Transformation-MAP adaptation
- MLST Basis Vectors
- Incremental Adaptation
- Dependency modeling of biases
- Vocal Tract Norm. with Linear Transformation
5Constrained Estimation Adaptation (Digalakis
1995)
- Hypothesize a sequence of feature-space linear
transformations - Adapted models (A) are then
- diagonal.
- Adaptation is equivalent to estimating the state
dependent
6Compared to MLLR (Leggeter 1996)
- Both published at the same time.
- MLLR is only model adaptation.
- MLLR transforms only the model means
- in MLLR is block diagonal.
- Constrained estimation is more generic.
7Limitations of the Linear Assumption
- Linear assumption may be too restrictive in
modeling the training testing dependency. - Goal Try a more complex transformation.
- All Gaussians in a class are restricted to be
transformed identically using the same
transformation. - Goal Let each Gaussian in a class to decide for
its own transformation. - Which transformation transforms each Gaussian is
predefined. - Goal Let the system to automatically choose the
transformation-Gaussian couples.
8ML Stochastic Transformations (MLST)
(Diakoloukas Digalakis 1997)
- Hypothesize a sequence of feature-space
stochastic transformations of the form
9MLST model-space
- Use a set of MLSTs instead of linear
transformations. - Adapted observation densities
- MLST-Method I
- is diagonal
- MLST-Method II
- is block diagonal
10MLST Reduce the number of mixture components
- The adapted mixture densities consist of
Gaussians. - Reduce the Gaussians back to their SI number
- HPT Apply the component transformation with the
highest probability to each Gaussian. - LCT Linear combination of all component
transforms. - MTG Merge the transformed Gaussians.
11Schematic representation of MLST adaptation
12MLST properties
- Asj, bsj are shared at a state or state-cluster
level - Transformation weights lj are estimated at a
Gaussian level - MLST combines transformed Gaussians
- MLST is flexible on how to select a
transformation for each Gaussian. - MLST chooses arbitrary number of transformations
per class.
13MLST compared to ML Linear Transforms
- Hard versus Soft decision
- Choose the linear component based on the training
samples. - Adaptation Resolution
- Linear components are common to a transformation
class - Choose the transformation at a Gaussian level
- Increased adaptation resolution - robust
estimation
14MLST basis transforms (Boulis Diakoloukas
Digalakis 2000)
- Algorithm steps
- Cluster the training speaker space into classes
- Train MLST component transforms using data from
each training speaker class - Adaptation data is used to estimate the
transformation weight - It is like having a-priori knowledge to the
estimation process - Results in rapid speaker adaptation
- Significant gains for medium and small data sets
15Combined Transformation Bayesian (Digalakis
Neumeyer 1996)
- MAP estimation can be expressed as
- Retain the asymptotic properties of MAP
- Retain fast adaptation rates of transformations.
16Rapid Speech Recognizer Adaptation (Digalakis
et.al 2000)
- Dependence models of the bias components of
cascaded transforms. Techniques - Gaussian multiscale process
- Hierarchical tree-structured prior
- Explicit correlation models
- Markov Random Fields
17VTN with Linear Transformation(Potamianos and
Rose 1997, Potamianos and Narayanan 1998)
- Vocal Tract Normalization
- Select optimal warping factor ? according to
- ? arg max P(Xªa, ?, H)
- where H is the transcription, and Xª frequency
warped observation vector by factor a. - VTN with linear transformation
- ?, ? arg max P(Xªa, ?, ?, H)
- where h?() is a parametric linear transformation
with parameter ?
18Acoustic ModelingSDSG Selected Work
- Genones Generalized Gaussian mixture tying
scheme - Stochastic Segment Models (SSMs)
19Genones Generalized Mixture Tying (Digalakis
Monaco Murveit 1996)
- Algorithm Steps
- Clustering of HMM states based on the similarity
of their distributions - Splitting Construct seed codebooks for each
state cluster - Either identify the most likely mixture component
subset - Or cluster down the original codebook
- Reestimation of the parameters using Baum-Welch
- Better trade-off between modelling resolution and
robustness - Genones are used in Decipher and Nuance
20Segment Models
- HMM limitations
- Weak duration modelling
- Conditional independence of observations
assumption - Restrictions on feature extraction imposed by
frame-based observations - Segment models motivation
- Larger number of degrees of freedom in the model
- Use segmental features
- Model correlation of frame-based features
- Powerful modelling of transitions and
longer-range speech dynamics - Less distortion for segmental coding ? segmental
recognition more efficient
21General Stochastic Segment Models
- A segment s in an utterance of N frames is
- s (ta , tb) 1 ta tb N
- Segment model density
- Segment models generate a variable-length
sequence of frames
22Stochastic Segment Model (Ostendorf Digalakis
1992)
- Problem Model time correlation within a segment
- Solution Gaussian model variations based on
assumptions about the form of statistical
dependency - Gauss-Markov model
- Dynamical System model
- Target State model.
23SSM Viterbi Decoding (Ostendorf Digalakis
Kimball 1996)
- HMM Viterbi recognition
- State to Word sequence mapping
- SSM analogous solution
- Map the segment label sequence to the appropriate
word sequence -
24From HMMs to Segment Models(Ostendorf Digalakis
1996)
- Unified view of stochastic modeling
- General stochastic model that encompasses most SM
type models - Similarities in terms of correlation and
parameter tying assumptions - Analogies between segment models and HMMs
25Robust Feature Selection
- Time-Frequency Representation for ASR
- (Potamianos and Maragos 1999)
- Confidence Measure Estimation for ASR Features
sent over wireless channels (missing features) - (Potamianos and Weerackody 2001)
- AM-FM Model Based Features
- (Dimitriadis et al 2002)
26Other Work
- Multiple source separation using microphone
arrays (Sidiropoulos et al. 2001)
27Prior Work Overview
Constr. Est. Adapt.
MLST.
Combinations
MAP (Bayes) Adapt.
VTLN
Genones
Segment Models
Robust Features
28HIWIRE Work Proposal
Adaptation Bayes optimal class.
Acoustic Modeling Segment Models
Feature Selection AM-FM Features
Microphone Arrays Speech/Noise Separation
Audio Visual ASR Baseline experiments
29Bayes optimal classification (HIWIRE proposal)
- Classifier decision for a test data vector xtest
- Choose the class that results in the highest
value
30Bayes optimal versus MAP
- Assumption the posterior is sufficiently peaked
around the most probable point - MAP approximation
- ?MAP is the set of parameters that maximize
31Why Bayes optimal classification
- Optimal classification criterion
- The prediction of all the parameter hypotheses is
combined - Better discrimination
- Less training data
- Faster asymptotic convergence to the ML estimate
- However
- Computationally more expensive
- Difficult to find analytical solutions
- ....hence some approximations should still be
considered
32Segment Models
- Phone Transition modeling
- New features
- Combine with HMMs
- Parametric modeling of feature trajectories
33AM-FM Features
34Audio-Visual ASR
35Microphone Array
- Speech Noise source separation algorithms