Title: OSU ASAT Status Report
1OSU ASAT Status Report
- Jeremy Morris
- Yu Wang
- Ilana Bromberg
- Eric Fosler-Lussier
- Keith Johnson
- 13 October 2006
2Personnel changes
- Jeremy and Yu are not currently on the project
- Jeremy is being funded on AFRL/DAGSI project
- Lexicon learning from orthography
- However, he is continuing to help in spare time
- Yu is currently in transition
- New student (to some!) Ilana Bromberg
- Technically funded as of 10/1 but did some
experiments for an ICASSP paper in Sept. - Still sorting out project for this year
3Future potential changes
- May transition in another student in WI 06
- Carry on further with some of Jeremys experiments
4Whats new?
- First pass on the parsing framework
- Last time talked about different models
- Naïve Bayes, Dirichlet modeling, MaxEnt models
- This time settled on Conditional Random Fields
framework - Monophone CRF phone recognition beating triphone
HTK recognition using attribute detectors - Ready for your inputs!
- More boundary work
- Small improvements seen in integrating boundary
information into HMM recognition - Still to be seen if it helps CRFs
5Parsing
- Desired ability to combine the output of
multiple, correlated attribute detectors to
produce - Phone sequences
- Word sequences
- Handle both semi-static dynamic events
- Traditional phonological features
- Landmarks, boundaries, etc.
- CRFs are good bet for this
6Conditional Random Fields
- A form of discriminative modelling
- Has been used successfully in various domains
such as part of speech tagging and other Natural
Language Processing tasks - Processes evidence bottom-up
- Combines multiple features of the data
- Builds the probability P( sequence data)
- Computes joint probability of sequence given data
- Minimal assumptions about input
- Inputs dont need to be decorrelated
- cf. Diagonal Covariance HMMs
7Conditional Random Fields
/k/
/k/
/iy/
/iy/
/iy/
X
X
X
X
X
- CRFs are based on the idea of Markov Random
Fields - Modelled as an undirected graph connecting labels
with observations - Observations in a CRF are not modelled as random
variables
8Conditional Random Fields
- Hammersley-Clifford Theorem states that a random
field is an MRF iff it can be described in the
above form - The exponential is the sum of the clique
potentials of the undirected graph
9Conditional Random Fields
- Conceptual Overview
- Each attribute of the data we are trying to model
fits into a feature function that associates the
attribute and a possible label - A positive value if the attribute appears in the
data - A zero value if the attribute is not in the data
- Each feature function carries a weight that gives
the strength of that feature function for the
proposed label - High positive weights indicate a good association
between the feature and the proposed label - High negative weights indicate a negative
association between the feature and the proposed
label - Weights close to zero indicate the feature has
little or no impact on the identity of the label
10Experimental Setup
- Attribute Detectors
- ICSI QuickNet Neural Networks
- Two different types of attributes
- Phonological feature detectors
- Place, Manner, Voicing, Vowel Height, Backness,
etc. - Features are grouped into eight classes, with
each class having a variable number of possible
values based on the IPA phonetic chart - Phone detectors
- Neural networks output based on the phone labels
one output per label - Classifiers were applied to 2960 utterances from
the TIMIT training set
11Experimental Setup
- Output from the Neural Nets are themselves
treated as feature functions for the observed
sequence each attribute/label combination gives
us a value for one feature function - Note that this makes the feature functions
non-binary features. - Different than most NLP uses of CRFs
- Along lines of Gaussian-based CRFs (e.g.,
Microsoft)
12Experiment 1
- Goal Implement a Conditional Random Field Model
on ASAT-style phonological feature data - Perform phone recognition
- Compare results to those obtained via a Tandem
HMM system
13Experiment 1 - Results
Model Phone Accuracy Phone Correct
Tandem monophone 61.48 63.50
Tandem triphone 66.69 72.52
CRF monophone 65.29 66.81
- CRF system trained on monophones with these
features achieves accuracy superior to HMM on
monophones - CRF comes close to achieving HMM triphone
accuracy - CRF uses many many fewer parameters
14Experiment 2
- Goals
- Apply CRF model to phone classifier data
- Apply CRF model to combined phonological feature
classifier data and phone classifier data - Perform phone recognition
- Compare results to those obtained via a Tandem
HMM system
15Experiment 2 - Results
Model Phone Acc Phone Correct
Tandem mono (phones) 60.48 63.30
Tandem tri (phones) 67.32 73.81
CRF mono (phones) 66.89 68.49
Tandem mono (phones/feas) 61.78 63.68
Tandem tri (phones/feas) 67.96 73.40
CRF mono (phones/feas) 68.00 69.58
Note that Tandem HMM result is best result with
only top 39 features following a principal
components analysis
16Experiment 3
- Goal
- Previous CRF experiments used phone posteriors
for CRF, and linear outputs transformed via a
Karhunen-Loeve (KL) transform for the HMM sytem - This transformation is needed to improve the HMM
performance through decorellation of inputs - Using the same linear outputs as the HMM system,
do our results change?
17Experiment 3 - Results
Model Phone Accuracy Phone Correct
CRF (phones) posteriors 67.27 68.77
CRF (phones) linear KL 66.60 68.25
CRF (phones) post. linear 68.18 69.87
CRF (features) posteriors 65.25 66.65
CRF (features) linear KL 66.32 67.95
CRF (features) post linear 66.89 68.48
CRF (features) linear (no KL) 65.89 68.46
Also shown Adding both feature sets together
and giving the system supposedly redundant
information leads to a gain in accuracy
18Experiment 4
- Goal
- Previous CRF experiments did not allow for
realignment of the training labels - Boundaries for labels provided by TIMIT hand
transcribers used throughout training - HMM systems allowed to shift boundaries during EM
learning - If we allow for realignment in our training
process, can we improve the CRF results?
19Experiment 4 - Results
Model Phone Accuracy Phone Correct
Tandem tri (phones) 67.32 73.81
CRF (phones) no realign 67.27 68.77
CRF (phones) realign 69.63 72.40
Tandem tri (features) 66.69 72.52
CRF (features) no realign 65.25 66.65
CRF (features) realign 67.52 70.13
Allowing realignment gives accuracy results for a
monophone trained CRF that are superior to a
triphone trained HMM, with fewer parameters
20Code status
- Current version java-based, multithreaded
- TIMIT training takes a few days on 8 proc.
machine - At test time, CRF generates ATT FSM lattice
- Use ATT FSM tools to decode
- Will (hopefully) make it easier to decode words
- Code is stable enough to try different kinds of
experiments quickly - Ilana joined the group and ran an experiment
within a month
21Joint models of attributes
- Monicas work showed that modeling attribute
detection with joint detectors worked better - e.g. modeling manner/place jointly better
- cf. Chang et al hierarchical detectors work
better - This study can we improve phonetic
attribute-based detection by using phone
classifiers and summing? - Phone classifer ultimate joint modeling
22Independent vs Joint Feature Modeling
- Baseline 1
- 61 phone posteriors (joint modeling)
- Baseline 2
- 44 feature posteriors (independent modeling)
- Experiment
- Feature posteriors derived from 61 phone
posteriors - In each frame weight for each feature summed
weight of each phone exhibiting that feature - e.g. P(stop) P(/p/) P(/t/) P(/k/) P(/b/)
P(/b/) P(/g/)
23Results Joint vs Independent Modeling
Posterior Type Number of Weights Accuracy Correct
Phonemes 61 67.27 68.77
Features 44 65.25 66.65
Features derived from Phonemes 44 66.45 67.94
24Removal of Feature Classes
25Results of Feature Class Removal
26Continued work on phone boundary detection
- Basic idea eventually we want to use these as
transition functions in CRF - CRF still under development when this study done
- Added features corresponding to P(boundarydata)
to HMM
27Phone Boundary Detection
- Evaluation and Results
- Using phonological features as an input
representation was modestly better than the phone
posterior estimates themselves. - Phonological feature representations also seemed
to edge out direct acoustic representations - Phonological feature MLPs are more complex to
train. - The nonlinear representations learned by the MLP
were better for boundary detection than
metric-based methods.
Input features Phonological features, acoustic
features(PLP) and Phone classifier
outputs Classification methods MLP metric-based
method
28Proposed Five-state HMM model
- Experiments
- For simplicity, the linear outputs from the
PLPMLP detector, were used as the phone boundary
features, instead of the ones from the features
MLP detector. Several experiments were conducted - 0) Baseline system standard 39 MFCCs
- MFCCs phone boundary features. (no KLT)
- MFCCs phone boundary features, which were
decorrelated using Karhunen-Loeve transformation
(KLT). - 3) MFCCs phone boundary features, with a
KL-transformation over all features. - 4) MFCCs (KLTed) to show the effect of KL
transformation on MFCCs - The training and recognition were conducted with
the HTK toolkit, on TIMIT data set. When reaching
the 4-mixture stage, some experiment failed due
to data sparsity. We adopted a hybrid 2/4 mixture
strategy, promoting triphones to 4-mixture when
the data was sufficient.
How to incorporating phone boundaries, estimated
by Multi-layer perceptron (MLP), into an HMM
system. Five-state HMM phone model to capture
boundary information In order to integrate phone
boundary information in speech recognition, phone
boundary information were concatenated to MFCCs
as additional input features. We explicitly
modeled the entering and exiting state of a phone
as a separate, one frame distribution. The
proposed 5-state HMM phone model is introduced
below.
The two additional boundary states were intended
to catch phone-boundary transitions, while the
three self-looped states in the center can model
phone-internal information. Escape arcs were also
included to bypass the boundary states for short
phones.
29Results Conclusion
Conclusion Phonological features perform better
as inputs to phone boundary classifiers than
acoustic features. The results suggest that the
pattern changes in the phonological feature space
may lead to robust boundary detection. By
exploring the potential space of representations
of boundaries, we argue that phonetic transitions
are very important for automatic speech
recognition. HMMs can be attuned to the
transition of phone boundaries by explicitly
modeling phone transition states. Also, the
combined strategy of binary boundary features,
KLT, and 5-state representations gives almost a
2 absolute improvement in phone recognition.
Considering the boundary information we
integrated is one of the simplest
representations, the result is rather
encouraging. In future work, we hope to
integrate phone boundary information as
additional features to CRF.
- Results
- The proposed 5-state HMM models performed better
than their conventional 3-state counterparts on
all training datasets. - Decorrelation improved the accuracy of
recognition on binary boundaries. - Including MFCCs in the decorrelation improved
recognition further. - For comparison, several experiments were also
conducted on a 5-state HMM with a traditional,
left-to-right all-self-loops transition matrix.
The results showed vastly increased deletions,
indicating a bias against short duration phones,
whereas the proposed model is balanced between
insertions and deletions. - Recently, I modified the decision tree questions
in the tied-state triphone step, and pushed the
model to 16-mix Gaussians. Part of the results
are also shown in the table.