OSU ASAT Status Report - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

OSU ASAT Status Report

Description:

Each attribute of the data we are trying to model fits into a feature function ... A positive value if the attribute appears in the data ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 30

Provided by: ericfosle

Learn more at: https://academicjobsonline.org

Category:

more less

Transcript and Presenter's Notes

Title: OSU ASAT Status Report

1
OSU ASAT Status Report

Jeremy Morris
Yu Wang
Ilana Bromberg
Eric Fosler-Lussier
Keith Johnson
13 October 2006

2
Personnel changes

Jeremy and Yu are not currently on the project
Jeremy is being funded on AFRL/DAGSI project
Lexicon learning from orthography
However, he is continuing to help in spare time
Yu is currently in transition
New student (to some!) Ilana Bromberg
Technically funded as of 10/1 but did some
experiments for an ICASSP paper in Sept.
Still sorting out project for this year

3
Future potential changes

May transition in another student in WI 06
Carry on further with some of Jeremys experiments

4
Whats new?

First pass on the parsing framework
Last time talked about different models
Naïve Bayes, Dirichlet modeling, MaxEnt models
This time settled on Conditional Random Fields
framework
Monophone CRF phone recognition beating triphone
HTK recognition using attribute detectors
Ready for your inputs!
More boundary work
Small improvements seen in integrating boundary
information into HMM recognition
Still to be seen if it helps CRFs

5
Parsing

Desired ability to combine the output of
multiple, correlated attribute detectors to
produce
Phone sequences
Word sequences
Handle both semi-static dynamic events
Traditional phonological features
Landmarks, boundaries, etc.
CRFs are good bet for this

6
Conditional Random Fields

A form of discriminative modelling
Has been used successfully in various domains
such as part of speech tagging and other Natural
Language Processing tasks
Processes evidence bottom-up
Combines multiple features of the data
Builds the probability P( sequence data)
Computes joint probability of sequence given data
Minimal assumptions about input
Inputs dont need to be decorrelated
cf. Diagonal Covariance HMMs

7
Conditional Random Fields
/k/
/k/
/iy/
/iy/
/iy/
X
X
X
X
X

CRFs are based on the idea of Markov Random
Fields
Modelled as an undirected graph connecting labels
with observations
Observations in a CRF are not modelled as random
variables

8
Conditional Random Fields

Hammersley-Clifford Theorem states that a random
field is an MRF iff it can be described in the
above form
The exponential is the sum of the clique
potentials of the undirected graph

9
Conditional Random Fields

Conceptual Overview
Each attribute of the data we are trying to model
fits into a feature function that associates the
attribute and a possible label
A positive value if the attribute appears in the
data
A zero value if the attribute is not in the data
Each feature function carries a weight that gives
the strength of that feature function for the
proposed label
High positive weights indicate a good association
between the feature and the proposed label
High negative weights indicate a negative
association between the feature and the proposed
label
Weights close to zero indicate the feature has
little or no impact on the identity of the label

10
Experimental Setup

Attribute Detectors
ICSI QuickNet Neural Networks
Two different types of attributes
Phonological feature detectors
Place, Manner, Voicing, Vowel Height, Backness,
etc.
Features are grouped into eight classes, with
each class having a variable number of possible
values based on the IPA phonetic chart
Phone detectors
Neural networks output based on the phone labels
one output per label
Classifiers were applied to 2960 utterances from
the TIMIT training set

11
Experimental Setup

Output from the Neural Nets are themselves
treated as feature functions for the observed
sequence each attribute/label combination gives
us a value for one feature function
Note that this makes the feature functions
non-binary features.
Different than most NLP uses of CRFs
Along lines of Gaussian-based CRFs (e.g.,
Microsoft)

12
Experiment 1

Goal Implement a Conditional Random Field Model
on ASAT-style phonological feature data
Perform phone recognition
Compare results to those obtained via a Tandem
HMM system

13
Experiment 1 - Results
Model Phone Accuracy Phone Correct
Tandem monophone 61.48 63.50
Tandem triphone 66.69 72.52
CRF monophone 65.29 66.81

CRF system trained on monophones with these
features achieves accuracy superior to HMM on
monophones
CRF comes close to achieving HMM triphone
accuracy
CRF uses many many fewer parameters

14
Experiment 2

Goals
Apply CRF model to phone classifier data
Apply CRF model to combined phonological feature
classifier data and phone classifier data
Perform phone recognition
Compare results to those obtained via a Tandem
HMM system

15
Experiment 2 - Results
Model Phone Acc Phone Correct
Tandem mono (phones) 60.48 63.30
Tandem tri (phones) 67.32 73.81
CRF mono (phones) 66.89 68.49
Tandem mono (phones/feas) 61.78 63.68
Tandem tri (phones/feas) 67.96 73.40
CRF mono (phones/feas) 68.00 69.58
Note that Tandem HMM result is best result with
only top 39 features following a principal
components analysis
16
Experiment 3

Goal
Previous CRF experiments used phone posteriors
for CRF, and linear outputs transformed via a
Karhunen-Loeve (KL) transform for the HMM sytem
This transformation is needed to improve the HMM
performance through decorellation of inputs
Using the same linear outputs as the HMM system,
do our results change?

17
Experiment 3 - Results
Model Phone Accuracy Phone Correct
CRF (phones) posteriors 67.27 68.77
CRF (phones) linear KL 66.60 68.25
CRF (phones) post. linear 68.18 69.87
CRF (features) posteriors 65.25 66.65
CRF (features) linear KL 66.32 67.95
CRF (features) post linear 66.89 68.48
CRF (features) linear (no KL) 65.89 68.46
Also shown Adding both feature sets together
and giving the system supposedly redundant
information leads to a gain in accuracy
18
Experiment 4

Goal
Previous CRF experiments did not allow for
realignment of the training labels
Boundaries for labels provided by TIMIT hand
transcribers used throughout training
HMM systems allowed to shift boundaries during EM
learning
If we allow for realignment in our training
process, can we improve the CRF results?

19
Experiment 4 - Results
Model Phone Accuracy Phone Correct
Tandem tri (phones) 67.32 73.81
CRF (phones) no realign 67.27 68.77
CRF (phones) realign 69.63 72.40
Tandem tri (features) 66.69 72.52
CRF (features) no realign 65.25 66.65
CRF (features) realign 67.52 70.13
Allowing realignment gives accuracy results for a
monophone trained CRF that are superior to a
triphone trained HMM, with fewer parameters
20
Code status

Current version java-based, multithreaded
TIMIT training takes a few days on 8 proc.
machine
At test time, CRF generates ATT FSM lattice
Use ATT FSM tools to decode
Will (hopefully) make it easier to decode words
Code is stable enough to try different kinds of
experiments quickly
Ilana joined the group and ran an experiment
within a month

21
Joint models of attributes

Monicas work showed that modeling attribute
detection with joint detectors worked better
e.g. modeling manner/place jointly better
cf. Chang et al hierarchical detectors work
better
This study can we improve phonetic
attribute-based detection by using phone
classifiers and summing?
Phone classifer ultimate joint modeling

22
Independent vs Joint Feature Modeling

Baseline 1
61 phone posteriors (joint modeling)
Baseline 2
44 feature posteriors (independent modeling)
Experiment
Feature posteriors derived from 61 phone
posteriors
In each frame weight for each feature summed
weight of each phone exhibiting that feature
e.g. P(stop) P(/p/) P(/t/) P(/k/) P(/b/)
P(/b/) P(/g/)

23
Results Joint vs Independent Modeling
Posterior Type Number of Weights Accuracy Correct
Phonemes 61 67.27 68.77
Features 44 65.25 66.65
Features derived from Phonemes 44 66.45 67.94
24
Removal of Feature Classes
25
Results of Feature Class Removal
26
Continued work on phone boundary detection

Basic idea eventually we want to use these as
transition functions in CRF
CRF still under development when this study done
Added features corresponding to P(boundarydata)
to HMM

27
Phone Boundary Detection

Evaluation and Results
Using phonological features as an input
representation was modestly better than the phone
posterior estimates themselves.
Phonological feature representations also seemed
to edge out direct acoustic representations
Phonological feature MLPs are more complex to
train.
The nonlinear representations learned by the MLP
were better for boundary detection than
metric-based methods.

Input features Phonological features, acoustic
features(PLP) and Phone classifier
outputs Classification methods MLP metric-based
method
28
Proposed Five-state HMM model

Experiments
For simplicity, the linear outputs from the
PLPMLP detector, were used as the phone boundary
features, instead of the ones from the features
MLP detector. Several experiments were conducted
0) Baseline system standard 39 MFCCs
MFCCs phone boundary features. (no KLT)
MFCCs phone boundary features, which were
decorrelated using Karhunen-Loeve transformation
(KLT).
3) MFCCs phone boundary features, with a
KL-transformation over all features.
4) MFCCs (KLTed) to show the effect of KL
transformation on MFCCs
The training and recognition were conducted with
the HTK toolkit, on TIMIT data set. When reaching
the 4-mixture stage, some experiment failed due
to data sparsity. We adopted a hybrid 2/4 mixture
strategy, promoting triphones to 4-mixture when
the data was sufficient.

How to incorporating phone boundaries, estimated
by Multi-layer perceptron (MLP), into an HMM
system. Five-state HMM phone model to capture
boundary information In order to integrate phone
boundary information in speech recognition, phone
boundary information were concatenated to MFCCs
as additional input features. We explicitly
modeled the entering and exiting state of a phone
as a separate, one frame distribution. The
proposed 5-state HMM phone model is introduced
below.
The two additional boundary states were intended
to catch phone-boundary transitions, while the
three self-looped states in the center can model
phone-internal information. Escape arcs were also
included to bypass the boundary states for short
phones.
29
Results Conclusion
Conclusion Phonological features perform better
as inputs to phone boundary classifiers than
acoustic features. The results suggest that the
pattern changes in the phonological feature space
may lead to robust boundary detection. By
exploring the potential space of representations
of boundaries, we argue that phonetic transitions
are very important for automatic speech
recognition. HMMs can be attuned to the
transition of phone boundaries by explicitly
modeling phone transition states. Also, the
combined strategy of binary boundary features,
KLT, and 5-state representations gives almost a
2 absolute improvement in phone recognition.
Considering the boundary information we
integrated is one of the simplest
representations, the result is rather
encouraging. In future work, we hope to
integrate phone boundary information as
additional features to CRF.

Results
The proposed 5-state HMM models performed better
than their conventional 3-state counterparts on
all training datasets.
Decorrelation improved the accuracy of
recognition on binary boundaries.
Including MFCCs in the decorrelation improved
recognition further.
For comparison, several experiments were also
conducted on a 5-state HMM with a traditional,
left-to-right all-self-loops transition matrix.
The results showed vastly increased deletions,
indicating a bias against short duration phones,
whereas the proposed model is balanced between
insertions and deletions.
Recently, I modified the decision tree questions
in the tied-state triphone step, and pushed the
model to 16-mix Gaussians. Part of the results
are also shown in the table.