OSU ASAT Status Report - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

OSU ASAT Status Report

Description:

Each attribute of the data we are trying to model fits into a feature function ... A positive value if the attribute appears in the data ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 30
Provided by: ericfosle
Category:
Tags: asat | osu | attribute | report | status

less

Transcript and Presenter's Notes

Title: OSU ASAT Status Report


1
OSU ASAT Status Report
  • Jeremy Morris
  • Yu Wang
  • Ilana Bromberg
  • Eric Fosler-Lussier
  • Keith Johnson
  • 13 October 2006

2
Personnel changes
  • Jeremy and Yu are not currently on the project
  • Jeremy is being funded on AFRL/DAGSI project
  • Lexicon learning from orthography
  • However, he is continuing to help in spare time
  • Yu is currently in transition
  • New student (to some!) Ilana Bromberg
  • Technically funded as of 10/1 but did some
    experiments for an ICASSP paper in Sept.
  • Still sorting out project for this year

3
Future potential changes
  • May transition in another student in WI 06
  • Carry on further with some of Jeremys experiments

4
Whats new?
  • First pass on the parsing framework
  • Last time talked about different models
  • Naïve Bayes, Dirichlet modeling, MaxEnt models
  • This time settled on Conditional Random Fields
    framework
  • Monophone CRF phone recognition beating triphone
    HTK recognition using attribute detectors
  • Ready for your inputs!
  • More boundary work
  • Small improvements seen in integrating boundary
    information into HMM recognition
  • Still to be seen if it helps CRFs

5
Parsing
  • Desired ability to combine the output of
    multiple, correlated attribute detectors to
    produce
  • Phone sequences
  • Word sequences
  • Handle both semi-static dynamic events
  • Traditional phonological features
  • Landmarks, boundaries, etc.
  • CRFs are good bet for this

6
Conditional Random Fields
  • A form of discriminative modelling
  • Has been used successfully in various domains
    such as part of speech tagging and other Natural
    Language Processing tasks
  • Processes evidence bottom-up
  • Combines multiple features of the data
  • Builds the probability P( sequence data)
  • Computes joint probability of sequence given data
  • Minimal assumptions about input
  • Inputs dont need to be decorrelated
  • cf. Diagonal Covariance HMMs

7
Conditional Random Fields
/k/
/k/
/iy/
/iy/
/iy/
X
X
X
X
X
  • CRFs are based on the idea of Markov Random
    Fields
  • Modelled as an undirected graph connecting labels
    with observations
  • Observations in a CRF are not modelled as random
    variables

8
Conditional Random Fields
  • Hammersley-Clifford Theorem states that a random
    field is an MRF iff it can be described in the
    above form
  • The exponential is the sum of the clique
    potentials of the undirected graph

9
Conditional Random Fields
  • Conceptual Overview
  • Each attribute of the data we are trying to model
    fits into a feature function that associates the
    attribute and a possible label
  • A positive value if the attribute appears in the
    data
  • A zero value if the attribute is not in the data
  • Each feature function carries a weight that gives
    the strength of that feature function for the
    proposed label
  • High positive weights indicate a good association
    between the feature and the proposed label
  • High negative weights indicate a negative
    association between the feature and the proposed
    label
  • Weights close to zero indicate the feature has
    little or no impact on the identity of the label

10
Experimental Setup
  • Attribute Detectors
  • ICSI QuickNet Neural Networks
  • Two different types of attributes
  • Phonological feature detectors
  • Place, Manner, Voicing, Vowel Height, Backness,
    etc.
  • Features are grouped into eight classes, with
    each class having a variable number of possible
    values based on the IPA phonetic chart
  • Phone detectors
  • Neural networks output based on the phone labels
    one output per label
  • Classifiers were applied to 2960 utterances from
    the TIMIT training set

11
Experimental Setup
  • Output from the Neural Nets are themselves
    treated as feature functions for the observed
    sequence each attribute/label combination gives
    us a value for one feature function
  • Note that this makes the feature functions
    non-binary features.
  • Different than most NLP uses of CRFs
  • Along lines of Gaussian-based CRFs (e.g.,
    Microsoft)

12
Experiment 1
  • Goal Implement a Conditional Random Field Model
    on ASAT-style phonological feature data
  • Perform phone recognition
  • Compare results to those obtained via a Tandem
    HMM system

13
Experiment 1 - Results
Model Phone Accuracy Phone Correct
Tandem monophone 61.48 63.50
Tandem triphone 66.69 72.52
CRF monophone 65.29 66.81
  • CRF system trained on monophones with these
    features achieves accuracy superior to HMM on
    monophones
  • CRF comes close to achieving HMM triphone
    accuracy
  • CRF uses many many fewer parameters

14
Experiment 2
  • Goals
  • Apply CRF model to phone classifier data
  • Apply CRF model to combined phonological feature
    classifier data and phone classifier data
  • Perform phone recognition
  • Compare results to those obtained via a Tandem
    HMM system

15
Experiment 2 - Results
Model Phone Acc Phone Correct
Tandem mono (phones) 60.48 63.30
Tandem tri (phones) 67.32 73.81
CRF mono (phones) 66.89 68.49
Tandem mono (phones/feas) 61.78 63.68
Tandem tri (phones/feas) 67.96 73.40
CRF mono (phones/feas) 68.00 69.58
Note that Tandem HMM result is best result with
only top 39 features following a principal
components analysis
16
Experiment 3
  • Goal
  • Previous CRF experiments used phone posteriors
    for CRF, and linear outputs transformed via a
    Karhunen-Loeve (KL) transform for the HMM sytem
  • This transformation is needed to improve the HMM
    performance through decorellation of inputs
  • Using the same linear outputs as the HMM system,
    do our results change?

17
Experiment 3 - Results
Model Phone Accuracy Phone Correct
CRF (phones) posteriors 67.27 68.77
CRF (phones) linear KL 66.60 68.25
CRF (phones) post. linear 68.18 69.87
CRF (features) posteriors 65.25 66.65
CRF (features) linear KL 66.32 67.95
CRF (features) post linear 66.89 68.48
CRF (features) linear (no KL) 65.89 68.46
Also shown Adding both feature sets together
and giving the system supposedly redundant
information leads to a gain in accuracy
18
Experiment 4
  • Goal
  • Previous CRF experiments did not allow for
    realignment of the training labels
  • Boundaries for labels provided by TIMIT hand
    transcribers used throughout training
  • HMM systems allowed to shift boundaries during EM
    learning
  • If we allow for realignment in our training
    process, can we improve the CRF results?

19
Experiment 4 - Results
Model Phone Accuracy Phone Correct
Tandem tri (phones) 67.32 73.81
CRF (phones) no realign 67.27 68.77
CRF (phones) realign 69.63 72.40
Tandem tri (features) 66.69 72.52
CRF (features) no realign 65.25 66.65
CRF (features) realign 67.52 70.13
Allowing realignment gives accuracy results for a
monophone trained CRF that are superior to a
triphone trained HMM, with fewer parameters
20
Code status
  • Current version java-based, multithreaded
  • TIMIT training takes a few days on 8 proc.
    machine
  • At test time, CRF generates ATT FSM lattice
  • Use ATT FSM tools to decode
  • Will (hopefully) make it easier to decode words
  • Code is stable enough to try different kinds of
    experiments quickly
  • Ilana joined the group and ran an experiment
    within a month

21
Joint models of attributes
  • Monicas work showed that modeling attribute
    detection with joint detectors worked better
  • e.g. modeling manner/place jointly better
  • cf. Chang et al hierarchical detectors work
    better
  • This study can we improve phonetic
    attribute-based detection by using phone
    classifiers and summing?
  • Phone classifer ultimate joint modeling

22
Independent vs Joint Feature Modeling
  • Baseline 1
  • 61 phone posteriors (joint modeling)
  • Baseline 2
  • 44 feature posteriors (independent modeling)
  • Experiment
  • Feature posteriors derived from 61 phone
    posteriors
  • In each frame weight for each feature summed
    weight of each phone exhibiting that feature
  • e.g. P(stop) P(/p/) P(/t/) P(/k/) P(/b/)
    P(/b/) P(/g/)

23
Results Joint vs Independent Modeling
Posterior Type Number of Weights Accuracy Correct
Phonemes 61 67.27 68.77
Features 44 65.25 66.65
Features derived from Phonemes 44 66.45 67.94
24
Removal of Feature Classes
25
Results of Feature Class Removal
26
Continued work on phone boundary detection
  • Basic idea eventually we want to use these as
    transition functions in CRF
  • CRF still under development when this study done
  • Added features corresponding to P(boundarydata)
    to HMM

27
Phone Boundary Detection
  • Evaluation and Results
  • Using phonological features as an input
    representation was modestly better than the phone
    posterior estimates themselves.
  • Phonological feature representations also seemed
    to edge out direct acoustic representations
  • Phonological feature MLPs are more complex to
    train.
  • The nonlinear representations learned by the MLP
    were better for boundary detection than
    metric-based methods.

Input features Phonological features, acoustic
features(PLP) and Phone classifier
outputs Classification methods MLP metric-based
method
28
Proposed Five-state HMM model
  • Experiments
  • For simplicity, the linear outputs from the
    PLPMLP detector, were used as the phone boundary
    features, instead of the ones from the features
    MLP detector. Several experiments were conducted
  • 0) Baseline system standard 39 MFCCs
  • MFCCs phone boundary features. (no KLT)
  • MFCCs phone boundary features, which were
    decorrelated using Karhunen-Loeve transformation
    (KLT).
  • 3) MFCCs phone boundary features, with a
    KL-transformation over all features.
  • 4) MFCCs (KLTed) to show the effect of KL
    transformation on MFCCs
  • The training and recognition were conducted with
    the HTK toolkit, on TIMIT data set. When reaching
    the 4-mixture stage, some experiment failed due
    to data sparsity. We adopted a hybrid 2/4 mixture
    strategy, promoting triphones to 4-mixture when
    the data was sufficient.

How to incorporating phone boundaries, estimated
by Multi-layer perceptron (MLP), into an HMM
system. Five-state HMM phone model to capture
boundary information In order to integrate phone
boundary information in speech recognition, phone
boundary information were concatenated to MFCCs
as additional input features. We explicitly
modeled the entering and exiting state of a phone
as a separate, one frame distribution. The
proposed 5-state HMM phone model is introduced
below.
The two additional boundary states were intended
to catch phone-boundary transitions, while the
three self-looped states in the center can model
phone-internal information. Escape arcs were also
included to bypass the boundary states for short
phones.
29
Results Conclusion
Conclusion Phonological features perform better
as inputs to phone boundary classifiers than
acoustic features. The results suggest that the
pattern changes in the phonological feature space
may lead to robust boundary detection. By
exploring the potential space of representations
of boundaries, we argue that phonetic transitions
are very important for automatic speech
recognition. HMMs can be attuned to the
transition of phone boundaries by explicitly
modeling phone transition states. Also, the
combined strategy of binary boundary features,
KLT, and 5-state representations gives almost a
2 absolute improvement in phone recognition.
Considering the boundary information we
integrated is one of the simplest
representations, the result is rather
encouraging. In future work, we hope to
integrate phone boundary information as
additional features to CRF.
  • Results
  • The proposed 5-state HMM models performed better
    than their conventional 3-state counterparts on
    all training datasets.
  • Decorrelation improved the accuracy of
    recognition on binary boundaries.
  • Including MFCCs in the decorrelation improved
    recognition further.
  • For comparison, several experiments were also
    conducted on a 5-state HMM with a traditional,
    left-to-right all-self-loops transition matrix.
    The results showed vastly increased deletions,
    indicating a bias against short duration phones,
    whereas the proposed model is balanced between
    insertions and deletions.
  • Recently, I modified the decision tree questions
    in the tied-state triphone step, and pushed the
    model to 16-mix Gaussians. Part of the results
    are also shown in the table.
Write a Comment
User Comments (0)
About PowerShow.com