Combining Phonetic Attributes Using Conditional Random Fields - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Combining Phonetic Attributes Using Conditional Random Fields

Description:

Title: Slide 1 Author: Jeremy J Morris Last modified by: Jeremy J Morris Created Date: 5/8/2006 3:50:03 PM Document presentation format: Custom Company – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 2
Provided by: JeremyJ7
Category:

less

Transcript and Presenter's Notes

Title: Combining Phonetic Attributes Using Conditional Random Fields


1
Combining Phonetic Attributes Using Conditional
Random Fields
Jeremy Morris and Eric Fosler-Lussier
Department of Computer Science and Engineering
A Conditional Random Field is a mathematical
model for sequences that is similar in many ways
to a Hidden Markov Model, but is discriminative
rather than generative in nature. Here we
explore the application of the CRF model to ASR
processing by building a system that performs
first-pass phonetic recogintion using
discriminatively trained phonetic attributes.
This system achieves an accuracy level in a phone
recognition task superior to that of an HMM that
has been comparably trained.
Conditional Random Fields
  • Phonetic Attributes
  • Phonetic attributes are defined via linguistic
    properties per the International Phonetics
    Association (IPA) phonetic chart
  • Consonants defined by their sonority, voicing,
    manner, and place of articulation
  • Vowels defined by their sonority, voicing,
    height, frontness, roundness and tenseness
  • Additional features for silence
  • Phonetic attributes extracted by multi-layer
    perceptron (MLP) neural net classifiers
  • Classifiers are trained on 12th order cepstral
    PLP and delta coefficients derived from the
    speech data
  • Speech data is broken up into frames of 25ms,
    with overlapping frames starting every 10ms
  • Input is a vector of PLP and delta coefficients
    for a nine frame window, centered on the current
    frame, with four frames of context on either side
  • Each classifier outputs a series of posterior
    probabilities representing the probability of the
    attribute given the data
  • One probability is output for each possible
    attribute for that classifier for any given frame
    of the data
  • These posteriors sum to one for a given
    classifier MLP
  • Classifiers were trained using phonetically
    transcribed corpus
  • Phonetic attribute labels were derived by using
    the attributes provided by the IPA description of
    the transcribed phone (See Figure 1)
  • For our purposes, all phones are assumed to have
    their canonical values for training, and
    attribute boundaries occur at phonetic boundaries
  • A discriminative model of a sequence that
    attempts to model the posterior probability of a
    label sequence given a set of observed data
    (Lafferty, et. al, 2001)
  • A CRF can be described by the following equation
  • Where each s is a state feature function and each
    t is a transition feature function
  • State feature functions associate observations in
    the data at a particular time segment with the
    label at that time segment
  • Described as s(y, x, i), where y is the label, x
    is the observed data, and i is the time frame.
  • Takes a non-zero value when the current label at
    frame i is the same as y and some observation in
    x holds for the frame i
  • Otherwise, the value is zero
  • Transition feature functions associate
    observations in the data at a particular time
    segment with the transition from the previous
    label into the current label
  • Described as t(y,y,x,i), where y is the label,
    y is the previous label, x is the observed data,
    and i is the time frame
  • Takes a non-zero value when the current label at
    frame i is the same as y, the previous label is
    the same as y, and some observation in x holds
    for the frame i
  • For our model, a state feature function is a
    single output from our MLP phonetic attribute
    classifiers associated with a single label
  • Example sj(y,x,i) MLPstop(xi)d(yi /t/)
  • The state feature function above has the value of
    the output of our MLP classifier for the STOP
    attribute if the label at time i is /t/.
    Otherwise, it takes the value of zero.
  • Currently, transition feature functions do not
    use the output of the MLP neural networks
  • The value of the function is 1 if the label pair
    matches the pair defined for the function, 0 if
    it does not.
  • Each feature function has an associated weight
    value
  • This weight value is high when a non-zero feature
    function value is strongly associated with a
    particular label giving a high value to the
    computation of the probability for that label
  • Weights are trained by maximizing the log
    likelihood of the training set with respect to
    the model
  • The strength of the CRF model is in its ability
    to use arbitrary features as input
  • In traditional HMMs, dependencies among features
    can lead to computationally difficult models
    features are usually required to be independent
  • In a CRF, no independence assumption on the
    features is made. Features can have arbitrary
    dependencies.

FIGURE 2 Graphical model for a CRF phone
labelling of the word /she/. The nodes labelled
Xi indicate time frame observations for time
frame i, while the nodes with phone labels
indicate the phonetic labels for that time frame.
Arcs between time frame observations and phone
labels indicate dependencies between the
observations and the identity of the phone
(modelled using state feature functions and
corresponding weights) while arcs between the
phone label nodes indicate dependencies between
neighboring phones (modelled using transition
feature functions and corresponding weights).
TABLE 1 PHONETIC ATTRIBUTES TABLE 1 PHONETIC ATTRIBUTES
Attribute Possible output values
SONORITY vowel, obstruent, sonorant, syllabic, silence
VOICE voiced, unvoiced, n/a
MANNER fricative,stop, flap, nasal, approximant, nasalflap, n/a
PLACE labial, dental, alveolar, palatal, velar, glottal, lateral, rhotic n/a
HEIGHT high, mid, low, lowhigh, midhigh, n/a
FRONT front, back, central, backfront, n/a
ROUND round, nonround, roundnonround, nonroundround, n/a
TENSE tense, lax, n/a
  • Discussion
  • The CRF system trained on monophones has accuracy
    results that fall between that of the monophone
    trained Tandem and triphone trained Tandem
    systems.
  • The CRF system makes many fewer insertions (extra
    hypothesized phones) than the Tandem systems
  • The CRF system also makes many more deletions
    (missed phones where ones should be hypothesized)
    than the Tandem systems
  • The CRF system makes fewer hypotheses overall
    than either Tandem system
  • The precision measurement shows how often a
    hypothesis is a correct hypothesis
  • When the CRF system makes a hypothesis, it is
    correct more often than the Tandem systems
  • These results suggest some means to improve the
    performance of the CRF system
  • Addition of new extracted attributes (such as a
    boundary detector) to incorporate into transition
    feature functions
  • Addition of a penalty factor on transition
    weights to generate more transitions
  • Addition of more contextual attributes into the
    state features to attempt to gain some level of
    triphonic context
  • Results
  • Phone-level accuracies of the CRF system were
    compared to a baseline Tandem system (Hermansky
    et. al, 2000)
  • A Tandem system uses the output of the neural
    networks as inputs to a Hidden Markov Model
    system
  • Tandem system was trained with both triphone
    label contexts and monophone label contexts
  • Triphone labels give a single left and right
    context phone to the label, allowing a finer
    level of context to be used when labels are
    assigned
  • In other words, the context for the phone /ae/ in
    the string of phones /k ae t/ is different from
    that in the string /k ae p/ since the right
    context is different
  • Monophone labels are a single
  • CRF system results are only for monophone labels
  • TABLE 2 (below) breaks down the results into
    three categories
  • Phone Correctness Was the correct phone
    hypothesized?
  • Number of correct labels divided by number of
    true labels
  • Phone Accuracy Correctness penalized for
    overgeneration
  • Number of correct labels, penalized by the number
    of spuriously hypothesized labels (insertions),
    and divided by the number of true labels
  • Phone Precision When a phone is hypothesized,
    how often is it right?
  • Number of correct labels, divided by the number
    of hypothesized labels
  • References
  • J. Lafferty, A. McCallum, and F. Pereira,
    Conditional Random Fields Probabilistic Models
    for Segmenting and Labelling Sequence Data, in
    Proceedings of the 18th International Conference
    on Machine Learning, 2001.
  • H. Hermansky, D. Ellis, and S.Sharma, Tandem
    connectionist feature stream extraction for
    conventional HMM systems, in Proceedings of the
    IEEE Intl. Conf. on Acoustics, Speech, and Signal
    Processing, 2000.
  • M. Rajamanohar and E. Fosler-Lussier, An
    evaluation of hierarchical articulatory feature
    detectors, in IEEE Automatic Speech Recognition
    and Understanding Workshop, 2005.
  • S. Sarawagi, CRF package for Java,
    http//crf.sourceforge.net
  • D. Johnson et al. ICSI QuickNet software,
    http//www.icsi.berkely.edu/Speech/qn.html
  • S. Young et al. HTK HMM software,
    http//htk.eng.cam.ac.uk/

TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons TABLE 2 Phone Recognition Comparisons
Model Phone Correctness Phone Accuracy Phone Precision
Tandem (mono) 63.34 61.40 73.29
Tandem (tri) 72.42 66.85 73.62
CRF (mono) 65.45 63.84 76.61
Write a Comment
User Comments (0)
About PowerShow.com