Combining%20Speech%20Attributes%20for%20Speech%20Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

Combining%20Speech%20Attributes%20for%20Speech%20Recognition

Description:

Vowels have height, frontness, roundness, tenseness. Speaker attributes (gender, age, etc. ... tenseness: tense. Feature Combination ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 27
Provided by: jeremyj
Learn more at: https://cse.osu.edu
Category:

less

Transcript and Presenter's Notes

Title: Combining%20Speech%20Attributes%20for%20Speech%20Recognition


1
Combining Speech Attributes for Speech Recognition
  • Jeremy Morris
  • November 9, 2006

2
Overview
  • Problem Statement (Motivation)
  • Conditional Random Fields
  • Experiments Results
  • Future Work

3
Problem Statement
  • Developed as part of the ASAT Project
  • Automatic Speech Attribute Transcription
  • Project to build tools to extract and parse
    speech attributes from a speech signal
  • Goal Develop a system for bottom-up speech
    recognition using 'speech attributes'

4
Speech Attributes?
  • Any information that could be useful for
    recognizing the spoken language
  • Phonetic attributes
  • Consonants have manner, place of articulation,
    voicing
  • Vowels have height, frontness, roundness,
    tenseness
  • Speaker attributes (gender, age, etc.)
  • Any other useful attributes that could be used
    for speech recognition

/d/ manner stop place of artic dental voicing
voiced
/iy/ height high frontness front roundness
nonround tenseness tense
/ae/ height low frontness front roundness
nonround tenseness tense
/t/ manner stop place of artic dental voicing
unvoiced
5
(No Transcript)
6
Feature Combination
  • Our piece of this project is to find ways to
    combine speech attributes together and use them
    to recognize language
  • Other groups are working on finding features to
    extract and methods of extracting them
  • Note that there is no guarantee that attributes
    will be independent of each other
  • In fact, many attributes will be strongly
    correllated or dependent on other attributes
  • e.g. voicing for vowels

7
Evidence Combination
  • Two basic ways to build hypotheses

8
Top Down
  • Traditional Automated Speech Recogintion Systems
    (ASR) use a top-down approach
  • Hypothesis is the phone we are predicting
  • Data is some encoding of the acoustic speech
    signal
  • A likelihood of the signal given the phone label
    is learned from data
  • A prior probability for the phone label is
    learned from the data
  • These are combined through Bayes Rule to give us
    the posterior probability P(label data)

P(/iy/)
P(X/iy/)
9
Bottom Up
  • Bottom-up models have the same high-level goal
    determine the label from the observation
  • But instead of a likelihood, the posterior
    probability P(label data) is learned directly
    from the data
  • Neural Networks can be used to learn
    probabilities in this manner

P(/iy/X)
10
Speech is a Sequence
  • Speech is not a single, independent event
  • It is a combination of multiple events over time
  • A model to recognize spoken language should take
    into account dependencies across time

11
Speech is a Sequence
  • A top down model can be extended into a time
    sequence as a Hidden Markov Model (HMM)
  • Now our likelihood of the data is over the entire
    sequence instead of a single phone

12
Conditional Random Fields
  • A form of discriminative modelling
  • Has been used successfully in various domains
    such as part of speech tagging and other Natural
    Language Processing tasks
  • Processes evidence bottom-up
  • Combines multiple features of the data
  • Builds the probability P( sequence data)

13
Conditional Random Fields
  • Conceptual Overview
  • Each attribute of the data we are trying to model
    fits into a feature function that associates the
    attribute and a possible label
  • A positive value if the attribute appears in the
    data
  • A zero value if the attribute is not in the data
  • Each feature function carries a weight that gives
    the strength of that feature function for the
    proposed label
  • High positive weights indicate a good association
    between the feature and the proposed label
  • High negative weights indicate a negative
    association between the feature and the proposed
    label
  • Weights close to zero indicate the feature has
    little or no impact on the identity of the label

14
Conditional Random Fields
/k/
/k/
/iy/
/iy/
/iy/
X
X
X
X
X
  • CRFs have transition feature functions and state
    feature functions
  • Transition functions add associations between
    transitions from one label to another
  • State functions help determine the identity of
    the state

15
Conditional Random Fields
16
Experiments
  • Goal Implement a Conditional Random Field Model
    on speech attribute data
  • Perform phone recognition
  • Compare results to those obtained via a Tandem
    system
  • Experimental Data
  • TIMIT read speech corpus
  • Moderate-sized corpus of clean, prompted speech,
    complete with phonetic-level transcriptions

17
Attribute Selection
  • Attribute Detectors
  • Built using ICSI QuickNet Neural Network software
  • Two different types of attributes
  • Phonological feature detectors
  • Place, Manner, Voicing, Vowel Height, Backness,
    etc.
  • Features are grouped into eight classes, with
    each class having a variable number of possible
    values based on the IPA phonetic chart
  • Phone detectors
  • Neural networks output based on the phone labels
    one output per label
  • Classifiers were trained on 2960 utterances from
    the TIMIT training set
  • Uses extracted 12th order PLP coefficients (i.e.
    frequency coefficients) in a 9 frame window as
    inputs to the neural networks

18
(No Transcript)
19
Experimental Setup
  • Code built on the Java CRF toolkit on Sourceforge
  • http//crf.sourceforge.net
  • Performs training to maximize the log-likelihood
    of the training set with respect to the model
  • Does this via gradient descent find the place
    where the gradient of the log-likelihood function
    goes to zero

20
Experimental Setup
  • Output from the Neural Nets are themselves
    treated as feature functions for the observed
    sequence
  • Each attribute/label combination gives us a value
    for one feature function
  • We also use a bias feature for each label
  • Currently, all combinations of features and
    labels are used as feature functions
  • e.g. f(P(stop),/t/), f(P(stop),/ae/), etc.
  • Phone class features are used in the same manner
  • e.g f(P(/t/), /t/), f(P(/t/), /ae/), etc.
  • Transition features use only a 0/1 bias feature
  • 1 if the transition occurs at that timeframe in
    the training set
  • 0 if the transition does not occur at that
    timeframe in the training set
  • For comparison purposes, we compare to a baseline
    HMM-trained system that uses decorrellated
    features as inputs

21
Initial Results
Model Label Space Phone Recog Accuracy
HMM (phones) triphones 67.32
CRF (phones) monophones 67.27
HMM (features) triphones 66.69
CRF (features) monophones 65.25
HMM (phones/feas) (top 39) triphones 67.96
CRF (phones/feas) monophones 68.00
22
Experimental Setup
  • Initial CRF experiments show results comparable
    to triphone HMM results with only monophone
    labelling
  • No decorrellation of features needed
  • No assumptions about feature independence
  • Comparison to HMM crippled in one way
  • HMM training allowed for shifting of phone
    boundaries during training
  • CRF training used set phone boundaries for all
    training
  • Another experiment train the CRF, realign
    training labels, then retrain on realigned labels

23
Realignment Results
Model Label Space Phone Recog Accuracy
HMM (phones) triphones 67.32
CRF (phones) base monophones 67.27
CRF (phones) realign monophones 69.63
HMM (features) triphones 66.69
CRF (features) base monophones 65.25
CRF (features) realign monophones 67.52
24
Experimental Setup
  • CRFs can also make use of features on the
    transitions
  • For the initial experiments, transition feature
    functions only used bias features (e.g. 1 or 0
    based on label in the training corpus)
  • What if the phone classifications were used as
    the state features, and the feature classes were
    used as transition features?
  • Linguistic observation feature spreading from
    phone to phone

25
Realignment Results
Model Label Space Phone Recog Accuracy
CRF (phones) base monophones 67.27
CRF (phones) realign monophones 69.63
CRF (features) base monophones 65.25
CRF (features) realign monophones 67.52
CRF (pf) base monophones 68.00
CRF (p trans f) base monophones 69.49
CRF (p trans f) align monophones 70.86
26
Discussion Future Work
  • This seems to be a good model for the type of
    feature combination we want to perform
  • Makes use of arbitrary, possibly correllated
    features
  • Results on phone recognition task comparable or
    superior to the alternative sequence model (HMM)
  • Future Work
  • New features
  • What kinds of features can we add to improve our
    transitions?
  • We hope to get more from the other research
    groups
  • New training methods
  • Faster algorithms than the gradient descent
    method exist and need to be tested
  • Word recogntion
  • We are thinking about how to model word
    recogntion in this framework
  • Larger corpora
  • TIMIT is a comparably small corpus we are
    looking to move to something bigger
Write a Comment
User Comments (0)
About PowerShow.com