Combining%20Speech%20Attributes%20for%20Speech%20Recognition - PowerPoint PPT Presentation

About This Presentation

Title:

Combining%20Speech%20Attributes%20for%20Speech%20Recognition

Description:

Vowels have height, frontness, roundness, tenseness. Speaker attributes (gender, age, etc. ... tenseness: tense. Feature Combination ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 27

Provided by: jeremyj

Learn more at: https://cse.osu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Combining%20Speech%20Attributes%20for%20Speech%20Recognition

1
Combining Speech Attributes for Speech Recognition

Jeremy Morris
November 9, 2006

2
Overview

Problem Statement (Motivation)
Conditional Random Fields
Experiments Results
Future Work

3
Problem Statement

Developed as part of the ASAT Project
Automatic Speech Attribute Transcription
Project to build tools to extract and parse
speech attributes from a speech signal
Goal Develop a system for bottom-up speech
recognition using 'speech attributes'

4
Speech Attributes?

Any information that could be useful for
recognizing the spoken language
Phonetic attributes
Consonants have manner, place of articulation,
voicing
Vowels have height, frontness, roundness,
tenseness
Speaker attributes (gender, age, etc.)
Any other useful attributes that could be used
for speech recognition

/d/ manner stop place of artic dental voicing
voiced
/iy/ height high frontness front roundness
nonround tenseness tense
/ae/ height low frontness front roundness
nonround tenseness tense
/t/ manner stop place of artic dental voicing
unvoiced
5
(No Transcript)
6
Feature Combination

Our piece of this project is to find ways to
combine speech attributes together and use them
to recognize language
Other groups are working on finding features to
extract and methods of extracting them
Note that there is no guarantee that attributes
will be independent of each other
In fact, many attributes will be strongly
correllated or dependent on other attributes
e.g. voicing for vowels

7
Evidence Combination

Two basic ways to build hypotheses

8
Top Down

Traditional Automated Speech Recogintion Systems
(ASR) use a top-down approach
Hypothesis is the phone we are predicting
Data is some encoding of the acoustic speech
signal
A likelihood of the signal given the phone label
is learned from data
A prior probability for the phone label is
learned from the data
These are combined through Bayes Rule to give us
the posterior probability P(label data)

P(/iy/)
P(X/iy/)
9
Bottom Up

Bottom-up models have the same high-level goal
determine the label from the observation
But instead of a likelihood, the posterior
probability P(label data) is learned directly
from the data
Neural Networks can be used to learn
probabilities in this manner

P(/iy/X)
10
Speech is a Sequence

Speech is not a single, independent event
It is a combination of multiple events over time
A model to recognize spoken language should take
into account dependencies across time

11
Speech is a Sequence

A top down model can be extended into a time
sequence as a Hidden Markov Model (HMM)
Now our likelihood of the data is over the entire
sequence instead of a single phone

12
Conditional Random Fields

A form of discriminative modelling
Has been used successfully in various domains
such as part of speech tagging and other Natural
Language Processing tasks
Processes evidence bottom-up
Combines multiple features of the data
Builds the probability P( sequence data)

13
Conditional Random Fields

Conceptual Overview
Each attribute of the data we are trying to model
fits into a feature function that associates the
attribute and a possible label
A positive value if the attribute appears in the
data
A zero value if the attribute is not in the data
Each feature function carries a weight that gives
the strength of that feature function for the
proposed label
High positive weights indicate a good association
between the feature and the proposed label
High negative weights indicate a negative
association between the feature and the proposed
label
Weights close to zero indicate the feature has
little or no impact on the identity of the label

14
Conditional Random Fields
/k/
/k/
/iy/
/iy/
/iy/
X
X
X
X
X

CRFs have transition feature functions and state
feature functions
Transition functions add associations between
transitions from one label to another
State functions help determine the identity of
the state

15
Conditional Random Fields
16
Experiments

Goal Implement a Conditional Random Field Model
on speech attribute data
Perform phone recognition
Compare results to those obtained via a Tandem
system
Experimental Data
TIMIT read speech corpus
Moderate-sized corpus of clean, prompted speech,
complete with phonetic-level transcriptions

17
Attribute Selection

Attribute Detectors
Built using ICSI QuickNet Neural Network software
Two different types of attributes
Phonological feature detectors
Place, Manner, Voicing, Vowel Height, Backness,
etc.
Features are grouped into eight classes, with
each class having a variable number of possible
values based on the IPA phonetic chart
Phone detectors
Neural networks output based on the phone labels
one output per label
Classifiers were trained on 2960 utterances from
the TIMIT training set
Uses extracted 12th order PLP coefficients (i.e.
frequency coefficients) in a 9 frame window as
inputs to the neural networks

18
(No Transcript)
19
Experimental Setup

Code built on the Java CRF toolkit on Sourceforge
http//crf.sourceforge.net
Performs training to maximize the log-likelihood
of the training set with respect to the model
Does this via gradient descent find the place
where the gradient of the log-likelihood function
goes to zero

20
Experimental Setup

Output from the Neural Nets are themselves
treated as feature functions for the observed
sequence
Each attribute/label combination gives us a value
for one feature function
We also use a bias feature for each label
Currently, all combinations of features and
labels are used as feature functions
e.g. f(P(stop),/t/), f(P(stop),/ae/), etc.
Phone class features are used in the same manner
e.g f(P(/t/), /t/), f(P(/t/), /ae/), etc.
Transition features use only a 0/1 bias feature
1 if the transition occurs at that timeframe in
the training set
0 if the transition does not occur at that
timeframe in the training set
For comparison purposes, we compare to a baseline
HMM-trained system that uses decorrellated
features as inputs

21
Initial Results
Model Label Space Phone Recog Accuracy
HMM (phones) triphones 67.32
CRF (phones) monophones 67.27
HMM (features) triphones 66.69
CRF (features) monophones 65.25
HMM (phones/feas) (top 39) triphones 67.96
CRF (phones/feas) monophones 68.00
22
Experimental Setup

Initial CRF experiments show results comparable
to triphone HMM results with only monophone
labelling
No decorrellation of features needed
No assumptions about feature independence
Comparison to HMM crippled in one way
HMM training allowed for shifting of phone
boundaries during training
CRF training used set phone boundaries for all
training
Another experiment train the CRF, realign
training labels, then retrain on realigned labels

23
Realignment Results
Model Label Space Phone Recog Accuracy
HMM (phones) triphones 67.32
CRF (phones) base monophones 67.27
CRF (phones) realign monophones 69.63
HMM (features) triphones 66.69
CRF (features) base monophones 65.25
CRF (features) realign monophones 67.52
24
Experimental Setup

CRFs can also make use of features on the
transitions
For the initial experiments, transition feature
functions only used bias features (e.g. 1 or 0
based on label in the training corpus)
What if the phone classifications were used as
the state features, and the feature classes were
used as transition features?
Linguistic observation feature spreading from
phone to phone

25
Realignment Results
Model Label Space Phone Recog Accuracy
CRF (phones) base monophones 67.27
CRF (phones) realign monophones 69.63
CRF (features) base monophones 65.25
CRF (features) realign monophones 67.52
CRF (pf) base monophones 68.00
CRF (p trans f) base monophones 69.49
CRF (p trans f) align monophones 70.86
26
Discussion Future Work

This seems to be a good model for the type of
feature combination we want to perform
Makes use of arbitrary, possibly correllated
features
Results on phone recognition task comparable or
superior to the alternative sequence model (HMM)
Future Work
New features
What kinds of features can we add to improve our
transitions?
We hope to get more from the other research
groups
New training methods
Faster algorithms than the gradient descent
method exist and need to be tested
Word recogntion
We are thinking about how to model word
recogntion in this framework
Larger corpora
TIMIT is a comparably small corpus we are
looking to move to something bigger