Potential team members to date:

About This Presentation

Title:

Potential team members to date:

Description:

Take advantage of human perception and production knowledge ... Growing number of sites investigating complementary aspects of this idea; a non-exhaustive list: ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 11

Provided by: kliv8

Learn more at: http://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Potential team members to date:

1
Articulatory Feature-based Speech RecognitionA
Proposal for the 2006 JHU Summer Workshop on
Language Engineering
November 11, 2005

Potential team members to date
Karen Livescu (presenter)
Simon King
Florian Metze
Jeff Bilmes

Mark Hasegawa-Johnson Ozgur Cetin Kate Saenko
2
Motivations

Why articulatory feature-based ASR?
Improved modeling of co-articulatory
pronunciation phenomena
Take advantage of human perception and production
knowledge
Application to audio-visual modeling
Application to multilingual ASR
Evidence of improved ASR performance with
feature-based models
In noise Kirchhoff et al. 2002
For hyperarticulated speech Soltau et al. 2002
Why this workshop project?
Growing number of sites investigating
complementary aspects of this idea a
non-exhaustive list
U. Edinburgh (King et al.)
UIUC (Hasegawa-Johnson et al.)
MIT (Livescu, Glass, Saenko)
Recently developed tools (e.g. graphical models)
for systematic exploration of the model space

3
Approach Main Ideas

Many ways to use articulatory features in ASR
Approach for this project Multiple streams of
hidden articulatory states that can desynchronize
and stray from target values
Inspired by linguistic theories, but simplified
and cast in a probabilistic setting

baseform dictionary
0
0
0
0
1
1
1
2
2
2
2
2
ind GLOT
0
0
0
0
2
0
0
0
1
0
0
0
ind VEL
0
0
0
0
1
1
1
2
2
2
2
1
ind LIP-OPEN
W
W
W
W
C
C
C
C
W
W
W
W
U LIP-OPEN
W
W
N
N
N
C
C
C
W
W
W
W
S LIP-OPEN
4
Dynamic Bayesian network implementation The
context-independent case
Example DBN with 3 features
5
Recent related work

Product observation models combining phones and
features, p(obss) p(obsphs) ?
p(obsfi), improve ASR in some conditions
Kirchhoff et al. 2002, Metze et al. 2002,
Stueker et al. 2002
Lexical access from manual transcriptions of
Switchboard words using DBN model above Livescu
Glass 2004, 2005
Improves over phone-based pronunciation models
(50 ? 25 error)
Preliminary result Articulatory phonology
features preferable to IPA-style (place/manner)
features
JHU WS04 project Hasegawa-Johnson et al. 2004
Can combine landmarks IPA-style features at
acoustic level with articulatory phonology
features at pronunciation level
Articulatory recognition using DBN and ANN/DBN
models Wester et al. 2004, Frankel et al. 2005
Modeling inter-feature dependencies useful,
asynchrony may also be useful
Lipreading using multistream DBN model SVM
feature detectors
Improves over viseme-based models in
medium-vocabulary word ranking and realistic
small-vocabulary task Saenko et al. 2005

6
Ongoing work Audio-visual ASR
7
Plan for 2006 Workshop

Goals
To build complete articulatory feature-based ASR
systems using multistream DBN structures
To develop a thorough understanding of the design
issues involved
Questions to be addressed
What are appropriate ways to combine models of
articulation with observations?
Are discriminative feature classifiers preferable
to generative observation models?
What asynchrony constraints can account for
co-articulation while permitting efficient
implementations?
How does context affect the modeling of
articulatory feature streams?
Must the features modeled at the observation
level be the same as the hidden state streams?
How can such models be applied to audio-visual
ASR?
A possible work plan
Prior to workshop
Selection of feature sets to be considered
Baseline feature-based and phone-based models on
selected data
Workshop, first half
Exploration of feature sets and classifiers

8
Potential participants and contributors

Local participants
Karen Livescu, MIT
Feature-based ASR structures, graphical models,
GMTK
Mark Hasegawa-Johnson, U. Illinois at
Urbana-Champaign
Discriminative feature classification, JHU WS04
Simon King, U. Edinburgh
Articulatory feature recognition, ANN/DBN
structures
Ozgur Cetin, ICSI Berkeley
Multistream/multirate modeling, graphical models,
GMTK
Florian Metze
Articulatory features in HMM framework
Jeff Bilmes, U. Washington
Graphical models, GMTK
Kate Saenko, MIT
Visual feature classification, AVSR
Others?
Satellite/advisory contributors
Jim Glass, MIT

9
Resources

Tools
GMTK
HTK
Intel AVCSR toolkit
Data
Audio-only
Svitchboard (CSTR Edinburgh) Small-vocab,
continuous, conversational
PhoneBook Medium-vocab, isolated-word, read
(Switchboard rescoring? LVCSR)
Audio-visual
AVTIMIT (MIT) Medium-vocab, continuous, read,
added noise
Digit strings database (MIT) Continuous, read,
naturalistic setting (noise and video background)
Articulatory measurements
X-ray microbeam database (U. Wisconsin) Many
speakers, large-vocab, isolated-word and
continuous
MOCHA (QMUC, Edinburgh) Few speakers,
medium-vocab, continuous
Others?
Manual transcriptions ICSI Berkeley Switchboard
transcription project