CSE 551651: - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

CSE 551651:

Description:

(a) segment the speech at possible phonetic boundaries (b) create network of (sub-)phonetic segments (c) ... Pronunciation of each word must be specified exactly ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 19

Provided by: hos1

Category:

Tags: cse

more less

Transcript and Presenter's Notes

Title: CSE 551651:

1
CSE 551/651 Structure of Spoken
Language Lecture 17 Automatic Speech
Recognition (ASR) Technology John-Paul
Hosom Fall 2005
2
ASR Technology Segment-Based Approaches

SUMMIT System
developed at MIT by Zue, Glass, et al.
based on knowledge of human spectrogram reading
competitive performance at phoneme
classification
segment-based recognition (a) segment the
speech at possible phonetic boundaries (b)
create network of (sub-)phonetic segments (c)
classify each segment (d) search segment
probabilities for most likely sequence of
phonetic segments
complicated

3
ASR Technology Segment-Based Approaches

SUMMIT System Dendrogram

spectral change

segment network can also be created using
segmentation by recognition

4
ASR Technology Segment-Based Approaches

Feature System
developed at CMU by Cole, Stern, et al.
also based on spectrogram-reading techniques
competitive performance at alphabet
classification
segment-based recognition of a single
letter (a) extract information about signal,
including spectral properties, F0, energy in
frequency bands (b) locate 4 points in
utterance beginning of utterance, beginning
of vowel, offset of vowel, end of
utterance (c) extract 50 features (from step (a)
at the 4 locations) (d) use decision tree to
determine the probabilities of each letter
fragile errors in feature extraction
segmentation can not be recovered from

5
ASR Technology Frame-Based Approaches

Stochastic Approach
includes HMMs and HMM/ANN hybrids

6
ASR Technology Frame-Based Approaches
7
ASR Technology Frame-Based Approaches

HMM-Based System Characteristics
System is in only one state at each time t at
time t1, the system transfers to one of the
states indicated by the arcs.
At each time t, the likelihood of each phoneme
is estimated using Gaussian mixture model or
ANN. The classifier uses a fixed time window
usually extending no more than 60 msec. Each
frame is typically classified into each phoneme
in a particular left and right context, e.g.
/y-ehs/, and as the left, middle, or right
region of that context-dependent phoneme (3
states per phoneme).
The probability of transferring from one state
to the next is independent of the observed
(test) speech utterance, being computed over
the entire training corpus.
The Viterbi search determines the most likely
word sequence given the phoneme and
state-transition probabilities and the list
of possible vocabulary words.

8
ASR Technology Frame-Based Approaches

Issues with HMMs
Independence is assumed between frames
Implicit duration model for phonemes is
Geometric, whereas phonemes actually have
Gamma distributions
Independence is required between features within
one frame for GMM classification (not so for
ANN classification)
All frames of speech contribute equally to final
result
Duration is not used in phoneme classification
Duration is modeled using a priori averages over
the entire training set
Language model uses probability of word N given
words N-1, N-2, etc. (bigram, trigram, etc.
language model) infrequently occurring word
combinations poorly recognized (e.g. black
Monday, a stock-market crash in 1987)

9
ASR Technology Frame-Based Approaches

Why is HMM Dominant Technique for ASR?
well-defined mathematical structure
does not require expert knowledge about speech
signal (more people study statistics than
study speech)
errors in analysis dont propagate and
accumulate
does not require prior segmentation
does not require a large number of templates
results are usually the best or among the best

10
Issues in Developing ASR Systems

Type of Channel
Microphone signal different from telephone
signal, land-line telephone signal different
from cellular signal.
Channel characteristics pick-up pattern
(omni-directional, unidirectional,
etc.) frequency response, sensitivity, noise,
etc.
Typical channels desktop boom
mic unidirectional, 100 to 16000 Hz hand-held
mic super-cardioid, 40 to 20000 Hz
telephone unidirectional, 300 to 8000 Hz
Training on data from one type of channel
automatically learns that channels
characteristics switching channels degrades
performance.

11
Issues in Developing ASR Systems

Speaker Characteristics
Because of differences in vocal tract length,
male, female, and childrens speech have
different characteristics.
Regional accents are expressed as differences in
resonant frequencies, durations, and pitch.
Individuals have resonant frequency patterns and
duration patterns that are unique (allows us
to identify speaker).
Training on data from one type of speaker
automatically learns that group or persons
characteristics.
Training on data from all types of speakers
results in lower overall performance.

12
Issues in Developing ASR Systems

Speaking Rate
Even the same speaker may vary the rate of
speech.
Most ASR systems require a fixed window of input
speech.
Formant dynamics change with different speaking
rates and speaking styles (e.g. frustrated
speech).
ASR performance is best when tested on same rate
of speech as training data.
Training on a wide variation in speaking rate
results in lower overall performance.

13
Issues in Developing ASR Systems

Noise
two types of noise additive, convolutional
additive white noise (random values added to
waveform)
convolutional filter (additive values in log
spectrum)
techniques for removing noise RASTA, Cepstral
Mean Subtraction (CMS)
(nearly) impossible to remove all noise while
preserving all speech
stochastic training learns noise as well as
speech if noise changes, performance degrades.

14
Issues in Developing ASR Systems

Vocabulary
Vocabulary must be specified in advance
(cant recognize new words)
Pronunciation of each word must be specified
exactly (phonetic substitutions may degrade
performance)
Grammar either very simple or very structured
Reasons
phonetic recognition so poor that confidence
in each recognized phoneme usually very low.
humans often speak ungrammatically or
disfluently.

15
Issues in Developing ASR Systems

How Well Does ASR Do?

100
Conversational Speech
Read Speech
Structured Speech
Broadcast Speech
Spontaneous Speech (2-3k)
20k
19
Varied Microphones
Noisy Speech
Word Error Rate
10
5k
Noisy
1k
human speech recognition of Broadcast Speech
(0.9WER)
2.5
1988 1989 1990 1991 1992 1993 1994 1995 1996
1997 1998 1999 2000 2001 2002 2003
1
Error Rates on Increasingly Difficult Problems
16
ASR Technology vs. Spectrogram Reading

HMM-Based ASR
frame based - no identification of landmarks in
speech signal
duration of phonemes not identified until end of
processing
all frames are equally important
cues are completely unspecified, learned by
training
coarticulation model context-dependent phoneme
models
Spectrogram Reading
first identify landmarks in the signal ?
Wheres the vowel? Is that change in energy a
plosive?
identify change over duration of a phoneme,
relative durations
? Is that formant movement a diphthong or
coarticulation?
identify activity at phoneme boundaries
? F2 goes to 1800 Hz at onset of voicing,
? voicing continues into frication, so its a
voiced fric.
specific cues to phoneme identity
? 1800 Hz implies alveolar, F3 ? 2000 Hz implies
retroflex
coarticulation model tends toward locus theory

17
ASR Technology vs. Spectrogram Reading

HMM-Based ASR
frame based - no identification of landmarks in
speech signal
duration of phonemes not identified until end of
processing
all frames are equally important
cues are completely unspecified, learned by
training
Spectrogram Reading and Human Speech Recognition
first identify landmarks in the signal ?
Humans thought to have landmark (e.g. plosive)
detectors
identify change over duration of a phoneme,
relative durations
? Humans very sensitive to small changes,
especially at vowel/consonant boundaries
identify activity at phoneme boundaries
? Transition into the vowel most important
region for human speech perception
specific cues to phoneme identity
? Humans use (large) set of specific cues, e.g.
VOT

18
The Structure of Spoken Language

Final Points
Speech is complex! Not as simple as sequence
of phonemes
There is structure in speech, related to broad
phonetic categories
Identifying formant locations and movement is
important
Duration is important even for phoneme identity
Phoneme boundaries are important
There are numerous cues to phoneme identity
Little is understood about how humans process
speech
Current ASR technology is incapable of
accounting for all information that humans use
in reading spectrograms, and what is known
about human speech processing often not used
this implies (but does not prove) that current
technology may be incapable of reaching human
levels of performance.
Speech is complex!