Title: CSE 551651:
1CSE 551/651 Structure of Spoken
Language Lecture 17 Automatic Speech
Recognition (ASR) Technology John-Paul
Hosom Fall 2005
2ASR Technology Segment-Based Approaches
- SUMMIT System
- developed at MIT by Zue, Glass, et al.
- based on knowledge of human spectrogram reading
- competitive performance at phoneme
classification - segment-based recognition (a) segment the
speech at possible phonetic boundaries (b)
create network of (sub-)phonetic segments (c)
classify each segment (d) search segment
probabilities for most likely sequence of
phonetic segments - complicated
3ASR Technology Segment-Based Approaches
spectral change
- segment network can also be created using
segmentation by recognition
4ASR Technology Segment-Based Approaches
- Feature System
- developed at CMU by Cole, Stern, et al.
- also based on spectrogram-reading techniques
- competitive performance at alphabet
classification - segment-based recognition of a single
letter (a) extract information about signal,
including spectral properties, F0, energy in
frequency bands (b) locate 4 points in
utterance beginning of utterance, beginning
of vowel, offset of vowel, end of
utterance (c) extract 50 features (from step (a)
at the 4 locations) (d) use decision tree to
determine the probabilities of each letter - fragile errors in feature extraction
segmentation can not be recovered from
5ASR Technology Frame-Based Approaches
- Stochastic Approach
- includes HMMs and HMM/ANN hybrids
6ASR Technology Frame-Based Approaches
7ASR Technology Frame-Based Approaches
- HMM-Based System Characteristics
- System is in only one state at each time t at
time t1, the system transfers to one of the
states indicated by the arcs. - At each time t, the likelihood of each phoneme
is estimated using Gaussian mixture model or
ANN. The classifier uses a fixed time window
usually extending no more than 60 msec. Each
frame is typically classified into each phoneme
in a particular left and right context, e.g.
/y-ehs/, and as the left, middle, or right
region of that context-dependent phoneme (3
states per phoneme). - The probability of transferring from one state
to the next is independent of the observed
(test) speech utterance, being computed over
the entire training corpus. - The Viterbi search determines the most likely
word sequence given the phoneme and
state-transition probabilities and the list
of possible vocabulary words.
8ASR Technology Frame-Based Approaches
- Issues with HMMs
- Independence is assumed between frames
- Implicit duration model for phonemes is
Geometric, whereas phonemes actually have
Gamma distributions - Independence is required between features within
one frame for GMM classification (not so for
ANN classification) - All frames of speech contribute equally to final
result - Duration is not used in phoneme classification
- Duration is modeled using a priori averages over
the entire training set - Language model uses probability of word N given
words N-1, N-2, etc. (bigram, trigram, etc.
language model) infrequently occurring word
combinations poorly recognized (e.g. black
Monday, a stock-market crash in 1987)
9ASR Technology Frame-Based Approaches
- Why is HMM Dominant Technique for ASR?
- well-defined mathematical structure
- does not require expert knowledge about speech
signal (more people study statistics than
study speech) - errors in analysis dont propagate and
accumulate - does not require prior segmentation
- does not require a large number of templates
- results are usually the best or among the best
10Issues in Developing ASR Systems
- Type of Channel
- Microphone signal different from telephone
signal, land-line telephone signal different
from cellular signal. - Channel characteristics pick-up pattern
(omni-directional, unidirectional,
etc.) frequency response, sensitivity, noise,
etc. - Typical channels desktop boom
mic unidirectional, 100 to 16000 Hz hand-held
mic super-cardioid, 40 to 20000 Hz
telephone unidirectional, 300 to 8000 Hz - Training on data from one type of channel
automatically learns that channels
characteristics switching channels degrades
performance.
11Issues in Developing ASR Systems
- Speaker Characteristics
- Because of differences in vocal tract length,
male, female, and childrens speech have
different characteristics. - Regional accents are expressed as differences in
resonant frequencies, durations, and pitch. - Individuals have resonant frequency patterns and
duration patterns that are unique (allows us
to identify speaker). - Training on data from one type of speaker
automatically learns that group or persons
characteristics. - Training on data from all types of speakers
results in lower overall performance.
12Issues in Developing ASR Systems
- Speaking Rate
- Even the same speaker may vary the rate of
speech. - Most ASR systems require a fixed window of input
speech. - Formant dynamics change with different speaking
rates and speaking styles (e.g. frustrated
speech). - ASR performance is best when tested on same rate
of speech as training data. - Training on a wide variation in speaking rate
results in lower overall performance.
13Issues in Developing ASR Systems
- Noise
- two types of noise additive, convolutional
- additive white noise (random values added to
waveform) - convolutional filter (additive values in log
spectrum) - techniques for removing noise RASTA, Cepstral
Mean Subtraction (CMS) - (nearly) impossible to remove all noise while
preserving all speech - stochastic training learns noise as well as
speech if noise changes, performance degrades.
14Issues in Developing ASR Systems
- Vocabulary
- Vocabulary must be specified in advance
(cant recognize new words) - Pronunciation of each word must be specified
exactly (phonetic substitutions may degrade
performance) - Grammar either very simple or very structured
- Reasons
- phonetic recognition so poor that confidence
in each recognized phoneme usually very low. - humans often speak ungrammatically or
disfluently.
15Issues in Developing ASR Systems
100
Conversational Speech
Read Speech
Structured Speech
Broadcast Speech
Spontaneous Speech (2-3k)
20k
19
Varied Microphones
Noisy Speech
Word Error Rate
10
5k
Noisy
1k
human speech recognition of Broadcast Speech
(0.9WER)
2.5
1988 1989 1990 1991 1992 1993 1994 1995 1996
1997 1998 1999 2000 2001 2002 2003
1
Error Rates on Increasingly Difficult Problems
16ASR Technology vs. Spectrogram Reading
- HMM-Based ASR
- frame based - no identification of landmarks in
speech signal - duration of phonemes not identified until end of
processing - all frames are equally important
- cues are completely unspecified, learned by
training - coarticulation model context-dependent phoneme
models - Spectrogram Reading
- first identify landmarks in the signal ?
Wheres the vowel? Is that change in energy a
plosive? - identify change over duration of a phoneme,
relative durations - ? Is that formant movement a diphthong or
coarticulation? - identify activity at phoneme boundaries
- ? F2 goes to 1800 Hz at onset of voicing,
- ? voicing continues into frication, so its a
voiced fric. - specific cues to phoneme identity
- ? 1800 Hz implies alveolar, F3 ? 2000 Hz implies
retroflex - coarticulation model tends toward locus theory
17ASR Technology vs. Spectrogram Reading
- HMM-Based ASR
- frame based - no identification of landmarks in
speech signal - duration of phonemes not identified until end of
processing - all frames are equally important
- cues are completely unspecified, learned by
training -
- Spectrogram Reading and Human Speech Recognition
- first identify landmarks in the signal ?
Humans thought to have landmark (e.g. plosive)
detectors - identify change over duration of a phoneme,
relative durations - ? Humans very sensitive to small changes,
especially at vowel/consonant boundaries - identify activity at phoneme boundaries
- ? Transition into the vowel most important
region for human speech perception - specific cues to phoneme identity
- ? Humans use (large) set of specific cues, e.g.
VOT
18The Structure of Spoken Language
- Final Points
- Speech is complex! Not as simple as sequence
of phonemes - There is structure in speech, related to broad
phonetic categories - Identifying formant locations and movement is
important - Duration is important even for phoneme identity
- Phoneme boundaries are important
- There are numerous cues to phoneme identity
- Little is understood about how humans process
speech - Current ASR technology is incapable of
accounting for all information that humans use
in reading spectrograms, and what is known
about human speech processing often not used
this implies (but does not prove) that current
technology may be incapable of reaching human
levels of performance. - Speech is complex!