CSE 551651: - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

CSE 551651:

Description:

(a) segment the speech at possible phonetic boundaries (b) create network of (sub-)phonetic segments (c) ... Pronunciation of each word must be specified exactly ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 19
Provided by: hos1
Category:
Tags: cse

less

Transcript and Presenter's Notes

Title: CSE 551651:


1
CSE 551/651 Structure of Spoken
Language Lecture 17 Automatic Speech
Recognition (ASR) Technology John-Paul
Hosom Fall 2005
2
ASR Technology Segment-Based Approaches
  • SUMMIT System
  • developed at MIT by Zue, Glass, et al.
  • based on knowledge of human spectrogram reading
  • competitive performance at phoneme
    classification
  • segment-based recognition (a) segment the
    speech at possible phonetic boundaries (b)
    create network of (sub-)phonetic segments (c)
    classify each segment (d) search segment
    probabilities for most likely sequence of
    phonetic segments
  • complicated

3
ASR Technology Segment-Based Approaches
  • SUMMIT System Dendrogram

spectral change
  • segment network can also be created using
    segmentation by recognition

4
ASR Technology Segment-Based Approaches
  • Feature System
  • developed at CMU by Cole, Stern, et al.
  • also based on spectrogram-reading techniques
  • competitive performance at alphabet
    classification
  • segment-based recognition of a single
    letter (a) extract information about signal,
    including spectral properties, F0, energy in
    frequency bands (b) locate 4 points in
    utterance beginning of utterance, beginning
    of vowel, offset of vowel, end of
    utterance (c) extract 50 features (from step (a)
    at the 4 locations) (d) use decision tree to
    determine the probabilities of each letter
  • fragile errors in feature extraction
    segmentation can not be recovered from

5
ASR Technology Frame-Based Approaches
  • Stochastic Approach
  • includes HMMs and HMM/ANN hybrids

6
ASR Technology Frame-Based Approaches
7
ASR Technology Frame-Based Approaches
  • HMM-Based System Characteristics
  • System is in only one state at each time t at
    time t1, the system transfers to one of the
    states indicated by the arcs.
  • At each time t, the likelihood of each phoneme
    is estimated using Gaussian mixture model or
    ANN. The classifier uses a fixed time window
    usually extending no more than 60 msec. Each
    frame is typically classified into each phoneme
    in a particular left and right context, e.g.
    /y-ehs/, and as the left, middle, or right
    region of that context-dependent phoneme (3
    states per phoneme).
  • The probability of transferring from one state
    to the next is independent of the observed
    (test) speech utterance, being computed over
    the entire training corpus.
  • The Viterbi search determines the most likely
    word sequence given the phoneme and
    state-transition probabilities and the list
    of possible vocabulary words.

8
ASR Technology Frame-Based Approaches
  • Issues with HMMs
  • Independence is assumed between frames
  • Implicit duration model for phonemes is
    Geometric, whereas phonemes actually have
    Gamma distributions
  • Independence is required between features within
    one frame for GMM classification (not so for
    ANN classification)
  • All frames of speech contribute equally to final
    result
  • Duration is not used in phoneme classification
  • Duration is modeled using a priori averages over
    the entire training set
  • Language model uses probability of word N given
    words N-1, N-2, etc. (bigram, trigram, etc.
    language model) infrequently occurring word
    combinations poorly recognized (e.g. black
    Monday, a stock-market crash in 1987)

9
ASR Technology Frame-Based Approaches
  • Why is HMM Dominant Technique for ASR?
  • well-defined mathematical structure
  • does not require expert knowledge about speech
    signal (more people study statistics than
    study speech)
  • errors in analysis dont propagate and
    accumulate
  • does not require prior segmentation
  • does not require a large number of templates
  • results are usually the best or among the best

10
Issues in Developing ASR Systems
  • Type of Channel
  • Microphone signal different from telephone
    signal, land-line telephone signal different
    from cellular signal.
  • Channel characteristics pick-up pattern
    (omni-directional, unidirectional,
    etc.) frequency response, sensitivity, noise,
    etc.
  • Typical channels desktop boom
    mic unidirectional, 100 to 16000 Hz hand-held
    mic super-cardioid, 40 to 20000 Hz
    telephone unidirectional, 300 to 8000 Hz
  • Training on data from one type of channel
    automatically learns that channels
    characteristics switching channels degrades
    performance.

11
Issues in Developing ASR Systems
  • Speaker Characteristics
  • Because of differences in vocal tract length,
    male, female, and childrens speech have
    different characteristics.
  • Regional accents are expressed as differences in
    resonant frequencies, durations, and pitch.
  • Individuals have resonant frequency patterns and
    duration patterns that are unique (allows us
    to identify speaker).
  • Training on data from one type of speaker
    automatically learns that group or persons
    characteristics.
  • Training on data from all types of speakers
    results in lower overall performance.

12
Issues in Developing ASR Systems
  • Speaking Rate
  • Even the same speaker may vary the rate of
    speech.
  • Most ASR systems require a fixed window of input
    speech.
  • Formant dynamics change with different speaking
    rates and speaking styles (e.g. frustrated
    speech).
  • ASR performance is best when tested on same rate
    of speech as training data.
  • Training on a wide variation in speaking rate
    results in lower overall performance.

13
Issues in Developing ASR Systems
  • Noise
  • two types of noise additive, convolutional
  • additive white noise (random values added to
    waveform)
  • convolutional filter (additive values in log
    spectrum)
  • techniques for removing noise RASTA, Cepstral
    Mean Subtraction (CMS)
  • (nearly) impossible to remove all noise while
    preserving all speech
  • stochastic training learns noise as well as
    speech if noise changes, performance degrades.

14
Issues in Developing ASR Systems
  • Vocabulary
  • Vocabulary must be specified in advance
    (cant recognize new words)
  • Pronunciation of each word must be specified
    exactly (phonetic substitutions may degrade
    performance)
  • Grammar either very simple or very structured
  • Reasons
  • phonetic recognition so poor that confidence
    in each recognized phoneme usually very low.
  • humans often speak ungrammatically or
    disfluently.

15
Issues in Developing ASR Systems
  • How Well Does ASR Do?

100
Conversational Speech
Read Speech
Structured Speech
Broadcast Speech
Spontaneous Speech (2-3k)
20k
19
Varied Microphones
Noisy Speech
Word Error Rate
10
5k
Noisy
1k
human speech recognition of Broadcast Speech
(0.9WER)
2.5
1988 1989 1990 1991 1992 1993 1994 1995 1996
1997 1998 1999 2000 2001 2002 2003
1
Error Rates on Increasingly Difficult Problems
16
ASR Technology vs. Spectrogram Reading
  • HMM-Based ASR
  • frame based - no identification of landmarks in
    speech signal
  • duration of phonemes not identified until end of
    processing
  • all frames are equally important
  • cues are completely unspecified, learned by
    training
  • coarticulation model context-dependent phoneme
    models
  • Spectrogram Reading
  • first identify landmarks in the signal ?
    Wheres the vowel? Is that change in energy a
    plosive?
  • identify change over duration of a phoneme,
    relative durations
  • ? Is that formant movement a diphthong or
    coarticulation?
  • identify activity at phoneme boundaries
  • ? F2 goes to 1800 Hz at onset of voicing,
  • ? voicing continues into frication, so its a
    voiced fric.
  • specific cues to phoneme identity
  • ? 1800 Hz implies alveolar, F3 ? 2000 Hz implies
    retroflex
  • coarticulation model tends toward locus theory

17
ASR Technology vs. Spectrogram Reading
  • HMM-Based ASR
  • frame based - no identification of landmarks in
    speech signal
  • duration of phonemes not identified until end of
    processing
  • all frames are equally important
  • cues are completely unspecified, learned by
    training
  • Spectrogram Reading and Human Speech Recognition
  • first identify landmarks in the signal ?
    Humans thought to have landmark (e.g. plosive)
    detectors
  • identify change over duration of a phoneme,
    relative durations
  • ? Humans very sensitive to small changes,
    especially at vowel/consonant boundaries
  • identify activity at phoneme boundaries
  • ? Transition into the vowel most important
    region for human speech perception
  • specific cues to phoneme identity
  • ? Humans use (large) set of specific cues, e.g.
    VOT

18
The Structure of Spoken Language
  • Final Points
  • Speech is complex! Not as simple as sequence
    of phonemes
  • There is structure in speech, related to broad
    phonetic categories
  • Identifying formant locations and movement is
    important
  • Duration is important even for phoneme identity
  • Phoneme boundaries are important
  • There are numerous cues to phoneme identity
  • Little is understood about how humans process
    speech
  • Current ASR technology is incapable of
    accounting for all information that humans use
    in reading spectrograms, and what is known
    about human speech processing often not used
    this implies (but does not prove) that current
    technology may be incapable of reaching human
    levels of performance.
  • Speech is complex!
Write a Comment
User Comments (0)
About PowerShow.com