Speech Recognition - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Speech Recognition

Description:

A Speech Spectrogram. Represents the varying short term ... person can 'read' a spectrogram. Therefore, the spectrogram contains all the information a ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 27
Provided by: mitchel4
Category:

less

Transcript and Presenter's Notes

Title: Speech Recognition


1
Speech Recognition
  • Mitch Marcus
  • CIS 530
  • Introduction to Natural Language Processing

2
A sample of speech recognition
  • The general problem of automatic transcription of
    speech by any speaker in any environment is still
    far from solved. But recent years have seen ASR
    technology matured (mature) to the point where
    (it) is viable in certain limited domains. One
    major application area is inhuman- (in human-)
    computer interaction. While many tasks are
    better solved with visual or pointing interfaces,
    speech has the potential to be a better interface
    than the keyboard for tasks were (where) full
    natural language communication is useful, or for
    which keyboards are not appropriate. This
    includes hands-busy or eyes-busy applications,
    such as where the user has objects to manipulate
    or equipment to control.
  • This was dictated one (on) April 16th, 2007,
    using Dragon NaturallySpeaking 9.1. The text is
    from Speech and Language Processing, draft of the
    second edition, by giraffe ski (Jurafsky) and
    Martin.
  • 140 words 6 errors

3
I. Why is Speech Recognition Hard??
4
A Speech Spectrogram
?Frequency
Time ?
  • Represents the varying short term amplitude
    spectra of the speech waveform
  • Darkness represents amplitude at that time
    frequency.

5
A trained person can read a spectrogram
Therefore, the spectrogram contains all the
information a machine needs as well.
Prof. Victor Zue, MIT
6
Vowels are determined by their formants
F3F2F1
bee baa
boo The frequencies of F1, F2, and F3
the first three resonances of the vocal tract
largely determine the perceived vowel
7
Consonants are determined by
  • burst spectra,
  • length of silence
  • formant motion
  • ...

8
Coarticulation
  • The same abstract phoneme can be realized very
    differently in different phonetic contexts
    coarticulation
  • F2 in the vowel /u/, crucial to its
    identification, varies significantly due to
    surrounding consonants in the syllables

Moom
Kook
Toot
9
Speech Information is not local
  • The identity of speech units, phones, cannot be
    determined independently of context.
  • Sometimes two phones can best be distinguished by
    examining properties of neighboring phones

d o s d o z
10
Speech Information is not local
  • /s/ and /z/ are often acoustically identical
  • They are differentiated by the length of the
    preceding vowel

d o s d o z
11
Words are constant, but utterances arent
  • Spectrograms of similar words pronounced by
    the same speaker
  • may be more alike than
  • Spectrograms of the same word pronounced by
    different speakers.

wait MM (m) wait JH (f) wait
whispered(MM)
12
II. HMMs for Speech Recognition
  • (Illustrations in II from draft Chapter 9,
  • Jurafsky Martin)

13
Speech Recognition Architecture
14
Schematic HMM for the word six
  • Simple one state per phone model
  • Left to right topology with self loops and no
    skips
  • Start and End states with no emissions

15
Review Phones have dynamic structure
  • The name Ike, pronounced ay k
  • The formants of the dipthong ay move continually
  • K consists of (a) a silence, (b) a burst

16
A 3-state HMM phone model
  • Three emitting states
  • Two non-emitting states
  • Usually includes skip states
  • The word six siks using 3-state HMM phone models

17
A simple full HMM for digit recognition
18
III. Speech Dialogue Understanding
19
Multiple knowledge sources provide redundancy
  • Grammatical, semantic and pragmatic information
    can be used to make recognition robust.
  • A first experiment ATT Bell Labs airline
    reservation system (1977)

20
Multiple knowledge sources provide redundancy
21
(No Transcript)
22
Speech Recognition Task Dimensions I
  • Continuous speech vs. isolated word
  • Speaker Dependent, Speaker Independent, Speaker
    Adaptive
  • Speaker dependent System trained for current
    speaker
  • Speaker independent No modificiation per speaker
  • Speaker Adaptive Initially speaker independent,
    then adapts to speaker while functioning
  • Vocabulary size
  • Small 10-50 words
  • Large 1,000-64,000 words
  • Unlimited System can handle Out of Vocabulary
    words

23
Speech Recognition Task Dimensions I
  • Perplexity level
  • Low perplexity Average expected branching factor
    of grammar lt 10-20
  • High perplexity Average expected branching
    factor of grammar gt 100
  • Read vs. dictation style vs. conversational
    speech
  • Quiet Conditions vs. various noise conditions

24
Perplexity Why it matters
  • Experiment (1992) read speech, Three tasks
  • Mammography transcription (perplexity 60)There
    are scattered calcifications with the right
    breastThese too have increased very slightly
  • General radiology (perplexity 140)This is
    somewhat diffuse in natureThere is no evidence
    of esophageal or gastric perforation
  • Encyclopedia dictation (perplexity
    430)Czechoslovakia is known internationally in
    music and filmMany large sulphur deposits are
    found at or near the earths surface

25
Real Speech is Difficult The air travel domain
  • Fragments
  • show me flights from boston to new york
  • to philadelphia
  • Ungrammatical utterances
  • what type of ground transportation from the
    airport to denver
  • Restarts and self-corrections
  • Id like to see show me flights leaving before
    noon
  • And finally..
  • from uh sss from the philadelphia airport um at
    ooh the airline is united airlines and it is
    flight number one ninety four once that one lands
    I need ground transportation to uh broad street
    in philadelphia what can you arrange for that

26
Conversational Speech Transcription
  • Automatically transcribe conversational speech,
    not necessarily intended for speech recognition
  • Best results (3/06)
  • English word error rate 17.2
  • Arabic word error rate 15.5
Write a Comment
User Comments (0)
About PowerShow.com