Speech Recognition - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Speech Recognition

Description:

A Speech Spectrogram. Represents the varying short term ... person can 'read' a spectrogram. Therefore, the spectrogram contains all the information a ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 27

Provided by: mitchel4

Category:

more less

Transcript and Presenter's Notes

Title: Speech Recognition

1
Speech Recognition

Mitch Marcus
CIS 530
Introduction to Natural Language Processing

2
A sample of speech recognition

The general problem of automatic transcription of
speech by any speaker in any environment is still
far from solved. But recent years have seen ASR
technology matured (mature) to the point where
(it) is viable in certain limited domains. One
major application area is inhuman- (in human-)
computer interaction. While many tasks are
better solved with visual or pointing interfaces,
speech has the potential to be a better interface
than the keyboard for tasks were (where) full
natural language communication is useful, or for
which keyboards are not appropriate. This
includes hands-busy or eyes-busy applications,
such as where the user has objects to manipulate
or equipment to control.
This was dictated one (on) April 16th, 2007,
using Dragon NaturallySpeaking 9.1. The text is
from Speech and Language Processing, draft of the
second edition, by giraffe ski (Jurafsky) and
Martin.
140 words 6 errors

3
I. Why is Speech Recognition Hard??
4
A Speech Spectrogram
?Frequency
Time ?

Represents the varying short term amplitude
spectra of the speech waveform
Darkness represents amplitude at that time
frequency.

5
A trained person can read a spectrogram
Therefore, the spectrogram contains all the
information a machine needs as well.
Prof. Victor Zue, MIT
6
Vowels are determined by their formants
F3F2F1
bee baa
boo The frequencies of F1, F2, and F3
the first three resonances of the vocal tract
largely determine the perceived vowel
7
Consonants are determined by

burst spectra,
length of silence
formant motion
...

8
Coarticulation

The same abstract phoneme can be realized very
differently in different phonetic contexts
coarticulation
F2 in the vowel /u/, crucial to its
identification, varies significantly due to
surrounding consonants in the syllables

Moom
Kook
Toot
9
Speech Information is not local

The identity of speech units, phones, cannot be
determined independently of context.
Sometimes two phones can best be distinguished by
examining properties of neighboring phones

d o s d o z
10
Speech Information is not local

/s/ and /z/ are often acoustically identical
They are differentiated by the length of the
preceding vowel

d o s d o z
11
Words are constant, but utterances arent

Spectrograms of similar words pronounced by
the same speaker
may be more alike than
Spectrograms of the same word pronounced by
different speakers.

wait MM (m) wait JH (f) wait
whispered(MM)
12
II. HMMs for Speech Recognition

(Illustrations in II from draft Chapter 9,
Jurafsky Martin)

13
Speech Recognition Architecture
14
Schematic HMM for the word six

Simple one state per phone model
Left to right topology with self loops and no
skips
Start and End states with no emissions

15
Review Phones have dynamic structure

The name Ike, pronounced ay k
The formants of the dipthong ay move continually
K consists of (a) a silence, (b) a burst

16
A 3-state HMM phone model

Three emitting states
Two non-emitting states
Usually includes skip states

The word six siks using 3-state HMM phone models

17
A simple full HMM for digit recognition
18
III. Speech Dialogue Understanding
19
Multiple knowledge sources provide redundancy

Grammatical, semantic and pragmatic information
can be used to make recognition robust.
A first experiment ATT Bell Labs airline
reservation system (1977)

20
Multiple knowledge sources provide redundancy
21
(No Transcript)
22
Speech Recognition Task Dimensions I

Continuous speech vs. isolated word
Speaker Dependent, Speaker Independent, Speaker
Adaptive
Speaker dependent System trained for current
speaker
Speaker independent No modificiation per speaker
Speaker Adaptive Initially speaker independent,
then adapts to speaker while functioning
Vocabulary size
Small 10-50 words
Large 1,000-64,000 words
Unlimited System can handle Out of Vocabulary
words

23
Speech Recognition Task Dimensions I

Perplexity level
Low perplexity Average expected branching factor
of grammar lt 10-20
High perplexity Average expected branching
factor of grammar gt 100
Read vs. dictation style vs. conversational
speech
Quiet Conditions vs. various noise conditions

24
Perplexity Why it matters

Experiment (1992) read speech, Three tasks
Mammography transcription (perplexity 60)There
are scattered calcifications with the right
breastThese too have increased very slightly
General radiology (perplexity 140)This is
somewhat diffuse in natureThere is no evidence
of esophageal or gastric perforation
Encyclopedia dictation (perplexity
430)Czechoslovakia is known internationally in
music and filmMany large sulphur deposits are
found at or near the earths surface

25
Real Speech is Difficult The air travel domain

Fragments
show me flights from boston to new york
to philadelphia
Ungrammatical utterances
what type of ground transportation from the
airport to denver
Restarts and self-corrections
Id like to see show me flights leaving before
noon
And finally..
from uh sss from the philadelphia airport um at
ooh the airline is united airlines and it is
flight number one ninety four once that one lands
I need ground transportation to uh broad street
in philadelphia what can you arrange for that

26
Conversational Speech Transcription

Automatically transcribe conversational speech,
not necessarily intended for speech recognition
Best results (3/06)
English word error rate 17.2
Arabic word error rate 15.5

Write a Comment

User Comments (0)