Introduction to Computer Speech Processing - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

Introduction to Computer Speech Processing

Description:

Introduction to Computer Speech Processing Alex Acero Research Area Manager Microsoft Research – PowerPoint PPT presentation

Number of Views:333

Avg rating:3.0/5.0

Slides: 68

Provided by: BryanB57

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Computer Speech Processing

1
Introduction to Computer Speech Processing
Alex Acero Research Area Manager Microsoft
Research
2
Outline

Grand challenges in Speech and Language
Vision videos
Products today
Prototypes
The role of speech
Technology Introduction

3
Outline

Grand challenges in Speech and Language
Vision videos
Products today
Prototypes
The role of speech
Technology Introduction

4
User Expectations for Speech
5
The Turing Test

Imitation Game
Judge, man, and a woman
All chat via Email.
Man pretends to be a woman.
Man lies, woman tries to help judge.
Judge must identify man after 5 minutes.
Turing Test
Replace man or woman with a computer.
Fool judge 30 of the time.

Thanks to Jim Gray for material
6
What Turing Said

I believe that in about fifty years' time it
will be possible, to programme computers, with a
storage capacity of about 109, to make them play
the imitation game so well that an average
interrogator will not have more than 70 per cent
chance of making the right identification after
five minutes of questioning. The original
question, "Can machines think?" I believe to be
too meaningless to deserve discussion.
Nevertheless I believe that at the end of the
century the use of words and general educated
opinion will have altered so much that one will
be able to speak of machines thinking without
expecting to be contradicted.

Alan M.Turing, 1950 Computing machinery and
intelligence. Mind, Vol. LIX. 433-460
7
Prediction 59 Years Later

Turings technology forecast was great!
Gigabyte memory is common
Computer beat world chess champion
with some help from its programming staff!
Computers help design most things today

8
Prediction 59 Years Later

Intelligence forecast was optimistic
Several internet sites offer Turning Test
chatterbots.
None pass (yet) http//www.loebner.net/Prizef/loeb
ner-prize.html
But I believe it will not be long
less than 50 years, more than 10 years
Turing test still stands as a long-term challenge

9
Challenges Implicit in the Turing Test

Read and understand as well as a human
Think and write as well as a human
Hear as well as a native speaker
Speech Recognition (speech to text)
Speak as well as a native speaker
Speech Synthesis (text to speech)
Remember what is heard and quickly return it on
request.

10
Moores law (1965)

Gordon Moore The number of transistors per chip
will double every 18 months 100x per decade
Progress in next 18 months ALL previous
progress
New storage sum of all old storage (ever)
New processing sum of all old processing.

15 years ago
11
Making Chips Smaller

Advances in Lithography science of "drawing"
circuits on chips
Impact of Moores law
Short distances gt smaller processing time
Smaller size gt lower cost per transistor
Amount of memory is increased
But, it is not a law of physics a mere self
fulfilling prophecy.

12
Moores law not applicable to Machine Intelligence

Speech technology benefited from Moores Law in
the 1990s.
In the 21th century, faster chips mean
recognition error appears faster ?
New algorithmic advances needed to pass the
Turing Test
Error rate halves approx every 7 years

13
Grand Challenges
Within 10 years speech will be in every device.
Things like speech and ink are so natural, when
they get the right quality level they will be in
everything. As technical hurdles such as
background noise and context are overcome, major
adoption of speech technology will arrive. Soon,
dictating to PCs and giving commands to cell
phones will be basic modes of interacting with
technology Bill Gates, March 2004
14
Outline

Grand challenges in Speech and Language
Vision videos
Products today
Prototypes
The role of speech
Technology Introduction

15
Speech in Mobile devices
16
Speech for Students
17
Speech in cars
18
Soccer Mom in car
19
Insurance Agent driving
20
Outline

Grand challenges in Speech and Language
Vision videos
Products today
Prototypes
The role of speech
Technology Introduction

21
Japanese dictation
22
Telephony Response point
23
Directory Assistance

Automatic generation of robust grammars
Users say Calabria or Calabria restaurant
Nearby cities
Is Calabria restaurant in Redmond or Kirkland?
Some people say the address too
Pizza hut on 3rd Avenue in New York, New York
Automatic normalization
Acronyms, compound words, homonyms, misspelled
words

24
Multimodal voice search
25
Click-Driven Automated Feedback
26
Outline

Grand challenges in Speech and Language
Vision videos
Products today
Prototypes
The role of speech
Technology Introduction

27
CommuteUX
28
Speech in Education
29
VerbalMath
30
Virtual Receptionist
31
Video Search(Frank Seide, MSRA)
32
Browsing a Video (Milind Mahajan Patrick
Nguyen)
33
Podcast authoring (Patrick Nguyen)
34
Outline

Grand challenges in Speech and Language
Vision videos
Products today
Prototypes
The role of speech
Technology Introduction

35
Role of Speech in Different Devices
Tablet PC
PC
High
Tablet PC
Internet TV
PDA
Internet TV
Screen Phone
PDA
Ease of GUI (screen/ Pointer)
Screen Phone
Car
Phone
Car
High
Low
Ease of text input (keyboard/pen)
36
A Roadmap for Speech
Dictation
High
Multimodal Command/Control
Ease of GUI (screen/ Pointer)
Speech-Only Telephony
High
Low
Ease of text input (keyboard/pen)
37
Speech Technology
38
Outline

Grand challenges in Speech and Language
Vision videos
Products today
Prototypes
The role of speech
Technology Introduction

39
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
40
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
41
Basic Formulation

Basic equation of speech recognition is
XX1,X2,,Xn is the acoustic observation is the
word sequence
P(XW) is the acoustic model
P(W) is the language model

42
Speech Recognition
TTS
ASR
SLU
SLG
DM
Acoustic Model
Input Speech
Pattern Classification (Decoding, Search)
Hello World
Feature Extraction
Confidence Scoring
(0.9) (0.8)
Language Model
Word Lexicon
43
Feature Extraction
Goal Extract robust features (information) from
the speech that are relevant for ASR. Method
Spectral analysis through either
a bank-of-filters or through Linear Predictive
Coding followed by non-linearity and
normalization. Result Signal compression where
for each window of speech samples where 30 or so
features are extracted (64,000 b/s -gt 5,200
b/s). Challenges Robustness to environment
(office, airport, car), devices (speakerphones,
cellphones), speakers (accents, dialect, style,
speaking defects), noise and echo.
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
44
Acoustic Modeling

Goal
Model probability of acoustic features
for each phone model i.e. p(X /ae/)
Method
Hidden Markov Models (HMM) through
Maximum likelihood (EM) or discriminative methods
Challenges/variability
Background noise Cocktail Party Effect
Dialect/accent
Speaker
Phonetic context It aly vs It alian
No spaces in speech

Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
Wreck a nice beach
Recognize speech
45
Word Lexicon

Goal
Map legal phone sequences into words
according to phonotactic rules
David /d/ /ey/ /v/ /ih/ /d/
Multiple Pronunciations
Several words may have multiple pronunciations
Data /d/ /ae/ /t/ /ax/
Data /d/ /ey/ /t/ /ax/
Challenges
How do you generate a word lexicon automatically?
LTS rules can be automatically trained with
decision trees (CART) less than 8 errors, but
proper nouns are hard!
How do you add new variant dialects and word
pronunciations?

Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
46
Pattern Classification

Goal
Find optimal word sequence
Combine information (probabilities) from
Acoustic model
Word lexicon
Language model
Method
Decoder searches through all possible recognition
choices using a Viterbi decoding algorithm
Challenge
Efficient search through a large network space is
computationally expensive for large vocabulary
ASR Beam search, WFST

Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
47
Confidence Scoring
Goal Identify possible recognition errors and
out-of-vocabulary events. Potentiallyimproves
the performance of ASR, SLU and DM. Method A
confidence score based on a hypothesis likelihood
ratio test is associated with each recognized
word Label credit please
Recognized credit fees Confidence
(0.9) (0.3) Command-and-control false
rejection and false acceptance gt ROC
curves Challenges Rejection of extraneous
acoustic events (noise, background speech, door
slams) without rejection of valid user input
speech.
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
48
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
49
Text-to-Speech Systems

TTS Engine
Text Analysis Document Structure Detection Text
Normalization Linguistic Analysis
Raw text or tagged text
tagged text
Phonetic Analysis Homograph disambiguation Graph
eme-to-Phoneme Conversion
tagged phones
Prosodic Analysis Pitch Duration Attachment
controls
Speech Audio Out
Speech Synthesis Voice Rendering
50
Multimedia Customer Care(Courtesy of ATT)
51
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
52
Language Understanding

Application Schema (XML for semantic entities)
defines the application status
A Semantic Context Free Grammar (CFG) parses an
English sentence and fills in slots of the
application schema.

53
Application Schema
ltitinerarygt ltorigingt ltcitygtlt/citygt ltstategtlt/s
tategt lt/origingt ltdestinationgt ltcitygtlt/citygt
ltstategtlt/stategt lt/destinationgt ltdategtlt/dategt lt/i
tinerarygt
54
Semantic CFG

ltrule nameitinerarygt
Show me flights from ltruleref nameorigin"/gt
to ltruleref namedestination"/gt
lt/rulegt
ltrule nameorigingt
ltruleref namecitygt
lt/rulegt
ltrule namedestinationgt
ltruleref namecitygt
lt/rulegt
ltrule namecitygt
Seattle San Francisco New York
lt/rulegt

55
An example sentence

Show me flights from Seattle to New York
would populate the application schema as
ltitinerarygt
ltorigingt
ltcitygtSeattlelt/citygt
ltstategtlt/stategt
lt/origingt
ltdestinationgt
ltcitygtNew Yorklt/citygt
ltstategtlt/stategt
lt/destinationgt
ltdategtlt/dategt
lt/itinerarygt

56
Voice-enabled System Technology Components
Speech
Speech

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
57
Who manages the Dialog?

Directed Dialog
Who would you like to contact?
Finite State Machine
Simple CFG
MSConnect

Initiative

User Initiative Dialog
What can I do for you?
Ngrams
Windows Airlines

Reservations
Flight Status
Baggage Claim
Special Announcements
58
Problems with directed dialogs
59
User-initiative dialogs

Pros
Can result in a shorter call
Can feel more natural
Useful when too many choices
Cons
Requires expensive expertise
Could lead to user frustration system appears
human but caller cant use full natural language

60
NLU Dialog Module

Drag-and-drop Dialog Flow Designer
Developer specifies
Destination branches
Example sentences per branch
Prompts (initial, mumble, no speech, etc)
Module generates SLM and classifier
It handles confirmation, reprompt, etc.

61
Natural Language
62
Multimodal System Technology Components
Speech
Speech
Pen Gesture
Visual

TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
63
MIPad