Title: Introduction to Computer Speech Processing
1Introduction to Computer Speech Processing
Alex Acero Research Area Manager Microsoft
Research
2Outline
- Grand challenges in Speech and Language
- Vision videos
- Products today
- Prototypes
- The role of speech
- Technology Introduction
3Outline
- Grand challenges in Speech and Language
- Vision videos
- Products today
- Prototypes
- The role of speech
- Technology Introduction
4User Expectations for Speech
5The Turing Test
- Imitation Game
- Judge, man, and a woman
- All chat via Email.
- Man pretends to be a woman.
- Man lies, woman tries to help judge.
- Judge must identify man after 5 minutes.
- Turing Test
- Replace man or woman with a computer.
- Fool judge 30 of the time.
Thanks to Jim Gray for material
6What Turing Said
- I believe that in about fifty years' time it
will be possible, to programme computers, with a
storage capacity of about 109, to make them play
the imitation game so well that an average
interrogator will not have more than 70 per cent
chance of making the right identification after
five minutes of questioning. The original
question, "Can machines think?" I believe to be
too meaningless to deserve discussion.
Nevertheless I believe that at the end of the
century the use of words and general educated
opinion will have altered so much that one will
be able to speak of machines thinking without
expecting to be contradicted.
Alan M.Turing, 1950 Computing machinery and
intelligence. Mind, Vol. LIX. 433-460
7Prediction 59 Years Later
- Turings technology forecast was great!
- Gigabyte memory is common
- Computer beat world chess champion
- with some help from its programming staff!
- Computers help design most things today
8Prediction 59 Years Later
- Intelligence forecast was optimistic
- Several internet sites offer Turning Test
chatterbots. - None pass (yet) http//www.loebner.net/Prizef/loeb
ner-prize.html - But I believe it will not be long
- less than 50 years, more than 10 years
- Turing test still stands as a long-term challenge
9Challenges Implicit in the Turing Test
- Read and understand as well as a human
- Think and write as well as a human
- Hear as well as a native speaker
- Speech Recognition (speech to text)
- Speak as well as a native speaker
- Speech Synthesis (text to speech)
- Remember what is heard and quickly return it on
request.
10Moores law (1965)
- Gordon Moore The number of transistors per chip
will double every 18 months 100x per decade - Progress in next 18 months ALL previous
progress - New storage sum of all old storage (ever)
- New processing sum of all old processing.
15 years ago
11Making Chips Smaller
- Advances in Lithography science of "drawing"
circuits on chips - Impact of Moores law
- Short distances gt smaller processing time
- Smaller size gt lower cost per transistor
- Amount of memory is increased
- But, it is not a law of physics a mere self
fulfilling prophecy.
12Moores law not applicable to Machine Intelligence
- Speech technology benefited from Moores Law in
the 1990s. - In the 21th century, faster chips mean
recognition error appears faster ? - New algorithmic advances needed to pass the
Turing Test - Error rate halves approx every 7 years
13Grand Challenges
Within 10 years speech will be in every device.
Things like speech and ink are so natural, when
they get the right quality level they will be in
everything. As technical hurdles such as
background noise and context are overcome, major
adoption of speech technology will arrive. Soon,
dictating to PCs and giving commands to cell
phones will be basic modes of interacting with
technology Bill Gates, March 2004
14Outline
- Grand challenges in Speech and Language
- Vision videos
- Products today
- Prototypes
- The role of speech
- Technology Introduction
15Speech in Mobile devices
16Speech for Students
17Speech in cars
18Soccer Mom in car
19Insurance Agent driving
20Outline
- Grand challenges in Speech and Language
- Vision videos
- Products today
- Prototypes
- The role of speech
- Technology Introduction
21Japanese dictation
22Telephony Response point
23Directory Assistance
- Automatic generation of robust grammars
- Users say Calabria or Calabria restaurant
- Nearby cities
- Is Calabria restaurant in Redmond or Kirkland?
- Some people say the address too
- Pizza hut on 3rd Avenue in New York, New York
- Automatic normalization
- Acronyms, compound words, homonyms, misspelled
words
24Multimodal voice search
25Click-Driven Automated Feedback
26Outline
- Grand challenges in Speech and Language
- Vision videos
- Products today
- Prototypes
- The role of speech
- Technology Introduction
27CommuteUX
28Speech in Education
29VerbalMath
30Virtual Receptionist
31Video Search(Frank Seide, MSRA)
32Browsing a Video (Milind Mahajan Patrick
Nguyen)
33Podcast authoring (Patrick Nguyen)
34Outline
- Grand challenges in Speech and Language
- Vision videos
- Products today
- Prototypes
- The role of speech
- Technology Introduction
35Role of Speech in Different Devices
Tablet PC
PC
High
Tablet PC
Internet TV
PDA
Internet TV
Screen Phone
PDA
Ease of GUI (screen/ Pointer)
Screen Phone
Car
Phone
Car
High
Low
Ease of text input (keyboard/pen)
36A Roadmap for Speech
Dictation
High
Multimodal Command/Control
Ease of GUI (screen/ Pointer)
Speech-Only Telephony
High
Low
Ease of text input (keyboard/pen)
37Speech Technology
38Outline
- Grand challenges in Speech and Language
- Vision videos
- Products today
- Prototypes
- The role of speech
- Technology Introduction
39 Voice-enabled System Technology Components
Speech
Speech
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
40 Voice-enabled System Technology Components
Speech
Speech
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
41Basic Formulation
- Basic equation of speech recognition is
- XX1,X2,,Xn is the acoustic observation is the
word sequence - P(XW) is the acoustic model
- P(W) is the language model
42Speech Recognition
TTS
ASR
SLU
SLG
DM
Acoustic Model
Input Speech
Pattern Classification (Decoding, Search)
Hello World
Feature Extraction
Confidence Scoring
(0.9) (0.8)
Language Model
Word Lexicon
43Feature Extraction
Goal Extract robust features (information) from
the speech that are relevant for ASR. Method
Spectral analysis through either
a bank-of-filters or through Linear Predictive
Coding followed by non-linearity and
normalization. Result Signal compression where
for each window of speech samples where 30 or so
features are extracted (64,000 b/s -gt 5,200
b/s). Challenges Robustness to environment
(office, airport, car), devices (speakerphones,
cellphones), speakers (accents, dialect, style,
speaking defects), noise and echo.
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
44Acoustic Modeling
- Goal
- Model probability of acoustic features
- for each phone model i.e. p(X /ae/)
- Method
- Hidden Markov Models (HMM) through
- Maximum likelihood (EM) or discriminative methods
- Challenges/variability
- Background noise Cocktail Party Effect
- Dialect/accent
- Speaker
- Phonetic context It aly vs It alian
- No spaces in speech
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
Wreck a nice beach
Recognize speech
45Word Lexicon
- Goal
- Map legal phone sequences into words
- according to phonotactic rules
- David /d/ /ey/ /v/ /ih/ /d/
- Multiple Pronunciations
- Several words may have multiple pronunciations
- Data /d/ /ae/ /t/ /ax/
- Data /d/ /ey/ /t/ /ax/
- Challenges
- How do you generate a word lexicon automatically?
- LTS rules can be automatically trained with
decision trees (CART) less than 8 errors, but
proper nouns are hard! - How do you add new variant dialects and word
pronunciations?
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
46Pattern Classification
- Goal
- Find optimal word sequence
- Combine information (probabilities) from
- Acoustic model
- Word lexicon
- Language model
- Method
- Decoder searches through all possible recognition
- choices using a Viterbi decoding algorithm
- Challenge
- Efficient search through a large network space is
computationally expensive for large vocabulary
ASR Beam search, WFST
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
47Confidence Scoring
Goal Identify possible recognition errors and
out-of-vocabulary events. Potentiallyimproves
the performance of ASR, SLU and DM. Method A
confidence score based on a hypothesis likelihood
ratio test is associated with each recognized
word Label credit please
Recognized credit fees Confidence
(0.9) (0.3) Command-and-control false
rejection and false acceptance gt ROC
curves Challenges Rejection of extraneous
acoustic events (noise, background speech, door
slams) without rejection of valid user input
speech.
Pattern Classification
Confidence Scoring
Feature Extraction
Language Model
Word Lexicon
48 Voice-enabled System Technology Components
Speech
Speech
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
49Text-to-Speech Systems
TTS Engine
Text Analysis Document Structure Detection Text
Normalization Linguistic Analysis
Raw text or tagged text
tagged text
Phonetic Analysis Homograph disambiguation Graph
eme-to-Phoneme Conversion
tagged phones
Prosodic Analysis Pitch Duration Attachment
controls
Speech Audio Out
Speech Synthesis Voice Rendering
50Multimedia Customer Care(Courtesy of ATT)
51 Voice-enabled System Technology Components
Speech
Speech
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
52Language Understanding
- Application Schema (XML for semantic entities)
defines the application status - A Semantic Context Free Grammar (CFG) parses an
English sentence and fills in slots of the
application schema.
53Application Schema
ltitinerarygt ltorigingt ltcitygtlt/citygt ltstategtlt/s
tategt lt/origingt ltdestinationgt ltcitygtlt/citygt
ltstategtlt/stategt lt/destinationgt ltdategtlt/dategt lt/i
tinerarygt
54Semantic CFG
- ltrule nameitinerarygt
- Show me flights from ltruleref nameorigin"/gt
- to ltruleref namedestination"/gt
- lt/rulegt
- ltrule nameorigingt
- ltruleref namecitygt
- lt/rulegt
- ltrule namedestinationgt
- ltruleref namecitygt
- lt/rulegt
- ltrule namecitygt
- Seattle San Francisco New York
- lt/rulegt
55An example sentence
- Show me flights from Seattle to New York
- would populate the application schema as
- ltitinerarygt
- ltorigingt
- ltcitygtSeattlelt/citygt
- ltstategtlt/stategt
- lt/origingt
- ltdestinationgt
- ltcitygtNew Yorklt/citygt
- ltstategtlt/stategt
- lt/destinationgt
- ltdategtlt/dategt
- lt/itinerarygt
56 Voice-enabled System Technology Components
Speech
Speech
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
57Who manages the Dialog?
- Directed Dialog
- Who would you like to contact?
- Finite State Machine
- Simple CFG
- MSConnect
Initiative
- User Initiative Dialog
- What can I do for you?
- Ngrams
- Windows Airlines
Reservations
Flight Status
Baggage Claim
Special Announcements
58Problems with directed dialogs
59User-initiative dialogs
- Pros
- Can result in a shorter call
- Can feel more natural
- Useful when too many choices
- Cons
- Requires expensive expertise
- Could lead to user frustration system appears
human but caller cant use full natural language
60NLU Dialog Module
- Drag-and-drop Dialog Flow Designer
- Developer specifies
- Destination branches
- Example sentences per branch
- Prompts (initial, mumble, no speech, etc)
- Module generates SLM and classifier
- It handles confirmation, reprompt, etc.
61Natural Language
62Multimodal System Technology Components
Speech
Speech
Pen Gesture
Visual
TTS
ASR
Automatic SpeechRecognition
Text-to-SpeechSynthesis
Data, Rules
Words
Words
SLU
SLG
Spoken Language Generation
Spoken LanguageUnderstanding
Meaning
Action
DM
DialogManagement
63MIPad
- Multimodal Interactive Pad
- MiPad
- Tap and Talk combines speech and pen
- Use context to simplify recognition
- Dictation allows complex command entry
- Usability studies show double throughput for
English - Speech is mostly useful in cases with lots of
alternatives
64Speech-centric Multimodal
65Multimodality Benefits
- Compared to speech-only
- User sees system response more quickly
- User sees what system understood
- User can know what system expects
- Compared to GUI-only
- Faster entry
- Better use of small screen
66(No Transcript)
67But general language understanding is hard