Title: Speech recognition, understanding and conversational interfaces
1Speech recognition, understanding and
conversational interfaces
- Alexander Rudnicky
- School of Computer Science
- http//www.cs.cmu.edu/air
- Speech
- Types of speech interfaces
- Speech systems and their structure
- Designing speech interfaces
- Some applications
- SpeechWear
- Communicator
3Speech as a signal
- The difference between speech and sound
- CD quality vs. intelligible quality
- high-quality is 44.1 / 48 kHz
- desirable speech bandwidth 0-8kHz, 16bits
- at 16bits/sample 256kbps (tethered mic)
- telephone 64kbps (and lower)
- Compression
- MPEG 64kbps/channel and up (but not
speech-optimal) - CELP 16kbps 2.4kbps (optimized for speech)
4Speech for communication
- The difference between speech and language
- Speech recognition and speech understanding
5Computers and speech
- Transcription
- dictation, information retrieval
- Command and control
- data entry, device control, navigation
- Information access
- airline schedules, stock quotes
- Problem solving
- travel planning, logistics
6Speech system architecture
7Varieties of speech systems
8A generic speech system
9Decoding speech
Acoustic models
Language models
Corpus-base statistical models
10Creating models for recognition
Speech data
Acoustic models
Text data
Language models
11Understanding speech
Ontology design, language acquisition
- Extract semantic content from utterance
Post parser
- Introduce context and world knowledge into
Domain Agents
Grounding, knowledge engineering
12Interacting with the user
Task schemas
Task analysis
Dialog manager
- Guide interaction through task
- Map user inputs and system state into actions
Domain agent
- Interact with back-end(s)
- Interpret information using domain knowledge
Domain agent
Domain agent
Live data (e.g. Web)
Domain expert
Knowledge engineering
13Communicating with the user
Language Generator
- Decide what to say to user (and how to phrase it)
Speech synthesizer
Display Generator
Action Generator
14Speech recognition and understanding
- Sphinx system
- speaker-independent
- continuous speech
- large vocabulary
- ATIS system
- air travel information retrieval
- context management
- film clip
15Command and control systems
- Small vocabularies, fixed syntax
- OPEN WINDOW ltwindow_idgt
- MOVE OBJECT ltobject_idgt to ltpositiongt
- Applications
- data entry (e.g., zip codes), process control
(e.g., electron microscope, darkroom equipment) - Large vocabulary, fixed syntax
- Web browsing (?)
- Vehicle inspection task
- USMC mechanics, fixed inspection form
- Wearable computer (COTS components)
- html-based task representation
- film clip
17Information access
- Moderate to very large vocabulary
- IVR and frame based systems
- Commercial systems
- Nuance http//www.nuance.com/demo/index.html
- SpeechWorks http//www.speechworks.com/demos/demo
s.htm - lots of others..
18IVR and frame-based systems
- Interactive voice response (IVR)
- interactions specified by a graph (typically a
tree) - Frame systems
- ergodic graphs
- states defined by multi-item forms
19Graph-based systems
Welcome to Bank ABC! Please say one of the
following Balance, Hours, Loan, ...
What type of loan are you interested in? Please
say one of the following Mortgage, Car,
Personal, ...
. . . .
20Frame-based systems
- I would like to fly to Boston
- Id like to go to Boston on Friday,
- When would you like to fly?
21Frame-based systems
Zxfgdh_dxab _____ askjs _____ dhe
_____ aa_hgjs_aa _____ . .
Transition on keyword or phrase
Zxfgdh_dxab _____ askjs _____ dhe
_____ aa_hgjs_aa _____ . .
Zxfgdh_dxab _____ askjs _____ dhe
_____ aa_hgjs_aa _____ . .
Zxfgdh_dxab _____ askjs _____ dhe
_____ aa_hgjs_aa _____ . .
Zxfgdh_dxab _____ askjs _____ dhe
_____ aa_hgjs_aa _____ . .
22Some problems
- IVR systems work great, but only for
well-structured ( shallow) tasks - Frame systems are good for tasks that
correspond to a single form leading to an action - Neither approach does well with more complex
problem-solving activities
23Dialog Systems
- Problem solving activity complex task
- Order of progression through task depends on user
goals (which can change) and system state (a
back-end retrieval) and is not predictable. - Track progress and help task along
- mixed-initiative dialog
- Discourse phenomena
- User expect to converse with the system
24Carnegie Mellon Communicator
- A dialog system that supports complex problem
solving in a travel planning domain - create an itinerary using air schedule, hotel and
car information - 186 U.S. airports (gt140k enplanements/yr)
- currently gt500 world airports
- Web-based data resources
- Live and cached flight information
- Airport, airline, etc. information
25Value schema/handlers
Domain Agent
26Compound schema
e.g. SQL query
Domain Agent
27Schema ordering
Schema i
Value i
Schema j
Value j
Schema k
Value k
28Carnegie Mellon Communicator
- CMU Communicator
- Call 268-5144
- the information is accurate you can use it for
your own travel planning...
29User-aware speech interfaces
- Predictable behavior on the systems part
- Users coomunicate at different levels
- http//www.speech.cs.cmu.edu/air/papers/InterfaceC
30User-aware speech interfaces
- Content task-centric utterances
- Possibility What can I do?
- Orientation Where are we?
- Navigation moving through the task space
- Control verbose/terse, listen!
- Customization define this word
31Speech interface guidelines
- Speech recognition is errorful
- System state is often opaque to the user
- http//www.speech.cs.cmu.edu/air/papers/SpInGuidel
32Interface guidelines
- State transparency
- Input control
- Error recovery
- Error detection
- Error correction
- Log performance
- Application integration
- Speech and language communication
- Dialog structure
- Interface design