Title: Speech Recognition Technology in the UbiquitousWearable Computing Environment
1Speech Recognition Technology in
theUbiquitous/Wearable Computing Environment
Sadaoki Furui
- Tokyo Institute of Technology
- Department of Computer Science
- 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552
Japan - Tel/Fax 81-3-5734-3480
- furui_at_cs.titech.ac.jp
- http//www.furui.cs.titech.ac.jp/
2Outline
- Speech recognition applications
- Speech recognition in the ubiquitous/wearable
computing environment - Hands-free communication problems
- Speaker recognition technology
- Audio indexing
- Multimodal human-machine communication
3Outline
- Speech recognition applications
- Speech recognition in the ubiquitous/wearable
computing environment - Hands-free communication problems
- Speaker recognition technology
- Audio indexing
- Multimodal human-machine communication
4Major speech recognition applications
- Conversational systems for accessing information
services - Robust conversation using wireless
handheld/hands-free devices in the real mobile
computing environment - Multimodal speech recognition technology
- Systems for transcribing, understanding and
summarizing ubiquitous speech documents such as
broadcast news, meetings, lectures, presentations
and voicemails
5Two-pass search structure used in the Japanese
broadcast-news transcription system
0011-10
60203-01
Model of human-computer interaction
7Typical structure for task-specific voice
controland dialogue systems
0011-23
80111-15
ATT Communicator architecture
90110-05
Voice portal environment
100110-13
Voice portal components
11What is VoiceXML?
- VoiceXML is to the voice Web what
- HTML is to the visual Web
- Collection of XML-based markup languages for
implementing speech applications
12Multimedia files
HTML scripts
Information Retrieval
VoiceXML scripts
Web browser
MS SQL Server
DB
Capture voice
Grammars
ASR
Database server
metadata
Voice browser
DTMF
Replay audio
TTS
Audio files
Gateway
Web server
VoiceXML
13Why VoiceXML?
- Simplifies application development
- Minimizes Internet traffic
- Separates user interaction code from application
logic - Provides portability
- Supports rapid prototyping and iterative
refinement
14Forms-Typical sequential dialog
Welcome to the electronic payment system
ltformsgt
ltpromptgtWelcome to the electronic payment system.
lt/promptgt
Whom do you want to pay?
ltfield namerecipientgt
ltpromptgtWhom do you want to pay? lt/promptgt
Ajax Drugstore
ltgrammar src www.valid_recipients.vxl/gt
lt/fieldgt
How much do you want to pay?
ltfield nameamountgt
ltpromptgtHow much do you want to pay? lt/promptgt
36.95
ltgrammar srcwww.valid_money.vxml/gt
lt/fieldgt
Do you want to pay 36.95 to Ajax Drugstore?
ltfield namevalidation typebinarygt
ltpromptgtDo you want to pay ltvalue expramount/gt
to ltvalue exprrecipient/gtlt/promptgt
Yes
lt/fieldgt
15W3C voice interface framework
VoiceXML 1.0
Dialog manager
World Wide Web
Context interpretation
Language understanding
ASR
DTMF tone recognizer
Telephone system
Media planning
Prerecorded audio player
User
Language generation
TTS
160203-04
Taxonomy of system-level evaluation techniques
17Outline
- Speech recognition applications
- Speech recognition in the ubiquitous/wearable
computing environment - Hands-free communication problems
- Speaker recognition technology
- Audio indexing
- Multimodal human-machine communication
18The Major Trends in Computing(http//www.ubiq.com
/hypertext/weiser/NomadicInteractive/Sld003.htm)
0010-14
19MIT wearable computing people(http//www.media.mi
t.edu/wearables/)
20Features provided by Ubicomp vs.Wearables(http//
rhodes.www.media.mit.edu/people/rhodes/papers/wear
hive.html)
0202-01
21Speech recognition in the ubiquitous/wearable
computing environment
0010-15
Ubiquitous computing environment
220104-02
Distributed speech recognition (DSR)
23Meeting synopsizing system using collaborative
speech recognizers
0010-16
24Outline
- Speech recognition applications
- Speech recognition in the ubiquitous/wearable
computing environment - Hands-free communication problems
- Speaker recognition technology
- Audio indexing
- Multimodal human-machine communication
25Hands-free communications
0103-03
26Hands-free communication problems
0103-02
- Noise problem
- Additive, background noise of various kinds,
including - unwanted speech from multiple sources
- Colorization problem
- Change in amplitude spectrum caused by
(short) impulse - response of the room and/or the transducer
- Reverberation problem
- Reverberation caused by long impulse response
of the room - and its interaction with quasi-stationarity
in speech - Duplex system mode problem
- Echo cancellation for full-duplex
communication (e.g. barge-in) - speech activity detection, verification
attention - spatialization real-time human-machine
interaction
27Outline
- Speech recognition applications
- Speech recognition in the ubiquitous/wearable
computing environment - Hands-free communication problems
- Speaker recognition technology
- Audio indexing
- Multimodal human-machine communication
28Information present in a speech signal
0012-02
29Speaker recognition
0103-21
voice key
30Principal structure of speaker recognition systems
0103-22
31(a) Speaker identification Basic structure of
speaker recognition systems
0012-06
32(b) Speaker verification Basic structure of
speaker recognition systems
0012-07
33Example of typical intraspeaker and interspeaker
distance distributions
0012-08
34Four conditional probabilitiesin speaker
verification
0012-09
35Relationship between error rate and decision
criterion (threshold) in speaker verification
0012-10
36Receiver operating characteristic (ROC) curves
performance examples of three speaker
verification systems A, B, and D
0012-11
37Recognition error rates as a function of
population size in speaker identification and
verification
0103-23
38Text-dependent vs. text-independent methods
0103-24
Text-dependent methods are usually based on
template matching techniques. The structure of
the systems is, therefore, rather simple. Since
this method can directly exploit the voice
individuality associated with each phoneme or
syllable, it generally achieves higher
recognition performance than the text-independent
method. Text-independent methods can be used in
several applications in which predetermined key
words cannot be used. Another advantage is that
it can be done sequentially, until a desired
significance level is reached, without the
annoyance of repeating the key words again and
again.
39Basic structure of DTW/HMM-based text-dependent
speaker recognition methods
0103-25
40Block diagram indicating principal operation of
speaker recognition method using time series of
cepstral coefficients and their orthogonal
polynomial coefficients
0012-12
41Basic structures of text-independent speaker
recognition methods
0012-13
42Variation of the long-time averaged spectrum at
four sessions over eight months, and
corresponding spectral envelopes derived from
cepstrum coefficients weighted by the square root
of inverse variances
0103-20
43Vector quantization (VQ)-based text-independent
speaker recognition
0103-19
44A five-state ergodic HMM for text-independent
speaker verification
0012-14
45Basic structures of text-independent speaker
recognition methods (cont.)
0012-15
46Text-prompted Speaker Recognition Method
0103-26
This method is facilitated by using
speaker-specific phoneme models as basic acoustic
units. The recognition system prompts each user
with a new key sentence every time the system is
used, and accepts the input utterance only when
it decides that the registered speaker has
uttered the prompted sentence. Because the
vocabulary is unlimited, prospective impostors
cannot know in advance what sentence they will be
asked to repeat. Thus a pre-recorded voice can
easily be rejected. One of the major issues in
this method is how to properly create the
speaker-specific phoneme models with training
utterances of a limited size for each speaker.
47Block diagram of the text-prompted speaker
recognition method
0012-16
48Sound spectrograms for word utterances by several
speakers
0103-18
Speaker S /kogeN/
Same (2 years later)
Speaker S /baNgo/
Speaker M /kogeN/
Speaker F /kogeN/
Speaker U /kogeN/
49Intersession variability (variability over time)
0103-27
- Speakers
- Recording and transmission conditions
- Noise
- Normalization
- Parameter domain
- Distance/similarity domain
50Parameter-domain normalization
0103-28
- Cepstral mean normalization (subtraction) (CMN,
CMS) - Linear channel effects
- Long-term spectral variation
- Delta-cepstral coefficients
51Distance/similarity-domain normalization
0103-29
52Cohort speakers
0103-30
53Conventional speaker verification system
0012-17
54Speaker verification system including verbal
information verification (VIV)
0012-18
Pass-phrases of the first few accesses Open
sesame Open sesame Open sesame
Save for training
Verbal information verification
Verified pass-phrases for training
HMM training
Speaker-dependent HMM
Automatic enrollment
Database
Speaker Verification
Identity claim
Speaker verifier
Scores
Test pass-phrase Open sesame
55Outline
- Speech recognition applications
- Speech recognition in the ubiquitous/wearable
computing environment - Hands-free communication problems
- Speaker recognition technology
- Audio indexing
- Multimodal human-machine communication
56BBNs Rough n Ready audio indexing system
- Processes recorded audio from broadcast news,
meetings, etc. - Produces partial transcripts
- Identifies entities spoken about (people,
companies, etc.) - Indexes words, concepts, and speakers
- Locates segments where each person is speaking
- Assigns multiple, ranked topics to segments
- Automatically creates structural summaries and
stores them - Quickly and easily retrieves relevant information
- Allows users to skim for topics without listening
to the recording
570010-16
Audio compressor
WAN LAN or local bus
Audio server
Speaker segmentation
Speech recognition
MS Internet Explorer
Information retrieval
Clustering
IR Index server
Speaker identification
MS SQL server
Name spotting
Story Indexing
Metadata
Classification
XML index
Database uploader
Story segmentation
XML Corpus
Architecture of the Rough n Ready audio
indexing system
580111-16
The SCANMail architecture
59Outline
- Speech recognition applications
- Speech recognition in the ubiquitous/wearable
computing environment - Hands-free communication problems
- Speaker recognition technology
- Audio indexing
- Multimodal human-machine communication
60Multimodal human-machine communication (HMC)
0010-25
610203-03
Modality-oriented classification of multimodal
systems
62 Multimedia contents technology
0010-27
63Information extraction and retrieval of spoken
language content(Spoken document retrieval,
information indexing, story segmentation, topic
tracking, topic detection, etc.)
0010-26
64Architecture of multimodal human/computer
interaction
0012-19
650011-9
Multi-modal dialogue system structure
66Dialogue system specifications
Task Shop/business information retrieval
(names, addresses, phone numbers,
specialties, etc.) Keywords District/station
names, business names, special requests
and shop names Acoustic model
Task-independent triphone HMMs trained by
phonetically-balanced sentence utterances
and dialogue utterances by many speakers
Language modeling FSN (finite state network)
with fillers or class-bigrams/trigrams
Each word/phrase DAWG structure in the FSN LM
case
67Audio-visual speech recognition systemusing
optical-flow analysis
0103-04
68Audio-only and audio-visual connected digit
recognition results.(la optimum audio-weighting
factor)
0103-05
69Video stream
Audio stream
Videostream
(a)
Audio stream
(b)
- Multi-stream HMM consisting of three audio and
three video HMM states. - The corresponding product (composite) HMM that
limits the possible asynchrony between the audio
and visual state sequences to one state.
700203-05
Late integration model
710203-07
Overview of an audio-visual speech system
720203-06
General flow chart of a talking head system
73Summary
- Speech and speaker recognition technology has
many potential applications. - Multimodal human-computer communication and
information extraction in ubiquitous/wearable
computing environment has a bright future. - Using such systems, everybody will access
information services anytime anywhere, and these
services are expected to augment various human
intelligent activities. - Robust conversation using wireless
handheld/hands-free devices in the real mobile
computing environment will be crucial.