Speech Recognition Technology in the UbiquitousWearable Computing Environment - PowerPoint PPT Presentation

1 / 73

About This Presentation

Title:

Speech Recognition Technology in the UbiquitousWearable Computing Environment

Description:

Speech Recognition Technology in the. Ubiquitous/Wearable Computing Environment. Tokyo Institute of Technology. Department of Computer Science ... – PowerPoint PPT presentation

Number of Views:258

Avg rating:3.0/5.0

Slides: 74

Provided by: die2

Category:

more less

Transcript and Presenter's Notes

Title: Speech Recognition Technology in the UbiquitousWearable Computing Environment

1
Speech Recognition Technology in
theUbiquitous/Wearable Computing Environment
Sadaoki Furui

Tokyo Institute of Technology
Department of Computer Science
2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552
Japan
Tel/Fax 81-3-5734-3480
furui_at_cs.titech.ac.jp
http//www.furui.cs.titech.ac.jp/

2
Outline

Speech recognition applications
Speech recognition in the ubiquitous/wearable
computing environment
Hands-free communication problems
Speaker recognition technology
Audio indexing
Multimodal human-machine communication

3
Outline

Speech recognition applications
Speech recognition in the ubiquitous/wearable
computing environment
Hands-free communication problems
Speaker recognition technology
Audio indexing
Multimodal human-machine communication

4
Major speech recognition applications

Conversational systems for accessing information
services
Robust conversation using wireless
handheld/hands-free devices in the real mobile
computing environment
Multimodal speech recognition technology
Systems for transcribing, understanding and
summarizing ubiquitous speech documents such as
broadcast news, meetings, lectures, presentations
and voicemails

5
Two-pass search structure used in the Japanese
broadcast-news transcription system
0011-10
6
0203-01
Model of human-computer interaction
7
Typical structure for task-specific voice
controland dialogue systems
0011-23
8
0111-15
ATT Communicator architecture
9
0110-05
Voice portal environment
10
0110-13
Voice portal components
11
What is VoiceXML?

VoiceXML is to the voice Web what
HTML is to the visual Web
Collection of XML-based markup languages for
implementing speech applications

12
Multimedia files
HTML scripts
Information Retrieval
VoiceXML scripts
Web browser
MS SQL Server
DB
Capture voice
Grammars
ASR
Database server
metadata
Voice browser
DTMF
Replay audio
TTS
Audio files
Gateway
Web server
VoiceXML
13
Why VoiceXML?

Simplifies application development
Minimizes Internet traffic
Separates user interaction code from application
logic
Provides portability
Supports rapid prototyping and iterative
refinement

14
Forms-Typical sequential dialog
Welcome to the electronic payment system
ltformsgt
ltpromptgtWelcome to the electronic payment system.
lt/promptgt
Whom do you want to pay?
ltfield namerecipientgt
ltpromptgtWhom do you want to pay? lt/promptgt
Ajax Drugstore
ltgrammar src www.valid_recipients.vxl/gt
lt/fieldgt
How much do you want to pay?
ltfield nameamountgt
ltpromptgtHow much do you want to pay? lt/promptgt
36.95
ltgrammar srcwww.valid_money.vxml/gt
lt/fieldgt
Do you want to pay 36.95 to Ajax Drugstore?
ltfield namevalidation typebinarygt
ltpromptgtDo you want to pay ltvalue expramount/gt
to ltvalue exprrecipient/gtlt/promptgt
Yes
lt/fieldgt
15
W3C voice interface framework
VoiceXML 1.0
Dialog manager
World Wide Web
Context interpretation
Language understanding
ASR
DTMF tone recognizer
Telephone system
Media planning
Prerecorded audio player
User
Language generation
TTS
16
0203-04
Taxonomy of system-level evaluation techniques
17
Outline

Speech recognition applications
Speech recognition in the ubiquitous/wearable
computing environment
Hands-free communication problems
Speaker recognition technology
Audio indexing
Multimodal human-machine communication

18
The Major Trends in Computing(http//www.ubiq.com
/hypertext/weiser/NomadicInteractive/Sld003.htm)
0010-14
19
MIT wearable computing people(http//www.media.mi
t.edu/wearables/)
20
Features provided by Ubicomp vs.Wearables(http//
rhodes.www.media.mit.edu/people/rhodes/papers/wear
hive.html)
0202-01
21
Speech recognition in the ubiquitous/wearable
computing environment
0010-15
Ubiquitous computing environment
22
0104-02
Distributed speech recognition (DSR)
23
Meeting synopsizing system using collaborative
speech recognizers
0010-16
24
Outline

Speech recognition applications
Speech recognition in the ubiquitous/wearable
computing environment
Hands-free communication problems
Speaker recognition technology
Audio indexing
Multimodal human-machine communication

25
Hands-free communications
0103-03
26
Hands-free communication problems
0103-02

Noise problem
Additive, background noise of various kinds,
including
unwanted speech from multiple sources
Colorization problem
Change in amplitude spectrum caused by
(short) impulse
response of the room and/or the transducer
Reverberation problem
Reverberation caused by long impulse response
of the room
and its interaction with quasi-stationarity
in speech
Duplex system mode problem
Echo cancellation for full-duplex
communication (e.g. barge-in)
speech activity detection, verification
attention
spatialization real-time human-machine
interaction

27
Outline

Speech recognition applications
Speech recognition in the ubiquitous/wearable
computing environment
Hands-free communication problems
Speaker recognition technology
Audio indexing
Multimodal human-machine communication

28
Information present in a speech signal
0012-02
29
Speaker recognition
0103-21
voice key
30
Principal structure of speaker recognition systems
0103-22
31
(a) Speaker identification Basic structure of
speaker recognition systems
0012-06
32
(b) Speaker verification Basic structure of
speaker recognition systems
0012-07
33
Example of typical intraspeaker and interspeaker
distance distributions
0012-08
34
Four conditional probabilitiesin speaker
verification
0012-09
35
Relationship between error rate and decision
criterion (threshold) in speaker verification
0012-10
36
Receiver operating characteristic (ROC) curves
performance examples of three speaker
verification systems A, B, and D
0012-11
37
Recognition error rates as a function of
population size in speaker identification and
verification
0103-23
38
Text-dependent vs. text-independent methods
0103-24
Text-dependent methods are usually based on
template matching techniques. The structure of
the systems is, therefore, rather simple. Since
this method can directly exploit the voice
individuality associated with each phoneme or
syllable, it generally achieves higher
recognition performance than the text-independent
method. Text-independent methods can be used in
several applications in which predetermined key
words cannot be used. Another advantage is that
it can be done sequentially, until a desired
significance level is reached, without the
annoyance of repeating the key words again and
again.
39
Basic structure of DTW/HMM-based text-dependent
speaker recognition methods
0103-25
40
Block diagram indicating principal operation of
speaker recognition method using time series of
cepstral coefficients and their orthogonal
polynomial coefficients
0012-12
41
Basic structures of text-independent speaker
recognition methods
0012-13
42
Variation of the long-time averaged spectrum at
four sessions over eight months, and
corresponding spectral envelopes derived from
cepstrum coefficients weighted by the square root
of inverse variances
0103-20
43
Vector quantization (VQ)-based text-independent
speaker recognition
0103-19
44
A five-state ergodic HMM for text-independent
speaker verification
0012-14
45
Basic structures of text-independent speaker
recognition methods (cont.)
0012-15
46
Text-prompted Speaker Recognition Method
0103-26
This method is facilitated by using
speaker-specific phoneme models as basic acoustic
units. The recognition system prompts each user
with a new key sentence every time the system is
used, and accepts the input utterance only when
it decides that the registered speaker has
uttered the prompted sentence. Because the
vocabulary is unlimited, prospective impostors
cannot know in advance what sentence they will be
asked to repeat. Thus a pre-recorded voice can
easily be rejected. One of the major issues in
this method is how to properly create the
speaker-specific phoneme models with training
utterances of a limited size for each speaker.
47
Block diagram of the text-prompted speaker
recognition method
0012-16
48
Sound spectrograms for word utterances by several
speakers
0103-18
Speaker S /kogeN/
Same (2 years later)
Speaker S /baNgo/
Speaker M /kogeN/
Speaker F /kogeN/
Speaker U /kogeN/
49
Intersession variability (variability over time)
0103-27

Speakers
Recording and transmission conditions
Noise
Normalization
Parameter domain
Distance/similarity domain

50
Parameter-domain normalization
0103-28

Cepstral mean normalization (subtraction) (CMN,
CMS)
Linear channel effects
Long-term spectral variation
Delta-cepstral coefficients

51
Distance/similarity-domain normalization
0103-29
52
Cohort speakers
0103-30
53
Conventional speaker verification system
0012-17
54
Speaker verification system including verbal
information verification (VIV)
0012-18
Pass-phrases of the first few accesses Open
sesame Open sesame Open sesame
Save for training
Verbal information verification
Verified pass-phrases for training
HMM training
Speaker-dependent HMM
Automatic enrollment
Database
Speaker Verification
Identity claim
Speaker verifier
Scores
Test pass-phrase Open sesame
55
Outline

Speech recognition applications
Speech recognition in the ubiquitous/wearable
computing environment
Hands-free communication problems
Speaker recognition technology
Audio indexing
Multimodal human-machine communication

56
BBNs Rough n Ready audio indexing system

Processes recorded audio from broadcast news,
meetings, etc.
Produces partial transcripts
Identifies entities spoken about (people,
companies, etc.)
Indexes words, concepts, and speakers
Locates segments where each person is speaking
Assigns multiple, ranked topics to segments
Automatically creates structural summaries and
stores them
Quickly and easily retrieves relevant information
Allows users to skim for topics without listening
to the recording

57
0010-16
Audio compressor
WAN LAN or local bus
Audio server
Speaker segmentation
Speech recognition
MS Internet Explorer
Information retrieval
Clustering
IR Index server
Speaker identification
MS SQL server
Name spotting
Story Indexing
Metadata
Classification
XML index
Database uploader
Story segmentation
XML Corpus
Architecture of the Rough n Ready audio
indexing system
58
0111-16
The SCANMail architecture
59
Outline

Speech recognition applications
Speech recognition in the ubiquitous/wearable
computing environment
Hands-free communication problems
Speaker recognition technology
Audio indexing
Multimodal human-machine communication

60
Multimodal human-machine communication (HMC)
0010-25
61
0203-03
Modality-oriented classification of multimodal
systems
62
Multimedia contents technology
0010-27
63
Information extraction and retrieval of spoken
language content(Spoken document retrieval,
information indexing, story segmentation, topic
tracking, topic detection, etc.)
0010-26
64
Architecture of multimodal human/computer
interaction
0012-19
65
0011-9
Multi-modal dialogue system structure
66
Dialogue system specifications
Task Shop/business information retrieval
(names, addresses, phone numbers,
specialties, etc.) Keywords District/station
names, business names, special requests
and shop names Acoustic model
Task-independent triphone HMMs trained by
phonetically-balanced sentence utterances
and dialogue utterances by many speakers
Language modeling FSN (finite state network)
with fillers or class-bigrams/trigrams
Each word/phrase DAWG structure in the FSN LM
case
67
Audio-visual speech recognition systemusing
optical-flow analysis
0103-04
68
Audio-only and audio-visual connected digit
recognition results.(la optimum audio-weighting
factor)
0103-05
69
Video stream
Audio stream
Videostream
(a)
Audio stream
(b)

Multi-stream HMM consisting of three audio and
three video HMM states.
The corresponding product (composite) HMM that
limits the possible asynchrony between the audio
and visual state sequences to one state.

70
0203-05
Late integration model
71
0203-07
Overview of an audio-visual speech system
72
0203-06
General flow chart of a talking head system
73
Summary

Speech and speaker recognition technology has
many potential applications.
Multimodal human-computer communication and
information extraction in ubiquitous/wearable
computing environment has a bright future.
Using such systems, everybody will access
information services anytime anywhere, and these
services are expected to augment various human
intelligent activities.
Robust conversation using wireless
handheld/hands-free devices in the real mobile
computing environment will be crucial.