Module u1: Speech in the Interface 3: Speech input and output technology

About This Presentation

Title:

Module u1: Speech in the Interface 3: Speech input and output technology

Description:

Module u1: Speech in the Interface 3: Speech input and output technology Jacques Terken SAI User-System Interaction u1, Speech in the Interface: 3. – PowerPoint PPT presentation

Number of Views:287

Avg rating:3.0/5.0

Slides: 81

Provided by: terk150

Category:

more less

Transcript and Presenter's Notes

Title: Module u1: Speech in the Interface 3: Speech input and output technology

1
Module u1Speech in the Interface 3 Speech
input and output technology
Jacques Terken
2
contents

Speech input technology
Speech recognition
Language understanding
Consequences for design
Speech output technology
Language generation
Speech synthesis
Consequences for design
Project

3
Components of conversational interfaces
Application
Speech Synthesis
Language Generation
Dialogue Manager
Natural Language Analysis
Speech recognition
Noise suppression
4
Speech recognition

Advances both through progress in speech and
language engineering and in computer technology
(increases in CPU power)

5
Developments
6
State of the art
7
Why is generic speech recognition so difficult ?

Variability of input due to many different
sources
Understanding requires vast amounts of world
knowledge and common sense reasoning for
generation and pruning of hypotheses
Dealing with variability and with storage of/
access to world knowledge outreaches
possibilities of current technology

8
Sources of variation
9
No generic speech recognizer

Idea of generic speech recognizer has been given
up (for the time being)
automatic speech recognition possible by virtue
of self-imposed limitations
vocabulary size
multiple vs single speaker
real-time vs offline
recognition vs understanding

10
Speech recognition systems

Relevant dimensions
Speaker-dependent vs speaker-independent
Vocabulary size
Grammar fixed grammar vs probabilistic language
model
Trade-off between different dimensions in terms
of performance choice of technology determined
by application requirements

11
Command and control

Examples controlling functionality of PC or PDA
controlling consumer appliances (stereo, tv etc.)
Individual words and multi-word expressions
File, Edit, Save as webpage, Columns to
the left
Speaker-independent no training needed before
use
Limited vocabulary gives high recognition
performance
Fixed format expressions (defined by grammar)
Real-time
?User needs to know which items are in the
vocabulary and what expressions can be used
?(Usually) not customizable

12
Information services

Examples train travel information, integrated
trip planning
Continuous speech
Speaker-independent Multiple users
Mid size vocabulary, typically less than 5000
words
Flexibility of input extensive grammar that can
handle expected user inputs
Requires interpretation
Real time

13
Dictation systems

Continuous speech
Speaker-dependent requires training by user
(Almost) unrestricted input
Large vocabulary gt 200.000 words
Probabilistic language model instead of fixed
grammar
No understanding, just recognition
Off-line (but near online performance possible
depending on system properties)

14
State of the art ASR Statistical approach

Two phases
Training creating an inventory of acoustic
models and computing transition probabilities
Testing (classification) mapping input onto
inventory

15
Speech

Writing vs speech
Writing see eat break lake
Speaking si it brek
lek
Alphabetic languages appr. 25 signs
Average language approximately 40 sounds
Phonetic alphabet
(11 mapping character-sound)

16
Speech and sounds Waveform and spectrogram of
How are you Speech is made up of nondiscrete
events

17
Representation of the speech signal

Sounds coded as successions of states (one state
each 10-30 ms)
States represented by acoustic vectors

Freq
Freq
time
time
18
Acoustic models

Inventory of elementary probabilistic models of
basic linguistic units, e.g. phonemes
Words stored as networks of elementary models

19
Training of acoustic models

Compute acoustic vectors and transition
probabilities from large corpora
each state holds statistics concerning parameter
values and parameter variation
The larger the amount of training data, the
better the estimates of parameter values and
variation

20
Language model

Defined by grammar
Grammar
Rules for combining words into sentences
(defining the admissible strings in that
language)
Basic unit of analysis is utterance/sentence
Sentence composed of words representing word
classes, e.g.
determiner the
noun boy
verb eat

noun boy verb eat determiner the
rule 1 noun_phrase ? det n
rule 2 sentence ? noun_phrase verb
Morphology base forms vs derived forms
eat stem, 1st person singular
stem s 3rd person singular
stem en past participle
stem er substantive (noun)
the boy eats
the eats
boy eats
eats the boy

Statistical language model
Probabilities for words and transition
probabilities for word sequences in corpus
unigram probability of individual words
bigram probability of word given preceding word
trigram probability of word given two preceding
words
Training materials
language corpora (journal articles
application-specific)

23
Recognition / classification
24

Compute probability of sequence of states given
the probabilities for the states, the
probabilities for transitions between states and
the language model
Gives best path
Usually not best path but n-best list for further
processing

25
Caveats

Properties of acoustic models strongly determined
by recording conditions
recognition performance dependent on match
between recording conditions and run-time
conditions
Use of language model induces word bias for
words outside vocabulary the best matching word
is selected
Solution use garbage model

26
Advances

Confidence measures for recognition results
Based on acoustic similarity
Or based on actual confusions for a database
Or taking into consideration the acoustic
properties of the input signal
Dynamic (state-dependent) loading of language
model
Parallel recognizers
e.g. In Vehicle Information Systems (IVIS)
separate recognizers for navigation system,
entertainment systems, mobile phone, general
purpose
choice on the basis of confidence scores
Further developments
Parallel recognizer for hyper-articulate speech

27
State of the art performance

98 - 99.8 correct for small vocabulary
speaker-independent recognition
92 - 98 correct for speaker-dependent large
vocabulary recognition
50 - 70 correct for speaker-independent mid
size vocabulary

28
Recognition of prosody

Observable manifestations pitch, temporal
properties, silence
Function emphasis, phrasing (e.g. through
pauses), sentence type (question/statement),
emotion c.
Relevant to understanding/interpretation, e.g.
Mary knows many languages you know
Mary knows many languages, you know
Influence on realisation of phonemes Used to be
considered as noise, but contains relevant
information

29
contents

Speech input technology
Speech recognition
Language understanding
Consequences for design
Speech output technology
Consequences for design
project

30
Natural language processing

Full parse or keyword spotting (concept spotting)
Keyword spotting
ltanygt keyword ltanygt
e.g. ltanygt DEPARTURE ltanygt DESTINATION ltanygt
can handle
Boston New York
I want to go from Boston to New York
I want a flight leaving at Boston and arriving
at New York
Semantics (mapping onto functionality) can be
specified in the grammar

31
contents

Speech input technology
Speech recognition
Language understanding
Consequences for design
Speech output technology
Consequences for design
project

32
coping with technological shortcomings of ASR

Shortcomings
Reliability/robustness
Architectural complexity of always open system
Lack of transparency in case of input limitations
Task for design of speech interfaces
induce user to modify behaviour to fit
requirements (restrictions) of technology

33
Solutions

Always open ideal
push-to-talk button recognition window
spoke-too-soon problem
Barge in (requires echo cancellation which may be
complicated depending on reverberation properties
of environment)
Make training conditions (properties of training
corpus) similar to test conditions
E.g. special corpora for car environment

Good prompt design to give clues about required
input

35
contents

Speech input technology
consequences for design
Speech output technology
Technology
Human factors in speech understanding
Consequences for design
project

36
Components of conversational interfaces
Application
Speech Synthesis
Language Generation
Dialogue Manager
Natural Language Analysis
Speech recognition
37
demos

http//www.ims.uni-stuttgart.de/moehler/synthspee
ch/examples.html
http//www.research.att.com/ttsweb/tts/demo.php
http//www.acapela-group.com/text-to-speech-intera
ctive-demo.html
http//cslu.cse.ogi.edu/tts/
Audiovisual speech synthesis
http//www.speech.kth.se/multimodal/
http//mambo.ucsc.edu/demos.html
Emotional synthesis (Janet Cahn)
http//xenia.media.mit.edu/7Ecahn/emot-speech.ht
ml

38
Applications

Information Access by phone
news / weather, timetables (OVR), reverse
directory, name dialling,
spoken e-mail etc.
Customer Ordering by phone (call centers)
IVR ASR replaces tedious touch-tone actions
Car Driver Information by voice
navigation, car traffic info (RDS/TMC), Command
Control (VODIS)

Interfaces for the Disabled
MIT/DECTalk (Stephen Hawking)
In the office and at home (near future?)
Command Control, navigation for home
entertainment

40
Output technology
LangGeneration
Speech Synthesis
Dialogue Manager
Application (e.g. E-mail)
Application (Information service)
41
Language generation

Eindhoven Amsterdam CS
Vertrektijd 0832 0847 0902 0917 0932
Aankomsttijd 0952 1010 1022 1040 1052
Overstappen 0 1 0 1 0

If nr_of_records gt1
I have found n connections
The first connection leaves at time_dep from
departure and arrives at time_arr at
destination
The second connection leaves at time_dep from
departure and arrives at time_arr at
destination

If the user also wants information about whether
there are transfers, either other templates have
to be used, or templates might be composed from
template elements

44
speech output technologies

canned (pre-recorded) speech
Suited for call centers, IVR
fixed messages/announcements

45
Concatenation of pre-recorded phrases

Suited for bank account information, database
enquiry systems with structured data and the like
Template-based, e.g.
your account is ltaccountgt
the flight from ltdeparturegt to ltdestinationgt
leaves at ltdategt at lttimegt from ltgategt
the number of customer is telephone_number
Requirements database of phrases to be
concatenated
Some knowledge of speech science required
words are pronounced differently depending on
emphasis
position in utterance
type of utterance
differences concern both pitch and temporal
properties (prosody)

Compare different realisations of Amsterdam in
do you want to go to Amsterdam ? (emphasis,
question, utterance-final)
I want to go to Amsterdam (emphasis, statement,
utterance -final)
Are there two stations in Amsterdam ? (no
emphasis, question, utterance-final)
There are two stations in Amsterdam (no emphasis,
statement, utterance-final)
Do you want to go to Amsterdam Central Station?
(no emphasis, statement, utternace-medial)
Solution
have words pronounced in context to obtain
different tokens
apply clever splicing techniques for smooth
concatenation

47
text-to-speech conversion (TTS)

Suited for unrestricted text input all kinds of
text
reading e-mail, fax (in combination with optical
character recognition)
information retrieval for unstructured data
(preferably in combination with automatic
summarisation)
Utterances made up by concatenation of small
units and post-processing for prosody, or by
concatenation of variable units

48
TtS technology

Distinction between
linguistic pre-processing and
synthesis
Linguistic pre-processing
Grapheme-phoneme conversion mapping written text
onto phonemic representation including word
stress
Prosodic structure (emphasis, boundaries
including pauses)

49
TtS Linguistic pre-processinggrapheme-phoneme
conversion

To determine how a word is pronounced
consult a lexicon, containing
a phoneme transcription
syllable boundaries
word accent(s)
and/or develop pronunciation rules
Output
Enschede . En-sx_at_-de .
Kerkrade . kErk-ra-d_at_ .
s-Hertogenbosch . sEr-to-x_at_n- bOs .

Pros and cons of lexicon
phoneme transcriptions are accurate
(high) risk of out-of-vocabulary words because
the lexicon
often contains only stems, no inflections, nor
compounds
is never up to date / complete
but usually the application includes a user
lexicon

Pros and cons of pronunciation rules
no out-of-vocabulary words
transcription results are often wrong for
(longer) combinations of words / morphemes
exceptions and loan-words from other languages
Best solution is a combination of the two
methods
develop a list of words incorrectly transcribed
by the rules and put these words in an exception
lexicon
words not occurring in the exception list are
then transcribed by rule

Complications
Words with same written form but different
pronunciations and different meaning record vs
record
requires parsing or statistical approach
Proper names and other specialized vocabularies,
acronyms/abbreviations (small announcements in
journals!)
Need to be included in (user) lexicon
Different kinds of numbers (telephone numbers,
amounts, credit card numbers etc.)
Require number grammars

53
TtS Linguistic pre-processingprosody

Emphasis, boundaries (including pauses), sentence
type
Observable manifestations pitch, temporal
properties, silence
Requires analysis of linguistic structure
(parsing) and (ideally) discourse level
information (cf. the earlier Amsterdam example)

54
TtS synthesis

Concatenation from words and phrases practically
impossible
database too large (especially if you need
several versions for each word) and
no full coverage (out-of-vocabulary words)
Approaches
sub-word units
data-oriented approach

55
Synthesis by subword units

Common approach diphone synthesis
linking together pre-recorded diphones, i.e.
short segments (transitions between two
successive phonemes), extracted from natural
speech
s Hertogenbosch
phonemes . s E r t o x _at_
n b O s .
diphones .s sE Er rt to ox x_at_ _at_n
nb bO Os s.
In all, 1600 transitions per language (40 40)

Synthesis
concatenate diphones in the correct order
perform some (intensity) smoothing at the
diphone borders
adjust phoneme duration and pitch course,
according to prosody rules

57
Data-oriented approach

Generalization of diphone approach
Store a large database of speech (running text)
Run-time
generate structure representing phoneme sequence
and prosodic properties needed
Search algorithm
find the largest possible fragments containing
the required properties in the database

frei-burg
nürn-berg
frei-berg
/fr/ also for fr-iedrichshaven
items in database re-usable
Concatenate the fragments as they are, without
post-processing for pitch and duration
? in this way, not only phoneme parameters and
transitions are taken from data, but also pitch
and temporal properties

Advantage natural speech quality preserved (but
may not always be desirable maybe it should be
made clear to people that they are talking to a
system)
Disadvantage no explicit control of voice
characteristics and prosodic characteristics such
as pitch and speaking rate (which you might want
to manipulate for synthesis of emotional speech
or conveying a certain personality)

Difficult or impossible to modify speaker
characteristics
Other speaker new database required
Other speaking style new database required
Research post-processing of result with
preservation of speech quality

hybrid synthesis
Combination of phrase concatenation and TTS
suited for template-based synthesis with fixed
message structure and variable slots
the flight from ltdeparturegt to ltdestinationgt
leaves at ltdategt at lttimegt from ltgategt
? in dialogue systems the system has knowledge
of message structure and can select the proper
tokens from the database on the basis of this
knowledge

62
Future markup languages

structured text
current tts-systems strip text annotations
(plain ascii standard)
draft proposal for xml for synthesis, SALT

63
Contents

Speech input technology
consequences for design
Speech output technology
Technology
Human factors in speech understanding
Consequences for design
project

64
issues in comprehension

Speech quality
reduced quality slows down feature extraction
and mapping input onto feature vectors will
increase number of matching vectors.
? requires compensation by top-down
processing, taking more time and effort and
practice
Text-to-speech, but written text is often
difficult to understand when read aloud due to
complex structures, high information density etc.

65
application to synthetic speech

substandard quality of synthetic speech requires
compensation by (resource-limited) top-down
processing
? potential overload of system due to time
constraints
slowing down speaking rate very effective way to
give the listener more processing time

66
Case study picking up information from speech

study on auditory exploration of lists (pitt
edwards, 1997)
recall of list of 48 file names
presented in groups
groups size varied (2, 3 or 4)
presentation of groups listener-paced
recall immediately after each group

67
(No Transcript)
68
adjustments

analysis of list speaking style
? grouping principles
always try to group
grouping by filenames and extensions
large groups first
mnemonic links between groups
? prosodic structuring

evaluation
directory with four subdirectories, each
containing files with four different names,
corresponding to the modules of a programming
project, and with three different extensions
task find most recent version of the four files
containing the source code for the modules and
copy them into a new directory
measures objective (task completion) and
subjective
results for task completion new algorithm 10.39
min, old version 24.12 min

70
Contents

Speech input technology
Consequences for design
Speech output technology
Technology
Human factors in speech understanding
Consequences for design
Project

71
Design implications

choice of technology can be made dependent on
needs of application
for restricted domains very high quality can be
achieved through canned speech or phrase
concatenation with multiple tokens
for concatenation with unit selection there is a
relation between quality and size of database
for good systems usually there is no problem with
intelligibility even with inexperienced listeners

high quality for diphone speech, needed for
uncommon forms such as proper names or company
names that are unlikely to be available from a
corpus, requires still much effort
importance of learning effects
general finding is that acceptance of synthetic
speech depends strongly on voice quality
if trade-off between quality and added value is
negative, prospects for acceptance of the speech
interface are poor

73
speech as output modality speech vs text/graphics

Text/graphics
an image may be worth a thousand words
image/written text is persistent
image is (at least) two-dimensional temporal and
spatial organisation
visual expression of hierarchical structure
receiver paced
but non-adaptive (until recently).
now adaptive hypertext

speech
one-dimensional extends only in time
? spatial issues better dealt with in other
modality
sender-paced
poor medium, yet popular
large amount of speech-based communication serves
primarily a social function
no need for supporting aids such as paper and pen
no special motoric abilities needed
speaking is fast

75
heuristicsspeech output preferred when

message is simple
message is short
message need not be referred to later
message deals with events in time
message requires an immediate response
visual channels are overloaded
environment is brightly lit, poorly lit, subject
to severe vibration or otherwise adverse to
transmission of visual information
user must be free to move around
(from michaelis wiggins)
pronunciation is subject of interaction

76
but speech output preferably not used when

message is complex or uses unfamiliar terms
message is long
message needs to be referred to later
message deals with spatial topics
message is not urgent
auditory channels are overloaded
environment is too noisy
user has easy access to screen
system output consists of many different kinds of
information which must be available
simultaneously and be monitored and acted upon by
the user
(from michaelis wiggins)
? environmental variation and mixed
interaction call for multimodal interfaces

77
contents

Speech input technology
consequences for design
Speech output technology
Main points
Project

78
Main points

Database approach, requiring large databases for
individual languages and speaking styles,
dominant both for speech input and output
Input databases for training acoustic models and
language model
Output concatenation of segments and phrases
taken from database
Large differences concerning performance of
speech recognition and quality of output for
different languages and target groups (e.g.
recognition for children)
Speech input three major classes of
applications commandcontrol, information
services, dictation systems
Major parameters speaker-dependent/independent,
vocabulary size (small, medium, large), rigid vs.
free-format input

Dialogue management
Finite-state or frame-based approach for
task-oriented dialogue acts
Verification strategies and repair mechanisms for
dialogue control
Pragmatic approaches to language understanding
and language generation
Input directly mapped onto application
functionality
Output template-based approaches
Not covered speech monitoring, speech data
mining applications and technology

Exercises with CSLU toolkit and other
demonstrators
Try out your name, telephone numbers, dates,
e-mail addresses, abbreviations etc.
Project
Protocol development
Dialogue structure
Strategies and prompts
Tomorrow
Wizard of Oz test

Write a Comment

User Comments (0)

About PowerShow.com

Module u1: Speech in the Interface 3: Speech input and output technology - PowerPoint PPT Presentation

Module u1: Speech in the Interface 3: Speech input and output technology

Module u1: Speech in the Interface 3: Speech input and output technology Jacques Terken SAI User-System Interaction u1, Speech in the Interface: 3. – PowerPoint PPT presentation