Title: Module u1: Speech in the Interface 3: Speech input and output technology
1Module u1Speech in the Interface 3 Speech
input and output technology
Jacques Terken
2contents
- Speech input technology
- Speech recognition
- Language understanding
- Consequences for design
- Speech output technology
- Language generation
- Speech synthesis
- Consequences for design
- Project
3Components of conversational interfaces
Application
Speech Synthesis
Language Generation
Dialogue Manager
Natural Language Analysis
Speech recognition
Noise suppression
4Speech recognition
- Advances both through progress in speech and
language engineering and in computer technology
(increases in CPU power)
5Developments
6State of the art
7Why is generic speech recognition so difficult ?
- Variability of input due to many different
sources - Understanding requires vast amounts of world
knowledge and common sense reasoning for
generation and pruning of hypotheses - Dealing with variability and with storage of/
access to world knowledge outreaches
possibilities of current technology
8Sources of variation
9No generic speech recognizer
- Idea of generic speech recognizer has been given
up (for the time being) - automatic speech recognition possible by virtue
of self-imposed limitations - vocabulary size
- multiple vs single speaker
- real-time vs offline
- recognition vs understanding
10Speech recognition systems
- Relevant dimensions
- Speaker-dependent vs speaker-independent
- Vocabulary size
- Grammar fixed grammar vs probabilistic language
model - Trade-off between different dimensions in terms
of performance choice of technology determined
by application requirements
11Command and control
- Examples controlling functionality of PC or PDA
controlling consumer appliances (stereo, tv etc.) - Individual words and multi-word expressions
- File, Edit, Save as webpage, Columns to
the left - Speaker-independent no training needed before
use - Limited vocabulary gives high recognition
performance - Fixed format expressions (defined by grammar)
- Real-time
- ?User needs to know which items are in the
vocabulary and what expressions can be used - ?(Usually) not customizable
12Information services
- Examples train travel information, integrated
trip planning - Continuous speech
- Speaker-independent Multiple users
- Mid size vocabulary, typically less than 5000
words - Flexibility of input extensive grammar that can
handle expected user inputs - Requires interpretation
- Real time
13Dictation systems
- Continuous speech
- Speaker-dependent requires training by user
- (Almost) unrestricted input
- Large vocabulary gt 200.000 words
- Probabilistic language model instead of fixed
grammar - No understanding, just recognition
- Off-line (but near online performance possible
depending on system properties)
14State of the art ASR Statistical approach
- Two phases
- Training creating an inventory of acoustic
models and computing transition probabilities - Testing (classification) mapping input onto
inventory
15Speech
- Writing vs speech
- Writing see eat break lake
- Speaking si it brek
lek -
- Alphabetic languages appr. 25 signs
- Average language approximately 40 sounds
- Phonetic alphabet
- (11 mapping character-sound)
16Speech and sounds Waveform and spectrogram of
How are you Speech is made up of nondiscrete
events
17Representation of the speech signal
- Sounds coded as successions of states (one state
each 10-30 ms) - States represented by acoustic vectors
Freq
Freq
time
time
18Acoustic models
- Inventory of elementary probabilistic models of
basic linguistic units, e.g. phonemes - Words stored as networks of elementary models
19Training of acoustic models
- Compute acoustic vectors and transition
probabilities from large corpora - each state holds statistics concerning parameter
values and parameter variation - The larger the amount of training data, the
better the estimates of parameter values and
variation
20Language model
- Defined by grammar
- Grammar
- Rules for combining words into sentences
(defining the admissible strings in that
language) - Basic unit of analysis is utterance/sentence
- Sentence composed of words representing word
classes, e.g. - determiner the
- noun boy
- verb eat
-
-
21- noun boy verb eat determiner the
-
- rule 1 noun_phrase ? det n
- rule 2 sentence ? noun_phrase verb
- Morphology base forms vs derived forms
- eat stem, 1st person singular
- stem s 3rd person singular
- stem en past participle
- stem er substantive (noun)
- the boy eats
- the eats
- boy eats
- eats the boy
22- Statistical language model
- Probabilities for words and transition
probabilities for word sequences in corpus - unigram probability of individual words
- bigram probability of word given preceding word
- trigram probability of word given two preceding
words - Training materials
- language corpora (journal articles
application-specific)
23Recognition / classification
24- Compute probability of sequence of states given
the probabilities for the states, the
probabilities for transitions between states and
the language model - Gives best path
- Usually not best path but n-best list for further
processing
25Caveats
- Properties of acoustic models strongly determined
by recording conditions - recognition performance dependent on match
between recording conditions and run-time
conditions - Use of language model induces word bias for
words outside vocabulary the best matching word
is selected - Solution use garbage model
26Advances
- Confidence measures for recognition results
- Based on acoustic similarity
- Or based on actual confusions for a database
- Or taking into consideration the acoustic
properties of the input signal - Dynamic (state-dependent) loading of language
model - Parallel recognizers
- e.g. In Vehicle Information Systems (IVIS)
separate recognizers for navigation system,
entertainment systems, mobile phone, general
purpose - choice on the basis of confidence scores
- Further developments
- Parallel recognizer for hyper-articulate speech
27State of the art performance
- 98 - 99.8 correct for small vocabulary
speaker-independent recognition - 92 - 98 correct for speaker-dependent large
vocabulary recognition - 50 - 70 correct for speaker-independent mid
size vocabulary
28Recognition of prosody
- Observable manifestations pitch, temporal
properties, silence - Function emphasis, phrasing (e.g. through
pauses), sentence type (question/statement),
emotion c. - Relevant to understanding/interpretation, e.g.
- Mary knows many languages you know
- Mary knows many languages, you know
- Influence on realisation of phonemes Used to be
considered as noise, but contains relevant
information
29contents
- Speech input technology
- Speech recognition
- Language understanding
- Consequences for design
- Speech output technology
- Consequences for design
- project
30Natural language processing
- Full parse or keyword spotting (concept spotting)
- Keyword spotting
- ltanygt keyword ltanygt
- e.g. ltanygt DEPARTURE ltanygt DESTINATION ltanygt
- can handle
- Boston New York
- I want to go from Boston to New York
- I want a flight leaving at Boston and arriving
at New York - Semantics (mapping onto functionality) can be
specified in the grammar
31contents
- Speech input technology
- Speech recognition
- Language understanding
- Consequences for design
- Speech output technology
- Consequences for design
- project
32coping with technological shortcomings of ASR
- Shortcomings
- Reliability/robustness
- Architectural complexity of always open system
- Lack of transparency in case of input limitations
- Task for design of speech interfaces
- induce user to modify behaviour to fit
requirements (restrictions) of technology
33Solutions
- Always open ideal
- push-to-talk button recognition window
- spoke-too-soon problem
- Barge in (requires echo cancellation which may be
complicated depending on reverberation properties
of environment) - Make training conditions (properties of training
corpus) similar to test conditions - E.g. special corpora for car environment
34- Good prompt design to give clues about required
input
35contents
- Speech input technology
- consequences for design
- Speech output technology
- Technology
- Human factors in speech understanding
- Consequences for design
- project
36Components of conversational interfaces
Application
Speech Synthesis
Language Generation
Dialogue Manager
Natural Language Analysis
Speech recognition
37demos
- http//www.ims.uni-stuttgart.de/moehler/synthspee
ch/examples.html - http//www.research.att.com/ttsweb/tts/demo.php
- http//www.acapela-group.com/text-to-speech-intera
ctive-demo.html - http//cslu.cse.ogi.edu/tts/
- Audiovisual speech synthesis
- http//www.speech.kth.se/multimodal/
- http//mambo.ucsc.edu/demos.html
- Emotional synthesis (Janet Cahn)
- http//xenia.media.mit.edu/7Ecahn/emot-speech.ht
ml
38Applications
- Information Access by phone
- news / weather, timetables (OVR), reverse
directory, name dialling, - spoken e-mail etc.
- Customer Ordering by phone (call centers)
- IVR ASR replaces tedious touch-tone actions
- Car Driver Information by voice
- navigation, car traffic info (RDS/TMC), Command
Control (VODIS)
39- Interfaces for the Disabled
- MIT/DECTalk (Stephen Hawking)
- In the office and at home (near future?)
- Command Control, navigation for home
entertainment
40Output technology
LangGeneration
Speech Synthesis
Dialogue Manager
Application (e.g. E-mail)
Application (Information service)
41Language generation
- Eindhoven Amsterdam CS
- Vertrektijd 0832 0847 0902 0917 0932
- Aankomsttijd 0952 1010 1022 1040 1052
- Overstappen 0 1 0 1 0
42- If nr_of_records gt1
- I have found n connections
- The first connection leaves at time_dep from
departure and arrives at time_arr at
destination - The second connection leaves at time_dep from
departure and arrives at time_arr at
destination
43- If the user also wants information about whether
there are transfers, either other templates have
to be used, or templates might be composed from
template elements
44speech output technologies
- canned (pre-recorded) speech
- Suited for call centers, IVR
- fixed messages/announcements
45Concatenation of pre-recorded phrases
- Suited for bank account information, database
enquiry systems with structured data and the like - Template-based, e.g.
- your account is ltaccountgt
- the flight from ltdeparturegt to ltdestinationgt
leaves at ltdategt at lttimegt from ltgategt - the number of customer is telephone_number
- Requirements database of phrases to be
concatenated - Some knowledge of speech science required
- words are pronounced differently depending on
- emphasis
- position in utterance
- type of utterance
- differences concern both pitch and temporal
properties (prosody)
46- Compare different realisations of Amsterdam in
- do you want to go to Amsterdam ? (emphasis,
question, utterance-final) - I want to go to Amsterdam (emphasis, statement,
utterance -final) - Are there two stations in Amsterdam ? (no
emphasis, question, utterance-final) - There are two stations in Amsterdam (no emphasis,
statement, utterance-final) - Do you want to go to Amsterdam Central Station?
(no emphasis, statement, utternace-medial) - Solution
- have words pronounced in context to obtain
different tokens - apply clever splicing techniques for smooth
concatenation
47text-to-speech conversion (TTS)
- Suited for unrestricted text input all kinds of
text - reading e-mail, fax (in combination with optical
character recognition) - information retrieval for unstructured data
(preferably in combination with automatic
summarisation) - Utterances made up by concatenation of small
units and post-processing for prosody, or by
concatenation of variable units
48TtS technology
- Distinction between
- linguistic pre-processing and
- synthesis
- Linguistic pre-processing
- Grapheme-phoneme conversion mapping written text
onto phonemic representation including word
stress - Prosodic structure (emphasis, boundaries
including pauses)
49TtS Linguistic pre-processinggrapheme-phoneme
conversion
- To determine how a word is pronounced
- consult a lexicon, containing
- a phoneme transcription
- syllable boundaries
- word accent(s)
- and/or develop pronunciation rules
- Output
- Enschede . En-sx_at_-de .
- Kerkrade . kErk-ra-d_at_ .
- s-Hertogenbosch . sEr-to-x_at_n- bOs .
50- Pros and cons of lexicon
- phoneme transcriptions are accurate
- (high) risk of out-of-vocabulary words because
the lexicon - often contains only stems, no inflections, nor
compounds - is never up to date / complete
- but usually the application includes a user
lexicon
51- Pros and cons of pronunciation rules
- no out-of-vocabulary words
- transcription results are often wrong for
- (longer) combinations of words / morphemes
- exceptions and loan-words from other languages
- Best solution is a combination of the two
methods - develop a list of words incorrectly transcribed
by the rules and put these words in an exception
lexicon - words not occurring in the exception list are
then transcribed by rule
52- Complications
- Words with same written form but different
pronunciations and different meaning record vs
record - requires parsing or statistical approach
- Proper names and other specialized vocabularies,
acronyms/abbreviations (small announcements in
journals!) - Need to be included in (user) lexicon
- Different kinds of numbers (telephone numbers,
amounts, credit card numbers etc.) - Require number grammars
53TtS Linguistic pre-processingprosody
- Emphasis, boundaries (including pauses), sentence
type - Observable manifestations pitch, temporal
properties, silence - Requires analysis of linguistic structure
(parsing) and (ideally) discourse level
information (cf. the earlier Amsterdam example)
54TtS synthesis
- Concatenation from words and phrases practically
impossible - database too large (especially if you need
several versions for each word) and - no full coverage (out-of-vocabulary words)
- Approaches
- sub-word units
- data-oriented approach
55Synthesis by subword units
- Common approach diphone synthesis
- linking together pre-recorded diphones, i.e.
short segments (transitions between two
successive phonemes), extracted from natural
speech - s Hertogenbosch
- phonemes . s E r t o x _at_
n b O s . - diphones .s sE Er rt to ox x_at_ _at_n
nb bO Os s. - In all, 1600 transitions per language (40 40)
56- Synthesis
- concatenate diphones in the correct order
- perform some (intensity) smoothing at the
diphone borders - adjust phoneme duration and pitch course,
according to prosody rules
57Data-oriented approach
- Generalization of diphone approach
- Store a large database of speech (running text)
- Run-time
- generate structure representing phoneme sequence
and prosodic properties needed - Search algorithm
- find the largest possible fragments containing
the required properties in the database
58- frei-burg
- nĂ¼rn-berg
- frei-berg
- /fr/ also for fr-iedrichshaven
- items in database re-usable
- Concatenate the fragments as they are, without
post-processing for pitch and duration - ? in this way, not only phoneme parameters and
transitions are taken from data, but also pitch
and temporal properties
59- Advantage natural speech quality preserved (but
may not always be desirable maybe it should be
made clear to people that they are talking to a
system) - Disadvantage no explicit control of voice
characteristics and prosodic characteristics such
as pitch and speaking rate (which you might want
to manipulate for synthesis of emotional speech
or conveying a certain personality)
60- Difficult or impossible to modify speaker
characteristics - Other speaker new database required
- Other speaking style new database required
- Research post-processing of result with
preservation of speech quality
61- hybrid synthesis
- Combination of phrase concatenation and TTS
- suited for template-based synthesis with fixed
message structure and variable slots - the flight from ltdeparturegt to ltdestinationgt
leaves at ltdategt at lttimegt from ltgategt - ? in dialogue systems the system has knowledge
of message structure and can select the proper
tokens from the database on the basis of this
knowledge
62Future markup languages
- structured text
- current tts-systems strip text annotations
(plain ascii standard) - draft proposal for xml for synthesis, SALT
63Contents
- Speech input technology
- consequences for design
- Speech output technology
- Technology
- Human factors in speech understanding
- Consequences for design
- project
64issues in comprehension
- Speech quality
- reduced quality slows down feature extraction
and mapping input onto feature vectors will
increase number of matching vectors. - ? requires compensation by top-down
processing, taking more time and effort and
practice - Text-to-speech, but written text is often
difficult to understand when read aloud due to
complex structures, high information density etc.
65application to synthetic speech
- substandard quality of synthetic speech requires
compensation by (resource-limited) top-down
processing - ? potential overload of system due to time
constraints - slowing down speaking rate very effective way to
give the listener more processing time
66Case study picking up information from speech
- study on auditory exploration of lists (pitt
edwards, 1997) - recall of list of 48 file names
- presented in groups
- groups size varied (2, 3 or 4)
- presentation of groups listener-paced
- recall immediately after each group
67(No Transcript)
68adjustments
- analysis of list speaking style
- ? grouping principles
- always try to group
- grouping by filenames and extensions
- large groups first
- mnemonic links between groups
- ? prosodic structuring
69- evaluation
- directory with four subdirectories, each
containing files with four different names,
corresponding to the modules of a programming
project, and with three different extensions - task find most recent version of the four files
containing the source code for the modules and
copy them into a new directory - measures objective (task completion) and
subjective - results for task completion new algorithm 10.39
min, old version 24.12 min
70Contents
- Speech input technology
- Consequences for design
- Speech output technology
- Technology
- Human factors in speech understanding
- Consequences for design
- Project
71Design implications
- choice of technology can be made dependent on
needs of application - for restricted domains very high quality can be
achieved through canned speech or phrase
concatenation with multiple tokens - for concatenation with unit selection there is a
relation between quality and size of database
for good systems usually there is no problem with
intelligibility even with inexperienced listeners
72- high quality for diphone speech, needed for
uncommon forms such as proper names or company
names that are unlikely to be available from a
corpus, requires still much effort - importance of learning effects
- general finding is that acceptance of synthetic
speech depends strongly on voice quality - if trade-off between quality and added value is
negative, prospects for acceptance of the speech
interface are poor
73speech as output modality speech vs text/graphics
- Text/graphics
- an image may be worth a thousand words
- image/written text is persistent
- image is (at least) two-dimensional temporal and
spatial organisation - visual expression of hierarchical structure
- receiver paced
- but non-adaptive (until recently).
- now adaptive hypertext
74- speech
- one-dimensional extends only in time
- ? spatial issues better dealt with in other
modality - sender-paced
- poor medium, yet popular
- large amount of speech-based communication serves
primarily a social function - no need for supporting aids such as paper and pen
- no special motoric abilities needed
- speaking is fast
75heuristicsspeech output preferred when
- message is simple
- message is short
- message need not be referred to later
- message deals with events in time
- message requires an immediate response
- visual channels are overloaded
- environment is brightly lit, poorly lit, subject
to severe vibration or otherwise adverse to
transmission of visual information - user must be free to move around
- (from michaelis wiggins)
- pronunciation is subject of interaction
76but speech output preferably not used when
- message is complex or uses unfamiliar terms
- message is long
- message needs to be referred to later
- message deals with spatial topics
- message is not urgent
- auditory channels are overloaded
- environment is too noisy
- user has easy access to screen
- system output consists of many different kinds of
information which must be available
simultaneously and be monitored and acted upon by
the user - (from michaelis wiggins)
- ? environmental variation and mixed
interaction call for multimodal interfaces
77contents
- Speech input technology
- consequences for design
- Speech output technology
- Main points
- Project
78Main points
- Database approach, requiring large databases for
individual languages and speaking styles,
dominant both for speech input and output - Input databases for training acoustic models and
language model - Output concatenation of segments and phrases
taken from database - Large differences concerning performance of
speech recognition and quality of output for
different languages and target groups (e.g.
recognition for children) - Speech input three major classes of
applications commandcontrol, information
services, dictation systems - Major parameters speaker-dependent/independent,
vocabulary size (small, medium, large), rigid vs.
free-format input
79- Dialogue management
- Finite-state or frame-based approach for
task-oriented dialogue acts - Verification strategies and repair mechanisms for
dialogue control - Pragmatic approaches to language understanding
and language generation - Input directly mapped onto application
functionality - Output template-based approaches
- Not covered speech monitoring, speech data
mining applications and technology
80- Exercises with CSLU toolkit and other
demonstrators - Try out your name, telephone numbers, dates,
e-mail addresses, abbreviations etc. - Project
- Protocol development
- Dialogue structure
- Strategies and prompts
- Tomorrow
- Wizard of Oz test