Module u1: Speech in the Interface 3: Speech input and output technology - PowerPoint PPT Presentation

1 / 80
About This Presentation
Title:

Module u1: Speech in the Interface 3: Speech input and output technology

Description:

Module u1: Speech in the Interface 3: Speech input and output technology Jacques Terken SAI User-System Interaction u1, Speech in the Interface: 3. – PowerPoint PPT presentation

Number of Views:287
Avg rating:3.0/5.0
Slides: 81
Provided by: terk150
Category:

less

Transcript and Presenter's Notes

Title: Module u1: Speech in the Interface 3: Speech input and output technology


1
Module u1Speech in the Interface 3 Speech
input and output technology
Jacques Terken
2
contents
  • Speech input technology
  • Speech recognition
  • Language understanding
  • Consequences for design
  • Speech output technology
  • Language generation
  • Speech synthesis
  • Consequences for design
  • Project

3
Components of conversational interfaces
Application
Speech Synthesis
Language Generation
Dialogue Manager
Natural Language Analysis
Speech recognition
Noise suppression
4
Speech recognition
  • Advances both through progress in speech and
    language engineering and in computer technology
    (increases in CPU power)

5
Developments
6
State of the art
7
Why is generic speech recognition so difficult ?
  • Variability of input due to many different
    sources
  • Understanding requires vast amounts of world
    knowledge and common sense reasoning for
    generation and pruning of hypotheses
  • Dealing with variability and with storage of/
    access to world knowledge outreaches
    possibilities of current technology

8
Sources of variation
9
No generic speech recognizer
  • Idea of generic speech recognizer has been given
    up (for the time being)
  • automatic speech recognition possible by virtue
    of self-imposed limitations
  • vocabulary size
  • multiple vs single speaker
  • real-time vs offline
  • recognition vs understanding

10
Speech recognition systems
  • Relevant dimensions
  • Speaker-dependent vs speaker-independent
  • Vocabulary size
  • Grammar fixed grammar vs probabilistic language
    model
  • Trade-off between different dimensions in terms
    of performance choice of technology determined
    by application requirements

11
Command and control
  • Examples controlling functionality of PC or PDA
    controlling consumer appliances (stereo, tv etc.)
  • Individual words and multi-word expressions
  • File, Edit, Save as webpage, Columns to
    the left
  • Speaker-independent no training needed before
    use
  • Limited vocabulary gives high recognition
    performance
  • Fixed format expressions (defined by grammar)
  • Real-time
  • ?User needs to know which items are in the
    vocabulary and what expressions can be used
  • ?(Usually) not customizable

12
Information services
  • Examples train travel information, integrated
    trip planning
  • Continuous speech
  • Speaker-independent Multiple users
  • Mid size vocabulary, typically less than 5000
    words
  • Flexibility of input extensive grammar that can
    handle expected user inputs
  • Requires interpretation
  • Real time

13
Dictation systems
  • Continuous speech
  • Speaker-dependent requires training by user
  • (Almost) unrestricted input
  • Large vocabulary gt 200.000 words
  • Probabilistic language model instead of fixed
    grammar
  • No understanding, just recognition
  • Off-line (but near online performance possible
    depending on system properties)

14
State of the art ASR Statistical approach
  • Two phases
  • Training creating an inventory of acoustic
    models and computing transition probabilities
  • Testing (classification) mapping input onto
    inventory

15
Speech
  • Writing vs speech
  • Writing see eat break lake
  • Speaking si it brek
    lek
  • Alphabetic languages appr. 25 signs
  • Average language approximately 40 sounds
  • Phonetic alphabet
  • (11 mapping character-sound)

16
Speech and sounds Waveform and spectrogram of
How are you Speech is made up of nondiscrete
events

17
Representation of the speech signal
  • Sounds coded as successions of states (one state
    each 10-30 ms)
  • States represented by acoustic vectors






Freq
Freq
time
time
18
Acoustic models
  • Inventory of elementary probabilistic models of
    basic linguistic units, e.g. phonemes
  • Words stored as networks of elementary models

19
Training of acoustic models
  • Compute acoustic vectors and transition
    probabilities from large corpora
  • each state holds statistics concerning parameter
    values and parameter variation
  • The larger the amount of training data, the
    better the estimates of parameter values and
    variation

20
Language model
  • Defined by grammar
  • Grammar
  • Rules for combining words into sentences
    (defining the admissible strings in that
    language)
  • Basic unit of analysis is utterance/sentence
  • Sentence composed of words representing word
    classes, e.g.
  • determiner the
  • noun boy
  • verb eat

21
  • noun boy verb eat determiner the
  • rule 1 noun_phrase ? det n
  • rule 2 sentence ? noun_phrase verb
  • Morphology base forms vs derived forms
  • eat stem, 1st person singular
  • stem s 3rd person singular
  • stem en past participle
  • stem er substantive (noun)
  • the boy eats
  • the eats
  • boy eats
  • eats the boy

22
  • Statistical language model
  • Probabilities for words and transition
    probabilities for word sequences in corpus
  • unigram probability of individual words
  • bigram probability of word given preceding word
  • trigram probability of word given two preceding
    words
  • Training materials
  • language corpora (journal articles
    application-specific)

23
Recognition / classification
24
  • Compute probability of sequence of states given
    the probabilities for the states, the
    probabilities for transitions between states and
    the language model
  • Gives best path
  • Usually not best path but n-best list for further
    processing

25
Caveats
  • Properties of acoustic models strongly determined
    by recording conditions
  • recognition performance dependent on match
    between recording conditions and run-time
    conditions
  • Use of language model induces word bias for
    words outside vocabulary the best matching word
    is selected
  • Solution use garbage model

26
Advances
  • Confidence measures for recognition results
  • Based on acoustic similarity
  • Or based on actual confusions for a database
  • Or taking into consideration the acoustic
    properties of the input signal
  • Dynamic (state-dependent) loading of language
    model
  • Parallel recognizers
  • e.g. In Vehicle Information Systems (IVIS)
    separate recognizers for navigation system,
    entertainment systems, mobile phone, general
    purpose
  • choice on the basis of confidence scores
  • Further developments
  • Parallel recognizer for hyper-articulate speech

27
State of the art performance
  • 98 - 99.8 correct for small vocabulary
    speaker-independent recognition
  • 92 - 98 correct for speaker-dependent large
    vocabulary recognition
  • 50 - 70 correct for speaker-independent mid
    size vocabulary

28
Recognition of prosody
  • Observable manifestations pitch, temporal
    properties, silence
  • Function emphasis, phrasing (e.g. through
    pauses), sentence type (question/statement),
    emotion c.
  • Relevant to understanding/interpretation, e.g.
  • Mary knows many languages you know
  • Mary knows many languages, you know
  • Influence on realisation of phonemes Used to be
    considered as noise, but contains relevant
    information

29
contents
  • Speech input technology
  • Speech recognition
  • Language understanding
  • Consequences for design
  • Speech output technology
  • Consequences for design
  • project

30
Natural language processing
  • Full parse or keyword spotting (concept spotting)
  • Keyword spotting
  • ltanygt keyword ltanygt
  • e.g. ltanygt DEPARTURE ltanygt DESTINATION ltanygt
  • can handle
  • Boston New York
  • I want to go from Boston to New York
  • I want a flight leaving at Boston and arriving
    at New York
  • Semantics (mapping onto functionality) can be
    specified in the grammar

31
contents
  • Speech input technology
  • Speech recognition
  • Language understanding
  • Consequences for design
  • Speech output technology
  • Consequences for design
  • project

32
coping with technological shortcomings of ASR
  • Shortcomings
  • Reliability/robustness
  • Architectural complexity of always open system
  • Lack of transparency in case of input limitations
  • Task for design of speech interfaces
  • induce user to modify behaviour to fit
    requirements (restrictions) of technology

33
Solutions
  • Always open ideal
  • push-to-talk button recognition window
  • spoke-too-soon problem
  • Barge in (requires echo cancellation which may be
    complicated depending on reverberation properties
    of environment)
  • Make training conditions (properties of training
    corpus) similar to test conditions
  • E.g. special corpora for car environment

34
  • Good prompt design to give clues about required
    input

35
contents
  • Speech input technology
  • consequences for design
  • Speech output technology
  • Technology
  • Human factors in speech understanding
  • Consequences for design
  • project

36
Components of conversational interfaces
Application
Speech Synthesis
Language Generation
Dialogue Manager
Natural Language Analysis
Speech recognition
37
demos
  • http//www.ims.uni-stuttgart.de/moehler/synthspee
    ch/examples.html
  • http//www.research.att.com/ttsweb/tts/demo.php
  • http//www.acapela-group.com/text-to-speech-intera
    ctive-demo.html
  • http//cslu.cse.ogi.edu/tts/
  • Audiovisual speech synthesis
  • http//www.speech.kth.se/multimodal/
  • http//mambo.ucsc.edu/demos.html
  • Emotional synthesis (Janet Cahn)
  • http//xenia.media.mit.edu/7Ecahn/emot-speech.ht
    ml

38
Applications
  • Information Access by phone
  • news / weather, timetables (OVR), reverse
    directory, name dialling,
  • spoken e-mail etc.
  • Customer Ordering by phone (call centers)
  • IVR ASR replaces tedious touch-tone actions
  • Car Driver Information by voice
  • navigation, car traffic info (RDS/TMC), Command
    Control (VODIS)

39
  • Interfaces for the Disabled
  • MIT/DECTalk (Stephen Hawking)
  • In the office and at home (near future?)
  • Command Control, navigation for home
    entertainment

40
Output technology
LangGeneration
Speech Synthesis
Dialogue Manager
Application (e.g. E-mail)
Application (Information service)
41
Language generation
  • Eindhoven Amsterdam CS
  • Vertrektijd 0832 0847 0902 0917 0932
  • Aankomsttijd 0952 1010 1022 1040 1052
  • Overstappen 0 1 0 1 0

42
  • If nr_of_records gt1
  • I have found n connections
  • The first connection leaves at time_dep from
    departure and arrives at time_arr at
    destination
  • The second connection leaves at time_dep from
    departure and arrives at time_arr at
    destination

43
  • If the user also wants information about whether
    there are transfers, either other templates have
    to be used, or templates might be composed from
    template elements

44
speech output technologies
  • canned (pre-recorded) speech
  • Suited for call centers, IVR
  • fixed messages/announcements

45
Concatenation of pre-recorded phrases
  • Suited for bank account information, database
    enquiry systems with structured data and the like
  • Template-based, e.g.
  • your account is ltaccountgt
  • the flight from ltdeparturegt to ltdestinationgt
    leaves at ltdategt at lttimegt from ltgategt
  • the number of customer is telephone_number
  • Requirements database of phrases to be
    concatenated
  • Some knowledge of speech science required
  • words are pronounced differently depending on
  • emphasis
  • position in utterance
  • type of utterance
  • differences concern both pitch and temporal
    properties (prosody)

46
  • Compare different realisations of Amsterdam in
  • do you want to go to Amsterdam ? (emphasis,
    question, utterance-final)
  • I want to go to Amsterdam (emphasis, statement,
    utterance -final)
  • Are there two stations in Amsterdam ? (no
    emphasis, question, utterance-final)
  • There are two stations in Amsterdam (no emphasis,
    statement, utterance-final)
  • Do you want to go to Amsterdam Central Station?
    (no emphasis, statement, utternace-medial)
  • Solution
  • have words pronounced in context to obtain
    different tokens
  • apply clever splicing techniques for smooth
    concatenation

47
text-to-speech conversion (TTS)
  • Suited for unrestricted text input all kinds of
    text
  • reading e-mail, fax (in combination with optical
    character recognition)
  • information retrieval for unstructured data
    (preferably in combination with automatic
    summarisation)
  • Utterances made up by concatenation of small
    units and post-processing for prosody, or by
    concatenation of variable units

48
TtS technology
  • Distinction between
  • linguistic pre-processing and
  • synthesis
  • Linguistic pre-processing
  • Grapheme-phoneme conversion mapping written text
    onto phonemic representation including word
    stress
  • Prosodic structure (emphasis, boundaries
    including pauses)

49
TtS Linguistic pre-processinggrapheme-phoneme
conversion
  • To determine how a word is pronounced
  • consult a lexicon, containing
  • a phoneme transcription
  • syllable boundaries
  • word accent(s)
  • and/or develop pronunciation rules
  • Output
  • Enschede . En-sx_at_-de .
  • Kerkrade . kErk-ra-d_at_ .
  • s-Hertogenbosch . sEr-to-x_at_n- bOs .

50
  • Pros and cons of lexicon
  • phoneme transcriptions are accurate
  • (high) risk of out-of-vocabulary words because
    the lexicon
  • often contains only stems, no inflections, nor
    compounds
  • is never up to date / complete
  • but usually the application includes a user
    lexicon

51
  • Pros and cons of pronunciation rules
  • no out-of-vocabulary words
  • transcription results are often wrong for
  • (longer) combinations of words / morphemes
  • exceptions and loan-words from other languages
  • Best solution is a combination of the two
    methods
  • develop a list of words incorrectly transcribed
    by the rules and put these words in an exception
    lexicon
  • words not occurring in the exception list are
    then transcribed by rule

52
  • Complications
  • Words with same written form but different
    pronunciations and different meaning record vs
    record
  • requires parsing or statistical approach
  • Proper names and other specialized vocabularies,
    acronyms/abbreviations (small announcements in
    journals!)
  • Need to be included in (user) lexicon
  • Different kinds of numbers (telephone numbers,
    amounts, credit card numbers etc.)
  • Require number grammars

53
TtS Linguistic pre-processingprosody
  • Emphasis, boundaries (including pauses), sentence
    type
  • Observable manifestations pitch, temporal
    properties, silence
  • Requires analysis of linguistic structure
    (parsing) and (ideally) discourse level
    information (cf. the earlier Amsterdam example)

54
TtS synthesis
  • Concatenation from words and phrases practically
    impossible
  • database too large (especially if you need
    several versions for each word) and
  • no full coverage (out-of-vocabulary words)
  • Approaches
  • sub-word units
  • data-oriented approach

55
Synthesis by subword units
  • Common approach diphone synthesis
  • linking together pre-recorded diphones, i.e.
    short segments (transitions between two
    successive phonemes), extracted from natural
    speech
  • s Hertogenbosch
  • phonemes . s E r t o x _at_
    n b O s .
  • diphones .s sE Er rt to ox x_at_ _at_n
    nb bO Os s.
  • In all, 1600 transitions per language (40 40)

56
  • Synthesis
  • concatenate diphones in the correct order
  • perform some (intensity) smoothing at the
    diphone borders
  • adjust phoneme duration and pitch course,
    according to prosody rules

57
Data-oriented approach
  • Generalization of diphone approach
  • Store a large database of speech (running text)
  • Run-time
  • generate structure representing phoneme sequence
    and prosodic properties needed
  • Search algorithm
  • find the largest possible fragments containing
    the required properties in the database

58
  • frei-burg
  • nĂ¼rn-berg
  • frei-berg
  • /fr/ also for fr-iedrichshaven
  • items in database re-usable
  • Concatenate the fragments as they are, without
    post-processing for pitch and duration
  • ? in this way, not only phoneme parameters and
    transitions are taken from data, but also pitch
    and temporal properties

59
  • Advantage natural speech quality preserved (but
    may not always be desirable maybe it should be
    made clear to people that they are talking to a
    system)
  • Disadvantage no explicit control of voice
    characteristics and prosodic characteristics such
    as pitch and speaking rate (which you might want
    to manipulate for synthesis of emotional speech
    or conveying a certain personality)

60
  • Difficult or impossible to modify speaker
    characteristics
  • Other speaker new database required
  • Other speaking style new database required
  • Research post-processing of result with
    preservation of speech quality

61
  • hybrid synthesis
  • Combination of phrase concatenation and TTS
  • suited for template-based synthesis with fixed
    message structure and variable slots
  • the flight from ltdeparturegt to ltdestinationgt
    leaves at ltdategt at lttimegt from ltgategt
  • ? in dialogue systems the system has knowledge
    of message structure and can select the proper
    tokens from the database on the basis of this
    knowledge

62
Future markup languages
  • structured text
  • current tts-systems strip text annotations
    (plain ascii standard)
  • draft proposal for xml for synthesis, SALT

63
Contents
  • Speech input technology
  • consequences for design
  • Speech output technology
  • Technology
  • Human factors in speech understanding
  • Consequences for design
  • project

64
issues in comprehension
  • Speech quality
  • reduced quality slows down feature extraction
    and mapping input onto feature vectors will
    increase number of matching vectors.
  • ? requires compensation by top-down
    processing, taking more time and effort and
    practice
  • Text-to-speech, but written text is often
    difficult to understand when read aloud due to
    complex structures, high information density etc.

65
application to synthetic speech
  • substandard quality of synthetic speech requires
    compensation by (resource-limited) top-down
    processing
  • ? potential overload of system due to time
    constraints
  • slowing down speaking rate very effective way to
    give the listener more processing time

66
Case study picking up information from speech
  • study on auditory exploration of lists (pitt
    edwards, 1997)
  • recall of list of 48 file names
  • presented in groups
  • groups size varied (2, 3 or 4)
  • presentation of groups listener-paced
  • recall immediately after each group

67
(No Transcript)
68
adjustments
  • analysis of list speaking style
  • ? grouping principles
  • always try to group
  • grouping by filenames and extensions
  • large groups first
  • mnemonic links between groups
  • ? prosodic structuring

69
  • evaluation
  • directory with four subdirectories, each
    containing files with four different names,
    corresponding to the modules of a programming
    project, and with three different extensions
  • task find most recent version of the four files
    containing the source code for the modules and
    copy them into a new directory
  • measures objective (task completion) and
    subjective
  • results for task completion new algorithm 10.39
    min, old version 24.12 min

70
Contents
  • Speech input technology
  • Consequences for design
  • Speech output technology
  • Technology
  • Human factors in speech understanding
  • Consequences for design
  • Project

71
Design implications
  • choice of technology can be made dependent on
    needs of application
  • for restricted domains very high quality can be
    achieved through canned speech or phrase
    concatenation with multiple tokens
  • for concatenation with unit selection there is a
    relation between quality and size of database
    for good systems usually there is no problem with
    intelligibility even with inexperienced listeners

72
  • high quality for diphone speech, needed for
    uncommon forms such as proper names or company
    names that are unlikely to be available from a
    corpus, requires still much effort
  • importance of learning effects
  • general finding is that acceptance of synthetic
    speech depends strongly on voice quality
  • if trade-off between quality and added value is
    negative, prospects for acceptance of the speech
    interface are poor

73
speech as output modality speech vs text/graphics
  • Text/graphics
  • an image may be worth a thousand words
  • image/written text is persistent
  • image is (at least) two-dimensional temporal and
    spatial organisation
  • visual expression of hierarchical structure
  • receiver paced
  • but non-adaptive (until recently).
  • now adaptive hypertext

74
  • speech
  • one-dimensional extends only in time
  • ? spatial issues better dealt with in other
    modality
  • sender-paced
  • poor medium, yet popular
  • large amount of speech-based communication serves
    primarily a social function
  • no need for supporting aids such as paper and pen
  • no special motoric abilities needed
  • speaking is fast

75
heuristicsspeech output preferred when
  • message is simple
  • message is short
  • message need not be referred to later
  • message deals with events in time
  • message requires an immediate response
  • visual channels are overloaded
  • environment is brightly lit, poorly lit, subject
    to severe vibration or otherwise adverse to
    transmission of visual information
  • user must be free to move around
  • (from michaelis wiggins)
  • pronunciation is subject of interaction

76
but speech output preferably not used when
  • message is complex or uses unfamiliar terms
  • message is long
  • message needs to be referred to later
  • message deals with spatial topics
  • message is not urgent
  • auditory channels are overloaded
  • environment is too noisy
  • user has easy access to screen
  • system output consists of many different kinds of
    information which must be available
    simultaneously and be monitored and acted upon by
    the user
  • (from michaelis wiggins)
  • ? environmental variation and mixed
    interaction call for multimodal interfaces

77
contents
  • Speech input technology
  • consequences for design
  • Speech output technology
  • Main points
  • Project

78
Main points
  • Database approach, requiring large databases for
    individual languages and speaking styles,
    dominant both for speech input and output
  • Input databases for training acoustic models and
    language model
  • Output concatenation of segments and phrases
    taken from database
  • Large differences concerning performance of
    speech recognition and quality of output for
    different languages and target groups (e.g.
    recognition for children)
  • Speech input three major classes of
    applications commandcontrol, information
    services, dictation systems
  • Major parameters speaker-dependent/independent,
    vocabulary size (small, medium, large), rigid vs.
    free-format input

79
  • Dialogue management
  • Finite-state or frame-based approach for
    task-oriented dialogue acts
  • Verification strategies and repair mechanisms for
    dialogue control
  • Pragmatic approaches to language understanding
    and language generation
  • Input directly mapped onto application
    functionality
  • Output template-based approaches
  • Not covered speech monitoring, speech data
    mining applications and technology

80
  • Exercises with CSLU toolkit and other
    demonstrators
  • Try out your name, telephone numbers, dates,
    e-mail addresses, abbreviations etc.
  • Project
  • Protocol development
  • Dialogue structure
  • Strategies and prompts
  • Tomorrow
  • Wizard of Oz test
Write a Comment
User Comments (0)
About PowerShow.com