Title: ECE 16:332:527 Digital Speech Processing Lecture 18
1ECE 16332527Digital Speech Processing Lecture
18
- Text-to-Speech (TTS) Synthesis Systems
2Text-to-Speech (TTS) Synthesis
- GOAL convert arbitrary textual messages to
intelligible and natural sounding synthetic
speech so as to transmit information from a
machine to a person
3Text Analysis Components
Raw English Input Text
Basic Text Processing Document Structure
Detection Text Normalization Linguistic Analysis
Dictionary
Tagged Text
Phonetic Analysis Homograph disambiguation Graphem
e-to-Phoneme Conversion
Tagged Phones
Prosodic Analysis Pitch and Duration Rules Stress
and Pause Assignment
Synthesis Controls (Sequence of Sounds,
Durations, Pitch)
4Document Structure
- end of sentence marked by .?! is not infallible
- The car is 72.5 in. long
- e-mail and web pages need special processing
- LarrySure. I'll try to do it before Thursday
-)Ed - multiple languages
- insertion of foreign words, unusual accent and
diacritical marks, etc.
5Text Normalization
- abbreviations and acronyms
- Dr. is pronounced either as Doctor or drive
depending on context (Dr. Smith lives on Smith
Dr.) - St. is pronounced either as street or Saint
depending on context (I live on Bourbon St. in
St. Louis) - DC is either direct current or District of
Columbia - MIT is pronounced as either M I T or
Massachusetts Institute of Technology but never
as mitt - DEC is pronounced as either deck or Digital
Equipment Company but never as D E C - numbers
- 370-1111 can be either three seven oh or
three seventy-model 1111 for the IBM 370
computer - 1920 is either nineteen-twenty or one
thousand, nine hundred, twenty - dates, times currency, account numbers, ordinals,
cardinals, math - Feb. 15, 1983 needs to convert to February
fifteenth, nineteen eighty-three - 10.50 is pronounced as ten dollars and fifty
cents - part 10-50 needs to be pronounced as part
number 10 dash fifty rather than part pound
sign ten to fifty
6Text Normalization
- proper names
- Rudy Prpch is pronounced as Rudy Perpich
- Sorin Ducan is pronounced as Sorin Duchan
- part of speech
- read is pronounced as reed or red
- record is pronounced as rec-ard or ri-cord
- word decomposition
- need to decompose complex words into base forms
(morphemes) to determine pronunciation
(indivisibility needs to be decomposed into
in-di-visible-ity to determine pronunciation)
7Text Normalization
- proper handling of special symbols in text
- punctuation, e.g., . , - -- _ ( )
_at_ ! lt gt ? / \ - string resolution
- 1020 can be pronounced as either twenty after
10 (as a time) or ten to twenty (as a sequence)
8Linguistic Analysis
- part-of-speech (POS)
- word sense
- phrases
- anaphora
- emphasis
- style
-
- a conventional parser could be used, but
typically a simple shallow analysis is done for
speed (parsers are not real-time!)
9Homograph Disambiguation
- an absent boy versus do you choose to absent
yourself? - they will abuse him versus they wont take
abuse - an overnight bag versus are you staying
overnight? - he is a learned man versus he learned to play
piano - El Camino Real road versus real world
10Letter-to-Sound (LTS) Conversion
Input Word
- CART (Classification and Regression Tree)
analysis - conventional dictionary search with
letter-to-sound rules
Whole Word Dictionary Probe
no
Affix Stripping
no
yes
Root Dictionary Probe
no
yes
Letter-to-Sound and Stress Rules
yes
Affix Reattachment
phonemes, stress, parts-of-speech
This is only ONE way of doing LTS. FSMs are
another.
11Prosody
- pauses
- to indicate phrases and to avoid running out of
breath - pitch
- fundamental frequency (F0) as a function of time
- rate/relative duration
- phoneme durations, timing, and rhythm
- loudness
- relative amplitude/volume
12Symbolic and Phonetic Prosody
parsed text and phone string
Symbolic Prosody
Pauses Prosodic Phrases
Accent Tone Tune
Speaking Style
Prosody Attributes Pitch Range Prominence Declina
tion
F0 Contour Generation
F0 Contour
13ToBI tones
- pitch accent tones
- intermediate phrasal tones (L-, H-)
- boundary tones (L-L, L-H, H-H, H- L, H)
14Marianna Made the Marmalade
15F0 Contour Generation
- anchor points from accent, tone, pitch range,
prominence and declination - F0 contour is obtained by interpolation
16Full TTS System
Face Sync
Text
Text/Language
Text Analysis
Letter-to-Sound Rules
Synthesis Backend
Speech
Synthesis model impacts TTS, as well as unit
representation. choice of units
- words
- phones
- diphones
- dyads
- syllables choice of parameters- LPC -
formants- waveform templates- articulatory
parameters- sinusoidal parameters method of
computation- rules- concatenation
- text and name dictionaries - timing and
duration rate, stress (position, context) -
intonation/pitch (type of phraselist,statement,
??) - phonetic substitution (vowel reduction,
flapping rules) - loudness/amplitude (phrasal
amp., local reduction)
- Numerical expansion (dates, times, , ) -
abbreviations, acronyms - proper name id
Prosody
Dr. Smith lives at 23 Lakeshore Dr.
Parsing
- sentences, phrase and breath groups- semantic
and syntactic accent stress of compounds-
intonation types parts of speech
17Word Concatenation Synthesis
- words in sentences are much shorter than in
isolation (up to 50 shorter) (see next page) - words cannot preserve sentence-level stress,
rhythm or intonation patterns - too many words to store (1.7 million surnames),
extended words using prefixes and suffixes
18Word Concatenation
19Proper Name Statistics
20Statistics of Proper Name Coverage
name pronunciation based on etymology
21Concatenative Word Synthesis
several examples of concatenated words and full
sentences--SBR
word-based synthesis does not work!
22Speech Synthesis Methods
- 1939the VODER (Voice Operated DEmonstratoR)Homer
Dudley - based on a simple model of speech sound
production - select voicing source (with foot pedal control of
pitch) or noise source - ten filters shaped the source to produce vocal or
noise-excited soundscontrolled by finger motions - separate keys for stop sounds
- wrist bar control of signal energy
23 The VODER
24Articulatory Synthesis
- in theory can create more natural and more
realistic motions of the articulators (rather
than formant parameters), thereby leading to more
natural sounding synthetic speech - utilize physical constraints of articulator
movements - use X-ray data to characterize individual speech
sounds - model how articulatory parameters move smoothly
between sounds - direct method solve wave equation for sound
pressure at lips - indirect method convert to formants or LPC
parameters for final synthesis in order to
utilize existing synthesizers - use highly constrained motions of articulatory
parameters
25Articulatory Model
- Articulatory Parameters of Model
- lip openingW
- lip protrusion lengthL
- tongue body height and lengthY,X
- velar closureK
- tongue tip height and lengthB,R
- jaw raising (dependent parameter)
- velum openingN
26Articulatory Models
27Articulatory Synthesis Using Formant Synthesizer
Backend
Cecil Coker--teaching computers to talk
Articulatory Synthesis of SpeechCecil Coker
28A DIRECT Approach Analysis of the Vocal Tract in
the Frequency Domain
Chain (abcd) matrix
The VT-transfer function is
with
impedance at lips
Matrix elements of a lossless section (length l
and cross-section Aconst.) are
Pout,Uout
with
Pin, Uin
and c (speed of sound), ? (density of air)
29Vocal Tract Analysis
30Articulatory Synthesis by Copying Measured Vocal
Tract Data
- fully automatic closed-loop optimization
- initialized from articulatory codebooks, neural
nets - Schroeter and Sondhi, 1987
- One example original re-synthesis
31Articulatory Synthesis Issues
- requires highly accurate models of glottis and
vocal tract - requires rules for dynamics of the articulators
32Vocal Tract to LPC
33LPC Implementation
34Serial Synthesis from LPC
Note H(z) has unity gain at DC (?0, z1)
35Source-Filter Synthesis Models
- cascade/serial (formant) synthesis model
36Serial/Formant Synthesis Model
37Serial/Formant Synthesis Model
- flaws in the serial/formant synthesis model
- cant handle voiced fricatives
- no zeros for nasal sounds
- no precise control for stop consonants
- pitch pulse shape fixedindependent of pitch
- spectral compensation is inadequate
To Be -Bell Labs
Daisy-Daisy with music
SPASS synthesis
JSRU Synthesis
OVE 1--Fant
We Wish You
38Parallel Synthesis Model
A serial synthesizer is a good approach for open,
non-nasal vocal tracts (vowels, liquids). For
obstruents and nasals, we need to control the
amplitudes of each resonance, and to introduce
zeros in addition to the poles.
parallel synthesizer provides more flexibility in
matching spectrum levels at formant frequencies
(via gain controls)however, zeros are introduced
into the spectrum.
39Parallel Synthesis
- issues
- need individual resonance amplitudes
(A1,,A4)if resonances are close, this is a
messy calculation - phasing of resonances neglected (the Bkz-1
terms) - synthetic speech has both resonances and zeros
(at frequencies between the resonances) that may
be perceptible - better reproduction of complex consonants
Parallel synthesis from BYU
Parallel synthesis-Holmes
40More Advanced Synthesizer
41More Advanced Synthesizer
- synthesizer features
- glottal pulse modeled directly using several
tunable parameters - breathiness component added to glottal source
- aspiration source included in voiced sound loop
to enable voiced fricative production - pole-zero model for voiced speech
- radiation modeled separately for both voiced and
unvoiced speech
42More Versatile Synthesizer (Serial-Parallel)
43Voiced Fricative Synthesis
Klatt TalkMIT, 1986
44Continuing Evolution (1959-1987)
- Haskins, 1959
- KTH Stockholm, 1962
- Bell Labs, 1973
- MIT, 1976
- MIT-talk, 1979
- Speak N spell, 1980
- BELL Labs, 1985
- Dec talk, 1987
45Text-to-Speech Synthesis (TTS) Evolution
Good Intelligibility Customer Quality
Naturalness (Limited Context)
Poor Intelligibility Poor Naturalness
Good Intelligibility Poor Naturalness
Formant Synthesis
LPC-Based Diphone/Dyad Synthesis
Unit Selection Synthesis
ATR in Japan CSTR in Scotland BT in England
ATT Labs (1998) LH in Belgium
Bell Labs CNET Bellcore Berkeley Speech
Technology
Bell Labs Joint Speech Research Unit MIT
(DEC-Talk) Haskins Lab
1962 1967 1972
1977 1982 1987
1992 1997
Year
46Speech Synthesisthe 90s
- what changed?
- TTS was highly intelligible but extremely
unnatural sounding - a decade of work had not changed the naturalness
substantially - computation and memory grew with Moores law,
enabling highly complex concatenative systems to
be created, implemented and perfected - concatenative systems showed themselves capable
of producing (in some cases) extremely natural
sounding synthetic speech
47Concatenation TTS Systems
- key idea use segments of recorded speech for
synthesis - data driven approach ? more segments give better
synthesis ? using an infinite number of segments
leads to perfect synthesis - key issues
- what units to use
- how to select units from natural speech
- how to label and extract consistent units from a
large database - what signal representation should be used for
spectrally smoothing units (at junctures) and for
prosody modification (pitch, duration, amplitude)
48Concatenation Units
- choice of units
- Wordsthere are an infinite number of them
- Syllablesthere are about 10K in English
- Phonemesthere are about 45 in English
- Demi-syllablesthere are about 2500 in English
- Diphonesthere are about 1500-2500 in English
49Choice of Units
Rules, Necessary Unit Modifications
Units (English)
Unit
Length
Quality
Allophone 60-80 Diphone lt402-652 Triphone lt403-
653 Demisyllable 2K Syllable 11K VCV 2-s
yllable lt11 K2 Word 100K-1.5M Phrase Sentence
Many
Short
Low
8
High
Few
Long
50Corpus Coverage by Unit Type
1.0
NOTE depends on domain (here SURNAMES)
.83
Slope
Units (1 token/unit)
.23
.15
.11
.02
.003
1 10K 20K 30K
40K 50K 2 M
Top N Surnames (rank)
51Concatenation Units
- Words
- no complete coverage for broad domains ? words
have to be supplemented with smaller units - limited ability to modify pitch, amplitude and
duration without losing naturalness and
intelligibility - need huge database to extract multiple versions
of each word - Sub-Words
- hard to isolate in context due to co-articulation
- need allophonic variations to characterize units
in all contexts - puts large burden on signal processing to smooth
at unit join points
52Concatenation Unit Representation
- LPC
- simple, easy to concatenate units, efficient for
modification of pitch (since it is inherently
separated from vocal tract spectra) - doesnt work well for nasals (lack of nasal
zeros) - glottal excitation not correct (assumed pitch
pulses) - doesnt work well for mixed excitation (basic LPC
assumptions) - buzzy
- TD-PSOLATime Domain, Pitch Synchronous Overlap
Add Synthesis - efficient prosody modification (pitch
synchronous) - no smoothing at join points
53Speech Waveform Models
- time-domain source-filter models (LPC)
- filter represents the vocal tract
- synthetic glottal pulse excites the filter
- filter produces synthetic speech
Filter
54Speech Waveform Models
- time-domain modification (e.g., PSOLA)
- window and shift pitch pulses
- pitch marks critical
zeros added to extend period
Amplitude
time
55TD-PSOLA Synthesis
time domain modifications lead to spectral
distortions
56Concatenation Mismatches of TD-PSOLA
- phase mismatch different relative position of
OLA windows within left and right segments (LS
and RS) - pitch mismatch different F0 in LS and RS
- timbre mismatch different spectral envelopes in
LS and RS - lacking smoothness across concatenation points
need to painfully optimize the segment database
to get best segmental quality
57Temporal Envelope Mismatch in TD-PSOLA
- speech waveform changes abruptly from the left
segment to the right segment - concatenation point is detectable
- even when applied on LPC-residual and with
smoothing spectral envelopes, audible glitches
remain
58MBROLA (T. Dutoit)
- time-domain synthesis that combines the
advantages of PSOLA (low computational cost, good
prosody modification) with an off-line hybrid
time/frequency algorithm for smoothing
transitions. - speech in database has constant pitch (100 Hz)
- amplitude and some general spectral
smoothinggtgtgtgt http//tcts.fpms.ac.be/synthesis/
59Speech Waveform Models
- sinusoidal models
- model signal as a sequence of time-varying
sinusoids
60HNM (Harmonic Noise Model) (Y. Stylianou)
- harmonic (low band) and noise (high band)
classification of speech - harmonic part modeled by a comb of sinusoids
- noise part modeled by a parametric time-domain
envelope and spectrally shaped by an AR-model - analysis/synthesis done pitch-synchronously
without explicit pitch markers
61Concatenation Unit Representation
- Hybrid Time-Frequency Representation
- easy to modify prosody
- segments easily smoothed at concatentation
points - can easily modify spectral envelope
62Hybrid Representation
63Practical Implementation of Hybrid Synthesis
Sr(?)
ai fi
?ii ?0
ep(n)
sp(n)
hp(n,m)
s(n)
hr(n,m)
er(n)
sr(n)
64Speech Waveform Modification
- cant cover all possible combinations of feature
variables in a database - waveform modification is of vital importance for
concatenative synthesis based on diphone units - some attributes can be modified easily with
signal processing techniques
65Powerful Voice Alteration
original modified
- prosody modification capabilities
- voice alteration female to child
- voice alteration child to adult male
HNM
PSOLA
66Hybrid Methods, Final Thoughts
- 10-20 times more complex than TD-PSOLA
- complexity issue can be addressed by using
computationally expensive hybrid methods only
offline during database generation while applying
low-complexity time-domain-only synthesis online
in the TTS system - e.g., MBROLA
- for highest possible quality prosodic
modifications, hybrid methods need to be applied
online - e.g., HNM
67Block Diagram of a Concatenative TTS System
Dictionary and Rules
Store of Sound Units
Text Analysis,Letter-to-Sound,Prosody
Speech Waveform Modification and Synthesis
Assemble Units that Match Input Targets
Speech
Message Text
Alphabetic Characters
Phonetic Symbols, Prosody Targets
Female Male
68Concatenative Synthesis Unit Definition and
Extraction
/eh-s/ diphone
- ? Waveform
- ? Spectrogram
- Symbolic Representation
- ? word labels
- ? tone labels
- ?syllable and stress labels
- ? phone labels
- ? break indices
/s/ phone
69Issues with Unit Type
- sub-word units
- usually cut from over-articulated sentences
read in an almost monotone style - neglects effects of neighboring units
(coarticulation) - must include several allophonic variations
- neglects variations due to speaking style (news,
announcements) and pitch - puts a large burden on signal processing
(smoothing, prosody modifications) - small (30 minutes / 650 sentences) database,
easy to label
70Procedure for Concatenative Synthesis
- off-line inventory preparation
- record speech corpus and process with coding
method of choice - determine location of speech units and store
units in inventory - on-line synthesis from text
- normalize input text (expand abbreviations, etc.)
- letter-to-sound (pronunciation dictionary and
rules) - prosody (melody/pitchdurations, stress
patterns/amplitudes, ) - select appropriate sequence of units from
inventory - modify units (smooth at boundaries, match desired
prosody) - synthesize and output speech signal
71Unit Selection Synthesis
- need to optimally match units at boundaries,
e.g., fundamental frequency (pitch), and spectrum
- need to automatically and efficiently select
optimal sequence of units from database - issues in Unit Selection Synthesis
- several examples in each unit category (from 10
to 10 ) - waveform modification used sparingly (leads to
perceived distortions) - high intelligibility must be maintained
- customer quality attained with reasonable
training set (1-10 hours) - natural quality attained with large training
set (10s of hours) - unit selection algorithm must run in a fraction
of real time on a state-of-the-art processor
6
72Why is Online Unit Selection Necessary?
Single Feature Distribution (within same
category, here /ow/) e.g., pitch, duration,
emphasis, spectral tilt,
/ow/
Assumption no labeling and feature extraction
errors !!
- Impossible to capture broad range of naturally
occurring features with just one or two examples
73Unit Selection
- given target features, automatically find
sequence of units in the database that most
closely match these features
Trained perceptual distance metric
hh
l
ow
eh
...
ow
ow
hh
eh
ow
hh
hh
eh
hh
l
eh
eh
ow
hh
eh
l
l
l
eh
eh
l
74Unit Selection
- additionally, find sequence of units that best
join each other - find optimal path using Viterbi Search (Dynamic
Programming)
hh
l
ow
eh
...
ow
ow
hh
eh
ow
hh
hh
eh
hh
l
eh
E
hh
eh
l
l
l
eh
eh
l
75(On-line) Unit Selection Viterbi Search
u-
-a
a-u
a-u (1)
-a (1)
u- (1)
a-u (2)
-a (2)
u- (2)
-
-
a-u (3)
-a (3)
u- (3)
a-u (4)
- transitional (concatenation) costs are based on
acoustic distances - node (target) costs are based on linguistic
identity of unit
76Unit Selection Measures
- USDUnit Segmental Distortion ? differences
between desired spectral pattern of target and
that of candidate unit, throughout whole unit - UCDUnit Concatenative Distortion ? spectral
discontinuity across boundaries of the
concatenated units - Example source contextwant/w ah n t/
- target contextcart/k ah r t/
- USD?(ah?n versus ah?r)
77Concatenative Synthesis
- concatenate recording chunks (senone, half-phone,
diphone, phone, demisyllable, syllable, word,
phrase, sentence) - adjacent units have zero concatenation cost.
Transition Cost (UCD)
Selected units
?j
?j1
Unit Cost (USD)
Target units
tj
78Concatenation Cost
79Target Cost
80(Off-line) Weight Training
81Acoustic Target Cost
82Modern TTS Systems (Natural Voices from ATT)
- Soliloquy from Hamlet
- Gettysburg Address
- Bob Story
- German female
- UK British female
- Spanish female
- Korean female
- French male
83Modern COMMERCIAL Systems
- Lucent
- AcuVoice
- Festival
- LH RealSpeak
- SpeechWorks female
- SpeechWorks male
- Cselt (Actor) - Italian
84TTS Future Needs
- TTS needs to know how things should be said
- context-sensitive pronunciations of words
- prosody prediction? emphasis
- I gave the book to John (not someone else)
- I gave the book to John (not the photos)
- I gave the book to John (I did it, not someone
else) - unit selection process ? target cost captures
mismatch between predicted unit specification
(phoneme name, duration, pitch, spectral
properties) and actual features of a candidate
recorded unit ? need better spectral distance
measures that incorporate human perception - better signal processing ? compress units for
small footprint devices
85Visual TTS
- Applications talking assistants/avatars,
intelligent agents, video mail, email reading ... - Advantages of using a talking head
- higher intelligibility and perceived quality of
(audio) TTS! - enhanced user experience through multimedia
communication - Approaches
- sample-based image synthesis using library of
video snippets - 3D head model including models of tongue and lips
(Baldi) - in the future sample-based image synthesis using
Baldi for alignment(i.e., use best of each of
the two technologies)
86Visual Text-to-Speech Synthesis
- personalized friendly agents provide an
entertaining and effective user experience. - subjective tests confirm
- Agents are preferred over text and audio
interfaces - Agents are more trusted
- than text and audio interfaces
- applications
- Personal Assistant
- Customer Service
- Newscaster (Ananova)
- E-commerce
- Games
87Talking Heads
- Sample-based Talking Heads
look like a real person require recording of
real people limited in pose that can be shown.
flexible easy to show in any pose faces look
cartoon-like.
883D-Head Model (Baldi Family)
Baldi
Andrew
Katherine
Caesar
Cybatt
Cybel
89VTTS Process
Text
Coarticulation Model/Library
phonemes
Text to Speech Synthesizer
Rendering
Lip shapes Emotions Movements ( FAPs)
3D model
Movements Emotions Model/Library
Sample-based model
stress
Conversation module
emotions
90Two Rendering Techniques
-
hard to reproduce minute skin details like
wrinkles, that look absolutely natural
Synthetic3D models parametrized shapes
Keeps correct appearance under full range of views
Sample-based parametrized textures
Reproduces photo realistic appearances fast
Range of views limited by planar approximation of
parts
91Sample-Based Model
- concatenate snippets of video to synthesize
talking heads - reduce the number of samples to store by
decomposing recorded head into sub-parts. - use a background image of the whole head onto
which parts are warped. - feathering (transparency gradient at border)
helps smooth blending - smooth transitions of each object (e.g., mouth
shape) across unit boundaries by using advanced
morphing techniques
92Model-Independent Animation
Fear
Disgust
Anger
Surprise
Sadness
Joy
93Giving Machines High Quality Voices and Faces
U.S. English Female U.S. English Male Spanish
Female
Natural Speech
94Customer Care Scenario
95Visual TTS Demos
E-Mail Messages
Virtual Secretary
Au Claire de la Lune
96Travel Domain Scenario
97Business Drivers of TTS
- cost reduction
- TTS as a dialog component for customer care
- TTS for delivering messages
- TTS to replace expensive recorded IVR prompts
- new products and services
- location-based services
- providing information in cars (e.g., driving
directions, traffic reports) - unified Messaging (reading e-mail, fax)
- voice Portals (enterprise, home, phone access to
Web-based services) - e-commerce (automatic information agents)
- customized News, Stock Reports, Sports Scores
- devices
98Reading Email
From Marilyn Walker ltwalker_at_research.att.comgt
To David Ross ltdavidross_at_home.comgt Subject
Re Today's Meeting Date Tuesday, December
01, 1998 425 PM -------------------------------
--------------------------------------------------
------------------------------------------ 430
is fine for me. See you at the meeting.
Marilyn -----Original Message-----
From David Ross ltdavidross_at_home.comgt To
Marilyn Walker ltwalker_at_research.att.comgt
Date Tuesday, December 01, 1998 225 PM
Subject Today's Meeting Today's
meeting has been changed from 400 to 430 PM. If
the time change is a problem, please send
me email at davidross_at_home.com.
Thanks, david ross
99Reading Email (final)
From Marilyn Walker ltwalker_at_research.att.comgt
To David Ross ltdavidross_at_home.comgt Subject
Re Today's Meeting Date Tuesday, December
01, 1998 425 PM -------------------------------
--------------------------------------------------
------------------------------------------ 430
is fine for me. See you at the meeting.
Marilyn Walker -----Original Message-----
From David Ross ltdavidross_at_home.comgt
To Marilyn Walker ltwalker_at_research.att.comgt
Date Tuesday, December 01, 1998 225 PM
Subject Today's Meeting Today's
meeting has been changed from 400 to 430 PM. If
the time change is a problem, please send
me email at davidross_at_home.com.
Thanks, david ross
100TTS Application Categories
Devices
- PDAs, cellphones, gaming, talking appliances
- driving directions, city and restaurant guides,
location services (e.g., Macys has a
sale!)voice control of cell phones, VCRs, TVs - home information access over telephone (Home
Voice Portals) - information access over the phone such as sales
information, HR, internal phonebook, messaging - E-commerce, customer care (e.g., friendly
automated talking web agents, FAQs, product
information) - next-gen HMIHY automated operator services
Automotive Connectivity
ConsumerCommunications
Enterprise Communications
Voice-assistedE-Commerce
Call center Automation