Title: LING 270 Language, Technology and Society Unit 8
1LING 270Language, Technology and SocietyUnit 8
- Richard Sproat
- URL http//catarina.ai.uiuc.edu/L270/
2Overview
- Some more history of Speech Synthesis
- The problem of text-to-speech conversion
- Approaches to selected problems
- Future work whats left to be done?
3Electronic Approaches through the 1950s
- Gunnar Fants OVE Synthesizer
- Walter Lawrences Parametric Artificial Talker
(1953) -
4Types of Synthesis
Never made it out of the laboratory
- Articulatory Synthesis Attempt to model human
articulation. - Formant Synthesis Bypass modeling of human
articulation, and model acoustics directly. - Concatenative Synthesis Synthesize from stored
units of actual speech
Basis of several commercial systems, including DE
CTALK
The only commercial approach taken today
5Types of Synthesis Early Work
- The earliest articulatory synthesizers include
(Nakata and Mitsuoka, 1965, Henke, 1967, Coker,
1968). - Earliest formant synthesizer was (Kelly and
Gerstman, 1961). - The first working diphone concatenative
synthesizer was (Dixon and Maxey,1968). - First full text-to-speech system was (Umeda at
al., 1968)
http//www.acoustics.hut.fi/publications/files/the
ses/lemmetty_mst/appa.html
6A Present-Day Commercial System
- An example of a current commercial system a
unit selection TTS system ATTs Natural Voices - http//www.naturalvoices.att.com/demos/
- Unit-selection approaches look for
variable-length units in an annotated database of
speech, and select them on the basis of various
features including desired phoneme sequence and
prosody.
7Components of a TTS System
8Linguistic Analysis
- Preprocessing
- Sentence segmentation
- Dr. Smith lives on Smith Dr. He is
- Abbreviation expansion
- Doctor Smith lives on Smith Drive. He is
- Word segmentation
- ???????????????
-
??????????????? Islamabad is a city in Pakistan
9Linguistic Analysis (contd.)
- Preprocessing continued
- Acronym pronunciation NATO, CIA
- Number expansion 123
- 123 goats
- 123 King St.
- Lotus 123
10Linguistic Analysis (contd.)
- Word pronunciation
- How to pronounce ordinary words through, though,
tough, bough - How to pronounce names Abate (versus abate)
Begin (versus begin) - Sense disambiguation
- Is bass a fish or a musical range?
- Is IV four(th) or intravenous?
- Is Arabic ??? qabla before or qabila
received?
11Linguistic Analysis (contd.)
- Accenting/prominence prediction
- This is a test of the emergency broadcasting
system - Intonation (inflection) i.e. assignment of
pitch (better fundamental frequency) - Declarative sentences tend to have falling
intonations in lots of languages - Interrogative sentences tend to have rising
intonations in lots of language - But there is a lot more going on than just that
12Linguistic Analysis (contd.)
- Segmental durations
- Every sound has to have some time assigned to it
- Other things being equal
- Vowels tend to be longer than consonants
- Stressed segments tend to be longer than
unstressed segments - Accented segments tend to be longer than
unaccented segments - Final segments tend to be longer than non-final
segments - Segments have different inherent durations ee
in keep is generally longer than i in kip
13Text-to-Speech Blocks
Text Normalization
Dr. Smith lives at 111 Smith Dr.
ASCII Text
Syntactic/Semantic Parser
Lives (verb) vs. Lives (noun)
Dictionary
Pronunciation dictionary morphemic
decomposition rhyming
Letter To Sound Rules
Rules used where dictionary derivation fails
Phonemes (3040)
Prosody Rules
Intonation and duration
Speech Synthesis
Speech output
14A Summary so Far
- From linguistic analysis we have
- A set of sounds to be produced
- Associated durations
- Associated fundamental frequency information
- Possibly other things
- Amplitude
- Properties of the vocal production
- Now we are ready to synthesize speech
15Resources for Synthesis
- Text corpora
- Raw
- Tagged
- Part of speech
- Semantic tags
- Syntactically parsed (treebanks)
- Electronic dictionaries (e.g. CMU Dictionary)
- Annotated speech corpora
- Orthographic transcriptions only
- Phonetically transcribed (e.g. TIMIT)
- Prosodically transcribed (e.g. Boston Radio News)
- All of these are available (in greater or lesser
quantities) for English - For other language, coverage is highly variable
- Many of these resources are available from the
Linguistic Data Consortium (www.ldc.upenn.edu) - Finally, there is the open-source Festival TTS
system
16Human Vocal Apparatus
(After Boite and Kunt, 1987)
17Synthesis
- Articulatory synthesizers will produce a set of
instructions to articulators (larynx, velum,
tongue body, tongue tip, lips, jaw) - This will produce a sequence of articulatory
configurations - From acoustic theory one derives the acoustics of
each configuration - Articulatory synthesis is very hard
- We do not fully understand how the articulators
move - We do not fully understand how to model the
acoustics
18Synthesis
- Formant synthesizers attempt to model the
acoustics directly by means of rules that capture
the change of acoustic parameters over time. - This is easier than formant synthesis but is
still hard - The only really high-quality system that has been
produced was the American English system produced
by Dennis Klatt
19Time-Domain Representation of Speech
20Frequency domain Spectrum
21Time-Frequency RepresentationSpectrogram
22The Klatt Synthesizer
23Single Digital Resonator
24Concatenative Systems
- Record real speech from a single talker
- Segment the speech so that we know where the
individual sounds are - Either
- Preselect a database of units diphone, polyphone
synthesis - Select the best unit at runtime unit-selection
synthesis - Coding of database
- Frequency domain e.g. linear predictive coding
(LPC) - Time domain e.g. Pitch-Synchronous Overlap and
Add (PSOLA)
25Concatenative Approach Diphones
26Linear Prediction Speech Production Model
27Units
- In a polyphone synthesizer, units are stored in
coded form - At synthesis time, appropriate units are selected
from the database and concatenated - Some smoothing between units is generally
necessary - Units need to be stretched or compressed to fit
within the specified duration - Intonation, and amplitude information is added,
and the system is sent for synthesis. - E.g. in an LPC-type synthesizer, intonation will
be modeled by the impulse-train generator plus a
glottal waveform model, and the vocal tract
parameters will be taken from the concatenated
units.
28Unit Selection Methods
- Record a (very) large database (gt1 hour) from a
single talker - Annotate the database with
- Aligned phonetic transcription
- Prosodic information (e.g. accent type, duration)
-
- At synthesis time find the best available
sequence of units that maximizes - The match between the desired sequence and the
individual units - The goodness of join between the units
29Importance of Good Joins
http//catarina.ai.uiuc.edu/orig.wav
Shannon says the party ended the US job.
30Intonation Types
- English declarative
- Final fall (H L- L)
- 9478-1509-7091
- English question
- Final rise (L H- H)
- 9478-1509-7091
31Linguistic Use of Intonation
- Lexical information
- stress, accent, tone
- Intonation type
- declarative, question
- Paralinguistic information
- mood, emotion
- Discourse functions
32Discourse Functions
- Topic initialization
- Discourse structure
- Phrasing
- Emphasis
- New vs. old information
- Other communicative goals
33ToBI Intonation Grammar
- (ToBI Tones and Break Indices)
- Binary tonal distinction
- H (high) and L (low)
- Combination in an intonation phrase
- Accent Phrase tone Boundary tone
- HL L- L
34 Accent Types
H L- L
L H- H
LH L- L
HL H- H
LH L- H
35Lexical Information
- Stress languages
- English, Russian
- Stress location is part of the lexical entry, but
the pitch contour (accent type) on the stressed
syllable may vary. - Accentual languages
- Japanese, Korean, Swedish
- The location of the accent is lexically marked.
Accent type in a word is typically fixed. - Tone languages
- Chinese, Navajo, Igbo
- Lexically determined tone on every syllable or
every word.
36Stress English
- Fixed stress location professor
37Stress English
- Fixed stress location professor
- Different accent type in questions.
38Stress Russian
- Fixed stress location professor
- May have different accent type than English even
in declarative sentences
39Chinese Lexical Tones
Tone shapes differentiate lexical meaning.
Ma1 mother Ma2 hemp Ma3 horse Ma4 to scold
40Chinese Sentences
Ma1-ma0 ma4 ma3. Mother scolds the horse.
Ma3 ma4 ma1-ma0. The horse scolds mother.
41Russian Intonation Types
- Declarative
- This is Neva.
42Russian Intonation Types
- Declarative
- This is Lena.
43Chinese Intonation Types
- Chinese declarative
- Li3bai4wu3 Luo2yan4 yao4 mai3 lu4.
- On Friday Luoyan wants to buy a deer.
- Chinese question
- Li3bai4wu3 Luo2yan4 yao4 mai3 lu4?
- On Friday Luoyan wants to buy a deer?
44Phrasing
- High f0 marks the beginning
- Low f0 marks the end.
45Prosody of Emotion
- Excitement
- Fast, very high pitch, loud
- Hot anger
- Fast, high pitch, strong, falling accent, loud
- Fear
- Jitter
- Sarcasm
- Prolonged accent, late peak
- Sad
- Slow, low pitch
46Excitement
Marilyn won nine million dollars!
Dirty rats are the best, arent they?
47Sad
Marilyn won nine million dollars.
Dirty rats are the best, arent they?
48Chinese Lexical Tones
Tone shapes differentiate lexical meaning.
Ma1 mother Ma2 hemp Ma3 horse Ma4 to scold
49Tonal Variations
50The Challenge Tonal Distortion
51A Heretical Claim
- Or why am I spending so much time on Intonation?
- The main determinant of naturalness in speech
synthesis is not voice quality, but
natural-sounding prosody (intonation and duration)
52Segmental Duration Factors
- Phonetic segment identity.
- For example, holding all else constant /ai/ as in
like is twice as long as /i/ in lick. - Identities of surrounding segments.
- For example, vowels are much longer when followed
by voiced fricatives than when followed by
voiceless stops. - Syllabic stress.
- Vowels in stressed syllables and consonants
heading stressed syllables are longer than vowels
and consonants in unstressed syllables. - Word importance
- Location of the syllable in the word.
- Word-final syllables are longer than word-initial
syllables - Location of the syllable in the phrase.
- Phrase-final syllables can be twice as long as
the same syllable in other locations in a phrase. - How do these factors interact?
53Approaches to Duration Modeling
- Hand-built rules
- E.g., the Klatt rules
- Semi-supervised statistical models
- Machine-learning approaches
- Classification and Regression Trees
- Neural nets
-
54Accent/Prominence Assignment
- Content versus non-content words
- The keeshond is from Holland.
- My dog bit someone
- Noun-phrase structure/lexical specifications
- Our dog needs a rabies shot, so I may be late for
my regular office hours onWednesday. - Tomorrow we have a dog training lesson.
- Our dog eats rawhide bones.
- Given/new information
- Weve always wanted a pet, so we finally bought a
weimaraner - My son wants a labrador, but Im allergic to dogs
- Contrast
- Thats not an Australian sheepdog, its an
Australian cattlehound.
55Cross-Linguistic Variation
- The investigations are helping to put back in
order things that have gone out of order - Italian Le inchieste servono a mettere a posto
cose andate fuori posto - Greek divers have found the wreck of the British
liner Britannic, sister ship of the Titanic. - Italian
- Moglie quarantenne, marito cinquantenne.
- Forty-year-old wife, fifty-year-old
husband
56Accent Modeling Factors
- Part-of-speech information
- Given/New information
- Lexical information
- Cue phrases And then,
- Noun compound analysis
- Contrast
- Information Content
- E.g. log(P(w))
57Accent Modeling Methods
- Hand-built rules (on the basis of corpus
analysis) - Machine-learning methods
- Hidden Markov Models
- Machine learning packages such as CART or Ripper
- Accuracies in the range of 80-90 are typical
58Word Sense Disambiguation
- Is bass a fish or a musical range?
- Is IV four(th) or intravenous?
- Is Arabic ??? qabla before or qabila
received? - Is 1/2 a date or a fraction?
- Is LAX lax or L. A. X. (Los Angeles
International Airport)?
59Methods for Word-Sense Disambiguation
- Decision lists
- bass g FISH if lake within k word window
- bass g MUSIC if word to right is player
- Decision lists can be trained on the basis of
tagged data - Other machine-learning methods
- N-gram language models as in automatic speech
recognition - Sometimes you have to infer from the data what
needs to be treated and how - 57 ST E/1st 2nd Ave Huge drmn 1 BR 750 sf,
lots of sun clsts. Sundeck lndry facils. Askg
187K, maint 868, utils incld. Call Bkr Peter
914-428-9054 - Accuracies for WSD can be as high as 99
- But this is HIGHLY variable, depending upon the
case(s) being considered
60Word Pronunciation Methods
- Dictionaries
- Morphological analysis
- If I know grammatical and the suffix ity, I can
predict the pronunciation of grammaticality - Letter-to-sound rules
- Reasonable for some languages (e.g. Spanish)
- Hard for others (e.g. English, Arabic, Urdu,
Japanese) - tough, bough, through, though
- Machine-learning methods
- CART
- Analogical reasoning
- Exemplar-based methods
- Special attention often needs to be paid to
classes such as proper names - Overall accuracies in the 95 range (by word) are
common (for English) - Interesting side-note pronunciation of proper
names is one of the few areas in Speech and
Language processing where machines can outperform
people.
61Preprocessing
- Expansion of numbers, abbreviations, and
decisions on word and sentence boundaries are
fairly algorithmic - at least when it comes to enumerating the
possible expansions - The real difficulty is in disambiguation
62Disambiguation in Word Segmentation
I couldnt forget where Liberation Avenue is.
Methods have included language-modeling
techniques or even full syntactic parsing
F-scores of around 95 are reported on some
corpora
63Evaluation of Synthesizers
- Favorite method is the Mean Opinion Score (MOS)
- Take two systems A and B, and see which one
people like better - Simple to do, but doesnt tell you anything about
why one system is preferred over another - Design tests that focus on particular features.
- For example play sentence and ask people to
point (in a text transcription of the sentence to
the most problematic portion - Ask them to describe whats wrong from a fixed
list - choppy, bad letters, wrong emphasis
- Can help one pinpoint particular problems, such
as bad units - Problem in comparing systems has been lack of
common databases especially for unit selection - The recent 2005 Blizzard Challenge
(http//festvox.org/blizzard/) addresses this by
providing a common training database (the CMU
Arctic Database)
64Applications of TTS
- Any situation where you need to get information,
but you cant access it visually - Access to information for the blind
- Access to email, news, stock quotes over the
phone - Directions to drivers
- Spoken dialog systems where it is not practical
to prerecord everything - Informational content e.g. NOAA Weather Radio
where it would be expensive to have a human read
all the announcements.
65Further Work is Needed
- Intonation is still not well understood
- How to predict appropriate prosody from text?
- How to predict appropriate affect from text?
- Linguistic analysis of text still needs further
work - Humans still outperform machines at
disambiguation - Correct rendering often depends upon
discourse-level information, which is not well
modeled - Cuantas casas grandes hay en la ciudad? 100 ?
doscientas - Unit-selection methods are hampered by a simple
fact you cant cover all the factors - Need methods for modification of speech databases