Title: CSE 551651:
1CSE 551/651 Structure of Spoken
Language Lecture 16 Text-to-Speech (TTS)
Technology John-Paul Hosom Fall 2005
2Text-to-Speech (TTS) Synthesis
- Having looked at theories of human speech
production and speech perception, now well look
at structures and algorithms currently used to
implement these technologies. - Text-to-Speech (TTS) has three main
approaches(1) formant-based(2)
concatenative(3) articulatory - All TTS approaches must address(a) text
analysis from text, predicting phonemes,
stress, and phrase boundaries(b) prosody from
text-analysis output, predicting pitch contour,
energy contour, duration of each phoneme(c)
signal processing given phoneme symbols and
timing, generate speech waveform
3Text-to-Speech (TTS) Synthesis
- From a linguistic perspective, there may be many
more things to consider
(from Klatt 1987)
4Text-to-Speech (TTS) Synthesis
- Text Analysis
- First, convert words to phonemes hello h eh
l ow read r iy dcl d or r eh dcl d
?? Send 2.50 to 1024 Clough Dr. ASAP. - Also, syllabification aeon ae/on aortae
a/or/tae tasty tas/ty tast/y ta/sty - Plus, stress conduct vs. conduct
- And, accent I want to go now, not later!
- Tones not in English (whew!)
5Text-to-Speech (TTS) Synthesis
- Text Analysis
- Three basic techniques/approaches
- (1) Dictionaries including word pronunciation,
accent marks, syllabification, etc. but many
words are not in the dictionary, and some words
have multiple pronunciations depending on context - (2) Rules such as morphological
analysis parallelization ? parallel ize
ation - (3) CART (Classification and Regression
Trees) utilize context to make binary decisions
leading to final decision
noun?
conduct
sent_begin? prev_wordthe?
6Text-to-Speech (TTS) Synthesis
- Prosody Pitch, Duration, Energy, Allophones
- Pitch modeling, superposition of phrase accent
Fujisaki model
(from van Santen,2000)
7Text-to-Speech (TTS) Synthesis
- Pitch contours
- However, more detailed approaches available (van
Santen),and pitch contour modeling is not a
solved problem
(from Klatt 1987)
8Text-to-Speech (TTS) Synthesis
- Factors affecting duration include
- Current phoneme, Previous phoneme, Next phoneme,
- Word stress, Phrase accent, Degree of emphasis,
- Position in syllable (onset vs. coda, open vs.
closed), Position in word, Position in phrase,
Position in foot - Models of Duration
- Hand-Tuned Rules (Klatt)
- PRCNT based on 11 Rules pause insertion rule,
clause-final lengthening, phrase-final
lengthening, non-word-final shortening,
polysyllabic shortening, non-initial-consonant
shortening, unstressed shortening, lengthening
for emphasis, postvocalic context of vowels,
shortening in clusters, lengthening due to
plosive aspiration
9Text-to-Speech (TTS) Synthesis
- Duration Modeling using CART
- Given large corpus, annotated with duration,
phoneme, stress, phrase information, train a CART
classifier - Doesnt generalize well to data not seen in
training - Sums-of-Products Approach (van Santen)
- Dur(/p/ c2, c3, c4, ...) where cn is
particular context - A11(/p/) A21(/p/) ?A22(c2)
A33(c3) ? A34(c4) - duration is combined function of contexts
- statistical, not rule based simple to train
generalizes well
plosive?
vowel? word-initial?
more rules here
0.98
1.13
10Text-to-Speech (TTS) Synthesis
- Generating a Waveform Articulatory Synthesis
- The vocal tract is divided into a large number of
short tubes, as in the electrical transmission
line analog (Lecture 11), which are then combined
and resonant frequencies calculated.
from Sinder, 1999 (thesis work with Flanagan,
Rutgers)
11Text-to-Speech (TTS) Synthesis
- Generating a Waveform Articulatory Synthesis
- Vocal-tract sources include noise and a buzz
source for voiced sounds - Articulatory synthesis important for validating
the Motor Theory of Speech Perception - Demos from 1976 and circa 1992 (Haskins
Labs)
12Text-to-Speech (TTS) Synthesis
- Generating a Waveform Formant Synthesis
- Instead of specifying mouth shapes, formant
synthesis specifies frequencies and bandwidths of
resonators, which are used to filter a source
waveform. - Formant frequency analysis is difficult
bandwidth estimation is even more difficult. But
the biggest perceptual problem in formant
synthesis is not in the resonances, but in a
buzzy quality most likely due to the glottal
source model. - Formant synthesis can sound identical to natural
utterance if details of the glottal source and
formants are well modeled. NATURAL
SPEECH SYNTHETIC SPEECH(John Holmes, 1973)
13Text-to-Speech (TTS) Synthesis
- Formant TTS Synthesis Architecture
- Formant-synthesis systems contain a number of
sound sources, which are passed to filters in
either parallel or cascade series. Each filter
corresponds to one formant (resonance) or
anti-resonance.
(From Yamaguchi, 1993)
14Text-to-Speech (TTS) Synthesis
- Formant systems Rule-Based Synthesis
- For synthesis of arbitrary text, formants and
bandwidths for each phoneme are determined by
analyzing speech of a single person. - The models of each phoneme may be a single set of
formant frequencies and bandwidths for a
canonical phoneme at a single point in time, or a
trajectory of frequencies, bandwidths, and source
models over time. - The formant frequencies for each phoneme are
combined over time using a model of
coarticulation, such as Klatts modified locus
theory. - Duration, pitch, and energy rules are applied
- Result something like this
15Text-to-Speech (TTS) Synthesis
- Despite great success in copy synthesis,
synthesis by rule using formants has severely
degraded quality. Its not clear why Problem
with glottal source? Problem with coarticulation
and formant transitions? Problem with prosody? - Formant synthesis was main TTS technique until
the early or mid 1990s, when increasing memory
size and CPU speed allowed concatenative
synthesis to be viable approach. - Concatenative synthesis uses recordings of small
units of speech (typically the region from the
middle of one phoneme to the middle of another
phoneme, or a diphone unit), and glues these
units together to forms words and sentences. - Concatenative synthesis means that you dont have
to worry about glottal source models or
coarticulation, since the synthesis is just a
concatenation of different waveforms containing
natural glottal source and coarticulation.
16Text-to-Speech (TTS) Synthesis
- Concatenative Synthesis Units
- The basic unit for concatenative synthesis is the
diphone - More recent TTS research is on using larger
units. Issues include (a) how to decide what
units will be used? (b) how to select
best unit from very large database? - With increasing size and variety of units, there
is an exponential growth in the database size.
Yet, despite massive databases that may take
months to record, coverage is nowhere near
complete. There is a very large number of
infrequent events in speech.
sil-jh jh-aa aa-n n-sil
17Text-to-Speech (TTS) Synthesis
- Concatenative Synthesis Signal Processing
- Waveform-based Pitch-Synchronous Overlap Add
(PSOLA) - Perform pitch modification by spacing of
pitch-synchronous units - Or, use Line Spectral Frequencies (LSFs), which
areconceptually the harmonics in a spectrogram
18Text-to-Speech (TTS) Synthesis
- DEMOS
- Klatts DEC Talk
- sample 1
- ATT
- sample 1 sample 2
- Bell Labs
- sample 1 sample 2
- OGI
- sample 1 sample 2