Title: Fundamental Frequency Contour Synthesis for Turkish Text to Speech
1Fundamental Frequency Contour Synthesis for
Turkish Text to Speech
2Content
- TTS systems and prosody
- Turkish Intonation, Stress
- Observations on Collected Data
- Methodology
- Improvements on Methodology
- Discussion
- Conclusion
3Introduction to Text to Speech (TTS) Systems
- Text -gt speech signal
- Widespread applications
- Message to speech generation
- Man-machine dialogue
- Multimedia applications
- Talking aids for handicapped
CHALLENGE Machine Accent -gt Natural Speech
SOLUTION Prosody Generation Modules
4What is Prosody?
- Properties of speech that cannot be derived from
the phoneme sequence - Modulation of voice pitch
- Rhythm, changes in durations
- Fluctuations of loudness
- Related to domains larger than one phoneme
- (supra-segmental properties)
5Basic Acoustic Parameters
- Fundamental Frequency F0 (pitch)
- Duration
- Intensity
Prosodic Phenomena
- Modulate the basic acoustic parameters
- Modulation of fundamental frequency
- Intonation
- Stress (accent)
6Intonation
- Ensemble of pitch variations
- Perceived as speech melody
Stress
- Modulate all the basic acoustic parameters
- Increase in F0 and intensity (loudness)
- Lengthening in duration
- Three types
- Word stress
- Phrase stress
- Sentence stress
- Stress on a single syllable
- Phrase and sentence stress coincide with word
stress
7Prosody Generation Modules in TTS
- Prosodic description
- Prosodic phrasing -gt phrase boundaries
- Accent labeling -gt accents on syllables
- Prosodic labels -gt F0 contour
PROBLEMS
- Complex linguistic processing units (morphology,
syntax, semantics) - Speaker-dependence
- Articulation-related problems microprosody vs.
macroprosody
8Basic Intonation Models
- Tone Sequence Models Pitch contour as a
sequence of fluctuations generated by local
accents - Pierrehumbert A sequence of independent H and L
tones (ortography) - Pitch accent -gt pitch movements on stressed
syllables - Boundary tone -gtat phrase boundaries
- Phrase accent -gt between stressed syllable and
phrase boundary - Superposition Models Pitch contour as the
superposition of several components with
different domains syllables, words, phrases,
sentences, paragraphs, whole text - Fujisaki purely mathematical model -gt parametric
- A basic F0
- A phrase component (crit. Damped sec. Order to
impulse) - An accent component (crit. Damped sec. Order to
rectangular) - Optimization of parameter values wrt F0 (Analysis
by Synthesis) - Möbius -gt Fujisaki Linguistics -gt German
9Approaches
- Perform an analysis on a speech corpus
- Transcribe the corpus
- Define F0 labels(rise, fall, peak etc.) and
boundary labels (minor, major etc.) - Labeling
- By hand
- Examination -gt rules -gt automatic
- Automatic learning of labels -gt F0 values (or
parametrized) - Neural Networks
- Stochastic methods
- Intonation pattern dictionary (from natural
speech) - Store pitch values in ST and key information
(labels) for each pattern - For the patterns in input sentence -gt compare key
info -gt find closest pattern from dictionary -gt
apply pitch
10Approaches
- For integration into TTS (labeling input sentence
from text) - Complex linguistic processing units
- Morphology
- Syntax
- Semantics
- Stochastic methods
- Syntax -gt most probable label sequence
11Sentence Intonation Types
- Terminal intonation
- pitch decreases at the end -gt message completed
- Interrogative intonation
- pitch slightly increases on the last syllable -gt
waiting for response - Progressive intonation
- pitch either increases slightly or does not show
any lowering at the end -gt message not completed
yet
12Turkish Intonation
- Classification of sentences
- Type
- Declaratives(?)
- wh-questions(?)
- yes-no questions(?)
- Structure
- Simple
- Compound (?) at the end of subordinate
- Mesgul oldugundan(?) bizimle sinemaya gelemedi(?).
13Turkish Intonation
- Tone groups (phrase or segment)
- Division into tone groups
- / Oraya varinca beni arayin. /
- / Oraya varinca / beni arayin. /
- Focus (new information) in each tone group
- / Oraya varinca beni arayin. /
- / Oraya varinca beni arayin. /
- / Oraya varinca beni arayin. /
- / Oraya varinca beni arayin. /
- Pitch variations on focus
14Turkish Intonation
- Four levels of pitch low(1), mid(2), high(3),
extra high(4) - gi2di3yoru1m
- sa2hi4 mi1
- Speech melody ltgt musical melody (Nash)
- Hierarchy of intonation units(phrase -gt text)
- Each intonation unit -gt melody
- Successive intonation units related by motifs -gt
melody of the upper level - Music reiteration of motifs -gt musical melody
15Turkish Stress
Word Stress
- Fixed(bound) stress vs. Free stress(Turkish)
- Stress on a single syllable of a word in Turkish
- Effect of suffixes on stress
- Stress on final syllable of root stressable
suffix - yolcu -lar ? yolcular
- Stress on final syllable of root, unstressable
suffix involves - oku -yor ? okuyor -lar ?
okuyorlar - Stress on non-final syllable of root
- karinca -lar ? karincalar
- May disappear in sentence
16Turkish Stress
Sentence Stress
- Signals the prominance of the most
information-bearing element in a sentence - Types
- Unmarked (preverbal position)
- Yarin Istanbula gidiyorlar.
- marked (any position)
- Yarin Istanbula gidiyorlar.
- Focusing elements
- Precede focus sadece, daha
- Mehmet daha bugün ödevine baslayabildi.
- Follow focus -mi, da, bile
- Ayla mi bugün Ankaradan dönüyor?
17Turkish Stress
Phrase Stress
- Phrase modifier or complement and head
- Phrase stress on modifier in Turkish
- Types
- Phrases used as nouns
- telefon ahizesi
- güzel çiçekler
- Phrases used as verbs
- hizli kos
- severek yasa
- Others
- senin için
- yarindan sonra
- Preserved in the sentence
18Motivation
- Nevin bugün menemen yemeli. (template)
- N Z F V
- Nevin menemen yemeli.
- N F V
- Bizim Nevin domatesli menemen yemeli.
- P N A F
V - Nalan yarin ayna aliyor.
- N Z F V
- Nalan ayna aliyor.
- N F V
- Kardesim Nalan yeni ayna aliyor.
- N N A F V
19Nevin bugün menemen yemeli.
Nevin menemen yemeli.
20Nevin bugün menemen yemeli.
Bizim Nevin domatesli menemen yemeli.
21Nevin bugün menemen yemeli.
Nalan yarin ayna aliyor.
22Nevin bugün menemen yemeli.
Nalan ayna aliyor.
23Nevin bugün menemen yemeli.
Kardesim Nalan yeni ayna aliyor.
24Sentences
- 19 close test sentences (add/remove categories)
- 18 random test sentences
- Syllable-based handlabeling
- Pitch extraction
25Observations
Declaratives
- Pitch decrease at the end (terminal intonation)
- Division into phrases
- Pitch increase on the phrase-final syllable
(progressive intonation)
Nevin/bugün/menemen yemeli.
26Observations
Declaratives
- Pitch decrease at the end (terminal intonation)
- Division into phrases
- Pitch increase on the phrase-final syllable
(progressive intonation)
Evvelki gün/ikimiz de/kuyumcu Aliye ugradik.
27Observations
Wh-questions
- Pitch increase on the last syllable
(interrogative intonation) - Evident pitch increase on the stressed syllable
of the wh-word - No division into phrases
- Word stress often disappears
Dün neden zamanimi aldin?
28Observations
Wh-questions
- Pitch increase on the last syllable
(interrogative intonation) - Evident pitch increase on the stressed syllable
of the wh-word - No division into phrases
- Word stress often disappears
Kimler yarin sinif gezisine katilacaklar?
29Observations
Yes-no questions
- Pitch decrease at the end
- Evident pitch increase on the stressed syllable
of the word before -mi - No division into phrases
- Word stress often disappears
Oralari yine eskisi gibi güzel mi?
30Observations
Yes-no questions
- Pitch decrease at the end
- Evident pitch increase on the stressed syllable
of the word before -mi - No division into phrases
- Word stress often disappears
Mudanyada bu sene de çok yagmur yagiyor mu?
31Observations
Conditionals
- Pitch decrease at the end (terminal intonation)
- Division into phrases
- Pitch increase on the phrase-final syllable
(progressive intonation) - -se always a phrase-final syllable
Insan azimliyse herseyi basarabilir.
32Observations
Conditionals
- Pitch decrease at the end (terminal intonation)
- Division into phrases
- Pitch increase on the phrase-final syllable
(progressive intonation) - -se always a phrase-final syllable
Babam keyifsizse ona konuyu bu aksam anlatamam.
33Observations
Imperatives
- Pitch decrease at the end (terminal intonation)
- Division into phrases
- Pitch increase on the phrase-final syllable
(progressive intonation)
Aksam yemegi için çarsidan birseyler alsinlar.
34Observations
Imperatives
- Pitch decrease at the end (terminal intonation)
- Division into phrases
- Pitch increase on the phrase-final syllable
(progressive intonation)
Sevgiyi ve mutlulugu yarinlara erteleme.
35Observations
Exclamations
- Diverse
- Pitch decrease at the end (terminal intonation)
- Evident pitch increase on the stressed syllable
of interjection or of another word
Aman büyüklerine bir saygisizlik yapma!
36Observations
Exclamations
- Diverse
- Pitch decrease at the end (terminal intonation)
- Evident pitch increase on the stressed syllable
of interjection or of another word
Haydi bugün hep birlikte piknige gidelim!
37Local Observations
- At most single stressed syllable excluding
phrase-final increase - Stress within the sentence coincides with the
word stress - Phrase stress preserved
Ekonomik kriz / her kesimden insani / olumsuz
etkiledi.
38Local Observations
- At most single stressed syllable excluding
phrase-final increase - Stress within the sentence coincides with the
word stress - Phrase stress preserved
Evvelki gün / ikimiz de / kuyumcu Aliye ugradik.
39Local Observations
- Word stress may disappear
Beden sagligimiz için aksamlari erken yatmaliyiz.
Mehmet daha bugün ödevine baslayabildi.
40Local Observations
- Word stress disappears at the end of positives
(terminal intonation)
Nevin bugün menemen yemeli.
Merve evine zamaninda dönemez.
41Local Observations
- Sentence stress (stress on focus)
Nevin bugün menemen yemeli.
Mehmet daha bugün ödevine baslayabildi.
42Local Observations
- Effects on neighbour syllables
- Unstressed stressed (nevin)
- Stressed stressed
- nevinbugün
Nevin bugün menemen yemeli.
43Local Observations
- Effects on neighbour syllables
- Stressed stressed (Partiyegelmeyecegim)
Ben aksam partiye gelmeyecegim.
44Local Observations
- Effects on neighbour syllables
- Stressed unstressed (Gecerüyasinda)
Kardesim beni dün gece rüyasinda görmüs.
45Local Observations
- Effects on neighbour syllables
- Stressed unstressed (neyle)
Bu geç vakitte sizin eve neyle dönecegiz?
46Local Observations
- Effects on neighbour syllables
- Stressed unstressed (last syllable, terminal
intonation) (degildi)
Aksamki yemek pek güzel degildi.
47Local Observations
- Effects on neighbour syllables
- Stressed unstressed (last syllable, terminal
intonation) - (güzelmi)
Oralari yine eskisi gibi güzel mi?
48Methodology
Overwiev
- Choose best sentence from a sentence database
- Apply its pitch to the matching regions of input
sentence - Compression / Stretching
- Interpolation
- Fit data to remaining regions using interpolation
49Methodology
Read Files
- Input information used for sentences
- Sentence type (declarative, wh-question, yes-no
question, conditional, imperative, exclamation) - Sentence state (positive or negative)
- Categories of each word
- Number of syllables of each word
- The index of the syllable bearing word stress,
for each word (stress in sentence coincides with
word stress)
50Methodology
Read Files
- Word categories rely mainly on part-of-speech
(POS) categories
51Methodology
Choose Best Sentence
- Search in database to find the best sentence
- Search the template sentences with the same
- Type
- State
- as the input sentence
- Two different approaches for
- Sentences other than questions
- Question sentences
52Sentences other than Questions
- Calculate sentence resemblance scores based on
word resemblance scores (WRS) - Choose the template sentence having the maximum
sentence resemblance score
Word Resemblance Score (WRS)
- Measure of resemblance of two words
- Consists of
- Regional resemblance score (RRS) -gt word stress
information - Category match score (CMS) -gt word categories
- WRS RRS CMS
53Regional Resemblance Score (RRS)
- Makes use of the four regions defined for every
word - Region before the stressed syllable
- Stressed syllable
- Region after the stressed syllable
- Phrase-final syllable
- Measure of resemblance of any two words in terms
of these regions - Based on number of syllables in each region
- Consists of
- Score of existing regions
- Score of lacking regions
- RRS 0.9 x ERS 0.1 x LRS
54Calculation of ERS and LRS
score ERS LRS 0 (initialization) for all
regions if the region exists in both
words score min( 1 , (NSRW1 / NSRW2)
) ERS ERS score else if region lacks
in both words LRS LRS 1 else LRS
LRS - 1 endif endif endfor ERS score
of existing regions LRS score of lacking
regions NSRW1 number of syllables in related
region for first word NSRW2 number of syllables
in related region for second word
55 Category Match Score (CMS)
- Category match -gt CMS
- CMS 3.7 (maximum possible value of RRS)
Example Calculation of WRS for the words Istanbul
and Ankara
ERS 1/1 1/2 3/2 LRS -1 1 0 RRS 0.9
x 3/2 0.1 x 0 1.35 CMS 3.7 WRS 1.35
3.7 5.05
56Sentence Resemblance Score
- I1, I2, ,IN words of the input sentence
- D1, D2, ,DM words of the template sentence
- MxN S score matrix with Si,js where Si,j WRS
of the pair (Di, Ij) - Path (Da, Ib), (Dc, Id), , (De, If)
- with 1 a lt c lt lt e M and 1 b lt d lt lt f
N - Score of the path sum of WRSs of its pairs
- TASK Find the path with the maximum score
(maximum score path) - score of maximum score path sentence
resemblance score - optimum combination of word pairings preserving
order
57EXAMPLE TEMPLATE Geçen aksam hepimiz müzigin
büyüsüne kapilmistik. INPUT Büyük dayimiz
Kadiköydeki evinde senelerdir yalniz
oturuyor. (aksam, Büyük), (müzigin, dayimiz),
(kapilmistik, evinde) valid (hepimiz, dayimiz),
(geçen, evinde), (büyüsüne, yalniz)
invalid (aksam, evinde), (müzigin, dayimiz),
(kapilmistik, oturuyor) invalid (geçen,
dayimiz), (hepimiz, dayimiz), (kapilmistik,
oturuyor) invalid
58Procedure
- MxN MPS maximum path scores matrix
- MxNx2 CMPS maximum path scores coordinates
matrix - MPSi,j contains the score of the maximum score
path beginning with the pair (Di, Ij) - CMPSi,j,k contains the indices of the next pair
in the same path ( for example if the max score
path of (Di, Ij) is (Di, Ij), (Dm, In), , (Dp,
Iq), then CMPSi,j,1 m and CMPSi,j,2 n ) - Recursive generation of MPS from itself and S
- CMPS generated from MPS
59Procedure
for i M, M-1, , 1 for j N, N-1, ,
1 if (i M) or (j N)
MPSi,j Si,j CMPSi,j,1
CMPSi,j,2 EMPTY else
MPSi,j Si,j value of the max element of
MPSp,q i1 p M and j1 q
N CMPSi,j,1 first indice of
max element of MPSp,q
i1 p M and j1 q N
CMPSi,j,2 second indice of max element of
MPSp,q i1 p M and
j1 q N endif
endfor endfor
60(No Transcript)
61Finding the maximum score path from MPS and CMPS
- Sentence resemblance score maxi,j(MPSi,j)
MPSa,b for ex. - MPSa,b -gt max score path begins with (Da, Ib)
- Apply to CMPSa,b,1 and CMPSa,b,2 to obtain the
second pair of the path - If for ex. CMPSa,b,1 c and CMPSa,b,2 d -gt
(Dc, Id) is the second pair - Similarly, apply to CMPSc,d,1 and CMPSc,d,2 to
obtain the third pair of the path etc. - Entire path is obtained
62We obtained answers to the following questions
- What is the max resemblance capacity of the
template sentence to the input sentence? - Answer sentence resemblance score (score of the
max score path) - How to arrive this max capacity, i.e. how to
match the words and choose the pairs? - Answer as in max score path
63Question Sentences
- Pitch curve of a question lt - gt Pitch curve of a
word - Whole question regarded as a word
- Use the same regions defined for words
- Region before the stressed syllable
- Stressed syllable (stressed syllable of the
wh-word or question suffix word) - Region after the stressed syllable
- Phrase-final syllable (exists for wh-questions)
- Use the same procedure assigning RRS to words to
assign sentence resemblance score to the questions
64EXAMPLE
Sentences Ayse bugün evde hangi yemegi yapti? Bu
su sesi yukaridan mi geliyor? Regions
65Methodology
Generate Regional Durations
- Region -gt one or more syllables
- Inputs(related to input and template sentences)
- The label files
- The number of syllables for each word
- The index of the syllable bearing word stress,
for each word - The information whether the last syllable shows a
pitch rise or not, for each word (conditional,
wh-question) - Assumes a perfect duration analysis for the input
sentence (label file of input sentence) - Determines the durations of each region the
onset and end, for each word in both sentences
66Methodology
Apply Pitch
- Inputs
- Regional durations generated by the previos block
- Pitch contour of the template sentence
- The max score path pertaining to the input and
template sentences - For all pairs of the path, the pitch of the
template sentence is applied to the input
sentence, for the regions existing in both
elements of a pair - Usage of spline interpolation
- Stretching / compression in time
- Data fitting for nonexisting regions
67Improvements
Discarding Unvoiced Regions
- Problem
- unvoiced regions of template sentence spline
-gt distortions - Example
- Input Yildizlar dünyadan gündüz görülmez
- Template Zamanimi televizyonun karsisinda bos
yere harcayamam - Path (zamanimi, yildizlar), (karsisinda,
dünyadan), (yere, gündüz), (harcayamam, görülmez) - Problematic pairs (karsisinda, dünyadan) and
(yere, gündüz) - unvoiced regions in karsisinda (/k/, /s/ and /s/)
and yere - Solution discard zero samples (unvoiced) and
then apply
68Yildizlar dünyadan gündüz görülmez.
69Improvements
- Problem poor performance of spline outside the
borders of data points to be interpolated - Example
- Input Didem her aksam odasinda günlük gazeteleri
okur - Template Annem bize her zaman çok lezzetli
yemekler pisirir - Problematic pairs (annem, didem) and (pisirir,
okur)
- Solution applying the value of the outermost
data point to the whole region, if the region
goes beyond this data point
70Didem her aksam odasinda günlük gazeteleri okur.
71Improvements
- Problem spline sometimes yields unsatisfactory
results within the data points - Example
- Input Çocuklar yazin günesin altinda fazla
kalmamali. - Problematic region /zin/ of yazin generated by
spline
Çocuklar yazin günesin altinda fazla kalmamali.
72Improvements
- Solution check spline spline -gt linear
interpolation when necessary - Spline check linear regression line, upper
threshold and lower threshold lines for the pitch
of template sentence - If spline exceeds the threshold lines spline -gt
linear
Linear regression and the two threshold lines.
73Çocuklar yazin günesin altinda fazla kalmamali.
74Discussion
Performance at sentence ends
- good -gt choosing from same type and state -gt
expected - microprosody degrades performance (unvoiced
regions of input sentence unknown)
Kuzenim Nalan Oyaya yarin aliyor.
75Discussion
Performance at sentence ends
- good -gt choosing from same type and state -gt
expected - microprosody degrades performance (unvoiced
regions of input sentence unknown)
Marsta hayat var midir?
76Discussion
Performance at sentence ends
- erroneous endings (increase instead of decrease)
due to template pitch
77Discussion
Performance at sentence ends
- erroneous endings (increase instead of decrease)
due to template pitch
78Discussion
Performance at movements (rises and falls)
- limited since
- the method is confined to
- the capacity of the database (same type, state)
- the capacity of the template sentence
- prosodic boundaries (yazin) and accented
syllables unknown
Çocuklar yazin günesin altinda fazla kalmamali.
79Discussion
Performance at movements (rises and falls)
- limited since
- the slope of the rise or fall may differ in input
and template sentences (bizim)
Bizim Nevin domatesli menemen yemeli.
80Discussion
Performance at movements (rises and falls)
- limited since
- there may be an absolute difference between pitch
values of both sentences (gündüz)
Yildizlar genellikle gündüz görülmez.
81Discussion
Performance at movements (rises and falls)
- limited since
- microprosodic effects (kardesim)
Kardesim Nalan yeni ayna aliyor.
82Discussion
Performance at movements (rises and falls)
- limited since
- effects of rises and falls on neighbouring
syllables are handled partially (only within
words) - Example
- Input Merve bu sefer zamaninda dönemez
- Template Aksamki yemek pek güzel degildi
- Merve from yemek (/ye/ of yemek affected by /ki/
of aksamki)
83Aksamki yemek pek güzel degildi.
Merve bu sefer zamaninda dönemez.
84Discussion
Performance at questions
- High success due to their simple nature
Niçin sorularima cevap vermiyorsun?
85Discussion
Performance at questions
- High success due to their simple nature
Önce nereye bilgi verilmeli?
86Discussion
Performance at questions
- High success due to their simple nature
Ona bu güzel kolyeyi satin almayacak misin?
87Discussion
Objective Evaluation
- Pitch -gt speech melody, human perception -gt ST
scale - distance d in ST between two frequencies f1 and
f2 is given as - d 12 x log2 (f1 / f2)
- metrics
- mean squared distance between original and
synthesized in ST - proportion lt 2ST distance
- compare with baseline solution constructed as
- 6 types x 2 states -gt 12 groups of DB sentences
- for each sentence -gt median of nonzero pitch
- average of median of sentences in each group -gt
12 baselines
88Discussion
Objective Evaluation
89Discussion
Objective Evaluation
90Discussion
Objective Evaluation
Results
- ANOVA (analysis of variance)
- p the probability of the means belonging to
each method to be equal - p lt 0.10 or 0.05 or 0.01 -gt averages
statistically significant
- Method better than baseline in general
- Performance at close test sentences gt Performance
at random test sentences - best results in questions
- similar results in both metrics
91Conclusion
- Intonation and stress -gt fundamental frequency
- Analysis of pitch contours
- Method based on syntactic structure in terms of
word categories and word stress information - Automatic generation of these inputs from text is
relatively easy. - Makes use of
- a sentence database (corpus of natural speech)
- interpolation
- Recordings of a single speaker
92Future Work
- Inclusion of other speakers
- A further categorization of words instead of POS
categories -gt subcategories -gt more complex
syntactic structures -gt larger database for
efficiency - Other inputs
- prosodic boundaries
- accented syllables
- and their automatic generation from input text
(prosodic description) - Handling microprosody