Title: Modelling Prosody for Speech Synthesis: example from Polish
1Modelling Prosody for Speech Synthesis example
from Polish
- Dominika Oliver
- IGK Colloquium
- 22 July 2004
2Outline
- Goal
- prosodic modelling for TTS
- Review of past studies
- intonational investigations
- Current state
- latest modelling results
3TTS Cycle
Text Processing Text Normalisation
names,abbrev.,numbers Linguistic Analysis
morphology,syntax,semantics
Text Input (raw or annotated)
Phonetic Analysis Grapheme-to-Phoneme
Conversion rules, dict.
Prosodic Analysis Pitch, Phrasing Duration
Modelling
Prosodic Analysis Pitch, Phrasing Duration
Modelling
Speech Synthesis Voice Rendering
4TTS Cycle
- Prosodic analysis/modelling
- Prosodic components (focus, stress, duration
etc.) - Prosodic phrasing
- Intonation accent types, pitch contour
5Overview
- Procedure
- Resources
- Modelling techniques
- Modelling prosody
- Problems solutions
- Suggested improvements
6Procedure
- Prosodic modelling shopping list
- Language specific intonation description
- Accent type and placement prediction F0
generation methods - Research and evaluation tool (Festival)
7Language specific intonation description
- Quantitative analysis of Polish intonation
(accent types) - Standard description of Polish intonation
(Jassem, 1961, 1984, Demenko, 1999) - Falling HL, HM, ML, xL
- Rising LM, MH, LH
- Level MM
- Rise-fall LHL
- Broad-Narrow Focus/Peak alignment study (Andreeva
and Oliver, 2003)
8Accent types
9Accent types
10Overview
- Procedure
- Resources
- Modelling techniques
- Modelling prosody
- Problems solutions
- Suggested improvements
11Resources
- Speech corpora PoInt (Polish Intonation
Database) (Karpinski, 2001) - 350MB, multi-speaker (40)
- read, (semi)-spontaneous
- Transcribed
- Syllable based IPA segmental
- Syllable based prosodic annotation
12Resources
- PoInt Prosodic transcription
- Tone heights xH, H, M, L, xL
- Phrase boundary indication
13Resources
14Resources
15Resources
- Festival TTS (Black Taylor, 1998)
- a general multi-lingual speech synthesis system
- offers a full text to speech system
- environment for development and research of
speech synthesis techniques
16Overview
- Procedure
- Resources
- Modelling techniques
- Modelling prosody
- Problems solutions
- Suggested improvements
17Modelling techniques
- Default prosodic assignment from simple text
analysis - Hand-built rule-based system hard to modify and
adapt to new domains - Corpus-based approaches (Sproat et al 92)
- Train prosodic variation on large labeled corpora
using machine learning techniques
18Modelling techniques accent type/placement
prediction
- Classification and regression trees (CART)
(Breiman, Friedman, Olshen Stone 1984, 1993) - In speech synthesis widely used to model
- segment durations (e.g. Riley 1992)
- accent prediction (Syrdal, Hirschberg,McGory,
Beckman 2001) - pitch contour generation (Dusterhoff 1997,
Dusterhoff, Black, Taylor 1999)
19Modelling techniques - F0 prediction
- Linear regression (Black Hunt, 1996) used e.g.
for F0 contour prediction/generation - find the appropriate F0 target per syllable based
on available features trained from data - predicted variable (p) can be modelled as a sum
of a set of weighted real-valued factors - p w0 w1f1 w1f1 w1f1 wnfn
- factors (fi) - parameterised properties of the
data - weights (wi) - trained usually using a stepwise
least squares technique
20Prerequisite
- F0 normalisation (Ladd, 1995, Clark, 2003)
- (PoInt 40 speakers, mixed sex)
- -where is f0 mean and is the f0
standard deviation of the utterance - -the rescaling uses standard deviation and
mean f0 of the database
21Overview
- Procedure
- Resources
- Modelling techniques
- Modelling prosody
- Problems solutions
- Suggested improvements
22Modelling
- Steps
- Building the utterance structure of the database
speech files - Incorporating database intonation labelling
- Extracting features for accent prediction and f0
generation - Building CART model
- PoInt intonation labels
- Building LR model
- 3 points per syllable
- Incorporating model parameters into voice
description
23Modelling - accent type/placement prediction
- Model based on PoInt
- multiple speaker (male, female)
- Accent inventory (L, H, M)
- Accent prediction method CART
- Features (31)
- POS window
- Position of candidate syllable in word and
sentence - Stress information window etc.
24Results accent prediction
- train set (total 963 correct 897 93.146 )
- test set (total 1070 correct 996 93.084)
25Modelling - F0 prediction/generation
- F0 generation Linear regression
- Features
- accent type
- POS window
- Position of candidate syllable in word and
sentence - Stress information window etc.
26Results F0 shape prediction
27Overview
- Procedure
- Resources
- Modelling techniques
- Modelling prosody
- Problems solutions
- Suggested improvements
28Potential problems
- Data
- not enough tokens to learn from
- Annotation inconsistencies (noisy data, messy
accent class assignment ) - Inappropriate technique / suboptimal feature set
29Potential data problems
30Potential data problems
31Potential data problems
32PoInt Analysis
33Addressing data issues
- F0 tracking errors
- Identifying outliers / annotation inconsistencies
- Re-classifying accent types
34When everything else fails blame it on the data
- Labelling errors
- Unmarked disfluencies/wrong reading
- Phonemic labelling
- Missing phrasing
- No indication of sentence mode in annotation
- Inconsistent labelling
- Misleading transcription description
- No independent labellers
35Data fixes
- Automatically identifying outliers /annotation
inconsistencies - Statistic analysis of acoustic parameters
- Manual data inspection
- Insertion of phrase boundaries
- Marking of disfluencies
- Aligning speech with text
- Deriving Gold Standard (hard)
36Accent classification studies
- Hierarchical clustering (Klabbers van Santen
2004) - Linear regression (Keller Zellner Keller, 2003)
- EM bagging boosting (Sun, 2002)
- HMMs
- (Kumpf, King 2004)
- (Blackburn ,Vonwiller, and King, 1993)
- (Batliner et al 1999, 2001)
- (Maragoudakis 2003, Zervas 2004)
- (Chan, Feng, Heinen, and Niederjohn 1994)
37Accent type re-classification
- Two stage procedure
- Self-organising maps (Kohonen 1982,1995) (Kaski,
1997)(Vesanto Alhoniemi, 2000) - create set of data representative prototype
vectors - projection of prototypes onto low dimensional
space - Hierarchical agglomerative clustering (HAC)
- method for good candidates for map unit clusters
cut the dendrogram where there is a large
distance between two clusters
38Acoustic data parameterisation
- Accent type classification
- (Demenko, 1999)
- Difference between start F0 (first vowel) and F0
extreme value (on a vowel or consonant) - Difference between F0 extreme value and end point
F0 - Difference between F0 max and F0 min
- Difference between utterance mean F0 and mean F0
for all utterances by the same voice - Difference between utterance min F0 and global
mean min F0 for the same voice
39Accent type re-classification
40Accent type re-classification
41Accent type re-classification
42New results Accent placement prediction
43New results Accent type prediction
44Evaluation
- self-organised maps - potential method for
categorisation - the results relatively successful and consistent
- the data pre-processing - most critical phase
- automatic training phase requires solid and
consistent preparations (manual)
45Overview
- Procedure
- Resources
- Modelling techniques
- Modelling prosody
- Problems solutions
- Suggested improvements
46Need for better data
- Based on problems encountered
- Further analysis of clusters
- A large amount of data from a single speaker
(primary need) - A large amount of prosodic variation
- A balanced set of pitch events
- Clear speech which can be easily tracked
- Complex prosodic structure
47Suggested improvements
- Model modification
- More data e.g. Peak Alignment study
- Separate models for different sentence types (Y/N
Quest/Statements) - Re-estimation of parameters based on new
intonationally rich data
48Next
- Closer inspection of automatically assigned
accent classes (clusters) - Evaluation perception experiments
49