Modelling Prosody for Speech Synthesis: example from Polish - PowerPoint PPT Presentation

About This Presentation
Title:

Modelling Prosody for Speech Synthesis: example from Polish

Description:

Language specific intonation description ... Standard description of Polish intonation (Jassem, 1961, 1984, Demenko, 1999) ... database intonation labelling ... – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 50
Provided by: julia200
Category:

less

Transcript and Presenter's Notes

Title: Modelling Prosody for Speech Synthesis: example from Polish


1
Modelling Prosody for Speech Synthesis example
from Polish
  • Dominika Oliver
  • IGK Colloquium
  • 22 July 2004

2
Outline
  • Goal
  • prosodic modelling for TTS
  • Review of past studies
  • intonational investigations
  • Current state
  • latest modelling results

3
TTS Cycle
Text Processing Text Normalisation
names,abbrev.,numbers Linguistic Analysis
morphology,syntax,semantics
Text Input (raw or annotated)
Phonetic Analysis Grapheme-to-Phoneme
Conversion rules, dict.
Prosodic Analysis Pitch, Phrasing Duration
Modelling
Prosodic Analysis Pitch, Phrasing Duration
Modelling
Speech Synthesis Voice Rendering
4
TTS Cycle
  • Prosodic analysis/modelling
  • Prosodic components (focus, stress, duration
    etc.)
  • Prosodic phrasing
  • Intonation accent types, pitch contour

5
Overview
  • Procedure
  • Resources
  • Modelling techniques
  • Modelling prosody
  • Problems solutions
  • Suggested improvements

6
Procedure
  • Prosodic modelling shopping list
  • Language specific intonation description
  • Accent type and placement prediction F0
    generation methods
  • Research and evaluation tool (Festival)

7
Language specific intonation description
  • Quantitative analysis of Polish intonation
    (accent types)
  • Standard description of Polish intonation
    (Jassem, 1961, 1984, Demenko, 1999)
  • Falling HL, HM, ML, xL
  • Rising LM, MH, LH
  • Level MM
  • Rise-fall LHL
  • Broad-Narrow Focus/Peak alignment study (Andreeva
    and Oliver, 2003)

8
Accent types
  • Falling

9
Accent types
  • Rising

10
Overview
  • Procedure
  • Resources
  • Modelling techniques
  • Modelling prosody
  • Problems solutions
  • Suggested improvements

11
Resources
  • Speech corpora PoInt (Polish Intonation
    Database) (Karpinski, 2001)
  • 350MB, multi-speaker (40)
  • read, (semi)-spontaneous
  • Transcribed
  • Syllable based IPA segmental
  • Syllable based prosodic annotation

12
Resources
  • PoInt Prosodic transcription
  • Tone heights xH, H, M, L, xL
  • Phrase boundary indication

13
Resources
  • Falling

14
Resources
  • Rising

15
Resources
  • Festival TTS (Black Taylor, 1998)
  • a general multi-lingual speech synthesis system
  • offers a full text to speech system
  • environment for development and research of
    speech synthesis techniques

16
Overview
  • Procedure
  • Resources
  • Modelling techniques
  • Modelling prosody
  • Problems solutions
  • Suggested improvements

17
Modelling techniques
  • Default prosodic assignment from simple text
    analysis
  • Hand-built rule-based system hard to modify and
    adapt to new domains
  • Corpus-based approaches (Sproat et al 92)
  • Train prosodic variation on large labeled corpora
    using machine learning techniques

18
Modelling techniques accent type/placement
prediction
  • Classification and regression trees (CART)
    (Breiman, Friedman, Olshen Stone 1984, 1993)
  • In speech synthesis widely used to model
  • segment durations (e.g. Riley 1992)
  • accent prediction (Syrdal, Hirschberg,McGory,
    Beckman 2001)
  • pitch contour generation (Dusterhoff 1997,
    Dusterhoff, Black, Taylor 1999)

19
Modelling techniques - F0 prediction
  • Linear regression (Black Hunt, 1996) used e.g.
    for F0 contour prediction/generation
  • find the appropriate F0 target per syllable based
    on available features trained from data
  • predicted variable (p) can be modelled as a sum
    of a set of weighted real-valued factors
  • p w0 w1f1 w1f1 w1f1 wnfn
  • factors (fi) - parameterised properties of the
    data
  • weights (wi) - trained usually using a stepwise
    least squares technique

20
Prerequisite
  • F0 normalisation (Ladd, 1995, Clark, 2003)
  • (PoInt 40 speakers, mixed sex)
  • -where is f0 mean and is the f0
    standard deviation of the utterance
  • -the rescaling uses standard deviation and
    mean f0 of the database

21
Overview
  • Procedure
  • Resources
  • Modelling techniques
  • Modelling prosody
  • Problems solutions
  • Suggested improvements

22
Modelling
  • Steps
  • Building the utterance structure of the database
    speech files
  • Incorporating database intonation labelling
  • Extracting features for accent prediction and f0
    generation
  • Building CART model
  • PoInt intonation labels
  • Building LR model
  • 3 points per syllable
  • Incorporating model parameters into voice
    description

23
Modelling - accent type/placement prediction
  • Model based on PoInt
  • multiple speaker (male, female)
  • Accent inventory (L, H, M)
  • Accent prediction method CART
  • Features (31)
  • POS window
  • Position of candidate syllable in word and
    sentence
  • Stress information window etc.

24
Results accent prediction
  • train set (total 963 correct 897 93.146 )
  • test set (total 1070 correct 996 93.084)

25
Modelling - F0 prediction/generation
  • F0 generation Linear regression
  • Features
  • accent type
  • POS window
  • Position of candidate syllable in word and
    sentence
  • Stress information window etc.

26
Results F0 shape prediction
27
Overview
  • Procedure
  • Resources
  • Modelling techniques
  • Modelling prosody
  • Problems solutions
  • Suggested improvements

28
Potential problems
  • Data
  • not enough tokens to learn from
  • Annotation inconsistencies (noisy data, messy
    accent class assignment )
  • Inappropriate technique / suboptimal feature set

29
Potential data problems
30
Potential data problems
31
Potential data problems
32
PoInt Analysis
  • Peak alignment

33
Addressing data issues
  • F0 tracking errors
  • Identifying outliers / annotation inconsistencies
  • Re-classifying accent types

34
When everything else fails blame it on the data
  • Labelling errors
  • Unmarked disfluencies/wrong reading
  • Phonemic labelling
  • Missing phrasing
  • No indication of sentence mode in annotation
  • Inconsistent labelling
  • Misleading transcription description
  • No independent labellers

35
Data fixes
  • Automatically identifying outliers /annotation
    inconsistencies
  • Statistic analysis of acoustic parameters
  • Manual data inspection
  • Insertion of phrase boundaries
  • Marking of disfluencies
  • Aligning speech with text
  • Deriving Gold Standard (hard)

36
Accent classification studies
  • Hierarchical clustering (Klabbers van Santen
    2004)
  • Linear regression (Keller Zellner Keller, 2003)
  • EM bagging boosting (Sun, 2002)
  • HMMs
  • (Kumpf, King 2004)
  • (Blackburn ,Vonwiller, and King, 1993)
  • (Batliner et al 1999, 2001)
  • (Maragoudakis 2003, Zervas 2004)
  • (Chan, Feng, Heinen, and Niederjohn 1994)

37
Accent type re-classification
  • Two stage procedure
  • Self-organising maps (Kohonen 1982,1995) (Kaski,
    1997)(Vesanto Alhoniemi, 2000)
  • create set of data representative prototype
    vectors
  • projection of prototypes onto low dimensional
    space
  • Hierarchical agglomerative clustering (HAC)
  • method for good candidates for map unit clusters
    cut the dendrogram where there is a large
    distance between two clusters

38
Acoustic data parameterisation
  • Accent type classification
  • (Demenko, 1999)
  • Difference between start F0 (first vowel) and F0
    extreme value (on a vowel or consonant)
  • Difference between F0 extreme value and end point
    F0
  • Difference between F0 max and F0 min
  • Difference between utterance mean F0 and mean F0
    for all utterances by the same voice
  • Difference between utterance min F0 and global
    mean min F0 for the same voice

39
Accent type re-classification
  • Clusters description




40
Accent type re-classification
  • Clusters characteristics

41
Accent type re-classification
42
New results Accent placement prediction
  • train data
  • test data

43
New results Accent type prediction
  • train data
  • test data

44
Evaluation
  • self-organised maps - potential method for
    categorisation
  • the results relatively successful and consistent
  • the data pre-processing - most critical phase
  • automatic training phase requires solid and
    consistent preparations (manual)

45
Overview
  • Procedure
  • Resources
  • Modelling techniques
  • Modelling prosody
  • Problems solutions
  • Suggested improvements

46
Need for better data
  • Based on problems encountered
  • Further analysis of clusters
  • A large amount of data from a single speaker
    (primary need)
  • A large amount of prosodic variation
  • A balanced set of pitch events
  • Clear speech which can be easily tracked
  • Complex prosodic structure

47
Suggested improvements
  • Model modification
  • More data e.g. Peak Alignment study
  • Separate models for different sentence types (Y/N
    Quest/Statements)
  • Re-estimation of parameters based on new
    intonationally rich data

48
Next
  • Closer inspection of automatically assigned
    accent classes (clusters)
  • Evaluation perception experiments

49
  • The End
Write a Comment
User Comments (0)
About PowerShow.com