Text-to-Speech Part II - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Text-to-Speech Part II

Description:

... Note Evaluation Preference test Normalized MOS scores for different TTS systems can be obtained without any direct preference ... from Acoustics to ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 63
Provided by: stardust
Category:

less

Transcript and Presenter's Notes

Title: Text-to-Speech Part II


1
Text-to-SpeechPart II
2
Previous Lecture Summary
  • Previous lecture presented
  • Text and Phonetic Analysis
  • Prosody-I
  • General Prosody
  • Speaking Style
  • Symbolic Prosody
  • This lecture continues
  • Prosody-II
  • Duration Assignment
  • Pitch Generation
  • Prosody Markup Languages
  • Prosody Evaluation

3
Duration Assignment
  • Pitch and duration are not entirely independent,
    and many of the higher-order semantic factors
    that determine pitch contours may also influence
    durational effects.
  • Most systems often treat duration and pitch
    independently because of practical considerations
    van Santen. 1994.
  • Numerous factors, including semantics and
    pragmatic conditions, might ultimately influence
    phoneme durations. Some factors that are
    typically neglected include
  • The issue of speech rate relative to speaker
    intent, mood, and emotion.
  • The use of duration and rhythm to possibly signal
    document structure above the level of phrase or
    sentence (e.g., paragraph).
  • The lack of a consistent and coherent practical
    definition of the phone such that boundaries can
    be clearly located for measurement.

4
Duration Assignment
  • Rule-Based Methods
  • Allen, 1987 identified a number of first-order
    perceptually significant effects that have
    largely been verified by subsequent research.

Perceptually significant effects for duration Allen, 1987.
Lengthening of the final vowel and following consonants in prepausal syllables.
Shortening of all syllabic segments (vowels and syllabic consonants) in nonprepausal position.
Shortening of syllabic segments if not in a word final syllable
Consonants in non-word-initial position are shortened.
Unstressed and secondary stressed phoned are shortened.
Emphasized vowels are lengthened.
Vowels may be shortened or lengthened according to phonetic features of their context.
Consonants may be shortened in clusters.
5
Duration Assignment
  • CART-based Durations
  • A number of generic machine-learning methods have
    been applied to the duration assignment problem,
    including CART and linear regression Plumpe et
    al., 1998.
  • Phone identity
  • Primary lexical stress (binary feature)
  • Left phone Context (1 phone)
  • Right phone Context (1 phone)

6
Pitch Generation
  • Pitch, or F0, is probably the most characteristic
    of all the prosody dimensions.
  • The quality of a prosody module is dominated by
    the quality of its pitch-generation component.
  • Since generating pitch contours is an incredibly
    complicated problem, pitch generation is often
    divided into two levels, with the first level
    computing the so-called symbolic prosody and the
    second level generating pitch contours from this
    symbolic prosody.

7
Pitch Generation
  • Parametric F0 generation
  • To realize all the prosodic effects, some systems
    make almost direct use of a real speaker
Write a Comment
User Comments (0)
About PowerShow.com