Bootstrapping a LanguageIndependent Synthesizer - PowerPoint PPT Presentation

About This Presentation
Title:

Bootstrapping a LanguageIndependent Synthesizer

Description:

Orthography as Pronunciation ... Difficulties in Machine Learning of Pronunciation 'But there is a much more fundamental problem ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 23
Provided by: Cra6170
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Bootstrapping a LanguageIndependent Synthesizer


1
Bootstrapping a Language-Independent Synthesizer
  • Craig Olinsky
  • Media Lab Europe / University College Dublin
  • 15 January 2002

2
Introducing the Problem
  • Given a set of recordings and transcriptions in
    an arbitrary language, can we quickly and easily
    build a speech synthesizer?
  • YES, if we know something about the language.
  • However, for the majority of languages for which
    such resources dont exist

3
Starting from Sample
  • PROS
  • The existing synthesizer provides a store of
    linguistic knowledge we can start from.
  • Analogue to speaker adaptation in Speech
    Recognition systems.
  • Overall, quality should be better.
  • CONS
  • Difficulty related to degree of different between
    sample and target language.
  • Best as a gradual process accent/dialect, not
    language

4
Starting from Scratch
  • PROS
  • Difficulty directly proportional to complexity of
    the language.
  • Common (machine-learning) procedure based upon
    machine learning from recordings and transcript.
  • CONS
  • Dont have a great deal of relevant knowledge to
    apply to the task.
  • If not using principled phone set, necessary to
    segment / label recordings cleanly

5
The Obvious Compromise
  • Take what we do know from building speech
    synthesis, and generalize it to an existing
    framework.
  • -- were not specifically learning from scratch
  • -- at the same time, were not making linguistic
    assumptions pre-coded into the source voices

6
Generic Synthesis Framework/Toolkit
  • Set of Scripts, Utilities, and Definition files
    to help to help to automate the creation of
    reasonable speech synthesis voices from an
    arbitrary language without the need for
    linguistic or language-specific information.
  • Build on top of the Festival Speech Synthesis
    System and FestVox toolkit (for wave form
    synthesis most of text processing and
    pronunciation handling externalized to
    locally-developed tools)

7
Language-Dependent Synthesis Components
  • Phone set
  • Word pronunciation (lexicon and/or
    letter-to-sound rules)
  • Token processing rules (numbers etc)
  • Durations
  • Intonation (accents and F0 contour)
  • Prosodic phrasing method

8
Phoneme Sets
  • If we rely on a pre-existing set of pronunciation
    rules, lexicon, etc., we are automatically
    limited to using the phone-set used in those
    resources (or something which they can be mapped
    to) most likely something language-dependent.
  • IPA, SAMPA something language-universal?
  • We need to generate pronunciations how do we
    create the relationship between our training
    database / phonetic representation / orthography?

9
Multilingual Phoneme Sets IPA, SAMPA
  • We dont want to be stuck with a set of phonemes
    targeted for a specific language, so we instead
    use a phoneme definition designed to be inclusive
    of all
  • But this still assumes we know the relationship
    between the phone set and orthography of the
    language i.e. for any given text we can generate
    a pronunciation.
  • This approach still assumes linguistic knowledge!

10
Orthography as Pronunciation
  • cf R. Singh, B. Raj and R.M. Stern, Automatic
    Generation of Phone Sets and Lexical
    Transcriptions ..
  • Suppose we begin with the orthography of the
    written language.
  • e.g. CAT c a t DOG d o g
  • This implies
  • A relation between number of characters in a
    spelling and the length of the pronunciation
  • The orthography of a language is consistent /
    efficient

11
Orthography as Pronunciation
12
Implications for Data Labeling and Training
13
Non-Roman Orthography Questions of Transcription
14
Difficulties in Machine Learning of Pronunciation
  • But there is a much more fundamental
    problem in that it crucially assumes that
    letter-to-phoneme correspondences can in general
    be determined on the basis of information local
    to a particular portion of the letter string.
    While this is clearly true in some languages
    (e.g. Spanish), it is simply false for others.
  • It is unreasonable to expect that good
    results will be obtained from a system trained
    with no guidence of this kind, or with data
    that is simply insufficient to the task.
  • Sproat et. al, Multilingual Text-to-Speech
    Synthesis The Bell Labs Approach, pp.76-77

15
Lexicon / Letter-to-Sound Rules
16
Token Processing
17
Duration and Stress Modeling
18
Intonation and Phrasing
19
Unit Selection and Waveform Synthesis
20
Overview Adaptation for Accent and Dialect
21
Final Points
22
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com