LING 270 Language, Technology and Society Unit 8 - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

LING 270 Language, Technology and Society Unit 8

Description:

Some more history of Speech Synthesis. The problem of text-to-speech conversion ... We've always wanted a pet, so we finally bought a weimaraner ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 65
Provided by: richar782
Category:

less

Transcript and Presenter's Notes

Title: LING 270 Language, Technology and Society Unit 8


1
LING 270Language, Technology and SocietyUnit 8
  • Richard Sproat
  • URL http//catarina.ai.uiuc.edu/L270/

2
Overview
  • Some more history of Speech Synthesis
  • The problem of text-to-speech conversion
  • Approaches to selected problems
  • Future work whats left to be done?

3
Electronic Approaches through the 1950s
  • Gunnar Fants OVE Synthesizer
  • Walter Lawrences Parametric Artificial Talker
    (1953)

4
Types of Synthesis
Never made it out of the laboratory
  • Articulatory Synthesis Attempt to model human
    articulation.
  • Formant Synthesis Bypass modeling of human
    articulation, and model acoustics directly.
  • Concatenative Synthesis Synthesize from stored
    units of actual speech

Basis of several commercial systems, including DE
CTALK
The only commercial approach taken today
5
Types of Synthesis Early Work
  • The earliest articulatory synthesizers include
    (Nakata and Mitsuoka, 1965, Henke, 1967, Coker,
    1968).
  • Earliest formant synthesizer was (Kelly and
    Gerstman, 1961).
  • The first working diphone concatenative
    synthesizer was (Dixon and Maxey,1968).
  • First full text-to-speech system was (Umeda at
    al., 1968)

http//www.acoustics.hut.fi/publications/files/the
ses/lemmetty_mst/appa.html
6
A Present-Day Commercial System
  • An example of a current commercial system a
    unit selection TTS system ATTs Natural Voices
  • http//www.naturalvoices.att.com/demos/
  • Unit-selection approaches look for
    variable-length units in an annotated database of
    speech, and select them on the basis of various
    features including desired phoneme sequence and
    prosody.

7
Components of a TTS System
8
Linguistic Analysis
  • Preprocessing
  • Sentence segmentation
  • Dr. Smith lives on Smith Dr. He is
  • Abbreviation expansion
  • Doctor Smith lives on Smith Drive. He is
  • Word segmentation
  • ???????????????

??????????????? Islamabad is a city in Pakistan
9
Linguistic Analysis (contd.)
  • Preprocessing continued
  • Acronym pronunciation NATO, CIA
  • Number expansion 123
  • 123 goats
  • 123 King St.
  • Lotus 123

10
Linguistic Analysis (contd.)
  • Word pronunciation
  • How to pronounce ordinary words through, though,
    tough, bough
  • How to pronounce names Abate (versus abate)
    Begin (versus begin)
  • Sense disambiguation
  • Is bass a fish or a musical range?
  • Is IV four(th) or intravenous?
  • Is Arabic ??? qabla before or qabila
    received?

11
Linguistic Analysis (contd.)
  • Accenting/prominence prediction
  • This is a test of the emergency broadcasting
    system
  • Intonation (inflection) i.e. assignment of
    pitch (better fundamental frequency)
  • Declarative sentences tend to have falling
    intonations in lots of languages
  • Interrogative sentences tend to have rising
    intonations in lots of language
  • But there is a lot more going on than just that

12
Linguistic Analysis (contd.)
  • Segmental durations
  • Every sound has to have some time assigned to it
  • Other things being equal
  • Vowels tend to be longer than consonants
  • Stressed segments tend to be longer than
    unstressed segments
  • Accented segments tend to be longer than
    unaccented segments
  • Final segments tend to be longer than non-final
    segments
  • Segments have different inherent durations ee
    in keep is generally longer than i in kip

13
Text-to-Speech Blocks
Text Normalization
Dr. Smith lives at 111 Smith Dr.
ASCII Text
Syntactic/Semantic Parser
Lives (verb) vs. Lives (noun)
Dictionary
Pronunciation dictionary morphemic
decomposition rhyming
Letter To Sound Rules
Rules used where dictionary derivation fails
Phonemes (3040)
Prosody Rules
Intonation and duration
Speech Synthesis
Speech output
14
A Summary so Far
  • From linguistic analysis we have
  • A set of sounds to be produced
  • Associated durations
  • Associated fundamental frequency information
  • Possibly other things
  • Amplitude
  • Properties of the vocal production
  • Now we are ready to synthesize speech

15
Resources for Synthesis
  • Text corpora
  • Raw
  • Tagged
  • Part of speech
  • Semantic tags
  • Syntactically parsed (treebanks)
  • Electronic dictionaries (e.g. CMU Dictionary)
  • Annotated speech corpora
  • Orthographic transcriptions only
  • Phonetically transcribed (e.g. TIMIT)
  • Prosodically transcribed (e.g. Boston Radio News)
  • All of these are available (in greater or lesser
    quantities) for English
  • For other language, coverage is highly variable
  • Many of these resources are available from the
    Linguistic Data Consortium (www.ldc.upenn.edu)
  • Finally, there is the open-source Festival TTS
    system

16
Human Vocal Apparatus
(After Boite and Kunt, 1987)
17
Synthesis
  • Articulatory synthesizers will produce a set of
    instructions to articulators (larynx, velum,
    tongue body, tongue tip, lips, jaw)
  • This will produce a sequence of articulatory
    configurations
  • From acoustic theory one derives the acoustics of
    each configuration
  • Articulatory synthesis is very hard
  • We do not fully understand how the articulators
    move
  • We do not fully understand how to model the
    acoustics

18
Synthesis
  • Formant synthesizers attempt to model the
    acoustics directly by means of rules that capture
    the change of acoustic parameters over time.
  • This is easier than formant synthesis but is
    still hard
  • The only really high-quality system that has been
    produced was the American English system produced
    by Dennis Klatt

19
Time-Domain Representation of Speech
20
Frequency domain Spectrum
21
Time-Frequency RepresentationSpectrogram
22
The Klatt Synthesizer
23
Single Digital Resonator
24
Concatenative Systems
  • Record real speech from a single talker
  • Segment the speech so that we know where the
    individual sounds are
  • Either
  • Preselect a database of units diphone, polyphone
    synthesis
  • Select the best unit at runtime unit-selection
    synthesis
  • Coding of database
  • Frequency domain e.g. linear predictive coding
    (LPC)
  • Time domain e.g. Pitch-Synchronous Overlap and
    Add (PSOLA)

25
Concatenative Approach Diphones
26
Linear Prediction Speech Production Model
27
Units
  • In a polyphone synthesizer, units are stored in
    coded form
  • At synthesis time, appropriate units are selected
    from the database and concatenated
  • Some smoothing between units is generally
    necessary
  • Units need to be stretched or compressed to fit
    within the specified duration
  • Intonation, and amplitude information is added,
    and the system is sent for synthesis.
  • E.g. in an LPC-type synthesizer, intonation will
    be modeled by the impulse-train generator plus a
    glottal waveform model, and the vocal tract
    parameters will be taken from the concatenated
    units.

28
Unit Selection Methods
  • Record a (very) large database (gt1 hour) from a
    single talker
  • Annotate the database with
  • Aligned phonetic transcription
  • Prosodic information (e.g. accent type, duration)
  • At synthesis time find the best available
    sequence of units that maximizes
  • The match between the desired sequence and the
    individual units
  • The goodness of join between the units

29
Importance of Good Joins
http//catarina.ai.uiuc.edu/orig.wav
Shannon says the party ended the US job.
30
Intonation Types
  • English declarative
  • Final fall (H L- L)
  • 9478-1509-7091
  • English question
  • Final rise (L H- H)
  • 9478-1509-7091

31
Linguistic Use of Intonation
  • Lexical information
  • stress, accent, tone
  • Intonation type
  • declarative, question
  • Paralinguistic information
  • mood, emotion
  • Discourse functions

32
Discourse Functions
  • Topic initialization
  • Discourse structure
  • Phrasing
  • Emphasis
  • New vs. old information
  • Other communicative goals

33
ToBI Intonation Grammar
  • (ToBI Tones and Break Indices)
  • Binary tonal distinction
  • H (high) and L (low)
  • Combination in an intonation phrase
  • Accent Phrase tone Boundary tone
  • HL L- L

34
Accent Types
H L- L
L H- H
LH L- L
HL H- H
LH L- H
35
Lexical Information
  • Stress languages
  • English, Russian
  • Stress location is part of the lexical entry, but
    the pitch contour (accent type) on the stressed
    syllable may vary.
  • Accentual languages
  • Japanese, Korean, Swedish
  • The location of the accent is lexically marked.
    Accent type in a word is typically fixed.
  • Tone languages
  • Chinese, Navajo, Igbo
  • Lexically determined tone on every syllable or
    every word.

36
Stress English
  • Fixed stress location professor

37
Stress English
  • Fixed stress location professor
  • Different accent type in questions.

38
Stress Russian
  • Fixed stress location professor
  • May have different accent type than English even
    in declarative sentences

39
Chinese Lexical Tones
Tone shapes differentiate lexical meaning.
Ma1 mother Ma2 hemp Ma3 horse Ma4 to scold
40
Chinese Sentences
Ma1-ma0 ma4 ma3. Mother scolds the horse.
Ma3 ma4 ma1-ma0. The horse scolds mother.
41
Russian Intonation Types
  • Declarative
  • This is Neva.
  • Question
  • Is this Neva?

42
Russian Intonation Types
  • Declarative
  • This is Lena.
  • Question
  • Is this Lena?

43
Chinese Intonation Types
  • Chinese declarative
  • Li3bai4wu3 Luo2yan4 yao4 mai3 lu4.
  • On Friday Luoyan wants to buy a deer.
  • Chinese question
  • Li3bai4wu3 Luo2yan4 yao4 mai3 lu4?
  • On Friday Luoyan wants to buy a deer?

44
Phrasing
  • High f0 marks the beginning
  • Low f0 marks the end.

45
Prosody of Emotion
  • Excitement
  • Fast, very high pitch, loud
  • Hot anger
  • Fast, high pitch, strong, falling accent, loud
  • Fear
  • Jitter
  • Sarcasm
  • Prolonged accent, late peak
  • Sad
  • Slow, low pitch

46
Excitement
Marilyn won nine million dollars!
Dirty rats are the best, arent they?
47
Sad
Marilyn won nine million dollars.
Dirty rats are the best, arent they?
48
Chinese Lexical Tones
Tone shapes differentiate lexical meaning.
Ma1 mother Ma2 hemp Ma3 horse Ma4 to scold
49
Tonal Variations
50
The Challenge Tonal Distortion
51
A Heretical Claim
  • Or why am I spending so much time on Intonation?
  • The main determinant of naturalness in speech
    synthesis is not voice quality, but
    natural-sounding prosody (intonation and duration)

52
Segmental Duration Factors
  • Phonetic segment identity.
  • For example, holding all else constant /ai/ as in
    like is twice as long as /i/ in lick.
  • Identities of surrounding segments.
  • For example, vowels are much longer when followed
    by voiced fricatives than when followed by
    voiceless stops.
  • Syllabic stress.
  • Vowels in stressed syllables and consonants
    heading stressed syllables are longer than vowels
    and consonants in unstressed syllables.
  • Word importance
  • Location of the syllable in the word.
  • Word-final syllables are longer than word-initial
    syllables
  • Location of the syllable in the phrase.
  • Phrase-final syllables can be twice as long as
    the same syllable in other locations in a phrase.
  • How do these factors interact?

53
Approaches to Duration Modeling
  • Hand-built rules
  • E.g., the Klatt rules
  • Semi-supervised statistical models
  • Machine-learning approaches
  • Classification and Regression Trees
  • Neural nets

54
Accent/Prominence Assignment
  • Content versus non-content words
  • The keeshond is from Holland.
  • My dog bit someone
  • Noun-phrase structure/lexical specifications
  • Our dog needs a rabies shot, so I may be late for
    my regular office hours onWednesday.
  • Tomorrow we have a dog training lesson.
  • Our dog eats rawhide bones.
  • Given/new information
  • Weve always wanted a pet, so we finally bought a
    weimaraner
  • My son wants a labrador, but Im allergic to dogs
  • Contrast
  • Thats not an Australian sheepdog, its an
    Australian cattlehound.

55
Cross-Linguistic Variation
  • The investigations are helping to put back in
    order things that have gone out of order
  • Italian Le inchieste servono a mettere a posto
    cose andate fuori posto
  • Greek divers have found the wreck of the British
    liner Britannic, sister ship of the Titanic.
  • Italian
  • Moglie quarantenne, marito cinquantenne.
  • Forty-year-old wife, fifty-year-old
    husband

56
Accent Modeling Factors
  • Part-of-speech information
  • Given/New information
  • Lexical information
  • Cue phrases And then,
  • Noun compound analysis
  • Contrast
  • Information Content
  • E.g. log(P(w))

57
Accent Modeling Methods
  • Hand-built rules (on the basis of corpus
    analysis)
  • Machine-learning methods
  • Hidden Markov Models
  • Machine learning packages such as CART or Ripper
  • Accuracies in the range of 80-90 are typical

58
Word Sense Disambiguation
  • Is bass a fish or a musical range?
  • Is IV four(th) or intravenous?
  • Is Arabic ??? qabla before or qabila
    received?
  • Is 1/2 a date or a fraction?
  • Is LAX lax or L. A. X. (Los Angeles
    International Airport)?

59
Methods for Word-Sense Disambiguation
  • Decision lists
  • bass g FISH if lake within k word window
  • bass g MUSIC if word to right is player
  • Decision lists can be trained on the basis of
    tagged data
  • Other machine-learning methods
  • N-gram language models as in automatic speech
    recognition
  • Sometimes you have to infer from the data what
    needs to be treated and how
  • 57 ST E/1st 2nd Ave Huge drmn 1 BR 750 sf,
    lots of sun clsts. Sundeck lndry facils. Askg
    187K, maint 868, utils incld. Call Bkr Peter
    914-428-9054
  • Accuracies for WSD can be as high as 99
  • But this is HIGHLY variable, depending upon the
    case(s) being considered

60
Word Pronunciation Methods
  • Dictionaries
  • Morphological analysis
  • If I know grammatical and the suffix ity, I can
    predict the pronunciation of grammaticality
  • Letter-to-sound rules
  • Reasonable for some languages (e.g. Spanish)
  • Hard for others (e.g. English, Arabic, Urdu,
    Japanese)
  • tough, bough, through, though
  • Machine-learning methods
  • CART
  • Analogical reasoning
  • Exemplar-based methods
  • Special attention often needs to be paid to
    classes such as proper names
  • Overall accuracies in the 95 range (by word) are
    common (for English)
  • Interesting side-note pronunciation of proper
    names is one of the few areas in Speech and
    Language processing where machines can outperform
    people.

61
Preprocessing
  • Expansion of numbers, abbreviations, and
    decisions on word and sentence boundaries are
    fairly algorithmic
  • at least when it comes to enumerating the
    possible expansions
  • The real difficulty is in disambiguation

62
Disambiguation in Word Segmentation
I couldnt forget where Liberation Avenue is.
Methods have included language-modeling
techniques or even full syntactic parsing
F-scores of around 95 are reported on some
corpora
63
Evaluation of Synthesizers
  • Favorite method is the Mean Opinion Score (MOS)
  • Take two systems A and B, and see which one
    people like better
  • Simple to do, but doesnt tell you anything about
    why one system is preferred over another
  • Design tests that focus on particular features.
  • For example play sentence and ask people to
    point (in a text transcription of the sentence to
    the most problematic portion
  • Ask them to describe whats wrong from a fixed
    list
  • choppy, bad letters, wrong emphasis
  • Can help one pinpoint particular problems, such
    as bad units
  • Problem in comparing systems has been lack of
    common databases especially for unit selection
  • The recent 2005 Blizzard Challenge
    (http//festvox.org/blizzard/) addresses this by
    providing a common training database (the CMU
    Arctic Database)

64
Applications of TTS
  • Any situation where you need to get information,
    but you cant access it visually
  • Access to information for the blind
  • Access to email, news, stock quotes over the
    phone
  • Directions to drivers
  • Spoken dialog systems where it is not practical
    to prerecord everything
  • Informational content e.g. NOAA Weather Radio
    where it would be expensive to have a human read
    all the announcements.

65
Further Work is Needed
  • Intonation is still not well understood
  • How to predict appropriate prosody from text?
  • How to predict appropriate affect from text?
  • Linguistic analysis of text still needs further
    work
  • Humans still outperform machines at
    disambiguation
  • Correct rendering often depends upon
    discourse-level information, which is not well
    modeled
  • Cuantas casas grandes hay en la ciudad? 100 ?
    doscientas
  • Unit-selection methods are hampered by a simple
    fact you cant cover all the factors
  • Need methods for modification of speech databases
Write a Comment
User Comments (0)
About PowerShow.com