Title: TexttoSpeech Introduction
1Text-to-Speech Introduction
- Heng Ji
- hengji_at_cs.qc.cuny.edu
- Sept 18, 2008
Acknowledgement some slides from Dan Jurafsky
2Outline
- Questions about Assignment 4 and Assignment 5?
- Remember to send me presentation slides by March
29 Sunday 1159pm - Syllabus
- Text-to-Speech Introduction
3Applications of Speech Synthesis/Text-to-Speech
(TTS)
- Games
- Telephone-based Information (directions, air
travel, banking, etc) - Eyes-free (in car)
- Education (Reading tutors, L2)
- Services for the hearing impaired
- Reading email aloud
4History The 1936 UK Speaking Clock
From http//web.ukonline.co.uk/freshwater/clocks/s
pkgclock.htm
5The UK Speaking Clock
- July 24, 1936
- Photographic storage on 4 glass disks
- 2 disks for minutes, 1 for hour, one for seconds.
- Other words in sentence distributed across 4
disks, so all 4 used at once. - Voice of Miss J. Cain
6A technician adjusts the amplifiers of the first
speaking clock
From http//web.ukonline.co.uk/freshwater/clocks/s
pkgclock.htm
7TTS Demos (all are Unit-Selection)
- ATT
- http//www.research.att.com/ttsweb/tts/demo.php
- IBM
- http//www-306.ibm.com/software/pervasive/tech/dem
os/tts.shtml - Cepstral
- http//www.cepstral.com/cgi-bin/demos/general
- Rhetorical ( Scansoft)
- http//www.rhetorical.com/cgi-bin/demo.cgi
- Festival
- http//www-2.cs.cmu.edu/awb/festival_demos/index.
html
8ARPAbet Vowels
2009-11-17
8
Speech and Language Processing Jurafsky and
Martin
9Brief Historical Interlude
- Pictures and some text from Hartmut Traunmüllers
web site - http//www.ling.su.se/staff/hartmut/kemplne.htm
- Von Kempeln 1780 b. Bratislava 1734 d. Vienna
1804 - Leather resonator manipulated by the operator to
copy vocal tract configuration during sonorants
(vowels, glides, nasals) - Bellows provided air stream, counterweight
provided inhalation - Vibrating reed produced periodic pressure wave
2009-11-17
9
Speech and Language Processing Jurafsky and
Martin
10Von Kempelen
- Small whistles controlled consonants
- Rubber mouth and nose nose had to be covered
with two fingers for non-nasals - Unvoiced sounds mouth covered, auxiliary bellows
driven by string provides puff of air
From Traunmüllers web site
2009-11-17
10
Speech and Language Processing Jurafsky and
Martin
11Modern TTS systems
- 1960s first full TTS Umeda et al (1968)
- 1970s
- Joe Olive 1977 concatenation of linear-prediction
diphones - Speak and Spell
- 1980s
- 1979 MIT MITalk (Allen, Hunnicut, Klatt)
- 1990s-present
- Diphone synthesis
- Unit selection synthesis
2009-11-17
11
Speech and Language Processing Jurafsky and
Martin
122. Overview of TTSArchitectures of Modern
Synthesis
- Articulatory Synthesis
- Model movements of articulators and acoustics of
vocal tract - Formant Synthesis
- Start with acoustics, create rules/filters to
create each formant - Concatenative Synthesis
- Use databases of stored speech to assemble new
utterances.
Text from Richard Sproat slides
2009-11-17
12
Speech and Language Processing Jurafsky and
Martin
13Fundamental Components
TTS System
words
Text Pre-processing
Prosody
Concatenation
14Development Tools
- FreeTTS
- http//freetts.sourceforge.net/docs/index.php
- Festival
- http//festvox.org/festival/
15Festival
- Open source speech synthesis system
- Designed for development and runtime use
- Use in many commercial and academic systems
- Distributed with RedHat 9.x
- Hundreds of thousands of users
- Multilingual
- No built-in language
- Designed to allow addition of new languages
- Additional tools for rapid voice development
- Statistical learning tools
- Scripts for building models
Text from Richard Sproat
16Festival as software
- http//festvox.org/festival/
- General system for multi-lingual TTS
- C/C code with Scheme scripting language
- General replaceable modules
- Lexicons, LTS, duration, intonation, phrasing,
POS tagging, tokenizing, diphone/unit selection,
signal processing - General tools
- Intonation analysis (f0, Tilt), signal
processing, CART building, N-gram, SCFG, WFST
Text from Richard Sproat
17Festival as software
- http//festvox.org/festival/
- No fixed theories
- New languages without new C code
- Multiplatform (Unix/Windows)
- Full sources in distribution
- Free software
Text from Richard Sproat
18Getting help
- Online manual
- http//festvox.org/docs/manual-1.4.3
- Alt-h (or esc-h) on current symbol short help
- Alt-s (or esc-s) to speak help
- Alt-m goto man page
- Use TAB key for completion
19Converting from words to phones
- Two methods
- Dictionary-based
- Rule-based (Letter-to-soundLTS)
- Early systems, all LTS
- MITalk was radical in having huge 10K word
dictionary - Now systems use a combination
- CMU dictionary 127K words
- http//www.speech.cs.cmu.edu/cgi-bin/cmudict
20Two steps
- PGE will file schedules on April 20.
- TEXT ANALYSIS Text into intermediate
representation - WAVEFORM SYNTHESIS From the intermediate
representation into waveform
2009-11-17
20
Speech and Language Processing Jurafsky and
Martin
21The Hourglass
2009-11-17
21
Speech and Language Processing Jurafsky and
Martin
22Rules for end-of-utterance detection
- A dot with one or two letters is an abbrev
- A dot with 3 cap letters is an abbrev.
- An abbrev followed by 2 spaces and a capital
letter is an end-of-utterance - Non-abbrevs followed by capitalized word are
breaks - This fails for
- Cog. Sci. Newsletter
- Lots of cases at end of line.
- Badly spaced/capitalized sentences
2009-11-17
22
From Alan Black lecture notes
Speech and Language Processing Jurafsky and
Martin
23Decision Tree is a word end-of-utterance?
2009-11-17
23
Speech and Language Processing Jurafsky and
Martin
24Learning Decision Trees
- DTs are rarely built by hand
- Hand-building only possible for very simple
features, domains - Lots of algorithms for DT induction
2009-11-17
24
Speech and Language Processing Jurafsky and
Martin
25Next Step Identify Types of Tokens, and Convert
Tokens to Words
- Pronunciation of numbers often depends on type
- 1776 date
- seventeen seventy six.
- 1776 phone number
- one seven seven six
- 1776 quantifier
- one thousand seven hundred (and) seventy six
- 25 day
- twenty-fifth
2009-11-17
25
Speech and Language Processing Jurafsky and
Martin
26Classify token into 1 of 20 types
- EXPN abbrev, contractions (adv, N.Y., mph,
govt) - LSEQ letter sequence (CIA, D.C., CDs)
- ASWD read as word, e.g. CAT, proper names
- MSPL misspelling
- NUM number (cardinal) (12,45,1/2, 0.6)
- NORD number (ordinal) e.g. May 7, 3rd, Bill
Gates II - NTEL telephone (or part) e.g. 212-555-4523
- NDIG number as digits e.g. Room 101
- NIDE identifier, e.g. 747, 386, I5, PC110
- NADDR number as stresst address, e.g. 5000
Pennsylvania - NZIP, NTIME, NDATE, NYER, MONEY, BMONY,
PRCT,URL,etc - SLNT not spoken (KENTREALTY)
2009-11-17
26
Speech and Language Processing Jurafsky and
Martin
27Dictionaries arent always sufficient
- Unknown words
- Seem to be linear with number of words in unseen
text - Mostly person, company, product names
- But also foreign words, etc.
- So commercial systems have 3-part system
- Big dictionary
- Special code for handling names
- Machine learned LTS system for other unknown words
28Improvements
- Take names out of the training data
- And acronyms
- Detect both of these separately
- And build special-purpose tools to do LTS for
names and acronyms - Names
- Can do morphology (Walters -gt Walter, Lucasville)
- Can write stress-shifting rules (Jordan -gt
Jordanian) - Rhyme analogy Plotsky by analogy with Trostsky
(replace tr with pl) - Liberman and Church for 250K most common names,
got 212K (85) from these modified-dictionary
methods, used LTS for rest.
29Text Pre-Processing (Block Diagram)
Word Segmenter
Acronym Converter
NumberConverter
Word to Diphone Translator (Phonetization)
NumberConverter
MLDS
Diphone Dictionary
30Text Normalization
- Analysis of raw text into pronounceable words
- Sample problems
- He stole 100 million from the bank
- It's 13 St. Andrews St.
- The home page is http//www.cs.qc.cuny.edu
- yes, see you the following tues, that's 09/23/08
- Steps
- Identify tokens in text
- Chunk tokens into reasonably sized sections
- Map tokens to words
- Identify types for words
31Number Converter
- Replace numerals with their textual versions
- 100 one hundred
- Handle fractional and decimal numbers
- 0.25 point two five
32Acronym Converter
- Replace acronyms with single letter components
- A.B.C. A B C
- Change abbreviations to full textual format
- Mr. Mister
33Word Segmenter
- Divide sentence into word segments
- Special delimiter to separate segments (i.e.
) - Segments can be
- A single word
- An acronym
- A numeral
- Identify punctuation marks
342. Homograph disambiguation
19 most frequent homographs, from Liberman and
Church
- use 319
- increase 230
- close 215
- record 195
- house 150
- contract 143
- lead 131
- live 130
- lives 105
- protest 94
survey 91 project 90 separate 87 present 80 read
72 subject 68 rebel 48 finance 46 estimate 46
Not a huge problem, but still important
2009-11-17
34
Speech and Language Processing Jurafsky and
Martin
35POS Tagging for homograph disambiguation
- Many homographs can be distinguished by POS
- use y uw s y uw z
- close k l ow s k l ow z
- house h aw s h aw z
- live l ay v l ih v
- REcord reCORD
- INsult inSULT
- OBject obJECT
- OVERflow overFLOW
- DIScount disCOUNT
- CONtent conTENT
2009-11-17
35
Speech and Language Processing Jurafsky and
Martin
363. Letter-to-Sound Getting from words to phones
- Two methods
- Dictionary-based
- Rule-based (Letter-to-soundLTS)
- Early systems, all LTS
- MITalk was radical in having huge 10K word
dictionary - Now systems use a combination
2009-11-17
36
Speech and Language Processing Jurafsky and
Martin
37Names
- Big problem area is names
- Names are common
- 20 of tokens in typical newswire text will be
names - 1987 Donnelly list (72 million households)
contains about 1.5 million names - Personal names McArthur, DAngelo, Jiminez,
Rajan, Raghavan, Sondhi, Xu, Hsu, Zhang, Chang,
Nguyen - Company/Brand names Infinit, Kmart, Cytyc,
Medamicus, Inforte, Aaon, Idexx Labs, Bebe
2009-11-17
37
Speech and Language Processing Jurafsky and
Martin
38Names
- Methods
- Can do morphology (Walters -gt Walter, Lucasville)
- Can write stress-shifting rules (Jordan -gt
Jordanian) - Rhyme analogy Plotsky by analogy with Trostsky
(replace tr with pl) - Liberman and Church for 250K most common names,
got 212K (85) from these modified-dictionary
methods, used LTS for rest. - Can do automatic country detection (from letter
trigrams) and then do country-specific rules - Can train g2p system specifically on names
- Or specifically on types of names (brand names,
Russian names, etc)
2009-11-17
38
Speech and Language Processing Jurafsky and
Martin
39Acronyms
- We saw above
- Use machine learning to detect acronyms
- EXPN
- ASWORD
- LETTERS
- Use acronym dictionary, hand-written rules to
augment
2009-11-17
39
Speech and Language Processing Jurafsky and
Martin
40Letter-to-Sound Rules
- Earliest algorithms handwritten
ChomskyHalle-style rules - Festival version of such LTS rules
- (LEFTCONTEXT ITEMS RIGHTCONTEXT NEWITEMS )
- Example
- ( c h C k )
- ( c h ch )
- denotes beginning of word
- C means all consonants
- Rules apply in order
- christmas pronounced with k
- But word with ch followed by non-consonant
pronounced ch - E.g., choice
2009-11-17
40
Speech and Language Processing Jurafsky and
Martin
41Stress rules in hand-written LTS
- English famously evil one from Allen et al 1987
- Where X must contain all prefixes
- Assign 1-stress to the vowel in a syllable
preceding a weak syllable followed by a
morpheme-final syllable containing a short vowel
and 0 or more consonants (e.g. difficult) - Assign 1-stress to the vowel in a syllable
preceding a weak syllable followed by a
morpheme-final vowel (e.g. oregano) - etc
2009-11-17
41
Speech and Language Processing Jurafsky and
Martin
42Modern method Learning LTS rules automatically
- Induce LTS from a dictionary of the language
- Black et al. 1998
- Applied to English, German, French
- Two steps
- alignment
- (CART-based) rule-induction
2009-11-17
42
Speech and Language Processing Jurafsky and
Martin
43Alignment
- Letters c h e c k e d
- Phones ch _ eh _ k _ t
- Black et al Method 1
- First scatter epsilons in all possible ways to
cause letters and phones to align - Then collect stats for P(phoneletter) and select
best to generate new stats - This iterated a number of times until settles
(5-6) - This is EM (expectation maximization) alg
2009-11-17
43
Speech and Language Processing Jurafsky and
Martin
44Alignment
- Some alignments will turn out to be really bad.
- These are just the cases where pronunciation
doesnt match letters - Dept d ih p aa r t m ah n t
- CMU s iy eh m y uw
- Lieutenant l eh f t eh n ax n t (British)
- Also foreign words
- These can just be removed from alignment training
2009-11-17
44
Speech and Language Processing Jurafsky and
Martin
45TTS Prosody
done
Acoustic Manipulation
MLDS
Diphone Retrieval
Concatenation
yes
no
Diphone Database
46Prosodyfrom wordsphones to boundaries, accent,
F0, duration
- Prosodic phrasing
- Need to break utterances into phrases
- Punctuation is useful, not sufficient
- Accents
- Predictions of accents which syllables should be
accented - Realization of F0 contour given accents/tones,
generate F0 contour - Duration
- Predicting duration of each phone
2009-11-17
46
Speech and Language Processing Jurafsky and
Martin
47Defining Intonation
- Ladd (1996) Intonational phonology
- The use of suprasegmental phonetic features
- Suprasegmental above and beyond the
segment/phone - F0
- Intensity (energy)
- Duration
- to convey sentence-level pragmatic meanings
- i.e. meanings that apply to phrases or utterances
as a whole, not lexical stress, not lexical tone.
2009-11-17
47
Speech and Language Processing Jurafsky and
Martin
48Three aspects of prosody
- Prominence some syllables/words are more
prominent than others - Structure/boundaries sentences have prosodic
structure - Some words group naturally together
- Others have a noticeable break or disjuncture
between them - Tune the intonational melody of an utterance.
From Ladd (1996)
2009-11-17
48
Speech and Language Processing Jurafsky and
Martin
49Prosodic Prominence Pitch Accents
- A What types of foods are a good source of
vitamins? - B1 Legumes are a good source of VITAMINS.
- B2 LEGUMES are a good source of vitamins.
- Prominent syllables are
- Louder
- Longer
- Have higher F0 and/or sharper changes in F0
(higher F0 velocity)
Slide from Jennifer Venditti
2009-11-17
49
Speech and Language Processing Jurafsky and
Martin
50Stress vs. accent (2)
- The speaker decides to make the word vitamin more
prominent by accenting it. - Lexical stress tell us that this prominence will
appear on the first syllable, hence VItamin.
2009-11-17
50
Speech and Language Processing Jurafsky and
Martin
51Which word receives an accent?
- It depends on the context. For example, the new
information in the answer to a question is often
accented, while the old information usually is
not. - Q1 What types of foods are a good source of
vitamins? - A1 LEGUMES are a good source of vitamins.
- Q2 Are legumes a source of vitamins?
- A2 Legumes are a GOOD source of vitamins.
- Q3 Ive heard that legumes are healthy, but what
are they a good source of ? - A3 Legumes are a good source of VITAMINS.
Slide from Jennifer Venditti
2009-11-17
51
Speech and Language Processing Jurafsky and
Martin
52Factors in accent prediction
- Part of speech
- Content words are usually accented
- Function words are rarely accented
- Of, for, in on, that, the, a, an, no, to, and but
or will may would can her is their its our there
is am are was were, etc
2009-11-17
52
Speech and Language Processing Jurafsky and
Martin
53Complex Noun Phrase Structure
- Sproat, R. 1994. English noun-phrase accent
prediction for text-to-speech. Computer Speech
and Language 879-94. - Proper Names, stress on right-most word
- New York CITY Paris, FRANCE
- Adjective-Noun combinations, stress on noun
- Large HOUSE, red PEN, new NOTEBOOK
- Noun-Noun compounds stress left noun
- HOTdog (food) versus HOT DOG (overheated animal)
- WHITE house (place) versus WHITE HOUSE (made of
stucco) - examples
- MEDICAL Building, APPLE cake, cherry PIE.
- What about Madison avenue, Park street ???
- Some Rules
- FurnitureRoom -gt RIGHT (e.g., kitchen TABLE)
- Proper-name Street -gt LEFT (e.g. PARK street)
2009-11-17
53
Speech and Language Processing Jurafsky and
Martin
54Levels of prominence
- Most phrases have more than one accent
- The last accent in a phrase is perceived as more
prominent - Called the Nuclear Accent
- Emphatic accents like nuclear accent often used
for semantic purposes, such as indicating that a
word is contrastive, or the semantic focus. - The kind of thing you represent via s in IM,
or capitalized letters - I know SOMETHING interesting is sure to
happen, she said to herself. - Can also have words that are less prominent than
usual - Reduced words, especially function words.
- Often use 4 classes of prominence
- emphatic accent,
- pitch accent,
- unaccented,
- reduced
2009-11-17
54
Speech and Language Processing Jurafsky and
Martin
55Yes-No question
are legumes a good source of VITAMINS
Rise from the main accent to the end of the
sentence.
Slide from Jennifer Venditti
2009-11-17
55
Speech and Language Processing Jurafsky and
Martin
56Surprise-redundancy tune
How many times do I have to tell you ...
legumes are a good source of vitamins
Low beginning followed by a gradual rise to a
high at the end.
Slide from Jennifer Venditti
2009-11-17
56
Speech and Language Processing Jurafsky and
Martin
57Contradiction tune
Ive heard that linguini is a good source of
vitamins.
linguini isnt a good source of vitamins
... how could you think that?
Sharp fall at the beginning, flat and low, then
rising at the end.
Slide from Jennifer Venditti
2009-11-17
57
Speech and Language Processing Jurafsky and
Martin
58Duration
- Simplest
- fixed size for all phones (100 ms)
- Next simplest
- average duration for that phone (from training
data). Samples from SWBD in ms - aa 118 b 68
- ax 59 d 68
- ay 138 dh 44
- eh 87 f 90
- ih 77 g 66
- Next Next Simplest
- add in phrase-final and initial lengthening plus
stress
2009-11-17
58
Speech and Language Processing Jurafsky and
Martin
59Intermediate representationusing Festival
- Do you really want to see all of it?
2009-11-17
59
Speech and Language Processing Jurafsky and
Martin
60Waveform Synthesis
- Given
- String of phones
- Prosody
- Desired F0 for entire utterance
- Duration for each phone
- Stress value for each phone, possibly accent
value - Generate
- Waveforms
2009-11-17
60
Speech and Language Processing Jurafsky and
Martin
61Diphone TTS architecture
- Training
- Choose units (kinds of diphones)
- Record 1 speaker saying 1 example of each diphone
- Mark the boundaries of each diphones,
- cut each diphone out and create a diphone
database - Synthesizing an utterance,
- grab relevant sequence of diphones from database
- Concatenate the diphones, doing slight signal
processing at boundaries - use signal processing to change the prosody (F0,
energy, duration) of selected sequence of diphones
2009-11-17
61
Speech and Language Processing Jurafsky and
Martin
62Recent stuff
- Problems with Unit Selection Synthesis
- Cant modify signal
- (mixing modified and unmodified sounds bad)
- But database often doesnt have exactly what you
want - Solution HMM (Hidden Markov Model) Synthesis
- Won recent TTS bakeoff.
- Sounds less natural to researchers
- But naïve subjects preferred it
- Has the potential to improve on both diphone and
unit selection.
2009-11-17
62
Speech and Language Processing Jurafsky and
Martin
63Summary
- ARPAbet
- TTS Architectures
- TTS Components
- Text Analysis
- Text Normalization
- Homonym Disambiguation
- Grapheme-to-Phoneme (Letter-to-Sound)
- Intonation
- Waveform Generation
- Diphones
- Unit Selection
- HMM
2009-11-17
63
Speech and Language Processing Jurafsky and
Martin