Title: EE2F1 Multimedia 1: Speech
1EE2F1Multimedia (1) Speech Audio
TechnologyLecture 7 Speech Synthesis
(1)Martin RussellElectronic, Electrical
Computer EngineeringSchool of EngineeringThe
University of Birmingham
2Stages in text-to-speech synthesis
- Text normalisation
- Text-to-phone conversion
- Linguistic analysis
- Semantic analysis
- Conversion of phone-sequence to sequence of
synthesiser control parameters - Synthesis of acoustic speech signal
3Approaches to synthesis
- Final stage is to convert phone or word
sequence into a sequence of synthesiser control
parameters - Two main approaches
- Waveform concatenation
- Model-based speech synthesis (inludes
articulatory synthesis)
4Waveform Concatenation
- Join together, or concatenate, stored sections of
real speech - Sections may correspond to whole word, or
sub-word units - Early systems based on whole words
- E.G Speaking clock - UK telephone system, 1936
- Storage and access major issues
- Speech quality requires data-rates of 16,000 to
32,000 bits per second (bps)
51936 Speaking Clock
From John Holmes, Speech synthesis and
recognition, courtesy of British
Telecommunications plc
6Whole word concatenation (1)
- Whole word concatenation can give good quality
speech (as in speaking clock), but has many
disadvantages - pronunciation of a word influenced by
neighbouring words (co-articulation) - prosodic effects like intonation,
rate-of-speaking and amplitude also influenced by
context. - interpretation of a sentence will be strongly
influenced by details of individual words used
(Mary didnt buy Sam a pizza)
7Whole word concatenation (2)
- Disadvantages (continued)
- words must be extracted from the right sort of
sentence - most suitable for applications where structure of
the sentence is constrained, e.g., announcements,
lists - may need to record more than one example of each
word, e.g., raised pitch at end of a list,
pre-pause lengthening
8Example original recording
The next train to arrive at platform 2 will call
at Bromsgrove, Droitwich Spa, Worcester Foregate
Street and Malvern Link
9Example trivial concatenative synthesis
The next train to arrive at platform 2 will call
at Malvern Link, Worcester Foregate Street,
Droitwich Spa and Bromsgrove
10Example repeated
- Original recording
- Concatenative synthesis
11Whole word concatenation (3)
- Disadvantages (continued)
- to add new words the original speaker must be
found, or all words must be re-recorded - even with specialist facilities, selection and
extraction of suitable words is labour intensive
and time consuming
12Sub-word concatenation (1)
- Limitations of word-based methods suggest
concatenative speech synthesis based on sub-word
units - Need well-annotated, phonetically-balanced corpus
of speech recordings - Extract fragments from waveforms in the corpus
which represent basic units of speech, and can
be concatenated and used for speech synthesis
13Sub-word concatenation (2)
- Difficulties include
- identification of a set of suitable units
- careful annotation of large amounts of data
- derivation of a good method for concatenation
14Sub-word concatenation (3)
- Sub-word concatenation overcomes difficulties
with adding new words to the application
vocabulary, - But, other problems exacerbated.
- In particular, coarticulation and pitch
continuity problems occur within, as well as
between, words. - Necessary to use several examples of each phone
(corresponding roughly to different allophones).
15Sub-word concatenation (4)
- Natural to select fragments that characterise the
phone target values, but modelling transitions
between these targets is a significant problem
16Example sub-word concatenation
stack (original)
task sub-word concatenative synthesis
17Transitional units (1)
- Central regions of many speech sounds are
approximately stationary and less susceptible to
coarticulation effects. - Hence select fragments which characterise
transitions between phones, rather than phone
targets. - e.g., diphone - transition between two phones.
18Transitional units (2)
- There are contextually-induced differences
between instantiations of the central region of
phone, which cause discontinuities if they are
not attended to. - Possible solutions are
- use several different examples of each diphone
- store short transition regions, and
- interpolate between end values
19Transitional units (3)
- Coping with coarticulation effects by modelling
transitions and - (a) using multiple examples to cope with
variation in the instantiation of the phone
centres, and - (b) by interpolation between short transition
regions
20More on prosody
- Discontinuity in the fundamental frequency
exacerbated for sub-word methods. - Can use source-filter model to separate-excitation
signal from vocal-tract shape. - Vocal-tract shape descriptions can then be
concatenated and an appropriately smooth
fundamental frequency pattern can be added
separately.
21PSOLA Pitch Synchronous Overlap and Add
- PSOLA (Charpentier, 1986)
- Most successful current approach to concatenative
synthesis - In PSOLA, the end regions of windowed waveform
samples are overlapped pitch-synchronously and
added - BTs Laureate is an example
22PSOLA
From John Holmes and Wendy Holmes, Speech
synthesis and recognition, Taylor Francis 2001
23Speech modification using PSOLA
- In addition to speech synthesis from segments,
there are two other common applications of PSOLA - Pitch modification
- Duration modification
24Increasing pitch using PSOLA
From John Holmes and Wendy Holmes, Speech
synthesis and recognition, Taylor Francis 2001
25Decreasing pitch using PSOLA
From John Holmes and Wendy Holmes, Speech
synthesis and recognition, Taylor Francis 2001
26The Laureate System
- The BT Laureate system is a modern, PSOLA-based
synthesiser - See Edington et al. (1996a), also look at the web
site - Demonstration
27PSOLA strengths and weaknesses
- Strengths
- Produces good quality speech
- Weaknesses
- Large, annotated corpus needed for each voice
- Requires accurate pitch peak detection
- Inflexible new voices can only be produced by
recording and labelling significant speech
corpora from new speakers - Automatic annotation of corpora using techniques
from speech recognition
28Summary
- Concatenative speech synthesis
- Whole word concatenation
- Importance of prosody
- Sub-word concatenation
- Choice of sub-word units
- PSOLA