Title: Speech Synthesis
1Speech Synthesis
- User friendly machine must have complete voice
communication abilities - Voice communication involves
- Speech synthesis
- Speech recognition
2Elements of Speech
- Some popular electronic speech synthesis are
modeled directly from the human tract - Other technique by putting together various
sounds that are required to produce speech - Therefore, it is important to learn the sound
characteristics
3Elements of Speech (cont)
4Elements of Speech (cont)
- The vocal system can be broken into
- Lungs
- Larynx
- Vocal cavity
- Lungs provide power to system by forcing air up
through the larynx and into the vocal cavity - Vocal cords which made up of skin layers create
sound when it flap or vibrate when air passes
through - The vibrating action generates several resonant
frequencies within the vocal cavity (which has
several harmonic frequency)
5Elements of Speech (cont)
- Different sounds is created by changing the shape
of the vocal cavity with throat, tongue, teeth
and lips. - Engineers use frequency analyzer called sound
spectrograph to study speech. - Spectrograph have shown that the range of most
human speech is from 150 to 3600Hz. This
represent a frequency bandwidth of 3450Hz. - Bass singer or soprano - bandwidth of 15kHz (from
10Hz to 15kHz) - Volume ratio for human speech is about 16,000 to
1, from shout to whisper
6Elements of Speech (cont)
- However, we do not need to design a speech
synthesizer which generate frequency from 10Hz to
15kHz and volume ratio of 16,000 to 1 - Therefore there is a need to produce an
intelligible speech. E.g Telephone system has
a bandwidth of 3000Hz (from 300 to 3300 Hz) with
10001 volume ratio.
7Electronic Speech Synthesis (ESS)
- 2 techniques commonly used in ESS are
- Natural speech analysis/synthesis
- Artificial constrictive/synthesis
Natural speech analysis/synthesis
- Involves recording and subsequent playback of
human speech - Can be analog or digital
- Best choice to produce limited speech as in
vending machine, appliances and automobiles.
8Electronic Speech Synthesis (ESS) (cont)
Natural speech analysis/synthesis (cont)
- Involves analysis and synthesis
- Analysis phase Analyzed human speech, coded in
digital form and stored - Synthesis phase Recall digitized speech from
memory and converted to analog to re-create the
original speech waveform
9Electronic Speech Synthesis (ESS) (cont)
Natural speech analysis/synthesis (cont)
- Digital analysis/synthesis method provides more
flexibility than analog method since the stored
words or phrases can be randomly accessed from
computer memory. - However, the vocabulary size is limited by the
amount of memory available. - For this reason, several different encoding
techniques are used to analyzed and compress the
speech waveform which attempt to discard
unimportant part resulting in fewer bits being
required.
10Electronic Speech Synthesis (ESS) (cont)
Natural speech analysis/synthesis (cont)
- 2 types of digital analysis/synthesis
- Time domain analysis/synthesis
- Speech waveform is digitized in time domain
- Analog-gt digital (use ADC). The stored samples
then passed through DAC to reproduce the speech. - E.g. Telephone directory assistance
- Frequency domain analysis/synthesis
- Frequency spectrum of analog waveform is analyzed
and coded - Synthesis operation attempts to emulate human
vocal tract electronically by using stored
frequency parameters obtained from analysis
11Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis
- Created artificially by putting together the
various sounds that are required to produce a
given speech segment - The most popular technique is called phoneme
speech synthesis - Phoneme The smallest phonetic unit in a language
that is capable of conveying a distinction in
meaning, as the m of mat and the b of bat in
English - Phonetic Representing the sounds of speech with
a set of distinct symbols, each designating a
single sound phonetic spelling - E.g. f-ne-tik sim-bl phonetic symbol
12Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis (cont)
- Allophone A predictable phonetic variant of a
phoneme. For example, the aspirated t of top, the
unaspirated t of stop, and the tt (pronounced as
a flap) of batter are allophones of the English
phoneme /t/.
13Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis (cont)
- Phoneme and allophones sounds are coded and
stored in memory - Software algorithm then connect the phonemes to
produce a given word. Words strung together to
produce phrase
14Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis (cont)
- There is also software that consists of a set of
production rules that are used to translate
written text into appropriate allophones code
Text to speech - With phoneme technique, a computer can produce
unlimited vocabulary using minimum amount of
memory
15Electronic Speech Synthesis (ESS) (cont)
Summary of the various methods that are used for
electronic speech synthesis (ESS)
16Time-Domain Analysis/Synthesis and Waveform
Digitization
17Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
- Any method that converts the amplitude variations
of speech to digital code for subsequent playback
can be considered time-domain speech
analysis/synthesis - Time-domain speech analysis/synthesis involve 2
operations - Encoding human speech waveform to digitize and
store the speech using analog to digital
converter (ADC) - Decoding the digitized speech into analog form
for playback using digital to analog converter
(DAC)
18Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
- Low pass filter is connected to DAC output to
smooth out the steps in the synthesized waveform - Time-domain encoding attempt to reduce the amount
of memory required to store digitized speech.
Some examples are - Simple Pulse-Code Modulation (PCM)
- Delta Modulation (DM)
- Differential Pulse-Code Modulation (DPCM)
- Adaptive Differential Pulse-Code Modulation
(ADPCM)
19Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Simple Pulse-Code Modulation (PCM)
- Direct waveform digitization using ADC
- Things that control the quality of the digitized
speech - Sampling rate high sampling rate creates higher
quality output - ADC Resolution higher bit converter creates
higher quality output - To catch all subtleties of the waveform, it
must be sampled about 30,000 times per second - If each sample were converted to an 8 bit digital
code, the data conversion rate would require
8X30,000 or 240,000 bits of memory (data
conversion rate 240,000 bit per second not
practical)
20Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Simple Pulse-Code Modulation (SPCM) (cont)
- To reduce data rate, sampling rate must be
reduced - Experimentation has shown that acceptable speech
can be created using a sampling rate of at least
two times the highest frequency component in the
speech waveform. - Therefore 6000 (or 3000X2) conversion per second
is the minimum sampling required to produce
acceptable speech (most speech falls in the 300
to 3000Hz). - The ADC resolution determines the smallest analog
increments that will be detected by the system.
Acceptable speech can be synthesized using an
8-bit ADC. (8 X 6,000 48,000 bps data rate).
Therefore 10 second of speech needs 60,000 bytes
of memory or 58.6kByte roughly
21Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
22Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Simple Pulse-Code Modulation (SPCM) (cont)
- Problem Requires too much memory to produce
acceptable speech for any length of time. - The answer is to use
- Delta Modulation (DM)
- Differential Pulse Code Modulation (DPCM)
- Adaptive Differential Pulse Code Modulation
(ADPCM)
23Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Delta Modulation (DM)
- Only single bit is stored for each sample of
speech waveform, rather than 8 or 12 bits. - The ADC still converts a sample to 8 or 12 bit
value but then it is compared to the last sample
value. - If the present value is greater than the last
value, the computer stores a logic 1-bit value.
If less logic 0 is stored. Therefore, a single
bit is used to represent data - An integrator is used on the circuit output to
convert the serial bit stream to analog waveform
24Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Delta Modulation (DM) (cont)
- Disadvantages
- The sampling rate must be high to catch all the
details of the speech signal. E.g. Typical
sampling rate 32,000 samples per second
translates to 32,000 bps data rate. Thus, 10
seconds of speech would require 39kbytes of
memory (40 data reduction from 8-bit SPCM) - Compliance Error/Slope Overload
- Results when speech waveform change so rapidly
for a given sampling rate. The resulting
digitization does not truly represent the analog
waveform and produces audible distortion in the
output. - Can be solved by increasing sampling rate
(however, will result in an increased data rate
and more memory)
25Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Delta Modulation (DM) (cont)
26Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Differential Pulse-Code Modulation (DPCM)
- Same as DM but several bits are used to represent
the actual difference between two successive
samples rather than a single bit. - Since speech waveform contains many duplicate
sound and pauses, the change in amplitude from
one sample to the next is relatively small as
compared to the actual amplitude of a sample. As
a result, less bits are required to store the
difference value than the absolute sample value - Difference between two successive samples can be
represented with 6 or 7-bit value (1 sign bit
(represent the slope of the input waveform) 5 or
6 bits for difference value)
27Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Differential Pulse-Code Modulation (DPCM) (cont)
- E.g. A 7-bit DPCM system using a sampling rate of
6,000 samples per second would require 51kB of
memory for 10 seconds of speech. - 7 X 6,000 X 10s 420,000 bit
- 420,000bit /8
- 52,500 byte
- gt 52,500/1024 51.3kB
-
- Bipolar DAC is used for playback to convert the
successive difference values to a continuous
analog waveform - Have the same problems as DM. Overcome by
increasing sampling rate and bit rate.
28Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Differential Pulse-Code Modulation (DPCM) (cont)
29Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Adaptive Differential Pulse-Code Modulation
(ADPCM)
- Is a variation of DPCM that eliminates the slope
overload-overload problems - Only 3 or 4 bits are required to represent each
sample - Waveform is sampled at 6000 samples per second
with 8 or 12 bit ADC. The computer then subtract
current sample value from previous to get
differential value as DPCM. However, the
differential value is then adjusted to compensate
for slope using quantization factor. - The quantization factor adjusts the differential
value dynamically, according to the rate of
change or slope of the input waveform. The
adjusted differential value can then be
represented using only 3 or 4 bits.
30Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Adaptive Differential Pulse-Code Modulation
(ADPCM) (cont)
- In addition to requiring fewer bit, the sampling
rate of ADPCM can be reduced (to 4000 samples),
since slope overload is minimized - E.g. 4000 samples per second with 3 bit code
results in data rate of 12,000 bps (i.e. 3 X
4000). Therefore 10s of speech require 15kB
memory. (I.e. (12,000/8) X 10s 15k - However, 8000 samples per second and 4-bit code
are more common gt 39kByte of memory for 10s of
speech - Disadvantage Needs sophisticated software