Speech Synthesis - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Speech Synthesis

Description:

Engineers use frequency analyzer called sound spectrograph to study speech. Spectrograph have shown that the range of most human speech is from 150 to 3600Hz. ... – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 31

Provided by: FSK6

Category:

more less

Transcript and Presenter's Notes

Title: Speech Synthesis

1
Speech Synthesis

User friendly machine must have complete voice
communication abilities
Voice communication involves
Speech synthesis
Speech recognition

2
Elements of Speech

Some popular electronic speech synthesis are
modeled directly from the human tract
Other technique by putting together various
sounds that are required to produce speech
Therefore, it is important to learn the sound
characteristics

3
Elements of Speech (cont)
4
Elements of Speech (cont)

The vocal system can be broken into
Lungs
Larynx
Vocal cavity
Lungs provide power to system by forcing air up
through the larynx and into the vocal cavity
Vocal cords which made up of skin layers create
sound when it flap or vibrate when air passes
through
The vibrating action generates several resonant
frequencies within the vocal cavity (which has
several harmonic frequency)

5
Elements of Speech (cont)

Different sounds is created by changing the shape
of the vocal cavity with throat, tongue, teeth
and lips.
Engineers use frequency analyzer called sound
spectrograph to study speech.
Spectrograph have shown that the range of most
human speech is from 150 to 3600Hz. This
represent a frequency bandwidth of 3450Hz.
Bass singer or soprano - bandwidth of 15kHz (from
10Hz to 15kHz)
Volume ratio for human speech is about 16,000 to
1, from shout to whisper

6
Elements of Speech (cont)

However, we do not need to design a speech
synthesizer which generate frequency from 10Hz to
15kHz and volume ratio of 16,000 to 1
Therefore there is a need to produce an
intelligible speech. E.g Telephone system has
a bandwidth of 3000Hz (from 300 to 3300 Hz) with
10001 volume ratio.

7
Electronic Speech Synthesis (ESS)

2 techniques commonly used in ESS are
Natural speech analysis/synthesis
Artificial constrictive/synthesis

Natural speech analysis/synthesis

Involves recording and subsequent playback of
human speech
Can be analog or digital
Best choice to produce limited speech as in
vending machine, appliances and automobiles.

8
Electronic Speech Synthesis (ESS) (cont)
Natural speech analysis/synthesis (cont)

Involves analysis and synthesis
Analysis phase Analyzed human speech, coded in
digital form and stored
Synthesis phase Recall digitized speech from
memory and converted to analog to re-create the
original speech waveform

9
Electronic Speech Synthesis (ESS) (cont)
Natural speech analysis/synthesis (cont)

Digital analysis/synthesis method provides more
flexibility than analog method since the stored
words or phrases can be randomly accessed from
computer memory.
However, the vocabulary size is limited by the
amount of memory available.
For this reason, several different encoding
techniques are used to analyzed and compress the
speech waveform which attempt to discard
unimportant part resulting in fewer bits being
required.

10
Electronic Speech Synthesis (ESS) (cont)
Natural speech analysis/synthesis (cont)

2 types of digital analysis/synthesis
Time domain analysis/synthesis
Speech waveform is digitized in time domain
Analog-gt digital (use ADC). The stored samples
then passed through DAC to reproduce the speech.
E.g. Telephone directory assistance
Frequency domain analysis/synthesis
Frequency spectrum of analog waveform is analyzed
and coded
Synthesis operation attempts to emulate human
vocal tract electronically by using stored
frequency parameters obtained from analysis

11
Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis

Created artificially by putting together the
various sounds that are required to produce a
given speech segment
The most popular technique is called phoneme
speech synthesis
Phoneme The smallest phonetic unit in a language
that is capable of conveying a distinction in
meaning, as the m of mat and the b of bat in
English
Phonetic Representing the sounds of speech with
a set of distinct symbols, each designating a
single sound phonetic spelling
E.g. f-ne-tik sim-bl phonetic symbol

12
Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis (cont)

Allophone A predictable phonetic variant of a
phoneme. For example, the aspirated t of top, the
unaspirated t of stop, and the tt (pronounced as
a flap) of batter are allophones of the English
phoneme /t/.

13
Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis (cont)

Phoneme and allophones sounds are coded and
stored in memory
Software algorithm then connect the phonemes to
produce a given word. Words strung together to
produce phrase

14
Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis (cont)

There is also software that consists of a set of
production rules that are used to translate
written text into appropriate allophones code
Text to speech
With phoneme technique, a computer can produce
unlimited vocabulary using minimum amount of
memory

15
Electronic Speech Synthesis (ESS) (cont)
Summary of the various methods that are used for
electronic speech synthesis (ESS)
16
Time-Domain Analysis/Synthesis and Waveform
Digitization
17
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)

Any method that converts the amplitude variations
of speech to digital code for subsequent playback
can be considered time-domain speech
analysis/synthesis
Time-domain speech analysis/synthesis involve 2
operations
Encoding human speech waveform to digitize and
store the speech using analog to digital
converter (ADC)
Decoding the digitized speech into analog form
for playback using digital to analog converter
(DAC)

18
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)

Low pass filter is connected to DAC output to
smooth out the steps in the synthesized waveform
Time-domain encoding attempt to reduce the amount
of memory required to store digitized speech.
Some examples are
Simple Pulse-Code Modulation (PCM)
Delta Modulation (DM)
Differential Pulse-Code Modulation (DPCM)
Adaptive Differential Pulse-Code Modulation
(ADPCM)

19
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Simple Pulse-Code Modulation (PCM)

Direct waveform digitization using ADC
Things that control the quality of the digitized
speech
Sampling rate high sampling rate creates higher
quality output
ADC Resolution higher bit converter creates
higher quality output
To catch all subtleties of the waveform, it
must be sampled about 30,000 times per second
If each sample were converted to an 8 bit digital
code, the data conversion rate would require
8X30,000 or 240,000 bits of memory (data
conversion rate 240,000 bit per second not
practical)

20
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Simple Pulse-Code Modulation (SPCM) (cont)

To reduce data rate, sampling rate must be
reduced
Experimentation has shown that acceptable speech
can be created using a sampling rate of at least
two times the highest frequency component in the
speech waveform.
Therefore 6000 (or 3000X2) conversion per second
is the minimum sampling required to produce
acceptable speech (most speech falls in the 300
to 3000Hz).
The ADC resolution determines the smallest analog
increments that will be detected by the system.
Acceptable speech can be synthesized using an
8-bit ADC. (8 X 6,000 48,000 bps data rate).
Therefore 10 second of speech needs 60,000 bytes
of memory or 58.6kByte roughly

21
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
22
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Simple Pulse-Code Modulation (SPCM) (cont)

Problem Requires too much memory to produce
acceptable speech for any length of time.
The answer is to use
Delta Modulation (DM)
Differential Pulse Code Modulation (DPCM)
Adaptive Differential Pulse Code Modulation
(ADPCM)

23
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Delta Modulation (DM)

Only single bit is stored for each sample of
speech waveform, rather than 8 or 12 bits.
The ADC still converts a sample to 8 or 12 bit
value but then it is compared to the last sample
value.
If the present value is greater than the last
value, the computer stores a logic 1-bit value.
If less logic 0 is stored. Therefore, a single
bit is used to represent data
An integrator is used on the circuit output to
convert the serial bit stream to analog waveform

24
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Delta Modulation (DM) (cont)

Disadvantages
The sampling rate must be high to catch all the
details of the speech signal. E.g. Typical
sampling rate 32,000 samples per second
translates to 32,000 bps data rate. Thus, 10
seconds of speech would require 39kbytes of
memory (40 data reduction from 8-bit SPCM)
Compliance Error/Slope Overload
Results when speech waveform change so rapidly
for a given sampling rate. The resulting
digitization does not truly represent the analog
waveform and produces audible distortion in the
output.
Can be solved by increasing sampling rate
(however, will result in an increased data rate
and more memory)

25
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Delta Modulation (DM) (cont)
26
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Differential Pulse-Code Modulation (DPCM)

Same as DM but several bits are used to represent
the actual difference between two successive
samples rather than a single bit.
Since speech waveform contains many duplicate
sound and pauses, the change in amplitude from
one sample to the next is relatively small as
compared to the actual amplitude of a sample. As
a result, less bits are required to store the
difference value than the absolute sample value
Difference between two successive samples can be
represented with 6 or 7-bit value (1 sign bit
(represent the slope of the input waveform) 5 or
6 bits for difference value)

27
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Differential Pulse-Code Modulation (DPCM) (cont)

E.g. A 7-bit DPCM system using a sampling rate of
6,000 samples per second would require 51kB of
memory for 10 seconds of speech.
7 X 6,000 X 10s 420,000 bit
420,000bit /8
52,500 byte
gt 52,500/1024 51.3kB
Bipolar DAC is used for playback to convert the
successive difference values to a continuous
analog waveform
Have the same problems as DM. Overcome by
increasing sampling rate and bit rate.

28
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Differential Pulse-Code Modulation (DPCM) (cont)
29
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Adaptive Differential Pulse-Code Modulation
(ADPCM)

Is a variation of DPCM that eliminates the slope
overload-overload problems
Only 3 or 4 bits are required to represent each
sample
Waveform is sampled at 6000 samples per second
with 8 or 12 bit ADC. The computer then subtract
current sample value from previous to get
differential value as DPCM. However, the
differential value is then adjusted to compensate
for slope using quantization factor.
The quantization factor adjusts the differential
value dynamically, according to the rate of
change or slope of the input waveform. The
adjusted differential value can then be
represented using only 3 or 4 bits.

30
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Adaptive Differential Pulse-Code Modulation
(ADPCM) (cont)

In addition to requiring fewer bit, the sampling
rate of ADPCM can be reduced (to 4000 samples),
since slope overload is minimized
E.g. 4000 samples per second with 3 bit code
results in data rate of 12,000 bps (i.e. 3 X
4000). Therefore 10s of speech require 15kB
memory. (I.e. (12,000/8) X 10s 15k
However, 8000 samples per second and 4-bit code
are more common gt 39kByte of memory for 10s of
speech
Disadvantage Needs sophisticated software