Speech Synthesis - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Speech Synthesis

Description:

Engineers use frequency analyzer called sound spectrograph to study speech. Spectrograph have shown that the range of most human speech is from 150 to 3600Hz. ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 31
Provided by: FSK6
Category:

less

Transcript and Presenter's Notes

Title: Speech Synthesis


1
Speech Synthesis
  • User friendly machine must have complete voice
    communication abilities
  • Voice communication involves
  • Speech synthesis
  • Speech recognition

2
Elements of Speech
  • Some popular electronic speech synthesis are
    modeled directly from the human tract
  • Other technique by putting together various
    sounds that are required to produce speech
  • Therefore, it is important to learn the sound
    characteristics

3
Elements of Speech (cont)
4
Elements of Speech (cont)
  • The vocal system can be broken into
  • Lungs
  • Larynx
  • Vocal cavity
  • Lungs provide power to system by forcing air up
    through the larynx and into the vocal cavity
  • Vocal cords which made up of skin layers create
    sound when it flap or vibrate when air passes
    through
  • The vibrating action generates several resonant
    frequencies within the vocal cavity (which has
    several harmonic frequency)

5
Elements of Speech (cont)
  • Different sounds is created by changing the shape
    of the vocal cavity with throat, tongue, teeth
    and lips.
  • Engineers use frequency analyzer called sound
    spectrograph to study speech.
  • Spectrograph have shown that the range of most
    human speech is from 150 to 3600Hz. This
    represent a frequency bandwidth of 3450Hz.
  • Bass singer or soprano - bandwidth of 15kHz (from
    10Hz to 15kHz)
  • Volume ratio for human speech is about 16,000 to
    1, from shout to whisper

6
Elements of Speech (cont)
  • However, we do not need to design a speech
    synthesizer which generate frequency from 10Hz to
    15kHz and volume ratio of 16,000 to 1
  • Therefore there is a need to produce an
    intelligible speech. E.g Telephone system has
    a bandwidth of 3000Hz (from 300 to 3300 Hz) with
    10001 volume ratio.

7
Electronic Speech Synthesis (ESS)
  • 2 techniques commonly used in ESS are
  • Natural speech analysis/synthesis
  • Artificial constrictive/synthesis

Natural speech analysis/synthesis
  • Involves recording and subsequent playback of
    human speech
  • Can be analog or digital
  • Best choice to produce limited speech as in
    vending machine, appliances and automobiles.

8
Electronic Speech Synthesis (ESS) (cont)
Natural speech analysis/synthesis (cont)
  • Involves analysis and synthesis
  • Analysis phase Analyzed human speech, coded in
    digital form and stored
  • Synthesis phase Recall digitized speech from
    memory and converted to analog to re-create the
    original speech waveform

9
Electronic Speech Synthesis (ESS) (cont)
Natural speech analysis/synthesis (cont)
  • Digital analysis/synthesis method provides more
    flexibility than analog method since the stored
    words or phrases can be randomly accessed from
    computer memory.
  • However, the vocabulary size is limited by the
    amount of memory available.
  • For this reason, several different encoding
    techniques are used to analyzed and compress the
    speech waveform which attempt to discard
    unimportant part resulting in fewer bits being
    required.

10
Electronic Speech Synthesis (ESS) (cont)
Natural speech analysis/synthesis (cont)
  • 2 types of digital analysis/synthesis
  • Time domain analysis/synthesis
  • Speech waveform is digitized in time domain
  • Analog-gt digital (use ADC). The stored samples
    then passed through DAC to reproduce the speech.
  • E.g. Telephone directory assistance
  • Frequency domain analysis/synthesis
  • Frequency spectrum of analog waveform is analyzed
    and coded
  • Synthesis operation attempts to emulate human
    vocal tract electronically by using stored
    frequency parameters obtained from analysis

11
Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis
  • Created artificially by putting together the
    various sounds that are required to produce a
    given speech segment
  • The most popular technique is called phoneme
    speech synthesis
  • Phoneme The smallest phonetic unit in a language
    that is capable of conveying a distinction in
    meaning, as the m of mat and the b of bat in
    English
  • Phonetic Representing the sounds of speech with
    a set of distinct symbols, each designating a
    single sound phonetic spelling
  • E.g. f-ne-tik sim-bl phonetic symbol

12
Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis (cont)
  • Allophone A predictable phonetic variant of a
    phoneme. For example, the aspirated t of top, the
    unaspirated t of stop, and the tt (pronounced as
    a flap) of batter are allophones of the English
    phoneme /t/.

13
Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis (cont)
  • Phoneme and allophones sounds are coded and
    stored in memory
  • Software algorithm then connect the phonemes to
    produce a given word. Words strung together to
    produce phrase

14
Electronic Speech Synthesis (ESS) (cont)
Artificial Constructive/Synthesis (cont)
  • There is also software that consists of a set of
    production rules that are used to translate
    written text into appropriate allophones code
    Text to speech
  • With phoneme technique, a computer can produce
    unlimited vocabulary using minimum amount of
    memory

15
Electronic Speech Synthesis (ESS) (cont)
Summary of the various methods that are used for
electronic speech synthesis (ESS)
16
Time-Domain Analysis/Synthesis and Waveform
Digitization
17
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
  • Any method that converts the amplitude variations
    of speech to digital code for subsequent playback
    can be considered time-domain speech
    analysis/synthesis
  • Time-domain speech analysis/synthesis involve 2
    operations
  • Encoding human speech waveform to digitize and
    store the speech using analog to digital
    converter (ADC)
  • Decoding the digitized speech into analog form
    for playback using digital to analog converter
    (DAC)

18
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
  • Low pass filter is connected to DAC output to
    smooth out the steps in the synthesized waveform
  • Time-domain encoding attempt to reduce the amount
    of memory required to store digitized speech.
    Some examples are
  • Simple Pulse-Code Modulation (PCM)
  • Delta Modulation (DM)
  • Differential Pulse-Code Modulation (DPCM)
  • Adaptive Differential Pulse-Code Modulation
    (ADPCM)

19
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Simple Pulse-Code Modulation (PCM)
  • Direct waveform digitization using ADC
  • Things that control the quality of the digitized
    speech
  • Sampling rate high sampling rate creates higher
    quality output
  • ADC Resolution higher bit converter creates
    higher quality output
  • To catch all subtleties of the waveform, it
    must be sampled about 30,000 times per second
  • If each sample were converted to an 8 bit digital
    code, the data conversion rate would require
    8X30,000 or 240,000 bits of memory (data
    conversion rate 240,000 bit per second not
    practical)

20
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Simple Pulse-Code Modulation (SPCM) (cont)
  • To reduce data rate, sampling rate must be
    reduced
  • Experimentation has shown that acceptable speech
    can be created using a sampling rate of at least
    two times the highest frequency component in the
    speech waveform.
  • Therefore 6000 (or 3000X2) conversion per second
    is the minimum sampling required to produce
    acceptable speech (most speech falls in the 300
    to 3000Hz).
  • The ADC resolution determines the smallest analog
    increments that will be detected by the system.
    Acceptable speech can be synthesized using an
    8-bit ADC. (8 X 6,000 48,000 bps data rate).
    Therefore 10 second of speech needs 60,000 bytes
    of memory or 58.6kByte roughly

21
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
22
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Simple Pulse-Code Modulation (SPCM) (cont)
  • Problem Requires too much memory to produce
    acceptable speech for any length of time.
  • The answer is to use
  • Delta Modulation (DM)
  • Differential Pulse Code Modulation (DPCM)
  • Adaptive Differential Pulse Code Modulation
    (ADPCM)

23
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Delta Modulation (DM)
  • Only single bit is stored for each sample of
    speech waveform, rather than 8 or 12 bits.
  • The ADC still converts a sample to 8 or 12 bit
    value but then it is compared to the last sample
    value.
  • If the present value is greater than the last
    value, the computer stores a logic 1-bit value.
    If less logic 0 is stored. Therefore, a single
    bit is used to represent data
  • An integrator is used on the circuit output to
    convert the serial bit stream to analog waveform

24
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Delta Modulation (DM) (cont)
  • Disadvantages
  • The sampling rate must be high to catch all the
    details of the speech signal. E.g. Typical
    sampling rate 32,000 samples per second
    translates to 32,000 bps data rate. Thus, 10
    seconds of speech would require 39kbytes of
    memory (40 data reduction from 8-bit SPCM)
  • Compliance Error/Slope Overload
  • Results when speech waveform change so rapidly
    for a given sampling rate. The resulting
    digitization does not truly represent the analog
    waveform and produces audible distortion in the
    output.
  • Can be solved by increasing sampling rate
    (however, will result in an increased data rate
    and more memory)

25
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Delta Modulation (DM) (cont)
26
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Differential Pulse-Code Modulation (DPCM)
  • Same as DM but several bits are used to represent
    the actual difference between two successive
    samples rather than a single bit.
  • Since speech waveform contains many duplicate
    sound and pauses, the change in amplitude from
    one sample to the next is relatively small as
    compared to the actual amplitude of a sample. As
    a result, less bits are required to store the
    difference value than the absolute sample value
  • Difference between two successive samples can be
    represented with 6 or 7-bit value (1 sign bit
    (represent the slope of the input waveform) 5 or
    6 bits for difference value)

27
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Differential Pulse-Code Modulation (DPCM) (cont)
  • E.g. A 7-bit DPCM system using a sampling rate of
    6,000 samples per second would require 51kB of
    memory for 10 seconds of speech.
  • 7 X 6,000 X 10s 420,000 bit
  • 420,000bit /8
  • 52,500 byte
  • gt 52,500/1024 51.3kB
  • Bipolar DAC is used for playback to convert the
    successive difference values to a continuous
    analog waveform
  • Have the same problems as DM. Overcome by
    increasing sampling rate and bit rate.

28
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Differential Pulse-Code Modulation (DPCM) (cont)
29
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Adaptive Differential Pulse-Code Modulation
(ADPCM)
  • Is a variation of DPCM that eliminates the slope
    overload-overload problems
  • Only 3 or 4 bits are required to represent each
    sample
  • Waveform is sampled at 6000 samples per second
    with 8 or 12 bit ADC. The computer then subtract
    current sample value from previous to get
    differential value as DPCM. However, the
    differential value is then adjusted to compensate
    for slope using quantization factor.
  • The quantization factor adjusts the differential
    value dynamically, according to the rate of
    change or slope of the input waveform. The
    adjusted differential value can then be
    represented using only 3 or 4 bits.

30
Time-Domain Analysis/Synthesis and Waveform
Digitization (cont)
Adaptive Differential Pulse-Code Modulation
(ADPCM) (cont)
  • In addition to requiring fewer bit, the sampling
    rate of ADPCM can be reduced (to 4000 samples),
    since slope overload is minimized
  • E.g. 4000 samples per second with 3 bit code
    results in data rate of 12,000 bps (i.e. 3 X
    4000). Therefore 10s of speech require 15kB
    memory. (I.e. (12,000/8) X 10s 15k
  • However, 8000 samples per second and 4-bit code
    are more common gt 39kByte of memory for 10s of
    speech
  • Disadvantage Needs sophisticated software
Write a Comment
User Comments (0)
About PowerShow.com