Speech in Multimedia - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Speech in Multimedia

Description:

Speech is our basic communication tool. ... Spectrogram. Speech. Voiced and Unvoiced Speech. Silence. unvoiced. voiced. Short-time Parameters ... – PowerPoint PPT presentation

Number of Views:777

Avg rating:3.0/5.0

Slides: 26

Provided by: Hao65

Category:

more less

Transcript and Presenter's Notes

Title: Speech in Multimedia

1
Speech in Multimedia

Hao Jiang
Computer Science Department
Boston College
Oct. 9, 2007

2
Outline

Introduction
Topics in speech processing
Speech coding
Speech recognition
Speech synthesis
Speaker verification/recognition
Conclusion

3
Introduction

Speech is our basic communication tool.
We have been hoping to be able to communicate
with machines using speech.

C3PO and R2D2
4
Speech Production Model
Anatomy Structure
Mechanical Model
5
Characteristics of Digital Speech
Waveform
Speech
Spectrogram
6
Voiced and Unvoiced Speech
Silence
unvoiced
voiced
7
Short-time Parameters
Short time power
Waveform Envelop
8
Zero crossing rate
Pitch period
9
Speech Coding

Similar to images, we can also compress speech to
make it smaller and easier to store and transmit.
General compression methods such as DPCM can also
be used.
More compression can be achieved by taking
advantage of the speech production model.
There are two classes of speech coders
Waveform coder
Vocoder

10
LPC Speech Coder
Vocal track Parameter
Quantizer
speech
Pitch
Speech buffer
Speech Analysis
Code generation
Code stream
Voiced/ unvoiced
Energy Parameter
Frame n1
Frame n
11
LPC and Vocal Track

Mathematically, speech can be modeled as the
following generation model
a1, a2, , ak are called Linear Prediction
Coefficients (LPC), which can be used to model
the shape of vocal track.
e(n) is the excitation to generate the speech.

x(n) åp1k ap x(n-p) e(n)
12
Decoding and Speech Synthesis
Pitch Period
Impulse Train Generator
Glottal Pulse Generator
Gain
Vocal Track Model
Radiation Model
speech
Random Noise Generator
U/V
13
An Example for Synthesizing Speech
Glottal Pulse
Go through vocal track filter with gain control
Blending region
Go through radiation filter
14
LPC10 (FS1015)

2.4kbps LPC10 was DOD speech coding standard for
voice communication at 2.4kbps.
LPC10 works on speech of 8Hz, using a 22.5ms
frame and 10 LPC coefficients.

Original Speech
LPC Decoded Speech
15
Mixed Excitation LP

For real speech, the excitation is usually not
pure pulse or noise but a mixture.
The new 2.4kbps standard (MELP) addresses this
problem.

Gain
Bandpass filter
w
pulses
Vocal Track Model
Radiation Model
speech

Bandpass filter
noise
1-w
Original Speech
MELP Decoded Speech
16
Hybrid Speech Codecs

For higher bit rate speech coders, hybrid speech
codecs have more advantage than vocoders.
FS1016 CELP (Code Excitation Linear Predictive)
G.723.1 A dual bit rate codec (5.3kbps and
6.3kbps) for multimedia communication through
Internet.
G.729 CELP based codec at 8kbps.

code
speech
perceptual comparison
Model parameter generation
Speech synthesis
Analysis by Synthesis
Sound at 5.3kbps
Sound at 6.3kbps
Sound at 8kbps
17
Speech Recognition

Speech recognition is the foundation of human
computer interaction using speech.
Speech recognition in different contexts
Dependent or independent on the speaker.
Discrete words or continuous speech.
Small vocabulary or large vocabulary.
In quiet environment or noisy environment.

Reference patterns
speech
Comparison and decision algorithm
Parameter analyzer
Words
Language model
18
How does Speech Recognition Work?
Words grey whales
Phonemes g r ey w ey l z
Each phoneme has different characteristics (for
example, The power distribution).
19
Speech Recognition
g g r ey ey ey ey w ey
ey l l z
How do we match the word when there are time
and other variations?
20
Hidden Markov Model
P12
S1
S2
a,b,c,
a,b,c,
S3
a,b,c,
21
Dynamic Programming in Decoding
time
states
We can find a path that corresponds to
max-probable phonemes to generate the observation
feature (extracted in each speech frame)
sequence.
22
HMM for a Unigram Language Model
HMM1 (word1)
p1
HMM2 (word2)
s0
p2
p3
HMM3 (wordn)
23
Speech Synthesis

Speech synthesis is to generate (arbitrary)
speech with desired prosperities (pitch, speed,
loudness, articulation mode, etc.)
Speech synthesis has been widely used for
text-to-speech systems and different telephone
services.
The easiest and most often used speech synthesis
method is waveform concatenation.

Increase the pitch without changing the speed
24
Speaker Recognition

Identifying or verifying the identity of a
speaker is an application where computer exceeds
human being.
Vocal track parameter can be used as a feature
for speaker recognition.

Speaker one
Speaker two
LPC covariance feature
25
Applications
Speech recognition
Call routing
Document input
Operator Services
Voice Commands
Directory Assistance
Speaker recognition
Speech Coding
Voice over Internet
Fraud Control
Wireless Telephone
Document Correction
Personalized service
Speech Interface
Text-to-Speech synthesis

Write a Comment

User Comments (0)