Title: Speech in Multimedia
1Speech in Multimedia
- Hao Jiang
- Computer Science Department
- Boston College
- Oct. 9, 2007
2Outline
- Introduction
- Topics in speech processing
- Speech coding
- Speech recognition
- Speech synthesis
- Speaker verification/recognition
- Conclusion
3Introduction
- Speech is our basic communication tool.
- We have been hoping to be able to communicate
with machines using speech.
C3PO and R2D2
4Speech Production Model
Anatomy Structure
Mechanical Model
5Characteristics of Digital Speech
Waveform
Speech
Spectrogram
6Voiced and Unvoiced Speech
Silence
unvoiced
voiced
7Short-time Parameters
Short time power
Waveform Envelop
8Zero crossing rate
Pitch period
9Speech Coding
- Similar to images, we can also compress speech to
make it smaller and easier to store and transmit. - General compression methods such as DPCM can also
be used. - More compression can be achieved by taking
advantage of the speech production model. - There are two classes of speech coders
- Waveform coder
- Vocoder
10LPC Speech Coder
Vocal track Parameter
Quantizer
speech
Pitch
Speech buffer
Speech Analysis
Code generation
Code stream
Voiced/ unvoiced
Energy Parameter
Frame n1
Frame n
11LPC and Vocal Track
- Mathematically, speech can be modeled as the
following generation model - a1, a2, , ak are called Linear Prediction
Coefficients (LPC), which can be used to model
the shape of vocal track. - e(n) is the excitation to generate the speech.
x(n) åp1k ap x(n-p) e(n)
12Decoding and Speech Synthesis
Pitch Period
Impulse Train Generator
Glottal Pulse Generator
Gain
Vocal Track Model
Radiation Model
speech
Random Noise Generator
U/V
13An Example for Synthesizing Speech
Glottal Pulse
Go through vocal track filter with gain control
Blending region
Go through radiation filter
14LPC10 (FS1015)
- 2.4kbps LPC10 was DOD speech coding standard for
voice communication at 2.4kbps. - LPC10 works on speech of 8Hz, using a 22.5ms
frame and 10 LPC coefficients.
Original Speech
LPC Decoded Speech
15Mixed Excitation LP
- For real speech, the excitation is usually not
pure pulse or noise but a mixture. - The new 2.4kbps standard (MELP) addresses this
problem.
Gain
Bandpass filter
w
pulses
Vocal Track Model
Radiation Model
speech
Bandpass filter
noise
1-w
Original Speech
MELP Decoded Speech
16Hybrid Speech Codecs
- For higher bit rate speech coders, hybrid speech
codecs have more advantage than vocoders. - FS1016 CELP (Code Excitation Linear Predictive)
- G.723.1 A dual bit rate codec (5.3kbps and
6.3kbps) for multimedia communication through
Internet. - G.729 CELP based codec at 8kbps.
code
speech
perceptual comparison
Model parameter generation
Speech synthesis
Analysis by Synthesis
Sound at 5.3kbps
Sound at 6.3kbps
Sound at 8kbps
17Speech Recognition
- Speech recognition is the foundation of human
computer interaction using speech. - Speech recognition in different contexts
- Dependent or independent on the speaker.
- Discrete words or continuous speech.
- Small vocabulary or large vocabulary.
- In quiet environment or noisy environment.
Reference patterns
speech
Comparison and decision algorithm
Parameter analyzer
Words
Language model
18How does Speech Recognition Work?
Words grey whales
Phonemes g r ey w ey l z
Each phoneme has different characteristics (for
example, The power distribution).
19Speech Recognition
g g r ey ey ey ey w ey
ey l l z
How do we match the word when there are time
and other variations?
20Hidden Markov Model
P12
S1
S2
a,b,c,
a,b,c,
S3
a,b,c,
21Dynamic Programming in Decoding
time
states
We can find a path that corresponds to
max-probable phonemes to generate the observation
feature (extracted in each speech frame)
sequence.
22HMM for a Unigram Language Model
HMM1 (word1)
p1
HMM2 (word2)
s0
p2
p3
HMM3 (wordn)
23Speech Synthesis
- Speech synthesis is to generate (arbitrary)
speech with desired prosperities (pitch, speed,
loudness, articulation mode, etc.) - Speech synthesis has been widely used for
text-to-speech systems and different telephone
services. -
- The easiest and most often used speech synthesis
method is waveform concatenation.
Increase the pitch without changing the speed
24Speaker Recognition
- Identifying or verifying the identity of a
speaker is an application where computer exceeds
human being. - Vocal track parameter can be used as a feature
for speaker recognition.
Speaker one
Speaker two
LPC covariance feature
25Applications
Speech recognition
Call routing
Document input
Operator Services
Voice Commands
Directory Assistance
Speaker recognition
Speech Coding
Voice over Internet
Fraud Control
Wireless Telephone
Document Correction
Personalized service
Speech Interface
Text-to-Speech synthesis