Speech in Multimedia - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Speech in Multimedia

Description:

Speech is our basic communication tool. ... Spectrogram. Speech. Voiced and Unvoiced Speech. Silence. unvoiced. voiced. Short-time Parameters ... – PowerPoint PPT presentation

Number of Views:777
Avg rating:3.0/5.0
Slides: 26
Provided by: Hao65
Category:

less

Transcript and Presenter's Notes

Title: Speech in Multimedia


1
Speech in Multimedia
  • Hao Jiang
  • Computer Science Department
  • Boston College
  • Oct. 9, 2007

2
Outline
  • Introduction
  • Topics in speech processing
  • Speech coding
  • Speech recognition
  • Speech synthesis
  • Speaker verification/recognition
  • Conclusion

3
Introduction
  • Speech is our basic communication tool.
  • We have been hoping to be able to communicate
    with machines using speech.

C3PO and R2D2
4
Speech Production Model
Anatomy Structure
Mechanical Model
5
Characteristics of Digital Speech
Waveform
Speech
Spectrogram
6
Voiced and Unvoiced Speech
Silence
unvoiced
voiced
7
Short-time Parameters
Short time power
Waveform Envelop
8
Zero crossing rate
Pitch period
9
Speech Coding
  • Similar to images, we can also compress speech to
    make it smaller and easier to store and transmit.
  • General compression methods such as DPCM can also
    be used.
  • More compression can be achieved by taking
    advantage of the speech production model.
  • There are two classes of speech coders
  • Waveform coder
  • Vocoder

10
LPC Speech Coder
Vocal track Parameter
Quantizer
speech
Pitch
Speech buffer
Speech Analysis
Code generation
Code stream
Voiced/ unvoiced
Energy Parameter
Frame n1
Frame n
11
LPC and Vocal Track
  • Mathematically, speech can be modeled as the
    following generation model
  • a1, a2, , ak are called Linear Prediction
    Coefficients (LPC), which can be used to model
    the shape of vocal track.
  • e(n) is the excitation to generate the speech.

x(n) åp1k ap x(n-p) e(n)
12
Decoding and Speech Synthesis
Pitch Period
Impulse Train Generator
Glottal Pulse Generator
Gain
Vocal Track Model
Radiation Model
speech
Random Noise Generator
U/V
13
An Example for Synthesizing Speech
Glottal Pulse
Go through vocal track filter with gain control
Blending region
Go through radiation filter
14
LPC10 (FS1015)
  • 2.4kbps LPC10 was DOD speech coding standard for
    voice communication at 2.4kbps.
  • LPC10 works on speech of 8Hz, using a 22.5ms
    frame and 10 LPC coefficients.

Original Speech
LPC Decoded Speech
15
Mixed Excitation LP
  • For real speech, the excitation is usually not
    pure pulse or noise but a mixture.
  • The new 2.4kbps standard (MELP) addresses this
    problem.

Gain
Bandpass filter
w
pulses
Vocal Track Model
Radiation Model
speech

Bandpass filter
noise
1-w
Original Speech
MELP Decoded Speech
16
Hybrid Speech Codecs
  • For higher bit rate speech coders, hybrid speech
    codecs have more advantage than vocoders.
  • FS1016 CELP (Code Excitation Linear Predictive)
  • G.723.1 A dual bit rate codec (5.3kbps and
    6.3kbps) for multimedia communication through
    Internet.
  • G.729 CELP based codec at 8kbps.

code
speech
perceptual comparison
Model parameter generation
Speech synthesis
Analysis by Synthesis
Sound at 5.3kbps
Sound at 6.3kbps
Sound at 8kbps
17
Speech Recognition
  • Speech recognition is the foundation of human
    computer interaction using speech.
  • Speech recognition in different contexts
  • Dependent or independent on the speaker.
  • Discrete words or continuous speech.
  • Small vocabulary or large vocabulary.
  • In quiet environment or noisy environment.

Reference patterns
speech
Comparison and decision algorithm
Parameter analyzer
Words
Language model
18
How does Speech Recognition Work?
Words grey whales
Phonemes g r ey w ey l z
Each phoneme has different characteristics (for
example, The power distribution).
19
Speech Recognition
g g r ey ey ey ey w ey
ey l l z
How do we match the word when there are time
and other variations?
20
Hidden Markov Model
P12
S1
S2
a,b,c,
a,b,c,
S3
a,b,c,
21
Dynamic Programming in Decoding
time
states
We can find a path that corresponds to
max-probable phonemes to generate the observation
feature (extracted in each speech frame)
sequence.
22
HMM for a Unigram Language Model
HMM1 (word1)
p1
HMM2 (word2)
s0
p2
p3
HMM3 (wordn)
23
Speech Synthesis
  • Speech synthesis is to generate (arbitrary)
    speech with desired prosperities (pitch, speed,
    loudness, articulation mode, etc.)
  • Speech synthesis has been widely used for
    text-to-speech systems and different telephone
    services.
  • The easiest and most often used speech synthesis
    method is waveform concatenation.

Increase the pitch without changing the speed
24
Speaker Recognition
  • Identifying or verifying the identity of a
    speaker is an application where computer exceeds
    human being.
  • Vocal track parameter can be used as a feature
    for speaker recognition.

Speaker one
Speaker two
LPC covariance feature
25
Applications
Speech recognition
Call routing
Document input
Operator Services
Voice Commands
Directory Assistance
Speaker recognition
Speech Coding
Voice over Internet
Fraud Control
Wireless Telephone
Document Correction
Personalized service
Speech Interface
Text-to-Speech synthesis
Write a Comment
User Comments (0)
About PowerShow.com