SIGNAL PROCESSING FOR SPEECH APPLICATIONS - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

SIGNAL PROCESSING FOR SPEECH APPLICATIONS

Description:

Richard M. Stern. Department of Electrical and Computer Engineering ... Informedia group (Howard Wactlar) Video on demand. Wearable computer group (Dan Sieworiek) ... – PowerPoint PPT presentation

Number of Views:1247

Avg rating:3.0/5.0

Slides: 62

Provided by: Richar8

Category:

more less

Transcript and Presenter's Notes

Title: SIGNAL PROCESSING FOR SPEECH APPLICATIONS

1
SIGNAL PROCESSINGFOR SPEECH APPLICATIONS

Richard M. Stern
Department of Electrical and Computer Engineering
and School of Computer Science
Carnegie Mellon University
Pittsburgh, Pennsylvania 15213
Telephone (412) 268-2535
FAX (412) 268-3890
INTERNET rms_at_cs.cmu.edu
18-491
May 1, 2006

2
SIGNAL PROCESSING FOR SPEECH APPLICATIONS

Major speech technologies
Speech coding
Speech synthesis
Speech recognition
Other research areas
Speaker identification and verification
Word spotting
Language identification
Machine translation

3
GOALS OF THIS LECTURE

Review underlying scientific basis for core
signal processing in speech
Review signal processing techniques used in
major application areas
Speech coding
Speech synthesis
Speech recognition

4
SPEECH AND LANGUAGE RESEARCH AT CARNEGIE MELLON

Some facets of CMUs ongoing core research
Large-vocabulary speech recognition
Spoken language understanding
Conversational systems
Machine translation
Multi-modal integration

5
SPEECH AND LANGUAGE RESEARCH AT CARNEGIE MELLON

Some application-focused efforts
LISTEN group (Jack Mostow)
Literacy training using speech input
FLUENCY group (Maxine Eskenazi)
Foreign language training using speech input
Informedia group (Howard Wactlar)
Video on demand
Wearable computer group (Dan Sieworiek)
Automotive applications (GM initiative)

6
CONVERSATIONAL SYSTEMS THE CMU
COMMUNICATOR

Users interact with computers to perform useful
tasks .
Conversational systems include
Speech recognition
Semantic interpretation
Speech generation
Domain knowledge (travel planning in our case)
Current research includes
Mixed-initiative interaction
user and task modeling
dialog scripting

7
The CMU Communicator System
8
OPEN SOURCE RELEASE OF SPHINX-II

SPHINX-II is now available in Open Source form
http//www.speech.cs.cmu.edu/speech/sphinx/
Initial release contains decoder, language tools,
and primitive acoustic models
Later releases include better models, trainer,
and SPHINX-3
More than 6000 downloads in the first week (!),
many current groups using
SPHINX-4 developed in Java by a collaboration
between CMU, Sun Microsystems, MIT, Mitsubishi
Labs, and HP

9
GOALS OF SPEECH REPRESENTATIONS

Capture important phonetic information in speech
Computational efficiency
Robustness in noise, etc.
Efficiency in storage requirements
Optimize generalization

10
ANATOMY OF THE VOCAL TRACT

Vocal tract can be modelled as a tube of varying
width

11
THE SOURCE-FILTER MODEL FOR SPEECH

A useful model for representing the generation of
speech sounds

Amplitude
pn
12
Speech coding separating the vocal tract
excitation and and filter

Original speech
Speech with 75-Hz excitation
Speech with 150 Hz excitation
Speech with noise excitation

13
SOME SPEECH CODING TECHNOLOGIES

Adaptive delta pulse code modulation (ADPCM, 32
kbits/sec)
Code and send differences in ongoing waveforn
from prediction
Sub-band vocoding (16 kbits/sec)
Code high- and low-frequency bands separately
Linear predictive coding (LPC, 2.4-9.6 kbits/sec)
Approximate vocal-tract filter function by
all-pole model
Code-excited linear prediction (CELP, 4.8-16
kbits/sec)
LPC plus coding of source waveform

14
EXAMPLES OF SPEECH CODING SCHEMES

64 kb/s G.711 m-law PCM
Toll network standard for North America and Japan
Has the range of a linear 13-bit coder using only
8 bits
32 kb/s G.721 ADPCM
CCITT standard adopted in 1980s
Works with m-law or A-law PCM input
16 kb/s LD-CELP (Low Delay CELP)
Low delay (2 ms)

15
EXAMPLES OF SPEECH CODING SCHEMES

8 kbs CELP
Submitted by ATT as the standard for digital
cell phones in North America
Higher delay - about 80 ms
4.8 kb/s CELP
Proposed for standardization by U.S. government
for secure terminals
2.4 kb/s LPC-10E
Standard for secure telephones by U.S. military
and NATO

16
OVERVIEW OF SPEECH RECOGNITION
Speech features
Phoneme hypotheses
Decision making procedure
Feature extraction

Major functional components
Signal processing to extract features from speech
waveforms
Comparison of features to pre-stored templates
Important design choices
Choice of features
Specific method of comparing features to stored
templates

17
WHY PERFORM SIGNAL PROCESSING?

A look at the time-domain waveform of six

Its hard to infer much from the time-domain
waveform
18
WHY PERFORM SIGNAL PROCESSING IN THE FREQUENCY
DOMAIN?

Human hearing is based on frequency analysis
Use of frequency analysis often simplifies signal
processing
Use of frequency analysis often facilitates
understanding

19
SHORT-TIME FOURIER ANALYSIS

Problem Conventional Fourier analysis does not
capture time-varying nature of speech signals
Solution Multiply signals by finite-duration
window function, then compute DTFT
Side effect windowing causes spectral blurring

20
THE SPEECH SPECTROGRAM
21
EFFECT OF WINDOW DURATION

Short-duration window Long-Duration window
N 64 N 512

22
LINEAR PREDICTION OF SPEECH

Find the best all-pole approximation to the
DTFT of a segment of speech
All-pole model is reasonable for most speech
Very efficient in terms of data storage
Coefficients ak can be computed efficiently
Phase information not preserved (not a problem
for us)

23
LINEAR PREDICTION EXAMPLE

Spectra from the /ih/ in six
Comment LPC spectrum follows peaks well

24
FEATURES FOR SPEECH RECOGNITION CEPSTRAL
COEFFICIENTS

The cepstrum is the inverse transform of the log
of the magnitude of the spectrum
Useful for separating convolved signals (like the
source and filter in the speech production model)
Can be thought of as the Fourier series expansion
of the log of the magnitude of the Fourier
transform
Generally provides more efficient and robust
coding of speech information than LPC
coefficients
Most common basic feature for speech recognition

25
TWO WAYS OF DERIVING CEPSTRAL COEFFICIENTS

LPC-derived cepstral coefficients (LPCC)
Compute traditional LPC coefficients
Convert to cepstra using linear transformation
Warp cepstra using bilinear transform
Mel-frequency cepstral coefficients (MFCC)
Compute log magnitude of DFT of windowed signal
Multiply by triangular Mel weighting functions
Compute inverse discrete cosine transform

26
COMPUTING CEPSTRAL COEFFICIENTS

Comments
MFCC is currently the most popular
representation.
Typical systems include a combination of
MFCC coefficients
Delta MFCC coefficients
Delta delta MFCC coefficients
Power and delta power coefficients

27
COMPUTING LPC CEPSTRAL COEFFICIENTS

Procedure used in standard OGI package (I think)
for LPC-derived cepstra (LPCC)
A/D conversion at 8 kHz sampling rate
Apply Hamming window, duration 200 samples (25
msec) every 10 ms (100-Hz frame rate)
Pre-emphasize to boost high-frequency components
Compute first 12 auto-correlation coefficients
Perform Levinson-Durbin recursion to obtain 12
LPC coefficients
Convert LPC coefficients to cepstral coefficients
Perform frequency warping to spread low
frequencies
Compute D and DD coefficients

28
An example the vowel in welcome

The original time function

29
THE TIME FUNCTION AFTER WINDOWING
30
THE RAW SPECTRUM
31
PRE-EMPHASIZING THE SIGNAL

Typical pre-emphasis filter
Its frequency response

32
THE SPECTRUM OF THE PRE-EMPHASIZED SIGNAL
33
THE LPC SPECTRUM
34
THE TRANSFORM OF THE CEPSTRAL COEFFICIENTS
35
CONVERTING FROM LPC COEFFICIENTS TO CEPSTRAL
COEFFICIENTS

Recursive conversion formula

36
THE BIG PICTURE THE ORIGINAL SPECTROGRAM
37
EFFECTS OF LPC PROCESSING
38
COMPARING REPRESENTATIONS

ORIGINAL SPEECH LPCC
CEPSTRA (unwarped)

39
FREQUENCY RESPONSE OF THE AUDITORY SYSTEM TUNING
CURVES

Threshold level for auditory-nerve response to
tones

40
MEL FREQUENCY WARPING OF THE SPECTRUM

Easy shortcut to characterizing increasing
peripheral filter bandwidths (Stevens)

41
COMPUTING MEL FREQUENCY CEPSTRAL COEFFICIENTS

Segment incoming waveform into Hamming-windowed
frames of 20-25 ms duration
Compute frequency response for each frame using
DFTs
Group magnitude of frequency response into 25-40
channels using triangular weighting functions
Compute log of weighted magnitudes for each
channel
Take inverse DFT (or DCT) of weighted magnitudes
for each channel, producing 12-14 cepstral
coefficients for each frame
Calculate D and DD coefficients

42
AN EXAMPLE DERIVING MFCC coefficients
43
WEIGHTING THE FREQUENCY RESPONSE
44
THE ACTUAL MEL WEIGHTING FUNCTIONS (for CMU)
45
THE LOG ENERGIES OF THE MEL FILTER OUTPUTS
46
THE CEPSTRAL COEFFICIENTS
47
LOGSPECTRA RECOVERED FROM CEPSTRA
48
COMPARING SPECTRAL REPRESENTATIONS

ORIGINAL SPEECH MEL LOG MAGS
AFTER CEPSTRA

49
Tracking speech sounds via fundamental frequency

Given good pitch estimates
How well can we separate signals from noise?
How much will this separation help in speech
recognition?
To what extent can pitch be used to separate
speech signals from one another?

50
The CMU ARCTIC database

Collected by John Komenik and Alan Black as a
resource for speech synthesis
Contains phonetically-balanced recordings with
simultaneously-recorded EGG (laryngograph)
measurements
Available at http//www.festvox.org/cmu_arctic

51
The CMU ARCTIC database
Original speech
Laryngograph recording
52
Typical pitch estimates obtained from ARCTIC

Comment not all outliers were successfully
removed

53
Isolating speech by pitch

Method 1
Estimate amplitudes of partials by synchronous
heterodyne analysis
Resynthesize as sums of sines or cosines
Unvoiced segments are problematical
Method 2
Pass speech through a comb filter that tracks
harmonic frequencies
Unvoiced segments are still problematical

54
Resynthesizing speech using heterodyne analysis

Original speech samples
Reconstructed speech

55
Recovering speech through comb filtering

Pitch-tracking comb filter
Its frequency response
Original speech samples
Reconstructed speech

56
Separating speech signals by heterodyning and
comb filtering

Combined speech signals
Speech separated by heterodyne filters
Speech separated by comb filters
Comment men mask women more because upper male
harmonics are more likely to impinge on lower
female harmonics

57
Some comments on pitch tracking

Results somewhat encouraging, and could improve
with a more careful implementation
Unvoiced segments and multiple speakers will be a
problem for pitch tracks
Hard to predict real impact on speech recognition
accuracy as of yet

58
Speech separation by source location

Sources arriving from different azimuths produce
interaural time delays (ITDs) and interaural
intensity differences (IIDs) as they arrive at
the two ears
While results of some experiments indicate that
these cues may not be primary in human source
separation and segregation
they are still useful and usable in
computational implementations

59
ASR results using reconstruction based on ITD

1-sample delay at 16-kHz, extracting ITD based on
zero crossings and cross correlation. Speech
masking at 0 dB SNR

60
SUMMARY

We outlined some of the relevant issues
associated with the representation of speech for
automatic recognition, synthesis, and coding
The source-filter model of speech production
Applications to speech coding
Representations used for speech recognition
Linear prediction/linear predictive coding (LPCC)
Mel frequency cepstral coefficients (MFCC)
Applications to sound source separation

61
(No Transcript)

Write a Comment

User Comments (0)