Title: SIGNAL PROCESSING FOR SPEECH APPLICATIONS
1SIGNAL PROCESSINGFOR SPEECH APPLICATIONS
- Richard M. Stern
- Department of Electrical and Computer Engineering
- and School of Computer Science
- Carnegie Mellon University
- Pittsburgh, Pennsylvania 15213
- Telephone (412) 268-2535
- FAX (412) 268-3890
- INTERNET rms_at_cs.cmu.edu
- 18-491
- May 1, 2006
2SIGNAL PROCESSING FOR SPEECH APPLICATIONS
- Major speech technologies
- Speech coding
- Speech synthesis
- Speech recognition
- Other research areas
- Speaker identification and verification
- Word spotting
- Language identification
- Machine translation
3GOALS OF THIS LECTURE
- Review underlying scientific basis for core
signal processing in speech - Review signal processing techniques used in
major application areas - Speech coding
- Speech synthesis
- Speech recognition
4SPEECH AND LANGUAGE RESEARCH AT CARNEGIE MELLON
- Some facets of CMUs ongoing core research
- Large-vocabulary speech recognition
- Spoken language understanding
- Conversational systems
- Machine translation
- Multi-modal integration
5SPEECH AND LANGUAGE RESEARCH AT CARNEGIE MELLON
- Some application-focused efforts
- LISTEN group (Jack Mostow)
- Literacy training using speech input
- FLUENCY group (Maxine Eskenazi)
- Foreign language training using speech input
- Informedia group (Howard Wactlar)
- Video on demand
- Wearable computer group (Dan Sieworiek)
- Automotive applications (GM initiative)
6CONVERSATIONAL SYSTEMS THE CMU
COMMUNICATOR
- Users interact with computers to perform useful
tasks . - Conversational systems include
- Speech recognition
- Semantic interpretation
- Speech generation
- Domain knowledge (travel planning in our case)
- Current research includes
- Mixed-initiative interaction
- user and task modeling
- dialog scripting
7The CMU Communicator System
8OPEN SOURCE RELEASE OF SPHINX-II
- SPHINX-II is now available in Open Source form
- http//www.speech.cs.cmu.edu/speech/sphinx/
- Initial release contains decoder, language tools,
and primitive acoustic models - Later releases include better models, trainer,
and SPHINX-3 - More than 6000 downloads in the first week (!),
many current groups using - SPHINX-4 developed in Java by a collaboration
between CMU, Sun Microsystems, MIT, Mitsubishi
Labs, and HP
9GOALS OF SPEECH REPRESENTATIONS
- Capture important phonetic information in speech
- Computational efficiency
- Robustness in noise, etc.
- Efficiency in storage requirements
- Optimize generalization
10ANATOMY OF THE VOCAL TRACT
- Vocal tract can be modelled as a tube of varying
width
11THE SOURCE-FILTER MODEL FOR SPEECH
- A useful model for representing the generation of
speech sounds
Amplitude
pn
12Speech coding separating the vocal tract
excitation and and filter
- Original speech
- Speech with 75-Hz excitation
- Speech with 150 Hz excitation
- Speech with noise excitation
13SOME SPEECH CODING TECHNOLOGIES
- Adaptive delta pulse code modulation (ADPCM, 32
kbits/sec) - Code and send differences in ongoing waveforn
from prediction - Sub-band vocoding (16 kbits/sec)
- Code high- and low-frequency bands separately
- Linear predictive coding (LPC, 2.4-9.6 kbits/sec)
- Approximate vocal-tract filter function by
all-pole model - Code-excited linear prediction (CELP, 4.8-16
kbits/sec) - LPC plus coding of source waveform
14EXAMPLES OF SPEECH CODING SCHEMES
- 64 kb/s G.711 m-law PCM
- Toll network standard for North America and Japan
- Has the range of a linear 13-bit coder using only
8 bits - 32 kb/s G.721 ADPCM
- CCITT standard adopted in 1980s
- Works with m-law or A-law PCM input
- 16 kb/s LD-CELP (Low Delay CELP)
- Low delay (2 ms)
15EXAMPLES OF SPEECH CODING SCHEMES
- 8 kbs CELP
- Submitted by ATT as the standard for digital
cell phones in North America - Higher delay - about 80 ms
- 4.8 kb/s CELP
- Proposed for standardization by U.S. government
for secure terminals - 2.4 kb/s LPC-10E
- Standard for secure telephones by U.S. military
and NATO
16OVERVIEW OF SPEECH RECOGNITION
Speech features
Phoneme hypotheses
Decision making procedure
Feature extraction
- Major functional components
- Signal processing to extract features from speech
waveforms - Comparison of features to pre-stored templates
- Important design choices
- Choice of features
- Specific method of comparing features to stored
templates
17WHY PERFORM SIGNAL PROCESSING?
- A look at the time-domain waveform of six
Its hard to infer much from the time-domain
waveform
18WHY PERFORM SIGNAL PROCESSING IN THE FREQUENCY
DOMAIN?
- Human hearing is based on frequency analysis
- Use of frequency analysis often simplifies signal
processing - Use of frequency analysis often facilitates
understanding
19SHORT-TIME FOURIER ANALYSIS
- Problem Conventional Fourier analysis does not
capture time-varying nature of speech signals - Solution Multiply signals by finite-duration
window function, then compute DTFT - Side effect windowing causes spectral blurring
20THE SPEECH SPECTROGRAM
21EFFECT OF WINDOW DURATION
- Short-duration window Long-Duration window
- N 64 N 512
22LINEAR PREDICTION OF SPEECH
- Find the best all-pole approximation to the
DTFT of a segment of speech - All-pole model is reasonable for most speech
- Very efficient in terms of data storage
- Coefficients ak can be computed efficiently
- Phase information not preserved (not a problem
for us)
23LINEAR PREDICTION EXAMPLE
- Spectra from the /ih/ in six
- Comment LPC spectrum follows peaks well
24FEATURES FOR SPEECH RECOGNITION CEPSTRAL
COEFFICIENTS
- The cepstrum is the inverse transform of the log
of the magnitude of the spectrum - Useful for separating convolved signals (like the
source and filter in the speech production model) - Can be thought of as the Fourier series expansion
of the log of the magnitude of the Fourier
transform - Generally provides more efficient and robust
coding of speech information than LPC
coefficients - Most common basic feature for speech recognition
25TWO WAYS OF DERIVING CEPSTRAL COEFFICIENTS
- LPC-derived cepstral coefficients (LPCC)
- Compute traditional LPC coefficients
- Convert to cepstra using linear transformation
- Warp cepstra using bilinear transform
- Mel-frequency cepstral coefficients (MFCC)
- Compute log magnitude of DFT of windowed signal
- Multiply by triangular Mel weighting functions
- Compute inverse discrete cosine transform
26COMPUTING CEPSTRAL COEFFICIENTS
- Comments
- MFCC is currently the most popular
representation. - Typical systems include a combination of
- MFCC coefficients
- Delta MFCC coefficients
- Delta delta MFCC coefficients
- Power and delta power coefficients
27COMPUTING LPC CEPSTRAL COEFFICIENTS
- Procedure used in standard OGI package (I think)
for LPC-derived cepstra (LPCC) - A/D conversion at 8 kHz sampling rate
- Apply Hamming window, duration 200 samples (25
msec) every 10 ms (100-Hz frame rate) - Pre-emphasize to boost high-frequency components
- Compute first 12 auto-correlation coefficients
- Perform Levinson-Durbin recursion to obtain 12
LPC coefficients - Convert LPC coefficients to cepstral coefficients
- Perform frequency warping to spread low
frequencies - Compute D and DD coefficients
28An example the vowel in welcome
- The original time function
29THE TIME FUNCTION AFTER WINDOWING
30THE RAW SPECTRUM
31PRE-EMPHASIZING THE SIGNAL
- Typical pre-emphasis filter
- Its frequency response
32THE SPECTRUM OF THE PRE-EMPHASIZED SIGNAL
33THE LPC SPECTRUM
34THE TRANSFORM OF THE CEPSTRAL COEFFICIENTS
35CONVERTING FROM LPC COEFFICIENTS TO CEPSTRAL
COEFFICIENTS
- Recursive conversion formula
36THE BIG PICTURE THE ORIGINAL SPECTROGRAM
37EFFECTS OF LPC PROCESSING
38COMPARING REPRESENTATIONS
- ORIGINAL SPEECH LPCC
CEPSTRA (unwarped)
39FREQUENCY RESPONSE OF THE AUDITORY SYSTEM TUNING
CURVES
- Threshold level for auditory-nerve response to
tones
40MEL FREQUENCY WARPING OF THE SPECTRUM
- Easy shortcut to characterizing increasing
peripheral filter bandwidths (Stevens)
41COMPUTING MEL FREQUENCY CEPSTRAL COEFFICIENTS
- Segment incoming waveform into Hamming-windowed
frames of 20-25 ms duration - Compute frequency response for each frame using
DFTs - Group magnitude of frequency response into 25-40
channels using triangular weighting functions - Compute log of weighted magnitudes for each
channel - Take inverse DFT (or DCT) of weighted magnitudes
for each channel, producing 12-14 cepstral
coefficients for each frame - Calculate D and DD coefficients
42AN EXAMPLE DERIVING MFCC coefficients
43WEIGHTING THE FREQUENCY RESPONSE
44THE ACTUAL MEL WEIGHTING FUNCTIONS (for CMU)
45THE LOG ENERGIES OF THE MEL FILTER OUTPUTS
46THE CEPSTRAL COEFFICIENTS
47LOGSPECTRA RECOVERED FROM CEPSTRA
48COMPARING SPECTRAL REPRESENTATIONS
- ORIGINAL SPEECH MEL LOG MAGS
AFTER CEPSTRA
49Tracking speech sounds via fundamental frequency
- Given good pitch estimates
- How well can we separate signals from noise?
- How much will this separation help in speech
recognition? - To what extent can pitch be used to separate
speech signals from one another?
50The CMU ARCTIC database
- Collected by John Komenik and Alan Black as a
resource for speech synthesis - Contains phonetically-balanced recordings with
simultaneously-recorded EGG (laryngograph)
measurements - Available at http//www.festvox.org/cmu_arctic
51The CMU ARCTIC database
Original speech
Laryngograph recording
52Typical pitch estimates obtained from ARCTIC
- Comment not all outliers were successfully
removed
53Isolating speech by pitch
- Method 1
- Estimate amplitudes of partials by synchronous
heterodyne analysis - Resynthesize as sums of sines or cosines
- Unvoiced segments are problematical
- Method 2
- Pass speech through a comb filter that tracks
harmonic frequencies - Unvoiced segments are still problematical
54Resynthesizing speech using heterodyne analysis
- Original speech samples
- Reconstructed speech
55Recovering speech through comb filtering
- Pitch-tracking comb filter
- Its frequency response
- Original speech samples
- Reconstructed speech
56Separating speech signals by heterodyning and
comb filtering
- Combined speech signals
- Speech separated by heterodyne filters
- Speech separated by comb filters
- Comment men mask women more because upper male
harmonics are more likely to impinge on lower
female harmonics
57Some comments on pitch tracking
- Results somewhat encouraging, and could improve
with a more careful implementation - Unvoiced segments and multiple speakers will be a
problem for pitch tracks - Hard to predict real impact on speech recognition
accuracy as of yet
58Speech separation by source location
- Sources arriving from different azimuths produce
interaural time delays (ITDs) and interaural
intensity differences (IIDs) as they arrive at
the two ears - While results of some experiments indicate that
these cues may not be primary in human source
separation and segregation - they are still useful and usable in
computational implementations
59ASR results using reconstruction based on ITD
- 1-sample delay at 16-kHz, extracting ITD based on
zero crossings and cross correlation. Speech
masking at 0 dB SNR
60SUMMARY
- We outlined some of the relevant issues
associated with the representation of speech for
automatic recognition, synthesis, and coding - The source-filter model of speech production
- Applications to speech coding
- Representations used for speech recognition
- Linear prediction/linear predictive coding (LPCC)
- Mel frequency cepstral coefficients (MFCC)
- Applications to sound source separation
61(No Transcript)