SIGNAL PROCESSING FOR SPEECH APPLICATIONS - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

SIGNAL PROCESSING FOR SPEECH APPLICATIONS

Description:

Richard M. Stern. Department of Electrical and Computer Engineering ... Informedia group (Howard Wactlar) Video on demand. Wearable computer group (Dan Sieworiek) ... – PowerPoint PPT presentation

Number of Views:1247
Avg rating:3.0/5.0
Slides: 62
Provided by: Richar8
Category:

less

Transcript and Presenter's Notes

Title: SIGNAL PROCESSING FOR SPEECH APPLICATIONS


1
SIGNAL PROCESSINGFOR SPEECH APPLICATIONS
  • Richard M. Stern
  • Department of Electrical and Computer Engineering
  • and School of Computer Science
  • Carnegie Mellon University
  • Pittsburgh, Pennsylvania 15213
  • Telephone (412) 268-2535
  • FAX (412) 268-3890
  • INTERNET rms_at_cs.cmu.edu
  • 18-491
  • May 1, 2006

2
SIGNAL PROCESSING FOR SPEECH APPLICATIONS
  • Major speech technologies
  • Speech coding
  • Speech synthesis
  • Speech recognition
  • Other research areas
  • Speaker identification and verification
  • Word spotting
  • Language identification
  • Machine translation

3
GOALS OF THIS LECTURE
  • Review underlying scientific basis for core
    signal processing in speech
  • Review signal processing techniques used in
    major application areas
  • Speech coding
  • Speech synthesis
  • Speech recognition

4
SPEECH AND LANGUAGE RESEARCH AT CARNEGIE MELLON
  • Some facets of CMUs ongoing core research
  • Large-vocabulary speech recognition
  • Spoken language understanding
  • Conversational systems
  • Machine translation
  • Multi-modal integration

5
SPEECH AND LANGUAGE RESEARCH AT CARNEGIE MELLON
  • Some application-focused efforts
  • LISTEN group (Jack Mostow)
  • Literacy training using speech input
  • FLUENCY group (Maxine Eskenazi)
  • Foreign language training using speech input
  • Informedia group (Howard Wactlar)
  • Video on demand
  • Wearable computer group (Dan Sieworiek)
  • Automotive applications (GM initiative)

6
CONVERSATIONAL SYSTEMS THE CMU
COMMUNICATOR
  • Users interact with computers to perform useful
    tasks .
  • Conversational systems include
  • Speech recognition
  • Semantic interpretation
  • Speech generation
  • Domain knowledge (travel planning in our case)
  • Current research includes
  • Mixed-initiative interaction
  • user and task modeling
  • dialog scripting

7
The CMU Communicator System
8
OPEN SOURCE RELEASE OF SPHINX-II
  • SPHINX-II is now available in Open Source form
  • http//www.speech.cs.cmu.edu/speech/sphinx/
  • Initial release contains decoder, language tools,
    and primitive acoustic models
  • Later releases include better models, trainer,
    and SPHINX-3
  • More than 6000 downloads in the first week (!),
    many current groups using
  • SPHINX-4 developed in Java by a collaboration
    between CMU, Sun Microsystems, MIT, Mitsubishi
    Labs, and HP

9
GOALS OF SPEECH REPRESENTATIONS
  • Capture important phonetic information in speech
  • Computational efficiency
  • Robustness in noise, etc.
  • Efficiency in storage requirements
  • Optimize generalization

10
ANATOMY OF THE VOCAL TRACT
  • Vocal tract can be modelled as a tube of varying
    width

11
THE SOURCE-FILTER MODEL FOR SPEECH
  • A useful model for representing the generation of
    speech sounds

Amplitude
pn
12
Speech coding separating the vocal tract
excitation and and filter
  • Original speech
  • Speech with 75-Hz excitation
  • Speech with 150 Hz excitation
  • Speech with noise excitation

13
SOME SPEECH CODING TECHNOLOGIES
  • Adaptive delta pulse code modulation (ADPCM, 32
    kbits/sec)
  • Code and send differences in ongoing waveforn
    from prediction
  • Sub-band vocoding (16 kbits/sec)
  • Code high- and low-frequency bands separately
  • Linear predictive coding (LPC, 2.4-9.6 kbits/sec)
  • Approximate vocal-tract filter function by
    all-pole model
  • Code-excited linear prediction (CELP, 4.8-16
    kbits/sec)
  • LPC plus coding of source waveform

14
EXAMPLES OF SPEECH CODING SCHEMES
  • 64 kb/s G.711 m-law PCM
  • Toll network standard for North America and Japan
  • Has the range of a linear 13-bit coder using only
    8 bits
  • 32 kb/s G.721 ADPCM
  • CCITT standard adopted in 1980s
  • Works with m-law or A-law PCM input
  • 16 kb/s LD-CELP (Low Delay CELP)
  • Low delay (2 ms)

15
EXAMPLES OF SPEECH CODING SCHEMES
  • 8 kbs CELP
  • Submitted by ATT as the standard for digital
    cell phones in North America
  • Higher delay - about 80 ms
  • 4.8 kb/s CELP
  • Proposed for standardization by U.S. government
    for secure terminals
  • 2.4 kb/s LPC-10E
  • Standard for secure telephones by U.S. military
    and NATO

16
OVERVIEW OF SPEECH RECOGNITION
Speech features
Phoneme hypotheses
Decision making procedure
Feature extraction
  • Major functional components
  • Signal processing to extract features from speech
    waveforms
  • Comparison of features to pre-stored templates
  • Important design choices
  • Choice of features
  • Specific method of comparing features to stored
    templates

17
WHY PERFORM SIGNAL PROCESSING?
  • A look at the time-domain waveform of six

Its hard to infer much from the time-domain
waveform
18
WHY PERFORM SIGNAL PROCESSING IN THE FREQUENCY
DOMAIN?
  • Human hearing is based on frequency analysis
  • Use of frequency analysis often simplifies signal
    processing
  • Use of frequency analysis often facilitates
    understanding

19
SHORT-TIME FOURIER ANALYSIS
  • Problem Conventional Fourier analysis does not
    capture time-varying nature of speech signals
  • Solution Multiply signals by finite-duration
    window function, then compute DTFT
  • Side effect windowing causes spectral blurring

20
THE SPEECH SPECTROGRAM
21
EFFECT OF WINDOW DURATION
  • Short-duration window Long-Duration window
  • N 64 N 512

22
LINEAR PREDICTION OF SPEECH
  • Find the best all-pole approximation to the
    DTFT of a segment of speech
  • All-pole model is reasonable for most speech
  • Very efficient in terms of data storage
  • Coefficients ak can be computed efficiently
  • Phase information not preserved (not a problem
    for us)

23
LINEAR PREDICTION EXAMPLE
  • Spectra from the /ih/ in six
  • Comment LPC spectrum follows peaks well

24
FEATURES FOR SPEECH RECOGNITION CEPSTRAL
COEFFICIENTS
  • The cepstrum is the inverse transform of the log
    of the magnitude of the spectrum
  • Useful for separating convolved signals (like the
    source and filter in the speech production model)
  • Can be thought of as the Fourier series expansion
    of the log of the magnitude of the Fourier
    transform
  • Generally provides more efficient and robust
    coding of speech information than LPC
    coefficients
  • Most common basic feature for speech recognition

25
TWO WAYS OF DERIVING CEPSTRAL COEFFICIENTS
  • LPC-derived cepstral coefficients (LPCC)
  • Compute traditional LPC coefficients
  • Convert to cepstra using linear transformation
  • Warp cepstra using bilinear transform
  • Mel-frequency cepstral coefficients (MFCC)
  • Compute log magnitude of DFT of windowed signal
  • Multiply by triangular Mel weighting functions
  • Compute inverse discrete cosine transform

26
COMPUTING CEPSTRAL COEFFICIENTS
  • Comments
  • MFCC is currently the most popular
    representation.
  • Typical systems include a combination of
  • MFCC coefficients
  • Delta MFCC coefficients
  • Delta delta MFCC coefficients
  • Power and delta power coefficients

27
COMPUTING LPC CEPSTRAL COEFFICIENTS
  • Procedure used in standard OGI package (I think)
    for LPC-derived cepstra (LPCC)
  • A/D conversion at 8 kHz sampling rate
  • Apply Hamming window, duration 200 samples (25
    msec) every 10 ms (100-Hz frame rate)
  • Pre-emphasize to boost high-frequency components
  • Compute first 12 auto-correlation coefficients
  • Perform Levinson-Durbin recursion to obtain 12
    LPC coefficients
  • Convert LPC coefficients to cepstral coefficients
  • Perform frequency warping to spread low
    frequencies
  • Compute D and DD coefficients

28
An example the vowel in welcome
  • The original time function

29
THE TIME FUNCTION AFTER WINDOWING
30
THE RAW SPECTRUM
31
PRE-EMPHASIZING THE SIGNAL
  • Typical pre-emphasis filter
  • Its frequency response

32
THE SPECTRUM OF THE PRE-EMPHASIZED SIGNAL
33
THE LPC SPECTRUM
34
THE TRANSFORM OF THE CEPSTRAL COEFFICIENTS
35
CONVERTING FROM LPC COEFFICIENTS TO CEPSTRAL
COEFFICIENTS
  • Recursive conversion formula

36
THE BIG PICTURE THE ORIGINAL SPECTROGRAM
37
EFFECTS OF LPC PROCESSING
38
COMPARING REPRESENTATIONS
  • ORIGINAL SPEECH LPCC
    CEPSTRA (unwarped)

39
FREQUENCY RESPONSE OF THE AUDITORY SYSTEM TUNING
CURVES
  • Threshold level for auditory-nerve response to
    tones

40
MEL FREQUENCY WARPING OF THE SPECTRUM
  • Easy shortcut to characterizing increasing
    peripheral filter bandwidths (Stevens)

41
COMPUTING MEL FREQUENCY CEPSTRAL COEFFICIENTS
  • Segment incoming waveform into Hamming-windowed
    frames of 20-25 ms duration
  • Compute frequency response for each frame using
    DFTs
  • Group magnitude of frequency response into 25-40
    channels using triangular weighting functions
  • Compute log of weighted magnitudes for each
    channel
  • Take inverse DFT (or DCT) of weighted magnitudes
    for each channel, producing 12-14 cepstral
    coefficients for each frame
  • Calculate D and DD coefficients

42
AN EXAMPLE DERIVING MFCC coefficients
43
WEIGHTING THE FREQUENCY RESPONSE
44
THE ACTUAL MEL WEIGHTING FUNCTIONS (for CMU)
45
THE LOG ENERGIES OF THE MEL FILTER OUTPUTS
46
THE CEPSTRAL COEFFICIENTS
47
LOGSPECTRA RECOVERED FROM CEPSTRA
48
COMPARING SPECTRAL REPRESENTATIONS
  • ORIGINAL SPEECH MEL LOG MAGS
    AFTER CEPSTRA

49
Tracking speech sounds via fundamental frequency
  • Given good pitch estimates
  • How well can we separate signals from noise?
  • How much will this separation help in speech
    recognition?
  • To what extent can pitch be used to separate
    speech signals from one another?

50
The CMU ARCTIC database
  • Collected by John Komenik and Alan Black as a
    resource for speech synthesis
  • Contains phonetically-balanced recordings with
    simultaneously-recorded EGG (laryngograph)
    measurements
  • Available at http//www.festvox.org/cmu_arctic

51
The CMU ARCTIC database
Original speech
Laryngograph recording
52
Typical pitch estimates obtained from ARCTIC
  • Comment not all outliers were successfully
    removed

53
Isolating speech by pitch
  • Method 1
  • Estimate amplitudes of partials by synchronous
    heterodyne analysis
  • Resynthesize as sums of sines or cosines
  • Unvoiced segments are problematical
  • Method 2
  • Pass speech through a comb filter that tracks
    harmonic frequencies
  • Unvoiced segments are still problematical

54
Resynthesizing speech using heterodyne analysis
  • Original speech samples
  • Reconstructed speech

55
Recovering speech through comb filtering
  • Pitch-tracking comb filter
  • Its frequency response
  • Original speech samples
  • Reconstructed speech

56
Separating speech signals by heterodyning and
comb filtering
  • Combined speech signals
  • Speech separated by heterodyne filters
  • Speech separated by comb filters
  • Comment men mask women more because upper male
    harmonics are more likely to impinge on lower
    female harmonics

57
Some comments on pitch tracking
  • Results somewhat encouraging, and could improve
    with a more careful implementation
  • Unvoiced segments and multiple speakers will be a
    problem for pitch tracks
  • Hard to predict real impact on speech recognition
    accuracy as of yet

58
Speech separation by source location
  • Sources arriving from different azimuths produce
    interaural time delays (ITDs) and interaural
    intensity differences (IIDs) as they arrive at
    the two ears
  • While results of some experiments indicate that
    these cues may not be primary in human source
    separation and segregation
  • they are still useful and usable in
    computational implementations

59
ASR results using reconstruction based on ITD
  • 1-sample delay at 16-kHz, extracting ITD based on
    zero crossings and cross correlation. Speech
    masking at 0 dB SNR

60
SUMMARY
  • We outlined some of the relevant issues
    associated with the representation of speech for
    automatic recognition, synthesis, and coding
  • The source-filter model of speech production
  • Applications to speech coding
  • Representations used for speech recognition
  • Linear prediction/linear predictive coding (LPCC)
  • Mel frequency cepstral coefficients (MFCC)
  • Applications to sound source separation

61
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com