Fundamentals of speech production and analysis - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Fundamentals of speech production and analysis

Description:

Basic speech units: phoneme, syllable, word, phrase, sentence, ... wide- and narrow-band spectrograms. K.Marasek. 05.07.2005 ... spectrogram is a ... – PowerPoint PPT presentation

Number of Views:491
Avg rating:3.0/5.0
Slides: 17
Provided by: MAR650
Category:

less

Transcript and Presenter's Notes

Title: Fundamentals of speech production and analysis


1
Fundamentals of speech production and analysis
2
Speech production and analysis Web tutorium
  • Speech production
  • Basic speech units phoneme, syllable, word,
    phrase, sentence, speaking turn
  • phone subphonetic units, diphone, triphone,
    syllable as recognition units
  • types of sounds
  • manner and place (constriction of vocal tract) of
    articulation,
  • vowels and consonants
  • sonorants (vowels, diphtongs, glides, liquides,
    nasals)
  • obstruents (stops, fricatives, affricates)
  • consonants classification depending on vocal
    tract configuration
  • labials, dentals, alveolars, palatals, glottals
    and pharingeals
  • transient sounds (diphtongs, glides, stops and
    affricates) and continuant sounds
  • vowels front, back, middle, low, high - vowels
    rectangle
  • IPA chart
  • coarticulation
  • prosodic features sentence intonation and word
    stress
  • voice quality and paralinguistic features
  • time and frequency features
  • formants and duration, wide- and narrow-band
    spectrograms

3
Principles of speech analysis
  • Speech detection remove silence and noise
  • signal preprocessing and conditioning
  • pre-emphasis to enhance speech signal at higher
    frequencies H(z)1-az-1, a0.95
  • high-pass filtering
  • spectral analysis
  • short -time Fourier transform (STFT)
  • where w(n-m) is a window sequence for observation
    of n-th time instant
  • window is usually tapered to avoid effects of
    multiplication in time domain, so called
    convolution ex.
    Hamming window
  • frequency and time resolution trade-off (FFT
    principle)
  • vector of coefficients as an output, magnitude in
    log-scale considered only power spectrum
  • side-effects spectral leakage, picket-fence
    effect etc., biased estimator of PFD
  • do not fit to F0 fluctuations, pitch synchronous
    analysis
  • spectrograms reading exercises

4
Other methods of speech analysis
  • Time-frequency distributions
  • where f(q,t) is the kernel function defining
    smoothing properties of the TFDs Wigner-Ville,
    Rihacek, and others
  • spectrogram is a special case of TFD
  • no trade-off between time and frequency
    resolution - limited only by Heisenbergs
    uncertainty principle (sampling frequency), but
    interference between signal components
  • wavelet transform future analysis tool?
    Non-uniform sampling of time-frequency plane
  • Filter bank analysis
  • the most specific cues of the signal are located
    in specific frequency bands
  • FIR-filters better (linear phase), but can be
    very long, IIR shorter, usually filtering in
    frequency domain used,
  • powerful enough for small vocabulary application
    ex. 7 bands for DTW 60 words recognizer

Filter bank for telephony
5
Wigner-Ville distribution
  • /a/ vowel

6
Wigner-Ville distribution
  • /t/ stop
  • sig1wavread('d\pjwstk\charlotte\lectures\ata2.wa
    v')
  • plot(sig1)
  • tfrwv(sig1)

7
Linear Predictive Coding (LPC)
  • Wiener (1966), Markel and Gray (1976), Makhoul
    (1973)
  • ARMA model of a process
  • where p and q are model orders of pole and zero
    filters, and a and b represent sets of
    coefficients
  • LPCAR, in order to compute coefficients is
    necessary to define the prediction error, so
    called residual signal
  • the coefficients of the filter can be than
    computed applying last-square criterion to
    minimize a total squared error
  • once the predictor coefficients have been
    estimated, the e(n) signal can be used for a
    perfect signal reconstruction

8
LPC
  • speech synthesis application
  • critical model order, quantization of parameters
    and excitation signal
  • computation of coefficients many methods,
    usually autocorrelation or auto-covariance
  • features of LPC
  • modeling of peaks of the spectrum good for
    formant frequency and bandwidth estimation
  • smoothed spectrum - spectral envelope
  • acoustic model of a tube with p/2 cylindrical
    sections
  • model order rule of thumb sampling frequency in
    kHz 2
  • SVD for model order estimation
  • application in speech recognition signal
    parametrization, but not commonly used
  • RASTA filtering for noisy signals

LPC Synthesis
9
LPC-based coeffcients
  • Usually not LPC coefficients are used, rather
    derivates
  • reflection coefficients directly obtainable
    during LPC computations (Levinson-Durbin
    recursion)
  • E(I) is the total prediction error at the i-th
    recursion step and al(I) is the l-th coefficient.
    Let E(0)R(0) where R(i) is i-th autocorrelation
    coeffcient, then recursively for i1p
  • where ki denote the reflection coefficient
    (PARCOR), klt1
  • acoustic tube model let Ai be the cross-section
    of i-th segment then for neighboring sections
    holds
  • line spectral frequencies poles of AR
    filterconcentration of two or more LSFs in a
    narrow frequency interval indicates the presence
    of a resonance in the LPC spectrum
  • LPC cepstral coeffcients (
    ), Mel-based possible,
  • perceptual LPC (PLP, Hermansky), using hearing
    properties, effective for noisy data

10
LPC
  • Vowel LPC spectrum for various model orders

11
Homomorphic cepstral analysis
  • Signal decomposition into components having
    different spectral charcteristics
  • the objective is to decompose given signal s(n)
    into source e(n) and vocal tract h(n) components
    s(n)e(n)h(n) (-convolution), what in frequency
    domain equals to
  • taking log one gets
  • the frequency response of the vocal tract
    log(H) is a slowly varying component and
    represents the envelope of log(S), while
    log(E) is rapidly varied excitation component
  • the components can be separated in the log
    spectral domain by computing IFFT and retaining
    lowest order coefficients to account for the
    vocal-tract transfer function
  • inverse Fourier transform of log(S) is called
    cepstrum (real cepstrum, exists also complex
    cepstrum)

Block diagram of homomorphic analysis
12
Cepstral Analysis and Auditory Models
  • Cepstrally smoothed spectrum examples
  • widely used in pattern-matching problems, because
    Euclidean distance between two cepstral vectors
    represents a good measure for comparing
    log-spectra
  • Auditory Models
  • separating the message from surounding noise
  • modeling of output from cochlea
  • bark or mel scale of frequency axis linear to
    ca. 1000 Hz, logarithmic above
  • Acoustic features for SR
  • static short time interval (20-50 ms)
  • dynamic change of parameters
  • The features describe Front-End of the recognizer

13
Filter bank based coefficients
  • Reduce the dimensionality of spectral signal
    representation
  • fundamental decisions structure of the filter
    bank number of filters, their response and
    spacing in frequency
  • symmetric triangular filter used to weight DFT
    values quick and dirty approximation of
    band-pass filtering
  • Example of a filter bank (24 triangular filters)
    spaced according to Mel-scale
  • Mel based cepstral coeffcients (MFCC), most
    popular in ASR usually computed as IFFT of
    log-energy output of filter bank consisting of i
    triangular filter masks
  • C0 approximates log-energy of the signal, higher
    order coefficients represents log-energy ratio
    between bands (i.e. c1 provides log-energy ratio
    between intervals 0,Fs/4 and Fs/4, Fs/2-
    higher for sonorants, lower for fricatives), but
    for higher order coefficients interpretation is
    complicated
  • IFFT is orthogonal transform, i.e. coeffcients
    are uncorrelated -gt simplified acoustic models
    can be used
  • MFCC speech reconstruction (IBM, ICASSP-2000)

14
Fundamental Frequency and Formants
  • F0 estimation (Hess) determining the main period
    in quasi-periodic waveform
  • usually using autocorrelation function and the
    average magnitude difference function (AMDF)

    where L is the frame length
    Npis
    number of point pairs (peak in ACF
    and valley in AMDF indicates F0)
  • usually speech signal is first low-pass filtered
    to avoid influence of formants
  • cepstral analysis peak at T0
  • Formant ferquency estimation
  • resonances in vocal tract are related to complex
    poles of LPC model zkRe(zk)jIm(zk)
  • cepstral smoothed spectrum also used
  • a lot of methods, but..
  • tracking of formant frequencies is a problem not
    solved yet

15
Dynamic features
  • Temporal variation and contextual dependency
  • time derivative features
  • not sensitive to slow channel-dependent
    variations of static parameters
  • first order difference is affected by various
    types of noise, thus smoothing necessary
  • polynomial expansion of time derivatives (Furui)
  • second order derivatives acceleration also often
    used
  • Typical set of parameters E,12 MFCC, DE, DMFCC,
    DDE, DD MFCC observation vector consists of 39
    parameters
  • Other types of dynamic features
  • spectral variation function
  • dynamic cepstrum
  • Karhunen-Loeve Transformation (KLT) segmenting
    speech into subword units depending only on
    acoustic properties without a priori defined
    units, like phonemes
  • RASTA processing - band-pass filtering

16
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com