ON REALTIME MEANANDVARIANCE NORMALIZATION OF SPEECH RECOGNITION FEATURES - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

ON REALTIME MEANANDVARIANCE NORMALIZATION OF SPEECH RECOGNITION FEATURES

Description:

Pere Pujol, Du an Macho, and Climent NadeuNational ICT. TALP Research Center ... for African Elephant and Passeriformes (songbird) species compared to the human ... – PowerPoint PPT presentation

Number of Views:548
Avg rating:3.0/5.0
Slides: 25
Provided by: Pili2
Category:

less

Transcript and Presenter's Notes

Title: ON REALTIME MEANANDVARIANCE NORMALIZATION OF SPEECH RECOGNITION FEATURES


1
ON REAL-TIME MEAN-AND-VARIANCE NORMALIZATION OF
SPEECH RECOGNITION FEATURES
  • Pere Pujol, Duan Macho, and Climent
    NadeuNational ICT
  • TALP Research CenterUniversitat Politècnica de
    Catalunya, Barcelona, Spain.
  • Presenter Chen, Hung-Bin

ICASSP 2006
2
Outline
  • Introduction
  • On-line versions of the mean and variance
    normalization(MVN)
  • Experiments
  • Conclusions

3
Introduction
  • On real-time system, However, the off-line
    estimation (MVN) involves a long delay that is
    likely unacceptable
  • In this paper, we report an empirical
    investigation about several on-line versions of
    the mean and variance normalization(MVN)
    technique and the factors affecting their
    performance
  • Segment-based updating of mean variance
  • Recursive updating of mean variance

4
INVOLVED IN REAL-TIME MVN
  • Mean and variance normalization

5
on-line versions issues
  • Segment-based updating of mean variance
  • there is a delay of half the length of the window
  • Recursive updating of mean variance
  • initialized using the first D frames of the
    current utterance and then they are recursively
    updated as new frames arrive

6
EXPERIMENTAL SETUP AND BASELINE RESULTS
  • A subset of the office environment recordings
    from the Spanish version of the Speecon database
  • carry out the speech recognition experiments with
    digit strings
  • The database includes recordings with 4
    microphones
  • a head-mounted close-talk (CT)
  • a Lavalier mic
  • a directional mic situated at 1 meter from the
    speaker
  • an omni-directional microphone placed at 2-3
    meters from the speaker
  • 125 speakers were chosen for training and 75 for
    testing
  • both balanced in terms of sex and dialect

7
baseline system results
  • the resulting parameters were used as features,
    along with their first- and second-order time
    derivatives

8
Segment-based updating of mean variance results
  • a sliding fixed-length window centered in the
    current frame
  • there is a delay of half the length of the window

9
Recursive updating of mean variance results
  • Table 3 shows the results using different
    look-ahead values in the recursive MVN
  • The entire look-ahead interval is used to
    calculate the initial estimates of mean and
    varianc
  • in the case of 0 sec, the first 100 ms are used

10
Recursive updating of mean variance results
  • The initial estimates were computed as in UTT-MVN
  • This reinforces the good initial estimates of
    mean variance
  • In our case ß was experimentally set to 0.992

11
INITIAL MEAN VARIANCE ESTIMATED FROMPAST DATA
ONLY
  • this category do not use the current utterance to
    compute the initial estimates of the mean
    variance of the features
  • the different data sources
  • Current session
  • in the current session with fixed microphones,
    environment and speaker
  • utterances not included in the test set
  • Set of sessions
  • using utterances from a set of sessions of the
    Speecon database instead of utterances from the
    same session

12
initial with past data results
13
Conclusions
  • In that case, a recursive MVN performs better
    than a segment-based MVN
  • we observed that
  • the usefulness of mean and variance updating
    increases when the initial estimates are not
    representative enough for a given utterance

14
GENERALIZED PERCEPTUAL FEATURES FOR VOCALIZATION
ANALYSIS ACROSSMULTIPLE SPECIES
  • Patrick J. Clemins, Marek B. Trawicki, Kuntoro
    Adi, Jidong Tao, and Michael T. Johnson
  • Marquette University
  • Department of Electrical and Computer Engineering
  • Speech and Signal Processing Lab
  • Presenter Chen, Hung-Bin

ICASSP 2006
15
Outline
  • Introduction
  • The greenwood function cepstral coefficient
    (gfcc) model
  • Experiments
  • Conclusions

16
Introduction
  • This paper introduces the Greenwood Function
    Cepstral Coefficient (GFCC) and Generalized
    Perceptual Linear Prediction (GPLP) feature
    extraction models for the analysis of animal
    vocalizations across arbitrary species
  • Mel-Frequency Cepstral Coefficients (MFCCs) and
    Perceptual Linear Predication (PLP) coefficients
    are well-established feature representations for
    human speech analysis and recognition tasks
  • W e want to generalize the frequency warping
    component across arbitrary species, we build on
    the work of Greenwood

17
Greenwood function cepstral coefficient (gfcc)
model
  • Greenwood found that many mammals perceived
    frequency on a logarithmic scale along the
    cochlea
  • He modeled this relationship with an equation of
    the form where a, A, and k are species-specific
    constants and x is the cochlea position
  • This equation can be used to define a frequency
    warping through the following equations for real
    frequency, f, and perceived frequency, fp

18
Greenwood function cepstral coefficient (gfcc)
model
  • Assuming this value for k, the other two
    constants can be solved for given the approximate
    hearing range (fmin fmax) of the species under
    study
  • By setting Fp(fmin) 0 and Fp (fmax) 1, the
    following equations for a and A are derived
  • Thus, using the species specific values for fmin
    and fmax and an assumed value of k0.88, a
    frequency warping function can be constructed

19
Greenwood function cepstral coefficient (gfcc)
model
  • Figure 1 shows GFCC filter positioning for
    African Elephant and Passeriformes (songbird)
    species compared to the Mel-Frequency scale

20
Greenwood function cepstral coefficient (gfcc)
model
  • Figure 2 shows GPLP equal loudness curves for
    African Elephant and Passeriformes (songbird)
    species compared to the human equal loudness
    curve

21
experiments
  • Speaker independent song-type classification
    experiments were performed across the 5 most
    common song types using 50 exemplars of each
    song-type, each containing multiple song-type
    variants.

22
experiments
  • Song-type dependent speaker identification
    experiments were performed using 25 exemplars of
    the most frequent song-type for each of the 6
    vocalizing buntings.

23
experiments
  • Call-type dependent speaker identification
    experiments were performed using the entire
    Loxodonta Africana dataset
  • There were a total of six elephants with 20, 30,
    14, 17, 34, and 28 rumble exemplars per elephant

24
Conclusions
  • New feature extraction models have been
    introduced for application to analysis and
    classification tasks of animal vocalizations.
Write a Comment
User Comments (0)
About PowerShow.com