Title: ON REALTIME MEANANDVARIANCE NORMALIZATION OF SPEECH RECOGNITION FEATURES
1ON REAL-TIME MEAN-AND-VARIANCE NORMALIZATION OF
SPEECH RECOGNITION FEATURES
- Pere Pujol, Duan Macho, and Climent
NadeuNational ICT - TALP Research CenterUniversitat Politècnica de
Catalunya, Barcelona, Spain. - Presenter Chen, Hung-Bin
ICASSP 2006
2Outline
- Introduction
- On-line versions of the mean and variance
normalization(MVN) - Experiments
- Conclusions
3Introduction
- On real-time system, However, the off-line
estimation (MVN) involves a long delay that is
likely unacceptable - In this paper, we report an empirical
investigation about several on-line versions of
the mean and variance normalization(MVN)
technique and the factors affecting their
performance - Segment-based updating of mean variance
- Recursive updating of mean variance
4INVOLVED IN REAL-TIME MVN
- Mean and variance normalization
5on-line versions issues
- Segment-based updating of mean variance
- there is a delay of half the length of the window
- Recursive updating of mean variance
- initialized using the first D frames of the
current utterance and then they are recursively
updated as new frames arrive
6EXPERIMENTAL SETUP AND BASELINE RESULTS
- A subset of the office environment recordings
from the Spanish version of the Speecon database - carry out the speech recognition experiments with
digit strings - The database includes recordings with 4
microphones - a head-mounted close-talk (CT)
- a Lavalier mic
- a directional mic situated at 1 meter from the
speaker - an omni-directional microphone placed at 2-3
meters from the speaker - 125 speakers were chosen for training and 75 for
testing - both balanced in terms of sex and dialect
7baseline system results
- the resulting parameters were used as features,
along with their first- and second-order time
derivatives
8Segment-based updating of mean variance results
- a sliding fixed-length window centered in the
current frame - there is a delay of half the length of the window
9Recursive updating of mean variance results
- Table 3 shows the results using different
look-ahead values in the recursive MVN - The entire look-ahead interval is used to
calculate the initial estimates of mean and
varianc - in the case of 0 sec, the first 100 ms are used
10Recursive updating of mean variance results
- The initial estimates were computed as in UTT-MVN
- This reinforces the good initial estimates of
mean variance - In our case ß was experimentally set to 0.992
11INITIAL MEAN VARIANCE ESTIMATED FROMPAST DATA
ONLY
- this category do not use the current utterance to
compute the initial estimates of the mean
variance of the features - the different data sources
- Current session
- in the current session with fixed microphones,
environment and speaker - utterances not included in the test set
- Set of sessions
- using utterances from a set of sessions of the
Speecon database instead of utterances from the
same session
12initial with past data results
13Conclusions
- In that case, a recursive MVN performs better
than a segment-based MVN - we observed that
- the usefulness of mean and variance updating
increases when the initial estimates are not
representative enough for a given utterance
14GENERALIZED PERCEPTUAL FEATURES FOR VOCALIZATION
ANALYSIS ACROSSMULTIPLE SPECIES
- Patrick J. Clemins, Marek B. Trawicki, Kuntoro
Adi, Jidong Tao, and Michael T. Johnson - Marquette University
- Department of Electrical and Computer Engineering
- Speech and Signal Processing Lab
- Presenter Chen, Hung-Bin
ICASSP 2006
15Outline
- Introduction
- The greenwood function cepstral coefficient
(gfcc) model - Experiments
- Conclusions
16Introduction
- This paper introduces the Greenwood Function
Cepstral Coefficient (GFCC) and Generalized
Perceptual Linear Prediction (GPLP) feature
extraction models for the analysis of animal
vocalizations across arbitrary species - Mel-Frequency Cepstral Coefficients (MFCCs) and
Perceptual Linear Predication (PLP) coefficients
are well-established feature representations for
human speech analysis and recognition tasks - W e want to generalize the frequency warping
component across arbitrary species, we build on
the work of Greenwood
17Greenwood function cepstral coefficient (gfcc)
model
- Greenwood found that many mammals perceived
frequency on a logarithmic scale along the
cochlea - He modeled this relationship with an equation of
the form where a, A, and k are species-specific
constants and x is the cochlea position - This equation can be used to define a frequency
warping through the following equations for real
frequency, f, and perceived frequency, fp
18Greenwood function cepstral coefficient (gfcc)
model
- Assuming this value for k, the other two
constants can be solved for given the approximate
hearing range (fmin fmax) of the species under
study - By setting Fp(fmin) 0 and Fp (fmax) 1, the
following equations for a and A are derived - Thus, using the species specific values for fmin
and fmax and an assumed value of k0.88, a
frequency warping function can be constructed
19Greenwood function cepstral coefficient (gfcc)
model
- Figure 1 shows GFCC filter positioning for
African Elephant and Passeriformes (songbird)
species compared to the Mel-Frequency scale
20Greenwood function cepstral coefficient (gfcc)
model
- Figure 2 shows GPLP equal loudness curves for
African Elephant and Passeriformes (songbird)
species compared to the human equal loudness
curve
21experiments
- Speaker independent song-type classification
experiments were performed across the 5 most
common song types using 50 exemplars of each
song-type, each containing multiple song-type
variants.
22experiments
- Song-type dependent speaker identification
experiments were performed using 25 exemplars of
the most frequent song-type for each of the 6
vocalizing buntings.
23experiments
- Call-type dependent speaker identification
experiments were performed using the entire
Loxodonta Africana dataset - There were a total of six elephants with 20, 30,
14, 17, 34, and 28 rumble exemplars per elephant
24Conclusions
- New feature extraction models have been
introduced for application to analysis and
classification tasks of animal vocalizations.