Mixed Feelings About Using Phoneme-Level Models in Emotion Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

Mixed Feelings About Using Phoneme-Level Models in Emotion Recognition

Description:

Mixed Feelings About Using Phoneme-Level Models in Emotion Recognition Hannes Pirker Austrian Research Institute for Artificial Intelligence (OFAI) – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 10
Provided by: han6159
Category:

less

Transcript and Presenter's Notes

Title: Mixed Feelings About Using Phoneme-Level Models in Emotion Recognition


1
Mixed Feelings About Using Phoneme-Level Models
in Emotion Recognition
  • Hannes Pirker
  • Austrian Research Institute for Artificial
    Intelligence (OFAI)
  • Vienna

Poster presented at ACII-2007, September
16-19,Lisbon, Portugal
2
  • This study deals with the application of MFCC
    based models for both the recognition of
    emotional speech and the recognition of emotions
    in speech. More specifically it investigates the
    performance of phone-level models. First, results
    from performing forced alignment for the phonetic
    segmentation on GEMEP, a novel multimodal corpus
    of acted emotional utterances are presented, then
    the newly acquired segmentations are used for
    experiments with emotion recognition comparing
    phone-level models with sentence-level models..

Abstract
1 Motivation
The Geneva Multimodal Emotion Portrayals corpus
(GEMEP) 1 is a novel corpus of highly
controlled and uniform content, containing 2
pseudo-linguistic utterances produced with 18
different emotions and varying levels of
intensity. This uniform content provides a
promising basis for investigating both the
acoustic correlates of emotions as well as
timing-issues in the synchronization between
speech, facial expressions and body movements. In
order to provide a sound basis for the
investigation of fine-grained temporal issues,
phonetic segmentation at the level of individual
speech sounds was to be performed. Apart from
this practical goal it also was to be
investigated on how well standard segmentation
techniques, i.e. Hidden Markov Models (HMM) with
Mel Frequency Cepstral Coefficients (MFCC) 4
would cope with the amount of variability of
manner and voice quality typically found in
emotional speech. The third goal was to shed
some light on the conflicting requirements on
MFCCs. They are used in speech recognition for
discriminating speech sounds and for their
robustness against variation in intonation and
voice quality. But they are also became popular
in emotion recognition 3, though then they
typically should be indifferent to the underlying
speech sounds and sensitive to voice quality. As
the identity of the speech sounds is a major
influencing factor on MFCCs, we are comparing
emotion classifiers that rely on phoneme-level
models with sentence-level models.
3
  • 2 Database Description
  • The Geneva Multimodal Emotion Portrayals corpus
    (GEMEP) consists of audio and video recordings of
    10 speakers (professional French speaking
    actors), 5 of which are female. Verbal content is
    restricted to only 2 different pseudo-linguistic
    sentences
  • (type1) Ne kal ibam soud mol'en!
  • (type2) Koun s'e mina lod belam?
  • These were uttered with 18 different emotions
    which were chosen for an extensive coverage of
    both activation (high-low) and evaluation
    dimensions (positive-negative)
  • Utterances were also produced in less intense,
    more intense and 'masked' manner (i.e.
    unsuccessfully hide the actual emotion from the
    audience). In average utterances were repeated 8
    times per condition.
  • This results in 3815 sentences 2739 of type1 and
    1076 of type2.

Emotion categories in the GEMEP corpus. Subset of
selected 6 set in bold.
3. Phonetic Segmentation 3.2 Workflow Starting
with 50 manually segmented sentences for training
an initial set of HMM, several cycles of
training-alignment-manual correction and
re-training with the increased set were
performed. By now 1313 manually validated
sentences are available which are currently
split into 892 training- and 421
test-samples. To make use of the constrained
content of the corpus HMM where trained for each
individual sound, i.e. 4 different models for /a/
where trained. Apart from unconstrained training
(Global-models) specific models for male and
female speakers (per Gender), for each actor (per
Speaker) and each emotion (per Emotion) where
constructed and evaluated.
4
  • 3.1. Technical Procedure
  • The original audio data was downsampled to 22.05
    kHz and high-pass filtered at 55 Hz. MFCC where
    calculated with a frame size of 30 ms and a
    window shift of 2.5 ms (i.e. resulting in 400
    frames per second). 12 MFCC, energy, delta and
    acceleration were included resulting in a 39
    dimensional feature vector.
  • For the phoneme-level models Hidden Markov Models
    with 3 states and 5 Gaussian mixtures in each
    state were employed. Baum-Welch algorithm was
    used for training the HMMs and Viterbi decoding
    was performed in order to retrieve the segment
    boundaries. Extraction of MFCC features, training
    and application of HMMs was performed with the
    respective tools provided by HTK 4.
  • 3.2. Evaluation
  • The absolute position error at the initial phone
    boundaries was used for evaluation. Quantiles and
    error thresholds provide a meaningful measure
    approx. 85 of segments are located with less an
    error less than 20ms and should not require
    manual correction. Below the performance of two
    differently sized training sets (N503 vs. N892)
    and when using gender-, speaker- and
    emotion-specific phoneme-models is illustrated.
  • The results indicate that the performance of the
    aligner seems to already level out, i.e. for
    type1 the increase in the size of the training
    data from 395 to 685 does not show much of an
    effect anymore while the less frequent type 2
    still benefits from the increase in
    training-size. Also further fractionating the
    training set by using speaker-specific models
    etc. does not decrease the performance
    dramatically anymore.

5
  • This is in line with the subjective impression
    that normal cases were remarkably well handled
    by models trained on small sets but persistently
    problematic classes exist, e.g. flustered speech
    with its weakly articulated formant structures.
    Soft speech, even of very low intensity, as well
    as loud or even shouted speech are less critical.
    The most problematic cases are intermingled
    laughter, hesitations etc. which pose severe
    problems to any procedure based on forced
    alignment.
  • 4 Emotion Classification
  • 4.1 Technical Procedure
  • In order to test the capabilities of
    phone-level-MFCC modeling for emotion
    classification virtually the same methods as used
    for phonetic segmentation were re-applied. For
    phone-level modelling emotion-specific 3 state
    left-to-right HMMs with 5 Gaussian Mixtures were
    trained. Alternatively two sentence-level HMM
    typologies were tested an 'elongated' version of
    the phone-level model with 22 states/5 mixtures
    and several 1-state HMMs, which are equivalent to
    a Gaussian Mixture Model. The recognition
    grammar of the aligner was adapted as shown in
    Fig.2. A majority vote was performed on the
    outcome of the Viterbi decoding.
  • For the experiments with emotion recognition a
    test-set of 875 samples was reserved, leaving a
    maximum of 2940 samples for training (i.e.
    77/23). Currently only sentence-level models
    could make use of the whole training-set, as they
    do not require any pre-segmented data.
    Phone-based models where for now trained on a
    smaller set of 892 though, as it was necessary to
    use manually segmented data.

Fig.1 85 Error quantile ms per segment for
the two sentences in the corpus (N train892)
6
18 categories is an untypical high numbee, for
better comparability all experiments where also
performed with a sub-set of 6 emotions (anger,
joy, fear, sadness, pleasure, interest) For
sentence-level HMMs different typologies where
compared. For the full set of 2940 a single state
model with an exceedingly high number of Gaussian
Mixtures (512) performed best. This model was
compared with phone-level modelling.
Fig.2 Recognition grammar for phone-level and
sentence-level modelling.
  • 4.2 Evaluation
  • The table below provides a comparison between the
    phone-based model and a sentence-based model on
    the same train- and test-set as it was also used
    in the evaluation of the phonetic-alignment task
    above. Again the smaller number of
    training-samples for type2 results in a
    significant drop in performance.

7
(No Transcript)
8
  • The recognition quality is likely to still
    benefit from increased training size as the
    results from the experiments with sentence-level
    models indicate

5. Conclusion We presented a study on
phone-based MFCC-models for emotion recognition
which originated in the practical task of
phonetic segmentation of the GEMEP corpus. The
results for the automatic segmentation probably
have reached a certain ceiling, but provide a
valid basis for manual correction of further
samples. Results for emotion classification are
still in flux, i.e., results with differently
sized training-sets still show significant
volatility. Ultimate conclusions on the
relationship of phone-based vs. global models are
difficult to obtain. Because the segmental
content in GEMEP is so restricted, phoneme-based
models lose their expected implicit advantages in
this context. On the other hand errors in the
automatic alignment have a strong influence on
the classification, e.g. surprisingly the sound
/s/ provided the best classification results of
all phonemes, which probably is due to the fact
that fricatives are the most suitable sounds for
the aligner. The study on emotion recognition was
not at all aimed at coming up with impressive
recognition rates, which could be easily boosted
by using e.g. 10fold-cross-validation, a-priori
probabilities, more equally balanced training
sets etc., but to test the relative performance
of different MFCC-based models.
9
  • Acknowledgements

I am very indebted to Klaus Scherer and his
group in Geneva for designing, creating and
sharing the GEMEP corpus and especially to Tanja
Baenziger for long standing and ongoing
interaction and support. This work has been
funded by the EU Network of Excellence HUMAINE
(IST 507422) and by the Austrian Funds for
Research and Technology Promotion for Industry
(FFF 808818/2970 KA/SA). Financial support for
OFAI is provided by the Austrian Federal Ministry
of Science and Research and by the Federal
Ministry of Transport, Innovation and Technology.
  • References
  • Baenziger T., Pirker H., Scherer K. GEMEP -
    GEneva Multimodal Emotion Portrayals A corpus
    for the study of multimodal emotional
    expressions, in Devillers L. et al. (eds.),
    Proceedings of LREC'06 Workshop on Corpora for
    Research on Emotion and Affect, May 23, Genoa,
    Italy, pp.15-19, 2006.
  • Lee C.M., Yildirim S., Bulut M., Kazemzadeh A.,
    Busso C., Deng Z., Lee S., Narayanan S.
    Emotion Recognition based on Phonem Classes,
    Proceedings of ICSLP 04, Jeju, Korea, 2004.
  • Schuller B., Rigoll G. Timing Levels in
    Segment-Based Speech Emotion Recognition, in
    Proceedings of INTERSPEECH 2006 - ICSLP, 17-21
    September, Pittsburgh, PA, USA, pp.1818-21, 2006.
  • Young S., Evermann G., Kershaw D., Moore G.,
    Odell J., Ollason D., Povey D., Valtchev V.,
    Woodland P. The HTK Book (version 3.4),
    Cambridge University Engineering Department,
    Cambridge UK, 2006.
Write a Comment
User Comments (0)
About PowerShow.com