Title: Mixed Feelings About Using Phoneme-Level Models in Emotion Recognition
1 Mixed Feelings About Using Phoneme-Level Models
in Emotion Recognition
- Hannes Pirker
- Austrian Research Institute for Artificial
Intelligence (OFAI) - Vienna
Poster presented at ACII-2007, September
16-19,Lisbon, Portugal
2- This study deals with the application of MFCC
based models for both the recognition of
emotional speech and the recognition of emotions
in speech. More specifically it investigates the
performance of phone-level models. First, results
from performing forced alignment for the phonetic
segmentation on GEMEP, a novel multimodal corpus
of acted emotional utterances are presented, then
the newly acquired segmentations are used for
experiments with emotion recognition comparing
phone-level models with sentence-level models..
Abstract
1 Motivation
The Geneva Multimodal Emotion Portrayals corpus
(GEMEP) 1 is a novel corpus of highly
controlled and uniform content, containing 2
pseudo-linguistic utterances produced with 18
different emotions and varying levels of
intensity. This uniform content provides a
promising basis for investigating both the
acoustic correlates of emotions as well as
timing-issues in the synchronization between
speech, facial expressions and body movements. In
order to provide a sound basis for the
investigation of fine-grained temporal issues,
phonetic segmentation at the level of individual
speech sounds was to be performed. Apart from
this practical goal it also was to be
investigated on how well standard segmentation
techniques, i.e. Hidden Markov Models (HMM) with
Mel Frequency Cepstral Coefficients (MFCC) 4
would cope with the amount of variability of
manner and voice quality typically found in
emotional speech. The third goal was to shed
some light on the conflicting requirements on
MFCCs. They are used in speech recognition for
discriminating speech sounds and for their
robustness against variation in intonation and
voice quality. But they are also became popular
in emotion recognition 3, though then they
typically should be indifferent to the underlying
speech sounds and sensitive to voice quality. As
the identity of the speech sounds is a major
influencing factor on MFCCs, we are comparing
emotion classifiers that rely on phoneme-level
models with sentence-level models.
3- 2 Database Description
- The Geneva Multimodal Emotion Portrayals corpus
(GEMEP) consists of audio and video recordings of
10 speakers (professional French speaking
actors), 5 of which are female. Verbal content is
restricted to only 2 different pseudo-linguistic
sentences - (type1) Ne kal ibam soud mol'en!
- (type2) Koun s'e mina lod belam?
- These were uttered with 18 different emotions
which were chosen for an extensive coverage of
both activation (high-low) and evaluation
dimensions (positive-negative) - Utterances were also produced in less intense,
more intense and 'masked' manner (i.e.
unsuccessfully hide the actual emotion from the
audience). In average utterances were repeated 8
times per condition. - This results in 3815 sentences 2739 of type1 and
1076 of type2.
Emotion categories in the GEMEP corpus. Subset of
selected 6 set in bold.
3. Phonetic Segmentation 3.2 Workflow Starting
with 50 manually segmented sentences for training
an initial set of HMM, several cycles of
training-alignment-manual correction and
re-training with the increased set were
performed. By now 1313 manually validated
sentences are available which are currently
split into 892 training- and 421
test-samples. To make use of the constrained
content of the corpus HMM where trained for each
individual sound, i.e. 4 different models for /a/
where trained. Apart from unconstrained training
(Global-models) specific models for male and
female speakers (per Gender), for each actor (per
Speaker) and each emotion (per Emotion) where
constructed and evaluated.
4- 3.1. Technical Procedure
- The original audio data was downsampled to 22.05
kHz and high-pass filtered at 55 Hz. MFCC where
calculated with a frame size of 30 ms and a
window shift of 2.5 ms (i.e. resulting in 400
frames per second). 12 MFCC, energy, delta and
acceleration were included resulting in a 39
dimensional feature vector. - For the phoneme-level models Hidden Markov Models
with 3 states and 5 Gaussian mixtures in each
state were employed. Baum-Welch algorithm was
used for training the HMMs and Viterbi decoding
was performed in order to retrieve the segment
boundaries. Extraction of MFCC features, training
and application of HMMs was performed with the
respective tools provided by HTK 4. - 3.2. Evaluation
- The absolute position error at the initial phone
boundaries was used for evaluation. Quantiles and
error thresholds provide a meaningful measure
approx. 85 of segments are located with less an
error less than 20ms and should not require
manual correction. Below the performance of two
differently sized training sets (N503 vs. N892)
and when using gender-, speaker- and
emotion-specific phoneme-models is illustrated. - The results indicate that the performance of the
aligner seems to already level out, i.e. for
type1 the increase in the size of the training
data from 395 to 685 does not show much of an
effect anymore while the less frequent type 2
still benefits from the increase in
training-size. Also further fractionating the
training set by using speaker-specific models
etc. does not decrease the performance
dramatically anymore.
5- This is in line with the subjective impression
that normal cases were remarkably well handled
by models trained on small sets but persistently
problematic classes exist, e.g. flustered speech
with its weakly articulated formant structures.
Soft speech, even of very low intensity, as well
as loud or even shouted speech are less critical.
The most problematic cases are intermingled
laughter, hesitations etc. which pose severe
problems to any procedure based on forced
alignment. - 4 Emotion Classification
- 4.1 Technical Procedure
- In order to test the capabilities of
phone-level-MFCC modeling for emotion
classification virtually the same methods as used
for phonetic segmentation were re-applied. For
phone-level modelling emotion-specific 3 state
left-to-right HMMs with 5 Gaussian Mixtures were
trained. Alternatively two sentence-level HMM
typologies were tested an 'elongated' version of
the phone-level model with 22 states/5 mixtures
and several 1-state HMMs, which are equivalent to
a Gaussian Mixture Model. The recognition
grammar of the aligner was adapted as shown in
Fig.2. A majority vote was performed on the
outcome of the Viterbi decoding. - For the experiments with emotion recognition a
test-set of 875 samples was reserved, leaving a
maximum of 2940 samples for training (i.e.
77/23). Currently only sentence-level models
could make use of the whole training-set, as they
do not require any pre-segmented data.
Phone-based models where for now trained on a
smaller set of 892 though, as it was necessary to
use manually segmented data.
Fig.1 85 Error quantile ms per segment for
the two sentences in the corpus (N train892)
618 categories is an untypical high numbee, for
better comparability all experiments where also
performed with a sub-set of 6 emotions (anger,
joy, fear, sadness, pleasure, interest) For
sentence-level HMMs different typologies where
compared. For the full set of 2940 a single state
model with an exceedingly high number of Gaussian
Mixtures (512) performed best. This model was
compared with phone-level modelling.
Fig.2 Recognition grammar for phone-level and
sentence-level modelling.
- 4.2 Evaluation
- The table below provides a comparison between the
phone-based model and a sentence-based model on
the same train- and test-set as it was also used
in the evaluation of the phonetic-alignment task
above. Again the smaller number of
training-samples for type2 results in a
significant drop in performance.
7(No Transcript)
8- The recognition quality is likely to still
benefit from increased training size as the
results from the experiments with sentence-level
models indicate
5. Conclusion We presented a study on
phone-based MFCC-models for emotion recognition
which originated in the practical task of
phonetic segmentation of the GEMEP corpus. The
results for the automatic segmentation probably
have reached a certain ceiling, but provide a
valid basis for manual correction of further
samples. Results for emotion classification are
still in flux, i.e., results with differently
sized training-sets still show significant
volatility. Ultimate conclusions on the
relationship of phone-based vs. global models are
difficult to obtain. Because the segmental
content in GEMEP is so restricted, phoneme-based
models lose their expected implicit advantages in
this context. On the other hand errors in the
automatic alignment have a strong influence on
the classification, e.g. surprisingly the sound
/s/ provided the best classification results of
all phonemes, which probably is due to the fact
that fricatives are the most suitable sounds for
the aligner. The study on emotion recognition was
not at all aimed at coming up with impressive
recognition rates, which could be easily boosted
by using e.g. 10fold-cross-validation, a-priori
probabilities, more equally balanced training
sets etc., but to test the relative performance
of different MFCC-based models.
9 I am very indebted to Klaus Scherer and his
group in Geneva for designing, creating and
sharing the GEMEP corpus and especially to Tanja
Baenziger for long standing and ongoing
interaction and support. This work has been
funded by the EU Network of Excellence HUMAINE
(IST 507422) and by the Austrian Funds for
Research and Technology Promotion for Industry
(FFF 808818/2970 KA/SA). Financial support for
OFAI is provided by the Austrian Federal Ministry
of Science and Research and by the Federal
Ministry of Transport, Innovation and Technology.
- References
- Baenziger T., Pirker H., Scherer K. GEMEP -
GEneva Multimodal Emotion Portrayals A corpus
for the study of multimodal emotional
expressions, in Devillers L. et al. (eds.),
Proceedings of LREC'06 Workshop on Corpora for
Research on Emotion and Affect, May 23, Genoa,
Italy, pp.15-19, 2006. - Lee C.M., Yildirim S., Bulut M., Kazemzadeh A.,
Busso C., Deng Z., Lee S., Narayanan S.
Emotion Recognition based on Phonem Classes,
Proceedings of ICSLP 04, Jeju, Korea, 2004. - Schuller B., Rigoll G. Timing Levels in
Segment-Based Speech Emotion Recognition, in
Proceedings of INTERSPEECH 2006 - ICSLP, 17-21
September, Pittsburgh, PA, USA, pp.1818-21, 2006. - Young S., Evermann G., Kershaw D., Moore G.,
Odell J., Ollason D., Povey D., Valtchev V.,
Woodland P. The HTK Book (version 3.4),
Cambridge University Engineering Department,
Cambridge UK, 2006. -