Title: Titelmaster
1Stefan Scherer 24.09.2007 LREC 2008 Institute
of Neural Information Processing Ulm
University stefan.scherer_at_uni-ulm.de
Emotion Recognition from Speech Stress
Experiment Stefan Scherer, Hansjörg Hofmann,
Malte Lampmann, Martin Pfeil, Steffen Rhinow,
Friedhelm Schwenker, Günther Palm
2Motivation
- Why stress recognition from speech?
- Safety and usability purposes
- More efficient and natural interfaces
- Several existing applications are based on speech
only (call center applications) - Existing problems
- Existing databases are limited
- Stress induced by increasing workload missing
- Choice of representative features difficult
3Experimental Setup
4Experimental Setup Summary
- Direct planes towards corresponding exit
- Four types of questions (personal, enumerations,
general knowledge, Jeopardy) - Difficulty levels differ in plane speed, number
of planes and exit sizes - Points are earned or lost and current score is
color coded - One game lasts 10 minutes
- Self-assessment of experienced stress is
questioned three times
5Evaluation and Labeling of Recordings
- Everybody reacts differently towards stress
- No common labels available for the recordings
- ? Second labeling experiment to obtain fuzzy
labels for each of the recordings
6Evaluation and Labeling of Recordings
Speaker Mean P25 P75 Self-Assess. Crashes
1 35.8 24 47 1/2/4 0/4/13
2 41.9 25 59 2/4/? 0/4/30
3 45.2 29.5 61 7/6/8 1/10/37
4 31.0 20 40 1/1/2 0/2/16
5 43.2 25 61 7/8/9 0/3/28
6 43.0 23 60 4/4/6 0/3/26
7 31.2 21 37 1/3/7 0/1/23
8 33.2 21 41 1/1/3-4 0/0/8
9 38.0 23 51 1/1-2/5 0/6/31
10 35.7 22 49 1/2/5 0/3/11
11 49.6 31.75 65 7/9/10 5/9/17
12 49.1 32 65 4/4/? 0/5/27
13 43.4 26 62 1/3/4 6/22/38
14 32.1 22 41 2/5/8 1/1/26
15 41.6 26 56 2/3/7 0/2/19
7Evaluation and Labeling of Recordings
- Spearman correlation tests
- Mean vs. self-assessment
- Mean vs. crashes
- Self-assessment vs. crashes
? p-value
M vs. SA 0.61 0.01
M vs. C 0.68 0.005
C vs. SA 0.40 0.13
8Automatic Stress Recognition
- Biologically motivated features
- Representing the rate of change of frequency
- Representative features
- Robust against noisy conditions
- Echo state networks
- Easy to train using direct pseudo inverse method
- Using sequential characteristics of features
- Robust against noisy conditions
9Utilized Features
- Motivation
- Pitch not always easy to extract
- Statistics of Pitch may not suffice
- Preliminary experiments show worse performance
- Goal representative features, that do not need
to be aggregated over time - Modulation spectrum based features
- Representing the rate of change of frequency
- Extracted at 25 Hz
10Modulation Spectrum Features
- Rate of change of frequency
- Standard procedures FFT and Mel filtering
- Most prominent energies are observed between 2
and 16 Hz
11Waveform
Spectrogram
Modulation Spectrogram
Time
12Echo State Networks
- Recurrent artificial neural network
- Dynamic reservoir represents history ? echo state
property - Wout are the connections that need to be adapted
using pseudo inverse method
13Experiments and Results
- No true label ? mean for each utterance of all
labelers as target - 10 fold cross validation
- Human labelers vs. ESN
- ESN outperforms labelers
MSE ME
Labeler 1 0.284 0.421
Labeler 2 0.151 0.281
Labeler 3 0.291 0.422
Labeler 4 0.241 0.384
Labeler 5 0.211 0.365
ESN 0.084 0.235
14Conclusions
- Experimental setup to record speech data with
different levels of stress - Large vocabulary dataset is available (with
additional video material and mouse movement
data) - Method to label the individual stressed
utterances by humans - Automatic stress recognizer based on recurrent
neural networks - ? outperforming human labelers in accuracy
15Thank you, for your attention!