Negative Emotional State Detection from Speech - PowerPoint PPT Presentation

1 / 14

About This Presentation

Title:

Negative Emotional State Detection from Speech

Description:

the first (F1), second (F2), and third (F3) formant of the signal, ... Pitch contour and formant frequency estimation was conducted with Praat. ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 15

Provided by: pan5151

Category:

more less

Transcript and Presenter's Notes

Title: Negative Emotional State Detection from Speech

1
Negative Emotional State Detection from Speech

Panagiotis Zervas, Todor Ganchev, Nikolaos
Fakotakis
Electrical and Computer Engineering Department
University of Patras, Greece
e-mailpzervas, tganchev, fakotaki_at_wcl.ee.upatra
s.gr

2
Topics of Discussion

Is emotion recognition process essential in
Artificial Intelligence tasks?
Description of the utilized speech data
LDC Emotional Speech database
Emotion Recognition Framework
Feature Extraction
Classification methods
Evaluation Setup
Results
Conclusions

3
Who Cares?

Practical impact
Detecting Frustration/Anger
Stress/Distress
Help call prioritizing
Tutorials Boredom/Confusion/Frustration
Pacing/Positive feedback
User acceptance
Users prefer a talking head that uses emotional
speech
Esoteric Impact
Is artificial intelligence possible w/o detection
of emotion?
w/o display of emotion?
Do we experience someone/something as
understanding us if it cant understand our
emotional state/experience?

4
Emotional Speech Corpora

Acquiring basic data is some of the hardest
problems in emotion research. Four main types of
source are commonly used,
Acted representation of emotion (which may or may
not be generated by actors).
Application-driven, a growing range of databases
are derived from specific applications (e.g. call
centres).
General naturalistic, data that is representative
of everyday life (very difficult to collect).
Induction, induce emotions of appropriate kinds
under appropriate circumstances.

5
Speech Material (1)

Emotional Prosody Speech and Transcripts Database
(Linguistic Data Consortium, University of
Pennsylvania)
acted emotional speech.
9 hours of English speech recordings and their
transcripts.
2-channel interleaved 16-bit PCM, stored as
sphere files (SPH).
Each file is a continuous recording of several
acted emotions from one speaker.

6
Speech Material (2)

8 actors but for the purpose of our evaluation,
we have selected data from 6 of them
3 male and 3 female.
short-duration utterances of three to four words
each

7
Speech Material (3)

Distribution of the training data among speakers
and emotional states
SP1, SP2, SP5 are the male speakers
and SP3, SP4, SP6 are the female
The amount of data available for training the
different classes varied between 67 and 150
seconds of voiced speech.
Lack of balance reflected in,
good identification accuracy for classes with
sufficient training data
moderate identification accuracy for classes with
fewer training data.

8
Feature Extraction

Feature vector, consists of nineteen basic
acoustic attributes
F0 values in the logF domain (logF0),
the deltas of F0 (dlogF0),
the first (F1), second (F2), and third (F3)
formant of the signal,
the thirteen first Mel frequency cepstral
coefficients (MFCCs)
energy (Enrg)
as well as the difference of energy (dEnrg) per
frame.
Pitch contour and formant frequency estimation
was conducted with Praat.
256 samples window, with overlapping of 128
samples.
F0 estimation in the range of 60-320 Hz for both
male and female voices.
13 MFCC parameters were computed for each speech
frame.
For testing and training data, only feature
vectors corresponding to voiced speech frames
were kept.

9
Classification Methods

Two classifiers, the C4.5 decision tree and
Probabilistic Neural Network (PNN) were employed.
C4.5 is an improvement to the ID3 algorithm, able
to handle numerical data.
PNN were chosen as an alternative classifier for
their good generalization properties and more
importantly for their fast designing times.
When limited training data are available, and
when fast and consistent training is required the
PNN provide the best trade-off.
PNN need more neurons compared to back
propagation
higher computational and memory requirements

10
Evaluation Setup (1)

Performance of C4.5 and PNN classification
algorithms for negative emotional state
recognition was evaluated on a common
experimental set-up.
LDC recordings were divided in training and
testing sets.
The test set consists of 90 files eighteen
files per each of the five classes.
The test files were selected by randomly choosing
three recordings for each of the six speakers.
Each test trial (one recording) contains between
0.5 and 1.2 seconds of voiced speech.

11
Classification Results (1)

Classification total accuracy
PNN 88.1
C4.5 62.8.
PNN classification accuracy was above 65 for all
classes.
Classification of hot anger and neutral achieved
an accuracy of 94.4 and 100 respectively.
C4.5 achieved accuracy of more than 85 for the
classification of cold anger and neutral
it showed poor results on the recognition of
contempt and disgust.

12
Classification Results (2)
columns represent the instances that were
classified as that class while rows represent the
actual instances that belong to the specific
class
13
Conclusions Further Work

An negative emotional state detection framework,
was constructed and evaluated with C4.5 and PNN.
PNN approach more effective.
Future integration in a speech-enabled dialogue
system (such as a smart home environment).
Implement dynamic steering of the dialogue flow,
and flexible dialogue management strategy, which
accounts for negative emotional states.
By including the emotion recognition component,
we aim at making the human-machine interaction a
step closer to the human-to-human communication.