Negative Emotional State Detection from Speech - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Negative Emotional State Detection from Speech

Description:

the first (F1), second (F2), and third (F3) formant of the signal, ... Pitch contour and formant frequency estimation was conducted with Praat. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 15
Provided by: pan5151
Category:

less

Transcript and Presenter's Notes

Title: Negative Emotional State Detection from Speech


1
Negative Emotional State Detection from Speech
  • Panagiotis Zervas, Todor Ganchev, Nikolaos
    Fakotakis
  • Electrical and Computer Engineering Department
  • University of Patras, Greece
  • e-mailpzervas, tganchev, fakotaki_at_wcl.ee.upatra
    s.gr

2
Topics of Discussion
  • Is emotion recognition process essential in
    Artificial Intelligence tasks?
  • Description of the utilized speech data
  • LDC Emotional Speech database
  • Emotion Recognition Framework
  • Feature Extraction
  • Classification methods
  • Evaluation Setup
  • Results
  • Conclusions

3
Who Cares?
  • Practical impact
  • Detecting Frustration/Anger
  • Stress/Distress
  • Help call prioritizing
  • Tutorials Boredom/Confusion/Frustration
  • Pacing/Positive feedback
  • User acceptance
  • Users prefer a talking head that uses emotional
    speech
  • Esoteric Impact
  • Is artificial intelligence possible w/o detection
    of emotion?
  • w/o display of emotion?
  • Do we experience someone/something as
    understanding us if it cant understand our
    emotional state/experience?

4
Emotional Speech Corpora
  • Acquiring basic data is some of the hardest
    problems in emotion research. Four main types of
    source are commonly used,
  • Acted representation of emotion (which may or may
    not be generated by actors).
  • Application-driven, a growing range of databases
    are derived from specific applications (e.g. call
    centres).
  • General naturalistic, data that is representative
    of everyday life (very difficult to collect).
  • Induction, induce emotions of appropriate kinds
    under appropriate circumstances.

5
Speech Material (1)
  • Emotional Prosody Speech and Transcripts Database
    (Linguistic Data Consortium, University of
    Pennsylvania)
  • acted emotional speech.
  • 9 hours of English speech recordings and their
    transcripts.
  • 2-channel interleaved 16-bit PCM, stored as
    sphere files (SPH).
  • Each file is a continuous recording of several
    acted emotions from one speaker.

6
Speech Material (2)
  • 8 actors but for the purpose of our evaluation,
    we have selected data from 6 of them
  • 3 male and 3 female.
  • short-duration utterances of three to four words
    each

7
Speech Material (3)
  • Distribution of the training data among speakers
    and emotional states
  • SP1, SP2, SP5 are the male speakers
  • and SP3, SP4, SP6 are the female
  • The amount of data available for training the
    different classes varied between 67 and 150
    seconds of voiced speech.
  • Lack of balance reflected in,
  • good identification accuracy for classes with
    sufficient training data
  • moderate identification accuracy for classes with
    fewer training data.

8
Feature Extraction
  • Feature vector, consists of nineteen basic
    acoustic attributes
  • F0 values in the logF domain (logF0),
  • the deltas of F0 (dlogF0),
  • the first (F1), second (F2), and third (F3)
    formant of the signal,
  • the thirteen first Mel frequency cepstral
    coefficients (MFCCs)
  • energy (Enrg)
  • as well as the difference of energy (dEnrg) per
    frame.
  • Pitch contour and formant frequency estimation
    was conducted with Praat.
  • 256 samples window, with overlapping of 128
    samples.
  • F0 estimation in the range of 60-320 Hz for both
    male and female voices.
  • 13 MFCC parameters were computed for each speech
    frame.
  • For testing and training data, only feature
    vectors corresponding to voiced speech frames
    were kept.

9
Classification Methods
  • Two classifiers, the C4.5 decision tree and
    Probabilistic Neural Network (PNN) were employed.
  • C4.5 is an improvement to the ID3 algorithm, able
    to handle numerical data.
  • PNN were chosen as an alternative classifier for
    their good generalization properties and more
    importantly for their fast designing times.
  • When limited training data are available, and
    when fast and consistent training is required the
    PNN provide the best trade-off.
  • PNN need more neurons compared to back
    propagation
  • higher computational and memory requirements

10
Evaluation Setup (1)
  • Performance of C4.5 and PNN classification
    algorithms for negative emotional state
    recognition was evaluated on a common
    experimental set-up.
  • LDC recordings were divided in training and
    testing sets.
  • The test set consists of 90 files eighteen
    files per each of the five classes.
  • The test files were selected by randomly choosing
    three recordings for each of the six speakers.
  • Each test trial (one recording) contains between
    0.5 and 1.2 seconds of voiced speech.

11
Classification Results (1)
  • Classification total accuracy
  • PNN 88.1
  • C4.5 62.8.
  • PNN classification accuracy was above 65 for all
    classes.
  • Classification of hot anger and neutral achieved
    an accuracy of 94.4 and 100 respectively.
  • C4.5 achieved accuracy of more than 85 for the
    classification of cold anger and neutral
  • it showed poor results on the recognition of
    contempt and disgust.

12
Classification Results (2)
columns represent the instances that were
classified as that class while rows represent the
actual instances that belong to the specific
class
13
Conclusions Further Work
  • An negative emotional state detection framework,
    was constructed and evaluated with C4.5 and PNN.
  • PNN approach more effective.
  • Future integration in a speech-enabled dialogue
    system (such as a smart home environment).
  • Implement dynamic steering of the dialogue flow,
    and flexible dialogue management strategy, which
    accounts for negative emotional states.
  • By including the emotion recognition component,
    we aim at making the human-machine interaction a
    step closer to the human-to-human communication.

14
  • Please forward any comments, criticism, questions
    to the corresponding address
  • pzervas_at_wcl.ee.upatras.gr
  • Home Page
  • http//www.wcl.ee.upatras.gr/ai/

Thank you for your attention
Write a Comment
User Comments (0)
About PowerShow.com