Title: Spoken Language Processing Lab
1Spoken Language Processing Lab
Who we are Julia Hirschberg, Stefan Benus, Fadi
Biadsy, Frank Enos, Agus Gravano, Jackson
Liscombe, Sameer Maskey, Andrew Rosenberg
Lab The Speech Lab, CEPSR 7LW3-A
2Prosody, Emotion and Speaker State
- A speakers emotional state represents important
and useful information - To recognize (e.g. anger/frustration in IVR
systems) - To generate (e.g. any emotion for games)
- Many studies have shown that prosody helps to
convey/identify classic emotions (anger,
happiness,) with some accuracy - Can prosody also signal other types of speaker
state? - In a tutoring domain (confidence vs. uncertainty)
- Charisma
- Deception
3LDC Emotional Speech Corpus
- happy
- sad
- angry
- confident
- frustrated
- friendly
- interested
anxious bored encouraging
4Identifying Confidence vs. Uncertainty (Liscombe)
- The ITSpoke Corpus physics tutoring Collected at
U. Pittsburgh by Diane Litman and students - 17 students, 1 tutor
- 130 human/human dialogues
- 7000 student turns (mean length 2.5 sec)
- Hand labeled for confidence, uncertainty, anger,
frustration
5A Certain Example
6An Uncertain Example
7pr01_sess00_prob58
8Direct Modeling of Prosodic Features
- Automatically extracted acoustic/prosodic
- Pitch, energy, speaking rate, unit duration (hand
labeled), pausal duration within and preceding
unit of analysis, filled pauses (hand labeled) - Units
- Entire turns
- Breath groups
- Context Same features from prior turn(s)
9Classifying Uncertainty
- Human-Human Corpus
- AdaBoost (C4.5) 90/10 split
- Classes Uncertain vs Certain vs Neutral
- Results
Features Accuracy
Baseline 66
Acoustic-prosodic 75
contextual 76
breath-groups 77
10Charismatic Speech (Rosenberg, Biadsy)
- What is charisma?
- The ability to attract, and retain followers by
virtue of personality as opposed to tradition or
laws. (Weber 47) - E.g. JFK, Hitler, Castro, Martin Luther King
- Why study it?
- Identify new leaders early
- Help people improve their public speaking
- Produce more compelling TTS
- What makes leaders charismatic?
- Can prosody help us identify charisma?
11(No Transcript)
12Method
- Data 45 2-10s speech segments, 5 each from 9
candidates for Democratic nomination for
president - 2 charismatic, 2 not charismatic
- Topics greeting, reasons for running, tax cuts,
postwar Iraq, healthcare - 13 subjects rated each segment on a Likert scale
(1-5) for 26 questions - Correlation of lexical and acoustic/prosodic
features with mean charisma ratings
13Acoustic/Prosodic and Lexical Features
- Min, max, mean, stdev F0
- Raw and normalized by speaker
- Min, max, mean, stdev intensity
- Speaking rate (syls/sec)
- Mean and stdev of normalized F0 and intensity
across phrases - Duration (secs)
- Length (words, syls)
- Number of intonational, intermediate, and
internal phrases - Mean words per intermediate and intonational
phrase - Mean syllables/word
- 1st, 2nd, 3rd person pronoun density
- Function to content word ratio
14What makes speech charismatic?
- More content
- Length in secs, words, syllables, and phrases
- Use of polysyllabic words
- Lexical complexity (mean syllables per word)
- Use of more first person pronouns
- First person pronoun density
- Higher and more dynamic raw F0
- Min, max, mean, std. dev. of F0 over male
speakers - Greater intensity
- Mean intensity
15- Higher in a speakers pitch range
- Mean normalized F0
- Faster speaking rate
- Syllables per second
- Greater variation in F0 and intensity across
phrases - Std. dev. of normalized phrase F0 and intensity
- But...what about cultural differences?
- Next
- Swedish ratings of American tokens
- Palestinian Arabs of Arabic tokens
16Acoustic/Prosodic and Lexical Cues to Deception
(Enos)
- Deception evokes emotion in deceivers (Ekman
85-92) - Fear of discovery higher pitch, faster, louder,
pauses, disfluencies, indirect speech - Elation at successful deceiving duping delight
higher pitch, faster, louder, greater elaboration - Detecting cues to these emotions may also
identify deception
- Can prosody help us identify deceptive speakers?
17Columbia/SRI/Colorado Corpus
- 15.2 hrs. of interviews 7 hrs subject speech
- Lexically transcribed automatically aligned
- Labeling conditions Global / Local
- Segmentation (LT/LL)
- slash units (5709/3782)
- phrases (11,612/7108)
- turns (2230/1573)
- Acoustic/prosodic features extracted from ASR
output and lexical and discourse features
extracted
18Sample Features
- Duration features
- Phone / Vowel / Syllable Durations
- Normalized by Phone/Vowel Means, Speaker
- Speaking rate features (vowels/time)
- Pause features (cf Benus et al 2006)
- Speech to pause ratio, number of long pauses
- Maximum pause length
- Energy features (RMS energy)
- Pitch features
- Pitch stylization (Sonmez et al.)
- LTM model of F0 to estimate speaker range
- Pitch ranges, slopes, locations of interest
- Spectral tilt features
19(No Transcript)
20Speech summarization in Broadcast News
- Problem How do we summarize text and speech
documents together? - Recognition Errors
- Named Entities
- Misrecognized rare terms
- Error propagation in the processing pipeline of
ASR transcripts - Ex Sentence boundary -gt Turn boundary -gt Speaker
Roles -gt Summarization - Solution Combining lexical and acoustic
information in one framework
21Current Approach
- Use acoustic/prosodic features to compute
acoustic significance of sentences - Remove disfluencies from ASR transcripts
- Compute ASR confidence for sentences
- Cluster text and speech transcripts together
- Use acoustic scores as additional weights
- Word or Phrase level acoustic significance
- Emphasized George Bush vs. non-emphasized
George Bush - Use Broadcast News structure in summarization
- Headlines, Soundbites, Interviews, Weather
report, Sports section may be useful for certain
questions opinion, attribution, disaster
22Spoken Dialogue Systems
- Discourse phenomena in dialogue
- Turn-taking
- Given/new information
- Cue phrases
- Entrainment
- The GAMES corpus
- 12 sessions of dialogue
- 12.2h
- Annotations orthographic, turns, cue phrases,
ToBI, question form and function
23(No Transcript)
24(No Transcript)
25Translating Prosody Mandarin/English (Rosenberg)
- Prosodic variation is the last thing we learn
- How do speakers convey suprasegmental information
in different languages? - To translate, first identify
- Automatic Identification of Prosodic Events
- Pitch Accents and Phrase Boundaries
- What are the correspondences?
- Discourse structure
- Intonational contours
- Information status
- Emotion