Prosodic Fortification in Error Resolution - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Prosodic Fortification in Error Resolution

Description:

Enhanced prosodic markup could ... between human judgments and prosodic features of 'hot spots' ... weighting of prosodic fortification strategies ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 36
Provided by: lesleyca
Category:

less

Transcript and Presenter's Notes

Title: Prosodic Fortification in Error Resolution


1
Prosodic Fortificationin Error Resolution
  • UW/Microsoft Symposium
  • Lesley Carmichael
  • January 23, 2004
  • This research was conducted at the
  • Center for Human-Computer Communication,
  • OGI School of Science Engineering (Beaverton,
    OR)

2
Prosody Speech Technologies
  • Prosodic information not widely used
  • No consistent framework
  • Normalization (f0 range, speech rate, etc.)
  • Bundles of prosodic features
  • Not just f0
  • How do features interact?

3
Speech Synthesizers
  • Current systems sound reasonable
  • Corpus-based generation
  • Im a synthetic voice yet I know just how to
    intone my speech like a human
  • Enhanced prosodic markup could provide some
    benefit
  • Special domain, concept-to-speech, e.g., weather
    reports

4
Speech Recognizers
  • Typically dont use prosodic information
  • Calculate probability of phones, words, phrases
  • Possible gains?
  • Classify dialog acts
  • Identify boundaries
  • Identify discourse-level structure
  • Identify emotion

5
Background
  • Hyperarticulate speech is common in error
    resolution (e.g., Oviatt et al. 1998)
  • Clearer pronunciation of segments
  • Slower speech
  • Pauses

6
Prosodic Fortification
  • Changing and/or augmenting the prosody of an
    utterance
  • Making words phrasally prominent
  • Grouping words together with boundaries
  • Changing tonal features
  • Analogous to segmental hyperarticulation
  • Adds clarification and organization to signal

7
Pitch Accents
  • Phonetic features indicating prominence
  • Relatively High or Low fundamental frequency
  • Can be simple (one tone) or complex (bitonal)
  • Associated with a particular syllable
  • Used to emphasize certain words relative to others

8
Breaks
  • Perceptible disjuncture between adjacent words
  • Can be indicated with pausing, tonal features,
    and/or segmental hyperarticulation
  • Indicate phrase boundaries
  • Used to group information

9
Present Study
  • Investigates whether speakers substantially
    fortify prosody in an attempt to resolve errors
  • Adding new prosodic events
  • Categorically changing quality of prosodic events

10
Hypotheses
  • In error resolution (as compared to original
    input), speakers will
  • 1 Use more pitch accents
  • 2 Use more complex pitch accents
  • 3 Use more breaks
  • 4 Use more total disjuncture strength
  • 5 Change tone types

11
Speech Data
  • Children interacting with computing device to
    learn about marine biology
  • Children received occasional system recognition
    errors (Wizard of Oz)
  • In some cases, children produced a lexically
    verbatim utterance after the error

12
  • 103 lexically verbatim utterance pairs
  • Pre-error and post-error speech
  • 2-8 pairs per speaker
  • 22 speakers
  • MLU 3.2, range 1-11 words
  • Provides opportunity to examine prosodic changes
    in spoken error resolution

13
Model for Analysis
  • Tones and Break Indices (ToBI) (Silverman, et
    al. 1992)
  • Analyzes intonation into series of discrete
    events
  • Small set of event types
  • Measures changes of kind rather than degree

14
Evaluation Metrics
  • Categorical changes
  • Presence vs. absence of pitch accent
  • Complexity of pitch accent
  • Presence vs. absence of breaks
  • Level of break
  • Type of base tone (H or L)
  • Discrete intonation events, not continuous

15
ToBI Pitch Accents
  • Identified by the presence and timing of
    relatively High and Low fundamental frequency
    values
  • 3 Simple tones
  • H, !H, L
  • 5 Complex tones
  • LH, L!H, LH, L!H, H!H

16
ToBI Breaks
  • Identified by the perceived degree of disjuncture
    between words
  • 5 levels
  • 0 clitic grouping
  • 1 standard word boundary
  • 2 slight break, no tonal marking
  • 3 intermediate phrase boundary, with tone mark
  • 4 full intonational phrase boundary, with tone
    mark

17
  • Tone values of breaks
  • Level 2 (none)
  • Level 3 H- or L-
  • Level 4 L-L, L-H, H-L, H-H, !H-L, !H-H

18
Examples
Original
Repeat
Original
Repeat
19
Original
Repeat
20
Original
Repeat
21
Results
  • Hypothesis 1 Speakers will use more pitch
    accents in repeats than original input
  • Original 2.20 mean accents per utterance
  • Repeat 2.56 mean accents per utterance
  • 16 more accents in repeats
  • Significant by Wilcoxon Signed Ranks Test, z3.39
    (N16) p lt .0005, one-tailed.

22
  • Hypothesis 2 Speakers will use more complex
    pitch accents on repeats than original input.
  • Original 17.7 of accents were complex
  • Repeat 25.3 of accents were complex
  • 43 more complex tones in repeats
  • Significant by Wilcoxon Signed Ranks Test, z2.21
    (N19) p lt .015, one-tailed.

23
  • Hypothesis 3 Speakers will use more breaks on
    repeats than original input
  • Original 1.39 mean breaks per utterance
  • Repeat 1.70 mean breaks per utterance
  • 22 more breaks in repeats
  • Significant by Wilcoxon Signed Ranks Test, z3.18
    (N13) p lt .0005, one-tailed.

24
  • Hypothesis 4 Speakers will use more total
    disjuncture strength on repeats than original
    input
  • Original mean break magnitude 3.98
  • Repeat mean break magnitude 4.66
  • 17 greater total break strength in repeats
  • Significant by Wilcoxon Signed Ranks Test, z3.30
    (N14) p lt .0005, one-tailed.
  • (used scale 1-5 to represent breaks 0-4)

25
  • Hypothesis 5 Speakers will change the tone type
    of pitch accents and breaks on repeats as
    compared to original input
  • 19 of pitch accents changed tone type
  • 38 49 of level 4 breaks changed tone type
    (depending on of level 4 breaks in the
    utterance)

26
  • Hypothesis 5 cannot be statistically verified
  • Need rate of spontaneous tone shift as baseline

27
Discussion
  • Speakers systematically manipulate prosody to aid
    in recognition
  • Speech is fortified
  • Many strategies available new accents, new
    breaks, accent complexity, break strength, tone
    type
  • Interactions?
  • Do some strategies complement or supplant others?

28
Quantity vs. Quality
  • Accents
  • 79 of accent tone changes occurred in utterances
    with NO new accents
  • 72 of new accents occurred in utterances with
    NO tone changes
  • Tend toward complementarity?

29
  • Breaks
  • 38 of level 4 break tones changed across all
    utterances
  • Utterances with
  • More than one level 4 break 10 changed
  • Only 1 level 4 break (but 2s, 3s ok) 42 changed
  • Only 1 level 4 break (no other breaks) 49
    changed
  • More tone changes when fewer breaks are available
    to organize information?

30
Prosody in Automatic Speech Processing
  • Data-driven prosody modeling
  • Acoustic/phonetic
  • Application of prosody models improved results
    for
  • Structural tagging (sentence boundaries,
    disfluencies)
  • Pragmatic, paralinguistic tagging (dialog act
    classification, emotion)
  • Speaker recognition
  • Word recognition
  • (Shriberg Stolcke, 2004)

31
Prosodic Knowledge Sources
  • 3 prosody models used to rescore N-best list for
    word recognition
  • Word duration
  • Word and pause duration/interaction
  • Other events boundaries, disfluencies
  • Each one improved scores independently
  • All 3 combined increased improvement
  • (Vergyri et al., 2003)

32
Hot Spots in Meetings
  • High involvement heated discussion, etc.
  • Auto processing of meetings
  • Summarize, find important information
  • Prosodic features automatically extracted
  • Humans judge hot spots reliably
  • F0 and energy correlation between human judgments
    and prosodic features of hot spots
  • (Wrede Shriberg, 2003)

33
Emotion Detection
  • Prosodic model predicts neutral vs. annoyed or
    frustrated
  • Accuracy close to human labelers
  • Duration and speaking rate had most impact
  • Hyperarticulation and pauses were not useful
  • (Ang et al., 2002)

34
Future Work
  • Investigate possible weighting of prosodic
    fortification strategies
  • Repeat analyses with adult speech
  • Segmental and prosodic hyperarticulation
  • Compare error correction and hot spot speech
  • Assess prosodic-syntactic boundary alignment in
    regular vs. error-correction speech

35
References
  • Ang, J., Dhillon, R., Krupski, A., Shriberg, E.
    and Stolcke, A. (2002), Prosody-Based Automatic
    Detection of Annoyance and Frustration in
    Human-Computer Dialog. Proc. Intl. Conf. on
    Spoken Language Processing, vol. 3, pp.
    2037-2040, Denver.
  • Carmichael, L. (submitted). Intonation
    categories and continua.
  • Oviatt, S.L., M. MacEachern, G. Levow. (1998).
    Predicting hyperarticulate speech during
    human-computer error resolution. Speech
    Communication, 24, 87-110.
  • Pierrehumbert, J. J. Hirschberg. (1990). The
    meaning of intonational contours in the
    interpretation of discourse. In Cohen, P.R.,
    Morgan, J., Pollack, M.E. (eds.), Intentions in
    Communication. Cambridge, MA MIT Press, 271-323.
  • Pitrelli, J., M.E. Beckman, J. Hirschberg.
    (1994). Evalution of prosodic transcription
    labeling reliability in the ToBI framework. In
    Proc. ICSLP, 123-126.
  • Shriberg, E. and Stolcke, A. (2004). Direct
    Modeling of Prosody An Overview of Applications
    in Automatic Speech Processing. To appear in
    Proc. International Conference on Speech Prosody
    2004 Nara, Japan.
  • Silverman, K., M.E. Beckman, J. Pitrelli, M.
    Ostendorf, C. Wightman, P. Price, J.
    Pierrehumbert, J. Hirschberg. (1992). ToBI a
    standard for labeling English prosody. In Proc.
    ICSLP, 2, 867-870.
  • Vergyri, D., Stolcke, A., Gadde, V.R.R., Ferrer,
    L. and Shriberg, E. (2003), Prosodic Knowledge
    Sources for Automatic Speech Recognition. Proc.
    IEEE Intl. Conf. on Acoustics, Speech and Signal
    Processing, Hong Kong.
  • Wrede, B. and Shriberg, E. (2003), Spotting
    "Hotspots" in Meetings Human Judgments and
    Prosodic Cues. Proc. Eurospeech, Geneva.
  • http//www.ling.ohio-state.edu/tobi
Write a Comment
User Comments (0)
About PowerShow.com