Goals and Objectives - PowerPoint PPT Presentation

About This Presentation
Title:

Goals and Objectives

Description:

... previous years' meetings regarding the OGI Stories corpus (telephone monologues) ... OGI Stories - Pitch Doesn't Cut the Mustard ... – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 49
Provided by: steve804
Category:

less

Transcript and Presenter's Notes

Title: Goals and Objectives


1
Stress-Accent and Vowel Quality in The
Switchboard Corpus Steven Greenberg and Leah
Hitchcock International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 http//www.icsi.berkeley.edu/steveng
NIST Workshop on Large Vocabulary Continuous
Speech Recognition Maritime Institute of
Technology, May 4, 2001
2
Take Home Messages
  • There is an intimate relationship between vocalic
    identity, nucleic duration and stress accent in
    spontaneous dialogue (at least in the
    Switchboard corpus)
  • Stressed syllables tend to have significantly
    longer nuclei than their unstressed
    counterparts, consistent with the findings
    reported by Silipo and Greenberg in previous
    years meetings regarding the OGI Stories
    corpus (telephone monologues)
  • Certain vocalic classes exhibit a far greater
    dynamic range in duration than others
  • Diphthongs tend to be longer than monophthongs,
    BUT .
  • The low monophthongs (ae, aa, ay, aw,
    ao) exhibit patterns of duration and dynamic
    range under stress (accent) similar to diphtongs
  • The statistical patterns are consistent with the
    hypothesis that duration serves under many
    conditions as either a primary or secondary cue
    for vowel height (normally associated with the
    frequency of the first formant)

3
Take Home Messages
  • Moreover, the stress-accent system in spontaneous
    (American) English appears to be closely
    associated with vocalic identity
  • Low vowels are far more likely to be fully
    stressed than high vowels (with the mid vowels
    exhibiting an intermediate probability of being
    stressed)
  • Thus, the identity of a vowel can not be
    considered independently of stress-accent
  • The two parameters are likely to be flip sides of
    the same Koine
  • Although English is not generally considered to
    be a vowel-quantity language (as is Finnish),
    given the close relationship between
    stress-accent and duration, and between
    duration and vowel quality, there is some sense
    in which English (and perhaps other
    stress-accent languages) manifest certain
    properties of a quantity system
  • Thus, vowel duration may be an important factor
    in disambiguating spoken language and therefore
    should be of interest to the speech
    recognition community

4
What is (usually) Meant by Prosodic Stress?
  • Prosody is supposed to pertain to extra-phonetic
    cues in the acoustic signal
  • The pattern of variation over a sequence of
    SYLLABLES pertaining to syllabic DURATION,
    AMPLITUDE and PITCH (fo) variation over time
    (but the plot thickens, as we shall see)

5
Why is Prosodic Stress Important?
  • It supposedly provides important information
    about
  • Focus of the speakers attention and emphasis for
    the listener
  • What is new and important information
  • Emotional context of the utterance - surprise,
    sarcasm, shock, delight anger impatience, etc.
  • Syntactic disambiguation, particularly at the
    clausal/sentential level e.g., interrogative,
    declarative forms
  • Perceptual processing - parsing the utterance
    into chunks for reliable understanding
  • Prosody provides a window onto the higher levels
    of language
  • Can be useful for developing
    semantic-oriented models for speech
    understanding (Information spotting)
  • Prosody affects pronunciation (and vice versa)
  • Can be useful for modeling pronunciation
    variation in ASR
  • Phonetic properties may be correlated with
    prosodic stress -
  • THIS IS THE TOPIC FOR TODAYS PRESENTATION

6
The Nitty Gritty (a.k.a. the Corpus Material)
  • SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS (same
    as Phoneval-2000)
  • Switchboard contains informal telephone dialogues
  • 54 minutes of material that had previously been
    phonetically transcribed (by highly trained
    phonetics students from UC- Berkeley)
  • 45.5 minutes of pure speech (filled pauses,
    junctures filtered out), consisting of
  • 9,991 words, 13,446 syllables, 33,370 phonetic
    segments
  • All of this material had been hand-segmented at
    either the phonetic- segment or syllabic level
    by the transcribers
  • The syllabic-segmented material was subsequently
    segmented at the phonetic-segment level by a
    special-purpose neural network trained on
    72-minutes of hand-segmented Switchboard
    material. This automatic segmentation was
    manually verified

7
Evaluation Material Details
  • AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS
  • BROAD DISTRIBUTION OF UTTERANCE DURATIONS
  • 2-4 sec - 40, 4-8 sec - 50, 8-17 sec - 10
    (mean 4.75 s)
  • COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN
    SWITCHBOARD
  • A WIDE RANGE OF DISCUSSION TOPICS
  • VARIABILITY IN DIFFICULTY (VERY EASY TO VERY
    HARD)

By Subjective Difficulty
By Dialect Region
Number of Utterances
Subjective Difficulty
Dialect Region
8
Manual Transcription of Stress Accent
  • 2 UC-Berkeley Linguistics students each
    transcribed the full 45 minutes of material
    (i.e., there is 100 overlap between the 2)
  • Three levels of stress-accent were marked for
    each syllabic nucleus
  • Fully stressed (78 concordance between
    transcribers)
  • Completely unstressed (85 interlabeler
    agreement)
  • An intermediate level of accent (neither fully
    stressed, nor completely unstressed (ca. 60
    concordance)
  • Hence, 95 concordance in terms of some level of
    stress
  • The labels of the two transcribers were averaged
  • In those instances where there was disagreement,
    the magnitude of disparity was almost always
    (ca. 90) one step. Usually, disagreement
    signaled a genuine ambiguity in stress accent
  • The illustrations in this presentation are based
    solely on those data in which both transcribers
    concurred (i.e., fully stressed or completely
    unstressed)
  • A table containing the complete set of data is in
    a paper submitted to Eurospeech (in the
    workshop notebook)

9
The Conventional Wisdom on Stress-Accent
  • "Pitch is widely regarded, at least in English,
    as the most salient determinant of prominence. In
    other words, when a syllable or word is perceived
    as 'stressed' or 'emphasized,' it is pitch height
    or a change in pitch, more than length or
    loudness that is likely to be mainly responsible
    (see, for example, Fry 1958, Grimson 1980, pp.
    222-226, Lehiste 1976, Fudge, 1984, ch. 1)"
  • Clark, J. and Yallop, C. (1990) An Introduction
    to Phonetics and Phonology. Oxford, Blackwell, p.
    280.
  • "In fact, although it is clear that stressed
    syllables often have greater overall acoustic
    intensity than weakly stressed ones, loudness
    seems to be the least salient and least
    consistent of the three parameters of pitch,
    duration and loudness - at least for purposes
    such as signaling stress" (ibid, p. 282)
  • Thus, acording to the general consensus the
    important parameters are (in order) - PITCH,
    DURATION, LOUDNESS
  • (the latter most closely correlated with TOTAL
    ENERGY (i.e., duration x amplitude, cf. further
    on)

10
OGI Stories - Pitch Doesnt Cut the Mustard
  • Although pitch range is the most important of the
    fo-related cues, it is not as good a predictor
    of stress as DURATION

11
Total Energy is the Best Predictor of Stress
  • Duration x Amplitude is superior to all other
    combination pairs of acoustic parameters. Pitch
    appears redundant with duration.

12
A Brief Primer on Vocalic Acoustics
  • Vowel quality is generally thought to be a
    function primarily of two articulatory properties
    - both related to the motion of the tongue
  • The front-back plane is most closely associated
    with the second formant frequency (or more
    precisely F2 - F1) and the volume of the
    front-cavity resonance
  • The height parameter is closely linked to the
    frequency of F1
  • In the classic vowel triangle segments are
    positioned in terms of the tongue positions
    associated with their production, as follows

13
Duration/Amplitude/Int. Energy - Which?
  • There are supposed to be large differences in the
    intrinsic amplitude and duration of vowels
  • Could such differences be compensated for in
    terms of stress?
  • Lets take a closer look!

14
Amplitude Differences - Stressed/Unstressed
  • There are very small differences in amplitude
    between stressed and unstressed nuclei
  • The lax monophthongs tend to be have a slightly
    larger dynamic range than diphthongs

15
Durational Differences - Stressed/Unstressed
  • There is a large dynamic range in duration
    between stressed and unstressed nuclei
  • Diphthongs and tense, low monophthongs tend to
    have a larger range than the lax monophthongs

16
Int. Energy Differences - Stressed/Unstressed
  • There is a large dynamic range in integrated
    energy between stressed and unstressed nuclei
  • Diphthongs and tense, low monophthongs tend to
    have a larger range than the lax monophthongs

17
Spatial Patterning of Duration and Amplitude
  • Lets return to the vowel triangle and see if it
    can shed light on certain patterns in the
    vocalic data
  • The duration, amplitude (and their product,
    integrated energy, will be plotted on a 2-D
    grid , where the x-axis will always be in terms
    of hypothetical front-back tongue position (and
    hence remain a constant throughout the plots to
    follow)
  • The y-axis will serve as the dependent measure,
    sometimes expressed in terms of duration, or
    amplitude, or their product

18
Dipthongal Amplitude and Vowel Height
All nuclei
19
Monopthongal Amplitude and Vowel Height
All nuclei
20
Amplitude - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
All nuclei
21
Diphthongal Duration and Vowel Height
All nuclei
22
Monopthongal Duration and Vowel Height
All nuclei
23
Duration - Monophthongs vs. Diphthongs
All nuclei
24
Dipthongal Int. Energy and Vowel Height
All nuclei
25
Monopthongal Int. Energy and Vowel Height
All nuclei
26
Int. Energy - Monophthongs vs. Diphthongs
All nuclei
27
Dipthongal Amplitude and Vowel Height
Stressed nuclei
28
Dipthongal Amplitude and Vowel Height
Unstressed nuclei
29
Monopthongal Amplitude and Vowel Height
Stressed nuclei
30
Monopthongal Amplitude and Vowel Height
Unstressed nuclei
31
Amplitude - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
Stressed
Unstressed
32
Diphthongal Duration and Vowel Height
Stressed nuclei
33
Diphphthongal Duration and Vowel Height
Unstressed nuclei
34
Monopthongal Duration and Vowel Height
Stressed nuclei
35
Monopthongal Duration and Vowel Height
Unstressed nuclei
36
Duration - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
Stressed
Unstressed
37
Dipthongal Int. Energy and Vowel Height
Stressed nuclei
38
Dipthongal Int. Energy and Vowel Height
Unstressed nuclei
39
Monopthongal Int. Energy and Vowel Height
Stressed nuclei
40
Monopthongal Int. Energy and Vowel Height
Unstressed nuclei
41
Int. Energy - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
Stressed
Unstressed
42
Mystery Parameter
  • There is one other parameter which when plotted
    in a vowel triangle plot shows an interesting
    pattern
  • This is - proportion of stressed an unstressed
    nuclei

43
Proportion of Stress Accent and Vowel Height
44
Amplitude - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
All nuclei
45
Duration - Monophthongs vs. Diphthongs
All nuclei
46
Int. Energy - Monophthongs vs. Diphthongs
All nuclei
47
Summary and Conclusions
  • There is an intimate relationship between vocalic
    identity, nucleic duration and stress accent in
    spontaneous dialogue (at least in the
    Switchboard corpus)
  • Stressed syllables tend to have significantly
    longer nuclei than their unstressed
    counterparts, consistent with the findings
    reported by Silipo and Greenberg in previous
    years meetings regarding the OGI Stories
    corpus (telephone monologues)
  • Certain vocalic classes exhibit a far greater
    dynamic range in duration than others
  • Diphthongs tend to be longer than monophthongs,
    BUT .
  • The low monophthongs (ae, aa, ay, aw,
    ao) exhibit patterns of duration and dynamic
    range under stress (accent) similar to diphtongs
  • The statistical patterns are consistent with the
    hypothesis that duration serves under many
    conditions as either a primary or secondary cue
    for vowel height (normally associated with the
    frequency of the first formant)

48
Summary and Conclusions
  • Moreover, the stress-accent system in spontaneous
    (American) English appears to be closely
    associated with vocalic identity
  • Low vowels are far more likely to be fully
    stressed than high vowels (with the mid vowels
    exhibiting an intermediate probability of being
    stressed)
  • Thus, the identity of a vowel can not be
    considered independently of stress-accent
  • Thus, vowel duration may be an important factor
    in disambiguating spoken language and therefore
    should be of interest to the speech
    recognition community
Write a Comment
User Comments (0)
About PowerShow.com