Title: Goals and Objectives
1Stress-Accent and Vowel Quality in The
Switchboard Corpus Steven Greenberg and Leah
Hitchcock International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 http//www.icsi.berkeley.edu/steveng
NIST Workshop on Large Vocabulary Continuous
Speech Recognition Maritime Institute of
Technology, May 4, 2001
2Take Home Messages
- There is an intimate relationship between vocalic
identity, nucleic duration and stress accent in
spontaneous dialogue (at least in the
Switchboard corpus) - Stressed syllables tend to have significantly
longer nuclei than their unstressed
counterparts, consistent with the findings
reported by Silipo and Greenberg in previous
years meetings regarding the OGI Stories
corpus (telephone monologues) - Certain vocalic classes exhibit a far greater
dynamic range in duration than others - Diphthongs tend to be longer than monophthongs,
BUT . - The low monophthongs (ae, aa, ay, aw,
ao) exhibit patterns of duration and dynamic
range under stress (accent) similar to diphtongs - The statistical patterns are consistent with the
hypothesis that duration serves under many
conditions as either a primary or secondary cue
for vowel height (normally associated with the
frequency of the first formant)
3Take Home Messages
- Moreover, the stress-accent system in spontaneous
(American) English appears to be closely
associated with vocalic identity - Low vowels are far more likely to be fully
stressed than high vowels (with the mid vowels
exhibiting an intermediate probability of being
stressed) - Thus, the identity of a vowel can not be
considered independently of stress-accent - The two parameters are likely to be flip sides of
the same Koine - Although English is not generally considered to
be a vowel-quantity language (as is Finnish),
given the close relationship between
stress-accent and duration, and between
duration and vowel quality, there is some sense
in which English (and perhaps other
stress-accent languages) manifest certain
properties of a quantity system - Thus, vowel duration may be an important factor
in disambiguating spoken language and therefore
should be of interest to the speech
recognition community
4What is (usually) Meant by Prosodic Stress?
- Prosody is supposed to pertain to extra-phonetic
cues in the acoustic signal - The pattern of variation over a sequence of
SYLLABLES pertaining to syllabic DURATION,
AMPLITUDE and PITCH (fo) variation over time
(but the plot thickens, as we shall see)
5Why is Prosodic Stress Important?
- It supposedly provides important information
about - Focus of the speakers attention and emphasis for
the listener - What is new and important information
- Emotional context of the utterance - surprise,
sarcasm, shock, delight anger impatience, etc. - Syntactic disambiguation, particularly at the
clausal/sentential level e.g., interrogative,
declarative forms - Perceptual processing - parsing the utterance
into chunks for reliable understanding - Prosody provides a window onto the higher levels
of language - Can be useful for developing
semantic-oriented models for speech
understanding (Information spotting) - Prosody affects pronunciation (and vice versa)
- Can be useful for modeling pronunciation
variation in ASR - Phonetic properties may be correlated with
prosodic stress - - THIS IS THE TOPIC FOR TODAYS PRESENTATION
6The Nitty Gritty (a.k.a. the Corpus Material)
- SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS (same
as Phoneval-2000) - Switchboard contains informal telephone dialogues
- 54 minutes of material that had previously been
phonetically transcribed (by highly trained
phonetics students from UC- Berkeley) - 45.5 minutes of pure speech (filled pauses,
junctures filtered out), consisting of - 9,991 words, 13,446 syllables, 33,370 phonetic
segments - All of this material had been hand-segmented at
either the phonetic- segment or syllabic level
by the transcribers - The syllabic-segmented material was subsequently
segmented at the phonetic-segment level by a
special-purpose neural network trained on
72-minutes of hand-segmented Switchboard
material. This automatic segmentation was
manually verified
7Evaluation Material Details
- AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS
- BROAD DISTRIBUTION OF UTTERANCE DURATIONS
- 2-4 sec - 40, 4-8 sec - 50, 8-17 sec - 10
(mean 4.75 s) - COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN
SWITCHBOARD - A WIDE RANGE OF DISCUSSION TOPICS
- VARIABILITY IN DIFFICULTY (VERY EASY TO VERY
HARD)
By Subjective Difficulty
By Dialect Region
Number of Utterances
Subjective Difficulty
Dialect Region
8Manual Transcription of Stress Accent
- 2 UC-Berkeley Linguistics students each
transcribed the full 45 minutes of material
(i.e., there is 100 overlap between the 2) - Three levels of stress-accent were marked for
each syllabic nucleus - Fully stressed (78 concordance between
transcribers) - Completely unstressed (85 interlabeler
agreement) - An intermediate level of accent (neither fully
stressed, nor completely unstressed (ca. 60
concordance) - Hence, 95 concordance in terms of some level of
stress - The labels of the two transcribers were averaged
- In those instances where there was disagreement,
the magnitude of disparity was almost always
(ca. 90) one step. Usually, disagreement
signaled a genuine ambiguity in stress accent - The illustrations in this presentation are based
solely on those data in which both transcribers
concurred (i.e., fully stressed or completely
unstressed) - A table containing the complete set of data is in
a paper submitted to Eurospeech (in the
workshop notebook)
9The Conventional Wisdom on Stress-Accent
- "Pitch is widely regarded, at least in English,
as the most salient determinant of prominence. In
other words, when a syllable or word is perceived
as 'stressed' or 'emphasized,' it is pitch height
or a change in pitch, more than length or
loudness that is likely to be mainly responsible
(see, for example, Fry 1958, Grimson 1980, pp.
222-226, Lehiste 1976, Fudge, 1984, ch. 1)" - Clark, J. and Yallop, C. (1990) An Introduction
to Phonetics and Phonology. Oxford, Blackwell, p.
280. - "In fact, although it is clear that stressed
syllables often have greater overall acoustic
intensity than weakly stressed ones, loudness
seems to be the least salient and least
consistent of the three parameters of pitch,
duration and loudness - at least for purposes
such as signaling stress" (ibid, p. 282) - Thus, acording to the general consensus the
important parameters are (in order) - PITCH,
DURATION, LOUDNESS - (the latter most closely correlated with TOTAL
ENERGY (i.e., duration x amplitude, cf. further
on)
10OGI Stories - Pitch Doesnt Cut the Mustard
- Although pitch range is the most important of the
fo-related cues, it is not as good a predictor
of stress as DURATION
11Total Energy is the Best Predictor of Stress
- Duration x Amplitude is superior to all other
combination pairs of acoustic parameters. Pitch
appears redundant with duration.
12A Brief Primer on Vocalic Acoustics
- Vowel quality is generally thought to be a
function primarily of two articulatory properties
- both related to the motion of the tongue - The front-back plane is most closely associated
with the second formant frequency (or more
precisely F2 - F1) and the volume of the
front-cavity resonance - The height parameter is closely linked to the
frequency of F1 - In the classic vowel triangle segments are
positioned in terms of the tongue positions
associated with their production, as follows
13Duration/Amplitude/Int. Energy - Which?
- There are supposed to be large differences in the
intrinsic amplitude and duration of vowels - Could such differences be compensated for in
terms of stress? - Lets take a closer look!
14Amplitude Differences - Stressed/Unstressed
- There are very small differences in amplitude
between stressed and unstressed nuclei - The lax monophthongs tend to be have a slightly
larger dynamic range than diphthongs
15Durational Differences - Stressed/Unstressed
- There is a large dynamic range in duration
between stressed and unstressed nuclei - Diphthongs and tense, low monophthongs tend to
have a larger range than the lax monophthongs
16Int. Energy Differences - Stressed/Unstressed
- There is a large dynamic range in integrated
energy between stressed and unstressed nuclei - Diphthongs and tense, low monophthongs tend to
have a larger range than the lax monophthongs
17Spatial Patterning of Duration and Amplitude
- Lets return to the vowel triangle and see if it
can shed light on certain patterns in the
vocalic data - The duration, amplitude (and their product,
integrated energy, will be plotted on a 2-D
grid , where the x-axis will always be in terms
of hypothetical front-back tongue position (and
hence remain a constant throughout the plots to
follow) - The y-axis will serve as the dependent measure,
sometimes expressed in terms of duration, or
amplitude, or their product
18Dipthongal Amplitude and Vowel Height
All nuclei
19Monopthongal Amplitude and Vowel Height
All nuclei
20Amplitude - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
All nuclei
21Diphthongal Duration and Vowel Height
All nuclei
22Monopthongal Duration and Vowel Height
All nuclei
23Duration - Monophthongs vs. Diphthongs
All nuclei
24Dipthongal Int. Energy and Vowel Height
All nuclei
25Monopthongal Int. Energy and Vowel Height
All nuclei
26Int. Energy - Monophthongs vs. Diphthongs
All nuclei
27Dipthongal Amplitude and Vowel Height
Stressed nuclei
28Dipthongal Amplitude and Vowel Height
Unstressed nuclei
29Monopthongal Amplitude and Vowel Height
Stressed nuclei
30Monopthongal Amplitude and Vowel Height
Unstressed nuclei
31Amplitude - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
Stressed
Unstressed
32Diphthongal Duration and Vowel Height
Stressed nuclei
33Diphphthongal Duration and Vowel Height
Unstressed nuclei
34Monopthongal Duration and Vowel Height
Stressed nuclei
35Monopthongal Duration and Vowel Height
Unstressed nuclei
36Duration - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
Stressed
Unstressed
37Dipthongal Int. Energy and Vowel Height
Stressed nuclei
38Dipthongal Int. Energy and Vowel Height
Unstressed nuclei
39Monopthongal Int. Energy and Vowel Height
Stressed nuclei
40Monopthongal Int. Energy and Vowel Height
Unstressed nuclei
41Int. Energy - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
Stressed
Unstressed
42Mystery Parameter
- There is one other parameter which when plotted
in a vowel triangle plot shows an interesting
pattern - This is - proportion of stressed an unstressed
nuclei
43Proportion of Stress Accent and Vowel Height
44Amplitude - Monophthongs vs. Diphthongs
Monophthongs
Diphthongs
All nuclei
45Duration - Monophthongs vs. Diphthongs
All nuclei
46Int. Energy - Monophthongs vs. Diphthongs
All nuclei
47Summary and Conclusions
- There is an intimate relationship between vocalic
identity, nucleic duration and stress accent in
spontaneous dialogue (at least in the
Switchboard corpus) - Stressed syllables tend to have significantly
longer nuclei than their unstressed
counterparts, consistent with the findings
reported by Silipo and Greenberg in previous
years meetings regarding the OGI Stories
corpus (telephone monologues) - Certain vocalic classes exhibit a far greater
dynamic range in duration than others - Diphthongs tend to be longer than monophthongs,
BUT . - The low monophthongs (ae, aa, ay, aw,
ao) exhibit patterns of duration and dynamic
range under stress (accent) similar to diphtongs - The statistical patterns are consistent with the
hypothesis that duration serves under many
conditions as either a primary or secondary cue
for vowel height (normally associated with the
frequency of the first formant)
48Summary and Conclusions
- Moreover, the stress-accent system in spontaneous
(American) English appears to be closely
associated with vocalic identity - Low vowels are far more likely to be fully
stressed than high vowels (with the mid vowels
exhibiting an intermediate probability of being
stressed) - Thus, the identity of a vowel can not be
considered independently of stress-accent - Thus, vowel duration may be an important factor
in disambiguating spoken language and therefore
should be of interest to the speech
recognition community