Title: Beyond the Phoneme
1 Beyond the Phoneme A Juncture-Accent Model of
Spoken Language Steven Greenberg, Hannah
Carvey, Leah Hitchcock and Shuangyu
Chang International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 steveng, hmcarvey, leahh,
shawnc_at_icsi.berkeley.edu
2Acknowledgements and Thanks
Research Funding U.S. Department of
Defense U.S. National Science Foundation
3For Further Information
Consult the web site www.icsi.berkel
ey.edu/steveng
4OVERTURE The Central Challenge for Models of
Speech Recognition
5The Serial Frame Perspective on Speech
Traditional models of speech recognition assume
the identity of a phonetic segment is derived
from a detailed spectral profile of the acoustic
signal computed for each time interval (frame) of
speech
6Phonemic Beads on a String Illustrated
In traditional models of speech recognition words
are represented as mere sequences of phonetic
segments (phones) .
7Phonemic Beads on a String Illustrated
In traditional models of speech recognition words
are represented as mere sequences of phonetic
segments (phones) . Strung together like
beads on a string
8Phonemic Beads on a String Illustrated
In traditional models of speech recognition words
are conceptualized as mere sequences of phonetic
segments (phones) . Strung together like
beads on a string No quarter is provided for
stress accent or other syllabic properties
9Language - The Traditional Perspective
The classical view of spoken language posits a
quasi-arbitrary relation between the lower and
higher tiers of linguistic organization
Cat /k/ /ae/ /t/
Cat k ae t
ASR systems focus on decoding words from
sequences of phones
10 A Challenge for the Phonemic Beads on a String
Approach to Speech Recognition Pronunciation
Variability
11Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse
12Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse There are literally
dozens of ways in which common words are
pronounced
13Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse There are literally
dozens of ways in which common words are
pronounced (as the following two slides
illustrate for the word AND based on manual
phonetic annotation of a corpus comprising
telephone dialogues)
14How Many Pronunciations of and?
Canonical pronunciation
15How Many Pronunciations of and?
16Pronunciation Variability of Real Speech
The are literally dozens of ways in which common
words are pronounced And as the following slide
illustrates for the 20 most frequent words from
the same corpus (Switchboard)
17Pronunciation Variability of Real Speech
The are literally dozens of ways in which common
words are pronounced And as the following slide
illustrates for the 20 most frequent words from
the same corpus (Switchboard) (which together
account for 35 of the word tokens in the corpus)
18How Many Different Pronunciations?
The 20 most frequency words account for 35 of
the tokens
19 QUESTION How do listeners decode the speech
signal given the large amount of pronunciation
variation?
20 PART ONE Anatomy of a Syllable
21Language - A Syllable-Centric Perspective
A more empirically grounded perspective of spoken
language focuses on the SYLLABLE as the interface
between sound and meaning
Within this framework the relationship between
the syllable and the higher and lower tiers is
non-arbitrary and systematic statistically
22The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure
23The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position
24The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level)
25The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns
26The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is an onset?
27The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus?
28The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus? What is a coda?
29The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is an onset?
What is a nucleus? What is a coda? The following
slides provide a brief (and gentle) introduction
to syllable structure
30Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA
J JUNCTURE
OGI Numbers95 corpus
31Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition)
J JUNCTURE
OGI Numbers95 corpus
32Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT)
J JUNCTURE
OGI Numbers95 corpus
33Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT)
J JUNCTURE
OGI Numbers95 corpus
34Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda (Nine)
J JUNCTURE
OGI Numbers95 corpus
35Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda
(Nine) Followed in popularity by Onset
Nucleus (Two)
J JUNCTURE
OGI Numbers95 corpus
36Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda
(Nine) Followed in popularity by Onset
Nucleus (Two) Onset segments often differ in
significant ways from coda segments
J JUNCTURE
OGI Numbers95 corpus
37 PART TWO Spectro-Temporal Profiles
38The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation
39The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation STRESS ACCENT and JUNCTURE are two
such properties
40The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail
41The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail As shown in miniature below ..
42The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail As shown in miniature below ..
STePs are derived from averages of hundreds of
individual instances
43The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail As shown in miniature below . (and as
shown in expanded form on the following slides)
STePs are derived from averages of hundreds of
individual instances
44Spectro-Temporal Profile - DiSyllabic Word
Full-spectrum perspective
Seven
unaccented syllable
accented syllable
s eh vx en
juncture
eh
en
s
vx
mean duration
OGI Numbers95
45Spectro-Temporal Profile - DiSyllabic Word
High-frequency perspective
Seven
s eh vx en
unaccented syllable
accented syllable
juncture
s
eh
en
vx
mean duration
OGI Numbers95
46 PART THREE Scientific Approach to Speech
Recognition
47A Scientific Approach to Speech Recognition
Ascertain the contribution of .
48A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification
49A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation
50A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and
51A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position
52A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance
53A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus
54A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus And a relatively transparent
recognition engine utilizing the following
variety of articulatory-based features manner
and place of articulation, voicing, vowel
height, lip-rounding, spectral dynamics, segment
length
55A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus And a relatively transparent
recognition engine utilizing the following
variety of articulatory-based features manner
and place of articulation, voicing, vowel
height, lip-rounding, spectral dynamics, segment
length That are explicitly tied to syllable
position (i.e., onset, nucleus and coda) and
stress-accent level
56A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus And a relatively transparent
recognition engine utilizing the following
variety of articulatory-based features manner
and place of articulation, voicing, vowel
height, lip-rounding, spectral dynamics, segment
length That are explicitly tied to syllable
position (i.e., onset, nucleus and coda) and
stress-accent level We will be comparing the
baseline system (entirely automatic
recognition) with an entirely fabricated set of
input data (derived from hand-labeled phonetic
annotation autoSAL) as well as a half-way
house system that is partially automatic and
partially not (manually derived phonetic
segmentation, as well as whether each segment is
vocalic or not)
57Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6
58Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that .
59Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that
. Accurate phonetic segmentation is extremely
important for enhanced ASR performance, as is
knowledge of the location of the syllabic nucleus
60Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that
. Accurate phonetic segmentation is extremely
important for enhanced ASR performance, as is
knowledge of the location of the syllabic
nucleus Stress-accent information most important
for the vocalic nucleus without it WER
increases by 10-20
61Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that
. Accurate phonetic segmentation is extremely
important for enhanced ASR performance, as is
knowledge of the location of the syllabic
nucleus Stress-accent information most important
for the vocalic nucleus without it WER
increases by 10-20 Also important for coda WER
increases by 7-15
62Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03
63Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03 Conclusions Onset segments are most
canonical
64Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03 Conclusions Onset segments are most
canonical Coda segments are least canonical
65Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03 Conclusions Onset segments are most
canonical Coda segments are least
canonical Therefore, it is important to provide
for pronunciation variation in ASR system
66Numbers95 Syllable Position Importance
Effect of pronunciation variation as a function
of syllable position, where each syllabic
constituent is neutralized with respect to
lexical matching (i.e., each element is factored
out of the decoding process separately) Standard
refers to the regular recognition
system Word Error Rate
Standard Onset Nucleus Coda Fabricated
1.29 9.70 5.95
3.92 Half-way House 1.97 11.27
13.28 6.60 Baseline 5.59 15.70
20.22 10.13
67Numbers95 Syllable Position Importance
Effect of pronunciation variation as a function
of syllable position, where each syllabic
constituent is neutralized with respect to
lexical matching (i.e., each element is factored
out of the decoding process separately) Standard
refers to the regular recognition
system Word Error Rate
Standard Onset Nucleus Coda Fabricated
1.29 9.70 5.95
3.92 Half-way House 1.97 11.27
13.28 6.60 Baseline 5.59 15.70
20.22 10.13 Neutralization of the onset
and nucleic elements exerts a greater impact on
ASR performance than codas
68Numbers95 Syllable Position Importance
Effect of pronunciation variation as a function
of syllable position, where each syllabic
constituent is neutralized with respect to
lexical matching (i.e., each element is factored
out of the decoding process separately) Standard
refers to the regular recognition
system Word Error Rate
Standard Onset Nucleus Coda Fabricated
1.29 9.70 5.95
3.92 Half-way House 1.97 11.27
13.28 6.60 Baseline 5.59 15.70
20.22 10.13 Neutralization of the onset
and nucleic elements exerts a greater impact on
ASR performance than codas Conclusion Onsets and
nuclei are most important for lexical access in
an ASR system (at least for the Numbers95
corpus)
69 PART FOUR Being Phonetically and Prosodically
Annotated
70Phonetic Transcription of Spontaneous English
Telephone dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and segmented)
71Phonetic Transcription of Spontaneous English
Telephone dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually
72Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually 4 hours
labeled at the phone level and segmented at the
syllabic level
73Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually 4 hours labeled at
the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level
74Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods
75Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material
76Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)
77Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)
There is a Lot of Diversity in the Material
Transcribed
78Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)
There is a Lot of Diversity in the Material
Transcribed Spans speech of both genders (ca.
50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality
79Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually 4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)
There is a Lot of Diversity in the Material
Transcribed Spans speech of both genders (ca.
50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality Transcription System A variant of
Arpabet, with phonetic diacritics such
as_gl,_cr, _fr, _n, _vl, _vd
80Phonetic Transcription of Spontaneous English
The Data are Available at .
81Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp
82Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
83Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
84Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy
85Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light
86Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
87Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
88Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
- (In actuality, labelers assigned a 1 to fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)
89Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
- (In actuality, labelers assigned a 1 to fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others) - An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary)
90Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
- (In actuality, labelers assigned a 1 to fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others) - An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary) - In this example most of the syllables are
unaccented, with two labeled as lightly accented
(0.5)
91Annotation of Stress Accent
- Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent - Three levels of accent were distinguished
- Heavy Light None
- (In actuality, labelers assigned a 1 to fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others) - An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary) - In this example most of the syllables are
unaccented, with two labeled as lightly accented
(0.5) (and one other labeled as very lightly
accented (0.25))
92Annotation of Stress Accent
The data are available at .
93Annotation of Stress Accent
The data are available at . http//www.ics
i/berkeley.edu/steveng/prosody
94Automatic Labeling of Stress Accent
- This forty-five minutes of hand-labeled phonetic
and prosodic annotation from the Switchboard
corpus was used as training data for development
of an Automatic Stress Accent Labeling System
(AutoSAL)
95How Good is AutoSAL?
- There is an 79 concordance between human and
machine accent labels when the tolerance level is
a quarter-step
96How Good is AutoSAL?
- There is an 79 concordance between human and
machine accent labels when the tolerance level is
a quarter-step - There is 97.5 concordance when the tolerance
level is half a step
97How Good is AutoSAL?
- There is an 79 concordance between human and
machine accent labels when the tolerance level is
a quarter-step - There is 97.5 concordance when the tolerance
level is half a step - This degree of concordance is as high as that
exhibited by two highly trained (human)
transcribers
98 PART FIVE Stress Accent and Syllable Position
99The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent
100The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent These data serve to
illustrate the sort of variation observed that is
conditioned by position within the syllable
101Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable
Deletions
CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
102Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Particularly when stress
accent is also taken into account
Deletions
CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
103Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Particularly when stress
accent is also taken into account BOTH syllable
structure and accent level are required for a
full accounting
Deletions
CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
104 PART SIX Durational Properties of Pronunciation
Variation
105Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position
106Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position Well begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment duration
107Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position Well begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level
108Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position Well begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level Under such conditions,
the durational properties associated with light
accent are generally intermediate between heavy
accent and none
109Syllable Duration - Across Syllable Forms
- There is a broad range of syllable structures
observed in spoken English
110Syllable Duration - Across Syllable Forms
- There is a broad range of syllable structures
observed in spoken English - The CV and CVC forms cover ca. 60 of the
syllables -
V Vowel C Consonant
111Syllable Duration - Across Syllable Forms
- There is a broad range of syllable structures
observed in spoken English - The CV and CVC forms cover ca. 60 of the
syllables - Together, the V, VC, CV and CVC forms account for
85 of syllables -
V Vowel C Consonant
112Syllable Duration - Across Syllable Forms
- There is a broad range of syllable structures
observed in spoken English - The CV and CVC forms cover ca. 60 of the
syllables - Together, the V, VC, CV and CVC forms account for
85 of syllables - The CVCC and CCVC (complex syllable) forms
account for another 10 -
V Vowel C Consonant
113Syllable Duration - Across Syllable Forms
- It is unsurprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below) -
V Vowel C Consonant
Canonical Syllable Forms
114Syllable Duration - Across Syllable Forms
- It is unsurprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below) - Note the systematic lengthening of the syllable
for each form as the accent level increases from
NONE to LIGHT to HEAVY -
V Vowel C Consonant
Canonical Syllable Forms
115Syllable Duration - Across Syllable Forms
- It is unsurprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below) - Note the systematic lengthening of the syllable
for each form as the accent level increases from
NONE to LIGHT to HEAVY - This pattern is representative of accents impact
on duration -
V Vowel C Consonant
Canonical Syllable Forms
116Syllable Duration - Across Syllable Forms
- It is unsurprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below) - Note the systematic lengthening of the syllable
for each form as the accent level increases from
NONE to LIGHT to HEAVY - This pattern is representative of accents impact
on duration (as well see) -
V Vowel C Consonant
Canonical Syllable Forms
117Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE)
V Vowel C Consonant
Canonical Syllable Forms
118Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts
V Vowel C Consonant
Canonical Syllable Forms
119Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV)
V Vowel C Consonant
Canonical Syllable Forms
120Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV) This
pattern implies that accent has the greatest
impact on vocalic duration
V Vowel C Consonant
Canonical Syllable Forms
121Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph below
Canonical Syllable Forms
122Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below Vowels in accented syllables (of all forms)
are at least twice as long as their unaccented
counterparts
Canonical Syllable Forms
123Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below Vowels in accented syllables (of all forms)
are at least twice as long as their unaccented
counterparts This pattern implies that the
syllable nucleus absorbs a major component of
accents impact (at least as far as duration is
concerned)
Canonical Syllable Forms
124 PART SEVEN Stress Accent and the Vocalic
Nucleus
125Stress Accents Impact on the Vocalic Nucleus
- Because the pattern of stress accents impact on
vocalic duration is relatively uniform across
syllable form it is likely that the specific
structure of the syllable has relatively little
impact on vocalic duration
126Stress Accents Impact on the Vocalic Nucleus
- Because the pattern of stress accents impact on
vocalic duration is relatively uniform across
syllable form it is likely that the specific
structure of the syllable has relatively little
impact on vocalic duration - As a consequence, the remaining analyses
pertaining to accents impact on vocalic duration
collapse the data across syllable form
127Stress Accents Impact on the Vocalic Nucleus
- Because the pattern of stress accents impact on
vocalic duration is relatively uniform across
syllable form it is likely that the specific
structure of the syllable has relatively little
impact on vocalic duration - As a consequence, the remaining analyses
pertaining to accents impact on vocalic duration
collapse the data across syllable form - We now examine vocalic duration in somewhat
greater detail and illustrate how duration,
stress accent and vocalic identity interact
128 The Spatial Patterning of Duration in Vocalic
Nuclei
129A Brief Primer on Vocalic Acoustics
- Vowel quality is generally thought to be a
function primarily of two articulatory
properties both related to the motion of the
tongue
130A Brief Primer on Vocalic Acoustics
- Vowel quality is generally thought to be a
function primarily of two articulatory
properties both related to the motion of the
tongue - The front-back plane is most closely associated
with the second formant frequency (or more
precisely F2 - F1) and the volume of the
front-cavity resonance
131A Brief Primer on Vocalic Acoustics
- Vowel quality is generally thought to be a
function primarily of two articulatory
properties both related to the motion of the
tongue - The front-back plane is most closely associated
with the second formant frequency (or more
precisely F2 - F1) and the volume of the
front-cavity resonance - The height parameter is closely linked to the
frequency of F1
132A Brief Primer on Vocalic Acoustics
- Vowel quality is generally thought to be a
function primarily of two articulatory properties
both related to the motion of the tongue - The front-back plane is most closely associated
with the second formant frequency (or more
precisely F2 - F1) and the volume of the
front-cavity resonance - The height parameter is closely linked to the
frequency of F1 - In the classic vowel triangle, segments are
positioned in terms of the tongue positions
associated with their production, as follows
133A Brief Primer on Vocalic Acoustics
- Vowel quality is generally thought to be a
function primarily of two articulatory properties
both related to the motion of the tongue - The front-back plane is most closely associated
with the second formant frequency (or more
precisely F2 - F1) and the volume of the
front-cavity resonance - The height parameter is closely linked to the
frequency of F1 - In the classic vowel triangle, segments are
positioned in terms of the tongue positions
associated with their production, as follows
134Spatial Patterning of Duration et al.
- In the following slides duration is plotted on a
2-D grid, where the x-axis represents the
(hypothetical) front-back tongue position -
135Spatial Patterning of Duration et al.
- In the following slides duration is plotted on a
2-D grid, where the x-axis represents the
(hypothetical) front-back tongue position - (and hence remains a constant throughout the
plots to follow)
136Spatial Patterning of Duration et al.
- In the following slides duration is plotted on a
2-D grid, where the x-axis represents the
(hypothetical) front-back tongue position - (and hence remains a constant throughout the
plots to follow) - The y-axis serves as the dependent measure,
expressed in terms of either duration or the
proportion of fully stressed (or unstressed)
nuclei
137Vocalic Duration and Vowel Height
- The spatial patterning of vocalic segments is
systematic with respect to duration
138Vocalic Duration and Vowel Height
- The spatial patterning of vocalic segments is
systematic with respect to duration - Low vowels, be they diphthongs or monophthongs,
are longer (on average) than high vowels
139Vocalic Duration and Vowel Height
- The spatial patterning of vocalic segments is
systematic with respect to duration - Low vowels, be they diphthongs or monophthongs,
are longer (on average) than high vowels
All nuclei
140Vocalic Duration and Vowel Height
- The spatial patterning of vocalic segments is
systematic with respect to duration - Low vowels, be they diphthongs or monophthongs,
are longer (on average) than high vowels - Thus, duration appears to be highly correlated
with vowel height
All nuclei
141Vocalic Duration and Vowel Height
- The spatial patterning of vocalic segments is
systematic with respect to duration - Low vowels, be they diphthongs or monophthongs,
are longer (on average) than high vowels - Thus, duration appears to be highly correlated
with vowel height - But the situation is a little more complicated
than first appearances would suggest
All nuclei
142Durational Differences - Stressed/Unstressed
There is a large dynamic range in duration
between accented and unaccented vocalic nuclei
Canonical Syllable Forms
143Durational Differences - Stressed/Unstressed
There is a large dynamic range in duration
between accented and unaccented vocalic
nuclei Moreover, diphthongs and tense, low
monophthongs tend to exhibit a larger dynamic
range than the lax monophthongs
Canonical Syllable Forms
144Durational Differences - Stressed/Unstressed
There is a large dynamic range in duration
between accented and unaccented vocalic
nuclei Moreover, diphthongs and tense, low
monophthongs tend to exhibit a larger dynamic
range than the lax monophthongs
Lax monophthongs
Canonical Syllable Forms
145Vocalic Identity Among Unstressed Nuclei
The high, lax monophthongs are almost always
unstressed
146Vocalic Identity Among Unstressed Nuclei
The high, lax monophthongs are almost always
unstressed The low vowels, be they monophthongs
or diphthongs, are rarely unstressed
147Vocalic Identity Among Unstressed Nuclei
The high, lax monophthongs are almost always
unstressed The low vowels, be they monophthongs
or diphthongs, are rarely unstressed The high
diphthongs and high/mid, tense monophthongs
occupy an intermediate position
148Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed
149Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed The low
vowels, be they monophthongs or diphthongs, are
far more likely to be fully stressed
150Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed The low
vowels, be they monophthongs or diphthongs, are
far more likely to be fully stressed An
intermediate degree of stress accounts for the
other vocalic instances
151Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed The low
vowels, be they monophthongs or diphthongs, are
far more likely to be fully stressed An
intermediate degree of stress accounts for the
other vocalic instances (but will not be
addressed here)
152Vocalic Variation Importance of Stress Accent
- The vowels of heavily accented syllables are
(mostly) pronounced canonically
Canonical Pronunciations
Non-Canonical Pronunciations
153Vocalic Variation Importance of Stress Accent
- The vowels of heavily accented syllables are
(mostly) pronounced canonically - Low vowels are largely the province of accented
syllables
Canonical Pronunciations
Non-Canonical Pronunciations
154Vocalic Variation Importance of Stress Accent
- The vowels of heavily accented syllables are
(mostly) pronounced canonically - Low vowels are largely the province of accented
syllables, and - High vowels the province of unaccented syllables
Canonical Pronunciations
Non-Canonical Pronunciations
155Vocalic Variation Importance of Stress Accent
- The vowels of heavily accented syllables are
(mostly) pronounced canonically - Low vowels are largely the province of accented
syllables, and - High vowels the province of unaccented syllables
- Moreover, theres a lexical bias towards high
vowels for unaccented forms
Canonical Pronunciations
Non-Canonical Pronunciations
156Vocalic Variation Importance of Stress Accent
- The vowels of heavily accented syllables are
(mostly) pronounced canonically - Low vowels are largely the province of accented
syllables, and - High vowels the province of unaccented syllables
- Moreover, theres a lexical bias towards high
vowels for unaccented forms - Thats reinforced in patterns of deviation from
canonical pronunciation
Canonical Pronunciations
Non-Canonical Pronunciations
157Vocalic Height Deviation from Canonical
- Vowels are more likely to RISE in height than to
descend when unaccented
Amount of Change
Direction of Change
158Vocalic Height Deviation from Canonical
- Vowels are more likely to RISE in height than to
descend when unaccented - Vocalic lowering of height is rare
Amount of Change
Direction of Change
159Vocalic Height Deviation from Canonical
- Vowels are more likely to RISE in height than to
descend when unaccented - Vocalic lowering of height is rare
- Most deviations from the canonical maintain vowel
height
Amount of Change
Direction of Change
160Vocalic Height Deviation from Canonical
- Vowels are more likely to RISE in height than to
descend when unaccented - Vocalic lowering of height is rare
- Most deviations from the canonical maintain vowel
height - More than a single height step deviation is
uncommon
Amount of Change
Direction of Change
161Vocalic Height Deviation from Canonical
- Vowels are more likely to RISE in height than to
descend when unaccented - Vocalic lowering of height is rare
- Most deviations from the canonical maintain vowel
height - More than a single height step deviation is
uncommon - Virtually all 2-step height deviations occur in
unaccented syllables
Amount of Change
Direction of Change
162The Vowel Space Under (Full) Stress (Accent)
- In unaccented nuclei there is a relatively even
distribution of segments across the vowel space,
with a slight bias towards the front and central
vowels
Canonical Vowels Only
163The Vowel Space Without (Stress) Accent
- In unaccented syllables vowels are confined
largely to the high-front and high-central
sectors of the articulatory space
Canonical Vowels Only
164The Vowel Space Without (Stress) Accent
- In unaccented syllables vowels are confined
largely to the high-front and high-central
sectors of the articulatory space - The low and mid vowels get creamed
Canonical Vowels Only
165The Vowel Spaces Compared
- Stress accent exerts a profound effect on the
character of the vowel space
Heavily Accented
Unaccented
Canonical Vowels Only
166The Vowel Spaces Compared
- Stress accent exerts a profound effect on the
character of the vowel space - High vowels are largely associated with
unaccented syllables
Heavily Accented
Unaccented
Canonical Vowels Only
167The Vowel Spaces Compared
- Stress accent exerts a profound effect on the
character of the vowel space - High vowels are largely associated with
unaccented syllables - Low vowels are mostly associated with accented
forms
Heavily Accented
Unaccented
Canonical Vowels Only
168The Vowel Spaces Compared
- Stress accent exerts a profound effect on the
character of the vowel space - High vowels are largely associated with
unaccented syllables - Low vowels are mostly associated with accented
forms - This distinction between accented and unaccented
syllables is of profound importance for
understanding (and modeling) pronunciation
variation
Heavily Accented
Unaccented
Canonical Vowels Only
169Is It Stress? Vocalic Identity? Or What?
- Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse
170Is It Stress? Vocalic Identity? Or What?
- Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse - For any given vocalic class, stressed segments
are longer (on average)
171Is It Stress? Vocalic Identity? Or What?
- Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse - For any given vocalic class, stressed segments
are longer (on average) - The durational disparity is most pronounced among
the low vowels and the diphthongs
172Is It Stress? Vocalic Identity? Or What?
- Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse - For any given vocalic class, stressed segments
are longer (on average) - The durational disparity is most pronounced among
the low vowels and the diphthongs - Low vowels tend to be much longer in duration
than high vowels
173Is It Stress? Vocalic Identity? Or What?
- Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse - For any given vocalic class, stressed segments
are longer (on average) - The durational disparity is most pronounced among
the low vowels and the diphthongs - Low vowels tend to be much longer in duration
than high vowels - This is the case even for diphthongs
174Is It Stress? Vocalic Identity? Or What?
- Duration appears to play an importan