Beyond the Phoneme

About This Presentation

Title:

Beyond the Phoneme

Description:

– PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 272

Provided by: stevegr4

Category:

more less

Transcript and Presenter's Notes

Title: Beyond the Phoneme

1
Beyond the Phoneme A Juncture-Accent Model of
Spoken Language Steven Greenberg, Hannah
Carvey, Leah Hitchcock and Shuangyu
Chang International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 steveng, hmcarvey, leahh,
shawnc_at_icsi.berkeley.edu
2
Acknowledgements and Thanks
Research Funding U.S. Department of
Defense U.S. National Science Foundation
3
For Further Information
Consult the web site www.icsi.berkel
ey.edu/steveng
4
OVERTURE The Central Challenge for Models of
Speech Recognition
5
The Serial Frame Perspective on Speech
Traditional models of speech recognition assume
the identity of a phonetic segment is derived
from a detailed spectral profile of the acoustic
signal computed for each time interval (frame) of
speech
6
Phonemic Beads on a String Illustrated
In traditional models of speech recognition words
are represented as mere sequences of phonetic
segments (phones) .
7
Phonemic Beads on a String Illustrated
In traditional models of speech recognition words
are represented as mere sequences of phonetic
segments (phones) . Strung together like
beads on a string
8
Phonemic Beads on a String Illustrated
In traditional models of speech recognition words
are conceptualized as mere sequences of phonetic
segments (phones) . Strung together like
beads on a string No quarter is provided for
stress accent or other syllabic properties
9
Language - The Traditional Perspective
The classical view of spoken language posits a
quasi-arbitrary relation between the lower and
higher tiers of linguistic organization
Cat /k/ /ae/ /t/
Cat k ae t
ASR systems focus on decoding words from
sequences of phones
10
A Challenge for the Phonemic Beads on a String
Approach to Speech Recognition Pronunciation
Variability
11
Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse
12
Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse There are literally
dozens of ways in which common words are
pronounced
13
Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse There are literally
dozens of ways in which common words are
pronounced (as the following two slides
illustrate for the word AND based on manual
phonetic annotation of a corpus comprising
telephone dialogues)
14
How Many Pronunciations of and?
Canonical pronunciation
15
How Many Pronunciations of and?
16
Pronunciation Variability of Real Speech
The are literally dozens of ways in which common
words are pronounced And as the following slide
illustrates for the 20 most frequent words from
the same corpus (Switchboard)
17
Pronunciation Variability of Real Speech
The are literally dozens of ways in which common
words are pronounced And as the following slide
illustrates for the 20 most frequent words from
the same corpus (Switchboard) (which together
account for 35 of the word tokens in the corpus)
18
How Many Different Pronunciations?
The 20 most frequency words account for 35 of
the tokens
19
QUESTION How do listeners decode the speech
signal given the large amount of pronunciation
variation?
20
PART ONE Anatomy of a Syllable
21
Language - A Syllable-Centric Perspective
A more empirically grounded perspective of spoken
language focuses on the SYLLABLE as the interface
between sound and meaning
Within this framework the relationship between
the syllable and the higher and lower tiers is
non-arbitrary and systematic statistically
22
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure
23
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position
24
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level)
25
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns
26
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is an onset?
27
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus?
28
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus? What is a coda?
29
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is an onset?
What is a nucleus? What is a coda? The following
slides provide a brief (and gentle) introduction
to syllable structure
30
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA
J JUNCTURE
OGI Numbers95 corpus
31
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition)
J JUNCTURE
OGI Numbers95 corpus
32
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT)
J JUNCTURE
OGI Numbers95 corpus
33
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT)
J JUNCTURE
OGI Numbers95 corpus
34
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda (Nine)
J JUNCTURE
OGI Numbers95 corpus
35
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda
(Nine) Followed in popularity by Onset
Nucleus (Two)
J JUNCTURE
OGI Numbers95 corpus
36
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda
(Nine) Followed in popularity by Onset
Nucleus (Two) Onset segments often differ in
significant ways from coda segments
J JUNCTURE
OGI Numbers95 corpus
37
PART TWO Spectro-Temporal Profiles
38
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation
39
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation STRESS ACCENT and JUNCTURE are two
such properties
40
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail
41
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail As shown in miniature below ..
42
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail As shown in miniature below ..
STePs are derived from averages of hundreds of
individual instances
43
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail As shown in miniature below . (and as
shown in expanded form on the following slides)
STePs are derived from averages of hundreds of
individual instances
44
Spectro-Temporal Profile - DiSyllabic Word
Full-spectrum perspective
Seven
unaccented syllable
accented syllable
s eh vx en
juncture
eh
en
s
vx
mean duration
OGI Numbers95
45
Spectro-Temporal Profile - DiSyllabic Word
High-frequency perspective
Seven
s eh vx en
unaccented syllable
accented syllable
juncture
s
eh
en
vx
mean duration
OGI Numbers95
46
PART THREE Scientific Approach to Speech
Recognition
47
A Scientific Approach to Speech Recognition
Ascertain the contribution of .
48
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification
49
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation
50
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and
51
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position
52
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance
53
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus
54
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus And a relatively transparent
recognition engine utilizing the following
variety of articulatory-based features manner
and place of articulation, voicing, vowel
height, lip-rounding, spectral dynamics, segment
length
55
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus And a relatively transparent
recognition engine utilizing the following
variety of articulatory-based features manner
and place of articulation, voicing, vowel
height, lip-rounding, spectral dynamics, segment
length That are explicitly tied to syllable
position (i.e., onset, nucleus and coda) and
stress-accent level
56
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus And a relatively transparent
recognition engine utilizing the following
variety of articulatory-based features manner
and place of articulation, voicing, vowel
height, lip-rounding, spectral dynamics, segment
length That are explicitly tied to syllable
position (i.e., onset, nucleus and coda) and
stress-accent level We will be comparing the
baseline system (entirely automatic
recognition) with an entirely fabricated set of
input data (derived from hand-labeled phonetic
annotation autoSAL) as well as a half-way
house system that is partially automatic and
partially not (manually derived phonetic
segmentation, as well as whether each segment is
vocalic or not)
57
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6
58
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that .
59
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that
. Accurate phonetic segmentation is extremely
important for enhanced ASR performance, as is
knowledge of the location of the syllabic nucleus
60
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that
. Accurate phonetic segmentation is extremely
important for enhanced ASR performance, as is
knowledge of the location of the syllabic
nucleus Stress-accent information most important
for the vocalic nucleus without it WER
increases by 10-20
61
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that
. Accurate phonetic segmentation is extremely
important for enhanced ASR performance, as is
knowledge of the location of the syllabic
nucleus Stress-accent information most important
for the vocalic nucleus without it WER
increases by 10-20 Also important for coda WER
increases by 7-15
62
Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03
63
Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03 Conclusions Onset segments are most
canonical
64
Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03 Conclusions Onset segments are most
canonical Coda segments are least canonical
65
Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03 Conclusions Onset segments are most
canonical Coda segments are least
canonical Therefore, it is important to provide
for pronunciation variation in ASR system
66
Numbers95 Syllable Position Importance
Effect of pronunciation variation as a function
of syllable position, where each syllabic
constituent is neutralized with respect to
lexical matching (i.e., each element is factored
out of the decoding process separately) Standard
refers to the regular recognition
system Word Error Rate
Standard Onset Nucleus Coda Fabricated
1.29 9.70 5.95
3.92 Half-way House 1.97 11.27
13.28 6.60 Baseline 5.59 15.70
20.22 10.13
67
Numbers95 Syllable Position Importance
Effect of pronunciation variation as a function
of syllable position, where each syllabic
constituent is neutralized with respect to
lexical matching (i.e., each element is factored
out of the decoding process separately) Standard
refers to the regular recognition
system Word Error Rate
Standard Onset Nucleus Coda Fabricated
1.29 9.70 5.95
3.92 Half-way House 1.97 11.27
13.28 6.60 Baseline 5.59 15.70
20.22 10.13 Neutralization of the onset
and nucleic elements exerts a greater impact on
ASR performance than codas
68
Numbers95 Syllable Position Importance
Effect of pronunciation variation as a function
of syllable position, where each syllabic
constituent is neutralized with respect to
lexical matching (i.e., each element is factored
out of the decoding process separately) Standard
refers to the regular recognition
system Word Error Rate
Standard Onset Nucleus Coda Fabricated
1.29 9.70 5.95
3.92 Half-way House 1.97 11.27
13.28 6.60 Baseline 5.59 15.70
20.22 10.13 Neutralization of the onset
and nucleic elements exerts a greater impact on
ASR performance than codas Conclusion Onsets and
nuclei are most important for lexical access in
an ASR system (at least for the Numbers95
corpus)
69
PART FOUR Being Phonetically and Prosodically
Annotated
70
Phonetic Transcription of Spontaneous English
Telephone dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and segmented)

71
Phonetic Transcription of Spontaneous English
Telephone dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually
72
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually         4 hours
labeled at the phone level and segmented at the
syllabic level
73
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled at
the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level
74
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods
75
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material
76
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)
77
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)
There is a Lot of Diversity in the Material
Transcribed
78
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)
There is a Lot of Diversity in the Material
Transcribed Spans speech of both genders (ca.
50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality
79
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)
There is a Lot of Diversity in the Material
Transcribed Spans speech of both genders (ca.
50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality Transcription System A variant of
Arpabet, with phonetic diacritics such
as_gl,_cr, _fr, _n, _vl, _vd
80
Phonetic Transcription of Spontaneous English
The Data are Available at .
81
Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp
82
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent

83
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished

84
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy

85
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light

86
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None

87
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None

88
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None
(In actuality, labelers assigned a 1 to fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)

89
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None
(In actuality, labelers assigned a 1 to fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)
An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary)

90
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None
(In actuality, labelers assigned a 1 to fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)
An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary)
In this example most of the syllables are
unaccented, with two labeled as lightly accented
(0.5)

91
Annotation of Stress Accent

Forty-five minutes of the phonetically annotated
portion of the Switchboard corpus was manually
labeled with respect to stress accent
Three levels of accent were distinguished
Heavy Light None
(In actuality, labelers assigned a 1 to fully
accented syllables, a null to completely
unaccented syllables, and a 0.5 to all others)
An example of the annotation (attached to the
vocalic nucleus) is shown below (where the accent
levels could not be derived from a dictionary)
In this example most of the syllables are
unaccented, with two labeled as lightly accented
(0.5) (and one other labeled as very lightly
accented (0.25))

92
Annotation of Stress Accent
The data are available at .
93
Annotation of Stress Accent
The data are available at . http//www.ics
i/berkeley.edu/steveng/prosody
94
Automatic Labeling of Stress Accent

This forty-five minutes of hand-labeled phonetic
and prosodic annotation from the Switchboard
corpus was used as training data for development
of an Automatic Stress Accent Labeling System
(AutoSAL)

95
How Good is AutoSAL?

There is an 79 concordance between human and
machine accent labels when the tolerance level is
a quarter-step

96
How Good is AutoSAL?

There is an 79 concordance between human and
machine accent labels when the tolerance level is
a quarter-step
There is 97.5 concordance when the tolerance
level is half a step

97
How Good is AutoSAL?

There is an 79 concordance between human and
machine accent labels when the tolerance level is
a quarter-step
There is 97.5 concordance when the tolerance
level is half a step
This degree of concordance is as high as that
exhibited by two highly trained (human)
transcribers

98
PART FIVE Stress Accent and Syllable Position
99
The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent
100
The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent These data serve to
illustrate the sort of variation observed that is
conditioned by position within the syllable
101
Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable
Deletions

All Segments

CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
102
Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Particularly when stress
accent is also taken into account
Deletions

All Segments

CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
103
Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Particularly when stress
accent is also taken into account BOTH syllable
structure and accent level are required for a
full accounting
Deletions

All Segments

CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
104
PART SIX Durational Properties of Pronunciation
Variation
105
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position
106
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position Well begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment duration
107
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position Well begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level
108
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position Well begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level Under such conditions,
the durational properties associated with light
accent are generally intermediate between heavy
accent and none
109
Syllable Duration - Across Syllable Forms

There is a broad range of syllable structures
observed in spoken English

110
Syllable Duration - Across Syllable Forms

There is a broad range of syllable structures
observed in spoken English
The CV and CVC forms cover ca. 60 of the
syllables

V Vowel C Consonant
111
Syllable Duration - Across Syllable Forms

There is a broad range of syllable structures
observed in spoken English
The CV and CVC forms cover ca. 60 of the
syllables
Together, the V, VC, CV and CVC forms account for
85 of syllables

V Vowel C Consonant
112
Syllable Duration - Across Syllable Forms

There is a broad range of syllable structures
observed in spoken English
The CV and CVC forms cover ca. 60 of the
syllables
Together, the V, VC, CV and CVC forms account for
85 of syllables
The CVCC and CCVC (complex syllable) forms
account for another 10

V Vowel C Consonant
113
Syllable Duration - Across Syllable Forms

It is unsurprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below)

V Vowel C Consonant
Canonical Syllable Forms
114
Syllable Duration - Across Syllable Forms

It is unsurprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below)
Note the systematic lengthening of the syllable
for each form as the accent level increases from
NONE to LIGHT to HEAVY

V Vowel C Consonant
Canonical Syllable Forms
115
Syllable Duration - Across Syllable Forms

It is unsurprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below)
Note the systematic lengthening of the syllable
for each form as the accent level increases from
NONE to LIGHT to HEAVY
This pattern is representative of accents impact
on duration

V Vowel C Consonant
Canonical Syllable Forms
116
Syllable Duration - Across Syllable Forms

It is unsurprising that syllable duration is
largely a function of the number of segments
within the syllable (as shown in the graph below)
Note the systematic lengthening of the syllable
for each form as the accent level increases from
NONE to LIGHT to HEAVY
This pattern is representative of accents impact
on duration (as well see)

V Vowel C Consonant
Canonical Syllable Forms
117
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE)
V Vowel C Consonant
Canonical Syllable Forms
118
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts
V Vowel C Consonant
Canonical Syllable Forms
119
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV)
V Vowel C Consonant
Canonical Syllable Forms
120
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV) This
pattern implies that accent has the greatest
impact on vocalic duration
V Vowel C Consonant
Canonical Syllable Forms
121
Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph below

Canonical Syllable Forms
122
Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below Vowels in accented syllables (of all forms)
are at least twice as long as their unaccented
counterparts
Canonical Syllable Forms
123
Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below Vowels in accented syllables (of all forms)
are at least twice as long as their unaccented
counterparts This pattern implies that the
syllable nucleus absorbs a major component of
accents impact (at least as far as duration is
concerned)
Canonical Syllable Forms
124
PART SEVEN Stress Accent and the Vocalic
Nucleus
125
Stress Accents Impact on the Vocalic Nucleus

Because the pattern of stress accents impact on
vocalic duration is relatively uniform across
syllable form it is likely that the specific
structure of the syllable has relatively little
impact on vocalic duration

126
Stress Accents Impact on the Vocalic Nucleus

Because the pattern of stress accents impact on
vocalic duration is relatively uniform across
syllable form it is likely that the specific
structure of the syllable has relatively little
impact on vocalic duration
As a consequence, the remaining analyses
pertaining to accents impact on vocalic duration
collapse the data across syllable form

127
Stress Accents Impact on the Vocalic Nucleus

Because the pattern of stress accents impact on
vocalic duration is relatively uniform across
syllable form it is likely that the specific
structure of the syllable has relatively little
impact on vocalic duration
As a consequence, the remaining analyses
pertaining to accents impact on vocalic duration
collapse the data across syllable form
We now examine vocalic duration in somewhat
greater detail and illustrate how duration,
stress accent and vocalic identity interact

128
The Spatial Patterning of Duration in Vocalic
Nuclei
129
A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a
function primarily of two articulatory
properties both related to the motion of the
tongue

130
A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a
function primarily of two articulatory
properties both related to the motion of the
tongue
The front-back plane is most closely associated
with the second formant frequency (or more
precisely F2 - F1) and the volume of the
front-cavity resonance

131
A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a
function primarily of two articulatory
properties both related to the motion of the
tongue
The front-back plane is most closely associated
with the second formant frequency (or more
precisely F2 - F1) and the volume of the
front-cavity resonance
The height parameter is closely linked to the
frequency of F1

132
A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a
function primarily of two articulatory properties
both related to the motion of the tongue
The front-back plane is most closely associated
with the second formant frequency (or more
precisely F2 - F1) and the volume of the
front-cavity resonance
The height parameter is closely linked to the
frequency of F1
In the classic vowel triangle, segments are
positioned in terms of the tongue positions
associated with their production, as follows

133
A Brief Primer on Vocalic Acoustics

Vowel quality is generally thought to be a
function primarily of two articulatory properties
both related to the motion of the tongue
The front-back plane is most closely associated
with the second formant frequency (or more
precisely F2 - F1) and the volume of the
front-cavity resonance
The height parameter is closely linked to the
frequency of F1
In the classic vowel triangle, segments are
positioned in terms of the tongue positions
associated with their production, as follows

134
Spatial Patterning of Duration et al.

In the following slides duration is plotted on a
2-D grid, where the x-axis represents the
(hypothetical) front-back tongue position

135
Spatial Patterning of Duration et al.

In the following slides duration is plotted on a
2-D grid, where the x-axis represents the
(hypothetical) front-back tongue position
(and hence remains a constant throughout the
plots to follow)

136
Spatial Patterning of Duration et al.

In the following slides duration is plotted on a
2-D grid, where the x-axis represents the
(hypothetical) front-back tongue position
(and hence remains a constant throughout the
plots to follow)
The y-axis serves as the dependent measure,
expressed in terms of either duration or the
proportion of fully stressed (or unstressed)
nuclei

137
Vocalic Duration and Vowel Height

The spatial patterning of vocalic segments is
systematic with respect to duration

138
Vocalic Duration and Vowel Height

The spatial patterning of vocalic segments is
systematic with respect to duration
Low vowels, be they diphthongs or monophthongs,
are longer (on average) than high vowels

139
Vocalic Duration and Vowel Height

The spatial patterning of vocalic segments is
systematic with respect to duration
Low vowels, be they diphthongs or monophthongs,
are longer (on average) than high vowels

All nuclei
140
Vocalic Duration and Vowel Height

The spatial patterning of vocalic segments is
systematic with respect to duration
Low vowels, be they diphthongs or monophthongs,
are longer (on average) than high vowels
Thus, duration appears to be highly correlated
with vowel height

All nuclei
141
Vocalic Duration and Vowel Height

The spatial patterning of vocalic segments is
systematic with respect to duration
Low vowels, be they diphthongs or monophthongs,
are longer (on average) than high vowels
Thus, duration appears to be highly correlated
with vowel height
But the situation is a little more complicated
than first appearances would suggest

All nuclei
142
Durational Differences - Stressed/Unstressed
There is a large dynamic range in duration
between accented and unaccented vocalic nuclei
Canonical Syllable Forms
143
Durational Differences - Stressed/Unstressed
There is a large dynamic range in duration
between accented and unaccented vocalic
nuclei Moreover, diphthongs and tense, low
monophthongs tend to exhibit a larger dynamic
range than the lax monophthongs
Canonical Syllable Forms
144
Durational Differences - Stressed/Unstressed
There is a large dynamic range in duration
between accented and unaccented vocalic
nuclei Moreover, diphthongs and tense, low
monophthongs tend to exhibit a larger dynamic
range than the lax monophthongs
Lax monophthongs
Canonical Syllable Forms
145
Vocalic Identity Among Unstressed Nuclei
The high, lax monophthongs are almost always
unstressed
146
Vocalic Identity Among Unstressed Nuclei
The high, lax monophthongs are almost always
unstressed The low vowels, be they monophthongs
or diphthongs, are rarely unstressed
147
Vocalic Identity Among Unstressed Nuclei
The high, lax monophthongs are almost always
unstressed The low vowels, be they monophthongs
or diphthongs, are rarely unstressed The high
diphthongs and high/mid, tense monophthongs
occupy an intermediate position
148
Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed
149
Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed The low
vowels, be they monophthongs or diphthongs, are
far more likely to be fully stressed
150
Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed The low
vowels, be they monophthongs or diphthongs, are
far more likely to be fully stressed An
intermediate degree of stress accounts for the
other vocalic instances
151
Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed The low
vowels, be they monophthongs or diphthongs, are
far more likely to be fully stressed An
intermediate degree of stress accounts for the
other vocalic instances (but will not be
addressed here)
152
Vocalic Variation Importance of Stress Accent

The vowels of heavily accented syllables are
(mostly) pronounced canonically

Canonical Pronunciations
Non-Canonical Pronunciations
153
Vocalic Variation Importance of Stress Accent

The vowels of heavily accented syllables are
(mostly) pronounced canonically
Low vowels are largely the province of accented
syllables

Canonical Pronunciations
Non-Canonical Pronunciations
154
Vocalic Variation Importance of Stress Accent

The vowels of heavily accented syllables are
(mostly) pronounced canonically
Low vowels are largely the province of accented
syllables, and
High vowels the province of unaccented syllables

Canonical Pronunciations
Non-Canonical Pronunciations
155
Vocalic Variation Importance of Stress Accent

The vowels of heavily accented syllables are
(mostly) pronounced canonically
Low vowels are largely the province of accented
syllables, and
High vowels the province of unaccented syllables
Moreover, theres a lexical bias towards high
vowels for unaccented forms

Canonical Pronunciations
Non-Canonical Pronunciations
156
Vocalic Variation Importance of Stress Accent

The vowels of heavily accented syllables are
(mostly) pronounced canonically
Low vowels are largely the province of accented
syllables, and
High vowels the province of unaccented syllables
Moreover, theres a lexical bias towards high
vowels for unaccented forms
Thats reinforced in patterns of deviation from
canonical pronunciation

Canonical Pronunciations
Non-Canonical Pronunciations
157
Vocalic Height Deviation from Canonical

Vowels are more likely to RISE in height than to
descend when unaccented

Amount of Change
Direction of Change
158
Vocalic Height Deviation from Canonical

Vowels are more likely to RISE in height than to
descend when unaccented
Vocalic lowering of height is rare

Amount of Change
Direction of Change
159
Vocalic Height Deviation from Canonical

Vowels are more likely to RISE in height than to
descend when unaccented
Vocalic lowering of height is rare
Most deviations from the canonical maintain vowel
height

Amount of Change
Direction of Change
160
Vocalic Height Deviation from Canonical

Vowels are more likely to RISE in height than to
descend when unaccented
Vocalic lowering of height is rare
Most deviations from the canonical maintain vowel
height
More than a single height step deviation is
uncommon

Amount of Change
Direction of Change
161
Vocalic Height Deviation from Canonical

Vowels are more likely to RISE in height than to
descend when unaccented
Vocalic lowering of height is rare
Most deviations from the canonical maintain vowel
height
More than a single height step deviation is
uncommon
Virtually all 2-step height deviations occur in
unaccented syllables

Amount of Change
Direction of Change
162
The Vowel Space Under (Full) Stress (Accent)

In unaccented nuclei there is a relatively even
distribution of segments across the vowel space,
with a slight bias towards the front and central
vowels

Canonical Vowels Only
163
The Vowel Space Without (Stress) Accent

In unaccented syllables vowels are confined
largely to the high-front and high-central
sectors of the articulatory space

Canonical Vowels Only
164
The Vowel Space Without (Stress) Accent

In unaccented syllables vowels are confined
largely to the high-front and high-central
sectors of the articulatory space
The low and mid vowels get creamed

Canonical Vowels Only
165
The Vowel Spaces Compared

Stress accent exerts a profound effect on the
character of the vowel space

Heavily Accented
Unaccented
Canonical Vowels Only
166
The Vowel Spaces Compared

Stress accent exerts a profound effect on the
character of the vowel space
High vowels are largely associated with
unaccented syllables

Heavily Accented
Unaccented
Canonical Vowels Only
167
The Vowel Spaces Compared

Stress accent exerts a profound effect on the
character of the vowel space
High vowels are largely associated with
unaccented syllables
Low vowels are mostly associated with accented
forms

Heavily Accented
Unaccented
Canonical Vowels Only
168
The Vowel Spaces Compared

Stress accent exerts a profound effect on the
character of the vowel space
High vowels are largely associated with
unaccented syllables
Low vowels are mostly associated with accented
forms
This distinction between accented and unaccented
syllables is of profound importance for
understanding (and modeling) pronunciation
variation

Heavily Accented
Unaccented
Canonical Vowels Only
169
Is It Stress? Vocalic Identity? Or What?

Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse

170
Is It Stress? Vocalic Identity? Or What?

Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse
For any given vocalic class, stressed segments
are longer (on average)

171
Is It Stress? Vocalic Identity? Or What?

Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse
For any given vocalic class, stressed segments
are longer (on average)
The durational disparity is most pronounced among
the low vowels and the diphthongs

172
Is It Stress? Vocalic Identity? Or What?

Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse
For any given vocalic class, stressed segments
are longer (on average)
The durational disparity is most pronounced among
the low vowels and the diphthongs
Low vowels tend to be much longer in duration
than high vowels

173
Is It Stress? Vocalic Identity? Or What?

Duration appears to play an important (but
certainly not exclusive) role in stress accent
for spontaneous American English discourse
For any given vocalic class, stressed segments
are longer (on average)
The durational disparity is most pronounced among
the low vowels and the diphthongs
Low vowels tend to be much longer in duration
than high vowels
This is the case even for diphthongs

Beyond the Phoneme - PowerPoint PPT Presentation

Beyond the Phoneme

– PowerPoint PPT presentation