Beyond the Phoneme - PowerPoint PPT Presentation

1 / 271
About This Presentation
Title:

Beyond the Phoneme

Description:

– PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 272
Provided by: stevegr4
Category:
Tags: beyond | phoneme

less

Transcript and Presenter's Notes

Title: Beyond the Phoneme


1
Beyond the Phoneme A Juncture-Accent Model of
Spoken Language Steven Greenberg, Hannah
Carvey, Leah Hitchcock and Shuangyu
Chang International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 steveng, hmcarvey, leahh,
shawnc_at_icsi.berkeley.edu
2
Acknowledgements and Thanks
Research Funding U.S. Department of
Defense U.S. National Science Foundation
3
For Further Information
Consult the web site www.icsi.berkel
ey.edu/steveng
4
OVERTURE The Central Challenge for Models of
Speech Recognition
5
The Serial Frame Perspective on Speech
Traditional models of speech recognition assume
the identity of a phonetic segment is derived
from a detailed spectral profile of the acoustic
signal computed for each time interval (frame) of
speech
6
Phonemic Beads on a String Illustrated
In traditional models of speech recognition words
are represented as mere sequences of phonetic
segments (phones) .
7
Phonemic Beads on a String Illustrated
In traditional models of speech recognition words
are represented as mere sequences of phonetic
segments (phones) . Strung together like
beads on a string
8
Phonemic Beads on a String Illustrated
In traditional models of speech recognition words
are conceptualized as mere sequences of phonetic
segments (phones) . Strung together like
beads on a string No quarter is provided for
stress accent or other syllabic properties
9
Language - The Traditional Perspective
The classical view of spoken language posits a
quasi-arbitrary relation between the lower and
higher tiers of linguistic organization
Cat /k/ /ae/ /t/
Cat k ae t
ASR systems focus on decoding words from
sequences of phones
10
A Challenge for the Phonemic Beads on a String
Approach to Speech Recognition Pronunciation
Variability
11
Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse
12
Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse There are literally
dozens of ways in which common words are
pronounced
13
Pronunciation Variability of Real Speech
Pronunciation patterns encountered in everyday
life are extremely diverse There are literally
dozens of ways in which common words are
pronounced (as the following two slides
illustrate for the word AND based on manual
phonetic annotation of a corpus comprising
telephone dialogues)
14
How Many Pronunciations of and?
Canonical pronunciation
15
How Many Pronunciations of and?
16
Pronunciation Variability of Real Speech
The are literally dozens of ways in which common
words are pronounced And as the following slide
illustrates for the 20 most frequent words from
the same corpus (Switchboard)
17
Pronunciation Variability of Real Speech
The are literally dozens of ways in which common
words are pronounced And as the following slide
illustrates for the 20 most frequent words from
the same corpus (Switchboard) (which together
account for 35 of the word tokens in the corpus)
18
How Many Different Pronunciations?
The 20 most frequency words account for 35 of
the tokens
19
QUESTION How do listeners decode the speech
signal given the large amount of pronunciation
variation?
20
PART ONE Anatomy of a Syllable
21
Language - A Syllable-Centric Perspective
A more empirically grounded perspective of spoken
language focuses on the SYLLABLE as the interface
between sound and meaning
Within this framework the relationship between
the syllable and the higher and lower tiers is
non-arbitrary and systematic statistically
22
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure
23
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position
24
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level)
25
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns
26
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is an onset?
27
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus?
28
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is a onset?
What is a nucleus? What is a coda?
29
The Importance of the Syllable
The analyses to follow are all linked, in some
fashion, to syllable structure In order to
highlight patterns germane to variation in
segmental duration it is necessary to partition
the data in terms of syllable position (as
well as stress accent level) As a consequence, we
will examine the onsets, codas and nuclei of
syllables separately in order to gain insight
into the underlying patterns What is an onset?
What is a nucleus? What is a coda? The following
slides provide a brief (and gentle) introduction
to syllable structure
30
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA
J JUNCTURE
OGI Numbers95 corpus
31
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition)
J JUNCTURE
OGI Numbers95 corpus
32
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT)
J JUNCTURE
OGI Numbers95 corpus
33
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT)
J JUNCTURE
OGI Numbers95 corpus
34
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda (Nine)
J JUNCTURE
OGI Numbers95 corpus
35
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda
(Nine) Followed in popularity by Onset
Nucleus (Two)
J JUNCTURE
OGI Numbers95 corpus
36
Syllable and Phonetic Segment Illustrated
Syllables generally consist of three constituents
- ONSET, NUCLEUS, CODA Virtually all syllables
contain a NUCLEUS, which is VOCALIC (by
definition) Most (but not all) syllables also
contain an ONSET (usually a CONSONANT) Many
syllables contain a CODA (also typically a
CONSONANT) The most common syllable form in
English is Onset Nucleus Coda
(Nine) Followed in popularity by Onset
Nucleus (Two) Onset segments often differ in
significant ways from coda segments
J JUNCTURE
OGI Numbers95 corpus
37
PART TWO Spectro-Temporal Profiles
38
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation
39
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation STRESS ACCENT and JUNCTURE are two
such properties
40
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail
41
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail As shown in miniature below ..
42
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail As shown in miniature below ..
STePs are derived from averages of hundreds of
individual instances
43
The Spectro-Temporal Profile (STeP)
Certain specific (and important) properties of
the syllable are not well represented in terms of
the traditional 2.5-D spectrographic
representation Stress Accent and Juncture are two
such properties A different representation, based
on the log, critical-band energy profile across
frequency and time, can provide the requisite
detail As shown in miniature below . (and as
shown in expanded form on the following slides)
STePs are derived from averages of hundreds of
individual instances
44
Spectro-Temporal Profile - DiSyllabic Word
Full-spectrum perspective
Seven
unaccented syllable
accented syllable
s eh vx en
juncture
eh
en
s
vx
mean duration
OGI Numbers95
45
Spectro-Temporal Profile - DiSyllabic Word
High-frequency perspective
Seven
s eh vx en
unaccented syllable
accented syllable
juncture
s
eh
en
vx
mean duration
OGI Numbers95
46
PART THREE Scientific Approach to Speech
Recognition
47
A Scientific Approach to Speech Recognition
Ascertain the contribution of .
48
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification
49
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation
50
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and
51
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position
52
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance
53
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus
54
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus And a relatively transparent
recognition engine utilizing the following
variety of articulatory-based features manner
and place of articulation, voicing, vowel
height, lip-rounding, spectral dynamics, segment
length
55
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus And a relatively transparent
recognition engine utilizing the following
variety of articulatory-based features manner
and place of articulation, voicing, vowel
height, lip-rounding, spectral dynamics, segment
length That are explicitly tied to syllable
position (i.e., onset, nucleus and coda) and
stress-accent level
56
A Scientific Approach to Speech Recognition
Ascertain the contribution of . (1) phonetic
segment (and feature) classification (2) phonetic
segmentation (3) stress accent, and (4) syllable
position to ASR performance Using the OGI
Numbers95 Corpus as a controlled (limited
vocabulary) corpus And a relatively transparent
recognition engine utilizing the following
variety of articulatory-based features manner
and place of articulation, voicing, vowel
height, lip-rounding, spectral dynamics, segment
length That are explicitly tied to syllable
position (i.e., onset, nucleus and coda) and
stress-accent level We will be comparing the
baseline system (entirely automatic
recognition) with an entirely fabricated set of
input data (derived from hand-labeled phonetic
annotation autoSAL) as well as a half-way
house system that is partially automatic and
partially not (manually derived phonetic
segmentation, as well as whether each segment is
vocalic or not)
57
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6
58
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that .
59
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that
. Accurate phonetic segmentation is extremely
important for enhanced ASR performance, as is
knowledge of the location of the syllabic nucleus
60
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that
. Accurate phonetic segmentation is extremely
important for enhanced ASR performance, as is
knowledge of the location of the syllabic
nucleus Stress-accent information most important
for the vocalic nucleus without it WER
increases by 10-20
61
Numbers95 Recognition Stress Accent Impact
Entirely Stress-Accent Dependent
Results Word Error Rate Fabricated
1.3 Half-way House 2.0 Baseline
5.6 The half-way house system is much closer
in performance to the fabricated data version
than to the baseline system, suggesting that
. Accurate phonetic segmentation is extremely
important for enhanced ASR performance, as is
knowledge of the location of the syllabic
nucleus Stress-accent information most important
for the vocalic nucleus without it WER
increases by 10-20 Also important for coda WER
increases by 7-15
62
Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03
63
Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03 Conclusions Onset segments are most
canonical
64
Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03 Conclusions Onset segments are most
canonical Coda segments are least canonical
65
Numbers95 Recognition Pronunciation Impact
Effect of pronunciation variation as a function
of syllable position, where the canonical
pronunciation is potentially fixed for each
syllable position separately (or All
together) Standard refers to regular
recognition system Word Error
Rate Standard Onset Nucleus Coda
All Fabricated 1.29 1.33 1.61
1.63 1.76 Half-way House 1.97 2.16
2.21 2.55 2.81 Baseline
5.59 5.91 5.91 6.70
7.03 Conclusions Onset segments are most
canonical Coda segments are least
canonical Therefore, it is important to provide
for pronunciation variation in ASR system
66
Numbers95 Syllable Position Importance
Effect of pronunciation variation as a function
of syllable position, where each syllabic
constituent is neutralized with respect to
lexical matching (i.e., each element is factored
out of the decoding process separately) Standard
refers to the regular recognition
system Word Error Rate
Standard Onset Nucleus Coda Fabricated
1.29 9.70 5.95
3.92 Half-way House 1.97 11.27
13.28 6.60 Baseline 5.59 15.70
20.22 10.13
67
Numbers95 Syllable Position Importance
Effect of pronunciation variation as a function
of syllable position, where each syllabic
constituent is neutralized with respect to
lexical matching (i.e., each element is factored
out of the decoding process separately) Standard
refers to the regular recognition
system Word Error Rate
Standard Onset Nucleus Coda Fabricated
1.29 9.70 5.95
3.92 Half-way House 1.97 11.27
13.28 6.60 Baseline 5.59 15.70
20.22 10.13 Neutralization of the onset
and nucleic elements exerts a greater impact on
ASR performance than codas
68
Numbers95 Syllable Position Importance
Effect of pronunciation variation as a function
of syllable position, where each syllabic
constituent is neutralized with respect to
lexical matching (i.e., each element is factored
out of the decoding process separately) Standard
refers to the regular recognition
system Word Error Rate
Standard Onset Nucleus Coda Fabricated
1.29 9.70 5.95
3.92 Half-way House 1.97 11.27
13.28 6.60 Baseline 5.59 15.70
20.22 10.13 Neutralization of the onset
and nucleic elements exerts a greater impact on
ASR performance than codas Conclusion Onsets and
nuclei are most important for lexical access in
an ASR system (at least for the Numbers95
corpus)
69
PART FOUR Being Phonetically and Prosodically
Annotated
70
Phonetic Transcription of Spontaneous English
Telephone dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically annotated (labeled and segmented)
   
71
Phonetic Transcription of Spontaneous English
Telephone dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually    
72
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually         4 hours
labeled at the phone level and segmented at the
syllabic level
73
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled at
the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level
74
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods
75
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material
76
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)  
77
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis)  
There is a Lot of Diversity in the Material
Transcribed
78
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis) 
There is a Lot of Diversity in the Material
Transcribed Spans speech of both genders (ca.
50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality
79
Phonetic Transcription of Spontaneous English
Telephone Dialogues of 5-10 minutes duration,
from the SWITCHBOARD corpus, have been
phonetically transcribed (labeled and
segmented) Most of this material has been
annotated manually     4 hours labeled
at the phone level and segmented at the syllabic
level 1 hour labeled and segmented at the
phonetic-segment level The remaining material
segmented at the phonetic-segment level using
automatic methods 45 minutes of hand-labeled
stress-accent material An additional four hours
of stress-accent material automatically labeled
(though unused in the current analysis) 
There is a Lot of Diversity in the Material
Transcribed Spans speech of both genders (ca.
50/50), reflecting a wide range of American
dialectal variation, speaking rate and voice
quality Transcription System A variant of
Arpabet, with phonetic diacritics such
as_gl,_cr, _fr, _n, _vl, _vd
80
Phonetic Transcription of Spontaneous English
The Data are Available at .
81
Phonetic Transcription of Spontaneous English
The Data are Available at . http//www.ics
i/berkeley.edu/real/stp
82
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent

83
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent
  • Three levels of accent were distinguished

84
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent
  • Three levels of accent were distinguished
  • Heavy

85
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent
  • Three levels of accent were distinguished
  • Heavy Light

86
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent
  • Three levels of accent were distinguished
  • Heavy Light None

87
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent
  • Three levels of accent were distinguished
  • Heavy Light None

88
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent
  • Three levels of accent were distinguished
  • Heavy Light None
  • (In actuality, labelers assigned a 1 to fully
    accented syllables, a null to completely
    unaccented syllables, and a 0.5 to all others)

89
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent
  • Three levels of accent were distinguished
  • Heavy Light None
  • (In actuality, labelers assigned a 1 to fully
    accented syllables, a null to completely
    unaccented syllables, and a 0.5 to all others)
  • An example of the annotation (attached to the
    vocalic nucleus) is shown below (where the accent
    levels could not be derived from a dictionary)

90
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent
  • Three levels of accent were distinguished
  • Heavy Light None
  • (In actuality, labelers assigned a 1 to fully
    accented syllables, a null to completely
    unaccented syllables, and a 0.5 to all others)
  • An example of the annotation (attached to the
    vocalic nucleus) is shown below (where the accent
    levels could not be derived from a dictionary)
  • In this example most of the syllables are
    unaccented, with two labeled as lightly accented
    (0.5)

91
Annotation of Stress Accent
  • Forty-five minutes of the phonetically annotated
    portion of the Switchboard corpus was manually
    labeled with respect to stress accent
  • Three levels of accent were distinguished
  • Heavy Light None
  • (In actuality, labelers assigned a 1 to fully
    accented syllables, a null to completely
    unaccented syllables, and a 0.5 to all others)
  • An example of the annotation (attached to the
    vocalic nucleus) is shown below (where the accent
    levels could not be derived from a dictionary)
  • In this example most of the syllables are
    unaccented, with two labeled as lightly accented
    (0.5) (and one other labeled as very lightly
    accented (0.25))

92
Annotation of Stress Accent
The data are available at .
93
Annotation of Stress Accent
The data are available at . http//www.ics
i/berkeley.edu/steveng/prosody
94
Automatic Labeling of Stress Accent
  • This forty-five minutes of hand-labeled phonetic
    and prosodic annotation from the Switchboard
    corpus was used as training data for development
    of an Automatic Stress Accent Labeling System
    (AutoSAL)

95
How Good is AutoSAL?
  • There is an 79 concordance between human and
    machine accent labels when the tolerance level is
    a quarter-step

96
How Good is AutoSAL?
  • There is an 79 concordance between human and
    machine accent labels when the tolerance level is
    a quarter-step
  • There is 97.5 concordance when the tolerance
    level is half a step

97
How Good is AutoSAL?
  • There is an 79 concordance between human and
    machine accent labels when the tolerance level is
    a quarter-step
  • There is 97.5 concordance when the tolerance
    level is half a step
  • This degree of concordance is as high as that
    exhibited by two highly trained (human)
    transcribers

98
PART FIVE Stress Accent and Syllable Position
99
The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent
100
The Importance of Syllable Structure
Before going into the details of durational
variation at the segmental level we briefly
examine some general patterns of pronunciation
variation that are conditioned by syllable
position and stress accent These data serve to
illustrate the sort of variation observed that is
conditioned by position within the syllable
101
Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable
Deletions
  • All Segments

CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
102
Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Particularly when stress
accent is also taken into account
Deletions
  • All Segments

CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
103
Pronunciation Variation Syllable and Accent
Pronunciation variation is systematic at the
level of the syllable Particularly when stress
accent is also taken into account BOTH syllable
structure and accent level are required for a
full accounting
Deletions
  • All Segments

CODA Territory
Substitutions
Insertions
ONSET Territory
NUCLEUS Territory
104
PART SIX Durational Properties of Pronunciation
Variation
105
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position
106
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position Well begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment duration
107
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position Well begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level
108
Analysis of Durational Properties of Speech
The following analyses are conditioned on stress
accent level and (for the most part) syllable
position Well begin with analyses illustrating
the patterns associated with three levels of
stress accent (heavy, light and none) to show the
graded nature of the durational properties
pertaining to syllable and segment
duration However, for purposes of illustrative
clarity, many of the slides will show only two
levels of accent (heavy and none) in order to
delineate the differences in duration associated
with stress accent level Under such conditions,
the durational properties associated with light
accent are generally intermediate between heavy
accent and none
109
Syllable Duration - Across Syllable Forms
  • There is a broad range of syllable structures
    observed in spoken English

110
Syllable Duration - Across Syllable Forms
  • There is a broad range of syllable structures
    observed in spoken English
  • The CV and CVC forms cover ca. 60 of the
    syllables

V Vowel C Consonant
111
Syllable Duration - Across Syllable Forms
  • There is a broad range of syllable structures
    observed in spoken English
  • The CV and CVC forms cover ca. 60 of the
    syllables
  • Together, the V, VC, CV and CVC forms account for
    85 of syllables

V Vowel C Consonant
112
Syllable Duration - Across Syllable Forms
  • There is a broad range of syllable structures
    observed in spoken English
  • The CV and CVC forms cover ca. 60 of the
    syllables
  • Together, the V, VC, CV and CVC forms account for
    85 of syllables
  • The CVCC and CCVC (complex syllable) forms
    account for another 10

V Vowel C Consonant
113
Syllable Duration - Across Syllable Forms
  • It is unsurprising that syllable duration is
    largely a function of the number of segments
    within the syllable (as shown in the graph below)

V Vowel C Consonant
Canonical Syllable Forms
114
Syllable Duration - Across Syllable Forms
  • It is unsurprising that syllable duration is
    largely a function of the number of segments
    within the syllable (as shown in the graph below)
  • Note the systematic lengthening of the syllable
    for each form as the accent level increases from
    NONE to LIGHT to HEAVY

V Vowel C Consonant
Canonical Syllable Forms
115
Syllable Duration - Across Syllable Forms
  • It is unsurprising that syllable duration is
    largely a function of the number of segments
    within the syllable (as shown in the graph below)
  • Note the systematic lengthening of the syllable
    for each form as the accent level increases from
    NONE to LIGHT to HEAVY
  • This pattern is representative of accents impact
    on duration

V Vowel C Consonant
Canonical Syllable Forms
116
Syllable Duration - Across Syllable Forms
  • It is unsurprising that syllable duration is
    largely a function of the number of segments
    within the syllable (as shown in the graph below)
  • Note the systematic lengthening of the syllable
    for each form as the accent level increases from
    NONE to LIGHT to HEAVY
  • This pattern is representative of accents impact
    on duration (as well see)

V Vowel C Consonant
Canonical Syllable Forms
117
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE)
V Vowel C Consonant
Canonical Syllable Forms
118
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts
V Vowel C Consonant
Canonical Syllable Forms
119
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV)
V Vowel C Consonant
Canonical Syllable Forms
120
Syllable Duration - Accent Level/Syllable Form
This graph shows the same data as the previous
slides, but from the perspective of just two
accent levels (HEAVY and NONE) The heavily
accented syllables are generally 60-100 longer
than their unaccented counterparts The disparity
in duration is most pronounced for syllable forms
with one or no consonants (i.e., V, VC, CV) This
pattern implies that accent has the greatest
impact on vocalic duration
V Vowel C Consonant
Canonical Syllable Forms
121
Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph below

Canonical Syllable Forms
122
Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below Vowels in accented syllables (of all forms)
are at least twice as long as their unaccented
counterparts
Canonical Syllable Forms
123
Nucleus Duration - Accent Level/Syllable Form
The hypothesis delineated on the previous slide
(that accent has the most profound impact on
vocalic duration) is confirmed in the graph
below Vowels in accented syllables (of all forms)
are at least twice as long as their unaccented
counterparts This pattern implies that the
syllable nucleus absorbs a major component of
accents impact (at least as far as duration is
concerned)
Canonical Syllable Forms
124
PART SEVEN Stress Accent and the Vocalic
Nucleus
125
Stress Accents Impact on the Vocalic Nucleus
  • Because the pattern of stress accents impact on
    vocalic duration is relatively uniform across
    syllable form it is likely that the specific
    structure of the syllable has relatively little
    impact on vocalic duration

126
Stress Accents Impact on the Vocalic Nucleus
  • Because the pattern of stress accents impact on
    vocalic duration is relatively uniform across
    syllable form it is likely that the specific
    structure of the syllable has relatively little
    impact on vocalic duration
  • As a consequence, the remaining analyses
    pertaining to accents impact on vocalic duration
    collapse the data across syllable form

127
Stress Accents Impact on the Vocalic Nucleus
  • Because the pattern of stress accents impact on
    vocalic duration is relatively uniform across
    syllable form it is likely that the specific
    structure of the syllable has relatively little
    impact on vocalic duration
  • As a consequence, the remaining analyses
    pertaining to accents impact on vocalic duration
    collapse the data across syllable form
  • We now examine vocalic duration in somewhat
    greater detail and illustrate how duration,
    stress accent and vocalic identity interact

128
The Spatial Patterning of Duration in Vocalic
Nuclei
129
A Brief Primer on Vocalic Acoustics
  • Vowel quality is generally thought to be a
    function primarily of two articulatory
    properties both related to the motion of the
    tongue

130
A Brief Primer on Vocalic Acoustics
  • Vowel quality is generally thought to be a
    function primarily of two articulatory
    properties both related to the motion of the
    tongue
  • The front-back plane is most closely associated
    with the second formant frequency (or more
    precisely F2 - F1) and the volume of the
    front-cavity resonance

131
A Brief Primer on Vocalic Acoustics
  • Vowel quality is generally thought to be a
    function primarily of two articulatory
    properties both related to the motion of the
    tongue
  • The front-back plane is most closely associated
    with the second formant frequency (or more
    precisely F2 - F1) and the volume of the
    front-cavity resonance
  • The height parameter is closely linked to the
    frequency of F1

132
A Brief Primer on Vocalic Acoustics
  • Vowel quality is generally thought to be a
    function primarily of two articulatory properties
    both related to the motion of the tongue
  • The front-back plane is most closely associated
    with the second formant frequency (or more
    precisely F2 - F1) and the volume of the
    front-cavity resonance
  • The height parameter is closely linked to the
    frequency of F1
  • In the classic vowel triangle, segments are
    positioned in terms of the tongue positions
    associated with their production, as follows

133
A Brief Primer on Vocalic Acoustics
  • Vowel quality is generally thought to be a
    function primarily of two articulatory properties
    both related to the motion of the tongue
  • The front-back plane is most closely associated
    with the second formant frequency (or more
    precisely F2 - F1) and the volume of the
    front-cavity resonance
  • The height parameter is closely linked to the
    frequency of F1
  • In the classic vowel triangle, segments are
    positioned in terms of the tongue positions
    associated with their production, as follows

134
Spatial Patterning of Duration et al.
  • In the following slides duration is plotted on a
    2-D grid, where the x-axis represents the
    (hypothetical) front-back tongue position

135
Spatial Patterning of Duration et al.
  • In the following slides duration is plotted on a
    2-D grid, where the x-axis represents the
    (hypothetical) front-back tongue position
  • (and hence remains a constant throughout the
    plots to follow)

136
Spatial Patterning of Duration et al.
  • In the following slides duration is plotted on a
    2-D grid, where the x-axis represents the
    (hypothetical) front-back tongue position
  • (and hence remains a constant throughout the
    plots to follow)
  • The y-axis serves as the dependent measure,
    expressed in terms of either duration or the
    proportion of fully stressed (or unstressed)
    nuclei

137
Vocalic Duration and Vowel Height
  • The spatial patterning of vocalic segments is
    systematic with respect to duration

138
Vocalic Duration and Vowel Height
  • The spatial patterning of vocalic segments is
    systematic with respect to duration
  • Low vowels, be they diphthongs or monophthongs,
    are longer (on average) than high vowels

139
Vocalic Duration and Vowel Height
  • The spatial patterning of vocalic segments is
    systematic with respect to duration
  • Low vowels, be they diphthongs or monophthongs,
    are longer (on average) than high vowels

All nuclei
140
Vocalic Duration and Vowel Height
  • The spatial patterning of vocalic segments is
    systematic with respect to duration
  • Low vowels, be they diphthongs or monophthongs,
    are longer (on average) than high vowels
  • Thus, duration appears to be highly correlated
    with vowel height

All nuclei
141
Vocalic Duration and Vowel Height
  • The spatial patterning of vocalic segments is
    systematic with respect to duration
  • Low vowels, be they diphthongs or monophthongs,
    are longer (on average) than high vowels
  • Thus, duration appears to be highly correlated
    with vowel height
  • But the situation is a little more complicated
    than first appearances would suggest

All nuclei
142
Durational Differences - Stressed/Unstressed
There is a large dynamic range in duration
between accented and unaccented vocalic nuclei
Canonical Syllable Forms
143
Durational Differences - Stressed/Unstressed
There is a large dynamic range in duration
between accented and unaccented vocalic
nuclei Moreover, diphthongs and tense, low
monophthongs tend to exhibit a larger dynamic
range than the lax monophthongs
Canonical Syllable Forms
144
Durational Differences - Stressed/Unstressed
There is a large dynamic range in duration
between accented and unaccented vocalic
nuclei Moreover, diphthongs and tense, low
monophthongs tend to exhibit a larger dynamic
range than the lax monophthongs
Lax monophthongs
Canonical Syllable Forms
145
Vocalic Identity Among Unstressed Nuclei
The high, lax monophthongs are almost always
unstressed
146
Vocalic Identity Among Unstressed Nuclei
The high, lax monophthongs are almost always
unstressed The low vowels, be they monophthongs
or diphthongs, are rarely unstressed
147
Vocalic Identity Among Unstressed Nuclei
The high, lax monophthongs are almost always
unstressed The low vowels, be they monophthongs
or diphthongs, are rarely unstressed The high
diphthongs and high/mid, tense monophthongs
occupy an intermediate position
148
Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed
149
Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed The low
vowels, be they monophthongs or diphthongs, are
far more likely to be fully stressed
150
Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed The low
vowels, be they monophthongs or diphthongs, are
far more likely to be fully stressed An
intermediate degree of stress accounts for the
other vocalic instances
151
Vocalic Identity Among Fully Stressed Nuclei
The high vowels are rarely fully stressed The low
vowels, be they monophthongs or diphthongs, are
far more likely to be fully stressed An
intermediate degree of stress accounts for the
other vocalic instances (but will not be
addressed here)
152
Vocalic Variation Importance of Stress Accent
  • The vowels of heavily accented syllables are
    (mostly) pronounced canonically

Canonical Pronunciations
Non-Canonical Pronunciations
153
Vocalic Variation Importance of Stress Accent
  • The vowels of heavily accented syllables are
    (mostly) pronounced canonically
  • Low vowels are largely the province of accented
    syllables

Canonical Pronunciations
Non-Canonical Pronunciations
154
Vocalic Variation Importance of Stress Accent
  • The vowels of heavily accented syllables are
    (mostly) pronounced canonically
  • Low vowels are largely the province of accented
    syllables, and
  • High vowels the province of unaccented syllables

Canonical Pronunciations
Non-Canonical Pronunciations
155
Vocalic Variation Importance of Stress Accent
  • The vowels of heavily accented syllables are
    (mostly) pronounced canonically
  • Low vowels are largely the province of accented
    syllables, and
  • High vowels the province of unaccented syllables
  • Moreover, theres a lexical bias towards high
    vowels for unaccented forms

Canonical Pronunciations
Non-Canonical Pronunciations
156
Vocalic Variation Importance of Stress Accent
  • The vowels of heavily accented syllables are
    (mostly) pronounced canonically
  • Low vowels are largely the province of accented
    syllables, and
  • High vowels the province of unaccented syllables
  • Moreover, theres a lexical bias towards high
    vowels for unaccented forms
  • Thats reinforced in patterns of deviation from
    canonical pronunciation

Canonical Pronunciations
Non-Canonical Pronunciations
157
Vocalic Height Deviation from Canonical
  • Vowels are more likely to RISE in height than to
    descend when unaccented

Amount of Change
Direction of Change
158
Vocalic Height Deviation from Canonical
  • Vowels are more likely to RISE in height than to
    descend when unaccented
  • Vocalic lowering of height is rare

Amount of Change
Direction of Change
159
Vocalic Height Deviation from Canonical
  • Vowels are more likely to RISE in height than to
    descend when unaccented
  • Vocalic lowering of height is rare
  • Most deviations from the canonical maintain vowel
    height

Amount of Change
Direction of Change
160
Vocalic Height Deviation from Canonical
  • Vowels are more likely to RISE in height than to
    descend when unaccented
  • Vocalic lowering of height is rare
  • Most deviations from the canonical maintain vowel
    height
  • More than a single height step deviation is
    uncommon

Amount of Change
Direction of Change
161
Vocalic Height Deviation from Canonical
  • Vowels are more likely to RISE in height than to
    descend when unaccented
  • Vocalic lowering of height is rare
  • Most deviations from the canonical maintain vowel
    height
  • More than a single height step deviation is
    uncommon
  • Virtually all 2-step height deviations occur in
    unaccented syllables

Amount of Change
Direction of Change
162
The Vowel Space Under (Full) Stress (Accent)
  • In unaccented nuclei there is a relatively even
    distribution of segments across the vowel space,
    with a slight bias towards the front and central
    vowels

Canonical Vowels Only
163
The Vowel Space Without (Stress) Accent
  • In unaccented syllables vowels are confined
    largely to the high-front and high-central
    sectors of the articulatory space

Canonical Vowels Only
164
The Vowel Space Without (Stress) Accent
  • In unaccented syllables vowels are confined
    largely to the high-front and high-central
    sectors of the articulatory space
  • The low and mid vowels get creamed

Canonical Vowels Only
165
The Vowel Spaces Compared
  • Stress accent exerts a profound effect on the
    character of the vowel space

Heavily Accented
Unaccented
Canonical Vowels Only
166
The Vowel Spaces Compared
  • Stress accent exerts a profound effect on the
    character of the vowel space
  • High vowels are largely associated with
    unaccented syllables

Heavily Accented
Unaccented
Canonical Vowels Only
167
The Vowel Spaces Compared
  • Stress accent exerts a profound effect on the
    character of the vowel space
  • High vowels are largely associated with
    unaccented syllables
  • Low vowels are mostly associated with accented
    forms

Heavily Accented
Unaccented
Canonical Vowels Only
168
The Vowel Spaces Compared
  • Stress accent exerts a profound effect on the
    character of the vowel space
  • High vowels are largely associated with
    unaccented syllables
  • Low vowels are mostly associated with accented
    forms
  • This distinction between accented and unaccented
    syllables is of profound importance for
    understanding (and modeling) pronunciation
    variation

Heavily Accented
Unaccented
Canonical Vowels Only
169
Is It Stress? Vocalic Identity? Or What?
  • Duration appears to play an important (but
    certainly not exclusive) role in stress accent
    for spontaneous American English discourse

170
Is It Stress? Vocalic Identity? Or What?
  • Duration appears to play an important (but
    certainly not exclusive) role in stress accent
    for spontaneous American English discourse
  • For any given vocalic class, stressed segments
    are longer (on average)

171
Is It Stress? Vocalic Identity? Or What?
  • Duration appears to play an important (but
    certainly not exclusive) role in stress accent
    for spontaneous American English discourse
  • For any given vocalic class, stressed segments
    are longer (on average)
  • The durational disparity is most pronounced among
    the low vowels and the diphthongs

172
Is It Stress? Vocalic Identity? Or What?
  • Duration appears to play an important (but
    certainly not exclusive) role in stress accent
    for spontaneous American English discourse
  • For any given vocalic class, stressed segments
    are longer (on average)
  • The durational disparity is most pronounced among
    the low vowels and the diphthongs
  • Low vowels tend to be much longer in duration
    than high vowels

173
Is It Stress? Vocalic Identity? Or What?
  • Duration appears to play an important (but
    certainly not exclusive) role in stress accent
    for spontaneous American English discourse
  • For any given vocalic class, stressed segments
    are longer (on average)
  • The durational disparity is most pronounced among
    the low vowels and the diphthongs
  • Low vowels tend to be much longer in duration
    than high vowels
  • This is the case even for diphthongs

174
Is It Stress? Vocalic Identity? Or What?
  • Duration appears to play an importan
Write a Comment
User Comments (0)
About PowerShow.com