Goals and Objectives - PowerPoint PPT Presentation

About This Presentation
Title:

Goals and Objectives

Description:

Goals and Objectives - University of California, Berkeley – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 106
Provided by: SteveG194
Category:

less

Transcript and Presenter's Notes

Title: Goals and Objectives


1
Understanding Spoken Language using Statistical
and Computational Methods Steven
Greenberg International Computer Science
Institute 1947 Center Street, Berkeley, CA
94704 http//www.icsi.berkeley.edu/steveng (cont
ains electronic versions of papers and links to
data) Patterns of Speech Sounds in
Unscripted Communication - Production,
Perception, Phonology. Akademie Sankelmark,
October 8-11, 2000
2
OR .
3
How I Learned to Stop Worrying and Use The
Canonical Form
4
Disclaimer I am a Phonetician - NOT! (many thanks
for the invite)
5
No Scientist is an Island
IMPORTANT COLLEAGUES PHONETIC
TRANSCRIPTION OF SPONTANEOUS SPEECH
(SWITCHBOARD) Candace Cardinal, Rachel Coulston,
Dan Ellis, Eric Fosler, Joy Holllenback, John
Ohala, Colleen Richey STATISTICAL ANALYSIS OF
PRONUNCIATION VARIATION Eric Fosler, Leah
Hitchcock, Joy Hollenback ARTICULATORY-ACOUSTIC
BASIS OF CONSONANT RECOGNITION Leah Hitchcock,
Rosaria Silipo AUTOMATIC PHONETIC TRANSCRIPTION
OF SPONTANEOUS SPEECH Shawn Chang, Lokendra
Shastri
6
Germane Publications
STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND
PRONUNCIATION MODELING Fosler-Lussier, E.,
Greenberg, S. and Morgan, N. (1999) Incorporating
contextual phonetics into automatic speech
recognition. Proceedings of the International
Congress of Phonetic Sciences, San
Francisco. Greenberg, S. and Fosler-Lussier, E.
(2000) The uninvited guest Information's role in
guiding the production of spontaneous speech, in
the Proceedings of the Crest Workshop on Models
of Speech Production Motor Planning and
Articulatory Modelling, Kloster Seeon, Germany .
Greenberg, S. (1999) Speaking in shorthand - A
syllable-centric perspective for understanding
pronunciation variation, Speech Communication,
29, 159-176. Greenberg, S. (1997) On the origins
of speech intelligibility in the real world.
Proceedings of the ESCA Workshop on Robust Speech
Recognition for Unknown Communication Channels,
Pont-a-Mousson, France, pp. 23-32. Greenberg, S.,
Hollenback, J. and Ellis, D. (1996) Insights into
spoken language gleaned from phonetic
transcription of the Switchboard corpus, in Proc.
Intern. Conf. Spoken Lang. (ICSLP),
Philadelphia, pp. S24-27. PERCEPTUAL BASES OF
SPEECH INTELLIGIBILITY Greenberg, S., Arai, T.
and Silipo, R. (1998) Speech intelligibility
derived from exceedingly sparse spectral
information, Proceedingss of the International
Conference on Spoken Language Processing, Sydney,
pp. 74-77. Greenberg, S. (1996) Understanding
speech understanding - towards a unified theory
of speech perception. Proceedings of the ESCA
Tutorial and Advanced Research Workshop on the
Auditory Basis of Speech Perception, Keele,
England, p. 1-8. Silipo, R., Greenberg, S. and
Arai, T. (1999) Temporal Constraints on Speech
Intelligibility as Deduced from Exceedingly
Sparse Spectral Representations, Proceedings of
Eurospeech, Budapest AUTOMATIC PHONETIC
TRANSCRIPTION AND SEGMENTATION Chang, S.,
Shastri, L. and Greenberg, S. (2000) Automatic
phonetic transcription of spontaneous speech
(American English). Proc. Int. Conf. Spoken Lang.
Proc., Beijing. Shastri, L. Chang, S. and
Greenberg, S. (1999) Syllable detection and
segmentation using temporal flow neural networks.
Proceedings of the International Congress of
Phonetic Sciences, San Francisco, pp. 1721-1724.
http//www.icsi.berkeley.edu/steveng
7
Prologue
8
Language - The Traditional Perspective
The classical view of spoken language posits a
quasi-arbitrary relation between the lower and
higher tiers of linguistic organization
Phonetic orthography
9
Language - A Syllable-Centric Perspective
A more empirical perspective of spoken language
focuses on the syllable as the interface between
sound and meaning
Within this framework the relationship between
the syllable and the higher and lower tiers is
non-arbitrary and systematic statistically
10
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    

11
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)

12
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL

13
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 80-90 of the time

14
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 80-90 of the time
  • Nuclei and codas are expressed canonically only
    60 of the time

15
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 80-90 of the time
  • Nuclei and codas are expressed canonically only
    60 of the time
  • Nuclei tend to be realized as vowels different
    from the canonical form

16
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 80-90 of the time
  • Nuclei and codas are expressed canonically only
    60 of the time
  • Nuclei tend to be realized as vowels different
    from the canonical form
  • Codas are often deleted entirely

17
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 80-90 of the time
  • Nuclei and codas are expressed canonically only
    60 of the time
  • Nuclei tend to be realized as vowels different
    from the canonical form
  • Codas are often deleted entirely
  • Articulatory-acoustic features are also organized
    in systematic fashion with respect to syllabic
    position

18
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 80-90 of the time
  • Nuclei and codas are expressed canonically only
    60 of the time
  • Nuclei tend to be realized as vowels different
    from the canonical form
  • Codas are often deleted entirely
  • Articulatory-acoustic features are also organized
    in systematic fashion with respect to syllabic
    position
  • Therefore, it is important to model spoken
    language at the syllabic level

19
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 80-90 of the time
  • Nuclei and codas are expressed canonically only
    60 of the time
  • Nuclei tend to be realized as vowels different
    from the canonical form
  • Codas are often deleted entirely
  • Articulatory-acoustic features are also organized
    in systematic fashion with respect to syllabic
    position
  • Therefore, it is important to model spoken
    language at the syllabic level
  • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH
    CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC
    ORTHOGRAPHY

20
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 80-90 of the time
  • Nuclei and codas are expressed canonically only
    60 of the time
  • Nuclei tend to be realized as vowels different
    from the canonical form
  • Codas are often deleted entirely
  • Articulatory-acoustic features are also organized
    in systematic fashion with respect to syllabic
    position
  • Therefore, it is important to model spoken
    language at the syllabic level
  • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH
    CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC
    ORTHOGRAPHY
  • It may be unrealistic to assume that any phonetic
    transcription based exclusively on segments (such
    as the IPA) is truly capable of capturing the
    important phonetic detail of spontaneous material

21
Take Home Messages
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 80-90 of the time
  • Nuclei and codas are expressed canonically only
    60 of the time
  • Nuclei tend to be realized as vowels different
    from the canonical form
  • Codas are often deleted entirely
  • Articulatory-acoustic features are also organized
    in systematic fashion with respect to syllabic
    position
  • Therefore, it is important to model spoken
    language at the syllabic level
  • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH
    CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC
    ORTHOGRAPHY
  • It may be unrealistic to assume that any phonetic
    transcription based exclusively on segments (such
    as the IPA) is truly capable of capturing the
    important phonetic detail of spontaneous material

22
Take Home Messages
  • PHONETIC PROPERTIES OF SPONTANEOUS SPEECH REFLECT
    INFORMATION CONTENT

Greenberg, S. and Fosler-Lussier, E. (2000) The
uninvited guest Information's role in guiding
the production of spontaneous speech, in the
Proceedings of the Crest Workshop on Models of
Speech Production Motor Planning and
Articulatory Modelling, Kloster Seeon, Germany .

23
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH

24
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH
  • Provides the basis for the statistical analyses
    of spontaneous material

25
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH
  • Provides the basis for the statistical analyses
    of spontaneous material
  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE
    PERSPECTIVE OF
  • Phonetic segments

26
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH
  • Provides the basis for the statistical analyses
    of spontaneous material
  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE
    PERSPECTIVE OF
  • Phonetic segments
  • Words

27
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH
  • Provides the basis for the statistical analyses
    of spontaneous material
  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE
    PERSPECTIVE OF
  • Phonetic segments
  • Words
  • Syllables

28
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH
  • Provides the basis for the statistical analyses
    of spontaneous material
  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE
    PERSPECTIVE OF
  • Phonetic segments
  • Words
  • Syllables
  • Articulatory-acoustic features

29
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH
  • Provides the basis for the statistical analyses
    of spontaneous material
  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE
    PERSPECTIVE OF
  • Phonetic segments
  • Words
  • Syllables
  • Articulatory-acoustic features
  • PERCEPTUAL EVIDENCE
  • The articulatory-acoustic basis of consonant
    recognition

30
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH
  • Provides the basis for the statistical analyses
    of spontaneous material
  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE
    PERSPECTIVE OF
  • Phonetic segments
  • Words
  • Syllables
  • Articulatory-acoustic features
  • PERCEPTUAL EVIDENCE
  • The articulatory-acoustic basis of consonant
    recognition
  • Not all articulatory-acoustic features are
    created equal - place-of-articulation cues appear
    to be most important for consonant recognition

31
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH
  • Provides the basis for the statistical analyses
    of spontaneous material
  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE
    PERSPECTIVE OF
  • Phonetic segments
  • Words
  • Syllables
  • Articulatory-acoustic features
  • PERCEPTUAL EVIDENCE
  • The articulatory-acoustic basis of consonant
    recognition
  • Not all articulatory-acoustic features are
    created equal - place-of-articulation cues appear
    to be most important for consonant recognition
  • COMPUTATIONAL METHODS
  • Automatic methods for phonetic transcription
    based on articulatory-acoustic features

32
Road Map
  • PHONETIC TRANSCRIPTION OF SPONTANEOUS AMERICAN
    ENGLISH
  • Provides the basis for the statistical analyses
    of spontaneous material
  • A BRIEF TOUR OF PRONUNCIATION VARIATION FROM THE
    PERSPECTIVE OF
  • Phonetic segments
  • Words
  • Syllables
  • Articulatory-acoustic features
  • PERCEPTUAL EVIDENCE
  • The articulatory-acoustic basis of consonant
    recognition
  • Not all articulatory-acoustic features are
    created equal - place-of-articulation cues appear
    to be most important for consonant recognition
  • COMPUTATIONAL METHODS
  • Automatic methods for phonetic transcription
    based on articulatory-acoustic features
  • Is the most likely means through which it will be
    possible to generate sufficient empirical data
    with which to rigorously test hypotheses germane
    to spoken language

33
Phonetic Transcription of Spontaneous (American)
English
34
Phonetic Transcription of Spontaneous English
  • TELEPHONE DIALOGUES OF 5-10 MINUTES DURATION -
    SWITCHBOARD
  • AMOUNT OF MATERIAL MANUALLY TRANSCRIBED    
  • 3 hours labeled at the phone level and segmented
    at the syllabic level (this material was later
    phonetically segmented by automatic methods)
  • 1 hour labeled and segmented at the
    phonetic-segment level
  • DIVERSITY OF MATERIAL TRANSCRIBED
  • Spans speech of both genders (ca. 50/50)
    reflecting a wide range of American dialectal
    variation (6 regions army brat), speaking
    rate and voice quality
  • TRANSCRIBED BY WHOM?
  • 7 undergraduates and 1 graduate student, all
    enrolled at UC-Berkeley. Most of the corpus was
    transcribed by three individuals out of the
    original eight
  • Supervised by Steven Greenberg and John Ohala
  • TRANSCRIPTION SYSTEM
  • A variant of Arpabet, with phonetic diacritics
    such as_gl,_cr, _fr, _n, _vl, _vd
  • HOW LONG DOES TRANSCRIPTION TAKE? (Dont Ask!)
  • 388 times real time for labeling and segmentation
    at the phonetic-segment level
  • 150 times real time for labeling phonetic
    segments and segmenting syllables
  • HOW WAS LABELING AND SEGMENTATION PERFORMED?
  • Using a display of the signal waveform,
    spectrogram, word transcription and forced
    alignments (estimates of phones and boundaries)
    audio (listening at multiple time scales -
    phone, word, utterance) on Sun workstations
  • DATA AVAILABLE AT - http//www.icsi/berkeley.edu/r
    eal/stp

35
A Brief Tour of Pronunciation Variation in Sponta
neous American English
36
Cumulative Word Frequency in English
Focus on 100 most common words
The 10 most common words account for 27 of the
corpus The 100 most common words account for 67
of the corpus The 1000 most common words account
for 92 of the corpus Thus, most informal
dialogues are composed of a relatively small
number of common words. However, it is the
infrequent words that typically provide the
precision and detail required for complex
information transfer
92
67
27
Computed from the Switchboard corpus (American
English telephone dialogues)
37
How Many Pronunciations of And?
38
How Many Pronunciations of And?
39
How Many Different Pronunciations?
40
How Many Different Pronunciations?
41
How Many Different Pronunciations?
42
How Many Different Pronunciations?
43
How Many Different Pronunciations?
44
English is (sort of) like Chinese .
95 of the words contain just ONE or TWO
syllables .

81 of the word tokens are monosyllabic Of the
100 most common words, 90 are one syllable in
length Only 22 of the words in the lexicon are
one syllable long Hence, there is a decided
preference for monosyllablic words in informal
discourse
45
Syllable and. Word Frequencies are Similar
Words and syllables exhibit similar
distributions over the 300 most common elements,
accounting for 80 of the corpus
The similarity of their distributions is a
consequence of most words consisting of just a
single syllable
46
Word Frequency in Spontaneous English
Word frequency as a function of word rank
approximates a 1/f distribution, particularly
after rank-order 10
Word frequency is logarithmically related to rank
order in the corpus (I.e., the 10th most common
word occurs ca. 10 times more frequently than the
100th most common word, etc.
Computed from the Switchboard corpus (American
English telephone dialogues)
47
Information Affects Pronunciation
The faster the speaking rate the more likely that
the pronunciation deviates from
canonical However, the effect is much more
pronounced for the 100 most common words than for
more infrequent words
From Fosler, Greenberg and Morgan (1999)
Greenberg and Fosler (2000)
48
English Syllable Structure is (sort of) Like
Japanese
Most syllables are simple in form (no consonant
clusters)
87 of the pronunciations are simple syllabic
forms 84 of the canonical corpus is composed of
simple syllabic forms
n 103, 054
49
Complex Syllables are Important, Though
Thus, despite Englishs reputation for complex
syllabic forms, only ca. 15 of the syllable
tokens are actually complex
There are many complex syllable forms
(consonant clusters, but all occur relatively
infrequently
Complex codas are not as frequently realized in
actual pronunciation as their canonical
representation
Complex onsets tend to preserve the canonical
pronunciation in realize their canonical
representation
n 17,760
50
Syllable-Centric Pronunciation
Onsets are pronounced canonically far more often
than nuclei or codas
Codas tend to be pronounced canonically more
frequently in formal speech than in spontaneous
dialogues
Percent Canonically Pronounced
(Read Sentences)
Cat k ae t k onset ae nucleus t
coda
Syllable Position
(Spontaneous speech)
n 120,814
51
Complex Onsets are Highly Canonical
Complex onsets are pronounced more canonically
than simple onsets despite the greater potential
for deviation from the standard pronunciation
Percent Canonically Pronounced
(Read Sentences)
(Spontaneous speech)
Syllable Onset Type
52
Speaking Style Affects Codas
Codas are much more likely to be realized
canonically in formal than in spontaneous speech
Percent Canonically Pronounced
Syllable Coda Type
53
Onsets (but not Codas) Affect Nuclei
The presence of a syllable onset has a
substantial impact on the realization of the
nucleus
Percent Canonically Pronounced
54
Syllable-Centric Feature Analysis
  • Place of articulation deviates most in nucleus
    position
  • Manner of articulation deviates most in onset and
    coda position
  • Voicing deviates most in coda position

Phonetic deviation along a SINGLE feature
Place is VERY unstable in nucleus position
Place deviates very little from canonical form in
the onset and coda. It is a STABLE AF in these
positions
55
Articulatory PLACE Feature Analysis
  • Place of articulation is a dominant feature in
    nucleus position only
  • Drives the feature deviation in the nucleus for
    manner and rounding

Phonetic deviation across SEVERAL features
Place carries manner and rounding in the nucleus
56
Articulatory MANNER Feature Analysis
  • Manner of articulation is a dominant feature in
    onset and coda position
  • Drives the feature deviation in onsets and codas
    for place and voicing

Phonetic deviation across SEVERAL features
Manner drives place and voicing deviations in
the onset and coda
Manner is less stable in the coda than in the
onset
57
Articulatory VOICING Feature Analysis
  • Voicing is a subordinate feature in all syllable
    positions
  • Its deviation pattern is controlled by manner in
    onset and coda positions

Phonetic deviation across SEVERAL features
Voicing is unstable in coda position and is
dominated by manner
58
LIP-ROUNDING Feature Analysis
  • Lip-rounding is a subordinate feature
  • Its deviation pattern is driven by the place
    feature in nucleus position

Phonetic deviation across SEVERAL features
Rounding is stable everywhere except in the
nucleus where its deviation pattern is driven by
place
59
Perceptual Evidence for the Importance of Place
(and Manner) of Articulation Features
60
Spectral Slit Paradigm
61
Consonant Recognition - Single Slits
62
Consonant Recognition - 1 Slit
63
Consonant Recognition - 2 Slits
64
Consonant Recognition - 3 Slits
65
Consonant Recognition - 4 Slits
66
Consonant Recognition - 5 Slits
67
Consonant Recognition - 2 Slits
68
Consonant Recognition - 2 Slits
69
Consonant Recognition - 2 Slits
70
Consonant Recognition - 2 Slits
71
Consonant Recognition - 2 Slits
72
Consonant Recognition - 2 Slits
73
Consonant Recognition - 3 Slits
74
Consonant Recognition - 3 Slits
75
Consonant Recognition - 3 Slits
76
Consonant Recognition - 4 Slits
77
Consonant Recognition - 5 Slits
78
Correlation - AFs/Consonant Recognition
Consonant recognition is almost perfectly
correlated with place of articulation
performance This correlation suggests that the
place feature is based on cues distributed across
the entire speech bandwidth, in contrast to other
features Manner is also highly correlated with
consonant recognition, voicing and rounding less
so
79
Automatic Phonetic Transcription of Spontaneous
Speech
80
Automatic Phonetic Transcription
  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
       ALIGNMENT DATA TO TRAIN NEW SYSTEMS

81
Automatic Phonetic Transcription
  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
       ALIGNMENT DATA TO TRAIN NEW SYSTEMS
  • These materials are highly inaccurate (35-50
    incorrect labeling of phonetic segments and an
    average of 32 ms (40 of the mean phone duration)
    off in terms of segment boundaries

82
Automatic Phonetic Transcription
  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
       ALIGNMENT DATA TO TRAIN NEW SYSTEMS
  • These materials are highly inaccurate (35-50
    incorrect labeling of phonetic segments and an
    average of 32 ms (40 of the mean phone duration)
    off in terms of segment boundaries
  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO
    MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL
    FOR SPONTANEOUS CORPORA

83
Automatic Phonetic Transcription
  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
       ALIGNMENT DATA TO TRAIN NEW SYSTEMS
  • These materials are highly inaccurate (35-50
    incorrect labeling of phonetic segments and an
    average of 32 ms (40 of the mean phone duration)
    off in terms of segment boundaries
  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO
    MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL
    FOR SPONTANEOUS CORPORA
  • Manual labeling and segmentation typically
    requires 150-400 times real time to perform

84
Automatic Phonetic Transcription
  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
       ALIGNMENT DATA TO TRAIN NEW SYSTEMS
  • These materials are highly inaccurate (35-50
    incorrect labeling of phonetic segments and an
    average of 32 ms (40 of the mean phone duration)
    off in terms of segment boundaries
  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO
    MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL
    FOR SPONTANEOUS CORPORA
  • Manual labeling and segmentation typically
    requires 150-400 times real time to perform
  • WE HAVE DEVELOPED AN AUTOMATIC LABELING OF
    PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE
    TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID
    DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE
    MATERIAL

85
Automatic Phonetic Transcription
  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
       ALIGNMENT DATA TO TRAIN NEW SYSTEMS
  • These materials are highly inaccurate (35-50
    incorrect labeling of phonetic segments and an
    average of 32 ms (40 of the mean phone duration)
    off in terms of segment boundaries
  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO
    MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL
    FOR SPONTANEOUS CORPORA
  • Manual labeling and segmentation typically
    requires 150-400 times real time to perform
  • WE HAVE DEVELOPED AN AUTOMATIC LABELING OF
    PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE
    TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID
    DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE
    MATERIAL
  • Such material will be extremely useful for
    developing pronunciation models and new
    algorithms for ASR

86
Automatic Phonetic Transcription
  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
       ALIGNMENT DATA TO TRAIN NEW SYSTEMS
  • These materials are highly inaccurate (35-50
    incorrect labeling of phonetic segments and an
    average of 32 ms (40 of the mean phone duration)
    off in terms of segment boundaries
  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO
    MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL
    FOR SPONTANEOUS CORPORA
  • Manual labeling and segmentation typically
    requires 150-400 times real time to perform
  • WE HAVE DEVELOPED AN AUTOMATIC LABELING OF
    PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE
    TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID
    DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE
    MATERIAL
  • Such material will be extremely useful for
    developing pronunciation models and new
    algorithms for ASR
  • THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS
    MATERIALS    (OGI Numbers Corpus) WITH ca. 83
    ACCURACY

87
Automatic Phonetic Transcription
  • MAINSTREAM ASR SYSTEMS USE MOSTLY AUTOMATIC
       ALIGNMENT DATA TO TRAIN NEW SYSTEMS
  • These materials are highly inaccurate (35-50
    incorrect labeling of phonetic segments and an
    average of 32 ms (40 of the mean phone duration)
    off in terms of segment boundaries
  • IT IS BOTH TIME-CONSUMING AND EXPENSIVE TO
    MANUALLY LABEL    AND SEGMENT PHONETIC MATERIAL
    FOR SPONTANEOUS CORPORA
  • Manual labeling and segmentation typically
    requires 150-400 times real time to perform
  • WE HAVE DEVELOPED AN AUTOMATIC LABELING OF
    PHONETIC    SEGMENTS (ALPS) SYSTEM TO PROVIDE
    TRAINING MATERIALS FOR    ASR AND TO ENABLE RAPID
    DEPLOYMENT TO NEW CORPORA AND    FOREIGN LANGUAGE
    MATERIAL
  • Such material will be extremely useful for
    developing pronunciation models and new
    algorithms for ASR
  • THE ALPS SYSTEM CURRENTLY LABELS SPONTANEOUS
    MATERIALS    (OGI Numbers Corpus) WITH ca. 83
    ACCURACY
  • The algorithms used are capable of achieving ca.
    93 accuracy with only minor changes to the models

88
Phonetic Feature Classification System
89
Spectro-Temporal Profile (STeP)
  • STePs provide a simple, accurate means of
    delineating the acoustic    properties associated
    with phonetic features and segments

Vocalic
90
Spectro-temporal Profile (STeP)
  • STePs incorporate information about the
    instantaneous modulation    spectrum distributed
    across the (tonotopic) frequency axis and can be
       used for training neural networks.

Fricative
91
Label Accuracy per Frame
  • Frames away from the boundary are labeled very
    accurately

92
Sample Transcription Output
  • The automatic system performs very similarly to
    manual transcription in terms of both labels and
    segmentation
  • 11 ms average concordance in segmentation
  • 83 concordance with respect to phonetic labels

93
In Conclusion .
94
Grand Summary
  • SPONTANEOUS SPEECH DIFFERS IN CERTAIN SYSTEMATIC
    WAYS   FROM CANONICAL DESCRIPTIONS OF LANGUAGE
    AND MORE FORMAL   SPEAKING STYLES    
  • Such insights can only be obtained at present
    with large amounts of phonetically labeled
    material (in this instance, four hours of
    telephone dialogues recorded in the United States
    - the Switchboard corpus)
  • Automatic methods will eventually supply badly
    needed data for more complete analyses and
    evaluation
  • THE PHONETIC PROPERTIES OF SPOKEN LANGUAGE APPEAR
    TO BE    ORGANIZED AT THE SYLLABIC RATHER THAN AT
    THE PHONE LEVEL
  • Onsets are pronounced in canonical (i. e.,
    dictionary) fashion 85-90 of the time
  • Nuclei and codas are expressed canonically only
    60 of the time
  • Nuclei tend to be realized as vowels different
    from the canonical form
  • Codas are often deleted entirely
  • Articulatory-acoustic features are also organized
    in systematic fashion with respect to syllabic
    position
  • Therefore, it is important to model spoken
    language at the syllabic level
  • THE SYLLABIC ORGANIZATION OF SPONTANEOUS SPEECH
    CALLS INTO QUESTION SEGEMENTALLY BASED PHONETIC
    ORTHOGRAPHY
  • It may be unrealistic to assume that any phonetic
    transcription based exclusively on segments (such
    as the IPA) is truly capably of capturing the
    important phonetic detail of spontaneous material

95
Thats All, Folks Many Thanks for Your Time and
Attention
96
Temporal View of Language
97
Linguistic Automatic Speech Recognition
  • CHARACTERIZE SPOKEN LANGUAGE WITH GREAT PRECISION
  • Currently, manual transcription is the only means
    by which to collect detailed data pertaining to
    spoken language. Computational methods are
    currently being developed to perform
    transcription automatically in order to provide
    an abundance of data for statistical
    characterization of spontaneous discourse.
  • USE THIS KNOWLEDGE TO DEVELOP COMPUTATIONAL
    TECHNIQUES TAILORED TO THE PROPERTIES OF THE
    SPEECH DOMAIN
  • A detailed knowledge of spoken language is
    essential for deriving a computational framework
    for ASR. The phonetic properties of speech are
    structured in different ways depending on the
    location within the syllable, word and phrase.
    Such knowledge is currently under-utilized by
    mainstream ASR.
  • FOCUS ON LOWER TIERS OF SPOKEN LANGUAGE FOR THE
    PRESENT
  • It is fashionable to emphasize the importance of
    language models (i.e., word co-occurrence
    properties) in ASR. However, most of the problems
    lie in the acoustic-phonetic front end and
    therefore this domain should be attacked first.
  • USE KNOWLEDGE OF HOW HUMAN LISTENERS UNDERSTAND
    SPOKEN LANGUAGE TO GUIDE DEVELOPMENT OF ASR
    ALGORITHMS
  • Current ASR acoustic models are not based on
    perceptual capabilities of human listeners, but
    on a distorted representation of what is
    important in hearing. It is important to perform
    intelligibility experiments to ascertain the
    identity of the truly important components of the
    speech signal and use this knowledge to develop
    robust, acoustic-front-end models for ASR.

98
Linguistic ASR Research _at_ ICSI
  • PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY
  • Human listening experiments identifying the
    specific properties crucial for understanding
    spoken language
  • MODULATION-SPECTRUM-BASED AUTOMATIC SPEECH
    RECOGNITION
  • Using auditory-based algorithms (linked to the
    syllable) for reliable ASR in background noise
    and reverberation
  • SYLLABLE-BASED AUTOMATIC SPEECH RECOGNITION
  • Development of a syllable-based decoder for ASR
  • STATISTICAL PROPERTIES OF SPONTANEOUS SPEECH
  • Detailed and comprehensive statistical analyses
    of the Switchboard corpus pertaining to phonetic,
    prosodic and lexical properties, used for
    developing pronunciation models (among other
    things)
  • AUTOMATIC PHONETIC LABELING AND SEGMENTATION
  • Development of (the first) automatic phonetic
    transcription system using articulatory-acoustic
    features (e.g, voicing, manner, place etc.)
  • AUTOMATIC LABELING OF PROSODIC STRESS
  • Development of (the first) automatic system for
    labeling prosodic stress in English
  • AUTOMATIC SPEECH RECOGNITION DIAGNOSTIC
    EVALUATION
  • Detailed and comprehensive analyses of
    Switchboard-corpus ASR systems in order to
    identify factors associated with word error

99
Linguistic ASR at ICSI
  • SENIOR PERSONNEL
  • Steven Greenberg - Linguistic ASR, Spoken
    Language Statistics, Speech Perception
  • Lokendra Shastri - Neural Network Design,
    Higher-level Language Neural Processing
  • GRADUATE STUDENTS
  • Shawn Chang - ANN-based ASR, Automatic Phonetic
    Transcription Segmentation
  • Michael Shire - Temporal Multi-Stream
    Approaches to Automatic Speech Recognition
  • Mirjam Wester - Pronunciation Modeling in
    Automatic Speech Recognition
  • UNDERGRADUATE STUDENTS
  • Micah Farrer - Database Development for ASR
    Analysis
  • Leah Hitchcock - Statistics of Pronunciation and
    Prosody of Spoken Language
  • TECHNICAL STAFF
  • Joy Hollenback - Statistical Analyses, Data
    Collection and Maintenance
  • ASSOCIATES AT ICSI
  • Hynek Hermansky, Nelson Morgan, Liz Shriberg and
    Andreas Stolcke
  • ASSOCIATES AT LOCATIONS OTHER THAN ICSI
  • Takayuki Arai (Sophia University, Tokyo) - Speech
    Perception, Signal Processing
  • Les Atlas (University of Washington, Seattle) -
    Acoustic Signal Processing
  • Ken Grant (Walter Reed Army Medical Center) -
    Audio-visual Speech Processing
  • David Poeppel (University of Maryland) - Brain
    Mechanisms of Language

100
Linguistic ASR at ICSI (continued)
  • FORMER ICSI POST-DOCTORAL FELLOWS
  • Takayuki Arai - Sophia University, Tokyo
  • Dan Ellis - Columbia University (as of September
    1, 2000)
  • Eric Fosler - Bell Laboratories, Lucent
    Technologies
  • Rosaria Silipo - Nuance Communications
  • FORMER ICSI GRADUATE STUDENTS
  • Jeff Bilmes - University of Washington, Seattle
  • Eric Fosler - Bell Laboratories, Lucent
    Technologies
  • Brian Kingsbury - IBM, Yorktown Heights
  • Katrin Kirchhoff - University of Washington,
    Seattle
  • Nikki Mirghafori - Nuance Communications
  • Su-Lin Wu - Nuance Communications
  • FORMER ICSI UNDERGRADUATE STUDENTS
  • Candace Cardinal - Nuance Communications
  • Rachel Coulston - University of California, San
    Diego
  • Collen Richey - Stanford University

101
Publications - Linguistic ASR
AUTOMATIC SPEECH RECOGNITION DIAGNOSTIC
EVALUATION Greenberg, S., Chang, S. and
Hollenback, J. (2000) An introduction to the
diagnostic evaluation of the Switchboard-corpus
automatic speech recognition systems. Proceedings
of the NIST Speech Transcription Workshop,
College Park Greenberg, S. and Chang, S. (2000)
Linguistic dissection of switchboard-corpus
automatic recognition systems. Proceedings of the
ICSI Workshop on Automatic Speech Recognition
Challenges for the New Millennium,
Paris. AUTOMATIC PHONETIC TRANSCRIPTION AND
SEGMENTATION Chang, S., Shastri, L. and
Greenberg, S. (2000) Automatic phonetic
transcription of spontaneous speech (American
English). Proc. Int. Conf. Spoken Lang. Proc.,
Beijing. Shastri, L. Chang, S. and Greenberg, S.
(1999) Syllable detection and segmentation using
temporal flow neural networks. Proceedings of
the International Congress of Phonetic Sciences,
San Francisco, pp. 1721-1724. STATISTICAL
PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION
MODELING Fosler-Lussier, E., Greenberg, S. and
Morgan, N. (1999) Incorporating contextual
phonetics into automatic speech recognition.
Proceedings of the International Congress of
Phonetic Sciences, San Francisco. Greenberg, S.
and Fosler-Lussier, E. (2000) The uninvited
guest Information's role in guiding the
production of spontaneous speech, in the
Proceedings of the Crest Workshop on Models of
Speech Production Motor Planning and
Articulatory Modelling, Kloster Seeon, Germany .
Greenberg, S. (1999) Speaking in shorthand - A
syllable-centric perspective for understanding
pronunciation variation, Speech Communication,
29, 159-176, Greenberg, S. (1997) On the origins
of speech intelligibility in the real world.
Proceedings of the ESCA Workshop on Robust Speech
Recognition for Unknown Communication Channels,
Pont-a-Mousson, France, pp. 23-32. Greenberg, S.,
Hollenback, J. and Ellis, D. (1996) Insights into
spoken language gleaned from phonetic
transcription of the Switchboard corpus, in Proc.
Intern. Conf. Spoken Lang. (ICSLP),
Philadelphia, pp. S24-27. AUTOMATIC LABELING OF
PROSODIC STRESS IN SPONTANEOUS SPEECH Silipo, R.
and Greenberg, S. (1999) Automatic transcription
of prosodic stress for spontaneous english
discourse. Proceedings of the International
Congress of Phonetic Sciences, San
Francisco. Silipo, R. and Greenberg, S. (2000)
Prosodic stress revisited Reassessing the role
of fundamental frequency. Proceedings of the NIST
Speech Transcription Workshop, College Park.
102
Publications - Linguistic ASR (continued)
MODULATION-SPECTRUM-BASED AUTOMATIC SPEECH
RECOGNITION Greenberg, S. and Kingsbury, B.
(1997) The modulation spectrogram In pursuit of
an invariant representation of speech, in
ICASSP-97, IEEE International Conference on
Acoustics, Speech and Signal Processing, Munich,
pp. 1647-1650. Kingsbury, B., Morgan, N. and
Greenberg, S. (1999) The modulation-filtered
spectrogram A noise-robust speech
representation, in Proceedings of the Workshop on
Robust Methods for Speech Recognition in Adverse
Conditions, Tampere, Finland. Kingsbury, B.,
Morgan, N. and Greenberg, S. (1998) Robust speech
recognition using the modulation spectrogram,
Speech Communication, 25, 117-132. SYLLABLE-BASE
D AUTOMATIC SPEECH RECOGNITION Wu, S.-L.,
Kingsbury, B., Morgan, N. and Greenberg, S.
(1998) Incorporating information from
syllable-length time scales into automatic speech
recognition, IEEE International Conference on
Acoustics, Speech and Signal Processing, Seattle,
pp. 721-724. Wu, S.-L., Kingsbury, B., Morgan, N.
and Greenberg, S. (1998) Performance improvements
through combining phone- and syllable-length
information in automatic speech recognition,
Proceedings of the International Conference on
Spoken Language Processing, Sydney, pp.
854-857. PERCEPTUAL BASES OF SPEECH
INTELLIGIBILITY GERMANE TO ASR Arai, T. and
Greenberg, S. (1998) Speech intelligibility in
the presence of cross-channel spectral
asynchrony, IEEE International Conference on
Acoustics, Speech and Signal Processing, Seattle,
pp. 933-936. Greenberg, S. and Arai, T. (1998)
Speech intelligibility is highly tolerant of
cross-channel spectral asynchrony. Proceedings of
the Joint Meeting of the Acoustical Society of
America and the International Congress on
Acoustics, Seattle, pp. 2677-2678. Greenberg, S.,
Arai, T. and Silipo, R. (1998) Speech
intelligibility derived from exceedingly sparse
spectral information, Proceedingss of the
International Conference on Spoken Language
Processing, Sydney, pp. 74-77. Greenberg, S.
(1996) Understanding speech understanding -
towards a unified theory of speech perception.
Proceedings of the ESCA Tutorial and Advanced
Research Workshop on the Auditory Basis of Speech
Perception, Keele, England, p. 1-8. Silipo, R.,
Greenberg, S. and Arai, T. (1999) Temporal
Constraints on Speech Intelligibility as Deduced
from Exceedingly Sparse Spectral Representations,
Proceedings of Eurospeech, Budapest.
103
Syllable Frequency - Spontaneous English
The distribution of syllable frequency in
spontaneous speech differs markedly from that in
dictionaries
104
Word Frequency in Spontaneous English
Word frequency as a function of word rank
approximates a 1/f distribution, particularly
after rank-order 10
Word frequency is logarithmically related to rank
order in the corpus (I.e., the 10th most common
word occurs ca. 10 times more frequently than the
100th most common word, etc.
Computed from the Switchboard corpus (American
English telephone dialogues)
105
The Intricate Web of Research
Write a Comment
User Comments (0)
About PowerShow.com