Title: ProsodyBased Automatic Segmentation of Speech into Sentences and Topics
1Prosody-Based Automatic Segmentation of Speech
into Sentences and Topics
- Elizabeth Shriberg Andreas Stolcke
- Speech Technology and Research Laboratory
- Dilek Hakkani-Tur Gokhan Tur
- Depatment of Computer Engineering, Bilkent
University - To appear in Speech Communication 32(1-2)Special
Issue on Accessing Information in Spoken Audio - Presenter Yi-Ting Chen
2Outline
- Introduction
- Method
- Prosodic modeling
- Language modeling
- Model combination
- Data
- Results and discussion
- Summary and conclusion
3Introduction (1/2)
- Why process audio data?
- Why automatic segmentation?
- A crucial step toward robust information
extraction from speech is the automatic
determination of topic, sentence, and phrase
boundaries - Why used prosody?
- In all languages, prosody is used to convey
structural, semantic, and functional information - Prosodic cues by their nature are relatively
unaffected by word identity - Unlike spectral features, some prosodic features
are largely invariant to changes in channel
characteristics - Prosodic feature extraction can be achieved with
minimal additional computational load and no
additional training data
4Introduction (2/2)
- In this paper we describe the prosodic modeling
in detail - Using decision tree and hidden Markov modeling
techniques to combine prosodic cues with
word-based approaches, and evaluate performance
on two speech corpora - To look at results for both true word, and word
as hypothesized by a speech recognizer.
5Method (1/6) Prosodic modeling
- Feature extraction regions
- For each inter-word boundary, we looked at
prosodic features of the word immediately
preceding and following the boundary, or
alternatively within a window of 20 frames
(200ms) befor and after the boundary - They extracted prosodic features reflecting pause
durations, phone durations, pitch information,
and voice quality information - They chose not to use amplitude- or energy-based
features, since previous word showed these
features to be both less reliable than and
largely redundant with duration and pitch
features
6Method (2/6) Prosodic modeling
- Features
- The features were designed to be independent of
word identities - They began with a set of over 100 features, was
pared down to a smaller set by eliminating
features - Pause features Important cues to boundaries
between semantic units - The pause model was trained as an individual
phone - In the case of no pause at the boundary, this
pause duration feature was output as 0 - The duration of the pause preceding the word
before the boundary - Raw durations and durations normalized were
investigated for pause duration distributions
from the particular speaker
7Method (3/6) Prosodic modeling
- Features
- Phone and rhyme duration features a slowing down
toward the ends of units, or preboundary
lengthening - Preboundary lengthening typically affects the
nucleus and coda of syllables - Duration characteristics of the last rhyme of the
syllable preceding the boundary - Each phone in the rhyme was normalized for
inherent duration as
8Method (4/6) Prosodic modeling
- Features
- F0 features
- Pitch information is typically less robust and
more difficult to model than other prosodic
features - To smooth out microintonation and tracking
errors, simplify F0 feature computation, and
identify speaking-range parameters for each
speaker
9Method (5/6) Prosodic modeling
- Features
- F0 features
- Reset features
- Range features
- F0 slope features
- F0 continuity features
- Estimated voice quality features
- Other features
- speaker gender?
- turn boundaries?
- time elapsed from the start of turn and the turn
count in the conversation
10Method (6/6) Prosodic modeling
- Decision trees
- Decision trees are probabilistic classifiers
- Given a set of features and a labeled training
set, the decision tree construction algorithm
repeatedly selects a single feature that has the
highest predictive value - The leaves of the tree store probabilities about
the class distribution of all samples falling
into the corresponding region of the feature
space - Decision trees make no assumptions about the
shape of feature distributions - It is not necessary to convert feature values to
some standard scale - Feature selection algorithm
11Method (1/3) Language modeling
- The goal to capture information about segment
boundaries contained in the word sequences - To model the joint distribution of boundary types
and words in a hidden Markov model (HMM) - To denote boundary classification by and use
for the word sequencesthe
structure of the HMM - Using the slightly more complex forward-backward
algorithm to maximize the posterior probability
of each individual boundary classification
12Method (2/3) Language modeling
- Sentence segmentation
- A hidden-event N-gram language model
- The states of the HMM consist of the end-of
sentence status of each word, plus any preceding
words and possibly boundary tags to fill up the
N-gram context - Transition probabilities are given by N-gram
probabilities estimated from annotated - Boundary-tagged training data using Katz backoff
- Ex
13Method (3/3) Language modeling
- Topic segmentation
- First, to constructed 100 individual unigram
topic cluster language models using the Multipass
k-means algorithm (Using TDT) - Then to built an HMM in which the states are
topic clusters, and the observation are sentences - In addition to the basic HMM segmenter,
incorporating two states for modeling the initial
and final sentences of a topic segment
14Method (1/3) Model combination
- Expecting prosodic and lexical segmentation cues
to be partly complementary - Posterior Probability interpolation
- Integrated hidden Markov modeling
- With suitable independence assumption to apply
the familiar techniques to compute - or
- To incorporate the prosodic information into the
HMM, prosodic features are modeled as emissions
from relecant HMM states, with likelihoods - So, a complete path through the HMM is associated
with the total probability
15Method (2/3) Model combination
- Expecting prosodic and lexical segmentation cues
to be partly complementary - Integrated hidden Markov modeling
- How to estimate the likelihoods
- Note that the decision tree estimates posteriors
- These can be converted to likelihoods using
Bayes rule as in - A beneficial side effect of this approach is that
the decision tress models the lower-frequency
events in greater detail than if presented with
the raw, highly skewed class distribution - A tunable model combination weight (MCW) was
introduced
16Method (3/3) Model combination
- Expecting prosodic and lexical segmentation cues
to be partly complementary - HMM posteriors as decision tress features
- For practical reasons we chose not to use it in
this work - Drawback overestimate the informativeness of the
word-based posteriors based on automatic
transcriptions - Alternative models
- HMM A drawback is that the independence
assumptions may be inappropriate and inherently
lime the performance of the model - The decision trees
- advantages enhances discrimination between
the target classifications and input features can
be combined easily - drawbacks the sensitivity to skewed class
distribution expensive to
model multiple target variables
17Method (1/2) Data
- Speech data and annotations
- Switchboard data a sub set of the corpus that
had been hand-labeled for sentence boundaries by
LDC - Broadcast News data for topic and sentence
segmentation was extracted from the LDC 1997
Broadcast News (BN) release - Training of Broadcast News language models used
an additional 130 million word of text-only
transcripts from the 1996 Hub-4 language model
corpus (for sentence segmentation ) - Training, tuning, and test sets
18Method (2/2) Data
- Word recognition
- 1-best output from SRIs DECIPHER
large-vocabulary speech recognizer - Skipping several of the computationally expensive
or cumbersome steps (such as acoustic adaptation) - Switchboard test set46.7 WER
- Broadcast News 30.5 WER
- Evaluation metrics
- Sentence segmentation performance for true words
was measured by boundary classification error - For recognized words, a string alignment of the
automatically labeled recognition hypothesis are
performed - Then to calculate error rate
- Topic segmentation was evaluated using the metric
defined by NIST for TDT-2 evaluation
19Results and discussion (1/10)
- Task 1 Sentence segmentation of Broadcast New
data - Prosodic features usage
- The best-performing tree identified six features
for this task, which fall into four groups - Pause gt turn gt F0 gt Rhyme duration
- Based on descriptive literature, the behavior of
the features is precisely
20Results and discussion (2/10)
- Task 1 Sentence segmentation of Broadcast New
data - Error reduction from prosody
- The prosodic model alone performs better than a
word-based language model - The prosodic model is somewhat more robust to
recognizer output than the language model
21Results and discussion (3/10)
- Task 1 Sentence segmentation of Broadcast New
data - Performance without F0 features
- The F0 features used are not typically extracted
or computed in most ASR systems - Removing all F0 features
- It could also indicate a higher degree of
correlation between true words and the prosodic
features?
22Results and discussion (4/10)
- Task 2 Sentence segmentation of Switchboard data
- Prosodic feature usage
- A different distribution of features than
observed for Broadcast News - The primary feature type used here is
pre-boundary duration - Pause duration at the boundary was also useful
- Most interesting about this tree was the
consistent behavior of duration features, which
gave higher probability to a sentence boundary
23Results and discussion (5/10)
- Task 2 Sentence segmentation of Switchboard data
- Error reduction from prosody
- Prosodic alone is not a particularly good mood
model - Combining prosody with the language model
resulted in a statistically significant
improvement - All differences were statistically significant
24Results and discussion (6/10)
- Task 3 Topic segmentation of Broadcast News data
- Prosodic feature usage
- Five feature types most helpful for this task
- The results are similar to those seen earlier for
sentence segmentation in Broadcast News - The importance of pause duration is
underestimated
25Results and discussion (7/10)
- Task 3 Topic segmentation of Broadcast News data
- Prosodic feature usage
- The speaker-gender feature
- The women in a sense behave more neatly than
the men - One possible explanationis that men are more
likely than women toproduce regions of
nonmodal voicing oftopic boundaries
26Results and discussion (8/10)
- Task 3 Topic segmentation of Broadcast News data
- Error reduction from prosody
- All results reflect the word-averaged, weighted
error metric used in the TDT-2 evaluation - Chance here correspond to outputting the no
boundary class at all locations, meaning that
the false alarm rate will be zero and miss rate
will be 1 - A weight of 0.7 to false alarms and 0.3 to miss
27Results and discussion (9/10)
- Task 3 Topic segmentation of Broadcast News data
- Performance without F0 features
- The experiments were conducted only for true
word, since as shown in table 5, results are
similar to those for recognized words
28Results and discussion (10/10)
- Comparisons of error reduction across conditions
- Performance without F0 features
- While researcher typically have found Switchboard
a difficult corpus to process, in the case of
sentence segmentation on true word it just the
opposite-atypically - Previous word on automatic segmentation on
Switchboard transcripts is likely to overestimate
success for other corpora
29Summary and conclusion
- The use of prosodic information for sentence and
topic segmentation have studied - Results showed that on Broadcast News the
prosodic model alone performed as well as purely
word0based statistical language models - Interestingly, the integrated HMM worded best on
transcribed words, while the posterior
interpolation approach was much more robust in
the case of recognized