ProsodyBased Automatic Segmentation of Speech into Sentences and Topics - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

ProsodyBased Automatic Segmentation of Speech into Sentences and Topics

Description:

Prosodic cues by their nature are relatively unaffected by word identity ... Prosodic feature extraction can be achieved with minimal additional ... – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 30

Provided by: YiT9

Category:

more less

Transcript and Presenter's Notes

Title: ProsodyBased Automatic Segmentation of Speech into Sentences and Topics

1
Prosody-Based Automatic Segmentation of Speech
into Sentences and Topics

Elizabeth Shriberg Andreas Stolcke
Speech Technology and Research Laboratory
Dilek Hakkani-Tur Gokhan Tur
Depatment of Computer Engineering, Bilkent
University
To appear in Speech Communication 32(1-2)Special
Issue on Accessing Information in Spoken Audio
Presenter Yi-Ting Chen

2
Outline

Introduction
Method
Prosodic modeling
Language modeling
Model combination
Data
Results and discussion
Summary and conclusion

3
Introduction (1/2)

Why process audio data?
Why automatic segmentation?
A crucial step toward robust information
extraction from speech is the automatic
determination of topic, sentence, and phrase
boundaries
Why used prosody?
In all languages, prosody is used to convey
structural, semantic, and functional information
Prosodic cues by their nature are relatively
unaffected by word identity
Unlike spectral features, some prosodic features
are largely invariant to changes in channel
characteristics
Prosodic feature extraction can be achieved with
minimal additional computational load and no
additional training data

4
Introduction (2/2)

In this paper we describe the prosodic modeling
in detail
Using decision tree and hidden Markov modeling
techniques to combine prosodic cues with
word-based approaches, and evaluate performance
on two speech corpora
To look at results for both true word, and word
as hypothesized by a speech recognizer.

5
Method (1/6) Prosodic modeling

Feature extraction regions
For each inter-word boundary, we looked at
prosodic features of the word immediately
preceding and following the boundary, or
alternatively within a window of 20 frames
(200ms) befor and after the boundary
They extracted prosodic features reflecting pause
durations, phone durations, pitch information,
and voice quality information
They chose not to use amplitude- or energy-based
features, since previous word showed these
features to be both less reliable than and
largely redundant with duration and pitch
features

6
Method (2/6) Prosodic modeling

Features
The features were designed to be independent of
word identities
They began with a set of over 100 features, was
pared down to a smaller set by eliminating
features
Pause features Important cues to boundaries
between semantic units
The pause model was trained as an individual
phone
In the case of no pause at the boundary, this
pause duration feature was output as 0
The duration of the pause preceding the word
before the boundary
Raw durations and durations normalized were
investigated for pause duration distributions
from the particular speaker

7
Method (3/6) Prosodic modeling

Features
Phone and rhyme duration features a slowing down
toward the ends of units, or preboundary
lengthening
Preboundary lengthening typically affects the
nucleus and coda of syllables
Duration characteristics of the last rhyme of the
syllable preceding the boundary
Each phone in the rhyme was normalized for
inherent duration as

8
Method (4/6) Prosodic modeling

Features
F0 features
Pitch information is typically less robust and
more difficult to model than other prosodic
features
To smooth out microintonation and tracking
errors, simplify F0 feature computation, and
identify speaking-range parameters for each
speaker

9
Method (5/6) Prosodic modeling

Features
F0 features
Reset features
Range features
F0 slope features
F0 continuity features
Estimated voice quality features
Other features
speaker gender?
turn boundaries?
time elapsed from the start of turn and the turn
count in the conversation

10
Method (6/6) Prosodic modeling

Decision trees
Decision trees are probabilistic classifiers
Given a set of features and a labeled training
set, the decision tree construction algorithm
repeatedly selects a single feature that has the
highest predictive value
The leaves of the tree store probabilities about
the class distribution of all samples falling
into the corresponding region of the feature
space
Decision trees make no assumptions about the
shape of feature distributions
It is not necessary to convert feature values to
some standard scale
Feature selection algorithm

11
Method (1/3) Language modeling

The goal to capture information about segment
boundaries contained in the word sequences
To model the joint distribution of boundary types
and words in a hidden Markov model (HMM)
To denote boundary classification by and use
for the word sequencesthe
structure of the HMM
Using the slightly more complex forward-backward
algorithm to maximize the posterior probability
of each individual boundary classification

12
Method (2/3) Language modeling

Sentence segmentation
A hidden-event N-gram language model
The states of the HMM consist of the end-of
sentence status of each word, plus any preceding
words and possibly boundary tags to fill up the
N-gram context
Transition probabilities are given by N-gram
probabilities estimated from annotated
Boundary-tagged training data using Katz backoff
Ex

13
Method (3/3) Language modeling

Topic segmentation
First, to constructed 100 individual unigram
topic cluster language models using the Multipass
k-means algorithm (Using TDT)
Then to built an HMM in which the states are
topic clusters, and the observation are sentences
In addition to the basic HMM segmenter,
incorporating two states for modeling the initial
and final sentences of a topic segment

14
Method (1/3) Model combination

Expecting prosodic and lexical segmentation cues
to be partly complementary
Posterior Probability interpolation
Integrated hidden Markov modeling
With suitable independence assumption to apply
the familiar techniques to compute
or
To incorporate the prosodic information into the
HMM, prosodic features are modeled as emissions
from relecant HMM states, with likelihoods
So, a complete path through the HMM is associated
with the total probability

15
Method (2/3) Model combination

Expecting prosodic and lexical segmentation cues
to be partly complementary
Integrated hidden Markov modeling
How to estimate the likelihoods
Note that the decision tree estimates posteriors
These can be converted to likelihoods using
Bayes rule as in
A beneficial side effect of this approach is that
the decision tress models the lower-frequency
events in greater detail than if presented with
the raw, highly skewed class distribution
A tunable model combination weight (MCW) was
introduced

16
Method (3/3) Model combination

Expecting prosodic and lexical segmentation cues
to be partly complementary
HMM posteriors as decision tress features
For practical reasons we chose not to use it in
this work
Drawback overestimate the informativeness of the
word-based posteriors based on automatic
transcriptions
Alternative models
HMM A drawback is that the independence
assumptions may be inappropriate and inherently
lime the performance of the model
The decision trees
advantages enhances discrimination between
the target classifications and input features can
be combined easily
drawbacks the sensitivity to skewed class
distribution expensive to
model multiple target variables

17
Method (1/2) Data

Speech data and annotations
Switchboard data a sub set of the corpus that
had been hand-labeled for sentence boundaries by
LDC
Broadcast News data for topic and sentence
segmentation was extracted from the LDC 1997
Broadcast News (BN) release
Training of Broadcast News language models used
an additional 130 million word of text-only
transcripts from the 1996 Hub-4 language model
corpus (for sentence segmentation )
Training, tuning, and test sets

18
Method (2/2) Data

Word recognition
1-best output from SRIs DECIPHER
large-vocabulary speech recognizer
Skipping several of the computationally expensive
or cumbersome steps (such as acoustic adaptation)
Switchboard test set46.7 WER
Broadcast News 30.5 WER
Evaluation metrics
Sentence segmentation performance for true words
was measured by boundary classification error
For recognized words, a string alignment of the
automatically labeled recognition hypothesis are
performed
Then to calculate error rate
Topic segmentation was evaluated using the metric
defined by NIST for TDT-2 evaluation

19
Results and discussion (1/10)

Task 1 Sentence segmentation of Broadcast New
data
Prosodic features usage
The best-performing tree identified six features
for this task, which fall into four groups
Pause gt turn gt F0 gt Rhyme duration
Based on descriptive literature, the behavior of
the features is precisely

20
Results and discussion (2/10)

Task 1 Sentence segmentation of Broadcast New
data
Error reduction from prosody
The prosodic model alone performs better than a
word-based language model
The prosodic model is somewhat more robust to
recognizer output than the language model

21
Results and discussion (3/10)

Task 1 Sentence segmentation of Broadcast New
data
Performance without F0 features
The F0 features used are not typically extracted
or computed in most ASR systems
Removing all F0 features
It could also indicate a higher degree of
correlation between true words and the prosodic
features?

22
Results and discussion (4/10)

Task 2 Sentence segmentation of Switchboard data
Prosodic feature usage
A different distribution of features than
observed for Broadcast News
The primary feature type used here is
pre-boundary duration
Pause duration at the boundary was also useful
Most interesting about this tree was the
consistent behavior of duration features, which
gave higher probability to a sentence boundary

23
Results and discussion (5/10)

Task 2 Sentence segmentation of Switchboard data
Error reduction from prosody
Prosodic alone is not a particularly good mood
model
Combining prosody with the language model
resulted in a statistically significant
improvement
All differences were statistically significant

24
Results and discussion (6/10)

Task 3 Topic segmentation of Broadcast News data
Prosodic feature usage
Five feature types most helpful for this task
The results are similar to those seen earlier for
sentence segmentation in Broadcast News
The importance of pause duration is
underestimated

25
Results and discussion (7/10)

Task 3 Topic segmentation of Broadcast News data
Prosodic feature usage
The speaker-gender feature
The women in a sense behave more neatly than
the men
One possible explanationis that men are more
likely than women toproduce regions of
nonmodal voicing oftopic boundaries

26
Results and discussion (8/10)

Task 3 Topic segmentation of Broadcast News data
Error reduction from prosody
All results reflect the word-averaged, weighted
error metric used in the TDT-2 evaluation
Chance here correspond to outputting the no
boundary class at all locations, meaning that
the false alarm rate will be zero and miss rate
will be 1
A weight of 0.7 to false alarms and 0.3 to miss

27
Results and discussion (9/10)

Task 3 Topic segmentation of Broadcast News data
Performance without F0 features
The experiments were conducted only for true
word, since as shown in table 5, results are
similar to those for recognized words

28
Results and discussion (10/10)

Comparisons of error reduction across conditions
Performance without F0 features
While researcher typically have found Switchboard
a difficult corpus to process, in the case of
sentence segmentation on true word it just the
opposite-atypically
Previous word on automatic segmentation on
Switchboard transcripts is likely to overestimate
success for other corpora

29
Summary and conclusion

The use of prosodic information for sentence and
topic segmentation have studied
Results showed that on Broadcast News the
prosodic model alone performed as well as purely
word0based statistical language models
Interestingly, the integrated HMM worded best on
transcribed words, while the posterior
interpolation approach was much more robust in
the case of recognized

Write a Comment

User Comments (0)