Reference - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Reference

Description:

Xiaodan Zhu, Gerald Penn, 'Evaluation of Sentence Selection for Speech ... X are feature sets and Yare class labels. Evaluation Metrics. Precision/Recall ... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 34

Provided by: YiT9

Category:

more less

Transcript and Presenter's Notes

Title: Reference

1
Reference

Julian Kupiec, Jan Pedersen, Francine Chen, A
Trainable Document Summarizer, SIGIR95 Seattle
WA USA, 1995.
Xiaodan Zhu, Gerald Penn, Evaluation of Sentence
Selection for Speech Summarization, Proceedings
of the 2nd International Conference on Recent
Advances in Natural Language Processing
(RANLP-05), Borovets, Bulgaria, pp. 39-45.
September 2005.
C.D. Paice, Constructing literature abstracts by
computer Techniques and prospects. Information
Processing and Management, 26171-186, 1990.

2
A Trainable Document Summarizer

Julian Kupiec, Jan Pedersen and Francine
ChenXerox Palo Alto Research Center

3
Outline

Introduction
A Trainable Summarizer
Experiments and Evaluation
Discussion and Conclusions

4
Introduction

To summarize is to reduce in complexity, and
hence in length, while retaining some of the
essential qualities of the original
This paper focuses on document extracts, a
particular kind of computed document summary
Document extracts consisting of roughly 20 of
the original can be as informative as the full
text of a document, which suggests that even
shorter extracts may be useful indicative
summaries
Titles, key-words, tables-of-contents and
abstracts might all be considered as forms of
summary
They approach extract election as a statistical
classification problem
This framework provides a natural evaluation
criterion the classification success rate or
precision
It does require a training corpus of documents
with labelled extracts

5
A Trainable Summarizer

Features
Paice groups sentence scoring features into seven
categories
Frequency-keyword heuristics
The title-keyword heuristic
Location heuristics
Indicator phrase (e.g., this report..)
Related heuristic involves cue words
Two set of words which are positively and
negatively correlated with summary sentences
Bonus e.g., greatest and significant
Stigma e.g., hardly and impossible
Ref. ---The frequency-keyword approach?the
title-keyword method?The location
method?Syntactic criteria?The cue method?The
indicator-phrase method?Relational criteria

6
A Trainable Summarizer

Features
Sentence Length Cut-off Feature
Given a threshold (e.g., 5 words)
The feature is true for all sentences longer than
the threshold, and false otherwise
Fixed-phrase Feature
This features is true for sentences that contain
any of 26 indicator phrases, or that follow
section heads that contain specific key words
Paragraph Feature
Thematic Word Feature
The most frequent content words are defined as
thematic words
This feature is binary, depending on whether a
sentence is present in the set of highest scoring
sentences
Uppercase Word Feature

7
A Trainable Summarizer

Classifier
For each sentence s, to compute the probability
it will be included in a summary S given the k
features ,
which can be expressed using Bayes rule as
follows
Assuming statistical independence of the
features
is a constant and
and can be estimated directly
from the training set by counting occurrences

8
Experiments and Evaluation

The corpus
There are 188 document/summary pairs, sampled
from 21 publications in the scientific/technical
domain
The average number of sentences per document is
86
Each document was normalized so that the first
line of each file contained the document title

9
Experiments and Evaluation

The corpus
Sentence Matching
Direct sentence match (verbatim or minor
modification)
Direct join (two or more sentences)
Unmatchable
Incomplete (some overlap, includes a sentence
from the original document, but also contains
other information)
The correspondences were produced in two passes
79 of the summary sentences have direct matches

10
Experiments and Evaluation

The corpus

11
Experiments and Evaluation

Evaluation
Using a cross-validation strategy for evaluation
Unmatchable and incomplete sentences were
excluded from both training and testing, yielding
a total of 498 unique sentences
Performance
First way
the highest performance
A sentence produced by the summarizer is defined
as correct here if (direct sentence match,
direct join)
Of the 568 sentences, 195 direct sentence matches
and 6 direct joins were correctly identified, for
a total of 201 correctly identified summary
sentences 35
Second way
498 match-able sentences
42

12
Experiments and Evaluation

Evaluation
The best combination is (Paragraphfixed-phrasese
ntence-length)
Addition of the frequency-keyword features
(thematic and uppercase word features) results in
a slight decrease in overall performance
For a baseline, to select sentences from the
beginning of a document (considering the sentence
length cut-off feature alone)24 (121 sentences
correct)

13
Experiments and Evaluation

Figure 3 shows the performance of the summarizer
(using all features) as a function of summary
size
Edmundson cites a sentence-level performance of
44
By analogy, 25 of the average document length
(86 sentences) in our corpus is about 20
sentences
Reference to the table indicatesperformance at
84

14
Discussion and Conclusions

The trends in our results are in agreement with
those of edmundson who used a subjectively
weighted combination of features as opposed to
training the feature weights using a corpus
Frequency-keyword features also gave poorest
individual performance in evaluation
They have however retained these features in our
final system for several reasons
The first is robustness
Secondly, as the number of sentences in a summary
grows, more dispersed informative material tends
to be included

15
Discussion and Conclusions

The goal is to provide a summarization program
that is of general utility
The first concerns robustness
The second issue concerns presentation and other
forms of summary information

16
Reference

Julian Kupiec, Jan Pedersen, Francine Chen, A
Trainable Document Summarizer, SIGIR95 Seattle
WA USA, 1995.
Xiaodan Zhu, Gerald Penn, Evaluation of Sentence
Selection for Speech Summarization, Proceedings
of the 2nd International Conference on Recent
Advances in Natural Language Processing
(RANLP-05), Borovets, Bulgaria, pp. 39-45.
September 2005.

17
Evaluation of Sentence Selection for Speech
Summarization

Xiaodan Zhu and Gerald PennDepartment of
Computer Science University of Toronto

18
Outline

Introduction
Speech Summarization by Sentence Selection
Evaluation Metrics
Experiments
Conclusions

19
Introduction

This paper consider whether ASR-inspired
evaluation metrics produce different results than
those taken from text summarization
The goal of speech summarization is to distill
important information from speech data
In this paper, we will focus on sentence-level
extraction

20
Speech Summarization by Sentence Selection

LEAD sentence selection is to select the first
N of sentences from the beginning of the
transcript
RAND random selection
Knowledge-based Approach SEM
To calculate semantic similarity between a given
utterance and the dialogue, the noun portion of
WordNet is used as a knowledge source, with
semantic distance between senses computed using
normalized path length
The performance of the system is reported as
better than LEAD, RAND and TFIDF based methods
Not using manually disambiguated, to apply
Brills POS tagger to acquire the nouns
Using semantic similarity package

21
Speech Summarization by Sentence Selection

MMR-based Approach MMR
Whether it is more similar to the whole dialogue
Whether it is less similar to the sentences that
have so far been selected
Classification-Based Approaches
To formulate sentence selection as a binary
classification problem
The best two have consistently been SVM and
logistic regression
SVM (OSU-SVM package)
SVM seeks an optimal separating hyperplane, where
the margin is maximal
Decision function is

22
Speech Summarization by Sentence Selection

Features

23
Speech Summarization by Sentence Selection

Classification-Based Approaches
Logistic Regression LOG
To model the posterior probabilities of the class
label with linear functions
X are feature sets and Yare class labels

24
Evaluation Metrics

Precision/Recall
When evaluated on binary annotations and using
precision/recall metrics, sys1 and sys2 achieve
50 and 0
Relative Utility
For the above example, if using relative utility,
sys1 gets 18/19 and sys2 gets 15/19
The values obtained are higher than with P/R, but
they are higher for all of the systems evaluated

25
Evaluation Metrics

Word Error Rate
Sentence level and word level
The sum of insertion error, substitution error
and deletion error of words, divided by the
number of all these errors plus the number of
corrects words
Zechners Summarization Accuracy
The summarization accuracy is defined as the sum
of the relevance scores of all the words in the
automatic summary, divided by the maximum
achievable relevance score with the same number
of words
ROUGE
To measuring overlapping units such as n-grams,
word sequences and word pairs
ROUGE-N and ROUGE-L

26
Experiments

Corpus the SWITCHBOARD dataset (a corpus of
open-domain spoken dialogue)
To randomly select 27 spoken dialogues from
SWITCHBOARD
Three annotators are asked to assign 0/1 labels
to indicate whether a sentence is in the summary
or not (required to select around 10 of the
sentences into the summary)
Judges annotation relative to another are
evaluated (F-scores)

27
Experiments

Precision/Recall
One standard marks a sentence as in the summary
only when all three annotators agree
LOG and SVM have similar performance and
outperform the others, with MMR following, and
then SEM and LEAD
At least two of the three judges include in the
summary

28
Experiments

Precision/Recall
Any of the three annotators
Relative Utility
For three different human judges, an assignment
of a number between 0 and 9 to each sentence are
obtained, to indicate the confidence that this
sentence should be included in the summary

29
Experiments

Relative Utility
The performance ranks of the five summarizaers
are the same here as they are in the three P/R
evaluations
First, the P/R agreement among annotators is not
low
Second, the redundancy in the data is much less
than in the multi-document summarization tasks
Third , the summarizers we compare might tend to
select the same sentences

30
Experiments

Word Error Rate and Summarization Accuracy

31
Experiments

Word Error Rate and Summarization Accuracy

32
Experiments

ROUGE

33
Conclusion

Five summarizers were evaluated on three
text-summarization-inspired metrics (P/R), (RU),
and ROUGE, as well as on two ASR-inspired
evaluation metrics (WER) and (SA)
Preliminary conclusion is that considerably
greater caution must be exercised when using
ASR-based measures than we have witnessed to date
in the speech summarization literature

Write a Comment

User Comments (0)