Title: Reference
1Reference
- Julian Kupiec, Jan Pedersen, Francine Chen, A
Trainable Document Summarizer, SIGIR95 Seattle
WA USA, 1995. - Xiaodan Zhu, Gerald Penn, Evaluation of Sentence
Selection for Speech Summarization, Proceedings
of the 2nd International Conference on Recent
Advances in Natural Language Processing
(RANLP-05), Borovets, Bulgaria, pp. 39-45.
September 2005. - C.D. Paice, Constructing literature abstracts by
computer Techniques and prospects. Information
Processing and Management, 26171-186, 1990.
2A Trainable Document Summarizer
- Julian Kupiec, Jan Pedersen and Francine
ChenXerox Palo Alto Research Center
3Outline
- Introduction
- A Trainable Summarizer
- Experiments and Evaluation
- Discussion and Conclusions
4Introduction
- To summarize is to reduce in complexity, and
hence in length, while retaining some of the
essential qualities of the original - This paper focuses on document extracts, a
particular kind of computed document summary - Document extracts consisting of roughly 20 of
the original can be as informative as the full
text of a document, which suggests that even
shorter extracts may be useful indicative
summaries - Titles, key-words, tables-of-contents and
abstracts might all be considered as forms of
summary - They approach extract election as a statistical
classification problem - This framework provides a natural evaluation
criterion the classification success rate or
precision - It does require a training corpus of documents
with labelled extracts
5A Trainable Summarizer
- Features
- Paice groups sentence scoring features into seven
categories - Frequency-keyword heuristics
- The title-keyword heuristic
- Location heuristics
- Indicator phrase (e.g., this report..)
- Related heuristic involves cue words
- Two set of words which are positively and
negatively correlated with summary sentences - Bonus e.g., greatest and significant
- Stigma e.g., hardly and impossible
- Ref. ---The frequency-keyword approach?the
title-keyword method?The location
method?Syntactic criteria?The cue method?The
indicator-phrase method?Relational criteria
6A Trainable Summarizer
- Features
- Sentence Length Cut-off Feature
- Given a threshold (e.g., 5 words)
- The feature is true for all sentences longer than
the threshold, and false otherwise - Fixed-phrase Feature
- This features is true for sentences that contain
any of 26 indicator phrases, or that follow
section heads that contain specific key words - Paragraph Feature
- Thematic Word Feature
- The most frequent content words are defined as
thematic words - This feature is binary, depending on whether a
sentence is present in the set of highest scoring
sentences - Uppercase Word Feature
7A Trainable Summarizer
- Classifier
- For each sentence s, to compute the probability
it will be included in a summary S given the k
features , - which can be expressed using Bayes rule as
follows - Assuming statistical independence of the
features - is a constant and
and can be estimated directly
from the training set by counting occurrences
8Experiments and Evaluation
- The corpus
- There are 188 document/summary pairs, sampled
from 21 publications in the scientific/technical
domain - The average number of sentences per document is
86 - Each document was normalized so that the first
line of each file contained the document title
9Experiments and Evaluation
- The corpus
- Sentence Matching
- Direct sentence match (verbatim or minor
modification) - Direct join (two or more sentences)
- Unmatchable
- Incomplete (some overlap, includes a sentence
from the original document, but also contains
other information) - The correspondences were produced in two passes
- 79 of the summary sentences have direct matches
10Experiments and Evaluation
11Experiments and Evaluation
- Evaluation
- Using a cross-validation strategy for evaluation
- Unmatchable and incomplete sentences were
excluded from both training and testing, yielding
a total of 498 unique sentences - Performance
- First way
- the highest performance
- A sentence produced by the summarizer is defined
as correct here if (direct sentence match,
direct join) - Of the 568 sentences, 195 direct sentence matches
and 6 direct joins were correctly identified, for
a total of 201 correctly identified summary
sentences 35 - Second way
- 498 match-able sentences
- 42
12Experiments and Evaluation
- Evaluation
- The best combination is (Paragraphfixed-phrasese
ntence-length) - Addition of the frequency-keyword features
(thematic and uppercase word features) results in
a slight decrease in overall performance - For a baseline, to select sentences from the
beginning of a document (considering the sentence
length cut-off feature alone)24 (121 sentences
correct)
13Experiments and Evaluation
- Figure 3 shows the performance of the summarizer
(using all features) as a function of summary
size - Edmundson cites a sentence-level performance of
44 - By analogy, 25 of the average document length
(86 sentences) in our corpus is about 20
sentences - Reference to the table indicatesperformance at
84
14Discussion and Conclusions
- The trends in our results are in agreement with
those of edmundson who used a subjectively
weighted combination of features as opposed to
training the feature weights using a corpus - Frequency-keyword features also gave poorest
individual performance in evaluation - They have however retained these features in our
final system for several reasons - The first is robustness
- Secondly, as the number of sentences in a summary
grows, more dispersed informative material tends
to be included
15Discussion and Conclusions
- The goal is to provide a summarization program
that is of general utility - The first concerns robustness
- The second issue concerns presentation and other
forms of summary information
16Reference
- Julian Kupiec, Jan Pedersen, Francine Chen, A
Trainable Document Summarizer, SIGIR95 Seattle
WA USA, 1995. - Xiaodan Zhu, Gerald Penn, Evaluation of Sentence
Selection for Speech Summarization, Proceedings
of the 2nd International Conference on Recent
Advances in Natural Language Processing
(RANLP-05), Borovets, Bulgaria, pp. 39-45.
September 2005.
17Evaluation of Sentence Selection for Speech
Summarization
- Xiaodan Zhu and Gerald PennDepartment of
Computer Science University of Toronto
18Outline
- Introduction
- Speech Summarization by Sentence Selection
- Evaluation Metrics
- Experiments
- Conclusions
19Introduction
- This paper consider whether ASR-inspired
evaluation metrics produce different results than
those taken from text summarization - The goal of speech summarization is to distill
important information from speech data - In this paper, we will focus on sentence-level
extraction
20Speech Summarization by Sentence Selection
- LEAD sentence selection is to select the first
N of sentences from the beginning of the
transcript - RAND random selection
- Knowledge-based Approach SEM
- To calculate semantic similarity between a given
utterance and the dialogue, the noun portion of
WordNet is used as a knowledge source, with
semantic distance between senses computed using
normalized path length - The performance of the system is reported as
better than LEAD, RAND and TFIDF based methods - Not using manually disambiguated, to apply
Brills POS tagger to acquire the nouns - Using semantic similarity package
21Speech Summarization by Sentence Selection
- MMR-based Approach MMR
- Whether it is more similar to the whole dialogue
- Whether it is less similar to the sentences that
have so far been selected - Classification-Based Approaches
- To formulate sentence selection as a binary
classification problem - The best two have consistently been SVM and
logistic regression - SVM (OSU-SVM package)
- SVM seeks an optimal separating hyperplane, where
the margin is maximal - Decision function is
22Speech Summarization by Sentence Selection
23Speech Summarization by Sentence Selection
- Classification-Based Approaches
- Logistic Regression LOG
- To model the posterior probabilities of the class
label with linear functions - X are feature sets and Yare class labels
24Evaluation Metrics
- Precision/Recall
- When evaluated on binary annotations and using
precision/recall metrics, sys1 and sys2 achieve
50 and 0 - Relative Utility
- For the above example, if using relative utility,
sys1 gets 18/19 and sys2 gets 15/19 - The values obtained are higher than with P/R, but
they are higher for all of the systems evaluated
25Evaluation Metrics
- Word Error Rate
- Sentence level and word level
- The sum of insertion error, substitution error
and deletion error of words, divided by the
number of all these errors plus the number of
corrects words - Zechners Summarization Accuracy
- The summarization accuracy is defined as the sum
of the relevance scores of all the words in the
automatic summary, divided by the maximum
achievable relevance score with the same number
of words - ROUGE
- To measuring overlapping units such as n-grams,
word sequences and word pairs - ROUGE-N and ROUGE-L
26Experiments
- Corpus the SWITCHBOARD dataset (a corpus of
open-domain spoken dialogue) - To randomly select 27 spoken dialogues from
SWITCHBOARD - Three annotators are asked to assign 0/1 labels
to indicate whether a sentence is in the summary
or not (required to select around 10 of the
sentences into the summary) - Judges annotation relative to another are
evaluated (F-scores)
27Experiments
- Precision/Recall
- One standard marks a sentence as in the summary
only when all three annotators agree - LOG and SVM have similar performance and
outperform the others, with MMR following, and
then SEM and LEAD - At least two of the three judges include in the
summary
28Experiments
- Precision/Recall
- Any of the three annotators
- Relative Utility
- For three different human judges, an assignment
of a number between 0 and 9 to each sentence are
obtained, to indicate the confidence that this
sentence should be included in the summary
29Experiments
- Relative Utility
- The performance ranks of the five summarizaers
are the same here as they are in the three P/R
evaluations - First, the P/R agreement among annotators is not
low - Second, the redundancy in the data is much less
than in the multi-document summarization tasks - Third , the summarizers we compare might tend to
select the same sentences
30Experiments
- Word Error Rate and Summarization Accuracy
31Experiments
- Word Error Rate and Summarization Accuracy
32Experiments
33Conclusion
- Five summarizers were evaluated on three
text-summarization-inspired metrics (P/R), (RU),
and ROUGE, as well as on two ASR-inspired
evaluation metrics (WER) and (SA) - Preliminary conclusion is that considerably
greater caution must be exercised when using
ASR-based measures than we have witnessed to date
in the speech summarization literature