Title: Methods for Automatic Evaluation of Sentence Extract Summaries
1Methods for Automatic Evaluation of Sentence
Extract Summaries
- G.Ravindra, N.Balakrishnan, K.R.Ramakrishnan
- Supercomputer Education Research Center
- Department of Electrical Engineering
- Indian Institute of Science
- Bangalore-INDIA
2Agenda
- Introduction to Text Summarization
- Need for summarization, types of summaries
- Evaluating Extract Summaries
- Challenges in manual and automatic evaluation
- Fuzzy Summary Evaluation
- Complexity Scores
3What is Text Summarization
- Reductive transformation of source text to
summary text by content generalization and/or
selection - Loss of information
- What can be lost and what should not be lost
- How much can be lost
- What is the size of the summary
- Types of Summaries
- Extracts and Abstracts
- Influence of genre on the performance of a
summarization algorithm - Newswire stories are favorable to sentence
position
4Need for Summarization
- Explosive growth in availability of digital
textual data - Books in digital libraries, mailing-list
archives, on-line news portals - Duplication of textual segments in books
- E.g. 10 introductory books on quantum physics
have a number of paragraphs common to all of them
(syntactically different but semantically the
same) - Hand-held devices
- Small screens and limited memory
- Low power devices and hence limited processing
capability - E.g. Stream a book from a digital library to a
hand-held device - Production of information is faster than
consumption
5Types of Summaries
- Extracts
- Text selection
- E.g Paragraphs from books, sentences from
editorials, phrases from e-mails - Application of statistical techniques
- Abstracts
- Text selection followed by generalization
- Need for linguistic processing
- E.g. Convert a sentence to a phrase
- Generic Summaries
- Independent of genre
- Indicative Summaries
- Gives a general idea as to the topic of
discussion in the text being summarized - Informational Summaries
- Serves as a surrogate to the original text
6Evaluating Extract Summaries
- Manual evaluation
- Human judges are allowed to score a summary on a
well defined scale based on a well defined
criteria - Subject to judges understanding of the subject
- Depends on judges opinions
- Guidelines constrain opinions
- Individual judges scores are combined to
generate the final score - Re-evaluation might result in different scores
- Logistic problems for researchers
7Automatic Evaluation
- Machine-based evaluation
- Consistent over multiple runs
- Fast, avoids logistic problems
- Suitable for researchers experimenting with new
algorithms - Flip-side
- Not as accurate as human evaluation
- Should be used as precursor to a detailed human
evaluation - Algorithmically handles various sentence
constructs and linguistic variants
8Fuzzy Summary Evaluation FuSE
- Proposing the use of Fuzzy union theory to
quantify the similarity of two extract summaries - Similarity between the reference (human
generated) summary and candidate (machine
generated) summary is evaluated - Each sentence is a fuzzy set
- Each sentence in the reference summary has a
membership grade in every sentence of the
candidate machine generated summary - Membership grade of a reference summary sentence
in the candidate summary is the union of
membership grades across all candidate summary
sentences - Use membership grades to compute an f-score value
- Membership grade is the hamming distance between
two sentences based on collocations
9Fuzzy F-score
Fuzzy Precision
Fuzzy Recall
Candidate summary sentence set
Reference summary sentence set
Union function
Membership grade of candidate sentence in
reference sentence
10Choice of Union operator
- Propose the use of Franks S-norm operator
- Allows combining partial matches non-linearly
- Membership grade of a sentence in a summary is
dependent on its length - Automatically includes brevity-bonus into the
scheme
11Franks S-norm operator
Damping Coefficient
Mean of non-zero membership grades for a sentence
Sentence length
Length of the longest sentence
12Characteristics of Franks base
13Performance of FuSE for various sentence lengths
14Dictionary-enhanced Fuzzy Summary
EvaluationDeFuSE
- FuSE does not understand sentence similarity
based on synonymy and hypernymy - Identifying synonymous words makes evaluation
more accurate - Identifying hypernymous word relationships allows
consideration of gross information during
evaluation - Note Very deep hypernymy trees could result in
topic drift and hence improper evaluation
15Use of Word Net
16Example Use of hypernymy
- HURRICANE GILBERT DEVASTATED DOMINICAN REPUBLIC
AND PARTS OF CUBA - (PHYSICAL PHENOMENON) GILBERT (DESTROY,RUIN)
(REGION) AND PARTS OF (REGION) - TROPICAL STORM GILBERT DESTROYED PARTS OF HAVANA
- TROPICAL (PHYSICAL PHENOMENON) GILBERT DESTROYED
PARTS OF (REGION)
17Complexity Score
- Attempts to quantify the summarization algorithm
based on the difficulty in generating a summary
of a particular accuracy - Generating a 9 sentence summary from a 10
sentence document is very easy. - An algorithm which randomly selects 9 sentences
will have a worst case accuracy of 90 - A complicated AINLP based algorithm cannot do
any better - If a 2 sentence summary is to be generated from a
10 sentence document, we have 45 possible
candidates out of which one is accurate
18Computing Complexity Score
- Probability of generating a summary of a length
m1 with accurate sentences l1 when human summary
has h sentences and the document being summarized
has n sentences
19Complexity Score (Cont..)
- To compare two summaries of equal length the
performance of one relative to the baseline is
given by
20Complexity Score (Cont..)
- Complexity in generating a 10 extract with 12
correct sentences is higher than generating a 30
extract with 12 correct sentences
21Conclusion
- Summary evaluation is as complicated as summary
generation - Fuzzy schemes are ideal for evaluating extract
summaries - Use of synonymy and hypernymy relations improve
evaluation accuracy - Complexity score is a new way of looking at
summary evaluation