Multi-document%20Summarization%20and%20Evaluation - PowerPoint PPT Presentation

About This Presentation

Title:

Multi-document%20Summarization%20and%20Evaluation

Description:

... using similarity score (Themes) Generate one sentence for each theme ... Sentence fusion: intersect sentences within a theme and choose the repeated phrases. ... – PowerPoint PPT presentation

Number of Views:220

Avg rating:3.0/5.0

Slides: 25

Provided by: Kathleen268

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multi-document%20Summarization%20and%20Evaluation

1
Multi-document Summarization and Evaluation
2
Task Characteristics

Input a set of documents on the same topic
Retrieved during an IR search
Clustered by a news browsers
Problem same topic or same event?
Output a paragraph length summary
Salient information across documents
Similarities between topics?
Redundancy removal is critical

3
Some Standard Approaches

Salient information similarities
Pairwise similarity between all sentences
Cluster sentences using similarity score (Themes)
Generate one sentence for each theme
Sentence extraction (one sentence/cluster)
Sentence fusion intersect sentences within a
theme and choose the repeated phrases. Generate
sentence from phrases
Salient information important words
Important words are simply the most frequent in
the document set
SumBasic simply chooses sentences with the most
frequent words. Conroy expands on this
Daume and Marcu have been the renegades

4
Some Variations on Task

Focused-based summarization given a topic/query
generate a summary
Update summaries given an event over time, tell
us whats new
Multilingual summarization generate an English
summary of multiple documents in different
languages

5
DUC Document Understanding Conference

Established and funded by DARPA TIDES
Run by independent evaluator NIST
Open to summarization community
Annual evaluations on common datasets
2001-present
Tasks
Single document summarization
Headline summarization
Multi-document summarization
Multi-lingual summarization
Focused summarization

6
DUC Evaluation

Gold Standard
Human summaries written by NIST
From 2 to 9 summaries per input set
Multiple metrics
Manual
Coverage (early years)
Pyramids (later years)
Responsiveness (later years)
Quality questions
Automatic
Rouge (-1, -2, -skipbigrams, LCS, BE)
Granularity
Manual sub-sentential elements
Automatic sentences

7
Considerations Across Evaluations

Independent evaluator
Not always as knowledgeable as researchers
Impartial determination of approach
Extensive collection of resources
Determination of task
Appealing to a broad cross-section of community
Changes over time
DUC 2001-2002 Single and multi-document
DUC 2003 headlines, multi-document
DUC 2004 headlines, multilingual and
multi-document, focused
DUC 2005 focused summarization
DUC 2006 focused and a new task, up for
discussion
How long do participants have to prepare?
When is a task dropped?
Scoring of text at the sub-sentential level

8
Potential Problems
9
Comparing Text Against Text

Which human summary makes a good gold standard?
Many summaries are good
At what granularity is the comparison made?
When can we say that two pieces of text match?

10
Variation impacts evaluation

Comparing content is hard
All kinds of judgment calls
Paraphrases
VP vs. NP
Ministers have been exchanged
Reciprocal ministerial visits
Length and constituent type
Robotics assists doctors in the medical operating
theater
Surgeons started using robotic assistants

11
Nightmare only one gold standard

System may have chosen an equally good sentence
but not in the one gold standard
Pinochet arrested in London on Oct 16 at a
Spanish judges request for atrocities against
Spaniards in Chile.
Former Chilean dictator Augusto Pinochet has
been arrested in London at the request of the
Spanish government
In DUC 2001 (one gold standard), human model had
significant impact on scores (McKeown et al)
Five human summaries needed to avoid changes in
rank (Nenkova and Passonneau)
DUC2003 data
3 topic sets, 1 highest scoring and 2 lowest
scoring
10 model summaries

12
Scoring

Two main approaches used in DUC
ROUGE (Lin and Hovy)
Pyramids (Nenkova and Passonneau)
Problems
Are the results stable?
How difficult is it to do the scoring?

13
ROUGE Recall-Oriented Understudy for Gisting
Evaluation
Rouge Ngram co-occurrence metrics measuring
content overlap
Counts of n-gram overlaps between candidate and
model summaries
Total n-grams in summary model
14
ROUGE

Experimentation with different units of
comparison unigrams, bigrams, longest common
substring, skip-bigams, basic elements
Automatic and thus easy to apply
Important to consider confidence intervals when
determining differences between systems
Scores falling within same interval not
significantly different
Rouge scores place systems into large groups can
be hard to definitively say one is better than
another
Sometimes results unintuitive
Multilingual scores as high as English scores
Use in speech summarization shows no
discrimination
Good for training regardless of intervals can
see trends

15
Comparison of Scoring Methods in DUC05

Comparisons between Pyramid (original,modified),
responsiveness, and Rouge-SU4
Pyramids score computed from multiple humans
Responsiveness is just one humans judgment
Rouge-SU4 equivalent to Rouge-2

16
Creation of pyramids

Done at Columbia for each of 20 out of 50 sets
Primary annotator, secondary checker
Held round-table discussions of problematic
constructions that occurred in this data set
Comma separated lists
Extractive reserves have been formed for managed
harvesting of timber, rubber, Brazil nuts, and
medical plants without deforestation.
General vs. specific
Eastern Europe vs. Hungary, Poland, Lithuania,
and Turkey

17
Characteristics of the Responses

Proportion of SCUs of Weight 1 is large
44 (D324) to 81 (D695)
Mean SCU weight 1.9
Agreement among human responders is quite low

18
of SCUs at each weight
SCU Weights
19
Human performance/Best sys

Pyramid Modified Resp
ROUGE-SU4
B 0.5472 B 0.4814 A 4.895
A 0.1722
A 0.4969 A 0.4617 B 4.526
B 0.1552
14 0.2587 10 0.2052 4 2.85
15 0.139

Best system 50 of human performance on manual
metrics Best system 80 of human performance on
ROUGE
20

Pyramid
original Modified Resp
Rouge-SU4
14 0.2587 10 0.2052 4 2.85
15 0.139
17 0.2492 17 0.1972 14 2.8
4 0.134
15 0.2423 14 0.1908 10 2.65
17 0.1346
10 0.2379 7 0.1852 15 2.6
19 0.1275
4 0.2321 15 0.1808 17 2.55
11 0.1259
7 0.2297 4 0.177 11 2.5
10 0.1278
16 0.2265 16 0.1722 28 2.45
6 0.1239
6 0.2197 11 0.1703 21 2.45
7 0.1213
32 0.2145 6 0.1671 6 2.4
14 0.1264
21 0.2127 12 0.1664 24 2.4
25 0.1188
12 0.2126 19 0.1636 19 2.4
21 0.1183
11 0.2116 21 0.1613 6 2.4
16 0.1218
26 0.2106 32 0.1601 27 2.35
24 0.118
19 0.2072 26 0.1464 12 2.35
12 0.116
28 0.2048 3 0.145 7 2.3
3 0.1198
13 0.1983 28 0.1427 25 2.2
28 0.1203
3 0.1949 13 0.1424 32 2.15
27 0.110

Pyramid
original Modified Resp
Rouge-SU4
14 0.2587 10 0.2052 4 2.85
15 0.139
17 0.2492 17 0.1972 14 2.8
4 0.134
15 0.2423 14 0.1908 10 2.65
17 0.1346
10 0.2379 7 0.1852 15 2.6
19 0.1275
4 0.2321 15 0.1808 17 2.55
11 0.1259
7 0.2297 4 0.177 11 2.5
10 0.1278
16 0.2265 16 0.1722 28 2.45
6 0.1239
6 0.2197 11 0.1703 21 2.45
7 0.1213
32 0.2145 6 0.1671 6 2.4
14 0.1264
21 0.2127 12 0.1664 24 2.4
25 0.1188
12 0.2126 19 0.1636 19 2.4
21 0.1183
11 0.2116 21 0.1613 6 2.4
16 0.1218
26 0.2106 32 0.1601 27 2.35
24 0.118
19 0.2072 26 0.1464 12 2.35
12 0.116
28 0.2048 3 0.145 7 2.3
3 0.1198
13 0.1983 28 0.1427 25 2.2
28 0.1203
3 0.1949 13 0.1424 32 2.15
27 0.110

Pyramid
original Modified Resp
Rouge-SU4
14 0.2587 10 0.2052 4 2.85
15 0.139
17 0.2492 17 0.1972 14 2.8
4 0.134
15 0.2423 14 0.1908 10 2.65
17 0.1346
10 0.2379 7 0.1852 15 2.6
19 0.1275
4 0.2321 15 0.1808 17 2.55
11 0.1259
7 0.2297 4 0.177 11 2.5
10 0.1278
16 0.2265 16 0.1722 28 2.45
6 0.1239
6 0.2197 11 0.1703 21 2.45
7 0.1213
32 0.2145 6 0.1671 6 2.4
14 0.1264
21 0.2127 12 0.1664 24 2.4
25 0.1188
12 0.2126 19 0.1636 19 2.4
21 0.1183
11 0.2116 21 0.1613 6 2.4
16 0.1218
26 0.2106 32 0.1601 27 2.35
24 0.118
19 0.2072 26 0.1464 12 2.35
12 0.116
28 0.2048 3 0.145 7 2.3
3 0.1198
13 0.1983 28 0.1427 25 2.2
28 0.1203
3 0.1949 13 0.1424 32 2.15
27 0.110

Pyramid
original Modified Resp
Rouge-SU4
14 0.2587 10 0.2052 4 2.85
15 0.139
17 0.2492 17 0.1972 14 2.8
4 0.134
15 0.2423 14 0.1908 10 2.65
17 0.1346
10 0.2379 7 0.1852 15 2.6
19 0.1275
4 0.2321 15 0.1808 17 2.55
11 0.1259
7 0.2297 4 0.177 11 2.5
10 0.1278
16 0.2265 16 0.1722 28 2.45
6 0.1239
6 0.2197 11 0.1703 21 2.45
7 0.1213
32 0.2145 6 0.1671 6 2.4
14 0.1264
21 0.2127 12 0.1664 24 2.4
25 0.1188
12 0.2126 19 0.1636 19 2.4
21 0.1183
11 0.2116 21 0.1613 6 2.4
16 0.1218
26 0.2106 32 0.1601 27 2.35
24 0.118
19 0.2072 26 0.1464 12 2.35
12 0.116
28 0.2048 3 0.145 7 2.3
3 0.1198
13 0.1983 28 0.1427 25 2.2
28 0.1203
3 0.1949 13 0.1424 32 2.15
27 0.110

24
Questions

Brotzman In "Topic-Focused Multi-document
Summarization Using an Approximate Oracle Score"
and "Bayesian Query-Focused Summarization" we
read of two methods of document summarization
that rely on a surface-level representation of
written language. They both beg the question (and
Nenkova hints at the issue by characterizing the
DUC's "coverage" as "not addressing issues such
as readability and other text qualities"), how
useful or relevant is a surface-level
representation of language, in general? The
experiments these papers conduct achieve
promising results - but is this merely because
the kinds of texts they consider are very "plain"
or fundamentally "surface-level" anyway? How do
you think the methods described could be extended
to apply to less straightforward text?
Sparck Jones In order to develop effective
procedures it is necessary to identify and
respond to the context factors, i.e. input,
purpose and output factors, that bear on
summarising and its evaluation. (p. 1)