Multi-document%20Summarization%20and%20Evaluation - PowerPoint PPT Presentation

About This Presentation
Title:

Multi-document%20Summarization%20and%20Evaluation

Description:

... using similarity score (Themes) Generate one sentence for each theme ... Sentence fusion: intersect sentences within a theme and choose the repeated phrases. ... – PowerPoint PPT presentation

Number of Views:220
Avg rating:3.0/5.0
Slides: 25
Provided by: Kathleen268
Category:

less

Transcript and Presenter's Notes

Title: Multi-document%20Summarization%20and%20Evaluation


1
Multi-document Summarization and Evaluation
2
Task Characteristics
  • Input a set of documents on the same topic
  • Retrieved during an IR search
  • Clustered by a news browsers
  • Problem same topic or same event?
  • Output a paragraph length summary
  • Salient information across documents
  • Similarities between topics?
  • Redundancy removal is critical

3
Some Standard Approaches
  • Salient information similarities
  • Pairwise similarity between all sentences
  • Cluster sentences using similarity score (Themes)
  • Generate one sentence for each theme
  • Sentence extraction (one sentence/cluster)
  • Sentence fusion intersect sentences within a
    theme and choose the repeated phrases. Generate
    sentence from phrases
  • Salient information important words
  • Important words are simply the most frequent in
    the document set
  • SumBasic simply chooses sentences with the most
    frequent words. Conroy expands on this
  • Daume and Marcu have been the renegades

4
Some Variations on Task
  • Focused-based summarization given a topic/query
    generate a summary
  • Update summaries given an event over time, tell
    us whats new
  • Multilingual summarization generate an English
    summary of multiple documents in different
    languages

5
DUC Document Understanding Conference
  • Established and funded by DARPA TIDES
  • Run by independent evaluator NIST
  • Open to summarization community
  • Annual evaluations on common datasets
  • 2001-present
  • Tasks
  • Single document summarization
  • Headline summarization
  • Multi-document summarization
  • Multi-lingual summarization
  • Focused summarization

6
DUC Evaluation
  • Gold Standard
  • Human summaries written by NIST
  • From 2 to 9 summaries per input set
  • Multiple metrics
  • Manual
  • Coverage (early years)
  • Pyramids (later years)
  • Responsiveness (later years)
  • Quality questions
  • Automatic
  • Rouge (-1, -2, -skipbigrams, LCS, BE)
  • Granularity
  • Manual sub-sentential elements
  • Automatic sentences

7
Considerations Across Evaluations
  • Independent evaluator
  • Not always as knowledgeable as researchers
  • Impartial determination of approach
  • Extensive collection of resources
  • Determination of task
  • Appealing to a broad cross-section of community
  • Changes over time
  • DUC 2001-2002 Single and multi-document
  • DUC 2003 headlines, multi-document
  • DUC 2004 headlines, multilingual and
    multi-document, focused
  • DUC 2005 focused summarization
  • DUC 2006 focused and a new task, up for
    discussion
  • How long do participants have to prepare?
  • When is a task dropped?
  • Scoring of text at the sub-sentential level

8
Potential Problems
9
Comparing Text Against Text
  • Which human summary makes a good gold standard?
    Many summaries are good
  • At what granularity is the comparison made?
  • When can we say that two pieces of text match?

10
Variation impacts evaluation
  • Comparing content is hard
  • All kinds of judgment calls
  • Paraphrases
  • VP vs. NP
  • Ministers have been exchanged
  • Reciprocal ministerial visits
  • Length and constituent type
  • Robotics assists doctors in the medical operating
    theater
  • Surgeons started using robotic assistants

11
Nightmare only one gold standard
  • System may have chosen an equally good sentence
    but not in the one gold standard
  • Pinochet arrested in London on Oct 16 at a
    Spanish judges request for atrocities against
    Spaniards in Chile.
  • Former Chilean dictator Augusto Pinochet has
    been arrested in London at the request of the
    Spanish government
  • In DUC 2001 (one gold standard), human model had
    significant impact on scores (McKeown et al)
  • Five human summaries needed to avoid changes in
    rank (Nenkova and Passonneau)
  • DUC2003 data
  • 3 topic sets, 1 highest scoring and 2 lowest
    scoring
  • 10 model summaries

12
Scoring
  • Two main approaches used in DUC
  • ROUGE (Lin and Hovy)
  • Pyramids (Nenkova and Passonneau)
  • Problems
  • Are the results stable?
  • How difficult is it to do the scoring?

13
ROUGE Recall-Oriented Understudy for Gisting
Evaluation
Rouge Ngram co-occurrence metrics measuring
content overlap
Counts of n-gram overlaps between candidate and
model summaries
Total n-grams in summary model
14
ROUGE
  • Experimentation with different units of
    comparison unigrams, bigrams, longest common
    substring, skip-bigams, basic elements
  • Automatic and thus easy to apply
  • Important to consider confidence intervals when
    determining differences between systems
  • Scores falling within same interval not
    significantly different
  • Rouge scores place systems into large groups can
    be hard to definitively say one is better than
    another
  • Sometimes results unintuitive
  • Multilingual scores as high as English scores
  • Use in speech summarization shows no
    discrimination
  • Good for training regardless of intervals can
    see trends

15
Comparison of Scoring Methods in DUC05
  • Comparisons between Pyramid (original,modified),
    responsiveness, and Rouge-SU4
  • Pyramids score computed from multiple humans
  • Responsiveness is just one humans judgment
  • Rouge-SU4 equivalent to Rouge-2

16
Creation of pyramids
  • Done at Columbia for each of 20 out of 50 sets
  • Primary annotator, secondary checker
  • Held round-table discussions of problematic
    constructions that occurred in this data set
  • Comma separated lists
  • Extractive reserves have been formed for managed
    harvesting of timber, rubber, Brazil nuts, and
    medical plants without deforestation.
  • General vs. specific
  • Eastern Europe vs. Hungary, Poland, Lithuania,
    and Turkey

17
Characteristics of the Responses
  • Proportion of SCUs of Weight 1 is large
  • 44 (D324) to 81 (D695)
  • Mean SCU weight 1.9
  • Agreement among human responders is quite low

18
of SCUs at each weight
SCU Weights
19
Human performance/Best sys
  • Pyramid Modified Resp
    ROUGE-SU4
  • B 0.5472 B 0.4814 A 4.895
    A 0.1722
  • A 0.4969 A 0.4617 B 4.526
    B 0.1552
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139

Best system 50 of human performance on manual
metrics Best system 80 of human performance on
ROUGE
20
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

21
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

22
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

23
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

24
Questions
  • Brotzman In "Topic-Focused Multi-document
    Summarization Using an Approximate Oracle Score"
    and "Bayesian Query-Focused Summarization" we
    read of two methods of document summarization
    that rely on a surface-level representation of
    written language. They both beg the question (and
    Nenkova hints at the issue by characterizing the
    DUC's "coverage" as "not addressing issues such
    as readability and other text qualities"), how
    useful or relevant is a surface-level
    representation of language, in general? The
    experiments these papers conduct achieve
    promising results - but is this merely because
    the kinds of texts they consider are very "plain"
    or fundamentally "surface-level" anyway? How do
    you think the methods described could be extended
    to apply to less straightforward text?
  • Sparck Jones In order to develop effective
    procedures it is necessary to identify and
    respond to the context factors, i.e. input,
    purpose and output factors, that bear on
    summarising and its evaluation. (p. 1)
Write a Comment
User Comments (0)
About PowerShow.com