Title: Multidocument summarization by people and machines
1Multi-document summarization by people and
machines
- Ani Nenkova
- Department of Computer and Information Science
- University of Pennsylvania
2Why is summarization important?
3Summarizing online news
Pan Am
bombing
Libya
suspects
Gadhafi
trial
Libya refuses to surrender two Pan Am bombing
suspects
UK and USA
???
4(No Transcript)
5People like summaries!
- User study on a report writing task
- 4 report topics
- 12 subjects with each interface
- Interface
- Newsblaster
- One line summary (Google news)
- No summary
- Main findings multi-document summaries help
- Higher user satisfaction
- Better reports
6Some references
- Ani Nenkova, Becky Passonneau, Kathy McKeown
- The pyramid method incorporating human
content selection variation in summarization
evaluation - ACM Transactions on Speech and Language
Processing, volume 4, issue 2, 2007 - Surabhi Gupta, Ani Nenkova and Dan Jurafsky
- Measuring Importance and Query Relevance in
Topic-focused Multi-document Summarization - ACL 2007 (short paper)
- Ani Nenkova, Lucy Vanderwende, Kathy McKeown
- A compositional context-sensitive multi-document
summarizer - ACM SIGIR 2006
- Kathy McKeown, Rebecca Passonneau, David Elson,
Ani Nenkova, Julia HirschbergDo Summaries Help?
A Task-Based Evaluation of Multi-Document
SummarizationACM SIGIR 2005 - McKeown, Barzilay, Evans, Hatzivassiloglou,
Klavans, Nenkova, Sable, Schiffman,
SigelmanTracking and Summarizing News on a Daily
Basis with Columbia's NewsblasterHLT 2002
7A problem human choice variation
- S1 Pinochet arrested in London on Oct 16 at a
Spanish judges request for atrocities against
Spaniards in Chile. - S2 Former Chilean dictator Augusto Pinochet has
been arrested in London at the request of the
Spanish government. - S3 Britain caused international controversy and
Chilean turmoil by arresting former Chilean
dictator Pinochet in London.
8Why is variation a problem?
- Makes a precise task definition impossible
- Different people produce different summaries
- The same person at different times produces a
different summary - How can an automatic system perform well?
- Evaluation of system output
- Comparison with a human model
- Switching the model leads to a different score
9Human variation content words
- Summaries differ in vocabulary
- Differences cannot be
- explained by paraphrase
- 7 translations
- 20 documents
- 7 summaries
- ? 20 document sets
- Faster vocabulary growth in summarization
10Content units better study of variation
- Semantic units
- Emerge from the analysis of several texts
- Link different surface realizations with the same
meaning
11Content unit example
- S1 Pinochet arrested in London on Oct 16 at a
Spanish judges request for atrocities against
Spaniards in Chile. - S2 Former Chilean dictator Augusto Pinochet has
been arrested in London at the request of the
Spanish government. - S3 Britain caused international controversy and
Chilean turmoil by arresting former Chilean
dictator Pinochet in London.
12SCU label, weight, contributors
- Label London was where Pinochet was arrested
- Weight3
- S1 Pinochet arrested in London on Oct 16 at a
Spanish judges request for atrocities against
Spaniards in Chile. - S2 Former Chilean dictator Augusto Pinochet has
been arrested in London at the request of the
Spanish government. - S3 Britain caused international controversy and
Chilean turmoil by arresting former Chilean
dictator Pinochet in London.
13Annotated corpora
- Document Understanding Conference
- Run by NIST
- Main forum for summarization research
- Annual evaluations on common datasets
- DUC 2006, DUC 2007
- DUC 2005
- 20 sets of 7 human summaries
- DUC 2004
- 50 sets of 4 human summaries
- DUC 2004
- 20 sets both input and 4 summaries annotated
14Importance class distribution
? Few units are expressed by everyone
? Many units are expressed by only one person
? The distributions of words and content units is
very similar
15Content pyramids
- The most important content is in top tier
- Good content is somewhere in the pyramid
16Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
17Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
18Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
19Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
20Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
21Ideally informative summary
- Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well
22Different equally good summaries
- Pinochet arrested
- Arrest in London
- Pinochet is a former Chilean dictator
- Accused of atrocities against Spaniards
23Different equally good summaries
- Pinochet arrested
- Arrest in London
- On Spanish warrant
- Chile protests
24Diagnosticwhy is a summary bad?
25Importance of content
- Can observe distribution in human summaries
- Assign relative importance
- Empirical rather than subjective
- The more people agree, the more important
26Pyramid score for evaluation
- New summary with n content units
- Estimates the percentage of information that is
maximally important
27Characteristics of human summarization
- Zipfian distribution of content units
- Non-deterministic process
- Can this process of content selection be modeled?
- Current automatic methods are completely
deterministic - Would automatic summarizers become better if they
were based on a cognitively plausible model?
28How traditional summarizers work
- Extract representative sentences
Input text1
Input text2
Input text3
Summary
29Frequency as feature
- Suggested in earliest research
- Never used alone in current systems
- Large scale test collections not available till
recently - Presentation outline
- Data analysis
- human summaries
- Automatic summarizer
- Considerations after selecting a feature
30Do people include frequent content in their
summaries?
- Yes both for words and content units
- 30 test sets ( 10 docs each)
- 4 human summaries (100 words each)
31Do people agree on including frequent content?
- Yes
- Very frequent words in the input tend to be those
that many people include in a summary
32Content units better study of variation
- Semantic units
- Emerge from the analysis of several texts
- Link different surface realizations with the same
meaning
33Content unit frequency
- Frequent content units appear in human summaries
- People agree on the inclusion of content units
frequent in the input - 0.64 correlation coefficient between weight in
the input and weight in pyramid
34Summarizer approach and features
- Compositional
- Content words are basic blocks of meaning
- Assign importance to them
- Choose a composition function
- Assign weight to sentence
- Context sensitive
- Relative importance changes after each selection
- Update weights
- Each is validated
35C2S2 algorithm
- Step 1 Estimate word weights (probabilities)
- Step 2 Estimate sentence weights
-
- Step 3 Choose best sentence
- Step 4 Update word weights
- Step 5 Go to 2 if desired length not reached
36Using words as basic blocks
- What humans do
- Include frequently repeated content in their
summaries - Agree on frequently repeated content
- Summary log-likelihood H(umans) and S(ystems)
- Parameters estimated from the input
- Multinomial model
HIGH (-198)
LOW (-227)
HSSSSSSSSSSHSSSHSSHHSHHHHH
37Frequency in related work
- Frequency is an often used feature
- But claims that frequency used alone does not
give good results - Why?
- ? The composition function matters
38Composition function CF
- Choose a composition function CF
- CFProduct
- CFSum
- CFAverage
39Evaluation results 50 summaries
- The choice of composition function has a big
impact - Good (sum average) to very bad (product)
content selection - Number of sentences vary considerably
40Comparison with other systems
- 2004 Document Understanding Conference
- Only one system significantly better
- Out of 16 participants
- 2004 Multi-lingual summarization task
- Only one system significantly better
- Out of 10 participants
- One of the best in avoiding repetition
- 0.6 content units per summary vs. 3.4/1.4 for the
(second) best system
41Need to update context sensitivity
- Importance is not static
- Pinochet was arrested in
- London. Chile protested
- the arrest, which was on
- a Spanish arrest warrant.
- There are repetitive sentences in the input
- Update the word weight by setting it to 0
- Related to earlier work (MMR)
- But we conclusively demonstrate the usefulness of
the approach
? Significant increase in repetition without
update
42What have we learned?
- Frequency is a powerful feature
- We now have lots of data to test feature utility
- Composition function is very important
- Some frequency-based summarizers can be close to
the baseline - Normalizations are important to report
- Context adjustment considerably changes
performance - Details not always clear for many systems