Multidocument summarization by people and machines - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Multidocument summarization by people and machines

Description:

... Chilean dictator Augusto Pinochet has been arrested in London at the request of ... Label London was where Pinochet was arrested. Weight=3 ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 43

Provided by: AniNe7

Category:

more less

Transcript and Presenter's Notes

Title: Multidocument summarization by people and machines

1
Multi-document summarization by people and
machines

Ani Nenkova
Department of Computer and Information Science
University of Pennsylvania

2
Why is summarization important?
3
Summarizing online news
Pan Am
bombing
Libya
suspects
Gadhafi
trial
Libya refuses to surrender two Pan Am bombing
suspects
UK and USA
???
4
(No Transcript)
5
People like summaries!

User study on a report writing task
4 report topics
12 subjects with each interface
Interface
Newsblaster
One line summary (Google news)
No summary
Main findings multi-document summaries help
Higher user satisfaction
Better reports

6
Some references

Ani Nenkova, Becky Passonneau, Kathy McKeown
The pyramid method incorporating human
content selection variation in summarization
evaluation
ACM Transactions on Speech and Language
Processing, volume 4, issue 2, 2007
Surabhi Gupta, Ani Nenkova and Dan Jurafsky
Measuring Importance and Query Relevance in
Topic-focused Multi-document Summarization
ACL 2007 (short paper)
Ani Nenkova, Lucy Vanderwende, Kathy McKeown
A compositional context-sensitive multi-document
summarizer
ACM SIGIR 2006
Kathy McKeown, Rebecca Passonneau, David Elson,
Ani Nenkova, Julia HirschbergDo Summaries Help?
A Task-Based Evaluation of Multi-Document
SummarizationACM SIGIR 2005
McKeown, Barzilay, Evans, Hatzivassiloglou,
Klavans, Nenkova, Sable, Schiffman,
SigelmanTracking and Summarizing News on a Daily
Basis with Columbia's NewsblasterHLT 2002

7
A problem human choice variation

S1 Pinochet arrested in London on Oct 16 at a
Spanish judges request for atrocities against
Spaniards in Chile.
S2 Former Chilean dictator Augusto Pinochet has
been arrested in London at the request of the
Spanish government.
S3 Britain caused international controversy and
Chilean turmoil by arresting former Chilean
dictator Pinochet in London.

8
Why is variation a problem?

Makes a precise task definition impossible
Different people produce different summaries
The same person at different times produces a
different summary
How can an automatic system perform well?
Evaluation of system output
Comparison with a human model
Switching the model leads to a different score

9
Human variation content words

Summaries differ in vocabulary
Differences cannot be
explained by paraphrase

7 translations
20 documents
7 summaries
? 20 document sets
Faster vocabulary growth in summarization

10
Content units better study of variation

Semantic units
Emerge from the analysis of several texts
Link different surface realizations with the same
meaning

11
Content unit example

S1 Pinochet arrested in London on Oct 16 at a
Spanish judges request for atrocities against
Spaniards in Chile.
S2 Former Chilean dictator Augusto Pinochet has
been arrested in London at the request of the
Spanish government.
S3 Britain caused international controversy and
Chilean turmoil by arresting former Chilean
dictator Pinochet in London.

12
SCU label, weight, contributors

Label London was where Pinochet was arrested
Weight3
S1 Pinochet arrested in London on Oct 16 at a
Spanish judges request for atrocities against
Spaniards in Chile.
S2 Former Chilean dictator Augusto Pinochet has
been arrested in London at the request of the
Spanish government.
S3 Britain caused international controversy and
Chilean turmoil by arresting former Chilean
dictator Pinochet in London.

13
Annotated corpora

Document Understanding Conference
Run by NIST
Main forum for summarization research
Annual evaluations on common datasets
DUC 2006, DUC 2007
DUC 2005
20 sets of 7 human summaries
DUC 2004
50 sets of 4 human summaries
DUC 2004
20 sets both input and 4 summaries annotated

14
Importance class distribution
? Few units are expressed by everyone
? Many units are expressed by only one person
? The distributions of words and content units is
very similar
15
Content pyramids

The most important content is in top tier
Good content is somewhere in the pyramid

16
Ideally informative summary

Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well

17
Ideally informative summary

Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well

18
Ideally informative summary

Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well

19
Ideally informative summary

Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well

20
Ideally informative summary

Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well

21
Ideally informative summary

Does not include an SCU from a lower tier unless
all SCUs from higher tiers are included as well

22
Different equally good summaries

Pinochet arrested
Arrest in London
Pinochet is a former Chilean dictator
Accused of atrocities against Spaniards

23
Different equally good summaries

Pinochet arrested
Arrest in London
On Spanish warrant
Chile protests

24
Diagnosticwhy is a summary bad?

Good

Less relevant summary

25
Importance of content

Can observe distribution in human summaries
Assign relative importance
Empirical rather than subjective
The more people agree, the more important

26
Pyramid score for evaluation

New summary with n content units

Estimates the percentage of information that is
maximally important

27
Characteristics of human summarization

Zipfian distribution of content units
Non-deterministic process
Can this process of content selection be modeled?
Current automatic methods are completely
deterministic
Would automatic summarizers become better if they
were based on a cognitively plausible model?

28
How traditional summarizers work