Title: Evaluating%20Summary%20Content%20Selection
1Evaluating Summary Content Selection
- Pyramid Method Work in Progress
- Rebecca Passonneau
- Ani Nenkova
2OUTLINE
- Motivation
- Problems
- DUC Evaluations
- Pyramid Method Current Status
- Open Issues
- Conclusions
3EVALUATION GOALS
- Define parameters of the problem
- What is summarization?
- Compare systems
- Is the metric meaningful?
- Track progress
- When does output improve?
- Cost Effectiveness
- Can it be (partly) automated?
4PICTURING CONTENT OVERLAP
- Philippine Airlines (PAL) experienced a crisis in
1998. Unable to make payments on a 2.1 billion
debt, it was faced by a pilot's strike in June
and the region's currency problems which reduced
passenger numbers and inflated costs. On
September 23 PAL shut down after the ground crew
union turned down a settlement which it accepted
two . . . - Starting in May 1998, Philippine Airlines (PAL)
laid off 5000 of its 13,000 workers. A 3-week
pilots' strike in June and a currency crisis that
reduced passenger numbers made payments on PAL's
2 billion debt debt impossible. President
Estrada brokered an agreement to suspend
collective bargaining for 10 years in exchange
for 20 of PAL stock and union seats on its
board.The large ground crew union initially voted
no.After PAL shut down operations for 13 days
starting Sept. 23rd, leaving much of the country
without air service and foreign . . .
5OBSTACLES
- Humans select different content
- Humans present same content differently
- Lack clear standard of good summary
- Contrasts with translation L1(C)?L2(C)
- Need objective method to get at subjective
notion of what a summary IS
6PREVIOUS WORK Pessimism
- Human Judgments
- Extraction
- Low Agreement (Rath, 1961 Salton et al, 1997)
- Inconsistent over time (Rath, 1961 Lin Hovy,
2002) - Abstraction (Depends on individuals orientation
(Gerrig et al1991) - Automated Evaluation
- Extraction (Pastra Saggion, 2003 EACL)
- 3-humans multiple models inconclusive
- Abstraction (Lin Hovy, 2002 ACL)
- Accepts inconsistent judgments as target
- Difficult to extend
7PREVIOUS WORK Optimism
- Good design methodology leads to better
understanding ? areas of agreement - High compression rate leads to high agreement
(Jing et al., 1998) - Content variation offset by logarithmic growth in
pool of distinct content units (Halteren
Teufel,2003) - Content can be reliably annotated (Beck et al.,
1991)
8HOW TO GET AT CONTENT FROM ITS EXPRESSION
- ADAPT BLEU MT EVALUATION
- Collect multiple model summaries
- Quantify ngram overlap
- IDENTIFY ABSTRACT CONTENT UNITS
- DUC
- Reading Comprehension
- A THIRD WAY
- Content unit level
- Multiple expressions of same content unit
9DUC THE CURRENT APPROACH
- Yearly evaluation of systems on new data sets
- NIST evaluations performed by humans
- Widely cited results
- Does it work?
- Compare current systems
- Track individual system progress
- Track community progress from year to year
- Identify specific strengths/weaknesses
- Can it eventually be automated?
10DUC SCORING METHOD
- Datasets human/machine summaries
- Designate model human summary
- (Automatically) identify content units in model
summary - Split peer summaries into sentences
- Human judges evaluate peer against model
11COMPUTE DUC SCORES
- For each EDU
- Does peer sentence express any part
- How much? (0, 20, 40, 60, 80, 100)
- Average EDU percent overlap scores
- Resulting score ranges from 0 to 1
12DRAWBACKS TO DUC SCORES
- Very sensitive to choice of model
- All model units created equal
- Difficult to interpret scores
- Human summary scores as low as 0.1
- Scores vary for same summarizer
- Scores vary for same summary
- Systems cannot be differentiated
13DUC SCATTERPLOT
14FOUNDATION OF PYRAMID
- A few CUs appear in many summaries
- Humans can identify same/different CUs
- ?Weight CUs differentially
15MULTIPLE GOOD SUMMARIES
- This pyramid predicts 6 different good summaries
consisting of 4 SCUs
16SCU ANNOTATION EXAMPLE
17PAL PYRAMID TIER W3 (N4)
- SCU1 PAL has 2.1 billion debt
- H2 PALs 2 billion debt1
- I1 and with a rising 2.1 billion debt,1
- J3 PAL is buried under a 2.2 billion dollar
debt1 -
- SCU2 PAL enforced a shutdown
- H5 After PAL shut down operations2
- I1 stopped all operations2
- J5 by a2 shutdown2
-
- SCU3 PAL in crisis
- H1 Philippine Airlines3
- I1 Philippines Airlines (PAL),3 devastated3
- J1 The fate3 is uncertain.3
-
18PAL PYRAMID TIER W2 (N8)
- SCU5 PAL unable to repay debt
- H2 made payments on5 impossible.5
- J3 it cannot repay5
-
- SCU6 PAL experienced pilots' strike
- H2 A5 pilots' strike6
- I1 by pilot5 strikes6
-
- SCU7 this PAL crisis occurred in 1988
- H1 1998,7
- I1 in 19987
- . . .
19ANNOTATION KEEPING TRACK
- H1 Starting in May23 1998,7 Philippine
Airlines3 - laid off 5000 of its 13,000 workers.24
- H2 A6 3-week25 pilots' strike6 in June11
and a - currency crisis12 that reduced passenger
numbers13 - H3 President Estrada brokered an agreement to
suspend - collective bargaining for 10 years17 in
exchange - for 20 of PAL stock and union seats on its
board.26 - H4 The large ground crew union initially voted
no.18 - H5 After PAL shut down operations2 for 13
days4 - starting Sept. 23rd,8 leaving much of the
country - without air service27 and foreign carriers
flying - some domestic routes,9 61 voted yes.19
- . . .
20RELIABILITY
- Two Annotators ?Consensus Annotation
- Number of SCUs 33 versus 37 ?35
- Count of Pairwise Agreements (PAs)
- SCU Label
- SCU Members
- Comparison of Annotations to Consensus
- Recall/Precision not valid
- 65/69 PAs
- Most disagreements due to membership size
- Only 2 conflicts
21ANOTHER CONSISTENCY TEST
Pyramid A H C J
Consensus .95 .89 .85 .76
Annotation1 .97 .87 .83 .82
Annotation2 .94 .87 .84 .74
22PYRAMID SCORE PART 1
- For N summaries, score each peer against a
pyramid with N-1 tiers - Peer annotation
- Gives SCU size
- Yields a residue of SCUs not in pyramid
- Compute D (Observed distribution) where Dsum of
weights of SCUs - EG Summary A (D30042), size20
- D(6x3) (6x2) (4x1) (4x0) 34
23PYRAMID SCORE PART II
- Compute Max Ideal Sum of weights of SCUs, given
the summary SCU size - Pyramid of H,I,J
- 9 SCUs in tier, w3
- 10 SCUs in tier, w2
- 12 SCUs in tier, w1
- Size20, Max(9x3) (10x2) (1x1)48
- PD/Max PA 34/48.71
24COMPARISON TO DUC SCORESHUMAN SUMMARIES
25MACHINE SUMMARY EXAMPLE
African countries voted in June to ignore the
U.N. flight ban which was imposed in 1992 to try
and force Libya to hand over for trial two
suspects wanted in the 1988 bombing of an
American airliner over Lockerbie, Scotland. The
reported jailing of the three officials comes as
Gadhafi is under pressure to accept a plan to
turn over for trial two other Libyans wanted for
the 1988 bombing of Pan am flight 103 over
Lockerbie, Scotland, that led to 270 deaths. The
visit was Farrakhan's
26COMPARISON TO DUC SCORESMACHINE SUMMARIES
SYSTEM DUC PYRAMID
Sys06 .30 .79
Sys13 .03 .24
Sys14 .25 .51
Sys16 .25 .26
Sys17 .03 .17
Sys18 .03 .20
Sys20 .10 .64
27MACHINE SUMMARIES
- System 6
- PAL, Asias oldest airline, has been unable to
make payments on dlrs 2.1 billion debt after
being devasted by a pilots strike and by Asias
currency crisis. PAL earlier accepted a
preliminary investment offer from Cathay Pacific,
Ailing Philippine Airlines and prospective
investor Cathy Pacific Airways have clashed over
. . .
28MACHINE SUMMARIES
- System 16
- President Joseph Estrada on Saturday urged
militant unionists at Philippine Airlines to
accept a vote by workers approving a 10-year
no-strike deal to revive the debt-laden airline.
President Joseph Estrada said Saturday the
financially troubled airlines will resume its
international flights on Sunday by flying him to
Singapore . . .
29MACHINE SUMMARIES
- System 17
- Christmas is a sacred holiday in the Philippines,
and nowhere is that more evident than at the
headquarters of Philippine Airlines. But Ramos,
who was intent on privatizing the economy, opened
the industry to competition, licensing rivals
like Air Philippines, Cebu Pacific, and Grand
Air. PAL closed for nearly 2 weeks on Sep. 23
after . . .
30OPEN ISSUES
- Distribution of SCUs NOT an independent variable
- Ordering
- Knowledge
- Informational Goal
- Can Pyramid Scoring be Automated?
31SCU INTERDEPENDENCIES
- SCU4 presupposes SCU1
- SCU1 (w4) PAL has a debt gt 2 billion
- SCU4 (w3) PAL cannot make its debt payments
- SCU7, SCU8 depend on SCU2
- SCU2 (w4) PAL shutdown operations
- SCU7 (w3) shutdown began on 9/23
- SCU8 (w3) shutdown lasted 2 weeks
32SCUs and DEPENDENCY/TAG GR
- A3
- On September 237
- PAL shut down2
- after the ground crew union turned down a
- settlement18
- which it accepted two weeks later.19
-
- SCU7
- 1 On IN 5 shut t0
- 2 September NNP 4 PAL t2
- 3 23 CD 4 PAL t2
33LARGE CONSTITUENTS
- 1. PAL experienced a crisis in 1998.
- 2. Unable to make payments on a 2.1 billion
debt, - 3. it was faced by a pilot's strike in June
- 4. and the region's currency problems
- 5. which reduced passenger numbers and
inflated costs. - 6. On September 23 pal shut down
- 7. after the ground crew union turned down a
settlement - 8. which it accepted two weeks later.
- 9. PAL resumed domestic flights on October 7
- 10. and resumed international flights on
October 26. - 11. Resolution of the basic financial problems
was elusive, - however,
- 12. and as of December 18 pal was still 2.2
billion in - debt
- 13. and pal was losing close to 1 million a
day.
34DOCSET TFIDF
- TERMS 2, airline, billion, day, debt, pal (6 of
13 LCs) - 1 1. Philippine Airlines (pal) experienced a
crisis in 1998. - SCU3 w3
- 3 2. Unable to make payments on a 2.1 billion
debt, - SCU1 w4
- 1 6. On September 23 pal shut down
- SCU2 w4 SCU7 w3
- 1 9. pal resumed domestic flights on October 7
- SCU10 w2
- 4 12. and as of December 18 pal was still 2.2
billion in debt - NO SCU
- 1 13. and losing close to 1 million a day.
- SCU15 w2
35CONCLUSIONS
- Define parameters of the problem
- What is summarization?
- Compare systems and/or humans
- Is the metric meaningful?
- Track progress
- When does output improve?
- Cost Effectiveness
- Can it be (partly) automated?