Title: LexPageRank: Prestige in Multi-Document Text Summarization
1LexPageRank Prestige in Multi-Document Text
Summarization
- Gunes Erkan and Dragomir R. Radev
- Department of EECS, School of Information
- University of Michigan
- ACL 2004
2Abstract
- This paper consider an approach for computing
sentence importance based on the concept of
eigenvector centrality (prestige) LexPageRank - In this model, a sentence connectivity matrix is
constructed based on cosine similarity - The experimental results using DUC2004 show that
this approach outperforms centroid-based
summarization and is quite successful compared to
other summarization systems
3Introduction
- Text summarization is the process of
automatically creating a compressed version of a
given text that provides useful information for
the user - This summarization approach is to assess the
centrality of each sentence in a cluster and
include the most important ones in the summary - Introduce two new measures for centrality, Degree
and LexPageRank, inspired from the prestige
concept in social networks
4Sentence centrality and centroid-based
summarization
- Extractive summarization produces summaries by
choosing a subset of the sentences in the
original documents - Centrality of a sentence is often defined in
terms of the centrality of the words that it
contains - The centroid of a cluster is a psuedo-document
which consists of words that have frequencyIDF
scores above a predefined threshold - In centroid-based summarization (Radevet et al.,
2000), the sentences that contain more words from
the centroid of the cluster are considered
central - Centroid-based summarization has given promising
results in the past
5Prestige-based sentence centrality
- We hypothesize that the sentences that are
similar to many of the other sentences in a
cluster are more central (or prestigious) to the
topic - There are two issues
- How to define similarity between two sentences
- Cosine
- How to compute the overall prestige of a sentence
given its similarity to other sentences - Degree centrality
- Eigenvector centrality and LexPageank
6Prestige-based sentence centrality
- A cluster may be represented by a cosine
similarity matrix
7Prestige-based sentence centrality
Most of them are nonzero
8Prestige-based sentence centrality
- Degree centrality
- Since we are interested in significant
similarities in the matrix, we can eliminate some
low values by defining a threshold , so that the
cluster can be view as an undirected graph - We define degree centrality as the degree of each
node in the similarity graph
9Prestige-based sentence centrality
10Prestige-based sentence centrality
11Prestige-based sentence centrality
- Issue for degree centrality
- Several unwanted sentences vote for each and
raise their prestige - This situation can be avoided by considering
where the votes come from and taking the prestige
of the voting node into account in weight each
node - Eigenvector centrality and LexPageRank
- PageRank (Page et al., 1998) is a method propose
for assigning a prestige score to each page in
the web independent of a specific query - Depending on the number of pages that link to
that pages as well as the individual score of the
linking pages
12Prestige-based sentence centrality
- The PageRank of Page A
- This recursively defined value can be computed by
forming the binary adjacency matrix of the web,
normalizing this matrix so that row sums equal to
1, and finding the principal eigenvector of the
normalized matrix - PageRank for ith pages equals to the ith entry in
the eigenvector
T1,,Tn pages that link to page A d damping
factor, C(Ti) the number of outgoing links from
page Ti
13Prestige-based sentence centrality
- This method can be easily applied to the cosine
similarity graph to find the most prestigious
sentences in a document - We called this new measure of sentence similarity
LexPageRank
14Prestige-based sentence centrality
damping factor 1
15Prestige-based sentence centrality
- Advantage over Centroid
- It accounts for information subsumption among
sentences - It prevents unnaturally high IDF scores from
boosting up the score of a sentence that is
unrelated to the topic
16Experiments on DUC 2004 data
- DUC 2004 data was used in our experiments
- Task 2 involves summarization of 50 TDT English
clusters - Task 4 is to produce summaries of machine
translation output (in English) of 24 Arabic TDT
documents - Recall-based measure Rouge is adopted and
665-byte summaries for each cluster are produced
17Experiments on DUC 2004 data
- MEAD summarization toolkit
- Extractive multi-document summarization
- Consist of three components
- Feature extractor (document -gt feature vector)
- Centroid, Position and Length
- Combiner (feature vector -gt scalar value)
- Reranker (the scores are adjusted upward or
downward) - MMR (Maximum Margin Relevance), CSIS
(Cross-Sentence Information Subsumption)
weight
Threshold
18Experiments on DUC 2004 data
Centroid
19Experiments on DUC 2004 data
20Experiments on DUC 2004 data
21Experiments on DUC 2004 data
22Conclusions
- A novel approach to define sentence centrality
based on graph-based prestige scoring of
sentences - We have introduced two different methods, Degree
and LexPageRank , for computing prestige in
similarity graph - The experimental results is quite promising
- Even the simplest approach, degree centrality, is
good enough heuristic to perform better than
lead-based and centroid-based summaries