Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization - PowerPoint PPT Presentation

About This Presentation
Title:

Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization

Description:

Freely skewed. d1 d2 d3 d4. Freely skewed = Only add a new ... Freely skewed model. Empirical and theoretical properties of TSGs (e.g., in-degree distribution) ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 31
Provided by: tanyeefanm
Category:

less

Transcript and Presenter's Notes

Title: Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization


1
Timestamped GraphsEvolutionary Models of Text
for Multi-document Summarization
  • Ziheng Lin and Min-Yen Kan
  • Department of Computer ScienceNational
    University of Singapore, Singapore

2
Summarization
  • Traditionally, heuristics for extractive
    summarization
  • Cue/stigma phrases
  • Sentence position (relative to document,
    section, paragraph)
  • Sentence length
  • TFIDF, TF scores
  • Similarity (with title, context, query)
  • With the advent of machine learning, heuristic
    weights for different features are tuned by
    supervised learning
  • In last few years, graphical representations of
    text have shed new light on the summarization
    problem

3
Prestige as sentence selection
  • One motivation of using graphical methods was to
    model the problem as finding prestige of nodes in
    a social network
  • PageRank used random walk to smooth the effect
    of non-local context
  • HITS and SALSA to model hubs and authorities
  • In summarization, lead to TextRank and LexRank
  • Contrast with previous graphical approaches
    (Salton et al. 1994)
  • Did we leave anything out of our representation
    for summarization?
  • Yes, the notion of an evolving network

4
Social networks change!
  • Natural evolving networks (Dorogovtsev and
    Mendes, 2001)
  • Citation networks New papers can cite old ones,
    but the old network is static
  • The Web new pages are added with an old page
    connecting it to the web graph, old pages may
    update links

5
Evolutionary models for summarization
  • Writers and readers often follow conventional
    rhetorical styles - articles are not written or
    read in an arbitrary way
  • Consider the evolution of texts using a very
    simplistic model
  • Writers write from the first sentence onwards in
    a text
  • Readers read from the first sentence onwards of
    a text
  • A simple model sentences get added
    incrementally to the graph

6
Timestamped Graph Construction
  • Approach
  • These assumptions suggest us to iteratively add
    sentences into the graph in chronological order.
  • At each iteration, consider which edges to add
    to the graph.
  • For single document simple and straightforward
    add 1st sentence, followed by the 2nd, and so
    forth, until the last sentence is added
  • For multi-document treat it as multiple
    instances of single documents, which evolve in
    parallel i.e., add 1st sentences of all
    documents, followed by all 2nd sentences, and so
    forth
  • Doesnt really model chronological ordering
    between articles, fix later

7
Timestamped Graph Construction
  • Model
  • Documents as columns
  • di document i
  • Sentences as rows
  • sj jth sentence of document

8
Timestamped Graph Construction
  • A multi document example

doc3
doc2
doc1
sent1
sent2
sent3
9
  • An example TSG DUC 2007 D0703A-A

10
Timestamped Graph Construction
  • These are just one instance of TSGs
  • Lets generalize and formalize them
  • Def A timestamped graph algorithm tsg(M) is a
    9-tuple (d, e, u, f,s, t, i, s, t) that
    specifies a resulting algorithm that takes as
    input the set of texts M and outputs a graph G

Input text transformation function
Properties of nodes
Properties of edges
11
Edge properties (d, e, u, f)
  • Edge Direction (d)
  • Forward, backward, or undirected
  • Edge Number (e)
  • number of edges to instantiate per timestep
  • Edge Weight (u)
  • weighted or unweighted edges
  • Inter-document factor (f)
  • penalty factor for links between documents in
    multi-document sets.

12
Node properties (s, t, i, s)
  • Vertex selection function s(u, G)
  • One strategy among those nodes not yet
    connected to u in G, choose the one with highest
    similarity according to u
  • Similarity functions Jaccard, cosine, concept
    links (Ye et al.. 2005)
  • Text unit type (t)
  • Most extractive algorithms use sentences as
    elementary units
  • Node increment factor (i)
  • How many nodes get added at each timestep
  • Skew degree (s)
  • Models how nodes in multi-document graphs are
    added
  • Skew degree how many iterations to wait before
    adding the 1st sentence of the next document
  • Lets illustrate this

13
Skew Degree Examples
  • time(d1) lt time(d2) lt time(d3) lt time(d4)

d1 d2 d3 d4
d1 d2 d3 d4
Freely skewed Only add a new document when it
would be linked by some node using vertex
function s
Skewed by 1
Skewed by 2
14
Input text transformation function (t)
  • Document Segmentation Function (t)
  • Problem observed in some clusters where some
    documents in a multi-document cluster are very
    long
  • Takes many timestamps to introduce all of the
    sentences, causing too many edges to be drawn
  • ?(G) segments long documents into several sub
    docs
  • Solution is too hacked hope to investigate
    more in current and future work

d5b
d5a
d5
15
Timestamped Graph Construction
  • Representations
  • We can model a number of different algorithms
    using this 9-tuple formalism
  • (d, e, u, f, s, t, i, s, t)
  • The given toy example
  • (f, 1, 0, 1, max-cosine-based, sentence, 1, 0,
    null)
  • LexRank graphs
  • (u, N, 1, 1, cosine-based, sentence, Lmax, 0,
    null)
  • N total number of sentences in the cluster
    Lmax the max document length
  • i.e., all sentences are added into the graph in
    one timestep, each connected to all others, and
    cosine scores are given to edge weights

16
TSG-based summarization
  • Methodology
  • Evaluation
  • Analysis

17
System Overview
  • Sentence splitting
  • Detect and mark sentence boundaries
  • Annotate each sentence with the doc ID and the
    sentence number
  • E.g., XIE19980304.0061 4 March 1998 from Xinhua
    News XIE19980304.0061-14 the 14th sentence of
    this document
  • Graph construction
  • Construct TSG in this phase

18
System Overview
  • Sentence Ranking
  • Apply topic-sensitive random walk on the graph
    to redistribute the weights of the nodes
  • Sentence extraction
  • Extract the top-ranked sentences
  • Two different modified MMR re-rankers are used,
    depending on whether it is main or update task

19
Evaluation
  • Dataset DUC 2005, 2006 and 2007.
  • Evaluation tool ROUGE n-gram based automatic
    evaluation
  • Each dataset contains 50 or 45 clusters, each
    cluster contains a query and 25 documents
  • Evaluate on some parameters
  • Do different e values affect the summarization
    process?
  • How do topic-sensitivity and edge weighting
    perform in running PageRank?
  • How does skewing the graph affect the information
    flow in the graph?

20
Evaluation on number of edges (e)
  • Tried different e values
  • Optimal performance e 2
  • At e 1, graph is too loosely connected, not
    suitable for PageRank ? very low performance
  • At e N, a LexRank system

e 2
e 2
N
N
N
21
Evaluation (other edge parameters)
  • PageRank generic vs topic-sensitive
  • Edge weight (u) unweighted vs weighted
  • Optimal performance topic-sensitive PageRank
    and weighted edges

Topic-sensitive Weighted edges ROUGE-1 ROUGE-2
No No 0.39358 0.07690
Yes No 0.39443 0.07838
No Yes 0.39823 0.08072
Yes Yes 0.39845 0.08282
22
Evaluation on skew degree (s)
  • Different skew degrees s 0, 1 and 2
  • Optimal performance s 1
  • s 2 introduces a delay interval that is too
    large
  • Need to try freely skewed graphs

Skew degree ROUGE-1 ROUGE-2
0 0.36982 0.07580
1 0.37268 0.07682
2 0.36998 0.07489
23
Holistic Evaluation in DUC
  • We participated in DUC 2007 with an
    extractive-based TSG system
  • Main task 12th for ROUGE-2, 10th for ROUGE-SU4
    among 32 systems
  • Update task 3rd for ROUGE-2, 4th for ROUGE-SU4
    among 24 systems
  • Used a modified version of maximal marginal
    relevance to penalize links in previously read
    articles
  • Extension of inter-document factor (f)
  • TSG formalism better tailored to deal with
    update / incremental text tasks
  • New method that may be competitive with current
    approaches
  • Other top scoring systems may do sentence
    compression, not just extraction

24
Conclusion
  • Proposed a timestamped graph model for text
    understanding and summarization
  • Adds sentences one at a time
  • Parameterized model with nine variables
  • Canonicalizes representation for several graph
    based summarization algorithms
  • Future Work
  • Freely skewed model
  • Empirical and theoretical properties of TSGs
    (e.g., in-degree distribution)

25
Backup Slides
  • 25 Minute talk total
  • 26 Apr 2007, 1150-1215

26
Differences for main and update task processing
  • Main task
  • Construct a TSG for input cluster
  • Run topic-sensitive PageRank on the TSG
  • Apply first modified version of MMR to extract
    sentences
  • Update task
  • Cluster A
  • Construct a TSG for cluster A
  • Run topic-sensitive PageRank on the TSG
  • Apply the second modified version of MMR to
    extract sentences
  • Cluster B
  • Construct a TSG for clusters A and B
  • Run topic-sensitive PageRank on the TSG only
    retain sentences from B
  • Apply the second modified version of MMR to
    extract sentences
  • Cluster C
  • Construct a TSG for clusters A, B and C
  • Run topic-sensitive PageRank on the TSG only
    retain sentences from C
  • Apply the second modified version of MMR to
    extract sentences

27
Sentence Ranking
  • Once a timestamped graph is built, we want to
    compute an prestige score for each node
  • PageRank use an iterative method that allows
    the weights of the nodes to redistribute until
    stability is reached
  • Similarities as edges ? weighted edges query ?
    topic-sensitive

Topic sensitive (Q) portion
Standard random walk term
28
Sentence Extraction Main task
  • Original MMR integrates a penalty of the
    maximal similarity of the candidate document and
    one selected document
  • Ye et al. (2005) introduced a modified MMR
    integrates a penalty of the total similarity of
    the candidate sentence and all selected sentences
  • Score(s) PageRank score of s S selected
    sentences
  • This is used in the main task

Penalty All previous sentence similarity
29
Sentence Extraction Update task
  • Update task assumes readers already read previous
    cluster(s)
  • implies we should not select sentences that have
    redundant information with previous cluster(s)
  • Propose a modified MMR for the update task
  • consider the total similarity of the candidate
    sentence with all selected sentences and
    sentences in previously-read cluster(s)
  • P contains some top-ranked sentences in previous
    cluster(s)

Previous cluster overlap
30
References
  • Günes Erkan and Dragomir R. Radev. 2004.
    LexRank Graph-based centrality as salience in
    text summari-zation. Journal of Artificial
    Intelligence Research, (22).
  • Rada Mihalcea and Paul Tarau. 2004. TextRank
    Bring-ing order into texts. In Proceedings of
    EMNLP 2004.
  • S.N. Dorogovtsev and J.F.F. Mendes. 2001.
    Evolution of networks. Submitted to Advances in
    Physics on 6th March 2001.
  • Sergey Brin and Lawrence Page. 1998. The anatomy
    of a large-scale hypertextual Web search engine.
    Com-puter Networks and ISDN Systems, 30(1-7).
  • Jon M. Kleinberg. 1999. Authoritative sources in
    a hy-perlinked environment. In Proceedings of
    ACM-SIAM Symposium on Discrete Algorithms, 1999.
  • Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen
    Kan. 2005. NUS at DUC 2005 Understanding
    docu-ments via concepts links. In Proceedings of
    DUC 2005.
Write a Comment
User Comments (0)
About PowerShow.com