Title: Summarization of XML Documents
1Summarization of XML Documents
2Outline
- Motivation
- System for XML Summarization
- Ranking Model and Summary Generation
- Example Summaries
- Conclusion and Future Work
3Motivation
XML Document Collection (eg IMDB)
XML Document
- Types of XML Document Summaries
- Generic summary summarizes entire contents of
the document. - Query-biased summary summarizes those parts of
the document which are relevant to users query.
4- Aims
- We aim at summaries which are
- Generated Automatically
- Highly constrained by size
- Highly Informative
- High Coverage
- Challenges
- Structure is as important as text
- Varying text length
5System for XML Summarization
Summary Size
Corpus Statistics
SUMMARY GENERATOR
RANKING UNIT
Ranked Tag units
Info Unit Generator
Tag Ranker
XML Doc
Tag Units
Summary
Text Ranker
Text Units
Ranked Text units
6Information Units of an XML Document
7Ranking Unit
I. Tag Ranking
8II. Text Ranking
- Two categories of text
- Entities
- Regular text
9- Ranking is done based on context of occurrence.
- - No redundancy in tag context (E.g. actor
names, genre) - Redundancy in tag context (E.g. plots, goofs,
trivia items)
Tag context
Document context
Corpus context
10Correlated tags and text
Often find related tag units siblings of each
other E.g. Actor and Role
Inclusion Principle
Case 1
Case 2
11Generation of Summary
Consider the following tag rank table
To generate a summary with 30 tags, 15 actor
tags, 9 keyword tags and 6 trivia would be
required.
12Generating the summary with 30 tags
13Few Example Summaries
Titanic.xml - Summaries
14(No Transcript)
15(No Transcript)
16Thanks!
17Appendix
Informativeness
18Coverage
19Ranking Model
I. TAG RANKER
Mixture Model of Typicality and Specialty
- Typicality How typical is the tag in the
corpus?
20 Specialty How unusually frequent/infrequent is
the tag in the current
document compared to an average
document of the corpus?
21- Text with redundancy in tag context
Sort terms by frequencies and take top m terms
as centroid query
Relevance
Similarity Calculated using Maximum marginal
relevance(MMR)
Finally,
22Text without redundancy in tag context
Redundancy at tag level
No redundancy at tag level
is set empirically
23- A Relative Count Matrix is constructed
- Given two tags Ti and Tj, the relative
importance of Tj with that of higher ranked Tj is
calculated by dividing them both by P(TjD)
(shows how many Tj tags are worth one Ti) - Tj is considered only after P(TiD)/P(TjD)
number of Ti tags have been considered. - Extending the above concept, a matrix with
relative counts can be formed.
24Oceans Eleven.xml - Summaries
25Generating the summary with 30 tags