Title: TeluguEnglish Dictionary Based Cross Language Query Focused MultiDocument Summarization
1Telugu-English Dictionary Based Cross Language
Query Focused Multi-Document Summarization
Prasad Pingali, J. Jagadeesh, Vasudeva Varma
International Institute of Information
Technology, Hyderabad, India pvvpr,
vv_at_iiit.ac.in July 08, 2006
2Cross-Language Information Systems
- CLIR, Factoid based Question Answering were
thoroughly researched - TREC, CLEF and NTCIR are some of the focused
workshops in this area - Efforts are on to make the state-of-the-art
research usable in applications - Machine Translation capabilities are a bottleneck
for end-user application - CL query focused summarization not studied.
3CL Query Focused Summarization Use Case
4Potential Benefits of CLQ Summarization
- A bridge between CLIR and MT
- Output is a paragraph, coherent, readable and
syntactically correct. - MT systems need syntactically correct sentences
for proper translation - Summary can be tailor made to suit the
constraints of MT system - For example, summary with minimum NL ambiguity
etc.
5Problem Statement
- to synthesize from a set of 25-50 documents in a
language L2 that are related to a given topic, a
brief, well-organized, fluent answer to a need
for information given in a language L1 , that
cannot be met by just stating a name, date,
quantity, etc.
6Extraction Based Summary
- Extraction based summarization process
- Identify Sentences from documents
- Score these sentences using some function
- Choose the top scoring sentences
- Identify and eliminate exact duplicates and near
duplicates (redundancy) - Concatenate the sentences in some logical order
(e.g., timestamp etc.) - Sentence scoring function
- Query independent or
- Query based or
- Combination
7Sentence Scoring using RBLM
- Relevance based language modeling (RBLM)
framework
- Using conditional sampling, joint probability
can be re-written as
- Term dependencies in a CL setting calculated as
8Sentence Scoring using RBLM (contd.)
- P(qi/ej) as translation probability
- Explored in statistical machine translation and
CLIR - P(ej/w) as post-translation query expansion
- Calculated using dictionary based
- Term co-occurrence statistics
- Pseudo-relevance feedback
9Calculations in our Experiments (Telugu-English)
- P(qi/ej) as translation probability
- Used bilingual dictionary and assumed uniform
probability for all possible translations - P(ej/w) as post-translation query expansion
- Used Hyperspace Analogue to Language (HAL)
feature to compute term dependencies - HAL works based on skip-bigrams, window length
is a parameter
10Sentence Score
- The score of each word 'w' w.r.t to the query
can be written as
- The score of each sentence 'S' w.r.t to the
query can be written as
P
11Summary Generation
- Top ranking sentences are tested for redundancy
- Cosine similarity between sentences (unigram
overlap) used for calculating redundancy - Candidate sentences after removing redundant ones
are concatenated in the order of the publication
date of the document - Sentences belonging to same document are
concatenated in the order of occurrence in the
document.
12Evaluation
- Experiments in Telugu-English language pair
- Used DUC 2005 dataset
- Manually translated DUC queries into Telugu
- Used Telugu queries as input to system
- System generates English summary
- Summary evaluated against DUC model summaries
using ROUGE package - ROUGE package contains many metrics, for
summaries of paragraph length ROUGE-2 and
ROUGE-SU4 correlate best with human evaluations
13An Example
14Evaluation Results
15Evaluation Results (per topic)
16Conclusions
- We extended our mono-lingual summarization
framework to a cross-lingual setting in RBLM
framework - We designed a cross-lingual experimental setup
using DUC 2005 dataset - Experiments were conducted for Telugu-English
language pair - Comparison with mono-lingual baseline shows about
90 performance in ROUGE-SU4 and about 85 in
ROUGE-2 f-measures
17DUC-2006
18Invitation
Workshop on Cross Lingual Information
Access http//search.iiit.ac.in/CLIA2007 January
06. 2007 held at IJCAI 07 in Hyderabad, India.
19Thank you