TeluguEnglish Dictionary Based Cross Language Query Focused MultiDocument Summarization - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

TeluguEnglish Dictionary Based Cross Language Query Focused MultiDocument Summarization

Description:

CLIR, Factoid based Question Answering were thoroughly researched. TREC, CLEF and NTCIR are some of the focused workshops in this area ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 20
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: TeluguEnglish Dictionary Based Cross Language Query Focused MultiDocument Summarization


1
Telugu-English Dictionary Based Cross Language
Query Focused Multi-Document Summarization
  • IIIA-2006

Prasad Pingali, J. Jagadeesh, Vasudeva Varma
International Institute of Information
Technology, Hyderabad, India pvvpr,
vv_at_iiit.ac.in July 08, 2006
2
Cross-Language Information Systems
  • CLIR, Factoid based Question Answering were
    thoroughly researched
  • TREC, CLEF and NTCIR are some of the focused
    workshops in this area
  • Efforts are on to make the state-of-the-art
    research usable in applications
  • Machine Translation capabilities are a bottleneck
    for end-user application
  • CL query focused summarization not studied.

3
CL Query Focused Summarization Use Case
4
Potential Benefits of CLQ Summarization
  • A bridge between CLIR and MT
  • Output is a paragraph, coherent, readable and
    syntactically correct.
  • MT systems need syntactically correct sentences
    for proper translation
  • Summary can be tailor made to suit the
    constraints of MT system
  • For example, summary with minimum NL ambiguity
    etc.

5
Problem Statement
  • to synthesize from a set of 25-50 documents in a
    language L2 that are related to a given topic, a
    brief, well-organized, fluent answer to a need
    for information given in a language L1 , that
    cannot be met by just stating a name, date,
    quantity, etc.

6
Extraction Based Summary
  • Extraction based summarization process
  • Identify Sentences from documents
  • Score these sentences using some function
  • Choose the top scoring sentences
  • Identify and eliminate exact duplicates and near
    duplicates (redundancy)
  • Concatenate the sentences in some logical order
    (e.g., timestamp etc.)
  • Sentence scoring function
  • Query independent or
  • Query based or
  • Combination

7
Sentence Scoring using RBLM
  • Relevance based language modeling (RBLM)
    framework
  • Using conditional sampling, joint probability
    can be re-written as
  • Term dependencies in a CL setting calculated as

8
Sentence Scoring using RBLM (contd.)
  • P(qi/ej) as translation probability
  • Explored in statistical machine translation and
    CLIR
  • P(ej/w) as post-translation query expansion
  • Calculated using dictionary based
  • Term co-occurrence statistics
  • Pseudo-relevance feedback

9
Calculations in our Experiments (Telugu-English)
  • P(qi/ej) as translation probability
  • Used bilingual dictionary and assumed uniform
    probability for all possible translations
  • P(ej/w) as post-translation query expansion
  • Used Hyperspace Analogue to Language (HAL)
    feature to compute term dependencies
  • HAL works based on skip-bigrams, window length
    is a parameter

10
Sentence Score
  • The score of each word 'w' w.r.t to the query
    can be written as
  • The score of each sentence 'S' w.r.t to the
    query can be written as

P
11
Summary Generation
  • Top ranking sentences are tested for redundancy
  • Cosine similarity between sentences (unigram
    overlap) used for calculating redundancy
  • Candidate sentences after removing redundant ones
    are concatenated in the order of the publication
    date of the document
  • Sentences belonging to same document are
    concatenated in the order of occurrence in the
    document.

12
Evaluation
  • Experiments in Telugu-English language pair
  • Used DUC 2005 dataset
  • Manually translated DUC queries into Telugu
  • Used Telugu queries as input to system
  • System generates English summary
  • Summary evaluated against DUC model summaries
    using ROUGE package
  • ROUGE package contains many metrics, for
    summaries of paragraph length ROUGE-2 and
    ROUGE-SU4 correlate best with human evaluations

13
An Example
14
Evaluation Results
15
Evaluation Results (per topic)
16
Conclusions
  • We extended our mono-lingual summarization
    framework to a cross-lingual setting in RBLM
    framework
  • We designed a cross-lingual experimental setup
    using DUC 2005 dataset
  • Experiments were conducted for Telugu-English
    language pair
  • Comparison with mono-lingual baseline shows about
    90 performance in ROUGE-SU4 and about 85 in
    ROUGE-2 f-measures

17
DUC-2006
18
Invitation
Workshop on Cross Lingual Information
Access http//search.iiit.ac.in/CLIA2007 January
06. 2007 held at IJCAI 07 in Hyderabad, India.
19
Thank you
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com