Title: Cross-lingual Event Tracking
1Cross-lingual Event Tracking
2Outline
- Task
- Two Alternative Approaches
- Our Methods
- Experimental results
- Observations
- Future work
3Task
- Tracking Task
- More Difficult than Information Filtering (TREC)
- No human relevance feedback
- Cross-lingual Event Tracking (CLET)
- Using source language topic track on target
language stories - Difficulty bridging the language gap
4Approaches
- Translating test documents (common approach)
- Translating the multilingual test documents to
preferred language, and treating the problem as a
monolingual tracking task. - Translating sampled training docs (true CLET)
- Translating some sampled training stories to the
same language with test documents
5 6Goal
- Reduce the gap between these two approaches
7Main Ideas
- Apply Cross-lingual Information Retrieval
Technique for CLET - Query Expansion
- Bi-gram Based Segmentation (for Chinese)
- Adaptation
- LIMSI weighted adaptation-- LWAdapt
- CMU normalized and weighted adaptation -- NWAdapt
8CMU Event Tracking System
- Rocchio Model
- the centroid of a category which is constructed
using a set of positive training examples and s
set of negative training examples of that class. - Fix Weighted Adaptation
- LIMSIs Adaptation
- CMUs Adaptation
9Cross-Lingual Components
- Topic Expansion
- Sample translations
- Using the bilingual dictionary
- Using the CL-PRF technique
- Segmentation (for Chinese)
- Phrase-based
- Bigram-based
- Adaptation
- LIMSIs adaptation
- CMUs adaptation
10Experiment Design
- Mixed language event tracking
- Demonstrate that our system is comparable to the
best teams in recent TDT benchmark evaluations - CLET based on test document translation
- Give a comparable baseline for our approach
- CLET based on translating sampled training data
- Our true cross-lingual event tracking approach
11Experiments English-Chinese Data
- Mixed language event tracking
- TDT 2001 evaluation data
- CLET based on test document translation
- TDT 1999 evaluation data
- Translation SYSTRAN MT system (released by NIST)
- CLET based on translating sampled training data
- TDT 1999 evaluation data
- Translation LDC dictionary
12Mixed language event tracking
- LIMSI result in TDT2001 Cost 0.1332
Adaptation Method NormalizedMin Cost Cost Reduction Ratio
Without adaptation 0.1225 --
LMAdapt 0.1183 3.4
NMAdapt 0.1133 7.5
13Two Baselines Translating test documents
- Using SYSTRAN MT system
- -- Released by NIST
- -- Cost 0.1336
- Using LDC dictionary
- -- Translated by CMU
- -- Cost 0.1899
14Experimental Results--The Effects of Topic
Expansion (TE) and Segmentation
15Experimental Results--The Effects of Topic
Expansion (TE) and Segmentation
Condition English-Chinese Cost Cost Reduction Ratio
Phrase (DICT) 0.5039 --
PhraseTE 0.2974 41
Bigram 0.3848 26.3
BigramTE 0.2522 50
16Experimental Results--The effects of different
adaptation approaches
Condition English-Chinese Cost Cost Reduction Ratio
Phrase 0.5039 --
PhraseLWAdapt 0.4258 15.5
PhraseNWAdapt 0.4197 16.7
PhraseTE 0.2974 41
PhraseTELWAdapt 0.2660 47.2
PhraseTENWAdapt 0.2617 48
Bigram 0.3848 26
BigramTE 0.2522 50
BigramTELWAdapt 0.2467 51
BigramTENWAdapt 0.2413 52.6
17Experimental Results--Translating Test Documents
vs. Sampled Training Documents (1)
18Experimental Results--Translate Test Documents
vs. Sampled Training Documents (Using StatMT
dictionary generated by IBM system)
19Observations
- CMUs adaptation gets better performance than
LIMSIs adaptation - Topic expansion improved the performance
- Bi-gram gets better performance than segmentation
in Mandarin tracking task - CLIR techniques are an effective way of bridging
language gap in true CLET
20Future Work
- Apply MT to cross-lingual tracking
- Introduce named entity to CLET task, further
improve the system performance
21Reference
- Improving text categorization methods for event
tracking Yiming Yang, Tom Ault etc. - Learning Approaches for Detecting and Tracking
News Events Yiming Yang, Jaime Carbonell etc. - The BBN Crosslingual Topic Detection and Tracking
System Tim Leek, Hubert Jin etc. - The LIMSI topic tracking system for TDT2002
- Yuen-Yee Lo and Jean-Luc Gauvain
22Evaluation
- DET Curve
- Reductive Ratio
23CLET using training sampled data
Segmentation Expand Adaptation Label
Phrase NO No Phrase
Phrase NO FWAdapt PhraseFWAdapt
Phrase NO LWAdapt PhraseLWAdapt
Phrase NO NWAdapt PhraseNWAdapt
Phrase Yes No PhraseTE
Phrase Yes FWAdapt PhraseTEFWAdapt
Phrase Yes LWAdapt PhraseTELWAdapt
Phrase Yes NWAdapt PhraseTENWAdapt
Bigram NO No Bigram
Bigram NO FWAdapt BigramFWAdapt
Bigram NO LWAdapt BigramLWAdapt
Bigram NO NWAdapt BigramNWAdapt
Bigram Yes No BigramTE
Bigram Yes FWAdapt BigramTEFWAdapt
Bigram Yes LWAdapt BigramTELWAdapt
Bigram Yes NWAdapt BigramTENWAdapt