The Atlas Project: Cross-language Event Detection and Tracking - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

The Atlas Project: Cross-language Event Detection and Tracking

Description:

based on language, topic and genre of documents. optimizing CL-EDT effectiveness. 8/30/09 ... document classification by language, genre and topics, and for ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 13
Provided by: nycLti
Category:

less

Transcript and Presenter's Notes

Title: The Atlas Project: Cross-language Event Detection and Tracking


1
The Atlas ProjectCross-language Event Detection
and Tracking
  • The Atlas Team
  • Language Technologies Institute Computer
    Science Dept.
  • School of Computer Science
  • Carnegie Mellon University

2
Team
  • Yiming Yang (PI)
  • Bryan Kisiel (Sr. Programmer)
  • Monica Rogati (PhD candidate)
  • Jian Zhang (PhD candidate)
  • Nianli Ma (Programmer)
  • Ashwin Tengli (MS candidate)
  • Bancha Dhammarungruang (CSD undergrad)
  • Derek Leung (CSD undergrad)
  • Joe Delfino (CSD undergrad)

3
Motivations
  • Non-English documents are valuable sources of
    important information about new events
  • Manual acquisition translation of multilingual
    documents is a costly off-line process
  • Tools for automated or semi-automated
    cross-language event detection and tracking
    (CL-EDT) are highly desirable

4
Primary Aims (in 3 years)
  • Focused web crawling for documents on
    user-specified topics (terrorist activities, for
    example)
  • Automated detection of new events for each topic
    and tracking the follow-up events
  • Automated extraction of named entities in
    relation to topics and events
  • True cross-lingual ability, including English,
    Arabic, Chinese, German, Spanish,
  • Integrated system, effective GUI

5
What is novel?
  • Topic-conditioned (hierarchical) TDT
  • as apposed to a flat approach (conventional)
  • True cross-lingual ability
  • translating a small number of training examples
    per event
  • as apposed to translating all test documents
  • Optimal training source selection
  • based on language, topic and genre of documents
  • optimizing CL-EDT effectiveness

6
Plan for Year 1
  • Develop Arabic, Chinese and German components for
    our EDT system
  • Develop web crawling tools
  • Develop further our topic-conditioned EDT scheme
  • Create a new evaluation corpus with topic/event
    labels
  • Investigate usage of Named Entities for selected
    topics (e.g., Terrorism events)

7
Exams for Year 1
  • Midterm
  • Participated TDT-2002 benchmark evaluations
  • Winning performance in the task of multilingual
    event detection (Arabic, Chinese and English)
  • End-of-the-year
  • Topic-conditioned EDT on our new corpus
    (broadcast news, 1992-1998) with selected topics

8
Plan for Year 2-3
  • Improve our crosslingual IR techniques and
    evaluate them on benchmark collections (TREC,
    CLEF, NTCIR)
  • Develop systems for multilingual document
    classification by language, genre and topics, and
    for module selection
  • Investigate unsupervised document clustering for
    the generation of topic hierarchy in our
    topic-conditioned EDT system
  • Investigate extraction of Situated Named Entities
    using automated induction of finite state
    transducers
  • System integration and user interface development

9
Component Presentations
  • Novelty Detection (Jian Zhang)
  • Crosslingual Event Tracking (Nianli Ma)
  • German-English crosslingual retrieval (Yiming
    Yang)
  • Unsupervised word stemming (Monica Rogati)
  • Web mining for training/evaluation data (Ashwin
    Tengli)

10
Check List (Midterm of Year 1)
  • Developed Arabic, Chinese components for our EDT
    system
  • Significant progress in the approach of
    translating training examples
  • Novel method for statistical stemming, applied to
    Arabic
  • Develop web crawling tools
  • On-going
  • Further develop topic-conditioned EDT
  • Best results in TDT-2002 benchmark evaluations
  • Create a new evaluation corpus with topic/event
    labels
  • Started in the biological weapon domain
  • Investigate NE extraction for selected topics
  • Next

11
Publications
  • Published Papers
  • New-Event Link Detection at CMU for TDT 2002.
    J. Carbonell, Y. Yang, R. Brown, J. Zhang N.
    Ma. Notebook for TDT-2002 Evaluations, Nov.
    2002.
  • CMU in Cross-Language Information Retrieval at
    NTCIR-3. Y. Yang and N. Ma. Proceedings of the
    NTCIR-2002 Conference.
  • Submitted Papers
  • In Cross-lingual Retrieval
  • Unsupervised Learning of Arabic Stemming using a
    Parallel Corpus. M. Rogati, S. McCarley Y.
    Yang. ACL03.
  • In Topic Detection Tracking
  • Cross-Language Event Tracking. N. Ma, Y. Yang
    M. Rogati. Submitted to the journal of
    Information Processing and Management, 2003.
  • In Adaptive Filtering
  • Margin-Based Local Regression of Adaptive
    Filtering. Y. Yang B. Kisiel. Submitted to the
    ACM/KDD Conference, 2003.

12
Publications (contd)
  • Submitted Papers
  • In Text Categorization
  • A scalability analysis of classifiers in text
    categorization. Y. Yang, J. Zhang B, Kisiel.
    ACM/SIGIR, 2003.
  • Modified logistic regression An approximation to
    SVM and its application to large-scale text
    categorization. J. Zhang, R. Jin, Y. Yang A.
    Hauptmann. Submitted to ICML, 2003.
  • A Loss-function Based Analysis of Classification
    Methods in Text Categorization. F. Li and Y.
    Yang. Submitted to the ICML, 2003.
  • Robustness of Regularized Linear Classification
    Methods in Text Categorization. Jian Zhang and
    Yiming Yang. ACM/SIGIR, 2003.
Write a Comment
User Comments (0)
About PowerShow.com