Information Retrieval at NLC - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Information Retrieval at NLC

Description:

Guihong Cao, Tianjin University, China. Hongzhao He, Tianjin University, China ... Jian-Yun Nie, Universit de Montr al. Stephen Robertson, Microsoft Research, ... – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 26
Provided by: edwardc1
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval at NLC


1
Information Retrieval at NLC
  • Jianfeng Gao
  • NLC Group, Microsoft Research China

2
Outline
  • People
  • Projects
  • Systems
  • Researches

3
People
  • Jianfeng Gao, Microsoft Research, China
  • Guihong Cao, Tianjin University, China
  • Hongzhao He, Tianjin University, China
  • Min Zhang, Tsinghua University, China
  • Jian-Yun Nie, Université de Montréal
  • Stephen Robertson, Microsoft Research, Cambridge
  • Stephen Walker, Microsoft Research, Cambridge

4
Systems
  • SMART (Master HongZhao)
  • Traditional IR system VSM, TFIDF
  • Hold more than 500M collection
  • Linix
  • Okapi (Master Guihong)
  • Modern IR system Probabilistic Model, BM25
  • Hold more than 10G collection
  • Windows2000

5
Projects
  • CLIR TREC-9 ( Japanese NTCIR-3)
  • System SMART
  • Focus
  • Chinese Indexing Unit Gao et al, 00 GaoHe,
    01
  • Query translation Gao et al, 01
  • Web Retrieval TREC-10
  • System Okapi
  • Focus
  • Blind Feedback Zhang et al, 01
  • Link-based retrieval (anchor text) Craswell et
    al, 01

6
Researches
  • Best indexing unit for Chinese IR
  • Query translation
  • Using link information for web retrieval
  • Blind feedback for web retrieval
  • Improving the effectiveness of IR with clustering
    and Fusion

7
Best indexing unit for Chinese IR
  • Motivation
  • What is the basic unit of indexing in Chinese IR
    word, n-gram, or combined
  • Does the accuracy of word segmentation have a
    significant impact on IR performance
  • Experiment1 indexing units
  • Experiment2 the impact of word segmentation

8
Experiment1 settings
  • System SMART (modified version)
  • Corpus TREC56 Chinese collection
  • Experiments
  • Impact of dict. using the longest matching with
    a small dict. and with a large dict.
  • Combining the first method with single characters
  • Using full segmentation
  • Using bi-grams and uni-grams (characters)
  • Combining words with bi-grams and characters
  • Unknown word detection using NLPWin

9
Experiment1 results
  • Word character (bigram) unknown words

10
Experiment2 settings
  • System
  • SMART System
  • Songrous Segmentation Evaluation System
  • Corpus
  • (1) Trec 56 for Chinese IR
  • (2) Songrous Corpus
  • 12rst.txt 181KB
  • 12rst.src 250KB ( Standard segmentation of
    12rst.txt made by linguists )
  • (3) Sampling from Songrous Coupus
  • test.txt 20KB ( Random sampling from
    12rst.txt )
  • standard.src 28KB ( Standard segmentation
    corresponding to test.txt )

11
Experiment2 results
  • Notes A 1 Baseline 2 Disambiguration 3
    Number 4 Propernoun 5 Suffix
  • Notes B Feedback parameters are (10, 500, 0.5,
    0.5 ) and (100, 500, 0.5, 0.5 )

12
Query Translation
  • Motivation problems of simple lexicon-based
    approaches
  • Lexicon is incomplete
  • Difficult to select correct translations
  • Solution improved lexicon-based approach
  • Term disambiguation using co-occurrence
  • Phrase detecting and translation using LM
  • Translation coverage enhancement using TM

13
Term disambiguation
  • Assumption correct translation words tend to
    co-occur in Chinese language
  • A greedy algorithm
  • for English terms Te (e1en),
  • find their Chinese translations Tc (c1cn),
    such that Tc argmax
    SIM(c1, , cn)
  • Term-similarity matrix trained on Chinese corpus

14
Phrase detection and translation
  • Multi-word phrase is detected by base NP detector
  • Translation pattern (PATTe), e.g.
  • ltNOUN1 NOUN2gt ?? ltNOUN1 NOUN2gt
  • ltNOUN1 of NOUN2gt ?? ltNOUN2 NOUN1gt
  • Phrase translation
  • Tc argmax P(OTcPATTe)P(Tc)
  • P(OTcPATTe) prob. of the translation pattern
  • P(Tc) prob. of the phrase in Chinese LM

15
Using translation model (TM)
  • Enhance the coverage of the lexicon
  • Using TM
  • Tc argmax P(TeTc)SIM(Tc)
  • Mining parallel texts from the Web for TM
    training

16
Experiments on TREC-56
  • Monolingual
  • Simple translation lexicon looking up
  • Best-sense translation 2 manually selecting
  • Improved translation (our method)
  • Machine translation using IBM MT system

17
Summary of Experiments
18
Using link information for web retrieval
  • Motivation
  • The effectiveness of link-based retrieval
  • The evaluation on TREC web collection
  • Link-based Web retrieval the state-of-the-art
  • Recommendation high in-degree is better
  • Topic locality connected pages are similar
  • Anchor description represented by anchor text
  • Link-based retrieval in TREC No good results

19
Experiments on TREC-9
  • Baseline Content based IR
  • Anchor description
  • Used alone Much worse than baseline
  • Combined with content description trivial
    improvement
  • Re-ranking trivial improvement
  • Spreading No positive effect

20
Summary of Experiments
21
Blind feedback for web retrieval
  • Motivation
  • Web query is short
  • Web collection is huge and highly mixed
  • Blind feedback refine web queries
  • Using global web collection
  • Using local web collection
  • Using other well-organized collection, i.e.
    Encarta

22
Experiments on TREC-9
  • Baseline 2-stage pseudo-relevance feedback
    (PFB) using global web collection
  • Local context analysis Xu et al., 96 2-stage
    PFB using local web collection retrieved by the
    first stage
  • 2-stage PFB using Encarta collection in the first
    stage

23
Summary of Experiments
  • ???

24
Improving the effectiveness of IR with clustering
and Fusion
  • Clustering Hypothesis Documents that are
    relevant to the same query are more similar than
    non-relevant documents, and can be clustered
    together.
  • Fusion Hypothesis Different ranked lists
    usually have a high overlap of relevant documents
    and a low overlap of non-relevant documents.

25
Thanks !
Write a Comment
User Comments (0)
About PowerShow.com