Information Retrieval at NLC - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Information Retrieval at NLC

Description:

Guihong Cao, Tianjin University, China. Hongzhao He, Tianjin University, China ... Jian-Yun Nie, Universit de Montr al. Stephen Robertson, Microsoft Research, ... – PowerPoint PPT presentation

Number of Views:175

Avg rating:3.0/5.0

Slides: 26

Provided by: edwardc1

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval at NLC

1
Information Retrieval at NLC

Jianfeng Gao
NLC Group, Microsoft Research China

2
Outline

People
Projects
Systems
Researches

3
People

Jianfeng Gao, Microsoft Research, China
Guihong Cao, Tianjin University, China
Hongzhao He, Tianjin University, China
Min Zhang, Tsinghua University, China
Jian-Yun Nie, Université de Montréal
Stephen Robertson, Microsoft Research, Cambridge
Stephen Walker, Microsoft Research, Cambridge

4
Systems

SMART (Master HongZhao)
Traditional IR system VSM, TFIDF
Hold more than 500M collection
Linix
Okapi (Master Guihong)
Modern IR system Probabilistic Model, BM25
Hold more than 10G collection
Windows2000

5
Projects

CLIR TREC-9 ( Japanese NTCIR-3)
System SMART
Focus
Chinese Indexing Unit Gao et al, 00 GaoHe,
01
Query translation Gao et al, 01
Web Retrieval TREC-10
System Okapi
Focus
Blind Feedback Zhang et al, 01
Link-based retrieval (anchor text) Craswell et
al, 01

6
Researches

Best indexing unit for Chinese IR
Query translation
Using link information for web retrieval
Blind feedback for web retrieval
Improving the effectiveness of IR with clustering
and Fusion

7
Best indexing unit for Chinese IR

Motivation
What is the basic unit of indexing in Chinese IR
word, n-gram, or combined
Does the accuracy of word segmentation have a
significant impact on IR performance
Experiment1 indexing units
Experiment2 the impact of word segmentation

8
Experiment1 settings

System SMART (modified version)
Corpus TREC56 Chinese collection
Experiments
Impact of dict. using the longest matching with
a small dict. and with a large dict.
Combining the first method with single characters
Using full segmentation
Using bi-grams and uni-grams (characters)
Combining words with bi-grams and characters
Unknown word detection using NLPWin

9
Experiment1 results

Word character (bigram) unknown words

10
Experiment2 settings

System
SMART System
Songrous Segmentation Evaluation System
Corpus
(1) Trec 56 for Chinese IR
(2) Songrous Corpus
12rst.txt 181KB
12rst.src 250KB ( Standard segmentation of
12rst.txt made by linguists )
(3) Sampling from Songrous Coupus
test.txt 20KB ( Random sampling from
12rst.txt )
standard.src 28KB ( Standard segmentation
corresponding to test.txt )

11
Experiment2 results

Notes A 1 Baseline 2 Disambiguration 3
Number 4 Propernoun 5 Suffix
Notes B Feedback parameters are (10, 500, 0.5,
0.5 ) and (100, 500, 0.5, 0.5 )

12
Query Translation

Motivation problems of simple lexicon-based
approaches
Lexicon is incomplete
Difficult to select correct translations
Solution improved lexicon-based approach
Term disambiguation using co-occurrence
Phrase detecting and translation using LM
Translation coverage enhancement using TM

13
Term disambiguation

Assumption correct translation words tend to
co-occur in Chinese language
A greedy algorithm
for English terms Te (e1en),
find their Chinese translations Tc (c1cn),
such that Tc argmax
SIM(c1, , cn)
Term-similarity matrix trained on Chinese corpus

14
Phrase detection and translation

Multi-word phrase is detected by base NP detector
Translation pattern (PATTe), e.g.
ltNOUN1 NOUN2gt ?? ltNOUN1 NOUN2gt
ltNOUN1 of NOUN2gt ?? ltNOUN2 NOUN1gt
Phrase translation
Tc argmax P(OTcPATTe)P(Tc)
P(OTcPATTe) prob. of the translation pattern
P(Tc) prob. of the phrase in Chinese LM

15
Using translation model (TM)

Enhance the coverage of the lexicon
Using TM
Tc argmax P(TeTc)SIM(Tc)
Mining parallel texts from the Web for TM
training

16
Experiments on TREC-56

Monolingual
Simple translation lexicon looking up
Best-sense translation 2 manually selecting
Improved translation (our method)
Machine translation using IBM MT system

17
Summary of Experiments
18
Using link information for web retrieval

Motivation
The effectiveness of link-based retrieval
The evaluation on TREC web collection
Link-based Web retrieval the state-of-the-art
Recommendation high in-degree is better
Topic locality connected pages are similar
Anchor description represented by anchor text
Link-based retrieval in TREC No good results

19
Experiments on TREC-9

Baseline Content based IR
Anchor description
Used alone Much worse than baseline
Combined with content description trivial
improvement
Re-ranking trivial improvement
Spreading No positive effect

20
Summary of Experiments
21
Blind feedback for web retrieval

Motivation
Web query is short
Web collection is huge and highly mixed
Blind feedback refine web queries
Using global web collection
Using local web collection
Using other well-organized collection, i.e.
Encarta

22
Experiments on TREC-9

Baseline 2-stage pseudo-relevance feedback
(PFB) using global web collection
Local context analysis Xu et al., 96 2-stage
PFB using local web collection retrieved by the
first stage
2-stage PFB using Encarta collection in the first
stage

23
Summary of Experiments

24
Improving the effectiveness of IR with clustering
and Fusion

Clustering Hypothesis Documents that are
relevant to the same query are more similar than
non-relevant documents, and can be clustered
together.
Fusion Hypothesis Different ranked lists
usually have a high overlap of relevant documents
and a low overlap of non-relevant documents.

25
Thanks !

Write a Comment

User Comments (0)