WIDIT in TREC2004 Web Track - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

WIDIT in TREC2004 Web Track

Description:

Query Classification. Indiana University. 4. S. R. C. WIDIT: Web IR System Architecture ... wRS = Robertson-Sparck Jones weight. N = total number of documents ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 21

Provided by: kiduk

Category:

more less

Transcript and Presenter's Notes

Title: WIDIT in TREC2004 Web Track

1
WIDIT in TREC-2004 Web Track
Dynamic Tuning for Fusion

Kiduk Yang
Yoon Lee
SLIS, Indiana University
November 2004

2
Web Track 2004 Overview

Query Classification
Task
Identification of Topic Category (TD, NP, HP)
avg. 3 words per topic (225 topics)
Challenges
Quantity Quality of training data
Topic length
Methods
Statistic (Machine Learning)
Linguistic (Word cues)

3
Web Track 2004 Overview

Mixed Query
Task
System Optimization for mixed topics
Questions
What sources of evidence to leverage?
How to combine MSE?
What is the best retrieval strategy for each
topic type?
How to optimize the system for mixed topics?
Methods
Traditional IR
Term Weighting, Query Expansion
Fusion
Result Merging, MSE Fusion
Query Classification

4
WIDIT Web IR System
Architecture
Anchor Index
Body Index
Header Index
Static Tuning
Sub-indexes
Sub-indexes
Sub-indexes
Documents
Fusion Module
Indexing Module
Retrieval Module
Search Results
Queries
Queries
Topics
Expanded Queries
Simple Queries
Fusion Result
Dynamic Tuning
Query Classification Module
Re-ranking Module
Query Types
Final Result
5
WIDIT Indexing Module

Document Indexing
Strip HTML tags
extract title, meta keywords description,
emphasized words
parse out hyperlinks (URL anchor texts)
Create pseudo-documents
anchor texts
header texts
Create Subcollection Indexes
Stop Stem (Simple, Combo stemmer)
compute SMART Okapi term weights
Compute whole collection term statistics
Query Indexing
Stop Stem
Identify nouns, phrases
Expand acronyms

6
WIDIT Retrieval Module

Parallel Searching
Multiple Document Index
body text (title, body)
anchor text (title, inlink anchor text)
header text (title, meta kw desc, first
heading, emphasized words)
Multiple Query formulations
Stemming (Simple, Combo)
expanded query (acronym, noun)
Multiple Subcollections
for search speed and scalability
search each subcollection using whole collection
term statistics
Merge subcollection search results
merge sort by document score

7
WIDIT Fusion Module

Fusion Formulas
Weighted Sum
FSws ?(wiNSi)
Overlap Weighted Sum
FSows ?(wiNSi) olp
Select candidate systems to combine
Top performers in each category
e.g. best stemmer, qry expansion, doc index
Diverse systems
e.g. Content-based, Link-based
One-time brute force combinations for validation
Complementary Strength effect
Determine system weights (wi)
Static Tuning
Evaluate fusion formulas using a fixed set of
values (e.g. 0.1..1.0) with training data
Select the formulas with best performance

wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin) olp number of systems
that retrieved a given document
(overlap)
8
WIDIT Query Classification
Module

Statistical Classification (SC)
Classifiers
Naïve Bayes
SVM
Training Data
Titles of 2003 topics (50 TD, 150 HP, 150 NP)
w/ and w/o stemming (Combo stemmer)
Training Data Enrichment for TD class
Added top-level Yahoo Government category labels
Linguistic Classification (LC)
Word Cues
Create HP and NP lexicons
Ad-hoc heuristic
e.g. HP if ends in all caps, NP if contains
YYYY, TD if short topic
Combination
More ad-hoc heuristic
if strong word cue, LCelse if single word,
TDelse SC

9
WIDIT Re-ranking Module

Re-ranking Factors
Field-specific Match
Query words, acronyms, phrases in URL, title,
header, anchor text
Exact Match
title, header text, anchor text
body text
Indegree Outdegree
URL Type root, subroot, path, file
based on URL ending and slash count
Page Type HPP, HP, NPP, NP, ??
based on word cue heuristic
Re-ranking Strategies
Rank Boosting
potential homepage if TD or HP topic
page w/ large outdegree if TD topic, large
indegree if HP topic
page w/ exact field-specific match if HP or NP
topic
Dynamic Tuning
dynamic/interactive optimization of the
re-ranking formula

10
WIDIT Dynamic Tuning Interface
11
Dynamic Tuning Observations

Effective re-ranking factors
HP
indegree, outdegree, exact match, URL/pagetype
minimum number of outdegree 1
NP
indegree, outdegree, URLtype
1/3 impact of HP
TD
acronym, outdegree, URLtype
minimum number of outdegree 10
Strength
Combines the human intelligence (pattern
recognition) w/ computational power of machine
Good for system tuning with many parameters
Facilitates failure analysis
Weakness
Over-tuning
Sensitive to initial results re-ranking
parameter selection

12
Failure Analysis Observations

Acronym Effect
Q80 Department of Labor's wage and hour division
title DOL ESA Wage Hour Division Home Page
Q89 CDC Rabies homepage
CDC expanded to Center for Disease Control
pages about CDC ranked high
Duplicate Documents
Q215 NIH Video cast
G00-74-1477693 (relevant) G00-05-3317821
(relevant)
WIDIT eliminated pages with same URLs
Q188 Why study comets?
G00-48-1227124 (relevant) G32-10-1245341 (not
relevant)
WIDIT ranked high the mirrored document (G32) due
to its hyperlinks
Link Noise Effect
Q197 Vietnam war
(relevant) Johnson Administrations Foreign
Relations volumes
4 links to Vietnam volumes
(top ranks) pages about Vietnam w/ many
navigational links
Topic Drift

13
Results Mixed Query Task

Run Descriptions
Best fusion run F3
0.4B(anchor) 0.3F1(1) 0.3F1(2)
F1(1) 0.8B(body) 0.05B(anchor)
0.15B(header)
Static re-ranking runs
w/ guessed query type (SR_g), w/ official query
types (SR_o)
Dynamic re-ranking runs (post-submission)
w/ guessed query type (DR_g), w/ official query
types (DR_o)

14
Results Query Classification Effect

Run Descriptions
Using official query types DR_o
Using guessed query types DR_g
Using Random query types DR_r
Using Bad query types DR_b

15
Concluding Remarks

What worked
Fusion
Combining multiple sources of evidence
Topic-sensitive retrieval strategies
Requires query classification
Dynamic Tuning
Helps multi-parameter tuning failure analysis
Future research
Link Analysis
PageRank, Hub/Authority scores for re-ranking
Link noise reduction based on page layout
Page type identification
Density of relevant links
Link structure
e.g. HubTD?, AuthorityHP?
Integration of re-ranking formula into retrieval
module

16
Resources

WIDIT
http//widit.slis.indiana.edu/
http//elvis.slis.indiana.edu/
Dynamic Tuning Interface (Web track)
http//elvis.slis.indiana.edu/TREC/web/results/tes
t/postsub0/
WIDIT ? projects ? TREC ? Web track

17
SMART

Length-Normalized Term Weights
SMART lnu weight for document terms
SMART ltc weight for query terms
where fik number of times term k appears
in document i
idfk inverse document frequency of term k
t number of terms in document/query
Document Score
inner product of document and query vectors
where qk weight of term k in the query
dik weight of term k in document i
t number of terms common to query
document

18
Okapi

Document Ranking
where Q query containing terms T
K k1 ((1-b) b(doc_length/avg.doc_lengt
h))
tf term frequency in a document
qtf term frequency in a query
tf term frequency in a document
k1 , b, k3 parameters (1.2, 0.75, 7..1000)
wRS Robertson-Sparck Jones weight
N total number of documents in the
collection
n total number of documents in which the
term occur
R total number of relevant documents in
the collection

Document term weight
(simplified formula)
Query term weight

19
Webpage Type Identification

URL Type (Tomlinson, 2003 Kraaij et al., 2002)
Heuristic
root slash_cnt0 or (HP_end
slash_cnt1)
subroot HP_end slash_cnt2
path HP_end slash_cntgt3
file rest
(HP_end 1 if URL ends w/ index.htm,
default.htm, /, etc)
Page Type
Heuristic
if welcome or home in title, header, anchor
text ? HPP
else if YYYY in title, anchor ? NPP
else if NP lexicon word ? NP
else if HP lexicon word ? HP
else if ends in all caps ? HP
else ? ??
NP lexicon
about, annual, report, guide, studies, history,
new, how
HP lexicon
office, bureau, department, institute, center,
committee, agency, administration, council,
society, service, corporation, commission, board,
division, museum, library, project, group,
program, laboratory, site, authority, study,
industry