Title: WIDIT in TREC2004 Web Track
1WIDIT in TREC-2004 Web Track
Dynamic Tuning for Fusion
- Kiduk Yang
- Yoon Lee
- SLIS, Indiana University
- November 2004
2Web Track 2004 Overview
- Query Classification
- Task
- Identification of Topic Category (TD, NP, HP)
- avg. 3 words per topic (225 topics)
- Challenges
- Quantity Quality of training data
- Topic length
- Methods
- Statistic (Machine Learning)
- Linguistic (Word cues)
3Web Track 2004 Overview
- Mixed Query
- Task
- System Optimization for mixed topics
- Questions
- What sources of evidence to leverage?
- How to combine MSE?
- What is the best retrieval strategy for each
topic type? - How to optimize the system for mixed topics?
- Methods
- Traditional IR
- Term Weighting, Query Expansion
- Fusion
- Result Merging, MSE Fusion
- Query Classification
4WIDIT Web IR System
Architecture
Anchor Index
Body Index
Header Index
Static Tuning
Sub-indexes
Sub-indexes
Sub-indexes
Documents
Fusion Module
Indexing Module
Retrieval Module
Search Results
Queries
Queries
Topics
Expanded Queries
Simple Queries
Fusion Result
Dynamic Tuning
Query Classification Module
Re-ranking Module
Query Types
Final Result
5WIDIT Indexing Module
- Document Indexing
- Strip HTML tags
- extract title, meta keywords description,
emphasized words - parse out hyperlinks (URL anchor texts)
- Create pseudo-documents
- anchor texts
- header texts
- Create Subcollection Indexes
- Stop Stem (Simple, Combo stemmer)
- compute SMART Okapi term weights
- Compute whole collection term statistics
- Query Indexing
- Stop Stem
- Identify nouns, phrases
- Expand acronyms
6WIDIT Retrieval Module
- Parallel Searching
- Multiple Document Index
- body text (title, body)
- anchor text (title, inlink anchor text)
- header text (title, meta kw desc, first
heading, emphasized words) - Multiple Query formulations
- Stemming (Simple, Combo)
- expanded query (acronym, noun)
- Multiple Subcollections
- for search speed and scalability
- search each subcollection using whole collection
term statistics - Merge subcollection search results
- merge sort by document score
7WIDIT Fusion Module
- Fusion Formulas
- Weighted Sum
- FSws ?(wiNSi)
- Overlap Weighted Sum
- FSows ?(wiNSi) olp
- Select candidate systems to combine
- Top performers in each category
- e.g. best stemmer, qry expansion, doc index
- Diverse systems
- e.g. Content-based, Link-based
- One-time brute force combinations for validation
- Complementary Strength effect
- Determine system weights (wi)
- Static Tuning
- Evaluate fusion formulas using a fixed set of
values (e.g. 0.1..1.0) with training data - Select the formulas with best performance
wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin) olp number of systems
that retrieved a given document
(overlap)
8WIDIT Query Classification
Module
- Statistical Classification (SC)
- Classifiers
- Naïve Bayes
- SVM
- Training Data
- Titles of 2003 topics (50 TD, 150 HP, 150 NP)
- w/ and w/o stemming (Combo stemmer)
- Training Data Enrichment for TD class
- Added top-level Yahoo Government category labels
- Linguistic Classification (LC)
- Word Cues
- Create HP and NP lexicons
- Ad-hoc heuristic
- e.g. HP if ends in all caps, NP if contains
YYYY, TD if short topic - Combination
- More ad-hoc heuristic
- if strong word cue, LCelse if single word,
TDelse SC
9WIDIT Re-ranking Module
- Re-ranking Factors
- Field-specific Match
- Query words, acronyms, phrases in URL, title,
header, anchor text - Exact Match
- title, header text, anchor text
- body text
- Indegree Outdegree
- URL Type root, subroot, path, file
- based on URL ending and slash count
- Page Type HPP, HP, NPP, NP, ??
- based on word cue heuristic
- Re-ranking Strategies
- Rank Boosting
- potential homepage if TD or HP topic
- page w/ large outdegree if TD topic, large
indegree if HP topic - page w/ exact field-specific match if HP or NP
topic - Dynamic Tuning
- dynamic/interactive optimization of the
re-ranking formula
10WIDIT Dynamic Tuning Interface
11Dynamic Tuning Observations
- Effective re-ranking factors
- HP
- indegree, outdegree, exact match, URL/pagetype
- minimum number of outdegree 1
- NP
- indegree, outdegree, URLtype
- 1/3 impact of HP
- TD
- acronym, outdegree, URLtype
- minimum number of outdegree 10
- Strength
- Combines the human intelligence (pattern
recognition) w/ computational power of machine - Good for system tuning with many parameters
- Facilitates failure analysis
- Weakness
- Over-tuning
- Sensitive to initial results re-ranking
parameter selection
12Failure Analysis Observations
- Acronym Effect
- Q80 Department of Labor's wage and hour division
- title DOL ESA Wage Hour Division Home Page
- Q89 CDC Rabies homepage
- CDC expanded to Center for Disease Control
- pages about CDC ranked high
- Duplicate Documents
- Q215 NIH Video cast
- G00-74-1477693 (relevant) G00-05-3317821
(relevant) - WIDIT eliminated pages with same URLs
- Q188 Why study comets?
- G00-48-1227124 (relevant) G32-10-1245341 (not
relevant) - WIDIT ranked high the mirrored document (G32) due
to its hyperlinks - Link Noise Effect
- Q197 Vietnam war
- (relevant) Johnson Administrations Foreign
Relations volumes - 4 links to Vietnam volumes
- (top ranks) pages about Vietnam w/ many
navigational links - Topic Drift
13Results Mixed Query Task
- Run Descriptions
- Best fusion run F3
- 0.4B(anchor) 0.3F1(1) 0.3F1(2)
- F1(1) 0.8B(body) 0.05B(anchor)
0.15B(header) - Static re-ranking runs
- w/ guessed query type (SR_g), w/ official query
types (SR_o) - Dynamic re-ranking runs (post-submission)
- w/ guessed query type (DR_g), w/ official query
types (DR_o)
14Results Query Classification Effect
- Run Descriptions
- Using official query types DR_o
- Using guessed query types DR_g
- Using Random query types DR_r
- Using Bad query types DR_b
15Concluding Remarks
- What worked
- Fusion
- Combining multiple sources of evidence
- Topic-sensitive retrieval strategies
- Requires query classification
- Dynamic Tuning
- Helps multi-parameter tuning failure analysis
- Future research
- Link Analysis
- PageRank, Hub/Authority scores for re-ranking
- Link noise reduction based on page layout
- Page type identification
- Density of relevant links
- Link structure
- e.g. HubTD?, AuthorityHP?
- Integration of re-ranking formula into retrieval
module
16Resources
- WIDIT
- http//widit.slis.indiana.edu/
- http//elvis.slis.indiana.edu/
- Dynamic Tuning Interface (Web track)
- http//elvis.slis.indiana.edu/TREC/web/results/tes
t/postsub0/ - WIDIT ? projects ? TREC ? Web track
17SMART
- Length-Normalized Term Weights
- SMART lnu weight for document terms
- SMART ltc weight for query terms
- where fik number of times term k appears
in document i - idfk inverse document frequency of term k
- t number of terms in document/query
- Document Score
- inner product of document and query vectors
- where qk weight of term k in the query
- dik weight of term k in document i
- t number of terms common to query
document
18Okapi
- Document Ranking
-
-
- where Q query containing terms T
- K k1 ((1-b) b(doc_length/avg.doc_lengt
h)) - tf term frequency in a document
- qtf term frequency in a query
- tf term frequency in a document
- k1 , b, k3 parameters (1.2, 0.75, 7..1000)
- wRS Robertson-Sparck Jones weight
- N total number of documents in the
collection - n total number of documents in which the
term occur - R total number of relevant documents in
the collection
- Document term weight
- (simplified formula)
- Query term weight
19Webpage Type Identification
- URL Type (Tomlinson, 2003 Kraaij et al., 2002)
- Heuristic
- root slash_cnt0 or (HP_end
slash_cnt1) - subroot HP_end slash_cnt2
- path HP_end slash_cntgt3
- file rest
- (HP_end 1 if URL ends w/ index.htm,
default.htm, /, etc) - Page Type
- Heuristic
- if welcome or home in title, header, anchor
text ? HPP - else if YYYY in title, anchor ? NPP
- else if NP lexicon word ? NP
- else if HP lexicon word ? HP
- else if ends in all caps ? HP
- else ? ??
- NP lexicon
- about, annual, report, guide, studies, history,
new, how - HP lexicon
- office, bureau, department, institute, center,
committee, agency, administration, council,
society, service, corporation, commission, board,
division, museum, library, project, group,
program, laboratory, site, authority, study,
industry
20WIDIT Basic Approach
- Leverage multiple sources of evidence (MSE)
- Document Content
- Document Structure
- title, meta keywords description, emphasized
words - Link Structure
- anchor text, in/outdegree
- Parallel search
- Multiple Index
- body text (title, body)
- anchor text (title, inlink anchor text)
- header text (title, meta kw desc, first
heading, emphasized words) - Multiple Query formulations
- stemming, acronyms, nouns, synonyms, definitions
- Combine results (i.e. fusion)
- Weighted Sum formula
- Post-retrieval re-ranking
- Acronym, Phrase, Exact, Field-specific match
- In/Outdegree, URL information, Site Compression
- Pseudo-feedback