WIDIT in TREC2004 Web Track - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

WIDIT in TREC2004 Web Track

Description:

Query Classification. Indiana University. 4. S. R. C. WIDIT: Web IR System Architecture ... wRS = Robertson-Sparck Jones weight. N = total number of documents ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 21
Provided by: kiduk
Category:
Tags: widit | track | trec2004 | web

less

Transcript and Presenter's Notes

Title: WIDIT in TREC2004 Web Track


1
WIDIT in TREC-2004 Web Track
Dynamic Tuning for Fusion
  • Kiduk Yang
  • Yoon Lee
  • SLIS, Indiana University
  • November 2004

2
Web Track 2004 Overview
  • Query Classification
  • Task
  • Identification of Topic Category (TD, NP, HP)
  • avg. 3 words per topic (225 topics)
  • Challenges
  • Quantity Quality of training data
  • Topic length
  • Methods
  • Statistic (Machine Learning)
  • Linguistic (Word cues)

3
Web Track 2004 Overview
  • Mixed Query
  • Task
  • System Optimization for mixed topics
  • Questions
  • What sources of evidence to leverage?
  • How to combine MSE?
  • What is the best retrieval strategy for each
    topic type?
  • How to optimize the system for mixed topics?
  • Methods
  • Traditional IR
  • Term Weighting, Query Expansion
  • Fusion
  • Result Merging, MSE Fusion
  • Query Classification

4
WIDIT Web IR System
Architecture
Anchor Index
Body Index
Header Index
Static Tuning
Sub-indexes
Sub-indexes
Sub-indexes
Documents
Fusion Module
Indexing Module
Retrieval Module
Search Results
Queries
Queries
Topics
Expanded Queries
Simple Queries
Fusion Result
Dynamic Tuning
Query Classification Module
Re-ranking Module
Query Types
Final Result
5
WIDIT Indexing Module
  • Document Indexing
  • Strip HTML tags
  • extract title, meta keywords description,
    emphasized words
  • parse out hyperlinks (URL anchor texts)
  • Create pseudo-documents
  • anchor texts
  • header texts
  • Create Subcollection Indexes
  • Stop Stem (Simple, Combo stemmer)
  • compute SMART Okapi term weights
  • Compute whole collection term statistics
  • Query Indexing
  • Stop Stem
  • Identify nouns, phrases
  • Expand acronyms

6
WIDIT Retrieval Module
  • Parallel Searching
  • Multiple Document Index
  • body text (title, body)
  • anchor text (title, inlink anchor text)
  • header text (title, meta kw desc, first
    heading, emphasized words)
  • Multiple Query formulations
  • Stemming (Simple, Combo)
  • expanded query (acronym, noun)
  • Multiple Subcollections
  • for search speed and scalability
  • search each subcollection using whole collection
    term statistics
  • Merge subcollection search results
  • merge sort by document score

7
WIDIT Fusion Module
  • Fusion Formulas
  • Weighted Sum
  • FSws ?(wiNSi)
  • Overlap Weighted Sum
  • FSows ?(wiNSi) olp
  • Select candidate systems to combine
  • Top performers in each category
  • e.g. best stemmer, qry expansion, doc index
  • Diverse systems
  • e.g. Content-based, Link-based
  • One-time brute force combinations for validation
  • Complementary Strength effect
  • Determine system weights (wi)
  • Static Tuning
  • Evaluate fusion formulas using a fixed set of
    values (e.g. 0.1..1.0) with training data
  • Select the formulas with best performance

wi weight of system i (relative
contribution of each system) NSi normalized
score of a document by system i (Si
Smin) / (Smax Smin) olp number of systems
that retrieved a given document
(overlap)
8
WIDIT Query Classification
Module
  • Statistical Classification (SC)
  • Classifiers
  • Naïve Bayes
  • SVM
  • Training Data
  • Titles of 2003 topics (50 TD, 150 HP, 150 NP)
  • w/ and w/o stemming (Combo stemmer)
  • Training Data Enrichment for TD class
  • Added top-level Yahoo Government category labels
  • Linguistic Classification (LC)
  • Word Cues
  • Create HP and NP lexicons
  • Ad-hoc heuristic
  • e.g. HP if ends in all caps, NP if contains
    YYYY, TD if short topic
  • Combination
  • More ad-hoc heuristic
  • if strong word cue, LCelse if single word,
    TDelse SC

9
WIDIT Re-ranking Module
  • Re-ranking Factors
  • Field-specific Match
  • Query words, acronyms, phrases in URL, title,
    header, anchor text
  • Exact Match
  • title, header text, anchor text
  • body text
  • Indegree Outdegree
  • URL Type root, subroot, path, file
  • based on URL ending and slash count
  • Page Type HPP, HP, NPP, NP, ??
  • based on word cue heuristic
  • Re-ranking Strategies
  • Rank Boosting
  • potential homepage if TD or HP topic
  • page w/ large outdegree if TD topic, large
    indegree if HP topic
  • page w/ exact field-specific match if HP or NP
    topic
  • Dynamic Tuning
  • dynamic/interactive optimization of the
    re-ranking formula

10
WIDIT Dynamic Tuning Interface
11
Dynamic Tuning Observations
  • Effective re-ranking factors
  • HP
  • indegree, outdegree, exact match, URL/pagetype
  • minimum number of outdegree 1
  • NP
  • indegree, outdegree, URLtype
  • 1/3 impact of HP
  • TD
  • acronym, outdegree, URLtype
  • minimum number of outdegree 10
  • Strength
  • Combines the human intelligence (pattern
    recognition) w/ computational power of machine
  • Good for system tuning with many parameters
  • Facilitates failure analysis
  • Weakness
  • Over-tuning
  • Sensitive to initial results re-ranking
    parameter selection

12
Failure Analysis Observations
  • Acronym Effect
  • Q80 Department of Labor's wage and hour division
  • title DOL ESA Wage Hour Division Home Page
  • Q89 CDC Rabies homepage
  • CDC expanded to Center for Disease Control
  • pages about CDC ranked high
  • Duplicate Documents
  • Q215 NIH Video cast
  • G00-74-1477693 (relevant) G00-05-3317821
    (relevant)
  • WIDIT eliminated pages with same URLs
  • Q188 Why study comets?
  • G00-48-1227124 (relevant) G32-10-1245341 (not
    relevant)
  • WIDIT ranked high the mirrored document (G32) due
    to its hyperlinks
  • Link Noise Effect
  • Q197 Vietnam war
  • (relevant) Johnson Administrations Foreign
    Relations volumes
  • 4 links to Vietnam volumes
  • (top ranks) pages about Vietnam w/ many
    navigational links
  • Topic Drift

13
Results Mixed Query Task
  • Run Descriptions
  • Best fusion run F3
  • 0.4B(anchor) 0.3F1(1) 0.3F1(2)
  • F1(1) 0.8B(body) 0.05B(anchor)
    0.15B(header)
  • Static re-ranking runs
  • w/ guessed query type (SR_g), w/ official query
    types (SR_o)
  • Dynamic re-ranking runs (post-submission)
  • w/ guessed query type (DR_g), w/ official query
    types (DR_o)

14
Results Query Classification Effect
  • Run Descriptions
  • Using official query types DR_o
  • Using guessed query types DR_g
  • Using Random query types DR_r
  • Using Bad query types DR_b

15
Concluding Remarks
  • What worked
  • Fusion
  • Combining multiple sources of evidence
  • Topic-sensitive retrieval strategies
  • Requires query classification
  • Dynamic Tuning
  • Helps multi-parameter tuning failure analysis
  • Future research
  • Link Analysis
  • PageRank, Hub/Authority scores for re-ranking
  • Link noise reduction based on page layout
  • Page type identification
  • Density of relevant links
  • Link structure
  • e.g. HubTD?, AuthorityHP?
  • Integration of re-ranking formula into retrieval
    module

16
Resources
  • WIDIT
  • http//widit.slis.indiana.edu/
  • http//elvis.slis.indiana.edu/
  • Dynamic Tuning Interface (Web track)
  • http//elvis.slis.indiana.edu/TREC/web/results/tes
    t/postsub0/
  • WIDIT ? projects ? TREC ? Web track

17
SMART
  • Length-Normalized Term Weights
  • SMART lnu weight for document terms
  • SMART ltc weight for query terms
  • where fik number of times term k appears
    in document i
  • idfk inverse document frequency of term k
  • t number of terms in document/query
  • Document Score
  • inner product of document and query vectors
  • where qk weight of term k in the query
  • dik weight of term k in document i
  • t number of terms common to query
    document

18
Okapi
  • Document Ranking
  • where Q query containing terms T
  • K k1 ((1-b) b(doc_length/avg.doc_lengt
    h))
  • tf term frequency in a document
  • qtf term frequency in a query
  • tf term frequency in a document
  • k1 , b, k3 parameters (1.2, 0.75, 7..1000)
  • wRS Robertson-Sparck Jones weight
  • N total number of documents in the
    collection
  • n total number of documents in which the
    term occur
  • R total number of relevant documents in
    the collection
  • Document term weight
  • (simplified formula)
  • Query term weight

19
Webpage Type Identification
  • URL Type (Tomlinson, 2003 Kraaij et al., 2002)
  • Heuristic
  • root slash_cnt0 or (HP_end
    slash_cnt1)
  • subroot HP_end slash_cnt2
  • path HP_end slash_cntgt3
  • file rest
  • (HP_end 1 if URL ends w/ index.htm,
    default.htm, /, etc)
  • Page Type
  • Heuristic
  • if welcome or home in title, header, anchor
    text ? HPP
  • else if YYYY in title, anchor ? NPP
  • else if NP lexicon word ? NP
  • else if HP lexicon word ? HP
  • else if ends in all caps ? HP
  • else ? ??
  • NP lexicon
  • about, annual, report, guide, studies, history,
    new, how
  • HP lexicon
  • office, bureau, department, institute, center,
    committee, agency, administration, council,
    society, service, corporation, commission, board,
    division, museum, library, project, group,
    program, laboratory, site, authority, study,
    industry

20
WIDIT Basic Approach
  • Leverage multiple sources of evidence (MSE)
  • Document Content
  • Document Structure
  • title, meta keywords description, emphasized
    words
  • Link Structure
  • anchor text, in/outdegree
  • Parallel search
  • Multiple Index
  • body text (title, body)
  • anchor text (title, inlink anchor text)
  • header text (title, meta kw desc, first
    heading, emphasized words)
  • Multiple Query formulations
  • stemming, acronyms, nouns, synonyms, definitions
  • Combine results (i.e. fusion)
  • Weighted Sum formula
  • Post-retrieval re-ranking
  • Acronym, Phrase, Exact, Field-specific match
  • In/Outdegree, URL information, Site Compression
  • Pseudo-feedback
Write a Comment
User Comments (0)
About PowerShow.com