RMIT University at TREC 2004 Terabyte Track Experiments - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

RMIT University at TREC 2004 Terabyte Track Experiments

Description:

Supports pivoted cosine and Okapi BM25 metrics. Working on further metrics ... Index on full text, ranked using Okapi BM25 metric ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 26
Provided by: scie276
Category:

less

Transcript and Presenter's Notes

Title: RMIT University at TREC 2004 Terabyte Track Experiments


1
RMIT University at TREC 2004Terabyte Track
Experiments
  • Bodo Billerbeck, Adam Cannane, Abhijit Chattaraj,
    Nicholas Lester
  • William Webber, Hugh E. Williams, John Yiannis,
    and Justin Zobel
  • Email hugh_at_cs.rmit.edu.au
  • School of Computer Science and
  • Information Technology, RMIT University,
  • Melbourne, Australia.

2
Overview
  • Zettair
  • Features, indexing, and querying
  • Runs all automatic, title-only runs
  • 1 Baseline
  • 2 Anchor text
  • 3 Fuzzy phrase
  • 4 Priority
  • 5 Query Expansion
  • Results
  • Efficiency
  • Effectiveness
  • Final thoughts

3
Zettair
  • From zetta (1021) and IR
  • A scalable, fast search engine server
  • Supports ranked, simple Boolean, and phrase
    queries
  • Indexes HTML, plain text, and TREC-formatted
    documents
  • Usable as a C and python library
  • Native support for TREC experiments
  • Documented. Includes easy-to-follow examples
  • BSD license
  • Emphasis on simplicity and efficiency
  • One executable does everything
  • Under continued development
  • Ported to Mac OS X, FreeBSD, MS Windows, Linux,
    Solaris
  • Available from www.seg.rmit.edu.au/zettair

4
Zettair Indexing
  • Single-pass, sort-merge scheme
  • Document-ordered, word position inverted indexes
  • Efficient, variable-byte index compression
  • Indexed the GOV2 collection (426 Gb) in under 14
    hours on a single AUD2000 Intel P4 machine.
    Throughput 30Gb/hour
  • Improved to just over 11 hours post-evaluation.
    Throughput 38Gb/hour
  • Fast configurable parser. Handles badly-formed
    HTML
  • Validates each tag by matching lt with gt within a
    character limit HTML comments are not indexed but
    are validated
  • Entity references translated
  • No support for internationalised text

5
Variable-byte indexes
  • Inverted indexes store terms and term occurrences
  • Term occurrence information (or postings) are
    integers
  • We store integers in a variable number of bytes
  • 7 bits of each block store the binary value of
    integer k
  • the most (or, optionally, least) significant bit
    is set to 1 if other blocks follow, or 0 for the
    final block
  • For example, for the integer k5, the
    variable-byte representation is 0 0000101
  • For the integer k270, the variable-byte
    representation is 1 0000001 0 0001110

6
Performance
  • Results for 10,000 queries on 20 Gb from TREC-7
    VLC2 collection. (Taken from Scholer, Williams,
    Yiannis, and Zobel, SIGIR 2002.)

7
Zettair Querying
  • B-tree vocabulary bulk-loaded at index
    construction time
  • For a 25 Gb collection, average query time is 70
    milliseconds (without explicit caching or other
    optimisations)
  • Single-threaded, blocking I/O, and relatively
    unoptimised
  • Provides query-biased summaries of documents (see
    Tombros and Sanderson, SIGIR 1998)
  • Supports pivoted cosine and Okapi BM25 metrics
  • Working on further metrics
  • Metrics can be manipulated externally
  • BM25 formulation in the notebook paper
  • Reads TREC topic files and supports output in
    original trec_eval format

8
1 Baseline run
  • Index on full text
  • Words are terms
  • Does not index phrases, URLs, or make use of
    inlink anchortext
  • Title, metadata, etc. treated as plain text
  • Okapi BM25 metric
  • Automatic, title-only queries
  • No stopping or stemming
  • Used as basis for all other runs
  • Hopefully, a useful comparison point

9
2 Combined
  • Parallel indexes
  • Index on full text, ranked using Okapi BM25
    metric
  • Index on inlink anchor text, ranked using
    Hawkapi metric (Toward better weighting of
    anchors, Hawking, Upstill, and Craswell, SIGIR
    2004)
  • Scores combined, with anchortext weighting
    determined from training evaluations (more in a
    moment)
  • used .GOV queries from past TRECs on the .GOV2
    collection and made our own relevance judgments
  • No particular significance or principles in the
    score combination
  • Combined index used in Fuzzy Phrase and Priority
    techniques (discussed later)

10
Combined example
  • Queries are evaluated separately using full text
    and anchortext indexes (with BM25 and Hawkapi
    metrics respectively)
  • First 50,000 answers returned from both
    evaluations
  • If a document appears in both the result sets,
    scores are combined with anchortext weighted as
    0.2 otherwise, full text score only used
  • For example, consider the query horse racing
    jockey weight and document GX246-35-4289557
  • Plain score 50.522, ranked 14
  • Anchor score 1.592, ranked 768
  • Combined score 50.522 (0.2 1.592) 50.841,
    final rank 1

11
3 Fuzzy Phrase
  • We observed (Williams, Zobel, and Bahle, ACM TOIS
    2004) that around 40 of ranked queries can be
    evaluated as phrase queries
  • For example, most home page finding queries
    appear not to explicitly use quotation marks
  • We investigated whether fuzzy phrases could aid
    ranking
  • A fuzzy phrase is a phrase in which ordering and
    adjacency are flexible
  • For example, cat sat (sloppiness 2) matches
  • cat sat, cat - sat, cat - - sat, and sat
    cat
  • but not, for example
  • cat - - - sat or sat - cat
  • Fuzzy phrases are added as terms to a ranked
    query (example next) and ranking metric scores
    computed based on statistics of the fuzzy phrase
    term
  • Fuzzified query evaluated on combined index

12
Fuzzy Phrase Example
  • Consider query 712, pyramid scheme
  • We expand this to pyramid scheme pyramid scheme
    sloppy 5
  • Evaluated on combined index
  • Consider the Okapi BM25 metric scores from the
    baseline index, the above query, and document
    GX018-07-13587191
  • pyramid 12.708
  • scheme 9.602
  • pyramid scheme sloppy 5 15.810
  • Total score 38.120

13
Fuzzy Phrase Example
  • The anchortext scores using Hawkapi are
  • pyramid 1.584
  • scheme 1.445
  • pyramid scheme sloppy 5 1.995
  • Total 5.024
  • For this run, we use a 1.0 anchor weighting
  • So, the combined score for the document is
  • 38.120 5.024 43.144

14
4 Priority
  • Priority gives a fixed boost to document scores
    for each criteria a document satisfies
  • Criteria we used for this run
  • All term words appear in 5-fuzziness phrase
  • All term words appear in first 50 words of the
    document
  • Boost amount was the score achieved by the
    first-ranked answer after evaluation with the
    combined index. Therefore, any document matching
    N criteria is ranked higher than any document
    matching N 1 criteria

15
5 Query Expansion
  • For the expansion run we used the local analysis
    method of Robertson and Walker from TREC 8
  • Evaluate the topic on full text
  • 25 terms with the lowest term selection value are
    chosen from the full text of the top 10 ranked
    documents
  • Terms are appended to the query, after adjusting
    weight, and the query reevaluated on the full
    text
  • Anchor index is not used
  • Details of the formulation in the notebook paper

16
QE Example
  • Consider again query 712, pyramid scheme
  • The following additional terms are added to the
    query
  • scam investors
    money
  • schemes recruiting
    fraud
  • scams participants
    ftc
  • pyramids dollars
    chain
  • consumer fraudulent
    investment
  • fortuna selling
    technojargon
  • claims thousands
    entice
  • others 5250
    mlms
  • consumers

17
Efficiency Results
  • GOV2 collection indexed on a single 2000 Intel
    P4 machine
  • 24.3 million documents, 426 Gb of text
  • 13 hours 34 minutes to index, at 30 Gb/hour
  • 2 seconds per query to search for 1 to 4. 25
    seconds for 5.
  • No stopping or stemming
  • Limited accumulators with continue strategy
  • Interesting statistics
  • Full text index size, with full word positions,
    was 10.5 of the collection size (45 Gb)
  • Longest inverted list (for the) 1 Gb
  • Distinct terms 87.7 million
  • Term occurrences 20.7 billion

18
Effectiveness Results
19
MAP for all topics
20
MAP of 5 Query Expansion
21
5 Query expansion MAP vs. 1 plain MAP
22
5 Query expansion MAP vs. median
23
4 Priority MAP vs. 1 plain MAP
24
Effectiveness Results
  • Surprised that no technique is on average
    significantly better than full text or full
    textanchortext
  • 5 Query expansion
  • Subject of a long-term project, so
    well-understood
  • Works well for some topics (highest MAP for 4
    topics, above median for 34 topics) identifying
    those apriori is difficult
  • 2 Combined anchor text
  • Further investigation of anchortext weightings
    and combination
  • 4 Priority
  • Prioritisation fairly crude plan experiments
    with smaller boost amounts and flexible criteria
  • Perhaps not a good idea for PDFs, Word, and so on
  • 3 Fuzzy phrase
  • Result a surprise previously good results in
    training
  • May need to try different weightings for phrase,
    and for anchortext combination

25
Final Thoughts
  • Five very different runs, combining ideas from
    across the SEG team
  • Query expansion, phrases, anchor text, priority
  • Surprises in the results
  • Results and qrels made available last week
  • More evaluation and follow up after TREC
  • Fast indexing and querying
  • Indexing (now) at 38 Gb/hour on an Intel P4, and
    queries in under 2 seconds without explicit
    caching, stopping, and optimisation
  • The zettair search engine was used for the first
    time at TREC
  • Very fast, portable, easy to use, hopefully
    useful to others
  • Get a copy from http//www.seg.rmit.edu.au/zettair
Write a Comment
User Comments (0)
About PowerShow.com