RMIT University at TREC 2004 Terabyte Track Experiments - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

RMIT University at TREC 2004 Terabyte Track Experiments

Description:

Supports pivoted cosine and Okapi BM25 metrics. Working on further metrics ... Index on full text, ranked using Okapi BM25 metric ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 26

Provided by: scie276

Category:

more less

Transcript and Presenter's Notes

Title: RMIT University at TREC 2004 Terabyte Track Experiments

1
RMIT University at TREC 2004Terabyte Track
Experiments

Bodo Billerbeck, Adam Cannane, Abhijit Chattaraj,
Nicholas Lester
William Webber, Hugh E. Williams, John Yiannis,
and Justin Zobel
Email hugh_at_cs.rmit.edu.au
School of Computer Science and
Information Technology, RMIT University,
Melbourne, Australia.

2
Overview

Zettair
Features, indexing, and querying
Runs all automatic, title-only runs
1 Baseline
2 Anchor text
3 Fuzzy phrase
4 Priority
5 Query Expansion
Results
Efficiency
Effectiveness
Final thoughts

3
Zettair

From zetta (1021) and IR
A scalable, fast search engine server
Supports ranked, simple Boolean, and phrase
queries
Indexes HTML, plain text, and TREC-formatted
documents
Usable as a C and python library
Native support for TREC experiments
Documented. Includes easy-to-follow examples
BSD license
Emphasis on simplicity and efficiency
One executable does everything
Under continued development
Ported to Mac OS X, FreeBSD, MS Windows, Linux,
Solaris
Available from www.seg.rmit.edu.au/zettair

4
Zettair Indexing

Single-pass, sort-merge scheme
Document-ordered, word position inverted indexes
Efficient, variable-byte index compression
Indexed the GOV2 collection (426 Gb) in under 14
hours on a single AUD2000 Intel P4 machine.
Throughput 30Gb/hour
Improved to just over 11 hours post-evaluation.
Throughput 38Gb/hour
Fast configurable parser. Handles badly-formed
HTML
Validates each tag by matching lt with gt within a
character limit HTML comments are not indexed but
are validated
Entity references translated
No support for internationalised text

5
Variable-byte indexes

Inverted indexes store terms and term occurrences
Term occurrence information (or postings) are
integers
We store integers in a variable number of bytes
7 bits of each block store the binary value of
integer k
the most (or, optionally, least) significant bit
is set to 1 if other blocks follow, or 0 for the
final block
For example, for the integer k5, the
variable-byte representation is 0 0000101
For the integer k270, the variable-byte
representation is 1 0000001 0 0001110

6
Performance

Results for 10,000 queries on 20 Gb from TREC-7
VLC2 collection. (Taken from Scholer, Williams,
Yiannis, and Zobel, SIGIR 2002.)

7
Zettair Querying

B-tree vocabulary bulk-loaded at index
construction time
For a 25 Gb collection, average query time is 70
milliseconds (without explicit caching or other
optimisations)
Single-threaded, blocking I/O, and relatively
unoptimised
Provides query-biased summaries of documents (see
Tombros and Sanderson, SIGIR 1998)
Supports pivoted cosine and Okapi BM25 metrics
Working on further metrics
Metrics can be manipulated externally
BM25 formulation in the notebook paper
Reads TREC topic files and supports output in
original trec_eval format

8
1 Baseline run

Index on full text
Words are terms
Does not index phrases, URLs, or make use of
inlink anchortext
Title, metadata, etc. treated as plain text
Okapi BM25 metric
Automatic, title-only queries
No stopping or stemming
Used as basis for all other runs
Hopefully, a useful comparison point

9
2 Combined

Parallel indexes
Index on full text, ranked using Okapi BM25
metric
Index on inlink anchor text, ranked using
Hawkapi metric (Toward better weighting of
anchors, Hawking, Upstill, and Craswell, SIGIR
2004)
Scores combined, with anchortext weighting
determined from training evaluations (more in a
moment)
used .GOV queries from past TRECs on the .GOV2
collection and made our own relevance judgments
No particular significance or principles in the
score combination
Combined index used in Fuzzy Phrase and Priority
techniques (discussed later)

10
Combined example

Queries are evaluated separately using full text
and anchortext indexes (with BM25 and Hawkapi
metrics respectively)
First 50,000 answers returned from both
evaluations
If a document appears in both the result sets,
scores are combined with anchortext weighted as
0.2 otherwise, full text score only used
For example, consider the query horse racing
jockey weight and document GX246-35-4289557
Plain score 50.522, ranked 14
Anchor score 1.592, ranked 768
Combined score 50.522 (0.2 1.592) 50.841,
final rank 1

11
3 Fuzzy Phrase

We observed (Williams, Zobel, and Bahle, ACM TOIS
2004) that around 40 of ranked queries can be
evaluated as phrase queries
For example, most home page finding queries
appear not to explicitly use quotation marks
We investigated whether fuzzy phrases could aid
ranking
A fuzzy phrase is a phrase in which ordering and
adjacency are flexible
For example, cat sat (sloppiness 2) matches
cat sat, cat - sat, cat - - sat, and sat
cat
but not, for example
cat - - - sat or sat - cat
Fuzzy phrases are added as terms to a ranked
query (example next) and ranking metric scores
computed based on statistics of the fuzzy phrase
term
Fuzzified query evaluated on combined index

12
Fuzzy Phrase Example

Consider query 712, pyramid scheme
We expand this to pyramid scheme pyramid scheme
sloppy 5
Evaluated on combined index
Consider the Okapi BM25 metric scores from the
baseline index, the above query, and document
GX018-07-13587191
pyramid 12.708
scheme 9.602
pyramid scheme sloppy 5 15.810
Total score 38.120

13
Fuzzy Phrase Example

The anchortext scores using Hawkapi are
pyramid 1.584
scheme 1.445
pyramid scheme sloppy 5 1.995
Total 5.024
For this run, we use a 1.0 anchor weighting
So, the combined score for the document is
38.120 5.024 43.144

14
4 Priority

Priority gives a fixed boost to document scores
for each criteria a document satisfies
Criteria we used for this run
All term words appear in 5-fuzziness phrase
All term words appear in first 50 words of the
document
Boost amount was the score achieved by the
first-ranked answer after evaluation with the
combined index. Therefore, any document matching
N criteria is ranked higher than any document
matching N 1 criteria

15
5 Query Expansion

For the expansion run we used the local analysis
method of Robertson and Walker from TREC 8
Evaluate the topic on full text
25 terms with the lowest term selection value are
chosen from the full text of the top 10 ranked
documents
Terms are appended to the query, after adjusting
weight, and the query reevaluated on the full
text
Anchor index is not used
Details of the formulation in the notebook paper

16
QE Example

Consider again query 712, pyramid scheme
The following additional terms are added to the
query
scam investors
money
schemes recruiting
fraud
scams participants
ftc
pyramids dollars
chain
consumer fraudulent
investment
fortuna selling
technojargon
claims thousands
entice
others 5250
mlms
consumers

17
Efficiency Results

GOV2 collection indexed on a single 2000 Intel
P4 machine
24.3 million documents, 426 Gb of text
13 hours 34 minutes to index, at 30 Gb/hour
2 seconds per query to search for 1 to 4. 25
seconds for 5.
No stopping or stemming
Limited accumulators with continue strategy
Interesting statistics
Full text index size, with full word positions,
was 10.5 of the collection size (45 Gb)
Longest inverted list (for the) 1 Gb
Distinct terms 87.7 million
Term occurrences 20.7 billion

18
Effectiveness Results
19
MAP for all topics
20
MAP of 5 Query Expansion
21
5 Query expansion MAP vs. 1 plain MAP
22
5 Query expansion MAP vs. median
23
4 Priority MAP vs. 1 plain MAP
24
Effectiveness Results

Surprised that no technique is on average
significantly better than full text or full
textanchortext
5 Query expansion
Subject of a long-term project, so
well-understood
Works well for some topics (highest MAP for 4
topics, above median for 34 topics) identifying
those apriori is difficult
2 Combined anchor text
Further investigation of anchortext weightings
and combination
4 Priority
Prioritisation fairly crude plan experiments
with smaller boost amounts and flexible criteria
Perhaps not a good idea for PDFs, Word, and so on
3 Fuzzy phrase
Result a surprise previously good results in
training
May need to try different weightings for phrase,
and for anchortext combination

25
Final Thoughts

Five very different runs, combining ideas from
across the SEG team
Query expansion, phrases, anchor text, priority
Surprises in the results
Results and qrels made available last week
More evaluation and follow up after TREC
Fast indexing and querying
Indexing (now) at 38 Gb/hour on an Intel P4, and
queries in under 2 seconds without explicit
caching, stopping, and optimisation
The zettair search engine was used for the first
time at TREC
Very fast, portable, easy to use, hopefully
useful to others
Get a copy from http//www.seg.rmit.edu.au/zettair