Title: SPIRIT Search Fast
1SPIRIT SearchFast Effective Retrieval of 100
million documents for TREC2004/I-SPY
- Paul Ferguson
- Peter Wilkins, Cathal Gurrin, Alan Smeaton
- Centre for Digital Video Processing
- Dublin City University, Ireland
2Agenda
- SPIRIT Search
- The Documents
- The Search Engine
- UCD Collaboration
- I-Spy
- TREC-2004
- Terabyte task
- Integration with I-Spy
3SPIRIT Search
- Although not as large of Google, we are searching
large collections. - The SPIRIT collection of web pages
- TREC Terabyte 2004 collection
- The SPIRIT collection of web pages is hosted in
DCU - Also in Sheffield and Glasgow
- We will be getting the TREC Terabyte collection
in June
4SPIRIT (Collection)
- Gathered almost three years ago by Univ.
Waterloo, Canada - A general web crawl
- Consists of just HTML pages, with a
- Large domain distribution,
- Leading to language distribution (an estimate of
over 20 non-english) - 94.5 million pages
- 1.2TB of disk space
- 500 million closed collection links
- 1.5 Billion other links
- Estimated 200 million unique image urls
5 6SPIRIT (Engine)
- We have been working on developing a search
engine capable of searching the SPIRIT collection
fast and efficiently. - A distributed high performance search engine
- Access via API (over TCP)
- about 40 ms local query turn-around time
- Without caching
- Indexing time of 3.5 hours/million
documents/computer
7Basic SPIRIT Search Engine
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Anchor
Aggregate Server
Result Wrapper
Collection Storage
8Basic SPIRIT Search Engine
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Aggregate Server
Enter Query
9Basic SPIRIT Search Engine
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Process query and send back results
Aggregate Server
10Basic SPIRIT Search Engine
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Aggregate Server
Combine Result Sets
11Multi-level Cache
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Aggregate Server
12Multi-level Cache
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Aggregate Server
Query cache
Query Equivalence Cache
13Multi-level Cache
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Term cache
Term cache
Term cache
Term cache
Aggregate Server
Query cache
Query Equivalence Cache
14SPIRIT Search API
- TCP based
- Specifies legitimate combination of operators in
the query string - Legitimate operators
- AND
- OR
- NOT
-
- outlink/inlink
- related
- cache
- site
15(No Transcript)
16SPIRIT Images
- In addition we are going to download all images
from the collection. - Estimate of 200 million unique images that are
still in existence (from 1.2 billion) - Will take over 3 months to accomplish
- Will require 3TB of disk space
- We will annotate each image with any alt-text and
text from a surrounding window to support search
as in Googles image search - Will be used to support further experiments for
this (and other) projects
17 18I-SPYs Relevance Metric
- I-Spy system re-ranks pages based on a relevance
metric - The relevance of a page pj to a query qi is based
on the number of clicks that page has received in
the past when query qi has been entered
19Relevance Metric Example
q1 jaguar (total hits - 5117 23)
pj www.jaguar.com
p1 www.jag-lovers.org
20- TREC-2004 Experiments
- DCU UCD
21TREC (Text REtrieval Conference)
- TREC is an annual conference
- funded by DARPA, organised and run by NIST
- Provides a framework within which research groups
can run experiments on identical data and then
come together to share their findings. - Experiments in summer
- Conference in November
- Emphasis has been on performance improvements
(PR) - Annual improvement in systems effectiveness.
22TREC Terabyte task
- The track we are participating in this year.
- The first time it has been run
- Goal of the task is to support experiments into
searching over large quantities of web data. - Terabyte Collection consists of 28 million web
pages - The whole .GOV domain
- Why only 28 million?
- Terabyte evaluation process requires
- Queries
- Documents
- Relevance Judgements
23A typical TREC query
Number 451 What is a Bengals cat?
Description Provide information on
the Bengal cat breed. Narrative Item
should include any information on the Bengal cat
breed, including description, origin,
characteristics, breeding program, names of
breeders and catteries carrying
bengals. References which discuss bengal clubs
only are not relevant. Discussions of bengal
tigers are not relevant.
24A typical TREC document
WTX016-B48-195 IA01
9-000198-B006-6 http//yarra.v
icnet.net.au80/vicnet/vicjama.htm 203.10.72.1
19970106203101 text/html 9222 HTTP/1.0 200
OK Server Netscape-Communications/1.12 Date
Monday, 06-Jan-97 213040 GMT Last-modified
Thursday, 27-Apr-95 135034 GMT Content-length
9031 Content-type text/html What
was new - Jan to March 1995 What was
new - Jan to March 1995
31
March - l"INFO EXPO - display of community
information databases
24
March ... Gary Hardy
garyh_at_vicnet.net.au
6699709
25TREC-2004 AIC Run
- Our terabyte experiments this year will be a
joint experiment between DCU and UCD - DCU
- A text-only run (BM25)
- Possibly a linkage-based run
- E.g. a PageRank Variant
- UCD
- I-SPY run
- Allows for another evaluation of I-SPY
- However, there is a problem here!
- I-SPY relies on human relevance judgements as the
basis for the collaborative search. - We can not rely on users for this TREC Terabyte
task.
26Collaborative search without users
- Anchors on the Web provide a measure of human
judgement. - Text of a HTML anchor is (usually) generated by a
human - So we can use the anchors of links from within
the Terabyte TREC collection to generate
synthetic user judgments
1997 featured the departure of both drivers from
the previous year, Barrichello leaving for the
newly formed Stewart Grand Prix, Brundle leaving
for a career as an F1 analyst. Jordan replaced
them with Italian Giancarlo Fisichella and young
Ralf Schumacher, Michael's brother. Again, the
team with 5th in the standings, with Fisichella
scoring two finishes on the podium. In 1998, the
team made it's biggest signing as former World
Champion Damon Hill, a graduate of Jordan's F3000
program, replaced Fisichella. The team also
replaced it's Peugeots with Mugen Honda motors.
At that year's Belgian Grand Prix, Jordan earned
their first career win, with Damon Hill earning
the last of his 20 career Grand Prix victories.
27Using synthetic user judgements
- So the hit-matrix for I-SPY for TREC will be
populated by synthetic queries from anchor text
descriptions of web pages
28TREC Terabyte 2004 Architecture
Index
Index
Index
Index
Index
A
B
C
D
E
Term cache
Term cache
Term cache
Term cache
Term cache
Aggregate Server
I-SPY
Linkage Server
29Conclusion/Future Work
- Text Search Engine almost complete
- Images soon to be downloaded
- Timetable
- Complete Spirit Search (end June)
- Complete TREC (8th September)
- Documents (mid-June)
- Queries (early August)
- Download images (estimate end October)
30- Thank You
- http//www.cdvp.dcu.ie/
- pferguson_at_computing.dcu.ie