SPIRIT Search Fast - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

SPIRIT Search Fast

Description:

SPIRIT Search. Fast & Effective Retrieval of 100 million documents for TREC2004/I-SPY ... A distributed high performance search engine. Access via API (over TCP) ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 31
Provided by: cathal5
Category:
Tags: spirit | fast | search | spy

less

Transcript and Presenter's Notes

Title: SPIRIT Search Fast


1
SPIRIT SearchFast Effective Retrieval of 100
million documents for TREC2004/I-SPY
  • Paul Ferguson
  • Peter Wilkins, Cathal Gurrin, Alan Smeaton
  • Centre for Digital Video Processing
  • Dublin City University, Ireland

2
Agenda
  • SPIRIT Search
  • The Documents
  • The Search Engine
  • UCD Collaboration
  • I-Spy
  • TREC-2004
  • Terabyte task
  • Integration with I-Spy

3
SPIRIT Search
  • Although not as large of Google, we are searching
    large collections.
  • The SPIRIT collection of web pages
  • TREC Terabyte 2004 collection
  • The SPIRIT collection of web pages is hosted in
    DCU
  • Also in Sheffield and Glasgow
  • We will be getting the TREC Terabyte collection
    in June

4
SPIRIT (Collection)
  • Gathered almost three years ago by Univ.
    Waterloo, Canada
  • A general web crawl
  • Consists of just HTML pages, with a
  • Large domain distribution,
  • Leading to language distribution (an estimate of
    over 20 non-english)
  • 94.5 million pages
  • 1.2TB of disk space
  • 500 million closed collection links
  • 1.5 Billion other links
  • Estimated 200 million unique image urls

5
  • Search Engine

6
SPIRIT (Engine)
  • We have been working on developing a search
    engine capable of searching the SPIRIT collection
    fast and efficiently.
  • A distributed high performance search engine
  • Access via API (over TCP)
  • about 40 ms local query turn-around time
  • Without caching
  • Indexing time of 3.5 hours/million
    documents/computer

7
Basic SPIRIT Search Engine
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Anchor
Aggregate Server
Result Wrapper
Collection Storage
8
Basic SPIRIT Search Engine
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Aggregate Server
Enter Query
9
Basic SPIRIT Search Engine
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Process query and send back results
Aggregate Server
10
Basic SPIRIT Search Engine
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Aggregate Server
Combine Result Sets
11
Multi-level Cache
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Aggregate Server
12
Multi-level Cache
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Aggregate Server
Query cache
Query Equivalence Cache
13
Multi-level Cache
24 Million Documents
24 Million Documents
24 Million Documents
24 Million Documents
A
B
C
D
Term cache
Term cache
Term cache
Term cache
Aggregate Server
Query cache
Query Equivalence Cache
14
SPIRIT Search API
  • TCP based
  • Specifies legitimate combination of operators in
    the query string
  • Legitimate operators
  • AND
  • OR
  • NOT
  • outlink/inlink
  • related
  • cache
  • site

15
(No Transcript)
16
SPIRIT Images
  • In addition we are going to download all images
    from the collection.
  • Estimate of 200 million unique images that are
    still in existence (from 1.2 billion)
  • Will take over 3 months to accomplish
  • Will require 3TB of disk space
  • We will annotate each image with any alt-text and
    text from a surrounding window to support search
    as in Googles image search
  • Will be used to support further experiments for
    this (and other) projects

17
  • UCD Collaboration

18
I-SPYs Relevance Metric
  • I-Spy system re-ranks pages based on a relevance
    metric
  • The relevance of a page pj to a query qi is based
    on the number of clicks that page has received in
    the past when query qi has been entered

19
Relevance Metric Example
q1 jaguar (total hits - 5117 23)
pj www.jaguar.com
p1 www.jag-lovers.org
20
  • TREC-2004 Experiments
  • DCU UCD

21
TREC (Text REtrieval Conference)
  • TREC is an annual conference
  • funded by DARPA, organised and run by NIST
  • Provides a framework within which research groups
    can run experiments on identical data and then
    come together to share their findings.
  • Experiments in summer
  • Conference in November
  • Emphasis has been on performance improvements
    (PR)
  • Annual improvement in systems effectiveness.

22
TREC Terabyte task
  • The track we are participating in this year.
  • The first time it has been run
  • Goal of the task is to support experiments into
    searching over large quantities of web data.
  • Terabyte Collection consists of 28 million web
    pages
  • The whole .GOV domain
  • Why only 28 million?
  • Terabyte evaluation process requires
  • Queries
  • Documents
  • Relevance Judgements

23
A typical TREC query
Number 451 What is a Bengals cat?
  Description Provide information on
the Bengal cat breed.   Narrative Item
should include any information on the Bengal cat
breed, including description, origin,
characteristics, breeding program, names of
breeders and catteries carrying
bengals. References which discuss bengal clubs
only are not relevant. Discussions of bengal
tigers are not relevant.
24
A typical TREC document
WTX016-B48-195 IA01
9-000198-B006-6 http//yarra.v
icnet.net.au80/vicnet/vicjama.htm 203.10.72.1
19970106203101 text/html 9222 HTTP/1.0 200
OK Server Netscape-Communications/1.12 Date
Monday, 06-Jan-97 213040 GMT Last-modified
Thursday, 27-Apr-95 135034 GMT Content-length
9031 Content-type text/html What
was new - Jan to March 1995 What was
new - Jan to March 1995 31
March
  • l"INFO EXPO - display of community
    information databases
24
March
... Gary Hardy
garyh_at_vicnet.net.au
6699709
25
TREC-2004 AIC Run
  • Our terabyte experiments this year will be a
    joint experiment between DCU and UCD
  • DCU
  • A text-only run (BM25)
  • Possibly a linkage-based run
  • E.g. a PageRank Variant
  • UCD
  • I-SPY run
  • Allows for another evaluation of I-SPY
  • However, there is a problem here!
  • I-SPY relies on human relevance judgements as the
    basis for the collaborative search.
  • We can not rely on users for this TREC Terabyte
    task.

26
Collaborative search without users
  • Anchors on the Web provide a measure of human
    judgement.
  • Text of a HTML anchor is (usually) generated by a
    human
  • So we can use the anchors of links from within
    the Terabyte TREC collection to generate
    synthetic user judgments

1997  featured the departure of both drivers from
the previous year, Barrichello leaving for the
newly formed Stewart Grand Prix, Brundle leaving
for a career as an F1 analyst. Jordan replaced
them with Italian Giancarlo Fisichella and young
Ralf Schumacher, Michael's brother. Again, the
team with 5th in the standings, with Fisichella
scoring two finishes on the podium. In 1998, the
team made it's biggest signing as former World
Champion Damon Hill, a graduate of Jordan's F3000
program, replaced Fisichella. The team also
replaced it's Peugeots with Mugen Honda motors.
At that year's Belgian Grand Prix, Jordan earned
their first career win, with Damon Hill earning
the last of his 20 career Grand Prix victories.
27
Using synthetic user judgements
  • So the hit-matrix for I-SPY for TREC will be
    populated by synthetic queries from anchor text
    descriptions of web pages

28
TREC Terabyte 2004 Architecture
Index
Index
Index
Index
Index
A
B
C
D
E
Term cache
Term cache
Term cache
Term cache
Term cache
Aggregate Server
I-SPY
Linkage Server
29
Conclusion/Future Work
  • Text Search Engine almost complete
  • Images soon to be downloaded
  • Timetable
  • Complete Spirit Search (end June)
  • Complete TREC (8th September)
  • Documents (mid-June)
  • Queries (early August)
  • Download images (estimate end October)

30
  • Thank You
  • http//www.cdvp.dcu.ie/
  • pferguson_at_computing.dcu.ie
Write a Comment
User Comments (0)
About PowerShow.com