A Search Engine Architecture Based on Collection Selection - PowerPoint PPT Presentation

About This Presentation
Title:

A Search Engine Architecture Based on Collection Selection

Description:

The Web is getting bigger and bigger, and users are more and more picky! Precise results are needed very fast. The index is growing, due to added page and advanced ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 61
Provided by: fabrizios7
Category:

less

Transcript and Presenter's Notes

Title: A Search Engine Architecture Based on Collection Selection


1
A Search Engine Architecture Based on Collection
Selection
  • Diego Puppin
  • University of Pisa, Italy
  • Supervisors D. Laforenza, M. Vanneschi

2
Introduction
3
Motivations
  • The Web is getting bigger and bigger, and users
    are more and more picky!
  • Precise results are needed very fast
  • The index is growing, due to added page and
    advanced indexing
  • Big IR problems for the Web, books, multimedia
    search engine

4
Motivations (2)
  • There is the need for new solutions, able to give
    high quality results with reduced computing load
  • Parallel Computing looks like the most natural
    choice to help algorithms to face this growth
    rate Baeza-Yates et al. 2007a
  • Billions of pages and data available (several
    TB) the index is still very big (about 5X the
    collection size)
  • New approaches to partitioning are key to the
    next phase

5
Parallel (Distributed) IRSs
6
Term vs Doc partitioning
7
Term vs Doc partitioning
  • Reduced computing load for term part.
  • Only the servers with relevant terms
  • Problems of load balancing
  • Heavier communication patterns
  • Doc.part. better balancing but all documents are
    scanned
  • How to reduce the load with doc.part.?

8

9
Main contributions
  • Query vector doc model
  • More efficient for partitioning and selection
    (co-clustering and PCAP)
  • Load-driven routing
  • Exploits better the available load
  • Based on the effective load of the system
  • Incremental Caching
  • Improves throughput AND quality

10
Acknowledgments
  • Fabrizio Silvestri
  • Raffaele Perego
  • Ricardo Baeza-Yates
  • Adbur Chowdury, Ophir Frieder, Gerhard Weikum,
    and the various reviewers

11
Other contributions
  • More compact collection representation
  • 1/5 CORI and outperforming
  • A way to select documents (50) to move out of
    the index
  • The documents in the supplemental index
    contribute to only 3 top results
  • A simple way to update the index in a doc.
    partitioned system
  • Extended simulation
  • 6 M documents, 800k test queries, real computing
    costs, several configurations tested

12
Reviewers Request Frieder
  • More detailed discussion of the coclustering
    algorithm
  • Improved cost scheme
  • Experiments to be extended in the future

13
Reviewers Requests Weikum
  • Improved description of pipelined
    term-partitioned IR system
  • Improved description of coclustering
  • Better definition of shingles
  • New realistic cost model
  • Deeper discussion of cache and silent documents

14
How to Improve Partitions
15
Partitioning Strategy
Document Collection
Partitioning Strategy
p1
p2
pp
16
The QV Model
documents
4
8
10
1
12
2
9
7
11
5
3
6
1
2
3
4
5
6
queries
9
8
7
10
Co-clustering
11
12
17
Theoretical Model of Co-clustering
  • The algorithm we use Dhillon et al., 2003 finds
    the clustering that minimizes the loss of
    information between the original matrix and the
    clustered matrix (given the number of row and
    column clusters)
  • Efficient implementation, very robust solution
  • Stable to test period, number of clusters,
    training set used, matrix model (scores, boolean,
    repeated)

18
QV for Collection Selection
Partitions are ranked according to their
relevance to the query

Document clusters
Query clusters

Query
We called this strategy PCAP
19
PCAP collection selection
20
Experimental Settings
  • Experiments were carried out using
  • WBR99 5,939,061 documents 22 GB uncompressed
    text
  • Snapshot of the Brazilian Web (domain .br) back
    in 1999.
  • A query log from todobr.com relative to the
    period Jan-Oct 2003.
  • Zettair as the IR Core
  • Training 190,000 queries, Test 800,000 queries
  • We created 16 1 doc. clusters and 128 query
    clusters.
  • Model tested on the successive week (the fourth
    week). Metrics used
  • Intersection percentage of relevant results
    returned using only k servers out of 161 (from
    Puppin et al., 2006).
  • Competitive similarity percentage of relevance
    score obtained using only k servers out of 161
    (adapted from Chierichetti et al., 2007).

21
Quality Metrics
22
Very Effective Partitioning and Selection
In the case of Random CORI performs really
bad! Almost equal to relevants/Nclusters. E.g.
5/17 0.29411765 0.3
CORI on QV vs. CORI on random performs about 5.2
times better.
PCAP on QV vs. CORI on random performs about 5.8
times better.
PCAP on QV vs. CORI on QV performs about 1.1
times better.
23
(No Transcript)
24
Strength
  • Popular queries are driving the distribution
  • Low-dimensional space to represent documents
  • More efficient collection representation
  • QV may be built while answering queries

25
Weakness
  • Dependent from the training set
  • Actually NOT!
  • Cannot manage new query terms
  • Very small fraction, CORI does not help
  • Inc. caching can help
  • Collection selection dependent from assignment
  • But addition does not break performance

26
Issues with Load Distribution
27
Load Balancing
Load is measured as the maximum number of queries
answered by each IR core within a sliding query
window of 1000 queries.
Still the maximum load is 25 of the maximum
capacity available at each IR Core
28
Load Balancing Strategies
  • Load-driven basic ltLgt
  • Servers are ranked according to their relevance,
    using a collection selection function. The first
    gets priority 1, then linearly down to 1/17.
    Every server i has to answer if L(i) lt p(i) L
  • Load-driven boost ltL,Tgt
  • Priority is 1 for the first T server, then
    linearly down to 1/(17-T)

29
Experimental Settings (2)
  • The broker models the load in the cores as the
    number of queries served from the last W queries
  • Assumption cost 1, for each query and
    collection
  • We will change this
  • We count the number of relevant results we can
    get by polling the servers, up to the chosen load
    threshold

30
Load Balancing Results
Intersection ( of relevant results retrieved)
Competitive Similarity ( of rank score retrieved)
31
Caching and Collection Selection
32
Interaction with a Cache
  • Result caching is commonly used in WSEs
    Baeza-Yates et al., 2007a Baeza-Yates et al.,
    2007b.
  • Caching has the effect of reshaping the power-law
    underlying the query distribution Baeza-Yates et
    al., 2007a.
  • We designed a novel caching strategy (i.e.
    Incremental Caching) integrated with collection
    selection

33
Incremental Caching
An incremental cache is effective both at load
reduction, and at improving result quality.
Q
IR Core1
Q
Q
Q
Q
Incremental Cache
Q
IR Core2
Q
Servers Polled
X
X
X
X
Results
IR Core3
Q
IR Core4
Q
34
Incremental Caching Results
Intersection (like P_at_N - of relevant res
retrieved)
Competitive Similarity ( of rank score retrieved)
35
(No Transcript)
36
Refined Cost Model and Prioritization
37
Collection Prioritization
  • We reverse the load control from the broker to
    the cores
  • The broker broadcasts the query, and sends info
    about the relative rank of each core (the
    priority)
  • Each core serves query if L(i) lt p(i) L
  • L(i) sum of the comp. cost (timing) of served
    queries

38
(No Transcript)
39
Extended Tests
  • We actually partitioned the documents onto
    different servers
  • We indexed locally, and we measured the timing of
    each query
  • The actual timing is used to compute the load and
    drive the system
  • Load cap is AVERAGE load
  • The peak can heavily vary!

40
(No Transcript)
41
the bill, please!
42
Conclusions
  • We presented an architecture for a distributed
    search engine, based on collection selection
  • The load-driven strategy and the incremental
    caching can retrieve very high quality results,
    with reduced load
  • Verified with an extensive simulation

43
Impact and Benefits
  • If a given precision is expected, we can use
    FEWER servers
  • With a given number of servers, we get HIGHER
    precision
  • Confirmed with different metrics
  • Smaller load for the IR system, with more focus
    on top results
  • Nice trade-off cost vs. quality

44
Impact and Benefits (2)
  • Load-driven routing can be used
  • to absorb query peaks
  • to offer higher/lower quality results to selected
    users
  • Consistent ranking due to local indexing
  • Inc. caching can be used to reduce the negative
    effects of selection

45
Furthermore
  • Caching posting lists is very effective on local
    indices
  • Simple way to add new documents
  • Inc. caching could help with impact-ordered
    posting lists
  • Caching could be based on line value (query
    frequency, number of polled servers)

46
Future Work
  • Comparison with other results in clustering
    (k-means, link-based, P2P, LSI, SVD)
  • Test on a large-scale, real-world search engine
  • Real-world implementation at Google
  • TOIS paper to wrap up

47
References
  • Puppin et al., 2006
  • Diego Puppin, Fabrizio Silvestri, Domenico
    Laforenza. Query-Driven Document Partitioning
    and Collection Selection. Invited Paper.
    Proceedings of INFOSCALE 06.
  • Puppin Silvestri, 2006
  • Diego Puppin, Fabrizio Silvestri. The
    Query-Vector Document Model. Proceedings of CIKM
    06.
  • Puppin et al., 2007
  • Diego Puppin, Ricardo Baeza-Yates, Raffaele
    Perego, Fabrizio Silvestri. Incremental Caching
    for Collection Selection Architectures.
    Proceedings of INFOSCALE 07.

48
References
  • Baeza-Yates et al., 2007a
  • Ricardo Baeza-Yates, Carlos Castillo, Flavio
    Junqueira, Vassilis Plachouras, Fabrizio
    Silvestri. Challenges in Distributed Information
    Retrieval. Invited Paper. Proceedings of ICDE
    2007.
  • Chierichetti et al., 2007
  • F. Chierichetti, A. Panconesi, P. Raghavan, M.
    Sozio, A. Tiberi, E. Upfal. Finding Near
    Neighbors Through Cluster Pruning. Proceedings
    of PODS 2007.
  • Baeza-Yates et al., 2007b
  • Ricardo Baeza-Yates, Aristides Gionis, Flavio
    Junqueira, Vanessa Murdock, Vassilis Plachouras,
    Fabrizio Silvestri. The Impact of Caching on
    Search Engines. Proceedings of SIGIR 2007.

49
References
  • Dhillon et al., 2003
  • Dhillon, I. S. and Mallela, S. and Modha, D.
    S., Information-Theoretic Co-Clustering.
    Proceedings of The Ninth ACM SIGKDD International
    Conference on Knowledge Discovery and Data
    Mining(KDD-2003)

50
Backup Slides
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
Adding Documents
55
Adding Documents
  • It is important to assign new documents to the
    fittest clusters
  • New versions, New pages etc.
  • The new documents will be found along with the
    previously assigned documents
  • Hopefully the coll. selection will find them with
    similar docs

56
A Modest Proposal
  • The body of the new document is used as query for
    the PCAP selection
  • The body is compared to the query clusters
  • We will find a similarity between doc. body and
    query cluster
  • We use PCAP to rank doc. collections

57
Implementation
  • The first 1000 byte of (stripped) body doc are
    used
  • The new doc is assigned to the doc. cluster with
    the top PCAP score
  • New docs are locally indexed
  • No need to re-train / re-assign
  • New docs have consistent score and ranking

58
Test Configurations
59
(No Transcript)
60
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com