A Search Engine Architecture Based on Collection Selection - PowerPoint PPT Presentation

About This Presentation

Title:

A Search Engine Architecture Based on Collection Selection

Description:

The Web is getting bigger and bigger, and users are more and more picky! Precise results are needed very fast. The index is growing, due to added page and advanced ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 61

Provided by: fabrizios7

Category:

more less

Transcript and Presenter's Notes

Title: A Search Engine Architecture Based on Collection Selection

1
A Search Engine Architecture Based on Collection
Selection

Diego Puppin
University of Pisa, Italy
Supervisors D. Laforenza, M. Vanneschi

2
Introduction
3
Motivations

The Web is getting bigger and bigger, and users
are more and more picky!
Precise results are needed very fast
The index is growing, due to added page and
advanced indexing
Big IR problems for the Web, books, multimedia
search engine

4
Motivations (2)

There is the need for new solutions, able to give
high quality results with reduced computing load
Parallel Computing looks like the most natural
choice to help algorithms to face this growth
rate Baeza-Yates et al. 2007a
Billions of pages and data available (several
TB) the index is still very big (about 5X the
collection size)
New approaches to partitioning are key to the
next phase

5
Parallel (Distributed) IRSs
6
Term vs Doc partitioning
7
Term vs Doc partitioning

Reduced computing load for term part.
Only the servers with relevant terms
Problems of load balancing
Heavier communication patterns
Doc.part. better balancing but all documents are
scanned
How to reduce the load with doc.part.?

8

9
Main contributions

Query vector doc model
More efficient for partitioning and selection
(co-clustering and PCAP)
Load-driven routing
Exploits better the available load
Based on the effective load of the system
Incremental Caching
Improves throughput AND quality

10
Acknowledgments

Fabrizio Silvestri
Raffaele Perego
Ricardo Baeza-Yates
Adbur Chowdury, Ophir Frieder, Gerhard Weikum,
and the various reviewers

11
Other contributions

More compact collection representation
1/5 CORI and outperforming
A way to select documents (50) to move out of
the index
The documents in the supplemental index
contribute to only 3 top results
A simple way to update the index in a doc.
partitioned system
Extended simulation
6 M documents, 800k test queries, real computing
costs, several configurations tested

12
Reviewers Request Frieder

More detailed discussion of the coclustering
algorithm
Improved cost scheme
Experiments to be extended in the future

13
Reviewers Requests Weikum

Improved description of pipelined
term-partitioned IR system
Improved description of coclustering
Better definition of shingles
New realistic cost model
Deeper discussion of cache and silent documents

14
How to Improve Partitions
15
Partitioning Strategy
Document Collection
Partitioning Strategy
p1
p2
pp
16
The QV Model
documents
4
8
10
1
12
2
9
7
11
5
3
6
1
2
3
4
5
6
queries
9
8
7
10
Co-clustering
11
12
17
Theoretical Model of Co-clustering

The algorithm we use Dhillon et al., 2003 finds
the clustering that minimizes the loss of
information between the original matrix and the
clustered matrix (given the number of row and
column clusters)
Efficient implementation, very robust solution
Stable to test period, number of clusters,
training set used, matrix model (scores, boolean,
repeated)

18
QV for Collection Selection
Partitions are ranked according to their
relevance to the query

Document clusters
Query clusters

Query
We called this strategy PCAP
19
PCAP collection selection
20
Experimental Settings

Experiments were carried out using
WBR99 5,939,061 documents 22 GB uncompressed
text
Snapshot of the Brazilian Web (domain .br) back
in 1999.
A query log from todobr.com relative to the
period Jan-Oct 2003.
Zettair as the IR Core
Training 190,000 queries, Test 800,000 queries
We created 16 1 doc. clusters and 128 query
clusters.
Model tested on the successive week (the fourth
week). Metrics used
Intersection percentage of relevant results
returned using only k servers out of 161 (from
Puppin et al., 2006).
Competitive similarity percentage of relevance
score obtained using only k servers out of 161
(adapted from Chierichetti et al., 2007).

21
Quality Metrics
22
Very Effective Partitioning and Selection
In the case of Random CORI performs really
bad! Almost equal to relevants/Nclusters. E.g.
5/17 0.29411765 0.3
CORI on QV vs. CORI on random performs about 5.2
times better.
PCAP on QV vs. CORI on random performs about 5.8
times better.
PCAP on QV vs. CORI on QV performs about 1.1
times better.
23
(No Transcript)
24
Strength

Popular queries are driving the distribution
Low-dimensional space to represent documents
More efficient collection representation
QV may be built while answering queries

25
Weakness

Dependent from the training set
Actually NOT!
Cannot manage new query terms
Very small fraction, CORI does not help
Inc. caching can help
Collection selection dependent from assignment
But addition does not break performance

26
Issues with Load Distribution
27
Load Balancing
Load is measured as the maximum number of queries
answered by each IR core within a sliding query
window of 1000 queries.
Still the maximum load is 25 of the maximum
capacity available at each IR Core
28
Load Balancing Strategies

Load-driven basic ltLgt
Servers are ranked according to their relevance,
using a collection selection function. The first
gets priority 1, then linearly down to 1/17.
Every server i has to answer if L(i) lt p(i) L
Load-driven boost ltL,Tgt
Priority is 1 for the first T server, then
linearly down to 1/(17-T)

29
Experimental Settings (2)

The broker models the load in the cores as the
number of queries served from the last W queries
Assumption cost 1, for each query and
collection
We will change this
We count the number of relevant results we can
get by polling the servers, up to the chosen load
threshold

30
Load Balancing Results
Intersection ( of relevant results retrieved)
Competitive Similarity ( of rank score retrieved)
31
Caching and Collection Selection
32
Interaction with a Cache

Result caching is commonly used in WSEs
Baeza-Yates et al., 2007a Baeza-Yates et al.,
2007b.
Caching has the effect of reshaping the power-law
underlying the query distribution Baeza-Yates et
al., 2007a.
We designed a novel caching strategy (i.e.
Incremental Caching) integrated with collection
selection

33
Incremental Caching
An incremental cache is effective both at load
reduction, and at improving result quality.
Q
IR Core1
Q
Q
Q
Q
Incremental Cache
Q
IR Core2
Q
Servers Polled
X
X
X
X
Results
IR Core3
Q
IR Core4
Q
34
Incremental Caching Results
Intersection (like P_at_N - of relevant res
retrieved)
Competitive Similarity ( of rank score retrieved)
35
(No Transcript)
36
Refined Cost Model and Prioritization
37
Collection Prioritization

We reverse the load control from the broker to
the cores
The broker broadcasts the query, and sends info
about the relative rank of each core (the
priority)
Each core serves query if L(i) lt p(i) L
L(i) sum of the comp. cost (timing) of served
queries

38
(No Transcript)
39
Extended Tests

We actually partitioned the documents onto
different servers
We indexed locally, and we measured the timing of
each query
The actual timing is used to compute the load and
drive the system
Load cap is AVERAGE load
The peak can heavily vary!

40
(No Transcript)
41
the bill, please!
42
Conclusions

We presented an architecture for a distributed
search engine, based on collection selection
The load-driven strategy and the incremental
caching can retrieve very high quality results,
with reduced load
Verified with an extensive simulation

43
Impact and Benefits

If a given precision is expected, we can use
FEWER servers
With a given number of servers, we get HIGHER
precision
Confirmed with different metrics
Smaller load for the IR system, with more focus
on top results
Nice trade-off cost vs. quality

44
Impact and Benefits (2)

Load-driven routing can be used
to absorb query peaks
to offer higher/lower quality results to selected
users
Consistent ranking due to local indexing
Inc. caching can be used to reduce the negative
effects of selection

45
Furthermore

Caching posting lists is very effective on local
indices
Simple way to add new documents
Inc. caching could help with impact-ordered
posting lists
Caching could be based on line value (query
frequency, number of polled servers)

46
Future Work

Comparison with other results in clustering
(k-means, link-based, P2P, LSI, SVD)
Test on a large-scale, real-world search engine
Real-world implementation at Google
TOIS paper to wrap up

47
References

Puppin et al., 2006
Diego Puppin, Fabrizio Silvestri, Domenico
Laforenza. Query-Driven Document Partitioning
and Collection Selection. Invited Paper.
Proceedings of INFOSCALE 06.
Puppin Silvestri, 2006
Diego Puppin, Fabrizio Silvestri. The
Query-Vector Document Model. Proceedings of CIKM
06.
Puppin et al., 2007
Diego Puppin, Ricardo Baeza-Yates, Raffaele
Perego, Fabrizio Silvestri. Incremental Caching
for Collection Selection Architectures.
Proceedings of INFOSCALE 07.

48
References

Baeza-Yates et al., 2007a
Ricardo Baeza-Yates, Carlos Castillo, Flavio
Junqueira, Vassilis Plachouras, Fabrizio
Silvestri. Challenges in Distributed Information
Retrieval. Invited Paper. Proceedings of ICDE
2007.
Chierichetti et al., 2007
F. Chierichetti, A. Panconesi, P. Raghavan, M.
Sozio, A. Tiberi, E. Upfal. Finding Near
Neighbors Through Cluster Pruning. Proceedings
of PODS 2007.
Baeza-Yates et al., 2007b
Ricardo Baeza-Yates, Aristides Gionis, Flavio
Junqueira, Vanessa Murdock, Vassilis Plachouras,
Fabrizio Silvestri. The Impact of Caching on
Search Engines. Proceedings of SIGIR 2007.

49
References

Dhillon et al., 2003
Dhillon, I. S. and Mallela, S. and Modha, D.
S., Information-Theoretic Co-Clustering.
Proceedings of The Ninth ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining(KDD-2003)