Title: A Search Engine Architecture Based on Collection Selection
1A Search Engine Architecture Based on Collection
Selection
- Diego Puppin
- University of Pisa, Italy
- Supervisors D. Laforenza, M. Vanneschi
2Introduction
3Motivations
- The Web is getting bigger and bigger, and users
are more and more picky! - Precise results are needed very fast
- The index is growing, due to added page and
advanced indexing - Big IR problems for the Web, books, multimedia
search engine
4Motivations (2)
- There is the need for new solutions, able to give
high quality results with reduced computing load - Parallel Computing looks like the most natural
choice to help algorithms to face this growth
rate Baeza-Yates et al. 2007a - Billions of pages and data available (several
TB) the index is still very big (about 5X the
collection size) - New approaches to partitioning are key to the
next phase
5Parallel (Distributed) IRSs
6Term vs Doc partitioning
7Term vs Doc partitioning
- Reduced computing load for term part.
- Only the servers with relevant terms
- Problems of load balancing
- Heavier communication patterns
- Doc.part. better balancing but all documents are
scanned - How to reduce the load with doc.part.?
8 9Main contributions
- Query vector doc model
- More efficient for partitioning and selection
(co-clustering and PCAP) - Load-driven routing
- Exploits better the available load
- Based on the effective load of the system
- Incremental Caching
- Improves throughput AND quality
10Acknowledgments
- Fabrizio Silvestri
- Raffaele Perego
- Ricardo Baeza-Yates
- Adbur Chowdury, Ophir Frieder, Gerhard Weikum,
and the various reviewers
11Other contributions
- More compact collection representation
- 1/5 CORI and outperforming
- A way to select documents (50) to move out of
the index - The documents in the supplemental index
contribute to only 3 top results - A simple way to update the index in a doc.
partitioned system - Extended simulation
- 6 M documents, 800k test queries, real computing
costs, several configurations tested
12Reviewers Request Frieder
- More detailed discussion of the coclustering
algorithm - Improved cost scheme
- Experiments to be extended in the future
13Reviewers Requests Weikum
- Improved description of pipelined
term-partitioned IR system - Improved description of coclustering
- Better definition of shingles
- New realistic cost model
- Deeper discussion of cache and silent documents
14How to Improve Partitions
15Partitioning Strategy
Document Collection
Partitioning Strategy
p1
p2
pp
16The QV Model
documents
4
8
10
1
12
2
9
7
11
5
3
6
1
2
3
4
5
6
queries
9
8
7
10
Co-clustering
11
12
17Theoretical Model of Co-clustering
- The algorithm we use Dhillon et al., 2003 finds
the clustering that minimizes the loss of
information between the original matrix and the
clustered matrix (given the number of row and
column clusters) - Efficient implementation, very robust solution
- Stable to test period, number of clusters,
training set used, matrix model (scores, boolean,
repeated)
18QV for Collection Selection
Partitions are ranked according to their
relevance to the query
Document clusters
Query clusters
Query
We called this strategy PCAP
19PCAP collection selection
20Experimental Settings
- Experiments were carried out using
- WBR99 5,939,061 documents 22 GB uncompressed
text - Snapshot of the Brazilian Web (domain .br) back
in 1999. - A query log from todobr.com relative to the
period Jan-Oct 2003. - Zettair as the IR Core
- Training 190,000 queries, Test 800,000 queries
- We created 16 1 doc. clusters and 128 query
clusters. - Model tested on the successive week (the fourth
week). Metrics used - Intersection percentage of relevant results
returned using only k servers out of 161 (from
Puppin et al., 2006). - Competitive similarity percentage of relevance
score obtained using only k servers out of 161
(adapted from Chierichetti et al., 2007).
21Quality Metrics
22Very Effective Partitioning and Selection
In the case of Random CORI performs really
bad! Almost equal to relevants/Nclusters. E.g.
5/17 0.29411765 0.3
CORI on QV vs. CORI on random performs about 5.2
times better.
PCAP on QV vs. CORI on random performs about 5.8
times better.
PCAP on QV vs. CORI on QV performs about 1.1
times better.
23(No Transcript)
24Strength
- Popular queries are driving the distribution
- Low-dimensional space to represent documents
- More efficient collection representation
- QV may be built while answering queries
25Weakness
- Dependent from the training set
- Actually NOT!
- Cannot manage new query terms
- Very small fraction, CORI does not help
- Inc. caching can help
- Collection selection dependent from assignment
- But addition does not break performance
26Issues with Load Distribution
27Load Balancing
Load is measured as the maximum number of queries
answered by each IR core within a sliding query
window of 1000 queries.
Still the maximum load is 25 of the maximum
capacity available at each IR Core
28Load Balancing Strategies
- Load-driven basic ltLgt
- Servers are ranked according to their relevance,
using a collection selection function. The first
gets priority 1, then linearly down to 1/17.
Every server i has to answer if L(i) lt p(i) L - Load-driven boost ltL,Tgt
- Priority is 1 for the first T server, then
linearly down to 1/(17-T)
29Experimental Settings (2)
- The broker models the load in the cores as the
number of queries served from the last W queries - Assumption cost 1, for each query and
collection - We will change this
- We count the number of relevant results we can
get by polling the servers, up to the chosen load
threshold
30Load Balancing Results
Intersection ( of relevant results retrieved)
Competitive Similarity ( of rank score retrieved)
31Caching and Collection Selection
32Interaction with a Cache
- Result caching is commonly used in WSEs
Baeza-Yates et al., 2007a Baeza-Yates et al.,
2007b. - Caching has the effect of reshaping the power-law
underlying the query distribution Baeza-Yates et
al., 2007a. - We designed a novel caching strategy (i.e.
Incremental Caching) integrated with collection
selection
33Incremental Caching
An incremental cache is effective both at load
reduction, and at improving result quality.
Q
IR Core1
Q
Q
Q
Q
Incremental Cache
Q
IR Core2
Q
Servers Polled
X
X
X
X
Results
IR Core3
Q
IR Core4
Q
34Incremental Caching Results
Intersection (like P_at_N - of relevant res
retrieved)
Competitive Similarity ( of rank score retrieved)
35(No Transcript)
36Refined Cost Model and Prioritization
37Collection Prioritization
- We reverse the load control from the broker to
the cores - The broker broadcasts the query, and sends info
about the relative rank of each core (the
priority) - Each core serves query if L(i) lt p(i) L
- L(i) sum of the comp. cost (timing) of served
queries
38(No Transcript)
39Extended Tests
- We actually partitioned the documents onto
different servers - We indexed locally, and we measured the timing of
each query - The actual timing is used to compute the load and
drive the system - Load cap is AVERAGE load
- The peak can heavily vary!
40(No Transcript)
41the bill, please!
42Conclusions
- We presented an architecture for a distributed
search engine, based on collection selection - The load-driven strategy and the incremental
caching can retrieve very high quality results,
with reduced load - Verified with an extensive simulation
43Impact and Benefits
- If a given precision is expected, we can use
FEWER servers - With a given number of servers, we get HIGHER
precision - Confirmed with different metrics
- Smaller load for the IR system, with more focus
on top results - Nice trade-off cost vs. quality
44Impact and Benefits (2)
- Load-driven routing can be used
- to absorb query peaks
- to offer higher/lower quality results to selected
users - Consistent ranking due to local indexing
- Inc. caching can be used to reduce the negative
effects of selection
45Furthermore
- Caching posting lists is very effective on local
indices - Simple way to add new documents
- Inc. caching could help with impact-ordered
posting lists - Caching could be based on line value (query
frequency, number of polled servers)
46Future Work
- Comparison with other results in clustering
(k-means, link-based, P2P, LSI, SVD) - Test on a large-scale, real-world search engine
- Real-world implementation at Google
- TOIS paper to wrap up
47References
- Puppin et al., 2006
- Diego Puppin, Fabrizio Silvestri, Domenico
Laforenza. Query-Driven Document Partitioning
and Collection Selection. Invited Paper.
Proceedings of INFOSCALE 06. - Puppin Silvestri, 2006
- Diego Puppin, Fabrizio Silvestri. The
Query-Vector Document Model. Proceedings of CIKM
06. - Puppin et al., 2007
- Diego Puppin, Ricardo Baeza-Yates, Raffaele
Perego, Fabrizio Silvestri. Incremental Caching
for Collection Selection Architectures.
Proceedings of INFOSCALE 07.
48References
- Baeza-Yates et al., 2007a
- Ricardo Baeza-Yates, Carlos Castillo, Flavio
Junqueira, Vassilis Plachouras, Fabrizio
Silvestri. Challenges in Distributed Information
Retrieval. Invited Paper. Proceedings of ICDE
2007. - Chierichetti et al., 2007
- F. Chierichetti, A. Panconesi, P. Raghavan, M.
Sozio, A. Tiberi, E. Upfal. Finding Near
Neighbors Through Cluster Pruning. Proceedings
of PODS 2007. - Baeza-Yates et al., 2007b
- Ricardo Baeza-Yates, Aristides Gionis, Flavio
Junqueira, Vanessa Murdock, Vassilis Plachouras,
Fabrizio Silvestri. The Impact of Caching on
Search Engines. Proceedings of SIGIR 2007.
49References
- Dhillon et al., 2003
- Dhillon, I. S. and Mallela, S. and Modha, D.
S., Information-Theoretic Co-Clustering.
Proceedings of The Ninth ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining(KDD-2003)
50Backup Slides
51(No Transcript)
52(No Transcript)
53(No Transcript)
54Adding Documents
55Adding Documents
- It is important to assign new documents to the
fittest clusters - New versions, New pages etc.
- The new documents will be found along with the
previously assigned documents - Hopefully the coll. selection will find them with
similar docs
56A Modest Proposal
- The body of the new document is used as query for
the PCAP selection - The body is compared to the query clusters
- We will find a similarity between doc. body and
query cluster - We use PCAP to rank doc. collections
57Implementation
- The first 1000 byte of (stripped) body doc are
used - The new doc is assigned to the doc. cluster with
the top PCAP score - New docs are locally indexed
- No need to re-train / re-assign
- New docs have consistent score and ranking
58Test Configurations
59(No Transcript)
60(No Transcript)