Title: Range and kNN Searching in P2P
1Range and kNN Searching in P2P
- Manesh Subhash
- Ni Yuan
- Sun Chong
2Outline
- Range query searching in P2P
- one dimension range query
- multi-dimension range query
- comparison of range query searching in P2P
- kNN searching in P2P
- scalable nearest neighbor searching
- PierSearch
- Conclusion
3Motivation
- Most P2P systems support only simple lookup
queries - The DHT based approaches such as Chord, CAN are
not suitable for range queries - More complicated queries such as range query and
kNN searching is needed
4P-Tree APJ04
- B-tree is widely used for efficiently
evaluating range queries in centralized database - distributed B-tree is not directly applicable
in a P2P environment - fully independent B-tree
- semi-independent B-tree, i.e. P-tree
5Fully independent B-tree
4
24
26
4
8
12
20
24
25
26
35
26
8
20
24
35
8
26
35
4
8
12
20
24
25
24
25
26
35
4
8
12
20
6Semi-independently B-tree
4
24
26
8
24
35
35
8
24
4
8
12
20
8
12
20
35
4
P1 4
P2 8
P8 35
26
8
20
12
25
35
P3 12
P7 26
12
20
24
26
35
4
P4 20
P6 25
P5 24
20
25
4
25
8
20
24
35
8
20
24
25
26
35
4
24
25
26
7Coverage Separation
4
35
4
20
24
25
35
8
20
25
overlap
4
8
20
24
24
25
25
26
35
4
20
4
8
12
24
25
26
35
anti-coverage
8Properties of P-tree
- Each node stores O(logdN) nodes
- Total storage per node is O(dlogdN)
- Require no global coordination among all peers
- The search cost for a range query that returns m
results is O(m logdN)
9Search Algorithm p1 21ltvalue lt29
4
24
26
8
24
35
l0
35
8
24
4
8
12
20
8
12
20
l1
35
4
P1 4
P2 8
P8 35
26
8
20
12
25
35
P3 12
P7 26
12
20
24
26
35
4
P4 20
P6 25
P5 24
20
25
4
25
8
20
24
35
8
20
24
25
26
35
4
24
25
26
10Multi-dimension range query
- Routing in one-dimensional routing space
- ZNet Z-ordering Skip graph STZ04
- Hilbert space filling curve Chord SP03
- SCRAP GYG04
- Routing in multi-dimensional routing space
- MURK GYG04
11Desiderata
- Locality the data elements nearby in the data
space should be stored in the same node or the
close nodes - Load balance the amount of data stored by each
node should be roughly the same - Efficient routing the number of messages
- exchanged between nodes for routing a query
should be small
12Hilbert SFC Chord
- SFC d-dimensional cube -gt a line
- the line passes once through each point in the
volume of the cube
10
01
0101
0110
1001
1010
0100
0111
1000
1011
00
11
0011
0010
1101
1100
0000
0001
1110
1111
13Hilbert SFC Chord
- mapping the 1-dimensional index space onto the
Chord overlay network topological
0
4
14
8
11
data elements with keys 5, 6, 7, 8
14Query Processing
- translate the keyword query to relevant clusters
of the SFC-based index space - query the appropriate nodes in the overlay
network for data-elements
0
0101
0110
1001
1010
1111
11
0100
0111
1000
1011
1100 1101 1110
10
4
14
0011
0010
1101
1100
01
(1, 0)
0000
0001
1110
1111
00
8
11
00
01
10
11
15Query Optimization (010, )
000000
111 110 101 100 011 010 001 000
000100
111000
000 001 010 011 100 101 110 111
001001
011110
(000100)
(000111, 001000)
(001011) (011000, 011001) (011101, 011110)
16Query Optimization (cont.) (010,)
01
10
11
00
00 01 10 11
0
00
01
0110
0001
0010
0111
000100
000111
001011
011000
011101
001000
011001
011110
17Query Optimization (cont.)
000000
0
00
111000
000100
(010, )
01
Pruning nodes from the tree
0
001001
011110
00
01
0110
0001
0010
0111
000100
000111
001011
011000
011101
001000
011001
011110
18SCARP GYG04
- Use z-order or Hilbert space filling curve to
map multi-dimensional data down to a single
dimension - Range partitioned the one dimension data across
the available S nodes - Use Skip graph to rout the queries
19MURK Multi-dimensional Rectangulation with
KD-tree
- Basic conception
- Partitioning high-dimensional data space into
rectangles, managed by each node. - Partitioning is done based on the KD-tree. The
space is split cyclically according to the
dimensions and each leaf of the KD-tree
corresponds to one rectangle.
20Partitioning
- Each node joins, split the space along one
dimension into two parts of equal load, keeping
load balance. - Each node manage data in one rectangle, thus
keeping data locality.
21Comparison with CAN
- The partition based on KD-tree is similar as that
in CAN. Both hash data into multi-dimensional
space and try to keep load balancing - The major difference is that a new node splits
the exiting node data space equally in CAN,
rather than splitting load equality.
22Routing in MURK
- Routing is to create a link between all the
neighboring nodes along the relevant nodes. - Based on the greedy routing over the grid
links, the distance between two node is the
minimum Manhattan distance.
23Optimization for the routing
- Grid links are not efficient for the routing.
- Maintain skip pointers for each node to speed up
the routing. Two methods to chose the skip
pointers - Random. Chose randomly a node from node set.
- Space-filling skip graph. Make the skip pointers
at exponentially increasing distance.
24Discussion
- Non-uniformity for the routing neighbors.
Resulted from load balancing for the node. - The dynamic data distribution would result in the
unbalance for the node data.
25Performance
26performance
27Conclusion
- For locality, MURK far outperforms SCRAP. For
routing cost, SCRAP is efficient enough, skip
pointers are efficient, such as space filling
curve skip. - SCRAP using space filling with rang partitioning
is efficient in low dimensions. MURK with space
filling skip graph performs much better,
especially in high dimensions.
28pSearch
- Motivation
- Numerous documents are over the internet.
- How to efficiently search the most closely
related document without returning too many with
little interest. - Problem Semantically, documents are randomly
distributed. - Exhaustively search brings overhead.
- No deterministic guarantees.
29P2P IR techniques
- Unstructured p2p search
- Centralized index with the problem bottleneck.
- Flooding-based techniques result in too much
overhead. - Heuristic-based algorithm may miss some important
documents. - Structured p2p search
- DHT based can and chord are suitable for keyword
matching. - Traditional IR techniques
- Advanced IR ranking algorithm could be adopted
into p2p search. - Two IR techniques
- Vector space model (VSM).
- Latent semantic indexing (LSI).
30pSearch
- An IR system built on p2p networks.
- Efficient and scalable as DHT
- Accurate as advanced IR algorithms.
- Map semantic space to nodes and conduct nearest
neighbor search. - use VSM and LSI to generate semantic space
- use CAN to organize nodes.
31VSM LSI
- VSM
- Document and queries are expressed as term
vectors. - Weight of a term Term frequency inverse
document frequency. - Rank based on the similarity of the document and
query cos (X,Y). X and Y are two term vectors. - LSI
- Based on singular value decomposition, transform
term vector from high-dimension to low-dimension
(L) semantic vector. - Statistically based conception avoids synonymous
and noise in document.
32pSearch system
DOC
QUERY
33Advantage of pSearch
- Exhaustive search in a bounded area while could
be ideally accurate. - Communication overhead is limited to transferring
query and reference to top documents independent
of the corpus size. - A good approximate of the global statistics is
sufficient for pSearch.
34Challenges
- Dimensionality mismatch between CAN and LSI.
- Uneven distribution of indices.
- Large search region.
35Dimensionality mismatch
- Not enough nodes (N) in the CAN to partition all
the dimensions (L) in the LSI semantic space. - N nodes in CAN could partition log(N) low
dimensions (effective dimensionality), leaving
others un-partitioned.
36Rolling index
- Motivation
- Small part of the dimensions would contribute a
lot to the similarity - Low-dimensions are of high importance.
- Partition more dimensions of the semantic space
by rotating the semantic vectors. - A semantic vector V(v0,v1,,vl). Each time
rotate the vector m dimensions. The rotate space
i is the vector of ith rotation. - Vi(vim,,v0,v1,, vim-1)
- m2.3ln(n).
- Use the rotated vector to route the query and
guide the search.
37Rolling index
- Use more storage (p times) to keep the search in
local space. - Selective rotation is expected to be efficient to
process the important high dimensions
38Balance index distribution
- Content-aware node bootstrapping.
- Randomly select a document to publish .
- Route the node.
- Transfers load.
- More indices would be distributed by more node.
Even random, still balance with large corpus.
39Reducing search space
- Curse of dimensionality
- Data of high-dimensions sparsely populated
- In the high-dimension, distance between nearest
neighbor becomes large. - Based on data locality, use stored indices on
nodes and recently processed query to guide new
search.
40Content-directed search
1 f 2 3 4 5 6
7 8 9 a 10 b 11 c 12
13 e 14 d 15 q 16 17 18
19 20 21 22 g 23 24 p
41Performance
42Conclusion
- pSearch is a P2P IR system organizing contents
around semantics and achieves good accuracy
w.r.t system size, corpus size and returned
document. - Rolling index resolve the dimension mismatch and
could limit space overhead and visited node
number. - Content-aware node bootstrapping balance node
load to achieve index and query locality - Contentdirected search reduce the searching
nodes.
43kNN searching in P2P Networks
- Manesh Subhash
- Ni Yuan
- Sun Chong
44Outline
- Introduction to searching in P2P
- Nearest neighbor queries
- Presentation of the ideas in the papers
- 1. A Scalable Nearest Neighbor Search in P2P
Systems - 2. Enhancing P2P File-Sharing with an
Internet-Scale Query Processor
45Introduction to searching in P2P
- Exact Match queries
- Single key retrieval
- Linear Hash
- CAN, CHORD, PASTRY, TAPESTRY
- Similarity based queries
- Metric space based
- What do we search for?
- Rare items or popular items or both.
46Nearest neighbor queries
- The notion of a metric space
- How similar are two objects given a set of
objects - Extensible for exact, range and nearest neighbor
queries. - Computationally expensive
- Distance property satisfies positive-ness,
reflexivity, symmetry, triangle inequality.
47Nearest neighbor queries (Cont)
- Metric space is a pair (D, d)
- D domain of objects
- d the distance function.
- Similarity queries
- Rangefor F D, a range query retrieves all
objects which have a distance lt ? to the query
object q F - Nearest neighbor
- Returns the object closest to q, k-nearest
object - for kNN. K F
48Scalable NN search
- Uses the GHT structure.
- Distributed metric index
- Supports range and k-NN queries
- The GHT architecture is composed of nodes, peers
that can insert, store and retrieve objects using
similarity queries. - Assumptions Message passing, unique network
identifiers, Local buckets to store data and
lastly, only one bucket per object.
49Example of the GHT Network
50Scalable NN search (3)
- Address Search Trees (AST)
- Is a binary search tree
- Inner nodes hold routing information
- Two pivots pointers to left and right sub-trees
- Leaf nodes are pointers to data
- Local data is stored in the buckets and can be
accessed using the BID - Non local data can be identified using NNID.
- (All AST leaf nodes are one of the above pointers)
51Scalable NN search (4)
- Searching the AST?
- The BPATH
- Is a representation of a tree as a string of n
binary elements 0,1 p (b1,b2,,bn) - Use the traversing operator ? and radius ? for a
query q. ? returns a BPATH. - ? examines every inner node using the two pivot
values and decides which sub-tree to follow. - A radius of zero is used for exact matches and
during inserts.
52Scalable NN search (5)
- k-NN searching in GHT
- Range searching not suitable without intrinsic
knowledge of data and the metric space used. - Begin search at bucket with high probability of
occurrences of k objects - If k objects are found, then use kth object to
define a similarity search with radius of kth
distance from q. - Sort result and pick first k.
- If less than k objects found then we cannot
determine the upper bound on the search for the
kth neighbor - Variation on range radius
53Scalable NN search (6)
- Finding the k objects using range searches.
- Optimistic
- Minimize distance computation costs, bucket
access. - Use bounding distance as that of the last
candidate available at the first accessed bucket. - Iteratively expand radius if fewer than k found
- Pessimistic.
- Probability of next iteration is minimized.
- Use distance between the pivot values at a level
of the AST as range radius starting from parent
of leaf and executes the range query. - If fever than k, move up the next level.
54Scalable NN search (7)
- Performance evaluation
- With increasing k
- Number of parallel distance computations remain
stable - Number of bucket accesses and Number of Messages
increase rapidly - Effect of growing dataset
- Max hop count increases slowly
- Nearly constant parallel distance computation
costs - Comparison with range
- Slightly slower because of overhead to locate
first bucket
55Scalable NN search (8)
Performance of the scheme on the TXT dataset.
56Scalable NN search (9)
- Conclusion
- First effort in distributed index structures
supporting K-NN searching. - GHT is a scalable solution
- Scope for future work includes handling updates
of the dataset. - Other metric space partitioning schemes.
57Enhanced P2P - PIERSearch (1)
- Internet scale query processor
- Queried data has Zipfian distribution
- Popular data in the head
- Long tail of rare items
- PIERSearch is DHT based
- Its a Hybrid system, uses Gnutella for popular
items, PIERSearch for rare items - Integrated with the PIER system
58PIERSearch (2)
- Gnutella query processing
- Flooding based
- Simple for popular files
- Optimized using
- ultra peers nodes that perform the query
processing on behalf of the leaf nodes, - dynamic querying Larger TTL
- Team studied characteristics of the Gnutella
network.
59PIERSearch (3)
- Effectiveness of Gnutella
- Query recall Percentage of available results in
the network returned - Query distinct recall Percentage of distinct
results, nullifies the effect of having replicas. - Experiments show that Gnutella is efficient for
highly replicated content and those with large
result set. - Found ineffective for rare content.
- Increasing the TTL does not reduce latency but
can improve recall
60PIERSearch (4)
- Searching using PIERSearch
- Keyword based.
- Publisher maintains inverted file indexed using
the DHT. - Generates two tuples for each item
- Item(fileId,filename, filesiz, ipAddress, port)
- Inverted(keyword,fileId)
- Uses the underlying PIER system
- A DHT based internet-scale relational query
processor.
61PIERSearch (5)
- Hybrid system
- Identification of rare items
- Query result size
- Smaller than fixed threshold considered rare.
- Term frequency
- Items with at-least one term below threshold
considered rare. - Term pair frequency
- Less prone to skew if filenames contain popular
words. - Sampling
- Samples neighboring nodes and computes lower
bound estimate on the number of replicas.
62PIERSearch (6)
63PIERSearch (7)
- Conclusion
- We have found that Gnutella is highly effective
for querying popular content, but ineffective for
rare items. - We have found that building a partial index over
the least replicated content can improve query
recall.
64Referemce
- APJ04 A. Crainiceanu, P. Linga, J. Gehrke and
J. Shanmugasundaram. Querying Peer-to-Peer
Networks Using P-Trees. In WebDB, 2004 - GYG04 P. Ganesan, B. Yang and H.
Garcia-Molina. One Torus to Rule them all
Multi-dimensional Queries in P2P Systems. In
WebDB, 2004 - SP03 C. Schmidt and M. Parashar. Flexible
Information Discovery in Decentralized
Distributed Systems. In HPDC, 2003 - STZ04 Y. Shu, K-L. Tan and A. Zhou. Adapting
the Content Native Space for Load Balanced
Indexing. In Database, Information Systems and
Peer-to-Peer Computing, 2004
65Reference (cont.)
- LHH04 B. Loo, J. Hellerstern, R. Huebsch, S.
Shenker and I. Stoica. Enhancing P2P File-sharing
with an Internet-Scale Query Processor. In VLDB,
2004. - TXD03 C. Tang, Z. Xu and S. Dwarkadas.
Peer-to-Peer Information Retrieval Using
Self-Organizing Semantic Overlay Networks. In
SIGCOMM, 2003 - ZBG04 P. Zezula, M. Batko and C. Gennaro. A
Scalable Nearest Neighbor Search in P2P Systems.
In Database, Information Systems and Peer-to-Peer
Computing, 2004
66