Range and kNN Searching in P2P presentation

About This Presentation

Transcript and Presenter's Notes

Title: Range and kNN Searching in P2P

1
Range and kNN Searching in P2P

Manesh Subhash
Ni Yuan
Sun Chong

2
Outline

Range query searching in P2P
one dimension range query
multi-dimension range query
comparison of range query searching in P2P
kNN searching in P2P
scalable nearest neighbor searching
PierSearch
Conclusion

3
Motivation

Most P2P systems support only simple lookup
queries
The DHT based approaches such as Chord, CAN are
not suitable for range queries
More complicated queries such as range query and
kNN searching is needed

4
P-Tree APJ04

B-tree is widely used for efficiently
evaluating range queries in centralized database
distributed B-tree is not directly applicable
in a P2P environment
fully independent B-tree
semi-independent B-tree, i.e. P-tree

5
Fully independent B-tree
4
24
26
4
8
12
20
24
25
26
35
26
8
20
24
35
8
26
35
4
8
12
20
24
25
24
25
26
35
4
8
12
20
6
Semi-independently B-tree
4
24
26
8
24
35
35
8
24
4
8
12
20
8
12
20
35
4
P1 4
P2 8
P8 35
26
8
20
12
25
35
P3 12
P7 26
12
20
24
26
35
4
P4 20
P6 25
P5 24
20
25
4
25
8
20
24
35
8
20
24
25
26
35
4
24
25
26
7
Coverage Separation
4
35
4
20
24
25
35
8
20
25
overlap
4
8
20
24
24
25
25
26
35
4
20
4
8
12
24
25
26
35
anti-coverage
8
Properties of P-tree

Each node stores O(logdN) nodes
Total storage per node is O(dlogdN)
Require no global coordination among all peers
The search cost for a range query that returns m
results is O(m logdN)

9
Search Algorithm p1 21ltvalue lt29
4
24
26
8
24
35
l0
35
8
24
4
8
12
20
8
12
20
l1
35
4
P1 4
P2 8
P8 35
26
8
20
12
25
35
P3 12
P7 26
12
20
24
26
35
4
P4 20
P6 25
P5 24
20
25
4
25
8
20
24
35
8
20
24
25
26
35
4
24
25
26
10
Multi-dimension range query

Routing in one-dimensional routing space
ZNet Z-ordering Skip graph STZ04
Hilbert space filling curve Chord SP03
SCRAP GYG04
Routing in multi-dimensional routing space
MURK GYG04

11
Desiderata

Locality the data elements nearby in the data
space should be stored in the same node or the
close nodes
Load balance the amount of data stored by each
node should be roughly the same
Efficient routing the number of messages
exchanged between nodes for routing a query
should be small

12
Hilbert SFC Chord

SFC d-dimensional cube -gt a line
the line passes once through each point in the
volume of the cube

10
01
0101
0110
1001
1010
0100
0111
1000
1011
00
11
0011
0010
1101
1100
0000
0001
1110
1111
13
Hilbert SFC Chord

mapping the 1-dimensional index space onto the
Chord overlay network topological

0
4
14
8
11
data elements with keys 5, 6, 7, 8
14
Query Processing

translate the keyword query to relevant clusters
of the SFC-based index space
query the appropriate nodes in the overlay
network for data-elements

0
0101
0110
1001
1010
1111
11
0100
0111
1000
1011
1100 1101 1110
10
4
14
0011
0010
1101
1100
01
(1, 0)
0000
0001
1110
1111
00
8
11
00
01
10
11
15
Query Optimization (010, )
000000
111 110 101 100 011 010 001 000
000100
111000
000 001 010 011 100 101 110 111
001001
011110
(000100)
(000111, 001000)
(001011) (011000, 011001) (011101, 011110)
16
Query Optimization (cont.) (010,)
01
10
11
00
00 01 10 11
0
00
01
0110
0001
0010
0111
000100
000111
001011
011000
011101
001000
011001
011110
17
Query Optimization (cont.)
000000
0
00
111000
000100
(010, )
01
Pruning nodes from the tree
0
001001
011110
00
01
0110
0001
0010
0111
000100
000111
001011
011000
011101
001000
011001
011110
18
SCARP GYG04

Use z-order or Hilbert space filling curve to
map multi-dimensional data down to a single
dimension
Range partitioned the one dimension data across
the available S nodes
Use Skip graph to rout the queries

19
MURK Multi-dimensional Rectangulation with
KD-tree

Basic conception
Partitioning high-dimensional data space into
rectangles, managed by each node.
Partitioning is done based on the KD-tree. The
space is split cyclically according to the
dimensions and each leaf of the KD-tree
corresponds to one rectangle.

20
Partitioning

Each node joins, split the space along one
dimension into two parts of equal load, keeping
load balance.
Each node manage data in one rectangle, thus
keeping data locality.

21
Comparison with CAN

The partition based on KD-tree is similar as that
in CAN. Both hash data into multi-dimensional
space and try to keep load balancing
The major difference is that a new node splits
the exiting node data space equally in CAN,
rather than splitting load equality.

22
Routing in MURK

Routing is to create a link between all the
neighboring nodes along the relevant nodes.
Based on the greedy routing over the grid
links, the distance between two node is the
minimum Manhattan distance.

23
Optimization for the routing

Grid links are not efficient for the routing.
Maintain skip pointers for each node to speed up
the routing. Two methods to chose the skip
pointers
Random. Chose randomly a node from node set.
Space-filling skip graph. Make the skip pointers
at exponentially increasing distance.

24
Discussion

Non-uniformity for the routing neighbors.
Resulted from load balancing for the node.
The dynamic data distribution would result in the
unbalance for the node data.

25
Performance
26
performance
27
Conclusion

For locality, MURK far outperforms SCRAP. For
routing cost, SCRAP is efficient enough, skip
pointers are efficient, such as space filling
curve skip.
SCRAP using space filling with rang partitioning
is efficient in low dimensions. MURK with space
filling skip graph performs much better,
especially in high dimensions.

28
pSearch

Motivation
Numerous documents are over the internet.
How to efficiently search the most closely
related document without returning too many with
little interest.
Problem Semantically, documents are randomly
distributed.
Exhaustively search brings overhead.
No deterministic guarantees.

29
P2P IR techniques

Unstructured p2p search
Centralized index with the problem bottleneck.
Flooding-based techniques result in too much
overhead.
Heuristic-based algorithm may miss some important
documents.
Structured p2p search
DHT based can and chord are suitable for keyword
matching.
Traditional IR techniques
Advanced IR ranking algorithm could be adopted
into p2p search.
Two IR techniques
Vector space model (VSM).
Latent semantic indexing (LSI).

30
pSearch

An IR system built on p2p networks.
Efficient and scalable as DHT
Accurate as advanced IR algorithms.
Map semantic space to nodes and conduct nearest
neighbor search.
use VSM and LSI to generate semantic space
use CAN to organize nodes.

31
VSM LSI

VSM
Document and queries are expressed as term
vectors.
Weight of a term Term frequency inverse
document frequency.
Rank based on the similarity of the document and
query cos (X,Y). X and Y are two term vectors.
LSI
Based on singular value decomposition, transform
term vector from high-dimension to low-dimension
(L) semantic vector.
Statistically based conception avoids synonymous
and noise in document.

32
pSearch system

DOC
QUERY
33
Advantage of pSearch

Exhaustive search in a bounded area while could
be ideally accurate.
Communication overhead is limited to transferring
query and reference to top documents independent
of the corpus size.
A good approximate of the global statistics is
sufficient for pSearch.

34
Challenges

Dimensionality mismatch between CAN and LSI.
Uneven distribution of indices.
Large search region.

35
Dimensionality mismatch

Not enough nodes (N) in the CAN to partition all
the dimensions (L) in the LSI semantic space.
N nodes in CAN could partition log(N) low
dimensions (effective dimensionality), leaving
others un-partitioned.

36
Rolling index

Motivation
Small part of the dimensions would contribute a
lot to the similarity
Low-dimensions are of high importance.
Partition more dimensions of the semantic space
by rotating the semantic vectors.
A semantic vector V(v0,v1,,vl). Each time
rotate the vector m dimensions. The rotate space
i is the vector of ith rotation.
Vi(vim,,v0,v1,, vim-1)
m2.3ln(n).
Use the rotated vector to route the query and
guide the search.

37
Rolling index

Use more storage (p times) to keep the search in
local space.
Selective rotation is expected to be efficient to
process the important high dimensions

38
Balance index distribution

Content-aware node bootstrapping.
Randomly select a document to publish .
Route the node.
Transfers load.
More indices would be distributed by more node.
Even random, still balance with large corpus.

39
Reducing search space

Curse of dimensionality
Data of high-dimensions sparsely populated
In the high-dimension, distance between nearest
neighbor becomes large.
Based on data locality, use stored indices on
nodes and recently processed query to guide new
search.

40
Content-directed search
1 f 2 3 4 5 6
7 8 9 a 10 b 11 c 12
13 e 14 d 15 q 16 17 18
19 20 21 22 g 23 24 p
41
Performance
42
Conclusion

pSearch is a P2P IR system organizing contents
around semantics and achieves good accuracy
w.r.t system size, corpus size and returned
document.
Rolling index resolve the dimension mismatch and
could limit space overhead and visited node
number.
Content-aware node bootstrapping balance node
load to achieve index and query locality
Contentdirected search reduce the searching
nodes.

43
kNN searching in P2P Networks

Manesh Subhash
Ni Yuan
Sun Chong

44
Outline

Introduction to searching in P2P
Nearest neighbor queries
Presentation of the ideas in the papers
1. A Scalable Nearest Neighbor Search in P2P
Systems
2. Enhancing P2P File-Sharing with an
Internet-Scale Query Processor

45
Introduction to searching in P2P

Exact Match queries
Single key retrieval
Linear Hash
CAN, CHORD, PASTRY, TAPESTRY
Similarity based queries
Metric space based
What do we search for?
Rare items or popular items or both.

46
Nearest neighbor queries

The notion of a metric space
How similar are two objects given a set of
objects
Extensible for exact, range and nearest neighbor
queries.
Computationally expensive
Distance property satisfies positive-ness,
reflexivity, symmetry, triangle inequality.

47
Nearest neighbor queries (Cont)

Metric space is a pair (D, d)
D domain of objects
d the distance function.
Similarity queries
Rangefor F D, a range query retrieves all
objects which have a distance lt ? to the query
object q F
Nearest neighbor
Returns the object closest to q, k-nearest
object
for kNN. K F

48
Scalable NN search

Uses the GHT structure.
Distributed metric index
Supports range and k-NN queries
The GHT architecture is composed of nodes, peers
that can insert, store and retrieve objects using
similarity queries.
Assumptions Message passing, unique network
identifiers, Local buckets to store data and
lastly, only one bucket per object.

49
Example of the GHT Network
50
Scalable NN search (3)

Address Search Trees (AST)
Is a binary search tree
Inner nodes hold routing information
Two pivots pointers to left and right sub-trees
Leaf nodes are pointers to data
Local data is stored in the buckets and can be
accessed using the BID
Non local data can be identified using NNID.
(All AST leaf nodes are one of the above pointers)

51
Scalable NN search (4)

Searching the AST?
The BPATH
Is a representation of a tree as a string of n
binary elements 0,1 p (b1,b2,,bn)
Use the traversing operator ? and radius ? for a
query q. ? returns a BPATH.
? examines every inner node using the two pivot
values and decides which sub-tree to follow.
A radius of zero is used for exact matches and
during inserts.

52
Scalable NN search (5)

k-NN searching in GHT
Range searching not suitable without intrinsic
knowledge of data and the metric space used.
Begin search at bucket with high probability of
occurrences of k objects
If k objects are found, then use kth object to
define a similarity search with radius of kth
distance from q.
Sort result and pick first k.
If less than k objects found then we cannot
determine the upper bound on the search for the
kth neighbor
Variation on range radius

53
Scalable NN search (6)

Finding the k objects using range searches.
Optimistic
Minimize distance computation costs, bucket
access.
Use bounding distance as that of the last
candidate available at the first accessed bucket.
Iteratively expand radius if fewer than k found
Pessimistic.
Probability of next iteration is minimized.
Use distance between the pivot values at a level
of the AST as range radius starting from parent
of leaf and executes the range query.
If fever than k, move up the next level.

54
Scalable NN search (7)

Performance evaluation
With increasing k
Number of parallel distance computations remain
stable
Number of bucket accesses and Number of Messages
increase rapidly
Effect of growing dataset
Max hop count increases slowly
Nearly constant parallel distance computation
costs
Comparison with range
Slightly slower because of overhead to locate
first bucket

55
Scalable NN search (8)
Performance of the scheme on the TXT dataset.
56
Scalable NN search (9)

Conclusion
First effort in distributed index structures
supporting K-NN searching.
GHT is a scalable solution
Scope for future work includes handling updates
of the dataset.
Other metric space partitioning schemes.

57
Enhanced P2P - PIERSearch (1)

Internet scale query processor
Queried data has Zipfian distribution
Popular data in the head
Long tail of rare items
PIERSearch is DHT based
Its a Hybrid system, uses Gnutella for popular
items, PIERSearch for rare items
Integrated with the PIER system

58
PIERSearch (2)

Gnutella query processing
Flooding based
Simple for popular files
Optimized using
ultra peers nodes that perform the query
processing on behalf of the leaf nodes,
dynamic querying Larger TTL
Team studied characteristics of the Gnutella
network.

59
PIERSearch (3)

Effectiveness of Gnutella
Query recall Percentage of available results in
the network returned
Query distinct recall Percentage of distinct
results, nullifies the effect of having replicas.
Experiments show that Gnutella is efficient for
highly replicated content and those with large
result set.
Found ineffective for rare content.
Increasing the TTL does not reduce latency but
can improve recall

60
PIERSearch (4)

Searching using PIERSearch
Keyword based.
Publisher maintains inverted file indexed using
the DHT.
Generates two tuples for each item
Item(fileId,filename, filesiz, ipAddress, port)
Inverted(keyword,fileId)
Uses the underlying PIER system
A DHT based internet-scale relational query
processor.

61
PIERSearch (5)

Hybrid system
Identification of rare items
Query result size
Smaller than fixed threshold considered rare.
Term frequency
Items with at-least one term below threshold
considered rare.
Term pair frequency
Less prone to skew if filenames contain popular
words.
Sampling
Samples neighboring nodes and computes lower
bound estimate on the number of replicas.

62
PIERSearch (6)

Performance summary

63
PIERSearch (7)

Conclusion
We have found that Gnutella is highly effective
for querying popular content, but ineffective for
rare items.
We have found that building a partial index over
the least replicated content can improve query
recall.

64
Referemce

APJ04 A. Crainiceanu, P. Linga, J. Gehrke and
J. Shanmugasundaram. Querying Peer-to-Peer
Networks Using P-Trees. In WebDB, 2004
GYG04 P. Ganesan, B. Yang and H.
Garcia-Molina. One Torus to Rule them all
Multi-dimensional Queries in P2P Systems. In
WebDB, 2004
SP03 C. Schmidt and M. Parashar. Flexible
Information Discovery in Decentralized
Distributed Systems. In HPDC, 2003
STZ04 Y. Shu, K-L. Tan and A. Zhou. Adapting
the Content Native Space for Load Balanced
Indexing. In Database, Information Systems and
Peer-to-Peer Computing, 2004

65
Reference (cont.)

LHH04 B. Loo, J. Hellerstern, R. Huebsch, S.
Shenker and I. Stoica. Enhancing P2P File-sharing
with an Internet-Scale Query Processor. In VLDB,
2004.
TXD03 C. Tang, Z. Xu and S. Dwarkadas.
Peer-to-Peer Information Retrieval Using
Self-Organizing Semantic Overlay Networks. In
SIGCOMM, 2003
ZBG04 P. Zezula, M. Batko and C. Gennaro. A
Scalable Nearest Neighbor Search in P2P Systems.
In Database, Information Systems and Peer-to-Peer
Computing, 2004

Thank you!

Write a Comment

User Comments (0)

About PowerShow.com

Range and kNN Searching in P2P PowerPoint PPT Presentation