Title: A Local Search Mechanism for PeertoPeer Networks
1A Local Search Mechanism for Peer-to-Peer
Networks
- Vana Kalogeraki, Dimitrios Gunopulos
- Demetris Zeinalipour (University of California
Riverside)
- csyiazti_at_cs.ucr.edu
CIKM 2002 Eleventh International Conference on
Information and Knowledge Management
November 4-9, Mclean VA
http//www.cs.ucr.edu/csyiazti/publications.html
2Presentation Outline
- Introduction Information Retrieval (I.R) in
Peer-to-Peer networks.
- Techniques for Distributed I.R.
- Breadth-First Search.
- Random Breadth-First Search.
- Intelligent Search with profiling.
- Experimental Evaluation.
- Related Work.
- Conclusions Future Work.
3Introduction to Peer-to-Peer
- Peer-to-Peer Computing definition
- Sharing of computer resources and information
through direct exchange
- Clients (downloaders) are also servers
- Clients may join or leave the network at any time
highly fault-tolerant but with a cost!
- Searches are done within the virtual network
while actual downloads are done offline (with
HTTP).
4Introduction to Peer-to-Peer
- Peer-to-Peer (P2P) systems are increasingly
becoming popular.
- P2P file-sharing systems, such as Gnutella,
Napster and Freenet realized a distributed
infrastructure for sharing files.
- Traditionally, files were shared using the
Client-Server model (e.g. http). Not scalable
since they are centralized services.
- P2P uncover new advantages in simplicity of use,
robustness, self organization and scalability.
5Information Retrieval in P2P
- Problem
- How to efficiently retrieve Information in P2P
systems where each node shares a collection of
documents?
- Documents consists of keywords.
- Resembles Information Retrieval but resources are
distributed now.
- Primary Data Structures such as Global Inverted
Indexes cant be maintained efficiently.
6Solutions for P2P Information Retrieval
- 1) Centralized Approaches
- Centralized Indexes
- e.g. Napster, SETI_at_HOME
- 2) Purely Distributed Approaches
- Each node has only local knowledge.
- I.R is done using Brute force mechanisms
- e.g. Gnutella, Fasttrack (Kazaa)
- 3) Hybrid Approaches
- One or more peers have partial indexes of the
contents of others.
- e.g. Limewire's Ultrapeers
Centralized Index
1) Upload Index
2) Query/QueryHit
3) Download (offline)
1
2
3
1) Connect
2) Query/QueryHit
3) Download (offline)
1,2
3
1) Connect
2) Intelligent Query/QueryHit
3) Download (offline)
1,2
3
7Motivation
- On 1st June we crawled the Gnutella P2P Network
for 5 hours with 17 workstations.
- We analyzed 15,153,524 query messages.
- Observation High locality of specific queries.
- We try to exploit this property for more
efficient searches?
8Presentation Outline
- Introduction Information Retrieval (I.R) in
Peer-to-Peer networks.
- Techniques for Distributed I.R.
- Breadth-First Search.
- Random Breadth-First Search.
- Intelligent Search with profiling.
- Experimental Evaluation.
- Related Work.
- Conclusions Future Work.
9Techniques for Distributed I.R.
- Breadth-First Search (Gnutella)
- Each Query Message is propagated along all
outgoing links of a peer using TTL
(time-to-live).
- TTL is decremented on each forward until it
becomes 0
- Technique for I.R in P2P systems such as
Gnutella.
- Results?
- The physical network comes to its knees
- Long Delays for search results.
P2P Network N
A
QUERY
1
QUERYHIT
2
Peer q
Peer d
10Techniques for Distributed I.R.
- 2. Modified Random BFS
- Each Query Message is forwarded to only a
fraction of outgoing links (e.g. ½ of them).
- TTL is again decremented on each forward until it
becomes 0.
- Results?
- Fewer Messages but possibly less results
- This algorithm is probabilistic.
- Some segments may become
- unreachable
unreachable
B
A
QUERY
1
P2P Network N
QUERYHIT
2
C
Peer d
11Techniques for Distributed I.R.
- 3. Intelligent Search Mechanism (ISM)
- Idea Each Query Message is forwarded
intelligently based on what queries a peer
answered in the past.
- Components of ISM (for each node u)
- Profile Mechanism, for each neighbor N(u).
- Peer Ranking Mechanism, for ranking peers locally
and send a search query only to the ones that
most likely will answer.
- Similarity Function, for finding similar search
queries.
- Search Mechanism, for propagating queries based
on local indexes
A
QUERY
1
profiles
QUERYHIT
2
?
Peer d
12Techniques for Distributed I.R.
- 3. Intelligent Search Mechanism (ISM)
- a) Profile mechanism.
- Maintains a list of past queries routed through
that host.
- Every time a QueryHit is received the table is
updated
- The profile manager uses a Least Recently Used
policy to keep most recent queries in
repository.
- Profiles are kept for neighbors only so the cost
for maintaining this cost is O(Td), T is a
limiting factor per profile, d is the degree of a
node
Size Td
13Techniques for Distributed I.R.
- 3. Intelligent Search Mechanism (ISM)
- b) Peer Ranking Mechanism.
- Before forwarding a Query Message a peer performs
an on-the-fly ranking of its peers to determine
the best paths.
- We use the Aggregate Similarity of peer Pi to a
query q, computed by a peer Pk as
14Techniques for Distributed I.R.
- 3. Intelligent Search Mechanism (ISM)
- c) Similarity Function The cosine similarity.
- Assume that L is a set of all words (in Profile
Manager)\
- e.g. Lelections, bush, clinton, super, bowl,
san, diego, ,italy, earthquake, disaster
- We define an L-dimensional space where each
query is a vector.
- If qitaly disaster q (vector of q)
0,0,0,,1,0,1
- Recall that we have a vector for each qi stored
in the Profile Manager ( i.e. qi )
15Techniques for Distributed I.R.
- 3. Intelligent Search Mechanism (ISM)
- d) Search Mechanism
- Utilizes the Peer Ranking Mechanism to forward
Queries to nodes that will potentially contain
the info we are looking for
Peer d
profiles
?
QUERY
1
?
16Presentation Outline
- Introduction Information Retrieval (I.R) in
Peer-to-Peer networks.
- Techniques for Distributed I.R.
- Breadth-First Search.
- Random Breadth-First Search.
- Intelligent Search with profiling.
- Experimental Evaluation.
- Related Work.
- Conclusions Future Work.
17Experimental Evaluation
- We use a decentralized Newspaper application
built on top of the REUTERS dataset (22,531
documents grouped by 84 countries).
- Random Network of 100 peers
- Each peer has documents from 3 countries
- The average degree of a node is 7 log2100
(connected graph)
18Experimental Evaluation
- We perform 400 sequential queries with a delay of
4 sec.
- We compare Doc. Ratio (recall rate) vs. Num. of
messages
- BFS (Gnutella Message Flooding) (forward to
degree nodes).
- Modified BFS (randomly forward to degree/2
nodes).
- Intelligent Search Mechanism
- (forward to M3 highest rank nodes 1 random).
19Experimental Evaluation
- We measure Doc. Ratio (recall rate) vs. Num. of
messages with Time-to-Live (TTL)4
- BFS (Gnutella) uses 763 messages w/ recall rate
100
- Random BFS(degree/2) uses 120 (16) msgs w/
recall rate 42
- Intelligent Search uses 131 (17) msgs w/ recall
rate 55
- Recall Rate improves over time with Intelligent
Search since Peer Profiles get more knowledge.
20Experimental Evaluation
- We again measure Doc. Ratio (recall rate) vs.
Num. of messages by increasing Time-to-Live (TTL)
5
- BFS (TTL4) uses 763 messages w/ recall rate
100
- Random BFS(degree/2) uses 28 msgs w/ recall
rate 72
- Intelligent Search uses 35 (of BFS msgs) w/
recall rate 90 !
- A large number of peers receive unnecessary
messages.
- We get almost identical recall (90) with only
35 of msgs
21Presentation Outline
- Introduction Information Retrieval (I.R) in
Peer-to-Peer networks.
- Techniques for Distributed I.R.
- Breadth-First Search.
- Random Breadth-First Search.
- Intelligent Search with profiling.
- Experimental Evaluation.
- Related Work.
- Conclusions Future Work.
22Related Work
- Improving Search in P2P B.Yang et al. (Stanford)
- Iterative Deepening, until Z results are returned
- Directed BFS based on aggregate statistics (e.g.
num of results a peer returned, shortest queue,
forwarded the most data)
- Local Indexes, each node maintains an index over
the data of peers r hops away.
- Routing Indices for P2P Crespo et al. (Stanford)
- Compound Indices, each node sends a clustered
summary of its topic to its neighbors. (e.g. 100
databases, 4 theory, 10 OS)
- Might be too costly for Highly dynamic P2P
systems.
23Related Work
- Freenet (Clark et al.) Search by Identifiers.
- uses SHA1 hashes of resources and information is
retrieved based on the key closeness in a DFS
manner.
- Others such as Chord.
- Systems that focus on scalable object location,
which becomes feasible by hashing and
distributing objects in the P2P system. (Searches
are by Identifier).
24Conclusions
- P2P systems offer several advantages such as
scalability, robustness and simplicity of use.
- Efficient P2P Information Retrieval is not
feasible with the current Search Algorithms.
- We propose an Intelligent Search Mechanism that
uses local knowledge to improve Information
Retrieval in P2P.
- Our mechanism achieves 90 recall rate while
using only 35 of the initial messaging.
25Future Work
- We plan to deploy our middleware infrastructure
on a larger P2P network with more Queries.
- We want to probe different Network Topologies
such as ASMap with PowerLaws.
- We want to probe different Peer-Profile
maintenance policies at peers.
- Compare the performance of our method with
different proposed algorithms (iterative
deepening, local indexes, etc).
26A Local Search Mechanism for Peer-to-Peer
Networks
- Vana Kalogeraki, Dimitrios Gunopulos
- Demetris Zeinalipour (University of California
Riverside)
- csyiazti_at_cs.ucr.edu
CIKM 2002 Eleventh International Conference on
Information and Knowledge Management
November 4-9, Mclean VA
http//www.cs.ucr.edu/csyiazti/publications.html