- PowerPoint PPT Presentation

About This Presentation
Title:

Description:

Clients may join or leave the network at any time = highly fault ... e.g. Limewire's Ultrapeers. Centralized Index. 1) Upload Index. 2) Query/QueryHit ... – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 44
Provided by: demetriosz
Learn more at: http://alumni.cs.ucr.edu
Category:
Tags: limewire

less

Transcript and Presenter's Notes

Title:


1
Information Retrieval in Peer-to-Peer Systems
Dept. of Computer Science Engineering. _at_
University of California - Riverside
  • Demetrios Zeinalipour-Yazti

M.Sc. Thesis Defense Monday, May 5, 2003Surge
349 1200-100 PM
Thesis Committee Dr. Dimitrios Gunopulos,
Chairperson Dr. Vana Kalogeraki Dr. Chinya V.
Ravishankar
http//www.cs.ucr.edu/csyiazti/msc.html
2
Presentation Outline
  • Introduction Motivation.
  • Search Techniques for P2P systems
  • The Intelligent Search Mechanism
  • PeerWare Simulation Infrastructure
  • Experimental Evaluation.
  • Conclusions Future Work.

3
Introduction to Peer-to-Peer
  • Peer-to-Peer Computing definition
  • Sharing of computer resources and information
    through direct exchange
  • Clients (downloaders) are also servers
  • Clients may join or leave the network at any time
    gt highly fault-tolerant but with a cost!
  • Searches are done within the virtual network
    while actual downloads are done offline (with
    HTTP).

4
Introduction to Peer-to-Peer
  • Peer-to-Peer (P2P) systems are increasingly
    becoming popular.
  • P2P file-sharing systems, such as Gnutella,
    Napster and Freenet realized a distributed
    infrastructure for sharing files.
  • Traditionally, files were shared using the
    Client-Server model (e.g. http). Not scalable
    since they are centralized services.
  • P2P uncover new advantages in simplicity of use,
    robustness, self organization and scalability.

5
Information Retrieval in P2P
  • Problem
  • How to efficiently retrieve Information in P2P
    systems where each node shares a collection of
    documents?
  • Documents consists of keywords.
  • Resembles Information Retrieval but resources are
    distributed now.
  • Primary Data Structures such as Global Inverted
    Indexes cant be maintained efficiently.

6
Solutions for P2P Information Retrieval
  • 1) Centralized Approaches
  • Centralized Indexes
  • e.g. Napster, SETI_at_HOME
  • 2) Purely Distributed Approaches
  • Each node has only local knowledge.
  • I.R is done using Brute force mechanisms
  • e.g. Gnutella, Fasttrack (Kazaa)
  • 3) Hybrid Approaches
  • One or more peers have partial indexes of the
    contents of others.
  • e.g. Limewire's Ultrapeers

Centralized Index
1) Upload Index
2) Query/QueryHit
3) Download (offline)
1
2
3
1) Connect
2) Query/QueryHit
3) Download (offline)
1,2
3
1) Connect
2) Intelligent Query/QueryHit
3) Download (offline)
1,2
3
7
Motivation
  • On 1st June we crawled the Gnutella P2P Network
    for 5 hours with 17 workstations.
  • We analyzed 15,153,524 query messages.
  • Observation High locality of specific queries.
  • We try to exploit this property for more
    efficient searches?

8
Presentation Outline
  • Introduction Motivation.
  • Search Techniques for P2P systems
  • The Intelligent Search Mechanism
  • PeerWare Simulation Infrastructure
  • Experimental Evaluation.
  • Conclusions Future Work.

9
Search Techniques for P2P systems
  • Breadth-First Search (Gnutella)
  • Idea Each Query Message is propagated along all
    outgoing links of a peer using TTL
    (time-to-live).
  • TTL is decremented on each forward until it
    becomes 0
  • Technique for I.R in P2P systems such as
    Gnutella.
  • Highlights
  • The physical network comes to its knees
  • Long Delays for search results.

P2P Network N
A
QUERY
1
QUERYHIT
2
Peer q
Peer d
10
Search Techniques for P2P systems
  • Modified Random BFS
  • V. Kalogeraki, D. Gunopulos, D.
    Zeinalipour-Yazti . CIKM2002
  • Idea Each Query Message is forwarded to only a
    fraction of outgoing links (e.g. ½ of them).
  • TTL is again decremented on each forward until it
    becomes 0.
  • Highlights
  • Fewer Messages but possibly less results
  • This algorithm is probabilistic.
  • Some segments may become
  • unreachable

unreachable
B
A
QUERY
1
P2P Network N
QUERYHIT
2
C
Peer d
11
Search Techniques for P2P systems
  • Searching Using Random Walkers
  • Q. Lv et al P. Cao, E. Cohen, K. Li, and S.
    Shenker. ICS2002
  • Idea Each Query Message is forwarded to 1
    neighbor
  • With k walkers after T steps we reach the same
    nodes as 1 walker after kT steps. (They use 16-64
    walkers)
  • Highlights
  • Network Traffic reduced (from BFS) by 2 orders of
    magnitude
  • Increases the user-perceived delay (from 2-6 hops
    to 4-15 hops)
  • This algorithm is probabilistic and the
    likelihood to locate the objects depends on the
    network topology.

Peer d
12
Search Techniques for P2P systems
  • 4. Using Randomized Gossiping to Replicate Global
    State F.M Cuenca-Acuna, Thu D. Nguyen HPDC-12
  • Idea PlanetP uses Bloom Filters to propagate
    summary indexes of the contents of a Peer.
  • Bloom Filters are used for Membership Queries
  • Highlights
  • Not Scalable (Technique works well
  • for lt10000 nodes)
  • No Data Replication Required
  • False Positives are a function of m,n,k
  • and can be kept small

D d
,d
,...,d

000
1
2
n
001
1
h
(d
)
010
1
1
011
m
h
(d
)
2
1
100
1
h
(d
)
3
1
101
d1?
110
h
(d
)
1
4
1
111
1
An 8-bit bloom filter w/ 4 hash functions
13
Search Techniques for P2P systems
  • 5. Searching using Local Indices Arturo Crespo
    and Hector Garcia-Molina, ICDCS 2002.
  • Idea Create indices which contain statistics
    that reveal the direction towards the
    documents.
  • Types of Proposed Indices
  • Compound Routing Index (CRI) metricnumber of
    documents
  • Hop-Count Routing Index (HRI) maintain a CRI for
    k hops,
  • Exponentially Aggregated Index (ERI) Apply some
    cost formula on HRI to shrink HRIs size.
  • Highlights
  • Not Scalable, Expensive Routing Updates but
    better than replicating data indexes.
  • Assumes static environment but No Data
    Replication Required

14
Search Techniques for P2P systems
  • 6. Directed BFS and the gtRES Heuristic 1/2
    Beverly Yang and Hector Garcia-Molina, ICDCS
    2002.
  • Proposed Techniques
  • Directed BFS based on aggregate statistics (e.g.
    num of results a peer returned, shortest queue,
    forwarded the most data)
  • Iterative Deepening, until Z results are returned
  • Local Indexes, each node maintains the actual
    index over the data of peers r hops away.
  • Their experiments deploy the Direct BFS
    techniques by attaching nodes to the Gnutella
    Network.
  • The gtRES Heuristic is shown to be working well.

15
Search Techniques for P2P systems
  • Directed BFS and the gtRES Heuristic 2/2
  • The gtRES Heuristic is optimized to find Z
    documents efficiently for some user defined Z.
  • gtRES works well because
  • It captures stable/large network segments.
  • Potentially less overloaded peers
  • gtRES is a quantitative approach
  • Drawback gtRES doesnt route queries to most
    relevant content

16
Search Techniques for P2P systems
  • 7. Depth-First-Search and Freenet
  • I. Clarke O. Sandberg, B. Wiley, and T.W. Hong,
    LNCS 2009
  • Idea Objects are Hashed and route the hash of a
    query based on the key closeness in a DFS
    manner.
  • Highlights
  • Uses caching of key/object for future requests.
  • Data Replication along the QueryHit path provides
    Availability
  • Anonymity of Searcher and Publisher.
  • Drawbacks i) Searches ONLY based on Object
    Identifier.
  • ii) The user-perceived delay is high

S
B
replicated
B
A
fileA
QUERY
h(A)
1
Search A
C
result
S
2
Peer
q
R
original fileA
17
Search Techniques for P2P systems
  • 8. Consistent Hashing and Chord
  • Ion Stoica et al. SIGCOMM 2001
  • Idea Objects/Nodes are hashed with m-bit
    identifier and organized in a virtual ring.
    Object lookup is achieved in O(logN).
  • Highlights
  • Consistent Hashing achieves (i) Good Load
    Balancing of keys (ii) Little object/key movement
    in case of node join/leave .
  • Drawbacks i) Searches ONLY based on Object
    Identifier
  • ii) Data Movement may be a big overhead.

18
Presentation Outline
  • Introduction Motivation.
  • Search Techniques for P2P systems
  • The Intelligent Search Mechanism
  • PeerWare Simulation Infrastructure
  • Experimental Evaluation.
  • Conclusions Future Work.

19
Intelligent Search Mechanism ISM
  • Introduction
  • Idea Each Query Message is forwarded
    intelligently based on what queries a peer
    answered in the past.
  • Components of ISM (for each node u)
  • Profile Mechanism, for each neighbor N(u).
  • Peer Ranking Mechanism, for ranking peers locally
    and send a search query only to the ones that
    most likely will answer.
  • Similarity Function, for finding similar search
    queries.
  • Search Mechanism, for propagating queries based
    on local indexes

A
QUERY
1
profiles
QUERYHIT
2
?
Peer d
20
Intelligent Search Mechanism ISM
  • Components of ISM
  • a) Profile mechanism.
  • Maintains a list of past queries routed through
    that host.
  • Every time a QueryHit is received the table is
    updated
  • The profile manager uses a Least Recently Used
    policy to keep most recent queries in repository.
  • Profiles are kept for neighbors only so the cost
    for maintaining this cost is O(Td), T is a
    limiting factor per profile, d is the degree of a
    node

Size Td

21
Intelligent Search Mechanism ISM
  • Components of ISM
  • b) The RelevanceRank Peer Ranking Metric.
  • Before forwarding a Query Message a peer performs
    an on-the-fly ranking of its peers to determine
    the best paths.
  • We use the Aggregate Weighted Similarity of peer
    Pi to a query q, computed by a peer Pl as


2
22
Intelligent Search Mechanism ISM
  • Components of ISM
  • c) Similarity Function The cosine similarity.
  • Assume that L is a set of all words (in Profile
    Manager)\
  • e.g. Lelections, bush, clinton, super, bowl,
    san, diego, ,italy, earthquake, disaster
  • We define an L-dimensional space where each
    query is a vector.
  • If qitaly disaster gt q (vector of q)
    0,0,0,,1,0,1
  • Recall that we have a vector for each qi stored
    in the Profile Manager ( i.e. qi )

23
Intelligent Search Mechanism ISM
  • Components of ISM
  • d) Search Mechanism
  • Utilizes the Peer Ranking Mechanism to forward
    Queries to nodes that will potentially contain
    the info we are looking for

Peer d
profiles
?
QUERY
1
?
24
Intelligent Search Mechanism ISM
  • Breaking cycles with Random Perturbation
  • Suppose that nodes answers to conjunction of
    q-terms
  • Suppose that query q has no answer from A,B,C
    or D.
  • and that one of them answered to similar q in
    the past
  • ? Query q fails to explore the segment through E
  • Random Perturbation adds one additional random
    message

25
Presentation Outline
  • Introduction Motivation.
  • Search Techniques for P2P systems
  • The Intelligent Search Mechanism
  • PeerWare Simulation Infrastructure
  • Experimental Evaluation.
  • Conclusions Future Work.

26
PeerWare Simulation Infrastructure
  • Introduction
  • PeerWare is our distributed middleware
    infrastructure that allows us to benchmark
    various Query Routing Algorithms.
  • It is deployed on a network of 50 workstations
  • It uses Public/Private Keys and SSH to connect to
    the networked hosts.
  • It is implemented in JAVA and consists of
    approximately 10000 lines of code.

27
PeerWare Simulation Infrastructure
  • Why real middleware and not simulations?
  • Many properties such as network failures, dropped
    queries may reveal interesting and unknown
    patterns.
  • In a real middleware we are able to measure the
    actual time to satisfy queries.
  • Finally there are no assumptions (network delays
    etc) which are typical in simulation
    environments
  • The Anthill Project (Univ. of Bologna) uses a
    similar approach to investigate properties of the
    Freenet algorithm.

28
PeerWare Simulation Infrastructure
  • PeerWare Components
  • dataGen The Dataset Generator
  • graphGen The Network Graph Generator
  • dataPeer The Data Node
  • searchPeer The Search Node
  • Other Administrative Components
  • netLaucher Shell script that launches Network
  • netStats Shell script that provides statistics
  • graphPlot Shell script that plots Graphs based
    on generated results.

29
PeerWare Simulation Infrastructure
  • 1) dataGen Component
  • dataGen is the Dataset Generator which generates
    documents about specific documents
  • (each peer can have some specialized knowledge)
  • It uses the REUTERS News Agency dataset (22,531
    documents).
  • It groups documents by various properties
  • Date, Topics, Places, People, Orgs, Companies
  • In our experiments we use the Places attribute
    and generate 104 countries.

30
PeerWare Simulation Infrastructure
  • 2) graphGen Component
  • graphGen is topology generator
  • Currently it generates Random Topologies given
    parameters such as degree, IPs, ports
  • It generates with graphViz visualizations of the
    generated topologies.

31
PeerWare Simulation Infrastructure
  • 3) dataPeer Component
  • dataPeer is a P2P client that maintains an XML
    repository of documents.
  • It uses the PDOM-XQL engine to query its
    documents.
  • It pre-establishes connections to other peers
    with persistent TCP connections

32
PeerWare Simulation Infrastructure
  • 4) searchPeer Component
  • searchPeer is a P2P client that connects to a
    PeerWare Network and performs unstructured
    queries.
  • Keywords are sampled from within the dataset
  • It logs statistics such as query response time,
    nodes answered to a node etc.

33
Presentation Outline
  • Introduction Motivation.
  • Search Techniques for P2P systems
  • The Intelligent Search Mechanism
  • PeerWare Simulation Infrastructure
  • Experimental Evaluation.
  • Conclusions Future Work.

34
Experimental Evaluation
  • Introduction
  • We create a distributed Newspaper application
  • We use a Random Network of 104 peers
  • Each peer has documents for 1 country
  • The average degree of a node is 7 log2100
    (connected graph)
  • We perform two series of experiments
  • 10x10 sequential queries with a delay of 4 sec.
  • 400 random queries with a delay of 4 sec.
  • We compare Doc. Ratio (Recall Rate) vs. Num. of
    messages
  • BFS (Gnutella Message Flooding) (forward to
    degree nodes).
  • Random BFS (randomly forward to degree/2 nodes).
  • Intelligent Search Mechanism (forward to
    M(degree/2)-1 highest RelevanceRank nodes 1
    random).
  • gtRES Heuristic (forward to degree/2 nodes that
    answered gtRES)

35
Experimental Evaluation
  • Reducing Query Messages (10x10 Experiment)
  • Recall Rate vs. Num. of messages with TTL4
  • BFS uses 1050 messages w/ recall rate 100
  • RBFS uses 220 (20) msgs w/ recall rate 50
  • gtRES uses 400 (38) msgs w/ recall rate 70
  • ISM uses 400 (38) msgs w/ recall rate 90
  • ISM improves over time since Peer Profiles get
    more knowledge.
  • ISM and gtRES start out slow since the use RBFS
  • until they populate their routing structures

36
Experimental Evaluation
  • Digging Deeper by Increasing the TTL (10x10)
  • Recall Rate vs. Num. of messages with TTL5
  • BFS uses again 1050 messages w/ recall rate 100
  • RBFS uses 450 (43) msgs w/ recall rate 82
  • gtRES uses 570(54) msgs w/ recall rate 90
  • ISM uses 570 (54) msgs w/ recall rate 99

37
Experimental Evaluation
  • Reducing Query Response Time (QRT) (10x10
    Experiment)
  • BFSs QRT is in the order of 6 seconds
  • RBFS, ISM and gtRES use
  • 30-60 of BFS for TTL4
  • 60-80 of BFS for TTL5
  • BFS unnecessary messages increase the user
    perceived delay
  • The Query Response Time as a percentage of BFS

38
Experimental Evaluation
  • The Discarded Message Problem (DMP)
  • A query q is identified by a GUID.
  • To avoid cycles a node never forwards a query it
    already forwarded.
  • DMP occurs if a node has forwarded q with TTL1
    and then receives again q with TTL2, where
    TTL2gtTTL1
  • In our experiments approximately 30 of queries
    were affected by the DMP problem.

39
Experimental Evaluation
  • Improving Recall Rate over Time (400 Experiment)
  • 10x10 Queries Experiment suited well ISM
  • In this experiment we perform 400 random queries
  • BFS overwhelming message create two major
    outbreaks
  • ISM improves over time achieving
  • 96 Recall Rate using again 38 of Messages

40
Presentation Outline
  • Introduction Motivation.
  • Search Techniques for P2P systems
  • The Intelligent Search Mechanism
  • PeerWare Simulation Infrastructure
  • Experimental Evaluation.
  • Conclusions Future Work.

41
Conclusions
  • Efficient Information Retrieval in P2P networks
    is not feasible with the current Search
    Algorithms.
  • We propose an Intelligent Search Mechanism that
    uses local knowledge to improve Information
    Retrieval in P2P.
  • We implement PeerWare and evaluate the
    performance of various Search Techniques
  • The ISM achieves in some cases 100 recall rate
    while using only 57 of the BFS messaging.

42
Future Work
  • Probe different Network Topologies such as ASMap
    with PowerLaws.
  • Deploy larger PeerWares with more queries.
  • Probe different Peer-Profile maintenance
    policies.
  • Use Stemming/Stop Words to answer more
    accurately.
  • Compare the performance of our method with new
    proposed techniques (random gossiping, random
    walkers, etc).
  • 60 of Gnutella belongs to 20 ISPs. How to
    exploit that to provide more efficient query
    routing schemes?

43
Information Retrieval in Peer-to-Peer Systems
Dept. of Computer Science Engineering. _at_
University of California - Riverside
  • Demetrios Zeinalipour-Yazti

Thank You!
M.Sc. Thesis Defense Monday, May 5, 2003Surge
349 1200-100 PM
Thesis Committee Dr. Dimitrios Gunopulos,
Chairperson Dr. Vana Kalogeraki Dr. Chinya V.
Ravishankar
http//www.cs.ucr.edu/csyiazti/msc.html
Write a Comment
User Comments (0)
About PowerShow.com