Title: Exploiting Content Localities for Efficient Search in P2P Systems
1Exploiting Content Localities for Efficient
Search in P2P Systems
- Lei Guo1 Song Jiang2 Li Xiao3 and
- Xiaodong Zhang1
- 1College of William and Mary, USA
- 2Los Alamos National Laboratory, USA
- 3Michigan State University, USA
2Peer-to-Peer Search
- Two Performance Objectives
- Individual peer improve the search quality
- Internet management minimize the search cost
Fast, fast, fast, and the more the better!
P2P user
3Existing Solutions
- Generally aim to one of the two objectives and
have performance limits to the other - Flooding
- Most effective for users experience
- Least efficient for network resource utilization
- Random walk
- Traffic efficient, but
- Long response time and limited number of search
results
4Super-Node Architecture
- Super-node
- Index server for its leaf nodes
- Problems
- Index based search has limits
- Hard for full-text search
- Impossible for encrypted content search
- Not responsible for the content quality of its
leaf nodes - The structure becomes large and inefficient.
- A leaf node has to connect to multiple
super-nodes to avoid single point failure - Generating an increasingly large number of
super-nodes
5Gnutella Population in One Day (2003)
number of peers
number of super peers
One super node only connects to 3-4 peers in
average!
6Outline
- Our Measurement Study
- CAC Constructing Content Abundant Cluster
- SPIRP Selectively Prefetching Indices from
Responding Peers - CAC-SPIRP Combining CAC and SPIRP
- Performance Evaluation
- Conclusion
7Our Measurement Study
- Existing measurement studies
- A small percentage of popular files account for
most shared storage and transmissions in P2P
systems - A small amount of peers contribute majority
number of files in P2P. - They are only the indirect evidence of content
locality - Some files may be never accessed, or accessed
rarely - Our purpose
- Fully understand the localities in the peer
community and individual peers - Get first-hand traces for our simulation study
8Trace Collection
- Four-day crawling on the Gnutella network
- Open source code of LimeWire Gnutella
- Session based collection (for the whole life time
of peers) - Query sending traces by different peers
- 25,764 peers
- 409,129 queries
- Content indices of different peers
- Full indices of 18,255 peers
- 37 free riders
9Content Locality in the Peer Community
A small group of peers can reply nearly all
queries and provide most of results
10The Localities of Search Interests of Individual
Peers
Result Contributions ()
Query Contributions ()
top 1 top 10 top 5 top 10 top 20
top 1 top 10 top 5 top 10 top 20
Top Query Responders
Top Result Providers
- A peer can get search results from a small number
of its top query responders they share the same
search interests - Similar to the idea in Locality of Interest
scheme, but our conclusion is based on real P2P
systems
11Reorganizing the P2P Management Structure
- Clustering those small number of content abundant
peers - Prefetching indices from those top query
responders
12CAC Constructing Content Abundant Cluster
- Objectives
- Clustering those small number of content abundant
peers in P2P overlay - Providing high quality and fast service
- Content Abundant Cluster
- An overlay on top of P2P network
- Self-evaluate, self-identify, and self-organize
- Persistent public service for all peers in the
system - Strong content-based (not index-based)
13CAC System Structure
Clustering
Leveling
Dynamic Update
C A C
X
4
14CAC Search Operations
- Queries are sent to CAC first
- Up-flowing operation
- Flooding in CAC
- Unsatisfied queries are propagated from CAC to
the whole system - Down-flooding operation
- Propagated from low levels to high levels
15Up-flowing
C A C
4
16Down-flooding
Unused links
C A C
4
17SPIRP Selectively Prefetching Indices from
Responding Peers
- Basic operations
- Peer I initiates a query q
- Query hits displays the results
- Misses sends q
- Peer R responds query q
- sends query results as well as
- piggybacks indices of all shared files
- Peer I receives response
- Display the searching results as well as
- stores piggybacked indices
- Indices updating
- Active updating indices by responding peers
- Updating indices demanded by requesting peers
- Replacement of file indices
18SPIRP Technique
Classic music
R1
I
Pop music
R2
Query Beethoven mp3
19SPIRP Technique
classic
R1
I
pop
NULL
R2
Query Beetle mp3
20SPIRP Technique
classic
R1
I
pop
R2
Query Beetle mp3
21SPIRP Technique
classic
R1
No enough space to save indices
I
pop
R2
Query Beetle mp3
22SPIRP Technique
classic
R1
Replace complete
I
pop
R2
Query Beetle mp3
23CAC-SPIRP
- CAC application level infrastructure
- Significantly reducing bandwidth consumption
- Good response time when queries success in CAC
- Long response time when queries fail in CAC
- SPIRP client-oriented and overlay independent
- Significantly reducing response time
- Small traffic when queries can be satisfied in
cache - Same traffic as flooding when cache misses
- CAC-SPIRP
- Easy to combine the two techniques
- Consider the trade-off between the two
performance objectives - Has both merits of search quality and search cost
24Simulation Environment
- Content trace and query trace
- 4 day Gnutella crawling in our measurement
- Overlay topology
- Traces by Clip2 Distributed Search Solutions
- Session duration
- Pareto distribution fitted from measurement
results - P(x) 14.5311 x -1.8598
25Evaluation Metrics
- Query success rate
- CAC success rate in CAC (normalized to flooding)
- SPIRP success rate in local cache (normalized to
flooding) - Overall network traffic
- accumulated communication traffics for all
queries, responses, and index transferring
(normalized to flooding) - Average response time
- use the number of routing hops (normalized to
flooding) - Evaluate for different query satisfactions
- 1, 10, 50 results, representing different user
demands
26Performance Evaluation for CAC
Overall Traffic (Normalized)
Success Rate in CAC (normalized)
Minimum Results 1 Minimum Results 10 Minimum
Results 50
Minimum Results 1 Minimum Results 10 Minimum
Results 50
Avg Response Time (Normalized)
5 top content abundant peers are good enough
for cluster construction
Minimum Results 1 Minimum Results 10 Minimum
Results 50
27CAC Member Selection
Avg Response Time (Normalized)
Success Rate in CAC (normalized)
Minimum Results 1 Minimum Results 10 Minimum
Results 50
0 0.01 0.02 0.03
0.04
Success Response Rate of CAC Peers
Minimum Results 1 Minimum Results 10 Minimum
Results 50
Overall Traffic (Normalized)
Minimum Results 1 Minimum Results 10 Minimum
Results 50
0 0.01 0.02 0.03
0.04
Success Response Rate of Content-Abundant Peers
- Overall traffic is not sensitive to CAC member
quality - Traffic can be significantly reduced even for
- randomly selected CAC members
- CAC down flooding is very efficient
0 0.01 0.02 0.03
0.04
Success response rate of CAC Peers
28CAC-SPIRP Overall Performance
1
Success Rate in Local Cache
2
Average Response Time (Normalized)
0.8
1.6
0.6
0.4
1.2
0.2
0.8
0
0.4
Overall Traffic (Normalized)
1
0.8
0
0 2 4 6
8 10
0.6
Size of Incoming Index Set Buffer (in M Bytes)
0.4
CAC-SPIRP reduces both the overall traffic and
response time significantly
0.2
0
29Conclusion
- CAC-SPIRP fundamentally addresses the P2P search
problem by a re-organization. - Exploiting organizational content locality
- CAC a content abundant cluster provides high
quality and fast services. - Exploiting user content locality
- SPIRP a client prefetching technique to speed up
search by avoiding unnecessary queries