Shanyu Zhao, Daniel Stutzbach, Reza Rejaie - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Shanyu Zhao, Daniel Stutzbach, Reza Rejaie

Description:

Generic crawler, accommodates plug-ins. Orders of magnitude faster than other P2P crawlers: ... Capturing a complete snapshot with a high-speed crawler ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 24
Provided by: debo132
Category:

less

Transcript and Presenter's Notes

Title: Shanyu Zhao, Daniel Stutzbach, Reza Rejaie


1
Characterizing Files in the Modern Gnutella
Network A Measurement Study
  • Shanyu Zhao, Daniel Stutzbach, Reza Rejaie
  • Multimedia Internetworking Research Group
    (Mirage)
  • Computer Information Science Department
  • University of Oregon
  • http//mirage.cs.uoregon.edu

2
Introduction
  • P2P applications are very popular over the
    Internet
  • File-sharing Gnutella, Kazza, eDonkey
  • Content distribution BitTorrent
  • IP telephony Skype
  • P2P applications remain popular because of
  • Ease of deployment, self-scaling,
    infrastructure-less
  • Significant impact on the Internet
  • Characterizing P2P applications is essential for
  • Evaluating their performance and improving their
    designs
  • Conducting meaningful simulations and analytical
    study
  • Examining their impact on the network
  • Characteristics of large scale P2P applications
    are not well understood!

3
P2P Systems An Overview (I)
  • Theme enabling a group of peers (computers) to
    share their resources (e.g. file, bandwidth,
    storage, CPU)
  • As participating peers arbitrarily join leave,
    they form an (application level) overlay
    topology.
  • Overlay is inherently dynamic
  • No especial support from the network (e.g.
    multicast)
  • Overlay is used for resource discovery,
    management

4
P2P Systems Overview (II)
  • Inherent properties
  • Scalability available resources organically
    grows with the number of peers
  • Churn peers voluntarily join/leave
  • Heterogeneity peers have different capabilities
  • Two basic architectures
  • Unstructured peers form a randomly connected
    overlay
  • 2) Structured peers form an overlay with
    certain properties (ring, tree)

5
Effect on the Internet
  • 60 of all Internet traffic CacheLogic Research
    2005
  • Some P2P apps have millions of simultaneous
    users.
  • Geographically distributed.

Gnutella overlay in 2002
Gnutella population (Oct 04 Jan 06)
6
Research on P2P Networking
  • Active area of research since 2001
  • Mostly focusing on new architectures, new
    resource discovery/management techniques
  • Evaluation is only feasible through simulation or
    small scale experiments with synthetic workloads.
  • Few empirical studies on P2P systems
  • Characteristics of widely-deployed P2P systems
    are not well understood.
  • Peer dynamics e.g. dist of peer uptime
  • Overlay properties e.g. dist of peer degree
  • Resource properties e.g. popularity dist of
    files

7
Methodology
  • Characterizing P2P applications requires
    capturing system snapshots.
  • Snapshot is a graph that represents state of the
    system at a given point of time (peers nodes,
    connections edges).
  • Individual snapshots reveal instantaneous
    properties.
  • Consecutive snapshots reveal dynamics.
  • Ideally, a snapshot is captured instantaneously.
  • In practice, a snapshot is iteratively discovered
    by a P2P crawler.
  • P2P apps should provide support for crawler,
    e.g. query a peer for list of neighbors,
    files.
  • It is difficult to characterize proprietary
    P2P applications.

8
Cruiser a Fast P2P Crawler
  • We developed a parallel crawler, called Cruiser.
  • Features
  • Master-slave architecture, master coordinates
    among slaves, each slave crawls hundred peers
    simultaneously
  • Dynamic adaptation to bandwidth CPU constraints
  • Generic crawler, accommodates plug-ins
  • Orders of magnitude faster than other P2P
    crawlers
  • Captures one million Gnutella nodes in around 7
    minutes
  • 140K peers/min (visiting 22K peers/min) gtgt 2.5
    peers/min
  • Lots of important implementation issues
  • Setting timeout, no of file-descriptors per
    process, dealing with local NAT box

9
Evaluating Snapshot Accuracy
Cruiser/
  • No ref. snapshot to compare
  • Completeness of captured snapshots edges, nodes
  • Tradeoff between granularity completeness
    of snapshots
  • Node distortion gt 4
  • Edge distortion gt 15
  • 30 of peers are unreachable
  • 3 departed peer
  • 17 behind firewall (NAT)
  • 10 overloaded !!

Peers discovered (10,000)
10
Previous Studies
Characterizing Files/
  • Captured a small population of peers
  • Partial snapshot through a short crawl
  • Periodic probe of a fixed group of peers
  • Have not verified whether the captured population
    is representative
  • Conducted more than 3 years ago (outdated)
  • Population of these apps has significantly grown
  • New features two-tier arch. were incorporated

11
Measurement Methodology
Characterizing Files/
  • Characterizing files requires file snapshots.
  • Obtaining the list of shared files neighbor
    info. from individual peers
  • a content crawl a topolgy crawl
  • Individual snapshots reveal static topological
    analysis.
  • Consecutive snapshots reveal dynamic analysis.
  • Topology crawl is much faster than content crawl
    (minutes vs hours)
  • Other challenges NAT, DHCP, fileID, (see
    paper).
  • Minimizing the distortion in file snapshots by
  • Capturing a complete snapshot with a high-speed
    crawler
  • Decoupling topology crawl from content crawl

Ultrapeer
Top-level overlay
Topology crawl
Topology crawl
Content Crawl
5.5 hours
15 min
15 min
Leaf
12
Dataset
Characterizing Files/
  • Captured around 50 snapshots
  • Average log size/snapshot 10GByte
  • Each snapshot represents
  • 800 Terabyte content
  • 100 million unique files
  • 0.5 million reachable peers, 20 of identified
    peers
  • Available content in Gnutella 4,000 Terabytes
  • Reported results were consistent across multiple
    snapshots
  • Post processing
  • e.g. Removed duplicate files reported by
    individual peers (9 of all captured files)

13
Summary of Characterizations
Characterizing Files/
  • 1) Static analysis characteristics of files at a
    given point of time
  • 2) Topological analysis correlation between file
    distribution and overlay topology
  • 3) Dynamics analysis changes in file
    characteristics over time

14
Free Riding
Characterizing Files/Static Analysis
Free Riders
  • of free riders reported in previous studies
  • 66 in 2000 Adar
  • 25 in 2002 Saroiu
  • of free riders have dropped

Peers
None
Files
352
12
159K
Ultra
332
15
235K
Leaf
12
125K
349
Long-lived Ultra
363
12
34K
Short-lived Ultra
156K
350
16
Long-lived leaf
79K
297
14
Short-lived Leaf
340
14
394K
total
June 13, 2005
rounded numbers
15
Resource Sharing
Characterizing Files/Static Analysis
  • How much resources (files, storage) peers
    contribute?
  • Dist. of peers contributing
  • x files conforms power-law
  • x MByte conforms power-law
  • Most peers contribute little, but few contribute
    a lot
  • Shared files vs storage
  • Not as strong as reported by Saroiu et al. 2002

16
File Popularity
Characterizing Files/Static Analysis
  • Representing availability of individual files.
  • Follows Zipf distribution
  • Popularity distribution remains stable over time

17
File Types
Characterizing Files/Static Analysis
Major Audio Types
File
Byte
Type
mp3
61
37
  • in 2001, chu et al. reported
  • Audio 67 of files, 79 of bytes
  • Video 2 of files, 19 of bytes
  • mp3 files are very popular!
  • mm files make up 73 files, 93 bytes
  • Non-mm jpg, gif, htm, exe, txt
  • Video files become more popular

wma
2.7
1.3
wave
1.9
0.7
m4a
1.4
0.7
total
67
40
Major Video Types
File
Byte
Type
wmv
2.3
3.4
mpg
2.4
23.3
avi
0.8
24.5
asf
.14
0.64
total
5.6
52
18
Topological Analysis
Characterizing Files/
  • Is there any correlation between locations of a
    file and overlay topology?
  • i.e. Are copies of a file topologically
    clustered?
  • File locations are affected by two factors
  • 1) Scoped search gt topological clustering
  • 2) Churn gt random distribution
  • Which factor is dominant?
  • Examining from two angles
  • Per-file perspective
  • Per-peer perspective

19
Topological Analysis
Characterizing Files/
  • Simulate flood-based query from 100 random peers
  • No of messages to find 5 copies
  • Files with different popularity
  • Random vs realistic file distr.
  • Average similarity of content between 100 random
    peers with one/two/three-hop neighbors.
  • No topological clustering exists
  • Churn is the dominant factor
  • Use random file dist. for sim
  • Select random peers to characterize files (non
    trivial)

20
Dynamic Analysis
Characterizing Files/
  • How do various characteristics of available files
    change over different timescales?
  • Peers add/download or remove files
  • Peers join/leave the system
  • 1) Variations in shared files by individual peers
  • Dynamics IP address introduces error
  • 2) Variations in popularity of individual files
  • Trend in popularity changes

21
Variations of files at individual peers
Characterizing Files/Dynamic Analysis
  • Ratio of added/removed files to total files
    (degree of change)
  • 3000 random peers
  • Timescales 2hr, 6hr, 1day, 1wk
  • More change over longer timescales seems
    intuitive
  • Change in popularity of 50K files over one-day
    interval
  • More changes for more popular

22
Change in file popularity
Characterizing Files/Dynamic Analysis
Top 100 files
  • Change in popularity
  • For top 100 and 1000 files
  • Over different timescales
  • For any timescale, more popular files
  • exhibit larger changes
  • Changes occur more rapidly
  • Caching references is useful
  • These all seem intuitive but one needs to
    quantify rate of changes

Top 1000 files
23
Trends in Popularity Changes
Characterizing Files/Dynamics Analysis
  • Goal to predict popularity of a file in the
    future?
  • No major change in popularity over several days
  • Larger changes over a few months
  • The key is to quantify the rate and pattern of
    changes.
  • Significantly more snapshots are required to
    derive any reliable conclusion
Write a Comment
User Comments (0)
About PowerShow.com