Title: Shanyu Zhao, Daniel Stutzbach, Reza Rejaie
1Characterizing Files in the Modern Gnutella
Network A Measurement Study
- Shanyu Zhao, Daniel Stutzbach, Reza Rejaie
- Multimedia Internetworking Research Group
(Mirage) - Computer Information Science Department
- University of Oregon
- http//mirage.cs.uoregon.edu
2Introduction
- P2P applications are very popular over the
Internet - File-sharing Gnutella, Kazza, eDonkey
- Content distribution BitTorrent
- IP telephony Skype
- P2P applications remain popular because of
- Ease of deployment, self-scaling,
infrastructure-less - Significant impact on the Internet
- Characterizing P2P applications is essential for
- Evaluating their performance and improving their
designs - Conducting meaningful simulations and analytical
study - Examining their impact on the network
- Characteristics of large scale P2P applications
are not well understood!
3P2P Systems An Overview (I)
- Theme enabling a group of peers (computers) to
share their resources (e.g. file, bandwidth,
storage, CPU) - As participating peers arbitrarily join leave,
they form an (application level) overlay
topology. - Overlay is inherently dynamic
- No especial support from the network (e.g.
multicast) - Overlay is used for resource discovery,
management
4P2P Systems Overview (II)
- Inherent properties
- Scalability available resources organically
grows with the number of peers - Churn peers voluntarily join/leave
- Heterogeneity peers have different capabilities
- Two basic architectures
- Unstructured peers form a randomly connected
overlay - 2) Structured peers form an overlay with
certain properties (ring, tree)
5Effect on the Internet
- 60 of all Internet traffic CacheLogic Research
2005 - Some P2P apps have millions of simultaneous
users. - Geographically distributed.
Gnutella overlay in 2002
Gnutella population (Oct 04 Jan 06)
6Research on P2P Networking
- Active area of research since 2001
- Mostly focusing on new architectures, new
resource discovery/management techniques - Evaluation is only feasible through simulation or
small scale experiments with synthetic workloads. - Few empirical studies on P2P systems
- Characteristics of widely-deployed P2P systems
are not well understood. - Peer dynamics e.g. dist of peer uptime
- Overlay properties e.g. dist of peer degree
- Resource properties e.g. popularity dist of
files
7Methodology
- Characterizing P2P applications requires
capturing system snapshots. - Snapshot is a graph that represents state of the
system at a given point of time (peers nodes,
connections edges). - Individual snapshots reveal instantaneous
properties. - Consecutive snapshots reveal dynamics.
- Ideally, a snapshot is captured instantaneously.
- In practice, a snapshot is iteratively discovered
by a P2P crawler. - P2P apps should provide support for crawler,
e.g. query a peer for list of neighbors,
files. - It is difficult to characterize proprietary
P2P applications.
8Cruiser a Fast P2P Crawler
- We developed a parallel crawler, called Cruiser.
- Features
- Master-slave architecture, master coordinates
among slaves, each slave crawls hundred peers
simultaneously - Dynamic adaptation to bandwidth CPU constraints
- Generic crawler, accommodates plug-ins
- Orders of magnitude faster than other P2P
crawlers - Captures one million Gnutella nodes in around 7
minutes - 140K peers/min (visiting 22K peers/min) gtgt 2.5
peers/min - Lots of important implementation issues
- Setting timeout, no of file-descriptors per
process, dealing with local NAT box
9Evaluating Snapshot Accuracy
Cruiser/
- No ref. snapshot to compare
- Completeness of captured snapshots edges, nodes
- Tradeoff between granularity completeness
of snapshots - Node distortion gt 4
- Edge distortion gt 15
- 30 of peers are unreachable
- 3 departed peer
- 17 behind firewall (NAT)
- 10 overloaded !!
Peers discovered (10,000)
10Previous Studies
Characterizing Files/
- Captured a small population of peers
- Partial snapshot through a short crawl
- Periodic probe of a fixed group of peers
- Have not verified whether the captured population
is representative - Conducted more than 3 years ago (outdated)
- Population of these apps has significantly grown
- New features two-tier arch. were incorporated
11Measurement Methodology
Characterizing Files/
- Characterizing files requires file snapshots.
- Obtaining the list of shared files neighbor
info. from individual peers - a content crawl a topolgy crawl
- Individual snapshots reveal static topological
analysis. - Consecutive snapshots reveal dynamic analysis.
- Topology crawl is much faster than content crawl
(minutes vs hours) - Other challenges NAT, DHCP, fileID, (see
paper). - Minimizing the distortion in file snapshots by
- Capturing a complete snapshot with a high-speed
crawler - Decoupling topology crawl from content crawl
Ultrapeer
Top-level overlay
Topology crawl
Topology crawl
Content Crawl
5.5 hours
15 min
15 min
Leaf
12Dataset
Characterizing Files/
- Captured around 50 snapshots
- Average log size/snapshot 10GByte
- Each snapshot represents
- 800 Terabyte content
- 100 million unique files
- 0.5 million reachable peers, 20 of identified
peers - Available content in Gnutella 4,000 Terabytes
- Reported results were consistent across multiple
snapshots - Post processing
- e.g. Removed duplicate files reported by
individual peers (9 of all captured files)
13Summary of Characterizations
Characterizing Files/
- 1) Static analysis characteristics of files at a
given point of time - 2) Topological analysis correlation between file
distribution and overlay topology - 3) Dynamics analysis changes in file
characteristics over time
14Free Riding
Characterizing Files/Static Analysis
Free Riders
- of free riders reported in previous studies
- 66 in 2000 Adar
- 25 in 2002 Saroiu
- of free riders have dropped
Peers
None
Files
352
12
159K
Ultra
332
15
235K
Leaf
12
125K
349
Long-lived Ultra
363
12
34K
Short-lived Ultra
156K
350
16
Long-lived leaf
79K
297
14
Short-lived Leaf
340
14
394K
total
June 13, 2005
rounded numbers
15Resource Sharing
Characterizing Files/Static Analysis
- How much resources (files, storage) peers
contribute? - Dist. of peers contributing
- x files conforms power-law
- x MByte conforms power-law
- Most peers contribute little, but few contribute
a lot - Shared files vs storage
- Not as strong as reported by Saroiu et al. 2002
16File Popularity
Characterizing Files/Static Analysis
- Representing availability of individual files.
- Follows Zipf distribution
- Popularity distribution remains stable over time
17File Types
Characterizing Files/Static Analysis
Major Audio Types
File
Byte
Type
mp3
61
37
- in 2001, chu et al. reported
- Audio 67 of files, 79 of bytes
- Video 2 of files, 19 of bytes
- mp3 files are very popular!
- mm files make up 73 files, 93 bytes
- Non-mm jpg, gif, htm, exe, txt
- Video files become more popular
wma
2.7
1.3
wave
1.9
0.7
m4a
1.4
0.7
total
67
40
Major Video Types
File
Byte
Type
wmv
2.3
3.4
mpg
2.4
23.3
avi
0.8
24.5
asf
.14
0.64
total
5.6
52
18Topological Analysis
Characterizing Files/
- Is there any correlation between locations of a
file and overlay topology? - i.e. Are copies of a file topologically
clustered? - File locations are affected by two factors
- 1) Scoped search gt topological clustering
- 2) Churn gt random distribution
- Which factor is dominant?
- Examining from two angles
- Per-file perspective
- Per-peer perspective
19Topological Analysis
Characterizing Files/
- Simulate flood-based query from 100 random peers
- No of messages to find 5 copies
- Files with different popularity
- Random vs realistic file distr.
- Average similarity of content between 100 random
peers with one/two/three-hop neighbors. - No topological clustering exists
- Churn is the dominant factor
- Use random file dist. for sim
- Select random peers to characterize files (non
trivial)
20Dynamic Analysis
Characterizing Files/
- How do various characteristics of available files
change over different timescales? - Peers add/download or remove files
- Peers join/leave the system
- 1) Variations in shared files by individual peers
- Dynamics IP address introduces error
- 2) Variations in popularity of individual files
- Trend in popularity changes
21Variations of files at individual peers
Characterizing Files/Dynamic Analysis
- Ratio of added/removed files to total files
(degree of change) - 3000 random peers
- Timescales 2hr, 6hr, 1day, 1wk
- More change over longer timescales seems
intuitive - Change in popularity of 50K files over one-day
interval - More changes for more popular
22Change in file popularity
Characterizing Files/Dynamic Analysis
Top 100 files
- Change in popularity
- For top 100 and 1000 files
- Over different timescales
- For any timescale, more popular files
- exhibit larger changes
- Changes occur more rapidly
- Caching references is useful
- These all seem intuitive but one needs to
quantify rate of changes
Top 1000 files
23Trends in Popularity Changes
Characterizing Files/Dynamics Analysis
- Goal to predict popularity of a file in the
future? - No major change in popularity over several days
- Larger changes over a few months
- The key is to quantify the rate and pattern of
changes. - Significantly more snapshots are required to
derive any reliable conclusion