Title:
1A Quantitative Analysis of the Gnutella Network
Traffic
University of California Riverside Department
of Computer Science Engineering
- cs204 Final Project by
- Demetris Zeinalipour Theodoros Folias
- ltcsyiazti_at_cs.ucr.edu , folias_at_cs.ucr.edugt
- Advisor Michalis Faloutsos
Online Resources http//www.cs.ucr.edu/csyiazti/
cs204.html
2Presentation Outline
- Motivation.
- Gnutella Protocol in a nutshell.
- Related Work.
- gnuDC Gnutella Distributed Crawler.
- Experiments.
- Conclusions Future Work.
Online Resources http//www.cs.ucr.edu/csyiazti/
cs204.html
31. Motivation
- P2P file-sharing systems, such as Gnutella,
Napster and Freenet realized a distributed
infrastructure for sharing files. - Traditionally, files were shared using the
Client-Server model (e.g. http, ftp). Not
scalable (centralized) - P2P systems have shown that distributed
file-searching is feasible! and yes that they
may change the way we interact on the Internet. - Why Gnutella?
- It is a Pure P2P protocol in contrast with e.g.
Napster - It is an open protocol which allows its
investigation. - It is a large community 250,000 peers at any
given moment. - It is a truly international phenomenon with a
world-wide community. - It is still not clear what kind of traffic is
traversing the network
41. Motivation
- Questions
- How do these systems really look like?
- What kind of traffic are these systems carrying?
- What is the communication overhead of P2P?
- Where are file-searchers coming from and what are
they looking for? - Our contribution
- We make a quantitative analysis of the Gnutella
Network Traffic at a large-scale (17 machines, 85
nodes, 700MB log traces in 5 hours) - To our knowledge such a large-scale measurement
is not presented in any publication. - We describe design and implementation issues of a
large-scale distributed Gnutella Crawler.
52. Gnutella Protocol v0.4 (1/5)
- One of the most popular file-sharing protocols.
- Operates without a central Index Server (such as
Napster). - Clients (downloaders) are also servers gt
servents - Clients may join or leave the network at any time
gt highly fault-tolerant but with a cost! - Searches are done within the virtual network
while actual downloads are done offline (with
HTTP). - The core of the protocol consists of 5
descriptors (PING, PONG, QUERY, QUERHIT and
PUSH).
62. Gnutella Protocol (2/5)
- It is important to understand how the protocol
works in order to understand our framework. - A Peer (p) needs to connect to 1 or more other
Gnutella Peers in order to participate in the
virtual Network - p initially doesnt know IPs of its fellow
file-sharers
72. Gnutella Protocol (3/5)
- a. HostCaches The initial connection
- P connects to a HostCache H to obtain a set of IP
addresses of active peers. - P might alternatively probe its cache to find
peers it was connected in the past.
H
Request/Receive a set of Active Peers
1
2
Connect to network
82. Gnutella Protocol (4/5)
- b. Ping/Pong The communication overhead
- Although p is already connected it must discover
new peers since its current connections may
break. - Thus, it sends periodically PING messages which
are broadcasted (message flooding). - If a host e.g. p2 is available it will respond
with a PONG (routed only the same path the PING
came from). - P might utilize this response and attempt a
connection to p2 in order to increase its degree.
Gnutella Network N
PING
1
PONG
2
Servent p2
92. Gnutella Protocol (5/5)
- c. Query/QueryHit The utilization
- Query descriptors contain unstructured queries
e.g. celine dion mp3 - They are again, like PING, broadcasted with a
typical TTL7. - If a host e.g. p2 matches the query it will
respond with a Queryhit descriptor - d. Push Enable downloads from peers that are
firewalled. - If a peer is firewalled gt we cant connect to
him. Hence we request from him to establish a
connection on us and to send us the file.
Gnutella Network N
QUERY
1
QUERYHIT
2
Servent p2
103. Related Work (1/3)
- a. Simulating Peer-to-Peer Systems
- Most researchers use simulation Testbeds (e.g.
Anthill) to validate the performance improvement
they gain from new ideas (routing algorithms
etc.) - Initial assumptions (e.g. degree of nodes, graph
type random, power-law, tree), might be
wrong though! - Visualizations might also not be very helpful.
- What we would need instead are real network
metrics from a large P2P Network such as
Gnutella.
113. Related Work (2/3)
- a. Obtaining data from different physical
locations - Tracing a large-scale Peer to Peer System An
hour in the life of Gnutella, E. Markatos,
CCGrid 2002 - They Obtained traffic log traces from 3 different
physical locations (Norway, Greece, USA). - The collected data from all three locations are
almost identical. - They found that the gnutella traffic is bursty
and remains bursty over several time scales. - The results also show that there are high
locality patterns in QUERY messages. This
observation might lead to better caching policies
at peers - Their study also reveals that there is topology
mismatch between the physical topology and the
virtual gnutella topology, since collected data
are identical among their 3 different crawlers.
123. Related Work (3/3)
- b. Obtaining real network data
- Limewire shows that there are averagely 250,000
hosts at any given moment. - They also show that only a small fraction of
these hosts accept incoming connections. - GnutellaMeter.com also monitors the network by
attaching itself to well positioned peers (i.e.
high degree) in the network. They present top
queries. - Clip2 showed that the network diameter in 2000
was 22 indicating that some regions of the
network were - not communicating with others.
- Clip2 also showed that most
- Gnutella searchers are seeking
- for video/audio media.
- How have these trends changed?
-
134. gnuDC Gnutella Distributed Crawler (1/6)
- gnuDC is a Large Scale distributed Gnutella
Crawler which monitors the network by attaching
itself to it with large numbers of peers. - A determinant factor between a WWW Crawler and a
P2P Crawler is that the latter needs to obtain
results (snapshot) in a relatively short amount
of time. - Design Issues and an architecture for a
Distributed P2P Crawler were not described in any
other publication. - What should be the responsibilities of a P2P
Crawler and how should we design it?
144. gnuDC Gnutella Distributed Crawler (2/6)
- Design Issues of a Distributed P2P Crawler
- Obtain Network Statistics in a small Interval.
- A P2P network might be very large which implies
that sequential discovery wont return expected
results. - Parallelizing the discovery process might be easy
by partitioning the hosts to be discovered among
K parallel crawlers. - Scale with the Network Size.
- A few years ago Gnutella had a few thousands
hosts. Today 250,000 at any given moment.
Distributed Discovery is a must. - What is desirable?
- purely distributed approaches ?
- Distributed approaches with centralized indexes
(e.g. SETI_at_Home)? - gnuDC is based on a hybrid approach were each
crawler runs in its own memory space, logs
information on local disks and notifies a central
index when new IPs are found
154. gnuDC Gnutella Distributed Crawler (3/6)
- Design Issues of a Large Distributed P2P Crawler
(contd) - Maintain Network Health.
- The Crawlers should not affect the regular
operation of the network. - Typically a messages TTL is decreased when it
traverses a Crawler. This shouldnt happen! - Platform Independence.
- Our distributed crawler is aimed to run on a NOW.
- Network of Workstations are typically
heterogeneous (Linux, SunOS, Unix). - Java is based on a write once, use everywhere
philosophy. - It also provides a strong core for networking,
Threads, RMI and others. - It makes it an ideal language for our purpose.
164. gnuDC Gnutella Distributed Crawler (4/6)
- gnuDC Architecture.
- It consists of an IP Index Server, several
distributed gnuBricks, a Log Aggregator and a Log
Analyzer. - Components operate asynchronously and
independently. - The whole system is bootstrapped by 1 Unix script
174. gnuDC Gnutella Distributed Crawler (5/6)
- IP Index Server
- Multi-threaded Engine which maintains and indexes
IP addresses of active Gnutella peers. - Uses double buffering for flushing results to
secondary storage. - Sustains high loads and indexes at a rates
- Avg2,500 IPs/sec with a Peak 5,000 IPs/sec.
- The cost for the in-memory data structures is
300MB for 240,000 IPs.
184. gnuDC Gnutella Distributed Crawler (6/6)
- gnuDC Bricks
- Configurable and self-adaptive Gnutella clients.
- Implementation based on the Jtella API
- gnuDC bricks are independent from each other and
run in different memory spaces. - Log Aggregator
- Collects and Aggregates data that is dispersed on
the remote disks of the gnuDC bricks. - Uses ssh along with bash scripts to make the
harvesting process easy. - Log Analyzer
- Combination of bash scripts, C routines and
Java programs for analyzing the harvested data
based on various criteria. - Aggregating and Analyzing takes approximately
7-10 minutes for 700MB of log traces.
195. Experiments 1/6
- We deployed gnuDC on 85 nodes running on 17 AMD
Athlons 4, 1.4 GHz with 1GB RAM running Mandrake
Linux 8.0 interconnected with a 10/100 LAN
connection. - On the 1st of June 2002, we performed our first
"long" crawl. - We also performed several other small scale
experiments to gather data on specific issues. - Technical Difficulties.
- We were crawling only during early morning hours
(i.e. 130 a.m. - 630 a.m.) because during
weekdays the machines were used by students. - Huge amounts of log traces. e.g. 700MB log traces
in 5 hours, so we had problems due to quota
limitations. - Department's Administrators blocked any remote
access (i.e. establishing a TCP connection on any
port number of a lab machine). - ? the crawler couldnt accept any incoming
connections. - ? the degree of a gnuBrick decreased in this way
from 100 to 30 connections
205. Experiments 2/6
- Analysis of Gnutella Messages (ALL)
- Our sample includes 56 million messages.
- The communication overhead (ping/pong) of
Gnutella is 63 - The utilization of the network (query/queryhit)
is only 37 - The huge communication overhead might be due to
the fact that Gnutella network connections are
highly unstable. - The proportion of queries with queryhits is
satisfactory, although we cant say if users are
satisfied by the actual query results. - General queries such as mpeg video may increase
this number.
215. Experiments 2/6
- Analysis of Gnutella Messages (ALL)
- We observed a correlation between the flow of
Ping/Pong and Query/Queryhit pairs although there
is formally no relation. - It is interesting to investigate this further.
- Ironically although a Ping message generates many
Pong messages (4x) a query message generates a
queryhit only 1/8 of the time.
225. Experiments 3/6
- b) Analysis of Query Messages
- We analyzed 15,153,524 query messages.
- High locality of specific queries. Might enable
better caching policies. - Gnutella users are looking for Video gt Audio gt
Images gt Documents - We observed three classes of Searchers
- Seasonal-Content Searchers search patterns
depend on time of crawling - Adult-Content Searchers constant search
patterns over time. - File Extension Searchers - constant search
patterns over time.
b) Ranking By file extension.
a) Ranking By query.
235. Experiments 4/6
- c) Analysis of Gnutella IP Addresses
- We analyzed 294,000 unique IP Addresses (the
initial number was larger but we filtered out IP
addresses designated for private networks (i.e.
192.x.x.x, 172.16.x.x and 10.x.x.x). - We implemented MRDL Multithreaded Reverse DNS
Lookup Engine which resolves in parallel 100
IP/second. - MRDL resolved 244,522 IP Addresses. 16,92 were
not resolvable. - From which domains are Gnutella Users coming
from?
245. Experiments 5/6
- c) Analysis of Gnutella IP Addresses (contd)
- Which ISPs are paying the price of the Gnutella
Infrastructure? - US, German, Canadian, French and English ISPs are
dominating. - We havent validated if this rank reflects the
actual size of each ISP. - Interestingly Asian ISPs (e.g. from Japan) are
listed very low in this rank although they are
technologically advanced.
255. Experiments 6/6
- d) Analysis of Hop Count found in IP Addresses
(contd) - Gnutella clients are conforming to the Protocol
specifications - Only a few queries are coming from father than 7
hops. - The protocol thwarts excessive network resources
consumption. - The bar graph presents a bimodal distribution
with 2 peaks. - at hopcount 1 and 7. It is interesting to
investigate why so many queries are coming from
so close (i.e. 1). It is probably connections are
weak.
log/normal scale
normal/normal scale
266. Conclusions and Future Work
- Summary of main observations
- 1. The Gnutella communication overhead is huge.
- Ping/Pong 63 Query/QueryHits 37.
- 2. Gnutella users can be classified in three
main categories. - Season-Content, Adult-Content and File
Extension Searchers. - 3. Gnutella Users are mainly interested in video
gt audio gt images gt documents. - 4. Although Gnutella is a truly international
phenomenon its largest segment is contributed by
only a few countries. - 5. The clients started conforming to the
specifications of the protocol and that they
thwart excessive network resources consumption. - We are interested in examining more carefully
other data (e.g. User-Agents) that we have
collected but which we havent analyzed due to
time shortage. - Our metrics might facilitate the development of
more advanced P2P protocols which might take into
consideration various bottlenecks and
characteristics of current solutions, such as
Gnutella.
Online Resources http//www.cs.ucr.edu/csyiazti/
cs204.html