PPT – PowerPoint presentation | free to download

About This Presentation

Title:

Description:

'A Quantitative Analysis of the Gnutella Network Traffic' cs204 Final Project by ... Query descriptors contain unstructured queries e.g. 'celine dion mp3' ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 27

Provided by: demetriosz

Learn more at: http://alumni.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title:

1
A Quantitative Analysis of the Gnutella Network
Traffic
University of California Riverside Department
of Computer Science Engineering

cs204 Final Project by
Demetris Zeinalipour Theodoros Folias
ltcsyiazti_at_cs.ucr.edu , folias_at_cs.ucr.edugt
Advisor Michalis Faloutsos

Online Resources http//www.cs.ucr.edu/csyiazti/
cs204.html
2
Presentation Outline

Motivation.
Gnutella Protocol in a nutshell.
Related Work.
gnuDC Gnutella Distributed Crawler.
Experiments.
Conclusions Future Work.

Online Resources http//www.cs.ucr.edu/csyiazti/
cs204.html
3
1. Motivation

P2P file-sharing systems, such as Gnutella,
Napster and Freenet realized a distributed
infrastructure for sharing files.
Traditionally, files were shared using the
Client-Server model (e.g. http, ftp). Not
scalable (centralized)
P2P systems have shown that distributed
file-searching is feasible! and yes that they
may change the way we interact on the Internet.
Why Gnutella?
It is a Pure P2P protocol in contrast with e.g.
Napster
It is an open protocol which allows its
investigation.
It is a large community 250,000 peers at any
given moment.
It is a truly international phenomenon with a
world-wide community.
It is still not clear what kind of traffic is
traversing the network

4
1. Motivation

Questions
How do these systems really look like?
What kind of traffic are these systems carrying?
What is the communication overhead of P2P?
Where are file-searchers coming from and what are
they looking for?
Our contribution
We make a quantitative analysis of the Gnutella
Network Traffic at a large-scale (17 machines, 85
nodes, 700MB log traces in 5 hours)
To our knowledge such a large-scale measurement
is not presented in any publication.
We describe design and implementation issues of a
large-scale distributed Gnutella Crawler.

5
2. Gnutella Protocol v0.4 (1/5)

One of the most popular file-sharing protocols.
Operates without a central Index Server (such as
Napster).
Clients (downloaders) are also servers gt
servents
Clients may join or leave the network at any time
gt highly fault-tolerant but with a cost!
Searches are done within the virtual network
while actual downloads are done offline (with
HTTP).
The core of the protocol consists of 5
descriptors (PING, PONG, QUERY, QUERHIT and
PUSH).

6
2. Gnutella Protocol (2/5)

It is important to understand how the protocol
works in order to understand our framework.
A Peer (p) needs to connect to 1 or more other
Gnutella Peers in order to participate in the
virtual Network
p initially doesnt know IPs of its fellow
file-sharers

7
2. Gnutella Protocol (3/5)

a. HostCaches The initial connection
P connects to a HostCache H to obtain a set of IP
addresses of active peers.
P might alternatively probe its cache to find
peers it was connected in the past.

H
Request/Receive a set of Active Peers
1
2
Connect to network
8
2. Gnutella Protocol (4/5)

b. Ping/Pong The communication overhead
Although p is already connected it must discover
new peers since its current connections may
break.
Thus, it sends periodically PING messages which
are broadcasted (message flooding).
If a host e.g. p2 is available it will respond
with a PONG (routed only the same path the PING
came from).
P might utilize this response and attempt a
connection to p2 in order to increase its degree.

Gnutella Network N
PING
1
PONG
2
Servent p2
9
2. Gnutella Protocol (5/5)

c. Query/QueryHit The utilization
Query descriptors contain unstructured queries
e.g. celine dion mp3
They are again, like PING, broadcasted with a
typical TTL7.
If a host e.g. p2 matches the query it will
respond with a Queryhit descriptor
d. Push Enable downloads from peers that are
firewalled.
If a peer is firewalled gt we cant connect to
him. Hence we request from him to establish a
connection on us and to send us the file.

Gnutella Network N
QUERY
1
QUERYHIT
2
Servent p2
10
3. Related Work (1/3)

a. Simulating Peer-to-Peer Systems
Most researchers use simulation Testbeds (e.g.
Anthill) to validate the performance improvement
they gain from new ideas (routing algorithms
etc.)
Initial assumptions (e.g. degree of nodes, graph
type random, power-law, tree), might be
wrong though!
Visualizations might also not be very helpful.
What we would need instead are real network
metrics from a large P2P Network such as
Gnutella.

11
3. Related Work (2/3)

a. Obtaining data from different physical
locations
Tracing a large-scale Peer to Peer System An
hour in the life of Gnutella, E. Markatos,
CCGrid 2002
They Obtained traffic log traces from 3 different
physical locations (Norway, Greece, USA).
The collected data from all three locations are
almost identical.
They found that the gnutella traffic is bursty
and remains bursty over several time scales.
The results also show that there are high
locality patterns in QUERY messages. This
observation might lead to better caching policies
at peers
Their study also reveals that there is topology
mismatch between the physical topology and the
virtual gnutella topology, since collected data
are identical among their 3 different crawlers.

12
3. Related Work (3/3)

b. Obtaining real network data
Limewire shows that there are averagely 250,000
hosts at any given moment.
They also show that only a small fraction of
these hosts accept incoming connections.
GnutellaMeter.com also monitors the network by
attaching itself to well positioned peers (i.e.
high degree) in the network. They present top
queries.
Clip2 showed that the network diameter in 2000
was 22 indicating that some regions of the
network were
not communicating with others.
Clip2 also showed that most
Gnutella searchers are seeking
for video/audio media.
How have these trends changed?

13
4. gnuDC Gnutella Distributed Crawler (1/6)

gnuDC is a Large Scale distributed Gnutella
Crawler which monitors the network by attaching
itself to it with large numbers of peers.
A determinant factor between a WWW Crawler and a
P2P Crawler is that the latter needs to obtain
results (snapshot) in a relatively short amount
of time.
Design Issues and an architecture for a
Distributed P2P Crawler were not described in any
other publication.
What should be the responsibilities of a P2P
Crawler and how should we design it?

14
4. gnuDC Gnutella Distributed Crawler (2/6)

Design Issues of a Distributed P2P Crawler
Obtain Network Statistics in a small Interval.
A P2P network might be very large which implies
that sequential discovery wont return expected
results.
Parallelizing the discovery process might be easy
by partitioning the hosts to be discovered among
K parallel crawlers.
Scale with the Network Size.
A few years ago Gnutella had a few thousands
hosts. Today 250,000 at any given moment.
Distributed Discovery is a must.
What is desirable?
purely distributed approaches ?
Distributed approaches with centralized indexes
(e.g. SETI_at_Home)?
gnuDC is based on a hybrid approach were each
crawler runs in its own memory space, logs
information on local disks and notifies a central
index when new IPs are found

15
4. gnuDC Gnutella Distributed Crawler (3/6)

Design Issues of a Large Distributed P2P Crawler
(contd)
Maintain Network Health.
The Crawlers should not affect the regular
operation of the network.
Typically a messages TTL is decreased when it
traverses a Crawler. This shouldnt happen!
Platform Independence.
Our distributed crawler is aimed to run on a NOW.
Network of Workstations are typically
heterogeneous (Linux, SunOS, Unix).
Java is based on a write once, use everywhere
philosophy.
It also provides a strong core for networking,
Threads, RMI and others.
It makes it an ideal language for our purpose.

16
4. gnuDC Gnutella Distributed Crawler (4/6)

gnuDC Architecture.
It consists of an IP Index Server, several
distributed gnuBricks, a Log Aggregator and a Log
Analyzer.
Components operate asynchronously and
independently.
The whole system is bootstrapped by 1 Unix script

17
4. gnuDC Gnutella Distributed Crawler (5/6)

IP Index Server
Multi-threaded Engine which maintains and indexes
IP addresses of active Gnutella peers.
Uses double buffering for flushing results to
secondary storage.
Sustains high loads and indexes at a rates
Avg2,500 IPs/sec with a Peak 5,000 IPs/sec.
The cost for the in-memory data structures is
300MB for 240,000 IPs.

18
4. gnuDC Gnutella Distributed Crawler (6/6)

gnuDC Bricks
Configurable and self-adaptive Gnutella clients.
Implementation based on the Jtella API
gnuDC bricks are independent from each other and
run in different memory spaces.
Log Aggregator
Collects and Aggregates data that is dispersed on
the remote disks of the gnuDC bricks.
Uses ssh along with bash scripts to make the
harvesting process easy.
Log Analyzer
Combination of bash scripts, C routines and
Java programs for analyzing the harvested data
based on various criteria.
Aggregating and Analyzing takes approximately
7-10 minutes for 700MB of log traces.

19
5. Experiments 1/6

We deployed gnuDC on 85 nodes running on 17 AMD
Athlons 4, 1.4 GHz with 1GB RAM running Mandrake
Linux 8.0 interconnected with a 10/100 LAN
connection.
On the 1st of June 2002, we performed our first
"long" crawl.
We also performed several other small scale
experiments to gather data on specific issues.
Technical Difficulties.
We were crawling only during early morning hours
(i.e. 130 a.m. - 630 a.m.) because during
weekdays the machines were used by students.
Huge amounts of log traces. e.g. 700MB log traces
in 5 hours, so we had problems due to quota
limitations.
Department's Administrators blocked any remote
access (i.e. establishing a TCP connection on any
port number of a lab machine).
? the crawler couldnt accept any incoming
connections.
? the degree of a gnuBrick decreased in this way
from 100 to 30 connections

20
5. Experiments 2/6

Analysis of Gnutella Messages (ALL)
Our sample includes 56 million messages.
The communication overhead (ping/pong) of
Gnutella is 63
The utilization of the network (query/queryhit)
is only 37
The huge communication overhead might be due to
the fact that Gnutella network connections are
highly unstable.
The proportion of queries with queryhits is
satisfactory, although we cant say if users are
satisfied by the actual query results.
General queries such as mpeg video may increase
this number.

21
5. Experiments 2/6

Analysis of Gnutella Messages (ALL)
We observed a correlation between the flow of
Ping/Pong and Query/Queryhit pairs although there
is formally no relation.
It is interesting to investigate this further.
Ironically although a Ping message generates many
Pong messages (4x) a query message generates a
queryhit only 1/8 of the time.

22
5. Experiments 3/6

b) Analysis of Query Messages
We analyzed 15,153,524 query messages.
High locality of specific queries. Might enable
better caching policies.
Gnutella users are looking for Video gt Audio gt
Images gt Documents
We observed three classes of Searchers
Seasonal-Content Searchers search patterns
depend on time of crawling
Adult-Content Searchers constant search
patterns over time.
File Extension Searchers - constant search
patterns over time.

b) Ranking By file extension.
a) Ranking By query.
23
5. Experiments 4/6

c) Analysis of Gnutella IP Addresses
We analyzed 294,000 unique IP Addresses (the
initial number was larger but we filtered out IP
addresses designated for private networks (i.e.
192.x.x.x, 172.16.x.x and 10.x.x.x).
We implemented MRDL Multithreaded Reverse DNS
Lookup Engine which resolves in parallel 100
IP/second.
MRDL resolved 244,522 IP Addresses. 16,92 were
not resolvable.
From which domains are Gnutella Users coming
from?

24
5. Experiments 5/6

c) Analysis of Gnutella IP Addresses (contd)
Which ISPs are paying the price of the Gnutella
Infrastructure?
US, German, Canadian, French and English ISPs are
dominating.
We havent validated if this rank reflects the
actual size of each ISP.
Interestingly Asian ISPs (e.g. from Japan) are
listed very low in this rank although they are
technologically advanced.

25
5. Experiments 6/6

d) Analysis of Hop Count found in IP Addresses
(contd)
Gnutella clients are conforming to the Protocol
specifications
Only a few queries are coming from father than 7
hops.
The protocol thwarts excessive network resources
consumption.
The bar graph presents a bimodal distribution
with 2 peaks.
at hopcount 1 and 7. It is interesting to
investigate why so many queries are coming from
so close (i.e. 1). It is probably connections are
weak.

log/normal scale
normal/normal scale
26
6. Conclusions and Future Work

Summary of main observations
1. The Gnutella communication overhead is huge.
Ping/Pong 63 Query/QueryHits 37.
2. Gnutella users can be classified in three
main categories.
Season-Content, Adult-Content and File
Extension Searchers.
3. Gnutella Users are mainly interested in video
gt audio gt images gt documents.
4. Although Gnutella is a truly international
phenomenon its largest segment is contributed by
only a few countries.
5. The clients started conforming to the
specifications of the protocol and that they
thwart excessive network resources consumption.
We are interested in examining more carefully
other data (e.g. User-Agents) that we have
collected but which we havent analyzed due to
time shortage.
Our metrics might facilitate the development of
more advanced P2P protocols which might take into
consideration various bottlenecks and
characteristics of current solutions, such as
Gnutella.