Title: Discovering Communities Among File Shares
1Discovering Communities Among File Shares
2Outline
- Background Information
- Problem Statement
- Data Collection
- Model
- Results
3Background Information
- Flatlan (www.flatlan.com) is used on LANs to
find shared files (flat view of the hierarchy of
files on a lan)
- Napster, Kaaza, Gnutella are used to share files
across the internet
4Background (continued) Advantages of Flatlan
over other P2P
- LANs are very fast compared to the internet as a
whole
- Different server per network, allowing very
simple and effective scaling, and high levels of
customization
- No software to download (for end user)
- Maybe more legal then other P2P models
5Background (continued)How files are shared
- Users sharing files form a power law (roughly 80
of the files are shared by 20 of users)
6Problem Statement
- Want to model the sharing of files as a graph and
find patterns
- Every graph is generated from one or more keyword
searches on a flatlan database
- A node on a graph represents a computer who is
sharing at least one file that matched the search
criteria
- An edge is drawn between two nodes if there is at
least one file that is identical
- Identical (FileName1,Size1) (FileName2,Size2)
7Example
Carol
Eminem - The Real Slim Shady.mp3, 6.57MB
Eminem - The Real Slim Shady.mp3, 6.57MB
Bob
Dave
Eminem - The Real Slim Shady.mp3, 6.57MB
Eminem - The Marshall Mathers LP - 07 - The Way I
Am.mp3, 4.42MB
Alice
(Eminem) The Real Slim Shady.mp3, 6.4MB
8Other Design Issues
- //128.113.147.101/share/music/Eminem/Eminem - The
Marshall Mathers/18-criminal.mp3 is that
valid?
- Should word stemming be allowed?
- What about artists with similar names like doors
vs 3 doors down?
9Problem Statement (continued)
- Study the properties of the resulting graphs.
- Study the evolution of these graphs (over a
period of three weeks).
- Study the differences between several networks.
- Try to guess the impact of other P2P.
- Propose a simple model of altruism from the given
data.
10Data Collected
- Collected data for three weeks time
- RPI, Cable Modem Network, Bryant, WNEC
- Sryacuse and U. Texas for shorter time
- Only looked at the mp3 sub database
- Searched for popular artists in several genres
- Popular, Rap, Metal, Classic Rock, Classical
- Eminem, Metallica, Madonna, Mozart, Beatles
11Major Observations
- All the graphs basically look the same,
regardless of music genre, date collected, or
location
- Central group of well connected nodes, with a
tree of other nodes off of them
- Some bigger graphs have more then one connected
component
- Many unconnected nodes, or nodes with few
neighbors
12Mozart at RPI
13Observed Results
- There are a large number of small cliques (a file
is usually shared by a group of common friends).
- Popular songs make up many of the edges
- Many unconnected nodes
- People often rename tracks, and then the songs
are no longer the same
- People have unique songs
- RPI students may like classical more then other
schools
14Variations in Songs
eminem - 13 - drug ballad.mp3,4805277 1
eminem - 13 - drug ballad.mp3,4807453 1
eminem - 13 - drug ballad.mp3,7196672 1
eminem - 13 - drug ballad.mp3,7198502 1
eminem - 13 - superman.mp3,8408502 3
eminem - 13 - superman.mp3,8408630 15
eminem - 13 - superman.mp3,8409088 1
eminem - 14 - amityville.mp3,4075103 1
eminem - 14 - amityville.mp3,4079368 1
eminem - 14 - amityville.mp3,6104948 1
eminem - 14 - hailie's song.mp3,5179730 2
eminem - 14 - hailie's song.mp3,7696300 1
eminem - 14 - hailies song.mp3,7696428 16
eminem - 14 - hailies song.mp3,7698432 1
eminem - 14 - rock bottom.mp3,3425176 1
eminem - 14 - rock bottom.mp3,3428187 1
15Distribution of repeated Files
16File Name Repeats
17Distribution of Matched Keywords for eminem at RPI
18Why do People share
- Give to the community - altruism
- Because they can technical superiority over
peers
- In order to trade files, requires everyone to
share something then everyone can benefit
think for the group Nash?
19Conclusion
- Live demonstration with W3Pal
- Questions and Answers