Title: Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional Data
1 Advanced Data Mining Tools Fractals and Power
Laws for Graphs, Streams and Traditional Data
- Christos Faloutsos
- Carnegie Mellon University
2THANK YOU!
- Prof. Jiawei Han
- Prof. Kevin Chang
3Overview
- Goals/ motivation find patterns in large
datasets - (A) Sensor data
- (B) network/graph data
- Solutions self-similarity and power laws
- Discussion
4Applications of sensors/streams
- Smart house monitoring temperature, humidity
etc - Financial, sales, economic series
5Motivation - Applications
- Medical ECGs blood pressure etc monitoring
- Scientific data seismological astronomical
environment / anti-pollution meteorological
6Motivation - Applications (contd)
- civil/automobile infrastructure
- bridge vibrations Oppenheim02
- road conditions / traffic monitoring
7Motivation - Applications (contd)
- Computer systems
- web servers (buffering, prefetching)
- network traffic monitoring
- ...
http//repository.cs.vt.edu/lbl-conn-7.tar.Z
8Problem definition
- Given one or more sequences
- x1 , x2 , , xt , (y1, y2, , yt, )
- Find
- patterns clusters outliers forecasts
9Problem 1
bytes
- Find patterns, in large datasets
time
10Problem 1
bytes
- Find patterns, in large datasets
time
Poisson indep., ident. distr
11Problem 1
bytes
- Find patterns, in large datasets
time
Poisson indep., ident. distr
12Problem 1
bytes
- Find patterns, in large datasets
time
Poisson indep., ident. distr
Q Then, how to generate such bursty traffic?
13Overview
- Goals/ motivation find patterns in large
datasets - (A) Sensor data
- (B) network/graph data
- Solutions self-similarity and power laws
- Discussion
14Problem 2 - network and graph mining
- How does the Internet look like?
- How does the web look like?
- What constitutes a normal social network?
- What is the market value of a customer?
- which gene/species affects the others the most?
15Problem2
- which node to market-to / defend / immunize
first? - Are there un-natural sub-graphs? (eg.,
criminals rings)?
from Lumeta ISPs 6/1999
16Solutions
- New tools power laws, self-similarity and
fractals work, where traditional assumptions
fail - Lets see the details
17Overview
- Goals/ motivation find patterns in large
datasets - (A) Sensor data
- (B) network/graph data
- Solutions self-similarity and power laws
- Discussion
18What is a fractal?
- self-similar point set, e.g., Sierpinski
triangle
zero area (3/4)inf infinite length! (4/3)inf
...
Q What is its dimensionality??
19What is a fractal?
- self-similar point set, e.g., Sierpinski
triangle
zero area (3/4)inf infinite length! (4/3)inf
...
Q What is its dimensionality?? A log3 / log2
1.58 (!?!)
20Intrinsic (fractal) dimension
- Q fractal dimension of a (finite set of points
on a) line?
x y
5 1
4 2
3 3
2 4
21Intrinsic (fractal) dimension
- Q fractal dimension of a line?
- A nn ( lt r ) r1
- (power law yxa)
- Q fd of a plane?
- A nn ( lt r ) r2
- fd slope of (log(nn) vs.. log(r) )
22Sierpinsky triangle
correlation integral CDF of pairwise
distances
23Observations Fractals lt-gt power laws
- Closely related
- fractals ltgt
- self-similarity ltgt
- scale-free ltgt
- power laws ( y xa
- FK r-2)
- (vs ye-ax or yxab)
24Outline
- Problems
- Self-similarity and power laws
- Solutions to posed problems
- Discussion
25Solution 1 traffic
- disk traces self-similar (also Leland94)
- How to generate such traffic?
26Solution 1 traffic
- disk traces (80-20 law) multifractals
bytes
time
2780-20 / multifractals
20
80
2880-20 / multifractals
20
80
- p (1-p) in general
- yes, there are dependencies
29Overview
- Goals/ motivation find patterns in large
datasets - (A) Sensor data
- (B) network/graph data
- Solutions self-similarity and power laws
- sensor/traffic data
- network/graph data
- Discussion
30Problem 2 - topology
- How does the Internet look like? Any rules?
31Patterns?
- avg degree is, say 3.3
- pick a node at random - what is the degree you
expect it to have?
count
?
avg 3.3
degree
32Patterns?
- avg degree is, say 3.3
- pick a node at random - what is the degree you
expect it to have? - A 1!!
count
avg 3.3
degree
33Patterns?
- avg degree is, say 3.3
- pick a node at random - what is the degree you
expect it to have? - A 1!!
- A very skewed distr.
- Corollary the mean is meaningless!
- (and std -gt infinity (!))
count
avg 3.3
degree
34Solution2 Rank exponent R
- A1 Power law in the degree distribution
SIGCOMM99
internet domains
35Solution2 Eigen Exponent E
Eigenvalue
Exponent slope
E -0.48
May 2001
Rank of decreasing eigenvalue
- A2 power law in the eigenvalues of the adjacency
matrix
36Solution2 Hop Exponent H
- A3 neighborhood function N(h) number of pairs
within h hops or less - power law, too!
Hop exp. 1
log(pairs)
internet
Hop exp. 2
hop exponent
log(hops)
37Power laws - discussion
- do they hold, over time?
- do they hold on other graphs/domains?
38Power laws - discussion
- do they hold, over time?
- Yes! for multiple years Siganos
- do they hold on other graphs/domains?
- Yes!
- web sites and links Tomkins, Barabasi
- peer-to-peer graphs (gnutella-style)
- who-trusts-whom (epinions.com)
39Time Evolution rank R
Domain level
- The rank exponent has not changed! Siganos
40The Peer-to-Peer Topology
count
Jovanovic
degree
- Number of immediate peers ( degree), follows a
power-law
41epinions.com
- who-trusts-whom Richardson Domingos, KDD 2001
count
(out) degree
42Outline
- problems
- Fractals
- Solutions
- Discussion
- what else can they solve?
- how frequent are fractals?
43What else can they solve?
- separability KDD02
- forecasting CIKM02
- dimensionality reduction SBBD00
- non-linear axis scaling KDD02
- disk trace modeling PEVA02
- selectivity of spatial queries PODS94, VLDB95,
ICDE00 - ...
44Problem 3 - spatial d.m.
- Galaxies (Sloan Digital Sky Survey w/ B. Nichol)
- - spiral and elliptical galaxies
- - patterns? (not Gaussian not uniform)
- attraction/repulsion?
- separability??
45Solution3 spatial d.m.
CORRELATION INTEGRAL!
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
46Solution3 spatial d.m.
w/ Seeger, Traina, Traina, SIGMOD00
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
47spatial d.m.
Heuristic on choosing of clusters
48Solution3 spatial d.m.
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
49Problem4 dim. reduction
skip
- given attributes x1, ... xn
- possibly, non-linearly correlated
- drop the useless ones
50Problem4 dim. reduction
skip
- given attributes x1, ... xn
- possibly, non-linearly correlated
- drop the useless ones
- (Q why?
- A to avoid the dimensionality curse)
- Solution keep on dropping attributes, until the
f.d. changes! SBBD00
51Outline
- problems
- Fractals
- Solutions
- Discussion
- what else can they solve?
- how frequent are fractals?
52Fractals power laws
- appear in numerous settings
- medical
- geographical / geological
- social
- computer-system related
- ltand many-many more! see Mandelbrotgt
53Fractals Brain scans
54More fractals
- periphery of malignant tumors 1.5
- benign 1.3
- Burdet
55More fractals
- cardiovascular system 3 (!) lungs 2.9
56Fractals power laws
- appear in numerous settings
- medical
- geographical / geological
- social
- computer-system related
57More fractals
1.1
1
1.3
58(No Transcript)
59GIS points
- Cross-roads of Montgomery county
- any rules?
60GIS
- A self-similarity
- intrinsic dim. 1.51
log(pairs(within lt r))
log( r )
61ExamplesLB county
- Long Beach county of CA (road end-points)
log(pairs)
log(r)
62More power laws areas Korcaks law
Scandinavian lakes Any pattern?
63More power laws areas Korcaks law
log(count( gt area))
Scandinavian lakes area vs complementary
cumulative count (log-log axes)
log(area)
64More power laws Korcak
log(count( gt area))
Japan islands area vs cumulative count (log-log
axes)
log(area)
65More power laws
- Energy of earthquakes (Gutenberg-Richter law)
simscience.org
Energy released
log(count)
Magnitude log(energy)
day
66Fractals power laws
- appear in numerous settings
- medical
- geographical / geological
- social
- computer-system related
67A famous power law Zipfs law
log(freq)
a
- Bible - rank vs. frequency (log-log)
the
Rank/frequency plot
log(rank)
68TELCO data
count of customers
best customer
of service units
69SALES data store96
count of products
aspirin
units sold
70Olympic medals (Sidney 2000)
log(medals)
log( rank)
71Olympic medals (Athens04)
log(medals)
log( rank)
72Even more power laws
- Income distribution (Paretos law)
- size of firms
- publication counts (Lotkas law)
73Even more power laws
- library science (Lotkas law of publication
count) and citation counts (citeseer.nj.nec.com
6/2001)
log(count)
Ullman
log(citations)
74Even more power laws
- web hit counts w/ A. Montgomery
yahoo.com
75Fractals power laws
- appear in numerous settings
- medical
- geographical / geological
- social
- computer-system related
76Power laws, contd
- In- and out-degree distribution of web sites
Barabasi, IBM-CLEVER
log indegree
from Ravi Kumar, Prabhakar Raghavan, Sridhar
Rajagopalan, Andrew Tomkins
- log(freq)
77Power laws, contd
- In- and out-degree distribution of web sites
Barabasi, IBM-CLEVER - length of file transfers Bestavros
- duration of UNIX jobs Harchol-Balter
78Conclusions
- Fascinating problems in Data Mining find
patterns in - sensors/streams
- graphs/networks
79Conclusions - contd
- New tools for Data Mining self-similarity
power laws appear in many cases
Bad news lead to skewed distributions (no
Gaussian, Poisson, uniformity, independence, mean,
variance)
X
80Resources
- Manfred Schroeder Chaos, Fractals and Power
Laws, 1991 - Jiawei Han and Micheline Kamber Data Mining
Concepts and Techniques, 2001
81References
- ieeeTN94 W. E. Leland, M.S. Taqqu, W.
Willinger, D.V. Wilson, On the Self-Similar
Nature of Ethernet Traffic, IEEE Transactions on
Networking, 2, 1, pp 1-15, Feb. 1994. - pods94 Christos Faloutsos and Ibrahim Kamel,
Beyond Uniformity and Independence Analysis of
R-trees Using the Concept of Fractal Dimension,
PODS, Minneapolis, MN, May 24-26, 1994, pp. 4-13
82References
- vldb95 Alberto Belussi and Christos Faloutsos,
Estimating the Selectivity of Spatial Queries
Using the Correlation' Fractal Dimension Proc.
of VLDB, p. 299-310, 1995 - vldb96 Christos Faloutsos, Yossi Matias and Avi
Silberschatz, Modeling Skewed Distributions Using
Multifractals and the 80-20 Law Conf. on Very
Large Data Bases (VLDB), Bombay, India, Sept.
1996.
83References
- vldb96 Christos Faloutsos and Volker Gaede
Analysis of the Z-Ordering Method Using the
Hausdorff Fractal Dimension VLD, Bombay, India,
Sept. 1996 - sigcomm99 Michalis Faloutsos, Petros Faloutsos
and Christos Faloutsos, What does the Internet
look like? Empirical Laws of the Internet
Topology, SIGCOMM 1999
84References
- icde99 Guido Proietti and Christos Faloutsos,
I/O complexity for range queries on region data
stored using an R-tree International Conference
on Data Engineering (ICDE), Sydney, Australia,
March 23-26, 1999 - sigmod2000 Christos Faloutsos, Bernhard Seeger,
Agma J. M. Traina and Caetano Traina Jr., Spatial
Join Selectivity Using Power Laws, SIGMOD 2000
85References
- kdd2001 Agma J. M. Traina, Caetano Traina Jr.,
Spiros Papadimitriou and Christos Faloutsos
Tri-plots Scalable Tools for Multidimensional
Data Mining, KDD 2001, San Francisco, CA.
86Thank you!
- Contact info
- christos _at_ cs.cmu.edu
- www. cs.cmu.edu /christos
- (w/ papers, datasets, code for fractal dimension
estimation, etc)