Title: The DGX Distribution for Mining Massive, Skewed Data
1The DGX Distribution for Mining Massive, Skewed
Data
- Zhiqiang Bi (CMU)
- Christos Faloutsos (CMU)
- Flip Korn (ATT)
2Outline
- Problem definition / Motivation
- Background
- Proposed method
- Experiments
- Conclusions
3Motivation
- Many real distributions are skewed
- but they occasionally tilt more than Zipfs law
expects (top concavity) - Thus, we need a distribution more general than
Zipfs - A quick intro to Zipf distribution first
4Outline
- Problem definition / Motivation
- Background mini intro to Zipf
- Proposed method
- Experiments
- Conclusions
5Example
the
log(freq)
and
Bible RANK-FREQUENCY plot (in log-log scales)
log(rank)
Zipfs (first) Law
6Equivalently
Zipfs (second) law frequency-count relation (
PDF)
and
FREQ.-COUNT (PDF)
RANK-FREQUENCY
log(count)
log(freq)
the and of
of and the
log(freq)
log(rank)
7Equivalently
Zipfs (second) law frequency-count relation (
PDF)
FREQ.-COUNT (PDF)
log(count)
of and the
log(freq)
8Why is Zipf important?
- because MANY distr. follow it (words, last/first
names, income etc) - are there skewed distributions that are NOT Zipf?
9Motivating example
Clickstream Data
Web Site Traffic
log(count)
Zipf
log(freq)
lturl, u-id, ....gt
10Outline
- Problem definition / Motivation
- Background Zipf successes and failures
- Proposed method
- Experiments
- Conclusions
11Background
- Skewed distributions appear VERY OFTEN in
practice - relational db (80-20 law high-end
histograms skew-aware join algos) - economics (Paretos law)
- text / IR Zipf
12Background contd
- library science (Lotkas law of publication
count) and citation counts (citeseer.nj.nec.com
6/2001)
log(count)
J. Ullman
log(citations)
13Background contd
- areas (lakes, islands, habitat patches) Korcak
14Korcaks law
log(count( gt area))
Scandinavian lakes area vs complementary
cumulative count (log-log axes)
log(area)
15Olympic medals
log( medals)
Russia
China
USA
log rank
16Earthquakes
- Energy of earthquakes (Gutenberg-Richter law)
simscience.org
log(count)
amplitude
magnitude
day
17Background contd
- web of in- and out- links CLEVER Barabasi
18Background contd
- UNIX file systems web file transfers
Bestavros - population of cities Zipf
- distribution of first and last names Mandelbrot
(-1 and 0.7, resp.) - ...
19But occasionally Zipf fails
- We want a distribution that
- can handle the top concavity
20Application of Zipfs Law
- Word frequency
- City size
- Surname distribution
- Olympic medals
- Web traffic
Q Do all skewed distributions obey Zipfs law?
21Problem definition
- We want a distribution that
- is discrete
- models well real datasets
- includes Zipf as a special case
- can handle the top concavity
- needs few parameters
- is fast to compute
22Problem definition
- which one to choose? (and why?)
- Gaussian, Erlang, Weibul,
- Chi-square, Cauchy
- geometric, exponential
- Pareto, ... ?
23Problem definition - contd
- Or should we fit curves in the frequency-count
plot? or the rank-frequency plot? - parabolas? hyperbolas? sinusoids? polynomials?
24Outline
- Problem definition / Motivation
- Background
- Proposed method
- Experiments
- Conclusions
25Proposed Method DGX
- Discretized Gaussian Exponentiated
- Inspired by the LogNormal distribution, which is
continuous
26Lognormal
- DFN If X is Gaussian (m,s), then exp(X) is
Lognormal - It has only two parameters to estimate (m and s
) - It has deep theoretical background (contrary to
Zipfs) and it appears often - size of crystals growing
- capitals that grow exponentially
- etc see KotzJohnson
- But
- - is continuous
27Lognormal
PDF
BUT NOT discrete
log(Prob(x))
Prob(x) (count))
0
0
1
x (eg., income)
log(x)
28Hence DGX
PDF
Prob(x) (count))
...
0
0
1
x (eg., income)
log(1)
29Recall our goals
- We want a distribution that
- is discrete
- models well real datasets
- includes Zipf as a special case
- can handle the top concavity
- needs few parameters
- is fast to compute
V
V
V
30Zipf as a special case
Skip
- When log(k) ltlt m, then
- becomes
Details in the paper. Intuitively
31Zipf as a special case
m gtgt 0 -gt top-concavity
m ltlt 0 -gt Zipf-like
log(Prob(x))
log(Prob(x))
...
log(x)
log(x)
log(1)
32Recall our goals
- We want a distribution that
- is discrete
- models well real datasets
- includes Zipf as a special case
- can handle the top concavity
- needs few parameters
- is fast to compute
V
V
V
V
33Estimation of m, s
- single pass, to collect histogram ( PDF)
- Max likelihood for m, s (using off the shelf max.
routine)
34Estimation of m, s
Skip
Off-the-shelf maximization algo (matlab), to find
good m, s
35Recall our goals
- We want a distribution that
- is discrete
- models well real datasets
- includes Zipf as a special case
- can handle the top concavity
- needs few parameters
- is fast to compute
V
V
V
V
V
36Outline
- Problem definition / Motivation
- Background
- Proposed method
- Experiments
- datasets
- goodness of DGX
- data mining spotting outliers with DGX
- Conclusions
37Experiments
- Data
- TEXT, (Bible), N800,000 words and V12,500
vocabulary words - SALES data from a retail chain (O(100) branches,
ltp-id, b-idgt, 5Gb records per week) - TELCO data, monthly usage volume per customer,
from three region ltu-id, region-idgt - CLICKSTREAM data. A. Montgomery, GSIA/CMU
38Experiments
- Evaluation of goodness
- visual, in the frequency-count plot
- correlation coefficient in the q-q plot (
quantile-quantile plot) ideally, straight lines
with slope 1)
90-tile of actual
quantile of actual distr.
90-tile of synthetic
quantile of synthetic distribution
39Results TEXT
Count (log scale)
synthetic
blue synthetic green real
0.96
real
Word frequency (log scale)
quantile-quantile plot
40SALES data store96
blue synthetic green real
41SALES data store82
blue synthetic green real
42SALES data store101
blue synthetic green real
43TELCO data region A
blue synthetic green real
44TELCO data region B
blue synthetic green real
45TELCO data region C
blue synthetic green real
46CLICKSTREAM
web site access count
number of user accesses
47Outline
- Problem definition / Motivation
- Background
- Proposed method
- Experiments
- datasets
- goodness of DGX
- data mining spotting outliers with DGX
- Conclusions
48How to spot outlier branches
s
m
49How to spot outlier branches
s
m
50How to spot outlier branches
51Conclusions
- DGX has all the desired properties
- is discrete
- models well real datasets
- includes Zipf as a special case
- can handle the top concavity
- needs few parameters
- is fast to compute
V
V
V
V
V
V
52Philosophically, why is DGX so popular?
- Gaussian fixed point for addition of R.V.
lognormal/DGX for multiplication - FOR EXAMPLE
- breaking a stick into pieces
- rich get richer phenomena
53Philosophically, why is DGX so popular?
- Stick, breaking in half, n times
- length of leftmost piece L0 p1 p2 ... pn
L0 p1
L0 (1-p1)
54Philosophically, why is DGX so popular?
- rich get richer leads to lognormals
- C(t) C(t-1) (1 a noise(t))
- ln(C(t)) ln(C(0)) St ln(1anoise(t))
- ln( C(t) ) ln(C(0)) at St(noise(t))
ln(1x) x
log()
time
time
55Usefulness for HP projects?
- disk traffic (bytes per unit time could be
lognormal/DGX 80-20) - ditto for web traffic (image-file sizes
lognormal) - feature extraction ( (m,s) for printers of type
A, (m,s) for type B compare)
56Code resources
- zb26,christos_at_cs.cmu.edu
- full paper Bi, Faloutsos Korn, KDD 2001
(runner up for best paper award) - Kotz, Johnson and Balakrishnan Continuous
Univariate distributions