Title: Statistical Data Mining: A Short Course for the Army Conference on Applied Statistics
1Statistical Data Mining A Short Course for the
Army Conference on Applied Statistics
- Edward J. Wegman
- George Mason University
- Jeffrey L. Solka
- Naval Surface Warfare Center
2Cluster Analysis
3What is Cluster Analysis?
- Given a collection of n objects each of which is
described by a set of p characteristics or
variables derive a useful division into a number
of classes. - Both the number of classes and the properties of
the classes are to be determined. (Everitt 1993)
4Why Do This?
- Organize
- Prediction
- Etiology (Causes)
5How Do We Measure Quality?
- Multiple Clusters
- Male, Female
- Low, Middle, Upper Income
- Neither True Nor False
- Measured by Utility
6Data Normalization
- zi (xi - mi)/si
- Such a normalization step may play havoc with
cluster structure
7Transformations
- We may be interested in performing our cluster
analysis in alternate coordinate frameworks - Principal Components
- Grand Tour
- Projection Pursuit
8Dissimilarity Measures
Euclidean Distance City Block Canberra Metric
9Dissimilarity Measures
10Between Group Similarity and Distance Measures
- Summary Statistics for Each Group
- Measurements of Within Group Variation
- Similarity or Distance Measures Between Groups
11Alternative Metrics
- Genetic Applications
- PiA gene frequency for the ith allele at a
given locus in the two populations
12Group Distance Measures
13Group Distance Measures
- S is the pooled within group covariance matrix
- W is the pxp matrix of pooled within group
dispersions of the 2 groups
14Hierarchical Cluster Analysis
- 1 cluster to n clusters
- Agglomerative methods
- Fusion of n data points into groups
- Divisive methods
- Separate the n data points into finer groupings
15Dendrograms
- agglomerative
- 0 1 2 3 4
- (1) (1,2) (1,2,3,4,5)
- (2) (3,4,5)
- (3)
- (4) (4,5)
- 4 3 2 1 0
- divisive
16Agglomerative Algorithm
- Start Clusters C1, C2, ..., Cn each with 1
- data point
- 1 - Find nearest pair Ci, Cj, merge Ci and Cj,
- delete Cj, and decrement cluster count by 1
- If number of clusters is greater than 1 then
- go back to step 1
17Single Linkage (Nearest Neighbor) Clustering
- Distance between groups is defined as that of the
closest pair of individuals where we consider 1
individual from each group
18Example of Single Linkage Clustering
- 1 2 3 4 5
- 1 0.0
- 2 2.0 0.0
- 3 6.0 5.0 0.0
- 4 10.0 9.0 4.0 0.0
- 5 9.0 8.0 5.0 3.0 0.0
- (1 2) 3 4 5
- (1 2) 0.0
- 3 5.0 0.0
- 4 9.0 4.0 0.0
- 5 8.0 5.0 3.0 0.0
19Example of Single Linkage Clustering
(1,2) 3 (4,5) (1,2) 0 3 5 0
(4,5) 8 4 0
(1,2) (3,4,5) (1,2) 0 (3,4,5)
5 0
20Resultant Dendrogram
1 2 3 4 5
21Complete Linkage Clustering (Furthest Neighbor)
- Distance between groups is defined as most
distance pairs of individuals
22Complete Linkage Example
- 1 2 3 4 5
- 1 0.0
- 2 2.0 0.0
- 3 6.0 5.0 0.0
- 4 10.0 9.0 4.0 0.0
- 5 9.0 8.0 5.0 3.0 0.0
- (1,2) is the First Cluster
- d(12) 3 maxd13, d23d136.0
- d(12)4 maxd14, d24d1410.0
- d(12)5 maxd15, d25d159.0
(1,2) 3 4 5 (1,2) 0 3
6 0 4 10 4 0 5 9 5 3
0 This implies (4,5) is 2nd cluster
23Complete Linkage Example
(1,2) 3 (4,5) (1,2) 0 3 6 0 (4,5)
10 5 0 Thus (3,4,5) is 3rd cluster and Then
(1,2,3,4,5) !!
24Dendrogram
1 2 3 4 5
25Group Average Clustering
- Distance between clusters is the average of the
distance between all pairs of individuals between
the 2 groups
26Centroid Clusters
- We use centroid of a group once it is formed.
27Problems With Hierarchical Clustering
- Chaining
- Propensity to cluster together individuals linked
by a chain of intermediates - Biased to finding spherical clusters
28The Number of Groups Problem
- How do we decide on the appropriate number of
clusters? - Duda and Hart (1973)
- Form E(2)/E(1) where E(M) is the sum of squares
error criterion for the m cluster model - Given that the ratio is less than k we reject the
single cluster model
29Optimization Methods
- Minimizing or maximizing some criteria
- Does not necessarily form hierarchical clusters
30Clustering Criteria
31Minimization of Trace of W
- This is equivalent to minimizing the distance
between each individual in a cluster and the
cluster center
32Minimization of the Determinant of W
- Maximize det(T)/det(W)
- Minimize det(W)
33Maximization of the Trace(BW-1)
?
34Optimizing the Clustering Criterion
- N(n,g) The number of partitions of n
individuals into g groups - N(15,3)2,375,101
- N(20,4)45,232,115,901
- N(25,8)690,223,721,118,368,580
- N(100,5)1068
35Hill Climbing Algorithms
- 1 - Form initial partition into required number
of groups - 2 - Calculate change in clustering criteria
produced by moving each individual from its own
to another cluster. - 3 - make the change which leads to the greatest
improvement in the value of the clustering
criterion. - 4 - Repeat steps (2) and (3) until no move of a
single individual causes the clustering criterion
to improve.
36Alternative Methods
- Simulated Annealing
- Genetic Algorithms
- Quantum Computing
37Reference
- Cluster Analysis (Third Edition), Brian Everitt
38Mixture Based Clustering
39Finite Mixture Model
40Expectation Maximization (EM) Algorithm
- Dempster, Laird, and Rubin (1977)
- Given an initial guess at the number of
components in the mixture and their starting
values this technique attempts to maximize the
likelihood of the model in a two step process
41Iterative EM Equations E-Step
42Iterative EM Equations M-Step
43Problems With the EM Algorithm
- One normally needs to constrain the ?i to prevent
spiking - Convergence may be very slow
44How Do We Initialize the Parameters
- Random Starts
- Helps prevent convergence to local maxima in the
likelihood surface - Human intervention
- Initial partitioning
- Set the initial posteriors to 0 or 1 according to
a prior clustering scheme.
45How Do We Choose g?
- Human Intervention
- Divine Intervention
- Likelihood Ratio Test Statistic
- Wolfes Method
- Bootstrap
- AIC
- Adaptive Mixtures Based Methods
- Pruning
- SHIP
46Problems With the Hypothesis Testing Method
- Regularity conditions do not hold for
- -2logl C2d
- d Difference in number of parameters
47Wolfes Method
- H0 X g1
- H1 X g2
- g1 lt g2
- g1 1, g2 2
- -2clogl Cd2
- d 2 difference in parameters without ps
- c (n-1-p-1/2g2)/n
48Interpreting Wolfes Method
- n/p gt 5 for Wolfes approximation to work
- Power of the test is low when Mahalanobis
distance between terms is less than 2 - Use Wolfes test as a guide to the number of
components - Since it is a guide Mclachlan and Basford Suggest
Using c 1 - Examine posterior probabilities for various gs
- Long tailed distributions (e.g. lognormal) may
lead to rejection of H0 g1 1
49Bootstrap Method
- H0 g g1
- H1 g g2
- Step 1 - Generate a bootstrap sample from a
mixture of g1 groups (using parameter estimated
from the data) - Step 2 - Fit the bootstrap sample with a g1 and
g2 mixture model. - Step 3 - Compute -2logl (l LH0/LH1)
- Step 4 - Repeat k times
- Step 5 - Compare these to -2logl computed on the
original data.
50Bootstrap Guidelines
- How many resamples are needed?
- Perhaps 350 or more
- Since we may need random starts of the initial
parameters to get good values for each sample, we
may have some major crunching to do for
moderately large g1 and g2.
51Significance Level
- The test which rejects H0 if -2logl gt X(j) has
size ? 1 - j/(k1)
52AIC
- AIC(g) -2L(f) N(g) where N(g) is the number
of free parameters in the model of size g. - We choose g in order to minimize the AIC
condition - This criterion is subject to the same regularity
conditions as -2logl
53Mode Based Methods
- Mixture terms and modes are not in 1-1
Correspondence - kernel approach
- Silverman
- Mode tree of Minnotte
- Mode forest of Minnotte, Marchette and Wegman
- Hybrid
- Filtered Mode Tree
54Mode Based Methods
A Mode Forest
55Adaptive Mixtures Density Estimator (AMDE)
- Priebe and Marchette 1992
- Priebe 1994
- Hybrid of Kernel Estimator and Mixture Model
- Number of Terms Driven by the Data
- L1 Consistent
56AMDE Algorithm
- 1 - Given a New Observation
- 2 - Update Existing Model Using the Recursive EM
- or
- 3 - Add a New Term to Explain This Data Point
57Recursive EM Equations - 1
58Recursive EM Equations - 2
59Create rule
- Test the MHD from current data point to each
mixture term in the existing model - Add in a new term when this distance exceeds a
certain create threshold - Location given by current data point
- Covariance given by weighted average of the
existing covariances - mixing coefficent set to 1/n
60Pruning
- Build Over Determined Mixture Model Using AMDE
- Use AIC to Remove Superfluous Terms
- Apply This Procedure to Bootstrap Samples From
the Data Set
61SHIP Algorithm
- 1 - Fit mixture model to data
- 2 - Use mixture model to set bandwidth on the
kernel estimator - 3 - Examine mismatch
- 4 - Add terms where there is mismatch
- 5 - Fit new mixture model, reset bandwidth, fit
new kernel estimator - 7 - Continue
- 8 - Uses AIC to tell when we are done
62SHIP Picture
63Mixture Visualization 1-d
64Mixture Visualization 2-d
65Mixture Viz 3d
66References
- J. L. Solka, W. L. Poston, and E. J. Wegman, A
Visualization Technique for Studying the
Iterative Estimation of Mixture Densities,
Journal of Computational and Graphical
Statistics, 4(3), pp.180-197, (1995). - Priebe, C. E., Adaptive Mixtures, JASA, vol.
89, No. 427, pp. 796-806, (1994). - Solka, J. L., Wegman, E. J., Priebe, C. E.,
Poston, W. L. , Rogers, G. W., A method to
determine the structure of an unknown mixture
using the Akaike Information Criterion and the
bootstrap, Statistics and Computing, 8, 177-188,
(1998).
67References
- G. J. McLachlan, and K. E. Basford, Mixture
Models, Marcel Dekker, 1988. - D. Titterington, A. F. M. Smith, U. E. Makov,
Statistical Analysis of Finite Mixture
Distributions, Wiley Series in Probability and
Mathematical Statistics, 1985.
68Autoclass
- Developed by R. Hanson, J. Stutz, and P.
Chesseman - Allows real and discrete data
- Allows missing data
- Uses a Bayesian approach
- Automatically chooses number of classes and their
structure
69Bayesian Assumptions
- Bayesian agent uses a single real number tp
describe its degree of belief in each proposition
of interest - How evidence should effect beliefs
- These lead to standard probability axioms
70Evidence of States
- Coin States
- 2 headed
- 1 headed
- Evidence
- Lands heads up
- Lands heads down
71Priors and Posteriors
- P(abcd) belief in a and b given c and d
- p(H) belief in H in absence of or prior to any
evidence - p(HE) posterior describing agents belief
after observing evidence E - L(EH) likelihood of how likely it would be to
see each possible evidence combo in each possible
world H
72Normalization
- 0 lt p(ab) lt 1
- SH p(H) 1
- SE L(EH) 1
73Joint Likelihood
- J(EH)L(EH)p(H)
- p(HE) J(EH)/(SHJ(EH) L(EH)p(H)/(SHL(EH)p(H
)
74Continuous
- HS continuous means that p(H) --gt dp(H)
- SS ---gt Integral S
- Continuous ES --gt differential likelihood
- dL(EH) --gt DL(EH) approx. dL(EH) DE/dE
75Utility Function
76Approach
- Normal Possible States Correspond to Certain
Models - Model Parameters
- T How many terms
- V Parameters that describe a class
- L(EVTS)
- dp(VTS)
- Decouple priors into product
- Broad priors
- Equal weights lead to different levels of model
complexity - More parameters demand better fit
77Joint
- dJ(EVTS) L(EVTS) dp(VTS)
- This is a spiky distribution in VT in a high
dimensional space - Punt normalization
78Alternate Approach
- Break V into Region R Surrounding Each Peak
- Search to Maximize
- M(ERTS) Integral(over R) dJ(EVTS)
- Report Best Few Models
79Reported Models
- Marginal Joint M(ERTS)
- Discrete Parameters T
- Estimate V in R
- E(VERTS) Integ.(V in R) (dJ(EVTS))/M(ERTS)
- V maxs dJ(EVTS) in R
80Clustering
- Evidence
- I cases with K attributes each designated Xi,k
- S
- V
- T
- dL(EVTS)
- dp(VTS)
- M(ERTS)
- E(VERTS)
81Likelihood
- L(EVTS) Pi L(EiVTS)
- L(EiVTS) where Ei Xi1, Xi2, , Xik
82Mixture Based Modeling
- L(EiVTSM) Si1 c ac L(EiVcTcSc)
- Parameters
- T c, Tc
- V ac, Vc
83Priors
- dp(VTSM) F3(C) C! dB(acC) Pcdp(VcTcSc)
- F3(C) 6/p2C2
84Priors on Mixing Coefficients
- dB(acC) G(aC)/G(a)C Pc aca-1dac
- a 1/C
- M(ES) G(aC))PcG(Ica)/G(aCI)G(a)C
- E(acES) (Ic a)/(I aC) (Ic 1/C)/(I1)
85Priors on the Mean
- dp(mS) dR(mm,m-)
- mmaxxi, m-minxi
- dR(yy,y-) dy/(y - y-) for y in y-, y
- E(mES) xbar
- Pkdp(mkSR)
86Priors on the Covariance
- The prior on the covariance is very hairy and is
given by a Wishart distribution
87Algorithm
- We use our standard EM algorithm to focus on
regions of the parameter space - At the maxima the weights should be consistent
with the expectation estimations based on the
priors
88Initialization
- Start at a random configuration and run 10-100
iterations of the EM algorithm - L(EiVTRS)PcC(acL(EiVcTcSC))wic
- Similary we form an approximate joint function
89Complexity Search
- C is chosen randomly from a log-normal
distribution fit to the Cs of the 6-10 best
trials seen so far after trying a fixed range of
Cs to start - Merge and split techniques have been formulated
but are not generally better - Marginal joints of the different trials follow
log likelihood distributed that allows one to
predict how much longer it will take on average
to find a better solution
90Class Hierarchy
- It is possible for the various models to share
terms in the mixtures - In this manner duplication of effort is prevented
91Running auto-class
- On science
- /usr/local/autoclass-c
- /usr/local/autoclass
- On nswc
- email me
92Relevant files in autoclass-c
- data
- example data sets and outputs
- sample
- in depth example of run with script file
- read-me.text
- explanation
- doc (papers)
- reports-c.text