Statistical Data Mining: A Short Course for the Army Conference on Applied Statistics - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

Statistical Data Mining: A Short Course for the Army Conference on Applied Statistics

Description:

Given a collection of n objects each of which is described by a set of p ... Test the MHD from current data point to each mixture term in the existing model ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 68

Provided by: edwardj8

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Data Mining: A Short Course for the Army Conference on Applied Statistics

1
Statistical Data Mining A Short Course for the
Army Conference on Applied Statistics

Edward J. Wegman
George Mason University
Jeffrey L. Solka
Naval Surface Warfare Center

2
Cluster Analysis

3
What is Cluster Analysis?

Given a collection of n objects each of which is
described by a set of p characteristics or
variables derive a useful division into a number
of classes.
Both the number of classes and the properties of
the classes are to be determined. (Everitt 1993)

4
Why Do This?

Organize
Prediction
Etiology (Causes)

5
How Do We Measure Quality?

Multiple Clusters
Male, Female
Low, Middle, Upper Income
Neither True Nor False
Measured by Utility

6
Data Normalization

zi (xi - mi)/si
Such a normalization step may play havoc with
cluster structure

7
Transformations

We may be interested in performing our cluster
analysis in alternate coordinate frameworks
Principal Components
Grand Tour
Projection Pursuit

8
Dissimilarity Measures
Euclidean Distance City Block Canberra Metric
9
Dissimilarity Measures

Angular separation

10
Between Group Similarity and Distance Measures

Summary Statistics for Each Group
Measurements of Within Group Variation
Similarity or Distance Measures Between Groups

11
Alternative Metrics

Genetic Applications
PiA gene frequency for the ith allele at a
given locus in the two populations

12
Group Distance Measures
13
Group Distance Measures

S is the pooled within group covariance matrix
W is the pxp matrix of pooled within group
dispersions of the 2 groups

14
Hierarchical Cluster Analysis

1 cluster to n clusters
Agglomerative methods
Fusion of n data points into groups
Divisive methods
Separate the n data points into finer groupings

15
Dendrograms

agglomerative
0 1 2 3 4
(1) (1,2) (1,2,3,4,5)
(2) (3,4,5)
(3)
(4) (4,5)
4 3 2 1 0
divisive

16
Agglomerative Algorithm

Start Clusters C1, C2, ..., Cn each with 1
data point
1 - Find nearest pair Ci, Cj, merge Ci and Cj,
delete Cj, and decrement cluster count by 1
If number of clusters is greater than 1 then
go back to step 1

17
Single Linkage (Nearest Neighbor) Clustering

Distance between groups is defined as that of the
closest pair of individuals where we consider 1
individual from each group

18
Example of Single Linkage Clustering

1 2 3 4 5
1 0.0
2 2.0 0.0
3 6.0 5.0 0.0
4 10.0 9.0 4.0 0.0
5 9.0 8.0 5.0 3.0 0.0
(1 2) 3 4 5
(1 2) 0.0
3 5.0 0.0
4 9.0 4.0 0.0
5 8.0 5.0 3.0 0.0

19
Example of Single Linkage Clustering
(1,2) 3 (4,5) (1,2) 0 3 5 0
(4,5) 8 4 0
(1,2) (3,4,5) (1,2) 0 (3,4,5)
5 0
20
Resultant Dendrogram
1 2 3 4 5
21
Complete Linkage Clustering (Furthest Neighbor)

Distance between groups is defined as most
distance pairs of individuals

22
Complete Linkage Example

1 2 3 4 5
1 0.0
2 2.0 0.0
3 6.0 5.0 0.0
4 10.0 9.0 4.0 0.0
5 9.0 8.0 5.0 3.0 0.0
(1,2) is the First Cluster
d(12) 3 maxd13, d23d136.0
d(12)4 maxd14, d24d1410.0
d(12)5 maxd15, d25d159.0

(1,2) 3 4 5 (1,2) 0 3
6 0 4 10 4 0 5 9 5 3
0 This implies (4,5) is 2nd cluster
23
Complete Linkage Example
(1,2) 3 (4,5) (1,2) 0 3 6 0 (4,5)
10 5 0 Thus (3,4,5) is 3rd cluster and Then
(1,2,3,4,5) !!
24
Dendrogram
1 2 3 4 5

25
Group Average Clustering

Distance between clusters is the average of the
distance between all pairs of individuals between
the 2 groups

26
Centroid Clusters

We use centroid of a group once it is formed.

27
Problems With Hierarchical Clustering

Chaining
Propensity to cluster together individuals linked
by a chain of intermediates
Biased to finding spherical clusters

28
The Number of Groups Problem

How do we decide on the appropriate number of
clusters?
Duda and Hart (1973)
Form E(2)/E(1) where E(M) is the sum of squares
error criterion for the m cluster model
Given that the ratio is less than k we reject the
single cluster model

29
Optimization Methods

Minimizing or maximizing some criteria
Does not necessarily form hierarchical clusters

30
Clustering Criteria
31
Minimization of Trace of W

This is equivalent to minimizing the distance
between each individual in a cluster and the
cluster center

32
Minimization of the Determinant of W

Maximize det(T)/det(W)
Minimize det(W)

33
Maximization of the Trace(BW-1)
?
34
Optimizing the Clustering Criterion

N(n,g) The number of partitions of n
individuals into g groups
N(15,3)2,375,101
N(20,4)45,232,115,901
N(25,8)690,223,721,118,368,580
N(100,5)1068

35
Hill Climbing Algorithms

1 - Form initial partition into required number
of groups
2 - Calculate change in clustering criteria
produced by moving each individual from its own
to another cluster.
3 - make the change which leads to the greatest
improvement in the value of the clustering
criterion.
4 - Repeat steps (2) and (3) until no move of a
single individual causes the clustering criterion
to improve.

36
Alternative Methods

Simulated Annealing
Genetic Algorithms
Quantum Computing

37
Reference

Cluster Analysis (Third Edition), Brian Everitt

38
Mixture Based Clustering

39
Finite Mixture Model
40
Expectation Maximization (EM) Algorithm

Dempster, Laird, and Rubin (1977)
Given an initial guess at the number of
components in the mixture and their starting
values this technique attempts to maximize the
likelihood of the model in a two step process

41
Iterative EM Equations E-Step
42
Iterative EM Equations M-Step
43
Problems With the EM Algorithm

One normally needs to constrain the ?i to prevent
spiking
Convergence may be very slow

44
How Do We Initialize the Parameters

Random Starts
Helps prevent convergence to local maxima in the
likelihood surface
Human intervention
Initial partitioning
Set the initial posteriors to 0 or 1 according to
a prior clustering scheme.

45
How Do We Choose g?

Human Intervention
Divine Intervention
Likelihood Ratio Test Statistic
Wolfes Method
Bootstrap
AIC
Adaptive Mixtures Based Methods
Pruning
SHIP

46
Problems With the Hypothesis Testing Method

Regularity conditions do not hold for
-2logl C2d
d Difference in number of parameters

47
Wolfes Method

H0 X g1
H1 X g2
g1 lt g2
g1 1, g2 2
-2clogl Cd2
d 2 difference in parameters without ps
c (n-1-p-1/2g2)/n

48
Interpreting Wolfes Method

n/p gt 5 for Wolfes approximation to work
Power of the test is low when Mahalanobis
distance between terms is less than 2
Use Wolfes test as a guide to the number of
components
Since it is a guide Mclachlan and Basford Suggest
Using c 1
Examine posterior probabilities for various gs
Long tailed distributions (e.g. lognormal) may
lead to rejection of H0 g1 1

49
Bootstrap Method

H0 g g1
H1 g g2
Step 1 - Generate a bootstrap sample from a
mixture of g1 groups (using parameter estimated
from the data)
Step 2 - Fit the bootstrap sample with a g1 and
g2 mixture model.
Step 3 - Compute -2logl (l LH0/LH1)
Step 4 - Repeat k times
Step 5 - Compare these to -2logl computed on the
original data.

50
Bootstrap Guidelines

How many resamples are needed?
Perhaps 350 or more
Since we may need random starts of the initial
parameters to get good values for each sample, we
may have some major crunching to do for
moderately large g1 and g2.

51
Significance Level

The test which rejects H0 if -2logl gt X(j) has
size ? 1 - j/(k1)

52
AIC

AIC(g) -2L(f) N(g) where N(g) is the number
of free parameters in the model of size g.
We choose g in order to minimize the AIC
condition
This criterion is subject to the same regularity
conditions as -2logl

53
Mode Based Methods

Mixture terms and modes are not in 1-1
Correspondence
kernel approach
Silverman
Mode tree of Minnotte
Mode forest of Minnotte, Marchette and Wegman
Hybrid
Filtered Mode Tree

54
Mode Based Methods
A Mode Forest
55
Adaptive Mixtures Density Estimator (AMDE)

Priebe and Marchette 1992
Priebe 1994
Hybrid of Kernel Estimator and Mixture Model
Number of Terms Driven by the Data
L1 Consistent

56
AMDE Algorithm

1 - Given a New Observation
2 - Update Existing Model Using the Recursive EM
or
3 - Add a New Term to Explain This Data Point

57
Recursive EM Equations - 1
58
Recursive EM Equations - 2
59
Create rule

Test the MHD from current data point to each
mixture term in the existing model
Add in a new term when this distance exceeds a
certain create threshold
Location given by current data point
Covariance given by weighted average of the
existing covariances
mixing coefficent set to 1/n

60
Pruning

Build Over Determined Mixture Model Using AMDE
Use AIC to Remove Superfluous Terms
Apply This Procedure to Bootstrap Samples From
the Data Set

61
SHIP Algorithm

1 - Fit mixture model to data
2 - Use mixture model to set bandwidth on the
kernel estimator
3 - Examine mismatch
4 - Add terms where there is mismatch
5 - Fit new mixture model, reset bandwidth, fit
new kernel estimator
7 - Continue
8 - Uses AIC to tell when we are done

62
SHIP Picture
63
Mixture Visualization 1-d
64
Mixture Visualization 2-d
65
Mixture Viz 3d
66
References

J. L. Solka, W. L. Poston, and E. J. Wegman, A
Visualization Technique for Studying the
Iterative Estimation of Mixture Densities,
Journal of Computational and Graphical
Statistics, 4(3), pp.180-197, (1995).
Priebe, C. E., Adaptive Mixtures, JASA, vol.
89, No. 427, pp. 796-806, (1994).
Solka, J. L., Wegman, E. J., Priebe, C. E.,
Poston, W. L. , Rogers, G. W., A method to
determine the structure of an unknown mixture
using the Akaike Information Criterion and the
bootstrap, Statistics and Computing, 8, 177-188,
(1998).

67
References

G. J. McLachlan, and K. E. Basford, Mixture
Models, Marcel Dekker, 1988.
D. Titterington, A. F. M. Smith, U. E. Makov,
Statistical Analysis of Finite Mixture
Distributions, Wiley Series in Probability and
Mathematical Statistics, 1985.

68
Autoclass

Developed by R. Hanson, J. Stutz, and P.
Chesseman
Allows real and discrete data
Allows missing data
Uses a Bayesian approach
Automatically chooses number of classes and their
structure

69
Bayesian Assumptions

Bayesian agent uses a single real number tp
describe its degree of belief in each proposition
of interest
How evidence should effect beliefs
These lead to standard probability axioms

70
Evidence of States

Coin States
2 headed
1 headed
Evidence
Lands heads up
Lands heads down

71
Priors and Posteriors

P(abcd) belief in a and b given c and d
p(H) belief in H in absence of or prior to any
evidence
p(HE) posterior describing agents belief
after observing evidence E
L(EH) likelihood of how likely it would be to
see each possible evidence combo in each possible
world H

72
Normalization

0 lt p(ab) lt 1
SH p(H) 1
SE L(EH) 1

73
Joint Likelihood

J(EH)L(EH)p(H)
p(HE) J(EH)/(SHJ(EH) L(EH)p(H)/(SHL(EH)p(H
)

74
Continuous

HS continuous means that p(H) --gt dp(H)
SS ---gt Integral S
Continuous ES --gt differential likelihood
dL(EH) --gt DL(EH) approx. dL(EH) DE/dE

75
Utility Function

EU(A) SH U(H) p(HEA)

76
Approach

Normal Possible States Correspond to Certain
Models
Model Parameters
T How many terms
V Parameters that describe a class
L(EVTS)
dp(VTS)
Decouple priors into product
Broad priors
Equal weights lead to different levels of model
complexity
More parameters demand better fit

77
Joint

dJ(EVTS) L(EVTS) dp(VTS)
This is a spiky distribution in VT in a high
dimensional space
Punt normalization

78
Alternate Approach

Break V into Region R Surrounding Each Peak
Search to Maximize
M(ERTS) Integral(over R) dJ(EVTS)
Report Best Few Models

79
Reported Models

Marginal Joint M(ERTS)
Discrete Parameters T
Estimate V in R
E(VERTS) Integ.(V in R) (dJ(EVTS))/M(ERTS)
V maxs dJ(EVTS) in R

80
Clustering

Evidence
I cases with K attributes each designated Xi,k
S
V
T
dL(EVTS)
dp(VTS)
M(ERTS)
E(VERTS)

81
Likelihood

L(EVTS) Pi L(EiVTS)
L(EiVTS) where Ei Xi1, Xi2, , Xik

82
Mixture Based Modeling

L(EiVTSM) Si1 c ac L(EiVcTcSc)
Parameters
T c, Tc
V ac, Vc

83
Priors

dp(VTSM) F3(C) C! dB(acC) Pcdp(VcTcSc)
F3(C) 6/p2C2

84
Priors on Mixing Coefficients

dB(acC) G(aC)/G(a)C Pc aca-1dac
a 1/C
M(ES) G(aC))PcG(Ica)/G(aCI)G(a)C
E(acES) (Ic a)/(I aC) (Ic 1/C)/(I1)

85
Priors on the Mean

dp(mS) dR(mm,m-)
mmaxxi, m-minxi
dR(yy,y-) dy/(y - y-) for y in y-, y
E(mES) xbar
Pkdp(mkSR)

86
Priors on the Covariance

The prior on the covariance is very hairy and is
given by a Wishart distribution

87
Algorithm

We use our standard EM algorithm to focus on
regions of the parameter space
At the maxima the weights should be consistent
with the expectation estimations based on the
priors

88
Initialization

Start at a random configuration and run 10-100
iterations of the EM algorithm
L(EiVTRS)PcC(acL(EiVcTcSC))wic
Similary we form an approximate joint function

89
Complexity Search

C is chosen randomly from a log-normal
distribution fit to the Cs of the 6-10 best
trials seen so far after trying a fixed range of
Cs to start
Merge and split techniques have been formulated
but are not generally better
Marginal joints of the different trials follow
log likelihood distributed that allows one to
predict how much longer it will take on average
to find a better solution

90
Class Hierarchy