Statistical Data Mining: A Short Course for the Army Conference on Applied Statistics - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Statistical Data Mining: A Short Course for the Army Conference on Applied Statistics

Description:

Given a collection of n objects each of which is described by a set of p ... Test the MHD from current data point to each mixture term in the existing model ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 68
Provided by: edwardj8
Category:

less

Transcript and Presenter's Notes

Title: Statistical Data Mining: A Short Course for the Army Conference on Applied Statistics


1
Statistical Data Mining A Short Course for the
Army Conference on Applied Statistics
  • Edward J. Wegman
  • George Mason University
  • Jeffrey L. Solka
  • Naval Surface Warfare Center

2
Cluster Analysis

3
What is Cluster Analysis?
  • Given a collection of n objects each of which is
    described by a set of p characteristics or
    variables derive a useful division into a number
    of classes.
  • Both the number of classes and the properties of
    the classes are to be determined. (Everitt 1993)

4
Why Do This?
  • Organize
  • Prediction
  • Etiology (Causes)

5
How Do We Measure Quality?
  • Multiple Clusters
  • Male, Female
  • Low, Middle, Upper Income
  • Neither True Nor False
  • Measured by Utility

6
Data Normalization
  • zi (xi - mi)/si
  • Such a normalization step may play havoc with
    cluster structure

7
Transformations
  • We may be interested in performing our cluster
    analysis in alternate coordinate frameworks
  • Principal Components
  • Grand Tour
  • Projection Pursuit

8
Dissimilarity Measures
Euclidean Distance City Block Canberra Metric
9
Dissimilarity Measures
  • Angular separation

10
Between Group Similarity and Distance Measures
  • Summary Statistics for Each Group
  • Measurements of Within Group Variation
  • Similarity or Distance Measures Between Groups

11
Alternative Metrics
  • Genetic Applications
  • PiA gene frequency for the ith allele at a
    given locus in the two populations

12
Group Distance Measures
13
Group Distance Measures
  • S is the pooled within group covariance matrix
  • W is the pxp matrix of pooled within group
    dispersions of the 2 groups

14
Hierarchical Cluster Analysis
  • 1 cluster to n clusters
  • Agglomerative methods
  • Fusion of n data points into groups
  • Divisive methods
  • Separate the n data points into finer groupings

15
Dendrograms
  • agglomerative
  • 0 1 2 3 4
  • (1) (1,2) (1,2,3,4,5)
  • (2) (3,4,5)
  • (3)
  • (4) (4,5)
  • 4 3 2 1 0
  • divisive

16
Agglomerative Algorithm
  • Start Clusters C1, C2, ..., Cn each with 1
  • data point
  • 1 - Find nearest pair Ci, Cj, merge Ci and Cj,
  • delete Cj, and decrement cluster count by 1
  • If number of clusters is greater than 1 then
  • go back to step 1

17
Single Linkage (Nearest Neighbor) Clustering
  • Distance between groups is defined as that of the
    closest pair of individuals where we consider 1
    individual from each group

18
Example of Single Linkage Clustering
  • 1 2 3 4 5
  • 1 0.0
  • 2 2.0 0.0
  • 3 6.0 5.0 0.0
  • 4 10.0 9.0 4.0 0.0
  • 5 9.0 8.0 5.0 3.0 0.0
  • (1 2) 3 4 5
  • (1 2) 0.0
  • 3 5.0 0.0
  • 4 9.0 4.0 0.0
  • 5 8.0 5.0 3.0 0.0

19
Example of Single Linkage Clustering
(1,2) 3 (4,5) (1,2) 0 3 5 0
(4,5) 8 4 0
(1,2) (3,4,5) (1,2) 0 (3,4,5)
5 0
20
Resultant Dendrogram
1 2 3 4 5
21
Complete Linkage Clustering (Furthest Neighbor)
  • Distance between groups is defined as most
    distance pairs of individuals

22
Complete Linkage Example
  • 1 2 3 4 5
  • 1 0.0
  • 2 2.0 0.0
  • 3 6.0 5.0 0.0
  • 4 10.0 9.0 4.0 0.0
  • 5 9.0 8.0 5.0 3.0 0.0
  • (1,2) is the First Cluster
  • d(12) 3 maxd13, d23d136.0
  • d(12)4 maxd14, d24d1410.0
  • d(12)5 maxd15, d25d159.0

(1,2) 3 4 5 (1,2) 0 3
6 0 4 10 4 0 5 9 5 3
0 This implies (4,5) is 2nd cluster
23
Complete Linkage Example
(1,2) 3 (4,5) (1,2) 0 3 6 0 (4,5)
10 5 0 Thus (3,4,5) is 3rd cluster and Then
(1,2,3,4,5) !!
24
Dendrogram
1 2 3 4 5

25
Group Average Clustering
  • Distance between clusters is the average of the
    distance between all pairs of individuals between
    the 2 groups

26
Centroid Clusters
  • We use centroid of a group once it is formed.

27
Problems With Hierarchical Clustering
  • Chaining
  • Propensity to cluster together individuals linked
    by a chain of intermediates
  • Biased to finding spherical clusters

28
The Number of Groups Problem
  • How do we decide on the appropriate number of
    clusters?
  • Duda and Hart (1973)
  • Form E(2)/E(1) where E(M) is the sum of squares
    error criterion for the m cluster model
  • Given that the ratio is less than k we reject the
    single cluster model

29
Optimization Methods
  • Minimizing or maximizing some criteria
  • Does not necessarily form hierarchical clusters

30
Clustering Criteria
31
Minimization of Trace of W
  • This is equivalent to minimizing the distance
    between each individual in a cluster and the
    cluster center

32
Minimization of the Determinant of W
  • Maximize det(T)/det(W)
  • Minimize det(W)

33
Maximization of the Trace(BW-1)
?
34
Optimizing the Clustering Criterion
  • N(n,g) The number of partitions of n
    individuals into g groups
  • N(15,3)2,375,101
  • N(20,4)45,232,115,901
  • N(25,8)690,223,721,118,368,580
  • N(100,5)1068

35
Hill Climbing Algorithms
  • 1 - Form initial partition into required number
    of groups
  • 2 - Calculate change in clustering criteria
    produced by moving each individual from its own
    to another cluster.
  • 3 - make the change which leads to the greatest
    improvement in the value of the clustering
    criterion.
  • 4 - Repeat steps (2) and (3) until no move of a
    single individual causes the clustering criterion
    to improve.

36
Alternative Methods
  • Simulated Annealing
  • Genetic Algorithms
  • Quantum Computing

37
Reference
  • Cluster Analysis (Third Edition), Brian Everitt

38
Mixture Based Clustering

39
Finite Mixture Model
40
Expectation Maximization (EM) Algorithm
  • Dempster, Laird, and Rubin (1977)
  • Given an initial guess at the number of
    components in the mixture and their starting
    values this technique attempts to maximize the
    likelihood of the model in a two step process

41
Iterative EM Equations E-Step
42
Iterative EM Equations M-Step
43
Problems With the EM Algorithm
  • One normally needs to constrain the ?i to prevent
    spiking
  • Convergence may be very slow

44
How Do We Initialize the Parameters
  • Random Starts
  • Helps prevent convergence to local maxima in the
    likelihood surface
  • Human intervention
  • Initial partitioning
  • Set the initial posteriors to 0 or 1 according to
    a prior clustering scheme.

45
How Do We Choose g?
  • Human Intervention
  • Divine Intervention
  • Likelihood Ratio Test Statistic
  • Wolfes Method
  • Bootstrap
  • AIC
  • Adaptive Mixtures Based Methods
  • Pruning
  • SHIP

46
Problems With the Hypothesis Testing Method
  • Regularity conditions do not hold for
  • -2logl C2d
  • d Difference in number of parameters

47
Wolfes Method
  • H0 X g1
  • H1 X g2
  • g1 lt g2
  • g1 1, g2 2
  • -2clogl Cd2
  • d 2 difference in parameters without ps
  • c (n-1-p-1/2g2)/n

48
Interpreting Wolfes Method
  • n/p gt 5 for Wolfes approximation to work
  • Power of the test is low when Mahalanobis
    distance between terms is less than 2
  • Use Wolfes test as a guide to the number of
    components
  • Since it is a guide Mclachlan and Basford Suggest
    Using c 1
  • Examine posterior probabilities for various gs
  • Long tailed distributions (e.g. lognormal) may
    lead to rejection of H0 g1 1

49
Bootstrap Method
  • H0 g g1
  • H1 g g2
  • Step 1 - Generate a bootstrap sample from a
    mixture of g1 groups (using parameter estimated
    from the data)
  • Step 2 - Fit the bootstrap sample with a g1 and
    g2 mixture model.
  • Step 3 - Compute -2logl (l LH0/LH1)
  • Step 4 - Repeat k times
  • Step 5 - Compare these to -2logl computed on the
    original data.

50
Bootstrap Guidelines
  • How many resamples are needed?
  • Perhaps 350 or more
  • Since we may need random starts of the initial
    parameters to get good values for each sample, we
    may have some major crunching to do for
    moderately large g1 and g2.

51
Significance Level
  • The test which rejects H0 if -2logl gt X(j) has
    size ? 1 - j/(k1)

52
AIC
  • AIC(g) -2L(f) N(g) where N(g) is the number
    of free parameters in the model of size g.
  • We choose g in order to minimize the AIC
    condition
  • This criterion is subject to the same regularity
    conditions as -2logl

53
Mode Based Methods
  • Mixture terms and modes are not in 1-1
    Correspondence
  • kernel approach
  • Silverman
  • Mode tree of Minnotte
  • Mode forest of Minnotte, Marchette and Wegman
  • Hybrid
  • Filtered Mode Tree

54
Mode Based Methods
A Mode Forest
55
Adaptive Mixtures Density Estimator (AMDE)
  • Priebe and Marchette 1992
  • Priebe 1994
  • Hybrid of Kernel Estimator and Mixture Model
  • Number of Terms Driven by the Data
  • L1 Consistent

56
AMDE Algorithm
  • 1 - Given a New Observation
  • 2 - Update Existing Model Using the Recursive EM
  • or
  • 3 - Add a New Term to Explain This Data Point

57
Recursive EM Equations - 1
58
Recursive EM Equations - 2
59
Create rule
  • Test the MHD from current data point to each
    mixture term in the existing model
  • Add in a new term when this distance exceeds a
    certain create threshold
  • Location given by current data point
  • Covariance given by weighted average of the
    existing covariances
  • mixing coefficent set to 1/n

60
Pruning
  • Build Over Determined Mixture Model Using AMDE
  • Use AIC to Remove Superfluous Terms
  • Apply This Procedure to Bootstrap Samples From
    the Data Set

61
SHIP Algorithm
  • 1 - Fit mixture model to data
  • 2 - Use mixture model to set bandwidth on the
    kernel estimator
  • 3 - Examine mismatch
  • 4 - Add terms where there is mismatch
  • 5 - Fit new mixture model, reset bandwidth, fit
    new kernel estimator
  • 7 - Continue
  • 8 - Uses AIC to tell when we are done

62
SHIP Picture
63
Mixture Visualization 1-d
64
Mixture Visualization 2-d
65
Mixture Viz 3d
66
References
  • J. L. Solka, W. L. Poston, and E. J. Wegman, A
    Visualization Technique for Studying the
    Iterative Estimation of Mixture Densities,
    Journal of Computational and Graphical
    Statistics, 4(3), pp.180-197, (1995).
  • Priebe, C. E., Adaptive Mixtures, JASA, vol.
    89, No. 427, pp. 796-806, (1994).
  • Solka, J. L., Wegman, E. J., Priebe, C. E.,
    Poston, W. L. , Rogers, G. W., A method to
    determine the structure of an unknown mixture
    using the Akaike Information Criterion and the
    bootstrap, Statistics and Computing, 8, 177-188,
    (1998).

67
References
  • G. J. McLachlan, and K. E. Basford, Mixture
    Models, Marcel Dekker, 1988.
  • D. Titterington, A. F. M. Smith, U. E. Makov,
    Statistical Analysis of Finite Mixture
    Distributions, Wiley Series in Probability and
    Mathematical Statistics, 1985.

68
Autoclass
  • Developed by R. Hanson, J. Stutz, and P.
    Chesseman
  • Allows real and discrete data
  • Allows missing data
  • Uses a Bayesian approach
  • Automatically chooses number of classes and their
    structure

69
Bayesian Assumptions
  • Bayesian agent uses a single real number tp
    describe its degree of belief in each proposition
    of interest
  • How evidence should effect beliefs
  • These lead to standard probability axioms

70
Evidence of States
  • Coin States
  • 2 headed
  • 1 headed
  • Evidence
  • Lands heads up
  • Lands heads down

71
Priors and Posteriors
  • P(abcd) belief in a and b given c and d
  • p(H) belief in H in absence of or prior to any
    evidence
  • p(HE) posterior describing agents belief
    after observing evidence E
  • L(EH) likelihood of how likely it would be to
    see each possible evidence combo in each possible
    world H

72
Normalization
  • 0 lt p(ab) lt 1
  • SH p(H) 1
  • SE L(EH) 1

73
Joint Likelihood
  • J(EH)L(EH)p(H)
  • p(HE) J(EH)/(SHJ(EH) L(EH)p(H)/(SHL(EH)p(H
    )

74
Continuous
  • HS continuous means that p(H) --gt dp(H)
  • SS ---gt Integral S
  • Continuous ES --gt differential likelihood
  • dL(EH) --gt DL(EH) approx. dL(EH) DE/dE

75
Utility Function
  • EU(A) SH U(H) p(HEA)

76
Approach
  • Normal Possible States Correspond to Certain
    Models
  • Model Parameters
  • T How many terms
  • V Parameters that describe a class
  • L(EVTS)
  • dp(VTS)
  • Decouple priors into product
  • Broad priors
  • Equal weights lead to different levels of model
    complexity
  • More parameters demand better fit

77
Joint
  • dJ(EVTS) L(EVTS) dp(VTS)
  • This is a spiky distribution in VT in a high
    dimensional space
  • Punt normalization

78
Alternate Approach
  • Break V into Region R Surrounding Each Peak
  • Search to Maximize
  • M(ERTS) Integral(over R) dJ(EVTS)
  • Report Best Few Models

79
Reported Models
  • Marginal Joint M(ERTS)
  • Discrete Parameters T
  • Estimate V in R
  • E(VERTS) Integ.(V in R) (dJ(EVTS))/M(ERTS)
  • V maxs dJ(EVTS) in R

80
Clustering
  • Evidence
  • I cases with K attributes each designated Xi,k
  • S
  • V
  • T
  • dL(EVTS)
  • dp(VTS)
  • M(ERTS)
  • E(VERTS)

81
Likelihood
  • L(EVTS) Pi L(EiVTS)
  • L(EiVTS) where Ei Xi1, Xi2, , Xik

82
Mixture Based Modeling
  • L(EiVTSM) Si1 c ac L(EiVcTcSc)
  • Parameters
  • T c, Tc
  • V ac, Vc

83
Priors
  • dp(VTSM) F3(C) C! dB(acC) Pcdp(VcTcSc)
  • F3(C) 6/p2C2

84
Priors on Mixing Coefficients
  • dB(acC) G(aC)/G(a)C Pc aca-1dac
  • a 1/C
  • M(ES) G(aC))PcG(Ica)/G(aCI)G(a)C
  • E(acES) (Ic a)/(I aC) (Ic 1/C)/(I1)

85
Priors on the Mean
  • dp(mS) dR(mm,m-)
  • mmaxxi, m-minxi
  • dR(yy,y-) dy/(y - y-) for y in y-, y
  • E(mES) xbar
  • Pkdp(mkSR)

86
Priors on the Covariance
  • The prior on the covariance is very hairy and is
    given by a Wishart distribution

87
Algorithm
  • We use our standard EM algorithm to focus on
    regions of the parameter space
  • At the maxima the weights should be consistent
    with the expectation estimations based on the
    priors

88
Initialization
  • Start at a random configuration and run 10-100
    iterations of the EM algorithm
  • L(EiVTRS)PcC(acL(EiVcTcSC))wic
  • Similary we form an approximate joint function

89
Complexity Search
  • C is chosen randomly from a log-normal
    distribution fit to the Cs of the 6-10 best
    trials seen so far after trying a fixed range of
    Cs to start
  • Merge and split techniques have been formulated
    but are not generally better
  • Marginal joints of the different trials follow
    log likelihood distributed that allows one to
    predict how much longer it will take on average
    to find a better solution

90
Class Hierarchy
  • It is possible for the various models to share
    terms in the mixtures
  • In this manner duplication of effort is prevented

91
Running auto-class
  • On science
  • /usr/local/autoclass-c
  • /usr/local/autoclass
  • On nswc
  • email me

92
Relevant files in autoclass-c
  • data
  • example data sets and outputs
  • sample
  • in depth example of run with script file
  • read-me.text
  • explanation
  • doc (papers)
  • reports-c.text
Write a Comment
User Comments (0)
About PowerShow.com