Feature Selection, Dimensionality Reduction, and Clustering - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Feature Selection, Dimensionality Reduction, and Clustering

Description:

http://www.autonlab.org/tutorials/kmeans11.pdf. Gifi, 1990 ... Kohavi and John, 1996. Kohavi and John, 1996. MacQueen, 1967. Hastie et. el., 2001 ... – PowerPoint PPT presentation

Number of Views:830
Avg rating:3.0/5.0
Slides: 55
Provided by: hom4433
Category:

less

Transcript and Presenter's Notes

Title: Feature Selection, Dimensionality Reduction, and Clustering


1
Feature Selection, Dimensionality Reduction,
and Clustering
Summer Course Data Mining
Presenter Georgi Nalbantov
August 2009
2
Structure
  • Dimensionality Reduction
  • Principal Components Analysis (PCA)
  • Nonlinear PCA (Kernel PCA, CatPCA)
  • Multi-Dimensional Scaling (MDS)
  • Homogeneity Analysis
  • Feature Selection
  • Filtering approach
  • Wrapper approach
  • Embedded methods
  • Clustering
  • Density estimation and clustering
  • K-means clustering
  • Hierarchical clustering
  • Clustering with Support Vector Machines (SVMs)

3
Feature Selection, Dimensionality Reduction, and
Clustering in the KDD Process
U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)
4
Feature Selection
In the presence of millions of
features/attributes/inputs/variables, select the
most relevant ones.Advantages build better,
faster, and easier to understand learning
machines.
m features
X
m
n
5
Feature Selection
  • Goal select the two best features individually
  • Any reasonable objective J will rank the features
  • J(x1) gt J(x2) J(x3) gt J(x4)
  • Thus, features chosen x1,x2 or x1,x3.
  • However, x4 is the only feature that provides
    complementary information to x1

6
Feature Selection
  • Filtering approach ranks features or feature
    subsets independently of the predictor
    (classifier).
  • using univariate methods consider one variable
    at a time
  • using multivariate methods consider more than
    one variables at a time
  • Wrapper approach uses a classifier to assess
    (many) features or feature subsets.
  • Embedding approachuses a classifier to build a
    (single) model with a subset of features that are
    internally selected.

7
Feature Selection univariate filtering approach
  • Issue determine the relevance of a given single
    feature.

m
m
density, P(Xi Y)
density, P(Xi Y)
-1
s-
xi
xi
s-
8
Feature Selection univariate filtering approach
  • Issue determine the relevance of a given single
    feature.
  • Under independence
  • P(X, Y) P(X) P(Y)
  • Measure of dependence (Mutual Information)
  • MI(X, Y) ? P(X,Y) log dX dY
  • KL( P(X,Y) P(X)P(Y) )

9
Feature Selection univariate filtering approach
  • Correlation and MI
  • Note Correlation is a measure of linear
    dependence

10
Feature Selection univariate filtering approach
  • Correlation and MI under the Gaussian
    distribution

11
Feature Selection univariate filtering approach.
Criteria for measuring dependence.
12
Feature Selection univariate filtering approach
m-
m
m
m-,
Legend Y1Y-1
DensityP(Xi Y-1)P(Xi Y1)
-1
s-
s-
s
xi
xi
s
13
Feature Selection univariate filtering approach
P(Xi Y1) P(Xi Y-1)
P(Xi Y1) / P(Xi Y-1)
Legend Y1Y-1
density
-1
s-
s-
s
xi
xi
s
14
Feature Selection univariate filtering approach
m-
m
Is this distance significant?
T-test
  • Normally distributed classes, equal variance s2
    unknown estimated from data as s2within.
  • Null hypothesis H0 m m-
  • T statistic If H0 is true, then
  • t (m - m-)/(swithin?1/m1/m-)
    Student(mm--2 d.f.)

-1
s-
s
xi
15
Feature Selection multivariate filtering approach
Guyon-Elisseeff, JMLR 2004 Springer 2006
16
Feature Selection search strategies
Kohavi-John, 1997
N features, 2N possible feature subsets!
17
Feature Selection search strategies
  • Forward selection or backward elimination.
  • Beam search keep k best path at each step.
  • GSFS generalized sequential forward selection
    when (n-k) features are left try all subsets of
    g features. More trainings at each step, but
    fewer steps.
  • PTA(l,r) plus l , take away r at each step,
    run SFS l times then SBS r times.
  • Floating search One step of SFS (resp. SBS),
    then SBS (resp. SFS) as long as we find better
    subsets than those of the same size obtained so
    far.

18
Feature Selection filters vs. wrappers vs.
embedding
  • Main goal rank subsets of useful features

19
Feature Selection feature subset assessment
(wrapper)
N variables/features
Split data into 3 sets training, validation, and
test set.
  • 1) For each feature subset, train predictor on
    training data.
  • 2) Select the feature subset, which performs best
    on validation data.
  • Repeat and average if you want to reduce variance
    (cross-validation).
  • 3) Test on test data.

M samples
  • Danger of over-fitting with intensive search!

20
Feature Selection via Embedded MethodsL1-regular
ization
sum(beta)
sum(beta)
21
Feature Selection summary
22
Dimensionality Reduction
In the presence of may of features, select the
most relevant subset of (weighted) combinations
of features.
Feature Selection
Dimensionality Reduction
23
Dimensionality Reduction(Linear) Principal
Components Analysis
  • PCA finds a linear mapping of dataset X to a
    dataset X of lower dimensionality. The variance
    of X that is remained in X is maximal.

Dataset X is mapped to dataset X, here of the
same dimensionality. The first dimension in X (
the first principal component) is the direction
of maximal variance. The second principal
component is orthogonal to the first.
24
Dimensionality ReductionNonlinear (Kernel)
Principal Components Analysis
Original dataset X
Map X to a HIGHER-dimensional space, and carry
out LINEAR PCA in that space
(If necessary,) map the resulting principal
components back to the origianl space
25
Dimensionality ReductionMulti-Dimensional
Scaling
  • MDS is a mathematical dimension reduction
    technique that maps the distances between
    observations from the original (high) dimensional
    space into a lower (for example, two) dimensional
    space.
  • MDS attempts to retain pairwise Euclidean
    distances in the low-dimensional space .
  • Error on the fit is measured using a so-called
    stress function
  • Different choices for a stress function are
    possible

26
Dimensionality ReductionMulti-Dimensional
Scaling
  • Raw stress function (identical to PCA)
  • Sammon cost function

27
Dimensionality ReductionMulti-Dimensional
Scaling (Example)
Input
Output
28
Dimensionality ReductionHomogeneity analysis
  • Homals finds a lower-dimensional representation
    of categorical data matrix X. It may be
    considered as a type of nonlinear extension of
    PCA.

29
ClusteringSimilarity measures for hierarchical
clustering
Clustering Classification
Regression

X 2
X 2











-






-



-
-








-

-

X 1
X 1
X 1
  • k-th Nearest Neighbour
  • Parzen Window
  • Unfolding, Conjoint Analysis, Cat-PCA
  • Linear Discriminant Analysis, QDA
  • Logistic Regression (Logit)
  • Decision Trees, LSSVM, NN, VS
  • Classical Linear Regression
  • Ridge Regression
  • NN, CART

30
Clustering
  • Clustering in an unsupervised learning technique.
  • Task organize objects into groups whose members
    are similar in some way
  • Clustering finds structures in a collection of
    unlabeled data
  • A cluster is a collection of objects which are
    similar between them and are dissimilar to the
    objects belonging to other clusters

31
Density estimation and clustering
Bayesian separation curve (optimal)
32
ClusteringK-means clustering
  • Minimizes the sum of the squared distances to the
    cluster centers (reconstruction error)
  • Iterative process
  • Estimate current assignments (construct Voronoi
    partition)
  • Given the new cluster assignments, set cluster
    center to center-of-mass

33
ClusteringK-means clustering
Step 1
Step 2
Step 3
Step 4
34
ClusteringHierarchical clustering
  • Clustering based on (dis)similarities. Multilevel
    clustering level 1 has n clusters, level n has
    one cluster
  • Agglomerative HC starts with N clusters and
    combines clusters iteratively
  • Divisive HC starts with one cluster and divides
    iteratively
  • Disadvantage wrong division cannot be undone

Dendrogram
35
ClusteringNearest Neighbor algorithm for
hierarchical clustering
1. Nearest Neighbor, Level 2, k 7 clusters.
2. Nearest Neighbor, Level 3, k 6 clusters.
3. Nearest Neighbor, Level 4, k 5 clusters.
36
ClusteringNearest Neighbor algorithm for
hierarchical clustering
4. Nearest Neighbor, Level 5, k 4 clusters.
5. Nearest Neighbor, Level 6, k 3 clusters.
6. Nearest Neighbor, Level 7, k 2 clusters.
37
ClusteringNearest Neighbor algorithm for
hierarchical clustering
7. Nearest Neighbor, Level 8, k 1 cluster.
38
ClusteringSimilarity measures for hierarchical
clustering
39
Clustering Similarity measures for hierarchical
clustering
  • Pearson Correlation Trend Similarity

40
Clustering Similarity measures for hierarchical
clustering
  • Euclidean Distance

41
Clustering Similarity measures for hierarchical
clustering
  • Cosine Correlation

1 ? Cosine Correlation ? 1
42
Clustering Similarity measures for hierarchical
clustering
  • Cosine Correlation Trend Mean Distance

43
Clustering Similarity measures for hierarchical
clustering
44
Clustering Similarity measures for hierarchical
clustering
Similar?
45
Clustering Grouping strategies for hierarchical
clustering
C1
Merge which pair of clusters?
C2
C3
46
Clustering Grouping strategies for hierarchical
clustering
Single Linkage
Dissimilarity between two clusters Minimum
dissimilarity between the members of two clusters


C2
C1
Tend to generate long chains
47
Clustering Grouping strategies for hierarchical
clustering
Complete Linkage
Dissimilarity between two clusters Maximum
dissimilarity between the members of two clusters


C2
C1
Tend to generate clumps
48
Clustering Grouping strategies for hierarchical
clustering
Average Linkage
Dissimilarity between two clusters Averaged
distances of all pairs of objects (one from each
cluster).


C2
C1
49
Clustering Grouping strategies for hierarchical
clustering
Average Group Linkage
Dissimilarity between two clusters Distance
between two cluster means.


C2
C1
50
Clustering Support Vector Machines for
clustering
The not-noisy case
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
51
Clustering Support Vector Machines for
clustering
The noisy case
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
52
Clustering Support Vector Machines for
clustering
The noisy case (II)
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
53
Clustering Support Vector Machines for
clustering
The noisy case (III)
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
54
Conclusion / Summary / References
  • Dimensionality Reduction
  • Principal Components Analysis (PCA)
  • Nonlinear PCA (Kernel PCA, CatPCA)
  • Multi-Dimensional Scaling (MDS)
  • Homogeneity Analysis
  • Feature Selection
  • Filtering approach
  • Wrapper approach
  • Embedded methods
  • Clustering
  • Density estimation and clustering
  • K-means clustering
  • Hierarchical clustering
  • Clustering with Support Vector Machines (SVMs)

Kohavi and John, 1996
Kohavi and John, 1996
I. Guyon et. al., 2006
http//www.cs.otago.ac.nz/cosc453/student_tutorial
s/...principal_components.pdf
Schoelkopf et. al., 2001 .Gifi, 1990
Born and Groenen, 2005
Gifi, 1990
Hastie et. el., 2001
MacQueen, 1967
http//www.autonlab.org/tutorials/kmeans11.pdf
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
Write a Comment
User Comments (0)
About PowerShow.com