Title: Feature Selection, Dimensionality Reduction, and Clustering
1Feature Selection, Dimensionality Reduction,
and Clustering
Summer Course Data Mining
Presenter Georgi Nalbantov
August 2009
2Structure
- Dimensionality Reduction
- Principal Components Analysis (PCA)
- Nonlinear PCA (Kernel PCA, CatPCA)
- Multi-Dimensional Scaling (MDS)
- Homogeneity Analysis
- Feature Selection
- Filtering approach
- Wrapper approach
- Embedded methods
- Clustering
- Density estimation and clustering
- K-means clustering
- Hierarchical clustering
- Clustering with Support Vector Machines (SVMs)
3Feature Selection, Dimensionality Reduction, and
Clustering in the KDD Process
U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)
4Feature Selection
In the presence of millions of
features/attributes/inputs/variables, select the
most relevant ones.Advantages build better,
faster, and easier to understand learning
machines.
m features
X
m
n
5Feature Selection
- Goal select the two best features individually
- Any reasonable objective J will rank the features
- J(x1) gt J(x2) J(x3) gt J(x4)
- Thus, features chosen x1,x2 or x1,x3.
- However, x4 is the only feature that provides
complementary information to x1
6Feature Selection
- Filtering approach ranks features or feature
subsets independently of the predictor
(classifier). - using univariate methods consider one variable
at a time - using multivariate methods consider more than
one variables at a time - Wrapper approach uses a classifier to assess
(many) features or feature subsets. - Embedding approachuses a classifier to build a
(single) model with a subset of features that are
internally selected.
7Feature Selection univariate filtering approach
- Issue determine the relevance of a given single
feature.
m
m
density, P(Xi Y)
density, P(Xi Y)
-1
s-
xi
xi
s-
8Feature Selection univariate filtering approach
- Issue determine the relevance of a given single
feature.
- Under independence
- P(X, Y) P(X) P(Y)
- Measure of dependence (Mutual Information)
- MI(X, Y) ? P(X,Y) log dX dY
- KL( P(X,Y) P(X)P(Y) )
9Feature Selection univariate filtering approach
- Correlation and MI
- Note Correlation is a measure of linear
dependence
10Feature Selection univariate filtering approach
- Correlation and MI under the Gaussian
distribution
11Feature Selection univariate filtering approach.
Criteria for measuring dependence.
12Feature Selection univariate filtering approach
m-
m
m
m-,
Legend Y1Y-1
DensityP(Xi Y-1)P(Xi Y1)
-1
s-
s-
s
xi
xi
s
13Feature Selection univariate filtering approach
P(Xi Y1) P(Xi Y-1)
P(Xi Y1) / P(Xi Y-1)
Legend Y1Y-1
density
-1
s-
s-
s
xi
xi
s
14Feature Selection univariate filtering approach
m-
m
Is this distance significant?
T-test
- Normally distributed classes, equal variance s2
unknown estimated from data as s2within. - Null hypothesis H0 m m-
- T statistic If H0 is true, then
- t (m - m-)/(swithin?1/m1/m-)
Student(mm--2 d.f.)
-1
s-
s
xi
15Feature Selection multivariate filtering approach
Guyon-Elisseeff, JMLR 2004 Springer 2006
16Feature Selection search strategies
Kohavi-John, 1997
N features, 2N possible feature subsets!
17Feature Selection search strategies
- Forward selection or backward elimination.
- Beam search keep k best path at each step.
- GSFS generalized sequential forward selection
when (n-k) features are left try all subsets of
g features. More trainings at each step, but
fewer steps. - PTA(l,r) plus l , take away r at each step,
run SFS l times then SBS r times. - Floating search One step of SFS (resp. SBS),
then SBS (resp. SFS) as long as we find better
subsets than those of the same size obtained so
far.
18Feature Selection filters vs. wrappers vs.
embedding
- Main goal rank subsets of useful features
19Feature Selection feature subset assessment
(wrapper)
N variables/features
Split data into 3 sets training, validation, and
test set.
- 1) For each feature subset, train predictor on
training data. - 2) Select the feature subset, which performs best
on validation data. - Repeat and average if you want to reduce variance
(cross-validation). - 3) Test on test data.
M samples
- Danger of over-fitting with intensive search!
20Feature Selection via Embedded MethodsL1-regular
ization
sum(beta)
sum(beta)
21Feature Selection summary
22Dimensionality Reduction
In the presence of may of features, select the
most relevant subset of (weighted) combinations
of features.
Feature Selection
Dimensionality Reduction
23Dimensionality Reduction(Linear) Principal
Components Analysis
- PCA finds a linear mapping of dataset X to a
dataset X of lower dimensionality. The variance
of X that is remained in X is maximal.
Dataset X is mapped to dataset X, here of the
same dimensionality. The first dimension in X (
the first principal component) is the direction
of maximal variance. The second principal
component is orthogonal to the first.
24Dimensionality ReductionNonlinear (Kernel)
Principal Components Analysis
Original dataset X
Map X to a HIGHER-dimensional space, and carry
out LINEAR PCA in that space
(If necessary,) map the resulting principal
components back to the origianl space
25Dimensionality ReductionMulti-Dimensional
Scaling
- MDS is a mathematical dimension reduction
technique that maps the distances between
observations from the original (high) dimensional
space into a lower (for example, two) dimensional
space. - MDS attempts to retain pairwise Euclidean
distances in the low-dimensional space . - Error on the fit is measured using a so-called
stress function - Different choices for a stress function are
possible
26Dimensionality ReductionMulti-Dimensional
Scaling
- Raw stress function (identical to PCA)
-
- Sammon cost function
27Dimensionality ReductionMulti-Dimensional
Scaling (Example)
Input
Output
28Dimensionality ReductionHomogeneity analysis
- Homals finds a lower-dimensional representation
of categorical data matrix X. It may be
considered as a type of nonlinear extension of
PCA.
29ClusteringSimilarity measures for hierarchical
clustering
Clustering Classification
Regression
X 2
X 2
-
-
-
-
-
-
X 1
X 1
X 1
- k-th Nearest Neighbour
- Parzen Window
- Unfolding, Conjoint Analysis, Cat-PCA
- Linear Discriminant Analysis, QDA
- Logistic Regression (Logit)
- Decision Trees, LSSVM, NN, VS
- Classical Linear Regression
- Ridge Regression
- NN, CART
30Clustering
- Clustering in an unsupervised learning technique.
- Task organize objects into groups whose members
are similar in some way - Clustering finds structures in a collection of
unlabeled data - A cluster is a collection of objects which are
similar between them and are dissimilar to the
objects belonging to other clusters
31Density estimation and clustering
Bayesian separation curve (optimal)
32ClusteringK-means clustering
- Minimizes the sum of the squared distances to the
cluster centers (reconstruction error) -
- Iterative process
- Estimate current assignments (construct Voronoi
partition) - Given the new cluster assignments, set cluster
center to center-of-mass
33ClusteringK-means clustering
Step 1
Step 2
Step 3
Step 4
34ClusteringHierarchical clustering
- Clustering based on (dis)similarities. Multilevel
clustering level 1 has n clusters, level n has
one cluster - Agglomerative HC starts with N clusters and
combines clusters iteratively - Divisive HC starts with one cluster and divides
iteratively - Disadvantage wrong division cannot be undone
Dendrogram
35ClusteringNearest Neighbor algorithm for
hierarchical clustering
1. Nearest Neighbor, Level 2, k 7 clusters.
2. Nearest Neighbor, Level 3, k 6 clusters.
3. Nearest Neighbor, Level 4, k 5 clusters.
36ClusteringNearest Neighbor algorithm for
hierarchical clustering
4. Nearest Neighbor, Level 5, k 4 clusters.
5. Nearest Neighbor, Level 6, k 3 clusters.
6. Nearest Neighbor, Level 7, k 2 clusters.
37ClusteringNearest Neighbor algorithm for
hierarchical clustering
7. Nearest Neighbor, Level 8, k 1 cluster.
38ClusteringSimilarity measures for hierarchical
clustering
39Clustering Similarity measures for hierarchical
clustering
- Pearson Correlation Trend Similarity
40Clustering Similarity measures for hierarchical
clustering
41Clustering Similarity measures for hierarchical
clustering
1 ? Cosine Correlation ? 1
42Clustering Similarity measures for hierarchical
clustering
- Cosine Correlation Trend Mean Distance
43Clustering Similarity measures for hierarchical
clustering
44Clustering Similarity measures for hierarchical
clustering
Similar?
45Clustering Grouping strategies for hierarchical
clustering
C1
Merge which pair of clusters?
C2
C3
46Clustering Grouping strategies for hierarchical
clustering
Single Linkage
Dissimilarity between two clusters Minimum
dissimilarity between the members of two clusters
C2
C1
Tend to generate long chains
47Clustering Grouping strategies for hierarchical
clustering
Complete Linkage
Dissimilarity between two clusters Maximum
dissimilarity between the members of two clusters
C2
C1
Tend to generate clumps
48Clustering Grouping strategies for hierarchical
clustering
Average Linkage
Dissimilarity between two clusters Averaged
distances of all pairs of objects (one from each
cluster).
C2
C1
49Clustering Grouping strategies for hierarchical
clustering
Average Group Linkage
Dissimilarity between two clusters Distance
between two cluster means.
C2
C1
50Clustering Support Vector Machines for
clustering
The not-noisy case
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
51Clustering Support Vector Machines for
clustering
The noisy case
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
52Clustering Support Vector Machines for
clustering
The noisy case (II)
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
53Clustering Support Vector Machines for
clustering
The noisy case (III)
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
54Conclusion / Summary / References
- Dimensionality Reduction
- Principal Components Analysis (PCA)
- Nonlinear PCA (Kernel PCA, CatPCA)
- Multi-Dimensional Scaling (MDS)
- Homogeneity Analysis
- Feature Selection
- Filtering approach
- Wrapper approach
- Embedded methods
- Clustering
- Density estimation and clustering
- K-means clustering
- Hierarchical clustering
- Clustering with Support Vector Machines (SVMs)
Kohavi and John, 1996
Kohavi and John, 1996
I. Guyon et. al., 2006
http//www.cs.otago.ac.nz/cosc453/student_tutorial
s/...principal_components.pdf
Schoelkopf et. al., 2001 .Gifi, 1990
Born and Groenen, 2005
Gifi, 1990
Hastie et. el., 2001
MacQueen, 1967
http//www.autonlab.org/tutorials/kmeans11.pdf
Ben-Hur, Horn, Siegelmann and Vapnik, 2001