Feature Selection, Dimensionality Reduction, and Clustering - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Feature Selection, Dimensionality Reduction, and Clustering

Description:

http://www.autonlab.org/tutorials/kmeans11.pdf. Gifi, 1990 ... Kohavi and John, 1996. Kohavi and John, 1996. MacQueen, 1967. Hastie et. el., 2001 ... – PowerPoint PPT presentation

Number of Views:830

Avg rating:3.0/5.0

Slides: 55

Provided by: hom4433

Category:

more less

Transcript and Presenter's Notes

Title: Feature Selection, Dimensionality Reduction, and Clustering

1
Feature Selection, Dimensionality Reduction,
and Clustering
Summer Course Data Mining
Presenter Georgi Nalbantov
August 2009
2
Structure

Dimensionality Reduction
Principal Components Analysis (PCA)
Nonlinear PCA (Kernel PCA, CatPCA)
Multi-Dimensional Scaling (MDS)
Homogeneity Analysis

Feature Selection
Filtering approach
Wrapper approach
Embedded methods
Clustering
Density estimation and clustering
K-means clustering
Hierarchical clustering
Clustering with Support Vector Machines (SVMs)

3
Feature Selection, Dimensionality Reduction, and
Clustering in the KDD Process
U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)
4
Feature Selection
In the presence of millions of
features/attributes/inputs/variables, select the
most relevant ones.Advantages build better,
faster, and easier to understand learning
machines.
m features
X
m
n
5
Feature Selection

Goal select the two best features individually
Any reasonable objective J will rank the features
J(x1) gt J(x2) J(x3) gt J(x4)
Thus, features chosen x1,x2 or x1,x3.
However, x4 is the only feature that provides
complementary information to x1

6
Feature Selection

Filtering approach ranks features or feature
subsets independently of the predictor
(classifier).
using univariate methods consider one variable
at a time
using multivariate methods consider more than
one variables at a time
Wrapper approach uses a classifier to assess
(many) features or feature subsets.
Embedding approachuses a classifier to build a
(single) model with a subset of features that are
internally selected.

7
Feature Selection univariate filtering approach

Issue determine the relevance of a given single
feature.

m
m
density, P(Xi Y)
density, P(Xi Y)
-1
s-
xi
xi
s-
8
Feature Selection univariate filtering approach

Issue determine the relevance of a given single
feature.

Under independence
P(X, Y) P(X) P(Y)
Measure of dependence (Mutual Information)
MI(X, Y) ? P(X,Y) log dX dY
KL( P(X,Y) P(X)P(Y) )

9
Feature Selection univariate filtering approach

Correlation and MI
Note Correlation is a measure of linear
dependence

10
Feature Selection univariate filtering approach

Correlation and MI under the Gaussian
distribution

11
Feature Selection univariate filtering approach.
Criteria for measuring dependence.
12
Feature Selection univariate filtering approach
m-
m
m
m-,
Legend Y1Y-1
DensityP(Xi Y-1)P(Xi Y1)
-1
s-
s-
s
xi
xi
s
13
Feature Selection univariate filtering approach
P(Xi Y1) P(Xi Y-1)
P(Xi Y1) / P(Xi Y-1)
Legend Y1Y-1
density
-1
s-
s-
s
xi
xi
s
14
Feature Selection univariate filtering approach
m-
m
Is this distance significant?
T-test

Normally distributed classes, equal variance s2
unknown estimated from data as s2within.
Null hypothesis H0 m m-
T statistic If H0 is true, then
t (m - m-)/(swithin?1/m1/m-)
Student(mm--2 d.f.)

-1
s-
s
xi
15
Feature Selection multivariate filtering approach
Guyon-Elisseeff, JMLR 2004 Springer 2006
16
Feature Selection search strategies
Kohavi-John, 1997
N features, 2N possible feature subsets!
17
Feature Selection search strategies

Forward selection or backward elimination.
Beam search keep k best path at each step.
GSFS generalized sequential forward selection
when (n-k) features are left try all subsets of
g features. More trainings at each step, but
fewer steps.
PTA(l,r) plus l , take away r at each step,
run SFS l times then SBS r times.
Floating search One step of SFS (resp. SBS),
then SBS (resp. SFS) as long as we find better
subsets than those of the same size obtained so
far.

18
Feature Selection filters vs. wrappers vs.
embedding

Main goal rank subsets of useful features

19
Feature Selection feature subset assessment
(wrapper)
N variables/features
Split data into 3 sets training, validation, and
test set.

1) For each feature subset, train predictor on
training data.
2) Select the feature subset, which performs best
on validation data.
Repeat and average if you want to reduce variance
(cross-validation).
3) Test on test data.

M samples

Danger of over-fitting with intensive search!

20
Feature Selection via Embedded MethodsL1-regular
ization
sum(beta)
sum(beta)
21
Feature Selection summary
22
Dimensionality Reduction
In the presence of may of features, select the
most relevant subset of (weighted) combinations
of features.
Feature Selection
Dimensionality Reduction
23
Dimensionality Reduction(Linear) Principal
Components Analysis

PCA finds a linear mapping of dataset X to a
dataset X of lower dimensionality. The variance
of X that is remained in X is maximal.

Dataset X is mapped to dataset X, here of the
same dimensionality. The first dimension in X (
the first principal component) is the direction
of maximal variance. The second principal
component is orthogonal to the first.
24
Dimensionality ReductionNonlinear (Kernel)
Principal Components Analysis
Original dataset X
Map X to a HIGHER-dimensional space, and carry
out LINEAR PCA in that space
(If necessary,) map the resulting principal
components back to the origianl space
25
Dimensionality ReductionMulti-Dimensional
Scaling

MDS is a mathematical dimension reduction
technique that maps the distances between
observations from the original (high) dimensional
space into a lower (for example, two) dimensional
space.
MDS attempts to retain pairwise Euclidean
distances in the low-dimensional space .
Error on the fit is measured using a so-called
stress function
Different choices for a stress function are
possible

26
Dimensionality ReductionMulti-Dimensional
Scaling

Raw stress function (identical to PCA)
Sammon cost function

27
Dimensionality ReductionMulti-Dimensional
Scaling (Example)
Input
Output
28
Dimensionality ReductionHomogeneity analysis

Homals finds a lower-dimensional representation
of categorical data matrix X. It may be
considered as a type of nonlinear extension of
PCA.

29
ClusteringSimilarity measures for hierarchical
clustering
Clustering Classification
Regression

X 2
X 2

-

-

-
-

-

-

X 1
X 1
X 1

k-th Nearest Neighbour
Parzen Window
Unfolding, Conjoint Analysis, Cat-PCA

Linear Discriminant Analysis, QDA
Logistic Regression (Logit)
Decision Trees, LSSVM, NN, VS

Classical Linear Regression
Ridge Regression
NN, CART

30
Clustering

Clustering in an unsupervised learning technique.
Task organize objects into groups whose members
are similar in some way
Clustering finds structures in a collection of
unlabeled data
A cluster is a collection of objects which are
similar between them and are dissimilar to the
objects belonging to other clusters

31
Density estimation and clustering
Bayesian separation curve (optimal)
32
ClusteringK-means clustering

Minimizes the sum of the squared distances to the
cluster centers (reconstruction error)
Iterative process
Estimate current assignments (construct Voronoi
partition)
Given the new cluster assignments, set cluster
center to center-of-mass

33
ClusteringK-means clustering
Step 1
Step 2
Step 3
Step 4
34
ClusteringHierarchical clustering

Clustering based on (dis)similarities. Multilevel
clustering level 1 has n clusters, level n has
one cluster
Agglomerative HC starts with N clusters and
combines clusters iteratively
Divisive HC starts with one cluster and divides
iteratively
Disadvantage wrong division cannot be undone

Dendrogram
35
ClusteringNearest Neighbor algorithm for
hierarchical clustering
1. Nearest Neighbor, Level 2, k 7 clusters.
2. Nearest Neighbor, Level 3, k 6 clusters.
3. Nearest Neighbor, Level 4, k 5 clusters.
36
ClusteringNearest Neighbor algorithm for
hierarchical clustering
4. Nearest Neighbor, Level 5, k 4 clusters.
5. Nearest Neighbor, Level 6, k 3 clusters.
6. Nearest Neighbor, Level 7, k 2 clusters.
37
ClusteringNearest Neighbor algorithm for
hierarchical clustering
7. Nearest Neighbor, Level 8, k 1 cluster.
38
ClusteringSimilarity measures for hierarchical
clustering
39
Clustering Similarity measures for hierarchical
clustering

Pearson Correlation Trend Similarity

40
Clustering Similarity measures for hierarchical
clustering

Euclidean Distance

41
Clustering Similarity measures for hierarchical
clustering

Cosine Correlation

1 ? Cosine Correlation ? 1
42
Clustering Similarity measures for hierarchical
clustering

Cosine Correlation Trend Mean Distance

43
Clustering Similarity measures for hierarchical
clustering
44
Clustering Similarity measures for hierarchical
clustering
Similar?
45
Clustering Grouping strategies for hierarchical
clustering
C1
Merge which pair of clusters?
C2
C3
46
Clustering Grouping strategies for hierarchical
clustering
Single Linkage
Dissimilarity between two clusters Minimum
dissimilarity between the members of two clusters

C2
C1
Tend to generate long chains
47
Clustering Grouping strategies for hierarchical
clustering
Complete Linkage
Dissimilarity between two clusters Maximum
dissimilarity between the members of two clusters

C2
C1
Tend to generate clumps
48
Clustering Grouping strategies for hierarchical
clustering
Average Linkage
Dissimilarity between two clusters Averaged
distances of all pairs of objects (one from each
cluster).

C2
C1
49
Clustering Grouping strategies for hierarchical
clustering
Average Group Linkage
Dissimilarity between two clusters Distance
between two cluster means.

C2
C1
50
Clustering Support Vector Machines for
clustering
The not-noisy case
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
51
Clustering Support Vector Machines for
clustering
The noisy case
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
52
Clustering Support Vector Machines for
clustering
The noisy case (II)
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
53
Clustering Support Vector Machines for
clustering
The noisy case (III)
Objective function
Ben-Hur, Horn, Siegelmann and Vapnik, 2001
54
Conclusion / Summary / References

Dimensionality Reduction
Principal Components Analysis (PCA)
Nonlinear PCA (Kernel PCA, CatPCA)
Multi-Dimensional Scaling (MDS)
Homogeneity Analysis

Feature Selection
Filtering approach
Wrapper approach
Embedded methods
Clustering
Density estimation and clustering
K-means clustering
Hierarchical clustering
Clustering with Support Vector Machines (SVMs)

Kohavi and John, 1996
Kohavi and John, 1996
I. Guyon et. al., 2006
http//www.cs.otago.ac.nz/cosc453/student_tutorial
s/...principal_components.pdf
Schoelkopf et. al., 2001 .Gifi, 1990
Born and Groenen, 2005
Gifi, 1990
Hastie et. el., 2001
MacQueen, 1967
http//www.autonlab.org/tutorials/kmeans11.pdf
Ben-Hur, Horn, Siegelmann and Vapnik, 2001

Write a Comment

User Comments (0)