Title: Dimension Reduction and Feature Selection
1Dimension Reduction and Feature Selection
- Craig A. Struble, Ph.D.
- Department of Mathematics, Statistics, and
Computer Science - Marquette University
2Overview
- Dimension Reduction
- Correlation
- Principal Component Analysis
- Singular Value Decomposition
- Feature Selection
- Information Content
3Dimension Reduction
- The number of attributes causes complexity of
learning, clustering, etc. to grow exponentially - Curse of dimensionality
- We need methods to reduce the number of
attributes - Dimension reduction reduces attributes without
(directly) considering relevance of the
attribute. - Not really removing attributes, but
combining/recasting them.
4Correlation
- A causal, complementary, parallel, or reciprocal
relationship - The simultaneous change in value of two
numerically valued random variables - So, if one attributes value changes in a
predictable way whenever another one changes, why
keep them both?
5Correlation Analysis
- Pearsons Correlation Coefficient
- Positive means both increase simultaneously
- Negative means one increases as other decreases
- If rA,B has a large magnitude, A and B are
strongly correlated and one of the attributes can
be removed
6Correlation Analysis
Strong relationship
7Principal Component Analysis
- Karhunen-Loeve or K-L method
- Combine essence of attributes to create a
(hopefully) smaller set of variables the describe
the data - An instance with k attributes is a point in
k-dimensional space - Find c k-dimensional orthogonal vectors that best
represent the data such that c lt k - These vectors are combinations of attributes.
8Principal Component Analysis
- Normalize the data
- Compute c orthonormal vectors, which are the
principal components - Sort in order of decreasing significance
- Measured in terms of data variance
- Can reduce data dimension by choosing only the
most significant principal components
9Singular Value Decomposition
- One method of PCA
- Let A be an m by n matrix. Then A can be written
as the product of matrices - such that U is an m by n matrix, V is an n by n
matrix, and ? is an n by n diagonal matrix with
singular values ?1gt?2 gtgt ?ngt0. Furthermore,
U and V are orthogonal matrices
10Singular Value Decomposition
11Singular Value Decomposition
gt x lt- t(array(112,dimc(3,4))) gt str(s lt-
svd(x)) u ,1 ,2
,3 1, -0.1408767 -0.82471435 -0.3128363 2,
-0.3439463 -0.42626394 0.7522216 3, -0.5470159
-0.02781353 -0.5659342 4, -0.7500855
0.37063688 0.1265489 v ,1
,2 ,3 1, -0.5045331 0.76077568
-0.4082483 2, -0.5745157 0.05714052
0.8164966 3, -0.6444983 -0.64649464
-0.4082483 gt a lt- diag(sd) ,1
,2 ,3 1, 25.46241 0.000000
0.000000e00 2, 0.00000 1.290662
0.000000e00 3, 0.00000 0.000000 8.920717e-16
12Singular Value Decomposition
- The amount of variance captured by a singular
value is - The entropy of the data set is
13Feature Selection
- Select the most relevant subset of attributes
- Wrapper approach
- Features are selected as part of the mining
algorithm - Filter approach
- Features selected before mining algorithm
- Wrapper approach is generally more accurate but
also more computationally expensive
14Feature Selection
- Feature selection is actually a search problem
- Want to select subset of features giving most
accurate model
a,b,c
b,c
a,c
a,b
b
c
a
?
15Feature Selection
- Any search heuristics will work
- Branch and bound
- Best-first or A
- Genetic algorithms
- etc.
- Bigger problem is to estimate the relevance of
attributes without building classifier.
16Feature Selection
- Using entropy
- Calculate information gain of each attribute
- Select the l attributes with the highest
information gain - Removes attributes that are the same for all data
instances
17Feature Selection
- Stepwise forward selection
- Start with empty attribute set
- Add best of attributes
- Add best of remaining attributes
- Repeat. Take the top l
- Stepwise backward selection
- Start with entire attribute set
- Remove worst of attributes
- Repeat until l are left.
18Feature Selection
- Other methods
- Sample data, build model for subset of data and
attributes to estimate accuracy. - Select attributes with most or least variance
- Select attributes most highly correlated with
goal attribute. - What does feature selection provide you?
- Reduced data size
- Analysis of most important pieces of
information to collect.