Title: Feature Selection and Dimensionality Reduction
1COMP5318/4044, Lecture 10aKnowledge Discovery
and Data Mining
- Feature Selection and Dimensionality Reduction
- Reference Hand, pp.465-469, Dunham 130-131
- J. Han and M. Kamber, Data Mining Concepts and
Techniques ch.3 - D. Pyle, Data Preparation for Data Mining
2Outline
- Data Preprocessing
- Data cleaning
- Data transformations
- Data reduction
- Data discretization
- Singular Value Decomposition
3Data Preprocessing
- Data in real world is
- Incomplete - lacking attribute values or certain
attributes of interest - Noisy containing errors or outliers
- Inconsistent e.g. in codes, names
- Tasks in data preprocessing
- Data cleaning - fill in missing values, smooth
noisy data, identify outliers and remove them,
resolve inconsistencies - Data integration - using multiple data sets
- Data transformations e.g. normalization
- Data reduction reducing the volume but keeping
the content - Data discretization replacing numerical
attributes with nominal ones
4Forms of Data Preprocessing
5Data Cleaning
- How to handle missing values?
- Ignore the example
- Use the attribute mean to fill in the missing
value - Use the attribute mean for all examples belonging
to the same class - Predict the missing value by using ML algorithm
(usually DT or NB) - How to handle noisy data?
- Binning sort data and partition into bins, then
smooth by bin means - Outlier detection detect and remove outliers
(automatic or manual) - Regression smooth by fitting the data into
regression function - Correct inconsistent data
- e.g. an attribute has different name in different
databases
6Smoothing Noisy Data by Binning
- Binning methods smooth data by consulting its
neighborhood, i.e. the values around it (local
smoothing) - Attribute price 4, 8, 15, 2, 21, 24, 25, 28, 34
- 1. Sort the data
- 2. Partition into (equidepth) bins
- Bin 1 4, 8, 15
- Bin 2 21, 21, 24
- Bin 3 25, 28, 34
- 3. Smoothing by bin means or Smoothing by
bin boundaries - (each value is replaced by the (each value is
replaced by the - mean value of the bin) closest boundary value)
- Bin 1 9, 9, 9 Bin 1 4, 4, 15
- Bin 2 22, 22, 22 Bin 2 21, 21, 24
- Bin 3 29, 29, 29 Bin 2 25, 25, 34
7Outliers
- Outliers instances with values much different
from the remaining set of data - May be errors in data (e.g. malfunctioning sensor
recorded an incorrect value). If this is the
case, find the correct value or treat it as a
missing value. - May be correct data values that are simply much
different from the remaining data. For example - a person 2.5m tall
- insurance claims most of them are typically
small but occasionally some are enormous) - predicting flooding high water levels appear
rarely, and compared with normal data they may
look as outliers
8Outlier Detection
- Clustering clusters very far from the others
- Distance-based techniques
- Statistical-based
- assume that data follows a known distribution,
apply standard tests (e.g. discordancy test) - Disadvantages
- real data do not follow well-defined data
distributions - most of these tests assume a single attribute
value but real-world databases are described with
multiple attributes
9Data Transformation
- Normalization - scaling so that attribute values
fall within a pre-defined range, e.g. 0 1 - Aggregation (for numerical attributes) summary
or aggregation operations are applied to the data - e.g. daily sales are aggregated to compute
monthly or annual amounts - Generalization (for nominal attributes)
low-level data are replaced by higher level
concepts - e.g. values for age can be mapped to higher-level
attributes like young, middle-aged, senior - Attribute construction new attributes are
constructed from the given attributes
10Data Reduction
- Feature selection - removing irrelevant
attributes by using - using statistical measures, e.g. Information
Gain, Mutual Information, etc. - Using induction of DTs
- Data compression
- Clustering - using the cluster centers to
represent actual data - Principle component analysis,
singular value decomposition,
wavelet transform, etc.
11Singular Value Decomposition (SVD)
- Singular Value Decomposition(SVD) is a technique
for data reduction for high dimensional data. - The basic idea is to project data into a lower
dimension space while retaining as much
variability in the data as desired
12Example 1
Along which axis is there maximum variability ?
13More Examples
Along which axis is there maximum variability ?
14Problem Definition
- Given N feature vectors with dimensionality m,
find m axis such that Var(Z1) gt Var(Z2).gtVar(Zm) - How does this help in data reduction? Once we
have Z1,Z2,Zm, then what? - Fix a k, where kltm, and choose,Z1,Z2,..Zk
- Or fix a percentage (e.g. 99 ) and choose j such
that Z1,Z2,Zj capture 99 of the variability - What are the savings?
- k/m or j/m e.g. if m100 and k10, than 1/10
space is only required -
15Revisit Example
z1
z2
16SVD Background
- Based on the following theorem from Linear
Algebra Any NxM matrix X (N??M) can be written
as the product of 3 matrices - U - NxM orthogonal matrix
- Vt - the transpose of an MxM orthogonal matrix
- ? - MxM diagonal matrix with positive or zero
elements (the singular values) - V defines the new coordinate system (set of axes)
- It provides important information about variance
1st axis shows the most variance among data,
the second shows the next highest value, etc. - U is the transformed data, i.e. i-th row of U
contains the coordinates of the i-th row of X in
the new coordinate system
17SVD - Compression
- More importantly X can be re-written as
- where ? are sorted in decreasing order
- Compression comes from taking only the first k
components (kltm) - i.e. the size of the data can be reduced by
eliminating the weaker components (the ones with
low variance) - Using only the strongest components, it is
possible to reconstruct a good approximation of
the original data
18SVD Graphical Representation
Without compression
U
With compression
19SVD Example
- Most of the variance is captured in the first
component gt the original 3 dimensional data X
can be reduced to 1-dimensional data (the first
column of U)
20Compression Ratio
- Space needed before SVD N x M
- Space needed after SVD NkkkM
- The first k columns of U (N dimensional vectors),
the first k columns of V (M dimensional vectors,
k singular vectors gt total k(NM1) ) - Compression ratio
- For N gtgt M gtk, this ratio is approximately k/M
- E.g. if M365 and k10, r0.28 or 2.8!
21SVD Application in Image Compression
- http//www.coastal.edu/jbernick/
- Yogi Rock photographed by Sojourner Mars mission
- 256 264 grayscale bitmap gt X is 256 264
matrix (i.e. contains 67584 numbers) - After SVD, k81
- 256 81 81 81 26442201 numbers
- 62 compression ratio
22SVD in Text Categorization
- Latent Semantic Indexing (LSI)
- applies SVD to the DF matrix to reduce the
number of features (terms)
- U - 10x6 each row of U is a transformed DF for
one document - 6x6 diagonal matrix
- VT6x6 provides the new orthogonal basis for the
data (principal component direction)
X10x6 is the DF matrix for 10 documents and 6
terms
23SVD in Text Categorization cont.
- ?1,.., ??6 77.4, 69.5, 22.9, 13.5, 12.1, 4.8
gt most of the variance is captured in the first
2 components - For k2, U
- New term that captures relationships among terms
and may better reflect the semantic content of a
document - e.g. the terms money, win, , profit are
combined in one new term - we can think about this term as characterizing a
spam document - The first 2 principal component directions (2
directions in the original 6d space) - v1(0.74, 0.49, 0.27, 0.28, 0.18, 0.19)
- v2(-0.28, -024, -0.12, 0.74, 0.37, 0.31)
The original DF matrix for 10 documents and 6
terms