Feature Selection and Dimensionality Reduction - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Feature Selection and Dimensionality Reduction

Description:

a person 2.5m tall ... The basic idea is to project data into a lower dimension space while retaining ... Yogi: Rock photographed by Sojourner Mars mission ... – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 24

Provided by: irenako

Category:

more less

Transcript and Presenter's Notes

Title: Feature Selection and Dimensionality Reduction

1
COMP5318/4044, Lecture 10aKnowledge Discovery
and Data Mining

Feature Selection and Dimensionality Reduction
Reference Hand, pp.465-469, Dunham 130-131
J. Han and M. Kamber, Data Mining Concepts and
Techniques ch.3
D. Pyle, Data Preparation for Data Mining

2
Outline

Data Preprocessing
Data cleaning
Data transformations
Data reduction
Data discretization
Singular Value Decomposition

3
Data Preprocessing

Data in real world is
Incomplete - lacking attribute values or certain
attributes of interest
Noisy containing errors or outliers
Inconsistent e.g. in codes, names
Tasks in data preprocessing
Data cleaning - fill in missing values, smooth
noisy data, identify outliers and remove them,
resolve inconsistencies
Data integration - using multiple data sets
Data transformations e.g. normalization
Data reduction reducing the volume but keeping
the content
Data discretization replacing numerical
attributes with nominal ones

4
Forms of Data Preprocessing
5
Data Cleaning

How to handle missing values?
Ignore the example
Use the attribute mean to fill in the missing
value
Use the attribute mean for all examples belonging
to the same class
Predict the missing value by using ML algorithm
(usually DT or NB)
How to handle noisy data?
Binning sort data and partition into bins, then
smooth by bin means
Outlier detection detect and remove outliers
(automatic or manual)
Regression smooth by fitting the data into
regression function
Correct inconsistent data
e.g. an attribute has different name in different
databases

6
Smoothing Noisy Data by Binning

Binning methods smooth data by consulting its
neighborhood, i.e. the values around it (local
smoothing)
Attribute price 4, 8, 15, 2, 21, 24, 25, 28, 34
1. Sort the data
2. Partition into (equidepth) bins
Bin 1 4, 8, 15
Bin 2 21, 21, 24
Bin 3 25, 28, 34
3. Smoothing by bin means or Smoothing by
bin boundaries
(each value is replaced by the (each value is
replaced by the
mean value of the bin) closest boundary value)
Bin 1 9, 9, 9 Bin 1 4, 4, 15
Bin 2 22, 22, 22 Bin 2 21, 21, 24
Bin 3 29, 29, 29 Bin 2 25, 25, 34

7
Outliers

Outliers instances with values much different
from the remaining set of data
May be errors in data (e.g. malfunctioning sensor
recorded an incorrect value). If this is the
case, find the correct value or treat it as a
missing value.
May be correct data values that are simply much
different from the remaining data. For example
a person 2.5m tall
insurance claims most of them are typically
small but occasionally some are enormous)
predicting flooding high water levels appear
rarely, and compared with normal data they may
look as outliers

8
Outlier Detection

Clustering clusters very far from the others
Distance-based techniques
Statistical-based
assume that data follows a known distribution,
apply standard tests (e.g. discordancy test)
Disadvantages
real data do not follow well-defined data
distributions
most of these tests assume a single attribute
value but real-world databases are described with
multiple attributes

9
Data Transformation

Normalization - scaling so that attribute values
fall within a pre-defined range, e.g. 0 1
Aggregation (for numerical attributes) summary
or aggregation operations are applied to the data
e.g. daily sales are aggregated to compute
monthly or annual amounts
Generalization (for nominal attributes)
low-level data are replaced by higher level
concepts
e.g. values for age can be mapped to higher-level
attributes like young, middle-aged, senior
Attribute construction new attributes are
constructed from the given attributes

10
Data Reduction

Feature selection - removing irrelevant
attributes by using
using statistical measures, e.g. Information
Gain, Mutual Information, etc.
Using induction of DTs
Data compression
Clustering - using the cluster centers to

represent actual data
Principle component analysis,

singular value decomposition,

wavelet transform, etc.

11
Singular Value Decomposition (SVD)

Singular Value Decomposition(SVD) is a technique
for data reduction for high dimensional data.
The basic idea is to project data into a lower
dimension space while retaining as much
variability in the data as desired

12
Example 1
Along which axis is there maximum variability ?
13
More Examples
Along which axis is there maximum variability ?
14
Problem Definition

Given N feature vectors with dimensionality m,
find m axis such that Var(Z1) gt Var(Z2).gtVar(Zm)
How does this help in data reduction? Once we
have Z1,Z2,Zm, then what?
Fix a k, where kltm, and choose,Z1,Z2,..Zk
Or fix a percentage (e.g. 99 ) and choose j such
that Z1,Z2,Zj capture 99 of the variability
What are the savings?
k/m or j/m e.g. if m100 and k10, than 1/10
space is only required

15
Revisit Example
z1
z2
16
SVD Background

Based on the following theorem from Linear
Algebra Any NxM matrix X (N??M) can be written
as the product of 3 matrices
U - NxM orthogonal matrix
Vt - the transpose of an MxM orthogonal matrix
? - MxM diagonal matrix with positive or zero
elements (the singular values)
V defines the new coordinate system (set of axes)
It provides important information about variance
1st axis shows the most variance among data,
the second shows the next highest value, etc.
U is the transformed data, i.e. i-th row of U
contains the coordinates of the i-th row of X in
the new coordinate system

17
SVD - Compression

More importantly X can be re-written as
where ? are sorted in decreasing order
Compression comes from taking only the first k
components (kltm)
i.e. the size of the data can be reduced by
eliminating the weaker components (the ones with
low variance)
Using only the strongest components, it is
possible to reconstruct a good approximation of
the original data

18
SVD Graphical Representation
Without compression
U
With compression
19
SVD Example

You can verify that

Most of the variance is captured in the first
component gt the original 3 dimensional data X
can be reduced to 1-dimensional data (the first
column of U)

20
Compression Ratio

Space needed before SVD N x M
Space needed after SVD NkkkM
The first k columns of U (N dimensional vectors),
the first k columns of V (M dimensional vectors,
k singular vectors gt total k(NM1) )
Compression ratio
For N gtgt M gtk, this ratio is approximately k/M
E.g. if M365 and k10, r0.28 or 2.8!

21
SVD Application in Image Compression

http//www.coastal.edu/jbernick/

Yogi Rock photographed by Sojourner Mars mission
256 264 grayscale bitmap gt X is 256 264
matrix (i.e. contains 67584 numbers)
After SVD, k81
256 81 81 81 26442201 numbers
62 compression ratio

22
SVD in Text Categorization

Latent Semantic Indexing (LSI)
applies SVD to the DF matrix to reduce the
number of features (terms)

U - 10x6 each row of U is a transformed DF for
one document
6x6 diagonal matrix
VT6x6 provides the new orthogonal basis for the
data (principal component direction)

X10x6 is the DF matrix for 10 documents and 6
terms
23
SVD in Text Categorization cont.

?1,.., ??6 77.4, 69.5, 22.9, 13.5, 12.1, 4.8
gt most of the variance is captured in the first
2 components
For k2, U
New term that captures relationships among terms
and may better reflect the semantic content of a
document
e.g. the terms money, win, , profit are
combined in one new term
we can think about this term as characterizing a
spam document
The first 2 principal component directions (2
directions in the original 6d space)
v1(0.74, 0.49, 0.27, 0.28, 0.18, 0.19)
v2(-0.28, -024, -0.12, 0.74, 0.37, 0.31)