Feature Selection and Dimensionality Reduction - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Feature Selection and Dimensionality Reduction

Description:

a person 2.5m tall ... The basic idea is to project data into a lower dimension space while retaining ... Yogi: Rock photographed by Sojourner Mars mission ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 24
Provided by: irenako
Category:

less

Transcript and Presenter's Notes

Title: Feature Selection and Dimensionality Reduction


1
COMP5318/4044, Lecture 10aKnowledge Discovery
and Data Mining
  • Feature Selection and Dimensionality Reduction
  • Reference Hand, pp.465-469, Dunham 130-131
  • J. Han and M. Kamber, Data Mining Concepts and
    Techniques ch.3
  • D. Pyle, Data Preparation for Data Mining

2
Outline
  • Data Preprocessing
  • Data cleaning
  • Data transformations
  • Data reduction
  • Data discretization
  • Singular Value Decomposition

3
Data Preprocessing
  • Data in real world is
  • Incomplete - lacking attribute values or certain
    attributes of interest
  • Noisy containing errors or outliers
  • Inconsistent e.g. in codes, names
  • Tasks in data preprocessing
  • Data cleaning - fill in missing values, smooth
    noisy data, identify outliers and remove them,
    resolve inconsistencies
  • Data integration - using multiple data sets
  • Data transformations e.g. normalization
  • Data reduction reducing the volume but keeping
    the content
  • Data discretization replacing numerical
    attributes with nominal ones

4
Forms of Data Preprocessing
5
Data Cleaning
  • How to handle missing values?
  • Ignore the example
  • Use the attribute mean to fill in the missing
    value
  • Use the attribute mean for all examples belonging
    to the same class
  • Predict the missing value by using ML algorithm
    (usually DT or NB)
  • How to handle noisy data?
  • Binning sort data and partition into bins, then
    smooth by bin means
  • Outlier detection detect and remove outliers
    (automatic or manual)
  • Regression smooth by fitting the data into
    regression function
  • Correct inconsistent data
  • e.g. an attribute has different name in different
    databases

6
Smoothing Noisy Data by Binning
  • Binning methods smooth data by consulting its
    neighborhood, i.e. the values around it (local
    smoothing)
  • Attribute price 4, 8, 15, 2, 21, 24, 25, 28, 34
  • 1. Sort the data
  • 2. Partition into (equidepth) bins
  • Bin 1 4, 8, 15
  • Bin 2 21, 21, 24
  • Bin 3 25, 28, 34
  • 3. Smoothing by bin means or Smoothing by
    bin boundaries
  • (each value is replaced by the (each value is
    replaced by the
  • mean value of the bin) closest boundary value)
  • Bin 1 9, 9, 9 Bin 1 4, 4, 15
  • Bin 2 22, 22, 22 Bin 2 21, 21, 24
  • Bin 3 29, 29, 29 Bin 2 25, 25, 34

7
Outliers
  • Outliers instances with values much different
    from the remaining set of data
  • May be errors in data (e.g. malfunctioning sensor
    recorded an incorrect value). If this is the
    case, find the correct value or treat it as a
    missing value.
  • May be correct data values that are simply much
    different from the remaining data. For example
  • a person 2.5m tall
  • insurance claims most of them are typically
    small but occasionally some are enormous)
  • predicting flooding high water levels appear
    rarely, and compared with normal data they may
    look as outliers

8
Outlier Detection
  • Clustering clusters very far from the others
  • Distance-based techniques
  • Statistical-based
  • assume that data follows a known distribution,
    apply standard tests (e.g. discordancy test)
  • Disadvantages
  • real data do not follow well-defined data
    distributions
  • most of these tests assume a single attribute
    value but real-world databases are described with
    multiple attributes

9
Data Transformation
  • Normalization - scaling so that attribute values
    fall within a pre-defined range, e.g. 0 1
  • Aggregation (for numerical attributes) summary
    or aggregation operations are applied to the data
  • e.g. daily sales are aggregated to compute
    monthly or annual amounts
  • Generalization (for nominal attributes)
    low-level data are replaced by higher level
    concepts
  • e.g. values for age can be mapped to higher-level
    attributes like young, middle-aged, senior
  • Attribute construction new attributes are
    constructed from the given attributes

10
Data Reduction
  • Feature selection - removing irrelevant
    attributes by using
  • using statistical measures, e.g. Information
    Gain, Mutual Information, etc.
  • Using induction of DTs
  • Data compression
  • Clustering - using the cluster centers to

    represent actual data
  • Principle component analysis,

    singular value decomposition,

    wavelet transform, etc.

11
Singular Value Decomposition (SVD)
  • Singular Value Decomposition(SVD) is a technique
    for data reduction for high dimensional data.
  • The basic idea is to project data into a lower
    dimension space while retaining as much
    variability in the data as desired

12
Example 1
Along which axis is there maximum variability ?
13
More Examples
Along which axis is there maximum variability ?
14
Problem Definition
  • Given N feature vectors with dimensionality m,
    find m axis such that Var(Z1) gt Var(Z2).gtVar(Zm)
  • How does this help in data reduction? Once we
    have Z1,Z2,Zm, then what?
  • Fix a k, where kltm, and choose,Z1,Z2,..Zk
  • Or fix a percentage (e.g. 99 ) and choose j such
    that Z1,Z2,Zj capture 99 of the variability
  • What are the savings?
  • k/m or j/m e.g. if m100 and k10, than 1/10
    space is only required

15
Revisit Example
z1
z2
16
SVD Background
  • Based on the following theorem from Linear
    Algebra Any NxM matrix X (N??M) can be written
    as the product of 3 matrices
  • U - NxM orthogonal matrix
  • Vt - the transpose of an MxM orthogonal matrix
  • ? - MxM diagonal matrix with positive or zero
    elements (the singular values)
  • V defines the new coordinate system (set of axes)
  • It provides important information about variance
    1st axis shows the most variance among data,
    the second shows the next highest value, etc.
  • U is the transformed data, i.e. i-th row of U
    contains the coordinates of the i-th row of X in
    the new coordinate system

17
SVD - Compression
  • More importantly X can be re-written as
  • where ? are sorted in decreasing order
  • Compression comes from taking only the first k
    components (kltm)
  • i.e. the size of the data can be reduced by
    eliminating the weaker components (the ones with
    low variance)
  • Using only the strongest components, it is
    possible to reconstruct a good approximation of
    the original data

18
SVD Graphical Representation
Without compression
U
With compression
19
SVD Example
  • You can verify that
  • Most of the variance is captured in the first
    component gt the original 3 dimensional data X
    can be reduced to 1-dimensional data (the first
    column of U)

20
Compression Ratio
  • Space needed before SVD N x M
  • Space needed after SVD NkkkM
  • The first k columns of U (N dimensional vectors),
    the first k columns of V (M dimensional vectors,
    k singular vectors gt total k(NM1) )
  • Compression ratio
  • For N gtgt M gtk, this ratio is approximately k/M
  • E.g. if M365 and k10, r0.28 or 2.8!

21
SVD Application in Image Compression
  • http//www.coastal.edu/jbernick/
  • Yogi Rock photographed by Sojourner Mars mission
  • 256 264 grayscale bitmap gt X is 256 264
    matrix (i.e. contains 67584 numbers)
  • After SVD, k81
  • 256 81 81 81 26442201 numbers
  • 62 compression ratio

22
SVD in Text Categorization
  • Latent Semantic Indexing (LSI)
  • applies SVD to the DF matrix to reduce the
    number of features (terms)
  • U - 10x6 each row of U is a transformed DF for
    one document
  • 6x6 diagonal matrix
  • VT6x6 provides the new orthogonal basis for the
    data (principal component direction)

X10x6 is the DF matrix for 10 documents and 6
terms
23
SVD in Text Categorization cont.
  • ?1,.., ??6 77.4, 69.5, 22.9, 13.5, 12.1, 4.8
    gt most of the variance is captured in the first
    2 components
  • For k2, U
  • New term that captures relationships among terms
    and may better reflect the semantic content of a
    document
  • e.g. the terms money, win, , profit are
    combined in one new term
  • we can think about this term as characterizing a
    spam document
  • The first 2 principal component directions (2
    directions in the original 6d space)
  • v1(0.74, 0.49, 0.27, 0.28, 0.18, 0.19)
  • v2(-0.28, -024, -0.12, 0.74, 0.37, 0.31)

The original DF matrix for 10 documents and 6
terms
Write a Comment
User Comments (0)
About PowerShow.com