2.%20Data%20Preparation%20and%20Preprocessing - PowerPoint PPT Presentation

About This Presentation
Title:

2.%20Data%20Preparation%20and%20Preprocessing

Description:

numeric, categorical (see the hierarchy for its relationship) static, dynamic (temporal) ... Erroneous data (inconsistent, misrecorded, distorted) Raw data. 2/4/03 ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 26
Provided by: publi5
Category:

less

Transcript and Presenter's Notes

Title: 2.%20Data%20Preparation%20and%20Preprocessing


1
2. Data Preparation and Preprocessing
  • Data and Its Forms
  • Preparation
  • Preprocessing and Data Reduction

2
Data Types and Forms
  • Attribute-vector data
  • Data types
  • numeric, categorical (see the hierarchy for its
    relationship)
  • static, dynamic (temporal)
  • Other data forms
  • distributed data
  • text, Web, meta data
  • images, audio/video
  • You have seen most of them after the invited
    talks.

3
Data Preparation
  • An important time consuming task in KDD
  • High dimensional data (20, 100, 1000)
  • Huge size data
  • Missing data
  • Outliers
  • Erroneous data (inconsistent, misrecorded,
    distorted)
  • Raw data

4
Data Preparation Methods
  • Data annotation as in driving data analysis
  • Data normalization
  • Another example is of image mining
  • Dealing with sequential or temporal data
  • Transform it to tabular form
  • Removing outliers
  • Different types

5
Normalization
  • Decimal scaling
  • v(i) v(i)/10k for the smallest k such that
    max(v(i))lt1.
  • For the range between -991 and 99, k is 1000,
    -991 ? .991
  • Min-max normalization into the new max/min range
  • v (v - minA)/(maxA - minA)
  • (new_maxA - new_minA) new_minA
  • v 73600 in 12000,98000 ? v 0.716 in 0,1
    (new range)
  • Zero-mean normalization
  • v (v - meanA) / std_devA
  • (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0,
    1)
  • If meanIncome 54000 and std_devIncome 16000,
  • then v 73600 ? 1.225

6
Temporal Data
  • The goal is to forecast t(n1) from previous
    values
  • X t(1), t(2), , t(n)
  • An example with two features and widow size 3
  • How to determine the window size?

Time A B
1 7 215
2 10 211
3 6 214
4 11 221
5 12 210
6 14 218
Inst A(n-2) A(n-1) A(n) B(n-2) B(n-1) B(n)
1 7 10 6 215 211 214
2 10 6 11 211 214 221
3 6 11 12 214 221 210
4 11 12 14 221 210 218
7
Outlier Removal
  • Data points inconsistent with the majority of
    data
  • Different outliers
  • Valid CEOs salary,
  • Noisy Ones age 200, widely deviated points
  • Removal methods
  • Clustering
  • Curve-fitting
  • Hypothesis-testing with a given model

8
Data Preprocessing
  • Data cleaning
  • missing data
  • noisy data
  • inconsistent data
  • Data reduction
  • Dimensionality reduction
  • Instance selection
  • Value discretization

9
Missing Data
  • Many types of missing data
  • not measured
  • truly missed
  • wrongly placed, and ?
  • Some methods
  • leave as is
  • ignore/remove the instance with missing value
  • manual fix (assign a value for implicit meaning)
  • statistical methods (majority, most likely,mean,
    nearest neighbor, )

10
Noisy Data
  • Random error or variance in a measured variable
  • inconsistent values for features or classes
    (process)
  • measuring errors (source)
  • Noise is normally a minority in the data set
  • Why?
  • Removing noise
  • Clustering/merging
  • Smoothing (rounding, averaging within a window)
  • Outlier detection (deviation-based or
    distance-based)

11
Inconsistent Data
  • Inconsistent with our models or common sense
  • Examples
  • The same name occurs differently in an
    application
  • Different names appear the same (Dennis vs.
    Denis)
  • Inappropriate values (Male-Pregnant, negative
    age)
  • One banks database shows that 5 of its
    customers were born in 11/11/11

12
Dimensionality Reduction
  • Feature selection
  • select m from n features, m n
  • remove irrelevant, redundant features
  • the saving in search space
  • Feature transformation (PCA)
  • form new features (a) in a new domain from
    original features (f)
  • many uses, but it does not reduce the original
    dimensionality
  • often used in visualization of data

13
Feature Selection
  • Problem illustration
  • Full set
  • Empty set
  • Enumeration
  • Search
  • Exhaustive/Complete (Enumeration/BAA)
  • Heuristic (Sequential forward/backward)
  • Stochastic (generate/evaluate)
  • Individual features or subsets generation/evaluati
    on

14
Feature Selection (2)
F1 F2 F3 C
0 0 1 1
0 0 1 0
0 0 1 1
1 0 0 1
1 0 0 0
1 0 0 0
  • Goodness metrics
  • Dependency depending on classes
  • Distance separating classes
  • Information entropy
  • Consistency 1 - inconsistencies/N
  • Example (F1, F2, F3) and (F1,F3)
  • Both sets have 2/6 inconsistency rate
  • Accuracy (classifier based) 1 - errorRate
  • Their comparisons
  • Time complexity, number of features, removing
    redundancy

15
Feature Selection (3)
  • Filter vs. Wrapper Model
  • Pros and cons
  • time
  • generality
  • performance such as accuracy
  • Stopping criteria
  • thresholding (number of iterations, some
    accuracy,)
  • anytime algorithms
  • providing approximate solutions
  • solutions improve over time

16
Feature Selection (Examples)
  • SFS using consistency (cRate)
  • select 1 from n, then 1 from n-1, n-2, features
  • increase the number of selected features until
    pre-specified cRate is reached.
  • LVF using consistency (cRate)
  • randomly generate a subset S from the full set
  • if it satisfies prespecified cRate, keep S with
    min S
  • go back to 1 until a stopping criterion is met
  • LVF is an any time algorithm
  • Many other algorithms SBS, BB, ...

17
Transformation PCA
E-values Diff Prop Cumu
?1 2.91082 1.98960 0.72771 0.72770
?2 0.92122 0.77387 0.23031 0.95801
?3 0.14735 0.12675 0.03684 0.99485
?4 0.02061 0.00515 1.00000
  • D DA, D is mean-centered, (N?n)
  • Calculate and rank eigenvalues of the covariance
    matrix
  • Select largest ?s such that r gt threshold (e.g.,
    .95)
  • corresponding eigenvectors form A (n?m)
  • Example of Iris data

m n r ( ? ?i ) / ( ? ?i )
i1 i1
V1 V2 V3 V4
F1 0.522372 0.372318 -.721017 -.261996
F2 -.263355 0.925556 0.242033 0.124135
F3 0.581254 0.021095 0.140892 0.801154
F4 0.565611 0.065416 0.633801 -.523546
18
Instance Selection
  • Sampling methods
  • random sampling
  • stratified sampling
  • Search-based methods
  • Representatives
  • Prototypes
  • Sufficient statistics (N, mean, stdDev)
  • Support vectors

19
Value Descritization
  • Binning methods
  • Equal-width
  • Equal-frequency
  • Class information is not used
  • Entropy-based
  • ChiMerge
  • Chi2

20
Binning
  • Attribute values (for one attribute e.g., age)
  • 0, 4, 12, 16, 16, 18, 24, 26, 28
  • Equi-width binning for bin width of e.g., 10
  • Bin 1 0, 4 -,10) bin
  • Bin 2 12, 16, 16, 18 10,20) bin
  • Bin 3 24, 26, 28 20,) bin
  • We use to denote negative infinity, for
    positive infinity
  • Equi-frequency binning for bin density of e.g.,
    3
  • Bin 1 0, 4, 12 -,14) bin
  • Bin 2 16, 16, 18 14,21) bin
  • Bin 3 24, 26, 28 21, bin
  • Any problems with the above methods?

21
Entropy-based
  • Given attribute-value/class pairs
  • (0,P), (4,P), (12,P), (16,N), (16,N), (18,P),
    (24,N), (26,N), (28,N)
  • Entropy-based binning via binarization
  • Intuitively, find best split so that the bins are
    as pure as possible
  • Formally characterized by maximal information
    gain.
  • Let S denote the above 9 pairs, p4/9 be fraction
    of P pairs, and n5/9 be fraction of N pairs.
  • Entropy(S) - p log p - n log n.
  • Smaller entropy set is relatively pure
    smallest is 0.
  • Large entropy set is mixed. Largest is 1.

22
Entropy-based (2)
  • Let v be a possible split. Then S is divided into
    two sets
  • S1 value lt v and S2 value gt v
  • Information of the split
  • I(S1,S2) (S1/S) Entropy(S1) (S2/S)
    Entropy(S2)
  • Information gain of the split
  • Gain(v,S) Entropy(S) I(S1,S2)
  • Goal split with maximal information gain.
  • Possible splits mid points b/w any two
    consecutive values.
  • For v14, I(S1,S2) 0 6/9Entropy(S2) 6/9
    0.65 0.433
  • Gain(14,S) Entropy(S) - 0.433
  • maximum Gain means minimum I.
  • The best split is found after examining all
    possible split points.

23
ChiMerge and Chi2
C1 C2 ?
I-1 A11 A12 R1
I-2 A21 A22 R2
? C1 C2 N
  • Given attribute-value/class pairs
  • Build a contingency table for every pair of
    intervals (I)
  • Chi-Squared Test (goodness-of-fit),
  • Parameters df k-1 and p level of
    significance
  • Chi2 algorithm provides an automatic way to
    adjust p

F C
12 P
12 N
12 P
16 N
16 N
16 P
24 N
24 N
24 N
2 k ?2 ? ? (Aij Eij)2 / Eij
i1 j1
24
Summary
  • Data have many forms
  • Attribute-vectors is the most common form
  • Raw data need to be prepared and preprocessed for
    data mining
  • Data miners have to work on the data provided
  • Domain expertise is important in DPP
  • Data preparation Normalization, Transformation
  • Data preprocessing Cleaning and Reduction
  • DPP is a critical and time-consuming task
  • Why?

25
Bibliography
  • H. Liu H. Motoda, 1998. Feature Selection for
    Knowledge Discovery and Data Mining. Kluwer.
  • M. Kantardzic, 2003. Data Mining - Concepts,
    Models, Methods, and Algorithms. IEEE and Wiley
    Inter-Science.
  • H. Liu H. Motoda, edited, 2001. Instance
    Selection and Construction for Data Mining.
    Kluwer.
  • H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002.
    Discretization An Enabling Technique. DMKD
    6393-423.
Write a Comment
User Comments (0)
About PowerShow.com