Title: Data Mining
12. Data Preparation and Preprocessing
- Data and Its Forms
- Preparation
- Preprocessing and Data Reduction
2Data Types and Forms
- Attribute-vector data
- Data types
- numeric, categorical (see the hierarchy for their
relationship) - static, dynamic (temporal)
- Other data forms
- distributed data
- text, Web, meta data
- images, audio/video
3Data Preparation
- An important time consuming task in KDD
- High dimensional data (20, 100, 1000, )
- Huge size (volume) data
- Missing data
- Outliers
- Erroneous data (inconsistent, mis-recorded,
distorted) - Raw data
4Data Preparation Methods
- Data annotation
- Data normalization
- Examples image pixels, age
- Dealing with sequential or temporal data
- Transform to tabular form
- Removing outliers
- Different types
5Normalization
- Decimal scaling
- v(i) v(i)/10k for the smallest k such that
max(v(i))lt1. - For the range between -991 and 99, 10k is 1000,
-991 ? -.991 - Min-max normalization into new max/min range
- v (v - minA)/(maxA - minA)
- (new_maxA - new_minA) new_minA
- v 73600 in 12000,98000 ? v 0.716 in 0,1
(new range) - Zero-mean normalization
- v (v - meanA) / std_devA
- (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0,
1) - If meanIncome 54000 and std_devIncome 16000,
- then v 73600 ? 1.225
6Temporal Data
- The goal is to forecast t(n1) from previous
values - X t(1), t(2), , t(n)
- An example with two features and widow size 3
- How to determine the window size?
Time A B
1 7 215
2 10 211
3 6 214
4 11 221
5 12 210
6 14 218
Inst A(n-2) A(n-1) A(n) B(n-2) B(n-1) B(n)
1 7 10 6 215 211 214
2 10 6 11 211 214 221
3 6 11 12 214 221 210
4 11 12 14 221 210 218
7Outlier Removal
- Outlier Data points inconsistent with the
majority of data - Different outliers
- Valid CEOs salary,
- Noisy Ones age 200, widely deviated points
- Removal methods
- Clustering
- Curve-fitting
- Hypothesis-testing with a given model
8Data Preprocessing
- Data cleaning
- missing data
- noisy data
- inconsistent data
- Data reduction
- Dimensionality reduction
- Instance selection
- Value discretization
9Missing Data
- Many types of missing data
- not measured
- not applicable
- wrongly placed, and ?
- Some methods
- leave as is
- ignore/remove the instance with missing value
- manual fix (assign a value for implicit meaning)
- statistical methods (majority, most likely,mean,
nearest neighbor, )
10Noisy Data
- Noise Random error or variance in a measured
variable - inconsistent values for features or classes
(processing) - measuring errors (source)
- Noise is normally a minority in the data set
- Why?
- Removing noise
- Clustering/merging
- Smoothing (rounding, averaging within a window)
- Outlier detection (deviation-based or
distance-based)
11Inconsistent Data
- Inconsistent with our models or common sense
- Examples
- The same name occurs as different ones in an
application - Different names appear the same (Dennis vs.
Denis) - Inappropriate values (Male-Pregnant, negative
age) - One banks database shows that 5 of its
customers were born on 11/11/11
12Dimensionality Reduction
- Feature selection
- select m from n features, m n
- remove irrelevant, redundant features
- saving in search space
- Feature transformation (PCA)
- form new features (a) in a new domain from
original features (f) - many uses, but it does not reduce the original
dimensionality - often used in visualization of data
13Feature Selection
- Problem illustration
- Full set
- Empty set
- Enumeration
- Search
- Exhaustive/Complete (Enumeration/BB)
- Heuristic (Sequential forward/backward)
- Stochastic (generate/evaluate)
- Individual features or subsets generation/evaluati
on
14Feature Selection (2)
F1 F2 F3 C
0 0 1 1
0 0 1 0
0 0 1 1
1 0 0 1
1 0 0 0
1 0 0 0
- Goodness metrics
- Dependency dependence on classes
- Distance separating classes
- Information entropy
- Consistency 1 - inconsistencies/N
- Example (F1, F2, F3) and (F1,F3)
- Both sets have 2/6 inconsistency rate
- Accuracy (classifier based) 1 - errorRate
- Their comparisons
- Time complexity, number of features, removing
redundancy
15Feature Selection (3)
- Filter vs. Wrapper Model
- Pros and cons
- time
- generality
- performance such as accuracy
- Stopping criteria
- thresholding (number of iterations, some
accuracy,) - anytime algorithms
- providing approximate solutions
- solutions improve over time
16Feature Selection (Examples)
- SFS using consistency (cRate)
- select 1 from n, then 1 from n-1, n-2, features
- increase the number of selected features until
pre-specified cRate is reached. - LVF using consistency (cRate)
- randomly generate a subset S from the full set
- if it satisfies prespecified cRate, keep S with
min S - go back to 1 until a stopping criterion is met
- LVF is an any time algorithm
- Many other algorithms SBS, BB, ...
17Transformation PCA
E-values Diff Prop Cumu
?1 2.91082 1.98960 0.72771 0.72770
?2 0.92122 0.77387 0.23031 0.95801
?3 0.14735 0.12675 0.03684 0.99485
?4 0.02061 0.00515 1.00000
- D DA, D is mean-centered, (N?n)
- Calculate and rank eigenvalues of the covariance
matrix - Select largest ?s such that r gt threshold (e.g.,
.95) - corresponding eigenvectors form A (n?m)
- Example of Iris data
m n r ( ? ?i ) / ( ? ?i )
i1 i1
V1 V2 V3 V4
F1 0.522372 0.372318 -.721017 -.261996
F2 -.263355 0.925556 0.242033 0.124135
F3 0.581254 0.021095 0.140892 0.801154
F4 0.565611 0.065416 0.633801 -.523546
18Instance Selection
- Sampling methods
- random sampling
- stratified sampling
- Search-based methods
- Representatives
- Prototypes
- Sufficient statistics (N, mean, stdDev)
- Support vectors
19Value Discretization
- Binning methods
- Equal-width
- Equal-frequency
- Class information is not used
- Entropy-based
- ChiMerge
- Chi2
20Binning
- Attribute values (for one attribute e.g., age)
- 0, 4, 12, 16, 16, 18, 24, 26, 28
- Equi-width binning for bin width of e.g., 10
- Bin 1 0, 4 -,10) bin
- Bin 2 12, 16, 16, 18 10,20) bin
- Bin 3 24, 26, 28 20,) bin
- We use to denote negative infinity, for
positive infinity - Equi-frequency binning for bin density of e.g.,
3 - Bin 1 0, 4, 12 -,14) bin
- Bin 2 16, 16, 18 14,21) bin
- Bin 3 24, 26, 28 21, bin
- Any problems with the above methods?
21Entropy-based
- Given attribute-value/class pairs
- (0,P), (4,P), (12,P), (16,N), (16,N), (18,P),
(24,N), (26,N), (28,N) - Entropy-based binning via binarization
- Intuitively, find best split so that the bins are
as pure as possible - Formally characterized by maximal information
gain. - Let S denote the above 9 pairs, p4/9 be fraction
of P pairs, and n5/9 be fraction of N pairs. - Entropy(S) - p log p - n log n.
- Smaller entropy set is relatively pure
smallest is 0. - Large entropy set is mixed. Largest is 1.
22Entropy-based (2)
- Let v be a possible split. Then S is divided into
two sets - S1 value lt v and S2 value gt v
- Information of the split
- I(S1,S2) (S1/S) Entropy(S1) (S2/S)
Entropy(S2) - Information gain of the split
- Gain(v,S) Entropy(S) I(S1,S2)
- Goal split with maximal information gain.
- Possible splits mid points b/w any two
consecutive values. - For v14, I(S1,S2) 0 6/9Entropy(S2) 6/9
0.65 0.433 - Gain(14,S) Entropy(S) - 0.433
- maximum Gain means minimum I.
- The best split is found after examining all
possible split points.
23ChiMerge and Chi2
C1 C2 ?
I-1 A11 A12 R1
I-2 A21 A22 R2
? C1 C2 N
- Given attribute-value/class pairs
- Build a contingency table for every pair of
intervals - Chi-Squared Test (goodness-of-fit),
- Parameters df k-1 and p level of
significance - Chi2 algorithm provides an automatic way to
adjust p
F C
12 P
12 N
12 P
16 N
16 N
16 P
24 N
24 N
24 N
2 k ?2 ? ? (Aij Eij)2 / Eij
i1 j1
24Summary
- Data have many forms
- Attribute-vectors the most common form
- Raw data need to be prepared and preprocessed for
data mining - Data miners have to work on the data provided
- Domain expertise is important in DPP
- Data preparation Normalization, Transformation
- Data preprocessing Cleaning and Reduction
- DPP is a critical and time-consuming task
- Why?
25Bibliography
- H. Liu H. Motoda, 1998. Feature Selection for
Knowledge Discovery and Data Mining. Kluwer. - M. Kantardzic, 2003. Data Mining - Concepts,
Models, Methods, and Algorithms. IEEE and Wiley
Inter-Science. - H. Liu H. Motoda, edited, 2001. Instance
Selection and Construction for Data Mining.
Kluwer. - H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002.
Discretization An Enabling Technique. DMKD
6393-423.