Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining

Description:

Title: CSE591 Data Mining Last modified by: Guozhu Dong Created Date: 9/30/1996 6:28:10 PM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:180

Avg rating:3.0/5.0

Slides: 26

Provided by: cecsWrig7

Learn more at: http://cecs.wright.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining

1
2. Data Preparation and Preprocessing

Data and Its Forms
Preparation
Preprocessing and Data Reduction

2
Data Types and Forms

Attribute-vector data
Data types
numeric, categorical (see the hierarchy for their
relationship)
static, dynamic (temporal)
Other data forms
distributed data
text, Web, meta data
images, audio/video

3
Data Preparation

An important time consuming task in KDD
High dimensional data (20, 100, 1000, )
Huge size (volume) data
Missing data
Outliers
Erroneous data (inconsistent, mis-recorded,
distorted)
Raw data

4
Data Preparation Methods

Data annotation
Data normalization
Examples image pixels, age
Dealing with sequential or temporal data
Transform to tabular form
Removing outliers
Different types

5
Normalization

Decimal scaling
v(i) v(i)/10k for the smallest k such that
max(v(i))lt1.
For the range between -991 and 99, 10k is 1000,
-991 ? -.991
Min-max normalization into new max/min range
v (v - minA)/(maxA - minA)
(new_maxA - new_minA) new_minA
v 73600 in 12000,98000 ? v 0.716 in 0,1
(new range)
Zero-mean normalization
v (v - meanA) / std_devA
(1, 2, 3), mean and std_dev are 2 and 1, (-1, 0,
1)
If meanIncome 54000 and std_devIncome 16000,
then v 73600 ? 1.225

6
Temporal Data

The goal is to forecast t(n1) from previous
values
X t(1), t(2), , t(n)
An example with two features and widow size 3
How to determine the window size?

Time A B
1 7 215
2 10 211
3 6 214
4 11 221
5 12 210
6 14 218
Inst A(n-2) A(n-1) A(n) B(n-2) B(n-1) B(n)
1 7 10 6 215 211 214
2 10 6 11 211 214 221
3 6 11 12 214 221 210
4 11 12 14 221 210 218
7
Outlier Removal

Outlier Data points inconsistent with the
majority of data
Different outliers
Valid CEOs salary,
Noisy Ones age 200, widely deviated points
Removal methods
Clustering
Curve-fitting
Hypothesis-testing with a given model

8
Data Preprocessing

Data cleaning
missing data
noisy data
inconsistent data
Data reduction
Dimensionality reduction
Instance selection
Value discretization

9
Missing Data

Many types of missing data
not measured
not applicable
wrongly placed, and ?
Some methods
leave as is
ignore/remove the instance with missing value
manual fix (assign a value for implicit meaning)
statistical methods (majority, most likely,mean,
nearest neighbor, )

10
Noisy Data

Noise Random error or variance in a measured
variable
inconsistent values for features or classes
(processing)
measuring errors (source)
Noise is normally a minority in the data set
Why?
Removing noise
Clustering/merging
Smoothing (rounding, averaging within a window)
Outlier detection (deviation-based or
distance-based)

11
Inconsistent Data

Inconsistent with our models or common sense
Examples
The same name occurs as different ones in an
application
Different names appear the same (Dennis vs.
Denis)
Inappropriate values (Male-Pregnant, negative
age)
One banks database shows that 5 of its
customers were born on 11/11/11

12
Dimensionality Reduction

Feature selection
select m from n features, m n
remove irrelevant, redundant features
saving in search space
Feature transformation (PCA)
form new features (a) in a new domain from
original features (f)
many uses, but it does not reduce the original
dimensionality
often used in visualization of data

13
Feature Selection

Problem illustration
Full set
Empty set
Enumeration
Search
Exhaustive/Complete (Enumeration/BB)
Heuristic (Sequential forward/backward)
Stochastic (generate/evaluate)
Individual features or subsets generation/evaluati
on

14
Feature Selection (2)
F1 F2 F3 C
0 0 1 1
0 0 1 0
0 0 1 1
1 0 0 1
1 0 0 0
1 0 0 0

Goodness metrics
Dependency dependence on classes
Distance separating classes
Information entropy
Consistency 1 - inconsistencies/N
Example (F1, F2, F3) and (F1,F3)
Both sets have 2/6 inconsistency rate
Accuracy (classifier based) 1 - errorRate
Their comparisons
Time complexity, number of features, removing
redundancy

15
Feature Selection (3)

Filter vs. Wrapper Model
Pros and cons
time
generality
performance such as accuracy
Stopping criteria
thresholding (number of iterations, some
accuracy,)
anytime algorithms
providing approximate solutions
solutions improve over time

16
Feature Selection (Examples)

SFS using consistency (cRate)
select 1 from n, then 1 from n-1, n-2, features
increase the number of selected features until
pre-specified cRate is reached.
LVF using consistency (cRate)
randomly generate a subset S from the full set
if it satisfies prespecified cRate, keep S with
min S
go back to 1 until a stopping criterion is met
LVF is an any time algorithm
Many other algorithms SBS, BB, ...

17
Transformation PCA
E-values Diff Prop Cumu
?1 2.91082 1.98960 0.72771 0.72770
?2 0.92122 0.77387 0.23031 0.95801
?3 0.14735 0.12675 0.03684 0.99485
?4 0.02061 0.00515 1.00000

D DA, D is mean-centered, (N?n)
Calculate and rank eigenvalues of the covariance
matrix
Select largest ?s such that r gt threshold (e.g.,
.95)
corresponding eigenvectors form A (n?m)
Example of Iris data

m n r ( ? ?i ) / ( ? ?i )
i1 i1
V1 V2 V3 V4
F1 0.522372 0.372318 -.721017 -.261996
F2 -.263355 0.925556 0.242033 0.124135
F3 0.581254 0.021095 0.140892 0.801154
F4 0.565611 0.065416 0.633801 -.523546
18
Instance Selection

Sampling methods
random sampling
stratified sampling
Search-based methods
Representatives
Prototypes
Sufficient statistics (N, mean, stdDev)
Support vectors

19
Value Discretization

Binning methods
Equal-width
Equal-frequency
Class information is not used
Entropy-based
ChiMerge
Chi2

20
Binning

Attribute values (for one attribute e.g., age)
0, 4, 12, 16, 16, 18, 24, 26, 28
Equi-width binning for bin width of e.g., 10
Bin 1 0, 4 -,10) bin
Bin 2 12, 16, 16, 18 10,20) bin
Bin 3 24, 26, 28 20,) bin
We use to denote negative infinity, for
positive infinity
Equi-frequency binning for bin density of e.g.,
3
Bin 1 0, 4, 12 -,14) bin
Bin 2 16, 16, 18 14,21) bin
Bin 3 24, 26, 28 21, bin
Any problems with the above methods?

21
Entropy-based

Given attribute-value/class pairs
(0,P), (4,P), (12,P), (16,N), (16,N), (18,P),
(24,N), (26,N), (28,N)
Entropy-based binning via binarization
Intuitively, find best split so that the bins are
as pure as possible
Formally characterized by maximal information
gain.
Let S denote the above 9 pairs, p4/9 be fraction
of P pairs, and n5/9 be fraction of N pairs.
Entropy(S) - p log p - n log n.
Smaller entropy set is relatively pure
smallest is 0.
Large entropy set is mixed. Largest is 1.

22
Entropy-based (2)

Let v be a possible split. Then S is divided into
two sets
S1 value lt v and S2 value gt v
Information of the split
I(S1,S2) (S1/S) Entropy(S1) (S2/S)
Entropy(S2)
Information gain of the split
Gain(v,S) Entropy(S) I(S1,S2)
Goal split with maximal information gain.
Possible splits mid points b/w any two
consecutive values.
For v14, I(S1,S2) 0 6/9Entropy(S2) 6/9
0.65 0.433
Gain(14,S) Entropy(S) - 0.433
maximum Gain means minimum I.
The best split is found after examining all
possible split points.

23
ChiMerge and Chi2
C1 C2 ?
I-1 A11 A12 R1
I-2 A21 A22 R2
? C1 C2 N

Given attribute-value/class pairs
Build a contingency table for every pair of
intervals
Chi-Squared Test (goodness-of-fit),
Parameters df k-1 and p level of
significance
Chi2 algorithm provides an automatic way to
adjust p

F C
12 P
12 N
12 P
16 N
16 N
16 P
24 N
24 N
24 N
2 k ?2 ? ? (Aij Eij)2 / Eij
i1 j1
24
Summary

Data have many forms
Attribute-vectors the most common form
Raw data need to be prepared and preprocessed for
data mining
Data miners have to work on the data provided
Domain expertise is important in DPP
Data preparation Normalization, Transformation
Data preprocessing Cleaning and Reduction
DPP is a critical and time-consuming task
Why?

25
Bibliography

H. Liu H. Motoda, 1998. Feature Selection for
Knowledge Discovery and Data Mining. Kluwer.
M. Kantardzic, 2003. Data Mining - Concepts,
Models, Methods, and Algorithms. IEEE and Wiley
Inter-Science.
H. Liu H. Motoda, edited, 2001. Instance
Selection and Construction for Data Mining.
Kluwer.
H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002.
Discretization An Enabling Technique. DMKD
6393-423.