Title: Data Mining: Concepts and Techniques
1Data Mining Concepts and Techniques Chapter
3 Cont.
- More on Feature Selection
- Chi-squared test
- Principal Component Analysis
2Attribute Selection
ID Outlook Temperature Humidity Windy Play
1 100 40 90 0 T
2 100 40 90 1 F
3 50 40 90 0 T
4 10 30 90 0 T
5 10 15 70 0 T
6 10 15 70 1 F
7 50 15 70 1 T
8 100 30 90 0 F
9 100 15 70 0 T
10 10 30 70 0 F
11 100 30 70 1 F
12 50 30 90 1 T
13 50 40 70 0 T
14 10 30 90 1 F
- Question Are attributes A1 and A2 independent?
- If they are very dependent, we can remove
eitherA1 or A2 - If A1 is independent on a class attribute A2, we
can remove A1 from our training data
3Deciding to remove attributes in feature selection
Dependent (ChiSqsmall)
?
A2class attribute
Independent (Chisqlarge
?
A1
Dependent (ChiSqsmall)
?
A2 class attribute
Independent (Chisqlarge
?
4Chi-Squared Test (cont.)
- Question Are attributes A1 and A2 independent?
- These features are nominal valued (discrete)
- Null Hypothesis we expect independence
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
5The Weather example Observed Count
temperature? Outlook High Low Outlook Subtotal
Sunny 2 0 2
Cloudy 0 1 1
Temperature Subtotal 2 1 Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
6The Weather example Expected Count
If attributes were independent, then the
subtotals would be Like this (this table is also
known as
temperature? Outlook High Low Subtotal
Sunny 22/34/31.3 21/32/30.6 2
Cloudy 21/30.6 11/30.3 1
Subtotal 2 1 Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
7Question How different between observed and
expected?
- If Chi-squared value is very large, then A1 and
A2 are not independent ? that is, they are
dependent! - Degrees of freedom if table has nm items, then
freedom (n-1)(m-1) - In our example
- Degree 1
- Chi-Squared?
8Chi-Squared Table what does it mean?
- If calculated value is much greater than in the
table, then you have reason to reject the
independence assumption - When your calculated chi-square value is greater
than the chi2 value shown in the 0.05 column
(3.84) of this table ? you are 95 certain that
attributes are actually dependent! - i.e. there is only a 5 probability that your
calculated X2 value would occur by chance
9Example Revisited (http//helios.bto.ed.ac.uk/bto/
statistics/tress9.html)
- We dont have to have two-dimensional count table
(also known as contingency table) - Suppose that the ratio of male to female students
in the Science Faculty is exactly 11, - But, the Honours class over the past ten years
there have been 80 females and 40 males. - Question Is this a significant departure from
the (11) expectation?
Observed Honours Male Female Total
40 80 120
10Expected (http//helios.bto.ed.ac.uk/bto/statistic
s/tress9.html)
- Suppose that the ratio of male to female students
in the Science Faculty is exactly 11, - but in the Honours class over the past ten years
there have been 80 females and 40 males. - Question Is this a significant departure from
the (11) expectation? - Note the expected is filled in, from 11
expectation, instead of calculated
Expected Honours Male Female Total
60 60 120
11Chi-Squared Calculation
Female Male Total
Observed numbers (O) 80 40 120
Expected numbers (E) 60 60 120
O - E 20 -20 0
(O-E)2 400 400
(O-E)2 / E 6.67 6.67 Sum13.34 X2
12Chi-Squared Test (Cont.)
- Then, check the chi-squared table for
significance - http//helios.bto.ed.ac.uk/bto/statistics/table2.h
tmlChi20squared20test - Compare our X2 value with a c2 (chi squared)
value in a table of c2 with n-1 degrees of
freedom - n is the number of categories, i.e. 2 in our case
-- males and females). - We have only one degree of freedom (n-1). From
the c2 table, we find a "critical value of 3.84
for p 0.05. - 13.34 gt 3.84, and the expectation (that the
MaleFemale in honours major are 11) is wrong!
13Chi-Squared Test in Weka weather.nominal.arff
14Chi-Squared Test in Weka
15Chi-Squared Test in Weka
16Example of Decision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
17Principal Component Analysis
- Given N data vectors from k-dimensions, find c lt
k orthogonal vectors that can be best used to
represent data - The original data set is reduced to one
consisting of N data vectors on c principal
components (reduced dimensions) - Each data vector Xj is a linear combination of
the c principal component vectors Y1, Y2, Yc - Xj mW1Y1W2Y2WkYc, i1, 2, N
- M is the mean of the data set
- W1, W2, are the ith components
- Y1, Y2, are the ith Eigen vectors
- Works for numeric data only
- Used when the number of dimensions is large
18- Principal Component Analysis
- See online tutorials such as http//www.cs.otago.a
c.nz/cosc453/student_tutorials/principal_component
s.pdf
X2
Note Y1 is the first eigen vector, Y2 is the
second. Y2 ignorable.
X1
Key observation variance largest!
19Principal Component Analysis one attribute first
Temperature
42
40
24
30
15
18
15
30
15
30
35
30
40
30
- Question how much spread is in the data along
the axis? (distance to the mean) - VarianceStandard deviation2
20Now consider two dimensions
XTemperature YHumidity
40 90
40 90
40 90
30 90
15 70
15 70
15 70
30 90
15 70
30 70
30 70
30 90
40 70
30 90
- Covariance measures thecorrelation between X
and Y - cov(X,Y)0 independent
- Cov(X,Y)gt0 move same dir
- Cov(X,Y)lt0 move oppo dir
21More than two attributes covariance matrix
- Contains covariance values between all possible
dimensions (attributes) - Example for three attributes (x,y,z)
22Background eigenvalues AND eigenvectors
- Eigenvectors e C e ? e
- How to calculate e and ?
- Calculate det(C-?I), yields a polynomial (degree
n) - Determine roots to det(C-?I)0, roots are
eigenvalues ? - Check out any math book such as
- Elementary Linear Algebra by Howard Anton,
Publisher John,Wiley Sons - Or any math packages such as MATLAB
23Steps of PCA
- Let be the mean vector (taking the mean of
all rows) - Adjust the original data by the mean
- X X
- Compute the covariance matrix C of adjusted X
- Find the eigenvectors and eigenvalues of C.
- For matrix C, vectors e (column vector) having
same direction as Ce - eigenvectors of C is e such that Ce?e,
- ? is called an eigenvalue of C.
- Ce?e ? (C-?I)e0
- Most data mining packages do this for you.
24Steps of PCA (cont.)
- Calculate eigenvalues ? and eigenvectors e for
covariance matrix - Eigenvalues ?j corresponds to variance on each
component j - Thus, sort by ?j
- Take the first n eigenvectors ei where n is the
number of top eigenvalues - These are the directions with the largest
variances
25An Example
Mean124.1 Mean253.8
X1 X2 X1' X2'
19 63 -5.1 9.25
39 74 14.9 20.25
30 87 5.9 33.25
30 23 5.9 -30.75
15 35 -9.1 -18.75
15 43 -9.1 -10.75
15 32 -9.1 -21.75
30 73 5.9 19.25
26Covariance Matrix
75 106
106 482
- C
- Using MATLAB, we find out
- Eigenvectors
- e1(-0.98,-0.21), ?151.8
- e2(0.21,-0.98), ?2560.2
- Thus the second eigenvector is more important!
27If we only keep one dimension e2
yi
-10.14
-16.72
-31.35
31.374
16.464
8.624
19.404
-17.63
- We keep the dimension of e2(0.21,-0.98)
- We can obtain the final data as
28Using Matlab to figure it out
29PCA in Weka
30Wesather Data from UCI Dataset (comes with weka
package)
31PCA in Weka (I)
32(No Transcript)
33Summary of PCA
- PCA is used for reducing the number of numerical
attributes - The key is in data transformation
- Adjust data by mean
- Find eigenvectors for covariance matrix
- Transform data
- Note only linear combination of data (weighted
sum of original data)
34Missing and Inconsistent values
- Linear regression Data are modeled to fit a
straight line - least-square method to fit YabX
- Multiple regression Y b0 b1 X1 b2 X2.
- Many nonlinear functions can be transformed into
the above.
35Regression
Height
Y1
y x 1
Y1
Age
X1
36Clustering for Outlier detection
- Outliers can be incorrect data. Clusters ?
majority behavior
37Data Reduction with Sampling
- Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data - Choose a representative subset of the data
- Simple random sampling may have very poor
performance in the presence of skew (uneven)
classes - Develop adaptive sampling methods
- Stratified sampling
- Approximate the percentage of each class (or
subpopulation of interest) in the overall
database - Used in conjunction with skewed data
38Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
39Sampling Example
Cluster/Stratified Sample
Raw Data
40Summary
- Data preparation is a big issue for data mining
- Data preparation includes
- Data warehousing
- Data reduction and feature selection
- Discretization
- Missing values
- Incorrect values
- Sampling