Data Mining: Concepts and Techniques - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining: Concepts and Techniques

Description:

... Determine roots to det(C- I)=0, roots are eigenvalues Check out any math book such as Elementary Linear Algebra by Howard Anton, ... – PowerPoint PPT presentation

Number of Views:212
Avg rating:3.0/5.0
Slides: 41
Provided by: cseUstHk1
Category:

less

Transcript and Presenter's Notes

Title: Data Mining: Concepts and Techniques


1
Data Mining Concepts and Techniques Chapter
3 Cont.
  • More on Feature Selection
  • Chi-squared test
  • Principal Component Analysis

2
Attribute Selection
ID Outlook Temperature Humidity Windy Play
1 100 40 90 0 T
2 100 40 90 1 F
3 50 40 90 0 T
4 10 30 90 0 T
5 10 15 70 0 T
6 10 15 70 1 F
7 50 15 70 1 T
8 100 30 90 0 F
9 100 15 70 0 T
10 10 30 70 0 F
11 100 30 70 1 F
12 50 30 90 1 T
13 50 40 70 0 T
14 10 30 90 1 F
  • Question Are attributes A1 and A2 independent?
  • If they are very dependent, we can remove
    eitherA1 or A2
  • If A1 is independent on a class attribute A2, we
    can remove A1 from our training data

3
Deciding to remove attributes in feature selection
Dependent (ChiSqsmall)
?
A2class attribute
Independent (Chisqlarge
?
A1
Dependent (ChiSqsmall)
?
A2 class attribute
Independent (Chisqlarge
?
4
Chi-Squared Test (cont.)
  • Question Are attributes A1 and A2 independent?
  • These features are nominal valued (discrete)
  • Null Hypothesis we expect independence

Outlook Temperature
Sunny High
Cloudy Low
Sunny High
5
The Weather example Observed Count
temperature? Outlook High Low Outlook Subtotal
Sunny 2 0 2
Cloudy 0 1 1
Temperature Subtotal 2 1 Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
6
The Weather example Expected Count
If attributes were independent, then the
subtotals would be Like this (this table is also
known as
temperature? Outlook High Low Subtotal
Sunny 22/34/31.3 21/32/30.6 2
Cloudy 21/30.6 11/30.3 1
Subtotal 2 1 Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
7
Question How different between observed and
expected?
  • If Chi-squared value is very large, then A1 and
    A2 are not independent ? that is, they are
    dependent!
  • Degrees of freedom if table has nm items, then
    freedom (n-1)(m-1)
  • In our example
  • Degree 1
  • Chi-Squared?

8
Chi-Squared Table what does it mean?
  • If calculated value is much greater than in the
    table, then you have reason to reject the
    independence assumption
  • When your calculated chi-square value is greater
    than the chi2 value shown in the 0.05 column
    (3.84) of this table ? you are 95 certain that
    attributes are actually dependent!
  • i.e. there is only a 5 probability that your
    calculated X2 value would occur by chance

9
Example Revisited (http//helios.bto.ed.ac.uk/bto/
statistics/tress9.html)
  • We dont have to have two-dimensional count table
    (also known as contingency table)
  • Suppose that the ratio of male to female students
    in the Science Faculty is exactly 11,
  • But, the Honours class over the past ten years
    there have been 80 females and 40 males.
  • Question Is this a significant departure from
    the (11) expectation?

Observed Honours Male Female Total
40 80 120
10
Expected (http//helios.bto.ed.ac.uk/bto/statistic
s/tress9.html)
  • Suppose that the ratio of male to female students
    in the Science Faculty is exactly 11,
  • but in the Honours class over the past ten years
    there have been 80 females and 40 males.
  • Question Is this a significant departure from
    the (11) expectation?
  • Note the expected is filled in, from 11
    expectation, instead of calculated

Expected Honours Male Female Total
60 60 120
11
Chi-Squared Calculation
Female Male Total
Observed numbers (O) 80 40 120
Expected numbers (E) 60 60 120
O - E 20 -20 0
(O-E)2 400 400
(O-E)2 / E 6.67 6.67 Sum13.34 X2
12
Chi-Squared Test (Cont.)
  • Then, check the chi-squared table for
    significance
  • http//helios.bto.ed.ac.uk/bto/statistics/table2.h
    tmlChi20squared20test
  • Compare our X2 value with a c2 (chi squared)
    value in a table of c2 with n-1 degrees of
    freedom
  • n is the number of categories, i.e. 2 in our case
    -- males and females).
  • We have only one degree of freedom (n-1). From
    the c2 table, we find a "critical value of 3.84
    for p 0.05.
  • 13.34 gt 3.84, and the expectation (that the
    MaleFemale in honours major are 11) is wrong!

13
Chi-Squared Test in Weka weather.nominal.arff
14
Chi-Squared Test in Weka
15
Chi-Squared Test in Weka
16
Example of Decision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
17
Principal Component Analysis
  • Given N data vectors from k-dimensions, find c lt
    k orthogonal vectors that can be best used to
    represent data
  • The original data set is reduced to one
    consisting of N data vectors on c principal
    components (reduced dimensions)
  • Each data vector Xj is a linear combination of
    the c principal component vectors Y1, Y2, Yc
  • Xj mW1Y1W2Y2WkYc, i1, 2, N
  • M is the mean of the data set
  • W1, W2, are the ith components
  • Y1, Y2, are the ith Eigen vectors
  • Works for numeric data only
  • Used when the number of dimensions is large

18
  • Principal Component Analysis
  • See online tutorials such as http//www.cs.otago.a
    c.nz/cosc453/student_tutorials/principal_component
    s.pdf

X2
Note Y1 is the first eigen vector, Y2 is the
second. Y2 ignorable.
X1
Key observation variance largest!
19
Principal Component Analysis one attribute first
Temperature
42
40
24
30
15
18
15
30
15
30
35
30
40
30
  • Question how much spread is in the data along
    the axis? (distance to the mean)
  • VarianceStandard deviation2

20
Now consider two dimensions
XTemperature YHumidity
40 90
40 90
40 90
30 90
15 70
15 70
15 70
30 90
15 70
30 70
30 70
30 90
40 70
30 90
  • Covariance measures thecorrelation between X
    and Y
  • cov(X,Y)0 independent
  • Cov(X,Y)gt0 move same dir
  • Cov(X,Y)lt0 move oppo dir

21
More than two attributes covariance matrix
  • Contains covariance values between all possible
    dimensions (attributes)
  • Example for three attributes (x,y,z)

22
Background eigenvalues AND eigenvectors
  • Eigenvectors e C e ? e
  • How to calculate e and ?
  • Calculate det(C-?I), yields a polynomial (degree
    n)
  • Determine roots to det(C-?I)0, roots are
    eigenvalues ?
  • Check out any math book such as
  • Elementary Linear Algebra by Howard Anton,
    Publisher John,Wiley Sons
  • Or any math packages such as MATLAB

23
Steps of PCA
  • Let be the mean vector (taking the mean of
    all rows)
  • Adjust the original data by the mean
  • X X
  • Compute the covariance matrix C of adjusted X
  • Find the eigenvectors and eigenvalues of C.
  • For matrix C, vectors e (column vector) having
    same direction as Ce
  • eigenvectors of C is e such that Ce?e,
  • ? is called an eigenvalue of C.
  • Ce?e ? (C-?I)e0
  • Most data mining packages do this for you.

24
Steps of PCA (cont.)
  • Calculate eigenvalues ? and eigenvectors e for
    covariance matrix
  • Eigenvalues ?j corresponds to variance on each
    component j
  • Thus, sort by ?j
  • Take the first n eigenvectors ei where n is the
    number of top eigenvalues
  • These are the directions with the largest
    variances

25
An Example
Mean124.1 Mean253.8
X1 X2 X1' X2'
19 63 -5.1 9.25
39 74 14.9 20.25
30 87 5.9 33.25
30 23 5.9 -30.75
15 35 -9.1 -18.75
15 43 -9.1 -10.75
15 32 -9.1 -21.75
30 73 5.9 19.25
26
Covariance Matrix
75 106
106 482
  • C
  • Using MATLAB, we find out
  • Eigenvectors
  • e1(-0.98,-0.21), ?151.8
  • e2(0.21,-0.98), ?2560.2
  • Thus the second eigenvector is more important!

27
If we only keep one dimension e2
yi
-10.14
-16.72
-31.35
31.374
16.464
8.624
19.404
-17.63
  • We keep the dimension of e2(0.21,-0.98)
  • We can obtain the final data as

28
Using Matlab to figure it out
29
PCA in Weka
30
Wesather Data from UCI Dataset (comes with weka
package)
31
PCA in Weka (I)
32
(No Transcript)
33
Summary of PCA
  • PCA is used for reducing the number of numerical
    attributes
  • The key is in data transformation
  • Adjust data by mean
  • Find eigenvectors for covariance matrix
  • Transform data
  • Note only linear combination of data (weighted
    sum of original data)

34
Missing and Inconsistent values
  • Linear regression Data are modeled to fit a
    straight line
  • least-square method to fit YabX
  • Multiple regression Y b0 b1 X1 b2 X2.
  • Many nonlinear functions can be transformed into
    the above.

35
Regression
Height
Y1
y x 1
Y1
Age
X1
36
Clustering for Outlier detection
  • Outliers can be incorrect data. Clusters ?
    majority behavior

37
Data Reduction with Sampling
  • Allow a mining algorithm to run in complexity
    that is potentially sub-linear to the size of the
    data
  • Choose a representative subset of the data
  • Simple random sampling may have very poor
    performance in the presence of skew (uneven)
    classes
  • Develop adaptive sampling methods
  • Stratified sampling
  • Approximate the percentage of each class (or
    subpopulation of interest) in the overall
    database
  • Used in conjunction with skewed data

38
Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
39
Sampling Example
Cluster/Stratified Sample
Raw Data
40
Summary
  • Data preparation is a big issue for data mining
  • Data preparation includes
  • Data warehousing
  • Data reduction and feature selection
  • Discretization
  • Missing values
  • Incorrect values
  • Sampling
Write a Comment
User Comments (0)
About PowerShow.com