Data Mining Go Over - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Data Mining Go Over

Description:

Title: Steven F. Ashby Center for Applied Scientific Computing Month DD, 1997 Author: Computations Last modified by: minqi zhou Created Date: 3/18/1998 1:44:31 PM – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 45

Provided by: Comput835

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining Go Over

1
Data Mining Go Over

Lecture Notes for Go Over
Introduction to Data Mining
by
Minqi Zhou

2
Exam

Time 6.16, 800-1000
Room ??? 301

3
Whats Data Mining?

Many Definitions
Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
Exploration analysis, by automatic or
semi-automatic means, of large quantities of
data in order to discover meaningful patterns

4
Data Mining Tasks

Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Description Methods
Find human-interpretable patterns that describe
the data.

From Fayyad, et.al. Advances in Knowledge
Discovery and Data Mining, 1996
5
Data Mining Tasks...

Classification Predictive
Clustering Descriptive
Association Rule Discovery Descriptive
Sequential Pattern Discovery Descriptive
Regression Predictive
Deviation Detection Predictive

6
Attribute Values

Attribute values are numbers or symbols assigned
to an attribute
Distinction between attributes and attribute
values
Same attribute can be mapped to different
attribute values
Example height can be measured in feet or
meters
Different attributes can be mapped to the same
set of values
Example Attribute values for ID and age are
integers
But properties of attribute values can be
different
ID has no limit but age has a maximum and minimum
value

7
Types of Attributes

There are different types of attributes
Nominal
Examples ID numbers, eye color, zip codes
Ordinal
Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short
Interval
Examples calendar dates, temperatures in Celsius
or Fahrenheit.
Ratio
Examples temperature in Kelvin, length, time,
counts

8
Aggregation

Combining two or more attributes (or objects)
into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states,
countries, etc
More stable data
Aggregated data tends to have less variability

9
Sampling

The key principle for effective sampling is the
following
using a sample will work almost as well as using
the entire data sets, if the sample is
representative
A sample is representative if it has
approximately the same property (of interest) as
the original set of data

10
Dimensionality Reduction

Purpose
Avoid curse of dimensionality
Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or
reduce noise
Techniques
Principle Component Analysis
Singular Value Decomposition
Others supervised and non-linear techniques

11
Similarity and Dissimilarity

Similarity
Numerical measure of how alike two data objects
are.
Is higher when objects are more alike.
Often falls in the range 0,1
Dissimilarity
Numerical measure of how different are two data
objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity

12
Euclidean Distance

Euclidean Distance
Where n is the number of dimensions
(attributes) and pk and qk are, respectively, the
kth attributes (components) or data objects p and
q.
Standardization is necessary, if scales differ.

13
Minkowski Distance

Minkowski Distance is a generalization of
Euclidean Distance
Where r is a parameter, n is the number of
dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or
data objects p and q.

14
Mahalanobis Distance
? is the covariance matrix of the input data X
For red points, the Euclidean distance is 14.7,
Mahalanobis distance is 6.
15
Similarity Between Binary Vectors

Common situation is that objects, p and q, have
only binary attributes
Compute similarities using the following
quantities
M01 the number of attributes where p was 0 and
q was 1
M10 the number of attributes where p was 1 and
q was 0
M00 the number of attributes where p was 0 and
q was 0
M11 the number of attributes where p was 1 and
q was 1
Simple Matching and Jaccard Coefficients
SMC number of matches / number of attributes
(M11 M00) / (M01 M10 M11
M00)
J number of 11 matches / number of
not-both-zero attributes values
(M11) / (M01 M10 M11)
Cosine Similarity

16
Techniques Used In Data Exploration

In EDA, as originally defined by Tukey
The focus was on visualization
In our discussion of data exploration, we focus
on
Summary statistics
Frequency frequency, mode, percential
Location mean, median
Spread range, variance,
Visualization
Histogram
Box plot
Scatter plot
Matrix plot
Online Analytical Processing (OLAP)

17
data exploration

Visualization
Parallel Coordination
Star plot, chernoff face
OLAP
Multi-dimension array
Data cube
Slice dice
Roll-up, drill-down

18
Classification Definition

Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.

19
Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
20
General Structure of Hunts Algorithm

Let Dt be the set of training records that reach
a node t
General Procedure
If Dt contains records that belong the same class
yt, then t is a leaf node labeled as yt
If Dt is an empty set, then t is a leaf node
labeled by the default class, yd
If Dt contains records that belong to more than
one class, use an attribute test to split the
data into smaller subsets. Recursively apply the
procedure to each subset.

Dt
?
21
Tree Induction

Greedy strategy.
Split the records based on an attribute test that
optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting

22
Comparison among Splitting Criteria
For a 2-class problem
23
Practical Issues of Classification

Underfitting and Overfitting
Insufficient records, data noise
Evaluation decision trees
Re-substitution error, Generalization error
Pre-pruning, post-pruning
Missing Values
In terms of the probability of visible data on
class
Costs of Classification

24
Underfitting and Overfitting
Overfitting
Underfitting when model is too simple, both
training and test errors are large
25
Occams Razor

Given two models of similar generalization
errors, one should prefer the simpler model over
the more complex model
For complex models, there is a greater chance
that it was fitted accidentally by errors in data
Therefore, one should include model complexity
when evaluating a model

26
Model Evaluation

Metrics for Performance Evaluation
How to evaluate the performance of a model?
Accuracy, cost
Methods for Performance Evaluation
How to obtain reliable estimates?
Handout, subsampling, cross validation
Methods for Model Comparison
How to compare the relative performance among
competing models?
ROC curve

27
Definition Frequent Itemset

Itemset
A collection of one or more items
Example Milk, Bread, Diaper
k-itemset
An itemset that contains k items
Support count (?)
Frequency of occurrence of an itemset
E.g. ?(Milk, Bread,Diaper) 2
Support
Fraction of transactions that contain an itemset
E.g. s(Milk, Bread, Diaper) 2/5
Frequent Itemset
An itemset whose support is greater than or equal
to a minsup threshold

28
Mining Association Rules

Two-step approach
Frequent Itemset Generation
Generate all itemsets whose support ? minsup
Rule Generation
Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset
Frequent itemset generation is still
computationally expensive

29
Reducing Number of Candidates

Apriori principle
If an itemset is frequent, then all of its
subsets must also be frequent
Apriori principle holds due to the following
property of the support measure
Support of an itemset never exceeds the support
of its subsets
This is known as the anti-monotone property of
support

30
Apriori Algorithm

Frequent itemset generation
Frequent itemset support computation
Brute-force
Hash-tree
Rule generation
Rule generated under the same itemset

31
Maximal vs Closed Itemsets
32
FP-growth Algorithm

Use a compressed representation of the database
using an FP-tree
Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine the
frequent itemsets

33
FP-tree construction
null
After reading TID1
A1
B1
After reading TID2
null
B1
A1
B1
C1
D1
34
FP-Tree Construction
Transaction Database
null
B3
A7
B5
C3
C1
D1
D1
Header table
C3
E1
D1
E1
D1
E1
D1
Pointers are used to assist frequent itemset
generation
35
FP-growth
Conditional Pattern base for D P
(A1,B1,C1), (A1,B1),
(A1,C1), (A1),
(B1,C1) Recursively apply FP-growth on
P Frequent Itemsets found (with sup gt 1) AD,
BD, CD, ACD, BCD
null
A7
B1
B5
C1
C1
D1
D1
C3
D1
D1
D1
36
Rule Generation

How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an
anti-monotone property
c(ABC ?D) can be larger or smaller than c(AB ?D)
But confidence of rules generated from the same
itemset has an anti-monotone property
e.g., L A,B,C,D c(ABC ? D) ? c(AB ? CD)
? c(A ? BCD)
Confidence is anti-monotone w.r.t. number of
items on the RHS of the rule

37
Types of Clusterings

A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping
subsets (clusters) such that each data object is
in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a
hierarchical tree

38
K-means Clustering

Partitional clustering approach
Each cluster is associated with a centroid
(center point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple

39
Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the
nearest cluster
To get SSE, we square these errors and sum them.
x is a data point in cluster Ci and mi is the
representative point for cluster Ci
can show that mi corresponds to the center
(mean) of the cluster
Given two clusters, we can choose the one with
the smallest error
One easy way to reduce SSE is to increase K, the
number of clusters
A good clustering with smaller K can have a
lower SSE than a poor clustering with higher K

40
Hierarchical Clustering

Two main types of hierarchical clustering
Agglomerative
Start with the points as individual clusters
At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
Divisive
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster
contains a point (or there are k clusters)
Traditional hierarchical algorithms use a
similarity or distance matrix
Merge or split one cluster at a time

41
How to Define Inter-Cluster Similarity
Similarity?

MIN
MAX
Group Average
Distance Between Centroids
Other methods driven by an objective function
Wards Method uses squared error

Proximity Matrix
42
DBSCAN

DBSCAN is a density-based algorithm.
Density number of points within a specified
radius (Eps)
A point is a core point if it has more than a
specified number of points (MinPts) within Eps
These are points that are at the interior of a
cluster
A border point has fewer than MinPts within Eps,
but is in the neighborhood of a core point
A noise point is any point that is not a core
point or a border point.

43
Measures of Cluster Validity

Numerical measures that are applied to judge
various aspects of cluster validity, are
classified into the following three types.
External Index Used to measure the extent to
which cluster labels match externally supplied
class labels.
Entropy
Internal Index Used to measure the goodness of
a clustering structure without respect to
external information.
Sum of Squared Error (SSE)
Relative Index Used to compare two different
clusterings or clusters.
Often an external or internal index is used for
this function, e.g., SSE or entropy
Sometimes these are referred to as criteria
instead of indices
However, sometimes criterion is the general
strategy and index is the numerical measure that
implements the criterion.