Title: Data Mining
1Data Mining
- By
- Farzana Forhad
- CS 157B
2Agenda
- Decision Tree and ID3
- Rough Set Theory
- Clustering
3Introduction
- Data mining is a component of a wider process
called knowledge discovery from databases. - The basic foundations of data mining
- decision tree
- association rules
- clustering
- other statistical techniques
4Decision Tree
- ID3 (Quinlan 1986), represents concepts as
decision trees. - A decision tree is a classifier in the form of a
tree structure where each node is either - a leaf node, indicating a class of instances
- OR
- a decision node, which specifies a test to be
carried out on a single attribute value, with one
branch and a sub-tree for each possible outcome
of the test
5Decision Tree
- The set of records available for classification
is divided into two disjoint subsets - a training set used for deriving the classifier
- a test set used to measure the accuracy of the
classifier
- Attributes whose domain is numerical are
- called numerical attributes
- Attributes whose domain is not numerical
- are called categorical attributes.
6Decision Tree
- A decision tree is a tree with the following
properties - An inner node represents an attribute
- An edge represents a test on the attribute of the
father node - A leaf represents one of the classes
- Construction of a decision tree
- Based on the training data
- Top-Down strategy
7Training Dataset
8Test Dataset
9Decision Tree
RULE 1 If it is sunny and the humidity is
not above 75, then play. RULE 2 If it is
sunny and the humidity is above 75, then do not
play. RULE 3 If it is overcast, then
play. RULE 4 If it is rainy and not windy,
then play. RULE 5 If it is rainy and
windy, then don't play.
10Training Dataset
11Decision Tree for Zip Code and Age
12Iterative Dichotomizer 3 (ID3)
- Quinlan (1986)
- Each node corresponds to a splitting attribute
- Entropy is used to measure how informative is a
node. - The algorithm uses the criterion of information
gain to determine the goodness of a split.
13Iterative Dichotomizer 3 (ID3)
14Rough Set Theory
- Useful means for studying delivery patterns,
rules, and knowledge in data - The rough set is the estimate of a vague concept
by a pair of specific concepts, called the lower
and upper approximations.
15Rough Set Theory
- The lower approximation is a type of the domain
objects which are known with certainty to belong
to the subset of interest. - The upper approximation is a description of the
objects which may perhaps belong to the subset. - Any subset defined through its lower and upper
approximations is called a rough set, if the
boundary region is not empty.
16Lower and Upper Approximations of a Rough Set
17Association Rule Mining
18Definition of Association Rules
19Mining the Rules
20Two Steps of Association Rule Mining
21 Clustering
- The process of organizing objects into groups
whose members are similar in some way - Statistics, machine learning, and database
researchers have studied data clustering - Recent emphasis on large datasets
22Different Approaches to Clustering
- Two main approaches to clustering
- partitioning clustering
- hierarchical clustering
- Clustering algorithms differ among themselves in
the following ways - in their ability to handle different types of
attributes (numeric and categorical) - in accuracy of clustering
- in their ability to handle disk-resident data
23Problem Statement
- N objects to be grouped in k clusters
- Number of different possibilities
- The objective is to find a grouping such that the
distances between objects in a group is minimum - Several algorithms to find near optimal solution
24k-Means Algorithm
- Randomly select k points to be the starting
points for the centroids of the k clusters. - Assign each object to the centroid closest to the
object, forming k exclusive clusters of examples. - Calculate new centroids of the clusters. Take the
average of all the attribute values of the
objects belonging to the same cluster. - Check if the cluster centroids have changed their
coordinates. If yes, repeat from Step 2. - If no, cluster detection is finished, and all
objects have their cluster memberships defined.
25Example
- One-dimensional database with N 9
- Objects labeled z1z9
- Let k 2
- Let us start with z1 to z2 as the initial
centroids
Table One-dimensional database
26Example
Table New cluster assignments
27Example
Table Reassignment of objects to two clusters
28Questions? Thank You