Data Mining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Data Mining

1
Data Mining

By
Farzana Forhad
CS 157B

2
Agenda

Decision Tree and ID3
Rough Set Theory
Clustering

3
Introduction

Data mining is a component of a wider process
called knowledge discovery from databases.
The basic foundations of data mining
decision tree
association rules
clustering
other statistical techniques

4
Decision Tree

ID3 (Quinlan 1986), represents concepts as
decision trees.
A decision tree is a classifier in the form of a
tree structure where each node is either
a leaf node, indicating a class of instances
OR
a decision node, which specifies a test to be
carried out on a single attribute value, with one
branch and a sub-tree for each possible outcome
of the test

5
Decision Tree

The set of records available for classification
is divided into two disjoint subsets
a training set used for deriving the classifier
a test set used to measure the accuracy of the
classifier

Attributes whose domain is numerical are
called numerical attributes
Attributes whose domain is not numerical
are called categorical attributes.

6
Decision Tree

A decision tree is a tree with the following
properties
An inner node represents an attribute
An edge represents a test on the attribute of the
father node
A leaf represents one of the classes
Construction of a decision tree
Based on the training data
Top-Down strategy

7
Training Dataset
8
Test Dataset
9
Decision Tree
RULE 1 If it is sunny and the humidity is
not above 75, then play. RULE 2 If it is
sunny and the humidity is above 75, then do not
play. RULE 3 If it is overcast, then
play. RULE 4 If it is rainy and not windy,
then play. RULE 5 If it is rainy and
windy, then don't play.
10
Training Dataset
11
Decision Tree for Zip Code and Age
12
Iterative Dichotomizer 3 (ID3)

Quinlan (1986)
Each node corresponds to a splitting attribute
Entropy is used to measure how informative is a
node.
The algorithm uses the criterion of information
gain to determine the goodness of a split.

13
Iterative Dichotomizer 3 (ID3)
14
Rough Set Theory

Useful means for studying delivery patterns,
rules, and knowledge in data
The rough set is the estimate of a vague concept
by a pair of specific concepts, called the lower
and upper approximations.

15
Rough Set Theory

The lower approximation is a type of the domain
objects which are known with certainty to belong
to the subset of interest.
The upper approximation is a description of the
objects which may perhaps belong to the subset.
Any subset defined through its lower and upper
approximations is called a rough set, if the
boundary region is not empty.

16
Lower and Upper Approximations of a Rough Set
17
Association Rule Mining

Basket Analysis

18
Definition of Association Rules
19
Mining the Rules
20
Two Steps of Association Rule Mining
21
Clustering

The process of organizing objects into groups
whose members are similar in some way
Statistics, machine learning, and database
researchers have studied data clustering
Recent emphasis on large datasets

22
Different Approaches to Clustering

Two main approaches to clustering
partitioning clustering
hierarchical clustering
Clustering algorithms differ among themselves in
the following ways
in their ability to handle different types of
attributes (numeric and categorical)
in accuracy of clustering
in their ability to handle disk-resident data

23
Problem Statement

N objects to be grouped in k clusters
Number of different possibilities
The objective is to find a grouping such that the
distances between objects in a group is minimum
Several algorithms to find near optimal solution

24
k-Means Algorithm

Randomly select k points to be the starting
points for the centroids of the k clusters.
Assign each object to the centroid closest to the
object, forming k exclusive clusters of examples.
Calculate new centroids of the clusters. Take the
average of all the attribute values of the
objects belonging to the same cluster.
Check if the cluster centroids have changed their
coordinates. If yes, repeat from Step 2.
If no, cluster detection is finished, and all
objects have their cluster memberships defined.

25
Example

One-dimensional database with N 9
Objects labeled z1z9
Let k 2
Let us start with z1 to z2 as the initial
centroids

Table One-dimensional database
26
Example
Table New cluster assignments
27
Example
Table Reassignment of objects to two clusters
28
Questions? Thank You

Write a Comment

User Comments (0)

About PowerShow.com

Data Mining PowerPoint PPT Presentation