Data Mining

About This Presentation

Title:

Data Mining

Description:

Huge amount of databases and web pages make information extraction next to ... in the 1950's who hypothesized that some people had Extra-Sensory Perception. ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 118

Provided by: rajee57

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining

1
Data Mining
2
Outline

What is data mining?
Data Mining Tasks
Association
Classification
Clustering
Data mining Algorithms
Are all the patterns interesting?

3
What is Data Mining

Huge amount of databases and web pages make
information extraction next to impossible
(remember the favored statement I will bury them
in data!)
Inability of many other disciplines (statistic,
AI, information retrieval) to have scalable
algorithms to extract information and/or rules
from the databases
Necessity to find relationships among data

4
What is Data Mining

Discovery of useful, possibly unexpected data
patterns
Subsidiary issues
Data cleansing
Visualization
Warehousing

5
Examples

A big objection to was that it was looking for so
many vague connections that it was sure to find
things that were bogus and thus violate
innocents privacy.
The Rhine Paradox a great example of how not to
conduct scientific research.

6
Rhine Paradox --- (1)

David Rhine was a parapsychologist in the 1950s
who hypothesized that some people had
Extra-Sensory Perception.
He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue.
He discovered that almost 1 in 1000 had ESP ---
they were able to get all 10 right!

7
Rhine Paradox --- (2)

He told these people they had ESP and called them
in for another test of the same type.
Alas, he discovered that almost all of them had
lost their ESP.
What did he conclude?
Answer on next slide.

8
Rhine Paradox --- (3)

He concluded that you shouldnt tell people they
have ESP it causes them to lose it.

9
A Concrete Example

This example illustrates a problem with
intelligence-gathering.
Suppose we believe that certain groups of
evil-doers are meeting occasionally in hotels to
plot doing evil.
We want to find people who at least twice have
stayed at the same hotel on the same day.

10
The Details

109 people being tracked.
1000 days.
Each person stays in a hotel 1 of the time (10
days out of 1000).
Hotels hold 100 people (so 105 hotels).
If everyone behaves randomly (I.e., no
evil-doers) will the data mining detect anything
suspicious?

11
Calculations --- (1)

Probability that persons p and q will be at the
same hotel on day d
1/100 1/100 10-5 10-9.
Probability that p and q will be at the same
hotel on two given days
10-9 10-9 10-18.
Pairs of days
5105.

12
Calculations --- (2)

Probability that p and q will be at the same
hotel on some two days
5105 10-18 510-13.
Pairs of people
51017.
Expected number of suspicious pairs of people
51017 510-13 250,000.

13
Conclusion

Suppose there are (say) 10 pairs of evil-doers
who definitely stayed at the same hotel twice.
Analysts have to sift through 250,010 candidates
to find the 10 real cases.
Not gonna happen.
But how can we improve the scheme?

14
Appetizer

Consider a file consisting of 24471 records. File
contains at least two condition attributes A and
D

A/D 0 1 total
0 9272 232 9504
1 14695 272 14967
Total 23967 504 24471
15
Appetizer (cont)

Probability that person has A P(A)0.6,
Probability that person has D P(D)0.02
Conditional probability that person has D
provided it has A P(DA) P(AD)/P(A)(272/24471)
/.6 .02
P(AD) P(AD)/P(D) .54
What can we say about dependencies between A and
D?

A/D 0 1 total
0 9272 232 9504
1 14695 272 14967
Total 23967 504 24471
16
Appetizer(3)

So far we did not ask anything that statistics
would not have ask. So Data Mining another word
for statistic?
We hope that the response will be resounding NO
The major difference is that statistical methods
work with random data samples, whereas the data
in databases is not necessarily random
The second difference is the size of the data set
The third data is that statistical samples do not
contain dirty data

17
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
18
Data Mining Tasks

Association (correlation and causality)
Multi-dimensional vs. single-dimensional
association
age(X, 20..29) income(X, 20..29K) -gt
buys(X, PC) support 2, confidence 60
contains(T, computer) -gt contains(x,
software) 1, 75
What is support? the percentage of the tuples
in the database that have age between 20 and 29
and income between 20K and 29K and buying PC
What is confidence? the probability that if
person is between 20 and 29 and income between
20K and 29K then it buys PC
Clustering (getting data that are close together
into the same cluster.
What does close together means?

19
Distances between data

Distance between data is a measure of
dissimilarity between data.
d(i,j)gt0 d(i,j) d(j,i) d(i,j)lt d(i,k)
d(k,j)
Euclidean distance ltx1,x2, xkgt and lty1,y2,ykgt
Standardize variables by finding standard
deviation and dividing each xi by standard
deviation of X
Covariance(X,Y)1/k(Sum(xi-mean(x))(y(I)-mean(y))
Boolean variables and their distances

20
Data Mining Tasks

Outlier analysis
Outlier a data object that does not comply with
the general behavior of the data
It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis
Trend and evolution analysis
Trend and deviation regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses

21
Are All the Discovered Patterns Interesting?

A data mining system/query may generate thousands
of patterns, not all of them are interesting.
Suggested approach Human-centered, query-based,
focused mining
Interestingness measures A pattern is
interesting if it is easily understood by humans,
valid on new or test data with some degree of
certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to
confirm
Objective vs. subjective interestingness
measures
Objective based on statistics and structures of
patterns, e.g., support, confidence, etc.
Subjective based on users belief in the data,
e.g., unexpectedness, novelty, actionability, etc.

22
Are All the Discovered Patterns Interesting? -
Example

1
coffee
0 1
tea
5 5 20
25
0
70 75
Conditional probability that if one buys coffee,
one also buys tea is 2/9 Conditional probability
that if one buys tea she also buys coffee is
20/25.8 However, the probability that she buys
coffee is .9 So, is it significant inference that
if customer buys tea she also buys coffee? Is
buying tea and coffee independent activities?
23
How to measure Interestingness

RI X , Y - XY/N
Support and Confidence X Y/N support and
X Y/X -confidence of X-gtY
Chi2 (XY - E(XY)) 2 /E(XY)
J(X-gtY) P(Y)(P(XY)log (P(XY)/P(X)) (1-
P(XY))log ((1- P(XY)/(1-P(X))
Sufficiency (X-gtY) P(XY)/P(X!Y) Necessity
(X-gtY) P(!XY)/P(!X!Y). Interestingness of
Y-gtX is
NC 1-N(X-gtY)P(Y), if N() is less than 1
or 0 otherwise

24
Can We Find All and Only Interesting Patterns?

Find all the interesting patterns Completeness
Can a data mining system find all the interesting
patterns?
Association vs. classification vs. clustering
Search for only interesting patterns
Optimization
Can a data mining system find only the
interesting patterns?
Approaches
First general all the patterns and then filter
out the uninteresting ones.
Generate only the interesting patternsmining
query optimization

25
Clustering

Partition data set into clusters, and one can
store cluster representation only
Can be very effective if data is clustered but
not if data is smeared
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
There are many choices of clustering definitions
and clustering algorithms.

26
Example Clusters
Outliers
x xx x x x x x x x x x x x
x
x x x x x x x x x x x x
x x x
x x x x x x x x x x
x
27
Sampling

Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
Stratified sampling
Approximate the percentage of each class (or
subpopulation of interest) in the overall
database
Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a
time).

28
Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
29
Sampling
Cluster/Stratified Sample
Raw Data
30
Discretization

Three types of attributes
Nominal values from an unordered set
Ordinal values from an ordered set
Continuous real numbers
Discretization
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis

31
Discretization

Discretization
reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can
then be used to replace actual data values.

32
Discretization
Sort Attribute
Select cut Point
Evaluate Measure
NO
NO
Satisfied
Yes
DONE
Split/Merge
Stop
33
Discretization

Dynamic vs Static
Local vs Global
Top-Down vs Bottom-Up
Direct vs Incremental

34
Discretization Quality Evaluation

Total number of Intervals
The Number of Inconsistencies
Predictive Accuracy
Complexity

35
Discretization - Binning

Equal width all range is between min and max
values is split in equal width intervals
Equal-frequency - Each bin contains
approximately the same number of data

36
Entropy-Based Discretization

Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is
The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g.,
Experiments show that it may reduce data size and
improve classification accuracy

37
Data Mining Primitives, Languages, and System
Architectures

Data mining primitives What defines a data
mining task?
A data mining query language
Design graphical user interfaces based on a data
mining query language
Architecture of data mining systems

38
Why Data Mining Primitives and Languages?

Data mining should be an interactive process
User directs what to be mined
Users must be provided with a set of primitives
to be used to communicate with the data mining
system
Incorporating these primitives in a data mining
query language
More flexible user interaction
Foundation for design of graphical user interface
Standardization of data mining industry and
practice

39
What Defines a Data Mining Task ?

Task-relevant data
Type of knowledge to be mined
Background knowledge
Pattern interestingness measurements
Visualization of discovered patterns

40
Task-Relevant Data (Minable View)

Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria

41
Types of knowledge to be mined

Characterization
Discrimination
Association
Classification/prediction
Clustering
Outlier analysis
Other data mining tasks

42
A Data Mining Query Language (DMQL)

Motivation
A DMQL can provide the ability to support ad-hoc
and interactive data mining
By providing a standardized language like SQL
Hope to achieve a similar effect like that SQL
has on relational database
Foundation for system development and evolution
Facilitate information exchange, technology
transfer, commercialization and wide acceptance
Design
DMQL is designed with the primitives described
earlier

43
Syntax for DMQL

Syntax for specification of
task-relevant data
the kind of knowledge to be mined
concept hierarchy specification
interestingness measure
pattern presentation and visualization
Putting it all together a DMQL query

44
Syntax for task-relevant data specification

use database database_name, or use data warehouse
data_warehouse_name
from relation(s)/cube(s) where condition
in relevance to att_or_dim_list
order by order_list
group by grouping_list
having condition

45
Specification of task-relevant data
46
Syntax for specifying the kind of knowledge to be
mined

Characterization
Mine_Knowledge_Specification mine
characteristics as pattern_name analyze
measure(s)
Discrimination
Mine_Knowledge_Specification mine
comparison as pattern_name for
target_class where target_condition versus
contrast_class_i where contrast_condition_i
analyze measure(s)
Association
Mine_Knowledge_Specification mine
associations as pattern_name

47
Syntax for specifying the kind of knowledge to be
mined (cont.)

Classification
Mine_Knowledge_Specification mine
classification as pattern_name analyze
classifying_attribute_or_dimension
Prediction
Mine_Knowledge_Specification mine
prediction as pattern_name analyze
prediction_attribute_or_dimension set
attribute_or_dimension_i value_i

48
Syntax for concept hierarchy specification

To specify what concept hierarchies to use
use hierarchy lthierarchygt for ltattribute_or_dimens
iongt
We use different syntax to define different type
of hierarchies
schema hierarchies
define hierarchy time_hierarchy on date as
date,month quarter,year
set-grouping hierarchies
define hierarchy age_hierarchy for age on
customer as
level1 young, middle_aged, senior lt level0
all
level2 20, ..., 39 lt level1 young
level2 40, ..., 59 lt level1 middle_aged
level2 60, ..., 89 lt level1 senior

49
Syntax for concept hierarchy specification (Cont.)

operation-derived hierarchies
define hierarchy age_hierarchy for age on
customer as
age_category(1), ..., age_category(5)
cluster(default, age, 5) lt all(age)
rule-based hierarchies
define hierarchy profit_margin_hierarchy on item
as
level_1 low_profit_margin lt level_0 all
if (price - cost)lt 50
level_1 medium-profit_margin lt level_0 all
if ((price - cost) gt 50) and ((price -
cost) lt 250))
level_1 high_profit_margin lt level_0 all
if (price - cost) gt 250

50
Syntax for interestingness measure specification

Interestingness measures and thresholds can be
specified by the user with the statement
with ltinterest_measure_namegt threshold
threshold_value
Example
with support threshold 0.05
with confidence threshold 0.7

51
Syntax for pattern presentation and visualization
specification

We have syntax which allows users to specify the
display of discovered patterns in one or more
forms
display as ltresult_formgt
To facilitate interactive viewing at different
concept level, the following syntax is defined
Multilevel_Manipulation roll up on
attribute_or_dimension drill down on
attribute_or_dimension add
attribute_or_dimension drop
attribute_or_dimension

52
Putting it all together the full specification
of a DMQL query

use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count
in relevance to C.age, I.type, I.place_made
from customer C, item I, purchases P,
items_sold S, works_at W, branch
where I.item_ID S.item_ID and S.trans_ID
P.trans_ID
and P.cust_ID C.cust_ID and P.method_paid
AmEx''
and P.empl_ID W.empl_ID and W.branch_ID
B.branch_ID and B.address Canada" and
I.price gt 100
with noise threshold 0.05
display as table

53
DMQL and SQL

DMQL Describe general characteristics of
graduate students in the Big-University database
use Big_University_DB
mine characteristics as Science_Students
in relevance to name, gender, major, birth_place,
birth_date, residence, phone, gpa
from student
where status in graduate
Corresponding SQL statement
Select name, gender, major, birth_place,
birth_date, residence, phone, gpa
from student
where status in Msc, MBA, PhD

54
Decision Trees

Example
Conducted survey to see what customers were
interested in new model car
Want to select customers for advertising campaign

training set
55
One Possibility
agelt30
Y
N
citysf
carvan
Y
Y
N
N
likely
unlikely
likely
unlikely
56
Another Possibility
cartaurus
Y
N
citysf
agelt45
Y
Y
N
N
likely
unlikely
likely
unlikely
57
Issues

Decision tree cannot be too deep
would not have statistically significant amounts
of data for lower decisions
Need to select tree that most reliably predicts
outcomes

58
Top-Down Induction of Decision Tree
Attributes Outlook, Temperature, Humidity,
Wind
PlayTennis yes, no
59
Entropy and Information Gain

S contains si tuples of class Ci for i 1, ,
m
Information measures info required to classify
any arbitrary tuple
Entropy of attribute A with values a1,a2,,av
Information gained by branching on attribute A

60
Example Analytical Characterization

Task
Mine general characteristics describing graduate
students using analytical characterization
Given
attributes name, gender, major, birth_place,
birth_date, phone, and gpa
Gen(ai) concept hierarchies on ai
Ui attribute analytical thresholds for ai
Ti attribute generalization thresholds for ai
R attribute relevance threshold

61
Example Analytical Characterization (contd)

1. Data collection
target class graduate student
contrasting class undergraduate student
2. Analytical generalization using Ui
attribute removal
remove name and phone
attribute generalization
generalize major, birth_place, birth_date and
gpa
accumulate counts
candidate relation gender, major, birth_country,
age_range and gpa

62
Example Analytical characterization (3)

3. Relevance analysis
Calculate expected info required to classify an
arbitrary tuple
Calculate entropy of each attribute e.g. major

63
Example Analytical Characterization (4)

Calculate expected info required to classify a
given sample if S is partitioned according to the
attribute
Calculate information gain for each attribute
Information gain for all attributes

64
Example Analytical characterization (5)

4. Initial working relation (W0) derivation
R 0.1
remove irrelevant/weakly relevant attributes from
candidate relation gt drop gender, birth_country
remove contrasting class candidate relation
5. Perform attribute-oriented induction on W0
using Ti

Initial target class working relation W0
Graduate students
65
What Is Association Mining?

Association rule mining
Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories.
Applications
Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering,
classification, etc.
Examples.
Rule form Body Head support, confidence.
buys(x, diapers) buys(x, beers) 0.5,
60
major(x, CS) takes(x, DB) grade(x, A)
1, 75

66
Association Rule Mining
transaction id
customer id
products bought
sales records
market-basket data

Trend Products p5, p8 often bough together
Trend Customer 12 likes product p9

67
Association Rule

Rule p1, p3, p8
Support number of baskets where these products
appear
High-support set support ? threshold s
Problem find all high support sets

68
Association Rule Basic Concepts

Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find all rules that correlate the presence of
one set of items with that of another set of
items
E.g., 98 of people who purchase tires and auto
accessories also get automotive services done
Applications
? Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales)
Home Electronics ? (What other products
should the store stocks up?)
Attached mailing in direct marketing
Detecting ping-ponging of patients, faulty
collisions

69
Rule Measures Support and Confidence
Customer buys both

Find all the rules X Y ? Z with minimum
confidence and support
support, s, probability that a transaction
contains X ? Y ? Z
confidence, c, conditional probability that a
transaction having X ? Y also contains Z

Customer buys diaper
Customer buys beer

Let minimum support 50, and minimum confidence
50, we have
A ? C (50, 66.6)
C ? A (50, 100)

70
Mining Association RulesAn Example
Min. support 50 Min. confidence 50

For rule A ? C
support support(A ?C) 50
confidence support(A ?C)/support(A) 66.6
The Apriori principle
Any subset of a frequent itemset must be frequent

71
Mining Frequent Itemsets the Key Step

Find the frequent itemsets the sets of items
that have minimum support
A subset of a frequent itemset must also be a
frequent itemset
i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association
rules.

72
The Apriori Algorithm

Join Step Ck is generated by joining Lk-1with
itself
Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset
Pseudo-code
Ck Candidate itemset of size k
Lk frequent itemset of size k
L1 frequent items
for (k 1 Lk !? k) do begin
Ck1 candidates generated from Lk
for each transaction t in database do
increment the count of all candidates in
Ck1 that are
contained in t
Lk1 candidates in Ck1 with min_support
end
return ?k Lk

73
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
74
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order
Step 1 self-joining Lk-1
insert into Ck
select p.item1, p.item2, , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1
Step 2 pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

75
How to Count Supports of Candidates?

Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of
itemsets and counts
Interior node contains a hash table
Subset function finds all the candidates
contained in a transaction

76
Example of Generating Candidates

L3abc, abd, acd, ace, bcd
Self-joining L3L3
abcd from abc and abd
acde from acd and ace
Pruning
acde is removed because ade is not in L3
C4abcd

77
Criticism to Support and Confidence

Example 1 (Aggarwal Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball ? eat cereal 40, 66.7 is
misleading because the overall percentage of
students eating cereal is 75 which is higher
than 66.7.
play basketball ? not eat cereal 20, 33.3 is
far more accurate, although with lower support
and confidence

78
Criticism to Support and Confidence (Cont.)

Example 2
X and Y positively correlated,
X and Z, negatively related
support and confidence of
XgtZ dominates
We need a measure of dependent or correlated
events
P(BA)/P(B) is also called the lift of rule A gt B

79
Other Interestingness Measures Interest

Interest (correlation, lift)
taking both P(A) and P(B) in consideration
P(AB)P(B)P(A), if A and B are independent
events
A and B negatively correlated, if the value is
less than 1 otherwise A and B positively
correlated

80
Classification vs. Prediction

Classification
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data
Prediction
models continuous-valued functions, i.e.,
predicts unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis

81
Classification Process Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
82
Classification Process Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
83
Supervised vs. Unsupervised Learning

Supervised learning (classification)
Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data are unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data

84
Training Dataset
This follows an example from Quinlans ID3
85
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
86
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they are discretized in advance)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no samples left

87
Information Gain (ID3/C4.5)

Select the attribute with the highest information
gain
Assume there are two classes, P and N
Let the set of examples S contain p elements of
class P and n elements of class N
The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is
defined as

88
Information Gain in Decision Tree Induction

Assume that using attribute A a set S will be
partitioned into sets S1, S2 , , Sv
If Si contains pi examples of P and ni examples
of N, the entropy, or the expected information
needed to classify objects in all subtrees Si is
The encoding information that would be gained by
branching on A

89
Attribute Selection by Information Gain
Computation

Class P buys_computer yes
Class N buys_computer no
I(p, n) I(9, 5) 0.940
Compute the entropy for age

Hence
Similarly

90
Gini Index (IBM IntelligentMiner)

If a data set T contains examples from n classes,
gini index, gini(T) is defined as
where pj is the relative frequency of class j
in T.
If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as
The attribute provides the smallest ginisplit(T)
is chosen to split the node (need to enumerate
all possible splitting points for each attribute).

91
Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN
rules
One rule is created for each path from the root
to a leaf
Each attribute-value pair along a path forms a
conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age lt30 AND student no THEN
buys_computer no
IF age lt30 AND student yes THEN
buys_computer yes
IF age 3140 THEN buys_computer yes
IF age gt40 AND credit_rating excellent
THEN buys_computer yes
IF age gt40 AND credit_rating fair THEN
buys_computer no

92
Avoid Overfitting in Classification

The generated tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training
data to decide which is the best pruned tree

93
Approaches to Determine the Final Tree Size

Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross
validation
Use all the data for training
but apply a statistical test (e.g., chi-square)
to estimate whether expanding or pruning a node
may improve the entire distribution
Use minimum description length (MDL) principle
halting growth of the tree when the encoding is
minimized

94
Scalable Decision Tree Induction Methods in Data
Mining Studies

SLIQ (EDBT96 Mehta et al.)
builds an index for each attribute and only class
list and the current attribute list reside in
memory
SPRINT (VLDB96 J. Shafer et al.)
constructs an attribute list data structure
PUBLIC (VLDB98 Rastogi Shim)
integrates tree splitting and tree pruning stop
growing the tree earlier
RainForest (VLDB98 Gehrke, Ramakrishnan
Ganti)
separates the scalability aspects from the
criteria that determine the quality of the tree
builds an AVC-list (attribute, value, class label)

95
Bayesian Theorem

Given training data D, posteriori probability of
a hypothesis h, P(hD) follows the Bayes theorem
MAP (maximum posteriori) hypothesis
Practical difficulty require initial knowledge
of many probabilities, significant computational
cost

96
Naïve Bayes Classifier (I)

A simplified assumption attributes are
conditionally independent
Greatly reduces the computation cost, only count
the class distribution.

97
Naive Bayesian Classifier (II)

Given a training set, we can compute the
probabilities

98
Bayesian classification

The classification problem may be formalized
using a-posteriori probabilities
P(CX) prob. that the sample tuple
Xltx1,,xkgt is of class C.
E.g. P(classN outlooksunny,windytrue,)
Idea assign to sample X the class label C such
that P(CX) is maximal

99
Estimating a-posteriori probabilities

Bayes theorem
P(CX) P(XC)P(C) / P(X)
P(X) is constant for all classes
P(C) relative freq of class C samples
C such that P(CX) is maximum C such that
P(XC)P(C) is maximum
Problem computing P(XC) is unfeasible!

100
Naïve Bayesian Classification

Naïve assumption attribute independence
P(x1,,xkC) P(x1C)P(xkC)
If i-th attribute is categoricalP(xiC) is
estimated as the relative freq of samples having
value xi as i-th attribute in class C
If i-th attribute is continuousP(xiC) is
estimated thru a Gaussian density function
Computationally easy in both cases

101
Play-tennis example estimating P(xiC)
outlook
P(sunnyp) 2/9 P(sunnyn) 3/5
P(overcastp) 4/9 P(overcastn) 0
P(rainp) 3/9 P(rainn) 2/5
temperature
P(hotp) 2/9 P(hotn) 2/5
P(mildp) 4/9 P(mildn) 2/5
P(coolp) 3/9 P(cooln) 1/5
humidity
P(highp) 3/9 P(highn) 4/5
P(normalp) 6/9 P(normaln) 2/5
windy
P(truep) 3/9 P(truen) 3/5
P(falsep) 6/9 P(falsen) 2/5
P(p) 9/14
P(n) 5/14
102
Play-tennis example classifying X

An unseen sample X ltrain, hot, high, falsegt
P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
ep)P(p) 3/92/93/96/99/14 0.010582
P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286
Sample X is classified in class n (dont play)

103
Association-Based Classification

Several methods for association-based
classification
ARCS Quantitative association mining and
clustering of association rules (Lent et al97)
It beats C4.5 in (mainly) scalability and also
accuracy
Associative classification (Liu et al98)
It mines high support and high confidence rules
in the form of cond_set gt y, where y is a
class label
CAEP (Classification by aggregating emerging
patterns) (Dong et al99)
Emerging patterns (EPs) the itemsets whose
support increases significantly from one class to
another
Mine Eps based on minimum support and growth rate

104
What Is Prediction?

Prediction is similar to classification
First, construct a model
Second, use model to predict unknown value
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
Prediction is different from classification
Classification refers to predict categorical
class label
Prediction models continuous-valued functions

105
Regression Analysis and Log-Linear Models in
Prediction

Linear regression Y ? ? X
Two parameters , ? and ? specify the line and
are to be estimated by using the data at hand.
using the least squares criterion to the known
values of Y1, Y2, , X1, X2, .
Multiple regression Y b0 b1 X1 b2 X2.
Many nonlinear functions can be transformed into
the above.
Log-linear models
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability p(a, b, c, d) ?ab ?ac?ad ?bcd

106
What is Cluster Analysis?

Cluster a collection of data objects
Similar to one another within the same cluster
Dissimilar objects are in different clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification no
predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

107
General Applications of Clustering

Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in
spatial data mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns

108
Examples of Clustering Applications

Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
Land use Identification of areas of similar land
use in an earth observation database
Insurance Identifying groups of motor insurance
policy holders with a high average claim cost
City-planning Identifying groups of houses
according to their house type, value, and
geographical location
Earth-quake studies Observed earth quake
epicenters should be clustered along continent
faults

109
What Is Good Clustering?

A good clustering method will produce high
quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on
both the similarity measure used by the method
and its implementation.
The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.

110
Types of Data in Cluster Analysis

Data matrix
Dissimilarity matrix

111
Measure the Quality of Clustering

Dissimilarity/Similarity metric Similarity is
expressed in terms of a distance function, which
is typically metric d(i, j)
There is a separate quality function that
measures the goodness of a cluster.
The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables.
Weights should be associated with different
variables based on applications and data
semantics.
It is hard to define similar enough or good
enough
the answer is typically highly subjective.

112
Similarity and Dissimilarity Between Objects

Distances are normally used to measure the
similarity or dissimilarity between two data
objects
Some popular ones include Minkowski distance
where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
is a positive integer
If q 1, d is Manhattan distance

113
Similarity and Dissimilarity Between Objects

If q 2, d is Euclidean distance
Properties
d(i,j) ? 0
d(i,i) 0
d(i,j) d(j,i)
d(i,j) ? d(i,k) d(k,j)
Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.

114
Binary Variables

A contingency table for binary data
Simple matching coefficient (invariant, if the
binary variable is symmetric)
Jaccard coefficient (noninvariant if the binary
variable is asymmetric)

Object j
Object i
115
Dissimilarity between Binary Variables

Example
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value
N be set to 0

116
Major Clustering Methods

Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion
Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
Density-based based on connectivity and density
functions
Grid-based based on a multiple-level granularity
structure
Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of that model to each other

117
Partitioning Algorithms Basic Concept

Partitioning method Construct a partition of a
database D of n objects into a set of k clusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
Global optimal exhaustively enumerate all
partitions
Heuristic methods k-means and k-medoids
algorithms
k-means (MacQueen67) Each cluster is
represented by the center of the cluster
k-medoids or PAM (Partition around medoids)
(Kaufman Rousseeuw87) Each cluster is
represented by one of the objects in the cluster

118
The K-Means Clustering Method

Given k, the k-means algorithm is implemented in
4 steps
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition. The centroid
is the center (mean point) of the cluster.
Assign each object to the cluster with the
nearest seed point.
Go back to Step 2, stop when no more new
assignment.

119
The K-Means Clustering Method

Example

120
Comments on the K-Means Method

Strength
Relatively efficient O(tkn), where n is
objects, k is clusters, and t is iterations.
Normally, k, t ltlt n.
Often terminates at a local optimum. The global
optimum may be found using techniques such as
deterministic annealing and genetic algorithms
Weakness
Applicable only when mean is defined, then what
about categorical data?
Need to specify k, the number of clusters, in
advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex
shapes