A%20dynamic-programming%20algorithm%20for%20hierarchical%20discretization%20of%20continuous%20attributes presentation

About This Presentation

Transcript and Presenter's Notes

Title: A%20dynamic-programming%20algorithm%20for%20hierarchical%20discretization%20of%20continuous%20attributes

1
A dynamic-programming algorithm for hierarchical
discretization of continuous attributes

Amit Goyal (15st April 2008)
Department of Computer Science
The University of British Columbia

2
Reference

Ching-Cheng Shen and Yen-Liang Chen. A
dynamic-programming algorithm for hierarchical
discretization of continuous attributes. In
European Journal of Operational Research 184
(2008) 636-651 (ElseVier).

Amit Goyal (UBC Computer Science)
3
Overview

Motivation
Background
Why need Discretization?
Related Work
DP Solution
Analysis
Conclusion

Amit Goyal (UBC Computer Science)
4
Motivation

Situation Attrition rate for mobile phone
customer is around 25-30 per year
Task
Given customer information for past N months,
predict who is likely to attrite next month
Also estimate customer value what is the cost
effective order to be made to this customer

Customer Attributes Age Gender Location Phone
bills Income Occupation
Amit Goyal (UBC Computer Science)
5
Pattern Discovery
t1 Beef, Chicken, Milk t2 Beef,
Cheese t3 Cheese, Boots t4 Beef, Chicken,
Cheese t5 Beef, Chicken, Clothes, Cheese,
Milk t6 Chicken, Clothes, Milk t7 Chicken,
Milk, Clothes

Transaction data
Assume
min_support 30
min_confidence 80
An example frequent itemset
Chicken, Clothes, Milk sup 3/7
Association rules from the itemset
Clothes ? Milk, Chicken sup 3/7, conf 3/3
Clothes, Chicken ? Milk, sup 3/7, conf 3/3

Amit Goyal (UBC Computer Science)
6
Issues with Numeric Attributes

Size of the discretized intervals affect support
confidence
Occupation SE, (Income 70,000) ?
Attrition Yes
Occupation SE, (60K ? Income ? 80K) ?
Attrition Yes
Occupation SE, (0K ? Income ? 1B) ?
Attrition Yes
If intervals too small
may not have enough support
If intervals too large
may not have enough confidence
Loss of Information (How to minimize?)
Potential solution use all possible intervals
Too many rules!!!

Amit Goyal (UBC Computer Science)
7
Background

Discretization
reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals.
Concept hierarchies
reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior).

Amit Goyal (UBC Computer Science)
8
Why need discretization?

Data Warehousing and Mining
Data reduction
Association Rule Mining
Sequential Patterns Mining
In some machine learning algorithms like Bayesian
approaches and Decision Trees.
Granular Computing

Amit Goyal (UBC Computer Science)
9
Related Work

Manual
Equal-Width Partition
Equal-Depth Partition
Chi-Square Partition
Entropy Based Partition
Clustering

Amit Goyal (UBC Computer Science)
10
Simple Discretization Methods Binning

Equal-width (distance) partitioning
It divides the range into N intervals of equal
size uniform grid
if A and B are the lowest and highest values of
the attribute, the width of intervals will be W
(B-A)/N.
The most straightforward
Equal-depth (frequency) partitioning
It divides the range into N intervals, each
containing approximately same number of samples

Amit Goyal (UBC Computer Science)
11
Chi-Square Based Partitioning

?2 (chi-square) test

The larger the ?2 value, the more likely the
variables are related
Merge Find the best neighboring intervals and
merge them to form larger intervals recursively

Amit Goyal (UBC Computer Science)
12
Entropy Based Partition

Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is

The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met

Amit Goyal (UBC Computer Science)
13
Clustering

Partition data set into clusters based on
similarity, and store cluster representation
(e.g., centroid and diameter) only
Can be very effective if data is clustered but
not if data is smeared
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
There are many choices of clustering definitions
and clustering algorithms

Amit Goyal (UBC Computer Science)
14
Weaknesses

Seeks a local optimal solution instead of a
global optimal
Subject to constraint that each interval can only
be partitioned into a fixed number of
sub-intervals
Constructed tree may be unbalanced

Amit Goyal (UBC Computer Science)
15
Notations

val(i) value of ith data
num(i) number of occurrences of value val(i)
R depth of the output tree
ub upper boundary on the number of subintervals
spawned from an interval
lb lower boundary

Amit Goyal (UBC Computer Science)
16
Example
R 2, lb 2, ub 3
Amit Goyal (UBC Computer Science)
17
Problem Definition
Given parameters R, ub, and lb and input data
val(1), val(2), , val(n) and num(1), num(2),
num(n), our goal is to build a minimum volume
tree subject to the constraints that all leaf
nodes must be in level R and that the branch
degree must be between ub and lb
Amit Goyal (UBC Computer Science)
18
Distances and Volume

Intra-distance of a node containing data from
data i to data j

Inter-distance b/w two adjacent siblings first
node containing data from i to u, second node
containing data from u1 to j

Volume of a tree is the total intra-distance
minus total inter-distance in the tree

Amit Goyal (UBC Computer Science)
19
Theorem

The volume of a tree the intra-distance of the
root node the volumes of all its sub-trees -
the inter-distances among its children

Amit Goyal (UBC Computer Science)
20
Notations

T(i,j,r) the minimum volume tree that contains
data from data i to data j and has depth r
T(i,j,r,k) the minimum volume tree that contains
data from data i to data j, has depth r, and
whose root has k branches
D(i,j,r) the volume of T(i,j,r)
D(i,j,r,k) the volume of T(i,j,r,k)

Amit Goyal (UBC Computer Science)
21
Notations Cont.
Amit Goyal (UBC Computer Science)
22
Notations Cont.
Amit Goyal (UBC Computer Science)
23
Algorithm
Amit Goyal (UBC Computer Science)
24
Algorithm Cont.
Amit Goyal (UBC Computer Science)
25
Algorithm Cont.
Amit Goyal (UBC Computer Science)
26
The complete DP algorithm
Amit Goyal (UBC Computer Science)
27
Run times of different algorithms
Volume of trees constructed
Gain Ratios of different algorithms (Monthly
Household Income)
Gain Ratios of different algorithms (Money Spent
Monthly)
Amit Goyal (UBC Computer Science)
28
Conclusion

Global optima instead of local optima
Each interval is partitioned into the most
appropriate number of subintervals
Trees are balanced
Time complexity is cubic, thus slightly slower

Amit Goyal (UBC Computer Science)
29
http//www.cs.ubc.ca/goyal(goyal_at_cs.ubc.ca)

Thank you !!!

Amit Goyal (UBC Computer Science)
30
Gain Ratio
The information gain due to particular split of S
into Si, i 1, 2, ., r Gain (S, S1, S2, .,
Sr) purity(S ) purity (S1, S2, Sr)
Amit Goyal (UBC Computer Science)
31
Chi-Square Test Example
Heads Tails Total
Observed 53 47 100
Expected 50 50 100
(O-E)2 9 9
X2 (O-E)2/E 0.18 0.18 0.36
In order to see whether this result is
statistically significant, the P-value (the
probability of this result not being due to
chance) must be calculated or looked up in a
chart. The P-value is found to be Prob(X21
0.36) 0.5485. There is thus a probability of
about 55 of seeing data that deviates at least
this much from the expected results if indeed the
coin is fair. Hence, fair coin.
Amit Goyal (UBC Computer Science)

Write a Comment

User Comments (0)

About PowerShow.com

A%20dynamic-programming%20algorithm%20for%20hierarchical%20discretization%20of%20continuous%20attributes PowerPoint PPT Presentation