A%20dynamic-programming%20algorithm%20for%20hierarchical%20discretization%20of%20continuous%20attributes PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: A%20dynamic-programming%20algorithm%20for%20hierarchical%20discretization%20of%20continuous%20attributes


1
A dynamic-programming algorithm for hierarchical
discretization of continuous attributes
  • Amit Goyal (15st April 2008)
  • Department of Computer Science
  • The University of British Columbia

2
Reference
  • Ching-Cheng Shen and Yen-Liang Chen. A
    dynamic-programming algorithm for hierarchical
    discretization of continuous attributes. In
    European Journal of Operational Research 184
    (2008) 636-651 (ElseVier).

Amit Goyal (UBC Computer Science)
3
Overview
  • Motivation
  • Background
  • Why need Discretization?
  • Related Work
  • DP Solution
  • Analysis
  • Conclusion

Amit Goyal (UBC Computer Science)
4
Motivation
  • Situation Attrition rate for mobile phone
    customer is around 25-30 per year
  • Task
  • Given customer information for past N months,
    predict who is likely to attrite next month
  • Also estimate customer value what is the cost
    effective order to be made to this customer

Customer Attributes Age Gender Location Phone
bills Income Occupation
Amit Goyal (UBC Computer Science)
5
Pattern Discovery
t1 Beef, Chicken, Milk t2 Beef,
Cheese t3 Cheese, Boots t4 Beef, Chicken,
Cheese t5 Beef, Chicken, Clothes, Cheese,
Milk t6 Chicken, Clothes, Milk t7 Chicken,
Milk, Clothes
  • Transaction data
  • Assume
  • min_support 30
  • min_confidence 80
  • An example frequent itemset
  • Chicken, Clothes, Milk sup 3/7
  • Association rules from the itemset
  • Clothes ? Milk, Chicken sup 3/7, conf 3/3
  • Clothes, Chicken ? Milk, sup 3/7, conf 3/3

Amit Goyal (UBC Computer Science)
6
Issues with Numeric Attributes
  • Size of the discretized intervals affect support
    confidence
  • Occupation SE, (Income 70,000) ?
    Attrition Yes
  • Occupation SE, (60K ? Income ? 80K) ?
    Attrition Yes
  • Occupation SE, (0K ? Income ? 1B) ?
    Attrition Yes
  • If intervals too small
  • may not have enough support
  • If intervals too large
  • may not have enough confidence
  • Loss of Information (How to minimize?)
  • Potential solution use all possible intervals
  • Too many rules!!!

Amit Goyal (UBC Computer Science)
7
Background
  • Discretization
  • reduce the number of values for a given
    continuous attribute by dividing the range of the
    attribute into intervals.
  • Concept hierarchies
  • reduce the data by collecting and replacing low
    level concepts (such as numeric values for the
    attribute age) by higher level concepts (such as
    young, middle-aged, or senior).

Amit Goyal (UBC Computer Science)
8
Why need discretization?
  • Data Warehousing and Mining
  • Data reduction
  • Association Rule Mining
  • Sequential Patterns Mining
  • In some machine learning algorithms like Bayesian
    approaches and Decision Trees.
  • Granular Computing

Amit Goyal (UBC Computer Science)
9
Related Work
  • Manual
  • Equal-Width Partition
  • Equal-Depth Partition
  • Chi-Square Partition
  • Entropy Based Partition
  • Clustering

Amit Goyal (UBC Computer Science)
10
Simple Discretization Methods Binning
  • Equal-width (distance) partitioning
  • It divides the range into N intervals of equal
    size uniform grid
  • if A and B are the lowest and highest values of
    the attribute, the width of intervals will be W
    (B-A)/N.
  • The most straightforward
  • Equal-depth (frequency) partitioning
  • It divides the range into N intervals, each
    containing approximately same number of samples

Amit Goyal (UBC Computer Science)
11
Chi-Square Based Partitioning
  • ?2 (chi-square) test
  • The larger the ?2 value, the more likely the
    variables are related
  • Merge Find the best neighboring intervals and
    merge them to form larger intervals recursively

Amit Goyal (UBC Computer Science)
12
Entropy Based Partition
  • Given a set of samples S, if S is partitioned
    into two intervals S1 and S2 using boundary T,
    the entropy after partitioning is
  • The boundary that minimizes the entropy function
    over all possible boundaries is selected as a
    binary discretization.
  • The process is recursively applied to partitions
    obtained until some stopping criterion is met

Amit Goyal (UBC Computer Science)
13
Clustering
  • Partition data set into clusters based on
    similarity, and store cluster representation
    (e.g., centroid and diameter) only
  • Can be very effective if data is clustered but
    not if data is smeared
  • Can have hierarchical clustering and be stored in
    multi-dimensional index tree structures
  • There are many choices of clustering definitions
    and clustering algorithms

Amit Goyal (UBC Computer Science)
14
Weaknesses
  • Seeks a local optimal solution instead of a
    global optimal
  • Subject to constraint that each interval can only
    be partitioned into a fixed number of
    sub-intervals
  • Constructed tree may be unbalanced

Amit Goyal (UBC Computer Science)
15
Notations
  • val(i) value of ith data
  • num(i) number of occurrences of value val(i)
  • R depth of the output tree
  • ub upper boundary on the number of subintervals
    spawned from an interval
  • lb lower boundary

Amit Goyal (UBC Computer Science)
16
Example
R 2, lb 2, ub 3
Amit Goyal (UBC Computer Science)
17
Problem Definition
Given parameters R, ub, and lb and input data
val(1), val(2), , val(n) and num(1), num(2),
num(n), our goal is to build a minimum volume
tree subject to the constraints that all leaf
nodes must be in level R and that the branch
degree must be between ub and lb
Amit Goyal (UBC Computer Science)
18
Distances and Volume
  • Intra-distance of a node containing data from
    data i to data j
  • Inter-distance b/w two adjacent siblings first
    node containing data from i to u, second node
    containing data from u1 to j
  • Volume of a tree is the total intra-distance
    minus total inter-distance in the tree

Amit Goyal (UBC Computer Science)
19
Theorem
  • The volume of a tree the intra-distance of the
    root node the volumes of all its sub-trees -
    the inter-distances among its children

Amit Goyal (UBC Computer Science)
20
Notations
  • T(i,j,r) the minimum volume tree that contains
    data from data i to data j and has depth r
  • T(i,j,r,k) the minimum volume tree that contains
    data from data i to data j, has depth r, and
    whose root has k branches
  • D(i,j,r) the volume of T(i,j,r)
  • D(i,j,r,k) the volume of T(i,j,r,k)

Amit Goyal (UBC Computer Science)
21
Notations Cont.
Amit Goyal (UBC Computer Science)
22
Notations Cont.
Amit Goyal (UBC Computer Science)
23
Algorithm
Amit Goyal (UBC Computer Science)
24
Algorithm Cont.
Amit Goyal (UBC Computer Science)
25
Algorithm Cont.
Amit Goyal (UBC Computer Science)
26
The complete DP algorithm
Amit Goyal (UBC Computer Science)
27
Run times of different algorithms
Volume of trees constructed
Gain Ratios of different algorithms (Monthly
Household Income)
Gain Ratios of different algorithms (Money Spent
Monthly)
Amit Goyal (UBC Computer Science)
28
Conclusion
  • Global optima instead of local optima
  • Each interval is partitioned into the most
    appropriate number of subintervals
  • Trees are balanced
  • Time complexity is cubic, thus slightly slower

Amit Goyal (UBC Computer Science)
29
http//www.cs.ubc.ca/goyal(goyal_at_cs.ubc.ca)
  • Thank you !!!

Amit Goyal (UBC Computer Science)
30
Gain Ratio
The information gain due to particular split of S
into Si, i 1, 2, ., r Gain (S, S1, S2, .,
Sr) purity(S ) purity (S1, S2, Sr)
Amit Goyal (UBC Computer Science)
31
Chi-Square Test Example
Heads Tails Total
Observed 53 47 100
Expected 50 50 100
(O-E)2 9 9
X2 (O-E)2/E 0.18 0.18 0.36
In order to see whether this result is
statistically significant, the P-value (the
probability of this result not being due to
chance) must be calculated or looked up in a
chart. The P-value is found to be Prob(X21
0.36) 0.5485. There is thus a probability of
about 55 of seeing data that deviates at least
this much from the expected results if indeed the
coin is fair. Hence, fair coin.
Amit Goyal (UBC Computer Science)
Write a Comment
User Comments (0)
About PowerShow.com