Title: A Probabilistic Approach to Classify Incomplete Objects Using Decision Trees
1A Probabilistic Approach to Classify Incomplete
Objects Using Decision Trees
- DB Seminar
- 1st Feb, 2007
- Speaker Tsang Pui Kwan (Smith)
- Supervisor Dr. B.C.M. Kao
DEXA 2004 (DEXA 2006 -- Evaluation) Authors L.
Hawarah, A. Simonet and M. Simonet
2Introduction
- Background Knowledge
- Decision Tree Classifier
- Classifying Incomplete Objects
- Missing Values Handling
- Previous Works
- Ordered Attribute Trees (OAT)
- Proposed Approaches by Authors
- Probabilistic Ordered Attribute Trees (POAT)
- Probabilistic Attribute Trees (PAT)
- Potential Problems and Solutions
- Evaluations (DEXA2006)
3Classification
- Important Problem on Data Mining and Machine
Learning - Predicts or Classifies Future Objects/Cases using
previously known results - Supervised Learning
- User specifies the targets
I want to know new customers credit risks!
Previously Known Results
New Cases Or Objects
predicts
4Classification
Unseen Cases/Objects
- Two Step Process
- Model Construction
- Model Usage
Yes!
Classification Algorithm
Training Data
Class label
Classifier (model)
5Classification
- Applications
- Scientific experiments
- Medical diagnosis
- Fraud detection
- Credit approval
- Target marketing
- etc
6Classification Models
- Various models has been proposed
- Decision Trees
- Classification Rules
- Bayesian Classifiers
- Neural Networks
- Support Vector Machines
7Decision Tree Classifier
- One of the most popular classification model
- Simple, Powerful, Human readable
Internal Nodes
branches
outlook?
Tests
sunny
rainy
overcast
Answers of the tests!
windy?
humidity?
yes
normal
high
TRUE
A Decision
no
no
yes
yes
Leaf Nodes
Example Decision Tree
8Decision Tree Classifier
- Decision Tree Induction
- Traditional algorithms includes Quinlan ID3, C4.5
- Top down recursive divide-and-conquer
- Greedy searching for the local best
partitioning - Attribute Selection
- Determine how cases in given node to be split
Top-Down
Original Training Set
Select an attribute!
Partitioning using possible values
Training Set Partitioned by a selected attribute
Continue the process on reduced set using other
attributes recursively!
9Decision Tree Classifier
- Evaluation functions
- Examples Information Gain (ID3), Gain Ratio
(C4.5) - Based on Entropy measure Impurity/randomness
- An attribute is selected if it reduces entropy of
the original set the most by partitioning - Target of Partitioning
- Leaf Node with cases ALL of the same class (Pure)
- Not possible in most cases
- Other Stopping Criterion
- All attributes used up
- No cases or too few cases left
Training cases
e.g. All same class
yes
Leaf Node
10Example Training Set
To Play or Not to Play
11Decision Tree Construction
Attribute humidity makes best partitioning
All cases of same class, Stop!
sunny Partition
Attribute outlook makes best partitioning
Attribute windy makes best partitioning
outlook?
sunny
rainy
overcast
rainy Partition
windy?
humidity?
yes
FALSE
normal
high
TRUE
no
yes
no
yes
overcast Partition
Play Example
12Decision Tree Usage
- Classification is done by searching the leaf node
from root node through the branches
outlook?
rainy
windy?
FALSE
Unseen Cases
yes
rainy
FALSE
13How about if values are missing?
How to build the tree?
??
Play Example
Root Node
14How about if values are missing?
Example Decision Tree
Which BRANCH should I go ????
outlook?
?
?
?
?
Unseen Cases
15Source of Missing Values
- Not entered due to misunderstanding
What was written?
During Data Entry
16Source of Missing Values
- Not available during data collection
Sorry, Id prefer not to answer.
What is your income?
I dont have a car.
Do you have a car? If yes, what is your car type?
Important Missing values may not be errors Can
be intentionally made!
Example Conversation of a survey
17Source of Missing Values
- Equipment failures
- Inconsistent with other data values
- E.g. Age vs. date of birth
Today is 1-2-2007.
18Missing Values
- Could appear in both training set and unseen
cases/objects - Problems How can classification down on cases
contains attributes with missing values?
19Why do we need to Handle missing value carefully?
- Accuracy!
- The cost of misclassification is high for some
applications - Example Cancer diagnosis
- False Negative Cancer patient wrongly classified
as healthy - Reduced patients recover chance
Dont Worry! Our diagnosis is highly accurate!! ?
20Issue to focused
- Classifying Incomplete Objects (Unseen Cases with
Missing Values) - Only consider categorical attributes
- Key Step Estimate the Missing Values
How to do?
21Directions
- Information Available Training Set
- Popular Strategy Using Training Set for
Estimation of Missing Value - An Unseen Case has high chance to follow results
of the similar cases in training set - Problem bias by training cases
- Estimation should be more accurate if the set of
training data is large
22How to handle missing values?
- Trial 1 Replacing a missing values with a value
consider adequate - Problem What should the value be?
- Commonly Known Value
- Estimation is case independent
SKIP
23C4.5 Missing Value Handling
- When an internal node is encountered and the
relevant attribute value is missing - Investigate all branches
- Estimate the probabilities of reaching each
branch - The class distribution is found by combining the
classification results of different branches
outlook?
rainy
sunny
overcast
humidity?
normal
high
C4.5 Decision Tree from Training Set
no
yes
Unseen Case
24C4.5 Missing Value Handling
- Probability Estimation
- By the size of the partitions
- Prob. of a branch cases in a partition /
cases represented by that internal node - Work well if most of the attributes are
independent - A class distribution is returned as the result
25C4.5 Missing Value Handling
Normal 40high 60
no 40yes 60
Unseen Case
outlook?
rainy
sunny
overcast
Prob. of branch 2/5 40
humidity?
Prob. Of branch 3/5 60
normal
high
Probability of each branch is estimated by case
in that branch over cases in the node.
no
yes
Example of Missing Value Handling in C4.5
no 1 x 0.4 0.4yes 1 x 0.6 0.6
26Using Decision Tree as a Tools
- Ordered Attribute Trees (OAT) Method
- By Lobo and Numao (PAKDD 1999, JSAI 2000)
- Using Decision Trees for estimating missing
attribute values - An Inference-based Approach
27Ordered Attribute Trees (OAT)
- A decision tree is built for each attribute using
the corresponding attribute as if it is the Class
Label - Described only by lower ordered attributes (have
weaker relation with the class) in the training
set
ordered
attribute A
attribute B
attribute C
28Ordered Attribute Tree (OAT)
- Mutual Information Dependency Measurement
- Symmetric function
- Measures the reduction in uncertainty about
random variable X from learning a value in Y - x a value of X in domain of X
- P(x) probabilities of occurrence of value x
- P(xy) conditional probabilities of X having
value x once Y is of value y
29Ordered Attribute Trees (OAT)
- Play Example
- MI(outlook, play) 0.2467
- MI(temp,play) 0.02922
- MI(humidity, play) 0.1518
- MI(windy,play) 0.04813
- Order (ascending) Temperature, Windy, Humidity,
Outlook
30OAT Examples
- Temperature have the lowest lower
- Tree would contain only root node, with the most
probably value mild (6/14) - Windy OAT using ID3
- Only contain Temperature
- But not Humidity, Outlook
Mild 6/14
Temperature (14)
hot
mild
cool
true 2/4
false 3/4
true 3/6
31OAT Examples
- Humidity OAT using ID3
- Similar to Windy Tree
Temperature (14)
hot
mild
cool
Windy(4)
normal 4/4
high 4/6
true
false
high 2/3
high 1/1
32Usage of OAT
- Missing attribute values are filled by using
corresponding Trees - If a case contains two or more missing
attributes, lowest order one will be filled first
Temperature (14)
hot
mild
cool
high
Windy(4)
normal 4/4
high 4/6
Cases with Missing values
true
false
high 2/3
high 1/1
33Problem on OAT
- Leaf Node with Single Value
- Issues on Attribute dependency
Temperature (14)
hot
mild
cool
TRUE
FALSE
TRUE
Windy OAT
Training Cases with temp cool
A
x
x
Lower ordered Attribute B
B
B
OAT of Attribute A
34Leaf Node with Single Value
- Leaf node associated with 1 value ONLY
- For example -- Temperature OAT
- Mild is chosen but it is NOT dominate in the node
- Lack of representative power
- Single value is inadequate
Mild 6/14
Mild is most probable valuebut more than half of
cases are not Mild!
35Issues on Dependent Attributes
- Best Estimation of Missing Values should rely on
dependent attributes - For Example
- Owns House vs. Owns Car
- Installed Cable TV vs. Watch Soccer Matches
36Issues on Dependent Attributes
- OAT relies only on attribute relationship with
class (in the training set) - Not care about attribute dependency
- Example
- Humidity OAT contains node with Windy
- But MI (humidity, windy) 0, i.e. independent
- The prediction from independent attributes are
less accurate
Humidity
x
x
Windy
?
windy
Humidity OAT
37Probabilistic Approach
- Probabilistic OAT (POAT)
- Extended version of OAT
- Probabilistic Attribute Trees (PAT)
- New approach
Using Probability make better and more complete
results!
SKIP
38Probabilistic OAT
- Improved version of OAT with probabilistic
information - Leaf node contains a probability distribution
instead of a single most probable value
high
high 67normal 33
Leaf Node of OAT
Leaf Node of POAT
39POAT Examples
Temperature (14)
hot
mild
cool
high 67normal 33
Windy(4)
normal 100
true
false
high 67normal 33
high 100
40Usage of POAT
- Similar with OAT
- The missing values filled would be a probability
distribution instead - Final classification result is a class
distribution
Temperature (14)
hot
mild
cool
high 67normal 33
Windy(4)
normal 100
High 67normal 33
true
false
high 67normal 33
high 100
Cases with Missing values
Humidity POAT
41Probabilistic Attribute Trees
- Take account of dependency of attributes
- NO ordering imposed on attributes
- A Leaf node contains a probabilistic distribution
instead of a single most probable value - Similar to POAT
Attributes
Class Label
NOT CARE on tree construction
42PAT Construction
- PAT is constructed for every attribute with
dependent attributes - Mutual Information is used again to for the
measurement - Dependency between attributes is defined by a
threshold - Attribute A is said to depend on Attribute B and
vice versa if MI(A,B) gt threshold -
43Play Example on PAT
- Settings
- Threshold 0.01
- Dependent Attribute Sets
- Dep(Humidity) Temp, Outlook
- Dep(Outlook) Temp, Humidity
- Dep(Temp) Humidity, Outlook, Windy
- DepWindy Temp
44Play Example on PAT
- Humidity PAT
- Contains only its dependent attributes
- i.e. Temperature and Outlook
Temperature (14)
hot
mild
cool
Outlook(4)
normal 100
Outlook(4)
overcast
rain
sunny
overcast
rain
sunny
?
normal 50 high 50
high 100
high 100
normal 33 high 66
normal 50 High 50
45Play Example on PAT
- Outlook PAT
- Contains only its dependent attributes
- i.e. Temperature and Humidity
46Usage of PAT
Temperature (14)
hot
mild
cool
Outlook(4)
normal 100
Outlook(4)
overcast
rain
sunny
overcast
rain
sunny
?
normal 50 high 50
high 100
high 100
normal 33 high 66
normal 50 High 50
Humidity PAT
High 50normal 50
Unseen Cases with Missing values
47Problems on PAT
- Cycle Problem
- Indeterminate Leaves Problem
attribute A
Value a
0 cases!
48Cycle Problem
- Happens when two or more dependent attributes
with missing values
Depends on Humidity
Humidity PAT Contains outlook!
Depends on Outlook
Outlook PAT Contains humidity!
49Possible Solution
- Using POAT to estimate the missing values of the
lower order attributes first - Used PAT again after no cycle exists
high 67normal 33
sunny 43.5 overcast 54.7
Humidity POAT
33
67
sunny 25 rainy 50 overcast 25
Outlook PAT
sunny 0.67x0.67 0.435overcast 0.33 0.33x0.67
0.547
50Another Solution
- Using a set of PATs instead of POAT
- Build PATs for any subset of dependent attributes
- If some dependent attributes are missing
- Use tree built by remaining subset instead
- Problem Efficiency and Space Overhead
Humidity PAT (without Temp)
Humidity PAT (without Outlook)
Humidity PAT (without Temp and Outlook)
Humidity PAT
Humidity PAT (without Outlook)
Use this!
Set of PATs for Humidity
51Indeterminate Leaves Problem
- A leaf node of a PAT could contain no cases
- Happen when the attribute for partition contains
three or more values
Unseen Cases with Missing values
Temperature (14)
hot
mild
cool
Outlook(4)
overcast
rain
sunny
No cases!
normal 50 high 50
high 100
What should be the results?
Humidity PAT
52Possible Solution
SKIP
- Using POAT if case with missing values reaches
leaf with no cases
Unseen Cases with Missing values
high 67normal 33
Temperature (14)
hot
mild
cool
Outlook(4)
overcast
rain
sunny
No cases!
normal 50 high 50
high 100
Use POAT
Humidity PAT
Humidity POAT
53Evaluation on PAT
- Compare PAT with C4.5
- vote database
- Classes 2 (Democrat, Republican)
54Evaluation on PAT
55Evaluation
- Thresholds are set based on the average Mutual
Information of all the attributes ( 0.26) - The set of thresholds 0.2, 0.3, 0.4, 0.5
56Results
SKIP
- The accuracy of PAT is higher than C4.5
57Evaluation
- Breast-cancer Database
- 2 classes (no-recurrence-events,
recurrence-events) - Some attributes are multi-valued
58Evaluation on PAT
- The set of thresholds 0.02, 0.03, 0.04
- Estimated by Normalized Mutual Information
- MI biased to multi-valued attributes
59Results
SKIP
- The accuracy of PAT is higher than or equal to
C4.5
60Analysis
- Compare classification quality by estimated class
distribution - Instance Analysis Algorithm
- Measure class distribution from training cases
that are similar to the case with missing value - Constant near two cases are near if the
distance between them is lower than it
Cases of class A
Training Set
SimilarTraining Cases
Cases of class B
61Analysis
62Analysis
- The class distribution for the similar cases
are generally matching the probabilistic result
of PAT compared to C4.5 - PAT Closer to reality
63Conclusion
- Missing Values Obstacles for classification
- Missing Value Handling
- Ordered Attribute Trees (OAT)
- Probabilistic Approaches
- Probabilistic OATs (POAT)
- Probabilistic Attribute Trees (PAT)
- Potential Problems and Possible Solutions
64Thank You!
65The End!
66Evaluation on PAT
- Threshold is set near to the average Normalized
Mutual Information of all the attributes ( 0.26) - The set of thresholds 0.2, 0.3, 0.4, 0.5
67Trial X
- Ignore training cases contains missing value
- For Training Set only
- Problem What if there are many cases with
missing values? - Other attribute value of training cases with
missing value may be useful and valuable
68Possible Methods
- Using Decision Trees
- Shapiros Method (1987)
- Using subset of training set with known value on
target attribute - Target attribute treated as class label
- Class is used as another attribute
- ONLY for Building Phase
Tree for outlook
Training subset without missing value
Training set with missing value
youth
fill in
Cases with missing values