Decision Tree - PowerPoint PPT Presentation

About This Presentation
Title:

Decision Tree

Description:

Q: Can trees represent arbitrary Boolean expressions? ... Decision trees are the single most popular data mining tool. Easy to understand ... – PowerPoint PPT presentation

Number of Views:391
Avg rating:3.0/5.0
Slides: 34
Provided by: rong7
Learn more at: http://www.cse.msu.edu
Category:
Tags: decision | tree | trees

less

Transcript and Presenter's Notes

Title: Decision Tree


1
Decision Tree
  • Rong Jin

2
Determine Milage Per Gallon
3
A Decision Tree for Determining MPG
mpg cylinders displacement horsepower weight accel
eration modelyear maker 4 low low low high 75to7
8 asia
good
From slides of Andrew Moore
4
Decision Tree Learning
  • Extremely popular method
  • Credit risk assessment
  • Medical diagnosis
  • Market analysis
  • Good at dealing with symbolic feature
  • Easy to comprehend
  • Compared to logistic regression model and support
    vector machine

5
Representational Power
  • Q Can trees represent arbitrary Boolean
    expressions?
  • Q How many Boolean functions are there over N
    binary attributes?

6
How to Generate Trees from Training Data
7
A Simple Idea
  • Enumerate all possible trees
  • Check how well each tree matches with the
    training data
  • Pick the one work best

Too many trees
How to determine the quality of decision trees?
Problems ?
8
Solution A Greedy Approach
  • Choose the most informative feature
  • Split data set
  • Recursive until each data item is classified
    correctly

9
How to Determine the Best Feature?
  • Which feature is more informative to MPG?
  • What metric should be used?

Mutual Information !
From Andrew Moores slides
10
Mutual Information for Selecting Best Features
From Andrew Moores slides
11
Another Example Playing Tennis
12
Example Playing Tennis
Humidity
(9, 5-)
Wind
(9, 5-)
High
Norm
Weak
Strong
(6, 1-)
(3, 4-)
(3, 3-)
(6, 2-)
13
Predication for Nodes
What is the predication for each node?
From Andrew Moores slides
14
Predication for Nodes
15
Recursively Growing Trees
cylinders 4
cylinders 5
cylinders 6
Original Dataset
Partition it according to the value of the
attribute we split on
cylinders 8
From Andrew Moore slides
16
Recursively Growing Trees
From Andrew Moore slides
17
A Two Level Tree
18
When should We Stop Growing Trees?
19
Base Cases
  • Base Case One If all records in current data
    subset have the same output then dont recurse
  • Base Case Two If all records have exactly the
    same set of input attributes then dont recurse

20
Base Cases An idea
  • Base Case One If all records in current data
    subset have the same output then dont recurse
  • Base Case Two If all records have exactly the
    same set of input attributes then dont recurse

Proposed Base Case 3 If all attributes have
zero information gain then dont recurse
Is this a good idea?
21
Old Topic Overfitting
22
What should We do ?
23
Pruning Decision Tree
  • Stop growing trees in time
  • Build the full decision tree as before.
  • But when you can grow it no more, start to prune
  • Reduced error pruning
  • Rule post-pruning

24
Reduced Error Pruning
  • Split data into training and validation set
  • Build a full decision tree over the training set
  • Keep removing node that maximally increases
    validation set accuracy

25
Original Decision Tree
26
Pruned Decision Tree
27
Reduced Error Pruning
28
Rule Post-Pruning
  • Convert tree into rules
  • Prune rules by removing the preconditions
  • Sort final rules by their estimated accuracy
  • Most widely used method (e.g., C4.5)
  • Other methods statistical significance test
    (chi-square)

29
Real Value Inputs
  • What should we do to deal with real value inputs?

30
Information Gain
  • x a real value input
  • t split value
  • Find the split value t such that the mutual
    information I(x, y t) between x and the class
    label y is maximized.

31
Conclusions
  • Decision trees are the single most popular data
    mining tool
  • Easy to understand
  • Easy to implement
  • Easy to use
  • Computationally cheap
  • Its possible to get in trouble with overfitting
  • They do classification predict a categorical
    output from categorical and/or real inputs

32
Software
  • Most widely used decision tree C4.5 (or C5.0)
  • http//www2.cs.uregina.ca/hamilton/courses/831/no
    tes/ml/dtrees/c4.5/tutorial.html
  • Source code, tutorial

33
The End
Write a Comment
User Comments (0)
About PowerShow.com