Title: Learning decision trees
1Learning decision trees
- A concept can be represented as a decision tree,
built from examples, as in this problem of
estimating credit risk by considering four
features of a potential creditor. Such data can
be derived from the history of credit
applications.
2Learning decision trees (2)
- At every level, one feature value is selected.
3Learning decision trees (3)
- Usually many decision trees are possible, with
varying average cost of classification. Not all
features must be included.
4Learning decision trees (4)
- The ID3 algorithm
- (its latest industrial-strength implementation is
called C5.0) - If all examples are in the same class, build a
leaf with this class. (If, for example, we have
no historical data that record low or moderate
risk, we can only learn that everything is
high-risk.) - Otherwise, if no more features can be used, build
a leaf with a disjunction of the classes of the
examples. (We might have data that only allow us
to distinguish low risk from high and moderate
risk.) - Otherwise, select a feature for the root
partition the examples of this feature build
recursively the decision trees for all
partitions attach them to the root. - (This is a greedy algorithm a form of hill
climbing.)
5Learning decision trees (5)
- Two partially constructed decisions trees.
6Learning decision trees (6)
- We saw that the same data can be turned into
different trees. The question is what trees are
better. - Essentially, the choice of the feature for the
root is important. We want to select a feature
that gives the most information. - Information in a set of disjoint classes
- C c1, ..., cn
- is defined by this formula
- I(C) S -p(ci) log2 p(ci)
- p(ci) is the probability that an example is in
class ci. - The information is measured in bits.
7Learning decision trees (7)
- Let us consider our credit risk data. There are
three feature values in 14 classes. - 6 classes have high risk, 3 have moderate risk, 5
have low risk. Assuming uniform distribution,
their probabilities are as follows
8Learning decision trees (8)
- Let feature F be at the root, and let e1, ..., em
be the partitions of the examples on this
feature. - Information needed to build a tree for partition
ei is I(ei). - Expected information needed to build the whole
tree is a weighted average of I(ei). - Let s be the cardinality of set s.
- Let ei be the set of all partitions.
- Expected information is defined by this formula
- E(F) S ei / ei I(ei)
9Learning decision trees (9)
- In our data, there are three partitions based on
income - e1 1, 4, 7, 11, e1 4, I(e1) 0.0
- All examples have high risk, so I(e1) -1 log2
1. - e2 2, 3, 12, 14, e2 4, I(e2) 1.0
- Two examples have high risk, two have
moderateI(e2) - 1/2 log2 1/2 - 1/2 log2
1/2. - e3 5, 6, 8, 9, 10, 13, e3 6, I(e3) 0.65
- I(e3) - 1/6 log2 1/6 - 5/6 log2 5/6.
- The expected information to complete the tree
using income as the root feature is this - - 4/14 0.0 - 4/14 1.0 - 6/14 0.65
0.564 bits
10Learning decision trees (10)
- Now we define the information gain from selecting
feature F for tree-building, given a set of
classes C. - G(F) I(C) - E(F)
- For our sample data and for F income, we get
this - G(INCOME) I(RISK) - E(INCOME)
- 1.531 - 0.564 bits 0.967 bits.
- Our analysis will be complete, and our choice
clear, after we have similarly considered the
remaining three features. The values are as
follows - G(COLLATERAL) 0.756 bits,
- G(DEBT) 0.581 bits,
- G(CREDIT HISTORY) 0.266 bits.
- That is, we should choose INCOME as the criterion
in the root of the best decision tree that we can
construct.
11Explanation-based learning
- A target concept
- The learning system finds an operational
definition of this concept, expressed in terms of
some primitives. The target concept is
represented as a predicate. - A training example
- This is an instance of the target concept. It
takes the form of a set of simple facts, not all
of them necessarily relevant to the theory. - A domain theory
- This is a set of rules, usually in predicate
logic, that can explain how the training example
fits the target concept. - Operationality criteria
- These are the predicates (features) that should
appear in an effective definition of the target
concept.
12Explanation-based learning (2)
- A classic example a theory and an instance of a
cup. A cup is a container for liquids that can be
easily lifted. It has some typical parts, such as
a handle and a bowl, Bowls, the actual
containers, must be concave. Because a cup can be
lifted, it should be light. And so on. - The target concept is cup(X).
- The domain theory has five rules.
- liftable( X ) ? holds_liquid( X ) ? cup( X )
- part( Z, W ) ? concave( W ) ? points_up( W )
? holds_liquid( Z ) - light( X ) ? part( X, handle ) ? liftable( X )
- small( A ) ? light( A )
- made_of( A, feathers ) ? light( A )
13Explanation-based learning (3)
- The training example lists nine facts (some of
them are not relevant). - cup( obj1 ) small( obj1 )
- part( obj1, handle ) owns( bob, obj1 )
- part( obj1, bottom ) part( obj1, bowl )
- points_up( bowl ) concave( bowl )
- color( obj1, red )
- Operationality criteria require a definition in
terms of structural properties of objects (part,
points_up, small, concave).
14Explanation-based learning (4)
Step 1 prove the target concept using the
training example
15Explanation-based learning (5)
Step 2 generalize the proof. Constants from the
domain theory, for example handle, are not
generalized.
16Explanation-based learning (6)
Step 3 Take the definition off the tree,only
the root and the leaves.
In our example, we get this rule small( X )
?part( X, handle ) ?part( X, W ) ?concave( W )
?points_up( W ) ? cup( X )