Machine Learning and Inductive Inference presentation

About This Presentation

Transcript and Presenter's Notes

Title: Machine Learning and Inductive Inference

1
Machine Learning and Inductive Inference

Hendrik Blockeel2001-2002

2
1 Introduction

Practical information
What is "machine learning and inductive
inference"?
What is it useful for? (some example
applications)
Different learning tasks
Data representation
Brief overview of approaches
Overview of the course

3
Practical informationabout the course

10 lectures (2h) 4 exercise sessions (2.5h)
Audience with diverse backgrounds
Course material
Book Machine Learning (Mitchell, 1997,
McGraw-Hill)
Slides notes, http//www.cs.kuleuven.ac.be/hend
rik/ML/
Examination
oral exam (20') with written preparation (/- 2h)
2/3 theory, 1/3 exercises
Only topics discussed in lectures / exercises

4
What is machine learning?

Study of how to make programs improve their
performance on certain tasks from own experience
"performance" speed, accuracy, ...
"experience" set of previously seen cases
("observations")
For instance (simple method)
experience taking action A in situation S
yielded result R
situation S arises again
if R was undesirable try something else
if R was desirable try action A again

This is a very simple example
only works if precisely the same situation is
encountered
what if similar situation?
Need for generalisation
how about choosing another action even if a good
one is already known ? (you might find a better
one)
Need for exploration
This course focuses mostly on generalisation or
inductive inference

6
Inductive inference

Reasoning from specific to general
e.g. statistics from sample, infer properties of
population

sample
population
observation "these dogs are all brown"
hypothesis "all dogs are brown"
7

Note inductive inference is more general than
statistics
statistics mainly consists of numerical methods
for inference
infer mean, probability distribution, of
population
other approaches
find symbolic definition of a concept (concept
learning)
find laws with complicated structure that govern
the data
study induction from a logical, philosophical,
point of view

Applications of inductive inference
Machine learning
"sample" of observations experience
generalizing to population finding patterns in
the observations that generally hold and may be
used for future tasks
Knowledge discovery (Data mining)
"sample" database
generalizing finding patterns that hold in this
database and can also be expected to hold on
similar data not in the database
discovered knowledge comprehensible description
of these patterns
...

9
What is it useful for?

Scientifically for understanding learning and
intelligence in humans and animals
interesting for psychologists, philosophers,
biologists,
More practically
for building AI systems
expert systems that improve automatically with
time
systems that help scientists discover new laws
also useful outside classical AI-like
applications
when we dont know how to program something
ourselves
when a program should adapt regularly to new
circumstances
when a program should tune itself towards its user

10
Knowledge discovery

Scientific knowledge discovery
Some toy examples
Bacon rediscovered some laws of physics (e.g.
Keplers laws of planetary motion)
AM rediscovered some mathematical theorems
More serious recent examples
mining the human genome
mining the web for information on genes,
proteins,
drug discovery
context robots perform lots of experiments at
high rate this yields lots of data, to be
studied and interpreted by humans try to
automate this process (because humans cant keep
up with robots)

11
Example given molecules that are active against
some disease, find out what is common in them
this is probably the reason for their activity.
12

Data mining in databases, looking for
interesting patterns
e.g. for marketing
based on data in DB, who should be interested in
this new product? (useful for direct mailing)
study customer behaviour to identify typical
profiles of customers
find out which products in store are often bought
together
e.g. in hospital help with diagnosis of patients

13
Learning to perform difficult tasks

Difficult for humans
LEX system learned how to perform symbolic
integration of functions
or easy for humans, but difficult to program
humans can do it, but cant explain how they do
it
e.g.
learning to play games (chess, go, )
learning to fly a plane, drive a car,
recognising faces

14
Adaptive systems

Robots in changing environment
continuously needs to adapt its behaviour
Systems that adapt to the user
based on user modelling
observe behaviour of user
build model describing this behaviour
use model to make users life easier
e.g. adaptive web pages, intelligent mail
filters, adaptive user interfaces (e.g.
intelligent Unix shell),

15
Illustration building a system that learns
checkers

Learning improving on task T, with respect to
performance measure P, based on experience E
In this example
T playing checkers
P of games won in world tournament
E games played against self
possible problem is experience representative
for real task?
Questions to be answered
exactly what is given, exactly what is learnt,
what representation learning algorithm should
we use

What do we want to learn?
given board situation, which move to make
What is given?
direct or indirect evidence ?
direct e.g., which moves were good, which were
bad
indirect consecutive moves in game, outcome of
the game
in our case indirect evidence
direct evidence would require a teacher

What exactly shall we learn?
Choose type of target function
ChooseMove Board ? Move ?
directly applicable
V Board ? ? ?
indicates quality of state
when playing, choose move that leads to best
state
Note reasonable definition for V easy to give
V(won) 100, V(lost) -100, V(draw) 0, V(s)
V(e) with e best state reachable from s when
playing optimally
Not feasible in practice (exhaustive minimax
search)
Lets choose the V function here

Choose representation for target function
set of rules?
neural network?
polynomial function of numerical board features?
Lets choose V w1bpw2rpw3bkw4rkw5btw6rt
bp, rp number of black / red pieces
bk, rk number of black / red kings
bt, rt number of black / read pieces threatened
wi constants to be learnt from experience

How to obtaining training examples?
we need a set of examples bp, rp, bk, rk, bt,
rt, V
bp etc. easy to determine but how to guess V?
we have indirect evidence only!
possible method
with V(s) true target function, V(s) learnt
function, Vt(s) training value for a state s
Vt(s) lt- V(successor(s))
adapt V using Vt values (making V and Vt
converge)
hope that V will converge to V
intuitively V for end states is known propagate
V values from later states to earlier states in
the game

Training algorithm how to adapt the weights wi?
possible method
look at error error(s) V(s) - Vt(s)
adapt weights so that error is reduced
e.g. using gradient descent method
for each feature fi wi ? wi c fi error(s)
with c some small constant

21
Overview of design choices
type of training experience
games against self
games against expert
table of good moves

determine type of target function

Board ? Move
Board ? ?

determine representation
linear function of 6 features
determine learning algorithm

gradient descent
ready!

22
Some issues that influence choices

Which algorithms useful for what type of
functions?
How is learning influenced by
training examples
complexity of hypothesis (function)
representation
noise in the data
Theoretical limits of learning?
Can we help the learner with prior knowledge?
Could a system alter its representation itself?

23
Typical learning tasks

Concept learning
learn a definition of a concept
supervised vs. unsupervised
Function learning ("predictive modelling")
Discrete ("classification") or continuous
("regression")
Concept function with boolean result
Clustering
Finding descriptive patterns

24
Concept learning supervised

Given positive () and negative (-) examples of a
concept, infer properties that cause instances to
be positive or negative ( concept definition)

X
X
-
-
C

-

-

-
-
-
-
-
-
-
-
-
-
-
-
-
-
C X true,false
25
Concept learning unsupervised

Given examples of instances
Invent reasonable concepts ( clustering)
Find definitions for these concepts

X
X
C1
C2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C3
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Cf. taxonomy of animals, identification of market
segments, ...

26
Function learning

Generalises over concept learning
Learn function f XS where
S is finite set of values classification
S is a continuous range of reals regression

X
X
f
. 1.4
. 1.4
3
. 2.1
. 2.1
2
. 2.7
. 2.7
. 0.6
. 0.6
1
0
. 0.9
. 0.9
27
Clustering

Finding groups of instances that are similar
May be a goal in itself (unsupervised
classification)
... but also used for other tasks
regression
flexible prediction when it is not known in
advance which properties to predict from which
other properties

X
X
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
Finding descriptive patterns

Descriptive patterns any kind of patterns, not
necessarily directly useful for prediction
Generalises over predictive modelling ( finding
predictive patterns)
Examples of patterns
"fast cars usually cost more than slower cars"
"people are never married to more than one person
at the same time"

29
Representation of data

Numerical data instances are points in ?n
Many techniques focus on this kind of data
Symbolic data (true/false, black/white/red/blue,
...)
Can be converted to numeric data
Some techniques work directly with symbolic data
Structural data
Instances have internal structure (graphs, sets,
cf. molecules)
Difficult to convert to simpler format
Few techniques can handle these directly

30
Brief overview of approaches

Symbolic approaches
Version Spaces, Induction of decision trees,
Induction of rule sets, inductive logic
programming,
Numeric approaches
neural networks, support vector machines,
Probabilistic approaches (bayesian learning)
Miscellaneous
instance based learning, genetic algorithms,
reinforcement learning

31
Overview of the course

Introduction (today) (Ch. 1)
Concept-learning Versionspaces (Ch. 2 - brief)
Induction of decision trees (Ch. 3)
Artificial neural networks (Ch. 4 - brief)
Evaluating hypotheses (Ch. 5)
Bayesian learning (Ch. 6)
Computational learning theory (Ch. 7)
Support vector machines (brief)

Instance-based learning (Ch. 8)
Genetic algorithms (Ch. 9)
Induction of rule sets association rules (Ch.
10)
Reinforcement learning (Ch. 13)
Clustering
Inductive logic programming
Combining different models
bagging, boosting, stacking,

33
2 Version Spaces

Recall basic principles from AI course
stressing important concepts for later use
Difficulties with version space approaches
Inductive bias
? Mitchell, Ch.2

34
Basic principles

Concept learning as search
given hypothesis space H and data set S
find all h ? H consistent with S
this set is called the version space, VS(H,S)
How to search in H
enumerate all h in H not feasible
prune search using some generality ordering
h1 more general than h2 ? (x ? h2 ? x ? h1)
See Mitchell Chapter 2 for examples

35
An example

belongs to concept - does not
S set of these and - examples
Assume hypotheses are rectangles
I.e., H set of all rectangles
VS(H,S) set of all rectangles that contain all
and no -

Example of consistent hypothesis green rectangle

h1 more general than h2 ? h2 totally inside h1

h2 more specific than h1 h3 incomparable with h1
h1
h2
h3
38
Version space boundaries

Bound versionspace by giving its most specific
(S) and most general (G) borders
S rectangles that cannot become smaller without
excluding some
G rectangles that cannot become larger without
including some -
Any hypothesis h consistent with the data
must be more general than some element in S
must be more specific than some element in G
Thus, G and S completely specify VS

39
Example, continued

So what are S and G here?

S h1, G h2,h3
40
Computing the version space

Computing G and S is sufficient to know the full
versionspace
Algorithms in Mitchells book
FindS computes only S set
S is always singleton in Mitchells examples
Candidate Elimination computes S and G

41
Candidate Elimination Algorithm demonstration
with rectangles

Algorithm see Mitchell
Representation
Concepts are rectangles
Rectangle represented with 2 attributes
ltXmin-Xmax, Ymin-Ymaxgt
Graphical representation
hypothesis consistent with data if
all inside rectangle
no - inside rectangle

42
G
3
2
1
4
5
6
S
1

Start S none, G all

2
3
4
5
6
S lt?,?gt G lt1-6, 1-6gt
43
G
S

Start S none, G all

Example e1 appears, not covered by S

S lt?,?gt G lt1-6,1-6gt
44
G
S

Start S none, G all

(3,2)

Example e1 appears, not covered by S

S is extended to cover e1

S lt3-3,2-2gt G lt1-6,1-6gt
45
G
S

Start S none, G all

(3,2)

Example e1 appears, not covered by S

S is extended to cover e1

Example e2 appears, covered by G

S lt3-3,2-2gt G lt1-6,1-6gt
46
G
S

Start S none, G all

(3,2)

Example e1 appears, not covered by S

S is extended to cover e1

Example e2 appears, covered by G

G is changed to avoid covering e2
note now consists of 2 parts
each part covers all and no -

S lt3-3,2-2gt G lt1-4,1-6gt, lt1-6, 1-3gt
47
G
S

Start S none, G all

(3,2)

Example e1 appears, not covered by S

S is extended to cover e1

Example e2 appears, covered by G

G is changed to avoid covering e2

Example e3 appears, covered by G

S lt3-3,2-2gt G lt1-4,1-6gt, lt1-6, 1-3gt
48
G
S

Start S none, G all

(3,2)

Example e1 appears, not covered by S

S is extended to cover e1

Example e2 appears, covered by G

G is changed to avoid covering e2

Example e3 appears, covered by G

One part of G is affected reduced

S lt3-3,2-2gt G lt3-4,1-6gt, lt1-6, 1-3gt
49
G
S

Start S none, G all

(3,2)

Example e1 appears, not covered by S

S is extended to cover e1

Example e2 appears, covered by G

G is changed to avoid covering e2

Example e3 appears, covered by G

One part of G is affected reduced

Example e4 appears, not covered by S

S lt3-3,2-2gt G lt3-4,1-6gt, lt1-6, 1-3gt
50
G
S

Start S none, G all

(3,2)

Example e1 appears, not covered by S

S is extended to cover e1

Example e2 appears, covered by G

G is changed to avoid covering e2

Example e3 appears, covered by G

One part of G is affected reduced

Example e4 appears, not covered by S

S is extended to cover e4

S lt3-5,2-3gt G lt3-4,1-6gt, lt1-6, 1-3gt
51
G
S

Start S none, G all

(3,2)

Example e1 appears, not covered by S

S is extended to cover e1

Example e2 appears, covered by G

G is changed to avoid covering e2

Example e3 appears, covered by G

One part of G is affected reduced

Example e4 appears, not covered by S

S is extended to cover e4

Part of G not covering new S is removed

S lt3-5,2-3gt G lt1-6, 1-3gt
52
G
S
h

Current versionspace contains all rectangles
covering S and covered by G, e.g. h lt2-5,2-3gt
S lt3-5,2-3gt G lt1-6, 1-3gt
53

Interesting points
We here use an extended notion of generality
In book ? lt value lt ?
Here e.g. ? lt 2-3 lt 2-5 lt 1-5 lt ?
We still use a conjunctive concept definition
each concept is 1 rectangle
this could be extended as well (but complicated)

54
Difficulties with version space approaches

Idea of VS provides nice theoretical framework
But not very useful for most practical problems
Difficulties with these approaches
Not very efficient
Borders G and S may be very large (may grow
exponentially)
Not noise resistant
VS collapses when no consistent hypothesis
exists
often we would like to find the best hypothesis
in this case
in Mitchells examples only conjunctive
definitions
We will compare with other approaches...

55
Inductive bias

After having seen a limited number of examples,
we believe we can make predictions for unseen
cases.
From seen cases to unseen cases inductive leap
Why do we believe this ? Is there any guarantee
this prediction will be correct ? What extra
assumptions do we need to guarantee correctness?
Inductive bias minimal set of extra assumptions
that guarantees correctness of inductive leap

56
Equivalence between inductive and deductive
systems
training examples
inductive system
result (by inductive leap)
new instance
training examples
deductive system
result (by proof)
new instance
inductive bias
57
Definition of inductive bias

More formal definition of inductive bias
(Mitchell)
L(x,D) denotes classification assigned to
instance x by learner L after training on D
The inductive bias of L is any minimal set of
assertions B such that for any target concept c
and corresponding training examples D,
?x?X B?D?x - L(x,D)

58
Effect of inductive bias

Different learning algorithms give different
results on same dataset because each may have a
different bias
Stronger bias means less learning
more is assumed in advance
Is learning possible without any bias at all?
I.e., pure learning, without any assumptions in
advance
The answer is No.

59
Inductive bias of version spaces

Bias of candidate elimination algorithm target
concept is in H
H typically consists of conjunctive concepts
in our previous illustration, rectangles
H could be extended towards disjunctive concepts
Is it possible to use version spaces with H set
of all imaginable concepts, thereby eliminating
all bias?

60
Unbiased version spaces

Let U be the example domain
Unbiased target concept C can be any subset of U
hence, H 2U
Condider VS(H,D) with D a strict subset of U
Assume you see an unseen instance x (x ? U \ D)
For each h?VS that predicts x?C, there is a h?VS
that predicts x?C, and vice versa
just take h h ? x since x?D, h and h are
exactly the same w.r.t. D so either both are in
VS, or none of them are

Conclusion version spaces without any bias do
not allow generalisation
To be able to make an inductive leap, some bias
is necessary.
We will see many different learning algorithms
that all differ in their inductive bias.
When choosing one in practice, bias should be an
important criterium
unfortunately not always well understood

62
To remember

Definition of version space, importance of
generality ordering for searching
Definition of inductive bias, practical
importance, why it is necessary for learning, how
it relates inductive systems to deductive systems

63
3 Induction of decision trees

What are decision trees?
How can they be induced automatically?
top-down induction of decision trees
avoiding overfitting
converting trees to rules
alternative heuristics ?
a generic TDIDT algorithm ?
? Mitchell, Ch. 3

64
What are decision trees?

Represent sequences of tests
According to outcome of test, perform a new test
Continue until result obtained known
Cf. guessing a person using only yes/no
questions
ask some question
depending on answer, ask a new question
continue until answer known

65
Example decision tree 1

Mitchells example Play tennis or not?
(depending on weather conditions)

Outlook
Sunny
Rainy
Overcast
Humidity
Wind
Yes
Normal
Strong
Weak
High
No
Yes
No
Yes
66
Example decision tree 2

Again from Mitchell tree for predicting whether
C-section necessary
Leaves are not pure here ratio pos/neg is given

Fetal_Presentation
1
3
2
Previous_Csection
-
-
0
3, 29- .11 .89-
8, 22- .27 .73-
1

Primiparous
55, 35- .61 .39-

67
Representation power

Typically
examples represented by array of attributes
1 node in tree tests value of 1 attribute
1 child node for each possible outcome of test
Leaf nodes assign classification
Note
tree can represent any boolean function
i.e., also disjunctive concepts (lt-gt VS examples)
tree can allow noise (non-pure leaves)

68
Representing boolean formulae

E.g., A ? B
Similarly (try yourself)
A ? B, A xor B, (A ? B) ? (C ? ?D ? E)
M of N (at least M out of N propositions are
true)
What about complexity of tree vs. complexity of
original formula?

A
false
true
B
true
true
false
true
false
69
Classification, Regression and Clustering trees

Classification trees represent function X -gt C
with C discrete (like the decision trees we just
saw)
Regression trees predict numbers in leaves
could use a constant (e.g., mean), or linear
regression model, or
Clustering trees just group examples in leaves
Most (but not all) research in machine learning
focuses on classification trees

70
Example decision tree 3 (from study of river
water quality)

"Data mining" application
Given descriptions of river water samples
biological description occurrence of organisms
in water (abundance, graded 0-5)
chemical description 16 variables (temperature,
concentrations of chemicals (NH4, ...))
Question characterize chemical properties of
water using organisms that occur

71
Clustering tree
abundance(Tubifex sp.,5) ?
yes
no
T 0.357111 pH -0.496808 cond
1.23151 O2 -1.09279 O2sat -1.04837
CO2 0.893152 hard 0.988909 NO2
0.54731 NO3 0.426773 NH4 1.11263 PO4
0.875459 Cl 0.86275 SiO2
0.997237 KMnO4 1.29711 K2Cr2O7 0.97025 BOD
0.67012
abundance(Sphaerotilus natans,5) ?
yes
no
T 0.0129737 pH -0.536434 cond
0.914569 O2 -0.810187 O2sat
-0.848571 CO2 0.443103 hard
0.806137 NO2 0.4151 NO3
-0.0847706 NH4 0.536927 PO4
0.442398 Cl 0.668979 SiO2
0.291415 KMnO4 1.08462 K2Cr2O7 0.850733 BOD
0.651707
abundance(...)
lt- "standardized" values (how many standard
deviations above mean)
72
Top-Down Induction of Decision Trees

Basic algorithm for TDIDT (later more formal
version)
start with full data set
find test that partitions examples as good as
possible
good examples with same class, or otherwise
similar examples, should be put together
for each outcome of test, create child node
move examples to children according to outcome of
test
repeat procedure for each child that is not
pure
Main question how to decide which test is best

73
Finding the best test (for classification trees)

For classification trees find test for which
children are as pure as possible
Purity measure borrowed from information theory
entropy
is a measure of missing information more
precisely, bits needed to represent the missing
information, on average, using optimal encoding
Given set S with instances belonging to class i
with probability pi Entropy(S) - ? pi log2
pi

74
Entropy

Intuitive reasoning
use shorter encoding for more frequent messages
information theory message with probability p
should get -log2p bits
e.g. A,B,C,D both 25 probability 2 bits for
each (00,01,10,11)
if some are more probable, it is possible to do
better
average bits for a message is then - ? pi log2
pi

75
Entropy

Entropy in function of p, for 2 classes

76
Information gain

Heuristic for choosing a test in a node
choose that test that on average provides most
information about the class
this is the test that, on average, reduces class
entropy most
on average class entropy reduction differs
according to outcome of test
expected reduction of entropy information gain
Gain(S,A) Entropy(S) - ? Sv/S Entropy(Sv)

77
Example

Assume S has 9 and 5 - examples partition
according to Wind or Humidity attribute

S 9,5-
S 9,5-
Humidity
Wind
Normal
Strong
Weak
High
S 3,4-
S 6,1-
S 6,2-
S 3,3-
78

Assume Outlook was chosen continue partitioning
in child nodes

9,5-
Outlook
Sunny
Rainy
Overcast
?
?
Yes
2,3-
3,2-
4,0-
79
Hypothesis space search in TDIDT

Hypothesis space H set of all trees
H is searched in a hill-climbing fashion, from
simple to complex

...
80
Inductive bias in TDIDT

Note for e.g. boolean attributes, H is complete
each concept can be represented!
given n attributes, can keep on adding tests
until all attributes tested
So what about inductive bias?
Clearly no restriction bias (H ? 2U) as in
cand. elim.
Preference bias some hypotheses in H are
preferred over others
In this case preference for short trees with
informative attributes at the top

81
Occams Razor

Preference for simple models over complex models
is quite generally used in machine learning
Similar principle in science Occams Razor
roughly do not make things more complicated than
necessary
Reasoning, in the case of decision trees more
complex trees have higher probability of
overfitting the data set

82
Avoiding Overfitting

Phenomenon of overfitting
keep improving a model, making it better and
better on training set by making it more
complicated
increases risk of modelling noise and
coincidences in the data set
may actually harm predictive power of theory on
unseen cases
Cf. fitting a curve with too many parameters

.
.
.
.
.
.
.
.
.
.
.
.
83
Overfitting example
-

-

-

-

-

-
-

-
-
-
-
-
-
-
-
-
-
-
-
84
Overfittingeffect on predictive accuracy

Typical phenomenon when overfitting
training accuracy keeps increasing
accuracy on unseen validation set starts
decreasing

accuracy on training data accuracy on unseen
data
accuracy
overfitting starts about here
size of tree
85
How to avoid overfitting when building
classification trees?

Option 1
stop adding nodes to tree when overfitting starts
occurring
need stopping criterion
Option 2
dont bother about overfitting when growing the
tree
after the tree has been built, start pruning it
again

86
Stopping criteria

How do we know when overfitting starts?
a) use a validation set data not considered for
choosing the best test
when accuracy goes down on validation set stop
adding nodes to this branch
b) use some statistical test
significance test e.g., is the change in class
distribution still significant? (?2-test)
MDL minimal description length principle
fully correct theory tree corrections for
specific misclassifications
minimize size(f.c.t.) size(tree)
size(misclassifications(tree))
Cf. Occams razor

87
Post-pruning trees

After learning the tree start pruning branches
away
For all nodes in tree
Estimate effect of pruning tree at this node on
predictive accuracy
e.g. using accuracy on validation set
Prune node that gives greatest improvement
Continue until no improvements
Note this pruning constitutes a second search
in the hypothesis space

88
accuracy on training data accuracy on unseen
data
accuracy
effect of pruning
size of tree
89
Comparison

Advantage of Option 1 no superfluous work
But tests may be misleading
E.g., validation accuracy may go down briefly,
then go up again
Therefore, Option 2 (post-pruning) is usually
preferred (though more work, computationally)

90
Turning trees into rules

From a tree a rule set can be derived
Path from root to leaf in a tree 1 if-then rule
Advantage of such rule sets
may increase comprehensibility
can be pruned more flexibly
in 1 rule, 1 single condition can be removed
vs. tree when removing a node, the whole subtree
is removed
1 rule can be removed entirely

91
Rules from trees example
Outlook
Sunny
Rainy
Overcast
Humidity
Wind
Yes
Normal
Strong
Weak
High
No
Yes
No
Yes
if Outlook Sunny and Humidity High then No if
Outlook Sunny and Humidity Normal then Yes
92
Pruning rules

Possible method
1. convert tree to rules
2. prune each rule independently
remove conditions that do not harm accuracy of
rule
3. sort rules (e.g., most accurate rule first)
before pruning each example covered by 1 rule
after pruning, 1 example might be covered by
multiple rules
therefore some rules might contradict each other

93
Pruning rules example
A
false
true
Tree representing A ? B
B
true
true
false
true
false
if Atrue then true if Afalse and Btrue then
true if Afalse and Bfalse then false
Rules represent A ? (?A?B)
94
Alternative heuristics for choosing tests
?

Attributes with continuous domains (numbers)
cannot different branch for each possible outcome
allow, e.g., binary test of the form Temperature
lt 20
Attributes with many discrete values
unfair advantage over attributes with few values
cf. question with many possible answers is more
informative than yes/no question
To compensate divide gain by max. potential
gain SI
Gain Ratio GR(S,A) Gain(S,A) / SI(S,A)
Split-information SI(S,A) - ? Si/S log2
Si/S
with i ranging over different results of test A

Tests may have different costs
e.g. medical diagnosis blood test, visual
examination, have different costs
try to find tree with low expected cost
instead of low expected number of tests
alternative heuristics, taking cost into
account,have been proposed

96
Properties of good heuristics

Many alternatives exist
ID3 uses information gain or gain ratio
CART uses Gini criterion (not discussed here)
Q Why not simply use accuracy as a criterion?

80-, 20
80-, 20
How would - accuracy - information gain rate
these splits?
A1
A2
40-,0
40-,20
40-,10
40-,10
97
Heuristics compared
Good heuristics are strictly concave
98
Why concave functions?
E1
E
E2
p
p2
p1
Assume node with size n, entropy E and proportion
of positives p is split into 2 nodes with n1,
E1, p1 and n2, E2 p2. We have p (n1/n)p1
(n2/n) p2 and the new average entropy E
(n1/n)E1(n2/n)E2 is therefore found by linear
interpolation between (p1,E1) and (p2,E2) at p.
Gain difference in height between (p, E) and
(p,E).
99
Handling missing values

What if result of test is unknown for example?
e.g. because value of attribute unknown
Some possible solutions, when training
guess value just take most common value (among
all examples, among examples in this node /
class, )
assign example partially to different branches
e.g. counts for 0.7 in yes subtree, 0.3 in no
subtree
When using tree for prediction
assign example partially to different branches
combine predictions of different branches

100
Generic TDIDT algorithm
?
function TDIDT(E set of examples) returns
tree T' grow_tree(E) T
prune(T') return T function grow_tree(E set
of examples) returns tree T
generate_tests(E) t best_test(T, E) P
partition induced on E by t if
stop_criterion(E, P) then return
leaf(info(E)) else for all Ej in P tj
grow_tree(Ej) return node(t, (j,tj)
101
For classification...

prune e.g. reduced-error pruning, ...
generate_tests Attrval, Attrltval, ...
for numeric attributes generate val
best_test Gain, Gainratio, ...
stop_criterion MDL, significance test (e.g.
?2-test), ...
info most frequent class ("mode")
Popular systems C4.5 (Quinlan 1993), C5.0
(www.rulequest.com)

102
For regression...

change
best_test e.g. minimize average variance
info mean
stop_criterion significance test (e.g., F-test),
...

1,3,4,7,8,12
1,3,4,7,8,12
A1
A2
1,4,12
3,7,8
1,3,7
4,8,12
103
CART

Classification and regression trees (Breiman et
al., 1984)
Classification info mode, best_test Gini
Regression info mean, best_test variance
prune "error complexity" pruning
penalty ? for each node
the higher ?, the smaller the tree will be
optimal ? obtained empirically (cross-validation)

104
n-dimensional target spaces

Instead of predicting 1 number, predict vector of
numbers
info mean vector
best_test variance (mean squared distance) in
n-dimensional space
stop_criterion F-test
mixed vectors (numbers and symbols)?
use appropriate distance measure
-gt "clustering trees"

105
Clustering tree
abundance(Tubifex sp.,5) ?
yes
no
T 0.357111 pH -0.496808 cond
1.23151 O2 -1.09279 O2sat -1.04837
CO2 0.893152 hard 0.988909 NO2
0.54731 NO3 0.426773 NH4 1.11263 PO4
0.875459 Cl 0.86275 SiO2
0.997237 KMnO4 1.29711 K2Cr2O7 0.97025 BOD
0.67012
abundance(Sphaerotilus natans,5) ?
yes
no
T 0.0129737 pH -0.536434 cond
0.914569 O2 -0.810187 O2sat
-0.848571 CO2 0.443103 hard
0.806137 NO2 0.4151 NO3
-0.0847706 NH4 0.536927 PO4
0.442398 Cl 0.668979 SiO2
0.291415 KMnO4 1.08462 K2Cr2O7 0.850733 BOD
0.651707
abundance(...)
lt- "standardized" values (how many standard
deviations above mean)
106
To Remember

Decision trees their representational power
Generic TDIDT algorithm and how to instantiate
its parameters
Search through hypothesis space, bias, tree to
rule conversion
For classification trees details on heuristics,
handling missing values, pruning,
Some general concepts overfitting, Occams razor

107
4 Neural networks

(Brief summary - studied in detail in other
courses)
Basic principle of artificial neural networks
Perceptrons and multi-layer neural networks
Properties
? Mitchell, Ch. 4

108
Artificial neural networks

Modelled after biological neural systems
Complex systems built from very simple units
1 unit neuron
has multiple inputs and outputs, connecting the
neuron to other neurons
when input signal sufficiently strong, neuron
fires (i.,e., propagates signal)

109

ANNs consists of
neurons
connections between them
these connections have weights associated with
them
input and output
ANNs can learn to associate inputs to outputs by
adapting the weights
For instance (classification)
inputs pixels of photo
outputs classification of photo (person? tree?
)

110
Perceptrons

Simplest type of neural network
Perceptron simulates 1 neuron
Fires if sum of (inputs weights) gt some
threshold
Schematically

x1
threshold function Y -1 if Xltt, Y1 otherwise
w1
x2
?
x3
x4
w5
Y
x5
X
computes ? wixi
111
2-input perceptron

represent inputs in 2-D space
perceptron learns a function of following form
if aX bY gt c then 1, else -1
i.e., creates linear separation between classes
and -

1
-1
112
n-input perceptrons

In general, perceptrons construct a hyperplane in
an n-dimensional space
one side of hyperplane , other side -
Hence, classes must be linearly separable,
otherwise perceptron cannot learn them
E.g. learning boolean functions
encode true/false as 1, -1
is there a perceptron that encodes 1. A and B?
2. A or B? 3. A xor B?

113
Multi-layer networks

Increase representation power by combining
neurons in a network

1
-1
output
output layer
-1
-1
hidden layer
1
1
-1
-1
inputs
X
Y
neuron 1
neuron 2
114

Sigmoid function instead of crisp threshold
changes continuously instead of in 1 step
has advantages for training multi-layer networks

x1
w1
x2
?
x3
x4
w5
x5
115

Non-linear sigmoid function causes non-linear
decision surfaces
e.g., 5 areas for 5 classes a,b,c,d,e
Very powerful representation

e
b
c
d
a
116

Note previous network had 2 layers of neurons
Layered feedforward neural networks
neurons organised in n layers
each layer has output from previous layer as
input
neurons fully interconnected
successive layers different representations of
input
2-layer feedforward networks very popular
but many other architectures possible!
e.g. recurrent NNs

117

Example 2-layer net representing ID function
8 input patterns, mapped to same pattern in
output
network converges to binary representation in
hidden layer

for instance 1 101 2 100 3 011 4 111 5
000 6 010 7 110 8 001
118
Training neural networks

Trained by adapting the weights
Popular algorithm backpropagation
minimizing error through gradient descent
principle output error of a layer is attributed
to
1 weights of connections in that layer
adapt these weights
2 inputs of that layer (except if first layer)
backpropagate error to these inputs
now use same principle to adapt weights of
previous layer
Iterative process, may be slow

119
Properties of neural networks

Useful for modelling complex, non-linear
functions of numerical inputs outputs
symbolic inputs/outputs representable using some
encoding, cf. true/false 1/-1
2 or 3 layer networks can approximate a huge
class of functions (if enough neurons in hidden
layers)
Robust to noise
but risk of overfitting! (because of high
expressiveness)
may happen when training for too long
usually handled using e.g. validation sets

120

All inputs have some effect
cf. decision trees selection of most important
attributes
Explanatory power of ANNs is limited
model represented as weights in network
no simple explanation why networks makes a
certain prediction
contrast with e.g. trees can give a rule that
was used

121

Hence, ANNs are good when
high-dimensional input and output (numeric or
symbolic)
interpretability of model unimportant
Examples
typical image recognition, speech recognition,
e.g. images one input per pixel
see http//www.cs.cmu.edu/tom/faces.html for
illustration
less typical symbolic problems
cases where e.g. trees would work too
performance of networks and trees then often
comparable

122
To remember

Perceptrons, neural networks
inspiration
what they are
how they work
representation power
explanatory power

Write a Comment

User Comments (0)

About PowerShow.com

Machine Learning and Inductive Inference PowerPoint PPT Presentation