Title: Machine Learning and Inductive Inference
1Machine Learning and Inductive Inference
- Hendrik Blockeel2001-2002
21 Introduction
- Practical information
- What is "machine learning and inductive
inference"? - What is it useful for? (some example
applications) - Different learning tasks
- Data representation
- Brief overview of approaches
- Overview of the course
3Practical informationabout the course
- 10 lectures (2h) 4 exercise sessions (2.5h)
- Audience with diverse backgrounds
- Course material
- Book Machine Learning (Mitchell, 1997,
McGraw-Hill) - Slides notes, http//www.cs.kuleuven.ac.be/hend
rik/ML/ - Examination
- oral exam (20') with written preparation (/- 2h)
- 2/3 theory, 1/3 exercises
- Only topics discussed in lectures / exercises
4What is machine learning?
- Study of how to make programs improve their
performance on certain tasks from own experience - "performance" speed, accuracy, ...
- "experience" set of previously seen cases
("observations") - For instance (simple method)
- experience taking action A in situation S
yielded result R - situation S arises again
- if R was undesirable try something else
- if R was desirable try action A again
5- This is a very simple example
- only works if precisely the same situation is
encountered - what if similar situation?
- Need for generalisation
- how about choosing another action even if a good
one is already known ? (you might find a better
one) - Need for exploration
- This course focuses mostly on generalisation or
inductive inference
6Inductive inference
- Reasoning from specific to general
- e.g. statistics from sample, infer properties of
population
sample
population
observation "these dogs are all brown"
hypothesis "all dogs are brown"
7- Note inductive inference is more general than
statistics - statistics mainly consists of numerical methods
for inference - infer mean, probability distribution, of
population - other approaches
- find symbolic definition of a concept (concept
learning) - find laws with complicated structure that govern
the data - study induction from a logical, philosophical,
point of view
8- Applications of inductive inference
- Machine learning
- "sample" of observations experience
- generalizing to population finding patterns in
the observations that generally hold and may be
used for future tasks - Knowledge discovery (Data mining)
- "sample" database
- generalizing finding patterns that hold in this
database and can also be expected to hold on
similar data not in the database - discovered knowledge comprehensible description
of these patterns - ...
9What is it useful for?
- Scientifically for understanding learning and
intelligence in humans and animals - interesting for psychologists, philosophers,
biologists, - More practically
- for building AI systems
- expert systems that improve automatically with
time - systems that help scientists discover new laws
- also useful outside classical AI-like
applications - when we dont know how to program something
ourselves - when a program should adapt regularly to new
circumstances - when a program should tune itself towards its user
10Knowledge discovery
- Scientific knowledge discovery
- Some toy examples
- Bacon rediscovered some laws of physics (e.g.
Keplers laws of planetary motion) - AM rediscovered some mathematical theorems
- More serious recent examples
- mining the human genome
- mining the web for information on genes,
proteins, - drug discovery
- context robots perform lots of experiments at
high rate this yields lots of data, to be
studied and interpreted by humans try to
automate this process (because humans cant keep
up with robots)
11Example given molecules that are active against
some disease, find out what is common in them
this is probably the reason for their activity.
12- Data mining in databases, looking for
interesting patterns - e.g. for marketing
- based on data in DB, who should be interested in
this new product? (useful for direct mailing) - study customer behaviour to identify typical
profiles of customers - find out which products in store are often bought
together - e.g. in hospital help with diagnosis of patients
13Learning to perform difficult tasks
- Difficult for humans
- LEX system learned how to perform symbolic
integration of functions - or easy for humans, but difficult to program
- humans can do it, but cant explain how they do
it - e.g.
- learning to play games (chess, go, )
- learning to fly a plane, drive a car,
- recognising faces
14Adaptive systems
- Robots in changing environment
- continuously needs to adapt its behaviour
- Systems that adapt to the user
- based on user modelling
- observe behaviour of user
- build model describing this behaviour
- use model to make users life easier
- e.g. adaptive web pages, intelligent mail
filters, adaptive user interfaces (e.g.
intelligent Unix shell),
15Illustration building a system that learns
checkers
- Learning improving on task T, with respect to
performance measure P, based on experience E - In this example
- T playing checkers
- P of games won in world tournament
- E games played against self
- possible problem is experience representative
for real task? - Questions to be answered
- exactly what is given, exactly what is learnt,
what representation learning algorithm should
we use
16- What do we want to learn?
- given board situation, which move to make
- What is given?
- direct or indirect evidence ?
- direct e.g., which moves were good, which were
bad - indirect consecutive moves in game, outcome of
the game - in our case indirect evidence
- direct evidence would require a teacher
17- What exactly shall we learn?
- Choose type of target function
- ChooseMove Board ? Move ?
- directly applicable
- V Board ? ? ?
- indicates quality of state
- when playing, choose move that leads to best
state - Note reasonable definition for V easy to give
- V(won) 100, V(lost) -100, V(draw) 0, V(s)
V(e) with e best state reachable from s when
playing optimally - Not feasible in practice (exhaustive minimax
search) -
- Lets choose the V function here
18- Choose representation for target function
- set of rules?
- neural network?
- polynomial function of numerical board features?
-
- Lets choose V w1bpw2rpw3bkw4rkw5btw6rt
- bp, rp number of black / red pieces
- bk, rk number of black / red kings
- bt, rt number of black / read pieces threatened
- wi constants to be learnt from experience
19- How to obtaining training examples?
- we need a set of examples bp, rp, bk, rk, bt,
rt, V - bp etc. easy to determine but how to guess V?
- we have indirect evidence only!
- possible method
- with V(s) true target function, V(s) learnt
function, Vt(s) training value for a state s - Vt(s) lt- V(successor(s))
- adapt V using Vt values (making V and Vt
converge) - hope that V will converge to V
- intuitively V for end states is known propagate
V values from later states to earlier states in
the game
20- Training algorithm how to adapt the weights wi?
- possible method
- look at error error(s) V(s) - Vt(s)
- adapt weights so that error is reduced
- e.g. using gradient descent method
- for each feature fi wi ? wi c fi error(s)
with c some small constant
21Overview of design choices
type of training experience
games against self
games against expert
table of good moves
determine type of target function
Board ? Move
Board ? ?
determine representation
linear function of 6 features
determine learning algorithm
gradient descent
ready!
22Some issues that influence choices
- Which algorithms useful for what type of
functions? - How is learning influenced by
- training examples
- complexity of hypothesis (function)
representation - noise in the data
- Theoretical limits of learning?
- Can we help the learner with prior knowledge?
- Could a system alter its representation itself?
23Typical learning tasks
- Concept learning
- learn a definition of a concept
- supervised vs. unsupervised
- Function learning ("predictive modelling")
- Discrete ("classification") or continuous
("regression") - Concept function with boolean result
- Clustering
- Finding descriptive patterns
24Concept learning supervised
- Given positive () and negative (-) examples of a
concept, infer properties that cause instances to
be positive or negative ( concept definition)
X
X
-
-
C
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
C X true,false
25Concept learning unsupervised
- Given examples of instances
- Invent reasonable concepts ( clustering)
- Find definitions for these concepts
X
X
C1
C2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- Cf. taxonomy of animals, identification of market
segments, ...
26Function learning
- Generalises over concept learning
- Learn function f XS where
- S is finite set of values classification
- S is a continuous range of reals regression
X
X
f
. 1.4
. 1.4
3
. 2.1
. 2.1
2
. 2.7
. 2.7
. 0.6
. 0.6
1
0
. 0.9
. 0.9
27Clustering
- Finding groups of instances that are similar
- May be a goal in itself (unsupervised
classification) - ... but also used for other tasks
- regression
- flexible prediction when it is not known in
advance which properties to predict from which
other properties
X
X
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28Finding descriptive patterns
- Descriptive patterns any kind of patterns, not
necessarily directly useful for prediction - Generalises over predictive modelling ( finding
predictive patterns) - Examples of patterns
- "fast cars usually cost more than slower cars"
- "people are never married to more than one person
at the same time"
29Representation of data
- Numerical data instances are points in ?n
- Many techniques focus on this kind of data
- Symbolic data (true/false, black/white/red/blue,
...) - Can be converted to numeric data
- Some techniques work directly with symbolic data
- Structural data
- Instances have internal structure (graphs, sets,
cf. molecules) - Difficult to convert to simpler format
- Few techniques can handle these directly
30Brief overview of approaches
- Symbolic approaches
- Version Spaces, Induction of decision trees,
Induction of rule sets, inductive logic
programming, - Numeric approaches
- neural networks, support vector machines,
- Probabilistic approaches (bayesian learning)
- Miscellaneous
- instance based learning, genetic algorithms,
reinforcement learning
31Overview of the course
- Introduction (today) (Ch. 1)
- Concept-learning Versionspaces (Ch. 2 - brief)
- Induction of decision trees (Ch. 3)
- Artificial neural networks (Ch. 4 - brief)
- Evaluating hypotheses (Ch. 5)
- Bayesian learning (Ch. 6)
- Computational learning theory (Ch. 7)
- Support vector machines (brief)
32- Instance-based learning (Ch. 8)
- Genetic algorithms (Ch. 9)
- Induction of rule sets association rules (Ch.
10) - Reinforcement learning (Ch. 13)
- Clustering
- Inductive logic programming
- Combining different models
- bagging, boosting, stacking,
332 Version Spaces
- Recall basic principles from AI course
- stressing important concepts for later use
- Difficulties with version space approaches
- Inductive bias
- ? Mitchell, Ch.2
34Basic principles
- Concept learning as search
- given hypothesis space H and data set S
- find all h ? H consistent with S
- this set is called the version space, VS(H,S)
- How to search in H
- enumerate all h in H not feasible
- prune search using some generality ordering
- h1 more general than h2 ? (x ? h2 ? x ? h1)
- See Mitchell Chapter 2 for examples
35An example
- belongs to concept - does not
- S set of these and - examples
- Assume hypotheses are rectangles
- I.e., H set of all rectangles
- VS(H,S) set of all rectangles that contain all
and no -
36- Example of consistent hypothesis green rectangle
37- h1 more general than h2 ? h2 totally inside h1
h2 more specific than h1 h3 incomparable with h1
h1
h2
h3
38Version space boundaries
- Bound versionspace by giving its most specific
(S) and most general (G) borders - S rectangles that cannot become smaller without
excluding some - G rectangles that cannot become larger without
including some - - Any hypothesis h consistent with the data
- must be more general than some element in S
- must be more specific than some element in G
- Thus, G and S completely specify VS
39Example, continued
- So what are S and G here?
S h1, G h2,h3
40Computing the version space
- Computing G and S is sufficient to know the full
versionspace - Algorithms in Mitchells book
- FindS computes only S set
- S is always singleton in Mitchells examples
- Candidate Elimination computes S and G
41Candidate Elimination Algorithm demonstration
with rectangles
- Algorithm see Mitchell
- Representation
- Concepts are rectangles
- Rectangle represented with 2 attributes
- ltXmin-Xmax, Ymin-Ymaxgt
- Graphical representation
- hypothesis consistent with data if
- all inside rectangle
- no - inside rectangle
42G
3
2
1
4
5
6
S
1
2
3
4
5
6
S lt?,?gt G lt1-6, 1-6gt
43G
S
- Example e1 appears, not covered by S
S lt?,?gt G lt1-6,1-6gt
44G
S
(3,2)
- Example e1 appears, not covered by S
- S is extended to cover e1
S lt3-3,2-2gt G lt1-6,1-6gt
45G
S
(3,2)
- Example e1 appears, not covered by S
- S is extended to cover e1
- Example e2 appears, covered by G
S lt3-3,2-2gt G lt1-6,1-6gt
46G
S
(3,2)
- Example e1 appears, not covered by S
- S is extended to cover e1
- Example e2 appears, covered by G
- G is changed to avoid covering e2
- note now consists of 2 parts
- each part covers all and no -
S lt3-3,2-2gt G lt1-4,1-6gt, lt1-6, 1-3gt
47G
S
(3,2)
- Example e1 appears, not covered by S
- S is extended to cover e1
- Example e2 appears, covered by G
- G is changed to avoid covering e2
- Example e3 appears, covered by G
S lt3-3,2-2gt G lt1-4,1-6gt, lt1-6, 1-3gt
48G
S
(3,2)
- Example e1 appears, not covered by S
- S is extended to cover e1
- Example e2 appears, covered by G
- G is changed to avoid covering e2
- Example e3 appears, covered by G
- One part of G is affected reduced
S lt3-3,2-2gt G lt3-4,1-6gt, lt1-6, 1-3gt
49G
S
(3,2)
- Example e1 appears, not covered by S
- S is extended to cover e1
- Example e2 appears, covered by G
- G is changed to avoid covering e2
- Example e3 appears, covered by G
- One part of G is affected reduced
- Example e4 appears, not covered by S
S lt3-3,2-2gt G lt3-4,1-6gt, lt1-6, 1-3gt
50G
S
(3,2)
- Example e1 appears, not covered by S
- S is extended to cover e1
- Example e2 appears, covered by G
- G is changed to avoid covering e2
- Example e3 appears, covered by G
- One part of G is affected reduced
- Example e4 appears, not covered by S
- S is extended to cover e4
S lt3-5,2-3gt G lt3-4,1-6gt, lt1-6, 1-3gt
51G
S
(3,2)
- Example e1 appears, not covered by S
- S is extended to cover e1
- Example e2 appears, covered by G
- G is changed to avoid covering e2
- Example e3 appears, covered by G
- One part of G is affected reduced
- Example e4 appears, not covered by S
- S is extended to cover e4
- Part of G not covering new S is removed
S lt3-5,2-3gt G lt1-6, 1-3gt
52G
S
h
Current versionspace contains all rectangles
covering S and covered by G, e.g. h lt2-5,2-3gt
S lt3-5,2-3gt G lt1-6, 1-3gt
53- Interesting points
- We here use an extended notion of generality
- In book ? lt value lt ?
- Here e.g. ? lt 2-3 lt 2-5 lt 1-5 lt ?
- We still use a conjunctive concept definition
- each concept is 1 rectangle
- this could be extended as well (but complicated)
54Difficulties with version space approaches
- Idea of VS provides nice theoretical framework
- But not very useful for most practical problems
- Difficulties with these approaches
- Not very efficient
- Borders G and S may be very large (may grow
exponentially) - Not noise resistant
- VS collapses when no consistent hypothesis
exists - often we would like to find the best hypothesis
in this case - in Mitchells examples only conjunctive
definitions - We will compare with other approaches...
55Inductive bias
- After having seen a limited number of examples,
we believe we can make predictions for unseen
cases. - From seen cases to unseen cases inductive leap
- Why do we believe this ? Is there any guarantee
this prediction will be correct ? What extra
assumptions do we need to guarantee correctness? - Inductive bias minimal set of extra assumptions
that guarantees correctness of inductive leap
56Equivalence between inductive and deductive
systems
training examples
inductive system
result (by inductive leap)
new instance
training examples
deductive system
result (by proof)
new instance
inductive bias
57Definition of inductive bias
- More formal definition of inductive bias
(Mitchell) - L(x,D) denotes classification assigned to
instance x by learner L after training on D - The inductive bias of L is any minimal set of
assertions B such that for any target concept c
and corresponding training examples D, - ?x?X B?D?x - L(x,D)
58Effect of inductive bias
- Different learning algorithms give different
results on same dataset because each may have a
different bias - Stronger bias means less learning
- more is assumed in advance
- Is learning possible without any bias at all?
- I.e., pure learning, without any assumptions in
advance - The answer is No.
59Inductive bias of version spaces
- Bias of candidate elimination algorithm target
concept is in H - H typically consists of conjunctive concepts
- in our previous illustration, rectangles
- H could be extended towards disjunctive concepts
- Is it possible to use version spaces with H set
of all imaginable concepts, thereby eliminating
all bias?
60Unbiased version spaces
- Let U be the example domain
- Unbiased target concept C can be any subset of U
- hence, H 2U
- Condider VS(H,D) with D a strict subset of U
- Assume you see an unseen instance x (x ? U \ D)
- For each h?VS that predicts x?C, there is a h?VS
that predicts x?C, and vice versa - just take h h ? x since x?D, h and h are
exactly the same w.r.t. D so either both are in
VS, or none of them are
61- Conclusion version spaces without any bias do
not allow generalisation - To be able to make an inductive leap, some bias
is necessary. - We will see many different learning algorithms
that all differ in their inductive bias. - When choosing one in practice, bias should be an
important criterium - unfortunately not always well understood
62To remember
- Definition of version space, importance of
generality ordering for searching - Definition of inductive bias, practical
importance, why it is necessary for learning, how
it relates inductive systems to deductive systems
633 Induction of decision trees
- What are decision trees?
- How can they be induced automatically?
- top-down induction of decision trees
- avoiding overfitting
- converting trees to rules
- alternative heuristics ?
- a generic TDIDT algorithm ?
- ? Mitchell, Ch. 3
64What are decision trees?
- Represent sequences of tests
- According to outcome of test, perform a new test
- Continue until result obtained known
- Cf. guessing a person using only yes/no
questions - ask some question
- depending on answer, ask a new question
- continue until answer known
65Example decision tree 1
- Mitchells example Play tennis or not?
(depending on weather conditions)
Outlook
Sunny
Rainy
Overcast
Humidity
Wind
Yes
Normal
Strong
Weak
High
No
Yes
No
Yes
66Example decision tree 2
- Again from Mitchell tree for predicting whether
C-section necessary - Leaves are not pure here ratio pos/neg is given
Fetal_Presentation
1
3
2
Previous_Csection
-
-
0
3, 29- .11 .89-
8, 22- .27 .73-
1
Primiparous
55, 35- .61 .39-
67Representation power
- Typically
- examples represented by array of attributes
- 1 node in tree tests value of 1 attribute
- 1 child node for each possible outcome of test
- Leaf nodes assign classification
- Note
- tree can represent any boolean function
- i.e., also disjunctive concepts (lt-gt VS examples)
- tree can allow noise (non-pure leaves)
68Representing boolean formulae
- E.g., A ? B
- Similarly (try yourself)
- A ? B, A xor B, (A ? B) ? (C ? ?D ? E)
- M of N (at least M out of N propositions are
true) - What about complexity of tree vs. complexity of
original formula?
A
false
true
B
true
true
false
true
false
69Classification, Regression and Clustering trees
- Classification trees represent function X -gt C
with C discrete (like the decision trees we just
saw) - Regression trees predict numbers in leaves
- could use a constant (e.g., mean), or linear
regression model, or - Clustering trees just group examples in leaves
- Most (but not all) research in machine learning
focuses on classification trees
70Example decision tree 3 (from study of river
water quality)
- "Data mining" application
- Given descriptions of river water samples
- biological description occurrence of organisms
in water (abundance, graded 0-5) - chemical description 16 variables (temperature,
concentrations of chemicals (NH4, ...)) - Question characterize chemical properties of
water using organisms that occur
71Clustering tree
abundance(Tubifex sp.,5) ?
yes
no
T 0.357111 pH -0.496808 cond
1.23151 O2 -1.09279 O2sat -1.04837
CO2 0.893152 hard 0.988909 NO2
0.54731 NO3 0.426773 NH4 1.11263 PO4
0.875459 Cl 0.86275 SiO2
0.997237 KMnO4 1.29711 K2Cr2O7 0.97025 BOD
0.67012
abundance(Sphaerotilus natans,5) ?
yes
no
T 0.0129737 pH -0.536434 cond
0.914569 O2 -0.810187 O2sat
-0.848571 CO2 0.443103 hard
0.806137 NO2 0.4151 NO3
-0.0847706 NH4 0.536927 PO4
0.442398 Cl 0.668979 SiO2
0.291415 KMnO4 1.08462 K2Cr2O7 0.850733 BOD
0.651707
abundance(...)
lt- "standardized" values (how many standard
deviations above mean)
72Top-Down Induction of Decision Trees
- Basic algorithm for TDIDT (later more formal
version) - start with full data set
- find test that partitions examples as good as
possible - good examples with same class, or otherwise
similar examples, should be put together - for each outcome of test, create child node
- move examples to children according to outcome of
test - repeat procedure for each child that is not
pure - Main question how to decide which test is best
73Finding the best test (for classification trees)
- For classification trees find test for which
children are as pure as possible - Purity measure borrowed from information theory
entropy - is a measure of missing information more
precisely, bits needed to represent the missing
information, on average, using optimal encoding - Given set S with instances belonging to class i
with probability pi Entropy(S) - ? pi log2
pi
74Entropy
- Intuitive reasoning
- use shorter encoding for more frequent messages
- information theory message with probability p
should get -log2p bits - e.g. A,B,C,D both 25 probability 2 bits for
each (00,01,10,11) - if some are more probable, it is possible to do
better - average bits for a message is then - ? pi log2
pi
75Entropy
- Entropy in function of p, for 2 classes
76Information gain
- Heuristic for choosing a test in a node
- choose that test that on average provides most
information about the class - this is the test that, on average, reduces class
entropy most - on average class entropy reduction differs
according to outcome of test - expected reduction of entropy information gain
- Gain(S,A) Entropy(S) - ? Sv/S Entropy(Sv)
77Example
- Assume S has 9 and 5 - examples partition
according to Wind or Humidity attribute
S 9,5-
S 9,5-
Humidity
Wind
Normal
Strong
Weak
High
S 3,4-
S 6,1-
S 6,2-
S 3,3-
78- Assume Outlook was chosen continue partitioning
in child nodes
9,5-
Outlook
Sunny
Rainy
Overcast
?
?
Yes
2,3-
3,2-
4,0-
79Hypothesis space search in TDIDT
- Hypothesis space H set of all trees
- H is searched in a hill-climbing fashion, from
simple to complex
...
80Inductive bias in TDIDT
- Note for e.g. boolean attributes, H is complete
each concept can be represented! - given n attributes, can keep on adding tests
until all attributes tested - So what about inductive bias?
- Clearly no restriction bias (H ? 2U) as in
cand. elim. - Preference bias some hypotheses in H are
preferred over others - In this case preference for short trees with
informative attributes at the top
81Occams Razor
- Preference for simple models over complex models
is quite generally used in machine learning - Similar principle in science Occams Razor
- roughly do not make things more complicated than
necessary - Reasoning, in the case of decision trees more
complex trees have higher probability of
overfitting the data set
82Avoiding Overfitting
- Phenomenon of overfitting
- keep improving a model, making it better and
better on training set by making it more
complicated - increases risk of modelling noise and
coincidences in the data set - may actually harm predictive power of theory on
unseen cases - Cf. fitting a curve with too many parameters
.
.
.
.
.
.
.
.
.
.
.
.
83Overfitting example
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
84Overfittingeffect on predictive accuracy
- Typical phenomenon when overfitting
- training accuracy keeps increasing
- accuracy on unseen validation set starts
decreasing
accuracy on training data accuracy on unseen
data
accuracy
overfitting starts about here
size of tree
85How to avoid overfitting when building
classification trees?
- Option 1
- stop adding nodes to tree when overfitting starts
occurring - need stopping criterion
- Option 2
- dont bother about overfitting when growing the
tree - after the tree has been built, start pruning it
again
86Stopping criteria
- How do we know when overfitting starts?
- a) use a validation set data not considered for
choosing the best test - when accuracy goes down on validation set stop
adding nodes to this branch - b) use some statistical test
- significance test e.g., is the change in class
distribution still significant? (?2-test) - MDL minimal description length principle
- fully correct theory tree corrections for
specific misclassifications - minimize size(f.c.t.) size(tree)
size(misclassifications(tree)) - Cf. Occams razor
87Post-pruning trees
- After learning the tree start pruning branches
away - For all nodes in tree
- Estimate effect of pruning tree at this node on
predictive accuracy - e.g. using accuracy on validation set
- Prune node that gives greatest improvement
- Continue until no improvements
- Note this pruning constitutes a second search
in the hypothesis space
88accuracy on training data accuracy on unseen
data
accuracy
effect of pruning
size of tree
89Comparison
- Advantage of Option 1 no superfluous work
- But tests may be misleading
- E.g., validation accuracy may go down briefly,
then go up again - Therefore, Option 2 (post-pruning) is usually
preferred (though more work, computationally)
90Turning trees into rules
- From a tree a rule set can be derived
- Path from root to leaf in a tree 1 if-then rule
- Advantage of such rule sets
- may increase comprehensibility
- can be pruned more flexibly
- in 1 rule, 1 single condition can be removed
- vs. tree when removing a node, the whole subtree
is removed - 1 rule can be removed entirely
91Rules from trees example
Outlook
Sunny
Rainy
Overcast
Humidity
Wind
Yes
Normal
Strong
Weak
High
No
Yes
No
Yes
if Outlook Sunny and Humidity High then No if
Outlook Sunny and Humidity Normal then Yes
92Pruning rules
- Possible method
- 1. convert tree to rules
- 2. prune each rule independently
- remove conditions that do not harm accuracy of
rule - 3. sort rules (e.g., most accurate rule first)
- before pruning each example covered by 1 rule
- after pruning, 1 example might be covered by
multiple rules - therefore some rules might contradict each other
93Pruning rules example
A
false
true
Tree representing A ? B
B
true
true
false
true
false
if Atrue then true if Afalse and Btrue then
true if Afalse and Bfalse then false
Rules represent A ? (?A?B)
94Alternative heuristics for choosing tests
?
- Attributes with continuous domains (numbers)
- cannot different branch for each possible outcome
- allow, e.g., binary test of the form Temperature
lt 20 - Attributes with many discrete values
- unfair advantage over attributes with few values
- cf. question with many possible answers is more
informative than yes/no question - To compensate divide gain by max. potential
gain SI - Gain Ratio GR(S,A) Gain(S,A) / SI(S,A)
- Split-information SI(S,A) - ? Si/S log2
Si/S - with i ranging over different results of test A
95- Tests may have different costs
- e.g. medical diagnosis blood test, visual
examination, have different costs - try to find tree with low expected cost
- instead of low expected number of tests
- alternative heuristics, taking cost into
account,have been proposed
96Properties of good heuristics
- Many alternatives exist
- ID3 uses information gain or gain ratio
- CART uses Gini criterion (not discussed here)
- Q Why not simply use accuracy as a criterion?
80-, 20
80-, 20
How would - accuracy - information gain rate
these splits?
A1
A2
40-,0
40-,20
40-,10
40-,10
97Heuristics compared
Good heuristics are strictly concave
98Why concave functions?
E1
E
E2
p
p2
p1
Assume node with size n, entropy E and proportion
of positives p is split into 2 nodes with n1,
E1, p1 and n2, E2 p2. We have p (n1/n)p1
(n2/n) p2 and the new average entropy E
(n1/n)E1(n2/n)E2 is therefore found by linear
interpolation between (p1,E1) and (p2,E2) at p.
Gain difference in height between (p, E) and
(p,E).
99Handling missing values
- What if result of test is unknown for example?
- e.g. because value of attribute unknown
- Some possible solutions, when training
- guess value just take most common value (among
all examples, among examples in this node /
class, ) - assign example partially to different branches
- e.g. counts for 0.7 in yes subtree, 0.3 in no
subtree - When using tree for prediction
- assign example partially to different branches
- combine predictions of different branches
100Generic TDIDT algorithm
?
function TDIDT(E set of examples) returns
tree T' grow_tree(E) T
prune(T') return T function grow_tree(E set
of examples) returns tree T
generate_tests(E) t best_test(T, E) P
partition induced on E by t if
stop_criterion(E, P) then return
leaf(info(E)) else for all Ej in P tj
grow_tree(Ej) return node(t, (j,tj)
101For classification...
- prune e.g. reduced-error pruning, ...
- generate_tests Attrval, Attrltval, ...
- for numeric attributes generate val
- best_test Gain, Gainratio, ...
- stop_criterion MDL, significance test (e.g.
?2-test), ... - info most frequent class ("mode")
- Popular systems C4.5 (Quinlan 1993), C5.0
(www.rulequest.com)
102For regression...
- change
- best_test e.g. minimize average variance
- info mean
- stop_criterion significance test (e.g., F-test),
...
1,3,4,7,8,12
1,3,4,7,8,12
A1
A2
1,4,12
3,7,8
1,3,7
4,8,12
103CART
- Classification and regression trees (Breiman et
al., 1984) - Classification info mode, best_test Gini
- Regression info mean, best_test variance
- prune "error complexity" pruning
- penalty ? for each node
- the higher ?, the smaller the tree will be
- optimal ? obtained empirically (cross-validation)
104n-dimensional target spaces
- Instead of predicting 1 number, predict vector of
numbers - info mean vector
- best_test variance (mean squared distance) in
n-dimensional space - stop_criterion F-test
- mixed vectors (numbers and symbols)?
- use appropriate distance measure
- -gt "clustering trees"
105Clustering tree
abundance(Tubifex sp.,5) ?
yes
no
T 0.357111 pH -0.496808 cond
1.23151 O2 -1.09279 O2sat -1.04837
CO2 0.893152 hard 0.988909 NO2
0.54731 NO3 0.426773 NH4 1.11263 PO4
0.875459 Cl 0.86275 SiO2
0.997237 KMnO4 1.29711 K2Cr2O7 0.97025 BOD
0.67012
abundance(Sphaerotilus natans,5) ?
yes
no
T 0.0129737 pH -0.536434 cond
0.914569 O2 -0.810187 O2sat
-0.848571 CO2 0.443103 hard
0.806137 NO2 0.4151 NO3
-0.0847706 NH4 0.536927 PO4
0.442398 Cl 0.668979 SiO2
0.291415 KMnO4 1.08462 K2Cr2O7 0.850733 BOD
0.651707
abundance(...)
lt- "standardized" values (how many standard
deviations above mean)
106To Remember
- Decision trees their representational power
- Generic TDIDT algorithm and how to instantiate
its parameters - Search through hypothesis space, bias, tree to
rule conversion - For classification trees details on heuristics,
handling missing values, pruning, - Some general concepts overfitting, Occams razor
1074 Neural networks
- (Brief summary - studied in detail in other
courses) - Basic principle of artificial neural networks
- Perceptrons and multi-layer neural networks
- Properties
- ? Mitchell, Ch. 4
108Artificial neural networks
- Modelled after biological neural systems
- Complex systems built from very simple units
- 1 unit neuron
- has multiple inputs and outputs, connecting the
neuron to other neurons - when input signal sufficiently strong, neuron
fires (i.,e., propagates signal)
109- ANNs consists of
- neurons
- connections between them
- these connections have weights associated with
them - input and output
- ANNs can learn to associate inputs to outputs by
adapting the weights - For instance (classification)
- inputs pixels of photo
- outputs classification of photo (person? tree?
)
110Perceptrons
- Simplest type of neural network
- Perceptron simulates 1 neuron
- Fires if sum of (inputs weights) gt some
threshold - Schematically
x1
threshold function Y -1 if Xltt, Y1 otherwise
w1
x2
?
x3
x4
w5
Y
x5
X
computes ? wixi
1112-input perceptron
- represent inputs in 2-D space
- perceptron learns a function of following form
- if aX bY gt c then 1, else -1
- i.e., creates linear separation between classes
and -
1
-1
112n-input perceptrons
- In general, perceptrons construct a hyperplane in
an n-dimensional space - one side of hyperplane , other side -
- Hence, classes must be linearly separable,
otherwise perceptron cannot learn them - E.g. learning boolean functions
- encode true/false as 1, -1
- is there a perceptron that encodes 1. A and B?
2. A or B? 3. A xor B?
113Multi-layer networks
- Increase representation power by combining
neurons in a network
1
-1
output
output layer
-1
-1
hidden layer
1
1
-1
-1
inputs
X
Y
neuron 1
neuron 2
114- Sigmoid function instead of crisp threshold
- changes continuously instead of in 1 step
- has advantages for training multi-layer networks
x1
w1
x2
?
x3
x4
w5
x5
115- Non-linear sigmoid function causes non-linear
decision surfaces - e.g., 5 areas for 5 classes a,b,c,d,e
- Very powerful representation
e
b
c
d
a
116- Note previous network had 2 layers of neurons
- Layered feedforward neural networks
- neurons organised in n layers
- each layer has output from previous layer as
input - neurons fully interconnected
- successive layers different representations of
input - 2-layer feedforward networks very popular
- but many other architectures possible!
- e.g. recurrent NNs
117- Example 2-layer net representing ID function
- 8 input patterns, mapped to same pattern in
output - network converges to binary representation in
hidden layer
for instance 1 101 2 100 3 011 4 111 5
000 6 010 7 110 8 001
118Training neural networks
- Trained by adapting the weights
- Popular algorithm backpropagation
- minimizing error through gradient descent
- principle output error of a layer is attributed
to - 1 weights of connections in that layer
- adapt these weights
- 2 inputs of that layer (except if first layer)
- backpropagate error to these inputs
- now use same principle to adapt weights of
previous layer - Iterative process, may be slow
119Properties of neural networks
- Useful for modelling complex, non-linear
functions of numerical inputs outputs - symbolic inputs/outputs representable using some
encoding, cf. true/false 1/-1 - 2 or 3 layer networks can approximate a huge
class of functions (if enough neurons in hidden
layers) - Robust to noise
- but risk of overfitting! (because of high
expressiveness) - may happen when training for too long
- usually handled using e.g. validation sets
120- All inputs have some effect
- cf. decision trees selection of most important
attributes - Explanatory power of ANNs is limited
- model represented as weights in network
- no simple explanation why networks makes a
certain prediction - contrast with e.g. trees can give a rule that
was used
121- Hence, ANNs are good when
- high-dimensional input and output (numeric or
symbolic) - interpretability of model unimportant
- Examples
- typical image recognition, speech recognition,
- e.g. images one input per pixel
- see http//www.cs.cmu.edu/tom/faces.html for
illustration - less typical symbolic problems
- cases where e.g. trees would work too
- performance of networks and trees then often
comparable
122To remember
- Perceptrons, neural networks
- inspiration
- what they are
- how they work
- representation power
- explanatory power