Title: RDM Chapter 3: Intro to Learning and Search
1RDM Chapter 3 Intro to Learning and Search
- prepared for COMP422/522-2008, Bernhard Pfahringer
23.1 Representing Hypothesis and Instances
- language Le to represent examples
- language Lh to represent hypothesis
- h ? Lh h Le -gt Y, e.g. Y0,1
- cover relation over Lh x Le, c(h,e) true if and
only if h(e) 1 (see Figures 3.1 and 3.2)
33.2 Boolean data
- simplify
- item-sets, true/false, variable assignment
Herbrand interpretation, sausage,beer,mustard,win
e - Le I I ? b,m,s,w
- Le Lh (single presentation trick)
4Machine Learning point of view
- given Le, Lh, unknown fLe -gt Y
- examples E-(e1,f(e1)), ..
- loss(h,E) measures quality of h wrt E
- find h argmin loss(h,E)
- zero-one loss (empirical risk)
- loss(h,E) 1/E ? f(e)-h(e)
- regression squared loss
- probabilistic settings log-likelihood
5Data Mining POV
- given Le, Lh, data D ? Le
- quality criterion Q(h,D), find set
- Th(Q,D,Lh) h? Lh Q(h,D) holds
- Q local,global, or heuristic
- e.g. freq(h,D) c(h,D) or
- rfreq(h,D) c(h,D) / D
- local rfreq(h,D) gt y
- acc(h,P,N) freq(h,P)/(freq(h,P)freq(h,N))
- global Q(h,P,N) argmax acc(h,P,N)
6Generate-and-test
- FORALL h ? Lh DO
- IF Q(h,D) true THEN output h
- Lh must be enumerable
- naïve, inefficient, but complete
- see Example 3.3
73.6 Search space structure
- h1 is more general than h2, h1 ? h2, if c(h2) ?
c(h1), - proper generalization, if true subset
- reflexive, and transitive
- but syntactic variants -gt problematic
- (canonical forms, partial order, )
- see Example 3.5, Fig 3.5 Hasse diagram, top T and
bottom ? element
8Monotonicity
- Q is monotonic (true for all specialisations)
?s,g ? Lh, ? D ? Le g?s ? Q(g,D) ? Q(s,D) - Q is anti-monotonic (true for all
generalisations) ?s,g ? Lh, ? D ? Le g?s ?
Q(s,D) ? Q(g,D)
9Examples
- freq(h,D) x, minFreq, is anti-monotonic
- freq(h,D) x, maxFreq, is monotonic
- specific example e is covered (e ? c(h)), is
anti-monotonic - specific example e is not covered (e ?c(h)), is
monotonic - acc(h,P,N) x, is neither
- do Exercises 3.6, 3.7, 3.8.
10Pruning
- if monotonic Q is false for h, then Q is also
false for all generalisations of h - if anti-monotonic Q is false for h, then Q is
false for all specialisations of h - see examples 3.11/3.12, figures 3.6/3.7
11Min/max
- max(T) h ? T ?t ? T h lt t maximal
elements are most specific - min(T) h ? T ?t ? T t lt h minimal
elements are most general - if Lh is infinite, might not exist
- examples 3.13, 3.14
12Borders
- S border of maximally specific hypothesis for
which Q holds - S(Th(Q,D,Lh)) max(Th(Q,D,Lh))
- similarly G maximally general
- G(Th(Q,D,Lh)) max(Th(Q,D,Lh))
- example 3.15
13Border properties
- borders fully specify all solutions for
- anti-monotonic Q Th(Q,D,Lh) h ? Lh ?s ?
S(Th(Q,D,Lh) h s - monotonic Q Th(Q,D,Lh) h ? Lh ?g ?
G(Th(Q,D,Lh) g h
14Version space
- Q is a conjunction of two criteria M? A, one
monotonic (M), one anti-monotonic (A), then T is
a version space if - T h ? Lh ?s ?S(T), g ? G(T) g h s
- S and G are condensed representations, often
much smaller - example 3.18 / figure 3.8, example 3.20
15Negative borders
- the elements just outside the (positive) borders
- S-(Th) min(Lh-h ? Lh ?s ? S(Th) h s)
- G-(Th) max(Lh-h ? Lh ?g ? g(Th) g h)
- example 3.21
- border sets can be large item-sets -gt G
exponentially large in N
16Refinement operators
- generalization op ?g Lh ? 2Lh with ?h ? Lh
?g(h) ? c?Lhch - specialisation op ?s Lh ? 2Lh with ?h ? Lh
?s(h) ? c?Lhhc - can be applied repeatedly
17Ideal refinement operator
- ideal specialisation ?h ? Lh ?s(h) min(
h?Lhhh ) - returns exactly all children of a node in the
Hasse diagram - used in heuristic search (e.g. hill-climbing)
18Optimal operator
- no hypothesis is generated twice gt efficient
- used in complete search
- see example 3.22
- optimal operators define a canonical form and
vice versa
19MGG minimally general gens
- mgg(h1,h2) min h?Lh hh1 ? h?h2
- if unique, than also called lgg (least general
generalization) and lub (least upper bound)
20MGS maximally general specs
- mgs(h1,h2) max h?Lh h1h ? h2?h
- if unique, than also called glb (greatest lower
bound) - if lub and glb exist for h1,h2, than they form a
lattice (e.g. item-sets do), example 3.23,
exercises 3.24/3.25
21Generic learning algorithm
- Queue Init
- Th
- WHILE not Stop DO
- Delete h from Queue
- IF Q(h,D) true THEN
- add h to Th
- ELSE Queue Queue ??(h)
- Queue prune(Queue)
- return Th
22Generic algorithm continued
- lots of parameters
- Init defines start point
- Delete defines search strategy
- first-in-first-out (queue)gt breadth-first
- last-in-first-out (stack)gt depth-first
- best gt best-first search
- Stop Queue gt all
- Prune heuristic or sound
23Complete general-to-specific
- Queue T Q is anti-monotonic
- Th
- WHILE not Queue DO
- Delete h from Queue
- IF Q(h,D) true THEN
- add h to Th
- ELSE Queue Queue ??o(h)
- return Th
- see example 3.26, 3.27
24Heuristic general-to-specific
- Queue T
- Th
- WHILE Th DO
- Delete best h from Queue
- IF Q(h,D) true THEN
- add h to Th
- ELSE Queue Queue ??i(h)
- Queue prune(Queue)
- return Th
- useful when a single good solution suffices
works for general Q if prune only keeps k best
gt beam-search see also example 3.28
25Branch-and-bound
- assume bound b(h) exists
- ?h?Lh h?h ? b(h) ? f(h)
- then given current best v for bound we can prune
all h with v ? b(h) - can be viewed as a kind of combination of
complete and heuristic - see example 3.29
26(Cautious) Specific-to-general
- Queue ?
- Th
- WHILE Queue ? DO
- Delete some h from Queue
- IF Q(h,D) true THEN
- add h to Th
- ELSE select a d ? D such that ?(h?d)
- Queue Queue ? lgg(h,d)
- return Th
- see example 3.31 can be seen as computing S for
(anti-monotonic) rfreq(h,D) ? 1
27Computing the G border general-to-specific
- Queue T
- Th
- WHILE Queue ? DO
- Delete h from Queue
- IF Q(h,D) true AND h?G THEN
- add h to Th
- ELSE IF Q(h,D) false THEN
- Queue Queue ??o(h)
- return Th
- similar for S possible when computing both S and
G, more pruning is possible (see example 3.34)
28Computing S and G incrementally
- inrementally update a version space (SG), e.g.
finding all correct h (rfreq(h,P) 1 ?
rfreq(h,N)0) - need msg(g,e) operation, which excludes e from
g, i.e. minimally specialises g to not cover e
(example 3.35)
29Mitchells candidate elimination
- S ? G T
- FOR ALL examples e DO
- IF e ? N THEN
- process negative example
- ELSE
- process positive example
30Process negative example
- S - s ? S e ? c(s)
- FOR ALL g ? G e ? c(g) DO
- ?g g ? ms(g,e) ?s?S g?s
- G G ? ?g
- G min(G)
31Process positive example
- G - g ? G e ? c(g)
- FOR ALL s ? S e ? c(s) DO
- ?s s ? lgg(s,e) ?g?G g?s
- S S ? ?s
- S max(S)
- see example 3.36, exercise 3.37
32Interesting properties
- S and G contain only one identical h gt converged
on single solution - S or G empty gt no solutions exists
- S and G can determine if any h is still possible
- S and G can predict some e, i.e. these e carry
no additional information - try exercise 3.39
33Intersection of version spaces
- two version spaces can be intersected by
computing the new S as lgg(s1,s2) for all pairs
of elements from S1 and S2, and by computing the
new G as glb(g1,g2) for all pairs of elements
from G1 and G2 - can use this to compute separate VS for every
single positive example against all negative
examples, - then incrementally intersect these VSs