Computational Learning Theory - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Computational Learning Theory

Description:

D = { x1,c(x1) ,..., xm,c(xm) } Determine h s.t. h(x) = c(x) for all x in ... polynomial in terms of 1/ , 1/ , size of examples and target class encoding length ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 31
Provided by: timokn
Category:

less

Transcript and Presenter's Notes

Title: Computational Learning Theory


1
Computational Learning Theory
  • What general laws constrain inductive learning?
  • Seeking theory to relate
  • probability of successful learning
  • number of training examples
  • complexity of H
  • accuracy of approximations
  • manner in which examples are given

2
Prototypical concept learning task
  • Given
  • X, c X -gt 0,1, H
  • D ltx1,c(x1)gt,,ltxm,c(xm)gt
  • Determine h s.t. h(x) c(x)
  • for all x in D?
  • for all x in X?

3
Sample Complexity
  • How large D required
  • query model learner proposes x, teacher gives
    c(x)
  • tutorial model teacher provides (good)
    ltx,c(x)gt
  • random x drawn from some unknown distribution D,
    teacher provides c(x)

4
Sample complexity
  • Query model
  • assume c is in H
  • Optimal query strategy
  • select x s.t. half of h in VS(H,D) classify it ,
    half -
  • log H queries
  • not always possible -gt more queries

5
Sample complexity
  • Tutorial model
  • teacher knows c
  • Optimal teaching strategy
  • depends on H
  • e.g. Boolean conjunctions of n literals
  • n 1 examples suffice!

6
Sample complexity
  • Random model
  • X, H, C, D
  • Task
  • output h estimating c
  • performance of h evaluated on new examples drawn
    from D
  • note probabilistic selection of x, no noise in
    c(x)

7
True error of h
  • We already had this in chapter 5
  • sample error estimates true error
  • BUT we assumed h is independent of the sample
  • however its the sample that is used to learn h
    --gt strongly dependent
  • Probability that h makes an error
  • Ph(x) ltgt c(x) over D

8
True error
  • Our concern
  • can we bound true error of h given the training
    error?
  • may be 0 (consistent learners)
  • VS(H,D) contains all consistent h
  • bound examples needed to assure VS(h,D) contains
    no unacceptable h
  • applies to any consistent learner

9
Exhausting H
  • VS(H,D) is ?-exhausted if
  • every h in VS(H,D)
  • has true error smaller than ?
  • Surprise
  • probabilistic argument allows us to bound the
    probability VS will be ?-exhausted after certain
    of examples

10
Theorem (Haussler-88)
  • If
  • H is finite
  • D indep. random examples, Dm
  • then
  • P(exists h in VS(H,D), error(h) gt ?)
  • is bounded by He(-?m)
  • Ensure He(-?m) lt ?
  • m gt 1/?(ln H ln(1/?))

11
Example conj. of literals
  • H 3n
  • m gt 1/?(ln H ln(1/?))
  • m gt 1/?(ln 3n ln(1/?))
  • m gt 1/?(n ln 3 ln(1/?))
  • EnjoySport example H 973
  • prob. of 95, errors lt 0.10
  • ? 0.1 and ? 0.05
  • get m gt 98.8

12
PAC learning
  • Class C is PAC-learnable by alg. L
  • with probability (1-?)
  • true error is smaller than ?
  • after reasonable of examples
  • reasonable time per example
  • Reasonable
  • polynomial in terms of 1/?, 1/?, size of examples
    and target class encoding length

13
Agnostic learning
  • Dont assume c is in H
  • We want
  • h_best making fewest errors on D
  • Sample complexity?
  • m gt 1/(2?2)(ln H ln(1/?))
  • justification Hoeffding bounds
  • Ptrue gt sample ? lt e(-2m?2)
  • H alternatives to choose from

14
Some examples
  • Boolean conjunctions
  • use Find_S -gt PAC
  • Unbiased learners
  • too large sample
  • k-term DNF k-CNF
  • polynomial sample
  • DNF NP-complete
  • k-CNF polynomial

15
Sample compl. for infinite Hypothesis spaces
  • lnH is not the best measure
  • larger than necessary
  • not applicable to infinite H
  • New measure for complexity of H
  • VC(H)

16
Shattering X
  • VC(H) of distinct instances of X that can be
    completely discriminated by H
  • S ? X is shattered by H
  • for every dichotomy C of S
  • exists h consistent with C

17
VC(H)
  • Size of largest S that can be shattered by H
  • one large S suffices
  • infinite if any S is shattered
  • Note
  • If VC(H) d, then H gt 2d
  • VC(H) lt logH for finite H

18
Examples
  • X real numbers, H intervals
  • X points on 2-dim. plane, H linear decision
    surfaces (perceptrons)
  • generalizes to n dimensions
  • Conjunctions of Boolean literals

19
Sample compl. VC(H)
  • Blumer al. 1989
  • upper bound
  • 1/?(8 VC(H)log(13/?) 4log(2/?))
  • Ehrenfeucht al. 1989
  • lower bound exists D s.t. at least
  • max( log(1/?)/?, (VC(C)-1)/32? )
  • note C instead of H (we might have C ? H)

20
VC(H) for ANNs
  • Layered acyclic networks
  • n inputs, 1 output
  • s internal units
  • at most r inputs
  • implement a Boolean function
  • can implement class C
  • Kearns Vazirani -94
  • VC(net) lt 2 VC(C) s log (es)

21
VC(H) for ANNs
  • Network of perceptrons
  • internal units have VC(C) r1
  • VC(net) lt 2(r1)s log(es)
  • apply this to count upper bound on of required
    training examples
  • Note
  • not applicable to sigmoid units
  • inductive bias of BP (small weights) reduces the
    effective VC dimension

22
Learning settings in COLT
  • Variations in
  • generation of examples
  • presence of noise
  • definition of success
  • assumptions (D, C ? H)
  • evaluation of the learner
  • of examples
  • time consumed
  • of mistakes made

23
Mistake bound model
  • How many mistakes we make before convergence?
  • learning while system is in real use
  • of errors may be more crucial than of
    examples
  • Problem setting
  • learner guesses c(x) for each x as they arrive,
    gets ok/not answer
  • here exact learning

24
Find_S mistake bound
  • H conjunctions of n literals
  • Computing the bound
  • Find_S never misclassifies a negative example
  • only errors classified as -
  • error at 1st iteration -gt n of 2n literals
    eliminated
  • remaining at most n errors

25
Halving algorithm
  • Candidate elim. majority vote
  • Mistake made when
  • majority of VC(H,D) misclassifies
  • these are eliminated -gt VC(H,D) is (at least)
    halved
  • logH at most (worst case bound)
  • may learn without making any mistakes (minority
    eliminated)

26
Optimal mistake bounds
  • Lowest worst-case m.b. over all possible learning
    algorithms
  • M_A(C)
  • max of mistakes made by A when learning
    concepts from C
  • Find_S n1, halving logC
  • Opt(C)
  • minimum of all M_A(C)

27
Optimal mistake bounds
  • Informal reading of Opt(C)
  • mistakes made for
  • hardest c in C
  • using hardest sequence of examples
  • with best algorithm
  • Littlestone -87
  • VC(C) lt Opt(C) lt M_halving(C)

28
Weighted majority alg.
  • Bayes optimal classifier
  • weight P(hD)
  • In general
  • pool of weighted prediction algs
  • learning adjusting weights
  • Note
  • works even with inconsistent data
  • m.b. of best alg -gt m.b. of W-M

29
W-M algorithm
  • Outline
  • initially assign weight 1 to all
  • misclassify multiply by 0 lt ? lt 1
  • ? 0 -gt halving algorithm
  • Theorem
  • sequence D, n algorithms, k min of errors, ?
    0.5 (generalizes)
  • m.b. lt 2.4(k log n)

30
Summary
  • PAC model
  • Sample complexity in PAC model
  • result applies to any consistent learner with
    finite H, C ? H
  • Agnostic learning
  • Complexity of H VC(H)
  • new sample complexity result
  • Mistake bound model, W-M
Write a Comment
User Comments (0)
About PowerShow.com