A New Linearthreshold Algorithm - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

A New Linearthreshold Algorithm

Description:

WINNOW 1. The instance space is X={0,1}n ... WINNOW 1. The weights are changed only if the learner makes a mistake according the table: ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 37
Provided by: annrap7
Category:

less

Transcript and Presenter's Notes

Title: A New Linearthreshold Algorithm


1
A New Linear-threshold Algorithm
  • Anna Rapoport
  • Lev Faivishevsky

2
Introduction
  • Valiant (1984) and others have studied the
    problem of learning various classes of Boolean
    functions from examples. Now were going to
    discuss incremental learning of these functions.
  • We consider a setting in which the learner
    responds to each example according to a current
    hypothesis. Then the learner updates it, if
    necessary, based on the correct classification of
    the example.

3
Introduction (cont.)
  • One natural measure of the quality of learning in
    this setting is the number of mistakes the
    learner makes.
  • For suitable classes of functions, learning
    algorithms are available that make a bounded
    number of mistakes, with the bound independent of
    the number of examples seen by the learner.

4
Introduction (cont.)
  • We present an algorithm that learns disjunctive
    Boolean functions, along with variants for
    learning other classes of Boolean functions.
  • The basic method can be expressed as a linear-
    threshold algorithm.
  • A primary advantage of this algorithm is that the
    number of mistakes grows only logarithmically
    with the number of irrelevant attributes in the
    examples. Also it is computationally efficient in
    both time and space.

5
How does it work?
  • We study learning in an on-line setting theres
    no separate set of training examples. The learner
    attempts to predict the appropriate response for
    each example, starting with the first example
    received.
  • After making this prediction, the learner is told
    whether the prediction was correct, and then uses
    this information to improve its hypothesis.
  • The learner continues to learn as long as it
    receives examples.

6
The Setting
  • Now were going to describe in more detail the
    learning environment that we consider and the
    classes of functions that the algorithm can
    learn. We assume that learning takes place in a
    sequence of trials. The order of events in a
    trial is as follows

7
The Setting (cont.)
  • (1) The learner receives some information about
    the world, corresponding to a single example.
    This information consists of the values of n
    Boolean attributes, for some n that remains
    fixed. We think of the information received as
    a point in 0,1n. We call this point an
    instance and we call 0,1n the instance space.

8
The Setting (cont.)
  • (2) The learner makes a response. The learner
    has a choice of two responses, labeled 0 and 1.
    We call this response the learners prediction
    of the correct value.
  • (3) The learner is told whether or not the
    response was correct. This information is
    called the reinforcement.

9
The Setting (cont.)
  • Each trial begins after the previous trial has
    ended.
  • We assume that for entire sequence of trials,
    there is a single function f 0,1n ?0,1 which
    maps each instance to the correct response to
    that instance. This function is called target
    function or target concept.
  • The algorithm for learning in this setting is
    called algorithm for on-line learning from
    examples (AOLLE)

10
Mistake Bound (introduction)
  • We evaluate the algorithms learning behavior by
    counting the worst-case number of mistakes that
    it will make while learning a function from a
    specified class of functions. Computational
    complexity is also considered. The method is
    computationally time and space efficient.

11
General results about mistake bounds for AOOLE
  • At first we present upper and lower bounds on the
    number of mistakes in the case where one ignores
    issues of computational efficiency.
  • The instant space can be any finite space X, and
    the target class is assumed to be a collection of
    functions, each with domain X and range 0,1.

12
Some definitions
  • Def 1For any learning algorithm A and any
    target function , let MA() be the maximum
    over all possible sequences of instance of the
    number of mistakes that algorithm A makes when
    the target function is .

13
Some definitions
  • Def 2For any learning algorithm A and any
    non-empty target class C, let
  • MA(C) ? max ?C MA().
  • Define MA(C) -1, if C is empty. Any number
    greater than or equal to MA(C) will be called a
    mistake bound for algorithm A applied to class
    C.

14
Some definitions
  • Def 3The optimal mistake bound for a target
    class C, denoted opt(C), is the minimum over
    all algorithms A of MA(C) (regardless
    algorithms computational efficiency) . An
    algorithm A is called optimal for class C if
    MA(C) opt(C). Thus opt(C) represents the best
    possible worst case mistake bound for any
    algorithm learning C.

15
2 Auxiliary algorithms
  • If computational resources are no issue, theres
    a straightforward learning algorithm that has
    excellent mistake bounds for many classes of
    functions. Were going briefly to observe them,
    because it gives an upper limit on the mistake
    bound and because it suggests strategies that one
    might explore in searching for computationally
    efficient algorithms.

16
Algorithm 1 halving algorithm (HA)
  • The HA can be applied to any finite class C of
    functions taking values in 0,1. The HA
    maintains a variable CONSIST C (initially).
    When it receives an instance, it determines the
    sets
  • ?0(CONSIST,x) ?C, (x)0
  • ?1(CONSIST,x) ?C, (x)1

17
HA scheme of the work
?1(CONSIST,x) gt ?0(CONSIST,x)
true
false
Predicts 1
Predicts 0
When it receives the reinforcement, it
sets CONSIST ?1(CONSIST,x), if correct is
1 CONSIST ?0(CONSIST,x), if correct is 0
18
HA main results
  • Def Let MHALVING(C) denote the maximum number
    of mistakes that the algorithm will make when it
    is run for the class C.
  • Th 1For any non-empty target class C,
  • MHALVING(C) ? log2C
  • Th 2For any finite target class C,
  • opt (C) ? log2C

19
Algorithm 2 standard optimal algorithm (SOA)
  • Def 1A mistake tree for a target class C over
    an instance space X is a binary tree each of
    whose nodes is a non-empty subset of C, and each
    internal node is labeled with a point of X and
    satisfies
  • 1.The root of the tree is C
  • 2. For any internal node C labeled with x
    the left child of C is ?0(C,x) and right
    one is ?1(C,x).

20
SOA
  • Def 2A complete k-mistake tree is a mistake
    tree that is a complete binary tree of height
    k.
  • Def 3For any non-empty finite target class C,
    let K(C) equal the largest integer k s.t. there
    exists a complete k-mistake free for C. K(?)
    -1.
  • The SOA is similar to HA, but it compares
  • K(?1(CONSIST,x)) gt K(?0(CONSIST,x))

21
SOA main results
  • Th1Let X be any instance space. CX?0,1
    opt(C) MSOA(C) K(C)
  • Def 4S?X is shattered by a target class C if
    for ?U?S ??C s.t (U)1 (S-U)0
  • Def 5The Vapnik-Chervonenkis dimension is the
    card of the largest set, shattered by C
  • Th2For any target class C
  • VCdim(C) ? opt(C)

22
The linear-threshold algorithm (LTA)
  • Def 10,1n?0,1is linearly-separable if
    there is a hyperplane in Rn separating the
    points on which the function is 1 from those on
    which its 0.
  • Def 1A monotone disjunction is such in which no
    literal appears negated (x1,..,xn) xi1 ??
    xik
  • A hyperplane given by xi1 xik ½ is a
    separating hyperplane for .

23
WINNOW 1
  • The instance space is X0,1n
  • The algorithm maintains weights w1,..,wn?R ,
    each having 1 as its initial value.
  • ? ? R the threshold.
  • When the learner receives an instance (x1,..,xn),
    the learner responds as follows
  • if ? wi xi ? ?, then it predicts 1
  • if ? wi xi ? ?, then it predicts 0.

24
WINNOW 1
  • The weights are changed only if the learner
    makes a mistake according the table

25
Requirements for WINNOW1
  • The space needed (without counting bits per
    weight) and the sequential time needed per trial
    are both linear in n.
  • Non-zero weights are powers of ?, so the weights
    are at most ??. Thus if the logarithms (base ?)
    of the weights are stored, only O(log2log??) bits
    per weight are needed

26
Mistake bound for WINNOW1
  • Th Suppose that the target function is a
    k- literal monotone disjunction given by
  • (x1,..,xn) ? xi1 ?? xik. If WINNOW1 is run
    with ? ?1 and ??1/?, then for any sequence of
    instances the total number of mistakes will be
    bounded by
  • ?k(log???1) ? n/?

27
Example
  • Good bounds are obtained if ? ? 2, ? ? n/?.
  • We get the bound 2klog2n ? 2 , the dominating
    first term is minimized for ? ?e the bound then
    becomes
  • (e/log2e) klog2n ? e ? 1.885klog2n ? e

28
Lower mistake bound
  • Def For 1 ? k ? n, let Ck denote the class of
    k-literal monotone disjunctions, and let Ck
    denote the class of all those monotone
    disjunctions that have at most k literals.
  • Th (lower bound) For 1 ? k ? n,
  • opt(Ck) ? opt(Ck) ? k log2(n/k).For n?1
  • we also have opt(Ck) ? k/8(1? log2(n/k))

29
Modified WINNOW1
  • For ? instance space X?0,1n, and ? ? s.t. 0lt??1
    let F(X,?)X?0,1s.t. for ??F(X,?) ? µ1,..,µn
    ?0 s.t. for all (x1,..,xn) ? X
  • ? µ i xi ? 1 if (x1,..,xn)1 ()
  • ? µ i xi ? 1-? if (x1,..,xn)0 ()
  • So the inverse images of 0 and 1 are linearly
    separable with a minimum separation that depends
    on ?. The mistake bound that we derive will be
    practical only for those functions for which ? is
    sufficiently large.

30
Example an r-of-k threshold function
  • DefLet X0,1n,an r-of-k threshold function
    is defined by selecting a set of k significant
    variables. 1 whenever at least r of this k
    variables are 1.
  • 1 ? xi1 xik ? r ?
  • (1/r)xi1 (1/r) xik ? 1 if
    (x1,..,xn)1
  • (1/r)xi1 (1/r) xik ? 1-r if
    (x1,..,xn)0
  • Thus the r-of-k threshold functions ?
    F(0,1n,1/r)

31
WINNOW2
  • The only change to WINNOW1 updating rule when a
    mistake is made.

32
Requirements for WINNOW2
  • We use ? 1? /2 for learning target function in
    F(X,?).
  • Space time requirements for WINNOW2 are similar
    to those for WINNOW1. However, more bits will be
    needed to store each weight, perhaps as many as
    the logarithm of the mistake bound.

33
Mistake bound for WINNOW2
  • Th For 0lt??1, if the target function is in
    F(X,?) for X?0,1n, if µ1,..,µn have been
    chosen s.t. satisfies (), (), and if
    WINNOW2 is run with ? 1? /2 and ??1 and the
    algorithm receives instances from X, then the
    number of mistakes will be bounded by
  • (8/?2)(n/?) 5/? 14ln?/?2 ? µi.

34
Example an r-of-k threshold function
  • Now we are going to calculate mistake bound for
    r-of-k threshold functions. We have ?1/r and ?
    µi k/r. So for ? 11/2r and ?n mistake bound
    8r2 5k 14krlnn.
  • Note that 1-of-k threshold functions are just
    k-literal monotone disjunctions. Thus if ?3/2,
    WINNOW2 will learn monotone disjunctions. The
    mistake bound is similar to the bound for
    WINNOW1, though with larger constants.

35
Conclusion
  • The first part gives us general results about how
    many mistakes an effective learner might make if
    computational complexity were not an issue.
  • The second part describes an efficient algorithm
    for learning specific target class.
  • A key advantage of WINNOW1 and WINNOW2 is their
    performance when few attributes are relevant.

36
Conclusion
  • If we define the number of relevant variables
    needed to express a function in the class
    F(0,1n, ? ) to be the least number of strictly
    positive weights needed to describe a separating
    hyperplane, then this target class for n gt 1
    can be learned with a number of mistakes bounded
    by Cklogn/?2 when the target function can be
    expressed with k relevant variables.
Write a Comment
User Comments (0)
About PowerShow.com