Title: Online learning with mistake bounds
1On-line learning with mistake bounds
(Reminder) Algorithms for learning linear
threshold functions Model of learning X0,1n
Instance space Cc cX?-1,1 Class of
concepts, classify each instance x as negative
(false) or positive (true) The goal of concept
learning discover unknown target concept c
from labeled instances. Target concept can be
described by Boolean function. This concept is
described by weights vector w. The goal of
learner to make few mistakes.
2On-line learning with mistake bounds
Algorithms learning behavior is evaluated by
counting the worst-case number of mistakes
(mistake bound) that it will make while learning
some worst-case function from a specified class
of functions or receive worst-case
examples sequence.
3Algorithms for learning linear threshold
functions
- General algorithm
- Initialize vector w1 t1
- On round t (given vector xt and yt - label for
xt ) - predict yt sgn(wtxt-?)
- if yt ltgt yt update wt1 using wt,, xt , yt
- Difference between perceptron and winnow
algorithm - How should we initialize vector w1
- How should we update vector wt
4Winnow algorithm
- Input vectors xt and labels yt
- Goal find vector w (w1,w2,..wn)
- Each wi is non-negative, real number
- Parameters
- ? threshold
- a parameter for weights change
5Winnow algorithm
- Special case ?n, a2
- Its algorithm for learning monotone disjunction
disjunction in which no literal appears negated
, that is a function of the form - f(x1,x2,..xn) xi1V xi2 V
V xik - Monotone disjunction is linearly separable
- for all x c(x) 1 ? w x ?
6Winnow algorithm
- Initialize ? n a2 w1i 1 ,i1 to N
- For each data point xt
- predict yt sgn(wtxt - ?)
-
- if yt lt 0 and yt 1
- then for each xti 1 wti wti a
- else if yt 0 and yt -1
- then for each xti 1 wti wti /a
Perceptron wti wti sgn(wtxt - ?)
7Winnow algorithm
f x2 v x3
n4
8Winnow algorithm
- Theorem Winnow algorithm makes at most O(k lg n)
errors. - k-number of variables in target disjunction
function - n-number of attributes (parameters)
- Proof
- Let u be number of cases when w were doubled
(i.e. false negative) and v be number of
cases when w were halved (i.e. false positive). - If attribute i is part of target function, call
wi relevant weight. - 1) Number of mistakes on positive examples (false
negative) - Relevant weights never decrease
- Each weight 2n
- Conclusion
- - no relevant weight need to be doubled
more then 1log2(n) times - - There are at most k times this
number of mistakes on false negative. gt u
k(1log2(n) ) -
9Winnow algorithm
- 2) Number of mistakes on negative examples (false
positive) - Let Ttotal weight. Initially Tn and Tgt0 always.
- Each false negative mistake adds at most n to
T. - T nun n(k(1log2(n) )n
- Each false positive mistake subtracts at least
at least n /2 from T. - v T / n/2 (n(k(1log2(n) )n) / n/2
- v 2k(1log2(n) )2
- The total number of mistakes is at most
- u v 2 3k(1log2(n) ) or O(k(1log(n)))
10Winnow algorithm
- If not all examples are consistent with target
function - Define mc number of mistakes made by concept
c - Ac number of attributes errors in data
for concept c. - For each example xt
- if c(xt )gt0 and xt has no relevant variables
of c - Ac Ac 1
- if c(xt )lt0 and xt satisfied r relevant
variables of c - Ac Ac r
- Conclusion if c is disjunction of k variables
then mc Ac kmc - It may be shown that
- for any sequence of examples and any
disjunction c , the number of mistakes made by
Winnow is O(Acklogn)
11Perceptron vs. Winnow
Committees of experts
- Perceptron number of mistakes O( nk)
- yt(uxt) s for all t,
- u (1/ vk)(0, 1, 0, , 1)
- xt (1/ vn)(1, -1, -1, , 1)
- (uxt) ltgt 0 for all xt gt yt(uxt) 1/ vnk
- Suppose s 1/ vnk gt if number of mistakes 1/
s 2 gt - number of mistakes nk
- Winnow number of mistakes O(k log n)
- Mistake bound does not depend on order of
examples or on specific examples
12Perceptron vs. Winnow
- Winnow
- Online can adjust to changing target, over time
- Advantages
- Simple
- Guaranteed to learn a linearly separable problem
- Suitable for problems with many irrelevant
attributes - Limitations
- only linear separations
- only converges for linearly separable data
- not really efficient with many features
- Perceptron
- Online can adjust to changing target, over time
- Advantages
- Simple
- Guaranteed to learn a linearly separable problem
- Limitations
- only linear separations
- only converges for linearly separable data
- not really efficient with many features