Title: Lecture 3: Perceptron
1Lecture 3 Perceptron
2Recap Perceptron algorithm
- Datapoints (x1,y1), (x2, y2), , xt 2 Rd, yt 2
1,-1, are separable by a hyperplane through
the origin -
- w 0
- for t 1,2,
- if yt(w xt) 0
- w w yt xt
- Claim Suppose
- xt R for all t
- There is some unit vector u 2 Rd and some
margin ? gt 0 such that yt (u xt) ? for all
t - Then Perceptron makes at most (R/?)2
mistakes/updates.
3Preprocessing step
- Points (x,y) where x 2 Rd, y 2 1,-1
- Add an extra feature to x, and set it to 1
- x0 (x,1) 2 Rd1
- Then points (x,y) linearly separable ? points
(x0, y) linearly separable by a hyperplane
through the origin
4(No Transcript)
5Fishers IRIS data
Four features sepal length sepal width petal
length petal width Three classes (species of
iris) setosa versicolor virginica 50 instances
of each
6(No Transcript)
7(No Transcript)
8Features 1 and 2 (sepal width/length)
9Features 3 and 4 (petal width/length)
10Features 1 and 2 goal separate setosa from
other two
1500 updates (different permutation 900)
11Features 3 and 4 goal separate setosa from
other two
Point 51
Points 1,2
Iteration 1 1,51 Iteration 2 1,2 Iteration 3
12(No Transcript)
13(No Transcript)
14Linear separator vs nearest neighbor
Linear separators parametric model fixed number
of parameters to learn Nearest
neighbor nonparametric prediction on test point
x depends only on training data near x, not on
the rest of the training data Advantages of
linear separators compact fast
convergence potentially meaningful
15Nonseparable data
What if data is not linearly separable?
In this case almost separable how will the
perceptron perform?
16Online perceptron
Data comes in an endless stream convergence is
not an issue. But how many mistakes does it
make? Suppose that for all t 0 there is some
u 2 Rd and some kt 0 such that for all but kt
of the first t datapoints (x,y), yt(u xt)
? Then for all t 0 the perceptron algorithm
makes at most (R/?)2 kt(1 2R/?) updates
(ie. mistakes) upto time t.
17Batch perceptron
Batch algorithm w 0 while some (xi,yi) is
misclassified w w yi xi Nonseparable
data will never converge. How can this be fixed?
Dream somehow find the separator that
misclassifies the fewest points but this is
NP-hard (in fact, even NP-hard to approximately
solve).
18Fixing the batch perceptron
Idea one only go through the data once, or a
fixed number of times w 0 for k 1 to
K for i 1 to m if (xi,yi) is
misclassified w w yi xi At least this
stops! Problem the final w might not be
good Eg. right before terminating, the alg might
perform an update on a total outlier
19Voted-perceptron
Idea two keep around intermediate hypotheses,
and have them vote Freund and Schapire,
1998 n 1 w1 0 c1 0 for k 1 to
K for i 1 to m if (xi,yi) is
misclassified wn1 wn yi xi cn1
1 n n 1 else cn cn 1 At the
end, a collection of linear separators w0, w1,
w2, , along with survival times cn amount of
time that wn survived.
20Voted-perceptron, contd
Idea two keep around intermediate hypotheses,
and have them vote Freund and Schapire,
1998 At the end, a collection of linear
separators w0, w1, w2, , along with survival
times cn amount of time that wn
survived. This cn is a good measure of the
reliability of wn. To classify a test point x,
use a weighted majority vote
21Voted-perceptron, contd
- Problem need to keep around a lot of wn vectors
- Solutions
- Find representatives
- Alternative prediction rule
wavg
22IRIS features 3 and 4 goal separate setosa
(circle) from the rest
Corrupted setosa
Run Voted-Perc for five rounds cn 0 1 2
3 1 1 5 117 2 41 2 13 1 3
8 222 2 173 3 95 3 52 Final
hypothesis 1 wrong (either voting or averaging)
23IRIS features 3 and 4 goal separate from o/x
100 rounds, 1595 updates (5 errors) Final
hypothesis 5 errors for voting, 6 for averaging
24Postscript multiclass
What if there are k classes?
Reduce to binary all-vs-one
Not always easy to do
1
2
2
1
3
4
3
25Some open problems
Modify the (voted) perceptron algorithm to 1
Find a linear separator with large
margin 2 Give up on troublesome
points after a while