Linear Separators - PowerPoint PPT Presentation

About This Presentation
Title:

Linear Separators

Description:

L is the number of late payments on credit cards over the past year. ... Fact1: All the points (x1, x2) lying on the line make the equation true. ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 26
Provided by: alext8
Category:

less

Transcript and Presenter's Notes

Title: Linear Separators


1
Linear Separators
2
Bankruptcy example
  • R is the ratio of earnings to expenses
  • L is the number of late payments on credit cards
    over the past year.
  • We would like here to draw a linear separator,
    and get so a classifier.

3
1-Nearest Neighbor Boundary
  • The decision boundary will be the boundary
    between cells defined by points of different
    classes, as illustrated by the bold line shown
    here.

4
Decision Tree Boundary
  • Similarly, a decision tree also defines a
    decision boundary in the feature space.

Although both 1-NN and decision trees agree on
all the training points, they disagree on the
precise decision boundary and so will classify
some query points differently. This is the
essential difference between different learning
algorithms.
5
Linear Boundary
  • Linear separators are characterized by a single
    linear decision boundary in the space.
  • The bankruptcy data can be successfully separated
    in that manner.
  • But, there is no guarantee that a single linear
    separator will successfully classify any set of
    training data.

6
Linear Hypothesis Class
  • Line equation (assume 2D first)
  • w2x2w1x1b0
  • Fact1 All the points (x1, x2) lying on the line
    make the equation true.
  • Fact2 The line separates the plane in two
    half-planes.
  • Fact3 The points (x1, x2) in one half-plane give
    us an inequality with respect to 0, which has the
    same direction for each of the points in the
    half-plane.
  • Fact4 The points (x1, x2) in the other
    half-plane give us the reverse inequality with
    respect to 0.

7
Fact 3 proof
  • w2x2w1x1b0
  • We can write it as

(p,r) is on the line so
But qltr, so we get
i.e.
Since (p,q) was an arbitrary point in the
half-plane, we say that the same direction of
inequality holds for any other point of the
half-plane.
8
Fact 4 proof
  • w2x2w1x1b0
  • We can write it as

(p,r) is on the line so
But sgtr, so we get
i.e.
Since (p,s) was an arbitrary point in the
(other) half-plane, we say that the same
direction of inequality holds for any other point
of that half-plane.
9
Corollary
  • Whats an easy way to determine the direction of
    the inequalities for each subplane?
  • Try it for the point (0,0), and determine the
    direction for the half-plane where (0,0) belongs.
  • The points of the other half-plane will have the
    opposite inequality direction.
  • How much bigger (or smaller) than zero is
    w2pw1qb is proportional to the distance of the
    point (p,q) from the line.
  • The same can be said for an n-dimensional space.
    Simply, we dont talk about half-planes but
    half-spaces (line is now hyperplane creating
    two half-spaces)

10
Linear classifier
  • We can now exploit the sign of this distance to
    define a linear classifier, one whose decision
    boundary is a hyperplane.
  • Instead of using 0 and 1 as the class labels
    (which was an arbitrary choice anyway) we use the
    sign of the distance, either 1 or -1 as the
    labels (that is the values of the yi s).

Which outputs 1 or 1.
11
Margin
  • The margin is the product of w.xi for the
    training point xi and the known sign of the
    class, yi.

margin ?i yiw.xi is proportional to
perpendicular distance of point xi to line
(hyperplane). ?i gt 0 point is correctly
classified (sign of distance yi) ?i lt 0
point is incorrectly classified (sign of distance
? yi)
12
Perceptron algorithm
  • How to find a linear separator?
  • Perceptron algorithm, was developed by Rosenblatt
    in the mid 50's.
  • This is a greedy, "mistake driven" algorithm.
  • Algorithm
  • Pick initial weight vector (including b), e.g.
    .1, , .1
  • Repeat until all points get correctly classified
  • Repeat for each point xi
  • Calculate margin yi.w.xi (this is number)
  • If margin gt 0, point xi is correctly classified
  • Else, change weights proportional to yi.xi

13
Gradient Ascent/Descent
  • Why pick yi.xi as increment to weights?
  • The margin is a multiple input variable function.
  • The variables are w2, w1, w0 (or in general
    wn,,w0)
  • In order to reach the maximum of this function,
    it is good to change the variables in the
    direction of the slope of the function.
  • The slope is represented by the gradient of the
    function.
  • The gradient is the vector of first (partial)
    derivatives of the function with respect to each
    of the input variables.

14
Perceptron algorithm
  • Changes for the different points interfere with
    each other.
  • So, it will not be the case that one pass through
    the points will produce a correct weight vector.
  • In general, we will have to go around multiple
    times.
  • However, the algorithm is guaranteed to terminate
    with the weights for a separating hyperplane as
    long as the data is linearly separable.
  • The proof of this fact is beyond our scope.
  • Notice that if the data is not separable, then
    this algorithm is an infinite loop.
  • Good idea to keep track of the best separator
    we've seen so far.

15
Perceptron algorithm Bankruptcy data
  • 49 iterations through the bankruptcy data for the
    algorithm to stop.
  • The separator at the end of the loop is 0.4,
    0.94, -2.2
  • We can pick some small "rate" constant to scale
    the change to w. This is called eta.

16
Dual Form
  • The calculated w will be
  • where, ?i is the number of times data instance
    xi got missclassified.
  • So, for classification well check

where x is the new data instance to e classified.
17
Perceptron algorithm
  • ? 0
  • Repeat until all points get correctly classified
  • Repeat for each point xi
  • Calculate margin
  • If margin gt 0, point xi is correctly classified
  • Else, increment ?i .
  • If data is not linearly separable then alphas
    grow without bound

18
Non-linearly separable
19
Moving points into a different space
Very easy now to divide X's from O's.
  • Square every x1 and x2 value first.
  • A point that was at (-1,2) would now be at (1,4),
  • A point that was at (0.5,1) would now be at
    (0.25,1), and so on.

20
Main Idea
  • Transform the points (vectors) into another
    space using some function
  • ?
  • and then do linear separation in the new space,
    i.e. considering vectors
  • ? (x1), ? (x2), ..., ? (xn).

21
The Kernel Trick
  • While you could write code to transform the data
    into a new space like this, it isn't usually done
    in practice because finding a dividing line when
    working with real datasets can require casting
    the data into hundreds or thousands of
    dimensions, and this is quite impractical to
    implement.
  • However, with any algorithm that uses
    dot-productsincluding the linear classifieryou
    can use a technique called the kernel trick.
  • The kernel trick involves replacing the
    dot-product function with a new function that
    returns what the dot-product would have been if
    the data had first been transformed to a higher
    dimensional space using some mapping function.

22
The Kernel Trick
  • Remember, all we care is computing dot products.
  • See something interesting
  • Let ? R2 ? R3 such that
  • ? (x) ?(x1, x2) z1, z2 , z3 x12,
    ?2x1x2, x22
  • Now, let r r1, r2, r3 and s s1, s2, s3 be
    two vectors in R3 corresponding to vectors a
    a1, a2 and b b1, b2 in R2.
  • ? (a)?? (b) r?s
  • r1s1r2s2r3s3
  • (a1b1)2 2a1a2b1b2 (a2b2)2
  • (a1b1 a2b2)2
  • (a?b)2

23
The Kernel Trick
  • So instead of mapping the data vectors via ? and
    computing the modified inner product ? (a)?? (b),
    we can do it in one operation, leaving the
    mapping completely implicit.
  • Because modified inner product is a long name,
    we call it a kernel, K(a, b) ? (a)?? (b).
  • Useful Kernels
  • Polynomial Kernel K(a, b) (a?b)2
  • Visualization http//www.youtube.com/watch?v3liC
    bRZPrZA
  • Gaussian Kernel K(a, b) e(1/2)x-y2

24
Line Separators
It's difficult to characterize the separator that
the Perceptron algorithm will come up with.
Different runs can come up with different
separators. Can we do better?
25
Which one to pick?
  • Natural choice Pick the separator that has the
    maximal margin to its closest points on either
    side.
  • Most conservative.
  • Any other separator will be "closer" to one class
    than to the other.

Those closest points are called "support vectors".
Write a Comment
User Comments (0)
About PowerShow.com