Efficient classification for metric data - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Efficient classification for metric data

Description:

Efficient classification for metric data. Metric space (X,d) is a metric space if . X = set of points. d() = distance function. nonnegative. symmetric. triangle ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 17
Provided by: nyu57
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient classification for metric data


1
Efficient classification for metric data
  • Lee-Ad Gottlieb Weizmann Institute
  • Aryeh Kontorovich Ben Gurion U.
  • Robert Krauthgamer Weizmann Institute

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAA
2
Classification problem
  • Probabilistic concept learning
  • S is a set of n examples (x,y) drawn from X x
    -1,1 according to some unknown probability
    distribution P.
  • The learner produces hypothesis h X ? -1,1
  • A good hypothesis (classifier) minimizes the
    generalization error
  • P(x,y) h(x) ? y
  • A popular solution uses kernels
  • Data represented as vectors, kernels take the
    dot-product of vectors

3
Finite metric space
  • (X,d) is a metric space if
  • X set of points
  • d distance function
  • Nonnegative
  • Symmetric
  • Triangle inequality
  • Classification for metric data?
  • Problem
  • No vector representation ?
  • No notion of dot-product ?
  • Cant use kernels
  • What can be done in this setting?

Tel-Aviv
95km
62km
Jerusalem
151km
Haifa
4
Preliminary definition
  • The Lipschitz constant L of a function f X ? R
    is the smallest value that satisfies for all
    points xi,xj in X
  • L f(xi)-f(xj) / d(xi,xj)
  • Consider a hypothesis consistent with all of S
  • Its Lipschitz constant is determined by the
    closest pair of differently labeled points
  • L 2 / d(xi,xj) for all xi in S-, xj in S

5
Classification for metric data
  • A powerful framework for this problem was
    introduced by
  • von Luxburg Bousquet (vLB, JMLR 04)
  • The natural hypotheses (classifiers) to consider
    are maximally smooth Lipschitz functions
  • Given the classifier h, the problem of evaluating
    of h for new points in X reduces to the problem
    of finding a Lipschitz function consistent with h
  • Lipschitz extension problem, a classic problem in
    Analysis
  • For example
  • f(x) mini yi 2d(x, xi)/d(S,S-) over all
    (xi,xj) in S
  • Function evaluation reduces to exact Nearest
    Neighbor Search (assuming zero training error)
  • Strong theoretical motivation for the NNS
    classification heuristic

6
Two new directions
  • The framework of vLB leaves open two further
    questions
  • Efficient evaluation of the classifier h on X
  • In arbitrary metric space, exact NNS requires
    T(n) time
  • Can we do better?
  • Bias variance tradeoff
  • Which sample points in S should h ignore?

q
1
1
-1
1
7
Doubling Dimension
  • Definition Ball B(x,r) all points within
    distance r from x.
  • The doubling constant (of a metric M) is the
    minimum value gt0 such that every ball can be
    covered by balls of half the radius
  • First used by Ass-83, algorithmically by
    Cla-97.
  • The doubling dimension is dim(M)log (M)
    GKL-03
  • A metric is doubling if its doubling dimension is
    constant
  • Packing property of doubling spaces
  • A set with diameter D and min. inter-point
  • distance a, contains at most
  • (D/a)O(log) points

Here ?7.
8
Application I
  • We provide generalization bounds for Lipschitz
    functions on spaces with low doubling dimension
  • vLB provided similar bounds using covering
    numbers and Rademacher averages
  • Fat-shattering analysis
  • Lipschitz function shatters a set ?
  • inter-point distance is at least 2/L
  • Packing property ?
  • set has (DL)O(log) points
  • So the fat-shattering dimension is low

9
Application I
  • Theorem
  • For any f that classifies a sample of size n
    correctly, we have with probability at least 1-?
  • P (x, y) sgn(f(x)) ? y 2/n (d log(34en/d)
    log(578n) log(4/?)) .
  • Likewise, if f is correct on all but k examples,
    we have with probability at least 1-?
  • P (x, y) sgn(f(x)) ? y k/n 2/n (d
    ln(34en/d) log2(578n) ln(4/?))1/2.
  • In both cases, d ?8LD log1.

10
Application II
  • Evaluation of h for new points in X
  • Lipschitz extension function
  • f(x) mini yi 2d(x, xi)/d(S,S-)
  • Requires exact nearest neighbor search, which can
    be expensive!
  • New tool (1?)-approximate nearest neighbor
    search
  • O(1) log n O(-log?) time
  • KL-04, HM-05, BKL-06, CG-06
  • If we evaluate f(x) using an approximate NNS, we
    can show that the result agrees with (the sign
    of) at least one of
  • g(x) (1?) f(x) ?
  • h(x) (1?) f(x) - ?
  • Note that g(x) f(x) h(x)
  • g(x) and h(x) have Lipschitz constant (1?)L, so
    they and the approximate function generalizes
    well

11
Bias variance tradeoff
  • Which sample points in S should h ignore?
  • If f is correct on all but k examples, we have
    with probability at least 1-?
  • P (x, y)sgn(f(x)) ? y k/n 2/n (d
    ln(34en/d)log2(578n) ln(4/?))1/2.
  • Where d ?8LD1.

-1
1
12
Bias variance tradeoff
  • Algorithm
  • Fix a target Lipschitz constant L
  • O(n2) possibilities
  • Locate all pairs of points from S and S- whose
    distance is less than 2L
  • At least one of these points has to be taken as
    an error
  • Goal Remove as few points as possible

13
Bias variance tradeoff
  • Algorithm
  • Fix a target Lipschitz constant L
  • Out of O(n2) possibilities
  • Locate all pairs of points from S and S- whose
    distance is less than 2L
  • At least one of these points has to be taken as
    an error
  • Goal Remove as few points as possible
  • Minimum vertex cover
  • NP-Complete
  • Admits a 2-approximation in O(E) time

14
Bias variance tradeoff
  • Algorithm
  • Fix a target Lipschitz constant L
  • Out of O(n2) possibilities
  • Locate all pairs of points from S and S- whose
    distance is less than 2L
  • At least one of these points has to be taken as
    an error
  • Goal Remove as few points as possible
  • Minimum vertex cover
  • NP-Complete
  • Admits a 2-approximation in O(E) time
  • Minimum vertex cover on a bipartite graph
  • Equivalent to maximum matching (Konigs theorem)
  • Admits an exact solution in O(n2.376) randomized
    time

15
Bias variance tradeoff
  • Algorithm
  • For each of O(n2) values of L
  • Run matching algorithm to find minimum error
  • Evaluate generalization bound for this value of L
  • O(n4.376) randomized time
  • Better algorithm
  • Binary search over O(n2) values of L
  • For each value
  • Run matching algorithm
  • Find minimum error in O(n2.376 log n) randomized
    time
  • Evaluate generalization bound for this value of
    L
  • Run greedy 2-approximation
  • Approximate minimum error in O(n2 log n) time
  • Evaluate approximate generalization bound for
    this value of L

16
Conclusion
  • Results
  • Generalization bounds for Lipschitz classifiers
    in doubling spaces
  • Efficient evaluation of the Lipschitz extension
    hypothesis using approximate NNS
  • Efficient calculation of the bias variance
    tradeoff
  • Continuing research
  • Similar results for continuous labels
Write a Comment
User Comments (0)
About PowerShow.com