A PAC Model for Learning from Labeled and Unlabeled Data PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: A PAC Model for Learning from Labeled and Unlabeled Data


1
A PAC Model for Learning from Labeled and
Unlabeled Data
  • Maria-Florina Balcan Avrim Blum
  • Carnegie Mellon University,
  • Computer Science Department

2
Outline of the talk
  • Supervised Learning
  • PAC Model
  • Sample Complexity
  • Algorithm Design
  • Semi-supervised Learning
  • A PAC Style Model
  • Examples of results in our model
  • Sample Complexity
  • Algorithmic Issues Co-training of linear
    separators
  • Conclusions
  • Implications of our Analysis

3
Usual Supervised Learning Problem
  • Imagine you want a computer program to help you
    decide which email messages are spam and which
    are important.
  • Might represent each message by n features.
    (e.g., return address, keywords, spelling, etc.).
  • Take a sample S of data, labeled according to
    whether they were/weren't spam.
  • Goal of algorithm is to use data seen so far to
    produce good prediction rule (a "hypothesis") h
    for future data.

4
The concept learning setting
E.g.,
  • Given data, some reasonable rules might be
  • Predict SPAM if unknown AND (sex OR sales)
  • Predict SPAM if sales sex known gt 0.
  • ...

5
Supervised Learning, Big Questions
  • Algorithm Design
  • How might we automatically generate rules that do
    well on observed data?
  • Sample Complexity/Confidence Bound
  • What kind of confidence do we have that they will
    do well in the future?

6
Supervised Learning Formalization (PAC)
  • PAC model nice/standard model for learning from
    labeled data.
  • X - instance space
  • S(x, l) - set of labeled examples
  • examples - assumed to be drawn i.i.d. from some
    distr. D over X and labeled by some target
    concept c
  • labels 2 -1,1 - binary classification
  • Want to do optimization over S to find some
    hypothesis h, but we want h to have small error
    over D.
  • err(h)Prx 2 D(h(x) ? c(x))

7
Basic PAC Learning Definitions
  • Algorithm A PAC-learns concept class C if for any
    target c in C, any distribution D over X, any ?,
    ? gt 0
  • A uses at most poly(n,1/?,1/?,size(c)) examples
    and running time.
  • With probability 1-?, A produces h in C of error
    at most ?.
  • Notation true error of h
  • - empirical error
    of h

8
Sample Complexity Uniform Convergence Finite
Hypothesis Spaces
  • Realizable Case
  • 1. Prob. a bad hypothesis is consistent with m
    examples is at most (1-?)m
  • 2. So, prob. exists a bad consistent hypothesis
    is at most C(1-?)m
  • 3. Set to ?, solve to get examples needed at
    most 1/?ln(C) ln(1/?)
  • If not too many rules to choose from, then
    unlikely some bad one will fool you just by
    chance.

9
Sample Complexity Uniform Convergence Finite
Hypothesis Spaces
  • Realizable Case
  • Agnostic Case
  • What if there is no perfect h?
  • Gives hope for local optimization over the
    training data.

10
Shattering, VC-dimension
  • Def A set of points S is shattered by a concept
    class C if there are concepts in C that split S
    in all 2S possible ways.
  • VC-dimension of C is the size of the largest set
    of points that can be shattered by C.
  • Example C the class of subintervals a,b, 0
    a,b 1
  • VC-dim(C)2
  • CS the set of splittings of dataset S using
    concepts from C.
  • Cm - maximum number of ways to split m points
    using concepts in C i.e.

0
1
11
Sample Complexity Uniform Convergence Infinite
Hypothesis Spaces
  • CS the set of splittings of dataset S using
    concepts from C.
  • Cm - maximum number of ways to split m points
    using concepts in C i.e.
  • Cm,D - expected number of splits of m points
    from D with concepts in C.
  • Neat Fact 1 previous results still hold if we
    replace C with C2m.
  • Neat Fact 2 can even replace with C2m,D.

12
Sample Complexity Uniform Convergence Infinite
Hypothesis Spaces
  • For instance
  • Sauers Lemma, CmO(mVC-dim(C)) implies

13
Sample Complexity Uniform Convergence Infinite
Hypothesis Spaces
  • Agnostic Case

14
Outline of the talk
  • Supervised Learning
  • PAC Model
  • Sample Complexity
  • Algorithms
  • Semi-supervised Learning
  • Proposed Model
  • Examples of results in our model
  • Sample Complexity
  • Algorithmic Issues Co-training of linear
    separators
  • Conclusions
  • Implications of our Analysis

15
Combining Labeled and Unlabeled Data (a.k.a.
Semi-supervised Learning)
  • Hot topic in recent years in Machine Learning.
  • Many applications have lots of unlabeled data,
    but labeled data is rare or expensive
  • Web page, document classification
  • OCR, Image classification

16
Combining Labeled and Unlabeled Data
  • Several methods have been developed to try to use
    unlabeled data to improve performance, e.g.
  • Transductive SVM
  • Co-training
  • Graph-based methods

17
Can we extend the PAC model to deal with
Unlabeled Data?
  • PAC model nice/standard model for learning from
    labeled data.
  • Goal extend it naturally to the case of
    learning from both labeled and unlabeled data.
  • Different algorithms are based on different
    assumptions about how data should behave.
  • Question how to capture many of the assumptions
    typically used?

18
Example of typical assumption
  • The separator goes through low density regions of
    the space/large margin.
  • assume we are looking for linear separator
  • belief should exist one with large separation

19
Another Example
  • Agreement between two parts co-training.
  • examples contain two sufficient sets of features,
    i.e. an example is xh x1, x2 i and the belief is
    that the two parts of the example are consistent,
    i.e. 9 c1, c2 such that c1(x1)c2(x2)c(x)
  • for example, if we want to classify web pages

x h x1, x2 i
20
Co-training
Text info
Link info



-
-
21
Proposed Model
  • Augment the notion of a concept class C with a
    notion of compatibility ? between a concept and
    the data distribution.
  • learn C becomes learn (C,?) (i.e. learn class
    C under compatibility notion ?)
  • Express relationships that one hopes the target
    function and underlying distribution will
    possess.
  • Goal use unlabeled data the belief that the
    target is compatible to reduce C down to just
    the highly compatible functions in C.

22
Proposed Model, cont
  • Goal use unlabeled data our belief to reduce
    size(C) down to size(highly compatible functions
    in C) in the previous bounds.
  • Want to be able to analyze how much unlabeled
    data is needed to uniformly estimate
    compatibilities well.
  • Require that the degree of compatibility be
    something that can be estimated from a finite
    sample.

23
Proposed Model, cont
  • Augment the notion of a concept class C with a
    notion of compatibility ? between a concept and
    the data distribution.
  • Require that the degree of compatibility be
    something that can be estimated from a finite
    sample.
  • Require ? to be an expectation over individual
    examples
  • ?(h,D)Ex 2 D?(h, x) compatibility of h with
    D, ?(h,x) 2 0,1
  • errunl(h)1-?(h, D) incompatibility of h with
    D (unlabeled error rate of h)

24
Margins, Compatibility
  • Margins belief is that should exist a large
    margin separator.
  • Incompatibility of h and D (unlabeled error rate
    of h) the probability mass within distance ? of
    h.
  • Can be written as an expectation over individual
    examples ?(h,D)Ex 2 D?(h,x) where
  • ?(h,x)0 if dist(x,h) ?
  • ?(h,x)1 if dist(x,h) ?

25
Margins, Compatibility
  • Margins belief is that should exist a large
    margin separator.
  • If do not want to commit to ? in advance, define
    ?(h,x) to be a smooth function of dist(x,h),
    e.g.
  • Illegal notion of compatibility the largest ?
    s.t. D has probability mass exactly zero within
    distance ? of h.

26
Co-training, Compatibility
  • Co-training examples come as pairs h x1, x2 i
    and the goal is to learn a pair of functions h
    h1, h2 i
  • Hope is that the two parts of the example are
    consistent.
  • Legal (and natural) notion of compatibility
  • the compatibility of h h1, h2 i and D
  • can be written as an expectation over examples

27
Examples of results in our model Sample
Complexity - Uniform convergence bounds
  • Finite Hypothesis Spaces, Doubly Realizable Case
  • Assume ?(h,x) 2 0,1 define CD,?(?) h 2 C
    errunl(h) ?.
  • Theorem
  • Bound the number of labeled examples as a measure
    of the helpfulness of D w.r.t to ?
  • a helpful distribution is one in which CD,?(?) is
    small

28
Semi-Supervised Learning Natural Formalization
(PAC?)
  • We will say an algorithm "PAC?-learns" if it runs
    in poly time using samples poly in respective
    bounds.
  • E.g., can think of lnC as bits to describe
    target without knowing D, and lnCD,?(?) as
    number of bits to describe target knowing a good
    approximation to D, given the assumption that the
    target has low unlabeled error rate.

29
Examples of results in our model Sample
Complexity - Uniform convergence bounds
  • Finite Hypothesis Spaces c not fully
    compatible
  • Theorem

30
Examples of results in our model Sample
Complexity - Uniform convergence bounds
  • Infinite Hypothesis Spaces
  • Assume ?(h,x) 2 0,1 and ?(C) ?h h 2 C
    where ?h(x) ?(h,x).

31
Examples of results in our model Sample
Complexity - Uniform convergence bounds
  • For S µ X, denote by US the uniform distribution
    over S, and by Cm, US the expected number of
    splits of m points from US with concepts in C.
  • Assume err(c)0 and errunl(c)0.
  • Theorem
  • The number of labeled examples depends on the
    unlabeled sample.
  • Useful since can imagine the learning alg.
    performing some calculations over the unlabeled
    data and then deciding how many labeled examples
    to purchase.

32
Examples of results in our model Sample
Complexity, ?-Cover-based bounds
  • For algorithms that behave in a specific way
  • first use the unlabeled data to choose a
    representative set of compatible hypotheses
  • then use the labeled sample to choose among these
  • Theorem

33
Examples of results in our model
  • Lets look at some algorithms.

34
Examples of results in our modelAlgorithmic
Issues Algorithm for a simple (C,?)
  • X0,1n, C class of disjunctions, e.g. hx1 Ç
    x2 Ç x3 Ç x4 Ç x7
  • For x 2 X, let vars(x) be the set of variables
    set to 1 by x
  • For h 2 C, let vars(h) be the set of variables
    disjoined by h
  • ?(h,x)1 if either vars(x) µ vars(h) or vars(x)
    Å vars(h)?
  • Strong notion of margin
  • every variable is either a positive indicator or
    a negative indicator
  • no example should contain both positive and
    negative indicators
  • Can give a simple PAC?-learning algorithm for
    this pair (C,?).

35
Examples of results in our modelAlgorithmic
Issues Algorithm for a simple (C,?)
  • Use unlabeled sample U to build G on n vertices
  • put an edge between i and j if 9 x in U with i,j
    2 vars(x).
  • Use labeled data L to label the connected
    components.
  • Output h s. t. vars(h) is the union of the
    positively-labeled components.
  • If c is fully compatible, then no component will
    get both positive and negative labels.
  • and
  • If U L are as given in the bounds, then whp
    err(h) ?.


-
011000
101000
unlabeled set U
1
3
4
5
6
2
001000
000011
100100
hx1Çx2Çx3Çx4
100000
labeled set L
000011 -
36
Examples of results in our modelAlgorithmic
Issues Algorithm for a simple (C,?)
  • Especially non-helpful distribution the uniform
    distr. over all examples x with vars(x)1
  • get n components still needs ?(n) labeled
    examples
  • Helpful distribution - one such that w.h.p. the
    of components is small
  • need a lower number of labeled examples

37
Examples of results in our modelAlgorithmic
Issues Co-training of linear separators
  • Examples h x1, x2 i 2 Rn Rn.
  • Target functions c1 and c2 are linear separators,
    assume c1c2c, and that no pair crosses the
    target plane.
  • f linear separator in Rn, errunl(f) - the
    fraction of the pairs that cross fs boundary
  • Consistency problem given a set of labeled and
    unlabeled examples, want to find a separator that
    is consistent with labeled examples and
    compatible with the unlabeled ones.
  • It is NP-hard - Abie.

38
Examples of results in our modelAlgorithmic
Issues Co-training of linear separators
  • Assume independence given the label (both points
    from D or from D-).
  • Blum Mitchell 98 show can co-train (in
    polynomial time) if have enough labeled data to
    produce a weakly-useful hypothesis to begin with.
  • We show, can learn with only a single labeled
    example.
  • Key point independence given the label implies
    that the functions with low errunl rate are
  • close to c
  • close to c
  • close to the all positive function
  • close to the all negative function

39
Examples of results in our modelAlgorithmic
Issues Co-training of linear separators
  • Nice Tool a super simple algorithm for weak
    learning a large-margin separator
  • pick c at random
  • If margin1/poly(n), then a random c has at least
    1/poly(n) chance of being a weak predictor

40
Examples of results in our modelAlgorithmic
Issues Co-training of linear separators
  • Assume independence given the label.
  • Draw a large unlabeled sample S(x1i,x2i).
  • If also assume large margin,
  • run the super-simple alg poly(n) times
  • feed each c into Blum Mitchell booster
  • examine all the hypotheses produced, and pick
    one h with small errunl, that is far from
    all-positive and all-negative functions
  • use labeled example to choose either h or h
  • w.h.p. one random c was a weakly-useful
    predictor so on at least one of these steps we
    end up with a hypothesis h with small err(h), and
    so with small errunl(h)
  • If dont assume large margin,
  • use Outlier Removal Lemma to make sure that at
    least 1/poly fraction of the points in S1x1i
    have margin at least 1/poly this is sufficient.

41
Implications of our analysisWays in which
unlabeled data can help
  • If the target is highly compatible with D and
    have enough unlabeled data to estimate ? over all
    h 2 C, then can reduce the search space (from C
    down to just those h 2 C whose estimated
    unlabeled error rate is low).
  • By providing an estimate of D, unlabeled data can
    allow a more refined distribution-specific notion
    of hypothesis space size (such as Annealed
    VC-entropy or the size of the smallest ?-cover).
  • If D is nice so that the set of compatible h 2 C
    has a small ?-cover and the elements of the cover
    are far apart, then can learn from even fewer
    labeled examples than the 1/? needed just to
    verify a good hypothesis.

42
Implications of our analysisWays in which
unlabeled data can help
  • If the target is highly compatible with D and
    have enough unlabeled data to estimate ? over all
    h 2 C, then can reduce the search space (from C
    down to just those h 2 C whose estimated
    unlabeled error rate is low).
  • By providing an estimate of D, unlabeled data can
    allow a more refined distribution-specific notion
    of hypothesis space size (such as Annealed
    VC-entropy or the size of the smallest ?-cover).

43
Questions?
44
Thank you !
Write a Comment
User Comments (0)
About PowerShow.com