CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers

Description:

Title: QA for the Web Author: Dan Moldovan Last modified by: ivan Created Date: 5/7/2002 3:19:09 PM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 28
Provided by: DanMol3
Category:

less

Transcript and Presenter's Notes

Title: CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers


1
CS546 Machine Learning and Natural
LanguageDiscriminative vs Generative Classifiers
  • This lecture is based on (Ng Jordan, 02) paper
    and some slides are based on Tom Mitchells slides

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. A
2
Outline
  • Reminder Naive Bayes and Logistic Regression
    (MaxEnt)
  • Asymptotic Analysis
  • What is better if you have an infinite dataset?
  • Non-asymptotic Analysis
  • What is the rate of convergence of parameters?
  • More important convergence of the expected error
  • Empirical evaluation
  • Why this lecture?
  • Nice and simple application of Large Deviation
    bounds we considered before
  • We will analyze specifically NB vs
    LogRegression, but hope it generalizes to other
    models (e.g, models for sequence labeling or
    parsing)

3
Discriminative vs Generative
  • Training classifiers involves estimating f X ?
    Y, or P(YX)
  • Discriminative classifiers (conditional models)
  • Assume some functional form for P(YX)
  • Estimate parameters of P(YX) directly from
    training data
  • Generative classifiers (joint models)
  • Assume some functional form for P(XY), P(X)
  • Estimate parameters of P(XY), P(X) directly from
    training data
  • Use Bayes rule to calculate P(YX xi)

4
Naive Bayes
  • Example assume Y boolean, X ltx1, x2, , xngt,
    where xi are binary
  • Generative model Naive Bayes
  • Classify new example x based on ratio
  • You can do it in log-scale

s indicates size of set. l is smoothing parameter
5
Naive Bayes vs Logistic Regression
  • Generative model Naive Bayes
  • Classify new example x based on ratio
  • Logistic Regression
  • Recall both classifiers are linear

6
What is the difference asymptotically?
  • Notation let denote error of
    hypothesis learned via algorithm A, from m
    examples
  • If the Naive Bayes model is true
  • Otherwise
  • Logistic regression estimator is consistent
  • ² (hDis,m) converges to
  • H is the class of all linear classifers
  • Therefore, it is asymptotically better than the
    linear classifier selected by the NB algorithm

7
Rate of covergence logistic regression
  • Convergences to best linear classifier, in order
    of n examples
  • follows from Vapniks structural risk bound
    (VC-dimension of n dimensional linear separators
    is n1 )

8
Rate of covergence Naive Bayes
  • We will proceed in 2 stages
  • Consider how fast parameters converge to their
    optimal values
  • (we do not care about it, actually)
  • We care Derive how it corresponds to the
    convergence of the error to the asymptotical
    error
  • The authors consider a continous case (where
    input is continious) but it is not very
    interesting for NLP
  • However, similar techniques apply

9
Convergence of Parameters
10
Recall Chernoff Bound
11
Recall Union Bound
12
Proof of Lemma (no smoothing for simplicity)
  • By the Chernoffs bound, with probability at
    least
  • the fraction of positive examples will be within
    of
  • Therefore we have at least positive and
    negative examples
  • By the Chernoffs bound for every feature and
    class label (2n cases) with probability
  • We have one event with probability and 2n
    events with probabilities , there joint
    probability is not greater than sum
  • Solve this for m, and you get

13
Implications
  • With a number of samples logarithmic in n (not
    linear as for the logistic regression!) the
    parameters of approach parameters
    of
  • Are we done?
  • Not really this does not automatically imply
    that
  • the error approaches
    with the same rate

14
Implications
  • We need to show that and
    often agree if their parameters are close
  • We compare log-scores given by the models
    and
  • I.e.

15
Convergence of Classifiers
  • G defines the fraction of points very close to
    the decision boundary
  • What is this fraction? See later

16
Proof of Theorem (sketch)
  • By the lemma (with high probability) the
    parameters of are within
    of those of
  • It implies that every term in the sum
    is also within
  • of the term in
    and hence
  • Let
  • So and can have
    different predictions only if
  • Probability of this event is

17
Convergence of Classifiers
  • G -- What is this fraction?
  • This is somewhat more difficult

18
Convergence of Classifiers
  • G -- What is this fraction?
  • This is somewhat more difficult

19
What to do with this theorem
  • This is easy to prove, no proof but intuition
  • A fraction of terms in have
    large expectation
  • Therefore, the sum has also large expectation

20
What to do with this theorem
  • But this is weaker then what we need
  • We have that the expectation is large
  • We need that the probability of small values is
    low
  • What about Chebyshev inequality?
  • They are not independent ... How to deal with it?

21
Corollary from the theorem
  • Is this condition realistic?
  • Yes (e.g., we can show it for rather realistic
    conditions)

22
Empirical Evaluation (UCI dataset)
  • Dashed line is logistic regression
  • Solid line is Naives Bayes

23
Empirical Evaluation (UCI dataset)
  • Dashed line is logistic regression
  • Solid line is Naives Bayes

24
Empirical Evaluation (UCI dataset)
  • Dashed line is logistic regression
  • Solid line is Naives Bayes

25
Summary
  • Logistic regression has lower asymptotic error
  • ... But Naive Bayes needs less data to approach
    its asymptotic error

26
First Assignment
  • I am still checking it, I will let you know by/on
    Friday
  • Note though
  • Do not perform multiple tests (model selection)
    on the final test set!
  • It is a form of cheating

27
Term Project / Substitution
  • This Friday I will distribute the first phase --
    due after the Spring break
  • I will be away for the next 2 weeks
  • the first week (Mar, 9 Mar, 15) I will be
    slow to respond to email
  • I will be substituted for this week by
  • Active Learning (Kevin Small)
  • Indirect Supervision (Alex Klementiev)
  • Presentation by Ryan Cunningham on Friday
  • week Mar, 16 - Mar, 23 no lectures
  • work on the project, send questions if needed
Write a Comment
User Comments (0)
About PowerShow.com