Linear Models (II) - PowerPoint PPT Presentation

About This Presentation
Title:

Linear Models (II)

Description:

y is from a discrete set. Example: height 1.8m male/female? ... The purpose of the Lady Bird Johnson Wildflower Center is to educate people around the world, ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 30
Provided by: rong7
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Linear Models (II)


1
Linear Models (II)
  • Rong Jin

2
Recap
  • Classification problems
  • Inputs x ? output y
  • y is from a discrete set
  • Example height 1.8m ? male/female?
  • Statistical learning approaches for
    classification problems

(1.8m, m) (1.87, m) (1.65, f) (1.66, m) (1.58, f)
(1.63, f)
p(hmale), p(male) p(fmale), p(female)
p(male1.8) p(female1.8)
3
Recap
  • Generative Model
  • p(yx) determine the class y for object x
  • p(y) how frequent class y appears
  • p(xy) the input pattern for class y
  • Example
  • 1.8m ? male? female?
  • p(male1.8m) p(male)p(1.8mmale)/p(1.8m)
  • p(female1.8m) p(female)p(1.8mfemale)/p(1.8m)
  • p(1.8m) p(1.8mmale)p(male)p(1.8mfemale)p(fema
    le)

4
Recap
  • Learning p(xy) and p(y)
  • p(y) example(y)/examples
  • Maximum likelihood estimation for p(xy)
  • Example
  • Training examples
  • (1.8m, m) (1.87, m) (1.65, f) (1.66, m) (1.58, f)
    (1.63, f)
  • p(male) Nmale/N p(female) Nfemale/N
  • Assume that the height distributions for male and
    female are Gaussian
  • (?male,?male), (?female,?female)
  • MLE estimation

5
Recap
6
Recap
7
Recap
  • Naïve Bayes
  • Input x is a vector xx1, x2,,xm
  • Assume each feature is independent from each
    other given the class y
  • p(xy)p(x1y)p(x2y)p(xmy)
  • each p(xiy) is estimated using MLE approach

8
Text Classification (I)
  • Learning to classify text
  • Input x document
  • Represented by a vector of words
  • Output y interesting or not
  • 1 for interesting document, -1 for uninteresting
  • Generative model for text classification (TC)
  • p(), p(-)
  • p(doc), p(doc-)
  • Naïve Bayes approach

9
Text Classification (II)
  • Learning parameters for TC
  • p() n()/N, p(-) n(-)/N
  • n(?) number of positive (or negative) documents
  • N total number of documents
  • Apply MLE for estimating p(w), p(w-)

10
Text Classification (IV)
Twenty Newsgroups
An Example
11
Text Classification (IV)
  • Any problems with the naïve Bayes text classifier?

12
Text Classifier (V)
  • Problems
  • Irrelevant words
  • Unseen words
  • Solution
  • Select relevant words using mutual information
    I(x, y)
  • x whether or not word x appearing in a document
  • y the document is of interests or not
  • Unseen words
  • Word class approach
  • Introduce word class T t1, t2, , tm
  • Compute p(ti), p(ti-)
  • When w is unseen before, replace p(w?) with
    p(ti?)
  • Word correlation approach
  • finding out the correlations between words
    p(ww)
  • Using web information
  • p(w?) ?w p(ww)p(w?)

13
Logistic Regression Model
  • Gaussian generative model find a linear
    decision boundary.
  • Why not learn a linear decision boundary
    directly?

14
Logistic Regression Model
  • The log-ratio of positive class to negative class
  • Results

15
Logistic Regression Model
  • Assume the inputs and outputs are related in the
    log linear function
  • Estimate weights MLE approach

16
Example 1 Heart Disease
1 25-29 2 30-34 3 35-39 4 40-44 5 45-49 6
50-54 7 55-59 8 60-64
  • Input feature x age group id
  • output y having heart disease or not
  • 1 having heart disease
  • -1 no heart disease

17
Example 1 Heart Disease
  • Logistic regression model
  • Learning w and c MLE approach
  • Numerical optimization w 0.58, c -3.34

18
Example 1 Heart Disease
  • W 0.58
  • An old person is more likely to have heart
    disease
  • C -3.34
  • i?wc lt 0 ? p(i) lt p(-i)
  • i?wc gt 0 ? p(i) gt p(-i)
  • i?wc 0 ? decision boundary
  • i 5.78 ? 53 year old

19
Naïve Bayes Solution
  • Inaccurate fitting
  • Non Gaussian distribution
  • i 5.59
  • Close to the estimation by logistic regression
  • Even though naïve Bayes does not fit input
    patterns well, it still works fine for the
    decision boundary

20
Problems with Using Histogram Data?
21
Uneven Sampling for Different Ages
22
Solution
w 0.63, c -3.56 ? i 5.65
23
Example Text Classification
  • Input x a binary vector
  • Each word is a different dimension
  • xi 0 if the ith word does not appear in the
    document
  • xi 1 if it appears in the document
  • Output y interesting document or not
  • 1 interesting
  • -1 uninteresting

24
Example Text Classification
Doc 1 The purpose of the Lady Bird Johnson
Wildflower Center is to educate people around the
world,
Doc 2 Rain Bird is one of the leading irrigation
manufacturers in the world, providingcomplete
irrigation solutions for people
term the world people company center
Doc 1 1 1 1 0 1
Doc 2 1 1 1 1 0
25
Example 2 Text Classification
  • Logistic regression model
  • Every term ti is assigned with a weight wi
  • Learning parameters MLE approach
  • Need numerical solutions

26
Example 2 Text Classification
  • Weight wi
  • wi gt 0 term ti is a positive evidence
  • wi lt 0 term ti is a negative evidence
  • wi 0 term ti is irrelevant to whether the
    document is intesting
  • The larger the wi , the more important ti term
    is determining whether the document is
    interesting.
  • Threshold c

27
Example 2 Text Classification
  • Dataset Reuter-21578
  • Classification accuracy
  • Naïve Bayes 77
  • Logistic regression 88

28
Why Logistic Regression Works better for Text
Classification?
  • Common words
  • Small weights in logistic regression
  • Large weights in naïve Bayes
  • Weight p(w) p(w-)
  • Independence assumption
  • Naive Bayes assumes that each word is generated
    independently
  • Logistic regression is able to take into account
    of the correlation of words

29
Comparison
  • Generative Model
  • Model P(xy)
  • Model the input patterns
  • Usually fast converge
  • Cheap computation
  • Robust to noise data
  • But
  • Usually performs worse
  • Discriminative Model
  • Model P(yx) directly
  • Model the decision boundary
  • Usually good performance
  • But
  • Slow convergence
  • Expensive computation
  • Sensitive to noise data
Write a Comment
User Comments (0)
About PowerShow.com