Title: Logistic Regression
1Logistic Regression
2Logistic Regression Model
- In Gaussian generative model
- Generalize the ratio to a linear model
- Parameters w and c
3Logistic Regression Model
- In Gaussian generative model
- Generalize the ratio to a linear model
- Parameters w and c
4Logistic Regression Model
- The log-ratio of positive class to negative class
- Results
5Logistic Regression Model
- The log-ratio of positive class to negative class
- Results
6Logistic Regression Model
- Assume the inputs and outputs are related in the
log linear function - Estimate weights MLE approach
7Example 1 Heart Disease
1 25-29 2 30-34 3 35-39 4 40-44 5 45-49 6
50-54 7 55-59 8 60-64
- Input feature x age group id
- output y having heart disease or not
- 1 having heart disease
- -1 no heart disease
8Example 1 Heart Disease
- Logistic regression model
- Learning w and c MLE approach
- Numerical optimization w 0.58, c -3.34
9Example 1 Heart Disease
- W 0.58
- An old person is more likely to have heart
disease - C -3.34
- xwc lt 0 ? p(x) lt p(-x)
- xwc gt 0 ? p(x) gt p(-x)
- xwc 0 ? decision boundary
- x 5.78 ? 53 year old
10Naïve Bayes Solution
- Inaccurate fitting
- Non Gaussian distribution
- i 5.59
- Close to the estimation by logistic regression
- Even though naïve Bayes does not fit input
patterns well, it still works fine for the
decision boundary
11Problems with Using Histogram Data?
12Uneven Sampling for Different Ages
13Solution
w 0.63, c -3.56 ? i 5.65
14Solution
w 0.63, c -3.56 ? i 5.65 lt 5.78
15Solution
w 0.63, c -3.56 ? i 5.65 lt 5.78
16Example Text Classification
- Learn to classify text into predefined categories
- Input x a document
- Represented by a vector of words
- Example (president, 10), (bush, 2), (election,
5), - Output y if the document is politics or not
- 1 for political document, -1 for not political
document - Training data
17Example 2 Text Classification
- Logistic regression model
- Every term ti is assigned with a weight wi
- Learning parameters MLE approach
- Need numerical solutions
18Example 2 Text Classification
- Logistic regression model
- Every term ti is assigned with a weight wi
- Learning parameters MLE approach
- Need numerical solutions
19Example 2 Text Classification
- Weight wi
- wi gt 0 term ti is a positive evidence
- wi lt 0 term ti is a negative evidence
- wi 0 term ti is irrelevant to the category of
documents - The larger the wi , the more important ti term
is determining whether the document is
interesting. - Threshold c
20Example 2 Text Classification
- Weight wi
- wi gt 0 term ti is a positive evidence
- wi lt 0 term ti is a negative evidence
- wi 0 term ti is irrelevant to the category of
documents - The larger the wi , the more important ti term
is determining whether the document is
interesting. - Threshold c
21Example 2 Text Classification
- Dataset Reuter-21578
- Classification accuracy
- Naïve Bayes 77
- Logistic regression 88
22Why Logistic Regression Works better for Text
Classification?
- Optimal linear decision boundary
- Generative model
- Weight logp(w) - logp(w-)
- Sub-optimal weights
- Independence assumption
- Naive Bayes assumes that each word is generated
independently - Logistic regression is able to take into account
of the correlation of words
23Discriminative Model
- Logistic regression model is a discriminative
model - Models the conditional probability p(yx), i.e.,
the decision boundary - Gaussian generative model
- Models p(xy), i.e., input patterns of different
classes
24Comparison
- Generative Model
- Model P(xy)
- Model the input patterns
- Usually fast converge
- Cheap computation
- Robust to noise data
- But
- Usually performs worse
- Discriminative Model
-
- Model P(yx) directly
- Model the decision boundary
- Usually good performance
- But
- Slow convergence
- Expensive computation
- Sensitive to noise data
25Comparison
- Generative Model
- Model P(xy)
- Model the input patterns
- Usually fast converge
- Cheap computation
- Robust to noise data
- But
- Usually performs worse
- Discriminative Model
-
- Model P(yx) directly
- Model the decision boundary
- Usually good performance
- But
- Slow convergence
- Expensive computation
- Sensitive to noise data
26A Few Words about Optimization
- Convex objective function
- Solution could be non-unique
27Problems with Logistic Regression?
How about words that only appears in one class?
28Overfitting Problem with Logistic Regression
- Consider word t that only appears in one document
d, and d is a positive document. Let w be its
associated weight - Consider the derivative of l(Dtrain) with respect
to w - w will be infinite !
29Overfitting Problem with Logistic Regression
- Consider word t that only appears in one document
d, and d is a positive document. Let w be its
associated weight - Consider the derivative of l(Dtrain) with respect
to w - w will be infinite !
30Example of Overfitting for LogRes
Decrease in the classification accuracy of test
data
Iteration
31Solution Regularization
- Regularized log-likelihood
- sw2 is called the regularizer
- Favors small weights
- Prevents weights from becoming too large
32The Rare Word Problem
- Consider word t that only appears in one document
d, and d is a positive document. Let w be its
associated weight
33The Rare Word Problem
- Consider the derivative of l(Dtrain) with respect
to w - When s is small, the derivative is still positive
- But, it becomes negative when w is large
34The Rare Word Problem
- Consider the derivative of l(Dtrain) with respect
to w - When w is small, the derivative is still positive
- But, it becomes negative when w is large
35Regularized Logistic Regression
36Interpretation of Regularizer
- Many interpretation of regularizer
- Bayesian stat. model prior
- Statistical learning minimize the generalized
error - Robust optimization min-max solution
37Regularizer Robust Optimization
- assume each data point is unknown-but-bounded in
a sphere of radius s and center xi - find the classifier w that is able to classify
the unknown-but-bounded data point with high
classification confidence
38Sparse Solution
- What does the solution of regularized logistic
regression look like ? - A sparse solution
- Most weights are small and close to zero
39Sparse Solution
- What does the solution of regularized logistic
regression look like ? - A sparse solution
- Most weights are small and close to zero
40Why do We Need Sparse Solution?
- Two types of solutions
- Many non-zero weights but many of them are small
- Only a small number of non-zero weights, and many
of them are large - Occams Razor the simpler the better
- A simpler model that fits data unlikely to be
coincidence - A complicated model that fit data might be
coincidence - Smaller number of non-zero weights
- ? less amount of evidence to consider
- ? simpler model
- ? case 2 is preferred
41Occams Razer
42Occams Razer Power 1
43Occams Razer Power 3
44Occams Razor Power 10
45Finding Optimal Solutions
- Concave objective function
- No local maximum
- Many standard optimization algorithms work
46Gradient Ascent
- Maximize the log-likelihood by iteratively
adjusting the parameters in small increments - In each iteration, we adjust w in the direction
that increases the log-likelihood (toward the
gradient)
47Graphical Illustration
No regularization case
48Gradient Ascent
- Maximize the log-likelihood by iteratively
adjusting the parameters in small increments - In each iteration, we adjust w in the direction
that increases the log-likelihood (toward the
gradient)
49When should Stop?
- Log-likelihood will monotonically increase during
the gradient ascent iterations - When should we stop?
50(No Transcript)
51When should Stop?
- The gradient ascent learning method converges
when there is no incentive to move the parameters
in any particular direction