Statistical Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Classification

Description:

Title: Linear Models Author: rongjin Last modified by: decs Created Date: 1/18/2004 9:51:42 PM Document presentation format: On-screen Show Company – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 46
Provided by: rong7
Learn more at: http://www.cse.msu.edu
Category:

less

Transcript and Presenter's Notes

Title: Statistical Classification


1
Statistical Classification
  • Rong Jin

2
Classification Problems
  • Given input Xx1, x2, , xm
  • Predict the class label y ? Y
  • Y -1,1, binary class classification problems
  • Y 1, 2, 3, , c, multiple class
    classification problems
  • Goal need to learn the function f X ? Y

3
Examples of Classification Problem
  • Text categorization
  • Input features X
  • Word frequency
  • (campaigning, 1), (democrats, 2), (basketball,
    0),
  • Class label y
  • Y 1 politics
  • Y -1 non-politics

Politics Non-politics
Doc Months of campaigning and weeks of
round-the-clock efforts in Iowa all came down to
a final push Sunday,
Topic
4
Examples of Classification Problem
  • Text categorization
  • Input features X
  • Word frequency
  • (campaigning, 1), (democrats, 2), (basketball,
    0),
  • Class label y
  • Y 1 politics
  • Y -1 not-politics

Politics Non-politics
Doc Months of campaigning and weeks of
round-the-clock efforts in Iowa all came down to
a final push Sunday,
Topic
5
Examples of Classification Problem
  • Image Classification
  • Input features X
  • Color histogram
  • (red, 1004), (red, 23000),
  • Class label y
  • Y 1 bird image
  • Y -1 non-bird image

Which images are birds, which are not?
6
Examples of Classification Problem
  • Image Classification
  • Input features X
  • Color histogram
  • (red, 1004), (blue, 23000),
  • Class label y
  • Y 1 bird image
  • Y -1 non-bird image

Which images are birds, which are not?
7
Classification Problems
How to obtain f ?
Learn classification function f from examples
8
Learning from Examples
  • Training examples
  • Identical Independent Distribution (i.i.d.)
  • Each training example is drawn independently from
    the identical source
  • Training examples are similar to testing examples

9
Learning from Examples
  • Training examples
  • Identical Independent Distribution (i.i.d.)
  • Each training example is drawn independently from
    the identical source

10
Learning from Examples
  • Given training examples
  • Goal learn a classification function f(x)X?Y
    that is consistent with training examples
  • What is the easiest way to do it ?

11
K Nearest Neighbor (kNN) Approach
How many neighbors should we count ?
12
Cross Validation
  • Divide training examples into two sets
  • A training set (80) and a validation set (20)
  • Predict the class labels of the examples in the
    validation set by the examples in the training
    set
  • Choose the number of neighbors k that maximizes
    the classification accuracy

13
Leave-One-Out Method
  • For k 1, 2, , K
  • Err(k) 0
  • Randomly select a training data point and hide
    its class label
  • Using the remaining data and given K to predict
    the class label for the left data point
  • Err(k) Err(k) 1 if the predicted label is
    different from the true label
  • Repeat the procedure until all training examples
    are tested
  • Choose the k whose Err(k) is minimal

14
Leave-One-Out Method
  • For k 1, 2, , K
  • Err(k) 0
  • Randomly select a training data point and hide
    its class label
  • Using the remaining data and given K to predict
    the class label for the left data point
  • Err(k) Err(k) 1 if the predicted label is
    different from the true label
  • Repeat the procedure until all training examples
    are tested
  • Choose the k whose Err(k) is minimal

15
Leave-One-Out Method
  • For k 1, 2, , K
  • Err(k) 0
  • Randomly select a training data point and hide
    its class label
  • Using the remaining data and given k to predict
    the class label for the left data point
  • Err(k) Err(k) 1 if the predicted label is
    different from the true label
  • Repeat the procedure until all training examples
    are tested
  • Choose the k whose Err(k) is minimal

16
Leave-One-Out Method
  • For k 1, 2, , K
  • Err(k) 0
  • Randomly select a training data point and hide
    its class label
  • Using the remaining data and given k to predict
    the class label for the left data point
  • Err(k) Err(k) 1 if the predicted label is
    different from the true label
  • Repeat the procedure until all training examples
    are tested
  • Choose the k whose Err(k) is minimal

Err(1) 1
17
Leave-One-Out Method
  • For k 1, 2, , K
  • Err(k) 0
  • Randomly select a training data point and hide
    its class label
  • Using the remaining data and given k to predict
    the class label for the left data point
  • Err(k) Err(k) 1 if the predicted label is
    different from the true label
  • Repeat the procedure until all training examples
    are tested
  • Choose the k whose Err(k) is minimal

Err(1) 1
18
Leave-One-Out Method
  • For k 1, 2, , K
  • Err(k) 0
  • Randomly select a training data point and hide
    its class label
  • Using the remaining data and given k to predict
    the class label for the left data point
  • Err(k) Err(k) 1 if the predicted label is
    different from the true label
  • Repeat the procedure until all training examples
    are tested
  • Choose the k whose Err(k) is minimal

Err(1) 3 Err(2) 2 Err(3) 6
19
Probabilistic interpretation of KNN
  • Estimate the probability density function Pr(yx)
    around the location of x
  • Count of data points in class y in the
    neighborhood of x
  • Bias and variance tradeoff
  • A small neighborhood ? large variance ?
    unreliable estimation
  • A large neighborhood ? large bias ? inaccurate
    estimation

20
Weighted kNN
  • Weight the contribution of each close neighbor
    based on their distances
  • Weight function
  • Prediction

21
Estimate ?2 in the Weight Function
  • Leave one cross validation
  • Training dataset D is divided into two sets
  • Validation set
  • Training set
  • Compute the

22
Estimate ?2 in the Weight Function
Pr(yx1, D-1) is a function of ?2
23
Estimate ?2 in the Weight Function
Pr(yx1, D-1) is a function of ?2
24
Estimate ?2 in the Weight Function
  • In general, we can have expression for
  • Validation set
  • Training set
  • Estimate ?2 by maximizing the likelihood

25
Estimate ?2 in the Weight Function
  • In general, we can have expression for
  • Validation set
  • Training set
  • Estimate ?2 by maximizing the likelihood

26
Optimization
  • It is a DC (difference of two convex functions)
    function

27
Challenges in Optimization
  • Convex functions are easiest to be optimized
  • Single-mode functions are the second easiest
  • Multi-mode functions are difficult to be optimized

28
Gradient Ascent
29
Gradient Ascent (contd)
  • Compute the derivative of l(?), i.e.,
  • Update ?

How to decide the step size t?
30
Gradient Ascent Line Search
Excerpt from the slides by Steven Boyd
31
Gradient Ascent
  • Stop criterion
  • ? is predefined small value
  • Start ?0, Define ?, ?, and ?
  • Compute
  • Choose step size t via backtracking line search
  • Update
  • Repeat till

32
Gradient Ascent
  • Stop criterion
  • ? is predefined small value
  • Start ?0, Define ?, ?, and ?
  • Compute
  • Choose step size t via backtracking line search
  • Update
  • Repeat till

33
ML Statistics Optimization
  • Modeling Pr(yx?)
  • ? is the parameter(s) involved in the model
  • Search for the best parameter ?
  • Maximum likelihood estimation
  • Construct a log-likelihood function l(?)
  • Search for the optimal solution ?

34
Instance-Based Learning (Ch. 8)
  • Key idea just store all training examples
  • k Nearest neighbor
  • Given query example , take vote among its k
    nearest neighbors (if discrete-valued target
    function)
  • take mean of f values of k nearest neighbors if
    real-valued target function

35
When to Consider Nearest Neighbor ?
  • Lots of training data
  • Less than 20 attributes per example
  • Advantages
  • Training is very fast
  • Learn complex target functions
  • Dont lose information
  • Disadvantages
  • Slow at query time
  • Easily fooled by irrelevant attributes

36
KD Tree for NN Search
  • Each node contains
  • Children information
  • The tightest box that bounds all the data points
    within the node.

37
NN Search by KD Tree
38
NN Search by KD Tree
39
NN Search by KD Tree
40
NN Search by KD Tree
41
NN Search by KD Tree
42
NN Search by KD Tree
43
NN Search by KD Tree
44
Curse of Dimensionality
  • Imagine instances described by 20 attributes, but
    only 2 are relevant to target function
  • Curse of dimensionality nearest neighbor is
    easily mislead when high dimensional X
  • Consider N data points uniformly distributed in a
    p-dimensional unit ball centered at original.
    Consider the nn estimate at the original. The
    mean distance from the original to the closest
    data point is

45
Curse of Dimensionality
  • Imagine instances described by 20 attributes, but
    only 2 are relevant to target function
  • Curse of dimensionality nearest neighbor is
    easily mislead when high dimensional X
  • Consider N data points uniformly distributed in a
    p-dimensional unit ball centered at origin.
    Consider the nn estimate at the original. The
    mean distance from the origin to the closest data
    point is
Write a Comment
User Comments (0)
About PowerShow.com