K nearest neighbor and Rocchio algorithm - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

K nearest neighbor and Rocchio algorithm

Description:

Testing time: for a new document, find the most similar prototype ... At the test time, instead of using all the training instances, use only prototype vectors. ... – PowerPoint PPT presentation

Number of Views:912
Avg rating:3.0/5.0
Slides: 29
Provided by: coursesWa
Category:

less

Transcript and Presenter's Notes

Title: K nearest neighbor and Rocchio algorithm


1
K nearest neighborand Rocchio algorithm
  • LING 572
  • Fei Xia
  • 1/11/2007

2
Announcement
  • Hw2 is online now. It is due on Jan 20.
  • Hw1 is due at 11pm on Jan 13 (Sat).
  • Lab session after the class.
  • Read DT Tutorial before next Tues class.

3
K-Nearest Neighbor (kNN)
4
Instance-based (IB) learning
  • No training store all training instances.
  • ? Lazy learning
  • Examples
  • kNN
  • Locally weighted regression
  • Radial basis functions
  • Case-based reasoning
  • The most well-known IB method kNN

5
kNN
6
kNN
  • For a new document d,
  • find k training documents that are closest to d.
  • perform majority voting or weighted voting.
  • Properties
  • A lazy classifier. No training.
  • Feature selection and distance measure are
    crucial.

7
The algorithm
  • Determine parameter K
  • Calculate the distance between query-instance and
    all the training instances
  • Sort the distances and determine K nearest
    neighbors
  • Gather the labels of the K nearest neighbors
  • Use simple majority voting or weighted voting.

8
Picking K
  • Use N-fold cross validation pick the one that
    minimizes cross validation error.

9
Normalizing attribute values
  • Distance could be dominated by some attributes
    with large numbers
  • Ex features age, income
  • Original data x1(35, 76K), x2(36, 80K),
    x3(70, 79K)
  • Assume age 2 0,100, income 2 0, 200K
  • After normalization x1(0.35, 0.38),
  • x2(0.36, 0.40), x3 (0.70, 0.395).

10
The Choice of Features
  • Imagine there are 100 features, and only 2 of
    them are relevant to the target label.
  • kNN is easily misled in high-dimensional space.
  • ? Feature weighting or feature selection

11
Feature weighting
  • Stretch j-th axis by weight wj,
  • Use cross-validation to automatically
  • choose weights w1, , wn
  • Setting wj to zero eliminates this dimension
    altogether.

12
Similarity measure
  • Euclidean distance
  • Weighted Euclidean distance
  • Similarity measure cosine

13
Voting
  • Majority voting
  • c arg maxc ?i ?(c, fi(x))
  • Weighted voting weighting is on each neighbor
  • c arg maxc ?i wi ?(c, fi(x))
  • wi 1/dist(x, xi)
  • ? We can use all the training examples.

14
Summary of kNN
  • Strengths
  • Simplicity (conceptual)
  • Efficiency at training no training
  • Handling multi-class
  • Stability and robustness averaging k neighbors
  • Predication accuracy when the training data is
    large
  • Weakness
  • Efficiency at testing time need to calc all
    distances
  • Theoretical validity
  • It is not clear which types of distance measure
    and features to use.

15
Rocchio Algorithm
16
Relevance Feedback for IR
  • The issue plane vs. aircraft
  • Take advantage of user feedback on relevance of
    docs to improve IR results
  • User issues a short, simple query
  • The user marks returned documents as relevant or
    non-relevant.
  • The system computes a better representation of
    the information need based on feedback.
  • Relevance feedback can go through one or more
    iterations.
  • Idea it may be difficult to formulate a good
    query
  • when you dont know the collection well,
    so iterate

17
Rocchio Algorithm
  • The Rocchio algorithm incorporates relevance
    feedback information into the vector space model.
  • Want to maximize sim (Q, Cr) - sim (Q, Cnr)
  • The optimal query vector for separating relevant
    and non-relevant documents (with cosine sim.)
  • Qopt optimal query Cr set of relevant doc
    vectors N collection size

18
Rocchio 1971 Algorithm (SMART)
qm modified query vector q0 original query
vector ?, ?, ? weights Dr set of known
relevant doc vectors Dnr set of known
irrelevant doc vectors
19
Relevance feedback assumptions
  • Relevance prototypes are well-behaved.
  • Term distribution in relevant documents will be
    similar
  • Term distribution in non-relevant documents will
    be different from those in relevant documents
  • Either All relevant documents are tightly
    clustered around a single prototype.
  • Or There are different prototypes, but they have
    significant vocabulary overlap.
  • Similarities between relevant and irrelevant
    documents are small.

20
Rocchio Algorithm for text classification
  • Training time construct a set of prototype
    vectors, one vector per class.
  • Testing time for a new document, find the most
    similar prototype vector.

21
Training time
Cj the set of positive examples for class
j. D the set of positive and negative
examples ?, ? weights
22
Why this formula?
  • Rocchio shows when ??1, each prototype vector
    maximizes
  • How maximizing this formula connects to the
    classification accuracy?

23
Testing time
  • Given a new document d

24
kNN vs. Rocchio
  • kNN
  • Lazy learning no training
  • Use all the training instances at testing time.
  • Rocchio algorithm
  • At the training time, calculate prototype
    vectors.
  • At the test time, instead of using all the
    training instances, use only prototype vectors.
  • Linear classifier not as expressive as kNN.

25
Summary of Rocchio
  • Strengths
  • Simplicity (conceptual)
  • Efficiency at training
  • Efficiency at testing time
  • Handling multi-class
  • Weakness
  • Theoretical validity
  • Stability and robustness
  • Predication accuracy it does not work well when
    the categories are not linearly separable.

26
Additional slides
27
Three major design choices
  • The weight of feature fk in document di e.g.,
    tf-idf
  • The document length normalization
  • The similarity measure e.g., cosine

28
Extending Rocchio?
  • Generalized instance set (GIS) algorithm (Lam and
    Ho, 1998).
Write a Comment
User Comments (0)
About PowerShow.com