Nearest Neighbor - PowerPoint PPT Presentation

About This Presentation

Title:

Nearest Neighbor

Description:

Query specifies information need: free text. Relevance judgments: 0/1 for all docs ... Professional or amateur. Good Hubs Good Authorities. Computing HITS ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 48

Provided by: classesCs

Learn more at: https://www.classes.cs.uchicago.edu

Category:

more less

Transcript and Presenter's Notes

Title: Nearest Neighbor

1
Nearest Neighbor Information Retrieval Search

Artificial Intelligence
CMSC 25000
January 29, 2004

2
Agenda

Machine learning Introduction
Nearest neighbor techniques
Applications Robotic motion, Credit rating
Information retrieval search
Efficient implementations
k-d trees, parallelism
Extensions K-nearest neighbor
Limitations
Distance, dimensions, irrelevant attributes

3
Nearest Neighbor

Memory- or case- based learning
Supervised method Training
Record labeled instances and feature-value
vectors
For each new, unlabeled instance
Identify nearest labeled instance
Assign same label
Consistency heuristic Assume that a property is
the same as that of the nearest reference case.

4
Nearest Neighbor Example

Problem Robot arm motion
Difficult to model analytically
Kinematic equations
Relate joint angles and manipulator positions
Dynamics equations
Relate motor torques to joint angles
Difficult to achieve good results modeling
robotic arms or human arm
Many factors measurements

5
Nearest Neighbor Example

Solution
Move robot arm around
Record parameters and trajectory segment
Table torques, positions,velocities, squared
velocities, velocity products, accelerations
To follow a new path
Break into segments
Find closest segments in table
Get those torques (interpolate as necessary)

6
Nearest Neighbor Example

Issue Big table
First time with new trajectory
Closest isnt close
Table is sparse - few entries
Solution Practice
As attempt trajectory, fill in more of table
After few attempts, very close

7
Nearest Neighbor Example II

Credit Rating
Classifier Good / Poor
Features
L late payments/yr
R Income/Expenses

Name L R G/P
A 0 1.2 G
B 25 0.4 P
C 5 0.7 G
D 20 0.8 P
E 30 0.85 P
F 11 1.2 G
G 7 1.15 G
H 15 0.8 P
8
Nearest Neighbor Example II
Name L R G/P
A 0 1.2 G
A
F
B 25 0.4 P
1
G
R
E
C 5 0.7 G
H
D
C
D 20 0.8 P
E 30 0.85 P
B
F 11 1.2 G
G 7 1.15 G
10
30
20
L
H 15 0.8 P
9
Nearest Neighbor Example II
Name L R G/P
I 6 1.15
G
A
F
K
J 22 0.45
1
G
I
P
E
??
K 15 1.2
D
H
R
C
B
J
Distance Measure
Sqrt ((L1-L2)2 sqrt(10)(R1-R2)2)) -
Scaled distance
10
30
20
L
10
Efficient Implementations

Classification cost
Find nearest neighbor O(n)
Compute distance between unknown and all
instances
Compare distances
Problematic for large data sets
Alternative
Use binary search to reduce to O(log n)

11
Roadmap

Problem
Matching Topics and Documents
Methods
Classic Vector Space Model
Challenge I Beyond literal matching
Expansion Strategies
Challenge II Authoritative source
Page Rank
Hubs Authorities

12
Matching Topics and Documents

Two main perspectives
Pre-defined, fixed, finite topics
Text Classification
Arbitrary topics, typically defined by statement
of information need (aka query)
Information Retrieval

13
Three Steps to IR

Three phases
Indexing Build collection of document
representations
Query construction
Convert query text to vector
Retrieval
Compute similarity between query and doc
representation
Return closest match

14
Matching Topics and Documents

Documents are about some topic(s)
Question Evidence of aboutness?
Words !!
Possibly also meta-data in documents
Tags, etc
Model encodes how words capture topic
E.g. Bag of words model, Boolean matching
What information is captured?
How is similarity computed?

15
Models for Retrieval and Classification

Plethora of models are used
Here
Vector Space Model

16
Vector Space Information Retrieval

Task
Document collection
Query specifies information need free text
Relevance judgments 0/1 for all docs
Word evidence Bag of words
No ordering information

17
Vector Space Model
Tv
Program
Computer
Two documents computer program, tv program
Query computer program matches 1 st doc
exact distance2 vs 0 educational
program matches both equally distance1
18
Vector Space Model

Represent documents and queries as
Vectors of term-based features
Features tied to occurrence of terms in
collection
E.g.
Solution 1 Binary features t1 if present, 0
otherwise
Similiarity number of terms in common
Dot product

19
Question

Whats wrong with this?

20
Vector Space Model II

Problem Not all terms equally interesting
E.g. the vs dog vs Levow
Solution Replace binary term features with
weights
Document collection term-by-document matrix
View as vector in multidimensional space
Nearby vectors are related
Normalize for vector length

21
Vector Similarity Computation

Similarity Dot product
Normalization
Normalize weights in advance
Normalize post-hoc

22
Term Weighting

Aboutness
To what degree is this term what document is
about?
Within document measure
Term frequency (tf) occurrences of t in doc j
Specificity
How surprised are you to see this term?
Collection frequency
Inverse document frequency (idf)

23
Term Selection Formation

Selection
Some terms are truly useless
Too frequent, no content
E.g. the, a, and,
Stop words ignore such terms altogether
Creation
Too many surface forms for same concepts
E.g. inflections of words verb conjugations,
plural
Stem terms treat all forms as same underlying

24
Key Issue

All approaches operate on term matching
If a synonym, rather than original term, is used,
approach fails
Develop more robust techniques
Match concept rather than term
Expansion approaches
Add in related terms to enhance matching
Mapping techniques
Associate terms to concepts
Aspect models, stemming

25
Expansion Techniques

Can apply to query or document
Thesaurus expansion
Use linguistic resource thesaurus, WordNet to
add synonyms/related terms
Feedback expansion
Add terms that should have appeared
User interaction
Direct or relevance feedback
Automatic pseudo relevance feedback

26
Query Refinement

Typical queries very short, ambiguous
Cat animal/Unix command
Add more terms to disambiguate, improve
Relevance feedback
Retrieve with original queries
Present results
Ask user to tag relevant/non-relevant
push toward relevant vectors, away from nr
ß?1 (0.75,0.25) r rel docs, s non-rel docs
Roccio expansion formula

27
Compression Techniques

Reduce surface term variation to concepts
Stemming
Map inflectional variants to root
E.g. see, sees, seen, saw -gt see
Crucial for highly inflected languages Czech,
Arabic
Aspect models
Matrix representations typically very sparse
Reduce dimensionality to small key aspects
Mapping contextually similar terms together
Latent semantic analysis

28
Authoritative Sources

Based on vector space alone, what would you
expect to get searching for search engine?
Would you expect to get Google?

29
Issue

Text isnt always best indicator of content
Example
search engine
Text search -gt review of search engines
Term doesnt appear on search engine pages
Term probably appears on many pages that point to
many search engines

30
Hubs Authorities

Not all sites are created equal
Finding better sites
Question What defines a good site?
Authoritative
Not just content, but connections!
One that many other sites think is good
Site that is pointed to by many other sites
Authority

31
Conferring Authority

Authorities rarely link to each other
Competition
Hubs
Relevant sites point to prominent sites on topic
Often not prominent themselves
Professional or amateur
Good Hubs Good Authorities

32
Computing HITS

Finding Hubs and Authorities
Two steps
Sampling
Find potential authorities
Weight-propagation
Iteratively estimate best hubs and authorities

33
Sampling

Identify potential hubs and authorities
Connected subsections of web
Select root set with standard text query
Construct base set
All nodes pointed to by root set
All nodes that point to root set
Drop within-domain links
1000-5000 pages

34
Weight-propagation

Weights
Authority weight
Hub weight
All weights are relative
Updating
Converges
Pages with high x good authorities y good hubs

35
Googles PageRank

Identifies authorities
Important pages are those pointed to by many
other pages
Better pointers, higher rank
Ranks search results
tpage pointing to A C(t) number of outbound
links
ddamping measure
Actual ranking on logarithmic scale
Iterate

36
Contrasts

Internal links
Large sites carry more weight
If well-designed
HA ignores site-internals
Outbound links explicitly penalized
Lots of tweaks.

37
Web Search

Search by content
Vector space model
Word-based representation
Aboutness and Surprise
Enhancing matches
Simple learning model
Search by structure
Authorities identified by link structure of web
Hubs confer authority

38
Efficient Implementation K-D Trees

Divide instances into sets based on features
Binary branching E.g. gt value
2d leaves with d split path n
d O(log n)
To split cases into sets,
If there is one element in the set, stop
Otherwise pick a feature to split on
Find average position of two middle objects on
that dimension
Split remaining objects based on average position
Recursively split subsets

39
K-D Trees Classification
Yes
No
No
Yes
Yes
No
No
Yes
No
Yes
No
No
Yes
Yes
Poor
Good
Good
Poor
Good
Good
Poor
Good
40
Efficient ImplementationParallel Hardware