Title: Nearest Neighbor
1Nearest Neighbor Information Retrieval Search
- Artificial Intelligence
- CMSC 25000
- January 29, 2004
2Agenda
- Machine learning Introduction
- Nearest neighbor techniques
- Applications Robotic motion, Credit rating
- Information retrieval search
- Efficient implementations
- k-d trees, parallelism
- Extensions K-nearest neighbor
- Limitations
- Distance, dimensions, irrelevant attributes
3Nearest Neighbor
- Memory- or case- based learning
- Supervised method Training
- Record labeled instances and feature-value
vectors - For each new, unlabeled instance
- Identify nearest labeled instance
- Assign same label
- Consistency heuristic Assume that a property is
the same as that of the nearest reference case.
4Nearest Neighbor Example
- Problem Robot arm motion
- Difficult to model analytically
- Kinematic equations
- Relate joint angles and manipulator positions
- Dynamics equations
- Relate motor torques to joint angles
- Difficult to achieve good results modeling
robotic arms or human arm - Many factors measurements
5Nearest Neighbor Example
- Solution
- Move robot arm around
- Record parameters and trajectory segment
- Table torques, positions,velocities, squared
velocities, velocity products, accelerations - To follow a new path
- Break into segments
- Find closest segments in table
- Get those torques (interpolate as necessary)
6Nearest Neighbor Example
- Issue Big table
- First time with new trajectory
- Closest isnt close
- Table is sparse - few entries
- Solution Practice
- As attempt trajectory, fill in more of table
- After few attempts, very close
7Nearest Neighbor Example II
- Credit Rating
- Classifier Good / Poor
- Features
- L late payments/yr
- R Income/Expenses
Name L R G/P
A 0 1.2 G
B 25 0.4 P
C 5 0.7 G
D 20 0.8 P
E 30 0.85 P
F 11 1.2 G
G 7 1.15 G
H 15 0.8 P
8Nearest Neighbor Example II
Name L R G/P
A 0 1.2 G
A
F
B 25 0.4 P
1
G
R
E
C 5 0.7 G
H
D
C
D 20 0.8 P
E 30 0.85 P
B
F 11 1.2 G
G 7 1.15 G
10
30
20
L
H 15 0.8 P
9Nearest Neighbor Example II
Name L R G/P
I 6 1.15
G
A
F
K
J 22 0.45
1
G
I
P
E
??
K 15 1.2
D
H
R
C
B
J
Distance Measure
Sqrt ((L1-L2)2 sqrt(10)(R1-R2)2)) -
Scaled distance
10
30
20
L
10Efficient Implementations
- Classification cost
- Find nearest neighbor O(n)
- Compute distance between unknown and all
instances - Compare distances
- Problematic for large data sets
- Alternative
- Use binary search to reduce to O(log n)
11Roadmap
- Problem
- Matching Topics and Documents
- Methods
- Classic Vector Space Model
- Challenge I Beyond literal matching
- Expansion Strategies
- Challenge II Authoritative source
- Page Rank
- Hubs Authorities
12Matching Topics and Documents
- Two main perspectives
- Pre-defined, fixed, finite topics
- Text Classification
- Arbitrary topics, typically defined by statement
of information need (aka query) - Information Retrieval
13Three Steps to IR
- Three phases
- Indexing Build collection of document
representations - Query construction
- Convert query text to vector
- Retrieval
- Compute similarity between query and doc
representation - Return closest match
14Matching Topics and Documents
- Documents are about some topic(s)
- Question Evidence of aboutness?
- Words !!
- Possibly also meta-data in documents
- Tags, etc
- Model encodes how words capture topic
- E.g. Bag of words model, Boolean matching
- What information is captured?
- How is similarity computed?
15Models for Retrieval and Classification
- Plethora of models are used
- Here
- Vector Space Model
16Vector Space Information Retrieval
- Task
- Document collection
- Query specifies information need free text
- Relevance judgments 0/1 for all docs
- Word evidence Bag of words
- No ordering information
17Vector Space Model
Tv
Program
Computer
Two documents computer program, tv program
Query computer program matches 1 st doc
exact distance2 vs 0 educational
program matches both equally distance1
18Vector Space Model
- Represent documents and queries as
- Vectors of term-based features
- Features tied to occurrence of terms in
collection - E.g.
- Solution 1 Binary features t1 if present, 0
otherwise - Similiarity number of terms in common
- Dot product
19Question
20Vector Space Model II
- Problem Not all terms equally interesting
- E.g. the vs dog vs Levow
- Solution Replace binary term features with
weights - Document collection term-by-document matrix
- View as vector in multidimensional space
- Nearby vectors are related
- Normalize for vector length
21Vector Similarity Computation
- Similarity Dot product
- Normalization
- Normalize weights in advance
- Normalize post-hoc
22Term Weighting
- Aboutness
- To what degree is this term what document is
about? - Within document measure
- Term frequency (tf) occurrences of t in doc j
- Specificity
- How surprised are you to see this term?
- Collection frequency
- Inverse document frequency (idf)
23Term Selection Formation
- Selection
- Some terms are truly useless
- Too frequent, no content
- E.g. the, a, and,
- Stop words ignore such terms altogether
- Creation
- Too many surface forms for same concepts
- E.g. inflections of words verb conjugations,
plural - Stem terms treat all forms as same underlying
24Key Issue
- All approaches operate on term matching
- If a synonym, rather than original term, is used,
approach fails - Develop more robust techniques
- Match concept rather than term
- Expansion approaches
- Add in related terms to enhance matching
- Mapping techniques
- Associate terms to concepts
- Aspect models, stemming
25Expansion Techniques
- Can apply to query or document
- Thesaurus expansion
- Use linguistic resource thesaurus, WordNet to
add synonyms/related terms - Feedback expansion
- Add terms that should have appeared
- User interaction
- Direct or relevance feedback
- Automatic pseudo relevance feedback
26Query Refinement
- Typical queries very short, ambiguous
- Cat animal/Unix command
- Add more terms to disambiguate, improve
- Relevance feedback
- Retrieve with original queries
- Present results
- Ask user to tag relevant/non-relevant
- push toward relevant vectors, away from nr
- ß?1 (0.75,0.25) r rel docs, s non-rel docs
- Roccio expansion formula
27Compression Techniques
- Reduce surface term variation to concepts
- Stemming
- Map inflectional variants to root
- E.g. see, sees, seen, saw -gt see
- Crucial for highly inflected languages Czech,
Arabic - Aspect models
- Matrix representations typically very sparse
- Reduce dimensionality to small key aspects
- Mapping contextually similar terms together
- Latent semantic analysis
28Authoritative Sources
- Based on vector space alone, what would you
expect to get searching for search engine? - Would you expect to get Google?
29Issue
- Text isnt always best indicator of content
- Example
- search engine
- Text search -gt review of search engines
- Term doesnt appear on search engine pages
- Term probably appears on many pages that point to
many search engines
30Hubs Authorities
- Not all sites are created equal
- Finding better sites
- Question What defines a good site?
- Authoritative
- Not just content, but connections!
- One that many other sites think is good
- Site that is pointed to by many other sites
- Authority
31Conferring Authority
- Authorities rarely link to each other
- Competition
- Hubs
- Relevant sites point to prominent sites on topic
- Often not prominent themselves
- Professional or amateur
- Good Hubs Good Authorities
32Computing HITS
- Finding Hubs and Authorities
- Two steps
- Sampling
- Find potential authorities
- Weight-propagation
- Iteratively estimate best hubs and authorities
33Sampling
- Identify potential hubs and authorities
- Connected subsections of web
- Select root set with standard text query
- Construct base set
- All nodes pointed to by root set
- All nodes that point to root set
- Drop within-domain links
- 1000-5000 pages
34Weight-propagation
- Weights
- Authority weight
- Hub weight
- All weights are relative
- Updating
- Converges
- Pages with high x good authorities y good hubs
35Googles PageRank
- Identifies authorities
- Important pages are those pointed to by many
other pages - Better pointers, higher rank
- Ranks search results
- tpage pointing to A C(t) number of outbound
links - ddamping measure
- Actual ranking on logarithmic scale
- Iterate
36Contrasts
- Internal links
- Large sites carry more weight
- If well-designed
- HA ignores site-internals
- Outbound links explicitly penalized
- Lots of tweaks.
37Web Search
- Search by content
- Vector space model
- Word-based representation
- Aboutness and Surprise
- Enhancing matches
- Simple learning model
- Search by structure
- Authorities identified by link structure of web
- Hubs confer authority
38Efficient Implementation K-D Trees
- Divide instances into sets based on features
- Binary branching E.g. gt value
- 2d leaves with d split path n
- d O(log n)
- To split cases into sets,
- If there is one element in the set, stop
- Otherwise pick a feature to split on
- Find average position of two middle objects on
that dimension - Split remaining objects based on average position
- Recursively split subsets
39K-D Trees Classification
Yes
No
No
Yes
Yes
No
No
Yes
No
Yes
No
No
Yes
Yes
Poor
Good
Good
Poor
Good
Good
Poor
Good
40Efficient ImplementationParallel Hardware
- Classification cost
- distance computations
- Const time if O(n) processors
- Cost of finding closest
- Compute pairwise minimum, successively
- O(log n) time
41Nearest Neighbor Issues
- Prediction can be expensive if many features
- Affected by classification, feature noise
- One entry can change prediction
- Definition of distance metric
- How to combine different features
- Different types, ranges of values
- Sensitive to feature selection
42Nearest Neighbor Analysis
- Problem
- Ambiguous labeling, Training Noise
- Solution
- K-nearest neighbors
- Not just single nearest instance
- Compare to K nearest neighbors
- Label according to majority of K
- What should K be?
- Often 3, can train as well
43Nearest Neighbor Analysis
- Issue
- What is a good distance metric?
- How should features be combined?
- Strategy
- (Typically weighted) Euclidean distance
- Feature scaling Normalization
- Good starting point
- (Feature - Feature_mean)/Feature_standard_deviatio
n - Rescales all values - Centered on 0 with std_dev 1
44Nearest Neighbor Analysis
- Issue
- What features should we use?
- E.g. Credit rating Many possible features
- Tax bracket, debt burden, retirement savings,
etc.. - Nearest neighbor uses ALL
- Irrelevant feature(s) could mislead
- Fundamental problem with nearest neighbor
45Nearest Neighbor Advantages
- Fast training
- Just record feature vector - output value set
- Can model wide variety of functions
- Complex decision boundaries
- Weak inductive bias
- Very generally applicable
46Summary
- Machine learning
- Acquire function from input features to value
- Based on prior training instances
- Supervised vs Unsupervised learning
- Classification and Regression
- Inductive bias
- Representation of function to learn
- Complexity, Generalization, Validation
47Summary Nearest Neighbor
- Nearest neighbor
- Training record input vectors output value
- Prediction closest training instance to new data
- Efficient implementations
- Pros fast training, very general, little bias
- Cons distance metric (scaling), sensitivity to
noise extraneous features