Title: Retrieval by Content Retrieval by Authority
1Retrieval by Content Retrieval by Authority
- Artificial Intelligence
- CMSC 25000
- February 5, 2008
2Roadmap
- Problem
- Matching Topics and Documents
- Challenge I Beyond literal matching
- Expansion Strategies
- Challenge II Authoritative source
- Hubs Authorities
- Page Rank
3Roadmap
- Problem
- Matching Topics and Documents
- Methods
- Classic Vector Space Model
- Challenge I Beyond literal matching
- Expansion Strategies
- Challenge II Authoritative source
- Page Rank
- Hubs Authorities
4Matching Topics and Documents
- Two main perspectives
- Pre-defined, fixed, finite topics
- Text Classification
- Arbitrary topics, typically defined by statement
of information need (aka query) - Information Retrieval
5Vector Space Information Retrieval
- Task
- Document collection
- Query specifies information need free text
- Relevance judgments 0/1 for all docs
- Word evidence Bag of words
- No ordering information
6Vector Space Model
Tv
Program
Computer
Two documents computer program, tv program
Query computer program matches 1 st doc
exact distance2 vs 0 educational
program matches both equally distance1
7Vector Space Model
- Represent documents and queries as
- Vectors of term-based features
- Features tied to occurrence of terms in
collection - E.g.
- Solution 1 Binary features t1 if present, 0
otherwise - Similiarity number of terms in common
- Dot product
8Question
9Vector Space Model II
- Problem Not all terms equally interesting
- E.g. the vs dog vs Levow
- Solution Replace binary term features with
weights - Document collection term-by-document matrix
- View as vector in multidimensional space
- Nearby vectors are related
- Normalize for vector length
10Vector Similarity Computation
- Similarity Dot product
- Normalization
- Normalize weights in advance
- Normalize post-hoc
11Term Weighting
- Aboutness
- To what degree is this term what document is
about? - Within document measure
- Term frequency (tf) occurrences of t in doc j
- Specificity
- How surprised are you to see this term?
- Collection frequency
- Inverse document frequency (idf)
12Term Selection Formation
- Selection
- Some terms are truly useless
- Too frequent, no content
- E.g. the, a, and,
- Stop words ignore such terms altogether
- Creation
- Too many surface forms for same concepts
- E.g. inflections of words verb conjugations,
plural - Stem terms treat all forms as same underlying
13Key Issue
- All approaches operate on term matching
- If a synonym, rather than original term, is used,
approach fails - Develop more robust techniques
- Match concept rather than term
- Expansion approaches
- Add in related terms to enhance matching
- Mapping techniques
- Associate terms to concepts
- Aspect models, stemming
14Expansion Techniques
- Can apply to query or document
- Thesaurus expansion
- Use linguistic resource thesaurus, WordNet to
add synonyms/related terms - Feedback expansion
- Add terms that should have appeared
- User interaction
- Direct or relevance feedback
- Automatic pseudo relevance feedback
15Query Refinement
- Typical queries very short, ambiguous
- Cat animal/Unix command
- Add more terms to disambiguate, improve
- Relevance feedback
- Retrieve with original queries
- Present results
- Ask user to tag relevant/non-relevant
- push toward relevant vectors, away from nr
- ß?1 (0.75,0.25) r rel docs, s non-rel docs
- Roccio expansion formula
16Compression Techniques
- Reduce surface term variation to concepts
- Stemming
- Map inflectional variants to root
- E.g. see, sees, seen, saw -gt see
- Crucial for highly inflected languages Czech,
Arabic - Aspect models
- Matrix representations typically very sparse
- Reduce dimensionality to small key aspects
- Mapping contextually similar terms together
- Latent semantic analysis
17Authoritative Sources
- Based on vector space alone, what would you
expect to get searching for search engine? - Would you expect to get Google?
18Issue
- Text isnt always best indicator of content
- Example
- search engine
- Text search -gt review of search engines
- Term doesnt appear on search engine pages
- Term probably appears on many pages that point to
many search engines
19Hubs Authorities
- Not all sites are created equal
- Finding better sites
- Question What defines a good site?
- Authoritative
- Not just content, but connections!
- One that many other sites think is good
- Site that is pointed to by many other sites
- Authority
20Conferring Authority
- Authorities rarely link to each other
- Competition
- Hubs
- Relevant sites point to prominent sites on topic
- Often not prominent themselves
- Professional or amateur
- Good Hubs Good Authorities
21Computing HITS
- Finding Hubs and Authorities
- Two steps
- Sampling
- Find potential authorities
- Weight-propagation
- Iteratively estimate best hubs and authorities
22Sampling
- Identify potential hubs and authorities
- Connected subsections of web
- Select root set with standard text query
- Construct base set
- All nodes pointed to by root set
- All nodes that point to root set
- Drop within-domain links
- 1000-5000 pages
23Weight-propagation
- Weights
- Authority weight
- Hub weight
- All weights are relative
- Updating
- Converges
- Pages with high x good authorities y good hubs
24Weight Propagation
- Create adjacency matrix A
- Ai,j 1 if i links to j, o.w. 0
- Create vectors x and y of corresponding values
- Converges to principal eigenvector
25Googles PageRank
- Identifies authorities
- Important pages are those pointed to by many
other pages - Better pointers, higher rank
- Ranks search results
- t page pointing to A C(t) number of outbound
links - d damping measure
- Actual ranking on logarithmic scale
- Iterate
26Contrasts
- Internal links
- Large sites carry more weight
- If well-designed
- HA ignores site-internals
- Outbound links explicitly penalized
- Lots of tweaks.
27Web Search
- Search by content
- Vector space model
- Word-based representation
- Aboutness and Surprise
- Enhancing matches
- Simple learning model
- Search by structure
- Authorities identified by link structure of web
- Hubs confer authority
28Learning Perceptrons
- Artificial Intelligence
- CMSC 25000
- February 5, 2008
29Agenda
- Neural Networks
- Biological analogy
- Perceptrons Single layer networks
- Perceptron training
- Perceptron convergence theorem
- Perceptron limitations
- Conclusions
30Neurons The Concept
Dendrites
Axon
Nucleus
Cell Body
Neurons Receive inputs from other neurons (via
synapses) When input exceeds threshold,
fires Sends output along axon to other
neurons Brain 1011 neurons, 1016 synapses
31Artificial Neural Nets
- Simulated Neuron
- Node connected to other nodes via links
- Links axonsynapselink
- Links associated with weight (like synapse)
- Multiplied by output of node
- Node combines input via activation function
- E.g. sum of weighted inputs passed thru
threshold - Simpler than real neuronal processes
32Artificial Neural Net
w
x
w
Sum Threshold
x
w
x
33Perceptrons
- Single neuron-like element
- Binary inputs
- Binary outputs
- Weighted sum of inputs gt threshold
34Perceptron Structure
y
w0
wn
w1
w3
w2
x01
x1
x3
x2
xn
. . .
compensates for threshold
x0 w0
35Perceptron Convergence Procedure
- Straight-forward training procedure
- Learns linearly separable functions
- Until perceptron yields correct output for all
- If the perceptron is correct, do nothing
- If the percepton is wrong,
- If it incorrectly says yes,
- Subtract input vector from weight vector
- Otherwise, add input vector to weight vector
36Perceptron Convergence Example
- LOGICAL-OR
- Sample x1 x2 x3 Desired Output
- 1 0 0 1
0 - 2 0 1 1
1 - 3 1 0 1
1 - 4 1 1 1
1 - Initial w(0 0 0)After S2, wws2(0 1 1)
- Pass2 S1ww-s1(0 1 0)S3wws3(1 1 1)
- Pass3 S1ww-s1(1 1 0)
37Perceptron Convergence Theorem
- If there exists a vector W s.t.
- Perceptron training will find it
- Assume
for all
ive examples x - w2 increases by at most x2, in each
iteration - wx2 lt w2x2 ltk x2
- v.w/w gt lt 1
Converges in k lt O
steps
38Perceptron Learning
- Perceptrons learn linear decision boundaries
- E.g.
x2
0
But not
0
x1
xor
X1 X2 -1 -1 w1x1 w2x2 lt 0 1
-1 w1x1 w2x2 gt 0 gt implies w1 gt 0 1
1 w1x1 w2x2 gt0 gt but should be
false -1 1 w1x1 w2x2 gt 0 gt implies
w2 gt 0
39Perceptron Example
- Digit recognition
- Assume display 8 lightable bars
- Inputs on/off threshold
- 65 steps to recognize 8
40Perceptron Summary
- Motivated by neuron activation
- Simple training procedure
- Guaranteed to converge
- IF linearly separable
41Neural Nets
- Multi-layer perceptrons
- Inputs real-valued
- Intermediate hidden nodes
- Output(s) one (or more) discrete-valued
X1
Y1 Y2
X2
X3
X4
Inputs
Hidden
Hidden
Outputs
42Neural Nets
- Pro More general than perceptrons
- Not restricted to linear discriminants
- Multiple outputs one classification each
- Con No simple, guaranteed training procedure
- Use greedy, hill-climbing procedure to train
- Gradient descent, Backpropagation
43Solving the XOR Problem
o1
w11
Network Topology 2 hidden nodes 1 output
w13
x1
w01
w21
y
-1
w23
w12
w03
w22
x2
-1
w02
o2
Desired behavior x1 x2 o1 o2 y 0 0 0
0 0 1 0 0 1 1 0 1 0 1
1 1 1 1 1 0
-1
Weights w11 w121 w21w22 1 w013/2 w021/2
w031/2 w13-1 w231
44Neural Net Applications
- Speech recognition
- Handwriting recognition
- NETtalk Letter-to-sound rules
- ALVINN Autonomous driving
45ALVINN
- Driving as a neural network
- Inputs
- Image pixel intensities
- I.e. lane lines
- 5 Hidden nodes
- Outputs
- Steering actions
- E.g. turn left/right how far
- Training
- Observe human behavior sample images, steering
46Backpropagation
- Greedy, Hill-climbing procedure
- Weights are parameters to change
- Original hill-climb changes one parameter/step
- Slow
- If smooth function, change all parameters/step
- Gradient descent
- Backpropagation Computes current output, works
backward to correct error
47Producing a Smooth Function
- Key problem
- Pure step threshold is discontinuous
- Not differentiable
- Solution
- Sigmoid (squashed s function) Logistic fn
48Neural Net Training
- Goal
- Determine how to change weights to get correct
output - Large change in weight to produce large reduction
in error - Approach
- Compute actual output o
- Compare to desired output d
- Determine effect of each weight w on error d-o
- Adjust weights
49Neural Net Example
xi ith sample input vector w weight vector
yi desired output for ith sample
-
Sum of squares error over training samples
From 6.034 notes lozano-perez
Full expression of output in terms of input and
weights
50Gradient Descent
- Error Sum of squares error of inputs with
current weights - Compute rate of change of error wrt each weight
- Which weights have greatest effect on error?
- Effectively, partial derivatives of error wrt
weights - In turn, depend on other weights gt chain rule
51Gradient Descent
dG dw
- E G(w)
- Error as function of weights
- Find rate of change of error
- Follow steepest rate of change
- Change weights s.t. error is minimized
E
G(w)
w0w1
w
Local minima
52Gradient of Error
-
Note Derivative of sigmoid ds(z1)
s(z1)(1-s(z1)) dz1
From 6.034 notes lozano-perez
53From Effect to Update
- Gradient computation
- How each weight contributes to performance
- To train
- Need to determine how to CHANGE weight based on
contribution to performance - Need to determine how MUCH change to make per
iteration - Rate parameter r
- Large enough to learn quickly
- Small enough reach but not overshoot target values
54Backpropagation Procedure
i
j
k
- Pick rate parameter r
- Until performance is good enough,
- Do forward computation to calculate output
- Compute Beta in output node with
- Compute Beta in all other nodes with
- Compute change for all weights with
55Backprop Example
Forward prop Compute zi and yi given xk, wl
56Backpropagation Observations
- Procedure is (relatively) efficient
- All computations are local
- Use inputs and outputs of current node
- What is good enough?
- Rarely reach target (0 or 1) outputs
- Typically, train until within 0.1 of target
57Neural Net Summary
- Training
- Backpropagation procedure
- Gradient descent strategy (usual problems)
- Prediction
- Compute outputs based on input vector weights
- Pros Very general, Fast prediction
- Cons Training can be VERY slow (1000s of
epochs), Overfitting
58Training Strategies
- Online training
- Update weights after each sample
- Offline (batch training)
- Compute error over all samples
- Then update weights
- Online training noisy
- Sensitive to individual instances
- However, may escape local minima
59Training Strategy
- To avoid overfitting
- Split data into training, validation, test
- Also, avoid excess weights (less than samples)
- Initialize with small random weights
- Small changes have noticeable effect
- Use offline training
- Until validation set minimum
- Evaluate on test set
- No more weight changes
60Classification
- Neural networks best for classification task
- Single output -gt Binary classifier
- Multiple outputs -gt Multiway classification
- Applied successfully to learning pronunciation
- Sigmoid pushes to binary classification
- Not good for regression
61Neural Net Example
- NETtalk Letter-to-sound by net
- Inputs
- Need context to pronounce
- 7-letter window predict sound of middle letter
- 29 possible characters alphabetspace,.
- 729203 inputs
- 80 Hidden nodes
- Output Generate 60 phones
- Nodes map to 26 units 21 articulatory, 5
stress/sil - Vector quantization of acoustic space
62Neural Net Example NETtalk
- Learning to talk
- 5 iterations/1024 training words bound/stress
- 10 iterations intelligible
- 400 new test words 80 correct
- Not as good as DecTalk, but automatic
63Neural Net Conclusions
- Simulation based on neurons in brain
- Perceptrons (single neuron)
- Guaranteed to find linear discriminant
- IF one exists -gt problem XOR
- Neural nets (Multi-layer perceptrons)
- Very general
- Backpropagation training procedure
- Gradient descent - local min, overfitting issues