Title: Web Search and Data Mining
1Web Search and Data Mining
- Lecture 4
- Adapted from Manning, Raghavan and Schuetze
2Recap of the last lecture
- MapReduce and distributed indexing
- Scoring documents linear comb/zone weighting
- tf?idf term weighting and vector spaces
- Derivation of idf
3This lecture
- Vector space models
- Dimension reduction random projection
- Review of linear algebra
- Latent semantic indexing (LSI)
4Documents as vectors
- At the end of Lecture 3 we said
- Each doc d can now be viewed as a vector of
wf?idf values, one component for each term - So we have a vector space
- terms are axes
- docs live in this space
- Dimension is usually very large
5Why turn docs into vectors?
- First application Query-by-example
- Given a doc d, find others like it.
- Now that d is a vector, find vectors (docs)
near it. - Natural setting for bag of words model
- Dimension reduction
6Intuition
t3
d2
d3
d1
?
f
t1
d5
t2
d4
Postulate Documents that are close together
in the vector space talk about the same things.
7Measuring Document Similarity
- Idea Distance between d1 and d2 is the length of
the vector d1 d2. - Euclidean distance
- Why is this not a great idea?
- We still havent dealt with the issue of length
normalization - Short documents would be more similar to each
other by virtue of length, not topic - However, we can implicitly normalize by looking
at angles instead
8Cosine similarity
- Distance between vectors d1 and d2 captured by
the cosine of the angle x between them. - Note this is similarity, not distance
- No triangle inequality for similarity.
9Cosine similarity
- A vector can be normalized (given a length of 1)
by dividing each of its components by its length
here we use the L2 norm - This maps vectors onto the unit sphere
- Then,
- Longer documents dont get more weight
10Cosine similarity
- Cosine of angle between two vectors
- The denominator involves the lengths of the
vectors.
Normalization
11Queries in the vector space model
- Central idea the query as a vector
- We regard the query as short document
- We return the documents ranked by the closeness
of their vectors to the query, also represented
as a vector. - Note that dq is very sparse!
12Dimensionality reduction
- What if we could take our vectors and pack them
into fewer dimensions (say 50,000?100) while
preserving distances? - (Well, almost.)
- Speeds up cosine computations.
- Many possibilities including,
- Random projection.
- Latent semantic indexing.
13Random projection onto kltltm axes
- Choose a random direction x1 in the vector space.
- For i 2 to k,
- Choose a random direction xi that is orthogonal
to x1, x2, xi1. - Project each document vector into the subspace
spanned by x1, x2, , xk.
14E.g., from 3 to 2 dimensions
t 3
d2
x2
x2
d2
d1
d1
t 1
x1
x1
t 2
x1 is a random direction in (t1,t2,t3) space. x2
is chosen randomly but orthogonal to x1.
Dot product of x1 and x2 is zero.
15Guarantee
- With high probability, relative distances are
(approximately) preserved by projection.
16Computing the random projection
- Projecting n vectors from m dimensions down to k
dimensions - Start with m ? n matrix of terms ? docs, A.
- Find random k ? m orthogonal projection matrix R.
- Compute matrix product W R ? A.
- jth column of W is the vector corresponding to
doc j, but now in k ltlt m dimensions.
17Cost of computation
Why?
- This takes a total of kmn multiplications.
- Expensive see Resources for ways to do
essentially the same thing, quicker. - Other variations, using sparse random matrix,
- entries of R from -1, 0, 1 with
probabilities - 1/6, 2/3, 1/6.
18Latent semantic indexing (LSI)
- Another technique for dimension reduction
- Random projection was data-independent
- LSI on the other hand is data-dependent
- Eliminate redundant axes
- Pull together related axes hopefully
- car and automobile
19Linear Algebra Background
20Eigenvalues Eigenvectors
- Eigenvectors (for a square m?m matrix S)
- How many eigenvalues are there at most?
eigenvalue
(right) eigenvector
21Eigenvalues Eigenvectors
22Example
- Let
- Then
- The eigenvalues are 1 and 3 (nonnegative, real).
- The eigenvectors are orthogonal (and real)
Real, symmetric.
Plug in these values and solve for eigenvectors.
23Eigen/diagonal Decomposition
- Let be a square matrix with m
linearly independent eigenvectors (a
non-defective matrix) - Theorem Exists an eigen decomposition
- (cf. matrix diagonalization theorem)
- Columns of U are eigenvectors of S
- Diagonal elements of are eigenvalues of
Unique for distinct eigen-values
24Diagonal decomposition why/how
Thus SUU?, or U1SU?
And SU?U1.
25Diagonal decomposition - example
Recall
The eigenvectors and form
Recall UU1 1.
Inverting, we have
Then, SU?U1
26Example continued
Lets divide U (and multiply U1) by
Then, S
?
Q
(Q-1 QT )
Why? Stay tuned
27Symmetric Eigen Decomposition
- If is a symmetric matrix
- Theorem Exists a (unique) eigen decomposition
- where Q is orthogonal
- Q-1 QT
- Columns of Q are normalized eigenvectors
- Columns are orthogonal.
- (everything is real)
28Time out!
- I came to this class to learn about text
retrieval and mining, not have my linear algebra
past dredged up again - But if you want to dredge, Strangs Applied
Mathematics is a good place to start. - What do these matrices have to do with text?
- Recall m? n term-document matrices
- But everything so far needs square matrices so
29Singular Value Decomposition
For an m? n matrix A of rank r there exists a
factorization (Singular Value Decomposition
SVD) as follows
The columns of U are orthogonal eigenvectors of
AAT.
The columns of V are orthogonal eigenvectors of
ATA.
30Singular Value Decomposition
- Illustration of SVD dimensions and sparseness
31SVD example
Let
Typically, the singular values arranged in
decreasing order.
32Low-rank Approximation
- SVD can be used to compute optimal low-rank
approximations. - Approximation problem Find Ak of rank k such
that - Ak and X are both m?n matrices.
- Typically, want k ltlt r.
33Low-rank Approximation
set smallest r-k singular values to zero
34Approximation error
- How good (bad) is this approximation?
- Its the best possible, measured by the Frobenius
norm of the error - where the ?i are ordered such that ?i ? ?i1.
- Suggests why Frobenius error drops as k increased.
35SVD Low-rank approximation
- Whereas the term-doc matrix A may have m50000,
n10 million (and rank close to 50000) - We can construct an approximation A100 with rank
100. - Of all rank 100 matrices, it would have the
lowest Frobenius error. - Great but why would we??
- Answer Latent Semantic Indexing
C. Eckart, G. Young, The approximation of a
matrix by another of lower rank. Psychometrika,
1, 211-218, 1936.
36Latent Semantic Analysis via SVD
37What it is
- From term-doc matrix A, we compute the
approximation Ak. - There is a row for each term and a column for
each doc in Ak - Thus docs live in a space of kltltr dimensions
- These dimensions are not the original axes
- But why?
38Vector Space Model Pros
- Automatic selection of index terms
- Partial matching of queries and documents
(dealing with the case where no document contains
all search terms) - Ranking according to similarity score (dealing
with large result sets) - Term weighting schemes (improves retrieval
performance) - Various extensions
- Document clustering
- Relevance feedback (modifying query vector)
- Geometric foundation
39Problems with Lexical Semantics
- Ambiguity and association in natural language
- Polysemy Words often have a multitude of
meanings and different types of usage (more
severe in very heterogeneous collections). - The vector space model is unable to discriminate
between different meanings of the same word.
40Problems with Lexical Semantics
- Synonymy Different terms may have identical or a
similar meaning (weaker words indicating the
same topic). - No associations between words are made in the
vector space representation.
41Latent Semantic Indexing (LSI)
- Perform a low-rank approximation of document-term
matrix (typical rank 100-300) - General idea
- Map documents (and terms) to a low-dimensional
representation. - Design a mapping such that the low-dimensional
space reflects semantic associations (latent
semantic space). - Compute document similarity based on the inner
product in this latent semantic space
42Goals of LSI
- Similar terms map to similar location in low
dimensional space - Noise reduction by dimension reduction
43Latent Semantic Analysis
- Latent semantic space illustrating example
courtesy of Susan Dumais
44Performing the maps
- Each row and column of A gets mapped into the
k-dimensional LSI space, by the SVD. - Claim this is not only the mapping with the
best (Frobenius error) approximation to A, but in
fact improves retrieval. - A query q is also mapped into this space, by
- Query NOT a sparse vector.
45Empirical evidence
- Experiments on TREC 1/2/3 Dumais
- Lanczos SVD code (available on netlib) due to
Berry used in these expts - Running times of one day on tens of thousands
of docs (old data) - Dimensions various values 250-350 reported
- (Under 200 reported unsatisfactory)
- Generally expect recall to improve what about
precision?
46Empirical evidence
- Precision at or above median TREC precision
- Top scorer on almost 20 of TREC topics
- Slightly better on average than straight vector
spaces - Effect of dimensionality
Dimensions Precision
250 0.367
300 0.371
346 0.374
47Some wild extrapolation
- The dimensionality of a corpus is the number of
distinct topics represented in it. - More mathematical wild extrapolation
- if A has a rank k approximation of low Frobenius
error, then there are no more than k distinct
topics in the corpus. (Latent semantic indexing
A probabilistic analysis,'' )
48LSI has many other applications
- In many settings in pattern recognition and
retrieval, we have a feature-object matrix. - For text, the terms are features and the docs are
objects. - Could be opinions and users
- This matrix may be redundant in dimensionality.
- Can work with low-rank approximation.
- If entries are missing (e.g., users opinions),
can recover if dimensionality is low. - Powerful general analytical technique
- Close, principled analog to clustering methods.