Web Search and Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Web Search and Data Mining

Description:

Web Search and Data Mining. Lecture 4. Adapted from Manning, Raghavan and Schuetze ... SVD can be used to compute optimal low-rank approximations. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 49

Provided by: christoph389

Learn more at: https://faculty.cc.gatech.edu

Category:

more less

Transcript and Presenter's Notes

Title: Web Search and Data Mining

1
Web Search and Data Mining

Lecture 4
Adapted from Manning, Raghavan and Schuetze

2
Recap of the last lecture

MapReduce and distributed indexing
Scoring documents linear comb/zone weighting
tf?idf term weighting and vector spaces
Derivation of idf

3
This lecture

Vector space models
Dimension reduction random projection
Review of linear algebra
Latent semantic indexing (LSI)

4
Documents as vectors

At the end of Lecture 3 we said
Each doc d can now be viewed as a vector of
wf?idf values, one component for each term
So we have a vector space
terms are axes
docs live in this space
Dimension is usually very large

5
Why turn docs into vectors?

First application Query-by-example
Given a doc d, find others like it.
Now that d is a vector, find vectors (docs)
near it.
Natural setting for bag of words model
Dimension reduction

6
Intuition
t3
d2
d3
d1
?
f
t1
d5
t2
d4
Postulate Documents that are close together
in the vector space talk about the same things.
7
Measuring Document Similarity

Idea Distance between d1 and d2 is the length of
the vector d1 d2.
Euclidean distance
Why is this not a great idea?
We still havent dealt with the issue of length
normalization
Short documents would be more similar to each
other by virtue of length, not topic
However, we can implicitly normalize by looking
at angles instead

8
Cosine similarity

Distance between vectors d1 and d2 captured by
the cosine of the angle x between them.
Note this is similarity, not distance
No triangle inequality for similarity.

9
Cosine similarity

A vector can be normalized (given a length of 1)
by dividing each of its components by its length
here we use the L2 norm
This maps vectors onto the unit sphere
Then,
Longer documents dont get more weight

10
Cosine similarity

Cosine of angle between two vectors
The denominator involves the lengths of the
vectors.

Normalization
11
Queries in the vector space model

Central idea the query as a vector
We regard the query as short document
We return the documents ranked by the closeness
of their vectors to the query, also represented
as a vector.
Note that dq is very sparse!

12
Dimensionality reduction

What if we could take our vectors and pack them
into fewer dimensions (say 50,000?100) while
preserving distances?
(Well, almost.)
Speeds up cosine computations.
Many possibilities including,
Random projection.
Latent semantic indexing.

13
Random projection onto kltltm axes

Choose a random direction x1 in the vector space.
For i 2 to k,
Choose a random direction xi that is orthogonal
to x1, x2, xi1.
Project each document vector into the subspace
spanned by x1, x2, , xk.

14
E.g., from 3 to 2 dimensions
t 3
d2
x2
x2
d2
d1
d1
t 1
x1
x1
t 2
x1 is a random direction in (t1,t2,t3) space. x2
is chosen randomly but orthogonal to x1.
Dot product of x1 and x2 is zero.
15
Guarantee

With high probability, relative distances are
(approximately) preserved by projection.

16
Computing the random projection

Projecting n vectors from m dimensions down to k
dimensions
Start with m ? n matrix of terms ? docs, A.
Find random k ? m orthogonal projection matrix R.
Compute matrix product W R ? A.
jth column of W is the vector corresponding to
doc j, but now in k ltlt m dimensions.

17
Cost of computation
Why?

This takes a total of kmn multiplications.
Expensive see Resources for ways to do
essentially the same thing, quicker.
Other variations, using sparse random matrix,
entries of R from -1, 0, 1 with
probabilities
1/6, 2/3, 1/6.

18
Latent semantic indexing (LSI)

Another technique for dimension reduction
Random projection was data-independent
LSI on the other hand is data-dependent
Eliminate redundant axes
Pull together related axes hopefully
car and automobile

19
Linear Algebra Background
20
Eigenvalues Eigenvectors

Eigenvectors (for a square m?m matrix S)
How many eigenvalues are there at most?

eigenvalue
(right) eigenvector
21
Eigenvalues Eigenvectors
22
Example

Let
Then
The eigenvalues are 1 and 3 (nonnegative, real).
The eigenvectors are orthogonal (and real)

Real, symmetric.
Plug in these values and solve for eigenvectors.
23
Eigen/diagonal Decomposition

Let be a square matrix with m
linearly independent eigenvectors (a
non-defective matrix)
Theorem Exists an eigen decomposition
(cf. matrix diagonalization theorem)
Columns of U are eigenvectors of S
Diagonal elements of are eigenvalues of

Unique for distinct eigen-values
24
Diagonal decomposition why/how
Thus SUU?, or U1SU?
And SU?U1.
25
Diagonal decomposition - example
Recall
The eigenvectors and form
Recall UU1 1.
Inverting, we have
Then, SU?U1
26
Example continued
Lets divide U (and multiply U1) by
Then, S
?
Q
(Q-1 QT )
Why? Stay tuned
27
Symmetric Eigen Decomposition

If is a symmetric matrix
Theorem Exists a (unique) eigen decomposition
where Q is orthogonal
Q-1 QT
Columns of Q are normalized eigenvectors
Columns are orthogonal.
(everything is real)

28
Time out!

I came to this class to learn about text
retrieval and mining, not have my linear algebra
past dredged up again
But if you want to dredge, Strangs Applied
Mathematics is a good place to start.
What do these matrices have to do with text?
Recall m? n term-document matrices
But everything so far needs square matrices so

29
Singular Value Decomposition
For an m? n matrix A of rank r there exists a
factorization (Singular Value Decomposition
SVD) as follows
The columns of U are orthogonal eigenvectors of
AAT.
The columns of V are orthogonal eigenvectors of
ATA.
30
Singular Value Decomposition

Illustration of SVD dimensions and sparseness

31
SVD example
Let
Typically, the singular values arranged in
decreasing order.
32
Low-rank Approximation

SVD can be used to compute optimal low-rank
approximations.
Approximation problem Find Ak of rank k such
that
Ak and X are both m?n matrices.
Typically, want k ltlt r.

33
Low-rank Approximation

Solution via SVD

set smallest r-k singular values to zero
34
Approximation error

How good (bad) is this approximation?
Its the best possible, measured by the Frobenius
norm of the error
where the ?i are ordered such that ?i ? ?i1.
Suggests why Frobenius error drops as k increased.

35
SVD Low-rank approximation

Whereas the term-doc matrix A may have m50000,
n10 million (and rank close to 50000)
We can construct an approximation A100 with rank
100.
Of all rank 100 matrices, it would have the
lowest Frobenius error.
Great but why would we??
Answer Latent Semantic Indexing

C. Eckart, G. Young, The approximation of a
matrix by another of lower rank. Psychometrika,
1, 211-218, 1936.
36
Latent Semantic Analysis via SVD
37
What it is

From term-doc matrix A, we compute the
approximation Ak.
There is a row for each term and a column for
each doc in Ak
Thus docs live in a space of kltltr dimensions
These dimensions are not the original axes
But why?

38
Vector Space Model Pros

Automatic selection of index terms
Partial matching of queries and documents
(dealing with the case where no document contains
all search terms)
Ranking according to similarity score (dealing
with large result sets)
Term weighting schemes (improves retrieval
performance)
Various extensions
Document clustering
Relevance feedback (modifying query vector)
Geometric foundation

39
Problems with Lexical Semantics

Ambiguity and association in natural language
Polysemy Words often have a multitude of
meanings and different types of usage (more
severe in very heterogeneous collections).
The vector space model is unable to discriminate
between different meanings of the same word.

40
Problems with Lexical Semantics

Synonymy Different terms may have identical or a
similar meaning (weaker words indicating the
same topic).
No associations between words are made in the
vector space representation.

41
Latent Semantic Indexing (LSI)

Perform a low-rank approximation of document-term
matrix (typical rank 100-300)
General idea
Map documents (and terms) to a low-dimensional
representation.
Design a mapping such that the low-dimensional
space reflects semantic associations (latent
semantic space).
Compute document similarity based on the inner
product in this latent semantic space

42
Goals of LSI

Similar terms map to similar location in low
dimensional space
Noise reduction by dimension reduction

43
Latent Semantic Analysis

Latent semantic space illustrating example

courtesy of Susan Dumais
44
Performing the maps

Each row and column of A gets mapped into the
k-dimensional LSI space, by the SVD.
Claim this is not only the mapping with the
best (Frobenius error) approximation to A, but in
fact improves retrieval.
A query q is also mapped into this space, by
Query NOT a sparse vector.

45
Empirical evidence

Experiments on TREC 1/2/3 Dumais
Lanczos SVD code (available on netlib) due to
Berry used in these expts
Running times of one day on tens of thousands
of docs (old data)
Dimensions various values 250-350 reported
(Under 200 reported unsatisfactory)
Generally expect recall to improve what about
precision?

46
Empirical evidence

Precision at or above median TREC precision
Top scorer on almost 20 of TREC topics
Slightly better on average than straight vector
spaces
Effect of dimensionality

Dimensions Precision
250 0.367
300 0.371
346 0.374
47
Some wild extrapolation

The dimensionality of a corpus is the number of
distinct topics represented in it.
More mathematical wild extrapolation
if A has a rank k approximation of low Frobenius
error, then there are no more than k distinct
topics in the corpus. (Latent semantic indexing
A probabilistic analysis,'' )

48
LSI has many other applications

In many settings in pattern recognition and
retrieval, we have a feature-object matrix.
For text, the terms are features and the docs are
objects.
Could be opinions and users
This matrix may be redundant in dimensionality.
Can work with low-rank approximation.
If entries are missing (e.g., users opinions),
can recover if dimensionality is low.
Powerful general analytical technique
Close, principled analog to clustering methods.