Web Search and Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Web Search and Data Mining

Description:

Web Search and Data Mining. Lecture 4. Adapted from Manning, Raghavan and Schuetze ... SVD can be used to compute optimal low-rank approximations. ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 49
Provided by: christoph389
Category:
Tags: data | mining | search | web

less

Transcript and Presenter's Notes

Title: Web Search and Data Mining


1
Web Search and Data Mining
  • Lecture 4
  • Adapted from Manning, Raghavan and Schuetze

2
Recap of the last lecture
  • MapReduce and distributed indexing
  • Scoring documents linear comb/zone weighting
  • tf?idf term weighting and vector spaces
  • Derivation of idf

3
This lecture
  • Vector space models
  • Dimension reduction random projection
  • Review of linear algebra
  • Latent semantic indexing (LSI)

4
Documents as vectors
  • At the end of Lecture 3 we said
  • Each doc d can now be viewed as a vector of
    wf?idf values, one component for each term
  • So we have a vector space
  • terms are axes
  • docs live in this space
  • Dimension is usually very large

5
Why turn docs into vectors?
  • First application Query-by-example
  • Given a doc d, find others like it.
  • Now that d is a vector, find vectors (docs)
    near it.
  • Natural setting for bag of words model
  • Dimension reduction

6
Intuition
t3
d2
d3
d1
?
f
t1
d5
t2
d4
Postulate Documents that are close together
in the vector space talk about the same things.
7
Measuring Document Similarity
  • Idea Distance between d1 and d2 is the length of
    the vector d1 d2.
  • Euclidean distance
  • Why is this not a great idea?
  • We still havent dealt with the issue of length
    normalization
  • Short documents would be more similar to each
    other by virtue of length, not topic
  • However, we can implicitly normalize by looking
    at angles instead

8
Cosine similarity
  • Distance between vectors d1 and d2 captured by
    the cosine of the angle x between them.
  • Note this is similarity, not distance
  • No triangle inequality for similarity.

9
Cosine similarity
  • A vector can be normalized (given a length of 1)
    by dividing each of its components by its length
    here we use the L2 norm
  • This maps vectors onto the unit sphere
  • Then,
  • Longer documents dont get more weight

10
Cosine similarity
  • Cosine of angle between two vectors
  • The denominator involves the lengths of the
    vectors.

Normalization
11
Queries in the vector space model
  • Central idea the query as a vector
  • We regard the query as short document
  • We return the documents ranked by the closeness
    of their vectors to the query, also represented
    as a vector.
  • Note that dq is very sparse!

12
Dimensionality reduction
  • What if we could take our vectors and pack them
    into fewer dimensions (say 50,000?100) while
    preserving distances?
  • (Well, almost.)
  • Speeds up cosine computations.
  • Many possibilities including,
  • Random projection.
  • Latent semantic indexing.

13
Random projection onto kltltm axes
  • Choose a random direction x1 in the vector space.
  • For i 2 to k,
  • Choose a random direction xi that is orthogonal
    to x1, x2, xi1.
  • Project each document vector into the subspace
    spanned by x1, x2, , xk.

14
E.g., from 3 to 2 dimensions
t 3
d2
x2
x2
d2
d1
d1
t 1
x1
x1
t 2
x1 is a random direction in (t1,t2,t3) space. x2
is chosen randomly but orthogonal to x1.
Dot product of x1 and x2 is zero.
15
Guarantee
  • With high probability, relative distances are
    (approximately) preserved by projection.

16
Computing the random projection
  • Projecting n vectors from m dimensions down to k
    dimensions
  • Start with m ? n matrix of terms ? docs, A.
  • Find random k ? m orthogonal projection matrix R.
  • Compute matrix product W R ? A.
  • jth column of W is the vector corresponding to
    doc j, but now in k ltlt m dimensions.

17
Cost of computation
Why?
  • This takes a total of kmn multiplications.
  • Expensive see Resources for ways to do
    essentially the same thing, quicker.
  • Other variations, using sparse random matrix,
  • entries of R from -1, 0, 1 with
    probabilities
  • 1/6, 2/3, 1/6.

18
Latent semantic indexing (LSI)
  • Another technique for dimension reduction
  • Random projection was data-independent
  • LSI on the other hand is data-dependent
  • Eliminate redundant axes
  • Pull together related axes hopefully
  • car and automobile

19
Linear Algebra Background
20
Eigenvalues Eigenvectors
  • Eigenvectors (for a square m?m matrix S)
  • How many eigenvalues are there at most?

eigenvalue
(right) eigenvector
21
Eigenvalues Eigenvectors
22
Example
  • Let
  • Then
  • The eigenvalues are 1 and 3 (nonnegative, real).
  • The eigenvectors are orthogonal (and real)

Real, symmetric.
Plug in these values and solve for eigenvectors.
23
Eigen/diagonal Decomposition
  • Let be a square matrix with m
    linearly independent eigenvectors (a
    non-defective matrix)
  • Theorem Exists an eigen decomposition
  • (cf. matrix diagonalization theorem)
  • Columns of U are eigenvectors of S
  • Diagonal elements of are eigenvalues of

Unique for distinct eigen-values
24
Diagonal decomposition why/how
Thus SUU?, or U1SU?
And SU?U1.
25
Diagonal decomposition - example
Recall
The eigenvectors and form
Recall UU1 1.
Inverting, we have
Then, SU?U1
26
Example continued
Lets divide U (and multiply U1) by
Then, S
?
Q
(Q-1 QT )
Why? Stay tuned
27
Symmetric Eigen Decomposition
  • If is a symmetric matrix
  • Theorem Exists a (unique) eigen decomposition
  • where Q is orthogonal
  • Q-1 QT
  • Columns of Q are normalized eigenvectors
  • Columns are orthogonal.
  • (everything is real)

28
Time out!
  • I came to this class to learn about text
    retrieval and mining, not have my linear algebra
    past dredged up again
  • But if you want to dredge, Strangs Applied
    Mathematics is a good place to start.
  • What do these matrices have to do with text?
  • Recall m? n term-document matrices
  • But everything so far needs square matrices so

29
Singular Value Decomposition
For an m? n matrix A of rank r there exists a
factorization (Singular Value Decomposition
SVD) as follows
The columns of U are orthogonal eigenvectors of
AAT.
The columns of V are orthogonal eigenvectors of
ATA.
30
Singular Value Decomposition
  • Illustration of SVD dimensions and sparseness

31
SVD example
Let
Typically, the singular values arranged in
decreasing order.
32
Low-rank Approximation
  • SVD can be used to compute optimal low-rank
    approximations.
  • Approximation problem Find Ak of rank k such
    that
  • Ak and X are both m?n matrices.
  • Typically, want k ltlt r.

33
Low-rank Approximation
  • Solution via SVD

set smallest r-k singular values to zero
34
Approximation error
  • How good (bad) is this approximation?
  • Its the best possible, measured by the Frobenius
    norm of the error
  • where the ?i are ordered such that ?i ? ?i1.
  • Suggests why Frobenius error drops as k increased.

35
SVD Low-rank approximation
  • Whereas the term-doc matrix A may have m50000,
    n10 million (and rank close to 50000)
  • We can construct an approximation A100 with rank
    100.
  • Of all rank 100 matrices, it would have the
    lowest Frobenius error.
  • Great but why would we??
  • Answer Latent Semantic Indexing

C. Eckart, G. Young, The approximation of a
matrix by another of lower rank. Psychometrika,
1, 211-218, 1936.
36
Latent Semantic Analysis via SVD
37
What it is
  • From term-doc matrix A, we compute the
    approximation Ak.
  • There is a row for each term and a column for
    each doc in Ak
  • Thus docs live in a space of kltltr dimensions
  • These dimensions are not the original axes
  • But why?

38
Vector Space Model Pros
  • Automatic selection of index terms
  • Partial matching of queries and documents
    (dealing with the case where no document contains
    all search terms)
  • Ranking according to similarity score (dealing
    with large result sets)
  • Term weighting schemes (improves retrieval
    performance)
  • Various extensions
  • Document clustering
  • Relevance feedback (modifying query vector)
  • Geometric foundation

39
Problems with Lexical Semantics
  • Ambiguity and association in natural language
  • Polysemy Words often have a multitude of
    meanings and different types of usage (more
    severe in very heterogeneous collections).
  • The vector space model is unable to discriminate
    between different meanings of the same word.

40
Problems with Lexical Semantics
  • Synonymy Different terms may have identical or a
    similar meaning (weaker words indicating the
    same topic).
  • No associations between words are made in the
    vector space representation.

41
Latent Semantic Indexing (LSI)
  • Perform a low-rank approximation of document-term
    matrix (typical rank 100-300)
  • General idea
  • Map documents (and terms) to a low-dimensional
    representation.
  • Design a mapping such that the low-dimensional
    space reflects semantic associations (latent
    semantic space).
  • Compute document similarity based on the inner
    product in this latent semantic space

42
Goals of LSI
  • Similar terms map to similar location in low
    dimensional space
  • Noise reduction by dimension reduction

43
Latent Semantic Analysis
  • Latent semantic space illustrating example

courtesy of Susan Dumais
44
Performing the maps
  • Each row and column of A gets mapped into the
    k-dimensional LSI space, by the SVD.
  • Claim this is not only the mapping with the
    best (Frobenius error) approximation to A, but in
    fact improves retrieval.
  • A query q is also mapped into this space, by
  • Query NOT a sparse vector.

45
Empirical evidence
  • Experiments on TREC 1/2/3 Dumais
  • Lanczos SVD code (available on netlib) due to
    Berry used in these expts
  • Running times of one day on tens of thousands
    of docs (old data)
  • Dimensions various values 250-350 reported
  • (Under 200 reported unsatisfactory)
  • Generally expect recall to improve what about
    precision?

46
Empirical evidence
  • Precision at or above median TREC precision
  • Top scorer on almost 20 of TREC topics
  • Slightly better on average than straight vector
    spaces
  • Effect of dimensionality

Dimensions Precision
250 0.367
300 0.371
346 0.374
47
Some wild extrapolation
  • The dimensionality of a corpus is the number of
    distinct topics represented in it.
  • More mathematical wild extrapolation
  • if A has a rank k approximation of low Frobenius
    error, then there are no more than k distinct
    topics in the corpus. (Latent semantic indexing
    A probabilistic analysis,'' )

48
LSI has many other applications
  • In many settings in pattern recognition and
    retrieval, we have a feature-object matrix.
  • For text, the terms are features and the docs are
    objects.
  • Could be opinions and users
  • This matrix may be redundant in dimensionality.
  • Can work with low-rank approximation.
  • If entries are missing (e.g., users opinions),
    can recover if dimensionality is low.
  • Powerful general analytical technique
  • Close, principled analog to clustering methods.
Write a Comment
User Comments (0)
About PowerShow.com