Information Retrieval and Text Mining

About This Presentation

Title:

Information Retrieval and Text Mining

Description:

Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html. Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 14

Provided by: imsUnist

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and Text Mining

1
Information Retrieval and Text Mining

WS 2004/05, Jan 28, 2005
Hinrich Schütze

2
How LSI is used for Text Search

LSI is a technique for dimension reduction
Similar to Principal Component Analysis (PCA)
Addresses (near-)synonymy car/automobile
Attempts to enable concept-based retrieval
Pre-process docs using a technique from linear
algebra called Singular Value Decomposition.
Reduce dimensionality to
Fewer dimensions, more collapsing of axes,
better recall, worse precision
More dimensions, less collapsing, worse recall,
better precision
Queries handled in this new (reduced) vector
space.

3
Input Term-Document Matrix

wi,j (normalized) weighted count (ti , dj)
Key idea Factorize this matrix

4
Matrix Factorization
n
n
k
k
m
m
hj is representation of dj in terms of basis W If
rank(W) rank(A) then we can always find H so A
WH How do we select W and H to get more
semantic dimensions? -gt LSI
5
Minimization Problem

Minimize
Minimize information loss
Given
norm
for SVD, the 2-norm
constraints on W, S, V
for SVD, W and V are orthonormal, and S is
diagonal

6
Matrix Factorizations SVD
A W x S x VT
n
n
k

x
x
k
m
m
Singular Values
Representation
Basis
Restrictions on representation W, V orthonormal
S diagonal
7
Dimension Reduction

For some s ltlt Rank, zero out all but the s
biggest singular values in S.
Denote by Ss this new version of S.
Typically s in the hundreds while r (Rank) could
be in the (tens of) thousands.
Before A W S Vt
Let As W Ss Vt WsSsVst
As is a good approximation to A.
Best rank s approximation according to 2-norm

8
Dimension Reduction
As W x Ss x VT
n
s
k
n
s

x
x
k
m
m
Singular Values
Representation
Basis
The columns of As represent the docs, but in s ltlt
m dimensions Best rank s approximation according
to 2-norm
9
More on W and V

Recall m ? n matrix of terms ? docs, A.
Define term-term correlation matrix T AAt
At denotes the matrix transpose of A.
T is a square, symmetric m ? m matrix.
Doc-doc correlation matrix DAtA.
D is a square, symmetric n ? n matrix.

10
Eigenvectors

Denote by W the m ? r matrix of eigenvectors of
T.
Denote by V the n ? r matrix of eigenvectors of
D.
Denote by S the diagonal matrix with the squares
of the eigenvalues of T AAt in sorted order.
It turns out that A WSVt is the SVD of A
Semi-precise intuition The new dimensions are
the principal components of term correlation
space.

11
Query processing

Exercise How do you map the query into the
reduced space?

12
Take Away

LSI is optimal optimal solution for given
dimensionality
Caveat Mathematically optimal is not necessarily
semantically optimal.
LSI is unique
Except for signs, singular values with same value
Key benefits of LSI
Enhances recall, addresses synonymy problem
But can decrease precision
Maintenance challenges
Changing collections
Recompute in intervals?
Performance challenges
Cheaper alternatives for recall enhancement
E.g. Pseudo-feedback
Use of LSI in deployed systems

13
Resources LSI

Random projection theorem http//citeseer.nj.nec.
com/dasgupta99elementary.html
Faster random projection http//citeseer.nj.nec.c
om/frieze98fast.html
Latent semantic indexing http//citeseer.nj.nec.c
om/deerwester90indexing.html http//cs276a.stanfor
d.edu/handouts/fsnlp-svd.pdf
Books FSNLP 15.4, MG 4.6, MIR 2.7.2.

Write a Comment

User Comments (0)