Title: Information Retrieval and Text Mining
1Information Retrieval and Text Mining
- WS 2004/05, Jan 28, 2005
- Hinrich SchĂĽtze
2How LSI is used for Text Search
- LSI is a technique for dimension reduction
- Similar to Principal Component Analysis (PCA)
- Addresses (near-)synonymy car/automobile
- Attempts to enable concept-based retrieval
- Pre-process docs using a technique from linear
algebra called Singular Value Decomposition. - Reduce dimensionality to
- Fewer dimensions, more collapsing of axes,
better recall, worse precision - More dimensions, less collapsing, worse recall,
better precision - Queries handled in this new (reduced) vector
space.
3Input Term-Document Matrix
- wi,j (normalized) weighted count (ti , dj)
- Key idea Factorize this matrix
4Matrix Factorization
n
n
k
k
m
m
hj is representation of dj in terms of basis W If
rank(W) rank(A) then we can always find H so A
WH How do we select W and H to get more
semantic dimensions? -gt LSI
5Minimization Problem
- Minimize
- Minimize information loss
- Given
- norm
- for SVD, the 2-norm
- constraints on W, S, V
- for SVD, W and V are orthonormal, and S is
diagonal
6Matrix Factorizations SVD
A W x S x VT
n
n
k
x
x
k
m
m
Singular Values
Representation
Basis
Restrictions on representation W, V orthonormal
S diagonal
7Dimension Reduction
- For some s ltlt Rank, zero out all but the s
biggest singular values in S. - Denote by Ss this new version of S.
- Typically s in the hundreds while r (Rank) could
be in the (tens of) thousands. - Before A W S Vt
- Let As W Ss Vt WsSsVst
- As is a good approximation to A.
- Best rank s approximation according to 2-norm
8Dimension Reduction
As W x Ss x VT
n
s
k
n
s
x
x
k
m
m
Singular Values
Representation
Basis
The columns of As represent the docs, but in s ltlt
m dimensions Best rank s approximation according
to 2-norm
9More on W and V
- Recall m ? n matrix of terms ? docs, A.
- Define term-term correlation matrix T AAt
- At denotes the matrix transpose of A.
- T is a square, symmetric m ? m matrix.
- Doc-doc correlation matrix DAtA.
- D is a square, symmetric n ? n matrix.
10Eigenvectors
- Denote by W the m ? r matrix of eigenvectors of
T. - Denote by V the n ? r matrix of eigenvectors of
D. - Denote by S the diagonal matrix with the squares
of the eigenvalues of T AAt in sorted order. - It turns out that A WSVt is the SVD of A
- Semi-precise intuition The new dimensions are
the principal components of term correlation
space.
11Query processing
- Exercise How do you map the query into the
reduced space?
12Take Away
- LSI is optimal optimal solution for given
dimensionality - Caveat Mathematically optimal is not necessarily
semantically optimal. - LSI is unique
- Except for signs, singular values with same value
- Key benefits of LSI
- Enhances recall, addresses synonymy problem
- But can decrease precision
- Maintenance challenges
- Changing collections
- Recompute in intervals?
- Performance challenges
- Cheaper alternatives for recall enhancement
- E.g. Pseudo-feedback
- Use of LSI in deployed systems
13Resources LSI
- Random projection theorem http//citeseer.nj.nec.
com/dasgupta99elementary.html - Faster random projection http//citeseer.nj.nec.c
om/frieze98fast.html - Latent semantic indexing http//citeseer.nj.nec.c
om/deerwester90indexing.html http//cs276a.stanfor
d.edu/handouts/fsnlp-svd.pdf - Books FSNLP 15.4, MG 4.6, MIR 2.7.2.