Information Retrieval and Text Mining - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Information Retrieval and Text Mining

Description:

Random projection theorem: http://citeseer.nj.nec.com/dasgupta99elementary.html. Faster random projection: http://citeseer.nj.nec.com/frieze98fast.html ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 14
Provided by: imsUnist
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and Text Mining


1
Information Retrieval and Text Mining
  • WS 2004/05, Jan 28, 2005
  • Hinrich SchĂĽtze

2
How LSI is used for Text Search
  • LSI is a technique for dimension reduction
  • Similar to Principal Component Analysis (PCA)
  • Addresses (near-)synonymy car/automobile
  • Attempts to enable concept-based retrieval
  • Pre-process docs using a technique from linear
    algebra called Singular Value Decomposition.
  • Reduce dimensionality to
  • Fewer dimensions, more collapsing of axes,
    better recall, worse precision
  • More dimensions, less collapsing, worse recall,
    better precision
  • Queries handled in this new (reduced) vector
    space.

3
Input Term-Document Matrix
  • wi,j (normalized) weighted count (ti , dj)
  • Key idea Factorize this matrix

4
Matrix Factorization
n
n
k
k
m
m
hj is representation of dj in terms of basis W If
rank(W) rank(A) then we can always find H so A
WH How do we select W and H to get more
semantic dimensions? -gt LSI
5
Minimization Problem
  • Minimize
  • Minimize information loss
  • Given
  • norm
  • for SVD, the 2-norm
  • constraints on W, S, V
  • for SVD, W and V are orthonormal, and S is
    diagonal

6
Matrix Factorizations SVD
A W x S x VT
n
n
k

x
x
k
m
m
Singular Values
Representation
Basis
Restrictions on representation W, V orthonormal
S diagonal
7
Dimension Reduction
  • For some s ltlt Rank, zero out all but the s
    biggest singular values in S.
  • Denote by Ss this new version of S.
  • Typically s in the hundreds while r (Rank) could
    be in the (tens of) thousands.
  • Before A W S Vt
  • Let As W Ss Vt WsSsVst
  • As is a good approximation to A.
  • Best rank s approximation according to 2-norm

8
Dimension Reduction
As W x Ss x VT
n
s
k
n
s

x
x
k
m
m
Singular Values
Representation
Basis
The columns of As represent the docs, but in s ltlt
m dimensions Best rank s approximation according
to 2-norm
9
More on W and V
  • Recall m ? n matrix of terms ? docs, A.
  • Define term-term correlation matrix T AAt
  • At denotes the matrix transpose of A.
  • T is a square, symmetric m ? m matrix.
  • Doc-doc correlation matrix DAtA.
  • D is a square, symmetric n ? n matrix.

10
Eigenvectors
  • Denote by W the m ? r matrix of eigenvectors of
    T.
  • Denote by V the n ? r matrix of eigenvectors of
    D.
  • Denote by S the diagonal matrix with the squares
    of the eigenvalues of T AAt in sorted order.
  • It turns out that A WSVt is the SVD of A
  • Semi-precise intuition The new dimensions are
    the principal components of term correlation
    space.

11
Query processing
  • Exercise How do you map the query into the
    reduced space?

12
Take Away
  • LSI is optimal optimal solution for given
    dimensionality
  • Caveat Mathematically optimal is not necessarily
    semantically optimal.
  • LSI is unique
  • Except for signs, singular values with same value
  • Key benefits of LSI
  • Enhances recall, addresses synonymy problem
  • But can decrease precision
  • Maintenance challenges
  • Changing collections
  • Recompute in intervals?
  • Performance challenges
  • Cheaper alternatives for recall enhancement
  • E.g. Pseudo-feedback
  • Use of LSI in deployed systems

13
Resources LSI
  • Random projection theorem http//citeseer.nj.nec.
    com/dasgupta99elementary.html
  • Faster random projection http//citeseer.nj.nec.c
    om/frieze98fast.html
  • Latent semantic indexing http//citeseer.nj.nec.c
    om/deerwester90indexing.html http//cs276a.stanfor
    d.edu/handouts/fsnlp-svd.pdf
  • Books FSNLP 15.4, MG 4.6, MIR 2.7.2.
Write a Comment
User Comments (0)
About PowerShow.com