Indexing by Latent Semantic Analysis - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Indexing by Latent Semantic Analysis

Description:

... Semantic Analysis ... Synonymy Problem : Keyword may not appear anywhere in the ... Polysemy Problem : Same keyword might mean different things in ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 21
Provided by: nikhil4
Category:

less

Transcript and Presenter's Notes

Title: Indexing by Latent Semantic Analysis


1
Indexing by Latent Semantic Analysis
  • Deerwester, S., Dumais, S.T., Furnas, G.W.,
    Landauer, T., Harshman, R.
  • Presented by
  • Nikhil Ahuja
  • University of Arkansas at Little Rock

2
Outline
  • Why Latent Semantic analysis
  • Problems with earlier techniques
  • Precision, recall, synonymy, polysemy
  • Semantic Value Decomposition model
  • Latent semantic indexing and working
  • Visualization of LSI
  • Example for concept comprehension
  • Conclusion

3
Why Latent Semantic Analysis?
  • Results show relevant docs which do not have
    exact keywords from query
  • Minimizes info loss - removes least significant
    parts of frequency matrix
  • Reduces dimensionality of the matrix
  • Deals with synonymy and polysemy
  • Improves recall and precision

4
Problems with earlier techniques
  • Match words of queries with words of docs
  • Fail to get relevant docs which are not exact
    keywords
  • Do not address polysemy or synonymy
  • Poor recall performance precision
  • Noise data not handled
  • Expensive (controlled vocabularies, human)

5
Precision and Recall
  • Precision Percentage of retrieved documents that
    are in fact relevant to the query
  • Precision Relevant ?
    Retrieved
  • Retrieved
  • Recall Percentage of documents relevant to query
    and were in fact retrieved
  • Recall Relevant ?
    Retrieved

  • Relevant

6
Synonymy and Polysemy Problem
  • Synonymy Problem Keyword may not appear
    anywhere in the problem even though document is
    closely related to it
  • eg keyword software product
  • Polysemy Problem Same keyword might mean
    different things in different contexts
  • eg keyword mining

7
Latent Semantic Indexing Model
  • K-dimensional vector space does not reconstruct
    original td matrix
  • Cosine similarity is used to find document
    closeness
  • Query is the weighted sum of its component term
    vectors

8
Working of LSI (1)
  • Each row represents a term
  • Column represents document vector
  • Each entry represents frequency matrix (i,j)
    registers number of occurrences of term ti in
    document dj

9
Working of LSI (2)
  • Start with a set of d documents and t terms,
    model each document as a vector v in t
    dimensional space
  • Jth coordinate of v measures association of jth
    term with respect to given document
  • 0 if term not there in document
  • Non-zero if document contains the term

10
Working of LSI (3)
  • Create term frequency matrix
  • Get SVD split into 3 smaller matrices To, So,
    Do
  • To and Do are orthogonal matrices
  • So is diagonal matrix of singular values
  • Matrix So is size KK and is reduced version
  • For each doc d, replace by a new doc that
    excludes terms eliminated during SVD
  • Store all vectors to find similarity between 2
    docs or to find top N matches for a query

11
Cosine Similarity
  • Cosine measure metric for measuring document
    similarity.
  • Let v1 and v2 document vectors
  • Cosine similarity sim(v1,v2) v1.v2

  • v1v2
  • v1.v2 vector dot product
  • v1 vv1.v1
  • Usage of similarity metrics
  • Construct similarity based indices on such
    documents
  • Text based queries can be represented as vectors
    which can be used to search their nearest
    neighbors in a document collection

12
Visualization of SVD - 1
13
Visualization of SVD - 2
14
Representation of term doc matrix
15
(No Transcript)
16
Example of 12-9 matrix
  • X 12 term by 9 document matrix
  • X decomposed to To So Do
  • To, Do orthogonal unit length columns

17
(No Transcript)
18
X with further reduced dimension
- Multiply TSD to get approx X
19
X with reduced dimensionality
  • This matrix does not match the original term by
    doc matrix
  • We want as close as possible but not perfect

20
Conclusion
  • LSI-Powerful way for effective info retrieval
  • Increases data compaction (reduced redundancy)
  • Relevant docs are grouped together even if they
    do not have exact same keywords
  • Nicely deals with synonymy problem and partially
    with polysemy problem
  • Reduces noise and increases precision, recall
Write a Comment
User Comments (0)
About PowerShow.com