Latent Semantic Indexing and its place in Information Retrieval PowerPoint PPT Presentation

presentation player overlay
1 / 29
About This Presentation
Transcript and Presenter's Notes

Title: Latent Semantic Indexing and its place in Information Retrieval


1
Latent Semantic Indexing and its place in
Information Retrieval
  • Michael Weller
  • Autumn 2003

2
Overview
3
Introduction
  • Searching for information
  • Library
  • Internet
  • Tools for searching the Internet
  • Search Engines
  • Electronic Library Catalogue

4
Information Retrieval
  • Indexing of documents
  • Text
  • Image
  • Multimedia
  • Bibliographic Data
  • Existed since 50s/60s
  • Originally developed for library automation

5
Quality Measures
  • Three approaches
  • Recall
  • Precision
  • Fallout
  • Recall RH / R
  • Precision RH / H
  • Fallout ID / H

ID Irrelevant Documents RH Relevant Hits
6
Best IR System
  • Returns all relevant documents
  • Returns no irrelevant documents
  • Recall 1
  • Precision 1
  • Fallout 0

7
Tools Available
  • Binary Matching
  • Uses keywords
  • Also known as Boolean Matching
  • Vector Space Model
  • Document Indexing
  • Term Weighting
  • Similarity Coefficients
  • Latent Semantic Indexing (LSI)
  • Language Problems

8
Latent Semantic Indexing
  • Developed by Bellcore
  • Extension of Vector Space Model
  • Designed to provide search results, even if
    keyword is not contained
  • Applies Singular Value Decomposition to index
    matrix of Vector Space Model

9
LSI
  • Appears to
  • Examine context
  • Show comprehension of relationships
  • Actually it
  • Applies pattern matching
  • Establishes connections based upon these patterns

10
How does LSI work?
  • Content Search
  • Index Matrix Composition
  • Term Space Modelling
  • Singular Value Decomposition (SVD)

11
Content Search
  • Collate all words
  • Remove non-content describing words
  • Prepositions
  • Conjunctions
  • Common Verbs Adjectives
  • Readability words
  • Remove all words that are
  • In only one document
  • In all documents

12
Index Matrix Composition
  • Same as for Vector Space Model
  • Variation of weighting
  • Global
  • IDF
  • GfIdf
  • Local
  • Log-Entropy
  • Term Frequency
  • Binary
  • May be composed of thousands of terms

13
Term Space Modelling
  • Graphical representation of index matrix
  • Represents documents in terms of keywords
  • Can be difficult to visualise

14
Singular Value Decomposition
  • Used to reduce number of dimensions
  • Splits any rectangular matrix (mxn)
  • mxn matrix (U)
  • nxn matrix (VT)
  • nxn diagonal matrix (S)
  • X USVT

15
SVD
  • For a mxm symmetrical matrix, SVD can be
    calculated by
  • Solving the eigenvalue problem
  • Diagonalization

16
SVD
  • Allows the calculation of
  • X(l) is closest rank-l matrix to X (Wall, M et
    al 2003, p.1)

17
SVD
  • VT and S can be calculated by diagonalization of
    XTX
  • XTX VS2VT
  • U can be calculated by
  • UXVS-1

18
Implementations of SVD
  • Householder Reflections and Given Rotations
  • Small, not very sparse matrix
  • Slow speed
  • Power Method and Subspace Iteration
  • Find largest eigenvector
  • Find corresponding eigenvector for square matrix

19
Implementations of SVD
  • Lanczos Algorithm
  • Calculate singular values
  • Large Matrices
  • Extra Computations eigenvectors
  • Variations
  • FRO
  • SO
  • SCO
  • SO2
  • Partial Orthogonalization

20
Strengths of LSI
  • Broader range of relevant documents
  • Terms and text objects represented in same space
  • Objects can be retrieved directly from query
    terms
  • Returns relevant documents that do not contain
    query terms

21
Problems with LSI
  • Addition/Removal of documents affects the SVD
    statistics
  • Queries must be transformed to reduced
    k-dimensional space
  • All queries
  • Every time
  • Matrix (U) must be kept readily available

22
Problems with LSI
  • Requires extra storage space
  • Requires extra computational power
  • Slower than conventional binary matching methods

23
Potential Uses
  • Relevance Feedback
  • Information Filtering
  • Spam Email
  • Chat Rooms
  • News Groups
  • Bulletin Boards
  • Family Suitable Search Engines

24
Potential Uses
  • Textual Coherence
  • Automated Writing Assessment
  • Feedback Further Study
  • Automated Academic Integrity Checking
  • Automated Marking of Exams Coursework
  • Cross-Language Retrieval

25
Improving LSI
  • Can Latent Semantic Indexing be improved outside
    its patented framework to improve the relevance
    of the returned documents?
  • A possible approach is to add meaning

26
Adding Meaning
  • Determine correctness according to context
    meaning
  • Not going to be easy
  • Years of research in NLP
  • Examine abstract/summary, not entire document
  • Enable search clarification
  • Verification of all documents

27
Some Questions To Think About
  • Would updating web standards help in improving
    Latent Semantic Indexing and Information
    Retrieval?
  • Is Latent Semantic Indexing the way forward or
    just a temporary solution until Natural Language
    Processing is perfected?
  • Who should manage the Index Matrix?
  • There are a variety of possible implementations
    of Latent Semantic Indexing, ranging from
    distributed to centralised systems. Which would
    provide the most benefit?

28
Conclusion
  • LSI is a mathematically-based solution
  • Relies upon finding relationships between
    keywords
  • Provides an inaccurate perception of
    understanding meaning
  • Uses SVD to reduce dimensions

29
Conclusion
  • Tends to be slower than binary matching
  • Improved success at returning relevant documents
  • 30 more effective at finding and ranking
    relevant items than comparable word matching
    methods (Telcordia Technologies Ltd. Undated,
    p.1)
  • Potential Adding true meaning analysis
Write a Comment
User Comments (0)
About PowerShow.com