The Mathematics of Information Retrieval - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

The Mathematics of Information Retrieval

Description:

This presentation is based on the following paper: 'Matrices, Vector Spaces, and Information ... D1:Complete Triathlon Endurance Training Manual:Swim, Bike, Run ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 24
Provided by: mathe181
Category:

less

Transcript and Presenter's Notes

Title: The Mathematics of Information Retrieval


1
The Mathematics of Information Retrieval
  • 11/21/2005
  • Presented by Jeremy Chapman, Grant Gelven and Ben
    Lakin

2
Acknowledgments
  • This presentation is based on the following
    paper
  • Matrices, Vector Spaces, and Information
    Retrieval. by Michael W. Barry, Zlatko Drmat,
    and Elizabeth R.Jessup.

3
Indexing of Scientific Works
  • Indexing primarily done by using the title,
    author list, abstract, key word list, and subject
    classification
  • These are created in large part to allow them to
    be found in a search of scientific documents
  • The use of automated information retrieval (IR)
    has improved consistency and speed

4
Vector Space Model for IR
  • The basic mechanism for this model is the
    encoding of a document as a vector
  • All documents vectors are stored in a single
    matrix
  • Latent Semantic Indexing (LSI) replaces the
    original matrix by a matrix of a smaller rank
    while maintaining similar information by use of
    Rank Reduction

5
Creating the Database Matrix
  • Each document is defined in a column of the
    matrix (d is the number of documents)
  • Each term is defined as a row (t is the number of
    terms)
  • This gives us a t x d matrix
  • The document vectors span the content

6
Simple Example
  • Let the six terms as follows
  • T1 bak(e, ing)
  • T2 recipes
  • T3 bread
  • T4 cake
  • T5 pastr(y, ies)
  • T6 pie

The following are the d5 documents D1 How to
Bake Bread Without Recipes D2 The Classical Art
of Viennese Pastry D3 Numerical Recipes The Art
of Scientific Computing D4 Breads, Pastries,
Pies, and Cakes Quantity Baking
Recipes D5Pastry A Book of Best French Recipes
Thus the document matrix becomes
A
7
The matrix A after Normalization
Thus after the normalization of the columns of A
we get the following
8
Making a Query
  • Next we will use the document matrix to ease our
    search for related documents.
  • Referring to our example we will make the
    following query Baking Bread
  • We will now format a query using our terms
    definitions given before
  • q (1 0 1 0 0 0)T

9
Matching the Document to the Query
  • Matching the documents to a given query is
    typically done by using the cosine of the angle
    between the query and document vectors
  • The cosine is given as follows

10
A Query
  • By using the cosine formula we would get
  • We will set our lower limit on our cosine at .5.
  • Thus by conducting a query baking bread we get
    the following two articles
  • D1 How to Bake Bread Without Recipes
  • D4 Breads, Pastries, Pies, and Cakes Quantity
    Baking Recipes

11
Singular Value Decomposition
  • The Singular Value Decomposition (SVD) is used to
    reduce the rank of the matrix, while also giving
    a good approximation of the information stored in
    it
  • The decomposition is written in the following
    manner
  • Where U spans the column space of A, is the
    matrix with singular values of A along the main
    diagonal, and V spans the row space of A. U and
    V are also orthogonal.

12
SVD continued
  • Unlike the QR Factorization, SVD provides us with
    a lower rank representation of the column and
    row spaces
  • We know Ak is the best rank-k approximation to A
    by Eckert and Youngs Theorem that states
  • Thus the rank-k approximation of A is given as
    follows
  • Ak Uk kVkT
  • Where Ukthe first k columns of U
  • ka k x k matrix whose diagonal is a set
    of decreasing values, call them
  • VkTis the k x d matrix whose rows are the
    first k rows of V

13
SVD Factorization
14
Interpretation
  • From the matrix given on the slide before we
    notice that if we take the rank-4 matrix has only
    four non-zero singular values
  • Also the two non-zero columns in tell us that
    the first four columns of U give us the basis for
    the column space of A

15
Analysis of the Rank-k Approximations
  • Using the following formula we can calculate the
    relative error from the original matrix to its
    rank-k approximation
  • A-AkF
  • Thus only a 19 relative error is needed to
    change from a rank-4 to a rank-3 matrix, however
    a 42 relative error is necessary to move to a
    rank-2 approximation from a rank-4 approximation
  • As expected these values are less than the rank-k
    approximations for the QR factorization

16
Using the SVD for Query Matching
  • Using the following formula we can calculate the
    cosine of the angles between the query and the
    columns of our rank-k approximation of A.
  • Using the rank-3 approximation we return the
    first and fourth books again using the cutoff of
    .5

17
Term-Term Comparison
  • It is possible to modify the vector space model
    for comparing queries with documents in order to
    compare terms with terms.
  • When this is added to a search engine it can act
    as a tool to refine the result
  • First we run our search as before and retrieve a
    certain number of documents in the following
    example we will have five documents retrieved.
  • We will then create another document matrix with
    the remaining information, call it G.

18
Another Example
Terms
Documents
  • T1Run(ning)
  • T2Bike
  • T3Endurance
  • T4Training
  • T5Band
  • T6Music
  • T7Fishes

D1Complete Triathlon Endurance Training
ManualSwim, Bike, Run D2Lake, River, and
Sea-Run Fishes of Canada D3Middle Distance
Running, Training and Competition D4Music Law
How to Run your Bands Business D5Running
Learning, Training Competing
19
Analysis of the Term-Term Comparison
  • For this we use the following formula

20
Clustering
  • Clustering is the process by which terms are
    grouped if they are related such as bike,
    endurance and training
  • First the terms are split into groups which are
    related
  • The terms in each group are placed such that
    their vectors are almost parallel

21
Clusters
  • In this example the first cluster is running
  • The second cluster is bike, endurance and
    training
  • The third is band and music
  • And the fourth is fishes

22
Analyzing the term-term Comparison
  • We will again use the SVD rank-k approximation
  • Thus the cosine of the angles becomes

23
Conclusion
  • Through the use of this model many libraries and
    smaller collections can index their documents
  • However, as the next presentation will show a
    different approach is used in large collections
    such as the internet
Write a Comment
User Comments (0)
About PowerShow.com