Information Retrieval - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Information Retrieval

Description:

eve adam eve adam. apple apple. eve eve. Doc 4. Doc 3. Doc 2. Doc 1. Doc 0 ... 4. 1. Portable. 3. 1. Adam. 2. 2. Eve. 1. 3. Apple. 0. df. Term. Term ID = log ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 33
Provided by: eugenet9
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
  • A Eugene M. Taranta II Presentation

2
Topics Of Discussion
  • IR
  • Definitions
  • Process
  • Research Areas
  • Models
  • Vector-space

3
What is IR ?
  • Information Retrieval (IR) is concerned with
    techniques that can provide effective access to
    large collections of objects containing primarily
    text from 1

4
IR Definitions
  • Object(Document) Source of information in text
    form
  • Scientific Journals
  • Emails
  • News Articles
  • Manuals
  • Etc.

5
IR Definitions
  • Object Descriptions contain a set of attributes
    that describe the content of a document
  • Object Descriptions may be the actual object
    itself in the case of smaller documents

6
IR Definitions
  • Information Need A set of attributes that define
    a researchers interest
  • Query A Description of the Information Need
  • Often expressed in natural language

7
IR
  • Is not database query processing
  • Database Query Processing
  • No use of natural language
  • Uses well defined semantics
  • Ex Inventory System

8
IR Process
  • Generate a representation for each object
    description
  • Generate a representation of the information need
  • Compare 1 and 2 to select objects that most
    likely satisfy research need

9
Areas of IR Research
  • Document Representation
  • Information Need Representation
  • Comparison of 1 and 2
  • Evaluation

10
IR Models
  • How IR Process is achieved is an IR Model
  • Four Common Models
  • Boolean
  • Cluster-Based Retrieval
  • Vector Space Retrieval
  • Probabilistic Retrieval

11
IR Model Boolean
  • Basis for most commercial IR Systems
  • Object Description is defined by a set of Boolean
    attributes
  • Information Need described by a Boolean expression

12
IR Model Boolean
  • No Ranking System
  • Difficult to use
  • Low precision
  • These are due to lack of attribute weights

13
IR Model Cluster-Based
  • Assumes similar documents satisfy same
    information need
  • Create clusters of like documents
  • Each cluster gets an Average representation
  • Retrievals returns clusters rather than
    independent documents

14
IR Model Vector-Space
  • Individual docs are described in vectors
  • Elements of Vectors represent weighted values
  • Weights are determined from frequency models
  • These models are generally considered Ad Hoc

15
IR Model Probabilistic
  • Based on Probability Ranking Principle
  • Overall effectiveness will be achieved when docs
    are ranked in decreasing order
  • Involves estimations of probability in relevance

16
IR Model Notes
  • Vector-space and Probabilistic share similarities
    with Boolean
  • All support partial matching and may use Boolean
    logic in queries.
  • Probabilistic contributed to to our understanding
    of term weighting, ranking and relevance feedback

17
IR Architecture
18
Vector-space The Pieces
  • Inverse Document Frequency (idf) A weighing
    factor of term frequency among the collection of
    documents
  • Term Frequency (tf) Number of times a term
    appears in a document

19
Vector-space The Pieces
  • idf can be calculated as
  • log( d/df )
  • d, number of documents
  • df, document frequency
  • idf gives us a sense of term importance

20
Vector-space
21
Vector-space
22
Vector-space
  • Cell value tfDoci idf

23
Vector-space
  • Three Representation Vectors
  • Non-zero vector
  • Column-index vector
  • Row vector
  • Vector-space algorithms
  • Coordinate Storage (COO)
  • Compressed Sparse Column (CSC)
  • Compressed Sparse Row (CSR)
  • Block Sparse Row (BSR)

24
CSR non-zero vector
25
CSR column vector
26
CSR row vector
27
Vector-space Query
  • Query Vector
  • Entry for each term
  • Value is based on idf

Information Need Researcher is looking for
information on Apple Computers
Query Apple Computer
28
CSR Algorithm
  • M ? Number of Documents
  • for count ? 0 to M
  • temp ? 0
  • row_idx ? row_vector count
  • while row_idx lt ( row_vectorcount 1 1 )
  • col_idx ? col_vector row_idx
  • temp ? temp(non_zero_vectorrow_idx
  • Qcol_idx)
  • CSR_outputcount ? temp

Examine each doc
Grab pointer to where row starts in non-zero
vector
Examine each term in doc
If term is in query, add relevance
Save doc result
29
CSR Example Query
  • Examine Doc 0

(.44.22)
(.8.0)
.097
Examine Doc 1
(.8.0)
(1.40)
0
Examine Doc 2
(.22.22)
(.70)
(.4.4)
.21
Examine Doc 3
(.22.22)
(.70)
(.70)
(.70)
.05
Examine Doc 4
(.4.4)
(.70)
.16
30
CSR Example Query Result
  • Examine Doc 0

(.44.22)
(.80)
.097
Examine Doc 1
(.8.0)
(1.40)
0
Examine Doc 2
(.22.22)
(.70)
(.4.4)
.21
Examine Doc 3
(.22.22)
(.70)
(.70)
(.70)
.05
Examine Doc 4
(.4.4)
(.70)
.16
31
Summary
  • Information Retrieval uses natural language to
    query into a collection of documents
  • Four basic methods include Boolean, Cluster,
    Vector-space, and Probabilistic
  • Vector-Space use three representation vectors
    (non-zero, column and row) and one query vector
  • Vector multiplication yields a relevance source
    by which we can rank results

32
References
  • Croft, Bruce W. and Turtle, Howard R. Text
    Retrieval and Inference Text-based Intelligent
    Systems Current research and practice in
    information extraction and retrieval 1992 pp
    127-155
  • Goharian, Nazli Jain,Ankit and Sun, Qian,
    "Comparative Analysis of Sparse Matrix Algorithms
    for Information Retrieval", Journal of Systemics,
    Cybernetics and Informatics, 2003.
  • Salton Gerard, Automatic Text Processing The
    Transformation, Analysis, and Retrieval by
    Computer. Addison-Wesley Pub Co 1998.
Write a Comment
User Comments (0)
About PowerShow.com