Information Retrieval Models - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Information Retrieval Models

Description:

Information Retrieval Models – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 20
Provided by: GOG50
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval Models


1
Information Retrieval Models
2
Retrieval Models
  • A retrieval model specifies the details of
  • Document representation
  • Query representation
  • Retrieval function
  • Determines a notion of relevance.
  • Notion of relevance can be binary or continuous
    (i.e. ranked retrieval).

3
Classes of Retrieval Models
  • Boolean models (set theoretic)
  • Extended Boolean
  • Vector space models (statistical/algebraic)
  • Generalized VS
  • Latent Semantic Indexing
  • Probabilistic models

4
Other Model Dimensions
  • User Task
  • Retrieval
  • Browsing
  • Logical View of Documents
  • Index terms
  • Full text
  • Full text Structure (e.g. hypertext)

5
Retrieval and browsing
  • The User Task
  • Retrieval
  • information or data
  • purposeful.
  • Browsing
  • glancing around
  • F1 cars, Le Mans, France, tourism.

6
Logical View of documents
  • Logical view of the documents
  • Document representation viewed as a continuum
    logical view of docs might shift.

Accents spacing
Noun groups
Manual indexing
stopwords
stemming
Docs
structure
7
Typical IR task
Docs
Index Terms
doc
match
Information Need
Ranking
query
8
IR keyword match
  • Matching at index term level is quite imprecise
  • No surprise that users get frequently
    unsatisfied
  • Since most users have no training in query
    formation, problem is even worst
  • Frequent dissatisfaction of Web users
  • Issue of deciding relevance is critical for IR
    systems ranking.

9
Ranking
  • A ranking is an ordering of the documents
    retrieved that (hopefully) reflects the relevance
    of the documents to the user query
  • A ranking is based on fundamental premises
    regarding the notion of relevance, such as
  • common sets of index terms
  • sharing of weighted terms
  • likelihood of relevance.
  • Each set of premises leads to a distinct IR model.

10
IR Models
U s e r T a s k
Retrieval Adhoc Filtering
Browsing
11
IR Models
  • The IR model, the logical view of the docs, and
    the retrieval task are distinct aspects of the
    system.

12
Classic IR Models
  • Traditional IR systems employ a set of index
    terms to represent the documents
  • The key idea is that the document semantics can
    be represented by the index terms
  • Usual formal models
  • Boolean
  • Vector-space
  • Probabilistic.

13
Classic IR Models
  • Each document is represented by a set of
    representative index terms
  • An index term is a document word useful for
    remembering the document main themes
  • Usually, index terms are nouns because nouns have
    meaning by themselves
  • However, search engines assume that all words are
    index terms (full text representation).

14
Classic IR Models
  • Not all terms are equally useful for representing
    the document contents less frequent terms allow
    identifying a narrower set of documents
  • The importance of the index terms is represented
    by weights associated to them
  • ki be an index term
  • dj be a document
  • wij is a weight associated with (ki,dj)
  • The weight wij quantifies the importance of the
    index term for describing the document contents.

15
Classic IR Models
  • ki is an index term
  • dj is a document
  • t is the total number of terms
  • N is the total number of docs
  • K (k1, k2, , kt) is the set of all index
    terms
  • wij gt 0 is a weight associated with (ki,dj)
  • wij 0 indicates that term does not belong to
    doc
  • vec(dj) (w1j, w2j, , wtj) is a weighted
    vector associated with the document dj
  • gi(vec(dj)) wij is a function which returns
    the weight associated with pair (ki,dj) .

16
Retrieval Tasks
  • Ad hoc retrieval Fixed document corpus, varied
    queries.
  • Filtering Fixed query, continuous document
    stream.
  • User Profile a model of relative static
    preferences.
  • Binary decision of relevant/not-relevant.
  • Routing Same as filtering but continuously
    supply ranked lists rather than binary filtering.

17
Retrieval Ad Hoc x Filtering
  • Ad hoc retrieval

Q1
Q2
Collection Fixed Size
Q3
Q4
Q5
18
Retrieval Ad Hoc x Filtering
  • Filtering

Docs Filtered for User 2
User 2 Profile
User 1 Profile
Docs for User 1
Documents Stream
19
Common Preprocessing Steps
  • Strip unwanted characters/markup (e.g. HTML
    tags, punctuation, numbers, etc.).
  • Break into tokens (keywords) on whitespace.
  • Stem tokens to root words
  • computational ? comput
  • Remove common stopwords (e.g. a, the, it, etc.).
  • Detect common phrases (possibly using a domain
    specific dictionary).
  • Build inverted index (keyword ? list of docs
    containing it).
Write a Comment
User Comments (0)
About PowerShow.com