Information Retrieval - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Information Retrieval

Description:

text retrieval systems, library systems, citation retrieval systems, records ... These systems are typical of ... In Frakes W.B. and Baeza-Yates R., Eds. ... – PowerPoint PPT presentation

Number of Views:245

Avg rating:3.0/5.0

Slides: 40

Provided by: Rise2

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval

1
Information Retrieval

Information retrieval has been the term applied
to such areas as
text retrieval systems, library systems,
citation retrieval systems, records management
and archives, photo library applications etc.
These systems are typical of variable-length
record systems
Text retrieval is a subset of Information
Retrieval.
research articles may use the term IR text
retrieval, especially in the 70s,80s and 90s.

2
Text Retrieval - Overview

Information retrieval
branch of database theory
specialises in managing retrieval of unstructured
data
large amount of free format text.
Response to a query
Does not answer the query directly
Identify relevant information.

Information Retrieval Techniques are LANGUAGE
specific.
3
Retrieval Process
4
Purpose of Indexing

a sufficiently general description of a document
so that it can be retrieved with queries that
concern the same subject as the document
sufficiently specific description so that the
document will not be returned for those queries
which are not related to the document.

5
Automatic Indexing - A Basic Method

Assume that a document consists of just text and
that we will derive our indexing terms from this
text.
Break the text up into words, casefold, and index
on every word. This technique is very simple and
performs reasonably well.

6
Automatic Indexing - Refinement

Language dependent.
refinement for English will be different from
Chinese
Stop List
Stemming
Term Weighting

7
Indexing Refinement Stop List

A list of common words.
Generally contains words that are not nouns,
verbs, adjectives and adverbs.
A stop list might consist of a, the, an
is, be , ....
Common stop lists run from 10 to hundreds of
words.
It does not matter what the stop list is,
typically around 300 common words will do well.
Indexing process will ignore the words listed in
the stop list.

8
Stop Lists

Fox indicates that the first 20 stop words
accounts for 31.19 of the English corpus.
Fox C. (1992). Lexical Analysis and Stoplists. In
Frakes W.B. and Baeza-Yates R., Eds.),
Information RetrievalData Structures and
Algorithms, Englewood Cliffs, NJ. Prentice-Hall
The first 20 stop words
The, of, and, to, a , in, that, is, was, he, for
, it, with, as, not, his, on, be, at, by.

9
Refinement - Stemming

To incorporate many variations of words, where an
attempt is made to accommodate many variations
comprising a concept
This avoids exceedingly long or query
statement.
Example inquiry or inquired or inquiries
The process is performed after the stop list
process.
Porter stemming algorithm
Porter, M.F., 1980, An algorithm for suffix
stripping, Program, 14(3) 130-137)

10
Stemming - Suffix

Most English meaning shifts for grammatical
purposes are handled by suffixes
Most retrieval systems allow for trailing or
suffixes truncation.
Example
inquir will retrieve documents containing the
words inquire, inquired, inquires,
inquiring, inquiry etc.

11
Stemming - Prefix

Usually is not used in English text retrieval
systems.
Prefix is substantial modifier, even a negation.
Example
flammable and inflammable.
Prefix stemming may be useful in Chemical
databases.

12
Stemming Exception List

Irregularity in the language needs to be
implemented as a lookup list
Example
Irregular plurals
woman gt women
child gt children
past tense
choose gt chose
find gt found

13
Weighting Terms

Having decided on a set of terms for indexing, we
need to consider whether all terms should be
given the same significance. If not, how should
we decide on their significance?

14
Weighting Terms - tf

Let tfij be the term frequency for term i on
document j. The more a term appears in a
document, the more likely it is to be a highly
significant index term.

15
Weighting Terms - df idf

Let dfi be document frequency of the i-th term.
Since the significance increases with a decrease
in the document frequency, we have the inverse
document frequency, idfi log(N/dfi) where N is
the number of documents in the database

16
Weighting Terms - tf. idf

The above two indicators are very often
multiplied together to form the tf.idf
weight, wij tfij idfi

17
Example

Consider 5 document collectionD1 Dogs eat
the same things that cats eatD2 No dog is a
mouseD3 Mice eat little thingsD4 Cats
often play with rats and miceD5 Cats often
play, but not with other cats

18
Example - Cont.

We might generate the following index sets
V1 ( dog, eat, cat )V2 ( dog, mouse
)V3 ( mouse, eat )V4 ( cat, play, rat,
mouse )V5 (cat, play)
System dictionary (cat,dog,eat,mouse,play,rat)

19
Example-Cont

dfcat3
idfcatln(5/3)0.51
dfdog2
idfdogln(5/2)0.91
dfeat2
idfeatln(5/2)0.91

dfmouse3 idfmouseln(5/3)0.51 dfplay2 idfplayl
n(5/2)0.91 dfrat1 idfratln(5/1)1.61
20
Example-Cont

V1(cat, eat,dog)
wcat tfcat idfcat 1 0.51 0.51
wdog tfdog idfdog 1 0.91 0.91
weat tfeat idfat 2 0.91 1.82
V2(dog,mouse)
wdog tfdog idfdog 1 0.91 0.91
wmouse tfmouse idfmouse 1 0.51 0.51

21
Example-Cont

V3(mouse,eat)
wmouse tfmouse idfmouse 1 0.51 0.51
weat tfeat idfat 1 0.91 0.91
V4(cat,mouse,play, rat)
wcat tfcat idfcat 1 0.51 0.51
wplay tfplay idfplay 1 0.91 0.91
wrat tfrat idfrat 1 1.61 1.61
wmouse tfmouse idfmouse 1 0.51 0.51

22
Example-Cont

V5
wcat tfcat idfcat 2 0.51 1.02
wplay tfplay idfplay 1 0.91 0.91

23
Example - cont.

Dictionary (cat,dog,eat,mouse,play,rat)
Weights
V1 cat(0.51), dog (0.91),eat(1.82), 0, 0,0
V2 0,dog(0.91),0,mouse(0.51),0,0V3
0,0,eat(0.91), mouse(0.51),0,0V4 cat(0.51),
0,0,mouse(0.51), play(0.91), rat(1.61)
V5 cat(1.02),0,0,0, play (0.91),0

24
Retrieval Process
25
Retrieval Paradigms

How do we match?
Produce non-ranked output
Boolean retrieval
Produce ranked output
vector space model
probabilistic retrieval

26
Advantages of Ranking

Good control over how many documents are viewed
by a user.
Good control over in which order documents are
viewed by a user.
The first documents that are viewed may help
modify the order in which later documents are
viewed.
The main disadvantage is computational cost.

27
The Vector Space Model

Each document and query is represented by a
vector. A vector is obtained for each document
and query from sets of index terms with
associated weights.
The document and query representatives are
considered as vectors in n dimensional space
where n is the number of unique terms in the
dictionary/document collection.
Measuring vectors similarity
value of cosine of the angle between the two
vectors.

28
Vector Space

Assume that documents vector is represented by
vector D and the query is represented by vector
Q.
The total number of terms in the dictionary is n.
Similarity between D and Q is measured by the
angle ?.

29
Cosine

The similarity between D and Q can be written as
Using the weight of the term as the components of
D and Q

30
Simple Example (1)

Assume
there are 2 terms in the dictionary (t1, t2)
Doc-1 contains t1 and t2, with weights 0.5 and
03 respectively.
Doc-2 contains t1 with weight 0.6
Doc-3 contains t2 with weights 0.4.
Query contains t2 with weight 0.5.

31
Simple Example (2)

The vectors for the query and documents

Query ( 0, 0.5)
32
Simple Example - Cosine
Similarity measured between Query(Q) and
Doc-1
Doc-2
Doc-3
Ranked output D3, D1, D2
33
Large Example (1)

Consider the same five document collection D1
Dogs eat the same things that cats eatD2 No
dog is mouseD3 Mice eat little thingsD4
Cats often play with rats and miceD5 Cats
often play, but not with other catsIndexed
byV1 ( dog, eat, cat )V2 ( dog, mouse )V3
( mouse, eat )V4 ( cat, play, rat, mouse
)V5 (cat, play)

34
Large Example (2)

The set of all terms (dictionary) (cat,
dog, eat, mouse, play, rat)
Using tf.idf weights, we obtain weights v1
(cat(0.51), eat(1.82), dog(0.91)) v2
(dog(0.91), mouse(0.51)) v3 (mouse(0.51),
eat(0.91)) v4 (cat(0.51), play(0.91),
rat(1.61), mouse(0.51)) v5 (cat (1.02), play
(0.91))

35
Large Example (3)

In the vector space model, we obtain
vectors(0.51, 0.91, 1.82, 0.00, 0.00,
0.00)(0.00, 0.91, 0.00, 0.51, 0.00, 0.00)(0.00,
0.00, 0.91, 0.51, 0.00, 0.00)(0.51, 0.00, 0.00,
0.51, 0.91, 1.61)(1.02, 0.00, 0.00, 0.00, 0.91,
0.00)
6 dimensional space for 6 terms

36
Cosine Similarity

Query what do cats play with? forms a query
vector as (0.51, 0.00, 0.00, 0.00, 0.91,
0.00)
using the cosine measure (cm), we obtain the
following similarity measuresD1
0.512/(0.5120.912)0.5 x(0.5120.9121.822)0.5D
2 0.0D3 0.0D4 (0.5120.912)/(0.5120.912)
0.5x(0.5120.5120.9121.612)0.5D5
(0.511.020.912)/(0.5120.912)0.5x(1.0220.912)0
.5
Thus we obtain he ranking D5, D4, D1, D2, D3 (or
D3, D2)

37
Retrieval Model Improvement for a Web Based IR.

Utilise the popularity of a page
If a page has many other pages pointed to this
page, the page must be very important. We can
assign a high weight to this page during search.
If a page is pointed by a popular page, this page
can be considered as important because it is
referred by a reputable source (a popular page).
PageRank Function.

38
PageRank Example
3
3
39
Retrieval Model Improvement for a Web Based IR

Utilise the anchor text.
Anchors often provide more accurate descriptions
of web pages than the pages themselves.
Anchors may exist for documents which cannot be
indexed by a text-based search engine.
Utilise the appearance of the text.
Larger and bolder font text are weighted higher
than other words.

Write a Comment

User Comments (0)