Information Retrieval - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Information Retrieval

Description:

text retrieval systems, library systems, citation retrieval systems, records ... These systems are typical of ... In Frakes W.B. and Baeza-Yates R., Eds. ... – PowerPoint PPT presentation

Number of Views:244
Avg rating:3.0/5.0
Slides: 40
Provided by: Rise2
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
  • Information retrieval has been the term applied
    to such areas as
  • text retrieval systems, library systems,
    citation retrieval systems, records management
    and archives, photo library applications etc.
  • These systems are typical of variable-length
    record systems
  • Text retrieval is a subset of Information
    Retrieval.
  • research articles may use the term IR text
    retrieval, especially in the 70s,80s and 90s.

2
Text Retrieval - Overview
  • Information retrieval
  • branch of database theory
  • specialises in managing retrieval of unstructured
    data
  • large amount of free format text.
  • Response to a query
  • Does not answer the query directly
  • Identify relevant information.

Information Retrieval Techniques are LANGUAGE
specific.
3
Retrieval Process
4
Purpose of Indexing
  • a sufficiently general description of a document
    so that it can be retrieved with queries that
    concern the same subject as the document
  • sufficiently specific description so that the
    document will not be returned for those queries
    which are not related to the document.

5
Automatic Indexing - A Basic Method
  • Assume that a document consists of just text and
    that we will derive our indexing terms from this
    text.
  • Break the text up into words, casefold, and index
    on every word. This technique is very simple and
    performs reasonably well.

6
Automatic Indexing - Refinement
  • Language dependent.
  • refinement for English will be different from
    Chinese
  • Stop List
  • Stemming
  • Term Weighting

7
Indexing Refinement Stop List
  • A list of common words.
  • Generally contains words that are not nouns,
    verbs, adjectives and adverbs.
  • A stop list might consist of a, the, an
    is, be , ....
  • Common stop lists run from 10 to hundreds of
    words.
  • It does not matter what the stop list is,
    typically around 300 common words will do well.
  • Indexing process will ignore the words listed in
    the stop list.

8
Stop Lists
  • Fox indicates that the first 20 stop words
    accounts for 31.19 of the English corpus.
  • Fox C. (1992). Lexical Analysis and Stoplists. In
    Frakes W.B. and Baeza-Yates R., Eds.),
    Information RetrievalData Structures and
    Algorithms, Englewood Cliffs, NJ. Prentice-Hall
  • The first 20 stop words
  • The, of, and, to, a , in, that, is, was, he, for
    , it, with, as, not, his, on, be, at, by.

9
Refinement - Stemming
  • To incorporate many variations of words, where an
    attempt is made to accommodate many variations
    comprising a concept
  • This avoids exceedingly long or query
    statement.
  • Example inquiry or inquired or inquiries
  • The process is performed after the stop list
    process.
  • Porter stemming algorithm
  • Porter, M.F., 1980, An algorithm for suffix
    stripping, Program, 14(3) 130-137)

10
Stemming - Suffix
  • Most English meaning shifts for grammatical
    purposes are handled by suffixes
  • Most retrieval systems allow for trailing or
    suffixes truncation.
  • Example
  • inquir will retrieve documents containing the
    words inquire, inquired, inquires,
    inquiring, inquiry etc.

11
Stemming - Prefix
  • Usually is not used in English text retrieval
    systems.
  • Prefix is substantial modifier, even a negation.
  • Example
  • flammable and inflammable.
  • Prefix stemming may be useful in Chemical
    databases.

12
Stemming Exception List
  • Irregularity in the language needs to be
    implemented as a lookup list
  • Example
  • Irregular plurals
  • woman gt women
  • child gt children
  • past tense
  • choose gt chose
  • find gt found

13
Weighting Terms
  • Having decided on a set of terms for indexing, we
    need to consider whether all terms should be
    given the same significance. If not, how should
    we decide on their significance?

14
Weighting Terms - tf
  • Let tfij be the term frequency for term i on
    document j. The more a term appears in a
    document, the more likely it is to be a highly
    significant index term.

15
Weighting Terms - df idf
  • Let dfi be document frequency of the i-th term.
  • Since the significance increases with a decrease
    in the document frequency, we have the inverse
    document frequency, idfi log(N/dfi) where N is
    the number of documents in the database

16
Weighting Terms - tf. idf
  • The above two indicators are very often
    multiplied together to form the tf.idf
    weight, wij tfij idfi

17
Example
  • Consider 5 document collectionD1 Dogs eat
    the same things that cats eatD2 No dog is a
    mouseD3 Mice eat little thingsD4 Cats
    often play with rats and miceD5 Cats often
    play, but not with other cats

18
Example - Cont.
  • We might generate the following index sets
  • V1 ( dog, eat, cat )V2 ( dog, mouse
    )V3 ( mouse, eat )V4 ( cat, play, rat,
    mouse )V5 (cat, play)
  • System dictionary (cat,dog,eat,mouse,play,rat)

19
Example-Cont
  • dfcat3
  • idfcatln(5/3)0.51
  • dfdog2
  • idfdogln(5/2)0.91
  • dfeat2
  • idfeatln(5/2)0.91

dfmouse3 idfmouseln(5/3)0.51 dfplay2 idfplayl
n(5/2)0.91 dfrat1 idfratln(5/1)1.61
20
Example-Cont
  • V1(cat, eat,dog)
  • wcat tfcat idfcat 1 0.51 0.51
  • wdog tfdog idfdog 1 0.91 0.91
  • weat tfeat idfat 2 0.91 1.82
  • V2(dog,mouse)
  • wdog tfdog idfdog 1 0.91 0.91
  • wmouse tfmouse idfmouse 1 0.51 0.51

21
Example-Cont
  • V3(mouse,eat)
  • wmouse tfmouse idfmouse 1 0.51 0.51
  • weat tfeat idfat 1 0.91 0.91
  • V4(cat,mouse,play, rat)
  • wcat tfcat idfcat 1 0.51 0.51
  • wplay tfplay idfplay 1 0.91 0.91
  • wrat tfrat idfrat 1 1.61 1.61
  • wmouse tfmouse idfmouse 1 0.51 0.51

22
Example-Cont
  • V5
  • wcat tfcat idfcat 2 0.51 1.02
  • wplay tfplay idfplay 1 0.91 0.91

23
Example - cont.
  • Dictionary (cat,dog,eat,mouse,play,rat)
  • Weights
  • V1 cat(0.51), dog (0.91),eat(1.82), 0, 0,0
  • V2 0,dog(0.91),0,mouse(0.51),0,0V3
    0,0,eat(0.91), mouse(0.51),0,0V4 cat(0.51),
    0,0,mouse(0.51), play(0.91), rat(1.61)
    V5 cat(1.02),0,0,0, play (0.91),0

24
Retrieval Process
25
Retrieval Paradigms
  • How do we match?
  • Produce non-ranked output
  • Boolean retrieval
  • Produce ranked output
  • vector space model
  • probabilistic retrieval

26
Advantages of Ranking
  • Good control over how many documents are viewed
    by a user.
  • Good control over in which order documents are
    viewed by a user.
  • The first documents that are viewed may help
    modify the order in which later documents are
    viewed.
  • The main disadvantage is computational cost.

27
The Vector Space Model
  • Each document and query is represented by a
    vector. A vector is obtained for each document
    and query from sets of index terms with
    associated weights.
  • The document and query representatives are
    considered as vectors in n dimensional space
    where n is the number of unique terms in the
    dictionary/document collection.
  • Measuring vectors similarity
  • value of cosine of the angle between the two
    vectors.

28
Vector Space
  • Assume that documents vector is represented by
    vector D and the query is represented by vector
    Q.
  • The total number of terms in the dictionary is n.
  • Similarity between D and Q is measured by the
    angle ?.

29
Cosine
  • The similarity between D and Q can be written as
  • Using the weight of the term as the components of
    D and Q

30
Simple Example (1)
  • Assume
  • there are 2 terms in the dictionary (t1, t2)
  • Doc-1 contains t1 and t2, with weights 0.5 and
    03 respectively.
  • Doc-2 contains t1 with weight 0.6
  • Doc-3 contains t2 with weights 0.4.
  • Query contains t2 with weight 0.5.

31
Simple Example (2)
  • The vectors for the query and documents

Query ( 0, 0.5)
32
Simple Example - Cosine
Similarity measured between Query(Q) and
Doc-1
Doc-2
Doc-3
Ranked output D3, D1, D2
33
Large Example (1)
  • Consider the same five document collection D1
    Dogs eat the same things that cats eatD2 No
    dog is mouseD3 Mice eat little thingsD4
    Cats often play with rats and miceD5 Cats
    often play, but not with other catsIndexed
    byV1 ( dog, eat, cat )V2 ( dog, mouse )V3
    ( mouse, eat )V4 ( cat, play, rat, mouse
    )V5 (cat, play)

34
Large Example (2)
  • The set of all terms (dictionary) (cat,
    dog, eat, mouse, play, rat)
  • Using tf.idf weights, we obtain weights v1
    (cat(0.51), eat(1.82), dog(0.91)) v2
    (dog(0.91), mouse(0.51)) v3 (mouse(0.51),
    eat(0.91)) v4 (cat(0.51), play(0.91),
    rat(1.61), mouse(0.51)) v5 (cat (1.02), play
    (0.91))

35
Large Example (3)
  • In the vector space model, we obtain
    vectors(0.51, 0.91, 1.82, 0.00, 0.00,
    0.00)(0.00, 0.91, 0.00, 0.51, 0.00, 0.00)(0.00,
    0.00, 0.91, 0.51, 0.00, 0.00)(0.51, 0.00, 0.00,
    0.51, 0.91, 1.61)(1.02, 0.00, 0.00, 0.00, 0.91,
    0.00)
  • 6 dimensional space for 6 terms

36
Cosine Similarity
  • Query what do cats play with? forms a query
    vector as (0.51, 0.00, 0.00, 0.00, 0.91,
    0.00)
  • using the cosine measure (cm), we obtain the
    following similarity measuresD1
    0.512/(0.5120.912)0.5 x(0.5120.9121.822)0.5D
    2 0.0D3 0.0D4 (0.5120.912)/(0.5120.912)
    0.5x(0.5120.5120.9121.612)0.5D5
    (0.511.020.912)/(0.5120.912)0.5x(1.0220.912)0
    .5
  • Thus we obtain he ranking D5, D4, D1, D2, D3 (or
    D3, D2)

37
Retrieval Model Improvement for a Web Based IR.
  • Utilise the popularity of a page
  • If a page has many other pages pointed to this
    page, the page must be very important. We can
    assign a high weight to this page during search.
  • If a page is pointed by a popular page, this page
    can be considered as important because it is
    referred by a reputable source (a popular page).
  • PageRank Function.

38
PageRank Example
3
3
39
Retrieval Model Improvement for a Web Based IR
  • Utilise the anchor text.
  • Anchors often provide more accurate descriptions
    of web pages than the pages themselves.
  • Anchors may exist for documents which cannot be
    indexed by a text-based search engine.
  • Utilise the appearance of the text.
  • Larger and bolder font text are weighted higher
    than other words.
Write a Comment
User Comments (0)
About PowerShow.com