CS 430 / INFO 430 Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

CS 430 / INFO 430 Information Retrieval

Description:

'the actor has an abacus' ... Query: (abacus or asp*) and actor ... 'actor' Merge these posting lists. ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 36
Provided by: wya1
Category:

less

Transcript and Presenter's Notes

Title: CS 430 / INFO 430 Information Retrieval


1
CS 430 / INFO 430 Information Retrieval
Lecture 6 Boolean Methods
2
Course Administration
Assignments You are encouraged to use existing
Java or C classes, e.g., to manage data
structures. If you do, you must acknowledge them
in your report. In addition, for a data
structure, you should explain why this structure
is appropriate for the use you make of it.
3
Porter Stemmer
A multi-step, longest-match stemmer. M. F.
Porter, An algorithm for suffix stripping.
(Originally published in Program, 14 no. 3, pp
130-137, July 1980.) http//www.tartarus.org/mart
in/PorterStemmer/def.txt Notation v vowel(s) c c
onstant(s) (vc)m vowel(s) followed by
constant(s), repeated m times Any word can be
written c(vc)mv m is called the measure of
the word
4
Porter's Stemmer
Multi-Step Stemming Algorithm Complex
suffixes Complex suffixes are removed bit by bit
in the different steps. Thus GENERALIZATIONS bec
omes GENERALIZATION (Step 1) becomes GENERALIZE
(Step 2) becomes GENERAL (Step 3) becomes GENER
(Step 4) In this example, note that Steps 3 and
4 appear to be unhelpful for information
retrieval.
5
Porter Stemmer Step 1a
Suffix Replacement Examples
sses ss caresses -gt
caress ies i
ponies -gt poni
ties -gt ti ss
ss caress -gt
caress s
cats -gt cat
At each step, carry out the longest match only.
6
Porter Stemmer Step 1b
Conditions Suffix Replacement Examples (m gt
0) eed ee feed -gt feed agreed -gt
agree (v) ed null plastered -gt plaster bled
-gt bled (v) ing null motoring -gt motor sing
-gt sing
v - the stem contains a vowel
7
Porter Stemmer Step 5a
(mgt1) e -gt probate -gt
probat rate -gt rate (m1 and not o) e -gt
cease -gt ceas o - the stem
ends cvc, where the second c is not w, x or y
(e.g. -wil, -hop).
8
Porter Stemmer Results
Suffix stripping of a vocabulary of 10,000
words Number of words reduced in step 1 3597
step 2 766
step 3 327 step 4
2424 step 5 1373 Number of
words not reduced 3650 The resulting
vocabulary of stems contained 6370 distinct
entries. Thus the suffix stripping process
reduced the size of the vocabulary by about one
third.
9
Exact Matching (Boolean Model)
Documents
Query
Index database
Mechanism for determining whether a document
matches a query.
Set of hits
10
Boolean Queries
Boolean query two or more search terms, related
by logical operators, e.g., and or not Exam
ples abacus and actor abacus or
actor (abacus and actor) or (abacus and
atoll) not actor
11
Boolean Diagram
not (A or B)
A and B
A
B
A or B
12
Adjacent and Near Operators
abacus adj actor Terms abacus and actor are
adjacent to each other as in the string "abacus
actor" abacus near 4 actor Terms abacus and
actor are near to each other as in the string
"the actor has an abacus" Some systems support
other operators, such as with (two terms in the
same sentence) or same (two terms in the same
paragraph).
13
Evaluation of Boolean Operators
Precedence of operators must be defined adj,
near high and, not or low Example A and B or
C and B is evaluated as (A and B) or (C and B)
14
Evaluating a Boolean Query
Examples abacus and actor Postings for
abacus Postings for actor Document 19 is the
only document that contains both terms, "abacus"
and "actor".
To evaluate the and operator, merge the two
inverted lists with a logical AND operation.
15
Evaluating an Adjacency Operation
Examples abacus adj actor Postings for
abacus Postings for actor Document 19,
locations 212 and 213, is the only occurrence of
the terms "abacus" and "actor" adjacent.
16
Query Matching Boolean Methods
  • Query (abacus or asp) and actor
  • 1. From the index file (word list), find the
    postings file for
  • "abacus"
  • every word that begins "asp"
  • "actor"
  • Merge these posting lists. For each document
    that occurs in any of the postings lists,
    evaluate the Boolean expression to see if it is
    true or false.
  • Step 2 should be carried out in a single pass.

17
Use of Postings File for Query Matching
  • 1 abacus
  • 3 94
  • 19 7
  • 19 212
  • 22 56
  • 2 actor
  • 66
  • 19 213
  • 29 45

3 aspen 5 43
  • 4 atoll
  • 3
  • 70
  • 34 40

18
Query Matching Vector Ranking Methods
  • Query abacus asp
  • 1. From the index file (word list), find the
    postings file for
  • "abacus"
  • every word that begins "asp"
  • Merge these posting lists. Calculate the
    similarity to the query for each document that
    occurs in any of the postings lists.
  • Sort the similarities to obtain the results in
    ranked order.
  • Steps 2 and 3 should be carried out in a single
    pass.

19
Contrast of Ranking with Matching
With matching, a document either matches a query
exactly or not at all Encourages short
queries Requires precise choice of index
terms Requires precise formulation of queries
(professional training) With retrieval using
similarity measures, similarities range from 0 to
1 for all documents Encourages long queries,
to have as many dimensions as possible
Benefits from large numbers of index terms
Benefits from queries with many terms, not all of
which need match the document
20
Problems with the Boolean model
Counter-intuitive results Query q a and b and
c and d and e Document d has terms a, b, c and
d, but not e Intuitively, d is quite a good match
for q, but it is rejected by the Boolean model.
Query q a or b or c or d or e Document d1 has
terms a, b, c, d, and e Document d2 has term a,
but not b, c, d or e Intuitively, d1 is a much
better match than d2, but the Boolean model ranks
them as equal.
21
Problems with the Boolean model (continued)
Boolean is all or nothing Boolean model has no
way to rank documents. Boolean model allows for
no uncertainty in assigning index terms to
documents. The Boolean model has no provision
for adjusting the importance of query terms.
22
Extending the Boolean model
Term weighting Give weights to terms in
documents and/or queries. Combine standard
Boolean retrieval with vector ranking of
results Fuzzy sets Relax the boundaries of the
sets used in Boolean retrieval
23
Ranking methods in Boolean systems
SIRE (Syracuse Information Retrieval
Experiment) Term weights Add term weights to
documents Weights calculated by the standard
method of term frequency inverse document
frequency. Ranking Calculate results set by
standard Boolean methods Rank results by
vector distances
24
Relevance feedback in SIRE
SIRE (Syracuse Information Retrieval
Experiment) Relevance feedback is particularly
important with Boolean retrieval because it
allow the results set to be expanded Results
set is created by standard Boolean retrieval
User selects one document from results set
Other documents in collection are ranked by
vector distance from this
document Relevance feedback will be covered in a
later lecture.
25
Boolean model as sets
d is either in the set A or not in A.
d
A
26
Boolean model as fuzzy sets
d is more or less in A.
d
A
27
Fuzzy Sets Basic concept
A document has a term weight associated with
each index term. The term weight measures the
degree to which that term characterizes the
document. Term weights are in the range 0, 1.
(In the standard Boolean model all weights are
either 0 or 1.) For a given query, calculate
the similarity between the query and each
document in the collection. This calculation
is needed for every document that has a non-zero
weight for any of the terms in the query.
28
Fuzzy Sets
Fuzzy set theory dA is the degree of membership
of an element to set A intersection (and) dA?B
min(dA, dB) union (or) dA?B max(dA, dB)
29
Fuzzy Sets
Fuzzy set theory example standard
fuzzy set theory set
theory dA 1 1 0 0 0.5 0.5 0 0 dB 1 0 1 0 0.7 0
0.7 0 and dA?B 1 0 0 0 0.5 0 0 0 or
dA?B 1 1 1 0 0.7 0.5 0.7 0
30
MMM Mixed Min and Max model
Terms a1, a2, . . . , an Document d, with
index-term weights d1, d2, . . . , dn
qor (a1 or a2 or . . . or an) Query-document
similarity S(qor, d) ?or max(d1,
d2,.. , dn) (1 - ?or) min(d1, d2,.. ,
dn) With regular Boolean logic, all di 1 or 0,
?or 1
31
MMM Mixed Min and Max model
Terms a1, a2, . . . , an Document d, with
index-term weights d1, d2, . . . , dn qand
(a1 and a2 and . . . and an) Query-document
similarity S(qand, d) ?and
min(d1,.. , dn) (1 - ?and) max(d1,.. ,
dn) With regular Boolean logic, all di 1 or 0,
?and 1
32
MMM Mixed Min and Max model
Experimental values all di 1 or 0 ?and in
range 0.5, 0.8 ?or gt 0.2 Computational cost is
low. Retrieval performance much improved.
33
Other Models
Paice model The MMM model considers only the
maximum and minimum document weights. The Paice
model takes into account all of the document
weights. Computational cost is higher than MMM.
P-norm model Document d, with term weights
dA1, dA2, . . . , dAn Query terms are given
weights, a1, a2, . . . ,an Operators have
coefficients that indicate degree of
strictness Query-document similarity is
calculated by considering each document and query
as a point in n space.
34
Test data
CISI CACM INSPEC P-norm 79 106 210 Paice 77 104 2
06 MMM 68 109 195
Percentage improvement over standard Boolean
model (average best precision) Lee and Fox, 1988
35
Reading
E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended
Boolean Models, Frake, Chapter 15 Methods based
on fuzzy set concepts
Write a Comment
User Comments (0)
About PowerShow.com