Title: CS/Info%20430:%20Information%20Retrieval
1CS/Info 430 Information Retrieval
Sample Midterm Examination Notes on the Solutions
2Midterm Examination -- Question 1
1(a) Define the terms inverted file, inverted
list, posting. Inverted file a list of the words
in a set of documents and the documents in which
they appear. Inverted list All the entries in
an inverted file that apply to a specific
word. Posting Entry in an inverted list 1(b)
When implementing an inverted file system, what
are the criteria that you would use to judge
whether the system is suitable for very
large-scale information retrieval?
3Q1 (continued)
Storage Inverted files are big, typically 10 to
100 the size of the collection of
documents. Update performance It must be
possible, with a reasonable amount of
computation, to (a) Add a large batch of
documents (b) Add a single document Retrieval
performance Retrieval must be fast enough to
satisfy users and not use excessive resource.
4Q1 (continued)
1(c) You are designing an inverted file system
to be used with Boolean queries on a very large
collection of textual documents. New documents
are being continually added to the collection.
(i) What file structure(s) would you use?
(ii) How well does your design satisfy the
criteria listed in Part (b)?
5Q1 (continued)
Separate inverted index from lists of postings
postings file
Term Pointer to list of postings ant
bee cat dog
elk fox gnu
hog
inverted index
Lists of postings
6Question 1 (continued)
(a) Postings file may be stored sequentially as a
linked list. (b) Index file is best stored as a
tree. Binary trees provide fast searching but
have problems with updating. B-trees are better,
with B-trees as the best. Note Other answers
are possible to this part of the question.
7Question 1 (continued)
1(c)(ii) How well does your design satisfy the
criteria listed in Part (b)? Sequential
list for each term is efficient for storage
and for processing Boolean queries. The
disadvantage is a slow update time for long
inverted lists. B-trees combine fast
retrieval with moderately efficient
updating. Bottom-up updating is usual fast,
but may require recursive tree climbing
to the root. The main weakness is poor
storage utilization typically buckets
are only 0.69 full.
8Midterm Examination -- Question 2
2(b) You have the collection of documents that
contain the following index terms D1 alpha
bravo charlie delta echo foxtrot golf D2 golf
golf golf delta alpha D3 bravo charlie bravo
echo foxtrot bravo D4 foxtrot alpha alpha golf
golf delta (i) Use an incidence matrix of terms
to calculate a similarity matrix for these four
documents, with no term weighting.
9Incidence array
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta
?7 ?3 ?4 ?4
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 1
D3 1 1 1 1 D4 1
1 1 1
10Document similarity matrix
D1 D2 D3 D4 D1 0.65 0.76 0.76 D2 0.65
0.00 0.87 D3 0.76 0.00 0.25 D4 0.76 0.87
0.25
11Question 2 (continued)
2b(ii) Use a frequency matrix of terms to
calculate a similarity matrix for these
documents, with weights proportional to the term
frequency and inversely proportional to the
document frequency.
12Frequency Array
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 3
D3 3 1 1 1 D4 2
1 1 2
13Inverse Document Frequency Weighting
Principle (a) Weight is proportional to the
number of times that the term appears in the
document (b) Weight is inversely proportional to
the number of documents that contain the
term wik fik / dk Where wik is the weight
given to term k in document i fik is
the frequency with which term k appears in
document i dk is the number of documents that
contain term k
14Frequency Array with Weights
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta
alpha bravo charlie delta echo foxtrot golf
D1 0.33 0.50 0.50 0.33 0.50
0.33 0.33
D2 0.33 0.33
1.00 D3 1.50 0.50 0.50
0.33 D4 0.67 0.33 0.33 0.67
length 0.94 0.65 1.08 0.76
dk 3 2 2 3
2 3 3
15Document similarity matrix
D1 D2 D3 D4 D1 0.46 0.74 0.58 D2 0.46
0.00 0.86 D3 0.74 0.00 0.06 D4 0.56 0.86
0.06
16Question 3
3(a) Define the terms recall and precision.
3(b) Q is a query. D is a collection of
1,000,000 documents. When the query Q is run, a
set of 200 documents is returned. (i) How
in a practical experiment would you calculate the
precision? Have an expert examine each of the 200
documents and decide whether it is relevant.
Precision is number judged relevant divided by
200. (ii) How in a practical experiment would
you calculate the recall? It is not feasible to
examine 1,000,000 records. Therefore sampling
must be used ...
17Question 3 (continued)
3(c) Suppose that, by some means, it is known
that 100 of the documents in D are relevant to Q.
Of the 200 documents returned by the search, 50
are relevant. (i) What is the
precision? 50/200 0.25 (ii) What is the
recall? 50/100 0.5 3(d) Explain in general
terms the method used by TREC to estimate the
recall.
18Question 3 (continued)
For each query, a pool of potentially relevant
documents is assembled, using the top 100 ranked
documents from each participant The human expert
who set the query looks at every document in the
pool and determines whether it is
relevant. Documents outside the pool are not
examined. In TREC-8 7,100 documents in the
pool 1,736 unique documents (eliminating
duplicates) 94 judged relevant