Title: CS 430: Information Discovery
1CS 430 Information Discovery
Lecture 5 Inverted Files
2Course Administration
Assignment 1 has been posted. It is a
programming assignment and is due on Monday,
September 22 at 5 p.m. Read the submission
instructions carefully. Send questions to
cs430_at_cs.cornell.edu.
3Representation of Inverted Files
Index (word list, vocabulary) file Stores list
of terms (keywords). Designed for searching and
sequential processing, e.g., for range queries,
(lexicographic index). Often held in
memory. Postings file Stores an inverted list
(postings list) of postings for each term.
Designed for rapid merging of lists and
calculation of similarities. Each list is
usually stored sequentially. Document file
Stores the documents. Important for user
interface design. Repositories for the storage
of document collections are covered in CS 431.
4Organization of Inverted Files
Index file
Postings file
Documents file
Term Pointer to postings ant bee
cat dog
elk fox gnu
hog
Inverted lists
5Decisions in Building Inverted Files What is a
Term?
Underlying character set, e.g., printable
ASCII, Unicode, UTF8. Is there a
controlled vocabulary? If so, what words
are included? List of stopwords.
Rules to decide the beginning and end of words,
e.g., spaces or punctuation.
Character sequences not to be indexed, e.g.,
sequences of numbers.
6Decisions in Building an Inverted File
Efficiency and Query Languages
Some query options may require huge computation,
e.g., Regular expressions If inverted files are
stored in alphabetical order, comp can be
processed efficiently comp cannot be
processed efficiently Boolean terms If A and B
are search terms A or B can be processed by
comparing two moderate sized lists (not A)
or (not B) requires two very large lists
7Efficiency Criteria
Storage Inverted files are big, typically 10 to
100 the size of the collection of
documents. Update performance It must be
possible, with a reasonable amount of
computation, to (a) Add a large batch of
documents (b) Add a single document Retrieval
performance Retrieval must be fast enough to
satisfy users and not use excessive resources.
8Document File
The document file may be Central (monolithic) -
all documents stored together on a single
server (e.g., library catalog) Distributed
database - all documents managed together but
stored on several servers (e.g., Medline,
Westlaw, Dialog) Highly distributed - documents
are stored on independently managed servers
(e.g., Web) Each requires a Document ID, which
is a unique identifier that can be used by the
inverted file system to refer to the document,
and a Location Counter, which can be used to
specify location within a document.
9Documents File for Web Search System
For Web search systems A Document is a Web
page. The Documents File is the Web.
The Document ID is the URL of the
document. Indexes are built using a Web crawler,
which retrieves each page on the Web (or a
subset). After indexing each page is discarded,
unless stored in a cache. (In addition to the
usual index file and postings file the indexing
system stores special information, which will be
discussed in a later lecture.)
10Postings File
The postings file stores the elements of a sparse
matrix, the term assignment matrix. It is stored
as a separate list for each column, i.e., a list
corresponding to each term in the index
file. Each list consists of one or many
individual postings.
11Postings FileA Linked List for Each Term
- 1 abacus
-
- 3 94
- 19 7
-
- 19 212
-
- 22 56
3 aspen 5 43
A linked list for each term is convenient to
process sequentially, but slow to update when
the lists are long.
12Length of Postings File
For a common term there may be very large numbers
of postings. Example 1,000,000,000
documents average length 1,000 words total 1012
words By Zipf's law, the 100th ranking word
occurs, approximately (1012/10)/100 times
109 times
13Postings File
Merging inverted lists is the most
computationally intensive task in many
information retrieval systems. Since inverted
lists may be very long, it is important to match
postings efficiently. Usually, the inverted lists
will be held on disk and paged into memory for
matching. Therefore algorithms for matching
postings process the lists sequentially. For
efficient matching, the inverted lists should all
be sorted in the same sequence. Inverted lists
are commonly cached to minimize disk accesses.
14Inverted File Adding Term Weights
When processing a new document Inverse document
frequency (idf) Update n - total number of
documents global variable Update and store in
index file nj - number of documents containing
term j Term frequency (tf) When creating
each posting, calculate fij number of
occurrences of term j in document i mi
maximum frequency of any term in document i
tfij fij / mi Store tfij as part of the posting
15Postings File with Weights
- 1 abacus
-
- 3 94 w3,1
- 19 7 w19,1
-
- 19 212 w19,1
-
- 22 56 w22,1
- 2 actor
- 66 w3,2
- 19 213 w19,2
- 29 45 w29,2
3 aspen 5 43 w5,3
- 4 atoll
-
- 3 w11,4
-
- 70 w11,4
-
- 34 40 w34,4
16Index File Individual Records
The record for term j in the index file
contains term j pointer to inverted (postings)
list for term j count of postings for term
j number of documents in which term j occurs (nj)
17Index Files
On disk If an index is held on disk, search time
is dominated by the number of disk accesses. In
memory Suppose that an index has 1,000,000
distinct terms. Each index entry consists of the
term, some basic statistics and a pointer to the
inverted list, average 100 characters. Size of
index is 100 megabytes, which can easily be held
in memory of a dedicated computer.
18Index File Structures Linear Index
Advantages Can be searched quickly, e.g., by
binary search, O(log n) Good for sequential
processing, e.g., comp Convenient for batch
updating Economical use of storage Disadvantages
Index must be rebuilt if an extra term is added
19Index File Structures Binary Tree
Input elk, hog, bee, fox, cat, gnu, ant, dog
elk
bee
hog
fox
cat
ant
gnu
dog
20Binary Tree
Advantages Can be searched quickly Convenient
for batch updating Easy to add an extra
term Economical use of storage Disadvantages Poo
r for sequential processing, e.g., comp Tree
tends to become unbalanced If the index is held
on disk, important to optimize
the number of disk accesses
21Binary Tree
Calculation of maximum depth of
tree. Illustrates importance of balanced
trees.
Worst case depth n
O(n) Ideal case depth log(n 1)/log 2
O(log n)
22Right Threaded Binary Tree
Threaded tree A binary search tree in which each
node uses an otherwise-empty left child link to
refer to the node's in-order predecessor and an
empty right child link to refer to its in-order
successor. Right-threaded tree A variant of a
threaded tree in which only the right thread,
i.e. link to the successor, of each node is
maintained.
Knuth vol 1, 2.3.1, page 325.
23Right Threaded Binary Tree
From Robert F. Rossa
24B-trees
B-tree of order m A balanced, multiway search
tree Each node stores many keys Root has
between 2 and 2m keys. All other internal
nodes have between m and 2m keys. If ki is
the ith key in a given internal node -gt all keys
in the (i-1)th child are smaller than ki -gt all
keys in the ith child are bigger than ki All
leaves are at the same depth
25B-trees
B-tree example (order 2)
50 65
55 59
70 90 98
10 19 35
66 68
91 95 97
36 47
1 5 8 9
72 73
12 14 18
21 24 28
Every arrow points to a node containing between 2
and 4 keys. A node with k keys has k 1 pointers.
26B-tree
B-tree A B-tree is used as an index
Data is stored in the leaves of the tree, known
as buckets
50 65
10 25
55 59
70 81 90
... D9
D51 ... D54
D66...
D81 ...
Example B-tree of order 2, bucket size 4