CS 4300 INFO 4300 Information Retrieval - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

CS 4300 INFO 4300 Information Retrieval

Description:

This course will use the Course Management System ... Check that you are correctly entered in the system. ... stored on several servers (e.g., Medline, Westlaw) ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 30
Provided by: wya2
Category:

less

Transcript and Presenter's Notes

Title: CS 4300 INFO 4300 Information Retrieval


1
CS 4300 / INFO 4300 Information Retrieval
Lecture 5 Searching Full Text 5
2
Course Administration
This course will use the Course Management System
(CMS) for the assignments http//cms.csuglab.co
rnell.edu/ CMS is now ready. Check that you are
correctly entered in the system. If you are not
there, send email to cs4300-l_at_lists.cs.cornell.edu
. Please check soon. Do not wait until the
assignment is due.
3
Organization of Files for Full Text Searching
Documents store
Word list
Postings
Term Pointer to postings ant bee
cat dog
elk fox gnu
hog
Postings lists
4
Representation of Inverted Files
Document store Stores the documents. Important
for user interface design. Repositories for the
storage of document collections are covered in
CS/Info 4302. Word list (vocabulary file)
Stores list of terms (keywords). Designed for
searching and sequential processing, e.g., for
range queries, (lexicographic index). May be
held in memory or on disk. Postings file Stores
a separate inverted list (postings list) of
postings for each term. Designed for rapid
merging of lists and calculation of similarities.
Each list is usually stored sequentially. Can be
very large.
5
Document Store
The Documents Store holds the corpus that is
being indexed. The corpus may be primary
documents, e.g., electronic journal articles or
Web pages. surrogates, which refer to the
primary documents, e.g., catalog records or
abstracts.
6
Document Store
The storage of the document store may be Central
(monolithic) - all documents stored together on a
single server (e.g., library
catalog) Distributed database - all documents
managed together but stored on several
servers (e.g., Medline, Westlaw) Highly
distributed - documents stored on independently
managed servers (e.g., the Web) Each
requires a document ID, which is a unique
identifier that can be used by the search system
to refer to the document, and a location counter,
which can be used to specify location of words or
characters within a document.
7
Documents Store for a Library Catalog
For the Library of Congress Catalog A
document is a catalog record. The documents
store is a group of Unix servers in a single
location. The document ID is the
Library of Congress Control Number
(LCCN). The catalog records are created manually
by skilled cataloguers who follow a strict set of
cataloguing rules to ensure quality. Indexes are
built automatically by a process similar to
Assignment 1. (Additional indexes hold special
information, such as an Name Authority file that
has information about each author.)
8
Documents Store for Web Search Systems
For Web search systems A document is a Web
page. The documents store is the Web.
The document ID is the URL (or a hash of the
URL). Indexes are built using a web crawler,
which retrieves each page on the Web for
indexing. After indexing, the local copy of each
page is discarded, unless stored in a cache.
(In addition to the usual word list and postings
file the indexing system stores information about
links, which will be discussed in a later
lecture.)
9
Requirements for the Inverted File SystemUse of
Inverted Files for Calculating Similarities
In the term vector space, if q is query and dj a
document, then q and dj have no terms in common
iff q.dj 0. 1. To calculate all the non-zero
similarities find R, the set of all the
documents, dj, that contain at least one term in
the query 2. Merge the inverted lists for
each term ti in the query, with a logical or, to
establish the set, R. 3. For each dj ? R,
calculate Similarity(q, dj), using appropriate
weights. 4. Return the elements of R in ranked
order.
10
Requirements for the Inverted File SystemData
for Calculating Weights
The calculation of tf.idf weights requires extra
data to be held in the inverted file system. For
each term, ti and document, dj fij number of
occurrences of ti in dj For each term,
ti ni number of documents containing ti For
each document, dj mj maximum frequency of any
term in dj For the entire document file N total
number of documents
11
Requirements for the Inverted File System
Location Each posting holds information about
the location of each term within the
document. Frequency The word list includes the
number of postings for each term.
12
Requirements for the Inverted File System
Lexicographic Order
It is important that the word list can be
processed sequentially, i.e, in alphabetic order.
To search with wild cards, e.g. comp, which
expands to every term beginning with the letters
"comp". To list results for browsing lists of
search terms. This is a special case of of the
mathematical concept of lexicographic order.
13
Requirements for the Inverted File System Query
Languages
Some query options may require huge computation,
e.g., Regular expressions If inverted files are
stored in lexicographic order, comp can be
processed efficiently comp cannot be
processed efficiently Logical operators If A and
B are search terms A or B can be processed
by comparing two moderate sized lists (not
A) or (not B) requires two very large lists
14
Requirements for the Inverted File System
Storage and Performance
Storage Inverted file systems are big, typically
10 to 100 the size of the collection of
documents. Update performance It must be
possible, with a reasonable amount of
computation, to (a) Add a large batch of
documents (b) Add a single document Retrieval
performance Retrieval must be fast enough to
satisfy users and not use excessive resources.
15
Postings File Requirements
The postings file stores the elements of a sparse
matrix, the components of the term vector space,
with weights. It is stored as a separate postings
list (inverted list) for each row, i.e., a list
corresponding to each term in the index
file. Each element in an inverted list is called
a posting, i.e., the occurrence of a term in a
document Each list consists of one or many
individual postings.
16
Length of Postings File
For a common term there may be very large numbers
of postings for a given term. Example 1,000,000,
000 documents 1,000,000 distinct words average
length 1,000 words per document 1012 postings By
Zipf's law, the 10th ranking word occurs,
approximately (1012/10)/10 times 1010 times
17
Postings File Requirements
Merging inverted lists is the most
computationally intensive task in many
information retrieval systems. Since inverted
lists may be long, it is important to match
postings efficiently. Usually, the inverted lists
will be held on disk and paged into memory for
matching. Therefore algorithms for matching
postings process the lists sequentially. For
efficient matching, the inverted lists should all
be sorted in the same sequence. Inverted lists
are commonly cached to minimize disk accesses.
18
Postings File DesignA Linked List for Each Term
1 abacus 3 94 19 7
19 63 22 56
2 actor 2 66 19 64
29 45
3 aspen 5 43
4 atoll 11 3 11 70
34 40
A linked list for each term is convenient to
process sequentially, but slow to update when the
lists are long.
19
Calculating Similarities for very large
Collections
With a very large corpus, merging the postings
files is computationally demanding. See Manning,
et al., Chapter 7, for a discussion of methods to
speed up this process.
20
Word List Requirements
On disk If a word list is held on disk, search
time is dominated by the number of disk
accesses. In memory Suppose that a word list has
1,000,000 distinct terms. Each index entry
consists of the term, some basic statistics and a
pointer to the inverted list, average 100
characters. Size of index is 100 megabytes, which
can easily be held in memory of a dedicated
computer.
21
Requirements for Word List Individual Records
for Each Term
The record for term i in the word list
contains term i pointer to postings list for
term i number of documents in which term i
occurs (ni)
22
Data Structures for Word List Linear Index
Advantages Can be searched quickly, e.g., by
binary search, O(log n) Good for lexicographic
processing, e.g., comp Convenient for batch
updating Economical use of storage Disadvantages
Index must be rebuilt if an extra term is
added For this reason, linear indexes are used
only under special circumstances.
23
Data Structures for Word List Binary Tree
Input elk, hog, bee, fox, cat, gnu, ant, dog
dog
bee
gnu
hog
cat
elk
ant
fox
24
Data Structures for Word List Binary Tree
Advantages Can be searched quickly Convenient
for batch updating Easy to add an extra
term Economical use of storage Disadvantages Les
s good for lexicographic processing, e.g.,
comp Tree tends to become unbalanced If the
index is held on disk, important to optimize
the number of disk accesses
25
Data Structures for Word List Right Threaded
Binary Tree
Threaded tree A binary search tree in which each
node uses an otherwise-empty left child link to
refer to the node's in-order predecessor and an
empty right child link to refer to its in-order
successor. Right-threaded tree A variant of a
threaded tree in which only the right thread,
i.e. link to the successor, of each node is
maintained. Can be used for lexicographic
processing. A good data structure when held in
memory
Knuth vol 1, 2.3.1, page 325
26
Data Structures for Word List Right Threaded
Binary Tree
dog
bee
gnu
hog
cat
elk
ant
fox
NULL
27
Data Structures for Inverted Files B-trees
B-tree example (order 2)
50 65
55 59
70 90 98
10 19 35
66 68
91 95 97
36 47
1 5 8 9
72 73
12 14 18
21 24 28
Every arrow points to a node containing between 2
and 4 keys. A node with k keys has k 1 pointers.
28
Data Structures for Inverted Files B-trees
B-tree of order m A balanced, multiway search
tree Each node stores many keys Root has
between 2 and 2m keys. All other internal
nodes have between m and 2m keys. If ki is
the ith key in a given internal node -gt all keys
in the (i-1)th child are smaller than ki -gt all
keys in the ith child are bigger than ki All
leaves are at the same depth
29
Data Structures for Inverted Files B-tree
A B-tree is used as an index Data is
stored in the leaves of the tree, known as buckets
Example B-tree of order 2, bucket size 4
50 65
10 25
55 59
70 81 90
... D9
D51 ... D54
D66...
D81 ...
(Implementation of B-trees is covered in CS
4320.)
Write a Comment
User Comments (0)
About PowerShow.com