CS 430 INFO 430 Information Retrieval - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

CS 430 INFO 430 Information Retrieval

Description:

It is a programming assignment and is due on Sunday, ... mj maximum frequency of any term in dj. For the entire document file. N total number of documents ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 34
Provided by: wya2
Category:

less

Transcript and Presenter's Notes

Title: CS 430 INFO 430 Information Retrieval


1
CS 430 / INFO 430 Information Retrieval
Lecture 4 Searching Full Text 4
2
Course Administration
Assignment 1 has been posted. It is a
programming assignment and is due on Sunday,
September 16 at 11 p.m. Follow the instructions
carefully. Send questions to cs430-l_at_cs.cornell.
edu. Watch the Web site for any minor changes.
3
Course Administration
This course is using the Version 3 of the Course
Management System (CMS) for the assignments
http//cms3.csuglab.cornell.edu/
4
The Unit Sphere
t3
q
Since the length of the vectors is not used in
calculating similarities, all vectors can be
scaled to length 1.
d
t2
?
unit sphere
t1
5
Document Vectors as Points on a Surface
Normalize all document vectors to be of
length 1 Then the ends of the vectors all
lie on a surface with unit radius
For similar documents, we can represent parts of
this surface as a flat region
Similar document are represented as points that
are close together on this surface
6
Results of a Search on a Surface
x
x
hits from search
x
?
x
x
x
x
surface of the unit sphere
x documents found by search ? query
7
Organization of Files for Full Text Searching
Documents store
Word list (index file)
Postings
Term Pointer to postings ant bee
cat dog
elk fox gnu
hog
Inverted lists
8
Representation of Inverted Files
Word list (vocabulary file) Stores list of
terms (keywords). Designed for searching and
sequential processing, e.g., for range queries,
(lexicographic index). May be held in
memory. Postings file Stores an inverted list
(postings list) of postings for each term.
Designed for rapid merging of lists and
calculation of similarities. Each list is
usually stored sequentially. Can be very
large. Document store Stores the documents.
Important for user interface design.
Repositories for the storage of document
collections are covered in CS 431.
9
Document Store
The Documents Store holds the corpus that is
being indexed. The corpus may be primary
documents, e.g., electronic journal articles or
Web pages. surrogates, e.g., catalog records
or abstracts, which refer to the primary
documents.
10
Document Store
The storage of the document store may be Central
(monolithic) - all documents stored together on a
single server (e.g., library
catalog) Distributed database - all documents
managed together but stored on several
servers (e.g., Medline, Westlaw) Highly
distributed - documents stored on independently
managed servers (e.g., the Web) Each
requires a document ID, which is a unique
identifier that can be used by the search system
to refer to the document, and a location counter,
which can be used to specify location of words or
characters within a document.
11
Documents Store for Web Search Systems
For Web search systems A document is a Web
page. The documents store is the Web.
The document ID is the URL (or a hash of the
URL). Indexes are built using a web crawler,
which retrieves each page on the Web for
indexing. After indexing, the local copy of each
page is discarded, unless stored in a cache.
(In addition to the usual word list and postings
file the indexing system stores contextual
information, which will be discussed in a later
lecture.)
12
Enhancements to Inverted Files -- Concept
Location Each posting holds information about
the location of each term within the
document. Uses user interface design --
highlight location of search term adjacency and
near operators (in Boolean searching) Frequency
Each inverted list includes the number of
postings for each term. Uses term
weighting query processing optimization
13
Data for Calculating Weights
The calculation of tf.idf weights requires extra
data to be held in the inverted file system. For
each term, ti and document, dj fij number of
occurrences of ti in dj For each term,
ti ni number of documents containing ti For
each document, dj mj maximum frequency of any
term in dj For the entire document file N total
number of documents
14
Word List Individual Records for Each Term
The record for term i in the word list
contains term i pointer to inverted (postings)
list for term i number of documents in which
term i occurs (ni)
15
Decisions in Building an Inverted File System
Lexicographic Order
It is important that the word list can be
processed sequentially, i.e, in alphabetic order.
To search with wild cards, e.g. comp, which
expands to every term beginning with the letters
"comp". To list results for browsing lists of
search terms. This is a special case of of the
mathematical concept of lexicographic order.
16
Decisions in Building an Inverted File System
Query Languages
Some query options may require huge computation,
e.g., Regular expressions If inverted files are
stored in lexicographic order, comp can be
processed efficiently comp cannot be
processed efficiently Logical operators If A and
B are search terms A or B can be processed
by comparing two moderate sized lists (not
A) or (not B) requires two very large lists
17
Decisions in Building an Inverted File System
Storage and Performance
Storage Inverted file systems are big, typically
10 to 100 the size of the collection of
documents. Update performance It must be
possible, with a reasonable amount of
computation, to (a) Add a large batch of
documents (b) Add a single document Retrieval
performance Retrieval must be fast enough to
satisfy users and not use excessive resources.
18
Postings File
The postings file stores the elements of a sparse
matrix, the components of the term vector space,
with weights. It is stored as a separate inverted
list for each row, i.e., a list corresponding to
each term in the index file. Each element in an
inverted list is called a posting, i.e., the
occurrence of a term in a document Each list
consists of one or many individual postings.
19
Length of Postings File
For a common term there may be very large numbers
of postings for a given term. Example 1,000,000,
000 documents 1,000,000 distinct words average
length 1,000 words per document 1012 postings By
Zipf's law, the 10th ranking word occurs,
approximately (1012/10)/10 times 1010 times
20
Use of Inverted Files for Calculating Similarities
In the term vector space, if q is query and dj a
document, then q and dj have no terms in common
iff q.dj 0. 1. To calculate all the non-zero
similarities find R, the set of all the
documents, dj, that contain at least one term in
the query 2. Merge the inverted lists for
each term ti in the query, with a logical or, to
establish the set, R. 3. For each dj ? R,
calculate Similarity(q, dj), using appropriate
weights. 4. Return the elements of R in ranked
order.
21
Postings File
Merging inverted lists is the most
computationally intensive task in many
information retrieval systems. Since inverted
lists may be long, it is important to match
postings efficiently. Usually, the inverted lists
will be held on disk and paged into memory for
matching. Therefore algorithms for matching
postings process the lists sequentially. For
efficient matching, the inverted lists should all
be sorted in the same sequence. Inverted lists
are commonly cached to minimize disk accesses.
22
Postings FileA Linked List for Each Term
1 abacus 3 94 19 7
19 63 22 56
2 actor 2 66 19 64
29 45
3 aspen 5 43
4 atoll 11 3 11 70
34 40
A linked list for each term is convenient to
process sequentially, but slow to update when the
lists are long.
23
Calculating Similarities for very large
Collections
With a very large corpus, merging the postings
files is computationally demanding. See Manning,
et al., Chapter 7, for a discussion of methods to
speed up this process.
24
Word List
On disk If a word list is held on disk, search
time is dominated by the number of disk
accesses. In memory Suppose that a word list has
1,000,000 distinct terms. Each index entry
consists of the term, some basic statistics and a
pointer to the inverted list, average 100
characters. Size of index is 100 megabytes, which
can easily be held in memory of a dedicated
computer.
25
File Structures for Inverted Files Linear Index
Advantages Can be searched quickly, e.g., by
binary search, O(log n) Good for lexicographic
processing, e.g., comp Convenient for batch
updating Economical use of storage Disadvantages
Index must be rebuilt if an extra term is added
26
File Structures for Inverted Files Binary Tree
Input elk, hog, bee, fox, cat, gnu, ant, dog
elk
bee
hog
fox
cat
ant
gnu
dog
27
File Structures for Inverted Files Binary Tree
Advantages Can be searched quickly Convenient
for batch updating Easy to add an extra
term Economical use of storage Disadvantages Les
s good for lexicographic processing, e.g.,
comp Tree tends to become unbalanced If the
index is held on disk, important to optimize
the number of disk accesses
28
File Structures for Inverted Files Binary Tree
Calculation of maximum depth of
tree. Illustrates importance of balanced
trees.
Worst case depth n
O(n) Ideal case depth log(n 1)/log 2
O(log n)
29
File Structures for Inverted Files Right
Threaded Binary Tree
Threaded tree A binary search tree in which each
node uses an otherwise-empty left child link to
refer to the node's in-order predecessor and an
empty right child link to refer to its in-order
successor. Right-threaded tree A variant of a
threaded tree in which only the right thread,
i.e. link to the successor, of each node is
maintained. Can be used for lexicographic
processing. A good data structure when held in
memory
Knuth vol 1, 2.3.1, page 325.
30
File Structures for Inverted Files Right
Threaded Binary Tree
dog
bee
gnu
hog
cat
elk
ant
NULL
fox
31
File Structures for Inverted Files B-trees
B-tree of order m A balanced, multiway search
tree Each node stores many keys Root has
between 2 and 2m keys. All other internal
nodes have between m and 2m keys. If ki is
the ith key in a given internal node -gt all keys
in the (i-1)th child are smaller than ki -gt all
keys in the ith child are bigger than ki All
leaves are at the same depth
32
File Structures for Inverted Files B-trees
B-tree example (order 2)
50 65
55 59
70 90 98
10 19 35
66 68
91 95 97
36 47
1 5 8 9
72 73
12 14 18
21 24 28
Every arrow points to a node containing between 2
and 4 keys. A node with k keys has k 1 pointers.
33
File Structures for Inverted Files B-tree
A B-tree is used as an index Data is
stored in the leaves of the tree, known as buckets
Example B-tree of order 2, bucket size 4
50 65
10 25
55 59
70 81 90
... D9
D51 ... D54
D66...
D81 ...
(Implementation of B-trees is covered in CS 432.)
Write a Comment
User Comments (0)
About PowerShow.com