Information Retrieval - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Information Retrieval

Description:

Information Retrieval Inverted Files. – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 31
Provided by: wya49
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
  • Inverted Files.

2
Document Vectors as Points on a Surface
Normalize all document vectors to be of
length 1 Define d' Then the ends of the
vectors d' all lie on a surface with unit
radius For similar documents, we can
represent parts of this surface as a flat
region Similar document are represented as
points that are close together on this
surface
d d
3
Results of a Search
x
x
hits from search
x
?
x
x
x
x
x documents found by search ? query
4
Relevance Feedback (Concept)
hits from original search
x
x
o
?
x
x
o
o
x documents identified as non-relevant o
documents identified as relevant ? original
query reformulated query
5
Document Clustering (Concept)
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Document clusters are a form of automatic
classification. A document may be in several
clusters.
6
Use of Inverted Files for Calculating Similarities
In the term vector space, if q is query and dj a
document, then q and dj have no terms in common
iff q.dj 0. 1. To calculate all the non-zero
similarities, find all the documents, dj, that
contain at least one term in the query
Merge the inverted lists for each term ti in the
query, with a logical OR, to establish a set of
hits, R. For each dj ? R, calculate
Similarity(q, dj), using appropriate weights. 2.
Return the elements of R in ranked order.
7
Representation of Inverted Files
Document file Stores the documents. Important
for user interface design. Index (word list,
vocabulary) file Stores list of terms
(keywords). Designed for searching and
sequential processing, e.g., for range queries,
(lexicographic index). Often held in
memory. Postings file Stores an inverted list
(postings list) of postings for each term.
Designed for rapid merging of lists and
calculation of similarities. Each list is
usually stored sequentially.
8
Organization of Inverted Files
Index file
Postings file
Documents file
Term Pointer to postings ant bee
cat dog
elk fox gnu
hog
Inverted lists
9
Decisions in Building Inverted Files What is a
Term?
Underlying character set, e.g., printable
ASCII, Unicode, UTF8. Is there a
controlled vocabulary? If so, what words
are included? List of stopwords.
Rules to decide the beginning and end of words,
e.g., spaces or punctuation.
Character sequences not to be indexed, e.g.,
sequences of numbers.
10
Decisions in Building an Inverted File
Efficiency and Query Languages
Some query options may require huge computation,
e.g., Regular expressions If inverted files are
stored in lexicographic order, comp can be
processed efficiently comp cannot be
processed efficiently Boolean terms If A and B
are search terms A or B can be processed by
comparing two moderate sized lists (not A)
or (not B) requires two very large lists
11
Efficiency Criteria
Storage Inverted files are big, typically 10 to
100 the size of the collection of
documents. Update performance It must be
possible, with a reasonable amount of
computation, to (a) Add a large batch of
documents (b) Add a single document Retrieval
performance Retrieval must be fast enough to
satisfy users and not use excessive resources.
12
Document File
The documents file stores the documents that are
being indexed. The documents may be primary
documents, e.g., electronic journal
articles surrogates, e.g., catalog records or
abstracts
13
Document File
The storage of the document file may be Central
(monolithic) - all documents stored together on a
single server (e.g., library
catalog) Distributed database - all documents
managed together but stored on several
servers (e.g., Medline, Westlaw, Dialog) Highly
distributed - documents are stored on
independently managed servers (e.g.,
Web) Each requires a document ID, which is a
unique identifier that can be used by the
inverted file system to refer to the document,
and a location counter, which can be used to
specify location within a document.
14
Docs file for web search system
For web search systems A document is a web
page. The documents file is the web.
The document ID is the URL of the
document. Indexes are built using a web crawler,
which retrieves each page on the web (or a
subset). After indexing, each page is discarded,
unless stored in a cache. (In addition to the
usual index file and postings file the indexing
system stores contextual information, which will
be discussed in a later lecture.)
15
Postings File
The postings file stores the elements of a sparse
matrix, the term assignment matrix. It is stored
as a separate inverted list for each column,
i.e., a list corresponding to each term in the
index file. Each element in an inverted list is
called a posting, i.e., the occurrence on a term
in a document Each list consists of one or many
individual postings.
16
Postings FileA Linked List for Each Term
  • 1 abacus
  • 3 94
  • 19 7
  • 19 212
  • 22 56
  • 2 actor
  • 66
  • 19 213
  • 29 45

3 aspen 5 43
  • 4 atoll
  • 3
  • 70
  • 34 40

A linked list for each term is convenient to
process sequentially, but slow to update when
the lists are long.
17
Length of Postings File
For a common term there may be very large numbers
of postings for a given term. Example 1,000,000,
000 documents 1,000,000 distinct words average
length 1,000 words per document 1012 postings By
Zipf's law, the 10th ranking word occurs,
approximately (1012/10)/10 times 1010 times
18
Postings File
Merging inverted lists is the most
computationally intensive task in many
information retrieval systems. Since inverted
lists may be long, it is important to match
postings efficiently. Usually, the inverted lists
will be held on disk and paged into memory for
matching. Therefore algorithms for matching
postings process the lists sequentially. For
efficient matching, the inverted lists should all
be sorted in the same sequence. Inverted lists
are commonly cached to minimize disk accesses.
19
Data for Calculating Weights
The calculation of weights requires extra data to
be held in the inverted file system. For each
term, tj and document, di fij number of
occurrences of tj in di For each term,
tj nj number of documents containing tj For
each document, di mi maximum frequency of any
term in di For the entire document file n total
number of documents
20
Index File Individual Records for Each Term
The record for term j in the index file
contains term j pointer to inverted (postings)
list for term j number of documents in which
term j occurs (nj)
21
Index Files
On disk If an index is held on disk, search time
is dominated by the number of disk accesses. In
memory Suppose that an index has 1,000,000
distinct terms. Each index entry consists of the
term, some basic statistics and a pointer to the
inverted list, average 100 characters. Size of
index is 100 megabytes, which can easily be held
in memory of a dedicated computer.
22
Index File Structures Linear Index
Advantages Can be searched quickly, e.g., by
binary search, O(log n) Good for
lexicographic processing, e.g., comp
Convenient for batch updating Economical use
of storage Disadvantages Index must be rebuilt
if an extra term is added
23
Index File Structures Binary Tree
Input elk, hog, bee, fox, cat, gnu, ant, dog
elk
bee
hog
fox
cat
ant
gnu
dog
24
Binary Tree
Advantages Can be searched quickly Convenient
for batch updating Easy to add an extra
term Economical use of storage Disadvantages Les
s good for lexicographic processing, e.g.,
comp Tree tends to become unbalanced If the
index is held on disk, important to optimize
the number of disk accesses
25
Binary Tree
Calculation of maximum depth of
tree. Illustrates importance of balanced
trees.
Worst case depth n
O(n) Ideal case depth log(n 1)/log 2
O(log n)
26
Right Threaded Binary Tree
Threaded tree A binary search tree in which each
node uses an otherwise-empty left child link to
refer to the node's in-order predecessor and an
empty right child link to refer to its in-order
successor. Right-threaded tree A variant of a
threaded tree in which only the right thread,
i.e. link to the successor, of each node is
maintained. Can be used for lexicographic
processing. A good data structure when index held
in memory
Knuth vol 1, 2.3.1, page 325.
27
Right Threaded Binary Tree
From Robert F. Rossa
28
B-trees
B-tree of order m A balanced, multiway search
tree Each node stores many keys Root has
between 2 and 2m keys. All other internal
nodes have between m and 2m keys. If ki is
the ith key in a given internal node -gt all keys
in the (i-1)th child are smaller than ki -gt all
keys in the ith child are bigger than ki All
leaves are at the same depth
29
B-trees
B-tree example (order 2)
50 65
55 59
70 90 98
10 19 35
66 68
91 95 97
36 47
1 5 8 9
72 73
12 14 18
21 24 28
Every arrow points to a node containing between 2
and 4 keys. A node with k keys has k 1 pointers.
30
B-tree
A B-tree is used as an index Data is
stored in the leaves of the tree, known as buckets
Example B-tree of order 2, bucket size 4
50 65
10 25
55 59
70 81 90
... D9
D51 ... D54
D66...
D81 ...
Write a Comment
User Comments (0)
About PowerShow.com