Title: IST2140 Information Storage and Retrieval
1IST2140Information Storage and Retrieval
- Week 5
- Implementing IR Systems
2Sample Statistics of Text Collections
- Dialog claims to have gt12 terabytes of data in
gt600 Databases, gt 800 million unique records - LEXIS/NEXIS claims 7 terabytes, 1.7 billion
documents, 1.5 million subscribers, 11,400
databases gt200,000 searches per day 99.99
availability 9 mainframes, 300 Unix servers, 200
NT servers
3Sample Statistics of Text Collections
- Web Search Engines cover less than 1/3 of WWW
according to Lawrence Page study largest are
Fast and Google which claim to index over 2
billion pages - TREC collections total of about 5 gigabytes of
text
4Designing an IR System
- Decisions designed to improve performance
effectiveness precision, recall - Stemming, stopwords, weighting schemes, matching
algorithms - Decisions designed to improve performance
efficiency --- storage space, access time - Compression, file structures, space time
tradeoffs
5Implementation Issues
- Storage of text --- compression??
- Indexing of text
- Memory for indexing, especially sorting
- Storage of indexes --- compression?
- Accessing text
- Accessing indexes
- Processing indexes
- Accessing documents
6Storage of text image vs. ascii
- Document image
- Digital image of page words represented as
patterns of pixels - Not searchable as text
- Optical character recognition to convert to ascii
(may be error prone) - ASCII
- Searchable as text words represented as ascii
codes
7Text Compression
- Motivation to save storage space and
transmission time - Must be lossless (cf image compression)
- Compromises
- Encode-decode time
- Random access to text?
8Text Compression
- Common methods
- Symbol-wise methods
- Estimate probabilities of symbols, code one at a
time, shorter codes for high probabilities
(Morse) - E.g. Huffman coding
- Dictionary methods
- Replace words and fragments with dictionary
entries (Braille) - E.g. Ziv-Lempel compression
- May be static or dynamic
9Huffman coding
- Developed in 1950s, widely used
- Static code, variable length
- Based on frequency of occurrence of letters (from
English or from body of text) - Method
- Sort by falling probabilities link 2 symbols
with least probabilities, label with sum repeat
till you reach a single symbol with probability
of 1 - Code down tree to generate symbols
10Huffman coding
- Consider a 7-symbol alphabet
-
11Huffman code tree
0
1
0
1
0
1
1
0
0
1
0
1
c
a
b
g
d
e
f
12Huffman coding
- So to decode, work down tree, left to right
- E.g.
- 001000000010001000011110
- ??
- Fast for encoding and decoding
- Good for word-based models
13Ziv-Lempel Compression
- Adaptive coding
- For repeat occurrences of text segments, pointer
back to first occurrence - Higher compression than Huffman coding
- Also used for image compression
14Ziv-Lempel compression
- Based on triples lta,b,cgt, where
- a how far back to segment
- b no of characters in segment
- c new character to end segment
- E.g.
- lt0,0,zgt first occurrence of z
- lt17,5,4gt go back 17 characters, repeat 5
characters, end in r
15Indexing
- Promotes efficiency in terms of time for
retrieval - Needed to resolve queries and extract relevant
documents quickly - Usual unit for indexing is the word (cf n-grams)
- Issue of granularity of index word, sentence,
paragraph, document, block
16Sample Document Collections
17Index issues
- How to structure the index
- How to create the index (storage, time)
- How to store the index (storage, compression)
- How to process the index (storage, time)
- How to update the index (storage, time)
18Inverted file indexing
- Postings file or concordance
- Inverted file contains
- Postings for each term in the lexicon, a list
of pointers to all occurrences of that term in
the main text stored in increasing document ID - Lexicon mapping from terms to pointer list
19Lexicon and postings file
Salmon 29 PTR
lt5,23gt lt12,95gt lt16,22gt lt21,12gt lt25,42gt
- Document 5 .The extinction of Atlantic salmon
is predicted if actions to preserve stocks are
not taken
20Structure of inverted index
- Document-level indexing
- No. Term Documents
- 1 cold lt2 1,4gt
- 2 days lt2 3,6gt
- Cf. word-level indexing
- 1 cold lt2(16) ,(48)gt
21Structure of inverted index
- May be a hierarchical set of addresses, e.g.
- word number within sentence number within
paragraph number within chapter number within
volume number within document number - Consider as a vector (d,v,c,p,s,w)
22Compression of indexes
- Index size case folding, stemming, stopwords ?
compression - Elimination of stopwords (few dozen ?30 of text)
- Granularity high granularity compresses index,
increases processing for proximity
23Compression of inverted indexes
- Uncompressed, maybe 50 100 of size of text
- Compression store differences rather than
document numbers - E.g. (83,5,20,21,22,23,76,77,78)
- ?(83,2,15,1,2,53,1,1)
- Then code differences using global (for all
lists) or local (for each list) methods
24Other indexing structures
- Signature files
- Each document has an associated signature,
generating by hashing each term it contains - Leads to possible matches further processing to
resolve - Bitmaps
- One-to-one hash function each distinct term in
collection has a bit vector with one bit for each
document - Special case of signature file storage expensive
25Signature files
- Early use edge notched cards
- E.g. Nine days old ? 1010110001001100
- Hash each word three times using different
functions to generate 1 bits in string - May generate false matches
- See animation at http//ciips.ee.uwa.au/morris/Ye
ar2/PLDS210/hash_tables.html - Size of signature processing vs. storage
- Processing hash query, compare signatures
26Comparison of indexing methods
- Inverted index, signature, bitmaps different
ways of storing a sparse matrix - Signature files extra access to main text poor
when document lengths are variable 2 3 times
larger than compressed inverted indexes - Inverted indexes requires lexicon file in main
memory
27Querying the index
- Lexicon entry
- (Termt, fi, pointer)
- Whale,6,?
- Store in memory in sorted order locate term by
binary search - Compress lexicon, e.g. front coding based on
common prefixes (40 saving)
28Querying the index
- If terms are partially specified, e.g. cat
- use brute force string matching
- In general processing is left to right, i.e. can
stem by suffix but not prefix - How to handle word fragments or prefix-removal?
29Processing Boolean Queries (I)
- Assuming a conjunctive (AND) query
- For each query term t
- search the lexicon, record ft and address of
It, the inverted file entry for t - Identify query term t with smallest ft
- Read the corresponding inverted file entry It
- Set C ? It. C is the list of candidates.
30Processing Boolean Queries (II)
- For each remaining term t,
- Read the inverted file entry, It
- For each d ? C,
- If d ? It, then
- Set C ? C d
- If C 0,
- Return, there are no answers
- For each d ? C,
- Look up the address of document d
- Retrieve document d and present it to the user
31Processing the inverted indexfor ranked output
systems
- For each query term
- for each document in inverted list
- augment similarity coefficient
- For each document
- finish calculation of similarity coefficient
- Perform sort of similarity coefficients
- Retrieve and present document
32Processing Vector Space Queries (I)
- To retrieve r documents using the cosine measure
- Set A ? . A is the set of accumulators
- For each query term t ? Q,
- Stem t
- Search the lexicon, record ft and the address of
It - Set wt ? 1 loge (N/ft)
33Processing VS Queries II
- Read the inverted file entry, It
- For each (d,fd,t) pair in It
- If Ad not? A then
- set Ad ?0
- set A ? A Ad
- Set Ad ? A loge (1 fd,t)wt
- For each Ad ? A
- Set Ad ? Ad/Wd (where Wd is weight of document
D) - Ad is now proportional to the value cosine(Q,Dd)
34Processing VS Queries III
- For 1 lt i lt r
- Select d such that Ad maxA
- Look up address of document d
- Retrieve document d and present it to the user
- Set A ? A Ad
35Building the inverted index
- Create a frequency matrix document by term
- Read in document order
- then write to disk in term in term order (i.e.
transpose) - Problem size of matrix
36Some solutions
- Resources predicted for 6 GB
- Linked lists (memory) 4 GB Memory, 0 MB disk, 6
hours - Linked lists (disk) 30 MB Memory, 4 GB Disk,
1,100 hours - Sort-based 40 MB Memory, 8 GB disk, 20 hours
- Text-based partition 40 MB Memory, 35 MB Disk,
15 hours
37Dynamic Collections
- Inserting a document
- Usually an append to previous files
- May cause some problems in compression
- Updating the index
- Accumulate updates in a new file and check for
each query - Build expansion into lexicon and index
- Reindex ?
38- For details, see
- I.H. Witten, A. Moffat and T.C. Bell, Managing
Gigabytes. 2nd ed. Morgan Kaufmann, 1999.
Chapter 3, Indexing Chapter 4, Querying.