IST2140 Information Storage and Retrieval

About This Presentation

Title:

IST2140 Information Storage and Retrieval

Description:

Sample Statistics of Text Collections ... Set Ad Ad/Wd (where Wd is weight of document D) Ad is now proportional to the value cosine(Q,Dd) ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 39

Provided by: eras

Category:

more less

Transcript and Presenter's Notes

Title: IST2140 Information Storage and Retrieval

1
IST2140Information Storage and Retrieval

Week 5
Implementing IR Systems

2
Sample Statistics of Text Collections

Dialog claims to have gt12 terabytes of data in
gt600 Databases, gt 800 million unique records
LEXIS/NEXIS claims 7 terabytes, 1.7 billion
documents, 1.5 million subscribers, 11,400
databases gt200,000 searches per day 99.99
availability 9 mainframes, 300 Unix servers, 200
NT servers

3
Sample Statistics of Text Collections

Web Search Engines cover less than 1/3 of WWW
according to Lawrence Page study largest are
Fast and Google which claim to index over 2
billion pages
TREC collections total of about 5 gigabytes of
text

4
Designing an IR System

Decisions designed to improve performance
effectiveness precision, recall
Stemming, stopwords, weighting schemes, matching
algorithms
Decisions designed to improve performance
efficiency --- storage space, access time
Compression, file structures, space time
tradeoffs

5
Implementation Issues

Storage of text --- compression??
Indexing of text
Memory for indexing, especially sorting
Storage of indexes --- compression?
Accessing text
Accessing indexes
Processing indexes
Accessing documents

6
Storage of text image vs. ascii

Document image
Digital image of page words represented as
patterns of pixels
Not searchable as text
Optical character recognition to convert to ascii
(may be error prone)
ASCII
Searchable as text words represented as ascii
codes

7
Text Compression

Motivation to save storage space and
transmission time
Must be lossless (cf image compression)
Compromises
Encode-decode time
Random access to text?

8
Text Compression

Common methods
Symbol-wise methods
Estimate probabilities of symbols, code one at a
time, shorter codes for high probabilities
(Morse)
E.g. Huffman coding
Dictionary methods
Replace words and fragments with dictionary
entries (Braille)
E.g. Ziv-Lempel compression
May be static or dynamic

9
Huffman coding

Developed in 1950s, widely used
Static code, variable length
Based on frequency of occurrence of letters (from
English or from body of text)
Method
Sort by falling probabilities link 2 symbols
with least probabilities, label with sum repeat
till you reach a single symbol with probability
of 1
Code down tree to generate symbols

10
Huffman coding

Consider a 7-symbol alphabet

11
Huffman code tree
0
1
0
1
0
1
1
0
0
1
0
1
c
a
b
g
d
e
f
12
Huffman coding

So to decode, work down tree, left to right
E.g.
001000000010001000011110
??
Fast for encoding and decoding
Good for word-based models

13
Ziv-Lempel Compression

Adaptive coding
For repeat occurrences of text segments, pointer
back to first occurrence
Higher compression than Huffman coding
Also used for image compression

14
Ziv-Lempel compression

Based on triples lta,b,cgt, where
a how far back to segment
b no of characters in segment
c new character to end segment
E.g.
lt0,0,zgt first occurrence of z
lt17,5,4gt go back 17 characters, repeat 5
characters, end in r

15
Indexing

Promotes efficiency in terms of time for
retrieval
Needed to resolve queries and extract relevant
documents quickly
Usual unit for indexing is the word (cf n-grams)
Issue of granularity of index word, sentence,
paragraph, document, block

16
Sample Document Collections
17
Index issues

How to structure the index
How to create the index (storage, time)
How to store the index (storage, compression)
How to process the index (storage, time)
How to update the index (storage, time)

18
Inverted file indexing

Postings file or concordance
Inverted file contains
Postings for each term in the lexicon, a list
of pointers to all occurrences of that term in
the main text stored in increasing document ID
Lexicon mapping from terms to pointer list

19
Lexicon and postings file
Salmon 29 PTR
lt5,23gt lt12,95gt lt16,22gt lt21,12gt lt25,42gt

Document 5 .The extinction of Atlantic salmon
is predicted if actions to preserve stocks are
not taken

20
Structure of inverted index

Document-level indexing
No. Term Documents
1 cold lt2 1,4gt
2 days lt2 3,6gt
Cf. word-level indexing
1 cold lt2(16) ,(48)gt

21
Structure of inverted index

May be a hierarchical set of addresses, e.g.
word number within sentence number within
paragraph number within chapter number within
volume number within document number
Consider as a vector (d,v,c,p,s,w)

22
Compression of indexes

Index size case folding, stemming, stopwords ?
compression
Elimination of stopwords (few dozen ?30 of text)
Granularity high granularity compresses index,
increases processing for proximity

23
Compression of inverted indexes

Uncompressed, maybe 50 100 of size of text
Compression store differences rather than
document numbers
E.g. (83,5,20,21,22,23,76,77,78)
?(83,2,15,1,2,53,1,1)
Then code differences using global (for all
lists) or local (for each list) methods

24
Other indexing structures

Signature files
Each document has an associated signature,
generating by hashing each term it contains
Leads to possible matches further processing to
resolve
Bitmaps
One-to-one hash function each distinct term in
collection has a bit vector with one bit for each
document
Special case of signature file storage expensive

25
Signature files

Early use edge notched cards
E.g. Nine days old ? 1010110001001100
Hash each word three times using different
functions to generate 1 bits in string
May generate false matches
See animation at http//ciips.ee.uwa.au/morris/Ye
ar2/PLDS210/hash_tables.html
Size of signature processing vs. storage
Processing hash query, compare signatures

26
Comparison of indexing methods

Inverted index, signature, bitmaps different
ways of storing a sparse matrix
Signature files extra access to main text poor
when document lengths are variable 2 3 times
larger than compressed inverted indexes
Inverted indexes requires lexicon file in main
memory

27
Querying the index

Lexicon entry
(Termt, fi, pointer)
Whale,6,?
Store in memory in sorted order locate term by
binary search
Compress lexicon, e.g. front coding based on
common prefixes (40 saving)

28
Querying the index

If terms are partially specified, e.g. cat
use brute force string matching
In general processing is left to right, i.e. can
stem by suffix but not prefix
How to handle word fragments or prefix-removal?

29
Processing Boolean Queries (I)

Assuming a conjunctive (AND) query
For each query term t
search the lexicon, record ft and address of
It, the inverted file entry for t
Identify query term t with smallest ft
Read the corresponding inverted file entry It
Set C ? It. C is the list of candidates.

30
Processing Boolean Queries (II)

For each remaining term t,
Read the inverted file entry, It
For each d ? C,
If d ? It, then
Set C ? C d
If C 0,
Return, there are no answers
For each d ? C,
Look up the address of document d
Retrieve document d and present it to the user

31
Processing the inverted indexfor ranked output
systems

For each query term
for each document in inverted list
augment similarity coefficient
For each document
finish calculation of similarity coefficient
Perform sort of similarity coefficients
Retrieve and present document

32
Processing Vector Space Queries (I)

To retrieve r documents using the cosine measure
Set A ? . A is the set of accumulators
For each query term t ? Q,
Stem t
Search the lexicon, record ft and the address of
It
Set wt ? 1 loge (N/ft)

33
Processing VS Queries II

Read the inverted file entry, It
For each (d,fd,t) pair in It
If Ad not? A then
set Ad ?0
set A ? A Ad
Set Ad ? A loge (1 fd,t)wt
For each Ad ? A
Set Ad ? Ad/Wd (where Wd is weight of document
D)
Ad is now proportional to the value cosine(Q,Dd)

34
Processing VS Queries III

For 1 lt i lt r
Select d such that Ad maxA
Look up address of document d
Retrieve document d and present it to the user
Set A ? A Ad

35
Building the inverted index

Create a frequency matrix document by term
Read in document order
then write to disk in term in term order (i.e.
transpose)
Problem size of matrix

36
Some solutions

Resources predicted for 6 GB
Linked lists (memory) 4 GB Memory, 0 MB disk, 6
hours
Linked lists (disk) 30 MB Memory, 4 GB Disk,
1,100 hours
Sort-based 40 MB Memory, 8 GB disk, 20 hours
Text-based partition 40 MB Memory, 35 MB Disk,
15 hours

37
Dynamic Collections

Inserting a document
Usually an append to previous files
May cause some problems in compression
Updating the index
Accumulate updates in a new file and check for
each query
Build expansion into lexicon and index
Reindex ?

For details, see
I.H. Witten, A. Moffat and T.C. Bell, Managing
Gigabytes. 2nd ed. Morgan Kaufmann, 1999.
Chapter 3, Indexing Chapter 4, Querying.

Write a Comment

User Comments (0)

About PowerShow.com

IST2140 Information Storage and Retrieval - PowerPoint PPT Presentation

IST2140 Information Storage and Retrieval

Sample Statistics of Text Collections ... Set Ad Ad/Wd (where Wd is weight of document D) Ad is now proportional to the value cosine(Q,Dd) ... – PowerPoint PPT presentation