IST2140 Information Storage and Retrieval - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

IST2140 Information Storage and Retrieval

Description:

Sample Statistics of Text Collections ... Set Ad Ad/Wd (where Wd is weight of document D) Ad is now proportional to the value cosine(Q,Dd) ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 39
Provided by: eras
Category:

less

Transcript and Presenter's Notes

Title: IST2140 Information Storage and Retrieval


1
IST2140Information Storage and Retrieval
  • Week 5
  • Implementing IR Systems

2
Sample Statistics of Text Collections
  • Dialog claims to have gt12 terabytes of data in
    gt600 Databases, gt 800 million unique records
  • LEXIS/NEXIS claims 7 terabytes, 1.7 billion
    documents, 1.5 million subscribers, 11,400
    databases gt200,000 searches per day 99.99
    availability 9 mainframes, 300 Unix servers, 200
    NT servers

3
Sample Statistics of Text Collections
  • Web Search Engines cover less than 1/3 of WWW
    according to Lawrence Page study largest are
    Fast and Google which claim to index over 2
    billion pages
  • TREC collections total of about 5 gigabytes of
    text

4
Designing an IR System
  • Decisions designed to improve performance
    effectiveness precision, recall
  • Stemming, stopwords, weighting schemes, matching
    algorithms
  • Decisions designed to improve performance
    efficiency --- storage space, access time
  • Compression, file structures, space time
    tradeoffs

5
Implementation Issues
  • Storage of text --- compression??
  • Indexing of text
  • Memory for indexing, especially sorting
  • Storage of indexes --- compression?
  • Accessing text
  • Accessing indexes
  • Processing indexes
  • Accessing documents

6
Storage of text image vs. ascii
  • Document image
  • Digital image of page words represented as
    patterns of pixels
  • Not searchable as text
  • Optical character recognition to convert to ascii
    (may be error prone)
  • ASCII
  • Searchable as text words represented as ascii
    codes

7
Text Compression
  • Motivation to save storage space and
    transmission time
  • Must be lossless (cf image compression)
  • Compromises
  • Encode-decode time
  • Random access to text?

8
Text Compression
  • Common methods
  • Symbol-wise methods
  • Estimate probabilities of symbols, code one at a
    time, shorter codes for high probabilities
    (Morse)
  • E.g. Huffman coding
  • Dictionary methods
  • Replace words and fragments with dictionary
    entries (Braille)
  • E.g. Ziv-Lempel compression
  • May be static or dynamic

9
Huffman coding
  • Developed in 1950s, widely used
  • Static code, variable length
  • Based on frequency of occurrence of letters (from
    English or from body of text)
  • Method
  • Sort by falling probabilities link 2 symbols
    with least probabilities, label with sum repeat
    till you reach a single symbol with probability
    of 1
  • Code down tree to generate symbols

10
Huffman coding
  • Consider a 7-symbol alphabet

11
Huffman code tree
0
1
0
1
0
1
1
0
0
1
0
1
c
a
b
g
d
e
f
12
Huffman coding
  • So to decode, work down tree, left to right
  • E.g.
  • 001000000010001000011110
  • ??
  • Fast for encoding and decoding
  • Good for word-based models

13
Ziv-Lempel Compression
  • Adaptive coding
  • For repeat occurrences of text segments, pointer
    back to first occurrence
  • Higher compression than Huffman coding
  • Also used for image compression

14
Ziv-Lempel compression
  • Based on triples lta,b,cgt, where
  • a how far back to segment
  • b no of characters in segment
  • c new character to end segment
  • E.g.
  • lt0,0,zgt first occurrence of z
  • lt17,5,4gt go back 17 characters, repeat 5
    characters, end in r

15
Indexing
  • Promotes efficiency in terms of time for
    retrieval
  • Needed to resolve queries and extract relevant
    documents quickly
  • Usual unit for indexing is the word (cf n-grams)
  • Issue of granularity of index word, sentence,
    paragraph, document, block

16
Sample Document Collections
17
Index issues
  • How to structure the index
  • How to create the index (storage, time)
  • How to store the index (storage, compression)
  • How to process the index (storage, time)
  • How to update the index (storage, time)

18
Inverted file indexing
  • Postings file or concordance
  • Inverted file contains
  • Postings for each term in the lexicon, a list
    of pointers to all occurrences of that term in
    the main text stored in increasing document ID
  • Lexicon mapping from terms to pointer list

19
Lexicon and postings file
Salmon 29 PTR
lt5,23gt lt12,95gt lt16,22gt lt21,12gt lt25,42gt
  • Document 5 .The extinction of Atlantic salmon
    is predicted if actions to preserve stocks are
    not taken

20
Structure of inverted index
  • Document-level indexing
  • No. Term Documents
  • 1 cold lt2 1,4gt
  • 2 days lt2 3,6gt
  • Cf. word-level indexing
  • 1 cold lt2(16) ,(48)gt

21
Structure of inverted index
  • May be a hierarchical set of addresses, e.g.
  • word number within sentence number within
    paragraph number within chapter number within
    volume number within document number
  • Consider as a vector (d,v,c,p,s,w)

22
Compression of indexes
  • Index size case folding, stemming, stopwords ?
    compression
  • Elimination of stopwords (few dozen ?30 of text)
  • Granularity high granularity compresses index,
    increases processing for proximity

23
Compression of inverted indexes
  • Uncompressed, maybe 50 100 of size of text
  • Compression store differences rather than
    document numbers
  • E.g. (83,5,20,21,22,23,76,77,78)
  • ?(83,2,15,1,2,53,1,1)
  • Then code differences using global (for all
    lists) or local (for each list) methods

24
Other indexing structures
  • Signature files
  • Each document has an associated signature,
    generating by hashing each term it contains
  • Leads to possible matches further processing to
    resolve
  • Bitmaps
  • One-to-one hash function each distinct term in
    collection has a bit vector with one bit for each
    document
  • Special case of signature file storage expensive

25
Signature files
  • Early use edge notched cards
  • E.g. Nine days old ? 1010110001001100
  • Hash each word three times using different
    functions to generate 1 bits in string
  • May generate false matches
  • See animation at http//ciips.ee.uwa.au/morris/Ye
    ar2/PLDS210/hash_tables.html
  • Size of signature processing vs. storage
  • Processing hash query, compare signatures

26
Comparison of indexing methods
  • Inverted index, signature, bitmaps different
    ways of storing a sparse matrix
  • Signature files extra access to main text poor
    when document lengths are variable 2 3 times
    larger than compressed inverted indexes
  • Inverted indexes requires lexicon file in main
    memory

27
Querying the index
  • Lexicon entry
  • (Termt, fi, pointer)
  • Whale,6,?
  • Store in memory in sorted order locate term by
    binary search
  • Compress lexicon, e.g. front coding based on
    common prefixes (40 saving)

28
Querying the index
  • If terms are partially specified, e.g. cat
  • use brute force string matching
  • In general processing is left to right, i.e. can
    stem by suffix but not prefix
  • How to handle word fragments or prefix-removal?

29
Processing Boolean Queries (I)
  • Assuming a conjunctive (AND) query
  • For each query term t
  • search the lexicon, record ft and address of
    It, the inverted file entry for t
  • Identify query term t with smallest ft
  • Read the corresponding inverted file entry It
  • Set C ? It. C is the list of candidates.

30
Processing Boolean Queries (II)
  • For each remaining term t,
  • Read the inverted file entry, It
  • For each d ? C,
  • If d ? It, then
  • Set C ? C d
  • If C 0,
  • Return, there are no answers
  • For each d ? C,
  • Look up the address of document d
  • Retrieve document d and present it to the user

31
Processing the inverted indexfor ranked output
systems
  • For each query term
  • for each document in inverted list
  • augment similarity coefficient
  • For each document
  • finish calculation of similarity coefficient
  • Perform sort of similarity coefficients
  • Retrieve and present document

32
Processing Vector Space Queries (I)
  • To retrieve r documents using the cosine measure
  • Set A ? . A is the set of accumulators
  • For each query term t ? Q,
  • Stem t
  • Search the lexicon, record ft and the address of
    It
  • Set wt ? 1 loge (N/ft)

33
Processing VS Queries II
  • Read the inverted file entry, It
  • For each (d,fd,t) pair in It
  • If Ad not? A then
  • set Ad ?0
  • set A ? A Ad
  • Set Ad ? A loge (1 fd,t)wt
  • For each Ad ? A
  • Set Ad ? Ad/Wd (where Wd is weight of document
    D)
  • Ad is now proportional to the value cosine(Q,Dd)

34
Processing VS Queries III
  • For 1 lt i lt r
  • Select d such that Ad maxA
  • Look up address of document d
  • Retrieve document d and present it to the user
  • Set A ? A Ad

35
Building the inverted index
  • Create a frequency matrix document by term
  • Read in document order
  • then write to disk in term in term order (i.e.
    transpose)
  • Problem size of matrix

36
Some solutions
  • Resources predicted for 6 GB
  • Linked lists (memory) 4 GB Memory, 0 MB disk, 6
    hours
  • Linked lists (disk) 30 MB Memory, 4 GB Disk,
    1,100 hours
  • Sort-based 40 MB Memory, 8 GB disk, 20 hours
  • Text-based partition 40 MB Memory, 35 MB Disk,
    15 hours

37
Dynamic Collections
  • Inserting a document
  • Usually an append to previous files
  • May cause some problems in compression
  • Updating the index
  • Accumulate updates in a new file and check for
    each query
  • Build expansion into lexicon and index
  • Reindex ?

38
  • For details, see
  • I.H. Witten, A. Moffat and T.C. Bell, Managing
    Gigabytes. 2nd ed. Morgan Kaufmann, 1999.
    Chapter 3, Indexing Chapter 4, Querying.
Write a Comment
User Comments (0)
About PowerShow.com