Techniques for Gigabyte-Scale N-gram Based Information Retrieval on PCs

1 / 21
About This Presentation
Title:

Techniques for Gigabyte-Scale N-gram Based Information Retrieval on PCs

Description:

Language-independent. Garble-tolerant. Better accuracy (phrases, etc. ... Independent of underlying byte order. Build to standard API. Reusable component ... –

Number of Views:59
Avg rating:3.0/5.0
Slides: 22
Provided by: etha9
Category:

less

Transcript and Presenter's Notes

Title: Techniques for Gigabyte-Scale N-gram Based Information Retrieval on PCs


1
Techniques for Gigabyte-Scale N-gram Based
Information Retrieval on PCs
  • Ethan L. Miller
  • University of Maryland Baltimore County
  • elm_at_csee.umbc.edu

2
Whats the problem?
  • N-gram based IR is becoming more important
  • Language-independent
  • Garble-tolerant
  • Better accuracy (phrases, etc.)?
  • Scalability of n-gram IR now necessary
  • Adapt traditional (word-based) IR techniques to
    n-grams
  • More unique terms per corpus
  • More unique terms per document
  • Avoid use of language-dependent techniques

3
What did we do about it?
  • Scaled n-gram based IR system to handle a
    gigabyte on a commodity (lt5K) PC
  • Adapted compression techniques
  • Used in-memory and on-disk methods
  • Preserved beneficial properties of n-gram based
    retrieval
  • Showed that disk isnt much slower than memory
    for postings lists
  • Fast file systems can move data quickly
  • Decompression times dominate transfer times

4
Overview
  • Information retrieval and n-gram basics
  • Adapting word-based techniques to n-grams
  • Scaling techniques to more terms
  • Adapting to different numerical characteristics
    of n-gram based IR
  • Performance
  • TELLTALE design
  • Future work
  • Conclusions

5
Whats an n-gram?
  • N-gram n-character sequence
  • Terms gathered by sliding window along the text
  • Term generator IR engine need no
    language-specific knowledge
  • N-grams have desirable properties
  • Language-independent
  • Garble-resistant
  • Incorporate inter-word relations
  • N-grams have difficulties
  • More unique terms per corpus document
  • Lower counts per term

6
Information retrieval in a nutshell
  • Create an inverted index for corpus
  • Table of terms (words or n-grams) in corpus
  • Postings list for each term
  • Posting ltdoc , term weight in docgt
  • Many potential weighting schemes for terms
  • Find documents in corpus similar to query
  • Break query into terms
  • Similarity between query and a given document is
    a function of the term vectors for each
  • Results ranked
  • Function often looks like a dot product

7
N-grams vs. words as terms
  • Fewer unique words
  • Differences of orders of magnitude
  • 5-grams gt 4x words
  • 6-grams gt 10x words
  • Longer n-grams gt even higher ratios
  • More postings per document
  • (5-gram postings) / (word postings) 10
  • Most 5-gram postings have a count of 1

8
N-gram IR memory usage
  • Postings lists
  • Naïve 12 bytes per entry
  • Better compression!
  • N-gram (term) table
  • 1 entry per n-gram
  • 40 bytes per entry
  • Document file information
  • Large structures
  • Relatively few instances!
  • Most memory used by postings list n-gram hash
    table

9
Corpus compression
  • Compress integers in postings to reduce corpus
    size
  • Posting count
  • Document identifier (use difference from previous
    one in sorted list)
  • Try different compression techniques adjust
    parameters to best fit n-grams
  • Simple compression
  • Easy to code
  • Effective enough?
  • Gamma compression

10
5-gram posting counts
  • Q Whats the count for a particular posting in a
    document?
  • A Almost certainly 1!
  • 80 of all postings have a count of 1
  • 98 have a count of 5 or less
  • Distribution is more skewed for n-grams than for
    words

11
Document identifier gaps
  • Curve less steep than that of posting counts
  • Curve less steep than corresponding curve for
    words
  • Compression may be less effective
  • Parameters may need to be changed

12
Simple compression
  • Raw (uncompressed) index requires 6x storage of
    documents themselves
  • Represent numbers in
  • 8 bits 0-127 (27-1)
  • 16 bits 127-16383 (214 -1)
  • 32 bits everything else (up to 30 bits)
  • Simple compression effectiveness
  • 960 MB of text -gt 1085 MB index
  • Factor of 6 reduction from no compression
  • gzip compressed index by another factor of 2

13
Gamma compression
  • Represent numbers as unary n followed by m-bit
    binary
  • n to m translation table can be tuned
  • Adjust translation to minimize number of bits
    used
  • Posting counts
  • Represent 1 in 1 bit
  • Small numbers have very few bits
  • Document gaps
  • Small numbers have small representations, but...
  • Shallower curve dont weight as much towards
    small numbers

14
Gamma compression results
  • Use single vector for simplicity
  • Select for minimal sum of posting counts,
    document gap sizes
  • Vector of lt0,2,4,,16,18,28gt worked best
  • Within 3 of minimum of each set compressed
    separately
  • Posting counts compressed far more than document
    gaps
  • 960 MB of text -gt 647 MB of index
  • Postings lists 485 MB
  • Overhead (doc info, n-gram headers) 150 MB

15
Postings lists memory vs. disk
  • Construct indices for 257 MB corpus
  • Run queries with postings lists
  • In memory
  • On disk
  • On disk lists slower, as expected, but
  • Less than 2x slowdown
  • Decompression not much slower than disk I/O
  • Seek time less critical than we thought

16
N-gram library rewrite
  • Build more efficient data structures
  • Better dynamic storage
  • Reduction in memory consumption
  • Make on-disk storage work better
  • More efficient
  • Independent of underlying byte order
  • Build to standard API
  • Reusable component
  • Fit with legacy apps

17
Data structure design
  • Main data structures
  • TermTable
  • Maintain per-term information
  • Store term text as hashed 64-bit value
  • PostingsList
  • Keep compressed postings lists
  • Dynamically allocate chunks as needed
  • Other structures
  • DocTable
  • Corpus (includes other structures)
  • Structures use templates extensively

18
Data structures connections
TermTable
H(ellta)
H(lltal)
nOccsnDocsPostList
nOccsnDocsPostList
DocTable
Count0DocId0Count1DocId1 ...
chunk1
chunk2
19
Current status
  • Basic data structures working
  • PostingsList
  • HashTable (for documents, terms)
  • Structures need to be tied together
  • Corpus data structure
  • Term generation (parsing)

20
Future work
  • Currently rewriting IR system from scratch
  • Better memory posting list management
  • Support for trying different term weighting
    schemes reduction mechanisms
  • Support for excluding n-grams that wont matter
  • Explore tradeoff between disk and memory
  • Try new weighting algorithms with n-grams
  • Parallelize the IR engine (-gt Linux clusters)
  • Gauge IR performance for n-grams on large corpora

21
Conclusions
  • Demonstrated an n-gram based IR system indexing a
    gigabyte on a commodity PC
  • Used compression disk storage for scaling
  • Preserved properties of n-gram based retrieval
  • Found source of performance improvement in
    scalable IR systems
  • Compression more helpful than memory residence
  • Disk access isnt so bad if the file system is
    fast
Write a Comment
User Comments (0)
About PowerShow.com