Techniques for Gigabyte-Scale N-gram Based Information Retrieval on PCs

1 / 21

About This Presentation

Title:

Techniques for Gigabyte-Scale N-gram Based Information Retrieval on PCs

Description:

Language-independent. Garble-tolerant. Better accuracy (phrases, etc. ... Independent of underlying byte order. Build to standard API. Reusable component ... –

Number of Views:59

Avg rating:3.0/5.0

Slides: 22

Provided by: etha9

Learn more at: https://research.cs.umbc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Techniques for Gigabyte-Scale N-gram Based Information Retrieval on PCs

1
Techniques for Gigabyte-Scale N-gram Based
Information Retrieval on PCs

Ethan L. Miller
University of Maryland Baltimore County
elm_at_csee.umbc.edu

2
Whats the problem?

N-gram based IR is becoming more important
Language-independent
Garble-tolerant
Better accuracy (phrases, etc.)?
Scalability of n-gram IR now necessary
Adapt traditional (word-based) IR techniques to
n-grams
More unique terms per corpus
More unique terms per document
Avoid use of language-dependent techniques

3
What did we do about it?

Scaled n-gram based IR system to handle a
gigabyte on a commodity (lt5K) PC
Adapted compression techniques
Used in-memory and on-disk methods
Preserved beneficial properties of n-gram based
retrieval
Showed that disk isnt much slower than memory
for postings lists
Fast file systems can move data quickly
Decompression times dominate transfer times

4
Overview

Information retrieval and n-gram basics
Adapting word-based techniques to n-grams
Scaling techniques to more terms
Adapting to different numerical characteristics
of n-gram based IR
Performance
TELLTALE design
Future work
Conclusions

5
Whats an n-gram?

N-gram n-character sequence
Terms gathered by sliding window along the text
Term generator IR engine need no
language-specific knowledge
N-grams have desirable properties
Language-independent
Garble-resistant
Incorporate inter-word relations
N-grams have difficulties
More unique terms per corpus document
Lower counts per term

6
Information retrieval in a nutshell

Create an inverted index for corpus
Table of terms (words or n-grams) in corpus
Postings list for each term
Posting ltdoc , term weight in docgt
Many potential weighting schemes for terms
Find documents in corpus similar to query
Break query into terms
Similarity between query and a given document is
a function of the term vectors for each
Results ranked
Function often looks like a dot product

7
N-grams vs. words as terms

Fewer unique words
Differences of orders of magnitude
5-grams gt 4x words
6-grams gt 10x words
Longer n-grams gt even higher ratios
More postings per document
(5-gram postings) / (word postings) 10
Most 5-gram postings have a count of 1

8
N-gram IR memory usage

Postings lists
Naïve 12 bytes per entry
Better compression!
N-gram (term) table
1 entry per n-gram
40 bytes per entry
Document file information
Large structures
Relatively few instances!
Most memory used by postings list n-gram hash
table

9
Corpus compression

Compress integers in postings to reduce corpus
size
Posting count
Document identifier (use difference from previous
one in sorted list)
Try different compression techniques adjust
parameters to best fit n-grams
Simple compression
Easy to code
Effective enough?
Gamma compression

10
5-gram posting counts

Q Whats the count for a particular posting in a
document?
A Almost certainly 1!
80 of all postings have a count of 1
98 have a count of 5 or less
Distribution is more skewed for n-grams than for
words

11
Document identifier gaps

Curve less steep than that of posting counts
Curve less steep than corresponding curve for
words
Compression may be less effective
Parameters may need to be changed

12
Simple compression

Raw (uncompressed) index requires 6x storage of
documents themselves
Represent numbers in
8 bits 0-127 (27-1)
16 bits 127-16383 (214 -1)
32 bits everything else (up to 30 bits)
Simple compression effectiveness
960 MB of text -gt 1085 MB index
Factor of 6 reduction from no compression
gzip compressed index by another factor of 2

13
Gamma compression

Represent numbers as unary n followed by m-bit
binary
n to m translation table can be tuned
Adjust translation to minimize number of bits
used
Posting counts
Represent 1 in 1 bit
Small numbers have very few bits
Document gaps
Small numbers have small representations, but...
Shallower curve dont weight as much towards
small numbers

14
Gamma compression results

Use single vector for simplicity
Select for minimal sum of posting counts,
document gap sizes
Vector of lt0,2,4,,16,18,28gt worked best
Within 3 of minimum of each set compressed
separately
Posting counts compressed far more than document
gaps
960 MB of text -gt 647 MB of index
Postings lists 485 MB
Overhead (doc info, n-gram headers) 150 MB

15
Postings lists memory vs. disk

Construct indices for 257 MB corpus
Run queries with postings lists
In memory
On disk
On disk lists slower, as expected, but
Less than 2x slowdown
Decompression not much slower than disk I/O
Seek time less critical than we thought

16
N-gram library rewrite

Build more efficient data structures
Better dynamic storage
Reduction in memory consumption
Make on-disk storage work better
More efficient
Independent of underlying byte order
Build to standard API
Reusable component
Fit with legacy apps

17
Data structure design

Main data structures
TermTable
Maintain per-term information
Store term text as hashed 64-bit value
PostingsList
Keep compressed postings lists
Dynamically allocate chunks as needed
Other structures
DocTable
Corpus (includes other structures)
Structures use templates extensively

18
Data structures connections
TermTable
H(ellta)
H(lltal)
nOccsnDocsPostList
nOccsnDocsPostList
DocTable
Count0DocId0Count1DocId1 ...
chunk1
chunk2
19
Current status