Title: nGram2L: A Space and Time Efficient TwoLevel nGram Inverted Index Structure
1n-Gram/2L A Space and Time EfficientTwo-Level
n-Gram Inverted Index Structure
VLDB 2005
- Aug. 31, 2005
- Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, and
Min-Jae Lee - Department of Computer Science
- Korea Advanced Institute of Science and
Technology (KAIST)
2Contents
- Introduction
- Motivation and Goals
- Structure of the n-Gram/2L Index
- Analysis of the n-Gram/2L Index
- Performance Evaluation
- Conclusions
3Inverted Index
- A term-oriented index structure for quickly
searching documents containing a given term
BR1999 - Most actively used for text searching
- Classification (depending on the kind of terms)
WMB1999 - Word-based inverted index
- n-gram inverted index (simply, the n-gram index)
? the scope of this talk
posting lists of terms
B-Tree index on terms
d document identifier oi offset where term t
occurs in document d f frequency of occurrence
of term t in document d
a posting
d, o1, , of
4n-Gram Index
- n-Gram
- Definition a string of fixed length n
- Extraction method
- Sliding a window of length n by one character in
the text - Recording a sequence of characters in the window
- (We call it the 1-sliding technique)
- Example
5Pros and Cons of the n-Gram Index BR1999,MM2003
- Pros
- Language-neutral
- Allowing us to disregard the characteristics of
the language - Being widely used for Asian languages or DNA and
protein databases - Error tolerant
- Allowing us to retrieve documents with some
errors in the query result - Being widely used for applications that allow
errors - (e.g., approximate matching)
- Cons
- The size tends to be large, and the query
performance tends to be bad
6Motivation
- We note that the large size of the n-gram index
is due to the redundancy in the position
information - If a subsequence is repeated multiple times in
documents, the relative offsets (within the
subsequences) of the n-grams extracted from that
subsequence would also be indexed multiple times
7- We find out that the two-level construction
eliminates that redundancy - If the relative offsets of n-grams extracted from
a subsequence are indexed only once, the index
size would be reduced since such repetition is
eliminated
8Goals
- We propose the two-level n-gram inverted index
(simply, n-gram/2L) - We show that the n-gram/2L index significantly
reduces the index size and improves the query
performance over the conventional n-gram index
9Structure of the n-Gram/2L Index
- Two-level structure
- Back-end index storing the offsets of
m-subsequences within documents - Front-end index storing the offsets of n-grams
within m-subsequences - (m-subsequence a subsequence of length m)
10Building of the n-Gram/2L Index
- Algorithm
- Step 1 (back-end index)
- Extracting m-subsequences from a set of documents
such that consecutive subsequences overlap with
each other by n-1 - Building the back-end index using the
m-subsequences - Step 2 (front-end index)
- Extracting n-grams from the set of m-subsequences
- Building the front-end index using the n-grams
11- Theorem 1 If m-subsequences are extracted such
that consecutive ones overlap with each other by
n-1, no n-gram is missed or duplicated - Proof (sketch)
m-subsequences
12Query Processing Using the n-Gram/2L Index
- Algorithm
- Step 1 (front-end index)
- Finding the m-subsequences that cover a query
string by searching the front-end index - Step 2 (back-end index)
- Finding the documents that have a set of
m-subsequences Si containing the query string
by searching the back-end index
13- Definition 1 Cover
- S covers Q if an m-subsequence S and a query
string Q satisfy one of the following four
conditions - A suffix of S matches a prefix of Q
- The whole string of S matches a substring of Q
- A prefix of S matches a suffix of Q
- A substring of S matches the whole string of Q
- Example
A
C D D
S
C D D
Q
14- Definition 2 (brief) Expand
- The expand function expands a sequence of
overlapping character sequences into one
character sequence - Definition 3 Contain
- A set of m-subsequences Si contains a query
string Q if Si and Q satisfy the following
condition - Let SlSl1...Sm be a sequence of m-subsequences
overlapping with each other in Si. A substring
of expand(SlSl1...Sm) matches the whole string
of Q
15 16- Lemma 1 A document that has a set of
m-subsequences Si containing the query string Q
includes at least one m-subsequence covering Q - Algorithm (revisited)
- Step 1 (front-end index)
- Finding the m-subsequences that cover a query
string by searching the front-end index for
retrieving candidate results satisfying the
necessary condition - Step 2 (back-end index)
- Finding the documents that have a set of
m-subsequences Si containing the query string
by searching the back-end index for refining
candidate results
A document d has a set of m-subsequences Si
containing Q
A document d has at least one m-subsequence
covering Q
ltA necessary conditiongt
17Formalization of the n-Gram/2L Index
- We observe that the redundancy in the position
information existing in the n-gram index is
caused by non-trivial MultiValued Dependencies
(MVDs) - We show that the n-gram/2L index can be derived
by eliminating that redundancy through relational
decomposition to the Fourth Normal Form (4NF)
18MultiValued Dependency (MVD)
- Definition Ull1988
- Suppose we are given a relation schema R, and X
and Y are subsets of R. X??Y holds in R if
whenever r is a relation for R, and ? and ? are
two tuples in r, with ?X ?X (that is, ?
and? agree on the attributes of X), then r also
contains tuples ? and ?, where - ? X ? X ? X ? X
- ? Y ? Y and ? R-X-Y ? R-X-Y
- ? Y ? Y and ? R-X-Y ? R-X-Y
- Non-trivial MVD Y ? X and X ? Y ? R
- Example
19Relational Representation for Theoretical Analysis
- NDO relation
- Converting the n-gram index so that obeys the
First Normal Form (1NF) - Having three attributes N, D, and O
- N n-grams
- D document identifiers
- O offsets of n-grams within documents
- SNDO1O2 relation
- Adding a new attribute S and splitting the
attribute O into two attributes O1 and O2 - Having five attributes S, N, D, O1, and O2
- S m-subsequences in which n-grams appear
- O1 offsets of n-grams within m-subsequences
- O2 offsets of m-subsequences within documents
n-gram index
NDO relation
SNDO1O2 relation
20(No Transcript)
21(No Transcript)
22Normalization of the n-Gram Index
- Lemma 2 Non-trivial MVDs S??NO1 and S??DO2
hold in the SNDO1O2 relation - Proof (sketch)
- The set of documents, where an m-subsequence
occurs, and the set of n-grams, which are
extracted from that m-subsequence, are
independent of each other - Due to this independence, there exist the tuples
corresponding to all possible combinations of
documents and n-grams for a given m-subsequence - Lemma 3 The decomposition (SNO1, SDO2) is in
4NF - Proof See the paper
- Theorem 2 The 4NF decomposition (SNO1, SDO2) of
the SNDO1O2 relation is identical to the
front-end and back-end indexes of the n-gram/2L
index - Proof See the paper
23(No Transcript)
24Analysis of the n-Gram/2L Index
- Optimal length mo
- Length of the m-subsequence that minimizes the
size of the n-gram/2L index
25Index Size
- Space complexities
- n-gram index O(avgdoc ? avgngram)
- n-gram/2L index O(avgdoc avgngram)
- Properties
- mo is obtained by finding the length m that makes
avgdoc avgngram - Both avgdoc and avgngram increase as the database
size gets larger - Analytical results
- Size of the n-gram/2L index is significantly
reduced compared with that of the n-gram index
for a large database - Reduction of the index size becomes more marked
as the database size increases - See the paper for the detailed analysis
26- Formulas for the index size
27Query Performance
- Time complexities
- n-gram index O(avgdoc ? avgngram)
- n-gram/2L index O(avgdoc avgngram)
- Analytical results
- n-gram/2L index significantly improves the query
performance over the n-gram index for a large
database - Improvement of the query performance gets better
as the database size increases - Query processing time increases only very
slightly as the query length gets longer - It has been pointed out that the query
performance of the n-gram index for long queries
tends to be bad Wil2003 - See the paper for the detailed analysis
28- Formulas for the query performance
29Experiments
- Measures
- Index size
- Query performance
- Number of page accesses
- Wall clock time (ms)
- Data sets
- PROTEIN-DATA the set of protein sequence
databases used in bioinformatics - TREC-DATA the set of English text databases used
in information retrieval - Parameters
- Data size 10 MBytes, 100 MBytes, and 1 GBytes
- n 3 (n-gram length) Kuk1992,WZ2002
- m 4 6 (m-subsequence length)
- Len(Q) 3, 6, 9, 12, 15, and 18 (query length)
30Index Size (PROTEIN-DATA)
optimal length mo
- The size of the n-gram/2L index is significantly
reduced compared with that of the n-gram index - By up to 2.7 times in PROTEIN-1G
- The reduction of index size become more marked as
the database size increases - Approximately 25 for the PROTEIN-DATA as the
database size is increased by ten fold (10 MBytes
? 100MBytes ? 1 GBytes)
31Query Performance (PROTEIN-DATA)
ltQuery processing timegt (Len(Q) 318)
ltNo. of page accessesgt (data set PROTEIN-1G)
ltQuery processing timegt (data set PROTEIN-1G)
- n-gram/2L significantly improves the query
performance over the n-gram index - Up to 13.1 times in wall clock time (PROTEIN-1G)
- Improvement gets better as the database size
increases - 1.37 times in PROTEIN-100M 6.65 times in
PROTEIN-1G -
- Query processing time increases only very
slightly as the query length gets longer - n-gram/2L index 53, Len(Q) 3 ? 18
- (c.f. n-gram index 32.9 times)
32Conclusions
- We have shown that the redundancy in the position
information existing in the n-gram index is due
to non-trivial MVDs - We have proposed the two-level structure of the
n-gram index - We have shown that the n-gram/2L index is derived
by the relational normalization process that
decomposes the n-gram index into 4NF - We have provided a formal analysis of the space
and time complexities of n-gram/2L index - Finally, through extensive experiments, we have
shown that the n-gram/2L significantly reduces
the size and improves the query performance
compared with the n-gram index
33References
- BR1999 Ricardo Baeza-Yates and Berthier
Ribeiro-Neto, Modern Information Retrieval, ACM
Press, 1999. - Coh1997 Jonathan D. Cohen, Recursive Hashing
Functions for n-Grams, ACM Trans. on Information
Systems, Vol. 15, No. 3, pp. 291-320, July 1997. - EN2003 Ramez Elmasri and Shamkant B. Navathe,
Fundamentals of Database Systems, Addison Wesley,
4th ed., 2003. - Kuk1992 Karen Kukich, Techniques for
Automatically Correcting Words in Text, ACM
Computing Surveys, Vol. 24, No. 4, pp. 377-439,
Dec. 1992. - LA1996 Joon Ho Lee and Jeong Soo Ahn, Using
n-Grams for Korean Text Retrieval, In Proc.
Int'l Conf. on Information Retrieval, ACM SIGIR,
Zurich, Switzerland, pp. 216-224, 1996. - MM2003 James Mayfield and Paul McNamee, Single
N-gram Stemming, In Proc. Int'l Conf. on
Information Retrieval, ACM SIGIR, Toronto,
Canada, pp. 415-416, July/Aug. 2003. - MSL2000 Ethan Miller, Dan Shen, Junli Liu, and
Charles Nicholas, Performance and Scalability of
a Large-Scale N-gram Based Information Retrieval
System, Journal of Digital Information 1(5), pp.
1-25, Jan. 2000. - MZ1996 Alistair Moffat and Justin Zobel,
Self-indexing inverted files for fast text
retrieval, ACM Trans. on Information Systems,
Vol. 14, No. 4, pp. 349-379, Oct. 1996. - Nav2001 Gonzalo Navarro, A Guided Tour to
Approximate String Matching, ACM Computing
Surveys, Vol. 33, No. 1, pp. 31-88, Mar. 2001. - Ram1998 Raghu Ramakrishnan, Database Management
Systems, McGraw-Hill, 1998.
34- SKS2001 Abraham Silberschatz, Henry F. Korth,
and S. Sudarshan, Database Systems Concepts,
McGraw-Hill, 4th ed., 2001. - SWY2002 Falk Scholer, Hugh E. Williams, John
Yiannis and Justin Zobel, Compression of
Inverted Indexes for Fast Query Evaluation, In
Proc. Int'l Conf. on Information Retrieval, ACM
SIGIR, Tampere, Finland, pp. 222-229, Aug. 2002. - Ull1988 Jeffery D. Ullman, Principles of
Database and Knowledge-Base Systems Vol. I,
Computer Science Press, USA, 1988. - Wil2003 Hugh E. Williams, Genomic Information
Retrieval, In Proc. the 14th Australasian
Database Conferences, 2003. - WLL2005 Kyu-Young Whang, Min-Jae Lee, Jae-Gil
Lee, Min-soo Kim, and Wook-Shin Han, Odysseusa
High-Performance ORDBMS Tightly-Coupled with IR
Reatures, In Proc. the 21th IEEE Int'l Conf. on
Data Engineering (ICDE), Tokyo, Japan, Apr. 2005. - WMB1999 I. Witten, A. Moffat, and T. Bell,
Managing Gigabytes Compressing and Indexing
Documents and Images, Morgan Kaufmann Publishers,
Los Altos, California, 2nd ed., 1999. - WVT1990 Kyu-Young Whang, Brad T. Vander-Zanden,
and Howard M. Taylor, A Linear-Time
Probabilistic Counting Algorithm for Database
Applications, ACM Trans. on Database Systems,
Vol. 15, No.2, pp. 208-229, June 1990. - WZ2002 Hugh E. Williams and Justin Zobel,
Indexing and Retrieval for Genomic Databases,
IEEE Trans. on Knowledge and Data Engineering,
Vol. 14, No. 1, pp. 63-78, Jan./Feb. 2002. - YT1998 Ogawa Yasushi and Matsuda Toru,
Optimizing query evaluation in n-gram indexing,
In Proc. Int'l Conf. on Information Retrieval,
ACM SIGIR, Melbourne, Australia, pp. 367-368,
1998.