nGram2L: A Space and Time Efficient TwoLevel nGram Inverted Index Structure presentation

About This Presentation

Transcript and Presenter's Notes

Title: nGram2L: A Space and Time Efficient TwoLevel nGram Inverted Index Structure

1
n-Gram/2L A Space and Time EfficientTwo-Level
n-Gram Inverted Index Structure
VLDB 2005

Aug. 31, 2005
Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, and
Min-Jae Lee
Department of Computer Science
Korea Advanced Institute of Science and
Technology (KAIST)

2
Contents

Introduction
Motivation and Goals
Structure of the n-Gram/2L Index
Analysis of the n-Gram/2L Index
Performance Evaluation
Conclusions

3
Inverted Index

A term-oriented index structure for quickly
searching documents containing a given term
BR1999
Most actively used for text searching
Classification (depending on the kind of terms)
WMB1999
Word-based inverted index
n-gram inverted index (simply, the n-gram index)
? the scope of this talk

posting lists of terms
B-Tree index on terms

d document identifier oi offset where term t
occurs in document d f frequency of occurrence
of term t in document d
a posting
d, o1, , of
4
n-Gram Index

n-Gram
Definition a string of fixed length n
Extraction method
Sliding a window of length n by one character in
the text
Recording a sequence of characters in the window
(We call it the 1-sliding technique)
Example

5
Pros and Cons of the n-Gram Index BR1999,MM2003

Pros
Language-neutral
Allowing us to disregard the characteristics of
the language
Being widely used for Asian languages or DNA and
protein databases
Error tolerant
Allowing us to retrieve documents with some
errors in the query result
Being widely used for applications that allow
errors
(e.g., approximate matching)
Cons
The size tends to be large, and the query
performance tends to be bad

6
Motivation

We note that the large size of the n-gram index
is due to the redundancy in the position
information
If a subsequence is repeated multiple times in
documents, the relative offsets (within the
subsequences) of the n-grams extracted from that
subsequence would also be indexed multiple times

We find out that the two-level construction
eliminates that redundancy
If the relative offsets of n-grams extracted from
a subsequence are indexed only once, the index
size would be reduced since such repetition is
eliminated

8
Goals

We propose the two-level n-gram inverted index
(simply, n-gram/2L)
We show that the n-gram/2L index significantly
reduces the index size and improves the query
performance over the conventional n-gram index

9
Structure of the n-Gram/2L Index

Two-level structure
Back-end index storing the offsets of
m-subsequences within documents
Front-end index storing the offsets of n-grams
within m-subsequences
(m-subsequence a subsequence of length m)

10
Building of the n-Gram/2L Index

Algorithm
Step 1 (back-end index)
Extracting m-subsequences from a set of documents
such that consecutive subsequences overlap with
each other by n-1
Building the back-end index using the
m-subsequences
Step 2 (front-end index)
Extracting n-grams from the set of m-subsequences
Building the front-end index using the n-grams

Theorem 1 If m-subsequences are extracted such
that consecutive ones overlap with each other by
n-1, no n-gram is missed or duplicated
Proof (sketch)

m-subsequences
12
Query Processing Using the n-Gram/2L Index

Algorithm
Step 1 (front-end index)
Finding the m-subsequences that cover a query
string by searching the front-end index
Step 2 (back-end index)
Finding the documents that have a set of
m-subsequences Si containing the query string
by searching the back-end index

Definition 1 Cover
S covers Q if an m-subsequence S and a query
string Q satisfy one of the following four
conditions
A suffix of S matches a prefix of Q
The whole string of S matches a substring of Q
A prefix of S matches a suffix of Q
A substring of S matches the whole string of Q
Example

A
C D D
S
C D D
Q
14

Definition 2 (brief) Expand
The expand function expands a sequence of
overlapping character sequences into one
character sequence
Definition 3 Contain
A set of m-subsequences Si contains a query
string Q if Si and Q satisfy the following
condition
Let SlSl1...Sm be a sequence of m-subsequences
overlapping with each other in Si. A substring
of expand(SlSl1...Sm) matches the whole string
of Q

Cases of containment

Lemma 1 A document that has a set of
m-subsequences Si containing the query string Q
includes at least one m-subsequence covering Q
Algorithm (revisited)
Step 1 (front-end index)
Finding the m-subsequences that cover a query
string by searching the front-end index for
retrieving candidate results satisfying the
necessary condition
Step 2 (back-end index)
Finding the documents that have a set of
m-subsequences Si containing the query string
by searching the back-end index for refining
candidate results

A document d has a set of m-subsequences Si
containing Q
A document d has at least one m-subsequence
covering Q
ltA necessary conditiongt
17
Formalization of the n-Gram/2L Index

We observe that the redundancy in the position
information existing in the n-gram index is
caused by non-trivial MultiValued Dependencies
(MVDs)
We show that the n-gram/2L index can be derived
by eliminating that redundancy through relational
decomposition to the Fourth Normal Form (4NF)

18
MultiValued Dependency (MVD)

Definition Ull1988
Suppose we are given a relation schema R, and X
and Y are subsets of R. X??Y holds in R if
whenever r is a relation for R, and ? and ? are
two tuples in r, with ?X ?X (that is, ?
and? agree on the attributes of X), then r also
contains tuples ? and ?, where
? X ? X ? X ? X
? Y ? Y and ? R-X-Y ? R-X-Y
? Y ? Y and ? R-X-Y ? R-X-Y
Non-trivial MVD Y ? X and X ? Y ? R
Example

19
Relational Representation for Theoretical Analysis

NDO relation
Converting the n-gram index so that obeys the
First Normal Form (1NF)
Having three attributes N, D, and O
N n-grams
D document identifiers
O offsets of n-grams within documents
SNDO1O2 relation
Adding a new attribute S and splitting the
attribute O into two attributes O1 and O2
Having five attributes S, N, D, O1, and O2
S m-subsequences in which n-grams appear
O1 offsets of n-grams within m-subsequences
O2 offsets of m-subsequences within documents

n-gram index
NDO relation
SNDO1O2 relation
20
(No Transcript)
21
(No Transcript)
22
Normalization of the n-Gram Index

Lemma 2 Non-trivial MVDs S??NO1 and S??DO2
hold in the SNDO1O2 relation
Proof (sketch)
The set of documents, where an m-subsequence
occurs, and the set of n-grams, which are
extracted from that m-subsequence, are
independent of each other
Due to this independence, there exist the tuples
corresponding to all possible combinations of
documents and n-grams for a given m-subsequence
Lemma 3 The decomposition (SNO1, SDO2) is in
4NF
Proof See the paper
Theorem 2 The 4NF decomposition (SNO1, SDO2) of
the SNDO1O2 relation is identical to the
front-end and back-end indexes of the n-gram/2L
index
Proof See the paper

23
(No Transcript)
24
Analysis of the n-Gram/2L Index

Notation

Optimal length mo
Length of the m-subsequence that minimizes the
size of the n-gram/2L index

25
Index Size

Space complexities
n-gram index O(avgdoc ? avgngram)
n-gram/2L index O(avgdoc avgngram)
Properties
mo is obtained by finding the length m that makes
avgdoc avgngram
Both avgdoc and avgngram increase as the database
size gets larger
Analytical results
Size of the n-gram/2L index is significantly
reduced compared with that of the n-gram index
for a large database
Reduction of the index size becomes more marked
as the database size increases
See the paper for the detailed analysis

Formulas for the index size

27
Query Performance

Time complexities
n-gram index O(avgdoc ? avgngram)
n-gram/2L index O(avgdoc avgngram)
Analytical results
n-gram/2L index significantly improves the query
performance over the n-gram index for a large
database
Improvement of the query performance gets better
as the database size increases
Query processing time increases only very
slightly as the query length gets longer
It has been pointed out that the query
performance of the n-gram index for long queries
tends to be bad Wil2003
See the paper for the detailed analysis

Formulas for the query performance

29
Experiments

Measures
Index size
Query performance
Number of page accesses
Wall clock time (ms)
Data sets
PROTEIN-DATA the set of protein sequence
databases used in bioinformatics
TREC-DATA the set of English text databases used
in information retrieval
Parameters
Data size 10 MBytes, 100 MBytes, and 1 GBytes
n 3 (n-gram length) Kuk1992,WZ2002
m 4 6 (m-subsequence length)
Len(Q) 3, 6, 9, 12, 15, and 18 (query length)

30
Index Size (PROTEIN-DATA)
optimal length mo

The size of the n-gram/2L index is significantly
reduced compared with that of the n-gram index
By up to 2.7 times in PROTEIN-1G
The reduction of index size become more marked as
the database size increases
Approximately 25 for the PROTEIN-DATA as the
database size is increased by ten fold (10 MBytes
? 100MBytes ? 1 GBytes)

31
Query Performance (PROTEIN-DATA)
ltQuery processing timegt (Len(Q) 318)
ltNo. of page accessesgt (data set PROTEIN-1G)
ltQuery processing timegt (data set PROTEIN-1G)

n-gram/2L significantly improves the query
performance over the n-gram index
Up to 13.1 times in wall clock time (PROTEIN-1G)
Improvement gets better as the database size
increases
1.37 times in PROTEIN-100M 6.65 times in
PROTEIN-1G
Query processing time increases only very
slightly as the query length gets longer
n-gram/2L index 53, Len(Q) 3 ? 18
(c.f. n-gram index 32.9 times)

32
Conclusions

We have shown that the redundancy in the position
information existing in the n-gram index is due
to non-trivial MVDs
We have proposed the two-level structure of the
n-gram index
We have shown that the n-gram/2L index is derived
by the relational normalization process that
decomposes the n-gram index into 4NF
We have provided a formal analysis of the space
and time complexities of n-gram/2L index
Finally, through extensive experiments, we have
shown that the n-gram/2L significantly reduces
the size and improves the query performance
compared with the n-gram index

33
References

BR1999 Ricardo Baeza-Yates and Berthier
Ribeiro-Neto, Modern Information Retrieval, ACM
Press, 1999.
Coh1997 Jonathan D. Cohen, Recursive Hashing
Functions for n-Grams, ACM Trans. on Information
Systems, Vol. 15, No. 3, pp. 291-320, July 1997.
EN2003 Ramez Elmasri and Shamkant B. Navathe,
Fundamentals of Database Systems, Addison Wesley,
4th ed., 2003.
Kuk1992 Karen Kukich, Techniques for
Automatically Correcting Words in Text, ACM
Computing Surveys, Vol. 24, No. 4, pp. 377-439,
Dec. 1992.
LA1996 Joon Ho Lee and Jeong Soo Ahn, Using
n-Grams for Korean Text Retrieval, In Proc.
Int'l Conf. on Information Retrieval, ACM SIGIR,
Zurich, Switzerland, pp. 216-224, 1996.
MM2003 James Mayfield and Paul McNamee, Single
N-gram Stemming, In Proc. Int'l Conf. on
Information Retrieval, ACM SIGIR, Toronto,
Canada, pp. 415-416, July/Aug. 2003.
MSL2000 Ethan Miller, Dan Shen, Junli Liu, and
Charles Nicholas, Performance and Scalability of
a Large-Scale N-gram Based Information Retrieval
System, Journal of Digital Information 1(5), pp.
1-25, Jan. 2000.
MZ1996 Alistair Moffat and Justin Zobel,
Self-indexing inverted files for fast text
retrieval, ACM Trans. on Information Systems,
Vol. 14, No. 4, pp. 349-379, Oct. 1996.
Nav2001 Gonzalo Navarro, A Guided Tour to
Approximate String Matching, ACM Computing
Surveys, Vol. 33, No. 1, pp. 31-88, Mar. 2001.
Ram1998 Raghu Ramakrishnan, Database Management
Systems, McGraw-Hill, 1998.

SKS2001 Abraham Silberschatz, Henry F. Korth,
and S. Sudarshan, Database Systems Concepts,
McGraw-Hill, 4th ed., 2001.
SWY2002 Falk Scholer, Hugh E. Williams, John
Yiannis and Justin Zobel, Compression of
Inverted Indexes for Fast Query Evaluation, In
Proc. Int'l Conf. on Information Retrieval, ACM
SIGIR, Tampere, Finland, pp. 222-229, Aug. 2002.
Ull1988 Jeffery D. Ullman, Principles of
Database and Knowledge-Base Systems Vol. I,
Computer Science Press, USA, 1988.
Wil2003 Hugh E. Williams, Genomic Information
Retrieval, In Proc. the 14th Australasian
Database Conferences, 2003.
WLL2005 Kyu-Young Whang, Min-Jae Lee, Jae-Gil
Lee, Min-soo Kim, and Wook-Shin Han, Odysseusa
High-Performance ORDBMS Tightly-Coupled with IR
Reatures, In Proc. the 21th IEEE Int'l Conf. on
Data Engineering (ICDE), Tokyo, Japan, Apr. 2005.
WMB1999 I. Witten, A. Moffat, and T. Bell,
Managing Gigabytes Compressing and Indexing
Documents and Images, Morgan Kaufmann Publishers,
Los Altos, California, 2nd ed., 1999.
WVT1990 Kyu-Young Whang, Brad T. Vander-Zanden,
and Howard M. Taylor, A Linear-Time
Probabilistic Counting Algorithm for Database
Applications, ACM Trans. on Database Systems,
Vol. 15, No.2, pp. 208-229, June 1990.
WZ2002 Hugh E. Williams and Justin Zobel,
Indexing and Retrieval for Genomic Databases,
IEEE Trans. on Knowledge and Data Engineering,
Vol. 14, No. 1, pp. 63-78, Jan./Feb. 2002.
YT1998 Ogawa Yasushi and Matsuda Toru,
Optimizing query evaluation in n-gram indexing,
In Proc. Int'l Conf. on Information Retrieval,
ACM SIGIR, Melbourne, Australia, pp. 367-368,
1998.

Write a Comment

User Comments (0)

About PowerShow.com

nGram2L: A Space and Time Efficient TwoLevel nGram Inverted Index Structure PowerPoint PPT Presentation