nGram2L: A Space and Time Efficient TwoLevel nGram Inverted Index Structure PowerPoint PPT Presentation

presentation player overlay
1 / 34
About This Presentation
Transcript and Presenter's Notes

Title: nGram2L: A Space and Time Efficient TwoLevel nGram Inverted Index Structure


1
n-Gram/2L A Space and Time EfficientTwo-Level
n-Gram Inverted Index Structure
VLDB 2005
  • Aug. 31, 2005
  • Min-Soo Kim, Kyu-Young Whang, Jae-Gil Lee, and
    Min-Jae Lee
  • Department of Computer Science
  • Korea Advanced Institute of Science and
    Technology (KAIST)

2
Contents
  • Introduction
  • Motivation and Goals
  • Structure of the n-Gram/2L Index
  • Analysis of the n-Gram/2L Index
  • Performance Evaluation
  • Conclusions

3
Inverted Index
  • A term-oriented index structure for quickly
    searching documents containing a given term
    BR1999
  • Most actively used for text searching
  • Classification (depending on the kind of terms)
    WMB1999
  • Word-based inverted index
  • n-gram inverted index (simply, the n-gram index)
    ? the scope of this talk

posting lists of terms
B-Tree index on terms

d document identifier oi offset where term t
occurs in document d f frequency of occurrence
of term t in document d
a posting
d, o1, , of
4
n-Gram Index
  • n-Gram
  • Definition a string of fixed length n
  • Extraction method
  • Sliding a window of length n by one character in
    the text
  • Recording a sequence of characters in the window
  • (We call it the 1-sliding technique)
  • Example

5
Pros and Cons of the n-Gram Index BR1999,MM2003
  • Pros
  • Language-neutral
  • Allowing us to disregard the characteristics of
    the language
  • Being widely used for Asian languages or DNA and
    protein databases
  • Error tolerant
  • Allowing us to retrieve documents with some
    errors in the query result
  • Being widely used for applications that allow
    errors
  • (e.g., approximate matching)
  • Cons
  • The size tends to be large, and the query
    performance tends to be bad

6
Motivation
  • We note that the large size of the n-gram index
    is due to the redundancy in the position
    information
  • If a subsequence is repeated multiple times in
    documents, the relative offsets (within the
    subsequences) of the n-grams extracted from that
    subsequence would also be indexed multiple times

7
  • We find out that the two-level construction
    eliminates that redundancy
  • If the relative offsets of n-grams extracted from
    a subsequence are indexed only once, the index
    size would be reduced since such repetition is
    eliminated

8
Goals
  • We propose the two-level n-gram inverted index
    (simply, n-gram/2L)
  • We show that the n-gram/2L index significantly
    reduces the index size and improves the query
    performance over the conventional n-gram index

9
Structure of the n-Gram/2L Index
  • Two-level structure
  • Back-end index storing the offsets of
    m-subsequences within documents
  • Front-end index storing the offsets of n-grams
    within m-subsequences
  • (m-subsequence a subsequence of length m)

10
Building of the n-Gram/2L Index
  • Algorithm
  • Step 1 (back-end index)
  • Extracting m-subsequences from a set of documents
    such that consecutive subsequences overlap with
    each other by n-1
  • Building the back-end index using the
    m-subsequences
  • Step 2 (front-end index)
  • Extracting n-grams from the set of m-subsequences
  • Building the front-end index using the n-grams

11
  • Theorem 1 If m-subsequences are extracted such
    that consecutive ones overlap with each other by
    n-1, no n-gram is missed or duplicated
  • Proof (sketch)

m-subsequences
12
Query Processing Using the n-Gram/2L Index
  • Algorithm
  • Step 1 (front-end index)
  • Finding the m-subsequences that cover a query
    string by searching the front-end index
  • Step 2 (back-end index)
  • Finding the documents that have a set of
    m-subsequences Si containing the query string
    by searching the back-end index

13
  • Definition 1 Cover
  • S covers Q if an m-subsequence S and a query
    string Q satisfy one of the following four
    conditions
  • A suffix of S matches a prefix of Q
  • The whole string of S matches a substring of Q
  • A prefix of S matches a suffix of Q
  • A substring of S matches the whole string of Q
  • Example

A
C D D
S
C D D
Q
14
  • Definition 2 (brief) Expand
  • The expand function expands a sequence of
    overlapping character sequences into one
    character sequence
  • Definition 3 Contain
  • A set of m-subsequences Si contains a query
    string Q if Si and Q satisfy the following
    condition
  • Let SlSl1...Sm be a sequence of m-subsequences
    overlapping with each other in Si. A substring
    of expand(SlSl1...Sm) matches the whole string
    of Q

15
  • Cases of containment

16
  • Lemma 1 A document that has a set of
    m-subsequences Si containing the query string Q
    includes at least one m-subsequence covering Q
  • Algorithm (revisited)
  • Step 1 (front-end index)
  • Finding the m-subsequences that cover a query
    string by searching the front-end index for
    retrieving candidate results satisfying the
    necessary condition
  • Step 2 (back-end index)
  • Finding the documents that have a set of
    m-subsequences Si containing the query string
    by searching the back-end index for refining
    candidate results

A document d has a set of m-subsequences Si
containing Q
A document d has at least one m-subsequence
covering Q
ltA necessary conditiongt
17
Formalization of the n-Gram/2L Index
  • We observe that the redundancy in the position
    information existing in the n-gram index is
    caused by non-trivial MultiValued Dependencies
    (MVDs)
  • We show that the n-gram/2L index can be derived
    by eliminating that redundancy through relational
    decomposition to the Fourth Normal Form (4NF)

18
MultiValued Dependency (MVD)
  • Definition Ull1988
  • Suppose we are given a relation schema R, and X
    and Y are subsets of R. X??Y holds in R if
    whenever r is a relation for R, and ? and ? are
    two tuples in r, with ?X ?X (that is, ?
    and? agree on the attributes of X), then r also
    contains tuples ? and ?, where
  • ? X ? X ? X ? X
  • ? Y ? Y and ? R-X-Y ? R-X-Y
  • ? Y ? Y and ? R-X-Y ? R-X-Y
  • Non-trivial MVD Y ? X and X ? Y ? R
  • Example

19
Relational Representation for Theoretical Analysis
  • NDO relation
  • Converting the n-gram index so that obeys the
    First Normal Form (1NF)
  • Having three attributes N, D, and O
  • N n-grams
  • D document identifiers
  • O offsets of n-grams within documents
  • SNDO1O2 relation
  • Adding a new attribute S and splitting the
    attribute O into two attributes O1 and O2
  • Having five attributes S, N, D, O1, and O2
  • S m-subsequences in which n-grams appear
  • O1 offsets of n-grams within m-subsequences
  • O2 offsets of m-subsequences within documents

n-gram index
NDO relation
SNDO1O2 relation
20
(No Transcript)
21
(No Transcript)
22
Normalization of the n-Gram Index
  • Lemma 2 Non-trivial MVDs S??NO1 and S??DO2
    hold in the SNDO1O2 relation
  • Proof (sketch)
  • The set of documents, where an m-subsequence
    occurs, and the set of n-grams, which are
    extracted from that m-subsequence, are
    independent of each other
  • Due to this independence, there exist the tuples
    corresponding to all possible combinations of
    documents and n-grams for a given m-subsequence
  • Lemma 3 The decomposition (SNO1, SDO2) is in
    4NF
  • Proof See the paper
  • Theorem 2 The 4NF decomposition (SNO1, SDO2) of
    the SNDO1O2 relation is identical to the
    front-end and back-end indexes of the n-gram/2L
    index
  • Proof See the paper

23
(No Transcript)
24
Analysis of the n-Gram/2L Index
  • Notation
  • Optimal length mo
  • Length of the m-subsequence that minimizes the
    size of the n-gram/2L index

25
Index Size
  • Space complexities
  • n-gram index O(avgdoc ? avgngram)
  • n-gram/2L index O(avgdoc avgngram)
  • Properties
  • mo is obtained by finding the length m that makes
    avgdoc avgngram
  • Both avgdoc and avgngram increase as the database
    size gets larger
  • Analytical results
  • Size of the n-gram/2L index is significantly
    reduced compared with that of the n-gram index
    for a large database
  • Reduction of the index size becomes more marked
    as the database size increases
  • See the paper for the detailed analysis

26
  • Formulas for the index size

27
Query Performance
  • Time complexities
  • n-gram index O(avgdoc ? avgngram)
  • n-gram/2L index O(avgdoc avgngram)
  • Analytical results
  • n-gram/2L index significantly improves the query
    performance over the n-gram index for a large
    database
  • Improvement of the query performance gets better
    as the database size increases
  • Query processing time increases only very
    slightly as the query length gets longer
  • It has been pointed out that the query
    performance of the n-gram index for long queries
    tends to be bad Wil2003
  • See the paper for the detailed analysis

28
  • Formulas for the query performance

29
Experiments
  • Measures
  • Index size
  • Query performance
  • Number of page accesses
  • Wall clock time (ms)
  • Data sets
  • PROTEIN-DATA the set of protein sequence
    databases used in bioinformatics
  • TREC-DATA the set of English text databases used
    in information retrieval
  • Parameters
  • Data size 10 MBytes, 100 MBytes, and 1 GBytes
  • n 3 (n-gram length) Kuk1992,WZ2002
  • m 4 6 (m-subsequence length)
  • Len(Q) 3, 6, 9, 12, 15, and 18 (query length)

30
Index Size (PROTEIN-DATA)
optimal length mo
  • The size of the n-gram/2L index is significantly
    reduced compared with that of the n-gram index
  • By up to 2.7 times in PROTEIN-1G
  • The reduction of index size become more marked as
    the database size increases
  • Approximately 25 for the PROTEIN-DATA as the
    database size is increased by ten fold (10 MBytes
    ? 100MBytes ? 1 GBytes)

31
Query Performance (PROTEIN-DATA)
ltQuery processing timegt (Len(Q) 318)
ltNo. of page accessesgt (data set PROTEIN-1G)
ltQuery processing timegt (data set PROTEIN-1G)
  • n-gram/2L significantly improves the query
    performance over the n-gram index
  • Up to 13.1 times in wall clock time (PROTEIN-1G)
  • Improvement gets better as the database size
    increases
  • 1.37 times in PROTEIN-100M 6.65 times in
    PROTEIN-1G
  • Query processing time increases only very
    slightly as the query length gets longer
  • n-gram/2L index 53, Len(Q) 3 ? 18
  • (c.f. n-gram index 32.9 times)

32
Conclusions
  • We have shown that the redundancy in the position
    information existing in the n-gram index is due
    to non-trivial MVDs
  • We have proposed the two-level structure of the
    n-gram index
  • We have shown that the n-gram/2L index is derived
    by the relational normalization process that
    decomposes the n-gram index into 4NF
  • We have provided a formal analysis of the space
    and time complexities of n-gram/2L index
  • Finally, through extensive experiments, we have
    shown that the n-gram/2L significantly reduces
    the size and improves the query performance
    compared with the n-gram index

33
References
  • BR1999 Ricardo Baeza-Yates and Berthier
    Ribeiro-Neto, Modern Information Retrieval, ACM
    Press, 1999.
  • Coh1997 Jonathan D. Cohen, Recursive Hashing
    Functions for n-Grams, ACM Trans. on Information
    Systems, Vol. 15, No. 3, pp. 291-320, July 1997.
  • EN2003 Ramez Elmasri and Shamkant B. Navathe,
    Fundamentals of Database Systems, Addison Wesley,
    4th ed., 2003.
  • Kuk1992 Karen Kukich, Techniques for
    Automatically Correcting Words in Text, ACM
    Computing Surveys, Vol. 24, No. 4, pp. 377-439,
    Dec. 1992.
  • LA1996 Joon Ho Lee and Jeong Soo Ahn, Using
    n-Grams for Korean Text Retrieval, In Proc.
    Int'l Conf. on Information Retrieval, ACM SIGIR,
    Zurich, Switzerland, pp. 216-224, 1996.
  • MM2003 James Mayfield and Paul McNamee, Single
    N-gram Stemming, In Proc. Int'l Conf. on
    Information Retrieval, ACM SIGIR, Toronto,
    Canada, pp. 415-416, July/Aug. 2003.
  • MSL2000 Ethan Miller, Dan Shen, Junli Liu, and
    Charles Nicholas, Performance and Scalability of
    a Large-Scale N-gram Based Information Retrieval
    System, Journal of Digital Information 1(5), pp.
    1-25, Jan. 2000.
  • MZ1996 Alistair Moffat and Justin Zobel,
    Self-indexing inverted files for fast text
    retrieval, ACM Trans. on Information Systems,
    Vol. 14, No. 4, pp. 349-379, Oct. 1996.
  • Nav2001 Gonzalo Navarro, A Guided Tour to
    Approximate String Matching, ACM Computing
    Surveys, Vol. 33, No. 1, pp. 31-88, Mar. 2001.
  • Ram1998 Raghu Ramakrishnan, Database Management
    Systems, McGraw-Hill, 1998.

34
  • SKS2001 Abraham Silberschatz, Henry F. Korth,
    and S. Sudarshan, Database Systems Concepts,
    McGraw-Hill, 4th ed., 2001.
  • SWY2002 Falk Scholer, Hugh E. Williams, John
    Yiannis and Justin Zobel, Compression of
    Inverted Indexes for Fast Query Evaluation, In
    Proc. Int'l Conf. on Information Retrieval, ACM
    SIGIR, Tampere, Finland, pp. 222-229, Aug. 2002.
  • Ull1988 Jeffery D. Ullman, Principles of
    Database and Knowledge-Base Systems Vol. I,
    Computer Science Press, USA, 1988.
  • Wil2003 Hugh E. Williams, Genomic Information
    Retrieval, In Proc. the 14th Australasian
    Database Conferences, 2003.
  • WLL2005 Kyu-Young Whang, Min-Jae Lee, Jae-Gil
    Lee, Min-soo Kim, and Wook-Shin Han, Odysseusa
    High-Performance ORDBMS Tightly-Coupled with IR
    Reatures, In Proc. the 21th IEEE Int'l Conf. on
    Data Engineering (ICDE), Tokyo, Japan, Apr. 2005.
  • WMB1999 I. Witten, A. Moffat, and T. Bell,
    Managing Gigabytes Compressing and Indexing
    Documents and Images, Morgan Kaufmann Publishers,
    Los Altos, California, 2nd ed., 1999.
  • WVT1990 Kyu-Young Whang, Brad T. Vander-Zanden,
    and Howard M. Taylor, A Linear-Time
    Probabilistic Counting Algorithm for Database
    Applications, ACM Trans. on Database Systems,
    Vol. 15, No.2, pp. 208-229, June 1990.
  • WZ2002 Hugh E. Williams and Justin Zobel,
    Indexing and Retrieval for Genomic Databases,
    IEEE Trans. on Knowledge and Data Engineering,
    Vol. 14, No. 1, pp. 63-78, Jan./Feb. 2002.
  • YT1998 Ogawa Yasushi and Matsuda Toru,
    Optimizing query evaluation in n-gram indexing,
    In Proc. Int'l Conf. on Information Retrieval,
    ACM SIGIR, Melbourne, Australia, pp. 367-368,
    1998.
Write a Comment
User Comments (0)
About PowerShow.com