Text Indexing - PowerPoint PPT Presentation

About This Presentation
Title:

Text Indexing

Description:

Text Indexing – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 42
Provided by: paol59
Category:
Tags: indexing | text

less

Transcript and Presenter's Notes

Title: Text Indexing


1
Text Indexing
  • S. Srinivasa Rao
  • April 19, 2007
  • based on slides by Paolo Ferragina

2
Why string data are interesting ?
  • They are ubiquitous
  • Digital libraries and product catalogues
  • Electronic white and yellow pages
  • Specialized information sources (e.g. Genomic or
    Patent dbs)
  • Web pages repositories
  • Private information dbs
  • ...
  • String collections are growing at a staggering
    rate
  • ...more than 10Tb of textual data in the web
  • ...more than 15Gb of base pairs in the genomic dbs

3
The need for an index
  • Brute-force scanning is not a viable approach
  • Fast single searches
  • Multiple simple searches for complex queries
  • The index is a basic block of any IR system.
  • An IR system also encompasses
  • IR models
  • Ranking algorithms
  • Query languages and operations
  • ...

We will concentrate only on index design !!
4
Two families of indexes
Two indexing approaches
  • Word-based indexes, here a concept of word
    must be devised !
  • Inverted files, Signature files or Bitmaps.
  • Full-text indexes, no constraint on text and
    queries !
  • Suffix Array, Suffix tree, Hybrid indexes, or
    String B-tree.

5
Word-based Inverted files (or lists)
? Query answering is a two-phase process
midnight AND time
6
Full-text indexes
  • Their need is pervasive
  • Raw data DNA sequences, Audio-Video files, ...
  • Linguistic texts data mining, statistics, ...
  • Vocabulary for Inverted Lists
  • Xpath queries on XML documents
  • Intrusion detection, Anti-viruses, ...
  • Classes of indexes
  • Suffix array, Suffix tree (variants)
  • Multi-level indexes Short Pat array
  • B-tree based data structures Prefix B-tree,
    String B-tree

7
Terminology
  • An alphabet, denoted ? is a set of (ordered)
    characters.
  • A string S is an array of characters, S1,n
    S1 S2 Sn.
  • Si,j Si Sj is a substring of S.
  • S1,j is a prefix of S Si,n is a suffix of S.
  • ? denotes all strings over alphabet ?.
  • Lexicographic order
  • Example For ? a, b, c, , z , where a lt b lt
    c lt lt z, the lexicographic order is the same as
    in a dictionary.

SUF(T) Sorted set of suffixes of T
8
Indexed string matching problem
  • Let T be a set of K strings in ?, where N is the
    total length of all strings in T.
  • String matching query on T Given a pattern P
    find all occurrences of P in the strings in T.
  • Static problem Store T in a data structure that
    supports string matching queries. Such a data
    structure is called a full-text index.
  • Dynamic version Supports also insertions and
    deletions of strings in the full-text index.

9
A simple but crucial observation
  • Pattern P1,p occurs at position i of T1,n
  • iff P1,p is a prefix of the suffix Ti,n

Occurrences of P in T All suffixes of T having
P as a prefix
Can transform the string matching problem to a
prefix matching problem over all the suffixes.
10
Suffix Array Manber-Myers, 90
  • Suffix array an array of pointers to all the
    suffixes in the text in their lexicographic order.

T mississippi
11
Two key properties Manber-Myers, 90
  • Prop 1. All suffixes in SUF(T) having prefix P
    are contiguous.
  • Prop 2. Starting position is the lexicographic
    one of P.

T mississippi
Psi
12
Searching in Suffix Array Manber-Myers, 90
  • Indirected binary search on SA O(p log2 N) time

T mississippi
13
Searching in Suffix Array Manber-Myers, 90
  • Indirected binary search on SA O(p log2 N) time

T mississippi
14
Listing the occurrences Manber-Myers, 90
  • Brute-force comparison O(p x occ) time

T mississippi 4 6 7
12 11 8 5 2 1 10 9 7 4 6 3
12 11 8 5 2 1 10 9 7 4 6 3
15
Output-sensitive retrieval
T mississippi 4 6 7
base B tricky !!
0 0 1 4 0 0 1 0 2 1 3
0 0 1 4 0 0 1 0 2 1 3
incremental search
Compare against P
16
Incremental search (case 1)
  • Incremental search using the LCP array no
    rescanning of pattern chars

SA
i
j
17
Incremental search (case 2)
  • Incremental search using the LCP array no
    rescanning of pattern chars

SA
i
j
18
Incremental search (case 3)
  • Incremental search using the LCP array no
    rescanning of pattern chars

SA
i
q
j
base B more tricky (we will not consider this)
19
Summary suffix array
  • Manber-Myers, 90
  • Space O(N)
  • String matching queries O(p log N occ)
  • Can be constructed O(N log N) time.
  • Static.

20
Hybrid Index Short Pat array
  • Exploit internal memory sample the suffix array
    and copy something in memory

Disk
  • Parameter s depends on M and influences both
    performance and space !!
  • (only a huristic)

21
Tries
  • Trie (name from the word retrieval) a data
    structure for storing a set of strings
  • Lets assume that all strings end with (not
    in S)

Set of strings bear, bid, bulk, bull, sun,
sunday
22
Tries
  • Properties of a trie
  • A multi-way tree.
  • Each node has from 1 to S1 children.
  • Each edge of the tree is labeled with a
    character.
  • Each leaf node corresponds to the stored string,
    which is a concatenation of characters on a path
    from the root to this node.

23
Analysis of the Trie
  • Given k strings of total length N
  • Size
  • O(N) in the worst-case
  • Search, insertion, and deletion (string of length
    m)
  • O(m) (assuming ? is constant)
  • Observation
  • Having chains of one-child nodes is wasteful

24
Compact Tries
  • Compact Trie
  • Replace a chain of one-child nodes with an edge
    labeled with a string
  • Each non-leaf node (except root) has at least two
    children

25
Compact Tries
  • Implementation
  • Strings are external to the structure in one
    array, edges are labeled with indices in the
    array (from, to)
  • Improves the space to O(k)

26
Patricia Tries
  • Patricia trie
  • a compact trie where each edges label (from, to)
    is replaced by (Tfrom, to from 1)

27
Suffix Trees McCreight, 76
  • Suffix tree a compact trie (or similar
    structure) of all suffixes of the text
  • Patricia trie of suffixes is sometimes called a
    Pat tree

1 2 3 4 5 6 7 8
28
Search in suffix trees

P ba
? Search is a path traversal
and O(occ) time
a
c
b
c
b
b
b
c
c
b
  • What about ST in external memory ?
  • Unbalanced tree topology
  • Updates

T abababbc 1 3 5 7 9

- Large space 15N
29
Summary compact trie
  • K strings of total length N
  • Space O(K)
  • String matching queries O(p occ) time
  • Can be constructed O(N) time.
  • Update(S) O(S) time

30
Summary suffix tree
  • McCreight, 76
  • For a string of length n
  • Space O(n)
  • String matching queries O(p occ)
  • Can be constructed O(n) time.
  • Static.

31
The String B-tree (An I/O-efficient full-text
index !!) Ferragina-Grossi, 95
32
The prologue
  • We are left with many open issues
  • Suffix Array updates
  • Hybrid Heuristic tuning of the performance
  • Suffix tree difficult packing and W(p) I/Os
  • B-tree is ubiquitous in large-scale applications
  • Atomic keys integers, reals, ...
  • Prefix B-tree bounded length keys (? 255 chars)

Suffix trees B-trees ?
33
Some considerations
  • Strings have arbitrary length
  • Disk page cannot ensure the storage of Q(B)
    strings
  • M may be unable to store even one single string
  • String storage
  • Pointers allow to fit Q(B) strings per disk page
  • String comparison needs disk access and may be
    expensive
  • String pointers organization seen so far
  • Suffix array simple but static and not optimal
  • Patricia trie sophisticated and much efficient
    (optimal ?)
  • Recall the problem T is a string collection
  • Search( P1,p ) retrieve all occurrences of P
    in Ts strings
  • Update( S1,t ) insert or delete a text S from T

34
1º step B-tree on string pointers
P AT
35
2º step The Patricia trie
(1 1,3)
(4 1,4)
(2 1,2)
(5 5,6)
(3 4,4)
(6 5,6)
(2 6,6)
(1 6,6)
(5 7,7)
(4 7,7)
(7 7,8)
(6 7,7)
Disk
36
2º step The Patricia trie
A
Two-phase search P GCACGCAC
A
  • Second phase O(p/B) I/Os

A
C
A
Just one string is checked !!
G
A
G
G
Disk
37
3º step B-tree Patricia tree
P AT
29 13 20 18 3 23
38
4º step Incremental Search
First case
39
4º step Incremental Search
Second case
No rescanning
40
Summary String B-tree
  • String B-tree performance
    Ferragina-Grossi, 95
  • Search(P) takes O(p/B logB N occ/B) I/Os
  • Update(S) takes O( s logB N ) I/Os
  • Space is Q(N/B) disk pages
  • Using the String B-tree in internal memory
  • Search(P) takes O(p log2 N occ) time
  • Update(S) takes O( s log2 N ) time
  • Space is Q(N) bytes
  • It is a sort of dynamic suffix array

41
Summary
  • Indexed string matching problem
  • Word-based
  • Full-text
  • Internal memory data structures
  • Suffix array
  • Suffix tree
  • External memory data structures
  • Patricia trie and Pat tree
  • Short Pat array
  • String B-tree
Write a Comment
User Comments (0)
About PowerShow.com