Title: Suffix Trees and Suffix Arrays
1Suffix Trees and Suffix Arrays
- Modern Information Retrieval
- by R. Baeza-Yates and B. Ribeiro-Neto
- Addison-Wesley, 1999.
- (Chapter 8)
2Introduction
- Word-based indexing
- Inverted indices are good for search words
- Queries such as phrases are expensive to solve
using Inverted files - For word-based applications, inverted files
perform better - Suffix trees and suffix arrays
- complex queries
3Text Suffixes
This is a text. A text has many words. Words are
made from letters.
- text. A text has many words. Words are made from
letters. - text has many words. Words are made from letters.
- many words. Words are made from letters.
- words. Words are made from letters.
- Words are made from letters.
- made from letters.
- letters.
4The Suffix Trie and Suffix Tree
5PAT Trees and PAT Arrays
- Information Retrieval Data Structures and
Algorithms - by W.B. Frakes and R. Baeza-Yates (Eds.)
Englewood Cliffs, NJ Prentice Hall, 1992. - (Chapters 5)
6PAT Trees and PAT Arrays
- Problems of tradition IR models
- Documents and words are assumed.
- Keywords must be extracted from the text
(indexing). - Queries are restricted to keywords.
- New indices for text
- A text is regarded as a long string.
- Each position corresponds to a semi-infinite
string (sistring). - No structures and no keywords
7Semi-infinite Strings
- ExampleText Once upon a time, in a far away
land sistring 1 Once upon a time sistring
2 nce upon a time sistring 8 on a time, in a
sistring 11 a time, in a far sistring 22 a
far away land - Compare sistrings 22 lt 11 lt 2 lt 8 lt 1
8PAT Tree
- PAT TreeA Patricia tree constructed over all the
possible sistrings of a text - Patricia tree
- a binary digital tree where the individual bits
of the keys are used to decide on the branching - A zero bit will cause a branch to the left
subtree - A one bit will cause a branch to the right
subtree - each internal node indicates which bit of the
query is used for branching - absolute bit position
- a count of the number of bits to skip
- each external node points to a sistring
- the integer displacement to original text
91
Example
2
2
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
3
4
2
1
1
2
2
3
4
2
3
5
1
external node sistring (integer
displacement) total displacement of the bit
to be inspected
1
1
1
1
0
0
1
1
1
2
2
0
1
3
2
internal node skip counter pointer
101
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111
sistring 5 0100010111 sistring 6 100010111
sistring 7 00010111 sistring 8 0010111 ...
2
2
2
4
3
3
6
7
3
4
5
1
1
2
2
1
2
4
3
3
2
2
6
7
3
5
5
4
1
4
2
3
4
8
6
3
5
1
Search 00101
?3?6?4?bits????
11Indexing Points
- The above example assumes every position in the
text is indexed.i.e. n external nodes, one for
each indexed position in the text - Word and phrase searchessistrings that are at
the beginning of words are necessary - Trade-off between size of the index and search
requirements
12Prefix searching
- ideaevery subtree of the PAT tree has all the
sistrings with a given prefix. - Search proportional to the query lengthexhaust
the prefix or up to external node.
Search for the prefix 10100 and its answer
13Proximity Searching
- Find all places where s1 is at most a fixed
(given by a user) number of characters away from
s2. in 4 ation gt insulation, international,
information - Algorithm1. Search for s1 and s2.2. Select the
smaller answer set from these two sets and
sort by position.3. Traverse the unsorted answer
set, searching every position in the sorted
set and checking if the distance between
positions satisfying the proximity condition.
sorttraverse timem1 logm1 m2logm1 (assume
m1ltm2)
14Range Searching
- Search for all the strings within a certain
lexicographical range. - Ex the range of abc ..acc
- abracadabra, acacia ?
- abacus, acrimonious X
- Algorithm
- Search each end of the defining intervals.
- Collect all the sub-trees between (and including)
them.
15Longest Repetition Searching
- the match between two different positions of a
text where this match is the longest in the
entire text, e.g., 0 1 1 0 0 1 0 0 0 1 0 1 1 1
the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
Text 01100100010111 sistring 1 01100100010111
sistring 2 1100100010111 sistring
3 100100010111 sistring 4 00100010111 sistring
5 0100010111 sistring 6 100010111 sistring
7 00010111 sistring 8 0010111
1
2
2
4
3
3
2
6
7
3
5
5
1
4
8
16Most Significant or Most Frequent Matching
- The most frequently occurring strings within the
text database - e.g., the most frequent trigram
- Find the most frequent trigram
- find the largest subtree at a distance 3
characters from root
1
the tallest internal node gives a pair of
sistrings that match for the greatest number of
characters
2
2
4
3
3
i.e., 1, 2, 3 are the same for sistrings
100100010111 and 100010111
2
6
7
3
5
5
1
4
8
17Building PAT Trees as Patricia Trees (1)
- Bucketing of external nodes
- collect more than one external node
- a bucket replaces any subtree with size less than
a certain constraint (b)save significant number
of internal nodes - the external nodes inside a bucket do not have
any structure associated with themincrease the
number of comparisons for each search
18Building PAT Trees as Patricia Trees (2)
- Mapping the tree onto the disk using super-nodes
- Advantage save the number of disk access and
space - Every disk page has a single entry point,
contains as much of the trees as possible, and - terminates either in external nodes or in
pointers to other disk pages - The pointers in internal nodes will address
either a disk page or another node inside the
same page - reduces the storage cost of internal nodes
- Example
- Assume a disk page contains on the order of 1,000
internal/external nodes - on the average, each disk page contains about 10
steps of a root-to-leaf path
19PAT Trees Represented as Arrays
- External node bucket size, b
- If we keep the external nodes in the bucket in
the same relative order as they would be in the
tree - Indirect binary search vs. sequential search
PAT array
1
7
4
8
5
1
6
3
2
2
2
2
4
3
3
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
6
7
3
5
5
1
4
8
20Searching PAT Trees as Arrays
- Prefix searching and range searchingdoing an
indirect binary search over the array with the
results of the comparisons being less than,
equal, and greater than. - ExampleSearch for the prefix 100 and its answer
- Most frequent, Longest repetition
- Manber and Baeza-Yates (1991)
PAT array
7
4
8
5
1
6
3
2
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text
21Comparisons
- Signature files
- Use hashing techniques to produce an index
- Advantage
- storage overhead is small (10-20)
- Disadvantages
- the search time on the index is linear
- some answers may not match the query, thus
filtering must be done
22Comparisons (Continued)
- Inverted files
- storage overhead (30 100)
- search time for word searches is logarithmic
- PAT arrays
- potential use in other kind of searches
- phrases
- regular expression searching
- approximate string searching
- longest repetitions
- most frequent searching