Title: Counting Suffix Arrays and Strings
1Counting Suffix Arrays and Strings
2Suffix Array Data Structure
Suffix Array lexicographically sorted list of
all suffixes
13 - 12 - C 10 - CTC 5 - CTCTTCTC 7 - CTTCTC
2 - CTTCTCTTCTC 11 - TC 9 - TCTC
4 - TCTCTTCTC 6 - TCTTCTC 1 - TCTTCTCTTCTC 8
- TTCTC 3 - TTCTCTTCTC
3Overview
- Classify strings sharing same suffix array
- Counting strings sharing same suffix array
- Counting suffix arrays? Lower bound suffix array
compression - Summation identities
41. Classify Strings for Suffix Array
- t - string of length n,
- P - permutation of 1,..., n,
- R - inverse of P.
- Theorem
- P is the suffix array of t if and only if
- for all i ?1,...,n
- tPi ? tPi1 and
- tPi tPi1 ? RPi1 ? RPi11
- same as
- RPi1 gt RPi11 ? tPi lt tPi1
51. Classify Strings for Suffix Array
a) tPi ? tPi1 and b) RPi1 gt
RPi11 ? tPi lt tPi1
Text to be indexed
R-descent
61. Classify Strings for Suffix Array
Equivalences between strings
Text to be indexed
(order-equivalent)
(order-distinct)
72. Counting Strings for Suffix Array
Text to be indexed
Base string
Non-decreasing sequences
82. Counting Strings for Suffix Array
- Suffix array P of length n with d R-descents.
- Number of strings over alphabet of size a for P
- Number of non-decreasing sequences overa-d
elements
92. Counting Strings for Suffix Array
- Suffix array P of length n with d R-descents.
- Number of strings composed of exactly k distinct
characters for P is
102. Counting Strings for Suffix Array
- Number of strings over alphabet size 20 for
suffix arrays of length n with 10 R-descents
112. Counting Strings for Suffix Array
- Suffix array P of length n with d R-descents
- Number of order-distinct strings over alphabet of
size a is - Number of order-distinct strings where all k
distinct characters must appear is
123. Counting Suffix Arrays
- Definition
- Let P permutation of 1,..., n.
- Position i?1,...,n-1 is a permutation descent
- if Pi gt Pi1.
- Definition
- The Eulerian number gives the number of
- permutations of 1,...,n with exactly d
- permutation descents.
133. Counting Suffix Arrays
- Well-known fact
- Recursive enumeration of Eulerian numbers
- ,
- for n ? d, and
-
143. Counting Suffix Arrays
- Definition
- Let A(n,d) be the number of permutations of
length n with d R-descents. - Observation
- A(n,0) 1
- A(n,d) 0 for n ? d
- see next
153. Counting Suffix Arrays
Text to be indexed
(d1) possible positions without additional
R-descent
163. Counting Suffix Arrays
Text to be indexed
(d1) possible positions without additional
R-descent
173. Counting Suffix Arrays
- Together
- A(n,0) 1,
- A(n,d) 0 for n ? d, and
- A(n,d) (d1) A(n-1,d) (n-d) A(n-1,d-1)
- Theorem
- The number A(n,d) of permutations of length n
- with d R-descents is the Eulerian number
.
183. Counting Suffix Arrays
- The number of distinct suffix arrays of length n
for strings over alphabet of size a - Lower bound for compressibility of suffix arrays
in the Kolmogorov sense
193. Counting Suffix Arrays
- Number of distinct suffix arrays of length n for
strings over alphabet of size 20
203. Counting Suffix Arrays
- Number of distinct suffix arrays of length n for
strings over alphabet of size 4
214. Summation Identities
- Worpitzkis identity by summing up the number of
strings of length n for each suffix array - Summation rule for Eulerian numbers to generate
the Stirling numbers of second kind
22Summary
- Constructive proofs to count strings sharing the
same suffix array - Constructive proof to count distinct suffix
arrays yielding lower bound for suffix array
compression - Constructive proofs for Worpitzkis identity and
the summation rule of Eulerian numbers to count
Stirling numbers of second kind
23Outlook
- Efficient enumeration algorithm for suffix arrays
- Compressed suffix arrays for fast querying in
bioinformatics applications - Average case analysis under non-uniform model
24- Thank you for
- your attention!