Title: Bioinformatics PhD' Course
1Bioinformatics PhD. Course
1. Biological introduction
2. Comparison of short sequences ( up to
10.000bps)
Dot Matrix Pairwise align.
Multiple align. Hash alg.
3. Comparison of large sequences ( more that
10.000bps)
Data structures Suffix trees MUMs
4. String matching
Exact Extended Approximate
5. Sequence assembly
6. Projects PROMO, MREPATT,
2Comparison of large sequences
First part Alignment of large sequences
3Dynamic programming
acc.................................agt
.................................xx acc.........
........................a--
- Quadratic cost of space and time.
- Quadratic cost of space and time.
- Short sequences (up to 10.000 bps) can be
aligned using dynamic programming
What about genomes?
4Genomic sequences
- Genomic sequences have millions of base pairs.
- The length of sequences is 1000 times longer.
In which case Dynamic Programming can be applied?
5First assumption
6Realistic assumption?
Unrealistic assumption!
More realistic assumption
7Realistic assumptions?
Unrealistic assumption!
More realistic assumption
But, now is it a real case?
8Preview in a real case
Chlamidia muridarum 1.084.689bps Chlamidia
Thrachomatis1057413bps
9Preview in a real case
Pyrococcus abyssis 1.790.334 bps Pyrococcus
horikoshu 1.763.341 bps
?
?
10Methodology of an alignment
Identify the portions that can be aligned.
(Linear cost)
(Linear cost)
11Methodology of an alignment
?
(Linear cost)
12Preview-Revisited
Matching
Unique
Maximal
Connect to MALGEN
13Methodology of an alignment
Linear cost with Suffix trees
How can MUMs be found?
Identify the portions that can be aligned.
How can these portions be determined?
With CLUSTALW, TCOFFEE,
14Comparison of large sequences
M-GCAT Todd Treangen
15Homework
- Javier 14. Alexis
- Dmitry 15. Ramon
- Ana Iris
- David
- Patricia
- Rogeli
- Atif
- Aina
- Isaac
- Maria Merce
- Romina
- Guillem
- Raul
16Bioinformatics PhD. Course
Second part Introducing Suffix trees
17Suffix trees
Given string ababaas
Suffixes
3 abaas
1 ababaas
4 baas
2 babaas
What kind of queries?
18Applications of Suffix trees
1. Exact string matching
- Does the sequence ababaas contain any ocurrence
of patterns abab, aab, and ab?
19Quadratic insertion algorithm
Invariant Properties
Given the string ......
...
P1 the leaves of suffixes from ? have been
inserted
20Quadratic insertion algorithm
Given the string ababaabbs
21Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
22Quadratic insertion algorithm
Given the string ababaabbs
ababaabbs,1
babaabbs,2
23Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
24Quadratic insertion algorithm
Given the string ababaabbs
babaabbs,2
25Quadratic insertion algorithm
Given the string ababaabbs
26Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
27Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
28Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
29Quadratic insertion algorithm
Given the string ababaabbs
ba
ba
baabbs,2
30Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
31Quadratic insertion algorithm
Given the string ababaabbs
ba
baabbs,2
32Quadratic insertion algorithm
Given the string ababaabbs
33Quadratic insertion algorithm
Given the string ababaabbs
34Quadratic insertion algorithm
Given the string ababaabbs
35Generalizad suffix tree
The suffix tree of many strings
is called the generalized suffix tree
and it is the suffix tree of the concatenation
of strings.
For instance,
36Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
Given the suffix tree of ababaaba
37Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
38Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ab
a
ba,5
39Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ab
a
ba,5
40Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
ab
a
ba,5
b
a
bba,3
a
baabba,1
41Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
ab
a
ba,5
b
a
bba,3
a
baabba,1
42Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
ab
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
a
a
bba,4
baabba,2
43Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
ab
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
a
a
bba,4
baabba,2
44Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
45Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
aaß,1
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
46Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ß,5
ß,4
aaß,1
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
47Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ß,5
ß,4
aaß,1
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
48Generalizad suffix tree
Construction of the suffix tree of
ababaabbaaabaaß
ß,5
ß,4
aaß,1
ß,6
a
b
a
ba,5
ß,2
b
a
bba,3
a
b
baabba,1
ß,3
a
a
bba,4
baabba,2
49Generalizad suffix tree
Generalized suffix tree of ababaabbaaabaaß
50Applications of Suffix trees
1. Exact string matching
- Does the sequence ababaas contain any ocurrence
of patterns abab, aab, and ab?
51Applications of Suffix trees
2. The substring problem for a database of
strings DB
- Does the DB contain any ocurrence of patterns
abab, aab, and ab?
52Applications of Suffix trees
3. The longest common substring of two strings
53Applications of Suffix trees
4. Finding the maximal repeats.
54Applications of Suffix trees
5. Finding MUMs.
55Bioinformatics PhD. Course
Third part Suffix links
56Suffix links
57Suffix links
58Suffix links
?
59Suffix links
?
60Suffix links
?
61Suffix links
?
62Suffix links
?
63Suffix links
?
64Suffix links
?
65Suffix links
66Suffix links
67Traversal using Suffix links
Given S2 a a b a a
68Traversal using Suffix links
Given S2 a a b a a
69Traversal using Suffix links
Unique matchings
Given S2 a a b a a
aa in S2 1
70Traversal using Suffix links
Unique matchings
Given S2 a a b a a
aa in S2 1
aab in S2 1
S15..6-7 in S2 1
71Traversal using Suffix links
Unique matchings
Given S2 a a b a a
S15..6-7 in S2 1
72Traversal using Suffix links
Unique matchings
Given S2 a a b a a
S15..6-7 in S2 1
73Traversal using Suffix links
Unique matchings
Given S2 a a b a a b b a
S15..6-7 in S2 1
S13..6- in S2 2
74Traversal using Suffix links
Unique matchings
Given S2 a a b a a b b a
S15..6-7 in S2 1
S13..6- in S2 2
75Traversal using Suffix links
Unique matchings
Given S2 a a b a a b b a
S15..6-7 in S2 1
S13..6- in S2 2
76Traversal using Suffix links
Unique matchings
Given S2 a a b a a b b a
S15..6-7 in S2 1
S13..6- in S2 2
77Traversal using Suffix links
Unique matchings
Given S2 a a b a a b b a
S15..6-7 in S2 1
S13..6-8 in S2 2
S14..6-8 in S2 3
78Traversal using Suffix links
Unique matchings
Given S2 a a b a a b b a
S15..8 in S2 4
S13..6-8 in S2 2
S14..6-8 in S2 3
S16..8 in S2 5
S17..8 in S2 6
79From UMs to MUMs
Unique matchings
Given S2 a a b a a b b a
S15..8 in S2 4
and S1 a b a b a a b b a
S13..6-8 in S2 2
S14..6-8 in S2 3
Array of UMs
S16..8 in S2 5
1 2 3 6-8 4 6-8 5 8 6 8 7 8 8 9
S17..8 in S2 6
MUM S13..6-8 in S22
80Bioinformatics PhD. Course
Third part Linear insertion algorithm
81Quadratic insertion algorithm
Invariant Properties
Given the string ......
...
P1 the leaves of suffixes from ? have been
inserted
82Linear insertion algorithm
Invariant Properties
Given the string ......
P1 the leaves of suffixes from ? have been
inserted
P2 the string ? is the longest string that can
be spelt through the tree.
83Linear insertion algorithm example
Given the string ababaababb...
84Linear insertion algorithm example
Given the string ababaababb...
6 7 8
85Linear insertion algorithm example
?
Given the string ababaababb...
6 7 8
?
86Linear insertion algorithm example
?
Given the string ababaababb...
6 7 89
?
87Linear insertion algorithm example
88Linear insertion algorithm example
89Linear insertion algorithm example
90Linear insertion algorithm example
ababb...,5
ababb...,3
ba
ba
ababb...,4
baababb...,2
91Linear insertion algorithm example
ababb...,5
ababb...,3
ba
ba
ababb...,4
b
aababb...,2
baababb...,2
baababb...,2
92Linear insertion algorithm example
?
Given the string ababaababb...
7 8
?
ababb...,5
ababb...,3
ba
ba
ababb...,4
baababb...,2
93Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
ababb...,3
ba
ba
ababb...,4
94Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
ababb...,3
ba
ba
ababb...,4
95Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
ababb...,3
ba
ba
ababb...,4
96Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
ababb...,3
ba
ba
ababb...,4
97Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
a
b
ba
ababb...,4
b
aababb...,2
b...,7
98Linear insertion algorithm example
?
Given the string ababaababb...
89
?
ababb...,5
a
b
b...,8
ba
ababb...,4
b
aababb...,2
b...,7
99Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
ba
ababb...,4
b
aababb...,2
b...,7
100Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
ba
ababb...,4
b
aababb...,2
b...,7
101Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
a
b
ababb...,4
b
aababb...,2
b...,7
102Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
a
b
ababb...,4
b...,9
b
aababb...,2
b...,7
103Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
a
b
ababb...,4
b...,9
b
aababb...,2
b...,7
104Linear insertion algorithm example
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
a
b
ababb...,4
b...,9
b
aababb...,2
b...,7
105Linear insertion algorithm example
?
Given the string ababaababb...
9
?
ababb...,5
a
b
b...,8
a
b
ababb...,4
b...,9
b
aababb...,2
b...,7
106Index
Suffix arrays Suffix-arrays a new method for
on-line string searches, G. Myers, U.
Manber
107Suffix arrays
Given string ababaa
1 ababaa
Suffixes
but lexicographically sorted
2 babaa
1
3 abaa
6 a
4 baa
5 aa
3 abaa
1 ababaa
4 baa
2 babaa
Which is the cost?
O(n log(n))
108Applications of suffix arrays
1. Exact string matching
- Does the sequence ababaas contain any ocurrence
of patterns abab, aab, and ab?
Binary search
which is the cost?
O(log(n) P)
Can it be improved to
O(log(n)P) ?
109Fast search with cost O(log(n)P)
Invariant Properties
110Fast search with cost O(log(n)P)
Invariant Properties
Algorithm