Title: Algorismes de cerca
1Algorismes de cerca
Algorismes de cerca definició del problema
(text,patró)
depèn de què coneixem al principi
- Només el text ----gt Estructurar el text (suffix
tree)
- Només el/s patró/ns ---gt Estructurar el/els
patró/ns
- 1 patró ---gt Lalgorisme depèn de la llargada i
?
- k patrons ---gt Lalgorisme depén del nombre k,
la llargada i ?
depèn de la llargada del patró
22.2 Pairwise alignment
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) from the alphabet a,c,t,g we
say that A and B from a,c,t,g,- are aligned
iff
- A and B become A and B if gaps ( ) are
removed. - AB
- For all i, it is not possible that ai bi -
MALIG (an example)
How many alignments of two sequences exist?
Which is the best alignment?
32.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
(a1,b1)
42.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
52.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
3
? ?
62.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) there are
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with
(an,-) (a1a2...an ,b1b2...bm-1) those
that end with (-,bm) (a1a2...an-1
,b1b2...bm-1) those that end with (an,bm)
1
1
1
1
1 1 1
3
5 7
5 7
?
72.2 Number of alignments
Given two DNA sequences A (a1a2...an)
and B (b1b2...bm) then
(a1a2...an ,b1b2...bm) (a1a2...an-1
,b1b2...bm) those that end with ( an ,
-) (a1a2...an ,b1b2...bm-1) those
that end with ( - , bm) (a1a2...an-1
,b1b2...bm-1) those that end with ( an , bm)
1
1
1
1
1 1 1
3
5 7
5 7
But, what is the assymptotic value?
82.2 Assymptotic value
As
(a1a2...an ,b1b2...bn)
and
n! nn e-n (Stirling approximation)
then
(a1a2...an ,b1b2...bn) gt 22n
92.2 Best alignment
How can an alignment be scored?
catcactactgacgactatcgtagcgcggctatacatctacgccaa-
ctac-t-gtgtagatcgccgg c- tgactgc--acgactatcgt-
attgcggctacacactacgcacaactactgtatgtcgc-cgg----
Then we assign a score for each case, for example
1,-1,-2.
How can the best alignment be found?
102.2 Edit distance and alignment of strings
The best alignment of two strings
is related with the edit distance, first
discussed in 1966...
The most efficient algorithm was proposed in
1968 and in 1970
using the technique called Dynamic programming
112.2 Best alignment
C T A C T A C T A C G T A C T G A
122.2 Best alignment
C T A C T A C T A C G T A C T G A
132.2 Best alignment
C T A C T A C T A C G T A C T G A
The cell contains the score of the best
alignment of AC and
CTACT.
142.2 Best alignment
C T A C T A C T A C G T
A
C
T G A
C T A C T A C T A C G T 0 -2 -4-6 -8
A-2 C-4 T -6 G A
s(AC,CTA)-2
s(A,CTA)1
BA(AC,CTAC) best
s(AC,CTAC)max
s(A,CTAC)-2
15Best alignment
Given the maximum score, how can the best
alignment be found?
- Quadratic cost in space and time
- Up to 10,000 bps sequences in length
Download alggen tool
162.2 Some slides revisited
- We have developed the theory according to the
following principles - 1) Both sequences have a similar length
(global). - 2) The model of gaps is linear
If there are k consecutive gaps the penalty
scores k(-2).
172.2 Semiglobal pairwise alignment
- Assume that we have sequences with different
length - S1
- S2
It is meaningless to introduce gaps until both
sequences have similar length .
The most probable alignment should be
Final gaps
Initial gaps
How can these alignments be found?
182.2 Semiglobal pairwise alignment
Initial gaps
Final gaps
C T A C T A C T A C G T A C T
192.2 Semiglobal pairwise alignment
Given a cell
C T A C T A C T A C G T A C T
0 0 0 0 0 0 0 0 0
0 0 0
0
The cell contains the score of the best
alignment of CTA with the empty sequence.
202.2 Semiglobal pairwise alignment
C T A C T A C T A C G T 0 0 0 0 0 0
0 A C T
The contribution of the initial gaps is
disregarded, then
but, what happens with the final gaps?
212.2 Semiglobal pairwise alignment
C T A C T A C T A C G T 0 0 0 0 0 0
0 A 1 C 2 T
3
How does the algorithm search for the best
alignment?
Practice with the alggen tool.
222.2 Affine-gap model score
Given the following alignments
that have the same score
a g t a c c c c g t a g a g t - c c - - g t a -
a g t a c c c c g t a g a g t - c - c - g t a -
a g t a c c c c g t a g a g t - c - - c g t a -
a g t a c c c c g t a g a g t - - c c - g t a -
a g t a c c c c g t a g a g t - - - c c g t a -
a g t a c c c c g t a g a g t - - c - c g t a -
Which is the most reliable case from a
biological point of view?
232.2 Affine-gap model score
Then, how can we distinguish between consecutive
gaps and separated gaps?
a g t a c c c c g t a g a g t - - - c c g t a -
a g t a c c c c g t a g a g t - - c - c g t a -
Then, the penalty of k consecutive gaps becomes
OG (k-1) EG which is an
affine-gap function.
How is the best alignment found?.
242.2 Affine-gap model score
C T A C T A C T A C G T A C T G A
Smallest arrows refer to the introduction of an
opening gap. Largest arrows refer to the
introduction of an extension gap.
But from which cell do the largest arrows
originate?
252.2 Affine-gap model score
C T A C T A C T A C G T A C T G A
Acces to clustalW http//www.ebi.ac.uk/clustalw
262.2 Local alignment
Given two sequences, we can consider the
alignments of all their substrings
how can the
best of them be found?
Two questions arise - how can the alignments
be compared? - how can the best one be selected?
272.2 Local alignment
Given a path
Imagine the graph of the scores can the best
subalignments be detected?
It suffices to compare the value of each cell
with zero!