Title: Class 4: Sequence Alignment II Gaps, Heuristic Search
1Class 4 Sequence Alignment IIGaps, Heuristic
Search
2Alignment with Gaps Example
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
3Gaps
- Both alignments have the same number of matches
and spaces but alignment 2 seems better - Definition A gap is any maximal, consecutive run
of spaces in a single string. - The length of a gap the number of spaces in it
- Example 1 has 11 gaps, example 2 only 2 gaps
- Idea develop alignment scores that take gaps
(not spaces) into account
4Biological Motivation
- Number of mutational events
- A single gap due to a single event that removed
a number of residues - Each gap due to distinct, independent event
- Protein structure
- Protein secondary structure consists of alpha
helices, beta sheets and loops - Loops of varying size can lead to very similar
structure
5Biological Motivation
6cDNA Mataching
- cDNA is the sequence after splicing (introns have
been removed) and editing - We expect regions of high similarity, separated
by long gaps
7Gap Penalty Models (I)
- Constant model
- Gives each gap a constant score, spaces are free
- Maximize
- Time O(mn)
- Works well with cDNA matching
- Affine model
- Penalty for starting a gap penalty for each
additional space - Each gap costs Wg qWs
- Maximize
- Time O(mn)
- Widely used
8Gap Penalty Models (II)
- Convex model
- Each extra space contributes less penalty
- Gap function is convex in its length
- Example Ws log(q)
- Time O(mnlogm)
- A better model of biology
- General model
- The weight of a gap is some arbitrary w(q)
- Time O(mn2 nm2)
9Example Revised
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
10Indel Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -6
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score -6
Scoring Parameters Match 1 Indel -2
11Constant Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -6
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score 12
Scoring Parameters Match 1 Open gap -2
12Affine Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -17
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score 1
Scoring Parameters Match 1 Open gap -2, each
space -1
13Convex Model
1
A A C A A T T A A G A C T A C G T T C A T G A C
A C G A T T A G C A C A C T G T A G A
Score -6
2
A A C A A T T A A G A C T A C G T T C A T G A C
A A C A A T T G T T C A T G A C G C A
Score 7
Scoring Parameters Match 1 Open gap -2, gap
length -logn
14Affine Weight Model
- We divide all possible alignments of the prefixes
s1..i and t1..j into 3 types - s i
- t j
- s i-----
- t j
- s i
- t j-----
15Affine Weight Model
Recurrence relations
16Affine Weight Model
Initial condition Optimal alignment Complex
ity Time O(mn) Space O(mn)
17Affine Weight Model
- This model has a natural explanation as a finite
state automata
18Alignment in Real Life
- One of the major uses of alignments is to find
sequences in a database - Such collections contain massive number of
sequences (order of 106) - Finding homologies in these databases with the
standard dynamic programming can take too long - Example
- query protein 232 AAs
- NR protein DB 2.7 million sequences 748
million AAs - mn 1.7 1011 cells !
19Heuristic Search
- Instead, most searches rely on heuristic
procedures - These are not guaranteed to find the best match
- Sometimes, they will completely miss a
high-scoring match - We now describe the main ideas used by some of
these procedures - Actual implementations often contain additional
tricks and hacks
20Basic Intuition
- The main resource consuming factor in the
standard DP is decision of where the gaps are. If
there were no gaps, life was easy! -
- Almost all heuristic search procedures are based
on the observation that real-life well-matching
pairs of sequences often do contain long strings
with gap-less matches. - These heuristics try to find significant local
gap-less matches and then extend them.
21Banded DP
- Suppose that we have two strings s1..n and
t1..m such that n?m - If the optimal global alignment of s and t has
few gaps, then path of the alignment will be
close to the diagonal
s
t
22Banded DP
- To find such a path, it suffices to search in a
diagonal region of the matrix - If the diagonal band has presumed width a, then
the dynamic programming step takes O(an) - Much faster than O(n2) of standard DP in this case
s
a
t
23Banded DP
- Problem (for local alignment)
- If we know that ti..j matches the query
sp..q, then we can use banded DP to evaluate
quality of the match - However, we do not know i,j,p,q !
- How do we select which sub-sequences to align
using banded DP?
24FASTA Overview
- Main idea
- Find (fast!) good diagonals and extend them to
complete matches - Suppose that we have a relatively long gap-less
local match (diagonal) - AGCGCCATGGATTGAGCGA
- TGCGACATTGATCGACCTA
- Can we find clues that will let us find it
quickly?
25Signature of a Match
- Assumption good matches contain several
patches of perfect matches - AGCGCCATGGATTGAGCGA
- TGCGACATTGATCGACCTA
26FASTA
- Given s and t, and a parameter k
- Find all pairs (i,j) such that si..ik and
tj..jk match perfectly - Locate sets of pairs that are on the same
diagonal by sorting according to i-j thus - Locating diagonals that contain
- many close pairs.
- This is faster than O(nm) !
s
i ik
j
jk
t
27FASTA
- Extend the best diagonal matches to imperfect
(yet ungapped) matches, compute alignment scores
per diagonal. Pick the best-scoring matches. - Try to combine close diagonals to potential
gapped matches, picking the best-scoring matches. - Finally, run banded DP on the regions containing
these matches, resulting in several good
candidate alignments. - Most applications of FASTA use very small k (2
for proteins, and 4-6 for DNA)
28BLAST Overview
- FASTA drawback is its reliance on perfect matches
- BLAST (Basic Local Alignment Search Tool) uses
similar intuition, but relies on high scoring
matches rather than exact matches - Given parameters length k, and threshold T
- Two strings s and t of length k are a high
scoring pair (HSP) if d(s,t) gt T
29High-Scoring Pair
- Given a query string s, BLAST construct all words
w (neighborhood words), such that w is an HSP
with a k-substring of s. - Note not all k-mers have an HSP in s
30BLAST phase 1
- Phase 1 compile a list of word pairs (k3) above
threshold T - Example for the following query
- FSGTWYA (query word is in green)
- A list of words (k3) is
- FSG SGT GTW TWY WYA
- YSG TGT ATW SWY WFA
- FTG SVT GSW TWF WYS
31BLAST phase 1
scores
GTW 6,5,11 22 neighborhood ASW 6,1,11
18 word hits ATW 0,5,11 16 gt threshold NTW
0,5,11 16 GTY 6,5,2 13 GNW 10 neighborh
ood GAW 9 word hits below threshold
(T11)
32BLAST phase 2
- Search the database for perfect matches with
neighborhood words. Those are hits for further
alignment. - We can locate seed words in a large database in a
single pass, given the database is properly
preprocessed (using hashing techniques).
33Extending Potential Matches
- Once a hit is found, BLAST attempts to find a
local alignment that extends it. - Seeds on the same diagonal tend to be combined
(as in FASTA)
34Two HSP diagonal
- An improvement look for 2 HSPs on close
diagonals - Extend the alignment between them
- Fewer extensions considered
- There is a version of BLAST,
- involving gapped
- extensions.
- Generally faster then FASTA,
- arguably better.
s
t
35Blast Variants
- blastn (nucleotide BLAST)
- blastp (protein BLAST)
- tblastn (protein query, translated DB BLAST)
- blastx (translated query, protein DB BLAST)
- tblastx (translated query, translated DB BLAST)
- bl2seq (pairwise alignment)
36Biological Databases
- Today, most of the biological information can be
freely accessed on the web. - One can
- Search for information on a known gene
- Check if a sequence exists in a database
- Find a homologous protein, helping us guess
- Structure
- Function
37Databases and Tool
- Important gateways
- National Center for Biotechnology (GenBank)
- http//www.ncbi.nlm.nih.gov/
- European Bioinformatics Institue (EMBL-Bank)
- http//www.ebi.ac.uk/
- Expert Protein Analysis System (SwissProt)
- http//www.expasy.org/
- ? Different tools and DBs to allow biologists a
rich suite of queries
38Database Types
- Nucleotide DBs (GenBank, EMBL-Bank)
- Contain any and every type of DNA fragment
- Full cDNA, ESTs, repeats, fragments
- Dirty and redundant
- Protein DBs (SwissProt)
- Contain amino-acid sequences for full proteins
- High quality, strict screening process
- Lots of annotated information on each protein