Title: Sequence Alignment
1Sequence Alignment
2sequence alignment
- how to position two strings, with gaps, so as to
maximize the agreement between
3sequence alignment
CLUSTAL W (1.82) multiple sequence alignment a
THISMAKESSENSE 14 b THISISN---NSENSE 13
. .
CLUSTAL W (1.82) multiple sequence alignment a
THISMAKESSENSE 14 b THISISN---NSENSE 13
. .
4biological sequences
- Are sequences similar
- Infer homology
- Infer function
- (sequence -gt structure -gt function)
5Types of Homology
6pairwise alignments multiple sequence
alignments
7S
T
O
P
S
S
T
O
P
S
8dot plots
- Plot one sequence against another
- Good for finding repeats/inverts
- Can specify word size
- EMBOSS has several programs (available via PISE)
- Dotter
- Dotlet (web-based)
9S
T
O
P
S
S
P
O
T
S
10gt
tc
ct
ta
ga
ag
gt
tg
aa
aa
ac
c
ta
gt
tc
ct
ta
ag
gt
tg
ga
aa
a
11dynamic programming
- developed by Richard Bellman, 1950s
- can be used for pairwise sequence alignment
- ex seq1 TTCATA
- seq2 TGCTCGTA
- try all possible?
12(No Transcript)
13dynamic programming
- good example at
- http//meetings.cshl.org/TGAC/TGAC/flash/DynamicPr
ogramming.swf
14Global vs. Local
SPQ-RTGKCCWIAGPGILHRMSL SGALRCSWND-IAGPCAQH-MSA
Global Needleman-Wunsch start at end and
add gaps until one end is reached
Local Smith-Waterman finds region(s) of
highest similarity and build outward
15evaluating alignments
16scoring aligments
- count all frequencies of aligned pairs in
confirmed alignments develop probabilities - Problems
- 1 - how to get random sample?
- 2 different pairs have diverged different
amounts
17Percent Accepted Mutation
- Dayhoff et al. (1978) -- idea obtain
substitution data from very similar proteins,
extrapolate information to larger evolutionary
distances - PAM1 construct phylogenetic tree from sequences
in 71 families with at least 85 identity - One PAM unit is an average of 1 change in all
amino acid positions - PAM250 (most popular) 250 changes in 100 aa
- Convert mutation probabilities to a scoring
matrix by log odds and scaling
18Percent Accepted Mutation
- PAM Problems
- PAM1 entries from short time interval
substitutions - PAM250 scaled version of PAM1
- -- short time interval substitutions mostly
single base changes in codon triplets - -- long time interval substitutions all types
of codon changes
19BLOck SUbstition Matrices
- Henikoff Henikoff (1992) -- substitution
matrices based on local alignments - all BLOSUM matrices based on observed alignments
- BLOSUM N -- calculated from comparisons of
sequences with no less than N divergence
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Sc
oring2.html
20gap penalties
- no standard model
- generally estimated empirically once substitution
scores are chosen - Parameters
- initial and terminal
- gap opening
- gap extending
21alignment methods
- dynamic programming too computationally expensive
for any practical task - show example
- use heuristics
- FASTA - Pearson and Lipman (1988)
- BLAST - Altschul et al. (1990)
22FASTA
- Fast All
- find local high scoring alignments from exact
short word matches
23FASTA Algorithm
- find all identically matching k-mers in two seqs
- (proteins k 1, 2 DNA k
4,6) - -- find diagonals with many mutually supporting
word matches - -- take best diagonals to step 2
- extend exact word matches to find maximal scoring
ungapped regions - check if any ungapped regions can be joined
(allowing for gap costs) - re-align highest scoring candidate matches by
dynamic programming
24FASTA Algorithm
25FASTA Algorithm
26FASTA Incarnations
- FASTA compares sequence to another sequence or
database (protein or DNA) - SSEARCH performs Smith-Waterman alignment
between sequences - FASTX compare DNA to protein
- others ... see http//fasta.bioch.virginia.edu/
27BLASTBasic Local Alignment Search Tool
- find high scoring local alignments between query
sequence and target database - assumption true match alignments very likely to
contain within them very high scoring matches
28BLAST Steps
1. Seeding
- For each word of length W in the query, generate
a list of all possible words (neighborhood) with
a score of at least threshold T (determined by
using the scoring matrix)
29Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Determine the locations of all common words
between the query and the database (word hits).
30Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
Identifies all word hits
31BLAST Steps
2. Extension
- Use dynamic programming to extend hits until the
score drops a value of X expensive!! -- 90 of
time
ABCDEFGHIJKLMNOPQRST
ABCDEFZYIJKLMXWVUTAB 12345654567898765654 -gt
Score 00000012100001234345 -gt Drop off score
Match 1 Mismatch -1 X 5
323. Evaluation
Evaluates the statistical significance of
extended hits and reports only those above the
determined threshold.
33Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
Returns highest-scoring segment pairs (HSP)
34BLAST statistical evaluation
- for local, ungapped alignments
- m size of query n size of database
- E expected of HSPs with scores at least S
- p prob of finding at least one HSP with S
- good tutorial at
- http//www.ncbi.nlm.nih.gov/BLAST/tutorial
/Altschul-1.html
35other BLASTs
36Gapped BLAST
- 2-hit model HSP may have multiple hits along
same diagonal - find two non-overlapping hits on same diagonal
within some distance A - T must be lowered to maintain senstivity
- gt more single hits found
- BUT small fraction have associated 2nd hit
- overall - increase in speed
37Gapped BLAST example
- - BLAST
- 15 hits, T gt 13
- 22 hits, T gt 11
- - two-hit with T 1 triggers two extensions
38Gapped BLAST example
- - left pair produces HSP with E value
sufficient to trigger gapped extentsion - E 0.5
- - original BLAST, finds first and last ungapped
segments and assigns combined E-val gt 50 greater
39BLATBLAST Like Alignment Tool
- BLAT
- -- builds index of DB and scans linearly through
query - -- trigger extensions on any number of perfect or
near perfect hits
- BLAST
- builds index of query sequence and scans
linearly through DB - -- trigger extensions for one or two hit models
40BLAT
- Good for aligning mRNA, ESTs to genome
- fast
- aligns whole mRNA, not just exons
- handles introns and splice-sites
41BLAT
- Steps for cDNA alignment
- 1 break cDNA into 500 base chunks
- 2 use index to find regions in genome similar
to each chunk of cDNA - 3 do detailed alignment between genomic regions
and cDNA chunk - 4 dynamic programming - stictch together
detailed alignments of chunks into alignment of
whole
42- genome cacaattatcacgaccgc
- 3-mers cac aat tat cac gac cgc
- 0 3 6 9 12 15
- cDNA aattctcac
- 3-mers aat att ttc tct ctc tca cac
- 0 1 2 3 4 5 6
- hits aat 0,3 -3
- cac 6,0 6
- cac 6,9 -3
- clump cacAATtatCACgaccgc
example from Jim Kent
43PSI-BLASTPosition Specific Iterated BLAST
- database searches using position-specific scoring
matrices more powerful than simply using single
sequence - STEPS
- collect all DB sequences that align with E-val lt
T - align these to make position-specific scoring
matrix - use scoring matrix to search for new hits
- iterate
44PSI Blast
45Pattern-Hit-Initiated BLAST (PHI-BLAST)
- Combines regular expression matching with local
alignments - Finds proteins containing the pattern and
similarity in the region of the pattern - Integrated with PSI-BLAST
- E-values are computed differently
46- multiple sequence alignment
47msa
- msa of PTEN catalytic core
48uses of MSAs
- find functionally important sites
- predict structure
- find weak similarities in databases
- reconstruct evolutionary history
49multiple sequence alignment
- Two separate problems
- 1 generating all possible alignments
- 2 scoring alignments
50scoring alignments
- scoring system should include
- 1 some positions more conserved than others
- 2 sequences related evolutionarily
- ideal define probabilistic model for sequence
evolution - -- not enough information
51scoring alignments
- without phylogenetic information ...
- 1 minimum entropy
- 2 sum of pairs
52generating alignments
- dynamic programming
- progressive alignment
- iterative search
- HMMs
- blocksets
53dynamic programming
54Dynamic Programming for 3 sequences
To see more, check out http//bibiserv.techfak.uni
-bielefeld.de/visualign/
55dynamic programming
- computationally infeasible for more than a few
sequences - Carrillo Lipman not all cells have to be
examined to guarantee optimal alignment - MSA can optimally align 5-7 protein sequences
of length 300
56progessive alignment
- succession of pairwise alignments
- heurisic cannot separate scoring and
optimization - works well for close sequences
- many examples
- -- Feng-Doolittle (1987)
- -- ClustalW
- -- T-coffee
57Feng-Doolittle Progressive Alignment
- 1- do global pairwise alignments for every pair
of sequences - 2 convert alignment scores to distances
- 3 construct guide tree from distance matrix
- 4 progressively align sequences with weights
from guide tree
58Feng-Doolittle Clustering
Similarity matrix (from pairwise alignment)
X1
X2
X3
X4
X5
X1
- 15 11 3 4
- 3 30 5 3 1
-
- 5 25 12 11
- 3 4 12 40 9
- 4 1 11 9 30
X2
X3
X4
X5
X5
X3
X1
X2
X4
X1
X2
X3
X4
X5
59Feng-Doolittle
- At each step, follow the guide tree and consider
all possible pairwise alignments of sequences in
the two candidate groups (3 cases) - Sequence vs. sequence
- Sequence vs. group (the best matching sequence in
the group determines the alignment) - group vs. group (the best matching pair of
sequences determines the alignment) - Once a gap, always a gap
- gap is replaced by a neutral symbol X
- no cost for aligning X with anything
60Generating a Multi-Sequence Alignment
- Align the two most similar sequences
- Following the guide tree, add in the next
sequences, aligning to the existing alignment - Insert gaps as necessary
Sample output FOS_RAT
PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNIS
NMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPE
SEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK
SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPS
G--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----
STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUM
AN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP---
--------------LPFQ . . .
.. . .
Dots and stars show how well-conserved a column
is.
61problems
- all alignments are completely determined by
initial pairwise sequence alignment errors in
intial alignment are propagated - gaps can proliferate
- No backtracking (subalignment is frozen)
- no way to correct an early mistake
- non-optimality Mismatches and gaps at highly
conserved region should be penalized more, but we
cant tell where is a highly conserved region
early in the process
62ClustalW
- essentially doing Feng-Doolittle with some
important heuristics - 1 weighting of sequences
- 2 choice of substitution matrices
- 3 gap penalties
- 4 guide tree adjustment on the fly
63T-coffee
- Tree-based Consistency Objective Function For
alignmEnt Evaluation - -- pre-process data set of all pairwise
alignments between sequences - -- build library
- -- intermediate alignments based on sequences and
library information - -- Generally gives improved results over
ClustalW, for sequences with lt 30 identity, but
is slower
64(No Transcript)
65iterative local search methods
- seek to increase MSA scores by randomly altering
the alignment - usually used to refine alignment
- start with generated msa
- not guaranteed to find optimal alignment
- examples simulated annealing, GA
66- Hidden Markov Models
- the Legos of computational sequence analysis --
Sean Eddy
67HMMs in computational biology
- gene finding
- regulatory site identification
- profile searching
- multiple sequence alignment
- CpG island detection
68Markov Chain
Any DNA sequence can be thought of as a series of
emissions from the following model
69Markov Chain
Any DNA sequence can be thought of as a series of
emissions from the following model
70Transition Probabilities
71HMM for CpG islands
In a hidden Markov Model, the identity of the
emitted symbol does not tell us the state.
Looking at the emission, A is equivalent to A-.
72CpG Island Transition Probabilities
73HMMs non-biological example
0.9
0.95
1 1/10
1 1/6
2 1/10
2 1/6
0.05
3 1/10
3 1/6
4 1/10
4 1/6
5 1/10
0.1
5 1/6
6 1/2
6 1/6
Fair Die
Loaded Die
74Profile HMM
75Profile HMM
76Block Alignments
- typical msa fix a particular sequence as
reference align all others to this - positive simple
- negative - regions conserved in subset but not
in reference sequence not found - - alignments generated with
different reference seqs may be inconsistent
77Block Alignments
- Threaded Blockset Aligner (2004)
- -- produce blocks in which each position in given
sequences appear precicesly once - blocks local alignments of some or all of the
given sequences - -- assumption matching regions occur in the same
order and orientation in all sequences
78TBA
79(No Transcript)