Sequence Alignment

About This Presentation

Title:

Sequence Alignment

Description:

how to position two strings, with gaps, so as to maximize the ... FASTA Incarnations. FASTA compares sequence to another sequence or database (protein or DNA) ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 79

Provided by: meetin8

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Alignment

1
Sequence Alignment

The Genome Access Course

2
sequence alignment

how to position two strings, with gaps, so as to
maximize the agreement between

3
sequence alignment
CLUSTAL W (1.82) multiple sequence alignment a
THISMAKESSENSE 14 b THISISN---NSENSE 13
. .
CLUSTAL W (1.82) multiple sequence alignment a
THISMAKESSENSE 14 b THISISN---NSENSE 13
. .
4
biological sequences

Are sequences similar
Infer homology
Infer function
(sequence -gt structure -gt function)

5
Types of Homology
6
pairwise alignments multiple sequence
alignments
7
S
T
O
P
S
S
T
O
P
S
8
dot plots

Plot one sequence against another
Good for finding repeats/inverts
Can specify word size
EMBOSS has several programs (available via PISE)
Dotter
Dotlet (web-based)

9
S
T
O
P
S
S
P
O
T
S
10
gt
tc
ct
ta
ga
ag
gt
tg
aa
aa
ac
c
ta
gt
tc
ct
ta
ag
gt
tg
ga
aa
a
11
dynamic programming

developed by Richard Bellman, 1950s
can be used for pairwise sequence alignment
ex seq1 TTCATA
seq2 TGCTCGTA
try all possible?

12
(No Transcript)
13
dynamic programming

good example at
http//meetings.cshl.org/TGAC/TGAC/flash/DynamicPr
ogramming.swf

14
Global vs. Local
SPQ-RTGKCCWIAGPGILHRMSL SGALRCSWND-IAGPCAQH-MSA
Global Needleman-Wunsch start at end and
add gaps until one end is reached
Local Smith-Waterman finds region(s) of
highest similarity and build outward
15
evaluating alignments

scoring matrices

16
scoring aligments

count all frequencies of aligned pairs in
confirmed alignments develop probabilities
Problems
1 - how to get random sample?
2 different pairs have diverged different
amounts

17
Percent Accepted Mutation

Dayhoff et al. (1978) -- idea obtain
substitution data from very similar proteins,
extrapolate information to larger evolutionary
distances
PAM1 construct phylogenetic tree from sequences
in 71 families with at least 85 identity
One PAM unit is an average of 1 change in all
amino acid positions
PAM250 (most popular) 250 changes in 100 aa
Convert mutation probabilities to a scoring
matrix by log odds and scaling

18
Percent Accepted Mutation

PAM Problems
PAM1 entries from short time interval
substitutions
PAM250 scaled version of PAM1
-- short time interval substitutions mostly
single base changes in codon triplets
-- long time interval substitutions all types
of codon changes

19
BLOck SUbstition Matrices

Henikoff Henikoff (1992) -- substitution
matrices based on local alignments
all BLOSUM matrices based on observed alignments
BLOSUM N -- calculated from comparisons of
sequences with no less than N divergence

http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Sc
oring2.html
20
gap penalties

no standard model
generally estimated empirically once substitution
scores are chosen
Parameters
initial and terminal
gap opening
gap extending

21
alignment methods

dynamic programming too computationally expensive
for any practical task
show example
use heuristics
FASTA - Pearson and Lipman (1988)
BLAST - Altschul et al. (1990)

22
FASTA

Fast All
find local high scoring alignments from exact
short word matches

23
FASTA Algorithm

find all identically matching k-mers in two seqs
(proteins k 1, 2 DNA k
4,6)
-- find diagonals with many mutually supporting
word matches
-- take best diagonals to step 2
extend exact word matches to find maximal scoring
ungapped regions
check if any ungapped regions can be joined
(allowing for gap costs)
re-align highest scoring candidate matches by
dynamic programming

24
FASTA Algorithm
25
FASTA Algorithm
26
FASTA Incarnations

FASTA compares sequence to another sequence or
database (protein or DNA)
SSEARCH performs Smith-Waterman alignment
between sequences
FASTX compare DNA to protein
others ... see http//fasta.bioch.virginia.edu/

27
BLASTBasic Local Alignment Search Tool

find high scoring local alignments between query
sequence and target database
assumption true match alignments very likely to
contain within them very high scoring matches

28
BLAST Steps
1. Seeding

For each word of length W in the query, generate
a list of all possible words (neighborhood) with
a score of at least threshold T (determined by
using the scoring matrix)

29
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Determine the locations of all common words
between the query and the database (word hits).
30
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
Identifies all word hits
31
BLAST Steps
2. Extension

Use dynamic programming to extend hits until the
score drops a value of X expensive!! -- 90 of
time

ABCDEFGHIJKLMNOPQRST
ABCDEFZYIJKLMXWVUTAB 12345654567898765654 -gt
Score 00000012100001234345 -gt Drop off score
Match 1 Mismatch -1 X 5
32
3. Evaluation
Evaluates the statistical significance of
extended hits and reports only those above the
determined threshold.
33
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
Returns highest-scoring segment pairs (HSP)
34
BLAST statistical evaluation

for local, ungapped alignments
m size of query n size of database
E expected of HSPs with scores at least S
p prob of finding at least one HSP with S

good tutorial at
http//www.ncbi.nlm.nih.gov/BLAST/tutorial
/Altschul-1.html

35
other BLASTs
36
Gapped BLAST

2-hit model HSP may have multiple hits along
same diagonal
find two non-overlapping hits on same diagonal
within some distance A
T must be lowered to maintain senstivity
gt more single hits found
BUT small fraction have associated 2nd hit
overall - increase in speed

37
Gapped BLAST example

- BLAST
15 hits, T gt 13
22 hits, T gt 11
- two-hit with T 1 triggers two extensions

38
Gapped BLAST example

- left pair produces HSP with E value
sufficient to trigger gapped extentsion
E 0.5
- original BLAST, finds first and last ungapped
segments and assigns combined E-val gt 50 greater

39
BLATBLAST Like Alignment Tool

BLAT
-- builds index of DB and scans linearly through
query
-- trigger extensions on any number of perfect or
near perfect hits

BLAST
builds index of query sequence and scans
linearly through DB
-- trigger extensions for one or two hit models

40
BLAT

Good for aligning mRNA, ESTs to genome
fast
aligns whole mRNA, not just exons
handles introns and splice-sites

41
BLAT

Steps for cDNA alignment
1 break cDNA into 500 base chunks
2 use index to find regions in genome similar
to each chunk of cDNA
3 do detailed alignment between genomic regions
and cDNA chunk
4 dynamic programming - stictch together
detailed alignments of chunks into alignment of
whole

genome cacaattatcacgaccgc
3-mers cac aat tat cac gac cgc
0 3 6 9 12 15
cDNA aattctcac
3-mers aat att ttc tct ctc tca cac
0 1 2 3 4 5 6
hits aat 0,3 -3
cac 6,0 6
cac 6,9 -3
clump cacAATtatCACgaccgc

example from Jim Kent
43
PSI-BLASTPosition Specific Iterated BLAST

database searches using position-specific scoring
matrices more powerful than simply using single
sequence
STEPS
collect all DB sequences that align with E-val lt
T
align these to make position-specific scoring
matrix
use scoring matrix to search for new hits
iterate

44
PSI Blast
45
Pattern-Hit-Initiated BLAST (PHI-BLAST)

Combines regular expression matching with local
alignments
Finds proteins containing the pattern and
similarity in the region of the pattern
Integrated with PSI-BLAST
E-values are computed differently

multiple sequence alignment

47
msa

msa of PTEN catalytic core

48
uses of MSAs

find functionally important sites
predict structure
find weak similarities in databases
reconstruct evolutionary history

49
multiple sequence alignment

Two separate problems
1 generating all possible alignments
2 scoring alignments

50
scoring alignments

scoring system should include
1 some positions more conserved than others
2 sequences related evolutionarily
ideal define probabilistic model for sequence
evolution
-- not enough information

51
scoring alignments

without phylogenetic information ...
1 minimum entropy
2 sum of pairs

52
generating alignments

dynamic programming
progressive alignment
iterative search
HMMs
blocksets

53
dynamic programming

2 sequences

54
Dynamic Programming for 3 sequences
To see more, check out http//bibiserv.techfak.uni
-bielefeld.de/visualign/
55
dynamic programming

computationally infeasible for more than a few
sequences
Carrillo Lipman not all cells have to be
examined to guarantee optimal alignment
MSA can optimally align 5-7 protein sequences
of length 300

56
progessive alignment

succession of pairwise alignments
heurisic cannot separate scoring and
optimization
works well for close sequences
many examples
-- Feng-Doolittle (1987)
-- ClustalW
-- T-coffee

57
Feng-Doolittle Progressive Alignment

1- do global pairwise alignments for every pair
of sequences
2 convert alignment scores to distances
3 construct guide tree from distance matrix
4 progressively align sequences with weights
from guide tree

58
Feng-Doolittle Clustering
Similarity matrix (from pairwise alignment)
X1
X2
X3
X4
X5
X1

15 11 3 4
3 30 5 3 1
5 25 12 11
3 4 12 40 9
4 1 11 9 30

X2
X3
X4
X5
X5
X3
X1
X2
X4
X1
X2
X3
X4
X5
59
Feng-Doolittle

At each step, follow the guide tree and consider
all possible pairwise alignments of sequences in
the two candidate groups (3 cases)
Sequence vs. sequence
Sequence vs. group (the best matching sequence in
the group determines the alignment)
group vs. group (the best matching pair of
sequences determines the alignment)
Once a gap, always a gap
gap is replaced by a neutral symbol X
no cost for aligning X with anything

60
Generating a Multi-Sequence Alignment

Align the two most similar sequences
Following the guide tree, add in the next
sequences, aligning to the existing alignment
Insert gaps as necessary

Sample output FOS_RAT
PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNIS
NMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPE
SEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK
SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPS
G--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----
STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUM
AN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP---
--------------LPFQ . . .
.. . .
Dots and stars show how well-conserved a column
is.
61
problems

all alignments are completely determined by
initial pairwise sequence alignment errors in
intial alignment are propagated
gaps can proliferate
No backtracking (subalignment is frozen)
no way to correct an early mistake
non-optimality Mismatches and gaps at highly
conserved region should be penalized more, but we
cant tell where is a highly conserved region
early in the process

62
ClustalW

essentially doing Feng-Doolittle with some
important heuristics
1 weighting of sequences
2 choice of substitution matrices
3 gap penalties
4 guide tree adjustment on the fly

63
T-coffee

Tree-based Consistency Objective Function For
alignmEnt Evaluation
-- pre-process data set of all pairwise
alignments between sequences
-- build library
-- intermediate alignments based on sequences and
library information
-- Generally gives improved results over
ClustalW, for sequences with lt 30 identity, but
is slower

64
(No Transcript)
65
iterative local search methods

seek to increase MSA scores by randomly altering
the alignment
usually used to refine alignment
start with generated msa
not guaranteed to find optimal alignment
examples simulated annealing, GA

Hidden Markov Models
the Legos of computational sequence analysis --
Sean Eddy

67
HMMs in computational biology

gene finding
regulatory site identification
profile searching
multiple sequence alignment
CpG island detection

68
Markov Chain
Any DNA sequence can be thought of as a series of
emissions from the following model
69
Markov Chain
Any DNA sequence can be thought of as a series of
emissions from the following model
70
Transition Probabilities
71
HMM for CpG islands
In a hidden Markov Model, the identity of the
emitted symbol does not tell us the state.
Looking at the emission, A is equivalent to A-.
72
CpG Island Transition Probabilities
73
HMMs non-biological example
0.9
0.95
1 1/10
1 1/6
2 1/10
2 1/6
0.05
3 1/10
3 1/6
4 1/10
4 1/6
5 1/10
0.1
5 1/6
6 1/2
6 1/6
Fair Die
Loaded Die
74
Profile HMM
75
Profile HMM
76
Block Alignments

typical msa fix a particular sequence as
reference align all others to this
positive simple
negative - regions conserved in subset but not
in reference sequence not found
- alignments generated with
different reference seqs may be inconsistent

77
Block Alignments

Threaded Blockset Aligner (2004)
-- produce blocks in which each position in given
sequences appear precicesly once
blocks local alignments of some or all of the
given sequences
-- assumption matching regions occur in the same
order and orientation in all sequences

78
TBA
79
(No Transcript)

Write a Comment

User Comments (0)