Sequence Alignment - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Sequence Alignment

Description:

how to position two strings, with gaps, so as to maximize the ... FASTA Incarnations. FASTA compares sequence to another sequence or database (protein or DNA) ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 79
Provided by: meetin8
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
  • The Genome Access Course

2
sequence alignment
  • how to position two strings, with gaps, so as to
    maximize the agreement between

3
sequence alignment
CLUSTAL W (1.82) multiple sequence alignment a
THISMAKESSENSE 14 b THISISN---NSENSE 13
. .
CLUSTAL W (1.82) multiple sequence alignment a
THISMAKESSENSE 14 b THISISN---NSENSE 13
. .
4
biological sequences
  • Are sequences similar
  • Infer homology
  • Infer function
  • (sequence -gt structure -gt function)

5
Types of Homology
6
pairwise alignments multiple sequence
alignments
7
S
T
O
P
S
S
T
O
P
S
8
dot plots
  • Plot one sequence against another
  • Good for finding repeats/inverts
  • Can specify word size
  • EMBOSS has several programs (available via PISE)
  • Dotter
  • Dotlet (web-based)

9
S
T
O
P
S
S
P
O
T
S
10
gt
tc
ct
ta
ga
ag
gt
tg
aa
aa
ac
c
ta
gt
tc
ct
ta
ag
gt
tg
ga
aa
a
11
dynamic programming
  • developed by Richard Bellman, 1950s
  • can be used for pairwise sequence alignment
  • ex seq1 TTCATA
  • seq2 TGCTCGTA
  • try all possible?

12
(No Transcript)
13
dynamic programming
  • good example at
  • http//meetings.cshl.org/TGAC/TGAC/flash/DynamicPr
    ogramming.swf

14
Global vs. Local
SPQ-RTGKCCWIAGPGILHRMSL SGALRCSWND-IAGPCAQH-MSA
Global Needleman-Wunsch start at end and
add gaps until one end is reached
Local Smith-Waterman finds region(s) of
highest similarity and build outward
15
evaluating alignments
  • scoring matrices

16
scoring aligments
  • count all frequencies of aligned pairs in
    confirmed alignments develop probabilities
  • Problems
  • 1 - how to get random sample?
  • 2 different pairs have diverged different
    amounts

17
Percent Accepted Mutation
  • Dayhoff et al. (1978) -- idea obtain
    substitution data from very similar proteins,
    extrapolate information to larger evolutionary
    distances
  • PAM1 construct phylogenetic tree from sequences
    in 71 families with at least 85 identity
  • One PAM unit is an average of 1 change in all
    amino acid positions
  • PAM250 (most popular) 250 changes in 100 aa
  • Convert mutation probabilities to a scoring
    matrix by log odds and scaling

18
Percent Accepted Mutation
  • PAM Problems
  • PAM1 entries from short time interval
    substitutions
  • PAM250 scaled version of PAM1
  • -- short time interval substitutions mostly
    single base changes in codon triplets
  • -- long time interval substitutions all types
    of codon changes

19
BLOck SUbstition Matrices
  • Henikoff Henikoff (1992) -- substitution
    matrices based on local alignments
  • all BLOSUM matrices based on observed alignments
  • BLOSUM N -- calculated from comparisons of
    sequences with no less than N divergence

http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Sc
oring2.html
20
gap penalties
  • no standard model
  • generally estimated empirically once substitution
    scores are chosen
  • Parameters
  • initial and terminal
  • gap opening
  • gap extending

21
alignment methods
  • dynamic programming too computationally expensive
    for any practical task
  • show example
  • use heuristics
  • FASTA - Pearson and Lipman (1988)
  • BLAST - Altschul et al. (1990)

22
FASTA
  • Fast All
  • find local high scoring alignments from exact
    short word matches

23
FASTA Algorithm
  • find all identically matching k-mers in two seqs
  • (proteins k 1, 2 DNA k
    4,6)
  • -- find diagonals with many mutually supporting
    word matches
  • -- take best diagonals to step 2
  • extend exact word matches to find maximal scoring
    ungapped regions
  • check if any ungapped regions can be joined
    (allowing for gap costs)
  • re-align highest scoring candidate matches by
    dynamic programming

24
FASTA Algorithm
25
FASTA Algorithm
26
FASTA Incarnations
  • FASTA compares sequence to another sequence or
    database (protein or DNA)
  • SSEARCH performs Smith-Waterman alignment
    between sequences
  • FASTX compare DNA to protein
  • others ... see http//fasta.bioch.virginia.edu/

27
BLASTBasic Local Alignment Search Tool
  • find high scoring local alignments between query
    sequence and target database
  • assumption true match alignments very likely to
    contain within them very high scoring matches

28
BLAST Steps
1. Seeding
  • For each word of length W in the query, generate
    a list of all possible words (neighborhood) with
    a score of at least threshold T (determined by
    using the scoring matrix)

29
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Determine the locations of all common words
between the query and the database (word hits).
30
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
Identifies all word hits
31
BLAST Steps
2. Extension
  • Use dynamic programming to extend hits until the
    score drops a value of X expensive!! -- 90 of
    time

ABCDEFGHIJKLMNOPQRST
ABCDEFZYIJKLMXWVUTAB 12345654567898765654 -gt
Score 00000012100001234345 -gt Drop off score
Match 1 Mismatch -1 X 5
32
3. Evaluation
Evaluates the statistical significance of
extended hits and reports only those above the
determined threshold.
33
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
Returns highest-scoring segment pairs (HSP)
34
BLAST statistical evaluation
  • for local, ungapped alignments
  • m size of query n size of database
  • E expected of HSPs with scores at least S
  • p prob of finding at least one HSP with S
  • good tutorial at
  • http//www.ncbi.nlm.nih.gov/BLAST/tutorial
    /Altschul-1.html

35
other BLASTs
36
Gapped BLAST
  • 2-hit model HSP may have multiple hits along
    same diagonal
  • find two non-overlapping hits on same diagonal
    within some distance A
  • T must be lowered to maintain senstivity
  • gt more single hits found
  • BUT small fraction have associated 2nd hit
  • overall - increase in speed

37
Gapped BLAST example
  • - BLAST
  • 15 hits, T gt 13
  • 22 hits, T gt 11
  • - two-hit with T 1 triggers two extensions

38
Gapped BLAST example
  • - left pair produces HSP with E value
    sufficient to trigger gapped extentsion
  • E 0.5
  • - original BLAST, finds first and last ungapped
    segments and assigns combined E-val gt 50 greater

39
BLATBLAST Like Alignment Tool
  • BLAT
  • -- builds index of DB and scans linearly through
    query
  • -- trigger extensions on any number of perfect or
    near perfect hits
  • BLAST
  • builds index of query sequence and scans
    linearly through DB
  • -- trigger extensions for one or two hit models

40
BLAT
  • Good for aligning mRNA, ESTs to genome
  • fast
  • aligns whole mRNA, not just exons
  • handles introns and splice-sites

41
BLAT
  • Steps for cDNA alignment
  • 1 break cDNA into 500 base chunks
  • 2 use index to find regions in genome similar
    to each chunk of cDNA
  • 3 do detailed alignment between genomic regions
    and cDNA chunk
  • 4 dynamic programming - stictch together
    detailed alignments of chunks into alignment of
    whole

42
  • genome cacaattatcacgaccgc
  • 3-mers cac aat tat cac gac cgc
  • 0 3 6 9 12 15
  • cDNA aattctcac
  • 3-mers aat att ttc tct ctc tca cac
  • 0 1 2 3 4 5 6
  • hits aat 0,3 -3
  • cac 6,0 6
  • cac 6,9 -3
  • clump cacAATtatCACgaccgc

example from Jim Kent
43
PSI-BLASTPosition Specific Iterated BLAST
  • database searches using position-specific scoring
    matrices more powerful than simply using single
    sequence
  • STEPS
  • collect all DB sequences that align with E-val lt
    T
  • align these to make position-specific scoring
    matrix
  • use scoring matrix to search for new hits
  • iterate

44
PSI Blast
45
Pattern-Hit-Initiated BLAST (PHI-BLAST)
  • Combines regular expression matching with local
    alignments
  • Finds proteins containing the pattern and
    similarity in the region of the pattern
  • Integrated with PSI-BLAST
  • E-values are computed differently

46
  • multiple sequence alignment

47
msa
  • msa of PTEN catalytic core

48
uses of MSAs
  • find functionally important sites
  • predict structure
  • find weak similarities in databases
  • reconstruct evolutionary history

49
multiple sequence alignment
  • Two separate problems
  • 1 generating all possible alignments
  • 2 scoring alignments

50
scoring alignments
  • scoring system should include
  • 1 some positions more conserved than others
  • 2 sequences related evolutionarily
  • ideal define probabilistic model for sequence
    evolution
  • -- not enough information

51
scoring alignments
  • without phylogenetic information ...
  • 1 minimum entropy
  • 2 sum of pairs

52
generating alignments
  • dynamic programming
  • progressive alignment
  • iterative search
  • HMMs
  • blocksets

53
dynamic programming
  • 2 sequences

54
Dynamic Programming for 3 sequences
To see more, check out http//bibiserv.techfak.uni
-bielefeld.de/visualign/
55
dynamic programming
  • computationally infeasible for more than a few
    sequences
  • Carrillo Lipman not all cells have to be
    examined to guarantee optimal alignment
  • MSA can optimally align 5-7 protein sequences
    of length 300

56
progessive alignment
  • succession of pairwise alignments
  • heurisic cannot separate scoring and
    optimization
  • works well for close sequences
  • many examples
  • -- Feng-Doolittle (1987)
  • -- ClustalW
  • -- T-coffee

57
Feng-Doolittle Progressive Alignment
  • 1- do global pairwise alignments for every pair
    of sequences
  • 2 convert alignment scores to distances
  • 3 construct guide tree from distance matrix
  • 4 progressively align sequences with weights
    from guide tree

58
Feng-Doolittle Clustering
Similarity matrix (from pairwise alignment)
X1
X2
X3
X4
X5
X1
  • 15 11 3 4
  • 3 30 5 3 1
  • 5 25 12 11
  • 3 4 12 40 9
  • 4 1 11 9 30

X2
X3
X4
X5
X5
X3
X1
X2
X4
X1
X2
X3
X4
X5
59
Feng-Doolittle
  • At each step, follow the guide tree and consider
    all possible pairwise alignments of sequences in
    the two candidate groups (3 cases)
  • Sequence vs. sequence
  • Sequence vs. group (the best matching sequence in
    the group determines the alignment)
  • group vs. group (the best matching pair of
    sequences determines the alignment)
  • Once a gap, always a gap
  • gap is replaced by a neutral symbol X
  • no cost for aligning X with anything

60
Generating a Multi-Sequence Alignment
  • Align the two most similar sequences
  • Following the guide tree, add in the next
    sequences, aligning to the existing alignment
  • Insert gaps as necessary

Sample output FOS_RAT
PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNIS
NMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPE
SEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK
SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPS
G--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----
STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUM
AN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP---
--------------LPFQ . . .
.. . .
Dots and stars show how well-conserved a column
is.
61
problems
  • all alignments are completely determined by
    initial pairwise sequence alignment errors in
    intial alignment are propagated
  • gaps can proliferate
  • No backtracking (subalignment is frozen)
  • no way to correct an early mistake
  • non-optimality Mismatches and gaps at highly
    conserved region should be penalized more, but we
    cant tell where is a highly conserved region
    early in the process

62
ClustalW
  • essentially doing Feng-Doolittle with some
    important heuristics
  • 1 weighting of sequences
  • 2 choice of substitution matrices
  • 3 gap penalties
  • 4 guide tree adjustment on the fly

63
T-coffee
  • Tree-based Consistency Objective Function For
    alignmEnt Evaluation
  • -- pre-process data set of all pairwise
    alignments between sequences
  • -- build library
  • -- intermediate alignments based on sequences and
    library information
  • -- Generally gives improved results over
    ClustalW, for sequences with lt 30 identity, but
    is slower

64
(No Transcript)
65
iterative local search methods
  • seek to increase MSA scores by randomly altering
    the alignment
  • usually used to refine alignment
  • start with generated msa
  • not guaranteed to find optimal alignment
  • examples simulated annealing, GA

66
  • Hidden Markov Models
  • the Legos of computational sequence analysis --
    Sean Eddy

67
HMMs in computational biology
  • gene finding
  • regulatory site identification
  • profile searching
  • multiple sequence alignment
  • CpG island detection

68
Markov Chain
Any DNA sequence can be thought of as a series of
emissions from the following model
69
Markov Chain
Any DNA sequence can be thought of as a series of
emissions from the following model
70
Transition Probabilities
71
HMM for CpG islands
In a hidden Markov Model, the identity of the
emitted symbol does not tell us the state.
Looking at the emission, A is equivalent to A-.
72
CpG Island Transition Probabilities
73
HMMs non-biological example
0.9
0.95
1 1/10
1 1/6
2 1/10
2 1/6
0.05
3 1/10
3 1/6
4 1/10
4 1/6
5 1/10
0.1
5 1/6
6 1/2
6 1/6
Fair Die
Loaded Die
74
Profile HMM
75
Profile HMM
76
Block Alignments
  • typical msa fix a particular sequence as
    reference align all others to this
  • positive simple
  • negative - regions conserved in subset but not
    in reference sequence not found
  • - alignments generated with
    different reference seqs may be inconsistent

77
Block Alignments
  • Threaded Blockset Aligner (2004)
  • -- produce blocks in which each position in given
    sequences appear precicesly once
  • blocks local alignments of some or all of the
    given sequences
  • -- assumption matching regions occur in the same
    order and orientation in all sequences

78
TBA
79
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com