Title: TGAC Pairwise Sequence Comparisons
1TGACPairwise SequenceComparisons
The Artist in His Museum, Charles Wilson Peale
2Why Compare Sequences?
- Determine whether two sequences are homologous
- Determine how two sequences are related
- Find common evolutionary histories
- Find common functional domains
3- Identity
- Similarity
- Homology
- Paralogy
- Orthology
4Types of Homology
5Comparing Sequences
- Dotplots
- Alignments
- Scoring alignments
- Algorithms for finding alignments
- Dynamic programming
- Heuristic methods
- BLAST
- FASTA
6Comparing Sequences
- Dotplots
- Alignments
- Scoring alignments
- Algorithms for finding alignments
- Dynamic programming
- Heuristic methods
- BLAST
- FASTA
7S
T
O
P
S
S
T
O
O
P
8S
T
O
P
S
S
T
O
P
S
9S
T
O
P
S
S
P
O
T
S
10a
c
t
g
a
t
c
g
a
a
c
t
g
a
c
t
11a
c
t
g
a
t
c
g
a
g
t
c
a
g
t
c
12a
c
t
g
a
t
c
g
a
a
c
t
g
g
t
c
13t
g
t
c
t
g
a
g
a
a
a
c
t
g
t
c
t
a
g
t
g
a
a
14gt
tc
ct
ta
ga
ag
gt
tg
aa
aa
ac
c
ta
gt
tc
ct
ta
ag
gt
tg
ga
aa
a
15Self-Alignment of Human LDL Receptor
16Dot Plot Analysis
- Plot two sequences or sequence against itself
- Diagonal is identical match
- Variable word size possible
- Good for finding features
- Duplications
- Inversions
- Deletions
17Dot Plot Programs
- EMBOSS has several programs (available via PISE)
- Dotter
- Dotlet (web-based)
18Comparing Sequences
- Dotplots
- Alignments
- Scoring alignments
- Algorithms for finding alignments
- Dynamic programming
- Heuristic methods
- BLAST
- FASTA
19Hamming Distance
Between strings of equal length, the number of
mismatches
fatcat fatrat
Hamming Distance 2
fatcats fastcat
Hamming Distance 5
20Alignment
fatcats fastcat
Hamming Distance 5
fa-tcats fastcat-
Edit Distance 2
21Optimal Alignment
- Given two sequences, there are many possible
alignments. - The edit distance is the minimal number of
differences in an alignment. - We can also use a more complicated scoring system
to evaluate alignments.
ac-gtac-g
acg-tac-g
acagtatag
acagtatag
acgtac----g
-ac-agtatag
22Comparing Sequences
- Dotplots
- Alignments
- Scoring alignments
- Algorithms for finding alignments
- Dynamic programming
- Heuristic methods
- BLAST
- FASTA
23Scoring Alignments
- Additive
- Score of AB Score of A Score of B
- Biologically meaningful high score
- Treat gaps differently than mismatches
- Point mutations are more common than insertions
or deletions.
24Scoring Systems
- Primitive option Aligned residues are either
identical or not identical - Physically motivated Match score is
proportionate to chemical similarity - Mutationally motivated (PAM, BLOSUM)If
mutations often convert one residue to another,
give that a higher match score.
25PAM(Point/Percent Accepted Mutation)
- One PAM unit is an average of 1 mutation per 100
amino acids - PAM 1 was generated from 71 protein sequence
groups of at least 85 similarity to avoid errors
from multiple mutations at the same site, as well
as insertions and deletions - Convert mutation probabilities to a (log odds)
scoring matrix
26PAM Matrices
- PAMn supplies scoring for sequences that have
diverged n PAM units - PAM1n generates other matrices based on Markov
model - PAM250 represents an average of 250 mutations in
100 amino acids
27BLOSUM(BLOck SUbstitution Matrix)
- BLOCKS is a database of 3,000 blocks short,
continuous multiple alignments of protein domains
(functional sequences) - Generates substitution frequency, which can be
converted into log odds score - BLOSUM62
- Block has 62 similarity
- Close to PAM250
- More Reliable
28Gap Penalties
- Insertions and deletions are less likely than
substitutions gaps should be penalized. - Simple gap penalty takes a penalty for each match
of gap to residue. - Lots of small deletions are less biologically
likely than one long deletion. - Affine gap penalty has different penalties for
opening a gap or for continuing a gap. - Opening a gap (5 nucleotides, 11 proteins)
- Continuing a gap (2 nucleotides, 1 proteins)
29Finding best alignment
- We know how to recognize optimal alignment in a
pool of possibilities - We could enumerate all possibilities, then choose
the best one. - Thats slow. Is it necessary?
30Comparing Sequences
- Dotplots
- Alignments
- Scoring alignments
- Algorithms for finding alignments
- Dynamic programming
- Heuristic methods
- BLAST
- FASTA
31Premise of Dynamic Programming
- If we know the optimal alignment of the prefixes
of two sequences, we can find the optimal
alignment of the entire sequence.
32Dynamic Programming Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
-
G
GT
GTA
33Initializing the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
GAT --- -9
G - -3
GA -- -6
empty alignment 0
-
- G -3
G
-- GT -6
GT
--- GTA -9
GTA
34Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
G - -3
empty alignment 0
-
- G -3
G G -3
G
-- GT -6
GT
--- GTA -9
GTA
35Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
G - -3
empty alignment 0
-
- G -3
G G -3
G
-- GT -6
GT
--- GTA -9
GTA
36Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
GAT --- -9
GA -- -6
G - -3
G - -3
empty alignment 0
-
- G -3
G G -3
GA G- 0
GAT G-- -3
G
-- GT -6
G- GT 0
GAT G-T 2
GA GT 2
GT
--- GTA -9
G-- GTA 0
GAT GTA 1
G-A GTA 3
GTA
37Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
-3
-6
-9
0
-
-3
-3
0
-3
G
-6
0
2
2
GT
0
-9
3
1
GTA
38Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
-3
-6
-9
0
-
-3
-3
0
-3
G
-6
0
2
2
GT
0
-9
3
1
GTA
39Finding the Alignment
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
GAT GTA
G
GA
GAT
-
-3
-6
-9
0
-
-3
-3
0
-3
G
-6
0
2
2
GT
0
-9
3
1
GTA
40A larger example
http//meetings.cshl.org/tgac/tgac/flash/justMatri
x.swf
41Local Alignments
a_fat_cat_lives_a_happy_existence the_dog_likes_to
_chase_cats
A local alignment is an alignment between a
substring of one sequence and a substring of the
other.
42Dynamic Programming
- Global alignment (what we just did)Initiate
traceback in bottom right-hand corner of the
matrix.Often called Needleman-Wunsch - Local alignment Find top-scoring
sub-alignments.Optimal scores cannot be
negative. Initiate traceback at highest scoring
cell in the matrix.Often called Smith-Waterman
43Dynamic Programming
- Provides an optimal alignment
- Results depend on scoring system and gap
penalties - Sometimes too slow
- Perfectly good for aligning two sequences
- Slow for aligning query sequence against whole
database of sequences
44Pairwise Sequence Alignment
- Often, we dont know which two sequences to
align. - Given a query sequence, we need a way to find all
homologous sequences in a large database.
45Comparing Sequences
- Dotplots
- Alignments
- Scoring alignments
- Algorithms for generating alignments
- Dynamic programming
- Heuristic methods
- BLAST
- FASTA
46BLAST
- Basic Local Alignment Search ToolFinds local
alignments. - Most popular tool available to search a database
for sequences homologous to a query sequence. - Will not necessarily find the optimal alignment,
if it has a low score. - Will find all high-scoring alignments with high
probability
47BLAST Steps
- -1. Preprocess database to get sorted list of
words in every sequence in the database. - For each word of length W in the query, generate
a list of all possible words with a score of at
least threshold T - Generate hits by searching the database for any
words on the list. - Use dynamic programming to extend hits until the
score drops gtX return highest-scoring segment
pair (HSP) - Evaluate significance of extended hits
48Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
High-scoring Segment Pair (HSP)
49BLAST Settings
- Program
- Filters
- Expect
- Word Size
- Show
- Number of Descriptions/Alignments
50BLAST Programs
51Other BLAST Programs
- MEGABLAST
- Genomic BLAST
- Fugu rubripes
52BLAST Databases
53BLASTing Custom Databases
- If you run BLAST on your own computer, instead of
at NCBI, you can create a custom database. - Gather sequences for database (e.g., by using
Batch ENTREZ) - Run formatdb to preprocess the database.
54BLAST Filters
- Low complexity
- Human repeats
- Mask for lookup table only
- Mask lower case
- Limit by Entrez query
55ENTREZ Queries
- nucleotide allFilter NOT specimen-voucherAll
Fields - You can eliminate most of the BAC-type records
from the default nucleotide database with the
following query nucleotide allFilter NOT
htgKeyword
56E-values
- Lower values give more stringent results
- Default is 10 fractional values permissible
- E10 means 10 matches are expected by chance
- In output, E lt 0.05 is significant
57BLAST Output
- Reference
- Query
- Database
- TaxBLAST link
- Graphical Overview aligned matches
58BLAST Output Significant Alignments
- SeqID Genbank ID and Accession
- Description (links to entry)
- Score (links down)
- E-value
- LocusLink
- Unigene
59BLAST Output - Alignments
- Length
- Score
- Raw score is just score from scoring matrix
- Bit score is adjusted for statistical properties
of scoring matrix, and can be compared between
searches. - E-value
- Identities/Gaps
- Strand (e.g. Plus/Plus)
- Alignment shows identity
60BLAST Output - Proteins
- Conserved Domains Pfam CD-Search
- Alignments
- Matching letters shown
- Mismatches blank
- Conservative substitutions shown by
- Gaps are shown by dashes
61BLAST Output End Summary
- Database -- of letters sequences
- Lambda, K, H
- Gapped Lambda, K, H
- Matrix
- Gap Penalties
- Hit summary , extensions, gt 10.0
- HSPs
- length of query
- T,A, X1, X2, X3, S1, S2
62Gapped BLAST
- Find two non-overlapping hits of score at least T
and distance at most d one from another - Ungapped extension
- If the HSP generated has high enough normalized
score, start gapped extension - Report resulting alignment if it has sufficiently
large E-value
63Benefits of Protein Searches
- 20 amino acids reduce chance of random matches
- Protein DBs are smaller
- Degrees of similarity
- Chemical
- Observed mutation frequencies
- Mutational steps between codons
64FASTA
- Find hot-spots (pairs of words of length k these
are called hits in BLAST) - Locate sequences of consecutive hot spots on a
diagonal - Combine sub-alignments into a longer alignment
- Find alternative local alignments using dynamic
programming restricted to a ribbon along the
diagonal containing best run
65FASTA Algorithm
66FASTA Algorithm
67FASTA Programs
- Fasta3 - Compare a protein sequence to a protein
database, or a DNA sequence to a DNA database - Ssearch3 - Compare a protein sequence to a
protein database, or a DNA database, using the
Smith-Waterman algorithm. It is very slow but
much more sensitive for full-length protein
comparison. - Fastx3 - Compare a DNA sequence to a protein
database, by comparing the translated DNA
sequence in three frames and allowing gaps and
frameshifts.
68BLAST and FASTA
- BLAST for proteins and speed
- FASTA for DNA and frameshifts
- FASTA for accurate statistics (protein and coding
DNA) - SSEARCH for optimal
69Blast vs. Fasta II
- Database format is the same
- BLAST is generally faster
- FASTA may produce better alignments
- blastp will show several alignments between
domains in the same sequences. Fasta3 shows one
alignment for each sequence pair.
70Blast vs. Fasta III
- Different default matrices and gap penalties
- BLAST cannot search very short sequences (can use
Fasta, but EMBOSS fuzznuc or fuzzpro are better) - Fasta searches one strand of DNA
- Fasta does not filter low complexity by default
71Sites using WU-BLAST
- European Bioinformatics Institute BLAST server
- EMBL Advanced BLAST2 Search
- Institut Pasteur
- Berkeley Drosophila Genome Project (BDGP)
- European Drosophila Genome Project at the EBI
- Mendel Bioinformatics Group at the John Innes
Centre - Mouse Genome Database at the Jackson Laboratory
- PlasmoDB (Plasmodium Genome Resource) at the
University of Pennsylvania - Saccharomyces cerevisiae search at Stanford
University - TAP (Transcript Assembly Program) by Zhengyan
("George") Kan - TIGR Gene Indices
- TIGR Microbial Genomes Database
- WU Genome Sequencing Center