TGAC Pairwise Sequence Comparisons - PowerPoint PPT Presentation

1 / 71

About This Presentation

Title:

TGAC Pairwise Sequence Comparisons

Description:

ga. ag. gt. tg. aa. aa. ta. ac. c. ta. ag. aa. Self-Alignment of Human LDL Receptor. Dot Plot Analysis. Plot two sequences or sequence against itself. Diagonal is ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 72

Provided by: jamesg61

Category:

more less

Transcript and Presenter's Notes

Title: TGAC Pairwise Sequence Comparisons

1
TGACPairwise SequenceComparisons
The Artist in His Museum, Charles Wilson Peale
2
Why Compare Sequences?

Determine whether two sequences are homologous
Determine how two sequences are related
Find common evolutionary histories
Find common functional domains

Identity
Similarity
Homology
Paralogy
Orthology

4
Types of Homology
5
Comparing Sequences

Dotplots
Alignments
Scoring alignments
Algorithms for finding alignments
Dynamic programming
Heuristic methods
BLAST
FASTA

6
Comparing Sequences

Dotplots
Alignments
Scoring alignments
Algorithms for finding alignments
Dynamic programming
Heuristic methods
BLAST
FASTA

7
S
T
O
P
S
S
T
O
O
P
8
S
T
O
P
S
S
T
O
P
S
9
S
T
O
P
S
S
P
O
T
S
10
a
c
t
g
a
t
c
g
a
a
c
t
g
a
c
t
11
a
c
t
g
a
t
c
g
a
g
t
c
a
g
t
c
12
a
c
t
g
a
t
c
g
a
a
c
t
g
g
t
c
13
t
g
t
c
t
g
a
g
a
a
a
c
t
g
t
c
t
a
g
t
g
a
a
14
gt
tc
ct
ta
ga
ag
gt
tg
aa
aa
ac
c
ta
gt
tc
ct
ta
ag
gt
tg
ga
aa
a
15
Self-Alignment of Human LDL Receptor
16
Dot Plot Analysis

Plot two sequences or sequence against itself
Diagonal is identical match
Variable word size possible
Good for finding features
Duplications
Inversions
Deletions

17
Dot Plot Programs

EMBOSS has several programs (available via PISE)
Dotter
Dotlet (web-based)

18
Comparing Sequences

Dotplots
Alignments
Scoring alignments
Algorithms for finding alignments
Dynamic programming
Heuristic methods
BLAST
FASTA

19
Hamming Distance
Between strings of equal length, the number of
mismatches
fatcat fatrat
Hamming Distance 2
fatcats fastcat
Hamming Distance 5
20
Alignment
fatcats fastcat
Hamming Distance 5
fa-tcats fastcat-
Edit Distance 2
21
Optimal Alignment

Given two sequences, there are many possible
alignments.
The edit distance is the minimal number of
differences in an alignment.
We can also use a more complicated scoring system
to evaluate alignments.

ac-gtac-g
acg-tac-g
acagtatag
acagtatag
acgtac----g
-ac-agtatag
22
Comparing Sequences

Dotplots
Alignments
Scoring alignments
Algorithms for finding alignments
Dynamic programming
Heuristic methods
BLAST
FASTA

23
Scoring Alignments

Additive
Score of AB Score of A Score of B
Biologically meaningful high score
Treat gaps differently than mismatches
Point mutations are more common than insertions
or deletions.

24
Scoring Systems

Primitive option Aligned residues are either
identical or not identical
Physically motivated Match score is
proportionate to chemical similarity
Mutationally motivated (PAM, BLOSUM)If
mutations often convert one residue to another,
give that a higher match score.

25
PAM(Point/Percent Accepted Mutation)

One PAM unit is an average of 1 mutation per 100
amino acids
PAM 1 was generated from 71 protein sequence
groups of at least 85 similarity to avoid errors
from multiple mutations at the same site, as well
as insertions and deletions
Convert mutation probabilities to a (log odds)
scoring matrix

26
PAM Matrices

PAMn supplies scoring for sequences that have
diverged n PAM units
PAM1n generates other matrices based on Markov
model
PAM250 represents an average of 250 mutations in
100 amino acids

27
BLOSUM(BLOck SUbstitution Matrix)

BLOCKS is a database of 3,000 blocks short,
continuous multiple alignments of protein domains
(functional sequences)
Generates substitution frequency, which can be
converted into log odds score
BLOSUM62
Block has 62 similarity
Close to PAM250
More Reliable

28
Gap Penalties

Insertions and deletions are less likely than
substitutions gaps should be penalized.
Simple gap penalty takes a penalty for each match
of gap to residue.
Lots of small deletions are less biologically
likely than one long deletion.
Affine gap penalty has different penalties for
opening a gap or for continuing a gap.
Opening a gap (5 nucleotides, 11 proteins)
Continuing a gap (2 nucleotides, 1 proteins)

29
Finding best alignment

We know how to recognize optimal alignment in a
pool of possibilities
We could enumerate all possibilities, then choose
the best one.
Thats slow. Is it necessary?

30
Comparing Sequences

Dotplots
Alignments
Scoring alignments
Algorithms for finding alignments
Dynamic programming
Heuristic methods
BLAST
FASTA

31
Premise of Dynamic Programming

If we know the optimal alignment of the prefixes
of two sequences, we can find the optimal
alignment of the entire sequence.

32
Dynamic Programming Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
-
G
GT
GTA
33
Initializing the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
GAT --- -9
G - -3
GA -- -6
empty alignment 0
-
- G -3
G
-- GT -6
GT
--- GTA -9
GTA
34
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
G - -3
empty alignment 0
-
- G -3
G G -3
G
-- GT -6
GT
--- GTA -9
GTA
35
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
G - -3
empty alignment 0
-
- G -3
G G -3
G
-- GT -6
GT
--- GTA -9
GTA
36
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
GAT --- -9
GA -- -6
G - -3
G - -3
empty alignment 0
-
- G -3
G G -3
GA G- 0
GAT G-- -3
G
-- GT -6
G- GT 0
GAT G-T 2
GA GT 2
GT
--- GTA -9
G-- GTA 0
GAT GTA 1
G-A GTA 3
GTA
37
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
-3
-6
-9
0
-
-3
-3
0
-3
G
-6
0
2
2
GT
0
-9
3
1
GTA
38
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
-3
-6
-9
0
-
-3
-3
0
-3
G
-6
0
2
2
GT
0
-9
3
1
GTA
39
Finding the Alignment
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
GAT GTA
G
GA
GAT
-
-3
-6
-9
0
-
-3
-3
0
-3
G
-6
0
2
2
GT
0
-9
3
1
GTA
40
A larger example
http//meetings.cshl.org/tgac/tgac/flash/justMatri
x.swf
41
Local Alignments
a_fat_cat_lives_a_happy_existence the_dog_likes_to
_chase_cats
A local alignment is an alignment between a
substring of one sequence and a substring of the
other.
42
Dynamic Programming

Global alignment (what we just did)Initiate
traceback in bottom right-hand corner of the
matrix.Often called Needleman-Wunsch
Local alignment Find top-scoring
sub-alignments.Optimal scores cannot be
negative. Initiate traceback at highest scoring
cell in the matrix.Often called Smith-Waterman

43
Dynamic Programming

Provides an optimal alignment
Results depend on scoring system and gap
penalties
Sometimes too slow
Perfectly good for aligning two sequences
Slow for aligning query sequence against whole
database of sequences

44
Pairwise Sequence Alignment

Often, we dont know which two sequences to
align.
Given a query sequence, we need a way to find all
homologous sequences in a large database.

45
Comparing Sequences

Dotplots
Alignments
Scoring alignments
Algorithms for generating alignments
Dynamic programming
Heuristic methods
BLAST
FASTA

46
BLAST

Basic Local Alignment Search ToolFinds local
alignments.
Most popular tool available to search a database
for sequences homologous to a query sequence.
Will not necessarily find the optimal alignment,
if it has a low score.
Will find all high-scoring alignments with high
probability

47
BLAST Steps

-1. Preprocess database to get sorted list of
words in every sequence in the database.
For each word of length W in the query, generate
a list of all possible words with a score of at
least threshold T
Generate hits by searching the database for any
words on the list.
Use dynamic programming to extend hits until the
score drops gtX return highest-scoring segment
pair (HSP)
Evaluate significance of extended hits

48
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
High-scoring Segment Pair (HSP)
49
BLAST Settings

Program
Filters
Expect
Word Size
Show
Number of Descriptions/Alignments

50
BLAST Programs
51
Other BLAST Programs

MEGABLAST
Genomic BLAST
Fugu rubripes

52
BLAST Databases
53
BLASTing Custom Databases

If you run BLAST on your own computer, instead of
at NCBI, you can create a custom database.
Gather sequences for database (e.g., by using
Batch ENTREZ)
Run formatdb to preprocess the database.

54
BLAST Filters

Low complexity
Human repeats
Mask for lookup table only
Mask lower case
Limit by Entrez query

55
ENTREZ Queries

nucleotide allFilter NOT specimen-voucherAll
Fields
You can eliminate most of the BAC-type records
from the default nucleotide database with the
following query nucleotide allFilter NOT
htgKeyword

56
E-values

Lower values give more stringent results
Default is 10 fractional values permissible
E10 means 10 matches are expected by chance
In output, E lt 0.05 is significant

57
BLAST Output

Reference
Query
Database
TaxBLAST link
Graphical Overview aligned matches

58
BLAST Output Significant Alignments

SeqID Genbank ID and Accession
Description (links to entry)
Score (links down)
E-value
LocusLink
Unigene

59
BLAST Output - Alignments

Length
Score
Raw score is just score from scoring matrix
Bit score is adjusted for statistical properties
of scoring matrix, and can be compared between
searches.
E-value
Identities/Gaps
Strand (e.g. Plus/Plus)
Alignment shows identity

60
BLAST Output - Proteins

Conserved Domains Pfam CD-Search
Alignments
Matching letters shown
Mismatches blank
Conservative substitutions shown by
Gaps are shown by dashes

61
BLAST Output End Summary

Database -- of letters sequences
Lambda, K, H
Gapped Lambda, K, H
Matrix
Gap Penalties
Hit summary , extensions, gt 10.0
HSPs
length of query
T,A, X1, X2, X3, S1, S2

62
Gapped BLAST

Find two non-overlapping hits of score at least T
and distance at most d one from another
Ungapped extension
If the HSP generated has high enough normalized
score, start gapped extension
Report resulting alignment if it has sufficiently
large E-value

63
Benefits of Protein Searches

20 amino acids reduce chance of random matches
Protein DBs are smaller
Degrees of similarity
Chemical
Observed mutation frequencies
Mutational steps between codons

64
FASTA

Find hot-spots (pairs of words of length k these
are called hits in BLAST)
Locate sequences of consecutive hot spots on a
diagonal
Combine sub-alignments into a longer alignment
Find alternative local alignments using dynamic
programming restricted to a ribbon along the
diagonal containing best run

65
FASTA Algorithm
66
FASTA Algorithm
67
FASTA Programs

Fasta3 - Compare a protein sequence to a protein
database, or a DNA sequence to a DNA database
Ssearch3 - Compare a protein sequence to a
protein database, or a DNA database, using the
Smith-Waterman algorithm. It is very slow but
much more sensitive for full-length protein
comparison.
Fastx3 - Compare a DNA sequence to a protein
database, by comparing the translated DNA
sequence in three frames and allowing gaps and
frameshifts.

68
BLAST and FASTA

BLAST for proteins and speed
FASTA for DNA and frameshifts
FASTA for accurate statistics (protein and coding
DNA)
SSEARCH for optimal

69
Blast vs. Fasta II

Database format is the same
BLAST is generally faster
FASTA may produce better alignments
blastp will show several alignments between
domains in the same sequences. Fasta3 shows one
alignment for each sequence pair.

70
Blast vs. Fasta III

Different default matrices and gap penalties
BLAST cannot search very short sequences (can use
Fasta, but EMBOSS fuzznuc or fuzzpro are better)
Fasta searches one strand of DNA
Fasta does not filter low complexity by default

71
Sites using WU-BLAST