TGAC Pairwise Sequence Comparisons - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

TGAC Pairwise Sequence Comparisons

Description:

ga. ag. gt. tg. aa. aa. ta. ac. c. ta. ag. aa. Self-Alignment of Human LDL Receptor. Dot Plot Analysis. Plot two sequences or sequence against itself. Diagonal is ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 72
Provided by: jamesg61
Category:

less

Transcript and Presenter's Notes

Title: TGAC Pairwise Sequence Comparisons


1
TGACPairwise SequenceComparisons
The Artist in His Museum, Charles Wilson Peale
2
Why Compare Sequences?
  • Determine whether two sequences are homologous
  • Determine how two sequences are related
  • Find common evolutionary histories
  • Find common functional domains

3
  • Identity
  • Similarity
  • Homology
  • Paralogy
  • Orthology

4
Types of Homology
5
Comparing Sequences
  • Dotplots
  • Alignments
  • Scoring alignments
  • Algorithms for finding alignments
  • Dynamic programming
  • Heuristic methods
  • BLAST
  • FASTA

6
Comparing Sequences
  • Dotplots
  • Alignments
  • Scoring alignments
  • Algorithms for finding alignments
  • Dynamic programming
  • Heuristic methods
  • BLAST
  • FASTA

7
S
T
O
P
S
S
T
O
O
P
8
S
T
O
P
S
S
T
O
P
S
9
S
T
O
P
S
S
P
O
T
S
10
a
c
t
g
a
t
c
g
a
a
c
t
g
a
c
t
11
a
c
t
g
a
t
c
g
a
g
t
c
a
g
t
c
12
a
c
t
g
a
t
c
g
a
a
c
t
g
g
t
c
13
t
g
t
c
t
g
a
g
a
a
a
c
t
g
t
c
t
a
g
t
g
a
a
14
gt
tc
ct
ta
ga
ag
gt
tg
aa
aa
ac
c
ta
gt
tc
ct
ta
ag
gt
tg
ga
aa
a
15
Self-Alignment of Human LDL Receptor
16
Dot Plot Analysis
  • Plot two sequences or sequence against itself
  • Diagonal is identical match
  • Variable word size possible
  • Good for finding features
  • Duplications
  • Inversions
  • Deletions

17
Dot Plot Programs
  • EMBOSS has several programs (available via PISE)
  • Dotter
  • Dotlet (web-based)

18
Comparing Sequences
  • Dotplots
  • Alignments
  • Scoring alignments
  • Algorithms for finding alignments
  • Dynamic programming
  • Heuristic methods
  • BLAST
  • FASTA

19
Hamming Distance
Between strings of equal length, the number of
mismatches
fatcat fatrat
Hamming Distance 2
fatcats fastcat
Hamming Distance 5
20
Alignment
fatcats fastcat
Hamming Distance 5
fa-tcats fastcat-
Edit Distance 2
21
Optimal Alignment
  • Given two sequences, there are many possible
    alignments.
  • The edit distance is the minimal number of
    differences in an alignment.
  • We can also use a more complicated scoring system
    to evaluate alignments.

ac-gtac-g
acg-tac-g
acagtatag
acagtatag
acgtac----g
-ac-agtatag
22
Comparing Sequences
  • Dotplots
  • Alignments
  • Scoring alignments
  • Algorithms for finding alignments
  • Dynamic programming
  • Heuristic methods
  • BLAST
  • FASTA

23
Scoring Alignments
  • Additive
  • Score of AB Score of A Score of B
  • Biologically meaningful high score
  • Treat gaps differently than mismatches
  • Point mutations are more common than insertions
    or deletions.

24
Scoring Systems
  • Primitive option Aligned residues are either
    identical or not identical
  • Physically motivated Match score is
    proportionate to chemical similarity
  • Mutationally motivated (PAM, BLOSUM)If
    mutations often convert one residue to another,
    give that a higher match score.

25
PAM(Point/Percent Accepted Mutation)
  • One PAM unit is an average of 1 mutation per 100
    amino acids
  • PAM 1 was generated from 71 protein sequence
    groups of at least 85 similarity to avoid errors
    from multiple mutations at the same site, as well
    as insertions and deletions
  • Convert mutation probabilities to a (log odds)
    scoring matrix

26
PAM Matrices
  • PAMn supplies scoring for sequences that have
    diverged n PAM units
  • PAM1n generates other matrices based on Markov
    model
  • PAM250 represents an average of 250 mutations in
    100 amino acids

27
BLOSUM(BLOck SUbstitution Matrix)
  • BLOCKS is a database of 3,000 blocks short,
    continuous multiple alignments of protein domains
    (functional sequences)
  • Generates substitution frequency, which can be
    converted into log odds score
  • BLOSUM62
  • Block has 62 similarity
  • Close to PAM250
  • More Reliable

28
Gap Penalties
  • Insertions and deletions are less likely than
    substitutions gaps should be penalized.
  • Simple gap penalty takes a penalty for each match
    of gap to residue.
  • Lots of small deletions are less biologically
    likely than one long deletion.
  • Affine gap penalty has different penalties for
    opening a gap or for continuing a gap.
  • Opening a gap (5 nucleotides, 11 proteins)
  • Continuing a gap (2 nucleotides, 1 proteins)

29
Finding best alignment
  • We know how to recognize optimal alignment in a
    pool of possibilities
  • We could enumerate all possibilities, then choose
    the best one.
  • Thats slow. Is it necessary?

30
Comparing Sequences
  • Dotplots
  • Alignments
  • Scoring alignments
  • Algorithms for finding alignments
  • Dynamic programming
  • Heuristic methods
  • BLAST
  • FASTA

31
Premise of Dynamic Programming
  • If we know the optimal alignment of the prefixes
    of two sequences, we can find the optimal
    alignment of the entire sequence.

32
Dynamic Programming Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
-
G
GT
GTA
33
Initializing the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
GAT --- -9
G - -3
GA -- -6
empty alignment 0
-
- G -3
G
-- GT -6
GT
--- GTA -9
GTA
34
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
G - -3
empty alignment 0
-
- G -3
G G -3
G
-- GT -6
GT
--- GTA -9
GTA
35
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
G - -3
empty alignment 0
-
- G -3
G G -3
G
-- GT -6
GT
--- GTA -9
GTA
36
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
GAT --- -9
GA -- -6
G - -3
G - -3
empty alignment 0
-
- G -3
G G -3
GA G- 0
GAT G-- -3
G
-- GT -6
G- GT 0
GAT G-T 2
GA GT 2
GT
--- GTA -9
G-- GTA 0
GAT GTA 1
G-A GTA 3
GTA
37
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
-3
-6
-9
0
-
-3
-3
0
-3
G
-6
0
2
2
GT
0
-9
3
1
GTA
38
Filling in the Matrix
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
G
GA
GAT
-
-3
-6
-9
0
-
-3
-3
0
-3
G
-6
0
2
2
GT
0
-9
3
1
GTA
39
Finding the Alignment
Match 3 Mismatch -1 Gap -3
Sequence A GAT Sequence B GTA
GAT GTA
G
GA
GAT
-
-3
-6
-9
0
-
-3
-3
0
-3
G
-6
0
2
2
GT
0
-9
3
1
GTA
40
A larger example
http//meetings.cshl.org/tgac/tgac/flash/justMatri
x.swf
41
Local Alignments
a_fat_cat_lives_a_happy_existence the_dog_likes_to
_chase_cats
A local alignment is an alignment between a
substring of one sequence and a substring of the
other.
42
Dynamic Programming
  • Global alignment (what we just did)Initiate
    traceback in bottom right-hand corner of the
    matrix.Often called Needleman-Wunsch
  • Local alignment Find top-scoring
    sub-alignments.Optimal scores cannot be
    negative. Initiate traceback at highest scoring
    cell in the matrix.Often called Smith-Waterman

43
Dynamic Programming
  • Provides an optimal alignment
  • Results depend on scoring system and gap
    penalties
  • Sometimes too slow
  • Perfectly good for aligning two sequences
  • Slow for aligning query sequence against whole
    database of sequences

44
Pairwise Sequence Alignment
  • Often, we dont know which two sequences to
    align.
  • Given a query sequence, we need a way to find all
    homologous sequences in a large database.

45
Comparing Sequences
  • Dotplots
  • Alignments
  • Scoring alignments
  • Algorithms for generating alignments
  • Dynamic programming
  • Heuristic methods
  • BLAST
  • FASTA

46
BLAST
  • Basic Local Alignment Search ToolFinds local
    alignments.
  • Most popular tool available to search a database
    for sequences homologous to a query sequence.
  • Will not necessarily find the optimal alignment,
    if it has a low score.
  • Will find all high-scoring alignments with high
    probability

47
BLAST Steps
  • -1. Preprocess database to get sorted list of
    words in every sequence in the database.
  • For each word of length W in the query, generate
    a list of all possible words with a score of at
    least threshold T
  • Generate hits by searching the database for any
    words on the list.
  • Use dynamic programming to extend hits until the
    score drops gtX return highest-scoring segment
    pair (HSP)
  • Evaluate significance of extended hits

48
Query word (W 3)
Query GSDFWQETRASFGCSLAALLNKCKTPQGQRLVNQWIKQPLMDK
NRIEERLNLVEAFGCATSWPI
PQG 18 PEG 15 PRG 14 PKG 14 PNG 13 PDG 13 PHG 13 P
MG 13 PSG 13 PQA 12 PQN 12
Neighborhood words
Neighborhood score threshold (T 13)
Hit
Query SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA
LAL TP G R W P D ER
A Subject TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQT
IGA
High-scoring Segment Pair (HSP)
49
BLAST Settings
  • Program
  • Filters
  • Expect
  • Word Size
  • Show
  • Number of Descriptions/Alignments

50
BLAST Programs
51
Other BLAST Programs
  • MEGABLAST
  • Genomic BLAST
  • Fugu rubripes

52
BLAST Databases
53
BLASTing Custom Databases
  • If you run BLAST on your own computer, instead of
    at NCBI, you can create a custom database.
  • Gather sequences for database (e.g., by using
    Batch ENTREZ)
  • Run formatdb to preprocess the database.

54
BLAST Filters
  • Low complexity
  • Human repeats
  • Mask for lookup table only
  • Mask lower case
  • Limit by Entrez query

55
ENTREZ Queries
  • nucleotide allFilter NOT specimen-voucherAll
    Fields
  • You can eliminate most of the BAC-type records
    from the default nucleotide database with the
    following query nucleotide allFilter NOT
    htgKeyword

56
E-values
  • Lower values give more stringent results
  • Default is 10 fractional values permissible
  • E10 means 10 matches are expected by chance
  • In output, E lt 0.05 is significant

57
BLAST Output
  • Reference
  • Query
  • Database
  • TaxBLAST link
  • Graphical Overview aligned matches

58
BLAST Output Significant Alignments
  • SeqID Genbank ID and Accession
  • Description (links to entry)
  • Score (links down)
  • E-value
  • LocusLink
  • Unigene

59
BLAST Output - Alignments
  • Length
  • Score
  • Raw score is just score from scoring matrix
  • Bit score is adjusted for statistical properties
    of scoring matrix, and can be compared between
    searches.
  • E-value
  • Identities/Gaps
  • Strand (e.g. Plus/Plus)
  • Alignment shows identity

60
BLAST Output - Proteins
  • Conserved Domains Pfam CD-Search
  • Alignments
  • Matching letters shown
  • Mismatches blank
  • Conservative substitutions shown by
  • Gaps are shown by dashes

61
BLAST Output End Summary
  • Database -- of letters sequences
  • Lambda, K, H
  • Gapped Lambda, K, H
  • Matrix
  • Gap Penalties
  • Hit summary , extensions, gt 10.0
  • HSPs
  • length of query
  • T,A, X1, X2, X3, S1, S2

62
Gapped BLAST
  • Find two non-overlapping hits of score at least T
    and distance at most d one from another
  • Ungapped extension
  • If the HSP generated has high enough normalized
    score, start gapped extension
  • Report resulting alignment if it has sufficiently
    large E-value

63
Benefits of Protein Searches
  • 20 amino acids reduce chance of random matches
  • Protein DBs are smaller
  • Degrees of similarity
  • Chemical
  • Observed mutation frequencies
  • Mutational steps between codons

64
FASTA
  • Find hot-spots (pairs of words of length k these
    are called hits in BLAST)
  • Locate sequences of consecutive hot spots on a
    diagonal
  • Combine sub-alignments into a longer alignment
  • Find alternative local alignments using dynamic
    programming restricted to a ribbon along the
    diagonal containing best run

65
FASTA Algorithm
66
FASTA Algorithm
67
FASTA Programs
  • Fasta3 - Compare a protein sequence to a protein
    database, or a DNA sequence to a DNA database
  • Ssearch3 - Compare a protein sequence to a
    protein database, or a DNA database, using the
    Smith-Waterman algorithm. It is very slow but
    much more sensitive for full-length protein
    comparison.
  • Fastx3 - Compare a DNA sequence to a protein
    database, by comparing the translated DNA
    sequence in three frames and allowing gaps and
    frameshifts.

68
BLAST and FASTA
  • BLAST for proteins and speed
  • FASTA for DNA and frameshifts
  • FASTA for accurate statistics (protein and coding
    DNA)
  • SSEARCH for optimal

69
Blast vs. Fasta II
  • Database format is the same
  • BLAST is generally faster
  • FASTA may produce better alignments
  • blastp will show several alignments between
    domains in the same sequences. Fasta3 shows one
    alignment for each sequence pair.

70
Blast vs. Fasta III
  • Different default matrices and gap penalties
  • BLAST cannot search very short sequences (can use
    Fasta, but EMBOSS fuzznuc or fuzzpro are better)
  • Fasta searches one strand of DNA
  • Fasta does not filter low complexity by default

71
Sites using WU-BLAST
  • European Bioinformatics Institute BLAST server
  • EMBL Advanced BLAST2 Search
  • Institut Pasteur
  • Berkeley Drosophila Genome Project (BDGP)
  • European Drosophila Genome Project at the EBI
  • Mendel Bioinformatics Group at the John Innes
    Centre
  • Mouse Genome Database at the Jackson Laboratory
  • PlasmoDB (Plasmodium Genome Resource) at the
    University of Pennsylvania
  • Saccharomyces cerevisiae search at Stanford
    University
  • TAP (Transcript Assembly Program) by Zhengyan
    ("George") Kan
  • TIGR Gene Indices
  • TIGR Microbial Genomes Database
  • WU Genome Sequencing Center
Write a Comment
User Comments (0)
About PowerShow.com