E. Sequence Alignment - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

E. Sequence Alignment

Description:

Part I: Overview and foundation: A. Central Dogma, B. Accessing Databases, C. ... If A is homologous to B, and B to C, then A must be homologous to C, even if ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 37
Provided by: stephe87
Category:
Tags: alignment | bb | sequence

less

Transcript and Presenter's Notes

Title: E. Sequence Alignment


1
E. Sequence Alignment
  • Part I Overview and foundation A. Central
    Dogma, B. Accessing Databases, C. Data Mining,
    D. Tree of life.
  • Part II Gene Sequence E. Sequence Alignment, F.
    Blasting NCBI Tutorial, G. SNP Video, H.
    Methods for Manipulating DNA and RNA Ch 8. MBoC
    pp 469-513, I. Hidden Markov Model Handouts,
    J. Gene Finding Handout, K. Microarray Analysis
    Ch 4 and Ch 5 DGPB
  • Part III Gene Regulation
  • L. Cell Chemistry and Biosynthesis Ch 2 MBoC,
    M. Proteins Ch 3 MBoC, N. DNA and Chromosomes
    Ch 4 MBoC, O. From DNA to Protein Ch 6. MBoC,
    P. Control of Gene Expression Ch 7 MBoC
  • Part IV Proteonomics
  • Q. Introduction Ch. 6 DGPB, R. Subcellular
    Localization Handout, S. Structural Prediction
    Handout, T. Protein Interaction Handout, U.
    Mass Spec Proteomic Informatics Handout
  • Part V Whole Genome Perspective
  • V. Protein and Gene Networking Handout Some
    parts of Unit Three DGPB
  • Part V Applications and Conclusion
  • W. Cancer Ch. 23 MBoC, X. Pathogens Ch. 25
    MBoC, Y. Case Studies Some Parts of Unit Four
    DGPB, Z. Future Directions Handout

2
E. Sequence Alignment
  • E. Sequence Alignment
  • E.1. Sequence Similarity
  • E.2. Dynamic Programming
  • E.3. Blasting and (Psi-Blast)
  • Reading
  • The NCBI BLAST education pages (all three
    sections).
  • Altschul, et al on PSI-BLAST and on improving
    PSI-BLAST (using ROC curves)

3
E. Sequence Alignment Purpose of Sequence
Alignment and Blasting
  • What is an alignment, and why might it be
    significant?
  • An alignment is a mapping from one sequence to
    another, identifying elements that are likely to
    have arisen from a common ancestor
  • A good alignment is an indication of homology
  • Alignments are NOT exact matches. We will need a
    method to find good alignments in a database...

4
E. Sequence Alignment Phylogenetic Tree
5
E.1 Sequence Similarity Similarity vs. Homology
Paralogs vs. Orthologs
  • Homology is an evolutionary relationship that
    either exists or does not. It cannot be partial.
  • An ortholog is a homolog with shared function.
  • A paralog is a homolog that arose through a gene
    duplication event. Paralogs often have divergent
    function.
  • Similarity is a measure of the quality of
    alignment between two sequences. High similarity
    is evidence for homology. Similar sequences may
    be orthologs or paralogs.

6
E.1 Sequence Similarity How do we compute
similarity?
  • Similarity can be defined by counting positions
    that are identical between two sequences
  • Gaps (insertions/deletions) can be important
    abcdef abcdef abcdef
    abceef acdef a-cdef

7
E.1 Sequence Similarity Not all mismatches are
the same
  • Some amino acids are more substitutable for each
    other than others. Serine and threonine are
    more alike than tryptophan and lysine.
  • We can introduce "mismatch costs" for handling
    different substitutions.
  • We don't usually use mismatch costs in aligning
    nucleotide sequences, since no substitution is
    per se better than any other.

8
E.1 Sequence Similarity Many possible alignments
to consider
  • Without gaps, there are are NxM possible
    alignments between sequences of length N and M
  • Once we start allowing gaps, there are many
    possible arrangements to consider abcbcd
    abcbcd abcbcd
    abc--d a--bcd ab--cd
  • This becomes a very large number when we allow
    mismatches, since we then need to look at every
    possible pairing between elements there are
    roughly NM possible alignments.

9
E.1 Sequence Similarity Avoiding random
alignments with a score function
  • Not only are there many possible gapped
    alignments, but introducing too many gaps makes
    nonsense alignments possible
    s--e-----qu---en--ce sometimesquipsentice
  • Need to distinguish between alignments that occur
    due to homology, and those that could be expected
    to be seen just by chance.
  • Define a score function that accounts for both
    element mismatches and a gap penalty

10
E.1 Sequence Similarity Match scores
  • Match scores are often calculated on the basis of
    the frequency of particular mutations in very
    similar sequences.
  • We can transform substitution frequencies into
    log odds scores, which can then be added together.

11
E.1 Sequence Similarity Local vs. Global
alignments
  • A global alignment includes all elements of a
    sequence, and includes gaps
  • A global alignment may or may not include "end
    gap" penalties.
  • A local alignment is includes only subsequences,
    and sometimes computed without gaps.
  • Local alignments can find shared domains in
    divergent proteins and are fast to compute
  • Global alignments are better indicators of
    homology and take longer to compute.

12
E.1 Sequence Similarity An alignment score
  • An alignment score is the sum of all the match
    scores of an alignment, with a penalty subtracted
    for each gap.
  • Gap penalties are usually "affine" meaning that
    the penalty for one long gap is smaller than the
    penalty for many smaller gaps that add up to the
    same size.
  • a b c - - da c c e f d9 2 7 6 gt 24 - (10
    2) 12

Gap start continuationpenalty
AlignmentScore
Matchscore
13
E.2 Dynamic Programming Finding the optimal
alignment
  • Given a pair of sequences and a score function,
    identify the best scoring (optimal) alignment
    between the sequences.
  • Remember, exponential number of possible
    alignments (most with terrible scores).
  • Computer science to the rescue dynamic
    programming identifies optimal alignments in time
    proportional to the sum of the lengths of the
    sequences

14
E.2. Dynamic programming
  • The name comes from an operations research task,
    and has nothing to do with writing programs.
  • Dynamic programming alignments are a key
    technology in bioinformatics, and you should
    understand how they work.
  • Called Needleman-Wunch or Smith-Waterman

15
E.2. Dynamic Programming The recurrence
  • The idea Recursion for the cost that you want to
    optimize
  • Boundary Condition
  • Recurrence
  • Example Pairwise sequence alignment
  • Ci,j optimal cost for aligning sequence
    A1..i and B1..j where A and B are input
    sequences of length n and m.
  • Boundary Condition
  • C0,j Ci,0 0 for i,j 0.
  • C0,j Ci,0 cost of deletion for i,j gt 0.
  • Recurrence Si,j
  • Si-1,j-1 cost of substituting Ai with Bj.
  • Si-1,j cost of deletion
  • Si,j-1 cost of insertion

16
E.2. Dynamic Programming Filling up the table
  • Using the recurrence compute all Ci,j in a
    table C
  • At the same time, keep track of which case you
    use to get Ci,j in another table F
  • Use F to backtrack and construct the solution.

Ci.j
For backtracking
(1)
(1)
(2)
(2)
(3)
(3)
Ci,j
17
E.2. Dynamic Programming Dynamic programming
alignment
  • Each cell has the score for the best aligned
    sequence prefix up to that position.
  • Start by filling in initial gap and first element
    to first element match score
  • Use arrow to indicate path to that alignment

18
E.2. Dynamic Programming Continue filling in
optimalpath scores
  • For each cell, have three choices for how to get
    there from the last optimal alignment (match, gap
    sequence 1, gap sequence 2).
  • Best score(s) are selected, and arrows added
    indicated route.-5 5 0 5 -5 0-7 -5
    -12

19
E.2. Dynamic Programming Optimal alignment by
traceback
  • We traceback a path that gets us the highest
    score. If we don't have end gap penalties,
    then take any path from the last row or columnto
    the first.
  • Otherwise we needto include the top and bottom
    corners
  • AACADCD---A-CD

20
E.2. Dynamic Programming How do we pick match
scores?
  • For match scores, two main options
  • PAM based on global alignments of closely related
    sequences. Normalized to changes per 100 sites,
    then exponentiated for more distant relatives.
  • BLOSUM based on local alignments in much more
    diverse sequences
  • Picking the right distance is important, and may
    be hard to do. BLOSUM seems to work better for
    more evolutionarily distant sequences. BLOSUM62
    is a good default.

21
E.2. Dynamic Programming Picking gap penalties
  • Many different possible forms
  • Most common is affine (gap open gap continue
    penalities)
  • More complex penalties have been proposed.
  • Penalties must be commensurate with match scores.
    Therefore, the match scoring scheme influences
    the gap penalty
  • Most alignment programs suggest appropriate
    penalties for each match score option.

22
E.2. Dynamic Programming Searching for optimal
scores
  • One possibility is to try several different match
    score and gap penalties, and choose the best
  • In general, this is called parameter space search
    and it is important in many areas.
  • Problems
  • requires a lot computation
  • we need some principled way to compare the
    results.
  • Use significance testing to compare...

23
E.2. Dynamic Programming The significance of an
alignment
  • Significance testing is the branch of statistics
    that is concerned with assessing the probability
    that a particular result could have occurred by
    chance.
  • How do we calculate the probability that an
    alignment occurred by chance?
  • Either with a model of evolution, or
  • Empirically, by scrambling our sequences and
    calculating scores on many randomized sequences.
  • Extreme value statistics (max, not sum)

24
E.2. Dynamic Programming Metric-space database
search
  • A metric is a function with these
    characteristics
  • f(A, B) f (B,A)
  • f(A,A) 0
  • f(A,B) f(B,C) ? f(A,C) the triangle inequality
  • Log(n) database search for closest (inexact)
    match can be done for metrics
  • Miranker, et al. have applied this to sequence
    databases (transforming score matrix)

25
E.3 Blasting Why BLAST?
  • Dynamic programming solutions to alignment
    problems are relatively slow, and don't lend
    themselves to efficient database search.
  • Need some way to search a large database to find
    sequences that have an inexact match to a query
    sequence
  • Competing solutions FASTA BLAST.
  • Both imperfect approximations to DP. DP finds
    some distantly related sequences the
    approximations don't
  • BLAST is more commonly used, although both are
    fine.

26
E.3 Blasting Sequence search basics
  • BLAST/FASTA are 50-100x faster than DP
  • If searching for coding regions, always translate
    nucleotide to amino acid sequence.
  • Use appropriate substitution and gap scores
  • BLOSUM62 is good for weak protein similarities
  • Use PAM30, PAM70 or BLOSUM45 for better results
    on more similar sequences, BLOSUM80 for most
    distant
  • Use Low-complexity filters and, for human
    sequence, filter out human repeats (ALUs, etc)

27
E.3 Blasting How does BLAST work
  • BLAST2 (gapped BLAST)
  • Break sequence into overlapping words, by
    default of length 3. n-l1 l-size words for
    sequence of length n. ABCDE ? ABC, BCD, CDE
  • For each word, define 50 other words that are
    similar (use substitution matrix threshold T)
  • Repeat for each of the n-l1 words, giving about
    50n words (out of 2038000 possible)
  • Use a hash table to find all places in DB with
    exact match to any of those words.

28
E.3 Blasting Blasting Extending alignments
  • Identify database sequences that contain several
    matching words on the same diagonal (think DP
    alignments) and within a short distance.
  • Extend these short, ungapped alignments in both
    directions along the sequence so long as score of
    alignment increases.
  • Call these extended alignments HSP's for high
    scoring pairs

29
E.3 Blasting PSI Blast
  • Position Specific Iterated BLAST
  • Intuition substitution matrices should be
    specific to a particular site. Penalize
    alanine?glycine more in a helix.
  • Idea Use BLAST with high stringency to get a set
    of closely related sequences. Align those
    sequences to create a new substitution matrix for
    each position. Then use that matrix
    (iteratively) to find additional sequences.

30
E.3 Blasting Why (not) PSI-BLAST
  • If the sequences used to construct the Position
    Specific Scoring Matrices (PSSMs) are all
    homologous, the sensitivity at a given
    specificity improves significantly.
  • However, if non-homologous sequences are included
    in the PSSMs, they are corrupted. Then they
    pull in more non-homologous sequences, and become
    worse than generic

31
E.3 Blasting How to use PSI BLAST
  • Set initial thresholds high. Inspect each
    iteration's result for suspicious sequences.
  • Do several iterations (5), or until no new
    sequences are found
  • Even if only looking for a small set of
    sequences, make the initial search very broad
  • First, use NR with up to 5 iterations to set PSSM
  • Then use that PSSM to search in restricted domain

32
E.3 Blasting PSI-BLAST example
33
E.3 Blasting First Iteration
...
34
E.3 Blasting Second iteration
...
...
35
E.3 Blasting PSI-BLAST caveats
  • Increased ability to find distant homologues
  • Cost of additional required care to prevent
    non-homologous sequences from being included in
    the PSSM calculation.
  • When in doubt, leave it out!
  • Examine sequences with moderate similarity
    carefully.
  • Be particularly cautious about matches to
    sequences with highly biased amino acid content

36
E.3 Blasting Some notes on sequence-based
database searching
  • Matches of gt50 identity in a 20-40 amino acid
    region occur frequently by chance.
  • Most sequences that share statistically
    significant similarity throughout their entire
    lengths are homologous.
  • If A is homologous to B, and B to C, then A must
    be homologous to C, even if they share no
    significant sequence similarity.
  • Low complexity regions, transmembrane regions and
    coiled-coil regions often display significant
    similarity without homology
  • Screen them out of your query sequences!
Write a Comment
User Comments (0)
About PowerShow.com