Core Module 7 Bioinformatics - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Core Module 7 Bioinformatics

Description:

Step through and calculate simple sequence ... Smith-Waterman (1981) algorithm is a local' alignment method ... Smith-Waterman. Based on Needleman-Wunsch ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 26
Provided by: ryang
Category:

less

Transcript and Presenter's Notes

Title: Core Module 7 Bioinformatics


1
Core Module 7 Bioinformatics
  • Sequence Comparisons
  • February 13, 2008
  • Bruce Byrne, PhD

2
Sequence Alignment
  • What we will do
  • Ask what considerations underlie comparing two or
    more sequences
  • Step through and calculate simple sequence
    comparisons using various assumptions
  • Review how to use several sequence comparison
    tools

3
Sequence Comparisons Finding Similarities
  • What is Sequence Alignment?
  • Procedure for comparing two (or more) sequences
  • Individual characters aligned, in rows, to best
    match
  • Two sequences are said to be aligned by writing
    them across in two rows
  • Identical (or similar) characters are matches
  • We will discuss similarity later
  • non-identical characters are mismatches
  • Gaps can be introduced in either (or both)
    sequences
  • How would gaps appear in evolution?
  • What is the likely consequence of small deletions
    in coding sequences?
  • Why might we think differently about gaps within
    a sequence rather than gaps at the ends of
    sequences?

4
Sequence Alignments Interpretations and
Importance
  • Why do we do Sequence Alignment?
  • Defines degree and location of possible
    similarities
  • Can look at entire sequence or localized
    similarities
  • Evolutionary relationships and relationship of
    sequence to function
  • Model sequence to function and structure

5
Alignment Tools
  • Different applications (computer programs)
    support quite different alignment needs
  • Dot Matrix Comparisons
  • Visualize the geometry of similarities
  • Variable Numbers of Sequences
  • Pairwise alignment - only two sequences compared
  • One sequence per file
  • Multiple alignment - multiple sequences compared
  • Multiple sequences per file
  • What is the Question?
  • Global alignment - aligns sequences over their
    entire length
  • Local alignment - determines the longest/best
    subsequence pair that gives maximum similarity

6
How and Where Identical?
  • LGPSSKQTGKGSSRIWDN
  • LNITKSAGKGAIMRLGDA

7
Two Possible Answers
  • LGPSSKQTGKGS-SRIWDN
  • (Global)
  • LN-ITKSAGKGAIMRLGDA
  • -------TGKG--------
  • (Local)
  • -------AGKG--------

Figure 1 from Bioinformatics Sequence and Genome
Analysis
8
Dot Plot
  • J. Biochem. Gibbs McIntyre (1970)
  • Full comparison
  • Gives a big picture a visual depiction of
    sequence relationship
  • Finding direct or inverted repeats
  • Steps
  • Create a two-dimensional matrix placing the
    N-terminal end (in the case of proteins) in the
    top-left corner
  • For every match, a dot is placed in the position
    of the intersection

9
Running a Dot Plot
Two dimensional grid with sequence entered as j
and i. In this case, the two sequences are
identical
j
Sequence A
i
Compare each sequence in each cell
Sequence B
10
Anatomy of a Dot Plot
Note that 1.1, 2.2, 3.3, etc. are identical.
The connected dots create a diagonal visualizing
the identity.
Whats our running time to traverse entire matrix?
11
Output Cytochrome C (Cox1)Human vs. Bacterium
at Different Stringency
12
Dotmatcher Stringency
A window of specified length is moved up all
possible diagonals and a score is calculated
within each window for each position along the
diagonals. The score is the sum of the
comparisons of the two sequences using the given
similarity matrix along the window. If the score
is above the threshold, then a line is plotted on
the image over the position of the window.
  • Recommendations
  • For DNA Comparisons Long windows, high
    stringencies
  • For Protein Comparisons Use short windows and
    stringencies
  • For a short domain of partial similarity, use a
    longer window and a small stringency

13
Similarity Matrix Blossum62
14
The Blosum Matrix
  • BLOcks of Amino Acid SUbstitution Matrix
  • Variety of matrices derived by observation
  • Reflect frequency of substitutions observed in
    highly conserved, well aligned sequences from a
    variety of taxa
  • Blosum62 frequently employed
  • Higher number (e.g. Blosum80) might be better for
    very closely related species
  • Lower number for distant relatives

15
Summary on Dot Plot
  • Advantages
  • Highly illustrative of alignment issues
  • All possible matches of residues between two
    sequences are found
  • Good for finding direct and inverted repeats
  • Allows for fast visual inspection
  • Disadvantages
  • Random matches cause noise
  • Computer cannot visually detect diagonals
  • Diagonals can be missed by visual inspection
  • Unreasonable for large number of comparisons
  • Doesnt give good statistics for comparison

16
Alternatives to Doing an Alignment
CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCACCCAGCCCCCTGGACCTGTAT
------CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCACCCAGCCCCCTGGACCTGTAT---------
Human
CCTTCAGAATACAGAATAGGGACATAGAGA ATCCCA---CCCAGCCCCC
TGGACCTGTAT
Computer
  • How many matches?
  • How many gaps?
  • Meaning of the gaps?

17
Scoring an Alignment
CCTTCAGAATACAGAATAGGGACATAGAGA ATCCCA---CCCAGCCCCC
TGGACCTGTAT
Score for each match is given by m (1 is used
here) Score for each mismatch is given by n (0 is
used here) Score for each gap we introduce is
given by g (1 is used here) Sum the match scores
and then reduce by n and g For example above,
score is 7 - (0 1) 6
What kind of alignment is shown above?
18
Number of Possible Optimal Alignments
Example of five sequence alignments AG.GC
A.GGC .AGGC A..GGC .A.GGC AATGC AATGC
AATGC AATG.C AATG.C 1 2
3 4 5
What if we imposed a penalty , e.g., -1, for
introducing gaps? Which sequence(s) would be
better?
There may be more than one optimal solution to
a problem
19
Optimal Sequence Alignment Methods
  • Total of distinct alignments (with gaps) is
    usually extraordinarily large
  • How do we identify the best one?
  • Brute force method of trying every possible gap
    is slow,
  • Roughly NM, where N is length of sequence A, M is
    length of sequence B
  • Dynamic programming offers a more efficient
    solution
  • (but still expensive) with time proportional to
    N3, where N is the length of the longer sequence

20
Dynamic Programming
  • Computational method used to align sequences
  • Solution not known in advance but built as we go,
    hence dynamic
  • Optimizes a solution to a problem
  • builds on previously optimal solution to a
    sub-part of the original problem (recursion)
  • Alignment is guaranteed to be optimal

21
Alignment Algorithms
  • Needleman-Wunsch (1970) algorithm is a global
    alignment algorithm
  • General algorithm for sequence comparison
  • May miss important local alignments
  • A global alignment may not be biologically
    relevant
  • Smith-Waterman (1981) algorithm is a local
    alignment method
  • Scoring system includes negative mismatch scores
  • Minimum score recorded in matrix is zero
  • End of optimal path is not restricted to last row
    or column

22
Needleman-Wunsch
  • Fundamental principle
  • To calculate the alignment score S(i,j), you only
    need to enumerate and score all the ways in which
    one aligned pair can be added to a shorter
    alignment to produce an alignment of the first i
    residues of seq1 and the first j residues of seq2
  • All possible pairs are represented by a
    two-dimensional array, and all possible
    comparisons are represented by pathways through
    this array
  • Global alignments ... i.e. every residue of the
    two sequences has to participate - therefore will
    not detect motif or active site homology alone

23
Smith-Waterman
  • Based on Needleman-Wunsch
  • Instead of looking at each sequence in its
    entirety, compare segments of all possible
    lengths and choose whichever optimizes the
    similarity measure (local alignments)
  • Assign negative score for a mismatch and a
    negative score based on introduction of
    insertion/deletion and length of insert/delete

24
Global Alignment Implementation
  • Needle

25
Local Alignment Implementation
  • matcher

26
Multiple Alignment Implementation
  • emma and prettyplot

27
Summary
  • We should be able to choose the correct
    application depending on
  • What question we are asking
  • What we know about the sequences
  • What we need to find out about similarities
  • We are also now aware of the important difference
    between identity and similarity
  • We can make good judgments about how to interpret
    some gaps
Write a Comment
User Comments (0)
About PowerShow.com