Core Module 7 Bioinformatics - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Core Module 7 Bioinformatics

Description:

Step through and calculate simple sequence ... Smith-Waterman (1981) algorithm is a local' alignment method ... Smith-Waterman. Based on Needleman-Wunsch ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 26

Provided by: ryang

Category:

more less

Transcript and Presenter's Notes

Title: Core Module 7 Bioinformatics

1
Core Module 7 Bioinformatics

Sequence Comparisons
February 13, 2008
Bruce Byrne, PhD

2
Sequence Alignment

What we will do
Ask what considerations underlie comparing two or
more sequences
Step through and calculate simple sequence
comparisons using various assumptions
Review how to use several sequence comparison
tools

3
Sequence Comparisons Finding Similarities

What is Sequence Alignment?
Procedure for comparing two (or more) sequences
Individual characters aligned, in rows, to best
match
Two sequences are said to be aligned by writing
them across in two rows
Identical (or similar) characters are matches
We will discuss similarity later
non-identical characters are mismatches
Gaps can be introduced in either (or both)
sequences
How would gaps appear in evolution?
What is the likely consequence of small deletions
in coding sequences?
Why might we think differently about gaps within
a sequence rather than gaps at the ends of
sequences?

4
Sequence Alignments Interpretations and
Importance

Why do we do Sequence Alignment?
Defines degree and location of possible
similarities
Can look at entire sequence or localized
similarities
Evolutionary relationships and relationship of
sequence to function
Model sequence to function and structure

5
Alignment Tools

Different applications (computer programs)
support quite different alignment needs
Dot Matrix Comparisons
Visualize the geometry of similarities
Variable Numbers of Sequences
Pairwise alignment - only two sequences compared
One sequence per file
Multiple alignment - multiple sequences compared
Multiple sequences per file
What is the Question?
Global alignment - aligns sequences over their
entire length
Local alignment - determines the longest/best
subsequence pair that gives maximum similarity

6
How and Where Identical?

LGPSSKQTGKGSSRIWDN
LNITKSAGKGAIMRLGDA

7
Two Possible Answers

LGPSSKQTGKGS-SRIWDN
(Global)
LN-ITKSAGKGAIMRLGDA
-------TGKG--------
(Local)
-------AGKG--------

Figure 1 from Bioinformatics Sequence and Genome
Analysis
8
Dot Plot

J. Biochem. Gibbs McIntyre (1970)
Full comparison
Gives a big picture a visual depiction of
sequence relationship
Finding direct or inverted repeats
Steps
Create a two-dimensional matrix placing the
N-terminal end (in the case of proteins) in the
top-left corner
For every match, a dot is placed in the position
of the intersection

9
Running a Dot Plot
Two dimensional grid with sequence entered as j
and i. In this case, the two sequences are
identical
j
Sequence A
i
Compare each sequence in each cell
Sequence B
10
Anatomy of a Dot Plot
Note that 1.1, 2.2, 3.3, etc. are identical.
The connected dots create a diagonal visualizing
the identity.
Whats our running time to traverse entire matrix?
11
Output Cytochrome C (Cox1)Human vs. Bacterium
at Different Stringency
12
Dotmatcher Stringency
A window of specified length is moved up all
possible diagonals and a score is calculated
within each window for each position along the
diagonals. The score is the sum of the
comparisons of the two sequences using the given
similarity matrix along the window. If the score
is above the threshold, then a line is plotted on
the image over the position of the window.

Recommendations
For DNA Comparisons Long windows, high
stringencies
For Protein Comparisons Use short windows and
stringencies
For a short domain of partial similarity, use a
longer window and a small stringency

13
Similarity Matrix Blossum62
14
The Blosum Matrix

BLOcks of Amino Acid SUbstitution Matrix
Variety of matrices derived by observation
Reflect frequency of substitutions observed in
highly conserved, well aligned sequences from a
variety of taxa
Blosum62 frequently employed
Higher number (e.g. Blosum80) might be better for
very closely related species
Lower number for distant relatives

15
Summary on Dot Plot

Advantages
Highly illustrative of alignment issues
All possible matches of residues between two
sequences are found
Good for finding direct and inverted repeats
Allows for fast visual inspection
Disadvantages
Random matches cause noise
Computer cannot visually detect diagonals
Diagonals can be missed by visual inspection
Unreasonable for large number of comparisons
Doesnt give good statistics for comparison

16
Alternatives to Doing an Alignment
CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCACCCAGCCCCCTGGACCTGTAT
------CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCACCCAGCCCCCTGGACCTGTAT---------
Human
CCTTCAGAATACAGAATAGGGACATAGAGA ATCCCA---CCCAGCCCCC
TGGACCTGTAT
Computer

How many matches?
How many gaps?
Meaning of the gaps?

17
Scoring an Alignment
CCTTCAGAATACAGAATAGGGACATAGAGA ATCCCA---CCCAGCCCCC
TGGACCTGTAT
Score for each match is given by m (1 is used
here) Score for each mismatch is given by n (0 is
used here) Score for each gap we introduce is
given by g (1 is used here) Sum the match scores
and then reduce by n and g For example above,
score is 7 - (0 1) 6
What kind of alignment is shown above?
18
Number of Possible Optimal Alignments
Example of five sequence alignments AG.GC
A.GGC .AGGC A..GGC .A.GGC AATGC AATGC
AATGC AATG.C AATG.C 1 2
3 4 5
What if we imposed a penalty , e.g., -1, for
introducing gaps? Which sequence(s) would be
better?
There may be more than one optimal solution to
a problem
19
Optimal Sequence Alignment Methods

Total of distinct alignments (with gaps) is
usually extraordinarily large
How do we identify the best one?
Brute force method of trying every possible gap
is slow,
Roughly NM, where N is length of sequence A, M is
length of sequence B
Dynamic programming offers a more efficient
solution
(but still expensive) with time proportional to
N3, where N is the length of the longer sequence

20
Dynamic Programming

Computational method used to align sequences
Solution not known in advance but built as we go,
hence dynamic
Optimizes a solution to a problem
builds on previously optimal solution to a
sub-part of the original problem (recursion)
Alignment is guaranteed to be optimal

21
Alignment Algorithms

Needleman-Wunsch (1970) algorithm is a global
alignment algorithm
General algorithm for sequence comparison
May miss important local alignments
A global alignment may not be biologically
relevant
Smith-Waterman (1981) algorithm is a local
alignment method
Scoring system includes negative mismatch scores
Minimum score recorded in matrix is zero
End of optimal path is not restricted to last row
or column

22
Needleman-Wunsch

Fundamental principle
To calculate the alignment score S(i,j), you only
need to enumerate and score all the ways in which
one aligned pair can be added to a shorter
alignment to produce an alignment of the first i
residues of seq1 and the first j residues of seq2
All possible pairs are represented by a
two-dimensional array, and all possible
comparisons are represented by pathways through
this array
Global alignments ... i.e. every residue of the
two sequences has to participate - therefore will
not detect motif or active site homology alone

23
Smith-Waterman

Based on Needleman-Wunsch
Instead of looking at each sequence in its
entirety, compare segments of all possible
lengths and choose whichever optimizes the
similarity measure (local alignments)
Assign negative score for a mismatch and a
negative score based on introduction of
insertion/deletion and length of insert/delete

24
Global Alignment Implementation

Needle

25
Local Alignment Implementation

matcher

26
Multiple Alignment Implementation

emma and prettyplot

27
Summary

We should be able to choose the correct
application depending on
What question we are asking
What we know about the sequences
What we need to find out about similarities
We are also now aware of the important difference
between identity and similarity
We can make good judgments about how to interpret
some gaps

Write a Comment

User Comments (0)