Title: Core Module 7 Bioinformatics
1Core Module 7 Bioinformatics
- Sequence Comparisons
- February 13, 2008
- Bruce Byrne, PhD
2Sequence Alignment
- What we will do
- Ask what considerations underlie comparing two or
more sequences - Step through and calculate simple sequence
comparisons using various assumptions - Review how to use several sequence comparison
tools
3Sequence Comparisons Finding Similarities
- What is Sequence Alignment?
- Procedure for comparing two (or more) sequences
- Individual characters aligned, in rows, to best
match - Two sequences are said to be aligned by writing
them across in two rows - Identical (or similar) characters are matches
- We will discuss similarity later
- non-identical characters are mismatches
- Gaps can be introduced in either (or both)
sequences - How would gaps appear in evolution?
- What is the likely consequence of small deletions
in coding sequences? - Why might we think differently about gaps within
a sequence rather than gaps at the ends of
sequences?
4Sequence Alignments Interpretations and
Importance
- Why do we do Sequence Alignment?
- Defines degree and location of possible
similarities - Can look at entire sequence or localized
similarities - Evolutionary relationships and relationship of
sequence to function - Model sequence to function and structure
5Alignment Tools
- Different applications (computer programs)
support quite different alignment needs - Dot Matrix Comparisons
- Visualize the geometry of similarities
- Variable Numbers of Sequences
- Pairwise alignment - only two sequences compared
- One sequence per file
- Multiple alignment - multiple sequences compared
- Multiple sequences per file
- What is the Question?
- Global alignment - aligns sequences over their
entire length - Local alignment - determines the longest/best
subsequence pair that gives maximum similarity
6How and Where Identical?
- LGPSSKQTGKGSSRIWDN
- LNITKSAGKGAIMRLGDA
7Two Possible Answers
- LGPSSKQTGKGS-SRIWDN
- (Global)
- LN-ITKSAGKGAIMRLGDA
- -------TGKG--------
- (Local)
- -------AGKG--------
Figure 1 from Bioinformatics Sequence and Genome
Analysis
8Dot Plot
- J. Biochem. Gibbs McIntyre (1970)
- Full comparison
- Gives a big picture a visual depiction of
sequence relationship - Finding direct or inverted repeats
- Steps
- Create a two-dimensional matrix placing the
N-terminal end (in the case of proteins) in the
top-left corner - For every match, a dot is placed in the position
of the intersection
9Running a Dot Plot
Two dimensional grid with sequence entered as j
and i. In this case, the two sequences are
identical
j
Sequence A
i
Compare each sequence in each cell
Sequence B
10Anatomy of a Dot Plot
Note that 1.1, 2.2, 3.3, etc. are identical.
The connected dots create a diagonal visualizing
the identity.
Whats our running time to traverse entire matrix?
11Output Cytochrome C (Cox1)Human vs. Bacterium
at Different Stringency
12Dotmatcher Stringency
A window of specified length is moved up all
possible diagonals and a score is calculated
within each window for each position along the
diagonals. The score is the sum of the
comparisons of the two sequences using the given
similarity matrix along the window. If the score
is above the threshold, then a line is plotted on
the image over the position of the window.
- Recommendations
- For DNA Comparisons Long windows, high
stringencies - For Protein Comparisons Use short windows and
stringencies - For a short domain of partial similarity, use a
longer window and a small stringency
13Similarity Matrix Blossum62
14The Blosum Matrix
- BLOcks of Amino Acid SUbstitution Matrix
- Variety of matrices derived by observation
- Reflect frequency of substitutions observed in
highly conserved, well aligned sequences from a
variety of taxa - Blosum62 frequently employed
- Higher number (e.g. Blosum80) might be better for
very closely related species - Lower number for distant relatives
15Summary on Dot Plot
- Advantages
- Highly illustrative of alignment issues
- All possible matches of residues between two
sequences are found - Good for finding direct and inverted repeats
- Allows for fast visual inspection
- Disadvantages
- Random matches cause noise
- Computer cannot visually detect diagonals
- Diagonals can be missed by visual inspection
- Unreasonable for large number of comparisons
- Doesnt give good statistics for comparison
16Alternatives to Doing an Alignment
CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCACCCAGCCCCCTGGACCTGTAT
------CCTTCAGAATACAGAATAGGGACATAGAGA
ATCCCACCCAGCCCCCTGGACCTGTAT---------
Human
CCTTCAGAATACAGAATAGGGACATAGAGA ATCCCA---CCCAGCCCCC
TGGACCTGTAT
Computer
- How many matches?
- How many gaps?
- Meaning of the gaps?
17Scoring an Alignment
CCTTCAGAATACAGAATAGGGACATAGAGA ATCCCA---CCCAGCCCCC
TGGACCTGTAT
Score for each match is given by m (1 is used
here) Score for each mismatch is given by n (0 is
used here) Score for each gap we introduce is
given by g (1 is used here) Sum the match scores
and then reduce by n and g For example above,
score is 7 - (0 1) 6
What kind of alignment is shown above?
18Number of Possible Optimal Alignments
Example of five sequence alignments AG.GC
A.GGC .AGGC A..GGC .A.GGC AATGC AATGC
AATGC AATG.C AATG.C 1 2
3 4 5
What if we imposed a penalty , e.g., -1, for
introducing gaps? Which sequence(s) would be
better?
There may be more than one optimal solution to
a problem
19Optimal Sequence Alignment Methods
- Total of distinct alignments (with gaps) is
usually extraordinarily large - How do we identify the best one?
- Brute force method of trying every possible gap
is slow, - Roughly NM, where N is length of sequence A, M is
length of sequence B - Dynamic programming offers a more efficient
solution - (but still expensive) with time proportional to
N3, where N is the length of the longer sequence
20Dynamic Programming
- Computational method used to align sequences
- Solution not known in advance but built as we go,
hence dynamic - Optimizes a solution to a problem
- builds on previously optimal solution to a
sub-part of the original problem (recursion) - Alignment is guaranteed to be optimal
21Alignment Algorithms
- Needleman-Wunsch (1970) algorithm is a global
alignment algorithm - General algorithm for sequence comparison
- May miss important local alignments
- A global alignment may not be biologically
relevant - Smith-Waterman (1981) algorithm is a local
alignment method - Scoring system includes negative mismatch scores
- Minimum score recorded in matrix is zero
- End of optimal path is not restricted to last row
or column
22Needleman-Wunsch
- Fundamental principle
- To calculate the alignment score S(i,j), you only
need to enumerate and score all the ways in which
one aligned pair can be added to a shorter
alignment to produce an alignment of the first i
residues of seq1 and the first j residues of seq2
- All possible pairs are represented by a
two-dimensional array, and all possible
comparisons are represented by pathways through
this array - Global alignments ... i.e. every residue of the
two sequences has to participate - therefore will
not detect motif or active site homology alone
23Smith-Waterman
- Based on Needleman-Wunsch
- Instead of looking at each sequence in its
entirety, compare segments of all possible
lengths and choose whichever optimizes the
similarity measure (local alignments) - Assign negative score for a mismatch and a
negative score based on introduction of
insertion/deletion and length of insert/delete
24Global Alignment Implementation
25Local Alignment Implementation
26Multiple Alignment Implementation
27Summary
- We should be able to choose the correct
application depending on - What question we are asking
- What we know about the sequences
- What we need to find out about similarities
- We are also now aware of the important difference
between identity and similarity - We can make good judgments about how to interpret
some gaps