Title: Welcome to Introduction to Bioinformatics Friday, 30 October
1Welcome toIntroduction to BioinformaticsFriday,
30 October
- I. Scenario 3 Sequence alignment
- Bring up course web site
- Go to Scenario 3
- Open the first sequence alignment notes
2Scenario 3 Our Story
You Our first defense at CDC
Outbreak
. . . Anthrax?
Samples
3Scenario 3 Our Story
4Scenario 3 Our Story
If DNA from bacterium with toxin gene
If DNA NOT from bacterium with toxin gene?
5Scenario 3 Our Story
If DNA from bacterium with toxin gene
If DNA NOT from bacterium with toxin gene?
(no product)
6Scenario 3 Our Story
DG47
7Scenario 3 Our Story
8Scenario 3 Our Story
DG47
9Scenario 3 Our Story
Maybe its not from the toxin gene??
10Scenario 3 Our Story
DG47
11DG47 nucleotide sequence Matches nothing in
GenBank
DG47 amino acid sequence 100 match to toxin gene
12Scenario 3 Our Story
Compare nucleotide sequences by hand
DG47 vs lef
13Scenario 3 Our Story
Compare nucleotide sequences by hand
DG47 1 AATATTGACGCTTTACTACATCAGTCCATCGGAA
GTACGTTGTATAATAAAATATATCTG
lef gene 1831
AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAA
AATTTATTTG DG47 61 TATGAAAACATGAATATAAA
TAACTTAACAGCAACGTTAGGTGCCGATTTAGTAGATTCC
lef gene 1891
TATGAAAATATGAATATCAATAACCTTACAGCAACCCTAGGTGCGGATTT
AGTTGATTCC DG47 121 ACAGATAATACAAAAATTAA
TCGAGGTATATTCAATGAGTTCAAAAAAAATTTCAAATAC
lef gene 1951
ACTGATAATACTAAAATTAATAGAGGTATTTTCAATGAATTCAAAAAAAA
TTTCAAATAT DG47 181 AGTATTTCTA
lef gene 2011 AGTATTTCTA
89 identical!
14Scenario 3 Our Story
Compare nucleotide sequences by hand
DG47
lef gene
15Scenario 3 Our Story
Why cant Blast figure outwhat you can plainly
see?
16Scenario 3 How does Blast work?
- Clearly we need to understand more about how
- sequence alignment really works!
- Theory behind nucleotide vs nucleotide Blast
- Theory behind protein-protein Blast
- How to get Blast to do what you want
17Flavours of sequence alignment
Global Alignment - Needleman-Wunsch algorithm
- Compares two sequences across their whole length
- Mostly only useful when you already know
sequences might be similar
- Not useful for comparing a short query to an
entire genome.
- Not discussed further in this class.
Local Alignment
- Allows alignment of subsequences of the target
and the query
- Usually what we want the query can be searched
against entire genomes or large databases.
18Crude Local Alignment Methods
The Dot Matrix method (Gibbs and McIntyre, 1970)
Represents the query and target sequences as a
matrix ( a two-dimensional array) using a sliding
window of similarity
The human eye can powerfully distinguish the
identity line from the noise
19The Dot Matrix method (Gibbs and McIntyre, 1970)
Normally a window size and stringency
are specified
i.e. if the window size is 8 and stringency
is 6, a dot is only placed if at least 6 of the
current 8 positions in the query match the target
20The Dot Matrix method (Gibbs and McIntyre, 1970)
G
G
T
A
A
T
A
G
window 2 stringency 2
G
T
A
A
T
A
21Two dimensional arrays in Perl
Two dimensional arrays are represented as lists
containing references to other lists
Index J
0
0
1
2
3
1
2
Index I
3
4
_at_table0 dereferences to allow use of the
array
table0 returns a reference to an array
Refer to notes and download dotmatrix1.pl
22Problems with the Dot Matrix method
- Requires human supervision!
- A memory and processor time pig
(a complete mn matrix is calculated
each time) - No explicit handling of gaps
- No good quantitative score of alignment quality
23The Smith-Waterman Algorithm (no gaps version)
G
G
T
A
A
T
A
G
1
1
Match Extension 1 NoMatch Penalty -2
G
2
1
T
3
1
A
4
1
2
Negative values are reset to zero!!
C
2
T
3
1
Download SmithWaterman1.pl
A
4
2
1
24Smith Waterman Dynamic Programming
An optimal alignment can be found starting from
the highest scoring box and working
backwards. Dynamic Programming is a method for
recording the solutions to subproblems, then
working backwards to find an overall solution.
If we incorporate gaps, we must start keeping
track of this traceback pathway.
25The Smith-Waterman Algorithm (with gaps)
G
G
T
A
A
T
A
Match Extension 1 NoMatch Penalty -2 Gap
Penalty -3
G
1
1
G
2
1
T
3
Take the Max of 0adding Query Gap adding
Target Gap Match/No match
A
4
1
C
1
T
Download SmithWaterman2.pl
A
26Problems with Smith-Waterman
Still a pig! Memory and processor time
requirements are huge when the query
and/or the database gets large.. (a complete
mn matrix is still calculated each time!!)
Do we really need to calculate the whole matrix?
27BlastN word based heuristics
Notice that in a typical S-W matrix, most of
the boxes are empty!!!
What if we find exact matches of some seed words,
then just work in the area surrounding these
seeds trying to extend the alignment?
This is exactly the heuristic that blast
employs to avoid calculating the whole
matrix! (see figure on page 6 of Alignment notes)
28BlastN Procedure
Filter the query sequence for repetitive
low complexity sequences
Identify the subsequences of size word in the
query
Find the exact matches in the target of the all
the words
Use a modified S-W to extend the hits around the
seed words
Score and report on the best matches More on
scoring on Monday!!!