Welcome to Introduction to Bioinformatics Friday, 30 October - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Welcome to Introduction to Bioinformatics Friday, 30 October

Description:

Smith Waterman Dynamic Programming. The Smith-Waterman Algorithm (with gaps) G. G ... Problems with Smith-Waterman. Do we really need to calculate the whole matrix? ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 29
Provided by: bioinforma6
Learn more at: https://bulletin.vcu.edu
Category:

less

Transcript and Presenter's Notes

Title: Welcome to Introduction to Bioinformatics Friday, 30 October


1
Welcome toIntroduction to BioinformaticsFriday,
30 October
  • I. Scenario 3 Sequence alignment
  • Bring up course web site
  • Go to Scenario 3
  • Open the first sequence alignment notes

2
Scenario 3 Our Story
You Our first defense at CDC
Outbreak
. . . Anthrax?
Samples
  • Confirm agent
  • Identify strain

3
Scenario 3 Our Story
4
Scenario 3 Our Story
If DNA from bacterium with toxin gene
If DNA NOT from bacterium with toxin gene?
5
Scenario 3 Our Story
If DNA from bacterium with toxin gene
If DNA NOT from bacterium with toxin gene?
(no product)
6
Scenario 3 Our Story
DG47
7
Scenario 3 Our Story
8
Scenario 3 Our Story
DG47
9
Scenario 3 Our Story
Maybe its not from the toxin gene??
10
Scenario 3 Our Story
DG47
11
DG47 nucleotide sequence Matches nothing in
GenBank
DG47 amino acid sequence 100 match to toxin gene
12
Scenario 3 Our Story
Compare nucleotide sequences by hand
DG47 vs lef
13
Scenario 3 Our Story
Compare nucleotide sequences by hand
DG47      1  AATATTGACGCTTTACTACATCAGTCCATCGGAA
GTACGTTGTATAATAAAATATATCTG             

lef gene 1831 
AATATTGATGCTTTATTACATCAATCCATTGGAAGTACCTTGTACAATAA
AATTTATTTG   DG47       61  TATGAAAACATGAATATAAA
TAACTTAACAGCAACGTTAGGTGCCGATTTAGTAGATTCC
              
  lef gene 1891 
TATGAAAATATGAATATCAATAACCTTACAGCAACCCTAGGTGCGGATTT
AGTTGATTCC   DG47      121  ACAGATAATACAAAAATTAA
TCGAGGTATATTCAATGAGTTCAAAAAAAATTTCAAATAC   
           
lef gene 1951 
ACTGATAATACTAAAATTAATAGAGGTATTTTCAATGAATTCAAAAAAAA
TTTCAAATAT   DG47      181  AGTATTTCTA      
         lef gene 2011  AGTATTTCTA
89 identical!
14
Scenario 3 Our Story
Compare nucleotide sequences by hand
DG47
lef gene
15
Scenario 3 Our Story
Why cant Blast figure outwhat you can plainly
see?
16
Scenario 3 How does Blast work?
  • Clearly we need to understand more about how
  • sequence alignment really works!
  • Theory behind nucleotide vs nucleotide Blast
  • Working BlastN program
  • Theory behind protein-protein Blast
  • How to get Blast to do what you want

17
Flavours of sequence alignment
Global Alignment - Needleman-Wunsch algorithm
- Compares two sequences across their whole length
- Mostly only useful when you already know
sequences might be similar
- Not useful for comparing a short query to an
entire genome.
- Not discussed further in this class.
Local Alignment
- Allows alignment of subsequences of the target
and the query
  • Usually what we want the query can be searched
    against entire genomes or large databases.

18
Crude Local Alignment Methods
The Dot Matrix method (Gibbs and McIntyre, 1970)
Represents the query and target sequences as a
matrix ( a two-dimensional array) using a sliding
window of similarity
The human eye can powerfully distinguish the
identity line from the noise
19
The Dot Matrix method (Gibbs and McIntyre, 1970)
Normally a window size and stringency
are specified
i.e. if the window size is 8 and stringency
is 6, a dot is only placed if at least 6 of the
current 8 positions in the query match the target
20
The Dot Matrix method (Gibbs and McIntyre, 1970)
G
G
T
A
A
T
A
G
window 2 stringency 2
G
T
A
A
T
A
21
Two dimensional arrays in Perl
Two dimensional arrays are represented as lists
containing references to other lists
Index J
0
0
1

2
3
1

2

Index I
3

4

_at_table0 dereferences to allow use of the
array
table0 returns a reference to an array
Refer to notes and download dotmatrix1.pl
22
Problems with the Dot Matrix method
  1. Requires human supervision!
  2. A memory and processor time pig
    (a complete mn matrix is calculated
    each time)
  3. No explicit handling of gaps
  4. No good quantitative score of alignment quality

23
The Smith-Waterman Algorithm (no gaps version)
G
G
T
A
A
T
A
G
1
1
Match Extension 1 NoMatch Penalty -2
G
2
1
T
3
1
A
4
1
2
Negative values are reset to zero!!
C
2
T
3
1
Download SmithWaterman1.pl
A
4
2
1
24
Smith Waterman Dynamic Programming
An optimal alignment can be found starting from
the highest scoring box and working
backwards. Dynamic Programming is a method for
recording the solutions to subproblems, then
working backwards to find an overall solution.
If we incorporate gaps, we must start keeping
track of this traceback pathway.
25
The Smith-Waterman Algorithm (with gaps)
G
G
T
A
A
T
A
Match Extension 1 NoMatch Penalty -2 Gap
Penalty -3
G
1
1
G
2
1
T
3
Take the Max of 0adding Query Gap adding
Target Gap Match/No match
A
4
1
C
1
T
Download SmithWaterman2.pl
A
26
Problems with Smith-Waterman
Still a pig! Memory and processor time
requirements are huge when the query
and/or the database gets large.. (a complete
mn matrix is still calculated each time!!)
Do we really need to calculate the whole matrix?
27
BlastN word based heuristics
Notice that in a typical S-W matrix, most of
the boxes are empty!!!
What if we find exact matches of some seed words,
then just work in the area surrounding these
seeds trying to extend the alignment?
This is exactly the heuristic that blast
employs to avoid calculating the whole
matrix! (see figure on page 6 of Alignment notes)
28
BlastN Procedure
Filter the query sequence for repetitive
low complexity sequences
Identify the subsequences of size word in the
query
Find the exact matches in the target of the all
the words
Use a modified S-W to extend the hits around the
seed words
Score and report on the best matches More on
scoring on Monday!!!
Write a Comment
User Comments (0)
About PowerShow.com