Why Life is Beautiful James Tisdall - PowerPoint PPT Presentation

About This Presentation

Title:

Why Life is Beautiful James Tisdall

Description:

Author of 'Beginning Perl for Bioinformatics' and 'Mastering Perl for ... PRI: primate sequences. ROD: rodent sequences. PHG: bacteriophage sequences ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 51

Provided by: jamesdt

Category:

more less

Transcript and Presenter's Notes

Title: Why Life is Beautiful James Tisdall

1
Why Life is BeautifulJames Tisdall
2
James Tisdall (Jim)
Biocomputing Associates PO Box 9965 Philadelphia
PA 19118 USA (610)933-1200 http/www.biocomputinga
ssociates.com And DuPont Experimental
Station James.Tisdall_at_DuPont.com
3
Who am I?

Musician
Bell Laboratories, Information Principles
Research
Human Genome Project
early Perl program DNA WorkBench
Mercator Genetics, Computational Biologist
Fox Chase Cancer Center, Bioinformatics Manager
Biocomputing Associates, consulting
DuPont Genetics Research
Author of "Beginning Perl for Bioinformatics" and
"Mastering Perl for Bioinformatics" from O'Reilly

Bioinformatics means using computers for biology
research, and
Perl is a popular computer language for
bioinformatics programming

5
OUTLINE

Perl resources
Overview of basic Perl
The art of programming
Sequences and strings
Motifs and loops
Subroutines and bugs

6
OUTLINE (Continued)

Mutations and Randomization
The Genetic Code
Restriction Maps and Regular Expressions
GenBank
Protein Data Bank
BLAST
Further Topics

7
Advantages of Perl

Ease of Programming
Simplify common programming tasks. A lot of free
software for bioinformatics, web programming,

Rapid Prototyping
Often less programming time than other languages

Portability
Most operating systems Windows, Mac, Unix, etc
Free
Run speed - suitable for most tasks

8
Biological Sequence Data

Bases and amino acids are represented as letters
in Perl, just as in biology literature
DNA and RNA
A C G T (or U)
Amino acids
A C D E F G H I K L M N P Q R S T V W X Y

9
A Small Bioinformatics Program

Sequences are represented as Strings
First we store the DNA in a variable called
DNA
DNA 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
Next, we print the DNA onto the screen
print DNA
Finally, we'll specifically tell the program to
exit.
exit

10
Virtual Transcription DNA to RNA

DNA 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
print "DNA\n\n"
Transcribe the DNA to RNA by substituting all
T's with U's.
RNA DNA
RNA s/T/U/g
print "RNA\n"

11
Reverse Complement

DNA 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
revcom reverse DNA
Next substitute all bases by their complements,
A-gtT, T-gtA, G-gtC, C-gtG, uppercase or lowercase
revcom tr/ACGTacgt/TGCAtgca/

Mutations and Randomization

13
Modeling mutations randomly

How to randomly select a position in a string
How to randomly select an index in an array
Randomly select a base in DNA
Mutate a base to some other random base
Generate random DNA data sets
Repeatedly mutate DNA

14
A program to simulate DNA mutation

Pseudocode
1. Select a random position in the DNA
2. Choose a random nucleotide
3. Substitute the random nucleotide in the random
position

15
A program to simulate DNA mutation

1. Select a random position in the DNA
position int(rand(length(DNA)))
2. Choose a random nucleotide
_at_nucleotides ('A', 'C', 'G', 'T')
new nucleotidesrand _at_nucleotides
3. Substitute the random nucleotide in the random
position
substr(DNA, position, 1, new)

16
Generating random DNA

sub make_random_DNA
Collect arguments, declare variables
my(length) _at__
my dna
for (my i0 i lt length i)
dna . nucleotidesrand _at_nucleotides
return dna

The Genetic Code

18
The genetic code

DNA encodes amino acids
3 bases of DNA a codon
Each codon represents an amino acid or a "stop"
There are 64 codons, but 20 amino acids
Most amino acids are coded for by multiple
codons, which makes the genetic code a redundant
code

19
Translating codons to amino acids

sub codon2aa
my(codon) _at__
codon uc codon
my(genetic_code) (
'TCA' gt 'S', Serine
'TCC' gt 'S', Serine
'TCG' gt 'S', Serine
'TCT' gt 'S', Serine
'TTC' gt 'F', Phenylalanine
et cetera

20
Translating DNA into proteins

sub dna2peptide
my(dna) _at__
Initialize variables
my protein ''
Translate each three-base codon to an amino
acid, and append to a protein
for(my i0 i lt (length(dna) - 2) i
3)
protein . codon2aa( substr(dna,i,3)
)
return protein

21
Reading DNA from FASTA files

FASTA format is the most common way that DNA is
stored in files (but there are many others)
Example
gt sample dna (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcag

22
Reading DNA from FASTA files

gt sample dna (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
Tgggacgccgccgagtggtctgtgcag
Notice that first line starts with a gt symbol,
and contains annotation.
Each other line just contains sequence data.
Simple!

23
Reading FASTA files core code

From textbook, p. 170
foreach my line (_at_fasta_file_data)
if(line /gt/) discard fasta header line
next
else keep line, add to sequence string
sequence . line
remove non-sequence data (in this case,
whitespace) from sequence string
sequence s/\s//g

24
Regular expressions
Regular expressions are a language within the
Perl language that describe many kinds of
patterns in strings, e.g. motifs in DNA. Regular
expressions permit you to find and alter many
patterns with relative ease. The excellent
regular expressions in Perl are a major reason
for Perl's success as a bioinformatics
programming language.
25
Restriction enzymes, sites, and maps

A restriction enzyme is a protein that cuts DNA
at short, specific sequences
EcoRI cuts at the restriction site GAATTC,
between the G and A, on both complementary
strands (the reverse complement of GAATTC is
GAATTC), leaving "sticky ends" for insertion into
a vector for cloning and sequencing.
A restriction map is the list of locations where
a given restriction site occurs in the sequence

26
Finding restriction sites

while ( sequence /regexp/ig )
push ( _at_positions, pos(sequence) -
length() 1)
while loop conditional is a pattern match
ig i stands for "case insensitive"
g stands for "global", will keep searching entire
string
pos is a built-in function that gives the ending
position of the last successful match subtract
the length of the match () to get to the
beginning of the match add 1 because biologists
call the first base in DNA as position 1, while
Perl calls the first letter in a string as
position 0

GenBank

28
GenBank

http//ncbi.nlm.nih.gov/Genbank

29
GenBank Libraries

Examples
PRI primate sequences
ROD rodent sequences
PHG bacteriophage sequences
GSS genome survey sequences

Protein Data Bank

31
PDB - Protein Data Bank

The primary depository for 3D protein structures
http//www.rcsb.org/pdb/

BLAST

33
BLAST

Basic Local Alignment Search Tool
Most popular bioinformatics program
Discovers sequence similarity between your
sequence and large databases
Voluminous output
Need to process output by program

34
Similarity vs Homology

String matching is computer science term for
finding sequence similarity
Exact
Approximate percent identity
Homology
Sequences are or are not related evolutionarily
Yes or no no percentages

35
Parsing BLAST output

Many bioinformatics projects require frequent
BLAST searches
Need to automate collection and processing
BLAST output does not label lines by their
function, so use regular expressions
Several BLAST parsers are available as Perl
modules see for example Bioperl modules
BioToolsBPlite and BioToolsBlast

36
Bioinformatics Resources

Books
Bioinformatics Sequence and Genome Analysis by
Mount
Developing Bioinformatics Computer Skills Gibas
Jambeck
Introduction to Computational Biology
Maps,Sequences and Genomes, Waterman
Bioinformatics A Practical Guide to the Analysis
of Genes and Proteins. Baxecvanis Ouellette,
editors

2. Government/Public Information Resources
National Center for Biotechnology Information
(NCBI) U.S. Government Center
www.ncbi.nlm.nih.gov
European Molecular Biology Laboratory (EMBL).
European Union Laboratory www.embl.org
European Bioinformatics Institute (EBI)- From UK,
part of EMBL- www.ebi.ac.uk
UCSC Genome Resource- serves constructed genome
data www.genome.ucsc.edu

37
Bioinformatics Resources (cont.)

3. Conferences
ISMB Intellegent Systems for Molecular Biology,
www.iscb.org/ismb2004/
Bioinformatics Open Source Conference,
www.bioinformatics.org
RECOMB Conference on Computational Molecular
Biology, recomb04.sdsc.edu/

38
Molecular Biology

1. Helpful Reference Books for Programmers
Recombinant DNA, 2nd edition, Watson et al.
Molecular Biology of the Gene, Watson et al.
Molecular Cell Biology, Lodish, et al.

39
Approximate string matching
As an example of advanced data structures,
following is an algorithm that relies on a 2
dimensional array. It finds the best match for a
(shorter) pattern in a (longer) text. It is one
of several important dynamic programming
algorithms in bioinformatics
40
Approximate string matching 1 of 4
my pattern 'EIQADEVRL' print
"PATTERN\npattern\n" my text
'SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE' print
"TEXT\ntext\n" my TLEN length text my
PLEN length pattern D is the Distance
matrix, which shows the "edit distance" between
substrings of the pattern and the text. It is
implemented as a reference to an anonymous
array. my D The rows correspond to the
text Initialize row 0 of D. for (my t0 t lt
TLEN t) D-gtt0 0
41
Approximate string matching 2 of 4
Compute the edit distances. for (my t1 t lt
TLEN t) for (my p1 p lt PLEN
p) D-gttp Choose whichever
alternative has the least cost min3(
Text and pattern may or may not match at this
character ... substr(text, t-1, 1) eq
substr(pattern, p-1, 1) ?
D-gtt-1p-1 If match, no increase in edit
distance! D-gtt-1p-1 1,
If the text is missing a character
D-gtt-1p 1, If the pattern is
missing a character D-gttp-1 1
)
42
Approximate string matching 3 of 4
Print D, the resulting edit distance array for
(my p0 p lt PLEN p) for (my t0
t lt TLEN t) print D-gttp, "
" print "\n" Find the best
match(es). The edit distances appear in the
last row. my _at_matches () my bestscore
10000000 for (my t1 t lt TLEN t)
if( D-gttPLEN lt bestscore)
bestscore D-gttPLEN _at_matches
(t) elsif( D-gttPLEN bestscore)
push(_at_matches, t)
43
Approximate string matching 4 of 4
Report the best match(es). print "\nThe best
match for the pattern pattern\n" print "has an
edit distance of bestscore\n" print "and
appears in the text ending at location" print
"s" if ( _at_matches gt 1) print " _at_matches\n"
44
Approximate string matching Output
PATTERN EIQADEVRL TEXT SVLQDRSMPHQEILAADEVLQESEM
RQQDMISHDE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1
1 1 1 0 2 2 2 2 2 2 2 2 2 2 2 2 1 0 1 2 2 2 1 1
2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 1 3 3 3 3 2 3 3 3
3 3 3 2 2 1 1 2 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2
3 3 2 4 4 4 4 3 3 4 4 4 4 4 3 3 2 2 1 2 3 3 3 3
3 3 3 3 3 3 3 3 3 4 3 3 3 4 3 5 5 5 5 4 3 4 5 5
5 5 4 4 3 3 2 2 2 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
3 4 6 6 6 6 5 4 4 5 6 6 6 5 4 4 4 3 3 3 2 3 4 5
4 5 4 5 5 5 5 4 4 5 5 5 4 3 7 7 6 7 6 5 5 5 6 7
7 6 5 5 5 4 4 4 3 2 3 4 5 5 5 5 6 6 6 5 5 5 6 6 5
4 8 8 7 7 7 6 5 6 6 7 8 7 6 6 6 5 5 5 4 3 3 4 5
6 6 6 5 6 7 6 6 6 6 7 6 5 9 9 8 7 8 7 6 6 7 7 8
8 7 7 6 6 6 6 5 4 3 4 5 6 7 7 6 6 7 7 7 7 7 7 7 6
The best match for the pattern EIQADEVRL has an
edit distance of 3 and appears in the text ending
at location 20
45
Bioperl
Bioperl is a collection of Perl modules for
bioinformatics. It has been developed by a large
group of volunteers. Bioperl has become a fairly
stable platform with which to develop software
it is growing and changing rapidly, but its core
is relatively mature. Bioperl is a must for
anyone interested in writing bioinformatics
software in Perl Don't reinvent the wheel!
(Unless you like to, for the fun or interest of
it.)
46
What can Bioperl do?