Why Life is Beautiful James Tisdall - PowerPoint PPT Presentation

About This Presentation
Title:

Why Life is Beautiful James Tisdall

Description:

Author of 'Beginning Perl for Bioinformatics' and 'Mastering Perl for ... PRI: primate sequences. ROD: rodent sequences. PHG: bacteriophage sequences ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 51
Provided by: jamesdt
Category:

less

Transcript and Presenter's Notes

Title: Why Life is Beautiful James Tisdall


1
Why Life is BeautifulJames Tisdall
2
James Tisdall (Jim)
Biocomputing Associates PO Box 9965 Philadelphia
PA 19118 USA (610)933-1200 http/www.biocomputinga
ssociates.com And DuPont Experimental
Station James.Tisdall_at_DuPont.com
3
Who am I?
  • Musician
  • Bell Laboratories, Information Principles
    Research
  • Human Genome Project
  • early Perl program DNA WorkBench
  • Mercator Genetics, Computational Biologist
  • Fox Chase Cancer Center, Bioinformatics Manager
  • Biocomputing Associates, consulting
  • DuPont Genetics Research
  • Author of "Beginning Perl for Bioinformatics" and
    "Mastering Perl for Bioinformatics" from O'Reilly

4
  • Bioinformatics means using computers for biology
    research, and
  • Perl is a popular computer language for
    bioinformatics programming

5
OUTLINE
  • Perl resources
  • Overview of basic Perl
  • The art of programming
  • Sequences and strings
  • Motifs and loops
  • Subroutines and bugs

6
OUTLINE (Continued)
  • Mutations and Randomization
  • The Genetic Code
  • Restriction Maps and Regular Expressions
  • GenBank
  • Protein Data Bank
  • BLAST
  • Further Topics

7
Advantages of Perl
  • Ease of Programming
  • Simplify common programming tasks. A lot of free
    software for bioinformatics, web programming,
  • Rapid Prototyping
  • Often less programming time than other languages
  • Portability
  • Most operating systems Windows, Mac, Unix, etc
  • Free
  • Run speed - suitable for most tasks

8
Biological Sequence Data
  • Bases and amino acids are represented as letters
    in Perl, just as in biology literature
  • DNA and RNA
  • A C G T (or U)
  • Amino acids
  • A C D E F G H I K L M N P Q R S T V W X Y

9
A Small Bioinformatics Program
  • Sequences are represented as Strings
  • First we store the DNA in a variable called
    DNA
  • DNA 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
  • Next, we print the DNA onto the screen
  • print DNA
  • Finally, we'll specifically tell the program to
    exit.
  • exit

10
Virtual Transcription DNA to RNA
  • DNA 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
  • print "DNA\n\n"
  • Transcribe the DNA to RNA by substituting all
    T's with U's.
  • RNA DNA
  • RNA s/T/U/g
  • print "RNA\n"

11
Reverse Complement
  • DNA 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'
  • revcom reverse DNA
  • Next substitute all bases by their complements,
  • A-gtT, T-gtA, G-gtC, C-gtG, uppercase or lowercase
  • revcom tr/ACGTacgt/TGCAtgca/

12
  • Mutations and Randomization

13
Modeling mutations randomly
  • How to randomly select a position in a string
  • How to randomly select an index in an array
  • Randomly select a base in DNA
  • Mutate a base to some other random base
  • Generate random DNA data sets
  • Repeatedly mutate DNA

14
A program to simulate DNA mutation
  • Pseudocode
  • 1. Select a random position in the DNA
  • 2. Choose a random nucleotide
  • 3. Substitute the random nucleotide in the random
    position

15
A program to simulate DNA mutation
  • 1. Select a random position in the DNA
  • position int(rand(length(DNA)))
  • 2. Choose a random nucleotide
  • _at_nucleotides ('A', 'C', 'G', 'T')
  • new nucleotidesrand _at_nucleotides
  • 3. Substitute the random nucleotide in the random
    position
  • substr(DNA, position, 1, new)

16
Generating random DNA
  • sub make_random_DNA
  • Collect arguments, declare variables
  • my(length) _at__
  • my dna
  • for (my i0 i lt length i)
  • dna . nucleotidesrand _at_nucleotides
  • return dna

17
  • The Genetic Code

18
The genetic code
  • DNA encodes amino acids
  • 3 bases of DNA a codon
  • Each codon represents an amino acid or a "stop"
  • There are 64 codons, but 20 amino acids
  • Most amino acids are coded for by multiple
    codons, which makes the genetic code a redundant
    code

19
Translating codons to amino acids
  • sub codon2aa
  • my(codon) _at__
  • codon uc codon
  • my(genetic_code) (
  • 'TCA' gt 'S', Serine
  • 'TCC' gt 'S', Serine
  • 'TCG' gt 'S', Serine
  • 'TCT' gt 'S', Serine
  • 'TTC' gt 'F', Phenylalanine
  • et cetera

20
Translating DNA into proteins
  • sub dna2peptide
  • my(dna) _at__
  • Initialize variables
  • my protein ''
  • Translate each three-base codon to an amino
    acid, and append to a protein
  • for(my i0 i lt (length(dna) - 2) i
    3)
  • protein . codon2aa( substr(dna,i,3)
    )
  • return protein

21
Reading DNA from FASTA files
  • FASTA format is the most common way that DNA is
    stored in files (but there are many others)
  • Example
  • gt sample dna (This is a typical fasta header.)
  • agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
  • tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
  • gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
  • tgggacgccgccgagtggtctgtgcag

22
Reading DNA from FASTA files
  • gt sample dna (This is a typical fasta header.)
  • agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
  • tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
  • gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
  • Tgggacgccgccgagtggtctgtgcag
  • Notice that first line starts with a gt symbol,
    and contains annotation.
  • Each other line just contains sequence data.
  • Simple!

23
Reading FASTA files core code
  • From textbook, p. 170
  • foreach my line (_at_fasta_file_data)
  • if(line /gt/) discard fasta header line
  • next
  • else keep line, add to sequence string
  • sequence . line
  • remove non-sequence data (in this case,
    whitespace) from sequence string
  • sequence s/\s//g

24
Regular expressions
Regular expressions are a language within the
Perl language that describe many kinds of
patterns in strings, e.g. motifs in DNA. Regular
expressions permit you to find and alter many
patterns with relative ease. The excellent
regular expressions in Perl are a major reason
for Perl's success as a bioinformatics
programming language.
25
Restriction enzymes, sites, and maps
  • A restriction enzyme is a protein that cuts DNA
    at short, specific sequences
  • EcoRI cuts at the restriction site GAATTC,
    between the G and A, on both complementary
    strands (the reverse complement of GAATTC is
    GAATTC), leaving "sticky ends" for insertion into
    a vector for cloning and sequencing.
  • A restriction map is the list of locations where
    a given restriction site occurs in the sequence

26
Finding restriction sites
  • while ( sequence /regexp/ig )
  • push ( _at_positions, pos(sequence) -
    length() 1)
  • while loop conditional is a pattern match
  • ig i stands for "case insensitive"
  • g stands for "global", will keep searching entire
    string
  • pos is a built-in function that gives the ending
    position of the last successful match subtract
    the length of the match () to get to the
    beginning of the match add 1 because biologists
    call the first base in DNA as position 1, while
    Perl calls the first letter in a string as
    position 0

27
  • GenBank

28
GenBank
  • http//ncbi.nlm.nih.gov/Genbank

29
GenBank Libraries
  • Examples
  • PRI primate sequences
  • ROD rodent sequences
  • PHG bacteriophage sequences
  • GSS genome survey sequences

30
  • Protein Data Bank

31
PDB - Protein Data Bank
  • The primary depository for 3D protein structures
  • http//www.rcsb.org/pdb/

32
  • BLAST

33
BLAST
  • Basic Local Alignment Search Tool
  • Most popular bioinformatics program
  • Discovers sequence similarity between your
    sequence and large databases
  • Voluminous output
  • Need to process output by program

34
Similarity vs Homology
  • String matching is computer science term for
    finding sequence similarity
  • Exact
  • Approximate percent identity
  • Homology
  • Sequences are or are not related evolutionarily
  • Yes or no no percentages

35
Parsing BLAST output
  • Many bioinformatics projects require frequent
    BLAST searches
  • Need to automate collection and processing
  • BLAST output does not label lines by their
    function, so use regular expressions
  • Several BLAST parsers are available as Perl
    modules see for example Bioperl modules
    BioToolsBPlite and BioToolsBlast

36
Bioinformatics Resources
  • Books
  • Bioinformatics Sequence and Genome Analysis by
    Mount
  • Developing Bioinformatics Computer Skills Gibas
    Jambeck
  • Introduction to Computational Biology
    Maps,Sequences and Genomes, Waterman
  • Bioinformatics A Practical Guide to the Analysis
    of Genes and Proteins. Baxecvanis Ouellette,
    editors
  • 2. Government/Public Information Resources
  • National Center for Biotechnology Information
    (NCBI) U.S. Government Center
    www.ncbi.nlm.nih.gov
  • European Molecular Biology Laboratory (EMBL).
    European Union Laboratory www.embl.org
  • European Bioinformatics Institute (EBI)- From UK,
    part of EMBL- www.ebi.ac.uk
  • UCSC Genome Resource- serves constructed genome
    data www.genome.ucsc.edu

37
Bioinformatics Resources (cont.)
  • 3. Conferences
  • ISMB Intellegent Systems for Molecular Biology,
    www.iscb.org/ismb2004/
  • Bioinformatics Open Source Conference,
    www.bioinformatics.org
  • RECOMB Conference on Computational Molecular
    Biology, recomb04.sdsc.edu/

38
Molecular Biology
  • 1. Helpful Reference Books for Programmers
  • Recombinant DNA, 2nd edition, Watson et al.
  • Molecular Biology of the Gene, Watson et al.
  • Molecular Cell Biology, Lodish, et al.

39
Approximate string matching
As an example of advanced data structures,
following is an algorithm that relies on a 2
dimensional array. It finds the best match for a
(shorter) pattern in a (longer) text. It is one
of several important dynamic programming
algorithms in bioinformatics
40
Approximate string matching 1 of 4
my pattern 'EIQADEVRL' print
"PATTERN\npattern\n" my text
'SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDE' print
"TEXT\ntext\n" my TLEN length text my
PLEN length pattern D is the Distance
matrix, which shows the "edit distance" between
substrings of the pattern and the text. It is
implemented as a reference to an anonymous
array. my D The rows correspond to the
text Initialize row 0 of D. for (my t0 t lt
TLEN t) D-gtt0 0
41
Approximate string matching 2 of 4
Compute the edit distances. for (my t1 t lt
TLEN t) for (my p1 p lt PLEN
p) D-gttp Choose whichever
alternative has the least cost min3(
Text and pattern may or may not match at this
character ... substr(text, t-1, 1) eq
substr(pattern, p-1, 1) ?
D-gtt-1p-1 If match, no increase in edit
distance! D-gtt-1p-1 1,
If the text is missing a character
D-gtt-1p 1, If the pattern is
missing a character D-gttp-1 1
)
42
Approximate string matching 3 of 4
Print D, the resulting edit distance array for
(my p0 p lt PLEN p) for (my t0
t lt TLEN t) print D-gttp, "
" print "\n" Find the best
match(es). The edit distances appear in the
last row. my _at_matches () my bestscore
10000000 for (my t1 t lt TLEN t)
if( D-gttPLEN lt bestscore)
bestscore D-gttPLEN _at_matches
(t) elsif( D-gttPLEN bestscore)
push(_at_matches, t)
43
Approximate string matching 4 of 4
Report the best match(es). print "\nThe best
match for the pattern pattern\n" print "has an
edit distance of bestscore\n" print "and
appears in the text ending at location" print
"s" if ( _at_matches gt 1) print " _at_matches\n"
44
Approximate string matching Output
PATTERN EIQADEVRL TEXT SVLQDRSMPHQEILAADEVLQESEM
RQQDMISHDE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1
1 1 1 0 2 2 2 2 2 2 2 2 2 2 2 2 1 0 1 2 2 2 1 1
2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 1 3 3 3 3 2 3 3 3
3 3 3 2 2 1 1 2 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2
3 3 2 4 4 4 4 3 3 4 4 4 4 4 3 3 2 2 1 2 3 3 3 3
3 3 3 3 3 3 3 3 3 4 3 3 3 4 3 5 5 5 5 4 3 4 5 5
5 5 4 4 3 3 2 2 2 3 4 4 4 4 4 4 4 4 4 4 3 4 4 4 4
3 4 6 6 6 6 5 4 4 5 6 6 6 5 4 4 4 3 3 3 2 3 4 5
4 5 4 5 5 5 5 4 4 5 5 5 4 3 7 7 6 7 6 5 5 5 6 7
7 6 5 5 5 4 4 4 3 2 3 4 5 5 5 5 6 6 6 5 5 5 6 6 5
4 8 8 7 7 7 6 5 6 6 7 8 7 6 6 6 5 5 5 4 3 3 4 5
6 6 6 5 6 7 6 6 6 6 7 6 5 9 9 8 7 8 7 6 6 7 7 8
8 7 7 6 6 6 6 5 4 3 4 5 6 7 7 6 6 7 7 7 7 7 7 7 6
The best match for the pattern EIQADEVRL has an
edit distance of 3 and appears in the text ending
at location 20
45
Bioperl
Bioperl is a collection of Perl modules for
bioinformatics. It has been developed by a large
group of volunteers. Bioperl has become a fairly
stable platform with which to develop software
it is growing and changing rapidly, but its core
is relatively mature. Bioperl is a must for
anyone interested in writing bioinformatics
software in Perl Don't reinvent the wheel!
(Unless you like to, for the fun or interest of
it.)
46
What can Bioperl do?
  • Sequence manipulation (DNA, RNA, proteins)
  • Several kinds of sequence objects
  • Sequence format conversion
  • Translation
  • Revcom
  • Sequence statistics (mol. weights, etc.)
  • Restriction enzyme maps
  • Signal cleavage sites

47
What can Bioperl do?
  • Sequence alignment
  • Interfaces with many alignment programs such as
    fasta, clustalw, gcg
  • Extract subalignment, consensus string
  • Alignment statistics

48
What can Bioperl do?
  • Biological databases
  • Retrieving Genbank, EMBL, acedb, etc.
  • Retrieving from remote databases
  • Indexing and retrieving from local databases
  • Parsing program reports from Blast and variants,
    HMMer, genefinding programs

49
What can Bioperl do?
  • Sequence annotation
  • Adding annotation
  • Parsing annotation from sequence databases

50
What can Bioperl do?
  • Protein structure
  • Read in PDB files
  • Extract chains and their residues
Write a Comment
User Comments (0)
About PowerShow.com