Title: Various Career Options Available
1 Introduction to Perl BioPerl Dr G. P. S.
Raghava Bioinformatics Centre IMTECH,
Chandigarh Email raghava_at_imtech.res.in Web
http//imtech.res.in/raghava/
2Perl
- Practical Extraction and Report Language
- Created by Larry Wall
- Runs on just about every platform
- Most popular on Unix/Linux systems
- Excellent language for file and data processing
3Simple Program
On Unix, this is the location of the Perl
interpreter
!/usr/local/bin/perl This is a comment line.
This program prints Hello World. to the
screen. print Hello world.\n
Comments start with and end with the end of the
line
Program statements are terminated with semicolons
Newline character
4Control Structures
!/usr/local/bin/perl Print out
0,1,2,3,4,5,6,7,8,9 in this case, x is local
only to the loop because my is used for (my x
0 x lt 10 x) print x if
(x lt 9) print , print \n
5Control Structures
!/usr/local/bin/perl Demonstrate the foreach
loop, which goes through elements in an
array. my _at_users (bonzo, gorgon, pluto,
sting) foreach user (_at_users) print
user is alright.\n
6Functions
- Use sub to create a function.
- No named formal parameters, assign _at__ to local
subroutine variables.
!/usr/local/bin/perl Subroutine for
calculating the maximum sub max my max
shift(_at__) shift removes the first value
from _at__ foreach val (_at__) max
val if max lt val Notice perl allows post
ifs return max high
max(1,5,6,7,8,2,4,9,3,4) print High value is
high\n
7Files
- File handles are used to access files
- open and close functions
!/usr/local/bin/perl Open a file and print its
contents to copy.txt my filename
ARGV0 open(MYFILE, ltfilename) lt
indicates read, gt indicates write open(OUTPUT,
gtcopy.txt) while (line ltMYFILEgt) The
ltgt operator reads a line print OUTPUT line
no newline is needed, read from
file close MYFILE Parenthesis
are optional
8Regular Expressions
- One of Perls strengths is pattern matching
- Perls regular expression language is extremely
powerful, but can be challenging to learn - Some examples follow
9Comma Separated Value Files
!/usr/local/bin/perl Some simple code
demonstrating how to use split and regular
expressions. This code extracts out values in a
CSV file. my filename ARGV0 open(INPUT,
ltfilename) while (ltINPUTgt) chomp
Remove terminating newline my
_at_values split /,/ Split string in _ where ,
exists print The first value is .
values0 . \n close INPUT
10Objects
- Perl supports object oriented programming
- Constructor name is new
- A class is really a special kind of package.
- Objects are created with bless
11Example Class Definition
package Critter constructor sub new my
objref reference to an empty hash
bless objref make it an object in
Critter class return objref return
the reference Instance method, first
parameter is object reference sub display
my self shift just to demonstrate
print Im a critter.\n 1 must end class
with a true value
Store in Critter.pm
12Example Object Usage
!/usr/local/bin/perl use Critter my critter
new Critter create an object critter-gtdisp
lay display the object display
critter alternative notation
13BioPerl (http//www.bioperl.org/)
- Defnition It is a collection of perl modules
that facilitate the development of perl script
for bioinformatics. - FACTS
- Started in 1995 by open Bioinformatics Foundation
- These are reusable codes (subroutine)
- It does not include any ready to use program
- It need basic knowledge of perl programming
- Bioperl is an open source software that is still
under active development - Biojava, Biophython, Biocorba, EMBOSS
14What is BioPerl?
- An open source project
- http//bio.perl.org or http//www.cpan.org
- A loose international collaboration of
biologist/programmers - Nobody (that I know of) gets paid for this
- A collection of PERL modules and methods for
doing a number of bioinformatics tasks - Think of it as subroutines to do biology
- Consider it a tool-box
- If you need a hammer, it is in there.
- If you need a 7/16 hex-head wrench, you might
need to code that yourself.
15What BioPerl isnt
- Out of the box solutions to problems
- You will have to know a little perl, and you will
have to read documentation - Particularly well documented
- Yes, there is documentation, but it can be
difficult to see the big picture - or sometimes
the small picture - Fast
- Generally the overhead involved in making perl
objects (complicated ones) and loading in
dependencies tends to make bioperl slow. - My own blast parser takes a small fraction of the
time of the bioperl parser. Pro mine is fast.
Con I had to write it and maintain it.
16Application of Bioperl
- Accessing sequence data from local and remote
databases - Transforming formats of database/ file records
- Manipulating individual sequences
- Searching for similar'' sequences
- Creating and manipulating sequence alignments
- Searching for genes and other structures on
genomic DNA - Developing machine readable sequence annotation
17Accessing sequence data from Databases
- Accessing data from remote databases
- GenBank
- Swissprot
- EMBL
- Indexing and accessing local databases
- Create index for your file
- Access the file by keyword
18Transforming formats of database/ file records
- This allows to convert your sequence/alignment
from one format to another format (eg. Readseq) - Change sequence format
- From GCG, PIR, FASTA etc.
- To FASTA,GCG. PIR etc.
- Change format of multiple sequence aignment
- FASTA, CLUSTAL-W, GCG etc.
- GCG, FASTA CLUSTAL-w etc.
19Manipulating sequence
- Bioperl contains many modules with functions for
sequence analysis. - Sequence data manipulation Display component,
Reverse complement, Translate/Reverse translate
from NT to Protein - Obtaining basic sequence statistics (Residues,
MW, CODON) - Restriction enzyme mapping
- Identifying amino acid cleavage sites
- Manipulation by EMBOSS using BioPerl
20Searching for similar'' sequences
- Running BLAST locally (Standalone BLAST)
- Running BLAST remotely
- Parsing BLAST and FASTA reports
- Parsing HMM reports
21Creating and manipulating sequence alignments
- Aligning 2 sequences with Smith-Waterman (pSW)
- Aligning 2 sequences with Blast using bl2seq and
AlignIO - Aligning multiple sequences (Clustalw.pm,
TCoffee.pm) - Manipulating / displaying alignments (SimpleAlign)
22Example 1
!/usr/local/bin/perl Collect documents from
PubMed containing the term Breast Cancer and
print them. use BioBiblio my biblio new
BioBiblio my collection biblio-gtfind(breas
t cancer) while (collection-gthas_next)
there are underlines before next print
collection-gtget_next
23Example 2
!/usr/local/bin/perl Get a sequence from
RefSeq by accession number use BioDBRefSeq
gb new BioDBRefSeq seq
gb-gtget_Seq_by_acc(NM_007304) print
seq-gtseq()
24A simple script
- !/opt/perl/bin/perl -w
- bioperl_gb_fetch.pl
- use strict
- use BioDBGenBank
- my gb BioDBGenBank-gtnew()
- my seq_obj gb-gtget_Seq_by_acc(AF303112)
- print seq_obj-gtdisplay_id(), \n
- print seq_obj-gtseq(),\n
25Changing Formats
- !/opt/perl/bin/perl -w
- genbank_to_fasta.pl
- use BioSeqIO
- my input BioSeqIOnew-gt(-file gt
ARGV0, -
-format gt GenBank) - my output BioSeqIOnew-gt(-file gt
gtoutput.fasta, -
-format gt Fasta) - while (my seq input-gtnext_seq())
- output-gtwrite_seq(seq)
26Parsing Blast Reports
- One of the strengths of BioPerl is its ability to
parse complex data structures. Like a blast
report. - Unfortunately, there is a bit of arcane
terminology. - Also, you have to think like bioperl, in order
to figure out the syntax. - This next script might get you started
27Parse BLAST output
- !/opt/perl/bin/perl -w
- bioperl_blast_parse.pl
- program prints out query, and all hits with
scores for each blast result - use BioSearchIO
- my record BioSearchIO-gtnew(-format gt
blast, -file gt ARGV0) - while (my result record-gtnext_result)
- print gt, result-gtquery_name, ,
result-gtquery_description, \n - my seen 0
- while (my hit result-gtnext_hit)
- print \t, hit-gtname, \t, hit-gtbits, \t,
hit-gtsignificance, \n seen - if (seen 0 ) print No Hits Found\n
-
28SearchIO parsers
- SearchIO can parse (and reformat) several formats
containing alignment or similarity data - blast
- xml formatted blast (still a little wonky)
- psi-blast
- exonerate
- WABA
- FASTA
- HMMER
- The interface is the same for all of these,
making your life a little easier.
29Some advanced BioPerl
- What if you want to draw pretty images of blast
reports? BioGraphics to the rescue - Input - A blast report (single query)
- Output
- Script in PERL13/blast_to_pngImage.pl
30Some really advanced BioPerl
- What if you want to display an entire genome with
annotation? Grow your own genome project - We can do that all in BioPerl (and a mySQL or
Oracle database) - GBrowse - the Generic Genome Browser
- Allows any feature to be displayed on a reference
sequence - Many different styles of Glyphs, so all
features can be drawn in a different style - Allows user to zoom in and out on reference
sequence - User can select which features to display
- User can upload their own features to be
displayed on your reference sequence - And more!!!
- http//www.bch.msu.edu/cgi-bin/gbrowse
- http//www.gmod.org/
31GBrowse (Generic Genome Browser)
32BioPerl Pros and Cons
- It can be a substantial investment in time to
learn how to use bioperl properly - Once you get used to it, it is pretty simple to
do some complicated things - There are a lot of nice tools in there, and
(usually) somebody else takes care of fixing
parsers when they break - BioPerl code is portable - if you give somebody a
script, it will probably work on their system - BioPerl, unfortunately, has high overhead.
Sometimes it is quicker and easier to write it
yourself
33So what is BioPerl? (continued)
- 551 modules (incl. 82 interface modules)
- 37 module groups
- 79,582 lines of code (223,310 lines total)
- 144 lines of code per module
- For More info BioPerl Module Listing
34Some statistics
35Downloading modules
- Modules can be obtained from
- www.CPAN.org (Perl Modules)
- www.BioPerl.org (BioPerl Modules)
- Downloading modules from CPAN
- Interactive mode
- perl -MCPAN -e shell
- Batch mode
- use CPAN
- clean, install, make, recompile, test
36Installing Modules
- Steps for installing modules
- Uncompress the module
- Gunzip file.tar.gz
- Tar xvf file.tar
- perl Makefile.PL
- make
- make test
- make install
37Directory Structure
- BioPerl directory structure organization
- Bio/ BioPerl modules
- models/ UML for BioPerl classes
- t/ Perl built-in tests
- t/data/ Data files used for the tests
- scripts/ Reusable scripts that use BioPerl
- scripts/contributed/ Contributed scripts not
necessarily integrated into BioPerl. - doc/ "How To" files and the FAQ as XML
38Platforms
- HP-UX / Itanium ('ia64') 11.0
- HP-UX / PA-RISC 11.0
- MacOS
- MacOS X
- CygWin NT_5 v. 1.3.10 on Windows 2000 5.00
- Win32, WinNT i386
- IRIX64 6.5 SGI
- Solaris 2.8 UltraSparc
- OpenBSD 2.8 i386
- RedHat Linux 7.2 i686
- Linux i386
39References
- Programming Perl by Wall, Christiansen, and
Schwartz (OReilly) - Learning Perl by Schwartz and Phoenix (OReilly)
- Beginning Perl for Bioinformatics by Tisdall
(OReilly) - http//www.perl.com
- http//www.bioperl.org