Title: print
1??? ???? ?? ??? ??????
- ???? ????? ?-06.01.13 ????? ???? ????? ????
????? ?????? -
- ???? ???? ???? ?????? ????? ?????
print It is your chance to give us feedback\n
use strict
2BioPerl
3BioPerl
- BioPerl is a collection of Perl modules for
bioinformatics applications. - It is an active open source software project
founded in 1996. - BioPerl provides software modules for many of the
typical bioinformatic tasks. - Among these are
- Format conversions
- Manipulating biological sequences
- Searching for similar sequences
- Creating sequence alignments
- Searching for ORFs on genomic DNA
- And more
4BioPerl
BioPerl modules are called BioXXX You can use
the BioPerl wiki http//bio.perl.org/ with
documentation and examples for how to use them
which is the best way to learn this. We recommend
beginning with the "How-tos" http//www.bioperl.o
rg/wiki/HOWTOs To a more hard-core inspection
of BioPerl modules BioPerl 1.6.1 Module
Documentation
5Object-oriented use of packages
Many packages are meant to be used as objects.
In Perl, an object is a data structure that can
use subroutines that are associated with
it. We will not learn object oriented
programming,but we will learn how to create and
use objects defined by BioPerl packages.
6BioPerl the SeqIO module
BioPerl modules are named Bioxxxx The
BioSeqIO module deals with Sequences Input and
Output We will pass arguments to the new
argument of the file name and format use
BioSeqIO my in BioSeqIO-gtnew("-file" gt
"ltseq.gb", "-format" gt "GenBank")
File argument(filename as would be in open)
Format argument
A list of all the sequence formats BioPerl can
read is inhttp//www.bioperl.org/wiki/HOWTOSeqI
OFormats
7BioPerl the SeqIO module
use BioSeqIO my in BioSeqIO-gtnew("-file"
gt "ltseq.gb", "-format" gt
"GenBank") my seqObj in-gtnext_seq()
Perform next_seq()subroutine on in You could
think of it asSeqIOnext_seq(in)
next_seq() returns the next sequence in the file
as a BioSeq object (we will talk about them
soon)
8BioPerl the SeqIO module
use BioSeqIO my in BioSeqIO-gtnew("-file"
gt "ltadeno12.gb", "-format" gt
"GenBank") my out BioSeqIO-gtnew("-file" gt
"gtadeno12.out.fas",
"-format" gt "Fasta") my seqObj
in-gtnext_seq() while ( defined(seqObj)
) out-gtwrite_seq(seqObj) seqObj
in-gtnext_seq()
write_seq()write a BioSeq object to out
according to its format
9BioPerl the Seq module
use BioSeqIO my in BioSeqIO-gtnew( "-file"
gt "ltEcoli.prot.fasta",
"-format" gt "Fasta") my seqObj
in-gtnext_seq() while (defined(seqObj))
print "ID".seqObj-gtid()."\n" 1st word
in header print "Desc".seqObj-gtdesc()."\n"
rest of header print "Sequence".seqObj-gtseq()
."\n" seq string print "Length".seqObj-gtleng
th()."\n" seq length seqObj
in-gtnext_seq() You can read more about the
BioSeq subroutines in http//www.bioperl.org/w
iki/HOWTOBeginnersThe_Sequence_Object
10Print last 30aa of each sequence (no BioPerl)
open (my in, "lt","seq.fasta") or die "Cannot
open seq.fasta..." my fastaLine ltingt while
(defined fastaLine) chomp fastaLine my
header"" Read first word of header if
(fastaLine m/gt(\S)/) header
substr(fastaLine,1) fastaLine ltingt
Read seq until next header my seq "" while
((defined fastaLine) and(substr(fastaLine,0,1)
ne "gt" )) chomp fastaLine seq
seq.fastaLine fastaLine ltingt
print last 30aa my subseq substr(seq,-30) p
rint "header\n."subseq\n"
11Now using BioPerl
use BioSeqIO my in BioSeqIO-gtnew("-file"gt
"ltseq.fasta","-format"gt"Fasta") my seqObj
in-gtnext_seq() while (defined(seqObj))
Read first word of header my header
seqObj-gtid() print last 30aa my seq
seqObj-gtseq() my subseq substr(seq,-30) p
rint "header\n" print "subseq\n" seqObj
in-gtnext_seq()
Note BioPerl warnings aboutSubroutine ...
redefined at ... Should not trouble you, it is a
known issue it is not your fault and won't
effect your script's performances.
12Now using BioPerl
use BioSeqIO my in BioSeqIO-gtnew("-file"gt
"ltseq.fasta","-format"gt"Fasta") my
seqObj while (defined (seqObj
in-gtnext_seq()) ) Read first word of
header my header seqObj-gtid() print last
30aa my seq seqObj-gtseq() my subseq
substr(seq,-30) print "header\n" print
"subseq\n"
or alternatively
13Class exercise 12a
- Use BioSeqIO to read a FASTA file and print to
an output FASTA file only sequences shorter than
3,000 bases. (use the EHD nucleotide FASTA from
the webpage) - Use BioSeqIO to read a FASTA file, and print
(to the screen) header lines that contain the
words "Mus musculus". - 3. Write a script that uses BioSeqIO to read a
GenPept file and convert it to FASTA. (use
preProInsulinRecords.gp from the courses
webpage) - 4. Same as Q1, but print to the FASTA the
reverse complement of each sequence. (Do not use
the reverse or tr// functions! BioPerl can do it
for you - read the BioPerl documentation).
14BioPerl downloading files from the web
The BioDBGenbank module allows us to
downloada specific record from the NCBI
website use BioDBGenBankmy gb
BioDBGenBank-gtnewmy seqObj
gb-gtget_Seq_by_acc("J00522") print
seqObj-gtseq() see more options
in http//www.bioperl.org/wiki/HOWTOBeginnersRe
trieving_a_sequence_from_a_database
http//doc.bioperl.org/releases/bioperl-1.4/Bio/DB
/GenBank.html
15BLAST
Congrats, you just sequenced yourself some DNA.
And you want to see if it exists in any other
organism
16BLAST
BLAST - Basic Local Alignment and Search Tool
BLAST helps you find similarity between your
sequence and other sequences
17BLAST
BLAST - Basic Local Alignment and Search Tool
BLAST helps you find similarity between your
sequence and other sequences
18BLAST
BLAST helps you find similarity between your
sequence and other sequences
19BLAST
Database
hit
query
20BLAST
You can search using BLAST proteins or DNA
Query DNA Protein Database DNA Protein
blastn nucleotides vs. nucleotides
blastp protein vs. protein
blastx translated query vs. protein database
tblastn protein vs. translated nuc. DB
tblastx translated query vs. translated database
21BioPerl reading BLAST output
First we need to have the BLAST results in a text
file BioPerl can read.Here is one way to achieve
this (using NCBI BLAST)
Download
Text
An alternative is to use BLASTALL on your computer
22BioPerl reading BLAST output
Query
Query gi52840257refYP_094056.1 chromosomal
replication initiator protein DnaA Legionella
pneumophila subsp. pneumophila str. Philadelphia
1 (452 letters) Database Coxiella.faa
1818 sequences 516,956 total
letters Searching................................
..................done
Score
E Sequences producing significant alignments
(bits) Value gi29653365refNP_
819057.1 chromosomal replication initiator p...
633 0.0 gi29655022refNP_820714.1
DnaA-related protein Coxiella burn... 72
4e-14 gi29654861refNP_820553.1 Holliday
junction DNA helicase B C... 32
0.033 gi29654871refNP_820563.1 ATPase, AFG1
family Coxiella burne... 27 1.4
gi29654481refNP_820173.1 hypothetical
protein CBU_1178 Coxi... 25 3.1
gi29654004refNP_819696.1 succinyl-diaminopime
late desuccinyl... 25 3.1
Results info
23BioPerl reading BLAST output
gi215919162refNP_820316.2 threonyl-tRNA
synthetase Coxiella... 25 5.3
gi29655364refNP_821056.1 transcription
termination factor rh... 24 9.0
gi215919324refNP_821004.2 adenosylhomocystein
ase Coxiella b... 24 9.0
gi29653813refNP_819505.1 putative
phosphoribosyl transferase... 24 9.0
gtgi29653365refNP_819057.1 chromosomal
replication initiator protein
Coxiella burnetii RSA 493 Length
451 Score 633 bits (1632), Expect 0.0
Identities 316/452 (69), Positives 371/452
(82), Gaps 5/452 (1) Query 1
MSTTAWQKCLGLLQDEFSAQQFNTWLRPLQAYMDEQR-LILLAPNRFVVD
WVRKHFFSRI 59 T W KCLG LDE
QQNTWRPL A Q LLLAPNRFVDW F RI Sbjct
3 LPTSLWDKCLGYLRDEIPPQQYNTWIRPLHAIESKQNGLLLLAPNR
FVLDWINERFLNRI 62 Query 60 EELIKQFSGDDIKAISIEVG
SKPVEAVDTPAETIVTSSSTAPLKSAPKKAVDYKSSHLNK 119
EL S D I GS E AP
N Sbjct 63 TELLDELS-DTPPQIRLQIGSR
STEMPTKNSHEPSHRKAAAPPAGT---TISHTQANINS
118 Query 120 KFVFDSFVEGNSNQLARAASMQVAERPGDAYNPL
FIYGGVGLGKTHLMHAIGNSILKNNP 179 F
FDSFVEG SNQLARAA QVAE PG AYNPLFIYGGVGLGKTHLMHAGN
IL Sbjct 119 NFTFDSFVEGKSNQLARAAATQVAENPGQAY
NPLFIYGGVGLGKTHLMHAVGNAILRKDS 178
Result header
high scoring pair (HSP) data
HSP Alignment
Note There could be more than one HSP for each
result, in case of homology in different parts
of the protein
24BioSearchIO reading BLAST output
The BioSearchIO module can read and parse BLAST
output use BioSearchIO my blast_report
BioSearchIO-gtnew("-file" gt
"ltLegCox.blastp", "-format" gt
"blast" ) my (resultObj, hitObj,
hspObj) while( defined(resultObj
blast_report-gtnext_result()) ) print
"Checking query ".resultObj-gtquery_name()."\n"
while( defined(hitObj resultObj-gtnext_hit())
) print "Checking hit ". hitObj-gtname()."\n
" hspObj hitObj-gtnext_hsp() print
"Best score ".hspObj-gtscore()."\n"
(See the BLAST output example in the
courses website)
25BioPerl reading BLAST output
You can send parameters to the subroutines of the
objects Get length of HSP (including
gaps) hspObj-gtlength("total") Get length of
hit part of alignment (without gaps) hspObj-gtleng
th("hit") Get length of query part of
alignment (without gaps) hspObj-gtlength("query")
More about what you can do with query, hit and
hsp see in http//www.bioperl.org/wiki/HOWTOSear
chIOTable_of_Methods
26Class exercise 12b
- Uses BioSearchIO to parse the BLAST results
(LegCox.blastp provided in the courses website) - For each query print out its name and the name of
its first hit. - b) Print the percent identity of each HSP of the
first hit of each query. - c) Print the e-value of each HSP of the first
hit of each query.