Title: Prsentation PowerPoint
1Introduction to perl programmingthe minimum to
know!
Fredj Tekaia Institut Pasteur tekaia_at_pasteur.fr
Bioinformatic and Comparative Genome Analysis
Course HKU-Pasteur Research Centre - Hong Kong,
China August 17 - August 29, 2009
2perl
A basic program !/bin/perl Program to
print a message print 'Hello world.' Print a
message
3Variables, Arrays
val9 val9 valABC transporter
case sensitive val is different from Val
4Operations and Assignment
Perl uses arithmetic operators a 1 2 Add
1 and 2 and store in a a 3 - 4 Subtract
4 from 3 and store in a a 5 6 Multiply 5
and 6 a 7 / 8 Divide 7 by 8 to give
0.875 a 9 10 Nine to the power of 10 a
5 2 Remainder of 5 divided by 2 a
Return a and then increment it a-- Return
a and then decrement it for strings perl has
among others a b . c Concatenate b and
c a b x c b repeated c times
5To assign values perl includes a b
Assign b to a a b Add b to a a -
b Subtract b from a a . b Append b
onto a
6Array variables
An array variable is a list of scalars (ie
numbers and/or strings). they are prefixed by _at_
_at_SEQNAME (MG001", MG002", MG003") SEQNAME
2 (MG003) Attention 0, 1, 2,....
_at_num (0,1,2,3)
7_at_L_CODONS ('TTT','TTC','TTA','TTG',
'CTT','CTC','CTA','CTG',
'ATT','ATC','ATA','ATG',
'GTT','GTC','GTA','GTG',
'TCT','TCC','TCA','TCG',
'CCT','CCC','CCA','CCG',
'ACT','ACC','ACA','ACG',
'GCT','GCC','GCA','GCG',
'TAT','TAC','TAA','TAG',
'CAT','CAC','CAA','CAG',
'AAT','AAC','AAA','AAG',
'GAT','GAC','GAA','GAG',
'TGT','TGC','TGA','TGG',
'CGT','CGC','CGA','CGG',
'AGT','AGC','AGA','AGG',
'GGT','GGC','GGA','GGG')
8_at_AA ('A','R','N','D','C','Q','E','G','H','I','L'
,'K','M','F','P','S','T','W','Y','V','B') _at_mm
( 'a','r','n','d','c','q','e','g','h','i','l','k',
'm','f','p','s','t','w','y','v','b )
9Associative arrays hash tables
Ordinary list arrays allow us to access their
element by number. The first element of array _at_AA
is AA0. The second element is AA1, and so
on. But perl also allows us to create arrays
which are accessed by string. These are called
associative arrays. array itself is prefixed by
a sign
10ages (Michael", 39, "Angie", 27,
"Willy", "21 years", "The Queen
Mother", 108)
ages"Michael" Returns 39 ages"Angie"
Returns 27 ages"Willy" Returns "21
years" ages"The Queen Mother" Returns 108
11File handling
a script (cat.pl) equivalent to the UNIX cat
!/bin/perl open(FILE,GMG.pep) while
ltFILEgt print _ close (FILE)
use chmod ax cat.pl cat.pl
12split
A very useful function in perl splits up a
string and places it into an array.
!/bin/perl open(FILE,GMG.pep) while
ltFILEgt _at_tabsplit(/\s/, _) print
tab0 close (FILE)
13!/bin/perl open(FILE,GMG.pep) while
ltFILEgt _at_tabsplit(/\s/, _, 2) NOMtab0
tab1 print NOMtab0 close (FILE)
_at_tabsplit(/\s/,_,n)
14Control structures
foreach To go through each line of an array or
other list-like structure (such as lines in a
file) perl uses the foreach structure. This has
the form foreach nom (_at_SEQNAME) Visit each
item in turn and call it nom print
"nom\n" Print the item
15foreach j ( 0 .. 2) Visit each value in
turn and call it j print
"SEQNAMj\n" Print the item
foreach j ( 0 .. AA) Visit each value in
turn and call it j print
"AAj\n" Print the item
16Testing
Here are some tests on numbers and strings. a
b Is a numerically equal to
b? Beware Don't use the operator. a !
b Is a numerically unequal to b? a eq
b Is a string-equal to b? a ne b Is a
string-unequal to b? You can also use logical
and, or and not (a b) Is a and b
true? (a b) Is either a or b
true? !(a) is a false?
17for
for (initialise test inc) first_action
second_action etc....
for (i 0 i lt 10 i) Start with i
1 Do it while i lt 10 Increment i
before repeating print "i\n"
18Conditionals
if (a) print "The string is not
empty\n" else print "The string is
empty\n"
!/bin/perl open(FILE,GMG.pep) while
ltFILEgt print _ if ( m/gt/ ) close (FILE)
19String matching
a eq b Is a string-equal to b? a ne b
Is a string-unequal to b?
Here are some special RE characters and their
meaning . Any single character except a
newline The beginning of the line or
string The end of the line or string Zero
or more of the last character One or more of
the last character ? Zero or one of the last
character
20Some more special characters
\n A newline \t A tab \w Any
alphanumeric (word) character. The same as
a-zA-Z0-9_ \W Any non-word character.
The same as a-zA-Z0-9_ \d Any digit. The
same as 0-9 \D Any non-digit. The same as
0-9 \s Any whitespace character space,
tab, newline, etc \S Any non-whitespace
character \b A word boundary, outside
only \B No word boundary
21Characters like , , , ), \, / and so on are
peculiar cases in regular expressions. If you
want to match for one of those then you have to
preceed it by a backslash (\). So \
Vertical bar \ An open square bracket \) A
closing parenthesis \ An asterisk \ A
carat symbol \/ A slash \\ A backslash
22Substitution and translation
s/london/London/
sentence s/london/London/
global substitution i option (for "ignore
case"). s/london/London/gi
Translation
sentence tr/abc/edf/ tr/a-z/A-Z/
converts _ to upper case tr/A-Z/a-z/
converts _ to lower case
23Simple scripts
- -given a nucleotide sequence
- base composition
- -given a protein sequence
- amino-acid composition
- -given a nucleic databse (in fasta format)
- base composition
- given a protein database (in fasta format)
- amino-acid composition
24- -sequence size (base or amino-acids)
- -extract a portion of a sequence (pos start pos
end) - -extract a sequence by name (from a database of
sequences) - gene sequence codon count
- given allxxseqnew file
- script to compute frequencies of multiple matches
see splitfasta.pl splitdnafasta.pl
25- given allxxseqnew file
- script to compute frequencies of multiple
matches
Exercices de manipulation des données -
home-directory, mkdir, cd, pathway, pwd, find -
notation DB.pep, DB.dna, seq.dna, seq.prt -
utiliser tab comme séparateur - utilisation
de sed et de grep - le format fasta des
séquences - compter le nombre des séquences
dans une base de séquences au format fasta
(grep gt DB.pep ? wc l ) -
changer un caractère par un autre - extraire
les séquences dune base (fichier au format
fasta) (splitfasta.pl, splitdnafasta.pl) - extrai
re 1 partie dune séquence (la séquence est au
format fasta) - fréquence des aa dune séquence
protéique - fréquence des bases dune séquence
nucléotidique - taille dune séquence - taille
s des séquence dune base - fréquence des
codons dune séquence codante - Codons
volatilité . correspondance codons/amino-acids