Title: Python programming for Life Science researchers
1 Python programming for Life Science researchers
- Sebastián Bassi
- Universidad Nacional de Quilmes, Argentina
- sbassi_at_gmail.com
- Updates http//genes.unq.edu.ar
2What is Python?
- Python is a dynamic object-oriented programming
language. It offers strong support for
integration with other languages and tools, comes
with extensive standard libraries, and can be
learned in a few days. - Python is free, the source code (in C) is
available and programs made in python can be
distributed free of charge or you can charge a
fee. - Python website is www.python.org
Python Overview
3Why Python?
- Some python characteristics
- Easy to read (pseudocode that works)
- Easy/Fast to code
- Batteries included
- Multiplatform (Windows, Linux, OSX, PDA)
- Dynamically typed
- Strongly typed
- Friendly community
Python Overview What is different from other
languages
4Python philosophy
- Python's designers reject exuberant syntax, such
as in Perl, in favor of a sparser, less cluttered
one. Python's developers expressly promote a
particular "culture" or ideology based on what
they want the language to be, favoring language
forms they see as "beautiful", "explicit" and
"simple". - Mandatory indentation, English keywords instead
of punctuation and few syntactic constructions
are derived from this philosophy.
Source Wikipedia article on Python.
5What can be done with Python
BitTornado A BitTorrent client (p2p, a file
sharing application).
There are several toolkits to make Python GUIs.
They are all out of the scope of this tutorial.
6Dynamic web page generation (via CGI like Perl)
The Python CGI module allows you to show dynamic
generated content in your website.
7Bioinformatics apps (like GUI-BLAST)
A multi-platform GUI is used for running BLAST
queries.
8How I use Python
- Retrieve data from different sources MySQL,
Access, Webpages, CSV files, Excel files, XML
files. - Write data in different formats XML, CSV, PDF,
plain text. - Draw graphics in SVG and GIF/PNG.
- Make dynamic web pages, some of them even query
different sources (like other web pages). - Make GUI to command line programs.
- Parse BLAST files.
- Run multiple BLAST.
- Convert and manipulate biological data.
There are many different uses of Python.
9Python interactive interpreter
Python interactive interpreter screenshot
10Python as a calculator
Python can be used as a calculator
11Numeric Data types Int, Long, Float
- Int From -2.1031-1 to 2.1031-1
- Long Integer gt than 2.1031 (no longer used, see
footnote) - Float Floating point numbers.
Long are no longer used, int data type can handle
large integer according to system capacity.
12Text Data Types String
Strings can be concatenated like this 's ...
s' ('tga', 'atg')
13Note Escape characters
Non-printable and special characters should be
escaped with a special character \.
Enter \n Tab \t Slash (\) \\ Quotes \
You can use a escape character to insert a double
quote inside a text, like this print Here is a
\ (double quote) This will be printed as Here
is a (double quote)
14Data types Lists
- An array of data. Like C vectors, VB and Perl
arrays.
List, definition, creating and invoking. Using
one index, you invoke only one element.
15List Slice Notation
Slice notation is used for lists and strings.
Using 2 indexes, you are invoking a sublist and
not a single element
16List Operations, insert data
- Append Add an element after the last element.
- Insert Add after any arbitrary position.
- Extend Add a list after the last position
Append, Insert, Extend as way to insert data in a
list
17List Delete elements
- LIST.pop(n) will retrieve the nth element of LIST
(defaultlast) - LIST.remove(N) will remove the first N in LIST
Delete with pop and remove. Pop will return the
value, and pop() will do it with the last element
18Data type Tuples
- Defined like a list, with parentheses instead of
square brackets. - Indexes works as lists. Can use slicing.
- Tuples are immutable. Can't add or remove
elements. - Tuples are faster than list. Tuples are like
write-protected list.
When you need to iterate over a list of constant
values, use a tuple instead of a list.
19Dictionaries
- Datatype used to store one-to-one relationships
between keys and values (like hash in Perl or the
Scripting. Dictionary object in Visual Basic).
threecode dictionary is part of Biopython.
Elements in a dictionary are unordered.
20Dictionaries Some methods
- If key is not found, Python rises an error
- gtgtgt threecode"kkk"
- Traceback (most recent call last)
- File "ltpyshell299gt", line 1, in -toplevel-
- threecode"kkk"
- KeyError 'kkk'
- Before looking for a value, check the key
- gtgtgt threecode.has_key("kkk")
- False
del threecodeA deletes that item from
dictionary. threecode.clear() deletes all items.
21Program flow If, elif, else
Elif works as switch in C. Note indentation. In
Python is mandatory to delimit code blocks!!
22If sample
See footnote on slide 12 for string concatenation
and slide 15 for list slicing. Elif works as C
switch
23Program flow For
This is how you iterate over a sequence (list)
Never modify the sequence you are iterating over
inside the loop in a for statement.
24For sample
for x in range(5) works as BASIC for x0 to 4
To cicle inside numbers, create a list with
numbers with range function. See indentation.
25While Do while is true
while True will generate an infinite loop. Can
be escaped with break.
We will use while True and break on BLAST
parsers
26Modularize your code Functions
- Variables declared inside a function, lives only
inside the function. Only argument in return is
returned to the program. - If the function just do something instead of
returning a value use return None (this is not
mandatory, but improves legibility of the code) - Usage MyInterproHandle get_interpro_entry(IPR0
04560)
To return more than one value, return a list with
all the variables you need.
27Modules
A chunk of code that can be used from a program
or in interactive mode. Functions, classes,
constants and dictionaries can be called and used
from a program. A module must be invoked before
used.
Modules are searched in several path, like your
home directory. See them all with sys.path.
28Reading text files
fileobjectopen(filename,r) for line in
fileobject print line
readlines() return a list of string from all the
file
Files can't be edited while opened, should wait
until closed to edit it, even with an external
program.
29Write text files
- There are two modes for writing files
- w Write with overwrite if a file exists
- a Write at the end of the file (append). Useful
for log files.
Open can take a third argument, which defines how
file is buffered before writing.
30Data Manipulation
- The problem A text file with data on it should
be parsed, that is, read and interpreted by the
program, and then display or store only selected
information. - Python tools
- Build-in open file function.
- Control flow structures.
- String manipulation methods.
This is a generic overview of the problem and
tools.
31Sample file BLAST Hit table
- inseq2 gi26249933refNP_755973.1 100.00 29 0 0
1 29 837 865 1e-08 60.8 - inseq2 gi1789736gbAAC76363.1 100.00 29 0 0 1 2
9 834 862 1e-08 60.8 - inseq2 gi3483131gbAAC33265.1 100.00 29 0 0 1 2
9 480 508 1e-08 60.8 - inseq2 gi29542596gbAAO91530.1 46.43 28 15 0 2
29 515 542 4.2 32.3 - inseq2 gi67762813refZP_00501511.1 48.28 29 15
0 1 29 278 306 7.2 31.6 - inseq2 gi67737420refZP_00488193.1 43.12 27 15
0 1 29 278 306 7.2 31.6 - inseq2 gi67714721refZP_00484082.1 47.88 42 15
0 1 29 278 306 7.2 31.6 - inseq2 gi69988727refZP_00641885.1 41.3159 15 0
1 29 221 249 7.2 31.6 - 2000 more lines follows (removed to enter into
this slide)
Your mission (should you choose to accept it)
Get all GI from this file and retrieve URL to get
full Genbank record only if identity is
greater than 45.
This URL will be handy for this kind of task
ncbi.nlm.nih.gov/entrez/query/static/linking.html
32Python script of data manipulation
To send the output to a text file just redirect
it in the command line with gt. Like program.py
gt my_text
33XML Basic Overview
- Language to describe data (with nothing about
data presentation). - Based on text format (binary XML is out of the
scope of this tutorial). - XML are human-legible (kind of)
- Easy to write programs to process XML documents
- Header with parsing information
- lt?xml version1.0?gt
- Body
- lttagname attribute_nameattribute_valuegta
textlt/tagnamegt - ltline type'demo'gtA simple linelt/linegt
- Empty element ltimg srclogo.png /gt
Pay attention XML is everywhere!. Official
webpage is www.w3.org/XML
34XML Some real world samples
A RSS feed. Is XML based.
RSS is a popular way to syndicate news. Atom is
another protocol, also based on XML.
35XML Some real world samples
XML BLAST output.
BLAST can be instructed to output as XML instead
of text or HTML
36XML Sample with attributes
All elements in this sample contains attributes.
SVG contains width and height. Text contains x, y
and style and Path has d and style.
Plasmids in SVG at bioinformatics.org/savvy/.
More bioXML at xml.com/pub/rg/Bioinformatics
37XML Parser with elementtree
Elementtree is located at effbot.org/zone/element.
htm. From version 2.5, it will be included in the
standard library.
38XML code output
39What is Biopython?
- It is a distributed collaborative effort to
develop Python libraries and applications which
addresses the needs of current and future work in
bioinformatics. - It provides
- Tools for working with sequences (aa and nt).
- Parsers of all popular bio file formats (fasta,
gb, pdb, BLAST output). - Data retrieve from biological databases.
- Wrapper to bio-programs (BLAST, ClustalW, EMBOSS,
Primer3, and more). - Biological functions like LCC, restriction
enzymes cutting, and more. - Tables and constants.
With biopython you can program repetitive task
concatenating several programs.
40Biopython sample. BLAST output parsing for vector
removing from DNA sequences
BLAST can be instructed to output as table with
Hit Table enabled on Alignment view .
41This first half parse the BLAST output, w/o
biopython.
42Using fasta parser to read sequences and
FastaWriter to write the modified sequence.
43With HTML is easy to make GUIs to command line
programs or Biopython functions. Just use any
HTML or text editor. This form asks for the same
parameters that Tm function uses.
This is a GUI (Graphical User Interface) for
Biopython melting point function.
44Form code
Look for action path and variable names.
45Generate Tm in HTML from multiple sequences using
Python
The Tm function is inline to avoid dependency
problem (biopython is not included in standard
hosting packages).
46In formu is stored all form variables. Doc is
an object used for storing the HTML info.
47CGI output generated from command line. The CGI
script could work using CLI (w/o webserver)
48Result of CGI code after submit button is pressed
in HTML.
There is a FAQ for Python CGI http//starship.pyt
hon.net/crew/davem/cgifaq/faqw.cgi
49Source code of generated webpage
50Thats all for today. But there is a lot more in
Python!
Resources The Quick Python Book, Dary Harms and
Kenneth McDonald, Manning, 2000 Professional XML,
Birdbeck et al., 2nd Ed., Word Press, 2001 Python
Tutorial, Guido van Rossum, March 2006
(http//docs.python.org/tut/) Dive into Python
(diveintopython.com) Biopython tutorial and
cookbook, Jeff Chang, Brad Chapman, Iddo
Friedberg, 2001 (http//bioweb.pasteur.fr/docs/doc
-gensoft/biopython/Doc/Tutorial.pdf) Python Speed
Performance Tips (http//wiki.python.org/moin/Py
thonSpeed/PerformanceTips) Python course in
Bioinformatics, Katja Schuerer, 2004
(http//www.pasteur.fr/recherche/unites/sis/format
ion/python/) Beginners Guide to Python, 2006
(http//wiki.python.org/moin/BeginnersGuide) Softw
are development skills for scientists and
engineers, Greg Wilson (http//osl.iu.edu/lums/sw
c/)
There is also an IRC channel at irc.freenode.org
(python)