Title: Programming for Bioinformatics
1Programming for Bioinformatics
2The module
- will teach students basic programming skills
relevant to bioinformatics, which will enable
them to actively develop bioinformatics tools. - will take a problem-driven approach.
- will present bioinformatics problems and show how
to solve them using existing online tools and how
to implement such tools. - will revisit some of the problems and databases
discussed in applied bioinformatics. - will be very practical and hands-on approach to
basic computer science tools such as using
command line operating systems, programming in
Python, and using relational databases.
3Objectives
- Students will have an understanding of different
operating systems - Students will be able to automate simple
repetitive information retrieval tasks - Students will be able to write simple programs in
Python - Students will be able to work with relational
databases - Students will appreciate the principles, limits,
and possibilities of programming - Students will be able to formulate biological
questions as information processing problems - Students will understand when and how programming
can help to automate bioinformatics problems
4Module Structure
- Introduction
- Databases
- Introduction to SQL
- A Little Exercise
- A Little Science
- Introduction to Python
- Data types and loops
- Sequences and lists
- Patterns and functions
- Dictionaries
- Object Orientation
- Advanced topics
- More Python
- Dynamic programming
- Clustering
- Revision Class
5Books
- You will need two books for the module a
reference book on MySQL and a book on Python
6Books Python
- We will follow a number of online resources.
(see course web page) - Further, we look in Python in a Nutshell, Alex
Martelli, OReilly - Wesley Chun's Core Python Programming
- Python Cookbook (OReilly)
- The publisher OReilly has many general
programming books on linux, python, etc. - They allow you to read all books for 2 weeks
online for free. This is very nice to decide what
to buy and what not. - You can also buy electronic copies of the book.
7Books MySQL
- There are many, many books on MySQL
- The following two are just sugestions, as there
are many other books covering the same material - MySQL Cookbook by Paul DuBois, O'Reilly or
- MySQL by Paul DuBois, Michael Widenius, O'Reilly
8Structure of Labs
- Databases
- Lab 1,2 Simple SQL
- Lab 3,4 SQL to answer interesting scientific
questions - Python
- Lab 5 Data types and loops, accessing a DB from
Python - Lab 6 Sequences and lists
- Lab 7 Patterns and Functions
- Lab 8 Dictionaries
- Lab 9 BioPython
- Lab 10 Python PyMOL
- More Python
- Lab 11 Dynamic programming revisited
- Lab 12 Clustering revisited
- Lab 13 Revision
9Assessment
- Lab
- Exercises
- Each week during the lab you get exercises which
you have to do during the lab and finish on your
own during the week - These exercises need to be handed in at the next
lecture - Results are discussed during the labs and as part
of the assessment you will have to present a
solution once - Doing the exercises is compulsory, but there are
no marks - Project/Test
- You will demonstrate your programming skills by
implementing a comprehensive software project. - The test will include a thorough code review and
you will be asked to present the features of your
programme. - Exam
- Pen and paper exam on material covered in lecture
10Motivation Databases
- In the last term,
- we accessed most information online via the web
- we interacted directly and manually with
databases and tools - we had to manually submit queries, interpret
results. select interesting results, cutpaste
them, and submit queries again, - Pro
- Reasonably easy to get hold of information
- Con
- Not possible to ask many queries
- Queries limited by interface provided by web page
- Difficult/impossible to integrate information
from different sites - In this term, we will look at the databases
underlying the online front ends - How is the data internally stored?
- How can we - and more important computer programs
- directly interact with the underlying data, so
that we can ask more powerful queries, large
queries, and integrate different systems
11What actually happens
- You are limited by what web server allows you to
ask - Example CATH
- PDB ID,
- CATH code, or
- General text
- But you cannot ask
- In how many different PDB structures is there a
P-loop domain? - Is there a PDB entry with a P-loop and a
DNA-binding domain - How many different superfamilies does the largest
structure in PDB have? - With direct access to the underlying database you
could answer all these questions (and many more)
12Motivation SCOP as Relational Database
- We worked with SCOP, the Structural
Classification of Proteins - Family gt30 sequence identity
- Superfamily Similar structure and function
(possibly lower 30 sequence identity)
13Motivation Databases
- We wish to answer the following questions
- How many families and superfamilies are there?
- Do all superfamilies roughly have the same number
of families? - How many families does the immunoglobulin
superfamily have? - Which superfamily has the most families and how
many? - How many percent of superfamilies have only one
family? - Which PDB structure has the largest number of
distinct superfamilies? - How many percent of PDB structures have only one
type of superfamily, how many percent have at
least two? - Which is the most popular superfamily?
- Are all superfamilies equally likely to co-occur
or do they have preferences? - Which superfamily has the most co-occurrence
partners? - Is the number of co-occurrence partners and the
frequency of the superfamily correlated?
14What is a Database
- SCOP contains relevant information, but we cannot
answer the above questions through the
web-interface of SCOP - The problem is that we do not have access to the
underlying database - What is a database anyway?
- A database provides
- Logical organization of data
- data models, schema design, dictionaries
- Physical organization of data
- Fast retrieval, indexing, compact storage of data
15Relational Database
- Central Idea Data as relations in a table
- E.g. Employee
------------------------------- id
name salary role --------------------
----------- 46457 pete 50.000
director 46458 jane 60.000 nurse
46459 asif 70.000 driver
-------------------------------
16Relational Database
- Central Idea Data as relations in a table
- E.g. SCOP, Structural Classification of Proteins
---------------------------------------------
------------------------ id type sccs
sid description
-------------------------------------------
-------------------------- 46457 cf a.1
- Globin-like
46458 sf a.1.1 -
Globin-like 46459
fa a.1.1.1 - Truncated hemoglobin
46460 dm a.1.1.1 -
Truncated hemoglobin 46461
sp a.1.1.1 - Ciliate (Paramecium
caudatum) 14982 px a.1.1.1
d1dlwa_ 1dlw A
46462 sp a.1.1.1 - Green alga
(Chlamydomonas eugametos) 14983 px
a.1.1.1 d1dlya_ 1dly A
63437 sp a.1.1.1 -
Mycobacterium tuberculosis 62301
px a.1.1.1 d1idra_ 1idr A
------------------------------
---------------------------------------
17SCOP Tables
- mysqlgt select from cla limit 1
- --------------------------------------------
------------------------------- - sid pdb_id sccs cl cf sf
fa dm sp px - --------------------------------------------
------------------------------- - d1dlwa_ 1dlw a.1.1.1 46456 46457
46458 46459 46460 46461 14982 - --------------------------------------------
------------------------------- - mysqlgt select from des limit 1
- ---------------------------------------------
- id type sccs sid description
- ---------------------------------------------
- 46456 cl a - All alpha proteins
- ---------------------------------------------
- mysqlgt select from astral limit 1
- -----------------------------------------------
------------------------------ - sid sccs seq
- -----------------------------------------------
------------------------------ - d1dlwa_ a.1.1.1 slfeqlggqaavqavtaqfyaniqadat
vatffngidmpnqtnktaaflcaalgg...
18SCOP Tables
- mysqlgt select from cla limit 1
- --------------------------------------------
------------------------------- - sid pdb_id sccs cl cf sf
fa dm sp px - --------------------------------------------
------------------------------- - d1dlwa_ 1dlw a.1.1.1 46456 46457
46458 46459 46460 46461 14982 - --------------------------------------------
------------------------------- - mysqlgt select from des limit 1
- ---------------------------------------------
- id type sccs sid description
- ---------------------------------------------
- 46456 cl a - All alpha proteins
- ---------------------------------------------
- mysqlgt select from astral limit 1
- -----------------------------------------------
------------------------------ - sid sccs seq
- -----------------------------------------------
------------------------------ - d1dlwa_ a.1.1.1 slfeqlggqaavqavtaqfyaniqadat
vatffngidmpnqtnktaaflcaalgg...
19Querying Relational Databases
- SQL Structured Query Language
- Select Which attributes? from Which tables? where
Which conditions? - Select from where
- Distinct
- Like
- Union/intersect
- Join
- Count/average/sum/min/max
- Group by
- Having
- Show tables
- Show databases
- Use
- Create database
- Create table as
- Drop table
- Load data
- Insert into
20Databases
- Given SCOP as relational database, we can answer
all the questions raised above using the SQL
constructs of the previous slide!
21Programming
- We will use Python (Guido van Rossum, named after
Monty Python) as a convenient extension to the
operating system - Easy to write quick programs
- More than just a scripting language
- Interpreted, interactive, indented
- Supports string processing well
- Widely used in bioinformatics
- Object oriented, general purpose
- Many nice libraries for database access,
Graphics, Web, GUI, R - Scientific orientation Numerical Python (math),
Scientific Python, Biopython - Beware Python is inefficient, but
computationally expensive parts can be included
as C-libraries
22Motivation Families and Identity
- We said that SCOP families share gt30 identity
- What does that mean?
- Any two structures in a family gt30?
- At least one other member in family with gt30?
- What is the average sequence similarity within a
family? Within a superfamily? - Given a sequence and that we know already which
superfamily it belongs to. Can we find the
superfamilys family best suited for the sequence
23Two approaches Blast vs. DIY
- We can answer the above easily
- We use SCOP database and run database queries
from a Python script - For a given superfamily select all corresponding
sequences from the astral table - For all pairs of selected sequences
- Call Blast and record the sequence identity
- Or run your own dynamic programming algorithm and
record the sequence identity - For second problem Compare sequence to all
family sequences and assign it to the family
which shares the highest (must be gt30)
similarity with the sequence
24Motivation Sequence vs. Structure
- Can we verify the plot below?
- Can we create a similar plot for specific
superfamilies? E.g. DNA-binding domains?
25Motivation Sequence vs. Structure
- Again select the relevant sequences from the
astral table and besides computing the sequence
identity, we compute structural similarity to the
relevant structure using an algorithm like Dali
or CE - Then plot the two similarities against each other
in a scatter plot
26Motivation Amino Acid Composition of Families
- Can we characterise the amino acid composition of
different families/superfamilies? - Again select the relevant sequences from astral
and count the frequencies of amino acids - Is the amino acid composition at the interface of
a domain different from the rest of the domain?
27Motivation Lets rebuild SCOP families
- Given a SCOP superfamily and its sequences, how
can we divide it into families? - First, we need dynamic programming to determine
the sequence similarity - Then we do the following
- For all pairs of sequences, call the sequence
similarity algorithm and record the similarity
into a distance matrix - Next, run hierarchical clustering to cluster the
sequences.
28Whats needed
29Python Programming Constructs
- Variables, strings,
- For/while Loops
- If statements
- File I/O
- Regular expressions
- Data structures Lists, Hashes
- Code Structure Objects, classes, modules
30Hello World in Python
- Given a file helloworld.py
- Open a shell and type at the command prompt
helloworld.py - The shell then executes your programme
- In the first line, it realises that the python
interpreter needs to be loaded and that what
follows is a python program - The line below prints a message
File helloworld.py
print "Hello World"
31Read a text file in python
- The command open opens a text file and creates
- r as second argument after the filename
indicates that file is read (this is default, ie.
can be left out) - w as second argument indicates that file is
written to - a as second argument indicates that file is
appended to - The for-loop reads all lines of the file one by
one (requires python gt2.2) - The body of the loop prints them on the screen
(note that print adds a new line automatically,
avoid that with adding a , )
File fileIO.py
Output
data open("seq.txt, r) for line in data
print "Line, line,
File seq.txt
Line acgt Line gggt
acgt gggt
32Variables in Python
- The symbol is used to assign values to
variables - The symbol is also used to concatenate strings
File fileIO.pl
lineNo 1 for line in open(seq.txt) print
lineNo line, lineNo lineNo1
Output
File seq.txt
1 acgt 2 gggt
acgt gggt
33If-then-else and strings in Python
File seqcomp.py
data open("seq.txt") line1 data.readline().rst
rip() line2 data.readline().rstrip() len1len(l
ine1) len2len(line2) if len1 lt len2 minLen
len1 else minLen len2 line3 "" for i
in range(minLen) if line1i line2i
line3line3"" else
line3line3" " print "Sequence
comparison" print line1 print line2 print line3
File seq.txt
acgt gggt
Output
Sequence comparison acgt gggt
34Programming Example