Title: Introduction to Bioinformatics Part I How did we get here and what can we do now
1Introduction to Bioinformatics - Part IHow did
we get here and what can we do now?
- Irilenia Nobeli
- BBK Biological Sciences - Crystallography
2Why I am here
- I work here (well, in Crystallography)
- You will inevitably come across Bioinformatics at
some stage - Learning about databases and tools may come handy
- I hope you will be inspired
3Overview of these lectures
- First lecture (26/11/2009)
- Introduction (a rather biased and subjective view
of the field and its history) - Second lecture (first half of 3/12/2009)
- Sequence analysis pairwise and multiple
alignment, BLAST and HMMs - Third lecture (second half of 3/12/2009)
- Structural bioinformatics
4Overview of these lectures (where to find things)
- Lecture slides and lecture notes on the
Blackboard - There are notes accompanying these slides (only
for slides that are not self-explanatory) - Practicals
- There will be one practical session on the 10th
of December - Room G10 has been booked from 600pm to 900 pm.
Event title is Molecular Biology - Coursework
- Coursework will be assigned at the end of the
second lecture
5Todays lecture
- A brief history of bioinformatics and the events
that led to the establishment of this field - A series of research questions that can be
addressed by bioinformatics/computational biology
approach - Biased by my experience and that of people in our
department
- Aim of this lecture
- To get your curiosity going and give you a broad
- overview of the field of computational biology
6Bioinformatics - The early days? - 1990
7Theory Milestones - Evolution
- How is evolution achieved?
- Sequences change over time. Mutations happen
often due to errors in replication, chemicals,
light etc - Divergence of sequences is also a result of
recombination, gene duplication, speciation,
horizontal gene transfer events.
natural selection (19th century)
genetic drift (20th century)
8Milestones - Evolution (backwards)
- DNA sequences determine (almost entirely) the
appearance and characteristics of organisms - Biological sequences show complex patterns of
similarity to one another, regardless of external
similarities
- The logical explanation for the similarities
observed is that sequences (and organisms) share
common ancestry
9Theory Milestones - Inheritance
Laws of heredity Dominant genes conceal the
phenotype of recessive genes but do not alter the
recessive genes themselves
parents
dd
rr
1st generation
dr
dr
2nd generation
dd
dr
dr
rr
Gregor Johann Mendel (1822 -1884)
10Theory Milestones - The central dogma of Biology
- The direction of information flow between DNA,
RNA and proteins is restricted. - The central dogma is often stated as
- Once (sequential) information has passed into
protein, it cannot get out again.
1958
1970
Crick, F. (1958), Symp. Soc. Exp. Biol., XII,
138. Crick, F. (1970), Nature, 227, 561.
11Theory Milestones - The central dogma (II)
- The central dogma was not accepted without
controversy - Much of it related to the simplification of
stating it as DNA makes RNA makes protein - If a dogma ends up having too many exceptions, it
somehow loses much of its appeal
the Central Dogma now would have to go
something like this 'DNA makes RNA makes
protein, but sometimes RNA can make DNA and other
times RNA makes RNA, which makes proteins
different from what they would be if only DNA
made the RNA, and once upon a time RNA made
protein, probably, but no-one knows for certain'.
From Petskos comment Dog eat dogma in Genome
Biology (2000), 1 (2), comment1002.1-1002.2
12Theory Milestones - The central dogma (III)
- The word dogma created as much, if not more,
controversy. - Crick himself writes
As it turned out the use of the word dogma
caused almost more trouble than it was worthMany
years later Jacques Monod pointed out to me that
I did not appear to understand the correct use of
the word dogma, which is a belief that cannot be
doubted. I used the word the way I myself
thought about it , and simply applied it to a
grand hypothesis that, however, plausible, had
little direct experimental support.
From Cricks autobiography, as quoted in
http//en.wikipedia.org/wiki/Central_dogma_of_mo
lecular_biology
131953 The Structure of DNA
Maurice Wilkins 1916 - 2004
X-ray photograph of DNA
The Watson and Crick model
Rosalind Franklin 1920 - 1958
James Watson (1928-) and Francis Crick (1916-2004)
141955 Complete sequence of insulin
- Proteins are not mixtures of molecules - they are
unique molecules with unique amino acid sequences
Primary structure of bovine insulin
from Stretton, A.O.W. (2002), Genetics, 162, 527.
Fred Sanger 1918 -
151950s The first X-ray structures of proteins -
Myoglobin Heamoglobin
Picture of haemoglobin from Perutz, Br Med
Bull.1976 32 195-208
The first ever model of a protein molecule
(1957, myoglobin model in plasticine) From the
image library of the Science Museum
16Other structure-related milestones
17Other structure-related milestones PDB
- 1971 Establishment of the Protein Data Bank
(PDB) - initially with only 7 structures!
- currently holding gt 60,000 structures
Number of searchable structures
http//www.wwpdb.org/
18The mother (and father) of bioinformatics
ALA gt A ARG gt R MET gt M PHE gt F TRP gt W
Margaret Dayhoff (1925-1983)
Comprotein a computer program to aid primary
protein structure determination Dayhoff M.O
Ledley, R.S. (1962) AFIPS Joint Computer
Conferences archive Proceedings of the December
4-6, 1962, fall joint computer conference table
of contents Pages 262-274
IBM 7090 Hagen(2000)
19First attempts at graphics - 1960s
Space-filling model of the structure of
myoglobin (Francoeur, 2002)
Photograph of the Kluge display showing detail
from a myoglobin structure (Francoeur, 2002)
- Cyrus Levinthal and others at MIT were the first
to use computers with powerful graphics to
visualise the 3D structures of proteins. - Levinthal built the first 3D model of cytochrome
C (later shown to be incorrect)
20A helping hand for visualisation
1980 Ribbon diagrams introduced by Jane
Richardson (hand-drawn!)
Ribbon schematic (hand drawn colored, in 1981
by Jane Richardson) of the 3D structure of triose
phosphate isomerase. Source wikipedia
The same protein (1tim) in the same orientation
but drawn in stick representation with Chimera.
21Bioinformatics milestones -Aligning sequences
global alignment
local alignment
T.F. SMITH AND M.S. WATERMAN
22Bioinformatics milestones - GenBank
- Began as a small database of sequences collected
by Walter Goad in Los Alamos in 1979 - 1982 GenBank goes public funded by the NIH
- A national nucleic acid sequence database
- More than 2000 sequences stored by 1983
- Now hosted at the National Center for
Biotechnology Information (NCBI), and is part of
an international collaboration involving EMBL and
Japan - Since its inception, GenBank has approximately
doubled in size every 18 months
23Bioinformatics - The golden era1990 -2000
Please note that calling the 1990s the golden
era is entirely my own subjective choice and not
a widely accepted term in the bioinformatics commu
nity
24The golden era started in 1990 with BLAST
Altschul et al. (1990). Basic local alignment
search tool. J. Mol. Biol. 215403-10.
BLAST is a very fast program for searching large
databases of sequences
It is by far the most widely used tool produced
by bioinformaticians
25and ended with the first draft of the human
genome in 2000
From BBC news, 15 March 2000
26In that decade major events influencedthe
progress of bioinformatics
- 1990 Introduction of MAD (multiwavelength
anomalous diffraction) for solving protein
structures - Wayne Hendrickson - 1993 The Sanger Centre was established
- 1994 The European Bioinformatics Institute was
established - 1995 The first bacterial genome was sequenced
(Haemophilus influanza) - 1996 The yeast genome was sequenced
- 1997 PSI-BLAST was published
- 1998 First high resolution structure of an ion
channel, Rod MacKinnon - 1999 First structures of the ribosome, Yonath,
Steiz, Ramakrishnan, Noller - and many others!
27Bioinformatics - The mature science (2001 -
today)
28The era of systems biology
The focus shifts from here
To here
29Computational biology at the centre of systems
biology
Figure from myCIB, at the University of
Nottingham http//www.mycib.ac.uk/zope/mycib/abou
t-mycib/document.2007-04-04.4299993065
30and synthetic biology
Hierarchy for synthetic biology inspired by
computer engineering
Figure from Andrianantoandro et al. (2006),
Molecular Systems Biology 2, 2006.0028
31So after this brief introduction,what is
bioinformatics and what can we do with it?
32Many definitions of bioinformatics
- but they are all boiling down to more or less
this -
- The application of computational,
mathematical, statistical methods to solve
biological problems
33Two types of bioinformatics
- The development of tools
- i.e. writing programs that implement algorithms
that provide solutions to specific questions - The use and application of such tools
- e.g. web-accessible databases, software that gets
installed locally
34What is an algorithm?
An algorithm is a finite list of well-defined
instructions for accomplishing some task that,
given an initial state, will terminate in a
defined end-state.
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
35Slide adapted from Dr Thomas Schlitt (KCL) with
permission
36Exp1_01C04 Exp1_01C05 Exp1_01C06 Exp1_01C07 Exp1_0
1C08 Exp1_01C09 Exp1_01C10 Exp1_01C11 Exp1_01C12 E
xp1_01D01 Exp1_01D02 Exp1_01D03 Exp1_01D04 Exp1_01
D05 Exp1_01D06
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
37Why do we need bioinformatics?
- The answer usually comes down to the following
- There is too much data
- The calculations are too complex
- We dont have enough time
- For example, we can no longer look through all
available protein structures and check manually
whether they match a new structure we just
solved. There are more than 60,000 of them and
life is simply too short
38Bioinformatics may be more relevant than you
think
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
BMJ 2007(335), 460-461
39Some examples of things you can do with
bioinformatics
40Major centres of biological data and tools - NCBI
- National Center for Biotechnology Information,
Bethesda, US
http//www.ncbi.nlm.nih.gov/guide/
41Major centres of biological data and tools - EBI
- European Bioinformatics Institute (EMBL
outstation at Hinxton, UK) - http//www.ebi.ac.uk
42Major centres of biological data and tools - RCSB
- Research Collaboratory for Structural
Bioinformatics - http//www.pdb.org
43Major centres of biological data and tools - KEGG
- Kyoto Encyclopedia of Genes and Genomes
- http//www.genome.jp/kegg
44Sample questions andwhere to find the answers
45Slide adapted from Dr Thomas Schlitt (KCL) with
permission
46Slide adapted from Dr Thomas Schlitt (KCL) with
permission
47Showcasing Bioinformatics research _at_ Birkbeck
48Exploring the fly genomeAlona Sosinsky
49Exploring regulatory elements (Alona Sosinsky)
http//te.cryst.bbk.ac.uk/
50Exploring regulatory elements (II) (Alona
Sosinsky)
- Gene regulatory elements consist of short
conserved binding sites for specific
transcription factors (TFs) - Programs that attempt to find such binding sites
often result in many false positives and
biologically non-important sites - However, in eukaryotic genomes regulatory binding
sites are found in clusters (modules) - Using information about the combination of
transcription factors and their relative
positioning can increase the accuracy of the
predictions for new regulatory sites
51Exploring regulatory elements (III) (Alona
Sosinsky)
Graphical map for cluster of putative binding
sites
individual binding sites for Lozenge
transcription factor and its co-factor Pointed
Lz
Pnt
cluster of Lozenge and Pointed binding sites
Lz
Pnt
Lz
Lz
Pnt
Pnt
Sosinsky A., Nucleic Acids Res. (2003)
52New regulatory elements for programmed cell death
(Alona Sosinsky)
- TargetExplorer was used to predict the binding
sites of the transcription factor Lozenge in
Drosophila - Among the new targets genes that were controlling
cell death were over-represented - A new functional role was predicted for Lozenge
as regulator of programmed cell death in the
Drosophila eye
References Sosinsky A, Bonin CP, Mann RS, Honig
B. (2003) Nucleic Acids Research, 31,
3589. Wildonger J, Sosinsky A, Honig B, Mann RS.
(2005) Genes Development, 19, 1034.
53Towards a better understanding of intermolecular
interactionsMark Williams
54Pro_ACT - Protein Accessibilities, Cavities
Contacts(Mark Williams)
Williams, M.A., Goodfellow, J.M., and Thornton,
J.M. (1994) Protein Science, 3, 1224.
55SCORPIO(Mark Williams)
A database of calorimetric data on binding of
small-molecules to proteins
Olsson et al. (2008), J Mol Biol, 384, 1002.
56Thermodynamics and surface area burial(Mark
Williams)
57Role of hydration in molecular recognition(Mark
Williams)
Software for the prediction and analysis of
biomolecular atomic interactions and hydration
Pro_ACT Protein Accessibilities, Cavities
conTacts
58Fighting fluWilliam Lees Adrian Shepherd
59The influenza virus(Adrian Shepherd)
- Influenza is an RNA based virus infecting birds
and mammals. - Both epidemics and pandemics cause significant
human mortality. - Influenza type A is the most virulent in humans.
It is divided into subtypes based on the
antigenic properties of the HA and NA surface
proteins (eg H3N2). - Infection cycle begins when the HA surface
protein binds to sialic acid on the surface of
the host cell. - Immunogenic activity is predominantly associated
with HA.
Figure from wikipedia
HA heamagglutinin
NA neuraminidase
60Immunodominant locations on haemagglutinin(Adrian
Shepherd)
- Studies with Monoclonal Antibodies in the 1980s
established 5 binding regions near the head of
HA. - Antibodies binding in these regions are believed
to interfere sterically with receptor binding.
Wilson Cox, 1990
61Antigenic clusters and vaccines(Adrian Shepherd)
Related influenza strains form antigenic
clusters. Breakout from a cluster requires a
vaccine update.
Smith et al, 2004
- Question
- Given a sequence of haemagglutinin, can we
predict whether existing vaccines are any good? - In other words, given two HA sequences, can we
predict their antigenic distance?
62A new model of antigenic distance(Adrian
Shepherd)
- A linear model, based on a count of changes at
each antibody binding site - Also includes changes in N-glycosylation sites as
they are known to affect antibody binding (Skehel
et al, 1984)
Number of differingresidues at binding site
Number of binding siteswith differing residues
Number of differing residuesoutside binding sites
log Dcij x1NAij x2NBij x3NCij x4NDij
x5NEij x6NDIFFij x7NNONij
x8NGLYADDij x9NGLYCHANGEij k.
Difference in numberof n-glycosylation sites
Number of varyingn-glycosylation sites
- The constants xi and k are found by minimising
the least-squares residual over a training set.
63Fighting flu - conclusions(Adrian Shepherd)
- The commonly accepted list of varying amino acid
locations near antigenic binding sites should be
updated. - Based on our data to 2008, generalised models can
meet or exceed predictive performance of
immunodominant models on novel data. - Performance of our models suggests that antibody
binding may occur in regions outside the
previously identified 5 antigenic sites.
64Sodium channels and the molecular basis of
painBonnie Wallace
65Molecular basis of pain(Bonnie Wallace)
- The sodium channel Nav1.7 has been recognised as
a key contributor to human pain
- Mutations of Nav1.7 that promote channel
activation induce Erythromelalgia (Burning-foot
Syndrome), an inherited pain disorder
- Families with Nav1.7 nonsense mutations (i.e.
no functional copies of the channel) feel no pain!
66A structural basis for the effect of the F1449V
mutation(Bonnie Wallace)
Wild-type Nav1.7
F1449V Mutant
Side View
View from Cytoplasm
Lampert et al. (2008), J. Biol. Chem. 283, 24118
67Molecular docking against diseaseIrilenia Nobeli
68Molecular docking(Slide adapted from Dr Arun
Prasad)
Role of molecular docking
- Use to identify lead compounds
- Quantify the association of the lead compounds
with the receptor - Optimize lead compounds
The Docking problem
- Sample the docking space (translation and
rotation of ligand) - Sample the ligand conformational space (torsion
angles) - Score the ligand receptor interaction
69The case of alpha1 - antitrypsin(I.N. in
collaboration with Dr Gooptu)
- The native fold of alpha1-antitrypsin is
metastable allowing for the characteristic serpin
mechanism of action - The Glu342Lys (Z) mutant of alpha1 - antitrypsin
results in the formation of polymers that lead to
disease of the liver and lungs
Gooptu et al. (2009), J Mol Biol, 387, 857.
70The case of alpha1 - antitrypsin (II)(I.N. in
collaboration with Dr B. Gooptu)
- The Thr114Phe mutation preserves activity but
reduces polymerisation of wild type antitrypsin
in vitro
Wild-type
Thr114Phe
Pharmacophore for mimicking the Thr114Phe mutation
Gooptu et al. (2009), J Mol Biol, 387, 857.
71Fragment screening against a mutation-defined
pharmacophore(I.N. with B. Gooptu)
5 top-ranking from Glide SP
65 top-ranking from induced fit docking
Gooptu et al. (2009), J Mol Biol, 387, 857.
72Solving the EM puzzlesMaya Topf
73Models and resolution(Maya Topf)
20Å
2Å
10Å
74Fitting to EM density maps(Maya Topf)
75Multi-Component Fitting(Maya Topf)
Crystal structure of Arp2/3 complex (PDB
1TYQ, Nolen et al, 2004) 7 subunits, ranging
from 15-45kDa in size
76Modelling the dog ribosome(Maya Topf)
8.7 Å resolution
- 48 homology models (SSU - 16 , LSU -32) based on
different templates (25-50 seq id), selected by
a combination of CC and statistical potentials. - Core rRNA (T. thermophilus for SSU, H.
marismortui for LSU) - Expansion segments (SSU -11, LSU - 16), mostly
A-form helices.
Chandramouli, Topf, Ménétret, Eswar, Gutell,
Sali, Akey., Structure, 2008
77The metabolomeIrilenia Nobeli
78The missing ome!(Irilenia Nobeli)
transcriptome
proteome
genome
Small molecules were pretty much ignored by
bioinformatics!
79Chemoinformatics(Irilenia Nobeli)
By analogy to bioinformatics, chemoinformatics use
s computational methods to study small molecules
The function of small molecules is encoded in
their properties, and the properties are encoded
in their structure
80The metabolome and protein function(Irilenia
Nobeli)
- Do homologous proteins bind similar substrates?
- the answer is superfamily dependent
farnesyl diphosphate synthase
triose phosphate isomerase
substrate conservation
substrate promiscuity
Nobeli et al. (2005), J Mol Biol 347, 415.
81Can we predict a proteins substrate?(Irilenia
Nobeli)
922 metabolites docked against 27 SDR proteins
78 of the time we find the substrate in the top
10 of all scores
Favia et al. (2008), J Mol Biol, 375, 855.
82Metabolites drugs(Irilenia Nobeli)
Macchiarulo et al. (2009), J Chem Inf Model, 49,
2272
83Simulating the immune systemAdrian Shepherd
84The ImmunoGrid aims(Adrian Shepherd)
- Develop a virtual human immune system
- Simulate immune processes at a natural scale,
connecting molecular level interactions with
system level models - Ultimate goal provide tools for applications in
clinical immunology, the design of vaccines and
immunotherapies - Data standardisation
85The ImmunoGrid - How?(Adrian Shepherd)
Conways Game of Life (1970)
Emergence of complex, unpredictable behaviour
from simple rules
86ImmunoGrid - The rules of life(Adrian Shepherd)
87The ImmunoGrid - An agent-based model(Adrian
Shepherd)
- Agent based model set of biological agents
(cells and molecules) at a given location on
lattice interacting probabilistically
In practice a hexagonal or triangular lattice is
often used.
88Some reading material for your free time
- Clare Sansom (2009). Molecules made to measure.
Chemistry World, November 2009, 50. - Available from
- www.rsc.org/images/Drug20design20HIV_tcm18-1664
06.pdf - Minoru Kanehisa (1998). Grand challenges in
Bioinformatics. Bioinformatics, 14, 309. - Available from
- www.ncbi.nlm.nih.gov/pubmed/9687209
- Hiroaki Kitano (2002). Systems Biology - A brief
overview. Science 295, 1662. - Email me for a reprint if you have no access to
Science.
89Bibliography
- Disclaimer
- These are resources I used to put together these
lectures and by no means do I endorse any books
or suggest you should go out and buy them! You
are lucky enough to have a huge bookstore right
next to your door. Go and find out for yourselves
what you like and what you dont like! Many
chapters may be available also through google
books so you can have a look at them as well.
90Bibliography
- Books
- Developing Bioinformatics Computer Skills
- Gibas Jambeck, OReilly, ISBN 1-56592-664-1
- A soft introduction including a nice intro to
basic Unix commands - Reasonable overview and might just get your
curiosity going.. - Does not go into detail in anything and it is
relatively old (2001)
91Bibliography
- Bioinformatics for dummies. Claverie Notredame.
Wiley, 2006. - A classic from the dummies series. Has
generally received very good reviews. - Introduction to bioinformatics. Arthur Lesk. OUP,
2008. - Now in its 3rd edition, so obviously not bad.
92Bibliography
- Websites
- Bioinformatics milestones
- http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/mi
lestones.html - 50 years of protein structure determination
- http//publications.nigms.nih.gov/psi/timeline.ht
ml
93Bibliography
- Papers
- Hagen (2000). The origins of bioinformatics. Nat.
Rev. Gen., 1, 231. - Stretton (2002). The first sequence Fred Sanger
and Insulin. Genetics, 162, 527. - Francoeur (2002). Cyrus Levinthal, the Kluge, and
the origins of interactive molecular graphics.
Endeavour, 26, 127.
94Acknowledgements
- Many thanks to Dr Thomas Schlitt for his slides
-
- Thanks to all computational biologists at BBK
Crystallography who made slides and articles from
their research available to me
95If you want to build a ship, dont drum up people
to collect wood and dont assign them tasks and
work, but rather teach them to long for the
endless immensity of the sea.
Antoine de Saint-Exupery