Introduction to Bioinformatics Part I How did we get here and what can we do now

About This Presentation

Title:

Introduction to Bioinformatics Part I How did we get here and what can we do now

Description:

Introduction to Bioinformatics Part I How did we get here and what can we do now – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 96

Provided by: irileni

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics Part I How did we get here and what can we do now

1
Introduction to Bioinformatics - Part IHow did
we get here and what can we do now?

Irilenia Nobeli
BBK Biological Sciences - Crystallography

2
Why I am here

I work here (well, in Crystallography)
You will inevitably come across Bioinformatics at
some stage
Learning about databases and tools may come handy
I hope you will be inspired

3
Overview of these lectures

First lecture (26/11/2009)
Introduction (a rather biased and subjective view
of the field and its history)
Second lecture (first half of 3/12/2009)
Sequence analysis pairwise and multiple
alignment, BLAST and HMMs
Third lecture (second half of 3/12/2009)
Structural bioinformatics

4
Overview of these lectures (where to find things)

Lecture slides and lecture notes on the
Blackboard
There are notes accompanying these slides (only
for slides that are not self-explanatory)
Practicals
There will be one practical session on the 10th
of December
Room G10 has been booked from 600pm to 900 pm.
Event title is Molecular Biology
Coursework
Coursework will be assigned at the end of the
second lecture

5
Todays lecture

A brief history of bioinformatics and the events
that led to the establishment of this field
A series of research questions that can be
addressed by bioinformatics/computational biology
approach
Biased by my experience and that of people in our
department

Aim of this lecture
To get your curiosity going and give you a broad
overview of the field of computational biology

6
Bioinformatics - The early days? - 1990
7
Theory Milestones - Evolution

How is evolution achieved?
Sequences change over time. Mutations happen
often due to errors in replication, chemicals,
light etc
Divergence of sequences is also a result of
recombination, gene duplication, speciation,
horizontal gene transfer events.

natural selection (19th century)
genetic drift (20th century)
8
Milestones - Evolution (backwards)

DNA sequences determine (almost entirely) the
appearance and characteristics of organisms
Biological sequences show complex patterns of
similarity to one another, regardless of external
similarities

The logical explanation for the similarities
observed is that sequences (and organisms) share
common ancestry

9
Theory Milestones - Inheritance
Laws of heredity Dominant genes conceal the
phenotype of recessive genes but do not alter the
recessive genes themselves

parents
dd
rr
1st generation
dr
dr
2nd generation
dd
dr
dr
rr
Gregor Johann Mendel (1822 -1884)
10
Theory Milestones - The central dogma of Biology

The direction of information flow between DNA,
RNA and proteins is restricted.
The central dogma is often stated as
Once (sequential) information has passed into
protein, it cannot get out again.

1958
1970
Crick, F. (1958), Symp. Soc. Exp. Biol., XII,
138. Crick, F. (1970), Nature, 227, 561.
11
Theory Milestones - The central dogma (II)

The central dogma was not accepted without
controversy
Much of it related to the simplification of
stating it as DNA makes RNA makes protein
If a dogma ends up having too many exceptions, it
somehow loses much of its appeal

the Central Dogma now would have to go
something like this 'DNA makes RNA makes
protein, but sometimes RNA can make DNA and other
times RNA makes RNA, which makes proteins
different from what they would be if only DNA
made the RNA, and once upon a time RNA made
protein, probably, but no-one knows for certain'.
From Petskos comment Dog eat dogma in Genome
Biology (2000), 1 (2), comment1002.1-1002.2
12
Theory Milestones - The central dogma (III)

The word dogma created as much, if not more,
controversy.
Crick himself writes

As it turned out the use of the word dogma
caused almost more trouble than it was worthMany
years later Jacques Monod pointed out to me that
I did not appear to understand the correct use of
the word dogma, which is a belief that cannot be
doubted. I used the word the way I myself
thought about it , and simply applied it to a
grand hypothesis that, however, plausible, had
little direct experimental support.
From Cricks autobiography, as quoted in
http//en.wikipedia.org/wiki/Central_dogma_of_mo
lecular_biology
13
1953 The Structure of DNA
Maurice Wilkins 1916 - 2004
X-ray photograph of DNA
The Watson and Crick model
Rosalind Franklin 1920 - 1958
James Watson (1928-) and Francis Crick (1916-2004)
14
1955 Complete sequence of insulin

Proteins are not mixtures of molecules - they are
unique molecules with unique amino acid sequences

Primary structure of bovine insulin
from Stretton, A.O.W. (2002), Genetics, 162, 527.
Fred Sanger 1918 -
15
1950s The first X-ray structures of proteins -
Myoglobin Heamoglobin
Picture of haemoglobin from Perutz, Br Med
Bull.1976 32 195-208
The first ever model of a protein molecule
(1957, myoglobin model in plasticine) From the
image library of the Science Museum
16
Other structure-related milestones
17
Other structure-related milestones PDB

1971 Establishment of the Protein Data Bank
(PDB)
initially with only 7 structures!
currently holding gt 60,000 structures

Number of searchable structures
http//www.wwpdb.org/
18
The mother (and father) of bioinformatics
ALA gt A ARG gt R MET gt M PHE gt F TRP gt W
Margaret Dayhoff (1925-1983)
Comprotein a computer program to aid primary
protein structure determination Dayhoff M.O
Ledley, R.S. (1962) AFIPS Joint Computer
Conferences archive Proceedings of the December
4-6, 1962, fall joint computer conference table
of contents Pages 262-274
IBM 7090 Hagen(2000)
19
First attempts at graphics - 1960s
Space-filling model of the structure of
myoglobin (Francoeur, 2002)
Photograph of the Kluge display showing detail
from a myoglobin structure (Francoeur, 2002)

Cyrus Levinthal and others at MIT were the first
to use computers with powerful graphics to
visualise the 3D structures of proteins.
Levinthal built the first 3D model of cytochrome
C (later shown to be incorrect)

20
A helping hand for visualisation
1980 Ribbon diagrams introduced by Jane
Richardson (hand-drawn!)
Ribbon schematic (hand drawn colored, in 1981
by Jane Richardson) of the 3D structure of triose
phosphate isomerase. Source wikipedia
The same protein (1tim) in the same orientation
but drawn in stick representation with Chimera.
21
Bioinformatics milestones -Aligning sequences
global alignment
local alignment
T.F. SMITH AND M.S. WATERMAN
22
Bioinformatics milestones - GenBank

Began as a small database of sequences collected
by Walter Goad in Los Alamos in 1979
1982 GenBank goes public funded by the NIH
A national nucleic acid sequence database
More than 2000 sequences stored by 1983
Now hosted at the National Center for
Biotechnology Information (NCBI), and is part of
an international collaboration involving EMBL and
Japan
Since its inception, GenBank has approximately
doubled in size every 18 months

23
Bioinformatics - The golden era1990 -2000
Please note that calling the 1990s the golden
era is entirely my own subjective choice and not
a widely accepted term in the bioinformatics commu
nity
24
The golden era started in 1990 with BLAST
Altschul et al. (1990). Basic local alignment
search tool. J. Mol. Biol. 215403-10.
BLAST is a very fast program for searching large
databases of sequences
It is by far the most widely used tool produced
by bioinformaticians
25
and ended with the first draft of the human
genome in 2000
From BBC news, 15 March 2000
26
In that decade major events influencedthe
progress of bioinformatics

1990 Introduction of MAD (multiwavelength
anomalous diffraction) for solving protein
structures - Wayne Hendrickson
1993 The Sanger Centre was established
1994 The European Bioinformatics Institute was
established
1995 The first bacterial genome was sequenced
(Haemophilus influanza)
1996 The yeast genome was sequenced
1997 PSI-BLAST was published
1998 First high resolution structure of an ion
channel, Rod MacKinnon
1999 First structures of the ribosome, Yonath,
Steiz, Ramakrishnan, Noller
and many others!

27
Bioinformatics - The mature science (2001 -
today)
28
The era of systems biology
The focus shifts from here
To here
29
Computational biology at the centre of systems
biology
Figure from myCIB, at the University of
Nottingham http//www.mycib.ac.uk/zope/mycib/abou
t-mycib/document.2007-04-04.4299993065
30
and synthetic biology
Hierarchy for synthetic biology inspired by
computer engineering
Figure from Andrianantoandro et al. (2006),
Molecular Systems Biology 2, 2006.0028
31
So after this brief introduction,what is
bioinformatics and what can we do with it?
32
Many definitions of bioinformatics

but they are all boiling down to more or less
this
The application of computational,
mathematical, statistical methods to solve
biological problems

33
Two types of bioinformatics

The development of tools
i.e. writing programs that implement algorithms
that provide solutions to specific questions
The use and application of such tools
e.g. web-accessible databases, software that gets
installed locally

34
What is an algorithm?
An algorithm is a finite list of well-defined
instructions for accomplishing some task that,
given an initial state, will terminate in a
defined end-state.
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
35
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
36
Exp1_01C04 Exp1_01C05 Exp1_01C06 Exp1_01C07 Exp1_0
1C08 Exp1_01C09 Exp1_01C10 Exp1_01C11 Exp1_01C12 E
xp1_01D01 Exp1_01D02 Exp1_01D03 Exp1_01D04 Exp1_01
D05 Exp1_01D06
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
37
Why do we need bioinformatics?

The answer usually comes down to the following
There is too much data
The calculations are too complex
We dont have enough time
For example, we can no longer look through all
available protein structures and check manually
whether they match a new structure we just
solved. There are more than 60,000 of them and
life is simply too short

38
Bioinformatics may be more relevant than you
think
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
BMJ 2007(335), 460-461
39
Some examples of things you can do with
bioinformatics
40
Major centres of biological data and tools - NCBI

National Center for Biotechnology Information,
Bethesda, US

http//www.ncbi.nlm.nih.gov/guide/
41
Major centres of biological data and tools - EBI

European Bioinformatics Institute (EMBL
outstation at Hinxton, UK)
http//www.ebi.ac.uk

42
Major centres of biological data and tools - RCSB

Research Collaboratory for Structural
Bioinformatics
http//www.pdb.org

43
Major centres of biological data and tools - KEGG

Kyoto Encyclopedia of Genes and Genomes
http//www.genome.jp/kegg

44
Sample questions andwhere to find the answers
45
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
46
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
47
Showcasing Bioinformatics research _at_ Birkbeck
48
Exploring the fly genomeAlona Sosinsky
49
Exploring regulatory elements (Alona Sosinsky)
http//te.cryst.bbk.ac.uk/
50
Exploring regulatory elements (II) (Alona
Sosinsky)

Gene regulatory elements consist of short
conserved binding sites for specific
transcription factors (TFs)
Programs that attempt to find such binding sites
often result in many false positives and
biologically non-important sites
However, in eukaryotic genomes regulatory binding
sites are found in clusters (modules)
Using information about the combination of
transcription factors and their relative
positioning can increase the accuracy of the
predictions for new regulatory sites

51
Exploring regulatory elements (III) (Alona
Sosinsky)
Graphical map for cluster of putative binding
sites
individual binding sites for Lozenge
transcription factor and its co-factor Pointed
Lz
Pnt
cluster of Lozenge and Pointed binding sites
Lz
Pnt

Lz
Lz
Pnt
Pnt
Sosinsky A., Nucleic Acids Res. (2003)
52
New regulatory elements for programmed cell death
(Alona Sosinsky)

TargetExplorer was used to predict the binding
sites of the transcription factor Lozenge in
Drosophila
Among the new targets genes that were controlling
cell death were over-represented
A new functional role was predicted for Lozenge
as regulator of programmed cell death in the
Drosophila eye

References Sosinsky A, Bonin CP, Mann RS, Honig
B. (2003) Nucleic Acids Research, 31,
3589. Wildonger J, Sosinsky A, Honig B, Mann RS.
(2005) Genes Development, 19, 1034.
53
Towards a better understanding of intermolecular
interactionsMark Williams
54
Pro_ACT - Protein Accessibilities, Cavities
Contacts(Mark Williams)
Williams, M.A., Goodfellow, J.M., and Thornton,
J.M. (1994) Protein Science, 3, 1224.
55
SCORPIO(Mark Williams)
A database of calorimetric data on binding of
small-molecules to proteins
Olsson et al. (2008), J Mol Biol, 384, 1002.
56
Thermodynamics and surface area burial(Mark
Williams)
57
Role of hydration in molecular recognition(Mark
Williams)
Software for the prediction and analysis of
biomolecular atomic interactions and hydration
Pro_ACT Protein Accessibilities, Cavities
conTacts
58
Fighting fluWilliam Lees Adrian Shepherd
59
The influenza virus(Adrian Shepherd)

Influenza is an RNA based virus infecting birds
and mammals.
Both epidemics and pandemics cause significant
human mortality.
Influenza type A is the most virulent in humans.
It is divided into subtypes based on the
antigenic properties of the HA and NA surface
proteins (eg H3N2).
Infection cycle begins when the HA surface
protein binds to sialic acid on the surface of
the host cell.
Immunogenic activity is predominantly associated
with HA.

Figure from wikipedia
HA heamagglutinin
NA neuraminidase
60
Immunodominant locations on haemagglutinin(Adrian
Shepherd)

Studies with Monoclonal Antibodies in the 1980s
established 5 binding regions near the head of
HA.
Antibodies binding in these regions are believed
to interfere sterically with receptor binding.

Wilson Cox, 1990
61
Antigenic clusters and vaccines(Adrian Shepherd)
Related influenza strains form antigenic
clusters. Breakout from a cluster requires a
vaccine update.
Smith et al, 2004

Question
Given a sequence of haemagglutinin, can we
predict whether existing vaccines are any good?
In other words, given two HA sequences, can we
predict their antigenic distance?

62
A new model of antigenic distance(Adrian
Shepherd)

A linear model, based on a count of changes at
each antibody binding site
Also includes changes in N-glycosylation sites as
they are known to affect antibody binding (Skehel
et al, 1984)

Number of differingresidues at binding site
Number of binding siteswith differing residues
Number of differing residuesoutside binding sites
log Dcij x1NAij x2NBij x3NCij x4NDij
x5NEij x6NDIFFij x7NNONij
x8NGLYADDij x9NGLYCHANGEij k.
Difference in numberof n-glycosylation sites
Number of varyingn-glycosylation sites

The constants xi and k are found by minimising
the least-squares residual over a training set.

63
Fighting flu - conclusions(Adrian Shepherd)

The commonly accepted list of varying amino acid
locations near antigenic binding sites should be
updated.
Based on our data to 2008, generalised models can
meet or exceed predictive performance of
immunodominant models on novel data.
Performance of our models suggests that antibody
binding may occur in regions outside the
previously identified 5 antigenic sites.

64
Sodium channels and the molecular basis of
painBonnie Wallace
65
Molecular basis of pain(Bonnie Wallace)

The sodium channel Nav1.7 has been recognised as
a key contributor to human pain

Mutations of Nav1.7 that promote channel
activation induce Erythromelalgia (Burning-foot
Syndrome), an inherited pain disorder

Families with Nav1.7 nonsense mutations (i.e.
no functional copies of the channel) feel no pain!

66
A structural basis for the effect of the F1449V
mutation(Bonnie Wallace)
Wild-type Nav1.7
F1449V Mutant
Side View
View from Cytoplasm
Lampert et al. (2008), J. Biol. Chem. 283, 24118
67
Molecular docking against diseaseIrilenia Nobeli
68
Molecular docking(Slide adapted from Dr Arun
Prasad)
Role of molecular docking

Use to identify lead compounds
Quantify the association of the lead compounds
with the receptor
Optimize lead compounds

The Docking problem

Sample the docking space (translation and
rotation of ligand)
Sample the ligand conformational space (torsion
angles)
Score the ligand receptor interaction

69
The case of alpha1 - antitrypsin(I.N. in
collaboration with Dr Gooptu)

The native fold of alpha1-antitrypsin is
metastable allowing for the characteristic serpin
mechanism of action
The Glu342Lys (Z) mutant of alpha1 - antitrypsin
results in the formation of polymers that lead to
disease of the liver and lungs

Gooptu et al. (2009), J Mol Biol, 387, 857.
70
The case of alpha1 - antitrypsin (II)(I.N. in
collaboration with Dr B. Gooptu)

The Thr114Phe mutation preserves activity but
reduces polymerisation of wild type antitrypsin
in vitro

Wild-type
Thr114Phe
Pharmacophore for mimicking the Thr114Phe mutation
Gooptu et al. (2009), J Mol Biol, 387, 857.
71
Fragment screening against a mutation-defined
pharmacophore(I.N. with B. Gooptu)
5 top-ranking from Glide SP
65 top-ranking from induced fit docking
Gooptu et al. (2009), J Mol Biol, 387, 857.
72
Solving the EM puzzlesMaya Topf
73
Models and resolution(Maya Topf)
20Å
2Å
10Å
74
Fitting to EM density maps(Maya Topf)
75
Multi-Component Fitting(Maya Topf)
Crystal structure of Arp2/3 complex (PDB
1TYQ, Nolen et al, 2004) 7 subunits, ranging
from 15-45kDa in size
76
Modelling the dog ribosome(Maya Topf)
8.7 Å resolution

48 homology models (SSU - 16 , LSU -32) based on
different templates (25-50 seq id), selected by
a combination of CC and statistical potentials.
Core rRNA (T. thermophilus for SSU, H.
marismortui for LSU)
Expansion segments (SSU -11, LSU - 16), mostly
A-form helices.

Chandramouli, Topf, Ménétret, Eswar, Gutell,
Sali, Akey., Structure, 2008
77
The metabolomeIrilenia Nobeli
78
The missing ome!(Irilenia Nobeli)
transcriptome
proteome
genome
Small molecules were pretty much ignored by
bioinformatics!
79
Chemoinformatics(Irilenia Nobeli)
By analogy to bioinformatics, chemoinformatics use
s computational methods to study small molecules
The function of small molecules is encoded in
their properties, and the properties are encoded
in their structure
80
The metabolome and protein function(Irilenia
Nobeli)

Do homologous proteins bind similar substrates?
the answer is superfamily dependent

farnesyl diphosphate synthase
triose phosphate isomerase
substrate conservation
substrate promiscuity
Nobeli et al. (2005), J Mol Biol 347, 415.
81
Can we predict a proteins substrate?(Irilenia
Nobeli)
922 metabolites docked against 27 SDR proteins
78 of the time we find the substrate in the top
10 of all scores
Favia et al. (2008), J Mol Biol, 375, 855.
82
Metabolites drugs(Irilenia Nobeli)
Macchiarulo et al. (2009), J Chem Inf Model, 49,
2272
83
Simulating the immune systemAdrian Shepherd
84
The ImmunoGrid aims(Adrian Shepherd)

Develop a virtual human immune system
Simulate immune processes at a natural scale,
connecting molecular level interactions with
system level models
Ultimate goal provide tools for applications in
clinical immunology, the design of vaccines and
immunotherapies
Data standardisation

85
The ImmunoGrid - How?(Adrian Shepherd)
Conways Game of Life (1970)
Emergence of complex, unpredictable behaviour
from simple rules
86
ImmunoGrid - The rules of life(Adrian Shepherd)
87
The ImmunoGrid - An agent-based model(Adrian
Shepherd)

Agent based model set of biological agents
(cells and molecules) at a given location on
lattice interacting probabilistically

In practice a hexagonal or triangular lattice is
often used.
88
Some reading material for your free time

Clare Sansom (2009). Molecules made to measure.
Chemistry World, November 2009, 50.
Available from
www.rsc.org/images/Drug20design20HIV_tcm18-1664
06.pdf
Minoru Kanehisa (1998). Grand challenges in
Bioinformatics. Bioinformatics, 14, 309.
Available from
www.ncbi.nlm.nih.gov/pubmed/9687209
Hiroaki Kitano (2002). Systems Biology - A brief
overview. Science 295, 1662.
Email me for a reprint if you have no access to
Science.

89
Bibliography

Disclaimer
These are resources I used to put together these
lectures and by no means do I endorse any books
or suggest you should go out and buy them! You
are lucky enough to have a huge bookstore right
next to your door. Go and find out for yourselves
what you like and what you dont like! Many
chapters may be available also through google
books so you can have a look at them as well.

90
Bibliography

Books
Developing Bioinformatics Computer Skills
Gibas Jambeck, OReilly, ISBN 1-56592-664-1
A soft introduction including a nice intro to
basic Unix commands
Reasonable overview and might just get your
curiosity going..
Does not go into detail in anything and it is
relatively old (2001)

91
Bibliography

Bioinformatics for dummies. Claverie Notredame.
Wiley, 2006.
A classic from the dummies series. Has
generally received very good reviews.
Introduction to bioinformatics. Arthur Lesk. OUP,
2008.
Now in its 3rd edition, so obviously not bad.

92
Bibliography

Websites
Bioinformatics milestones
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/mi
lestones.html
50 years of protein structure determination
http//publications.nigms.nih.gov/psi/timeline.ht
ml

93
Bibliography

Papers
Hagen (2000). The origins of bioinformatics. Nat.
Rev. Gen., 1, 231.
Stretton (2002). The first sequence Fred Sanger
and Insulin. Genetics, 162, 527.
Francoeur (2002). Cyrus Levinthal, the Kluge, and
the origins of interactive molecular graphics.
Endeavour, 26, 127.

94
Acknowledgements

Many thanks to Dr Thomas Schlitt for his slides
Thanks to all computational biologists at BBK
Crystallography who made slides and articles from
their research available to me

95
If you want to build a ship, dont drum up people
to collect wood and dont assign them tasks and
work, but rather teach them to long for the
endless immensity of the sea.
Antoine de Saint-Exupery

Write a Comment

User Comments (0)