Introduction to Bioinformatics Part I How did we get here and what can we do now - PowerPoint PPT Presentation

1 / 95
About This Presentation
Title:

Introduction to Bioinformatics Part I How did we get here and what can we do now

Description:

Introduction to Bioinformatics Part I How did we get here and what can we do now – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 96
Provided by: irileni
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics Part I How did we get here and what can we do now


1
Introduction to Bioinformatics - Part IHow did
we get here and what can we do now?
  • Irilenia Nobeli
  • BBK Biological Sciences - Crystallography

2
Why I am here
  • I work here (well, in Crystallography)
  • You will inevitably come across Bioinformatics at
    some stage
  • Learning about databases and tools may come handy
  • I hope you will be inspired

3
Overview of these lectures
  • First lecture (26/11/2009)
  • Introduction (a rather biased and subjective view
    of the field and its history)
  • Second lecture (first half of 3/12/2009)
  • Sequence analysis pairwise and multiple
    alignment, BLAST and HMMs
  • Third lecture (second half of 3/12/2009)
  • Structural bioinformatics

4
Overview of these lectures (where to find things)
  • Lecture slides and lecture notes on the
    Blackboard
  • There are notes accompanying these slides (only
    for slides that are not self-explanatory)
  • Practicals
  • There will be one practical session on the 10th
    of December
  • Room G10 has been booked from 600pm to 900 pm.
    Event title is Molecular Biology
  • Coursework
  • Coursework will be assigned at the end of the
    second lecture

5
Todays lecture
  • A brief history of bioinformatics and the events
    that led to the establishment of this field
  • A series of research questions that can be
    addressed by bioinformatics/computational biology
    approach
  • Biased by my experience and that of people in our
    department
  • Aim of this lecture
  • To get your curiosity going and give you a broad
  • overview of the field of computational biology

6
Bioinformatics - The early days? - 1990
7
Theory Milestones - Evolution
  • How is evolution achieved?
  • Sequences change over time. Mutations happen
    often due to errors in replication, chemicals,
    light etc
  • Divergence of sequences is also a result of
    recombination, gene duplication, speciation,
    horizontal gene transfer events.

natural selection (19th century)
genetic drift (20th century)
8
Milestones - Evolution (backwards)
  • DNA sequences determine (almost entirely) the
    appearance and characteristics of organisms
  • Biological sequences show complex patterns of
    similarity to one another, regardless of external
    similarities
  • The logical explanation for the similarities
    observed is that sequences (and organisms) share
    common ancestry

9
Theory Milestones - Inheritance
Laws of heredity Dominant genes conceal the
phenotype of recessive genes but do not alter the
recessive genes themselves

parents
dd
rr
1st generation
dr
dr
2nd generation
dd
dr
dr
rr
Gregor Johann Mendel (1822 -1884)
10
Theory Milestones - The central dogma of Biology
  • The direction of information flow between DNA,
    RNA and proteins is restricted.
  • The central dogma is often stated as
  • Once (sequential) information has passed into
    protein, it cannot get out again.

1958
1970
Crick, F. (1958), Symp. Soc. Exp. Biol., XII,
138. Crick, F. (1970), Nature, 227, 561.
11
Theory Milestones - The central dogma (II)
  • The central dogma was not accepted without
    controversy
  • Much of it related to the simplification of
    stating it as DNA makes RNA makes protein
  • If a dogma ends up having too many exceptions, it
    somehow loses much of its appeal

the Central Dogma now would have to go
something like this 'DNA makes RNA makes
protein, but sometimes RNA can make DNA and other
times RNA makes RNA, which makes proteins
different from what they would be if only DNA
made the RNA, and once upon a time RNA made
protein, probably, but no-one knows for certain'.
From Petskos comment Dog eat dogma in Genome
Biology (2000), 1 (2), comment1002.1-1002.2
12
Theory Milestones - The central dogma (III)
  • The word dogma created as much, if not more,
    controversy.
  • Crick himself writes

As it turned out the use of the word dogma
caused almost more trouble than it was worthMany
years later Jacques Monod pointed out to me that
I did not appear to understand the correct use of
the word dogma, which is a belief that cannot be
doubted. I used the word the way I myself
thought about it , and simply applied it to a
grand hypothesis that, however, plausible, had
little direct experimental support.
From Cricks autobiography, as quoted in
http//en.wikipedia.org/wiki/Central_dogma_of_mo
lecular_biology
13
1953 The Structure of DNA
Maurice Wilkins 1916 - 2004
X-ray photograph of DNA
The Watson and Crick model
Rosalind Franklin 1920 - 1958
James Watson (1928-) and Francis Crick (1916-2004)
14
1955 Complete sequence of insulin
  • Proteins are not mixtures of molecules - they are
    unique molecules with unique amino acid sequences

Primary structure of bovine insulin
from Stretton, A.O.W. (2002), Genetics, 162, 527.
Fred Sanger 1918 -
15
1950s The first X-ray structures of proteins -
Myoglobin Heamoglobin
Picture of haemoglobin from Perutz, Br Med
Bull.1976 32 195-208
The first ever model of a protein molecule
(1957, myoglobin model in plasticine) From the
image library of the Science Museum
16
Other structure-related milestones
17
Other structure-related milestones PDB
  • 1971 Establishment of the Protein Data Bank
    (PDB)
  • initially with only 7 structures!
  • currently holding gt 60,000 structures

Number of searchable structures
http//www.wwpdb.org/
18
The mother (and father) of bioinformatics
ALA gt A ARG gt R MET gt M PHE gt F TRP gt W
Margaret Dayhoff (1925-1983)
Comprotein a computer program to aid primary
protein structure determination Dayhoff M.O
Ledley, R.S. (1962) AFIPS Joint Computer
Conferences archive Proceedings of the December
4-6, 1962, fall joint computer conference table
of contents Pages 262-274
IBM 7090 Hagen(2000)
19
First attempts at graphics - 1960s
Space-filling model of the structure of
myoglobin (Francoeur, 2002)
Photograph of the Kluge display showing detail
from a myoglobin structure (Francoeur, 2002)
  • Cyrus Levinthal and others at MIT were the first
    to use computers with powerful graphics to
    visualise the 3D structures of proteins.
  • Levinthal built the first 3D model of cytochrome
    C (later shown to be incorrect)

20
A helping hand for visualisation
1980 Ribbon diagrams introduced by Jane
Richardson (hand-drawn!)
Ribbon schematic (hand drawn colored, in 1981
by Jane Richardson) of the 3D structure of triose
phosphate isomerase. Source wikipedia
The same protein (1tim) in the same orientation
but drawn in stick representation with Chimera.
21
Bioinformatics milestones -Aligning sequences
global alignment
local alignment
T.F. SMITH AND M.S. WATERMAN
22
Bioinformatics milestones - GenBank
  • Began as a small database of sequences collected
    by Walter Goad in Los Alamos in 1979
  • 1982 GenBank goes public funded by the NIH
  • A national nucleic acid sequence database
  • More than 2000 sequences stored by 1983
  • Now hosted at the National Center for
    Biotechnology Information (NCBI), and is part of
    an international collaboration involving EMBL and
    Japan
  • Since its inception, GenBank has approximately
    doubled in size every 18 months

23
Bioinformatics - The golden era1990 -2000
Please note that calling the 1990s the golden
era is entirely my own subjective choice and not
a widely accepted term in the bioinformatics commu
nity
24
The golden era started in 1990 with BLAST
Altschul et al. (1990). Basic local alignment
search tool. J. Mol. Biol. 215403-10.
BLAST is a very fast program for searching large
databases of sequences
It is by far the most widely used tool produced
by bioinformaticians
25
and ended with the first draft of the human
genome in 2000
From BBC news, 15 March 2000
26
In that decade major events influencedthe
progress of bioinformatics
  • 1990 Introduction of MAD (multiwavelength
    anomalous diffraction) for solving protein
    structures - Wayne Hendrickson
  • 1993 The Sanger Centre was established
  • 1994 The European Bioinformatics Institute was
    established
  • 1995 The first bacterial genome was sequenced
    (Haemophilus influanza)
  • 1996 The yeast genome was sequenced
  • 1997 PSI-BLAST was published
  • 1998 First high resolution structure of an ion
    channel, Rod MacKinnon
  • 1999 First structures of the ribosome, Yonath,
    Steiz, Ramakrishnan, Noller
  • and many others!

27
Bioinformatics - The mature science (2001 -
today)
28
The era of systems biology
The focus shifts from here
To here
29
Computational biology at the centre of systems
biology
Figure from myCIB, at the University of
Nottingham http//www.mycib.ac.uk/zope/mycib/abou
t-mycib/document.2007-04-04.4299993065
30
and synthetic biology
Hierarchy for synthetic biology inspired by
computer engineering
Figure from Andrianantoandro et al. (2006),
Molecular Systems Biology 2, 2006.0028
31
So after this brief introduction,what is
bioinformatics and what can we do with it?
32
Many definitions of bioinformatics
  • but they are all boiling down to more or less
    this
  • The application of computational,
    mathematical, statistical methods to solve
    biological problems

33
Two types of bioinformatics
  • The development of tools
  • i.e. writing programs that implement algorithms
    that provide solutions to specific questions
  • The use and application of such tools
  • e.g. web-accessible databases, software that gets
    installed locally

34
What is an algorithm?
An algorithm is a finite list of well-defined
instructions for accomplishing some task that,
given an initial state, will terminate in a
defined end-state.
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
35
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
36
Exp1_01C04 Exp1_01C05 Exp1_01C06 Exp1_01C07 Exp1_0
1C08 Exp1_01C09 Exp1_01C10 Exp1_01C11 Exp1_01C12 E
xp1_01D01 Exp1_01D02 Exp1_01D03 Exp1_01D04 Exp1_01
D05 Exp1_01D06
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
37
Why do we need bioinformatics?
  • The answer usually comes down to the following
  • There is too much data
  • The calculations are too complex
  • We dont have enough time
  • For example, we can no longer look through all
    available protein structures and check manually
    whether they match a new structure we just
    solved. There are more than 60,000 of them and
    life is simply too short

38
Bioinformatics may be more relevant than you
think
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
BMJ 2007(335), 460-461
39
Some examples of things you can do with
bioinformatics
40
Major centres of biological data and tools - NCBI
  • National Center for Biotechnology Information,
    Bethesda, US

http//www.ncbi.nlm.nih.gov/guide/
41
Major centres of biological data and tools - EBI
  • European Bioinformatics Institute (EMBL
    outstation at Hinxton, UK)
  • http//www.ebi.ac.uk

42
Major centres of biological data and tools - RCSB
  • Research Collaboratory for Structural
    Bioinformatics
  • http//www.pdb.org

43
Major centres of biological data and tools - KEGG
  • Kyoto Encyclopedia of Genes and Genomes
  • http//www.genome.jp/kegg

44
Sample questions andwhere to find the answers
45
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
46
Slide adapted from Dr Thomas Schlitt (KCL) with
permission
47
Showcasing Bioinformatics research _at_ Birkbeck
48
Exploring the fly genomeAlona Sosinsky
49
Exploring regulatory elements (Alona Sosinsky)
http//te.cryst.bbk.ac.uk/
50
Exploring regulatory elements (II) (Alona
Sosinsky)
  • Gene regulatory elements consist of short
    conserved binding sites for specific
    transcription factors (TFs)
  • Programs that attempt to find such binding sites
    often result in many false positives and
    biologically non-important sites
  • However, in eukaryotic genomes regulatory binding
    sites are found in clusters (modules)
  • Using information about the combination of
    transcription factors and their relative
    positioning can increase the accuracy of the
    predictions for new regulatory sites

51
Exploring regulatory elements (III) (Alona
Sosinsky)
Graphical map for cluster of putative binding
sites
individual binding sites for Lozenge
transcription factor and its co-factor Pointed
Lz
Pnt
cluster of Lozenge and Pointed binding sites
Lz
Pnt



Lz
Lz
Pnt
Pnt
Sosinsky A., Nucleic Acids Res. (2003)
52
New regulatory elements for programmed cell death
(Alona Sosinsky)
  • TargetExplorer was used to predict the binding
    sites of the transcription factor Lozenge in
    Drosophila
  • Among the new targets genes that were controlling
    cell death were over-represented
  • A new functional role was predicted for Lozenge
    as regulator of programmed cell death in the
    Drosophila eye

References Sosinsky A, Bonin CP, Mann RS, Honig
B. (2003) Nucleic Acids Research, 31,
3589. Wildonger J, Sosinsky A, Honig B, Mann RS.
(2005) Genes Development, 19, 1034.
53
Towards a better understanding of intermolecular
interactionsMark Williams
54
Pro_ACT - Protein Accessibilities, Cavities
Contacts(Mark Williams)
Williams, M.A., Goodfellow, J.M., and Thornton,
J.M. (1994) Protein Science, 3, 1224.
55
SCORPIO(Mark Williams)
A database of calorimetric data on binding of
small-molecules to proteins
Olsson et al. (2008), J Mol Biol, 384, 1002.
56
Thermodynamics and surface area burial(Mark
Williams)
57
Role of hydration in molecular recognition(Mark
Williams)
Software for the prediction and analysis of
biomolecular atomic interactions and hydration
Pro_ACT Protein Accessibilities, Cavities
conTacts
58
Fighting fluWilliam Lees Adrian Shepherd
59
The influenza virus(Adrian Shepherd)
  • Influenza is an RNA based virus infecting birds
    and mammals.
  • Both epidemics and pandemics cause significant
    human mortality.
  • Influenza type A is the most virulent in humans.
    It is divided into subtypes based on the
    antigenic properties of the HA and NA surface
    proteins (eg H3N2).
  • Infection cycle begins when the HA surface
    protein binds to sialic acid on the surface of
    the host cell.
  • Immunogenic activity is predominantly associated
    with HA.

Figure from wikipedia
HA heamagglutinin
NA neuraminidase
60
Immunodominant locations on haemagglutinin(Adrian
Shepherd)
  • Studies with Monoclonal Antibodies in the 1980s
    established 5 binding regions near the head of
    HA.
  • Antibodies binding in these regions are believed
    to interfere sterically with receptor binding.

Wilson Cox, 1990
61
Antigenic clusters and vaccines(Adrian Shepherd)
Related influenza strains form antigenic
clusters. Breakout from a cluster requires a
vaccine update.
Smith et al, 2004
  • Question
  • Given a sequence of haemagglutinin, can we
    predict whether existing vaccines are any good?
  • In other words, given two HA sequences, can we
    predict their antigenic distance?

62
A new model of antigenic distance(Adrian
Shepherd)
  • A linear model, based on a count of changes at
    each antibody binding site
  • Also includes changes in N-glycosylation sites as
    they are known to affect antibody binding (Skehel
    et al, 1984)

Number of differingresidues at binding site
Number of binding siteswith differing residues
Number of differing residuesoutside binding sites
log Dcij x1NAij x2NBij x3NCij x4NDij
x5NEij x6NDIFFij x7NNONij
x8NGLYADDij x9NGLYCHANGEij k.
Difference in numberof n-glycosylation sites
Number of varyingn-glycosylation sites
  • The constants xi and k are found by minimising
    the least-squares residual over a training set.

63
Fighting flu - conclusions(Adrian Shepherd)
  • The commonly accepted list of varying amino acid
    locations near antigenic binding sites should be
    updated.
  • Based on our data to 2008, generalised models can
    meet or exceed predictive performance of
    immunodominant models on novel data.
  • Performance of our models suggests that antibody
    binding may occur in regions outside the
    previously identified 5 antigenic sites.

64
Sodium channels and the molecular basis of
painBonnie Wallace
65
Molecular basis of pain(Bonnie Wallace)
  • The sodium channel Nav1.7 has been recognised as
    a key contributor to human pain
  • Mutations of Nav1.7 that promote channel
    activation induce Erythromelalgia (Burning-foot
    Syndrome), an inherited pain disorder
  • Families with Nav1.7 nonsense mutations (i.e.
    no functional copies of the channel) feel no pain!

66
A structural basis for the effect of the F1449V
mutation(Bonnie Wallace)
Wild-type Nav1.7
F1449V Mutant
Side View
View from Cytoplasm
Lampert et al. (2008), J. Biol. Chem. 283, 24118
67
Molecular docking against diseaseIrilenia Nobeli
68
Molecular docking(Slide adapted from Dr Arun
Prasad)
Role of molecular docking
  • Use to identify lead compounds
  • Quantify the association of the lead compounds
    with the receptor
  • Optimize lead compounds

The Docking problem
  • Sample the docking space (translation and
    rotation of ligand)
  • Sample the ligand conformational space (torsion
    angles)
  • Score the ligand receptor interaction

69
The case of alpha1 - antitrypsin(I.N. in
collaboration with Dr Gooptu)
  • The native fold of alpha1-antitrypsin is
    metastable allowing for the characteristic serpin
    mechanism of action
  • The Glu342Lys (Z) mutant of alpha1 - antitrypsin
    results in the formation of polymers that lead to
    disease of the liver and lungs

Gooptu et al. (2009), J Mol Biol, 387, 857.
70
The case of alpha1 - antitrypsin (II)(I.N. in
collaboration with Dr B. Gooptu)
  • The Thr114Phe mutation preserves activity but
    reduces polymerisation of wild type antitrypsin
    in vitro

Wild-type
Thr114Phe
Pharmacophore for mimicking the Thr114Phe mutation
Gooptu et al. (2009), J Mol Biol, 387, 857.
71
Fragment screening against a mutation-defined
pharmacophore(I.N. with B. Gooptu)
5 top-ranking from Glide SP
65 top-ranking from induced fit docking
Gooptu et al. (2009), J Mol Biol, 387, 857.
72
Solving the EM puzzlesMaya Topf
73
Models and resolution(Maya Topf)
20Å

10Å
74
Fitting to EM density maps(Maya Topf)
75
Multi-Component Fitting(Maya Topf)
Crystal structure of Arp2/3 complex (PDB
1TYQ, Nolen et al, 2004) 7 subunits, ranging
from 15-45kDa in size
76
Modelling the dog ribosome(Maya Topf)
8.7 Å resolution
  • 48 homology models (SSU - 16 , LSU -32) based on
    different templates (25-50 seq id), selected by
    a combination of CC and statistical potentials.
  • Core rRNA (T. thermophilus for SSU, H.
    marismortui for LSU)
  • Expansion segments (SSU -11, LSU - 16), mostly
    A-form helices.

Chandramouli, Topf, Ménétret, Eswar, Gutell,
Sali, Akey., Structure, 2008
77
The metabolomeIrilenia Nobeli
78
The missing ome!(Irilenia Nobeli)
transcriptome
proteome
genome
Small molecules were pretty much ignored by
bioinformatics!
79
Chemoinformatics(Irilenia Nobeli)
By analogy to bioinformatics, chemoinformatics use
s computational methods to study small molecules
The function of small molecules is encoded in
their properties, and the properties are encoded
in their structure
80
The metabolome and protein function(Irilenia
Nobeli)
  • Do homologous proteins bind similar substrates?
  • the answer is superfamily dependent

farnesyl diphosphate synthase
triose phosphate isomerase
substrate conservation
substrate promiscuity
Nobeli et al. (2005), J Mol Biol 347, 415.
81
Can we predict a proteins substrate?(Irilenia
Nobeli)
922 metabolites docked against 27 SDR proteins
78 of the time we find the substrate in the top
10 of all scores
Favia et al. (2008), J Mol Biol, 375, 855.
82
Metabolites drugs(Irilenia Nobeli)
Macchiarulo et al. (2009), J Chem Inf Model, 49,
2272
83
Simulating the immune systemAdrian Shepherd
84
The ImmunoGrid aims(Adrian Shepherd)
  • Develop a virtual human immune system
  • Simulate immune processes at a natural scale,
    connecting molecular level interactions with
    system level models
  • Ultimate goal provide tools for applications in
    clinical immunology, the design of vaccines and
    immunotherapies
  • Data standardisation

85
The ImmunoGrid - How?(Adrian Shepherd)
Conways Game of Life (1970)
Emergence of complex, unpredictable behaviour
from simple rules
86
ImmunoGrid - The rules of life(Adrian Shepherd)
87
The ImmunoGrid - An agent-based model(Adrian
Shepherd)
  • Agent based model set of biological agents
    (cells and molecules) at a given location on
    lattice interacting probabilistically

In practice a hexagonal or triangular lattice is
often used.
88
Some reading material for your free time
  • Clare Sansom (2009). Molecules made to measure.
    Chemistry World, November 2009, 50.
  • Available from
  • www.rsc.org/images/Drug20design20HIV_tcm18-1664
    06.pdf
  • Minoru Kanehisa (1998). Grand challenges in
    Bioinformatics. Bioinformatics, 14, 309.
  • Available from
  • www.ncbi.nlm.nih.gov/pubmed/9687209
  • Hiroaki Kitano (2002). Systems Biology - A brief
    overview. Science 295, 1662.
  • Email me for a reprint if you have no access to
    Science.

89
Bibliography
  • Disclaimer
  • These are resources I used to put together these
    lectures and by no means do I endorse any books
    or suggest you should go out and buy them! You
    are lucky enough to have a huge bookstore right
    next to your door. Go and find out for yourselves
    what you like and what you dont like! Many
    chapters may be available also through google
    books so you can have a look at them as well.

90
Bibliography
  • Books
  • Developing Bioinformatics Computer Skills
  • Gibas Jambeck, OReilly, ISBN 1-56592-664-1
  • A soft introduction including a nice intro to
    basic Unix commands
  • Reasonable overview and might just get your
    curiosity going..
  • Does not go into detail in anything and it is
    relatively old (2001)

91
Bibliography
  • Bioinformatics for dummies. Claverie Notredame.
    Wiley, 2006.
  • A classic from the dummies series. Has
    generally received very good reviews.
  • Introduction to bioinformatics. Arthur Lesk. OUP,
    2008.
  • Now in its 3rd edition, so obviously not bad.

92
Bibliography
  • Websites
  • Bioinformatics milestones
  • http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/mi
    lestones.html
  • 50 years of protein structure determination
  • http//publications.nigms.nih.gov/psi/timeline.ht
    ml

93
Bibliography
  • Papers
  • Hagen (2000). The origins of bioinformatics. Nat.
    Rev. Gen., 1, 231.
  • Stretton (2002). The first sequence Fred Sanger
    and Insulin. Genetics, 162, 527.
  • Francoeur (2002). Cyrus Levinthal, the Kluge, and
    the origins of interactive molecular graphics.
    Endeavour, 26, 127.

94
Acknowledgements
  • Many thanks to Dr Thomas Schlitt for his slides
  • Thanks to all computational biologists at BBK
    Crystallography who made slides and articles from
    their research available to me

95
If you want to build a ship, dont drum up people
to collect wood and dont assign them tasks and
work, but rather teach them to long for the
endless immensity of the sea.
Antoine de Saint-Exupery
Write a Comment
User Comments (0)
About PowerShow.com