Title: The Future As Defined by Structural Genomics
1The Future As Defined by Structural Genomics
- Philip E. Bourne
- Dept. of Pharmacology
- University of California San Diego
- pbourne_at_ucsd.edu
2Agenda
- The Data
- What is structural genomics exactly?
- What has it achieved thus far?
- What are its goals going forward?
- One possible strategy for selecting targets
- Unsolved Problems
- New Challenges
3Structural GenomicsA Broad Working Definition
- Structural genomics is the process of
high-throughput determination of the
3-dimensional structures of biological
macromolecules
4Ah Yes, But What is the Goal?
- The goal of the human genome project was clear
cut.. The goal of structural genomics is not so
clear cut Phase I.. - Provision of enough structural templates to
facilitate homology modeling of most proteins - Structures of all proteins in a complete proteome
- Structural elucidation of a complete biological
pathway - Structural elucidation of a complete disease
5Example Goals (PSI Phase I)
The hyperthermophilic bacterium Thermotoga
maritima has been the target of choice for
pipeline development and genome-wide fold
coverage.
207
The SGPP consortium will determine and analyze
the three-dimensional structures of a large
number of proteins from major global pathogenic
protozoa, Leishmania major, Trypanosoma brucei,
Trypanosoma cruzi and Plasmodium falciparum.
35
Structural Genomics of Pathogenic Protozoa
It is aimed at determining structures of proteins
and protein complexes directly relevant to human
health and diseases.
79
6Ah Yes, But What is the Goal? Phase II
7Growth in the Number of Folds per Year According
To SCOP
New Folds
Total Folds
http//www.pdb.org/pdb/statistics/contentGrowthCha
rt.do?contentfold-scop from Nov., 2008
8The Process - X-ray Crystallography
Basic Steps
- Crystallomics
- Isolation,
- Expression,
- Purification,
- Crystallization
Target Selection
Data Collection
Structure Solution
Structure Refinement
Functional Annotation
Publish
9What Has The Process Achieved Thus Far?
10Much of the Data Discussed Will Come from
http//kb.psi-structuralgenomics.org/
Nucleic Acids Research 2006 34 D302-5
11Current Status of All Centers 2006/2008
90421/200291 Targets
56626 / 133958
/89229
2479 /6020 (7.5/11.1 of PDB)
Chen et al. 2004 Bioinformatics 20(16) 2860-2
http//targetdb.rcsb.org Oct 20, 2005
12Total Structures Released per Year
2006 586 2007 792 2008 3483
Chen et al. 2004 Bioinformatics 20(16) 2860-2
http//targetdb.rcsb.org Oct 20, 2005
13PepcDB http//pepcdb.pdb.org/
Capturing of protocols associated with the
experiment
14(No Transcript)
15(No Transcript)
16What Has The Process Achieved Thus Far?
- While was only 7.5 of the current PDB (30 year
history), now contributing 11 of all structures
in a given year - Higher throughput is being achieved traditional
laboratories benefit too - Useful data are being collected more
systematically, but the situation could still be
improved
17Todd, Marsden, Thornton and Orengo 2005 JMB
348(5) 1235-60 provide the following data, but
based on 316 non-redundant structures
- Quality and size of structures is comparable
- 29 of domains revealed an evolutionary
relationship not apparent from sequence - 19 and 11 contributed new superfamilies and
folds, respectively ??? - 9287 reliable homology models built across 206
completely sequenced genomes
18What Should be the Target Selection Strategy
Going Forward?
19One Approach - Pfam 5000 Chandonia Brenner 2005
Proteins 58(1) 166-179
- Would provide fold assignment for 68 of
prokaryotic proteins and 61 of eukaryotic - This is significantly greater than would be
achieved by completing a single genome
20Our Approach is to Consider Coverage Relative to
the Human Genome
- What protein structures would tell us most about
the human condition if determined?
21Basic Logic of Our Approach to Target Selection
- Given the functions of proteins currently in the
PDB - And what we can ascertain about the function of
structural genomics targets - And what we know about the functional coverage of
the human genome - What structures should be determined to increase
our coverage of functional space - Which of those structures are most tractable?
22Coverage of the Human Genome By Structure
PDB
Structural Genomics Targets
GO
Ensembl Human Genome Annotation
Superfamily
EC
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http//sg.rcsb.org
23Drill down to the Appropriate level
Define the level of redundancy
Coverage by domains(s) or structure
24PDB vs Human Genome Top Level EC Shows Even
Distribution
PDB
607/1141 Structures
9698 Sequences
Ensembl Human Genome Annotation
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http//sg.rcsb.org
25PDB vs Human Genome EC Hydrolases Begins to
Illustrate the Bias in the PDB
PDB
2.5 Transferring alkyl or aryl groups over
represented in PDB 2.4 Glycosyltransferases
under represented in PDB
Ensembl Human Genome Annotation
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http//sg.rcsb.org
26Functional Coverage (GO Molecular Function) of
the Human Genome By Structure, Targets and Models
SG Targets
Human Genome
PDB
Homology Models
- As expected few structures of unknown function
in the PDB at this stage. Large number of targets
of unknown function - Enzyme regulation over represented in PDB
GTPase, kinase regulator, caspase regulator
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http//sg.rcsb.org
27Target Selection Relative to Disease
PDB
Structural Genomics Targets
OMIM
Swiss-Prot
Superfamily
Ensembl Human Genome Annotation
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http//sg.rcsb.org
28Human Disease Coverage
SG Targets
Human Genome
PDB
Homology Models
- PDB covers 69 of OMIM disease categories
- Diseases of the CNS are over represented by
targets - Disease of ear nose throat under represented in
PDB but covered by targets and models - Cancers fewer targets at top level, but female
related cancers over represented, male under
represented by structures
29Structural Coverage of the Human Genome
- Single domains cover 37 of the functional
classes identified in the genome - Whole structures cover 25
- 37 goes to 56 with homology models
- 25 goes to 31 with homology models
- If all current structural genomics targets were
solved (3x current PDB) - 37 goes to 69
- 25 goes to 44
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http//sg.rcsb.org
30Other Points to Note
- Coverage by homology models is not even more
divergent families are less well represented - Transporters and receptors (non membrane regions)
are the most pressing - Possible to create a most wanted list of
structures
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http//sg.rcsb.org
31The Most Wanted List
- So Far We Have Considered the Functional Coverage
of Structures, Models and Targets Relative to the
Human Genome (Based on the Current Level of
Functional Annotation) - What if we turn that round and rather than ask
what we know, ask what we do not know
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http//sg.rcsb.org
32Bottom Line
- There are approximately 1800 domains which have
been functionally recognized in the human genome
for which no structure exists (hence no homology
models) and for which no target exists
Xie and Bourne 2005 PLoS Comp. Biol. 1(3) e31
http//sg.rcsb.org
33How Do We Get To This List?
- Start with functional categories without
structures - Select those without Superfamily assignments
i.e., cant be modeled - Prefer those with a disease association
- Remove those that appear less tractable based on
prediction of transmembrane segments,
coiled-coiled and low complexity
34Examples from the Most Wanted List
- The most understudied structures are various
kinds of receptors and transporters - For catalytic activity the largest under
representation is in protein synthesis and gene
regulation - Congenital adrenal hyperplasia appears to have
tractable domains without structure representation
35Unsolved Problems
36Some Problems with Estimators of What has Been
Achieved 2006/2008
- Basic knowledge of macromolecular structure (40
- missing temporal view, alternative views eg
ligand view, rules for molecular recognition) - Integrated view of structure as part of a
biological continuum of data and associated
knowledge (20/30) - Structure representation, comparison and
classification (60/80) - Structure prediction from sequence (30/27)
37Some Problems with Estimators of What has Been
Achieved The Challenges 2006/2008
- Inferring function from structure (20/30)
- Inferring protein interactions (20/21)
- Macromolecular assemblies (30/40)
- Docking (20/25)
- Rational drug discovery (5/6)
- Structural evolution (1/5)