Title: Structural Genomics:
1Structural Genomics Case studies in assigning
function from structure
James D Watson watson_at_ebi.ac.uk
2Structural Genomics Collaborators
MCSG Mid-west Centre for Structural Genomics
SPINE Structural Proteomics in Europe
SGC Structural Genomics Consortium
3Structural Genomics Aims
Pathogens and disease
Automation / High Throughput
?
Coverage of Fold Space
Human Proteins
4Proteins known sequences and 3D structures
5,500 non-redundant structures
1.3m non-redundant protein sequences
260,000 homology models
MRTKSPGDSKFHEITKTPPKNQVSNS MIVISGENVDIAELTDFLCAA
PPRIPYSMVGPCCVFLMHH MDVVDSLFVNGSNITSACELGFENE V
YAWETAHFLDAAPKLIEWEVS MAQQRRGGFKRRKKVDFIAANKIE C
ELGFENETLFCLDRPRPSKE MAQQRRGGFKRRKKVDFIAANKIE MG
MKKNRPRRGSLAFSPRKRAKKLVP MQILKENASNQRFVTRESEV ME
KFEGYSEKQKSRQQYFVYPFLF MEEFVNPCKIKVIGVGGGGSNAVNRM
Y MAVTQEEIIAGIAEIIEEVTGIEP
5Proteins known sequences and 3D structures
5,500 non-redundant structures
10 unknown
3D structures of 16,000 carefully selected
proteins
Homology models
6Protein Function
- Protein function has many definitions
- Biochemical Function - The biochemical role of
the protein e.g. serine protease - Biological Function - The role of the protein in
the cell/organism e.g.digestion, blood clotting,
fertilisation
7Function through homology
8Template Methodology
- Use 3D templates to describe the active site of
the enzyme - analogous to 1-D sequence motifs
such as PROSITE, but in 3-D - (Wallace et al 1997)
- defines a functional site
- search a new structure for a functional site
- search a database of structures for similar
clusters
9SiteSeers reverse templates
10Problems with template methods
- Too many hits (hundreds, thousands or even tens
of thousands) - Use of rmsd rarely discriminates true from false
positives - Local distortion in structure may give a large
rmsd - Top hit rarely the correct hit even in
obvious cases
11An example
12Enzyme active site templates
Hits for 1hsk
102. E.C.1.1.1.158 2.19Ã… UDP-N-acetylmuramat
e dehydrogenase
13Comparison of template environments
14Comparison of template environments
15Comparison of template environments
Identical residues in neighbourhood
Template structure 1mbb
Query structure 1hsk
16Comparison of template environments
Similar residues in neighbourhood
Template structure 1mbb
Query structure 1hsk
17Results for 1hsk
Hit E.C number Rmsd Score Enzyme
1. E.C.1.1.1.158 2.08 209.1
UDP-N-acetylmuramate dehydrogenase 2.
E.C.3.2.1.14 2.13 146.0 Chitinase A
chitodextrinase
1,4-beta-poly-N-acetylglucos
aminidase
coly-beta-glucosaminidase 3.
E.C.3.2.1.17 1.92 142.4 Turkey
lysozyme 4. E.C.3.2.1.17 1.89 138.7
Hen lysozyme 5. E.C.3.5.1.26 1.47 132.3
Aspartylglucosylaminidase 6. E.C.3.2.1.3
1.54 131.1 Glucan 1,4-alpha-glucosidase
18ProFunc function from 3D structure
19Large scale analysis
- Created an edited version of the target database
from the PDB only those with status In PDB - Extract all PDB codes for each Structural
Genomics group - Extract prior knowledge (Header, Title, Jrnl,
etc.) - Find any associated GOA annotation
- Classify each structure by whether function is
known unknown or limited info - Run Profunc in a batch process on all codes
(560) - Extract summary results from each analysis
- Compare to prior knowledge and estimate success
20Number of deposits to the TargetDB by Structural
Genomics group (Total of 577 unique entries)
March 2004
21PDB Blast
- Run query sequences against the PDB using BLAST
- Filtered out those matches released AFTER the
query sequence - Any hits are ignored from subsequent analyses
- Still get significant matches
- why?
22InterPro Scan
- InterPro scan on proteins of known function
- Cannot backdate the InterPro database
- Essentially picking up itself
23Function of query structure known
24Limited Functional Info
25Unknown Function
26The Good, the Not So Good and the Ugly
Three examples show the varying levels of
information that can be retrieved from
structures
27The Good BioH structure (MCSG)
One very strong hit
Function Discovered
28The Not So Good APC1040 (MCSG)
- Assigned as a probable glutaminase
- Most methods suggest b-lactamase activity
- No match to Prosite patterns
Function being assayed
70 F-T-M-Q-S-I-S-K-V-I-S-F-I-A-A-C 85
APC1040
FY -x-LIVMFY-x-S-TV-x-K-x(4)-AGLM-x(2)-LC
Class A
29The Ugly MT0777 (MCSG)
Hypothetical protein from Methanobacterium
thermoautotrophicum
- No sequence motifs
- Residue conservation is poor.
- Fold associated with many functions (Rossmann
fold)
Function Unknown
30Future Work
- Improvements to scoring system and additional
templates - Further utilisation of SOAP services as they
become available (e.g. KEGG API service) - Possible adaptation to use as part of a larger
workflow or in LIMS systems (Taverna and MyGrid) - More truely predictive analyses being developed
(e.g. Electrostatics, ligand prediction,
catalytic residue prediction)
31Detection of DNA-binding proteins (with HTH
motif) using structural motifs and electrostatics
(Hugh Shanahan)
- Combine electrostatics with
- HTH structural templates.
- Can detect HTH DNA-binding
- proteins only.
- 1/3 of DNA-binding proteins
- families have HTH motif
- Use linear predictor as
- discriminant.
- Find comparable true positive
- rate (80) with more
- complicated methods.
- Very low (lt 0.01 ) false
- positive rate.
32Ligand Prediction
Can active site geometry, shape,
physical-chemical properties etc. be used to
predict the preferred ligand class?
Active Site Ligand description/fingerprinting
methods
33Spherical Harmonics (Richard Morris)
Spherical t-designs
The computation of Legendre polynomials of high
order requires a robust integration scheme
34Hybrid Ellipsoids (Rafael Najmanovich)
- Every shape can be modelled by a set of hybrid
ellipsoids - The parameters describe location and a,b,c of the
ellipsoid and a smear factor - Similar parameters mean similar active sites and
ligands
35 Predicting Catalytic Residues (Alex Gutteridge)
- Aims
- To predict the location of the active site in an
enzyme structure. - To predict the catalytic residues of an enzyme.
- How?
- Train a neural network to identify catalytic
residues. - Cluster high scoring residues to find the active
site.
36Workflows and Taverna (Tom Oinn)
- Most procedures used now follow a workflow type
scheme - Taverna allows users to pick elements from
services to create their own workflows for
automation of complex sets of procedures. - Removes the need to write complex scripts
Beta 9 release available at http//taverna.source
forge.net/
37Acknowledgements
- Janet Thornton
- Christine Orengo
- Roman Laskowski - Profunc
- Richard Morris Interpro search, Spherical
Harmonics - Gail Bartlett, Craig Porter Enzyme Templates
- Alex Gutteridge Catalytic Residue Prediction
- Sue Jones HTH motifs
- Hugh Shanahan DNA binding, Electrostatics
- Jonathan Barker JESS
- Hannes Ponstingl PITA
- Rafael Najmanovich Hybrid Ellipsoids
- Martin Senger, Siamak Sobhany SOAP, Tom Oinn
Taverna - Annabel Todd and Russell Marsden UCL
- MCSG consortium for lots of structures, plus many
more at EBI and UCL - Work was supported by NIH grant (GM 62414) and by
the US DoE under contract (W-31-109-Eng-38)