Title: Structure Databases
1Structure Databases
- Sameer Velankar
- http//msd.ebi.ac.uk
2http//www.ebi.ac.uk
3Information Management is Central to Biological
Research
Analysis Reading. Talking. Thinking.
Hypothesis!
Private Data Past Experiments. Lab note
books. Group discussions.
Public Data Journals Conferences
Experiment Design. Execution.
Publish!
4Bioinformatics is Central to Biological Research
Analysis Reading. Talking. Thinking. Computational
Analysis Software Development
Public Data Journals Conferences DNA
sequences Protein Sequences Genetic
maps Transcripts 3D structures proteomics
results SNP data etc etc etc
Private Data Past Experiments. Lab note
books. Group discussions. DNA sequences Protein
Sequences Genetic maps Transcripts 3D
structures proteomics results SNP
data etc etc etc
Hypothesis! Computer aided.
Experiment Design. Execution. Computational
experiments Simulation
Publish! Database submission Database management
5(No Transcript)
6EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
7EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
SWISS-PROT TrEMBL Protein Sequences
8EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
SWISS-PROT TrEMBL Protein Sequences
Array-Express Microarray Expression Data
9EMBL-BankDNA sequences
EnsEMBL Human Genome Gene Annotation
SWISS-PROT TrEMBL Protein Sequences
Array-Express Microarray Expression Data
EMSD Macromolecular Structure Data
10EMBL-BankDNA sequences
EnsEMBL
Array-Express Microarray Expression Data
SWISS-PROT TrEMBL Protein Sequences
IntAct Protein Protein Interaction Data
EMSD Macromolecular Structure Data
11Integr8
12Integr8 will encompass information on
- genetic regulatory elements
- gene expression
- protein expression
- gene product information
- protein families and domains
- structural and ligand information
- molecular function
- biological role
- location of gene products
13 EMBL Outstation European Bioinformatics
Institute Macromolecular Structure Database
Project EMSD http//msd.ebi.ac.uk To develop an
autonomous structural database capability in
Europe
14 Global Context of Data on Macromolecular
Structure
For more than 20 years, the Protein Data Bank
(PDB) at Brookhaven National Laboratory USA
collected public data on protein structure.
In 1998 the US funding agencies re-competed the
grant. Brookhaven National Laboratory lost the
contract.
Responsibility for The Protein Data Bank in the
USA has now moved to RCSB (Research
Collaboratory for Structural Bioinformatics)
RCSB is Rutgers University NJ San Diego
Supercomputer Center (SDSC) National Institute of
Standards for Technology (NIST)
15Two Different Database Systems One Common Set of
Data
Deposition
Deposition
USA RCSB
Europe EBI-MSD
Agreed Common Data Items and Exchange Mechanism.
Services
Services
Ftp Archive
16http//msd.ebi.ac.uk
17Data Base Tasks
- GET DATA
- ORGANISE AND STORE DATA
- GIVE IT BACK
18 New data from NMR spectroscopy, X-ray
crystallography Electron microscopy
XML
HTML
Flat-files
Harvesting
Manual
Data Delivery
Data capture
Data Storage
Relational Database
Clean-up
CORBA
Oracle
Legacy data (old PDB)
19Where does the 3D-structural data come from?
- Three complex techniques
- X-ray crystallography
- NMR spectroscopy
- cryo-electron microscopy
20ESRF, Grenoble Synchrotron radiation
source includes beamlines for macromolecular X-ra
y crystallography
Electron Microscope for cryo-EM
High Field NMR Spectrometer
21J.Frank, Wadsworth Centre, New York
Rat-liver mitochondrion
Mitotic spindle
Model of translational apparatus
223D-EM
- Structural information about Macromolecular
Complexes - A vital link between Cell Biology and X-Ray and
NMR studies - Insights into Macromolecular Mechanisms and
Interrelationships
231be3 - transmembrane helix bundle
241be3 11 unique Proteins
- UBIQUINOL CYTOCHROME C OXIDOREDUCTASE, COMPLEX
- DBREF 1BE3 A 1 446 GB 1730447 1730447
35 480 - DBREF 1BE3 B 21 439 SWS P23004
UCR2_BOVIN 35 453 - DBREF 1BE3 C 1 379 SWS P00157
CYB_BOVIN 1 379 - DBREF 1BE3 D 1 241 GB 223266 223266
1 241 - DBREF 1BE3 E 1 196 SWS P13272
UCRI_BOVIN 79 274 - DBREF 1BE3 F 5 110 SWS P00129
UCR6_BOVIN 5 110 - DBREF 1BE3 G 1 81 SWS P13271
UCRQ_BOVIN 1 81 - DBREF 1BE3 H 15 78 SWS P00126
UCRH_BOVIN 15 78 - DBREF 1BE3 I 46 78 SWS P07588
UCRI_BOVIN 46 78 - DBREF 1BE3 J 1 62 SWS P00130
UCRX_BOVIN 1 62 - DBREF 1BE3 K 15 36 SWS P07552
UCRY_BOVIN 15 36 - 4 Haems
253D-structural data are complex
- Need database models that can cope with this
complexity - Need ease of maintenance and integration
- EMSD system stretches current relational database
technology
26We would like to be able to...
- ...ask questions that span from genome to
ligand. - This SNP is in a coding region. Does the native
protein have a known three dimensional structure?
- Does the amino acid occur in the active site of
the protein? - Which ligands are known to bind in the active
site of this protein or a homologue?
27Data Base Tasks
- GET DATA
- ORGANISE AND STORE DATA
- GIVE IT BACK
28Harvesting
Data Capture
29Harvesting Concept
- Harvesting for Macromolecular
- structure Data is SIMPLY
- communicating relevant data
- from software to deposition site
- The idea was first used in XPLOR with the module
pdbsubmission (1996). - (Jian-Sheng Jiang and A. T. Brunger)
30Harvesting Rationale
- RICHER DATA BASE
- MORE ACCURATE DATA
- SYSTEMATIC DATA
- CONFIDENCE LEVELS FOR DERIVED DATA
- SIMPLER DEPOSITION
31Data Harvesting
Deposition files
Database
Submission procedure
Manually-entered information
32Web Based Structure Deposition Sites
EBI-MSD (Europe) http//autodep.ebi.ac.uk/ RCSB-
Rutgers (USA) http//pdb.rutgers.edu/adit/ Osak
a University (Japan) http//pdbdep.protein.osaka-
u.ac.jp/adit/
33DEPOSITION via AUTODEP
34 New data from NMR spectroscopy, X-ray
crystallography Electron microscopy
XML
HTML
Flat-files
Harvesting
Manual
Data Delivery Data Warehouse
Data capture
Data Storage
Relational Database
Clean-up
CORBA
Oracle
Legacy data (old PDB)
35Database organisation
Biological Unit(s)
Independent units
ASU observed exp data
Chains
- Each level of the hierarchy can have associated
properties, e.g. - Bound molecules
- Domains
- Site residues
- Derived properties (e.g. asa)
- Reference information (e.g. standard geometry)
Residues
Atoms
36Classes of data
- Chemical descriptions (a.k.a Hetgroup or
chem_comp information) - Structure, coordinates and associated information
- Experimental information
- Bibliographic data
- Sequence database cross-references
- Taxonomy
37Database Design Goals
- Robust - thorough analysis and consistency
checks expect substantial growth - Clean - top down design
- Maintainable - use industry-standard tools
- Open interface to independently-maintained
software - Extensible to meet evolving needs
38What does an RDBMS provide?
- efficient database management
- non-procedural
- maintain data in an organised form
- reading and writing data to the computer
- fast data access mechanisms
- reduce or eliminate need for redundant data
- ensure integrity and consistency of data
39Example Tables and Relationships
DEPOSITION
CREATION_DATE
REF_CONTACT_NAME
LAST_UPDATE
EMAIL
TITLE
NAME_FAMILY
o ACCESSION_CODE
o BUILDING
o DATE_ALL_ARRIVED_DATE
o DEPARTMENT_1
o DEP_PASSWORD
o DEPARTMENT_2
o DETAILS_NDB
o FAX
o HARVEST_PROJECT_NAME
o NAME_GIVEN
o HOLD_DATE
o NAME_INITIALS
o HTTP_REFERER
o NAME_SUFFIX
o HTTP_USER_AGENT
o PHONE
o IMMUNE_RELATED
o TITLE
o RELEASE_DATE
o URL
o VIRUS_FLAG
40Our Schema
- The whole schema statistics
- 410 tables
- 661 relationships
- 1851 columns
- Manage with Designer
41Database Implementation
- Industry standard relational database management
system - Oracle. - Tables of data and relationships.
- Constraints in the database prevent
inconsistencies. - Many features come for free - e.g. interfaces to
Excel and other products.
42Use of Designer
- Why Designer?
- start to finish software engineering
- Components of Designer
- Process Modeller
- Entity relationship diagrammer
- Data flow Modeller
- Function hierarchy diagrammer
- Transformers, Editors, Generators, Servers
43Normalised database 410 tables, 2000 attributes
deposition data 260 tables, 1300 attributes
is used to calculate
provides standard values for
reference data 130 tables, 500 attributes
derived data 20 tables, 200 attributes
44(No Transcript)
45(No Transcript)
46Data DeliveryData Warehouse
Data Capture
Clean-up
Legacy data (old PDB)
47Search system
- Design goals
- usable by novices - useful for experts
- arbitrary combination of queries
- easy-to-write own queries
- run locally or use remotely
- incremental updates
- warehouse subsetting fallback database
48EMSD Relational Databases
Normalised database 410 tables, 2000 attributes
deposition data 260 tables, 1300 attributes
transformed to data warehouse structure
provides standard values for
is used to calculate
reference data 130 tables, 500 attributes
derived data 20 tables, 200 attributes
49What Is A Data Warehouse
- A Data Warehouse is simply a different way of
thinking about (and therefore organising) data - same data, different representation
- not transactional, but for reporting and analysis
50Using the Warehouse
- Ad Hoc query
- not available to external users
- Simple questions
- text-based
- Complex Analytical Questions
- including true 3D queries
- Data Mining
- searching the data for hitherto unknown
correlations
51Information Hierarchy
Data Mining
Data Warehouse
Analytical Questions
Ad Hoc Querying and Browsing
Operational Database
Operational Reporting
Data
52STRUCTURAL GENOMICS
53STRUCTURAL GENOMICS
- SELECTION OF TARGET PROTEINS OR DOMAINS
- CLONING, EXPRESSION, PURIFICATION
- CRYSTALLIZATION AND STRUCTURE DETERMINATION
- ARCHIVING AND ANNOTATION OF THE NEW STRUCTURE
54MSD URLs
- EBI Macromolecular Structure Database
http//msd.ebi.ac.uk/ - MacroMolecular PDB Code Search
http//msd.ebi.ac.uk/Services/Quaternary/mm_search
/mm_search.html - PQS Protein Quaternary Structure Query Form
http//pqs.ebi.ac.uk/ - 3D sequence server http//www3.ebi.ac.uk7654/3Ds
eq/ - UN-Published References Assignments
http//msd.ebi.ac.uk/Services/UnPubRef/Server/form
_page/form_page.html - PDB EBI Home http//pdb-browsers.ebi.ac.uk/
- OCA http//oca.ebi.ac.uk/oca-bin/ocamain
- 3DB Browser http//pdb-browsers.ebi.ac.uk/pdb-bin
/pdbmain - PDB Lite http//pdb-browsers.ebi.ac.uk/pdb-bin/pd
blite - AutoDep - PDB Submission http//autodep.ebi.ac.u
k/ - Oracle Search http//www.ebi.ac.uk/jji/ecsrch/
- Biotech Validation http//biotech.ebi.ac.uk8400/
55(No Transcript)
56PDB SEARCHES AT THE EBI
57Quaternary Structures
58http//pqs.ebi.ac.uk/
593D sequence server http//www3.ebi.ac.uk7654/3Dse
q/
1.Derive a sequence from the coordinate
section of the PDB entry 2.Align it with
the sequence taken from the 'SEQRES' lines in
the PDB file. 3.Align the coordinate-derived
sequence again, but this time do not allow
gaps to be inserted in covalently-linked
segments 4.Derive a probe sequence from this
alignment, and use it to try and find the
corresponding SWISS-PROT entry, on the basis
of sequence similarity and taxonomy 5.Check
for any inconsistencies
603D sequence server http//www3.ebi.ac.uk7654/3Dse
q/
61http//oca.ebi.ac.uk/oca-bin/ocamain
62OCA oca_at_bioinfo.weizmann.ac.il
63Validation Server