Title: High Performance Application Program Interfaces to Macromolecular Structure Data
1Data Integration and Management A PDB Perspective
http//www.pdb.org/ info_at_rcsb.org
2What is PDB?
- Single international repository of
three-dimensional data for biological
macromolecules - Public community resource
- Established at Brookhaven in 1971 (7 structures)
- Moves to RCSB in 1998
- wwPDB established in 2004
- gt 25,000 structures in PDB
3Community
- Scientific Community - at all levels
- Structural biologists (crystallography, NMR,
cryo-EM) - Biologists
- Computational biologists
- Journals
- General Community
- Secondary school
- General public
- Internal
- RCSB PDB staff
- wwPDB members
4Data Representation
- Macromolecular Crystallographic Information
Framework - XML DTD/Schema Mapping
- SQL Schema Mapping
- CORBA IDL Mapping
- Supporting emerging ontology representations - OWL
5Elements of Dictionary Metadata
- Data Attributes
- Definition
- Examples
- Data type (primitive type/regular expression
patterns) - Range or allowed values
- Classes
- Categories
- Subcategories
- Category groups
- Associations
- Parent-child relationships
- Interdependencies/exclusivity
- Methods
6Difficult Issues
- Resolving semantic ambiguities encoding meaning
- Integrating controlled vocabularies
- Separation of primary and derived information
- Supporting rapid evolution of science
7Whats Driving Data Definition
- IUCr-sponsored community effort
- Automated data acquisition
- Data management and data exchange for PDB
- New technologies (e.g. cryo-electron microscopy)
- High-throughput structure determination and
structural genomics
8Typical Project Deposition Data Flow
Target Selection
Crystal Production
Protein Production
Project Database
Structure Determination
Merged Project Data
Exchange Dictionary
PDB Deposition
9Data Sharing Nightmare
10Incremental Data Pipeline
11Current Integration Strategy
- Provide software tools to collect bits of data
from the output from each program step - Convert data in log and output files to a common
representation - Merge the data corresponding to the successful
outcome - Provide an editor tool to enter remaining data
and check consistency of results
12Data Deposition and Annotation
Step 2
Validation Report
Step 1
Depositor
Annotate
ADIT
Step 3
Step 4
Depositor Approval
Step 5
13Integrated Data Processing System
MAXIT
Validation
Data
Assembled by Depositor
ADIT ADITsrv
ADIT
Client Input Tool
Database Loader
ADITsrv
Reports Final Files
Metadata Dictionaries
Data Views
14Features of System
15Data Distribution
mmCIF Parsers
Applications
XML Files
mmCIF Data Files (Data Reference Standard)
Relational Database
API Servers
16Automatic Production of Macromolecular
Structure API Components
PDB Exchange Dictionary API Specific Data
Dictionaries
Metamodel Framework
CORBA IDL, SQL Schema, XML DTD/Schemas, Data
Loaders Database Access Classes
17Management
- Complex challenges in technology and sociology
- Communicate and work with diverse community
- Help create and enforce community policies and
standards - Must take advantage of the most current
innovations in new technologies - New technologies must be introduced so as to
enable and not disrupt the users of the resource - Beyond all else is the need for good data and a
robust data representation
18Access
- RCSB Protein Data Bank Site
- http//www.pdb.org/
- OpenMMS site (Java implementation)
- http//openmms.sdsc.edu/
- RCSB PDB Software Download Site (C and Python
implementation, NDB server) - http//deposit.pdb.org/mmcif/FILM/
- RCSB PDB Dictionary Resource Site
- http//deposit.pdb.org/mmcif/
- RCSB PDB Beta Data Site
- ftp//beta.rcsb.org/pub/pdb/uniformity/data/
19http//www.pdb.org/ info_at_rcsb.org
- Operated by three members of the RCSB Rutgers,
The State University of New Jersey San Diego
Supercomputer Center at the University of
California, San Diego Center for Advanced
Research in Biotechnology/UMBI/NIST - The RCSB PDB is supported by funds from the
National Science Foundation (NSF), the National
Institute of General Medical Sciences (NIGMS),
the Office of Science, Department of Energy
(DOE), the National Library of Medicine (NLM),
the National Cancer Institute (NCI), the National
Center for Research Resources (NCRR), the
National Institute of Biomedical Imaging and
Bioengineering (NIBIB), and the National
Institute of Neurological Disorders and Stroke
(NINDS). - The RCSB PDB is a member of the wwPDB
(http//www.wwpdb.org/)