Title: Fast Access to Big Molecules An Introduction to the OMG/LSR Macromolecular Structure API
1Fast Access to Big MoleculesAn Introduction to
the OMG/LSR Macromolecular Structure API
- Douglas S. Greer
- University of California, San Diego
- Alexy Khrabrov
- Rutgers University
- Philip E. Bourne
- University of California, San Diego
- John D. Westbrook
- Rutgers University
2Overview
- RCSB
- An Ontology Driven Architecture
- OpenMMS Toolkit
- Macromolecular Structure (MMS) Metamodel
- Parser, XML, SQL
- CORBA
- Ten Macromolecular Structure Classes
- Two Examples of the API
3What Is the RCSB?
- The Research Collaboratory for Structural
Bioinformatics - Manages the Protein Data Bank
- Members
- University of California San Diego
- Rutgers University
- National Institute of Standards and Technology
- http//www.rcsb.org - info_at_rcsb.org
4RCSB Goal
To enable the science of molecular biology
5How to Enable?
- Promote well defined MMS specifications
- Deposition and Archiving
- Fast turnaround of accurate data
- Capture more information directly
- Distribution Open Interfaces
- Classic
- flat files
- Web browsing and searching
- New
- XML, SQL, CORBA
6RCSB Internal Responsibilities
X-ray and NMR Depositions
- Rutgers - Data deposition and validation
- UCSD - Data query and distribution
- NIST - Long term archive and data clean-up
EBI
Direct
Rutgers
Cleanup
UCSD
Rutgers
NIST
Mirrors
7Why OpenMMS?
- Allow programmers to more easily create
efficient, high performance and robust
applications. - A Java-only toolkit with that creates XML, CORBA
and Relational DB representations of the mmCIF
Macromolecular Structure Data. - Source code is publicly available
- Extensible add your own dictionary definitions
and data
8mmCIF Dictionary and Data Files
- Based on Ontology for Macromolecular Structure
defined by the International Union of
Crystallography - Replaces the older 80-Column PDB files
- mmCIF Dictionary contains over 140 Category and
1600 Item definitions - Open Standards Process
- Provides a well-defined reference standard for
the specification and distribution of
macromolecular structure data
9Ontology Driven Architecture OpenMMS Toolkit
Data Flow
mmCIF Parsers
Applications
XML Files
mmCIF Data Files (Reference Standard)
Relational Database
CORBA Server
10Ontology Driven Architecture Metamodel
Information Flow
mmCIF Dictionaries (Ontology)
Ontology Metamodel
Metamodel Framework
CORBA IDL, SQL Schema, XML DTD, Java Data
Loaders JDBC Loaders
11MMS Metamodel Hierarchy
Root
Visitor Abstract Class
Module
Module
Interface
Struct
Visitor Subclass
Struct
Struct
Field
Field
12Some Advantages of Using an Ontology Driven
Architecture
- Scales to very large Ontologies
- More reliable and maintainable code
- Transfer between representations
- Scientific correctness of representation
- Help in maintaining backward compatibility
13mmCIF Parsers
- General Purpose, Low-level access to data
- Parsers available in many Languages
- OpenMMS toolkit includes Java Parser
- An application subclasses Abstract class and
stores data into its own data structure
14MMS in XML (Prototype)
- Very Large Flat Files (due to open and close
tags) - CIF ? mmCIF
- Tables can be grouped by rows or columns
- XML from SQL Query
15Relational DB Expression
- SQL-92 Compatible
- Schemas for all the standard DB vendors
- Oracle, DB2, mySQL, MS Access, Sybase
- Fast and Flexible Keyword searches
- PDBase loader allows structures to be selectively
loaded
16CORBA Expression of MMS Data
- No Parsing of Flat Files
- Direct Access to Binary Data Structures
- Strongly Typed Data
- Granularity of Access
- Indices and Presence Flags Pre-computed
- Highest Performance
17Two OpenMMS Tools pdbase and dbserv
dbserv Corba Server
Compute Farm
pdbase DB loader
RAM Cache
18OMG/LSR MMS Specification Adoption Process
- August 1999 RFP issued
- March 2000 Initial Submission
- September 2000 Revised Submission
- February 2001 Adopted by the OMG
- November 2001 Version 1.0 of OpenMMS source
code publicly available - February 2002 Formal OMG Specification
19Using the CORBA MMS Server
An excerpt from a legacy 80-column PDB Formatted
File (4hhb) ... ATOM 6 CG1 VAL A 1
7.009 20.127 5.418 6.00 61.79 ... ATOM
7 CG2 VAL A 1 5.246 18.533 5.681
6.00 80.12 ... ATOM 8 N LEU A 2
9.096 18.040 3.857 7.00 26.44 ... ATOM
9 CA LEU A 2 10.600 17.889 4.283
6.00 26.32 ... ATOM 10 C LEU A 2
11.265 19.184 5.297 6.00 32.96 ... ATOM
11 O LEU A 2 10.813 20.177 4.647
8.00 31.90 ... ATOM 12 CB LEU A 2
11.099 18.007 2.815 6.00 29.23 ... ATOM
13 CG LEU A 2 11.322 16.956 1.934
6.00 37.71 ... ATOM 14 CD1 LEU A 2
11.468 15.596 2.337 6.00 39.10 ... ATOM
15 CD2 LEU A 2 11.423 17.268 .300
6.00 37.47 ... ...
20LSR/MMS ATOM Record
DsLSRMacromolecularStructure.idl excerpt
struct AtomSite string id
IndexId type_symbol AtomIndex label
IndexId label_entity VectorXYZ
cartn float occupancy float
b_iso_or_equiv
21Example code to get a list of atomic coordinates
Entry e entryFactory.get_entry_from_id(4hhb")
AtomSite a e.get_atom_site_list()
for (int i 0 i lt a.length i)
System.out.println(ai.id " "
ai.type_symbol.id " ("
ai.cartn.x ", " ai.cartn.y ", "
ai.cartn.z ")")
produces 1 N (11.065, 7.352, 9.598) 2 C
(12.436, 7.764, 9.902) 3 C (12.883, 7.09,
11.208) 4 O (12.088, 7.0, 12.147) 5 C (12.611,
9.264, 10.06) ...
22Overview of Ten Core Classes (mmCIF Categories)
23Ten Core Classes continued...
24Secondary Structure Core Classes
25Secondary Structure Code Example
Entry e entryFactory.get_entry_from_id("4hhb")
StructConf scf e.get_struct_conf_list() Enti
tyPolySeq eps e.get_entity_poly_seq_list() C
hemComp cc e.get_chem_comp_list()
for (int j 0 j lt scf.length j)
System.out.println("Structure Conformation "
scfj.id " in chain "
scfj.beg_label.asym.id " contains") int
start scfj.beg_label.seq.index int end
scfj.end_label.seq.index for (int i
start i lt end i)
System.out.println(" Monomer "
ccepsi.mon.index.name " ("
epsi.mon.id ") at position " epsi.num)
26Secondary Structure Print Results
... Structure Conformation HELX_P24 in chain C
contains Monomer THREONINE (THR) at
position 118 Monomer PROLINE (PRO) at
position 119 Monomer ALANINE (ALA) at
position 120 Monomer VALINE (VAL) at
position 121 Monomer HISTIDINE (HIS) at
position 122 Monomer ALANINE (ALA) at
position 123 Monomer SERINE (SER) at
position 124 ...
27Secondary Structure Code Example
Entry e entryFactory.get_entry_from_id("4hhb")
StructConf scf e.get_struct_conf_list() Enti
tyPolySeq eps e.get_entity_poly_seq_list() C
hemComp cc e.get_chem_comp_list()
for (int j 0 j lt scf.length j)
System.out.println("Structure Conformation "
scfj.id " in chain "
scfj.beg_label.asym.id " contains") int
start scfj.beg_label.seq.index int end
scfj.end_label.seq.index for (int i
start i lt end i)
System.out.println(" Monomer "
ccepsi.mon.index.name " ("
epsi.mon.id ") at position " epsi.num)
28Work in progress
- MMS Corba API Graphics Applications
- MMS Corba API Searching and Analysis
Applications - MMS Corba API Linux Cluster Applications
29Thanks and Acknowledgments
- Michael Miller
- Martin Senger
- Lynn TenEyck
- David Benton
- Helen Berman
- Karl Konnerth
The OMG