Title: Memops Data modelling and automatic code generation
1MemopsData modelling and automatic code
generation
- Edinburgh 9 September 2008
2Memops - main points
- Code generation framework
- Data access subroutine libraries
- Fully automatic code generation from model
- Several programming languages in parallel
- Precise, detailed, validated data
3Memops
- Introduction
- Code generation
- Generated libraries
- Applications of Memops
4The CCPN Project
- Collaborative Computing Project for NMR
- Since 1999
- Unifying platform for NMR software similar to
CCP4 for X-ray crystallography - Community-based, open-source, software
development - Code generation, data model, applications,
meetings
5NMR Structural Biology Pipeline
Sample Preparation
NMR Machine
Structure Calculation
Data Processing
Spectrum Analysis
Slow, complex,interactive
Repository Database
6Native Anarchy
Task1
Task2
Convert
Task1
Task1
Convert
Convert
Convert
Task2
Convert
Task3
Task3
Task3
7With Data Standard
Task1
Task2
Task1
Convert
Convert
Convert
DataStandard
Task2
Task1
Convert
Convert
Convert
Task3
Task3
Task3
8Data standard - objectives
- Lossless data transfer between programs-
different approaches and architectures - All data needed for pipeline software
- Creating data, not analysing end results
- Intermediate results needed
- Comprehensive, detailed, complex
- Completeness, integrity of changing data
- Precisely defined standard
- A single central description
- Validation directly against standard
9CCPN approach
- Standard API, no stable format
- easier to maintain as model changes
- Abstract data model
- Exact correspondence to APIs
- API implementations for several languages
- Transparent access to XML or DB storage
- Complete validation of model rules and
constraints
10Memops
- Introduction
- Code generation
- Generated libraries
- Applications of Memops
11Automatic Code generation
- Model will change over time
- Several parallel implementations
- Synchronisation between APIs and model
- Maintenance and debugging
- Resources are limited
- Automatic Code Generation
- Write and debug once and for all
- Any domain, from Astrophysics to Zoology
- Quick and simple to extend model
- E.g. Application-specific packages
12Code Generation Framework
13Code Generation
Legend
edit UML
CCPN codeOff-the-shelf files
CCPN generated
API codeSchemasMappings etc.
In-Memory ModelPython objects
MetaModel
On-disk model XML file
14API generator
- Written in Python
- Modular
- Different generators share code
15Memops
- Introduction
- Code generation
- Generated libraries
- Applications of Memops
16Model features
- Packages to subdivide model, code, and data files
- Objects. Unique context, compare-by-identity
- Complex data types. Different contexts,
compare-by-value - Simple data types, PositiveInt, enumerations,
- Attributes and links
- Cardinality, frozen/modifiable, derived
- Unique/ordered collections (sets, lists, unique
lists) - Ad-hoc constraints on attributes, simple and
complex datatypes, and objects.
17Molstructure model package
18CCPN APIs
- Application Programming Interface
- Object oriented
- Data accessed in memory as if stored in the data
model - Implementations come with
- Integrated, transparent I/O (file or database)?
- Complete validity checking
- Protection against casual change (data
encapsulation) - Versioning and backwards compatibility
- Event notifier system
- Slot for application-specific data
19PythonXML at runtime
User application
Data get, set. Validity check
Python API
XML parser
XML I/O code
Generic XML read/write
XML I/O mappings
What to do for which element
User data in CCPN XMLformat
Data StorageXML files
20JavaDB at runtime
Legend
CCPN code Off-the-shelf Application
code files
CCPN generated
HQL
Presentation layer
Custom queries(Hibernate QueryLanguage)
Optional
Java API
Hibernate mappings
Hibernate
Hibernate
Database Schema
Database
21Now Available
- Version 2.0 just released
- PythonXML, JavaXML, CXML JavaDB (with
Hibernate) - Available under GPL licensefrom Sourceforge or
www.ccpn.ac.uk - CCPN Data Standard
- NMR, Macromolecules, LIMS
- 46 packages
- 552 classes and data types
- PythonXML implementation 800,000 lines of code
22Memops
- Introduction
- Code generation
- Generated libraries
- Applications of Memops
23CcpNmr Suite
- Analysis
- Interactive NMR analysis
- FormatConverter
- Convert between 30 NMR and structure formats
- Built on top of CCPN model (PythonXML)
- Version 2.0 released
- Widely used in macromlecular NMR
24CcpNmr Analysis
25ExtendNMR NMR pipeline
- Integrated macromolecular NMR pipeline- from
sample to structure - Pre-existing programs from 8 groups
- In-memory conversion to internal data structures
- Integrated versions released
- ARIA (NMR structure generation)
- Bruker TOPSPIN, Manufacturers processing/analysis
package
26BIOXDM
- Software pipeline for on-synchrotron
crystallography - Exploit new technology (? goniometers)
- Experiment optimisation, acquisition, and on-line
processing - Independent data model, with Memops machinery
- JavaDB implementation for runtime concurrent
access
27EUROCarbDB
- Distributed deposition database
- Glycobiology and glycomics
- NMR, MS, HPLC and topology
- Java. Database storage using Hibernate
- CCPN model JavaDB implementation slot in as-is
28Funding acknowledgements
- BBSRC CCPN grants
- European Union grants
- EXTEND-NMR, EU-NMR, NMR-Life, NMRQUAL, and
TEMBLOR contracts - Industry support
- AstraZeneca, Dupont Pharma (now BMS), Genentech,
GlaxoSmithKline - Peter Keller (BIOXDM) thanks Synchrotron
Soleil, the Global Phasing Consortium and EU
FP6 BIOXHIT
29People
- Authors Prof. Ernest Laue, Wayne Boucher,
Rasmus Fogh, Tim Stevens, John Ionides, Wim
Vranken (EBI), Peter Keller (Global Phasing) - Collaborators at U. Cambridge Dan ODonovan,
Wolfgang Rieping, Alan da Silva, Darima
Lamazhapova -
- Collaborators at EBI (MSD), Hinxton Kim
Henrick, Anne Pajon, Chris Penkett - Special thanks to Bruker Biospin GmbH
(TOPSPIN), Michael Nilges (ARIA), Bas Leeflang
(EUROCarbDB FP6 contract RIDS-CT-2004-01195
30END
31Overview
- Packages
- The Implementation package
- Objects
- DataTypes and DataObjTypes
- Access control
32ARIA structure generation from NMR data
Custom conversion
Application
ARIA XML
ARIA Data Model
CCPN Data Model
CCPN XML
- ARIA imports
- Peak Lists
- Constraints
- Sequences
- Chemical shifts
- ARIA exports
- Peak Assignments
- Filtered Constraints
- Violations
- Structures
33API functions
- get and set (Attributes and links)?
- add and remove (Collection attributes and
links)? - sorted (Unordered collection links)?
- findFirst and findAll (Collection links)?
- Simple filtering (attribute value)?
- create and new (Objects)?
- Normal and factory function object creation
- delete (Objects)?
- Delete function cascades to objects rendered
invalid by deletion - checkValid, checkAllValid (Objects)?
- API classes are strongly coupled. For efficiency
reasons object-to-object links are two-way.
34FormatConverter - The NMR Translator
Peaks
Chemical shifts
Acquisition parameters
XEasy
NmrView
XEasy
NmrView
Bruker
Varian
...
...
Format specific readers
Generic peak converter
Generic chemical shift converter
Generic acquisition parameters converter
Data model entry
CCPN Data Model
Format specific writers
XEasy
XEasy
NmrView
NMRPipe
Azara
...
...
NmrView
Chemical shifts
Peaks
Processing parameters
35ExtendNMR ARIA
- Structure generation from macromolecular NMR
data, ambiguous distance constraints - One of two leading programs
- Python and scripts, with CNS dynamics engine
- All input and output integrated to CCPN standard
36ARIA CCPN object selection
37ExtendNMR Bruker TOPSPIN
- NMR processing program of major NMR instrument
company - Java. In-memory conversion to CCPN JavaXML
implementation - CCPN output in current TOPSPIN release,Expanded
in upcoming release.
38Data Model v. Data Format
Abstract model (UML)
Relational Database
Atom
Bond
Atom_Bond_Connect
XML ltAtom IDAT1 elementNameCgt ltBond
IDBD1 bondOrder1.0gt ltBondListgt
ltAtom1 IDREFAT1/gt ltBond IDREFBD1/gt
ltAtom2 IDREFAT2/gt . lt/Bondgt .
lt/BondListgt lt/Atomgt
39Packages
40Packages
- Partition model, code, and data
- Import each other
- Can be omitted
- All import Implementation and AccessControl
- Each have a TopObject
- No links between data from rival Topbjects
(different extents of data)?
41Root and TopObjects
42TopObjects
- One in every package
- Ultimate parent to all objects in package
- Have globally unique identifier (guid)?
- currentXyz links from root
- Links can constrain links between descendants
- In file implementations
- Hold links to storage and backup locations
- Live in Implementation as almost empty shell
43Overview
- Packages
- The Implementation package
- Objects
- DataTypes and DataObjTypes
- Access control
44CcpNmr Analysis
- NMR Assignment Program
- Inspired by ANSIG and Sparky
- Demonstrates CCPN approach
- Modern interface and scripting
- Scalable and extensible
- Operating Systems
- Linux, Sun, SGI, OSX, Windows
- Languages
- Python
- Data model interaction
- Tk Graphical interface
- Scripting
- C
- OpenGL/Tk contours
- Structure display
- Mathematical operations
45Implementation Package
- Model and Code
- Supertypes that define all objects
- Objects
- DataTypes
- DataObjTyps
- Basic data types
- Data how to access the real data
- Data location pointers
- Current-package pointers
- Implementation data are not part of the data set,
and are not in the database. - Represent view or session?
46Data Location
47Objects and their Supertypes
48Simple Data Types
49Complex Data Types