Title: eHTPX
1e-HTPX HPC, Grid and Web-Portal Technologies in
High Throughput Protein Crystallography
Rob Allan (r.j.allan_at_dl.ac.uk), Ronan Keegan
(r.m.keegan_at_dl.ac.uk), David Meredith
(d.j.meredith_at_dl.ac.uk), Martyn Winn
(m.d.winn_at_dl.ac.uk), Graeme Winter
(g.winter_at_dl.ac.uk), CCLRC Daresbury
Laboratory Jonathan Diprose (jon_at_
strubi.ox.ac.uk), Chris Mayo (chris.mayo_at_strubi.ox
.ac.uk), University of Oxford, The Welcome Trust
Centre of Human Genetics, Oxford Ludovic Launer
(launer_at_embl-grenoble.fr), MRC France, ESRF,
Grenoble Joel Fillon (fillon_at_ebi.ac.uk), European
Bioinformatics Institute, Cambridge Paul Young
(pyoung_at_ysbl.york.ac.uk), York Structural Biology
Laboratory
2e-HTPX Overview
- The vast amounts of data coming from the genome
projects have generated a demand for new methods
to obtain structural information about proteins
and macromolecules. This has led to a demand for,
high throughput structural biology to determine
the structure of important proteins. - e-HTPX A distributed computing infrastructure
required to remotely plan, initiate, monitor
experiments for protein crystallographic
structure determination (workflow). - Relies heavily on Grid portal, web-service, HPC
technologies. - Project integrates a number of key services
provided by UK e-Science, protein manufacture and
synchrotron laboratories.
3e-HTPX Workflow
Stage 1 Select protein target Stage 2
Crystallisation of Protein Stage 3 Data
Collection (X-ray diffraction images, Scaling and
Integration) Stage 4 Structure Solution (HPC
data processing to derive digital protein model)
Stage 5 Submit model into public database
Structure Solution
Target Selection
Start
Finish
- A single all encompassing web interface from
which users can initiate, plan, direct and
document the experimental workflow either locally
or remotely from a desktop computer.
4Key Technologies
SRS - Beamline
3rd Party Grid ftp RSL
MyProxy Server
Grid Node
Grid Node
HPC Cluster Sun Grid Engine
Grid ftp RSL
Upload Credentials (Java WebStart App)
U P
Grid ftp RSL
Credentials
Web Server JSP, Scoped Beans, Java CoG Kit,
Apache Axis, Apache Tomcat
Beamline Machine
Username Password
IP Recognition through firewall
Web Services (PPDM)
Web Services (PPDM)
Beamline Database
5Stages 1 to 2 (Target Selection and Protein
Production)
6Web-Service Call Stack PPDM
- Web-service Call Stack
- Complex sequence of communications is required
between the user and the different laboratories
involved (protein production and delivery to
SRS). - Hub centralizes all the requests/responses
between user and various labs involved
- PPDM Protein Production Data Model
- To provide a model to exchange information
between the different partners of the high
throughput process - Communicates with hardware, databases, LIMS and
different stages of the Protein Crystallography
Pipeline - Each facility can implement independently the
service according to an agreed standard - Describes many components (experiments,
molecules, sample constituents) - Various Expressed Languages-
XML schema, SQL, Java Classes, Python
Classes
7e-HTPX Hub Interface
- Interface used to plan and experiment and input
required data - (e.g. on-line completion of safety forms,
specification of crystal growth conditions,
remote authorization of necessary permissions and
allocation of sufficient beam-time)
- Interface simplifies complex web-service call
stack - (The status of each call is automatically updated
and presented to the user)
8Protein Visualisation Web-Service
- OPPF - Automatic pipetting facilities in to 96
well trays, facilities for imaging the wells and
a database. Images of the wells can be provided
for the user over the internet. - Colour codes indicate likelihood of crystal
developed in droplet.
9Stage 3 Data Collection (X-ray Diffraction and
collection of images)
- A typical experiment on a high brilliance
beamline may generate a few gigabytes of data - Data collection involves automated X-ray
diffraction facilities, including sample changers
to exchange crystals on the beamline. - Automation of this type is essential for remote
operation. - The system is being linked in to a database
which is used to store requests from the user and
handle the data for individual samples.
10Stage 3 Data Collection
Sample Changer
2) Expert system providing automated and
synchronous analysis / verification of data
quality
4) Data Collection (Beamline Control Module)
Diffractometer
Portal
SRS - Beamline
X-ray Diffraction Images
1) Start Specify experimental requirements
Detector
3) Feedback Modify data collection parameters
5) Finish Data Collection and Processing Complete
Grid FTP Data
Grid FTP Data
6) Start Stage 4 HPC Further data processing
Grid FTP
Grid Node Storage Facility
11Stage 4 Solve Structure of Protein
End Stage3
1) Continue Pipeline
2) Job Submission a) Globus 2.4 GRAM Job Manager
Automated job submission and data transfer
(continuation of e-HTPX pipeline)
b) Sun Grid Engine Batch Queuing
1) New Entry Point
3rd Party Grid FTP Data Job Submission
3) CCP4 Code Suite
Key codes parallelized - Beast, Molecular
Replacement, Scala, Mosflm
4) Digital Protein Model
5) Submit model in DB
12Remotely Accessing the Facilities
- Monitor Status of Grid FTP Hosts and GRAM Job
Managers
- Interface to Grid FTP (Jsp, servlets, Java CoG
Kit) - e-HTPX Requires secure transfer of
diffraction images to HPC for structure solution
13Remotely Accessing the Facilities (Upload data
from remote machine)
e-HTPX Portal
Web Start Download digitally signed jars
Grid FTP Data
2) Run Grid FTP File Transfer Tool
Web Start Download digitally signed jars
1) Run Proxy Delegation Tool from portal with Web
Start (Delegation via Web Services)
Remote Machine Requirements Java Web Start,
Internet Access, Port 2811 Open
14Remotely Accessing the Facilities
- Job submission interface (session scope Java
beans Java CoG Kit, GT2.4) - Batch / Interactive jobs, Staging of exes,
Stdout, Stderr re-direction, - Monitoring status of jobs (application scope
job-monitor bean)
15Remotely Accessing the Facilities
- Custom interfaces for e-HTPX specific jobs
- Molecular Replacement (new entry point for part
of stage 4 structure solution process)
16Conclusions
- Key Problems Solved
- Allows biologist to concentrate on the
scientific questions rather than technical
details. - Comprehensive Data Model (PPDM) allows each
facility to implement independently the service
according to an agreed standard. - Allows biologist to access to high-performance
facilities (HPC, CCP4 codes.). - Key Technologies Java Beans, JSP, Servlets,
Web Start, Java CoG Kit, GT2.4, Web Services,
Expert Systems, Databases. - Future Plans
- Remotely interfacing with robotic hardware
(sample changers) - Outreach to industry to integrate e-HTPX into
drug discovery pipelines.
17Stages 1 to 2
Stage 1 Protein Production (Select target, Fill
out safety description, Diffraction Plan) Stage
2 Crystallisation of Protein (Monitoring of
crystal growth, Delivery of crystal to
synchatron)
18Web Service Hub / Grid Portal
- A single all encompassing web interface from
which users can initiate, plan, direct and
document the experimental workflow either locally
or remotely from a desktop computer - Main Features
- Web-service Call stack with hub used to
centralise the requests - PPDM each facility
- Grid technologies used to provide security, data
transfer and job-submission. - The portal is located on a 'Hub', which
centralizes all the requests/answers - The requests/answers are expressed in XML
embedded in SOAP messages and addressed via Web
Services calls. - Each facility can implement independently the
service according to an agreed Web Services
interface (PPDM)