Title: TeraGyroid
1TeraGyroid
- HPC Applications ready for UKLight
Stephen Pickles ltstephen.pickles_at_man.ac.ukgt http
//www.realitygrid.org http//www.realitygrid.org/
TeraGyroid.html UKLight Town Meeting, NeSC,
Edinburgh, 9/9/2004
2The TeraGyroid Project
- Funded by EPSRC (UK) NSF (USA) to join the UK
e-Science Grid and US TeraGrid - application from RealityGrid, a UK e-Science
Pilot Project - 3 month project including work exhibited at SC03
and SC Global, Nov 2003 - thumbs up from TeraGrid mid-September, funding
from EPSRC approved later - Main objective was to deliver high impact science
which it would not be possible to perform without
the combined resources of the US and UK grids - Study of defect dynamics in liquid crystalline
surfactant systems using lattice-Boltzmann
methods - featured worlds largest Lattice Boltzmann
simulation - 10243 cell simulation of gyroid phase demands
terascale computing - hence TeraGyroid
3Networking
HPC engine
HPC engine
checkpoint files
steering control and status
visualization data
compressed video
visualization engine
storage
4LB3D 3-dimensional Lattice-Boltzmann simulations
- LB3D code is written in Fortran90 and
parallelized using MPI - Scales linearly on all available resources
(Lemieux, HPCx, CSAR, Linux/Itanium II clusters) - Data produced during a single run can exceed 100s
of gigabytes to terabytes - Simulations require supercomputers
- High end visualization hardware (eg. SGI Onyx,
dedicated viz clusters) and parallel rendering
software (e.g. VTK) needed for data analysis
3D datasets showing snapshots from a simulation
of spinodal decomposition A binary mixture of
water and oil phase separates. Blue areas
denote high water densities and red visualizes
the interface between both fluids.
5Computational Steering ofLattice Boltzmann
Simulations
- LB3D instrumented for steering using the
RealityGrid steering library. - Malleable checkpoint/restart functionality allows
rewinding of simulations and run-time job
migration across architectures. - Steering reduces storage requirements because the
user can adapt data dumping frequencies. - CPU time can be saved because users do not have
to wait for jobs to be finished if they can
already see that nothing relevant is happening. - Instead of doing task farming, parameter
searches are accelerated by steering through
parameter space. - Analysis time is significantly reduced because
less irrelevant data is produced.
Applied to study of gyroid mesophase of
amphiphilic liquid crystals at unprecedented
space and time scales
6Parameter space exploration
Cubic micellar phase, high surfactant density
gradient.
Cubic micellar phase, low surfactant density
gradient.
Initial condition Random water/ surfactant
mixture.
Self-assembly starts.
Lamellar phase surfactant bilayers between water
layers.
Rewind and restart from checkpoint.
7Strategy
- Aim use federated resources of US TeraGrid and
UK e-Science Grid to accelerate scientific
process - Rapidly map out parameter space using large
number of independent small (1283) simulations - use job cloning and migration to exploit
available resources and save equilibration time - Monitor their behaviour using on-line
visualization - Hence identify parameters for high-resolution
simulations on HPCx and Lemieux - 10243 on Lemieux (PSC) takes 0.5 TB to
checkpoint! - create initial conditions by stacking smaller
simulations with periodic boundary conditions - Selected 1283 simulations were used for
long-time studies - All simulations monitored and steered by
geographically distributed team of computational
scientists
8The Architecture of Steering
OGSI middle tier
multiple clients Qt/C, .NET on PocketPC,
GridSphere Portlet (Java)
remote visualization through SGI VizServer,
Chromium, and/or streamed to Access Grid
- Computations run at HPCx, CSAR, SDSC, PSC and
NCSA - Visualizations run at Manchester, UCL, Argonne,
NCSA, Phoenix - Scientists in 4 sites steer calculations,
collaborating via Access Grid - Visualizations viewed remotely
- Grid services run anywhere
9SC Global 03 Demonstration
10TeraGyroid Testbed
Starlight (Chicago)
Netherlight (Amsterdam)
10 Gbps
ANL
PSC
Manchester
Caltech
BT provision
NCSA
Daresbury
2 x 1 Gbps
production network
MB-NG
SJ4
SDSC
Phoenix
Visualization
UCL
Computation
Access Grid node
Service Registry
Network PoP
Dual-homed system
11Trans-AtlanticNetwork
- Collaborators
- Manchester Computing
- Daresbury Laboratory Networking Group
- MB-NG and UKERNA
- UCL Computing Service
- BT
- SurfNET (NL)
- Starlight (US)
- Internet-2 (US)
12TeraGyroidHardware Infrastructure
- Computation (using more than 6000 processors)
including - HPCx (Daresbury), 1280 procs IBM Power4 Regatta,
6.6 Tflops peak, 1.024 TB - Lemieux (PSC), 3000 procs HP/Compaq, 3TB memory,
6 Tflops peak - TeraGrid Itanium2 cluster (NCSA), 256 procs, 1.3
Tflops peak - TeraGrid Itanium2 cluster (SDSC), 256 procs, 1.3
Tflops peak - Green (CSAR), SGI Origin 3800, 512 procs, 0.512
TB memory (shared) - Newton (CSAR), SGI Altix 3700, 256 Itanium 2
procs, 384GB memory (shared) - Visualization
- Bezier (Manchester), SGI Onyx 300, 6xIR3, 32procs
- Dirac (UCL), SGI Onyx 2, 2xIR3, 16 procs
- SGI loan machine, Phoenix, SGI Onyx 1xIR4, 1xIR3,
commissioned on site - TeraGrid Visualization Cluster (ANL), Intel Xeon
- SGI Onyx (NCSA)
- Service Registry
- Frik (Manchester), Sony Playstation2
- Storage
- 20 TB of science data generated in project
- 2 TB moved to long term storage for on-going
analysis - Atlas Petabyte Storage System (RAL) - Access Grid nodes at Boston University, UCL,
Manchester, Martlesham, Phoenix (4)
13Network lessons
- Less than three weeks to debug networks
- applications people and network people nodded
wisely but didnt understand each other - middleware such as GridFTP is infrastructure to
applications folk, but an application to network
folk - rapprochement necessary for success
- Grid middleware not designed with dual-homed
systems in mind - HPCx, CSAR (Green) and Bezier are busy production
systems - had to be dual homed on SJ4 and MB-NG
- great care with routing
- complication we needed to drive everything from
laptops that couldnt see the MB-NG network - Many other problems encountered
- but nothing that cant be fixed once and for all
given persistent infrastructure
14Measured Transatlantic Bandwidths during SC03
15TeraGyroid Summary
- Real computational science...
- Gyroid mesophase of amphiphilic liquid crystals
- Unprecedented space and time scales
- investigating phenomena previously out of reach
- ...on real Grids...
- enabled by high-bandwidth networks
- ...to reduce time to insight
Dislocations
Interfacial Surfactant Density
16TeraGyroid Collaborating Organisations
- Our thanks to hundreds of individuals at...
- Argonne National Laboratory (ANL)
- Boston University
- BT
- BT Exact
- Caltech
- CSC
- Computing Services for Academic Research (CSAR)
- CCLRC Daresbury Laboratory
- Department of Trade and Industry (DTI)
- Edinburgh Parallel Computing Centre
- Engineering and Physical Sciences Research
Council (EPSRC) - Forschungzentrum Juelich
- HLRS (Stuttgart)
- HPCx
- IBM
- Imperial College London
- National Center for Supercomputer Applications
(NCSA)
ANL
17The TeraGyroid Experiment
- S. M. Pickles1, R. J. Blake2, B. M. Boghosian3,
J. M. Brooke1, - J. Chin4, P. E. L. Clarke5, P. V. Coveney4,
- N. González-Segredo4, R. Haines1, J. Harting4, M.
Harvey4, - M. A. S. Jones1, M. Mc Keown1, R. L. Pinning1,
- A. R. Porter1, K. Roy1, and M. Riding1.
- Manchester Computing, University of Manchester
- CLRC Daresbury Laboratory, Daresbury
- Tufts University, Massachusetts
- Centre for Computational Science, University
College London - Department of Physics Astronomy, University
College London
http//www.realitygrid.org http//www.realitygrid
.org/TeraGyroid.html
18New Application at AHM2004
Exact calculation of peptide-protein binding
energies by steered thermodynamic integration
using high-performance computing grids.
- Philip Fowler, Peter Coveney, Shantenu Jha and
Shunzhou Wan - UK e-Science All Hands Meeting
- 31 August 3 September 2004
19Why are we studying this system?
- Measuring binding energies are vital for e.g.
designing new drugs. - Calculating a peptide-protein binding energy can
take weeks to months. - We have developed a grid-based method to
accelerate this process
To compute ??Gbind during the AHM 2004 conference
i.e. in less than 48 hours Using federated
resources of UK National Grid Service and US
TeraGrid
20Thermodynamic Integration on Computational Grids
Use steering to launch, spawn and terminate ?-
jobs
Starting conformation
Check for convergence
Combine and calculate integral
?0.1
time
?0.2
?0.3
lambda
Seed successive simulations (10 sims, each 2ns)
?0.9
Run each independent job on the Grid
21checkpointing
steering and control
monitoring
22We successfully ran many simulations
- This is the first time we have completed an
entire calculation. - Insight gained will help us improve the
throughput. - The simulations were started at 5pm on Tuesday
and the data was collated at 10am Thursday. - 26 simulations were run
- At 4.30pm on Wednesday, we had nine simulations
in progress (140 processors) - 1x TG-SDSC, 3x TG-NCSA, 3x NGS-Oxford, 1x
NGS-Leeds, 1x NGS-RAL - We simulated over 6.8ns of classical molecular
dynamics in this time
23Very preliminary results
??G (kcal/mol) Experiment -1.0 0.3 Quick
and dirty analysis -9 to -12
- as at 41 hours
We expect our value to improve with further
analysis around the endpoints.
24Conclusions
- We can harness todays grids to accelerate
high-end computational science - On-line visualization and job migration require
high bandwidth networks - Need persistent network infrastructure
- else set up costs are too high
- QoS Would like ability to reserve bandwidth
- and processors, graphics pipes, AG rooms, virtual
venues, nodops... (but thats another story) - Hence our interest in UKLight