Title: Terascale Numerical Relativity using Cactus
1Terascale Numerical Relativity using Cactus
- John Shalf
- LBNL/NERSC
- Ed Seidel, Gabrielle Allen Cactus Team
- Max Planck Institute for Gravitational Physics,
- (Albert Einstein Institute)
2The Story in 5 Chapters
- The Science
- Cactus A Community Code
- Grid Pervasive Access to Distributed Resources
- Portals Spacetime Superglue to Put it All
Together - Whats Next? Dynamically Computing on the Grid?
3Chapter I
4Gravitational Waves Astronomy New Field,
Fundamental New Information about the Universe
5Motivation for Grand Challenge Simulations
- NSF Black Hole
- Grand Challenge
- 8 US Institutions, 5 years
- Towards colliding black holes
- Examples of Future of
- Science Engineering
- Require Large Scale Simulations, beyond reach of
any single machine - Require Large Geo-Distributed Cross-Disciplinary
Collaborations - Require Grid Technologies, but not yet using
them! - Both Apps and Grids Dynamic
6Working Towards the Big Splash
- Finite Difference Evolution of Einsteins
Equations (ADM-BSSN method) - Schwartzchild (1916 solution!)
- Kerr (Spinning, charged, 1963!)
- Misner (head-on collision)
- Good for calibration, but not a likely event
- 16Gigs of Memory for 1903 Octant Symmetry in 3D
on 512 CPU CM5 in 95 - Grazing Collisions Full In-Spiral
- This is astrophysically relevant!
- No analytic solution
- 1.5 TByte 5Tflops for bitant symmetry on NERSC-3
- 3 Tbyte req. for full 3D
- 10Tbytes for wave extraction
- Initial Conditions (the next big thing)
72002 Big Splash on Seaborg (NERSC)
- The Splash (Recipe)
- The Cactus Code
- 5 TFlop supercomputer system at NERSC Oakland
Scientific Computing Facility (OSF) - 1.5 TBytes of RAM (1024x1024x768 DP x 250
gridfunctions) - Set aside 5 TB of disk space (2TB for checkpoint
alone) - Two Deployment Scenarios
- 184nodes
- 64 fat nodes
- Consumed over 1M CPU hours in 2months (114 CPU
years!) - Results?
- Followed closely the preditions of Meudon Model
(counter to Cook-Baumgarte Model for
coalescences). More analysis to come! - Visualization of BH Merger in April Scientific
American Article - Discovery Channel Movie
- Vis by Werner Benger, production by Tom Lucas,
Donna Cox and Bob Patterson
82D BBH Spacetime Splashes Circa 1992 (vis by
Mark Bajuk)
93D Big Splash in Scientific American(Image By
Werner Benger)
10Evalutation of Apparent Horizon Boundary
Conditions
11Uncovering Painfully Obvious Numerial Nonesense
12AMR Diagnostics
Debug a Clustering Algorithm
Convergence Testing
13Role of Visualization?
- Research is what Im doing when I dont know
what Im doing (Werner Von Braun) - Data Mining? (maybe?)
- Drill Down (yes)
- Larger simulations mean larger spatial dynamic
range - Understanding connection between large scale and
small scale features is critical - New datastructures (tensor vis, geodesics, AMR
hierarchies, multidimensional analysis) - Computational Monitoring is important
- Rapid visual inspection for quick turn-around
during development - Shepherd/protect very costly hero runs
- Best way to deal with big data is to move it as
little as possible - Offline analysis is also important, but may
involve a completely different set of tools and
methods (even serial raytracing) - Physicists have little tolerance for complexity
or software installation - Motivates need for vis portals and thin-client
interfaces to vis tools. - More dimsmore qualitative (1D vis is still
critical!) - Need vis tools customized for the domain
- general purpose tools have too many options.
Confusing and unwieldy
14Multidisciplinary Scientific Communities
- Nature is fundamentally multidisciplinary. As
we strive to understand its complexity,
researchers from different fields and different
locations must become engaged in large
multinational teams to tackle these Grand
Challenge problems - Need a software infrastructure to support this
the multidisciplinary Virtual Organization (VO) - Community code (open/modular/shared simulation
codes) - Tools that support collaboration and data sharing
- Location-independent equal-access to shared
resources (visualization, supercomputers,
experiments, telescopes etc..)
15Chapter II
16Cactus
- CACTUS is a freely available, modular,
- portable and manageable environment
- for collaboratively developing parallel, high-
- performance multi-dimensional simulations
THE GRID Dependable, consistent, pervasive
access to high-end resources
www.CactusCode.org
17History
- Cactus originated in 1997 as a code for numerical
relativity, following a long line of codes
developed in Ed Seidels research groups, at the
NCSA and recently the AEI. - Numerical Relativity complicated 3D
hyperbolic/elliptic PDEs, dozens of equations,
thousands of terms, many people from very
different disciplines working together, needing a
fast, portable, flexible, easy-to-use, code which
can incorporate new technologies without
disrupting users. - Originally Paul Walker, Joan Masso, John Shalf,
Ed Seidel. - Cactus 4.0, August 1999 Total rewrite and
redesign of code, learning from experiences with
previous versions.
18What Is Cactus?
- Modular Component Architecture for Simulation
Code Development - Multi-Language C,C, F90/F77
- Tightly Integrated with Revision Control System
(CVS) - Trivially Grid-Enabled
- Open Source Community Code Distributed under GNU
GPL and Actively Supported/Documented - Current Release Cactus 4.0 B12
- Supported Architectures
- IBM SP2
- Cray T3E
- Hitachi SR8000-F
- NEC SX-5
- Intel Linux IA32/IA64
- Windows NT
- MacOS-X
- HP Exemplar
- Sun Solaris
- SGI Origin (n32/64)
- Dec Alpha
- ...
19Modularity of Cactus...
Symbolic Manip App
Sub-app
Legacy App 2
Application 2
Application 1
...
User selects desired functionality Code
created...
Cactus Flesh
Abstractions...
Unstructured...
AMR (GrACE, Carpet, etc)
I/O layer 2
MPI layer 3
MDS/Remote Spawn
Remote Steer 2
Globus Metacomputing Services
20Supported Simulation Types
- Unigrid
- Numerics Einstein, Hydro (Valencia, EOS), MHD
(Zeus), PETSc - Features Comp Monitoring, Vis, Parallel I/O
(PANDA, HDF5, FlexIO) - Metacomputing MPICH-G2, SC2001 Gordon Bell
Award - AMR (Berger-Oliger, Berger-Collela)
- DAGH (the Framework) 97
- Carpet
- PAGH/GrACE
- Unstructured Grids
- 99 Unstructured Grids summit _at_ LBL (AEI,Cornell,
LLNL,Stanford,SDSC) - PPPL PIM, curvilinear meshes
- Chemistry
- U. Kansas (Karen Camarda)
- Cornell Crack Propagation
21Cactus Community
22Chapter III
- The Grid
- Pervasive Access to Distributed Resources
23Why Grid Computing?
- Cactus Numerical Relativity Community has access
to high-end resources in over ten centers in
Europe/USA - They want
- Bigger simulations, more simulations and faster
throughput - Intuitive IO at local workstation
- No new systems/techniques to master!!
- How to make best use of these resources?
- Provide easier access no one can remember ten
usernames, passwords, batch systems, file
systems, great start!!! - Combine resources for larger productions runs
(more resolution badly needed!) - Dynamic scenarios automatically use what is
available - Remote/collaborative visualization, steering,
monitoring - Many other motivations for Grid computing ...
24Grid Applications Some Examples
- Dynamic Staging move to faster/cheaper/bigger
machine - Cactus Worm
- Multiple Universe
- create clone to investigate steered parameter
- Automatic Convergence Testing
- from intitial data or initiated during simulation
- Look Ahead
- spawn off and run coarser resolution to predict
likely future - Spawn Independent/Asynchronous Tasks
- send to cheaper machine, main simulation carries
on - Thorn Profiling
- best machine/queue, choose resolution parameters
based on queue - Dynamic Load Balancing
- inhomogeneous loads, multiple grids
- Intelligent Parameter Surveys
- farm out to different machines
- Must get application community to rethink
algorithms
25Grand Picture
Viz of data from previous simulations in SF café
Remote steering and monitoring from airport
Remote Viz in St Louis
Remote Viz and steering from Berlin
DataGrid/DPSS Downsampling
IsoSurfaces
http
HDF5
SP2 NERSC
Origin AEI
Globus
Simulations launched from Cactus Portal
Grid enabled Cactus runs on distributed machines
26Remote Visualization
OpenDX
OpenDX
Amira
Amira
All Remote Files VizLauncher (download)
IsoSurfaces and Geodesics
LCA Vision
Grid Functions Streaming HDF5 (downsampling to
match bandwidth)
Use variety of local clients to view remote
simulation data. Collaborative, colleagues can
access from anywhere. Now adding matching of data
to network characteristics
Amira
27Remote Monitoring/Steering Thorn HTTPD
- Thorn which allows simulation any to act as its
own web server - Connect to simulation from any browser anywhere
collaborate - Monitor run parameters, basic visualization, ...
- Change steerable parameters
- See running example at www.CactusCode.org
- Wireless remote viz, monitoring
and steering
28Remote Steering
Any Viz Client
HTTP
Remote Viz data
XML
HDF5
Amira
Remote Viz data
29Vis Launcher
- VizLauncher Output data (remote files/streamed
data) automatically launched into appropriate
local Viz Client (extending to include
application specific networks) - Debugging information (individual thorns can
easily provide their own information) - Timing information (thorns, communications, IO),
allows users to steer their simulation for better
performance (switch of analysis/IO)
30Remote File Access
Viz in Berlin
VisualizationClient
Downsampling, hyperslabs
Only what is needed
Web Server
FTP Server
DPSS Server
Remote Data Server
4TB at NCSA
31Remote File Access
HDF5 VFD/ GridFTP Clients use file
URL (downsampling,hyperslabbing)
More Bandwidth Available
NCSA (USA)
32Chapter IV
- Portal Architecture
- Spacetime Superglue to make these components work
together for the Virtual Organization
33Cactus/ASC Portal
- KDI ASC Project (Argonne, NCSA, AEI, LBL, WashU)
- Technology Web Based (end user requirement)
- Globus, GSI, DHTML, Java CoG, MyProxy, GPDK,
TomCat, Stronghold/Apache, SQL/RDBMS - Portal should hide/simplify the Grid for users
- Single access, locates resources, builds/finds
executables, central management of parameter
files/job output, submit jobs to local batch
queues, tracks active jobs. Submission/management
of distributed runs - Accesses the ASC Grid Testbed
34Portal Client Layers
- Thin Client Slow interaction, but you know its
going to work! - Delivery DHTML to any ol web-browser
- Users No time investment
- Slender Client Faster interaction, but primary
work on remote server. Download on every
invocation! - Delivery Java applet, signed applications, DCOM,
tiny binaries - Users Some time investment in aquiring compliant
JVM. - Fat Clients Portal merely a data broker between
distributed resources and your helper
application. - Delivery Standalone applications of any sort (or
even veneer) - Users More significant time investment to
install helper app.
35Computational Physics Complex Workflow
Code Dev
Sim Dev
Production
Analysis
Select largest Rsrc and run For a week
Select and Stage data to Storage array
Acquire Code Modules
Set Params Initial Data
Configure And Build
Run Many Test Jobs
Remove vis and steer
Regression
Rmt Vis
Bugs?
Novel Results?
Correct?
N
Y
Data Mine
Y
Observation
N
Y
N
Report/Fix bugs
Steer, Kill, Or restart
Archive TBs Of Data
Papers Nobel Prizes
36Dynamic Grid Computing
Add more resources
Queue time over, find new machine
Free CPUs!!
Clone job with steered parameter
Physicist has new idea !
37Chapter V
- Whats Next?
- Distributed Applications with Intelligent
Adaptation? Nomadic Grid Entities?
38New Paradigms
- Dynamically reDistributed Applications
- Code should be aware of its environment
- What resources are out there NOW, and what is
their current state? - What is my allocation?
- What is the bandwidth/latency between sites?
- Code should be able to make decisions on its own
- A slow part of my simulation can run
asynchronouslyspawn it off! - New, more powerful resources just became
availablemigrate there! - Machine went downreconfigure and recover!
- Need more memoryget it by adding more machines!
- Code should be able to publish this information
to Portal for tracking, monitoring, steering - Unexpected eventnotify users!
- Collaborators from around the world all connect,
examine simulation. - Two protypical examples
- Dynamic, Adaptive Distributed Computing
- Cactus Worm Intelligent Simulation Migration
39Distributing Computing
- Why do this?
- Capability Need larger machine memory than a
single machine has - Throughput For smaller jobs, can still be
quicker than queues - Technology
- Globus GRAM for job submission/authentification
- MPICH-G2 for communications (Native MPI/TCP)
- Cactus simply compiled with MPICH-G2
implementation of MPI - gmake cactus MPIglobus
- New Cactus Communication Technologies
- Overlap communication/communications
- Simulation dynamically adapts to WAN network
- Compression/Buffer size for communication
- Extra ghostzones, communicate across WAN every N
timesteps - Available generically all applications/grid
topologies
40Dynamic Adaptive Distributed Computation(with
Argonne/U.Chicago)
Large Scale Physics Calculation For accuracy
need more resolution than memory of one machine
can provide
OC-12 line (But only 2.5MB/sec)
- This experiment
- Einstein Equations (but could be any Cactus
application) - Achieved
- First runs 15 scaling
- With new techniques 70-85 scaling, 250GF
41Dynamic Adaptation
- Automatically adapt to bandwidth latency issues
- Application has NO KNOWLEDGE of machines(s) it is
on, networks, etc - Adaptive techniques make NO assumptions about
network - Issues
- More intellegent adaption algorithm
- Eg if network conditions change faster than
adaption - Next Real BH hole run across Linux Clusters for
high quality data for viz.
Adapt
42Cactus Worm Basic ScenarioLive Demo at
http//www.cactuscode.org
- Cactus simulation starts
- Queries a Grid Information Server, finds
resources - Makes intelligent decision to move
- Locates new resource migrates
- Registers new location to GIS
- Continues around Europe
- Basic prototypical example of
- many things we want to do!
43Migration due to Contract Violation(Foster,
Angulo, Cactus Team)
44GridLab Enabling Dynamic Grid Applications
- Large EU Project under negotiation with EC
- Members AEI, ZIB, PSNC, Lecce, Athens, Cardiff,
Amsterdam, SZTAKI, Brno, ISI, Argonne, Wisconsin,
Sun, Compaq - Grid Application Toolkit for application
developers and infrastructure (APIs/Tools) - Will be around 20 new Grid positions in Europe !!
Look at www.gridlab.org for details
45More Information
- The Science of Numerical Relativity (Chpt 1)
- http//jean-luc.ncsa.uiuc.edu/
- http//www.nersc.gov/
- http//dsc.discovery.com/schedule/episode.jsp?epis
ode23428000 - Cactus Community Code (Chpt 2)
- http//www.cactuscode.org/
- The Grid/Globus (Chpt 3)
- http//www.gridforum.org
- http//www.globus.org/
- http//www.zib.de/Visual/projects/TIKSL/ (the
TIKSL project at ZIB) - The ASC Portal (Chpt 4)
- http//www.ascportal.org
- http//www-itg.lbl.gov/grid/projects/GPDK/
- Whats Next Dynamic Grid Computing (Chpt 5)
- http//www.gridlab.org/