Title: Nuclear Stockpile Stewardship, Trials and Tribulations: A Computing Perspective
1Nuclear Stockpile Stewardship,Trials and
TribulationsA Computing Perspective
- Dr. William M. Jones
- Electrical and Computer Engineering Department
- United States Naval Academy
- Dr. Nathan A. DeBardeleben and John T. Daly
- High Performance Computing Division
- Los Alamos National Laboratory
2U N C L A S S I F I E D
Los Alamos National Laboratory
LA-UR-07-6965
3(No Transcript)
4Tri-Labs
Advanced Simulation and Computing
5Weapon Design and Testing
- Massive data collection
- Feedback into the design process
6Types of Nuclear Testing
Image courtesy of Wikimedia Commons
7Image courtesy of Wikimedia Commons
8Estimated Actual numbers are classified
Image courtesy of Wikimedia Commons
9Weapon Test Bans
- Limited Test Ban Treaty (1963)?
- Atmosphere, underwater, outer space
- Underground testing still permitted
- Comprehensive Test Ban Treaty (1992)?
- No detonation testing permitted at all
- Implications
- Validation of existing stockpile
- Creation of new weapons
- Stockpile stewardship
- Insure safety and reliability
10Enter High Performance Computing
- Multi-physics simulations
- Enormous computational complexity
- Parallel and distributed computing
Images courtesy of LANL and ASC
11Capacity Versus Capability
- Capacity
- Smaller / less expensive HPC systems
- Modest computational requirements
- Commodity Linux clusters
- 4 year typical life cycle
-
- Capability
- Most powerful supercomputers
- COTS custom hardware
- Major programmatic efforts
- Some become capacity machines
BGL 106,496 nodes 212,992 proc
12Cluster Computing
- COTS computers
- Computational resources
- Network interconnect
- Single supercomputer
- Parallel libraries
- File I/O services
- Solve larger problems
- Multiple users
- Pervasive alternative
- Multiple clusters ...
Small ? Medium ? Large
Image courtesy of Clemson University
13Top500 Supercomputer Stats Architecture Breakdown
14(No Transcript)
15(No Transcript)
16Japans Earth Simulator
17Hybrid Computing Platforms
LANLs Roadrunner will be a capability machine
(likely the fastest)
Tera/Petaflop scale computing is not cheap --gt We
have to share!
Image courtesy of IBM and LANL
18Multiple users Multiple jobs Multiple clusters
What happens to overall reliability at the
petaflop scale?
19Dismal Performance at Scale
Capacity
Capability
Capability
Capability
Capability
Capacity
Decreasing performance
Image courtesy of John Daly
20Failures
So what does this mean?
Image courtesy of John Daly
21Failures May Go Unnoticed
wasted time
Application stops making progress
22- What about fault-tolerance?
- Suppose you could detect that an error occurred,
migrate the job, and restart the job from last
checkpoint. - How quickly would you need to determine that an
interrupt occurred?
23What is C/R?How Much Does It Cost?
R f(detection latency restart overhead) Lets
conduct simulation-based study
Image courtesy of John Daly
24BeoSimA Computational Grid Simulator
Parallel Job Scheduling Research Single and
Multiple Clusters Reliability Studies
JAVA front-front C back-end Discrete event
simulator Single-threaded Parallel parameter
studies
25Lots and lots of Parameter Studies
26BeoSim Framework
C-language implementation written from
scratch approx. 13K SLOC
Beosim http//www.parl.clemson.edu/beosim
27Cluster Model
28Parallel IPC Model
- All-to-all personalized
2D
29Network State Bookkeeping
30Scheduling PolicyConservative Backfilling
Out of order execution No delay to start time
Prevents job starvation, (compared to FCFS)
31Workload Distribution
(1926 node cluster)
Event driven simulation using 4,000,000 jobs
(using BeoSim)
32Impact of Increasing Failure Rates
May seem negligible, but, multiple interrupts,
impact on throughput
33Impact on Throughput
significant reduction in queueing delays
CPdelta (time to determine an interrupt occurred
(min)?
34Impact on Execution Time
marginal(1.8)?
significant (13.5)
CPdelta (time to determine an interrupt occurred
(min)?
35Job Checkpoint and Restart
- Save the state of the job out to disk
- (Also allows partial runs)
- If failure occurs, restart from last checkpoint
- So what's the problem?
- How do you know failure occurred?
- Sometimes it's obvious, sometimes not!
Image courtesy of John Daly
36Application Monitoring
- Is the job making progress?
- CPU load
- File I/O
- Network I/O
- Intrusiveness
- Top secret computing platforms
- Legacy simulation codes
- Resistance to modify OS
- What can we do?
- Types of monitoring
- Node level
- System level
- Application level
- How costly is it?
- How good does it need to be?
(Gratuitous images)
Images courtesy of ASC Image Library
37Monitoring File I/O
- Cons
- Not all applications generate file I/O
- Those that do, not necessarily at a fixed
interval - Pros
- Easy to implement
- Not intrusive
- Portable implementation
- Initial Opposition
- Too easy
- Couldn't be that easy
- Simulation, implementation, deployment
38Conclusions
- Minimizing interrupt detection latency
- Immediate detection not necessary
- Potential approaches
- Less sophisticated
- Less intrusive
- More portable
(simple script to monitor file I/O) (actually
used on LEP runs at LANL)?
Application resilience in an unsolved
problem Many research venues
Image courtesy of John Day and LANL
39Thank-you!Questions?
- William M. Jones
- http//www.parl.clemson.edu/beosim