Nuclear Stockpile Stewardship, Trials and Tribulations: A Computing Perspective

About This Presentation

Title:

Nuclear Stockpile Stewardship, Trials and Tribulations: A Computing Perspective

Description:

Nuclear Stockpile Stewardship, Trials and Tribulations: A Computing Perspective – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 40

Provided by: parlCl

Category:

more less

Transcript and Presenter's Notes

Title: Nuclear Stockpile Stewardship, Trials and Tribulations: A Computing Perspective

1
Nuclear Stockpile Stewardship,Trials and
TribulationsA Computing Perspective

Dr. William M. Jones
Electrical and Computer Engineering Department
United States Naval Academy
Dr. Nathan A. DeBardeleben and John T. Daly
High Performance Computing Division
Los Alamos National Laboratory

2
U N C L A S S I F I E D
Los Alamos National Laboratory
LA-UR-07-6965
3
(No Transcript)
4
Tri-Labs
Advanced Simulation and Computing
5
Weapon Design and Testing

Massive data collection
Feedback into the design process

6
Types of Nuclear Testing
Image courtesy of Wikimedia Commons
7
Image courtesy of Wikimedia Commons
8
Estimated Actual numbers are classified
Image courtesy of Wikimedia Commons
9
Weapon Test Bans

Limited Test Ban Treaty (1963)?
Atmosphere, underwater, outer space
Underground testing still permitted
Comprehensive Test Ban Treaty (1992)?
No detonation testing permitted at all
Implications
Validation of existing stockpile
Creation of new weapons
Stockpile stewardship
Insure safety and reliability

10
Enter High Performance Computing

Multi-physics simulations
Enormous computational complexity
Parallel and distributed computing

Images courtesy of LANL and ASC
11
Capacity Versus Capability

Capacity
Smaller / less expensive HPC systems
Modest computational requirements
Commodity Linux clusters
4 year typical life cycle
Capability
Most powerful supercomputers
COTS custom hardware
Major programmatic efforts
Some become capacity machines

BGL 106,496 nodes 212,992 proc
12
Cluster Computing

COTS computers
Computational resources
Network interconnect
Single supercomputer
Parallel libraries
File I/O services
Solve larger problems
Multiple users
Pervasive alternative
Multiple clusters ...

Small ? Medium ? Large
Image courtesy of Clemson University
13
Top500 Supercomputer Stats Architecture Breakdown
14
(No Transcript)
15
(No Transcript)
16
Japans Earth Simulator
17
Hybrid Computing Platforms
LANLs Roadrunner will be a capability machine
(likely the fastest)
Tera/Petaflop scale computing is not cheap --gt We
have to share!
Image courtesy of IBM and LANL
18
Multiple users Multiple jobs Multiple clusters
What happens to overall reliability at the
petaflop scale?
19
Dismal Performance at Scale
Capacity
Capability
Capability
Capability
Capability
Capacity
Decreasing performance
Image courtesy of John Daly
20
Failures
So what does this mean?
Image courtesy of John Daly
21
Failures May Go Unnoticed
wasted time
Application stops making progress
22

What about fault-tolerance?
Suppose you could detect that an error occurred,
migrate the job, and restart the job from last
checkpoint.
How quickly would you need to determine that an
interrupt occurred?

23
What is C/R?How Much Does It Cost?
R f(detection latency restart overhead) Lets
conduct simulation-based study
Image courtesy of John Daly
24
BeoSimA Computational Grid Simulator
Parallel Job Scheduling Research Single and
Multiple Clusters Reliability Studies
JAVA front-front C back-end Discrete event
simulator Single-threaded Parallel parameter
studies
25
Lots and lots of Parameter Studies
26
BeoSim Framework
C-language implementation written from
scratch approx. 13K SLOC
Beosim http//www.parl.clemson.edu/beosim
27
Cluster Model
28
Parallel IPC Model

All-to-all personalized
2D

29
Network State Bookkeeping
30
Scheduling PolicyConservative Backfilling
Out of order execution No delay to start time
Prevents job starvation, (compared to FCFS)
31
Workload Distribution
(1926 node cluster)
Event driven simulation using 4,000,000 jobs
(using BeoSim)
32
Impact of Increasing Failure Rates
May seem negligible, but, multiple interrupts,
impact on throughput
33
Impact on Throughput
significant reduction in queueing delays
CPdelta (time to determine an interrupt occurred
(min)?
34
Impact on Execution Time
marginal(1.8)?
significant (13.5)
CPdelta (time to determine an interrupt occurred
(min)?
35
Job Checkpoint and Restart

Save the state of the job out to disk
(Also allows partial runs)
If failure occurs, restart from last checkpoint

So what's the problem?
How do you know failure occurred?
Sometimes it's obvious, sometimes not!

Image courtesy of John Daly
36
Application Monitoring

Is the job making progress?
CPU load
File I/O
Network I/O
Intrusiveness
Top secret computing platforms
Legacy simulation codes
Resistance to modify OS
What can we do?
Types of monitoring
Node level
System level
Application level
How costly is it?
How good does it need to be?

(Gratuitous images)
Images courtesy of ASC Image Library
37
Monitoring File I/O

Cons
Not all applications generate file I/O
Those that do, not necessarily at a fixed
interval
Pros
Easy to implement
Not intrusive
Portable implementation
Initial Opposition
Too easy
Couldn't be that easy
Simulation, implementation, deployment

38
Conclusions

Minimizing interrupt detection latency
Immediate detection not necessary
Potential approaches
Less sophisticated
Less intrusive
More portable

(simple script to monitor file I/O) (actually
used on LEP runs at LANL)?
Application resilience in an unsolved
problem Many research venues
Image courtesy of John Day and LANL
39
Thank-you!Questions?