Nuclear Stockpile Stewardship, Trials and Tribulations: A Computing Perspective - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Nuclear Stockpile Stewardship, Trials and Tribulations: A Computing Perspective

Description:

Nuclear Stockpile Stewardship, Trials and Tribulations: A Computing Perspective – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 40
Provided by: parlCl
Category:

less

Transcript and Presenter's Notes

Title: Nuclear Stockpile Stewardship, Trials and Tribulations: A Computing Perspective


1
Nuclear Stockpile Stewardship,Trials and
TribulationsA Computing Perspective
  • Dr. William M. Jones
  • Electrical and Computer Engineering Department
  • United States Naval Academy
  • Dr. Nathan A. DeBardeleben and John T. Daly
  • High Performance Computing Division
  • Los Alamos National Laboratory

2
U N C L A S S I F I E D
Los Alamos National Laboratory
LA-UR-07-6965
3
(No Transcript)
4
Tri-Labs
Advanced Simulation and Computing
5
Weapon Design and Testing
  • Massive data collection
  • Feedback into the design process

6
Types of Nuclear Testing
Image courtesy of Wikimedia Commons
7
Image courtesy of Wikimedia Commons
8
Estimated Actual numbers are classified
Image courtesy of Wikimedia Commons
9
Weapon Test Bans
  • Limited Test Ban Treaty (1963)?
  • Atmosphere, underwater, outer space
  • Underground testing still permitted
  • Comprehensive Test Ban Treaty (1992)?
  • No detonation testing permitted at all
  • Implications
  • Validation of existing stockpile
  • Creation of new weapons
  • Stockpile stewardship
  • Insure safety and reliability

10
Enter High Performance Computing
  • Multi-physics simulations
  • Enormous computational complexity
  • Parallel and distributed computing

Images courtesy of LANL and ASC
11
Capacity Versus Capability
  • Capacity
  • Smaller / less expensive HPC systems
  • Modest computational requirements
  • Commodity Linux clusters
  • 4 year typical life cycle
  • Capability
  • Most powerful supercomputers
  • COTS custom hardware
  • Major programmatic efforts
  • Some become capacity machines

BGL 106,496 nodes 212,992 proc
12
Cluster Computing
  • COTS computers
  • Computational resources
  • Network interconnect
  • Single supercomputer
  • Parallel libraries
  • File I/O services
  • Solve larger problems
  • Multiple users
  • Pervasive alternative
  • Multiple clusters ...

Small ? Medium ? Large
Image courtesy of Clemson University
13
Top500 Supercomputer Stats Architecture Breakdown
14
(No Transcript)
15
(No Transcript)
16
Japans Earth Simulator
17
Hybrid Computing Platforms
LANLs Roadrunner will be a capability machine
(likely the fastest)
Tera/Petaflop scale computing is not cheap --gt We
have to share!
Image courtesy of IBM and LANL
18
Multiple users Multiple jobs Multiple clusters
What happens to overall reliability at the
petaflop scale?
19
Dismal Performance at Scale
Capacity
Capability
Capability
Capability
Capability
Capacity
Decreasing performance
Image courtesy of John Daly
20
Failures
So what does this mean?
Image courtesy of John Daly
21
Failures May Go Unnoticed
wasted time
Application stops making progress
22
  • What about fault-tolerance?
  • Suppose you could detect that an error occurred,
    migrate the job, and restart the job from last
    checkpoint.
  • How quickly would you need to determine that an
    interrupt occurred?

23
What is C/R?How Much Does It Cost?
R f(detection latency restart overhead) Lets
conduct simulation-based study
Image courtesy of John Daly
24
BeoSimA Computational Grid Simulator
Parallel Job Scheduling Research Single and
Multiple Clusters Reliability Studies
JAVA front-front C back-end Discrete event
simulator Single-threaded Parallel parameter
studies
25
Lots and lots of Parameter Studies
26
BeoSim Framework
C-language implementation written from
scratch approx. 13K SLOC
Beosim http//www.parl.clemson.edu/beosim
27
Cluster Model
28
Parallel IPC Model
  • All-to-all personalized
    2D

29
Network State Bookkeeping
30
Scheduling PolicyConservative Backfilling
Out of order execution No delay to start time
Prevents job starvation, (compared to FCFS)
31
Workload Distribution
(1926 node cluster)
Event driven simulation using 4,000,000 jobs
(using BeoSim)
32
Impact of Increasing Failure Rates
May seem negligible, but, multiple interrupts,
impact on throughput
33
Impact on Throughput
significant reduction in queueing delays
CPdelta (time to determine an interrupt occurred
(min)?
34
Impact on Execution Time
marginal(1.8)?
significant (13.5)
CPdelta (time to determine an interrupt occurred
(min)?
35
Job Checkpoint and Restart
  • Save the state of the job out to disk
  • (Also allows partial runs)
  • If failure occurs, restart from last checkpoint
  • So what's the problem?
  • How do you know failure occurred?
  • Sometimes it's obvious, sometimes not!

Image courtesy of John Daly
36
Application Monitoring
  • Is the job making progress?
  • CPU load
  • File I/O
  • Network I/O
  • Intrusiveness
  • Top secret computing platforms
  • Legacy simulation codes
  • Resistance to modify OS
  • What can we do?
  • Types of monitoring
  • Node level
  • System level
  • Application level
  • How costly is it?
  • How good does it need to be?

(Gratuitous images)
Images courtesy of ASC Image Library
37
Monitoring File I/O
  • Cons
  • Not all applications generate file I/O
  • Those that do, not necessarily at a fixed
    interval
  • Pros
  • Easy to implement
  • Not intrusive
  • Portable implementation
  • Initial Opposition
  • Too easy
  • Couldn't be that easy
  • Simulation, implementation, deployment

38
Conclusions
  • Minimizing interrupt detection latency
  • Immediate detection not necessary
  • Potential approaches
  • Less sophisticated
  • Less intrusive
  • More portable

(simple script to monitor file I/O) (actually
used on LEP runs at LANL)?
Application resilience in an unsolved
problem Many research venues
Image courtesy of John Day and LANL
39
Thank-you!Questions?
  • William M. Jones
  • http//www.parl.clemson.edu/beosim
Write a Comment
User Comments (0)
About PowerShow.com