Considerations for Scalable CAE on the SGI ccNUMA Architecture presentation

About This Presentation

Transcript and Presenter's Notes

Title: Considerations for Scalable CAE on the SGI ccNUMA Architecture

1
Considerations for Scalable CAE on the SGI ccNUMA
Architecture
Stan Posey Applications Market Development
Cheng Liao Principal Scientist, FEA Applications
Christian Tanasescu CAE Applications Manager
2
Topics of Discussion
Historical Trends of CAE
Current Status of Scalable CAE
Future Directions in Applications
3
Motivation for CAE Technology
Economics Physical prototyping costs continue
Increasing Engineer more expensive than
simulation tools
MSC/NASTRAN Simulation Costs (Source General
Motors)
1960 30,000
1999 0.02
Mainframes
Cost of CAE Simulation
CAE Engineer vs. System Costs (Source Detroit
Big3)
Engineer 36/hr
System 1.5/hr
Cost
Cost of CAE Engineer
Cost of Physical Prototyping
Workstations and Servers
1960
2000
Years
4
Recent Technology Achievements
Rapid CAE Advancement from 1996 to 1999
Computer Hardware Advances Processors Ability
to hide system latency Architecture ccNUMA
Crossbar switch replaces shared bus
5
Recent History of Parallel Computing
Late 1980s Shared Memory Parallel Hardware Bus
-based shared memory parallel (SMP) Parallel
Model Compiler enabled loop level (SMP fine
grain) Characteristics Low scalability (2p to
6p) but easy to program Limitations Expensive
memory for vector architectures Early
1990s Distributed Memory Parallel Hardware MPP
and cluster distributed memory parallel
(DMP) Parallel Model DMP coarse grain through
explicit message passing Characteristics High
scalability (gt 64p) but difficult to
program Limitations Commercial CAE applications
generally unavailable Late 1990s Distributed
Shared Memory Parallel Hardware Physically DMP
but logically SMP ccNUMA Parallel Model SMP
fine grain, DMP and SMP coarse grain Characterist
ics High scalability and easy to program
6
Origin ccNUMA Architecture Basics
Features of ccNUMA Multi-purpose Architecture
Detail of Two Node (w/Router) Architecture (32p
Topology)
Node
Proc.
Proc.
Proc.
Proc.
Cache
Cache
Cache
Cache
I/O
I/O
Local Switch
Local Switch
Main Memory
Main Memory
Dir
Dir
Router
Global Switch Interconnect
7
Parallel Computing with ccNUMA
Features of ccNUMA Multi-purpose Architecture

Origin2000 ccNUMA available since 1996
Non-blocking crossbar switch as interconnect
fabric
High levels of scalability over shared bus SMP
Physical DMP but logical SMP (synchronized cache
memories)
2 to 512 MIPS R12000/400Mhz processors with 8MB
L2 cache
High memory bandwidth (1.6Gb/s) and I/O that is
scalable
Distributed and shared memory (fine and coarse)
parallel models

Origin2000/256
8
Recent Technology Achievements
Rapid CAE Advancement from 1996 to 1999
Computer Hardware Advances Processors Ability
to hide system latency Architecture ccNUMA
Crossbar switch replaces shared bus Application
Software Advances Implicit FEA Sparse solvers
increase performance by 10-fold Explicit
FEA Domain parallel increases performance by
10-fold CFD Scalability increases performance
by 100-fold Meshing Automatic and robust
tetra meshing
9
Characterization of CAE Applications
CFD
OVERFLOW
FLUENT
High
STAR-CD
Explicit FEA
LS-DYNA
PAM-CRASH
Implicit FEA (Direct Freq)
MSC.Nastran (108)
RADIOSS
Degree of Parallelism
MARC
ADINA
ANSYS
Implicit FEA (Statics)
Implicit FEA (Modal Freq)
ABAQUS
MSC.Nastran (101)
MSC.Nastran (103 and 111)
Low
0.1
1
10
100
1000
Cache-friendly
Memory BW
Compute Intensity Flops/word of memory traffic
10
Characterization of CAE Applications
CFD
OVERFLOW
FLUENT
High
MP SCALAR
STAR-CD
Explicit FEA
LS-DYNA
Implicit FEA (Direct Freq)
PAM-CRASH
MSC.Nastran (108)
RADIOSS
Degree of Parallelism
VECTOR
MARC
ADINA
ANSYS
Implicit FEA (Statics)
Implicit FEA (Modal Freq)
ABAQUS
MSC.Nastran (101)
MSC.Nastran (103 and 111)
Low
0.1
1
10
100
1000
Cache-friendly
Memory BW
Compute Intensity Flops/word of memory traffic
11
Topics of Discussion
Historical Trends of CAE
Current Status of Scalable CAE
Future Directions in Applications
12
Scalability Emerging for all CAE
Scalable CAE Domain Decomposition Parallel
Implicit FEA - ABAQUS, ANSYS, MSC.Marc,
MSC.Nastran Explicit FEA - LS-DYNA, PAM-CRASH,
RADIOSS General CFD - CFX, FLUENT, STAR-CD
System
CPU1 CPU2 CPU3 CPU4
Domain Parallel Example Compressible 2D flow
over wedge, partitioned as 4 domains for
parallel execution on 4 processors
3
2
image
4
1
13
Parallel Scalability in CAE
CPUs
512 256 128 64
V70.7
32 16 8 4 2 1
SMP
DMP
108
101 103 108
Usable parallel
Peak parallel
V70.5
CFD Codes
Nastran
Crash Codes
14
Considerations for Scalable CAE
Sources that Inhibit Efficient Parallelism
Source
Solution
Computational load imbalance communication
overhead between neighboring partitions data
and process placement message passing
performance MPICH latency 31ms
Nearly equal sized partitions minimize
communication between adjacent cells on different
cpus enforce memory-process affinity latency
and bandwidth awareness SGI-MPI3.1 latency
12ms
Scaling to 16p only
Scaling to 64p !!
15
Considerations for Scalable CAE
Processor-Memory Affinity (Data Placement)
Theory system will place data and execution
threads together properly, system will migrate
that data to follow the executing
Real Life
32p Origin 2000
Process migrates, data stays
Process Data
16
FLUENT Scalability on ccNUMA
FLUENT Scalability Study of SSI vs. Cluster
Software FLUENT 5.1.1 CFD Model External
aerodynamics, 3D, k-e, segregated incompressible
, iso-thermal, 29M cells
Time per Iteration (seconds)
CPUs 10 30 60 120 240
SSI 381 1.0 99 3.9 67 5.7 29 13.1 18 21.2
4 x 64 424 1.0 139 3.0 72 5.9 39
10.9 49 8.7
Largest FLUENT automotive case achieved near
ideal scaling on SGI 2800/256
17
SSI Advantage for CFD with MPI
Single System Image (SSI) Latency
256cpu SSI
CPUs 8 16 32 64 128 256
Shared Memory (ns) 528 641 710 796 903 1200
MPI (ns) 19 x 103 23 x 103 26 x 103 29 x
103 34 x 103 44 x 103
4 x 64 Cluster
Cluster Configuration Latency
HIPPI osBYPASS 139 x 103
18
Grand Scale HPC NASA and Boeing
NASA Ames Research Center
Boeing Commercial Aircraft
OVERFLOW Complete Boeing 747 Aerodynamics
Simulation
75
60 GFLOPS, Oct 99
60
Problem 35M Points 160 Zones
45
Largest model in NASA history, achieved
60Gflops on SGI 2800/512 with linear scaling
Performance (GFLOP/s)
30
FY98 Milestone
15
C916/16 OVERFLOW Limit
0
0 128 256
384 512
Number of CPUs
19
Computational Requirements for MSC.Nastran
Memory CPU Bandwidth
Cycles 7 93
60 40 83 17
100 0
Compute Task Sparse Direct Solver Lanczos
Solver Iterative Solver I/O Activity
20
MSC.Nastran Scalability on ccNUMA
MSC/NASTRAN MPI Based Scalability for SOL 103,
111

Typical scalability - 2x to 3x on 8p, less for
SOL 111

MSC/NASTRAN MPI Based Scalability for SOL 108

Independent frequency steps, naturally parallel
File and memory space not shared
Near linear parallel scalability
Improved accuracy over SOL 111 with increasing
frequency
Released on SGI with v70.7 (Oct 99)

21
MSC.Nastran Scalability on ccNUMA
MSC/NASTRAN MPI Based Scalability for SOL 111
200Hz
100Hz
0Hz
400Hz
300Hz
150 modes CPU 1
350 modes CPU 2
300 modes CPU 3
200 modes CPU 4
Modes CPU
MSC/NASTRAN MPI Based Scalability for SOL 108
200Hz
50Hz
100Hz
150Hz
0Hz
Parallel Schematics Parallel Schemes for an
excitation frequency of 200Hz on a 4 CPU system
1 - 50 CPU 1
51 - 100 CPU 2
101 - 150 CPU 3
151 - 200 CPU 4
Freqs CPU
22
MSC.Nastran Scalability on ccNUMA
SOL 108 Comparison with Conventional NVH (SOL
111 on T90)
Cray T90 Baseline Results SOL 111 DOF 525K Ei
gensolution 2714 modes Freq Steps 96 Elap
Time 31610 sec
CPUs Elapsed Parallel Time (s)
Speed-up 1 120720 1.0 2
61680 2.0 4 32160 3.8 8
17387 6.9 16 10387
11.6() measured on populated nodes
23
MSC.Nastran Scalability on ccNUMA
The Future of Automotive NVH Modeling
MSC.Nastran Parallel Scalability for Direct
Frequency Response (SOL 108)
Model Description Model BIW
SOL 108 DOF 536K Freq Steps 96 Run
Statistics (per MPI Process) Memory 340 MB
FFIO Cache 128 MB Disk Space 3.6
GB Process/Node 2
CPUs Elapsed Parallel Time (h)
Speed-up 1 31.7 1.0 8
4.1 7.8 16 2.2
14.2 32 1.4 22.6
24
Future Automotive NVH Modeling
Higher excitation frequencies of interest will
increase DOF and modal density beyond SOL
103,111 practical limits
Modal Frequency Response 103,111
Elap Time
Direct Frequency Response 108
199X Models
200X Models
Frequency
25
Topics of Discussion
Historical Trends of CAE
Current Status of Scalable CAE
Future Directions in Applications
26
Economics of HPC Rapidly Changing
SGI Partnership with HPC Community on
Technology Roadmap
UNICOS/ Vector
Functionality Migration
Capability Features
IRIX/MIPS SSI
Linux/IA-64, Clusters SSI
General Availability
27
HPC Architecture Roadmap at SGI
SN-MIPS Features of Next Generation ccNUMA

Bandwidth improvement of 2x over Origin2000
System support for IRIX/MIPS or LINUX/IA-64
Modular design allows subsystem upgrades without
forklift
Latency decrease by 50 over Origin2000
Next Generation IRIX Features and Improvements

Shared memory to 512 processors and beyond
RAS enhancements Resiliency and Hot Swap
Data center management scheduling, accounting
HPC clustering GSN, CXFS shared file system

28
Characterization of CAE Applications
CFD
OVERFLOW
SN-MIPS Benefit
FLUENT
High
STAR-CD
Explicit FEA
LS-DYNA
PAM-CRASH
Implicit FEA (Direct Freq)
MSC.Nastran (108)
RADIOSS
Degree of Parallelism
MARC
ADINA
ANSYS
Implicit FEA (Statics)
Implicit FEA (Modal Freq)
ABAQUS
MSC.Nastran (101)
MSC.Nastran (103 and 111)
Low
0.1
1
10
100
1000
Cache-friendly
Memory BW
Compute Intensity Flops/word of memory traffic
29
Characterization of CAE Applications
CFD
OVERFLOW
SN-MIPS Benefit
FLUENT
High
STAR-CD
Explicit FEA
LS-DYNA
SN-IA Benefit
PAM-CRASH
Implicit FEA (Direct Freq)
MSC.Nastran (108)
RADIOSS
Degree of Parallelism
MARC
ADINA
ANSYS
Implicit FEA (Statics)
Implicit FEA (Modal Freq)
ABAQUS
MSC.Nastran (101)
MSC.Nastran (103 and 111)
Low
0.1
1
10
100
1000
Cache-friendly
Memory BW
Compute Intensity Flops/word of memory traffic
30
Architecture Mix for Automotive HPC
1999 2.9 TFlops installed in Automotive OEMs
world wide
1997 1.1 TFlops installed in Automotive OEMs
world wide
Current as of SEP 1999
31
Automotive Industry HPC Investments
GM and DaimlerChrysler each grew capacity more
than 2x over past year
32
Future Directions in CAE Applications
Meta-Computing with Explicit FEA
Non-deterministic methods for improved FEA
simulation
Los Alamos and DOE Applied Engineering
Analysis Stochastic Simulation of 18 CPU Years
Completed in 3 Days on ASCI Blue Mtn USDOE
supported research achieved first-ever
full-scale ABAQUS/Explicit simulation of
nuclear weapons impact response on Origin/6144
ASCI (Feb 00) Ford Motor SRL and NASA Langley
Optimization of a vehicle body for NVH and crash,
completed 9 CPU months of RADIOSS and
MSC.Nastran overnight with response surface
technique (Apr 00) BMW Body Engineering 672 MIPS
cpus dedicated to stochastic crash simulation
with PAM-CRASH (Jan 00)
33
Meta-Computing with Explicit FEA
Objective

Manage design uncertainty from variability
Scatter in materials, loading, test conditions
Non-deterministic simulation of vehicle
population
Meta-computing on SSI or large cluster
Improved design space exploration
Moving design towards target parameters

Most likely Performance
Approach
Unlikely Performance
Insight
34
Grand Scale HPC NASA and Ford
NVH Crash Optimization of Vehicle Body Overnight

Ford body-in-prime (BIP) model of 390K DOF
MSC.Nastran for NVH, 30 design variables
RADIOSS for crash, 20 design variables
10 design variables in common
Sensitivity based Taylor approx. for NVH
Polynomial response surface for crash

Achieved overnight BIP optimization on SGI
2800/256, with equivalent yield of 9 months CPU
time
35
Historical Growth of CAE Application
Growth Index
100
1993
1999
X90
x90
x40
x6
x7
x5
x6
x6
1
Cost per CPU-hour
Capacity GFlops
Crash Model Size
Number of Engineers
Turnaround time Crash SMP
Turnaround time Crash, CFD-MPP
NVH Model Size
CFDModel Size
1 564 Gflops
450000 elem.
2 Mil. DOF
gt10Mil cells
Source Survey of major automotive developers
36
Future Directions of Scalable CAE
CAE to evolve into fully scalable, RISC-based
technology High resolution models - CFD today,
Crash, FEA emerging Deterministic CAE giving way
to probability techniques Deployment increases
computational requirements 10-fold Visual
interaction with models beyond 3M cell/DOF
High resolution modeling will strain
visualization technology Multi-Discipline
optimization (MDO) implementation in earnest
Coupling of structure, fluids, acoustics,
electromagnetics
37
Conclusions
For small and medium size problems cluster can be
a viable solution in the range of 8 16 CPUs In
the space of large and extremely large problems
SSI architecture provides better parallel
performance due to superior characteristics of
in-box interconnect In order to increase a single
CPU performance developer should put in
consideration the correlation between exploited
data structure algorithms and specific memory
hierarchy ccNUMA system allows a coupling of
various parallel programming paradigms which
could benefit a performance of multiphysics
applications

Write a Comment

User Comments (0)

About PowerShow.com

Considerations for Scalable CAE on the SGI ccNUMA Architecture PowerPoint PPT Presentation