Title: The Blue GeneL Supercomputer George Chiu
1The Blue Gene/L Supercomputer George Chiu
2June 2005 25th TOP10 of the TOP500 list
IBM is the clear leader in the TOP500 list with
51.8 of systems and 57.9 of installed
performance.
3BG/L 32768 nodes (IBM Rochester) Linpack 136.8
TF/s sustained, 183.5 TF/s peak 1 TF
1000,000,000,000 Flops
4 Blue Gene/L Sales
- Advanced Industrial Science and Technology, Japan
(Yutaka Akiyama) 4 racks 2/05 - Argonne National Laboratory Consortium (William
Gropp, Rick Stevens) 1 rack 12/04 - ASTRON Lofar - Holland (Harvey Butcher) 6
racks 3/05 - Boston University (Claudio Rebbi) 1 rack
12/04 - Ecole Polytechnique Federale de Lausanne (Henry
Markram) 4 racks 06/05 - IBM Yorktown Research Center 22 racks 06/05
- IBM Almaden Research Center 2 racks 03/05
- Juelich (Thomas Lippert) 1 rack 7/05
- Lawrence Livermore National Laboratory (Mark
Seager, Don Dossa) 65 racks 32 3/05 - National Center for Atmospheric Research (Richard
Loft) 1 rack 3/05 - NIWS (Suesada) 1 rack 1/05
- San Diego Supercomputing Center (Wayne Pfeiffer)
- Intimidata 1 rack 12/17/04 - University of Edinburgh (Anthony Kennedy, Richard
Kenway) 1 rack 12/04
5BlueGene/L System Buildup
System
64 Racks, 64x32x32
Rack
32 Node Cards
180/360 TF/s 32 TB
Node Card
(32 chips 4x4x2) 16 compute, 0-2 IO cards
2.8/5.6 TF/s 512 GB
Compute Card
2 chips, 1x2x1
Chip
90/180 GF/s 16 GB
2 processors
5.6/11.2 GF/s 1.0 GB
2.8/5.6 GF/s 4 MB
6Two Computation Modes for the BG/L Node
- Mode 1 (Co-processor mode)
- CPU0 does all the computations
- CPU1 does all the communications (including
MPI etc) - Communications can overlap with computations
- Peak compute performance is 5.6/2 2.8
GFlops - Mode 2 (Virtual node mode)
- CPU0, CPU1 act as independent virtual nodes
- Each one does both computations and
communications - The two CPUs communicate via common memory
buffers - Computations and communications can not
overlap. - Peak compute performance is 5.6 GFlops
CPU0
CPU1
CPU0
CPU1
7(No Transcript)
8Supercomputer Power Efficiencies
9Microprocessor Power Density Growth
10(No Transcript)
11HPC Challenge Global-Random Access (Gup/s)
12Summary of performance results
- DGEMM
- 92.3 of dual core peak on 1 node
- LINPACK
- 73.73 of peak on 32,768 nodes (136.8 Tflops/s on
3/23/05) - SPECint2000 316
- SPECfp2000 436
- G-FFTE 48.993 GFlop/s
- STREAM
- Tuned Copy 3.8 GB/s, Scale 3.3 GB/s, Add
2.8 GB/s, Triad 3.0 GB/s - Standard Copy 1.8 GB/s, Scale 1.3 GB/s, Add
1.5 GB/s, Triad 1.5 GB/s - Competitive with STREAM numbers for most high end
microprocessors - G-Random Access
- Ranked 1 in HPCC at 0.134994 gups/s
- MPI
- Latency 3.3 ls at 700 MHz
13Comparing Systems
14BlueGene/L Compute ASIC
- IBM CU-11, 0.13 µm
- 11 x 11 mm die size
- 25 x 32 mm CBGA
- 474 pins, 328 signal
- 1.5/2.5 Volt
15Dual FPU Architecture
- Two 64 bit floating point units
- Designed with input from compiler and library
developers - SIMD instructions over both register files
- FMA operations over double precision data
- More general operations available with cross and
replicated operands - Useful for complex arithmetic, matrix multiply,
FFT - Parallel (quadword) loads/stores
- Fastest way to transfer data between processors
and memory - Data needs to be 16-byte aligned
- Load/store with swap order available
- Useful for matrix transpose
16BlueGene/L Interconnection Networks
- 3 Dimensional Torus
- Interconnects all compute nodes (65,536)
- Virtual cut-through hardware routing
- 1.4Gb/s on all 12 node links (2.1 GB/s per node)
- 1 µs latency between nearest neighbors, 5 µs to
the farthest - MPI 3.3 µs latency for one hop, 10 µs to the
farthest - Communications backbone for computations
- 0.7/1.4 TB/s bisection bandwidth, 68TB/s total
bandwidth - Collective Network
- One-to-all broadcast functionality
- Reduction operations functionality
- 2.8 Gb/s of bandwidth per link
- Latency of one way tree traversal 2.5 µs, MPI 6
µs - 23TB/s total binary tree bandwidth (64k machine)
- Interconnects all compute and I/O nodes (1024)
- Ethernet
- Incorporated into every node ASIC
- Active in the I/O nodes (164)
- All external comm. (file I/O, control, user
interaction, etc.)
17BlueGene/L System
18Complete BlueGene/L System at LLNL
BG/L I/O nodes 1,024
WAN
506
visualization
128
archive
128
BG/L compute nodes 65,536
Federated Gigabit Ethernet Switch 2,048 ports
CWFS
226
1024
Front-end nodes
28
Service node
8
8
Control network
19Applications
- N-body simulation
- Classical molecular dynamics AMBER8, Blue
Matter, ddcMD, DL_POLY, GRASP, LAMMPS, LJ,
MDCASK, NAMD, PRIME (Schrodinger), SPaSM - Quantum chemistry CHARMm, CPMD, FEQMD,
GAMESS-UK, GAMESS-US, Gaussian, NWChem, Qbox,
QMC - Plasma Physics TBLE
- Stellar dynamics of galaxies Enzo
- Complex multiphysics code
- Climatology CCSM, HOMME
- Computational Fluid Dynamics FLUENT, Miranda,
Overflow, POP (Ocean), Raptor, SAGE, sPPM,
STAR-CD, - Astronomy Accretion, planetary formation and
evolution, stellar evolution, FLASH (supernova),
Radiotelescope (Astron) - Electromagnetics FDTD code
- Finite element analysis, Car Crash LS-Dyna,
NASTRAN, PAM-Crash (ESI), HPCMW (RIST) - Radiative transport 2-D SPHOT, 3-D UMT2000
- Neutron transport Sweep3D
- Weather MM5, IFS (ECMWF)
- Life Sciences and Biotechnology mpiBLAST,
Smith-Waterman - CAD/CAE AVBP
- Crystallography with X-Ray Diffraction Shake
Bake - Drug Screening OpenEye Scientific Software,
Tripos, MOE (CCG Chemical Computing Group) - Finance NIWS (Nissei)
20BlueGene/L will allow overlappingevaluation of
models for first time
Continuum
Mesoscale
s
Time Scale
Microscale
ms
Finite ElementPlasticity of complex shapes
Atomic Scale
Aggregate Materials Aggregate grain response,
poly-crystal plasticity
ms
ns
Dislocation Dynamics Collective behavior of
defects, single crystal plasticity
BlueGene/L simulations bring qualitative change
to material and physics modeling efforts
Molecular Dynamics Unit mechanisms of defect
mobility and interaction
ps
mm
Length Scale
mm
nm
21Closing Points
- Blue Gene represents an innovative way to scale
to multi-teraflops capability - Massive scalability
- Efficient packaging for low power and floor space
consumption - Unique in the market for its balance between
massive scale-out capacity and preservation of
familiar user/administrator environments - Better than COTS clusters by virtue of density,
scalability and innovative interconnect design - Better than vector-based supercomputers by virtue
of adherence to Linux and MPI standards - Blue Gene is applicable to a wide range of Deep
Computing workloads - Programs are underway to ensure Blue Gene
technology is accessible to a broad range of
researchers and technologists - Based on PowerPC, Blue Gene leverages and
advances core IBM technology - Blue Gene RD continues so as to ensure the
program stays vital