Title: Blue GeneL system architecture
1Blue Gene/L system architecture
2Overview of the IBM Blue Gene/L System
Architecture
- Design objectives
- Design approach
- Hardware overview
- System architecture
- Node architecture
- Interconnect architecture
3Highlights
- A 64K-node highly integrated supercomputer based
on system-on-a-chip technology - Two ASICs
- Blue Gene/L compute (BLC), Blue Gene/L Link (BLL)
- Distributed memory, massively parallel processing
(MPP) architecture. - Use the message passing programming model (MPI).
- 360 Tflops peak performance
- Optimized for cost/performance
4Design objectives
- Objective 1 360-Tflops supercomputer
- Earth Simulator (Japan, fastest supercomputer
from 2002 to 2004) 35.86 Tflops - Objective 2 power efficiency
- Performance/rack performance/watt watt/rack
- Watt/rack is a constant of around 20kW
- Performance/watt determines performance/rack
5- Power efficiency
- 360Tflops gt 20 megawatts with conventional
processors - Need low-power processor design (2-10 times
better power efficiency)
6Design objectives (continue)
- Objective 3 extreme scalability
- Optimized for cost/performance ? use low power,
less powerful processors ? need a lot of
processors - Up to 65536 processors.
- Interconnect scalability
- Reliability, availability, and serviceability
- Application scalability
7Application-based design approach
- Limit the type of applications to improve
scalability and cost/performance ratio. - Which applications?
- Applications in national labs (lawrence
livermore, Los Alamos, Sandia) - Simulations of physical phenomena
- Real-time data processing
- Offline data analysis
8Application scalability issue
- Two types of applications
- Strong scaling fixed problem size.
- Data on each node decreases as the number of
nodes increases - Weak scaling fixed the data size on each node.
- Problem size increases as the number of node
increases. - Most applications from national labs are weak
scaling applications while commercial HPC
applications tend to be strong scaling.
9Application scalability issue
- Strong scaling fixed problem size
- Amdahls law
- Communication to computation ratio
- Load balancing
- Small messages
- Global communication dominates
- Memory footprint
- File I/O
10Application scalability issue
- Weak scaling
- Amdahls law
- Problem segmentation limits
- Load balancing
- Global communication dominates
- Memory footprint
- File I/O
11Application scalability issue
- Amdahls law usually not a problem for the
applications considered. - Problem segmentation limits/communication-to-compu
tation ratio determined by application, cant do
much about it. - Load balancing
- Major limit for both types of applications.
- Cannot do much about it.
- Global communication dominates
- Major limit for both types of applications
- Calls for efficient hardware support for global
communication
12Application scalability issue
- Small messages
- Calls for efficient support for small messages.
- Memory footprint determines how much memory to
be put in each node - File I/O need parallel I/O support.
13Blue Gene/L system components
14Blue Gene/L Compute ASIC
- 2 Power PC440 cores with floating-point
enhancements - 700MHz
- Everything of a typical superscalar processor
- Pipelined microarchitecture with dual instruction
fetch, decode, and out of order issue, out of
order dispatch, out of order execution and out of
order completion, etc - 1 W each through extensive power management
15Blue Gene/L Compute ASIC
16Memory system on a BGL node
- BG/L only supports distributed memory paradigm.
- No need for efficient support for cache coherence
on each node. - Coherence enforced by software if needed.
- Two cores operate in two modes
- Communication coprocessor mode
- Need coherence, managed in system level libraries
- Virtual node mode
- Memory is physical partitioned (not shared).
17Blue Gene/L networks
- Five networks.
- 100 Mbps Ethernet control network for
diagnostics, debugging, and some other things. - 1000 Mbps Ethernet for I/O
- Three high-band width, low-latency networks for
data transmission and synchronization. - 3-D torus network for point-to-point
communication - Collective network for global operations
- Barrier network
- All network logic is integrated in the BG/L node
ASIC - Memory mapped interfaces from user space
183-D torus network
- Support p2p communication
- Link bandwidth 1.4Gb/s, 6 bidirectional link per
node (1.2GB/s). - 64x32x32 torus diameter 32161664 hops, worst
case hardware latency 6.4us. - Cut-through routing
- Adaptive routing
19Collective network
- Binary tree topology, static routing
- Link bandwidth 2.8Gb/s
- Maximum hardware latency 5us
- With arithmetic and logical hardware can perform
integer operation on the data - Efficient support for reduce, scan, global sum,
and broadcast operations - Floating point operation can be done with 2
passes.
20Barrier network
- Hardware support for global synchronization.
- 1.5us for barrier on 64K nodes.
21System level reliability, availability, and
servicibility
- IBM is strong in this area.
- Simplicity
- Can isolate and replace the failing node
- In the unit of 512 node (8x8x8).
- Lots of redundancy
- Flexible partitioning for availability
22Conclusion
- Optimize cost/performance
- limiting applications.
- Use low power design
- Lower frequency, system-on-a-chip
- Great performance per watt metric
- Scalability support
- Hardware support for global communication and
barrier - Low latency, high bandwidth support
- Simplicity for reliability, availability,
serviceability.