Title: System Architecture: Near, Medium, and Longterm Scalable Architectures
1System ArchitectureNear, Medium, and
Long-termScalable Architectures
- Panel Discussion Presentation
- Sandia CSRI Workshop onNext-generation Scalable
ApplicationsWhen MPI-only is not enough - June 4, 2008
- Kevin Pedretti
- Scalable System Software Dept.
- Sandia National Laboratories
- ktpedre_at_sandia.gov
Sandia is a multiprogram laboratory operated by
Sandia Corporation, a Lockheed Martin
Company,for the United States Department of
Energys National Nuclear Security
Administration under contract DE-AC04-94AL85000.
2Near Term
- Odds are good, but goods are odd...
- Multi-core, many-core, mega-core
- Heterogeneous ISAs, cores, systems
- Accelerators GPU, Cell, Clearspeed, FPGA, etc.
- Embedded Tilera, SPI, Ambric (336-core),
Tensilica - Scalable Architectures
- Peak FLOPS not bottleneck
- Improving per-socket efficiency on real
applications is low-hanging fruit - Decreasing memory size bandwidth per core
- Symbiosis of architecture and system software
3Near Term (Cont.)?
- Adapting MPI implementations for architecture
- Shared memory copies vs. NIC
- Cache pollution, injection
- Leverage hierarchy / intra-node locality
- Adapting MPI applications for architecture
- MPI shared memory LIBSM
- MPI something else for intra-node
- OpenMP, Thread Building Blocks, ALF Streaming,
CUDA, Rapid Mind, Peakstream/Google, etc. - All incompatible, some similar concepts
- Adapting architecture for MPI?
- Leveraging interconnect capabilities for PGAS
4OS Scalability
At 8192 nodes, CNL (2.0.44) is 49 worse than
Catamount onthis Partisn problem. Doesnt
appear to be a bandwidth issue.
5Task and Memory Placement
- No standard mechanisms, most punt and hope for
best - Explicit vs. implicit mechanisms
- More important than node placement?
6Intra-node MPI
7Virtual Memory Nice, but Gets in Way
Dashed Line Small pages Solid Line
Large pages (Dual-core Opteron)? Open Shapes
Existing Logarithmic Algorithm
(Gibson/Bruck)? Solid Shapes New
Constant-Time Algorithm (Slepoy, Thompson,
Plimpton)?
UnexpectedBehavior Due to TLB
TLB misses increased with large pages,but time
to service miss decreased dramatically
(10x).Page table fits in L1! (vs. 2MB per GB
with small pages)?
8So, Answer is Large Pages?
- DRAM bank conflicts can be considerable depending
on data alignment - OS-level and hardware mitigation strategies
9Affects SpMV Also(28 Node HPCCG Run)?
10Medium Term
- More accelerators, normalization
- Attractive power and memory efficiency
- Commodity processors will integrate GPUs on-chip
- HPC-centric off-chip accelerators
- General-purpose cores not getting much faster
- Leverage architecture for specific app domains
- Some common mechanism will/must emerge for
dealing with data-parallel accelerators - General-purpose cores become more light-weight,
better match for light-weight system software - Chip stacking
- Off-chip optics
11Long Term
- MPP-on-a-chip
- On and off-chip optics
- More intelligent memory systems
- Application driven architectures