System Architecture: Near, Medium, and Longterm Scalable Architectures

About This Presentation

Title:

System Architecture: Near, Medium, and Longterm Scalable Architectures

Description:

... size & bandwidth per core. Symbiosis of architecture and ... (Dual-core Opteron)? Open Shapes = Existing Logarithmic Algorithm (Gibson/Bruck)? Solid Shapes ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 12

Provided by: csSa2

Category:

more less

Transcript and Presenter's Notes

Title: System Architecture: Near, Medium, and Longterm Scalable Architectures

1
System ArchitectureNear, Medium, and
Long-termScalable Architectures

Panel Discussion Presentation
Sandia CSRI Workshop onNext-generation Scalable
ApplicationsWhen MPI-only is not enough
June 4, 2008
Kevin Pedretti
Scalable System Software Dept.
Sandia National Laboratories
ktpedre_at_sandia.gov

Sandia is a multiprogram laboratory operated by
Sandia Corporation, a Lockheed Martin
Company,for the United States Department of
Energys National Nuclear Security
Administration under contract DE-AC04-94AL85000.
2
Near Term

Odds are good, but goods are odd...
Multi-core, many-core, mega-core
Heterogeneous ISAs, cores, systems
Accelerators GPU, Cell, Clearspeed, FPGA, etc.
Embedded Tilera, SPI, Ambric (336-core),
Tensilica
Scalable Architectures
Peak FLOPS not bottleneck
Improving per-socket efficiency on real
applications is low-hanging fruit
Decreasing memory size bandwidth per core
Symbiosis of architecture and system software

3
Near Term (Cont.)?

Adapting MPI implementations for architecture
Shared memory copies vs. NIC
Cache pollution, injection
Leverage hierarchy / intra-node locality
Adapting MPI applications for architecture
MPI shared memory LIBSM
MPI something else for intra-node
OpenMP, Thread Building Blocks, ALF Streaming,
CUDA, Rapid Mind, Peakstream/Google, etc.
All incompatible, some similar concepts
Adapting architecture for MPI?
Leveraging interconnect capabilities for PGAS

4
OS Scalability
At 8192 nodes, CNL (2.0.44) is 49 worse than
Catamount onthis Partisn problem. Doesnt
appear to be a bandwidth issue.
5
Task and Memory Placement

No standard mechanisms, most punt and hope for
best
Explicit vs. implicit mechanisms
More important than node placement?

6
Intra-node MPI
7
Virtual Memory Nice, but Gets in Way
Dashed Line Small pages Solid Line
Large pages (Dual-core Opteron)? Open Shapes
Existing Logarithmic Algorithm
(Gibson/Bruck)? Solid Shapes New
Constant-Time Algorithm (Slepoy, Thompson,
Plimpton)?
UnexpectedBehavior Due to TLB
TLB misses increased with large pages,but time
to service miss decreased dramatically
(10x).Page table fits in L1! (vs. 2MB per GB
with small pages)?
8
So, Answer is Large Pages?

DRAM bank conflicts can be considerable depending
on data alignment
OS-level and hardware mitigation strategies

9
Affects SpMV Also(28 Node HPCCG Run)?
10
Medium Term

More accelerators, normalization
Attractive power and memory efficiency
Commodity processors will integrate GPUs on-chip
HPC-centric off-chip accelerators
General-purpose cores not getting much faster
Leverage architecture for specific app domains
Some common mechanism will/must emerge for
dealing with data-parallel accelerators
General-purpose cores become more light-weight,
better match for light-weight system software
Chip stacking
Off-chip optics