Title: Cray Roadmap 20042010
1Cray Roadmap(2004-2010)
- John M. Levesque
- Senior Technologist
- (Virtual Steve Scott
- Chief Architect for X1/X1E/BW)
2Crays Computing Vision
Scalable High-Bandwidth Computing
2010 Cascade Sustained Petaflops
Black Widow
Black Widow 2
2006
X1E
2004
Product Integration
X1
2006
Strider X
Strider 3
2004
2005
Strider 2
RS
Red Storm
3Cray X1
- Cray PVP
- Powerful vector processors
- Very high memory bandwidth
- Non-unit stride computation
- Special ISA features
- Modernized the ISA
- T3E
- Extreme scalability
- Optimized communication
- Memory hierarchy
- Synchronization features
- Improved via vectors
High bandwidth, scalable shared memory
supercomputer
4Key Architectural Features
- New vector instruction set architecture (ISA)
- Much larger register set (32x64 vector, 6464
scalar) - 64- and 32-bit memory and IEEE arithmetic
- Based on 25 years of experience compiling with
Cray1 ISA - Decoupled Execution
- Scalar unit runs ahead of vector unit, doing
addressing and control - Hardware dynamically unrolls loops, and issues
multiple loops concurrently - Special sync operations keep pipeline full, even
across barriers - ? Allows the processor to perform well on short
nested loops - Scalable, distributed shared memory (DSM)
architecture - Memory hierarchy caches, local memory, remote
memory - Low latency, load/store access to entire machine
(tens of TBs) - Processors support 1000s of outstanding refs
with flexible addressing - Very high bandwidth network
- Coherence protocol, addressing and
synchronization optimized for DM
5Cray X1 Node
51 Gflops, 200 GB/s
- Four multistream processors (MSPs), each 12.8
Gflops - High bandwidth local shared memory (128 Direct
Rambus channels) - 32 network links and four I/O links per node
6NUMA Scalable up to 1024 Nodes
Interconnection Network
- 16 parallel networks for bandwidth
- Global shared memory across machine
7Network Topology (16 CPUs)
P
P
P
P
node 0
M15
M0
M1
P
P
P
P
node 1
M15
M0
M1
P
P
P
P
node 2
M0
M15
M1
P
P
P
P
node 3
M0
M15
M1
Section 0
Section 1
Section 15
8Network Topology (128 CPUs)
16 links
9Network Topology (512 CPUs)
10Cray X1 Node Module
11Cray X1 Chassis
1264 Processor Cray X1 System820 Gflops
13Cray X1E Product Enhancement
14Cray X1E Mid-life Enhancement
- Technology refresh of the X1 (0.13?m)
- 50 faster processors
- Scalar performance enhancements
- Doubling processor density
- Modest increase in memory system bandwidth
- Same interconnect and I/O
- Machine upgradeable
- Can replace Cray X1 nodes with X1E nodes
- Shipping the end of this year
15Cray BlackWidow System
- Second generation Vector MPP
- Upward compatible with the Cray X1
- Shipping in 2006
- Major improvement (gtgt Moores Law rate) in
- Single thread scalar performance
- Price performance
- BlackWidow features
- Single chip vector microprocessor
- Globally addressable memory with 4-way SMP nodes
- Scalable to tens of thousands of processors
- Even more bandwidth per flop than the X1
- Innovative fault tolerance features
- Configurable memory capacity, memory BW and
network BW
16System Goals
- Balanced Performance between CPU, Memory,
Interconnect, and I/O - Highly scalable system hardware and software
- High speed, high bandwidth 3D mesh interconnect
- Run a set of applications 7 times faster than
ASCI Red - Run an ASCI Red application on full system for 50
hours - Flexible partitioning for classified and
non-classified computing - High performance I/O subsystem (File system and
storage)
17Red Storm System Overview
- 40TF peak performance
- 108 compute node cabinets, 16 service and I/O
node cabinets, and 16 Red/Black switch cabinets - 10,368 compute processors - 2.0 GHz AMD Opteron
- 512 service and I/O processors (256P for red,
256P for black) - 10 TB DDR memory
- 240 TB of disk storage(120TB for red, 120TB for
black) - MPP System Software
- Linux lightweight compute node operating system
- Managed and used as a single system
- Easy to use programming environment
- Common programming environment
- High performance file system
- Low overhead RAS and message passing
- Approximately 3,000 ft² including disk systems
18Typical Architecture
- Memory latency 160 ns and bandwidth is shared
between mutliple processors
- Northbridge chip is 2nd most complex chip on the
board. Typical chip uses about 11 Watts
19AMD Opteron Generic System
- SDRAM memory controller and function of
Northbridge is pulled onto the Opteron die.
Memory latency reduced to 60-90 ns - No Northbridge chip results in savings in heat,
power, complexity and an increase in performance - Interface off the chip is an open standard
(HyperTransport)
20Using HyperTransport to Interface With System
Interconnect
- 6GB/sec (3 GB/sec bi-directional)
- 3D Torus Interconnect
21Cray BlackWidow
The Next Generation HPC System From Cray Inc.
22Cray BlackWidow System
- Second generation Vector MPP
- Upward compatible with the Cray X1
- Shipping in 2006
- Major improvement (gtgt Moores Law rate) in
- Single thread scalar performance
- Price performance
- BlackWidow features
- Single chip vector microprocessor
- Globally addressable memory with 4-way SMP nodes
- Scalable to tens of thousands of processors
- Even more bandwidth per flop than the X1
- Innovative fault tolerance features
- Configurable memory capacity, memory BW and
network BW
23CascadeToward Sustained Petaflop Computing
24HPCS Phases
- Phase I Concept Development
- Forecast available technology
- Propose HPCS hw/sw concepts
- Explore productivity metrics
- Develop research plan for Phase II
25Crays Approach to HPCS
- High system efficiency at scale
- Bandwidth is the most critical and expensive part
of scalability - Enable very high (but configurable) global
bandwidth - Design processor and system to use this bandwidth
wisely - Reduce bandwidth demand architecturally
- High human productivity and portability
- Support legacy and emerging languages
- Provide strong compiler, tools and runtime
support - Support a mixed UMA/NUMA programming model
- Develop higher-level programming language and
tools - System robustness
- Provide excellent fault detection and diagnosis
- Implement automatic reconfiguration and fault
containment - Make all resources virtualized and dynamically
reconfigurable
26Using Bandwidth Wisely
- Implement global shared memory
- Lowest latency communication
- Lowest overhead communication
- Fine-grained overlap of computation and
communication - Tolerate latency with processor concurrency
- Message passing concurrency is constraining and
hard to program - Vectors and streaming provide concurrency within
a thread - Multithreading provides concurrency between
threads - Exploit locality to reduce bandwidth demand
- Heavyweight processors (HWPs) to exploit
temporal locality - Lightweight processors (LWPs) to exploit
spatial locality - Use other techniques to reduce network traffic
- Atomic memory operations
- Single word network transfers when no locality is
present
27Questions?
File Name BWOpsReview081103.ppt