Cray Roadmap 20042010

About This Presentation

Title:

Cray Roadmap 20042010

Description:

Scalar unit runs ahead of vector unit, doing addressing and control ... Processors support 1000's of outstanding refs with flexible addressing ... – PowerPoint PPT presentation

Number of Views:147

Avg rating:3.0/5.0

Slides: 28

Provided by: informat359

Category:

more less

Transcript and Presenter's Notes

Title: Cray Roadmap 20042010

1
Cray Roadmap(2004-2010)

John M. Levesque
Senior Technologist
(Virtual Steve Scott
Chief Architect for X1/X1E/BW)

2
Crays Computing Vision
Scalable High-Bandwidth Computing
2010 Cascade Sustained Petaflops
Black Widow
Black Widow 2
2006
X1E
2004
Product Integration
X1
2006
Strider X
Strider 3
2004
2005
Strider 2
RS
Red Storm
3
Cray X1

Cray PVP
Powerful vector processors
Very high memory bandwidth
Non-unit stride computation
Special ISA features
Modernized the ISA

T3E
Extreme scalability
Optimized communication
Memory hierarchy
Synchronization features
Improved via vectors

High bandwidth, scalable shared memory
supercomputer
4
Key Architectural Features

New vector instruction set architecture (ISA)
Much larger register set (32x64 vector, 6464
scalar)
64- and 32-bit memory and IEEE arithmetic
Based on 25 years of experience compiling with
Cray1 ISA
Decoupled Execution
Scalar unit runs ahead of vector unit, doing
addressing and control
Hardware dynamically unrolls loops, and issues
multiple loops concurrently
Special sync operations keep pipeline full, even
across barriers
? Allows the processor to perform well on short
nested loops
Scalable, distributed shared memory (DSM)
architecture
Memory hierarchy caches, local memory, remote
memory
Low latency, load/store access to entire machine
(tens of TBs)
Processors support 1000s of outstanding refs
with flexible addressing
Very high bandwidth network
Coherence protocol, addressing and
synchronization optimized for DM

5
Cray X1 Node
51 Gflops, 200 GB/s

Four multistream processors (MSPs), each 12.8
Gflops
High bandwidth local shared memory (128 Direct
Rambus channels)
32 network links and four I/O links per node

6
NUMA Scalable up to 1024 Nodes
Interconnection Network

16 parallel networks for bandwidth
Global shared memory across machine

7
Network Topology (16 CPUs)
P
P
P
P
node 0
M15
M0
M1
P
P
P
P
node 1
M15
M0
M1
P
P
P
P
node 2
M0
M15
M1
P
P
P
P
node 3
M0
M15
M1
Section 0
Section 1
Section 15
8
Network Topology (128 CPUs)
16 links
9
Network Topology (512 CPUs)
10
Cray X1 Node Module
11
Cray X1 Chassis
12
64 Processor Cray X1 System820 Gflops
13
Cray X1E Product Enhancement
14
Cray X1E Mid-life Enhancement

Technology refresh of the X1 (0.13?m)
50 faster processors
Scalar performance enhancements
Doubling processor density
Modest increase in memory system bandwidth
Same interconnect and I/O
Machine upgradeable
Can replace Cray X1 nodes with X1E nodes
Shipping the end of this year

15
Cray BlackWidow System

Second generation Vector MPP
Upward compatible with the Cray X1
Shipping in 2006
Major improvement (gtgt Moores Law rate) in
Single thread scalar performance
Price performance
BlackWidow features
Single chip vector microprocessor
Globally addressable memory with 4-way SMP nodes
Scalable to tens of thousands of processors
Even more bandwidth per flop than the X1
Innovative fault tolerance features
Configurable memory capacity, memory BW and
network BW

16
System Goals

Balanced Performance between CPU, Memory,
Interconnect, and I/O
Highly scalable system hardware and software
High speed, high bandwidth 3D mesh interconnect
Run a set of applications 7 times faster than
ASCI Red
Run an ASCI Red application on full system for 50
hours
Flexible partitioning for classified and
non-classified computing
High performance I/O subsystem (File system and
storage)

17
Red Storm System Overview

40TF peak performance
108 compute node cabinets, 16 service and I/O
node cabinets, and 16 Red/Black switch cabinets
10,368 compute processors - 2.0 GHz AMD Opteron
512 service and I/O processors (256P for red,
256P for black)
10 TB DDR memory
240 TB of disk storage(120TB for red, 120TB for
black)
MPP System Software
Linux lightweight compute node operating system
Managed and used as a single system
Easy to use programming environment
Common programming environment
High performance file system
Low overhead RAS and message passing
Approximately 3,000 ft² including disk systems

18
Typical Architecture

Memory latency 160 ns and bandwidth is shared
between mutliple processors

Northbridge chip is 2nd most complex chip on the
board. Typical chip uses about 11 Watts

19
AMD Opteron Generic System

SDRAM memory controller and function of
Northbridge is pulled onto the Opteron die.
Memory latency reduced to 60-90 ns
No Northbridge chip results in savings in heat,
power, complexity and an increase in performance
Interface off the chip is an open standard
(HyperTransport)

20
Using HyperTransport to Interface With System
Interconnect

6GB/sec (3 GB/sec bi-directional)
3D Torus Interconnect

21
Cray BlackWidow
The Next Generation HPC System From Cray Inc.
22
Cray BlackWidow System

Second generation Vector MPP
Upward compatible with the Cray X1
Shipping in 2006
Major improvement (gtgt Moores Law rate) in
Single thread scalar performance
Price performance
BlackWidow features
Single chip vector microprocessor
Globally addressable memory with 4-way SMP nodes
Scalable to tens of thousands of processors
Even more bandwidth per flop than the X1
Innovative fault tolerance features
Configurable memory capacity, memory BW and
network BW

23
CascadeToward Sustained Petaflop Computing
24
HPCS Phases

Phase I Concept Development
Forecast available technology
Propose HPCS hw/sw concepts
Explore productivity metrics
Develop research plan for Phase II

25
Crays Approach to HPCS

High system efficiency at scale
Bandwidth is the most critical and expensive part
of scalability
Enable very high (but configurable) global
bandwidth
Design processor and system to use this bandwidth
wisely
Reduce bandwidth demand architecturally

High human productivity and portability
Support legacy and emerging languages
Provide strong compiler, tools and runtime
support
Support a mixed UMA/NUMA programming model
Develop higher-level programming language and
tools
System robustness
Provide excellent fault detection and diagnosis
Implement automatic reconfiguration and fault
containment
Make all resources virtualized and dynamically
reconfigurable

26
Using Bandwidth Wisely

Implement global shared memory
Lowest latency communication
Lowest overhead communication
Fine-grained overlap of computation and
communication
Tolerate latency with processor concurrency
Message passing concurrency is constraining and
hard to program
Vectors and streaming provide concurrency within
a thread
Multithreading provides concurrency between
threads
Exploit locality to reduce bandwidth demand
Heavyweight processors (HWPs) to exploit
temporal locality
Lightweight processors (LWPs) to exploit
spatial locality
Use other techniques to reduce network traffic
Atomic memory operations
Single word network transfers when no locality is
present

27
Questions?
File Name BWOpsReview081103.ppt

Write a Comment

User Comments (0)

About PowerShow.com

Cray Roadmap 20042010 - PowerPoint PPT Presentation

Cray Roadmap 20042010

Scalar unit runs ahead of vector unit, doing addressing and control ... Processors support 1000's of outstanding refs with flexible addressing ... – PowerPoint PPT presentation