Cray Roadmap 20042010 - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Cray Roadmap 20042010

Description:

Scalar unit runs ahead of vector unit, doing addressing and control ... Processors support 1000's of outstanding refs with flexible addressing ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 28
Provided by: informat359
Category:

less

Transcript and Presenter's Notes

Title: Cray Roadmap 20042010


1
Cray Roadmap(2004-2010)
  • John M. Levesque
  • Senior Technologist
  • (Virtual Steve Scott
  • Chief Architect for X1/X1E/BW)

2
Crays Computing Vision
Scalable High-Bandwidth Computing
2010 Cascade Sustained Petaflops
Black Widow
Black Widow 2
2006
X1E
2004
Product Integration
X1
2006
Strider X
Strider 3
2004
2005
Strider 2
RS
Red Storm
3
Cray X1
  • Cray PVP
  • Powerful vector processors
  • Very high memory bandwidth
  • Non-unit stride computation
  • Special ISA features
  • Modernized the ISA
  • T3E
  • Extreme scalability
  • Optimized communication
  • Memory hierarchy
  • Synchronization features
  • Improved via vectors

High bandwidth, scalable shared memory
supercomputer
4
Key Architectural Features
  • New vector instruction set architecture (ISA)
  • Much larger register set (32x64 vector, 6464
    scalar)
  • 64- and 32-bit memory and IEEE arithmetic
  • Based on 25 years of experience compiling with
    Cray1 ISA
  • Decoupled Execution
  • Scalar unit runs ahead of vector unit, doing
    addressing and control
  • Hardware dynamically unrolls loops, and issues
    multiple loops concurrently
  • Special sync operations keep pipeline full, even
    across barriers
  • ? Allows the processor to perform well on short
    nested loops
  • Scalable, distributed shared memory (DSM)
    architecture
  • Memory hierarchy caches, local memory, remote
    memory
  • Low latency, load/store access to entire machine
    (tens of TBs)
  • Processors support 1000s of outstanding refs
    with flexible addressing
  • Very high bandwidth network
  • Coherence protocol, addressing and
    synchronization optimized for DM

5
Cray X1 Node
51 Gflops, 200 GB/s
  • Four multistream processors (MSPs), each 12.8
    Gflops
  • High bandwidth local shared memory (128 Direct
    Rambus channels)
  • 32 network links and four I/O links per node

6
NUMA Scalable up to 1024 Nodes
Interconnection Network
  • 16 parallel networks for bandwidth
  • Global shared memory across machine

7
Network Topology (16 CPUs)
P
P
P
P
node 0
M15
M0
M1
P
P
P
P
node 1
M15
M0
M1
P
P
P
P
node 2
M0
M15
M1
P
P
P
P
node 3
M0
M15
M1
Section 0
Section 1
Section 15
8
Network Topology (128 CPUs)
16 links
9
Network Topology (512 CPUs)
10
Cray X1 Node Module
11
Cray X1 Chassis
12
64 Processor Cray X1 System820 Gflops
13
Cray X1E Product Enhancement
14
Cray X1E Mid-life Enhancement
  • Technology refresh of the X1 (0.13?m)
  • 50 faster processors
  • Scalar performance enhancements
  • Doubling processor density
  • Modest increase in memory system bandwidth
  • Same interconnect and I/O
  • Machine upgradeable
  • Can replace Cray X1 nodes with X1E nodes
  • Shipping the end of this year

15
Cray BlackWidow System
  • Second generation Vector MPP
  • Upward compatible with the Cray X1
  • Shipping in 2006
  • Major improvement (gtgt Moores Law rate) in
  • Single thread scalar performance
  • Price performance
  • BlackWidow features
  • Single chip vector microprocessor
  • Globally addressable memory with 4-way SMP nodes
  • Scalable to tens of thousands of processors
  • Even more bandwidth per flop than the X1
  • Innovative fault tolerance features
  • Configurable memory capacity, memory BW and
    network BW

16
System Goals
  • Balanced Performance between CPU, Memory,
    Interconnect, and I/O
  • Highly scalable system hardware and software
  • High speed, high bandwidth 3D mesh interconnect
  • Run a set of applications 7 times faster than
    ASCI Red
  • Run an ASCI Red application on full system for 50
    hours
  • Flexible partitioning for classified and
    non-classified computing
  • High performance I/O subsystem (File system and
    storage)

17
Red Storm System Overview
  • 40TF peak performance
  • 108 compute node cabinets, 16 service and I/O
    node cabinets, and 16 Red/Black switch cabinets
  • 10,368 compute processors - 2.0 GHz AMD Opteron
  • 512 service and I/O processors (256P for red,
    256P for black)
  • 10 TB DDR memory
  • 240 TB of disk storage(120TB for red, 120TB for
    black)
  • MPP System Software
  • Linux lightweight compute node operating system
  • Managed and used as a single system
  • Easy to use programming environment
  • Common programming environment
  • High performance file system
  • Low overhead RAS and message passing
  • Approximately 3,000 ft² including disk systems

18
Typical Architecture
  • Memory latency 160 ns and bandwidth is shared
    between mutliple processors
  • Northbridge chip is 2nd most complex chip on the
    board. Typical chip uses about 11 Watts

19
AMD Opteron Generic System
  • SDRAM memory controller and function of
    Northbridge is pulled onto the Opteron die.
    Memory latency reduced to 60-90 ns
  • No Northbridge chip results in savings in heat,
    power, complexity and an increase in performance
  • Interface off the chip is an open standard
    (HyperTransport)

20
Using HyperTransport to Interface With System
Interconnect
  • 6GB/sec (3 GB/sec bi-directional)
  • 3D Torus Interconnect

21
Cray BlackWidow
The Next Generation HPC System From Cray Inc.
22
Cray BlackWidow System
  • Second generation Vector MPP
  • Upward compatible with the Cray X1
  • Shipping in 2006
  • Major improvement (gtgt Moores Law rate) in
  • Single thread scalar performance
  • Price performance
  • BlackWidow features
  • Single chip vector microprocessor
  • Globally addressable memory with 4-way SMP nodes
  • Scalable to tens of thousands of processors
  • Even more bandwidth per flop than the X1
  • Innovative fault tolerance features
  • Configurable memory capacity, memory BW and
    network BW

23
CascadeToward Sustained Petaflop Computing
24
HPCS Phases
  • Phase I Concept Development
  • Forecast available technology
  • Propose HPCS hw/sw concepts
  • Explore productivity metrics
  • Develop research plan for Phase II

25
Crays Approach to HPCS
  • High system efficiency at scale
  • Bandwidth is the most critical and expensive part
    of scalability
  • Enable very high (but configurable) global
    bandwidth
  • Design processor and system to use this bandwidth
    wisely
  • Reduce bandwidth demand architecturally
  • High human productivity and portability
  • Support legacy and emerging languages
  • Provide strong compiler, tools and runtime
    support
  • Support a mixed UMA/NUMA programming model
  • Develop higher-level programming language and
    tools
  • System robustness
  • Provide excellent fault detection and diagnosis
  • Implement automatic reconfiguration and fault
    containment
  • Make all resources virtualized and dynamically
    reconfigurable

26
Using Bandwidth Wisely
  • Implement global shared memory
  • Lowest latency communication
  • Lowest overhead communication
  • Fine-grained overlap of computation and
    communication
  • Tolerate latency with processor concurrency
  • Message passing concurrency is constraining and
    hard to program
  • Vectors and streaming provide concurrency within
    a thread
  • Multithreading provides concurrency between
    threads
  • Exploit locality to reduce bandwidth demand
  • Heavyweight processors (HWPs) to exploit
    temporal locality
  • Lightweight processors (LWPs) to exploit
    spatial locality
  • Use other techniques to reduce network traffic
  • Atomic memory operations
  • Single word network transfers when no locality is
    present

27
Questions?
File Name BWOpsReview081103.ppt
Write a Comment
User Comments (0)
About PowerShow.com