PowerPoint-Pr - PowerPoint PPT Presentation

About This Presentation

Title:

PowerPoint-Pr

Description:

Pipelined Vector Processing and Scientific Computation John G. Zabolitzky – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 29

Provided by: stef4186

Category:

more less

Transcript and Presenter's Notes

Title: PowerPoint-Pr

1
Pipelined Vector Processing and Scientific
ComputationJohn G. Zabolitzky
2
Applications of High-Performance Computing

Weather prediction, climatic simulation
fluid dynamics simulation (aerodynamics for
aerospace, automobile, combustion, ....)
basic science
cosmology
quantum mechanical many-body problems
chemistry
solid-state
quantum fluids
high-energy physics
cryptography
weapons research
energy research
nuclear reactor simulation
fusion research
many many more

3
Terminal State of Scalar Computing CDC 7600, 1968

Maximum RISC performance of 1 operation/cycle
achieved
No further improvement possible without change of
paradigm
36 MHz gt 36 MIPS gt 5 MFLOPS real

The CDC 7600 (designed by Seymour Cray) was the most powerful of all computers from 1968 to 1976 when the Cray-1 achieved gt 10 times its performance
4
Pipelined Scalar Execution
5
(No Transcript)
6
Scalar Code Example

DO i1,100 a(i)b(i)c(i)
load b, inc addesss
load c, inc address
multiply
store a, inc address
decrement count, loop?
5 instructions cycles (optimum) for one
multiply
pipelined multiply could start one multiply each
and every cycle gt only 20 efficient use
expensive multiplier sits idle most of the time

7
Architectural Alternatives

Pipelined Scalar (RISC) as outlined before
Pipelined Vector (this presentation further
down)
SIMD (Single Instruction Multiple Data)
parallel arithmetic (e.g., ILLIAC IV)
too expensive, inefficient larger number of
lightly used multipliers
Superscalar multiple issue in one cycle
all modern single-chip CPUs (Intel to TI) keep
all functions busy
VLIW (Very Long Instruction Word) Variant
of Superscalar
MIMD (Multiple Instruction Multiple Data) true
parallel streams, e.g. Cray T3E, IBM Blue Gene,
IBM Cell may be superimposed on top of ANY CPU
architecture

8
Vector Computation

Scientific codes have high percentage in looping
over simple data structures
DO i1,100 a(i) bc(i) d(i)
simple logical structure gt
set up such that one multiply/cycle
one instruction for entire loop
MFLOP rate cycle rate or multiple thereof
specialized for scientific/engineering tasks

9
Vector Pipeline c(i)a(i)b(i)
Inventor Henry Ford
10
Need to Vectorize some automatic, high quality
requires hand-optimization

Naive scalar code for matrix multiply
s0.0
do j1,n
ssa(i,j)b(j,k)
Recursive on s gt adder pipeline blocked
vector code for matrix multiply
do i1,n
c(i,k) c(i,k) a(i,j)b(j,k)
Independent vector elements, but 1.5x bandwidth
Frequently good idea exchange inner/outer loop

11
First Vector Computers

Control Data Corporation (CDC) STAR-100 STring
ARray 100 MFLOPS
memory-to-memory architecture
therefore long startup times (n00 cycles)
very slow scalar unit (2 MFLOPS)
overall disappointing performance
contracted 1967, announced 1972, delivered 1974
total of 4 machines, 2 Lawrence Livermore Lab
Thornton (CDC) and Fernbach (LLL) loose their
jobs

12
CDC STAR-100
Photograph courtesy of Charles Babbage Institute,
University of Minnesota, Minneapolis
13
Texas Instruments ASC

Advanced Scientific Computer, early 1970s
architecturally similar to CDC STAR-100
7 units sold
TI dropped out of mainframe computer
manufacturing after this machine

14
Vector Performance I

MFLOP rate (MFLOPS) as function of vector length
n
scalar constant (only some loop overhead, then
n loop time)
vector (n length of vector)
cycles startup n / nflop_per_cycle
rate/clock ops / cycles n / (startup n)
half rate at vectorlength n startup
full rate needs n gtgt startup gt Long Vector
Machine

15
Performance vs. Startup, Length
16
Vector Performance II

Vector/Scalar Subsections
ALL codes have some scalar (non-vectorizable)
sections
total time (scalar fraction)/(scalar rate)
(vector fraction)/(vector rate)
example 10 / 1 MFLOPS 90 / 100 MFLOPS
100 / (0.1 100 0.9 1) 9.2 MFLOPS
!!!

17
Vector Version of Amdahls Law
18
Vector Computer Design Guide

Must have SHORT vector startup gt can work with
short vectors
Must have FASTEST POSSIBLE scalar unit gt can
afford scalar sections
irregular data structures gt need gather,
scatter, merge operations (and a few more)
x(i) a(index(i)) b(i)
y(index(i)) c(i) d(i)
where (a(i) gt b(i)) c(i) d(i)

19
Cray Research, Inc.

Founded by Seymour Cray (father of CDC 6600/7600)
in 1972 (STAR-100 known)
first Cray-1 delivered in 1976 to Los Alamos
Scientific Laboratory (LASL)
8 vector registers of 64 elements each
Vector load/store instructions
fastest scalar computer of its time
160 MFLOPS peak rate ( 2 ops/cycle _at_ 80 MHz), few
cycles startup

20
Seymour Cray Cray-1 1976 Single Processor 80
MFLOPS 1 Mword 8 Mbyte
Photograph courtesy of Charles Babbage Institute,
University of Minnesota, Minneapolis
21
Large working set - 8 vector registers, 64
words - 8 scalar registers - 8 address
registers - large instruction buffer Performance
Features - vector processing one operation
affects 64 vector elements, streamed through
functional unit - small vector startup time -
chaining between vector ops - large, fast
semiconductor memory
22
Cray Research, Inc. cntd

1982 Cray-XMP (Steve Chen improvements, up to 4
processors, shared memory)
1985 Cray-2, 256 Mword memory, 4 processors,
immersion cooled
1988 Cray-YMP (last Chen machine)
1991 Cray C90 (up to 16 vector CPUs, shared
memory)
1993 Cray T3D (massively parallel Alpha)
one and only Cray-3 delivered to NCAR
(Cray Comp Corp)
1994 Cray J90 (up to 32 vector CPUs, shared
memory), air cooled
1995 Cray T3E (most successful MPP machine), Cray
T90 (parallel vector, immersion cooled)
Cray-4 abandoned (Cray Computer
Corporation ch. 11)
1996 acquired by Silicon Graphics
1998 Cray SV1 (parallel vector, air cooled)
1999 acquired by Teradata gt Cray, Inc.
2002 Cray X1, parallel vector, immersion spray
cooled
2004 Cray X1e, enhanced version of X1
Cray XT3, AMD based 3D Torus massively
parallel machine

23
CDC Cyber 200 Family

- 1980, enhanced version of STAR-100
- reduced startup time, 50 cycles
- fast scalar unit
- rich instruction repertoire
- still memory-to-memory, 400 MFLOPS peak
- Cyber 203, Cyber 205, ETA-10 10 GFLOPS
- vector FORTRAN language extensions provided
- terminated in 1989 since unprofitable
- around 40 Cyber 200, 34 ETA-10 sold

24
Minnesota Supercomputer Center Minneapolis,
1986 Cray-2, CDC Cyber 205
25
NEC Japan

- 1983 SX-1 single processor vector 650 MFLOPS
- 1985 SX-2 single processor vector 1300 MFLOPS
- 1990 SX-3 four processors at 5 GFLOPS each, 4
Gbyte 0.5 Gword memory
- 1995 SX-4 32 processors at 2 GFLOPS each
(CMOS all previous ECL)
- 1998 SX-5 upto 512 processors 8 GFLOPS each
- 2002 SX-6 upto 1024 processors 8 GFLOPS each
- 2004 SX-7 upto 2048 processors 8.8 GFLOPS each
- 2004 SX-8 upto 4096 processors 16 GFLOPS each

26
IBM - Sony - Toshiba CELL processor
- 8 vector CPUs GPU on single chip - 256 kbyte
32 kword local storage (very small !!) - 12
word/cycle internal interconnect 386
Gbyte/sec - 24 Gbyte/sec 3 Gword/sec main
memory - 76 Gbyte/sec 9.5 Gword/sec
communication - _at_ 4 GHz clock 256 GFLOPS (32 bit)
peak - 26 GFLOPS (64
bit) peak - max 4.5 Gbyte addressable, 512 Mbyte
implemented - system interconnect ? - used within
Sony Playstation 3 - Mercury, IBM blades
available 512 Mbyte only - highly imbalanced for
scientific computation
27
IBM - Sony - Toshiba CELL processor
- 90 nm SOI, 8 layers Cu interconnect - 234 M
Transistors - 221 mm² die size - significant
potential in future revisions - but 80W _at_ 1.1V
4.0 GHz is too much - 180W _at_ 1.4V 5.6 GHz is
much too much - work needed in power reduction -
larger internal memory - 64 bit arithmetic
improved
28
IBM - Sony - Toshiba CELL processor
From S. Williams et. al., Lawrence Berkeley
Laboratory - single Cell chip performance -
compared with Cray X1E single vector processor
and several commodity microprocessors (AMD,
Intel) - already current version shows impressive
speedup, at cost of significant programming
complexity (explicit storage moves as opposed to
caching) - slightly enhanced Cell (Cell)
simulation provides very significant additional
speedup (more efficient DP) - current version
insufficient for major impact - future versions
may change that, great potential

Write a Comment

User Comments (0)