Title: PowerPoint-Pr
1Pipelined Vector Processing and Scientific
ComputationJohn G. Zabolitzky
2Applications of High-Performance Computing
- Weather prediction, climatic simulation
- fluid dynamics simulation (aerodynamics for
aerospace, automobile, combustion, ....) - basic science
- cosmology
- quantum mechanical many-body problems
- chemistry
- solid-state
- quantum fluids
- high-energy physics
- cryptography
- weapons research
- energy research
- nuclear reactor simulation
- fusion research
- many many more
3Terminal State of Scalar Computing CDC 7600, 1968
- Maximum RISC performance of 1 operation/cycle
achieved - No further improvement possible without change of
paradigm - 36 MHz gt 36 MIPS gt 5 MFLOPS real
The CDC 7600 (designed by Seymour Cray) was the most powerful of all computers from 1968 to 1976 when the Cray-1 achieved gt 10 times its performance
4Pipelined Scalar Execution
5(No Transcript)
6Scalar Code Example
- DO i1,100 a(i)b(i)c(i)
- load b, inc addesss
- load c, inc address
- multiply
- store a, inc address
- decrement count, loop?
- 5 instructions cycles (optimum) for one
multiply - pipelined multiply could start one multiply each
and every cycle gt only 20 efficient use - expensive multiplier sits idle most of the time
7Architectural Alternatives
- Pipelined Scalar (RISC) as outlined before
- Pipelined Vector (this presentation further
down) - SIMD (Single Instruction Multiple Data)
parallel arithmetic (e.g., ILLIAC IV) - too expensive, inefficient larger number of
lightly used multipliers - Superscalar multiple issue in one cycle
- all modern single-chip CPUs (Intel to TI) keep
all functions busy - VLIW (Very Long Instruction Word) Variant
of Superscalar - MIMD (Multiple Instruction Multiple Data) true
parallel streams, e.g. Cray T3E, IBM Blue Gene,
IBM Cell may be superimposed on top of ANY CPU
architecture
8Vector Computation
- Scientific codes have high percentage in looping
over simple data structures - DO i1,100 a(i) bc(i) d(i)
- simple logical structure gt
- set up such that one multiply/cycle
- one instruction for entire loop
- MFLOP rate cycle rate or multiple thereof
- specialized for scientific/engineering tasks
9Vector Pipeline c(i)a(i)b(i)
Inventor Henry Ford
10Need to Vectorize some automatic, high quality
requires hand-optimization
- Naive scalar code for matrix multiply
- s0.0
- do j1,n
- ssa(i,j)b(j,k)
- Recursive on s gt adder pipeline blocked
- vector code for matrix multiply
- do i1,n
- c(i,k) c(i,k) a(i,j)b(j,k)
- Independent vector elements, but 1.5x bandwidth
- Frequently good idea exchange inner/outer loop
11First Vector Computers
- Control Data Corporation (CDC) STAR-100 STring
ARray 100 MFLOPS - memory-to-memory architecture
- therefore long startup times (n00 cycles)
- very slow scalar unit (2 MFLOPS)
- overall disappointing performance
- contracted 1967, announced 1972, delivered 1974
- total of 4 machines, 2 Lawrence Livermore Lab
- Thornton (CDC) and Fernbach (LLL) loose their
jobs
12CDC STAR-100
Photograph courtesy of Charles Babbage Institute,
University of Minnesota, Minneapolis
13Texas Instruments ASC
- Advanced Scientific Computer, early 1970s
- architecturally similar to CDC STAR-100
- 7 units sold
- TI dropped out of mainframe computer
manufacturing after this machine
14Vector Performance I
- MFLOP rate (MFLOPS) as function of vector length
n - scalar constant (only some loop overhead, then
n loop time) - vector (n length of vector)
- cycles startup n / nflop_per_cycle
- rate/clock ops / cycles n / (startup n)
- half rate at vectorlength n startup
- full rate needs n gtgt startup gt Long Vector
Machine
15Performance vs. Startup, Length
16Vector Performance II
- Vector/Scalar Subsections
- ALL codes have some scalar (non-vectorizable)
sections - total time (scalar fraction)/(scalar rate)
(vector fraction)/(vector rate) - example 10 / 1 MFLOPS 90 / 100 MFLOPS
- 100 / (0.1 100 0.9 1) 9.2 MFLOPS
!!!
17Vector Version of Amdahls Law
18Vector Computer Design Guide
- Must have SHORT vector startup gt can work with
short vectors - Must have FASTEST POSSIBLE scalar unit gt can
afford scalar sections - irregular data structures gt need gather,
scatter, merge operations (and a few more) - x(i) a(index(i)) b(i)
- y(index(i)) c(i) d(i)
- where (a(i) gt b(i)) c(i) d(i)
19Cray Research, Inc.
- Founded by Seymour Cray (father of CDC 6600/7600)
in 1972 (STAR-100 known) - first Cray-1 delivered in 1976 to Los Alamos
Scientific Laboratory (LASL) - 8 vector registers of 64 elements each
- Vector load/store instructions
- fastest scalar computer of its time
- 160 MFLOPS peak rate ( 2 ops/cycle _at_ 80 MHz), few
cycles startup
20Seymour Cray Cray-1 1976 Single Processor 80
MFLOPS 1 Mword 8 Mbyte
Photograph courtesy of Charles Babbage Institute,
University of Minnesota, Minneapolis
21Large working set - 8 vector registers, 64
words - 8 scalar registers - 8 address
registers - large instruction buffer Performance
Features - vector processing one operation
affects 64 vector elements, streamed through
functional unit - small vector startup time -
chaining between vector ops - large, fast
semiconductor memory
22Cray Research, Inc. cntd
- 1982 Cray-XMP (Steve Chen improvements, up to 4
processors, shared memory) - 1985 Cray-2, 256 Mword memory, 4 processors,
immersion cooled - 1988 Cray-YMP (last Chen machine)
- 1991 Cray C90 (up to 16 vector CPUs, shared
memory) - 1993 Cray T3D (massively parallel Alpha)
- one and only Cray-3 delivered to NCAR
(Cray Comp Corp) - 1994 Cray J90 (up to 32 vector CPUs, shared
memory), air cooled - 1995 Cray T3E (most successful MPP machine), Cray
T90 (parallel vector, immersion cooled)
- Cray-4 abandoned (Cray Computer
Corporation ch. 11) - 1996 acquired by Silicon Graphics
- 1998 Cray SV1 (parallel vector, air cooled)
- 1999 acquired by Teradata gt Cray, Inc.
- 2002 Cray X1, parallel vector, immersion spray
cooled - 2004 Cray X1e, enhanced version of X1
- Cray XT3, AMD based 3D Torus massively
parallel machine
23CDC Cyber 200 Family
- - 1980, enhanced version of STAR-100
- - reduced startup time, 50 cycles
- - fast scalar unit
- - rich instruction repertoire
- - still memory-to-memory, 400 MFLOPS peak
- - Cyber 203, Cyber 205, ETA-10 10 GFLOPS
- - vector FORTRAN language extensions provided
- - terminated in 1989 since unprofitable
- - around 40 Cyber 200, 34 ETA-10 sold
24Minnesota Supercomputer Center Minneapolis,
1986 Cray-2, CDC Cyber 205
25NEC Japan
- - 1983 SX-1 single processor vector 650 MFLOPS
- - 1985 SX-2 single processor vector 1300 MFLOPS
- - 1990 SX-3 four processors at 5 GFLOPS each, 4
Gbyte 0.5 Gword memory - - 1995 SX-4 32 processors at 2 GFLOPS each
(CMOS all previous ECL) - - 1998 SX-5 upto 512 processors 8 GFLOPS each
- - 2002 SX-6 upto 1024 processors 8 GFLOPS each
- - 2004 SX-7 upto 2048 processors 8.8 GFLOPS each
- - 2004 SX-8 upto 4096 processors 16 GFLOPS each
26IBM - Sony - Toshiba CELL processor
- 8 vector CPUs GPU on single chip - 256 kbyte
32 kword local storage (very small !!) - 12
word/cycle internal interconnect 386
Gbyte/sec - 24 Gbyte/sec 3 Gword/sec main
memory - 76 Gbyte/sec 9.5 Gword/sec
communication - _at_ 4 GHz clock 256 GFLOPS (32 bit)
peak - 26 GFLOPS (64
bit) peak - max 4.5 Gbyte addressable, 512 Mbyte
implemented - system interconnect ? - used within
Sony Playstation 3 - Mercury, IBM blades
available 512 Mbyte only - highly imbalanced for
scientific computation
27IBM - Sony - Toshiba CELL processor
- 90 nm SOI, 8 layers Cu interconnect - 234 M
Transistors - 221 mm² die size - significant
potential in future revisions - but 80W _at_ 1.1V
4.0 GHz is too much - 180W _at_ 1.4V 5.6 GHz is
much too much - work needed in power reduction -
larger internal memory - 64 bit arithmetic
improved
28IBM - Sony - Toshiba CELL processor
From S. Williams et. al., Lawrence Berkeley
Laboratory - single Cell chip performance -
compared with Cray X1E single vector processor
and several commodity microprocessors (AMD,
Intel) - already current version shows impressive
speedup, at cost of significant programming
complexity (explicit storage moves as opposed to
caching) - slightly enhanced Cell (Cell)
simulation provides very significant additional
speedup (more efficient DP) - current version
insufficient for major impact - future versions
may change that, great potential