Scalable Numerical Algorithms and Methods on the ASCI Machines presentation

About This Presentation

Transcript and Presenter's Notes

Title: Scalable Numerical Algorithms and Methods on the ASCI Machines

1
CS61V
Parallel Architectures
2
Architecture Classification

The Flynn taxonomy (proposed in 1966!)
Functional taxonomy based on the notion of
streams of information data and instructions
Platforms are classified according to whether
they have a single (S) or multiple (M) stream of
data or instructions.

3
Flynns Classification
Architecture Categories
SISD
SIMD
MISD
MIMD
4
SISD

Classic von Neumann machine
Basic components CPU (control unit, ALU) and
Main Memory (RAM)
Connected via Bus (aka von Neumann bottleneck)
Examples standard desktop computer, laptop

5
SISD
M
C
P
IS
IS
DS
6
SIMD

Pure SIMD machine
single CPU devoted exclusively to control
collection of subordinate ALUs each w/small
amount of memory
Instruction cycle CPU broadcasts, ALUs execute
or idle
lock-step progress (effectively a global clock)
Key point completely synchronous execution of
statements
Vector and matrix computation lend themselves to
an SIMD implementation
Examples of SIMD computers Illiac IV, MPP, DAP,
CM-2, and MasPar MP-2

7
SIMD
8
Data Parallel Systems

Programming model
Operations performed in parallel on each element
of data structure
Logically single thread of control, performs
sequential or parallel steps
Conceptually, a processor associated with each
data element
Architectural model
Array of many simple, cheap processors with
little memory each
Processors dont sequence through instructions
Attached to a control processor that issues
instructions
Specialized and general communication, cheap
global synchronization
Original motivations
Matches simple differential equation solvers
Centralize high cost of instruction
fetch/sequencing

9
Data Parallel Programming

In this approach, we must determine how large
amounts of data can be split up. In other words,
we need to identify small chunks of data which
require similar processing.
These chunks of data are than assigned to
different sites where they can be processed. The
computations at each node may require some
intermediate results from peer nodes.
The same executable could be running on each
processing site, but each processing site would
have different datasets.
For data parallelism to work best the volume of
communicated values should be small compared with
the volume of locally computed results.

10
Data Parallel Programming

Data Parallel decomposition can be implemented
using a SPMD (single program multiple data)
programming model.
One processing element is regarded as "first
among equals
This processor starts up the program and
initialises the other processors. It then works
as an equal to these processors.
Each PE is doing approximately the same
calculation on different data.

11
Data Parallel Programming

Data-parallel architectures introduced the new
programming-language concept of a distributed or
parallel array. Typically the set of semantic
operations allowed on a distributed array was
somewhat different to the operations allowed on a
sequential array
Unfortunately, each data parallel language had
features tied to a particular manufacturer's
parallel computer architecture e.g.
LISP, C and CM Fortran for Thinking Machines
Corporations Connection Machine series of
computers.
In the 1980s and 1990s microprocessors grew in
power and availability, and fell in price.
Building SIMD computers out of simple but
specialized compute nodes gradually became less
economical than putting a general purpose
commodity microprocessor at every node.
Eventually SIMD computers were displaced almost
completely by Multiple Instruction Multiple Data
(MIMD) parallel computer architectures.

12
Example - ILLIAC IV

ILLIAC IV was the first large system to employ
semiconductor primary memory, built in 1974 at
the University of Illinois.
The ILLIAC IV was a SIMD computer for array
processing.
It consisted of
a control unit (CU) and
64 processing elements (PEs).
Each processing element had two thousand 64-bit
words of memory associated with it. The CU could
access all 128K words of memory through a bus,
but each PE could only directly access its local
memory.

13
Example - ILLIAC IV

An 8 by 8 grid interconnect joined each PE to 4
neighbours.
The CU interpreted program instructions scattered
across the memory, and broadcast them to the PEs.
Neither the PEs nor the CU were general-purpose
computers in the modern sense--the CU had quite
limited arithmetic capabilities.
Between 1975 and 1981 it was the world's fastest
computer.

14
Example - ILLIAC IV

The ILLIAC IV had thirteen rotating fixed head
disks which comprised part of the central system
memory.
The ILLIAC IV, one of the first computers to use
all semiconductor main memories.

15
Example - ILLIAC IV
16
Example - ILLIAC IV
17
Data Parallel Languages

CFD was a data parallel language developed in the
early 70s at the Computational Fluid Dynamics
Branch of Ames Research Center.
CFD was a FORTRAN-like'' language, rather than
a FORTRAN dialect.
The language design was extremely pragmatic. No
attempt was made to hide the hardware
peculiarities from the user in fact, every
attempt was made to give programmers access and
control of all of the ILLIAC hardware so they
could construct an efficient program.
CFD had five basic datatypes
CU INTEGER
CU REAL
CU LOGICAL
PE REAL
PE INTEGER.

18
Data Parallel Languages

The type of a variable statically encoded its
home
either on the control unit or on the processing
elements.
Apart from restrictions on their home, the two
INTEGER and REAL types behave like the
corresponding types in ordinary FORTRAN.
The CU LOGICAL type was more idiosyncratic
it had 64 independent bits that acted as flags
controlling activity of the PEs.

19
Data Parallel Languages

Scalars and arrays of the five types could be
declared as in FORTRAN.
An ordinary variable or array of type CU REAL,
for example, would be allocated in the (very
small) control unit memory.
An ordinary variable or array of type PE REAL
would be allocated somewhere in the collective
memory of the processing elements (accessible by
the control unit over the data bus) e.g.
CU REAL A, B(100)
PE INTEGER I
PE REAL D(25), E(1000)
The last data structure available in CFD was a
new kind of array called a vector-aligned array.

20
Data Parallel Languages

Only the first dimension could be distributed,
and the extent of that dimension had to be
exactly 64.
A vector-aligned array would be of PE INTEGER or
PE REAL type, and the syntax for the distributed
dimension involved an asterisk
PE INTEGER J()
PE REAL X(,4), Y(,2,8)
These are parallel arrays.
J(1) is stored on the first PE
J(2) is stored on the second PE, and so on.
Similarly X(1,1), X(1,2), X(1,3), X(1,4) are
stored on PE 1
X(2,1), X(2,2), X(2,3), X(2,4) are stored on
PE 2, etc.

21
Data Parallel Languages

A vector expression was a vector-aligned array
with a () subscript in the first dimension.
Communication between neighbouring PEs was
captured by allowing the () to have some shift
added, as in
DIFP() P( 1) - P( - 1)
All shifts were cyclic (end-around) shifts, so
this parallel statement is equivalent to the
sequential statements
DIFP(1) P(2) - P(64)
DIFP(2) P(3) - P(1)
...
DIFP(64) P(1) - P(63)

22
Data Parallel Languages

Essential flexibility was added by allowing
vector assignments to be executed conditionally
with a vector test, e.g.
IF(A() .LT. 0) A() -A()
Less structured methods of masking operations by
explicitly assigning PE activity flags in CU
LOGICAL variables were also available
there were special primitives for restricting
activity to simply-specified ranges of PEs.
PEs could concurrently access different addresses
in their local memory by using vector subscripts
DIAG() RHO(, X())

23
Connection Machine
(Tucker, IEEE Computer, Aug. 1988)
24
CM-5

Repackaged SparcStation
4 per board
Fat-Tree network
Control network for global synchronization

25
Whither SIMD machines?

Trade-off individual processor performance for
collective performance
CM-1 had 64K PEs each 1-bit!
Problems with SIMD
Inflexible - not all problems can use this style
of parallelism
cannot leverage off microprocessor technology
gt cannot be general-purpose architectures
Special-purpose SIMD architecture still viable
(array processors, DSP chips)

26
Vector Processors

Definition a processor that can do element-wise
operations on entire vectors with a single
instruction, called a vector instruction
These are specified as operations on vector
registers
A processor comes with some number of such
registers
A vector register holds 32-64 elements
The number of elements is larger than the amount
of parallel hardware, called vector pipes or
lanes, say 2-4
The hardware performs a full vector operation in
elements-per-vector-register / pipes

r1
r2

(logically, performs elts adds in parallel)
r3
(actually, performs pipes adds in parallel)
27
A processor that is capable of adding two vectors
by streaming the two sectors through a pipelined
adder
Concept of Vector Processing
Stream A
Multiport Memory System
Pipelined Adder
Stream B
Stream C A B
28
Scalar Processor
Vector Processor
Scalar Instructions
Vector Instructions
Vector Control Unit
Scalar Control Unit
Control
Instruction
Vector Func. Pipe.
Scalar Data
Main Memory (Program and Data)
Vector Data
Vector Registers
Vector Func. Pipe.
Host Computer
Mass Storage
I/O (User)
The Architecture of a Vector Computer
29
Vector Processors

Advantages
quick fetch and decode of a single instruction
for multiple operations
the instruction provides the processor with a
regular source of data, which can arrive at each
cycle, and processed in a pipelined fashion
The compiler does the work for you of course
Memory-to-memory
no registers
can process very long vectors, but startup time
is large
appeared in the 70s and died in the 80s
Examples Cray, Fujitsu, Hitachi, NEC

30
Vector Processors

What about
for (j 0 j lt 100 j)
Aj Bj Cj
Scalar code load, operate, store for each
iteration
Both instructions and data consume memory
bandwidth
The solution A vector instruction

31
Vector Processors

A099 B0.99 C099
Single instruction requires memory bandwidth for
data only.
No control overhead for loops
Pitfalls
extension to instruction set, vector fus, vector
registers, memory subsystem changes for vectors

32
Vector Processors

Merits of vector processor
Very deep pipeline without data hazard
The computation of each result is independent of
the computation of previous results
Instruction bandwidth requirement is reduced
A vector instruction specifies a great deal of
work
Control hazards are nonexistent
A vector instruction represents an entire loop.
No loop branch

33
Vector Processors (Contd)

The high latency of initiating a main memory
access is amortized
A single access is initiated for the entire
vector rather than a single word
Known access pattern
Interleaved memory banks
Vector operations is faster than a sequence of
scalar operations on the same number of data
items!

34
Vector Programming Example
Y a X Y
LD F0, a ADDI R4, Rx, 512 last
address to load Loop LD F2, 0(Rx) load
X(i) MULTD F2, F0, F2 a x X(i) LD F4,
0(Ry) load Y(i) ADDD F4, F2, F4 a x X(i)
Y(i) SD F4, 0(Ry) store into Y(i) ADDI
Rx, Rx, 8 increment index to X ADDI
Ry, Ry, 8 increment index to Y SUB R20,
R4, Rx compute bound BNZ R20, loop check
if done
Repeat 64 times
RISC machine
35
Vector Programming Example(Contd)
Y a X Y
LD F0, a load scalar LV V1, Rx
load vector X MULTSV V2, F0, V1 vector-scalar
multiply LV V3, Ry load vector Y ADDV V4,
V2, V3 add SV Ry, V4 store the result
6 instructions (low instruction bandwidth)
Vector machine
36
A Vector-Register Architecture(DLXV)
Main Memory
FP add/subtract
Vector Load-store
FP add/subtract
FP add/subtract
Vector registers
FP add/subtract
FP add/subtract
Crossbar
Crossbar
Scalar registers
37
Vector Machines
Registers
Elements per register
Load Store
Functional units
8
64
1
6
CRAY-1
8 - 256
32-1024
2
3
Fujitsu VP200
8
64
2Ld/1St
8
CRAY X-MP
32
256
4
4
Hitachi S820
8 8192
256
8
16
NEC SX/2
Convex C-1
8
128
1
4
8
64
1
5
CRAY-2
CRAY Y-MP
8
64
2Ld/1St
8
CRAY C-90
8
128
4
8
NEC SX/4
8 8192
256
8
16
38
MISD

Scalable Numerical Algorithms and Methods on the ASCI Machines PowerPoint PPT Presentation