Title: Scalable Numerical Algorithms and Methods on the ASCI Machines
1CS61V
Parallel Architectures
2Architecture Classification
- The Flynn taxonomy (proposed in 1966!)
- Functional taxonomy based on the notion of
streams of information data and instructions - Platforms are classified according to whether
they have a single (S) or multiple (M) stream of
data or instructions.
3Flynns Classification
Architecture Categories
SISD
SIMD
MISD
MIMD
4SISD
- Classic von Neumann machine
- Basic components CPU (control unit, ALU) and
Main Memory (RAM) - Connected via Bus (aka von Neumann bottleneck)
- Examples standard desktop computer, laptop
5SISD
M
C
P
IS
IS
DS
6SIMD
- Pure SIMD machine
- single CPU devoted exclusively to control
- collection of subordinate ALUs each w/small
amount of memory - Instruction cycle CPU broadcasts, ALUs execute
or idle - lock-step progress (effectively a global clock)
- Key point completely synchronous execution of
statements - Vector and matrix computation lend themselves to
an SIMD implementation - Examples of SIMD computers Illiac IV, MPP, DAP,
CM-2, and MasPar MP-2
7SIMD
8Data Parallel Systems
- Programming model
- Operations performed in parallel on each element
of data structure - Logically single thread of control, performs
sequential or parallel steps - Conceptually, a processor associated with each
data element - Architectural model
- Array of many simple, cheap processors with
little memory each - Processors dont sequence through instructions
- Attached to a control processor that issues
instructions - Specialized and general communication, cheap
global synchronization - Original motivations
- Matches simple differential equation solvers
- Centralize high cost of instruction
fetch/sequencing
9Data Parallel Programming
- In this approach, we must determine how large
amounts of data can be split up. In other words,
we need to identify small chunks of data which
require similar processing. -
- These chunks of data are than assigned to
different sites where they can be processed. The
computations at each node may require some
intermediate results from peer nodes. -
- The same executable could be running on each
processing site, but each processing site would
have different datasets. - For data parallelism to work best the volume of
communicated values should be small compared with
the volume of locally computed results.
10Data Parallel Programming
- Data Parallel decomposition can be implemented
using a SPMD (single program multiple data)
programming model. - One processing element is regarded as "first
among equals - This processor starts up the program and
initialises the other processors. It then works
as an equal to these processors. - Each PE is doing approximately the same
calculation on different data.
11Data Parallel Programming
- Data-parallel architectures introduced the new
programming-language concept of a distributed or
parallel array. Typically the set of semantic
operations allowed on a distributed array was
somewhat different to the operations allowed on a
sequential array - Unfortunately, each data parallel language had
features tied to a particular manufacturer's
parallel computer architecture e.g. - LISP, C and CM Fortran for Thinking Machines
Corporations Connection Machine series of
computers. - In the 1980s and 1990s microprocessors grew in
power and availability, and fell in price.
Building SIMD computers out of simple but
specialized compute nodes gradually became less
economical than putting a general purpose
commodity microprocessor at every node.
Eventually SIMD computers were displaced almost
completely by Multiple Instruction Multiple Data
(MIMD) parallel computer architectures.
12Example - ILLIAC IV
- ILLIAC IV was the first large system to employ
semiconductor primary memory, built in 1974 at
the University of Illinois. - The ILLIAC IV was a SIMD computer for array
processing. - It consisted of
- a control unit (CU) and
- 64 processing elements (PEs).
- Each processing element had two thousand 64-bit
words of memory associated with it. The CU could
access all 128K words of memory through a bus,
but each PE could only directly access its local
memory.
13Example - ILLIAC IV
- An 8 by 8 grid interconnect joined each PE to 4
neighbours. - The CU interpreted program instructions scattered
across the memory, and broadcast them to the PEs. - Neither the PEs nor the CU were general-purpose
computers in the modern sense--the CU had quite
limited arithmetic capabilities. - Between 1975 and 1981 it was the world's fastest
computer.
14Example - ILLIAC IV
- The ILLIAC IV had thirteen rotating fixed head
disks which comprised part of the central system
memory. - The ILLIAC IV, one of the first computers to use
all semiconductor main memories.
15Example - ILLIAC IV
16Example - ILLIAC IV
17Data Parallel Languages
- CFD was a data parallel language developed in the
early 70s at the Computational Fluid Dynamics
Branch of Ames Research Center. - CFD was a FORTRAN-like'' language, rather than
a FORTRAN dialect. - The language design was extremely pragmatic. No
attempt was made to hide the hardware
peculiarities from the user in fact, every
attempt was made to give programmers access and
control of all of the ILLIAC hardware so they
could construct an efficient program. - CFD had five basic datatypes
- CU INTEGER
- CU REAL
- CU LOGICAL
- PE REAL
- PE INTEGER.
18Data Parallel Languages
- The type of a variable statically encoded its
home - either on the control unit or on the processing
elements. - Apart from restrictions on their home, the two
INTEGER and REAL types behave like the
corresponding types in ordinary FORTRAN. - The CU LOGICAL type was more idiosyncratic
- it had 64 independent bits that acted as flags
controlling activity of the PEs.
19Data Parallel Languages
- Scalars and arrays of the five types could be
declared as in FORTRAN. - An ordinary variable or array of type CU REAL,
for example, would be allocated in the (very
small) control unit memory. - An ordinary variable or array of type PE REAL
would be allocated somewhere in the collective
memory of the processing elements (accessible by
the control unit over the data bus) e.g. - CU REAL A, B(100)
- PE INTEGER I
- PE REAL D(25), E(1000)
- The last data structure available in CFD was a
new kind of array called a vector-aligned array.
20Data Parallel Languages
- Only the first dimension could be distributed,
and the extent of that dimension had to be
exactly 64. - A vector-aligned array would be of PE INTEGER or
PE REAL type, and the syntax for the distributed
dimension involved an asterisk - PE INTEGER J()
- PE REAL X(,4), Y(,2,8)
-
- These are parallel arrays.
- J(1) is stored on the first PE
- J(2) is stored on the second PE, and so on.
- Similarly X(1,1), X(1,2), X(1,3), X(1,4) are
stored on PE 1 - X(2,1), X(2,2), X(2,3), X(2,4) are stored on
PE 2, etc.
21Data Parallel Languages
- A vector expression was a vector-aligned array
with a () subscript in the first dimension. - Communication between neighbouring PEs was
captured by allowing the () to have some shift
added, as in -
- DIFP() P( 1) - P( - 1)
- All shifts were cyclic (end-around) shifts, so
this parallel statement is equivalent to the
sequential statements - DIFP(1) P(2) - P(64)
- DIFP(2) P(3) - P(1)
- ...
- DIFP(64) P(1) - P(63)
22Data Parallel Languages
- Essential flexibility was added by allowing
vector assignments to be executed conditionally
with a vector test, e.g. - IF(A() .LT. 0) A() -A()
- Less structured methods of masking operations by
explicitly assigning PE activity flags in CU
LOGICAL variables were also available - there were special primitives for restricting
activity to simply-specified ranges of PEs. - PEs could concurrently access different addresses
in their local memory by using vector subscripts
DIAG() RHO(, X())
23Connection Machine
(Tucker, IEEE Computer, Aug. 1988)
24CM-5
- Repackaged SparcStation
- 4 per board
- Fat-Tree network
- Control network for global synchronization
25Whither SIMD machines?
- Trade-off individual processor performance for
collective performance - CM-1 had 64K PEs each 1-bit!
- Problems with SIMD
- Inflexible - not all problems can use this style
of parallelism - cannot leverage off microprocessor technology
- gt cannot be general-purpose architectures
- Special-purpose SIMD architecture still viable
(array processors, DSP chips)
26Vector Processors
- Definition a processor that can do element-wise
operations on entire vectors with a single
instruction, called a vector instruction - These are specified as operations on vector
registers - A processor comes with some number of such
registers - A vector register holds 32-64 elements
- The number of elements is larger than the amount
of parallel hardware, called vector pipes or
lanes, say 2-4 - The hardware performs a full vector operation in
- elements-per-vector-register / pipes
r1
r2
(logically, performs elts adds in parallel)
r3
(actually, performs pipes adds in parallel)
27A processor that is capable of adding two vectors
by streaming the two sectors through a pipelined
adder
Concept of Vector Processing
Stream A
Multiport Memory System
Pipelined Adder
Stream B
Stream C A B
28Scalar Processor
Vector Processor
Scalar Instructions
Vector Instructions
Vector Control Unit
Scalar Control Unit
Control
Instruction
Vector Func. Pipe.
Scalar Data
Main Memory (Program and Data)
Vector Data
Vector Registers
Vector Func. Pipe.
Host Computer
Mass Storage
I/O (User)
The Architecture of a Vector Computer
29Vector Processors
- Advantages
- quick fetch and decode of a single instruction
for multiple operations - the instruction provides the processor with a
regular source of data, which can arrive at each
cycle, and processed in a pipelined fashion - The compiler does the work for you of course
- Memory-to-memory
- no registers
- can process very long vectors, but startup time
is large - appeared in the 70s and died in the 80s
- Examples Cray, Fujitsu, Hitachi, NEC
30Vector Processors
- What about
- for (j 0 j lt 100 j)
- Aj Bj Cj
- Scalar code load, operate, store for each
iteration - Both instructions and data consume memory
bandwidth - The solution A vector instruction
31Vector Processors
- A099 B0.99 C099
- Single instruction requires memory bandwidth for
data only. - No control overhead for loops
- Pitfalls
- extension to instruction set, vector fus, vector
registers, memory subsystem changes for vectors
32Vector Processors
- Merits of vector processor
- Very deep pipeline without data hazard
- The computation of each result is independent of
the computation of previous results - Instruction bandwidth requirement is reduced
- A vector instruction specifies a great deal of
work - Control hazards are nonexistent
- A vector instruction represents an entire loop.
- No loop branch
33Vector Processors (Contd)
- The high latency of initiating a main memory
access is amortized - A single access is initiated for the entire
vector rather than a single word - Known access pattern
- Interleaved memory banks
- Vector operations is faster than a sequence of
scalar operations on the same number of data
items!
34Vector Programming Example
Y a X Y
LD F0, a ADDI R4, Rx, 512 last
address to load Loop LD F2, 0(Rx) load
X(i) MULTD F2, F0, F2 a x X(i) LD F4,
0(Ry) load Y(i) ADDD F4, F2, F4 a x X(i)
Y(i) SD F4, 0(Ry) store into Y(i) ADDI
Rx, Rx, 8 increment index to X ADDI
Ry, Ry, 8 increment index to Y SUB R20,
R4, Rx compute bound BNZ R20, loop check
if done
Repeat 64 times
RISC machine
35Vector Programming Example(Contd)
Y a X Y
LD F0, a load scalar LV V1, Rx
load vector X MULTSV V2, F0, V1 vector-scalar
multiply LV V3, Ry load vector Y ADDV V4,
V2, V3 add SV Ry, V4 store the result
6 instructions (low instruction bandwidth)
Vector machine
36A Vector-Register Architecture(DLXV)
Main Memory
FP add/subtract
Vector Load-store
FP add/subtract
FP add/subtract
Vector registers
FP add/subtract
FP add/subtract
Crossbar
Crossbar
Scalar registers
37Vector Machines
Registers
Elements per register
Load Store
Functional units
8
64
1
6
CRAY-1
8 - 256
32-1024
2
3
Fujitsu VP200
8
64
2Ld/1St
8
CRAY X-MP
32
256
4
4
Hitachi S820
8 8192
256
8
16
NEC SX/2
Convex C-1
8
128
1
4
8
64
1
5
CRAY-2
CRAY Y-MP
8
64
2Ld/1St
8
CRAY C-90
8
128
4
8
NEC SX/4
8 8192
256
8
16
38MISD
- Multiple instruction, single data
- Doesnt really exist, unless you consider
pipelining an MISD configuration
39MISD
M
C
P
IS
IS
DS
C
P
IS
IS
DS