Scalable Numerical Algorithms and Methods on the ASCI Machines PowerPoint PPT Presentation

presentation player overlay
1 / 37
About This Presentation
Transcript and Presenter's Notes

Title: Scalable Numerical Algorithms and Methods on the ASCI Machines


1
CS61V
Parallel Architectures
2
Architecture Classification
  • The Flynn taxonomy (proposed in 1966!)
  • Functional taxonomy based on the notion of
    streams of information data and instructions
  • Platforms are classified according to whether
    they have a single (S) or multiple (M) stream of
    data or instructions.

3
Flynns Classification
Architecture Categories
SISD
SIMD
MISD
MIMD
4
SISD
  • Classic von Neumann machine
  • Basic components CPU (control unit, ALU) and
    Main Memory (RAM)
  • Connected via Bus (aka von Neumann bottleneck)
  • Examples standard desktop computer, laptop

5
SISD
M
C
P
IS
IS
DS
6
SIMD
  • Pure SIMD machine
  • single CPU devoted exclusively to control
  • collection of subordinate ALUs each w/small
    amount of memory
  • Instruction cycle CPU broadcasts, ALUs execute
    or idle
  • lock-step progress (effectively a global clock)
  • Key point completely synchronous execution of
    statements
  • Vector and matrix computation lend themselves to
    an SIMD implementation
  • Examples of SIMD computers Illiac IV, MPP, DAP,
    CM-2, and MasPar MP-2

7
SIMD
8
Data Parallel Systems
  • Programming model
  • Operations performed in parallel on each element
    of data structure
  • Logically single thread of control, performs
    sequential or parallel steps
  • Conceptually, a processor associated with each
    data element
  • Architectural model
  • Array of many simple, cheap processors with
    little memory each
  • Processors dont sequence through instructions
  • Attached to a control processor that issues
    instructions
  • Specialized and general communication, cheap
    global synchronization
  • Original motivations
  • Matches simple differential equation solvers
  • Centralize high cost of instruction
    fetch/sequencing

9
Data Parallel Programming
  • In this approach, we must determine how large
    amounts of data can be split up. In other words,
    we need to identify small chunks of data which
    require similar processing.
  • These chunks of data are than assigned to
    different sites where they can be processed. The
    computations at each node may require some
    intermediate results from peer nodes.
  • The same executable could be running on each
    processing site, but each processing site would
    have different datasets.
  • For data parallelism to work best the volume of
    communicated values should be small compared with
    the volume of locally computed results.

10
Data Parallel Programming
  • Data Parallel decomposition can be implemented
    using a SPMD (single program multiple data)
    programming model.
  • One processing element is regarded as "first
    among equals
  • This processor starts up the program and
    initialises the other processors. It then works
    as an equal to these processors.
  • Each PE is doing approximately the same
    calculation on different data.

11
Data Parallel Programming
  • Data-parallel architectures introduced the new
    programming-language concept of a distributed or
    parallel array. Typically the set of semantic
    operations allowed on a distributed array was
    somewhat different to the operations allowed on a
    sequential array
  • Unfortunately, each data parallel language had
    features tied to a particular manufacturer's
    parallel computer architecture e.g.
  • LISP, C and CM Fortran for Thinking Machines
    Corporations Connection Machine series of
    computers.
  • In the 1980s and 1990s microprocessors grew in
    power and availability, and fell in price.
    Building SIMD computers out of simple but
    specialized compute nodes gradually became less
    economical than putting a general purpose
    commodity microprocessor at every node.
    Eventually SIMD computers were displaced almost
    completely by Multiple Instruction Multiple Data
    (MIMD) parallel computer architectures.

12
Example - ILLIAC IV
  • ILLIAC IV was the first large system to employ
    semiconductor primary memory, built in 1974 at
    the University of Illinois.
  • The ILLIAC IV was a SIMD computer for array
    processing.
  • It consisted of
  • a control unit (CU) and
  • 64 processing elements (PEs).
  • Each processing element had two thousand 64-bit
    words of memory associated with it. The CU could
    access all 128K words of memory through a bus,
    but each PE could only directly access its local
    memory.

13
Example - ILLIAC IV
  • An 8 by 8 grid interconnect joined each PE to 4
    neighbours.
  • The CU interpreted program instructions scattered
    across the memory, and broadcast them to the PEs.
  • Neither the PEs nor the CU were general-purpose
    computers in the modern sense--the CU had quite
    limited arithmetic capabilities.
  • Between 1975 and 1981 it was the world's fastest
    computer.

14
Example - ILLIAC IV
  • The ILLIAC IV had thirteen rotating fixed head
    disks which comprised part of the central system
    memory.
  • The ILLIAC IV, one of the first computers to use
    all semiconductor main memories.

15
Example - ILLIAC IV
16
Example - ILLIAC IV
17
Data Parallel Languages
  • CFD was a data parallel language developed in the
    early 70s at the Computational Fluid Dynamics
    Branch of Ames Research Center.
  • CFD was a FORTRAN-like'' language, rather than
    a FORTRAN dialect.
  • The language design was extremely pragmatic. No
    attempt was made to hide the hardware
    peculiarities from the user in fact, every
    attempt was made to give programmers access and
    control of all of the ILLIAC hardware so they
    could construct an efficient program.
  • CFD had five basic datatypes
  • CU INTEGER
  • CU REAL
  • CU LOGICAL
  • PE REAL
  • PE INTEGER.

18
Data Parallel Languages
  • The type of a variable statically encoded its
    home
  • either on the control unit or on the processing
    elements.
  • Apart from restrictions on their home, the two
    INTEGER and REAL types behave like the
    corresponding types in ordinary FORTRAN.
  • The CU LOGICAL type was more idiosyncratic
  • it had 64 independent bits that acted as flags
    controlling activity of the PEs.

19
Data Parallel Languages
  • Scalars and arrays of the five types could be
    declared as in FORTRAN.
  • An ordinary variable or array of type CU REAL,
    for example, would be allocated in the (very
    small) control unit memory.
  • An ordinary variable or array of type PE REAL
    would be allocated somewhere in the collective
    memory of the processing elements (accessible by
    the control unit over the data bus) e.g.
  • CU REAL A, B(100)
  • PE INTEGER I
  • PE REAL D(25), E(1000)
  • The last data structure available in CFD was a
    new kind of array called a vector-aligned array.

20
Data Parallel Languages
  • Only the first dimension could be distributed,
    and the extent of that dimension had to be
    exactly 64.
  • A vector-aligned array would be of PE INTEGER or
    PE REAL type, and the syntax for the distributed
    dimension involved an asterisk
  • PE INTEGER J()
  • PE REAL X(,4), Y(,2,8)
  • These are parallel arrays.
  • J(1) is stored on the first PE
  • J(2) is stored on the second PE, and so on.
  • Similarly X(1,1), X(1,2), X(1,3), X(1,4) are
    stored on PE 1
  • X(2,1), X(2,2), X(2,3), X(2,4) are stored on
    PE 2, etc.

21
Data Parallel Languages
  • A vector expression was a vector-aligned array
    with a () subscript in the first dimension.
  • Communication between neighbouring PEs was
    captured by allowing the () to have some shift
    added, as in
  • DIFP() P( 1) - P( - 1)
  • All shifts were cyclic (end-around) shifts, so
    this parallel statement is equivalent to the
    sequential statements
  • DIFP(1) P(2) - P(64)
  • DIFP(2) P(3) - P(1)
  • ...
  • DIFP(64) P(1) - P(63)

22
Data Parallel Languages
  • Essential flexibility was added by allowing
    vector assignments to be executed conditionally
    with a vector test, e.g.
  • IF(A() .LT. 0) A() -A()
  • Less structured methods of masking operations by
    explicitly assigning PE activity flags in CU
    LOGICAL variables were also available
  • there were special primitives for restricting
    activity to simply-specified ranges of PEs.
  • PEs could concurrently access different addresses
    in their local memory by using vector subscripts
    DIAG() RHO(, X())

23
Connection Machine
(Tucker, IEEE Computer, Aug. 1988)
24
CM-5
  • Repackaged SparcStation
  • 4 per board
  • Fat-Tree network
  • Control network for global synchronization

25
Whither SIMD machines?
  • Trade-off individual processor performance for
    collective performance
  • CM-1 had 64K PEs each 1-bit!
  • Problems with SIMD
  • Inflexible - not all problems can use this style
    of parallelism
  • cannot leverage off microprocessor technology
  • gt cannot be general-purpose architectures
  • Special-purpose SIMD architecture still viable
    (array processors, DSP chips)

26
Vector Processors
  • Definition a processor that can do element-wise
    operations on entire vectors with a single
    instruction, called a vector instruction
  • These are specified as operations on vector
    registers
  • A processor comes with some number of such
    registers
  • A vector register holds 32-64 elements
  • The number of elements is larger than the amount
    of parallel hardware, called vector pipes or
    lanes, say 2-4
  • The hardware performs a full vector operation in
  • elements-per-vector-register / pipes

r1
r2

(logically, performs elts adds in parallel)
r3
(actually, performs pipes adds in parallel)
27
A processor that is capable of adding two vectors
by streaming the two sectors through a pipelined
adder
Concept of Vector Processing
Stream A
Multiport Memory System
Pipelined Adder
Stream B
Stream C A B
28
Scalar Processor
Vector Processor
Scalar Instructions
Vector Instructions
Vector Control Unit
Scalar Control Unit
Control
Instruction
Vector Func. Pipe.
Scalar Data
Main Memory (Program and Data)
Vector Data
Vector Registers
Vector Func. Pipe.
Host Computer
Mass Storage
I/O (User)
The Architecture of a Vector Computer
29
Vector Processors
  • Advantages
  • quick fetch and decode of a single instruction
    for multiple operations
  • the instruction provides the processor with a
    regular source of data, which can arrive at each
    cycle, and processed in a pipelined fashion
  • The compiler does the work for you of course
  • Memory-to-memory
  • no registers
  • can process very long vectors, but startup time
    is large
  • appeared in the 70s and died in the 80s
  • Examples Cray, Fujitsu, Hitachi, NEC

30
Vector Processors
  • What about
  • for (j 0 j lt 100 j)
  • Aj Bj Cj
  • Scalar code load, operate, store for each
    iteration
  • Both instructions and data consume memory
    bandwidth
  • The solution A vector instruction

31
Vector Processors
  • A099 B0.99 C099
  • Single instruction requires memory bandwidth for
    data only.
  • No control overhead for loops
  • Pitfalls
  • extension to instruction set, vector fus, vector
    registers, memory subsystem changes for vectors

32
Vector Processors
  • Merits of vector processor
  • Very deep pipeline without data hazard
  • The computation of each result is independent of
    the computation of previous results
  • Instruction bandwidth requirement is reduced
  • A vector instruction specifies a great deal of
    work
  • Control hazards are nonexistent
  • A vector instruction represents an entire loop.
  • No loop branch

33
Vector Processors (Contd)
  • The high latency of initiating a main memory
    access is amortized
  • A single access is initiated for the entire
    vector rather than a single word
  • Known access pattern
  • Interleaved memory banks
  • Vector operations is faster than a sequence of
    scalar operations on the same number of data
    items!

34
Vector Programming Example
Y a X Y
LD F0, a ADDI R4, Rx, 512 last
address to load Loop LD F2, 0(Rx) load
X(i) MULTD F2, F0, F2 a x X(i) LD F4,
0(Ry) load Y(i) ADDD F4, F2, F4 a x X(i)
Y(i) SD F4, 0(Ry) store into Y(i) ADDI
Rx, Rx, 8 increment index to X ADDI
Ry, Ry, 8 increment index to Y SUB R20,
R4, Rx compute bound BNZ R20, loop check
if done
Repeat 64 times
RISC machine
35
Vector Programming Example(Contd)
Y a X Y
LD F0, a load scalar LV V1, Rx
load vector X MULTSV V2, F0, V1 vector-scalar
multiply LV V3, Ry load vector Y ADDV V4,
V2, V3 add SV Ry, V4 store the result
6 instructions (low instruction bandwidth)
Vector machine
36
A Vector-Register Architecture(DLXV)
Main Memory
FP add/subtract
Vector Load-store
FP add/subtract
FP add/subtract
Vector registers
FP add/subtract
FP add/subtract
Crossbar
Crossbar
Scalar registers
37
Vector Machines
Registers
Elements per register
Load Store
Functional units
8
64
1
6
CRAY-1
8 - 256
32-1024
2
3
Fujitsu VP200
8
64
2Ld/1St
8
CRAY X-MP
32
256
4
4
Hitachi S820
8 8192
256
8
16
NEC SX/2
Convex C-1
8
128
1
4
8
64
1
5
CRAY-2
CRAY Y-MP
8
64
2Ld/1St
8
CRAY C-90
8
128
4
8
NEC SX/4
8 8192
256
8
16
38
MISD
  • Multiple instruction, single data
  • Doesnt really exist, unless you consider
    pipelining an MISD configuration

39
MISD
M
C
P
IS
IS
DS
C
P
IS
IS
DS
Write a Comment
User Comments (0)
About PowerShow.com