CS 267 Applications of Parallel Computers Lecture 5: Sources of Parallelism (continued) Shared-Memory Multiprocessors - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

CS 267 Applications of Parallel Computers Lecture 5: Sources of Parallelism (continued) Shared-Memory Multiprocessors

Description:

model is discrete space with discrete interactions. synchronous and asynchronous versions ... A bar of uniform material, insulated except at ends ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 41

Provided by: davide76

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 267 Applications of Parallel Computers Lecture 5: Sources of Parallelism (continued) Shared-Memory Multiprocessors

1
CS 267 Applications of Parallel
ComputersLecture 5 Sources of Parallelism
(continued) Shared-Memory Multiprocessors

Kathy Yelick
http//www.cs.berkeley.edu/dmartin/cs267/

2
Outline

Recap
Parallelism and Locality in PDEs
Continuous Variables Depending on Continuous
Parameters
Example The heat equation
Eulers method
Indirect methods
Shared Memory Machines
Historical Perspective Centralized Shared Memory
Bus-based Cache-coherent Multiprocessors
Scalable Shared Memory Machines

3
Recap Source of Parallelism and Locality

Discrete event system
model is discrete space with discrete
interactions
synchronous and asynchronous versions
parallelism over graph of entities communication
for events
Particle systems
discrete entities moving in continuous space and
time
parallelism between particles communication for
interactions
ODEs
systems of lumped (discrete) variables,
continuous parameters
parallelism in solving (usually sparse) linear
systems
graph partitioning for parallelizing the sparse
matrix computation
PDEs (today)

4
Continuous Variables, Continuous Parameters

Examples of such systems include
Heat flow Temperature(position, time)
Diffusion Concentration(position, time)
Electrostatic or Gravitational Potential Pote
ntial(position)
Fluid flow Velocity,Pressure,Density(position,tim
e)
Quantum mechanics Wave-function(position,time)
Elasticity Stress,Strain(position,time)

5
Example Deriving the Heat Equation
0
1
x
xh

Consider a simple problem
A bar of uniform material, insulated except at
ends
Let u(x,t) be the temperature at position x at
time t
Heat travels from x to xh at rate proportional
to

d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)-
u(xh,t))/h dt
h
C

As h 0, we get the heat equation

6
Explicit Solution of the Heat Equation

For simplicity, assume C1
Discretize both time and position
Use finite differences with xti as the heat at
time t and position I
initial conditions on x0i
boundary conditions on xt0 and xt1
At each timestep
This corresponds to
matrix vector multiply
nearest neighbors on grid

t5 t4 t3 t2 t1 t0
xit1 zxti-1 (1-2z)xti
zxti1 where z k/h2
x0 x1 x2 x3 x4 x5
7
Parallelism in Explicit Method for PDEs

Partitioning the space (x) into p largest chunks
good load balance (assuming large number of
points relative to p)
minimized communication (only p chunks)
Generalizes to
multiple dimentions
arbitrary graphs ( sparse matrices)
Problem with explicit approach
numerical instability
need to make the timesteps very small

8
Implicit Solution

As with many (stiff) ODEs, need an implicit
method
This turns into solving the following equation
Where I is the identity matrix and T is
I.e., essentially solving Poissons equation

(I (z/2)T) xt1 (I - (z/2)T) xt
2 -1 -1 2 -1 -1 2 -1
-1 2 -1 -1 2
-1
T
9
2D Implicit Method

Similar to the 1D case, but the matrix T is now
Multiplying by this matrix (as in the explicit
case) is simply nearest neighbor computation
To solve this system, there are several techniques

4 -1 -1 -1 4 -1
-1 -1 4 -1
-1 -1 4 -1
-1 -1 -1 4 -1 -1
-1 4 -1 -1
-1 4 -1 -1
-1 4
T
10
Algorithms for Solving the Poisson Equation

Algorithm Serial PRAM Mem Procs
Dense LU N3 N N2 N2
Band LU N2 N N3/2 N
Jacobi N2 N N N
Conj.Grad. N 3/2 N 1/2 log N N N
RB SOR N 3/2 N 1/2 N N
Sparse LU N 3/2 N 1/2 Nlog N N
FFT Nlog N log N N N
Multigrid N log2 N N N
Lower bound N log N N
PRAM is an idealized parallel model with zero
cost communication

11
Administrative

HW2 extended to Monday, Feb. 16th
Break
On to shared memory machines

12
Programming Recap and History of SMPs
13
Relationship of Architecture and Programming Model
Parallel Application
Programming Model
User / System Interface
compiler library
operating system
HW / SW interface (comm. primitives)
Hardware
14
Shared Address Space Programming Model

Collection of processes
Naming
Each can name data in a private address space and
all can name all data in a common shared
address space
Operations
Uniprocessor operations, plus sychronization
operations on shared address
lock, unlock, testset, fetchadd, ...
Operations on the shared address space appear to
be performed in program order
its own operations appear to be in program order
all see a consistent interleaving of each others
operations
like timesharing on a uniprocessor
explicit synchronization operations used when
program ordering is not sufficient.

15
Example shared flag indicating full/empty
P1 P2 A 1 a while (flag
is 0) do nothing b flag 1 print A

Intuitively clear that intention was to convey
meaning by order of stores
No data dependences
Sequential compiler / architecture would be free
to reorder them!

16
Historical Perspective

Diverse spectrum of parallel machines designed to
implement a particular programming model directly
Technological convergence on collections of
microprocessors on a scalable interconnection
network
Map any programming model to simple hardware
with some specialization

Shared Address Space
Message Passing
Data Parallel
centralized shared memory
hypercubes and grids
SIMD

Scalable Interconnection Network
CA
M

essentially complete computer
P
17
60s Mainframe Multiprocessors

Enhance memory capacity or I/O capabilities by
adding memory modules or I/O devices
How do you enhance processing capacity?
Add processors
Already need an interconnect between slow memory
banks and processor I/O channels
cross-bar or multistage interconnection network

I/O
De
vices
IOC
IOC
Mem
Mem
Mem
Mem
Inter
connect
Proc
Pr
oc
18
Caches A Solution and New Problems
19
70s breakthrough

Caches!

memory (slow)
A
17
interconnect
I/O Device or Processor
P
processor (fast)
20
Technology Perspective
Capacity Speed Logic 2x in 3 years 2x
in 3 years DRAM 4x in 3 years 1.4x in 10
years Disk 2x in 3 years 1.4x in 10 years
DRAM Year Size Cycle
Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1
Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145
ns 1995 64 Mb 120 ns
10001!
21!
21
Bus Bottleneck and Caches

Assume 100 MB/s bus
50 MIPS processor w/o cache
gt 200 MB/s inst BW per processor
gt 60 MB/s data BW at 30 load-store
Suppose 98 inst hit rate and 95 data hit rate
(16 byte block)
gt 4 MB/s inst BW per processor
gt 12 MB/s data BW per processor
gt 16 MB/s combined BW
\ 8 processors will saturate bus

I/O
MEM
MEM

16 MB/s

cache
cache
260 MB/s
PROC
PROC
Cache provides bandwidth filter as well as
reducing average access time
22
Cache Coherence The Semantic Problem

Scenario
p1 and p2 both have cached copies of x (as 0)
p1 writes x1 and then the flag, f1 pulling f
into its cache
both of these writes may write through to memory
p2 reads f (bringing it into cache) to see if it
is 1, which it is
p2 therefore reads x, but gets the stale cached
copy

x 1 f 1
x 1 f 1
x 0 f 1
p1
p2
23
Snoopy Cache Coherence

Bus is a broadcast medium
all caches can watch others mem ops
All processors write through
update local cache and global bus write
updates main memory
invalidates/updates all other caches with that
item
Examples Early Sequent and Encore machines.
Cache stay coherent
Consistent view of memory!
One shared write at a time
Performance is much worse than uniprocessor
write-back caches
Since 15-30 of references are writes, this
scheme consumes tremendous bus bandwidth. Few
processors can be supported.

24
Write-Back/Ownership Schemes

When a single cache has ownership of a block,
processor writes do not result in bus writes,
thus conserving bandwidth.
reads by others cause it to return to shared
state
Most bus-based multiprocessors today use such
schemes.
Many variants of ownership-based protocols

25
Programming SMPs

Consistent view of shared memory
All addresses equidistant
dont worry about data partitioning
Automatic replication of shared data close to
processor
If program concentrates on a block of the data
set that no one else updates gt very fast
Communication occurs only on cache misses
cache misses are slow
Processor cannot distinguish communication misses
from regular cache misses
Cache block may introduce artifacts
two distinct variables in the same cache block
false sharing

26
Scalable Cache-Coherence
27
90 Scalable, Cache Coherent Multiprocessors
28
SGI Origin 2000
29
90s Pushing the bus to the limit Sun Enterprise
30
90s Pushing the SMP to the masses
31
Caches and Scientific Computing

Caches tend to perform worst on demanding
applications that operate on large data sets
transaction processing
operating systems
sparse matrices
Modern scientific codes use tiling/blocking to
become cache friendly
easier for dense codes than for sparse
tiling and parallelism are similar transformations

32
Scalable Global Address Space
33
Structured Shared Memory SPMD
machine physical address space
Each Process is same program with same address
space layout
Pn pr
te
i
v
a
load
Pn
x
P2
Common Ph
ysical
P1
Ad
dr
esses
P0
x
stor
e
P2 pr
i
v
a
te
Shar
ed P
or
tion
of
Ad
dr
ess Space
P1 pr
i
v
a
te
Pr
i
v
a
te P
or
tion
of
Ad
dr
ess Space
P0 pr
i
v
a
te
34
Large Scale Shared Physical Address
Scalable Network
rrsp
tag
data
src
tag
src
dest
read
addr

Pseudo Mem
Pseudo Proc
M
M
Ld Rlt- Addr

Processor performs load
Pseudo-memory controller turns it into a message
transaction with a remote controller, which
performs the memory operation and replies with
the data.
Examples BBN butterfly, Cray T3D

35
Cray T3D
Req Out
Resp In
Req in
Resp Out
3D Torus of Pair of PEs share net BLT
upto 2048 64 MB each
Msg Queue - 4080x4x64
BLT
PE FC
Prefetch Queue - 16 x 64
DTB
150 MHz Dec Alpha (64-bit) 8 KB Inst 8 KB
Data 43-bit Virtual Address 32 64 bit mem
byte operations Non-blocking stores
mem-barrier Prefetch Load-lock, Store Conditional
DRAM
32-bit P.A. - 5 27
Special Registers - swaperand - fetchadd -
barrier
36
The Cray T3D

2048 Alphas (150 MHz, 16 or 64 MB each) fast
network
43-bit virtual address space, 32-bit physical
32-bit and 64-bit load/store byte manipulation
on regs.
no L2 cache
non-blocking stores, load/store re-ordering,
memory fence
load-lock / store-conditional
Direct global memory access via external segment
regs
DTB annex, 32 entries, remote processor number
and mode
atomic swap between special local reg and memory
special fetchinc register
global-OR, global-AND barriers
Prefetch Queue
Block Transfer Engine
User-level Message Queue

37
T3D Local Read (average latency)
No TLB !
Line Size 32 bytes
L1 Cache Size 8KB
DRAM page miss 100 ns (15 cycles)
Memory Access Time 155 ns (23 cycles)
Cache Access Time 6.7 ns (1 cycle)
38
T3D Remote Read Uncached
3 - 4x Local Memory Read !
100 ns DRAM-page miss
610 ns (91 cycles)
DEC Alpha
local T3D
Network Latency Additional 13-20 ns (2-3 cycles)
per hop
39
Bulk Read Options
40
Where are things going

High-end
collections of almost complete workstations/SMP
on high-speed network
with specialized communication assist integrated
with memory system to provide global access to
shared data
Mid-end
almost all servers are bus-based CC SMPs
high-end servers are replacing the bus with a
network
Sun Enterprise 10000, IBM J90, HP/Convex SPP
volume approach is Pentium pro quadpack SCI
ring
Sequent, Data General
Low-end
SMP desktop is here
Major change ahead
SMP on a chip as a building block