CS 267 Applications of Parallel Computers Lecture 5: Sources of Parallelism (continued) Shared-Memory Multiprocessors - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

CS 267 Applications of Parallel Computers Lecture 5: Sources of Parallelism (continued) Shared-Memory Multiprocessors

Description:

model is discrete space with discrete interactions. synchronous and asynchronous versions ... A bar of uniform material, insulated except at ends ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: CS 267 Applications of Parallel Computers Lecture 5: Sources of Parallelism (continued) Shared-Memory Multiprocessors


1
CS 267 Applications of Parallel
ComputersLecture 5 Sources of Parallelism
(continued) Shared-Memory Multiprocessors
  • Kathy Yelick
  • http//www.cs.berkeley.edu/dmartin/cs267/

2
Outline
  • Recap
  • Parallelism and Locality in PDEs
  • Continuous Variables Depending on Continuous
    Parameters
  • Example The heat equation
  • Eulers method
  • Indirect methods
  • Shared Memory Machines
  • Historical Perspective Centralized Shared Memory
  • Bus-based Cache-coherent Multiprocessors
  • Scalable Shared Memory Machines

3
Recap Source of Parallelism and Locality
  • Discrete event system
  • model is discrete space with discrete
    interactions
  • synchronous and asynchronous versions
  • parallelism over graph of entities communication
    for events
  • Particle systems
  • discrete entities moving in continuous space and
    time
  • parallelism between particles communication for
    interactions
  • ODEs
  • systems of lumped (discrete) variables,
    continuous parameters
  • parallelism in solving (usually sparse) linear
    systems
  • graph partitioning for parallelizing the sparse
    matrix computation
  • PDEs (today)

4
Continuous Variables, Continuous Parameters
  • Examples of such systems include
  • Heat flow Temperature(position, time)
  • Diffusion Concentration(position, time)
  • Electrostatic or Gravitational Potential Pote
    ntial(position)
  • Fluid flow Velocity,Pressure,Density(position,tim
    e)
  • Quantum mechanics Wave-function(position,time)
  • Elasticity Stress,Strain(position,time)

5
Example Deriving the Heat Equation
0
1
x
xh
  • Consider a simple problem
  • A bar of uniform material, insulated except at
    ends
  • Let u(x,t) be the temperature at position x at
    time t
  • Heat travels from x to xh at rate proportional
    to

d u(x,t) (u(x-h,t)-u(x,t))/h - (u(x,t)-
u(xh,t))/h dt
h
C
  • As h 0, we get the heat equation

6
Explicit Solution of the Heat Equation
  • For simplicity, assume C1
  • Discretize both time and position
  • Use finite differences with xti as the heat at
  • time t and position I
  • initial conditions on x0i
  • boundary conditions on xt0 and xt1
  • At each timestep
  • This corresponds to
  • matrix vector multiply
  • nearest neighbors on grid

t5 t4 t3 t2 t1 t0
xit1 zxti-1 (1-2z)xti
zxti1 where z k/h2
x0 x1 x2 x3 x4 x5
7
Parallelism in Explicit Method for PDEs
  • Partitioning the space (x) into p largest chunks
  • good load balance (assuming large number of
    points relative to p)
  • minimized communication (only p chunks)
  • Generalizes to
  • multiple dimentions
  • arbitrary graphs ( sparse matrices)
  • Problem with explicit approach
  • numerical instability
  • need to make the timesteps very small

8
Implicit Solution
  • As with many (stiff) ODEs, need an implicit
    method
  • This turns into solving the following equation
  • Where I is the identity matrix and T is
  • I.e., essentially solving Poissons equation

(I (z/2)T) xt1 (I - (z/2)T) xt
2 -1 -1 2 -1 -1 2 -1
-1 2 -1 -1 2
-1
T
9
2D Implicit Method
  • Similar to the 1D case, but the matrix T is now
  • Multiplying by this matrix (as in the explicit
    case) is simply nearest neighbor computation
  • To solve this system, there are several techniques

4 -1 -1 -1 4 -1
-1 -1 4 -1
-1 -1 4 -1
-1 -1 -1 4 -1 -1
-1 4 -1 -1
-1 4 -1 -1
-1 4
T
10
Algorithms for Solving the Poisson Equation
  • Algorithm Serial PRAM Mem Procs
  • Dense LU N3 N N2 N2
  • Band LU N2 N N3/2 N
  • Jacobi N2 N N N
  • Conj.Grad. N 3/2 N 1/2 log N N N
  • RB SOR N 3/2 N 1/2 N N
  • Sparse LU N 3/2 N 1/2 Nlog N N
  • FFT Nlog N log N N N
  • Multigrid N log2 N N N
  • Lower bound N log N N
  • PRAM is an idealized parallel model with zero
    cost communication

11
Administrative
  • HW2 extended to Monday, Feb. 16th
  • Break
  • On to shared memory machines

12
Programming Recap and History of SMPs
13
Relationship of Architecture and Programming Model
Parallel Application
Programming Model
User / System Interface
compiler library
operating system
HW / SW interface (comm. primitives)
Hardware
14
Shared Address Space Programming Model
  • Collection of processes
  • Naming
  • Each can name data in a private address space and
  • all can name all data in a common shared
    address space
  • Operations
  • Uniprocessor operations, plus sychronization
    operations on shared address
  • lock, unlock, testset, fetchadd, ...
  • Operations on the shared address space appear to
    be performed in program order
  • its own operations appear to be in program order
  • all see a consistent interleaving of each others
    operations
  • like timesharing on a uniprocessor
  • explicit synchronization operations used when
    program ordering is not sufficient.

15
Example shared flag indicating full/empty
P1 P2 A 1 a while (flag
is 0) do nothing b flag 1 print A
  • Intuitively clear that intention was to convey
    meaning by order of stores
  • No data dependences
  • Sequential compiler / architecture would be free
    to reorder them!

16
Historical Perspective
  • Diverse spectrum of parallel machines designed to
    implement a particular programming model directly
  • Technological convergence on collections of
    microprocessors on a scalable interconnection
    network
  • Map any programming model to simple hardware
  • with some specialization

Shared Address Space
Message Passing
Data Parallel
centralized shared memory
hypercubes and grids
SIMD
  
Scalable Interconnection Network
CA
M


essentially complete computer
P
17
60s Mainframe Multiprocessors
  • Enhance memory capacity or I/O capabilities by
    adding memory modules or I/O devices
  • How do you enhance processing capacity?
  • Add processors
  • Already need an interconnect between slow memory
    banks and processor I/O channels
  • cross-bar or multistage interconnection network

I/O
De
vices
IOC
IOC
Mem
Mem
Mem
Mem
Inter
connect
Proc
Pr
oc
18
Caches A Solution and New Problems
19
70s breakthrough
  • Caches!

memory (slow)
A
17
interconnect
I/O Device or Processor
P
processor (fast)
20
Technology Perspective
Capacity Speed Logic 2x in 3 years 2x
in 3 years DRAM 4x in 3 years 1.4x in 10
years Disk 2x in 3 years 1.4x in 10 years
DRAM Year Size Cycle
Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1
Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145
ns 1995 64 Mb 120 ns
10001!
21!
21
Bus Bottleneck and Caches
  • Assume 100 MB/s bus
  • 50 MIPS processor w/o cache
  • gt 200 MB/s inst BW per processor
  • gt 60 MB/s data BW at 30 load-store
  • Suppose 98 inst hit rate and 95 data hit rate
    (16 byte block)
  • gt 4 MB/s inst BW per processor
  • gt 12 MB/s data BW per processor
  • gt 16 MB/s combined BW
  • \ 8 processors will saturate bus

I/O
MEM
MEM

16 MB/s

cache
cache
260 MB/s
PROC
PROC
Cache provides bandwidth filter as well as
reducing average access time
22
Cache Coherence The Semantic Problem
  • Scenario
  • p1 and p2 both have cached copies of x (as 0)
  • p1 writes x1 and then the flag, f1 pulling f
    into its cache
  • both of these writes may write through to memory
  • p2 reads f (bringing it into cache) to see if it
    is 1, which it is
  • p2 therefore reads x, but gets the stale cached
    copy

x 1 f 1
x 1 f 1
x 0 f 1
p1
p2
23
Snoopy Cache Coherence
  • Bus is a broadcast medium
  • all caches can watch others mem ops
  • All processors write through
  • update local cache and global bus write
  • updates main memory
  • invalidates/updates all other caches with that
    item
  • Examples Early Sequent and Encore machines.
  • Cache stay coherent
  • Consistent view of memory!
  • One shared write at a time
  • Performance is much worse than uniprocessor
  • write-back caches
  • Since 15-30 of references are writes, this
    scheme consumes tremendous bus bandwidth. Few
    processors can be supported.

24
Write-Back/Ownership Schemes
  • When a single cache has ownership of a block,
    processor writes do not result in bus writes,
    thus conserving bandwidth.
  • reads by others cause it to return to shared
    state
  • Most bus-based multiprocessors today use such
    schemes.
  • Many variants of ownership-based protocols

25
Programming SMPs
  • Consistent view of shared memory
  • All addresses equidistant
  • dont worry about data partitioning
  • Automatic replication of shared data close to
    processor
  • If program concentrates on a block of the data
    set that no one else updates gt very fast
  • Communication occurs only on cache misses
  • cache misses are slow
  • Processor cannot distinguish communication misses
    from regular cache misses
  • Cache block may introduce artifacts
  • two distinct variables in the same cache block
  • false sharing

26
Scalable Cache-Coherence
27
90 Scalable, Cache Coherent Multiprocessors
28
SGI Origin 2000
29
90s Pushing the bus to the limit Sun Enterprise
30
90s Pushing the SMP to the masses
31
Caches and Scientific Computing
  • Caches tend to perform worst on demanding
    applications that operate on large data sets
  • transaction processing
  • operating systems
  • sparse matrices
  • Modern scientific codes use tiling/blocking to
    become cache friendly
  • easier for dense codes than for sparse
  • tiling and parallelism are similar transformations

32
Scalable Global Address Space
33
Structured Shared Memory SPMD
machine physical address space
Each Process is same program with same address
space layout
Pn pr
te
i
v
a
load
Pn
x
P2
Common Ph
ysical
P1
Ad
dr
esses
P0
x
stor
e
P2 pr
i
v
a
te
Shar
ed P
or
tion
of
Ad
dr
ess Space
P1 pr
i
v
a
te
Pr
i
v
a
te P
or
tion
of
Ad
dr
ess Space
P0 pr
i
v
a
te
34
Large Scale Shared Physical Address
Scalable Network
rrsp
tag
data
src
tag
src
dest
read
addr
  
Pseudo Mem
Pseudo Proc
M
M
Ld Rlt- Addr
  • Processor performs load
  • Pseudo-memory controller turns it into a message
    transaction with a remote controller, which
    performs the memory operation and replies with
    the data.
  • Examples BBN butterfly, Cray T3D

35
Cray T3D
Req Out
Resp In
Req in
Resp Out
3D Torus of Pair of PEs share net BLT
upto 2048 64 MB each
Msg Queue - 4080x4x64
BLT
PE FC
Prefetch Queue - 16 x 64
DTB
150 MHz Dec Alpha (64-bit) 8 KB Inst 8 KB
Data 43-bit Virtual Address 32 64 bit mem
byte operations Non-blocking stores
mem-barrier Prefetch Load-lock, Store Conditional
DRAM
32-bit P.A. - 5 27
Special Registers - swaperand - fetchadd -
barrier
36
The Cray T3D
  • 2048 Alphas (150 MHz, 16 or 64 MB each) fast
    network
  • 43-bit virtual address space, 32-bit physical
  • 32-bit and 64-bit load/store byte manipulation
    on regs.
  • no L2 cache
  • non-blocking stores, load/store re-ordering,
    memory fence
  • load-lock / store-conditional
  • Direct global memory access via external segment
    regs
  • DTB annex, 32 entries, remote processor number
    and mode
  • atomic swap between special local reg and memory
  • special fetchinc register
  • global-OR, global-AND barriers
  • Prefetch Queue
  • Block Transfer Engine
  • User-level Message Queue

37
T3D Local Read (average latency)
No TLB !
Line Size 32 bytes
L1 Cache Size 8KB
DRAM page miss 100 ns (15 cycles)
Memory Access Time 155 ns (23 cycles)
Cache Access Time 6.7 ns (1 cycle)
38
T3D Remote Read Uncached
3 - 4x Local Memory Read !
100 ns DRAM-page miss
610 ns (91 cycles)
DEC Alpha
local T3D
Network Latency Additional 13-20 ns (2-3 cycles)
per hop
39
Bulk Read Options
40
Where are things going
  • High-end
  • collections of almost complete workstations/SMP
    on high-speed network
  • with specialized communication assist integrated
    with memory system to provide global access to
    shared data
  • Mid-end
  • almost all servers are bus-based CC SMPs
  • high-end servers are replacing the bus with a
    network
  • Sun Enterprise 10000, IBM J90, HP/Convex SPP
  • volume approach is Pentium pro quadpack SCI
    ring
  • Sequent, Data General
  • Low-end
  • SMP desktop is here
  • Major change ahead
  • SMP on a chip as a building block
Write a Comment
User Comments (0)
About PowerShow.com