CS 267: Shared Memory Machines Programming Example: Sharks and Fish

About This Presentation

Title:

CS 267: Shared Memory Machines Programming Example: Sharks and Fish

Description:

Hardware evolves to try to match speeds. Program semantics evolve too ... Performance evolves as well. Well tuned programs today may be inefficient tomorrow ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 61

Provided by: kathyy

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 267: Shared Memory Machines Programming Example: Sharks and Fish

1
CS 267 Shared Memory MachinesProgrammingExamp
le Sharks and Fish

James Demmel
demmel_at_cs.berkeley.edu
www.cs.berkeley.edu/demmel/cs267_Spr05

2
Basic Shared Memory Architecture

Processors all connected to a large shared memory
Where are caches?

P2
P1
Pn
interconnect
memory

Now take a closer look at structure, costs,
limits, programming

3
Outline

Evolution of Hardware and Software
CPUs getting exponentially faster than memory
they share
Hardware evolves to try to match speeds
Program semantics evolve too
Programs change from correct to buggy, unless
programmed carefully
Performance evolves as well
Well tuned programs today may be inefficient
tomorrow
Goal teach a programming style likely to stay
correct, if not always as efficient as possible
Use locks to avoid race conditions
Current research seeks best of both worlds
Example Sharks and Fish (part of next homework)

4
Processor-DRAM Gap (latency)
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
5
Shared Memory Code for Computing a Sums
f(A0) f(A1)
static int s 0
Thread 0 s s f(A0)
Thread 1 s s f(A1)

Might get f(A0) f(A1) or f(A0) or
f(A1)
Problem is a race condition on variable s in the
program
A race condition or data race occurs when
two processors (or two threads) access the same
variable, and at least one does a write.
The accesses are concurrent (not synchronized) so
they could happen simultaneously

6
Approaches to Building Parallel Machines
P
P
Scale
1
n
Switch
(Interleaved)
P
First-level
P
n
1

(Interleaved)
Main memory
Inter
connection network
Shared Cache
Mem
Mem
Centralized Memory UMA Uniform Memory
Access
P
P
n
1

Mem
Mem
Inter
connection network
Distributed Memory (NUMA Non-UMA)
7
Shared Cache Advantages and Disadvantages

Advantages
Cache placement identical to single cache
Only one copy of any cached block
Cant have values of same memory location in
different caches
Fine-grain sharing is possible
Good Interference
One processor may prefetch data for another
Can share data within a line without moving line
Disadvantages
Bandwidth limitation
Bad Interference
One processor may flush another processors data

8
Limits of Shared Cache Approach

Assume
1 GHz processor w/o cache
gt 4 GB/s inst BW per processor (32-bit)
gt 1.2 GB/s data BW at 30 load-store
Need 5.2 GB/s of bus bandwidth per processor!
Typical off-chip bus bandwidth is closer to 1
GB/s

9
Evolution of Shared Cache

Alliant FX-8 (early 1980s)
eight 68020s with x-bar to 512 KB interleaved
cache
Encore Sequent (1980s)
first 32-bit micros (N32032)
two to a board with a shared cache
Disappeared for a while, and then
Cray X1 shares L3 cache
IBM Power 4 and Power 5 share L2 cache
If switch and cache on chip, may have enough
bandwidth again

10
Approaches to Building Parallel Machines
P
P
Scale
1
n
Switch
(Interleaved)
P
First-level
P
n
1

(Interleaved)
Main memory
Inter
connection network
Shared Cache
Mem
Mem
Centralized Memory UMA Uniform Memory
Access
P
P
n
1

Mem
Mem
Inter
connection network
Distributed Memory (NUMA Non-UMA)
11
Intuitive Memory Model

Reading an address should return the last value
written to that address
Easy in uniprocessors
except for I/O
Cache coherence problem in MPs is more pervasive
and more performance critical
More formally, this is called sequential
consistency
A multiprocessor is sequentially consistent if
the result of any execution is the same as if the
operations of all the processors were executed in
some sequential order, and the operations of each
individual processor appear in this sequence in
the order specified by its program. Lamport,
1979

12
Sequential Consistency Intuition

Sequential consistency says the machine behaves
as if it does the following

13
Memory Consistency Semantics

What does this imply about program behavior?
No process ever sees garbage values, I.e., ½ of
2 values
Processors always see values written by some some
processor
The value seen is constrained by program order on
all processors
Time always moves forward
Example spin lock
P1 writes data1, then writes flag1
P2 waits until flag1, then reads data

If P2 sees the new value of flag (1), it must
see the new value of data (1)
initially flag0 data0
P1
P2
data 1 flag 1
10 if flag0, goto 10 data
14
If Caches are Not Coherent

Coherence means different copies of same location
have same value
p1 and p2 both have cached copies of data (as 0)
p1 writes data1
May write through to memory
p2 reads data, but gets the stale cached copy
This may happen even if it read an updated value
of another variable, flag, that came from memory

data 0
data 1
data 0
data 0
p1
p2
15
Snoopy Cache-Coherence Protocols
Pn
P0
bus snoop

memory bus
memory op from Pn
Mem
Mem

Memory bus is a broadcast medium
Caches contain information on which addresses
they store
Cache Controller snoops all transactions on the
bus
A transaction is a relevant transaction if it
involves a cache block currently contained in
this cache
Take action to ensure coherence
invalidate, update, or supply value
Many possible designs (see CS252 or CS258)

16
Limits of Bus-Based Shared Memory

Assume
1 GHz processor w/o cache
gt 4 GB/s inst BW per processor (32-bit)
gt 1.2 GB/s data BW at 30 load-store
Suppose 98 inst hit rate and 95 data hit rate
gt 80 MB/s inst BW per processor
gt 60 MB/s data BW per processor
140 MB/s combined BW
Assuming 1 GB/s bus bandwidth
\ 8 processors will saturate bus

I/O
MEM
MEM

140 MB/s

cache
cache
5.2 GB/s
PROC
PROC
17
Sample Machines

Intel Pentium Pro Quad
Coherent
4 processors
Sun Enterprise server
Coherent
Up to 16 processor and/or memory-I/O cards
IBM Blue Gene/L
L1 not coherent, L2 shared

18
Approaches to Building Parallel Machines
P
P
Scale
1
n
Switch
(Interleaved)
P
First-level
P
n
1

(Interleaved)
Main memory
Inter
connection network
Shared Cache
Mem
Mem
Centralized Memory UMA Uniform Memory
Access
P
P
n
1

Mem
Mem
Inter
connection network
Distributed Memory (NUMA Non-UMA))
19
Basic Choices in Memory/Cache Coherence

Keep Directory to keep track of which memory
stores latest copy of data
Directory, like cache, may keep information such
as
Valid/invalid
Dirty (inconsistent with memory)
Shared (in another caches)
When a processor executes a write operation to
shared data, basic design choices are
With respect to memory
Write through cache do the write in memory as
well as cache
Write back cache wait and do the write later,
when the item is flushed
With respect to other cached copies
Update give all other processors the new value
Invalidate all other processors remove from
cache
See CS252 or CS258 for details

20
SGI Altix 3000

A node contains up to 4 Itanium 2 processors and
32GB of memory
Network is SGIs NUMAlink, the NUMAflex
interconnect technology.
Uses a mixture of snoopy and directory-based
coherence
Up to 512 processors that are cache coherent
(global address space is possible for larger
machines)

21
Cache Coherence and Sequential Consistency

There is a lot of hardware/work to ensure
coherent caches
Never more than 1 version of data for a given
address in caches
Data is always a value written by some processor
But other HW/SW features may break sequential
consistency (SC)
The compiler reorders/removes code (e.g., your
spin lock)
The compiler allocates a register for flag on
Processor 2 and spins on that register value
without every completing
Write buffers (place to store writes while
waiting to complete)
Processors may reorder writes to merge addresses
(not FIFO)
Write X1, Y1, X2 (second write to X may happen
before Ys)
Prefetch instructions cause read reordering (read
data before flag)
The network reorders the two write messages.
The write to flag is nearby, whereas data is far
away.
Some of these can be prevented by declaring
variables volatile
Most current commercial SMPs give up SC

22
Programming with Weaker Memory Models than SC

Possible to reason about machines with fewer
properties, but difficult
Some rules for programming with these models
Avoid race conditions
Use system-provided synchronization primitives
If you have race conditions on variables, make
them volatile
At the assembly level, may use fences (or analog)
directly
The high level language support for these differs
Built-in synchronization primitives normally
include the necessary fence operations
lock (), only one thread at a time allowed
here. unlock()
Region between lock/unlock called critical region
For performance, need to keep critical region
short

23
Improved Code for Computing a Sums f(A0)
f(An-1)
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2

Since addition is associative, its OK to
rearrange order

24
Improved Code for Computing a Sums f(A0)
f(An-1)
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2

Since addition is associative, its OK to
rearrange order
Critical section smaller
Most work outside it

25
Caches and Scientific Computing

Caches tend to perform worst on demanding
applications that operate on large data sets
transaction processing
operating systems
sparse matrices
Modern scientific codes use tiling/blocking to
become cache friendly
easier for dense matrix codes (eg matmul) than
for sparse
tiling and parallelism are similar transformations

26
Sharing A Performance Problem

True sharing
Frequent writes to a variable can create a
bottleneck
OK for read-only or infrequently written data
Technique make copies of the value, one per
processor, if this is possible in the algorithm
Example problem the data structure that stores
the freelist/heap for malloc/free
False sharing
Cache block may also introduce artifacts
Two distinct variables in the same cache block
Technique allocate data used by each processor
contiguously, or at least avoid interleaving
Example problem an array of ints, one written
frequently by each processor

27
What to Take Away?

Programming shared memory machines
May allocate data in large shared region without
too many worries about where
Memory hierarchy is critical to performance
Even more so than on uniprocs, due to coherence
traffic
For performance tuning, watch sharing (both true
and false)
Semantics
Need to lock access to shared variable for
read-modify-write
Sequential consistency is the natural semantics
Architects worked hard to make this work
Caches are coherent with buses or directories
No caching of remote data on shared address space
machines
But compiler and processor may still get in the
way
Non-blocking writes, read prefetching, code
motion
Avoid races or use machine-specific fences
carefully

28
Creating Parallelism with Threads
29
Programming with Threads

Several Thread Libraries
PTHREADS is the Posix Standard
Solaris threads are very similar
Relatively low level
Portable but possibly slow
OpenMP is newer standard
Support for scientific programming on shared
memory
http//www.openMP.org
P4 (Parmacs) is another portable package
Higher level than Pthreads
http//www.netlib.org/p4/index.html

30
Language Notions of Thread Creation

cobegin/coend
fork/join
cobegin cleaner, but fork is more general

cobegin job1(a1) job2(a2) coend

Statements in block may run in parallel
cobegins may be nested
Scoped, so you cannot have a missing coend

tid1 fork(job1, a1) job2(a2) join tid1

Forked function runs in parallel with current
join waits for completion (may be in different
function)

31
Forking Posix Threads
Signature
Signature int pthread_create(pthread_t ,
const pthread_attr_t ,
void ()(void ),
void ) Example call errcode
pthread_create(thread_id thread_attribute
thread_fun fun_arg)

thread_id is the thread id or handle (used to
halt, etc.)
thread_attribute various attributes
standard default values obtained by passing a
NULL pointer
thread_fun the function to be run (takes and
returns void)
fun_arg an argument can be passed to thread_fun
when it starts
errorcode will be set nonzero if the create
operation fails

32
Posix Thread Example

include ltpthread.hgt
void print_fun( void message )
printf("s \n", message)
main()
pthread_t thread1, thread2
char message1 "Hello"
char message2 "World"
pthread_create( thread1,
NULL,
(void)print_fun,
(void) message1)
pthread_create(thread2,
NULL,
(void)print_fun,
(void) message2)
return(0)

Compile using gcc lpthread See
Millennium/Seaborg docs for paths/modules
Note There is a race condition in the print
statements
33
Loop Level Parallelism

Many scientific application have parallelism in
loops
With threads
my_stuff nn
for (int i 0 i lt n i)
for (int j 0 j lt n j)
pthread_create (update_cell, ,
my_stuffij)
But overhead of thread creation is nontrivial

Also need i j
34
Shared Data and Threads

Variables declared outside of main are shared
Object allocated on the heap may be shared (if
pointer is passed)
Variables on the stack are private passing
pointer to these around to other threads can
cause problems
Often done by creating a large thread data
struct
Passed into all threads as argument

35
Basic Types of Synchronization Barrier

Barrier -- global synchronization
fork multiple copies of the same function work
SPMD Single Program Multiple Data
simple use of barriers -- all threads hit the
same one
work_on_my_subgrid()
barrier
read_neighboring_values()
barrier
more complicated -- barriers on branches (or
loops)
if (tid 2 0)
work1()
barrier
else barrier
barriers are not provided in many thread libraries

36
Basic Types of Synchronization Mutexes

Mutexes -- mutual exclusion aka locks
threads are working mostly independently
need to access common data structure
lock l alloc_and_init() / shared
/
acquire(l)
access data
release(l)
Java and other languages have lexically scoped
synchronization
similar to cobegin/coend vs. fork and join
Semaphores give guarantees on fairness in
getting the lock, but the same idea of mutual
exclusion
Locks only affect processors using them
pair-wise synchronization

37
A Model Problem Sharks and Fish

Illustration of parallel programming
Original version (discrete event only) proposed
by Geoffrey Fox
Called WATOR
Sharks and fish living in a 2D toroidal ocean
We can imagine several variation to show
different physical phenomenon
Basic idea sharks and fish living in an ocean
rules for movement
breeding, eating, and death
forces in the ocean
forces between sea creatures

38
Particle Systems

A particle system has
a finite number of particles.
moving in space according to Newtons Laws (i.e.
F ma).
time is continuous.
Examples
stars in space with laws of gravity.
electron beam and ion beam semiconductor
manufacturing.
atoms in a molecule with electrostatic forces.
neutrons in a fission reactor.
cars on a freeway with Newtons laws plus model
of driver and engine.
Many simulations combine particle simulation
techniques with some discrete event techniques
(e.g., Sharks and Fish).

39
Forces in Particle Systems

Force on each particle decomposed into near and
far
force external_force nearby_force
far_field_force

External force
ocean current to sharks and fish world
externally imposed electric field in electron
beam.
Nearby force
sharks attracted to eat nearby fish balls on a
billiard table bounce off of each other.
Van der Waals forces in fluid (1/r6).
Far-field force
fish attract other fish by gravity-like (1/r2 )
force
gravity, electrostatics
forces governed by elliptic PDE.

40
Parallelism in External Forces

External forces are the simplest to implement.
The force on each particle is independent of
other particles.
Called embarrassingly parallel.
Evenly distribute particles on processors
Any even distribution works.
Locality is not an issue, no communication.
For each particle on processor, apply the
external force.

41
Parallelism in Nearby Forces

Nearby forces require interaction and therefore
communication.
Force may depend on other nearby particles
Example collisions.
simplest algorithm is O(n2) look at all pairs to
see if they collide.
Usual parallel model is decomposition of
physical domain
O(n2/p) particles per processor if evenly
distributed.
Often called domain decomposition (which also
refers to numerical alg.)
Challenges
Dealing with particles near processor boundaries
Dealing with load imbalance from nonuniformly
distributed particles

42
Parallelism in Far-Field Forces

Far-field forces involve all-to-all interaction
and therefore communication.
Force depends on all other particles
Examples gravity, protein folding
Simplest algorithm is O(n2)
Just decomposing space does not help since every
particle needs to visit every other particle.
Use more clever algorithms to lower O(n2) to O(n
log n)
Several later lectures

Implement by rotating particle sets.
Keeps processors busy
All processor eventually see all particles

43
Examine Sharks and Fish code

Gravitational forces among fish only
Use Eulers method to move fish numerically
Sequential and Shared Memory with Pthreads
www.cs.berkeley.edu/demmel/cs267_Spr05/SharksAndF
ish

44
Extra Slides
45
Engineering Intel Pentium Pro Quad

SMP for the masses
All coherence and multiprocessing glue in
processor module
Highly integrated, targeted at high volume
Low latency and bandwidth

46
Engineering SUN Enterprise

Proc mem card - I/O card
16 cards of either type
All memory accessed over bus, so symmetric
Higher bandwidth, higher latency bus

47
Outline

Historical perspective
Bus-based machines
Pentium SMP
IBM SP node
Directory-based (CC-NUMA) machine
Origin 2000
Global address space machines
Cray t3d and (sort of) t3e

48
60s Mainframe Multiprocessors

Enhance memory capacity or I/O capabilities by
adding memory modules or I/O devices
How do you enhance processing capacity?
Add processors
Already need an interconnect between slow memory
banks and processor I/O channels
cross-bar or multistage interconnection network

49
70s Breakthrough Caches

Memory system scaled by adding memory modules
Both bandwidth and capacity
Memory was still a bottleneck
Enter Caches!
Cache does two things
Reduces average access time (latency)
Reduces bandwidth requirements to memory

memory (slow)
A
17
interconnect
I/O Device or Processor
P
processor (fast)
50
Technology Perspective
Capacity Speed Logic 2x in 3 years 2x
in 3 years DRAM 4x in 3 years 1.4x in 10
years Disk 2x in 3 years 1.4x in 10 years
DRAM Year Size Cycle
Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1
Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145
ns 1995 64 Mb 120 ns
10001!
21!
51
Example Write-thru Invalidate

Update and write-thru both use more memory
bandwidth if there are writes to the same address
Update to the other caches
Write-thru to memory

52
Write-Back/Ownership Schemes

When a single cache has ownership of a block,
processor writes do not result in bus writes,
thus conserving bandwidth.
reads by others cause it to return to shared
state
Most bus-based multiprocessors today use such
schemes.
Many variants of ownership-based protocols

53
Directory-Based Cache-Coherence
54
90 Scalable, Cache Coherent Multiprocessors
55
Cache Coherenceand Memory Consistency
56
Violations of Sequential Consistency

Flag/data program is one example that relies on
SC
Given coherent memory,all violations of SC based
on reordering on independent operations are
figure 8s
See paper by Shasha and Snir for more details
Operations can be linearized (move forward time)
if SC

P1
P0
P2
read y write
x
57
Sufficient Conditions for Sequential Consistency

Processors issues memory operations in program
order
Processor waits for store to complete before
issuing any more memory operations
E.g., wait for write-through and invalidations
Processor waits for load to complete before
issuing any more memory operations
E.g., data in another cache may have to be marked
as shared rather than exclusive
A load must also wait for the store that produced
the value to complete
E.g., if data is in cache and update event
changes value, all other caches much also have
processed that update
There are much more aggressive ways of
implementing SC, but most current commercial SMPs
give up

Based on slide by Mark Hill et al
58
Classification for Relaxed Models

Optimizations can generally be categorized by
Program order relaxation
Write ? Read
Write ? Write
Read ? Read, Write
Read others write early
Read own write early
All models provide safety net, e.g.,
A write fence instruction waits for writes to
complete
A read fence prevents prefetches from moving
before this point
Prefetches may be synchronized automatically on
use
All models maintain uniprocessor data and control
dependences, write serialization
Memory models differ on orders to two different
locations

Slide source Sarita Adve et al
59
Some Current System-Centric Models
Safety Net
Read Own Write Early
Read Others Write Early
R ?RW Order
W ?W Order
W ?R Order
Relaxation
serialization instructions
?
IBM 370
RMW
?
?
TSO
RMW
?
?
?
PC
RMW, STBAR
?
?
?
PSO
synchronization
?
?
?
?
WO
release, acquire, nsync, RMW
?
?
?
?
RCsc
release, acquire, nsync, RMW
?
?
?
?
?
RCpc
MB, WMB
?
?
?
?
Alpha
various MEMBARs
?
?
?
?
RMO
SYNC
?
?
?
?
?
PowerPC
Slide source Sarita Adve et al
60
Data-Race-Free-0 Some Definitions