CS 2200 Lecture 19 Parallel Processing

About This Presentation

Title:

CS 2200 Lecture 19 Parallel Processing

Description:

... Bill Leahy, Ken MacKenzie, Richard Murphy, and Michael Niemier) ... Small-scale machines: DELL WORKSTATION 530. 1.7GHz Intel Pentium IV (in Minitower) ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 73

Provided by: michaelt8

Category:

more less

Transcript and Presenter's Notes

Title: CS 2200 Lecture 19 Parallel Processing

1
CS 2200 Lecture 19Parallel Processing

(Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)

2
Our Road Map
Processor
Memory Hierarchy
I/O Subsystem
Parallel Systems
Networking
3
The Next Step

Create more powerful computers simply by
interconnecting many small computers
Should be scalable
Should be fault tolerant
More economical
Multiprocessors
High throughput running independent tasks
Parallel Processing
Single program on multiple processors

4
Key Questions

How do parallel processors share data?
How do parallel processors communicate?
How many processors?

5
TodayParallelism vs. Parallelism

Uni
Pipelined
Superscalar
VLIW/EPIC

SMP (Symmetric)
Distributed

TLP
ILP
6
Flynns taxonomy

Single instruction stream, single data stream
(SISD)
Essentially, this is a uniprocessor
Single instruction stream, multiple data streams
(SIMD)
Same instruction executed by multiple processors
with different data streams
Each processor has own data memory, but 1
instruction memory and control processor to
fetch/dispatch instructions

7
Flynns Taxonomy

Multiple instruction streams, single data streams
(MISD)
Can anyone think of a good application for this
machine?
Multiple instruction streams, multiple data
streams (MIMD)
Each processor fetches its own instructions and
operates on its own data

8
A history

From a parallel perspective, many early
processors SIMD
In recent past, MIMD most common multiprocessor
arch.
Why MIMD?
Often MIMD machines made of off-the-shelf
components
Usually means flexibility could be used as a
single-user machine or multi-programmed machine

9
A history

MIMD machines can be further sub-divided
Centralized shared-memory architectures
Multiple processors share a single centralized
memory and interconnect to it via a bus
works best with smaller of processors
B/c of centralization/uniform access time
sometimes called Uniform Memory Access
Physically distributed memory
Almost a must for larger processor counts else
bandwidth a problem

10
Ok, so we introduced the two kinds of parallel
computer architectures that were going to talk
about.Well come back to them soon enough. But
1st, well talk about why parallel processing is
a good thing
11
Parallel Computers

Definition A parallel computer is a collection
of processing elements that cooperate and
communicate to solve large problems fast.
Almasi and Gottlieb, Highly Parallel Computing
,1989
Questions about parallel computers
How large a collection?
How powerful are processing elements?
How do they cooperate and communicate?
How is data transmitted?
What type of interconnection?
What are HW and SW primitives for programmer?
Does it translate into performance?

(i.e. things you should have some understanding
of after class today)
12
The Plan

Applications (problem space)
Key hardware issues
shared memory how to keep caches coherence
message passing low-cost communication
See Board (intro. to cache coherency)

13
Current Practice

Some success w/MPPs (Massively Parallel
Processors)
dense matrix scientific computing (Petrolium,
Automotive, Aeronautics, Pharmaceuticals)
file servers, databases, web search engines
entertainment/graphics
Small-scale machines DELL WORKSTATION 530
1.7GHz Intel Pentium IV (in Minitower)
512 MB RDRAM memory, 40GB disk, 20X CD, 19
monitor, Quadro2Pro Graphics card, RedHat Linux,
3yrs service
2,760 for 2nd processor, add 515
(Can also chain these together)

14
Parallel Architecture

Parallel Architecture extends traditional
computer architecture with a communication
architecture
Programmming model (SW view)
Abstractions (HW/SW interface)
Implementation to realize abstraction efficiently
Historically, implementations have been tied to
programming models but that is changing.

15
Parallel Applications

Throughput-oriented (want many answers)
multiprogramming
databases, web servers
Latency oriented (want one answer, fast)
Grand Challenge problems
See http//www.nhse.org/grand_challenge.html
See http//www.research.att.com/dsj/nsflist.html
global climate model
human genome
quantum chromodynamics
combustion model
cognition

16
Programming

As contrasted to instruction level parallelism
which may be largely ignored by the programmer...
Writing efficient multiprocessor programs is
hard.
Wizards write programs with sequential interface
(e.g. Databases, file servers, CAD)
Communications overhead becomes a factor
Requires a lot of knowledge of the hardware!!!

17
Speedupmetric for performance on
latency-sensitive applications

Time(1) / Time(P) for P processors
note must use the best sequential algorithm for
Time(1) -- the parallel algorithm may be
different.

linear speedup (ideal)
speedup
typical rolls off w/some of processors
1 2 4 8 16 32 64
occasionally see superlinear... why?
1 2 4 8 16 32 64
processors
18
Speedup Challenge

To get full benefit of parallelism need to be
able to parallelize the entire program!
Amdahls Law
Timeafter (Timeaffected/Improvement)Timeunaffec
ted
Example We want 100 times speedup with 100
processors
Timeunaffected 0!!!
(see board notes for this worked out)

19
Hardware Two Main Variations

Shared-Memory
may be physically shared or only logically shared
communication is implicit in loads and stores
Message-Passing
must add explicit communication

20
Shared-Memory Hardware (1)Hardware and
programming model dont have to match, but this
is the mental model for shared-memory programming

Memory centralized with uniform access time
(UMA) and bus interconnect, I/O
Examples Dell Workstation 530, Sun Enterprise,
SGI Challenge

typical
1 cycle to local cache
20 cycles to remote cache
100 cycles to memory

21
Sharing Data (another view)
Uniform Memory Access - UMA
Memory
Symmetric Multiprocessor SMP
22
Shared-Memory Hardware (2)

Variation memory is not centralized. Called
non-uniform access time (NUMA)
Shared memory accesses are converted into a
messaging protocol (usually by HW)
Examples DASH/Alewife/FLASH (academic), SGI
Origin, Compaq GS320, Sequent (IBM) NUMA-Q

23
Sharing Data (another view)
Non-Uniform Memory Access - NUMA
24
More on distributed memory

Distributing memory among processing nodes has 2
pluses
Its a great way to save some bandwidth
With memory distributed at nodes, most accesses
are to local memory within a particular node
No need for bus communication
Reduces latency for accesses to local memory
It also has 1 big minus!
Have to communicate among various processors
Leads to a higher latency for intra-node
communication
Also need bandwidth to actually handle
communication

25
Message Passing Model

Whole computers (CPU, memory, I/O devices)
communicate as explicit I/O operations
Essentially NUMA but integrated at I/O devices
instead of at the memory system
Send specifies local buffer receiving process
on remote computer

26
Message Passing Model

Receive specifies sending process on remote
computer local buffer to place data
Usually send includes process tag and receive
has rule on tag match 1, match any
Synch when send completes, when buffer free,
when request accepted, receive wait for send
Sendreceive gt memory-memory copy, where each
each supplies local address, AND does pairwise
sychronization!

27
Two terms multicomputers vs. multiprocessors
28
Communicating between nodes

One way to communicate b/t processors treats
physically separate memories as 1 big memory
(i.e. 1 big logically shared address space)
Any processor can make a memory reference to any
memory location even if its at a different node
Machines are called distributed shared
memory(DSM)
Same physical address on two processors refers to
the same one location in memory
Another method involves private address spaces
Memories are logically disjoint cannot be
addressed be a remote processor
Same physical address on two processors refers to
two different locations in memory
These are multicomputers

29
Multicomputer
Proc Cache A
Proc Cache B
interconnect
memory
memory
30
MultiprocessorSymmetric Multiprocessor or SMP
Cache A
Cache B
memory
31
But both can have a cache coherency problem
Cache A
Cache B
Read X Write X
Read X ...
Oops!
X 1
X 0
memory
X 0
32
Simplest Coherence StrategyEnforce Exactly One
Copy
Cache A
Cache B
Read X Write X
Read X ...
X 1
X 0
memory
X 0
33
Exactly One Copy
Read or write/ (invalidate other copies)
INVALID
VALID
More reads or writes
Replacement or invalidation

Maintain a lock per cache line
Invalidate other caches on a read/write
Easy on a bus snoop bus for transactions

34
Exactly One Copy

Works, but performance is crummy.
Suppose we all just want to read the same memory
location
one lousy global variable n the size of the
problem, written once at the start of the program
and read thereafter

Permit multiple readers (readers/writer lock per
cache line)
35
Cache consistency(i.e. how do we avoid the
previous protocol?)
36
Multiprocessor Cache Coherency

Means values in cache and memory are consistent
or that we know they are different and can act
accordingly
Considered to be a good thing.
Becomes more difficult with multiple processors
and multiple caches!
Popular technique Snooping!
Write-invalidate
Write-update

37
Cache coherence protocols

Directory Based
Whether or not some physical memory location is
shared or not is recorded in 1 central location
Called the directory
Snooping
Every cache w/entries from centralized main
memory also has a particular blocks sharing
status
No centralized state kept
Caches connected to shared memory bus
If there is bus traffic, caches check (or
snoop) to see if they have the block being
transferred on bus
Main focus of upcoming discussion

38
Side note Snoopy Cache
CPU
CPU references check cache tags (as
usual) Cache misses filled from memory (as
usual) Other read/write on bus must check tags,
too, and possibly invalidate
State Tag Data
Bus
39
Maintaining the coherence requirement

1 way make sure processor has exclusive access
to a data word before its written
Called the write invalidate protocol
Will actually invalidate other copies of the data
word on a write
Most common for both snooping and directory
schemes

40
Maintaining the coherence requirement

What if 2 processors try to write at the same
time?
The short answer one of them will get
permission to write first
The others copy will be invalidated,
Then itll get a new copy of the data with
updated value
Then it can get permission and write
Probably more on how later, but briefly
Caches snoop on the bus, so theyll detect a
request to write so whichever machine gets to
the bus 1st, goes 1st

41
Write invalidate example

Assumes neither cache had value/location X in it
1st
When 2nd miss by B occurs, CPU A responds with
value canceling response from memory.
Update Bs cache memory contents of X updated
Typical and simple

42
Maintaining the cache coherency requirement

Alternative to write invalidate update all
cached copies of a data item when the item is
written
Called a write update/broadcast protocol
One problem bandwidth issues could quickly get
out of hand
Solution track whether or not a word in the
cache is shared (i.e. contained in another cache)
If the word is not shared, theres no need to
broadcast on a write

43
Write update example
(Shaded parts are different than before)

Assumes neither cache had value/location X in it
1st
CPU and memory contents show value after
processor and bus activity both completed
When CPU A broadcasts the write, cache in CPU B
and memory location X are updated

44
Comparing write update/write invalidate

What if there are multiple writes and no
intermediate reads to the same word?
With update protocol, multiple write broadcasts
required
With invalidation protocol, only one invalidation
Writing to multiword cache blocks
With update protocol, each word written in a
cache block requires a write broadcast
With invalidation protocol, only 1st write to any
word needs to generate an invalidate

45
Comparing write update/write invalidate

What about delays between writing and reading?
With update protocol delay b/t writing a word on
one processor and reading a word in another
usually less
Written data is immediately updated in readers
cache
With invalidation protocol, reader invalidated,
later reads/stalls

46
See example
47
Messages vs. Shared Memory?

Shared Memory
As a programming model, shared memory is
considered easier
automatic caching is good for dynamic/irregular
problems
Message Passing
As a programming model, messages are the most
portable
Right Thing for static/regular problems
BW , latency --, no concept of caching
Model implementation?
not necessarily...

48
More on address spaces(i.e. 1 shared memory vs.
distributed, multiple memories)
49
Communicating between nodes

In a shared address space
Data could be implicitly transferred with just a
load or a store instruction
Ex. Machine X executes Load 5, 0(4). 0(4)
actually stored in the memory of Machine Y.

50
Communicating between nodes

With private/multiple address spaces
Communication of data done by explicitly passing
messages among processors
Usually based on Remote Procedure Call (RPC)
protocol
Is a synchronous transfer i.e. requesting
machine waits for a reply before continuing
This is OS stuff no more detail here
Could also have the writer initiate data
transfers
Done in hopes that a node will be a soon to be
consumer
Often done asynchronously sender process can
continue right away

51
Performance metrics

3 performance metrics critical for communication
(1) Communication bandwidth
Usually limited by processor, memory, and
interconnection bandwidths
Not by some aspect of communication mechanism
Often occupancy can be a limiting factor.
When communication occurs, resources wi/ nodes
are tied up or occupied prevents other
outgoing communication
If occupancy incurred for each word of a message,
sets a limit on communication bandwidth
(often lower than what network or memory system
can provide)

52
Performance metrics

(2) Communication latency
Latency includes
Transport latency (function of interconnection
network)
SW/HW overheads (from sending/receiving messages)
Largely determined by communication mechanism and
its implementation
Latency must be hidden!!!
Else, processor might just spend lots of time
waiting for messages

53
Performance metrics

(3) Hiding communication latency
Ideally we want to mask latency of waiting for
communication, etc.
This might be done by overlapping communication
with other, independent computations
Or maybe 2 independent messages could be sent at
once?
Quantifying how well a multiprocessor
configuration can do this is this metric
Often this burden is placed to some degree on the
SW and the programmer
Also, this metric is heavily application dependent

54
Performance metrics

All of these metrics are actually affected by
application type, data sizes, communication
patterns, etc.

55
Advantages and disadvantages

Whats good about shared memory? Whats bad
about it?
Whats good about message-passing? Whats bad
about it?
Note message passing implies distributed memory

56
Advantages and disadvantages

Shared memory good
Compatibility with well-understood mechanisms in
use in centralized multiprocessors used shared
memory
Its easy to program
Especially if communication patterns are complex
Easier just to do a load/store operation and not
worry about where the data might be (i.e. on
another node with DSM)
But, you also take a big time performance hit
Smaller messages are more efficient w/shared
memory
Might communicate via memory mapping instead of
going through OS
(like wed have to do for a remote procedure call)

57
Advantages and disadvantages

Shared memory good (continued)
Caching can be controlled by the hardware
Reduces the frequency of remote communication by
supporting automatic caching of all data
Message-passing good
The HW is lots simpler
Especially by comparison with a scalable
shared-memory implementation that supports
coherent caching of data
Communication is explicit
Forces programmers/compiler writers to think
about it and make it efficient
This could be a bad thing too FYI

58
More detail on cache coherency protocols with
some examples
59
More on centralized shared memory

Its worth studying the various ramifications of a
centralized shared memory machine
(and there are lots of them)
Later well look at distributed shared memory
When studying memory hierarchies we saw
cache structures can substantially reduce memory
bandwidth demands of a processor
Multiple processors may be able to share the same
memory

60
More on centralized shared memory

Centralized shared memory supports private/shared
data
If 1 processor in a multiprocessor network
operates on private data, caching, etc. are
handled just as in uniprocessors
But if shared data is cached there can be
multiple copies and multiple updates
Good b/c it reduces required memory bandwidth
bad because we now must worry about cache
coherence

61
Cache coherence why its a problem

Assumes that neither cache had value/location X
in it 1st
Both a write-through cache and a write-back cache
will encounter this problem
If B reads the value of X after Time 3, it will
get 1 which is the wrong value!

62
Coherence in shared memoryprograms

Must have coherence and consistency
Memory system coherent if
Program order preserved (always true in
uniprocessor)
Say we have a read by processor P of location X
Before the read processor P wrote something to
location X
In the interim, no other processor has written to
X
A read to X should always return the value
written by P
A coherent view of memory is provided
1st, processor A writes something to memory
location X
Then, processor B tries to read from memory
location X
Processor B should get the value written by
processor A assuming
Enough time has past b/t the two events
No other writes to X have occurred in the interim

63
Coherence in shared memory programs (continued)

Memory system coherent if (continued)
Writes to same location are serialized
Two writes to the same location by any two
processors are seen in the same order by all
processors
Ex. Values of A and B are written to memory
location X
Processors cant read the value of B and then
later as A
If writes not serialized
One processor might see the write of processor P2
to location X 1st
Then, it might later see a write to location X by
processor P1
(P1 actually wrote X before P2)
Value of P1 could be maintained indefinitely even
though it was overwritten

64
Coherence/consistency

Coherence and consistency are complementary
Coherence defines actions of reads and writes to
same memory location
Consistency defines actions of reads and writes
with regard to accesses of other memory locations
Assumption for the following discussion
Write does not complete until all processors have
seen effect of write
Processor does not change order of any write with
any other memory accesses
Not exactly the case for either one reallybut
more later

65
Caches in coherent multiprocessors

In multiprocessors, caches at individual nodes
help w/ performance
Usually by providing properties of migration
and replication
Migration
Instead of going to centralized memory for each
reference, data word will migrate to a cache at
a node
Reduces latency
Replication
If data simultaneously read by two different
nodes, copy is made at each node
Reduces access latency and contention for shared
item
Supporting these require cache coherence
protocols
Really, we need to keep track of shared blocks

66
Detail about snooping
67
Implementing protocols

Well focus on the invalidation protocol
And start with a generic template for
invalidation
To perform an invalidate
Processor must acquire bus access
Broadcast the address to be invalidated on the
bus
Processors connected to bus snoop on addresses
If address on bus is in processors cache, data
invalidated
Serialization of accesses enforces serialization
of writes
When 2 processors compete to write to the same
location, 1 gets access to the bus 1st

68
Its not THAT easy though

What happens on a cache miss?
With a write through cache, no problem
Data is always in main memory
In shared memory machine, every cache write would
go back to main memory bad, bad, bad for
bandwidth!
What about write back caches though?
Much harder.
Most recent value of data could be in a cache
instead of memory
How to handle write back caches?
Snoop.
Each processor snoops every address placed on the
bus
If a processor has a dirty copy of requested
cache block, it responds to read request, and
memory request is cancelled

69
Specifics of snooping

Normal cache tags can be used
Existing valid bit makes it easy to invalidate
What about read misses?
Easy to handle too rely on snooping capability
What about writes?
Wed like to know if any other copies of the
block are cached
If theyre NOT, we can save bus bandwidth
Can add extra bit of state to solve this problem
state bit
Tells us if block is shared, if we must generate
an invalidate
When write to a block in shared state happens,
cache generates invalidation and marks block as
private
No other invalidations sent by that processor for
that block

70
Specifics of snooping

When invalidation sent, state of owners
(processor with sole copy of cache block) cache
block is changed from shared to unshared (or
exclusive)
If another processor later requests cache block,
state must be made shared again
Snooping cache also sees any misses
Knows when exclusive cache block has been
requested by another processor and state should
be made shared

71
Specifics of snooping

More overhead
Every bus transaction would have to check
cache-addr. tags
Could easily overwhelm normal CPU cache accesses
Solutions
Duplicate the tags snooping/CPU accesses can go
on in parallel
Employ a multi-level cache with inclusion
Everything in the L1 cache also in L2 snooping
checks L2, CPU L1

72
An example protocol

Bus-based protocol usually implemented with a
finite state machine controller in each node
Controller responds to requests from processor
bus
Changes the state of the selected cache block and
uses the bus to access data or invalidate it
An example protocol (which well go through an
example of)

Write a Comment

User Comments (0)