CS 2200 Parallel Processing - PowerPoint PPT Presentation

1 / 111
About This Presentation
Title:

CS 2200 Parallel Processing

Description:

... Alewife/FLASH (academic), SGI Origin, Compaq GS320, Sequent (IBM) NUMA-Q ... Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 112
Provided by: michaelt8
Category:

less

Transcript and Presenter's Notes

Title: CS 2200 Parallel Processing


1
CS 2200 Parallel Processing
  • (Lectures based on the work of Jay Brockman,
    Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
    Ken MacKenzie, Richard Murphy, and Michael
    Niemier)

2
Our Road Map
Processor
Memory Hierarchy
I/O Subsystem
Parallel Systems
Networking
3
The Next Step
  • Create more powerful computers simply by
    interconnecting many small computers
  • Should be scalable
  • Should be fault tolerant
  • More economical
  • Multiprocessors
  • High throughput running independent tasks
  • Parallel Processing
  • Single program on multiple processors

4
Key Questions
  • How do parallel processors share data?
  • How do parallel processors communicate?
  • How many processors?

5
TodayParallelism vs. Parallelism
  • Uni
  • Pipelined
  • Superscalar
  • VLIW/EPIC
  • SMP (Symmetric)
  • Distributed

TLP
ILP
6
Flynns taxonomy
  • Single instruction stream, single data stream
    (SISD)
  • Essentially, this is a uniprocessor
  • Single instruction stream, multiple data streams
    (SIMD)
  • Same instruction executed by all processors,but
    each operates on its own data
  • Each processor has its own data memory,but all
    share the instr memory and fetch/dispatch

7
Flynns Taxonomy
  • Multiple instruction streams, single data stream
    (MISD)
  • E.g. a pipeline of specialized processors, each
    does something on the data and passes on to next
    processor
  • Multiple instruction streams, multiple data
    streams (MIMD)
  • Each processor fetches its own instructionsand
    operates on its own data

8
A history
  • Many early parallel processors were SIMD
  • Recently, MIMD most common multiprocessor arch.
  • Why MIMD?
  • Can make MIMD machines using off-the-shelf
    chips
  • Many uniprocessors are made, multiprocessors much
    fewer
  • Price low for uniprocesor chips (mass
    market),high for specialized multiprocessor
    chips (too few are made)
  • Can get cheaper multies if they use the same
    chips as unies

9
A history
  • MIMD machines can be further sub-divided
  • Centralized shared-memory architectures
  • All processors sit on the same bus anduse the
    same centralized memory
  • Works well with smaller of processors
  • Bus bandwidth a problem with many processors
  • Physically distributed memory
  • Each processor has some memory near it,can
    access others memory over a network
  • With good data locality, most memory accesses
    local
  • Works well even with large of processors

10
Ok, so we introduced the two kinds of parallel
computer architectures that were going to talk
about.Well come back to them soon enough. But
1st, well talk about why parallel processing is
a good thing
11
Parallel Computers
  • Definition A parallel computer is a collection
    of processing elements that cooperate and
    communicate to solve large problems fast.
  • Almasi and Gottlieb, Highly Parallel Computing
    ,1989
  • Questions about parallel computers
  • How large a collection?
  • How powerful are processing elements?
  • How do they cooperate and communicate?
  • How is data transmitted?
  • What type of interconnection?
  • What are HW and SW primitives for programmer?
  • Does it translate into performance?

(i.e. things you should have some understanding
of after class today)
12
The Plan
  • Applications (problem space)
  • Key hardware issues
  • Shared memory how to keep caches coherent
  • Message passing low-cost communication

13
Current Practice
  • Some success w/MPPs (Massively Parallel
    Processors)
  • Matrix-based scientific and engineering
    computing(Oil, Nukes, Rockets, Cars, Medicines,
    Weather)
  • File servers, databases, web search engines
  • Entertainment/graphics
  • Small-scale machines DELL WORKSTATION 530
  • 1.7GHz Intel Pentium IV (in Minitower)
  • 512 MB RDRAM memory, 40GB disk, 20X CD, 19
    monitor, Quadro2Pro Graphics card, RedHat Linux,
    3yrs service
  • 2,760 for 2nd processor, add 515
  • (Can also chain these together)

14
Parallel Architecture
  • Parallel Architecture extends traditional
    computer architecture with a communication
    architecture
  • Programmming model (SW view)
  • Abstractions (HW/SW interface)
  • Implementation to realize abstraction efficiently
  • Historically, implementations have been tied to
    programming models but that is changing.

15
Parallel Applications
  • Throughput-oriented (want many answers)
  • multiprogramming
  • databases, web servers
  • Latency oriented (want one answer, fast)
  • Grand Challenge problems
  • See http//www.nhse.org/grand_challenge.html
  • See http//www.research.att.com/dsj/nsflist.html
  • global climate model
  • human genome
  • quantum chromodynamics
  • combustion model
  • cognition

16
Programming
  • As contrasted to instruction level parallelism
    which may be largely ignored by the programmer
  • writing efficient multiprocessor programs is
    hard.
  • Wizards write programs with sequential
    interface(e.g. Databases, file servers, CAD)
  • Communications overhead becomes a factor
  • Requires a lot of knowledge of the hardware!!!

17
Speedupmetric for performance on
latency-sensitive applications
  • Time(1) / Time(P) for P processors
  • note must use the best sequential algorithm for
    Time(1) b/c the best parallel algorithm may be
    different.

linear speedup (ideal)
speedup
typical rolls off w/some of processors
1 2 4 8 16 32 64
occasionally see superlinear... why?
1 2 4 8 16 32 64
processors
18
Speedup Challenge
  • To get full benefit of parallelism need to be
    able to parallelize the entire program!
  • Amdahls Law
  • Timeafter (Timeaffected/Improvement)Timeunaffec
    ted
  • Example We want 100 times speedup with 100
    processors
  • Timeunaffected 0!!!

19
Hardware Two Main Variations
  • Shared-Memory
  • may be physically shared or only logically shared
  • communication is implicit in loads and stores
  • Message-Passing
  • must add explicit communication

20
Shared-Memory Hardware (1)Hardware and
programming model dont have to match, but this
is the mental model for shared-memory programming
  • Memory centralized with uniform access time
    (UMA) and bus interconnect, I/O
  • Examples Dell Workstation 530, Sun Enterprise,
    SGI Challenge
  • typical
  • 1 cycle to local cache
  • 20 cycles to remote cache
  • 100 cycles to memory

21
Sharing Data (another view)
Uniform Memory Access - UMA
Memory
Symmetric Multiprocessor SMP
22
Shared-Memory Hardware (2)
  • Variation memory is not centralized.Called
    non-uniform access time (NUMA)
  • Shared memory accesses are converted into a
    messaging protocol (usually by HW)
  • Examples DASH/Alewife/FLASH (academic), SGI
    Origin, Compaq GS320, Sequent (IBM) NUMA-Q

P
P
Network
NI
NI
M
M
23
More on distributed memory
  • Distributing memory among nodes has 2 pluses
  • Its a great way to get more bandwidth
  • Most accesses are to local memory within a
    particular node
  • Each node gets its own bus, instead of all using
    one
  • Can have a fancy (not bus) network for inter-node
    accesses
  • Reduces latency for accesses to local memory
  • It also has 1 big minus!
  • Have to communicate among various processors
  • Leads to a higher latency for inter-node
    communication
  • Also need bandwidth to actually handle
    communication

24
Message Passing Model
  • Whole computers (CPU, memory, I/O devices)
    communicate as explicit I/O operations
  • Essentially NUMA but integrated at I/O devices
    instead of at the memory system
  • Send specifies local buffer receiving process
    on remote computer

25
Message Passing Model
  • Receive specifies sending process on remote
    computer local buffer to place data
  • Usually send includes process tag and receive
    has rule on tag match 1, match any
  • Synch when send completes, when buffer free,
    when request accepted, receive wait for send
  • Sendreceive gt memory-memory copy, where each
    each supplies local address, AND does pairwise
    sychronization!

26
Two terms multicomputers vs. multiprocessors
27
Communicating between nodes
  • One way to communicate b/t processors treats
    physically separate memories as 1 big memory
  • (i.e. 1 big logically shared address space)
  • Any processor can make a memory reference to any
    memory location even if its at a different node
  • Machines are called distributed shared
    memory(DSM)
  • Same physical address on two processors refers to
    the same one location in memory
  • Another method involves private address spaces
  • Memories are logically disjoint cannot be
    addressed be a remote processor
  • Same physical address on two processors refers to
    two different locations in memory
  • These are multicomputers

28
Multicomputer
Proc Cache A
Proc Cache B
interconnect
memory
memory
29
MultiprocessorSymmetric Multiprocessor or SMP
Cache A
Cache B
memory
30
But both can have a cache coherence problem
Cache A
Cache B
Read X Write X
Read X ...
Oops!
X 1
X 0
memory
X 0
31
Simplest Coherence StrategyExactly One Copy at
a Time
Cache A
Cache B
Read X Write X
Read X ...
X 1
X 0
memory
X 0
32
Exactly One Copy
Read or write/ (invalidate other copies)
INVALID
VALID
More reads or writes
Replacement or invalidation
  • Maintain a lock per cache line
  • Invalidate other caches on a read/write
  • Easy on a bus snoop bus for transactions

33
Exactly One Copy
  • Works, but performance is crummy.
  • Suppose we all just want to read the same memory
    location
  • one lousy global variable n the size of the
    problem, written once at the start of the program
    and read thereafter

Permit multiple readers (readers/writer lock per
cache line)
34
Multiprocessor Cache Coherence
  • Means
  • values in cache and memory are same, or
  • we know they are different and can act
    accordingly
  • Considered to be a good thing
  • Becomes much more difficultwith multiple
    processors and multiple caches!
  • Popular technique Snooping!
  • Write-invalidate
  • Write-update

35
Cache coherence protocols
  • Directory Based
  • Whether or not some physical memory location is
    shared or not is recorded in 1 central location
  • Called the directory
  • Snooping
  • Every cache w/entries from centralized main
    memory also has a particular blocks sharing
    status
  • No centralized state kept
  • Caches connected to shared memory bus
  • If there is bus traffic, caches check (or
    snoop) to see if they have the block being
    transferred on bus
  • Main focus of upcoming discussion

36
Side note Snoopy Cache
CPU
CPU references check cache tags (as
usual) Cache misses filled from memory (as
usual) Other read/write on bus must check tags,
too, and possibly invalidate
State Tag Data
Bus
37
Maintaining the coherence requirement
  • Method one make sure writing processor has the
    only cached copy of data word before it is
    written
  • Called the write invalidate protocol
  • Write invalidates other cached copies of the data
  • Most common for both snooping and directory
    schemes

38
Maintaining the coherence requirement
  • What if 2 processors try to write at the same
    time?
  • The short answer one of them does it first
  • The others copy will be invalidated,
  • When the first write done, the other gets that
    copy
  • Then it again invalidates all cached copies and
    writes
  • Probably more on how later, but briefly
  • Caches snoop on the bus, so theyll detect a
    request to write so whichever machine gets to
    the bus 1st, goes 1st

39
Write invalidate example
  • Assumes neither cache had value/location X in it
    1st
  • When 2nd miss by B occurs, CPU A responds with
    value canceling response from memory.
  • Update Bs cache memory contents of X updated
  • Typical and simple

40
Maintaining the cache coherence requirement
  • Method two update all cached copies of a data
    item when we write it
  • Called a write update/broadcast protocol
  • Bandwidth quickly becomes a problem
  • Every write needs to go to the bus, cant just
    write to cache
  • Solution track whether or not a word in the
    cache is shared (i.e. contained in another cache)
  • If the word is not shared, theres no need to
    broadcast on a write

41
Write update example
(Shaded parts are different than before)
  • Assumes neither cache had value/location X in it
    1st
  • CPU and memory contents show value after
    processor and bus activity both completed
  • When CPU A broadcasts the write, cache in CPU B
    and memory location X are updated

42
Comparing write update/write invalidate
  • What if there are multiple writes and no
    intermediate reads to the same word?
  • With update protocol, multiple write broadcasts
    required
  • With invalidation protocol, only one invalidation
  • Writing to multiword cache blocks
  • With update protocol, each word written in a
    cache block requires a write broadcast
  • With invalidation protocol, only 1st write to any
    word needs to generate an invalidate

43
Comparing write update/write invalidate
  • What about delays between writing and reading?
  • With update protocol delay b/t writing a word on
    one processor and reading a word in another
    usually less
  • Written data is immediately updated in readers
    cache
  • With invalidation protocol, reader invalidated,
    later reads/stalls

44
Messages vs. Shared Memory?
  • Shared Memory
  • As a programming model, shared memory is
    considered easier
  • automatic caching is good for dynamic/irregular
    problems
  • Message Passing
  • As a programming model, messages are the most
    portable
  • Right Thing for static/regular problems
  • BW , latency --, no concept of caching
  • Model implementation?
  • not necessarily...

45
More on address spaces(i.e. 1 shared memory vs.
distributed, multiple memories)
46
Communicating between nodes
  • In a shared address space
  • Data could be implicitly transferred with just a
    load or a store instruction
  • Ex. Machine X executes Load 5, 0(4). 0(4)
    actually stored in the memory of Machine Y.

47
Communicating between nodes
  • With private/multiple address spaces
  • Communication of data done by explicitly passing
    messages among processors
  • Usually based on Remote Procedure Call (RPC)
    protocol
  • Is a synchronous transfer i.e. requesting
    machine waits for a reply before continuing
  • This is OS stuff no more detail here
  • Could also have the writer initiate data
    transfers
  • Done in hopes that a node will be a soon to be
    consumer
  • Often done asynchronously sender process can
    continue right away

48
Performance metrics
  • 3 performance metrics critical for communication
  • (1) Communication bandwidth
  • Usually limited by processor, memory, and
    interconnection bandwidths
  • Not by some aspect of communication mechanism
  • Often occupancy can be a limiting factor.
  • When communication occurs, resources wi/ nodes
    are tied up or occupied prevents other
    outgoing communication
  • If occupancy incurred for each word of a message,
    sets a limit on communication bandwidth
  • (often lower than what network or memory system
    can provide)

49
Performance metrics
  • (2) Communication latency
  • Latency includes
  • Transport latency (function of interconnection
    network)
  • SW/HW overheads (from sending/receiving messages)
  • Largely determined by communication mechanism and
    its implementation
  • Latency must be hidden!!!
  • Else, processor might just spend lots of time
    waiting for messages

50
Performance metrics
  • (3) Hiding communication latency
  • Ideally we want to mask latency of waiting for
    communication, etc.
  • This might be done by overlapping communication
    with other, independent computations
  • Or maybe 2 independent messages could be sent at
    once?
  • Quantifying how well a multiprocessor
    configuration can do this is this metric
  • Often this burden is placed to some degree on the
    SW and the programmer
  • Also, this metric is heavily application dependent

51
Performance metrics
  • All of these metrics are actually affected by
    application type, data sizes, communication
    patterns, etc.

52
Advantages and disadvantages
  • Whats good about shared memory? Whats bad
    about it?
  • Whats good about message-passing? Whats bad
    about it?
  • Note message passing implies distributed memory

53
Advantages and disadvantages
  • Shared memory good
  • Compatibility with well-understood mechanisms in
    use in centralized multiprocessors used shared
    memory
  • Its easy to program
  • Especially if communication patterns are complex
  • Easier just to do a load/store operation and not
    worry about where the data might be (i.e. on
    another node with DSM)
  • But, you also take a big time performance hit
  • Smaller messages are more efficient w/shared
    memory
  • Might communicate via memory mapping instead of
    going through OS
  • (like wed have to do for a remote procedure call)

54
Advantages and disadvantages
  • Shared memory good (continued)
  • Caching can be controlled by the hardware
  • Reduces the frequency of remote communication by
    supporting automatic caching of all data
  • Message-passing good
  • The HW is lots simpler
  • Especially by comparison with a scalable
    shared-memory implementation that supports
    coherent caching of data
  • Communication is explicit
  • Forces programmers/compiler writers to think
    about it and make it efficient
  • This could be a bad thing too FYI

55
More detail on cache coherency protocols with
some examples
56
More on centralized shared memory
  • Its worth studying the various ramifications of a
    centralized shared memory machine
  • (and there are lots of them)
  • Later well look at distributed shared memory
  • When studying memory hierarchies we saw
  • cache structures can substantially reduce memory
    bandwidth demands of a processor
  • Multiple processors may be able to share the same
    memory

57
More on centralized shared memory
  • Centralized shared memory supports private/shared
    data
  • If 1 processor in a multiprocessor network
    operates on private data, caching, etc. are
    handled just as in uniprocessors
  • But if shared data is cached there can be
    multiple copies and multiple updates
  • Good b/c it reduces required memory bandwidth
    bad because we now must worry about cache
    coherence

58
Cache coherence why its a problem
  • Assumes that neither cache had value/location X
    in it 1st
  • Both a write-through cache and a write-back cache
    will encounter this problem
  • If B reads the value of X after Time 3, it will
    get 1 which is the wrong value!

59
Coherence in shared memoryprograms
  • Must have coherence and consistency
  • Memory system coherent if
  • Program order preserved (always true in
    uniprocessor)
  • Say we have a read by processor P of location X
  • Before the read processor P wrote something to
    location X
  • In the interim, no other processor has written to
    X
  • A read to X should always return the value
    written by P
  • A coherent view of memory is provided
  • 1st, processor A writes something to memory
    location X
  • Then, processor B tries to read from memory
    location X
  • Processor B should get the value written by
    processor A assuming
  • Enough time has past b/t the two events
  • No other writes to X have occurred in the interim

60
Coherence in shared memory programs (continued)
  • Memory system coherent if (continued)
  • Writes to same location are serialized
  • Two writes to the same location by any two
    processors are seen in the same order by all
    processors
  • Ex. Values of A and B are written to memory
    location X
  • Processors cant read the value of B and then
    later as A
  • If writes not serialized
  • One processor might see the write of processor P2
    to location X 1st
  • Then, it might later see a write to location X by
    processor P1
  • (P1 actually wrote X before P2)
  • Value of P1 could be maintained indefinitely even
    though it was overwritten

61
Coherence/consistency
  • Coherence and consistency are complementary
  • Coherence defines actions of reads and writes to
    same memory location
  • Consistency defines actions of reads and writes
    with regard to accesses of other memory locations
  • Assumption for the following discussion
  • Write does not complete until all processors have
    seen effect of write
  • Processor does not change order of any write with
    any other memory accesses
  • Not exactly the case for either one reallybut
    more later

62
Caches in coherent multiprocessors
  • In multiprocessors, caches at individual nodes
    help w/ performance
  • Usually by providing properties of migration
    and replication
  • Migration
  • Instead of going to centralized memory for each
    reference, data word will migrate to a cache at
    a node
  • Reduces latency
  • Replication
  • If data simultaneously read by two different
    nodes, copy is made at each node
  • Reduces access latency and contention for shared
    item
  • Supporting these require cache coherence
    protocols
  • Really, we need to keep track of shared blocks

63
Detail about snooping
64
Implementing protocols
  • Well focus on the invalidation protocol
  • And start with a generic template for
    invalidation
  • To perform an invalidate
  • Processor must acquire bus access
  • Broadcast the address to be invalidated on the
    bus
  • Processors connected to bus snoop on addresses
  • If address on bus is in processors cache, data
    invalidated
  • Serialization of accesses enforces serialization
    of writes
  • When 2 processors compete to write to the same
    location, 1 gets access to the bus 1st

65
Its not THAT easy though
  • What happens on a cache miss?
  • With a write through cache, no problem
  • Data is always in main memory
  • In shared memory machine, every cache write would
    go back to main memory bad, bad, bad for
    bandwidth!
  • What about write back caches though?
  • Much harder.
  • Most recent value of data could be in a cache
    instead of memory
  • How to handle write back caches?
  • Snoop.
  • Each processor snoops every address placed on the
    bus
  • If a processor has a dirty copy of requested
    cache block, it responds to read request, and
    memory request is cancelled

66
Specifics of snooping
  • Normal cache tags can be used
  • Existing valid bit makes it easy to invalidate
  • What about read misses?
  • Easy to handle too rely on snooping capability
  • What about writes?
  • Wed like to know if any other copies of the
    block are cached
  • If theyre NOT, we can save bus bandwidth
  • Can add extra bit of state to solve this problem
    state bit
  • Tells us if block is shared, if we must generate
    an invalidate
  • When write to a block in shared state happens,
    cache generates invalidation and marks block as
    private
  • No other invalidations sent by that processor for
    that block

67
Specifics of snooping
  • When invalidation sent, state of owners
    (processor with sole copy of cache block) cache
    block is changed from shared to unshared (or
    exclusive)
  • If another processor later requests cache block,
    state must be made shared again
  • Snooping cache also sees any misses
  • Knows when exclusive cache block has been
    requested by another processor and state should
    be made shared

68
Specifics of snooping
  • More overhead
  • Every bus transaction would have to check
    cache-addr. tags
  • Could easily overwhelm normal CPU cache accesses
  • Solutions
  • Duplicate the tags snooping/CPU accesses can go
    on in parallel
  • Employ a multi-level cache with inclusion
  • Everything in the L1 cache also in L2 snooping
    checks L2, CPU L1

69
An example protocol
  • Bus-based protocol usually implemented with a
    finite state machine controller in each node
  • Controller responds to requests from processor
    bus
  • Changes the state of the selected cache block and
    uses the bus to access data or invalidate it
  • An example protocol (which well go through an
    example of)

70
P
One of many processors.
71
P
Addr
000000
R
W
This indicates what operation the processor is
trying to perform and with what address.
72
P
The processors cache Tag (4 bits), 4 lines
(ID), Valid, dirty and Shared bits.
73
P
Note For this somewhat simplified example we
wont concern ourselves with how many bytes (or
words) are in each line. Assume that its more
than one.
74
P
The Bus with indication of address and operation.
Addr
000000
R
W
75
P
These bus operations are coming from other
processors which arent shown.
Addr
000000
R
W
76
P
Addr
000000
R
W
Main Memory
MEMORY
77
P
Processor issues a read
Addr
101010
R
W
Addr
000000
R
W
MEMORY
78
P
Cache reports...
Addr
101010
R
W
MISS
Tag
ID
V
D
S
0000
00
0
0
0
0000
01
0
0
0
0000
10
0
0
0
Addr
0000
11
0
0
0
000000
R
W
MEMORY
79
P
Cache reports...
Addr
101010
R
W
MISS
Tag
ID
V
D
S
0000
00
0
0
0
0000
01
0
0
0
0000
10
0
0
0
Addr
0000
11
0
0
0
000000
R
W
Because the tags dont match!
MEMORY
80
P
Data read from memory
Addr
101010
R
W
Tag
ID
V
D
S
0000
00
0
0
0
0000
01
0
0
0
1010
10
1
0
1
Addr
0000
11
0
0
0
000000
R
W
MEMORY
81
P
Data read from memory
Addr
101010
R
W
Tag
ID
V
D
S
0000
00
0
0
0
This bit indicates that this line is shared
which means other caches might have the same
value.
0000
01
0
0
0
1010
10
1
0
1
Addr
0000
11
0
0
0
000000
R
W
MEMORY
82
P
From now on we will show these as 2
step operationsstep 1 the request.
Addr
101010
R
W
Tag
ID
V
D
S
0000
00
0
0
0
0000
01
0
0
0
0000
10
0
0
0
Addr
0000
11
0
0
0
000000
R
W
MEMORY
83
P
Step 2what was the result and the change to the
cache.
Addr
101010
R
W
MISS
Tag
ID
V
D
S
0000
00
0
0
0
0000
01
0
0
0
1010
10
1
0
1
Addr
0000
11
0
0
0
000000
R
W
MEMORY
84
P
A write...
Addr
111100
R
W
Tag
ID
V
D
S
0000
00
0
0
0
0000
01
0
0
0
1010
10
1
0
1
Addr
0000
11
0
0
0
000000
R
W
MEMORY
85
P
Addr
111100
R
W
Write Miss
Tag
ID
V
D
S
Addr
000000
R
W
MEMORY
86
P
Keep in mind that since most cache
configurations have multiple bytes per line a
write miss will actually require us to get the
line from memory into the cache first since we
are only writing one byte into the line.
Addr
111100
R
W
Write Miss
Tag
ID
V
D
S
Addr
000000
R
W
MEMORY
87
P
Note The dirty bit signifies that the data in
the cache is not the same as in memory.
Addr
111100
R
W
Tag
ID
V
D
S
Addr
000000
R
W
MEMORY
88
P
Another read...
Addr
101010
R
W
Tag
ID
V
D
S
Addr
000000
R
W
MEMORY
89
P
this time a hit!
Addr
101010
R
W
Tag
ID
V
D
S
HIT!
Addr
000000
R
W
MEMORY
90
P
Now another write...
Addr
111100
R
W
Tag
ID
V
D
S
Addr
000000
R
W
MEMORY
91
P
To a dirty line!
Addr
111100
R
W
Tag
ID
V
D
S
This is a write hit and since the shared bit is
0 we know we are in the exclusive state.
Addr
000000
R
W
MEMORY
92
P
Now another processor failing to find what it
needs in its cache goes to the busa bus
read miss
Addr
000000
R
W
Tag
ID
V
D
S
Addr
010101
R
W
MEMORY
93
P
Our cache which is monitoring the bus or snooping
sees the miss but cant help.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
010101
R
W
MEMORY
94
P
Another bus request...
Addr
000000
R
W
Tag
ID
V
D
S
Addr
101010
R
W
MEMORY
95
P
Since we have this value in our cache we can
satisfy the request from our cache assuming that
this will be quicker than from memory.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
101010
R
W
MEMORY
96
P
And another request. This time to a dirty line.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
97
P
We have to supply the value out of our
cache since it is more current than the value in
memory.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
98
P
Addr
000000
R
W
We also mark it as shared. Why?
Tag
ID
V
D
S
Addr
000000
R
W
MEMORY
99
P
Addr
111100
R
W
If, for example, our next operation was a
write to this line...
Tag
ID
V
D
S
Addr
000000
R
W
MEMORY
100
P
We would have to note that it was again exclusive
and let the other caches know
Addr
111100
R
W
ZAP
Tag
ID
V
D
S
Addr
000000
R
W
MEMORY
101
P
We could then write repeatedly to this line and
since we have exclusive ownership no one has to
know!
Addr
111100
R
W
Tag
ID
V
D
S
Addr
000000
R
W
MEMORY
102
P
In a similar way we must respond to write misses
by other caches.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
101010
R
W
MEMORY
103
P
In this case we know that some other processor is
going to have a newer value so we must mark
this line as invalid.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
101010
R
W
MEMORY
104
P
Now assume some other processor requests a
byte from the 111100 line of its cache.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
105
P
Since our line is marked valid and exclusive the
other caches should be marked as invalid.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
106
P
So the first thing that the other cache will do
is a read to get the correct value for all the
bytes in the line before it writes the one
new byte.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
107
P
Our cache supplies the value and goes to the
shared state.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
108
P
Sometime later, for whatever reason, the other
cache writes back the value.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
109
P
This requires us to mark our line as invalid
since we no longer have the most current value
for this line.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
110
P
We dont need to worry about the dirty bit
since we already supplied that value to the other
cache. Its entry should now be marked as dirty.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
111
P
This concludes our demonstration of some basic
multiprocessor techniques for guaranteeing cache
coherency and consistency.
Addr
000000
R
W
Tag
ID
V
D
S
Addr
111100
R
W
MEMORY
Write a Comment
User Comments (0)
About PowerShow.com