Multiprocessing and Scalability

About This Presentation

Title:

Multiprocessing and Scalability

Description:

Large-scale multiprocessor systems have long held the promise of substantially ... by all components in the system (whole system is operating in a lock-step manner) ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 113

Provided by: alihu2

Category:

more less

Transcript and Presenter's Notes

Title: Multiprocessing and Scalability

1
Multiprocessing and Scalability

A.R. Hurson
Computer Science Department
Missouri Science Technology
hurson_at_mst.edu

2
Multiprocessing and Scalability

Large-scale multiprocessor systems have long held
the promise of substantially higher performance
than traditional uni-processor systems.
However, due to a number of difficult problems,
the potential of these machines has been
difficult to realize. This is because of the

3
Multiprocessing and Scalability

Advances in technology - Rate of increase in
performance of uni-processor,
Complexity of multiprocessor system design - This
drastically effected the cost and implementation
cycle.
Programmability of multiprocessor system - Design
complexity of parallel algorithms and parallel
programs

4
Multiprocessing and Scalability

Programming a parallel machine is more difficult
than a sequential one. In addition, it takes
much effort to port an existing sequential
program to a parallel machine than to a newly
developed sequential machine.
Lack of good parallel programming environments
and standard parallel languages also has further
aggravated this issue.

5
Multiprocessing and Scalability

As a result, absolute performance of many early
concurrent machines was not significantly better
than available or soon-to-be available
uni-processors.

6
Multiprocessing and Scalability

Recently, there has been an increased interest in
large-scale or massively parallel processing
systems. This interest stems from many factors,
including
Advances in integrated technology.
Very high aggregate performance of these
machines.
Cost performance ratio.
Widespread use of small-scale multiprocessors.
Advent of high performance microprocessors.

7
Multiprocessing and Scalability

Integrated technology
Advances in integrated technology is slowing
down.
Chip density - integrated technology now allows
all performance features found in a complex
processor be implemented on a single chip and
adding more functionality has diminishing returns

8
Multiprocessing and Scalability

Integrated technology
Studies of more advanced superscalar processors
indicates that they may not offer a performance
improvement more than 2 to 4 for general
applications.

9
Multiprocessing and Scalability

Aggregate performance of machines
Cray X-MP (1983) had a peak performance of 235
MFLOPS.
A node in an Intel iPSC/2 (1987) with attached
vector units had a peak performance of 10 MFLOPS.
As a result, a 256-processor configuration of
Intel iPSC/2 would offer less than 11 times the
peak performance of a single Cray X-MP.

10
Multiprocessing and Scalability

Aggregate performance of machines
Today a 256-processor system might offer 100
times the peak performance of the fastest
systems.
Cray C90 has a peak performance of 952 MFLOPS,
while a 256-processor configuration using MIPS
R8000 would have a peak performance of 76,800
MFLOPS.

11
Multiprocessing and Scalability

Economy
Cray C90 or Convex C4 which exploit IC and
packaging technology cost about 1 million.
High end microprocessor cost somewhere between
2000-5000.
By using a small number of microprocessors, equal
performance can be achieved at a fraction of the
cost of more advanced supercomputers.

12
Multiprocessing and Scalability

Small-scale multiprocessors
Due to the advances in technology,
multiprocessing has come to the desktops and
high-end personal computers. This has led to
improvements in parallel processing hardware and
software.

13
Multiprocessing and Scalability

In spite of these developments, many challenges
need to overcome in order to exploit the
potential of large-scale multiprocessors
Scalability
Ease of programming

14
Multiprocessing and Scalability

Scalability Maintaining the cost-performance of
a uni-processor while linearly increasing overall
performance as processors are added.
Ease of programming Programming a parallel
system is inherently more difficult than
programming a uni-processor where there is a
single thread of control. Providing a single
shared address space to the programmer is one way
to alleviate this problem Shared-memory
architecture.

15
Multiprocessing and Scalability

Shared-memory eases the burden of programming a
parallel machine since there is no need to
distribute data or explicitly communicate data
between the processors in software.
Shared-memory is also the model used on
small-scale multiprocessors, so it is easier to
transport programs that have been developed for a
small system to a larger shared-memory system.

16
Multiprocessing and Scalability

Flynns Taxonomy
Flynn has classified the concurrent space
according to the multiplicity of instruction and
data streams

17
Multiprocessing and Scalability

Flynns Taxonomy
The Cartesian product of these two sets will
define four different classes
SISD
SIMD
MISD
MIMD

18
Multiprocessing and Scalability

Flynns Taxonomy
The MIMD class can be further divided based on
Memory structure global or distributed
Communication/synchronism mechanism shared
variable or message passing.

19
Multiprocessing and Scalability

Flynns Taxonomy
As a result we have four additional classes of
computers
GMSV Shared memory multiprocessors
GMMP ?
DMSV Distributed shared memory
DMMP Distributed memory (multi-computers)

20
Multiprocessing and Scalability

A taxonomy of MIMD organization
The memory can be
Centralized
Distributed

21
Multiprocessing and Scalability

A taxonomy of MIMD organization
The inter-processor communicate can be either
through explicit messages or through the common
address space. This brings out two classes
Message passing systems (also called distributed
memory systems. It is natural to assume that the
memory in message passing system is distributed.)
? Intel iPSC, Paragon nCUBE2, IBM SP1 and SP2
Shared-memory systems ? IBM RP3, Cray X-MP, Cray
Y-MP, Cray C90, Sequent Symmetry

22
Multiprocessing and Scalability

Multiple instruction streams multiple data
streams organizations are made of multiple
processors and multiple memory modules connected
together via some interconnection network.
In this paradigm, based on the memory
organization and the way processors communicate
with each other, one can distinguishes two broad
categories
Shared memory organization
Message passing organization.

23
Multiprocessing and Scalability

A shared memory organization typically
accomplishes inter-processor coordination through
a global memory shared by all processors - The
global address space shared by all processors.
In shared-memory multiprocessor systems the
processors are more tightly coupled. Memory is
accessible to all processors, and communication
among processors is through shared variables or
messages deposited in shared memory buffers.

24
Multiprocessing and Scalability

Shared-memory System (programmers view)

25
Multiprocessing and Scalability

Shared-memory System (programmers view)
In a shared-memory machine, processors can
distinguish communication destination, type, and
value through shared-memory addresses.
There is no requirement for the programmer to
manage the movement of data.

26
Multiprocessing and Scalability

Shared-Memory Multiprocessor

27
Multiprocessing and Scalability

Multiprocessors interconnection networks can be
classified based on a number of criteria
Mode of operation - Synchronous vs. asynchronous
In synchronous mode a single clock is used by all
components in the system (whole system is
operating in a lock-step manner).
In asynchronous mode hand-shaking signals are
used to coordinate the operations.
Synchronous systems are slower than asynchronous
systems, but they are race and hazard free.

28
Multiprocessing and Scalability

Control strategy - centralized vs. decentralized
In centralized control systems, a single central
control unit is used to oversee and control the
operations of the components.
In decentralized control, the control functions
are distributed among different components.
The reliability of centralized control is its
weak point (single point of failure).
Crossbar switch is a centralized control and
multistage interconnection network is a
decentralized control.

29
Multiprocessing and Scalability

Switching techniques - circuit switching vs.
packet switching
In circuit switching a complete path has to be
established before the start of communication.
This path remains intact during the
communication.
In a packet switching, communication is divided
into packets. Packets from one component to
another component are sent in store-and-forward
manner.
Packet switching uses the network resources more
efficiently.

30
Multiprocessing and Scalability

Topology - Static vs. dynamic
Topology defines how the resources in an
interconnection network are communicating with
each other. In a ring topology, each processor
(say pk) is physically connected to its neighbor
(pk-1, pk1).
In static topology, direct fixed link are
established among nodes to form a fixed network.
In dynamic topology, connections are established
as needed.

31
Multiprocessing and Scalability

A shared-memory architecture allows all memory
locations to be accessed by every processors.
This helps to ease the burden of programming a
parallel system, since
There is no need to distribute data or explicitly
communicate data between processors in the
system.
Shared-memory is the model adopted on small-scale
multiprocessors, so it is easier to map programs
that have been parallelized for small-scale
systems to a larger shared-memory system.

32
Multiprocessing and Scalability

A message passing organization typically combines
the local memory and processor as a node of the
interconnection network - There is no global
address space. So there is a need to move data
from one processor to the other via messages.
In message passing multiprocessor systems
(distributed memory), processors communicate with
each other by sending explicit messages.

33
Multiprocessing and Scalability

Message passing System (programmers view)

34
Multiprocessing and Scalability

Message passing System (programmers view)
In a message passing machine, the user must
explicitly communicate all information passed
between processors. Unless the communication
patterns are very regular, the management of this
data movement is very difficult.
Note, multi-computers are equivalent to this
organization. It is also referred to as No
Remote Memory Access Machine (NORMA).

35
Multiprocessing and Scalability

Message passing multiprocessor

36
Multiprocessing and Scalability

Inter-processor communication is critical in
comparing the performance of message passing and
shared-memory machines
Communication in message passing environment is
direct and is initiated by the producer of data.
In shared-memory system, communication is
indirect, and producer usually moves data no
further than memory. The consumer, then has to
fetch the data from the memory decrease in
performance.

37
Multiprocessing and Scalability

In a shared-memory system, communication requires
no intervention on the part of a run-time library
or operating system. In a message passing
environment, access to the network port is
typically managed by the system software. This
overhead at the sending and receiving processors
makes the start up costs of communication much
higher (usually 10s to 100s of microsecond).

38
Multiprocessing and Scalability

As a result of high communication cost in a
message passing organization either performance
is compromised or a coarser grain parallelism,
and hence more limitation on exploitation of
available parallelism, must be adapted.
Note that the start up overhead of communication
in shared-memory organization is typically on the
order of microseconds.

39
Multiprocessing and Scalability

Communication in shared-memory systems is usually
demand-driven by the consuming processor.
The problem here is overlapping communication
with computation. This is not a problem for
small data items. However, it can degrade the
performance if there is frequent communication or
a large amount of data is exchanged.

40
Multiprocessing and Scalability

Consider the following case, in a message passing
environment
A producer process wants to send 10 words to a
consumer process. In a typical message passing
environment with blocking send and receive
protocol this problem would be coded as

Producer process Send (proci, processj,
_at_sbuffer, num-bytes)
Consumer process Receive (_at_rbuffer, max-bytes)
41
Multiprocessing and Scalability

This code usually is broken into the following
steps
The operating system checks protections and then
programs the network DMA controller to move the
message from the senders buffer to the network
interface.
A DMA channel on the consumer processor has been
programmed to move all messages to a common
system buffer. When the message arrives, it is
moved from the network interface to the system
buffer and an interrupt is posted to the
processor.

42
Multiprocessing and Scalability

The receiving processor services the interrupt
and determines which process the message is
intended for. It then copies the message to the
specified receiver buffer and reschedules the
program on the processors ready queue.
The process is dispatched on the processor and
reads the message from the users receive buffer.

43
Multiprocessing and Scalability

On a shared-memory system there is no operating
system involved and the process can transfer data
using a shared data area. Assuming the data is
protected by a flag indicating its availability
and the size, we have

Producer process For i 0 to num-bytes
buffer i source i Flag num-bytes
Consumer process While (Flag 0) For i
0 to Flag dest i buffer i
44
Multiprocessing and Scalability

For the message passing environment, the dominant
factors are the operating system overhead
programming the DMA and the internal processing.
For the shared-memory environment, the overhead
is primarily on the consumer reading the data,
since it is then that data moves from the global
memory to the consumer processor.

45
Multiprocessing and Scalability

Thus, for a short message the shared-memory
system is much more efficient. For long
messages, the message-passing environment has
similar or possibly higher performance.

46
Multiprocessing and Scalability

Message passing vs. Shared-memory systems
The message passing systems minimize the hardware
overhead.
A single-thread performance of a message passing
system is as high as a uni-processor system,
since memory can be tightly coupled to a single
processor.
Shared-memory systems are easier to program.

47
Multiprocessing and Scalability

Message passing vs. Shared-memory systems
Overall, the shared-memory paradigm is preferred
since it is easier to use and it is more
flexible.
A shared-memory organization can emulate a
message passing environment while the reverse is
not possible without a significant performance
degradation.

48
Multiprocessing and Scalability

Message passing vs. Shared-memory systems
Shared-memory systems offer lower performance
than message passing systems if communication is
frequent and not overlapped with computation.
In shared-memory environment, interconnection
network between processors and memory usually
requires higher bandwidth and more sophistication
than the network in message passing environment.
This can increase overhead costs to the level
that the system does not scale well.

49
Multiprocessing and Scalability

Message passing vs. Shared-memory systems
Solving the latency problem through memory
caching and cache coherence is the key to a
shared-memory multiprocessor.

50
Multiprocessing and Scalability

In the design of shared memory organization a
number of basic issues have to be taken into
account
Access control determines which process accesses
are possible to which resource.
Access control protocols make the required check
for every access request issued by the processors
to the shared memory against the contents of
access control table.

51
Multiprocessing and Scalability

Synchronization rules and constraints limit the
time of accesses from sharing processors to
shared resources.
Appropriate synchronization ensures that the
information flows properly and ensures system
functionality.

52
Multiprocessing and Scalability

Protection prevents processes from making
arbitrary access to resources belonging to other
processes.

53
Multiprocessing and Scalability

In message passing organization processors
communicate with each other via send/receive
operations. The processing units of a message
passing system are connected in a variety of ways
ranging from architecture-specific
interconnection structure to geographically
dispersed networks.
The massage passing approach, in general, is
scalable.

54
Multiprocessing and Scalability

Two design factors must be considered in
designing interconnection networks for message
passing systems
Bandwidth which is the number of bits that can be
transmitted per unit of time, and
Latency which is defined as the time to complete
a message transfer.

55
Multiprocessing and Scalability

Shared Memory Multiprocessor
In most shared-memory organizations, processors
and memory are separated by an interconnection
network.
For small-scale shared-memory systems, the
interconnection network is a simple bus.
For large-scale shared-memory systems, similar to
the message passing systems, we use multistage
networks.

56
Multiprocessing and Scalability

Shared memory multiprocessors - A taxonomy
Based on the memory organization, shared memory
multiprocessor systems are classified as
Uniform Memory Access (UMA) systems,
Non-uniform Memory Access (NUMA) systems, and
Cache-only memory (COMA) systems
Note, most large-scale shared memory machines
utilize a NUMA structure.

57
Multiprocessing and Scalability

Uniform Memory Access Architecture (UMA)
The physical memory is uniformly shared by all
the processors.
Each processor may use a private cache.
UMA is equivalent to tightly coupled
multiprocessor class.
When all processors have equal access to
peripheral devices, the system is called a
symmetric multiprocessor (SMP).

58
Multiprocessing and Scalability

In UMA, the shared memory is accessible by all
processors via an interconnection network in the
same way a single processor accesses its memory.
As a result, all processors have equal access
time to any memory address.
The interconnection network in UMA can be a
single bus, multiple busses, crossbar switch, or
a multiport memory.

59
Multiprocessing and Scalability

Uniform Memory Access Architecture

60
Multiprocessing and Scalability

Symmetric Multiprocessor
Each processor has its own cache, all the
processors and memory modules are attached to the
same interconnect Usually a shared bus.
The ability to access all shared data efficiently
and uniformly using ordinary load and store
operations, and
Automatic movement and replication of date in the
local caches
makes this organization attractive for concurrent
processing.

61
Multiprocessing and Scalability

Symmetric Multiprocessor

62
Multiprocessing and Scalability

Non-uniform Memory Access (NUMA)
In NUMA, each processor has part of the shared
memory attached. However, access time to a
shared memory address depends on the distance
between the processor and the designated memory
module - access time varies with the location of
the memory word.
In NUMA a number of architectures can be used to
interconnect processors to memory modules.
NUMA is also referred to as Distributed
Shared-Memory Multiprocessor Architecture (DSM).

63
Multiprocessing and Scalability

Shared Memory Multiprocessor
UMA is easier to program than NUMA.
This configuration is not symmetric, as a result,
locality should be exploited to improve
performance.
Most of the small-scale shared-memory systems are
based on UMA philosophy and large-scale
shared-memory systems are based on NUMA.

64
Multiprocessing and Scalability

Non-uniform Memory Access (NUMA)

65
Multiprocessing and Scalability

Shared Memory Multiprocessor Distributed Memory

66
Multiprocessing and Scalability

Distributed Shared-memory architecture
As noted, it is possible to build a shared-memory
machine with a distributed-memory organization.
Such a machine has the same structure as the
message passing system, but instead of sending
messages to other processors, every processor can
directly address both its local memory and remote
memories of other processors.

67
Multiprocessing and Scalability

Cache-only memory (COMA)
In COMA similar to NUMA, each processor has part
of the shared memory. However, the shared memory
consists of cache memory - data must be migrated
to the processor requesting it.
In short, COMA is a special class of NUMA in
which distributed global main memory are caches.

68
Multiprocessing and Scalability

Cache Only Memory Architecture (COMA)

69
Multiprocessing and Scalability

Shared Memory Multiprocessor
Since all communication and local computations
generate memory accesses in a shared address
space, from a system architectures perspective,
the key high level design issue lies in the
organization of memory hierarchy.
In general there are the following choices

70
Multiprocessing and Scalability

Shared Memory Multiprocessor Shared cache

71
Multiprocessing and Scalability

Shared Memory Multiprocessor Shared cache
This platform is more suitable for small scale
configuration 2 to 8 processors.
It is a more common approach for multiprocessor
on chip.

72
Multiprocessing and Scalability

Shared Memory Multiprocessor Bus based Shared
Memory

73
Multiprocessing and Scalability

Shared Memory Multiprocessor Bus based Shared
Memory
This platform is suitable for small to medium
configuration 20 to 30 processors.
Due to its simplicity it is very popular.
Bandwidth of the shared bus is the major
bottleneck of the system.
This configuration also is known as Cache
Coherent Shared Memory Configuration.

74
Multiprocessing and Scalability

Shared Memory Multiprocessor Dance Hall

75
Multiprocessing and Scalability

Shared Memory Multiprocessor Dance Hall
This platform is symmetric and scalable.
All memory modules are far away from processors.
In large configurations, several hops are needed
to access memory.

76
Multiprocessing and Scalability

Shared Memory Multiprocessor
In all shared memory multiprocessor organization,
the concept of cache is becoming very attractive
in reducing the memory latency and the required
bandwidth.
In all design cases, except shared cache, each
processor has at least one level of private
cache. This raises the issue of cache coherence
as an attractive research directions.

77
Multiprocessing and Scalability

Shared Memory Multiprocessor Cache Coherency
The architecture supports the caching of both
shared and private data
When a private item is requested, its location is
migrated to the cache, reducing the average
access time as well as the memory bandwidth
required.
When shared data is cached, the shared value may
be replicated in multiple caches. This reduces
the access latency, memory bandwidth required,
and contention at the shared data item.

78
Multiprocessing and Scalability

Shared Memory Multiprocessor Cache Coherency
Caching of the shared data, however, introduces a
new problem - Cache Coherence.
Cache Coherence problem means that two different
processors can have two different values for the
same location.

79
Multiprocessing and Scalability

Shared Memory Multiprocessor Cache Coherency
Assume write-through in a 2-processor
configuration

Cache CPU A
Cache CPU B
Memory For X
Time
Event
80
Multiprocessing and Scalability

Shared Memory Multiprocessor Cache Coherency
Copies of the same memory blocks are resident in
one or more caches and processors make
modifications to the very same cache block.
Unless special action in taken, other processors
will continue to access the old stale copy of the
block in their caches.

81
Multiprocessing and Scalability

Cache Coherence Problem An Example
Consider the following configuration and the
sequence of events
1 p1 reads u from main memory
2 p2 reads u from main memory
3 p3 writes and changes u from 5 to 7 using
write through policy
4 p1 reads u again

82
Multiprocessing and Scalability

Cache Coherence Problem An Example

83
Multiprocessing and Scalability

Cache Coherence Problem An Example
Unless we take special action, if p1 reads u for
the second time, it might get an old value.
The value read by p2 is a subject of write policy.

84
Multiprocessing and Scalability

Cache Coherence Problem An Example
Assume that in the previous example p3 uses
write-through policy. Does it help the situation
or not?

85
Multiprocessing and Scalability

Cache Coherence Problem An Example

86
Multiprocessing and Scalability

Cache Coherence Problem An Example
Now the situation is even worse, if p2 initiate a
read on u, it will get an old value of u.
In a shared memory multiprocessor system, reading
and writing shared variable by different
processors are frequent events. As a result,
cache coherence needs to be addressed as a basic
hardware design issue.

87
Multiprocessing and Scalability

Cache Coherence Problem
Informally memory is coherent if any read of a
data item returns the most recently written value
of that data item.
This simple definition contains two different
aspects of memory behavior
Coherence what values can be returned by a read?
Consistency When a written value will be
returned by a read?

88
Multiprocessing and Scalability

Cache Coherence Problem
A memory system is coherent if
Preserving Program Order - A read by a processor
P to a location X that follows a write by P to X,
with no writes of X by another processor, between
the write and the read by P, always returns the
value written by P.

89
Multiprocessing and Scalability

Cache Coherence Problem
Coherent View - A read by a processor to location
X follows a write by another processor to X,
returns the written value if the read and write
are sufficiently separated and no other write to
X occur between the two accesses.

90
Multiprocessing and Scalability

Cache Coherence Problem
Write Serialization - Writes to the same location
are serialized. Two writes to the same location
by two processors are seen in the same order by
all processors.

91
Multiprocessing and Scalability

Cache Coherence Definitions
A memory operation is referred to as read, write,
and read-modify-write on individual data element.
A memory operation is atomic.
Write propagation means that writes become
visible to all processors.
Write serialization means that all writes to a
location are seen in the same order by all
processors.

92
Multiprocessing and Scalability

Cache Coherence
For a multiprocessor system, memory is coherent
if the results of any execution of a program are
such that for each location, it is possible to
construct a hypothetical serial order of all
operations to the location that is consistent
with the results of the execution and in which
Operations issued by any particular process
occurs in the order in which they were issued to
the memory system by that process, and
The value returned by each read operations is the
value written by the last write to that location
in the serial order.

93
Multiprocessing and Scalability

Cache Coherence Protocol
The protocol to maintain coherence for multiple
processors are called cache coherence protocol.
The key issue to implement a cache coherence
protocol is tracking the state of any sharing of
a data block.

94
Multiprocessing and Scalability

Cache Coherence Protocol
Directory Bases The sharing status of a block
of physical memory is kept in just one location -
Directory.
Snooping Based Every cache that has a copy of
the data are on the shared memory bus and snoop
on the bus to determine whether or not they have
a copy of a block that is requested on the bus.

95
Multiprocessing and Scalability

Cache Coherence Protocol
There are two ways to maintain the coherence
requirement
Write Invalidation
Write Propagate (Write update or Write Broadcast)

96
Multiprocessing and Scalability

Write Invalidation To ensure that a processor
has exclusive access to a data item before it
writes that item.

Bus activity
CPU As Cache
CPU Bs Cache
Memory Location X
Processor activity
0
97
Multiprocessing and Scalability

Write Propagate All the cached copies of a data
item are updated when data item is written.

Bus activity
CPU As Cache
CPU Bs Cache
Memory Location X
Processor activity
0
98
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
Within this class bus based shared memory
multiprocessors, this is a simple and elegant
solution. Recall that
Bus is a single set of wires connecting
resources,
Every resources attached to the bus can observe
every bus transaction,
All transactions that appear on the bus are
visible to the cache controllers in the same
order,

99
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
When a processor issues a request to its cache,
its cache controller examines the status of the
cache and takes suitable actions, including a
request to access memory.
All cache controllers snoop on the bus and
monitor transactions.

100
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
A snooping cache controller may take action if a
bus transaction is relevant to it
In case of a write-through policy, if a snooping
cache has a copy of the data, it either
Updates its copy update based protocols, or
Invalidates its copy invalidation based
protocols.
In either case, when a processor requests for a
block, it will see the most recent value.

101
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
Refer to our previous example

102
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
When p3 writes into u, its cache controller
generates a bus transaction to update memory.
Observing this bus transaction, it is relevant to
p1 and hence p1s cache controller invalidates
its own copy of the block containing u.
The main memory updates u to 7.
Subsequent reads to u from processors p1 and p2
generate misses and hence, they will get the
correct value of 7 from the main memory.

103
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
Simplicity is the main advantage of snooping
protocols there is no need to explicitly use
coherence operations in the program.
By extending the cache controller and exploiting
the properties of the bus, reads and writes that
are natural to the program keep the caches
coherent.
However, snooping systems with write-through
policy are not very efficient, specially if we
realize the natural bandwidth limitation of a bus.

104
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
Let us look at the snooping protocol with
write-back policy - The processors are able to
write to different blocks in their local caches
concurrently without any bus transactions.
Recall that in a uni-processor system updates to
the main memory blocks resident in cache will be
done during the replacement - cache block will be
dirty for a period.
In a multiprocessor system, this modified state
is used by the protocols to indicate exclusive
ownership of the block by cache.

105
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
A cache is said to be the owner of a block if it
must supply the data upon a request for that
block.
A cache has an exclusive copy of a block if it is
the only cache with a valid copy of the block -
Exclusivity implies that the cache may modify the
block without notifying anyone else.

106
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
As a result, if a cache does not have
exclusivity, then it cannot write into that block
before first putting a transaction on the bus to
communicate with others - consequently, a
processor might have the block in its cache in a
valid state, but since a transaction must be
generated, it is treated as a write miss.

107
Multiprocessing and Scalability

Cache Coherence Solutions Bus Snooping
If a cache has a modified block, then it is the
owner and thus it has exclusivity.
On a write miss in an invalidation protocol, a
special form of transaction, read exclusive, is
generated to acquire exclusive ownership.
As a result, concurrent writes to the same block
is not allowed - the read exclusive bus
transactions are serialized.

108
Multiprocessing and Scalability

Scalable shared-memory multiprocessors (SSMP)
provides the shared-memory programming model
while removing the bottleneck of todays
small-scale systems.

109
Multiprocessing and Scalability

Scalable shared-memory multiprocessors
Scalability must be physically possible and
technically feasible.
Adding processors increases the potential
computational capability of the system, but to
realize this potential, all aspect of the system
must be scaled up - Specially the memory
bandwidth.

110
Multiprocessing and Scalability

Scalable shared-memory multiprocessors
A natural solution is to distribute memory among
processors so that each processor has direct
access to its local memory.
However, interconnection network, connecting
processors must provide scalable bandwidth at
reasonable cost and latency.

111
Multiprocessing and Scalability

Scalable shared-memory multiprocessors
Scalable systems are less closely coupled than
bus based shared memory multiprocessors, so
interactions among processors and
processors/memories are different.
A scalable system attempts to avoid inherent
design limits on the extent to which resources
can be added to the system.

112
Multiprocessing and Scalability