Title: Multiprocessing and Scalability
1Multiprocessing and Scalability
- A.R. Hurson
- Computer Science Department
- Missouri Science Technology
- hurson_at_mst.edu
2Multiprocessing and Scalability
- Large-scale multiprocessor systems have long held
the promise of substantially higher performance
than traditional uni-processor systems. - However, due to a number of difficult problems,
the potential of these machines has been
difficult to realize. This is because of the
3 Multiprocessing and Scalability
- Advances in technology - Rate of increase in
performance of uni-processor, - Complexity of multiprocessor system design - This
drastically effected the cost and implementation
cycle. - Programmability of multiprocessor system - Design
complexity of parallel algorithms and parallel
programs
4 Multiprocessing and Scalability
- Programming a parallel machine is more difficult
than a sequential one. In addition, it takes
much effort to port an existing sequential
program to a parallel machine than to a newly
developed sequential machine. - Lack of good parallel programming environments
and standard parallel languages also has further
aggravated this issue.
5Multiprocessing and Scalability
- As a result, absolute performance of many early
concurrent machines was not significantly better
than available or soon-to-be available
uni-processors.
6 Multiprocessing and Scalability
- Recently, there has been an increased interest in
large-scale or massively parallel processing
systems. This interest stems from many factors,
including - Advances in integrated technology.
- Very high aggregate performance of these
machines. - Cost performance ratio.
- Widespread use of small-scale multiprocessors.
- Advent of high performance microprocessors.
7Multiprocessing and Scalability
- Integrated technology
- Advances in integrated technology is slowing
down. - Chip density - integrated technology now allows
all performance features found in a complex
processor be implemented on a single chip and
adding more functionality has diminishing returns
8Multiprocessing and Scalability
- Integrated technology
- Studies of more advanced superscalar processors
indicates that they may not offer a performance
improvement more than 2 to 4 for general
applications.
9Multiprocessing and Scalability
- Aggregate performance of machines
- Cray X-MP (1983) had a peak performance of 235
MFLOPS. - A node in an Intel iPSC/2 (1987) with attached
vector units had a peak performance of 10 MFLOPS.
As a result, a 256-processor configuration of
Intel iPSC/2 would offer less than 11 times the
peak performance of a single Cray X-MP.
10Multiprocessing and Scalability
- Aggregate performance of machines
- Today a 256-processor system might offer 100
times the peak performance of the fastest
systems. - Cray C90 has a peak performance of 952 MFLOPS,
while a 256-processor configuration using MIPS
R8000 would have a peak performance of 76,800
MFLOPS.
11Multiprocessing and Scalability
- Economy
- Cray C90 or Convex C4 which exploit IC and
packaging technology cost about 1 million. - High end microprocessor cost somewhere between
2000-5000. - By using a small number of microprocessors, equal
performance can be achieved at a fraction of the
cost of more advanced supercomputers.
12Multiprocessing and Scalability
- Small-scale multiprocessors
- Due to the advances in technology,
multiprocessing has come to the desktops and
high-end personal computers. This has led to
improvements in parallel processing hardware and
software.
13 Multiprocessing and Scalability
- In spite of these developments, many challenges
need to overcome in order to exploit the
potential of large-scale multiprocessors - Scalability
- Ease of programming
14 Multiprocessing and Scalability
- Scalability Maintaining the cost-performance of
a uni-processor while linearly increasing overall
performance as processors are added. - Ease of programming Programming a parallel
system is inherently more difficult than
programming a uni-processor where there is a
single thread of control. Providing a single
shared address space to the programmer is one way
to alleviate this problem Shared-memory
architecture.
15Multiprocessing and Scalability
- Shared-memory eases the burden of programming a
parallel machine since there is no need to
distribute data or explicitly communicate data
between the processors in software. - Shared-memory is also the model used on
small-scale multiprocessors, so it is easier to
transport programs that have been developed for a
small system to a larger shared-memory system.
16Multiprocessing and Scalability
- Flynns Taxonomy
- Flynn has classified the concurrent space
according to the multiplicity of instruction and
data streams
17Multiprocessing and Scalability
- Flynns Taxonomy
- The Cartesian product of these two sets will
define four different classes - SISD
- SIMD
- MISD
- MIMD
18Multiprocessing and Scalability
- Flynns Taxonomy
- The MIMD class can be further divided based on
- Memory structure global or distributed
- Communication/synchronism mechanism shared
variable or message passing.
19Multiprocessing and Scalability
- Flynns Taxonomy
- As a result we have four additional classes of
computers - GMSV Shared memory multiprocessors
- GMMP ?
- DMSV Distributed shared memory
- DMMP Distributed memory (multi-computers)
20Multiprocessing and Scalability
- A taxonomy of MIMD organization
- The memory can be
- Centralized
- Distributed
21Multiprocessing and Scalability
- A taxonomy of MIMD organization
- The inter-processor communicate can be either
through explicit messages or through the common
address space. This brings out two classes - Message passing systems (also called distributed
memory systems. It is natural to assume that the
memory in message passing system is distributed.)
? Intel iPSC, Paragon nCUBE2, IBM SP1 and SP2 - Shared-memory systems ? IBM RP3, Cray X-MP, Cray
Y-MP, Cray C90, Sequent Symmetry
22Multiprocessing and Scalability
- Multiple instruction streams multiple data
streams organizations are made of multiple
processors and multiple memory modules connected
together via some interconnection network. - In this paradigm, based on the memory
organization and the way processors communicate
with each other, one can distinguishes two broad
categories - Shared memory organization
- Message passing organization.
23Multiprocessing and Scalability
- A shared memory organization typically
accomplishes inter-processor coordination through
a global memory shared by all processors - The
global address space shared by all processors. - In shared-memory multiprocessor systems the
processors are more tightly coupled. Memory is
accessible to all processors, and communication
among processors is through shared variables or
messages deposited in shared memory buffers.
24Multiprocessing and Scalability
- Shared-memory System (programmers view)
25Multiprocessing and Scalability
- Shared-memory System (programmers view)
- In a shared-memory machine, processors can
distinguish communication destination, type, and
value through shared-memory addresses. - There is no requirement for the programmer to
manage the movement of data.
26Multiprocessing and Scalability
- Shared-Memory Multiprocessor
27Multiprocessing and Scalability
- Multiprocessors interconnection networks can be
classified based on a number of criteria - Mode of operation - Synchronous vs. asynchronous
- In synchronous mode a single clock is used by all
components in the system (whole system is
operating in a lock-step manner). - In asynchronous mode hand-shaking signals are
used to coordinate the operations. - Synchronous systems are slower than asynchronous
systems, but they are race and hazard free.
28Multiprocessing and Scalability
- Control strategy - centralized vs. decentralized
- In centralized control systems, a single central
control unit is used to oversee and control the
operations of the components. - In decentralized control, the control functions
are distributed among different components. - The reliability of centralized control is its
weak point (single point of failure). - Crossbar switch is a centralized control and
multistage interconnection network is a
decentralized control.
29Multiprocessing and Scalability
- Switching techniques - circuit switching vs.
packet switching - In circuit switching a complete path has to be
established before the start of communication.
This path remains intact during the
communication. - In a packet switching, communication is divided
into packets. Packets from one component to
another component are sent in store-and-forward
manner. - Packet switching uses the network resources more
efficiently.
30Multiprocessing and Scalability
- Topology - Static vs. dynamic
- Topology defines how the resources in an
interconnection network are communicating with
each other. In a ring topology, each processor
(say pk) is physically connected to its neighbor
(pk-1, pk1). - In static topology, direct fixed link are
established among nodes to form a fixed network. - In dynamic topology, connections are established
as needed.
31 Multiprocessing and Scalability
- A shared-memory architecture allows all memory
locations to be accessed by every processors.
This helps to ease the burden of programming a
parallel system, since - There is no need to distribute data or explicitly
communicate data between processors in the
system. - Shared-memory is the model adopted on small-scale
multiprocessors, so it is easier to map programs
that have been parallelized for small-scale
systems to a larger shared-memory system.
32Multiprocessing and Scalability
- A message passing organization typically combines
the local memory and processor as a node of the
interconnection network - There is no global
address space. So there is a need to move data
from one processor to the other via messages. - In message passing multiprocessor systems
(distributed memory), processors communicate with
each other by sending explicit messages.
33Multiprocessing and Scalability
- Message passing System (programmers view)
34Multiprocessing and Scalability
- Message passing System (programmers view)
- In a message passing machine, the user must
explicitly communicate all information passed
between processors. Unless the communication
patterns are very regular, the management of this
data movement is very difficult. - Note, multi-computers are equivalent to this
organization. It is also referred to as No
Remote Memory Access Machine (NORMA).
35Multiprocessing and Scalability
- Message passing multiprocessor
36 Multiprocessing and Scalability
- Inter-processor communication is critical in
comparing the performance of message passing and
shared-memory machines - Communication in message passing environment is
direct and is initiated by the producer of data. - In shared-memory system, communication is
indirect, and producer usually moves data no
further than memory. The consumer, then has to
fetch the data from the memory decrease in
performance.
37 Multiprocessing and Scalability
- In a shared-memory system, communication requires
no intervention on the part of a run-time library
or operating system. In a message passing
environment, access to the network port is
typically managed by the system software. This
overhead at the sending and receiving processors
makes the start up costs of communication much
higher (usually 10s to 100s of microsecond).
38 Multiprocessing and Scalability
- As a result of high communication cost in a
message passing organization either performance
is compromised or a coarser grain parallelism,
and hence more limitation on exploitation of
available parallelism, must be adapted. - Note that the start up overhead of communication
in shared-memory organization is typically on the
order of microseconds.
39 Multiprocessing and Scalability
- Communication in shared-memory systems is usually
demand-driven by the consuming processor. - The problem here is overlapping communication
with computation. This is not a problem for
small data items. However, it can degrade the
performance if there is frequent communication or
a large amount of data is exchanged.
40 Multiprocessing and Scalability
- Consider the following case, in a message passing
environment - A producer process wants to send 10 words to a
consumer process. In a typical message passing
environment with blocking send and receive
protocol this problem would be coded as
Producer process Send (proci, processj,
_at_sbuffer, num-bytes)
Consumer process Receive (_at_rbuffer, max-bytes)
41 Multiprocessing and Scalability
- This code usually is broken into the following
steps - The operating system checks protections and then
programs the network DMA controller to move the
message from the senders buffer to the network
interface. - A DMA channel on the consumer processor has been
programmed to move all messages to a common
system buffer. When the message arrives, it is
moved from the network interface to the system
buffer and an interrupt is posted to the
processor.
42 Multiprocessing and Scalability
- The receiving processor services the interrupt
and determines which process the message is
intended for. It then copies the message to the
specified receiver buffer and reschedules the
program on the processors ready queue. - The process is dispatched on the processor and
reads the message from the users receive buffer.
43 Multiprocessing and Scalability
- On a shared-memory system there is no operating
system involved and the process can transfer data
using a shared data area. Assuming the data is
protected by a flag indicating its availability
and the size, we have
Producer process For i 0 to num-bytes
buffer i source i Flag num-bytes
Consumer process While (Flag 0) For i
0 to Flag dest i buffer i
44 Multiprocessing and Scalability
- For the message passing environment, the dominant
factors are the operating system overhead
programming the DMA and the internal processing. - For the shared-memory environment, the overhead
is primarily on the consumer reading the data,
since it is then that data moves from the global
memory to the consumer processor.
45 Multiprocessing and Scalability
- Thus, for a short message the shared-memory
system is much more efficient. For long
messages, the message-passing environment has
similar or possibly higher performance.
46Multiprocessing and Scalability
- Message passing vs. Shared-memory systems
- The message passing systems minimize the hardware
overhead. - A single-thread performance of a message passing
system is as high as a uni-processor system,
since memory can be tightly coupled to a single
processor. - Shared-memory systems are easier to program.
47Multiprocessing and Scalability
- Message passing vs. Shared-memory systems
- Overall, the shared-memory paradigm is preferred
since it is easier to use and it is more
flexible. - A shared-memory organization can emulate a
message passing environment while the reverse is
not possible without a significant performance
degradation.
48Multiprocessing and Scalability
- Message passing vs. Shared-memory systems
- Shared-memory systems offer lower performance
than message passing systems if communication is
frequent and not overlapped with computation. - In shared-memory environment, interconnection
network between processors and memory usually
requires higher bandwidth and more sophistication
than the network in message passing environment.
This can increase overhead costs to the level
that the system does not scale well.
49Multiprocessing and Scalability
- Message passing vs. Shared-memory systems
- Solving the latency problem through memory
caching and cache coherence is the key to a
shared-memory multiprocessor.
50 Multiprocessing and Scalability
- In the design of shared memory organization a
number of basic issues have to be taken into
account - Access control determines which process accesses
are possible to which resource. - Access control protocols make the required check
for every access request issued by the processors
to the shared memory against the contents of
access control table.
51 Multiprocessing and Scalability
- Synchronization rules and constraints limit the
time of accesses from sharing processors to
shared resources. - Appropriate synchronization ensures that the
information flows properly and ensures system
functionality.
52 Multiprocessing and Scalability
- Protection prevents processes from making
arbitrary access to resources belonging to other
processes.
53 Multiprocessing and Scalability
- In message passing organization processors
communicate with each other via send/receive
operations. The processing units of a message
passing system are connected in a variety of ways
ranging from architecture-specific
interconnection structure to geographically
dispersed networks. - The massage passing approach, in general, is
scalable.
54 Multiprocessing and Scalability
- Two design factors must be considered in
designing interconnection networks for message
passing systems - Bandwidth which is the number of bits that can be
transmitted per unit of time, and - Latency which is defined as the time to complete
a message transfer.
55Multiprocessing and Scalability
- Shared Memory Multiprocessor
- In most shared-memory organizations, processors
and memory are separated by an interconnection
network. - For small-scale shared-memory systems, the
interconnection network is a simple bus. - For large-scale shared-memory systems, similar to
the message passing systems, we use multistage
networks.
56Multiprocessing and Scalability
- Shared memory multiprocessors - A taxonomy
- Based on the memory organization, shared memory
multiprocessor systems are classified as - Uniform Memory Access (UMA) systems,
- Non-uniform Memory Access (NUMA) systems, and
- Cache-only memory (COMA) systems
- Note, most large-scale shared memory machines
utilize a NUMA structure.
57Multiprocessing and Scalability
- Uniform Memory Access Architecture (UMA)
- The physical memory is uniformly shared by all
the processors. - Each processor may use a private cache.
- UMA is equivalent to tightly coupled
multiprocessor class. - When all processors have equal access to
peripheral devices, the system is called a
symmetric multiprocessor (SMP).
58Multiprocessing and Scalability
- In UMA, the shared memory is accessible by all
processors via an interconnection network in the
same way a single processor accesses its memory.
As a result, all processors have equal access
time to any memory address. - The interconnection network in UMA can be a
single bus, multiple busses, crossbar switch, or
a multiport memory.
59Multiprocessing and Scalability
- Uniform Memory Access Architecture
60Multiprocessing and Scalability
- Symmetric Multiprocessor
- Each processor has its own cache, all the
processors and memory modules are attached to the
same interconnect Usually a shared bus. - The ability to access all shared data efficiently
and uniformly using ordinary load and store
operations, and - Automatic movement and replication of date in the
local caches - makes this organization attractive for concurrent
processing.
61Multiprocessing and Scalability
62Multiprocessing and Scalability
- Non-uniform Memory Access (NUMA)
- In NUMA, each processor has part of the shared
memory attached. However, access time to a
shared memory address depends on the distance
between the processor and the designated memory
module - access time varies with the location of
the memory word. - In NUMA a number of architectures can be used to
interconnect processors to memory modules. - NUMA is also referred to as Distributed
Shared-Memory Multiprocessor Architecture (DSM).
63Multiprocessing and Scalability
- Shared Memory Multiprocessor
- UMA is easier to program than NUMA.
- This configuration is not symmetric, as a result,
locality should be exploited to improve
performance. - Most of the small-scale shared-memory systems are
based on UMA philosophy and large-scale
shared-memory systems are based on NUMA.
64Multiprocessing and Scalability
- Non-uniform Memory Access (NUMA)
65Multiprocessing and Scalability
- Shared Memory Multiprocessor Distributed Memory
66Multiprocessing and Scalability
- Distributed Shared-memory architecture
- As noted, it is possible to build a shared-memory
machine with a distributed-memory organization. - Such a machine has the same structure as the
message passing system, but instead of sending
messages to other processors, every processor can
directly address both its local memory and remote
memories of other processors.
67Multiprocessing and Scalability
- Cache-only memory (COMA)
- In COMA similar to NUMA, each processor has part
of the shared memory. However, the shared memory
consists of cache memory - data must be migrated
to the processor requesting it. - In short, COMA is a special class of NUMA in
which distributed global main memory are caches.
68Multiprocessing and Scalability
- Cache Only Memory Architecture (COMA)
69Multiprocessing and Scalability
- Shared Memory Multiprocessor
- Since all communication and local computations
generate memory accesses in a shared address
space, from a system architectures perspective,
the key high level design issue lies in the
organization of memory hierarchy. - In general there are the following choices
70Multiprocessing and Scalability
- Shared Memory Multiprocessor Shared cache
71Multiprocessing and Scalability
- Shared Memory Multiprocessor Shared cache
- This platform is more suitable for small scale
configuration 2 to 8 processors. - It is a more common approach for multiprocessor
on chip.
72Multiprocessing and Scalability
- Shared Memory Multiprocessor Bus based Shared
Memory
73Multiprocessing and Scalability
- Shared Memory Multiprocessor Bus based Shared
Memory - This platform is suitable for small to medium
configuration 20 to 30 processors. - Due to its simplicity it is very popular.
- Bandwidth of the shared bus is the major
bottleneck of the system. - This configuration also is known as Cache
Coherent Shared Memory Configuration.
74Multiprocessing and Scalability
- Shared Memory Multiprocessor Dance Hall
75Multiprocessing and Scalability
- Shared Memory Multiprocessor Dance Hall
- This platform is symmetric and scalable.
- All memory modules are far away from processors.
- In large configurations, several hops are needed
to access memory.
76Multiprocessing and Scalability
- Shared Memory Multiprocessor
- In all shared memory multiprocessor organization,
the concept of cache is becoming very attractive
in reducing the memory latency and the required
bandwidth. - In all design cases, except shared cache, each
processor has at least one level of private
cache. This raises the issue of cache coherence
as an attractive research directions.
77Multiprocessing and Scalability
- Shared Memory Multiprocessor Cache Coherency
- The architecture supports the caching of both
shared and private data - When a private item is requested, its location is
migrated to the cache, reducing the average
access time as well as the memory bandwidth
required. - When shared data is cached, the shared value may
be replicated in multiple caches. This reduces
the access latency, memory bandwidth required,
and contention at the shared data item.
78Multiprocessing and Scalability
- Shared Memory Multiprocessor Cache Coherency
- Caching of the shared data, however, introduces a
new problem - Cache Coherence. - Cache Coherence problem means that two different
processors can have two different values for the
same location.
79Multiprocessing and Scalability
- Shared Memory Multiprocessor Cache Coherency
- Assume write-through in a 2-processor
configuration
Cache CPU A
Cache CPU B
Memory For X
Time
Event
80Multiprocessing and Scalability
- Shared Memory Multiprocessor Cache Coherency
- Copies of the same memory blocks are resident in
one or more caches and processors make
modifications to the very same cache block. - Unless special action in taken, other processors
will continue to access the old stale copy of the
block in their caches.
81Multiprocessing and Scalability
- Cache Coherence Problem An Example
- Consider the following configuration and the
sequence of events - 1 p1 reads u from main memory
- 2 p2 reads u from main memory
- 3 p3 writes and changes u from 5 to 7 using
write through policy - 4 p1 reads u again
82 Multiprocessing and Scalability
- Cache Coherence Problem An Example
83 Multiprocessing and Scalability
- Cache Coherence Problem An Example
- Unless we take special action, if p1 reads u for
the second time, it might get an old value. - The value read by p2 is a subject of write policy.
84 Multiprocessing and Scalability
- Cache Coherence Problem An Example
- Assume that in the previous example p3 uses
write-through policy. Does it help the situation
or not?
85 Multiprocessing and Scalability
- Cache Coherence Problem An Example
86 Multiprocessing and Scalability
- Cache Coherence Problem An Example
- Now the situation is even worse, if p2 initiate a
read on u, it will get an old value of u. - In a shared memory multiprocessor system, reading
and writing shared variable by different
processors are frequent events. As a result,
cache coherence needs to be addressed as a basic
hardware design issue.
87Multiprocessing and Scalability
- Cache Coherence Problem
- Informally memory is coherent if any read of a
data item returns the most recently written value
of that data item. - This simple definition contains two different
aspects of memory behavior - Coherence what values can be returned by a read?
- Consistency When a written value will be
returned by a read?
88Multiprocessing and Scalability
- Cache Coherence Problem
- A memory system is coherent if
- Preserving Program Order - A read by a processor
P to a location X that follows a write by P to X,
with no writes of X by another processor, between
the write and the read by P, always returns the
value written by P.
89Multiprocessing and Scalability
- Cache Coherence Problem
- Coherent View - A read by a processor to location
X follows a write by another processor to X,
returns the written value if the read and write
are sufficiently separated and no other write to
X occur between the two accesses.
90Multiprocessing and Scalability
- Cache Coherence Problem
- Write Serialization - Writes to the same location
are serialized. Two writes to the same location
by two processors are seen in the same order by
all processors.
91 Multiprocessing and Scalability
- Cache Coherence Definitions
- A memory operation is referred to as read, write,
and read-modify-write on individual data element. - A memory operation is atomic.
- Write propagation means that writes become
visible to all processors. - Write serialization means that all writes to a
location are seen in the same order by all
processors.
92 Multiprocessing and Scalability
- Cache Coherence
- For a multiprocessor system, memory is coherent
if the results of any execution of a program are
such that for each location, it is possible to
construct a hypothetical serial order of all
operations to the location that is consistent
with the results of the execution and in which - Operations issued by any particular process
occurs in the order in which they were issued to
the memory system by that process, and - The value returned by each read operations is the
value written by the last write to that location
in the serial order.
93Multiprocessing and Scalability
- Cache Coherence Protocol
- The protocol to maintain coherence for multiple
processors are called cache coherence protocol. - The key issue to implement a cache coherence
protocol is tracking the state of any sharing of
a data block.
94Multiprocessing and Scalability
- Cache Coherence Protocol
- Directory Bases The sharing status of a block
of physical memory is kept in just one location -
Directory. - Snooping Based Every cache that has a copy of
the data are on the shared memory bus and snoop
on the bus to determine whether or not they have
a copy of a block that is requested on the bus.
95Multiprocessing and Scalability
- Cache Coherence Protocol
- There are two ways to maintain the coherence
requirement - Write Invalidation
- Write Propagate (Write update or Write Broadcast)
96Multiprocessing and Scalability
- Write Invalidation To ensure that a processor
has exclusive access to a data item before it
writes that item.
Bus activity
CPU As Cache
CPU Bs Cache
Memory Location X
Processor activity
0
97Multiprocessing and Scalability
- Write Propagate All the cached copies of a data
item are updated when data item is written.
Bus activity
CPU As Cache
CPU Bs Cache
Memory Location X
Processor activity
0
98 Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- Within this class bus based shared memory
multiprocessors, this is a simple and elegant
solution. Recall that - Bus is a single set of wires connecting
resources, - Every resources attached to the bus can observe
every bus transaction, - All transactions that appear on the bus are
visible to the cache controllers in the same
order,
99Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- When a processor issues a request to its cache,
its cache controller examines the status of the
cache and takes suitable actions, including a
request to access memory. - All cache controllers snoop on the bus and
monitor transactions.
100 Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- A snooping cache controller may take action if a
bus transaction is relevant to it - In case of a write-through policy, if a snooping
cache has a copy of the data, it either - Updates its copy update based protocols, or
- Invalidates its copy invalidation based
protocols. - In either case, when a processor requests for a
block, it will see the most recent value.
101 Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- Refer to our previous example
102 Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- When p3 writes into u, its cache controller
generates a bus transaction to update memory. - Observing this bus transaction, it is relevant to
p1 and hence p1s cache controller invalidates
its own copy of the block containing u. - The main memory updates u to 7.
- Subsequent reads to u from processors p1 and p2
generate misses and hence, they will get the
correct value of 7 from the main memory.
103 Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- Simplicity is the main advantage of snooping
protocols there is no need to explicitly use
coherence operations in the program. - By extending the cache controller and exploiting
the properties of the bus, reads and writes that
are natural to the program keep the caches
coherent. - However, snooping systems with write-through
policy are not very efficient, specially if we
realize the natural bandwidth limitation of a bus.
104 Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- Let us look at the snooping protocol with
write-back policy - The processors are able to
write to different blocks in their local caches
concurrently without any bus transactions. - Recall that in a uni-processor system updates to
the main memory blocks resident in cache will be
done during the replacement - cache block will be
dirty for a period. - In a multiprocessor system, this modified state
is used by the protocols to indicate exclusive
ownership of the block by cache.
105 Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- A cache is said to be the owner of a block if it
must supply the data upon a request for that
block. - A cache has an exclusive copy of a block if it is
the only cache with a valid copy of the block -
Exclusivity implies that the cache may modify the
block without notifying anyone else.
106 Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- As a result, if a cache does not have
exclusivity, then it cannot write into that block
before first putting a transaction on the bus to
communicate with others - consequently, a
processor might have the block in its cache in a
valid state, but since a transaction must be
generated, it is treated as a write miss.
107 Multiprocessing and Scalability
- Cache Coherence Solutions Bus Snooping
- If a cache has a modified block, then it is the
owner and thus it has exclusivity. - On a write miss in an invalidation protocol, a
special form of transaction, read exclusive, is
generated to acquire exclusive ownership. - As a result, concurrent writes to the same block
is not allowed - the read exclusive bus
transactions are serialized.
108 Multiprocessing and Scalability
- Scalable shared-memory multiprocessors (SSMP)
provides the shared-memory programming model
while removing the bottleneck of todays
small-scale systems.
109 Multiprocessing and Scalability
- Scalable shared-memory multiprocessors
- Scalability must be physically possible and
technically feasible. - Adding processors increases the potential
computational capability of the system, but to
realize this potential, all aspect of the system
must be scaled up - Specially the memory
bandwidth.
110Multiprocessing and Scalability
- Scalable shared-memory multiprocessors
- A natural solution is to distribute memory among
processors so that each processor has direct
access to its local memory. - However, interconnection network, connecting
processors must provide scalable bandwidth at
reasonable cost and latency.
111Multiprocessing and Scalability
- Scalable shared-memory multiprocessors
- Scalable systems are less closely coupled than
bus based shared memory multiprocessors, so
interactions among processors and
processors/memories are different. - A scalable system attempts to avoid inherent
design limits on the extent to which resources
can be added to the system.
112Multiprocessing and Scalability
- Scalable shared-memory multiprocessors
- Scalability is studied based on its effect on the
following metrics - Throughput
- Latency per operation
- Cost
- Implementation.