Multiprocessing and Scalability - PowerPoint PPT Presentation

1 / 112
About This Presentation
Title:

Multiprocessing and Scalability

Description:

Large-scale multiprocessor systems have long held the promise of substantially ... by all components in the system (whole system is operating in a lock-step manner) ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 113
Provided by: alihu2
Category:

less

Transcript and Presenter's Notes

Title: Multiprocessing and Scalability


1
Multiprocessing and Scalability
  • A.R. Hurson
  • Computer Science Department
  • Missouri Science Technology
  • hurson_at_mst.edu

2
Multiprocessing and Scalability
  • Large-scale multiprocessor systems have long held
    the promise of substantially higher performance
    than traditional uni-processor systems.
  • However, due to a number of difficult problems,
    the potential of these machines has been
    difficult to realize. This is because of the

3
Multiprocessing and Scalability
  • Advances in technology - Rate of increase in
    performance of uni-processor,
  • Complexity of multiprocessor system design - This
    drastically effected the cost and implementation
    cycle.
  • Programmability of multiprocessor system - Design
    complexity of parallel algorithms and parallel
    programs

4
Multiprocessing and Scalability
  • Programming a parallel machine is more difficult
    than a sequential one. In addition, it takes
    much effort to port an existing sequential
    program to a parallel machine than to a newly
    developed sequential machine.
  • Lack of good parallel programming environments
    and standard parallel languages also has further
    aggravated this issue.

5
Multiprocessing and Scalability
  • As a result, absolute performance of many early
    concurrent machines was not significantly better
    than available or soon-to-be available
    uni-processors.

6
Multiprocessing and Scalability
  • Recently, there has been an increased interest in
    large-scale or massively parallel processing
    systems. This interest stems from many factors,
    including
  • Advances in integrated technology.
  • Very high aggregate performance of these
    machines.
  • Cost performance ratio.
  • Widespread use of small-scale multiprocessors.
  • Advent of high performance microprocessors.

7
Multiprocessing and Scalability
  • Integrated technology
  • Advances in integrated technology is slowing
    down.
  • Chip density - integrated technology now allows
    all performance features found in a complex
    processor be implemented on a single chip and
    adding more functionality has diminishing returns

8
Multiprocessing and Scalability
  • Integrated technology
  • Studies of more advanced superscalar processors
    indicates that they may not offer a performance
    improvement more than 2 to 4 for general
    applications.

9
Multiprocessing and Scalability
  • Aggregate performance of machines
  • Cray X-MP (1983) had a peak performance of 235
    MFLOPS.
  • A node in an Intel iPSC/2 (1987) with attached
    vector units had a peak performance of 10 MFLOPS.
    As a result, a 256-processor configuration of
    Intel iPSC/2 would offer less than 11 times the
    peak performance of a single Cray X-MP.

10
Multiprocessing and Scalability
  • Aggregate performance of machines
  • Today a 256-processor system might offer 100
    times the peak performance of the fastest
    systems.
  • Cray C90 has a peak performance of 952 MFLOPS,
    while a 256-processor configuration using MIPS
    R8000 would have a peak performance of 76,800
    MFLOPS.

11
Multiprocessing and Scalability
  • Economy
  • Cray C90 or Convex C4 which exploit IC and
    packaging technology cost about 1 million.
  • High end microprocessor cost somewhere between
    2000-5000.
  • By using a small number of microprocessors, equal
    performance can be achieved at a fraction of the
    cost of more advanced supercomputers.

12
Multiprocessing and Scalability
  • Small-scale multiprocessors
  • Due to the advances in technology,
    multiprocessing has come to the desktops and
    high-end personal computers. This has led to
    improvements in parallel processing hardware and
    software.

13
Multiprocessing and Scalability
  • In spite of these developments, many challenges
    need to overcome in order to exploit the
    potential of large-scale multiprocessors
  • Scalability
  • Ease of programming

14
Multiprocessing and Scalability
  • Scalability Maintaining the cost-performance of
    a uni-processor while linearly increasing overall
    performance as processors are added.
  • Ease of programming Programming a parallel
    system is inherently more difficult than
    programming a uni-processor where there is a
    single thread of control. Providing a single
    shared address space to the programmer is one way
    to alleviate this problem Shared-memory
    architecture.

15
Multiprocessing and Scalability
  • Shared-memory eases the burden of programming a
    parallel machine since there is no need to
    distribute data or explicitly communicate data
    between the processors in software.
  • Shared-memory is also the model used on
    small-scale multiprocessors, so it is easier to
    transport programs that have been developed for a
    small system to a larger shared-memory system.

16
Multiprocessing and Scalability
  • Flynns Taxonomy
  • Flynn has classified the concurrent space
    according to the multiplicity of instruction and
    data streams

17
Multiprocessing and Scalability
  • Flynns Taxonomy
  • The Cartesian product of these two sets will
    define four different classes
  • SISD
  • SIMD
  • MISD
  • MIMD

18
Multiprocessing and Scalability
  • Flynns Taxonomy
  • The MIMD class can be further divided based on
  • Memory structure global or distributed
  • Communication/synchronism mechanism shared
    variable or message passing.

19
Multiprocessing and Scalability
  • Flynns Taxonomy
  • As a result we have four additional classes of
    computers
  • GMSV Shared memory multiprocessors
  • GMMP ?
  • DMSV Distributed shared memory
  • DMMP Distributed memory (multi-computers)

20
Multiprocessing and Scalability
  • A taxonomy of MIMD organization
  • The memory can be
  • Centralized
  • Distributed

21
Multiprocessing and Scalability
  • A taxonomy of MIMD organization
  • The inter-processor communicate can be either
    through explicit messages or through the common
    address space. This brings out two classes
  • Message passing systems (also called distributed
    memory systems. It is natural to assume that the
    memory in message passing system is distributed.)
    ? Intel iPSC, Paragon nCUBE2, IBM SP1 and SP2
  • Shared-memory systems ? IBM RP3, Cray X-MP, Cray
    Y-MP, Cray C90, Sequent Symmetry

22
Multiprocessing and Scalability
  • Multiple instruction streams multiple data
    streams organizations are made of multiple
    processors and multiple memory modules connected
    together via some interconnection network.
  • In this paradigm, based on the memory
    organization and the way processors communicate
    with each other, one can distinguishes two broad
    categories
  • Shared memory organization
  • Message passing organization.

23
Multiprocessing and Scalability
  • A shared memory organization typically
    accomplishes inter-processor coordination through
    a global memory shared by all processors - The
    global address space shared by all processors.
  • In shared-memory multiprocessor systems the
    processors are more tightly coupled. Memory is
    accessible to all processors, and communication
    among processors is through shared variables or
    messages deposited in shared memory buffers.

24
Multiprocessing and Scalability
  • Shared-memory System (programmers view)

25
Multiprocessing and Scalability
  • Shared-memory System (programmers view)
  • In a shared-memory machine, processors can
    distinguish communication destination, type, and
    value through shared-memory addresses.
  • There is no requirement for the programmer to
    manage the movement of data.

26
Multiprocessing and Scalability
  • Shared-Memory Multiprocessor

27
Multiprocessing and Scalability
  • Multiprocessors interconnection networks can be
    classified based on a number of criteria
  • Mode of operation - Synchronous vs. asynchronous
  • In synchronous mode a single clock is used by all
    components in the system (whole system is
    operating in a lock-step manner).
  • In asynchronous mode hand-shaking signals are
    used to coordinate the operations.
  • Synchronous systems are slower than asynchronous
    systems, but they are race and hazard free.

28
Multiprocessing and Scalability
  • Control strategy - centralized vs. decentralized
  • In centralized control systems, a single central
    control unit is used to oversee and control the
    operations of the components.
  • In decentralized control, the control functions
    are distributed among different components.
  • The reliability of centralized control is its
    weak point (single point of failure).
  • Crossbar switch is a centralized control and
    multistage interconnection network is a
    decentralized control.

29
Multiprocessing and Scalability
  • Switching techniques - circuit switching vs.
    packet switching
  • In circuit switching a complete path has to be
    established before the start of communication.
    This path remains intact during the
    communication.
  • In a packet switching, communication is divided
    into packets. Packets from one component to
    another component are sent in store-and-forward
    manner.
  • Packet switching uses the network resources more
    efficiently.

30
Multiprocessing and Scalability
  • Topology - Static vs. dynamic
  • Topology defines how the resources in an
    interconnection network are communicating with
    each other. In a ring topology, each processor
    (say pk) is physically connected to its neighbor
    (pk-1, pk1).
  • In static topology, direct fixed link are
    established among nodes to form a fixed network.
  • In dynamic topology, connections are established
    as needed.

31
Multiprocessing and Scalability
  • A shared-memory architecture allows all memory
    locations to be accessed by every processors.
    This helps to ease the burden of programming a
    parallel system, since
  • There is no need to distribute data or explicitly
    communicate data between processors in the
    system.
  • Shared-memory is the model adopted on small-scale
    multiprocessors, so it is easier to map programs
    that have been parallelized for small-scale
    systems to a larger shared-memory system.

32
Multiprocessing and Scalability
  • A message passing organization typically combines
    the local memory and processor as a node of the
    interconnection network - There is no global
    address space. So there is a need to move data
    from one processor to the other via messages.
  • In message passing multiprocessor systems
    (distributed memory), processors communicate with
    each other by sending explicit messages.

33
Multiprocessing and Scalability
  • Message passing System (programmers view)

34
Multiprocessing and Scalability
  • Message passing System (programmers view)
  • In a message passing machine, the user must
    explicitly communicate all information passed
    between processors. Unless the communication
    patterns are very regular, the management of this
    data movement is very difficult.
  • Note, multi-computers are equivalent to this
    organization. It is also referred to as No
    Remote Memory Access Machine (NORMA).

35
Multiprocessing and Scalability
  • Message passing multiprocessor

36
Multiprocessing and Scalability
  • Inter-processor communication is critical in
    comparing the performance of message passing and
    shared-memory machines
  • Communication in message passing environment is
    direct and is initiated by the producer of data.
  • In shared-memory system, communication is
    indirect, and producer usually moves data no
    further than memory. The consumer, then has to
    fetch the data from the memory decrease in
    performance.

37
Multiprocessing and Scalability
  • In a shared-memory system, communication requires
    no intervention on the part of a run-time library
    or operating system. In a message passing
    environment, access to the network port is
    typically managed by the system software. This
    overhead at the sending and receiving processors
    makes the start up costs of communication much
    higher (usually 10s to 100s of microsecond).

38
Multiprocessing and Scalability
  • As a result of high communication cost in a
    message passing organization either performance
    is compromised or a coarser grain parallelism,
    and hence more limitation on exploitation of
    available parallelism, must be adapted.
  • Note that the start up overhead of communication
    in shared-memory organization is typically on the
    order of microseconds.

39
Multiprocessing and Scalability
  • Communication in shared-memory systems is usually
    demand-driven by the consuming processor.
  • The problem here is overlapping communication
    with computation. This is not a problem for
    small data items. However, it can degrade the
    performance if there is frequent communication or
    a large amount of data is exchanged.

40
Multiprocessing and Scalability
  • Consider the following case, in a message passing
    environment
  • A producer process wants to send 10 words to a
    consumer process. In a typical message passing
    environment with blocking send and receive
    protocol this problem would be coded as

Producer process Send (proci, processj,
_at_sbuffer, num-bytes)
Consumer process Receive (_at_rbuffer, max-bytes)
41
Multiprocessing and Scalability
  • This code usually is broken into the following
    steps
  • The operating system checks protections and then
    programs the network DMA controller to move the
    message from the senders buffer to the network
    interface.
  • A DMA channel on the consumer processor has been
    programmed to move all messages to a common
    system buffer. When the message arrives, it is
    moved from the network interface to the system
    buffer and an interrupt is posted to the
    processor.

42
Multiprocessing and Scalability
  • The receiving processor services the interrupt
    and determines which process the message is
    intended for. It then copies the message to the
    specified receiver buffer and reschedules the
    program on the processors ready queue.
  • The process is dispatched on the processor and
    reads the message from the users receive buffer.

43
Multiprocessing and Scalability
  • On a shared-memory system there is no operating
    system involved and the process can transfer data
    using a shared data area. Assuming the data is
    protected by a flag indicating its availability
    and the size, we have

Producer process For i 0 to num-bytes
buffer i source i Flag num-bytes
Consumer process While (Flag 0) For i
0 to Flag dest i buffer i
44
Multiprocessing and Scalability
  • For the message passing environment, the dominant
    factors are the operating system overhead
    programming the DMA and the internal processing.
  • For the shared-memory environment, the overhead
    is primarily on the consumer reading the data,
    since it is then that data moves from the global
    memory to the consumer processor.

45
Multiprocessing and Scalability
  • Thus, for a short message the shared-memory
    system is much more efficient. For long
    messages, the message-passing environment has
    similar or possibly higher performance.

46
Multiprocessing and Scalability
  • Message passing vs. Shared-memory systems
  • The message passing systems minimize the hardware
    overhead.
  • A single-thread performance of a message passing
    system is as high as a uni-processor system,
    since memory can be tightly coupled to a single
    processor.
  • Shared-memory systems are easier to program.

47
Multiprocessing and Scalability
  • Message passing vs. Shared-memory systems
  • Overall, the shared-memory paradigm is preferred
    since it is easier to use and it is more
    flexible.
  • A shared-memory organization can emulate a
    message passing environment while the reverse is
    not possible without a significant performance
    degradation.

48
Multiprocessing and Scalability
  • Message passing vs. Shared-memory systems
  • Shared-memory systems offer lower performance
    than message passing systems if communication is
    frequent and not overlapped with computation.
  • In shared-memory environment, interconnection
    network between processors and memory usually
    requires higher bandwidth and more sophistication
    than the network in message passing environment.
    This can increase overhead costs to the level
    that the system does not scale well.

49
Multiprocessing and Scalability
  • Message passing vs. Shared-memory systems
  • Solving the latency problem through memory
    caching and cache coherence is the key to a
    shared-memory multiprocessor.

50
Multiprocessing and Scalability
  • In the design of shared memory organization a
    number of basic issues have to be taken into
    account
  • Access control determines which process accesses
    are possible to which resource.
  • Access control protocols make the required check
    for every access request issued by the processors
    to the shared memory against the contents of
    access control table.

51
Multiprocessing and Scalability
  • Synchronization rules and constraints limit the
    time of accesses from sharing processors to
    shared resources.
  • Appropriate synchronization ensures that the
    information flows properly and ensures system
    functionality.

52
Multiprocessing and Scalability
  • Protection prevents processes from making
    arbitrary access to resources belonging to other
    processes.

53
Multiprocessing and Scalability
  • In message passing organization processors
    communicate with each other via send/receive
    operations. The processing units of a message
    passing system are connected in a variety of ways
    ranging from architecture-specific
    interconnection structure to geographically
    dispersed networks.
  • The massage passing approach, in general, is
    scalable.

54
Multiprocessing and Scalability
  • Two design factors must be considered in
    designing interconnection networks for message
    passing systems
  • Bandwidth which is the number of bits that can be
    transmitted per unit of time, and
  • Latency which is defined as the time to complete
    a message transfer.

55
Multiprocessing and Scalability
  • Shared Memory Multiprocessor
  • In most shared-memory organizations, processors
    and memory are separated by an interconnection
    network.
  • For small-scale shared-memory systems, the
    interconnection network is a simple bus.
  • For large-scale shared-memory systems, similar to
    the message passing systems, we use multistage
    networks.

56
Multiprocessing and Scalability
  • Shared memory multiprocessors - A taxonomy
  • Based on the memory organization, shared memory
    multiprocessor systems are classified as
  • Uniform Memory Access (UMA) systems,
  • Non-uniform Memory Access (NUMA) systems, and
  • Cache-only memory (COMA) systems
  • Note, most large-scale shared memory machines
    utilize a NUMA structure.

57
Multiprocessing and Scalability
  • Uniform Memory Access Architecture (UMA)
  • The physical memory is uniformly shared by all
    the processors.
  • Each processor may use a private cache.
  • UMA is equivalent to tightly coupled
    multiprocessor class.
  • When all processors have equal access to
    peripheral devices, the system is called a
    symmetric multiprocessor (SMP).

58
Multiprocessing and Scalability
  • In UMA, the shared memory is accessible by all
    processors via an interconnection network in the
    same way a single processor accesses its memory.
    As a result, all processors have equal access
    time to any memory address.
  • The interconnection network in UMA can be a
    single bus, multiple busses, crossbar switch, or
    a multiport memory.

59
Multiprocessing and Scalability
  • Uniform Memory Access Architecture

60
Multiprocessing and Scalability
  • Symmetric Multiprocessor
  • Each processor has its own cache, all the
    processors and memory modules are attached to the
    same interconnect Usually a shared bus.
  • The ability to access all shared data efficiently
    and uniformly using ordinary load and store
    operations, and
  • Automatic movement and replication of date in the
    local caches
  • makes this organization attractive for concurrent
    processing.

61
Multiprocessing and Scalability
  • Symmetric Multiprocessor

62
Multiprocessing and Scalability
  • Non-uniform Memory Access (NUMA)
  • In NUMA, each processor has part of the shared
    memory attached. However, access time to a
    shared memory address depends on the distance
    between the processor and the designated memory
    module - access time varies with the location of
    the memory word.
  • In NUMA a number of architectures can be used to
    interconnect processors to memory modules.
  • NUMA is also referred to as Distributed
    Shared-Memory Multiprocessor Architecture (DSM).

63
Multiprocessing and Scalability
  • Shared Memory Multiprocessor
  • UMA is easier to program than NUMA.
  • This configuration is not symmetric, as a result,
    locality should be exploited to improve
    performance.
  • Most of the small-scale shared-memory systems are
    based on UMA philosophy and large-scale
    shared-memory systems are based on NUMA.

64
Multiprocessing and Scalability
  • Non-uniform Memory Access (NUMA)

65
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Distributed Memory

66
Multiprocessing and Scalability
  • Distributed Shared-memory architecture
  • As noted, it is possible to build a shared-memory
    machine with a distributed-memory organization.
  • Such a machine has the same structure as the
    message passing system, but instead of sending
    messages to other processors, every processor can
    directly address both its local memory and remote
    memories of other processors.

67
Multiprocessing and Scalability
  • Cache-only memory (COMA)
  • In COMA similar to NUMA, each processor has part
    of the shared memory. However, the shared memory
    consists of cache memory - data must be migrated
    to the processor requesting it.
  • In short, COMA is a special class of NUMA in
    which distributed global main memory are caches.

68
Multiprocessing and Scalability
  • Cache Only Memory Architecture (COMA)

69
Multiprocessing and Scalability
  • Shared Memory Multiprocessor
  • Since all communication and local computations
    generate memory accesses in a shared address
    space, from a system architectures perspective,
    the key high level design issue lies in the
    organization of memory hierarchy.
  • In general there are the following choices

70
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Shared cache

71
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Shared cache
  • This platform is more suitable for small scale
    configuration 2 to 8 processors.
  • It is a more common approach for multiprocessor
    on chip.

72
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Bus based Shared
    Memory

73
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Bus based Shared
    Memory
  • This platform is suitable for small to medium
    configuration 20 to 30 processors.
  • Due to its simplicity it is very popular.
  • Bandwidth of the shared bus is the major
    bottleneck of the system.
  • This configuration also is known as Cache
    Coherent Shared Memory Configuration.

74
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Dance Hall

75
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Dance Hall
  • This platform is symmetric and scalable.
  • All memory modules are far away from processors.
  • In large configurations, several hops are needed
    to access memory.

76
Multiprocessing and Scalability
  • Shared Memory Multiprocessor
  • In all shared memory multiprocessor organization,
    the concept of cache is becoming very attractive
    in reducing the memory latency and the required
    bandwidth.
  • In all design cases, except shared cache, each
    processor has at least one level of private
    cache. This raises the issue of cache coherence
    as an attractive research directions.

77
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Cache Coherency
  • The architecture supports the caching of both
    shared and private data
  • When a private item is requested, its location is
    migrated to the cache, reducing the average
    access time as well as the memory bandwidth
    required.
  • When shared data is cached, the shared value may
    be replicated in multiple caches. This reduces
    the access latency, memory bandwidth required,
    and contention at the shared data item.

78
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Cache Coherency
  • Caching of the shared data, however, introduces a
    new problem - Cache Coherence.
  • Cache Coherence problem means that two different
    processors can have two different values for the
    same location.

79
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Cache Coherency
  • Assume write-through in a 2-processor
    configuration

Cache CPU A
Cache CPU B
Memory For X
Time
Event
80
Multiprocessing and Scalability
  • Shared Memory Multiprocessor Cache Coherency
  • Copies of the same memory blocks are resident in
    one or more caches and processors make
    modifications to the very same cache block.
  • Unless special action in taken, other processors
    will continue to access the old stale copy of the
    block in their caches.

81
Multiprocessing and Scalability
  • Cache Coherence Problem An Example
  • Consider the following configuration and the
    sequence of events
  • 1 p1 reads u from main memory
  • 2 p2 reads u from main memory
  • 3 p3 writes and changes u from 5 to 7 using
    write through policy
  • 4 p1 reads u again

82
Multiprocessing and Scalability
  • Cache Coherence Problem An Example

83
Multiprocessing and Scalability
  • Cache Coherence Problem An Example
  • Unless we take special action, if p1 reads u for
    the second time, it might get an old value.
  • The value read by p2 is a subject of write policy.

84
Multiprocessing and Scalability
  • Cache Coherence Problem An Example
  • Assume that in the previous example p3 uses
    write-through policy. Does it help the situation
    or not?

85
Multiprocessing and Scalability
  • Cache Coherence Problem An Example

86
Multiprocessing and Scalability
  • Cache Coherence Problem An Example
  • Now the situation is even worse, if p2 initiate a
    read on u, it will get an old value of u.
  • In a shared memory multiprocessor system, reading
    and writing shared variable by different
    processors are frequent events. As a result,
    cache coherence needs to be addressed as a basic
    hardware design issue.

87
Multiprocessing and Scalability
  • Cache Coherence Problem
  • Informally memory is coherent if any read of a
    data item returns the most recently written value
    of that data item.
  • This simple definition contains two different
    aspects of memory behavior
  • Coherence what values can be returned by a read?
  • Consistency When a written value will be
    returned by a read?

88
Multiprocessing and Scalability
  • Cache Coherence Problem
  • A memory system is coherent if
  • Preserving Program Order - A read by a processor
    P to a location X that follows a write by P to X,
    with no writes of X by another processor, between
    the write and the read by P, always returns the
    value written by P.

89
Multiprocessing and Scalability
  • Cache Coherence Problem
  • Coherent View - A read by a processor to location
    X follows a write by another processor to X,
    returns the written value if the read and write
    are sufficiently separated and no other write to
    X occur between the two accesses.

90
Multiprocessing and Scalability
  • Cache Coherence Problem
  • Write Serialization - Writes to the same location
    are serialized. Two writes to the same location
    by two processors are seen in the same order by
    all processors.

91
Multiprocessing and Scalability
  • Cache Coherence Definitions
  • A memory operation is referred to as read, write,
    and read-modify-write on individual data element.
  • A memory operation is atomic.
  • Write propagation means that writes become
    visible to all processors.
  • Write serialization means that all writes to a
    location are seen in the same order by all
    processors.

92
Multiprocessing and Scalability
  • Cache Coherence
  • For a multiprocessor system, memory is coherent
    if the results of any execution of a program are
    such that for each location, it is possible to
    construct a hypothetical serial order of all
    operations to the location that is consistent
    with the results of the execution and in which
  • Operations issued by any particular process
    occurs in the order in which they were issued to
    the memory system by that process, and
  • The value returned by each read operations is the
    value written by the last write to that location
    in the serial order.

93
Multiprocessing and Scalability
  • Cache Coherence Protocol
  • The protocol to maintain coherence for multiple
    processors are called cache coherence protocol.
  • The key issue to implement a cache coherence
    protocol is tracking the state of any sharing of
    a data block.

94
Multiprocessing and Scalability
  • Cache Coherence Protocol
  • Directory Bases The sharing status of a block
    of physical memory is kept in just one location -
    Directory.
  • Snooping Based Every cache that has a copy of
    the data are on the shared memory bus and snoop
    on the bus to determine whether or not they have
    a copy of a block that is requested on the bus.

95
Multiprocessing and Scalability
  • Cache Coherence Protocol
  • There are two ways to maintain the coherence
    requirement
  • Write Invalidation
  • Write Propagate (Write update or Write Broadcast)

96
Multiprocessing and Scalability
  • Write Invalidation To ensure that a processor
    has exclusive access to a data item before it
    writes that item.

Bus activity
CPU As Cache
CPU Bs Cache
Memory Location X
Processor activity
0
97
Multiprocessing and Scalability
  • Write Propagate All the cached copies of a data
    item are updated when data item is written.

Bus activity
CPU As Cache
CPU Bs Cache
Memory Location X
Processor activity
0
98
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • Within this class bus based shared memory
    multiprocessors, this is a simple and elegant
    solution. Recall that
  • Bus is a single set of wires connecting
    resources,
  • Every resources attached to the bus can observe
    every bus transaction,
  • All transactions that appear on the bus are
    visible to the cache controllers in the same
    order,

99
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • When a processor issues a request to its cache,
    its cache controller examines the status of the
    cache and takes suitable actions, including a
    request to access memory.
  • All cache controllers snoop on the bus and
    monitor transactions.

100
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • A snooping cache controller may take action if a
    bus transaction is relevant to it
  • In case of a write-through policy, if a snooping
    cache has a copy of the data, it either
  • Updates its copy update based protocols, or
  • Invalidates its copy invalidation based
    protocols.
  • In either case, when a processor requests for a
    block, it will see the most recent value.

101
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • Refer to our previous example

102
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • When p3 writes into u, its cache controller
    generates a bus transaction to update memory.
  • Observing this bus transaction, it is relevant to
    p1 and hence p1s cache controller invalidates
    its own copy of the block containing u.
  • The main memory updates u to 7.
  • Subsequent reads to u from processors p1 and p2
    generate misses and hence, they will get the
    correct value of 7 from the main memory.

103
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • Simplicity is the main advantage of snooping
    protocols there is no need to explicitly use
    coherence operations in the program.
  • By extending the cache controller and exploiting
    the properties of the bus, reads and writes that
    are natural to the program keep the caches
    coherent.
  • However, snooping systems with write-through
    policy are not very efficient, specially if we
    realize the natural bandwidth limitation of a bus.

104
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • Let us look at the snooping protocol with
    write-back policy - The processors are able to
    write to different blocks in their local caches
    concurrently without any bus transactions.
  • Recall that in a uni-processor system updates to
    the main memory blocks resident in cache will be
    done during the replacement - cache block will be
    dirty for a period.
  • In a multiprocessor system, this modified state
    is used by the protocols to indicate exclusive
    ownership of the block by cache.

105
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • A cache is said to be the owner of a block if it
    must supply the data upon a request for that
    block.
  • A cache has an exclusive copy of a block if it is
    the only cache with a valid copy of the block -
    Exclusivity implies that the cache may modify the
    block without notifying anyone else.

106
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • As a result, if a cache does not have
    exclusivity, then it cannot write into that block
    before first putting a transaction on the bus to
    communicate with others - consequently, a
    processor might have the block in its cache in a
    valid state, but since a transaction must be
    generated, it is treated as a write miss.

107
Multiprocessing and Scalability
  • Cache Coherence Solutions Bus Snooping
  • If a cache has a modified block, then it is the
    owner and thus it has exclusivity.
  • On a write miss in an invalidation protocol, a
    special form of transaction, read exclusive, is
    generated to acquire exclusive ownership.
  • As a result, concurrent writes to the same block
    is not allowed - the read exclusive bus
    transactions are serialized.

108
Multiprocessing and Scalability
  • Scalable shared-memory multiprocessors (SSMP)
    provides the shared-memory programming model
    while removing the bottleneck of todays
    small-scale systems.

109
Multiprocessing and Scalability
  • Scalable shared-memory multiprocessors
  • Scalability must be physically possible and
    technically feasible.
  • Adding processors increases the potential
    computational capability of the system, but to
    realize this potential, all aspect of the system
    must be scaled up - Specially the memory
    bandwidth.

110
Multiprocessing and Scalability
  • Scalable shared-memory multiprocessors
  • A natural solution is to distribute memory among
    processors so that each processor has direct
    access to its local memory.
  • However, interconnection network, connecting
    processors must provide scalable bandwidth at
    reasonable cost and latency.

111
Multiprocessing and Scalability
  • Scalable shared-memory multiprocessors
  • Scalable systems are less closely coupled than
    bus based shared memory multiprocessors, so
    interactions among processors and
    processors/memories are different.
  • A scalable system attempts to avoid inherent
    design limits on the extent to which resources
    can be added to the system.

112
Multiprocessing and Scalability
  • Scalable shared-memory multiprocessors
  • Scalability is studied based on its effect on the
    following metrics
  • Throughput
  • Latency per operation
  • Cost
  • Implementation.
Write a Comment
User Comments (0)
About PowerShow.com