Cache Coherent Nonuniform Memory Access - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Cache Coherent Nonuniform Memory Access

Description:

A single-processor computer (uniprocessor) in which a single stream of ... Write invalidate the writing processor sends an invalidation signal to the bus. ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 21
Provided by: chrisco49
Category:

less

Transcript and Presenter's Notes

Title: Cache Coherent Nonuniform Memory Access


1
ccNUMA
Cache Coherent Non-Uniform Memory Access
Chris Coughlin MSCS521 Prof. Ten Eyck Spring 2004
2
Lets First Talk About Computer Architectures
In 1966, Michael Flynn proposed a classification
for computer architectures based on the number of
instruction steams and data streams (Flynns
Taxonomy).
  • SISD(Single Instruction Stream-Single Data
    Stream)
  • A single-processor computer (uniprocessor) in
    which a single stream of instructions is
    generated from the program.
  • SIMD(Single Instruction Stream-Multiple Data
    Stream)
  • Each instruction is executed on a different set
    of data by different processors. (Used for vector
    and array processing)
  • MISD(Multiple Instruction Stream-Single Data
    Stream)
  • Each processor executes a different sequence of
    instructions.
  • Never been commercially implemented.
  • MIMD(Multiple Instruction Stream-Multiple Data
    Stream)
  • Each processor has a separate program.
  • An instruction stream is generated from each
    program.
  • Each instruction operates on different data.

3
Multiprocessors
  • The idea behind multiprocessors is to create
    powerful computers by connecting many smaller
    ones.
  • Computational speed is increased by using
    multiple processors operating together on a
    single problem.
  • A parallel processing program is a single program
    that runs on multiple processors simultaneously.
  • The overall problem is split into parts, each of
    which is performed by a separate processor in
    parallel.
  • In addition to a faster solution, it may also
    generate a more precise solution.

4
MIMD Systems
  • Shared Memory Multiprocessor System
  • Multiple processors are connected to multiple
    memory modules such that each processor can
    access any other processors memory module. This
    multiprocessor employs a shared address space
    (also known as a single address space).
  • Communication is implicit with loads and stores
    there is no explicit recipient of a shared
    memory access.
  • Processors may communicate without necessarily
    being aware of one another.
  • A single image of the operating system runs
    across all the processors.

5
MIMD Systems (cont.)
  • Multicomputer
  • A term for parallel processors with separate,
    private address spaces (not accessible by the
    other processors in the system).
  • Communicate by message-passing the messages
    carry data from one processor to another as
    dictated by the program.
  • Complete computers, consisting of a processor and
    local memory, connected through an
    interconnection network (e.g. a LAN).

6
Computer Architecture Classifications
Processor Organizations
Single Instruction, Single Instruction, Multiple
Instruction Multiple Instruction Single Data
Stream Multiple Data Stream Single Data
Stream Multiple Data Stream (SISD)
(SIMD) (MISD)
(MIMD)
Uniprocessor Vector
Array Shared Memory Multicomputer
Processor
Processor (tightly coupled)
(loosely coupled)
Note We will expand on this later
7
Back to Shared Memory Multiprocessors
  • Two styles UMA and NUMA
  • UMA (Uniform Memory Access)
  • The time to access main memory is the same for
    all processors since they are equally close to
    all memory locations.
  • Machines that use UMA are called Symmetric
    Multiprocessors (SMPs).
  • In a typical SMP architecture, all memory
    accesses are posted to the same shared memory
    bus.
  • Contention - as more CPUs are added, competition
    for access to the bus leads to a decline in
    performance.
  • Thus, scalability is limited to about 32
    processors.


8
Shared Memory Multiprocessors (cont.)
  • NUMA (Non-Uniform Memory Access)
  • Since memory is physically distributed, it is
    faster for a processor to access its own local
    memory than non-local memory (memory local to
    another processor or shared between processors).
  • Unlike SMPs, all processors are not equally close
    to all memory locations.
  • A processors own internal computations can be
    done in its local memory leading to reduced
    memory contention.
  • Designed to surpass the scalability limits of
    SMPs.

9
Communication and Connection Options for
Multiprocessors
Multiprocessors come in two main configurations
a single bus connection, and a network
connection. The choice of the communication
model and the physical connection depends largely
on the number of processors in the organization.
Notice that the scalability of NUMA makes it
ideal for a network configuration. UMA, however,
is best suited to a bus connection.
10
A Multiprocessor Bus Configuration
The single bus design is limited in terms of
scalability. The largest number of processors in
a commercial product using this configuration is
36 (SGI Power Challenge).
11
A Multiprocessor Network Configuration
The network-connected processor design is very
scalable. Since each processor has its own
memory, the network connection is only used for
communication between processors.
12
A Quick Look at Cache
  • Modern processors use a faster, smaller cache
    memory to act as a buffer for slower, larger
    memory.
  • Caches exploit the principal of locality in
    memory accesses.
  • Temporal locality the concept that if data is
    referenced, it will tend to be referenced again
    soon after.
  • Spatial locality the concept that data is
    more likely to be referenced soon if data near
    it was just referenced.
  • Caches hold recently referenced data, as well as
    data near the recently referenced data.
  • This can lead to performance increases by
    reducing the need to access main memory on every
    reference.

13
What is ccNUMA?
  • The cc in ccNUMA stands for cache coherent.
  • The use of cache memory in modern computer
    architectures leads to the cache coherence
    problem.
  • It is a situation that can occur when two or more
    processors reference the same shared data. If
    one processor modifies its copy of the data, the
    other processors will have stale copies of the
    data in their caches.
  • Machines that are cache coherent ensure that a
    processor accessing a memory location receives
    the most up-to-date version of the data.
  • Cache coherence is maintained by software,
    special-purpose hardware, or both.
  • NUMA systems that maintain cache coherence are
    referred to as ccNUMA machines.
  • Since few applications still exist for non-cache
    coherent NUMA machines, the terms NUMA and ccNUMA
    are used interchangeably.

14
Computer Architecture Classifications (revisited)
Processor Organizations
Single Instruction, Single Instruction, Multiple
Instruction Multiple Instruction Single Data
Stream Multiple Data Stream Single Data
Stream Multiple Data Stream (SISD)
(SIMD) (MISD)
(MIMD)
Uniprocessor Vector
Array Shared Memory
Multicomputer
Processor Processor
(tightly coupled) (loosely coupled)
UMA (SMP)
NUMA
ccNUMA
15
Cache Coherency Protocols
  • Snooping protocol
  • A bus-based method in which cache controllers
    monitor the bus for activity and update or
    invalidate cache entries as necessary.
  • Two types
  • Write invalidate the writing processor sends
    an invalidation signal to the bus. All other
    caches check to see if they have a copy of the
    cache block. If they do, the block containing
    the data gets invalidated. The writing
    processor then changes its local copy.
  • Write-update the writing processor broadcasts
    the new data over the bus and all copies are
    updated with the new value.
  • Commercial machines use write-invalidate to
    preserve bandwidth.
  • Write-update has the advantage of making the new
    values appear in the caches sooner.

16
Cache Coherency Protocols (cont.)
  • Directory-based protocol
  • A central directory maintains the information
    about which memory locations are being shared in
    multiple caches and which are contained in just
    one processors cache.
  • On any memory access, it knows the caches that
    need to be updated or invalidated.
  • It is used by all software-based implementations
    of shared memory.
  • It is a scalable scheme that is suitable for a
    network configuration.

17
A Side-Effect of Cache Coherency
  • False sharing
  • Caches are organized into blocks of contiguous
    memory locations mainly because programs tend
    to use spatial locality of reference.
  • It is therefore possible for two processors to
    share the same cache block, but to not share the
    same memory location within the block.
  • If one processor writes to its own part of the
    block, it then causes the other processors
    entire block, including the memory location it
    was accessing, to get updated or invalidated.
  • Unnecessary invalidations can affect performance.
  • It is up to the programmer to detect it and avoid
    it.
  • Compiler-based solutions are being researched.

18
ccNUMA Implementations
  • Stanford Dash
  • Dash stands for Directory Architecture for Shared
    Memory.
  • First to use directory-based cache coherence.
  • SGI Origin 2000 (Silicon Graphics Inc.) -
  • Can support up to 1024 processors.
  • SGI claims it accounts for over 95 of worldwide
    shipments of ccNUMA-based systems.
  • IBMs LA (Local Access) ccNUMA

19
References
  • Computer Organization and Design The
    Hardware/Software Interface, David A. Patterson
    John L. Hennessy, 1998, 2nd edition
  • Supercomputing Systems Architectures, Design,
    and Performance, Svetlana P. Kartashev Steven
    I. Kartashev, 1990
  • Parallel Programming Techniques and Applications
    Using Networked Workstations and Parallel
    Computers, Barry Wilkinson Michael Allen, 1999
  • www.mkp.com/cod2e.htm
  • Non-Uniform Memory Access Wikipedia
  • Symmetric Multiprocessing - Wikipedia
  • Cache Coherence - Wikipedia
  • Parallel Computing - Wikipedia
  • Locality of Reference Wikipedia

20
References (cont.)
  • A Primer on NUMA ( Non-Uniform Memory Access)
  • Cache Coherence in the context of Shared Memory
    Architecture
  • Distributed shared memory -- ccNUMA interconnects
  • The Stanford Dash Multiprocessor
  • The SGI Origin A ccNUMA Highly Scalable Server
  • IBM Distributed Shared Memory Plans Uncovered
  • http//benchoi.info/Bens/Teaching/Csc364/PDF/CH18.
    pdf
  • http//www.cs.ucsd.edu/classes/fa00/cse240/lecture
    s/Lecture17.html
  • http//www.cs.ucsd.edu/users/carter/260/260class02
    .pdf
Write a Comment
User Comments (0)
About PowerShow.com