Multiprocessors and Thread-Level Parallelism - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Multiprocessors and Thread-Level Parallelism

Description:

Thread-level parallelism: Run different threads using different processors ... The same address on two processors access to different locations ... – PowerPoint PPT presentation

Number of Views:363
Avg rating:3.0/5.0
Slides: 16
Provided by: cappS
Category:

less

Transcript and Presenter's Notes

Title: Multiprocessors and Thread-Level Parallelism


1
Chapter 5
  • Multiprocessors and Thread-Level Parallelism

2
6.1 Introduction
  • Thread
  • A set of instructions that can be executed in a
    different processor
  • Example)
  • Process
  • an iteration of a loop
  • Needs to be identified generated by a compiler
  • Size from hundreds to millions of instructions
  • Thread-level parallelism
  • Run different threads using different processors
  • N threads ? run N processors in parallel
  • To fully utilize a machine with N processor, a
    compiler or programmer needs to identify N
    threads
  • Difference from instruction-level parallelism
    (ILP)
  • Large grain (also called coarse grain)
    parallelism
  • ILP called small grain (also called fine grain)
    parallelism
  • Identified by programmer or compiler
  • ILP mainly identified by HW

3
A Taxonomy of Parallel Architectures
6.1 Introduction
  • Flynns categorization
  • SISD (Single instruction stream, single data
    stream) uniprocessor
  • SIMD (Single instruction stream, multiple data
    streams) instructions for multimedia extension
  • MISD (Multiple instructions streams, single data
    stream) no commercial multiprocessor, some
    special-purpose stream processors
  • MIMD (Multiple instruction streams, multiple data
    streams) general form of parallel processors
  • MIMD architectures
  • Centralized shared-memory architecture
  • ? Fig. 6.1
  • Multiple processors share a single centralized
    memory
  • Simplest architecture
  • Normally, a single bus based architecture
  • But multiple buses or a switch can also be used
  • At most a few dozens of processors
  • Because scaling is difficult (the shared memory
    can be a bottleneck)
  • Most popular architecture
  • Also called, symmetric (shared-memory)
    multiprocessors (SMP) or uniform memory access
    (UMA)

4
A Taxonomy of Parallel Architectures
6.1 Introduction
  • MIMD Architectures (continued)
  • Physically distributed memory
  • ? Fig. 6.2
  • Advantage
  • Fast local memory access
  • Easy to scale up memory bandwidth
  • Cost-effective way
  • Assumption most accesses are to the local memory
    in the node
  • Disadvantage
  • Slow remote memory access or communication
    between processors
  • Variation each node consists of multiple
    processors (i.e., SMP)

5
Models for Communication and Memory Architecture
6.1 Introduction
  • This discussion is for physically distributed
    memory systems
  • Distributed shared-memory (DSM) architecture
  • Physically distributed but logically shared
    address space
  • A single address space over physically-distributed
    memory
  • The same address on two processors access to the
    same location
  • Interprocessor communication use load/store
    instructions through the shared memory
  • Also called nonuniform memory access (NUMA)
    architecture
  • Message-passing multiprocessors
  • Each processor has its own private address space
  • The same address on two processors access to
    different locations
  • Can be considered as multiple independent
    computers connected through a network ?
    multicomputer, cluster
  • Interprocessor communication explicit call to
    message passing interface
  • MPI (Message Passing Interface) most popular
    standard library for message passing

6
Models for Communication and Memory Architecture
6.1 Introduction
  • Performance Metrics for Communication Mechanisms
  • Communication bandwidth
  • Communication latency
  • Communication latency hiding
  • Overlapping with computation or other
    communication
  • Advantage of different communication mechanisms
  • Shared-memory communication
  • Advantage
  • Shared memory model is familiar programming
    environment
  • Easy of programming (or compiler design)
  • Compatible with centralized memory systems
  • OpenMP standardized programming interface for
    shared-memory multiprocessors
  • Easy to focus on optimization of critical part.
  • Lower communication overhead (normally lower
    latency)
  • Cache management may be automatically supported
    by HW

7
Models for Communication and Memory Architecture
6.1 Introduction
  • Advantage of different communication mechanisms
    (continued)
  • Message-passing communication
  • Advantage
  • Simple and scalable hardware implementation
  • Explicit communication
  • more observable and controllable communication
  • Can explicitly focusing on specific part
  • Reduce errors for synchronization
  • Performance advantage by explicitly handling
    communication
  • Analogy shared-memory communication is like a
    high-level programming while message-passing is
    like an assembly program

8
Challenges of Parallel Processing
6.1 Introduction
  • Limited parallelism
  • Ex) You want to achieve a speedup of 80 with 100
    processors.
  • Speedup 1/(Fractionenhanced/Speedupenhanced)(1
    -Fractionenhanced)
  • Speedup 1/(Fractionparallel/Speedupparallel)(1
    -Fractionparallel)
  • 80 1/(Fractionparallel/100)(1-Fractionparallel
    )
  • Fractionparallel0.9975
  • Large latency or remote access in a parallel
    processor
  • ? Figure 6.3
  • Summary SMP, DSM, UMA, NUMA, Message-passing,
    MPP (Massively Parallel Processors)

9
6.3 Symmetric Shared-Memory Architectures
  • Private data used by a single processor
  • Shared data used by multiple processors
  • Caching of shared data causes a cache coherence
    problem
  • ? Fig. 6.7
  • Cache coherence problem
  • Informal definition a memory system is coherent
    if any read of a data item returns the most
    recently written value of that data item ? vague
  • A memory system is coherent if
  • A read by a processor P to a location X that
    follows a write by P to X, with no writes of X by
    another processor occurring between the write and
    the ready by P, always returns the value written
    by P
  • A ready by a processor to location X that follows
    a write by another processor to X returns the
    written value if the read and write are
    sufficiently separated in time and no other
    writes to X occur between the two accesses..8
  • Exactly when ? ? memory consistency problem in
    Section 6
  • Writes to the same location are serialized that
    is, two writes to the same location by any two
    processors are seen in the same order by all
    processors. For example, if the values 1 and then
    2 are written to a location, processor can never
    read the value of the location as 2 and then
    later read it as 1

10
Basic Schemes for Enforcing Coherence
6.3 Symmetric Shared-Memory Architectures
  • Cache coherence protocol
  • Maintain coherence for multiple processors
  • Track the state of any sharing of a data block
  • Two approaches
  • Directory based a single central data structure
    to store the sharing status
  • Snooping every cache contains the sharing status
  • No centralized data structure
  • Usually on the shared-memory bus

11
An Example Protocol
6.3 Symmetric Shared-Memory Architectures
  • Write invalidate protocol for a write-back cache
  • ? Fig. 6.10, Fig. 6.11 Fig. 6.12
  • In the left half of Fig. 6.11
  • All states needed for a write-back uniprocessor
    cache invalid, valid, dirty
  • All arcs needed for a write-back uniprocessor
    cache except a write hit on the shared state
    (generate a write miss)
  • Simplification
  • No distinction between a write hit and a write
    miss to a shared cache block
  • Place a write miss on bus
  • Memory will supply the data
  • Any processor with copy of the cache block
    invalidate it
  • More complicated protocol distinguishes write
    hit from write miss
  • Write hit may not fetch data from memory ? no
    data movement is necessary ? just status change ?
    addition of write invalidate transaction
  • Addition of write invalidate transaction to an
    Exclusive state
  • Eg) Write miss on an invalid block with other
    cache has it in the exclusive state
  • just invalidate the cache block but no data
    write-back
  • Just change of the ownership
  • This may depend on the size of write (full cache
    block write or partial write)
  • Additional state clean and private state
  • No need to generate a bus transaction on a write
    to this block

12
An Example Protocol
6.3 Symmetric Shared-Memory Architectures
  • Protocol assumption operations are atomic, i.e.,
    no intervening operation can occur.
  • Eg) Atomic operation write miss detection,
    acquire bus, and receive a response
  • Non-atomic operation may cause deadlock
  • Microprocessor supports cache coherence protocol
  • Ex) For Pentium IV, 4 processors can be directly
    connected to a shared bus

13
Snooping Protocols
6.3 Symmetric Shared-Memory Architectures
  • Two approaches
  • Write invalidate protocol
  • Processor has exclusive access to a data item
    before it writes that item
  • ? Fig. 6.8
  • Write update protocol (or write broadcast
    protocol)
  • Update all the cached copies of a data item when
    that item is written
  • ? Fig. 6.9
  • Performance differences
  • Multiple writes to the same item
  • Write invalidate invalidate just once
  • Write update update every time
  • Multiple writes to different words in the same
    cache block
  • Write invalidate invalidate just once
  • Write update update for every word
  • Delay between write in one processor and read in
    another processor
  • Write update scheme is faster
  • Write invalidate Less bus and memory traffic ?
    More popular

14
Basic Implementation Techniques
6.3 Symmetric Shared-Memory Architectures
  • Write through vs write back
  • On cache miss
  • Write-through cache always in the memory
  • Write-back cache either in the memory or in the
    other processors cache
  • The processor broadcasts the address to be read
    on the bus
  • If a processor has the dirty copy of the data, it
    sends the data block
  • Note write-back cache requires less memory
    traffic ? preferred for multiprocessor systems

15
Basic Implementation Techniques
6.3 Symmetric Shared-Memory Architectures
  • Data structure for cache coherence protocol
  • Normal cache structure needed cache tag valid
    bit dirty bit
  • Shared bit indicates whether it is shared by
    other processor
  • Write invalidation is not necessary for
    non-shared data
  • Owner of a cache block the processor that
    exclusively contains the sole copy of data block
  • Every bus transaction must check cache tags ? may
    interfere with CPU cache access
  • Duplicate tags
  • Multilevel cache with the inclusion property
  • Snooping 2nd level cache
  • CPU access 1st level cache
  • Adopted in many designs
Write a Comment
User Comments (0)
About PowerShow.com