MIMD COMPUTERS OR MULTIPROCESSORS - PowerPoint PPT Presentation

About This Presentation
Title:

MIMD COMPUTERS OR MULTIPROCESSORS

Description:

This chapter is a continuation of the brief coverage of MIMDs in our ... take longer to access than others is called a NUMA (for NonUniform Memory Access) ... – PowerPoint PPT presentation

Number of Views:397
Avg rating:3.0/5.0
Slides: 18
Provided by: ObertaASl8
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: MIMD COMPUTERS OR MULTIPROCESSORS


1
MIMD COMPUTERS OR MULTIPROCESSORS
  • References
  • 8 Jordan and Alaghaband, Fundamentals of
    Parallel Algorithms, Architectures, Languages,
    Prentice Hall, Chapters 4 and 5.
  • 20 Gregory Pfister, In Search of Clusters The
    Ongoing Battle in Lowly Parallel Computing,
    Second Edition, Prentice Hall PTR, 1998, Ch 6, 7,
    9, 13.
  • This chapter is a continuation of the brief
    coverage of MIMDs in our introductory chapter.
  • In practice, the name MIMD usually refers to a
    type of parallel computer, while it is more
    common to use multiprocessors to refer to this
    style of computing.
  • Defn See 8 A multiprocessor is single
    integrated system that contains multiple
    processors, each capable of executing an
    independent stream of instructions, but one
    integrated system for moving data among
    processors, to memory, and to I/O devices.
  • If data are transferred among processors (PEs)
    infrequently, possibly in large chunks, with long
    periods of independent computing in between, the
    multiprocessing is called course grain or loosely
    coupled.
  • In fine grained computation, or tightly coupled
    computation, small amounts of data (i.e., one or
    a few words) are communicated frequently.

2
Shared Memory Multiprocessors
  • There is a wide variation in the types of shared
    memory multiprocessors.
  • A shared memory multiprocessors in which some
    memory locations take longer to access than
    others is called a NUMA (for NonUniform Memory
    Access).
  • One with the same access is called a UMA
  • See earlier discussion in our Ch 1 here.
  • Some shared memory processors allow each
    processor to have its own private memory as well
    as to have shared memory.
  • An interconnection network (e.g., a ring, 2D
    mesh, or a hypercube) is used to connect all
    processors to the shared memory.
  • Characteristics of Shared Memory Multiprocessors
  • Interprocessor communication is done in the
    memory interface by read and write instructions
  • Memory may be physically distributed and the
    reads and writes from different processors may
    take different time and may collide in the
    interconnection network.
  • Memory latency (i.e., time to complete a read or
    write) may be long and variable.
  • Messages through the interconnection network are
    the size of single memory words.

3
  • Randomization of requests (as by interleaving
    words across memory modules) may be used to
    reduce the probability of collisions.
  • Contrasting characteristics of message-passing
    multiprocessors
  • Interprocessor communication is done by software
    using data transmission instructions (e.g., send,
    receive).
  • Read and write refer only to memory private to
    the processor issuing them.
  • Data may be aggregated into long message before
    being sent through the interconnection network.
  • Large data transmissions may mask long and
    variable latency in the communications network.
  • Global scheduling of communications can help
    avoid collisions between long messages
  • SPMD (single program, multiple data) programs
  • About only choice in managing a huge number of
    processes (i.e., hundreds, perhaps thousands)
  • Multiple processes execute the same program
    simultaneously but normally not synchronously.
  • Distinct programs for a large number of processes
    is not feasible.

4
The OpenMP Language Extension forShared Memory
Multiprocessors
  • OpenMP is a language extension built on top of an
    existing sequential language.
  • OpenMP extensions exist for both C/C and
    Fortran.
  • When it is necessary to refer to a specific
    version, we will refer to the Fortran77 version
  • Can be contrasted with F90 vector extensions
  • OpenMP constructs are limited to compiler
    directives and library subroutine calls.
  • The compiler directive format is such that they
    will be treated as comments by a sequential
    compiler.
  • This allows existing sequential compilers to
    easily be modified to support OpenMP
  • Whether a program executes the same computation
    (or any meaningful computation) when executed
    sequentially) is responsibility of programmer.
  • Execution starts with a sequential process that
    forks a fixed number of threads when it reaches a
    parallel region.
  • This team of threads execute to the end of the
    parallel region and then join the original
    process.

5
OpenMP (cont)
  • The number of threads is constant within a
    parallel region.
  • Different parallel regions can have a different
    number of threads.
  • Nested parallelism is supported by allowing a
    thread to fork a new team of threads at the
    beginning of a nested parallel region.
  • A thread that forks other threads is called a
    master thread of the team.
  • User controlled environment variables
  • num_threads specifies the number of threads
  • dynamic controls whether the number of threads
    can change from one parallel section to another
  • nested specifies whether or not nested
    parallelism is allowed or whether nested parallel
    regions are performed sequentially.
  • Process Control
  • Parallel regions are bracket by parallel and end
    parallel directives.
  • The directives, parallel-do and parallel section,
    can be used to combine parallel regions with
    work distribution

6
OpenMP (cont)
  • The term, parallel construct, denotes a parallel
    region or block structured work distribution
    contained in a parallel region.
  • The static scope of a parallel region consists of
    all statements between the start and end
    statement in that construct.
  • The dynamic scope of a parallel region consists
    of all statements executed by a team member
    between the entry to and exit from this
    construct.
  • This may include statements outside of the static
    scope of parallel region.
  • Parallel directives that lie in the static scope
    but outside the dynamic scope of a parallel
    construct are called orphan directives
  • These orphan directives cause special compiler
    problems.
  • A SPMD-style program could be written in OpenML
    by entering a parallel region at the beginning of
    the main program and exiting it just before the
    end.
  • Includes entire program in the dynamic scope of
    the parallel region.

7
OpenMP (cont)
  • Work Distribution consist of parallel loops,
    parallel code sections, single thread execution
    and master thread executions.
  • Parallel code sections emphasize distributing
    code sections to parallel processes that are
    already running, rather than on forking
    processes.
  • The section between single and end single is
    executed by one (and only one) single thread
  • The code between master and end master is
    executed by the master thread and is often done
    to provide synchronization.
  • OpenMP synchronization is handled by various
    methods, including
  • critical sections
  • single-point barrier
  • ordered sections of a parallel loop that have to
    be performed in the specified order
  • locks
  • subroutine calls

8
OpenMP (cont)
  • Memory Consistency
  • The flush directive allows programmer to force a
    consistent view of memory by all processors at
    the point where it occurs.
  • Needed as assignments to variables may become
    visible to different processors at different
    times due to the hierarchically structured memory
  • Requires more memory detail to understand
  • Also needed as a shared variable may be stored in
    a register, hence not visible to other
    processors.
  • Not practical to not allow shared variables to be
    stored in registers.
  • Must identify program points and/or variables for
    which mutual visibility affects program
    correctness.
  • The compiler can recognize explicit points where
    synchronization is needed.

9
OpenMP (cont)
  • Two extreme philosophies for parallelizing
    programs
  • In minimum approach, parallel constructs are only
    placed where large amounts of independent data is
    processed.
  • Typically use nested loops
  • Rest of program is executed sequentially.
  • One problem is that it may not exploit all of the
    parallelism available in the program.
  • The process creation and termination may be
    invoked many times and may be high.
  • The other extreme is the SPMD approach, which
    treats the entire program as parallel code.
  • Steps serialized only when required by program
    logic.
  • Many programs are a mixture of these two
    parallelizing extremes. (Examples given in 9, pg
    152-158.

10
OpenMP LanguageAdditional References
  • The below references may be more accessible
    references than 8,Jordan Alaghband, which was
    used as primary reference here.
  • Ohio Supercomputer Center (OSC, www.osc.org) has
    a online WebCT course on OpenMP.  All you have to
    do is create a user name and password. 
  • The textbook, Introduction to Parallel
    Computing by Kumar, et.al. 25 has a
    section/chapter on OpenMP
  • The "Parallel Computing Sourcebook" 23
    discusses OpenMP at a number of places, but
    particularly on pgs 301-3 and 323-329.
  • Chapter 10 gives short overview and comparison of
    message passing and multitreaded programming.

11
Symmetric Multiprocessorsor SMPs
  • A SMP is a shared memory multiprocessor has
    processors that are symmetric
  • Multiple, identical processors
  • Any processor can do anything (i.e., access I/O)
  • Only shared memory.
  • Currently the primary example of shared memory
    multiprocessors.
  • A very popular type of computer (with a number of
    variations). See 20 for additional information.
  • FOR INFORMATION ON PROBLEMS THAT SERIOUSLY LIMIT
    PERFORMANCE OF SMPS, SEE
  • 20 Gregory Pfister, In Search of Clusters The
    Ongoing Battle in Lowly Parallel Computing,
    Second Edition, Prentice Hall PTR, 1998, Ch 6, 7,
    9, 13.
  • ABOVE INFORMATION TO BE ADDED IN THE FUTURE

12
Distributed Memory Multiprocessors
  • References
  • 1, Wilkenson Allyn, Ch 1-2
  • 3, Quinn, Chapter 1
  • 8, Jordan Alaghband, Chapter 5
  • 25, Kumar, Grama, Gupta, Karypis, Introduction
    to Parallel Computing, 2nd Edition, Ch 2
  • General Characteristics
  • In a distributed memory system, each memory cell
    belongs to a particular processor.
  • In order for data to be available to a processor,
    it must be stored in the local memory for a
    processors.
  • Data produced by one processor that is needed by
    other processors must be moved to the memory of
    the other processors.
  • The data movement is usually handled by message
    passing using send and receive commands.
  • The data transmissions between processors have a
    huge impact on the performance
  • The distribution of the data among the processors
    is a very important factor in the performance
    efficiency.

13
Some Interconnection Network Terminology
  • A link is the connection between two nodes.
  • A switch that enables packets to be routed
    through the node to other nodes without
    disturbing the processor is assumed.
  • The link between two nodes can be either
    bidirectional or use two directional links .
  • Either one wire to carry one bit or parallel
    wires (one wire for each bit in word) can be
    used.
  • The above choices do not have a major impact on
    the concepts presented in this course.
  • The below terminology is given in 1 and will be
    occasionally needed
  • The bandwidth is the number of bits that can be
    transmitted in unit time (i.e., bits per second).
  • The network latency is the time required to
    transfer a message through the network.
  • The communication latency is the total time
    required to send a message, including software
    overhead and interface delay.
  • The message latency or startup time is the time
    required to send a zero-length message.
  • Software and hardware overhead, such as
  • finding a route
  • packing and unpacking the message

14
Communication Methods
  • Two basic ways of transferring messages from
    source to destination. (See 1, 25 )
  • Circuit switching
  • Establishing a path and allowing the entire
    message to transfer uninterrupted.
  • Similar to telephone connection that is held
    until the end of the call.
  • Links are not available to other messages until
    the transfer is complete.
  • Latency (or message transfer time) If the length
    of control packet sent to establish path is small
    wrt (with respect to) the message length, the
    latency is essentially
  • the constant L/B, where L is message length and B
    is bandwidth.
  • packet switching
  • Message is divided into packets of information
  • Each packet includes source and destination
    addresses.
  • Packets can not exceed a fixed, maximum size
    (e.g., 1000 byte).
  • A packet is stored in a node in a buffer until it
    can move to the next node.

15
Communications (cont)
  • At each node, the designation information is
    looked at and used to select which node to
    forward the packet to.
  • Routing algorithms (often probabilistic) are used
    to avoid hot spots and to minimize traffic jams.
  • Significant latency is created by storing each
    packet in each node it reaches.
  • Latency increases linearly with the length of the
    route.
  • Store-and-forward packet switching is the name
    used to describe the preceding packet switching.
  • Virtual cut-through package switching can be used
    to reduce the latency.
  • Allows packet to pass through a node without
    being stored, if the outgoing link is available.
  • If complete path is available, a message can
    immediately move from source to destination..
  • Wormhole Routing alternate to store-and-forward
    packet routing
  • A message is divided into small units called
    flits (flow control units).
  • flits are 1-2 bytes in size.
  • can be transferred in parallel on links with
    multiple wires.
  • Only head of flit is initially transferred when
    the next link becomes available.

16
Communications (cont)
  • As each flit moves forward, the next flit can
    move forward.
  • The entire path must be reserved for a message as
    these packets pull each other along (like cars of
    a train).
  • Request/acknowledge bit messages are required to
    coordinate these pull-along moves. (see 1)
  • The complete path must be reserved, as these
    flits are linked together.
  • Latency If the head of the flit is very small
    compared to the length of the message, then the
    latency is essentially the constant L/B, with L
    the message length and B the link bandwidth.
  • Deadlock
  • Routing algorithms needed to find a path between
    the nodes.
  • Adaptive routing algorithms choose different
    paths, depending on traffic conditions.
  • Livelock is a deadlock-type situation where a
    packet continues to go around the network,
    without ever reaching its destination.
  • Deadlock No packet can be forwarded because they
    are blocked by other stored packets waiting to be
    forwarded.
  • Input/Output A significant problem on all
    parallel computers.

17
Languages for Distributed Memory Multiprocessors
  • HPF is a data parallel programming language that
    is supported by most distributed memory
    multiprocessors.
  • Good for applications where data can be stored
    and processed as vectors.
  • Message passing has to be specified by the
    compiler for each machine and is hidden from the
    programmer.
  • MPI is a message passing language that can be
    used to support both data parallel and control
    parallel programming.
  • MPI commands are low level and very error prone.
  • Programs are typically long due to low level
    commands.
Write a Comment
User Comments (0)
About PowerShow.com