MIMD COMPUTERS OR MULTIPROCESSORS

About This Presentation

Title:

MIMD COMPUTERS OR MULTIPROCESSORS

Description:

This chapter is a continuation of the brief coverage of MIMDs in our ... take longer to access than others is called a NUMA (for NonUniform Memory Access) ... – PowerPoint PPT presentation

Number of Views:397

Avg rating:3.0/5.0

Slides: 18

Provided by: ObertaASl8

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: MIMD COMPUTERS OR MULTIPROCESSORS

1
MIMD COMPUTERS OR MULTIPROCESSORS

References
8 Jordan and Alaghaband, Fundamentals of
Parallel Algorithms, Architectures, Languages,
Prentice Hall, Chapters 4 and 5.
20 Gregory Pfister, In Search of Clusters The
Ongoing Battle in Lowly Parallel Computing,
Second Edition, Prentice Hall PTR, 1998, Ch 6, 7,
9, 13.
This chapter is a continuation of the brief
coverage of MIMDs in our introductory chapter.
In practice, the name MIMD usually refers to a
type of parallel computer, while it is more
common to use multiprocessors to refer to this
style of computing.
Defn See 8 A multiprocessor is single
integrated system that contains multiple
processors, each capable of executing an
independent stream of instructions, but one
integrated system for moving data among
processors, to memory, and to I/O devices.
If data are transferred among processors (PEs)
infrequently, possibly in large chunks, with long
periods of independent computing in between, the
multiprocessing is called course grain or loosely
coupled.
In fine grained computation, or tightly coupled
computation, small amounts of data (i.e., one or
a few words) are communicated frequently.

2
Shared Memory Multiprocessors

There is a wide variation in the types of shared
memory multiprocessors.
A shared memory multiprocessors in which some
memory locations take longer to access than
others is called a NUMA (for NonUniform Memory
Access).
One with the same access is called a UMA
See earlier discussion in our Ch 1 here.
Some shared memory processors allow each
processor to have its own private memory as well
as to have shared memory.
An interconnection network (e.g., a ring, 2D
mesh, or a hypercube) is used to connect all
processors to the shared memory.
Characteristics of Shared Memory Multiprocessors
Interprocessor communication is done in the
memory interface by read and write instructions
Memory may be physically distributed and the
reads and writes from different processors may
take different time and may collide in the
interconnection network.
Memory latency (i.e., time to complete a read or
write) may be long and variable.
Messages through the interconnection network are
the size of single memory words.

Randomization of requests (as by interleaving
words across memory modules) may be used to
reduce the probability of collisions.
Contrasting characteristics of message-passing
multiprocessors
Interprocessor communication is done by software
using data transmission instructions (e.g., send,
receive).
Read and write refer only to memory private to
the processor issuing them.
Data may be aggregated into long message before
being sent through the interconnection network.
Large data transmissions may mask long and
variable latency in the communications network.
Global scheduling of communications can help
avoid collisions between long messages
SPMD (single program, multiple data) programs
About only choice in managing a huge number of
processes (i.e., hundreds, perhaps thousands)
Multiple processes execute the same program
simultaneously but normally not synchronously.
Distinct programs for a large number of processes
is not feasible.

4
The OpenMP Language Extension forShared Memory
Multiprocessors

OpenMP is a language extension built on top of an
existing sequential language.
OpenMP extensions exist for both C/C and
Fortran.
When it is necessary to refer to a specific
version, we will refer to the Fortran77 version
Can be contrasted with F90 vector extensions
OpenMP constructs are limited to compiler
directives and library subroutine calls.
The compiler directive format is such that they
will be treated as comments by a sequential
compiler.
This allows existing sequential compilers to
easily be modified to support OpenMP
Whether a program executes the same computation
(or any meaningful computation) when executed
sequentially) is responsibility of programmer.
Execution starts with a sequential process that
forks a fixed number of threads when it reaches a
parallel region.
This team of threads execute to the end of the
parallel region and then join the original
process.

5
OpenMP (cont)

The number of threads is constant within a
parallel region.
Different parallel regions can have a different
number of threads.
Nested parallelism is supported by allowing a
thread to fork a new team of threads at the
beginning of a nested parallel region.
A thread that forks other threads is called a
master thread of the team.
User controlled environment variables
num_threads specifies the number of threads
dynamic controls whether the number of threads
can change from one parallel section to another
nested specifies whether or not nested
parallelism is allowed or whether nested parallel
regions are performed sequentially.
Process Control
Parallel regions are bracket by parallel and end
parallel directives.
The directives, parallel-do and parallel section,
can be used to combine parallel regions with
work distribution

6
OpenMP (cont)

The term, parallel construct, denotes a parallel
region or block structured work distribution
contained in a parallel region.
The static scope of a parallel region consists of
all statements between the start and end
statement in that construct.
The dynamic scope of a parallel region consists
of all statements executed by a team member
between the entry to and exit from this
construct.
This may include statements outside of the static
scope of parallel region.
Parallel directives that lie in the static scope
but outside the dynamic scope of a parallel
construct are called orphan directives
These orphan directives cause special compiler
problems.
A SPMD-style program could be written in OpenML
by entering a parallel region at the beginning of
the main program and exiting it just before the
end.
Includes entire program in the dynamic scope of
the parallel region.

7
OpenMP (cont)

Work Distribution consist of parallel loops,
parallel code sections, single thread execution
and master thread executions.
Parallel code sections emphasize distributing
code sections to parallel processes that are
already running, rather than on forking
processes.
The section between single and end single is
executed by one (and only one) single thread
The code between master and end master is
executed by the master thread and is often done
to provide synchronization.
OpenMP synchronization is handled by various
methods, including
critical sections
single-point barrier
ordered sections of a parallel loop that have to
be performed in the specified order
locks
subroutine calls

8
OpenMP (cont)

Memory Consistency
The flush directive allows programmer to force a
consistent view of memory by all processors at
the point where it occurs.
Needed as assignments to variables may become
visible to different processors at different
times due to the hierarchically structured memory
Requires more memory detail to understand
Also needed as a shared variable may be stored in
a register, hence not visible to other
processors.
Not practical to not allow shared variables to be
stored in registers.
Must identify program points and/or variables for
which mutual visibility affects program
correctness.
The compiler can recognize explicit points where
synchronization is needed.

9
OpenMP (cont)

Two extreme philosophies for parallelizing
programs
In minimum approach, parallel constructs are only
placed where large amounts of independent data is
processed.
Typically use nested loops
Rest of program is executed sequentially.
One problem is that it may not exploit all of the
parallelism available in the program.
The process creation and termination may be
invoked many times and may be high.
The other extreme is the SPMD approach, which
treats the entire program as parallel code.
Steps serialized only when required by program
logic.
Many programs are a mixture of these two
parallelizing extremes. (Examples given in 9, pg
152-158.

10
OpenMP LanguageAdditional References

The below references may be more accessible
references than 8,Jordan Alaghband, which was
used as primary reference here.
Ohio Supercomputer Center (OSC, www.osc.org) has
a online WebCT course on OpenMP. All you have to
do is create a user name and password.
The textbook, Introduction to Parallel
Computing by Kumar, et.al. 25 has a
section/chapter on OpenMP
The "Parallel Computing Sourcebook" 23
discusses OpenMP at a number of places, but
particularly on pgs 301-3 and 323-329.
Chapter 10 gives short overview and comparison of
message passing and multitreaded programming.

11
Symmetric Multiprocessorsor SMPs

A SMP is a shared memory multiprocessor has
processors that are symmetric
Multiple, identical processors
Any processor can do anything (i.e., access I/O)
Only shared memory.
Currently the primary example of shared memory
multiprocessors.
A very popular type of computer (with a number of
variations). See 20 for additional information.
FOR INFORMATION ON PROBLEMS THAT SERIOUSLY LIMIT
PERFORMANCE OF SMPS, SEE
20 Gregory Pfister, In Search of Clusters The
Ongoing Battle in Lowly Parallel Computing,
Second Edition, Prentice Hall PTR, 1998, Ch 6, 7,
9, 13.
ABOVE INFORMATION TO BE ADDED IN THE FUTURE

12
Distributed Memory Multiprocessors

References
1, Wilkenson Allyn, Ch 1-2
3, Quinn, Chapter 1
8, Jordan Alaghband, Chapter 5
25, Kumar, Grama, Gupta, Karypis, Introduction
to Parallel Computing, 2nd Edition, Ch 2
General Characteristics
In a distributed memory system, each memory cell
belongs to a particular processor.
In order for data to be available to a processor,
it must be stored in the local memory for a
processors.
Data produced by one processor that is needed by
other processors must be moved to the memory of
the other processors.
The data movement is usually handled by message
passing using send and receive commands.
The data transmissions between processors have a
huge impact on the performance
The distribution of the data among the processors
is a very important factor in the performance
efficiency.

13
Some Interconnection Network Terminology

A link is the connection between two nodes.
A switch that enables packets to be routed
through the node to other nodes without
disturbing the processor is assumed.
The link between two nodes can be either
bidirectional or use two directional links .
Either one wire to carry one bit or parallel
wires (one wire for each bit in word) can be
used.
The above choices do not have a major impact on
the concepts presented in this course.
The below terminology is given in 1 and will be
occasionally needed
The bandwidth is the number of bits that can be
transmitted in unit time (i.e., bits per second).
The network latency is the time required to
transfer a message through the network.
The communication latency is the total time
required to send a message, including software
overhead and interface delay.
The message latency or startup time is the time
required to send a zero-length message.
Software and hardware overhead, such as
finding a route
packing and unpacking the message

14
Communication Methods

Two basic ways of transferring messages from
source to destination. (See 1, 25 )
Circuit switching
Establishing a path and allowing the entire
message to transfer uninterrupted.
Similar to telephone connection that is held
until the end of the call.
Links are not available to other messages until
the transfer is complete.
Latency (or message transfer time) If the length
of control packet sent to establish path is small
wrt (with respect to) the message length, the
latency is essentially
the constant L/B, where L is message length and B
is bandwidth.
packet switching
Message is divided into packets of information
Each packet includes source and destination
addresses.
Packets can not exceed a fixed, maximum size
(e.g., 1000 byte).
A packet is stored in a node in a buffer until it
can move to the next node.

15
Communications (cont)

At each node, the designation information is
looked at and used to select which node to
forward the packet to.
Routing algorithms (often probabilistic) are used
to avoid hot spots and to minimize traffic jams.
Significant latency is created by storing each
packet in each node it reaches.
Latency increases linearly with the length of the
route.
Store-and-forward packet switching is the name
used to describe the preceding packet switching.
Virtual cut-through package switching can be used
to reduce the latency.
Allows packet to pass through a node without
being stored, if the outgoing link is available.
If complete path is available, a message can
immediately move from source to destination..
Wormhole Routing alternate to store-and-forward
packet routing
A message is divided into small units called
flits (flow control units).
flits are 1-2 bytes in size.
can be transferred in parallel on links with
multiple wires.
Only head of flit is initially transferred when
the next link becomes available.

16
Communications (cont)

As each flit moves forward, the next flit can
move forward.
The entire path must be reserved for a message as
these packets pull each other along (like cars of
a train).
Request/acknowledge bit messages are required to
coordinate these pull-along moves. (see 1)
The complete path must be reserved, as these
flits are linked together.
Latency If the head of the flit is very small
compared to the length of the message, then the
latency is essentially the constant L/B, with L
the message length and B the link bandwidth.
Deadlock
Routing algorithms needed to find a path between
the nodes.
Adaptive routing algorithms choose different
paths, depending on traffic conditions.
Livelock is a deadlock-type situation where a
packet continues to go around the network,
without ever reaching its destination.
Deadlock No packet can be forwarded because they
are blocked by other stored packets waiting to be
forwarded.
Input/Output A significant problem on all
parallel computers.

17
Languages for Distributed Memory Multiprocessors

HPF is a data parallel programming language that
is supported by most distributed memory
multiprocessors.
Good for applications where data can be stored
and processed as vectors.
Message passing has to be specified by the
compiler for each machine and is hidden from the
programmer.
MPI is a message passing language that can be
used to support both data parallel and control
parallel programming.
MPI commands are low level and very error prone.
Programs are typically long due to low level
commands.