Title: An Overview of Parallel Computing
1An Overview of Parallel Computing
2Hardware
- There are many varieties of parallel computing
hardware and many different architectures - The original classification of parallel computers
is popularly known as Flynns taxonomy. - Flynn classified systems according to the number
of instruction streams and the number of data
streams - 1. SISD (Von Neumann machine)
- 2. MIMD (most general, a collection of
autonomous - processors operate on their own
data streams) - 3. SIMD
- 4. MISD
3The Classical von Neumann Machine
- Divided into a CPU and main memory
- CPU is divided into a control unit and an ALU
- The memory stores both instructions and data
- The control unit directs the execution of
programs - The ALU carries out the calculations. When being
used by the program, instructions and data are
stored in very fast memory location, called
registers. Both data and instructions are moved
between memory and registers in CPU via bus. - Bus is a collection of parallel wires, faster
buses have more wires -
4The Classical von Neumann Machine
- To be useful, some additional devices are needed
including input devices, output devices, and
extended storage devices (disk) - The bottleneck is the transfer of data and
instructions between memory and CPU - Few computers use classical Neumann machine
- Most machines now have a hierarchical memory.
Cache is used to achieve faster access. -
5Pipeline and Vector Architecture
- The first widely used extension to Neumann model
was pipelining. - Suppose we have the following program
- float x100, y100, z100
- for(i 0 ilt100 i)
- zixiyi
- Further a single addition consists of following
operations - 1. Fetch the operands from memory 2. Compare
exponents 3. Shift one operand 4. Add 5.
Normalize the results 6. Store results in
memory.
6Pipeline and Vector Architecture
- A further improvement add vector instructions
- With vector instruction, each of the basic
instruction only needs to be issued once. One
short instruction encodes N operations. - Another improvement is the use of multiple memory
banks. - Different authors regard vector processors as
different categories (MISD, variant of SIMD, even
not really parallel machines) - Examples CRAY C90 and NEC SX4
7Pipeline and Vector Architecture
- Advantages relatively easy to write programs to
obtain very high performance, therefore very
popular for high performance scientific computing - Disadvantages Dont work well for programs that
use irregular structures or use many branches
8SIMD Systems
- A pure SIMD system is opposed to a vector
processor since it has single CPU - During each instruction cycle, the control
processor broadcasts an instruction to all of the
subordinate processors. Each of them either
executes the instruction or idle. - Example for (i0 ilt 1000 i)
- if (yi!0.0)
- zixi/yi
- else
- zixi
9SIMD Systems
- Each subordinate processor would execute
- Time Step 1 Test local_y0.0.
- Time Step 2
- a. If local_y was nonzero,
zixi/yi - b. If local_y was zero, do nothing.
- Time Step 3
- a. If local_y was nonzero, do nothing.
- b. if local_y was zero, zixi.
- It is completely synchronous execution. A given
subordinate processor either active or idle at
given instant of time
10SIMD Systems
- The disadvantage is clear in a program with many
conditional branches or long segments of code
whose execution depends on conditionals, possibly
many processes will be idle for long period of
time - Easy program if underlying problem has a regular
structure. - The most famous examples of SIMD machines are the
CM-1 and CM-2 Connection Machines produced by
Thinking Machines. -
11General MIMD Systems
- The key difference between SIMD and MIMD the
processors are autonomous. - MIMD systems are asynchronous. Often no global
clock maybe no correspondence between different
processors even if they execute the same program - MIMD systems consist of shared-memory (and
distributed memory systems, also sometimes called
multiprocessors and multicomputers. -
-
12Shared-Memory MIMD
- The generic shared-memory architecture
-
-
13Bus-based Architecture
- Simplest interconnection network
- If multiple processors access memory, bus will
become saturated, thus long delays - A fairly large cache
- Due to limited bandwidth of a bus, do not scale
to large number of processors. -
-
14Switched-based Architecture
- Most others rely on some type of switch-based
network - A crossbar as a rectangular mesh of wires with
switches at the point of intersection, and
terminals on its left and top edges. -
-
15Switched-based Architecture
- Processors or memory modules can be connected to
the terminals - The switches can either allow a signal to pass
through in both directions simultaneously, or
they can redirect a signal from vertical to
horizontal or vice versa. - Any other processor can simultaneously access any
other memory module, therefore, dont suffer from
the problems of saturation - However, they are very expensive an mn crossbar
needs mn hardware switches -
-
16Cache Coherence
- Cache coherence is a problem for any
shared-memory architecture - A processor accesses a shared variable in its
cache, how will it know whether the value stored
in the variable is current? - Example assume x2 //initially
- P1
P2 - Time 0 y0x y13x
- Time 1 x7 z6
- Time 2 y5 z14x
- y0 ends up 2 and y1 ends up 6. How about z1?
-
-
17Cache Coherence
- The simplest solution is probably the snoopy
protocol - Each CPU has a cache controller
- The ache controller monitors the bus traffic.
When a processor updates a shared variable, it
also updated the corresponding main memory
location. The cache controllers on the other
processors detect the write to main memory and
mark their copies of the variable as invalid - This approach is only suitable for bus-based
shared-memory because any traffic on the bus can
be monitored by all the controllers -
-
18Distributed-Memory MIMD
- Each processor has its own private memory
- Generic distributed-memory MIMD
- If we view it as a graph, the edges are
communication wires. Each vertex corresponds to a
processor/memory pair (or node), or some vertices
correspond to nodes and others correspond to
switches - They are static networks and dynamic networks
-
-
19Distributed-Memory MIMD
- Different types of distributed systems
- (a) a static network (mesh) (b) a dynamic
network -
(crossbar) -
-
20Dynamic Interconnection Networks
- Dynamic interconnection networks
- Example An omega network
-
-
21Dynamic Interconnection Networks
- A less expensive solution is to use the
multistage switching network, such as omega
network - If p nodes, plogp/2 switches are needed, less
than the crossbar using p2 switches - The delay in transmitting a message is increased
since logp switches must be set -
-
22Static Interconnection Networks
- Fully connected interconnection network
- Ideal case from the performance and programming
- Communication has no delay
- Costs are huge
-
-
23Static Interconnection Networks
- A linear array or a ring
- Relatively inexpensive (p or p-1 wires)
- Easy to increase the size of the network
- Number of available wires is extremely limited
- The longest path is p-1 or p/2
-
-
24Static Interconnection Networks
- Hypercube practically closest to the fully
connected network - A d-dimension hypercube has 2d nodes
- Any two nodes traverse at most d wires
- Drawback relative lack of scalability
-
-
25Static Interconnection Networks
26Static Interconnection Networks
- Mesh or torus is between hypercube and linear
array - Scale better than hypercube
- Quite popular
-
-
27Communication and Routing
- If two nodes are not directly connected or if a
processor is not directly connected to a memory
module, how is data transmitted between the two? - If there are multiple routes, how to decide? Is
the route always the shortest path? Most systems
use a deterministic shortest-path algorithm - How do intermediate nodes forward communications?
Two basic approaches are store-and-forward
routing and cut-through routing - Store-and-forward routing uses considerably more
memory - Most systems use some variant of cut-through
routing
28Store-and-Forward Routing
- Read in the entire message ,and then send to C
-
-
29Cut-Through Routing
- Immediately forward each identifiable pieces of
the message -
-