An Overview of Parallel Computing - PowerPoint PPT Presentation

About This Presentation
Title:

An Overview of Parallel Computing

Description:

An Overview of Parallel Computing – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 30
Provided by: RF556
Category:

less

Transcript and Presenter's Notes

Title: An Overview of Parallel Computing


1
An Overview of Parallel Computing

2
Hardware
  • There are many varieties of parallel computing
    hardware and many different architectures
  • The original classification of parallel computers
    is popularly known as Flynns taxonomy.
  • Flynn classified systems according to the number
    of instruction streams and the number of data
    streams
  • 1. SISD (Von Neumann machine)
  • 2. MIMD (most general, a collection of
    autonomous
  • processors operate on their own
    data streams)
  • 3. SIMD
  • 4. MISD

3
The Classical von Neumann Machine
  • Divided into a CPU and main memory
  • CPU is divided into a control unit and an ALU
  • The memory stores both instructions and data
  • The control unit directs the execution of
    programs
  • The ALU carries out the calculations. When being
    used by the program, instructions and data are
    stored in very fast memory location, called
    registers. Both data and instructions are moved
    between memory and registers in CPU via bus.
  • Bus is a collection of parallel wires, faster
    buses have more wires

4
The Classical von Neumann Machine
  • To be useful, some additional devices are needed
    including input devices, output devices, and
    extended storage devices (disk)
  • The bottleneck is the transfer of data and
    instructions between memory and CPU
  • Few computers use classical Neumann machine
  • Most machines now have a hierarchical memory.
    Cache is used to achieve faster access.

5
Pipeline and Vector Architecture
  • The first widely used extension to Neumann model
    was pipelining.
  • Suppose we have the following program
  • float x100, y100, z100
  • for(i 0 ilt100 i)
  • zixiyi
  • Further a single addition consists of following
    operations
  • 1. Fetch the operands from memory 2. Compare
    exponents 3. Shift one operand 4. Add 5.
    Normalize the results 6. Store results in
    memory.

6
Pipeline and Vector Architecture
  • A further improvement add vector instructions
  • With vector instruction, each of the basic
    instruction only needs to be issued once. One
    short instruction encodes N operations.
  • Another improvement is the use of multiple memory
    banks.
  • Different authors regard vector processors as
    different categories (MISD, variant of SIMD, even
    not really parallel machines)
  • Examples CRAY C90 and NEC SX4

7
Pipeline and Vector Architecture
  • Advantages relatively easy to write programs to
    obtain very high performance, therefore very
    popular for high performance scientific computing
  • Disadvantages Dont work well for programs that
    use irregular structures or use many branches

8
SIMD Systems
  • A pure SIMD system is opposed to a vector
    processor since it has single CPU
  • During each instruction cycle, the control
    processor broadcasts an instruction to all of the
    subordinate processors. Each of them either
    executes the instruction or idle.
  • Example for (i0 ilt 1000 i)
  • if (yi!0.0)
  • zixi/yi
  • else
  • zixi

9
SIMD Systems
  • Each subordinate processor would execute
  • Time Step 1 Test local_y0.0.
  • Time Step 2
  • a. If local_y was nonzero,
    zixi/yi
  • b. If local_y was zero, do nothing.
  • Time Step 3
  • a. If local_y was nonzero, do nothing.
  • b. if local_y was zero, zixi.
  • It is completely synchronous execution. A given
    subordinate processor either active or idle at
    given instant of time

10
SIMD Systems
  • The disadvantage is clear in a program with many
    conditional branches or long segments of code
    whose execution depends on conditionals, possibly
    many processes will be idle for long period of
    time
  • Easy program if underlying problem has a regular
    structure.
  • The most famous examples of SIMD machines are the
    CM-1 and CM-2 Connection Machines produced by
    Thinking Machines.

11
General MIMD Systems
  • The key difference between SIMD and MIMD the
    processors are autonomous.
  • MIMD systems are asynchronous. Often no global
    clock maybe no correspondence between different
    processors even if they execute the same program
  • MIMD systems consist of shared-memory (and
    distributed memory systems, also sometimes called
    multiprocessors and multicomputers.

12
Shared-Memory MIMD
  • The generic shared-memory architecture

13
Bus-based Architecture
  • Simplest interconnection network
  • If multiple processors access memory, bus will
    become saturated, thus long delays
  • A fairly large cache
  • Due to limited bandwidth of a bus, do not scale
    to large number of processors.

14
Switched-based Architecture
  • Most others rely on some type of switch-based
    network
  • A crossbar as a rectangular mesh of wires with
    switches at the point of intersection, and
    terminals on its left and top edges.

15
Switched-based Architecture
  • Processors or memory modules can be connected to
    the terminals
  • The switches can either allow a signal to pass
    through in both directions simultaneously, or
    they can redirect a signal from vertical to
    horizontal or vice versa.
  • Any other processor can simultaneously access any
    other memory module, therefore, dont suffer from
    the problems of saturation
  • However, they are very expensive an mn crossbar
    needs mn hardware switches

16
Cache Coherence
  • Cache coherence is a problem for any
    shared-memory architecture
  • A processor accesses a shared variable in its
    cache, how will it know whether the value stored
    in the variable is current?
  • Example assume x2 //initially
  • P1
    P2
  • Time 0 y0x y13x
  • Time 1 x7 z6
  • Time 2 y5 z14x
  • y0 ends up 2 and y1 ends up 6. How about z1?

17
Cache Coherence
  • The simplest solution is probably the snoopy
    protocol
  • Each CPU has a cache controller
  • The ache controller monitors the bus traffic.
    When a processor updates a shared variable, it
    also updated the corresponding main memory
    location. The cache controllers on the other
    processors detect the write to main memory and
    mark their copies of the variable as invalid
  • This approach is only suitable for bus-based
    shared-memory because any traffic on the bus can
    be monitored by all the controllers

18
Distributed-Memory MIMD
  • Each processor has its own private memory
  • Generic distributed-memory MIMD
  • If we view it as a graph, the edges are
    communication wires. Each vertex corresponds to a
    processor/memory pair (or node), or some vertices
    correspond to nodes and others correspond to
    switches
  • They are static networks and dynamic networks

19
Distributed-Memory MIMD
  • Different types of distributed systems
  • (a) a static network (mesh) (b) a dynamic
    network

  • (crossbar)

20
Dynamic Interconnection Networks
  • Dynamic interconnection networks
  • Example An omega network

21
Dynamic Interconnection Networks
  • A less expensive solution is to use the
    multistage switching network, such as omega
    network
  • If p nodes, plogp/2 switches are needed, less
    than the crossbar using p2 switches
  • The delay in transmitting a message is increased
    since logp switches must be set

22
Static Interconnection Networks
  • Fully connected interconnection network
  • Ideal case from the performance and programming
  • Communication has no delay
  • Costs are huge

23
Static Interconnection Networks
  • A linear array or a ring
  • Relatively inexpensive (p or p-1 wires)
  • Easy to increase the size of the network
  • Number of available wires is extremely limited
  • The longest path is p-1 or p/2

24
Static Interconnection Networks
  • Hypercube practically closest to the fully
    connected network
  • A d-dimension hypercube has 2d nodes
  • Any two nodes traverse at most d wires
  • Drawback relative lack of scalability

25
Static Interconnection Networks
  • Mesh or torus

26
Static Interconnection Networks
  • Mesh or torus is between hypercube and linear
    array
  • Scale better than hypercube
  • Quite popular

27
Communication and Routing
  • If two nodes are not directly connected or if a
    processor is not directly connected to a memory
    module, how is data transmitted between the two?
  • If there are multiple routes, how to decide? Is
    the route always the shortest path? Most systems
    use a deterministic shortest-path algorithm
  • How do intermediate nodes forward communications?
    Two basic approaches are store-and-forward
    routing and cut-through routing
  • Store-and-forward routing uses considerably more
    memory
  • Most systems use some variant of cut-through
    routing

28
Store-and-Forward Routing
  • Read in the entire message ,and then send to C

29
Cut-Through Routing
  • Immediately forward each identifiable pieces of
    the message
Write a Comment
User Comments (0)
About PowerShow.com