Parallel Programming Platforms - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Parallel Programming Platforms

Description:

Chapter 2 Parallel Programming Platforms Reference: http://www-users.cs.umn.edu/~karypis/parbook/ http://www.eel.tsint.edu.tw/teacher/ttsu/teach01.htm – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 78
Provided by: 12315
Category:

less

Transcript and Presenter's Notes

Title: Parallel Programming Platforms


1
Parallel Programming Platforms
Chapter 2
Reference http//www-users.cs.umn.edu/karypis/pa
rbook/ http//www.eel.tsint.edu.tw/teacher/ttsu/te
ach01.htm
2
Introduction
  • The traditional logical view of a sequential
    computer (?????) consists of
  • a memory connected to a processor via a datapath.
  • All these three components processor, memory,
    and datapath
  • Present bottlenecks to the overall processing
    rate of a computer system.

3
Introduction
  • A number of architectural innovations?? over the
    years have addressed these bottlenecks. One of
    the most important innovation is multiplicity(??)
    in
  • processor units,
  • datapaths, and
  • memory units.
  • This multiplicity is either entirely hidden from
    the programmer, as in the case of implicit
    parallelism, or exposed to the programmer in
    different forms.

4
Introduction
  • Learning objectives in this chapter
  • An overview of important architecture concepts as
    they relate to parallel processing.
  • To provide sufficient detail for programmers to
    be able to write efficient code on a variety of
    platforms.
  • It develops cost models and abstractions for
    quantifying the performance of various parallel
    algorithms, and identifying bottlenecks resulting
    from various programming constructs.

5
Introduction
  • Parallelizing sub-optimal serial codes often has
    undesirable effects of unreliable speedups and
    misleading???runtimes.
  • It advocates??optimizing serial performance of
    codes before attempting parallelization.
  • The tasks of serial and parallel optimization
    often have very similar characteristics.

(????????????????,???????????????????????????,????
????????????)
6
Outline
  1. Implicit Parallelism
  2. Limitations of Memory System Performance
  3. Dichotomy???of Parallel Computing Platforms
  4. Physical Organization of Parallel Platforms
  5. Communication Costs in Parallel Machines
  6. Routing Mechanisms for Interconnection Networks
  7. Impact of Processor-Processor Mapping and Mapping
    Techniques
  8. Case Studies

7
Implicit Parallelism ????
  • Trend in Microprocessor Architecture
  • Pipelining and Superscalar Execution
  • Very Long Instruction Word Processors VLIW

8
Trend in Microprocessor Architecture
  • Clock speeds of microprocessors have posted
    impressive???????gains two to three orders of
    magnitude over the past 20 years.
  • However, these increments are severely diluted??
    by the limitations of memory technology.
  • Consequently, techniques that enable execution of
    multiple instructions in a single clock cycle
    have become popular.

9
Trend in Microprocessor Architecture
  • Mechanisms used by various processors for
    supporting multiple instruction execution.
  • Pipelining and Superscalar Execution (?????????)
  • Very Long Instruction Word Processors (????????)

10
Pipelining and Superscalar Execution
  • By overlapping various stages in instruction
    execution, pipelining enables faster execution.
    (????????????????,????)
  • To increase the speed of a single pipeline, one
    would break down the tasks into smaller and
    smaller units, thus lengthening the pipeline and
    increasing overlap in execution.
    (?????????????????,??????????,?????)

11

Pipelining and Superscalar Execution
  • For example, the Pentium 4, which operate at 2.0
    GHz, at a 20 stages pipeline.
  • Long instruction pipelines therefore need
    effective techniques for predicting branch
    destinations so that pipelines can be
    speculatively??filled.????????????????????????,???
    ???????
  • An obvious way to improve instruction execution
    rate beyond this level is to use multiple
    pipelines.
  • During each clock cycle, multiple instructions
    are piped into the processor in parallel.
    .(????????????,????????,?????????????)

12
Superscalar Execution Example 2.1
Example of a two-way superscalar execution of
instructions.
13
  • Consider Fig. 2.1(a) first code
  • t0 The first and second instructions are
    independent and therefore can be issued
    concurrently.
  • t1
  • The next two instructions (row 3,4) are also
    mutually independent, although they must be
    executed after the first two instructions (t0)
  • They can be issued concurrently at t1 since the
    processors are pipelined.
  • t2 Only add instruction is issued
  • t3 Only store instruction is issued
  • Two instructions (row 5,6) cannot be executed
    concurrently since the result of the former is
    used by the latter.

14
Superscalar Execution
  • Scheduling of instructions is determined by a
    number of factors
  • True Data Dependency The result of one operation
    is an input to the next.
  • Resource Dependency Two operations require the
    same resource.
  • Branch Dependency Scheduling instructions across
    conditional branch statements cannot be done
    deterministically a-priori???.

15
Superscalar Execution
  • Scheduling of instructions is determined by a
    number of factors
  • The scheduler, a piece of hardware looks at a
    large number of instructions in an instruction
    queue and selects appropriate number of
    instructions to execute concurrently based on
    these factors.
  • The complexity of this hardware is an important
    constraint on superscalar processors.

16
Dependency- True data dependency
  • The results of an instruction must be required
    for subsequent instructions.
  • Consider the 2nd code fragment, there is a true
    data dependency between load R1, _at_1000 and add
    R1, _at_1004,
  • Since the resolution is done at runtime, it must
    supported in hardware. The complexity of this
    hardware can be high.
  • The amount of instruction level of parallelism in
    a program is often limited and is a function of
    coding technique.

17
Dependency- True data dependency
  • In 2nd code fragment, there can be no
    simultaneous issue, leading to poor resource
    utilization.
  • The third code fragments also illustrate in many
    cases it is possible to extract more parallelism
    by reordering the instructions and by altering
    the code.
  • The code reorganization corresponds to
    exposing??parallelism in a form that can be used
    by the instruction issue mechanism.

18
Dependency- Resource dependency
  • The form of dependency in which two instructions
    compete for a single processor resource.
  • As an example, consider the co-scheduling of two
    floating point operations on a dual issue machine
    with a single floating point unit.
  • Although there might be no data dependencies
    between the instructions, they cannot be
    scheduled together since both need the floating
    point unit.

19
Dependency- Branch or procedural dependencies
  • Since the branch destination is known only at the
    point of execution, scheduling instructions a
    priori across branches may lead to errors.
  • These dependencies are referred to as branch or
    procedural dependencies and are typically handled
    by speculatively scheduling across branches and
    rolling back in case of errors.

20
Dependency- Branch or procedural dependencies
  • On average, a branch instruction is encountered
    between every five to six instructions.
  • Therefore, just as in populating instruction
    pipelines, accurate branch prediction is critical
    for efficient superscalar execution.
  • The ability of a processor to detect and
    concurrent instruction is critical to superscalar
    performance.

21
Dependency- Branch or procedural dependencies
  • The 3rd code fragment is merely semantically
    equivalent reordering of the 1st code fragment.
    However, there is a data dependency between load
    R1, _at_1000 and add R1, _at_1004 .
  • Therefore, these instructions cannot be issued
    simultaneously. However, if the processor has
    ability to look ahead, it would realize that it
    is possible to schedule the 3rd instruction with
    the 1st instruction.
  • In this way, the same execution schedule can be
    derived for the 1st and 3rd code fragments.
    However, the processor needs the ability to issue
    instruction out-of-order to accomplish desired
    ordering.

22
Dependency- Branch or procedural dependencies
  • Most current microprocessor are capable of
    out-of-order issue and completion.
  • The model, also referred as dynamic instruction
    issue, exploits maximum instruction level
    parallelism. The processor uses a window of
    instructions from which it selects instructions
    for simultaneous issue. This window corresponds
    to the look ahead of the scheduler. (Dynamic
    Dependency Analysis)

23
Dependency-Branch or procedural dependencies
  • In Fig. 2.1(C)
  • These are essentially wasted cycles from the
    point of view of the execution unit. If, during a
    particular cycle, no instructions are issued on
    the execution units, it is referred to as
    vertical waste. If only part of the execution
    units are used during a cycle, it is termed
    horizontal waste.
  • In all, only three of the eight available cycles
    are used for computation. This implies that the
    code fragment will yield no more than
    three-eighths of the peak rated FLOPS count of
    the processor.

24
Dependency- Branch or procedural dependencies
  • Often, due to limited parallelism, resource
    dependencies, or the ability of a processor to
    extract parallelism, the resources of superscalar
    processors are heavily under-utilized.
  • Current microprocessors typically support up to
    four-issue superscalar execution.

25
Very Long Instruction Word Processors VLIW
  • The parallelism extracted by superscalar
    processors is often limited by the instruction
    look ahead.
  • The hardware logic for Dynamic Dependency
    Analysis is typically in the range of 5-10 of
    the total logic on conventional microprocessors.
  • The complexity grows roughly quadratic????with
    the number of issues and become a bottleneck.

26
Very Long Instruction Word Processors (VLIW)
  • An alternate concept for exploiting
    instruction-level parallelism used in every long
    instruction word (VLIW) processors relies on the
    compiler to resolve dependencies and resource
    availability at compile time.

27
Very Long Instruction Word Processors VLIW
  • Instruction that can be executed concurrently are
    packed into groups and parceled?? off the
    processor as a single long instruction word to be
    executed on multiple functional units at the same
    time.

28
Very Long Instruction Word Processors (VLIW)
  • VLIW advantages
  • Since the schedule is done in software, the
    decoding and instruction issue mechanisms are
    simpler in VLIW processors.
  • The compiler has a larger context from which to
    select instructions and can use a variety of
    transformations to optimize parallelism when
    compared to a hardware issue unit.
  • Additional parallel instructions are typically
    made available to the compiler to control
    parallel execution.

29
Very Long Instruction Word Processors VLIW
  • VLIW disadvantages
  • Compilers does not have the dynamic program state
    (e.g. the branch history buffer) available to
    make scheduling decisions.
  • This reduces the accuracy of branch and memory
    prediction, but allows the use of more
    sophisticated???static predictions schemas.
  • Others runtime situations are extremely
    difficulty to predict accurately.
  • This limits the scope and performance of static
    compiler-based scheduling.

30
Very Long Instruction Word Processors VLIW
  • VLIW is very sensitive to the compilers ability
    to detect data and resource dependencies and R/W
    hazards, and to schedule instructions for maximum
    parallelism. Loop unrolling, branch prediction,
    and speculative execution all play important
    roles in the performance of VLIW processors.
  • While superscalar and VLIW processors have been
    successful in exploiting implicit parallelism,
    they are generally limited to smaller scales of
    parallelism concurrency in the range of
    four-to-eight-way parallelism.

31
Limitations of Memory System Performance
32
Limitations of Memory System Performance
  • Memory system, and not processor speed, is often
    the bottleneck for many applications.
  • Memory system performance is largely captured by
    two parameters, latency and bandwidth.

33
Limitations of Memory System Performance
  • Latency is the time from the issue of a memory
    request to the time the data is available at the
    processor.
  • Bandwidth is the rate at which data can be pumped
    to the processor by the memory system.

34
Example2.2 Effect of memory latency on
performance
  • Consider a processor operating at 1GHz (1 ns
    clock) connected to a DRAM with a latency of 100
    ns (no caches) 100 cycles.
  • Assume that the processor has two multiple-add
    units and is capable of executing four
    instructions in each cycle of 1 ns. The peak
    processor rating is therefore 4 GFLOPS.
  • (4 FLOPS/cycle x 109 cycles/s4x109FLOPS)

35
Example2.2 Effect of memory latency on
performance
  • Since the memory latency is equal to 100 cycles
    and the block size is one word, every time a
    memory request is made, the processor must wait
    100 cycles before it can process the data.
  • It is easy to see that the peak speed of this
    computation is limited to one floating point
    operation every 100 ns(10010-910-7), or a
    speed of 10 MFLOPS (10106107).

36
Limitations of Memory System Performance
  • Improve Effective Memory Latency Using Caches
  • Impact of Memory Bandwidth
  • Alternate Approaches for Hiding Memory Latency
  • Multithreading for Latency Hiding
  • Prefetching for Latency Hiding
  • Tradeoffs of Multithreading and Prefetching

37
Improve Effective Memory Latency Using Caches
  • One innovation??address the speed mismatch by
    placing a smaller and faster memory between the
    processor and the DRAM.
  • The fraction of data references satisfied by the
    cache is called the cache hit ratio.
  • The notation of repeated reference to a data item
    in a small time window is called temporal
    locality.

38
Improve Effective Memory Latency Using Caches
  • The effective computation rate of many
    applications is bounded not by the processing
    rate of the CPU, but by the rate at which data
    can be pumped into the CPU. Such computations are
    referred to as being memory bound.

39
Example2.3 Impact of caches on memory system
performance
  • As Example2.2, consider a 1GHz processor with a
    100 ns latency DRAM and we introduce a cache of
    size 32 KB with a latency of 1ns or one cycle. We
    use this setup to multiply two matrices A and B
    of Dimension 32x32
  • A32x322101K B32x322101K then, 1K1K2K
    words2000 words
  • Fetching two matrices into cache takes about
    2000x100ns 200 µs
  • Multiplying two n x n matrices takes 2n3
    operations 2(32)3 64K operations

40
Example2.3 Impact of caches on memory system
performance
  • Because the processor has two multiple-add units
    and is capable of executing four instructions in
    each cycle of 1 ns.
  • Then 64K4x16K cycles (or 16 µs) at four
    instructions per cycle
  • Total 200 µs 16 µs 216 µs
  • Peak computation rate 64K / 216 303.4074 x
    106 303 MFLOPS
  • Compare with example 2.2
  • Improvement ratio 303 / 10 30.3 about thirty
    fold

41
Impact of Memory Bandwidth
  • Memory Bandwidth
  • The rate at which data can be moved between the
    processor and memory.
  • It is determined by the memory bus as well as the
    memory units.
  • The single memory request returns a
    contiguous???block of four words. The single unit
    of four words in this case is also referred to as
    a cache line.

42
Impact of Memory Bandwidth
  • In following example, the data layouts were
    assumed to be such that consecutive data words in
    memory were used by successive instructions. In
    other words, if we take a computation-centric
    view, there is a spatial locality of memory
    access.

43
Example2.4 Effect of block size dot-product of
two vectors
  • A peak speed of 10 MFLOPS as illustrated in
    example 2.2
  • If the block size is increased to four words
    i.e., the processor can fetch a four-word cache
    line every 100 cycles
  • For each pair of words, the dot-product performs
    one multiply-add, i.e., 2 FLOPs,
  • then four words need 8 FLOPs can be performed in
    2x100 200 cycles
  • The corresponds to a FLOP every 200/8 25 ns, for
    a peak of 1/25ns109/25 40 MFLOPs

44
Impact of Memory Bandwidth
  • If we take a data-layout centric point view, the
    computation is ordered so that successive
    computations require contiguous data.
  • If the computation (or access pattern) does not
    have spatial locality, then effective bandwidth
    can be much smaller than the peak bandwidth.

45
Row majority vs. Column Majority
  • Row majority
  • for( i0 ilt100 i )
  • for( j0jlt100j )
  • aijbijcij
  • Column majority
  • for( j0jlt100j )
  • for( i0 ilt100 i )
  • aijbijcij

46
Impact of strided??access Example 2.5
  • Consider the following code fragment
  • for (i 0 i lt 1000 i)
  • column_sumi 0.0
  • for (j 0 j lt 1000 j)
  • column_sumi Aji
  • The code fragment sums columns of the matrix A
    into a vector column_sum.
  • Assumption the matrix has been stored in a
    row-major fashion in memory.

47
  • Example2.5 Impact of strided? access

Example 2.5 column sum
Example 2.6 column sum II
48
Eliminating strided access Example 2.6
  • We can fix the above code as follows
  • for (i 0 i lt 1000 i)
  • column_sumi 0.0
  • for (j 0 j lt 1000 j)
  • for (i 0 i lt 1000 i)
  • column_sumi Aji
  • In this case, the matrix is traversed in a
    row-order and performance can be expected to be
    significantly better.

49
Memory System Performance Summary
  • The series of examples presented in this section
    illustrate the following concepts
  • Exploiting spatial and temporal locality in
    applications is critical for amortizing??memory
    latency and increasing effective memory
    bandwidth.
  • The ratio of the number of operations to number
    of memory accesses is a good indicator of
    anticipated??tolerance to memory bandwidth.
  • Memory layouts and organizing computation
    appropriately can make a significant impact on
    the spatial and temporal locality.

50
Alternate Approaches for Hiding Memory Latency
  • Imaging sitting at your computer browsing the web
    during peak network network traffic hours. The
    lack of response from your browser can be
    alleviated??
  • Multithreading for Latency Hidinglike we open
    multiple browsers and access different pages in
    each browser,thus while we are waiting for one
    page to load, we could be reading others
  • Prefetching for Latency Hiding like we
    anticipate??which pages we are going to browse
    ahead of time and issue requests for them in
    advance.
  • Spatial locality in accessing memory wordslike
    we access a whole bunch?of pages in one go.

51
Multithreading for Latency Hiding
  • A thread is a single stream of control in the
    flow of a program.
  • We illustrate threads with a simple example 2.7
  • for (i 0 i lt n i)
  • ci dot_product(get_row(a, i), b)
  • Each dot-product is independent of the other, and
    therefore represents a concurrent unit of
    execution. We can safely rewrite the above code
    segment as
  • for (i 0 i lt n i)
  • ci create_thread(dot_product,get_row(a,
    i), b)

52
Multithreading for Latency Hiding Example 2.7
  • In the code, the first instance of this function
    accesses a pair of vector elements and waits for
    them.
  • In the meantime, the second instance of this
    function can access two other vector elements in
    the next cycle, and so on.
  • After l units of time, where l is the latency of
    the memory system, the first function instance
    gets the requested data from memory and can
    perform the required computation.
  • In the next cycle, the data items for the next
    function instance arrive, and so on.
  • In this way, in every clock cycle, we can perform
    a computation.

53
Multithreading for Latency Hiding
  • The execution schedule in the previous example is
    predicated upon two assumptions
  • the memory system is capable of servicing
    multiple outstanding???requests, and
  • the processor is capable of switching threads at
    every cycle.

54
Multithreading for Latency Hiding
  • It also requires the program to have an explicit
    specification of concurrency in the form of
    threads.
  • Machines such as the HEP and Tera rely on
    multithreaded processors that can switch the
    context of execution in every cycle.
  • Consequently, they are able to hide latency
    effectively.

55
Prefetching for Latency Hiding
  • Misses on loads cause programs to stall.
  • Why not advance the loads so that by the time the
    data is actually needed, it is already there!
  • The only drawback is that you might need more
    space to store advanced loads.
  • However, if the advanced loads are overwritten,
    we are no worse than before!

56
Example 2.8 Hiding latency of perfecting
  • Consider the problem of adding two vectors a and
    b using a single loop.
  • In the first iteration of the loop
  • The processor request a0 and b0
  • Since these are not in the cache, the processor
    must pay the memory latency.
  • While these requests are being serviced, the
    processor also requests a1 and b1.
  • Assuming that each request is generated in one
    cycle (1ns) and memory requests are satisfied in
    100 ns
  • After 100 such requests the first set of data
    items is return by the memory system
  • Subsequently, one pair of vector components will
    be returned every cycle.

57
Example2.9 Impact of bandwidth on multithreaded
programs
  • Consider a computation running on a machine with
    a 1GHz clock, 4 word cache line, single cycle
    access to the cache, and 100 ns latency to DRAM.
    The computation has a cache hit ratio at 1 KB of
    25 and at 32 KB of 90.
  • A single threaded execution in which the entire
    cache (32KB) is available to the serial context
  • A multithreaded execution with 32 threads where
    each thread has a cache residency of 1 KB
  • If the computation makes one data request in
    every cycle of 1 ns

58
Example2.9
  • In the first case
  • DRAM latency 100ns
  • 4 words/cycle computation
  • 4 words/ns ( 4-ways)
  • 100ns need support 400 words to CPU
  • 1s need support 107x400words4000MB
  • 10 from DRAM then, 10x4000MB400MB/s
  • DRAM bandwidth 400MB/s
  • A single thread

59
Example2.9
  • In the second case,
  • DRAM latency 100ns
  • 4 words/cycle computation
  • 4 words/ns ( 4-ways)
  • 100ns need support 400 words to CPU
  • 1s need support 107x400words4000MB
  • 75 from DRAM then, 75x4000MB3000MB3GB
  • DRAM bandwidth 3GB/s
  • 32 threads

60
Tradeoffs of Multithreading and Prefetching
  • Bandwidth requirements of a multithreaded system
    may increase very significantly because of the
    smaller cache residency of each thread.
  • Multithreaded systems become bandwidth bound
    instead of latency bound.
  • Multithreading and prefetching only address the
    latency problem and may often exacerbate?? the
    bandwidth problem.
  • Multithreading and prefetching also require
    significantly more hardware resources in the form
    of storage.

61
Dichotomy???of Parallel Computing Platforms
62
Dichotomy of Parallel Computing Platforms
  • Logical
  • Control Structure of Parallel Platformsthe
    former
  • Communication Model of Parallel Platforms(chap
    10) the latter
  • Shared-Address-Space Platforms(chap 07)
  • Message-Passing Platforms(chap 06)
  • Physical
  • Architecture of an ideal Parallel Computer
  • Interconnection Networks for Parallel Computers
  • Network Topologies
  • Evaluating Static Interconnection Networks
  • Evaluating Dynamic Interconnection Networks
  • Cache Coherence in Multiprocessor Systems

63
Control Structure of Parallel Programs
  • Parallelism can be expressed at various levels of
    granularity??- from instruction level to
    processes.
  • Between these extremes exist a range of models,
    along with corresponding architectural support.

64
Example2.10 Parallelism from single instruction
on multiple processors
  • Consider the following code segment that adds two
    vectors
  • for (i0 ilt1000 i)
  • ciaibi
  • C0a0b0C1a1b1..etc.,can be
    executed independently of each other.
  • If there is a mechanism for executing the same
    instruction, all the processors with appropriate
    data, we could execute this loop much faster.

65
SIMD and MIMD
66
SIMD
  • SIMD (Single instruction stream, multiple data
    stream) Architecture
  • A single control unit dispatches instructions to
    each processing unit.
  • In an SIMD parallel computer, the same
    instruction is executed synchronously by all
    processing units.
  • These Architectural enhancements rely on the
    highly structured (regular) nature of underlying
    computations, for example in image processing and
    graphics, to deliver improved performance.

67
MIMD
  • MIMD (Multiple instruction stream, multiple data
    stream) Architecture
  • Computers in which each processing element can
    execute a different program independence of the
    other processing elements
  • A simple variant of this model, called the
    single program multiple data (SPMD), relies on
    multiple instances of the same program executing
    on different data.
  • The SPMD model is widely used by many parallel
    platforms and requires minimal architecture
    support. Examples of such platforms include the
    SUN Ultra Servers, Microprocessor PCs,
    workstation clusters, and the IBM SP.

68
SIMD vs. MIMD
  • SIMD computers require less hardware than MIMD
    computers because they only one global control
    unit.
  • Furthermore, SIMD computers require less memory
    because only one copy of the program needs to be
    stored.
  • Platforms supporting the SIMD paradigm can be
    built from inexpensive off-the-shelf components
    with relatively little effort in a short amount
    of time.

69
SIMD Disadvantages
  • Since the underlying serial processors change so
    rapidly, SIMD computers suffer from fast
    obsolescence??.
  • The irregular nature of many applications makes
    SIMD architecture less suitable.
  • Example 2.11 illustrates a case in which SIMD
    architectures yield poor resource utilization in
    the conditional execution.

70
Example2.11 Conditional Execution in SIMD
Processors
71
Communication Model of Parallel Platforms
  • There are two primary forms of data exchange
    between parallel tasks
  • Shared-Address-Space Platforms(ch 07)
  • Message-Passing Platforms(ch 06)

72
Shared-Address-Space Platforms
  • Part (or all) of the memory is accessible to all
    processors.
  • Processors interact by modifying data objects
    stored in this shared-address-space.
  • If the time taken by a processor to access any
    memory word in the system global or local is
    identical, the platform is classified as a
    uniform memory access (UMA), else, a non-uniform
    memory access (NUMA) machine.

73
Shared-Address-Space Platforms
74
Shared-Address-Space Platforms
  • The Shared-Address-Space view of a parallel
    supports a common data space that is accessible
    to all processors.
  • Processors interact by modifying data objects
    stored in this Shared-Address-Space.
  • Memory in Shared-Address-Space platforms can be
    local or global
  • Shared-Address-Space platforms supporting SPMD
    programming are also referred to as
    multiprocessors.

75
Shared-Address-Space vs. Shared Memory Machines
  • It is important to note the difference between
    the terms shared address space and shared memory.
  • We refer to the former as a programming
    abstraction and to the latter as a physical
    machine attribute.
  • It is possible to provide a shared address space
    using a physically distributed memory

76
Message-Passing Platforms
  • These platforms comprise of a set of processors
    and their own (exclusive) memory.
  • Instances of such a view come naturally from
    clustered workstations and non-shared-address-spac
    e multicomputers.
  • These platforms are programmed using (variants
    of) send and receive primitives.
  • Libraries such as MPI and PVM provide such
    primitives.

77
Message Passing vs. Shared Address Space
Platforms
  • Message passing requires little hardware support,
    other than a network.
  • Shared address space platforms can easily
    emulate??message passing. The reverse is more
    difficult to do (in an efficient manner).
Write a Comment
User Comments (0)
About PowerShow.com