Parallel Architectures - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Parallel Architectures

Description:

sequential von Neumann architecture. MISD (Multiple Instruction Single Data stream) ... IPC (instruction per cycle) 1) Pipelined processors: ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 33
Provided by: barbara179
Category:

less

Transcript and Presenter's Notes

Title: Parallel Architectures


1
Parallel Architectures
  • State of the Art
  • Shared Memory Computers

2
Agenda
  • System level
  • Shared memory
  • Distributed memory
  • Microprocessor level
  • Multithreading
  • Multicore

3
Classification of Architectures
  • Flynns taxonomy not very accurate but widely
    used
  • SISD (Single Instruction Single Data stream)
  • sequential von Neumann architecture
  • MISD (Multiple Instruction Single Data stream)
  • do any exist?
  • SIMD (Single Instruction Multiple Data stream)
  • Lock-step machines
  • MIMD (Multiple Instruction Multiple Data stream)
  • modern data and instruction parallel systems

4
SIMD Architectures
  • Lock-step parallel computers, developed as
    alternative to vector computers
  • Front end (control unit) and back end (SIMD
    system)
  • Sequential instructions performed by front end
  • Parallel instructions broadcast to back end at
    each instruction cycle
  • Massive parallelism maybe thousands of
    processors
  • Examples
  • Historical Connection machines, MasPar
  • Today SIMD instructions (MMX, SSE) graphics
    processors (Cells SPE)

5
MIMD Architectures
  • Computer with multiple independently operating
    CPUs
  • CPUs are connected with each other and with
    memory
  • Many types of topologies Linear (Bus), ring,
    tree, mess, 3D-torus
  • CPUs have independent channels to memory
  • Two major kinds of MIMD machines
  • shared memory
  • distributed memory
  • now also distributed shared memory systems
  • Most current HPC systems are MIMD machines
  • Examples SGI PowerChallenge, IBM SP2, HP Exemplar

6
Examples
Distributed Memory Systems
Shared Memory Systems
LAN
7
Shared Memory Architectures
  • Built during the 1980s, now popular again
  • Multiple processors that access same main memory
  • connected to main memory via bus
  • Each processor has local cache
  • Cache coherence when a memory location is
    updated, any copies in the local cache of other
    processors must be invalidated
  • Advantage easy to program, easy processor
    communication
  • Disadvantage
  • Bus is bottleneck only small systems possible

8
Shared Memory
  • All processes/threads access same shared memory
    data space

Shared memory space
proc1
proc2
proc3
procN
9
More Realistic View of Shared Memory Architecture
Shared memory
Bus
cache1
cache2
cache3
cacheN

cpu1
cpu2
cpu3
cpuN
10
Distributed Memory Architecture
  • Built during late 1980s and since then
  • Multiple processors, each with its own local
    memory
  • No shared memory
  • Processormemory nodes connected via network
  • If data in processors memory is needed by
    another processor, it must be transmitted as
    messages via network
  • Data exchange slow in comparison to clock speed
  • Hard to program
  • message passing assembly language in parallel
    programming
  • Advantage Distributed memory parallel
    architectures scale easily
  • Examples Intel iPSC, Paragon IBM SP2, Cray T3D,
    T3E

11
Distributed Memory
  • The alternative model to shared memory

mem1
mem2
mem3
memN
proc1
proc2
proc3
procN
network
12
Distributed Memory Multicomputers
  • Each node is von Neumann machine
  • Cost of local memory access ltlt remote memory
    access
  • Variety of interconnects used (butterfly
    networks, hypercubes, multistage networks)

13
Distributed Shared Memory Systems
  • Recent architecture, overcomes size limits of
    shared memory systems
  • Distributed memory system but memory is globally
    addressed
  • So looks like shared memory system to users
  • Hardware supports cache coherency
  • Examples HP Exemplar, SGI Origin, Altix

Program as shared memory system
14
Distributed Shared Memory
  • For larger numbers of processors

Global shared memory space
mem2
mem3
memN
mem1

Interconnect
cache2
cache1
cache4
cache3

proc1
proc2
proc3
procN
15
Why Shared Memory Parallelism?
  • Shared memory parallelism was introduced to
    desktop computing
  • Now it is becoming the basis for general-purpose
    computing
  • Shared memory parallelism is powerful and very
    flexible
  • can provide large amounts of memory for a program
  • More power without raising the clock speed
  • Easy to program compared to distributed memory
  • Also a good start point for learning more
    aggressive parallel programming
  • Technology was fairly cost-effective
  • Can help avoid excessive power consumption and
    overheating
  • Multicore/multithreading

16
Agenda
  • System level
  • Shared memory
  • Distributed memory
  • Microprocessor level
  • Multithreading
  • Multicore

17
Generations of Microprocessors
  • Serial processors
  • handle each instruction back to back
  • IPC (instruction per cycle)lt1)
  • Pipelined processors
  • overlap different stages of handling of different
    instructions
  • IPC 1 as throughput
  • Superscalar processors
  • issue/execute multiple instructions in parallel
    (Instruction level parallelism ILP)
  • Multiple function units Out-of-order/speculative
    execution
  • IPCgt1
  • Multicore/multithreaded processors
  • IPC gtgt 1

18
Limitations of superscalar uniprocessor computers
  • Limited instruction level parallelism
  • Data hazards
  • Instruction depends on result of prior
    instruction still in the pipeline
  • Structural hazards
  • limited resources to support parallel execution
    of instructions
  • Control hazards
  • Branches and jumps cause wrong instructions are
    fetched/executed
  • Difficulties in increasing clock frequency
  • Power consumption
  • Heat dissipation
  • Solution thread level of parallelism

19
Multithreading
  • Thread a lightweight stream of instructions
    within a process
  • Shared resources memory address space, file
    descriptors
  • Private state information (context) registers,
    program counter, stack, etc.
  • Software multithreading context maintained by
    software (OS)
  • Useful for overlapping I/O and computation
  • Hardware multithreading thread context
    maintained by hardware
  • Multiple processors SMP
  • A multithreading processor
  • Fine-grained and Coarse-grained multithreading
  • Simultaneous multithreading
  • multicore

20
MultiThreaded Processors
  • Fine-grain switch threads on every instruction
    issue
  • Round-robin thread interleaving (skipping stalled
    threads)
  • Advantage can hide throughput losses that come
    from both short and long stalls
  • Disadvantage slows down the execution of an
    individual thread, frequent thread switching is
    costly.
  • Coarse-grain switches threads only on costly
    stalls (e.g., L2 cache misses)
  • Advantages thread switching is cheaper, much
    less likely to slow down the execution of an
    individual thread
  • Disadvantage limited ability to overcome
    throughput loss
  • Pipeline must be flushed and refilled on thread
    switches

21
Simultaneous Multithreading
  • A combination of ILP and TLP
  • Superscalar issue/execute multiple instructions
    per cycle
  • TLP hardware support for instructions from
    multiple threads simultaneously
  • Improved resource utilization
  • Shared function units, cache, TLB, etc
  • Pentium 4 HyperThread 5 size increase with
    30 performance gain

22
Hyper-Threading from Intel
  • 1 physical processors supporting 2 logical
    processors
  • Shared Instruction Trace Cache and L1 D-Cache
  • Replicated PC and register renamer
  • Other resources partitioned equally between 2
    threads
  • Recombines partitioned resources when single
    threaded (ensure single thread performance)

Intel NetBurst Microarchitecture Pipeline With
Hyper-Threading Technology
23
ILP and TLP
Issue slots
Fine-grained Multithreading
SMT
Superscalar
Courtesy of Pratyusa Manadhata_at_CMU
24
Multicore Processors
  • Multicore (chip multiprocessing)
  • Several processors integrated into one chip
  • Dedicated execution resources for each core
  • Shared L1, off-chip datapath
  • L2 shared or dedicated
  • Benefits of integration compared to SMP
  • Increased communication bandwidth between
    processors
  • Decreased latency
  • Cost/performance efficiency

25
Dual-core Intel vs. AMD
  • Major difference how to ensure scalability
    limited by Memory I/O bandwidth
  • Intel increases front side bus frequency
  • AMD integrated memory controller 3
    HyperTransport links (to I/O and other CPUs)

26
Performance SMT vs. Multicore
Sun Fire V490 4 dual-core processors (4 x 2
threads)
Intel Xeon 2 SMT processors (2 x 2 threads)
NAS parallel benchmark in OpenMP
Chunhua Liao, Zhenying Liu, Lei Huang, and
Barbara Chapman. "Evaluating OpenMP on Chip
MultiThreading Platforms". First International
Workshop on OpenMP, IWOMP 2005. Eugene, Oregon
USA. June 1-4, 2005.
27
Hybrid Multicore/multitheading
  • Suns UltraSparc T1(Niagara) throughput
    computing, not ideal for scientific computing.
  • 8 single-issue, in-order cores, each is 4-way
    fine-grained multithreaded

I/O shared functs
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
Crossbar
4-way banked L2 Cache
Memory controllers
28
More hybrid Multicore/multitheading
Intel's Itanium 2 Dual core 2-way
coarse-grained multithreading
IBM Power5 Dual core 4-way SMT
29
An aggressive one Tera MTA
  • A large-scale multithreading pioneer
  • Tera MultiThreaded Architecture
  • Up to 256 of processors
  • Up to 128 threads per processor
  • Fine-grained multithreading
  • 257x12832,768 threads!
  • Uniform shared memory
  • No data cache
  • up to 2.5 gigabytes per second via
    interconnection network.
  • About 25 active streams per MTA processor are
    needed to overlap memory latency with
    computational processing.

Tera Computer Company bought the remains of the
Cray Research division of Silicon Graphics in
2000 and promptly renamed itself Cray Inc.
30
Asymmetric (Heterogeneous) Multiprocessing
  • Combining general-purpose and special-purpose
    cores together
  • Cell18 cores in one chip
  • 1 Power Processor Element (PPE)
  • general purpose, 64-bit RISC
  • 8 Synergistic Processor Elements (SPE)
  • SIMD units, support both parallel and stream
    processing

Cell Processor for game consoles
31
Summary
  • Almost all modern supercomputers are MIMD
    architectures
  • SIMD components are also popular
  • Main distinction is between shared memory and
    distributed memory
  • But also distributed shared memory with cache
    coherency
  • Microprocessors are increasingly powerful
  • More relying on thread level parallelism rather
    than increasing frequency and complexity.

32
Announcements
  • Did you get an email from Dr. Chapman?
  • If not, please sign up your name email now.
  • Class website
  • http//www.cs.uh.edu/chapman/teachpubs/6397
  • There might be some other invited speakers teach
    this class Wednesday.
Write a Comment
User Comments (0)
About PowerShow.com