Decoupled Architectures for Complexity-Effective General Purpose Processors - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Decoupled Architectures for Complexity-Effective General Purpose Processors

Description:

Complexity-Effective General Purpose Processors Ronny Krashinsky and Mike Sung 6.893 Term Project Presentation MIT Laboratory for Computer Science – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 50
Provided by: PAJ56
Category:

less

Transcript and Presenter's Notes

Title: Decoupled Architectures for Complexity-Effective General Purpose Processors


1
Decoupled Architectures for Complexity-Effective
General Purpose Processors
  • Ronny Krashinsky and Mike Sung
  • 6.893 Term Project Presentation
  • MIT Laboratory for Computer Science
  • 12-7-2000

2
Motivation
  • out-of-order superscalar designs are inefficient
    and hard to scale
  • decoupled architectures can provide latency
    hiding, dynamic scheduling, and ILP in a much
    more complexity-effective and scalable manner
  • in previous work, decoupled architectures have
    been investigated for scientific apps
  • superscalar architectures are used universally
    for general purpose computing requirements
  • why? superscalars provide more flexibility, and
    decoupled architectures break down when there is
    a loss of decoupling

3
Proposal
  • use decoupled architectures for
    complexity-effective general purpose computing
  • multithreading can be used to hide loss of
    decoupling latency
  • potentially get the best out of both
    architectures by providing a superscalar
    processor with decoupled engines for
    complexity-effective streaming computations
  • we will present a survey of prior work and our
    proposed architectural innovations, unfortunately
    a lot of infrastructure (e.g. a compiler) is
    required for a more detailed investigation

4
Decoupled Access/Execute Architecture
  • AP EP process separate instruction streams
  • EP used for computation (floating point)
  • ILP
  • data values communicated via queues
  • slip AP runs ahead of EP
  • memory latency hiding
  • dynamic scheduling
  • head of AEQ can be used as instruction operand in
    EP
  • blocks if data isnt available
  • takes the place of register renaming
  • store addresses wait in WAQ until corresponding
    data arrives from EP
  • loads can bypass stores (check address)

Decoupled Access/Execute Computer Architectures,
Smith, 1982
5
Decoupled Access/Execute Architecture
  • program control flow implemented with
    corresponding conditional branch in each stream
  • branch condition queues allow AP to hide branch
    latency from EP
  • loss of decoupling if AP depends on branch
    condition from EP
  • not discussed in early works
  • implemented in the Astronautics ZS-1 Processor
  • single interleaved instruction stream is split to
    feed instruction queues
  • control flow instruction executed in the splitter

Decoupled Access/Execute Computer Architectures,
Smith, 1982
6
Simultaneous Multithreading with DAE
  • observation that functional unit latencies and
    true data dependencies in EP hinder performance
  • use SMT and thread level parallelism to better
    utilize functional units (same as with SMT in
    superscalars)
  • few threads are required
  • decoupling provides memory latency tolerance, SMT
    hides functional unit latencies

The Synergy of Multithreading and Access/Execute
Decoupling, Parcerisa and Gonzalez, 1998
7
Decoupled Control/Access/Execute Architecture
  • further optimization control decoupling
  • three instruction streams, dynamic slip
  • CP processes control flow graph, sends directives
    to AP and EP to execute basic blocks
  • limited control capabilities in AP and EP loop
    count and predication
  • fetch engines fill queues with valid instructions
  • dynamic loop unrolling
  • control latency hidden (without speculation)
  • stream units
  • CU can operate in stand-alone mode
  • implemented as a 21064, ran the OS

The Effectiveness of Decoupling, Bird et. al.,
1993
8
Decoupled Control/Access/Execute Architecture
  • loss of decoupling events cause breakdown

The Performance of Decoupled Architectures,
Parcerisa et. al., 1996
9
Decoupled Control/Access/Execute Architecture
10
Decoupled Control/Access/Execute Architecture
11
Decoupled Control/Access/Execute Architecture
12
Decoupled Control/Access/Execute Architecture
13
Decoupled Control/Access/Execute Architecture
14
Decoupled Control/Access/Execute Architecture
LOD!
15
Decoupled Control/Access/Execute Architecture
16
Decoupled Control/Access/Execute Architecture
17
Decoupled Control/Access/Execute Architecture
18
Decoupled Control/Access/Execute Architecture
19
Decoupled Control/Access/Execute Architecture
20
Decoupled Control/Access/Execute Architecture
21
Decoupled Control/Access/Execute Architecture
22
Decoupled Control/Access/Execute Architecture
23
Decoupled Control/Access/Execute Architecture
24
Decoupled Control/Access/Execute Architecture
25
Decoupled vs. Superscalar Architectures
  • Dynamic out-of-order execution with less
    complexity
  • Allows non-speculative instruction and data
    prefetching. We can shrink data structures like
    first level caches, potentially reducing critical
    paths as well as reducing power
  • Inherent long memory latency toleration
    provides performance advantage for streaming
    applications, etc. where lack of locality
    mitigates performance advantages of caches
  • Simplified issue logic which can be implemented
    with small structures/queues (contrast with
    ROB/IW/bypass structures)
  • Better resource utilization by partioning between
    CP/AP/DP, processors can have specialized ISAs
  • Scalability direct consequence of simplified
    logic
  • For superscalar processors, need to increase IW
    which does not scale (Palacharla/Agawal papers)
  • Decoupled machines alleviate centralized resource
    bottlenecks
  • Queue-based structure is amenable to tiled
    architectures with on-chip networks

26
Decoupled Architectures for General Purpose
Computing
  • So why havent decoupled machines taken over the
    world?
  • Because superscalar architectures took over the
    world first
  • Primary drawback of decoupled architectures from
    LOD events - twisty C code can cause severe
    performance degradation
  • Inability for compilers to program effectively
    for separate instruction streams lack of
    research/development in the area of
    programming/compiling analysis
  • Wheel of Reincarnation no such thing as a new
    idea
  • If we can augment existing decoupled
    architectures to remove the effects of LOD
    events, we effectively have an architecture that
    can feasibly be used for general purpose
    computing
  • Leverage exiting ideas to augment decoupling
    Multithreading and Auxiliary Processing

27
Multithreading on a DCAE Architecture
  • Multithreading hides latency of LOD events.
  • LOD events result in very long latencies (gt100s
    cycles) to reestablish decoupling
  • Motivation is to hide LOD events to prevent need
    to resynchronize
  • SMT hides functional unit latencies.

28
Multithreading on a DCAE Architecture
  • Multithreading in Access/execute units
  • Multiple contexts (IP/RF) for fast
    context-switching during LOD event
  • Interleaved SMT to hide horizontal as well as
    vertical waste within execute processor

29
Multithreading on a DCAE Architecture
  • With multithreading, utilization of CP/AP/EP by
    different threads is pipelined
  • analgous to instruction pipelining in a CPU
    datapath

30
Multithreading on a DCAE Architecture
31
Multithreading on a DCAE Architecture
32
Multithreading on a DCAE Architecture
33
Multithreading on a DCAE Architecture
34
Multithreading on a DCAE Architecture
35
Multithreading on a DCAE Architecture
LOD!
36
Multithreading on a DCAE Architecture
37
Multithreading on a DCAE Architecture
38
Multithreading on a DCAE Architecture
39
Multithreading on a DCAE Architecture
40
Multithreading on a DCAE Architecture
41
Multithreading on a DCAE Architecture
42
Multithreading on a DCAE Architecture
43
Multithreading on a DCAE Architecture
44
Multithreading on a DCAE Architecture
45
Multithreading on a DCAE Architecture
46
Multithreading on a DCAE Architecture
47
Auxiliary Decoupled Access/Execute Streaming Units
  • Implement control processor as fully functional
    high-performance microprocessor. Compiler can
    avoid decoupling control intensive code.
  • When decoupling is possible (e.g. streaming
    computations), the decoupled access/execute
    engines provide a high-performance
    complexity-effective alternative.
  • Analogous to vector coprocessors or SIMD array
    coprocessors. Basic idea is to utilize
    specialized hardware when possible and have a
    fallback plan when Achilles heel is exposed.

48
Extensions for Improved Performance
  • Wider issue access/execute processors
  • Speculative Multithreading
  • Control processor can spawn speculative threads
    when only a single thread of control is available
  • Miss-speculation detection can be performed by
    checking accessed memory addresses (in queues)
    for collisions
  • Kill speculative thread by simply flushing
    queues/context
  • Can merge concepts, with multithreaded decoupled
    execution under the auxiliary access/execute
    units paradigm.
  • Use decoupling/multithreading when possible, and
    fall back on high performance control processor
    otherwise
  • Tiled architectures Extend decoupled
    architectures to scaleable multiprocessor systems
    such as RAW.
  • Queue-based structure is a good fit for
    encorporating communication from other tiles

49
Summary
  • Decoupled architectures represent a
    complexity-effective and scalable way to provide
    dynamic scheduling, hide latency, and exploit ILP
  • To enable general purpose computation, we can
    augment decoupling with multithreading to hide
    the latency of LODs
  • By using decoupled access and execute units as
    auxillary processors, we can leverage the
    benefits of both decoupling for streaming
    computations, and out-of-order superscalars for
    control flow intensive computations
Write a Comment
User Comments (0)
About PowerShow.com