13th Lecture 6' Future Processors to use CoarseGrain Parallelism PowerPoint PPT Presentation

presentation player overlay
1 / 30
About This Presentation
Transcript and Presenter's Notes

Title: 13th Lecture 6' Future Processors to use CoarseGrain Parallelism


1
13th Lecture6. Future Processors to use
Coarse-Grain Parallelism
  • Todays microprocessors utilize instruction level
    parallelism by a deep instruction pipeline and by
    the superscalar or VLIW multiple issue techniques
  • Todays (2001) technology approx. 40 M
    transistors per chip, In future (2012)
    1.4 G transistors per chip,What next?
  • Two directions
  • Increase of single-thread performance--gt use of
    more speculative instruction-level parallelism
  • Increase of multi-thread (multi-task)
    performance--gt Utilize thread-level parallism
    additionally to instruction-level parallelismA
    thread in this lecture means a HW thread
    which can be a SW (Posix) thread, a process, ...
  • Far future (??) Increase of single-thread
    performance by use of speculative
    instruction-level and thread-level parallelism

2
5.2 Advanced Superscalar Processors for Billion
Transistor Chips in Year 2005 - Characteristics
  • Aggressive speculation, such as a very aggressive
    dynamic branch predictor,
  • a large trace cache,
  • very-wide-issue superscalar processing (an issue
    width of 16 or 32 instructions per cycle),
  • a large number of reservation stations to
    accommodate 2,000 instructions,
  • 24 to 48 highly optimized, pipelined functional
    units,
  • sufficient on-chip data cache, and
  • sufficient resolution and forwarding logic.
  • see Yale N. Patt, Sanjay J. Patel, Marius
    Evers, Daniel H. Friendly, Jaret Stark One
    Billion Transistors, One Uniprocessor, One Chip.
    IEEE Computer, September 1997, pp. 51-57.

3
(No Transcript)
4
Requirements and Solutions
  • Delivering optimal instruction bandwidth
    requires
  • a minimal number of empty fetch cycles,
  • a very wide (conservatively 16 instructions,
    aggressively 32), full issue each cycle,
  • and a minimal number of cycles in which the
    instructions fetched are subsequently discarded.
  • Consuming this instruction bandwidth requires
  • sufficient data supply,
  • and sufficient processing resources to handle the
    instruction bandwidth.
  • Suggestions
  • an instruction cache system (the I-cache) that
    provides for out-of-order fetch (fetch, decode,
    and issue in the presence of I-cache misses).
  • a large Trace cache for providing a logically
    contiguous instruction stream,
  • an aggressive Multi-Hybrid branch predictor
    (multiple, separate branch predictors, each tuned
    to a different class of branches) with support
    for context switching, indirect jumps, and
    interference handling.

5
Multi-Hybrid Branch Predictor
  • Hybrid predictors comprise several predictors,
    each targeting different classes of branches.
  • Idea each predictor scheme works best for
    another branch type.
  • McFarling already combined two predictors (see
    also PowerPC 620, Alpha 21364,...).
  • As predictors increase in size, they often take
    more time to react to changes in a program
    (warm-up time).
  • A hybrid predictor with several components can
    solve this problem by using component predictors
    with shorter warm-up times while the larger
    predictors are warming up.
  • Examples of predictors with shorter warm-up times
    are two-level predictors with shorter histories
    as well as smaller dynamic predictors.
  • The Multi-Hybrid uses a set of selection counters
    for each entry in the branch target buffer, in
    the trace cache, or in a similar structure,
    keeping track of the predictor currently most
    accurate for each branch and then using the
    prediction from that predictor for that branch.

6
Instruction supply and out-of-order fetch
  • An in-order fetch processor, upon encountering a
    trace cache miss, waits until the miss is
    serviced before fetching any new segments.
  • But an out-of-order fetch processor temporarily
    ignores the segment associated with the miss,
    attempting to fetch, decode, and issue the
    segments that follow it.
  • After the miss has been serviced, the processor
    decodes and issues the ignored segment.
  • Higher performance can be achieved by fetching
    instructions thatin terms of a dynamic
    instruction traceappear after a mispredicted
    branch, but are not control-dependent upon that
    branch.
  • In the event of a mispredict, only instructions
    control-dependent on the mispredicted branch are
    discarded.
  • Out-of-order fetch provides a way to fetch
    control-independent instructions.

7
Data Supply
  • A 16-wide-issue processor will need to execute
    about eight loads/stores per cycle.
  • The primary design goal of the data-cache
    hierarchy is to provide the necessary bandwidth
    to support eight loads/stores per cycle.
  • The size of a single, monolithic, multi-ported,
    first-level data cache would likely be so large
    that it would jeopardize the cycle time.
  • Because of this, we expect the first-level data
    cache to be replicated to provide the required
    ports.
  • Further features of the data supply system
  • A bigger, second-level data cache with less port
    requirements.
  • Data prefetching.
  • Processors will predict the addresses of loads,
    allowing loads to be executed before the
    computation of operands needed for their address
    calculation.
  • Processors will predict dependencies between
    loads and stores, allowing them to predict that a
    load is always dependent on some older store.

8
Execution Core
  • If the fetch engine is providing 16 to 32
    instructions per cycle, then the execution core
    must consume instructions just as rapidly.
  • To avoid unnecessary delays due to false
    dependencies, logical registers must be renamed.
  • To compensate for the delays imposed by the true
    data dependencies, instructions must be executed
    out of order.
  • Patt et al. envision an execution core
    comprising 24 to 48 functional units supplied
    with instructions from large reservation stations
    and having a total storage capacity of 2,000 or
    more instructions.
  • Functional units will be partitioned into
    clusters of three to five units. Each cluster
    will maintain an individual register file. Each
    functional unit has its own reservation station.
  • Instruction scheduling will be done in stages.

9
Conclusions (so far)
  • Patt et al. argue that the highest performance
    computing system (when one billion transistor
    chips are available) will contain on each
    processor chip a single processor.
  • All this makes sense, however, only
  • if CAD tools can be improved to design such chips
  • if algorithms and compilers can be redesigned to
    take advantage of such powerful dynamically
    scheduled engines.

10
6. Future Processors to use Coarse-Grain
Parallelism
  • Chip multiprocessors (CMPs) or multiprocessor
    chips
  • integrate two or more complete processors on a
    single chip,
  • every functional unit of a processor is
    duplicated
  • Simultaneous multithreaded processors (SMPs)
  • store multiple contexts in different register
    sets on the chip,
  • the functional units are multiplexed between the
    threads,
  • instructions of different contexts are
    simultaneously executed

11
6.2 Chip Multiprocessors (CMPs)6.2.1 Principal
Chip Multiprocessor Alternatives
  • symmetric multiprocessor (SMP),
  • distributed shared memory multiprocessor (DSM),
  • message-passing shared-nothing multiprocessor.

12
Organizational principles of multiprocessors
13
Typical SMP
14
Shared memory candidates for CMPs
Shared-main memory and
shared-secondary cache
15
Shared memory candidates for CMPs
and shared-primary cache
16
Grain-levels for CMPs
  • multiple processes in parallel
  • multiple threads from a single application ?
    implies a common address space for all threads
  • extracting threads of control dynamically from a
    single instruction stream
  • ? see last chapter, multiscalar, trace
    processors, ...

17
Texas Instruments TMS320C80 Multimedia Video
Processor
18
Hydra A Single-Chip Multiprocessor
19
Conclusions on CMP
  • Usually, a CMP will feature
  • separate L1 I-cache and D-cache per on-chip CPU
  • and an optional unified L2 cache.
  • If the CPUs always execute threads of the same
    process, the L2 cache organization will be
    simplified, because different processes do not
    have to be distinguished.
  • Recently announced commercial processors with CMP
    hardware
  • IBM Power4 processor with 2 processor on a single
    die
  • Sun MAJC5200 two processor on a die (each
    processor a 4-threaded block-interleaving VLIW)

20
6.3 Multithreaded Processors
  • Aim Latency tolerance
  • What is the problem?
  • Load access latencies measured on an Alpha Server
    4100 SMP with four 300 MHz Alpha 21164 processors
    are
  • 7 cycles for a primary cache miss which hits in
    the on-chip L2 cache of the 21164 processor,
  • 21 cycles for a L2 cache miss which hits in the
    L3 (board-level) cache,
  • 80 cycles for a miss that is served by the
    memory, and
  • 125 cycles for a dirty miss, i.e., a miss that
    has to be served from another processor's cache
    memory.
  • Multithreaded processors are able to bridge
    latencies by switching to another thread of
    control - in contrast to chip multiprocessors.

21
Multithreaded Processors
  • Multithreading
  • Provide several program counters registers (and
    usually several register sets) on chip
  • Fast context switching by switching to another
    thread of control

22
Approaches of Multithreaded Processors
  • Cycle-by-cycle interleaving
  • An instruction of another thread is fetched and
    fed into the execution pipeline at each processor
    cycle.
  • Block-interleaving
  • The instructions of a thread are executed
    successively until an event occurs that may cause
    latency. This event induces a context switch.
  • Simultaneous multithreading
  • Instructions are simultaneously issued from
    multiple threads to the FUs of a superscalar
    processor.
  • combines a wide issue superscalar instruction
    issue with multithreading.

23
Comparision of Multithreading with
Non-Multithreading Approaches
  • (a) single-threaded scalar
  • (b) cycle-by-cycle interleaving multithreaded
    scalar
  • (c) block interleaving multithreaded scalar

24
Comparision of Multithreading with
Non-Multithreading Approaches
)
s
e
l
c
y
c

r
o
s
s
e
c
o
r
p
(

e
m
i
T
Issue slots
(a)
  • (a) superscalar (c) cycle-by-cycle
    interleaving
  • (b) VLIW (d) cycle-by-cycle interleaving VLIW

25
Comparison of Multithreading withNon-Multithreadi
ng
  • simultaneous multithreading (SMT) and
    chip multiprocessor (CMP)

26
6.3.3 Cycle-by-Cycle Interleaving
  • the processor switches to a different thread
    after each instruction fetch
  • pipeline hazards cannot arise and the processor
    pipeline can be easily built without the
    necessity of complex forwarding paths
  • context-switching overhead is zero cycles
  • memory latency is tolerated by not scheduling a
    thread until the memory transaction has completed
  • requires at least as many threads as pipeline
    stages in the processor
  • degrading the single-thread performance if not
    enough threads are present

27
Cycle-by-Cycle Interleaving- Improving
single-thread performance
  • The dependence look-ahead technique adds several
    bits to each instruction format in the ISA.
  • Scheduler feeds non data or control dependent
    instructions of the same thread successively into
    the pipeline.
  • The interleaving technique proposed by Laudon et
    al. adds caching and full pipeline interlocks to
    the cycle-by-cycle interleaving approach.

28
Tera MTA
  • cycle-by-cycle interleaving technique
  • employs the dependence-look-ahead technique
  • VLIW ISA (3-issue)
  • The processor switches context every cycle (3 ns
    cycle period) among as many as 128 distinct
    threads, thereby hiding up to 128 cycles (384 ns)
    of memory latency.? 128 register sets

29
Tera Processing Element
30
Tera MTA
Write a Comment
User Comments (0)
About PowerShow.com