Tarantula: A Vector Extension to the Alpha Architecture - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Tarantula: A Vector Extension to the Alpha Architecture

Description:

Tarantula: A Vector Extension to the Alpha Architecture. Espasa, et al. ... Hand compiled and hand tuned for Tarantula ... in Tarantula. Large prefetches ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 12
Provided by: curtha7
Category:

less

Transcript and Presenter's Notes

Title: Tarantula: A Vector Extension to the Alpha Architecture


1
Tarantula A Vector Extension to the Alpha
Architecture
Espasa, et al. Compaq-UPC Microprocessor Lab in
Spain Alpha Development Group in
Massachusetts Presented by Curt Harting
2
Motivation
  • In order to build CMPs, multithreaded systems,
    etc. control logic scales non-linearly
  • Power and area that is being used without doing
    any computation
  • Each instruction only does a limited number of
    computations
  • Multi-Billion dollar scientific (parallel)
    industry
  • Want to exploit memory/L2 bandwidth as much as
    possible
  • On unit, regular non-unit, and irregular strides
  • Cant invent a new ISA/Coherence protocol, or
    spend much time on development

3
Overview
  • A vector extension to the Alpha architecture
  • A vector unit (VBox) on chip with the EV8 core,
    made up of 16 lanes.
  • Capable of 32 flops/cycle
  • 16MB L2 cache to pipe data directly into the VBox

4
Architectural Changes
  • 45 New instructions
  • Predication, not prediction within VBox
  • 35 Architectural Registers
  • 32 Vector Registers (64128bits each), VL, VS, VM
  • Register Renaming
  • V31 tied to 0 for easy prefetching (128 lines, or
    8kB with 1 insn)
  • Runs old code
  • Must recompile to take advantage of the VBox

5
The VBox
  • 16 Lanes
  • Slice of registers mask unified register file
    would be too large
  • 2 functional units (North and South)
  • Instruction, LD, and ST queues
  • TLB - 1 per lane, 32 entries each
  • 512MB virtual pages
  • On a miss, either fill one or all
  • Symmetric
  • Multithreaded

6
Communication with EV8 Core
  • 3 Instruction bus to IQ
  • 3 9 bit insns ID buses for retirement from VCU
  • A bus to carry scalars (2x64 bit)
  • Kill Signal
  • On exception, only the instruction is given, not
    the faulting lane

7
Memory, Addressing
  • VBox communicates solely with the L2
  • L2 has 16 banks that can be accessed in parallel
  • Normal Strides Those that arent self
    conflicting or 1
  • Built in ROM to generate 8 slices of 16qw (8
    cycles)
  • PUMP operation Stride of 1 (16/17 cache lines)
  • 2x the bandwidth (4 cycles) Routed through a
    special structure
  • Makes a difference (sometimes)
  • Gather/Scatters Arbitrary Addresses
  • Greedy algorithm in the CR box
  • Worst Case 128 cycles
  • Self-Conflicting Strides Stridey2x where
    y21, xgt4
  • Treated as a Gather/Scatter
  • Caveat Still have to wait the full time
    regardless of the number of quadwords needed

8
Memory, Consistency
  • Problem VBox writes to the L2, behind the L1s
    back!
  • Every line in the L2 has a presence bit that is
    set if the EV8 core has touched that line
  • If a line has its P-bit set, the L2 must
    essentially issue a GETX to the L1.
  • Scalar Write, Vector Read The vector read cant
    see store/write buffers (no P-bit set yet)
  • Programmer/Complier must anticipate this case and
    add an extra barrier
  • DrainM forces a purging of the store/write
    buffers into cache
  • Also forces the killing and re-fetching of
    younger instructions
  • On a cache miss, the entire slice waits until the
    offending block is replaced and a retry occurs
  • After a threshold of retries, the cache entries a
    panic mode

9
Evaluation
  • Vectorizable portions of vector benchmarks chosen
  • Large Vectors chosen, they do better
  • All but 2 (sixtrack, linpack100) have over 98
    vector code
  • EV8 Code compiled with an EV6 scheduler
  • Hand compiled and hand tuned for Tarantula
  • All benchmarks cache-friendly or custom tiled
    (up to 2x speedup)
  • Many more registers in Tarantula
  • Large prefetches
  • A standard mirco-processor being compared to the
    specialized processor

10
Low-Level Questions
  • Power only 20 more than equivalent EV8 CMP.
    Is 144W really a power win, even with the
    increase in performance per watt?
  • Memory Bandwidth One the most expenseive
    pieces of overall cost
  • Why was it assumed to quadruple in four years?
  • Why does the system not have it as a bottleneck?
    Amdahls Law and Fig. 8
  • L2 Sizing/bandwidth seemed critical to the
    performance, what would happen if the L2 was
    smaller and/or slower?
  • Was DrainM the best way of accomplishing its goal?

11
High-Level Questions
  • Gather/Scatter support seems like a great idea
  • How many programs touch random parts of an array
    in a parallel fashion?
  • How can you compile pointer walk throughs?
  • Is there a multi-billion dollar scientific
    compute industry?
  • If so, is this processor an answer for it? Only
    does well for large vectors.
  • Is this a commodity processor or an expensive
    system?
  • The paper implies the goal of making it a
    commodity plug and play processor
  • Talks of very large memory bandwidth
    requirements, power requirements, huge L2
  • Is Tarantula one of those ideas that goes from
    good to bad to good?
  • Did Tarantula catch on? Just Google it!
Write a Comment
User Comments (0)
About PowerShow.com