Tarantula: A Vector Extension to the Alpha Architecture

About This Presentation

Title:

Tarantula: A Vector Extension to the Alpha Architecture

Description:

Tarantula: A Vector Extension to the Alpha Architecture Espasa, et al. Compaq-UPC Microprocessor Lab in Spain Alpha Development Group in Massachusetts – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 12

Provided by: CurtH150

Learn more at: http://people.ee.duke.edu

Category:

more less

Transcript and Presenter's Notes

Title: Tarantula: A Vector Extension to the Alpha Architecture

1
Tarantula A Vector Extension to the Alpha
Architecture
Espasa, et al. Compaq-UPC Microprocessor Lab in
Spain Alpha Development Group in
Massachusetts Presented by Curt Harting
2
Motivation

In order to build CMPs, multithreaded systems,
etc. control logic scales non-linearly
Power and area that is being used without doing
any computation
Each instruction only does a limited number of
computations
Multi-Billion dollar scientific (parallel)
industry
Want to exploit memory/L2 bandwidth as much as
possible
On unit, regular non-unit, and irregular strides
Cant invent a new ISA/Coherence protocol, or
spend much time on development

3
Overview

A vector extension to the Alpha architecture
A vector unit (VBox) on chip with the EV8 core,
made up of 16 lanes.
Capable of 32 flops/cycle
16MB L2 cache to pipe data directly into the VBox

4
Architectural Changes

45 New instructions
Predication, not prediction within VBox
35 Architectural Registers
32 Vector Registers (64128bits each), VL, VS, VM
Register Renaming
V31 tied to 0 for easy prefetching (128 lines, or
8kB with 1 insn)
Runs old code
Must recompile to take advantage of the VBox

5
The VBox

16 Lanes
Slice of registers mask unified register file
would be too large
2 functional units (North and South)
Instruction, LD, and ST queues
TLB - 1 per lane, 32 entries each
512MB virtual pages
On a miss, either fill one or all
Symmetric
Multithreaded

6
Communication with EV8 Core

3 Instruction bus to IQ
3 9 bit insns ID buses for retirement from VCU
A bus to carry scalars (2x64 bit)
Kill Signal
On exception, only the instruction is given, not
the faulting lane

7
Memory, Addressing

VBox communicates solely with the L2
L2 has 16 banks that can be accessed in parallel
Normal Strides Those that arent self
conflicting or 1
Built in ROM to generate 8 slices of 16qw (8
cycles)
PUMP operation Stride of 1 (16/17 cache lines)
2x the bandwidth (4 cycles) Routed through a
special structure
Makes a difference (sometimes)
Gather/Scatters Arbitrary Addresses
Greedy algorithm in the CR box
Worst Case 128 cycles
Self-Conflicting Strides Stridey2x where
y21, xgt4
Treated as a Gather/Scatter
Caveat Still have to wait the full time
regardless of the number of quadwords needed

8
Memory, Consistency

Problem VBox writes to the L2, behind the L1s
back!
Every line in the L2 has a presence bit that is
set if the EV8 core has touched that line
If a line has its P-bit set, the L2 must
essentially issue a GETX to the L1.
Scalar Write, Vector Read The vector read cant
see store/write buffers (no P-bit set yet)
Programmer/Complier must anticipate this case and
add an extra barrier
DrainM forces a purging of the store/write
buffers into cache
Also forces the killing and re-fetching of
younger instructions
On a cache miss, the entire slice waits until the
offending block is replaced and a retry occurs
After a threshold of retries, the cache entries a
panic mode

9
Evaluation

Vectorizable portions of vector benchmarks chosen
Large Vectors chosen, they do better
All but 2 (sixtrack, linpack100) have over 98
vector code
EV8 Code compiled with an EV6 scheduler
Hand compiled and hand tuned for Tarantula
All benchmarks cache-friendly or custom tiled
(up to 2x speedup)
Many more registers in Tarantula
Large prefetches
A standard mirco-processor being compared to the
specialized processor

10
Low-Level Questions

Power only 20 more than equivalent EV8 CMP.
Is 144W really a power win, even with the
increase in performance per watt?
Memory Bandwidth One the most expenseive
pieces of overall cost
Why was it assumed to quadruple in four years?
Why does the system not have it as a bottleneck?
Amdahls Law and Fig. 8
L2 Sizing/bandwidth seemed critical to the
performance, what would happen if the L2 was
smaller and/or slower?
Was DrainM the best way of accomplishing its goal?

11
High-Level Questions