Tarantula A Vector Extension to the Alpha Architecture - PowerPoint PPT Presentation

About This Presentation
Title:

Tarantula A Vector Extension to the Alpha Architecture

Description:

Tarantula. A Vector Extension to the Alpha Architecture ... Tarantula. EV8 core tightly integrated Vector Unit. Out of Order execution, Register Renaming ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 22
Provided by: albertm2
Category:

less

Transcript and Presenter's Notes

Title: Tarantula A Vector Extension to the Alpha Architecture


1
TarantulaA Vector Extension to the Alpha
Architecture
  • Roger Espasa, Federico Ardanaz, Joel Emerz,
    Stephen Felixz, Julio Gago, Roger Gramunt,Isaac
    Hernandez, Toni Juan, Geoff Lowneyz, Matthew
    Mattinaz, André Seznec
  • Universitat Politècnica Catalunya, Barcelona,
    Spain
  • Compaq Computer Corporation, Shrewsbury, MA

2
State of the World
  • CMOS Technology progresses
  • More transistors, more functional units, more
    control overhead
  • VLIW and Wide Superscalar
  • More individually controlled units
  • Amount of real estate for control logic grows
    non-linearly
  • Vector ISA
  • Localization of parallelism, aggregation of
    control
  • Regular structures, simple control

3
Tarantula
  • EV8 core tightly integrated Vector Unit
  • Out of Order execution, Register Renaming
  • Integrated in VM and cache coherence system
  • SMT support
  • Targeted at scientific computing applications
  • Requires compiler support and recompilation

4
Vector ISA
  • New Architectural State
  • 32 vector registers (v0-v31)
  • v31 wired to 0. Used for prefetch
  • Vector length (vl), Vector stride (vs), Vector
    Mask (vm)
  • 45 New Instructions
  • 5 Groups
  • Vector-Vector, Vector-Scalar, Strided Memory
    Access, Random Memory Access, Vector Control

5
Vector Mask
  • Allows conditional execution without EV8 scalar
    registers
  • VM can be renamed
  • A(i).ne.0.and.B(i).gt.2
  • vloadq A(i) --gt v0
  • vloadq B(i) --gt v1
  • vcmpne v0, 0 --gt v6
  • vcmpgt v1, 2 --gt v7
  • vand v6, v7 --gt v8
  • setvm v8 --gt vm

6
Tarantula Block Diagram
7
Vector Execution Unit
  • 16 independent lanes
  • No communication, except for gather/scatter
  • Each lane has
  • 2 functional units
  • Slice of Register File and Mask
  • Allows high bandwidth
  • Address generator and private TLB
  • 32 functional unit appear as only 2 issue ports
  • Simple scheduling

8
Vector Unit Core Interface
  • Vector Unit physically separate from core
  • Little modification to core
  • Large bus prevented by routing space
  • Core to VBox
  • 3 Instruction Bus
  • 2 Data Buses for Scalars from EV8 register file
  • 3 Instruction Kill Signal Bus for misspeculation
  • VBox to Core
  • 3 Instruction Completion Bus

9
Power Consumption
10
Vector Memory System
  • Bound to EV8 VM and Cache Coherence architecture
  • High Load/Store Bandwidth required
  • Goal one 64bit datum per flop
  • Memory Bus to slow
  • L1 Cache to small for vector data
  • Direct Connection to L2 Cache
  • Non-Unit Stride central problem
  • 20 of all accesses
  • Dont match cache lines

11
Non-Unit Strides
  • EV8 4MByte L2 Cache in 128 banks
  • 8 ways, 16 banks per way
  • Read 8 ways, select correct one
  • Non-unit stride accesses
  • Read 16 independent cache lines
  • Select one qword per line
  • Requires
  • Conflict free addresses
  • Conflict free writes to 16 lanes
  • One qword per lane per cycle

12
Conflict Free Addresses
  • Possible for any 128 consecutive elements
  • For stride S? 2s with s 4
  • Order stored in ROM table
  • Elements accessed out of order
  • Even for length lt 128 full eight cycles for
    address generation
  • Slice
  • Group of 16 conflict free addresses

13
PUMP
  • Stride 1 accesses
  • 80 of all accesses
  • 128 Qwords in 16 (aligned) or 17 (misaligned)
    cache lines
  • Full cache lines read into PUMP latches
  • Two qwords per cycle sent to VBox
  • Similar for writes
  • Allows double bandwidth

14
Gathers and Scatters
  • Arbitrary Address for every vector element
  • Reordering algorithm doesnt work
  • Conflict Resolution Box (CR)
  • Find biggest subset of non-conflicting addresses,
    pack into slice
  • Add new addresses to remaining ones and repeat
  • Worst case 128 slices generated
  • Same algorithm used for self-conflicting strides
  • stride S? 2s with s gt 4

15
Vector Misses
  • To handle L2 misses consider slices as atomic
  • On miss, slice moved to Miss Address File (MAF)
  • Wait for missing data
  • Go to retry queue
  • Too many retries cause Panic Mode
  • MAF nacks all other L2 requests, that might
    prevent progress

16
Scalar-Vector Coherency
  • VBox by-passes L1 cache
  • Presence bit P indicates L2 cache line loaded by
    VCore
  • If P Set, VBox invalidates L1
  • Scalar Write followed by Vector Read is not
    covered
  • Barrier command required
  • DrainM Purges write buffer and cause replay trap

17
Evaluation
  • No Compiler support available
  • Hand coded assembler cores
  • Scientific Benchmarks
  • ASIM Simulator
  • Cycle Accurate EV8 simulator
  • Tarantula compared to
  • EV8
  • EV8 Trantulas memory system
  • Tarantula4 14 ratio to RAMBUS frequency

18
Operations per Cycle
19
Speed Up over EV8
20
Conclusions
  • Vector Processor most efficient solution for many
    applications
  • Vector Unit can be added to standard
    microprocessor core
  • Big Bandwidth requirement can only be satisfied
    by L2 cache
  • Potentially big performance gains
  • 2 to 20 over EV8
  • Performance depends on good code
  • Tiling aggressive prefetching
  • Very good power/performance ratio

21
Questions
  • Can only scientific applications exploit vector
    processors?
  • Radix sort worked
  • Powerful memory access instructions
  • Masks allow logic execution
  • Does anyone no more about PRAM algorithms?
  • EV8/VBox coherency seems quirky. Does anyone see
    a better solution?
Write a Comment
User Comments (0)
About PowerShow.com