TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS - PowerPoint PPT Presentation

About This Presentation
Title:

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

Description:

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 25
Provided by: Aaron214
Category:

less

Transcript and Presenter's Notes

Title: TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS


1
TPUTCACHE HIGH-FREQUENCY, MULTI-WAY CACHE FOR
HIGH-THROUGHPUTFPGA APPLICATIONS
  • Aaron Severance
  • University of British Columbia
  • Advised by Guy Lemieux

2
Our Problem
  • We use overlays for data processing
  • Partially/fully fixed processing elements
  • Virtual CGRAs, soft vector processors
  • Memory
  • Large register files/scratchpad in overlay
  • Low latency, local data
  • Trivial (large DMA) burst to/from DDR
  • Non-trivial?

3
Scatter/Gather
  • Data dependent store/load
  • vscatter adr_ptr, idx_vect, data_vect
  • for i in 1..N
  • adr_ptridx_vecti lt data_vecti
  • Random narrow (32-bit) accesses
  • Waste bandwidth on DDR interfaces

4
If Data Fits on the FPGA
  • BRAMs with interconnect network
  • General network
  • Not customized per application
  • Shared all masters lt-gt all slaves
  • Memory mapped BRAM
  • Double-pump (2x clk) if possible
  • Banking/LVT/etc. for further ports

5
Example BRAM system
6
But if data doesnt fit (oversimplified)
7
So Lets Use a Cache
  • But a throughput focused cache
  • Low latency data held in local memories
  • Amortize latency over multiple accesses
  • Focus on bandwidth

8
Replace on-chip memory or augment memory
controller?
  • Data fits on-chip
  • Want BRAM like speed, bandwidth
  • Low overhead compared to shared BRAM
  • Data doesnt fit on-chip
  • Use leftover BRAMs for performance

9
TputCache Design Goals
  • Fmax near BRAM Fmax
  • Fully pipelined
  • Support multiple outstanding misses
  • Write coalescing
  • Associativity

10
TputCache Architecture
  • Replay based architecture
  • Reinsert misses back into pipeline
  • Separate line fill/evict logic in background
  • Token FIFO for completing requests in order
  • No MSHRs for tracking misses
  • Fewer muxes (only single replay request mux)
  • 6 stage pipeline -gt 6 outstanding misses
  • Good performance with high hit rate
  • Common case fast

11
TputCache Architecture
12
Cache Hit
13
Cache Miss
14
Evict/Fill Logic
15
Area Fmax Results
  • Reaches 253MHz compared to 270MHz BRAM fmax on
    Cyclone IV
  • 423MHz compared to 490MHz BRAM fmax on Stratix IV
  • Minor degredation with increasing size,
    associativity
  • 13 to 35 extra BRAM usage for tags, queues

16
Benchmark Setup
  • TputCache
  • 128kB, 4-way, 32-byte lines
  • MXP soft vector processor
  • 16 lanes, 128kB scratchpad memory
  • Scatter/Gather memory unit
  • Indexed loads/stores per lane
  • Doublepumping port adapters
  • TputCache runs at 2x frequency of MXP

17
MXP Soft Vector Processor
18
Histogram
  • Instantiate a number of Virtual Processors (VPs)
    mapped across lanes
  • Each VP histograms part of the image
  • Final pass to sum VP partial histograms

19
Hough Transform
  • Convert an image to 2D Hough Space (angle,
    radius)
  • Each vector element calculates the radius for a
    given angle
  • Adds pixel value to counter

20
Motion Compensation
  • Load block from reference image, interpolate
  • Offset by small amount from location in current
    image

21
Future Work
  • More ports needed for scalability
  • Share evict/fill BRAM port with 2nd request
  • Banking (sharing same evict/fill logic)
  • Multiported BRAM designs
  • Write cache
  • Allocate on write currently
  • Track dirty state of bytes in BRAMs 9th bit
  • Non-blocking behavior
  • Multiple token FIFOs (one per requestor)?

22
FAQ
  • Coherency
  • Envisioned as only/LLC
  • Future work
  • Replay loops/problems
  • Random replacement associativity
  • Power expected to be not great

23
Conclusions
  • TputCache alternative to shared BRAM
  • Low overhead (13-35 extra BRAM)
  • Nearly as high fmax (253MHz vs 270MHz)
  • More flexible than shared BRAM
  • Performance degrades gradually
  • Cache behavior instead of manual filling

24
Questions?
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com