RAMP Gold - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

RAMP Gold

Description:

Cache Modeling. The cache model maintains tag, state, protocol bits internally ... RAMP Gold closes two critical feedback loops. Expedient HW/SW co-tuning is ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 20
Provided by: yunsu
Category:
Tags: ramp | gold | modeling | rooten

less

Transcript and Presenter's Notes

Title: RAMP Gold


1
RAMP Gold
  • RAMPants
  • rimas,waterman,yunsup_at_cs

Parallel Computing Laboratory University of
California, Berkeley
2
A Survey of µArch Simulation Trends
  • Typical ISCA 2008 papers simulated about about
    twice as many instructions as those in 1998. So
    what?

3
A Survey of µArch Simulation Trends
  • Something seems broken here

4
A Survey of µArch Simulation Trends
  • Something is clearly broken here.

5
Something is Rotten in theState of California
  • A median ISCA 08 papers simulations run for
    fewer than four OS scheduling quanta!
  • We run yesterdays apps at yesteryears
    timescales
  • And attempt to model N communicating cores with
    O(1/N) instructions per core?!
  • The problem is that simulators are too slow
  • Irony since performance scales as
    sqrt(complexity), simulated instructions per
    wall-clock second falls as processors get faster

6
RAMP Gold Our Solution
  • RAMP Gold is an FPGA-based, 100 MIPS manycore
    simulator
  • Only 100x slower than real-time
  • Economical RTL is BSD-licensed commodity HW

Cost Performance (MIPS) Simulations per day
Software Simulator 2,000 0.1 - 1 1
RAMP Gold 2,000 750 50 - 100 100
7
Our Target Machine
64 cores
SPARC V8 CORE
SPARC V8 CORE
SPARC V8 CORE
SPARC V8 CORE

I
D
I
D
I
D
I
D
Shared L2 / Interconnect
DRAM
8
RAMP Gold Architecture
  • Mapping the target machine directly to an FPGA is
    inefficient
  • Solution split timing and functionality
  • The timing logic decides how many target cycles
    an instruction sequence should take
  • Simulating the functionality of an instruction
    might take multiple host cycles
  • Target time and host time are orthogonal

9
Function/Timing Split Advantages
  • Flexibility
  • Can configure target at runtime
  • Synthesize design once, change target model
    parameters at will
  • Efficient FPGA resource usage
  • Example 1 model a 2-cycle FPU in 10 host cycles
  • Example 2 model a 16MB L2 using only 256KB host
    BRAM to store tags/metadata

10
Host Multithreading
  • How are we going to model 64 cores?

Build 64 pipelines
Time-multiplex one pipeline
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W

64 pipelines
time
  • Single hardware pipeline with multiple copies of
    CPU state
  • No bypass path required
  • Not multithreaded target!

F
D
X
M
W
F
D
X
M
W
time
11
Cache Modeling
  • The cache model maintains tag, state, protocol
    bits internally
  • Whenever the functional model issues a memory
    operation, the cache model determines how many
    target cycles to stall

tag
index
offset
max associativity

tag, state
tag, state
tag, state



hit dont stall miss stall arbitrary cycles
12
Putting it all together
instruction cache
ifetch stage
decode stage
register access stage
memory stage
data cache
exception stage
memory controller
cache model performance counters
  • Resource Utilization (XC5VLX110T)
  • LUTs 14, BRAM 23
  • We can fit 3 pipelines on one FPGA!

13
Infrastructure
14
Our accomplishments this semester
Jan 2009 Last Night
Single Threaded 64 Way Host Multithreaded
0.000032GB BRAM 2GB DDR2 SDRAM
Hello World works (sometimes) ParLab Damascene CBIR App, SPLASH2 SPEC CPU2000
No Timing Model or Introspection Runtime Configurable Cache Model, Performance Counters
No Floating Point Hardware FPU Multiply/Add Software Emulation
15
HARDware aint no joke
500 Man Hours DDR2 Memory Controller Debugging
100 Man Hours DMA Engine/Pipeline Issues
150 Man Hours CAD Tool Issues
?0 Pipeline Corner Cases
16
Sample Use Case L1 D Tradeoffs
  • Assume we have a 64-core CMP with private 16KB
    direct-mapped L1 D
  • In the next tech gen, we can fit either of these
    improved configurations in a clock cycle
  • 32KB direct-mapped L1
  • 16KB 4-way set-associative L1
  • Which should we choose?

17
Sample Use Case L1 D Tradeoffs
  • Evidently, the associative cache is superior
  • It took longer to make these slides than to run
    these 10 billion instruction simulations

18
Future Directions
  • RAMP Gold closes two critical feedback loops
  • Expedient HW/SW co-tuning is within our grasp
  • Simulations can now be run on a thermal
    timescale, enabling the exploration of
    temperature-aware scheduling policies
  • We intend to explore both avenues!

19
DEMO Damascene
Image
Convert Colorspace
Textons K-means
Intervening Contour
Texture Gradient
Generalized Eigensolver
Bg
Cga
Cgb
Combine
Oriented Energy Combination
Non-max suppression
Combine, Normalize
Contours
Write a Comment
User Comments (0)
About PowerShow.com