RAMP Gold - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

RAMP Gold

Description:

Cache Modeling. The cache model maintains tag, state, protocol bits internally ... RAMP Gold closes two critical feedback loops. Expedient HW/SW co-tuning is ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 20

Provided by: yunsu

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: RAMP Gold

1
RAMP Gold

RAMPants
rimas,waterman,yunsup_at_cs

Parallel Computing Laboratory University of
California, Berkeley
2
A Survey of µArch Simulation Trends

Typical ISCA 2008 papers simulated about about
twice as many instructions as those in 1998. So
what?

3
A Survey of µArch Simulation Trends

Something seems broken here

4
A Survey of µArch Simulation Trends

Something is clearly broken here.

5
Something is Rotten in theState of California

A median ISCA 08 papers simulations run for
fewer than four OS scheduling quanta!
We run yesterdays apps at yesteryears
timescales
And attempt to model N communicating cores with
O(1/N) instructions per core?!
The problem is that simulators are too slow
Irony since performance scales as
sqrt(complexity), simulated instructions per
wall-clock second falls as processors get faster

6
RAMP Gold Our Solution

RAMP Gold is an FPGA-based, 100 MIPS manycore
simulator
Only 100x slower than real-time
Economical RTL is BSD-licensed commodity HW

Cost Performance (MIPS) Simulations per day
Software Simulator 2,000 0.1 - 1 1
RAMP Gold 2,000 750 50 - 100 100
7
Our Target Machine
64 cores
SPARC V8 CORE
SPARC V8 CORE
SPARC V8 CORE
SPARC V8 CORE

I
D
I
D
I
D
I
D
Shared L2 / Interconnect
DRAM
8
RAMP Gold Architecture

Mapping the target machine directly to an FPGA is
inefficient
Solution split timing and functionality
The timing logic decides how many target cycles
an instruction sequence should take
Simulating the functionality of an instruction
might take multiple host cycles
Target time and host time are orthogonal

9
Function/Timing Split Advantages

Flexibility
Can configure target at runtime
Synthesize design once, change target model
parameters at will
Efficient FPGA resource usage
Example 1 model a 2-cycle FPU in 10 host cycles
Example 2 model a 16MB L2 using only 256KB host
BRAM to store tags/metadata

10
Host Multithreading

How are we going to model 64 cores?

Build 64 pipelines
Time-multiplex one pipeline
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W

64 pipelines
time

Single hardware pipeline with multiple copies of
CPU state
No bypass path required
Not multithreaded target!

F
D
X
M
W
F
D
X
M
W
time
11
Cache Modeling

The cache model maintains tag, state, protocol
bits internally
Whenever the functional model issues a memory
operation, the cache model determines how many
target cycles to stall

tag
index
offset
max associativity

tag, state
tag, state
tag, state

hit dont stall miss stall arbitrary cycles
12
Putting it all together
instruction cache
ifetch stage
decode stage
register access stage
memory stage
data cache
exception stage
memory controller
cache model performance counters

Resource Utilization (XC5VLX110T)
LUTs 14, BRAM 23
We can fit 3 pipelines on one FPGA!

13
Infrastructure
14
Our accomplishments this semester
Jan 2009 Last Night
Single Threaded 64 Way Host Multithreaded
0.000032GB BRAM 2GB DDR2 SDRAM
Hello World works (sometimes) ParLab Damascene CBIR App, SPLASH2 SPEC CPU2000
No Timing Model or Introspection Runtime Configurable Cache Model, Performance Counters
No Floating Point Hardware FPU Multiply/Add Software Emulation
15
HARDware aint no joke
500 Man Hours DDR2 Memory Controller Debugging
100 Man Hours DMA Engine/Pipeline Issues
150 Man Hours CAD Tool Issues
?0 Pipeline Corner Cases
16
Sample Use Case L1 D Tradeoffs

Assume we have a 64-core CMP with private 16KB
direct-mapped L1 D
In the next tech gen, we can fit either of these
improved configurations in a clock cycle
32KB direct-mapped L1
16KB 4-way set-associative L1
Which should we choose?

17
Sample Use Case L1 D Tradeoffs

Evidently, the associative cache is superior
It took longer to make these slides than to run
these 10 billion instruction simulations

18
Future Directions

RAMP Gold closes two critical feedback loops
Expedient HW/SW co-tuning is within our grasp
Simulations can now be run on a thermal
timescale, enabling the exploration of
temperature-aware scheduling policies
We intend to explore both avenues!

19
DEMO Damascene
Image
Convert Colorspace
Textons K-means
Intervening Contour
Texture Gradient
Generalized Eigensolver
Bg
Cga
Cgb
Combine
Oriented Energy Combination
Non-max suppression
Combine, Normalize
Contours

Write a Comment

User Comments (0)