Title: RAMP Gold
1RAMP Gold
- RAMPants
- rimas,waterman,yunsup_at_cs
Parallel Computing Laboratory University of
California, Berkeley
2A Survey of µArch Simulation Trends
- Typical ISCA 2008 papers simulated about about
twice as many instructions as those in 1998. So
what?
3A Survey of µArch Simulation Trends
- Something seems broken here
4A Survey of µArch Simulation Trends
- Something is clearly broken here.
5Something is Rotten in theState of California
- A median ISCA 08 papers simulations run for
fewer than four OS scheduling quanta! - We run yesterdays apps at yesteryears
timescales - And attempt to model N communicating cores with
O(1/N) instructions per core?! - The problem is that simulators are too slow
- Irony since performance scales as
sqrt(complexity), simulated instructions per
wall-clock second falls as processors get faster
6RAMP Gold Our Solution
- RAMP Gold is an FPGA-based, 100 MIPS manycore
simulator - Only 100x slower than real-time
- Economical RTL is BSD-licensed commodity HW
Cost Performance (MIPS) Simulations per day
Software Simulator 2,000 0.1 - 1 1
RAMP Gold 2,000 750 50 - 100 100
7Our Target Machine
64 cores
SPARC V8 CORE
SPARC V8 CORE
SPARC V8 CORE
SPARC V8 CORE
I
D
I
D
I
D
I
D
Shared L2 / Interconnect
DRAM
8RAMP Gold Architecture
- Mapping the target machine directly to an FPGA is
inefficient - Solution split timing and functionality
- The timing logic decides how many target cycles
an instruction sequence should take - Simulating the functionality of an instruction
might take multiple host cycles - Target time and host time are orthogonal
9Function/Timing Split Advantages
- Flexibility
- Can configure target at runtime
- Synthesize design once, change target model
parameters at will - Efficient FPGA resource usage
- Example 1 model a 2-cycle FPU in 10 host cycles
- Example 2 model a 16MB L2 using only 256KB host
BRAM to store tags/metadata
10Host Multithreading
- How are we going to model 64 cores?
Build 64 pipelines
Time-multiplex one pipeline
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
F
D
X
M
W
64 pipelines
time
- Single hardware pipeline with multiple copies of
CPU state - No bypass path required
- Not multithreaded target!
F
D
X
M
W
F
D
X
M
W
time
11Cache Modeling
- The cache model maintains tag, state, protocol
bits internally - Whenever the functional model issues a memory
operation, the cache model determines how many
target cycles to stall
tag
index
offset
max associativity
tag, state
tag, state
tag, state
hit dont stall miss stall arbitrary cycles
12Putting it all together
instruction cache
ifetch stage
decode stage
register access stage
memory stage
data cache
exception stage
memory controller
cache model performance counters
- Resource Utilization (XC5VLX110T)
- LUTs 14, BRAM 23
- We can fit 3 pipelines on one FPGA!
13Infrastructure
14Our accomplishments this semester
Jan 2009 Last Night
Single Threaded 64 Way Host Multithreaded
0.000032GB BRAM 2GB DDR2 SDRAM
Hello World works (sometimes) ParLab Damascene CBIR App, SPLASH2 SPEC CPU2000
No Timing Model or Introspection Runtime Configurable Cache Model, Performance Counters
No Floating Point Hardware FPU Multiply/Add Software Emulation
15HARDware aint no joke
500 Man Hours DDR2 Memory Controller Debugging
100 Man Hours DMA Engine/Pipeline Issues
150 Man Hours CAD Tool Issues
?0 Pipeline Corner Cases
16Sample Use Case L1 D Tradeoffs
- Assume we have a 64-core CMP with private 16KB
direct-mapped L1 D - In the next tech gen, we can fit either of these
improved configurations in a clock cycle - 32KB direct-mapped L1
- 16KB 4-way set-associative L1
- Which should we choose?
17Sample Use Case L1 D Tradeoffs
- Evidently, the associative cache is superior
- It took longer to make these slides than to run
these 10 billion instruction simulations
18Future Directions
- RAMP Gold closes two critical feedback loops
- Expedient HW/SW co-tuning is within our grasp
- Simulations can now be run on a thermal
timescale, enabling the exploration of
temperature-aware scheduling policies - We intend to explore both avenues!
19DEMO Damascene
Image
Convert Colorspace
Textons K-means
Intervening Contour
Texture Gradient
Generalized Eigensolver
Bg
Cga
Cgb
Combine
Oriented Energy Combination
Non-max suppression
Combine, Normalize
Contours