Title: RAMP Gold: Architecture and Timing Model
1RAMP Gold Architecture and Timing Model
- Andrew Waterman, Zhangxi Tan, Rimas Avizienis,
Yunsup Lee, David Patterson, Krste Asanovic
Parallel Computing Laboratory University of
California, Berkeley
2RAMP Gold Overview
Par Lab InfiniCore
- Tiled CMP simulator
- ISA SPARC V8
- (ARM/Thumb-2 later?)
- Split timing and function (both on FPGA)
- Host-multithreaded
- Runs on V5LX110T (XUP)
3RAMP Gold Target Machine
64 cores
SPARC V8 CORE
SPARC V8 CORE
SPARC V8 CORE
SPARC V8 CORE
I
D
I
D
I
D
I
D
Shared L2 / Interconnect
DRAM
4RAMP Gold v1 Target Features
- 64 single issue in-order SPARCv8 processors
- Simple, 5-stage pipeline
- FPU
- Cache Timing model
- Configurable size, line size, associativity, miss
penalty, shared/private - Change parameters without resynthesis
5RAMP Gold Architecture
- Mapping the target machine directly to an FPGA is
inefficient - Solution split timing and functionality
Multithreading - The timing logic decides how many target cycles
an instruction sequence should take - Simulating the functionality of an instruction
might take multiple host cycles
6Function/Timing Split Advantages
- Flexibility
- Can configure target at runtime
- Synthesize design once, change target model
parameters at will - Efficient FPGA resource usage
- Example 1 model a 2-cycle FPU in 10 host cycles
- Example 2 model a 16MB L2 using only 256KB host
BRAM to store tags/metadata - Enables multithreading
7Split Timing and Function
Target Machine
Functional Model
Timing Model
CPU TM
CPU
CPU FM
L1 D
L1 D TM
L1 D FM
MEM
MEM FM
MEM TM
- Functional model executes ISA correctly
- Timing model determines how long a program takes
to run
8 Split Timing and Function
Target Machine
Functional Model
Timing Model
CPU TM
CPU
CPU FM
L1 D
L1 D TM
MEM FM
MEM
MEM TM
- Functional model executes ISA correctly
- Timing model determines how long a program takes
to run
9TM FM from 30,000 ft
Memory Timing Model
ld/st address
stall
CPU Timing Model
L1 D Timing Model
stall
ld/st address store data
ld/st address store data
instruction complete
CPU Functional Model
Memory Functional Model
load data
instruction
10TM FM from 3,000 ft
Memory Timing Model
ld/st address
stall
CPU TM
L1 D TM
stall
IF
CTRL
TM1
TM2
ld/st address, store data
ld/st address, store data
instruction complete
CPU FM
Memory Functional Model
WB
DEC
EX
MEM
instruction
load data
11Example Target Load Miss
Memory Timing Model
6
4
ld/st address
stall
CPU TM
L1 D TM
7
4
stall
IF
CTRL
TM1
TM2
3
1
ld/st address, store data
ld/st address, store data
instruction complete
4
CPU FM
Memory Functional Model
WB
DEC
EX
MEM
instruction
2
load data
5
12Timing-Driven Host Pipeline
T0
T1
T2
TARGET MEMORY TM/FM
ADD
LD
ST
CPU/D Timing Model
L1 D TM
TS
TM1
TM2
TM3
IF
Store Buffer
TID,INST
TID,ADDR
CPU Functional Model
Load Result Buffer
DE
EX
WB
MEM2
MEM1
13Cache Modeling
- The cache model maintains tag, state, protocol
bits internally - Whenever the functional model issues a memory
operation, the cache model determines how many
target cycles to stall
tag
index
offset
associativity
tag, state
tag, state
tag, state
hit/miss
14Multithreaded, Pipelined Cache TM
hit?
Address
tag, state
tag, state
tag, state
Index
15Quick Dirty Validation
- 32KB, 2-way L1 D, 64B lines
- 256KB, 4-way L2, 64B lines
16Status
- Functional simple timing model work in HW
- Running real programs (e.g. SPLASH2)
- Near term future work
- Move from current functional-first stall
configuration to timing-driven described here - More interesting memory system timing model
- Functional potpourri (FDIV, MMU, )
17DEMO
- Run OCEAN with different L1 D parameters
18Questions?