Title: RAMP Tutorial IntroductionOverview
1RAMP TutorialIntroduction/Overview
- Krste Asanovic
- UC Berkeley
- RAMP Tutorial, ASPLOS, Seattle, WA
- March 2, 2008
2Technology Trends CPU
- Microprocessor Power Wall Memory Wall ILP
Wall Brick Wall - End of uniprocessors and faster clock rates
- Every program(mer) is a parallel program(mer),
Sequential algorithms are slow algorithms - Since parallel more power efficient (W
CV2F)New Moores Law is 2X processors or
cores per socket every 2 years, same clock
frequency - Conservative 2007 4 cores, 2009 8 cores, 2011
16 cores for embedded, desktop, server - Sea change for HW and SW industries since
changing programmer model, responsibilities - HW/SW industries bet farm that parallel
successful
3Problems with Manycore Sea Change
- Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries,
not ready for 1000 CPUs / chip - ? Only companies can build HW, and it takes years
- Software people dont start working hard until
hardware arrives - 3 months after HW arrives, SW people list
everything that must be fixed, then we all wait 4
years for next iteration of HW/SW - How get 1000 CPU systems in hands of researchers
to innovate in timely fashion on in algorithms,
compilers, languages, OS, architectures, ? - Can avoid waiting years between HW/SW iterations?
4Vision Build Research MPP from FPGAs
- As ? 16 CPUs will fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from ? 64 FPGAs? - 8 32-bit simple soft core RISC at 100MHz in
2004 (Virtex-II) - FPGA generations every 1.5 yrs ? 2X CPUs, ? 1.2X
clock rate - HW research community does logic design (gate
shareware) to create out-of-the-box, MPP - E.g., 1000 processor, standard ISA
binary-compatible, 64-bit, cache-coherent
supercomputer _at_ ? 150 MHz/CPU in 2007 - 6 universities, 10 faculty
- 3rd party sells RAMP 2.0 (BEE3) hardware at low
cost - Research Accelerator for Multiple Processors
5Why RAMP Good for Research MPP?
6Partnerships
- Co-PIs Krste Asanovíc (UCB), Derek Chiou (UT
Austin), Joel Emer (MIT/Intel), James Hoe (CMU),
Christos Kozyrakis (Stanford), Shih-Lien Lu
(Intel), Mark Oskin (Washington), David
Patterson (Berkeley), and John Wawrzynek
(Berkeley) - RAMP hardware development activity centered at
Berkeley Wireless Research Center. - Three year NSF grant for staff (awarded 3/06).
- GSRC (Jan Rabaey) has paid partial staff and some
students. - Major continuing commitment from Xilinx
- Collaboration with MSR (Chuck Thacker) on BEE3
FPGA platform. - Sun, IBM contributing processor designs, IBM
faculty awards.
High-speed high-confidence emulation is widely
recognized as a necessary component of
multiprocessor research and development. FPGA
emulation is the only practical approach.
7BEE3 Design
Chuck Thacker Chen Chang, UC Berkeley
- New RAMP systems to be based on Berkeley
Emulation Engine version 3 (BEE3). - BEECube, Inc.
- (UC Berkeley spinout startup company)
- To provide manufacturing, distribution, and
support to commercial and academic users. - General availability 2Q08
BEE3,1st prototype 11/07
BEE3,1st prototype 11/07
For small scale design, or to get started, use
Xilinx ML505
7
8RAMP An infrastructure to build simulators using
FPGAs
9Run Target Model on Host Platform
Hard Work
10Reduce, Reuse, Recycle
- Reduce effort to build target models
- Users just build components (units),
infrastructure handles connections (The RDL
Compiler) - Reuse units by having good abstractions
- Across different target models
- Across different host platforms
- XUP, Calinx, BEE2, BEE3, ML505 also Altera
platforms - Recycle existing IP for use as simulation models
- Commercial processor RTL is (almost) its own model
11RAMP Target Model
- Units
- Relatively large chunks of functionality
- e.g., processor L1 cache
- User-written in some HDL or software
- Channels
- Point-point, undirectional, two kinds
- FIFO channel Flow-controlled interface
- Pipeline channel Simple shift register, bits
drop off end - Generated by RAMP infrastructure
12Target Pipeline Channel Parameters
D
D
Datawidth
Forward Latency
13RAMP Description Language (RDL)
Target
Greg Gibeling, UCB
Generated links carry channels
RDLC
Host
Unit B
Generated Unit Wrappers
Unit A
Unit C
FPGA2
FPGA1
- User describes target model topology, channel
parameters, and (manual) mapping to host platform
FPGAs using RDL - RDL Compiler (RDLC) generates configurations
14Virtual Target Clock
15Virtualized RTL Improves FPGA Resource Usage
- RAMP allows units to run at varying target-host
clock ratios to optimize area and overall
performance - Example 1 Multiported register file
- Example, Sun Niagara has 3 read ports and 2 write
ports to 6KB of register storage - If RTL mapped directly, requires 48K flip-flops
- Slow cycle time, large area
- If mapping into block RAMs (one readone write
per cycle), takes 3 host cycles and 3x2KB block
RAMs - Faster cycle time (3X) and far less resources
- Example 2 Large L2/L3 caches
- Current FPGAs only have 1MB of on-chip SRAM
- Use on-chip SRAM to build cache of active piece
of L2/L3 cache, stall target cycle if access
misses and fetch data from off-chip DRAM
16Start/Done Timing Interface
Wrapper
Start
Unit
In1
Out
In2
Done
- Wrapper generated by RDL asserts Start on the
physical FPGA cycle when the inputs to the unit
are ready for the next target cycle - Unit asserts Done when it finishes the target
cycle and its outputs are ready - Unit can take variable amount of time
- Unvirtualized RTL unit can connect Done to
Start (but must not clock until Start)
17Distributed Timing Models
18Distributed Timing Example
Unit A
Unit B
D
Target
Latency L
19Other Automatically Generated Networks
- Control network has workstation as master and
every unit as slave device - Memory-mapped interface with block transfers
- Used for initialization, stats gathering,
debugging, and monitoring - Units can connect to DRAM resources outside of
timed target channels - Used to support emulation and virtualization
state - Units can communicate with each other outside of
timed target channels - Support arbitrary communication. E.g., for
distributed stats gathering
20Wide Variety of RAMP Simulators
21Simulator Design Choices
- Structural Analog versus Highly Virtualized
- Functional-only versus FunctionalTiming
- Timing via (virtual) RTL design versus separate
functional and timing models - Hybrid software/hardware simulators
22Host Multithreading(Zhangxi Tan (UCB), Chung,
(CMU))
- Multithreading emulation engine reduces FPGA
resource use and improves emulator throughput - Hides emulation latencies (e.g., communicating
across FPGAs)
23Split Functional/Timing Models(HASIM Emer
(MIT/Intel), FAST Chiou, (UT Austin))
Functional Model
Timing Model
- Functional model executes CPU ISA correctly, no
timing information - Only need to develop functional model once for
each ISA - Timing model captures pipeline timing details,
does not need to execute code - Much easier to change timing model for
architectural experimentation - Without RTL design, cannot be 100 certain that
timing is accurate - Many possible splits between timing and
functional model
24Multithreaded Func. Timing Models(RAMP Gold
Tan, Gibeling, Asanovic, UCB)
Timing Model Pipeline
MT-Channels
MT-Unit
- MT-Unit multiplexes multiple target units on a
single host engine - MT-Channel multiplexes multiple target channels
over a single host link
25Schedule
- 900- 945 Welcome/Overview
- 945-1015 RAMP Blue Overview Demo
- 1015-1045 Break
- 1045-1230 RAMP White Live Demo
- BEE3 Rollout (MSR/BEEcube/QA)
- 1230-1330 Lunch
- 1330-1500 ATLAS Transactional Memory (RAMP Red)
- 1500-1515 Break
- 1515-1645 CMU Simics/RAMP Cache Study
- 1645 Wrapup
26- RAMP Blue Release 2/25/2008
- design available from RAMP website
- ramp.eecs.berkeley.edu
27RAMP WhiteHari Angepat, Derek Chiou (UT Austin)
- Scalable Coherent Shared Memory Multiprocessor
- Support standard shared memory programming models
Leon3 shim
Leon3 shim
Intersection Unit
NIU
Intersection Unit
NIU
Router
Router
AHB shim
AHB shim
AHB bus
AHB bus
MP IntCntrl
DSU
Eth
DDR2
DDR2
RAMP-White
27
28(No Transcript)
29CMU Simics/RAMP Simulator
16-CPU Shared-memory UltraSPARC III Server
(SunFire 3800)
BEE2 Platform
29
30RAMP Home Page/Repository
- ramp.eecs.berkeley.edu
- Remotely accessible subversion repository
31Thank You!