RAMP Tutorial IntroductionOverview - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

RAMP Tutorial IntroductionOverview

Description:

... to innovate in timely fashion on in algorithms, compilers, ... HW research community does logic design ('gate shareware') to create out-of-the-box, MPP ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 32

Provided by: georgep6

Category:

more less

Transcript and Presenter's Notes

Title: RAMP Tutorial IntroductionOverview

1
RAMP TutorialIntroduction/Overview

Krste Asanovic
UC Berkeley
RAMP Tutorial, ASPLOS, Seattle, WA
March 2, 2008

2
Technology Trends CPU

Microprocessor Power Wall Memory Wall ILP
Wall Brick Wall
End of uniprocessors and faster clock rates
Every program(mer) is a parallel program(mer),
Sequential algorithms are slow algorithms
Since parallel more power efficient (W
CV2F)New Moores Law is 2X processors or
cores per socket every 2 years, same clock
frequency
Conservative 2007 4 cores, 2009 8 cores, 2011
16 cores for embedded, desktop, server
Sea change for HW and SW industries since
changing programmer model, responsibilities
HW/SW industries bet farm that parallel
successful

3
Problems with Manycore Sea Change

Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries,
not ready for 1000 CPUs / chip
? Only companies can build HW, and it takes years
Software people dont start working hard until
hardware arrives
3 months after HW arrives, SW people list
everything that must be fixed, then we all wait 4
years for next iteration of HW/SW
How get 1000 CPU systems in hands of researchers
to innovate in timely fashion on in algorithms,
compilers, languages, OS, architectures, ?
Can avoid waiting years between HW/SW iterations?

4
Vision Build Research MPP from FPGAs

As ? 16 CPUs will fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from ? 64 FPGAs?
8 32-bit simple soft core RISC at 100MHz in
2004 (Virtex-II)
FPGA generations every 1.5 yrs ? 2X CPUs, ? 1.2X
clock rate
HW research community does logic design (gate
shareware) to create out-of-the-box, MPP
E.g., 1000 processor, standard ISA
binary-compatible, 64-bit, cache-coherent
supercomputer _at_ ? 150 MHz/CPU in 2007
6 universities, 10 faculty
3rd party sells RAMP 2.0 (BEE3) hardware at low
cost
Research Accelerator for Multiple Processors

5
Why RAMP Good for Research MPP?
6
Partnerships

Co-PIs Krste Asanovíc (UCB), Derek Chiou (UT
Austin), Joel Emer (MIT/Intel), James Hoe (CMU),
Christos Kozyrakis (Stanford), Shih-Lien Lu
(Intel), Mark Oskin (Washington), David
Patterson (Berkeley), and John Wawrzynek
(Berkeley)
RAMP hardware development activity centered at
Berkeley Wireless Research Center.
Three year NSF grant for staff (awarded 3/06).
GSRC (Jan Rabaey) has paid partial staff and some
students.
Major continuing commitment from Xilinx
Collaboration with MSR (Chuck Thacker) on BEE3
FPGA platform.
Sun, IBM contributing processor designs, IBM
faculty awards.

High-speed high-confidence emulation is widely
recognized as a necessary component of
multiprocessor research and development. FPGA
emulation is the only practical approach.
7
BEE3 Design
Chuck Thacker Chen Chang, UC Berkeley

New RAMP systems to be based on Berkeley
Emulation Engine version 3 (BEE3).
BEECube, Inc.
(UC Berkeley spinout startup company)
To provide manufacturing, distribution, and
support to commercial and academic users.
General availability 2Q08

BEE3,1st prototype 11/07
BEE3,1st prototype 11/07
For small scale design, or to get started, use
Xilinx ML505
7
8
RAMP An infrastructure to build simulators using
FPGAs
9
Run Target Model on Host Platform
Hard Work
10
Reduce, Reuse, Recycle

Reduce effort to build target models
Users just build components (units),
infrastructure handles connections (The RDL
Compiler)
Reuse units by having good abstractions
Across different target models
Across different host platforms
XUP, Calinx, BEE2, BEE3, ML505 also Altera
platforms
Recycle existing IP for use as simulation models
Commercial processor RTL is (almost) its own model

11
RAMP Target Model

Units
Relatively large chunks of functionality
e.g., processor L1 cache
User-written in some HDL or software
Channels
Point-point, undirectional, two kinds
FIFO channel Flow-controlled interface
Pipeline channel Simple shift register, bits
drop off end
Generated by RAMP infrastructure

12
Target Pipeline Channel Parameters
D
D
Datawidth
Forward Latency
13
RAMP Description Language (RDL)
Target
Greg Gibeling, UCB
Generated links carry channels
RDLC
Host
Unit B
Generated Unit Wrappers
Unit A
Unit C
FPGA2
FPGA1

User describes target model topology, channel
parameters, and (manual) mapping to host platform
FPGAs using RDL
RDL Compiler (RDLC) generates configurations

14
Virtual Target Clock
15
Virtualized RTL Improves FPGA Resource Usage

RAMP allows units to run at varying target-host
clock ratios to optimize area and overall
performance
Example 1 Multiported register file
Example, Sun Niagara has 3 read ports and 2 write
ports to 6KB of register storage
If RTL mapped directly, requires 48K flip-flops
Slow cycle time, large area
If mapping into block RAMs (one readone write
per cycle), takes 3 host cycles and 3x2KB block
RAMs
Faster cycle time (3X) and far less resources
Example 2 Large L2/L3 caches
Current FPGAs only have 1MB of on-chip SRAM
Use on-chip SRAM to build cache of active piece
of L2/L3 cache, stall target cycle if access
misses and fetch data from off-chip DRAM

16
Start/Done Timing Interface
Wrapper
Start
Unit
In1
Out
In2
Done

Wrapper generated by RDL asserts Start on the
physical FPGA cycle when the inputs to the unit
are ready for the next target cycle
Unit asserts Done when it finishes the target
cycle and its outputs are ready
Unit can take variable amount of time
Unvirtualized RTL unit can connect Done to
Start (but must not clock until Start)

17
Distributed Timing Models
18
Distributed Timing Example
Unit A
Unit B
D
Target
Latency L
19
Other Automatically Generated Networks

Control network has workstation as master and
every unit as slave device
Memory-mapped interface with block transfers
Used for initialization, stats gathering,
debugging, and monitoring
Units can connect to DRAM resources outside of
timed target channels
Used to support emulation and virtualization
state
Units can communicate with each other outside of
timed target channels
Support arbitrary communication. E.g., for
distributed stats gathering

20
Wide Variety of RAMP Simulators
21
Simulator Design Choices

Structural Analog versus Highly Virtualized
Functional-only versus FunctionalTiming
Timing via (virtual) RTL design versus separate
functional and timing models
Hybrid software/hardware simulators

22
Host Multithreading(Zhangxi Tan (UCB), Chung,
(CMU))

Multithreading emulation engine reduces FPGA
resource use and improves emulator throughput
Hides emulation latencies (e.g., communicating
across FPGAs)

23
Split Functional/Timing Models(HASIM Emer
(MIT/Intel), FAST Chiou, (UT Austin))
Functional Model
Timing Model

Functional model executes CPU ISA correctly, no
timing information
Only need to develop functional model once for
each ISA
Timing model captures pipeline timing details,
does not need to execute code
Much easier to change timing model for
architectural experimentation
Without RTL design, cannot be 100 certain that
timing is accurate
Many possible splits between timing and
functional model

24
Multithreaded Func. Timing Models(RAMP Gold
Tan, Gibeling, Asanovic, UCB)
Timing Model Pipeline
MT-Channels
MT-Unit

MT-Unit multiplexes multiple target units on a
single host engine
MT-Channel multiplexes multiple target channels
over a single host link

25
Schedule

900- 945 Welcome/Overview
945-1015 RAMP Blue Overview Demo
1015-1045 Break
1045-1230 RAMP White Live Demo
BEE3 Rollout (MSR/BEEcube/QA)
1230-1330 Lunch
1330-1500 ATLAS Transactional Memory (RAMP Red)
1500-1515 Break
1515-1645 CMU Simics/RAMP Cache Study
1645 Wrapup

RAMP Blue Release 2/25/2008
design available from RAMP website
ramp.eecs.berkeley.edu

27
RAMP WhiteHari Angepat, Derek Chiou (UT Austin)

Scalable Coherent Shared Memory Multiprocessor
Support standard shared memory programming models

Leon3 shim
Leon3 shim
Intersection Unit
NIU
Intersection Unit
NIU
Router
Router
AHB shim
AHB shim
AHB bus
AHB bus
MP IntCntrl
DSU
Eth
DDR2
DDR2
RAMP-White
27
28
(No Transcript)
29
CMU Simics/RAMP Simulator
16-CPU Shared-memory UltraSPARC III Server
(SunFire 3800)
BEE2 Platform
29
30
RAMP Home Page/Repository