RAMP Tutorial IntroductionOverview - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

RAMP Tutorial IntroductionOverview

Description:

... to innovate in timely fashion on in algorithms, compilers, ... HW research community does logic design ('gate shareware') to create out-of-the-box, MPP ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 32
Provided by: georgep6
Category:

less

Transcript and Presenter's Notes

Title: RAMP Tutorial IntroductionOverview


1
RAMP TutorialIntroduction/Overview
  • Krste Asanovic
  • UC Berkeley
  • RAMP Tutorial, ASPLOS, Seattle, WA
  • March 2, 2008

2
Technology Trends CPU
  • Microprocessor Power Wall Memory Wall ILP
    Wall Brick Wall
  • End of uniprocessors and faster clock rates
  • Every program(mer) is a parallel program(mer),
    Sequential algorithms are slow algorithms
  • Since parallel more power efficient (W
    CV2F)New Moores Law is 2X processors or
    cores per socket every 2 years, same clock
    frequency
  • Conservative 2007 4 cores, 2009 8 cores, 2011
    16 cores for embedded, desktop, server
  • Sea change for HW and SW industries since
    changing programmer model, responsibilities
  • HW/SW industries bet farm that parallel
    successful

3
Problems with Manycore Sea Change
  • Algorithms, Programming Languages, Compilers,
    Operating Systems, Architectures, Libraries,
    not ready for 1000 CPUs / chip
  • ? Only companies can build HW, and it takes years
  • Software people dont start working hard until
    hardware arrives
  • 3 months after HW arrives, SW people list
    everything that must be fixed, then we all wait 4
    years for next iteration of HW/SW
  • How get 1000 CPU systems in hands of researchers
    to innovate in timely fashion on in algorithms,
    compilers, languages, OS, architectures, ?
  • Can avoid waiting years between HW/SW iterations?

4
Vision Build Research MPP from FPGAs
  • As ? 16 CPUs will fit in Field Programmable Gate
    Array (FPGA), 1000-CPU system from ? 64 FPGAs?
  • 8 32-bit simple soft core RISC at 100MHz in
    2004 (Virtex-II)
  • FPGA generations every 1.5 yrs ? 2X CPUs, ? 1.2X
    clock rate
  • HW research community does logic design (gate
    shareware) to create out-of-the-box, MPP
  • E.g., 1000 processor, standard ISA
    binary-compatible, 64-bit, cache-coherent
    supercomputer _at_ ? 150 MHz/CPU in 2007
  • 6 universities, 10 faculty
  • 3rd party sells RAMP 2.0 (BEE3) hardware at low
    cost
  • Research Accelerator for Multiple Processors

5
Why RAMP Good for Research MPP?
6
Partnerships
  • Co-PIs Krste Asanovíc (UCB), Derek Chiou (UT
    Austin), Joel Emer (MIT/Intel), James Hoe (CMU),
    Christos Kozyrakis (Stanford), Shih-Lien Lu
    (Intel), Mark Oskin (Washington), David
    Patterson (Berkeley), and John Wawrzynek
    (Berkeley)
  • RAMP hardware development activity centered at
    Berkeley Wireless Research Center.
  • Three year NSF grant for staff (awarded 3/06).
  • GSRC (Jan Rabaey) has paid partial staff and some
    students.
  • Major continuing commitment from Xilinx
  • Collaboration with MSR (Chuck Thacker) on BEE3
    FPGA platform.
  • Sun, IBM contributing processor designs, IBM
    faculty awards.

High-speed high-confidence emulation is widely
recognized as a necessary component of
multiprocessor research and development. FPGA
emulation is the only practical approach.
7
BEE3 Design
Chuck Thacker Chen Chang, UC Berkeley
  • New RAMP systems to be based on Berkeley
    Emulation Engine version 3 (BEE3).
  • BEECube, Inc.
  • (UC Berkeley spinout startup company)
  • To provide manufacturing, distribution, and
    support to commercial and academic users.
  • General availability 2Q08

BEE3,1st prototype 11/07
BEE3,1st prototype 11/07
For small scale design, or to get started, use
Xilinx ML505
7
8
RAMP An infrastructure to build simulators using
FPGAs
9
Run Target Model on Host Platform
Hard Work
10
Reduce, Reuse, Recycle
  • Reduce effort to build target models
  • Users just build components (units),
    infrastructure handles connections (The RDL
    Compiler)
  • Reuse units by having good abstractions
  • Across different target models
  • Across different host platforms
  • XUP, Calinx, BEE2, BEE3, ML505 also Altera
    platforms
  • Recycle existing IP for use as simulation models
  • Commercial processor RTL is (almost) its own model

11
RAMP Target Model
  • Units
  • Relatively large chunks of functionality
  • e.g., processor L1 cache
  • User-written in some HDL or software
  • Channels
  • Point-point, undirectional, two kinds
  • FIFO channel Flow-controlled interface
  • Pipeline channel Simple shift register, bits
    drop off end
  • Generated by RAMP infrastructure

12
Target Pipeline Channel Parameters
D
D
Datawidth
Forward Latency
13
RAMP Description Language (RDL)
Target
Greg Gibeling, UCB
Generated links carry channels
RDLC
Host
Unit B
Generated Unit Wrappers
Unit A
Unit C
FPGA2
FPGA1
  • User describes target model topology, channel
    parameters, and (manual) mapping to host platform
    FPGAs using RDL
  • RDL Compiler (RDLC) generates configurations

14
Virtual Target Clock
15
Virtualized RTL Improves FPGA Resource Usage
  • RAMP allows units to run at varying target-host
    clock ratios to optimize area and overall
    performance
  • Example 1 Multiported register file
  • Example, Sun Niagara has 3 read ports and 2 write
    ports to 6KB of register storage
  • If RTL mapped directly, requires 48K flip-flops
  • Slow cycle time, large area
  • If mapping into block RAMs (one readone write
    per cycle), takes 3 host cycles and 3x2KB block
    RAMs
  • Faster cycle time (3X) and far less resources
  • Example 2 Large L2/L3 caches
  • Current FPGAs only have 1MB of on-chip SRAM
  • Use on-chip SRAM to build cache of active piece
    of L2/L3 cache, stall target cycle if access
    misses and fetch data from off-chip DRAM

16
Start/Done Timing Interface
Wrapper
Start
Unit
In1
Out
In2
Done
  • Wrapper generated by RDL asserts Start on the
    physical FPGA cycle when the inputs to the unit
    are ready for the next target cycle
  • Unit asserts Done when it finishes the target
    cycle and its outputs are ready
  • Unit can take variable amount of time
  • Unvirtualized RTL unit can connect Done to
    Start (but must not clock until Start)

17
Distributed Timing Models
18
Distributed Timing Example
Unit A
Unit B
D
Target
Latency L
19
Other Automatically Generated Networks
  • Control network has workstation as master and
    every unit as slave device
  • Memory-mapped interface with block transfers
  • Used for initialization, stats gathering,
    debugging, and monitoring
  • Units can connect to DRAM resources outside of
    timed target channels
  • Used to support emulation and virtualization
    state
  • Units can communicate with each other outside of
    timed target channels
  • Support arbitrary communication. E.g., for
    distributed stats gathering

20
Wide Variety of RAMP Simulators
21
Simulator Design Choices
  • Structural Analog versus Highly Virtualized
  • Functional-only versus FunctionalTiming
  • Timing via (virtual) RTL design versus separate
    functional and timing models
  • Hybrid software/hardware simulators

22
Host Multithreading(Zhangxi Tan (UCB), Chung,
(CMU))
  • Multithreading emulation engine reduces FPGA
    resource use and improves emulator throughput
  • Hides emulation latencies (e.g., communicating
    across FPGAs)

23
Split Functional/Timing Models(HASIM Emer
(MIT/Intel), FAST Chiou, (UT Austin))
Functional Model
Timing Model
  • Functional model executes CPU ISA correctly, no
    timing information
  • Only need to develop functional model once for
    each ISA
  • Timing model captures pipeline timing details,
    does not need to execute code
  • Much easier to change timing model for
    architectural experimentation
  • Without RTL design, cannot be 100 certain that
    timing is accurate
  • Many possible splits between timing and
    functional model

24
Multithreaded Func. Timing Models(RAMP Gold
Tan, Gibeling, Asanovic, UCB)
Timing Model Pipeline
MT-Channels
MT-Unit
  • MT-Unit multiplexes multiple target units on a
    single host engine
  • MT-Channel multiplexes multiple target channels
    over a single host link

25
Schedule
  • 900- 945 Welcome/Overview
  • 945-1015 RAMP Blue Overview Demo
  • 1015-1045 Break
  • 1045-1230 RAMP White Live Demo
  • BEE3 Rollout (MSR/BEEcube/QA)
  • 1230-1330 Lunch
  • 1330-1500 ATLAS Transactional Memory (RAMP Red)
  • 1500-1515 Break
  • 1515-1645 CMU Simics/RAMP Cache Study
  • 1645 Wrapup

26
  • RAMP Blue Release 2/25/2008
  • design available from RAMP website
  • ramp.eecs.berkeley.edu

27
RAMP WhiteHari Angepat, Derek Chiou (UT Austin)
  • Scalable Coherent Shared Memory Multiprocessor
  • Support standard shared memory programming models

Leon3 shim
Leon3 shim
Intersection Unit
NIU
Intersection Unit
NIU
Router
Router
AHB shim
AHB shim
AHB bus
AHB bus
MP IntCntrl
DSU
Eth
DDR2
DDR2
RAMP-White
27
28
(No Transcript)
29
CMU Simics/RAMP Simulator
16-CPU Shared-memory UltraSPARC III Server
(SunFire 3800)
BEE2 Platform
29
30
RAMP Home Page/Repository
  • ramp.eecs.berkeley.edu
  • Remotely accessible subversion repository

31
Thank You!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com