Title: Research Accelerator for MultiProcessing
1Research Accelerator for MultiProcessing
- Dave Patterson, EECS, UC Berkeley
- President, Assoc. for Computer Machinery
- November 2005
RAMP 10 collaborators at Berkeley, Carnegie
Mellon University, MIT, Stanford,
University of Texas, and University of Washington
2Conventional Wisdom (CW) in Computer Architecture
- Old CW Multiplies are slow, Memory access is
fast - New CW Memory wall Memory slow, multiplies
fast (200 clocks to DRAM memory, 4 clocks for
multiply) - Old CW Power is free, Transistors expensive
- New CW Power wall Power expensive, Xtors free
(Can put more on chip than can afford to turn
on) - Old CW Uniprocessor performance 2X / 1.5 yrs
- New CW Power Wall Memory Wall Brick Wall
- Uniprocessor performance only 2X / 5 yrs
- Sea change in chip design multiple cores (2X
processors per chip / 2 years) - More simpler processors are more power efficient
3Sea Change in Chip Design
- Intel 4004 (1971) 4-bit processor,2312
transistors, 0.4 MHz, 10 micron PMOS, 11 mm2
chip
- RISC II (1983) 32-bit, 5 stage pipeline, 40,760
transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip
- 125 mm2 chip, 0.065 micron CMOS 2312 RISC
IIFPUIcacheDcache - RISC II shrinks to 0.02 mm2 at 65 nm
- Caches via DRAM or 1 transistor SRAM
(www.t-ram.com) ? - Proximity Communication via capacitive coupling
at gt 1 TB/s ?(Ivan Sutherland _at_ Sun / Berkeley)
- Processor is the new transistor?
4Problems with Sea Change
- Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries,
not ready for 1000 CPUs / chip - Software people dont start working hard until
hardware arrives - How do research in timely fashion on 1000 CPU
systems in algorithms, compilers, OS,
architectures, without waiting years between HW
generations?
5FPGAs as New Research Platform
- As 25 CPUs can fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from 40 FPGAs? - 64-bit simple soft core RISC at 100MHz in 2004
(Virtex-II) - FPGA generations every 1.5 yrs 2X CPUs, 2X clock
rate - HW research community does logic design (gate
shareware) to create out-of-the-box, Massively
Parallel Processor runs standard binaries of OS
and applications - Gateware Processors, Caches, Coherency, Ethernet
Interfaces, Switches, Routers, - E.g., 1000 IBM Power cache-coherent supercomputer
6Why RAMP Good for Research?
7RAMP 1 Hardware
- Completed Dec. 2004 (14x17 inch 22-layer PCB)
- Module
- FPGAs, memory, 10GigE conn.
- Compact Flash
- Administration/maintenance ports
- 10/100 Enet
- HDMI/DVI
- USB
- 4K/module w/o FPGAs or DRAM
8Multiple Module RAMP 1 Systems
- 8 compute modules (plus power supplies) in 8U
rack mount chassis - 500-1000 emulated processors
- Many topologies possible
- 2U single module tray for developers
- Disk storage disk emulator Network Attached
Storage
9RAMP Development Plan
- Distribute systems internally for RAMP 1
development - Xilinx agreed to pay for production of a set of
modules for initial contributing developers and
first full RAMP system - Others could be available if can recover costs
- Release publicly available out-of-the-box MPP
emulator - Based on standard ISA (IBM Power, Sun SPARC, )
for binary compatibility - Complete OS/libraries
- Locally modify RAMP as desired
- Design next generation platform for RAMP 2
- Base on 65nm FPGAs (2 generations later than
Virtex-II) - Pending results from RAMP 1, Xilinx will cover
hardware costs for initial set of RAMP 2 machines - Find 3rd party to build and distribute systems
(at near-cost) - NSF/CRI proposal pending to help support effort
- 2 full-time staff (one HW/gateware, one
OS/software) - Look for grad student support from industrial
donations
10 Gateware Design Framework
- Insight almost every large building block fits
inside FPGA today - what doesnt is between chips in real design
- Supports both cycle-accurate emulation of
detailed parameterized machine models and rapid
functional-only emulations - Carefully counts for Target Clock Cycles
- Units in any hardware design languages (will
work with Verilog, VHDL, BlueSpec, C, ...) - RAMP Design Language (RDL) to describe plumbing
to connect units
11 Gateware Design Framework
- Design composed of units that send messages over
channels via ports - Units (10,000 gates)
- CPU L1 cache, DRAM controller.
- Channels ( FIFO)
- Lossless, point-to-point, unidirectional,
in-order message delivery
12Status
- Submitted NSF proposal August
- 10 more RAMP1 boards being fabricated
- Asked IBM, Sun for commercial ISA, simple,
industrial-strength, 64-bit HDL of CPU FPU - Working on design framework document
- Biweekly teleconferences (8 since June)
- RAMP 1 short course/board distribution for RAMP
conspirators Jan 06 in Berkeley - FPGA workshop at HPCA Feb 06 in Austin
13RAMP in RADS Internet in a Box
- Building blocks also ? Distributed Computing
- RAMP vs. Clusters (Emulab, PlanetLab)
- Scale RAMP O(1000) vs. Clusters O(100)
- Private use 100k ? Every group has one
- Develop/Debug Reproducibility, Observability
- Flexibility Modify modules (Router, SMP, OS)
- Explore via repeatable experiments as vary
parameters, configurations vs. observations on
single (aging) cluster that is often idiosyncratic
14Multiprocessing Watering Hole
Parallel file system
Dataflow language/computer
Data center in a box
Thread scheduling
Internet in a box
Security enhancements
Multiprocessor switch design
Router design
Compile to FPGA
Fault insertion to check dependability
Parallel languages
- RAMP as next Standard Research Platform? (e.g.,
VAX/BSD Unix in 1980s) - RAMP attracts many communities to shared artifact
? Cross-disciplinary interactions ? Accelerate
innovation in multiprocessing
15Supporters (wrote letters to NSF)
- Gordon Bell (Microsoft)
- Ivo Bolsens (Xilinx CTO)
- Norm Jouppi (HP Labs)
- Bill Kramer (NERSC/LBL)
- Craig Mundie (MS CTO)
- G. Papadopoulos (Sun CTO)
- Justin Rattner (Intel CTO)
- Ivan Sutherland (Sun Fellow)
- Chuck Thacker (Microsoft)
- Kees Vissers (Xilinx)
- Doug Burger (Texas)
- Bill Dally (Stanford)
- Carl Ebeling (Washington)
- Susan Eggers (Washington)
- Steve Keckler (Texas)
- Greg Morrisett (Harvard)
- Scott Shenker (Berkeley)
- Ion Stoica (Berkeley)
- Kathy Yelick (Berkeley)
RAMP Participants Arvind (MIT), Krste AsanovÃc
(MIT), Derek Chiou (Texas), James Hoe (CMU),
Christos Kozyrakis (Stanford), Shih-Lien Lu
(Intel), Mark Oskin (Washington), David
Patterson (Berkeley), Jan Rabaey (Berkeley), and
John Wawrzynek (Berkeley)
16Conclusion
- RAMP as system-level time machine preview
computers of future to accelerate HW/SW
generations - Trace anything, Reproduce everything, Tape out
every day - Emulate Multiprocessor, Data Center, or
Distributed Computer - FTP supercomputer overnight and boot in morning
- Clone to check results (as fast in Berkeley as in
Boston?) - Carpe Diem
- Systems researchers (HW SW) need the capability
- FPGA technology is ready today, and getting
better every year - Stand on shoulders vs. toes standardize on
design framework, multi-year Berkeley effort on
FPGA platforms (Berkeley Emulation Engine, BEE2) - Architecture researchers get opportunity to
immediately aid colleagues via gateware (as SW
researchers have done in past) - Multiprocessor Research Watering Hole
accelerate research in multiprocessing via
standard research platform ? hasten sea change
from sequential to parallel computing
17Backup Slides
18Why RAMP Attractive?
Priorities for Research Parallel
Computers Insight Commercial priorities
radically different from research
- 1a. Cost of purchase
- 1b. Cost of ownership (staff to administer it)
- 1c. Scalability (1000 much better than 100 CPUs)
- 4. Power/Space (machine room cooling, number of
racks) - 5. Community synergy (share code, )
- 6. Observability (non-obtrusively measure, trace
everything) - 7. Reproducibility (to debug, run experiments)
- 8. Flexibility (change for different experiments)
- 9. Credibility (Faithfully predicts real hardware
behavior) - 10. Performance (As long as experiments not too
slow)
19Uniprocessor Performance (SPECint)
- VAX 25/year 1978 to 1986
- RISC x86 52/year 1986 to 2002
- RISC x86 20/year 2002 to present
20Related Approaches (1)
- Quickturn, Axis, IKOS, Thara
- FPGA- or special-processor based gate-level
hardware emulators - Synthesizable HDL is mapped to array for cycle
and bit-accurate netlist emulation - RAMPs emphasis is on emulating high-level
architecture behaviors - Hardware and supporting software provides
architecture-level abstractions for modeling and
analysis - Targets architecture and software research
- Provides a spectrum of tradeoffs between speed
and accuracy/precision of emulation - RPM at USC in early 1990s
- Up to only 8 processors
- Only the memory controller implemented with
configurable logic
21Related Approaches (2)
- Software Simulators
- Clusters (standard microprocessors)
- PlanetLab (distributed environment)
- Wisconsin Wind Tunnel (used CM-5 to simulate
shared memory) - All suffer from some combination of
- Slowness, inaccuracy, scalability, unbalanced
computation/communication, target inflexibility