Research Accelerator for MultiProcessing - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Research Accelerator for MultiProcessing

Description:

Compact Flash. Administration/ maintenance ports: 10/100 ... 8 compute modules (plus power supplies) in 8U rack mount chassis. 500-1000 emulated processors ... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 22

Provided by: georgep6

Category:

more less

Transcript and Presenter's Notes

Title: Research Accelerator for MultiProcessing

1
Research Accelerator for MultiProcessing

Dave Patterson, EECS, UC Berkeley
President, Assoc. for Computer Machinery
November 2005

RAMP 10 collaborators at Berkeley, Carnegie
Mellon University, MIT, Stanford,
University of Texas, and University of Washington
2
Conventional Wisdom (CW) in Computer Architecture

Old CW Multiplies are slow, Memory access is
fast
New CW Memory wall Memory slow, multiplies
fast (200 clocks to DRAM memory, 4 clocks for
multiply)
Old CW Power is free, Transistors expensive
New CW Power wall Power expensive, Xtors free
(Can put more on chip than can afford to turn
on)
Old CW Uniprocessor performance 2X / 1.5 yrs
New CW Power Wall Memory Wall Brick Wall
Uniprocessor performance only 2X / 5 yrs
Sea change in chip design multiple cores (2X
processors per chip / 2 years)
More simpler processors are more power efficient

3
Sea Change in Chip Design

Intel 4004 (1971) 4-bit processor,2312
transistors, 0.4 MHz, 10 micron PMOS, 11 mm2
chip

RISC II (1983) 32-bit, 5 stage pipeline, 40,760
transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip

125 mm2 chip, 0.065 micron CMOS 2312 RISC
IIFPUIcacheDcache
RISC II shrinks to 0.02 mm2 at 65 nm
Caches via DRAM or 1 transistor SRAM
(www.t-ram.com) ?
Proximity Communication via capacitive coupling
at gt 1 TB/s ?(Ivan Sutherland _at_ Sun / Berkeley)

Processor is the new transistor?

4
Problems with Sea Change

Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries,
not ready for 1000 CPUs / chip
Software people dont start working hard until
hardware arrives
How do research in timely fashion on 1000 CPU
systems in algorithms, compilers, OS,
architectures, without waiting years between HW
generations?

5
FPGAs as New Research Platform

As 25 CPUs can fit in Field Programmable Gate
Array (FPGA), 1000-CPU system from 40 FPGAs?
64-bit simple soft core RISC at 100MHz in 2004
(Virtex-II)
FPGA generations every 1.5 yrs 2X CPUs, 2X clock
rate
HW research community does logic design (gate
shareware) to create out-of-the-box, Massively
Parallel Processor runs standard binaries of OS
and applications
Gateware Processors, Caches, Coherency, Ethernet
Interfaces, Switches, Routers,
E.g., 1000 IBM Power cache-coherent supercomputer

6
Why RAMP Good for Research?
7
RAMP 1 Hardware

Completed Dec. 2004 (14x17 inch 22-layer PCB)

Module
FPGAs, memory, 10GigE conn.
Compact Flash
Administration/maintenance ports
10/100 Enet
HDMI/DVI
USB
4K/module w/o FPGAs or DRAM

8
Multiple Module RAMP 1 Systems

8 compute modules (plus power supplies) in 8U
rack mount chassis
500-1000 emulated processors
Many topologies possible
2U single module tray for developers
Disk storage disk emulator Network Attached
Storage

9
RAMP Development Plan

Distribute systems internally for RAMP 1
development
Xilinx agreed to pay for production of a set of
modules for initial contributing developers and
first full RAMP system
Others could be available if can recover costs
Release publicly available out-of-the-box MPP
emulator
Based on standard ISA (IBM Power, Sun SPARC, )
for binary compatibility
Complete OS/libraries
Locally modify RAMP as desired
Design next generation platform for RAMP 2
Base on 65nm FPGAs (2 generations later than
Virtex-II)
Pending results from RAMP 1, Xilinx will cover
hardware costs for initial set of RAMP 2 machines
Find 3rd party to build and distribute systems
(at near-cost)
NSF/CRI proposal pending to help support effort
2 full-time staff (one HW/gateware, one
OS/software)
Look for grad student support from industrial
donations

10
Gateware Design Framework

Insight almost every large building block fits
inside FPGA today
what doesnt is between chips in real design
Supports both cycle-accurate emulation of
detailed parameterized machine models and rapid
functional-only emulations
Carefully counts for Target Clock Cycles
Units in any hardware design languages (will
work with Verilog, VHDL, BlueSpec, C, ...)
RAMP Design Language (RDL) to describe plumbing
to connect units

11
Gateware Design Framework

Design composed of units that send messages over
channels via ports
Units (10,000 gates)
CPU L1 cache, DRAM controller.
Channels ( FIFO)
Lossless, point-to-point, unidirectional,
in-order message delivery

12
Status

Submitted NSF proposal August
10 more RAMP1 boards being fabricated
Asked IBM, Sun for commercial ISA, simple,
industrial-strength, 64-bit HDL of CPU FPU
Working on design framework document
Biweekly teleconferences (8 since June)
RAMP 1 short course/board distribution for RAMP
conspirators Jan 06 in Berkeley
FPGA workshop at HPCA Feb 06 in Austin

13
RAMP in RADS Internet in a Box

Building blocks also ? Distributed Computing
RAMP vs. Clusters (Emulab, PlanetLab)
Scale RAMP O(1000) vs. Clusters O(100)
Private use 100k ? Every group has one
Develop/Debug Reproducibility, Observability
Flexibility Modify modules (Router, SMP, OS)
Explore via repeatable experiments as vary
parameters, configurations vs. observations on
single (aging) cluster that is often idiosyncratic

14
Multiprocessing Watering Hole
Parallel file system
Dataflow language/computer
Data center in a box
Thread scheduling
Internet in a box
Security enhancements
Multiprocessor switch design
Router design
Compile to FPGA
Fault insertion to check dependability
Parallel languages

RAMP as next Standard Research Platform? (e.g.,
VAX/BSD Unix in 1980s)
RAMP attracts many communities to shared artifact
? Cross-disciplinary interactions ? Accelerate
innovation in multiprocessing

15
Supporters (wrote letters to NSF)

Gordon Bell (Microsoft)
Ivo Bolsens (Xilinx CTO)
Norm Jouppi (HP Labs)
Bill Kramer (NERSC/LBL)
Craig Mundie (MS CTO)
G. Papadopoulos (Sun CTO)
Justin Rattner (Intel CTO)
Ivan Sutherland (Sun Fellow)
Chuck Thacker (Microsoft)
Kees Vissers (Xilinx)

Doug Burger (Texas)
Bill Dally (Stanford)
Carl Ebeling (Washington)
Susan Eggers (Washington)
Steve Keckler (Texas)
Greg Morrisett (Harvard)
Scott Shenker (Berkeley)
Ion Stoica (Berkeley)
Kathy Yelick (Berkeley)

RAMP Participants Arvind (MIT), Krste Asanovíc
(MIT), Derek Chiou (Texas), James Hoe (CMU),
Christos Kozyrakis (Stanford), Shih-Lien Lu
(Intel), Mark Oskin (Washington), David
Patterson (Berkeley), Jan Rabaey (Berkeley), and
John Wawrzynek (Berkeley)
16
Conclusion

RAMP as system-level time machine preview
computers of future to accelerate HW/SW
generations
Trace anything, Reproduce everything, Tape out
every day
Emulate Multiprocessor, Data Center, or
Distributed Computer
FTP supercomputer overnight and boot in morning
Clone to check results (as fast in Berkeley as in
Boston?)
Carpe Diem
Systems researchers (HW SW) need the capability
FPGA technology is ready today, and getting
better every year
Stand on shoulders vs. toes standardize on
design framework, multi-year Berkeley effort on
FPGA platforms (Berkeley Emulation Engine, BEE2)
Architecture researchers get opportunity to
immediately aid colleagues via gateware (as SW
researchers have done in past)
Multiprocessor Research Watering Hole
accelerate research in multiprocessing via
standard research platform ? hasten sea change
from sequential to parallel computing

17
Backup Slides
18
Why RAMP Attractive?
Priorities for Research Parallel
Computers Insight Commercial priorities
radically different from research

1a. Cost of purchase
1b. Cost of ownership (staff to administer it)
1c. Scalability (1000 much better than 100 CPUs)
4. Power/Space (machine room cooling, number of
racks)
5. Community synergy (share code, )
6. Observability (non-obtrusively measure, trace
everything)
7. Reproducibility (to debug, run experiments)
8. Flexibility (change for different experiments)
9. Credibility (Faithfully predicts real hardware
behavior)
10. Performance (As long as experiments not too
slow)

19
Uniprocessor Performance (SPECint)

VAX 25/year 1978 to 1986
RISC x86 52/year 1986 to 2002
RISC x86 20/year 2002 to present

20
Related Approaches (1)

Quickturn, Axis, IKOS, Thara
FPGA- or special-processor based gate-level
hardware emulators
Synthesizable HDL is mapped to array for cycle
and bit-accurate netlist emulation
RAMPs emphasis is on emulating high-level
architecture behaviors
Hardware and supporting software provides
architecture-level abstractions for modeling and
analysis
Targets architecture and software research
Provides a spectrum of tradeoffs between speed
and accuracy/precision of emulation
RPM at USC in early 1990s
Up to only 8 processors
Only the memory controller implemented with
configurable logic

21
Related Approaches (2)

Software Simulators
Clusters (standard microprocessors)
PlanetLab (distributed environment)
Wisconsin Wind Tunnel (used CM-5 to simulate
shared memory)
All suffer from some combination of
Slowness, inaccuracy, scalability, unbalanced
computation/communication, target inflexibility

Write a Comment

User Comments (0)