Title: Three related projects at Berkeley
1Welcome
- Three related projects at Berkeley
- Intelligent RAM (IRAM)
- Intelligent Storage (ISTORE)
- OceanStore
- Groundrules
- Questions are welcome during talks
- Feedback required Friday morning
- Time for rafting and talking
- Introductions
2Overview of the IRAM Project
Aaron Brown, James Beck, Rich Fromm, Joe Gebis,
Paul Harvey, Adam Janin, Dave Judd,
Christoforos Kozyrakis, David Martin, Thinh
Nguyen, David Oppenheimer, Steve Pope, Randi
Thomas, Noah Treuhaft, Sam Williams, John
Kubiatowicz, and David Patterson http//iram.cs.be
rkeley.edu/ Summer 2000 Retreat
3Outline
- IRAM Motivation
- VIRAM architecture and VIRAM-1 microarchitecture
- Benchmarks
- Compiler
- Back to the future 1997 and today
4Original IRAM Motivation
Processor-DRAM Gap (latency)
µProc 60/yr.
1000
CPU
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
5Intelligent RAM IRAM
- Microprocessor DRAM on a single chip
- 10X capacity vs. DRAM
- on-chip memory latency 5-10X
- memory bandwidth 50-100X
- improve energy efficiency 2X-4X (no off-chip
bus) - smaller board area/volume
- IRAM advantages extend to
- a single chip system
- a building block for larger systems
61997 Vanilla IRAM Study
- Estimate performance IRAM version of Alpha (same
caches, benchmarks, standard DRAM) - Assumed logic slower, DRAM faster
- Results Spec92 slower, sparse matrices faster,
DBs even - Two conclusions
- Conventional benchmarks like Spec match
conventional architectures - Conventional architectures do not utilize memory
bandwidth - Research plan
- Focus on power/area advantages, including
portable, hand-held devices - Focus on multimedia benchmarks to match these
devices - Develop an architecture that can exploit the
enormous memory bandwidth
7Vector IRAM Architecture
Maximum Vector Length (mvl) elts per register
VP0
VP1
VPvl-1
vr0
vr1
Data Registers
vpw
vr31
- Maximum vector length is given by a read-only
register mvl - E.g., in VIRAM-1 implementation, each register
holds 32 64-bit values - Vector Length is given by the register vl
- This is the of active elements or virtual
processors - To handle variable-width data (8,16,32,64-bit)
- Width of each VP given by the register vpw
- vpw is one of 8b,16b,32b,64b (no 8b in VIRAM-1)
- mvl depends on implementation and vpw 32 64-bit,
64 32-bit, 128 16-bit,
8IRAM Architecture Update
- ISA mostly frozen since 6/99
- Changes in 2H 99 for better fixed-point model and
some instructions for short vectors (auto
increment and in-register permutations) - Minor changes in 1H 00 to address new
co-processor interface in MIPS core - ISA manual publicly available
- http//www.cs.berkeley.edu
- Suite of simulators actively used
- vsim-isa (functional)
- Major rewrite nearly complete for new scalar
processor - All UCB code
- vsim-p (performance), vsim-db (debugger),
vsim-sync (memory synchronization)
9VIRAM-1 Implementation
- 16 MB DRAM, 8 banks
- MIPS Scalar core and caches _at_ 200 MHz
- Vector unit _at_ 200 MHz
- 4 64-bit lanes
- 8 32-bit virtual lanes
- 16 16-bit virtual lanes
- 0.18 um EDL process
- 17x17 mm
- 2 Watt power target
Memory (64 Mbits / 8 MBytes)
Xbar
Memory (64 Mbits / 8 MBytes)
- Design easily scales in number of lanes, e.g.
- 2 64-bit lanes lower power version
- 8 64-bit lanes higher performance version
- Number of memory banks is independent
10VIRAM-1 Microarchitecture
- Memory system
- 8 DRAM banks
- 256-bit synchronous interface
- 1 sub-bank per bank
- 16 Mbytes total capacity
- Peak performance
- 3.2 GOPS64, 12.8 GOPS16 (w. madd)
- 1.6 GOPS64, 6.4 GOPS16 (wo. madd)
- 0.8 GFLOPS64, 1.6 GFLOPS32
- 6.4 Gbyte/s memory bandwidth comsumed by VU
- 2 arithmetic units
- both execute integer operations
- one executes FP operations
- 4 64-bit datapaths (lanes) per unit
- 2 flag processing units
- for conditional execution and speculation support
- 1 load-store unit
- optimized for strides 1,2,3, and 4
- 4 addresses/cycle for indexed and strided
operations - decoupled indexed and strided stores
11VIRAM-1 block diagram
12IRAM Chip Update
- IBM supplying embedded DRAM/Logic (100)
- Agreement in place as of June 1, 2000
- MIPS supplying scalar core (100)
- MIPS processor, caches, TLB
- MIT supplying FPU (100)
- VIRAM-1 Tape-out scheduled for January 2001
- Simplifications
- Floating point
- Network Interface
13Hand-Coded Benchmark Review
- Image processing kernels (old FPU model)
- Note BLAS-2 performance
14Base-line system comparison
- All numbers in cycles/pixel
- MMX and VIS results assume all data in L1 cache
15FFT Uses In-Register Permutations
Without in-register permutations
16Problem General Element Permutation
- Hardware for a full vector permutation
instruction (128 16b elements, 256b datapath) - Datapath 16 x 16 (x 16b) crossbar scales by
0(N2) - Control 16 16-to-1 multiplexors scales by
0(NlogN) - Time/energy wasted on wide vector register file
port
17Simple Vector Permutations
- Simple steps of butterfly permutations
- A register provides the butterfly radix
- Separate instructions for moving elements to
left/right - Sufficient semantics for
- Fast reductions of vector registers (dot
products) - Fast FFT kernels
18Hardware for Simple Permutations
- Hardware for 128 16b elements, 256b datapath
- Datapath 2 buses, 8 tristate drivers, 4
multiplexors, 4 shifters (by 0, 16b, 32b only)
Scales by O(N) - Control 6 control cases scales by O(N)
- Other benefits
- Consecutive result elements written together
- Buses used only for small radices
19IRAM Compiler Status
- Retarget of Cray compiler
- Steps in compiler development
- Build MIPS backend (done)
- Build VIRAM backend for vectorized loops (done)
- Instruction scheduling for VIRAM-1 (done)
- Insertion of memory barriers (using Cray
strategy, improving) - Additional optimizations (ongoing)
- Feedback results to Cray, new version from Cray
(ongoing)
20Compiled Applications Update
- Applications using compiler
- Speech processing under development
- Developed new small-memory algorithm for speech
processing - Uses some existing kernels (FFT and MM)
- Vector search algorithm is most challenging
- DIS image understanding application under
development - Compiles, but does not yet vectorize well
- Singular Value Decomposition
- Better than 2 VLIW machines (TI C67 and TM 1100)
- Challenging BLAS-1,2 work well on IRAM because of
memory BW - Kernels
- Simple floating point kernels are very
competitive with hand-code
21(10n x n SVD, rank 10)
(From Herman, Loo, Tang, CS252 project)
22IRAM Latency Advantage
- 1997 estimate 5-10x improvement
- No parallel DRAMs, memory controller, bus to turn
around, SIMM module, pins - 30ns for IRAM (or much lower with DRAM redesign)
- Compare to Alpha 600 180 ns for 128b 270 ns
for 512b - 2000 estimate 5x improvement
- IRAM memory latency is 25 ns for 256 bits, fixed
pipeline delay - Alpha 4000/4100 120 ns
23IRAM Bandwidth Advantage
- 1997 estimate 100x
- 1024 1Mbit modules, each 1Kb wide(1Gb chip)
- 10 _at_ 40 ns RAS/CAS 320 GBytes/sec
- If cross bar switch or multiple busses deliver
1/3 to 2/3 of total Þ 100 - 200 GBytes/sec - Compare to AlphaServer 8400 1.2 GBytes/sec,
41001.1 Gbytes/sec - 2000 estimate 10-100x
- VIRAM-1 16 MB chip divided into 8 banks
gt 51.2 GB peak from memory
banks - Crossbar can consume 12.8 GB/s
- 6.4GB/s from Vector Unit 6.4 GB/s from either
scalar or I/O
24Power and Energy Advantages
- 1997 Case study of StrongARM memory hierarchy
vs. IRAM memory hierarchy - cell size advantages Þ much larger cache Þ fewer
off-chip references Þ up to 2X-4X energy
efficiency for memory-intensive algorithms - less energy per bit access for DRAM
- Power target for VIRAM-1
- 2 watt goal
- Based on preliminary spice runs, this looks very
feasible today - Scalar core included
25Summary
- IRAM takes advantage of high on-chip bandwidth
- Vector IRAM ISA utilizes this bandwidth
- Unit, strided, and indexed memory access patterns
supported - Exploits fine-grained parallelism, even with
pointer chasing - Compiler
- Well-understood compiler model, semi-automatic
- Still some work on code generation quality
- Application benchmarks
- Compiled and hand-coded
- Include FFT, SVD, MVM, sparse MVM, and other
kernels used in image and signal processing
26IRAM Applications Intelligent PDA
- Pilot PDA
- gameboy, cell phone, radio, timer, camera, TV
remote, am/fm radio, garage door opener, ... - Wireless data (WWW)
- Speech, vision recog.
- Voice output for conversations
Speech control Vision to see, scan documents,
read bar code, ...
27IRAM as Building Block for ISTORE
- System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk - Target for 5-7 years
- building block 2006 MicroDrive integrated with
IRAM - 9GB disk, 50 MB/sec disk (projected)
- connected via crossbar switch
- O(10) Gflops
- 10,000 nodes fit into one rack!