Three related projects at Berkeley - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Three related projects at Berkeley

Description:

... Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Christoforos ... Back to the future: 1997 and today. Original IRAM Motivation: Processor ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 25

Provided by: yel3

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Three related projects at Berkeley

1
Welcome

Three related projects at Berkeley
Intelligent RAM (IRAM)
Intelligent Storage (ISTORE)
OceanStore
Groundrules
Questions are welcome during talks
Feedback required Friday morning
Time for rafting and talking
Introductions

2
Overview of the IRAM Project

Kathy Yelick

Aaron Brown, James Beck, Rich Fromm, Joe Gebis,
Paul Harvey, Adam Janin, Dave Judd,
Christoforos Kozyrakis, David Martin, Thinh
Nguyen, David Oppenheimer, Steve Pope, Randi
Thomas, Noah Treuhaft, Sam Williams, John
Kubiatowicz, and David Patterson http//iram.cs.be
rkeley.edu/ Summer 2000 Retreat
3
Outline

IRAM Motivation
VIRAM architecture and VIRAM-1 microarchitecture
Benchmarks
Compiler
Back to the future 1997 and today

4
Original IRAM Motivation
Processor-DRAM Gap (latency)
µProc 60/yr.
1000
CPU
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
5
Intelligent RAM IRAM

Microprocessor DRAM on a single chip
10X capacity vs. DRAM
on-chip memory latency 5-10X
memory bandwidth 50-100X
improve energy efficiency 2X-4X (no off-chip
bus)
smaller board area/volume
IRAM advantages extend to
a single chip system
a building block for larger systems

6
1997 Vanilla IRAM Study

Estimate performance IRAM version of Alpha (same
caches, benchmarks, standard DRAM)
Assumed logic slower, DRAM faster
Results Spec92 slower, sparse matrices faster,
DBs even
Two conclusions
Conventional benchmarks like Spec match
conventional architectures
Conventional architectures do not utilize memory
bandwidth
Research plan
Focus on power/area advantages, including
portable, hand-held devices
Focus on multimedia benchmarks to match these
devices
Develop an architecture that can exploit the
enormous memory bandwidth

7
Vector IRAM Architecture
Maximum Vector Length (mvl) elts per register
VP0
VP1
VPvl-1
vr0
vr1
Data Registers
vpw
vr31

Maximum vector length is given by a read-only
register mvl
E.g., in VIRAM-1 implementation, each register
holds 32 64-bit values
Vector Length is given by the register vl
This is the of active elements or virtual
processors
To handle variable-width data (8,16,32,64-bit)
Width of each VP given by the register vpw
vpw is one of 8b,16b,32b,64b (no 8b in VIRAM-1)
mvl depends on implementation and vpw 32 64-bit,
64 32-bit, 128 16-bit,

8
IRAM Architecture Update

ISA mostly frozen since 6/99
Changes in 2H 99 for better fixed-point model and
some instructions for short vectors (auto
increment and in-register permutations)
Minor changes in 1H 00 to address new
co-processor interface in MIPS core
ISA manual publicly available
http//www.cs.berkeley.edu
Suite of simulators actively used
vsim-isa (functional)
Major rewrite nearly complete for new scalar
processor
All UCB code
vsim-p (performance), vsim-db (debugger),
vsim-sync (memory synchronization)

9
VIRAM-1 Implementation

16 MB DRAM, 8 banks
MIPS Scalar core and caches _at_ 200 MHz
Vector unit _at_ 200 MHz
4 64-bit lanes
8 32-bit virtual lanes
16 16-bit virtual lanes
0.18 um EDL process
17x17 mm
2 Watt power target

Memory (64 Mbits / 8 MBytes)
Xbar
Memory (64 Mbits / 8 MBytes)

Design easily scales in number of lanes, e.g.
2 64-bit lanes lower power version
8 64-bit lanes higher performance version
Number of memory banks is independent

10
VIRAM-1 Microarchitecture

Memory system
8 DRAM banks
256-bit synchronous interface
1 sub-bank per bank
16 Mbytes total capacity
Peak performance
3.2 GOPS64, 12.8 GOPS16 (w. madd)
1.6 GOPS64, 6.4 GOPS16 (wo. madd)
0.8 GFLOPS64, 1.6 GFLOPS32
6.4 Gbyte/s memory bandwidth comsumed by VU

2 arithmetic units
both execute integer operations
one executes FP operations
4 64-bit datapaths (lanes) per unit
2 flag processing units
for conditional execution and speculation support
1 load-store unit
optimized for strides 1,2,3, and 4
4 addresses/cycle for indexed and strided
operations
decoupled indexed and strided stores

11
VIRAM-1 block diagram
12
IRAM Chip Update

IBM supplying embedded DRAM/Logic (100)
Agreement in place as of June 1, 2000
MIPS supplying scalar core (100)
MIPS processor, caches, TLB
MIT supplying FPU (100)
VIRAM-1 Tape-out scheduled for January 2001
Simplifications
Floating point
Network Interface

13
Hand-Coded Benchmark Review

Image processing kernels (old FPU model)
Note BLAS-2 performance

14
Base-line system comparison

All numbers in cycles/pixel
MMX and VIS results assume all data in L1 cache

15
FFT Uses In-Register Permutations
Without in-register permutations
16
Problem General Element Permutation

Hardware for a full vector permutation
instruction (128 16b elements, 256b datapath)
Datapath 16 x 16 (x 16b) crossbar scales by
0(N2)
Control 16 16-to-1 multiplexors scales by
0(NlogN)
Time/energy wasted on wide vector register file
port

17
Simple Vector Permutations

Simple steps of butterfly permutations
A register provides the butterfly radix
Separate instructions for moving elements to
left/right
Sufficient semantics for
Fast reductions of vector registers (dot
products)
Fast FFT kernels

18
Hardware for Simple Permutations

Hardware for 128 16b elements, 256b datapath
Datapath 2 buses, 8 tristate drivers, 4
multiplexors, 4 shifters (by 0, 16b, 32b only)
Scales by O(N)
Control 6 control cases scales by O(N)
Other benefits
Consecutive result elements written together
Buses used only for small radices

19
IRAM Compiler Status

Retarget of Cray compiler
Steps in compiler development
Build MIPS backend (done)
Build VIRAM backend for vectorized loops (done)
Instruction scheduling for VIRAM-1 (done)
Insertion of memory barriers (using Cray
strategy, improving)
Additional optimizations (ongoing)
Feedback results to Cray, new version from Cray
(ongoing)

20
Compiled Applications Update

Applications using compiler
Speech processing under development
Developed new small-memory algorithm for speech
processing
Uses some existing kernels (FFT and MM)
Vector search algorithm is most challenging
DIS image understanding application under
development
Compiles, but does not yet vectorize well
Singular Value Decomposition
Better than 2 VLIW machines (TI C67 and TM 1100)
Challenging BLAS-1,2 work well on IRAM because of
memory BW
Kernels
Simple floating point kernels are very
competitive with hand-code

21
(10n x n SVD, rank 10)
(From Herman, Loo, Tang, CS252 project)
22
IRAM Latency Advantage

1997 estimate 5-10x improvement
No parallel DRAMs, memory controller, bus to turn
around, SIMM module, pins
30ns for IRAM (or much lower with DRAM redesign)
Compare to Alpha 600 180 ns for 128b 270 ns
for 512b
2000 estimate 5x improvement
IRAM memory latency is 25 ns for 256 bits, fixed
pipeline delay
Alpha 4000/4100 120 ns

23
IRAM Bandwidth Advantage

1997 estimate 100x
1024 1Mbit modules, each 1Kb wide(1Gb chip)
10 _at_ 40 ns RAS/CAS 320 GBytes/sec
If cross bar switch or multiple busses deliver
1/3 to 2/3 of total Þ 100 - 200 GBytes/sec
Compare to AlphaServer 8400 1.2 GBytes/sec,
41001.1 Gbytes/sec
2000 estimate 10-100x
VIRAM-1 16 MB chip divided into 8 banks
gt 51.2 GB peak from memory
banks
Crossbar can consume 12.8 GB/s
6.4GB/s from Vector Unit 6.4 GB/s from either
scalar or I/O

24
Power and Energy Advantages

1997 Case study of StrongARM memory hierarchy
vs. IRAM memory hierarchy
cell size advantages Þ much larger cache Þ fewer
off-chip references Þ up to 2X-4X energy
efficiency for memory-intensive algorithms
less energy per bit access for DRAM
Power target for VIRAM-1
2 watt goal
Based on preliminary spice runs, this looks very
feasible today
Scalar core included

25
Summary

IRAM takes advantage of high on-chip bandwidth
Vector IRAM ISA utilizes this bandwidth
Unit, strided, and indexed memory access patterns
supported
Exploits fine-grained parallelism, even with
pointer chasing
Compiler
Well-understood compiler model, semi-automatic
Still some work on code generation quality
Application benchmarks
Compiled and hand-coded
Include FFT, SVD, MVM, sparse MVM, and other
kernels used in image and signal processing

26
IRAM Applications Intelligent PDA

Pilot PDA
gameboy, cell phone, radio, timer, camera, TV
remote, am/fm radio, garage door opener, ...
Wireless data (WWW)
Speech, vision recog.
Voice output for conversations

Speech control Vision to see, scan documents,
read bar code, ...
27
IRAM as Building Block for ISTORE