Three related projects at Berkeley - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Three related projects at Berkeley

Description:

... Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Christoforos ... Back to the future: 1997 and today. Original IRAM Motivation: Processor ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 25
Provided by: yel3
Category:

less

Transcript and Presenter's Notes

Title: Three related projects at Berkeley


1
Welcome
  • Three related projects at Berkeley
  • Intelligent RAM (IRAM)
  • Intelligent Storage (ISTORE)
  • OceanStore
  • Groundrules
  • Questions are welcome during talks
  • Feedback required Friday morning
  • Time for rafting and talking
  • Introductions

2
Overview of the IRAM Project
  • Kathy Yelick

Aaron Brown, James Beck, Rich Fromm, Joe Gebis,
Paul Harvey, Adam Janin, Dave Judd,
Christoforos Kozyrakis, David Martin, Thinh
Nguyen, David Oppenheimer, Steve Pope, Randi
Thomas, Noah Treuhaft, Sam Williams, John
Kubiatowicz, and David Patterson http//iram.cs.be
rkeley.edu/ Summer 2000 Retreat
3
Outline
  • IRAM Motivation
  • VIRAM architecture and VIRAM-1 microarchitecture
  • Benchmarks
  • Compiler
  • Back to the future 1997 and today

4
Original IRAM Motivation
Processor-DRAM Gap (latency)
µProc 60/yr.
1000
CPU
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
5
Intelligent RAM IRAM
  • Microprocessor DRAM on a single chip
  • 10X capacity vs. DRAM
  • on-chip memory latency 5-10X
  • memory bandwidth 50-100X
  • improve energy efficiency 2X-4X (no off-chip
    bus)
  • smaller board area/volume
  • IRAM advantages extend to
  • a single chip system
  • a building block for larger systems

6
1997 Vanilla IRAM Study
  • Estimate performance IRAM version of Alpha (same
    caches, benchmarks, standard DRAM)
  • Assumed logic slower, DRAM faster
  • Results Spec92 slower, sparse matrices faster,
    DBs even
  • Two conclusions
  • Conventional benchmarks like Spec match
    conventional architectures
  • Conventional architectures do not utilize memory
    bandwidth
  • Research plan
  • Focus on power/area advantages, including
    portable, hand-held devices
  • Focus on multimedia benchmarks to match these
    devices
  • Develop an architecture that can exploit the
    enormous memory bandwidth

7
Vector IRAM Architecture
Maximum Vector Length (mvl) elts per register
VP0
VP1
VPvl-1
vr0
vr1
Data Registers
vpw
vr31
  • Maximum vector length is given by a read-only
    register mvl
  • E.g., in VIRAM-1 implementation, each register
    holds 32 64-bit values
  • Vector Length is given by the register vl
  • This is the of active elements or virtual
    processors
  • To handle variable-width data (8,16,32,64-bit)
  • Width of each VP given by the register vpw
  • vpw is one of 8b,16b,32b,64b (no 8b in VIRAM-1)
  • mvl depends on implementation and vpw 32 64-bit,
    64 32-bit, 128 16-bit,

8
IRAM Architecture Update
  • ISA mostly frozen since 6/99
  • Changes in 2H 99 for better fixed-point model and
    some instructions for short vectors (auto
    increment and in-register permutations)
  • Minor changes in 1H 00 to address new
    co-processor interface in MIPS core
  • ISA manual publicly available
  • http//www.cs.berkeley.edu
  • Suite of simulators actively used
  • vsim-isa (functional)
  • Major rewrite nearly complete for new scalar
    processor
  • All UCB code
  • vsim-p (performance), vsim-db (debugger),
    vsim-sync (memory synchronization)

9
VIRAM-1 Implementation
  • 16 MB DRAM, 8 banks
  • MIPS Scalar core and caches _at_ 200 MHz
  • Vector unit _at_ 200 MHz
  • 4 64-bit lanes
  • 8 32-bit virtual lanes
  • 16 16-bit virtual lanes
  • 0.18 um EDL process
  • 17x17 mm
  • 2 Watt power target

Memory (64 Mbits / 8 MBytes)
Xbar
Memory (64 Mbits / 8 MBytes)
  • Design easily scales in number of lanes, e.g.
  • 2 64-bit lanes lower power version
  • 8 64-bit lanes higher performance version
  • Number of memory banks is independent

10
VIRAM-1 Microarchitecture
  • Memory system
  • 8 DRAM banks
  • 256-bit synchronous interface
  • 1 sub-bank per bank
  • 16 Mbytes total capacity
  • Peak performance
  • 3.2 GOPS64, 12.8 GOPS16 (w. madd)
  • 1.6 GOPS64, 6.4 GOPS16 (wo. madd)
  • 0.8 GFLOPS64, 1.6 GFLOPS32
  • 6.4 Gbyte/s memory bandwidth comsumed by VU
  • 2 arithmetic units
  • both execute integer operations
  • one executes FP operations
  • 4 64-bit datapaths (lanes) per unit
  • 2 flag processing units
  • for conditional execution and speculation support
  • 1 load-store unit
  • optimized for strides 1,2,3, and 4
  • 4 addresses/cycle for indexed and strided
    operations
  • decoupled indexed and strided stores

11
VIRAM-1 block diagram
12
IRAM Chip Update
  • IBM supplying embedded DRAM/Logic (100)
  • Agreement in place as of June 1, 2000
  • MIPS supplying scalar core (100)
  • MIPS processor, caches, TLB
  • MIT supplying FPU (100)
  • VIRAM-1 Tape-out scheduled for January 2001
  • Simplifications
  • Floating point
  • Network Interface

13
Hand-Coded Benchmark Review
  • Image processing kernels (old FPU model)
  • Note BLAS-2 performance

14
Base-line system comparison
  • All numbers in cycles/pixel
  • MMX and VIS results assume all data in L1 cache

15
FFT Uses In-Register Permutations
Without in-register permutations
16
Problem General Element Permutation
  • Hardware for a full vector permutation
    instruction (128 16b elements, 256b datapath)
  • Datapath 16 x 16 (x 16b) crossbar scales by
    0(N2)
  • Control 16 16-to-1 multiplexors scales by
    0(NlogN)
  • Time/energy wasted on wide vector register file
    port

17
Simple Vector Permutations
  • Simple steps of butterfly permutations
  • A register provides the butterfly radix
  • Separate instructions for moving elements to
    left/right
  • Sufficient semantics for
  • Fast reductions of vector registers (dot
    products)
  • Fast FFT kernels

18
Hardware for Simple Permutations
  • Hardware for 128 16b elements, 256b datapath
  • Datapath 2 buses, 8 tristate drivers, 4
    multiplexors, 4 shifters (by 0, 16b, 32b only)
    Scales by O(N)
  • Control 6 control cases scales by O(N)
  • Other benefits
  • Consecutive result elements written together
  • Buses used only for small radices

19
IRAM Compiler Status
  • Retarget of Cray compiler
  • Steps in compiler development
  • Build MIPS backend (done)
  • Build VIRAM backend for vectorized loops (done)
  • Instruction scheduling for VIRAM-1 (done)
  • Insertion of memory barriers (using Cray
    strategy, improving)
  • Additional optimizations (ongoing)
  • Feedback results to Cray, new version from Cray
    (ongoing)

20
Compiled Applications Update
  • Applications using compiler
  • Speech processing under development
  • Developed new small-memory algorithm for speech
    processing
  • Uses some existing kernels (FFT and MM)
  • Vector search algorithm is most challenging
  • DIS image understanding application under
    development
  • Compiles, but does not yet vectorize well
  • Singular Value Decomposition
  • Better than 2 VLIW machines (TI C67 and TM 1100)
  • Challenging BLAS-1,2 work well on IRAM because of
    memory BW
  • Kernels
  • Simple floating point kernels are very
    competitive with hand-code

21
(10n x n SVD, rank 10)
(From Herman, Loo, Tang, CS252 project)
22
IRAM Latency Advantage
  • 1997 estimate 5-10x improvement
  • No parallel DRAMs, memory controller, bus to turn
    around, SIMM module, pins
  • 30ns for IRAM (or much lower with DRAM redesign)
  • Compare to Alpha 600 180 ns for 128b 270 ns
    for 512b
  • 2000 estimate 5x improvement
  • IRAM memory latency is 25 ns for 256 bits, fixed
    pipeline delay
  • Alpha 4000/4100 120 ns

23
IRAM Bandwidth Advantage
  • 1997 estimate 100x
  • 1024 1Mbit modules, each 1Kb wide(1Gb chip)
  • 10 _at_ 40 ns RAS/CAS 320 GBytes/sec
  • If cross bar switch or multiple busses deliver
    1/3 to 2/3 of total Þ 100 - 200 GBytes/sec
  • Compare to AlphaServer 8400 1.2 GBytes/sec,
    41001.1 Gbytes/sec
  • 2000 estimate 10-100x
  • VIRAM-1 16 MB chip divided into 8 banks
    gt 51.2 GB peak from memory
    banks
  • Crossbar can consume 12.8 GB/s
  • 6.4GB/s from Vector Unit 6.4 GB/s from either
    scalar or I/O

24
Power and Energy Advantages
  • 1997 Case study of StrongARM memory hierarchy
    vs. IRAM memory hierarchy
  • cell size advantages Þ much larger cache Þ fewer
    off-chip references Þ up to 2X-4X energy
    efficiency for memory-intensive algorithms
  • less energy per bit access for DRAM
  • Power target for VIRAM-1
  • 2 watt goal
  • Based on preliminary spice runs, this looks very
    feasible today
  • Scalar core included

25
Summary
  • IRAM takes advantage of high on-chip bandwidth
  • Vector IRAM ISA utilizes this bandwidth
  • Unit, strided, and indexed memory access patterns
    supported
  • Exploits fine-grained parallelism, even with
    pointer chasing
  • Compiler
  • Well-understood compiler model, semi-automatic
  • Still some work on code generation quality
  • Application benchmarks
  • Compiled and hand-coded
  • Include FFT, SVD, MVM, sparse MVM, and other
    kernels used in image and signal processing

26
IRAM Applications Intelligent PDA
  • Pilot PDA
  • gameboy, cell phone, radio, timer, camera, TV
    remote, am/fm radio, garage door opener, ...
  • Wireless data (WWW)
  • Speech, vision recog.
  • Voice output for conversations

Speech control Vision to see, scan documents,
read bar code, ...
27
IRAM as Building Block for ISTORE
  • System-on-a-chip enables computer, memory,
    redundant network interfaces without
    significantly increasing size of disk
  • Target for 5-7 years
  • building block 2006 MicroDrive integrated with
    IRAM
  • 9GB disk, 50 MB/sec disk (projected)
  • connected via crossbar switch
  • O(10) Gflops
  • 10,000 nodes fit into one rack!
Write a Comment
User Comments (0)
About PowerShow.com