Title: Rough Schedule
1Rough Schedule
- 130-215 IRAM overview
- 215-300 ISTORE overview
- break
- 315-330 Financial
- 400-500 Future
2IRAM Hardware and Software
- Kathy Yelick
- Computer Science Division
- UC Berkeley
3Intelligent RAM IRAM
- Microprocessor DRAM on a single chip
- 10X capacity vs. DRAM
- on-chip memory latency 5-10X, bandwidth 50-100X
- improve energy efficiency 2X-4X (no off-chip
bus) - serial I/O 5-10X v. buses
- smaller board area/volume
- IRAM advantages extend to
- a single chip system
- a building block for larger systems
4VIRAM System on a Chip
- 0.18 um EDL process
- 16 MB DRAM, 8 banks
- MIPS Scalar core and
caches _at_ 200 MHz - 4 64-bit vector unit
pipelines _at_ 200 MHz - 17x17 mm, 2 Watts target
- 25.6 GB/s memory (6.4 GB/s per direction
and per Xbar) - 0.8 Gflops (64-bit), 6.4 GOPs (16-bit)
Memory (64 Mbits / 8 MBytes)
Xbar
Memory (64 Mbits / 8 MBytes)
5IRAM Chip Update
- IBM supplying embedded DRAM/Logic (100)
- Agreement in place and technology files available
- MIPS supplying scalar core (100)
- MIPS processor, caches, TLB
- MIT supplying FPU (100)
- VIRAM-1 Tape-out scheduled for late-2000
- Simplifications
- Floating point
- Network Interface
6VIRAM-1 Chip Design Status
- MIPS scalar core
- Synthesizable RTL code received from MIPS
- Cache RAMs to be compiled for IBM technology
- FPU RTL code almost compete
- Vector unit
- RTL models for sub-blocks developed currently
integrated and tested - Control logic to be compiled for IBM technology
- Full-custom layout for multipliers/adders
developed layout for shifters to be developed
- Memory system
- Synthesizable model for DRAM controllers done
- To be integrated with IBM DRAM macros
- Full-custom layout for crossbar under development
- Testing infrastructure
- Environment developed for automatic test
validation - Directed tests for single/multiple instruction
groups developed - Random instruction sequence generator developed
7IRAM Architecture Update
- ISA mostly frozen since 6/99
- Changes in 2H 99 for better fixed-point model and
some instructions for short vectors (auto
increment and in-register permutations) - Minor changes in 1H 00 to address new
co-processor interface in MIPS core - ISA manual publicly available
- http//www.cs.berkeley.edu
- Suite of simulators actively used
- vsim-isa (functional)
- Major rewrite underway for new scalar processor
- All UCB code
- vsim-p (performance), vsim-db (debugger),
vsim-sync (memory synchronization)
8IRAM Compiler Status
- Retarget of Cray Backend
- Steps in compiler development
- Build MIPS backend (done)
- Build VIRAM bacckend for vectorized loops (done)
- Instruction scheduling for VIRAM-1 (works, but
could be improved) - Insertion of memory barriers (using Cray
strategy, improving) - Optimizations for short loops (reduce overhead)
- Feedback results to Cray, new version from Cray
(ongoing)
9IRAM Compiler Update
- Study of compiler quality using 100 Dongarra
loops - 70 vectorized
- Average 10x reduction in dynamic instruction
count - Average vector length of 42
- 30 did not, usually due to a dependence
- Some reductions missed
- Vector version of math libraries (sin, cos, etc.)
needed - Some failed due to bugs in benchmark
- Identified 2 specific areas for improvements in
loop overhead - Use VL and MVL more carefully
- Use auto-increment instruction more extensively
10Compiled Applications Update
- Applications using compiler
- Speech processing under development
- Developed new small-memory algorithm for speech
processing - Uses some existing kernels (FFT and MM)
- Vector search algorithm is most challenging
- DIS image understanding application under
development - Compiles, but does not yet vectorize well
- Singular Value Decomposition
- Better than 2 VLIW machines (TI C67 and TM 1100)
- Challenging BLAS-1,2 work well on IRAM because of
memory BW - Kernels
- SAXPY, MVM, etc.
- Will include DIS stress-marks
11(10n x n SVD, rank 10)
(From Herman, Loo, Tang, CS252 project)
12Hand-Coded Applications Update
- Image processing kernels (old FPU model)
- Note BLAS-2 performance
13Problem General Element Permutation
- Hardware for a full vector permutation
instruction (128 16b elements, 256b datapath) - Datapath 16 x 16 (x 16b) crossbar scales by
0(N2) - Control 16 16-to-1 multiplexors scales by
0(NlogN) - Time/energy wasted on wide vector register file
port
14Simple Vector Permutations
- Simple steps of butterfly permutations
- A register provides the butterfly radix
- Separate instructions for moving elements to
left/right - Sufficient semantics for
- Fast reductions of vector registers (dot
products) - Fast FFT kernels
15Hardware for Simple Permutations
- Hardware for 128 16b elements, 256b datapath
- Datapath 2 buses, 8 tristate drivers, 4
multiplexors, 4 shifters (by 0, 16b, 32b only)
Scales by O(N) - Control 6 control cases scales by O(N)
- Other benefits
- Consecutive result elements written together
- Buses used only for small radices
16FFT Uses In-Register Permutations
Without in-register permutations
17Summary
- IRAM takes advantage of high on-chip bandwidth
- BLAS-2 performance confirms this
- Vector IRAM ISA utilizes this bandwidth
- Unit, strided, and indexed memory access patterns
supported - Exploits fine-grained parallelism, even with
pointer chasing - Compiler
- Well-understood compiler model, semi-automatic
- Still some work on code generation quality
- Application benchmarks
- Compiled and hand-coded
- Include FFT, SVD, MVM, sparse MVM, and other
kernels used in image and signal processing
18IRAM as Building Block for ISTORE
- System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk - Target for 5-7 years
- building block 2006 MicroDrive integrated with
IRAM - 9GB disk, 50 MB/sec disk (projected)
- connected via crossbar switch
- O(10) Gflops
- 10,000 nodes fit into one rack!