IRAM and ISTORE Projects - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

IRAM and ISTORE Projects

Description:

... Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, ... Integrated processor in memory provides efficient access to high ... – PowerPoint PPT presentation

Number of Views:226

Avg rating:3.0/5.0

Slides: 32

Provided by: davidoppe

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: IRAM and ISTORE Projects

1
IRAM and ISTORE Projects

Aaron Brown, James Beck, Rich Fromm, Joe Gebis,
Paul Harvey, Adam Janin, Dave Judd,
Kimberly Keeton, Christoforos Kozyrakis, David
Martin, Rich Martin, Thinh Nguyen, David
Oppenheimer, Steve Pope, Randi Thomas,
Noah Treuhaft, Sam Williams, John
Kubiatowicz, Kathy Yelick, and David Patterson
http//iram.cs.berkeley.edu/istore
Fall 2000 DIS DARPA Meeting

2
IRAM and ISTORE Vision

Integrated processor in memory provides efficient
access to high memory bandwidth

Two Post-PC applications
IRAM Single chip system for embedded and
portable applications
Target media processing (speech, images, video,
audio)
ISTORE Building block when combined with disk
for storage and retrieval servers
Up to 10K nodes in one rack
Non-IRAM prototype addresses key scaling issues
availability, manageability, evolution

Photo from Itsy, Inc.
3
IRAM Overview

A processor architecture for embedded/portable
systems running media applications
Based on media processing and embedded DRAM
Simple, scalable, energy and area efficient
Good compiler target

Flag 0
Flag 1
Instr Cache (8KB)
FPU
Flag Register File (512B)
MIPS64 5Kc Core
CP IF
Arith 0
Arith 1
256b
256b
SysAD IF
Vector Register File (8KB)
64b
64b
Memory Unit
TLB
256b
JTAG IF
DMA
Memory Crossbar

JTAG
DRAM0 (2MB)
DRAM1 (2MB)
DRAM7 (2MB)
4
Architecture Details

MIPS64 5Kc core (200 MHz)
Single-issue scalar core with 8 Kbyte ID caches
Vector unit (200 MHz)
8 KByte register file (32 64b elements per
register)
256b datapaths, can be subdivided into 16b, 32b,
64b
2 arithmetic (1 FP, single), 2 flag processing
Memory unit
4 address generators for strided/indexed accesses
Main memory system
8 2-MByte DRAM macros
25ns random access time, 7.5ns page access time
Crossbar interconnect
12.8 GBytes/s peak bandwidth per direction
(load/store)
Off-chip interface
2 channel DMA engine and 64n SysAD bus

5
Floorplan

Technology IBM SA-27E
0.18mm CMOS, 6 metal layers
290 mm2 die area
225 mm2 for memory/logic
Transistor count 150M
Power supply
1.2V for logic, 1.8V for DRAM
Typical power consumption 2.0 W
0.5 W (scalar) 1.0 W (vector) 0.2 W (DRAM)
0.3 W (misc)
Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b
operations)
3.2/6.4 /12.8 Gops w. madd
1.6 Gflops (single-precision)
Tape-out planned for March 01

6
Alternative Floorplans

VIRAM-8MB
4 lanes, 8 Mbytes
190 mm2
3.2 Gops at 200 MHz(32-bit ops)

VIRAM-2Lanes 2 lanes, 4 Mbytes 120 mm2 1.6 Gops
at 200 MHz
VIRAM-Lite 1 lane, 2 Mbytes 60 mm2 0.8 Gops at
200 MHz
7
VIRAM Compiler
Optimizer
Frontends
Code Generators
C
T3D/T3E
Crays PDGCS
C
C90/T90/SV1
Fortran95
SV2/VIRAM

Based on the Crays production compiler
Challenges
narrow data types and scalar/vector memory
consistency
Advantages relative to media-extensions
powerful addressing modes and ISA independent of
datapath width

8
Exploiting 0n-Chip Bandwidth

Vector ISA uses high bandwidth to mask latency
Compiled matrix-vector multiplication 2
Flops/element
Easy compilation problem stresses memory
bandwidth
Compare to 304 Mflops (64-bit) for Power3
(hand-coded)

Performance scales with number of lanes up to 4
Need more memory banks than default DRAM macro
for 8 lanes

9
Compiling Media Kernels on IRAM

The compiler generates code for narrow data
widths, e.g., 16-bit integer
Compilation model is simple, more scalable
(across generations) than MMX, VIS, etc.

Strided and indexed loads/stores simpler than
pack/unpack
Maximum vector length is longer than datapath
width (256 bits) all lane scalings done with
single executable

10
IRAM Status

Chip
ISA has not changed significantly in over a year
Verilog complete, except SRAM for scalar cache
Testing framework in place
Compiler
Backend code generation complete
Continued performance improvements, especially
for narrow data widths
Application Benchmarks
Handcoded kernels better than MMX,VIS, gp DSPs
DCT, FFT, MVM, convolution, image composition,
Compiled kernels demonstrate ISA advantages
MVM, sparse MVM, decrypt, image composition,
Full applications H263 encoding (done), speech
(underway)

11
Scaling to 10K Processors

IRAM micro-disk offer huge scaling
opportunities
Still many hard system problems (AME)
Availability
systems should continue to meet quality of
service goals despite hardware and software
failures
Maintainability
systems should require only minimal ongoing human
administration, regardless of scale or complexity
Evolutionary Growth
systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded
These are problems at todays scales, and will
only get worse as systems grow

12
Is Maintenance the Key?

Rule of Thumb Maintenance 10X HW
so over 5 year product life, 95 of cost is
maintenance

13
Hardware Techniques for AME

Cluster of Storage Oriented Nodes (SON)
Scalable, tolerates partial failures, automatic
redundancy
Heavily instrumented hardware
Sensors for temp, vibration, humidity, power,
intrusion
Independent diagnostic processor on each node
Remote control of power collects environmental
data for
Diagnostic processors connected via independent
network
On-demand network partitioning/isolation
Allows testing, repair of online system
Managed by diagnostic processor
Built-in fault injection capabilities
Used for hardware introspection
Important for AME benchmarking

14
ISTORE-1 system

Hardware plug-and-play intelligent devices with
self-monitoring, diagnostics, and fault injection
hardware
intelligence used to collect and filter
monitoring data
diagnostics and fault injection enhance
robustness
networked to create a scalable shared-nothing
cluster
Scheduled for 4Q 00

15
ISTORE-1 System Layout
PE1000s
PE1000s PowerEngines 100Mb switches PE5200s
PowerEngines 1 Gb switches UPSs used
PE5200
PE5200
UPS
UPS
UPS
UPS
UPS
UPS
16
ISTORE Brick Node Block Diagram
Mobile Pentium II Module
SCSI
North Bridge
CPU
Disk (18 GB)
South Bridge
Diagnostic Net
DUAL UART
DRAM 256 MB
Super I/O
Monitor Control
Diagnostic Processor
BIOS
Ethernets 4x100 Mb/s
PCI

Sensors for heat and vibration
Control over power to individual nodes

Flash
RTC
RAM
17

ISTORE Brick Node
Pentium-II/266MHz
256 MB DRAM
18 GB SCSI (or IDE) disk
4x100Mb Ethernet
m68k diagnostic processor CAN diagnostic
network
Packaged in standard half-height RAID array
canister

18
Software Techniques

Reactive introspection
Mining available system data
Proactive introspection
Isolation fault insertion gt test recovery code
Semantic redundancy
Use of coding and application-specific
checkpoints
Self-Scrubbing data structures
Check (and repair?) complex distributed
structures
Load adaptation for performance faults
Dynamic load balancing for regular computations
Benchmarking
Define quantitative evaluations for AME

19
Network Redundancy

Each brick node has 4 100Mb ethernets
TCP striping used for performance
Demonstration on 2-node prototype using 3 links
When a link fails, packets on that link are
dropped
Nodes detect failures using independent pings
More scalable approach being developed

Mb/s
20
Load Balancing for Performance Faults

Failure is not always a discrete property
Some fraction of components may fail
Some components may perform poorly
Graph shows effect of Graduated Declustering on
cluster I/O with disk performance faults

21
Availability benchmarks

Goal quantify variation in QoS as fault events
occur
Leverage existing performance benchmarks
to generate fair workloads
to measure trace quality of service metrics
Use fault injection to compromise system
Results are most accessible graphically

22
Example Faults in Software RAID
Linux
Solaris

Compares Linux and Solaris reconstruction
Linux minimal performance impact but longer
window of vulnerability to second fault
Solaris large perf. impact but restores
redundancy fast

23
Towards Manageability Benchmarks

Goal is to gain experience with a small piece of
the problem
can we measure the time and learning-curve costs
for one task?
Task handling disk failure in RAID system
includes detection and repair
Same test systems as availability case study
Windows 2000/IIS, Linux/Apache, Solaris/Apache
Five test subjects and fixed training session
(Too small to draw statistical conclusions)

24
Sample results time

Graphs plot human time, excluding wait time

25
Analysis of time results

Rapid convergence across all OSs/subjects
despite high initial variability
final plateau defines minimum time for task
plateau invariant over individuals/approaches
Clear differences in plateaus between OSs
Solaris lt Windows lt Linux
note statistically dubious conclusion given
sample size!

26
ISTORE Status

ISTORE Hardware
All 80 Nodes (boards) manufactured
PCB backplane in layout
Finish 80 node system December 2000
Software
2-node system running -- boots OS
Diagnostic Processor SW and device driver done
Network striping done fault adaptation ongoing
Load balancing for performance heterogeneity done
Benchmarking
Availability benchmark example complete
Initial maintainability benchmark complete,
revised strategy underway

27
BACKUP SLIDES

IRAM

28
IRAM Latency Advantage

1997 estimate 5-10x improvement
No parallel DRAMs, memory controller, bus to turn
around, SIMM module, pins
30ns for IRAM (or much lower with DRAM redesign)
Compare to Alpha 600 180 ns for 128b 270 ns
for 512b
2000 estimate 5x improvement
IRAM memory latency is 25 ns for 256 bits, fixed
pipeline delay
Alpha 4000/4100 120 ns

29
IRAM Bandwidth Advantage

1997 estimate 100x
1024 1Mbit modules, each 1Kb wide(1Gb chip)
10 _at_ 40 ns RAS/CAS 320 GBytes/sec
If cross bar switch or multiple busses deliver
1/3 to 2/3 of total Þ 100 - 200 GBytes/sec
Compare to AlphaServer 8400 1.2 GBytes/sec,
41001.1 Gbytes/sec
2000 estimate 10-100x
VIRAM-1 16 MB chip divided into 8 banks
gt 51.2 GB peak from memory
banks
Crossbar can consume 12.8 GB/s
6.4GB/s from Vector Unit 6.4 GB/s from either
scalar or I/O

30
Power and Energy Advantages

1997 Case study of StrongARM memory hierarchy
vs. IRAM memory hierarchy
cell size advantages Þ much larger cache Þ fewer
off-chip references Þ up to 2X-4X energy
efficiency for memory-intensive algorithms
less energy per bit access for DRAM
Power target for VIRAM-1
2 watt goal
Based on preliminary spice runs, this looks very
feasible today
Scalar core included

31
Summary

IRAM takes advantage of high on-chip bandwidth
Vector IRAM ISA utilizes this bandwidth
Unit, strided, and indexed memory access patterns
supported
Exploits fine-grained parallelism, even with
pointer chasing
Compiler
Well-understood compiler model, semi-automatic
Still some work on code generation quality
Application benchmarks
Compiled and hand-coded
Include FFT, SVD, MVM, sparse MVM, and other
kernels used in image and signal processing