IRAM and ISTORE Projects

About This Presentation

Title:

IRAM and ISTORE Projects

Description:

... Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, ... Integrated processor in memory provides efficient access to high ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 79

Provided by: davidoppe

Category:

more less

Transcript and Presenter's Notes

Title: IRAM and ISTORE Projects

1
IRAM and ISTORE Projects

Aaron Brown, James Beck, Rich Fromm, Joe Gebis,
Paul Harvey, Adam Janin, Dave Judd,
Kimberly Keeton, Christoforos Kozyrakis, David
Martin, Rich Martin, Thinh Nguyen, David
Oppenheimer, Steve Pope, Randi Thomas,
Noah Treuhaft, Sam Williams, John
Kubiatowicz, Kathy Yelick, and David Patterson
http//iram.cs.berkeley.edu/istore
Fall 2000 DIS DARPA Meeting

2
IRAM and ISTORE Vision

Integrated processor in memory provides efficient
access to high memory bandwidth

Two Post-PC applications
IRAM Single chip system for embedded and
portable applications
Target media processing (speech, images, video,
audio)
ISTORE Building block when combined with disk
for storage and retrieval servers
Up to 10K nodes in one rack
Non-IRAM prototype addresses key scaling issues
availability, manageability, evolution

Photo from Itsy, Inc.
3
IRAM Overview

A processor architecture for embedded/portable
systems running media applications
Based on media processing and embedded DRAM
Simple, scalable, energy and area efficient
Good compiler target

Flag 0
Flag 1
Instr Cache (8KB)
FPU
Flag Register File (512B)
MIPS64 5Kc Core
CP IF
Arith 0
Arith 1
256b
256b
SysAD IF
Vector Register File (8KB)
64b
64b
Memory Unit
TLB
256b
JTAG IF
DMA
Memory Crossbar

JTAG
DRAM0 (2MB)
DRAM1 (2MB)
DRAM7 (2MB)
4
Architecture Details

MIPS64 5Kc core (200 MHz)
Single-issue scalar core with 8 Kbyte ID caches
Vector unit (200 MHz)
8 KByte register file (32 64b elements per
register)
256b datapaths, can be subdivided into 16b, 32b,
64b
2 arithmetic (1 FP, single), 2 flag processing
Memory unit
4 address generators for strided/indexed accesses
Main memory system
8 2-MByte DRAM macros
25ns random access time, 7.5ns page access time
Crossbar interconnect
12.8 GBytes/s peak bandwidth per direction
(load/store)
Off-chip interface
2 channel DMA engine and 64n SysAD bus

5
Floorplan

Technology IBM SA-27E
0.18mm CMOS, 6 metal layers
290 mm2 die area
225 mm2 for memory/logic
Transistor count 150M
Power supply
1.2V for logic, 1.8V for DRAM
Typical power consumption 2.0 W
0.5 W (scalar) 1.0 W (vector) 0.2 W (DRAM)
0.3 W (misc)
Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b
operations)
3.2/6.4 /12.8 Gops w. madd
1.6 Gflops (single-precision)
Tape-out planned for March 01

6
Alternative Floorplans

VIRAM-8MB
4 lanes, 8 Mbytes
190 mm2
3.2 Gops at 200 MHz(32-bit ops)

VIRAM-2Lanes 2 lanes, 4 Mbytes 120 mm2 1.6 Gops
at 200 MHz
VIRAM-Lite 1 lane, 2 Mbytes 60 mm2 0.8 Gops at
200 MHz
7
VIRAM Compiler
Optimizer
Frontends
Code Generators
C
T3D/T3E
Crays PDGCS
C
C90/T90/SV1
Fortran95
SV2/VIRAM

Based on the Crays production compiler
Challenges
narrow data types and scalar/vector memory
consistency
Advantages relative to media-extensions
powerful addressing modes and ISA independent of
datapath width

8
Exploiting 0n-Chip Bandwidth

Vector ISA uses high bandwidth to mask latency
Compiled matrix-vector multiplication 2
Flops/element
Easy compilation problem stresses memory
bandwidth
Compare to 304 Mflops (64-bit) for Power3
(hand-coded)

Performance scales with number of lanes up to 4
Need more memory banks than default DRAM macro
for 8 lanes

9
Compiling Media Kernels on IRAM

The compiler generates code for narrow data
widths, e.g., 16-bit integer
Compilation model is simple, more scalable
(across generations) than MMX, VIS, etc.

Strided and indexed loads/stores simpler than
pack/unpack
Maximum vector length is longer than datapath
width (256 bits) all lane scalings done with
single executable

10
Vector Vs. SIMD Example

Simple image processing example
conversion from RGB to YUV
Y ( 9798R 19235G 3736B) / 32768
U (-4784R - 9437G 4221B) / 32768
128
V (20218R 16941G 3277B) / 32768
128

11
VIRAM Code (22 instructions)

RGBtoYUV
vlds.u.b r_v, r_addr, stride3, addr_inc
load R
vlds.u.b g_v, g_addr, stride3, addr_inc
load G
vlds.u.b b_v, b_addr, stride3, addr_inc
load B
xlmul.u.sv o1_v, t0_s, r_v
calculate Y
xlmadd.u.sv o1_v, t1_s, g_v
xlmadd.u.sv o1_v, t2_s, b_v
vsra.vs o1_v, o1_v, s_s
xlmul.u.sv o2_v, t3_s, r_v
calculate U
xlmadd.u.sv o2_v, t4_s, g_v
xlmadd.u.sv o2_v, t5_s, b_v
vsra.vs o2_v, o2_v, s_s
vadd.sv o2_v, a_s, o2_v
xlmul.u.sv o3_v, t6_s, r_v
calculate V
xlmadd.u.sv o3_v, t7_s, g_v
xlmadd.u.sv o3_v, t8_s, b_v
vsra.vs o3_v, o3_v, s_s
vadd.sv o3_v, a_s, o3_v
vsts.b o1_v, y_addr, stride3, addr_inc
store Y

12
MMX Code (part 1)

RGBtoYUV
movq mm1, eax
pxor mm6, mm6
movq mm0, mm1
psrlq mm1, 16
punpcklbw mm0, ZEROS
movq mm7, mm1
punpcklbw mm1, ZEROS
movq mm2, mm0
pmaddwd mm0, YR0GR
movq mm3, mm1
pmaddwd mm1, YBG0B
movq mm4, mm2
pmaddwd mm2, UR0GR
movq mm5, mm3
pmaddwd mm3, UBG0B
punpckhbw mm7, mm6
pmaddwd mm4, VR0GR
paddd mm0, mm1

paddd mm4, mm5
movq mm5, mm1
psllq mm1, 32
paddd mm1, mm7
punpckhbw mm6, ZEROS
movq mm3, mm1
pmaddwd mm1, YR0GR
movq mm7, mm5
pmaddwd mm5, YBG0B
psrad mm0, 15
movq TEMP0, mm6
movq mm6, mm3
pmaddwd mm6, UR0GR
psrad mm2, 15
paddd mm1, mm5
movq mm5, mm7
pmaddwd mm7, UBG0B
psrad mm1, 15
pmaddwd mm3, VR0GR

13
MMX Code (part 2)

paddd mm6, mm7
movq mm7, mm1
psrad mm6, 15
paddd mm3, mm5
psllq mm7, 16
movq mm5, mm7
psrad mm3, 15
movq TEMPY, mm0
packssdw mm2, mm6
movq mm0, TEMP0
punpcklbw mm7, ZEROS
movq mm6, mm0
movq TEMPU, mm2
psrlq mm0, 32
paddw mm7, mm0
movq mm2, mm6
pmaddwd mm2, YR0GR
movq mm0, mm7
pmaddwd mm7, YBG0B

movq mm4, mm6
pmaddwd mm6, UR0GR
movq mm3, mm0
pmaddwd mm0, UBG0B
paddd mm2, mm7
pmaddwd mm4,
pxor mm7, mm7
pmaddwd mm3, VBG0B
punpckhbw mm1,
paddd mm0, mm6
movq mm6, mm1
pmaddwd mm6, YBG0B
punpckhbw mm5,
movq mm7, mm5
paddd mm3, mm4
pmaddwd mm5, YR0GR
movq mm4, mm1
pmaddwd mm4, UBG0B
psrad mm0, 15

14
MMX Code (pt. 3 121 instructions)

pmaddwd mm7, UR0GR
psrad mm3, 15
pmaddwd mm1, VBG0B
psrad mm6, 15
paddd mm4, OFFSETD
packssdw mm2, mm6
pmaddwd mm5, VR0GR
paddd mm7, mm4
psrad mm7, 15
movq mm6, TEMPY
packssdw mm0, mm7
movq mm4, TEMPU
packuswb mm6, mm2
movq mm7, OFFSETB
paddd mm1, mm5
paddw mm4, mm7
psrad mm1, 15
movq ebx, mm6
packuswb mm4,

movq ecx, mm4
packuswb mm5, mm3
add ebx, 8
add ecx, 8
movq edx, mm5
dec edi
jnz RGBtoYUV

15
IRAM Status

Chip
ISA has not changed significantly in over a year
Verilog complete, except SRAM for scalar cache
Testing framework in place
Compiler
Backend code generation complete
Continued performance improvements, especially
for narrow data widths
Application Benchmarks
Handcoded kernels better than MMX,VIS, gp DSPs
DCT, FFT, MVM, convolution, image composition,
Compiled kernels demonstrate ISA advantages
MVM, sparse MVM, decrypt, image composition,
Full applications H263 encoding (done), speech
(underway)

16
Scaling to 10K Processors

IRAM micro-disk offer huge scaling
opportunities
Still many hard system problems (AME)
Availability
systems should continue to meet quality of
service goals despite hardware and software
failures
Maintainability
systems should require only minimal ongoing human
administration, regardless of scale or complexity
Evolutionary Growth
systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded
These are problems at todays scales, and will
only get worse as systems grow

17
Is Maintenance the Key?

Rule of Thumb Maintenance 10X HW
so over 5 year product life, 95 of cost is
maintenance

18
Hardware Techniques for AME

Cluster of Storage Oriented Nodes (SON)
Scalable, tolerates partial failures, automatic
redundancy
Heavily instrumented hardware
Sensors for temp, vibration, humidity, power,
intrusion
Independent diagnostic processor on each node
Remote control of power collects environmental
data for
Diagnostic processors connected via independent
network
On-demand network partitioning/isolation
Allows testing, repair of online system
Managed by diagnostic processor
Built-in fault injection capabilities
Used for hardware introspection
Important for AME benchmarking

19
ISTORE-1 system

Hardware plug-and-play intelligent devices with
self-monitoring, diagnostics, and fault injection
hardware
intelligence used to collect and filter
monitoring data
diagnostics and fault injection enhance
robustness
networked to create a scalable shared-nothing
cluster
Scheduled for 4Q 00

20
ISTORE-1 System Layout
PE1000s
PE1000s PowerEngines 100Mb switches PE5200s
PowerEngines 1 Gb switches UPSs used
PE5200
PE5200
UPS
UPS
UPS
UPS
UPS
UPS
21
ISTORE Brick Node Block Diagram
Mobile Pentium II Module
SCSI
North Bridge
CPU
Disk (18 GB)
South Bridge
Diagnostic Net
DUAL UART
DRAM 256 MB
Super I/O
Monitor Control
Diagnostic Processor
BIOS
Ethernets 4x100 Mb/s
PCI

Sensors for heat and vibration
Control over power to individual nodes

Flash
RTC
RAM
22

ISTORE Brick Node
Pentium-II/266MHz
256 MB DRAM
18 GB SCSI (or IDE) disk
4x100Mb Ethernet
m68k diagnostic processor CAN diagnostic
network
Packaged in standard half-height RAID array
canister

23
Software Techniques

Reactive introspection
Mining available system data
Proactive introspection
Isolation fault insertion gt test recovery code
Semantic redundancy
Use of coding and application-specific
checkpoints
Self-Scrubbing data structures
Check (and repair?) complex distributed
structures
Load adaptation for performance faults
Dynamic load balancing for regular computations
Benchmarking
Define quantitative evaluations for AME

24
Network Redundancy

Each brick node has 4 100Mb ethernets
TCP striping used for performance
Demonstration on 2-node prototype using 3 links
When a link fails, packets on that link are
dropped
Nodes detect failures using independent pings
More scalable approach being developed

Mb/s
25
Load Balancing for Performance Faults

Failure is not always a discrete property
Some fraction of components may fail
Some components may perform poorly
Graph shows effect of Graduated Declustering on
cluster I/O with disk performance faults

26
Availability benchmarks

Goal quantify variation in QoS as fault events
occur
Leverage existing performance benchmarks
to generate fair workloads
to measure trace quality of service metrics
Use fault injection to compromise system
Results are most accessible graphically

27
Example Faults in Software RAID
Linux
Solaris

Compares Linux and Solaris reconstruction
Linux minimal performance impact but longer
window of vulnerability to second fault
Solaris large perf. impact but restores
redundancy fast

28
Towards Manageability Benchmarks

Goal is to gain experience with a small piece of
the problem
can we measure the time and learning-curve costs
for one task?
Task handling disk failure in RAID system
includes detection and repair
Same test systems as availability case study
Windows 2000/IIS, Linux/Apache, Solaris/Apache
Five test subjects and fixed training session
(Too small to draw statistical conclusions)

29
Sample results time

Graphs plot human time, excluding wait time

30
Analysis of time results

Rapid convergence across all OSs/subjects
despite high initial variability
final plateau defines minimum time for task
plateau invariant over individuals/approaches
Clear differences in plateaus between OSs
Solaris lt Windows lt Linux
note statistically dubious conclusion given
sample size!

31
ISTORE Status

ISTORE Hardware
All 80 Nodes (boards) manufactured
PCB backplane in layout
Finish 80 node system December 2000
Software
2-node system running -- boots OS
Diagnostic Processor SW and device driver done
Network striping done fault adaptation ongoing
Load balancing for performance heterogeneity done
Benchmarking
Availability benchmark example complete
Initial maintainability benchmark complete,
revised strategy underway

32
BACKUP SLIDES

IRAM

33
Modular Vector Unit Design
256b
Control

Single 64b lane design replicated 4 times
Reduces design and testing time
Provides a simple scaling model (up or down)
without major control or datapath redesign
Lane scaling independent of DRAM scaling
Most instructions require only intra-lane
interconnect
Tolerance to interconnect delay scaling

34
Performance FFT (1)
35
Performance FFT (2)
36
Media Kernel Performance
37
Base-line system comparison

All numbers in cycles/pixel
MMX and VIS results assume all data in L1 cache

38
Vector Architecture State
39
Vector Instruction Set

Complete load-store vector instruction set
Uses the MIPS64 ISA coprocessor 2 opcode space
Ideas work with any core CPU Arm, PowerPC, ...
Architecture state
32 general-purpose vector registers
32 vector flag registers
Data types supported in vectors
64b, 32b, 16b (and 8b)
91 arithmetic and memory instructions
Not specified by the ISA
Maximum vector register length
Functional unit datapath width

40
Compiler/OS Enhancements

Compiler support
Conditional execution of vector instruction
Using the vector flag registers
Support for software speculation of load
operations
Operating system support
MMU-based virtual memory
Restartable arithmetic exceptions
Valid and dirty bits for vector registers
Tracking of maximum vector length used

41
BACKUP SLIDES

ISTORE

42
ISTORE A server for the PostPC Era

Aaron Brown, Dave Martin, David Oppenheimer, Noah
Trauhaft, Dave Patterson,Katherine Yelick
University of California at Berkeley
Patterson_at_cs.berkeley.edu
UC Berkeley ISTORE Group
istore-group_at_cs.berkeley.edu
August 2000

43
ISTORE as Storage System of the Future

Availability, Maintainability, and Evolutionary
growth key challenges for storage systems
Maintenance Cost gt10X Purchase Cost per year,
Even 2X purchase cost for 1/2 maintenance cost
wins
AME improvement enables even larger systems
ISTORE also cost-performance advantages
Better space, power/cooling costs (_at_colocation
site)
More MIPS, cheaper MIPS, no bus bottlenecks
Compression reduces network , encryption
protects
Single interconnect, supports evolution of
technology, single network technology to
maintain/understand
Match to future software storage services
Future storage service software target clusters

44
Lampson Systems Challenges

Systems that work
Meeting their specs
Always available
Adapting to changing environment
Evolving while they run
Made from unreliable components
Growing without practical limit
Credible simulations or analysis
Writing good specs
Testing
Performance
Understanding when it doesnt matter

Computer Systems Research-Past and Future
Keynote address, 17th SOSP, Dec. 1999 Butler
Lampson Microsoft
45
Jim Gray Trouble-Free Systems

Manager
Sets goals
Sets policy
Sets budget
System does the rest.
Everyone is a CIO (Chief Information Officer)
Build a system
used by millions of people each day
Administered and managed by a ½ time person.
On hardware fault, order replacement part
On overload, order additional equipment
Upgrade hardware and software automatically.

What Next? A dozen remaining IT
problems Turing Award Lecture, FCRC, May
1999 Jim Gray Microsoft
46
Jim Gray Trustworthy Systems

Build a system used by millions of people that
Only services authorized users
Service cannot be denied (cant destroy data or
power).
Information cannot be stolen.
Is always available (out less than 1 second per
100 years 8 9s of availability)
1950s 90 availability, Today 99 uptime for
web sites, 99.99 for well managed sites
(50 minutes/year)3 extra 9s in 45 years.
Goal 5 more 9s 1 second per century.
And prove it.

47
Hennessy What Should the New World Focus Be?

Availability
Both appliance service
Maintainability
Two functions
Enhancing availability by preventing failure
Ease of SW and HW upgrades
Scalability
Especially of service
Cost
per device and per service transaction
Performance
Remains important, but its not SPECint

Back to the Future Time to Return to
Longstanding Problems in Computer Systems?
Keynote address, FCRC, May 1999 John
Hennessy Stanford
48
The real scalability problems AME

Availability
systems should continue to meet quality of
service goals despite hardware and software
failures
Maintainability
systems should require only minimal ongoing human
administration, regardless of scale or
complexity Today, cost of maintenance 10-100
cost of purchase
Evolutionary Growth
systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded
These are problems at todays scales, and will
only get worse as systems grow

49
Principles for achieving AME

No single points of failure, lots of redundancy
Performance robustness is more important than
peak performance
Performance can be sacrificed for improvements in
AME
resources should be dedicated to AME
biological systems gt 50 of resources on
maintenance
can make up performance by scaling system
Introspection
reactive techniques to detect and adapt to
failures, workload variations, and system
evolution
proactive techniques to anticipate and avert
problems before they happen

50
Hardware Techniques (1) SON

SON Storage Oriented Nodes
Distribute processing with storage
If AME really important, provide resources!
Most storage servers limited by speed of CPUs!!
Amortize sheet metal, power, cooling, network for
disk to add processor, memory, and a real
network?
Embedded processors 2/3 perf, 1/10 cost, power?
Serial lines, switches also growing with Moores
Law less need today to centralize vs. bus
oriented systems
Advantages of cluster organization
Truly scalable architecture
Architecture that tolerates partial failure
Automatic hardware redundancy

51
Hardware techniques (2)

Heavily instrumented hardware
sensors for temp, vibration, humidity, power,
intrusion
helps detect environmental problems before they
can affect system integrity
Independent diagnostic processor on each node
provides remote control of power, remote console
access to the node, selection of node boot code
collects, stores, processes environmental data
for abnormalities
non-volatile flight recorder functionality
all diagnostic processors connected via
independent diagnostic network

52
Hardware techniques (3)

On-demand network partitioning/isolation
Internet applications must remain available
despite failures of components, therefore can
isolate a subset for preventative maintenance
Allows testing, repair of online system
Managed by diagnostic processor and network
switches via diagnostic network
Built-in fault injection capabilities
Power control to individual node components
Injectable glitches into I/O and memory busses
Managed by diagnostic processor
Used for proactive hardware introspection
automated detection of flaky components
controlled testing of error-recovery mechanisms

53
Hardware culture (4)

Benchmarking
One reason for 1000X processor performance was
ability to measure (vs. debate) which is better
e.g., Which most important to improve clock
rate, clocks per instruction, or instructions
executed?
Need AME benchmarks
what gets measured gets done
benchmarks shape a field
quantification brings rigor

54
Example single-fault result
Linux
Solaris

Compares Linux and Solaris reconstruction
Linux minimal performance impact but longer
window of vulnerability to second fault
Solaris large perf. impact but restores
redundancy fast

55
Deriving ISTORE

What is the interconnect?
FC-AL? (Interoperability? Cost of switches?)
Infiniband? (When? Cost of switches? Cost of
NIC?)
Gbit Ehthernet?
Pick Gbit Ethernet as commodity switch, link
As main stream, fastest improving in cost
performance
We assume Gbit Ethernet switches will get cheap
over time (Network Processors, volume, )

56
Deriving ISTORE

Number of Disks / Gbit port?
Bandwidth of 2000 disk
Raw bit rate 427 Mbit/sec.
Data transfer rate 40.2 MByte/sec
Capacity 73.4 GB
Disk trends
BW 40/year
Capacity, Areal density,/MB 100/year
2003 disks
500 GB capacity (lt8X)
110 MB/sec or 0.9 Gbit/sec (2.75X)
Number of Disks / Gbit port 1

57
ISTORE-1 Brick

Websters Dictionary brick a handy-sized unit
of building or paving material typically being
rectangular and about 2 1/4 x 3 3/4 x 8 inches
ISTORE-1 Brick 2 x 4 x 11 inches (1.3x)
Single physical form factor, fixed cooling
required, compatible network interface to
simplify physical maintenance, scaling over time
Contents should evolve over time contains most
cost effective MPU, DRAM, disk, compatible NI
If useful, could have special bricks (e.g., DRAM
rich)
Suggests network that will last, evolve Ethernet

58
ISTORE-1 hardware platform

80-node x86-based cluster, 1.4TB storage
cluster nodes are plug-and-play, intelligent,
network-attached storage bricks
a single field-replaceable unit to simplify
maintenance
each node is a full x86 PC w/256MB DRAM, 18GB
disk
more CPU than NAS fewer disks/node than cluster

59
Common Question RAID?

Switched Network sufficient for all types of
communication, including redundancy
Hierarchy of buses is generally not superior to
switched network
Veritas, others offer software RAID 5 and
software Mirroring (RAID 1)
Another use of processor per disk

60
A Case for Intelligent Storage

Advantages
Cost of Bandwidth
Cost of Space
Cost of Storage System v. Cost of Disks
Physical Repair, Number of Spare Parts
Cost of Processor Complexity
Cluster advantages dependability, scalability
1 v. 2 Networks

61
Cost of Space, Power, Bandwidth

Co-location sites (e.g., Exodus) offer space,
expandable bandwidth, stable power
Charge 1000/month per rack ( 10 sq. ft.)
Includes 1 20-amp circuit/rack charges
100/month per extra 20-amp circuit/rack
Bandwidth cost 500 per Mbit/sec/Month

62
Cost of Bandwidth, Safety

Network bandwidth cost is significant
1000 Mbit/sec/month gt 6,000,000/year
Security will increase in importance for storage
service providers
XML gt server format conversion for gadgets
gt Storage systems of future need greater
computing ability
Compress to reduce cost of network bandwidth 3X
save 4M/year?
Encrypt to protect information in transit for B2B
gt Increasing processing/disk for future storage
apps

63
Cost of Space, Power

Sun Enterprise server/array (64CPUs/60disks)
10K Server (64 CPUs) 70 x 50 x 39 in.
A3500 Array (60 disks) 74 x 24 x 36 in.
2 Symmetra UPS (11KW) 2 52 x 24 x 27 in.
ISTORE-1 2X savings in space
ISTORE-1 1 rack (big) switches, 1 rack (old)
UPSs, 1 rack for 80 CPUs/disks (3/8 VME rack
unit/brick)
ISTORE-2 8X-16X space?
Space, power cost/year for 1000 disks Sun
924k, ISTORE-1 484k, ISTORE2 50k

64
Disk Limit Bus Hierarchy
Server
Storage Area Network
CPU
Memory bus
(FC-AL)
Internal I/O bus
Memory
RAID bus
(PCI)
Mem

Data rate vs. Disk rate
SCSI Ultra3 (80 MHz), Wide (16 bit) 160
MByte/s
FC-AL 1 Gbit/s 125 MByte/s
Use only 50 of a bus
Command overhead ( 20)
Queuing Theory (lt 70)

External I/O bus
Disk Array
(SCSI)
(15 disks/bus)
65
Physical Repair, Spare Parts

ISTORE Compatible modules based on hot-pluggable
interconnect (LAN) with few Field Replacable
Units (FRUs) Node, Power Supplies, Switches,
network cables
Replace node (disk, CPU, memory, NI) if any fail
Conventional Heterogeneous system with many
server modules (CPU, backplane, memory cards, )
and disk array modules (controllers, disks, array
controllers, power supplies, )
Store all components available somewhere as FRUs
Sun Enterprise 10k has 100 types of spare parts
Sun 3500 Array has 12 types of spare parts

66
ISTORE Complexity v. Perf

Complexity increase
HP PA-8500 issue 4 instructions per clock cycle,
56 instructions out-of-order execution, 4Kbit
branch predictor, 9 stage pipeline, 512 KB I
cache, 1024 KB D cache (gt 80M transistors just in
caches)
Intel Xscale 16 KB I, 16 KB D, 1 instruction,
in order execution, no branch prediction, 6 stage
pipeline
Complexity costs in development time, development
power, die size, cost
550 MHz HP PA-8500 477 mm2, 0.25 micron/4M 330,
60 Watts
1000 MHz Intel StrongARM2 (Xscale) _at_ 1.5 Watts,
800 MHz at 0.9 W, 50 Mhz _at_ 0.01W, 0.18 micron
(old chip 50 mm2, 0.35 micron, 18)
gt Count for system, not processors/disk

67
ISTORE Cluster Advantages

Architecture that tolerates partial failure
Automatic hardware redundancy
Transparent to application programs
Truly scalable architecture
Given maintenance is 10X-100X capital costs,
clustersize limits today are maintenance, floor
space cost - generally NOT capital costs
As a result, it is THE target architecture for
new software apps for Internet

68
ISTORE 1 vs. 2 networks

Current systems all have LAN Disk interconnect
(SCSI, FCAL)
LAN is improving fastest, most investment, most
features
SCSI, FC-AL poor network features, improving
slowly, relatively expensive for switches,
bandwidth
FC-AL switches dont interoperate
Two sets of cables, wiring?
SysAdmin trained in 2 networks, SW interface,
???
Why not single network based on best HW/SW
technology?
Note there can be still 2 instances of the
network (e.g. external, internal), but only one
technology

69
Initial Applications

ISTORE-1 is not one super-system that
demonstrates all these techniques!
Initially provide middleware, library to support
AME
Initial application targets
information retrieval for multimedia data (XML
storage?)
self-scrubbing data structures, structuring
performance-robust distributed computation
Example home video server using XML interfaces
email service
self-scrubbing data structures, online
self-testing
statistical identification of normal behavior

70
A glimpse into the future?

System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk
ISTORE HW in 5-7 years

2006 brick System On a Chip integrated with
MicroDrive
9GB disk, 50 MB/sec from disk
connected via crossbar switch
From brick to domino
If low power, 10,000 nodes fit into one rack!
O(10,000) scale is our ultimate design point

71
Conclusion ISTORE as Storage System of the
Future

Availability, Maintainability, and Evolutionary
growth key challenges for storage systems
Maintenance Cost 10X Purchase Cost per year, so
over 5 year product life, 95 of cost of
ownership
Even 2X purchase cost for 1/2 maintenance cost
wins
AME improvement enables even larger systems
ISTORE has cost-performance advantages
Better space, power/cooling costs (_at_colocation
site)
More MIPS, cheaper MIPS, no bus bottlenecks
Compression reduces network , encryption
protects
Single interconnect, supports evolution of
technology, single network technology to
maintain/understand
Match to future software storage services
Future storage service software target clusters

72
Questions?

Contact us if youre interestedemail
patterson_at_cs.berkeley.edu http//iram.cs.berkeley
.edu/
If its important, how can you say if its
impossible if you dont try?
Jean Morreau, a founder of European Union

73
Clusters and TPC Software 8/00

TPC-C 6 of Top 10 performance are clusters,
including all of Top 5 4 SMPs
TPC-H SMPs and NUMAs
100 GB All SMPs (4-8 CPUs)
300 GB All NUMAs (IBM/Compaq/HP 32-64 CPUs)
TPC-R All are clusters
1000 GB NCR World Mark 5200
TPC-W All web servers are clusters (IBM)

74
Clusters and TPC-C Benchmark

Top 10 TPC-C Performance (Aug. 2000) Ktpm
1. Netfinity 8500R c/s Cluster 441
2. ProLiant X700-96P Cluster 262
3. ProLiant X550-96P Cluster 230
4. ProLiant X700-64P Cluster 180
5. ProLiant X550-64P Cluster 162
6. AS/400e 840-2420 SMP 152
7. Fujitsu GP7000F Model 2000 SMP 139
8. RISC S/6000 Ent. S80 SMP 139
9. Bull Escala EPC 2400 c/s SMP 136
10. Enterprise 6500 Cluster Cluster 135

75
Cost of Storage System v. Disks

Examples show cost of way we build current
systems (2 networks, many buses, CPU, )
Disks Disks Date Cost Main. Disks /CPU
/IObus
NCR WM 10/97 8.3M -- 1312 10.2 5.0
Sun 10k 3/98 5.2M -- 668 10.4 7.0
Sun 10k 9/99 6.2M 2.1M 1732 27.0 12.0
IBM Netinf 7/00 7.8M 1.8M 7040 55.0 9.0
gtToo complicated, too heterogenous
And Data Bases are often CPU or bus bound!
ISTORE disks per CPU 1.0
ISTORE disks per I/O bus 1.0

76
Common Question Why Not Vary Number of
Processors and Disks?

Argument if can vary numbers of each to match
application, more cost-effective solution?
Alternative Model 1 Dual Nodes E-switches
P-node Processor, Memory, 2 Ethernet NICs
D-node Disk, 2 Ethernet NICs
Response
As D-nodes running network protocol, still need
processor and memory, just smaller how much
save?
Saves processors/disks, costs more NICs/switches
N ISTORE nodes vs. N/2 P-nodes N D-nodes
Isn't ISTORE-2 a good HW prototype for this
model? Only run the communication protocol on N
nodes, run the full app and OS on N/2

77
Common Question Why Not Vary Number of
Processors and Disks?

Alternative Model 2 N Disks/node
Processor, Memory, N disks, 2 Ethernet NICs
Response
Potential I/O bus bottleneck as disk BW grows
2.5" ATA drives are limited to 2/4 disks per ATA
bus
How does a research project pick N? Whats
natural?
Is there sufficient processing power and memory
to run the AME monitoring and testing tasks as
well as the application requirements?
Isn't ISTORE-2 a good HW prototype for this
model? Software can act as simple disk interface
over network and run a standard disk protocol,
and then run that on N nodes per apps/OS node.
Plenty of Network BW available in redundant
switches