Title: ISTORE Update
1ISTORE Update
- David Patterson
- University of California at Berkeley
- Patterson_at_cs.berkeley.edu
- UC Berkeley IRAM Group
- UC Berkeley ISTORE Group
- istore-group_at_cs.berkeley.edu
- May 2000
2Lampson Systems Challenges
- Systems that work
- Meeting their specs
- Always available
- Adapting to changing environment
- Evolving while they run
- Made from unreliable components
- Growing without practical limit
- Credible simulations or analysis
- Writing good specs
- Testing
- Performance
- Understanding when it doesnt matter
Computer Systems Research-Past and Future
Keynote address, 17th SOSP, Dec. 1999 Butler
Lampson Microsoft
3Hennessy What Should the New World Focus Be?
- Availability
- Both appliance service
- Maintainability
- Two functions
- Enhancing availability by preventing failure
- Ease of SW and HW upgrades
- Scalability
- Especially of service
- Cost
- per device and per service transaction
- Performance
- Remains important, but its not SPECint
Back to the Future Time to Return to
Longstanding Problems in Computer Systems?
Keynote address, FCRC, May 1999 John
Hennessy Stanford
4The real scalability problems AME
- Availability
- systems should continue to meet quality of
service goals despite hardware and software
failures - Maintainability
- systems should require only minimal ongoing human
administration, regardless of scale or complexity - Evolutionary Growth
- systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded - These are problems at todays scales, and will
only get worse as systems grow
5Principles for achieving AME (1)
- No single points of failure
- Redundancy everywhere
- Performance robustness is more important than
peak performance - performance robustness implies that real-world
performance is comparable to best-case
performance - Performance can be sacrificed for improvements in
AME - resources should be dedicated to AME
- compare biological systems spend gt 50 of
resources on maintenance - can make up performance by scaling system
6Principles for achieving AME (2)
- Introspection
- reactive techniques to detect and adapt to
failures, workload variations, and system
evolution - proactive techniques to anticipate and avert
problems before they happen
7ISTORE-1 hardware platform
- 80-node x86-based cluster, 1.4TB storage
- cluster nodes are plug-and-play, intelligent,
network-attached storage bricks - a single field-replaceable unit to simplify
maintenance - each node is a full x86 PC w/256MB DRAM, 18GB
disk - more CPU than NAS fewer disks/node than cluster
Intelligent Disk Brick Portable PC CPU Pentium
II/266 DRAM Redundant NICs (4 100 Mb/s
links) Diagnostic Processor
- ISTORE Chassis
- 80 nodes, 8 per tray
- 2 levels of switches
- 20 100 Mbit/s
- 2 1 Gbit/s
- Environment Monitoring
- UPS, redundant PS,
- fans, heat and vibration sensors...
8ISTORE-1 Status
- 10 Nodes manufactured 45 board fabbed, 40 to go
- Boots OS
- Diagnostic Processor Interface SW complete
- PCB backplane not yet designed
- Finish 80 node system Summer 2000
9Hardware techniques
- Fully shared-nothing cluster organization
- truly scalable architecture
- architecture that tolerates partial failure
- automatic hardware redundancy
10Hardware techniques (2)
- No Central Processor Unit distribute processing
with storage - Serial lines, switches also growing with Moores
Law less need today to centralize vs. bus
oriented systems - Most storage servers limited by speed of CPUs
why does this make sense? - Why not amortize sheet metal, power, cooling
infrastructure for disk to add processor, memory,
and network? - If AME is important, must provide resources to be
used to help AME local processors responsible
for health and maintenance of their storage
11Hardware techniques (3)
- Heavily instrumented hardware
- sensors for temp, vibration, humidity, power,
intrusion - helps detect environmental problems before they
can affect system integrity - Independent diagnostic processor on each node
- provides remote control of power, remote console
access to the node, selection of node boot code - collects, stores, processes environmental data
for abnormalities - non-volatile flight recorder functionality
- all diagnostic processors connected via
independent diagnostic network
12Hardware techniques (4)
- On-demand network partitioning/isolation
- Internet applications must remain available
despite failures of components, therefore can
isolate a subset for preventative maintenance - Allows testing, repair of online system
- Managed by diagnostic processor and network
switches via diagnostic network
13Hardware techniques (5)
- Built-in fault injection capabilities
- Power control to individual node components
- Injectable glitches into I/O and memory busses
- Managed by diagnostic processor
- Used for proactive hardware introspection
- automated detection of flaky components
- controlled testing of error-recovery mechanisms
- Important for AME benchmarking (see next slide)
14Hardware techniques (6)
- Benchmarking
- One reason for 1000X processor performance was
ability to measure (vs. debate) which is better - e.g., Which most important to improve clock
rate, clocks per instruction, or instructions
executed? - Need AME benchmarks
- what gets measured gets done
- benchmarks shape a field
- quantification brings rigor
15Availability benchmark methodology
- Goal quantify variation in QoS metrics as events
occur that affect system availability - Leverage existing performance benchmarks
- to generate fair workloads
- to measure trace quality of service metrics
- Use fault injection to compromise system
- hardware faults (disk, memory, network, power)
- software faults (corrupt input, driver error
returns) - maintenance events (repairs, SW/HW upgrades)
- Examine single-fault and multi-fault workloads
- the availability analogues of performance micro-
and macro-benchmarks
16Benchmark Availability?Methodology for reporting
results
- Results are most accessible graphically
- plot change in QoS metrics over time
- compare to normal behavior?
- 99 confidence intervals calculated from no-fault
runs
17Example single-fault result
Linux
Solaris
- Compares Linux and Solaris reconstruction
- Linux minimal performance impact but longer
window of vulnerability to second fault - Solaris large perf. impact but restores
redundancy fast
18Reconstruction Policy
- Linux favors performance over data availability
- automatically-initiated reconstruction, idle
bandwidth - virtually no performance impact on application
- very long window of vulnerability (gt1hr for 3GB
RAID) - Solaris favors data availability over app. perf.
- automatically-initiated reconstruction at high BW
- as much as 34 drop in application performance
- short window of vulnerability (10 minutes for
3GB) - Windows favors neither!
- manually-initiated reconstruction at moderate BW
- as much as 18 app. performance drop
- somewhat short window of vulnerability (23
min/3GB)
19Transient error handling Policy
- Linux is paranoid with respect to transients
- stops using affected disk (and reconstructs) on
any error, transient or not - fragile system is more vulnerable to multiple
faults - disk-inefficient wastes two disks per transient
- but no chance of slowly-failing disk impacting
perf. - Solaris and Windows are more forgiving
- both ignore most benign/transient faults
- robust less likely to lose data, more
disk-efficient - less likely to catch slowly-failing disks and
remove them - Neither policy is ideal!
- need a hybrid that detects streams of transients
20Software techniques
- Fully-distributed, shared-nothing code
- centralization breaks as systems scale up
O(10000) - avoids single-point-of-failure front ends
- Redundant data storage
- required for high availability, simplifies
self-testing - replication at the level of application objects
- application can control consistency policy
- more opportunity for data placement optimization
21Software techniques (2)
- River storage interfaces
- NOW Sort experience performance heterogeneity
is the norm - e.g., disks outer vs. inner track (1.5X),
fragmentation - e.g., processors load (1.5-5x)
- So demand-driven delivery of data to apps
- via distributed queues and graduated declustering
- for apps that can handle unordered data delivery
- Automatically adapts to variations in performance
of producers and consumers - Also helps with evolutionary growth of cluster
22Software techniques (3)
- Reactive introspection
- Use statistical techniques to identify normal
behavior and detect deviations from it - Policy-driven automatic adaptation to abnormal
behavior once detected - initially, rely on human administrator to specify
policy - eventually, system learns to solve problems on
its own by experimenting on isolated subsets of
the nodes - one candidate reinforcement learning
23Software techniques (4)
- Proactive introspection
- Continuous online self-testing of HW and SW
- in deployed systems!
- goal is to shake out Heisenbugs before theyre
encountered in normal operation - needs data redundancy, node isolation, fault
injection - Techniques
- fault injection triggering hardware and software
error handling paths to verify their
integrity/existence - stress testing push HW/SW to their limits
- scrubbing periodic restoration of potentially
decaying hardware or software state - self-scrubbing data structures (like MVS)
- ECC scrubbing for disks and memory
24Initial Applications
- ISTORE is not one super-system that demonstrates
all these techniques! - Initially provide middleware, library to support
AME goals - Initial application targets
- cluster web/email servers
- self-scrubbing data structures, online
self-testing - statistical identification of normal behavior
- information retrieval for multimedia data
- self-scrubbing data structures, structuring
performance-robust distributed computation
25A glimpse into the future?
- System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk - ISTORE HW in 5-7 years
- building block 2006 MicroDrive integrated with
IRAM - 9GB disk, 50 MB/sec from disk
- connected via crossbar switch
- If low power, 10,000 nodes fit into one rack!
- O(10,000) scale is our ultimate design point
26Future Targets
- Maintenance in DoD application
- Security in Computer Systems
- Computer Vision
27Maintenance in DoD systems
- Introspective Middleware, Builtin Fault
Injection, Diagnostic Computer, Isolatable
Subsystems ... should reduce Maintenance of DoD
Hardware and Software systems - Is Maintenance a major concern of DoD?
- Does Improved Maintenance fit within Goals of
Polymorphous Computing Architecture?
28Security in DoD Systems?
- Separate Diagnostic Processor and Network gives
interesting Security possibilities - Monitoring of behavior by separate computer
- Isolation of portion of cluster from rest of
network - Remote reboot, software installation
29Attacking Computer Vision
- Analogy Computer Vision Recognition in 2000 like
Computer Speech Recognition in 1985 - Pre 1985 community searching for good algorithms
classic AI vs. statistics? - By 1985 reached consensus on statistics
- Field focuses and makes progress, uses special
hardware - Systems become fast enough that can train systems
rather than preload information, which
accelerates progress - By 1995 speech regonition systems starting to
deploy - By 2000 widely used, available on PCs
30Computer Vision at Berkeley
- Jitendra Malik believes has an approach that is
very promising - 2 step process
- 1) Segmentation Divide image into regions of
coherent color, texture and motion - 2) Recognition combine regions and search image
database to find a match - Algorithms for 1) work well, just slowly (300
seconds per image using PC) - Algorithms for 2) being tested this summer using
hundreds of PCs will determine accuracy
31Human Quality Computer Vision
- Suppose Algorithms Work What would it take to
match Human Vision? - At 30 images per second segmentation
- Convolution and Vector-Matrix Multiply of Sparse
Matrices (10,000 x 10,000, 10 nonzero/row) - 32-bit Floating Point
- 300 seconds on PC (assuming 333 MFLOPS) gt 100G
FL Ops/image - 30 Hz gt 3000 GFLOPs machine to do segmentation
32Human Quality Computer Vision
- At 1 / second object recognition
- Human can remember 10,000 to 100,000 objects per
category (e.g., 10k faces, 10k Chinese
characters, high school vocabulary of 50k words,
..) - To recognize a 3D object, need 10 2D views
- 100 x 100 x 8 bit (or fewer bits) per viewgt
10,000 x 10 x 100 x 100 bytes or 109 bytes - Pruning using color and texture and by organizing
shapes into an index reduce shape matches to
1000 - Compare 1000 candidate merged regions with 1000
candidate object images - If 10 hours on PC (333 MFLOPS) gt 12000 GFLOPS
33ISTORE Successor does Human Quality Vision?
- 10,000 nodes with System-On-A-Chip Microdrive
network - 1 to 10 GFLOPS/node gt 10,000 to 100,000 GFLOPS
- High Bandwidth Network
- 1 to 10 GB of Disk Storage per Node gt can
replicate images per node - Need Dependability, Maintainability advances to
keep 10,000 nodes useful - Human quality vision useful for DoD Apps?
Retrainable recognition?
34Conclusions (1) ISTORE
- Availability, Maintainability, and Evolutionary
growth are key challenges for server systems - more important even than performance
- ISTORE is investigating ways to bring AME to
large-scale, storage-intensive servers - via clusters of network-attached,
computationally-enhanced storage nodes running
distributed code - via hardware and software introspection
- we are currently performing application studies
to investigate and compare techniques - Availability benchmarks a powerful tool?
- revealed undocumented design decisions affecting
SW RAID availability on Linux and Windows 2000