ISTORE Update - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

ISTORE Update

Description:

Remains important, but its not SPECint 'Back to the Future: Time to Return to Longstanding. Problems in Computer Systems?' Keynote address, FCRC, ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 35
Provided by: aaron
Category:

less

Transcript and Presenter's Notes

Title: ISTORE Update


1
ISTORE Update
  • David Patterson
  • University of California at Berkeley
  • Patterson_at_cs.berkeley.edu
  • UC Berkeley IRAM Group
  • UC Berkeley ISTORE Group
  • istore-group_at_cs.berkeley.edu
  • May 2000

2
Lampson Systems Challenges
  • Systems that work
  • Meeting their specs
  • Always available
  • Adapting to changing environment
  • Evolving while they run
  • Made from unreliable components
  • Growing without practical limit
  • Credible simulations or analysis
  • Writing good specs
  • Testing
  • Performance
  • Understanding when it doesnt matter

Computer Systems Research-Past and Future
Keynote address, 17th SOSP, Dec. 1999 Butler
Lampson Microsoft
3
Hennessy What Should the New World Focus Be?
  • Availability
  • Both appliance service
  • Maintainability
  • Two functions
  • Enhancing availability by preventing failure
  • Ease of SW and HW upgrades
  • Scalability
  • Especially of service
  • Cost
  • per device and per service transaction
  • Performance
  • Remains important, but its not SPECint

Back to the Future Time to Return to
Longstanding Problems in Computer Systems?
Keynote address, FCRC, May 1999 John
Hennessy Stanford
4
The real scalability problems AME
  • Availability
  • systems should continue to meet quality of
    service goals despite hardware and software
    failures
  • Maintainability
  • systems should require only minimal ongoing human
    administration, regardless of scale or complexity
  • Evolutionary Growth
  • systems should evolve gracefully in terms of
    performance, maintainability, and availability as
    they are grown/upgraded/expanded
  • These are problems at todays scales, and will
    only get worse as systems grow

5
Principles for achieving AME (1)
  • No single points of failure
  • Redundancy everywhere
  • Performance robustness is more important than
    peak performance
  • performance robustness implies that real-world
    performance is comparable to best-case
    performance
  • Performance can be sacrificed for improvements in
    AME
  • resources should be dedicated to AME
  • compare biological systems spend gt 50 of
    resources on maintenance
  • can make up performance by scaling system

6
Principles for achieving AME (2)
  • Introspection
  • reactive techniques to detect and adapt to
    failures, workload variations, and system
    evolution
  • proactive techniques to anticipate and avert
    problems before they happen

7
ISTORE-1 hardware platform
  • 80-node x86-based cluster, 1.4TB storage
  • cluster nodes are plug-and-play, intelligent,
    network-attached storage bricks
  • a single field-replaceable unit to simplify
    maintenance
  • each node is a full x86 PC w/256MB DRAM, 18GB
    disk
  • more CPU than NAS fewer disks/node than cluster

Intelligent Disk Brick Portable PC CPU Pentium
II/266 DRAM Redundant NICs (4 100 Mb/s
links) Diagnostic Processor
  • ISTORE Chassis
  • 80 nodes, 8 per tray
  • 2 levels of switches
  • 20 100 Mbit/s
  • 2 1 Gbit/s
  • Environment Monitoring
  • UPS, redundant PS,
  • fans, heat and vibration sensors...

8
ISTORE-1 Status
  • 10 Nodes manufactured 45 board fabbed, 40 to go
  • Boots OS
  • Diagnostic Processor Interface SW complete
  • PCB backplane not yet designed
  • Finish 80 node system Summer 2000

9
Hardware techniques
  • Fully shared-nothing cluster organization
  • truly scalable architecture
  • architecture that tolerates partial failure
  • automatic hardware redundancy

10
Hardware techniques (2)
  • No Central Processor Unit distribute processing
    with storage
  • Serial lines, switches also growing with Moores
    Law less need today to centralize vs. bus
    oriented systems
  • Most storage servers limited by speed of CPUs
    why does this make sense?
  • Why not amortize sheet metal, power, cooling
    infrastructure for disk to add processor, memory,
    and network?
  • If AME is important, must provide resources to be
    used to help AME local processors responsible
    for health and maintenance of their storage

11
Hardware techniques (3)
  • Heavily instrumented hardware
  • sensors for temp, vibration, humidity, power,
    intrusion
  • helps detect environmental problems before they
    can affect system integrity
  • Independent diagnostic processor on each node
  • provides remote control of power, remote console
    access to the node, selection of node boot code
  • collects, stores, processes environmental data
    for abnormalities
  • non-volatile flight recorder functionality
  • all diagnostic processors connected via
    independent diagnostic network

12
Hardware techniques (4)
  • On-demand network partitioning/isolation
  • Internet applications must remain available
    despite failures of components, therefore can
    isolate a subset for preventative maintenance
  • Allows testing, repair of online system
  • Managed by diagnostic processor and network
    switches via diagnostic network

13
Hardware techniques (5)
  • Built-in fault injection capabilities
  • Power control to individual node components
  • Injectable glitches into I/O and memory busses
  • Managed by diagnostic processor
  • Used for proactive hardware introspection
  • automated detection of flaky components
  • controlled testing of error-recovery mechanisms
  • Important for AME benchmarking (see next slide)

14
Hardware techniques (6)
  • Benchmarking
  • One reason for 1000X processor performance was
    ability to measure (vs. debate) which is better
  • e.g., Which most important to improve clock
    rate, clocks per instruction, or instructions
    executed?
  • Need AME benchmarks
  • what gets measured gets done
  • benchmarks shape a field
  • quantification brings rigor

15
Availability benchmark methodology
  • Goal quantify variation in QoS metrics as events
    occur that affect system availability
  • Leverage existing performance benchmarks
  • to generate fair workloads
  • to measure trace quality of service metrics
  • Use fault injection to compromise system
  • hardware faults (disk, memory, network, power)
  • software faults (corrupt input, driver error
    returns)
  • maintenance events (repairs, SW/HW upgrades)
  • Examine single-fault and multi-fault workloads
  • the availability analogues of performance micro-
    and macro-benchmarks

16
Benchmark Availability?Methodology for reporting
results
  • Results are most accessible graphically
  • plot change in QoS metrics over time
  • compare to normal behavior?
  • 99 confidence intervals calculated from no-fault
    runs

17
Example single-fault result
Linux
Solaris
  • Compares Linux and Solaris reconstruction
  • Linux minimal performance impact but longer
    window of vulnerability to second fault
  • Solaris large perf. impact but restores
    redundancy fast

18
Reconstruction Policy
  • Linux favors performance over data availability
  • automatically-initiated reconstruction, idle
    bandwidth
  • virtually no performance impact on application
  • very long window of vulnerability (gt1hr for 3GB
    RAID)
  • Solaris favors data availability over app. perf.
  • automatically-initiated reconstruction at high BW
  • as much as 34 drop in application performance
  • short window of vulnerability (10 minutes for
    3GB)
  • Windows favors neither!
  • manually-initiated reconstruction at moderate BW
  • as much as 18 app. performance drop
  • somewhat short window of vulnerability (23
    min/3GB)

19
Transient error handling Policy
  • Linux is paranoid with respect to transients
  • stops using affected disk (and reconstructs) on
    any error, transient or not
  • fragile system is more vulnerable to multiple
    faults
  • disk-inefficient wastes two disks per transient
  • but no chance of slowly-failing disk impacting
    perf.
  • Solaris and Windows are more forgiving
  • both ignore most benign/transient faults
  • robust less likely to lose data, more
    disk-efficient
  • less likely to catch slowly-failing disks and
    remove them
  • Neither policy is ideal!
  • need a hybrid that detects streams of transients

20
Software techniques
  • Fully-distributed, shared-nothing code
  • centralization breaks as systems scale up
    O(10000)
  • avoids single-point-of-failure front ends
  • Redundant data storage
  • required for high availability, simplifies
    self-testing
  • replication at the level of application objects
  • application can control consistency policy
  • more opportunity for data placement optimization

21
Software techniques (2)
  • River storage interfaces
  • NOW Sort experience performance heterogeneity
    is the norm
  • e.g., disks outer vs. inner track (1.5X),
    fragmentation
  • e.g., processors load (1.5-5x)
  • So demand-driven delivery of data to apps
  • via distributed queues and graduated declustering
  • for apps that can handle unordered data delivery
  • Automatically adapts to variations in performance
    of producers and consumers
  • Also helps with evolutionary growth of cluster

22
Software techniques (3)
  • Reactive introspection
  • Use statistical techniques to identify normal
    behavior and detect deviations from it
  • Policy-driven automatic adaptation to abnormal
    behavior once detected
  • initially, rely on human administrator to specify
    policy
  • eventually, system learns to solve problems on
    its own by experimenting on isolated subsets of
    the nodes
  • one candidate reinforcement learning

23
Software techniques (4)
  • Proactive introspection
  • Continuous online self-testing of HW and SW
  • in deployed systems!
  • goal is to shake out Heisenbugs before theyre
    encountered in normal operation
  • needs data redundancy, node isolation, fault
    injection
  • Techniques
  • fault injection triggering hardware and software
    error handling paths to verify their
    integrity/existence
  • stress testing push HW/SW to their limits
  • scrubbing periodic restoration of potentially
    decaying hardware or software state
  • self-scrubbing data structures (like MVS)
  • ECC scrubbing for disks and memory

24
Initial Applications
  • ISTORE is not one super-system that demonstrates
    all these techniques!
  • Initially provide middleware, library to support
    AME goals
  • Initial application targets
  • cluster web/email servers
  • self-scrubbing data structures, online
    self-testing
  • statistical identification of normal behavior
  • information retrieval for multimedia data
  • self-scrubbing data structures, structuring
    performance-robust distributed computation

25
A glimpse into the future?
  • System-on-a-chip enables computer, memory,
    redundant network interfaces without
    significantly increasing size of disk
  • ISTORE HW in 5-7 years
  • building block 2006 MicroDrive integrated with
    IRAM
  • 9GB disk, 50 MB/sec from disk
  • connected via crossbar switch
  • If low power, 10,000 nodes fit into one rack!
  • O(10,000) scale is our ultimate design point

26
Future Targets
  • Maintenance in DoD application
  • Security in Computer Systems
  • Computer Vision

27
Maintenance in DoD systems
  • Introspective Middleware, Builtin Fault
    Injection, Diagnostic Computer, Isolatable
    Subsystems ... should reduce Maintenance of DoD
    Hardware and Software systems
  • Is Maintenance a major concern of DoD?
  • Does Improved Maintenance fit within Goals of
    Polymorphous Computing Architecture?

28
Security in DoD Systems?
  • Separate Diagnostic Processor and Network gives
    interesting Security possibilities
  • Monitoring of behavior by separate computer
  • Isolation of portion of cluster from rest of
    network
  • Remote reboot, software installation

29
Attacking Computer Vision
  • Analogy Computer Vision Recognition in 2000 like
    Computer Speech Recognition in 1985
  • Pre 1985 community searching for good algorithms
    classic AI vs. statistics?
  • By 1985 reached consensus on statistics
  • Field focuses and makes progress, uses special
    hardware
  • Systems become fast enough that can train systems
    rather than preload information, which
    accelerates progress
  • By 1995 speech regonition systems starting to
    deploy
  • By 2000 widely used, available on PCs

30
Computer Vision at Berkeley
  • Jitendra Malik believes has an approach that is
    very promising
  • 2 step process
  • 1) Segmentation Divide image into regions of
    coherent color, texture and motion
  • 2) Recognition combine regions and search image
    database to find a match
  • Algorithms for 1) work well, just slowly (300
    seconds per image using PC)
  • Algorithms for 2) being tested this summer using
    hundreds of PCs will determine accuracy

31
Human Quality Computer Vision
  • Suppose Algorithms Work What would it take to
    match Human Vision?
  • At 30 images per second segmentation
  • Convolution and Vector-Matrix Multiply of Sparse
    Matrices (10,000 x 10,000, 10 nonzero/row)
  • 32-bit Floating Point
  • 300 seconds on PC (assuming 333 MFLOPS) gt 100G
    FL Ops/image
  • 30 Hz gt 3000 GFLOPs machine to do segmentation

32
Human Quality Computer Vision
  • At 1 / second object recognition
  • Human can remember 10,000 to 100,000 objects per
    category (e.g., 10k faces, 10k Chinese
    characters, high school vocabulary of 50k words,
    ..)
  • To recognize a 3D object, need 10 2D views
  • 100 x 100 x 8 bit (or fewer bits) per viewgt
    10,000 x 10 x 100 x 100 bytes or 109 bytes
  • Pruning using color and texture and by organizing
    shapes into an index reduce shape matches to
    1000
  • Compare 1000 candidate merged regions with 1000
    candidate object images
  • If 10 hours on PC (333 MFLOPS) gt 12000 GFLOPS

33
ISTORE Successor does Human Quality Vision?
  • 10,000 nodes with System-On-A-Chip Microdrive
    network
  • 1 to 10 GFLOPS/node gt 10,000 to 100,000 GFLOPS
  • High Bandwidth Network
  • 1 to 10 GB of Disk Storage per Node gt can
    replicate images per node
  • Need Dependability, Maintainability advances to
    keep 10,000 nodes useful
  • Human quality vision useful for DoD Apps?
    Retrainable recognition?

34
Conclusions (1) ISTORE
  • Availability, Maintainability, and Evolutionary
    growth are key challenges for server systems
  • more important even than performance
  • ISTORE is investigating ways to bring AME to
    large-scale, storage-intensive servers
  • via clusters of network-attached,
    computationally-enhanced storage nodes running
    distributed code
  • via hardware and software introspection
  • we are currently performing application studies
    to investigate and compare techniques
  • Availability benchmarks a powerful tool?
  • revealed undocumented design decisions affecting
    SW RAID availability on Linux and Windows 2000
Write a Comment
User Comments (0)
About PowerShow.com