Computers for the PostPC Era

1 / 55
About This Presentation
Title:

Computers for the PostPC Era

Description:

Flash. RTC. RAM. Monitor. Control. Disk (18 GB) Sensors for heat and vibration ... 10,000 nodes fit into one rack! This scale is our ultimate design point. Slide 27 ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 56
Provided by: aaron

less

Transcript and Presenter's Notes

Title: Computers for the PostPC Era


1
Computers for the Post-PC Era
  • David Patterson
  • University of California at Berkeley
  • Patterson_at_cs.berkeley.edu
  • UC Berkeley IRAM Group
  • UC Berkeley ISTORE Group
  • istore-group_at_cs.berkeley.edu
  • 10 Feburary 2000

2
Perspective on Post-PC Era
  • PostPC Era will be driven by 2 technologies
  • 1) Tiny Embedded orMobile Consumer Devices
  • e.g., successor to PDA, cell phone, wearable
    computers
  • ubiquitous in everything
  • 2) Infrastructure to Support such Devices
  • e.g., successor to Big Fat Web Servers, Database
    Servers

3
Outline
  • 1) One instance of microprocessors for gadgets
  • 2) Motivation and the ISTORE project vision
  • AME Availability, Maintainability, Evolutionary
    growth
  • ISTOREs research principles
  • Proposed techniques for achieving AME
  • Benchmarks for AME
  • Conclusions and future work

4
Intelligent RAM IRAM
  • Microprocessor DRAM on a single chip
  • 10X capacity vs. SRAM
  • on-chip memory latency 5-10X, bandwidth 50-100X
  • improve energy efficiency 2X-4X (no off-chip
    bus)
  • serial I/O 5-10X v. buses
  • smaller board area/volume
  • IRAM advantages extend to
  • a single chip system
  • a building block for larger systems

5
Revive Vector Architecture
  • Single-chip CMOS MPU/IRAM
  • IRAM
  • Much smaller than VLIW
  • For sale, mature (gt20 years)(We retarget Cray
    compilers)
  • Easy scale speed with technology
  • Parallel to save energy, keep perf
  • Multimedia apps vectorizable too N64b, 2N32b,
    4N16b
  • Cost 1M each?
  • Low latency, high BW memory system?
  • Code density?
  • Compilers?
  • Performance?
  • Power/Energy?
  • Limited to scientific applications?

6
V-IRAM1 Low Power v. High Perf.

4 x 64 or 8 x 32 or 16 x 16
x
2-way Superscalar
Vector
Instruction

Processor
Queue
Load/Store
Vector Registers
16K I cache
16K D cache
4 x 64
4 x 64
Serial I/O
Memory Crossbar Switch
M
M
M
M
M
M
M
M
M
M

M
M
M
M
M
M
M
M
M
M
4 x 64
4 x 64
4 x 64
4 x 64
4 x 64










M
M
M
M
M
M
M
M
M
M
7
VIRAM-1 System on a Chip
  • Prototype scheduled for tape-out mid 2000
  • 0.18 um EDL process
  • 16 MB DRAM, 8 banks
  • MIPS Scalar core and
    caches _at_ 200 MHz
  • 4 64-bit vector unit
    pipelines _at_ 200 MHz
  • 4 100 MB parallel I/O lines
  • 17x17 mm, 2 Watts
  • 25.6 GB/s memory (6.4 GB/s per direction
    and per Xbar)
  • 1.6 Gflops (64-bit), 6.4 GOPs (16-bit)

Memory (64 Mbits / 8 MBytes)
Xbar
I/O
Memory (64 Mbits / 8 MBytes)
8
Media Kernel Performance
9
Base-line system comparison
  • All numbers in cycles/pixel
  • MMX and VIS results assume all data in L1 cache

10
IRAM Chip Challenges
  • Merged Logic-DRAM process Cost of wafer, Impact
    on yield, testing cost of logic and DRAM
  • Price on-chip DRAM v. separate DRAM chips?
  • Time delay of transistor speeds, memory cell
    sizes in Merged process vs. Logic only or DRAM
    only
  • DRAM block flexibility via DRAM compiler (vary
    size, width, no. subbanks) vs. fixed block
  • Applications advantages in memory bandwidth,
    energy, system size to offset above challenges?

11
Other examples Sony Playstation 2
  • Emotion Engine 6.2 GFLOPS, 75 million polygons
    per second (Microprocessor Report, 135)
  • Superscalar MIPS core vector coprocessor
    graphics/DRAM
  • Claim Toy Story realism brought to games!

12
Other examples IBM Blue Gene
  • Blue Gene Chip
  • 20 x 20 mm
  • 32 Multithreaded RISC processors ??MB Embedded
    DRAM high speed Network Interface on single
    chip
  • 1 GFLOPS / processor
  • 2 x 2 Board 64 chips
  • Tower 8 Boards
  • System 64 Towers
  • Total 1 million processors (25 x 26 x 23 x 26),
    in just 2000 sq. ft.
  • Cost 100M
  • Goal 1 PetaFLOPS in 2005?
  • Application Protein Folding

13
Outline
  • 1) One instance of microprocessors for gadgets
  • 2) Motivation and the ISTORE project vision
  • AME Availability, Maintainability, Evolutionary
    growth
  • ISTOREs research principles
  • Proposed techniques for achieving AME
  • Benchmarks for AME
  • Conclusions and future work

14
The problem space big data
  • Big demand for enormous amounts of data
  • today high-end enterprise and Internet
    applications
  • enterprise decision-support, data mining
    databases
  • online applications e-commerce, mail, web,
    archives
  • future infrastructure services, richer data
  • computational storage back-ends for mobile
    devices
  • more multimedia content
  • more use of historical data to provide better
    services
  • Todays server designs cant easily scale to meet
    these huge demands
  • bus bandwidth bottlenecks limit access to stored
    data
  • SMP designs are near their limits and dont offer
    incremental growth path

15
One approach traditional NAS
  • Network-attached storage makes storage devices
    first-class citizens on the network
  • network file server appliances (NetApp, SNAP,
    ...)
  • storage-area networks (CMU NASD, NSIC OOD, ...)
  • active disks (CMU, UCSB, Berkeley IDISK)
  • These approaches primarily target performance
    scalability
  • scalable networks remove bus bandwidth
    limitations
  • migration of layout functionality to storage
    devices removes overhead of intermediate servers
  • There are bigger scaling problems than scalable
    performance!

16
The real scalability problems AME
  • Availability
  • systems should continue to meet quality of
    service goals despite hardware and software
    failures
  • Maintainability
  • systems should require only minimal ongoing human
    administration, regardless of scale or complexity
  • Evolutionary Growth
  • systems should evolve gracefully in terms of
    performance, maintainability, and availability as
    they are grown/upgraded/expanded
  • These are problems at todays scales, and will
    only get worse as systems grow

17
The ISTORE project vision
  • Our goal
  • develop principles and investigate hardware/sof
    tware techniques for building storage-based
    server systems that
  • are highly available
  • require minimal maintenance
  • robustly handle evolutionary growth
  • are scalable to O(10000) nodes

18
Principles for achieving AME (1)
  • No single points of failure
  • Redundancy everywhere
  • Performance robustness is more important than
    peak performance
  • performance robustness implies that real-world
    performance is comparable to best-case
    performance
  • Performance can be sacrificed for improvements in
    AME
  • resources should be dedicated to AME
  • compare biological systems spend gt 50 of
    resources on maintenance
  • can make up performance by scaling system

19
Principles for achieving AME (2)
  • Introspection
  • reactive techniques to detect and adapt to
    failures, workload variations, and system
    evolution
  • proactive techniques to anticipate and avert
    problems before they happen
  • Benchmarking
  • quantification brings rigor
  • requires new AME benchmarks
  • what gets measured gets done
  • benchmarks shape a field

20
Outline
  • 1) One instance of microprocessors for gadgets
  • 2) Motivation and the ISTORE project vision
  • AME Availability, Maintainability, Evolutionary
    growth
  • ISTOREs research principles
  • Proposed techniques for achieving AME
  • Benchmarks for AME
  • Conclusions and future work

21
Hardware techniques
  • Fully shared-nothing cluster organization
  • truly scalable architecture
  • architecture that can tolerate partial failure
  • automatic hardware redundancy
  • Storage distributed with computation nodes
  • distributed processing reduces data movement and
    avoids network bottlenecks
  • nodes are responsible for the health of the
    storage that they own
  • if AME is important, must provide resources to be
    used for AME

22
Hardware techniques (2)
  • Heavily instrumented hardware
  • sensors for temp, vibration, humidity, power,
    intrusion
  • helps detect environmental problems before they
    can affect system integrity
  • Independent diagnostic processor on each node
  • provides remote control of power, remote console
    access to the node, selection of node boot code
  • collects, stores, processes environmental data
    for abnormalities
  • non-volatile flight recorder functionality
  • all diagnostic processors connected via
    independent diagnostic network

23
Hardware techniques (3)
  • Built-in fault injection capabilities
  • power control to individual node components
  • injectable glitches into I/O and memory busses
  • on-demand network partitioning/isolation
  • managed by diagnostic processor and network
    switches via diagnostic network
  • used for proactive hardware introspection
  • automated detection of flaky components
  • controlled testing of error-recovery mechanisms
  • important for AME benchmarking

24
ISTORE-1 hardware platform
  • 80-node x86-based cluster, 1.4TB storage
  • cluster nodes are plug-and-play, intelligent,
    network-attached storage bricks
  • a single field-replaceable unit to simplify
    maintenance
  • each node is a full x86 PC w/256MB DRAM, 18GB
    disk
  • more CPU than NAS fewer disks/node than cluster

25
ISTORE Brick Block Diagram
Mobile Pentium II Module
SCSI
North Bridge
CPU
Disk (18 GB)
South Bridge
Diagnostic Net
DUAL UART
DRAM 256 MB
Super I/O
Monitor Control
Diagnostic Processor
BIOS
Ethernets 4x100 Mb/s
PCI
  • Sensors for heat and vibration
  • Control over power to individual nodes

Flash
RTC
RAM
26
A glimpse into the future?
  • System-on-a-chip enables computer, memory,
    redundant network interfaces without
    significantly increasing size of disk
  • ISTORE HW in 5-7 years
  • building block 2006 MicroDrive integrated with
    IRAM
  • 9GB disk, 50 MB/sec from disk
  • connected via crossbar switch
  • 10,000 nodes fit into one rack!
  • This scale is our ultimate design point

27
Software techniques
  • Fully-distributed, shared-nothing code
  • centralization breaks as systems scale up
    O(10000)
  • avoids single-point-of-failure front ends
  • Redundant data storage
  • required for high availability, simplifies
    self-testing
  • replication at the level of application objects
  • application can control consistency policy
  • more opportunity for data placement optimization

28
Software techniques (2)
  • River storage interfaces
  • NOW Sort experience performance heterogeneity
    is the norm
  • disks inner vs. outer track (50), fragmentation
  • processors load (1.5-5x)
  • So demand-driven delivery of data to apps
  • via distributed queues and graduated declustering
  • for apps that can handle unordered data delivery
  • automatically adapts to variations in performance
    of producers and consumers

29
Software techniques (3)
  • Reactive introspection
  • use statistical techniques to identify normal
    behavior and detect deviations from it
  • policy-driven automatic adaptation to abnormal
    behavior once detected
  • initially, rely on human administrator to specify
    policy
  • eventually, system learns to solve problems on
    its own by experimenting on isolated subsets of
    the nodes
  • one candidate reinforcement learning

30
Software techniques (4)
  • Proactive introspection
  • continuous online self-testing of HW and SW
  • in deployed systems!
  • goal is to shake out Heisenbugs before theyre
    encountered in normal operation
  • needs data redundancy, node isolation, fault
    injection
  • techniques
  • fault injection triggering hardware and software
    error handling paths to verify their
    integrity/existence
  • stress testing push HW/SW to their limits
  • scrubbing periodic restoration of potentially
    decaying hardware or software state
  • self-scrubbing data structures (like MVS)
  • ECC scrubbing for disks and memory

31
Applications
  • ISTORE is not one super-system that demonstrates
    all these techniques!
  • Initially provide library to support AME goals
  • Initial application targets
  • cluster web/email servers
  • self-scrubbing data structures, online
    self-testing
  • statistical identification of normal behavior
  • decision-support database query execution system
  • River-based storage, replica management
  • information retrieval for multimedia data
  • self-scrubbing data structures, structuring
    performance-robust distributed computation

32
Outline
  • 1) One instance of microprocessors for gadgets
  • 2) Motivation and the ISTORE project vision
  • AME Availability, Maintainability, Evolutionary
    growth
  • ISTOREs research principles
  • Proposed techniques for achieving AME
  • Benchmarks for AME
  • Conclusions and future work

33
Availability benchmarks
  • Questions to answer
  • what factors affect the quality of service
    delivered by the system, and by how much/how
    long?
  • how well can systems survive typical failure
    scenarios?
  • Availability metrics
  • traditionally, percentage of time system is up
  • time-averaged, binary view of system state
    (up/down)
  • traditional metric is too inflexible
  • doesnt capture spectrum of degraded states
  • time-averaging discards important temporal
    behavior
  • Solution measure variation in system quality of
    service metrics over time
  • performance, fault-tolerance, completeness,
    accuracy

34
Availability benchmark methodology
  • Goal quantify variation in QoS metrics as events
    occur that affect system availability
  • Leverage existing performance benchmarks
  • to generate fair workloads
  • to measure trace quality of service metrics
  • Use fault injection to compromise system
  • hardware faults (disk, memory, network, power)
  • software faults (corrupt input, driver error
    returns)
  • maintenance events (repairs, SW/HW upgrades)
  • Examine single-fault and multi-fault workloads
  • the availability analogues of performance micro-
    and macro-benchmarks

35
Methodology reporting results
  • Results are most accessible graphically
  • plot change in QoS metrics over time
  • compare to normal behavior
  • 99 confidence intervals calculated from no-fault
    runs
  • Graphs can be distilled into numbers
  • quantify distribution of deviations from normal
    behavior, compute area under curve for
    deviations, ...

36
Example results software RAID-5
  • Test systems Linux/Apache and Win2000/IIS
  • SpecWeb 99 to measure hits/second as QoS metric
  • fault injection at disks based on empirical fault
    data
  • transient, correctable, uncorrectable, timeout
    faults
  • 15 single-fault workloads injected per system
  • only 4 distinct behaviors observed
  • (A) no effect (C) RAID enters degraded mode
  • (B) system hangs (D) RAID enters degraded mode
    starts
    reconstruction
  • both systems hung (B) on simulated disk hangs
  • Linux exhibited (D) on all other errors
  • Windows exhibited (A) on transient errors and (C)
    on uncorrectable, sticky errors

37
Example results multiple-faults
Windows 2000/IIS
Linux/ Apache
  • Windows reconstructs 3x faster than Linux
  • Windows reconstruction noticeably affects
    application performance, while Linux
    reconstruction does not

38
Conclusions
  • IRAM attractive for two Post-PC applications
    because of low power, small size, high memory
    bandwidth
  • Mobile consumer electronic devices
  • Scaleable infrastructure
  • IRAM benchmarking result faster than DSPs
  • ISTORE hardware/software architecture for large
    scale network services
  • Scaling systems requires
  • new continuous models of availability
  • performance not limited by the weakest link
  • self systems to reduce human interaction

39
Benchmark conclusions
  • Linux and Windows take opposite approaches to
    managing benign and transient faults
  • Linux is paranoid and stops using a disk on any
    error
  • Windows ignores most benign/transient faults
  • Windows is more robust except when disk is truly
    failing
  • Linux and Windows have different reconstruction
    philosophies
  • Linux uses idle bandwidth for reconstruction
  • Windows steals app. bandwidth for reconstruction
  • Windows rebuilds fault-tolerance more quickly
  • Win2k favors fault-tolerance over performance
    Linux favors performance over fault-tolerance

40
ISTORE conclusions
  • Availability, Maintainability, and Evolutionary
    growth are key challenges for server systems
  • more important even than performance
  • ISTORE is investigating ways to bring AME to
    large-scale, storage-intensive servers
  • via clusters of network-attached,
    computationally-enhanced storage nodes running
    distributed code
  • via hardware and software introspection
  • we are currently performing application studies
    to investigate and compare techniques
  • Availability benchmarks are a powerful tool
  • revealed undocumented design decisions affecting
    SW RAID availability on Linux and Windows 2000

41
Future work
  • ISTORE
  • implement AME-enhancing techniques in a variety
    of Internet, enterprise, and info retrieval
    applications
  • select the best techniques and integrate into a
    generic runtime system with AME API
  • AME benchmarks
  • expand availability benchmarks to distributed
    apps
  • add maintainability
  • use methodology from availability benchmark
  • but include administrators response to faults
  • must develop model of typical administrator
    behavior
  • can we quantify administrative work needed to
    maintain a certain level of availability?

42
The UC Berkeley ISTORE Projectbringing
availability, maintainability, and evolutionary
growth to storage-based clusters
  • For more information
  • http//iram.cs.berkeley.edu/istore
  • istore-group_at_cs.berkeley.edu

43
Backup Slides
  • (mostly in the area of benchmarking)

44
Case study
  • Software RAID-5 plus web server
  • Linux/Apache vs. Windows 2000/IIS
  • Why software RAID?
  • well-defined availability guarantees
  • RAID-5 volume should tolerate a single disk
    failure
  • reduced performance (degraded mode) after failure
  • may automatically rebuild redundancy onto spare
    disk
  • simple system
  • easy to inject storage faults
  • Why web server?
  • an application with measurable QoS metrics that
    depend on RAID availability and performance

45
Benchmark environment metrics
  • QoS metrics measured
  • hits per second
  • roughly tracks response time in our experiments
  • degree of fault tolerance in storage system
  • Workload generator and data collector
  • SpecWeb99 web benchmark
  • simulates realistic high-volume user load
  • mostly static read-only workload some dynamic
    content
  • modified to run continuously and to measure
    average hits per second over each 2-minute
    interval

46
Benchmark environment faults
  • Focus on faults in the storage system (disks)
  • How do disks fail?
  • according to Tertiary Disk project, failures
    include
  • recovered media errors
  • uncorrectable write failures
  • hardware errors (e.g., diagnostic failures)
  • SCSI timeouts
  • SCSI parity errors
  • note no head crashes, no fail-stop failures

47
Disk fault injection technique
  • To inject reproducible failures, we replaced one
    disk in the RAID with an emulated disk
  • a PC that appears as a disk on the SCSI bus
  • I/O requests processed in software, reflected to
    local disk
  • fault injection performed by altering SCSI
    command processing in the emulation software
  • Types of emulated faults
  • media errors (transient, correctable,
    uncorrectable)
  • hardware errors (firmware, mechanical)
  • parity errors
  • power failures
  • disk hangs/timeouts

48
System configuration
  • RAID-5 Volume 3GB capacity, 1GB used per disk
  • 3 physical disks, 1 emulated disk, 1 emulated
    spare disk
  • 2 web clients connected via 100Mb switched
    Ethernet

49
Results single-fault experiments
  • One expt for each type of fault (15 total)
  • only one fault injected per experiment
  • no human intervention
  • system allowed to continue until stabilized or
    crashed
  • Four distinct system behaviors observed
  • (A) no effect system ignores fault
  • (B) RAID system enters degraded mode
  • (C) RAID system begins reconstruction onto spare
    disk
  • (D) system failure (hang or crash)

50
System behavior single-fault
(A) no effect
(B) enter degraded mode
(D) system failure
(C) begin reconstruction
51
System behavior single-fault (2)
  • Windows ignores benign faults
  • Windows cant automatically rebuild
  • Linux reconstructs on all errors
  • Both systems fail when disk hangs

52
Interpretation single-fault expts
  • Linux and Windows take opposite approaches to
    managing benign and transient faults
  • these faults do not necessarily imply a failing
    disk
  • Tertiary Disk 368/368 disks had transient SCSI
    errors 13/368 disks had transient hardware
    errors, only 2/368 needed replacing.
  • Linux is paranoid and stops using a disk on any
    error
  • fragile system is more vulnerable to multiple
    faults
  • but no chance of slowly-failing disk impacting
    perf.
  • Windows ignores most benign/transient faults
  • robust less likely to lose data, more
    disk-efficient
  • less likely to catch slowly-failing disks and
    remove them
  • Neither policy is ideal!
  • need a hybrid?

53
Results multiple-fault experiments
  • Scenario
  • (1) disk fails
  • (2) data is reconstructed onto spare
  • (3) spare fails
  • (4) administrator replaces both failed disks
  • (5) data is reconstructed onto new disks
  • Requires human intervention
  • to initiate reconstruction on Windows 2000
  • simulate 6 minute sysadmin response time
  • to replace disks
  • simulate 90 seconds of time to replace hot-swap
    disks

54
Interpretation multi-fault expts
  • Linux and Windows have different reconstruction
    philosophies
  • Linux uses idle bandwidth for reconstruction
  • little impact on application performance
  • increases length of time system is vulnerable to
    faults
  • Windows steals app. bandwidth for reconstruction
  • reduces application performance
  • minimizes system vulnerability
  • but must be manually initiated (or scripted)
  • Windows favors fault-tolerance over performance
    Linux favors performance over fault-tolerance
  • the same design philosophies seen in the
    single-fault experiments

55
Maintainability Observations
  • Scenario administrator accidentally removes and
    replaces live disk in degraded mode
  • double failure no guarantee on data integrity
  • theoretically, can recover if writes are queued
  • Windows recovers, but loses active writes
  • journalling NTFS is not corrupted
  • all data not being actively written is intact
  • Linux will not allow removed disk to be
    reintegrated
  • total loss of all data on RAID volume!

56
Maintainability Observations (2)
  • Scenario administrator adds a new spare
  • a common task that can be done with hot-swap
    drive trays
  • Linux requires a reboot for the disk to be
    recognized
  • Windows can dynamically detect the new disk
  • Windows 2000 RAID is easier to maintain
  • easier GUI configuration
  • more flexible in adding disks
  • SCSI rescan and NTFS deal with administrator
    goofs
  • less likely to require administration due to
    transient errors
  • BUT must manually initiate reconstruction when
    needed
Write a Comment
User Comments (0)