Title: Computers for the Post-PC Era
1Computers for the Post-PC Era
- David Patterson
- University of California at Berkeley
- Patterson_at_cs.berkeley.edu
- UC Berkeley IRAM Group
- UC Berkeley ISTORE Group
- istore-group_at_cs.berkeley.edu
- 10 Feburary 2000
2Perspective on Post-PC Era
- PostPC Era will be driven by 2 technologies
- 1) Tiny Embedded orMobile Consumer Devices
- e.g., successor to PDA, cell phone, wearable
computers - ubiquitous in everything
- 2) Infrastructure to Support such Devices
- e.g., successor to Big Fat Web Servers, Database
Servers
3Outline
- 1) One instance of microprocessors for gadgets
- 2) Motivation and the ISTORE project vision
- AME Availability, Maintainability, Evolutionary
growth - ISTOREs research principles
- Proposed techniques for achieving AME
- Benchmarks for AME
- Conclusions and future work
4Intelligent RAM IRAM
- Microprocessor DRAM on a single chip
- 10X capacity vs. SRAM
- on-chip memory latency 5-10X, bandwidth 50-100X
- improve energy efficiency 2X-4X (no off-chip
bus) - serial I/O 5-10X v. buses
- smaller board area/volume
- IRAM advantages extend to
- a single chip system
- a building block for larger systems
5Revive Vector Architecture
- Single-chip CMOS MPU/IRAM
- IRAM
- Much smaller than VLIW
- For sale, mature (gt20 years)(We retarget Cray
compilers) - Easy scale speed with technology
- Parallel to save energy, keep perf
- Multimedia apps vectorizable too N64b, 2N32b,
4N16b
- Cost 1M each?
- Low latency, high BW memory system?
- Code density?
- Compilers?
- Performance?
- Power/Energy?
- Limited to scientific applications?
6V-IRAM1 Low Power v. High Perf.
4 x 64 or 8 x 32 or 16 x 16
x
2-way Superscalar
Vector
Instruction
Processor
Queue
Load/Store
Vector Registers
16K I cache
16K D cache
4 x 64
4 x 64
Serial I/O
Memory Crossbar Switch
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
4 x 64
4 x 64
4 x 64
4 x 64
4 x 64
M
M
M
M
M
M
M
M
M
M
7VIRAM-1 System on a Chip
- Prototype scheduled for tape-out mid 2000
- 0.18 um EDL process
- 16 MB DRAM, 8 banks
- MIPS Scalar core and
caches _at_ 200 MHz - 4 64-bit vector unit
pipelines _at_ 200 MHz - 4 100 MB parallel I/O lines
- 17x17 mm, 2 Watts
- 25.6 GB/s memory (6.4 GB/s per direction
and per Xbar) - 1.6 Gflops (64-bit), 6.4 GOPs (16-bit)
Memory (64 Mbits / 8 MBytes)
Xbar
I/O
Memory (64 Mbits / 8 MBytes)
8Media Kernel Performance
9Base-line system comparison
- All numbers in cycles/pixel
- MMX and VIS results assume all data in L1 cache
10IRAM Chip Challenges
- Merged Logic-DRAM process Cost of wafer, Impact
on yield, testing cost of logic and DRAM - Price on-chip DRAM v. separate DRAM chips?
- Time delay of transistor speeds, memory cell
sizes in Merged process vs. Logic only or DRAM
only - DRAM block flexibility via DRAM compiler (vary
size, width, no. subbanks) vs. fixed block - Applications advantages in memory bandwidth,
energy, system size to offset above challenges?
11Other examples Sony Playstation 2
- Emotion Engine 6.2 GFLOPS, 75 million polygons
per second (Microprocessor Report, 135) - Superscalar MIPS core vector coprocessor
graphics/DRAM - Claim Toy Story realism brought to games!
12Other examples IBM Blue Gene
- Blue Gene Chip
- 20 x 20 mm
- 32 Multithreaded RISC processors ??MB Embedded
DRAM high speed Network Interface on single
chip - 1 GFLOPS / processor
- 2 x 2 Board 64 chips
- Tower 8 Boards
- System 64 Towers
- Total 1 million processors (25 x 26 x 23 x 26),
in just 2000 sq. ft. - Cost 100M
- Goal 1 PetaFLOPS in 2005?
- Application Protein Folding
13Outline
- 1) One instance of microprocessors for gadgets
- 2) Motivation and the ISTORE project vision
- AME Availability, Maintainability, Evolutionary
growth - ISTOREs research principles
- Proposed techniques for achieving AME
- Benchmarks for AME
- Conclusions and future work
14The problem space big data
- Big demand for enormous amounts of data
- today high-end enterprise and Internet
applications - enterprise decision-support, data mining
databases - online applications e-commerce, mail, web,
archives - future infrastructure services, richer data
- computational storage back-ends for mobile
devices - more multimedia content
- more use of historical data to provide better
services - Todays server designs cant easily scale to meet
these huge demands - bus bandwidth bottlenecks limit access to stored
data - SMP designs are near their limits and dont offer
incremental growth path
15One approach traditional NAS
- Network-attached storage makes storage devices
first-class citizens on the network - network file server appliances (NetApp, SNAP,
...) - storage-area networks (CMU NASD, NSIC OOD, ...)
- active disks (CMU, UCSB, Berkeley IDISK)
- These approaches primarily target performance
scalability - scalable networks remove bus bandwidth
limitations - migration of layout functionality to storage
devices removes overhead of intermediate servers - There are bigger scaling problems than scalable
performance!
16The real scalability problems AME
- Availability
- systems should continue to meet quality of
service goals despite hardware and software
failures - Maintainability
- systems should require only minimal ongoing human
administration, regardless of scale or complexity - Evolutionary Growth
- systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded - These are problems at todays scales, and will
only get worse as systems grow
17The ISTORE project vision
- Our goal
- develop principles and investigate hardware/sof
tware techniques for building storage-based
server systems that - are highly available
- require minimal maintenance
- robustly handle evolutionary growth
- are scalable to O(10000) nodes
18Principles for achieving AME (1)
- No single points of failure
- Redundancy everywhere
- Performance robustness is more important than
peak performance - performance robustness implies that real-world
performance is comparable to best-case
performance - Performance can be sacrificed for improvements in
AME - resources should be dedicated to AME
- compare biological systems spend gt 50 of
resources on maintenance - can make up performance by scaling system
19Principles for achieving AME (2)
- Introspection
- reactive techniques to detect and adapt to
failures, workload variations, and system
evolution - proactive techniques to anticipate and avert
problems before they happen - Benchmarking
- quantification brings rigor
- requires new AME benchmarks
- what gets measured gets done
- benchmarks shape a field
20Outline
- 1) One instance of microprocessors for gadgets
- 2) Motivation and the ISTORE project vision
- AME Availability, Maintainability, Evolutionary
growth - ISTOREs research principles
- Proposed techniques for achieving AME
- Benchmarks for AME
- Conclusions and future work
21Hardware techniques
- Fully shared-nothing cluster organization
- truly scalable architecture
- architecture that can tolerate partial failure
- automatic hardware redundancy
- Storage distributed with computation nodes
- distributed processing reduces data movement and
avoids network bottlenecks - nodes are responsible for the health of the
storage that they own - if AME is important, must provide resources to be
used for AME
22Hardware techniques (2)
- Heavily instrumented hardware
- sensors for temp, vibration, humidity, power,
intrusion - helps detect environmental problems before they
can affect system integrity - Independent diagnostic processor on each node
- provides remote control of power, remote console
access to the node, selection of node boot code - collects, stores, processes environmental data
for abnormalities - non-volatile flight recorder functionality
- all diagnostic processors connected via
independent diagnostic network
23Hardware techniques (3)
- Built-in fault injection capabilities
- power control to individual node components
- injectable glitches into I/O and memory busses
- on-demand network partitioning/isolation
- managed by diagnostic processor and network
switches via diagnostic network - used for proactive hardware introspection
- automated detection of flaky components
- controlled testing of error-recovery mechanisms
- important for AME benchmarking
24ISTORE-1 hardware platform
- 80-node x86-based cluster, 1.4TB storage
- cluster nodes are plug-and-play, intelligent,
network-attached storage bricks - a single field-replaceable unit to simplify
maintenance - each node is a full x86 PC w/256MB DRAM, 18GB
disk - more CPU than NAS fewer disks/node than cluster
25ISTORE Brick Block Diagram
Mobile Pentium II Module
SCSI
North Bridge
CPU
Disk (18 GB)
South Bridge
Diagnostic Net
DUAL UART
DRAM 256 MB
Super I/O
Monitor Control
Diagnostic Processor
BIOS
Ethernets 4x100 Mb/s
PCI
- Sensors for heat and vibration
- Control over power to individual nodes
Flash
RTC
RAM
26A glimpse into the future?
- System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk - ISTORE HW in 5-7 years
- building block 2006 MicroDrive integrated with
IRAM - 9GB disk, 50 MB/sec from disk
- connected via crossbar switch
- 10,000 nodes fit into one rack!
- This scale is our ultimate design point
27Software techniques
- Fully-distributed, shared-nothing code
- centralization breaks as systems scale up
O(10000) - avoids single-point-of-failure front ends
- Redundant data storage
- required for high availability, simplifies
self-testing - replication at the level of application objects
- application can control consistency policy
- more opportunity for data placement optimization
28Software techniques (2)
- River storage interfaces
- NOW Sort experience performance heterogeneity
is the norm - disks inner vs. outer track (50), fragmentation
- processors load (1.5-5x)
- So demand-driven delivery of data to apps
- via distributed queues and graduated declustering
- for apps that can handle unordered data delivery
- automatically adapts to variations in performance
of producers and consumers
29Software techniques (3)
- Reactive introspection
- use statistical techniques to identify normal
behavior and detect deviations from it - policy-driven automatic adaptation to abnormal
behavior once detected - initially, rely on human administrator to specify
policy - eventually, system learns to solve problems on
its own by experimenting on isolated subsets of
the nodes - one candidate reinforcement learning
30Software techniques (4)
- Proactive introspection
- continuous online self-testing of HW and SW
- in deployed systems!
- goal is to shake out Heisenbugs before theyre
encountered in normal operation - needs data redundancy, node isolation, fault
injection - techniques
- fault injection triggering hardware and software
error handling paths to verify their
integrity/existence - stress testing push HW/SW to their limits
- scrubbing periodic restoration of potentially
decaying hardware or software state - self-scrubbing data structures (like MVS)
- ECC scrubbing for disks and memory
31Applications
- ISTORE is not one super-system that demonstrates
all these techniques! - Initially provide library to support AME goals
- Initial application targets
- cluster web/email servers
- self-scrubbing data structures, online
self-testing - statistical identification of normal behavior
- decision-support database query execution system
- River-based storage, replica management
- information retrieval for multimedia data
- self-scrubbing data structures, structuring
performance-robust distributed computation
32Outline
- 1) One instance of microprocessors for gadgets
- 2) Motivation and the ISTORE project vision
- AME Availability, Maintainability, Evolutionary
growth - ISTOREs research principles
- Proposed techniques for achieving AME
- Benchmarks for AME
- Conclusions and future work
33Availability benchmarks
- Questions to answer
- what factors affect the quality of service
delivered by the system, and by how much/how
long? - how well can systems survive typical failure
scenarios? - Availability metrics
- traditionally, percentage of time system is up
- time-averaged, binary view of system state
(up/down) - traditional metric is too inflexible
- doesnt capture spectrum of degraded states
- time-averaging discards important temporal
behavior
- Solution measure variation in system quality of
service metrics over time - performance, fault-tolerance, completeness,
accuracy
34Availability benchmark methodology
- Goal quantify variation in QoS metrics as events
occur that affect system availability - Leverage existing performance benchmarks
- to generate fair workloads
- to measure trace quality of service metrics
- Use fault injection to compromise system
- hardware faults (disk, memory, network, power)
- software faults (corrupt input, driver error
returns) - maintenance events (repairs, SW/HW upgrades)
- Examine single-fault and multi-fault workloads
- the availability analogues of performance micro-
and macro-benchmarks
35Methodology reporting results
- Results are most accessible graphically
- plot change in QoS metrics over time
- compare to normal behavior
- 99 confidence intervals calculated from no-fault
runs
- Graphs can be distilled into numbers
- quantify distribution of deviations from normal
behavior, compute area under curve for
deviations, ...
36Example results software RAID-5
- Test systems Linux/Apache and Win2000/IIS
- SpecWeb 99 to measure hits/second as QoS metric
- fault injection at disks based on empirical fault
data - transient, correctable, uncorrectable, timeout
faults - 15 single-fault workloads injected per system
- only 4 distinct behaviors observed
- (A) no effect (C) RAID enters degraded mode
- (B) system hangs (D) RAID enters degraded mode
starts
reconstruction - both systems hung (B) on simulated disk hangs
- Linux exhibited (D) on all other errors
- Windows exhibited (A) on transient errors and (C)
on uncorrectable, sticky errors
37Example results multiple-faults
Windows 2000/IIS
Linux/ Apache
- Windows reconstructs 3x faster than Linux
- Windows reconstruction noticeably affects
application performance, while Linux
reconstruction does not
38Conclusions
- IRAM attractive for two Post-PC applications
because of low power, small size, high memory
bandwidth - Mobile consumer electronic devices
- Scaleable infrastructure
- IRAM benchmarking result faster than DSPs
- ISTORE hardware/software architecture for large
scale network services - Scaling systems requires
- new continuous models of availability
- performance not limited by the weakest link
- self systems to reduce human interaction
39Benchmark conclusions
- Linux and Windows take opposite approaches to
managing benign and transient faults - Linux is paranoid and stops using a disk on any
error - Windows ignores most benign/transient faults
- Windows is more robust except when disk is truly
failing - Linux and Windows have different reconstruction
philosophies - Linux uses idle bandwidth for reconstruction
- Windows steals app. bandwidth for reconstruction
- Windows rebuilds fault-tolerance more quickly
- Win2k favors fault-tolerance over performance
Linux favors performance over fault-tolerance
40ISTORE conclusions
- Availability, Maintainability, and Evolutionary
growth are key challenges for server systems - more important even than performance
- ISTORE is investigating ways to bring AME to
large-scale, storage-intensive servers - via clusters of network-attached,
computationally-enhanced storage nodes running
distributed code - via hardware and software introspection
- we are currently performing application studies
to investigate and compare techniques - Availability benchmarks are a powerful tool
- revealed undocumented design decisions affecting
SW RAID availability on Linux and Windows 2000
41Future work
- ISTORE
- implement AME-enhancing techniques in a variety
of Internet, enterprise, and info retrieval
applications - select the best techniques and integrate into a
generic runtime system with AME API - AME benchmarks
- expand availability benchmarks to distributed
apps - add maintainability
- use methodology from availability benchmark
- but include administrators response to faults
- must develop model of typical administrator
behavior - can we quantify administrative work needed to
maintain a certain level of availability?
42The UC Berkeley ISTORE Projectbringing
availability, maintainability, and evolutionary
growth to storage-based clusters
- For more information
- http//iram.cs.berkeley.edu/istore
- istore-group_at_cs.berkeley.edu
43Backup Slides
- (mostly in the area of benchmarking)
44Case study
- Software RAID-5 plus web server
- Linux/Apache vs. Windows 2000/IIS
- Why software RAID?
- well-defined availability guarantees
- RAID-5 volume should tolerate a single disk
failure - reduced performance (degraded mode) after failure
- may automatically rebuild redundancy onto spare
disk - simple system
- easy to inject storage faults
- Why web server?
- an application with measurable QoS metrics that
depend on RAID availability and performance
45Benchmark environment metrics
- QoS metrics measured
- hits per second
- roughly tracks response time in our experiments
- degree of fault tolerance in storage system
- Workload generator and data collector
- SpecWeb99 web benchmark
- simulates realistic high-volume user load
- mostly static read-only workload some dynamic
content - modified to run continuously and to measure
average hits per second over each 2-minute
interval
46Benchmark environment faults
- Focus on faults in the storage system (disks)
- How do disks fail?
- according to Tertiary Disk project, failures
include - recovered media errors
- uncorrectable write failures
- hardware errors (e.g., diagnostic failures)
- SCSI timeouts
- SCSI parity errors
- note no head crashes, no fail-stop failures
47Disk fault injection technique
- To inject reproducible failures, we replaced one
disk in the RAID with an emulated disk - a PC that appears as a disk on the SCSI bus
- I/O requests processed in software, reflected to
local disk - fault injection performed by altering SCSI
command processing in the emulation software - Types of emulated faults
- media errors (transient, correctable,
uncorrectable) - hardware errors (firmware, mechanical)
- parity errors
- power failures
- disk hangs/timeouts
48System configuration
- RAID-5 Volume 3GB capacity, 1GB used per disk
- 3 physical disks, 1 emulated disk, 1 emulated
spare disk - 2 web clients connected via 100Mb switched
Ethernet
49Results single-fault experiments
- One expt for each type of fault (15 total)
- only one fault injected per experiment
- no human intervention
- system allowed to continue until stabilized or
crashed - Four distinct system behaviors observed
- (A) no effect system ignores fault
- (B) RAID system enters degraded mode
- (C) RAID system begins reconstruction onto spare
disk - (D) system failure (hang or crash)
50System behavior single-fault
(A) no effect
(B) enter degraded mode
(D) system failure
(C) begin reconstruction
51System behavior single-fault (2)
- Windows ignores benign faults
- Windows cant automatically rebuild
- Linux reconstructs on all errors
- Both systems fail when disk hangs
52Interpretation single-fault expts
- Linux and Windows take opposite approaches to
managing benign and transient faults - these faults do not necessarily imply a failing
disk - Tertiary Disk 368/368 disks had transient SCSI
errors 13/368 disks had transient hardware
errors, only 2/368 needed replacing. - Linux is paranoid and stops using a disk on any
error - fragile system is more vulnerable to multiple
faults - but no chance of slowly-failing disk impacting
perf. - Windows ignores most benign/transient faults
- robust less likely to lose data, more
disk-efficient - less likely to catch slowly-failing disks and
remove them - Neither policy is ideal!
- need a hybrid?
53Results multiple-fault experiments
- Scenario
- (1) disk fails
- (2) data is reconstructed onto spare
- (3) spare fails
- (4) administrator replaces both failed disks
- (5) data is reconstructed onto new disks
- Requires human intervention
- to initiate reconstruction on Windows 2000
- simulate 6 minute sysadmin response time
- to replace disks
- simulate 90 seconds of time to replace hot-swap
disks
54Interpretation multi-fault expts
- Linux and Windows have different reconstruction
philosophies - Linux uses idle bandwidth for reconstruction
- little impact on application performance
- increases length of time system is vulnerable to
faults - Windows steals app. bandwidth for reconstruction
- reduces application performance
- minimizes system vulnerability
- but must be manually initiated (or scripted)
- Windows favors fault-tolerance over performance
Linux favors performance over fault-tolerance - the same design philosophies seen in the
single-fault experiments
55Maintainability Observations
- Scenario administrator accidentally removes and
replaces live disk in degraded mode - double failure no guarantee on data integrity
- theoretically, can recover if writes are queued
- Windows recovers, but loses active writes
- journalling NTFS is not corrupted
- all data not being actively written is intact
- Linux will not allow removed disk to be
reintegrated - total loss of all data on RAID volume!
56Maintainability Observations (2)
- Scenario administrator adds a new spare
- a common task that can be done with hot-swap
drive trays - Linux requires a reboot for the disk to be
recognized - Windows can dynamically detect the new disk
- Windows 2000 RAID is easier to maintain
- easier GUI configuration
- more flexible in adding disks
- SCSI rescan and NTFS deal with administrator
goofs - less likely to require administration due to
transient errors - BUT must manually initiate reconstruction when
needed