Title: Computers for the PostPC Era
1Computers for the Post-PC Era
- David Patterson
- University of California at Berkeley
- Patterson_at_cs.berkeley.edu
- UC Berkeley IRAM Group
- UC Berkeley ISTORE Group
- istore-group_at_cs.berkeley.edu
- May 2000
2Perspective on Post-PC Era
- PostPC Era will be driven by 2 technologies
- 1) GadgetsTiny Embedded or Mobile Devices
- ubiquitous in everything
- e.g., successor to PDA, cell phone, wearable
computers - 2) Infrastructure to Support such Devices
- e.g., successor to Big Fat Web Servers, Database
Servers
3VIRAM-1 Block Diagram
4VIRAM-1 System on a Chip
- Prototype scheduled for tape-out mid 2000
- 0.18 um EDL process
- 16 MB DRAM, 8 banks
- MIPS Scalar core and
caches _at_ 200 MHz - 4 64-bit vector unit
pipelines _at_ 200 MHz - 4 100 MB parallel I/O lines
- 17x17 mm, 2 Watts
- 25.6 GB/s memory (6.4 GB/s per direction
and per Xbar) - 1.6 Gflops (64-bit), 6.4 GOPs (16-bit)
Memory (64 Mbits / 8 MBytes)
4 Vector Pipes/Lanes
Xbar
I/O
Memory (64 Mbits / 8 MBytes)
5Problem General Element Permutation
- Hardware for a full vector permutation
instruction (128 16b elements, 256b datapath) - Datapath 16 x 16 (x 16b) crossbar scales by
0(N2) - Control 16 16-to-1 multiplexors scales by
0(NlogN) - Other problems
- Consecutive result elements not written together
time/energy wasted on wide vector register file
port
6Simple Vector Permutations
- Simple steps of butterfly permutations
- A register provides the butterfly radix
- Separate instructions for moving elements to
left/right - Sufficient semantics for
- Fast reductions of vector registers (dot
products) - Fast FFT/DCT kernels
7Hardware for Simple Permutations
- Hardware for 128 16b elements, 256b datapath
- Datapath 2 buses, 8 tristate drivers, 4
multiplexors, 4 shifters (by 0, 16b, 32b only)
Scales by O(N) - Control 6 control cases scales by O(N)
- Other benefits
- Consecutive result elements written together
- Buses used only for small radices
8FFT Straight forward
Problem most time spent in short vectors in
later stages of FFT
9FFT Transpose inside Vector Regs
10FFT Straight forward
11VIRAM-1 Design Status
- MIPS scalar core
- Synthesizable RTL code received from MIPS
- Cache RAMs to be compiled for IBM technology
- FPU RTL code almost compete
- Vector unit
- RTL models for sub-blocks developed currently
integrated and tested - Control logic to be compiled for IBM technology
- Full-custom layout for multipliers/adders
developed layout for shifters to be developed
- Memory system
- Synthesizable model for DRAM controllers done
- To be integrated with IBM DRAM macros
- Full-custom layout for crossbar under development
- Testing infrastructure
- Environment developed for automatic test
validation - Directed tests for single/multiple instruction
groups developed - Random instruction sequence generator developed
12FPU Features
- Executes MIPS IV ISA single-precision FP
instructions - Thirty-two 32-bit Floating Point Registers
- Two 32-bit Control Registers
- One 3-cycle (division takes 10 cycles) fully
pipelined, nearly full IEEE-754 compliant,
execution unit (from Albert Ma_at_MIT) - 6-stage pipeline (R-X-X-X-CDB-WB)
- Support for partial out-of-order execution and
precise exceptions - Scalar Core dispatches FP instructions to FPU
using an interface that splits instructions into
3 classes - Arithmetic instructions (ADD.S, SUB.S, MUL.S,
DIV.S, ABS.S, NEG.S, C.cond.S, CVT.S.W, CVT.W.S,
TRUNC.W.S, MOV.S, MOVZ.S, MOVN.S) - From Coprocessor Data Transfer instructions
(SWC1, MFC1, CFC1) - To Coprocessor Data Transfer instructions (LWC1,
MTC1, CTC1)
13FPU Architecture
14Multiplier Partitioning
- 64-bit multiplier built from 16-bit multiplier
subblocks - Subblocks combined with adders to perform larger
multiplies - Performs 2 simultaneous 32-bit multiplies by
grouping 4 subblocks - Performs 4 simultaneous 16-bit multiplies by
using individual subblocks - Unused blocks turned off to conserve power
15FPU Current Status
- Current Functionality
- Able to execute most instructions (all except
C.cond.S, CFC1 and CTC1). - Supports precise exception semantics.
- Functionality verification.
- Used a random test generator that generates/kills
instructions at random and compares the results
from the RTL Verilog simulator against the
results from an ISA Perl simulator. - What remains to be done
- Instructions that use the Control Registers
(C.cond.S, CFC1 and CTC1). - Exception generation.
- Integrate execution pipeline with the rest of the
design. - Synthesize, place and route.
- Final assembly and verification of multiplier
- Performance
- Sustainable Throughput 1 instruction/cycle
(assuming no data hazards) - Instruction Latency 6 cycles
16UC-IBM Agreement
- Biggest IRAM ObstacleIntellectual Property
Agreement between University of California and
IBM - Can university accept free fab costs (2.0M to
2.5M) in return for capped non-exclusive patent
licensing fees for IBM if UC files for IRAM
patents? - Process started with IBM March 1999
- IBM wont give full process info until contract
- UC started negotiating seriously Jan 2000
- Agreement June 1, 2000!
17Other examples IBM Blue Gene
- 1 PetaFLOPS in 2005 for 100M?
- Application Protein Folding
- Blue Gene Chip
- 32 Multithreaded RISC processors ??MB Embedded
DRAM high speed Network Interface on single 20
x 20 mm chip - 1 GFLOPS / processor
- 2 x 2 Board 64 chips (2K CPUs)
- Rack 8 Boards (512 chips,16K CPUs)
- System 64 Racks (512 boards,32K chips,1M CPUs)
- Total 1 million processors in just 2000 sq. ft.
18Other examples Sony Playstation 2
- Emotion Engine 6.2 GFLOPS, 75 million polygons
per second (Microprocessor Report, 135) - Superscalar MIPS core vector coprocessor
graphics/DRAM - Claim Toy Story realism brought to games
19Outline
- 1) Example microprocessor for PostPC gadgets
- 2) Motivation and the ISTORE project vision
- AME Availability, Maintainability, Evolutionary
growth - ISTOREs research principles
- Benchmarks for AME
- Conclusions and future work
20Lampson Systems Challenges
- Systems that work
- Meeting their specs
- Always available
- Adapting to changing environment
- Evolving while they run
- Made from unreliable components
- Growing without practical limit
- Credible simulations or analysis
- Writing good specs
- Testing
- Performance
- Understanding when it doesnt matter
Computer Systems Research-Past and Future
Keynote address, 17th SOSP, Dec. 1999 Butler
Lampson Microsoft
21Hennessy What Should the New World Focus Be?
- Availability
- Both appliance service
- Maintainability
- Two functions
- Enhancing availability by preventing failure
- Ease of SW and HW upgrades
- Scalability
- Especially of service
- Cost
- per device and per service transaction
- Performance
- Remains important, but its not SPECint
Back to the Future Time to Return to
Longstanding Problems in Computer Systems?
Keynote address, FCRC, May 1999 John
Hennessy Stanford
22The real scalability problems AME
- Availability
- systems should continue to meet quality of
service goals despite hardware and software
failures - Maintainability
- systems should require only minimal ongoing human
administration, regardless of scale or complexity - Evolutionary Growth
- systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded - These are problems at todays scales, and will
only get worse as systems grow
23Principles for achieving AME (1)
- No single points of failure
- Redundancy everywhere
- Performance robustness is more important than
peak performance - performance robustness implies that real-world
performance is comparable to best-case
performance - Performance can be sacrificed for improvements in
AME - resources should be dedicated to AME
- compare biological systems spend gt 50 of
resources on maintenance - can make up performance by scaling system
24Principles for achieving AME (2)
- Introspection
- reactive techniques to detect and adapt to
failures, workload variations, and system
evolution - proactive techniques to anticipate and avert
problems before they happen
25ISTORE-1 hardware platform
- 80-node x86-based cluster, 1.4TB storage
- cluster nodes are plug-and-play, intelligent,
network-attached storage bricks - a single field-replaceable unit to simplify
maintenance - each node is a full x86 PC w/256MB DRAM, 18GB
disk - more CPU than NAS fewer disks/node than cluster
Intelligent Disk Brick Portable PC CPU Pentium
II/266 DRAM Redundant NICs (4 100 Mb/s
links) Diagnostic Processor
- ISTORE Chassis
- 80 nodes, 8 per tray
- 2 levels of switches
- 20 100 Mbit/s
- 2 1 Gbit/s
- Environment Monitoring
- UPS, redundant PS,
- fans, heat and vibration sensors...
26ISTORE-1 Status
- 10 Nodes manufactured
- Boots OS
- Diagnostic Processor Interface SW complete
- PCB backplane not yet designed
- Finish 80 node system Summer 2000
27Hardware techniques
- Fully shared-nothing cluster organization
- truly scalable architecture
- architecture that tolerates partial failure
- automatic hardware redundancy
28Hardware techniques (2)
- No Central Processor Unit distribute processing
with storage - Serial lines, switches also growing with Moores
Law less need today to centralize vs. bus
oriented systems - Most storage servers limited by speed of CPUs
why does this make sense? - Why not amortize sheet metal, power, cooling
infrastructure for disk to add processor, memory,
and network? - If AME is important, must provide resources to be
used to help AME local processors responsible
for health and maintenance of their storage
29Hardware techniques (3)
- Heavily instrumented hardware
- sensors for temp, vibration, humidity, power,
intrusion - helps detect environmental problems before they
can affect system integrity - Independent diagnostic processor on each node
- provides remote control of power, remote console
access to the node, selection of node boot code - collects, stores, processes environmental data
for abnormalities - non-volatile flight recorder functionality
- all diagnostic processors connected via
independent diagnostic network
30Hardware techniques (4)
- On-demand network partitioning/isolation
- Internet applications must remain available
despite failures of components, therefore can
isolate a subset for preventative maintenance - Allows testing, repair of online system
- Managed by diagnostic processor and network
switches via diagnostic network
31Hardware techniques (5)
- Built-in fault injection capabilities
- Power control to individual node components
- Injectable glitches into I/O and memory busses
- Managed by diagnostic processor
- Used for proactive hardware introspection
- automated detection of flaky components
- controlled testing of error-recovery mechanisms
- Important for AME benchmarking (see next slide)
32Hardware techniques (6)
- Benchmarking
- One reason for 1000X processor performance was
ability to measure (vs. debate) which is better - e.g., Which most important to improve clock
rate, clocks per instruction, or instructions
executed? - Need AME benchmarks
- what gets measured gets done
- benchmarks shape a field
- quantification brings rigor
33Availability benchmark methodology
- Goal quantify variation in QoS metrics as events
occur that affect system availability - Leverage existing performance benchmarks
- to generate fair workloads
- to measure trace quality of service metrics
- Use fault injection to compromise system
- hardware faults (disk, memory, network, power)
- software faults (corrupt input, driver error
returns) - maintenance events (repairs, SW/HW upgrades)
- Examine single-fault and multi-fault workloads
- the availability analogues of performance micro-
and macro-benchmarks
34Benchmark Availability?Methodology for reporting
results
- Results are most accessible graphically
- plot change in QoS metrics over time
- compare to normal behavior?
- 99 confidence intervals calculated from no-fault
runs
35Example results multiple-faults
Windows 2000/IIS
Linux/ Apache
- Windows reconstructs 3x faster than Linux
- Windows reconstruction noticeably affects
application performance, while Linux
reconstruction does not
36Conclusions (1) ISTORE
- Availability, Maintainability, and Evolutionary
growth are key challenges for server systems - more important even than performance
- ISTORE is investigating ways to bring AME to
large-scale, storage-intensive servers - via clusters of network-attached,
computationally-enhanced storage nodes running
distributed code - via hardware and software introspection
- we are currently performing application studies
to investigate and compare techniques - Availability benchmarks a powerful tool?
- revealed undocumented design decisions affecting
SW RAID availability on Linux and Windows 2000
37Conclusions (2)
- IRAM attractive for two Post-PC applications
because of low power, small size, high memory
bandwidth - Gadgets Embedded/Mobile devices
- Infrastructure Intelligent Storage and Networks
- PostPC infrastructure requires
- New Goals Availability, Maintainability,
Evolution - New Principles Introspection, Performance
Robustness - New Techniques Isolation/fault insertion,
Software scrubbing - New Benchmarks measure, compare AME metrics