Computers for the PostPC Era

About This Presentation

Title:

Computers for the PostPC Era

Description:

Control: 6 control cases; scales by O(N) Other benefits ... These are problems at today's scales, and will only get worse as systems grow. Slide 23 ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 38

Provided by: aaron

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computers for the PostPC Era

1
Computers for the Post-PC Era

David Patterson
University of California at Berkeley
Patterson_at_cs.berkeley.edu
UC Berkeley IRAM Group
UC Berkeley ISTORE Group
istore-group_at_cs.berkeley.edu
May 2000

2
Perspective on Post-PC Era

PostPC Era will be driven by 2 technologies
1) GadgetsTiny Embedded or Mobile Devices
ubiquitous in everything
e.g., successor to PDA, cell phone, wearable
computers
2) Infrastructure to Support such Devices
e.g., successor to Big Fat Web Servers, Database
Servers

3
VIRAM-1 Block Diagram
4
VIRAM-1 System on a Chip

Prototype scheduled for tape-out mid 2000
0.18 um EDL process
16 MB DRAM, 8 banks
MIPS Scalar core and
caches _at_ 200 MHz
4 64-bit vector unit
pipelines _at_ 200 MHz
4 100 MB parallel I/O lines
17x17 mm, 2 Watts
25.6 GB/s memory (6.4 GB/s per direction
and per Xbar)
1.6 Gflops (64-bit), 6.4 GOPs (16-bit)

Memory (64 Mbits / 8 MBytes)
4 Vector Pipes/Lanes
Xbar
I/O
Memory (64 Mbits / 8 MBytes)
5
Problem General Element Permutation

Hardware for a full vector permutation
instruction (128 16b elements, 256b datapath)
Datapath 16 x 16 (x 16b) crossbar scales by
0(N2)
Control 16 16-to-1 multiplexors scales by
0(NlogN)
Other problems
Consecutive result elements not written together
time/energy wasted on wide vector register file
port

6
Simple Vector Permutations

Simple steps of butterfly permutations
A register provides the butterfly radix
Separate instructions for moving elements to
left/right
Sufficient semantics for
Fast reductions of vector registers (dot
products)
Fast FFT/DCT kernels

7
Hardware for Simple Permutations

Hardware for 128 16b elements, 256b datapath
Datapath 2 buses, 8 tristate drivers, 4
multiplexors, 4 shifters (by 0, 16b, 32b only)
Scales by O(N)
Control 6 control cases scales by O(N)
Other benefits
Consecutive result elements written together
Buses used only for small radices

8
FFT Straight forward
Problem most time spent in short vectors in
later stages of FFT
9
FFT Transpose inside Vector Regs
10
FFT Straight forward
11
VIRAM-1 Design Status

MIPS scalar core
Synthesizable RTL code received from MIPS
Cache RAMs to be compiled for IBM technology
FPU RTL code almost compete
Vector unit
RTL models for sub-blocks developed currently
integrated and tested
Control logic to be compiled for IBM technology
Full-custom layout for multipliers/adders
developed layout for shifters to be developed

Memory system
Synthesizable model for DRAM controllers done
To be integrated with IBM DRAM macros
Full-custom layout for crossbar under development
Testing infrastructure
Environment developed for automatic test
validation
Directed tests for single/multiple instruction
groups developed
Random instruction sequence generator developed

12
FPU Features

Executes MIPS IV ISA single-precision FP
instructions
Thirty-two 32-bit Floating Point Registers
Two 32-bit Control Registers
One 3-cycle (division takes 10 cycles) fully
pipelined, nearly full IEEE-754 compliant,
execution unit (from Albert Ma_at_MIT)
6-stage pipeline (R-X-X-X-CDB-WB)
Support for partial out-of-order execution and
precise exceptions
Scalar Core dispatches FP instructions to FPU
using an interface that splits instructions into
3 classes
Arithmetic instructions (ADD.S, SUB.S, MUL.S,
DIV.S, ABS.S, NEG.S, C.cond.S, CVT.S.W, CVT.W.S,
TRUNC.W.S, MOV.S, MOVZ.S, MOVN.S)
From Coprocessor Data Transfer instructions
(SWC1, MFC1, CFC1)
To Coprocessor Data Transfer instructions (LWC1,
MTC1, CTC1)

13
FPU Architecture
14
Multiplier Partitioning

64-bit multiplier built from 16-bit multiplier
subblocks
Subblocks combined with adders to perform larger
multiplies
Performs 2 simultaneous 32-bit multiplies by
grouping 4 subblocks
Performs 4 simultaneous 16-bit multiplies by
using individual subblocks
Unused blocks turned off to conserve power

15
FPU Current Status

Current Functionality
Able to execute most instructions (all except
C.cond.S, CFC1 and CTC1).
Supports precise exception semantics.
Functionality verification.
Used a random test generator that generates/kills
instructions at random and compares the results
from the RTL Verilog simulator against the
results from an ISA Perl simulator.
What remains to be done
Instructions that use the Control Registers
(C.cond.S, CFC1 and CTC1).
Exception generation.
Integrate execution pipeline with the rest of the
design.
Synthesize, place and route.
Final assembly and verification of multiplier
Performance
Sustainable Throughput 1 instruction/cycle
(assuming no data hazards)
Instruction Latency 6 cycles

16
UC-IBM Agreement

Biggest IRAM ObstacleIntellectual Property
Agreement between University of California and
IBM
Can university accept free fab costs (2.0M to
2.5M) in return for capped non-exclusive patent
licensing fees for IBM if UC files for IRAM
patents?
Process started with IBM March 1999
IBM wont give full process info until contract
UC started negotiating seriously Jan 2000
Agreement June 1, 2000!

17
Other examples IBM Blue Gene

1 PetaFLOPS in 2005 for 100M?
Application Protein Folding
Blue Gene Chip
32 Multithreaded RISC processors ??MB Embedded
DRAM high speed Network Interface on single 20
x 20 mm chip
1 GFLOPS / processor
2 x 2 Board 64 chips (2K CPUs)
Rack 8 Boards (512 chips,16K CPUs)
System 64 Racks (512 boards,32K chips,1M CPUs)
Total 1 million processors in just 2000 sq. ft.

18
Other examples Sony Playstation 2

Emotion Engine 6.2 GFLOPS, 75 million polygons
per second (Microprocessor Report, 135)
Superscalar MIPS core vector coprocessor
graphics/DRAM
Claim Toy Story realism brought to games

19
Outline

1) Example microprocessor for PostPC gadgets
2) Motivation and the ISTORE project vision
AME Availability, Maintainability, Evolutionary
growth
ISTOREs research principles
Benchmarks for AME
Conclusions and future work

20
Lampson Systems Challenges

Systems that work
Meeting their specs
Always available
Adapting to changing environment
Evolving while they run
Made from unreliable components
Growing without practical limit
Credible simulations or analysis
Writing good specs
Testing
Performance
Understanding when it doesnt matter

Computer Systems Research-Past and Future
Keynote address, 17th SOSP, Dec. 1999 Butler
Lampson Microsoft
21
Hennessy What Should the New World Focus Be?

Availability
Both appliance service
Maintainability
Two functions
Enhancing availability by preventing failure
Ease of SW and HW upgrades
Scalability
Especially of service
Cost
per device and per service transaction
Performance
Remains important, but its not SPECint

Back to the Future Time to Return to
Longstanding Problems in Computer Systems?
Keynote address, FCRC, May 1999 John
Hennessy Stanford
22
The real scalability problems AME

Availability
systems should continue to meet quality of
service goals despite hardware and software
failures
Maintainability
systems should require only minimal ongoing human
administration, regardless of scale or complexity
Evolutionary Growth
systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded
These are problems at todays scales, and will
only get worse as systems grow

23
Principles for achieving AME (1)

No single points of failure
Redundancy everywhere
Performance robustness is more important than
peak performance
performance robustness implies that real-world
performance is comparable to best-case
performance
Performance can be sacrificed for improvements in
AME
resources should be dedicated to AME
compare biological systems spend gt 50 of
resources on maintenance
can make up performance by scaling system

24
Principles for achieving AME (2)

Introspection
reactive techniques to detect and adapt to
failures, workload variations, and system
evolution
proactive techniques to anticipate and avert
problems before they happen

25
ISTORE-1 hardware platform

80-node x86-based cluster, 1.4TB storage
cluster nodes are plug-and-play, intelligent,
network-attached storage bricks
a single field-replaceable unit to simplify
maintenance
each node is a full x86 PC w/256MB DRAM, 18GB
disk
more CPU than NAS fewer disks/node than cluster

Intelligent Disk Brick Portable PC CPU Pentium
II/266 DRAM Redundant NICs (4 100 Mb/s
links) Diagnostic Processor

ISTORE Chassis
80 nodes, 8 per tray
2 levels of switches
20 100 Mbit/s
2 1 Gbit/s
Environment Monitoring
UPS, redundant PS,
fans, heat and vibration sensors...

26
ISTORE-1 Status

10 Nodes manufactured
Boots OS
Diagnostic Processor Interface SW complete
PCB backplane not yet designed
Finish 80 node system Summer 2000

27
Hardware techniques

Fully shared-nothing cluster organization
truly scalable architecture
architecture that tolerates partial failure
automatic hardware redundancy

28
Hardware techniques (2)

No Central Processor Unit distribute processing
with storage
Serial lines, switches also growing with Moores
Law less need today to centralize vs. bus
oriented systems
Most storage servers limited by speed of CPUs
why does this make sense?
Why not amortize sheet metal, power, cooling
infrastructure for disk to add processor, memory,
and network?
If AME is important, must provide resources to be
used to help AME local processors responsible
for health and maintenance of their storage

29
Hardware techniques (3)

Heavily instrumented hardware
sensors for temp, vibration, humidity, power,
intrusion
helps detect environmental problems before they
can affect system integrity
Independent diagnostic processor on each node
provides remote control of power, remote console
access to the node, selection of node boot code
collects, stores, processes environmental data
for abnormalities
non-volatile flight recorder functionality
all diagnostic processors connected via
independent diagnostic network

30
Hardware techniques (4)

On-demand network partitioning/isolation
Internet applications must remain available
despite failures of components, therefore can
isolate a subset for preventative maintenance
Allows testing, repair of online system
Managed by diagnostic processor and network
switches via diagnostic network

31
Hardware techniques (5)

Built-in fault injection capabilities
Power control to individual node components
Injectable glitches into I/O and memory busses
Managed by diagnostic processor
Used for proactive hardware introspection
automated detection of flaky components
controlled testing of error-recovery mechanisms
Important for AME benchmarking (see next slide)

32
Hardware techniques (6)

Benchmarking
One reason for 1000X processor performance was
ability to measure (vs. debate) which is better
e.g., Which most important to improve clock
rate, clocks per instruction, or instructions
executed?
Need AME benchmarks
what gets measured gets done
benchmarks shape a field
quantification brings rigor

33
Availability benchmark methodology

Goal quantify variation in QoS metrics as events
occur that affect system availability
Leverage existing performance benchmarks
to generate fair workloads
to measure trace quality of service metrics
Use fault injection to compromise system
hardware faults (disk, memory, network, power)
software faults (corrupt input, driver error
returns)
maintenance events (repairs, SW/HW upgrades)
Examine single-fault and multi-fault workloads
the availability analogues of performance micro-
and macro-benchmarks

34
Benchmark Availability?Methodology for reporting
results

Results are most accessible graphically
plot change in QoS metrics over time
compare to normal behavior?
99 confidence intervals calculated from no-fault
runs

35
Example results multiple-faults
Windows 2000/IIS
Linux/ Apache

Windows reconstructs 3x faster than Linux
Windows reconstruction noticeably affects
application performance, while Linux
reconstruction does not

36
Conclusions (1) ISTORE

Availability, Maintainability, and Evolutionary
growth are key challenges for server systems
more important even than performance
ISTORE is investigating ways to bring AME to
large-scale, storage-intensive servers
via clusters of network-attached,
computationally-enhanced storage nodes running
distributed code
via hardware and software introspection
we are currently performing application studies
to investigate and compare techniques
Availability benchmarks a powerful tool?
revealed undocumented design decisions affecting
SW RAID availability on Linux and Windows 2000

37
Conclusions (2)

IRAM attractive for two Post-PC applications
because of low power, small size, high memory
bandwidth
Gadgets Embedded/Mobile devices
Infrastructure Intelligent Storage and Networks
PostPC infrastructure requires
New Goals Availability, Maintainability,
Evolution
New Principles Introspection, Performance
Robustness
New Techniques Isolation/fault insertion,
Software scrubbing
New Benchmarks measure, compare AME metrics

Write a Comment

User Comments (0)

About PowerShow.com

Computers for the PostPC Era - PowerPoint PPT Presentation

Computers for the PostPC Era

Control: 6 control cases; scales by O(N) Other benefits ... These are problems at today's scales, and will only get worse as systems grow. Slide 23 ... – PowerPoint PPT presentation