Computers for the PostPC Era - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Computers for the PostPC Era

Description:

Control: 6 control cases; scales by O(N) Other benefits ... These are problems at today's scales, and will only get worse as systems grow. Slide 23 ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 38
Provided by: aaron
Category:
Tags: postpc | computers | era | scales

less

Transcript and Presenter's Notes

Title: Computers for the PostPC Era


1
Computers for the Post-PC Era
  • David Patterson
  • University of California at Berkeley
  • Patterson_at_cs.berkeley.edu
  • UC Berkeley IRAM Group
  • UC Berkeley ISTORE Group
  • istore-group_at_cs.berkeley.edu
  • May 2000

2
Perspective on Post-PC Era
  • PostPC Era will be driven by 2 technologies
  • 1) GadgetsTiny Embedded or Mobile Devices
  • ubiquitous in everything
  • e.g., successor to PDA, cell phone, wearable
    computers
  • 2) Infrastructure to Support such Devices
  • e.g., successor to Big Fat Web Servers, Database
    Servers

3
VIRAM-1 Block Diagram
4
VIRAM-1 System on a Chip
  • Prototype scheduled for tape-out mid 2000
  • 0.18 um EDL process
  • 16 MB DRAM, 8 banks
  • MIPS Scalar core and
    caches _at_ 200 MHz
  • 4 64-bit vector unit
    pipelines _at_ 200 MHz
  • 4 100 MB parallel I/O lines
  • 17x17 mm, 2 Watts
  • 25.6 GB/s memory (6.4 GB/s per direction
    and per Xbar)
  • 1.6 Gflops (64-bit), 6.4 GOPs (16-bit)

Memory (64 Mbits / 8 MBytes)
4 Vector Pipes/Lanes
Xbar
I/O
Memory (64 Mbits / 8 MBytes)
5
Problem General Element Permutation
  • Hardware for a full vector permutation
    instruction (128 16b elements, 256b datapath)
  • Datapath 16 x 16 (x 16b) crossbar scales by
    0(N2)
  • Control 16 16-to-1 multiplexors scales by
    0(NlogN)
  • Other problems
  • Consecutive result elements not written together
    time/energy wasted on wide vector register file
    port

6
Simple Vector Permutations
  • Simple steps of butterfly permutations
  • A register provides the butterfly radix
  • Separate instructions for moving elements to
    left/right
  • Sufficient semantics for
  • Fast reductions of vector registers (dot
    products)
  • Fast FFT/DCT kernels

7
Hardware for Simple Permutations
  • Hardware for 128 16b elements, 256b datapath
  • Datapath 2 buses, 8 tristate drivers, 4
    multiplexors, 4 shifters (by 0, 16b, 32b only)
    Scales by O(N)
  • Control 6 control cases scales by O(N)
  • Other benefits
  • Consecutive result elements written together
  • Buses used only for small radices

8
FFT Straight forward
Problem most time spent in short vectors in
later stages of FFT
9
FFT Transpose inside Vector Regs
10
FFT Straight forward
11
VIRAM-1 Design Status
  • MIPS scalar core
  • Synthesizable RTL code received from MIPS
  • Cache RAMs to be compiled for IBM technology
  • FPU RTL code almost compete
  • Vector unit
  • RTL models for sub-blocks developed currently
    integrated and tested
  • Control logic to be compiled for IBM technology
  • Full-custom layout for multipliers/adders
    developed layout for shifters to be developed
  • Memory system
  • Synthesizable model for DRAM controllers done
  • To be integrated with IBM DRAM macros
  • Full-custom layout for crossbar under development
  • Testing infrastructure
  • Environment developed for automatic test
    validation
  • Directed tests for single/multiple instruction
    groups developed
  • Random instruction sequence generator developed

12
FPU Features
  • Executes MIPS IV ISA single-precision FP
    instructions
  • Thirty-two 32-bit Floating Point Registers
  • Two 32-bit Control Registers
  • One 3-cycle (division takes 10 cycles) fully
    pipelined, nearly full IEEE-754 compliant,
    execution unit (from Albert Ma_at_MIT)
  • 6-stage pipeline (R-X-X-X-CDB-WB)
  • Support for partial out-of-order execution and
    precise exceptions
  • Scalar Core dispatches FP instructions to FPU
    using an interface that splits instructions into
    3 classes
  • Arithmetic instructions (ADD.S, SUB.S, MUL.S,
    DIV.S, ABS.S, NEG.S, C.cond.S, CVT.S.W, CVT.W.S,
    TRUNC.W.S, MOV.S, MOVZ.S, MOVN.S)
  • From Coprocessor Data Transfer instructions
    (SWC1, MFC1, CFC1)
  • To Coprocessor Data Transfer instructions (LWC1,
    MTC1, CTC1)

13
FPU Architecture
14
Multiplier Partitioning
  • 64-bit multiplier built from 16-bit multiplier
    subblocks
  • Subblocks combined with adders to perform larger
    multiplies
  • Performs 2 simultaneous 32-bit multiplies by
    grouping 4 subblocks
  • Performs 4 simultaneous 16-bit multiplies by
    using individual subblocks
  • Unused blocks turned off to conserve power

15
FPU Current Status
  • Current Functionality
  • Able to execute most instructions (all except
    C.cond.S, CFC1 and CTC1).
  • Supports precise exception semantics.
  • Functionality verification.
  • Used a random test generator that generates/kills
    instructions at random and compares the results
    from the RTL Verilog simulator against the
    results from an ISA Perl simulator.
  • What remains to be done
  • Instructions that use the Control Registers
    (C.cond.S, CFC1 and CTC1).
  • Exception generation.
  • Integrate execution pipeline with the rest of the
    design.
  • Synthesize, place and route.
  • Final assembly and verification of multiplier
  • Performance
  • Sustainable Throughput 1 instruction/cycle
    (assuming no data hazards)
  • Instruction Latency 6 cycles

16
UC-IBM Agreement
  • Biggest IRAM ObstacleIntellectual Property
    Agreement between University of California and
    IBM
  • Can university accept free fab costs (2.0M to
    2.5M) in return for capped non-exclusive patent
    licensing fees for IBM if UC files for IRAM
    patents?
  • Process started with IBM March 1999
  • IBM wont give full process info until contract
  • UC started negotiating seriously Jan 2000
  • Agreement June 1, 2000!

17
Other examples IBM Blue Gene
  • 1 PetaFLOPS in 2005 for 100M?
  • Application Protein Folding
  • Blue Gene Chip
  • 32 Multithreaded RISC processors ??MB Embedded
    DRAM high speed Network Interface on single 20
    x 20 mm chip
  • 1 GFLOPS / processor
  • 2 x 2 Board 64 chips (2K CPUs)
  • Rack 8 Boards (512 chips,16K CPUs)
  • System 64 Racks (512 boards,32K chips,1M CPUs)
  • Total 1 million processors in just 2000 sq. ft.

18
Other examples Sony Playstation 2
  • Emotion Engine 6.2 GFLOPS, 75 million polygons
    per second (Microprocessor Report, 135)
  • Superscalar MIPS core vector coprocessor
    graphics/DRAM
  • Claim Toy Story realism brought to games

19
Outline
  • 1) Example microprocessor for PostPC gadgets
  • 2) Motivation and the ISTORE project vision
  • AME Availability, Maintainability, Evolutionary
    growth
  • ISTOREs research principles
  • Benchmarks for AME
  • Conclusions and future work

20
Lampson Systems Challenges
  • Systems that work
  • Meeting their specs
  • Always available
  • Adapting to changing environment
  • Evolving while they run
  • Made from unreliable components
  • Growing without practical limit
  • Credible simulations or analysis
  • Writing good specs
  • Testing
  • Performance
  • Understanding when it doesnt matter

Computer Systems Research-Past and Future
Keynote address, 17th SOSP, Dec. 1999 Butler
Lampson Microsoft
21
Hennessy What Should the New World Focus Be?
  • Availability
  • Both appliance service
  • Maintainability
  • Two functions
  • Enhancing availability by preventing failure
  • Ease of SW and HW upgrades
  • Scalability
  • Especially of service
  • Cost
  • per device and per service transaction
  • Performance
  • Remains important, but its not SPECint

Back to the Future Time to Return to
Longstanding Problems in Computer Systems?
Keynote address, FCRC, May 1999 John
Hennessy Stanford
22
The real scalability problems AME
  • Availability
  • systems should continue to meet quality of
    service goals despite hardware and software
    failures
  • Maintainability
  • systems should require only minimal ongoing human
    administration, regardless of scale or complexity
  • Evolutionary Growth
  • systems should evolve gracefully in terms of
    performance, maintainability, and availability as
    they are grown/upgraded/expanded
  • These are problems at todays scales, and will
    only get worse as systems grow

23
Principles for achieving AME (1)
  • No single points of failure
  • Redundancy everywhere
  • Performance robustness is more important than
    peak performance
  • performance robustness implies that real-world
    performance is comparable to best-case
    performance
  • Performance can be sacrificed for improvements in
    AME
  • resources should be dedicated to AME
  • compare biological systems spend gt 50 of
    resources on maintenance
  • can make up performance by scaling system

24
Principles for achieving AME (2)
  • Introspection
  • reactive techniques to detect and adapt to
    failures, workload variations, and system
    evolution
  • proactive techniques to anticipate and avert
    problems before they happen

25
ISTORE-1 hardware platform
  • 80-node x86-based cluster, 1.4TB storage
  • cluster nodes are plug-and-play, intelligent,
    network-attached storage bricks
  • a single field-replaceable unit to simplify
    maintenance
  • each node is a full x86 PC w/256MB DRAM, 18GB
    disk
  • more CPU than NAS fewer disks/node than cluster

Intelligent Disk Brick Portable PC CPU Pentium
II/266 DRAM Redundant NICs (4 100 Mb/s
links) Diagnostic Processor
  • ISTORE Chassis
  • 80 nodes, 8 per tray
  • 2 levels of switches
  • 20 100 Mbit/s
  • 2 1 Gbit/s
  • Environment Monitoring
  • UPS, redundant PS,
  • fans, heat and vibration sensors...

26
ISTORE-1 Status
  • 10 Nodes manufactured
  • Boots OS
  • Diagnostic Processor Interface SW complete
  • PCB backplane not yet designed
  • Finish 80 node system Summer 2000

27
Hardware techniques
  • Fully shared-nothing cluster organization
  • truly scalable architecture
  • architecture that tolerates partial failure
  • automatic hardware redundancy

28
Hardware techniques (2)
  • No Central Processor Unit distribute processing
    with storage
  • Serial lines, switches also growing with Moores
    Law less need today to centralize vs. bus
    oriented systems
  • Most storage servers limited by speed of CPUs
    why does this make sense?
  • Why not amortize sheet metal, power, cooling
    infrastructure for disk to add processor, memory,
    and network?
  • If AME is important, must provide resources to be
    used to help AME local processors responsible
    for health and maintenance of their storage

29
Hardware techniques (3)
  • Heavily instrumented hardware
  • sensors for temp, vibration, humidity, power,
    intrusion
  • helps detect environmental problems before they
    can affect system integrity
  • Independent diagnostic processor on each node
  • provides remote control of power, remote console
    access to the node, selection of node boot code
  • collects, stores, processes environmental data
    for abnormalities
  • non-volatile flight recorder functionality
  • all diagnostic processors connected via
    independent diagnostic network

30
Hardware techniques (4)
  • On-demand network partitioning/isolation
  • Internet applications must remain available
    despite failures of components, therefore can
    isolate a subset for preventative maintenance
  • Allows testing, repair of online system
  • Managed by diagnostic processor and network
    switches via diagnostic network

31
Hardware techniques (5)
  • Built-in fault injection capabilities
  • Power control to individual node components
  • Injectable glitches into I/O and memory busses
  • Managed by diagnostic processor
  • Used for proactive hardware introspection
  • automated detection of flaky components
  • controlled testing of error-recovery mechanisms
  • Important for AME benchmarking (see next slide)

32
Hardware techniques (6)
  • Benchmarking
  • One reason for 1000X processor performance was
    ability to measure (vs. debate) which is better
  • e.g., Which most important to improve clock
    rate, clocks per instruction, or instructions
    executed?
  • Need AME benchmarks
  • what gets measured gets done
  • benchmarks shape a field
  • quantification brings rigor

33
Availability benchmark methodology
  • Goal quantify variation in QoS metrics as events
    occur that affect system availability
  • Leverage existing performance benchmarks
  • to generate fair workloads
  • to measure trace quality of service metrics
  • Use fault injection to compromise system
  • hardware faults (disk, memory, network, power)
  • software faults (corrupt input, driver error
    returns)
  • maintenance events (repairs, SW/HW upgrades)
  • Examine single-fault and multi-fault workloads
  • the availability analogues of performance micro-
    and macro-benchmarks

34
Benchmark Availability?Methodology for reporting
results
  • Results are most accessible graphically
  • plot change in QoS metrics over time
  • compare to normal behavior?
  • 99 confidence intervals calculated from no-fault
    runs

35
Example results multiple-faults
Windows 2000/IIS
Linux/ Apache
  • Windows reconstructs 3x faster than Linux
  • Windows reconstruction noticeably affects
    application performance, while Linux
    reconstruction does not

36
Conclusions (1) ISTORE
  • Availability, Maintainability, and Evolutionary
    growth are key challenges for server systems
  • more important even than performance
  • ISTORE is investigating ways to bring AME to
    large-scale, storage-intensive servers
  • via clusters of network-attached,
    computationally-enhanced storage nodes running
    distributed code
  • via hardware and software introspection
  • we are currently performing application studies
    to investigate and compare techniques
  • Availability benchmarks a powerful tool?
  • revealed undocumented design decisions affecting
    SW RAID availability on Linux and Windows 2000

37
Conclusions (2)
  • IRAM attractive for two Post-PC applications
    because of low power, small size, high memory
    bandwidth
  • Gadgets Embedded/Mobile devices
  • Infrastructure Intelligent Storage and Networks
  • PostPC infrastructure requires
  • New Goals Availability, Maintainability,
    Evolution
  • New Principles Introspection, Performance
    Robustness
  • New Techniques Isolation/fault insertion,
    Software scrubbing
  • New Benchmarks measure, compare AME metrics
Write a Comment
User Comments (0)
About PowerShow.com