Title: Physical Implementation of Computation
1Physical Implementation of Computation
- André DeHon
- California Institute of Technology
Sastry/ITO May 24, 2000
2- How do we design and engineer physical devices
which implement computations? - How do we build programmable VLSI computing
devices in the era of billion transistor
silicon die capacity? (and beyond) - Capacity increase by 1000-100,000
- 1984 15Ml2?1999 30Gl2 ? 2007 1Tl2
3DARPA/ITO Background
- Microsystems
- MIT Large Scale Parallel Systems 1988-1993
- MIT Reinventing Computing 1993-1996
- Adaptive Computing Systems (JITHW)
- UCB BRASS 1996-present
- Polymorphic Computing?
4Outline
- Programmable Design Space
- Instructions Organization Effects
- Size
- Interconnect
- Requirements of Computation
5Programmable Design Space
- Basic design params.
- SIMD data width (w)
- Instruction Depth (c)
- Retiming Depth (d)
- Intercon. Richness (p)
- Control Granularity
Overview IEEE Computer, April 2000
6Architectural Characterization
Temporal
Spatial
7Peak Computational Densities from Model
- Small slice of space
- only 2 parameters
- 100? density across
- Large difference in peak densities
- large design space!
8Yielded Efficiency
FPGA (cw1)
Processor (c1024, w64)
- Large variation in yielded density
- large design space!
9Large Design Space Reflection (1)
- No one, conventional architecture is robustly
general-purpose across design space. - E.g. processor can be orders of magnitude less
efficient than an alternative programmable
architecture
10Large Design SpaceReflection (2)
- Need to understand space and application
characteristics to tailor Application-Specific
Processors. - Specific applications may have limited/dominating
characteristics in space - Can get it wrong by orders magnitude
11UCB BRASS RISCHSRA(heterogeneous mix)
- Integrate
- processor
- reconfig. array
- Key Idea
- best of both worlds temporal/spatial
12MIT MATRIX
- Make instruction distribution flexible
- Efficient/flexible word size and depth
- Base unit
- 8-bit RFALU slice
- c4 or 256, d1 or 128
- w8 expandable
FCCM96/HotChips97
13Design Lesson?
- General Purpose
- BRASS hybrid is a first step
- integrating two complementary points
- generalize?
- Application Specific
- Within an application, requirements vary
- even here, single point in space can be
suboptimal - identify best portfolio
14Generalize Mix-and-Match
Heterogeneous Composition
Heterogeneous Tile
? Framework to systematize exploration and
construction
15Interconnect
- Along with instruction store, also dominant area
in temporal (processor) designs - Also dominant power, delay...
- Dominant area in spatial designs
16Can Parameterize Richness
p0.5
p0.75
Interconnect Richness ?
17Effects of Richness on Area
18How rich interconnect?
Single design
Binary tree or 1-D p0.0
Crossbar p1.0
Interconnect Richness ?
19Interconnect
- Since grows faster than linear in system size
- not surprising dominant component
- not surprising importance is growing
- Important develop a systematic understanding of
design - richness and structure
- energy, delay, power tradeoffs
- switching requirements
- mapping/routing requirements
20What capacity is required to perform a
computation?
- Strong component based on structural
characteristics just identified - application interconnect richness, throughput,
instruction locality/diversity, state - Also a component based on dataset characteristics
- information content of input
21Idea
- Program semantics is very general
- handle any data input
- Specialized code/circuits
- require less capacity
- less cycles, less circuits
- Input data not random, structured
- Exploit to minimize work
- very roughly like data compression
22Examples Information Content
- Branch predictability
- e.g. trace-schedule likely path
- Common/exceptional case
- e.g. common case no error condition
- Memoization
- save/cache result rather than recompute
- Binding time
- value unchange once bound, specialize around
23Data Example
- Conventional C semantics
- compute on 32b quantities
- But many items not need that full width
- Look at bit ops actually used in practice
- Identify fraction of bit-ops doing useful work on
conventional processors
Student Eylon Caspi (UCB)
24Bit Classification
25Lesson
- With simple models can identify large amount of
redundancy in conventional computations - e.g. 60-75 of bit-ops redundant
- Interesting to develop a Specialization Theory,
computational analog to Information Theory
26Vectors
- Interconnect
- requirements
- systematic construction
- Understand real comp. Requirements
- specialization theory
- Programming model
- robust across architecture space
- ISA abstraction outdated, time to find next
Middleware abstraction
- Acknowledge space is large
- Systematize
- understand tradeoffs
- elaborate details
- Intermediate variables
- capture/application (algorithm) fingerprint
- Mix-and-match or break assumptions which force
tradeoff
27extra
28Post-Fabrication
- Examples
- Personal Computers, microprocessors, PLDs, FPGAs,
DSPs, VLIW, Vector, multiprocessors
- More important today
- Increasing die capacity (1000??)
- Greater absolute performance per die
- SoB?SoC ? differentiation
- Increasing design complexity
- Advantages
- Economies of scale
- Reduce Time-To-Market
- crucial today
- Robust to change