Title: What Is a Compiler When The Architecture Is Not Hardware
1What Is a Compiler When The Architecture Is Not
Hard(ware) ?
This work was supported in part by awards from
Hewlett-Packard Corporation, IBM, Panasonic AVC,
and by DARPA under Contract No. DABT63-96-C-0049
and Grant No. 25-74100-F0944. Portions of this
presentation were given by the speaker as a
keynote at the ACM LCTES2001 and as an invited
speaker at EMSOFT01
2Embedded Computing
Why ?
What ?
How ?
3The Nature of Embedded Systems
4Favorable Trends
- Supported by Moores (second) law
- Computing power doubles every eighteen months
- Corollary cost per unit of computing halves
every eighteen months - From hundreds of millions to billions of units
- Projected by market research firms (VDC) to be a
50 billion space over the next five years - High volume, relatively low per unit margin
5Embedded Systems Desiderata
6Timing Example
Video-On-Demand
Predictable Timing Behavior
Unpredictable Timing Behavior
7Current Art
Vertical application domains
- Meet desiderata while overcoming NRE cost hurdles
through volume - High migration inertia across applications
- Long time to market
8Subtle but Sure Hurdles
- For Moores corollary to be true
- Non-recurring engineering (NRE) cost must be
amortized over high-volume - Else prohibitively high per unit costs
- Implies uniform designs over large workload
classes - (Eg). Numerical, integer, signal processing
- Demands of embedded systems
- Non uniform or application specific designs
- Per application volume might not be high
- High NRE costs ? infeasible cost/unit
- Time to market pressure
9The Embedded Systems Challenge
Multiple application domains
- Sustain Moores corollary
- Keep NRE costs down
10Responding Via Automation
Multiple application domains
11Three Active Approaches
- Custom microprocessors
- Architecture exploration and synthesis
- Architecture assembly for reconfigurable computing
12Custom Processor Implementation
Proprietary ISA, Architecture Specification
Custom Processor implementation
Proprietary Tools
Application analysis
Fabricate
Processor
Proprietary ISA
Language with custom extensions
Application
Compiler
Binary
- High performance implementation
- Customized in silicon for particular application
domain - O(months) of design time
- Once designed, programmable like standard
processors
Tensilica, HP-ST Microelectronics approach
13Architecture Exploration and Synthesis
The PICO Vision
Automatic synthesis of application specific
parallel / VLIW ULSI microprocessors And their
compilers for embedded computing
14Custom Microprocessors
Application(s) define workload
Analyze
Define ISA extension (eg) IA 64
Optimizing Compiler
Define Compiler Optimizations
Design Implementation
Microprocessor (eg) Itanium
15Application Specific Design
Applications
Single Application
Analyze
Program Analysis
Extended EPIC Compiler Technology
Library of possible implementations (Bypass ISA)
Explore and Synthesize implementations
VLIW Core Non
programmable extension
Application specific processor runs single
application
16The Compiler Optimization Trajectory
17What Is the Compilers Target ISA?
Compiler
Hardware
- Target is a range of architectures and their
building blocks - Compiler reaches into a constrained space of
silicon - Explores architectural implementations
- O(days weeks) of design time
- Exploration sensitive to application specific
hardware modules - Fixed function silicon is the result
- Verification NRE costs still there
- One approach to overcoming time to market
Frontend and Optimizer
Superscalar
Determine Dependences
Determine Dependences
Dataflow
Determine Independences
Determine Independences
Indep. Arch.
Bind Operations to Function Units
Bind Operations to Function Units
VLIW
Bind Transports to Busses
Bind Transports to Busses
TTA
Execute
B. Ramakrishna Rau and Joseph A. Fisher.
Instruction-level parallel History overview, and
perspective. The Journal of Supercomputing,
7(1-2)9-50, May 1993.
18Choices of Silicon
High level design/synthesis
19Reconfigurable Computing
20FPGAs As an Alternative Choice for Customization
- Frequent (re)configuration and hence frequent
recustomization - Fabrication process is steadily improving
- Gate densities are going up
- Performance levels are acceptable
- Amortize large NRE investments by using COTS
platform
21Major Motivations
- Poor compilation times
- Lack of correspondence between standard IR and
final configurations - Place and route inherently complex
- Would like to have compile times of the order of
seconds to minutes - Provide customization of data-path by automatic
analysis and optimization in software
22Adaptive EPIC
23 Compiler-Processor Interface
ISA
Source program
format
ADD
semantics
format
Compiler
LD
semantics
Registers Exceptions..
Executable
24Redefining Processor-Compiler Interface
Source program
Compiler
Executable
Let compiler determine the instruction sets (and
their realization on chip)
25EPIC execution model
Record of execution
26Adaptive EPIC execution model
ILP-1
Configured Functional units
Reconfigure datapath
ILP-2
27What can be efficiently compiled for today?
FPGA
Complex ASIC
DPGA
RAW
RaPiD
GARP
MultiChip
CVH
TRACE (Multiscalar)
SMT
Parallelism
SuperSpeculative
EPIC/VLIW
TTA
SuperScalar
Dataflow
VECTOR
Simple Pipelined/ Embedded
Early x86
Simple ASIC
0
4
16
32
64
128-512
1K-10K
100K-1M
gt1M
Approximate instruction packet size
28Adaptive EPIC
- AEPIC EPIC processing Datapath
reconfiguration - Reconfiguration through three basic operations
- add a functional unit to the datapath
- remove a functional unit from the datapath
- execute an operation on resident functional unit
29AEPIC Machine Organization
Level-1 configuration cache
I-Cache
Configuration cache
CRF
F/D
Multi-context Reconfigurable Logic Array
GPR/FPR/
Array Register File
L1
L2
30AEPIC Compilation
31Key Tasks For AEPIC Compiler
- Partitioning
- Identifying code sections that convert to custom
instructions - Operation synthesis
- Efficient CFU implementations for identified
partitions - Instruction selection
- Multiple CFU choices for code partitions
- Resource allocation
- Allocation of C-cache and MRLA resources for CFUs
- Scheduling
- When and where to schedule adaptive component
instructions
32Compiler Modules
Source program
Front-end processing
Partitioning
Configuration Library
High-level optimizations (EPIC core)
Mapping (Adaptive)
Machine description
Performance statistics
33Resource allocation for configurations
Reconfigurable logic can accommodate only two
configurations simultaneously
Record of execution
B
G
RL
R
Overlapping live-range region
Processor
34Managing Reconfigurable Resource (contd.)
- Simpler version related to register allocation
problem - New parameters
- No need to save to memory on spill
- configurations immutable
- Load costs different for different configurations
- C-cache, MRLA multi-level register file
- Adapted register allocation techniques to
configuration allocation - Non-uniform sizes of configurations graph
multi-coloring - Adapted Chaitins coloring based techniques
35The Problem With Long Reconfiguration Times
36Speculating Configuration Loads
- Mask reconfiguration times!
- Need to know when and where to speculate
- If f1gtgtf2 do not speculate to red empty load
slot
37Sample Compiler Topics
- Configuration cache management
- Power/clock rate vs. Performance tradeoff
- Bit-width optimizations
38More Generally Architecture Assembly
Applications
- An ISA view
- Synthesis and other hardware design off-line
- Much closer to compiler optimizations implies
faster compile time
Build off-line (synthesis, place and route)
Program
Prebuilt Implementations
Compiler selects assembles and optimizes
program
Data path
Storage
Interconnect
Dynamically variable ISA Architecture
implementation
Also applicable to yield fixed implementations in
silicon