Title: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning
1A Configurable Logic Architecture for Dynamic
Hardware/Software Partitioning
- Roman Lysecky, Frank Vahid
- Department of Computer Science and Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems at UC Irvine - This work was supported in part by the National
Science Foundation, the Semiconductor Research
Corporation, and a Department of Education GAANN
fellowship
2IntroductionDynamic Software Optimization
- Dynamic optimizations are increasingly common
- Dynamo - Dynamic software optimizations
- Transmeta Crusoe, Efficeon - Dynamic code
morphing - Just In Time (JIT) Compilation - Interpreted
languages - Advantages
- Transparent optimizations
- No designer effort
- No tool restrictions
- Adapts to actual usage
- Drawbacks
- Currently limited to software optimizations
- Limited speedup (1.1x to 1.3x common)
3IntroductionHardware/Software Partitioning
- Benefits
- Speedups of 2X to 10X typical
- Speedups of 800X possible
- Far more potential than dynamic SW optimizations
(1.2x) - Energy reductions of 25 to 95 typical
SW ______ ______ ______
SW ______ ______ ______
4IntroductionTraditional Hardware/Software
Partitioning
- Requires specialized CAD tools
- Non-standard partitioning compilers
5IntroductionBinary Hardware/Software Partitioning
- Binary Partitioning Stitt/Vahid ICCAD02
Banerjee DATE03 - Partition application starting from SW binary
- Can be desktop based
- Advantages
- Use any standard compiler
- Supports any language
- Supports multiple sources from multiple languages
- Supports assembly/object code
- Supports legacy code
- Disadvantage
- Loses some high-level information, so may be some
loss of quality
6IntroductionDynamic Hardware/Software
Partitioning
- Dynamic HW/SW Partitioning
- Embed HW/SW partitioning CAD tools on-chip
- Feasible in era of billion-transistor chips
- Advantages
- Does not require any special compilers
- Completely transparent
- Bring benefits of HW/SW partitioning to all SW
designers - Complements other approaches
- Desktop CAD best from purely technical
perspective - Dynamic opens additional market segments (i.e.,
all software developers) that otherwise might not
use desktop CAD
7IntroductionWarp Processors
2
Profile application to determine critical regions
1
Initially execute application in software only
3
Profiler
Partition critical regions to hardware
MIPS/ARM
I
5
D
Partitioned application executes faster with
lower energy consumption
Configurable Logic
Dynamic Part. Module (DPM)
4
Program configurable logic update software
binary
8Warp ProcessorsRequirements Tools
- Warp Processor Architecture and Tools
- Basic configurable logic architecture
- Efficient profiling architecture
- On-chip CAD tools for HW/SW partitioning
- Decompilation
- Synthesis
- Technology Mapping
- Placement and Routing
Profiler
ARM
I
D
Config. Logic
DPM
9Warp Configurable Logic ArchitectureRequirements
- Robustness
- Capable of supporting large set of applications
- Simplicity
- Existing FPGAs are too complex for warp
processors - Design goals of FPGAs much different
- Design configurable fabric by analyzing
architectural features as to their impacts on
on-chip CAD tools - Fast execution
- Very low data memory
- Produce reasonable hardware circuits
- Efficient interface to memory
10Warp Configurable Logic Architecture
- Data address generators (DADG) and Loop control
hardware (LCH) - Found in most digital signal processors
- Provide fast loop execution
- Supports memory accesses with regular access
pattern - Synthesis of FSM not required for many critical
loops - Configurable logic fabric input provide
alternative control of loop execution
DADG LCH
32-bit MAC
Configurable Logic Fabric
11Warp Configurable Logic Architecture
- Integrated 32-bit multiplier-accumulator (MAC)
- Multiplications are frequently found within
critical loops - Frequently in the form of a multiply-accumulate
operation - Fast, single-cycle multipliers are large and
require many interconnections
DADG LCH
32-bit MAC
Configurable Logic Fabric
12Warp Configurable Logic Architecture
Configurable Logic Fabric
- Array of configurable logic blocks (CLBs)
surrounded by switch matrices (SMs) - Each CLB is directly connected to a SM
- Switch matrix connections
- Four short wires connect adjacent SMs
- Four long wires connect every other SM together
SM
SM
SM
CLB
CLB
SM
SM
SM
13Warp Configurable Logic Architecture
Combinational Logic Block Design
- Several studies have analyzed the impact of LUT
and CLB size of overall design area and delay - LUTs with 5 to 6 inputs result in best
performance - LUTs with less than 3 inputs have much worse
performance Chow, et al. 1999, Singh, et al.
1992 - CLB cluster size of 3 to 20 LUTs are feasible
Marquardt, Betz, Rose 2000
14Warp Configurable Logic Architecture
Combinational Logic Block Design
- Incorporate two 3-input 2-output LUTs
- Corresponds to four 3-input LUTs
- Allows for good quality circuit while reducing
on-chip CAD tools complexity - Provide routing resources between adjacent CLBs
to support carry chains
FPGAs WCLA
Flexibility Large CLBs, various internal routing resources Simplicity Limited internal routing, reduce technology mapping complexity
15Warp Configurable Logic ArchitectureSwitch Matrix
- Switch Matrix
- SM connected using eight channels per side
- Four short channels
- Four long channels
- Routes connect wires from different side using
the same channel - Each short channel is associated with single long
channel - Wires are routed using a single pair of channels
through configurable logic fabric
FPGAs WCLA
Flexibility Large routing resources, requires complex routing algorithms Simplicity Allow for design of less complex routing algorithm
16ResultsBenchmarks
- Considered 12 embedded benchmarks from NetBench,
MediaBench, EEMBC, and Powerstone - Average of 53 of total software execution time
was spent executing single critical loop (more
speedup possible if more loops considered) - On average, critical loops comprised only 1 of
total program size
17ResultsExperimental Setup
- Warp Processor
- 75 MHz ARM7 processor
- Configurable logic fabric with fixed frequency of
60 MHz - Used dynamic partitioning CAD tools to map
critical region to hardware - Executed on an ARM7 processor
- Active for roughly 10 seconds to perform
partitioning - Traditional HW/SW Partitioning
- 75 MHz ARM7 processor
- Xilinx Virtex-E FPGA (executing at maximum
possible speed) - Manually partitioned software using VHDL
- VHDL synthesized using Xilinx ISE 4.1 on desktop
18ResultsPerformance Speedup
19ResultsEnergy Reduction
20Context UCRs Research on Configurable SoCs
Self Tuning, Self Configuring Mass Produced ICs
21Conclusions Future Work
- Warp Configurable Logic Fabric
- Supports wide range of embedded systems
applications - Design specifically to allow development of lean
on-chip CAD tools - Provide excellent results
- Average speedups of 2.1
- Average energy reduction of 33
- Much better than dynamic software optimizations
- One loop only more speedup possible
- More recent examples since DATE publication 10x
speedups - Working towards examples with 100x speedups
- Future Work
- Partitioning multiple software loops to hardware
- Synthesizing Finite State Machines (FSMs)
- Improved synthesis, technology mapping, and place
and route