Title: Dynamic FPGA Routing for Just-in-Time Compilation
1Dynamic FPGA Routing for Just-in-Time Compilation
- Roman Lyseckya, Frank Vahida, Sheldon X.-D. Tanb
- aDepartment of Computer Science and Engineering
- bDepartment of Electrical Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems at UC Irvine - This work was supported in part by the National
Science Foundation, the Semiconductor Research
Corporation, and a Department of Education GAANN
fellowship
2IntroductionJust-in-Time Compilation has Become
Commonplace
- Just-in-Time Compilation
- Modern Pentium processors
- Dynamically translate instructions onto
underlying RISC architecture - Transmeta Crusoe Efficeon
- Dynamic code morphing
- Translate x86 instructions to underlying VLIW
processor - Interpreted languages
- Distribute SW as processor independent
bytecode/source - SW typically executed on a virtual machine
- JIT compile bytecode to processors native
instructions - Java, Python, etc.
3IntroductionJust-in-Time Compilation also
Performs Optimization
- Dynamic optimizations are increasingly common
- Dynamically recompile binary during execution
- Dynamo Bala, et al., 2000 - Dynamic software
optimizations - Identify frequently executed code segments
(hotpaths) - Recompile with higher optimization
- BOA Gschwind, et al., 2000 - Dynamic optimizer
for Power PC - Advantages
- Transparent optimizations
- No designer effort
- No tool restrictions
- Adapts to actual usage
- Speedups of up 20-30 -- 1.3X
- JIT compilation operates on software binaries
4IntroductionBut Todays Binaries are More than
just Software
5IntroductionJust-in-Time FPGA Compilation?
- JIT FPGA compilation
- Idea standard binary for FPGA
- Similar benefits as standard binary for
microprocessor - Portability, transparency, standard tools
- Embedded JIT compilation tools optimized for each
FPGA
6IntroductionOne Use of JIT FPGA Compilation
CableTV Company
7IntroductionOne Use of JIT FPGA Compilation
CableTV Company
8IntroductionOne Use of JIT FPGA Compilation
CableTV Company
9IntroductionAnother Use - Warp Processors
(Dynamic HW/SW Partitioning)
Profiler
µP
I
D
Warp Config. Logic Architecture
Dynamic Part. Module (DPM)
Lysecky/Vahid, DATE04 Stitt/Lysecky/Vahid
DAC03 Stitt/Vahid, ICCAD02
10IntroductionAnother Use - Warp Processors
(Dynamic HW/SW Partitioning)
Profiler
ARM
I
D
WCLA
DPM
Lysecky/Vahid, DATE04 Stitt/Lysecky/Vahid,
DAC03 Stitt/Vahid, ICCAD02
11IntroductionAll that CAD on-chip?
- CAD people may first think Just-in-Time FPGA
compilation is absurd - CAD tools are extremely complex
- Require long execution times on power desktop
workstations - Require very large memory resources
- Usually require GBytes of hard drive space
- Costs of complete CAD tools package can exceed 1
million - All that CAD on-chip?
12Simultaneous FPGA/CAD Design
- Careful simultaneous design of configurable logic
fabric and CAD tools - Analyze architectural features as to their
impacts on on-chip Just-in-Time CAD tools - Fast execution time
- Very low data memory
- Produce reasonable (good) hardware circuits
13Simultaneous FPGA/CAD Design Configurable Logic
Fabric
- Array of configurable logic blocks (CLBs)
surrounded by switch matrices (SMs) - Each CLB is directly connected to a SM
- Switch matrix connections
- Four short wires connect adjacent SMs
- Four long wires connect every other SM together
SM
SM
SM
CLB
CLB
SM
SM
SM
Lysecky/Vahid, DATE04
14Simultaneous FPGA/CAD Design Combinational Logic
Block Design
- Incorporate two 3-input 2-output LUTs
- Corresponds to four 3-input LUTs
- Allows for good quality circuit while reducing
on-chip CAD tools complexity - Provide routing resources between adjacent CLBs
to support carry chains
Lysecky/Vahid, DATE04
15Simultaneous FPGA/CAD Design Switch Matrix
- Switch Matrix
- SM connected using eight channels per side
- Four short channels
- Four long channels
- Routes wires from different side using the same
channel - Each short channel is associated with single long
channel - Wires are routed using a single pair of channels
through configurable logic fabric
Lysecky/Vahid, DATE04
16FPGA Routing
- FPGA Routing
- Find a path within FPGA to connect source and
sinks of each net within our hardware circuit - Typically use a form of maze routing Lee, 1961
- Routes each net using Dijkstras shortest path
algorithm
17FPGA Routing
- Pathfinder Ebeling, et al., 1995
- Introduced negotiated congestion
- During each routing iteration, route nets using
shortest path - Allows overuse (congestion) of routing resources
- If congestion exists (illegal routing)
- Update cost of congested resources based on the
amount of overuse - Rip-up all routes and reroute all nets
2
18FPGA Routing
- VPR Versatile Place and Route Betz, et al.,
1997 - Uses modified Pathfinder algorithm
- Increase performance over original Pathfinder
algorithm - Routability-driven routing
- Goal Use fewest tracks possible
- Timing-driven routing
- Goal Optimize circuit speed
19JIT FPGA Routing
- Riverside On-Chip Router (ROCR)
- Represent routing nets between CLBs as routing
between SMs - Resource Graph
- Nodes correspond to SMs
- Edges correspond to channels between SMs
- Capacity of edge equal to the number of wires
within the channel - Requires much less memory than VPR as resource
graph is much smaller
20JIT FPGA Routing
- Riverside On-Chip Router (ROCR) - Global Routing
- Based on VPRs routability-driven router
- Utilizes similar cost model consisting of base,
historical congestion, and current congestion
costs - Routes nets between SMs using greedy, depth-first
routing algorithm - Faster than traditional VPRs breadth-first
routing method - Requires addition of adjustment cost to direct
ROCR to re-route illegal nets using different
initial routing path - Ignores illegal routing within SMs
- If congestion exists, rip-up and re-route only
the illegal routes - Reduces computation time during successive
routing iterations
21JIT FPGA Routing
- Riverside On-Chip Router (ROCR) - Detailed
Routing - Assign specific channels to each route
- Construct routing conflict graph
- Routes conflict if assigning same channel results
in an illegal routing within any SM - Use Brelazs greedy vertex coloring algorithm
Brelaz, 1979 - If illegal routes exist, rip-up illegal routes
and repeat global routing
22Experiments Memory Usage
23Experiments Algorithm Performance
24Experiments Critical Path Results
But 10 shorter critical path than VPR (RD)
25Experiments Wire Segments
26Conclusions
- Developed Riverside On-Chip Router (ROCR)
- Fast, lean on-chip router for JIT FPGA
compilation - Order of magnitude less memory required
- On average 10X faster than VPRs faster routing
algorithm - Produces acceptable circuit quality
- Uses only 10 more routing resources
- Critical path 10 shorter than VPRs
routability-driven router - JIT FPGA Compilation
- Enables development of a standard HW binary
- Brings portability of SW design to HW designers
- Presently requires custom FPGA fabric
- Future work - Overhead of mapping simple fabric
onto commercial fabric?