Title: Modeling and Mapping Techniques for Dynamically Reconfigurable Hybrid Architectures
1Modeling and Mapping Techniques for Dynamically
Reconfigurable Hybrid Architectures
- Kiran Bondalapati
- Computer Engineering
2Outline
- Introduction
- Background
- Thesis Contributions
- HySAM Model
- Mapping Techniques
- DRIVE Simulation
- Conclusions
3Computing Landscape
Cost/Performance Gap
Microprocessor
ASIC
Introduction
Special purpose Excellent application
specific performance
General purpose Good average performance
Increasing Performance
IBM JPEG
Increasing Flexibility
4Configurable Computing Concept
- Computation and Communication adapted on-the-fly
Introduction
5Configurable Computing Characteristics
- Variable Hardware and Software
- Spatial Computation
- Distributed Resources
- Distributed Control
Introduction
6Configurable Computing Variable Hardware
Introduction
Evolution
7Configurable Computing Spatial Computing
ALU
computations
Chip
Introduction
Time
Temporal
Chip
Spatial
8Configurable Computing Distributed Resources
Active Logic Silicon resources which actually
perform the computations
Introduction
Configurable Processor
Microprocessor
9Configurable Computing Distributed Control
Instruction Broadcast
Localized Control
Introduction
Decode
Configurable Processor
Microprocessor
10Early Hybrid Chip Xilinx XC 6200 FPGA
- SRAM based FPGA architecture
- 64 x 64 array of cells
- 2-input, 2-output logic function in cell
- Reconfiguration
- Less than 1 ms. for entire device
- Partial reconfiguration
- 40 ns each cell
- FastMap Processor Interface
- FPGA connected to system bus
- Memory mapped device
- Normal load/store instructions for
reconfiguration - Dynamic reconfiguration
Background
cell
11Billion Transistor Chips
Conf. Logic
Background
Conf. Logic
CPU
Memory
Memory
Current Systems
Emerging Systems
12BRASS Garp
- Reconfigurable array unit with a RISC processor
- Gate array of 32x24 logic blocks
- Partial configuration of gate array in row
increments - Configuration cache for fast reconfiguration
- 4 cycles on-chip and 12 cycle off-chip
reconfiguration time
Memory
Background
Instruction cache
Data cache
Configurable Array
MIPS
13Chameleon RCP
ARC 32-bit RISC
DMA Controller
128-bit RoadRunner Bus
Background
Reconfigurable Processing Fabric
- 84 DPUs, 24 Multipliers
- 24 Billion OPS
- 3 Billion MACS
- Single cycle Reconfiguration
- 2 Gbyte/sec I/O
14Xilinx Platform FPGA
Distributed Multipliers
PowerPC 32-bit RISC
Background
- 10 Million System Gates
- 200 MHz System Clock
- 300 MHz PowerPC Core
- 600 Billion MACS
- 6 Gbyte/sec RISC-Logic b/w
Virtex-II CLB Array
15Configurable Computing Challenges
- High-level System Models
- Current abstraction is register-level
- Formal Methodologies
- Performance analysis and optimization
- Integrated Mapping Techniques
- Exploit all resources in a unified approach
- Design and Simulation Tools
- System-level simulation and analysis
Background
16Our Model Based Approach
Application Developer
Background
Optimized mapping algorithms for generic
problems and applications
Models
Computational Model Compilation Model
Devices
Systems
17Related Work
- Mapping Loops
- Weinhardt, Pande, Luk, etc.
- No high-level model
- Limited applicability
- Compiler Projects
- National NAPA, Synopsys Nimble, Berkeley Garp,
USC/ISI DEFACTO, Chameleon c2b - Mainly complementary software efforts
- Reconfiguration costs not addressed
- Interactions with DEFACTO and Chameleon
Background
18Thesis Contributions
Mapping Techniques
HySAM Model
DRIVE Simulation
19Parameterized HySAM Model
Application Developer
Algorithmic techniques for mapping generic
loops onto hybrid architectures
HySAM
Hybrid Reconfigurable Architectures
20Summary of Mapping Techniques
- Mapping Linear Loops
- Polynomial complexity algorithms
- Mapping onto multi-context architectures
- Dynamic precision management
- Reconfigurable Pipelines
- Mapping onto configurable pipelines
- Heuristic pipeline segmentation techniques
- Integrated Mapping Techniques
- Parallelizing feedback loops
- Data Context Switching
21DRIVE Interpretive Simulation Framework
System Abstraction
Performance Characterization
Analysis and Transformation
System Models
Task Models
Mapping Algorithms
Interpretive Simulation
Performance Analysis Design Exploration
22Thesis Contributions
Mapping Techniques
HySAM Model
HySAM
DRIVE Simulation
23Hybrid System Architecture Model
Main Processor
Memory
Interconnection Network
HySAM
Conf. Cache
Conf. Logic Unit (CLU)
Parameterized Model
24Functions and Configurations
- F - Functions
- Computational units (e.g. Add, Multiply, Select)
- Library Modules
- C - Configurations
- Area, Configuration time, Execution time,
Precision, Power dissipation, I/O requirement - A Function can be executed by various
Configurations - Aij - Attributes for function Fi in
configuration Cj - Rij - Reconfiguration cost from Ci to Cj
- Depends on both Ci and Cj
- Can include partial reconfiguration
- Reconfiguration cost matrix
HySAM
25Attributes
Execution of Fi in Cj
- tij Latency (execution time)
- ?ij - Throughput
- pij - Precision
- Others
- Data access
- Power dissipation
HySAM
26Scheduled Reconfiguration
Computation (e.g. Program)
Hybrid System Architecture Model
HySAM
Problem Instance (e.g. Input Data)
Configurations and Schedule
Intermediate Results
Computation and Reconfiguration
Result
27Tasks and Configurations
Input Application Tasks
Tp
T3
T2
T1
p
Mapping
m
HySAM
Configurations
C3
C5
R53
F6
F4
C5
C3
Reconfiguration
28Example Garp Architectural Parameters
Exec. Time
Conf. Time
tij
R0j
Function
Operation
Configuration
14.4 us
37.5 ns
C1
F1
Multiplication (Fast)
HySAM
C2
6.4 us
52.5 ns
Multiplication (Slow)
F2
C3
7.5 ns
1.6 us
Addition
F3
C4
7.5 ns
Subtraction
1.6 us
C5
3.2 us
7.5 ns
F4
Shift
29HySAM Abstraction
- Resource Model - hardware
- Task Model - applications
- Execution Model - run-time
- Attribute Model - library
- Generative Model - design
HySAM
30Thesis Contributions
Mapping Techniques
HySAM Model
Mapping
DRIVE Simulation
31Why Loops?
- Dense and Regular computations
- Occur in most applications
- More than 90 of execution is spent in loops
- Extensive research in loop analysis
- - Task identification
- - Configuration generation
Mapping
32Loops Definitions
- Loop Iteration
- Loop Index
- Dependency
- Data
- Control
- Loop carried dependency
FOR I1 TO 100 AI 2 BI CI AI -
3 DI DI CI
Mapping
33Sub-space of Problems
Loop Characteristics
Feedback
Linear
Linear Loop Mapping
Parallel Pipelines
Unlimited
Resources
Mapping
Heuristic Pipeline Segmentation
Data Context Switching
Limited
34Example Mapping Contributions
- Problems and Highlighted Characteristics
- LMP Theoretical complexity
- DPMA Novel reconfiguration ideas
- DCS Application area focus
Mapping
35General Mapping Problem
- Execute a given sequence of tasks for N
iterations - Minimize total execution time
- E Computation Reconfiguration
- Find a sequence of configurations which minimizes
E
LOOP 1..N
T1
T2
Mapping - LMP
Tp
END
NP-Complete!
36LMP Linear Mapping Problem
- Given
- A set of tasks ltT1 , T2, Tp gt to be executed
sequentially N times - Set of configurations C
- Reconfiguration cost matrix R
- Find
- A sequence ltC1 , C2, , Cqgt of configurations
which minimizes the total execution time E
LOOP 1..N
T1
T2
Mapping - LMP
Tp
END
37Optimal Solution
- Lemma 1
- An optimal sequence of configurations for
executing one iteration of the loop can be
computed in O(pm2) time - Lemma 2
- An optimal sequence of configurations can be
computed by unrolling the loop only m times - Theorem
- An optimal sequence of configurations for N
iterations of a loop statement with p tasks,
where each task can be executed in one of m
possible configurations, can be computed in
O(pm3) time.
Mapping - LMP
38Optimal Solution - One Iteration
- Explore all possible execution sequences ? -
Exponential !
T1
Tp
Ti
Ti1
C2
C2
C7
C3
Mapping - LMP
C4
C9
C6
Reconfiguration Costs
- Exploit subsequence optimality
- Utilize dynamic programming to reduce search
space
39Optimal Solution Multiple Iterations
- Maximum number of distinct ltT,Cgt pairs is pm
- Compute dynamic programming solution for
T1TpT1Tp T1Tp (unrolled m times) - Solution repeated N/m times is required sequence
- Complexity O(pm3)
- p number of tasks
- m number of configurations
Mapping - LMP
40Summary of LMP Solution
C1
Configuration Library (size m)
C2
Our Mapping
...
Mapping - LMP
Algorithm
Cq
Optimal Configuration Sequence
Input
- Minimizes total execution time including
reconfiguration time - Algorithm complexity independent of number of
loop iterations - O(pm3) compile time algorithm
41Example Garp Architectural Parameters
Exec. Time
Conf. Time
tij
R0j
Function
Operation
Configuration
14.4 us
37.5 ns
C1
F1
Multiplication (Fast)
C2
6.4 us
52.5 ns
Multiplication (Slow)
Mapping - LMP
F2
C3
7.5 ns
1.6 us
Addition
F3
C4
7.5 ns
Subtraction
1.6 us
C5
3.2 us
7.5 ns
F4
Shift
42Example Mapping FFT onto Garp
FFT Butterfly Operation - One Complex Multiply,
One Complex Add, One Complex Subtract - 4
Real Multiplies, 3 Real Adds, 3 Real Subtracts
FFT Linearized Task Sequence
TM - Multiplication TA - Addition TS - Subtraction
TM TM TM TM TA TS TA TA TS TS
Mapping - LMP
Optimal Solution N 13.055 ?s ( N
number of iterations)
Important Characteristic of Solution - Uses
slower execution time Multiplier configuration
- Faster reconfiguration helps in amortizing
the execution cost over all the
iterations
43Variable Precision Computation
- Precision requirement is lower than implemented
- Match implementation to algorithm requirements
- Less resources
- Execution time
- Logic area
- Power dissipation
- Run-time precision management
- Dynamic modification
Mapping - DPMA
44Precision Variation in Loops
DO 10 I1,N DO 20 J1,N RSQ(J)
RSQ(J) XDIFF(I,J)YDIFF(I,J) 20 IF
(MAXQ.LT.RSQ(J)) THEN MAXQ RSQ(J) 10
VIRTXY VIRTXY MAXQ SCALE(I)
Ex
Mapping - DPMA
- 8-bit inputs XDIFF(I,J) and YDIFF(I,J)
- MAXQ operand and operation
- Accumulation
- Precision changes with iterations of I
- Does not change every iteration
- Lower than maximum possible precision (for most
iterations)
45Precision Variation Curve
Mapping - DPMA
46Precision Management Problem
- Given
- PVC for a given operation in the loop
- Find
- A valid optimal schedule which minimizes total
execution time - Valid schedule
- Satisfies the precision requirements of the
computation - Total execution time
- Execution time Reconfiguration time
Mapping - DPMA
47Dynamic Precision Management Algorithm
- DPMA algorithm
- Dynamic programming based
- Explores sub-optimal configurations
- For a few iterations
- Reduces reconfiguration overhead
- O(um2) complexity
- u of PVC points, m of configurations
Mapping - DPMA
48Experimental Results
Mapping - DPMA
49Mapping onto XC 6200
Mapping the multiplier operation in MAXQ
SCALE(I)
Execution Time (ns)
Reconfig. Time (ns)
Total Time (ns)
Algorithm
20480
675840
Raw
655360
17920
Static
550400
532480
Greedy
56320
524330
468010
Mapping - DPMA
504440
DPMA
33280
471160
- Raw - 8x32 precision for all iterations
- Static - 8x28 precision for all iterations
- Greedy - schedule using greedy algorithm
- DPMA - schedule using theoretical PVC
50Assumptions
- Higher precision requires more resources
- Execution time
- Logic area
- Monotonic variation in precision
- Several image processing and signal processing
applications - Split non-monotonic PVC into monotonic
subsequences - Optimal solution for the given PVC
- Near optimal if actual precision variation is
different
Mapping - DPMA
51DCS Data Context Switching
- Introduction to Voice Coding
- Synthesis Filter
- Single Channel Design
- Data Context Switching
- Multi-Channel Design
Mapping - DCS
52Multimedia Communication
Video
Audio
Control and Management
Data
H.261 H.263
G.711, G.723.1, G.728, G.729
H.225.0 H.225.0 H.245 RAS
Signaling Control
T.124
RTCP
RTP
X.224 Class 0
Mapping - DCS
UDP
TCP
T.123
Network (IP)
Datalink (IEEE 802.3)
53Hybrid Vocoder
- Waveform coding encodes voice signal for Tx
- Vocoding models speech using parameters
- Hybrid Coding
- Extract parameters of speech
- Regenerate the signal using parameters
- Compare to original voice signal
- Refine
Mapping - DCS
54Voice Compression
- G.729 is a hybrid vocoder algorithm
- Conjugate-Structure Algebraic Code Excited
Linear Prediction
LP Analysis Quantization Interpolation
Input Speech
Mapping - DCS
Parameter Encoding Refinement
Synthesis Filter
Transmitted Bitstream
55Voice Decompression
Fixed Codebook
Received Bitstream
Gc
Synthesis Filter
Post- Processing
Mapping - DCS
Adaptive Codebook
Gp
56Synthesis Filter
- 10th order Infinite Impulse Response (IIR) Filter
Mapping - DCS
57Generalization Feedback Loops
Loop Carried Dependence
FOR I1 TO N DO FOR J1 TO N DO
. VARJ f(VARJ-1)
Mapping - DCS
Many Signal and Image Processing Kernels and
Cryptographic Engines
58Mapping Pipelining Timing Constraints
y(n)
x(n)
y(n)
Mapping - DCS
-
59Limitations of the Design
- Pipeline delay of 5-12 cycles/stage
- Feedback limits throughput
- Cannot feed a new input every cycle
- Only one output every 5-12 clock cycles
- 40 sample frame takes 250-600 cycles!
Mapping - DCS
60Mapping Technique Goals
- Maximize channels/sec
- Improve throughput
- Maximize Multiplier and DPU utilization
- Integrated Mapping Technique
- Multi-dimensional optimization
Mapping - DCS
Parallelism Pipelining
Embedded Memory
Configurability
61Data Context Switching
- Computed result has to pipe through buffers
- No useful computation performed in delay cycles
- Multiplier and DPU are idle
- Data Context Switching
- Perform multi-channel computations
- Switch Data Context
- Overlapped multiple data set processing
- Utilize multipliers every cycle
Mapping - DCS
62Overlapped Multi-Channel Processing
Data Parallel Programming
Mapping - DCS
63DCS Loop Interchange Transformation
FOR I1 TO N DO FOR J1 TO N DO
. VARJ f(VARJ-1)
Mapping - DCS
FOR J1 TO N DO FOR I1 TO N DO
. VARJ f(VARJ-1)
64Data Flow in Multi-Channel Processing
channel 1
sample i
sample i1
channel 2
sample i
sample i1
channel N
sample i
feedback
channel 1
channel 2
channel 3
coefficient
channel 1
channel 2
channel 3
channel 1
channel 2
channel N
Mapping - DCS
65Multi-Channel Design Datapath
Distributed memories store the coefficients Distri
buted memories as buffers schedule the dataflow
Mapping - DCS
66Chameleon RCP Mapping
Pipelined Design - Local Resources - Routable
Design Optimal Throughput
Mapping - DCS
67Analytical Performance Comparison
- Single channel design - N250 cycles
- Multi-channel design - N50 cycles
- 5x speedup
- One output per cycle Optimal
- DSPs - N400 cycles
- 8x Chameleon RCP speedup
Mapping - DCS
68Performance Speedup
Approach
Speedup
Time (us)
1.0
UltraSPARC
2000
660
DSP
3.0
1.4
Virtex Standard
1426
Virtex DCS
12.7
158
Mapping - DCS
Chameleon Standard
432
4.6
27.8
36
Chameleon DCS
100 channels, 80 samples, 10-stage filter 400 MHz
UltraSPARC 300 MHz DSP TI C62x 200 MHz Virtex
(max frequency) 125 MHz Chameleon
69Thesis Contributions
Mapping Techniques
HySAM Model
DRIVE
DRIVE Simulation
70Simulation Tools
- Performance Analysis
- Execution time, memory access, power,
- Algorithmic Analysis
- Various mapping and scheduling algorithms
- Architectural Exploration
- Device and architectural alternatives
DRIVE
71EDA Simulation Tools
- Simulation of VHDL designs
- High level behavioral simulation
- Verifies correctness
- Does not provide performance characteristics
- Simulation of netlist/placed and routed design
- Low level timing simulation
- Fixed to specific implementation on specific
device - Needs final design for each alternative
device/algorithm
DRIVE
72DRIVE Goals
- High level performance analysis
- Module level performance characterization
- Architecture abstraction
- Insulate developer from hardware intricacies
- Algorithm analysis
- Extensible to study various algorithmic
techniques - Architecture exploration
- Parameterized architectural model for exploration
DRIVE
73Interpretive Simulation Framework
System Abstraction
Performance Characterization
Analysis and Transformation
System Models
Task Models
Mapping Algorithms
DRIVE
Interpretive Simulation
Performance Analysis Design Exploration
74Interpretive Simulation
- Simulate the application model on the system
model - Performance is based on module characterization
- Advantages
- Exploits the design methodology
- Elimination of actual execution
- Interactive and real-time simulation
- Disadvantages
- Analysis only as accurate as module analysis
- Approximates module interactions
DRIVE
75DRIVE Components
USER
Visualizer
System State
Simulator Core
Data
DRIVE
HySAM Model
Scheduler
Applications
Architectures
76Sample Visualizer View
DRIVE
77Sample Publications
- Integrated Mapping Techniques for Reconfigurable
SoC Architectures - FPGAs for Custom Computing Machines (FCCM), 2001
(Submitted). - Parallelizing DSP Nested Loops on Reconfigurable
Architectures using - Data Context Switching
- Design Automation Conference 2001 (Submitted).
- DRIVE An Interpretive Simulation and
Visualization Environment for Dynamically - Reconfigurable Systems
- Field Programmable Logic and Applications,
Aug-Sept 1999. - Hardware Object Selection for Mapping Loops onto
Reconfigurable Architectures - Parallel and Distributed Processing Techniques
and Applications, June 1999. - DEFACTO A Design Environment for Adaptive
Computing Technology (with ISI DEFACTO) - Reconfigurable Architectures Workshop 1999,
April 1999. - Dynamic Precision Management for Loop
Computations on Reconfigurable Architectures - FPGAs for Custom Computing Machines (FCCM),
April 1999. - Mapping Loops onto Reconfigurable Architectures
- Field Programmable Logic and Applications,
Aug-Sept 1998. - Reconfigurable Meshes Theory and Practice
- Reconfigurable Architectures Workshop, Int.
Parallel Processing Symposium, April 1997.
78Conclusions
- Reconfigurable Computing Is Here
- Model-based Approach
- HySAM Hybrid System Architecture Model of
reconfigurable architectures - Algorithmic Mapping Techniques
- Mapping of application loops onto reconfigurable
architectures - Dynamic precision management to exploit run-time
reconfiguration - Configurable pipeline generation and segmentation
- Integrated mapping techniques for hybrid
architectures - Simulation Methodology
- DRIVE module based interpretive simulation
framework