Modeling and Mapping Techniques for Dynamically Reconfigurable Hybrid Architectures

About This Presentation

Title:

Modeling and Mapping Techniques for Dynamically Reconfigurable Hybrid Architectures

Description:

Dynamically Reconfigurable Hybrid Architectures. Kiran Bondalapati. Computer Engineering ... Early Hybrid Chip: Xilinx XC 6200 FPGA. Background. SRAM based FPGA ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 74

Provided by: kiranbon

Category:

more less

Transcript and Presenter's Notes

Title: Modeling and Mapping Techniques for Dynamically Reconfigurable Hybrid Architectures

1
Modeling and Mapping Techniques for Dynamically
Reconfigurable Hybrid Architectures

Kiran Bondalapati
Computer Engineering

2
Outline

Introduction
Background
Thesis Contributions
HySAM Model
Mapping Techniques
DRIVE Simulation
Conclusions

3
Computing Landscape
Cost/Performance Gap
Microprocessor
ASIC
Introduction
Special purpose Excellent application
specific performance
General purpose Good average performance
Increasing Performance
IBM JPEG
Increasing Flexibility
4
Configurable Computing Concept

Computation and Communication adapted on-the-fly

Introduction
5
Configurable Computing Characteristics

Variable Hardware and Software
Spatial Computation
Distributed Resources
Distributed Control

Introduction
6
Configurable Computing Variable Hardware
Introduction
Evolution
7
Configurable Computing Spatial Computing
ALU
computations
Chip
Introduction
Time
Temporal
Chip
Spatial
8
Configurable Computing Distributed Resources
Active Logic Silicon resources which actually
perform the computations
Introduction
Configurable Processor
Microprocessor
9
Configurable Computing Distributed Control
Instruction Broadcast
Localized Control
Introduction
Decode
Configurable Processor
Microprocessor
10
Early Hybrid Chip Xilinx XC 6200 FPGA

SRAM based FPGA architecture
64 x 64 array of cells
2-input, 2-output logic function in cell
Reconfiguration
Less than 1 ms. for entire device
Partial reconfiguration
40 ns each cell
FastMap Processor Interface
FPGA connected to system bus
Memory mapped device
Normal load/store instructions for
reconfiguration
Dynamic reconfiguration

Background
cell
11
Billion Transistor Chips
Conf. Logic
Background
Conf. Logic
CPU
Memory
Memory
Current Systems
Emerging Systems
12
BRASS Garp

Reconfigurable array unit with a RISC processor
Gate array of 32x24 logic blocks
Partial configuration of gate array in row
increments
Configuration cache for fast reconfiguration
4 cycles on-chip and 12 cycle off-chip
reconfiguration time

Memory
Background
Instruction cache
Data cache
Configurable Array
MIPS
13
Chameleon RCP
ARC 32-bit RISC
DMA Controller
128-bit RoadRunner Bus
Background
Reconfigurable Processing Fabric

84 DPUs, 24 Multipliers
24 Billion OPS
3 Billion MACS
Single cycle Reconfiguration
2 Gbyte/sec I/O

14
Xilinx Platform FPGA
Distributed Multipliers
PowerPC 32-bit RISC
Background

10 Million System Gates
200 MHz System Clock
300 MHz PowerPC Core
600 Billion MACS
6 Gbyte/sec RISC-Logic b/w

Virtex-II CLB Array
15
Configurable Computing Challenges

High-level System Models
Current abstraction is register-level
Formal Methodologies
Performance analysis and optimization
Integrated Mapping Techniques
Exploit all resources in a unified approach
Design and Simulation Tools
System-level simulation and analysis

Background
16
Our Model Based Approach
Application Developer
Background
Optimized mapping algorithms for generic
problems and applications
Models
Computational Model Compilation Model
Devices
Systems
17
Related Work

Mapping Loops
Weinhardt, Pande, Luk, etc.
No high-level model
Limited applicability
Compiler Projects
National NAPA, Synopsys Nimble, Berkeley Garp,
USC/ISI DEFACTO, Chameleon c2b
Mainly complementary software efforts
Reconfiguration costs not addressed
Interactions with DEFACTO and Chameleon

Background
18
Thesis Contributions
Mapping Techniques
HySAM Model
DRIVE Simulation
19
Parameterized HySAM Model
Application Developer
Algorithmic techniques for mapping generic
loops onto hybrid architectures
HySAM
Hybrid Reconfigurable Architectures
20
Summary of Mapping Techniques

Mapping Linear Loops
Polynomial complexity algorithms
Mapping onto multi-context architectures
Dynamic precision management
Reconfigurable Pipelines
Mapping onto configurable pipelines
Heuristic pipeline segmentation techniques
Integrated Mapping Techniques
Parallelizing feedback loops
Data Context Switching

21
DRIVE Interpretive Simulation Framework
System Abstraction
Performance Characterization
Analysis and Transformation
System Models
Task Models
Mapping Algorithms
Interpretive Simulation
Performance Analysis Design Exploration
22
Thesis Contributions
Mapping Techniques
HySAM Model
HySAM
DRIVE Simulation
23
Hybrid System Architecture Model
Main Processor
Memory
Interconnection Network
HySAM
Conf. Cache
Conf. Logic Unit (CLU)
Parameterized Model
24
Functions and Configurations

F - Functions
Computational units (e.g. Add, Multiply, Select)
Library Modules
C - Configurations
Area, Configuration time, Execution time,
Precision, Power dissipation, I/O requirement
A Function can be executed by various
Configurations
Aij - Attributes for function Fi in
configuration Cj
Rij - Reconfiguration cost from Ci to Cj
Depends on both Ci and Cj
Can include partial reconfiguration
Reconfiguration cost matrix

HySAM
25
Attributes
Execution of Fi in Cj

tij Latency (execution time)
?ij - Throughput
pij - Precision
Others
Data access
Power dissipation

HySAM
26
Scheduled Reconfiguration
Computation (e.g. Program)
Hybrid System Architecture Model
HySAM
Problem Instance (e.g. Input Data)
Configurations and Schedule
Intermediate Results
Computation and Reconfiguration
Result
27
Tasks and Configurations
Input Application Tasks
Tp
T3
T2
T1
p
Mapping
m
HySAM
Configurations
C3
C5
R53
F6
F4
C5
C3
Reconfiguration
28
Example Garp Architectural Parameters
Exec. Time
Conf. Time
tij
R0j
Function
Operation
Configuration
14.4 us
37.5 ns
C1
F1
Multiplication (Fast)
HySAM
C2
6.4 us
52.5 ns
Multiplication (Slow)
F2
C3
7.5 ns
1.6 us
Addition
F3
C4
7.5 ns
Subtraction
1.6 us
C5
3.2 us
7.5 ns
F4
Shift
29
HySAM Abstraction

Resource Model - hardware
Task Model - applications
Execution Model - run-time
Attribute Model - library
Generative Model - design

HySAM
30
Thesis Contributions
Mapping Techniques
HySAM Model
Mapping
DRIVE Simulation
31
Why Loops?

Dense and Regular computations
Occur in most applications
More than 90 of execution is spent in loops
Extensive research in loop analysis
- Task identification
- Configuration generation

Mapping
32
Loops Definitions

Loop Iteration
Loop Index
Dependency
Data
Control
Loop carried dependency

FOR I1 TO 100 AI 2 BI CI AI -
3 DI DI CI
Mapping
33
Sub-space of Problems
Loop Characteristics
Feedback
Linear
Linear Loop Mapping
Parallel Pipelines
Unlimited
Resources
Mapping
Heuristic Pipeline Segmentation
Data Context Switching
Limited
34
Example Mapping Contributions

Problems and Highlighted Characteristics
LMP Theoretical complexity
DPMA Novel reconfiguration ideas
DCS Application area focus

Mapping
35
General Mapping Problem

Execute a given sequence of tasks for N
iterations
Minimize total execution time
E Computation Reconfiguration
Find a sequence of configurations which minimizes
E

LOOP 1..N
T1
T2
Mapping - LMP
Tp
END
NP-Complete!
36
LMP Linear Mapping Problem

Given
A set of tasks ltT1 , T2, Tp gt to be executed
sequentially N times
Set of configurations C
Reconfiguration cost matrix R
Find
A sequence ltC1 , C2, , Cqgt of configurations
which minimizes the total execution time E

LOOP 1..N
T1
T2
Mapping - LMP
Tp
END
37
Optimal Solution

Lemma 1
An optimal sequence of configurations for
executing one iteration of the loop can be
computed in O(pm2) time
Lemma 2
An optimal sequence of configurations can be
computed by unrolling the loop only m times
Theorem
An optimal sequence of configurations for N
iterations of a loop statement with p tasks,
where each task can be executed in one of m
possible configurations, can be computed in
O(pm3) time.

Mapping - LMP
38
Optimal Solution - One Iteration

Explore all possible execution sequences ? -
Exponential !

T1
Tp
Ti
Ti1
C2
C2
C7
C3
Mapping - LMP
C4
C9
C6
Reconfiguration Costs

Exploit subsequence optimality
Utilize dynamic programming to reduce search
space

39
Optimal Solution Multiple Iterations

Maximum number of distinct ltT,Cgt pairs is pm
Compute dynamic programming solution for
T1TpT1Tp T1Tp (unrolled m times)
Solution repeated N/m times is required sequence
Complexity O(pm3)
p number of tasks
m number of configurations

Mapping - LMP
40
Summary of LMP Solution
C1
Configuration Library (size m)
C2
Our Mapping
...
Mapping - LMP
Algorithm
Cq
Optimal Configuration Sequence
Input

Minimizes total execution time including
reconfiguration time
Algorithm complexity independent of number of
loop iterations
O(pm3) compile time algorithm

41
Example Garp Architectural Parameters
Exec. Time
Conf. Time
tij
R0j
Function
Operation
Configuration
14.4 us
37.5 ns
C1
F1
Multiplication (Fast)
C2
6.4 us
52.5 ns
Multiplication (Slow)
Mapping - LMP
F2
C3
7.5 ns
1.6 us
Addition
F3
C4
7.5 ns
Subtraction
1.6 us
C5
3.2 us
7.5 ns
F4
Shift
42
Example Mapping FFT onto Garp
FFT Butterfly Operation - One Complex Multiply,
One Complex Add, One Complex Subtract - 4
Real Multiplies, 3 Real Adds, 3 Real Subtracts
FFT Linearized Task Sequence
TM - Multiplication TA - Addition TS - Subtraction
TM TM TM TM TA TS TA TA TS TS
Mapping - LMP
Optimal Solution N 13.055 ?s ( N
number of iterations)
Important Characteristic of Solution - Uses
slower execution time Multiplier configuration
- Faster reconfiguration helps in amortizing
the execution cost over all the
iterations
43
Variable Precision Computation

Precision requirement is lower than implemented
Match implementation to algorithm requirements
Less resources
Execution time
Logic area
Power dissipation
Run-time precision management
Dynamic modification

Mapping - DPMA
44
Precision Variation in Loops
DO 10 I1,N DO 20 J1,N RSQ(J)
RSQ(J) XDIFF(I,J)YDIFF(I,J) 20 IF
(MAXQ.LT.RSQ(J)) THEN MAXQ RSQ(J) 10
VIRTXY VIRTXY MAXQ SCALE(I)
Ex
Mapping - DPMA

8-bit inputs XDIFF(I,J) and YDIFF(I,J)
MAXQ operand and operation
Accumulation
Precision changes with iterations of I
Does not change every iteration
Lower than maximum possible precision (for most
iterations)

45
Precision Variation Curve
Mapping - DPMA
46
Precision Management Problem

Given
PVC for a given operation in the loop
Find
A valid optimal schedule which minimizes total
execution time
Valid schedule
Satisfies the precision requirements of the
computation
Total execution time
Execution time Reconfiguration time

Mapping - DPMA
47
Dynamic Precision Management Algorithm

DPMA algorithm
Dynamic programming based
Explores sub-optimal configurations
For a few iterations
Reduces reconfiguration overhead
O(um2) complexity
u of PVC points, m of configurations

Mapping - DPMA
48
Experimental Results
Mapping - DPMA
49
Mapping onto XC 6200
Mapping the multiplier operation in MAXQ
SCALE(I)
Execution Time (ns)
Reconfig. Time (ns)
Total Time (ns)
Algorithm
20480
675840
Raw
655360
17920
Static
550400
532480
Greedy
56320
524330
468010
Mapping - DPMA
504440
DPMA
33280
471160

Raw - 8x32 precision for all iterations
Static - 8x28 precision for all iterations
Greedy - schedule using greedy algorithm
DPMA - schedule using theoretical PVC

50
Assumptions

Higher precision requires more resources
Execution time
Logic area
Monotonic variation in precision
Several image processing and signal processing
applications
Split non-monotonic PVC into monotonic
subsequences
Optimal solution for the given PVC
Near optimal if actual precision variation is
different

Mapping - DPMA
51
DCS Data Context Switching

Introduction to Voice Coding
Synthesis Filter
Single Channel Design
Data Context Switching
Multi-Channel Design

Mapping - DCS
52
Multimedia Communication
Video
Audio
Control and Management
Data
H.261 H.263
G.711, G.723.1, G.728, G.729
H.225.0 H.225.0 H.245 RAS
Signaling Control
T.124
RTCP
RTP
X.224 Class 0
Mapping - DCS
UDP
TCP
T.123
Network (IP)
Datalink (IEEE 802.3)
53
Hybrid Vocoder

Waveform coding encodes voice signal for Tx
Vocoding models speech using parameters
Hybrid Coding
Extract parameters of speech
Regenerate the signal using parameters
Compare to original voice signal
Refine

Mapping - DCS
54
Voice Compression

G.729 is a hybrid vocoder algorithm
Conjugate-Structure Algebraic Code Excited
Linear Prediction

LP Analysis Quantization Interpolation
Input Speech
Mapping - DCS
Parameter Encoding Refinement
Synthesis Filter
Transmitted Bitstream

55
Voice Decompression
Fixed Codebook
Received Bitstream
Gc
Synthesis Filter
Post- Processing

Mapping - DCS
Adaptive Codebook
Gp
56
Synthesis Filter

10th order Infinite Impulse Response (IIR) Filter

Mapping - DCS
57
Generalization Feedback Loops
Loop Carried Dependence
FOR I1 TO N DO FOR J1 TO N DO
. VARJ f(VARJ-1)
Mapping - DCS
Many Signal and Image Processing Kernels and
Cryptographic Engines
58
Mapping Pipelining Timing Constraints
y(n)
x(n)
y(n)
Mapping - DCS

-
59
Limitations of the Design

Pipeline delay of 5-12 cycles/stage
Feedback limits throughput
Cannot feed a new input every cycle
Only one output every 5-12 clock cycles
40 sample frame takes 250-600 cycles!

Mapping - DCS
60
Mapping Technique Goals

Maximize channels/sec
Improve throughput
Maximize Multiplier and DPU utilization
Integrated Mapping Technique
Multi-dimensional optimization

Mapping - DCS
Parallelism Pipelining
Embedded Memory
Configurability
61
Data Context Switching

Computed result has to pipe through buffers
No useful computation performed in delay cycles
Multiplier and DPU are idle
Data Context Switching
Perform multi-channel computations
Switch Data Context
Overlapped multiple data set processing
Utilize multipliers every cycle

Mapping - DCS
62
Overlapped Multi-Channel Processing
Data Parallel Programming
Mapping - DCS
63
DCS Loop Interchange Transformation
FOR I1 TO N DO FOR J1 TO N DO
. VARJ f(VARJ-1)
Mapping - DCS
FOR J1 TO N DO FOR I1 TO N DO
. VARJ f(VARJ-1)
64
Data Flow in Multi-Channel Processing
channel 1
sample i
sample i1
channel 2
sample i
sample i1
channel N
sample i
feedback
channel 1
channel 2
channel 3
coefficient
channel 1
channel 2
channel 3
channel 1
channel 2
channel N
Mapping - DCS
65
Multi-Channel Design Datapath
Distributed memories store the coefficients Distri
buted memories as buffers schedule the dataflow
Mapping - DCS
66
Chameleon RCP Mapping
Pipelined Design - Local Resources - Routable
Design Optimal Throughput
Mapping - DCS
67
Analytical Performance Comparison

Single channel design - N250 cycles
Multi-channel design - N50 cycles
5x speedup
One output per cycle Optimal
DSPs - N400 cycles
8x Chameleon RCP speedup

Mapping - DCS
68
Performance Speedup
Approach
Speedup
Time (us)
1.0
UltraSPARC
2000
660
DSP
3.0
1.4
Virtex Standard
1426
Virtex DCS
12.7
158
Mapping - DCS
Chameleon Standard
432
4.6
27.8
36
Chameleon DCS
100 channels, 80 samples, 10-stage filter 400 MHz
UltraSPARC 300 MHz DSP TI C62x 200 MHz Virtex
(max frequency) 125 MHz Chameleon
69
Thesis Contributions
Mapping Techniques
HySAM Model
DRIVE
DRIVE Simulation
70
Simulation Tools

Performance Analysis
Execution time, memory access, power,
Algorithmic Analysis
Various mapping and scheduling algorithms
Architectural Exploration
Device and architectural alternatives

DRIVE
71
EDA Simulation Tools

Simulation of VHDL designs
High level behavioral simulation
Verifies correctness
Does not provide performance characteristics
Simulation of netlist/placed and routed design
Low level timing simulation
Fixed to specific implementation on specific
device
Needs final design for each alternative
device/algorithm

DRIVE
72
DRIVE Goals

High level performance analysis
Module level performance characterization
Architecture abstraction
Insulate developer from hardware intricacies
Algorithm analysis
Extensible to study various algorithmic
techniques
Architecture exploration
Parameterized architectural model for exploration

DRIVE
73
Interpretive Simulation Framework
System Abstraction
Performance Characterization
Analysis and Transformation
System Models
Task Models
Mapping Algorithms
DRIVE
Interpretive Simulation
Performance Analysis Design Exploration
74
Interpretive Simulation

Simulate the application model on the system
model
Performance is based on module characterization
Advantages
Exploits the design methodology
Elimination of actual execution
Interactive and real-time simulation
Disadvantages
Analysis only as accurate as module analysis
Approximates module interactions

DRIVE
75
DRIVE Components
USER
Visualizer
System State
Simulator Core
Data
DRIVE
HySAM Model
Scheduler
Applications
Architectures
76
Sample Visualizer View
DRIVE
77
Sample Publications

Integrated Mapping Techniques for Reconfigurable
SoC Architectures
FPGAs for Custom Computing Machines (FCCM), 2001
(Submitted).
Parallelizing DSP Nested Loops on Reconfigurable
Architectures using
Data Context Switching
Design Automation Conference 2001 (Submitted).
DRIVE An Interpretive Simulation and
Visualization Environment for Dynamically
Reconfigurable Systems
Field Programmable Logic and Applications,
Aug-Sept 1999.
Hardware Object Selection for Mapping Loops onto
Reconfigurable Architectures
Parallel and Distributed Processing Techniques
and Applications, June 1999.
DEFACTO A Design Environment for Adaptive
Computing Technology (with ISI DEFACTO)
Reconfigurable Architectures Workshop 1999,
April 1999.
Dynamic Precision Management for Loop
Computations on Reconfigurable Architectures
FPGAs for Custom Computing Machines (FCCM),
April 1999.
Mapping Loops onto Reconfigurable Architectures
Field Programmable Logic and Applications,
Aug-Sept 1998.
Reconfigurable Meshes Theory and Practice
Reconfigurable Architectures Workshop, Int.
Parallel Processing Symposium, April 1997.