Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths presentation

About This Presentation

Transcript and Presenter's Notes

Title: Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths

1
Exploiting Operation Level Parallelism Through
Dynamically Reconfigurable Datapaths

Zhining Huang, Sharad Malik
Department of Electrical Engineering
Princeton University

2
Dynamically Reconfigurable Datapaths

Speed-up kernel loops using reconfigurable
hardware

Applications
Loop 1
Trivial Codes
Loop 3
Loop 2
µP
Reconfigurable Datapath
3
Outline

Application specific programmable platforms
Methodology overview and architectural model
Datapath design for kernel loops
Direct Mapping, Pipelining
Reconfigurable datapath design
Case studies
GSM, MPEG II
Conclusion

4
Application Specific Programmable Platforms

Why programmable platforms?
Design cost, time to market
Different programmable platforms
Bit level FPGA based
Word level specialized VLIW, coarse grained
reconfigurable coprocessors
Thread level Multiple PEs with on-chip
communication networks

5
Application Specific Programmable Platforms
(contd.)
Flexibility
Performance, Power

Goal Approach the flexibility of GPPs with the
efficiency of ASICs
Part of the MESCAL project
Modern Embedded Systems, Compilers, Architectures
and Languages
A disciplined effort for application specific
programmable platform development

6
Related Research

Various reconfigurable coprocessors
Garp Hauser97, PipeRench Goldstein99,
Pleiades Wan00
Chameleon Systems, Morphics Technology
General reconfigurable fabrics compiler
Hardware resource, routing, compiler
Our approach
Design automation of the application specific
reconfigurable fabrics
Coarse grained dynamically reconfigurable logic

7
Architectural Model

RISC Coarse grained reconfigurable datapath
Fixed function units
Reconfigurable interconnections

Reconfigurable Datapath
8
Methodology Overview

Designing the application specific reconfigurable
datapath.

Front End Compilation Profiling, DA, etc.
Kernel Loop Extraction
Direct Mapping
Hardware Constraint
Performance Estimation
Datapaths
Mapping Algorithm
Reconfigurable Datapath
9
Mapping Kernel Loops from C to Hardware

Generating a datapath for each kernel loop.

IR code after front end compilation
Mapping within basic blocks
Branch merging
Intra-iteration Scheduling
Detail
Datapath with maximum operation level parallelism
Inter-iteration Scheduling
Critical Path Detection
Datapath with high computation throughput
FU merging
10
Direct Mapping

Direct mapping from IR to hardware
One instruction to one function unit

Cb5 ld r1, r2, r12 ld r13, r9, r12
ld r14, r10, r12 add r5, r1, 1
add r11, r5, r13 lsr r3, r11, 1
sub r6, r3, r14
11
Direct Mapping (contd.)

Branch condition transforms

Cb5 blt r6, 0, cb7 Cb6 add r19,
r19, r6 jump Cb8 Cb7 sub r19, r19,
r6 Cb8
cmp

-
mux
12
Intra-iteration Scheduling

Schedule FUs into different pipe stages

Cb5 ld r1, r2, r12 ld r13, r9, r12
ld r14, r10, r12 add r5, r1, 1
add r11, r5, r13 lsr r3, r11, 1
sub r6, r3, r14 blt r6, 0, cb7 Cb6
add r19, r19, r6 jump Cb8 Cb7 sub
r19, r19, r6 Cb8

SH
-
cmp

-
Kernel loop code from GSM
13
Inter-iteration Scheduling

Pipelining the execution of loop iterations
Determine the Initial Interval (II) of a loop
datapath

if no data dependence
II 1 (single copy datapath)
II 0 (multiple copies of datapaths)

p2
ltlt
p3
x
p4
p5
-
14
Inter-iteration Scheduling (contd.)

Data dependence from FU i to FU j across loop
iterations
Feedback connection
II PipeStage(i) PipeStage(j) FU_Delay(j),
if II gt 0

II 5 1 1 5
Fetch new loop iteration every 5 cycles

p1
p2
ltlt
p3
x
p4
p5
-
15
Inter-iteration Scheduling (contd.)

Data dependence on memory access
No feedback connections needed
II ? PipeStage(i) PipeStage(j) 1 / k?
K distance of dependent iterations, from data
dependence analysis

p1
LD
ST
ST
p2
p3
p4
ST
LD
ST
p5
II 4
II 0
II 4
16
Execution Time Estimation
T S II(N-1) O W (cycles)

S total of pipeline stages of the datapath
II initial interval between the fetch of 2
consecutive iterations
N loop iteration number
O configuration overhead
W system write back
Example T 5 2x(32-1) 4 71

17
Reconfigurable Datapath Design

Embed individual datapaths into a single
datapath.
Datapath graph Gi
Vertices are hardware resources (memories,
registers, function units)
Edges are connections between them
Construct a single graph G such that each Gi ?
G and G has the fewest edges and vertices
Bipartite matching based algorithm Huang 2001

18
Reconfigurable Datapath

Merged graph G to reconfigurable datapath
Vertices to function units
Edges to reconfigurable interconnects
By selecting subset of interconnections, any
selected datapath can be generated and executed
on reconfigurable datapath
Appropriate interconnects in merged datapath are
enabled using configuration bits

19
Routing

Useful interconnections are selected
Routing box to select between multiple
connections
Configuration contexts
Configuration bits for routing box
Control bits for some FU
Static registers initialization

Interconnection Routing
Routing Box
Function Unit
Register
20
Reconfiguration Overhead

Store configuration contexts of limited number of
kernel loops in distributed RAMs
Fast context switch for reconfigurable fabrics
NEC OmniPath Furuta00, Chameleon systems
Reconfiguration overhead
read live-in register set
write live-out register set

Context Address
RC
Context 1
Context 2
Context 3
Context 4
Reconfiguration controller
21
Critical Path and Clock Speed

Critical path in the reconfigurable datapath
Delay of FU
Delay of routing box
Delay of directly connected wires
Critical path in general processor
No longer in FU stage
Branch control, decoding stage
The clock speed of reconfigurable datapath should
be no less than that for a general processor

22
Benchmark Studies

MPEG
Overall speedup 3.57
10 kernel loops 86 execution time
Max possible speedup 7.14

GSM
Overall speedup 2.78
10 kernel loops 81 execution time
Max possible speedup 5.26

Speed-up
23
Datapath Mapping Results

Significant overlap between datapaths is
obtained.
Configuration bits MPEG lt 500bits, GSM lt 1000bits

24
Speed-up vs. Memory Bandwidth

Make multiple copies of datapath
Constraint number of memory ports

Time
Speed-up
Time
Speed-up
of Memory ports
MPEG II Coder
GSM Coder
25
Clustered VLIW machine?

Application specific clustered VLIW processor
with one instruction per kernel loop
Reconfiguration contexts as instructions
Interconnections as application specific
bypassing networks

Configuration contexts
Configuration contexts
Configuration contexts
Mem Port
Mem Port
FU
FU
FU
FU
26
Reconfigurable Datapath (RD) vs. VLIW
Execution Time
MPEG II
Execution Time
GSM
27
Applicable Application Domain

computation intensive applications
localized operational parallelism
a few areas account for most of the execution time

28
Conclusion

A methodology for the design of a dynamically
reconfigurable datapath coprocessor
Kernel loop IR to datapath hardware
Datapath hardware merged into reconfigurable
hardware
MPEG, GSM benchmark case studies
Examined reconfigurable datapaths vs. VLIW
processors
Future research
Increasing the datapath pipelining throughput
through FU merging
Fully automating the process

Write a Comment

User Comments (0)

About PowerShow.com

Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths PowerPoint PPT Presentation