Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths

Description:

General reconfigurable fabrics compiler. Hardware resource, routing, compiler. Our approach. Design automation of the application specific reconfigurable fabrics ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 29
Provided by: arlW
Category:

less

Transcript and Presenter's Notes

Title: Exploiting Operation Level Parallelism Through Dynamically Reconfigurable Datapaths


1
Exploiting Operation Level Parallelism Through
Dynamically Reconfigurable Datapaths
  • Zhining Huang, Sharad Malik
  • Department of Electrical Engineering
  • Princeton University

2
Dynamically Reconfigurable Datapaths
  • Speed-up kernel loops using reconfigurable
    hardware

Applications
Loop 1
Trivial Codes
Loop 3
Loop 2
µP
Reconfigurable Datapath
3
Outline
  • Application specific programmable platforms
  • Methodology overview and architectural model
  • Datapath design for kernel loops
  • Direct Mapping, Pipelining
  • Reconfigurable datapath design
  • Case studies
  • GSM, MPEG II
  • Conclusion

4
Application Specific Programmable Platforms
  • Why programmable platforms?
  • Design cost, time to market
  • Different programmable platforms
  • Bit level FPGA based
  • Word level specialized VLIW, coarse grained
    reconfigurable coprocessors
  • Thread level Multiple PEs with on-chip
    communication networks

5
Application Specific Programmable Platforms
(contd.)
Flexibility
Performance, Power
  • Goal Approach the flexibility of GPPs with the
    efficiency of ASICs
  • Part of the MESCAL project
  • Modern Embedded Systems, Compilers, Architectures
    and Languages
  • A disciplined effort for application specific
    programmable platform development

6
Related Research
  • Various reconfigurable coprocessors
  • Garp Hauser97, PipeRench Goldstein99,
    Pleiades Wan00
  • Chameleon Systems, Morphics Technology
  • General reconfigurable fabrics compiler
  • Hardware resource, routing, compiler
  • Our approach
  • Design automation of the application specific
    reconfigurable fabrics
  • Coarse grained dynamically reconfigurable logic

7
Architectural Model
  • RISC Coarse grained reconfigurable datapath
  • Fixed function units
  • Reconfigurable interconnections

Reconfigurable Datapath
8
Methodology Overview
  • Designing the application specific reconfigurable
    datapath.

Front End Compilation Profiling, DA, etc.
Kernel Loop Extraction
Direct Mapping
Hardware Constraint
Performance Estimation
Datapaths
Mapping Algorithm
Reconfigurable Datapath
9
Mapping Kernel Loops from C to Hardware
  • Generating a datapath for each kernel loop.

IR code after front end compilation
Mapping within basic blocks
Branch merging
Intra-iteration Scheduling
Detail
Datapath with maximum operation level parallelism
Inter-iteration Scheduling
Critical Path Detection
Datapath with high computation throughput
FU merging
10
Direct Mapping
  • Direct mapping from IR to hardware
  • One instruction to one function unit

Cb5 ld r1, r2, r12 ld r13, r9, r12
ld r14, r10, r12 add r5, r1, 1
add r11, r5, r13 lsr r3, r11, 1
sub r6, r3, r14
11
Direct Mapping (contd.)
  • Branch condition transforms

Cb5 blt r6, 0, cb7 Cb6 add r19,
r19, r6 jump Cb8 Cb7 sub r19, r19,
r6 Cb8
cmp

-
mux
12
Intra-iteration Scheduling
  • Schedule FUs into different pipe stages

Cb5 ld r1, r2, r12 ld r13, r9, r12
ld r14, r10, r12 add r5, r1, 1
add r11, r5, r13 lsr r3, r11, 1
sub r6, r3, r14 blt r6, 0, cb7 Cb6
add r19, r19, r6 jump Cb8 Cb7 sub
r19, r19, r6 Cb8


SH
-
cmp

-
Kernel loop code from GSM
13
Inter-iteration Scheduling
  • Pipelining the execution of loop iterations
  • Determine the Initial Interval (II) of a loop
    datapath


p1
  • if no data dependence
  • II 1 (single copy datapath)
  • II 0 (multiple copies of datapaths)

p2
ltlt
p3
x
p4
p5
-
14
Inter-iteration Scheduling (contd.)
  • Data dependence from FU i to FU j across loop
    iterations
  • Feedback connection
  • II PipeStage(i) PipeStage(j) FU_Delay(j),
    if II gt 0
  • II 5 1 1 5
  • Fetch new loop iteration every 5 cycles


p1
p2
ltlt
p3
x
p4
p5
-
15
Inter-iteration Scheduling (contd.)
  • Data dependence on memory access
  • No feedback connections needed
  • II ? PipeStage(i) PipeStage(j) 1 / k?
  • K distance of dependent iterations, from data
    dependence analysis

p1
LD
ST
ST
p2
p3
p4
ST
LD
ST
p5
II 4
II 0
II 4
16
Execution Time Estimation
T S II(N-1) O W (cycles)
  • S total of pipeline stages of the datapath
  • II initial interval between the fetch of 2
    consecutive iterations
  • N loop iteration number
  • O configuration overhead
  • W system write back
  • Example T 5 2x(32-1) 4 71

17
Reconfigurable Datapath Design
  • Embed individual datapaths into a single
    datapath.
  • Datapath graph Gi
  • Vertices are hardware resources (memories,
    registers, function units)
  • Edges are connections between them
  • Construct a single graph G such that each Gi ?
    G and G has the fewest edges and vertices
  • Bipartite matching based algorithm Huang 2001

18
Reconfigurable Datapath
  • Merged graph G to reconfigurable datapath
  • Vertices to function units
  • Edges to reconfigurable interconnects
  • By selecting subset of interconnections, any
    selected datapath can be generated and executed
    on reconfigurable datapath
  • Appropriate interconnects in merged datapath are
    enabled using configuration bits

19
Routing
  • Useful interconnections are selected
  • Routing box to select between multiple
    connections
  • Configuration contexts
  • Configuration bits for routing box
  • Control bits for some FU
  • Static registers initialization

Interconnection Routing
Routing Box
Function Unit
Register
20
Reconfiguration Overhead
  • Store configuration contexts of limited number of
    kernel loops in distributed RAMs
  • Fast context switch for reconfigurable fabrics
  • NEC OmniPath Furuta00, Chameleon systems
  • Reconfiguration overhead
  • read live-in register set
  • write live-out register set

Context Address
RC
Context 1
Context 2
Context 3
Context 4
Reconfiguration controller
21
Critical Path and Clock Speed
  • Critical path in the reconfigurable datapath
  • Delay of FU
  • Delay of routing box
  • Delay of directly connected wires
  • Critical path in general processor
  • No longer in FU stage
  • Branch control, decoding stage
  • The clock speed of reconfigurable datapath should
    be no less than that for a general processor

22
Benchmark Studies
  • MPEG
  • Overall speedup 3.57
  • 10 kernel loops 86 execution time
  • Max possible speedup 7.14
  • GSM
  • Overall speedup 2.78
  • 10 kernel loops 81 execution time
  • Max possible speedup 5.26

Speed-up
23
Datapath Mapping Results
  • Significant overlap between datapaths is
    obtained.
  • Configuration bits MPEG lt 500bits, GSM lt 1000bits

24
Speed-up vs. Memory Bandwidth
  • Make multiple copies of datapath
  • Constraint number of memory ports

Time
Speed-up
Time
Speed-up
of Memory ports
MPEG II Coder
GSM Coder
25
Clustered VLIW machine?
  • Application specific clustered VLIW processor
    with one instruction per kernel loop
  • Reconfiguration contexts as instructions
  • Interconnections as application specific
    bypassing networks

Configuration contexts
Configuration contexts
Configuration contexts
Mem Port
Mem Port
FU
FU
FU
FU
26
Reconfigurable Datapath (RD) vs. VLIW
Execution Time
MPEG II
Execution Time
GSM
27
Applicable Application Domain
  • computation intensive applications
  • localized operational parallelism
  • a few areas account for most of the execution time

28
Conclusion
  • A methodology for the design of a dynamically
    reconfigurable datapath coprocessor
  • Kernel loop IR to datapath hardware
  • Datapath hardware merged into reconfigurable
    hardware
  • MPEG, GSM benchmark case studies
  • Examined reconfigurable datapaths vs. VLIW
    processors
  • Future research
  • Increasing the datapath pipelining throughput
    through FU merging
  • Fully automating the process
Write a Comment
User Comments (0)
About PowerShow.com