The Garp Architecture and C Compiler - PowerPoint PPT Presentation

About This Presentation

Title:

The Garp Architecture and C Compiler

Description:

configuration time, size, floating-point operations, compatibility of various ... The Blueprint. The Garp Arch. ( Cont) For general purpose applications ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 37

Provided by: iscp2

Category:

more less

Transcript and Presenter's Notes

Title: The Garp Architecture and C Compiler

1
The Garp Architecture and C Compiler
T.J. Callahan J.R. Hauser J. Wawrzynek,
U.C. Berkeley

Brought to you by
Liao Jirong
liaojiro_at_comp.nus.edu.sg
http//www.comp.nus.edu.sg/liaojiro

2
Outline

Background
The Garp Architecture
The Compiler for the Garp
Simulation result
Summary

3
Background

Emergence of reconfigurable hardware, FPGA,etc.
Impressive speedups for various tasks
DNA sequence matching, encryption,etc.
Obstacles to be overcome
configuration time, size, floating-point
operations, compatibility of various
implementations in the market..
Past works -- PRISC, NAPA, PRISM, etc
limited to specific application domains
non full automatic compliation

4
The Big Picture
Application
Processor (CPU)
Compilation
Non-Computation Kernel
communication
Computation Kernel
Coprocessor (FPGA, ASIC, etc)
Compilation/synthesis
5
Working Flow of the Execution of the Kernel

Load a configuration
Copy any initial register data to coprocessor
Start execution on coprocessor
Copy result back to the processor
1, 2 4 are overhead.

6
Motivation

Integrate reconfigurable hardware more closely
with the processor
Long reconfiguration times
Low- bandwidth paths for data transfer
Hardware design expertise

7
Assumption

A few cycles of overhead for register data
transferring is acceptable
Coprocessor need its own direct path to the
processors memory system
impossible for the processor to do this
Coprocessor need to be rapidly reconfigurable.

8
The Garp Architecture

Single-issue MIPS processor core with
reconfigurable hardware (coprocessor)
Coprocessor is on the same die with processor
Coprocessor and Processor share the same memory
The reconfigurable hardware architecture and
interfaces are designed
Does not exist as real silicon (simulation only)

9
The Blueprint
10
The Garp Arch. (Cont)

For general purpose applications
Fit into an ordinary processing environment
The main thread of control through a program is
managed by the processor
1. configuration can be loaded only when
coprocessor is idle
2. coprocessor can work independently
3. coprocessor execution can be halted or
resumed
4. can not load configuration or access the
coprocessor while it is active

11
The reconfigurable hardware

Two-dimensional array of Blocks
No. of row is implementation-specific
upward-compatible fashion
Interconnected by programmable wiring
A fixed global clock - sequencer
Configuration cache
Memory buses
Memory queues

12
(No Transcript)
13
Blocks

Configurable Logic Block (CLB)
2-bit width
16 CLBs in a row is a 32-bit data path
each up to 4 2-bit inputs
(altlt10)(bc) can be implemented in one row
Control blocks
one for each row in the leftmost column
serve as liaison
Boolean Values
for if-conversion used in hyperblock

14
Wires

Vertical wire
communicate blocks in the same column
Horizontal
communicate blocks in the same or adjacent
rows
Built-in carry chain
support for addition, subtraction and
comparison.
Make multiplication and division by constant
fairly efficient by multi-bit shift across a row
The wire network is passive
value cannot jump from one to another
without passing through a logic block

15
Memory tricks

Configuration cache
hold recently displaced configurations
reloading from cache requires 5 cycles only.
can hole 4 full-sized configurations
Wide path betwn coprocessor and memory
data transfer and configuration load
Memory bus
4 32-bit data bus and 1 32-bit address bus
coprocessor is master of memory buses when
active
initiate one access every cycle
Memory Queues

16
Compare Garp with other arch.

VLIW
Garp resemble VLIW
Advantage over VLIW
but doesnt have VLIWs per-cycle limits on
instruction issue, functional units, or register
file bandwidth.
pipeline in Garp is more straightforward than
software pipelining on VLIW no function units
competition problem for Garp
maintain high performance for sequential code
in processor
Disadvantage over VLIW
kernel size limit
can not exploit ILP outside of loops

17
Garp V.S. Vector

Garp resemble a memory-to-memory vector processor
when synthesizing a vectorizable loop.
Feedback loops can be constructed arbitrarily
while vector units can handle only very
speciallized recurrences
Garp can easily handle data-dependent loop exits,
which is a problem for vector arch.

18
Garp V.S. Superscalar

Because of the modest number of instruction
issue slots, Superscalar processor can not
compete with the Garp coprocessor in cases with a
large amount of ILP.

19
Any Question About Garp? For further
details Garp A MIPS Processor with a
Reconfigurable Coprocessor J.R. Hauser, J.
Wawrzynek, IEEE FCCM 1997,
20
Automatic Compilation

Standard ANSI C as input
SUIF C compiler for the front-end phase
parsing and standard optimizations
Full automatic compliation

21
Compilation Flow
Application
Kernel selection
kernel
Non-kernel
Optimization Synthesis
Optimization
Executable file
Bit-stream
coprocessor
processor
22
Kernel selection

Loops
The whole loop? -- NO
loop size too large
contain some infrequent executed code
-- longer load time
-- longer interconnects
operations cannot be implemented
ILP limitation in basic block

23
Hyperblock

Join all the basic blocks of a loop body by using
prediction boolean value
Increase ILP
Precedence edges
array subscript analysis
inter-procedural pointer analysis
Contain the loop back edges
avoid switching control from time to time

24
(No Transcript)
25
Hyperblock (Cont)

Reject loops that speedup doesnt make up the
overhead
by profiling and execution time estimate
Exceptional exit cases
execution continue on processor
occur only a small fraction

26
Optimization Techs.

Speculative loads
crucial for pipelining
Pipelining
loop-carried dependencies
simultaneous memory access
Memory queues
3 memory queues
buffering and reading ahead, writing behind
non-cache-allocating

27
Configuration Synthesis

Module mapping
mapping groups of nodes in the DFG to
compound modules in the configuration, minimize
the size and its critical path
Placement
connect modules close to one another
Generating the bit-stream file

28
Simulation Results