Title: The Garp Architecture and C Compiler
1The Garp Architecture and C Compiler
T.J. Callahan J.R. Hauser J. Wawrzynek,
U.C. Berkeley
- Brought to you by
- Liao Jirong
- liaojiro_at_comp.nus.edu.sg
- http//www.comp.nus.edu.sg/liaojiro
2Outline
- Background
- The Garp Architecture
- The Compiler for the Garp
- Simulation result
- Summary
3Background
- Emergence of reconfigurable hardware, FPGA,etc.
- Impressive speedups for various tasks
- DNA sequence matching, encryption,etc.
- Obstacles to be overcome
- configuration time, size, floating-point
operations, compatibility of various
implementations in the market.. - Past works -- PRISC, NAPA, PRISM, etc
- limited to specific application domains
- non full automatic compliation
4The Big Picture
Application
Processor (CPU)
Compilation
Non-Computation Kernel
communication
Computation Kernel
Coprocessor (FPGA, ASIC, etc)
Compilation/synthesis
5Working Flow of the Execution of the Kernel
- Load a configuration
- Copy any initial register data to coprocessor
- Start execution on coprocessor
- Copy result back to the processor
- 1, 2 4 are overhead.
6Motivation
- Integrate reconfigurable hardware more closely
with the processor - Long reconfiguration times
- Low- bandwidth paths for data transfer
- Hardware design expertise
7Assumption
- A few cycles of overhead for register data
transferring is acceptable - Coprocessor need its own direct path to the
processors memory system - impossible for the processor to do this
- Coprocessor need to be rapidly reconfigurable.
8The Garp Architecture
- Single-issue MIPS processor core with
reconfigurable hardware (coprocessor) - Coprocessor is on the same die with processor
- Coprocessor and Processor share the same memory
- The reconfigurable hardware architecture and
interfaces are designed - Does not exist as real silicon (simulation only)
9The Blueprint
10The Garp Arch. (Cont)
- For general purpose applications
- Fit into an ordinary processing environment
- The main thread of control through a program is
managed by the processor - 1. configuration can be loaded only when
- coprocessor is idle
- 2. coprocessor can work independently
- 3. coprocessor execution can be halted or
resumed - 4. can not load configuration or access the
coprocessor while it is active -
11The reconfigurable hardware
- Two-dimensional array of Blocks
- No. of row is implementation-specific
- upward-compatible fashion
- Interconnected by programmable wiring
- A fixed global clock - sequencer
- Configuration cache
- Memory buses
- Memory queues
12(No Transcript)
13Blocks
- Configurable Logic Block (CLB)
- 2-bit width
- 16 CLBs in a row is a 32-bit data path
- each up to 4 2-bit inputs
- (altlt10)(bc) can be implemented in one row
- Control blocks
- one for each row in the leftmost column
- serve as liaison
- Boolean Values
- for if-conversion used in hyperblock
14Wires
- Vertical wire
- communicate blocks in the same column
- Horizontal
- communicate blocks in the same or adjacent
rows - Built-in carry chain
- support for addition, subtraction and
comparison. - Make multiplication and division by constant
fairly efficient by multi-bit shift across a row - The wire network is passive
- value cannot jump from one to another
without passing through a logic block
15Memory tricks
- Configuration cache
- hold recently displaced configurations
- reloading from cache requires 5 cycles only.
- can hole 4 full-sized configurations
- Wide path betwn coprocessor and memory
- data transfer and configuration load
- Memory bus
- 4 32-bit data bus and 1 32-bit address bus
- coprocessor is master of memory buses when
active - initiate one access every cycle
- Memory Queues
16Compare Garp with other arch.
- VLIW
- Garp resemble VLIW
- Advantage over VLIW
- but doesnt have VLIWs per-cycle limits on
instruction issue, functional units, or register
file bandwidth. - pipeline in Garp is more straightforward than
software pipelining on VLIW no function units
competition problem for Garp - maintain high performance for sequential code
in processor - Disadvantage over VLIW
- kernel size limit
- can not exploit ILP outside of loops
17Garp V.S. Vector
- Garp resemble a memory-to-memory vector processor
when synthesizing a vectorizable loop. - Feedback loops can be constructed arbitrarily
while vector units can handle only very
speciallized recurrences - Garp can easily handle data-dependent loop exits,
which is a problem for vector arch. -
18Garp V.S. Superscalar
- Because of the modest number of instruction
issue slots, Superscalar processor can not
compete with the Garp coprocessor in cases with a
large amount of ILP.
19Any Question About Garp? For further
details Garp A MIPS Processor with a
Reconfigurable Coprocessor J.R. Hauser, J.
Wawrzynek, IEEE FCCM 1997,
20Automatic Compilation
- Standard ANSI C as input
- SUIF C compiler for the front-end phase
- parsing and standard optimizations
- Full automatic compliation
21Compilation Flow
Application
Kernel selection
kernel
Non-kernel
Optimization Synthesis
Optimization
Executable file
Bit-stream
coprocessor
processor
22Kernel selection
- Loops
- The whole loop? -- NO
- loop size too large
- contain some infrequent executed code
- -- longer load time
- -- longer interconnects
- operations cannot be implemented
- ILP limitation in basic block
23Hyperblock
- Join all the basic blocks of a loop body by using
prediction boolean value - Increase ILP
- Precedence edges
- array subscript analysis
- inter-procedural pointer analysis
- Contain the loop back edges
- avoid switching control from time to time
24(No Transcript)
25Hyperblock (Cont)
- Reject loops that speedup doesnt make up the
overhead - by profiling and execution time estimate
- Exceptional exit cases
- execution continue on processor
- occur only a small fraction
26Optimization Techs.
- Speculative loads
- crucial for pipelining
- Pipelining
- loop-carried dependencies
- simultaneous memory access
- Memory queues
- 3 memory queues
- buffering and reading ahead, writing behind
- non-cache-allocating
27Configuration Synthesis
- Module mapping
- mapping groups of nodes in the DFG to
compound modules in the configuration, minimize
the size and its critical path - Placement
- connect modules close to one another
- Generating the bit-stream file
28Simulation Results
- 32-row array
- Adapted Ultrasparc processor
- Cycle-accurate simulator
- Model cache misses and interlocks.
29Wavelet image compression
30Gzip compression
- Gzip have irregular memory accesses
- reduce parallelism and prevent pipelining
- Each loop execute only a few cycles
- overhead cost more significant
- The overhead negates the benefit
31Compilation time Code expansion
- Compilation time
- typically much less than double that of
compiling for software only - Code size
- typically increase from 10 to 50 percent
- wavelet benchmark 16 percent
32Garp V.S. Ultrasparc
- Ultrasparc
- a four-way superscalar, 167Mhz
- Garp
- implemented using the same VLSI process
- 133Mhz
- Wavelet
- Garp is 68 faster than Ultrasparc
- Gzip
- Ultrasparc is 14 faster than Garp
33Garp V.S. Ultrasparc (Cont)
- Hand-coded functions
- Garp has great potential
34Future
- More experiments over a broader range of
benchmark - Development of new optimizations
- Find out strengths and weaknesses of the Garp
architecture
35Summary
- The Garp Architecture
- processor coprocess
- configuration cache
- memory queues
- high-bandwidth, low-latency data access
- Synthesis Compiler for Garp
-
36The End
Thank you! Any feedback will be
appreciated liaojiro_at_comp.nus.edu.sg http//www.co
mp.nus.edu.sg/liaojiro