The Garp Architecture and C Compiler - PowerPoint PPT Presentation

About This Presentation
Title:

The Garp Architecture and C Compiler

Description:

configuration time, size, floating-point operations, compatibility of various ... The Blueprint. The Garp Arch. ( Cont) For general purpose applications ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 37
Provided by: iscp2
Category:

less

Transcript and Presenter's Notes

Title: The Garp Architecture and C Compiler


1
The Garp Architecture and C Compiler
T.J. Callahan J.R. Hauser J. Wawrzynek,
U.C. Berkeley
  • Brought to you by
  • Liao Jirong
  • liaojiro_at_comp.nus.edu.sg
  • http//www.comp.nus.edu.sg/liaojiro

2
Outline
  • Background
  • The Garp Architecture
  • The Compiler for the Garp
  • Simulation result
  • Summary

3
Background
  • Emergence of reconfigurable hardware, FPGA,etc.
  • Impressive speedups for various tasks
  • DNA sequence matching, encryption,etc.
  • Obstacles to be overcome
  • configuration time, size, floating-point
    operations, compatibility of various
    implementations in the market..
  • Past works -- PRISC, NAPA, PRISM, etc
  • limited to specific application domains
  • non full automatic compliation

4
The Big Picture
Application
Processor (CPU)
Compilation
Non-Computation Kernel
communication
Computation Kernel
Coprocessor (FPGA, ASIC, etc)
Compilation/synthesis
5
Working Flow of the Execution of the Kernel
  • Load a configuration
  • Copy any initial register data to coprocessor
  • Start execution on coprocessor
  • Copy result back to the processor
  • 1, 2 4 are overhead.

6
Motivation
  • Integrate reconfigurable hardware more closely
    with the processor
  • Long reconfiguration times
  • Low- bandwidth paths for data transfer
  • Hardware design expertise

7
Assumption
  • A few cycles of overhead for register data
    transferring is acceptable
  • Coprocessor need its own direct path to the
    processors memory system
  • impossible for the processor to do this
  • Coprocessor need to be rapidly reconfigurable.

8
The Garp Architecture
  • Single-issue MIPS processor core with
    reconfigurable hardware (coprocessor)
  • Coprocessor is on the same die with processor
  • Coprocessor and Processor share the same memory
  • The reconfigurable hardware architecture and
    interfaces are designed
  • Does not exist as real silicon (simulation only)

9
The Blueprint
10
The Garp Arch. (Cont)
  • For general purpose applications
  • Fit into an ordinary processing environment
  • The main thread of control through a program is
    managed by the processor
  • 1. configuration can be loaded only when
  • coprocessor is idle
  • 2. coprocessor can work independently
  • 3. coprocessor execution can be halted or
    resumed
  • 4. can not load configuration or access the
    coprocessor while it is active

11
The reconfigurable hardware
  • Two-dimensional array of Blocks
  • No. of row is implementation-specific
  • upward-compatible fashion
  • Interconnected by programmable wiring
  • A fixed global clock - sequencer
  • Configuration cache
  • Memory buses
  • Memory queues

12
(No Transcript)
13
Blocks
  • Configurable Logic Block (CLB)
  • 2-bit width
  • 16 CLBs in a row is a 32-bit data path
  • each up to 4 2-bit inputs
  • (altlt10)(bc) can be implemented in one row
  • Control blocks
  • one for each row in the leftmost column
  • serve as liaison
  • Boolean Values
  • for if-conversion used in hyperblock

14
Wires
  • Vertical wire
  • communicate blocks in the same column
  • Horizontal
  • communicate blocks in the same or adjacent
    rows
  • Built-in carry chain
  • support for addition, subtraction and
    comparison.
  • Make multiplication and division by constant
    fairly efficient by multi-bit shift across a row
  • The wire network is passive
  • value cannot jump from one to another
    without passing through a logic block

15
Memory tricks
  • Configuration cache
  • hold recently displaced configurations
  • reloading from cache requires 5 cycles only.
  • can hole 4 full-sized configurations
  • Wide path betwn coprocessor and memory
  • data transfer and configuration load
  • Memory bus
  • 4 32-bit data bus and 1 32-bit address bus
  • coprocessor is master of memory buses when
    active
  • initiate one access every cycle
  • Memory Queues

16
Compare Garp with other arch.
  • VLIW
  • Garp resemble VLIW
  • Advantage over VLIW
  • but doesnt have VLIWs per-cycle limits on
    instruction issue, functional units, or register
    file bandwidth.
  • pipeline in Garp is more straightforward than
    software pipelining on VLIW no function units
    competition problem for Garp
  • maintain high performance for sequential code
    in processor
  • Disadvantage over VLIW
  • kernel size limit
  • can not exploit ILP outside of loops

17
Garp V.S. Vector
  • Garp resemble a memory-to-memory vector processor
    when synthesizing a vectorizable loop.
  • Feedback loops can be constructed arbitrarily
    while vector units can handle only very
    speciallized recurrences
  • Garp can easily handle data-dependent loop exits,
    which is a problem for vector arch.

18
Garp V.S. Superscalar
  • Because of the modest number of instruction
    issue slots, Superscalar processor can not
    compete with the Garp coprocessor in cases with a
    large amount of ILP.

19
Any Question About Garp? For further
details Garp A MIPS Processor with a
Reconfigurable Coprocessor J.R. Hauser, J.
Wawrzynek, IEEE FCCM 1997,
20
Automatic Compilation
  • Standard ANSI C as input
  • SUIF C compiler for the front-end phase
  • parsing and standard optimizations
  • Full automatic compliation

21
Compilation Flow
Application
Kernel selection
kernel
Non-kernel
Optimization Synthesis
Optimization
Executable file
Bit-stream
coprocessor
processor
22
Kernel selection
  • Loops
  • The whole loop? -- NO
  • loop size too large
  • contain some infrequent executed code
  • -- longer load time
  • -- longer interconnects
  • operations cannot be implemented
  • ILP limitation in basic block

23
Hyperblock
  • Join all the basic blocks of a loop body by using
    prediction boolean value
  • Increase ILP
  • Precedence edges
  • array subscript analysis
  • inter-procedural pointer analysis
  • Contain the loop back edges
  • avoid switching control from time to time

24
(No Transcript)
25
Hyperblock (Cont)
  • Reject loops that speedup doesnt make up the
    overhead
  • by profiling and execution time estimate
  • Exceptional exit cases
  • execution continue on processor
  • occur only a small fraction

26
Optimization Techs.
  • Speculative loads
  • crucial for pipelining
  • Pipelining
  • loop-carried dependencies
  • simultaneous memory access
  • Memory queues
  • 3 memory queues
  • buffering and reading ahead, writing behind
  • non-cache-allocating

27
Configuration Synthesis
  • Module mapping
  • mapping groups of nodes in the DFG to
    compound modules in the configuration, minimize
    the size and its critical path
  • Placement
  • connect modules close to one another
  • Generating the bit-stream file

28
Simulation Results
  • 32-row array
  • Adapted Ultrasparc processor
  • Cycle-accurate simulator
  • Model cache misses and interlocks.

29
Wavelet image compression
30
Gzip compression
  • Gzip have irregular memory accesses
  • reduce parallelism and prevent pipelining
  • Each loop execute only a few cycles
  • overhead cost more significant
  • The overhead negates the benefit

31
Compilation time Code expansion
  • Compilation time
  • typically much less than double that of
    compiling for software only
  • Code size
  • typically increase from 10 to 50 percent
  • wavelet benchmark 16 percent

32
Garp V.S. Ultrasparc
  • Ultrasparc
  • a four-way superscalar, 167Mhz
  • Garp
  • implemented using the same VLSI process
  • 133Mhz
  • Wavelet
  • Garp is 68 faster than Ultrasparc
  • Gzip
  • Ultrasparc is 14 faster than Garp

33
Garp V.S. Ultrasparc (Cont)
  • Hand-coded functions
  • Garp has great potential

34
Future
  • More experiments over a broader range of
    benchmark
  • Development of new optimizations
  • Find out strengths and weaknesses of the Garp
    architecture

35
Summary
  • The Garp Architecture
  • processor coprocess
  • configuration cache
  • memory queues
  • high-bandwidth, low-latency data access
  • Synthesis Compiler for Garp

36
The End
Thank you! Any feedback will be
appreciated liaojiro_at_comp.nus.edu.sg http//www.co
mp.nus.edu.sg/liaojiro
Write a Comment
User Comments (0)
About PowerShow.com