Title: An Introduction to High Performance Reconfigurable Computing
1An Introduction toHigh PerformanceReconfigurable
Computing
Grid Computing Workshop Department of Physics,
University of Cape Town 13 September 2006
- Peter McMahonpeter_at_dotnet.za.net
- Department of Electrical Engineering
- University of Cape Town
Disclaimer References are missing. Only results
presented as mine are so.
2Agenda
- High performance computing and the motivation for
alternative architectures - Speeding up computing with Field Programmable
Gate Arrays - The current state of reconfigurable computing
- Recent performance results
- The future of reconfigurable computing
3Motivation for alternative architectures (1)
- Present systems use conventional architectures
- CPU clock speeds have become a significant
barrier no longer doubling every 18 months - Power consumption has become a major issue. Many
sizeable centres consume gt5MW, with next
generation centres planning for 30 MW. - Koeberg produces 1800 MW!
4(No Transcript)
5Motivation for alternative architectures (2)
- Present solution from vendors seems to be to tack
together more processors (multi-core and bigger
clusters) - More cores and/or chips leads to greater power
consumption and cooling issues - Even if CPUs could be made to run faster, they
would then run hotter - Maybe its time to look at other ways of doing
high performance computing?
6What are we trying to do?
- Were not looking at personal or embedded
computing some of the same issues, but not our
fight! - Increase the performance capability of high
performance computing systems, scaling to
petaflops and beyond - Majority of scientific codes running on HPC
centres are floating-point intensive - Hence specifically what we want to do is increase
the performance of floating-point intensive
software (roughly, FLOPS)
7What options are there? (1)
- Multi-core and/or multi-CPU systems
- Parallelism via VLIW
- Sea of ALUs/Processors IBM Cell Processor
- High performance floating-point coprocessor
(ClearSpeed) - Reconfigurable computing
8What options are there? (2)
- Sony/Toshiba/IBM Cell Processor
- Delivers 30x performance of single PPC for some
applications 100x in exceptional cases.
9What options are there? (3)
- ClearSpeed floating-point accelerator
- 0.17 GFLOPS/Watt in an HP cluster
- Up from 0.07 GFLOPS/Watt without ClearSpeed
- Performance increase of 2.7x by using two
ClearSpeed accelerators per server
10Introduction to FPGAs (1)
- Field Programmable Gate Arrays
- Back to basics all programs are essentially a
series of logic operations on bits - The key idea is that FPGAs are custom-designed
like ICs (ASICs), but are also software-reprogramm
able
11Introduction to FPGAs (2)
- You can in some sense think of an FPGA as a grid
of wires connecting together logic gates. The
joints between the wires are defined when you
configure the device. - These wires have fuses between them and the
fuses can be blown or connected in software. - At least, that was the original idea
(Programmable Array Logic) now they are far
more sophisticated.
12(No Transcript)
13Introduction to FPGAs (3)
- Instead of just AND/OR gates, FPGAs now use
lookup table and flip-flop blocks, and include
onboard memory (block RAM), hardware integer
multipliers, fast I/O interconnects etc.
14What is a reconfigurable computer?
- Idea whereby hardware can modify itself to suit
executing program - Reconfigurable computing is sometimes used to
refer to FPGAs alone. - We use the term to refer to hybrid computers that
include both conventional microprocessors and
FPGA reconfigurable logic.
15A generic reconfigurable computer architecture
16Performance advantages of reconfigurable computers
- Simple idea use CPU when it is faster, and FPGA
when it is faster - FPGAs have been used to do application-type
computation before, but - Typically programming has been done in
VHDL/Verilog - All-or-nothing whole machine built out of custom
hardware
17Performance in the real world
- The first commercial reconfigurable computers
have yielded promising results.
3
18Performance in the real world
- Why does the speedup vary with input size?
4
19Performance in the real world
- FPGAs faster for certain applications
- FPGAs can execute in parallel
- Programs which do not depend on previously
calculated values can be executed in single clock
cycle - Programs where number of iterations are not known
a-priori generally perform better on general
purpose computers. Hardware to implement such
routines require complex control structures
20Programming models
- FPGA-based general-purpose computing devices have
been possible for many years, but have not taken
off. Why? - A simple matter of programming.
- 10 years ago the problem was programming. It is
still programming.
21Programming models
- VHDL is a nice abstraction, but it is still
design. - Software developers cannot be expected to operate
at the digital logic level. - There are conservatively 10x more software
developers than there are digital electronics
designers. - Hand-coding may yield 2x better performance than
an automated tool, but productivity must be
factored in 100 apps running 50x faster is
better than 2 apps running 100x faster.
22Programming models
- Ultimate objective create a programming model
that abstracts the hardware from the programmer,
including the decision of whether to run code on
the microprocessor or reconfigurable logic
components - Intermediate objective create a programming
model that abstracts the complexities of FPGA
design from the programmer, and allows the
programmer to develop applications in a
high-level language
23FPGA programming models
- Lots of wire
- Vast quantities of Mountain Dew
- Healthy disregard for personal hygiene
- Very little wire
- Vast quantities of Mountain Dew
- Healthy disregard for personal hygiene
24Programming models
- In the simplest case, the CPU may be coded with a
high-level language (e.g. C), and the FPGA with a
HDL (e.g. VHDL or Verilog). - This is not ideal the programmer shouldnt have
to worry about logic design. - Solution use a special C-gtHDL compiler for FPGA
routines, allowing the programmer to write the
entire program in a high-level language
25Commercial implementations
x
VHDL
Verilog
Low Efficiency High
x
x
x
x
Easy Ease of Use
Difficult
26SRC programming
- SRC MapStation. Two languages C and MAP C for
MAP component (FPGA)
27SRC MapStation
- Xeon processors, common memory, and MAP FPGA
boards
28SRC programming
include ltlibmap.hgt void sub_routine(int,
int) void main() int A
(int)malloc(10sizeof(int)) int B
(int)malloc(10sizeof(int)) // Put data to
process into A map_allocate(1)
sub_routine(A, B) map_free(1) // Do
something with data in B free(A) free(B)
include ltlibmap.hgt void sub_routine(int A, int
B) OBM_BANK_A(AL, int, 10) OBM_BANK_B(BL,
int, 10) DMA_CPU(CM2OBM, AL, , A, 1,
10sizeof(int), ) wait_DMA(0) // Do some
processing with AL DMA_CPU(OBM2CM, BL, , B, 1,
10sizeof(int), ) wait_DMA(0)
29Programming models
- SRCs model hides the logic hardware design
element from the programmer, but s/he still has
to be familiar with the reconfigurable computers
architecture. - Estimating what subroutines will be best to run
on the FPGA is not necessarily trivial need to
perform code profiling. - Short bursts of FPGA use are counter-productive
call overhead. - Not easy to switch between microprocessor and
FPGA target code.
30Performance Results
- SRC has reported 250x speedups for some signal
processing applications - Non-floating point applications seem to be
getting speedups of 20x - Floating-point performance is currently fairly
poor not familiar with any floating-point code
that has speedup of gt10x
31What is limiting performance?
- 10x is nice, but why should we care?
- This is possibly just the tip of the iceberg.
- Current clock rates are 100MHz, so theres a
possibility of scaling it up - Most importantly, FPGAs dont yet ship with
floating-point multipliers, and have only limited
integer multipliers. - Slice count limits parallelism (max 10 parallel
engines) theres a possibility of scaling this
up too
32Where does performance come from? (1)
- Intensity! (100MHz vs 4GHz)
- CPUs can do 0.5 integer operations per clock
cycle. - This is the best case, when there are no caching
issues, the pipeline is working perfectly etc.
The worst case is much worse! - FPGA only implements the functionality that is
needed, resulting in much less complicated logic
(no need for Control Unit, ALU, etc.) - Pipelined code can reduce to just a few clock
cycles per loop iteration!
33Where does performance come from? (2)
- Lower latency
- Ties into intensity (since lower latency
increases intensity) - Can have 1 cycle access to static variables
- Can have 5 cycle access to BRAM
- Spatial parallelism on the chip
- We can make one small pipeline for doing some
computation, then replicate it - Since it is small, we can do much better than
processor core or cluster parallelism!
34Parallel random number generation
- To do Monte Carlo, we need a decent random number
generator. - To do Monte Carlo in parallel, we need to
generation uncorrelated random number streams in
parallel. - SRC do not provide an RNG library.
- Implement parallel LCG from SPRNG
- Xi (a Xi-1 b) mod M
- Parameterize b (by stream)
35Monte Carlo (1)
36Monte Carlo (2)
- Asian option pricing
- Uses parallel random number generator
- Significantly more calculation required than for
pi estimator - Floating-point code
- 1x performance
- Could only fit 2 parallel computations on FPGA!
37Conways Game of Life (1)
- Cellular automata with simple rules
38Conways Game of Life (2)
- 128x128 cell grid 100,000 iterations
- CPU 2 mins 39 sec
- MAP 58 sec
- Speedup 2.7x
- Used 4 parallel engines, consuming 30 of
slices. - Compiler issues (e.g. limit on lexical writes)
made it more trouble than it was worth to extend
to more engines, and I can see how it scales
already.
39Lattice Gas (1)
- Cellular automata algorithm, but more complicated
than Conways Game of Life.
40Lattice Gas (2)
- On a large lattice, can get reasonable results
41Lattice Gas (3)
- 480x480 point lattice 10,000 iterations
- CPU 1 min 39 sec
- MAP 1 min 6 sec
- Speedup 1.5x
- Used 4 parallel engines, consuming 50 slices
and 60 BRAM.
42Edge Detection (1)
- Find the edges in an image
43Edge Detection (2)
- The Sobel edge detection algorithm just
involves 2D convolution which is actually
implemented in a very similar way to CA - But CA uses 1,000s of iterations, and ED uses
only one so I/O time dominates.
44Edge Detection (3)
- 700x2100 image
- CPU 0.45 sec
- MAP 0.95 sec
- Slowdown 2.1x
- Measuring just compute time we see a 20x
speedup. - Used 3 parallel engines, consuming 23 slices.
45Molecular model docking
- Automatically dock molecular models into electron
density maps. - Spent a couple of weeks converting existing code
to a form suitable for MAP (i.e. C to C, static
arrays etc.) - No performance results yet.
46Reconfigurable Computing at UCT
- Some experience with FPGAs in Electrical
Engineering used for specific applications. - Plan to start a reconfigurable computing lab at
the end of this year. - Will focus on applications for KAT. Main two are
- RFI excision
- Pulsar searching
- May look at other scientific computing software
applications as well
47The future hardware
- At the very least, reconfigurable computers will
likely remain the best choice for
non-floating-point code implementing simple
algorithms and big loops. - Hardware is still expensive, but any application
with 50x speedup or better starts to have a very
compelling price/performance case. - High clock speeds more hardware multipliers
floating-point hardware more slices. - At least there is room to grow!
48The future software
- Not yet mature enough MAPC is better than
Verilog, but not close enough to C. - Library support is lacking.
- Adoption will be limited until programming
becomes as easy as it is for clusters. - Currently getting performance gains involves a
lot of thinking about the hardware abstracting
this without ruining performance is going to be
difficult. - But in the last 5 years, big gains have been
made hope for the future.