Title: Implementing Complex Algorithms in FPGAs
1Implementing Complex Algorithms in FPGAs
Workshop Dr Steve Chappell
Director Apps Engineering
2Workshop Materials
- For the Labs
- Course Workbook, Tutorials and Application Notes
- DK integrated help system
- On your Workstations
- DK, PDK
- Target Platforms
- RC100, RC1000
3Contents
- Introductions
- About Celoxica
- The Basics
- Opportunities with a HW Coprocessor
- Target Boards
- Design Flows DK and Handel-C in brief
- Handel-C Language
- Tool Connectivity
- Platform Developers Kit
- Platform Abstraction
- Codesign
- Appendices
- Technology, Applications, CUP
Lab1
Lab2
Labs3,4
4About Celoxica
- System EDA company
- Design Tools, FPGA Boards, Consultancy and
Services - Incorporated on the 25th September 2000 (Formerly
ESL) - Market leader in complete solutions for
software-compiled system design - Core Technology is DK incorporating the Handel-C
programming language - A senior management wealth of EDA and electronics
industry experience - Industry leading partners
- Strong Links with Research Development
- Technology and expertise based upon decades of
research into state-of-the-art at The University
of Oxford - Chief Science Officer Ian Page, visiting
Professor at the Imperial College of Science,
Technology Medicine, London - Established and active University Program (700
institutions world-wide) - Investors
- Premier league investors including
Intel
5Supporting Argonne
- Augmented Cluster supplied by Linux Networks
- Incorporating Tarari CPP cards and Software
drivers - Celoxica Development Kit for FPGA content
- Ensuring successful deployment and evaluation
- Cluster support by Linux Networks
- Augmented Application and CPP card support by
Celoxica
6The Basics
- Opportunities and Challenges
- Essence of an FPGA
- Design Flows
7Opportunities with a HW Co-processor
- Algorithm Acceleration
- Exploit the parallelism in algorithms to increase
performance with implementation in custom
(parallel) hardware - Algorithm Offload
- Exploit the coprocessor to free CPU resource
- e.g., in an SSL proxy, the CPU can always handle
more TCP traffic if algorithms such as RSA and
3DES are moved to a coprocessor - For PCI-based coprocessor cards candidate
algorithms include ones where CPU execution time
far exceeds data transfer time over PCI - Full analysis needs to consider
- Time required to perform the algorithm in the
Co-processor - System application performance improvement
Amdahls Law
8Opportunities with FPGAs
- FPGA architecture
- What it means for applications
- Soft Hardware
- Reconfigurability/Programmability
- Integer processors (FP is resource expensive)
- Wide data paths
- Parallel Computation
- Challenges to deployment in enterprise computing
- Development complexity
- IP deployment and integration
- Design Framework and methods
- Data Bandwidth to/from coprocessor
- Choosing the right applications
9Essence of an FPGA
gt SRAM Field Programmable Gate Array
CLBsIOBsInterconnect Matrix
10Target Boards
- RC100, RC1000, RC2000 Tarari CPP
11RC-100
- Xilinx Spartan2-200 FPGA
- 2MB ZBT SRAM, in 2 36-bit banks. 8MB Flash RAM
- 50 pin expansion header, PS/2 mouse/keyboard,
parallel port - Video input decoder, VGA output DAC
- Two 7-segment LED displays
- 80MHz maximum clock
12RC-1000
- PCI card, DMA transfers gt 110 MB/sec sustained
- Xilinx Virtex-2000 FPGA
- 8MB SRAM, in 4 32-bit banks
- 2 PMC slots
- 50 auxiliary I/O pins
- Programmable clock
13RC-1000
13
14RC-2000
- Virtex II 2V3000-4, 2V6000-4 and 2V6000-6 FPGAs
- 64bit 66MHz PCI bus
- 6 banks of ZBT SRAM offering a total of either
12Mb or 24Mb - Front-panel I/O up to 146 lines, dependant on
options - 64 I/O lines via PMC connector
- 16Mb Flash for configuration storage
- 2 Programmable clocks
- Options include
- 16Mb additional ZBT SRAM in 2 banks
- 128Mb DDR Ram
15RC-2000
15
16CPP Basic Board Architecture
- Two CPEs Content Processing Engines
- Virtex-II 1000 FPGA
- Eight LEDs
- 2x 1MB SRAM
- Connection to CPC
- CPC Content Processing Controller
- 256MB DDR SDRAM
- PCI Bus to Host
17Design Flows
18Designing acceleration IP
- Traditional Options HDL based design
- Purchase FPGA (HW) development tools
- Hire/use HW engineers
- Pay 3rd Party development fees
- The Alternative Software Compiled System
Design - Use Celoxica Content Processing Development Kit
- Development framework with Example Acceleration
IP - Comprehensive Hardware-Software Co-simulation
environment - Tool and Language Connectivity
- Enable SW engineers and/or increase HW engineer
productivity
19Why a Software Language Based Approach for System
Design?
- Some problems are better expressed as a software
algorithm - Software Reference designs can be utilized
- Designs are often specified by a C/C executable
- Simplifies and delays hardware-software
partitioning - Software development techniques can be used
- Brings hardware and software teams closer
together - New Possibilities
20RC100
- RC100 prototyping board
- 10 FPGA
- Commodity memory chips
- Video Input and Output
1
21RC100
- RC100 prototyping board
- 10 FPGA
- Commodity memory chips
- Video Input and Output
2
22CPDK for developing acceleration IP
- The Content Processing Development Kit includes
- Celoxica DK and supporting libraries
- Consisting of
- Software Compiled System Design environment
- Simple design flow with integrated Simulation and
direct implementation - Similar SW/HW design methods simplifies design
exploration and optimal allocation of
functionality between SW and HW - Verification and Debug using a Symbolic Debugger
- Connectivity and co-simulation with SW and HDL
cores - APIs to hide complexity
- Enabling your software and hardware developers
- To rapidly develop acceleration IP
23Celoxica DK1 Rapid Design
- Handel-C direct to FPGA, Minimum Tool Chain
- Easy-to-learn language ISO-C (ANSI-C)
- Design of hardware and software in parallel with
co-simulation
Simulate
Handel-C
Compile
Final Hardware
Netlist
Configure
24Supported FPGA/PLD Devices
25Development Flow
Specification
Algorithm Definition
LIBS
SW Tool
DK
C
Handel-C
HW
SW
Partition
BSP
BSP
OS
Develop
HLL Co-Verification
Implementation
HDL
C
EDIF
LIB
Compile
EDIF
OBJ
Host CPU
CPP
26APIs Enable Rapid Co-verification
Specification
DK
Nexus
C
Handel-C
HW
SW
BSP
BSP
HDL-Simulator
SW and/or ISS
Virtual Platform
Implementation
- Virtual Platform for Co-simulation and
Co-design - Cycle-accurate HLL simulator for Acceleration IP
modelling - Extendable Co-Sim to C/C, HDL, System-C, ISS
27DK User Interface
Simulate
Build
Syntax highlighting
Break-points Multithreaded Debug
File view
Symbol view
Watch variables
Clock Cycles
Info
28Handel-C in Brief
- Handel-C is based on ANSI C
- Well-defined semantics similar to
OCCAM/CSP - Additions
- support for parallelism
- channels for communications between parallel
processes - operators for detailed control of hardware
- constructs for RAM, ROM, interfacing, etc.
29HW-SW Co-Design
30Handel-C Language
31Core Language Features
- Standard C (if, while, switch etc) including
- Functions
- Structures
- Pointers
- par construct for parallelism
- Simple model of timing
- each assignment is one clock cycle
- Arbitrary widths on variables
- Enhanced bit manipulation operators
- Sharing/Copying expressions
- Support for hardware constructs
- Multiple clock domains, RAM, ROM, external
interfaces
32Handel-C describes Hardware!
- No side effects in expressions
- i.e. statements like a bc are not supported
- No floating point
- Floating point not directly supported by
Handel-C. - Library support provided for fixed and floating
point arithmetic - No run-time recursion
- Due to the absence of any kind of call stack in
hardware. - Limited standard library (i.e. no printf, fopen
etc.) - However, DK1.1 allows direct calls to external
functions written in C/C, and these could
incorporate file I/O, user interaction,
recursion, etc.
33Variables
- Handel-C has one basic type - integer
- May be signed or unsigned
- Can be any width, not limited to 8, 16, 32 etc.
Variables are mapped to hardware registers.
34Bit Manipulation Operators
- Extra operators have been added to allow more
hardware like bit manipulation
ltlt Shift Left b altlt2 gtgt Shift Right b
agtgt1 lt- Take least significant bits b
alt-5 \\ Drop least significant bits b a\\5 _at_
Concatenate bits b a_at_c Bit
Selection b a41
35Example Bit Manipulation
36Bit Manipulation 2
- Other bit manipulation examples
signed int 4 a signed b,c,d a 0b1100 b
altlt1 // b 0b1000 b agtgt1 // b
0b1110 c a21 // c 0b10 c alt-2 // c
0b00 c a\\2 // c 0b11 d a _at_ a // d
0b11001100
37Timing model
- Assignments and delay statements take 1 clock
cycle - Combinatorial Expressions computed between clock
edges - Most complex expression determines clock period
- Example takes 1n cycles (n is number of
iterations)
index 0 // 1 Cycle while
(index lt length) if(tableindex
key) foundindex // 1 Cycle else index
index1 // 1 Cycle
38Parallelism
- Handel-C blocks are by default sequential
- par executes statements in parallel
- par block completes when all statements complete
- Time for block is time for longest statement
- Can nest sequential blocks in par blocks
39More Parallelism
- Example array initialisation
- Sequential version takes 20 clock cycles
- for() loop has 1 cycle overhead for increment
- Parallel version takes 1 clock cycle
- Replicated par() builds hardware to execute
all 20 iterations in a single cycle - Allows trade-off between hardware size and
performance
40Channels
- Allow communication and synchronisation between
two parallel branches - Semantics based on CSP unbuffered (synchronous)
send and receive - Declaration
- Specifies data type to be communicated
41Sharing Hardware for Expressions
- Functions provide a means of sharing hardware for
expressions - By default, compiler generates separate hardware
for each expression - Hardware is idle when control flow is elsewhere
in the program - Hardware function body is shared among call sites
x xa b y yc d
int mult_add(int z,c1,c2) return zc1 c2
x mult_add(x,a,b) y
mult_add(y,c,d)
42Replicating Hardware for Expressions
- Inline Functions are expanded at the call site
- Provide for functional abstraction of complex
hardware
inline complex mult_complex(complex
x,y) complex z par z.re x.rey.re
x.imy.im z.im x.rey.im
x.imy.re return z complex
x1,y1,x2,y2,z1,z2 par z1
mult_complex(x1,y1) z2 mult_complex(x2,y2)
43Macro procedures
- macro proc is similar to an inline function, but
is expanded at compile time. - They also allow for arbitrary bit width
calculations - The following generates a reusable timer
macro proc usleep(ms) define TENTH_SEC
CLOCK_RATE/10 unsigned (log2ceil(TENTH_SEC))
Counter Counter TENTH_SEC (0_at_ms)
while (Counter) Counter--
44Signals
- A signal behaves like a wire - takes the value
assigned to it but only for that clock cycle. - The value can be read back during the same clock
cycle. - The signal can also be given a default value.
45Interfaces - Introduction
- Interfaces allow Handel-C designs to connect to
external hardware and logic. - Three types of interfaces
- Buses used for connecting to external pins
- Ports used for creating connection points for
external logic. - e.g. Creating the ports for a VHDL entity
- User Defined used for including external logic
blocks inside a Handel-C design. - e.g. Including an EDIF black box inside a deign.
46Interfaces Buses
- Makes connections to pins on the FPGA.
- Bus types
- Output
- Input direct, clocked and latched input
- Tri-state direct, clocked and latched tri-state
interface bus_in(int 4) Address() with
dataP1,P2,P3,P4 xAddress.in
47Interfaces Ports
- Allows connection points for external logic to be
specified. e.g. Defining the ports for a black
box VHDL entity - Port types Input, Output
//Declare Ports interface port_in(int 4 Input1)
InputPort1() interface port_in(int 4 Input2)
InputPort2() interface port_out() OutputPort(int
4 Output OutReg)
48Interfaces User Defined
- Allows external logic blocks to be used inside a
Handel-C design. e.g. Using an EDIF core.
//Instantiate connections to core interface
pipe_mult(int 4 Result) Multiplier( int 4 A, int
4 B)
49Multiple Clock Domains - example
Domain1.c
Domain2.c
50Handel-C Summary
- Handel-C is based on ANSI C
- Well-defined semantics similar to OCCAM/CSP
- Additions
- support for parallelism
- channels for communications between parallel
processes - operators for detailed control of hardware
- constructs for RAM, ROM, interfacing, etc.
51Lab 1
- Quick Start DK1, Handel-C and the RC100
52Tool Connectivity
53Tool Connectivity
Co-Simulation HDL Implementation
RTOS / ISS
54Black Boxes - Xilinx CoreGen
55Co-Simulation with HDL
55
56Co-Simulation with ISS
57HW-SW Co-Simulation Virtual Platforms
58MatLab Simulink
dll
Filter.hcc
Sfunc.cpp
59Co-Simulation with System-C
60Lab 2
61PDK Platform Dev Kit
62Introduction to PDK
- PDK Platform Developers Kit
- Goal to provide an integrated package of tools,
support libraries and implementations to simplify
application development and verification using
DK1 - Insulate developer from hardware details
- Improve portability and maintainability
- Provide key pre-packaged valueadding
functionality - Allow simulation of the complete environment from
modelling through to hardware implementation - Benefits
- Reduce development time
- Allow development focus on application added
value
63Introduction to PDK
- PDK Three major components
- DSM
- Integration between processors and FPGA/PLD
- PAL
- A consistent API for portable board-level
Handel-C implementations - PSL
- Provides board, hardware or development tool
specific support for DK1 and Handel-C
64Introduction to PDK
- Each PDK component provides four functional
areas - Simulation
- Provides hardware independent simulation of DSM
and PAL APIs and co-simulation with external
tools and simulators - Kit
- Provides key components and/or templates to allow
development of new, platform specific,
implementations - Platform
- Platform specific implementations of DSM, PAL and
PSL components - Cores
- Implementations of added-value functionality,
demos or examples
65Platform Abstraction Layer (PAL)
Handel-C Application
PAL-Core
Platform Abstraction Layer Application
Programming Interface
Platform Support Library (PSL)
Board
Peripheral 1
Peripheral 2
66DSM Data Stream Manager
67Labs 3 and 4
68Summary
- High performance gains with HW acceleration cards
- For appropriate algorithms
- Development kit enabling rapid design using a
software-like development framework - Celoxica DK and Handel-C
- Consultancy and Services
- For More gt www.celoxica.com
69CPDK for the Tarari CPP
70Development Framework Stack
APP(S)
Host CPU
Agent Device Driver(s)
Base Driver
SW
SW Tool
DK Tool
HW BSP
HW
AGENT(S)
Dev Kit - Examples - Supplied Base Driver/HW BSP
71HW Development in DK
1MB SRAM
256MB DDR
CPP.hcl
CPP_ZBT
BSP and API calls
CPC
AGENT(S)
Handel-C APP
CPP_LED
8x LED
CPP_SAI CPP_ABI
CPP_ZBT
PCI
1MB SRAM
72Appendices
- Technology Behind DK
- Consultancy, Services, Projects
- Case studies
- University Programme
73The Technology Behind DK
74The Technology Behind DK
- Simple Hardware constructs
- Compilation Flow
- Optimisations
75The Hardware Description
- Data Path
- Circuitry to Move/Manipulate/Store Data
- Control Path
- Circuitry to schedule operations
76Control and Assignment
- Variables are mapped to hardware registers
- The control start signal forms the clock enable
signal for the destination register of the
assignment.
void main(void) RExp
77The IF Construct
void main(void) if BE
S1 else
S2
78Sequential Composition
void main(void) S1 S2 S3
79Parallel Composition
void main(void) par S1 S2 S3
80Compilation Flow - Optimisations
- Generate AST from Source code
- Macro Expansion
- Width Inferencing
- Design Checking
- Compilation to High Level Netlist
- Expansion to technology specific netlist
81Re-Writing
- Logical equivalence
- (a) Constant 1 input to AND Gate removed
- (b) Gate removed with unused output
- (c) Block removed with unused output
82Conditional Re-Writing
- Logical equivalence by testing for impossible
Conditions - Gates removed for circuit with output independent
of y
83Common Sub-Expression Elimination
- Test for common logic
- Duplicate AND gate removed
84DK1 Optimisation Settings
85Customer Highlights
- Consultancy, Services, Projects
- Case studies
86Celoxica Expertise
- Technical Strengths
- Design Methodologies and Hardware Compiler
technology - FPGA board design and prototyping
- Image, Data processing and Multimedia
- Encryption
- Compression/Decompression
- Video Processing
- Telecommunications
- Routers/Switches
- Protocol stacks IPv6, VoIP (H323, SIP), ATM
- Software defined radio UMTS, 3G, DAB
- Business Consultancy
- Analysis, Marketing and Strategy
- Venture capital
- Services and Support
87Marconi Celoxica Technology Demonstrator
- Internet Reconfigurable Hardware from Software
- FPGA based, no microprocessor or operating system
- Different applications from the same hardware
- Can be reconfigured over internet to new
applications - MMT 2000
- IP Phone
- MP3 player
- Games console
- Graphic display
88High Speed Video Prototyping System
- Customer Requirement
- to shorten the evaluation time of video filter
algorithms as candidates for use in DTVB - Solution
- FPGA-based system comprising
- Wealth of analogue and digital video I/O
- COTS boards and custom
- Development kit DK and Video framework libraries
(SW/HW) - Outcome
- Real-time evaluation system rather than slow
software models - Algorithm Evaluation times reduced from 12 to 3-6
months - Prototypes for ASIC process rather than software
models
89EuroSkyWay Multimedia Satellite Ground Traffic
Simulator
- for system verification and end-to-end perf.
testing - generation of total network traffic (ATM, IP)
- full implementation of ESW protocols (layer
1/2/3) - digital baseband transmission
- Services 512, 2048 kb/s, 8...32 Mb/s (provider)
- fixed and mobile users (aircraft, busses,
vessels) - service launch in 2004
90JPEG2000 MQ encoder implementation
- SCSD version
- Slices 1,999
- Device utilization 18
- Speed (MHz) 115.5
- Lines of code 330
- Design time (days) 10 2
- Av cycles per code block (000s) 108
- Processing time (ms) 0.939
- Simulation time for Lena jpeg 5 minutes
Xilinx Virtex
Wind River SBC405 GP
- Traditional HDL Implementation
- Slices 620
- Device utilization 6
- Speed (MHz) 76
- Lines of code 800
- Design time (days) 30
- Av cycles per code block (000s) 67.5
- Processing time (ms) 0.888
- Simulation time for Lena jpeg XXX
IBM Power PC
Proteus FPGA daughter card
91Customer highlights
"The DK1 suite enables us to work at a high
level, quickly optimise C code for hardware
implementation, prototype using FPGAs and will
ultimately provide the HDL output for our ASIC
design. Shigeru Kawada, General Manager, NEC
Electronics Singapore's technology centre
"The real value of the Celoxica tools is the
quick re- engineering capability and
smooth transfer to a production platform. The DK1
methodology allows us to accomplish tasks in a
time frame that conventional design methods
cannot handle. Dennis Hazel, Director of
Engineering, Foxboro
"Our evaluation of DK1 clearly demonstrated that
the flow increases our engineering throughput,
and allows us to make better use of our scarce
hardware engineering resources. Using Celoxica's
Handel-C to hardware flow, our software engineers
can take a software solution through to hardware
allowing the hardware designer to focus on system
integration and optimisation." Andy Davey,
senior engineer at Cogent Defence Systems
A new joint development team to create powerful,
flexible and scaleable application specific
servers was announced today. Celoxica Ltd,
Motorola and StrongBow Technologies are working
together to create servers that embed
applications, such as transaction processing for
credit cards, directly in hardware Without
Celoxicas tools, this would not have been
possible, Alan Prouse, CEO and founder of
StrongBow Technologies.
Our original project plan was slated for 12-18
months using the traditional HDL design
methodology. By adopting the Handel-C high
level design language methodology, we were able
to finish the project in 6 months with DK1 design
suite and Xilinx ISE software targeting Xilinx
VII-6000 FPGAs. We put in minimum development
resource, but still met the design specification,
timing far ahead of the schedule. Anyone can use
the DK1 design suite to design efficient
hardware. Gary Mallaley, Manager of Strategy
Development at Northrop Grumman.
I visited Celoxica's headquarters. While there, I
re-implemented our existing VHDL solution using
the DK1 suite in just one day. I was hooked. Jan
Mennekens, chief technical officer M-TEC WIRELESS.
92CUPCeloxica University Programme
93Introduction to CUP
- CUP has been active since the company was formed
- 700 universities worldwide registered with a
multi-disciplinary user base - Strategic relationship with XUP
- University specific products and services
- Heavily discounted
- Focused upon supporting innovative teaching and
research - Comprehensive Website
- www.celoxica.com/programs/university/index
- Register Now!
94Benefits to Universities
- Rapid Design Exploration
- Fit more interest into time dependent project
work through rapid prototyping and productivity
improvements - Port protoyped C designs to Handel-C for
implementation in FPGAs - For Computer Science disciplines
- Familiar software environment
- Parallel programming environment
- Computer architecture exploration build your
own instruction sets - Exploring hardware accelerated systems
- For EE disciplines
- Cycle accurate interactive simulation
- SW/HW co-design, system design and SOC
- Integration with HDLs
- Creates a bridge for increased collaboration
between different disciplines
95Update on University activity
- Research Articles
- Customising Floating-Point Designs, Imperial
College, Xilinx. - Accelerating Radiosity Calculations using
Reconfigurable Platforms, Altaf Abdul Gaffar and
Wayne Luk, Imperial College - A Hardware Implementation of a Genetic
Programming System Using FPGAs and Handel-C,
Peter Martin, University of Birmingham - Teaching Programmes
- VDEC Japan now support DK1/Handel-C
- HARDWARE/SOFTWARE CO-DESIGN A SHORT COURSE FOR
UNBELIEVERS, A. Downton et al, University of
Essex
96IGOL Framework
- What is it?
- COM based Framework for Development and
Distribution of Hardware Acceleration - Testing and debugging for development
- Runtime services and packaging for deployment
- Application Examples
- Premier, Photoshop, WinAmp, VirtualDub,
DirectShow
- Demonstrates
- Ease of Development and Deployment of Hardware
Acceleration - Separation of concerns
- Hardware developers only develop hardware
- Application developers only develop software
- Re-use of hardware and software components
- Simply updating and patching
- Automatic application support for new components