Title: NARC: Network-Attached Reconfigurable Computing for High-performance, Network-based Applications
1NARC Network-Attached Reconfigurable Computing
for High-performance, Network-based Applications
- Chris Conger, Ian Troxel, Daniel Espinosa,
- Vikas Aggarwal, and Alan D. George
- High-performance Computing and Simulation (HCS)
Research Lab - Department of Electrical and Computer Engineering
- University of Florida
2Outline
- Introduction
- NARC Board Architecture, Protocols
- Case Study Applications
- Experimental Setup
- Results and Analysis
- Pitfalls and Lessons Learned
- Conclusions
- Future Work
3Introduction
- Network-Attached Reconfigurable Computer (NARC)
Project - Inspiration network-attached storage
- (NAS) devices
- Core concept investigate challenges and
- alternatives for enabling direct network access
- and control over reconfigurable (RC) devices
- Method prototype hardware interface and
- software infrastructure, demonstrate proof of
- concept for benefits of network-attached RC
- resources
- Motivations for NARC project include (but
- not limited to) applications such as
- Network-accessible processing resources
- Generic network RC resource, viable alternative
to server and supercomputer solutions - Power and cost savings over server-based FPGA
cards are key benefits - No server needed to host RC device
- Infrastructure provided for robust operation and
interfacing with users - Performance increase over existing RC solutions
is not a primary goal of this approach - Network monitoring and packet analysis
4Envisioned Applications
- Aerospace military applications
- Modular, low-power design lends itself well to
military craft and munitions deployment - FPGAs providing high-performance radar, sonar,
and other computational capabilities - Scientific field operations
- Quickly provide first-level estimations for
scientific field operations for geologists,
biologists, etc.
- Field-deployable covert operations
- Completely wireless device enabled through
battery, WLAN - Passive network monitoring applications
- Active network traffic injection
- Distributed computing
- Cost-effective, RC-enabled clusters or cluster
resources - Cluster NARC devices at a fraction of cost,
power, cooling
- Cost-effective intelligent sensor networks
- Use FPGAs in close conjunction with sensors to
provide pre-processing functions before network
transmission - High-performance network technologies
- Fast Ethernet may be replaced by any network
technology - Gig-E, Infiniband, RapidIO, proprietary
communication protocols
5NARC Board Architecture Hardware
- ARM9 network control with FPGA processing power
(see Figure 1) - Prototype design consists of two boards,
connected via cable - Network interface board (ARM9 processor
peripherals) - Xilinx development board(s) (FPGA)
- Network interface peripherals include
- Layer-2 network connection (hardware PHYMAC)
- External memory, SDRAM and Flash
- Serial port (debug communication link)
- FPGA control and data lines
- NARC hardware specifications
- ARM-core microcontroller, 1.8V core, 3.3V
peripheral - 32-bit RISC, 5-stage pipeline, in-order execution
- 16KB data cache, 16KB instruction cache
- Core clock speed 180MHz, peripheral clock 60MHz
- On-chip Ethernet MAC layer with DMA
- External memory, 3.3V
- 32MB SDRAM, 32-bit data bus
- 2MB Flash, 16-bit data bus
- Port available for additional 16-bit SRAM devices
Figure 1 Block diagram of NARC device
6NARC Board Architecture Software
- ARM processor runs Linux kernel 2.4.19
- Provides TCP(UDP)/IP stack, resource management,
threaded execution, Berkeley Sockets interface
for applications - Configured and compiled with drivers specifically
for our board - Applications written in C, compiled using GCC
compiler for ARM (see Figure 2) - NARC API Low-level driver function library for
basic services - Initialize and configure on-chip peripherals of
ARM-core processor - Configure FPGA (SelectMAP protocol)
- Transfer data to/from FPGA, manipulate control
lines - Monitor and initiate network traffic
- NARC protocol for job exchange (from remote
workstation) - NARC board application and client application
must follow standard rules and procedures for
responding to requests from a user - User appends a small header onto data (if any)
containing info. about request before sending
over network (see Figure 3) - Bootstrap software in on-board Flash,
automatically loads and executes on power-up - Configures clocks, memory controllers, I/O pins,
etc - Contacts tftp server running on network,
downloads Linux and ramdisk - Boot Linux, automatically execute NARC board
software contained in ramdisk - Optional serial interface through HyperTerminal
for debugging/development
Figure 2 Software development process
Figure 3 Request header field definitions
7NARC Board Architecture FPGA Interface
- Data communicated to/from FPGA by means of
unidirectional data paths - 8-bit input port, 8-bit output port, 8 control
lines (Figure 4) - Control lines manage data transfer, also drive
configuration signals - Data transferred one byte at a time, full duplex
communication possible - Control lines include following signals
- Clock software-generated signal to clock data
on data ports - Reset reset signal for interface logic in FPGA
- Ready signal indicating device is ready to
accept another byte of data - Valid signal indicating device has placed valid
data on port - SelectMAP all signals necessary to drive
SelectMAP configuration
Figure 4 FPGA interface signal diagram
- FPGA configuration through SelectMAP protocol
- Fastest configuration option for Xilinx FPGAs,
protocol emulated using GPIO pins of ARM - NARC board enables remote configuration and
management of FPGA - User submits configuration request (RTYPE 01),
along with bitfile and function descriptor - Function descriptor is ASCII string, formatted
list of functions with associated RTYPE
definition - ARM halts and configures FPGA, stores descriptor
in dedicated RAM buffer for user queries - All FPGA designs must restrict use of all
SelectMAP pins after configuration - Some signals are shared between SelectMAP port
and FPGA-ARM link - Once configured, SelectMAP pins must remain
tri-stated and unused
8Results and Analysis Raw Performance
- FPGA interface I/O throughput (Table 1)
- 1KB data transferred over link, timed
- Measured using hardware methods
- Logic analyzer to capture raw link data rate,
divide data sent by time from first clock to last
clock (see Figure 9) - Performance lower than desired for prototype
- Handshake protocol may add unnecessary overhead
- Widening data paths, optimizing software routine
will significantly improve FPGA I/O performance - Network throughput (Table 2)
- Measured using Linux network benchmark IPerf
- NARC board located on arbitrary switch within
network, application partner is user workstation - Transfers as much data as possible in 10 seconds,
calculates throughput based on data sent divided
by 10 seconds - Performed two experiments with NARC board serving
as client in one run, server in other - Both local and remote (remote location 400 miles
away, at Florida State University) IPerf partner - Network interface achieves reasonably good
bandwidth efficiency - External memory throughput (Table 3)
- 4KB transferred to external SDRAM, both read and
write - Measurements again taken using logic analyzer
- Memory throughput sufficient to provide
wire-speed buffering of network traffic - On-chip Ethernet MAC has DMA to this SDRAM
Figure 9 Logic analyzer timing
Mb/s Input Output
Logic Analyzer 6.08 6.12
Table 1 FPGA interface I/O performance
Mb/s Local Network Remote Network (WAN)
NARC-Server 75.4 4.9
Server-Server 78.9 5.3
Table 2 Network throughput
Mb/s Read Write
Logic Analyzer 183.2 160
Table 3 External SDRAM throughput
9Results and Analysis Raw Performance
- Reconfiguration speed
- Includes time to transfer bitfile over network,
plus time to configure device (transfer bitfile
from ARM to FPGA), plus time to receive
acknowledgement - Our design currently completes a user-initiated
reconfiguration request with a 1.2MB bitfile in
2.35 sec - Area/resource usage of minimal wrapper for
Virtex-II Pro FPGA - Stats on resource requirements for a minimal
design to provide required link control and data
transfer in an application wrapper are presented
below - Design implemented on older
- Virtex-II Pro FPGA
- Numbers to right indicate
- requirements for wrapper only,
- un-used resources available
- for use in user applications
- Extremely small footprint!
- Footprint will be even smaller
- on larger FPGA
Device utilization summary ----------------------
---------------------------------- Selected
Device 2vp20ff1152-5 Number of Slices
143 out of 9280 1 Number of
Slice Flip Flops 120 out of 18560 0
Number of 4 input LUTs 238 out of 18560
1 Number of bonded IOBs 24 out
of 564 4 Number of BRAMs
8 out of 88 9 Number of GCLKs
1 out of 16 6
10Case Study Applications
- Clustered RC Devices N-Queens
- HPC application demonstrating NARC boards role
as generic compute resource - Application characterized by minimal
communication, heavy computation within FPGA - NARC version of N-Queens adapted from previously
implemented application for PCI-based Celoxica
RC1000 board housed in a conventional server - N-Queens algorithm is a part of the DoD
high-performance computing benchmark suite and
representative of select military and
intelligence processing algorithms - Exercises functionality of various developed
mechanisms and protocols for job submission, data
transfer, etc. on NARC - User specifies a single parameter N, upon
- completion the algorithm returns total number
- of possible solutions
- Purpose of algorithm is to determine how many
- possible arrangements of N queens there are on
- an N N chess board, such that no queen may
- attack another (see Figure 5)
- Results are presented from both NARC-based
execution and RC1000-based execution for
comparison
Figure c/o Jeff Somers
Figure 5 Possible 8x8 solution
11Case Study Applications
- Network processing Bloom Filter
- This application performs passive packet analysis
through use of a classification algorithm known
as a Bloom Filter - Application characterized by constant, bursty
communication patterns - Most communication is Rx over network,
transmission to FPGA - Filter may be programmed or queried
- NARC device copies all received network frames to
memory, ARM parses TCP/IP header and sends it to
Bloom Filter for classification - User can send programming requests, which include
a header and string to be programmed into Filter - User can also send result collection requests,
which causes a formatted results packet to be
sent back to the user - Otherwise, application constantly runs, querying
each header against the current Bloom Filter and
recording match/header pair information - Bloom Filter works by using multiple hash
functions on a given bit string, each hash
function rendering indices of a separate bit
vector (see Figure 6) - To program, hash inputted string and set
resulting bit positions as 1 - To query, hash inputted string, if all resulting
bit positions are 1 the string matches - Implemented on Virtex-II Pro FPGA
- Uses slightly larger, but ultimately more
effective application wrapper (see Figure 7) - Larger FPGA selected to demonstrate
interoperability with any FPGA
Figure 6 Bloom Filter algorithmic architecture
Figure 7 Bloom Filter implementation
architecture
12Experimental Setup
- N-Queens Clustered RC devices
- NARC device located on arbitrary switch in
network - User interfaces through client application on
workstation, requests N-Queens procedure - Figure 8 illustrates experimental environment
- Client application records time required to
satisfy request - Power supply measures current draw of active NARC
device - N-Queens also implemented on RC-enabled server
equipped with Celoxica RC1000 board - Client-side function call to NARC board replaced
with function call to RC1000 board in local
workstation, same timing measurement - Comparison offered in terms of performance,
power, cost
- Bloom Filter Network processing
- Same experimental setup as N-Queens case study
- Software on ARM co-processor captures all
Ethernet frames - Only packet headers (TCP/IP) are passed to FPGA
- Data continuously sent to FPGA as packets arrive
over network - By attaching NARC device to switch, limited
packets can be captured - Only broadcast packets and packets destined for
the NARC device can be seen - Dual-port device could be inserted in-line with
network link, monitor all flow-through traffic
Figure 8 Experimental environment
13Results and Analysis N-Queens Case Study
- First, consider an execution time comparison
between our NARC board and a PCI-based RC card
(see Figure 10a and 10b) - Both FPGA designs clocked at 50MHz
- Performance difference is minimal between devices
- Being able to match performance of PCI-based card
is a resounding success! - Power consumption and cost of NARC devices
drastically lower than that of server with RC
card combos - Multiple users may share NARC device, PCI-based
cards somewhat fixed in an individual server - Power consumption calculated using following
method - Three regulated power supplies exist in complete
NARC device (network interface FPGA board) 5V,
3.3V, 2.5V - Current draw from each supply was measured
- Power consumption is calculated as sum of VI
products of all three supplies
Figure 10 Performance comparison between NARC
board and PCI-based RC card on server
14Results and Analysis N-Queens Case Study
- Figure 11 summarizes the performance ratio of
N-Queens between both NARC and RC-1000 platforms - Consider Table 4 for a summary of cost and power
statistics - Unit price shown excluding cost of FPGA
- FPGA costs offset when compared to another device
- Price shown includes PCB fabrication, component
costs - Approximate power consumption drastically less
than server RC-card combo - Power consumption of server varies depending on
particular hardware - Typical servers operate off of 200-400W power
supplies - See Figure 12 for example of approximate power
consumption calculation
Figure 11 Power consumption calculation
NARC Board
Cost per unit (prototype) 175.00
Approx. Power Consumption 3.28 W
Table 4 Price and power figures for NARC device
P (5V)(I5) (3.3V)(I33) (2.5V)(I25) I5
0.2A I33 0.49A I25 0.27A P (5)(.2)
(3.3)(.49) (2.5)(.27) 3.28W
Figure 12 Power consumption calculation
15Results and Analysis Bloom Filter
- Passive, continuous network traffic analysis
- Wrapper design was slightly larger than previous
minimal wrapper used with N-Queens - Still small footprint on chip, majority of FPGA
remains for application - Maximum wrapper clock frequency 183 MHz, should
not limit application clock if in same clock
domain - Packets received over network link are parsed by
ARM, with TCP/IP header saved in buffer - Headers sent one-at-a-time as query requests to
Bloom Filter (FPGA), when query finishes another
header will be de-queued if available - User may query NARC device at any time for
results update, program new pattern
- Figure 13 shows resource usage for Virtex-II Pro
FPGA - Maximum clock frequency of 113MHz
- Not affected by wrapper constraint
- Significantly faster computation speed than
FPGA-ARM link communication speed - FPGA-side buffer will not fill up, headers are
processed before next header transmitted to FPGA - ARM-side buffer may fill up under heavy traffic
loads - 32MB ARM-side RAM gives large buffer
Device utilization summary ----------------------
--------------------------------- Selected Device
2vp20ff1152-5 Number of Slices
1174 out of 9280 13 Number of Slice
Flip Flops 1706 out of 18560 9 Number
of 4 input LUTs 2032 out of 18560 11
Number of bonded IOBs 24 out of 564
4 Number of BRAMs 9 out of
88 10 Number of GCLKs 1
out of 16 6
Figure 13 Device utilization statistics for
Bloom Filter design
16Pitfalls and Lessons Learned
- FPGA I/O throughput capacity remains persistent
problem - One motivation for designing custom hardware is
to remove typical PCI bottleneck and provide
wire-speed network connectivity for FPGA - Under-provisioned data path between FPGA and
network interface restricts performance benefits
for our prototype design - Luckily, this problem may be solved through a
variety of approaches - Wider data paths (16-bit, 32-bit) double or
quadruple throughput, at expense of higher pin
count - Use of higher-performance co-processor capable of
faster I/O switching frequencies - Optimized data transfer protocol
- Having co-processor in addition to FPGA to handle
network interface is vital to success of our
approach - Required in order to permit initial remote
configuration of FPGA, as well as additional
reconfigurations upon user request - Offloading network stack, basic request handling,
and other maintenance-type tasks from FPGA saves
significant amount of valuable slices for user
designs - Drastically eases interfacing with user
application on networked workstation - Active co-processor for FPGA applications, e.g.
parsing network packets as in Bloom Filter
application
17Conclusions
- A novel approach to providing FPGAs with
standalone network connectivity has been
prototyped and successfully demonstrated - Investigated issues critical to providing remote
management of standalone NARC resources - Proposed and demonstrated solutions to discovered
challenges - Performed pair of case studies with two distinct,
representative applications for a NARC device - Network-attached RC devices offer potential
benefits for a variety of applications - Impressive cost and power savings over
server-based RC processing - Independent NARC devices may be shared by
multiple users without moving - Tightly coupled network interface enables FPGA to
be used directly in path of network traffic for
real-time analysis and monitoring - Two issues that are proving to be a challenge to
our approach include - Data latency in FPGA communication
- Software infrastructure required to achieve a
robust standalone RC unit - While prototype design achieves relatively good
performance in some areas, and limited
performance in others, this is acceptable for
concept demonstration - Fairly complex board design architecture and
software enhancements in development - As proof of NARC concept, important goal of
project was achieved in demonstration of an
effective and efficient infrastructure for
managing NARC devices
18Future Work
- Expansion of network processing capabilities
- Further development of packet filtering
application - More specific and practical activity or behavior
sought from network traffic - Analyze streaming packets at or near wire-speed
rates - Expansion of Ethernet link to 2-port hub
- Permit transparent insertion of device into
network path - Provide easier access to all packets in switched
IP network - Merging FPGA with ARM co-processor and network
interface into one device - Ultimate vision for NARC device
- Will restrict number of different FPGAs which may
be supported, according to chosen FPGA
socket/footprint for board - Increased difficulty in PCB design
- Expansion to Gig-E, other network technologies
- Fast Ethernet targeted for prototyping effort,
concept demonstration - True high-performance device should support
Gigabit Ethernet - Other potential technologies include (but not
limited to) InfiniBand, RapidIO - Further development of management infrastructure
- Need for more robust control/decision-making
middleware - Automatic device discovery, concurrent job
execution, fault-tolerant operation