Download - NARC: Network-Attached Reconfigurable Computing for High-performance, Network-based Applications
Conger #233 MAPLD 2005
NARC: NARC: Network-Attached Reconfigurable Computing for Network-Attached Reconfigurable Computing for High-performance, Network-based ApplicationsHigh-performance, Network-based Applications
Chris Conger, Ian Troxel, Daniel Espinosa, Vikas Aggarwal, and Alan D. George
High-performance Computing and Simulation (HCS) Research Lab
Department of Electrical and Computer Engineering
University of Florida
#233 MAPLD 2005Conger 2
Outline Introduction NARC Board Architecture, Protocols Case Study Applications Experimental Setup Results and Analysis Pitfalls and Lessons Learned Conclusions Future Work
#233 MAPLD 2005Conger 3
Introduction Network-Attached Reconfigurable Computer (NARC) Project
Inspiration: network-attached storage (NAS) devices
Core concept: investigate challenges and alternatives for enabling direct network access and control over reconfigurable (RC) devices
Method: prototype hardware interface and software infrastructure, demonstrate proof of concept for benefits of network-attached RC resources
Motivations for NARC project include (butnot limited to) applications such as: Network-accessible processing resources
Generic network RC resource, viable alternative to server and supercomputer solutions Power and cost savings over server-based FPGA cards are key benefits
No server needed to host RC device Infrastructure provided for robust operation and interfacing with users
Performance increase over existing RC solutions is not a primary goal of this approach Network monitoring and packet analysis
Easy attachment; unobtrusive, fast traffic gathering and processing Network intrusion and attack detection, performance monitoring, active traffic injection Direct network connection of FPGA can enable wire-speed processing of network traffic
Aircraft and advanced munitions systems Standard Ethernet interface eases addition and integration of RC devices in aircraft and munitions
systems Low weight and power also attractive characteristics of NARC device for such applications
#233 MAPLD 2005Conger 4
Envisioned Applications Aerospace & military applications
Modular, low-power design lends itself well to military craft and munitions deployment
FPGAs providing high-performance radar, sonar, and other computational capabilities
Scientific field operations Quickly provide first-level estimations for scientific
field operations for geologists, biologists, etc.
Field-deployable covert operations Completely wireless device enabled through battery, WLAN Passive network monitoring applications Active network traffic injection
Distributed computing Cost-effective, RC-enabled clusters or cluster resources Cluster NARC devices at a fraction of cost, power, cooling
Cost-effective intelligent sensor networks Use FPGAs in close conjunction with sensors to provide pre-processing
functions before network transmission High-performance network technologies
Fast Ethernet may be replaced by any network technology Gig-E, Infiniband, RapidIO, proprietary communication protocols
#233 MAPLD 2005Conger 5
NARC Board Architecture: Hardware ARM9 network control with FPGA processing power (see Figure 1) Prototype design consists of two boards, connected via cable:
Network interface board (ARM9 processor + peripherals) Xilinx development board(s) (FPGA)
Network interface peripherals include: Layer-2 network connection (hardware PHY+MAC) External memory, SDRAM and Flash Serial port (debug communication link) FPGA control and data lines
NARC hardware specifications: ARM-core microcontroller, 1.8V core, 3.3V peripheral
32-bit RISC, 5-stage pipeline, in-order execution 16KB data cache, 16KB instruction cache Core clock speed 180MHz, peripheral clock 60MHz On-chip Ethernet MAC layer with DMA
External memory, 3.3V 32MB SDRAM, 32-bit data bus 2MB Flash, 16-bit data bus Port available for additional 16-bit SRAM devices
Ethernet transceiver, 3.3V DM9161 PHY layer transceiver 100Mbps, full duplex capable RMII interface to MAC
Xilinx HW-AFX-BG560-100
Custom Board
RJ-45
AT91RM92000
ARM Processor
RC Device(s)
NetworkInterface
88
control output
EXTERNAL MEMORY
memory bus
Serial 1
EthernetPHY
JTAG
FPGA
8
input
Figure 1 – Block diagram of NARC device
#233 MAPLD 2005Conger 6
NARC Board Architecture: Software ARM processor runs Linux kernel 2.4.19
Provides TCP(UDP)/IP stack, resource management, threaded execution, Berkeley Sockets interface for applications
Configured and compiled with drivers specifically for our board
Applications written in C, compiled using GCC compiler for ARM (see Figure 2)
NARC API: Low-level driver function library for basic services Initialize and configure on-chip peripherals of ARM-core processor Configure FPGA (SelectMAP protocol) Transfer data to/from FPGA, manipulate control lines Monitor and initiate network traffic
NARC protocol for job exchange (from remote workstation) NARC board application and client application must follow standard
rules and procedures for responding to requests from a user User appends a small header onto data (if any) containing info.
about request before sending over network (see Figure 3)
Bootstrap software in on-board Flash, automatically loads and executes on power-up Configures clocks, memory controllers, I/O pins, etc Contacts tftp server running on network, downloads Linux and
ramdisk Boot Linux, automatically execute NARC board software contained
in ramdisk
Optional serial interface through HyperTerminal for debugging/development
RTYPE1 byte
Job ID1 byte
Undefined2 bytes
Data Size in bytes4 bytes
Data
RTYPE 00 – request status 01 – configure FPGA 02-FE – user-definable functions FF – reboot board
Job ID Unique identifier of request Included with response
Figure 3 – Request header field definitions
Figure 2 – Software development process
main.c
main applicatio
n
util.c
library routines
narc.h
definitions
global vars
Makefile arm-linux-gcc
RAMDISK
NARC Board
Linux Kernel
Client application `
User Workstation
gcc
NARCboard
application
client.c
clientapplicatio
n
#233 MAPLD 2005Conger 7
NARC Board Architecture: FPGA Interface Data communicated to/from FPGA by means of
unidirectional data paths 8-bit input port, 8-bit output port, 8 control lines (Figure 4) Control lines manage data transfer, also drive configuration signals Data transferred one byte at a time, full duplex communication possible Control lines include following signals:
Clock – software-generated signal to clock data on data ports Reset – reset signal for interface logic in FPGA Ready – signal indicating device is ready to accept another byte of data Valid – signal indicating device has placed valid data on port SelectMAP – all signals necessary to drive SelectMAP configuration
Figure 4 – FPGA interface signal diagram
ARM FPGAOut[0:7]
In[0:7]Out[0:7]
In[0:7]
a_valid
f_ready
a_valid
f_ready
a_ready
f_validf_valid
a_ready
clock
reset
SelectMAPPort
D[0:7]
PROG, INIT, CS, WRITE, DONE
PROG, INIT, CS, WRITE, DONE
FPGA configuration through SelectMAP protocol Fastest configuration option for Xilinx FPGAs, protocol emulated using GPIO pins of ARM NARC board enables remote configuration and management of FPGA
User submits configuration request (RTYPE = 01), along with bitfile and function descriptor Function descriptor is ASCII string, formatted list of functions with associated RTYPE definition ARM halts and configures FPGA, stores descriptor in dedicated RAM buffer for user queries
All FPGA designs must restrict use of all SelectMAP pins after configuration Some signals are shared between SelectMAP port and FPGA-ARM link Once configured, SelectMAP pins must remain tri-stated and unused
#233 MAPLD 2005Conger 8
Results and Analysis: Raw Performance
FPGA interface I/O throughput (Table 1) 1KB data transferred over link, timed Measured using hardware methods
Logic analyzer – to capture raw link data rate, divide data sent by time from first clock to last clock (see Figure 9)
Performance lower than desired for prototype Handshake protocol may add unnecessary overhead
Widening data paths, optimizing software routine will significantly improve FPGA I/O performance
Network throughput (Table 2) Measured using Linux network benchmark IPerf
NARC board located on arbitrary switch within network, application partner is user workstation
Transfers as much data as possible in 10 seconds, calculates throughput based on data sent divided by 10 seconds
Performed two experiments with NARC board serving as client in one run, server in other
Both local and remote (remote location ~400 miles away, at Florida State University) IPerf partner
Network interface achieves reasonably good bandwidth efficiency
External memory throughput (Table 3) 4KB transferred to external SDRAM, both read and write Measurements again taken using logic analyzer Memory throughput sufficient to provide wire-speed buffering
of network traffic On-chip Ethernet MAC has DMA to this SDRAM Should help alleviate I/O bottleneck between ARM and FPGA
Mb/s Input Output
Logic Analyzer 6.08 6.12
Mb/s Local Network
Remote Network (WAN)
NARC-Server
75.4 4.9
Server-Server
78.9 5.3
timeFigure 9 – Logic analyzer timing
Table 1 – FPGA interface I/O performance
Table 2 – Network throughput
Mb/s Read Write
Logic Analyzer 183.2 160
Table 3 – External SDRAM throughput
#233 MAPLD 2005Conger 9
Results and Analysis: Raw Performance
Reconfiguration speed Includes time to transfer bitfile over network, plus time to configure device (transfer
bitfile from ARM to FPGA), plus time to receive acknowledgement Our design currently completes a user-initiated reconfiguration request with a
1.2MB bitfile in 2.35 sec
Area/resource usage of minimal wrapper for Virtex-II Pro FPGA Stats on resource requirements for a minimal design to provide required link
control and data transfer in an application wrapper are presented below: Design implemented on older
Virtex-II Pro FPGA Numbers to right indicate
requirements for wrapper only, un-used resources available for use in user applications
Extremely small footprint! Footprint will be even smaller
on larger FPGA
Device utilization summary:
--------------------------------------------------------
Selected Device : 2vp20ff1152-5
Number of Slices: 143 out of 9280 1%
Number of Slice Flip Flops: 120 out of 18560 0%
Number of 4 input LUTs: 238 out of 18560 1%
Number of bonded IOBs: 24 out of 564 4%
Number of BRAMs: 8 out of 88 9%
Number of GCLKs: 1 out of 16 6%
#233 MAPLD 2005Conger 10
Case Study Applications Clustered RC Devices: N-Queens
HPC application demonstrating NARC board’s role as generic compute resource Application characterized by minimal communication, heavy computation within FPGA NARC version of N-Queens adapted from previously implemented application for PCI-
based Celoxica RC1000 board housed in a conventional server N-Queens algorithm is a part of the DoD high-performance computing benchmark suite
and representative of select military and intelligence processing algorithms Exercises functionality of various developed mechanisms and protocols for job
submission, data transfer, etc. on NARC User specifies a single parameter N, upon
completion the algorithm returns total number of possible solutions
Purpose of algorithm is to determine how many possible arrangements of N queens there are on an N × N chess board, such that no queen may attack another (see Figure 5)
Results are presented from both NARC-based execution and RC1000-based execution for comparison
Figure c/o Jeff Somers
Figure 5 – Possible 8x8 solution
#233 MAPLD 2005Conger 11
Case Study Applications Network processing: Bloom Filter
This application performs passive packet analysis through use of a classification algorithm known as a Bloom Filter Application characterized by constant, bursty communication patterns Most communication is Rx over network, transmission to FPGA Filter may be programmed or queried
NARC device copies all received network frames to memory, ARM parses TCP/IP header and sends it to Bloom Filter for classification User can send programming requests, which include a header and
string to be programmed into Filter User can also send result collection requests, which causes a
formatted results packet to be sent back to the user Otherwise, application constantly runs, querying each header against
the current Bloom Filter and recording match/header pair information
Bloom Filter works by using multiple hash functions on a given bit string, each hash function rendering indices of a separate bit vector (see Figure 6) To program, hash inputted string and set resulting bit positions as 1 To query, hash inputted string, if all resulting bit positions are 1 the
string matches
Implemented on Virtex-II Pro FPGA Uses slightly larger, but ultimately more effective application wrapper
(see Figure 7) Larger FPGA selected to demonstrate interoperability with any FPGA
Figure 6 – Bloom Filter algorithmic architecture
Figure 7 – Bloom Filter implementation architecture
#233 MAPLD 2005Conger 12
Experimental Setup N-Queens: Clustered RC devices
NARC device located on arbitrary switch in network User interfaces through client application on
workstation, requests N-Queens procedure Figure 8 illustrates experimental environment Client application records time required to satisfy request Power supply measures current draw of active NARC device
N-Queens also implemented on RC-enabled server equipped with Celoxica RC1000 board Client-side function call to NARC board replaced with function
call to RC1000 board in local workstation, same timing measurement
Comparison offered in terms of performance, power, costWorkstation
NARC
EthernetNetwork
User
RC-enabled servers
NARC
Figure 8 – Experimental environment Bloom Filter: Network processing Same experimental setup as N-Queens case study Software on ARM co-processor captures all Ethernet frames
Only packet headers (TCP/IP) are passed to FPGA Data continuously sent to FPGA as packets arrive over network
By attaching NARC device to switch, limited packets can be captured Only broadcast packets and packets destined for the NARC device can be seen Dual-port device could be inserted in-line with network link, monitor all flow-through traffic
#233 MAPLD 2005Conger 13
Results and Analysis: N-Queens Case Study First, consider an execution time comparison
between our NARC board and a PCI-based RC card (see Figure 10a and 10b) Both FPGA designs clocked at 50MHz Performance difference is minimal between devices
Being able to match performance of PCI-based card is a resounding success! Power consumption and cost of NARC devices
drastically lower than that of server with RC card combos
Multiple users may share NARC device, PCI-based cards somewhat fixed in an individual server
Power consumption calculated using following method Three regulated power supplies exist in complete
NARC device (network interface + FPGA board): 5V, 3.3V, 2.5V
Current draw from each supply was measured Power consumption is calculated as sum of V×I
products of all three supplies
N-Queens Execution Time Comparison(small board size)
0
0.01
0.02
0.03
0.04
0.05
5 6 7 8 9 10
Algorithm Parameter (N)
Exec. T
ime (
s) NARC
RC-1000
Figure 10 – Performance comparison between NARC board and PCI-based RC card on server
N-Queens Execution Time Comparison(large board size)
010203040506070
11 12 13 14
Algorithm Parameter (N)
Exec. T
ime (
s) NARC
RC-1000
#233 MAPLD 2005Conger 14
Results and Analysis: N-Queens Case Study Figure 11 summarizes the performance
ratio of N-Queens between both NARC and RC-1000 platforms
Consider Table 4 for a summary of cost and power statistics Unit price shown excluding cost of FPGA
FPGA costs offset when compared to another device
Price shown includes PCB fabrication, component costs
Approximate power consumption drastically less than server + RC-card combo Power consumption of server varies
depending on particular hardware Typical servers operate off of 200-
400W power supplies See Figure 12 for example of approximate
power consumption calculation
NARC Board
Cost per unit (prototype)
$175.00
Approx. Power Consumption
3.28 WTable 4 – Price and power figures for NARC device
Figure 12 – Power consumption calculation
P = (5V)(I5) + (3.3V)(I33) + (2.5V)(I25)
I5 ≈ 0.2A ; I33 ≈ 0.49A ; I25 ≈ 0.27A
P = (5)(.2) + (3.3)(.49) + (2.5)(.27) = 3.28W
NARC / RC-1000 Performance Ratio
0
5
10
15
20
25
5 6 7 8 9 10 11 12 13 14
Algorithm Parameter (N)
Rat
io
RATIO
Equivalency
Figure 11 – Power consumption calculation
#233 MAPLD 2005Conger 15
Results and Analysis: Bloom Filter Passive, continuous network traffic analysis
Wrapper design was slightly larger than previous minimal wrapper used with N-Queens Still small footprint on chip, majority of FPGA remains for application Maximum wrapper clock frequency 183 MHz, should not limit application clock if in same clock domain
Packets received over network link are parsed by ARM, with TCP/IP header saved in buffer Headers sent one-at-a-time as query requests to Bloom Filter (FPGA), when query finishes
another header will be de-queued if available User may query NARC device at any time for results update, program new pattern
Device utilization summary:
-------------------------------------------------------
Selected Device : 2vp20ff1152-5
Number of Slices: 1174 out of 9280 13%
Number of Slice Flip Flops: 1706 out of 18560 9%
Number of 4 input LUTs: 2032 out of 18560 11%
Number of bonded IOBs: 24 out of 564 4%
Number of BRAMs: 9 out of 88 10%
Number of GCLKs: 1 out of 16 6%
Figure 13 – Device utilization statistics for Bloom Filter design
Figure 13 shows resource usage for Virtex-II Pro FPGA
Maximum clock frequency of 113MHz Not affected by wrapper constraint Significantly faster computation speed
than FPGA-ARM link communication speed
FPGA-side buffer will not fill up, headers are processed before next header transmitted to FPGA
ARM-side buffer may fill up under heavy traffic loads 32MB ARM-side RAM gives large buffer
#233 MAPLD 2005Conger 16
Pitfalls and Lessons Learned FPGA I/O throughput capacity remains persistent problem
One motivation for designing custom hardware is to remove typical PCI bottleneck and provide wire-speed network connectivity for FPGA
Under-provisioned data path between FPGA and network interface restricts performance benefits for our prototype design
Luckily, this problem may be solved through a variety of approaches Wider data paths (16-bit, 32-bit) double or quadruple throughput, at expense of
higher pin count Use of higher-performance co-processor capable of faster I/O switching frequencies Optimized data transfer protocol
Having co-processor in addition to FPGA to handle network interface is vital to success of our approach Required in order to permit initial remote configuration of FPGA, as well as
additional reconfigurations upon user request Offloading network stack, basic request handling, and other maintenance-type
tasks from FPGA saves significant amount of valuable slices for user designs Drastically eases interfacing with user application on networked workstation Active co-processor for FPGA applications, e.g. parsing network packets as in
Bloom Filter application
#233 MAPLD 2005Conger 17
Conclusions A novel approach to providing FPGAs with standalone network connectivity has
been prototyped and successfully demonstrated Investigated issues critical to providing remote management of standalone NARC resources Proposed and demonstrated solutions to discovered challenges Performed pair of case studies with two distinct, representative applications for a NARC device
Network-attached RC devices offer potential benefits for a variety of applications Impressive cost and power savings over server-based RC processing Independent NARC devices may be shared by multiple users without moving Tightly coupled network interface enables FPGA to be used directly in path of network traffic for
real-time analysis and monitoring
Two issues that are proving to be a challenge to our approach include: Data latency in FPGA communication Software infrastructure required to achieve a robust standalone RC unit
While prototype design achieves relatively good performance in some areas, and limited performance in others, this is acceptable for concept demonstration Fairly complex board design; architecture and software enhancements in development As proof of “NARC” concept, important goal of project was achieved in demonstration of an
effective and efficient infrastructure for managing NARC devices
#233 MAPLD 2005Conger 18
Future Work Expansion of network processing capabilities
Further development of packet filtering application More specific and practical activity or behavior sought from network traffic Analyze streaming packets at or near wire-speed rates
Expansion of Ethernet link to 2-port hub Permit transparent insertion of device into network path Provide easier access to all packets in switched IP network
Merging FPGA with ARM co-processor and network interface into one device Ultimate vision for NARC device Will restrict number of different FPGAs which may be supported, according to
chosen FPGA socket/footprint for board Increased difficulty in PCB design
Expansion to Gig-E, other network technologies Fast Ethernet targeted for prototyping effort, concept demonstration True high-performance device should support Gigabit Ethernet Other potential technologies include (but not limited to) InfiniBand, RapidIO
Further development of management infrastructure Need for more robust control/decision-making middleware Automatic device discovery, concurrent job execution, fault-tolerant operation