arxiv:0710.3535v1 [cs.ar] 18 oct 2007 - semantic scholar · pdf file10 eth lab - eurotech...

13
IANUS: an FPGA-based System for High Performance Scientific Computing Francesco Belletti 1,2 , Maria Cotallo 3,4 , Andres Cruz 3,4 , Luis Antonio Fern´ andez 5,4 , Antonio Gordillo 6,4 , Andrea Maiorano 1,4 , Filippo Mantovani 1,2 , Enzo Marinari 7 , Victor Mart´ ın-Mayor 5,4 , Antonio Mu ˜ noz-Sudupe 5,4 , Denis Navarro 8,9 , Sergio P´ erez-Gaviro 3,4 , Mauro Rossi 10 , Juan Jesus Ruiz-Lorenzo 6,4 , Sebastiano Fabio Schifano 1,2 , Daniele Sciretti 3,4 , Alfonso Taranc´ on 3,4 , Raffaele Tripiccione 1,2 , and Jose Luis Velasco 3,4 1 Dipartimento di Fisica, Universit` a di Ferrara, I-44100 Ferrara (Italy) 2 INFN, Sezione di Ferrara, I-44100 Ferrara (Italy) 3 Departamento de F´ ısica Te´ orica, Facultad de Ciencias, Universidad de Zaragoza, 50009 Zaragoza (Spain) 4 Instituto de Biocomputaci ´ on y F´ ısica de Sistemas Complejos (BIFI), 50009 Zaragoza (Spain) 5 Departamento de F´ ısica Te´ orica, Facultad de Ciencias F´ ısicas, Universidad Complutense, 28040 Madrid (Spain) 6 Departamento de F´ ısica, Facultad de Ciencia, Universidad de Extremadura, 06071, Badajoz (Spain) 7 Dipartimento di Fisica, Universit` a di Roma “La Sapienza”, I-00100 Roma (Italy) 8 Departamento de Ingenieria Electr ´ onica y Comunicaciones, Universidad de Zaragoza, CPS, Maria de Luna 1, 50018 Zaragoza (Spain) 9 Instituto de Investigaci´ on en Ingenieria de Arag ´ on ( I3A), Universidad de Zaragoza, Maria de Luna 3, 50018 Zaragoza (Spain) 10 ETH Lab - Eurotech Group, I-33020 Amaro (Italy) This paper describes IANUS, a modular massively parallel and reconfigurable FPGA-based computing system. Each IANUS module has a computational core and a host. The computational core is a 4x4 array of FPGA-based processing elements with nearest-neighbor data links. Processors are also directly connected to an I/O node attached to the IANUS host, a conventional PC. IANUS is tailored for, but not limited to, the requirements of a class of scientific applications characterized by regular code structure, large ratio of computation over data-base size and heavy use of bit manipulation operations. We assess the potential of this system with accurate performance figures of the IANUS prototype for a Monte Carlo code for statistical physics. For this application one module has the same performance as approximately 1500 high-end PCs. Flexibility and reconfigurability of the system promise similar level of performance for other applications, such as optimization problems. 1 Overview Monte Carlo simulations are a widely used investigation technique in many fields of theoretical physics, including, among others, Lattice Gauge Theories, statistical mechanics, optimization (such as, for instance, K-sat problems) (see 1, 2 for a review). In several cases, the simulation of even small systems, described by simple dynamics equations requires huge computing resources. Spin systems - discrete systems, whose variables (that we call spins) sit at the vertexes of a D-dimensional lattice - falls in this category 3 . Reaching statistical equilibrium for a lattice of 48 3 sites requires 10 12 ... 10 13 Monte Carlo steps, to be performed on O(100) copies of the system, corresponding to 10 20 spin updates, a challenge never attempted so far. Careful programming on traditional computers (introducing some amount of SIMD processing) yields an average spin-update time of 1 ns: the simulation outlined above would mean running on 1 PC for 10 4 years (or for 1 year on 10000 PCs, optimistically and unrealistically assuming that perfect scaling takes place). The structure of simulation algorithms for spin system - basically, massive parallelism, regular code flow and heavy resort to logical bit manipulation - are however very well suited to unconventional computer architectures that are clearly emerging at this point in time (see e.g. 4 ), and promise substantial performance gains. In this paper we describe the architecture, the implementation, the tests performed on the first prototype and some 1 arXiv:0710.3535v1 [cs.AR] 18 Oct 2007

Upload: dangmien

Post on 10-Mar-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

IANUS: an FPGA-based Systemfor High Performance Scientific Computing

Francesco Belletti1,2, Maria Cotallo3,4, Andres Cruz3,4, Luis Antonio Fernandez5,4, Antonio Gordillo6,4,Andrea Maiorano1,4, Filippo Mantovani1,2, Enzo Marinari7, Victor Martın-Mayor5,4,Antonio Munoz-Sudupe5,4, Denis Navarro8,9, Sergio Perez-Gaviro3,4, Mauro Rossi10,

Juan Jesus Ruiz-Lorenzo6,4, Sebastiano Fabio Schifano1,2, Daniele Sciretti3,4, Alfonso Tarancon3,4,Raffaele Tripiccione1,2, and Jose Luis Velasco3,4

1 Dipartimento di Fisica, Universita di Ferrara, I-44100 Ferrara (Italy)2 INFN, Sezione di Ferrara, I-44100 Ferrara (Italy)

3 Departamento de Fısica Teorica, Facultad de Ciencias,Universidad de Zaragoza, 50009 Zaragoza (Spain)

4 Instituto de Biocomputacion y Fısica de Sistemas Complejos (BIFI), 50009 Zaragoza (Spain)5 Departamento de Fısica Teorica, Facultad de Ciencias Fısicas,

Universidad Complutense, 28040 Madrid (Spain)6 Departamento de Fısica, Facultad de Ciencia,

Universidad de Extremadura, 06071, Badajoz (Spain)7 Dipartimento di Fisica, Universita di Roma “La Sapienza”, I-00100 Roma (Italy)

8 Departamento de Ingenieria Electronica y Comunicaciones,Universidad de Zaragoza, CPS, Maria de Luna 1, 50018 Zaragoza (Spain)

9 Instituto de Investigacion en Ingenieria de Aragon ( I3A),Universidad de Zaragoza, Maria de Luna 3, 50018 Zaragoza (Spain)

10 ETH Lab - Eurotech Group, I-33020 Amaro (Italy)

This paper describes IANUS, a modular massively parallel and reconfigurable FPGA-based computing system. Each IANUSmodule has a computational core and a host. The computational core is a 4x4 array of FPGA-based processing elements withnearest-neighbor data links. Processors are also directly connected to an I/O node attached to the IANUS host, a conventionalPC. IANUS is tailored for, but not limited to, the requirements of a class of scientific applications characterized by regular codestructure, large ratio of computation over data-base size and heavy use of bit manipulation operations. We assess the potentialof this system with accurate performance figures of the IANUS prototype for a Monte Carlo code for statistical physics. Forthis application one module has the same performance as approximately 1500 high-end PCs. Flexibility and reconfigurabilityof the system promise similar level of performance for other applications, such as optimization problems.

1 Overview

Monte Carlo simulations are a widely used investigation technique in many fields of theoretical physics, including,among others, Lattice Gauge Theories, statistical mechanics, optimization (such as, for instance, K-sat problems)(see1, 2 for a review).

In several cases, the simulation of even small systems, described by simple dynamics equations requires hugecomputing resources. Spin systems - discrete systems, whose variables (that we call spins) sit at the vertexes of aD−dimensional lattice - falls in this category3. Reaching statistical equilibrium for a lattice of 483 sites requires1012 . . . 1013 Monte Carlo steps, to be performed on O(100) copies of the system, corresponding to ' 1020 spinupdates, a challenge never attempted so far. Careful programming on traditional computers (introducing someamount of SIMD processing) yields an average spin-update time of ' 1 ns: the simulation outlined above wouldmean running on 1 PC for ' 104 years (or for 1 year on ' 10000 PCs, optimistically and unrealistically assumingthat perfect scaling takes place).

The structure of simulation algorithms for spin system - basically, massive parallelism, regular code flow andheavy resort to logical bit manipulation - are however very well suited to unconventional computer architectures thatare clearly emerging at this point in time (see e.g.4), and promise substantial performance gains.

In this paper we describe the architecture, the implementation, the tests performed on the first prototype and some

1

arX

iv:0

710.

3535

v1 [

cs.A

R]

18

Oct

200

7

measured performance statistics of IANUSa, a dedicated massively parallel and reconfigurable system, based onFPGA technologies. IANUS explores such issues as massive on-chip parallel processing, close-grained integrationof memory and processors, reconfigurable processing, and in doing so efficiently satisfies the number crunchingrequirements associated to statistical physics simulations: each FPGA-based processor in the system has the sameperformance for selected applications as between 100 and 1000 high-end PCs.

This paper is structured as follows: in section 2, we start from the computing needs associated to typical al-gorithms of spin-system simulations and derive a set of requirements that are not usually addressed by existinghigh-performance architectures. In section 3 we describe the structure of the IANUS massively parallel system, thatwe have architected trying to match in the most flexible way the requirements of section 2; we also provide details onthe IANUS prototype and on the large IANUS system that we plan to assemble in fall 2007. In section 4, we describethe specific configuration of IANUS as a spin-simulation engine, seen as a prototype application whose performanceis based on massive on-chip parallelism, on-chip multiple memory accesses, on-chip inter-process communications.This section contains also some performance figures measured on the prototype. Section 5 has some details on theinteractions between the IANUS-core and its host system, including some details on the programming frameworkthat supports IANUS operations. The final section summarizes our results and discuss short and long term prospects.

2 Computing Requirements

Several relevant applications in high-performance scientific computing offer a very large amount of parallelism intheir associated algorithms. A subset of these applications also exhibits additional features that make it relativelyeasy, within current technological constraints, to actually implement carefully focused architectures able to extracta very large fraction of the available parallelism.

A list of such features includes:

• A regular code, associated to computational kernels based on regular loop structure.

• Data structures characterized by memory accesses that can be predicted in advance.

• A small size of the memory data base.

• Data processing associated to bit-manipulation (as opposed to traditional arithmetics performed on long datawords).

The simulation of several models of spin-systems is an example of a computational problem that requires hugecomputing resources and that shares all the properties listed above. The dynamics of the spin variables associatedto site i (si) of the lattice is defined in terms of an energy equation (known as the hamiltonian)

H = −∑〈i,j〉

Jijxixjsisj −∑

i

hixisi (1)

characterized by interaction parameters Jij , external fields hi, and dilution parameters xi (flagging occupied vs.unoccupied sites). The notation 〈i, j〉 means that the summation is done only on geometrically neighboring sites ofthe lattice.

Spin-systems are usually studied with Monte Carlo techniques. A Monte Carlo algorithm for a spin-systemupdates at successive iteration steps the spin values of one or more sites of the lattice, following e.g., the Heat-Bathprocedure. At each Heat-Bath step, a new value for si is chosen randomly according to a probability distributionsthat depends only on the values of the neighboring spins. Probabilities are constant in time and can be stored in asmall-sized table (of ' 10 entries).

A sequential C-version of a typical Monte Carlo kernel is shown in figure 1.Inspection of figure 1 and of equation (1) shows that the procedure has several properties relevant in this context:

• The code kernel has a regular structure, associated to regular loops. At each iteration the same set of operationsis performed on data stored at locations whose addresses can be predicted in advance.

• Data processing is associated to logical operations performed on bit-variables defining the site variables ((pseudo-)random numbers must also be generated).

afrom the name of the double-faced ancient Roman god of doors and gates

2

Figure 1. The kernel of the Heat-Bath Monte Carlo simulation algorithm for the Edwards-Anderson model (one possible spin-system model). Variables s are defined at all lattice values as well as (constant) J couplings. In this case the lattice side is 24(i.e. the total number of spins is 243). HBT is a constant look-up table with 7 entries.

• The data base associated to the computation is of the order of just a few bytes for each site (e.g., ' 100 KBytesfor a grid of 483 sites).

What is not explicitly written in the sample code (but easily checked by inspection) is that a very large amountof parallelism is available: each site interacts with its nearest neighbors only, while correctness of the procedurerequires that when one site is being updated its neighbors are kept fixed. As a consequence, up to one-half of all thesites, organized in a checkerboard structure, can in principle be operated upon in parallel. The same sequence ofoperation is performed while updating any site, so almost perfect load-balancing is also possible.

Several of the features that we have discussed above are shared by application algorithms revelant in differentscientific areas. Graph coloring is an example in the broad area of optimization. The goal here is to assign a colorci belonging to a given allowed set C to each node (labelled by i) of a graph, fulfilling the condition that adjacentnodes of the graph do not have the same color. A Monte Carlo procedure is also appropriate in this case: at each stepone node is considered, a new color c′i is proposed, the number of color conflicts associated to ci and c′i comparedand a choice between the two made again with a given probability. What is different from the previous case isthat memory access patterns are slightly more complex, since lists of arcs spanning from i have to be navigated toexplore all adjacent nodes.

We want to define a processing system delivering very high performance (order of magnitudes more than possiblewith current PCs) allowing to deploy a very large fraction of the parallelism available in algorithms like thosedescribed above and leveraging on the other features discussed above. The structure has to be implemented usingreadily available technologies. From the architectural point of view, our goal translates into two key features:

1. the processing features available inside the processor must be limited to the subset of functions required bythe algorithm. In this way, for a given silicon budget, the largest possible number of processing elements,exploiting the parallelism offered by the algorithm, can be instantiated.

2. the performance made in principle possible by the previous point must be sustained by sufficient memorybandwidth (as an example - see later for details - in our first target application we move from/to memory' 4000 bits per clock cycle). This dictates the use of fine grained on-chip memory structures.

A methodology to study the best match of algorithm and architecture has been studied theoretically in a slightlydifferent context in5. Here we make the choice of using latest generation FPGAs as the basic block of our system.Their reconfigurable structure matches point 1 above, and the availability of a large number of on-board RAM blocksmatches point 2. The system that we have planned is described in the next section. What is still missing from thepicture is:

• Parallelism has to be extracted by hand and painfully handcrafted into (e.g.) a VHDL or Verilog code. Our mainfocus on a short time frame is: i) to deploy a system with enough computing power to match our performancerequirements for simulation, ii) to show that our system is a resource-efficient computing engine for a focusedclass of problems. In the long term, our reconfigurable system will be able to benefit of any progress inautomatic program partitioning and restructuring, or automatic code generation.

3

• The problem of the interplay between logical processing structures configured on the FPGA and their actuallayout on silicon, which has a potentially large impact on performance, is completely neglected for smalllattices (or, more accurately, completely left to the standard placement tools made available by FPGA vendors).However for large models a carefully manual floorplan of the internal memories ( more than 300 ) is mandatory.In some cases we have needed a few iterations between physical synthesis and place and route to reach the targetfrequency. As shown later on, this is not a bottlenck for our present applications. Again any progress in thisarea can be immediately used by our system.

3 The IANUS architecture

3.1 General structure

In this section, we describe IANUS (early ideas on the IANUS architecture were presented in6). IANUS buildson previous attempts made several years ago within the SUE project7. With respect to SUE, IANUS has a widerapplication focus and much higher expected performances, made possible by a more flexible design and by recentprogress in FPGA technology.

A. B.

Figure 2. IANUS-core: topology (A) and overview of the IANUS prototype implementation (B) with 16 SPs, the I/O module(IOP) and the IANUS host.

IANUS is based on modules containing a computational core (the IANUS-core) loosely connected to a hostsystem (the IANUS-host) of one or more networked PCs.

The IANUS-core is based on:

• a 4× 4 2-D grid of processing elements (each called SP) with nearest-neighbor links in both axis (and periodicboundary conditions),

• a so-called IO-Processor (IOP) connected with all SPs.

The IOP module controls the operation of all available SPs and merges and moves data to/from the IANUS-host. Abank of staging memory is available, and the main communication channel uses 2 Gigabit-Ethernet links (an USBport and a serial interface are also available for debugging and test).

Both the IOP and the SPs are implemented with FPGAs. The prototype is based on Xilinx Virtex4-LX160devices, while our planned large IANUS system will upgrade to Virtex4-LX200 FPGAs.

A block diagram of a IANUS module is shown in figure 2, while a picture of the prototype of the IANUS-coreis shown in figure 3.

Each SP processor contains uncommitted logic that can be configured at will (on-the-fly reconfiguration of allSPs is handled by the IOP). The SP is the main computational engine of the system. Number crunching kernelsare allocated to one or more SPs. They run either as independent threads, ultimately reporting to the IANUS-hostthrough the IOP or they run concurrently, sharing data, as appropriate, across the nearest-neighbor links (in thecurrent implementation each full-duplex link transfers four bytes per clock cycle with negligible latency, 1 or 2

4

clock cycles). A single task may execute on the complete core, or up to 16 independent tasks may be allocated onthe system.

As explained in details later, we expect that in a typical high-performance application, each SP by itself will bereconfigured as a many-core processor (we use the terminology proposed recently in4). A major design decisionis that the SPs have no memory apart from the one available on-chip (of the order of ' 0.5 MByte per processingelement). This choice constrains the class of problem that IANUS can handle. On the other hand, in their questfor performance applications can rely on the huge bandwidth and low latency made possible by on-chip memory tofeed all the cores configured onto the FPGAs.

The IOP is used a relatively stable control and I/O interface for the SPs. However close to 70% of the logicalresources of its FPGA are available, so an active role in a complex application is possible. The IOP can, for example,act as a cross-bar switch for the array of SPs, supporting complex patterns of data transfers, possibly performingglobal operations on aggregated data sets. Each full-duplex point-to-point link transfers in each direction 2 bytes ofdata every clock cycle, again with small latency.

Figure 3. Building blocks of the IANUS-core: on the left the SP daughter board of the IANUS system. On the right a prototypeof a IANUS processing board: the SPs are arranged in 4 lines of 4 processors each while the IOP is placed at the center of theboard.

The IOP and SP are engineered as small daughter-boards plugging onto a mother-board that we call the Pro-cessing Board (PB). This implementation will allow to upgrade each component independently. We are alreadyconsidering mid-term upgrades such as newer generation FPGAs for the SPs and more performing core-host inter-faces (using e.g. PCI-Express).

3.2 IOP structure

The current firmware configuration of the Input/Output Processor (IOP) focuses on the implementation of an inter-face between the SPs and the host. The IOP is not a programmable processor: it basically allow streaming of datafrom the host to the appropriate destination (and back), under the complete control of the IANUS-host.

The IOP configured structure is naturally split in 2 “worlds”:

• the IOlink block handles the I/O interfaces between IOP and host-PC (gigabit channels, serial and USB ports).Its supports the lower layers of the gigabit ethernet protocol, ensures data integrity and provides a bandwidthclose to the theoretical limit.

• the multidev block contains a logical device associated to each hardware sub-system that may be reached forcontrol and/or data transfer: there is a memory interface for the staging memory, a programming interface toconfigure the SPs, an SP interface to handle communication with the SPs (after they are configured) and severalservice/debug interfaces.

The link between these two different worlds is the Stream Router that forwards the streams coming from theIOlink to the appropriate device within the multidev according to a mask encoded in the stream.

5

Figure 4. Schematic logical representation of the IOP processor: on the left the IOlink that handles the I/O interfaces betweenIOP and the IANUS-host; on the right the devices of the multidev.

A block diagram of the IOP is shown in figure 4.The structure of this configuration for the IOP:

• at the hardware level, is flexible enough so additional devices can be readily added without interference withthe rest of the system.

• At the software/application level, it supports the programming and run-time framework that we describe insection 5. Upgrades to this framework, either for better performance or to implement standard interfacesbetween conventional processors and FPGAs, as they emerge and consolidate, will probably need little changesto the present structure.

3.3 Project status

A full prototype of the IANUS system (as well as several systems with a reduced number of SPs) have been undertest since early 2007. The performances shown in section 4 were measured on these systems, that have shownremarkably stable operation.

Several IANUS modules can be grouped into a larger system: Eeach core in packed in one standard 19” box.Several cores (8 to 10) and their associated hosts are placed in a standard rack. All IANUS-hosts are networked andconnected to a supervisor host. End-users log onto the supervisor and connect to one (or more) available IANUSmodules.

A schematic representation of our planned system is shown in figure 5. A large system (2 racks with 12 IANUSmodules) is expected in fall 2007. Power dissipation for the full system is estimated from measurements on theprototype at ' 4 KW, a huge improvement with respect to a large PC farm.

4 Reconfiguring IANUS for spin-system simulations

Systems described by the hamiltonian function 1 have been our main testbench during IANUS development. Thishamiltonian describes several different systems, well known in statistical physics, depending on the values assignedto its parameters. For example, the Edwards-Anderson Spin Glass model is obtained if the Jij are randomly assignedthe values ±1 with equal probability; The Random Field Ising Model has Jij = 1 everywhere while hi may take

6

Figure 5. Schematic logical representation of a IANUS system: a supervisor-PC works as interface between users and the IANUSmodules.

values +h or −h with probability 1/2; the Diluted Anti-Ferromagnet in a Field has Jij = −1 and hi = +heverywhere while xi takes the value 1 with probability d and 0 with probability 1− d (d is the mean dilution).

One is interested in studying the thermodynamic properties of the system, kept at temperature T . This is usuallydone by generating several spin configurations following the equilibrium statistical distribution ∝ exp(− 1

T H).Sampling equilibrium configurations is possible by means of standard Monte Carlo spin update algorithms such asthe Heat Bath or Metropolis scheme.

Let’s consider for instance the case d = 1, xi = 1 ∀i (a non diluted systems). In the Heat Bath algorithm, themain assumption is that at any time each spin variables is in thermal equilibrium with its surrounding environment,i.e. nearest neighbors spins and local applied fields; define E(s0) = −s0

∑〈0,j〉 J0jsj − h0s0 the local energy of

the spin on site 0, the probability for s0 to be up is

P (+1) =e−

1T E(+1)

e−1T E(+1) + e

1T E(+1)

(2)

The hamiltonian contains no explicit information on the dynamics of spin variables so the “spin up” energy E(+1)is the only relevant quantity that we need to compute. The sequential update scheme is then as follows:

1. pick up the first site of the lattice and compute E(+1)

2. compute P (+1)

3. extract a random number 0 < ρ < 1

4. if ρ < P (+1) set the spin value to up, otherwise set it to down (−1)

5. repeat steps 1-4 for all lattice sites

The procedure is repeated as many times as needed to generate enough uncorrelated spin configurations. Alter-natively, in the Metropolis scheme it is the “spin-flip” probability to be tested, which is

P (s0 → −s0) = min[1, e−

1T ∆E(s0)

](3)

where ∆E(s0) = E(−s0)− E(s0) = −2E(s0).In both schemes the transition probabilities take only a limited set of values corresponding to discrete energy

values (7 different values per each applied field orientation) so they may be tabulated and addressed by an integerrepresentation ofE(+1) orE(s0). The pointer to the table is itself computed by just a few logical bitwise operations.Moreover, one can reach a high degree of internal parallelism inside the FPGA-based processing element, allowingfor an efficient parallel update of many sites of the lattice (known in the trade as synchronous multi-spin coding).

In fact, the update scheme leaves freedom in the choice of the sequence of sites to be updated, so several spinsmay be tested simultaneously provided there are no nearest-neighbors among them. This can achieved for instance

7

Figure 6. Parallel update scheme. The spins, their neighbors, and all other relevant values are passed to the update cells where the local energyor the “spin-up” energy is computed. A probability which is function of the energy (and stored in a look-up table) is compared with a randomnumber (RNG) in order to compute the new value for the spins.

by labelling as “black” and “white” all sites in the lattice as in a checkerboard: all black (white) sites can besimultaneously tested for update. A further step in parallelizing the computation is to consider two “replicas”, astwo independent systems having the same interactions, fields and dilution configurations: one arranges all blacksites of replica 1 and all white sites of replica 2 on one lattice (we call it “P ”) and all remaining sites on a secondone (the “Q” lattice). All P (Q) lattice sites can be updated at the same time, since all their neighbors are on latticeQ(P ). In practice, the number of simultaneous updates are only limited by FPGA resources. As an example, ona Xilinx Virtex4-LX160 FPGA available resources allow up to 512 parallel updates, so systems on lattices up to512 = 83 sites can be updated as a whole. For larger systems, the global update takes more than one step.

The computational architecture that we configure on each SP for our Monte Carlo simulations is shown in fig.6.

As a guide to understand the structure, we have a set (512 in the picture) of processing engines (also calledUpdate Cells), that receive in input all the variables and parameters needed to update one spin and perform allassociated arithmetic and logic operations, producing the updated value of the spin variable. Data (variables andparameters) are kept in memories and fed to the appropriate update cell. Updated spin values are written back tomemory.

Variables and parameters are stored in the BLOCK-RAMs available in the Virtex4 FPGAs. Transition probabil-ities (Look-Up Tables), update cells and random number generators are implemented as configurable logic blocks.The choice of an appropriate storage structure for data and the provision of appropriate data channels to feed allupdate cells with the value they need is a complex challenge; designing the update cells is a comparatively minortask, since the latter elements perform relatively simple logical functions. Updating a spin in a cubic lattice involvesreading six nearest neighbors, six interaction variables, possibly the local spin variable (Metropolis algorithm) andlocal parameters (dilution, applied field). Data has to be routed to and from the update cells keeping into accountthat only two accesses per clock cycle are allowed on BLOCK-RAMs. Let us consider for definiteness a system oflinear size L=16 (volume=4096 sites). The spin memory structure is made up of 16 different 16-bit word memoriesof depth 16. A single spin or interaction variable is coded by means of a unique bit so we need 5 similar structuresto store the P, Q and interaction lattices (only interactions along positive axis directions are needed). This way wecan access a whole plane of a lattice (256) for reading or writing with a single operation per memory. At regime inthe Heat Bath algorithm, for example, updating a plane of the P lattice requires two reads per each Q memory, oneread per each interaction memory and one write per each P memory. The scheme is only slightly modified in caseof larger systems, for which the limit of 512 parallel updates forces to divide the lattice in sub-blocks.

Look-up tables for probability values are replicated on configurable logic in such a way that every one of themmay be accessed by only two update cells at a time.

Random number generators are also implemented on configurable logic. We use the Parisi-Rapuano shift-register generator8 whose hardware implementation is relatively simple.

8

The Parisi-Rapuano random number generator is defined as

I(k) = I(k − 24) + I(k − 55) (4)R(k) = I(k)⊗ I(k − 61)

where I(k − 24), I(k − 55) and I(k − 61) are elements (32-bits wide) of an in principle infinitely long buffer.Only 62 values of the buffer must be kept in the system at each time, so we use a circular buffer (the wheel) that weinitialize with externally generated random values. I(k) is the new element of the updated wheel, and R(k) is thegenerated pseudo-random value. As a new random value is computed, the wheel is rolled one position and a newvalue is inserted. N wheels are needed to compute N random numbers. However, just one wheel may be used tocompute k values at the same time, if all the corresponding arithmetic and logic operations are performed and thewheel is rolled by k positions (and k new values are inserted). Sharing one wheel saves storage but increases routingcongestion. A good trade-off is encountered in a broad plateau with k ' 50.

In spite of this optimization steps, random number generation is the main bottleneck in the implementation ofour algorithm. We have to stop at 512 (1024) random values (and therefore spin updates) per clock cycle on theVirtex4-LX160 (Virtex4-LX200) FPGA. Note that at our clock frequency of 62.5 MHz, each SP generates 3.2 ·1010

(6.4·1010) 32-bits random number per seconds, corresponding to 32 (64) Gops (not accounting for XOR operations).BLOCK-RAM size limits the larger lattice size that we can handle, L = 88 for the LX160 and L = 100 for the

LX200 (larger systems have to be split on two or more cooperating SPs). Independent simulations are carried out oneach SP belonging to the IANUS system, since several simulations associated to different parameters of the systemare needed anyway.

In short, we are in the rewarding situation in which: i) the algorithm offers a large degree of allowed parallelism,ii) the processor architecture does not introduce any bottleneck to the actual exploitation of the available parallelism,iii) performance of the actual implementation is only limited by the hardware resources contained in the FPGAs.

At the clock frequency quoted above (62.5 MHz), the average update time for one SP is 32 ps/spin (16 ps/spinfor LX200-based nodes), that is 2 ps/spin (1 ps/spin) for an IANUS-core.

It is interesting to compare these figures with those appropriate for a PC. Understanding what exactly has tobe compared is not completely trivial. Popular PC codes update in parallel the same site of a large number of(up to 128) replicas of the system, sharing the same random number across all samples: this scheme (we call itasynchronous multi spin coding AMSC) is useful for statistical analysis. On the other hand, it is less appropriatewhen a small number of replicas of a large system must be updated many (e.g., 1012 . . . 1013) times. In this case analgorithm that updates in parallel several spins of the same lattice (exactly what we do in IANUS) is obviously moreperforming. Such codes (that we call Synchronous Multi Spin Coding SMSC) were proposed several years ago, butnever widely used. We put a rather large effort in developing efficient ASMC and SMSC codes on an high-end PC(an Intel Core2Duo (64 bit) 1.6 GHz processor). A comparison - see table 1 - shows that one IANUS-core has aperformance of hundreds (or even thousands) of PCs.

LX160 LX200 PC (SMSC) PC (AMSC)Update Rate 2 ps/spin 1 ps/spin 3000 ps/spin 700 ps/spin

Table 1. Comparing the performances of one IANUS-core and two different PC codes.

A straightforward generalization of the scheme presented here for spin systems works efficiently in case of Pottssystems, where the dynamic variables si may take more than two values, on three and four-dimensional lattices. Inthis case we are limited to smaller lattice sizes because of memory limitations, but the gain in performances respectto traditional CPU is about ten times greater than in the spin-system case.

At present we are fine-tuning for IANUS a Monte Carlo procedure for random graph coloring (a prototype ofa K-sat problem), for which we also anticipate very large performance gains with respect to standard PCs. Thiscode, that we describe elsewhere together with the Potts systems, will assess the performance potential of FPGA-based processors for algorithms associated to irregular memory access patterns. Apart from this issue, we stress thatthe graph coloring problem with N colors is analogous to a energy minimization of an antiferromagnetic model ofN-valued Potts variables, so we expect IANUS to be effective also in such application fields.

9

5 IANUS Programming Framework

Figure 7. Framework of an application running on the IANUS system.

Efficient applications on IANUS spend most of their processing time inside the IANUS-core, with limited dataexchange with the IANUS-host. Typically the host performs run initialization, uploads data onto the FPGAs, andwaits for results. This is at variance with some recent architectures, like the Cray-XD1 and others, in which one ormore FPGA co-processors are directly attached to a traditional CPU. In this case the key role of the configurableprocessors is to accelerate an algorithm, by executing fine-grained processing tasks. The limited bandwidth betweenthe core and the host does not allow this mode of operation in IANUS: the interface between the two elements ofthe system is therefore optimized for loose control of the core and for streaming transfer of relatively large data setsacross the two subsystems.

The programming framework developed for IANUS is intended to meet the requirements of the prevailingoperating modes of our system, outlined above. Applications running on the IANUS system can be thought as split intwo sub-applications, one, called Software Application (SA), written for example in C, and running on the IANUS-host. The other, called Firmware Application (FA), written for example in VHDL, and running on the IANUS-core.As shown in figure 7, the two entities, SA and FA are connected together by a Communication Infrastructure (CI),allowing to exchange data and performing synchronization operations, directly and in a transparent way.

The CI abstracts the low-level communication between SA and FA applications, implemented in hardware bythe IOP and its interfaces. It includes three main components:

• a C communication library linked by the SA,

• a communication firmware running on the IOP processor, interfacing both the host PC and the SP processor,

• a VHDL library linked by the FA.

The firmware running on the IOP processor, communicates with the host PC by a dual gigabit channel, us-ing the standard RAW-ethernet communication protocol. The choice of RAW-ethernet is based on the followingconsiderations:

• the standard TCP/IP protocol provides a lot of functionalities which are not needed by our purposes;

• an hardware implementation of the TCP/IP stack is non trivial, uses far more hardware resources and is moretime critical than a RAW-ethernet implementation;

• the TCP/IP protocol introduces overheads in the communication.

To guarantee reliability of the communication in the direction IOP to PC, we adopt the Go-back-N protocol9. Inthe Go-back-N protocol, N frames are sent without waiting for an acknowledge from the receiver. After N frameshave been transmitted, the sender waits for an acknowledge. If the acknowledge is positive it starts to send the nextN frames, otherwise it send again the last N frames. The receiver waits for N frames, and after receiving all theframes sends an acknowledge to the sender. If some frames are lost, typically dropped by the network card becauseof CRC error, or by the Linux operating system because of network buffer overflow, the receiver gets a timeoutand send back a not-acknowledge, asking re-transmission of the last N frames. Using the Go-back-N protocol wereach approximately the 90% of the full gigabit bandwidth, transferring messages whose length is of the order of 1MB, using the maximum data payload per frame, 1500 bytes, and N = 64, see figure 8.

10

Figure 8. Measured transfer bandwidth of the IOP processor. Red bullets are for write operations (host to core), blue triangles arefor read operations (core to host). The lines are fits to a simple functional form B(s) =

s

α+ βswhere s is the message size.

In the other direction we did not adopt any communication protocol since, as explained in section 3, the IOPinterface, barring hardware errors, ensures that no packets are lost. Incoming frames are protected by standardethernet CRC code, and errors are flagged by the IOP processor.

Data coming from the SA application are interpreted by the IOP as commands for itself, to set, for example,internal control registers, or are routed to one or more SPs according to specific control information packed togetherwith the data as described in 3.2. Data coming from the FA application are packed as burst of N frames, accordingto the the Go-back-N protocol.

The IOP is also responsible to program the SP processors. The firmware to be uploaded onto the SP is loadedinto the staging memory attached to the IOP, and then a programming procedure, fully handled by the IOP, starts.

Information exchanged by SA and FA application are packed in a data-stream, whose content depends only onthe specific FA application running on the SP processors. Data-streams are sent or received by the FA applicationthrough the functions provided by the C communication library. The library encapsulates the outgoing data streaminto frames and it sends them to the IOP processor via the Linux network system, using a dedicated gigabit networkcard. The library is also responsible to manage the Go-back-N protocol for frames incoming from the IOP, and toassembly them into a data-stream to be delivered to the SA application.

For futher improvements, the C communication library can be easily interfaced to a custom Linux network carddriver, to send and receive frames completely by-passing the Linux network system. An implementation for IntelPro1000 network cards, is under test: first measurements show an improvement of both latency and bandwidth ofthe communication.

The VHDL communication library provides a set of entities to be linked in the FA application, to send andreceive data stream to and from the SA application. The VHDL entities standardize the communication interfacebetween the IOP and the SP as described in the section 3.

Developers of applications running on the IANUS system have to write the SA and FA relaying on the CIcommunication infrastructure, to make the two applications collaborative.

A typical SA application configures, using the functions provided by the communication library, the SP proces-sor with the firmware corresponding to the FA program, it loads input data for the FA application, and it waits forthe incoming results. As previous underlined all the data can be exchanged via the communication library. A typicalFA application wait for incoming requests, performs some calculation and sends back the results.

The approach adopted to develop application for the IANUS system will allow the use of development toolkits

11

to automatically decompose an application into cooperating modules. Some modules are executed on the PC side,while other modules are executed on dedicated hardware. Currently available tools decompose, for example, theapplication into a set of C modules and a set of VHDL modules. The C is compiled, linked together with theC communication library and executed on the PC. The VHDL modules are synthesized together with the VHDLcommunication library and mapped onto the SPs.

6 Conclusions

In this paper we have described the IANUS computer, a parallel system that explore the potential for performanceof large FPGAs on massively parallel fine-grained algorithms. We have shown that very high performance canbe reached on a class of applications relevant for statistical physics. This results are obtained expressing a largefraction of the available parallelism made possible by the efficient use of the logical resources inside the FPGA.Equally important is the use of fine-grained memory structures that provide the huge memory bandwidth requiredto sustain performance.

Our present results are admittedly obtained for a large but limited class of applications and are based on carefulhandcrafting of the kernel codes. Our experience teaches in our opinion some lessons relevant in a more generalcontext:

• we have put in practice the potential for performance of a many-core architecture, exploiting all the potentialfor parallelization available in the application up to the limits of hardware resources.

• FPGAs can be configured to perform functions poorly done by traditional CPUs, allowing to put a large fractionof available resources to useful work. This dramatically compensates the overheads associated to reconfigura-bility and inherent to any FPGA structure.

In the future we will continue to develop IANUS applications. Migration of other applications to IANUS willundoubtedly benefit from progress in the automatic mapping of codes onto reconfigurable computers.

12

References

1. D. P. Landau and K. Binder, “A Guide to Monte Carlo Simulations in Statistical Physics”, Cambridge Univer-sity Press, (2005).

2. see the papers contained in the special issue of Computing in Science and Engineering, vol 8, number 1,January/February 2006, (S. Gottlieb Guest Editor).

3. see for instance P. Young, “Spin Glasses and Random Fields” World Scientific, (1998).4. K. Asanovic et al., “The Landscape of Parallel Computing Research: A View from Berkeley”, Tech. Report

UCB/EECS-2006-183, (2006).5. G. Bilardi, A. Pietracaprina, G. Pucci, F. Schifano, R. Tripiccione, “The Potential of On-Chip Multiprocessing

for QCD Machines”, High Performance Computing - HiPC 2005 (D. A. Bader, M. Parashar, V. Sridhar editors)Lecture Notes in Computer Science, vol. 3769, Springer (2005)

6. F. Belletti et al., “IANUS: An Adaptive FPGA Computer”, Computing in Science and Engineering, vol 8,number 1, January/February 2006, 41.

7. A. Cruz, et al., “SUE: A Special Purpose Computer for Spin Glass Models”, Comp. Phys. Comm., 133, (2001)165.

8. G. Parisi and F. Rapuano, “Effects of the random number generator on computer simulations”, Phys. Lett. B157, (1985) 301.

9. S. Sumimoto et al., “The design and evaluation of high performance communication using a Gigabit Ethernet”,Proceedings of the 13th international conference on Supercomputing, (1999) 260.

13