a hardware accelerator implementation for real-time ... · a hardware accelerator implementation...

A hardware accelerator implementation forreal-time collision detection

Fredy Augusto M. Alves1, Lucas Bragança da Silva1, Ricardo S. Ferreira3, Peter Jamieson2, José A. Miranda Nacif11Institute of Exact Sciences and Technology, Federal University of Viçosa, Campus UFV-Florestal, Brazil

2College of Engineering and Computing, Miami University, USA3Departament of Informatics, UFV, Brazil

[email protected], [email protected], [email protected], [email protected], [email protected]

Index Terms—Parallel Processor, Hardware Accelerator, Com-puter Architecture, Collision Detection

I. ABSTRACT

Collision detection algorithms are used to detect whenvirtual objects collide and the results of these collisions.This type of algorithm is used by many research areas suchas simulation. automatic path finding, tolerance checking,among others. This type of algorithm is processed in real-time. Intel created the HARP Platform which uses a PCIwith 8GB/s bandwidth, this decreased the memory latencyallowing real-time processing of fine grain applicationssuch as collision detection algorithms. In this paper wepropose a heterogeneous system in order to use the FPGAon the HARP as an accelerator for collision detectionalgorithms. Our results show a speedup of 14.5% inexecution time.

II. INTRODUCTION

With the recent acquisition of Altera by Intel, the tendencyis that FPGAs will be included in CPUs as accelerators fora wide range of applications. Collision detection algorithmsare used to detect when virtual objects collide and the resultsof these collisions. This type of algorithm is used by manyresearch areas such as simulation. automatic path finding,tolerance checking, among others. Applications can be foundin games, factory simulators, medical procedures trainingapplications, virtual reality, etc [1].

Collision detection in games and simulators are usuallyimplemented in Engines in the form of APIs which can beused out-of-the-box. Collision detection in applications suchas games and simulators is done in real-time which meansthat collisions are processed as soon as they happen. With theincreasing complexity of the applications that use this typeof algorithm in order to achieve a better resemblance withreality, the number of collisions for each simulation step hasbeen increasing so finding ways to improve its the efficiencyis crucial.

In a single simulation step we have access to a set ofcollision data. One of the most important advantage of FPGAsover CPUs is its capacity for parallel processing which isideal for collision detection algorithms. Before the acquisition

of Altera by Intel, the use of FPGAs as accelerators for thistype of algorithm was highly affected by memory latency butthis changed with the new platform HARP (HeterogeneousAccelerator Research Platform) by Intel which consists ofa Xeon Processor connected to an Altera Stratix V FPGAby a PCI of 8GB/s bandwidth which reduces drastically thememory latency. The ODE [2] (Open Dynamics Environment)is an open-source Engine used by a wide variety of games andsimulators.

HARP uses the AAL (Accelerator Abstraction Layer) fra-mework to develop applications for it, it uses the service abs-traction for implementing applications for accelerators whichallows the user to implement a method for accelerating anapplication which can be called as any other function in aC++ application such as ODE.

In this paper we developed a novel hardware accelerator forspheres collision detection as a proof of concept for optimizingthis type of algorithm on FPGAs. It uses a many-processingunits approach to process collisions in parallel with a producer-consumer approach in order to improve communication effici-ency, it also uses pipeline methods to improve the throughput.The algorithm uses floating point arithmetic. We tested thedesign on the HARP platform and collected results in real-time for many ODE simulation benchmarks. We managed toget an improvement of up to 14.5%.

This paper is outlined as follows. In section III we brieflyexplain the background of the collision detection algorithmand its implementation on the HARP platform. In section IV,we explain how our results were fetched. In section V, weput our work into perspective. In section VI we present anddiscuss our results. Then, in section VII we conclude our workand explain the future works for it.

III. BACKGROUND

A. Collision detection algorithm

The ODE offers a collision detection engine, it also providesan interface for reimplementing the methods on it. This enginereceives information about the shape and position of eachbody in a space. At each simulation step, the collision engineis responsible for identifying which bodies are colliding andreturning contact points which are structures responsible forholding information about the collision.

B. Contact points

Contact points are structs returned by the collision detectionengine, they are composed by three attributes:

• Pos: The position of the contact point in a space.• Normal: The resultant vector that is perpendicular to the

collision contact point.• Depth: The depth of penetration of the bodies in each

other.• G1 and G2: The bodies colliding for the contact point.

C. Sphere Collision detection algorithm

In our case of study, we reimplemented the method forcollision detection between two spheres, the pseudocode forthis algorithm is on Algorithm 1 with its inner methods. Theinputs are the position of two spheres in a space (p1 and p2)and their respective radius (r1 and r2), all the data is in 32-bitieee 754 floating point format. The output is a contact pointfor the collision.

The algorithm begins by calculating the depth for thecollision and storing it on the variable d. Based on d, thealgorithm has three possible execution flows. Flow 1 is whenthe collision doesn‘t happen, we call this a fake collision, Flow2 is when the bodies barely touch each other, this is a grazingcollision, Flow 3 is when the bodies collide with each otherand the collision has a depth, a real collision. Flow 2 almostnever happens so we focus our studies on Flow 1 and 3.

By studying the data dependencies on the algorithm, wewere able to identify which calculations could be executed inparallel. We divided the calculations in stages which happensis sequence in a parallel processor such as an FPGA, thestages are listed on Algorithm 2. In this project we developeda Verilog RTL Design to execute those stages.

As we are dealing with bodies in a 3D Space, p1,p2,cNormal, cDepth,cPos are all vectors of size 3 to hold thex, y and z coordinates. This means that each operation onAlgorithm 1 which uses any of these vectors is actually 3operations.

D. The System architecture

The system architecture overview can be seen on Figure 1,on the FPGA side, in addition to the Accelerator FunctionUnit (AFU) for the collision detection algorithm on the FPGA,the system also uses another component provided by Intel,the System Protocol Layer 2 (SPL2) which is responsiblefor memory address translation since the FPGA uses virtualaddressing, the shared memory can go up to 2GB.

The shared memory is divided in sections, the device statusmemory (DSM) is a 4kb pinned memory region which holdsthe status of the device. The control memory is a reservedmemory space for handshake signals between the CPU and theFPGA, it is used for start and reset signals. The Src Buffer isused by the CPU to send data to the AFU and the Dst Bufferis used by the FPGA to send results to the SW application.The VAFU2_CNTXT holds the pointers to the buffers.

The CPU side is composed by the ODE simulations and theAAL application used to communicate with the AFU.

Algorithm 1 Spheres collision detection.d← DCALCPOINTSDISTANCE3(p1,p2)Flow 1: Fake Collisionif d > (r1 + r2) then return 0end ifFlow 2: Grazing Collisionif d ≤ 0 then

cPos← p1cNormal← (1, 0, 0)cDepth← r1 + r2Flow 3: Real Collision

elsed1← DRECIP(d)cNormal← (p1− p2) ∗ d1k ← 0.5 ∗ (r2− r1− d)cPos← p1 + cNormal ∗ kcDepth← r1 + r2− d

end iffunction DCALCPOINTSDISTANCE3(p1,p2)

tmp← DSUBTRACTVECTORS3(p1,p2)res← DCALCVECTORLENGTH3(tmp)return res

end functionfunction DSUBTRACTVECTORS3(p1,p2)

res← p1− p2end functionfunction DCALCVECTORLENGTH3(tmp)

res← sqrt(tmp[0]∗tmp[0]+tmp[1]∗tmp[1]+tmp[2]∗tmp[2])end function

Algorithm 2 Paralell Spheres collision detection.Stage 1 :d← DCALCPOINTSDISTANCE3(p1,p2)rsum← r1 + r2psub← p1− p2rsub← r2− r1Stage 2 :d1← DRECIP(d)d > rsumr2− r1− dr1 + r2− dStage 3 :cNormal← (psub) ∗ d1k ← 0.5 ∗ (rsub− d)Stage 4 :cnk ← cNormal ∗ kStage 5 :cPos← p1 + cnk

E. The Sphere Collision Detection AFU

The AFU is divided into two main components as seenon the datapath on Figure 2, the Processing Units Controller(PUC) and the Sphere Collision Processing Units (SCPUs),

Figura 1. System architecture overview.

The handshake signals control protocol is implemented on thePUC, it also implements a collector responsible for fetchingcollision data from the Src Buffer and a Dispatcher responsiblefor writing the collisions processing results to the Dst Buffer.Besides the algorithm parallelism based on data dependencedescribed on Section III-A, in our architecture we also usea many Processing Units approach, each SCPU process onecollision in parallel to the other SCPUs. So, if we haveN SCPUs, the collision execution time T for one collisionprocessing becomes T/N. The PUC also controls the start andreset signals for the SCPUs, it receives from the CPU AALapplication the number of collisions to be processed Ncol, itthen execute each SCPU Ncol/N times.

Figura 2. the AFU datapath.

IV. METHODOLOGY

A. Collision detection execution time on AFU

The AAL framework is service-oriented so it works withthe concept of transactions. A transaction must be initiated,

processed and stopped. The steps for executing a transactionfrom the AAL to the AFU is presented on Figure 3 and worksas follows:

1) The AAL application gets pointers to the src buffer(pSource), destination buffer (pDest) and AFU context(pVAFU2_cntxt).

2) The application sends the collision Data to be processedto the Source Buffer and starts the SPL.

3) The application sends the start signal and how manycollisions (nCol) should be processed to the AFU throughthe control memory.

4) The AFU writes its AFU ID to the DSM which is usedto signal to the CPU that it is running.

5) After the AFU finishes processing the collisions, it signalsto the CPU that it is done.

6) The CPU uses pDest to retrieve the collision processingresults from the Destination Buffer.

7) The application ends the transaction by freeing the works-pace.

The application re-execute the collision processing on theAFU X times and compute an average of all the executiontimes, this process is done in order to minimize the variationson the method for fetching execution time. As the path forexecuting a type of collision is always the same, it is expectedthat the execution time for a certain amount of collisions isalways similar.

Figura 3. Transaction steps.

B. Collision detection execution time on ODE

For the ODE, a set of parameterizable benchmarks wasdeveloped. The benchmark creates a simulation environmentwith spheres organized in the form of a diamond, it receivesthe maximum height hMax for the diamond from its centeras an input parameter in number of spheres. Figure 4 showsexamples of three benchmarks and their respective hMax. Itis also possible to increase the distance dSph between thespheres. After the spheres are created, a force is applied in

each one pointing to the center of the diamond so they cancollide.

Figura 4. Benchmarks.

In order to get the execution time, the user must specifyhow many collisions and their respective type tCol (fake, realor grazing) should be taken into account. Figure 5 shows theexecution flow for getting the ODE spheres collision executiontimes. First the benchmark parameters are set, the simulationstarts, when an execution of the specified type happens, it isexecuted 10 times and the execution time average is addedto the total execution time, if the number of collisions nColis equal to the maximum number of collisions nColMax, thesimulation stops and the benchmark returns the total executiontime. The benchmarks were created in order to take intoaccount any optimization made by the processor in a real timeapplication.

V. RELATED WORKS

On [3] an algorithm to solve linear programs is used inorder to improve the speed of collision detection algorithms,this paper uses preloaded data with memory initialization filesin an FPGA kit to generate its results. On [4], a collisiondetection design for FPGAs using fixed-point arithmetics andbounded error is proposed, its focus is on saving space inorder to improve area overhead, its results are given in terms

Figura 5. ODE execution flow.

of speedup but it also uses preloaded data. Works such as [5][6] relies on GPU-CPU based systems to improve collisiondetection.

The main differentials for our work is that we propose areal-time collision detection design for real-time processing ona CPU-FPGA heterogeneous system without preloaded data,our results are executed in a real environment and are givenin terms of execution time speedup.

VI. RESULTS AND DISCUSSION

Our experiments were conducted by executing sets of realcollisions on both ODE and the AFU. For the ODE, the testswere made using a benchmark of hMax equal to 3 and dSphequal to 0. We started with a set of 10 collisions and thenkept increasing it by 10 until we reached 1000 collisions. Wepresented the results in terms of collisions processing time ina graph comparing ODE to the AFU times. For the AFU weused 10 SCPUs.

Figure 6 shows the graph for the sets of real collisions, theincreasing in execution time is linear as expected since thepath to execute a real collision is always the same. Above 250collisions, the AFU is always better than the ODE, the reasonwhy before this point the ODE is better is because the AFUhas a configuration time before and after each transaction, thisconfiguration time is on average 2000 milliseconds so the AFUmust reach a minimum of collisions in order to compensatethis. With the number of collisions that we collected, we canreach up to 14.5% speedup on the AFU for real collisions.

We then implemented another test where we start with aset of 1000 collisions and increase it by 1000 until we reach90000 collisions, the results can be seen in Figure 7. For theHARP tests, the average execution time per collision was 86nsand for the ODE it was 102ns.

VII. CONCLUSION AND FUTURE WORK

In this work we presented a hardware accelerator for acollision detection algorithm. We implemented a case ofstudy with Sphere Collision Detection reaching a speedup ofup to 14.5% with FPGA proven designs running in a realenvironment. The results show that our design is a great optionfor decreasing execution times for this type of algorithm inthis new FPGA accelerators reality. As for future works, we

Figura 6. Graph for real collisions for 1000 set.

Figura 7. Graph for real collisions for 100000 set.

intend to expand the AFU library to another type of collisionsdetection algorithms in order to produce a new Engine.

ACKNOWLEDGMENTS

We would like to thank UFV for providing us the necessarymaterials and support, we also would like to acknowledgethe students Connor Blandford and Oakley Katterheinrichfor participating on the development of the SCPUs during asummer internship on the Miami University.

REFERÊNCIAS

[1] H. Wei and W. W. Gen, “A comprehensive fpga implementation ofcollision detection,” in IET International Communication Conference onWireless Mobile and Computing (CCWMC 2011), Nov 2011, pp. 341–346.

[2] R. Smith. Open dynamics engine. [Online]. Available:http://www.ode.org/

[3] C. H. Wu, S. O. Memik, and S. Mehrotra, “Fpga implementation ofthe interior-point algorithm with applications to collision detection,” in2009 17th IEEE Symposium on Field Programmable Custom ComputingMachines, April 2009, pp. 295–298.

[4] A. Raabe, S. Hochgurtel, J. Anlauf, and G. Zachmann, “Space-efficientfpga-accelerated collision detection for virtual prototyping,” in Procee-dings of the Design Automation Test in Europe Conference, vol. 2, March2006, pp. 6 pp.–.

[5] A. Hermann, F. Drews, J. Bauer, S. Klemm, A. Roennau, and R. Dill-mann, “Unified gpu voxel collision detection for mobile manipulationplanning,” in 2014 IEEE/RSJ International Conference on IntelligentRobots and Systems, Sept 2014, pp. 4154–4160.

[6] A. Hermann, F. Mauch, K. Fischnaller, S. Klemm, A. Roennau, andR. Dillmann, “Anticipate your surroundings: Predictive collision detectionbetween dynamic obstacles and planned robot trajectories on the gpu,” in2015 European Conference on Mobile Robots (ECMR), Sept 2015, pp.1–8.

a hardware accelerator implementation for real-time ... · a hardware accelerator implementation...

Documents