distributed execution of transmural … execution of transmural electrophysiological imaging with...

7
978-1-4799-2079-2/13/$31.00 c 2013 IEEE Distributed Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA Sam Skalicky, Sonia L´ opez, Marcin Lukowiak Department of Computer Engineering, Rochester Institute of Technology Rochester, NY, USA {sxs5464,slaeec,mxleec}@rit.edu Abstract—One of the main challenges of using cutting edge medical imaging applications in the clinical setting is the large amount of data processing required. Many of these applications are based on linear algebra computations operating on large data sizes and their execution may require days in a standard CPU. Distributed heterogeneous systems are capable of improving the performance of applications by using the right computation- to-hardware mapping. To achieve high performance, hardware platforms are chosen to satisfy the needs of each computation with corresponding architectural features such as clock speed, number of parallel computational units, and memory bandwidth. In this paper we evaluate the performance benefits of using different hardware platforms to accelerate the execution of a transmural electrophysiological imaging algorithm, targeting a standard CPU with GPU and FPGA accelerators. Using this cutting edge medical imaging application as a case study, we demonstrate the importance of making intelligent computation assignments for improved performance. We show that, depending on the size of the data structures the application works with, the usage of an FPGA to run certain computations can make a big difference: a heterogeneous system with all three hardware platforms (CPU+GPU+FPGA) can cut the execution time by half, compared to the best result using one single accelerator (CPU+GPU). In addition, our experimental results show that combining CPU, GPU, and FPGA platforms in a single system achieves a speedup of up to 62x, 2x, and 1605x compared to systems with a single CPU, GPU, or FPGA platform respectively. I. I NTRODUCTION In the past, invasive techniques for transmural electrophys- iological imaging required cardiac catheterization to measure electrical activity directly by touching a sensor to the interior walls of the heart. The large risk associated with this pro- cedure and the data inaccuracies caused by measuring elec- trical activity at different positions throughout many cardiac cycles drove researchers to find better techniques. Noninvasive transmural electrophysiological imaging (NTEPI) uses body surface potential electrical readings combined with anatomical data from MR imaging to model the electrical activity not only on heart surfaces but also deep into the 3D myocardium of the ventricles with better accuracy and less risk to the patient. Despite the initial success, the clinical translation of NTEPI is hindered by the tremendous computational cost of the algorithm. This compute-intensive application, when implemented in a standard CPU, does not achieve a sufficient level of perfor- mance for its required purpose. Previous research by Camara et al. [1] reported an average computational time of 250 hours on a regular desktop computer. This is not an acceptable execution time for a critical diagnostic tool used in the clinical environment. To make use of this new diagnostic technique in the clinical domain will require alternative hardware solutions such as GPU or FPGA accelerated implementations to oper- ate within an available time budget. Heterogeneous systems present a solution to this problem by combining different ar- chitectures such as CPU, GPU, and FPGA into a single system to provide best performance. Through proper computation-to- hardware assignments, the NTEPI application can achieve the performance required for the clinical environment. However, in order to take full advantage of heterogeneous systems some major challenges must be overcome. These include choosing the best computation-to-hardware assignments and choosing which hardware platforms to include in the system. The core of the NTEPI application is composed of linear algebra computations such as dot product, matrix-vector mul- tiplication, matrix-matrix multiplication, matrix inverse, and matrix decomposition. In this paper we analyze three of these linear algebra computations on CPU, GPU, and FPGA archi- tectures with multiple implementations. These linear algebra computations are commonly used in many other compute- intensive applications in the medical field and other relevant fields, and our conclusions can be applied to those as well. Out of the literature [2][3][4][5], successful FPGA architectures for these computations are selected. Multiple implementations for the CPU [6][7] and GPU architectures [8][9][7] from commonly used scientific libraries are used for comparison. These results can be used in systems where one or more of these architectures are available to assist in selecting the best implementation for any computation using a particular matrix size. The main contributions of this work are: Decomposition of the NTEPI application and identification of the bottleneck computations. Demonstration of the impact of computation-to- hardware assignments on performance of this medical imaging application. Evaluation of the performance benefits of these assignments, including data transfer costs, across several systems with various combinations of hardware platforms.

Upload: vankien

Post on 01-May-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Execution of Transmural … Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA ... jj1 are generated from the mean and covariance matrix of u

978-1-4799-2079-2/13/$31.00 c©2013 IEEE

Distributed Execution of TransmuralElectrophysiological Imaging with CPU, GPU, and

FPGA

Sam Skalicky, Sonia Lopez, Marcin ŁukowiakDepartment of Computer Engineering, Rochester Institute of Technology

Rochester, NY, USA{sxs5464,slaeec,mxleec}@rit.edu

Abstract—One of the main challenges of using cutting edgemedical imaging applications in the clinical setting is the largeamount of data processing required. Many of these applicationsare based on linear algebra computations operating on largedata sizes and their execution may require days in a standardCPU. Distributed heterogeneous systems are capable of improvingthe performance of applications by using the right computation-to-hardware mapping. To achieve high performance, hardwareplatforms are chosen to satisfy the needs of each computationwith corresponding architectural features such as clock speed,number of parallel computational units, and memory bandwidth.In this paper we evaluate the performance benefits of usingdifferent hardware platforms to accelerate the execution of atransmural electrophysiological imaging algorithm, targeting astandard CPU with GPU and FPGA accelerators. Using thiscutting edge medical imaging application as a case study, wedemonstrate the importance of making intelligent computationassignments for improved performance. We show that, dependingon the size of the data structures the application works with,the usage of an FPGA to run certain computations can make abig difference: a heterogeneous system with all three hardwareplatforms (CPU+GPU+FPGA) can cut the execution time byhalf, compared to the best result using one single accelerator(CPU+GPU). In addition, our experimental results show thatcombining CPU, GPU, and FPGA platforms in a single systemachieves a speedup of up to 62x, 2x, and 1605x compared tosystems with a single CPU, GPU, or FPGA platform respectively.

I. INTRODUCTION

In the past, invasive techniques for transmural electrophys-iological imaging required cardiac catheterization to measureelectrical activity directly by touching a sensor to the interiorwalls of the heart. The large risk associated with this pro-cedure and the data inaccuracies caused by measuring elec-trical activity at different positions throughout many cardiaccycles drove researchers to find better techniques. Noninvasivetransmural electrophysiological imaging (NTEPI) uses bodysurface potential electrical readings combined with anatomicaldata from MR imaging to model the electrical activity notonly on heart surfaces but also deep into the 3D myocardiumof the ventricles with better accuracy and less risk to thepatient. Despite the initial success, the clinical translation ofNTEPI is hindered by the tremendous computational cost ofthe algorithm.

This compute-intensive application, when implemented ina standard CPU, does not achieve a sufficient level of perfor-mance for its required purpose. Previous research by Camara

et al. [1] reported an average computational time of 250 hourson a regular desktop computer. This is not an acceptableexecution time for a critical diagnostic tool used in the clinicalenvironment. To make use of this new diagnostic technique inthe clinical domain will require alternative hardware solutionssuch as GPU or FPGA accelerated implementations to oper-ate within an available time budget. Heterogeneous systemspresent a solution to this problem by combining different ar-chitectures such as CPU, GPU, and FPGA into a single systemto provide best performance. Through proper computation-to-hardware assignments, the NTEPI application can achieve theperformance required for the clinical environment. However,in order to take full advantage of heterogeneous systems somemajor challenges must be overcome. These include choosingthe best computation-to-hardware assignments and choosingwhich hardware platforms to include in the system.

The core of the NTEPI application is composed of linearalgebra computations such as dot product, matrix-vector mul-tiplication, matrix-matrix multiplication, matrix inverse, andmatrix decomposition. In this paper we analyze three of theselinear algebra computations on CPU, GPU, and FPGA archi-tectures with multiple implementations. These linear algebracomputations are commonly used in many other compute-intensive applications in the medical field and other relevantfields, and our conclusions can be applied to those as well. Outof the literature [2][3][4][5], successful FPGA architecturesfor these computations are selected. Multiple implementationsfor the CPU [6][7] and GPU architectures [8][9][7] fromcommonly used scientific libraries are used for comparison.These results can be used in systems where one or more ofthese architectures are available to assist in selecting the bestimplementation for any computation using a particular matrixsize.

The main contributions of this work are:

• Decomposition of the NTEPI application andidentification of the bottleneck computations.

• Demonstration of the impact of computation-to-hardware assignments on performance of this medicalimaging application.

• Evaluation of the performance benefits of theseassignments, including data transfer costs, acrossseveral systems with various combinations ofhardware platforms.

Page 2: Distributed Execution of Transmural … Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA ... jj1 are generated from the mean and covariance matrix of u

Fig. 1: The steps required for each NTEPI iteration.

II. RELATED WORK

Initial steps have been taken to parallelize whole heart EPsimulation using complex, ionic models by Sato et al. [10] andBartocci et al. [11]. However few efforts have reported on GPUacceleration of noninvasive EP imaging except for the work byCorraine et al. [12] that presented a 16x speedup comparedto a high end CPU. This is the first work to analyze theperformance of noninvasive EP imaging on a heterogeneoussystem containing CPUs, GPUs, and FPGAs.

Performance of various processing architectures have beenevaluated for many computations. CPU, GPU, and FPGAimplementations of a Low-Density Parity-Check decoder werecompared by Falcao et al. [13] for data sizes 8000x4000and 1024x512. They concluded that for the smaller data sizethe FPGA was faster and the GPU was faster at the largerdata size. Sotiropoulos et al. designed an FPGA matrix-matrixmultiplication architecture [3] and compared its performance toa standard CPU implementation. This comparison was only forspecifically sized matrices and did not discuss their CPU im-plementation. The results showed that the FPGA outperformsthe CPU with a speedup of up to 557x. A comparison of matrixdecomposition by Yang et al. [14] evaluated the performanceon CPUs, GPUs, and FPGAs. They analyzed four data sizesfrom 256 to 1024 and demonstrated that the FPGA was fasterthan GPU followed by the CPU for both single and doubleprecision floating point. Compared to the previous works thatcontrasted the performance of each architecture against eachother, in this work we combine their capabilities into a singlesystem to achieve best performance.

Higher level functions such as 2D filtering by Llamoccaet al. [15] and an implementation of Bayesian networks byFletcher et al. [16] were evaluated using GPU and FPGAarchitectures. Grozea et al. evaluated a sorting algorithm onCPUs, GPUs, and FPGAs [17] to speed up the performance ofnetwork intrusion detection systems. Their results showed thehighest performance architecture was CPU, followed by FPGAand then GPU. But when making the comparisons, the authorsimplemented the algorithms solely in one architecture andtherefore, chose one particular processor over another. Theydid not discuss the best implementations for the computationsin these higher level functions, only the best implementationfor the whole algorithm.

III. NTEPI ALGORITHM

The NTEPI Algorithm employs a sequentialmaximum a posteriori (MAP) estimation of the transmuralaction potential (electrical propagation) distributions uk giventhe body-surface potential data (as measured from a standardECG) from all samples up to the current sample k, denoted asφ1:k [18]. At each time step when a new sample is available,the computational steps shown in Figure 1 are executed. Atypical patient analysis requires 2000-3000 iterations.

In each iteration, a Cholesky decomposition of the covari-ance matrix of uk−1 is performed. Then, a set of samplevectors U||−∞ are generated from the mean and covariancematrix of uk−1. Each sample vector in U||−∞ individuallyenters into simulation of the Alive-Panfilov models to predicta new set of sample vectors U−|| . From this, the mean andcovariance matrix of u−k are predicted.

Since the sample ECG measurements φk contain electricalnoise from sources other than the heart such as the respiratorymuscles that are located between the electrodes and the heart,a Kalman filter is used to reduce the impact of random noisefrom the data. The Kalman update process requires inverting anMxM matrix where M is the dimension of the body surfacedata φk. Each iteration repeats the above prediction and updateprocesses.

The computations required for the above calculationsare common matrix operations such as addition, subtraction,element-wise and standard multiplication, scaling, inversion,and Cholesky decomposition. Previous work profiled this algo-rithm in detail [12] and found that the majority of the executiontime is spent on the Alieve-Panfilov model. Moreover, wefurther investigated which specific computations are the bottle-neck and found that 98% of the time is spent on matrix-matrixmultiplication, matrix inversion, and Cholesky decomposition.As such, this work attempts to improve the overall NTEPIalgorithm’s performance by focusing on these three types ofcomputations. The number of each of these computations andthe dependencies between them for each NTEPI iteration areshown in Figure 2.

Fig. 2: Dataflow Graph (DFG) of the bottleneck computations,matrix-matrix multiply (MM), Cholesky decomposition (Chol), andmatrix inverse (Inv), in each iteration of the NTEPI algorithm.

Page 3: Distributed Execution of Transmural … Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA ... jj1 are generated from the mean and covariance matrix of u

Each iteration requires 12 computations broken down into1 Cholesky decomposition, 1 matrix inversion, and 10 matrix-matrix multiplications. Between iterations there is no opportu-nity for overlap as every iteration is dependent on the previousupdate calculations. However, there is sufficient parallelismwithin each iteration to potentially keep all three CPU, GPU,and FPGA processors busy. In the next section we evaluatethe performance of the three computation types and the datatransfer costs to determine a schedule that improves the overallperformance.

IV. RESULTS

To run the NTEPI algorithm in a heterogeneous system,each computation will need to be assigned to a hardwareplatform to be executed. As stated in [19], performance of acomputation depends on the number of operations, the memorybandwidth requirements, control flow complexity, and datasize. The NTEPI algorithm operates on matrix sizes that rangefrom 500x500 to 8000x8000, depending on the size of themesh used to represent the heart. The sample use case that weoriginally evaluated operated on a data size of 836x836. Largerdata sizes allow for more precise electrical activity modelingbut also dramatically increase the execution time. Additionally,double precision (DP) floating point is required in order tostore the small action potentials and intermediary values usedduring calculation. However, we also investigate the potentialperformance improvement of moving to single precision (SP)floating point.

The performance of each computation and data size wasevaluated in CPU, GPU, and FPGA hardware platforms withspecifications as shown in Table I. For each computation inthe FPGA, a well researched custom design was used withoutmodification [2][3][4][5] in a Virtex 6 and Virtex 7 device,although we only show results for Virtex 7 as will be explainedlater. For the CPU [6][7] and GPU [8][9][7], high performanceimplementations that are commonly used in scientific comput-ing were used that stem from the original BLAS and LAPACKlibraries. The AMD C Math Library (ACML) and MathWorksMatlab were used to implement computations in the CPU.Parallel versions of these libraries are also available for GPUsin the form of Compute Unified BLAS (CUBLAS), MatrixAlgebra on GPU and Multicore Architectures (MAGMA),and Matlab. Figure 3 shows the performance of the threecomputations for the data sizes evaluated in this work.

TABLE I: Hardware Platform Specifications

CP

U HW Platform Intel Core i7 2600 3.4GHz16GB DDR3 @1333MHz

Implementations AMD C Math Library 5.1.0MathWorks Matlab 2012b 64b

GP

U HW Platform Nvidia Tesla K20 706MHz5GB GDDR5 @5.2GHz

Implementations CUBLAS + MAGMA LibrariesMathWorks Matlab 2012b 64b

FP

GA HW Platforms

Xilinx Virtex 6 LX240T, ML605512MB DDR3 @ 400MHz

Xilinx Virtex 7 VX485T, VC7071GB DDR3 @ 800MHz

Implementations [2][3][4][5]

Fig. 3: Performance of matrix-matrix multiplication, Cholesky decom-position, and matrix inversion on CPU, GPU, and FPGA platformsfor both single precision (SP) and double precision (DP) floatingpoint.

Page 4: Distributed Execution of Transmural … Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA ... jj1 are generated from the mean and covariance matrix of u

(a) CPU+GPU system. (b) CPU+FPGA system. (c) GPU+FPGA system. (d) CPU+GPU+FPGA system.

Fig. 4: Schedules of a single iteration of the NTEPI algorithm on two and three-platform systems. The matrix inverse computation (Inv) can beplaced in either CPU, GPU, or FPGA to achieve best performance depending on data size and precision. Whereas the Cholesky (Chol) andmatrix-matrix multiply (MM) computations always perform best in FPGA and GPU platforms respectively. The CPU+GPU system’s schedulemay be longer than the others since for some data sizes the Cholesky decomposition’s second best platform is the GPU.

Although multiple implementations were used for eachhardware platform, only one showed the best results for eachplatform across our range of data sizes and precision. Forthe CPU, the Matlab implementation performed better than allother platforms and implementations for double precision ma-trix inversion at data size 500x500. The Matlab implementationin the GPU was also better than CUBLAS (for matrix-matrixmultiplication) and MAGMA (for matrix inverse). Howeverthe FPGA’s ability to execute the complex control flow inCholesky decomposition using custom parallel pipelines wasbetter than all other platforms. Since the Cholesky decompo-sition computation implemented in the FPGA only requiredone operand per pipeline the design never utilized the entirememory bandwidth of the Virtex 6 device. Moreover, the samenumber of pipelines were implemented in both devices so the

Fig. 5: Design space for CPU, GPU, and FPGA hardware platformsfor the three computations evaluated in this work across a range ofdata sizes using single precision (SP) and double precision (DP).

additional bandwidth of the Virtex 7 device was of no benefit,so both devices performed the same. Similarly, the matrixinversion design only required a single operand regardless ofthe size of the pipeline implemented and as a result both FPGAdevices performed the same for this computation.

The design space chart in Figure 5 shows the bestcomputation-to-hardware mappings for various data sizes. Thischart was derived from the performances shown in the plotsin Figure 3 and is based only on the execution times of eachcomputation. Data transfer times also affect these mappings,next we define each platform’s communication interfaces and

Fig. 6: Distributed CPU, GPU, FPGA system showing each platform’sfull duplex PCIe connections to both other platforms.

Page 5: Distributed Execution of Transmural … Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA ... jj1 are generated from the mean and covariance matrix of u

their connections within the system. Figure 6 shows theconnections between each commercial-off-the-shelf (COTS)hardware platform as PCI Express interfaces with the specifiednumber of lanes. The GPU’s PCIe bandwidth to the CPU using16 lanes is assumed to be 8GBps, and the FPGA’s bandwith isassumed to be 2GBps using 4 lanes. We clearly define the CPUand its chipset as separate chips since the PCIe root complexwithin the chipset can act as a switch and route transactionsbetween the various devices without interaction from the CPU.Bittner et al. [20] presented a technique to enable direct GPUto FPGA communication. We assume that this functionality isenabled within this system and that the data transfer betweenthe GPU and FPGA happens at the FPGA’s bandwidth andwithout any CPU interaction. These bandwidths along with thedata size, determine the data transfer costs that are included inthe scheduling decisions to choose which platform to assign acomputation. The goal of assigning computations to differentplatforms is that the difference in execution time is larger thanthe associated data transfer time, leading to an overall decreasein execution time.

For our three computations, matrix-matrix multiplicationwill always be performed best in the GPU, and Choleskydecomposition in the FPGA. However, depending on the datasize and precision for matrix inverse the CPU, GPU, or FPGAmay be the best as shown in Figures 3e-f. Figure 4 showsthe best schedules for the heterogeneous systems: CPU+GPU,CPU+FPGA, GPU+FPGA, and CPU+GPU+FPGA. Eachschedule has multiple possible assignments for matrix inverseand depending on the data size and precision, different plat-forms will be chosen to achieve the best performance. In asystem with only two of the hardware platforms where the bestplatform for a particular computation is not in the system, thesecond or third best platform must be used. For the CPU+GPUschedule in Figure 4a, the Cholesky decomposition cannotbe scheduled in the best processor (FPGA) since it is notin the system. Figures 3c-d show that the performance ofCholesky decomposition is second best on both CPU and GPUplatforms depending on data size and precision. Figure 4cshows the schedule for the CPU+FPGA system. In this system,a GPU is not available and Figures 3a-b show that the CPUplatform always performs better than the FPGA for matrix-matrix multiplication and will be used in lieu of the GPU. Inthe GPU+FPGA system, there will be only one case wherea second best processor is used. For the 500x500 doubleprecision case, the matrix inverse will be assigned to thesecond best processor (GPU) instead of the CPU.

A. Medical Imaging Use Case

As stated above, we tested seven different system con-figurations: systems with a single hardware platform (CPU,GPU, and FPGA), two hardware platforms (CPU+GPU,CPU+FPGA, and GPU+FPGA), and three hardware platforms(CPU+GPU+FPGA). As expected, the best performing one isthe three-platform CPU+GPU+FPGA system and we presentour results by comparing this one to the rest. For the singleGPU and FPGA systems and GPU+FPGA system, we assumethey are also accompanied by a CPU for control purposesonly, to initiate computations and data transfers, no computinghappens on the CPU. Since there is no opportunity for overlap

(a) SP speedup over CPU, FPGA, and CPU+FPGA.

(b) SP speedup over GPU, CPU+GPU, and GPU+FPGA.

(c) DP speedup over CPU, FPGA, and CPU+FPGA.

(d) DP speedup over GPU, CPU+GPU, and GPU+FPGA.

Fig. 7: Speedup of the heterogeneous CPU+GPU+FPGA system fora single iteration of the NTEPI algorithm compared to single andtwo-platform systems across a range of data sizes using both singleprecision (SP) and double precision (DP) floating point.

between iterations, the execution time for a single iteration canbe used to estimate the execution time for an application withany number of iterations.

Page 6: Distributed Execution of Transmural … Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA ... jj1 are generated from the mean and covariance matrix of u

Figure 7 shows the speedup of the three-platform systemagainst the other six systems for single and double precisionacross a range of data sizes. For the sake of visibility, theresults were split into two groups: those systems with a GPU(GPU, CPU+GPU, GPU+FPGA) in Figures 7b & 7d, andthose systems without a GPU (CPU, FPGA, CPU+FPGA)in Figures 7a & 7c. The reason for this distribution leadsto our first conclusion, that those systems including a GPUshow performance closest to the best performance achievablewith the CPU+GPU+FPGA system. This is due to the largenumber of computations that are best performed in the GPU(all matrix-matrix multiplications and some matrix inversions)and the large difference in execution time of these on the otherplatforms. The GPU achieves such high performance comparedto the other platforms as a natural consequence of its inherenthigh parallelism. Matrix-matrix multiplication is especiallyappropriate for GPU in this range of matrix’s sizes. The resultswould be very different if we were working with a lowernumber of matrix-matrix multiplications or smaller matrixsizes. These two factors would reduce the level of parallelismof the application and make the GPU hardware resourcesoverkill for the computational needs of the application.

We can see that the closest performance to the three-platform system is achieved by the GPU+FPGA system, withspeedup of 1x (that is, equal performance) for almost all cases.This shows that the contribution of the of the CPU is minimal,if not null, and that the full load of the execution lays onthe GPU and FPGA platforms. Only for the double precisionGPU+FPGA plot in Figure 7d, size 500x500, we can see asmall speedup of 1.03x of the three-platform system over theGPU+FPGA. This slight improvement of 3%, achieved bythe addition of the CPU to the system, is due to the smallerexecution time of the matrix inverse in the CPU, which workswell for that matrix size. Given that the CPU is required in sucha system to control the data transfers and initiate computationsin the GPU and FPGA platforms, it is natural to also use it tofurther improve performance.

Continuing with this analysis of the three-platform sys-tem versus the other two-platform systems (CPU+GPU andCPU+FPGA) we analyze the third component’s impact on theperformance when added (FPGA and GPU respectively). Forthe CPU+FPGA system in Figures 7a and 7c, we can see thatthe three-platform system achieves performance almost equalto the system with no GPU for small matrix sizes, and theGPU gains relevance as the matrix size increases. As statedabove, to really exploit the potential of the GPU’s hardware,large amounts of parallelism in the computations are necessary,which increases as we approach the 6000x6000 mark. Here, theaddition of the GPU cuts the execution time by 12x for doubleand 6x for single precision compared to the CPU+FPGAsystem. On the other hand, looking at Figures 7b and 7d wecan evaluate the impact of adding the FPGA by analyzing theperformance of the CPU+GPU system. The performance ofthe three-platform and CPU+GPU systems are very similar,leading us to believe that adding the FPGA does not make abig difference. However, this depends again on the size of thedata that we work with. Depending on the size of the matrix,the addition of an FPGA for certain computations can improvethe performance considerably. This is the case for the 500x500data size, where the performance is almost cut in half thanks

to the addition of the FPGA, which achieves best performancewhen executing the matrix inverse computation for that datasize.

Current implementations of the NTEPI algorithm regularlyexecute 2000-3000 iterations. Future implementations willneed to execute millions of iterations to improve accuracy andusability of the algorithm in the clinical setting. Moreover,every potential speedup is needed for this to become feasible.We found that in comparing the system with three platforms,the potential speedup from moving to single precision rangedfrom 1.3x to 2x with an average of 1.7x.

In summary we can conclude that, out of the three hardwareplatforms in the full heterogeneous system, the one with themost relevant impact on performance is the GPU, while theone with the smallest impact on performance is the CPU.For any application, the ratio of certain computations and thedata sizes they operate on are the key factors to select theright hardware platforms for a heterogeneous system. For ourspecific application, given that the smallest matrix size thatwe could consider is 500x500 and the high ratio of matrix-matrix multiplications, we found that the single most importantcomponent of a heterogeneous solution is the GPU due tothe high level of parallelism of both the application and thehardware. We found that the FPGA can be a key addition tothe system for small matrix sizes, cutting the execution timeby up to half for a critical medical diagnosis application.

V. CONCLUSIONS

In this work we have evaluated the performance of theNTEPI medical imaging algorithm on CPU, GPU, and FPGAhardware platforms. Although many previous works have com-pared and contrasted CPUs, GPUs, and FPGAs to determinewhich architecture is better, we show that a single systemcontaining all three architectures results in higher performance.Using design space charts, computations were mapped tohardware platforms to improve their individual performancebased on data size and precision. The large differences shownbetween the fastest and second fastest implementations arekey to enabling heterogeneous systems to spend time on datatransfers and still achieve higher performance than single ar-chitecture systems. Then, schedules for a single iteration of thealgorithm were constructed for two and three platform systemsto improve the performance of the bottleneck computations inthe application.

We compared the performance of single CPU, GPU,or FPGA systems and two-platform systems (CPU+GPU,CPU+FPGA, and GPU+FPGA) against the three-platformCPU+GPU+FPGA system. Our results showed that in thethree-platform system, the GPU had the highest contributionto the overall performance of the application, and the CPUhad the lowest. After the GPU, adding the FPGA can cutthe performance of the system by up to half for small matrixsizes. As expected, the single platform systems were the worstperforming, but out of these the GPU performed the best. Thethree-platform system performed up to 12x better than thetwo-platform systems, with the GPU+FPGA system achievingequal performance for all but one data size.

Page 7: Distributed Execution of Transmural … Execution of Transmural Electrophysiological Imaging with CPU, GPU, and FPGA ... jj1 are generated from the mean and covariance matrix of u

Future work will use scheduling algorithms to automat-ically determine the best or optimal schedule statically ordynamically at run time for the number and type of hardwareplatforms in the system. Following this, a simulation of theapplication to estimate the performance in a heterogeneoussystem will be used to validate the assignments or identifyplaces for improvement. In the future, we will use the resultsfrom this paper to design an automatic framework to convertan application from its initial software implementation to afaster implementation in a heterogeneous system.

REFERENCES

[1] O. Camara, M. Sermesant, P. Lamata, L. Wang, M. Pop, J. Relan,M. D. Craene, H. Delingette, H. Liu, S. Niederer, A. Pashaei, G. Plank,D. Romero, R. Sebastian, K. Wong, H. Zhang, N. Ayache, A. Frangi,P. Shi, N. Smith, and G. Wright, “Inter-model Consistency and Com-plementarity: Learning from Ex-vivo Imaging and ElectrophysiologicalData Towards an Integrated Understanding of Cardiac Physiology,”Progress in Biophysics and Molecular Biology, vol. 107, no. 1, Oct.2011.

[2] L. Zhuo and V. K. Prasanna, “High-Performance Designs for LinearAlgebra Operations on Reconfigurable Hardware,” IEEE Transactionson Computers, vol. 57, no. 8, Aug. 2008.

[3] I. Sotiropoulos and I. Papaefstathiou, “A Fast Parallel Matrix Multi-plication Reconfigurable Unit Utilized in Face Recognitions Systems,”International Conference on Field Programmable Logic and Applica-tions, Sept. 2009.

[4] F. Edman and V. Owall, “Implementation of a Highly Scalable Archi-tecture for Fast Inversion of Triangular Matrices,” IEEE InternationalConference on Electronics, Circuits and Systems, Dec. 2003.

[5] D. Yang, G. Peterson, and H. Li, “Compressed Sensing and CholeskyDecomposition on FPGAs and GPUs,” Parallel Computing, vol. 38,no. 8, Mar. 2012.

[6] Advanced Micro Devices Inc. (2011) AMD C Math Library 5.1.0.Sunnyvale, CA, USA. [Online]. Available: http://developer.amd.com/libraries/acml

[7] MathWorks Inc. (2012) Matlab 2012b. Natick, MA, USA. [Online].Available: http://www.mathworks.com/help/techdoc/

[8] nVidia Corporation. (2012) CUDA Toolkit 4.2CUBLAS Library. Santa Clara, CA, USA. [Online].Available: http://developer.download.nvidia.com/compute/DevZone/docs/html/CUDALibraries/doc/CUBLAS Library.pdf

[9] University of Tenessee Innovative Computing Laboratory. (2012)Matrix Algebra on GPU and Multicore Architectures. Knoxville, TN,USA. [Online]. Available: http://icl.cs.utk.edu/magma/

[10] D. Sato, Y. Xie, J. Weiss, Z. Qu, A. Garfinkel, and A. Sanderson,“Acceleration of Cardiac Tissue Simulation with Graphic ProcessingUnits,” Medical & Biological Engineering & Computing, vol. 47, no. 9,Sept. 2009.

[11] E. Bartocci, E. M. Cherry, J. Glimm, R. Grosu, S. A. Smolka, and F. H.Fenton, “Toward Real-time Simulation of Cardiac Dynamics,” Interna-tional Conference on Computational Methods in Systems Biology, Sept.2011.

[12] M. Corraine, S. Lopez, and L. Wang, “GPU Acceleration of TransmuralElectrophysiological Imaging,” Computing in Cardiology, Sept. 2012.

[13] G. Falcao, M. Owaida, D. Novo, M. Purnaprajna, N. Bellas, C. D.Antonopoulos, G. Karakonstantis, A. Burg, and P. Ienne, “Shorten-ing Design Time through Multiplatform Simulations with a PortableOpenCL Golden-model: The LDPC Decoder Case,” IEEE InternationalSymposium on Field-Programmable Custom Computing Machines, Apr.2012.

[14] D. Yang, J. Sun, J. Lee, G. Liang, D. D. Jenkins, G. D. Peterson,and H. Li, “Performance Comparison of Cholesky Decomposition onGPUs and FPGAs,” Symposium on Application Accelerators in HighPerformance Computing, July 2010.

[15] D. Llamocca, C. Carranza, and M. Pattichis, “Separable FIR Filteringin FPGA and GPU Implementations: Energy, Performance, and Accu-racy Considerations,” International Conference on Field ProgrammableLogic and Applications, Sept. 2011.

[16] C. W. Fletcher, I. Lebedev, N. B. Asadi, D. R. Burke, and J. Wawrzynek,“Bridging the GPGPU-FPGA Efficiency Gap,” ACM/SIGDA Interna-tional Symposium on Field Programmable Gate Arrays, Feb. 2011.

[17] C. Grozea, Z. Bankovic, and P. Laskov, “FPGA vs. Multi-core CPUsvs. GPUs: Hands-on Experience with a Sorting Application,” Facingthe Multicore-Challenge, vol. 6310, Mar. 2010.

[18] L. Wang, H. Zhang, K. C. L. Wong, H. Liu, and P. Shi, “Physiological-Model-Constrained Noninvasive Reconstruction of Volumetric Myocar-dial Transmembrane Potentials,” IEEE Transactions on BiomedicalEngineering, vol. 57, no. 2, Feb. 2010.

[19] S. Skalicky, S. Lopez, M. Lukowiak, J. Letendre, and D. Gasser, “LinearAlgebra Computations in Heterogeneous Systems,” IEEE InternationalConference on Application-specific Systems, Architectures and Proces-sors, June 2013.

[20] R. Bittner and E. Ruf, “Direct GPU/FPGA Communication via PCIExpress,” International Conference on Parallel Processing Workshops,Sept. 2012.