development of stockham fast fourier transform using data- …1511982/... · 2020. 12. 21. ·...

IN THE FIELD OF TECHNOLOGYDEGREE PROJECT ENGINEERING PHYSICSAND THE MAIN FIELD OF STUDYCOMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2020

Development of Stockham Fast Fourier Transform using Data-Centric Parallel Programming

GABRIEL BENGTSSON

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Development of StockhamFast Fourier Transform usingData-Centric ParallelProgramming

GABRIEL BENGTSSON

Master in Computer ScienceDate: November 5, 2020Supervisor: Artur Podobas and Steven Wei Der ChienExaminer: Stefano MarkidisSchool of Electrical Engineering and Computer ScienceSwedish title: Implementation av Stockhams SnabbaFouriertransform med Datacentrerad Parallel Programmering

Development of Stockham Fast Fourier Transform usingData-Centric Parallel Programming / Implementation avStockhams Snabba Fouriertransform med Datacentrerad ParallelProgrammering

c© 2020 Gabriel Bengtsson

Abstract | i

AbstractWriting efficient scientific applications for modern High Performance Com-puting (HPC) systems has become much harder during the last years due tothe rise of heterogeneous computer architectures. The inclusion of hardwareaccelerators has caused an increase in the number of required technologies toreach high performance, a change that has programming these systems outsidethe skill-set of a domain scientist. The Data-Centric (DaCe) project aims toease programming efforts and improve performance portability by separatingthe implementation of the program from the optimization. To do this, DaCe in-troduces an Intermediate Representation (IR) called Stateful DataFlow multi-Graph (SDFG), a transformable graph that can represent a wide variety ofalgorithms.

Currently, the DaCe framework only has limited examples, many of which areeither stencil calculations or graph calculations. An evaluation of the imple-mentation of a general-purpose algorithm in DaCe would be helpful to showthe status of the framework.

The Fast Fourier Transform (FFT) is very important to modern science, as it isused for example to perform digital signal processing, solving partial differen-tial equations, and spectral analysis. The transform computes the frequency-space representation of a sequence with length n in O(n log(n)) operations.The FFT is a sparse representation of the Discrete Fourier Transform (DFT),which takes O(n2) operations to perform. The reduction of required opera-tions enabled the use of Fourier analysis in modern science.

In this thesis we implement the Stockham FFT algorithm to evaluate the pro-cess of implementing algorithms in DaCes. The algorithm was implementedusing the so-called restrictive Python in the DaCe framework, optimized andported to use Graphics Processing Units (GPUs). In comparison to otherstate-of-the-art solutions, Fastest Fourier Transform in the West (FFTW) andCUDA Fast Fourier Transform library (cuFFT) our implementation reaches amaximum/minimum of 54%/0.33% on Central Processing Unit (CPU) and62%/10% on GPU respectively on HPC hardware.

ii | Abstract

Sammanfattning | iii

SammanfattningAtt skriva effektiva vetenskapliga applikationer för moderna högpresterandedatorsystem har blivit mycket svårare under de senaste åren på grund av denökade förekomsten av heterogena datorarkitekturer. Införandet av hårdvaru-acceleratorer har orsakat en ökning av antalet tekniker som behövs för högprestanda, en förändring som har flyttat programmeringen av dessa system ut-anför en forskares kompetens. Data-Centric Parallel Programming-projektet(DaCe) syftar till att stoppa programmeringsinsatser och förbättra prestations-portabilitet genom att separera genomförandet av programmet från optime-ringen. För att göra detta introducerar DaCe en mellanliggande representationkallad Stateful DataFlow multi-Graph (SDFG), en transformerbar graf somkan representera en mängd olika algoritmer.

För närvarande har DaCe-ramverket bara en begränsad uppsättning exempel,varav många är antingen stencilberäkningar eller grafberäkningar. En utvärde-ring av implementationen av en annan algoritmtyp i DaCe skulle vara till hjälpför att visa ramverkets aktuella status.

Snabb fouriertransform är mycket viktigt för modern vetenskap, eftersom denanvänds till exempel för att utföra digital signalbehandling, lösa partiella diffe-rentialekvationer och spektralanalys. Transformen beräknar frekvensrymdre-presentationen för en sekvens med längd n inO(n log(n)) operationer. Snabbfouriertransform är en gles representation av diskret fouriertransform, som tarO(n2) operationer att utföra. Minskningen av nödvändiga operationer möjlig-gjorde användningen av fourier-analys inom modern vetenskap.

I denna avhandling implementerar vi Stockhams snabba fouriertransform föratt utvärdera processen för att implementera algoritmer i DaCe. Algoritmenimplementeradesmed så kallad restriktiv Python i DaCe-ramverket, optimeradoch portad för att använda grafikprocessorer. Jämfört med de andra toppmo-derna lösningarna Fastest Fourier Transform in the West (FFTW) och CUDAFast Fourier Transform library (cuFFT) når vår implementering maximalt /minimum 54%/0.33% vid körning på processor och 62%/10% vid körningpå grafikprocessor.

iv | Sammanfattning

Acknowledgments | v

AcknowledgmentsI would like to thank my supervisors Artur Podobas and Steven Wei DerChien for their support, feedback, and help during the project. I would like tothank my examiner Stefano Markidis for the project. Additionally, I wouldlike to thank the rest of the HPC group at CST for being nice company duringthe pandemic lockdown.

My family has meant everything to me during my years at KTH and especiallyduring the writing of this thesis. Without you, this thesis would not have beenpossible.

This work would not have been possible without any of my friends and somehave been of extra support during the thesis work. A big thanks to Svante Rol-lenhagen, Helmer Nylén, Elisabet Arvidsson, Felix Liu,Morris Eriksson,and Jonas Nylund.

I would also like to thank PDC Center for High Performance Computingat KTH and Gilbert Netzer for providing the necessary hardware for testingthe performance of the software.

Stockholm, November 2020Gabriel Bengtsson

vi | Acknowledgments

CONTENTS | vii

Contents

List of Abbreviations xv

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Sustainability and ethics . . . . . . . . . . . . . . . . . . . . 41.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Related Works 62.1 Heterogeneous HPC hardware . . . . . . . . . . . . . . . . . 62.2 Programming for HPC . . . . . . . . . . . . . . . . . . . . . 72.3 The Data-Centric Parallel Programming project . . . . . . . . 82.4 Fast Fourier Transform on parallel processors . . . . . . . . . 9

3 Background 113.1 Scientific Computing and HPC . . . . . . . . . . . . . . . . . 113.2 Hardware in HPC . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Programming for HPC . . . . . . . . . . . . . . . . . . . . . 153.4 Data-Centric Parallel Programming . . . . . . . . . . . . . . 173.5 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . 26

4 Methods 334.1 Discrete Fourier Transform in DaCe . . . . . . . . . . . . . . 334.2 Stockham FFT in DaCe . . . . . . . . . . . . . . . . . . . . . 40

5 Experimental Setup 505.1 Validation of output . . . . . . . . . . . . . . . . . . . . . . . 505.2 Performance measurements . . . . . . . . . . . . . . . . . . . 51

viii | Contents

5.3 Discrete Fourier Transform optimization via SDFG transfor-mations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4 Bottleneck analysis . . . . . . . . . . . . . . . . . . . . . . . 525.5 Performance testing of the final program . . . . . . . . . . . . 535.6 Computer systems . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Results 566.1 Validation of computations . . . . . . . . . . . . . . . . . . . 566.2 Improvement of DFT performance via SDFG transformations . 576.3 Bottleneck analysis of CPU implementation . . . . . . . . . . 596.4 Performance testing of DaCe-FFT . . . . . . . . . . . . . . . 62

7 Discussion and conclusions 677.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.2 Scientific closure . . . . . . . . . . . . . . . . . . . . . . . . 737.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . 747.3.2 Future work . . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography 75

A Code 85A.1 Code changed in DaCe . . . . . . . . . . . . . . . . . . . . . 85

B Extra SDFG figures 87B.1 DaCe-DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . 87B.2 DaCe-FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

LIST OF FIGURES | ix

List of Figures

3.1 Transistor count in microprocessors between the years 1971and 2019 which shows scaling according toMoore’s law. Datafrom [76], with additional data for the years 2018 and 2019. . . 13

3.2 A simple DaCe map before (a) and after (b) applying theMap-Tiling transformation. . . . . . . . . . . . . . . . . . . . . . . 23

3.3 A simple DaCe map before (a) and after (b) applying theMapReduceFusion transformation. . . . . . . . . . . . . . . . 24

3.4 A simple DaCemap before (a) and after (b) applying theGPU-TransformSDFG transformation. Added transient nodes areoutlined in red. . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 The raw data (a) and frequency spectrum (b) of a recording ofa person saying, "Fast Fourier transform". . . . . . . . . . . . 26

3.6 Dataflow graph of the Stockham FFT for the case r = 2, k = 3.Modified image from [97, 98]. . . . . . . . . . . . . . . . . . 31

3.7 Dataflow for the product with Butterfly for the block parallel(a) and vector parallel (b) structure. . . . . . . . . . . . . . . . 31

4.1 SDFG calculating the DFT of the input X. . . . . . . . . . . . 364.2 DFT-SDFG after applying MapWCRFusion. . . . . . . . . . . 374.3 SDFG of a single Stockham FFT pass. During each iteration,

the three operators in Equation 3.8 are applied. . . . . . . . . 45

6.1 Performance of the different variants of DaCe-DFT on CPUand GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2 Performance comparison between the DaCe-DFT on CPU fora select number of regular variants and the halved variant. . . . 58

6.3 Speedup comparison of the isolated DFT matrix generationbetween the full and halved variant. . . . . . . . . . . . . . . 59

x | LIST OF FIGURES

6.4 Distribution of the time taken for the different stages in thenaive and optimized DaCe implementation with an FFT oflength n = 4096. The optimized is the worst case with r = 16

and the naive is the best case with r = 4. . . . . . . . . . . . . 606.5 Distribution of the time taken for the different parts of the op-

timized program with the input length set to 4096 and varyingradix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.6 Performance scaling with varying number of threads for thethree best and three worst implementations in DaCe-FFT onCPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.7 Performance comparison between DaCe-FFT on CPU andFFTW. For DaCe, the best thread- and radix-configurationwas selected for each input length. . . . . . . . . . . . . . . . 63

6.8 Performance comparison with different radices in DaCe-FFTon CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.9 Performance comparison between DaCe-FFT on GPU andcuFFT. For DaCe, the best radix-configuration was selectedfor each input length. . . . . . . . . . . . . . . . . . . . . . . 65

6.10 Performance comparison with different radices in DaCe-FFTon GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

B.1 DFT-SDFG after applying MapReduceFusion. . . . . . . . . . 88B.2 DFT-SDFG using the BLAS GEMV code. . . . . . . . . . . . 89B.3 DFT-SDFG using the DFT-matrix symmetry. . . . . . . . . . 90B.4 DFT-SDFG after applying MapReduceFusion and ported to

GPU using GPUTransformSDFG. . . . . . . . . . . . . . . . 91B.5 SDFG of final version DaCe-FFT on GPU. . . . . . . . . . . . 93

LIST OF TABLES | xi

List of Tables

3.1 Illustration of SDFG syntax. . . . . . . . . . . . . . . . . . . 20

6.1 Differences between the results from the DaCe CPU and GPUimplementation and the reference MKL FFT function. . . . . . 56

6.2 Best and worst performance factor of the final DaCe programcompared to FFTW on CPU. The performance factor is theFLOP/s values of FFTW divided by the FLOP/s values of DaCe. 64

6.3 Worst- and best-case performance factor of DaCe programcompared to cuFFT on GPU. The performance factor is theFLOP/s values of cuFFT divided by the FLOP/s values of DaCe. 65

xii | LIST OF TABLES

LISTINGS | xiii

Listings

4.1 Annotated implementation of double-precision DFT in DaCe. . 344.2 BLAS GEMV code replacement for Listing 4.1. . . . . . . . . 384.3 DFT matrix mirroring code replacement for Listing 4.1. . . . . 384.4 Stockham DaCe program signature. . . . . . . . . . . . . . . 414.5 DaCe code setting up the DFT matrix, generating indices, and

moving data for the DaCe-FFT. . . . . . . . . . . . . . . . . . 424.6 Looping over the Stockham passes in DaCe. . . . . . . . . . . 434.7 Calling our stockhamFFT DaCe program from within Python. 464.8 Code for performing the vector parallel product with butterfly

matrix by using library BLAS GEMM. . . . . . . . . . . . . . 48A.1 C++ template code that adds atomic addition for double pre-

cision complex numbers on GPU for DaCe. . . . . . . . . . . 85

xiv | LISTINGS

List of Abbreviations | xv

List of Abbreviations

API Application Programming InterfaceASIC Application-Specific Integrated CircuitBLAS Basic Linear Algebra SubprogramsCLI Command Line InterfaceCPU Central Processing UnitcuBLAS CUDA Basic Linear Algebra Subroutine libraryCUDA Nvidia CUDAcuFFT CUDA Fast Fourier Transform libraryDaCe Data-CentricDAPP Data-Centric Parallel ProgrammingDFT Discrete Fourier TransformDIODE Data-centric Interactive Optimization Develop-

ment EnvironmentDSL Domain-Specific LanguageFFT Fast Fourier TransformFFTW Fastest Fourier Transform in the WestFLOP/s FLoating Point Operations per SecondFPGA Field-Programmable Gate ArrayFT Fourier TransformGEMM GEneral Matrix-Matrix multiplicationGEMV GEneral Matrix-Vector multiplicationGPU Graphics Processing UnitHIP Heterogeneous-compute Interface for PortabilityHPC High Performance ComputingIC Integrated CircuitIDE Integrated Development EnvironmentIR Intermediate RepresentationMIMD Multiple Instructions, Multiple Data streamsMKL Intel Math Kernel Library

xvi | List of Abbreviations

ML Machine LearningMPI Message Passing InterfaceOpenACC Open AcceleratorsOpenCL Open Computing LanguageOpenMP Open Multi-ProcessingSDFG Stateful DataFlow multiGraphSIMD Single Instruction, Multiple Data streamsSIMT Single Instruction, Multiple ThreadsSISD Single Instruction, Single Data streamTPU Tensor Processing UnitWCR Write-Conflict Resolution

Introduction | 1

Chapter 1

Introduction

1.1 MotivationDuring the last few years, heterogeneous hardware has become the norm inHPC systems [1], resulting in many modern computer systems using hard-ware accelerators in conjunction with the classic CPU. Currently, GPUs areused as accelerators in the majority of the top ten fastest supercomputers inthe world [2], and other accelerators such as many-core processors [3], Field-Programmable Gate Arrays (FPGAs) [4, 5], and Application-Specific Inte-grated Circuits (ASICs) [6] are becoming more and more common. As thefree performance gains from transistor shrinking have ended [7, 8], we shouldexpect hardware in HPC systems to become more heterogeneous [1], to keepup with the growing demands for computational power.

Many times, HPC programmers sacrifice ease of coding and portability be-tween computer systems to reach the highest performance possible [9]. How-ever, the complexity of code changes required to reach the sought-after perfor-mance on heterogeneous platforms has increased rapidly during the last fewyears, as more types of hardware have become accessible. This complexityspike has made program development too difficult for domain scientists [9].

Many different methods have been attempted to alleviate the problems ofportability and ease of programming, with some of the most common beingusing standardized Application Programming Interfaces (APIs) that have beenimplemented on a wide variety of hardware and systems. Common examplesare Open Multi-Processing (OpenMP) [10] for multi-core parallel processing,Message Passing Interface (MPI) [11] for message-passing between nodes

2 | Introduction

and Basic Linear Algebra Subprograms (BLAS) [12] for performing basiclinear algebra. However, using standardized programming interfaces usuallyrequires rewriting the original code or using compiler directives, which maycause bugs or decrease the readability of the code. There are also program-ming languages that have attempted a top-down approach to the problem withan integrated solution, examples are Chapel [13], Kokkos [14], and Halide[15]. Here, however, the problem is that there is a rather steep learning curveto these languages, they are rather restrictive, and they require a lot of imple-mentation for the specific functions.

As the heterogeneity of hardware increases, the complexity of writing thathigh-performance and portable code increases. Some examples of technolo-gies that are required for programming GPUs are Nvidia CUDA (CUDA) [16]for programming Nvidia GPUs, Open Computing Language (OpenCL) [17]for programming other GPUs, and Open Accelerators (OpenACC) [18] forprogramming for GPUs via compiler directives. The complexity of the com-bination of all of these technologies has made it unfeasible for a domain scien-tist to use modern hardware efficiently by themselves, as this is a full-time job.Today, domain scientists write applications for some scientific calculations,which a performance engineer then optimizes by adapting the technologiesmentioned earlier. A performance engineer is a person with a wide knowledgeof optimization and tuning techniques for computing software. The resultingprogram is usually much faster and scales better on a certain platform but isusually difficult to understand.

TheData-Centric Parallel Programming (DAPP) [9] project aims to solve theseproblems by allowing scientists to write simple code in a language like Python[19] which can then be optimized separately without changing the originalcode. To achieve this, DAPP introduces theDaCe environment with the SDFG,a graph-based IR that can be edited and transformed without changing theoriginal code. The strength of the SDFG is that it combines fine-grained dataaccess control with large scale parallelism, enabling data access analysis whilestill enabling simple parallelism. With DaCe, simple Python code can be com-piled to an SDFG which can then be modified to enhance performance and/orsupport hardware accelerators. What this means is that the previously men-tioned standardized APIs no longer needs to be present in the code written bythe scientist, they are all moved to the SDFG stage.

The FFT algorithm [20] has been cited as one of the ten most important algo-rithms in science and engineering during the 20th century [21]. Transforminga sequence of length n from an original domain such as the time-domain to the

Introduction | 3

frequency domain in O(n log(n)), the FFT enables applications such as digi-tal signal processing, multimedia compression, and efficient solving of partialdifferential equations [22]. The FFT is a sparse factorization of the DFT algo-rithm and is straight forward to implement in code. However, getting the codeto perform well on modern hardware is not as straight forward and requiresmany optimizations to use cache, vector instructions, and multi-threading ef-ficiently [23].

In an effort to test the DaCe framework, we implement the Stockham FFTalgorithm as an SDFG and transform to use a GPU for acceleration. This testis a part of the testing needed to evaluate if DaCe is ready for general algorithmimplementation on HPC-systems. The objective of this work is to investigateif DaCe can lower the time taken to develop performance portable softwarethat can run on heterogeneous hardware configurations.

1.2 Research questionsIn this section, we outline the three research questions for this work to testthe hypothesis that DaCe can lower the time taken to develop performanceportable software that can run on heterogeneous hardware configurations.

How do we express algorithms such as the FFT in DaCe? As DaCe is anew research project, it has only been tested with a limited number of problemtypes, such as matrix- and graph-calculations. This means that it is not certainthat the SDFG will be able to accommodate all types of algorithms, possiblydue to the limitations with memory accesses inside calculations. We evaluateif the framework can represent an FFT by implementing one of the most com-mon iterative FFT algorithms called the Stockham algorithm and comparingthe validity of the results against state-of-the-art solutions.

How do we port DaCe-FFT to GPU? As the core focus of DaCe is per-formance portability on heterogeneous HPC systems, we would like to studyhow much effort is needed to port an FFT implementation to a CUDA enabledGPU. This evaluation is important as the porting procedure is usually time-consuming, error-prone, and intrusive when done manually. We evaluate thisby porting our implementation of the FFT to use GPU and reporting the timetaken as well as all problems that arise during the process.

4 | Introduction

How do we improve the performance of DaCe-FFT? A naive implemen-tation of the Stockham FFT algorithm results in poor performance comparedto modern state-of-the-art implementations that use specialized routines forutilizing multi-threading, vectorization, and/or GPU support. We would liketo study if the transformations available in DaCe are enough to achieve lev-els of performance comparable to other state-of-the-art implementations. Toevaluate the relative performance of our implementation against state-of-the-art implementations, we implement a simple benchmark routine for both CPUand GPU.

1.3 ContributionOur contributions in this work are as follows. First, we evaluate if the DaCeframework can represent the FFT by implementing the DFT as well as theStockham FFT in DaCe. Second, we use DaCe-transformations to port bothimplementations to use CUDA enabled GPUs. To evaluate our work, we im-plement benchmarks to compare our solutions to the current state-of-the-arton CPU and GPU.

1.4 Sustainability and ethics

SustainabilityThis thesis solely considers software development and does not interact withthe physical world except for the consideration of computer hardware, and assuch does not have a direct effect on the environment.

The thesis is focused on testing a framework for the development of scientificsoftware for large heterogeneous HPC systems. The framework could enablemore efficient use of supercomputers in the future and as such could reducethe electricity usage of large calculations, showcased in the optimization of anab initio quantum transport solver using DaCe [24, 25], however an increase inavailability could paradoxically increase the number of calculations that andas such, increase the total electricity use. A deeper analysis is left for futureworks.

Introduction | 5

EthicsThis work does not contain any personal or confidential material, except forFigure 3.5 which shows a recording of the author’s voice.

The software developed in this thesis calculates a simple mathematical algo-rithm and can not be used for nefarious purposes by itself.

1.5 OutlineThis thesis is organized as follows. In Chapter 2 we present other worksthat cover the topics of hardware in modern HPC, programming for HPC, theDAPP project, and works about FFTs on parallel processors. Then, in Chap-ter 3 we present a detailed background of the problems with modern hardwarein HPC, present an introduction to the DaCe framework, and introduce thetheory behind the FFT. In Chapter 4 we describe the steps of the software de-velopment of a DFT and FFT to show how the DaCe framework is used. InChapter 5, with the software in place, we explain how we verify the results andmeasure the performance of our software. In Chapter 6 we present the results.Finally, in Chapter 7 we discuss the performance of the implementation, drawconclusions from the work, and discuss the future of the DaCe framework.

6 |Related Works

Chapter 2

Related Works

In this chapter, we provide a brief overview of other works that are related tothe thesis. First, we present the current situation of hardware in HPC systemsis presented. Then, we present some of the programming models and frame-works used in HPC application development. Then, we compare the DaCeframework to other similar frameworks. Finally, we discuss state-of-the-artFFTs on parallel and heterogeneous HPC hardware.

2.1 Heterogeneous HPC hardwareHardware in modern supercomputers is heterogeneous [1], with hardware ac-celerators being used in 7 out of 10 of the fastest supercomputers in the world[2], with 6 being GPUs [16]. Besides GPUs, HPC systems use a plethora ofother accelerators, such asmany-core processors [26], ASICs [27], and FPGAs[28]. GPUs have been successful by focusing on large amounts of slow andsimple cores, focusing on high throughput. This approach provides higher to-tal performance together with better performance per watt compared to CPUs,which have been crippled by the end of Dennard scaling [7] as the power con-sumption of transistors no longer scaled with size, and the apparent end ofMoore’s law [29] as more and more issues with transistor shrinking arise. Thedownside of GPUs is that they only perform well in highly parallel situationsas the single-thread performance is very low. Many-core processors are slowCPUs with many cores and are comparable to GPUs while maintaining bettersingle-core performance. One advantage is that many-core processors oftenuse programming models like multi-core CPUs, enabling easier porting of ex-isting code designed for CPUs. One of the most common many-core proces-

Related Works | 7

sors is the Intel Xeon Phi [3], an x86 accelerator with up to 72 processor coreswith support for vector instructions. Another example of a many-core proces-sor is theMatrix-2000 processor with 128 cores, an accelerator used in China’sTianhe-2A supercomputer [30]. A recent trend in Machine Learning (ML) isto use accelerators specialized in doing one specific type of computation forachieving higher performance and power efficiency, called ASICs [1]. An ex-ample of an ASIC used for ML is Google’s Tensor Processing Unit (TPU) [6],enabling very fast processing of neural networks. The supercomputer Anton[31] used ASICs for calculations on molecular dynamics. Even though thesetup cost is very high for ASIC systems, we should expect more widespreaduse of them in the future [1]. Another technology that could find use in HPCwith more research and development is FPGAs, which are re-programmablecircuits, like ASICs but less efficient. FPGAs have been used for HPC in thepast, for example in the JANUS systems [32].

2.2 Programming for HPCIn terms of programming HPC systems, they are commonly divided into twocategories: shared memory systems and distributed memory systems [33]. Ashared memory system is a system where all processor cores access the samememory while a distributed memory system is a system where different pro-cessors have their own memory and that data needs to be explicitly passedbetween processes. Programming the two types require different methods forefficient use of hardware. A common structure for modern CPU based HPCsystems is to program one node as a shared memory system and then combineseveral nodes as a distributed system. However, with the advent of heteroge-neous hardware, many of the components in a node have memory structuresof their own, causing each node to become a distributed system on their own,complicating the programming [9].

When programming shared memory systems, standardized APIs like OpenMP[10] or BLAS [12, 34] are often used. Standardized APIs separate the usagefrom the implementation which enables porting code to other platforms [35],under the assumption that the APIs are widely adopted. Furthermore, the APIimplementations are often highly optimized and tuned for the specific hard-ware, enabling higher performance. OpenMP provides easy multi-threadingof common processing patterns in HPC applications by forking and joiningthreads over code sections marked with compiler directives. The OpenMPstandard is implemented in a wide variety of compilers such as GCC [36] and

8 |Related Works

Intel C/C++ compiler [37]. BLAS is a specification for routines performinggeneral linear algebra operations such as matrix-matrix multiplication and isoptimized for many platforms. Some examples of BLAS implementations areIntel Math Kernel Library (MKL) [38] and OpenBLAS [39].

Distributed computing requires communication between nodes, except forembarrassingly parallel problems, to perform calculations. The routines de-fined by the MPI [11] standard can be used to perform the communicationbetween processes. MPI provides support for virtual topologies, synchro-nization, and communication between any number of processes, simplifyingcomplicated communication patterns. Since MPI is a standard, it providesa platform-agnostic interface for communication, enabling easier portingbetween systems. MPI-IO [40] provides routines to perform parallel I/O op-erations on a single file from multiple processes. Besides vendor-specificlibraries, OpenMPI [41] and MPICH [42] are the two major implementations.

Hardware accelerators often have their own memory, requiring data to beexplicitly passed between different processors, much like distributed sys-tems. Additionally, accelerators often have their own unique programminglanguage dialects, such as CUDA [16]. GPUs are programmed using CUDA[16], OpenCL [17], OpenACC [18], or Heterogeneous-compute Interface forPortability (HIP) [43]. Many-core processors, even though they share in-struction sets with CPUs, still require hardware-specific tuning to reach peakperformance [44]. FPGAs are programmed with a hardware description lan-guage such as Verilog [45], VHDL [46], or more recently, with OpenCL [47].ASICs are not programmed but rather designed with VHDL or Verilog [48].

2.3 The Data-Centric Parallel Programmingproject

DaCe is a software framework developed by ETH Zürich which targets sci-entific computing [9]. The project aims to decouple program definition fromoptimizations by introducing the Stateful DataFlow multiGraph (SDFG), ahuman-readable Intermediate Representation (IR) structured as a directedgraph of directed acyclic multigraphs that combines state programming (di-rected graph) with dataflow programming (directed acyclic multigraphs) [49].The DaCe environment supports the compilation of SDFGs from severalcommon programming languages such as Python [19] or MATLAB [50], aswell as compiling SDFGs into C++ libraries. The SDFG can be modified

Related Works | 9

by the user, either via changing settings on individual graph nodes or by ap-plying transformations that use pattern matching to change the structure ofthe graph. The design of the SDFG allows for the representation of a widevariety of programs together with non-invasive optimization and tuning, en-abling better performance portability [9]. The C++ code generated by DaCeuses OpenMP, MPI, BLAS, CUDA, or HIP to achieve high performance inaddition to performance portability.

DaCe combines innovations from projects in several different areas and re-fines them to enable the malleable and flexible SDFG. The complete sepa-ration of programming and optimizing application code can be seen in theCHiLL [51], Halide [15]. CHiLL enables optimization of high-level loopswritten in C/C++, Fortran, or CUDA via Python scripts, while the Halide in-troduces a Domain-Specific Language (DSL) for image processing that canbe optimized via scheduling commands. Some of the available optimizationsare loop unrolling, order permutation, and tiling. The separation is combinedwith a graph-based IR as with the LLVM-based [52] HPVM [53] and MLIR[54] IRs aiming at unifying heterogeneous parallel platforms by providing acommon interface for compiler optimization. MLIR is a compiler focused onmachine learning and was originally a part of the TensorFlow project and cancurrently be used to optimize ML models and connecting additional hardwareto the TensorFlow ecosystem. HPVM is a dataflow graph IR combined witha virtual instruction set that supports shared memory and vector instructions.The dataflow graph enables graph optimizations while the instruction set en-ables better performance portability between heterogeneous systems.

2.4 Fast Fourier Transform on parallel pro-cessors

The FFT transforms a sequence of length n to the frequency-domain inO(n log(n)) operations. The FFT is commonly used for digital signal process-ing and media compression but is also one of the most important algorithmsin modern scientific computing [22] and is used in GROMACS [55]. While anaive implementation of the Cooley-Tukey [56] or Stockham algorithm [20]is rather straight forward, it will not reach peak performance on modern hard-ware, as it will not utilize cache memory, multi-core processors, or vectorinstructions [57]. Some of the most widely used FFT frameworks for HPCare FFTW [58], SPIRAL [59], and MKL’s [38] FFT routine. FFTW wasoriginally developed at MIT with the first release done in 1997. Developed

10 |Related Works

in C, FFTW supports shared memory and distributed memory systems forarbitrarily sized, multi-dimensional inputs. FFTW is optimized for memoryhierarchies and vectorized instructions but does not support the use of acceler-ators. SPIRAL is a library designed to automatically generating efficient codefor linear transformations from a high-level mathematical algorithm specifica-tion. SPIRAL can be used for generating efficient FFT codelets with supportfor multi-threading, vectorization, or GPUs [57]. The MKL-FFT routines areproprietary and as such, little is known about the structure of the code.

There are multiple implementations of the FFT on GPU, particularly the oneprovided by Nvidia, cuFFT [60]. Other implementations are the OpenCLbased one provided by AMD [61] and SPIRAL. There have been studies onoptimizing FFTs using CUDA [62] and auto-tuning frameworks for three-dimensional FFTs in CUDA [63]. More recent works include an implemen-tation of both the Cooley-Tukey and Stockham FFT algorithms on GPU [64]using the LIFT programming language [65] as well as the FFTX project, whichaims to combine the strengths of SPIRAL and FFTW to bring FFTs to hetero-geneous exascale supercomputers [66].

Background | 11

Chapter 3

Background

This chapter presents the theoretical background of the concepts used in thethesis. First, we give a short introduction to Scientific Computing and HPC.Then, we describe the current state of hardware in HPC. We follow by describ-ing the problems of programming computers for HPC and some of the previoussolutions to these problems. We then describe the DaCe framework in-depthand give the background for the components. Finally, we give a mathematicaldescription of the FFT algorithm.

3.1 Scientific Computing and HPCScientific Computing is an interdisciplinary scientific field that combines com-puter science and applied math to solve problems in separate scientific fields,usually via simulations [67]. Since the problems are wide in scope, severalscientists with expertise in different areas need to co-operate to create an ac-curate simulation that completes within a reasonable time. Some commonproblems studied in Scientific Computing is weather forecasting [68], compu-tational fluid dynamics [69], and molecular dynamics [55].

The field of High Performance Computing (HPC) focuses on the effectiveusage of aggregations of computers to solve problems that are unfeasible tosolve on a normal desktop machine [70]. These machines may be massivesupercomputer clusters with over 200 000 CPU cores and 25 000 GPUs [71]or smaller machines in size of a normal desktop computer [72]. Some of theareas researched in the field of HPC is the usage of hardware and memory ona single machine, communication between nodes, data storage and retrieval,

12 | Background

and programmability of complex computer systems.

3.2 Hardware in HPCTo understand the problems of programming modern HPC applications forheterogeneous systems, we must understand the different types of processorsthat are used. In this section, we describe the background of the CPU, GPU,FPGA, and ASIC respectively from the perspective of parallel processing. Toexplain the different types of processors we use Flynn’s taxonomy [73], withthe inclusion of Single Instruction, Multiple Threads (SIMT) to include GPUs.

Central Processing Unit (CPU)Traditionally, CPUs worked sequentially and executed one instruction on onestream of data at a time, and was classified as Single Instruction, Single Datastream (SISD) machines. The aim is to minimize the time between startingand finishing a single task. To improve processing speeds further, processorsin supercomputers began introducing vector operations where one instructionwas executed on several data streams at once, classified as Single Instruction,Multiple Data streams (SIMD). SIMD processing is the standard in modernprocessors with AVX2 [74, 75] occurring in all mainstream processors fromIntel and AMD. There is more functionality to a processor to speed up pro-cessing, such as pipelining and prefetching, but that does not change the coreidea.

For many years, between the 1940s and 2000s, performance in computers wasgained by exponential growth of clock frequencies in CPUs, meaning morecalculations per second. This growth was enabled by Dennard scaling [7]whichmeant that the power density of transistors stayed constant, meaning thatsmaller transistors required less power which allowed for higher frequencies.Unfortunately, the trend stopped in 2004 [1] and smaller transistors consumedthe same amount of power, leading to the so-called "power wall" meaning ex-ponential growth in power- and heat-production with increasing frequency.

The slowdown in clock frequency pushed the CPU market towards multi-coreCPUs, as having multiple, but slower processor cores still allow for raw per-formance gains. With the combination of vector instructions and multi-corestructure, CPUs are classified as Multiple Instructions, Multiple Data streams(MIMD) processors. This process of increasing the number of cores followsMoore’s law [29], which is an observation that the number of transistors in an

Background | 13

1970 1980 1990 2000 2010 2020

Year of introduction

103

106

109

1012

Tra

nsis

tor

count

Figure 3.1: Transistor count in microprocessors between the years 1971 and2019 which shows scaling according to Moore’s law. Data from [76], withadditional data for the years 2018 and 2019.

Integrated Circuit (IC) will double approximately every two years. This hasbeen correct for almost 50 years straight, as seen in Figure 3.1. While this gainin raw numbers of transistors increases the raw performance, many problemsaremore difficult to program for efficient execution on parallel processors. Ad-ditionally, the power scaling issues have not gone away and have led to issueswith heat dissipation in multi-core processors if all transistors are active at thesame time, forcing CPUs to only activate a limited area of the chip at any time,causing so-called "dark silicon" [77].

The future for the CPU as the lone processor in computers is uncertain, moreandmore problems arise with the shrinking of transistors, and it is believed thatsmaller transistors will soon become impossible due to electrons experiencingquantum tunneling between separate transistors [8]. This will most likely leadto the end ofMoore’s law [8] as factors such as the speed of light limits the sizeof a processor die, further limiting the natural performance growth of CPUs.What this means for computer hardware is to keep on improving computationalspeeds, the computing community needs to look towards other solutions thansolely using CPUs. In practice, this could imply more and more heteroge-neous computing platforms [1] if hardware engineers do not find alternativesolutions.

14 | Background

Graphics Processing Unit (GPU)A recent trend in computing is to useGPUs for performing calculations as CPUperformance improvements are stagnating. A GPU works by having massiveamounts of smaller, simpler cores which are paired with a large memory busto RAM. For example, the latest Nvidia GPU designed for HPC, the A100will have 6912 cores [78] for doing floating-point operations while the largestCPUs are in the 64-core range [79]. GPUs work by running each sequence ofcalculations as a separate thread, with several threads being run in lockstep.A simple explanation is that each iteration of a for-loop would be turned intoa thread and then the GPU would process as many threads at once as possi-ble. This structure can be classified as Single Instruction, Multiple Threads(SIMT), as many threads are executing the same instruction simultaneously.Instead of aiming for fast completion of every single task like a CPU, a GPUcore can be 10 to 100 times slower at finishing a single task [80] compared toa CPU core but compensates with parallelism. The result is a high through-put, but high latency model which is vastly different from the traditional CPU,which has been the model for learning programming since the 1950s, leadingto a gap in programmer knowledge.

GPUs started by only handling calculations related to computer graphics,which are simple and inherently parallelizable. The development towardsthe current general-purpose GPUs began with the advent of so-called pro-grammable shaders, where the programmer could control the process of theGPU cores [81]. The large possibilities, however, came with the release ofCUDA [16] where the programmer uses common HPC methods to programcalculations on GPU. At the time of writing this thesis, the two major frame-works for writing general-purpose programs that run on GPUs are CUDA andOpenCL. CUDA is limited to Nvidia-specific cards but OpenCL [17] is avail-able on most GPUs. However, in an effort to replace the proprietary solutionson the market, AMD has developed the ROCm open-source software devel-opment framework [82] for connecting different frameworks with support forhardware accelerators. As an alternative to the CUDA programming model,ROCm offers HIP [43], a C++ language that is a subset of CUDA, which canbe compiled for both AMD and Nvidia GPUs. The Frontier supercomputerwill use AMD GPUs and be programmed using HIP [83].

Background | 15

Field-Programmable Gate Array (FPGA)Field-Programmable Gate Arrays (FPGAs) [28] are re-configurable IntegratedCircuits (ICs), in which re-programmable logic blocks can be connected us-ing an interconnect. By reprogramming the blocks and connecting in differentpatterns, the programmer can create complex logic flows. Today, most FP-GAs also include dedicated circuits for digital signal processing and dedicatedmemory to provide higher computational power. FPGAs are not common inHPC today as they are expensive in terms of absolute performance [84] and canbe difficult to use efficiently [85]. However, recent development has enabledhigh-level synthesis to provide easy compilation of programming languageslike OpenCL to FPGA [86] setups, which could enable more widespread use.Recently, two supercomputers that use FPGAs have been built. The Cygnus[4] computer is used to simulate the early stages of the universe and the Noctua[5] computer is used for materials engineering besides the research on FPGAuse in HPC.

Application-Specific Integrated Circuit (ASIC)Application-Specific Integrated Circuits (ASICs) [27] are ICs that are special-ized for one specific use, such as digital signal processing or cryptographichashing [87]. As ASICs are specialized for one specific task, they achievevery high energy efficiency, a feature that is highly sought after in HPC. Thedownside of ASICs is that they have very long lead times in production as thedesigns are different and unique, meaning that using ASICs is limited to situ-ations where they are needed in large numbers or provide large performancegains.

3.3 Programming for HPCTraditionally, programming languages have been designed for formalizing se-rial logic which executes one task at a time on a single computer system [88].This means that most programmers only learn how to program serially. Asmost applications written while learning how to program are not performance-critical, programmers do not learn how to utilize computer hardware effi-ciently. However, in HPC, virtually all systems are parallel, and many timesalso distributed, leading to a large gap in knowledge.

Many scientific programs are created by domain scientists with limited knowl-edge of HPC, who write a simple program that solves the problem at hand.

16 | Background

This program is probably slow and inefficient, as it is hard to be proficientenough in both the original field and in HPC to be able to perform the opti-mizations needed to reach maximum utilization of hardware. The number ofcurrently available approaches such as OpenMP, BLAS, CUDA, and MPI istoo wide for a domain scientist to fully understand and use.

To solve the performance and scaling issues of scientific applications, the roleof a performance engineer was created [9]. A performance engineer opti-mizes the programwritten by the domain scientist, by introducingAPIs such asOpenMP, BLAS, CUDA, or MPI to the code via rewrites or compiler annota-tions. The performance engineer knows how to optimize an application to usehardware efficiently while still being able to scale, some of the optimizationsperformed might be loop tiling and buffering. These changes are, however,often intrusive and change the original code written by the domain scientist,meaning that if the domain scientist needs to change or improve some part ofthe original logic, that might have changed significantly. The domain scientistchanging the performance engineers code might introduce bugs or impact theperformance and once again the performance engineer needs to work on thecode.

This development has been accelerated by the increasing heterogeneity inmodern HPC hardware. Accelerators such as GPUs, FPGAs, and ASICs re-quire special programming for achieving peak performance. There have beenmany attempts to solve this by introducing libraries that simplify the program-ming via different measures. The compiler directive approach of OpenMPand OpenACC can be very effective at enabling easy conversion of code touse multiple threads or use GPU resources, but may easily become inefficientand as specifications grow, too complicated. Designated frameworks or lan-guages like Kokkos [14], Chapel [13], Tiramisu [89] can be useful for creatingapplications that use hardware efficiently but can constrict the user to specificuse cases.

Below, we explain OpenMP and CUDA further from the perspective of thiswork.

OpenMPThe OpenMP API is a portable programming model for using shared memorysystems by introducing a standard for several compiler directives that can par-allelize common calculations or tasks. OpenMP is controlled using preproces-sor pragmas in C/C++, written as #pragma. A very common operation that is

Background | 17

used in this work is to parallelize a for-loop using the #pragma omp parallel

for pragma to create code for thread-handling and work distribution, the re-sult is that the different iterations of the for-loop are distributed between thethreads. As data races are an issue in shared memory systems, OpenMP im-plements different data clauses to handle possible problems. In this work,both #pragma omp critical and #pragma omp atomic are used in DaCe tomake certain summation updates thread-safe. The critical pragma limitsso that only one thread may perform a statement at once, making the otherthreads wait, while the atomic pragma makes OpenMP implement the state-ment using atomics so that the waiting is individual to each memory locationand thread.

CUDACUDA is a software platform and API for parallel programming on NvidiaGPUs. The software works by enabling users to interface CUDA C/C++ orCUDA Fortran from regular C, C++, or Fortran. The CUDA code is writtenin so-called compute kernels that are compiled to GPU code, and can then belaunched fromCPU to execute on data located in GPUmemory. As CUDA canonly perform calculations on data located in GPUmemory, the program needsto copy data to/from GPU before and after each computation. CUDA computekernels are run in parallel by up to several thousand threads at once, eachrunning the kernel code by itself. AmodernNvidia GPU contains thousands ofCUDA cores that execute CUDA threads in groups of 32 in lockstep, resultingin a SIMD structure with a width of 32 elements.

3.4 Data-Centric Parallel ProgrammingData-Centric (DaCe) is an open-source framework that is aimed at scientificcomputing and is developed by the SPCL team at ETH Zürich. The goal ofDaCe is to decouple domain science from implementation and performanceoptimization by introducing a new way to represent scientific programs. Thefocus is preventing scientific software from becoming too complex for domainscientists to understand by separating the problem solving from the perfor-mance improving modifications. The second part is that it simplifies perfor-mance portability by not limiting implementations to a specific platform andhaving a clear focus on parallel execution. The last feature is that it enablesthe user to identify regions in the program where unnecessary data transfersare carried out and provides easy methods to mitigate these inefficiencies.

18 | Background

DaCe allows a domain scientist to write a program that solves a problem in acommon programming language like a restricted subset of Python or a DSLlike MATLAB. DaCe then compiles this code to a so-called SDFG, a graphthat can represent a wide range of operations while still maintaining a fine-grained data access control together with a coarse parallel structure. TheSDFG is human-readable and can be modified without changing the origi-nal code, allowing for a performance engineer to optimize and tune a programwithout disturbing the domain scientist, who can, in turn, make changes to theoriginal code without disturbing the optimizations. The DaCe framework isused either via a web interface or directly via the Python API.

The DaCe framework was used to implement a massively parallel ab initioquantum transport solver [24, 25] which increased the simulation size by afactor 10× and reduced the run-time by a factor 14× compared to the state-of-the-art on the same hardware. This effort was, in 2019, awarded with theGordon Bell prize, which is a prize for "Outstanding achievement in high-performance computing applications".

Stateful DataFlow multiGraphThe Stateful DataFlow multiGraph (SDFG) [9] is a directed graph of directedacyclic multigraphs, where the directed graph can represent stateful and cyclicprogram logic while the directed acyclic multigraphs are dataflow representa-tion of data movement and computations. By combining the two, many dif-ferent programs can be represented with the SDFG. The SDFG can be mod-ified and changed without changing the original code and is as such an IR.The modification of an SDFG is usually done via transforms, where a sub-graph is either changed or replaced by using a pattern-matching algorithm. AJavaScript frontend called Data-centric Interactive Optimization DevelopmentEnvironment (DIODE) facilitates modification, testing, and benchmarking ofSDFGs and allows a performance engineer to manually tune the SDFG for aspecific platform. An SDFG can be compiled into a C++ library for executionon an HPC system.

The SDFG uses dataflow programming [49] to describe computations insidestates. By using graph nodes to represent basic arithmetic operations and di-rected edges to represent data dependencies, dataflow programming can en-capsulate many computations. Originally, dataflow programming was createdto avoid the limitations of the von Neumann architecture of CPUs by creatinga new type of computing hardware, called dataflow computers, with limited

Background | 19

success. Instead of running on special hardware, dataflow programs were exe-cuted on regular CPUs and turned out to be very inefficient when having a veryfine level of detail due to large overheads for each calculation. However, witha coarser granularity of the operations where blocks of regular code such as Cact as nodes, high performance can be reached. The dataflow structure withstrict data dependencies enables analysis of the possibilities of parallel exe-cution of code. However, there are downsides to pure dataflow programmingsuch as the lack of state and problems with implementing iterative methods.DaCe solves both problems by adding the states to the SDFG to enable iterativemethods.

The SDFG builds upon a model called data-centric programming, which is acombination of concepts defined by Ben-Nun et al. [9] as

“

1. Separating Containers from Computation: Data-holdingconstructs with volatile or non-volatile information are de-fined as separate entities from computations, which consistof stateless functional units that perform arithmetic or logi-cal operations in any granularity.

2. Dataflow: The concept of information moving from onecontainer or computation to another. This may be translatedto copying, communication, or other forms of movement.

3. States: Constructs that provide a mechanism to introduceexecution order independent of data movement.

4. Coarsening: The ability to view parallel patterns in a hier-archical manner, e.g., by grouping repeating computations.

”

These concepts are implemented as different types of nodes, edges, and com-mands in the SDFG model, with the ones used in this work displayed in Fig-ure 3.1. Below, we give a brief introduction to the types used in this work.

Data and computations

Data nodes are containers in the graph representing an N-dimensional ar-ray. These nodes can either be permanent input/output to/from the SDFG ortransient, which means that they are temporary variables. The data nodes can

20 | Background

Table 3.1: Illustration of SDFG syntax.

Figure NameInput/output Transient Data node and transient data node

MemletMemlet with WCR

Tasklet Tasklet

Map entry Map exit Map

Reduce (Sum), Axes: [1] Reduce

guard

State

Background | 21

represent data in different types of computer memory, from heap memory onthe CPU to shared memory on a GPU.

Memlets are the edges between the nodes in the SDFG and describe dataaccess in the graph. Data nodes cannot be accessed without memlets, a de-sign decision that enables analysis of data movements inside the program. Anexample of accessing an element a at index i in the array A can be written asa << A[i]. To specify an output, >> is used.

Write-Conflict Resolutions (WCRs) are lambda functions that extendmemlets to provide a way to define what action to take when several writesare performed on the same location. A WCR does not necessarily imply con-current writes and is usually implemented using atomic operations in case ofpossible conflicts.

Tasklet nodes contain stateless computing functions that operate onmemoryusing memlets. Tasklets are immutable to prevent the performance engineerfrom changing computational semantics. The tasklets can be defined in severaldifferent programming languages, mainly Python, and are compiled to C++during SDFG compilation.

Parallelism and states

Map nodes define regions of parallel execution of tasklets by enclosing thetasklet between two scope nodes. Maps can represent the execution of taskletsover several dimensions and can be nested multiple times. When a taskletis parallelized with a map, all the tasklets connections go through the map,enabling detailed analysis of the data flow during execution. As maps areimplementation agnostic, they can be compiled to parallel OpenMP for loopson CPU or parallel CUDA kernels on GPU.

Reduce nodes are operations to reduce the number of dimensions in a datastructure by operating upon the elements in a specified direction using alambda function. An example is to sum the elements of a vector into a scalar.

States are the second major part of the SDFG that enables control flow andcan be used to create complex loops that extend those offered by maps, anexample is a scenario where a stencil pattern is applied until the solution con-verges. As amap is defined as a certain amount of iterations, states are required

22 | Background

unless loop unrolling is performed in a graph. Inline use of other SDFGs isallowed, but recursion of SDFGs is disallowed since it can break transforma-tions.

Summary of the SDFG types

The types of nodes and edges described above can be used to fulfill require-ments from the concepts for data-centric programming defined by Ben-Nunet al. [9] earlier. The types defined under Data and computations fulfill con-cepts 1 and 2 while the types defined under Parallelism and states fulfillconcepts 3 and 4.

From the SDFGbuilding blocks defined above, the SDFG can be compiled intoefficient C++ code where the strict memory flow definitions enable concurrentexecutions of parts that do not cause conflicting memory use. This is doneeither via OpenMP regions or by utilizing several CUDA streams.

SDFG transformationsAs stated earlier, one of the most important features of the SDFG is that thegraph can be transformed via different operations without changing the origi-nal code. There are several transformations included in DaCe, the first groupis automatic, and the rest can be applied by the performance engineer. Sometransformations are designed to change the program flow for higher perfor-mance, while others are designed to port the SDFG to use different accelera-tors. As the transforms applied in DaCe are platform agnostic and simple toapply, they provide benefits to performance portability as an SDFG can easilybe adapted to a new platform.

The transformation work by first finding subgraphs in an SDFG that matchesany of the pattern subgraphs defined by the available transforms. When amatch is found DaCe evaluates if all preconditions defined for the transformare fulfilled and displays the available transform to the performance engineer.When a transform is applied, DaCe replaces the matching pattern with a re-placement subgraph defined by the transform. Transforms can be applied ei-ther via the interactive web interface or directly via the Python API.

Below, we introduce the transformations used in this work.

Strict transformations are automatically applied when the initial code iscompiled into an SDFG. Strict transformations remove unnecessary nodes

Background | 23

dft_mat_gen[i=0:N, j=0:N]


Tasklet

(a) SDFG before transformation

dft_mat_gen[i=128*tile_i:Min(N, 128*tile_i + 128), j=128*tile_j:Min(N, 128*tile_j + 128)]

dft_mat_gen[i=128*tile_i:Min(N, 128*tile_i + 128), j=128*tile_j:Min(N, 128*tile_j + 128)]

Tasklet

merged_tile_i_dft_mat_gen[tile_i=0:int_ceil(N, 128), tile_j=0:int_ceil(N, 128)]

merged_tile_i_dft_mat_gen[tile_i=0:int_ceil(N, 128), tile_j=0:int_ceil(N, 128)]

(b) SDFG after transformation

Figure 3.2: A simpleDaCemap before (a) and after (b) applying theMapTilingtransformation.

and edges and improve performance by removing unnecessary operationsfrom the code. The transformations are called strict because they ensure thatthe behavior of the program is unchanged.

MapTiling transformations are used to perform loop tiling [90] (also calledloop blocking) on a map inside an SDFG. The transformation works by nestingthe original map inside a new outer map. The inner map can be viewed as atile (or block) of the original map with the outer map looping over all tiles(or blocks). The most common purpose of applying MapTiling is to increasecache locality as it limits calculations to one memory region at a time.

An example of the MapTiling transformation can be seen in Figure 3.2.

MapReduceFusion transformations combine a map and reduction node andintroduce a WCR for data conflicts. The reduce node reduces the dimensionof a variable by some operation, meaning that by fusing the map and reductionan unnecessary transient variable can be removed, leading to a lower memoryfootprint.

An example of theMapReduceFusion transformation can be seen in Figure 3.3.We see that the transformation adds an additional state with a map that zeroesthe output array due to technical limitations.

Vectorization transformations are designed to enable vectorization in a mapby increasing the stride length of the iteration. This increase in stride indicatesto the C++ compiler that the code should be vectorized using SIMD instruc-tions.

24 | Background

dft_tasklet[k=0:N, n=0:N]


X

tmp

Reduce (Sum), Axes: [1]

Y

out = (x * omega)




X

Y

out = (x * omega)

freduce_init_map[o0=0:N]

out = 0


Y


Figure 3.3: A simple DaCe map before (a) and after (b) applying the MapRe-duceFusion transformation.

GPUTransformSDFG transformations are designed to change a wholeSDFG to use GPU for performing computations. The transform changesall computations in maps or standalone tasklets to be performed in CUDAkernels. As CUDA kernels require the data to be in GPU memory, DaCeautomatically creates data nodes representing GPU memory as well as mem-lets responsible for transferring data to and from the GPU memory, usingasynchronous copies.

An example of the GPUTransformSDFG transformation can be seen in Fig-ure 3.4 with the added transients located in GPU memory indicated by a redoutline. Besides the visual changes, there are changes made to the configura-tion of the DaCe maps to cause them to compile to CUDA code.

GPUTransformMap transformations work just like the GPUTransformS-DFG, except that it is limited to transforming a single map.

Using DaCeThere are two ways to use DaCe as a framework. The first is to use DIODEto write the program, transform the compiled SDFG, and then run or test thegenerated C++ library. The second option is to do all steps directly via DaCe

Background | 25



X

tmp


Y

out = (x * omega)


X



gpu_X

tmp


out = (x * omega)

gpu_Y

Y


Figure 3.4: A simple DaCe map before (a) and after (b) applying the GPU-TransformSDFG transformation. Added transient nodes are outlined in red.

Python API.

When writing code for use in DaCe, the user has three options. The first is touse a restricted form of Python to program explicit dataflow, the second is touse a DSL like Matlab, and the third is to define the SDFG directly by usingthe SDFG API. The restricted Python is a subset of the Python languages plussome additional Numpy [91] operators and can be used to create complex pro-cessing flows with relative ease. The user can also use Matlab or TensorFlowas DSLs and have this code compiled to an SDFG, however, this is much morerestricted compared to the Python alternative. Using the API to construct theSDFG is of course the most versatile way of constructing an SDFG but carriesthe cost of being much more complex and verbose.

DIODE is an experimental Integrated Development Environment (IDE) in-cluded in the DaCe framework specifically designed to work with SDFGs.DIODE enables visualization, editing, and transformation of SDFGs by pro-viding several interfaces to the performance engineer. DIODE is written inJavaScript and interfaces with the DaCe Python API.

All usage of DaCe in this work was performed through DIODE and writtenrestricted Python.

26 | Background

0 1 2 3

Time (s)

-1

-0.5

0

0.5

1S

ignal

(a) Raw input

102

103

104

Frequency (Hz)

0

1

2

3

4

5

Am

plit

ude

10-3

(b) FFT

Figure 3.5: The raw data (a) and frequency spectrum (b) of a recording of aperson saying, "Fast Fourier transform".

3.5 Fast Fourier TransformThe Fast Fourier Transform (FFT) transforms an input sequence of numbersfrom the time or space domain to the frequency domain. Fourier analysis tellsus that all general functions can be represented or approximated by a sum orintegral of sinus- or cosine-functions [92]. The process of decomposing func-tions into their constituent frequencies is concretized by the Fourier Transform(FT) and the Discrete Fourier Transform (DFT) for the discrete case. FFT is afactorization of the DFT, and performs the transform efficiently, enabling useon large data sets. This transformation has become one of the most importanttools in modern science and can be used for example to solve differential equa-tions, compress audio, and perform convolutions [22]. An example of the FFTof an audio signal can be seen in Figure 3.5.

The FFT can be extended to multiple dimensions to process n-dimensionaldata. This is out of the scope of this work and we only cover the one-dimensional case.

Background | 27

Fourier TransformThe FT [93] is a linear transform that transforms an integrable function f :

R→ C from the time or space domain into the frequency domain. The func-tion f is usually time-dependent ie. f = f(t) and the resulting function isfrequency-dependent. The transform decomposes the original function intothe frequencies that create the function.

The transform for the continuous case is defined as

f(ξ) =

∫ ∞−∞

f(x)e−2πixξdx (3.1)

for any number ξ ∈ R. The inverse of the transform is also available in theform

f(x) =

∫ ∞−∞

f(ξ)e2πixξdξ (3.2)

for any number x ∈ R.

Discrete Fourier TransformThe major issue with the original FT is that it is defined for infinite, cyclic, andcontinuous functions, which would require infinitely fast sampling speeds andinfinite memory to implement in digital form. TheDFT approximates the solu-tion of the problem with a finite amount of sampled points with equal distancebetween them. The transform of length n acts upon a sequence of complexnumbers x := x0, x1, . . . , xn−1 and produces an output y := y0, y1, . . . , yn−1where xl, yl ∈ C. The transform is defined as

yk =n−1∑l=0

xl · e−2πinkl (3.3)

The inverse transform is defined by

xl =1

n

n−1∑k=0

yk · e2πNkl (3.4)

The factor 1/n is the normalization factor, which can be replaced with a factor√1/n before both Eq. 3.3 & 3.4 to make the transform unitary. The DFT

can also be represented as a matrix, and the DFT matrix DFTn is a n × n

28 | Background

complex-valued matrix defined as

DFTn =[ωjkn]j, k = [0 .. n− 1] where ωn = e−

2πin (3.5)

The matrix will look like the following with n = 4

DFT4 =

1 1 1 1

1 −i −1 i

1 −1 1 −1

1 i −1 −i

Using the representation in Eq. 3.5, the transform can be written as

y = DFTnx, x, y ∈ Cn (3.6)

The problem with DFT is that to produce the full output y the implementationneeds to calculate every element yn, each of which uses all elements in x. Thisleads to a computational complexity of O(n2) where n is the input size statedearlier. The quadratic growth in computation time makes this form useless forthe large inputs required in modern applications.

Fast Fourier TransformThe FFT is usually a factorization of the DFT matrix and there are many dif-ferent formulations. The first widely known and used variant is the one byJames Cooley and John Tukey in 1965 [56]. There were many different for-mulations stated earlier, with the earliest known being Carl Friedrich Gauss’sunpublished work from 1805 [94], but none of them reached general use. Thecomplexity of the FFT isO(n log(n)), a reduction from theO(n2) when usingthe DFT definition. The reduction in computational complexity comes fromthe factorization of the DFT matrix into several sparse matrices.

There are many different variants of the FFT, such as forward or inverse, forcomplex or real input data, and many others. However, in this work, we focuson the difference between recursive and iterative algorithms.

Mathematical notation and concepts

Before we can start describing the different FFT variants we need to introducesome mathematical formalism as well as some terms used in FFT definitions.

Background | 29

In represents the n×n identity matrix, the butterfly matrix DFTn is the matrixdefined in Equation 3.5, and the Kronecker product A⊗B is defined as

A⊗B =

a11B · · · a1nB... . . . ...

am1B · · · amnB

where A is am× n matrix and B is a p× q matrix.

A common concept used in FFT factorization is the twiddle factors, whichare used to combine smaller butterfly matrices. They are derived from thefactorization of the DFT matrix. To explain what they do, twiddle factors areneeded to compensate for factors that are missing when the DFT matrix isshrunk with the recursion, for example, the DFT4 matrix can be factorizedinto two DFT2 matrices, but the DFT2 matrix does not contain any complexnumbers even though DFT4 matrix does.

A permutation matrix [95] is a matrix that shuffles an input vector via multi-plication. As the matrix is only supposed to permute the input, every row andcolumn sum up to one.

Cooley-Tukey FFT

The Cooley-Tukey algorithm for the general-radix is the most common recur-sive variant. From [57]we get the following formulationwhenn is a compositenumber n = km

DFTn = (DFTk ⊗ Im)T nm (Ik ⊗ DFTm)Lnk , n = km (3.7)

In this formulation T nm is a diagonal matrix that contains the twiddle factorsfor the Cooley-Tukey algorithm and Lnk is the permutation matrix defined in[57]. If k or n are composite numbers, the process can be repeated with DFTkor DFTn, leading to recursion.

This recursion is repeated until the butterfly matrices are sufficiently small,usually around 2 or 4, and the smallest size is called the radix of the factoriza-tion.

Stockham FFT

Another common FFT variant is the iterative Stockham FFT which is widelyused for FFT on GPUs. The Stockham algorithm is a variant of the Cooley-

30 | Background

Tukey algorithm and was attributed to T. G. Stockham in 1966 in an article byCochran et al. [96]. One benefit of the variant is that the radix is fixed everyiteration, meaning that it can be optimized for the target platform to match thewidth of vector instructions. Additionally, the Stockham FFT is self-sortingmeaning that the output is in the correct order, which is not necessarily thecase with other variants.

FFTrk =k−1∏i=0

Stockham pass︷︸︸︷(DFTr ⊗ Irk−1

)︸︷︷︸Product with Butterfly

Drk

i︸︷︷︸Twiddle Factors

(Lr

k−i

r ⊗ Iri)︸︷︷︸

Stride Permutation

(3.8)

For this work, we use the definition of the Stockham FFT found in Equa-tion 3.8, originally expressed in [57]. In the Stockham FFT, r is the radixand k is the number of iterations.

In Equation 3.8, Drk

i is a diagonal matrix that contains the twiddle factors forthe Stockham algorithm and Lrk−ir is a permutation matrix. As we can seefrom the definition in Equation 3.8, this variant is iterative and not recursive,as the process is multiplying the input vector with a Stockham pass one at atime.

To define the twiddle factors Drk

i and permutation matrix Lrk−ir we use theformulations in [64, 20]. In [64] the twiddle factors can be extracted from thecombination butterfly matrices leading to the following expression

Drk

i = diag(Iri ,Ωr,ri ,Ω2r,ri , · · · ,Ωr−1

r,ri)⊗ Irk−i−1 (3.9)

Where Ωr,ri are diagonal matrices defined as

Ωr,ri = diag(1, ωri+1 , ω2ri+1 , · · · , ωr

i−1ri+1 )

And ωri+1 is defined as in Eq. 3.5. The permutation operation y = Lrk−ir x is

defined in [20] as

y =

x(0 : rk−i : n− 1)...

x(r : rk−i : n− 1)

(3.10)

The dataflow graph of the Stockham FFT is visualized in Figure 3.6.

Background | 31

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

y(0)

y(1)

y(2)

y(3)

y(4)

y(5)

y(6)

y(7)

× × ×× × ×× × ×× × ×× × ×× × ×× × ×× × ×

ω02

ω02

ω02

ω02

ω02

ω02

ω02

ω02

ω04

ω04

ω04

ω04

ω04

ω04

ω14

ω14

ω08

ω08

ω08

ω08

ω08

ω18

ω28

ω38

Figure 3.6: Dataflow graph of the Stockham FFT for the case r = 2, k = 3.Modified image from [97, 98].

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

y(0)

y(1)

y(2)

y(3)

y(4)

y(5)

y(6)

y(7)

(a) y = (I4 ⊗ DFT2)x

x(0)

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

y(0)

y(1)

y(2)

y(3)

y(4)

y(5)

y(6)

y(7)

(b) y = (DFT2 ⊗ I4)x

Figure 3.7: Dataflow for the product with Butterfly for the block parallel (a)and vector parallel (b) structure.

Our choice of algorithm

In this work we implement the iterative Stockham FFT described in Sec-tion 3.5, the reasoning behind this is that the DaCe framework does notsupport recursive calls to SDFGs when compiling from restricted Python,preventing us from using a recursive FFT variant. If we were to use anotherdataflow program without the explicit states present in DaCe, a back-edge andswitch could be used to produce a recursive variant. As an additional benefit,the Stockham algorithm is self-sorting and avoids bit reversals, providingbetter performance on GPUs [99].

Another big benefit of using the Stockham FFT on GPU is that it is designedfor wide vector operations, as opposed to small independent iterative opera-

32 | Background

tions. This difference is showcased in Figure 3.7 where Figure 3.7a shows thestructure of the operations in the Cooley-Tukey FFT and Figure 3.7b shows thestructure of the operations in the Stockham FFT. The first one enables the useof deep cache memory hierarchies, like the ones found in modern CPUs. Thesecond one can easily be expressed as vector instructions [99] and hence fitsfor machines with flat memory hierarchies like modern GPUs.

Methods | 33

Chapter 4

Methods

This chapter presents the implementation of two common mathematical oper-ations used in scientific computing: the DFT and the FFT. We present how toexpress the algorithms for implementation in DaCe as well as the necessarymodifications done to DaCe to enable the functionality required and what wasdone to improve the performance of the resulting program. The code is avail-able on GitHub [100].

4.1 Discrete Fourier Transform in DaCeDaCe is a new research project, meaning that some functionality may be miss-ing. When this project started, complex-valued map-reductions which hadbeenmerged usingMapReduceFusionwere not able to compile in CUDAs. Byintroducing new code to the DaCe code-generator and implementing a smallDFT program, we showcase that complex-valued reductions can be performedon GPU. First, we describe how the algorithm is implemented in restricted-Python and then how the resulting SDFG is transformed using DIODE forhigher performance as well as GPU capability. The implementation is as naiveas possible to enable the showcase of several DaCe graph transformations.

ImplementationThe code required in DaCe to implement a double-precision DFT can be seenin Listing 4.1. The program performs the calculations described in Equa-tion 3.5 and 3.6, operates on an input vector X, and produces the resulting DFTin the output vector Y. The DaCe program itself is described by the function

34 |Methods

1 import dace2

3 # Define DaCe symbolic variable4 N = dace . symbol (’N’)5

6 # Declare DaCe program with two arrays7 # of length N as arguments .8 @dace . program ( dace . complex128 [N], dace . complex128 [N

])9 def DFT (X, Y):10 # Define transient variable to hold DFT matrix11 dft_mat = dace . define_local ([N, N], dtype = dace .

complex128 )12 # Execute tasklet in parallel to fill DFT mat .13 @dace . map (_ [0:N, 0:N])14 def dft_mat_gen (i, j): # Tasklet15 omega >> dft_mat [i, j] # Memlet16

17 omega = exp (- dace . complex128 (0 , 2 *3.14159265359 * i * j) / dace . complex128 (N))

18

19 # Transient to hold result before reduce20 tmp = dace . define_local ([N, N], dtype = dace .

complex128 )21 # Perform multiplication22 @dace . map (_ [0:N, 0:N])23 def dft_tasklet (k, n): # Tasklet24 x << X[n] # Memlet25 omega << dft_mat [k, n] # Memlet26 out >> tmp [k, n] # Memlet27

28 out = x * omega29

30 # Reduce the transient into the output array31 dace . reduce ( lambda a, b: a + b, tmp , Y, axis =1 ,

identity =0)

Listing 4.1: Annotated implementation of double-precision DFT in DaCe.

Methods | 35

definition on line 9 in Listing 4.1 and the Python decorator @dace.program

on line 8 tells DaCe that the function should be compiled into an SDFG. [email protected] tells DaCe the size of the inputs and outputs by using thesymbolic variable N defined on line 4.

The first step of the program is to calculate theDFTmatrix. A two-dimensionaltransient data node dft_mat is allocated on line 11 in Listing 4.1 to holdthe values of the DFT matrix. The @dace.map on line 13 runs the taskletdft_mat_gen(i, j) on line 14 in parallel over the DFT matrix defined ear-lier. The tasklet on line 14 defines a memlet omega >> dft_mat[i, j] tothe corresponding location in the DFT matrix, to be able to write the correctvalue to it. Remember, no data access inside tasklets is allowed without usingmemlets.

The second step of the program is to multiply the generated DFT matrix withthe input vector. Usually, when performing a matrix-vector multiplication,the program would use += to sum rows, but in this case, we use a transienttwo-dimensional matrix (line 20) which is then summed over columns intothe output using a dace.reduce on line 31, the lambda function lambda a,

b: a + b describes how data should be combined. The @dace.map on line 22performs the multiplication using the input vector and newly generated DFTmatrix as input and the transient matrix as the output.

The SDFG compiled from Listing 4.1 can be seen in Figure 4.1.

Performance transformationsThe original implementation can be improved by applying several differenttransformations to the SDFG via DIODE. The original code is changed to testa possible performance enhancement.

MapReduceFusion

The first inefficiency of the implementation is the extra transient that is usedfor the multiplication and the subsequent reduction. By applying a MapRe-duceFusion transformation in DIODE, the @dace.map and @dace.reduce arecombined into a single @dace.map, where the results are written directly tothe output Y with a WCR that defines how data conflicts should be handled.By moving the calculations from the reduction node to the map, we avoid onewrite and one read for each element in the matrix.

The resulting SDFG can be seen in Figure B.1.

36 |Methods

s22_4





X dft_mat

tmp


Y

omega = exp(((- dace.complex128(0, (((2 * 3.14159265359) * i) * j))) / dace.complex128(N)))

out = (x * omega)

Figure 4.1: SDFG calculating the DFT of the input X.

Removing the WCR

The previous application ofMapReduceFusion introduces a WCR to the map,currently DaCe uses the OpenMP #pragma omp critical for avoiding dataraces while writing to the output. The use of #pragma omp critical meansthat only one thread at a time will be able to enter the summation-section at atime, leading to massive overhead. The WCR is introduced as a safety mea-sure, but when the @dace.map distributes the OpenMP threads on line 15, theywill by default be assigned to different rows and thus not write to the same out-put elements at once, meaning that the WCR is unnecessary. To disable theWCR, we simply clicking on the node in DIODE and change wcr_conflict

property to false.

This change does not change the appearance of the SDFG.

Map fusion

As the original DFT as seen in Equation 3.3 does not contain any intermediatematrix, the two maps in Listing 4.1 should theoretically be able to be com-bined into a single map. Since the two maps present in the program have the

Methods | 37

s22_4


X

Y


out = (x * omega)


state_1


out = 0


Y

Figure 4.2: DFT-SDFG after applying MapWCRFusion.

same boundaries and data dependencies, this can be achieved in DaCe by firstapplying a MapExpansion transformation to the multiplication map definedon line 15 in Listing 4.1. With the map expanded, DaCe enables aMapWCR-Fusion transformation that merges the two maps into a single map. The mergeremoves one unnecessary write- and read-operation for each element in theDFT matrix.

The resulting SDFG can be seen in Figure 4.2.

Using BLAS GEMV

To ensure that the system resources are used to their maximum, we go backto the program generated after removing the WCR and replace our own im-plementation of the matrix-vector multiplication by using a GEneral Matrix-

38 |Methods

1 with dace . tasklet ( language = dace . Language .CPP ,code_global =’# include <mkl .h>’):

2 x << X; omega << dft_mat ; y >> Y3 ’’’4 dace :: complex128 alpha (1 ,0) , beta (0 ,0) ;5 cblas_zgemv ( CblasRowMajor , CblasNoTrans , N, N,

& alpha , omega , N, x, 1, &beta , y, 1);6 ’’’

Listing 4.2: BLAS GEMV code replacement for Listing 4.1.

1 @dace . mapscope (_ [0: N])2 def out_map_gen (i):3 @dace . map (_[i:N])4 def dft_mat_gen (j):5 omega1 >> dft_mat [i, j]6 omega2 >> dft_mat [j, i]7

8 omega = exp (- dace . complex128 (0 , 2 *3.14159265359 * i * j) / dace . complex128 (N))

9 omega1 = omega10 omega2 = omega

Listing 4.3: DFT matrix mirroring code replacement for Listing 4.1.

Vector multiplication (GEMV) BLAS operation from MKL.

To use the library function, we replace the code between line 14 and 23 inListing 4.1 with the code in Listing 4.2. The tasklet specifies the inputs andoutputs to data nodes and then the C++ code that calls the library function.This approach requires a bit more knowledge about the internal structure ofDaCe and C++ but is still integrable into the rest of the restricted Python.

The resulting SDFG can be seen in Figure B.2.

Using symmetry in the DFT matrix

As the DFTmatrix is symmetrical with regards to i and j, seen in Equation 3.5,we can remove around half of the required computations of the map generatingthe DFT matrix by looping over the upper-right half of the matrix and writingeach value to both the regular location as well as the location mirrored in thediagonal.

The mirroring code replaces the code between lines 8 and 12 in Listing 4.1with the code in Listing 4.3. The difference from the original map is that

Methods | 39

the inner loop range now depends on the index of the outer loop as seen onlines 2 and 3 in Listing 4.3 in addition to the second memlet that writes to themirrored matrix location.

With these changes to the DFT matrix generation changes the resulting SDFGcan be seen in Figure B.3.

Porting to GPU

To port the different SDFGs produced by the previous transformations we ap-ply a GPUTransformSDFG transformation to each one of them, enabling allversions to use GPU resources for computations. After applying the transfor-mation, we need to re-enable theWCR as the argument of each thread handlinga row of the multiplication no longer is true, due to how CUDA kernels workon each element of the DFT matrix simultaneously.

The versions with MapReduceFusion applied, and with symmetry-reductionwere ported to use GPU. The SDFG for the MapReduceFusion-version withGPUTransformSDFG applied can be seen in Figure B.4.

Code changes in DaCeAt the time of writing, DaCe does not support atomic operations on complex-type variables in CUDA, leading to compilation errors whenWCRs introducedby MapReduceFusion exist in the C++ code. By adding specific functions tohandle atomic additions to DaCes C++ template code, which can be seen inListing A.1, the problem was resolved.

Differences from normal Python

The code in Listing 4.1 contains all the code necessary for performing the cal-culations and uses only primitive functions such as exponentiation and mul-tiplication. In the version of DIODE that was used to implement the DFT,there was no support for separating the code into functions, such as having theDFT matrix generation moved to a separate function. It was possible to useseparate functions if the optimizer Command Line Interface (CLI) was useddirectly and skipping DIODE, which could be useful with larger applications.

40 |Methods

4.2 Stockham FFT in DaCeThe Stockham FFT algorithm defined in Equation 3.8 does not trivially fitthe DaCe programming model and several methods are used to encapsulatethe procedure of the algorithm in an SDFG. First, the Stockham algorithmis iterative, meaning that multiplies the input with the sparse matrices calledStockham passes several times before finishing. This implies that the imple-mentation will need to use stateful programming in DaCe to loop the innerStockham passes over the input variable. Secondarily, DaCe has no supportfor the sparse matrix operations present in the Product with Butterfly andStride Permutation stages of a single Stockham Pass, meaning that an alter-native formulation must be used. A fitting formulation of the Product withButterfly can be found in [57], while the definition of the Stride Permutationcan be found in [20].

To summarize, the sought-after structure for the implementation is a largerloop representing the complete Stockhampass, with the individual operationsof the Stockham pass rewritten as DaCe maps inside the larger loop.

Our implementation is a complex-to-complex, forward, and out-of-place FFT.A forward FFTmeans that it transforms from the spatial- or time-domain to thefrequency domain. An out-of-place FFT stores the input and output separatelyand does not overwrite the input.

The full code for the implementation is available on GitHub [100].

DaCe implementationIn this part, we describe the implementation of the DaCe-FFT step-by-step,with solutions to the problems encountered during implementation.

Program signature

When implemented as a DaCe program the whole process is encapsulated ina @dace.program as seen in Listing 4.4. The double-precision complex vari-ables x and y are the input and output respectively. N, R, K are symbolicvariables representing the length, radix, and the number of iterations fromEquation 3.8.

Methods | 41

1 # Define DaCe symbolic variables .2 N, R, K = ( dace . symbol ( name ) for name in [’N’,’R’,’

K’])3

4 # DaCe program with two arrays as arguments .5 @dace . program ( dace . complex128 [N], dace . complex128 [N

])6 def stockhamFFT (x, y):

Listing 4.4: Stockham DaCe program signature.

Setup

The first thing that happens in the program is the setup of the DFT matrix,calculation of the loop indices, and copying of data from the input to the output.

DFTmatrix initialization is done in accordance with Equation 3.5, and theDaCe code can be seen between lines 1 and 7 in Listing 4.5.

Moving data is required, due to our implementation being out-of-place. Thedata in the input x must be moved to the output y before processing can be-gin, the DaCe code performing this can be seen between lines 9 and 16 inListing 4.5.

Generation of indices is required, as DaCe is not designed for the vari-able loop/map ranges needed to perform the loops from the operations in theStockham pass. Using the symbolic variables (N,R,K) to calculate the indicescurrently results in compilation errors. By generating the indices as transientvariables we enable both map ranges and tasklets to use the generated values,the code for generating these is located between line 18 and 35 in Listing 4.5.

Loop over Stockham passes

After having performed the necessary setup, the program can start loopingover the Stockham passes from Equation 3.8. To create a stateful control-looprepresenting the

∏from Equation 3.8 in DaCe, it is sufficient to write the

code in Listing 4.6. The code for performing the Product with Butterfly,Twiddle Factors, and Stride Permutation will be placed inside this loop.This code creates several states in the SDFG, the first being the initial setup,and the second being the state representing the Stockham pass. DaCe will

42 |Methods

1 # Generate DFT matrix for radix R.2 # Define transient variable for matrix .3 dft_mat = dace . define_local ([R, R], dtype = dace .

complex128 )4 @dace . map (_ [0:R, 0:R]) # Parallel execution5 def dft_mat_gen (ii , jj): # Tasklet6 omega >> dft_mat [ii , jj] # Memlet7 omega = exp (- dace . complex128 (0 , 2 * 3.14159265359

* ii * jj / R))8

9 # Move input x to output y10 # to avoid overwriting the input .11 @dace . map (_ [0: N]) # Execute tasklet in parallel12 def move_x_to_y (ii): # Tasklet13 x_in << x[ii] # Memlet14 y_out >> y[ii] # Memlet15

16 y_out = x_in17

18 # Calculate loop indices19 # Allocate indices for each iteration of main loop .20 r_i = dace . define_local ([K], dtype = dace . int64 )21 r_i_1 = dace . define_local ([K], dtype = dace . int64 )22 r_k_1 = dace . define_local ([K], dtype = dace . int64 )23 r_k_i_1 = dace . define_local ([K], dtype = dace . int64 )24

25 @dace . map (_ [0: K]) # Execute tasklet in parallel26 def calc_indices (ii): # Tasklet27 r_i_out >> r_i [ii] # Memlet28 r_i_1_out >> r_i_1 [ii] # Memlet29 r_k_1_out >> r_k_1 [ii] # Memlet30 r_k_i_1_out >> r_k_i_1 [ii] # Memlet31

32 r_i_out = R ** ii33 r_i_1_out = R ** (ii + 1)34 r_k_1_out = R ** (K - 1)35 r_k_i_1_out = R ** (K - ii - 1)

Listing 4.5: DaCe code setting up the DFT matrix, generating indices, andmoving data for the DaCe-FFT.

Methods | 43

1 # Main Stockham loop2 for i in range (K):3 # Stride permutation4 tmp_perm = dace . define_local ([N], dtype = dace . complex128 ) # Transient for

temporary data5 @dace . map (_ [0:R, 0: r_i [i], 0: r_k_i_1 [i ]]) # Execute tasklet in parallel6 def permute (ii , jj , kk): # Tasklet7 r_k_i_1_in << r_k_i_1 [i] # Memlet8 r_i_in << r_i [i] # Memlet9 y_in << y[ r_k_i_1_in * (jj * R + ii) + kk] # Memlet

10 tmp_out >> tmp_perm [ r_k_i_1_in * (ii * r_i_in + jj) + kk] # Memlet1112 tmp_out = y_in1314 # Twiddle Factor multiplication15 D = dace . define_local ([N], dtype = dace . complex128 ) # Transient for twiddle factors16 @dace . map (_ [0:R, 0: r_i [i], 0: r_k_i_1 [i ]]) # Execute tasklet in parallel17 def generate_twiddles (ii , jj , kk): # Tasklet18 r_i_1_in << r_i_1 [i] # Memlet19 r_i_in << r_i [i] # Memlet20 r_k_i_1_in << r_k_i_1 [i] # Memlet21 twiddle_o >> D[ r_k_i_1_in * (ii * r_i_in + jj) + kk] # Memlet22 twiddle_o = exp ( dace . complex64 (0 , -2 * 3.14159265359 * ii * jj / r_i_1_in ))2324 tmp_twid = dace . define_local ([ N], dtype = dace . complex128 ) # Transient for results

after applying twiddle factors .25 @dace . map (_ [0: N]) # Execute tasklet in parallel26 def twiddle_multiplication (i): # Tasklet27 tmp_in << tmp_perm [i] # Memlet28 D_in << D[i] # Memlet29 tmp_out >> tmp_twid [i] # Memlet3031 tmp_out = tmp_in * D_in3233 # Product with Butterfly34 tmp_y = dace . define_local ([N, N], dtype = dace . complex128 ) # Transient for results

before reduction35 @dace . map (_ [0: r_k_1 [i], 0:R, 0:R]) # Execute tasklet in parallel36 def tensormult (ii , jj , kk): # Tasklet37 r_k_1_in << r_k_1 [i] # Memlet38 dft_in << dft_mat [jj , kk] # Memlet39 tmp_in << tmp_twid [ii + r_k_1_in * kk] # Memlet40 tmp_y_out >> tmp_y [ii + r_k_1_in * jj , ii + r_k_1_in * kk] # Memlet4142 tmp_y_out = dft_in * tmp_in4344 # Reduce to finish the product with butterfly step45 dace . reduce ( lambda a, b: a + b, tmp_y , y, axis =1 , identity =0)

Listing 4.6: Looping over the Stockham passes in DaCe.

44 |Methods

create additional states that act as counters and guards to emulate the mainloop.

Stride permutation of the input in the Stockham pass is defined as a sparsematrix-vector multiplication in the form of(

Lrk−i

r ⊗ Iri)y

where Lrk−ir is defined using Equation 3.10, which, combined with the MAT-LAB pseudo-code for the Kronecker product from [57], can be written as aDaCe map over three dimensions. The implementation in DaCe code can befound between lines 3 and 12 in Listing 4.6.

Twiddle factors are defined as diagonal matrices, as seen in Equation 3.9,and are applied to the input vector as a sparse matrix-vector multiplication like

Drk

i y

and as the matrix is diagonal, this turns into an element-wise vector-vectormultiplication. The DaCe code is adapted from the MATLAB pseudo-codefrom [57]. For maximum clarity of the process, implementation of the gener-ation and multiplication of the Twiddle factors are split up into two parts, andthe generation of the Twiddle factors is between lines 14 and 22 in Listing 4.6and the multiplication is between lines 24 and 31 in Listing 4.6.

Product with butterfly is vector parallel and defined in the Stockham passas

(DFTr ⊗ Irk−1) y

The naive implementation is adapted from the MATLAB pseudo-code in [57]and the DaCe code is between lines 33 and 45 in Listing 4.6.

Visualization of one Stockham FFT pass

The SDFG of a Stockham pass in the implementation of the Stockham FFTin DaCe can be viewed in Figure 4.3. The annotations tell which part of theSDFG belongs to which part of the algorithm and code, with L⊗ I being theStride permutation, D being the Twiddle factors, and DFT ⊗ I being theProduct with Butterfly.

Methods | 45

permute [ii=0..R - 1, jj=0..__permute_e0_r_i - 1, kk=0..__permute_e0_r_k_i_1 - 1]

permute [ii=0..R - 1, jj=0..__permute_e0_r_i - 1, kk=0..__permute_e0_r_k_i_1 - 1]

r_i r_k_i_1r_k_i_1 r_iy

generate_twiddles [ii=0..R - 1, jj=0..__generate_twiddles_e0_r_i - 1, kk=0..__generate_twiddles_e0_r_k_i_1 - 1]

generate_twiddles [ii=0..R - 1, jj=0..__generate_twiddles_e0_r_i - 1, kk=0..__generate_twiddles_e0_r_k_i_1 - 1]

r_i_1

twiddle_multiplication [ii=0..N - 1]

twiddle_multiplication [ii=0..N - 1]

tmp_perm D

tensormult [ii=0..__tensormult_e0_r_k_1 - 1, jj=0..R - 1, kk=0..R - 1]

tensormult [ii=0..__tensormult_e0_r_k_1 - 1, jj=0..R - 1, kk=0..R - 1]

r_k_1 r_k_1 dft_mat tmp_twid

tmp_y


y

tmp_out = y_in twiddle_o = exp(dace.complex64(0, (((((- 2) * 3.14159265359) * ii) * jj) / r_i_1_in)))

tmp_out = (tmp_in * D_in)

tmp_y_out = (dft_in * tmp_in)

INPUT

L⊗I

Tasklet

Map entry

Map exit

Tasklet

Map entry

Map exit

D

Map entry

Map exit

Tasklet

Transient Transient

Transient

DFT⊗I

Tasklet

Map entry

Map exit

Transient

Reduce operation

D

Figure 4.3: SDFG of a single Stockham FFT pass. During each iteration, thethree operators in Equation 3.8 are applied.

46 |Methods

1 r = 2 # Set radix2 k = 4 # Set number of iterations3 n = r ** k # Calculate length4 print (’FFT on vector of length %d’ % n)5

6 # Assign DaCe symbolic values7 N. set (n)8 R. set (r)9 K. set (k)10

11 # Define input and output variables12 X = np. random . rand (n). astype (np. complex128 ) + 1j *

np. random . rand (n). astype (np. complex128 )13 Y = np. zeros_like (X, dtype =np. complex128 )14

15 # Call the DaCe program16 stockhamFFT (X, Y, N=N, K=K, R=R)

Listing 4.7: Calling our stockhamFFT DaCe program from within Python.

Calling the DaCe program from Python

With the combination of Listing 4.4, 4.5, and 4.6, we have implemented theStockham FFT in DaCe, DaCe-FFT for short. To use the generated SDFG wecan call the function defined in Listing 4.4 with the input, output, and symbolicvariables as seen in Listing 4.7. The input and output are defined as Numpyvariables, the DaCe symbolic variables are set before execution and passed asfunction arguments on line 16.

TransformationsThe original implementation is naively implemented and as such does not per-form well. We increase the performance by applying transformations to theSDFG in DIODE and by replacing the Product with Butterfly code for aBLAS GEneral Matrix-Matrix multiplication (GEMM) call.

MapReduceFusion

The naive implementation uses a reduction node by default, which is highlyinefficient, as mentioned in Section 4.1. By applying a MapReduceFusiontransformation, we merge the map and reduction node. This creates a WCR,which in turn causes DaCe to generate code using OpenMP critical leadingto poor performance, just like in the case with the DFT implementation. The

Methods | 47

code does not performmultiple writes to the same array location, meaning thatthe WCR can be disabled in DIODE without any issues.

GPUTransformSDFG

Porting the DaCe-FFT SDFG to GPU is straightforward in DIODE. By apply-ing a GPUTransformSDFG transformation, the CUDA kernels are created, allnecessary memory allocation and memory copies to/from GPU device mem-ory are created. DaCe transforms the SDFG to use GPU resources by intro-ducing additional data nodes set to exist in GPU device memory. The edgesbetween the original data nodes and GPU data nodes represent memory copiesto the GPU. The original tasklet is set to be performed on GPU and as such iscompiled to CUDA kernel. The ported SDFG can be seen in Figure B.5.

The C++/CUDA code compiled from the ported SDFG is not functional due toan issue with the map indices. The indices calculated in the start-up needs tobe used both for the DaCe map ranges as well as inside the tasklet code. Whencompiled to C++/CUDA, the indices have been ported to use GPU memory,meaning that the CPU code launching the CUDA kernels will not be able touse them and crash.

To solve this issue, we create separate indices for CPU and GPU, respectively,and use the CPU version in the map definition and GPU version inside thetasklets. The problem with this approach, however, is that the GPUTrans-formSDFG transform moves all data nodes and tasklets to GPU, meaning thatwe needed to manually change the back to CPU, which is done via editing theproperties of the nodes in DIODE.

Using matmul nodeThe naive code for the Product with Butterfly step in Listing 4.6 is not op-timized. The order of the map produces a rather random-access pattern, re-ducing cache prefetching, leading the worse performance. Additionally, thereis no support for vectorization which is crucial to reach peak performance onmodern CPUs. The formulation of the Product with Butterfly translates toperforming multiple matrix-vector multiplications with the DFTr as the ma-trix and all possible vectors with stride rk−1 from the input y. However, thisprocess can be merged into a single matrix-matrix multiplication as the mul-tiplication is independent with regards to the columns in the second matrix.

To improve the performance, we change the code between lines 33 and 44

48 |Methods

1 # Transients for vectors packed as arrays2 x_packed = dace . define_local ([R, N_div_R ], dtype =

dace . complex128 )3 y_packed = dace . define_local ([R, N_div_R ], dtype =

dace . complex128 )4 @dace . map (_ [0:R, 0: N_div_R ]) # Execute tasklet in

parallel5 def pack_matrices (ii , jj): # Tasklet6 g_r_k_1_in << g_r_k_1 [i] # Memlet7 tmp_in << tmp_perm [jj + ii * g_r_k_1_in ] #

Memlet8 x_packed_out >> x_packed [ii , jj] # Memlet9

10 x_packed_out = tmp_in11

12 # Perform matrix - matrix multiplication13 y_packed [:] = dft_mat @ x_packed14

15 @dace . map (_ [0:R, 0: N_div_R ]) # Execute tasklet inparallel

16 def unpack_matrices (ii , jj): # Tasklet17 g_r_k_1_in << g_r_k_1 [i] # Memlet18 y_out >> y[jj + ii * g_r_k_1_in ] # Memlet19 y_packed_in << y_packed [ii , jj] # Memlet20

21 y_out = y_packed_in

Listing 4.8: Code for performing the vector parallel product with butterflymatrix by using library BLAS GEMM.

Methods | 49

in Listing 4.6 to the code in Listing 4.8. The new code creates a matrix-matrix multiplication (matmul) node in the SDFG by invoking the @ operatoron line 13. Unfortunately, the original data format of the input and output isnot recognized automatically, meaning that the data needs to be restructured.The code between lines 4 and 10 and lines 15 and 21 in Listing 4.8 performsthe packing and unpacking of data to/from temporary data nodes. The useof a matmul node in the SDFG enables DaCe to generate code using highlyoptimized BLAS, routines. The matmul node can be changed in DIODE touse a pure C++, MKL, or CUDA Basic Linear Algebra Subroutine library(cuBLAS) implementation, with the last only being available in CUDA onGPU.

However, this setup requires the transient data nodes x_packed and y_packed

to work. The temporary data nodes are shaped like r times rk−1, with the lattervalue being calculated in the setup phase. But as DaCe places all memoryallocations at the beginning of the C++ function, the value of rk−1 does notexist yet. The solution is to add an additional input N_div_R as a symbolicvariable to the program signature in Listing 4.4.

Incorrect code generationCurrently, DaCe generates incorrect code for a specific case in the naive orig-inal SDFG. To ensure correctness during the execution of reductions, DaCegenerates code to zero the output array. In the case of DaCe-FFT, this codezeroes the output y after each iteration, even though the data is reused. Theproblem was avoided by manually moving the zeroing code before the loopduring testing. The problem was not present in the other variants.

50 | Experimental Setup

Chapter 5

Experimental Setup

In this chapter, we describe how we validate the accuracy and performance ofthe DaCe-DFT and DaCe-FFT implementation described in Chapter 4. First,we describe how the correctness of the output was validated. Then, we describehow the performance of the implementations was measured. Then, we give adetailed presentation of how the DaCe-DFT was tested. Then, we present howthe DaCe-FFT was analyzed to locate performance hot-spots as well as de-scribing the final performance tests against state-of-the-art solutions on bothCPU and GPU. Finally, we describe the different computer systems used dur-ing testing.

5.1 Validation of outputDuring the development process of the DaCe-FFT from Section 4.2, all ver-sions were compared against Numpy’s FFT-function, which in our case usedMKL’s FFT routine. The comparison was done by calculating the square normof the differences between the output divided by the number of elements, asseen below,

d =‖yDaCe − yMKL‖

n

As the algorithm operates using double precision floating-point arithmetic,the maximum precision available is the smallest possible difference betweentwo numbers that can be represented. For values close to 1 this is 2−52 ≈2.2204 × 10−16 [101], meaning that a difference near 10−16 between the im-plementations is to be considered equal. This is similar to what the FFTWimplementation achieves when compared against an arbitrary-precision FFT

Experimental Setup | 51

in their tests [102].

5.2 Performance measurementsWe used two methods to run performance benchmarks on the code generatedby DaCe. The first method was to benchmark directly from Python using thedace.timethis function included in DaCe, the function returns the averagetime elapsed and FLoating Point Operations per Second (FLOP/s) count. Theother method was our small C++ benchmarking program that called the gen-erated C++ code. The main difference between the two is that the first methodin Python is harder to use when performing a large number of tests and doesnot allow for editing of the C++ code, the second one requires more manualwork but permits changes to the C++ code.

The benchmarks work by generating random double-precision complex num-bers in an array and then executing the FFT with the generated data as theinput. The length of the FFTs tested are in the range 32 to 65 536 elements,meaning that FFTs tested are small enough to fit in cache, so a warm-up runis performed to avoid issues with cold cache starts. On GPU the warm-up runalso serves to initialize the CUDA runtime. After performing the warm-up,the code was run and measured individually 100 times to eliminate noisy re-sults. In the small C++ program, the high-resolution clock from the chronolibrary was used to measure each execution of the FFTs independently.

The time taken to transfer data to and from theGPU is included in the executiontime for the GPU tests. When the DaCe implementation is transformed touse GPU resources it will include all necessary array (de-)allocations on GPUmemory and movement from CPU to/from GPU memory in the function thatwe call to execute our FFT implementation.

Using these measurements, the FLOP/s could be calculated under the assump-tion that the FFT takes 5n log2(n) operations with n being the length of theinput. When presenting performance figures, we use the harmonic mean whencalculating the average FLOP/s for the different implementations.


5.3 Discrete Fourier Transform optimizationvia SDFG transformations

To evaluate the performance of the DFT in Section 4.1, the small C++ programwas used, and the measurements were repeated 100 times. The FLOP/s can becalculated using 4n2 operations for the GEMV and n2 operations for the DFTmatrix generation.

The tested input sizes were 32, 64, 128, 256, 512, 1024, 2048 and 4096 withbetween 1 to 24 OpenMP threads.

To calculate the time saved by using the symmetry of the matrix and only cal-culating half as many elements, the DFT matrix generation code was isolatedand benchmarked individually.

5.4 Bottleneck analysisBy adding some additional functionality to the C++ program described in Sec-tion 5.2, we performed an analysis of the DaCe-FFT on CPU to find the mosttime-consuming part of the generated code. By manually adding timers thatreported the execution time of each part of the C++ code, we could accuratelymeasure the impact of each part on the run time. The test was performed on afixed FFT input length of 4096 elements with varying radices 2, 4, 8, 16 and64. The analysis was performed on both the naive version and the optimizedversion on our development machine.

The code parts were divided into five parts based on the implementation de-scribed in Section 4.2,

• Startup is the initialization before the main loop,

• Permute is the short form for Stride Permutation,

• Twiddle factors,

• Matmul is the short form for Product with Butterfly, and

• Other are all parts that are not covered by the other four categories,which are mostly memory (de-)allocation.

All listed parts are measured individually except for Other, which is just thetotal time minus the other parts combined. For the optimized version, theMatmul part includes the packing and unpacking of data.


5.5 Performance testing of the final programThe performance testing on HPC hardware against state-of-the-art implemen-tations was done using small C++ programs.

The input lengths that the different implementations were tested on were gen-erated by iterating over the radix r and number of iterations k like

n = rk, r = [2 .. 256], k = [1 .. 30]

32 ≤ n ≤ 65536

Strong scalingAn important evaluation factor of a program is how it strongly scales, meaningthat the problem size is constant, but the number of threads varies. We test thestrong scaling of the DaCe CPU implementation by testing each input with 1to the maximum number of hardware threads available.

Comparison to state-of-the-artTo evaluate the performance of the code generated from the SDFGwe compareour implementation against FFTW [58] on CPU and cuFFT [60] on GPU.

FFTW

We compiled FFTW version 3.3.8 using Intel ICC 19.0.1.144 with supportfor the SSE2, AVX, and AVX2 vector instruction sets as well as OpenMPthreading support. When preparing to execute an FFT, FFTW planned usingthe FFTW_MEASURE flag, meaning that it will try the most probable setups afew times and then select the best performing. There is more rigorous testingavailable via other flags, but the performance gain is usually minimal accord-ing to FFTW documentation. From the documentation, we find that FFTWperforms best on inputs lengths that can be factored into the primes 2, 3, 5 and7.

cuFFT

We used cuFFT version 10.1 supplied via the CUDA runtime. Before ex-ecuting the measured runs on each length, the cufftPlan1D function wasused with the CUFFT_Z2Z flag to enable double-precision complex to double


precision complex transforms. The executions of the FFT was done on theplans with the cufftExecZ2Z function. From the documentation, we find thatcuFFT works best on input lengths that can be factored into the primes 2, 3, 5and 7 [103].

5.6 Computer systemsThe benchmarks of the program were conducted on three different systems,the first being a development machine, one being an HPC machine focusingon CPU, and the last being an HPC machine focusing on GPU.

Development machineThe development machine was a workstation with an Intel i9-10920X proces-sor and 32GB of RAM, runningUbuntu 18.04 LTS (kernel: 5.3.0-46-generic).The machine featured an Nvidia RTX-2060 with 8 GB GDDR6 RAM.

The compiler was Intel ICC 19.1.1.217 and linked to Intel MKL with the -mkl

=parallel flag. The MKL version was 2020.1. When compiling for CUDA,GCC version 8.4.0 and NVCC version 10.2 was used.

HPC CPU systemThe CPU system is a node on Beskow, a Cray XC-40 system. The computenode used contained two Xeon E5-2698v3 with 64 GB of RAM, where eachprocessor had 16 cores and 32 threads.

The code was compiled using Intel ICC 19.0.1.144 and linked to Intel MKLwith the -mkl=parallel flag. The MKL version was 2019.0 Update 1.

Before executing the program, the OMP_PROC_BIND=true and OMP_PLACES=

cores environment flags were set to control the thread placement during exe-cution. The first flag binds threads to specific cores and prevents migration toother cores during execution. The second flag causes threads to be distributedover physical cores first instead of utilizing hyperthreading first.

HPC GPU systemThe GPU benchmarks were run on a custom system at PDC. The system wasequipped with two Power8 processors with 132.8 GB of RAM, each processor


had 8 cores and 64 threads. The system had two Nvidia P100 SMX2 with 16GB HBM2 RAM each.

The system ran CentOS Linux 7.6.1810 and the code was compiled using GCCversion 8.3.0 and NVCC 10.1.

56 |Results

Chapter 6

Results

In this chapter, we present the results from the experiments performed in Chap-ter 5.

6.1 Validation of computationsWe ran the procedure described in Section 5.1 on our development machineand noted the differences for some input lengths.

Table 6.1a shows the differences acquired for the DaCe-FFT on CPU versionand Table 6.1b shows the differences acquired for the DaCe-FFT on GPU ver-sion. We can immediately see that these results are close to the maximumdouble precision accuracy of around 10−16.

Table 6.1: Differences between the results from the DaCe CPU and GPU im-plementation and the reference MKL FFT function.

(a) DaCe CPU

Input length (n) Difference128 3.5062× 10−13

1024 7.7635× 10−13

4096 9.0438× 10−13

(b) DaCe GPU

Input length (n) Difference128 3.4624× 10−13

1024 7.7635× 10−13

4096 9.0438× 10−13

Results | 57

32 64 128 256 512 1024 2048 4096

Input size (n)

1

2

3

4

5

6

Pe

rfo

rma

nce

(G

FL

OP

/s)

CPU naive

CPU MapReduceFusion

CPU no WCR

CPU MapFusion

CPU GEMV

GPU naive

GPU halved

GPU MapFusion

Figure 6.1: Performance of the different variants of DaCe-DFT on CPU andGPU.

6.2 Improvement of DFT performance viaSDFG transformations

We run the benchmark described in Section 5.3 on our development machine(see Section 5.6).

SDFG transformation performance gainsEven though the tests were supposed to be run with between 1 to 24 OpenMPthreads, the naive and MapReduceFusion were only run with 1 and 2 threadsbecause of too large run times with additional threads. Figure 6.1 shows theperformance of the DFT-SDFG after the application of transformations to the

58 |Results

32 64 128 256 512 1024 2048 4096

Input size (n)

1

2

3

4

5

6

Pe

rfo

rma

nce

(G

FL

OP

/s)

CPU no WCR

CPU MapFusion

CPU GEMV

CPU halved

Figure 6.2: Performance comparison between the DaCe-DFT on CPU for aselect number of regular variants and the halved variant.

original naive implementation. The figure only shows the best performingvariant with regards to the number of threads, the best performing variantsused between 11 to 24 threads, except for the naive and MapReduceFusionvariants which used only 1 thread.

We can see in the figure that the naive andMapReduceFusion variants are theworst-performing, the peak performance reached by them is 0.11GFLOP/sand 0.13GFLOP/s respectively. The variant where the WCR is disabledand the variant using the MKL GEMV has very similar performance pro-files. The best performing CPU variant is the MapFusion variant, peaking at3.64GFLOP/s.

On the GPU side, we see that the MapFusion-variant performs the best andpeaks at 5.24GFLOP/s, the naive variant peaks at 2.68GFLOP/s, and thehalved variant peaks at 1.93GFLOP/s. An interesting observation is that whenthe input length grows the performance of the different variants convergesaround 2GFLOP/s.

Performance using DFT matrix symmetryFigure 6.2 shows the performance of selected normal CPU variants versus thehalved symmetrical variants on CPU.

We immediately see that the performance of the halved variants does not sig-nificantly differ from the normal variants.

Figure 6.3 shows the time taken for the normal generation of the DFT matrixdivided by the halved generation for all tested input lengths as well as all pos-

Results | 59

1 4 8 12 16 20 24

Number of threads

0

0.5

1

1.5

2

2.5

3

3.5

Rela

tive tim

e taken

n = 32

n = 256

Figure 6.3: Speedup comparison of the isolated DFT matrix generation be-tween the full and halved variant.

sible thread combinations. Two input lengths of interest are distinguished inthe figure, n = 32 and n = 256, for these lengths, the halved variant showsa significant speedup for some thread configurations compared to the normalvariant.

6.3 Bottleneck analysis of CPU implementa-tion

We evaluated the bottleneck analysis of the naive and optimized implementa-tion described in Section 5.4 on our development machine (see Section 5.6).

Time spent in different partsFigure 6.4 shows a comparison of the time distribution for the naive and opti-mized version respectively. To showcase the massive difference in time spenton the matmul part in the naive version, the naive version with the lower per-centage of time spent on matmul is compared to the optimized version withthe highest percentage of time spent on matmul. The naive version performsbest at a radix of 4, resulting in a total run time of 2.42 s, where 98.01% isspent on the matmul. The optimized version performs worst at a radix of 16,resulting in an average total run time of 22ms, where 33.42% is spent on thematmul.

60 |Results

Other

Startup

Permute

Twiddle factors

Matmul

Part of the program

0

20

40

60

80

100

Perc

enta

ge o

f to

tal tim

e taken (

%)

Naive

Optimized

Type

Figure 6.4: Distribution of the time taken for the different stages in the naiveand optimized DaCe implementation with an FFT of length n = 4096. Theoptimized is the worst case with r = 16 and the naive is the best case withr = 4.

Results | 61

Other

Startup

Permute

Twiddle factors

Matmul

Part of the program

0

10

20

30

40

50

60

70

Perc

enta

ge o

f to

tal tim

e taken (

%)

2

4

8

16

64

Radix size (r)

Figure 6.5: Distribution of the time taken for the different parts of the opti-mized program with the input length set to 4096 and varying radix.

These results show that to increase the performance of the program, one shouldoptimize the matmul part, other optimizations would save around 50ms at amaximum, nothing in comparison to the total run time of 2.42 s. The naiveradix where the matmul took the most time was with radix 64, resulting in anaverage run time of 12.98 s with 99.66% spent on the matmul.

The geometric average of the speedup between all listed radices with the op-timized program compared to the naive is in this case around 89 times.

Optimized variant on CPUFigure 6.5 shows how the time distribution for the different parts varies withthe size of the radix. The run times for the different radices from 2 to 64 were124ms, 49ms, 30ms, 22ms and 53ms, respectively.

The first thing we can notice about the distribution is that the start-up partrapidly scales up and starts taking large amounts of the total run time. Themost probable reason for this is the fact that the size of the DFT matrix used inthe matmul step scales as a square function of the radix R as seen in Listing 4.5.

The decrease in time fraction spent on the Twiddle factors can be attributed tothe fact that DaCe uses an OpenMP pragma directive to parallelize the @dace

62 |Results

0 5 10 15 20 25 30 35

Number of threads

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Speedup (

com

pare

d to o

ne thre

ad)

DaCe-FFT r=40, k=3, n=64000

DaCe-FFT r=39, k=3, n=59319

DaCe-FFT r=38, k=3, n=54872

DaCe-FFT r=66, k=1, n=66



Figure 6.6: Performance scaling with varying number of threads for the threebest and three worst implementations in DaCe-FFT on CPU.

.map seen in Listing 4.6. By default the directive will only parallelize theoutermost loop, which is the one looping from 0 to R, meaning that for smallradices, this process will be limited to R threads even though the total length isn=4096 and there are no race conditions. This leads to all cases being poten-tially limited except for radix 64 where R is bigger than the number of threads.

6.4 Performance testing of DaCe-FFTIn this section, we measure the performance of our final DaCe implementa-tion with the benchmark described in Section 5.5 on HPC hardware (see Sec-tion 5.6).

Strong scaling on CPUWhen we increase the number of threads, we see a speedup for the better per-forming variants, while some variants just experience no to little speedup, as

Results | 63

0 10000 20000 30000 40000 50000 60000

Input size (n)

0.01

0.1

1

10

100

Perf

orm

ance (

GF

LO

P/s

)

FFTW

DaCe

Figure 6.7: Performance comparison between DaCe-FFT on CPU and FFTW.For DaCe, the best thread- and radix-configuration was selected for each inputlength.

seen in Figure 6.6.

The best performing variant scales with up to 15 threads and achieves aspeedup of 4.09 over the same variant running on 1 thread. The worst-performing variant achieves the worst performance with a speed factor of 0.02at index 15 compared to the same variant running 1 thread.

We can see from the figure that the performance starts to oscillate after 15threads for the best performing variants and after around 12 threads for theworst variants. Some thread combinations provide good performance whilesome provide bad performance.

An additional thing is that the worst-performing variants start by slowing downwith additional threads until it suddenly, around 9 to 12 threadswhere they gaina tiny bit of performance before crashing.

Comparison against state-of-the-artHere we present the results of our benchmark against state-of-the-art on CPUas well as GPU.

64 |Results

Table 6.2: Best and worst performance factor of the final DaCe program com-pared to FFTW on CPU. The performance factor is the FLOP/s values ofFFTW divided by the FLOP/s values of DaCe.

n r k factorBest 49 7 2 54.07%Worst 240 240 1 0.334%

2 32 64 96 128 160 192 224 256

Radix (r)

0

65536

Tra

nsfo

rm s

ize

0

1

2

3

Perf

orm

ance (

GF

LO

P/s

)

Transform size

Performance

Figure 6.8: Performance comparison with different radices in DaCe-FFT onCPU.

CPU

The comparison between our DaCe CPU implementation and FFTW can beseen in Figure 6.7. Table 6.2 lists the best and worst-performing input lengthwhen comparing the FLOP/s count of the two.

The performance of our DaCe implementation is quite stable and has peaksdistributed throughout the possible input lengths, these are radices in the range10 to 40 with iteration depth k larger than 2, while the other are larger radiceswith iteration depth k equal to 2. This effect can be seen in Figure 6.8, whichdisplays the best performance for each radix.

The FFTW implementation is better than our DaCe implementation for ev-ery single tested input length by a rather large margin in most cases as thegeometric average of the performance factor is 3.5%. The peak performancereached by FFTW is 118.2GFLOP/s while DaCe-FFT on CPU reaches a peak2.35GFLOP/s. There are a few exceptions where DaCe-FFT starts to reach ac-

Results | 65

0 10000 20000 30000 40000 50000 60000

Input size (n)

0.01

0.1

1

10

Perf

orm

ance (

GF

LO

P/s

)

cuFFT

DaCe

Figure 6.9: Performance comparison between DaCe-FFT on GPU and cuFFT.For DaCe, the best radix-configuration was selected for each input length.

Table 6.3: Worst- and best-case performance factor of DaCe program com-pared to cuFFT on GPU. The performance factor is the FLOP/s values ofcuFFT divided by the FLOP/s values of DaCe.

n r k factorBest 27 889 167 2 62.34%Worst 224 224 1 10.33%

ceptable performance levels compared to FFTW, but they are limited to shorterinputs.

GPU

The comparison between DaCe-FFT on GPU implementation and cuFFT canbe seen in Figure 6.9. Table 6.3 lists the best and worst-performing combina-tion input length and radix with DaCe-FFT as a percentage of cuFFT perfor-mance for the same length.

The cuFFT is better than DaCe-FFT on GPU for every input length tested andthe geometric average of the performance factor is 23.76%. cuFFT reachesa peak performance of 17.94GFLOP/s while DaCe-FFT on GPU reaches4.317GFLOP/s.

The input lengths that perform best in DaCe-FFT on GPU compared to cuFFTcan be found around lengths of 30 000 where cuFFT has some sudden drops

66 |Results

2 32 64 96 128 160 192 224 256

Radix (r)

0

65536T

ransfo

rm s

ize

0

1

2

3

4

5

Perf

orm

ance (

GF

LO

P/s

)

Transform size

Performance

Figure 6.10: Performance comparison with different radices in DaCe-FFT onGPU.

in performance, as well as around input lengths of 55 000.

The performance of DaCe-FFT on GPU is steadily increasing with the lengthof the input but dips up and down. When looking at the performance versusradix in Figure 6.10, we notice that the performance follows different expo-nential curves interleaved, as an example we notice one such curve betweenradix 35 and 120 with another curve appearing between 82 and 105.

Discussion and conclusions | 67

Chapter 7

Discussion and conclusions

In this chapter we discuss the results presented in Chapter 6, draw conclusionsfrom the results in regard to the research questions presented in Chapter 1,summarize the work, and present suggestions for future work.

7.1 Discussion

Validation of computationsIn this section, we discuss the results presented in Section 6.1.

The differences in Table 6.1a and 6.1b are 103 times larger than the 10−16 wewould have preferred but are still in a reasonable range. There are two possibil-ities for why the results are not perfectly accurate. The first possible reason isthat the value of π used for generating the DFT matrix and the twiddle factorsis defined manually with a limited number of decimals, limiting the precision.The second reason could be due to the structure of the implementation, wheresome of the design decisions could have affected the accuracy, but analyzingthis is outside of the scope of this work.

A small test during the thesis writing confirms that the first theory is the cause.By increasing the number of decimals used in the value of π used to over 20decreases the difference to around 3× 10−16 for an FFT of length 1024 usingDaCe-FFT on CPU.

68 |Discussion and conclusions

Performance improvements of DaCe-DFTIn this section, we discuss the results presented in Section 6.2.

The naive DaCe-DFT and the version with MapReduceFusion applied wereonly run with 1 to 2 threads, due to excessive run times. This was most likelycaused by the #pragma omp critical inserted to prevent data races, whichonly allows one thread at a time to perform the critical operation. As the loopis very simple, all loops will have to wait for each other constantly, leadingto no concurrency with added overhead for locking and unlocking the criticalregion. Using a #pragma omp atomic would be preferred as it uses atomicinstructions instead, leading to the processor handling each memory operationseparately, allowing for concurrency and less overhead.

In Figure 6.1 for DaCe-DFT, we note that the MapFusion variant performsmuch better than the rest of the CPU variants. The regular variants calculateand write each element in the DFT matrix and then read each element beforemultiplication. As each element is used only once, the write/read operationsare completely unnecessary, leading to a performance loss. Additionally, wesee that there is no performance gain from using the MKL GEMV routine,showing that the GEMV operation is most likely memory-bound, as a rathernaive implementation is equal to an optimized library routine.

Figure 6.2 shows that there is no speedup from using the halved variant. How-ever, Figure 6.3 showing the performance comparison for the isolated DFTmatrix generation shows that some halved matrices are generated faster thanthe regular, especially the cases of n = 32 and n = 256, which could con-tradict the results from Figure 6.2. A possible explanation for this is that theresults in Figure 6.2 are using between 11 to 24 threads, where the differencebetween the normal and halved generation is minimal, as seen in Figure 6.3.The fact that the halved DFT matrix provides no speedup in this benchmark isintriguing. The most reasonable explanation for this is that by default the staticthread scheduler is used when the @dace.map is parallelized, meaning that thei range will be split into equal chunks. Because of the triangular structure,higher values of i will need fewer calculations to finish, meaning that the firstchunks will be more computationally heavy. This leads to the whole processbeing limited by the slowest thread. This could be improved by using anotherscheduler or adjusting the chunk size manually.

The results on GPU are also interesting as the performance peaks around sizesof 512 to 1024 and then go down belowCPU levels of performance. This could


have to do with the thread-block sizes that DaCe defaults to when runningCUDAkernels, the performance could probably be tuned by changing the sizesof the blocks.

Bottleneck analysisIn this section, we discuss the results presented in Section 6.3.

Naive vs optimized

The naive DaCe-FFT uses a map and reduction node to perform the Productwith Butterfly, meaning that a sparse operation is done using a normal ma-trix. This operation is very inefficient and could explain the large performancepenalty. Applying aMapReduceFusion transformation increases performancesignificantly but is still far away from the performance of thematmul code. Thenaive version withMapReduceFusion applied could in theory be optimized forhigher performance by applying SDFG transformations for loop-blocking, butit might be difficult due to the variable indices used by the map and the factthat the optimal cache reuse depends on the radix.

A shortcoming in the comparison between the two is that Figures 6.4 and 6.5do not tell the change in total time taken, for example, it could be misunder-stood that the optimized matmul takes around a third of the time of the naivematmul.

Optimized variant comparison

While the change from the naive Product with Butterfly to the optimizedGEMM routines, there is still potential for additional improvements. The op-eration is still very inefficient as it requires an (un-)packing of data. The datamovement is inefficient and a better solution would be to call the GEMM rou-tine using strides on original arrays. This could possibly be implemented bycalling a sub-SDFG with a change of the pointers to the different arrays. How-ever, it is not obvious how to do this while only using restricted Python anddeveloping it using DIODE. Manually changing the code to use the GEMMroutine like what was done for DaCe-DFT with GEMV in Section 4.1 is pos-sible but is against the core idea of the SDFG.

The negative side of the optimizationwith the GEMM routine is that it changedthe original code from the original notation that was close to the mathematical


definition, even though the goal with DaCe is to avoid changes to the originalimplementation.

One of the shortcomings of Figure 6.5 is that it does not convey the fact thatthe different radices have different run times.

Performance of DaCe-FFTHere we will discuss the results for the strong scaling test on CPU and thecomparison against state-of-the-art presented in Section 6.4.

Strong scaling on CPU

From Figure 6.6, we see that the performance scaling seems to break at around16 threads, the reason for this is most likely the fact that the benchmark usesOMP_PROC_BIND=true and OMP_PLACES=cores, meaning that the first 16threads are placed on the first processor, while the rest are placed on the sec-ond processor. The placement of threads on the second processor causes cacheproblems since data is shared between two processors, the alternating patternseems to suggest that some numbers of threads distribute sections of the databetter between processors. It was, however, hard to find documentation aboutthe behavior of thread placement and it seems to be dependent on the compilerused. Further analysis could have been performed using Intel VTune [104] tolocalize the root issues with the multi-threading but was not performed due toa lack of time.

The speedup when using multiple threads does not scale linearly as the DaCe-FFT scaling peaks at 4.09× speedup at 15 threads, which is far from the ideal15× speedup. We would have liked to compare the strong scaling of DaCe-FFT to FFTW but unfortunately, with the planning setup we used in FFTW,we do not have access to the scaling information in FFTW on our system.

Performance on CPU

In this section, we discuss the performance of DaCe-FFT on CPU and compareit to the performance of FFTW.

Given the low geometric average of 3.5% of the relative performance of DaCe-FFT on CPU compared to FFTW we can conclude that this implementation isnot suitable for general use in production. It should be noted that the StockhamFFT is optimized for vector parallel machines like GPUs and not deep cachemachines like the Intel CPUs used in the benchmarks. FFTW has additional


advantages over DaCe-FFT, with the most important being that FFTW plansexecutions of FFTs and selects the best performing transform and thread count.

The fact that some input lengths such as n = 49 results in reasonable perfor-mance in comparison to FFTW should not be taken too seriously. The spikes inperformance aremost likely due to the timings being skewed by the setup phasebeing large for both DaCe-FFT and FFTW, leading to a smaller time fractionbeing spent on the actual FFT routine. The relevant performance comparisonis lengths above 1000 and as we can see in Figure 6.7, DaCe-FFT stands nochance against FFTW.

Performance on GPU

In this section, we discuss the performance of DaCe-FFT onGPU and compareit to the performance of cuFFT.

The DaCe-FFT on GPU performs better compared to cuFFT than DaCe-FFTon CPU did compared to FFTW, seen in the relative geometric average of23.76%. As mentioned earlier, the Stockham FFT was designed for vectorparallel machines such as GPUsmeaning that this result is somewhat expected.However, it is crucial to remember that the transfer of data is included in theperformance calculation, meaning that the FLOP/s difference could be skewed.

A way to have countered this effect would have been to measure the averageallocation and transfer speed back and forth from the CPU to GPU and sub-tracted this from the measured times measured from DaCe-FFT on GPU andcompared to FFT execution on data already in GPU memory. This was notdone due to a lack of time.

As cuFFT is proprietary, and Nvidia provides little information about the ac-tual implementation, it is difficult to compare it with DaCe-FFT on GPU.cuFFT is stated to be an implementation of the Cooley-Tukey FFT, but theStockham FFT is just a Cooley-Tukey in another form, meaning that the state-ment is quite confusing.

As we can see in Figure 6.10, the performance goes up with larger transformsizes. It is worth noting that each drastic performance drop up until the peak atradix 40 corresponds to a lowering in the maximum value of k, thus limitingthe size of the transform. As an example, r = 40 with k = 3 is 64 000 whiler = 41 with k = 3 is 68 921 which is over the limit of 65 536. A probableexplanation for this is that larger transform sizes enable higher utilization of themassive parallelism in the GPU. Additionally, DaCe-FFT on GPU should not


suffer the same problems as DaCe-FFT on CPU with regards to the generationof the DFT matrix since the GPU can calculate thousands of matrix valuesconcurrently, while the CPU can only do a maximum of 32 concurrently.

Problems encountered during implementationHere, we discuss the problems that were encountered during the developmentof our DaCe-FFT implementation.

Errors in generated code

As mentioned in Section 4.2, DaCe sometimes generates erroneous C++ codefrom certain SDFGs that either produce incorrect results or does not com-pile. These problems are simply symptoms of DaCe being in an active stateof development and not being feature complete. Many of the problems will besolved in future updates when the project is more mature.

Indices

As mentioned in Section 4.2, DaCe does not currently support dynamic mapindices in the way that this work uses them. One of the major issues with theapproach taken in this work is that it does not work well with the graph patternmatcher when it is trying to find possible transformations. This issue could besolved two ways, either try to improve the pattern matcher to be able to matchthe maps present in this work, or, try to rewrite the code from this work to usemaps that iterate from the start to the end and calculate the index inside thetasklet. However, both approaches are too complex or time consuming for thiswork.

Using the code outside small scripts

Due to the complex structure of DaCe, it can be difficult to use DaCe-FFTon large HPC systems as the compilation procedure is rather convoluted andcomplex. This is against one of the core ideas of DaCe which is to enableperformance portability, but this can probably be solved in the futurewithmoreimplementation work in the Python backend, the design of the build system,and the generated C++ code.


7.2 Scientific closureHere, we return to the hypothesis and research questions formulated in Sec-tion 1.2.

How do we express algorithms such as the FFT in DaCe? Algorithms areimplemented as stateful graphs in DaCe using SDFGs, meaning that the usermust convert the algorithm to a suitable form. As DaCe forces concurrencyby design and enables merging overlapping operations, the algorithms shouldbe divided into the smallest sub-units possible to avoid complex logic andenable DaCe transformations. Establishing a clear logic flow and translatingthe mathematical notation into code after dividing the algorithm into partsprovides all the necessary information needed for implementation in DaCe.

How do we port DaCe-FFT to GPU? Porting to SDFGs to use GPU re-sources is enabled by the design decision to force concurrency. With onlyone transformation and minimal manual changes, the algorithm runs on GPU.Compared to the usual process of moving calculations to GPU where man-ual memory management and kernel implementation is needed, this reducesthe time required from hours to minutes. Even though not showcased in thissituation, DaCe automatically utilizes asynchronous data transfer and kernellaunching when using GPUs, made possible via the forced concurrency.

How do we improve the performance of DaCe-FFT? Improving the per-formance of an SDFG is preferably done via the transformations integratedinto DaCe but sometimes it is necessary to perform them in the original code.Using transformations to remove the unnecessary reduce operation is trivialand attempts to enable cache-friendly looping and vectorization via transfor-mations failed. Switching to using library code for calculating the matrix-vector product improved performance but the goal of achieving performanceon par with state-of-the-art was not reached.

To evaluate the hypothesis stated in Section 1.1, we summarize the resultsfrom the scientific questions. We were able to implement an FFT in DaCe, wedid port it to use GPU resources, and we did manage to improve the perfor-mance of the naive implementation. However, we did not reach performancecomparable to state-of-the-art solutions. It is also hard to objectively judgethe time and effort spent on the implementation and comparing it to a regularimplementation in C/C++.


With these answers, we can deduct that the hypothesis was partly falsifiedas the time aspect is hard to measure and that the performance of the finalprogram was not up to par with current state-of-the-art solutions.

7.3 Conclusions

7.3.1 SummaryIn this work, we have implemented a simple DFT and the Stockham FFT al-gorithm in DaCe, verified the correctness of the calculations, analyzed bottle-necks in the code, and compared the performance against other state-of-the-arton both CPU- and GPU-systems aimed towards HPC.

7.3.2 Future workThis work has shown that implementing algorithms such as the Stockham FFTin DaCe is possible, however with the poor performance of the implemen-tation, it is clear that further development is needed to improve the perfor-mance. The first step could be to try and optimize the Product with Butterflystep using transformations to use vector instructions and become more cache-friendly. This could be achieved using a nested sub-graph or writing a smallmatmul kernel that can be transformed.

An interesting idea stemming from the problems with the inefficiencies of theProduct with Butterfly is whether there is a possibility of automating opti-mization there, either by some tuning tool that applies different transforma-tions or by automatically switching the naive code for the optimized GEMMcode. Either way, there exist such projects in the form of polyhedral compil-ers, but twist the non-uniform array accesses which could make it a bit morecomplicated.

The SDFG structure enables visualizing data movement in DaCe and in a simi-lar fashion, it could be useful to display the run-time of the different operationsakin to TensorFlow.

BIBLIOGRAPHY | 75

Bibliography

[1] Jeffrey S. Vetter et al. “ExtremeHeterogeneity 2018 - Productive Com-putational Science in the Era of Extreme Heterogeneity: Report forDOE ASCR Workshop on Extreme Heterogeneity”. In: (Dec. 2018).doi: 10.2172/1473756.

[2] TOP500.org. TOP500 June 2020. url: https://www.top500.org/lists/top500/2020/06/ (visited on 06/23/2020).

[3] James Jeffers and James Reinders. Intel Xeon Phi Coprocessor HighPerformance Programming. Newnes, 2013.

[4] University of Tsukuba Center for Computational Sciences. Overviewof Cygnus: a new Supercomputer at CCS. url: https://www.ccs.tsukuba.ac.jp/wp-content/uploads/sites/14/2018/12/About-Cygnus.pdf (visited on 09/03/2020).

[5] Michael Feldman. German University Will Deploy FPGA-PoweredCray Supercomputer. url: https://www.top500.org/news/german - university - will - deploy - fpga - powered -cray-supercomputer/ (visited on 09/03/2020).

[6] Norman P. Jouppi et al. “In-Datacenter Performance Analysis of aTensor Processing Unit”. In: SIGARCH Comput. Archit. News 45.2(June 2017), pp. 1–12. issn: 0163-5964. doi: 10.1145/3140659.3080246. url: https://doi.org/10.1145/3140659.3080246.

[7] M. Bohr. “A 30 Year Retrospective on Dennard’s MOSFET ScalingPaper”. In: IEEE Solid-State Circuits Society Newsletter 12.1 (2007),pp. 11–13.

[8] MMitchell Waldrop. “The Chips are Down for Moore’s Law”. In: Na-ture News 530.7589 (2016), p. 144.

https://doi.org/10.2172/1473756

https://www.top500.org/lists/top500/2020/06/

https://www.top500.org/lists/top500/2020/06/

https://www.ccs.tsukuba.ac.jp/wp-content/uploads/sites/14/2018/12/About-Cygnus.pdf



https://www.top500.org/news/german-university-will-deploy-fpga-powered-cray-supercomputer/



https://doi.org/10.1145/3140659.3080246

https://doi.org/10.1145/3140659.3080246

https://doi.org/10.1145/3140659.3080246

https://doi.org/10.1145/3140659.3080246

76 | BIBLIOGRAPHY

[9] Tal Ben-Nun et al. “Stateful Dataflow Multigraphs: A Data-CentricModel for Performance Portability on Heterogeneous Architectures”.In: Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis. SC ’19. 2019.

[10] L. Dagum and R. Menon. “OpenMP: an Industry Standard API forShared-Memory Programming”. In: IEEEComputational Science andEngineering 5.1 (1998), pp. 46–55.

[11] William Gropp et al.Using MPI: Portable Parallel Programming withthe Message-Passing Interface. Vol. 1. MIT press, 1999.

[12] Chuck L Lawson et al. “Basic Linear Algebra Subprograms for FortranUsage”. In: ACM Transactions on Mathematical Software (TOMS) 5.3(1979), pp. 308–323.

[13] Bradford L Chamberlain, David Callahan, and Hans P Zima. “Paral-lel Programmability and the Chapel Language”. In: The InternationalJournal of High Performance Computing Applications 21.3 (2007),pp. 291–312.

[14] HCarter Edwards, Christian RTrott, andDaniel Sunderland. “Kokkos:Enabling Manycore Performance Portability Through PolymorphicMemory Access Patterns”. In: Journal of Parallel and DistributedComputing 74.12 (2014), pp. 3202–3216.

[15] Jonathan Ragan-Kelley et al. “Halide: A Language and Compiler forOptimizing Parallelism, Locality, and Recomputation in Image Pro-cessing Pipelines”. In: SIGPLAN Not. 48.6 (June 2013), pp. 519–530.issn: 0362-1340. doi: 10.1145/2499370.2462176. url: https://doi.org/10.1145/2499370.2462176.

[16] John Nickolls et al. “Scalable Parallel Programming with CUDA”. In:Queue 6.2 (Mar. 2008), pp. 40–53. issn: 1542-7730. doi: 10.1145/1365490.1365500. url: https://doi.org/10.1145/1365490.1365500.

[17] Aaftab Munshi. “The OpenCL Specification”. In: 2009 IEEE HotChips 21 Symposium (HCS). IEEE. 2009, pp. 1–314.

[18] Sandra Wienke et al. “OpenACC — First Experiences with Real-World Applications”. In: Euro-Par 2012 Parallel Processing. Ed. byChristos Kaklamanis, Theodore Papatheodorou, and Paul G. Spirakis.Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 859–870.isbn: 978-3-642-32820-6.

https://doi.org/10.1145/2499370.2462176

https://doi.org/10.1145/2499370.2462176

https://doi.org/10.1145/2499370.2462176

https://doi.org/10.1145/1365490.1365500

https://doi.org/10.1145/1365490.1365500

https://doi.org/10.1145/1365490.1365500

https://doi.org/10.1145/1365490.1365500

BIBLIOGRAPHY | 77

[19] Guido Van Rossum and Fred L Drake Jr. Python Reference Manual.Centrum voor Wiskunde en Informatica Amsterdam, 1995.

[20] Charles Van Loan. Computational Frameworks for the Fast FourierTransform. eng. Frontiers in Applied Mathematics ; 10. Philadel-phia, Pa.: Society for Industrial and Applied Mathematics (SIAM,3600 Market Street, Floor 6, Philadelphia, PA 19104), 1992. isbn:1-61197-099-7.

[21] Jack Dongarra and Francis Sullivan. “Guest Editors’ Introduction: TheTop 10 Algorithms”. In: Computing in Science & Engineering 2.1(2000), pp. 22–23.

[22] K.R Rao, D.N Kim, and J.-J Hwang. Fast Fourier Transform - Algo-rithms and Applications. eng. Signals and Communication Technol-ogy. Dordrecht: Springer, 2011. isbn: 9781402066283.

[23] Matteo Frigo and Steven G Johnson. “The Design and Implementationof FFTW3”. In: Proceedings of the IEEE 93.2 (2005), pp. 216–231.

[24] Alexandros Nikolaos Ziogas et al. “A Data-Centric Approach toExtreme-Scale ab initio Dissipative Quantum Transport Simulations”.In: Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis. 2019, pp. 1–13.

[25] Alexandros Nikolaos Ziogas et al. “Optimizing the Data Movement inQuantum Transport Simulations via Data-Centric Parallel Program-ming”. In: Proceedings of the International Conference for High Per-formance Computing, Networking, Storage and Analysis. 2019, pp. 1–17.

[26] D. H. Woo and H. S. Lee. “Extending Amdahl’s Law for Energy-Efficient Computing in the Many-Core Era”. In: Computer 41.12(2008), pp. 24–31.

[27] Michael John Sebastian Smith. Application-Specific Integrated Cir-cuits. Vol. 7. Addison-Wesley Reading, MA, 1997.

[28] Stephen M Trimberger. Field-Programmable Gate Array Technology.Springer Science & Business Media, 2012.

[29] R. R. Schaller. “Moore’s Law: Past, Present and Future”. In: IEEESpectrum 34.6 (1997), pp. 52–59.

78 | BIBLIOGRAPHY

[30] Jack Dongarra. Report on the Tianhe-2A System. Tech. rep. Tech. rep.,University of Tennesssee Oak Ridge National Laboratory, 2017. url:https://www.dropbox.com/s/0jyh5qlgok73t1f/TH-2A-report.pdf?dl=0.

[31] David E Shaw et al. “Millisecond-Scale Molecular Dynamics Simula-tions on Anton”. eng. In: Proceedings of the Conference on High Per-formance Computing Networking, Storage and Analysis. ACM, 2009,pp. 1–11. isbn: 9781605587448.

[32] Raffaele Tripiccione. “JANUS FPGA-Based Machine”. In: Ency-clopedia of Parallel Computing. Ed. by David Padua. Boston, MA:Springer US, 2011, pp. 985–992. isbn: 978-0-387-09766-4. doi:10 . 1007 / 978 - 0 - 387 - 09766 - 4 _ 414. url: https ://doi.org/10.1007/978-0-387-09766-4_414.

[33] Rolf Rabenseifner et al. “Hybrid Parallel Programming on HPC Plat-forms”. In: proceedings of the Fifth European Workshop on OpenMP,EWOMP. Vol. 3. 2003, pp. 185–194.

[34] Jack J Dongarra et al. “A Set of Level 3 Basic Linear Algebra Sub-programs”. In: ACM Transactions on Mathematical Software (TOMS)16.1 (1990), pp. 1–17.

[35] Stephen Lien Harrell et al. “Effective Performance Portability”. In:2018 IEEE/ACM International Workshop on Performance, Portabilityand Productivity in HPC (P3HPC). IEEE. 2018, pp. 24–36.

[36] Inc. & GCC Team Free Software Foundation. GCC, the GNU Com-piler Collection. url: https : / / gcc . gnu . org/ (visited on08/10/2020).

[37] Intel Corporation. Intel R© C++ Compiler. url: https://software.intel.com/content/www/us/en/develop/tools/compilers/c-compilers.html (visited on 08/10/2020).

[38] MKL Intel. “Intel Math Kernel Library”. In: (2007). url: https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html.

[39] Zhang Xianyi, Wang Qian, and Zaheer Chothia. “OpenBLAS”. In:URL: http://xianyi. github. io/OpenBLAS (2012), p. 88.

https://www.dropbox.com/s/0jyh5qlgok73t1f/TH-2A-report.pdf?dl=0

https://www.dropbox.com/s/0jyh5qlgok73t1f/TH-2A-report.pdf?dl=0

https://doi.org/10.1007/978-0-387-09766-4_414

https://doi.org/10.1007/978-0-387-09766-4_414

https://doi.org/10.1007/978-0-387-09766-4_414

https://gcc.gnu.org/

https://software.intel.com/content/www/us/en/develop/tools/compilers/c-compilers.html



https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html



BIBLIOGRAPHY | 79

[40] Rajeev Thakur, William Gropp, and Ewing Lusk. “On ImplementingMPI-IO Portably and with High Performance”. In: Proceedings of theSixth Workshop on I/O in Parallel and Distributed Systems. IOPADS’99. Atlanta, Georgia, USA: Association for Computing Machinery,1999, pp. 23–32. isbn: 1581131232. doi: 10.1145/301816.301826. url: https://doi.org/10.1145/301816.301826.

[41] The Open MPI Project. Open MPI: Open Source High PerformanceComputing. url: https://www.open-mpi.org/ (visited on08/10/2020).

[42] MPICH. MPICH. url: https://www.mpich.org/ (visited on08/10/2020).

[43] Inc Advanced Micro Devices. HIP Programming Guide. url: https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-GUIDE.html (visited on 08/10/2020).

[44] James Jeffers, James Reinders, and Avinash Sodani. Intel Xeon PhiProcessor High Performance Programming: Knights Landing Edition2nd Edition. 2nd. San Francisco, CA, USA: Morgan Kaufmann Pub-lishers Inc., 2016. isbn: 0128091940.

[45] Donald Thomas and Philip Moorby. The Verilog R© Hardware De-scription Language. Springer Science & Business Media, 2008.

[46] Peter J Ashenden. TheDesigner’s Guide to VHDL.MorganKaufmann,2010.

[47] Tomasz S Czajkowski et al. “From OpenCL to High-PerformanceHardware on FPGAs”. In: 22nd International Conference on FieldProgrammable Logic and Applications (FPL). IEEE. 2012, pp. 531–534.

[48] David Chinnery and Kurt Keutzer. Closing the Gap Between ASIC &Custom: Tools and Techniques for High-Performance ASIC Design.Springer Science & Business Media, 2002.

[49] Wesley M Johnston, JR Paul Hanna, and Richard J Millar. “Advancesin Dataflow Programming Languages”. In: ACM computing surveys(CSUR) 36.1 (2004), pp. 1–34.

[50] MATLAB.MATLAB 9.7.0.1190202 (R2019b). Natick,Massachusetts:The MathWorks Inc., 2019.

https://doi.org/10.1145/301816.301826

https://doi.org/10.1145/301816.301826

https://doi.org/10.1145/301816.301826

https://www.open-mpi.org/

https://www.mpich.org/

https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-GUIDE.html



80 | BIBLIOGRAPHY

[51] Chun Chen, Jacqueline Chame, and Mary Hall. CHiLL: A Frameworkfor Composing High-Level Loop Transformations. Tech. rep. Citeseer,2008.

[52] Chris Lattner and Vikram Adve. “LLVM: A Compilation Frameworkfor Lifelong Program Analysis & Transformation”. In: InternationalSymposium on Code Generation and Optimization, 2004. CGO 2004.IEEE. 2004, pp. 75–86.

[53] Maria Kotsifakou et al. “HPVM: Heterogeneous Parallel Virtual Ma-chine”. In: SIGPLAN Not. 53.1 (Feb. 2018), pp. 68–80. issn: 0362-1340. doi: 10 . 1145 / 3200691 . 3178493. url: https : / /doi.org/10.1145/3200691.3178493.

[54] Chris Lattner et al. “MLIR: A Compiler Infrastructure for the End ofMoore’s Law”. In: arXiv preprint arXiv:2002.11054 (2020).

[55] David Van Der Spoel et al. “GROMACS: Fast, Flexible, and Free”. In:Journal of computational chemistry 26.16 (2005), pp. 1701–1718.

[56] James W Cooley and John W Tukey. “An Algorithm for the MachineCalculation of Complex Fourier Series”. In: Mathematics of compu-tation 19.90 (1965), pp. 297–301.

[57] F. Franchetti et al. “Discrete Fourier Transform on Multicore”. In:IEEE Signal Processing Magazine 26.6 (2009), pp. 90–102.

[58] M. Frigo and S. G. Johnson. “FFTW: an Adaptive Software Archi-tecture for the FFT”. In: Proceedings of the 1998 IEEE InternationalConference on Acoustics, Speech and Signal Processing, ICASSP ’98(Cat. No.98CH36181). Vol. 3. 1998, 1381–1384 vol.3.

[59] Markus Puschel et al. “SPIRAL: Code Generation for DSP Trans-forms”. In: Proceedings of the IEEE 93.2 (2005), pp. 232–275.

[60] NVIDIA. cuFFT. url: https://developer.nvidia.com/cufft (visited on 06/23/2020).

[61] Bragadeesh Natarajan. clMathLibraries – clFFT. url: https :/ / github . com / clMathLibraries / clFFT (visited on07/10/2020).

[62] Naga K Govindaraju et al. “High Performance Discrete Fourier Trans-forms on Graphics Processors”. In: SC’08: Proceedings of the 2008ACM/IEEE conference on Supercomputing. Ieee. 2008, pp. 1–12.

https://doi.org/10.1145/3200691.3178493

https://doi.org/10.1145/3200691.3178493

https://doi.org/10.1145/3200691.3178493

https://developer.nvidia.com/cufft

https://developer.nvidia.com/cufft

https://github.com/clMathLibraries/clFFT

https://github.com/clMathLibraries/clFFT

BIBLIOGRAPHY | 81

[63] Akira Nukada and Satoshi Matsuoka. “Auto-Tuning 3-D FFT Libraryfor CUDA GPUs”. In: Proceedings of the Conference on High Per-formance Computing Networking, Storage and Analysis. IEEE. 2009,pp. 1–10.

[64] Bastian Köpcke, Michel Steuwer, and Sergei Gorlatch. “GeneratingEfficient FFT GPU Code with Lift”. In: Proceedings of the 8th ACMSIGPLAN International Workshop on Functional High-Performanceand Numerical Computing. FHPNC 2019. Berlin, Germany: Associa-tion for ComputingMachinery, 2019, pp. 1–13. isbn: 9781450368148.doi: 10 . 1145 / 3331553 . 3342613. url: https : / / doi .org/10.1145/3331553.3342613.

[65] Michel Steuwer, Toomas Remmelg, and Christophe Dubach. “LIFT:A Functional Data-Parallel IR for High-Performance GPU Code Gen-eration”. eng. In: IEEE, 2017, pp. 74–85. isbn: 1509049312.

[66] Franz Franchetti et al. “FFTX and SpectralPack: A First Look”. eng.In: IEEE, 2018, pp. 18–27. isbn: 172810114X.

[67] Gene H Golub. Scientific Computing and Differential Equations : anIntroduction to Numerical Methods. eng. 2. ed.. Boston: AcademicPress, 1992. isbn: 0-12-289255-0.

[68] National Centers For Environmental Information at NOAA. GlobalForecast System (GFS). url:https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00634# (visited on06/30/2020).

[69] James W. Lottes Paul F. Fischer and Stefan G. Kerkemeier. nek5000Web Page. http://nek5000.mcs.anl.gov. 2008.

[70] United States Geological Survey. What is High Performance Com-puting? url: https://www.usgs.gov/core- science-systems/sas/arc/about/what-high-performance-computing (visited on 06/02/2020).

[71] Oak Ridge National Laboratory. Summit. url: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/ (visited on 06/02/2020).

[72] Nvidia. NVIDIA DGX STATION. url: https://www.nvidia.com /en- us/data- center/dgx- station/ (visited on06/02/2020).

https://doi.org/10.1145/3331553.3342613

https://doi.org/10.1145/3331553.3342613

https://doi.org/10.1145/3331553.3342613

https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00634#

https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00634#

https://www.usgs.gov/core-science-systems/sas/arc/about/what-high-performance-computing



https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/



https://www.nvidia.com/en-us/data-center/dgx-station/

https://www.nvidia.com/en-us/data-center/dgx-station/

82 | BIBLIOGRAPHY

[73] Michael Flynn. “Flynn’s Taxonomy”. In: Encyclopedia of ParallelComputing. Ed. by David Padua. Boston, MA: Springer US, 2011,pp. 689–697. isbn: 978-0-387-09766-4. doi: 10.1007/978- 0-387-09766-4_2. url: https://doi.org/10.1007/978-0-387-09766-4_2.

[74] Chris Lomont. “Introduction to Intel Advanced Vector Extensions”.In: Intel White Paper 23 (2011).

[75] Mark Buxton. Haswell New Instruction Descriptions Now Available!url: https://software.intel.com/content/www/us/en/develop/blogs/haswell-new-instruction-descriptions-now-available.html (visited on 09/01/2020).

[76] Karl Rupp. 42 Years of Microprocessor Trend Data. url: https:/ / github . com / karlrupp / microprocessor - trend -data (visited on 06/24/2020).

[77] Hadi Esmaeilzadeh et al. “Dark Silicon and the end of Multicore Scal-ing”. In: 2011 38th Annual International Symposium on Computer Ar-chitecture (ISCA). IEEE. 2011, pp. 365–376.

[78] Nvidia. NVIDIA A100 Tensor Core GPU Architecture. url: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf (visited on 06/03/2020).

[79] Advanced Micro Devices (AMD). AMD EPYCTM 7002 Series Proces-sors: A New Standard for the Modern Data Center. url: https://www.amd.com/system/files/documents/AMD-EPYC-7002-Series-Datasheet.pdf (visited on 06/03/2020).

[80] J. D. Owens et al. “GPU Computing”. In: Proceedings of the IEEE96.5 (2008), pp. 879–899.

[81] E. Scott Larsen and David McAllister. “Fast Matrix Multiplies UsingGraphics Hardware”. In: Proceedings of the 2001 ACM/IEEE Con-ference on Supercomputing. SC ’01. Denver, Colorado: Associationfor Computing Machinery, 2001, p. 55. isbn: 158113293X. doi: 10.1145/582034.582089. url: https://doi.org/10.1145/582034.582089.

[82] Inc Advanced Micro Devices. AMD ROCmTM Open Ecosystem. url:https://www.amd.com/en/graphics/servers-solutions-rocm (visited on 09/01/2020).

https://doi.org/10.1007/978-0-387-09766-4_2

https://doi.org/10.1007/978-0-387-09766-4_2

https://doi.org/10.1007/978-0-387-09766-4_2

https://doi.org/10.1007/978-0-387-09766-4_2

https://software.intel.com/content/www/us/en/develop/blogs/haswell-new-instruction-descriptions-now-available.html



https://github.com/karlrupp/microprocessor-trend-data



https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf




https://www.amd.com/system/files/documents/AMD-EPYC-7002-Series-Datasheet.pdf



https://doi.org/10.1145/582034.582089

https://doi.org/10.1145/582034.582089

https://doi.org/10.1145/582034.582089

https://doi.org/10.1145/582034.582089

https://www.amd.com/en/graphics/servers-solutions-rocm

https://www.amd.com/en/graphics/servers-solutions-rocm

BIBLIOGRAPHY | 83

[83] Oak Ridge Leadership Computing Facility. Frontier Spec Sheet. url:https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf (visited on 09/01/2020).

[84] T. Hamada et al. “A Comparative Study on ASIC, FPGAs, GPUs andGeneral Purpose Processors in the O(N2) Gravitational N-body Sim-ulation”. In: 2009 NASA/ESA Conference on Adaptive Hardware andSystems. 2009, pp. 447–452.

[85] J. Cong et al. “High-Level Synthesis for FPGAs: From Prototypingto Deployment”. In: IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 30.4 (2011), pp. 473–491.

[86] Tomasz S Czajkowski et al. “From OpenCL to High-PerformanceHardware on FPGAs”. In: 22nd international conference on field pro-grammable logic and applications (FPL). IEEE. 2012, pp. 531–534.

[87] M. Bedford Taylor. “The Evolution of Bitcoin Hardware”. In: Com-puter 50.9 (2017), pp. 58–66.

[88] Blaise Barney et al. “Introduction to Parallel Computing”. In: LawrenceLivermore National Laboratory 6.13 (2010), p. 10.

[89] R. Baghdadi et al. “Tiramisu: A Polyhedral Compiler for ExpressingFast and Portable Code”. In: 2019 IEEE/ACM International Sympo-sium on Code Generation and Optimization (CGO). 2019, pp. 193–205.

[90] Jingling Xue. Loop Tiling for Parallelism. Vol. 575. Springer Science& Business Media, 2012.

[91] Eric Jones, Travis Oliphant, Pearu Peterson, et al. “SciPy: Open SourceScientific Tools for Python”. In: (2001).

[92] Javier Duoandikoetxea and Javier Duoandikoetxea Zuazo. FourierAnalysis. Vol. 29. American Mathematical Soc., 2001.

[93] Ronald Newbold Bracewell and Ronald N Bracewell. The FourierTransform and its Applications. Vol. 31999. McGraw-Hill New York,1986.

[94] MHeideman, D Johnson, and C Burrus. “Gauss and the History of theFast Fourier Transform”. eng. In: IEEE ASSP Magazine 1.4 (1984),pp. 14–21. issn: 0740-7467.

[95] EricW.Weisstein. PermutationMatrix. FromMathWorld—AWolframWeb Resource. url: https://mathworld.wolfram .com /PermutationMatrix.html (visited on 07/01/2020).

https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf

https://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet.pdf

https://mathworld.wolfram.com/PermutationMatrix.html

https://mathworld.wolfram.com/PermutationMatrix.html

84 | BIBLIOGRAPHY

[96] W. T. Cochran et al. “What is the Fast Fourier Transform?” In: Pro-ceedings of the IEEE 55.10 (1967), pp. 1664–1674.

[97] Kjell Magne Fauske. Example: Radix-2 FFT Signal Flow. url: http://www.texample.net/tikz/examples/radix2fft/(visited on 06/26/2020).

[98] Qrrbrbirlbel. Making FFT Figure Using LaTeX Tikz. url: https://tex.stackexchange.com/questions/239447/making-fft-figure-using-latex-tikz/239473#239473(visited on 06/26/2020).

[99] Franz Franchetti and Markus Püschel. “FFT (Fast Fourier Trans-form)”. In: Encyclopedia of Parallel Computing. Ed. by David Padua.Boston, MA: Springer US, 2011, pp. 658–671. isbn: 978-0-387-09766-4. doi: 10.1007/978- 0- 387- 09766- 4_ 243. url:https://doi.org/10.1007/978-0-387-09766-4_243.

[100] Gabriel Bengtsson. GitHub Repository for Implementation of DaCe-FFT. url: https://github.com/Gabbeo/dace-fft (vis-ited on 07/10/2020).

[101] Tim Sauer.Numerical Analysis. eng. Second Edition, Pearson New In-ternational Edition.. Harlow, Essex: Pearson, 2014. isbn: 9781292023588.

[102] M. Frigo and S. G. Johnson. FFT Accuracy Benchmark Methodology.url: http://www.fftw.org/accuracy/method.html(visited on 06/03/2020).

[103] NVIDIA. The API Reference Guide for cuFFT, the CUDAFast FourierTransform Library. url: https://docs.nvidia.com/cuda/cufft/index.html (visited on 06/23/2020).

[104] Intel Corporation. Intel R© VTuneTM Profiler. url: https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html (visited on 09/15/2020).

http://www.texample.net/tikz/examples/radix2fft/

http://www.texample.net/tikz/examples/radix2fft/

https://tex.stackexchange.com/questions/239447/making-fft-figure-using-latex-tikz/239473#239473



https://doi.org/10.1007/978-0-387-09766-4_243

https://doi.org/10.1007/978-0-387-09766-4_243

https://github.com/Gabbeo/dace-fft

http://www.fftw.org/accuracy/method.html

https://docs.nvidia.com/cuda/cufft/index.html

https://docs.nvidia.com/cuda/cufft/index.html

https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html



Appendix A:Code | 85

Appendix A

Code

A.1 Code changed in DaCe1 // Implementation of Complex128 summation reduction .2 // Special case for Complex1283 template <>4 struct _wcr_fixed < ReductionType :: Sum , dace :: complex128 >5 6 static DACE_HDFI void reduce ( dace :: complex128 *ptr , const dace :: complex128 & value )7 8 double * real_ptr = reinterpret_cast < double * >( ptr );9 double * imag_ptr = real_ptr + 1;

1011 _wcr_fixed < ReductionType :: Sum , double >:: reduce ( real_ptr , value . real ());12 _wcr_fixed < ReductionType :: Sum , double >:: reduce ( imag_ptr , value . imag ());13 1415 static DACE_HDFI void reduce_atomic ( dace :: complex128 *ptr , const dace :: complex128 &

value )16 17 double * real_ptr = reinterpret_cast < double * >( ptr );18 double * imag_ptr = real_ptr + 1;1920 _wcr_fixed < ReductionType :: Sum , double >:: reduce_atomic ( real_ptr , value . real ());21 _wcr_fixed < ReductionType :: Sum , double >:: reduce_atomic ( imag_ptr , value . imag ());22 2324 DACE_HDFI dace :: complex128 operator ()( const dace :: complex128 &a, const dace ::

complex128 &b) const25 26 return _wcr_fixed < ReductionType :: Sum , dace :: complex128 >() (a, b);27 28 ;2930 // Enables template based on if T is a complex number or not .31 template < typename T>32 using EnableIfComplex128 = typename std :: enable_if < std :: is_same <T, dace :: complex128

>:: value >:: type ;3334 // When atomics are supported , use _wcr_fixed normally35 template < ReductionType REDTYPE , typename T>36 struct wcr_fixed < REDTYPE , T, EnableIfComplex128 <T> >37 38 static DACE_HDFI void reduce (T *ptr , const T& value )39 40 _wcr_fixed < REDTYPE , T >:: reduce (ptr , value );41 4243 static DACE_HDFI void reduce_atomic (T *ptr , const T& value )44

86 | Appendix A:Code

45 _wcr_fixed < REDTYPE , T >:: reduce_atomic (ptr , value );46 4748 DACE_HDFI T operator ()( const T &a, const T &b) const49 50 return _wcr_fixed < REDTYPE , T >() (a, b);51 52 ;

Listing A.1: C++ template code that adds atomic addition for double precisioncomplex numbers on GPU for DaCe.

Appendix B: Extra SDFG figures | 87

Appendix B

Extra SDFG figures

B.1 DaCe-DFTThis section presents additional figures of the DFT-SDFGs.

88 | Appendix B: Extra SDFG figures

s22_4





X dft_mat

Y


out = (x * omega)

state_1


out = 0


Y

Figure B.1: DFT-SDFG after applying MapReduceFusion.


s22_4



dace::complex128 alpha(1,0), beta(0,0); cblas_zgemv(CblasRowMajor, CblasNoTrans, N, N, &alpha, omega, N, x, 1, &beta, y, 1);

Xdft_mat

Y


Figure B.2: DFT-SDFG using the BLAS GEMV code.


s23_4

out_map_gen[i=0:N]

out_map_gen[i=0:N]

dft_mat



X dft_mat

tmp


Y

dft_mat_gen[j=i:N]

dft_mat_gen[j=i:N]

out = (x * omega)


omega1 = omega

omega2 = omega

Figure B.3: DFT-SDFG using the DFT-matrix symmetry.


s22_4





gpu_X dft_mat


out = (x * omega)

gpu_Y

Y

state_1


out = 0


gpu_Y

DFT_copyin

X

gpu_X

Y

gpu_Y

Figure B.4: DFT-SDFG after applyingMapReduceFusion and ported to GPUusing GPUTransformSDFG.


B.2 DaCe-FFTThis section presents additional figures of the FFT-SDFGs.


s19_4

dft_mat_gen[ii=0:R, jj=0:R]

dft_mat_gen[ii=0:R, jj=0:R]

dft_mat

move_x_to_y[ii=0:N]

move_x_to_y[ii=0:N]

x

y

calc_indices_cpu[ii=0:K]


c_r_i c_r_k_1 c_r_k_i_1



g_r_i g_r_i_1 g_r_k_1 g_r_k_i_1

omega = exp((- dace.complex128(0, ((((2 * 3.14159265359) * ii) * jj) / R))))

y_out = x_in

c_r_i_out = (R ** ii)c_r_k_1_out = (R ** (K - 1))c_r_k_i_1_out = (R ** ((K - ii) - 1))

g_r_i_out = (R ** ii)g_r_i_1_out = (R ** (ii + 1))g_r_k_1_out = (R ** (K - 1))g_r_k_i_1_out = (R ** ((K - ii) - 1))

s69_8

permute[ii=0:R, jj=0:__permute_e0_c_r_i, kk=0:__permute_e0_c_r_k_i_1]

permute[ii=0:R, jj=0:__permute_e0_c_r_i, kk=0:__permute_e0_c_r_k_i_1]

c_r_i c_r_k_i_1g_r_k_i_1 g_r_i g_r_i_1 y

pack_matrices[ii=0:R, jj=0:N_div_R]

pack_matrices[ii=0:R, jj=0:N_div_R]

g_r_k_1 tmp_perm

dft_mat x_packed

_MatMult_

unpack_matrices[ii=0:R, jj=0:N_div_R]

unpack_matrices[ii=0:R, jj=0:N_div_R]

y_packed

y

tmp_out = (y_in * exp(dace.complex128(0, (((((- 2) * 3.14159265359) * ii) * jj) / r_i_1_in))))

x_packed_out = tmp_in

y_out = y_packed_in

endfor_65

guard

Figure B.5: SDFG of final version DaCe-FFT on GPU.

TRITA -EECS-EX-2020:825

www.kth.se

development of stockham fast fourier transform using data- …1511982/... · 2020. 12. 21. ·...

Documents