dfscala: high level dataflow support for scala · dfscala: high level dataflow support for scala...

8
DFScala: High Level Dataflow Support for Scala Daniel Goodman * Behram Khan Salman Khan Mikel Luj´ an Chris Seaton Ian Watson University of Manchester Oxford Road Manchester, UK {goodmand, khanb, salman.khan, mlujan, seatonc, watson}@cs.man.ac.uk Abstract—In this paper we present DFScala, a library for constructing and executing dataflow graphs in the Scala language. Through the use of Scala this library allows the pro- grammer to construct coarse grained dataflow graphs that take advantage of functional semantics for the dataflow graph and both functional and imperative semantics within the dataflow nodes. This combination allows for very clean code which exhibits the properties of dataflow programs, but we believe is more accessible to imperative programmers. We first describe DFScala in detail, before using a number of benchmarks to evaluate both its scalability and its absolute performance relative to existing codes. DFScala has been constructed as part of the Teraflux project and is being used extensively as a basis for further research into dataflow programming. Keywords-Dataflow; Scala; Coarse Grained; Programming Model I. I NTRODUCTION In this paper we introduce and evaluate DFScala, a dataflow programming library to allow the construction of coarse grained Dataflow programs [17] in the Scala [12] programming language. The recent trend towards ever higher levels of parallelism through rising core counts in processors has reinvigorated the search for a solution to the problems of constructing correct parallel programs. This in turn has revived interest in Dataflow programming [17] and associated functional programming approaches [2]. With these the computation is side-effect free and execution is triggered by the presence of data instead of the explicit flow of control. These constraints simplify the task of constructing and executing parallel programs as they guarantee the absence of both deadlocks and race conditions. The Teraflux [16] project is investigating highly extensible multi-core systems including both hardware architectures and software systems built around the dataflow approach. One part of this project is the construction of a high level dataflow implementation which serves two purposes: 1) To provide a high productivity language in which to construct dataflow programs. 2) To provide a high level platform for experimenting with new ideas such as using the type system to enforce different properties of the dataflow graph and different memory models [13]. DFScala implements the base functionality of this lan- guage. In doing so this platform provides a foundation on which a range of other work within the Teraflux project is built. In the remainder of this paper we first describe the dataflow programming model in more detail; we then in- troduce Scala before describing why we feel that a new dataflow library in Scala can improve on the existing li- braries available in other languages. Next we describe the implementation and API of DFScala before evaluating its scalability and performance, and finally making our con- cluding remarks. II. DATAFLOW PROGRAMS In a dataflow program the computation is split into sec- tions. Depending on the granularity of the program these vary from a single instruction to whole functions which can include calls to other functions, allowing arbitrarily large computation units. All of these sections are deterministic based on their input and side-effect free. To enforce this, once a piece of data has been constructed it remains im- mutable for the lifetime of the program. The execution of the program is then orchestrated through the construction of a directed acyclic graph where the nodes are the sections of computation and the vertices are the data dependencies between these. An example of this can be seen in Figure 1. Once all the inputs of a node in the graph have been computed the node can be scheduled for execution. As such the execution of the program is controlled by the flow of data through the graph unlike imperative programs where control is explicitly passed to threads. Dataflow programming has been shown to be very effec- tive at exposing parallelism [6] as the side-effect free nature means segments of code forming nodes whose inputs have been generated can be executed in parallel regardless of the actions of other segments in the graph and without effect on the deterministic result. Dataflow programs are by definition deadlock and race condition free. These properties make it easier to construct correct parallel programs.

Upload: others

Post on 09-Feb-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DFScala: High Level Dataflow Support for Scala · DFScala: High Level Dataflow Support for Scala Daniel Goodman Behram Khan Salman Khan Mikel Lujan Chris Seaton Ian Watson´ University

DFScala: High Level Dataflow Support for Scala

Daniel Goodman∗ Behram Khan Salman Khan Mikel Lujan Chris Seaton Ian WatsonUniversity of Manchester

Oxford RoadManchester, UK

{goodmand, khanb, salman.khan, mlujan, seatonc, watson}@cs.man.ac.uk

Abstract—In this paper we present DFScala, a libraryfor constructing and executing dataflow graphs in the Scalalanguage. Through the use of Scala this library allows the pro-grammer to construct coarse grained dataflow graphs that takeadvantage of functional semantics for the dataflow graph andboth functional and imperative semantics within the dataflownodes. This combination allows for very clean code whichexhibits the properties of dataflow programs, but we believe ismore accessible to imperative programmers. We first describeDFScala in detail, before using a number of benchmarksto evaluate both its scalability and its absolute performancerelative to existing codes. DFScala has been constructed aspart of the Teraflux project and is being used extensively as abasis for further research into dataflow programming.

Keywords-Dataflow; Scala; Coarse Grained; ProgrammingModel

I. INTRODUCTION

In this paper we introduce and evaluate DFScala, adataflow programming library to allow the construction ofcoarse grained Dataflow programs [17] in the Scala [12]programming language.

The recent trend towards ever higher levels of parallelismthrough rising core counts in processors has reinvigoratedthe search for a solution to the problems of constructingcorrect parallel programs. This in turn has revived interestin Dataflow programming [17] and associated functionalprogramming approaches [2]. With these the computation isside-effect free and execution is triggered by the presence ofdata instead of the explicit flow of control. These constraintssimplify the task of constructing and executing parallelprograms as they guarantee the absence of both deadlocksand race conditions.

The Teraflux [16] project is investigating highly extensiblemulti-core systems including both hardware architecturesand software systems built around the dataflow approach.One part of this project is the construction of a high leveldataflow implementation which serves two purposes:

1) To provide a high productivity language in which toconstruct dataflow programs.

2) To provide a high level platform for experimentingwith new ideas such as using the type system to

enforce different properties of the dataflow graph anddifferent memory models [13].

DFScala implements the base functionality of this lan-guage. In doing so this platform provides a foundation onwhich a range of other work within the Teraflux project isbuilt.

In the remainder of this paper we first describe thedataflow programming model in more detail; we then in-troduce Scala before describing why we feel that a newdataflow library in Scala can improve on the existing li-braries available in other languages. Next we describe theimplementation and API of DFScala before evaluating itsscalability and performance, and finally making our con-cluding remarks.

II. DATAFLOW PROGRAMS

In a dataflow program the computation is split into sec-tions. Depending on the granularity of the program thesevary from a single instruction to whole functions which caninclude calls to other functions, allowing arbitrarily largecomputation units. All of these sections are deterministicbased on their input and side-effect free. To enforce this,once a piece of data has been constructed it remains im-mutable for the lifetime of the program. The execution ofthe program is then orchestrated through the construction ofa directed acyclic graph where the nodes are the sectionsof computation and the vertices are the data dependenciesbetween these. An example of this can be seen in Figure 1.Once all the inputs of a node in the graph have beencomputed the node can be scheduled for execution. As suchthe execution of the program is controlled by the flow ofdata through the graph unlike imperative programs wherecontrol is explicitly passed to threads.

Dataflow programming has been shown to be very effec-tive at exposing parallelism [6] as the side-effect free naturemeans segments of code forming nodes whose inputs havebeen generated can be executed in parallel regardless of theactions of other segments in the graph and without effect onthe deterministic result. Dataflow programs are by definitiondeadlock and race condition free. These properties make iteasier to construct correct parallel programs.

Page 2: DFScala: High Level Dataflow Support for Scala · DFScala: High Level Dataflow Support for Scala Daniel Goodman Behram Khan Salman Khan Mikel Lujan Chris Seaton Ian Watson´ University

Figure 1. An instance of a dataflow graph for a circuit routing algorithm.

In addition to having properties that reduce the complexityof constructing parallel programs, dataflow programs alsoreduce the required complexity of the supporting hardware.The explicit description within dataflow programs of whenresults must be passed allows for a relaxation of the cachecoherency requirements. Instead of maintaining coherenceat the instruction level it is now only necessary to ensurethat results from one segment are written back to mainmemory before the segments that depend on these resultsbegin. While it is argued that cache coherency does notlimit the core count on processors [10] it is observable thatwell blocked programs that do not require cache coherencyoutperform those that do, and processors such as GPGPU’swhich are closer to the dataflow model achieve betterperformance per watt and higher peak performance.

III. THE CHOICE OF SCALA

Scala [12] is a general purpose programming language de-signed to smoothly integrate features of object-oriented [18]and functional languages [2]. By design it supports seamlessintegration with Java, including existing compiled Java codeand libraries. The compiler produces Java byte-code [9],meaning that Scala can be called from Java and visa versa.Scala is a pure object-oriented language in the sense thatevery value is an object. Types and behaviour of objects aredescribed by classes and traits, and classes are extended bysub-classing and a flexible mixin-based composition mech-anism as a replacement for multiple inheritance. However,Scala is also a functional language in the sense that everyfunction is a value. This power is furthered through theprovision of a lightweight syntax for defining anonymousfunctions as shown in Figure 2, support for higher-orderfunctions, also seen in Figure 2, the nesting of functions,

apply(life _, (x:Int) => x*x)

......

def life():Int = { 42 }

......

def apply( f1:() => Int, f2:Int => Int) ={println(f2(f1()))

}

Figure 2. A snippet of Scala code demonstrating the construction ofconventional and anonymous functions and their use as arguments in otherfunction calls. The first line calls apply with the function life andan anonymous function as inputs. life returns the value 42 and theanonymous function takes an integer as an input and returns this valuesquared. apply takes these two functions as arguments and applies thesecond function to the output of the first. It then prints the resulting value,in this case 1764.

and for currying. Scalas case classes and its built-in supportfor pattern matching and algebraic types is equivalent tothose used in many functional programming languages.

Scala is statically typed and equipped with a type systemthat enforces that abstractions are used in a safe and coherentmanner. A local type inference mechanism means that theuser is not required to annotate the program with redundanttype information.

Finally, Scala provides a combination of language mech-anisms that make it easy to smoothly add new languageconstructs in the form of libraries. Specifically, any methodmay be used as an infix or postfix operator, and closures areconstructed automatically depending on the expected type(target typing). A joint use of both features facilitates thedefinition of new statements without extending the syntaxand without using macro-like meta-programming facilities.

A. Why Create a New Dataflow Library

Having introduced Scala we will now describe how aScala based dataflow framework extends the functionalityprovided by existing dataflow frameworks.

Current dataflow frameworks overwhelmingly fall intotwo categories: those implemented in pure functional pro-gramming languages [3], [4] and those implementationsbased on low level languages such as C [1], [15]. Inaddition to these there are a small number implemented inmanaged environments such as Java [8]. Scala’s combinationof both functional and imperative elements in a high levellanguage allows it to add to these by introducing a dataflowprogramming library that maintains the accessabillity toprogrammers of high level imperative languages by allowingimperative code in the dataflow nodes while adding thestrengths and functionallity of pure functional programming.

Page 3: DFScala: High Level Dataflow Support for Scala · DFScala: High Level Dataflow Support for Scala Daniel Goodman Behram Khan Salman Khan Mikel Lujan Chris Seaton Ian Watson´ University

In doing so it creates a bridge between these two styles ofdataflow language.

Purely Functional Implementations: While functionalprogramming is extremely powerful, the lack of mutablestate prevents the construction of many intuitive program-ming constructs such as loops. This in turn makes these im-plementations challenging for many imperative programmersto use, limiting their uptake. While the functional semanticsbetween nodes in a dataflow graph is the key to their power,there is no need to restrict the programmer to using purelyfunctional semantics within the single threaded execution ofthe dataflow nodes. Scala provides the semantics and a cleansyntax with which to add this freedom.

Other attempts have been made to address this by eitheradding specific imperative constructs to functional languagessuch as OCamel [14], which maintains a functional styleof syntax and semantics while allowing a set of imperativeconstructs, or alternatively by the addition of functionalconstructs into languages such as C# [7] for specific tasks.However to date we are unaware of a language that would befamiliar to imperative programmers both supporting dataflowcomputations and functional programming is such a way thatthese programs can be both clean and high level.

Pure Imperative Languages: Imperative libraries oftenallow for coarse grained dataflow nodes where the codewithin the node is imperative. However, the libraries that dothis can be split into those supporting high level languagesand those supporting low level languages.

Libraries based on low level languages have weaker typesystems and do not support features such as garbage col-lection. For example StarSS [1] and TFlux [15] are pragmabased systems, where sequential C code is annotated withtask information including inputs and outputs or specifictasks and their dependences respectively.

High level languages are able to maintain a strong typesystem, and a managed environment with features such asgarbage collection, however, the code can become restrictiveor verbose as functions are explicitly wrapped to be passed tothe library. For example the absence of functional semanticsin Out of Order Java [8] requires the user to manually markup tasks in the code which can then be examined by thecompiler and runtime to determine if they can be executed inparallel. Scala’s inbuilt ability to pass functions as first classvariables removes this complexity allowing for cleaner codewhile maintaining the properties of a high level language.This is furthered by Scala’s type inference system whichprevents users from having to appreciate the subtleties ofthe types used by the underlying library.

IV. DFSCALA LIBRARY

The DFScala library provides the functionality to con-struct and execute dataflow graphs in Scala. The nodes inthe graph are dynamically constructed over the course of aprogram and each node executes a function which is passed

as an argument. The arcs between nodes are all staticallytyped. An example of a function using the library can beseen in Figure 3.

The library is constructed from two principal components:a Dataflow Manager that controls the execution of the graphand a hierarchy of thread classes which represent the nodesin the graph. For any given dataflow graph there will be asingle manager object; however, as we discuss in more detaillater, within a given program there may be many independentdataflow graphs calculating different results for the overallprogram. These can exist either because dataflow functionshave been inserted into legacy code, or because dataflowgraphs are being nested within other dataflow graphs tocalculate sub-results. We will now look at these components,starting with the Dataflow Manager.

A. Dataflow Manager

The Dataflow Manager performs two functions.1) Scheduling dataflow threads to execute when all their

inputs become available.2) Providing a factory which takes a function and con-

structs the appropriate node for the graph (DFThread).To achieve this, the Dataflow Manager is constructed from

two parts: a static singleton object that contains the factorymethods to construct new threads and a dynamic objectthat manages the scheduling of threads for a given dataflowgraph. This separation allows the factory methods to beaccessed directly from anywhere without having to explicitlypass a manager around, while at the same time there can bemany managers each managing a separate dataflow graph.

B. Dataflow Threads

Threads are implemented through a hierarchy of classes.This hierarchy allows threads to be passed between functionswithout forcing the receiving functions to be overly specificabout the type of thread they are expecting, instead merelyspecifying the required properties of the thread within thefunction.

Passed arguments are called tokens. The threads supportstatic type checking of the tokens they pass between eachother. This checking allows the compiler to guarantee thetype correctness of a dataflow graph. To improve pro-grammability threads will not release any threads or tokensthey create or pass until the thread completes its execution.This restriction simplifies the task of keeping track of theorder in which events can occur when debugging.

All dataflow threads are constructed from a base classDFThread which contains all the logic required to inter-act with the Dataflow Manager, and manage the storingof tokens and newly constructed threads until the threadcompletes. This class is then extended by a chain of abstractclasses, each of which represents a thread that takes at leastn arguments for increasing values of n. Finally each of these

Page 4: DFScala: High Level Dataflow Support for Scala · DFScala: High Level Dataflow Support for Scala Daniel Goodman Behram Khan Salman Khan Mikel Lujan Chris Seaton Ian Watson´ University

Expanded Version

def fib(n :Int , out:Token[Int]){if(n <= 2)

out(1)else {

var t1 = DFManager.createThread((x:Int, y:Int,out:Token[Int]) => {out(x + y)}

)var t2 = DFManager.createThread(fib _)var t3 = DFManager.createThread(fib _)

t2.arg1 = n - 1t2.arg2 = t1.token1

t3.arg1 = n - 2t3.arg2 = t1.token2

t1.arg3 = out}

}

Concise Version

def fib(n :Int):Int = {if(n <= 2)

1else

fib(n-1) + fib(n-2)}

Figure 3. An example of a dataflow program to compute Fibonaccinumbers. This demonstrates the API, but the functional nature of thelanguage means this function could be automatically constructed from themore concise function appearing below.

classes is extended with a concrete class that contains thefunctionality to execute a single method.

This separation of methods and the collection of ar-guments not only saves re-implementing code for eachof the concrete classes, but also allows the casting downof DFThreads to threads that expect less arguments, andthe construction of special classes of thread. Such specialthreads include those used to collect a number of elementsand either return them as a collection or reduce them to asingle value using an appropriate function. This is achievedby allowing the thread to have multiple tokens passed toone argument. The number of times tokens can be passed isspecified at runtime when the thread is created. This classof threads is required because, while the number of threadsspawned can be defined at runtime, for example by iteratinground a loop and spawning a thread on each iteration, thenumber of tokens received by a regular thread is set by thethread type at compile time.

def join (arg1 :Int, arg2 :Int,out: DFThread4[Int, Int, _, Pos, _],position :Int)

{if(position == 1)out.arg1 = arg1 + arg2

elseout.arg2 = arg1 + arg2

}

Figure 4. An example of a function for reducing multiple values through abinary tree. Note the additional complexity determining which value to setand the fixed type of the thread argument. Compare this with the reductionusing tokens in Figure 5. Figure 3 shows a function using tokens to passdata between threads when reducing a binary tree.

def join (arg1 :Int, arg2 :Int,out: Token[Int])

{out(arg1 + arg2)

}

Figure 5. Using the indirection provided by tokens to simplifiy the codein Figure 4

C. Passing Tokens

Tokens can either be pushed by the producing thread, orthey can be pulled by the consuming thread.

Pushing Tokens: If a thread is available in the codethat generates its token, the token can be directly assigned,for example: t.arg2 = 10. This invokes an underlyingsetter method which handles the control logic. The type ofthe thread specifies the number and types of arguments thatcan be passed. As threads with higher numbers of argumentsextend threads with lower numbers of arguments any threadwhich has n or more arguments of the appropriate type canbe used.

Setting thread arguments directly can be convenient, how-ever, as the signature of the thread is fixed, this can alsobe restrictive. For example to reduce a binary tree to asingle value a thread can be constructed for each node, thesethreads take two arguments, a thread and a parameter toindicate if the result is output to the left or right argumentof the next thread. See Figure 4. This approach has twoproblems:

1) First the parameter indicating where to assign theargument adds complexity;

2) Second the signature of the thread that the final resultis to be passed to, is now required to match that of theintermediate threads. This will clearly not normally bethe case.

Token objects address this by acting as a proxy forthe argument setter method, so allowing the consumer andproducer to be decoupled. Tokens can be retrieved from

Page 5: DFScala: High Level Dataflow Support for Scala · DFScala: High Level Dataflow Support for Scala Daniel Goodman Behram Khan Salman Khan Mikel Lujan Chris Seaton Ian Watson´ University

Figure 6. A diagram of the events when a token is used in Figure 3. 1.The thread calculating fib(n) requests the token for the first argument ofthe function that will perform the addition. 2. This token is passed to thethread calculating fib(n-1). 3. This thread sets the token to the correct valueonce it is known, in doing so it passes this value

threads for each argument. As the token is a proxy for thesetter method it handles all the required control logic tomanage the release of the input once the thread has finishedexecuting. An example of this technique also being usedto merge two values in a binary tree can be seen in theFibonacci program in Figure 3, and the resultant sequenceof events can be seen in Figure 6. The code that would beused to replace that presented in Figure 4 can be seen inFigure 5.

Pulling Tokens: Pushing values into threads directlyfrom within the code allows for very descriptive code, butit has two key weaknesses:

1) The executing thread has to be aware of how manyother threads need to receive a given value as a token.

2) Legacy functions cannot be used as they do not includethe required code to push tokens to other threads.

To address this all DFThreads have the ability to take Tokensas listeners. Then, once the thread has completed, theseTokens will be assigned the value returned by the threadsfunction. This allows the thread’s result to be distributed toan arbitrary number of threads, and allows legacy functionsto be included within a dataflow graph.

D. Constructing a Dataflow Graph

Two techniques are provided to start the initial thread ofa dataflow graph depending on the situation in which thethread is to be started.

1) Dataflow Applications: The simplest way to start adataflow graph in DFScala is, instead of constructing a mainmethod to start the program, to extend the class DFAppimplementing the method dfMain. When this extendedclass has its main method called to start the program, itwill perform all the required initialisation and start a threadwhich will call the function dfMain with the argumentsprovided to main.

2) Nested Dataflow Graphs: Extending a class and call-ing its main method is not always an appropriate way tostart a dataflow graph, either because another library such asQT [11] that uses this technique is being used, or because itis desirable to insert independent dataflow graphs into either

def foo(argument: Int): Int = {val subgraph = new NestedGraph[Int]subgraph.createThread(

bar _,argument,subgraph.token1)

}

Figure 7. An example of using a NestedGraph object to nest a separatedataflow graph in a function. The nested graph is spawned by executing thefunction passed to createThread (in this case bar) in a new dataflowthread. createThread also takes the arguments for bar as its remainingarguments.

a legacy program, or into another dataflow graph. We willnow look at this final scenario in more detail.

As this is a coarse grain dataflow model, dataflow threadsmay call other functions. These functions may contain codethat it is advantageous to run in parallel. However, toensure the absence of deadlocks in the dataflow graph,once a thread has started it cannot receive input from anyother thread within the graph. This prevents the creationof additional worker threads within called functions. Thiscould be addressed by splitting the thread into two threadsat the function call. This would then allow the first threadto create the worker threads and then for the worker threadsto pass their results on to the second thread. However, thishas several disadvantages:

• It requires the tokenizing of any thread local data thatspans the split threads.

• The separation either has to be done at runtime whichcould be expensive, or the compiled code has to know ifthe function will be executed in parallel, which preventsthe decision from being made at runtime based on dataset size etc.

To overcome this a dataflow thread can create a Nested-Graph object representing an independent dataflow graph.This will run to completion and return a result. The inde-pendent execution allows the function to continue to executeas pure dataflow while spawning helper threads.

The NestedGraph provides two elements: An object oftype Token that can be passed into the dataflow graph toreturn a result; and a function that takes a function and aset of arguments for the function. The passed function isthen executed as the root of a separate dataflow graph. Thisseparate execution allows it to spawn additional threads tohandle the computation. An example of this can be seen inFigure 7.

To implement the nested dataflow graphs, the Nested-Graph object creates a new manager object to manage thegraph. It then runs the passed function using a new threadobtained from this manager. This ensures that all threadscreated by this function will also be run using this new

Page 6: DFScala: High Level Dataflow Support for Scala · DFScala: High Level Dataflow Support for Scala Daniel Goodman Behram Khan Salman Khan Mikel Lujan Chris Seaton Ian Watson´ University

Figure 8. The chain of events involved in the construction and executionof an independent dataflow graph within an existing program. 1. A Nest-edGraph object is created by the function foo (subgraph). 2. A result tokenis retrieved from this object. 3. The root thread of the independent graph isconstructed by foo providing a function (bar) and the required arguments.This call blocks while the graph is executed. 4. subgraph constructs allthe required framework and executes bar and any resulting child threads.5. The dataflow graph sets the value of the passed token, in doing so thisvalue is passed to the object subgraph. 6. The create thread method returnsthe resultant value passed from the dataflow graph.

manager so ensuring their independence of the existingdataflow graph. A diagram detailing this chain of events canbe seen in Figure 8

E. Debugging and Performance Analysis

It is important to be able to extract from an execution thedataflow graph that was executed. This information is neededfor the detection of errors, performance measuring andoptimisation. In terms of error detection, there is potentialfor both deadlocks and livelocks resulting from the failure topass tokens. This can result from an error that occurred in athread that has long since finished executing, and being ableto trace where a thread was expecting an argument fromis important. To address this the library makes calls outto a DFLogging object which, depending on the way theapplication was started, will record the information passedto it in such a way that debugging tools can read thisinformation while the program is executing, or it can besaved onto disk or into a database for later analysis.

V. EVALUATION

To evaluate the scalability of DFScala we implementedfour benchmarks and ran them on both Intel and AMDbased systems. The benchmarks we used were: KMeans,0-1 Knapsack, Monte-Carlo Tree Search, and Matrix MatrixMultiplication.

KMeans is an algorithm that groups points into clustersthrough a process of iterative refinement. In this instancethere are 16 thousand 24 dimensional points which form 16clusters.

0-1 Knapsack is an optimisation problem requiring itemsto be picked from a set such that their weight does notexceed a given amount while maximising the overall valueof the items. This benchmark solves the problem throughthe use of dynamic programming. In this instance there are1000 items and a maximum weight of 10,000.

Figure 9. Benchmark performance on a 4 core Intel Xeon processor withHyper Threading.

Monte-Carlo Tree Search is a randomised techniqueused to search a state space for a best guess at an optimumsolution. It is used in situations where searching exhaustivelyis too costly. In this instance it is used by a computer playerto look for moves in a game of Go [5].

Matrix Matrix Multiplication For this benchmark weperformed Matrix Matrix multiplication on matrices of sizevarying from 256x256 to 2048x2048.

A. Results

For each of these we used a constant problem size and var-ied the number of system threads. For KMeans we recordedthe time for each iteration, for the other benchmarks weiterated the algorithm numerous times. We then took themedian time and used this to determine the speedup relativeto the single threaded solution. The result of this analysiscan be seen in Figures 9 and 10. These show that for bothtest platforms DFScala shows very good scaling propertiesfor all four benchmarks. The inflection point on the Xeonprocessor is a well documented artifact of Hyper Threading,not a property of the library.

B. Performance

To demonstrate that this scalability was not gained atthe expense of performance when compared with existingsolutions we also implemented a version of Scala ParallelCollections based on DFScala. Both this version and thecurrent Scala Parallel Collections shipped with the Scalaruntime had their performance compared relative to thestandard single threaded Scala Collections libraries. Thismeans that the speed improvements shown in the graphsin Figure 11 and 12 is the improvement that would be seenby replacing the currently shipping single threaded librarieswith these implementations, not just the performance gainby adding more threads to our own implementation. Theresults of these tests for mapping a function across all the

Page 7: DFScala: High Level Dataflow Support for Scala · DFScala: High Level Dataflow Support for Scala Daniel Goodman Behram Khan Salman Khan Mikel Lujan Chris Seaton Ian Watson´ University

Figure 10. Benchmark performance on a machine with two 6 core AMDOpteron processors.

Figure 11. Graph showing the speedup provided by our Dataflowcollections with differing numbers of elements and threads. All speedups arenormalised against the performance of the standard Scala single threadedcollections.

elements in a collection on a 48 core AMD based systemcan be seen in Figures 11 and 12. These show that thescalability of the library based on DFScala is maintainedand the ultimate performance exceeds that of the currentimplementations of Parallel Collections as the size of thedatasets increases. We attribute the better performance to thedataflow framework resulting in better separation of the dataused by the different threads and reducing the need for inter-thread communication. Unfortunately due to limitations ofthe existing parallel collections library it was not possible tocompare the performance of the two with differing numbersof threads.

VI. CONCLUSION

In this paper we have presented a library for constructingand executing coarse grained dataflow tasks in Scala. Indoing so we have built on Scala’s merging of Object Ori-ented programming and Functional programming to provide

Figure 12. Graph showing the performance of Scala Parallel collectionsand our Dataflow collections library on a 48 core AMD based machine fordiffereing sizes of collection. Again all speedups are normalised against theperformance of the standard Scala single threaded collections.

a means of constructing dataflow graphs that cleanly takesadvantage of functional semantics while allowing the user touse more familiar imperative programming within the node.We believe that this will allow the construction of cleanerdataflow codes that are more accessible to the majority ofprogrammers.

This library forms the basis of a wide range of workwithin the Teraflux project and has been used to implementa sizable number of applications of varying sizes fromsmall benchmarks to entire parts of the Scala runtime. Asubset of these applications is described here, along withmeasurements of their scalability and performance on arange of systems. These results show that DFScala provideseffective scaling, and good performance. There is alwaysroom for improvement and targets for such improvementinclude the addition of a more advanced scheduler andreducing the overhead of creating DFThread objects. How-ever, as described in the introduction the Teraflux projectaims to develop a complete system including processors,compilers, tools and languages for dataflow computing, andas a result many of the overheads in this first softwareimplementation will ultimately be able to be off loaded ontodedicated hardware. Applications written using this librarycan then be used to assess the effectiveness of any suchhardware. In addition this library is also being used as aplatform to experiment with different memory models, theautomatic insertion of dataflow constructs into programs andthe enforcement of specific properties of dataflow graphs.

ACKNOWLEDGMENT

The authors would like to thank the European Com-munity’s Seventh Framework Programme (FP7/2007-2013)for funding this work under grant agreement no 249013(Teraflux-project). Dr. Lujan is supported by a Royal SocietyUniversity Research Fellowship.

Page 8: DFScala: High Level Dataflow Support for Scala · DFScala: High Level Dataflow Support for Scala Daniel Goodman Behram Khan Salman Khan Mikel Lujan Chris Seaton Ian Watson´ University

REFERENCES

[1] AYGUADE, E., BADIA, R. M., IGUAL, F. D., LABARTA, J.,MAYO, R., AND QUINTANA-ORTI, E. S. An Extension ofthe StarSs Programming Model for Platforms with MultipleGPUs. In Proceedings of the 15th International Euro-ParConference on Parallel Processing (Berlin, Heidelberg, 2009),Euro-Par ’09, Springer-Verlag, pp. 851–862.

[2] BIRD, R. Introduction to Functional Programming usingHaskell, second ed. Prentice Hall, 1998.

[3] CANN, D. Retire Fortran?: a debate rekindled. Commun.ACM 35 (August 1992), 81–89.

[4] C.A.R. HOARE, Ed. OCCAM Programming Manual. Oren-tice Hall International Series in Computer Science. PrenticeHall International, 1984.

[5] CHASLOT, G., WINANDS, M. H. M., AND VAN DEN HERIK,H. J. Parallel monte-carlo tree search. In Computers andGames (2008), pp. 60–71.

[6] GURD, J. R., KIRKHAM, C. C., AND WATSON, I. TheManchester prototype dataflow computer. Commun. ACM 28(January 1985), 34–52.

[7] INTERNATIONAL, E. Standard ECMA-334 - C# LanguageSpecification, 4 ed. June 2006.

[8] JENISTA, J. C., EOM, Y. H., AND DEMSKY, B. C. Ooojava:software out-of-order execution. In Proceedings of the 16thACM symposium on Principles and practice of parallel pro-gramming (New York, NY, USA, 2011), PPoPP ’11, ACM,pp. 57–68.

[9] LINDHOLM, T., AND YELLIN, F. Java Virtual MachineSpecification, 2nd ed. Addison-Wesley Longman PublishingCo., Inc., Boston, MA, USA, 1999.

[10] MARTIN, M. M. K., HILL, M. D., AND SORIN, D. J. Whyon-chip cache coherence is here to stay. Communications ofthe ACM (2012).

[11] NOKIA. QT, http://qt.nokia.com/, 2012.

[12] ODERSKY, M., SPOON, L., AND VENNERS, B. Program-ming in Scala: [a comprehensive step-by-step guide], 1st ed.Artima Incorporation, USA, 2008.

[13] SEATON, C., GOODMAN, D., LUJAN, M., AND WATSON,I. Applying dataflow and transactions to Lee routing. InWorkshop on Programmability Issues for Heterogeneous Mul-ticores (2012).

[14] SMITH, J. B. Practical OCaml (Practical). Apress, Berkely,CA, USA, 2006.

[15] STAVROU, K., NIKOLAIDES, M., PAVLOU, D., ARANDI, S.,EVRIPIDOU, P., AND TRANCOSO, P. Tflux: A portable plat-form for data-driven multithreading on commodity multicoresystems. In Proceedings of the 2008 37th InternationalConference on Parallel Processing (Washington, DC, USA,2008), ICPP ’08, IEEE Computer Society, pp. 25–34.

[16] TERAFLUX. The TERAFLUX project, http://www.teraflux.org,2010.

[17] WATSON, I., WOODS, V., WATSON, P., BANACH, R.,GREENBERG, M., AND SARGEANT, J. Flagship: A parallelarchitecture for declarative programming. In ISCA (1988),pp. 124–130.

[18] WU, C. T. An Introduction to Object-Oriented Programmingwith Java 2nd Edition, 2nd ed. McGraw-Hill, Inc., New York,NY, USA, 2000.