darma - stanford cs theorytheory.stanford.edu/~aiken/west/talks/darma.pdfamt runtime system...

Post on 30-May-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP

DARMAJanineC.Bennett,JonathanLifflander,DavidS.Hollman,JeremiahWilke,Hemanth Kolla,AramMarkosyan,NicoleSlattengren,RobertL.Clay(PM)

PSAAP-WESTFebruary22,2017

UnclassifiedSAND2017-1430C

WhatisDARMA?

DARMAisaC++abstractionlayerforasynchronousmany-task(AMT)runtimes.

Itprovidesasetofabstractionstofacilitatetheexpressionoftaskingthatmaptoavarietyofunderlying

AMTruntimesystemtechnologies.

Sandia’sATDMprogramisusingDARMAtoinformitstechnicalroadmapfornextgenerationcodes.

2

2015studytoassessleadingAMTruntimesledtoDARMA

3

§ BroadsurveyofmanyAMTruntimesystems§ DeepdiveonCharm++,Legion,Uintah

§ Programmability: DoesthisruntimeenableefficientexpressionofATDMworkloads?

§ Performance: Howperformantisthisruntimeforourworkloadsoncurrentplatformsandhowwellsuitedisthisruntimetoaddressfuturearchitecturechallenges?

§ Mutability:Whatistheeaseofadoptingthisruntimeandmodifyingittosuitourcodeneeds?

Aim:informSandia’stechnicalroadmapfornextgenerationcodes

2015studytoassessleadingAMTruntimesledtoDARMA

4

§ Conclusions§ AMTsystemsshowgreatpromise§ GapsinrequirementsforSandia

applications§ Nocommonuser-levelAPIs§ Needforbestpracticesandstandards

§ SurveyrecommendationsledtoDARMA§ C++abstractionlayerforAMTruntimes§ RequirementsdrivenbySandiaATDM

applications§ Asingleuser-levelAPI§ SupportmultipleAMTruntimestobegin

identificationofbestpractices

Aim:informSandia’stechnicalroadmapfornextgenerationcodes

5

MappingtoavarietyofAMTruntimesystemtechnologies

DARMAprovidesaunifiedAPItoapplicationdevelopersforexpressingtasks

6

Application

DARMA

Runtime

OS/Hardware

Common APIacross runtimes

Common APIacross runtimes

Front End API(Application User)

Translation Layer

Back End API(Specification for Runtime)

Glue Code(Specific to each runtime)

MappingtoavarietyofAMTruntimesystemtechnologies

ApplicationcodeistranslatedintoaseriesofbackendAPIcallstoanAMTruntime

7

Not all runtimes providethe same functionality

Application

DARMA

Runtime

OS/Hardware

Common APIacross runtimes

Common APIacross runtimes

Front End API(Application User)

Translation Layer

Back End API(Specification for Runtime)

Glue Code(Specific to each runtime)

MappingtoavarietyofAMTruntimesystemtechnologies

ApplicationcodeistranslatedintoaseriesofbackendAPIcallstoanAMTruntime

8

Challenge: design a back end API that maps to a variety of runtimes

Application

DARMA

Runtime

OS/Hardware

Common APIacross runtimes

Common APIacross runtimes

Front End API(Application User)

Translation Layer

Back End API(Specification for Runtime)

Glue Code(Specific to each runtime)

MappingtoavarietyofAMTruntimesystemtechnologies

ConsiderationswhendevelopingabackendAPIthatmapstoavarietyofruntimes

9

§ AMTruntimesoftenoperatewithadirectedacyclicgraph(DAG)§ Capturesrelationshipsbetweenapplicationdataandinter-dependenttasks

d1d2

d3 d4

d5d6 d7

t1 t2 t3

t4 t5

subsetreads

MappingtoavarietyofAMTruntimesystemtechnologies

ConsiderationswhendevelopingabackendAPIthatmapstoavarietyofruntimes

10

§ AMTruntimesoftenoperatewithadirectedacyclicgraph(DAG)§ Capturesrelationshipsbetweenapplicationdataandinter-dependenttasks

§ DAGscanbeannotatedtocaptureadditionalinformation§ Tasks’read/writeusageofdata§ Taskneedsasubsetofdata

d1d2

d3 d4

d5d6 d7

t1 t2 t3

t4 t5

subsetreads

MappingtoavarietyofAMTruntimesystemtechnologies

ConsiderationswhendevelopingabackendAPIthatmapstoavarietyofruntimes

11

§ AMTruntimesoftenoperatewithadirectedacyclicgraph(DAG)§ Capturesrelationshipsbetweenapplicationdataandinter-dependenttasks

§ DAGscanbeannotatedtocaptureadditionalinformation§ Tasks’read/writeusageofdata§ Taskneedsasubsetofdata

§ Additionalinformationenablesruntimetoreasonmorecompletelyabout§ Whenandwheretoexecuteatask§ Whethertoloadbalance

§ ExistingruntimesleverageDAGswithvaryingdegreesofannotation

d1d2

d3 d4

d5d6 d7

t1 t2 t3

t4 t5

subsetreads

MappingtoavarietyofAMTruntimesystemtechnologies

DARMAcapturesdata-taskdependencyinformationandtheruntimebuildsandexecutestheDAG

12

Captures data-task dependency information

Application

DARMA

Runtime

OS/Hardware

Common APIacross runtimes

Common APIacross runtimes

Front End API(Application User)

Translation Layer

Back End API(Specification for Runtime)

Glue Code(Specific to each runtime)

Runtime controls construction andexecution of the DAG

MappingtoavarietyofAMTruntimesystemtechnologies

13

Abstractionsthatfacilitatetheexpressionoftasking

DARMAfrontendabstractionsfordataandtasksareco-designedwithSandiaATDMapplicationscientists

14Abstractionsthatfacilitatetheexpressionoftasking

Provide abstractions to simplifycapturing of data-task dependencies

Application

DARMA

Runtime

OS/Hardware

Common APIacross runtimes

Common APIacross runtimes

Front End API(Application User)

Translation Layer

Back End API(Specification for Runtime)

Glue Code(Specific to each runtime)

DARMADataModel

Howaredatacollections/datastructuresdescribed?§ Asynchronoussmartpointerswrapapplicationdata

§ Trackmeta-datausedtobuildandannotatetheDAG– Currentlypermissionsinformation– Subsetting informationunderdevelopment

§ Enableextractionofparallelisminadata-race-freemannerHowaredatapartitioninganddistributionexpressed?§ Thereisanexplicit,hierarchical,logicaldecompositionofdata

§ AccessHandle<T> – Doesnotspanmultiplememoryspaces– Mustbeserializedtobetransferredbetweenmemoryspaces

§ AccessHandleCollection<T, R>– Expressesacollectionofdata– Canbemappedacrossmemoryspacesinascalablemanner

§ Distributionofdataisuptoindividualbackendruntime

15Abstractionsthatfacilitatetheexpressionoftasking

DARMAControlModel

16

Howisparallelismachieved?§ create_work

§ Ataskthatdoesn’tspanmultipleexecutionspaces§ Sequentialsemantics:theorderandmanner(e.g.,read,write)inwhichdata

(AccessHandle)isuseddetermineswhattasksmay beruninparallel§ create_concurrent_work

§ Scalableabstractiontolaunchacrossdistributedsystems§ Acollectionoftasksthatmustmakesimultaneousforwardprogress§ Sequentialsemanticssupportedacrossdifferenttaskcollectionsbasedon

orderandmannerofAccessHandleCollection usageHowissynchronizationexpressed?§ DARMAdoesnotprovideexplicittemporalsynchronizationabstractions§ DARMAdoes providedatacoordinationabstractions

§ publish/fetch semanticsbetweenparticipantsinataskcollection§ Asynchronouscollectivesbetweenparticipantsinataskcollection

Abstractionsthatfacilitatetheexpressionoftasking

17

UsingDARMAtoinformSandia’stechnicalroadmap

Currentlytherearethreebackendsinvariousstagesofdevelopment

18

Charm++

Charm++, Threads, HPX(in progress)

Application

DARMA

Runtime

OS/Hardware

Common APIacross runtimes

Common APIacross runtimes

Front End API(Application User)

Translation Layer

Back End API(Specification for Runtime)

Glue Code(Specific to each runtime)

fully distributed development

tool

prototype

ThreadsHPX

UsingDARMAtoinformSandia’sATDMtechnicalroadmap

2017study:ExploreprogrammabilityandperformanceoftheDARMAapproachinthecontextofATDMcodes

19UsingDARMAtoinformSandia’sATDMtechnicalroadmap

ElectromagneticPlasma Particle-in-cellKernels

MultiscaleProxy

MultiLevelMonteCarloUncertaintyQuantificationProxy

2017study:ExploreprogrammabilityandperformanceoftheDARMAapproachinthecontextofATDMcodes

§ Kernelsandproxies§ Formbasisforprogrammabilityassessments§ WillbeusedtoexploreperformancecharacteristicsoftheDARMA-

Charm++backend

§ Simplebenchmarksenablestudieson§ Taskgranularity§ Overlapofcommunicationandcomputation§ Runtime-managedloadbalancing

§ TheseearlyresultsarebeingusedtoidentifyandaddressbottlenecksinDARMA-Charm++backendinpreparationforstudieswithkernels/proxies

20UsingDARMAtoinformSandia’sATDMtechnicalroadmap

DARMA’sprogrammingmodelenablesruntime-managed,measurement-basedloadbalancing

21UsingDARMAtoinformSandia’sATDMtechnicalroadmap

InitialstrongscalingresultswithDARMA-Charm++backendarepromising

2DNewtonianparticlesimulationthatstartshighlyimbalanced.

DARMA’sprogrammingmodelenablesruntime-managed,measurement-basedloadbalancing

22UsingDARMAtoinformSandia’sATDMtechnicalroadmap

LB LB LB LB

Initial LoadImbalance

Perc

ent U

tiliz

atio

n

Time0s 165s

1.74s/iterBefore LB

1.66s/iterAfter LB

1.70s/iterBefore LB

1.47s/iterBefore LB

1.54s/iterAfter LB

1.41s/iterAfter LB

100%

0%

TheCharm++loadbalancerincrementallyrunsasparticlesmigrateandtheworkdistributionchanges.

Summary:DARMAseekstoacceleratediscoveryofbestpractices

§ Applicationdevelopers§ Useaunifiedinterfacetoexploreavarietyofdifferentruntimesystem

technologies§ DirectlyinformDARMA’suser-levelAPIviaco-design

requirements/feedback

§ Systemsoftwaredevelopers§ Acquireasynthesizedsetofrequirementsviathebackend

specification§ Directlyinformbackendspecificationviaco-designfeedback§ CanexperimentwithproxyapplicationswritteninDARMA

§ SandiaATDMisusingDARMAtoinformitstechnologyroadmapinthecontextofAMTruntimesystems

23

24

BackupSlides

Asynchronoussmartpointersenableextractionofparallelisminadata-race-freemanner

darma::AccessHandle<T> enforcessequentialsemantics:itusestheorderinwhichdataisaccessedinyourprogramandhowitisaccessed(read/write/etc.)toautomaticallyextractparallelism

25

Permission Level

None

Read

Write

Reduce

Permission Type

SchedulingA task with scheduling permission can create deferred tasks that can accessthe data at the specified permission level.

ImmediateA task with immediate permission can dereference the AccessHandle<T> and use it according to the permission level.

Abstractionsthatfacilitatetheexpressionoftasking

Tasksareannotatedinthecodeviaalambdaorfunctorinterface

Taskscanberecursivelynestedwithineachothertogeneratemoresubtasks

26

C++ Lambdas C++ Functors

darma::create_work( [=]{ /*do some work*/ });

This is the C++ 11 syntax for writing an anonymous function that captures variables by value.

struct MyFun { void operator()(...) { /* do some work */ }};

darma::create_work<MyFun>(...)

Functors are for larger blocks of codethat may be reused and migrated bythe backend to another memory space.

Abstractionsthatfacilitatetheexpressionoftasking

Example:Puttingtasksanddatatogether

27

Example Program

AccessHandle<int> my_data;

darma::create_work([=]{ my_data.set_value(29);});

darma::create_work( reads(my_data), [=]{ cout << my_data.get_value(); });

darma::create_work( reads(my_data), [=]{ cout << my_data.get_value(); });

darma::create_work([=]{ my_data.set_value(31);});

Modifymy_data

Readmy_data

Readmy_data

Modifymy_data

DAG (Directed Acyclic Graph)

These two tasks are concurrentand can be run in parallel by a DARMA backend!

SequentialSemantics

Abstractionsthatfacilitatetheexpressionoftasking

Smartpointercollectionscanbemappedacrossmemoryspacesinascalablemanner

28

AccessHandleCollection<vector<double>, Range1D> mycol = darma::initial_access_collection( index_range = Range1D(10) ); Range1D is a potentially user-defined

(or domain-specific) index range, aC++ object that describes the extents of the collection along with providing a corresponding index class for accessingan element.

Every element in the collectioncontains a vector<double>

mycol

Index 0 Index 1

...Index 9

Each indexed element is anAccessHandle<vector<double>>

AccessHandleCollection<T, R> isanextensiontoAccessHandle<T>thatexpressesacollectionofdata

Abstractionsthatfacilitatetheexpressionoftasking

Taskscanbegroupedintocollectionsthatmakeconcurrentforwardprogresstogether

29

create_concurrent_work<MyFun>( index_range = Range1D(5));

This call to create_concurrent_worklaunches a set of tasks, the size of which is specified by an index range, Range1D, that is passed as an argument. Each element in the task collection is

passed an Index1D within the range, used by the programmer to express communication patterns across elementsin the collection.

struct MyFun { void operator()(Index1D i) { int me = i.value; /* do some work */ }};

Taskcollectionsareascalableabstractiontoefficientlylaunchcommunicatingtasksacrosslarge-scaledistributedsystems

Abstractionsthatfacilitatetheexpressionoftasking

Puttingtaskcollectionsanddatacollectionstogether

30

auto mycol = initial_access_collection( index_range = Range1D(10));

create_concurrent_work<MyFun>( mycol, index_range = Range1D(10));

create_concurrent_work<MyFun>( mycol, index_range = Range1D(10));

Example Program

SequentialSemantics

Modifymycol

Modifymycol

Generated DAG

Scalable GraphRefinement

...

...

Modifymycol

Modifymycol

Index 0 Index 1 Index 9

A mapping must exist between thedata index ranges and task index range.In this case, since the three ranges areidentical in size and type, a one-to-oneidentity map is automatically applied.

Abstractionsthatfacilitatetheexpressionoftasking

Tasksindifferentexecutionstreamscancommunicateviapublish/fetchsemantics

31

Execution Stream A

AccessHandle<int> my_data = initial_access<int>(”my_key”);

darma::create_work([=]{ my_data.set_value(29);});

my_data.publish(version=”a”);

darma::create_work([=]{ my_data.set_value(31);});

Execution Stream B

AccessHandle<int> other_data = read_access(”my_key”, version=”a”);

darma::create_work([=]{ cout << other_data.get_value();});

other_data = nullptr;

Modifymy_data

Modifymy_data

Readmy_data

Copymy_data

Readmy_data

Potential DAG 1

If the read_access is on anothernode it might be send across thenetwork.

Abstractionsthatfacilitatetheexpressionoftasking

Tasksindifferentexecutionstreamscancommunicateviapublish/fetchsemantics

32

Execution Stream A

AccessHandle<int> my_data = initial_access<int>(”my_key”);

darma::create_work([=]{ my_data.set_value(29);});

my_data.publish(version=”a”);

darma::create_work([=]{ my_data.set_value(31);});

Execution Stream B

AccessHandle<int> other_data = read_access(”my_key”, version=”a”);

darma::create_work([=]{ cout << other_data.get_value();});

other_data = nullptr;

Modifymy_data

Modifymy_data

Readmy_data

Readmy_data

Potential DAG 2

If the read_access is on the samenode a back end runtime can generatean alternative DAG without the transfer.

Abstractionsthatfacilitatetheexpressionoftasking

Tasksindifferentexecutionstreamscancommunicateviapublish/fetchsemantics

33

Execution Stream A

AccessHandle<int> my_data = initial_access<int>(”my_key”);

darma::create_work([=]{ my_data.set_value(29);});

my_data.publish(version=”a”);

darma::create_work([=]{ my_data.set_value(31);});

Execution Stream B

AccessHandle<int> other_data = read_access(”my_key”, version=”a”);

darma::create_work([=]{ cout << other_data.get_value();});

other_data = nullptr;

Modifymy_data

Modifymy_data

Readmy_data

Readmy_data

Potential DAG 2

If the read_access is on the samenode a back end runtime can generatean alternative DAG without the transfer.

Abstractionsthatfacilitatetheexpressionoftasking

Amappingbetweendataandtaskcollectionsdeterminesaccesspermissionsbetweentasksanddata

34

auto mycol = initial_access_collection<int>( index_range = Range1D(10));create_concurrent_work<MyFun>( mycol, index_range = Range1D(10));

struct MyFun { void operator()( Index1D i, AccessHandleCollection<int> col ) { int me = i.value, mx = i.max_value; auto my_elm = col[i].local_access(); my_elm.publish(version=”x”); auto neighbor = me-1 < 0 ? mx : me-1; auto other_elm = col[neighbor].read_access(version=”x”); create_work([=]{ cout << “neighbor = ” << other_elm.get_value() << endl; }); }};

Identity map between these data andtasks. Thus, index i has local access todata index i.

Any other index must be read using read_access, which actually may bea remote or local operation dependingon the backend mapping, but is always a deferred operation.

Abstractionsthatfacilitatetheexpressionoftasking

SandiaATDMapplicationsdriverequirementsanddevelopersplayactiveroleininformingfrontendAPI

§ Applicationfeaturerequests§ Sequentialsemantics§ MPIinteroperability§ Node-levelperformanceportabilitylayerinteroperability(Kokkos)§ Collectives§ Runtime-enabledload-balancingschemes

35Abstractionsthatfacilitatetheexpressionoftasking

Alatency-intolerantbenchmarkhighlightsoverheadsasgrainsizedecreases

UsingDARMAtoinformSandia’sATDMtechnicalroadmap

Atthisscale,eachiterationislessthan5mslong.

Thisbenchmarkhastightsynchronizationeveryiteration.

Increasedasynchronyintheapplicationenablestheruntimetooverlapcommunicationandcomputation

UsingDARMAtoinformSandia’sATDMtechnicalroadmap

Scalabilityimproveswithasynchronousiterations.Requiresonlyminorchangestoapplicationcode.

DARMA’sprogrammingmodelenablesruntime-managed,measurement-basedloadbalancing

38UsingDARMAtoinformSandia’sATDMtechnicalroadmap

Loadbalancingdoesnotrequirechangestotheapplicationcode.

Stencilbenchmarkisnotlatencytolerantandhighlightsruntimeoverheadswhentask-granularityissmall

UsingDARMAtoinformSandia’sATDMtechnicalroadmap

Atthisscale,eachiterationislessthan5mslong.

Increasedasynchronyinapplicationenablesruntimetooverlapcommunicationandcomputation

UsingDARMAtoinformSandia’sATDMtechnicalroadmap

Scalabilityimproveswithasynchronousiterations.RequiresonlyminorchangestoDARMAcode.

top related