argobots and its application to charm++ · 1.0 1.5 2.0 2.5 3.0 3.5 1 3 5 7 9 11 13 15 17 19 21 23...
TRANSCRIPT
SangminSeo
AssistantComputerScientistArgonneNationalLaboratory
April19,2016
Argobots and its Application to Charm++
Charm++Workshop2016
Argo Concurrency Team
• ArgonneNationalLaboratory(ANL)– Pavan Balaji (co-lead)– SangminSeo– AbdelhalimAmer– MarcSnir– PeteBeckman(PI)
• UniversityofIllinoisatUrbana-Champaign(UIUC)– Laxmikant Kale(co-lead)– Prateek Jindal– JonathanLifflander
• UniversityofTennessee,Knoxville(UTK)– GeorgeBosilca– ThomasHerault– DamienGenet
• PacificNorthwestNationalLaboratory(PNNL)– Sriram Krishnamoorthy
PastTeamMembers:• CyrilBordage (UIUC)• EstebanMeneses
(UniversityofPittsburgh)• Huiwei Lu(ANL)• Yanhua Sun (UIUC)
Charm++Workshop2016 2
Massive On-node Parallelism
• Thenumberofcoresisincreasing• Massiveon-nodeparallelismisinevitable• Existingsolutionsdonoteffectivelydealwithsuchparallelismwith
respecttoon-nodethreading/taskingsystemsorwithrespecttooff-nodecommunicationinthepresenceofsuchtasks/threads
• Howtoexploit?
core
Core-levelParallelism3Charm++Workshop2016
Shortcomings today? Pthreads (1/2)
Nesting
int in[1000][1000], out[1000][1000];
#pragma omp parallel for
for (i = 0; i < 1000; i++) {
petsc_voodoo(i);
}
petsc_voodoo(int x)
{
#pragma omp parallel for
for (j = 0; j < 1000; j++)
out[x][j] = cosine(in[x][j]);
}
Executiontimefor36threadsintheouterloop
WhyistraditionalOpenMP’s performancesobad?Thecompilercannotanalyzepetsc_voodoo toknowwhetherthefunctionmighteverblockoryield,so ithastoassumethatitmight.Thereforeastackisneededtofacilitateit.CreatingadditionalPthreads foreachnestingisthesimplestwaytoachievethis.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Time(s)
#OMPThreads|ArgobotsULTs/tasks(innerloop)
GCC/pthreads GCC/ArgobotsULTs GCC/Argobotstasks
Lowerisbetter
Charm++Workshop2016 4
Shortcomings today? Pthreads (2/2)
TasksofapplicationmappedtoagroupofPthreads
Needlightweightmechanismstoswitchtasks!computation
communication
C
Pthreads
C C C
map&schedule
Howaboutthesecommunications?Waitorcontextswitch?
Workunitsintermixedwithblockingcalls (suchascommunication calls)cancauseidlecores
Charm++Workshop2016 5
Outline
• Background• Argobots• Charm++withArgobots• OtherProgrammingModels• Summary
Charm++Workshop2016 6
User-Level Threads (ULTs)
• Whatisuser-levelthread(ULT)?– Providesthreadsemanticsinuserspace– Executionmodel:cooperativetimesharing
• MorethanoneULTcanbemappedtoasinglekernelthread
• ULTsonthesameOSthreaddonotexecuteconcurrently• Wheretouse?
– Tobetteroverlapcomputationandcommunication/IO– Toexploitfine-grainedtaskparallelism
timeline
Context switch
Context switch
ULT1
ULT2
Core Core Core Core Core Core Core Core
ULTs :
Kernel threads :
Charm++Workshop2016 7
Pthreads vs. ULTs
• Averagetimeforcreatingandjoiningonethread• pthread:6.6us- 21.2us(avg.34,953cycles)• ULT(Argobots):78ns- 130ns(avg.191cycles)• ULTis64x- 233xfaster thanPthread
– HowfastisULT?• L1$access:1.112ns,L2$access:5.648ns,memory access:18.4ns• Context switch (2processes):1.64us
1
10
100
1000
10000
100000
1 2 4 8 16 32 64 128 256 512 1024 2048
Avg.Create&JoinTime/thread
(ns)
NumberofThreads
pthread ULT(Argobots)
*measuredusingLMbench3
Charm++Workshop2016 8
Growing Interests in ULTs
• ULTandtasklibraries– Conversethreads,Qthreads,MassiveThreads,Nanos++,
Maestro,GnuPth,StackThreads/MP,Protothreads,Capriccio,StateThreads,TiNy-threads,etc.
• OSsupports– Windowsfibers,Solaristhreads
• Languageandprogrammingmodels– Cilk,OpenMP task,C++11task,C++17coroutineproposal,
Stackless Python,Gocoroutines,etc.
• Pros– EasytousewithPthreads-likeinterface
• Cons– Runtimetriestodosomethingsmart(e.g.,work-stealing)– Thismayconflictwiththecharacteristicsanddemandsof
applications
Charm++Workshop2016 9
Argobots
Overview• Separationofmechanismsandpolicies• Massiveparallelism
– Exec.Streams guaranteeprogress– WorkUnits executetocompletion
• User-levelthreads(ULTs)vs.Tasklet• Clearlydefinedmemorysemantics
– Consistencydomains• ProvideEventualConsistency
– Softwarecanmanageconsistency
Argobots Innovations• Enablingtechnology,butnotapolicymaker
– High-levellanguages/librariessuchasOpenMP,Charm++havemoreinformationabouttheuserapplication(datalocality,dependencies)
• Explicitmodel:– Enablesdynamism,butalwaysmanaged
byhigh-levelsystems
Argobots
coreProcessor
Programming Models(MPI, OpenMP, Charm++, PaRSEC, …)
U User-Level Thread T TaskletLightweightWork Units
Exe
cutio
n S
tream
Private pool Private poolShared pool
U U
U T
TTU TU
Exe
cutio
n S
tream
Exe
cutio
n S
tream
Alow-levellightweightthreadingandtaskingframework(http://collab.cels.anl.gov/display/argobots/)
*Teammembers:Sangmin Seo,Abdelhalim Amer,Pavan Balaji (ANL),Laxmikant Kale,Prateek Jindal(UIUC)
Charm++Workshop2016 10
Argobots Execution Model
• ExecutionStreams(ES)– Sequentialinstructionstream
• Canconsistofoneormoreworkunits– Mappedefficientlytoahardware
resource– Implicitlymanagedprogresssemantics
• OneblockedEScannotblockotherESs
• User-levelThreads(ULTs)– Independentexecutionunitsinuser
space– AssociatedwithanESwhenrunning– Yieldableandmigratable– Canmakeblockingcalls
• Tasklets– Atomicunitsofwork– Asynchronouscompletionvia
notifications– Notyieldable,migratablebefore
execution– Cannotmakeblockingcalls
S
Scheduler Pool
U
ULT
T
Tasklet
E
Event
ES1 Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
Argobots Execution Model
...
ESn
• Scheduler– Stackableschedulerwithpluggable
strategies• Synchronizationprimitives
– Mutex,conditionvariable,barrier, future• Events
– Communicationtriggers
Charm++Workshop2016 11
Explicit Mapping ULT/Tasklet to ES
• TheuserneedstomapworkunitstoESs• Nosmartscheduling,nowork-stealingunlesstheuserwants
touse
ES1
U0
U1
T1
T2
U2
U3
ES2
U4
U5
• Benefits– Allowlocalityoptimization
• ExecuteworkunitsonthesameES
– NoexpensivelockisneededbetweenULTsonthesameES
• Theydonotrunconcurrently• Aflagisenough
Charm++Workshop2016 12
Stackable Scheduler with Pluggable Strategies
• AssociatedwithanES• CanhandleULTsandtasklets• Canhandleschedulers
– Allowstostackschedulershierarchically• Canhandleasynchronousevents• Userscanwriteschedulers
– Providesmechanisms,notpolicies– Replacethedefaultscheduler
• E.g.,FIFO,LIFO,PriorityQueue,etc.• ULTcanexplicitlyyieldto anotherULT
– Avoidscheduleroverhead
Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
U S U U U
yield() yield_to(target)
Charm++Workshop2016 13
Performance: Create/Join Time
• Idealscalability– IftheULTruntimeisperfectlyscalable,thetimeshouldbethesame
regardlessofthenumberofESs
10
100
1000
10000
1 2 4 8 16 24 32 36 40 48 56 64 72
Create/Jo
inTimeperU
LT(cycles)
NumberofExecutionStreams(Workers)
Qthreads MassiveThreads (H) MassiveThreads (W)
Argobots(ULT) Argobots(Tasklet)
Charm++Workshop2016 14
JonathanLifflander,Prateek Jindal,Yanhua SunLaxmikant Kale
UniversityofIllinoisatUrbana-Champaign(UIUC)
Charm++ with Argobots
15Charm++Workshop2016
Charm++ with Argobots
• Goals– TestthecompletenessandperformanceofArgobotswith
Charm++programmingmodel– TakeadvantageofArgobots features(tasklets,stackable
schedulers,etc.)withoutmodifyingapplicationcodes– ForCharm++applications,interoperatewithapplications
writteninothermodels(MPI,Cilk,etc.)
16
Mini-apps and real world applications
Charm++ model
Converse runtime(threading, messaging, scheduler)
Communication libraries (MPI, uGNI, PAMI, Verbs)
Intelligent runtime
Argobots(ULTs, Tasks, scheduling, etc.)
Charm++ infrastructure Charm++ with Argobots *Teammembers:LaxmikantKale,JonathanLifflander, PrateekJindal (UIUC)
Charm++Workshop2016
Replacing the Converse Runtime with Argobots
• Converse– TheactivemessaginglayerinCharm++
• Approaches– EachCharm++Pthread insideanode(includingthecommunication
thread)isimplementedasanArgobots ES• CreateanESforeveryConverseinstance
– AcustomArgobots scheduleriscreatedinsteadofusingtheConversescheduler
– Conversemessagesareenqueued intoArgobots poolsastasklets– Conversethreads(CthThread)areimplementedontopofArgobots
ULTs,withconditionalvariablestoimplementsuspend/resume
• Only180linesofcodehadtobechanged!
Charm++Workshop2016 17
Converse runtime(threading, messaging, scheduler)
Argobots(ULTs, Tasks, scheduling, etc.)
LeanMD Performance: Runtime Comparison
Charm++Workshop2016 18
• Evaluationmachine– 2xIntelXeon E5-2699v3(2.30GHz):36cores (72threads)
• LeanMD simulation– Atotalof20stepsonacellarrayofdimensions7x7x7– 1-awayXYZconfigurationand1000atomspercell
Achievedcomparableperformancealthough itisaverysimpleimplementation
LeanMD Performance: Manual Implementation
Charm++Workshop2016 19
• Manualimplementation ofLeanMD usingtheArgobots– ExploitedbothULTsandtasklets
• AULTformanagingacellandatasklet formanagingtheinteractionbetweencells– Usedfuturesforthewaitingmechanism– Workstealingbetweenpools
• Betterperformanceofourmanualimplementation implies thatCharm++withArgobots couldbeimproved
Argobots: Interfaces for Shrink/Expand Events
20
ES0
Sched
ES1
Sched
ES2
Sched NRM
ESn-1
Sched...
E
E
E
Argobots
socket
Charm++ CilkBot PaRSECMPI+Argobots
programmingmodelruntimesandapplications
callbackfunctions
1. [Argobots]ConnecttoNRMusingasocketonABT_init()2. [Runtimes/applications]Registercallbackfunctionsforshrink/expandevents3. [Runtimes/applications]Deregistercallbackfunctionswhentheyterminate4. [Argobots]DisconnectfromNRMonABT_finalize()
ABT_ENV_POWER_EVENT_HOSTNAMEABT_ENV_POWER_EVENT_PORT
ABT_event_add_callback()ABT_event_del_callback()
Charm++Workshop2016
Argobots: Shrink/Expand Event Handling
• Shrinking
21
ES0
Sched
ES1
Sched
ES2
Sched
E E E
prog.model runtime/application
1. ES1 picksanevent,whichrequestsES2 tobestopped2. AsktheruntimeusingcallbackswhetherES2 canbestopped3. IfOK,markES2 toneedtostopsowhenthescheduleronES2 checksevents,
itcanbestopped4. NotifytheruntimethatES2 willbestopped5. CreateaULTonES06. WhenthescheduleronES2 stops,ES2 isterminated7. AfterES2 isterminated,theULTfreesES2 andsendsaresponsetoNRM
- Anyscheduler onanyEScancheckandhandle events- ES0 cannotbestopped
U
12
3
4
5NRM
7
6
Charm++Workshop2016
Argobots: Shrink/Expand Event Handling
• Expanding
22
ES0
Sched
ES1
Sched
ES2
Sched
E E E
prog.model runtime/application
1. ES0 picksanevent,whichrequeststocreateanES22. AsktheruntimeusingcallbackswhetheritcancreateES23. IfOK,invokeacallbackfunctionsotheruntimecreatesES24. CreateES25. SendaresponsetoNRM
1 2 3
5NRM
4
Charm++Workshop2016
Charm++ with Argobots: Implementation
• Shrink/ExpandImplementation– Charm++maintainsasetofpoolsforeachscheduler,mappedtoanES
– RanksinCharm++arevirtualizedbysavingthemappingofpooltoanES
– WhenanESisremoved,theassociatedpoolsareputintoagloballist
• TomaintaincorrectnessinCharm++,therankofanytasks/threadsinthegloballistarederivedfromthepool(ranksarevirtualized)
– Shrink• OtherESsexecuteworkunitsfromorphanedpoolswithsomeaddedsynchronization
– Expand• AnewESiscreatedandtakesoverasetoforphanedpools
23Charm++Workshop2016
LeanMD Results of Shrinking/Expanding ESs
Charm++Workshop2016 24
Argobots Ecosystem
ES1 Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
Argobots
...
ESn
MPI+Argobots
ULT
ES
ULT
ES
MPI
Argobots runtime
Communication libraries
Charm++
Applications
Charm++Cilk “Worker”
ArgobotsES
RWSULT
FusedULT1
FusedULT2
FusedULTN
…CilkBots
PO
GE
TR
SYTR TR
PO GE GE
TR TR
SYSY GE
PO
TR
SY
PO
SY SY
PaRSEC
OpenMP MercuryRPC
Origin
Target
RPCproc
RPCproc
OmpSs
GridFTP,Kokkos,RAJA,ROSE,TASCEL,XMP,etc.ExternalConnections
Charm++Workshop2016 25
Summary
• Massiveon-nodeparallelismisinevitable– Needruntimesystemsutilizingsuchparallelism
• Argobots– Alightweightlow-levelthreading/taskingframework– Providesefficientmechanisms,notpolicies,tousers(library
developersorcompilers)• Theycanbuildtheirownsolutions
• Charm++withArgobots– ImplementedbyreplacingtheConverseruntimewithArgobots– AchievedcomparableperformanceonLeanMD– Incorporatedtheshrinking/expandingofESsinArgobots in
ordertorespondtotheexternalevents(e.g.,powercapping)
Charm++Workshop2016 26
Try Argobots
• git repository• http://git.mcs.anl.gov/argo/argobots.git/
• Documentation– Wiki
• https://collab.cels.anl.gov/display/ARGOBOTS/
– Doxygen• http://www.mcs.anl.gov/~sseo/public/argobots/
Charm++Workshop2016 27
Funding AcknowledgmentsFundingGrantProviders
InfrastructureProviders
Charm++Workshop2016
Programming Models and Runtime Systems GroupGroupLead– PavanBalaji(computerscientistandgroup
lead)
CurrentStaffMembers– Abdelhalim Amer (postdoc)– Yanfei Guo (postdoc)– RobLatham(developer)– LenaOden(postdoc)– KenRaffenetti (developer)– Sangmin Seo (assistantcomputerscientist)– MinSi(postdoc)– MinTian(visitingscholar)
PastStaffMembers– AntonioPena(postdoc)– WesleyBland(postdoc)– DariusT.Buntinas (developer)– JamesS.Dinan (postdoc)– DavidJ.Goodell (developer)– Huiwei Lu(postdoc)– YanjieWei(visitingscholar)– Yuqing Xiong (visitingscholar)
– JianYu(visitingscholar)
– Junchao Zhang(postdoc)
– Xiaomin Zhu(visitingscholar)
– AshwinAji (Ph.D.)– Abdelhalim Amer (Ph.D.)– Md.Humayun Arafat(Ph.D.)– AlexBrooks(Ph.D.)– AdrianCastello (Ph.D.)– Dazhao Cheng(Ph.D.)– JamesS.Dinan (Ph.D.)– Piotr Fidkowski (Ph.D.)– PriyankaGhosh(Ph.D.)– SayanGhosh (Ph.D.)– RalfGunter(B.S.)– Jichi Guo (Ph.D.)– Yanfei Guo (Ph.D.)– MariusHorga (M.S.)– John Jenkins(Ph.D.)– Feng Ji (Ph.D.)– PingLai(Ph.D.)– Palden Lama(Ph.D.)– YanLi(Ph.D.)– Huiwei Lu(Ph.D.)– JintaoMeng (Ph.D.)– GaneshNarayanaswamy (M.S.)– Qingpeng Niu (Ph.D.)– Ziaul Haque Olive(Ph.D.)– DavidOzog (Ph.D.)
– Renbo Pang(Ph.D.)– Sreeram Potluri (Ph.D.)– LiRao (M.S.)– Gopal Santhanaraman (Ph.D.)– ThomasScogland (Ph.D.)– MinSi(Ph.D.)– BrianSkjerven (Ph.D.)– RajeshSudarsan (Ph.D.)– LukaszWesolowski (Ph.D.)– Shucai Xiao(Ph.D.)– Chaoran Yang(Ph.D.)– Boyu Zhang(Ph.D.)– Xiuxia Zhang(Ph.D.)– XinZhao(Ph.D.)
Advisory Board– PeteBeckman(seniorscientist)– RustyLusk (retired,STA)– MarcSnir (divisiondirector)– RajeevThakur (deputydirector)
Current andRecentStudents
Charm++Workshop2016
Q&A
§ Thankyouforyourattention!
Questions?
Charm++Workshop2016