high level abstractions for irregular computations...programming irregular computations for...
TRANSCRIPT
Highlevelabstractionsforirregularcomputations
DavidPaduaDepartmentofComputerScience
UniversityofIllinoisatUrbana-Champaign
1.INTRODUCTION:THEPROBLEM
2December4,2017
Programmingforperformance
• Islaborious– Particularlysowhenthetargetmachineisparallel.
• Mustdecidewhattoexecuteinparallel• Inwhatordertotraversethedata• Howtopartitionthedata…
• Today,tuningmustbedonemanually.• Maintenanceisdifficult
– andtheresultisnotportable.
3December4,2017
Programmingforperformance
• Thereisstillmuchroomforprogressinhighperformanceprogrammingmethodology.
• Ourgoalistofacilitate(automate)tuningandenablemachineindependentprogramming.
4December4,2017
2.IRREGULARCOMPUTATIONS
5December4,2017
Whatareirregularcomputations?
• Irregularcomputationsaredifficulttodefine.• Onepossibledefinitionis:
– Computationson“irregular”datastructures.– Irregulardatastructures=notdensearrays.
• Pingali etal.Thetao ofparallelisminalgorithms.PLDI’11
– Andsubscriptexpressionsnon-linear.
6December4,2017
Whatareirregularcomputations?
• Examplesinclude– Subscriptedsubscripts:
• A[C[i]]– Linkedliststraversal:
• leftCell = leftCell->ColNext;• result+=leftCell->Value;
7December4,2017
Programmingirregularcomputationsforperformance
• Today,programmingofirregularcomputationsistypicallyatalowlevelofabstraction.
8G.HagerandG.Wellein.IntroductiontoHighPerformanceComputingforScientistsandEngineers.CRC Press
do diag=1,Nj, 2 ! two-way unroll amd Jamdiaglen=min( ( jd_ptr(diag+1)-jd_ptr(diag) ) , \
( jd_ptr(diag+2)-jd+ptr(diag+1) ) )offset1 = jd_ptr(diag) - 1offset2 = jd_ptr(diag+1) - 1do i=1, diagLen
C(i) = C(i) + val(offset1+i)*B(col_idx(offsetr1+i))C(i) = C(i) + val(offset2+i)*B(col_idx(offset2+i))
end do! peeled off iterationsoffset1 = jd_ptr(diag)do i=(diagLen+1), (jd_ptr(diag+1)-jd_ptr(diag))
C(i) = C(i)+val(offset1+i)*B(col_idx(offswt1+i))end do
end do
December4,2017
Programmingirregularcomputationsforperformance
• Programmingandoptimizingirregularcomputationsmorechallengingthanforlinear-subscriptscomputations.
• Wanttohavecompilersupportforprogrammingata lowlevelofabstractiontoenable– Automaticallymappingcomputationontodiversemachineclasses– Localityenhancement– Othercompileroptimizations(e.g.commonsubexpression eliminations)
• Inaddition,oftenusinghigherlevelsofabstractionsandapplyingoptimizationsatthatlevelisabettersolution.
• Bothapproacheswillbediscussedinthispresentation
9December4,2017
Whencomputationsarenotirregular
• Compilercandoagoodjobattransformingprograms:– Loopinterchange– Loopdistribution– Tiling– …
• Andwehavethethetechnologytogenerateefficientcodeforavarietyofplatformsevenwhentheinitialcodeissequential– Vectorextensions– Multicores– Distributedmemorymachines– GPUs
10December4,2017
ExampleAutomatictransformationofaregularcomputation
• Considertheloop
for (i=1; i<n; i++) {for (j=1; j<n; j++) {
a[i][j]=a[i][j-1]+a[i-1][j];}}
11December4,2017
for (i=1; i<n; i++) {for (j=1; j<n; j++) {
a[i][j]=a[i][j-1]+a[i-1][j];}}
a[1][1]=a[1][0]+a[0][1]
a[1][2]=a[1][1]+a[0][2]
a[1][3]=a[1][2]+a[0][3]
a[1][4]=a[1][3]+a[0][4]
12
j=1
j=2
j=3
j=4
a[2][1]=a[2][0]+a[1][1]
a[2][2]=a[2][1]+a[1][2]
a[2][3]=a[2][2]+a[1][3]
a[2][4]=a[2][3]+a[1][4]
i=1 i=2
• Computedependences(part1)
ExampleAutomatictransformationofaregularcomputation
December4,2017
for (i=1; i<n; i++) {for (j=1; j<n; j++) {
a[i][j]=a[i][j-1]+a[i-1][j];}}
a[1][1]=a[1][0]+a[0][1]
a[1][2]=a[1][1]+a[0][2]
a[1][3]=a[1][2]+a[0][3]
a[1][4]=a[1][3]+a[0][4]
13
j=1
j=2
j=3
j=4
a[2][1]=a[2][0]+a[1][1]
a[2][2]=a[2][1]+a[1][2]
a[2][3]=a[2][2]+a[1][3]
a[2][4]=a[2][3]+a[1][4]
i=1 i=2
• Computedependences(part2)
ExampleAutomatictransformationofaregularcomputation
December4,2017
for (i=1; i<n; i++) {for (j=1; j<n; j++) {
a[i][j]=a[i][j-1]+a[i-1][j];}}
1 2 3 4 …
1
2
3
4
j
i
14
1,1
or
ExampleAutomatictransformationofaregularcomputation
December4,2017
for (i=1; i<n; i++) {for (j=1; j<n; j++) {
a[i][j]=a[i][j-1]+a[i-1][j];}}
15
• Findparallelism
ExampleAutomatictransformationofaregularcomputation
December4,2017
for (i=1; i<n; i++) {for (j=1; j<n; j++) {
a[i][j]=a[i][j-1]+a[i-1][j];}}
16
• Findparallelism
ExampleAutomatictransformationofaregularcomputation
December4,2017
for (i=1; i<n; i++) {for (j=1; j<n; j++) {
a[i][j]=a[i][j-1]+a[i-1][j];}}
17
• Transformthecode
for (k=4; k<2*n; k++) forall (i=max(2,k-n):min(n,k-2)) a[i][k-i]=...
ExampleAutomatictransformationofaregularcomputation
December4,2017
Analyzingirregularcomputations
• Analysisofthefollowingcodeisdifficultandsometimesimpossible
• Dependencesareafunctionofthevaluesofc andd• Intheabsenceofadditionalinformation,thecompilermustassumetheworst.
• Andthisprecludesautomatictransformations(parallelization,vectorization,datadistribution,…)
18
for (i=1; i<n; i++) {a[c[i]]=a[d[i]]+1;
}
December4,2017
Irregularityisabeastprogramminglanguageresearchershavebeen
fightingforalongtime
19December4,2017
Ithasdefeatedussometimes
20
IgnoringtheimportanceofirregularcomputationswasoneoftheReasonsforthefailureofHighPerformanceFortran
December4,2017
Tamingtheirregularbeast
• CompilersandRuntime(Section3)
• Betternotations(Section4)
21December4,2017
3.COMPILERANDRUNTIMESOLUTIONSFORLOWLEVELIRREGULARPROGRAMMING
22December4,2017
Twoapproachestodealingwithirregularcomputations
• Analyzingirregularcodesdirectly– Statically– Dynamically
• Convertingintohigherlevelofabstraction(densearraycomputations)
23December4,2017
Staticanalysis
• Insomecases,itispossibletoanalyzeirregularalgorithmswritteninconventionalnotation.
• Moreeffortthanforregular/linearsubscriptcomputations
24December4,2017
Staticanalysis
25
Dependenceanalysis
do k = 1, nq = 0do i = 1, p
if ( x(i) > 0 ) thenq = q + 1ind(q) = i
end ifend dodo j = 1, q
x(ind(j)) = x(ind(j))* y(ind(j))end do
end do
December4,2017
Staticanalysis
26
do k = 1, nq = 0do i = 1, p
if ( x(i) > 0 ) thenq = q + 1ind(q) = i
end ifend dodo j = 1, q
x(ind(j)) = x(ind(j))* y(ind(j))end do
end do
Dependenceanalysis
Needinformationaboutind(j)
December4,2017
Staticanalysis
27
do k = 1, nq = 0do i = 1, p
if ( x(i) > 0 ) thenq = q + 1ind(q) = i
end ifend dodo j = 1, q
x(ind(j)) = x(ind(j))* y(ind(j))end do
end do
Dependenceanalysis
Needinformationaboutind(j)Canbefoundbyanalyzingtheprogram
December4,2017
Staticanalysis
28
do i=1, ndo j=2, iblen(i)
do k=1, j-1x(pptr(i)+k-1) = ...
end doend dodo j=1, iblen(i)-1
do k=1, j... = x(iblen(i)+pptr(i)+k-j-1)
end doend do
end do
Arrayprivatization
December4,2017
Staticanalysis
29
do i=1, ndo j=2, iblen(i)
do k=1, j-1x(pptr(i)+k-1) = ...
end doend dodo j=1, iblen(i)-1
do k=1, j... = x(iblen(i)+pptr(i)+k-j-1)
end doend do
end do
Arrayprivatization
Todecideifx canbeprivatized
December4,2017
Staticanalysis
30
do i=1, ndo j=2, iblen(i)
do k=1, j-1x(pptr(i)+k-1) = ...
end doend dodo j=1, iblen(i)-1
do k=1, j... = x(iblen(i)+pptr(i)+k-j-1)
end doend do
end do
Arrayprivatization
Todecideifx canbeprivatizedIssufficienttoshowthatawriteprecedeseachread
December4,2017
Staticanalysis
• Apossibleapproachistoquerythecontrolflowgraph,boundingthesearch.
31
YuanLinandDavidPadua.Compileranalysisofirregularmemoryaccesses.(PLDI'00).
December4,2017
Dynamicanalysis
• Embeddeddependenceanalysis• Inspectorexecutor• Speculation
32December4,2017
Embeddeddependenceanalysis
• ThisisaclevermechanismduetoZhuandYew.
33
Zhu,C.Q.,&Yew,P.C.(1987).SCHEMETOENFORCEDATADEPENDENCEONLARGEMULTIPROCESSORSYSTEMS.IEEETransactionsonSoftwareEngineering,SE-13(6),726-739
for i=1; i<n; i++a(k(i)) = c(i) + 1
a_tag(k(:)) = 0parallel for i=1;i<n;i++
critical a(k(i)) if a_tag(k(i)) < i
a(k(i)) = c(i) + 1a_tag(k(i)) = i
December4,2017
Inspectorexecutor
• Parallelscheduleandcommunicationaggregationcomputedatexecutiontime– Analyzingthememoryaccesspatternofacodesegment(typicallyaloop)
• Overheadreducedwhenthecodesegment(loop)isexecutedmultipletimeswiththesameaccesspattern
34
SeeEncyclopediaofParallelComputingRunTimeParallelization(Saltz andDas)
December4,2017
Speculation
• Thereisanextensivebodyofliteratureonspeculation.• Acodesegmentisexecutedinparallelinthehopethatthereisnoproblem.
• Ifthereis,executionbacktracksandthecodesegmentisre-executedserially
See LawrenceRauchwerger,DavidA.Padua:TheLRPD Test:SpeculativeRun-TimeParallelizationofLoopswithPrivatizationandReductionParallelization.PLDI 1995:218-232
EncyclopediaofParallelComputingThreadlevelspeculation(Torrellas)TransactionalMemories(Herlihy)
35December4,2017
Convertingintodensearraycomputations
• Theobjectiveistoconvertsparsecomputationsintoequivalentdenseform.
• Thisincreasesthenumberofoperationsinvolvingzeroes(oridentity)– Sometimesmakingthecomputationimpossible
• Butitallowstheapplicationofcompile-timetransformations.
• Sparsity isrecoveredbyapplyingafinaltransformation(seebelow)
36December4,2017
Convertingintodensearraycomputations
37
for(col=0; col<cols; col++) {for(row=0; row<dimensions; row++) {
leftCell = left.Rows[row];while( leftCell != NULL ) {
result[row][col] +=leftCell->Value * right[leftCell->ColIndex][col];
leftCell = leftCell->ColNext;
for( col = 0; col < cols; col++ ) for( row = 0; row < dimensions; row++ )
result[row][col] += A_Valuep[row][:] * right[:][col];
December4,2017
Convertingintodensearraycomputations
38
Harmen L.A.vanderSpek,HarryA.G.Wijshoff:Sublimation:ExpandingDataStructurestoEnableDataInstanceSpecificOptimizations.LCPC2010:106-120
Anand Venkat,MaryHall,andMichelleStrout.2015.Loopanddatatransformationsforsparsematrixcode.In Proceedingsofthe36thACMSIGPLANConferenceonProgrammingLanguageDesignandImplementation(PLDI'15).
December4,2017
Wheredowestand?
• Thereisanextensivebodyofliteratureoncompilationandruntimetofacilitatetheparallelprogrammingofirregularcomputations.
• Althoughmoreideasareimportant,whatweneedmosturgentlyisanevaluationoftheireffectiveness.– Ameansforrefinementandprogress.
• Allcompilertechniquessufferfromlackofunderstandingoftheireffectiveness.
39December4,2017
Needtoolstoevaluateprogress
• Measuringadvancesincompilertechnologyiscrucialforprogress.
• Anidea:apubliclyavailablerepository– Containingextensiveandrepresentativecodecollections.
– Keeptrackofresultsforcompilers/techniquesalongtime.
• Ongoingworkononesuchrepository:– LORE:ALoopRepositoryfortheEvaluationofCompilersbyZ.i Chen,Z.Gong,J.
JosefSzaday,D.Won,D.Padua,A.Nicolau ,A.Veidenbaum ,N.Watkinson,Z.Sura,S.Maleki,J.Torrellas,G.DeJong.IISWC-2017
40December4,2017
4. NOTATIONS
41December4,2017
4. NOTATIONS4.1ABSTRACTION
42December4,2017
WhatareProgrammingabstractions?
§ Anabstractionisformedbyreducinginformation.
§ Ineverydaylife,itformstheworldofideaswherewecanreason.No(naturallanguage)statementcanbemadewithoutabstractions
43December4,2017
WhatareProgrammingabstractions?
§ Abstractmachinelanguage/lowerlevelimplementation§ Scalarcomputations:
§ a=b*c+d^3.4§ Noloads,stores,registerallocation.
§ Scalarloops:§ for (i= …..) {…}§ Noifstatements,branch/goto instructions.§ Givesstructuretotheprogram.
§ Abstractalgorithmimplementation§ min(A(1:n))§ it = find (myvec.begin(), myvec.end(), 30);§ Hidealgorithm,datarepresentation
44December4,2017
Whatarethebenefitofusingabstractions?
• Programmerproductivity(expressiveness)– Codesareeasiertowrite– Tounderstand,maintain
• Portability• Optimization(manualandautomatic)
– Programsareeasiertoanalyze– Easiertotransform– Deepertransformationsareenabled
45December4,2017
Abstractionsandirregularcomputations
• Whatabstractionsshouldweuseforprogrammingirregularcomputationsatahigherlevel?
• Bestanswerseemstobetorepresentirregularalgorithmsintermsof:
Densearrayoperations
46December4,2017
4. NOTATIONS4.2.ARRAYNOTATIONANDIRREGULARCOMPUTATIONS
47December4,2017
Whyarraynotation
• Couldaccessarraysinscalarmode• Butoperationsonaggregatesare(often)easiertoread.
for (i=0; i<n; i++) for (j=0;j<n;j++)
a[i][j]=b[i][j]+c[i][j]
a=b+c
• Theyarewellunderstoodabstractions.• Powerful.
48December4,2017
Arrayoperationsandperformance
• Butusefulnessofarraynotationgobeyondexpressiveness.
• Thereareatleastthreereasonstousearrayabstractionsforperformance– Torepresentparallelism– Toreducetheoverheadofinterpretation.– Toenableautomaticoptimizations.
49December4,2017
Representationofparallelism• Popularintheearlydays:
• Usedtodaybutnotwidely
• Tensorflow• Futhark
– T.Henriksen,N.Serup,M.Elsman,F.Henglein,andC.Oancea.Futhark:PurelyFunctionalGPU-ProgrammingwithNestedParallelismandIn-PlaceArrayUpdates PLDI 2017
50
do 10 i = 1, 100, 2
M (i) = .true.M (i+l) = .false.
10 continueA(*) = B(*) + A(*)
C(M(*)) = B(M(*)) + A(M(*))
Illiac IVFortran
C[i][ j] = __sec_reduce_add( a[ i][:] * b[:][j]);
Cilk plus
December4,2017
Array operations and optimizationsAnexample
• Applyingarithmeticrulessuchasassociativityofmatrix-matrixmultiplicationtoreducethenumberofoperations.– Whichoneisbetter?
• (((A*B)*C)*D)*E,• (A*((B*C)*D))*E,• (A*B)*((C*D)*E),• …
– Findbestusingdynamicprogramming– YoichiMuraokaandDavidJ.Kuck.1973.Onthetimerequiredforasequenceofmatrixproducts.Commun.ACM16,1(January1973),22-26.
51December4,2017
Reduceoverheadofinterpretation
• SeeHaichuan Wang,DavidA.Padua,PengWu:Vectorization ofapplytoreduceinterpretationoverheadofR.OOPSLA 2015:400-415
52December4,2017
Arrayoperationsandirregularcomputations
• Arrayoperationsareanidealnotationtomanipulatesparsearrays.
53December4,2017
SparsearraysinMATLABload west0479.matA = west0479;S = A * A' + speye(size(A));pct = 100 / numel(A);
figurespy(S)title('A Sparse Symmetric Matrix')nz = nnz(S);xlabel(sprintf('nonzeros = %d (%.3f%%)',nz,nz*pct));
• Arraysappearasdense,butrepresentedinternallyasspaarse.• Theinterpreterandlibrarieshandlethesparseness
54
S = sparse(m,n)
December4,2017
Sparsearrayscanbeabstractedasdensearraysincompiledlanguages
• BikandWijshoff*developedastrategytotranslatedensearrayprogramsontosparseequivalents.
55
do i=1,nacc=acc+a(i,:)*<1:n>
do i=1,ndo j=1,n
if (i,j) ∈ E a thenacc=acc+a’(σa(i,j))*j
do i=1,ndo ad∈PADa
j=π2σa-1(ad)acc=acc+a’(ad)*j
do i=1,ndo ad=alow(i),ahigh(j)
j=aind(ad)acc=acc+aval(ad)*j
*Aart J.C.Bik,HarryA.G.Wijshoff:CompilationTechniquesforSparseMatrixComputations.InternationalConferenceonSupercomputing1993:416-424December4,2017
Arrayoperationsandirregularcomputations
• Notonlycanarraynotationbeusedtorepresentdensecomputations,itcanbeusedforanumberofotherdomains.
• Awiderangeofdomains
56December4,2017
Howwide?
• Atthetime(ca.1959),Illinois,withthefirstofits‘ILLIAC’supercomputers,wasagreatcentre ofcomputing,andGolubshowedhisaffectionforhisAlmaMaterbyendowingachairthere50yearslater.Rumour hasitthatthefundsforthegiftcamefromGooglestockacquiredinexchangeforsomeadviceonlinearalgebra.Google’sPageRanksearchtechnologystartsfromamatrixcomputation— aneigenvalueproblemwithdimensionsinthebillions.Hardlysurprising,Golub wouldhavesaid:everythingislinearalgebra.Nature2007
57December4,2017
New operatorsArraysforgraphsalgorithms
• AsimplealgorithmforSSSP(Bellman-Ford)canberepresentedasfollows:
58
for(i=1; i<=n; i++)d = d ⨁ A ⨂ d
JeremyKepner andJohnGilbert.2011.GraphAlgorithmsintheLanguageofLinearAlgebra.SIAMPressT.Mattson,D.Bader,J.Berry,A.Buluc,J.Dongarra,C.Faloutsos,J.Feo,J.Gilbert,J.Gonzalez,B.Hendrickson,J.Kepner,C.Leiserson,A.Lumsdaine,D.Padua,S.Poole,SteveReinhardt,M.Stonebraker,S.Wallach,A.Yoo.StandardsforGraphAlgorithmPrimitivesDecember4,2017
4. NOTATIONS4.3.EXTENSIONSTOARRAYNOTATIONFORPARALLELISM
59December4,2017
repmat(h, [1, 3])
circshift( h, [0, -1] )
transpose(h)
60
Communicationasarrayoperations
December4,2017
A00 A01
A10
A20
A11
A21 A22
A12
A02 B01
B10
B20
B11
B21 B22
B12
B02B00
A00B00
A01B11
A02B22
A12B21
A11B10
A10B02
A22B20
A20B01
A21B12
A00B00
A01B11
A02B22
A12B21
A11B10
A10B02
A22B20
A20B01
A21B12
initial skew
shift-multiply-add
61
Cannon’salgorithm
December4,2017
for k=1:nc = c + a × b;a = circshift(a,[0 -1]); b = circshift(b,[-1 0]);
end
Cannnon’s Algorithm
62
JamesC.Brodman,G.CarlEvans,MuratManguoglu,AhmedH.Sameh,María Jesús Garzarán,DavidA.Padua:AParallelNumericalSolverUsingHierarchicallyTiledArrays.LCPC 2010:46-61
December4,2017
Barrierelimination
63
X(3)=Y(3)*2X(1)=Y(1)*2 X(2)=Y(2)*2 X(4)=Y(4)*2
Z(1)=X(1) Z(2)=X(3)
Z(3)=X(2) Z(4)=X(4)
X(1:4)=Y(1:4)*2;Z(1:2)=X([1,3]);Z(3:4)=X([2,4]);
December4,2017
Barrierelimination
64December4,2017
Asynchronoussemantics
• Insomecases,itcouldbedesirabletodelayupdatesacrossthewholemachine.Atsomepoint,aglobalupdateismade.
• Thisisaparticularexampleofcommunicationinasynchronousalgorithmswheredataassignmentsufferdelays.
65December4,2017
TiledLinearAlgebra– SSSPExample
// the distance vectord = inf; d(1) = 0;// the mask vectorL = 0; L(1) = 1;while (L.nnz())for i = 1:LOC_ITERr <- d + (trans(A)*(d*.L);L <- (r != d);d <- r;
r += d;L <- (r != d);d <- r;
Partialmultiplication
Localcom
putatio
n
Globalreduction
66
Saeed Maleki,G.CarlEvans,DavidA.Padua:TiledLinearAlgebraaSystemforParallelGraphAlgorithms.LCPC2014:116-130December4,2017
4. NOTATIONS4.3.WHEREDOWESTAND?
67December4,2017
Arraynotation
• Arraynotationhasalonghistory• Isapowerfulnotationthatcanbeusedforawiderangeofapplications.
• Forparallelism,itisanunfulfilledpromise• Thehighlymathematicalnotationisachallenge• Weseeincreaseuseofthenotationandexpectittobecomepopularalsoforirregularalgorithms.
• Challenges:– developthenotation– Designcompileroptimizations
68December4,2017
5.EPILOG
69December4,2017
• In1976notmuchwasknownaboutparallelprogrammingofirregularcomputations.
• 40yearslaterifisimpressivehowmuchhavebeenlearnedoncompilationandnotations.
• Butthedevelopmentofwidelyusedtoolsbasedontheseideasremainachallenge.
70December4,2017