readex: a tool suite for dynamic energy tuning · periscope tuning framework (ptf) • automatic...
TRANSCRIPT
READEX: A Tool Suite for Dynamic Energy Tuning
MichaelGerndt
TechnischeUniversitätMünchen
CampusGarching
2
SuperMUC:3Petaflops,3MW
3
READEX
RuntimeExploitationofApplicationDynamismforEnergy-efficienteXascaleComputing
09/2015 to08/2018www.readex.eu
4
Objectives• Tuningforenergyefficiency• Beyondstatictuning:exploit dynamism inapplicationcharacteristics
• Leverage system scenario based tuning
5
HPCAutomaticTuning
EmbeddedSystemScenarios
SystemsScenariobasedMethodology
6
PeriscopeTuningFramework (PTF)
• Automaticapplicationanalysis&tuning• Tuneperformanceandenergy(statically)• Plug-in-basedarchitecture• Evaluatealternativesonline• Scalableanddistributedframework
• Supportvarietyofparallelparadigms• MPI,OpenMP,OpenCL,Parallelpattern
• AutoTuneEU-FP7project
7
Score-P
ScalablePerformanceMeasurementInfrastructureforParallelCodesCommoninstrumentationand measurement infrastructure
8
TuningPluginInterface
Plugin PeriscopeFrontend
Applicationwith
Monitor
Scenario execution§ Tuningactions§ Measurement requests
SearchSpaceExplorationinsideofTuningSteps
TuningPlugins
• MPIparameters• EagerLimit,Bufferspace,collectivealgorithms• ApplicationrestartorMPITToolsInterface
• DVFS• Frequencytuningforenergydelayproduct• Model-basedpredictionoffrequency• Regionleveltuning
• Parallelismcapping• Threadnumbertuningforenergydelayproduct• Exhaustiveandcurvefittingbasedprediction
DynamicTuningwiththeREADEXToolSuite
• READEXextends theconceptoftuning inPeriscope• Dynamictuning
• Instead ofoneoptimalconfiguration,SWITCHbetweendifferentbestconfigurations.
• Dynamicadaptation tochanging programcharacteristics.
Scenario-BasedTuning
12
Design Time Analysis
Tuning Model
Runtime Tuning
PeriscopeTuningFramework(PTF)
READEXRuntimeLibrary(RRL)
Intra-phase Dynamism
PhasePhaseregion
Significantregion Runtimesituation
FREQ=2GHz
FREQ=1.5GHz
Intra-phasedynamism
13
READEXIntra-phase TuningPlugin
Tuningpluginsupporting• Coreanduncorefrequencies,numthreads parameters,
applicationtuningparameters• ConfigurablesearchspaceviaREADEXConfigurationFile• Severalobjectivefunctions:energy,CPUenergy,EDP,EDP2,time• Severalsearchstrategies:exhaustive,individual,random,genetic
Approach1. Experimentwithdefaultconfiguration2. Experimentsforselectedconfigurations
• Configurationsetforphaseregion• Energyandtimemeasuredforallruntimesituations
3. Identificationofstaticbestforphaseandrts specific bestconfigurations
14
Pre-Computation ofTuningModel
PeriscopeTuningFramework
Analysis
PluginControl
PerformanceDatabase
SearchAlgorithms Experiments
Engine
READEXTuningPlugin
DTAManagement
DTAProcessManagement
RTSManagement
RTS
Database
ScenarioIdentification
ApplicationTuningModel
Score-P READEXRuntimeLibrary
OnlineAccessInterface
SubstratePlugin
Interface
Instrumen-tation
MetricPlugin
Interface
EnergyMeasurements
(HDEEM)
15
READEXRuntimeLibrary(RRL)• RuntimeApplicationTuningperformedbytheREADEXRuntimeLibrary.• TuningrequestsduringDesign Time AnalysisaresenttoRRL.• Alightweightlibrary
• Dynamicswitchingbetweendifferentconfigurationsatruntime.• ImplementedasasubstratepluginofScore-P.
• Developedby TUDandNTNU
16
RuntimeScenarioDetectionandSwitchingDecisionduringProductionRun
• DuringRuntimeApplicationTuning
• Scenarioclassification
17
• Switchingdecisioncomponent
• Manipulationoftuningparameters
BEM4I– Dynamicswitching– Energy
bladesummary,energyassemble_k
[J]assemble_v
[J]gmres_solve
[J]print_vtu
[J]main[J]
defaultsettings 1467 1484 2733 1142 6872statictuningonly 1876 1926 1306 402 5537dynamictuningonly 1348 1335 1150 268 4138static+ dynamictuning 1343 1322 1161 265 4125
staticsavings[%] -27.9% -29.8% 52.2% 64.8% +19.4%dynamicsavings[%] 8.4% 10.9% 57.5% 76.8% +40.0%static +dynamicsavings[%] 8.1% 10.0% 57.9% 76.5% +39.8%
"assemble_k":{"FREQUENCY":"23","NUM_THREADS":"24","UNCORE_FREQUENCY":”16”
},
"assemble_v":{"FREQUENCY":”25","NUM_THREADS":"24","UNCORE_FREQUENCY":”14”
},
"gmres_solve":{"FREQUENCY":”17","NUM_THREADS":”8","UNCORE_FREQUENCY":”22”
},
"print_vtu": {"FREQUENCY":"25","NUM_THREADS":”6","UNCORE_FREQUENCY":”24”
}
”static": {
"FREQUENCY": ”25", <--------- 2.5GHz
"NUM_THREADS": ”12", <--------- 12OpenMP threads
"UNCORE_FREQUENCY": ”22” <--------- 2.2GHz
},
18
http://bem4i.it4i.cz/
simpleFoam• strongscalingtest• Motorbike example
• optimumdetectedforeveryrun
• Static:11.7%• Dynamic:4.4%• Total:15.5%
• Dynamicsavingsincreaseswithhighernumberofnodes
Doesnotscaleanymore
ScalabilityTests– OpenFOAM– Analysis
Inter-phaseDynamism
20
All-to-allPerformance2048phases
PEPCBenchmarkoftheDEISABenchmarkSuite
Inter-PhaseAnalysis
• Variationofbehavioramongphases• Group/clusterphases• Selectabestconfigurationforeachclusterofphases
Whatdoweneed?• Identifiersofphasecharacteristics(PhaseIdentifiers)• Providedbyapplicationexpert(??)
21
Inter-PhaseAnalysis– Approach• Developedtheinterphase_tuning plugin• 3tuningsteps:
• Analysisstep:• Randomsearchstrategyisusedtocreatethesearchspace• Don’twanttoexplorethewholetuningspace• Clusterphasesandfindbestconfigurationforeachcluster
• Defaultstep:• Runtheapplicationforthedefaultsetting
• Verificationstep:• Selectthebestconfigurationforeach phase,asdeterminedforitscluster.
• Aggregatethesavings overthephases
22
INDEED
23
• 3clustersidentified• Noisepointsmarkedinred
ClusterPrediction inRRL• Howtohandlephase identifiers topredict clusters?
• Callpath ofanrts now includes theclusternumber• Solution:
• Addthe cluster number as auser parameter• AddPAPIeventstomeasure L3_TCM,Total_Instr andconditionalbranch instructions
• Predict the cluster of the upcomingphase• Iftheclusterwasmispredicted forthephase,correct itattheendofthephase
24
…SCOREP_OA_PHASE_BEGIN()
SCOREP_USER_PARAMETER_INT64(cluster, predict_cluster())…
SCOREP_OA_PHASE_END()
Evaluationofthereadex_interphase plugin
25
• Performedontwoapplications:miniMD,INDEED• ExperimentsconductedontheTaurusHPCsystemattheZIHinDresden• Eachnodecontainstwo12-coreIntelXeonCPUsE5-2680v3(IntelHaswell family)
• RunswithadefaultCPUfrequencyof2.5GHz,uncore frequencyof3GHz
• EnergymeasurementsprovidedonTaurusviaHDEEMmeasurementhardware
• Providesprocessorandbladeenergymeasurements
miniMD
26
• Lightweight,parallelmoleculardynamicssimulationcode
• PerformsmoleculardynamicssimulationofaLennard-JonesEmbeddedAtomModel(EAM)system
• WritteninC++• Providesinputfiletospecifyproblemsize,temperature,timesteps
• EvaluationofDTA:• Hybrid(MPI+OpenMP)AVXvectorized version
• Problemsizeof50fortheLennard-Jonessystem.
miniMD (2)
ParCo'17,September13,Bologna
27
• 6clustersidentified• Noisepointsmarkedinred
INDEED
ParCo'17,September13,Bologna
28
• INDEEDperformssheetmetalformingsimulationsoftoolswithdifferentgeometriesmovingtowardsastationaryworkpiece
• Contactbetweentoolandworkpiece causes:
• Adaptivemeshrefinement• Increaseinnumberoffiniteelementnodes
• Increasingcomputationalcost
• Timeloopcomputesthesolutiontoasystemofequationsuntilequilibriumisreached.
• OpenMP versionevaluated
INDEED(2)
ParCo'17,September13,Bologna
29
• 3clustersidentified• Noisepointsmarkedinred
EnergySavings
30
Application Phasebestfortherts’s(%)
rts best forthe rts’s(%)
miniMD 14.51 0.03
INDEED 9.24 10.45
• miniMDrecordslowerdynamicsavings• miniMD hasonlytwosignificantregions• Oneregioniscalledonlyonceduringtheentireapplicationrun
• Betterstaticanddynamicsavingsfortherts’s ofINDEED• INDEEDhasninesignificantregions• Providesmorepotentialfordynamism
ApplicationTuningParameters(ATP)
• Exploit thedynamism incharacteristics throughtheuseofdifferentcodepaths(e.g.preconditioners)
• Identifythecontrolvariablesresponsibleforcontrolflowswitching.
• providesAPIstoannotatethesourcecode
l FiniteElement(FEM)toolsanddomaindecompositionbasedFiniteElementTearingandInterconnect(FETI)solverl Contains aprojectedconjugategradient(PCG)solver.l Convergencecanbeimprovedbyseveralpreconditioners.
l Evaluatedpreconditioners onastructuralmechanicsproblemwith23millionunknownsl Onasinglecomputenodewith24MPIprocesses.
Preconditioner #iterations 1iteration Solution
None 172 125 ms 31.6 J 21.36 s 5 501.31 JWeightfunction 100 130+2 ms 32.3+0.53 J 12.89 s 3 284.07 J
Lumped 45 130+10 ms 32.3+3.86 J 6.32 s 1 636.11 JLightdirichlet 39 130+10 ms 32.3+3.74 J 5.46 s 1 409.82 J
Dirichlet 30 130+80 ms 32.3+20.62 J 6.34 s 1 594.50 J
15.9s 4091.5j
EvaluationofATP: Espreso
Configuration VariableTuning
• Recently added readex_configuration tuning plugin• ApplicationConfiguration Parameterswithsearchspace
• Replace selected valueininput files andrerun theapplication
33
InputIdentifiers
• Whataboutdifferentinputs?• AnnotationwithInputIdentifiers likeproblem size• Apply DesignTimeAnalysisandmerge generatedtuning models
• Selection atruntimebased ontheinput identifiers
34
Summary
• Energy-efficiencytuning• DesignTimeAnalysis– TuningModel– RuntimeTuning
• Supportfor• Intra-phase dynamism• Inter-phase dynamism• ApplicationTuningParameters• ApplicationConfigurationParameters• Differentinput configurations
• Based on• Periscope TuningFramework• Score-PMonitoring
35
Thankyou!Questions?
36