intelligent system for aialiensunmin.github.io/aii_workshop/2nd/slides/7.pdfnetwork bw requirement...
TRANSCRIPT
IntelligentSystemforAI清大資工 周志遠
2018/5/19 @ AII Workshop
• 周志遠 (Jerry Chou)– Email: [email protected]– Large-scaled System Architecture (LSA) Lab
• 經歷– 清華大學資工系 副教授 2016~現今– 清華大學資工系 助理教授 2011~2016– 美國勞倫斯國家實驗室工程師 2010~2011– 美國加州大學聖帝亞哥分校(UCSD)博士學位 2009
• 研究領域– 雲端計算、分散式系統、高效能計算、巨量資料處理
2
AI
Intelligentresourcemanagement&systemadministration
System
VM baremetal
CPU FPGA
container
XeonPhi GPU
Service Interface
Resource Orchestration
Hardware Virtualization
Highthroughput&costeffective
DGX-1150,000 USD
256GPUsinonehour[facebook2017]
ResNet-50
3
SystemsforAI
Publiccloud
ManagedservicePay-as-you-usedAvailability,Reliability
Cost:10KTWDfor256GPU-hourDataprivacyandtransfer
4
SystemsforAI
Publiccloud
ManagedservicePay-as-you-usedAvailability,Reliability
Cost:10KTWDfor256GPU-hourDataprivacyandtransfer
Privatecloud
Control&efficiencySecurity&privacyCustomization
Complex&virtualizedHWinfra.DiverseSWdeploymentResourcemanagement
5
SystemsforAI
Publiccloud
ManagedservicePay-as-you-usedAvailability,Reliability
Cost:10KTWDfor256GPU-hourDataprivacyandtransfer
Privatecloud
Control&efficiencySecurity&privacyCustomization
Complex&virtualizedHWinfra.DiverseSWdeploymentResourcemanagement
6
KeyChallengesofAISystems• SystemInfrastructure:
– VM+CPUè Container+GPU
• Trainingjobexecution:– StaticSingleinstanceexecution
è Elasticdistributedexecution
7
Container-basedGPUCloud• WhyContainer?
– Lightweight,lowperformanceoverhead– Highdeploymentdensity– Executionenvironmentisolation
BenchmarkTensorFlowonvariedresourceorchestration(baremetal,container,VM) andexecutionenvironment(single,distributed,multi-tenant)
8
Container-basedGPUCloud• WhyContainer?
– Lightweight,lowperformanceoverhead– Highdeploymentdensity– Executionenvironmentisolation
Singleinstance Distributed
Containercandeliverclosetothebare-metalperformanceindedicatedresourceenvironment
9
Container-basedGPUCloud• WhyContainer?
– Lightweight,lowperformanceoverhead– Highdeploymentdensity– Executionenvironmentisolation
Multi-tenant
• ContinerlacksofQoScontrolforPCIEandGPU
• GPUmaynotbefullyutilizedbyasinglejob
10
largebatchsize argebatchsize
Container-basedGPUCloud• WhyKubernetes(containerorchestrator)?
– Automatingdeployment,scaling,and(lifecycle&resource)management ofcontainerizedapplications
• Currentsolutions&limitations– NVidia-Docker:exposeGPUdevicestocontainers
• DedicateGPUallocationtocontainer– K8Sresourcelimit:controlmemoryandCPUusage
• GPUisnotmanageableresourceyet– KubeFlow:ATF-operatortodeploycontainerizedTFjobasasetofK8Sapplications
• Naïveround-robinschedulingwithoutscalingandmanagement
11
ProposedSolutions:Multi-tenantGPUcontroller
• Objective– TreatGPUasthefirstclass
resource likeCPU– Allowuserstospecifythemax
andmin requirementsforGPUutilization andmemoryusage
• Approach– InterceptCUDAdriver&
runtimeAPI– Forwardrequeststoa
centralizedschedulerforCPUandmemorycontrol
– SimilartoconVGPU,butfocusmoreonGPUutilizationcontrol andGPUassignment
60%
40%
ContainerA
50%
30%
ContainerB
40%
30%
40%
GPUusage
12
Nvidia Dockercontainer
GPUcontroller
CUDAdriver
CUDAAPIpassthrough
Mem&kernelcall
GPUassignment
CUDAAPIwrapper
K8Sscheduler
Usageinfo
Requested/ApprovedAPIcalls
ProposedSolutions:Elastic-KubeFlow
• AnenhancedK8STF-operatoroverKubeFlowAuto-
DeploymentScheduling Scaling
KubeFlow Round-Robin
Elastic-KubeFlow
Perf. AwarePlacement
Scale-upwhenutil.islowScale-downwhenwaittimeishigh
TF-operatorJob1
worker1
jobqueue
Job1worker3
Job1worker4
Job1worker2
Job2worker1
Job2worker3
Job2worker2
Job1worker1
Job1worker3
Job1worker4
Job1worker2
Job2worker1
Job3worker1
Job4worker1
Job2worker2
SysteminLowLoading SysteminHighLoadingjobqueueJob5
worker1Job6
worker1
13
ProposedSolutions:Elastic-KubeFlow
• AnenhancedK8STF-operatoroverKubeFlowAuto-
DeploymentScheduling Scaling
KubeFlow Round-Robin
Elastic-KubeFlow
Perf. AwarePlacement
Scale-upwhenutil.islowScale-downwhenwaittimeishigh
TF-operatorJob1
worker1
jobqueue
Job1worker3
Job1worker4
Job1worker2
Job2worker1
Job2worker3
Job2worker2
Job1worker1
Job1worker3
Job1worker4
Job1worker2
Job2worker1
Job3worker1
Job4worker1
Job2worker2
SysteminLowLoading SysteminHighLoadingjobqueueJob5
worker1Job6
worker1
14
Totaljob runtime:4:11:44è 3:16:54(22%) Totaljobwaittime:4:45:02è 2:57:37(38%)
DistributedDeepLearning• ModelParallelism
– Withinanode:sharedmemory,auto-managedbyframework
– Acrossnodes:messagepassing,modelrewrittenbydevelopers
• DataParallelism– Parameterserver:
• Asynchronouscentralizedcomm.èFasterconvergetime,buthighernetworkBWrequirement
• MainstrategyinTF– Allreduce:
• SynchronousP2Pcomm.èHigherlatencydelay,butmorebalancednetworktraffic(avoidhotspot)
• Recentoptimizedimp.byHorovod
Modelparallelism
Dataparallelism(PS)
Dataparallelism(P2P) 15
DistributedModelTraining• Whydistributedmodeltraining?
– Shortertrainingtime– Fullyutilizecomputingresources
16DistributedTrainingofDeepNeuralNetworks:TheoreticalandPracticalLimitsofParallelScalability.
• Non-negligible overhead• Moretuningnobs:batchsize,learningrate,#PS
ProposedSolutions:Elastic-TensorFlow• Whywewanttodynamicallyadd/removeworkersfromatrainingjobwithoutcheckpoint-restart?– Auto-tuningPS/Workerratioatruntime– Reachdesiredperformancewithminimumcost– Maximizesystemutilization&throughput (Combinewithourelastic-kubeflow controller)
17DistributedtrainingstrategiesforacomputervisiondeeplearningalgorithmonGPUcluster
http://blog.kubernetes.io/2017/12/paddle-paddle-fluid-elastic-learning.html
AI forSystems• Timepredictionforoptimizingjobexecution
– ApplyFCN、RNNforcomplexparallelDAG
• Anomaly&failurepredictionfor minimizingcost– DNNalongmightnotbeenough…
• UsingSVMforrareclassclassification• Usingbayesian networkordecisiontreeforrootcausediagnosis
• Usingprobabilitydistributionforsystemmetricsprediction
• Auto-scaling&Schedulingformaximizingsystemperformance– Applyreinforcementlearning:A3S,DeepQ-learning
18
TimePredictionofHadoopExecution
• Aparallelexecutionjob• Over100executionconfigurations• Cloudplatformprovidesvariedcomputeinstancetypes• Inexperiencedusers forperformanceoptimization
19
𝒇 𝑗𝑜𝑏𝑝𝑟𝑜𝑓𝑖𝑙𝑒, 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠𝑝𝑒𝑐, 𝑒𝑥𝑒𝑐𝑜𝑛𝑓𝑖𝑔 = 𝑗𝑜𝑏𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛𝑡𝑖𝑚𝑒
TimePredictionofHadoopExecution
Step1
Step2 Step3 Step4
Ø Step1: Job Profiling� Collect job features
Ø Step2: Job classification� Improve prediction accuracy
Ø Step3: Model prediction� Fully-Connected NN
Ø Step4: Optimization� Search optimal configurations
20
EvaluationResults• WorkloadfromHiBench,aHadoopbenchmarksuite
DecisionTree:16%SVM:12%NN:8%
MoreaccuratetimepredictionthantraditionalMLmethods
10~50%performanceimprovementbychoosingtheproperexecutionconfigurations
21
PredictionAccuracy PerformanceImprovement
TimePredictionofHiveQuery• Hive:AqueryengineonHadoop
– ComplexworkflowrepresentedbyDAG
22
QueryStatement
DifferentDAGexecutionplans
Jobdependency
Translation Execution
OR
TimePredictionofHiveQuery
23
• RNNmodel– SerializedDAGworkflow witharbitraryjobsequencelength– Storedstateforcapturingjobdependencyeffects
• Twolevelprediction&optimization– Querylevel(Hive)andjoblevel(Hadoop)
EvaluationResults• WorkloadfromTPC-Hbenchmarks
24
RNNhasthelowesterrorrate comparingtoDNNandothermethods
Improveperformancebyover50%whenbothHadoopandHiveconfigurationsareoptimized
PredictionAccuracy PerformanceImprovement
25