intelligent system for aialiensunmin.github.io/aii_workshop/2nd/slides/7.pdfnetwork bw requirement...

IntelligentSystemforAI清大資工周志遠

2018/5/19 @ AII Workshop

• 周志遠 (Jerry Chou)– Email: [email protected]– Large-scaled System Architecture (LSA) Lab

• 經歷– 清華大學資工系副教授 2016~現今– 清華大學資工系助理教授 2011~2016– 美國勞倫斯國家實驗室工程師 2010~2011– 美國加州大學聖帝亞哥分校(UCSD)博士學位 2009

• 研究領域– 雲端計算、分散式系統、高效能計算、巨量資料處理

2

AI

Intelligentresourcemanagement&systemadministration

System

VM baremetal

CPU FPGA

container

XeonPhi GPU

Service Interface

Resource Orchestration

Hardware Virtualization

Highthroughput&costeffective

DGX-1150,000 USD

256GPUsinonehour[facebook2017]

ResNet-50

3

SystemsforAI

Publiccloud

ManagedservicePay-as-you-usedAvailability,Reliability

Cost:10KTWDfor256GPU-hourDataprivacyandtransfer

4

SystemsforAI

Publiccloud



Privatecloud

Control&efficiencySecurity&privacyCustomization

Complex&virtualizedHWinfra.DiverseSWdeploymentResourcemanagement

5

SystemsforAI

Publiccloud



Privatecloud

Control&efficiencySecurity&privacyCustomization

Complex&virtualizedHWinfra.DiverseSWdeploymentResourcemanagement

6

KeyChallengesofAISystems• SystemInfrastructure:

– VM+CPUè Container+GPU

• Trainingjobexecution:– StaticSingleinstanceexecution

è Elasticdistributedexecution

7

Container-basedGPUCloud• WhyContainer?

– Lightweight,lowperformanceoverhead– Highdeploymentdensity– Executionenvironmentisolation

BenchmarkTensorFlowonvariedresourceorchestration(baremetal,container,VM) andexecutionenvironment(single,distributed,multi-tenant)

8



Singleinstance Distributed

Containercandeliverclosetothebare-metalperformanceindedicatedresourceenvironment

9



Multi-tenant

• ContinerlacksofQoScontrolforPCIEandGPU

• GPUmaynotbefullyutilizedbyasinglejob

10

largebatchsize argebatchsize

Container-basedGPUCloud• WhyKubernetes(containerorchestrator)?

– Automatingdeployment,scaling,and(lifecycle&resource)management ofcontainerizedapplications

• Currentsolutions&limitations– NVidia-Docker:exposeGPUdevicestocontainers

• DedicateGPUallocationtocontainer– K8Sresourcelimit:controlmemoryandCPUusage

• GPUisnotmanageableresourceyet– KubeFlow:ATF-operatortodeploycontainerizedTFjobasasetofK8Sapplications

• Naïveround-robinschedulingwithoutscalingandmanagement

11

ProposedSolutions:Multi-tenantGPUcontroller

• Objective– TreatGPUasthefirstclass

resource likeCPU– Allowuserstospecifythemax

andmin requirementsforGPUutilization andmemoryusage

• Approach– InterceptCUDAdriver&

runtimeAPI– Forwardrequeststoa

centralizedschedulerforCPUandmemorycontrol

– SimilartoconVGPU,butfocusmoreonGPUutilizationcontrol andGPUassignment

60%

40%

ContainerA

50%

30%

ContainerB

40%

30%

40%

GPUusage

12

Nvidia Dockercontainer

GPUcontroller

CUDAdriver

CUDAAPIpassthrough

Mem&kernelcall

GPUassignment

CUDAAPIwrapper

K8Sscheduler

Usageinfo

Requested/ApprovedAPIcalls

ProposedSolutions:Elastic-KubeFlow

• AnenhancedK8STF-operatoroverKubeFlowAuto-

DeploymentScheduling Scaling

KubeFlow Round-Robin

Elastic-KubeFlow

Perf. AwarePlacement

Scale-upwhenutil.islowScale-downwhenwaittimeishigh

TF-operatorJob1

worker1

jobqueue

Job1worker3

Job1worker4

Job1worker2

Job2worker1

Job2worker3

Job2worker2

Job1worker1

Job1worker3

Job1worker4

Job1worker2

Job2worker1

Job3worker1

Job4worker1

Job2worker2

SysteminLowLoading SysteminHighLoadingjobqueueJob5

worker1Job6

worker1

13

ProposedSolutions:Elastic-KubeFlow

• AnenhancedK8STF-operatoroverKubeFlowAuto-

DeploymentScheduling Scaling

KubeFlow Round-Robin

Elastic-KubeFlow

Perf. AwarePlacement

Scale-upwhenutil.islowScale-downwhenwaittimeishigh

TF-operatorJob1

worker1

jobqueue

Job1worker3

Job1worker4

Job1worker2

Job2worker1

Job2worker3

Job2worker2

Job1worker1

Job1worker3

Job1worker4

Job1worker2

Job2worker1

Job3worker1

Job4worker1

Job2worker2

SysteminLowLoading SysteminHighLoadingjobqueueJob5

worker1Job6

worker1

14

Totaljob runtime:4:11:44è 3:16:54(22%) Totaljobwaittime:4:45:02è 2:57:37(38%)

DistributedDeepLearning• ModelParallelism

– Withinanode:sharedmemory,auto-managedbyframework

– Acrossnodes:messagepassing,modelrewrittenbydevelopers

• DataParallelism– Parameterserver:

• Asynchronouscentralizedcomm.èFasterconvergetime,buthighernetworkBWrequirement

• MainstrategyinTF– Allreduce:

• SynchronousP2Pcomm.èHigherlatencydelay,butmorebalancednetworktraffic(avoidhotspot)

• Recentoptimizedimp.byHorovod

Modelparallelism

Dataparallelism(PS)

Dataparallelism(P2P) 15

DistributedModelTraining• Whydistributedmodeltraining?

– Shortertrainingtime– Fullyutilizecomputingresources

16DistributedTrainingofDeepNeuralNetworks:TheoreticalandPracticalLimitsofParallelScalability.

• Non-negligible overhead• Moretuningnobs:batchsize,learningrate,#PS

ProposedSolutions:Elastic-TensorFlow• Whywewanttodynamicallyadd/removeworkersfromatrainingjobwithoutcheckpoint-restart?– Auto-tuningPS/Workerratioatruntime– Reachdesiredperformancewithminimumcost– Maximizesystemutilization&throughput (Combinewithourelastic-kubeflow controller)

17DistributedtrainingstrategiesforacomputervisiondeeplearningalgorithmonGPUcluster

http://blog.kubernetes.io/2017/12/paddle-paddle-fluid-elastic-learning.html

AI forSystems• Timepredictionforoptimizingjobexecution

– ApplyFCN、RNNforcomplexparallelDAG

• Anomaly&failurepredictionfor minimizingcost– DNNalongmightnotbeenough…

• UsingSVMforrareclassclassification• Usingbayesian networkordecisiontreeforrootcausediagnosis

• Usingprobabilitydistributionforsystemmetricsprediction

• Auto-scaling&Schedulingformaximizingsystemperformance– Applyreinforcementlearning:A3S,DeepQ-learning

18

TimePredictionofHadoopExecution

• Aparallelexecutionjob• Over100executionconfigurations• Cloudplatformprovidesvariedcomputeinstancetypes• Inexperiencedusers forperformanceoptimization

19

𝒇 𝑗𝑜𝑏𝑝𝑟𝑜𝑓𝑖𝑙𝑒, 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠𝑝𝑒𝑐, 𝑒𝑥𝑒𝑐𝑜𝑛𝑓𝑖𝑔 = 𝑗𝑜𝑏𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛𝑡𝑖𝑚𝑒

TimePredictionofHadoopExecution

Step1

Step2 Step3 Step4

Ø Step1: Job Profiling� Collect job features

Ø Step2: Job classification� Improve prediction accuracy

Ø Step3: Model prediction� Fully-Connected NN

Ø Step4: Optimization� Search optimal configurations

20

EvaluationResults• WorkloadfromHiBench,aHadoopbenchmarksuite

DecisionTree:16%SVM:12%NN:8%

MoreaccuratetimepredictionthantraditionalMLmethods

10~50%performanceimprovementbychoosingtheproperexecutionconfigurations

21

PredictionAccuracy PerformanceImprovement

TimePredictionofHiveQuery• Hive:AqueryengineonHadoop

– ComplexworkflowrepresentedbyDAG

22

QueryStatement

DifferentDAGexecutionplans

Jobdependency

Translation Execution

OR

TimePredictionofHiveQuery

23

• RNNmodel– SerializedDAGworkflow witharbitraryjobsequencelength– Storedstateforcapturingjobdependencyeffects

• Twolevelprediction&optimization– Querylevel(Hive)andjoblevel(Hadoop)

EvaluationResults• WorkloadfromTPC-Hbenchmarks

24

RNNhasthelowesterrorrate comparingtoDNNandothermethods

Improveperformancebyover50%whenbothHadoopandHiveconfigurationsareoptimized

PredictionAccuracy PerformanceImprovement

intelligent system for aialiensunmin.github.io/aii_workshop/2nd/slides/7.pdfnetwork bw requirement...

Documents