next-generation apache hadoop - usenix · 2019-02-26 · © cloudera, inc. all rights reserved. 6...

43
Open problems in distributed storage and resource management Karthik Kambatla | [email protected] Andrew Wang | [email protected] Next-Generation Apache Hadoop

Upload: others

Post on 14-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

1©Cloudera,Inc.Allrightsreserved.

Openproblemsindistributedstorageandresourcemanagement

Karthik Kambatla |[email protected]|[email protected]

Next-GenerationApacheHadoop

Page 2: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

2©Cloudera,Inc.Allrightsreserved.

???

Page 3: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

3©Cloudera,Inc.Allrightsreserved.

Page 4: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

4©Cloudera,Inc.Allrightsreserved.

Clouderaperspective

• Hadoopsoftwarestackisrelativelymature• Seenbroaduptakeinmanyindustries•Widervarietyofworkloads• Largerandlargeramountsofdata

• Newdatacenterhardwaretrendsonthehorizon

• Goodtimetorevisitoriginaldesignassumptions• Collaboratewithacademicsontheseproblems

Page 5: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

5©Cloudera,Inc.Allrightsreserved.

Scalability

Page 6: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

6©Cloudera,Inc.Allrightsreserved.

Master

Worker

Worker

Worker

Worker

GlobalviewofclusterCoordinatesworkers

SimpleDoworkScale-out

Page 7: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

7©Cloudera,Inc.Allrightsreserved.

NN

DN

DN

DN

DN

FilemetadataFile->blockmapping

StoreblocksServedatareads/writes

Bottleneck

Page 8: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

8©Cloudera,Inc.Allrightsreserved.

Project Improvement Cost(months)

Multiple volumesperNN Operational 6Splitnamespace andblockmanagementlocking

2xRPC 12

Fine-grained lockingofnamespace 2xRPC 6Pageable namespace 2x objectcount 6Persistentblockspace Operational 6Blockmanagement asaservice 2xobjectcount 12+Volumemigration Operational 12

VerticallyscalingHDFS

Page 9: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

9©Cloudera,Inc.Allrightsreserved.

Project Improvement Cost(months)

Multiple volumesperNN Operational 6Splitnamespace andblockmanagementlocking

2xRPC 12

Fine-grained lockingofnamespace 2xRPC 6Pageable namespace 2x objectcount 6Persistentblockspace Operational 6Blockmanagement asaservice 2xobjectcount 12+Volumemigration Operational 12

Scarychanges

Page 10: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

10©Cloudera,Inc.Allrightsreserved.

Project Improvement Cost(months)

Multiple volumesperNN Operational 6Splitnamespace andblockmanagementlocking

2xRPC 12

Fine-grained lockingofnamespace 2xRPC 6Pageable namespace 2x objectcount 6Persistentblockspace Operational 6Blockmanagement asaservice 2xobjectcount 12+Volumemigration Operational 12

Incremental

Page 11: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

11©Cloudera,Inc.Allrightsreserved.

Project Improvement Cost(months)

Multiple volumesperNN Operational 6Splitnamespace andblockmanagementlocking

2xRPC 12

Fine-grained lockingofnamespace 2xRPC 6Pageable namespace 2x objectcount 6Persistentblockspace Operational 6Blockmanagement asaservice 2xobjectcount 12+Volumemigration Operational 12

Yearsofwork

Page 12: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

12©Cloudera,Inc.Allrightsreserved.

Hardwaretrendsonthehorizon

2006 2016 2021

HDDcapacity(TB) 0.2 2 20HDDspeed(MB/s) 90 110 140Networkspeed(Gb/s) 0.1 10 40

FewerIOPS/GB

HDDlocalityirrelevant

Page 13: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

13©Cloudera,Inc.Allrightsreserved.

Afreshlook

• Designedforanalyticworkloads• Scaleshorizontally(exabyte scale)• Operationallyrobust• Designedforfuturehardwaretrends

Page 14: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

14©Cloudera,Inc.Allrightsreserved.

Blobstore

• Usersthinkindatasets,notdirectoriesandfiles• Spectrumofblobstore vs.filesystemfunctionality•WhatistheequivalentofthePOSIXAPIforascalablestoragesystem?•Whatsetofoperationsarerequired?•Whataretheirsemantics?•Whatcanandcannotbesupportedscalably?

Page 15: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

15©Cloudera,Inc.Allrightsreserved.

Otherconsiderations

• Erasurecoding• Requiredtobecostcompetitive

•Multi-datacenterreplication• Importantforbusiness-criticalanalytics

• 3DXpoint• Newadditiontostoragehierarchy• Couldchangehowwewritesoftwareandthinkaboutpersistence

Page 16: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

16©Cloudera,Inc.Allrightsreserved.

RM

NM

NM

NM

NM

QueueoftasksAvailableresources

Runtasks

Page 17: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

17©Cloudera,Inc.Allrightsreserved.

Oneclustertorulethemall

• Exabyte-scalestoragemeansexabyte-scaleprocessing• Current:10,000nodeYARNclusters• Goal:1,000,000 nodes• Oneclusterforallcompute ataninternet-scalecompany• ThinkMicrosoft orTwitter

Page 18: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

18©Cloudera,Inc.Allrightsreserved.

YarnFederation

ResourceManager

NodeManager

NodeManager

NodeManager

NodeManager

ResourceManager

NodeManager

NodeManager

NodeManager

NodeManager

Router

Policy

Client

Admin

<Tenant,Sub-cluster>

Page 19: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

19©Cloudera,Inc.Allrightsreserved.

Fair-SharingandFederation

ResourceManager

NodeManager

NodeManager

NodeManager

NodeManager

ResourceManager

NodeManager

NodeManager

NodeManager

NodeManager

Andrew

Karthik

50

50

50

50

Page 20: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

20©Cloudera,Inc.Allrightsreserved.

Fair-SharingandFederation

ResourceManager

NodeManager

NodeManager

NodeManager

NodeManager

ResourceManager

NodeManager

NodeManager

NodeManager

NodeManager

Andrew

Karthik

99

1

1

99

FeedbackloopPer-clusterweights

Page 21: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

21©Cloudera,Inc.Allrightsreserved.

Fair-SharingandFederation

ResourceManager

NodeManager

NodeManager

NodeManager

NodeManager

ResourceManager

NodeManager

NodeManager

NodeManager

NodeManager

99

1

1

99

Page 22: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

22©Cloudera,Inc.Allrightsreserved.

Scheduling

Page 23: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

23©Cloudera,Inc.Allrightsreserved.

Duration SchedulingLatency “Tasks” Tenant

ScalePlacementQuality

Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low

InteractiveSQL Seconds Milliseconds 100s Users (100s) Medium

Streamprocessing Months Minutes 10s Jobs(10s) High

Long-runningservices Months Minutes #Nodes Services(10s) High

Varietyofworkloads

Page 24: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

24©Cloudera,Inc.Allrightsreserved.

Schedulinglatency

Duration SchedulingLatency “Tasks” Tenant

ScalePlacementQuality

Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low

InteractiveSQL Seconds Milliseconds 100s Users (100s) Medium

Streamprocessing Months Minutes 10s Jobs(10s) High

Long-runningservices Months Minutes #Nodes Services(10s) High

Page 25: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

25©Cloudera,Inc.Allrightsreserved.

Lowlatencyschedulingfordistributedsystems

• Stateoftheart• Low-latencyscheduling:Sparrow• Second-levelschedulerthatneedspre-allocatedresources

• Operational• Staticpartitioning:setasideresources• Semi-static:Maintainaper-usercacheofresources• Downside:lowutilization

Canwedesignscalablealgorithmsforlow-latencyscheduling?

Page 26: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

26©Cloudera,Inc.Allrightsreserved.

Schedulinglatency

Duration SchedulingLatency “Tasks” Tenant

ScalePlacementQuality

Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low

InteractiveSQL Seconds Milliseconds 100s Users (100s) Medium

Streamprocessing Months Minutes 10s Jobs(10s) High

Long-runningservices Months Minutes #Nodes Services(10s) High

Page 27: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

27©Cloudera,Inc.Allrightsreserved.

Schedulinglatency

Duration SchedulingLatency “Tasks” Tenant

ScalePlacementQuality

Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low

InteractiveSQL Seconds Minutes 100s Users (100s) Medium

Streamprocessing Months Minutes 10s Jobs(10s) High

Long-runningservices Months Minutes #Nodes Services(10s) High

Page 28: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

28©Cloudera,Inc.Allrightsreserved.

JobsvsServices

Duration SchedulingLatency “Tasks” Tenant

ScalePlacementQuality

Batchprocessing Mins- hours Seconds < 400,000 Jobs(10Ks) Low

InteractiveSQL Seconds Minutes 100s Users (100s) Medium

Streamprocessing Months Minutes 10s Jobs(10s) High

Long-runningservices Months Minutes #Nodes Services(10s) High

Page 29: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

29©Cloudera,Inc.Allrightsreserved.

JobsvsServices

Duration SchedulingLatency “Tasks” Tenant

ScalePlacementQuality

Jobs Mins- hours Seconds < 400,000 ~ 10,000 Low

Services Seconds Minutes #Nodes <100 High

Page 30: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

30©Cloudera,Inc.Allrightsreserved.

Scalability- Tenants

Duration SchedulingLatency “Tasks” Tenant

ScalePlacementQuality

Jobs Mins- hours Seconds < 400,000 ~10,000 Low

Services Seconds Minutes #Nodes <100 High

Page 31: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

31©Cloudera,Inc.Allrightsreserved.

Scalability– TenantsvsNodes

• Schedulingisallocatingresourcesfortenantsonclusternodes•Matching/joinbetweentwosets• Schedulinglatency=|Tenants|x|Nodes|

Canwelowertheboundonschedulinglatency?

Page 32: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

32©Cloudera,Inc.Allrightsreserved.

Qualityofplacement

Duration SchedulingLatency “Tasks” Tenant

ScalePlacementQuality

Jobs Mins- hours Seconds < 400,000 ~ 10,000 Low

Services Seconds Minutes #Nodes <100 High

Page 33: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

33©Cloudera,Inc.Allrightsreserved.

Placementrequirements

SOFTHARDDatalocalitySoftware

e.g.Database

Hardwaree.g.GPU

Intra-jobaffinity

Inter-jobaffinity

Page 34: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

34©Cloudera,Inc.Allrightsreserved.

Multi-tenancyandscalability

Tenants,Schedulingthroughput

Nodes,Placementquality

ServiceScheduler

JobScheduler

UnifiedScheduler

Page 35: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

35©Cloudera,Inc.Allrightsreserved.

Utilization

Page 36: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

36©Cloudera,Inc.Allrightsreserved.

Productionclusters

[1]ApacheYARNatSOCC‘13[2]Anecdotalfromthecommunity

CPUUtilization% MemoryUtilization%

MapReducev1 <20[1] <20[1]

YARN /MapReducev2 50[1] 30[2]

Page 37: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

37©Cloudera,Inc.Allrightsreserved.

Potentialforimprovement

• Atask’sresourceusagevariesovertime.• Resourceusagevariesacrosstasksofthesamejob

020040060080010001200

Terasort Wordcount

Mean Peak Allotted

Page 38: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

38©Cloudera,Inc.Allrightsreserved.

Over-subscribingnodes

• Allocateunusedresourcestopendingtasks• Challenges• Handlesuddenspikesinresourceusagegracefully• Performanceoftaskscannotdeteriorate• Contentiononnon-isolatedresources

Page 39: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

39©Cloudera,Inc.Allrightsreserved.

Conclusion

ApacheHadoopismatureandverywidelydeployed.

Theunderlyingassumptionsare10yearsoldandneedrevisiting.

Lotsofinterestingandhardresearchproblemsinthespace.

Page 40: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

40©Cloudera,Inc.Allrightsreserved.

Thankyou

Page 41: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

41©Cloudera,Inc.Allrightsreserved.

OpenProblems

• Storagescalability• Blobstore APIforanalyticworkloads• GlobalfairnessinafederatedYARNcluster• Low-latencyscheduling• Jobsandservicesonthesamecluster

• Schedulerscalabilityintenantsandnodes• Improvingqualityofplacementwithalatencyupperbound

• Clusterutilizationimprovements• I/OschedulingforpredictabilityandQoS

Page 42: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

42©Cloudera,Inc.Allrightsreserved.

NodeManager

NodeManager

Greedyplacementisnotoptimal

GPU

CPUAndrew

Karthik

NodeManagerCPU

NodeManagerCPU

Page 43: Next-Generation Apache Hadoop - USENIX · 2019-02-26 · © Cloudera, Inc. All rights reserved. 6 Master Worker Worker Worker Worker Global view of cluster Coordinates workers Simple

43©Cloudera,Inc.Allrightsreserved.

Multi-tenancy

43

MapReduce Spark Impala HBase

HDFS