computing at massive-scale: scalability and dependability...

Distributed Systems and Service Group

1

Renyu Yang and Professor Jie Xu

Computing at Massive-Scale: Scalability and Dependability Challenges

IEEE SOSE 2016, Oxford, March 2016

CoLAB

Cloud Datacenters and Virtualization

• Cloud computing is primarily a business model.

• It provides dynamic computing resources to businesses: computing resources can be rented rather than owned outright.

• Two key characteristics: multi-tenancy and resource elasticity.

Machine 2

Customer B

Customer CCPU

MemoryStorage

Machine 1

Customer A

Customer B

CPUMemoryStorage

Machine 3

Customer BCPU

MemoryStorage

Scheduler

Customer C

Cloud Provider

Overview: Computing at Massive-scale

…

Scheduler

Fm 1 Fm 3 Fm 2

Scheduler

Task submission

Fm = Framework

= compute task

= compute node

New trends and characteristics • Varying request and resource heterogeneity • Workload diversity and resource sharing • Increasing scale of request and cluster • Frequent failure occurrence

• Multiple computing frameworks often have to run on a unified scheduler while handling varying requests. The diverse workloads are usually co-allocated to the shared hardware cluster in order to improve utilization.


Varying request and resource heterogeneity

• “Varying” Features

• Various customers with diverse resource estimation

• Varying resource demand dimensions and attributes

• Various resource usage patterns

1. Architecture

2. Number of Cores

3. Max Disks

4. Min Disks

5. Number of CPUs

6. Kernel Version

7. CPU Clock Speed

8. Ethernet Speed

9. Platform Family

Fig: Different constraints in task resource request

• The request heterogeneity can attribute to the highly dynamic Cloud environment, where users with different computation purposes co-exist with diverse resource requirements and patterns

• The heterogeneities will increase the scheduling complexity since the system has to pre-filter the candidate targeted servers for the specific request

Fig: Varying user and resource request

Workload Diversity and Resource Sharing

• Cluster computing systems are increasingly specialized for particular application domains and purposes

• Offline processing v.s. long-running services • Batch job: Map Reduce, Dryad

• In-memory computing: Spark

• Stream processing: Storm and MillWheel

• Interactive SQL queries: Dremel and Hive

• Strong requirements: resource sharing, high server utilization and efficient data sharing


MapReduce

Spark

VM shared cluster

Static partitioning Dynamic sharing

• Graph processing: Pregel

• DAG processing: Tez and FuxiJob

• Machine learning: GraphLab

• Virtual Machine or Container: EC2, Docker


Type Number

Peak order number Over 120,000 per second

Total payment transactions in Alipay

(Alibaba Payment system)

710 millions

Peak payment transactions in Alipay 85,900 per second

Peak transactions processed on

AliCloud (Alibaba Cloud) platform

140,000 per second

Type Number

Server number 4830

Job number 91,990

Task number 42,266,899

Worker number 16,295,167

Statistical data during 2015 Alibaba double-eleven shopping festival Statistical data of one production system in Ali-Cloud.

Increasing Request and Cluster Scale

• The increasingly enlarged cluster size also gives rise to difficulties of cluster management and the increasing scheduling complexity

• A transparent user experience is highly desirable during the request bursting period without noticeable response latency or service timing-out due to the overloaded workloads beyond the system capacity


Frequent failure occurrence

• With increasing scale of a cluster, the probability of hardware and software failures also arises. Failures have become the norm rather than the exception at massive scale.

• The increased cluster size itself introduces much more uncertainties and reduces the overall system reliability, largely due to the increased failure probability of each node and software component.

• Some root causes: • OS crash/network disconnection

• disk hang or insufficient memory (OOM)

• bugs in codes/overweight system utilization

• performance interference

• network congestion

2009.2.24, Gmail failure for 4 hours

2011.4，Yahoo mailbox outages，affecting 0.25 billion users.

2011.4，Amazon Outage 4 days

2009.3，Azure outage for 22 hours

2010.1，Salesforce cloud service outage

2012.7，Azure failure for 2.5 hours

2013.3，HotMail、Outlook 17hours

2012.10，AliCloud Power Outage

2009

2010

2011

2012

2013

2014

2014.8，Azure System outage

2014.11 AWS,Rackspace,SoftLayer rebooting

Challenges and Methodology

Scalability and dependability have become two fundamental challenges for all distributed computing at massive scale.

Data-driven Analysing, Modelling, Problem Finding:

• A good understanding of Cloud Computing workloads drives to:

• Identify the resource demand dimensions and attributes for a better datacenter planning

• Identify resource usage inefficiencies

• Identify resource usage patterns to improve the QoS

• Identify relationships between workload parameters and their impact on the productivity of the overall datacenter

• Identifying the workload characteristics from a real production system allows us to:

• Design experimental scenarios to simulate environments and evaluate mechanisms for datacenter operational improvements following realistic conditions

• Find system bottlenecks, and further improve the system performance

• 29 days

• 12,500+ servers

• 27,000,000 tasks

9

Cloud Datacenter Case Studies

• 365 days (year)

• 100 + servers per site

• 1000 tasks per site

• 60 days

• 5,000+ servers

• 185,444 tasks

Comprehensive and Correlation Analysis

• Task length • CPU • Memory

Task User • Submission rate • CPU estimated • Memory estimated

ii tttttT ,...,,, 321

)(),(),( fffui

i i u u u u u U ,..., , , 3 2 1

))()(()( jiii uPtPttE

)(),(),( fffti

)()( iii uPuuE

User and Task profile definition

Expectation of User and Task profile

u t

11

Analytics: Workload Models

K = Cluster Number SK = Sum of Squares α = Weighted Variable (Coherence) d = Number of Dimensions

D.T. Pham “Selection of K in k-means clustering” Proc. Inst. Mech. Eng. C. Mech. Eng. Sci., 219 (1) (2005)

• Workload is a combination of tasks and users (customers).

• Characteristics, behavioural patterns and relationships of workload.

Workload model definition Workload clusterization

12

Example: Google Workload

Users Month Day 2 Day 18 Day 26

Requested CPU Requested Memory Submission Rate (Hourly)

Cluster Population % Mean Stdev. Cv Mean Stdev. Cv Mean Stdev. Cv

U1 37.03 0.010 0.004 0.388 0.016 0.013 0.854 34.94 94.00 2.691

U2 0.71 0.016 0.011 0.689 0.019 0.013 0.658 2498.21 2034.6 0.814

U3 6.37 0.135 0.048 0.358 0.094 0.136 1.453 4.71 10.82 2.295

U4 6.37 0.025 0.018 0.718 0.092 0.031 0.342 13.49 19.47 1.444

U5 22.64 0.063 0.011 0.168 0.030 0.020 0.648 73.40 170.44 2.322

U6 26.89 0.032 0.006 0.197 0.014 0.010 0.752 43.63 105.18 2.411

Month Day 2

Param Cluster Mean Stdev. Cv Mean Stdev. Cv

CPU

T1 0.029 0.028 0.966 0.029 0.025 0.862

T2 0.095 0.088 0.926 0.071 0.071 1

T3 0.006 0.012 2 0.007 0.012 1.714

Mem

T1 0.011 0.01 0.909 0.013 0.01 0.769

T2 0.049 0.031 0.633 0.047 0.021 0.447

T3 0.002 0.003 1.5 0.003 0.003 1

Length

T1 16,605,683 32,753,760 1.972 9,787,032 1,551,9963 1.586

T2 123,974,450 250,146,79 2.018 30,932,490 40,683,248 1.315

T3 739,117 4,056,404 5.488 245,445 655,190 2.669

Day 18 Day 26

Mean Stdev. Cv Mean Stdev. Cv

CPU

T1 0.028 0.014 0.492 0.006 0.006 1

T2 0.076 0.051 0.667 0.065 0.04 0.615

T3 0.005 0.005 0.984 0.026 0.012 0.462

Mem

T1 0.009 0.006 0.632 0.001 0.001 1

T2 0.040 0.017 0.428 0.031 0.018 0.581

T3 0.001 0.001 1.075 0.009 0.004 0.444

Length

T1 41,329,800

103,613,33

5 2.507 13,669,736 16,538,165 1.21

T2 117,493,568

388,077,47

6 3.303 82300581 54,360,253 0.661

T3 7,658,844 25,068,810 3.273 613,803 1,450,884 2.364

Task dimension characteristics

Tasks

Data

U6U5U4U3U2U1S-T3

R-T3S-T2

R-T2S-T1

R-T1S-T3

R-T3S-T2

R-T2S-T1

R-T1S-T3

R-T3S-T2

R-T2S-T1

R-T1S-T3

R-T3S-T2

R-T2S-T1

R-T1S-T3

R-T3S-T2

R-T2S-T1

R-T1S-T3

R-T3S-T2

R-T2S-T1

R-T1

50

40

30

20

10

0

% Du

ring R

un-Ti

me

Scalability Challenges

• Request handling scalability • How to enable the high cluster throughput with low-latency request

handling and allocation decisions?

• Resource scheduling scalability • How to make prompt scheduling decisions at millisecond rate for interactive

query task or millions of queued resource requests?

• Communication and message scalability • how to properly avoid message flooding whilst guaranteeing timely

component communication(with resource request/reclaim, heartbeat)?

Scalability

Request handling

scalability

Resource scheduling

scalability

Communication and

messaging scalability

Request number

and frequency

Resource dimension

and amount

System scale and

complexity

Scalability Solutions (1)

Architectural evolution • Single-master scheduling

• Delegate every scheduling decision, state monitoring and updating all in a single master node (such as the JobTracker in Hadoop 1.0)

• Overloaded JobTracker, and single point of failure

• Only support one type of computing paradigm/framework (slots only for Map or Reduce)

• Two-level scheduling

• Decouples the resource management and the framework- /application- specified scheduling into two separate layers

• The central resource manager is responsible for resource negotiation among different resource requests and application master takes charge of job scheduling

• Decentralized scheduling

• Multiple distributed scheduler replicas are adopted via multi-threads or independent processes, and each scheduler can handle requests simultaneously based on its local cached states or global shared states


Incremental scheduling

• Achieving rapid response and prompt scheduling decisions at such a fast rate means that the central resource manager cannot recalculate the complete mapping of all machine resource to all applications tasks in every decision making

• Only the changed part will be calculated

• Locality-tree based incremental scheduling

• Incremental resource request and allocation protocol

• Resource request is only sent once until the application master releases the resources

• Scheduling tree

• Multi-level waiting queue

• Different priorities and constraint labels

• Quota-group control (access control)

Fig: Scheduling tree example with multi-level waiting queues


Decentralized scheduling

• Option1: Local state replica coordinated by central master

• The functionality of central master can be simplified to only synchronization all states as a coordinator

• Conflict resolving is significantly important

• Option2: Shared states visible to all schedulers without a central coordinator

• The communal states can be locked using exclusive locking techniques or lock-free optimistic concurrency control by using incremental transaction

• Option3: Stateless distributed scheduling

• Sampling-based probing for low-latency

• Each autonomous scheduler detects servers with fewer queued tasks by probing m random servers and assigns the tasks of its jobs to targeted machines


Incremental communication

• An incremental request will be sent only when the resource demands are dynamically adjusted:

• reducing frequency of message passing

• improving the whole cluster utilization

• Core techniques in a messenger:

• Message order-preserving

• Message idempotent resending

• Message deduplication

Cluster Partition

• A compute cluster can be divided into several area partitions and each manager replica is responsible for request handling and information delegation of severs within its specified partition

• The consistency will be guaranteed by an elected central coordinator (only the coordinator can conduct changes to the permanent store)

Sender App RPC-Call

1

1

MessageBuf

{max=1,ack=0}

1

2

callback

Messenger Messenger

{max=2,ack=0}

2

Receiver App

12

{max=2,ack=0}

12

Sender App RPC-Call

MessageBuf

2callback

Messenger Messenger Receiver App

{max=2,ack=0}

12

{max=1,ack=0}

1

1

{max=2,ack=0}

2

12

{max=1,ack=0}

{ack=1}

1

Fig: Message re-sending and de-duplication

Fig: Google Cluster Partition [EuroSys15’]

Dependability Challenges

• Providers are under great pressure to provision uninterrupted reliable service to consumers while trying to reduce their operational costs due to software and hardware failures within the system.

• Faults and handling coverage: Components within the resource manager are likely to experience different types of faults ranging from crash-stop to late timing failure, as well as have different underlying root causes

• Recovery effectiveness and efficiency: consider factors including the full recovery time, the system utilization, the additional resource cost produced by the recovery, the latent negative impacts onto other components or workloads

• User-perceived impact: a user-transparent failover technique to recover the service without noticeable changes to provisioned service perceived by consumers

Dependability

Fault Coverage

Recovery

Effectiveness &

Efficiency

User-perceived

Impact

Failure MTTF

Workloads and

Subcomponents

amount

System complexity

Dependability Solutions (1)

Rapid and Effective Component Failover

• Failover with reduced checkpointing

• Soft-state inference: Collects and exploits states collected from neighboring components instead of solely relying on hard-state periodically collected from dedicated backup systems

• Hybrid recovery techniques: A combination recovery of light-weight hard-state and soft-state inference

• Minimized worker eviction

• Loose-coupling master or agent behavior from its respective workers during the execution and non-faulty workers will not be automatically evicted

• State-inference to identify late-timing or inaccessible agents

• Adaptive resource reservation for running/faulty workers

Fig: Soft-state inference applied to Fuxi system Fig: Fault Recovery Finite-State mMachine (FSM)


Optimized Recovery Time v.s. Degraded Service Level

• Recovery Time or Information Completeness

• Incomplete information might appear due to timing-out components unable to contribute their states in time. The state-collection time also closely depends on cluster scale, application number, and application-specified configurations etc.

• Insufficient collection time leads to incomplete states and subsequent degraded service level

• Flexible and customizable configurations

• Such flexibility through customization offers adaptive control of recovery overheads and allows possible trade-offs between the full state recovery and various levels of degraded recovery with incomplete state inference.


Maximized fault coverage

• Diverse failure types: stop-crash failure, timing failures etc.

• Different failure combinations

• Failure correlations and simultaneous component failures

• Root causes analysis

Blacklist and alarm dashboard

• Multi-level blacklist

• cluster level, task-level and job-level

• System health self-checker and dashboard

• monitor, diagnose the node health, process status, system features

• The right tools can quickly find the root cause, minimizing the duration of the failure

Data-driven Methodology Applied Into Engineering

Future Directions

• Big Data as a Service (BDaaS)

• Data processing API, data sharing, API composition

• Debugging large-scale distributed applications • Debugging or investigating a distributed application performance issue

• Time-consuming for engineers and technical staffs to find root-causes of problems

• History-Based Optimization (HBO) approach • Accurate estimation of resource requirement

• User demands/system patterns etc.

• Simulation of large-scale system behavior • Cost-effective technique to evaluate the system functionalities and performance in a

simulation environment

• Application in container-based system

• Light-weighted container, Docker

• IoE Applications • Cloud-Network-Edge

• The dependable and real-time capability with low latency

Conclusions

• Exploiting the inherent workload heterogeneity that exists in Cloud environments provides an excellent mechanism that helps to improve both the performance of running tasks and the system efficiency

• Large-scale distributed systems may run millions of service instances concurrently, with an increased probability of frequent and simultaneous failures

• Relying on real data is critical to understanding the real challenges in massive-scale computing and formulating assumptions under realistic operational circumstances

• Experiences learnt from Cloud and distributed computing will facilitate the development of the future generation computing systems that support a number of human intelligent decisions

Our Main Contributions

Topic 1 - Analysis, Modeling and Simulation

• [1] I. S. Moreno, P. Garraghan, P. Townend, and J. Xu. An approach for characterizing workloads in

Google cloud to derive realistic resource utilization models. In Proceedings of IEEE SOSE 2013, Best Paper Award

• [2] R. Yang, I. S. Moreno, J. Xu and T. Wo. T. An analysis of performance interference effects on energy-efficiency of virtualized cloud environments. In Proceedings of the IEEE CloudCom, 2013

• [3] P.Garraghan, P.Townend, J.Xu, "An Analysis of the Server Characteristics and Resource Utilization in Google Cloud" in the proceedings of the IEEE IC2E, 2013.

• [4] P.Garraghan, P.Townend, J.Xu, "An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment" in the proceedings of IEEE HASE 2014

• [5] I. S. Moreno, P. Garraghan, P. Townend, and J. Xu. Analysis, modeling and simulation of workload patterns in a large-scale utility cloud[J], in IEEE Transactions on Cloud Computing, 2014

• [6] P. Garraghan, I. S. Moreno, P. Townend, and J. Xu. An analysis of failure-related energy waste in a large-scale cloud environment, in IEEE Transactions on Emerging Topics in Computing, 2014

• [7] P. Garraghan, D.McKee, X. Ouyang, D. Webster and J. Xu. SEED: A Scalable Approach for Cyber-Physical System Simulation, in IEEE Transactions on Services Computing, 2015


Topic 2 – Scalable Resource Scheduling at Scale

• [1] I. S. Moreno, R. Yang, J. Xu and T. Wo. Improved energy-efficiency in cloud datacenters with

interference-aware virtual machine placement. In Proceedings of the IEEE ISADS 2013, Best Paper Award

• [2] Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. In Proceedings of the VLDB Endowment, 2014

• [3] Y.Wang, R. Yang, T. Wo, W. Jiang and C. Hu. Improving utilization through dynamic VM resource allocation in hybrid cloud environment. In Proceedings of the IEEE ICPADS 2014

• [4] P.Garraghan, X.Ouyang, P.Townend, J.Xu. Timely Long Tail Identification through Agent Based Monitoring and Analytics, In Proceedings of IEEE ISORC, 2015

• [5] R. Yang, T. Wo, C. Hu, J. Xu and M. Zhang. D2PS: a Dependable Data Provisioning Service in Multi-Tenants Cloud Environments, In Proceedings of IEEE HASE, 2016

• [6] X.Ouyang, P.Garraghan, D.McKee, P.Townend, and J.Xu. Straggler Detection in Parallel Computing Systems through Dynamic Threshold Calculation, In Proceedings of IEEE AINA, 2016

• [7] X.Ouyang , P.Garraghan, R.Yang, P.Townend and J.Xu Reducing Late-Timing Failure at Scale: Straggler Causes Analysis and Occurrence Prediction in proceeding of IEEE/IFIP DSN 2016 (under review)


Topic 3 – Dependable and Reliable Computing at Scale

• [1] Y. Zhang ,R. Yang, T. Wo, C. Hu, J. Kang and L. Cui. CloudAP: Improving the QoS of Mobile

Applications with Efficient VM Migration. In Proceedings of IEEE HPCC, 2013

• [2] L. Cui, J. Li, T. Wo, B. Li, R. Yang, Y. Cao and J. Huai. HotRestore: a fast restore system for virtual machine cluster. In Proceedings of USENIX LISA, 2014

• [3] Y. Huang, R. Yang, L. Cui, T. Wo, C. Hu and B. Li. VMCSnap: Taking Snapshots of Virtual Machine Cluster with Memory Deduplication. In Proceedings of IEEE SOSE, 2014

• [4] J. Li, J. Zheng, L. Cui and R. Yang. ConSnap: Taking continuous snapshots for running state protection of virtual machines. In Proceedings of IEEE ICPADS, 2014

• [5] R.Yang and J.Xu. Computing at Massive Scale: Scalability and Dependability Challenges. In Proceedings of IEEE SOSE 2016, Invited Visionary Paper (In press)

• [6] R.Yang, Y.Zhang, P.Garraghan, Y.Feng, J.Ouyang, J.Xu, Z.Zhang, C.Li. Reliable Compute Service in Massive-scale Systems through Rapid Low-cost Failover. In IEEE Transactions on Services Computing, 2016 (In press)

• [7] P.Garraghan, X.Ouyang, R.Yang and J.Xu. Straggler Root-Cause Analysis and Detection in Massive-scale Cloud Datacenters. In IEEE Transactions on Services Computing, 2016 (under review)

28

Thanks !

Renyu Yang

[email protected]

Beihang University/Alibaba Cloud Inc, China

http://act.buaa.edu.cn/yangrenyu

Professor Jie Xu

[email protected]

University of Leeds, UK

http://www.comp.leeds.ac.uk/jxu/

CoLAB Made by:

mailto:[email protected]

mailto:[email protected]

computing at massive-scale: scalability and dependability...

Documents