an introduction to ego: an enterprise-ready resource ... · as mesos and yarn, have emerged to...

19
IBM® Corporation An introduction to EGO: An enterprise-ready resource manager for all workloads IBM Platform Computing

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

IBM® Corporation

An introduction to EGO: An enterprise-ready resource manager

for all workloads

IBM Platform Computing

Page 2: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

1

Contents 1 Abstract ................................................................................................................................................. 2 2 Introduction .......................................................................................................................................... 2 3 Cluster management in enterprise ....................................................................................................... 3 4 The EGO resource management framework ........................................................................................ 4

4.1.1 Architecture overview ........................................................................................................... 4 4.1.2 System concepts ................................................................................................................... 4 4.1.3 Components and interactions ............................................................................................... 5 4.1.4 APIs and protocols ................................................................................................................ 6 4.1.5 Resource allocation interface ............................................................................................... 6 4.1.6 Resource provider interface (RPI) ......................................................................................... 7 4.1.7 Scheduling plugin interface ................................................................................................... 8 4.1.8 Activity interface ................................................................................................................... 8 4.1.9 Security interface .................................................................................................................. 9

5 Workload managers integrated with EGO ............................................................................................ 9 5.1.1 Long-running services ........................................................................................................... 9 5.1.2 Platform Symphony middleware .......................................................................................... 9 5.1.3 Batch workloads with LSF ................................................................................................... 10 5.1.4 Spark integration ................................................................................................................. 10 5.1.5 OpenStack integration ........................................................................................................ 10

6 Deployment experiences .................................................................................................................... 10 7 Related work ....................................................................................................................................... 14 8 Future directions ................................................................................................................................. 15 9 Conclusion ........................................................................................................................................... 15 10 Appendix ......................................................................................................................................... 16

10.1 Allocation Interface ..................................................................................................................... 16 10.2 Resource Provider Interface ....................................................................................................... 17 10.3 Scheduler Plugin Interface .......................................................................................................... 17 10.4 Security Plugin Interface ............................................................................................................. 18

Page 3: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

2

1 Abstract

The growth and scale of public clouds have lead various cloud providers to build sophisticated cloud-scale cluster management systems that enable the execution of multiple workloads on a shared pool of resources. The goal of cluster management systems is to drive up resource utilization while satisfying applications service level requirements. Cloud providers, such as Google have, built systems like Borg and Omega to reduce their data center footprints and drive down costs. Microsoft’s Apollo is another internal system that optimizes the execution of DAG based workloads. Open-source tools, such as Mesos and YARN, have emerged to provide similar capabilities to the broader community.

In the mean-time, there have been many large enterprises that are building internal cloud-like systems to support unique requirements that are different from cloud providers. In this document, we explore the IBM Platform Enterprise Grid Orchestrator (EGO) technology, which underpins various high performance computing, analytics, and big data grids in a variety of industry verticals including financial services, life sciences, manufacturing, and electronics. We try to shed light on the design choices that make EGO a foundation enterprise for customers looking to build out a shared infrastructure that hosts multiple line-of-business applications. The protocols and APIs that are required to support the diverse nature of workloads, and the system components of EGO and how it is used to bring together organizational silos within enterprise environments are also discussed. Lastly, we compare our experiences to other cluster management systems in the industry.

2 Introduction

The history of IBM Platform EGO began around 2004; around the same time that Google was developing its Borg system, Hadoop was starting up, and cloud computing was not yet popularized. At the time, Platform Computing was the leader in high performance cluster and grid computing with its Platform Load Sharing Facility (LSF) product for batch workloads. Sophisticated customers in financial services were looking to move from batch-based models of risk analytics to more real-time approaches. Application were being modernized using the model of re-usable parallel analytics services that could perform pricing or risk calculations using market and portfolio data. The need for a sophisticated, high performance middleware that co-existed and shared resources with traditional batch jobs lead to the creation of the EGO technology. We realized that supporting multiple application frameworks that managed workloads in different ways required having a common resource management layer if the problem of dedicated silos was to be avoided.

Different types of workload managers or application frameworks could use the resource manager services to allocate and release resources as the demand for computation and data increased or decreased. Distributed data services like in-memory caches also needed to be managed and horizontally scaled up and down on the common infrastructure. These applications shared many of the same characteristics of today’s modern cloud native workloads in terms of being elastically scalable, with built-in high availability at the application level, and using fine-grained services with task routing and messaging as part of the infrastructure. Over time, these distributed computing infrastructures grew to be a significant portion of the data center estate of many large enterprises. Now, with the emergence of new application and big data frameworks such as Hadoop, Spark, Storm, Cassandra, MongoDB, Node.js, and so on, enterprises are looking to bring these together under a common resource management layer to improve datacenter efficiencies.

Page 4: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

3

3 Cluster management in enterprise

In delivering solutions to enterprise customers, we have learned a few lessons that may be of some value to the wider cluster management community as they attempt to bridge the gaps between public and private clouds. There is an argument that all workloads will eventually move to public clouds, however, we feel that there will be room for hybrid solutions for large enterprises that we deal with. The following are a few significant themes that we see in enterprise environments:

Diversity of workloads: Enterprises tend to have a large number of applications with different processing requirements. There will be several hundreds of smaller footprint applications, along with fewer very large ones. Not all workloads are suitable for running on a scale-out cluster. Many of the more sophisticated enterprises have large software development organizations to meet custom requirements. As such, they demand special features for their use case that the broader market may not yet value.

Heterogeneity of infrastructure: Not all enterprises, including some cloud providers, have migrated to Linux. These enterprises have a significant footprint of Windows, Power Linux, and other UNIX infrastructures that they want to leverage. Technologies such as GPU, FPGA accelerators, flash drives, and so on all add complexity to the scheduling decisions. Even desktop or VDI infrastructures based on VMware or KVM have an idle capacity that enterprises want to leverage for their scale-out workloads. A cluster management system for the enterprise should be able to integrate a broad range of heterogeneous resources.

End to end solution requirements: Although cluster management will be an important element of future data center architectures, it is still a component of the overall enterprise landscape. Enterprises have legal and regulatory requirements that necessitate complex security requirements, including pluggable authentication and role-based access control. Management needs reporting, chargeback, integrated workload and infrastructure capacity planning, and forecasting tools. Operations teams need diagnostics and troubleshooting advice. Enterprise IT skills around distributed systems operations are generally weaker than your typical cloud provider.

Dealing with organizational silos: Enterprises are not homogeneous entities in which central IT can dictate and enforce standards, though many attempt to do so. Different lines of business within the enterprise will have their own agenda and requirements. Hardware is often “owned” by a LOB, and they are willing to selectively share their resources under certain conditions. It is important that the cluster management system allows such policies to be expressed, and moves the organization towards a shared service model.

Moving to real-time and low latency: Impacted by the cloud and social and mobile trends, analytic and big data applications are moving to a near real-time approach. From the point of view of a cluster management system, it is important to be able to react with low latency to workload and resource requests.

Page 5: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

4

4 The EGO resource management framework

A cluster management system consists of several layers including workload managers or application frameworks, a distributed file system, a resource manager, and cluster-oriented system management tools. EGO focuses on resource management, interacting and coordinating the sharing of resources across multiple application frameworks. A distributed file system is required by scale out applications, and works alongside the resource manager with integrations for data-aware scheduling and placement. We use the terms workload managers and application frameworks interchangeably.

4.1.1 Architecture overview

When defining EGO back in 2004, we took the distributed operating system model into account. The term data center operating system (DCOS) is very much in tune with the mentality we had at the time, though we shelved it in terms of a grid computing. EGO uses individual nodes and their operating systems as devices that makeup a large, virtual supercomputer. Agents running on the nodes provide information and execution capabilities that are aggregated into a central master process. The central master acts as the kernel of the distributed environment. Multiple clusters, each with their own master, make up the grid. Each cluster can have multiple types of workload managers running on top of the kernel. A single cluster can scale up to 10K nodes, though in practice we find that the largest enterprises have clusters with 4-5K nodes. The increase in core counts on systems, and GPU accelerators have reduced the demand for very large single clusters in most enterprises.

4.1.2 System concepts

EGO uses resource groups to organize the supply of resources. It manages the supply of resources and allocates them to different workloads according to policies. Resource groups can be static or dynamic, using attributes of hosts to define membership. An agent running on each node collects dynamic load state information about the node from the operating system, and reports it up to a central master. Topology information can be associated with resources to represent a logical or a network hierarchy. Resources can also be logical entities that are independent of nodes, like software licenses or bandwidth capacity. There is a notion of a consumable resource that allows for counting of usage as the resource is allocated. A consumer is a term used to identify an organizational entity that will request resources from the system. Consumers are organized hierarchically into a tree to reflect lines of business, departments, projects, and so on. The consumer concept is orthogonal to how a workload manager maps its clients or users to consumers. Multiple workload managers may map different types of workloads into the same consumer so that resource sharing is controlled regardless of the form a workload may take. Resource plans are policies that assign different amounts of resources to different consumers. The policies are implemented as pluggable shared libraries that allow different strategies to be implemented without affecting the framework.

A client of EGO is typically an application framework that manages workload on behalf of end users or higher order services. This client will register with EGO and make allocation requests, which express the demand for resources. We use the generic term allocation request to emphasize the fact that it is the frameworks running on top that map resources to workload-specific units of execution such

Page 6: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

5

as batch jobs, short-lived tasks, long-running services, and so on. The allocation requests capture information about the resource consumer that is associated with the request, units of allocation requested (cpu, mem, disk, entire host), the min/max number of resources, resource requirements that need to be satisfied, topology constraints, and other workload objectives.

The allocation requests are mapped to available resources by a pluggable scheduler, and the scheduling decisions are returned to the client. The scheduler takes the workload objectives and applies an administrator defined, system-level resource plan that controls how competing consumers should share resources. An asynchronous protocol is used to communicate decisions between the EGO Kernel and the application frameworks. Resources are matched and provided to the frameworks as they become available, and are reclaimed from the frameworks, as dictated by policies.

Finally, in order to make use of the resources to run workloads, EGO provides an optional execution service for managing operating system level processes. We use activity as the generic term for the launching and control of processes. Depending on the application framework, the processes could represent jobs, tasks, service instances, Docker containers, or JVM processes. The allocation is decoupled from the execution of an activity. One allocation can be used to launch multiple activities over a period of time, as long as they are constrained by the allocation amount. On Linux, we can use control group (cgroup) settings to enforce the allocation limits. Docker integration is available through a container plugin.

4.1.3 Components and interactions

The basic architecture, as shown in Figure 1, starts with a couple of lightweight agents written in C, running on the individual nodes. The Load Information Manager (LIM) provides static and dynamic state information about the nodes such as the number of cores, total memory, cpu utilization, run queue lengths, memory, network, disk statistics, and so on. It is extensible by writing scripts (called ELIMs) that collect other information. The LIM uses UDP to exchange load information with a central master to give a near real-time load state picture of the distributed environment. The load exchange vectors are also used as a heart-beating mechanism to elect the master LIM in the event of node failures.

The master will run the EGO kernel daemon (vemkd), which is both the resource allocator and placement engine, to arbitrate the usage of resources between competing workloads from different tenants. The EGO kernel implements the concept of a hierarchical set of resource consumers that can be mapped onto various structures like organizations, business units, and departments, all the way down to applications or services. The EGO kernel is akin to a memory or cpu manager of an operating system. Through scheduling plugins, it implements policies like fair share, borrow/lend, ownership, or preemption, dynamically adjusting the allocation of the resource supply to meet demand from workloads.

High availability of the central EGO kernel daemon is achieved using the LIM master election algorithm, with state being stored in an external shared filesystem. The state of the EGO kernel consisting of the resource allocations and execution activities is kept in a combination of snapshot files and event logs. We are also working on leveraging etcd or Zookeeper to handle EGO kernel failover requirements. The dynamic load information maintained by LIM is not persisted but re-built from slave nodes upon master failover.

Page 7: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

6

The Process Execution Manager (PEM) provides execution capabilities such as starting and stopping processes or containers. In order to start processes through the PEM, the client framework must go through the EGO kernel, which will authenticate the requests and ensure an appropriate allocation is used. The EGO kernel talks to the PEM via long-lived TCP connections that enable fast parallel start-up of processes. The PEM maintains the local state about the processes it launches on the local filesystem. It can be restarted without affecting running workloads.

Figure 1: EGO component architecture

4.1.4 APIs and protocols This section outlines some of the key interfaces that are supported by the EGO kernel. The

primary goal was to keep the EGO kernel relatively small and acting as a framework in which various functionalities could be plugged in. Details of the core EGO C interfaces and data structures that are described in this section are provided in the Appendix.

4.1.5 Resource allocation interface

The resource allocation model in EGO requires a framework to register itself as a client and make explicit allocation requests. The resources are provided based on the characteristics of the allocation request, which can consist of multiple sub-demands. The sub-demand identifies a group of similar resources with minimum and maximum instances, resource requirements, or placement constraints such as affinity/anti-affinity within the sub-demand or between sub-demands. The protocol between the EGO kernel and the framework is asynchronous and dynamic. Once the allocation request

Page 8: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

7

has been registered in the EGO kernel, the kernel and the client will interact as resources satisfying the allocation request become available or need to be reclaimed. Until the allocation request is cancelled, this two-way interaction allows the framework and EGO kernel to negotiate the allocation and reclaim of resources.

We find that the EGO allocation model can support both “greedy” frameworks, which can consume any type of resource as along as it is free; as well as “picky” frameworks that require very specific types and configurations of resources to be available. The greedy frameworks are common in stateless types of applications, such as web services that can run anywhere where there is capacity. Picky frameworks, found in big data and analytics systems, require resources to be collocated with data or other components in a network relationship, taking into account communication and data transfer costs. When reclaiming resources from a picky framework, the framework should be allowed to make the decision about which specific resource it is willing to give up.

Figure 2: Resource allocation

4.1.6 Resource provider interface (RPI)

Resource Provider Interface (RPI) is a framework for EGO to interact with resource providers in order to discover resources and collect resource information. A resource provider is a component that monitors and manages resources and maintains the resource information. LIM is a default resource provider in EGO. However, EGO can be integrated with external resource providers such as monitoring tools, asset databases, or other resource or infrastructure managers.

RPI supports two methods of interacting with the resource provider. One is a push model where the resource provider pushes information via the RPI API. The other is a plugin into the EGO kernel that pulls information from the external provider periodically. The RPI API is a set of APIs that the resource provider can call to feed resource information into EGO. See the Appendix for details. The RPI plugin is a shared object that interacts with the resource provider to gather resource information, and passes it to the EGO RPI framework with appropriate data structures. The RPI framework periodically calls the RPI plugin to get resource information.

Page 9: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

8

4.1.7 Scheduling plugin interface

In order to support different type of policies, the EGO Kernel supports a plug-in scheduler using a shared object or DLL that makes the decisions about mapping allocation requests onto available resources. The EGO Kernel acts as a controller for the scheduler plug-in providing information about the current state of resources, the consumer tree, and policy configuration. The scheduler is expected to return a set of decisions from the scheduler plug-in. We have implemented several different scheduling strategies using this plug-in model. One uses the notion of single dimensional resource slots that works well for many workloads where the workload is homogeneous and fine grained. We have also extended this to multi-dimensional scheduling where up to five dimensions of consumable attributes are considered in the scheduling algorithm. When using the EGO Kernel to manage the placement of complex application patterns consisting of multiple components with inter-dependent relationships, we have implemented an efficient biased sampling algorithm through another plug-in.

4.1.8 Activity interface

The EGO kernel provides a set of APIs for launching and controlling processes that workload managers can leverage. The activity interface is de-coupled from the allocation interface in the sense that the same allocation can be used to launch multiple activities. This allows a workload manager to hold the allocation and start multiple processes over a time period to run work. Another common use of the activity interface is to launch control processes or daemons that will interact directly with the workload manager to run different types of work. In this case, the frequency of launching activities is much less. The state diagram of the lifecycle of an activity is shown in Figure 3.

start

RUN

FINISH

UNKNOWN

ZOMBIE

failed

to start

host

reachable

pem kills

activity

clean up

kill activity

started but dependency is not ready

NULL

NULL

START

host

unreachable

host

unreachable

host

unreachable

TENTATIVE

started

activity completes

host

reachable

host

unreachable

Figure 3: EGO Activity state diagram

Page 10: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

9

4.1.9 Security interface

Security is an important consideration in enterprise environments. Typically, enterprises will have their own security infrastructure that the resource manager must integrate with. Clients of the resource manager must be authenticated with role-based access control (RBAC), and audited. The EGO kernel supports a security plug-in that allows for interfacing with external authentication systems such as Kerberos. A plug-in implements RBAC control for enterprise users to give fine grained permissions to a variety of read/write/modify actions on various objects. This RBAC mechanism can be used by workload managers to secure access to their own domain-specific objects.

5 Workload managers integrated with EGO

In this section we discuss how EGO is integrated with different workload managers or application frameworks.

5.1.1 Long-running services

The management of long-running services is a basic requirement for a distributed environment. The first framework we built on the EGO kernel was the EGO service controller. The EGO service controller uses the EGO kernel to allocate resources and launch distributed service processes. It provides high-availability, health checks, and failover support, ensuring the minimum number of service instances are always kept running. It makes no assumption about how services are develop, communicate, or discover each other, although we do provide a DNS-based service discovery mechanism. The EGO service controller is used in our commercial IBM Platform Application Service Controller (Platform ASC) product. Internal services in the IBM Platform product suite, such as GUI, REST API, and reporting services are all managed under the control of the EGO service controller.

5.1.2 Platform Symphony middleware

The IBM Platform Symphony (Symphony) product provides an application middleware layer on top of EGO that enables high performance parallel services to be developed. It supports a programming paradigm in which developers write client applications using session and task APIs, which communicate with a broker called the Symphony Session Manager (SSM), which in turn routes the tasks to service instances (SIs) that invoke shared computation service logic (e.g. a Monte Carlo simulation). Each application can have one or more SSMs that act as clients to the EGO kernel, and hundreds of applications can co-exist in a shared environment. For example, a trading desk application may connect to SSMs and submit tasks to execute pricing calculations. The SSMs connect to the EGO kernel and compete with one another for the resources with EGO policies arbitrating between them. The SSM handles the details of the workload execution receiving resources required by the application from the EGO kernel.

SIs are started as activities via the EGO kernel, however, the kernel is not involved in the

execution of Symphony tasks, which can last on the order of a few hundred millisecond. By de-coupling the task-level scheduling and workload management in SSM/SI with the more coarse grained resource allocation in EGO, Symphony is able to handle hundreds of applications while driving up resource

Page 11: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

10

utilization to 90+%, and also meeting unique application service-levels in terms of response time, latency or task throughput.

5.1.3 Batch workloads with LSF

The IBM Platform LSF product supports distributed processing of batch jobs. A job in LSF is a single executable that can be launched and controlled on a node. LSF has concepts of job grouping, job arrays, and job flows with sequencing and dependencies to organize more complex workflows around the batch model. While LSF pre-dates EGO, much of the resource management capabilities of EGO were generalized from LSF. LSF supports a SLA-driven scheduling model that interfaces with EGO to drive the allocation of resources based on throughput or response time goals. For example, if the job throughput for a particular queue managed by the SLA policy drops below a threshold, requests are triggered to allocate more resources. Job queues are mapped to EGO consumers enabling the EGO kernel to arbitrate between different queues with different SLA goals.

5.1.4 Spark integration

Spark on EGO is as a next generation big data framework oriented at more real-time processing of analytic workloads that are integrated with EGO. The Spark driver is modified to call out to the EGO kernel to allocate and release resources, and start up Spark executor processes. Spark supports both batch mode and interactive analysis. For interactive users running short computations, EGO’s ability to quickly deliver resources with sub-second latency is important. In a multi-tenant Spark cloud environment with users submitting both batch and interactive jobs, the ability to prioritize between paying and free customers becomes critical to delivering appropriate service level guarantees.

5.1.5 OpenStack integration

We have integrated EGO into OpenStack to act as a scheduler and placement engine for any workloads deployed through OpenStack. The integration does not rely on EGO node-level agents by default and uses the EGO RPI interface to pull node-level statistics from OpenStack Nova. An Optimization Service is provided in addition to the placement function that dynamically migrates VMs based on performance thresholds while maintaining the placement constraints.

6 Deployment experiences

Although the EGO technology is used in a variety of IBM Platform deployments across EDA, manufacturing, and financial services, the largest and most sophisticated deployments of EGO are arguably in global investment banks. In this section we highlight how EGO is used as the underpinning of shared infrastructures. Applications from different lines of business for fixed income, equities, derivatives or credit risk analytics need to manage the sharing of resources. They extensively leverage the EGO consumer tree concept to implement those policies to align with the business goals. Figure 4 gives a view of the consumer hierarchy and resource plan, where the resource allocations for hundreds of applications are centrally managed by IT to create a multi-tenant cloud. The user-facing applications can use different workload managers or frameworks such as LSF, SOAM, Platform ASC, Spark or Hadoop, all integrated with EGO. The EGO consumer unifies the resource allocation mechanism across

Page 12: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

11

frameworks. Ownership, borrow/lend, fairshare, and priority policies are applied against the consumer regardless of what framework the demand originates from.

Figure 4: EGO Consumer Hierarchy and Resource Plan

In order to manage large-scale distributed environments, Figure 5 shows the management and monitoring console that is built on top of EGO. The UI provides operators with the ability to view a large number of nodes across multiple geographically distributed clusters and their health status and to drill down to individual usage metrics on the node. In addition, it supports the notion of re-domaining resources between EGO clusters to repurpose between Dev, UAT and Production environments.

Page 13: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

12

Figure 5: Multiple distributed EGO clusters managed centrally

In order to support historical reporting and analytics, customers pull all EGO and workload data into a reporting database from which deeper insights and correlation of information across consumers, resources and jobs, tasks or sessions can be drawn. Figure 6 shows the distribution of resources allocated for multiple consumers for a 4-day period.

Page 14: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

13

Figure 6: Resource allocation report

Figure 7 shows the distribution of how long Symphony SOAM sessions last for a particular application from a workload perspective. This enables sessions to be profiled to get an understanding of how the resources allocated are being used to run tasks.

Page 15: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

14

Figure 7: Workload session duration distribution report

7 Related work

The Apache Mesos project has strong similarities to EGO both in terms of architectural philosophy and implementation. Both systems have a master-slave architecture with a two-level scheduling system. Resource allocation is done in a central kernel and workload scheduling is done in the application frameworks. EGO differs from Mesos in the use of near real-time load information of the nodes to dynamically allocate resources. EGO also offers a highly configurable and efficient central placement service in its kernel rather than requiring each framework to implement its own logic. Also given the use cases in enterprise accounts, EGO has developed a richer set of multi-tenant resource sharing policies. In Mesos, reclaim of resources is handled for unallocated capacity that is given to a framework. In EGO, we can preempt a workload that is already running through a co-ordination protocol with the framework to ensure that the work with lowest impact to the SLA is stopped or killed, to allow higher priority work to run.

The Google Borg system is an example of a monolithic scheduler that supports both batch jobs and long-running services. It provides a single RPC interface to support both types of workloads. Each Borg cluster consists of multiple cells and it scales by distributing the master functions among multiple processes and using multi-threading. In the EGO system, we scale by separating out coarse grained allocation from fine-grained workload scheduling. The EGO kernel acts as a wholesaler of resources with multiple distributed workload managers responsible for scheduling and dispatching work units (e.g. jobs, tasks, sessions, long-running services) to those resources via workload-specific interfaces. The EGO model is similar to Google Omega in the sense that we allow for different types of RPC interfaces to be

Page 16: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

15

presented to users through different workload managers. So far, the EGO kernel has not been a major bottleneck for enterprise customers, but in future work, we will look to shared state models similar to Omega.

Apache Hadoop YARN is another cluster resource manager that has similarities to EGO. YARN, like EGO, makes the placement decisions about which resources to allocate to a framework based on the framework’s explicit requirements. The YARN application master functions like the EGO client with explicit primitives to allocate, release, and reclaim resources. The YARN community is working on better support for long-running services with the Slider framework, which is analogous to the EGO service controller. The history of YARN leads it to be applied to more batch-oriented MapReduce workloads, whereas EGO is targeted at sharing resources among a mix of real-time, batch, and long-running services. Also, YARN is implemented in Java, while EGO is mostly implemented in high-performance C code.

8 Future directions

Customers are continuously driving us by providing new requirements for our products, which forces us to re-evaluate the design choices we have made. In looking at the evolution of open-source resource managers, it becomes attractive to leverage the efforts of the wider community. We have begun initial attempts, for example to plug EGO capabilities into YARN, and working on Mesos and Docker Swarm integrations with EGO. This will enable EGO concepts and techniques to add differentiated value to existing ecosystems.

We are also working with our customers to address next generation requirements around scaling, availability management, cloud bursting, and diagnosis and troubleshooting. Related open-source projects like the ELK stack allow for easier operational monitoring and visualization of the information in EGO that we hope to leverage. When EGO was first developed, configuration and distributed consensus systems such as Zookeeper and etcd were not readily available, forcing us to invent our own. Looking forward we will explore how we can leverage these tools to reduce our maintenance efforts.

9 Conclusion

The IBM EGO technology represents over 10 years of development and production deployment capturing a wide variety of use cases and experiences around cluster resource management. It provides a sophisticated resource sharing and placement model enabling multiple different types of workloads to exist in a multi-tenant environment. The system is lightweight, consisting primarily of optimized C code and works in conjunction with its ecosystem of integrated workload managers to deliver a highly performant and scalable enterprise-class cluster management system. The experiences and lessons learned could be of value to other related projects.

Page 17: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

16

10 Appendix

10.1 Allocation Interface int ego_alloc_md(struct ego_handle *h, struct ego_allocreq_md *r, char **allocId); int (*addResource_md)(ego_allocreply_md_t *); int ego_release_md(ego_handle_t *h, ego_releasereq_md_t *release); typedef struct ego_allocreq_md { char *name; /*allocation name*/ char *consumer; /*Consumer name that this allocation should be charged to.*/ int flags; /*if VEM_ALLOC_ENFORCE_RECLAIM is set, ego reclaim other * allocations to enforce its priority and ratio */ int maxunitsperhost; int numsubdemand; /*number of sub-demands*/ ego_alloc_sub_demand_md_t ** subdemands; /* sub demands */ char *resplan; /*resource plan name */ int priority; /*[1, 10], default 5. 1 is highest priority. Number 0 means using default */ int share; /* [1,1000], default 10. 0 means using system default and EGO will convert it to 10.*/ int numdimensions; /**< number of dimension */ char **dimensions; /**name of dimensions*/ ego_reqattr_t *extension; /*<name, value> pairs of allocation arguments */ } ego_allocreq_md_t; typedef struct ego_alloc_sub_demand_md{ char *name /*sub demand name, should be unique in the allocation*/ int maxunits; /*unit number*/ int maxunitsperhost; char *resreq; /*Resource requirement string*/ int numpreferhosts; ego_alloc_prefer_host_md_t **preferhosts; int flag /*if VEM_ALLOC_ENFORCE_RATIO is set, ego reclaim other allocations to enforce its priority and ratio */ int priority; /*[1, 10], default 5. 1 is highest priority. Number 0 means using default */ int ratio; /* [1,1000], default 10. 0 means using system default and EGO will convert it to 10.*/ ego_unit_t *unitDefinition; } ego_alloc_sub_demand_md_t;

Page 18: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

17

10.2 Resource Provider Interface

extern int rpi_init(struct rpiInput *); extern int rpi_shutdown(void *); extern struct rpiInfo *rpi_getinfo(void *); extern struct rpiResource **rpi_getresources(void *); extern int rpi_freeresourcearray(struct rpiResource **); extern struct buffer *rpi_status(void *);

10.3 Scheduler Plugin Interface struct policyInfo *(*vem_policy_info) (void *); int (*vem_policy_init) (const struct policyInit *, void *); int (*vem_policy_fin) (void *); int (*vem_policy_schedule) (struct policyWorkSpace *, struct list_ *, void *); int (*vem_policy_housekeep) (struct policyWorkSpace *, struct list_ *, void *); struct buffer *(*vem_policy_status) (void *); int (*vem_validate_resreq) (struct ego_resreq *); int (*vem_migrate_decision) (struct migrateDecision2 *, void *); int (*vem_policy_addconsumer) (struct policyConsumer *, char **, void *); int (*vem_policy_delconsumer) (struct policyConsumer *, void *); int (*vem_policy_getresplan) (struct policyPlan *, void *); int (*vem_policy_setresplan) (struct policyPlan *, void *); int (*vem_validate_consumer) (const char *, const char *, void *); int (*vem_policy_modify_resgroup) (const char *, const int, void *); int (*vem_resgroup_busy) (const char *, void *); struct buffer *(*vem_alloc_diagnose) (struct allocation *, vem_allocdiagnose_option_t *, void *); int (*vem_policy_place) (struct placeWorkspace *, struct list_ *, void *); int (*vem_get_possible_hosts) (ego_getpossiblehost_req_t*, struct rpiResource ***); int (*vem_policy_getInSlots)(char *, struct tree_ **); int (*vem_post_notifyclient) (struct allocation *, void *); int (*vem_alloc_event) (struct AllocEvent*); int (*vem_get_decisions) (struct list_*, char * (*)()); int (*vem_resource_event) (struct ResourceEvent*); int (*vem_get_group_runtime_info) (char *, char*, struct rpiGroup **); int (*vem_validate_request) (enum AllocEventType , void *); int (*vem_policy_delresplan) (char *); int (*vem_policy_update_decision) (struct UpdateDecision*, void *);

Page 19: An Introduction to EGO: An enterprise-ready resource ... · as Mesos and YARN, have emerged to provide similar capabilities to the broader community. ... A cluster management system

18

10.4 Security Plugin Interface int (*finalize)(void); int (*client_initialize)(sec_context_t *); int (*client_start)(sec_context_t *, char **, int *); int (*client_step)(sec_context_t *, const char *, int, char **, int *); int (*client_finalize)(sec_context_t *); int (*server_initialize)(sec_context_t *); int (*server_start)(sec_context_t *, const char *, int, char **, int *); int (*server_step)(sec_context_t *, const char *, int, char **, int *); int (*server_finalize)(sec_context_t *); int (*get_localuser)(sec_context_t *, char **); int (*get_remoteuser)(sec_context_t *, char **); int (*encrypt)(sec_context_t *, const char *in, int inlen, char **out, int *outlen); int (*decrypt)(sec_context_t *, const char *in, int inlen, char **out, int *outlen); int (*sign)(sec_context_t *, const char *in, int inlen, char **sig, int *siglen); int (*verify)(sec_context_t *, const char *in, int inlen, const char *sig, int siglen); int (*get_cred)(sec_context_t *, char **cred, int *credlen); int (*set_cred)(sec_context_t *, const char *cred, int credlen); int (*save_cred)(sec_context_t *); int (*read_cred)(sec_context_t *); int (*remove_cred)(sec_context_t *); int (*clear_cred)(sec_context_t *); int (*failreason)(sec_context_t *, int *); int (*get_users)(sec_context_t *,const char **userreq, int nuserreq, sec_user_t **userlist, int users); int (*is_user)(sec_context_t *, const char* username); int pluginver; int (*add_user)(sec_context_t *, const sec_user_t* user); int (*del_user)(sec_context_t *, const char *user); int (*modify_user)(sec_context_t *, sec_user_t* user); int (*is_group)(sec_context_t *, const char* groupname); int (*is_groupmember)(sec_context_t *, const char* username, const char* groupname); int (*group_get_users)(sec_context_t *, const char* groupname, char ***userlist, int *nusers); int (*user_get_groups)(sec_context_t *, const char* username, char ***grouplist, int *ngroups); int (*get_groups)(sec_context_t *, const char* groupnames[], int grpcount, sec_group_t **grouplist, int *ngroups); © Copyright IBM Corporation 2015 U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.