ocrem: openstack-based cloud datacentre resource monitoring...

Int. J. High Performance Computing and Networking, Vol. 9, Nos. 1/2, 2016 31

Copyright © 2016 Inderscience Enterprises Ltd.

OCReM: OpenStack-based cloud datacentre resource monitoring and management scheme

Zhihui Lu* School of Computer Science, Fudan University, Shanghai, 200433, China and IT Management Research Department, Yokohama Research Laboratory, Hitachi Ltd., Japan Email: [email protected], Email: [email protected] *Corresponding author

Jie Wu Engineering Research Center of Cyber Security Auditing and Monitoring, Ministry of Education, 200433, China Email: [email protected]

Jie Bao School of Computer Science, Fudan University, Shanghai, 200433, China Email: [email protected]

Patrick C.K. Hung Faculty of Business and IT, University of Ontario Institute of Technology, Ontario, Canada Email: [email protected]

Abstract: Managing virtualised computing, network and storage resources at large-scale in both public and private cloud datacentres is a challenging task. As an open source cloud operating system, OpenStack needs to be enhanced for managing cloud datacentre resources. In order to improve OpenStack functions to support cloud datacentre resource management, we present OCReM: OpenStack-based cloud datacentre resource monitoring and management scheme. First, we designed a virtual machine group life-cycle management module. Then, we designed and developed a cloud resource monitoring module based on the Nagios monitoring software and Libvirt interface. We conducted an integrated experiment to verify the performance improvement of group-oriented auto scaling and elastic load balancing policy based on real-time resource monitoring data. After that, we implemented the OCReM-EC2 hybrid cloud monitoring and auto scaling model. Finally, we analysed the prospective research direction and propose our future work.

Keywords: OpenStack; cloud; cloud datacentre; cloud resource management; auto scaling; elastic load balancing; ELB; hybrid cloud.

Reference to this paper should be made as follows: Lu, Z., Wu, J., Bao, J. and Hung, P.C.K. (2016) ‘OCReM: OpenStack-based cloud datacentre resource monitoring and management scheme’, Int. J. High Performance Computing and Networking, Vol. 9, Nos. 1/2, pp.31–44.

Biographical notes: Zhihui Lu is an Associate Professor in School of Computer Science, Fudan University. Currently, he is also a Senior Researcher of Hitachi Yokohama Research Laboratory. His research interests are cloud computing and service computing, system resource management, and cloud data centre resource management. He is an expert member of China SOA and cloud standard working group led by CESI. He is also an academic alliance member of DMTF WS-Management Work Group.

32 Z. Lu et al.

Jie Wu is a Professor at School of Computer Science, Fudan University. His research interests are web service management, cloud computing, and system resource management technology. He is one expert member of WSSG (Web Service Study Group) and SOA-SG (SOA Study Group) of ISO/IEC JTC1 SC7, SC38 on behalf of China.

Jie Bao is a Master student at School of Computer Science, Fudan University. His research interests are service computing, cloud computing resource management technology.

Patrick Hung is an Associate Professor at the Faculty of Business and Information Technology in University of Ontario Institute of Technology. He has been working with Boeing Research and Technology in Seattle, Washington on aviation services-related research projects. He owns a US patent on a Mobile Network Dynamic Workflow Exception Handling System with Boeing. His research interests include services computing, business process and security, and cloud computing.

This paper is a revised and expanded version of a paper entitled ‘OCReM: OpenStack-based cloud datacentre resource monitoring and management scheme’ presented at The 2014 Asia-Pacific Services Computing Conference (APSCC 2014), Fuzhou, China, 4–6 December 2014.

1 Introduction

Virtualisation and cloud computing technologies are gaining increasing acceptance from IT companies and institutions for use in the deployment of efficient, flexible, and scalable datacentres (DCs). Managing virtualised computing, network and storage resources at large-scale in both public and private cloud DCs is quite challenging. The success of any cloud management system critically depends on the flexibility, scalability and efficiency with which it can utilise the underlying hardware resources, while providing high-quality performance guarantees for cloud tenants. Customers expect cloud service providers to deliver good quality of service (QoS) for them. Thus, resource management at cloud DCs requires the management scheme to provide a rich set of resource controls that balance the QoS of tenants with overall resource efficiencies.

The core component of cloud architecture – the cloud operating system – is responsible for managing and monitoring the physical and virtual machines, providing abstraction of the underlying infrastructure, and offering various tools and advanced functionality for cloud users. OpenStack is a ubiquitous open source cloud OS for public and private clouds. The project aims to deliver solutions for all types of clouds by being simple to implement, massively scalable, and feature rich (OpenStack, http://www.openstack.org/). Founded by Rackspace Hosting and NASA, OpenStack (http://www.openstack.org/) has grown to be a global software community of developers and cloud computing technologists collaborating on a standard and massively scalable open source cloud operating system. OpenStack’s mission is to enable any organisation to create and offer cloud computing services running on standard hardware. OpenStack controls large pools of computing, storage, and networking resources throughout a DC, all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface. However, in order to efficiently manage cloud DC resources, we need to enhance

OpenStack-G release’s function as follows: OpenStack can only manage single instances, but cannot manage instance groups. OpenStack supports manual creation and termination of instances, but does not support group-oriented auto scaling of instances and elastic load balancing (ELB). OpenStack manages role rights by backend operations on physical machines’ OS and needs to modify configuration files, but does not support modification of roles through the user-friendly dashboard interface. OpenStack does not support direct self-registration of users/tenants. OpenStack-G release does not provide a resource status monitoring function.

The key contributions and key findings of this paper are the following.

Based on the main requirements for a cloud DC resource management platform, we present OCReM: an OpenStack-based cloud DC resource monitoring and management scheme. Using the OCReM architecture, we enhanced OpenStack with multiple functions, including the user self-registration, multi-tenant management, virtual machine instance group management function. We complemented OpenStack with a cloud resource monitoring function based on the Nagios open source software and Libvirt interface.

Based on the real-time monitoring function and VM instance group management function, we conducted an experiment to verify the performance improvement of OpenStack group-oriented reactive auto scaling and ELB policy. We also designed and carried out an experiment on an OpenStack-EC2 hybrid cloud monitoring and resource auto scaling model built on the OCReM platform, which gives an optimised solution for controlling and improving VM launching procedure.

The remainder of this paper is organised as follows: In Section 2, we propose an OpenStack-based cloud DC resource management architecture: OCReM. In Section 3, we present the design of a user registration and management module and a virtual machine group life-cycle management module. Then in Section 4, we describe the development of

OCReM: OpenStack-based cloud datacentre resource monitoring and management scheme 33

the OpenStack cloud resource monitoring and supervision module. The OpenStack group-oriented auto scaling and ELB experiment and its results are shown and discussed in Section 5. We introduce the design and implementation of an OCReM-EC2 hybrid cloud monitoring and auto scaling model in Section 6. Cloud DC resource management related work is presented in Section 7. In Section 8, we conclude and propose future work.

2 OCReM: OpenStack-based cloud DC resource monitoring and management architecture

The software architecture of OCReM (OpenStack-based cloud DC resource monitoring and management scheme) is shown in Figure 1. We enhanced the OpenStack-G release modules marked with blue colour, including the user role management, user-view life-cycle management and administrator-view life-cycle management functions. We also added some new function modules to the OpenStack-G release, including user registration and management, instance group management and batch actions, PM&VM monitoring, group-oriented auto scaling and ELB, etc.

Figure 1 OCReM (OpenStack-based cloud DC resource monitoring and management scheme) software architecture (see online version for colours)

The OCReM hardware architecture is shown in Figure 2. The system is composed of one controller server, one NFS server, one monitor server and many physical machines acting as computing nodes in a resource pool.

Figure 2 OCReM (OpenStack-based cloud DC resource monitoring and management scheme) hardware architecture (see online version for colours)

These servers and computing nodes use the Ubuntu server operating system. Several physical machines in the system comprise a shared resource pool; they are all controlled by the controller server. Multi-hypervisors are supported by the system; we mainly use KVM to realise virtualised resource allocation.

The core modules, which run on the controller server, are in charge of providing user interfaces, processing user requirements, managing resources and other related functions. We also use auto scaling and ELB technology to improve the efficiency of resource management based on real-time monitoring data collected by the monitor server.

3 OCReM virtual machine group life-cycle management modules design and implementation

We designed a user registration and management module for multi-tenant usage. Based on this, we designed a VM resource group life-cycle management module integrated into the OpenStack-G release platform, which makes OpenStack closer to customer requirements in public clouds.

3.1 User registration and management module design

Since OpenStack was originally designed for private clouds, the self-registration function for user was not necessary. All users were added by the administrator. We try to make OpenStack a commercial cloud platform for both private and public clouds, similarly to Amazon, so as part of this effort, we designed and implemented a user self-registration module. In order to conform to the design style of OpenStack, we developed the following interfaces and functions:

• A register button is added to the login page. The user can register by providing the necessary information, which will be stored in a database in OpenStack.

• A register users’ panel is added to the management interface of the administrator, who can see all the information of the registered users and manage these users in the cloud platform.

The administrator has two management functions: if the user information is valid, the administrator creates the account; on the other hand, if the user information is invalid, the administrator will deny the registration.

1 User registration view

As mentioned previously, a user registration function is added to the user login page of the dashboard (called Horizon). When a user wants to register to OpenStack, he can click the button, and then the he is redirected to user registration view. The view contains a form for basic user information. Currently, username, password, confirm password and email fields are required. When

34 Z. Lu et al.

the user submits the registration information, the validity check module is executed. Obviously, the password and confirm password must be identical and the email address must have a legal format. Furthermore, the username should not already be present in the list of users who have been accepted into OpenStack (user_list), and in the list of users with pending registration status (user_list_register).

2 Validity check view

This view is designed for the administrator to check whether the user registration information is valid or not. If some fields are blank, the information_incomplete page will be opened to show the incomplete information and require the user to register again. Respectively, if the password is not identical with confirm password, the password_failure page will inform the user of the problem.

3 Users management panel

When an administrator has logged in to the OpenStack cloud platform through the dashboard, he can see the register users management panel, which informs him of the users who want to register to the platform. The administrator can accept or deny the requests depending on the validity of the data. If the administrator wants to accept the user, the add_user page will be invoked, already containing the basic information of the new user, which reduces the work of the admin. When a user is accepted, the administrator can see him in the OpenStack platform users list, and can allocate some role rights to him.

3.2 Instance group life-cycle management module design

We added a VM group management function in OCReM to the multiple tenants view. This virtual machine group management function is designed to divide instances into several groups, which can help to manage instances more conveniently and also enables auto scaling and load balancing. The Instance_group data table is shown in Table 1.

Table 1 Instance_group data table

Attribute name Type Description

ID INT Unique identifiers of group Group_name VARCHAR Name of group Project_id VARCHAR Specifies which project this

group is located in Instance_number INT Shows how many instances

are in this group Description VARCHAR Comments for this group

3.2.1 Main views and tables

1 Group view

This view lists the basic information of all groups in a local project. It supports adding or removing groups and batch actions of group members. It also provides a link to the group members’ view.

2 Group members view

This view lists all instances in a group. It shows instance name, IP address and current state of each instance. Users can remove instances from a group.

3 Instances not in group view

This view lists all instances which are not in any group. It also shows instance name, IP address and current state of each instance. Users can add these instances to any group.

We added a ‘Group_id’ column to the instances table of the original OpenStack-G version, as shown in Table 2.

Table 2 Modified instances data table

Attribute name Type Description

ID INT Unique identifier of instance … … Other attributes of instance Group_id INT Specifies which group this

instance is located in (By default its value is zero)

Table 3 Main interfaces and functions of the instance group life-cycle management module

API/function() Description

Group_get Retrieve a group list Group_create Create a new instance group Group_delete Delete an existing instance group Group_members_get Get a list of group members Group_members_add Add an instance into a group Group_members_remove Remove an instance from a group Group_recover Recover all instances’ group_id Pause_all Pause all instances in a group Unpause_all Unpause all instances in a group Suspend_all Suspend all instances in a group Resume_all Resume all instances in a group Reboot_all Reboot all instances in a group Terminate_all Terminate all instances in a group

3.2.2 Main interface and function design

The main interfaces and functions of the instance group life-cycle management module are shown in Table 3. Now tenants can create their own instance groups and manage them through the dashboard. The module supports instance group life-cycle management batch actions, such as pause/unpause instance group, suspend/resume instance group, etc. We implemented auto scaling based on the instance group functions.


4 OCReM cloud resource monitoring module design and implementation

OpenStack was originally designed for private clouds, and its G release lacks a monitoring module. Similarly to Amazon Cloud, which includes a cloud watch module to monitor the VM, we also developed a monitoring module for OpenStack. Using this module, all users (ordinary users and administrators) can get the status information of PMs and VMs through the web interface. The collected monitoring data is stored in an OpenStack database, which is an important basis for the auto scaling strategy implementation. The auto scaling strategy will be described in the following Section 5.

Our monitoring module includes two parts:

1 Using the open source software Nagios (http://www.nagios.org/) to monitor PMs

We integrated Nagios into the OpenStack Dashboard, which can be managed by the administrator. Thus, the administrator can see detailed monitoring information about each PM, which was not possible in the original OpenStack dashboard. The module can also provide monitoring data to support auto scaling decisions in a VM instance group.

2 Using Libvirt to monitor VMs

The Libvirt library calls Linux virtualisation functions through the Linux API, which supports a variety of hypervisors. Libvirt is installed by default in OpenStack, so we can easily invoke Libvirt to monitor VM instances, and store the monitoring data into a database. The OpenStack Dashboard can display the corresponding data-plots. The VM instance monitoring data makes it possible to realise the auto scaling functionality in OCReM. Table 4 describes the main functions of the monitoring module.

4.1 Main interface and function design

Table 4 Description of the main functions of the VM monitoring module

API/function() Description

get_devices() Get information of devices for the query get_memory() Get VM memory information, monitoring

content Daemonize() Record monitoring data into files for future

inquiries get_node_info() Obtain all monitoring information, write it

into the database generateimage() Generate dynamic monitoring images for

foreground exhibition

4.2 Monitoring VMs

OCReM uses Libvirt to collect monitoring data, and it provides a hypervisor agnostic API to safely manage guest

operating systems running on the host. Libvirt itself is not a tool, but a way to create tools to manage the guest operating system API. Libvirt provides an abstraction. It defines a common function set, using which hypervisors can provide a common API. Libvirt provides a management API specifically designed for Xen, which has been extended to support multiple hypervisors.

Our OCReM platform uses Libvirt components to monitor virtual machines. We implemented a monitoring programme, which runs on the physical machine to monitor the VMs by invoking Libvirt. Every physical node must run the monitoring programme (monitor_vm.py), and have a conf.py file containing the database configuration.

We schedule the background programme to automatically run in a specified time interval on the instances on each physical node. The collected monitoring information is written into the database. The VM monitoring data graph is illustrated in Figure 3.

Figure 3 VM Instance monitoring data in OCReM (see online version for colours)

We integrated the monitoring module into OpenStack, in order to get detailed run-time status information about each VM, including CPU usage, MEM usage, etc. Using this data, the scheduler can make decisions such as auto scaling based on the VMs’ health status condition.

5 OCReM group-oriented auto scaling and ELB policy module design and experiment

Auto scaling is a way to automatically scale up or down the number of computing resources that are being allocated to user applications based on its needs at any given time.

Before the cloud computing era, it was very difficult to scale a website, let alone figure out a way to automatically scale a server setup. In a traditional, dedicated hosting environment, users are limited by their hardware resources. Once those server resources are maxed out, the site will inevitably suffer from performance issues and possibly crash, thereby causing users to lose data and/or potential business. Auto scaling allows users to setup and configure the necessary trigger points, called alerts and alert

36 Z. Lu et al.

escalations. Those allow users to create an automated setup that reacts to various monitored conditions when thresholds are exceeded.

We designed a group-oriented auto scaling module for the OpenStack-G release, based on the collected monitoring data. In most systems, group-oriented auto scaling is often used together with ELB. Group-oriented auto scaling allows users to automatically scale a group’s cloud resources up or down, according to conditions defined by the cloud tenant. It can closely cooperate with ELB to improve the utilisation efficiency of cloud resources in a group.

5.1 Main interface and function design

The main functions of the group-oriented auto scaling and ELB module are described in Table 5. The usage of each function, including when to activate these functions and what to do next, is defined by the users themselves through policies. But if we find some appropriate algorithm, we can import them as enhancements. Note that the algorithm’s performance is tightly related to the actual applications in the cloud environment.

Table 5 Main function description of the auto scaling and ELB module

Operation name Description

Auto scaling Automatically scale user resources, e.g., adding/removing instances, according to user definition and load change.

Elastic load balancing

Balancing instance loads by redirecting user requirements.

5.2 Group-oriented auto scaling system architecture

There are two important components in the auto scaling and ELB system, as shown in Figure 4. One is the user controller, which enables users to define threshold values to activate the auto scaling module; it also lets users define which action should be executed after auto scaling has been activated. The other component is a request redirector, which is responsible for balancing loads of virtual machine instances.

Figure 4 Group-oriented auto scaling and ELB module architecture in OCReM (see online version for colours)

Note that auto scaling could be independent of ELB. But for some applications such as web servers, it is recommended to use auto scaling together with ELB. In these cases, ELB needs to re-adjust running resource loads according to the auto scaling action.

5.3 Group-oriented auto scaling and ELB main interface and interaction design

Table 6 Description of auto scaling and ELB interface

Operation name Description

User control interface

Allow users to define when to activate auto scaling or ELB and what action should be executed. This could be integrated into the OpenStack dashboard.

Trigger When metrics reach the threshold value defined by the user, auto scaling will be activated.

Notify Delivers messages between the auto scaling and the ELB modules.

Call Auto scaling finally calls the Nova API to manage the virtual machine instances.

Redirect requests ELB balances instance loads by redirecting user requirements.

Figure 5 Main interactions of auto scaling and ELB

The main interfaces of auto scaling and ELB are shown in Table 6. The interaction is described in Figure 5. A user first creates an auto scaling rule, and then starts monitoring to see if the rule is matched. When the rule conditions are met, the monitor will execute the auto scaling action. Auto scaling will accordingly call Nova to increase or decrease virtual machines based on the rule, and then notify the ELB execute real-time load balancing on the changed resource group.


5.4 Group-oriented auto scaling experiment based on real-time monitoring data

We considered two implementation models for auto scaling. One is auto scaling based on real-time monitoring data, which reactively scales out or in corresponding to the current workload of running instances; the other model is based on prediction algorithms, which can realise resource auto scaling in advance.

The main advantage of the real-time monitoring-based reactive auto scaling model is that it can reflect the real running status of instances and it does not rely on historical records and analysis. Some applications do not have periodic pattern, or especially have short-term fluctuation. Therefore, it is hard to apply prediction algorithms to them.

Our goal was to develop a user-controlled tool to support dynamic scaling of cloud resources in a group according to their definition. For example, users can decide that once the CPU load of a specific instance is over 80%, a new similar instance should be created to balance the load of the group. This idea is similar to the Amazon auto scaling service. We realised it on our OCReM cloud platform.

Figure 6 Auto scaling system panel in OCReM (see online version for colours)

We added a new system panel to the OpenStack-G dashboard in the user view, as shown in Figure 6. In this panel, users can create or remove auto scaling rules. Each rule defines when and what to do in the respective auto scaling scheme. Each time, after the monitoring module retrieves status data and stores it in the database, the auto scaling module will check the monitoring data related to each rule, and carry out corresponding actions.

The attributes of each rule are stated in Table 7. One problem that needs to be addressed is that we

should avoid the ‘flash crowd’ effect on our action. This means that if a metric, in the above example the CPU load, exceeds the user defined threshold for a very short moment and falls back quickly, our auto scaling management should not react to it. These actions often cause a great waste of computing resources.

Currently, we use a direct mechanism to avoid this problem. We use a status variable to indicate the duration of how long the metrics are greater or less than their thresholds. This variable is named status flag and its value is initially 0. At every time point, the auto scaling module checks the monitoring data, if the metric is greater than the user defined threshold, it executes auto scaling actions, and the status flag is incremented by 1, otherwise it is decremented by 1. Once the status flag reaches 3, the corresponding action will be executed and the flag will be reset to 0. Figure 7 illustrates a typical example of our auto scaling scheme. In this example, the user defines the following auto scaling rule: if the CPU load of the specific instance is over 80% at least 3 times in 15 minutes, a new instance will be created. We think 15 minutes is an appropriate value for most web servers, but this value can be changed easily. We selected Haproxy as a load balancer cooperating with the auto scaling function, which balances the workloads of virtual machines by redirecting user requests. We can see from Figure 7, that instance A’s CPU usage exceeded 70% three times from 8:24 to 8:28 in 5 minutes, and thus one new instance C was created at 8:28, which led to a load decrease after 10 minutes due to load balancing.

Table 7 Specification for auto scaling rules

Attribute Description

ID The unique identifier of the auto scaling rule.

Rule Type The type of the rule. There are two kinds of rule types: ‘instance’ for single instance and ‘groups’ for instance groups.

Target ID The rule target: for instance rules it is the ID of an instance; for group rules it is the group ID.

CPU threshold Threshold to trigger rule actions, displayed in percentage.

CPU sign Sign used for CPU threshold, supports ‘> < =’.

Memory threshold

Threshold to trigger rule actions, displayed in percentage.

Memory sign Sign used for Memory threshold, supports ‘> < =’.

Status flag A value to avoid ‘flash crowd’, explained below.

Action The final action to do, currently supports ‘add’ or ‘terminate’ instances.

Image ID The image or snapshot ID for creating new instances, this attribute could be empty if the specified action is ‘terminate’.

Flavor ID The flavour ID for creating new instances, this attribute could be empty if the specified action is ‘terminate’.

38 Z. Lu et al.

Figure 7 A typical example of the OCReM auto scaling scheme (see online version for colours)

6 OCReM-EC2 hybrid cloud model design and experiment

6.1 OCReM-EC2 hybrid cloud monitoring model

Hybrid cloud computing is receiving increasing attention in recent days. In order to realise the full potential of the hybrid cloud platform, an architectural framework for efficiently coupling public and private clouds is necessary. A hybrid cloud is the combination of at least one private cloud with at least one public cloud-based infrastructure, producing an environment that provides transparent user access to the hybrid cloud and is capable of dynamic scalability to manage uneven demand.

In our hybrid cloud model design, we use OCReM as our private cloud platform to connect to Amazon EC2 as the public cloud platform. Since EC2 provides AWS interface for developers, OpenStack needs to use some middleware to access AWS. We selected Boto and Cloudwatch CLI for this purpose.

Figure 8 OpenStack-EC2 hybrid cloud monitoring model

Boto is a Python package that provides interfaces to Amazon Web Services. We use Boto to connect EC2 VM instances in OCReM. The reason we chose Boto is the following: it is an open source AWS interface library written in python, and it is easy to integrate with OpenStack through python language.

Unlike in a private or public cloud, managing workloads deployed on a hybrid cloud is not straightforward.

Typically, each cloud entity has its own service management interfaces that do not interoperate with one another. Thus, managing workloads in a composite hybrid cloud invariably means managing workloads separately for each cloud component, which can be tedious, error prone, and time consuming. Clearly, to be practical, hybrid clouds need integrated monitoring and other service management functions that operate seamlessly and uniformly across the hybrid cloud, masking the heterogeneity, while providing a single unified management experience.

Cloudwatch is a monitoring service provided by AWS to collect EC2 and other monitoring data. It provides command line interface (CLI) tools, written in Java. In the OCReM hybrid cloud environment, we invoke Cloudwatch in OCReM private clouds to monitor the memory workload in EC2. Figure 8 shows the hybrid cloud monitoring model of OpenStack and EC2 AWS implemented using Boto and Cloudwatch.

6.2 OCReM-EC2 hybrid cloud monitoring experiment

6.2.1 Experiment environment architecture

Our experiment architecture is described in Figure 9. Our OCReM private cloud connects to Amazon EC2. We run a web application in the VM instances of this hybrid cloud. In the OCReM management interface, the administrator can manage all running instances, including monitoring and control of instances created in EC2.

Figure 9 OCReM-EC2 hybrid cloud experiment environment architecture (see online version for colours)

6.2.2 Using Boto to connect to EC2

In OpenStack, we use Python to connect to EC2 through Boto. The concrete steps are the following:

1 Boto config

A Boto config file is a simple .ini configuration file that specifies values for options that control the behavior of the Boto library. Upon startup, the Boto library looks


for configuration files in the following locations and in the following order: • /etc/boto.cfg – for site-wide settings that all users

on this machine will use • ~/.boto – for user-specific settings

First, install via pip: >>>pip install boto.

Then, the credentials can be passed to the methods that create connections. Alternatively, Boto will check for the existence of the following environment variables to ascertain your credentials: • AWS_ACCESS_KEY_ID – Your AWS Access

Key ID • AWS_SECRET_ACCESS_KEY – Your AWS

Secret Access Key

2 Create connection

>>> import boto.ec2

>>> ec2 = boto.ec2.connect_to_region (‘us-east-2’)

3 Get config information from EC2 instances

reservations = conn.get_all_instances()

for res in reservations:

instances = res.intances

for inst in instances:

inst

All available attributes.

4 Create instance

>>> conn.run_instances(

‘<ami-image-id>‘,

key_name=‘myKey’,

instance_type=‘c1.xlarge’,

security_groups = [‘your-security-group-here’])

5 Terminate instance

>>>conn.terminate_instances(instance_ids = [‘instance-id-1’,’instance-id-2’, …])

6 Monitor instance 1 get monitor status:

>>> c = boto.ec2.cloudwatch. connect_to_region(‘us-east-1’) >>> metrics = c.list_metrics()

2 select five status for monitoring: CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps

3 get monitor data. (take CPUUtilization for an example) >>> cw = boto.ec2.cloudwatch.connect_to_region (‘us-east-1’)

>>> end = datetime.datetime.utcnow() >>> start = end – datetime.timedelta(seconds=600) >>>cw.get_metric_statistics (60, start, end, ‘CPUUtilization’, ‘AWS/EC2’, ‘Average’, dimensions = {‘InstanceId’: [‘i-6b1d3a04’]})

6.2.3 Using CloudWatch to monitor EC2

6.2.3.1 Monitoring procedure

OCReM cloud can provide the performance monitoring function both for Openstack and EC2 instances. Monitoring instance is an important part of maintaining the reliability, availability, and realising scalability for the hybrid cloud solutions. For a certain instance, no matter whether it is on-site or in the public cloud, we can provide the same user interface to illustrate the monitoring information.

From OCReM Cloud dashboard, we can see the monitoring status data and graph by clicking the ID link from the instance list. We provide the monitoring metrics as outlined below:

• CPU utilisation: The percentage of allocated computing units that are currently in use on the instance. This metric identifies the processing power required to run an application upon a selected instance.

• Disk space utilisation: The percentage of allocated disk space units that are currently in use on the instance. (This metric is not default provided by CloudWatch. But we can produce and consume Amazon CloudWatch custom metrics by running the Amazon CloudWatch Monitoring Scripts for EC2.)

• Network in/out: The number of bytes received/sent out on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to an application on a single instance.

• Memory utilisation: The percentage of allocated memory units that are currently in use on the instance. (This metric is not default provided by CloudWatch. But we can produce and consume Amazon CloudWatch custom metrics by running the Amazon CloudWatch Monitoring Scripts for EC2.)

The monitoring procedure is illustrated in Figure 10.

6.2.3.2 Implementation mechanism

To get the monitoring data, we need the instance ID, start time, end time and the period between every two data points.

1 CPU utilisation, network I/O

These metrics are default provided by Amazon CloudWatch. So we can acquire them directly through the APIs below: • org.jclouds.cloudwatch.domain.GetMetricStatistics

.builder() • .dimension((Dimension dimension)

40 Z. Lu et al.

• .metricName(String metricName) • .namespace(String namespace) • .statistic(Statistic statistic) • .startTime(Date startTime).endTime(Date

endTime). period(int period) • .unit(Unit unit).build()

Figure 10 EC2 performance monitoring procedure (see online version for colours)

2 Memory utilisation, disk space utilisation

These metrics cannot acquire from CloudWatch directly. But the Amazon CloudWatch Monitoring Scripts for Linux are sampling Perl scripts that demonstrate how to produce and consume Amazon CloudWatch custom metrics. The scripts comprise a fully functional example that reports memory, swap, and disk space utilisation metrics for an Amazon Elastic Compute Cloud (Amazon EC2) Linux instance. First we need to login the EC2 VM which are selected to monitor. And we install the following packages: libwww-perl and libcrypt-ssleay-perl. Then we download, uncompress, and configure the CloudWatch Monitoring Scripts on an EC2 Linux instance. The download command is: • wget http://ec2-download.s3.amazonaws.com/

cloudwatch-samples/ CloudWatchMonitoringScripts-v1.1.0.zip

Also we should confirm that an AWS identity and access management (IAM) role is associated with the instance, and it has permissions to perform the Amazon CloudWatch PutMetricData operation. In the downloaded CloudWatch Monitoring Scripts, there are the following files:

• CloudWatchClient.pm – Shared Perl module that simplifies calling Amazon CloudWatch from other scripts.

• mon-put-instance-data.pl – Collects system metrics on an Amazon EC2 instance (memory, swap, disk space utilisation) and sends them to Amazon CloudWatch.

• mon-get-instance-stats.pl – Queries Amazon CloudWatch and displays the most recent utilisation statistics for the EC2 instance on which this script is executed.

• awscreds.template –File template for AWS credentials that stores your access key ID and secret access key.

• LICENSE.txt –Text file containing the Apache 2.0 license.

• NOTICE.txt – copyright notice.

Script mon-put-instance-data.pl collects memory, swap, and disk space utilisation data on the current system. It then makes a remote call to Amazon CloudWatch to report the collected data as custom metrics. For example, we can use command ./mon-put-instance-data.pl --mem-util --mem-used --mem-avail to collect all available memory and metrics and send them to CloudWatch. Besides we can automatically get the monitoring data and retrieve them on demand. So we set a cron schedule for metrics reported to CloudWatch. The following command to report memory and disk space utilisation to CloudWatch every minute:

• */1 * * * * ~/aws-scripts-mon/mon-put-instance-data.pl --mem-util --disk-space-util --disk-path=/ --from-cron

• If the script encounters an error, the script will write the error message in the system log.

Finally, in order to use a single program to process this problem, and also be convenient for the user to use, we pre-write a script and set it running as boot on the target instance. After that we can acquire these data through the same api of CloudWatch as above.

After we get the monitoring data in the program, these data will be written in local database, as shown in Table 8.

Table 8 EC2 instance performance monitoring table

Name Type Length Allow null

ID int 11 No InstanceID varchar 255 Yes InstanceName varchar 255 Yes CpuUsage decimal 9 Yes MemUsage decimal 9 Yes DiskUsage decimal 9 Yes DiskRead decimal 9 Yes DiskWrite decimal 9 Yes NetworkIncoming decimal 9 Yes NetworkOutgoing decimal 9 Yes TimeStamp Datetime 20 Yes Status varchar 255 Yes


Figure 11 Launching an EC2 instance through OCReM (see online version for colours)

Figure 11 shows how we can launch a new instance in the public cloud (EC2) from our OCReM platform. We can remotely delete the instance through OCReM, as well. Figure 12 shows EC2 instances launched in the OCReM platform.

Figure 12 EC2 instances in OCReM (see online version for colours)

As described in Figure 13, we run an EC2 instance with an IAM role, and use CloudWatch to get detailed monitoring information.

Figure 13 IAM role and CloudWatch monitoring in OCReM (see online version for colours)

6.3 OCReM-EC2 hybrid cloud auto scaling experiment

6.3.1 Basic auto scaling procedure

Continuing the above experiment, we also carry out an auto scaling experiment, which demonstrates how the OCReM private cloud can be augmented with resources from the EC2 public cloud. When an overload is detected in OCReM, the OCReM platform can automatically intervene and transfer part of the workload to newly added EC2 instances. The auto scaling procedure is illustrated in Figure 14.

With the help of the auto scaling mechanism, when the on-site resources are not enough to support a burst of requests, OCReM cloud automatically extends to the public cloud and gets public resources according to the user’s policy to handle the burst. After the load on the OCReM private cloud instances returns to a low level, the public resources are released automatically.

In order to enable the auto scaling function, the user has to create an auto scaling group and define the auto scaling policy for this group, including the respective parameters. For example, when the average CPU utilisation exceeds 80%, OCReM can create an EC2 instance. Although we can upload a local image to the public cloud and use it to launch public instances, this time-consuming method is not recommended. Instead, it is better to pre-configure a public instance and make an image from it. Then the user can use this image to set the auto scaling rules by defining various conditions and operations. After the user finishes the configuration, the auto scaling function is enabled. The user can also change the configuration by editing it, or disable the auto scaling function by deleting the auto scaling group.

Figure 14 OCReM-EC2 hybrid cloud auto scaling procedure (see online version for colours)

Auto scaling to public cloud provides a method to handle computing bursts in a cost-effective way. However, it brings the slow VM launching problem in public cloud. Therefore, our following implementation mechanism gives a solution for controlling and improving VM launching procedure.

42 Z. Lu et al.

6.3.2 Implementation mechanism and experiment result

6.3.2.1 Polling programme

To implement the auto scaling mechanism, first of all, a polling programme is required. It periodically calls the monitoring module to get the relevant metrics, and checks whether the status triggers the user’s policy or not. If the condition is satisfied, it calls the scheduling module to reserve or release the corresponding resources.

6.3.2.2 Launch configuration

Second, a user-defined launch configuration is needed. A launch configuration is a configuration template for new instances, which includes information about the image, instance type, region, security group and other details necessary to launch a public instance. After a launch configuration has been created, it is written to a local database, as shown in Table 9.

Table 9 VM launch configuration table


ID int 11 No ImageID varchar 255 No LocationID varchar 255 No FlavorID varchar 255 No Name varchar 255 Yes KeyPair varchar 255 No Security Group varchar 255 No

6.3.2.3 Auto scaling group

Next, an auto scaling group needs to be created. The group is the target of the polling programme. The user can add launch configurations to this group, and has to define the scaling policy, which consists of an increasing and a decreasing policy. When an auto scaling group is created, policies are written to a local database, as shown in Table 10. The wait time before another scaling activity is allowed, is set to 60 seconds by default.

Table 10 auto scaling policy table


PolicyID int 11 No Metric varchar 255 No Condition varchar 255 No Threshold int 11 No Operation bit 1 Yes Number int 11 Yes GroupID int 11 Yes LauchConfigID int 11 Yes

6.3.2.4 Load balance

In order to share the workload on private and public VM instances, auto scaling groups are also used as load-balance groups. The workload within this group is distributed uniformly among the member instances. Once a new instance is created in the group, it is automatically added to the global load-balance pool. Similarly, when resources are released, they are removed from the load-balance pool. Hence, the new private and public resources can share the workload of the original group.

6.3.2.5 Experiment result

The OCReM-EC2 hybrid cloud auto scaling experiment is carried out in our Shanghai lab data centre. The results verified we can launch one t1.micro VM instance with 1 vCPU and 0.613GiB memory in 1.5–2 minutes from Amazon EC2 Tokyo area.

7 Related work

Cloud computing as a new service model has attracted more and more attention. Large-scale cloud DCs host tens of thousands of diverse applications each day. Managing data centre resources at large-scale in both public and private clouds is quite challenging. There are some related works for cloud DC resource management and auto scaling mechanisms. Christina and Christos (2013) present Paragon, an online and scalable DC scheduler that is heterogeneity and interference-aware. Ching et al. (2013) propose a candidate algorithm, namely to use a workload-based method with trend analysis. Slight sudden workload changes do not affect the workload-based algorithm and it could perform well for web applications. Masum et al. (2012) correlate metrics from three domains to make scaling decisions. For example, CPU load is correlated with response time over network or network device link load.

Cloud DC virtualisation increases resource utilisation and takes advantage of emerging hardware trends in multi-core CPUs, high-speed IO devices, and enhanced memory systems. To best take advantage of these features, the cloud DC management infrastructure must be scalable at all levels, secure, low-overhead, and extensible. When the workload demand is unknown while the requirements on QoE of the internet services are given, the data centre operators need to determine the appropriate amount of resources and design redirection strategies in presence of multiple data centres to guarantee the QoE. For this problem, Haiyang and Deep (2013) propose a hierarchical modeling approach that can easily combine all components in the data centre provisioning environment. The numeric results show that their model serves as a very useful analytical tool for data centre operators to provide appropriate scalable resources. In the paper, Tian et al. (2014) describe Seagull, a system designed to facilitate cloud bursting by determining which applications should be transitioned into the cloud and automating the


movement process at the proper time. DMTF CIMI-Cloud Infrastructure Management Interface (CIMI) (2013) is a work-in-progress standardisation effort within the DMTF consortium that targets management of resources within the IaaS domain. It implements a REST interface over HTTP and defines the REST APIs for both XML and JSON rendering. CIMI attempts to provide a RESTful management interface for common IaaS components, including machines, networks, volumes, etc. Zhiming et al. (2011) present CloudScale, a system that automates fine-grained elastic resource scaling for multi-tenant cloud computing infrastructures. CloudScale employs online resource demand prediction and prediction error handling to achieve adaptive resource allocation without assuming any prior knowledge about the applications running inside the cloud. Huang et al. (2014) introduce a prediction-based dynamic resource scheduling (PDRS) solution to automate elastic resource scaling for virtualised cloud systems. Unlike traditional static consolidation or threshold-driven reactive scheduling, they both consider the dynamic workload fluctuations of each VM and the resource conflict handling problem. Lei et al. (2014) have presented AppRM, a holistic performance management tool that scales virtual machines vertically by adjusting resource settings at the individual VM level or at the resource pool level. We propose some novel cloud DC resource prediction and provisioning schemes in Wei et al. (2012) and Feifei et al. (2013). Anshul et al. (2012a) introduce a dynamic capacity management policy, AutoScale, that greatly reduces the number of servers needed in data centres driven by unpredictable, time-varying load, while meeting response time SLAs. Anshul et al. (2012b) also present SoftScale, a practical approach to handling load spikes in multi-tier data centres without having to over-provision resources. SoftScale works by opportunistically stealing resources from other tiers to alleviate the bottleneck tier, even when the tiers are carefully provisioned at capacity. SoftScale is especially useful during the transient overload periods when additional capacity is being brought online. Importantly, SoftScale can be used in conjunction with existing dynamic server provisioning policies, such as AutoScale. Hui et al. (2014) propose a hybrid cloud computing model which enterprise IT customers can base to design and plan their computing platform for hosting internet-based applications with highly dynamic workload. We also propose a novel load-aware auto scale scheme for private cloud resource management platforms in Jie et al. (2014). Open and dynamic internet environment, multi-period QoS and multiple QoS registry centres increased the difficulty of service selection. To solve difficulties above, a hybrid QoS model is introduced supporting multi-period hybrid QoS aggregating operation (Longchang and Yanhong, 2013). Peng and Ning (2014) propose a novel power-conscious scheduling algorithm for data-intensive precedence-constrained applications in cloud environments. Peng and Dongbo (2014) propose a multi-scheme co-scheduling framework

for high-performance real-time applications in heterogeneous grids.

8 Conclusions and future work

In this paper, we presented OCReM: OpenStack-based Cloud DC Resource Monitoring and Management Scheme. We improved and enhanced some functions of the OpenStack-G release to support efficient management of cloud DC resources, including modules for VM group life-cycle management, PM&VM monitoring, group-oriented auto scaling and load balancing, etc. We used OCReM as a private cloud platform to connect to Amazon EC2 public cloud platform through Boto and CloudWatch. Our achievements and future work are discussed in the following paragraphs.

Cloud resource monitoring and supervision: we integrated Nagios with our system, which provided powerful monitoring functions. At the same time, we provided another light-weight Libvirt-based way to monitor VMs, which offers basic monitoring functions and does not need any agents. The newest polished OpenStack Havana and Icehouse release includes Ceilometer to support resource monitoring, we are planning to integrate our monitoring function with Ceilometer in OCReM.

Auto scaling and load balancing policy module: we realised group-oriented reactive auto scaling based on real-time monitoring data. We enabled users to define thresholds to drive auto scaling actions and decide what to do when the metrics reach thresholds. Users can create auto scaling rules in the dashboard. Heat in the OpenStack H and I release supports auto scaling of OpenStack clusters based on the workload. We will propose more efficient algorithms to integrate with Heat, in order to further improve auto scaling effects and fully support elastic provisioning of cloud resources.

Hybrid cloud model: we verified the OpenStack-EC2 one-to-one hybrid cloud model. At the same time, we have designed one auto scaling model from a private cloud to a public cloud. As a next step, we will design one-to-multiple hybrid cloud models based on OpenStack’s heat module, including federation model, broker model, multi-cloud model and peer alliance model and others.

Acknowledgements

This work is based on the Fudan-Hitachi Innovative Software Technology Joint Laboratory Project – ‘Cloud Resource Operation and Risk Management Technology’. We would like to give our sincere thanks to Hitachi’s researchers for all the support and advice. This work is also supported by 2014 to 2016 PuJiang Program of Shanghai under Grant No. 14PJ1431100.

44 Z. Lu et al.

References Anshul, G., Mor, H.B., Ram, R. and Michael, A.K. (2012a)

‘AutoScale: dynamic, robust capacity management for multi-tier data centers’, ACM Trans. Comput. Syst., Vol. 30, No. 4, p.14.

Anshul, G., Timothy, Z., Mor, H.B. and Michael, K. (2012b) ‘SOFTScale: stealing opportunistically for transient scaling’, Proc. of Middleware 2012.

Ching, C.L., Jan, J.W., Pangfeng, L. et al. (2013) ‘Automatic resource scaling for web applications in the cloud’, Proc. of GPC 2013, pp.81–90.

Christina, D. and Christos, K. (2013) ‘QoS-aware scheduling in heterogeneous datacenters with paragon’, ACM Trans. Computing Syst., Vol. 31, No. 4, p.12.

Cloud Infrastructure Management Interface (CIMI) (2013) DMTF DSP0263, August [online] http://dmtf.org/standards/cmwg (accessed 12 October 2014).

Feifei, Z., Jie, W. and ZhiHui, L. (2013) ‘PSRPS: a workload pattern sensitive resource provisioning scheme for cloud systems’, Proc. of IEEE SCC 2013, IEEE Computer Society, pp.344–351.

Haiyang, Q. and Deep, M. (2013) ‘Data center resource management with temporal dynamic workload’, Proc. of IM 2013, pp.948–954.

Huang, Q., Shuang, K., Xu, P., Li, J., Liu, X. and Su, S. (2014) ‘Prediction-based dynamic resource scheduling for virtualized cloud systems’, Journal of Networks, Vol. 9, No. 2, pp.375–383.

Hui, Z., Guofei, J. et al. (2014) ‘Proactive workload management in hybrid cloud computing’, IEEE Transactions on Network and Service Management, Vol. 11, No. 1, pp.90–100 (2014).

Jie, B., ZhiHui, L., Jie, W. et al. (2014) ‘Implementing a novel load-aware auto scale scheme for private cloud resource management platform’, Proc. of the 2014 IFIP/IEEE Network Operations and Management Symposium (NOMS 2014).

Lei, L., Xiaoyun, Z., Rean, G., Pradeep, P. et al. (2014) ‘Application-driven dynamic vertical scaling of virtual machines in resource pools’, Proc. of NOMS 2014.

Longchang, Z. and Yanhong, Y. (2013) ‘Dynamic web service selection group decision-making method based on hybrid QoS’, International Journal of High Performance Computing and Networking, Vol. 7, No. 3, pp.215–226.

Masum, Z.H., Edgar, M., Alexander, C., Lew, T. and Gudreddi, S.L.D. (2012) ‘Integrated and autonomic cloud resource scaling’, Proc. of 2012 IEEE Network Operations and Management Symposium – NOMS 2012, pp.1327–1334.

Nagios, Nagios – The Industry Standard in IT Infrastructure Monitoring [online] http://www.nagios.org/ (accessed 10 October 2014).

OpenStack [online] http://www.openstack.org/ (accessed 11 October 2014).

Peng, X. and Dongbo, L. (2014) ‘Multi-scheme co-scheduling framework for high-performance real-time applications in heterogeneous grids’, Int. J. of Computational Science and Engineering, Vol. 9, Nos. 1/2, pp.55–63.

Peng, X. and Ning, H. (2014) ‘A novel power-conscious scheduling algorithm for data-intensive precedence-constrained applications in cloud environments’, IJHPCN, Vol. 7, No. 4, pp.299–306.

Tian, G., Upendra, S., Prashant, S. et al. (2014) ‘Cost-aware cloud bursting for enterprise applications’, ACM Transactions on Internet Technology (TOIT), May, Vol. 13, No. 3, pp.1–24.

Wei, F., ZhiHui, L., Jie, W. and ZhenYin, C. (2012) ‘RPPS: a novel resource prediction and provisioning scheme in cloud data center’, Proc. of IEEE SCC 2012, IEEE Computer Society, pp.609–616.

Zhiming, S., Sethuraman, S. et al. (2011) ‘CloudScale: elastic resource scaling for multi-tenant cloud systems’, Proc. of SoCC 2011.

ocrem: openstack-based cloud datacentre resource monitoring...

Documents