proceedings for cmg india 2016 3rd annual conference · in the tutorials track - we had excellent...

Proceedings for CMG India 20163rd Annual Conference

Mumbai, India

___________________________________

____________________________________

1

Table of Contents

Foreword: CMG India's 3rd Annual Conference

Best Paper Awards CMG 2016

Automating Web Application Performance Testing – Challenges And Approaches

Bishnu Priya Nanda et al.

Survey Of Big Data Frameworks For Different Application Characteristics

Praveen Kumar Singh et al.

Performance Engineering Best Practices For Data Warehouse Applications

Gaurav Kodmalwar et al.

Performance Monitoring And Management Of Microservices On Docker Ecosystem

Sushanta Mahapatra et al.

Performance Testing And Tuning Of A Web Based Enterprise Scale High

Volume Accounts Payable Processing System

Seji Thomas et al.

The Docker Container Approach To Build Scalable And Performance Testing Environment

Pankaj Rodge et al.

Conflux: Distributed, Real-Time Actionable Insights On High-Volume Data Streams Vinay

Eswara et al.

Db Volume Emulator

Rekha Singhal et al.

Leveraging Human Capacity Planning For Efficient Infrastructure Provisioning

Shashank Pawnarkar

Staticprof: Tool To Statically Measure Performance Metrics Of C, C++,

Fortran Programs

Barnali Basak

Performance Anomaly Detection & Forecasting Model (Padfm)

For Eretailer Web Application

Ramya Ramalinga Moorthy

Performance Comparison Of Virtual Machines And Containers With Unikernels

Nagashree N et al.

5

6

16

21

31

42

53

57

66

71

79

87

9 9

2

3

Foreword: CMG India's 3rd Annual Conference

Abhay Pendse

President, CMG India

We completed 3 years of CMG India in 2016. In last 3+ years, we have come a long way. With 25+

regional events and 3 Annual conferences in 3 years, we have come a long way. This is certainly a

significant achievement in very short period of time.

CMG India continues to offer great knowledge sharing platform in the performance engineering and

capacity management area and our growing support from Industry and academia with professionals like

you is the testimony of the same.

We had strong support and presence from all of you at the first 2 annual conference - 2014 in Pune and

2015 in Bengaluru. For our 3rd annual conference – we decided the venue of Mumbai – India’s financial

capital. Our 3rd Annual conference in Mumbai too had an excellent response and was attended by 125+

delegates across Software services and products industry.

We had 2 major tracks at the 3rd Annual conference. Technical paper track and tutorial track.

We had 17 paper and 4 Tutorial submissions this year. Our technical program committee headed by

Amol Khanapurkar and co-chaired by Dheeraj Chahal did excellent job of screening the papers to select

12 top paper submissions for the conference.

The technical paper sessions this year were around 4 broader areas of “Data Engineering”,

“Performance Modelling”, “performance Testing” and “Structural Analysis”. Paper sessions had some

top notch technical content aided by our 6 keynote sessions with topics from “SBI (State Bank of India)

Digitization in scale and robustness” to “NSE (National Stock Exchange) Journey over 2 decades”, from

“Holistic Optimization of Data Access from Imperative Programs” to “Accelerating Deep Learning and

Machine Learning to a New Level” and from “Rise and Rise of e-Governance in India” to “Architecture of

a Real-Time Operational DBMS”.

In the tutorials track - We had excellent Industry and academia participation with topics ranging from

performance modelling, Trustworthy engineering, Accelerating GPUs and Application performance with

Vectorization.

Thanks to all our keynote speakers (Mr. Shiv Kumar Bhasin – CTO, State Bank of India (SBI), Prof. S.

Sudarshan –Dept. of Computer Science and Engg, IITB, Mr. Naveen Gv and Mr. Austin Cherian, Intel

Corporation, Mr. G M. Shenoy, CTO Operations, National Stock Exchange(NSE), Mr. Dharmesh Parekh,

Senior Vice President, NSDL, Dr. Srini V. Srinivasan is Founder and Chief Development Officer at

Aerospike) and our invited tutorial talk speaker Dr. Rajesh Mansharamani for their time and sessions.

CMG India being virtual entity, considering the limited budget, we initially had some challenges in

finding the right venue for the Mumbai conference. Our sincere thanks to Tata Consultancy Services

(TCS) for agreeing to host the conference at no cost. In addition, our sincere thanks to our sponsors –

Intel (Principal Partner), STPI (Diamond Partner) and TCS (Gold Partner) for the event. Without sponsor

support conference would not have been possible.

3

The organizing committee headed by Manoj Nambiar and supported by Mehul Vora and other CMG

India members Milind Hanchinmani, Suresh Babu, Prashant Ramakumar, Adarsh Jagadeeshwaran

working with Ismail Emkay from event management company “Think Happy Everyday” and his team did

excellent job in ensuring smooth organization of the conference. Kudos to our Technical program

committee headed by Amol Khanapurkar and co-chaired by Dheeraj Chahal and other TPC members for

ensuring good technical paper and tutorial content for the conference. Hats off to their commitment for

the conference.

As we enter into 2017, we would like to continue this momentum and organize more regional events

and conference with top notch technical content. CMG India is an all-volunteer organization and success

of such organization depends on all our members like you. I hope with your strong support CMG India

will continue to grow in the coming years. Thank you.

Conference Technical Program Committee Conference Organizing Committee

Amol Khanapurkar (Chair) Dheeraj Chahal (Co- Chair) Abhay Pendse Adarsh Jagadeeshwaran Bhargav Lavu Maheshgopinath Mariappan Manoj Raghavendra Milind Hanchinmani Prajakta Bhatt Prashant Ramakumar Sundarraj Kaushik Suresh Kumar YK

Manoj Nambiar (Head, Conference Organizing Committee) Mehul Vora Abhay Pendse Milind Hanchinmani Suresh Babu Amol Khanapurkar Adarsh Jagadeeshwaran

4

Best Paper Awards at CMG India 2016

Overall, 12 papers were Accepted and 9 papers were offered speaking slot of 40 minutes each.The Jury

headed by Mr. Prashant Ramakumar identified the following papers as the Best Papers.

Best Paper

Conflux: Distributed, real-time actionable insights on high-volume data streams

(Vinay Eswara, Jai Krishna and Gaurav Srivastava, VMWare)

2nd Best Paper

Performance Anomaly Detection & Forecasting Model for eRetailer Web application by Ramya

(Ramalinga Moorthy, Elitesouls.in)

3rd Best Paper

DB Volume Emulator

(Rekha Singhal and Amol Khanapurkar, TCS )

First 2 CMG India Annual conference paper award winners get direct entry into CMG US conference in a

subsequent year.

5

The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement

Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.

AUTOMATING WEB APPLICATION PERFORMANCE TESTING – CHALLENGES AND APPROACHES

Bishnu Priya Nanda

TCS Product Trustworthy CoE- Component Engineering Group, Tata Consultancy Services

Ltd , IT/ITES Special Economic Zone, Plot – 35, Chandaka Industrial Estate, Patia,

Chandrasekharpur, Bhubaneswar - 751024,Orissa, India +91 0674 6645333

[email protected]

Seji Thomas

TCS Product Trustworthy CoE- Component Engineering Group, Tata Consultancy Services

Ltd, 1st Floor TCS Center, Infopark, Kakkanad , Kochi – 682020, Kerala, India, Ph: +91-484-

6187568

[email protected]

Mohan Jayaramappa TCS Product Trustworthy CoE- Component Engineering Group, Tata Consultancy Services

Ltd , SJM Towers, Gandhinagar, Bangalore - 560009, Karnataka, India, , Ph: +91-080-67247967

[email protected]

1 ABSTRACT

With the rapid adoption of Agile and DevOps in software development, performance testing the manual way

brings in an impedance mismatch slowing down the development progress. To make performance testing agile,

first we need to automate it and integrate it with the development & testing toolsets. Making performance testing

automatable is not straightforward and in this paper we describe the challenges one faces and how we addressed

that to build a successful performance testing automation bringing in significant savings in time and effort.

2 INTRODUCTION

Performance testing is an important part in product development. In traditional software development models using the waterfall method, performance testing is planned after development is complete and, preferably, after functional testing is finished. Performance testing projects are technically complex. They often require expenses on infrastructure and can take several weeks. Moreover, a large part of the time is taken by preparatory work, such as setting up the test environment, developing load scripts and emulators, and preparing test data.

6



In practice, software development using traditional methods ends on time in less than 30% of cases, so performance testing is frequently cancelled to minimize delays to releasing the software. This often leads to application failure under load, slow execution times, and the operations department’s inability to predict the system’s behavior in the future when the load and volume of data will grow.

Many development teams are beginning to adopt agile methodologies in order to overcome the barriers of traditional methods and release high-quality product on time. Agile makes it possible to significantly shorten development cycles – now they are counted in days. As a result, many organizations and independent development teams are wondering what to do with performance testing, which looks awkward and impedes the idea of short cycles. As in traditional models, this is often the reason why performance testing is not performed and the reason things end with the same negative results. Automated performance testing ensures the rapid release of software with high performance. No longer need to wait months to receive performance test results. Automation helps in cycles of continuous software integration and delivery, helps in receiving constant feedback and optimize software performance during current sprints.

3 ABBREVIATION

Sl No. Acronyms Full form

1 CPU Central Processing Unit

2 CV Co-efficient of Variance

3 DB Data Base

4 GC Garbage Collection

5 IO Input Output

6 NFR Non Functional Requirement

7 PT Performance Testing

8 RAM Random Access Memory

9 SLA Service Level Agreement

4 NEED FOR PERFORMANCE TESTING AUTOMATION

Following are key reasons for automating PT.

• During design and implementation, many big and small decisions are made that may affect applicationperformance - for good and bad. But since performance can be ruined many times throughout a project,good application performance simply cannot be added as an extra activity at the end of a project. If youhave load tests carried out throughout your project development, you can proactively trace howperformance is affected.

7



• Without automation, the process of collecting measurement data and validating manually is laborious,which reduces productivity and increases turnaround times.

• Through automation, none of the steps done manually will be missed, and the time lost due to testerfatigue will be minimal.

• Developers can get immediate feedback since they are not heavily dependent on the performance tester.

• Test run comparisons that are automated for making high level inference, and have side by side displayof details help to review and assess difference in performance between different versions of theApplication.

5 CHALLENGES IN PERFORMANCE AUTOMATION

Unlike in functional test automation, there are multiple entities involved in a load test that need to be

monitored and from where metrics must be collected. These then need to be evaluated against target

metrics and a decision taken on pass/fail. Thus, we may have to collect metrics from the Web Server,

Application Server, DB server, utilization of network, CPU & RAM on each of these servers. Once

collected, the right inferences need to be made by the automation tool.

When manual performance testing is done, metrics such as CPU utilization figures are charted to visually

see trend and behavior. The performance tester on reviewing these charts decides based on his

judgment if things looks normal or not. In contrast, in automated performance testing, we need to identify

the right statistical measures upfront, compute the same and apply inference rules on these measures.

6 APPROACH TO AUTOMATE THE PERFORMANCE ASSESSMENT

We used JMeter for generating load tests and used standard tools present in the environment and

automated following metrics:

• Screen response times as captured by JMeter load testing

• CPU & RAM utilization at the OS level

• GC log analysis of the jvms

The load test scripting and sanity check is as usual and manual.

We used Jenkins to orchestrate the automation and created separate python programs to process the different

types of metrics. For each metric, a ‘Limit’ file is manually created that specifies the target to be achieved for that

metric. The python program at end of the test evaluates the observed values against the Limit file to decide if the

metric is passed or not. Finally a separate python program evaluates all the different metric results and generates

a report on overall load testing pass/fail. We can extend this approach to calculate a performance score by giving

a weightage to each of the metric types and then giving a final score at the end. This will help in tracking

improvements over multiple runs of the tests (whether scores are improving or not).

Subsequent sections describe the different metrics, the challenges and how the pass/fail inference is made.

6.1 Automating the application transaction response time report

Response time automation helps to measure the application page response times during load testing.

8



High level steps in creating the automation are:

• JMeter scripting is done manually as usual

• Manually create a ‘Limit’ file whose contents vary as per application requirement.A Limit file contains the response time SLA for the pages of an application considering the maximumconcurrent users accessing at peak time.

Example of response time Limit file is:

Page/Txn SLA (in ms) Concurrent users

Orders 2000 10 Payment By Cash 3500 10 Payment By Card 2500 30

The Jenkins is configured to orchestrate the following:

• Start load testing by initiating JMeter command line option.• At end of the JMeter run, start the python program to evaluate the result. Here Converting the JMeter raw

result to readable format and comparing that with the limit file and deciding the response time metricresult as pass or fail.

• If anyone transaction encounters errors or cross the SLA then the python program will mark all the PTresponse time result as fail otherwise overall response time result is said to be pass.

• The final PT dashboard shows error % in the final JMeter result and the JMeter script coverage i.e. whatpercentage of transaction scripted and tested from the total transaction list given in limit file.

Example of Output produced by the python program for response time is:

Page/Txn Request Name SLA(in ms)

Concurrent User

Samples

Min Avg 90th

Percentile

Max Pass/Fail

Orders Orders:64/ABC-WebServices/rest/lookup/itemLookUp

2000 10 10966 200 1005 1300 1980 Pass

Payment By Cash

PaymentByCash:7/ABC-WebServices/rest/salereturn/addSaleLineItem

3500 10 3071 450 2057 2640 3020 Pass

Payment By Card

Payment By card :6/ABC-WebServices/rest/transaction/beginSalesTransaction

2500 30 15679 245 2047 2356 3457 Pass

Challenges in Automation and how we addressed them:

• With default JMeter setting the result will come on XML format without the header, which is difficult toparse and compare with response SLA mentioned in limit file. So some basic properties change isrequired in JMeter to get the response time result in csv format to parse and compare with SLA easilyusing python script.

• For validating the response time of transaction against limit file, each and every page response cannot becompared against the limit. Also the same page request can come from various transactions and so will

9



appear multiple times in the JMeter report. So the python program summarizes the page responses across multiple occurrences for Min, Max, Average and 90

th percentile values. It then picks the 90

th

percentile value from the summarized data and then compares it with the SLA mentioned in limit.

• In manual analysis we can ignore some of the requests having error percentage, but in automationtransaction having zero percentage error is considered to be compared with SLA.

Note: JMeter an open source tool is used for load testing here. JMeter can perform load test, performance-oriented business (functional) test, regression test, etc., on different protocols or technologies. By default JMeter command line result is produced in xml format. For ease in processing, we used the CSV format by setting the appropriate command line options.

6.2 Automating the CPU and Memory utilization of the servers

Along with Response Times measurement, it is essential to monitor the CPU and memory usage of the servers and ensure usage is within limits.


• Manually create a utilization ‘Limit’ file for each server (e.g. WebServer, AppServer, DB etc.) that is part ofthe load test in the application whose contents are typically fixed for most applications.The Limit file contains the upper limit of the CPU and RAM time SLA for the pages of an applicationconsidering the maximum concurrent users accessing at peak time.

• This limit file also specifies the statistical measure ‘coefficient of variance’ and typically should be 10%.Example of the Resource Utilization limit file:

Measure Limit Remarks

CPU 77 Upper limit in %age RAM 77 Upper limit in %age of

total Memory C.V. 10 Coefficient of Variance

=(Standard Deviation/Average)


• Execute the Vmstat command on each of the Linux servers in background before the load testing

• Once the load test is over, start the python program to analyze the Vmstat output.• The python program calculates the min, max, average, standard deviation and ‘co-efficient of variance’

(C.V.) for CPU and RAM Vmstat values. The metric is considered passed if the calculated Average <Limit and C.V. < Limit C.V.

• This can be similarly extended to measure and evaluate IO times as well.

Example of Output produced by the python program for server utilization is:

Measure Limit Min Avg Max StdDev C.V. Pass/Fail Overall Pass/Fail

CPU 77 26 35.74 96 20.64 57.75 Fail Fail RAM 77 97.12 97.54 97.61 0.1 0.1 Fail Fail


• To automatically transfer the Vmstat files from the application hosted servers to Jenkins server foranalysis requires password less authentication between the servers. So used certificates to authenticateover SSH.

10



• In manual graphical analysis of server utilization results, it is easy to spot peaks and idle periods andbased on experience judge as acceptable or not. For this to work in the automation program, we identifiedthe statistical measure ‘co-efficient of variance’ (C.V.) as a good indicator of the overall utilization. C.V. isthe Standard Deviation/Average. A coefficient of variance of 10 was found to be giving satisfactoryresults. Lower values indicate that the variation in utilization is more even and not bunched up.

Note: Vmstat is a Linux command to report information about processes, memory, paging, IO and CPU activity.

6.3 Automating the GC log analysis

Improper GC settings can cause lot of response time issues and hence needs to be tuned well. GC analysis also helps in identifying memory leaks.


• Manually create a GC Limit file containing the number of Full GCs allowed for the duration of the loadtest, GC throughput and ‘Slope after Full GC’. This limit is typically constant for all type of application.

• Installing GC Viewer, a GC log analysis open source tool.

Example of the GC Limit file:

Measure Limit Remarks

MinTotalTime 3600 Seconds MinFullGCCount 0 Min Full GCs that must be

present in entire test MaxFullGCCountPerHour 10 Max number of Full GC per hour MinThroughput 95 In %age AvgMemAfterFullGC 5 % of total memory. If it is below

this limit then Overall result will always made PASS

MaxSlopeAfterFullGC 30000 Bytes/Second increase in used-memory after Full GC (i.e. slope of all min used-memory after Full GCs)


• After the load test is complete, initiate GC Viewer to convert the raw GC Log file to GC summary reportusing GC Viewer tool.

• Start the python program to evaluate the GC output. This compared the GC Summary output (e.g. FullGC numbers, Max slope after full GC, Freed Memory after full GC) with GC Limit file and creating GCstatistics output file mentioning Pass or Fail.

Example of Output produced by the python program for GC is:

Measure Limit Actual Result Overall Result

Remarks

MinTotalTime 3600 57506 Pass Pass Seconds MinFullGCCount 0 0 Pass Pass Min Full GCs that

must be present in entire test

MaxFullGCCountPerHour 10 2 Pass Pass Max number of Full GC per hour

MinThroughput 95 98 Pass Pass In %age

11



AvgMemAfterFullGC 5 0 Pass Pass % of total memory. If it is below this limit then Overall result will always made PASS

MaxSlopeAfterFullGC 30000 0 Pass Pass Bytes/Second increase in used-memory after Full GC (i.e. slope of all min used-memory after FullGCs)


• The GC log file format changes as per the GC algorithm used, making it more complex to process it. Weused the GCViewer tool as an intermediate step before analyzing.

• In manual GC analysis, the tester reviews the graphical trends of the GC behavior. To decide leaks andpotential issues. For the automation tool, we evaluated many of the parameters such as Number of GCs,GC throughput etc and finally settled on the parameter MaxSlopeAfterFullGC (i.e. calculating the maxmemory slope between two Full GC). If the slope is a steep rise after two Full GC and reaches to the maxlimit, then there is a high probability of memory leak.

Note: GC is the automatic memory management by JVM. JVM reclaims memory /garbage occupied by object that are no longer in use by application by doing GC.

7 RUNNING AUTOMATION STEPS USING JENKINS

Jenkins is an open source automation server. With Jenkins, organizations can accelerate the software development process through automation. Jenkins manages and controls development lifecycle processes of all kinds, including build, document, test, package, stage, deployment, static analysis and many more.

In Jenkins server need to create a new job for performance testing. The configuration required for Performance automation is possible by doing the setting in Configure section of that Job. Steps that need to configure serially in Jenkins are,

• Configuring the Vmstat command to run in all the servers used for application deployment, before loadtesting.

• Configuring the steps involved in connecting to the system where JMeter setup is done, executing theload test, capturing the result and transferring to Jenkins server for parsing and analyzing result.

• Transferring the Vmstat reports from the hosted server to Jenkins local for analysis.

• Transferring the GC log from hosted application server to Jenkins local for parsing and analysis.Note: Here the configuration is one time activity for an application.

Automation steps that will run once you trigger the job/schedule the job are,

• Reboot all the servers involved application deployment.

• Restart the Application server, DB server and Web server involved for executing the application.

• Run Vmstat command on each of the servers• Trigger load testing to perform through JMeter.

• Transfer the response time result to Jenkins server.

• Execute python program to evaluate JMeter output.

• Transfer Vmstat output from App/DB/Web servers to Jenkins server.

12



• Execute script to parse server utilization result from Vmstat report and compare with limit file meant forserver utilization and produce a server utilization output result as pass/fail.

• Transfer GC log from Application server to Jenkins server.• Execute script to parse GC log and compare with limit file meant for GC and produce a GC output result

as pass/fail.

• Execute a final script to read all the output results, if all results are pass then will declare performancetesting result as pass otherwise fail.

• The final performance result will be displayed in application’s dashboard in Jenkins.

The automation PT process can run as and when required by the application team.

8 DEPLOYMENT DIAGRAM OF PERFORMANCE TESTING AUTOMTION SETUP IN

JENKINS SERVER

Steps involved in setup are,

• Jenkins server connect with the Linux servers (App Server, DB server or web server) involved inapplication deployment using SSH remote host through configure management feature of Jenkins.

• Jenkins server connect with the windows system where JMeter is hosted using master – slave feature,where Jenkins server act as master and windows system act as slave. For this master – slaveconfiguration need to download Slave Agent of Jenkins and install in windows.

9 BENEFITS OF AUTOMATING THE PERFORMANCE TESTING

The automatic Performance testing helps, • In saving the effort and time consumption of doing the manual performance test setup before load testing.

• Avoid time consumption in manual analysis of each testing result and declaring the assessment as passor fail.

• Automating Performance Testing helps in analyzing and mitigating Performance issues before havingchance to derail system at the last moment or in production.

13



• By automating performance testing, organizations can catch performance issues early before they getmuch more costly to fix at any time during agile cycle. Developers can instantly know that the new featurein the latest build has caused the application to no longer meet SLAs. They can fix the problem then andthere before it becomes exponentially more expensive.

• Automated performance testing guarantees users get new feature not new performance issues.Integrating performance testing to continuous build and integration process in DevOps, before the newchanges are deployed to production will ensure that the new feature can meet the peak load conditionand will not show any new performance issues in production.

10 LIMITATIONS IN PT AUTOMATION

The basic limitations in automating the PT are,

• To configure the application hosted server with Jenkins, there should be password less authentication

available between Jenkins server and other servers which will help in automatic file transfer when the

process is triggered.

• Need to manually create the limit files for each application and changing the threshold manually with

change in workload.

• The JMeter scripts need to be created and edited manually for changing workload.

• If there is any build change for the application the JMeter scripts to be validated manually.

11 PLANNED WORK FOR FUTURE IN PT AUTOMATION

The planned automation activities in PT to be executed in future are,

• The automatic analysis of thread dump and alerting user incase of blocked thread in application.

• The server utilization automation of windows system considering different performance counters in

PerfMon tool.

• Generating script to automatically change JMeter script with change in concurrent user load per thread.

• To make configuration in Jenkins that the PT job will be triggered automatically and validate the scripts

With change in build version.

12 CONCLUSION

Agile development and Continuous Testing are increasing the productivity of teams and the quality of

applications. When adding load and performance testing into those processes, careful planning should occur to

ensure that performance is a priority in every iteration. The goal of load testing at the speed of Agile is to deliver

as much value to the users of an application through evolving features and functionality while ensuring

performance no matter how many users are on the app at any one time.

Load or Performance testing solution for agile development using Jenkins helps

• Both developers and performance testers to design and edit testing scenarios quickly just by manually

creating the test scripts and placing them in JMeter.

• Enabling everyone on the development team to share test scenarios and test results

14



• Easily integrate with new servers required for application deployment and display result trends over

several builds for regression testing.

• Producing reports that everyone on the team can understand and act on.

• Run tests that resemble the real world by accounting for users’ network conditions, and geographic

locations

13 ACKNOWLEDGEMENTS

Our sincere thanks to Mr. Mohan Jayaramappa, Head – SPD Trustworthy CoE, TCS in encouraging us to write

this white paper. Also a note of thanks to Debiprasad Swain, Head – Product Engineering SEG, Pitabasa Sa,

Product Manager of IPRMS and Automation in CEG for helping us to go for this experiment and concept.

15

Survey of Big Data Frameworks for DifferentApplication Characteristics

Praveen Kumar SinghTCS ResearchMumbai, India

Email: [email protected]

Rekha SinghalTCS ResearchMumbai, India

Email: [email protected]

Abstract—Nowadays applications are migrating from tradi-tional 3-tier architecture to Big data platform which are widelyavailable in open source and can do parallel data processing oncluster of commodity machines. The challenges are to choose the”right” available Big data framework for an application with theavailable features of the framework. We have proposed a rulebase methodology for making this choice. We first categorize anapplication into 4 major categories such as Iterative computationbased application, SQL Query based analytic workload, onlinetransaction processing application and streaming based applica-tion based on their features. The proposed rule base stores levelof support given by popular big data frameworks for each of theapplication’s features. We have presented rule base for each ofthese application category in this paper.

Keywords—SQL, NoSQL, Machine Learning, Rule Base,Hadoop, Spark, Flink, Iterative Computation, Streaming, Graph,Framework

I. INTRODUCTION

In the world of Big data, enormously data generated inthe terms of Petabyte , Yotabyte from the sources like socialnetworks, satellites, sensors, user devices, search engines etc.To analyze the large data and extract the useful informationfrom the large data, people come up with the large data pro-cessing frameworks. People from the open world are providingthe tools to implement and execute the parallel, complex andscientific application. Most of the scientific application needsparallel and distributed computation. As we know Hadoop isnot suitable for iterative computation and the most of the sci-entific computation and machine learning algorithms requiresiteration. Spark[13], Twister[7], Haloop[6] and Flink[1] arecome up with alternative solution of Map Reduce frameworkand providing better support for applications which requiresiterative computation.

These frameworks are parallel and distributed in nature.Programmers can distribute their data across the cluster andexecute their task in parallel over the data present in the cluster.

Framework like Hadoop and Twister evolves the disk I/O’s,and Spark reduce the I/O’s activity due to in-memory concept.It doesn’t always depend on the framework but it also dependson the application. Applications are need to be categorize intoI/O intensive and CPU intensive.

There are plethora’s of Big data frameworks available forexecuting various types of applications. There is always achallenge for a user to choose the right Big data processing

platform for his/her problem statement. This paper providesa mapping of different type of applications and workloadto possible set of data processing frameworks. We furthercategorize application based on their features such as speed,latency, dependency, language etc. For the perspective of userswe are trying to put comparisons among different computationframework. In order to help users to choose appropriateframework, which can be used to process their real worldproblem. We are categorizing the application to the differentframework based on their suitability and features. In the cat-egorization, we will focus on the application classes like Ma-chine Learning and Iterative computation based Application,SQL Query based Application, Online Transaction ProcessingApplication (OLTP) and Streaming based Application. Someof the open source Big data frameworks like Hadoop[3],Spark[13], Flink[1] with their pros, cons and Rule base forchoosing these frameworks. The contribution of this paper is todefine features for different types of applications and matchingit to appropriate Big data framework based on several studiesavailable in the literature.

The rest of the paper is organized as follows. In the nextSection, we will discuss about the various Application Classes,in Section 3 will explains about different Big data frameworksand in Section 4 we have Rule Base for choosing theseframeworks and finally concluded in section 5.

II. APPLICATION CLASSES

All domains such as Banking, Media, Health care, Ed-ucation, Manufacturing, Insurance, Government, Retail andTransportation etc have challenges of processing Big Data.Most of these applications have following type of use casesfor processing large data-sizes.

1) Knowledge Discovery in Database: Applications whichare required to identify the hidden data or extract theuseful information from the given data which is appli-cable to identify the current trends in market and moreinsight over the customer data.

2) Fraud Detection and prevention: used to identify thefraudulent cases happens in every sector, can be preventbefore it happens.

3) Device-generated Data Analytic: enormously data gen-erated through the sensor, remote devices, mobile and

The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc aroyalty free right to publish this paper in CMG India Annual Conference Proceedings.16

satellites. Generated data can be used for weather fore-casting and it is being used to provide intelligence basedon the location.

4) Social Networking and their relationship Analytic: So-cial networking sites are generating huge amount of datain a seconds that need to be analyze in a real time withthe correct identification of relationship and respondback to the user.

5) Recommendation based Analytic: In the digital worldusers are spending most of the time with their devices,data need to be analyzed and recommend the usefulinformation to the user. This analytic can be able tobusy the users by their recommendations.

In this paper we are providing the rule base for choosingframework, and categorizing application broadly into Itera-tive computation based application, SQL Query based ana-lytic workload, online transaction processing application andstreaming based application.

A. Machine Learning and Iterative computation based Appli-cation

Machine Learning is a part of Artificial Intelligence wherethe machine is trained with the previous data to make thedecisions for the future and we can apply some algorithm toget some output. To apply machine learning few steps need tofollow

1) Data need to be collected from the various sources.2) Data need to be prepared well by analyzing the data,

removing the outlier and missing values issues.3) Data need to be representing in training and testing part

to build the model and test it.4) Check the accuracy of a model by evaluating the algo-

rithm outcome.5) Improve the performance if possible by the mean of

selecting different model altogether.Supervised learning is also known as task driven in which

learning is to be done through the past data to get futureprediction. Learning can be define through the attribute of aclass, if the attribute is discrete then it is classification andif the attribute is continuous then it is regression. DecisionTree, SVM and Regression are comes into this category. Un-supervised Learning is used in a cluster analysis based on thegiven input data set with the help of some objective function.A well known K-Means clustering algorithm and PrincipalComponent Analysis (PCA) comes under this category.

B. SQL Query based Application

The Big data framework provides the functionality for usersto run and analyze the query on the data, which is storedon the Hadoop Distributed File System(HDFS) and S3. Hiveon Hadoop Map-reduce, SparkSQL on Spark and many moreprovides the API to deal with SQL Query. Data can beviewed and analyzed by basic query like aggregation and join,even on the fly for business purposes through these availableframeworks.

C. Online Transaction Processing Application (OLTP)

These applications have read, write and update operations.Each transaction execution will be associated with somesemantics such as ACID, CAP etc. Most of the NoSQL basedframework such as HBase, Neo4j, MongoDB, etc supportsOLTP workload. Some of the NoSQL framework such asGraphLab and Pregel are specialized to work on graph baseddata which is required for social networking application. Theamount of data generated through the social networking sitesand various sources of data is in exponential rate. Graph basedmodel used for parallel computation on the graph ensuresthe data should be consistent and different techniques arebeing applied to get some useful information from hugeamount of data. Graph analysis is one of the solutions tounderstand complex relations and to check the different pat-terns present in the large data. The available frameworksfor graph processing are GraphX[8], GraphLab[9], Giraph[2]and Pregel[11]. PageRank, Connected Component, SVD++,Strongly Connected Components and Triangle Count are theapplications belongs to this category.

D. Streaming based Application

To Process the large amount of data is challenging taskbut to take decisions on-fly even more challenging in realtime, where stream processing comes into the picture. Mostof the business firms are more focused on streaming analysisto take business decision such as recommendations for theusers/customers/clients in real time. Stream processing mainlydoes the calculation, statistical analysis and continuous queryruns over the stream of data in real time. The availableframework for stream processing are Spark Streaming[14],Apache storm[4] from Twitter, IBM InfoSphere Streams[5]from IBM. Fraud detection, E-commerce, feed, page-view andTrading are the streaming based applications.

III. BIG DATA FRAMEWORKS

In this section, we discuss features of few of the popularBig data frameworks available in Open source.

A. Hadoop

Generally used for Batch processing and designed forparallel algorithms used in scientific applications. It is usedfor parallel processing but due to lack of single cluster theimprovised version of Hadoop[3] comes with the name ofHadoop YARN[3]. The associated ML tool is Mahout. Hadoopis not suitable first when the communication between the splitshappens. Secondly Hadoop not suitable for the long runningMR job and Last but not the least there is no concept calledin-memory and caching in MR loops.

B. Spark

It is also one of the programming models used to processthe large data. According to the spark community [13] , muchfaster than the Hadoop. The main reason behind the claim isinstead of placing intermediate data into the disks, it storesinto the memory. Due to that we can save the I/O time

17

which requires while storing and retrieving the data from thedisks. Spark introduces memory abstraction using the conceptcalled as resilient distributed datasets (RDDs), functions runon each record of HDFS or any available storage in parallelmanner. Two types of operation performed on the RDD one isTransformations which is lazy evaluation and another one isAction applied on the transformed RDD. RDD is collection ofobjects that partitioned across the cluster and rebuilt at timeof partition lost. The lost partition can be rebuilt using lineage[12]. Spark scheduling is done using the directed acyclic graph(DAG). The job has multiple stages and it can be scheduledsimultaneously if the jobs are independent. It supports thewide range of application category like Machine Learning,Spark SQL, Spark Streaming and Spark Graph computation.The associated ML tools are MLib, Mahout and H2O.

C. Flink

It is an open source platform for Hybrid, interactive, Real-Time Streaming, Real-World Streaming (out of order streams,Windowing, Back-pressure) and Native iterative Processing[1].Major abstraction is Cyclic Data-flows. The associated MLtools are Flink-ML and SAMOA. Flink is in-memory pro-cessing and Low-Latency engine.

D. GraphX

GraphX[8] framework is used for the graph processing onthe top of Spark. It has low-cost fault tolerance as well as dueto iterative processing of graphs, it reduces the memory over-head. Different API’s are available for the different frameworkfor graph processing.

E. Spark Streaming

Hadoop doesn’t have full controlled solution for streamingapplication, even though they are able to process micro-batch job[10]. Spark streaming provides strong consistency,scalability, parallel and fault recovery due to introduction ofdiscretized streams(D-Streams)[14] programming model forintermix streaming, batch and interactive queries.

IV. RULE BASE FOR CHOOSING FRAMEWORK

In this Section will provide a rule base mapping betweendifferent applications and frameworks discussed in Section 2and 3 respectively. The Table I first column presents differentfeatures of SQL Query based analytic workload based on ourexperience in client project. Rest of the columns representsdifferent Big data framework supporting processing of SQLQuery based analytic workload. Each cell in the Table shows amatch between the feature and the framework. Similarly TableII, III and IV present rule based mapping for NoSQL work-load, Iterative Computation based application and Streamingapplication respectively.

We have described the different workloads and features ofan framework on the tables mentioned below. Refer Table IIwhich is used to describe the NoSQL workload categories,Table III describes the features of an framework used for It-erative Computation based application and Table IV describesthe features of an framework used for Streaming application.

V. CONCLUSIONS AND FUTURE WORK

There are plethora of Big Data frameworks are availablein open source for parallel processing of CPU and/or dataintensive applications. It is humongous task for a user toselect a right platform for his/her workload. In this paper wehave categorized application broadly into SQL Query basedanalytic workload, NoSQL workload, Iterative Computationbased application and Streaming applications. We have furtherspecified features for each of these application categories.Finally we have presented rule base mapping for each ofthese category by specifying level of support provided byvarious available Big data frameworks for each of the specifiedfeatures. In future we shall benchmark these available Bigdata platforms for different types of applications with featuresas mentioned in the paper and corroborate the result withactual measurements. We also plan to integrate this rule basemapping with our larger project on migration of applicationsdeployed on traditional systems to Big data platforms.

REFERENCES

[1] http://flink.apache.org/.[2] http://giraph.apache.org/.[3] http://hadoop.apache.org/.[4] http://storm.apache.org/.[5] http://www-03.ibm.com/software/products/en/ibm-streams.[6] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: Efficient

iterative data processing on large clusters. PVLDB, 3(1):285–296, 2010.[7] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and

G. Fox. Twister: a runtime for iterative mapreduce. In S. Hariri andK. Keahey, editors, HPDC, pages 810–818. ACM, 2010.

[8] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin,and I. Stoica. Graphx: Graph processing in a distributed dataflowframework. In J. Flinn and H. Levy, editors, OSDI, pages 599–613.USENIX Association, 2014.

[9] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M.Hellerstein. Distributed graphlab: A framework for machine learning inthe cloud. PVLDB, 5(8):716–727, 2012.

[10] S. Shahrivari. Beyond batch processing: Towards real-time and stream-ing big data. Computers, 3(4):117–129, 2014.

[11] C. E. Tsourakakis. Pegasus: A system for large-scale graph processing.In S. Sakr and M. M. Gaber, editors, Large Scale and Big Data, pages255–286. Auerbach Publications, 2014.

[12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J.Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In S. D. Gribbleand D. Katabi, editors, NSDI, pages 15–28. USENIX Association, 2012.

[13] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica.Spark: Cluster computing with working sets. In E. M. Nahum and D. Xu,editors, HotCloud. USENIX Association, 2010.

[14] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams:An efficient and fault-tolerant model for stream processing on largeclusters. In R. Fonseca and D. A. Maltz, editors, HotCloud. USENIXAssociation, 2012.

18

TABLE I: Rule Base for SQL Analytic Workload

Features Hive Presto Drill SparkFramework MapReduce MapReduce Pipeline Queries MapReduce and DAGModel Batch Processing Real Time Interation using Query Pipeline Real Time Interaction Micro-Batch processing, Streaming

Speed Fast10x fasterthan Hive Fast

10x faster thanMR on disk and

100x faster inmemory

Capability ofProcessing Data

SizeTerabytes Petabytes Gegabytes to Petabytes Petabytes

ML Support No No No Yes

Language support Java API,SQL

C, Java,PHP, Python,

R, Ruby,SQL

ANSI SQL,Mongo QL,

Java API

Scala, Java,Python, SQL

Horizontal clusterscalability Scalable( 100+ Nodes) Scalable(1000+ Nodes) Scalable(100+ Nodes) Scalable(1000+ Nodes)

Optimization Query/Cost-Based(Not Yet)

rule based/cost based Catalyst

YARN Support Yes Not Yet Yes Yes

File System Local/HDFS/S3 HDFS/S3

Local/HDFS/S3/MapR-FS

HDFS/local filesystem/S3

/Hbase/Cassandra

TABLE II: Rule Base for NoSQL WorkloadKey Value Store Document Store Graph Based Wide Column

Features Redis Riak MongoDB CouchDB Titan Neo4j HBase CassandraSchema Flexibility High High High High High High Moderate Moderate

Implementation-Language C Erlang C++ Erlang Java Java Java Java

Distribution-Replication Master-Slave

Selectablereplication factor Master-Slave

Master-Master/Master-Slave

Multi-Masterreplication factor

Master-Slavereplication factor

Selectablereplication factor

Selectablereplication factor

Design and Features Sorted sets Vector clocks Indexing, GridFS Partitioningpluggable backendsCassandra,Hbase,MapR,Hazlecast

ACID Applicable MapReduce supportbuilt-in data

compression andMapReduce support

Query Method Throughkey-commands

MapReduceterm matching

Dynamic object-basedlanguage andMapReduce

MapReduceR ofJavascript funcs Gremlin,SparQL

SparQL,nativeJavaAPI

Internal API,Map Reduce

Internal API,SQL like(CQL)

Concurrency In Memory Eventually Consistent

Update in Place(Master-slave withMulti granularity

locking)

MVCC(Applicationcan select Optimistic

or Pessimistic locking)ACID Tunable C

Non-block reads,write locks involvednodes/relationship

until commit

Optimistic Lockingwith MVCC Tunable Consistency

API and otheraccess methods Proprietary protocol

HTTP API,Native Erlang

interface

Proprietary protocolusing JSON,

binary(BSON)

RESTful HTTP/JSON API

Java API,Blueprints,Gremlin,

Python,Clojure

Cypher querylanguage,Java API,

RESTful HTTP API

Java API,RESTful HTTP API

,Thrift

Thrift,Custombinary CQL

Complexity None None Low Low High High Low LowCAP AP AP CP CP AP/CP CP AP/CP

PartitioningScheme

Consistenthashing

Consistenthashing

Consistenthashing None Range Based

Consistenthashing

File System Volatile memoryfile system

Bitcask LevelDBVolatile memory

Volatile memoryfile system



Volatile memoryfile system HDFS

CassandraFile System

Map Reduce No Yes Yes Yes Yes No Yes YesHorizontal Scalable Yes Yes Yes Yes Yes No Yes Yes

19

TABLE III: Rule Base for Iterative Computation Application

Features Hadoop Twister Flink SparkFramework MapReduce MapReduce PACT-MapReduce MapReduceModel batch Operator-based micro-batchAuto-Tuning/Optimization Manually/NA

Manually/Staticdata caching

Auto-tuned/cachesstatic data path

Manually/static datacached in iteration

Latency low MediumMedium(better than

spark) Medium

Fault Tolerance Medium Medium High High(Through lineage )Memory Management Manually Manually Automatic Automatic(Spark 1.6.0+)Iteration Nature No Yes Yes Yes

General Purpose ETL Iterative AlgorithmETL/Machine Learning

/Iterative AlgorithmsETL/Machine Learning

/Iterative Algorithms

Storage support HDFS/local filesystem/s3 Local File systemsHDFS/S3/

MapR/TachyonHDFS/local file system/

S3/Hbase/Cassandra

Language support Java JavaJava,scala,python

Java,scala,python

Horizontal clusterscalability

Scalable (100 to1000 Nodes)

Scalable (100 to1000 Nodes)

Highly Scalable (100 to10000 Nodes)

Highly Scalable (100 to10000 Nodes)

TABLE IV: Rule Base for Streaming Application

Features Hadoop Spark Streaming Storm Flink Samza

Processing Framework Batch Batch, StreamingReal Time Eventbased Processing

Batch, Real Time Eventbased Processing Real Time stream Processing

Streaming Model batch Micro-batchingNative/micro-batching

with Trident API

Native/Micro-batchingby accumulates stream

messages

Native/Relies onKafka for internal

messagingResponse Time Minutes Seconds Milliseconds Milliseconds Sub-seconds

Functionalities/supports YARN,HDFS,MapReuce

YARN,HDFS,Mesos

Runs on YARNinteract with Hadoop

and HDFS

Runs on YARNas an Application Requires YARN and HDFS

Stream Source/Primitive/Computation NA

Receivers/D Streams/Transformations window

operationsSpouts/Tuple/Bolts DataStream consumers/message/Tasks

Delivery Semantics Exactly onceExactly once(Excepts

in some failure )At Least Once(Exactly

once with Trident) Exactly once Atleat once

State Management StatelessStateful(writes state

to storage),Dedicated DStream

Not build-in/Stateless(Roll your own or

use Trident)Stateful Operators

Stateful(EmbeddedKey-Value store)

Language Supported Java Scala,Java,PythonJVM-languages,

Ruby,Python,Javascript,Perl

Java,scala,python Scala,Java,JVM Languages only

API Declarative Compositional Declarative CompositionalLatency Minutes Seconds milliseconds milliseconds milliseconds

Throughput NA100k+ records pernode per seconds

10k+ records pernode per seconds



Scalabilty to Largeinput Volume(streams) No No yes yes yes

Fault Tolerance(completeing the compuation

correctly under failure)yes yes yes yes yes

Accuracy and Repeatability No No No Yes NoQueryable(querying

the results insidethe stream processor

without exporting themto an external database)

No No No No(Upcoming feature) No

In-Memory Processing No Yes Yes Yes NoResource Manager YARN YARN,Mesos YARN,Mesos YARN YARNSupported ML Tools Mahout Mahout/Mlib/H2o SAMOA Flink-ML/SAMOA SAMOA

20

Performance Engineering Best Practices

for Data Warehouse applications

Authors: Gaurav Kodmalwar, Niteen Yemul SAS R&D India Pvt. Ltd. Pune

Data warehouse applications are different in architecture than regular client-

server based applications. Performance testing and engineering strategy

for data warehouse applications are very different than traditional client

server based applications. This paper describes the basic architecture and

concept of data warehouse which are required prior to start of the

performance testing of data warehouse application. It covers performance

testing and engineering strategies to baseline application performance

when volume of data grows at different layers of data warehouse. In this

paper, references to SMP (Symmetric Multi-Processing) and MPP

(Massively Parallel Processing) databases are intentionally avoided and

examples are explained with SAS or SPDS (Scalable Performance Data

Server) file systems in comparison with undisclosed SMP and MPP

databases.

Introduction: Need of performance testing for data warehouse based applications like Business Intelligence (BI)

and Business Analytics (BA) are increasing due to increased volume of data at data warehouse or big data

environments. Usually the scope of performance testing in such applications are focused only on ETLs (extract,

transform, loading). ETLs (Extract transform Loading) are scheduled to complete during offline hours, unless it is a

real time application. Performance of data warehouse applications need to be within the service level agreement

(SLA), otherwise it may even impact end users accessing application during online hours. This paper is based up

on the practical learnings while conducting performance testing of data warehouse application. ETL scenarios

mentioned in the paper are kept more generic in name to understand data warehouse application’s architecture and

its performance considerations along with actual performance numbers wherever possible. Architecture may vary

for similar applications in different domains, accordingly approach for testing and troubleshooting need to change.

This paper is more focused towards ETL in data warehouse than BI reporting.

Experimental Setup

Typically in Data warehouse based solution, end user will get Business related (BI or BA) reports after completion

of multiple ETLs (e.g. ETL for loading data from production or OLTP (Online Transaction Processing) system into

staging tables, then ETL for loading data into mart and finally ETL for loading mart data into smaller report\output

tables). It’s only pragmatic to say that SLA (Service Level Agreement) for first few ETLs will be higher than later

ETLs since amount of data volume processed by initial ETLs are very high. In such circumstances, scope of performance testing is driven by multiple factors. It may be decided based on data

volume, frequency (scheduled period of time i.e. monthly/weekly/daily) and/or totally based on Business priority.

This paper is based on the experience gained while testing a banking related solution which had 8 different ETLs

and all of them were tested for performance. After each iteration of testing, certain tuning strategies were applied

to improve the performance. The best practices stated in this paper talks about the various tuning strategies

21

gathered while performance testing the first two ETLs which were both volume and server resource intensive. For most of the ETLs, history is maintained at the source and\or target tables. In case of our solution, 34 months history

was maintained at first ETL and 11 months history is maintained at 3rd ETL. Execution time of ETL for these two

ETLs were measured for 35th month and 12th month. This will help in measuring execution time at customer end

after 1 year or 3 years.

Paper Brief: The following diagram shows a generic data warehouse schema. In this section each of the database

layer is explained at a higher level along with the tips for improving performance.

Figure 1: Data Warehouse Data Flow Process (Image Source: Internet)

The integration layer retrieves data from the OLTP or source system on regular intervals

(daily/weekly/monthly/yearly). Mostly, the integration layer is the starting point of the data warehouse design. Due

to variety of the OLTP/source system, bringing data into integration layer from OLTP/source system are kept outside

the scope of data warehouse solutions. As integration layer accumulates huge data over the time, it may cause

performance slowness to ETL processes while extracting data from the integration layer to load into the data

warehouse.

Limitation

Since data in the integration layer keeps on accumulating over the time period, appending new data from the

OLTP\source system to the integration layer will be a time consuming process. There will be additional overhead in

case of indexes at the integration layer.

Performance Improvement Guideline (for above limitation)

To overcome the above limitation, there can be multiple options available depending upon the data model and

business requirement. By capturing only the changed transactional data in the operational systems and using it as

an input data source while loading data into the integration layer, will show the performance improvement. This can

be achieved by Change Data Capture (CDC) object which is mostly available in the data integration suite.

The data warehouse keeps the same operational system data but in a different structure (dimensional modeling

schema) that is accumulated over a time period. More details about dimensional modeling is provided in the next

section. Some data warehouse designs don’t contain an integration layer and load data directly into the data

warehouse from operational systems using ETL.

For the first time period, the data warehouse load (ETL) process might take a longer time due to the volume of data

being processed & loaded. During the dimension load process for the first month, all the business key (account

/customer etc.) accumulated over all the previous time periods gets loaded. Due to which, volume of data for

dimensions for the first time period is higher than the next consecutive time period. One out of few techniques

22

available to load data faster during first time period is BULKLOAD. BULKLOAD can be used to load data to staging area from production\OLTP. A summary of this is as follows:

a) Loading data into staging area tables with indexes will take very long time,. This is because there is an

additional overhead of updating index during data loading process.b) In some databases if target tables in staging area are empty then BULKLOAD can complete same

operations smaller amount of time BULKLOAD will insert into new empty blocks and not fill unused space

in existing blocks that aren’t completely full.c) SAS uses SAS specific interface and its methods as part of SAS Access for a database, a seamless

interface provided between SAS and other databases. This technology is used to load tables from SAS

dataset to databases and vice versa. Thus, a reusable code enhances development and more importantly

an easier maintenance of the code.

For subsequent load, only the changed data is tracked and loaded. This will show performance improvement in the

next load process compared to first time period. Few types of bottleneck in the SQL query or code can be discovered

for the first month load process whenever first time period ETL is slower than the next time periods. Table#1 shows

the response time for job loading of different dimensions for 1st month, 2nd month and 35th month. As mentioned

earlier, 1st month job’s response time is high due to high volume of data and from the second month onwards,

response time increases as changed data volume increases gradually. Performance issue in SQL query of big

database was found at a higher volume while generating the surrogate key for the first month. After fix the response

time for the 1st month was brought down to few minutes.

Table Name 1st month 2nd month 35th month

Response Time

No of Data Loaded

Response Time

No of Data Loaded

Response Time

No of Data Loaded

Dimension 1 1:02:53 30735406 0:00:04 0 0:02:57 3,948,846

Dimension 2 0:23:55 25181230 0:00:45 11981230 0:00:03 1

Dimension 3 0:22:44 20491766 0:00:04 0 0:00:51 2,632,770

Table#1

SQL Query for inserting\updating dimension tables mentioned in Table#1:

proc sql;

connect to database (SERVER="xxx" AUTHDOMAIN="xxxxxx");

execute (insert into "dim".APP_DIM2(

APP_SK , APP_RK , PRODUCT_RK , CUSTOMER_RK , VALID_END_DTTM, PROCESSED_DTTM)

select FUNCT(1,1) + 0 FUNCT(1,APP_RK),

APP_RK , PRODUCT_RK , CUSTOMER_RK , XREF_TODATE , PROCESSED_DTTM from

"spafm_scratch" .mrg_APP_DIM where ACTION in ('UPDATE','INSERT'); ) by database;

execute(commit) by database;

disconnect from database;

quit; In the above query, generating Surrogate key logic was changed instead of using inbuilt function (FUNCT) provided

by MPP database. FUNCT function was not efficient enough at high data volume however it was easy to use and

was not causing any slowness at low or medium data volume.

Star Schema means a fact surrounded by dimensions in an integration layer of data warehouse. Data is extracted

from this schema using query called Star Join query. Star schema stores data in a partially normalized format and

fully normalized format is called as snow-flake schema. In a very well designed data warehouse solutions, these

queries are used while loading data into data mart or extracting data from star schema on the fly.

23

Performance Engineering Best Practices:

Stated below are some of the best practices that can be applied across any data warehouse solution to gain performance improvement. For the ease of reader, tuning guidelines are segregated into Basic & Advanced level.

Basic level guidelines are handy tips which can be applied to bottlenecks that are uncovered in preliminary investigation phase. Advanced level can be applied post detailed investigation.

Basic Level of Tuning Guidelines

1. Indexing

Creating index on time column on top of existing index on the business key/primary key of the transactional tables

will help improve performance of extraction process However, ‘transaction time’ column should not be null

otherwise, it will slow down the performance due to duplicate values in indexes. However time indexes may not be

needed by other solutions sharing data in same integration layer, so this may be acceptable suggestion only if

beneficial for all solutions. As shown in chart#1, the performance of the extract jobs were improved after creating

indexes on the transaction time of transactional tables. The improvement will be different, depending up on the type

of transaction table, its columns, rows, and index columns. Note that this may show performance slowness at the

time of data load from the OLTP\source system to the integration layer due to index checks overhead.

Graph 1: Extract Job Performance Comparison

2. Verify Join (where clause) against indexes

Data warehouse has very huge volume of data compared to data volume in integration layer. Fact and dimensions

are huge data sizes e.g. up to 200 Gigabytes per table. Queries fetching data from such tables will be slower than

usual. These queries should be executed in the typically designed format like Star Join and surrogate keys should be used in Where clauses. As shown in the Table#2, there was performance improvement after SQL query was

rewritten with Star Join (which means all surrogate keys with respective dimensions used at where clause) instead of using surrogate keys of only few dimensions.

Query Type Execution Time

Partially Star Join 0:10:00

Fully Star Join 0:03:00

Table#2

Data mart tables are extracted through Star Join\Snow Flake queries from the Data warehouse. Data marts are

aggregated tables for specific business purpose and time period (e.g. monthly, yearly). These marts are small in

sizes compared to Data warehouse. Most of the times, there are detailed tables extracted by analyzing data in

00:09:46

00:04:38 00:03:02

00:02:07

00:00:00

00:02:53

00:05:46

00:08:38

00:11:31

Transaction Table 1 Transaction Table 2

Source Table (transaction type table)

Extract Job Performance Comparison

Without Index

With Index

24

marts especially for the purpose of reporting, mining. Usually there are cubes as well, which stores aggregated data across multiple dimensions and used for analysis purpose based upon categorical data selected as per business

need. Due to a small volume of data in data marts & cubes, there are usually lesser chances of obvious performance

issues compared to that at the integration layer or data warehouse. Performance issues found at Data mart and

cubes were mostly due to inefficient SQL queries those were resolved by rewriting them.

Advanced Level of Tuning Guidelines

Focus Area: Design Level Optimization

1. Compression

Compression techniques are used for different databases or big data to save disk space and also reduce IO

operations while reading\writing data on the disk. Depending upon compression techniques, performance of SQL

query execution will vary. However now a days, good compression techniques available so chances of adverse

effect of compression on query execution are very less. But still worth to test different SQL queries on compressed

tables to avoid any degradation in execution time. Following graph shows disk space saving and execution time

improvement for few experimental tables and SQL queries.

Graph 2: Table size in Gigabytes Graph 3: Query Execution Time

2. Partition

In case of very huge fact and dimension tables, both of these tables can be physically partitioned over a time period

for the faster query execution. These tables can be partitioned yearly, monthly, or even daily depending upon

volume of data. This is a good idea if designed to use from the beginning. Otherwise, implementing it in the existing

design would be complex.

In a specific example, a table was partitioned on the date-time column on month or week depending on the table

data access requirements. A table whose data is accessed on weekly basis is partitioned on week and a table which

is accessed for monthly basis is partitioned on month. In case a single table is used for accessing week or month

queries then using a week as a partition is usually a better method. It was found that, in a specific example, a set

of job that would run for 2 days has completed in less than 7 hours.

An example that illustrates advantages of a partition as given in the graph. This example contains SQL queries that

were run on a varied size of table data. Initial results are depicted which did not use any partition. Final results are

displayed which are obtained after partition. It can be seen that there are queries that provide large gain of several

hours due to large size of table. For example a query that would consume 4 hours 20 minutes, after partition and

compression would consume just 10 minutes.

0

20

40

60

80

100

Table 1 Table 2 Table 3

Table Size in Gigabytes

Uncompressed Compressed

0:00:00

0:02:53

0:05:46

0:08:38

Table 1 Table 2 Table 3

Query Execution Time in HH:MM:SS

Uncompressed Compressed

25

Graph 4: SQL Queries in ETL, before and after Partition

3. Parallelism

Although ETL term looks simple but it will have many jobs (different ETLs designed for different data source system

and business purpose e.g. ETL for credit card, sales, saving accounts etc.). Since these ETL source and target

tables are different, these can be grouped together to run in parallel independently without impacting functionality.

As obvious, it will show improvement in the performance due to parallel execution of jobs in ETL.

Database level parallelism is a very important and complex technique in ETL to improve performance. Query level

parallelism in a database can be implemented by enabling parallelism at table level, SQL query level. Parallelism

may show improvement and degradation depending upon level of parallelism for different queries and different data.

In case of MPP database, every SQL query works with high level of parallelism due to share nothing architecture

but in case of SMP database, level of parallelism has to restrict at some queries. During ETL execution in SMP

database, SQL query parallelism was set to automatic and level of parallelism was restricted to 4 or 8 depending

upon hardware configuration of SMP database. This change has shown improvement in ETL execution compared

to no-parallelism but few SQL queries were degraded while running in parallel however in overall, it was a trade-off

between improvement and degradation for queries running in parallel.

In continuation with previous experiment, it’s important to enable optimal performance settings at user defined

functions level in any other databases (SMP or MPP). Otherwise, SQL query consuming functions in other

databases (SMP or MPP) may not opt for faster execution plan. It’s observed many times that user defined functions

needs to specify parallel setting (and other type of setting as well) to allow faster execution of query with high level

of parallelism.

Proc sql;

CREATE TABLE src.TargetTable1 as select a."SUB_RK", a."AS_OF_DTTM", a."TARGET_VALUE", (select

TXT_1."FINAL_DTTM", FunctionName(TXT_1."AS_OF_DTTM") as max_dttm from src.SourceTable TXT_1; Quit;

From execution plan of above SQL query, it didn’t go in parallel, until particular function enabled the parallelism.

create or replace FUNCTION FunctionName(vars) PARALLEL; END FunctionName;

4. SQL Pass-through (SAS specific)

In some SAS based analytics solution, database can be any other SMP or MPP database. For faster performance

of SQL query, it’s important to pass all queries (or at least slow running queries) to databases so that database run

them by their own mechanism. Otherwise, SAS may bring data from database to SAS tier causing query to be

00:00 00:28 00:57 01:26 01:55 02:24 02:52 03:21 03:50 04:19 04:48

Query01

Query02

Query03

Query04

Query05 SQL Queries in ETL, before and after Partition (h:mm)

Initial Result Final Result

26

partially processed by database or not at all processed by database. In both the cases, performance of query will be very slow due to entire table’s data movement from Database to SAS tier. Such issues can be found from SAS

Jobs logs by enabling logging level and make sure that entire query is passed to the database. Following are three

type of messages in the logs after enabling logging level for query execution.

• SQL_IP_TRACE: None of the SQL was directly passed to the DBMS.

• SQL_IP_TRACE: Some of the SQL was directly passed to the DBMS.• SQL_IP_TRACE: The SELECT statement was passed to the DBMS.

Focus Area: System Level Measurements

1. Disk Speed

In data warehouse solution, ETL performance is very much dependent on IO capacity (both disk read and write

speed). UNIX has inbuilt command to check disk speed called “DD test” and similarly there are other tools in

windows based system as well. DD test will help to measure the disk capacity before start of the ETL test execution

and performance of high throughput SQL query may degrade due to disk capacity. As per following two monitoring

graphs (nmon) during dd test on disk, disk read and write speed were ~400 Mbps and 370 Mbps respectively except

spikes up to 1-2 Gbps in read speed (these spikes can be ignored due to memory cache benefits).

Graph 5: Disk Write Speed Graph 6: Disk Read Speed

2. RAID setting

Along with disk type and their speed, choosing correct type of RAID level settings of logical disk are crucial

depending up on purpose of the disk. In our experimental setup, RAID 0 was opted for logical disk containing

temporary data, RAID 5 was opted for logical disk containing solution data and RAID 1 was opted for logical disk

containing Operating System. Such type of RAID settings are important since RAID 0 are faster than RAID 1 and

RAID 5 however it doesn’t guarantee data recovery but temporary data may not need any recovery on the other

hand, solution data are needed to be recover in case of disk crash so RAID 5 are more important for solution data.

3. Job Parallelism

One ETL had ~260 jobs for Extract Transform and Loading for different business entities/purposes. This ETL ran

for ~35 minutes and at a time only 5 jobs ran in parallel to avoid hardware bottleneck. In following graph (captured

through PerfMon), disk read speed for solution data was up to maximum up to 600 Mbps and disk write speed for

temporary tables was up to 200 Mbps (except one spike of 500 Mbps). Disk speed capacity measured during DD

test was higher than the current utilization so it shows no limitation at disk level. CPU utilization and available

Memory are not reaching to their capacity except CPU utilization on 5 cores touches 100% as SAS jobs were single

threaded. In fact, its good sign that remaining 7 cores out of total 12 cores can be used for other SAS jobs if we

really want to increase job level parallelism to 12 than 5. After increasing Job level parallelism from 5 to 10, there

was improvement in ETL execution from 35 minutes to 28 minutes and this improvement was expected since at 5

parallel job setting, hardware was not fully utilized.

27

Graph 7: Operating System Monitoring using PerfMon

4. Parallel IO operation

SPDS is a data storage system optimized to deliver quickly by distributing data across multiple disk instead of single

disk. In one of the experiment setup, the data was distributed across multiple virtual disks. ETL execution was

conducted on data in SPDS, disk read and write speed (900 Mbps and 700 Mbps respectively) was higher than

disk read/write capacity (350 Mbps and 500 Mbps respectively) of individual virtual disks. This was possible only

due to data distributed across multiple virtual disks and SPDS performs parallel IO operations on virtual disks.

SPDS shows good performance improvement than SAS filesystem if data is distributed across multiple different

virtual disks (just like data on MPP database distributed across different disk called as “shared nothing

architecture”). Such experiment can be compared with SMP vs MPP databases (SMP has shared everything and

single host whereas MPP has shared nothing i.e. disk, CPU, nodes).

Graph 8: Disk Read Speed Graph 9: Disk Write Speed

Focus Area: Database Level Tuning

1. SQL Query Execution Plan

ETL execution time is derived by Jobs execution time, especially slowest running Job contribute more to ETL

execution time. Execution time of slowest running Job is derived by SQL queries or tasks in that job. So it’s important

to understand reason for slowest running SQL query in a job in order to improve performance of ETL. FULLSTIMER

setting in SAS configuration will only provide execution time of SQL query or task but in order to investigate reason

for slowness of SQL query, one needs to find execution plan of a SQL query. “_method” option will show execution

plan of a SQL query, however similar execution plan can be found for any other SMP or MPP databases by default

through their monitoring and reporting tools. Execution plan for different databases or SAS datasets are discovered

by many factors like table sizes, indexes, IO speed, server utilization (CPU, Memory), parallelism settings etc.

In following slow executing SQL query, FULLSTIMER shows 30 minutes taken by SQL query to execute and

_method shows sort and merge join opted by optimizer to execute query. Sort and merge joins seems to be slower

0

200

400

600

800

1000

1200

1400 Operating System Monitoring using

PerfMon

Process(_Total)\% Processor Time Available RAM in GBytes Temp Area - Disk Read MBytes/sec Solution Data - Disk Read MBytes/sec Temp Area - Disk Write MBytes/sec

0

200

400

600

800

1000

Disk Read KB/ssdf sde sdc sdb sda sdd

0

200

400

600

800

Disk Write KB/ssdc sdb sdf sda sde sdd

28

in many databases since it has to sort large tables (larger than RAM size) and then write them to disk and IO are always slower than memory. Sort and merge are useful when you have multiple database connections and sessions

running their SQL queries at a time so no one gets “service denial” type error and every one can run queries with

limited speed. However if there is enough hardware, optimizer should opt for faster joins. Other two types of Joins

available in SAS proc SQL are index and hash joins. If any one of the table out of two source tables fits into memory,

optimizer opts for hash join and this will be in-memory join so it should be always faster than sort-merge join.

However memory utilization will be high in case of hash join as it fits entire table in memory and do operations in

memory. If source table has indexes, then optimizer will opt for index join and execution will be even faster since it

has to read limited data than entire table. Index joins are not always fast, it will show improvement if output table

fetches limited rows out of total rows in a source table otherwise it may compensate or show degradation due to

additional task of index access. There are numerous ways in SAS or other databases (SMP or MPP) for promoting

faster execution plan by providing hint to optimizer through mechanism like collecting stats, partitioning tables (by

where clause), increasing memory size of database.

proc sql _method; create table temp.Table1_EXT as select SOURCETABLE1.PRODUCT_RK, SOURCETABLE2.LOAD_END_DTTM, SOURCETABLE1.SOURCETABLE1_TYPE_CD,

SOURCETABLE1.APR_RT, SOURCETABLE1.BASE_APR_RT, SOURCETABLE1.NOMINAL_RT from temp.SOURCETABLE2, SourceSchema.SOURCETABLE1 where SOURCETABLE2.LOAD_END_DTTM >= SOURCETABLE1.VALID_FROM_DTTM and SOURCETABLE2.LOAD_END_DTTM <= SOURCETABLE1.VALID_TO_DTTM ;

quit;

Execution plan:

sqxslct sqxjm sqxsort

sqxsrc(temp.SOURCETABLE2 (alias = TABLE0) ) sqxsort

sqxsrc(SourceSchema.SOURCETABLE1 (alias = TABLE1) )

NOTE: PROCEDURE SQL used (Total process time): real time 30:51.52

cpu time 24:12.46

Other Execution plans:

Hash Join

sqxslct

sqxjhsh

sqxsrc

sqxsrc

Index Join

sqxcrta sqxjndx

sqxsrc

sqxsrc

2. Configuration Settings

In SAS (and in many other databases SMP or MPP as well), memory setting at database level are available to

configure. Small value of such parameter may cause performance bottleneck while executing SQL query on large

sized tables. An experiment was conducted for different SORTSIZE setting in SAS (similar experiments were

conducted on other popular databases as well with their own System or Query level memory settings) and SQL

query in this experiment was creating UNIQUE index on a large sized table. This query was executed through sort

29

and merge join all the time as it need to find unique values for unique index creation. There were read operations in pieces and write to temp location if SORTSIZE 12Gigbytes and query execution time improved by increasing

SORTSIZE to 60Gigabytes. This improvement was due to reduced write operations to temp and increased

continuous read operation into memory. So such memory settings in any databases will help in reducing IO

operations and improve the query performance.

Graph 10: Sort Size Run 1 and Run 2

Basic Housekeeping Techniques

• Inconsistency of performance numbers between two iterations: ETL executions in data warehouse are

different than UI (User Interface) based load testing. Depending upon the total number of ETLs and number

of months for every ETL, backup and data restoration needed to re-run ETL only for selected months. Over

the time on windows systems, disk fragmentation happens significantly due to multiple time backup and

restoration causing slow IO operations. By running disk de-fragmentation on regular interval or prior to start

of another round of testing, IO operations seems consistent and so will SQL query execution time. This will

make sure that ETLs execution timings to remain consistent. As shown in following table, disk fragmentation

was significant on D drive having solution data due to which table’s data was not continuously stored on

the disk causing slow read\write processes. After disk de-fragmentation, the performance improved

significantly.

Disk

Driv e

Total

size (Tb)

Av ailable

(Tb)

Used

(Tb)

% of

fragmentation

C 1.08 1.04 0.08 19

D 11.4 9.21 2.19 5

E 4.36 3.36 1.0 0

Graph 11: SQL Execution Time Table #3: Disk Details

• Keep an eye on unnecessary intermediate table creation during ETL processes and use of views instead

of creating intermediate tables will save IO. Similarly, intermediate tables need to be cleaned up prior to

0:00

0:07

0:14

0:21

0:28

0:36

APPEND CREATE INDEX 1

CREATE INDEX 2

create table

DROP INDEX

(H:MM )

Before Defrag After Defrag

0

100

200

300

400

500

Application and Temp Data IO

Application Data ead/sec Temp Data Read/sec Application Data Write/sec Temp Data Write/sec

Run 1 – SORTSIZE 12G

Run 2 – SORTSIZE 60 G

30

execution of same ETL for next time period. Intermediate tables are small but many in numbers so they are created and used as per developers’ convenience. However, when the same tables are used for the next

time period, instead of creating new tables the old tables are loaded with new data and it increases the

table size. For best performance, usually such temporary tables are cleaned up and collecting stats on

temporary table may show performance improvement by choosing best execution plan by optimizer.

• As a best practice, keep on measuring not only timings of ETL for specified time period (e.g. 12 th month)

but for all incremental time period (1 to 12 months). This will help to find any inconsis tency in data

distribution over incremental time period if rest of the parameters remained the same. Ideally, for ETLs

where source or target data increases for next time period, the execution time of ETLs also increases

gradually. In case of any ETL where source or target data volume remains same for the next time period,

execution time of ETL also remains the same.

Graph 12: ETL Execution Time over Time Iterative Time Period

References:

1. [KIMB2004] Ralph Kimball and Joe Caserta, "The Data Warehouse ETL Toolkit: Practical Techniques for

Extracting, Cleaning, Conforming, and Delivering Data" first edition, Wiley (2004).

2. [SAS2010] A note from "SAS/ACCESS(R) 9.2 for Relational Databases: Reference, Fourth Edition" [2010],

http://support.sas.com/documentation/cdl/en/acreldb/63647/HTML/default/viewer.htm#a001384466.htm

3. [KRENN] Wiki title "Linux I/O Performance Tests using dd". Thomas Krenn, .

https://www.thomas-krenn.com/en/wiki/Linux_I/O_Performance_Tests_using_dd

4. [NATARAJAN2010] "RAID 0, RAID 1, RAID 5, RAID 10 Explained with Diagrams" by Ramesh Natarajan

AUGUST 10, 2010. http://www.thegeekstuff.com/2010/08/raid-levels-tutorial

Acronyms:

ETL (Extract transform Loading), OS (Operating system), UI (User Interface), OLTP (Online Transaction

Processing), CDC (Change Data Capture), BA (Business Analytics), BI (Business Intelligence)

00:00:00

00:02:53

00:05:46

00:08:38

00:11:31

00:14:24

00:17:17

1 2 3 4 5 6 7 8 9 10 11

Months

ETL Execution Time over Time Iterative Time Period

ETL 1 ETL 2

31

http://support.sas.com/documentation/cdl/en/acreldb/63647/HTML/default/viewer.htm#a001384466.htm

https://www.thomas-krenn.com/en/wiki/Linux_I/O_Performance_Tests_using_dd

http://www.thegeekstuff.com/2010/08/raid-levels-tutorial

.

The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right

to publish this paper in CMG India Annual Conference Proceedings.

Performance Monitoring and Management of Microservices on Docker Ecosystem

Sushanta Mahapatra Sr.Software Specialist – Performance

Engineering SAS R&D India Pvt. Ltd. Pune

[email protected]

Richa Sharma Sr.Software Specialist – Performance

Engineering SAS R&D India Pvt. Ltd. Pune

[email protected]

Application packaging and deployment using Docker is trending and fast catching up in the infrastructure world. The big players of the industry are either deploying on Docker, integrating with Docker or developing for Docker. The reason being it is flexible, light weight, easy to manage and highly scalable. Docker containers wrap applications and their dependencies, use a shared kernel and run on the same host along with other containers. Essentially they are isolated processes in the user space on the host operating system. Similarly Microservice architecture enables decomposition of applications into small services which improves fault-tolerance and manageability. These services are designed to run on their own, and can communicate with the outside world via lightweight protocols like HTTP. These small services are designed to be flexible in deployment and work independently. One such deployment mechanism is containerization of these micro services via Docker. So Docker containers are one of the preferred options to build, ship and run these micro services. Containerization of these micro services via Docker is highly convenient as Docker provides automation for the deployment of applications inside the containers. Docker provides an additional layer of abstraction and automation of operating system–level virtualization on Linux. The Docker containers are built on top of the Linux containers mechanism and are very light weight. However the trade-off is the performance monitoring and management overhead that poses the challenge to measure system level utilization and performance characteristics of all the services.

This paper talks about various options that a performance engineer can leverage to monitor and manage the Docker based micro services. The options include live monitoring of containers, persistent storage options for offline monitoring and analysis along with integration possibilities with few of the industry standard performance testing tools (commercial). Additionally, this paper also delves into few container management tools like Docker UI and Consul Web UI.

32

.



Introduction:

Microservices architecture is the latest buzz amongst leading software architects and design guru’s. Over the last few years a sharp surge has been evident towards Microservice way of designing software services, leaving behind the monolithic approach. There is a very consistent pattern of migration to this architecture using the container technology, and primarily the choice of the container being Docker. Microservice adoption is been driven by de-composition of applications to granular level enabling a single service to be self-sufficient in itself. Cloud based infrastructure and elastic scalability are also drivers behind microservice adoption.

Figure 1: Decomposition of Service into microservices

From performance standpoint there is a need of adopting different approaches to monitor such Docker deployed

micro services. Typically, there will be various micro services communicating with each other on a production

system and each of the services will ideally be hosted on a Docker container on top of the Linux kernel or on a

cloud infrastructure. As these containers consume resources in isolation, the monitoring approaches should aim

at gauging the individual containers in isolation similar to monitoring a single host. There are various ways in

which a performance test engineer/tester can monitor and manage such containers and many platforms and

frameworks are developed or are under development to monitor such micro services.

Docker Containers and Resource Isolation: Docker is written in “Go” Language and makes use of several Linux kernel features like namespaces, control groups, union file systems and container format, to deliver the container isolation functionalities that are discussed below. Docker takes advantage of a technology called namespaces to provide the isolated workspace called the container. When you run a container, Docker creates a set of namespaces for that container which provides a layer of isolation: each aspect of a container runs in its own namespace and does not have access outside it as shown is Figure 2.

33

.



Figure 2: Docker Container resource isolation

Monitoring Approaches for Docker Containers: As discussed in the above section each Docker container will have its own subsystem and the resources can be monitored in isolation as the container is a separate host. There are several ways to achieve this. Listed below are few of the approaches that are most helpful.

Approach 1: Control Groups

Linux provides a kernel feature: control groups or commonly called as cgroups that allow us to allocate resources such as CPU time, memory, network bandwidth, or combinations of these resources—among user-defined groups of tasks or processes running on a system. We can configure and monitor the cgroups, control user access to cgroups. By using cgroups, system administrators gain fine-grained control over allocating, prioritizing, denying, managing, and monitoring system resources. Figure 3 below shows how the cgroups can be used to control the resources.

Figure 3 : Control Groups and an Analogy with real world

The Docker Containers, are based on control groups to isolate the resource usage (CPU, memory, disk I/O, and network) for a collection of processes. This is quite similar to the way the resource isolation happens in a multi-

34

.



stored building where the resource consumption details (i.e. water or electricity) are measured commonly for the whole building where as it is measured individually for each unit.

Usage Example: The memory metrics for a running container can be retrieved by using the container ID with cgroups by specifying /sys/fs/cgroup/memory/Docker/(id) at command prompt, which will show the following output of the specified container’s memory usage in bytes: total_cache 210492

total_rss 311541332

total_rss_huge 234061234

Similarly we can find any of the performance matrices (i.e. CPU, IO, Network) for any of the containers in

isolation.

Approach 2 : Using Google’s Cadvisor

Google created Cadvisor initially for their internal use but later on they added Docker container support and

released it for open source use. Cadvisor is short for Container Advisor which is easy to use and gives a detailed

look into the resource usage and the various performance characteristics of all the running containers. Cadvisor

includes a simple UI to view the live data, a simple API to retrieve the data programmatically, and the ability to

store the data in an external InfluxDB.

Cadvisor has native support for Docker containers out of the box. It provides details of resource usage and performance characteristics of the running containers. It is a running daemon that gathers, aggregates and presents information about all the running containers. For each containers it keeps resource isolation parameters, historical resource usage and network statistics container-wide and machine-wide.

Pulling and Running Cadvisor Using Cadvisor is quite easy, from your host just pull Cadvisor Docker container with the following command and you are ready to go

sudo Docker run –volume=/:/rootfs: ro –volume=/var/run:/var/run:rw –volume=/sys:/sys:ro –

volume=/var/lib/Docker/:/var/lib/Docker:ro –publish=8080:8080 –detach=true –name=Cadvisor

google/Cadvisor:latest

Once the Cadvisor container starts running you can bring up the UI from: http://yourhost:8080 (preferably using Firefox, Chrome) and drill down to any container you want to monitor

from the Docker Containers link.

35

.



Figure 4 : Cadvisor UI and typical graphs it shows for a container (Image customised for representation).

Additionally Cadvisor provides REST API end points which can be used to get all stats in JSON format which can be further consumed. Few of them are as follows:

http://yourhost:8080/api/v1.2/containers

http://yourhost:8080/api/v1.2/Docker

http://yourhost:8080/api/v1.2/machine

http://yourhost:8080/api/v1.2/Docker/myinstance (your container Instance)

Limitation:

• The data captured is live and there is no internal way of storing it for later analysis. For persistent storage of data, Cadvisor needs to be interfaced with some external time-series databases like InfluxDB and OpenTSDB. This will be discussed in the later part of this paper.

Approach 3 : Integration with Leading Performance Testing Tools

This section touches upon two approaches of integrating Docker container metrics with Leading Performance testing tool (commercial) for monitoring of Microservices.

Approach 3.1: Integrating Cadvisor API end points with Performance testing tools: As mentioned in the previous section, Google’s Cadvisor is good to get performance counters from Docker containers as well as from the services or processes running within those containers. The Cadvisor rest API end points can be called and processed programmatically to extract the different performance metrics and then store them as part of the analysis report (output from the tool).

36

.



Performance test tool has support for consuming those data points and generate graphs as part of the performance test report which can be stored permanently.

Limitation(s):

• A single virtual user should run another script at a specified interval (in background) to collect and process the metrics during the course of the test run (this will not add much overhead)

• The parser to process those Cadvisor JSON responses has to be implemented (this can be custom made using any simple program)

• The actual data values cannot be stored as it is. For Further analysis in case of any problem the test has to be re-run.

Usage Example: Extract each value and store it as a key/value pair as follows:

HashMap results=new HashMap(); results.put (“Counter1”,"Value");

Feed the data to the tools storage system or store it on any storage medium for offline analysis as depicted in the following diagram.

Figure 5 : Cadvisor API endpoints and their integration with Industry standard tools.

Advantages: The data received via the API end points will be stored permanently and can be used for live or offline analysis.

37

.



Approach 3.2: Registering custom scripts to leverage cgroups and integrate with industry standard tools. Red Hat Enterprise Linux kernel has a feature called control groups or cgroups that was discussed earlier in this paper. Cgroups can be used to collect the system resources for a specific group in complete isolation. Based on cgroups, one or more custom monitoring scripts can be developed to be registered with Leading Performance monitoring tools and those scripts can be used from within Performance test tool to collect metrics for any of the containers.

The figure below displays a simple process of writing a custom shell script in Linux leveraging cgroups to retrieve container specific matrices and how tools like JMeter and Gatling can consume those script’s output. Latest versions of JMeter as well as Gatling have support for host monitoring via direct plugins or collectd plugins which can be configured to use the custom script to pull matrices during the course of the test run and show them as part of the test result.

Figure 6 : Integrating custom cgroups script with tools like Jmeter and Gatling

Limitation: Proper understanding of cgroups and how to define the hierarchy and the subsystems is required.

Approach 4 : A persistent Monitoring Infrastructure using Cadvisor , InfluxDB and Grafana

As discussed in the previous section, Cadvisor can be used to monitor containers but the limitation in that approach is: the monitoring counters and their values that we are seeing are live and not really getting stored anywhere. Additionally, another limitation is customizing the performance statistics into meaningful graphs. Further section describes a quick approach to store the performance metrics persistently in a time-series database and use a charting solution to build more meaningful graphs based on the stored metrics.

38

.



Components in the Framework:

• Cadvisor ( to read performance matrices)

• InfluxDB or OpenTSDB (time-series databases for persistent storage)

• Grafana (charting solution that supports InfluxDB/OpenTSDB)

How the above components are integrated?

Figure 7 : A persistent Docker monitoring Infrastructure using Cadvisor, InfluxDB and Grafana

The integration is quite simple as shown in Figure 4. Let’s see in detail how each of the components integrate with each other. The objective is to monitor the Docker containers that sits on top of the Linux kernel.

• Run the Cadvisor Docker container which will provide all the resource usage statistics and performance measures for all the running containers.

• In order to store the metrics retrieved by Cadvisor (in some database) use either InfluxDB or OpenTSDB. For ease of use, InfluxDB is recommended since Cadvisor can be auto configured to push data to InfluxDB.

• Once the data is persisted, a charting solution such as Grafana can be used which allows to connect to these databases and helps in building rich and meaningful graphs to assist in performing better analysis.

• Once the performance dashboards are ready , end-users can simply view and monitor the performance characteristics live via these Dashboards

This infrastructure can be setup quickly as all these component are available as Docker containers, so if you have a Docker enabled Linux box, it will take just few minutes to build this dashboard and start gathering performance metrics.

Limitation:

As already mentioned it’s a quick way of building a monitoring infrastructure. So once you stop the InfluxDB or Grafana Docker instances you may lose all the data. If you are interested to build a robust monitoring infrastructure, it will be good to host the InfluxDB/OpenTSDB and Grafana configuration permanently or find a way to store the data that these Docker instances are using permanently somewhere so that they can be re-usable.

39

.



Managing Docker Containers:

DockerUI:

In addition to the resource monitoring it is equally important to manage the containers. Docker UI is a web interface which helps visualizing running containers and allows to manage them easily. It assists users in various container lifecycle actions like starting, stopping, pausing, removing and killing a container, making it pretty easy to manage containers and images with simple clicks without needing to execute lines of commands to do small jobs.

Figure 8 : Docker UI (Image Source : Internet)

Consul Web UI: Consul is a tool for discovering and configuring services in your infrastructure. It comes with a user-friendly web UI that can be used for viewing all services and nodes, for viewing all health checks and their current status, and for reading and setting key/value data. It can act as a one-shot dashboard for managing your services.

40

.



Figure 9 : Consul web UI (Image Source : https://demo.consul.io)

References:

Definitions of Microservices: http://klangism.tumblr.com/post/80087171446/microservices

RedHat guide on Control Groups: https://access.redhat.com/documentation/en-

US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html

Google’s Cadvisor: https://github.com/google/Cadvisor

Grafana Information: http://grafana.org/

InfluxDB Information: https://influxdb.com/

Docker Information: https://www.Docker.com/

Consul Information: https://www.consul.io/

41



PERFORMANCE TESTING AND TUNING OF A WEB BASED ENTERPRISE

SCALE HIGH VOLUME ACCOUNTS PAYABLE PROCESSING SYSTEM

Seji Thomas Software Product Development CoE –

Intellectual Property & Engineering Group Tata Consultancy Services Ltd, 1st Floor

TCS Center, Infopark, Kakkanad Kochi – 682020, Kerala, India

Ph: +91-484-6187568

[email protected]

Rony Pius Manakkal Software Product Development Engg –

Intellectual Property & Engineering Group Tata Consultancy Services Ltd, 1st Floor

TCS Center, Infopark, Kakkanad Kochi – 682020, Kerala, India

Ph: +91-484-6187607

[email protected]

Abstract

Software Applications are developed catering to the requirements of the

customer. Requirements are expressed as business use cases that are

validated and verified using prototypes that then blend in the mainstream

Application. It is implicit that outcome of an Application development

project is favorable to the customer with respect to its proper functioning,

optimum performance, lower infrastructure investment and operational

costs, ease of administration, maintenance, and adaptability to changes in

business. To build to success, it is essential to understand what counts to

success. The target Application should bring a successful outcome to the

customer, it should meet current demands and also be flexible for

addressing future growth. While meeting current demands, it should

provide a winning edge for the customer to save time and cost of its

operations, leverage past and current investments for future, and earn a

repute in the market. Engineering with special focus on quality is a

challenge most development teams are trying to address, and the focus of

this paper is to present the steps taken to evaluate the performance

quality of the Application under construction, the challenges that came up

during the testing and tuning phases of development, and how those

challenges transformed as opportunities while scaling up business

operations.

1. Introduction

Business requirements form the central theme on

which an Application is build. When developing an

Application, it is important to know the respective

industry domain, the target user population, the use

case scenarios they will be using often, acceptable

response time, the collective demand that

42

determines peak usage, offline batch processing,

operational data volumes, long term growth in

demand, and so on, that form the performance

requirements. It is also beneficial to know if the

product is targeted for a niche market, or there is

already one or more competitor/s with demonstrated

credentials in the respective business domain. All

these information are important to responsibly build

the target architecture. When the quality measures to

be achieved are quantified at an early stage, it is

possible to choose the most fitting architectural

elements for building the structure that will suffice the

performance requirements.

During the milestone delivery phase, developmental

activities are frozen to actively test for the

performance of the product, as if being subject to real

use by end users. It is important that time is

scheduled for such an activity, to check the stability of

the release and reduce the risks due to regression.

Performance testing enhances the identification of

system hotspots that are rarely spotted in a functional

test. It also helps in understanding the level of

resources being consumed, determining whether the

system requires further tuning of performance

parameters, gauging the ability of the system to

support a work load of higher magnitude within

acceptable variance of the key performance

characteristics. The exhibited performance

characteristics are crucial for estimating the capacity

demands of the infrastructure required to support the

end user demands over the lifecycle of the product.

Time invested for discerning the architectural and

design alternatives, realizing the tangible performance

savings through code & configuration optimization,

and saving every bit of resources from being

consumed beyond the actual need, are keys to

success in the journey of making a product that

distinguishes well in the market.

2. Business Need

The product being considered for study is an

Enterprise class Web based Accounts Payable (AP)

Application built to replace a legacy system comprised

of several sub systems developed over years, with

limited documented information available on

Application integration and interfacing requirements.

The Accounts Payable application was built on Open

Source Frameworks and API like Java/ J2EE, Tomcat,

Oracle Database. Spring, Spring Batch, ActiveMQ,

and iBatis – ORM (Object Relation Mapping. In-house

Data Quality Management Application was used for

the purpose of mass scale record matching.

2.1 Performance Quality Requirements

The product that has to be built to replace the legacy

system has stringent performance requirements, as

the existing system already supported large number of

concurrent users and huge volumes of data from its

successful operation in the business domain over the

years. In the business domain it operates, the product

customer excels its competitors by a good margin and

with a significant share in the market. The key

performance requirements are listed below

• 300 users concurrently active on transactions

like Manual Reconciliation of Invoices (either

at header level or detail level), Receipt Item

Maintenance, and Manual entry of Invoice.

• Several daily batch jobs for execution that has

stringent throughput requirements like

completing processing of 10 to 12 million

detail level invoices within a 2 hour time

window.

• Web Page Response time of 2 seconds.

• Operating Data Volume -> Header Level

Invoice - 40 Million, Detail Level Invoice - 500

Million.

• Daily Data Processing Volume of 150 K

(includes both Header and Detail Invoices),

with transparent audit trail logging.

• CPU (Central Processing Unit) Utilization to

be under 50 % and Memory Utilization to be

between 30 and 40 %

Challenging part of the project is to develop the

system without leg room for mistakes that would derail

the earmarked time schedule and cost. From day 1 of

go-alive, without fail, the system has to be operational

for the above user concurrency, data volumes,

response time and data processing time windows,

throughput and resource consumption. It was

apparent that the architecture, design, and

development approaches have to be smart and

efficient enough to build every single element of the

system in a befitting manner, minimizing wastage.

2.2 Performance Driven Engineering

Software Systems are increasing in complexity day by

day. Next generation of technologies are arriving on

stage at hardly ever imagined pace, that the

challenges to build something worth its class and

43

adaptable to the business needs are also

proportionately growing. Driven by competition,

Software Development organizations are forced to cut

costs of construction, maintenance, and

customization. In this realm, it is worthwhile to think

how the developing organization can ensure that it

builds functionally useful product with greatest

satisfaction for its buyer.

The basic principle underlying the development of a

system that satisfies its customer needs is to address

functional and quality concerns right from the

conceptual stage, and carry on that rigor through all

life cycle stages until the product is phased out.

Taking the right step at its most appropriate time cuts

down wastage and avoids unnecessary costs that can

later grow uncontrolled.

PELC (Performance engineering life cycle) stages

overlaps with the SDLC (Software development life

cycle) stages and synergistically enhance the

performance quality of the product.

Within the functional scope it is important to

understand the target users who will benefit from a

business point of view. Some functions are heavily

used and others occasionally. Even at an abstract

level, each function when triggered for execution by

user action, is expected to result in a set of

component interactions occurring in a known pattern,

because of which a certain demand is imposed on the

resources backing the running Application. From the

perspective of real usage demand of the live

Application, the different functions will be invoked at

different rates, and their demands will collectively form

a demand foot print that the Application should be

capable of delivering at the mutually agreed upon SLA

(Service Level Agreement). Figure 1Error! Reference

source not found. represents the discussed idea.

Figure 1 Resource demand induced by transaction

The magnitude of demand imposed at each resource

center can be different and will depend on the actual

implementation, which is unknown at the architecture

development time. Nevertheless, based on the

architecture under evaluation, it is worthwhile to

approximate the total number of demand requests or

the time spent at each resource center to determine

which of them are in high demand. Such a back of the

44

envelope approximation will help to simplify the

analysis required to determine if the current

architecture introduces any hotspots that will deter

performance. The earlier such an analysis can be

done, the better, because it improves the success rate

of decisions being made later on.

Once the Architectural back bone is laid and design

phase flourishes, early evaluations related to

performance can be done because components along

the transaction pipe-lines are gradually being realized.

Along with the regular unit testing, performance unit

tests are done at this juncture to streamline and

optimize code. Automated Code quality checks save a

lot of effort as it evaluates all engineering debts to be

closed, which among others include architectural and

performance quality issues. Getting a good

compliance rating at this time is important for attaining

the targeted quality objectives.

Once the implementation of the key business

transactions are completed, they can be profiled at

code and SQL (Structured Query Language) level for

additional insights on performance. Method level and

SQL level inefficiencies are brought to fore for re-

factoring. The objective at this point is to save

computing resources and time, without falling in to the

trap of premature performance optimization.

As development progresses and matures through this

phase to reach a state of stable release, the code

base is frozen to avoid testing in influx. Application is

deployed in a performance test environment

comprising of hardware and network resources at a

scaled down factor of the production levels.

Production scale data if available is loaded in the test

environment. If not, appropriate tools are used to

generate data in volumes and load in the data

repository.

The work load characteristics that was determined at

the requirements gathering time to be the most likely

peak load to occur in production server is taken as

input for formulating the testing strategy. Transactions

are recorded to script automated transaction

invocations. Scripts are further modified through

configuration, and parameter correlation to simulate

user concurrency for the desired transaction mix.

Load ramp up and ramp down times, think time,

delays, and supporting data for various transactions

are also suitably set in the performance test

automation tool/ harness.

Load is triggered to check whether it simulated a

realistic load as originally intended. Once validated for

expected behavior, configurable performance

parameters are tuned sufficiently in the test

environment to support the quantum of load that will

be generated by load testing. While simulating the

load, checks are done to avoid all biases that would

lead to misinterpreting the performance behavior.

When the load is simulated, the environment and the

Application are monitored and all relevant

performance metrics collected for analysis and

reporting. Inferences are made on scientific evaluation

of the metrics. Outcome of the analysis is captured in

the test report, which reports the status of

achievement of the performance objectives stated at

the onset, existence of hotspots if any, and

recommendations for improvement. The testing report

feed backs to engineering, if improvement is required.

If the objectives are met, the Application performance

is benchmarked against the infrastructure capacity for

making capacity planning decisions for customers.

Performance engineering life cycle continues its

journey in production phase, where it takes the form of

continuous monitoring of the live system. The data

collected from monitoring is checked for symptoms of

short term and long term performance failures. All

such symptoms, or in some instances, real failures,

are debugged for identifying the root cause and

correcting as appropriate. Performance monitoring

data is also used for predicting user demands,

understanding the current resource usage levels,

knowing the user experience, and anticipating an

imminent problem like resource crunch that can be

fixed ahead of an issue.

2.3 Deployment architecture for

performance and scalability

In the fag-end of the development phase, concrete

decisions related to deployment had to be taken. Load

balancing and fail-over are the main concerns to be

addressed as per the requirements. For each layer,

redundancy is required, hence the concept of Server

pooling was considered. Having a cluster of Servers

allows for scaling out in the event of demand spike,

and in this scenario, the entire Application was scaled

out following that approach. Web Server content

caching for static data was adopted to reduce the

network demand. Figure 2 provides the high level

deployment plan.

45

Figure 2 Application Deployment Diagram

3. Development time tuning

The high level decisions taken in the previous phases

were actually put to test during development time. As

the product started realizing, it became easier to test

and validate for gaps. In this phase of product

development, we encountered several challenging

situations that proved to be good learning. During this

time, the standard engineering practices like

1 Code profiling to find out method bottlenecks, and

optimizing them

2 Slow Query analysis and adding required indexes

3 Sizing down Web pages after analysis using

developer tools

4 Conducting automated and peer code reviews,

and,

other best practices enabled the team to deliver at a

consistent quality. Some key lessons learned during

this phase are described below.

1 Power of parallel computing: As mentioned

earlier, Audit trail logging is a key functionality of

the system. It is a pre-built component build upon

the concept of queue. By default, items for

processing are put in a queue and consumed by a

single listener. Main invoice and sub items form a

sizable entity, whose accompanied change history

will also be bulky. Change history entries are

pushed to the queue, from where it will be

asynchronously picked for processing or

persisting in the Audit DB (Database) table.

During a high demand time interval, the change

history items generated will be numerous and

starts filling up in the queue. If only a single

listener consumer is registered for the queue, the

rate of consumption was falling behind, leaving

accumulation in the queue, and thereby forcing a

memory pressure. The number of listeners was

increased to observe the performance gain as in

Table 1

Queue buffer size Single Listener (default) Number of Listeners = 3

(optimized)

46

2 GB Frequent out of memory errors

when subject to high load

A logger that checks the number of

processed queue items every 10

mins reported that several

thousands have been processed

Table 1 Queue performance optimization by configuring appropriate number of Listeners registered

against the queue

2 Fetch the data in a batch and group it in Java

code: There is a requirement to view huge set of

data through the Web Browsers. As the

operations are performed in Web Browser itself

using JS functions, all the required data should be

brought to the screen while loading the page itself.

For example, Detail Level Manual Reconciliation

is an operation in which user works on the Invoice

line items and match them against the

corresponding receipt line items. For the operation

of matching, the screen should be populated with

huge set of Invoice and receipt data. This data is

encapsulated as records in parent and child

tables. Fetching that requires different queries,

which repeats as many times as the number of

line items in the invoice. As a result, loading an

invoice with more than 30 line items was taking

more than 2 seconds (SLA). Soon it was realized

that the performance issue is due to repeated DB

calls.

Code was re-written to fetch the additional details

required for the line items together, and group it in

Java code. Refer to the tables for pseudo code -

sub-optimal (Table 2) and optimized cases (Table

3)

List<InvLineItem> InvLineItems = fetch invoice line items --DB round trip

for each InvLineItems{

fetch and add Detail1 --DB round trip

…

fetch and add Detail5 --DB round trip

}

Table 2 Pseudo Code of the logic causing performance issue

List<InvLineItem> InvLineItems = fetch invoice line items; --DB round trip

List<InvLineId> lineIds = get the list of line ids;

Map<InvLineId, Detail1> = Fetch the Detail1 for every lineIds and create a Map with key as the lineId and

value as Details1; --DB round trip..

Map<InvLineId, Detail5> = Fetch the Detail5 for every lineIds and create a Map with key as the lineId and

value as Details5; --DB round trip

Tag the fetched Details into corresponding InvLineItems

Table 3 Pseudo Code of the logic for resolving performance issue

Performance improvement achieved in

this case is given in Table 4

47

Test Scenario Code State Number of SQL Queries Response time

For loading an invoice with 50 line items

Before fix It required firing almost 50 queries

7 to 9 Seconds

After fix Always 5 queries will get

triggered irrespective of

number of Line items.

< 2 seconds

Table 4 Performance advantage of fetching data in batch and grouping in the calling code

3 Custom query to update multiple records:

There are couple scenarios in which user

performs some operation on parent record and

the system has to update its child records as part

of the operation. In the application, there are

multiple occurrences of such scenario.

For example, when a user is performing header

level matching of an invoice, all the invoice line

level item records should be updated for fields like

line_item_match_amount and line_match_status.

The logic to update line_item_match_amount is by

multiplying line_item_qnty with

line_item_unit_price values in the same record

and update the line_match_status as 'MATCHED'.

As per the design, all the DML (Data Manipulation

Language) operations have to be accomplished

through ORM framework, which was implemented

so. Table 5 provides the pseudo code of the

operation.

LIST<InvLines> = Fetch the DOM records for LIST<inv_line_id>; -- DB round trip;

for each InvLines{

Update line_item_match_amount = line_item_qnty * line_item_unit_price;

Update line_match_status = 'MATCHED';

Persist the data; -- DB round trip;

}

Table 5 Pseudo code for DML operation through ORM framework

In the above approach, there is a DB call to fetch

the DOM and another DB call for each item to

persist in the DB. It followed from analysis that

that approach had performance limitations while

matching multiple invoices, each having

considerable number of line items. To overcome

the performance issue, the ORM based update

process was changed into a custom query based

update as given below.

update invoice_details set line_item_match_amount = (line_item_qnty * line_item_unit_price),

line_match_status = 'MATCHED' where inv_line_id in (LIST<inv_line_id>); -- only one round trip to

DB

As a result, only one DB call is required to update

the line item details irrespective of how many line

items are available for the selected invoices. The

performance advantage in this case is given in

Table 6

Test Scenario Prior to Fix After the Fix

Matching 10 Invoices and 10

receipts with 20 line items

each

5-8 seconds < 2 Seconds

Table 6 Performance benefit derived by use of a custom query to update multiple records

48

4. Performance testing time re-factor

Performance testing phase actually reveals the

behavior of the application when users start using it

concurrently. It is an important activity in the SDLC

phase that validates and verifies the ability of the

Application to operate consistently under load. Prior to

load testing, the following were done

• Pre-Production Environment (replica of

Production) was set up for PT and tuned for

the load.

• Customer PT Tools / Servers were leveraged

for Testing

• Load Runner Tool was used for Load Testing.

• Two Load Generator Servers were used to

simulate 300 Concurrent User Load Test for a

test duration of 1 Hour.

• Application Performance Management Tool

agents were installed on Application and

Batch Servers to analyze Performance

Bottlenecks in Web / App / Batch / Database

Tiers.

• Oracle AWR (Automatic Workload

Repository)/ADDM (Automatic Database

Diagnostic Monitor) Reports were leveraged

to analyze Database related Performance

Bottlenecks.

Some of the key lessons learned in this phase are

described below.

1 Configure Server parameters for performance:

Key performance parameters were configured in

the Web, App, and DB Servers as in Figure 3

Figure 3 Key performance parameter tuning for Servers

2 Configure proper Statement Cache: During the

performance testing phase, it was identified that

there was considerable amount of waiting for DB

querying. At the onset of analysis, it was mistook

as a problem due to insufficient Connections in

the pool, but it turned out that checkoutstatement

parameter of C3P0 was contributing to the wait

during query execution.

The issue was resolved by setting proper

Statement Caching, which is the caching of

executable SQL statements at Connection level.

A proper statement caching prevents

a. The overhead of repeated cursor

creation, and

b. Repeated statement parsing and

creation

Together, it also enables the reuse of data

structures in the client.

From analysis and testing, it was proven that

setting maxStatements to a sufficient number can

improve the Applications’ overall performance.

The value to be assigned was set to the number

obtained by multiplying frequently executing

distinct queries, and maximum number of

49

Connections configured in Connection pool

setting.

Performance advantage after the change was

implemented is given in Table 7


Occurrences of wait for DB

querying for check out

statements

Occurring very frequently Not occurring

Table 7 Performance benefit of proper SQL Statement caching

3 Fix Connection closed exception at the

beginning of transactions: During performance

testing, it was noticed that Connection Closed

exception was thrown while executing the first

SQL query of a transaction. This was identified as

due to the staled connection checked out from the

Connection pool. To rectify this, enabled the

parameters - testConnectionOnCheckout &

testConnectionOnCheckin in C3P0 configuration

properties. Issue was resolved after this

configuration change, but it introduced a slight

delay while checking out a Connection from the

pool. To avoid this performance delay, used a

combination of Connection related parameters like

- idleConnectionTestPeriod &

testConnectionOnCheckin. As a result both the

idle test, and the check-in test are performed

asynchronously, which leads to better

performance - both perceived and actual.

4 Fix inordinate delays while deleting records

from the DB: In the online Application, there is no

hard delete. All the records will be soft deleted,

and there is a batch job to purge or archive the

soft deleted records from the DB. The count of

records will be varying from thousands to millions

depending on the business relevance of the table

that represents the entity. Following two issues

were causing the delay.

Issue 1: During the delete of parent table, foreign

key constraints from the child tables are checked

to ensure that a parent with a child record is not

deleted until deleting the child record. If the

columns on which constraints are enforced are

not indexed, the check is done in form of a full

table scan.

The pseudo script for the tables prior to fix is as

below

Create Table Receipt_Parent ( Receipt_Id,Receipt_name,Date ) ; -- Parent Table

Create Table Receipt_child ( Child_id , Receipt_Id ) ;

-- Child table

Alter table Receipt_child add constraint receipt_child_chk references Receipt_Parent (Receipt_Id) ; --

foreign Key

Delete From Receipt_Parent ; --> Full Table scan of Receipt_child

To rectify this issue, created indexes for the

foreign key validation as given below.

Create index index_child on Receipt_child (Receipt_Id) ;

Delete From Receipt_Parent ; --> Index scan of Receipt_child using index_child

Issue 2: The foreign key check can be done at

the transaction level or statement level. By

default, the constraint check is immediate – that

means for each row being deleted from parent

table, child table is checked for matching

records. The constraint check can be done at

the transaction level for all the records covering

the transaction in one shot. This allows Oracle to

optimize the performance of the checks. A

performance benefit of 50% of the execution

50

time was observed while deleting 6 million records.

Alter table Receipt_child add constraint receipt_child_chk references Receipt_Parent (Receipt_Id)

DEFERRABLE INITIALLY DEFERRED; --> Deferring the constraint

Delete From Receipt_Parent; --> Check child table Receipt_child once for the whole transaction

With these two fixes, the performance improved

as in Table 8


Delete 6 million records

from a parent table with

references to other

tables

More than 24 hours Less than 2 hours

Table 8 Performance gain while deleting records - by indexing foreign key and deferring foreign

key check

5 Prior PK Sequence generation for bulk inserts:

Batch programs is expected to handle more than

three million records. It was found that two queries

are fired for every single row insertion. One for

generating the sequence, and the other for row

insertion. Since the use case expects to insert

more than three million records, 2 hits per record,

would contribute six million hits to DB, for insertion

alone.

As a solution, sequences are generated in

advance using a custom query. Number of

sequences generated is equal to the commit size,

and generated sequence numbers were applied to

each insert statements which result in only one db

hit for every insert statement, and one additional

hit for every commit interval. Performance gain

Following this approach, a performance gain of

10-12 % was achieved, when tested for 1.2 million

records.

5. Production time re-factor

Performance engineering does not stop post go-live.

In the production phase, active monitoring of the

infrastructure and Application for understanding the

usage trends, resource consumption, demand,

impending issues (short term and long term) and

others is the prime focus. The observations from

monitoring will drive optimization of the Application

further. True validation of non-functional qualities

happens in this phase, and hence it is very important

for a developing organization that aims for fine tuning

their skills to reach the highest level of maturity. Some

of the key lessons learned in this phase are described

below.

1 Audit Component queue item check out

attempt: It was observed that the queue items

were not being processed at the expected rate. If

processing does not happen, the next attempt

picks the same item from the queue, which

repeats endlessly, and the queue accumulates

items.

By the limiting the number of attempts the same

item will be picked, the issue was resolved. If

processing failed, the item is discarded by moving

in to a separate queue for deferred processing.

Following this approach, a healthy balance is

maintained between the number of items left in

the queue and its threshold buffer size.

2 Thread blocks, and results in clocking issues

for end users at random entry points: It was

reported that some transactions take up > 10 mins

and timeout. After some time, many transactions

on that Application Server (a node in the App

Server Cluster) start getting blocked.

On analyzing deeper, it was observed that the

Application Server threads are stuck in 'C3P0' DB

Connection pooling component. They got stuck

giving rise to WAITING at

sun.misc.Unsafe.park(Native Method)

It was identified that the issue related to

Application http-nio thread going in to wait

condition for 600 seconds or more is correlated to

the blocked state of

C3P0PooledConnectionPoolManager-Helper

Thread, or the Task-Thread-for-

51

com.mchange.v2.async.ThreadPerTaskAsynchro

nousRunner, when trying to close the Statement

underneath a Connection that is still in use. This

would not cause issue for most of the JDBC (Java

Database Connectivity) drivers that are developed

complying with the spec that it should support

closing of a Statement while the parent

Connection is in use. From various sources it was

understood that Oracle JDBC driver does not

comply with that spec, and hence can encounter

freeze, leading to deadlocks.

The solution adopted was to enable the following

C3P0 flag setting-

'statementCacheNumDeferredCloseThreads',

which will avoid the prolonged wait of Thread.

6. Conclusion

Following the approach of addressing performance

quality from the early planning phases of Software

Development, it was possible to address Application

performance effectively, and appreciate the close

linkage between architecture and performance quality.

7. Acknowledgements

Our sincere thanks to the product team for their

patience responding to our requests even during tight

development schedules. We would like to thank

Shameer Sulaiman, Lakshmi Hariharan, Ranjit

Poduval, Dhanesh Chakrapani (Product Manager,

IP&E), Krishnakumar G (Engineering Lead, IP&E) and

other key members of the product team who had

interacted with us and shared valuable points during

the study. Special thanks to Debiprasad Swain (SPD

Engineering Services Group Head, IP&E), Mohan

Jayaramappa (SPD CoE Head, IP&E), and Rajashree

Das (SPD CoE Architecture Group Lead, IP&E) for

the guidance and directions provided.

8. References

[1] http://docs.oracle.com/

[2] https://www.oracle.com/index.html

[3] https://projects.spring.io/spring-framework/

[4] http://docs.spring.io/spring-

batch/reference/html/spring-batch-intro.html

[5] http://www.mchange.com/projects/c3p0/

52

The Docker Container Approach to Build Scalable and Performance Testing Environment

Author: Pankaj Rodge ([email protected])

ABSTRACT

Testing forms an integral part of Software development life cycle (SDLC). There are mainly two levels of testing, functional and non-functional testing. Along with functional testing, scale and performance tests which are part of the non-functional testing play an important part of the testing phase.

Building test environments to execute scale and performance tests for products like SSL VPN would require emulating SSL VPN client logins and running tools like iperf simultaneously from multiple machines to check the throughput and maximum concurrent connections supported by the SSL VPN gateway. This would also require deployments of large number of virtual machines, creating the appropriate networking thus increasing the cost tremendously. The same is applicable to build scale test environment for feature like DCHP requires testing with huge number of DHCP clients. Using docker containers we can overcome the above challenges efficiently. This paper compares the advantages of the docker containers approach versus the virtual machine based approach for SSL VPN scale and performance and DHCP scale testing.

1. INTRODUCTION

Figure 1

SSL VPN-Plus is a service on NSX edge gateway. It allows

Internet enabled remote users to securely connect to private

networks behind edge services NSX Edge gateway. Remote

users can access servers and applications in the private

networks.

NSX Edge gateway supports four form factors depending on

the system resources. As per the form factor the maximum

number of concurrent SSL VPN connections are supported

i.e. A compact edge supports 50 concurrent SSL VPN

connections and an X-large edge supports 1000 concurrent

SSL VPN connections. In order to test the maximum

number SSL VPN client connections supported per form

factor we would require to deploy 50 virtual machines for a

compact edge and 1000 in case of X-large edge. These tests

can be easily conducted by using docker containers using a

single machine.

Figure 2

1.1 Scale Deployments using Virtual Machines

Figure 3

For creating scale, large number of VMs will be required to

be deployed. These VMs take up lot of system resources as

they not only require a full copy of the operating system but

also require a virtual copy of the hardware that the operating

system needs to run.

53

mailto:[email protected])

1.2 Scale Deployments using Docker Containers

Figure 4

A container environment is created by first deploying a host

operating system. Then a container layer i.e. docker is

installed over the host OS. Once that is done container

instances can be provisioned from the system’s available

computing resources and enterprise applications can be

installed within containers.

Container View inside Guest OS

Figure 5

2. Scale for SSLVpn ClientsSSL VPN scale tests environments can be created using

docker containers using the following workflow:

Figure 6

1. Install the Docker.

2. Build the image with all the required dependencies.

3. Create Container

4. Install VPN client

5. Connect to VPN Gateway.

6. Start IPerf.

7. Measure the Throughput

8. Repeat step from 3 to 7 for n number of clients.

2.1 Measure throughput on container and on VM.

Throughput is measured on docker VM and Normal VM

using 1G interface connect to SSLVPN gateway and

SSLVPN gateway is 2 hops away from both VM.

Iperf is used to measure throughput.

54

2.1.1 Throughput on Container

- SSLVPN Connection Status

- Throughput with Single thread

- Throughput with Ten thread in parallel

2.1.2 Throughput on VM

- SSLVPN Connection Status

- Throughput with Single thread

- Throughput with Ten thread in Parallel

It is observed that throughput is almost same with both

container and normal VM.

3. Scale for DHCP Client

3.1 Networking within docker host

- Here ens192 interface of docker host connected to

dhcp server (both dhcp server interface and ens192

interface of docker host are in same Port group)

- Created the bridge interface and added ens192

interface into it.

55

Bridge Status

Docker network

Interface Status

3.2 Creating containers with dhcp client installed

It is observed that for creating 500 clients it took 51 sec

which means approximate 10ms for one container which

more faster than deploying vm’s.

root@ubuntu:~# time ./create_container.sh 500 real 0m51.600s user 0m8.720s sys 0m3.272s

4. Future WorkDocker containers can be used similarly to execute scale

tests for services like L2 VPN, IPsec VPN, load balancer

etc.

Using docker containers scale tests can also be automated

easily.

5. REFERENCES[1] https://docs.docker.com/engine/tutorials/dockerimages

56

The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.

Conflux: Distributed, real-time actionable insights on high-volume data streams

Vinay Eswara VMware India, Bangalore [email protected]

Jai Krishna VMware India, Bangalore

[email protected]

Gaurav Srivastava VMware India, Bangalore

[email protected]

AbstractSystems and applications are most useful when they can react to events and data in real-time, because the utility of the reaction typically reduces with time. In cloud environments with high data volume, identifying and filtering relevant data, processing it and reacting in real-time becomes challenging.

This paper presents the design and implementation of Conflux, a system for merging high volume streams of data in a scalable, fault tolerant way. It allows users to declaratively specify complex merge functions over data streams and evaluates these functions in real time. It also provides callout facility where applications can specify conditional actions in the context of the merged result stream. This allows applications to delegate stream merging and evaluation and focus on the resultant application specific reactions.

Consider a stock-market use case where news feeds, stock quotes, risk appetite and currency values must be continuously monitored. When a favourable situation occurs, the system must automatically place buy or sell orders to maximize profit within the bounds of risk specified. With Conflux, the operator only needs to express application actions as a function over the current and historical stock quotes, risk and current currency streams, and provide stubbed code buying, selling or paging a trader. There are similar use-cases that can be realized in domains such as IoT, fraud detection etc. where real-time reaction at scale is critical.

Conflux itself is deployed on a cluster to linearly scale with data volume. It tolerates node failures by transparently redistributing streams on the fly using consistent hashing [1] onto survivor nodes while compensating for any data loss.

1. IntroductionStream processing systems are critical to make accurate decisions in rapidly changing environments. In scenarios where there are a multitude of disparate data streams that need to be grouped and merged on the fly the following issues need to be addressed:

Grouping: Users should be able to specify grouping of data streams before any merge functions.

Merging: Streams of data need to be merged based on some pluggable, domain specific rules that compute the result that can be specified by users of Conflux.

Change: Changes to the merge function specified by the user and group updates as individual data sources enter or leave the group should reflect immediately in the result of the merge.

Consider an example of automating Dev-Ops responses when running a critical SaaS application:

Operators’ definition of ‘critical workload’ may vary with time. Additional mid-tier machines may be added to the cluster or decommissioned; completely new application clusters may be marked critical. Conflux needs to track the snapshot of ‘criticality’ for the automatic DevOps system to be effective. I.e. it should allow users to specify a group of ‘critical data sources’ and guarantee updating of the group in real-time.

Developers, maintainers and operators of a system would usually have knowledge about what constitutes a problem. These would be expressed as a relationship between data streams. Since the users of that system know best, it should allow the developers / maintainers / operators to specify the relationship in a way that does not require a recompile / redeploy of Conflux itself. A secondary

57


objective is that it should be as simple as possible to specify these merge relationships

If either the grouping or the merge functions were changed, the changes should reflect in the composite merged stream immediately. Unless this is immediate, the reaction would be based on a stale view until the new declarations take effect.

Conflux is designed to address these high volume, low latency stream merging use cases. Typical use cases include IOT, Dev-Ops, faceted dash boarding and monitoring. It keeps users of the system insulated from scale, availability and performance concerns and allows them to focus on their application logic.

Conflux users can specify application specific aggregates, define thresholds and provide callout actions. This application code is in JavaScript to run on the JVM. This allows for dynamic modification of declarations without system recompile / redeploy.

Conflux does not address the problem of modelling using machine learning or other methods. It focuses on evaluating the results of an existing model function with low latency It supports adding; deleting or modifying existing models in real-time. It is hosted as a SaaS platform and supports a multi-tenant model. It uses consistent hashing to partition data streams based on the ID of the data source emitting the stream to

parallelize memory and compute resources across the cluster and for coping with node failure.

In this paper, we propose: i. A scalable, fault tolerant, low-latency way

to evaluate models, expressed asrelationships between different high-volume streams.

ii. A way to define new models, redefineexisting models and delete modelsdynamically in real-time

iii. Threshold results and invoke externalHTTP callouts for application specificactions.

The rest of the paper is organized as follows; Section 2 presents the background for conflux and related work. Section 3 presents system design. Section 4 describes the implementation of Conflux. Section 5 evaluates behavior and performance under different scenarios. Finally, section 6 contains concluding remarks and directions for future work.

2. Motivation and related workConflux originated as a general monitoring system [12] for virtual machines that ran critical workloads.The following requirements in this system led tothe conceptualization of Conflux:

Figure 1. The first part of the figure represents routing to conflux nodes using consistent hash of the stream IDs. The second shows ease of deployment/maintenance of conflux versus a typical distributed application.

58


i. Metrics were collected at the VM level.There was a need to provide rollups inreal-time to provide useful real-time dashboarding.

ii. It was useful to allow the customer tosynthesize new metrics by mergingmetrics streams from applications andunderlying metrics with resource billingand allocation metrics to provide real-time,low-granularity economic ROI dashboards.

iii. A logical extension was to invoke HTTPcallouts, where tenants can embedadvisory or remedial actions such asnotifications and auto-scaling respectively.

There are several existing stream merging systems available [2,3,4,5,6,7,8,9]. Twitter’s Heron, Google millwheel and Google photon are the closest in intent to Conflux, but are closed source. Other systems that have inspired Conflux are Apache Spark, Apache Storm, Apache Samza, etc. The critical reason for building Conflux instead of using these systems was twofold:

I. It was an explicit requirement that all nodes ofthe cluster processing system are identical w.r.tthe software stack, i.e. there are no ‘master’ or‘slave’ nodes. We have found that this greatlysimplifies deployment and upgrades inproduction since an the VM snapshots used fordeployment are identical. The tuning of thedifferent nodes, CPU, RAM and diskconfiguration are all identical. This requirementalso implies that when the Conflux clusterdeployment is under high load, all nodes shouldbe almost equally loaded. There should be nosingle point of failure or hot spotting. Also,adding new nodes to the cluster shoulduniformly reduce load across the cluster. Thisrequirement eliminated all systems withmaster-slave topologies or systems whichrequired nodes with heterogeneous softwarecomponents. Homogenous node types allow usto leverage established technology like VMsnapshotting, storage and deployment tomonitor and manage the cluster.

II. Another requirement was to be able to trulyoperate on streams ‘on the fly’. Thus,frameworks that do any form of in-memorycaching, e.g. micro-batching in Apache stormwere not considered.

3. System Designa) Definitions

Agents: are external systems that push data to Conflux in a format that it understands.

Packet: is the atomic unit of input and output in Conflux. It is a set of (ID, Metric, Timestamp, Value) tuples, which allows for composition and pipelining of data operations. The ID is a universally unique ID corresponding to each data source.

Stream: is a logically unbounded sequence of tuples bearing the same ID.

Routing: is the process of consistent hashing the ID in each packet with the number of live nodes in the conflux cluster to decide which node to deliver the packet to.

Data Source: is the entity being monitored for observable phenomenon. It could be a virtual machine in the case of monitoring, a process in the case of log file streams or a sensor in IoT. A single data source generates a single stream.

Metric: is an individual, time stamped, measurable property of a phenomenon being observed. A data source can emit multiple metrics, for example a data source corresponding to a VM can emit CPU, RAM and IOPS metrics identified by the VM id. A stock ticker can emit stock quotes identified by the stock symbol as ID. All Metric values are not restricted to numbers; strings and Booleans are allowed.

Tags: are empty metrics without timestamp/value tuples. They are used for logical grouping of streams.

Feed forward: when a node receives a packet with some ID ‘X’, it looks up all the groups that ‘X’ belongs to and for each group, retransmits the packet after applying the group specific transformation. Each transformed packet’s ID ‘X’ is overwritten with the respective Group ID. This process of retransmission is called feed forward.

Since the same Group Id is used for routing by all members of a given group, the result of the feed forward would always be homed to a specific node although individual members may be homed to different nodes. This allows for parallelization of pre-processing before merging the stream. [Figure 2]

Merging: is the process of joining multiple streams to create a single composite stream. Comparing

59


feed forward and merging to MapReduce, feed forward like map in that it operates over individual streams and can be parallelized. Merging is analogous to reduce which aggregates multiple values to one metric.

Ingestion: Conflux expects agents to emit streams in a specific format. An ID uniquely identifies each stream. The stream to node mapping is decided by a consistent hash of the stream ID to the nodes in the Conflux cluster. This ensures that in steady state, a particular stream always lands on a particular node in the cluster. Streams are roughly equally partitioned between the different nodes, so nodes can cache metadata about the stream IDs that are homed to it. This provides horizontal scale since the fire hose of streams are split among Conflux nodes in a clustered deployment. When nodes enter or leave the cluster, the resultant streams are consistent-hashed among the current snapshot of cluster nodes. This ensures that the load balancing characteristics of the cluster are retained.

Groups: Conflux is unique in the way that groups are treated. It exposes synchronous API to create a group and add/delete constituent members. The create API expects a group name and list of stream ID’s that belong to the group. It returns the

group ID. The “add” and “delete” API expect the group ID and the list of elements to be added and deleted from the group.

Any group operation is guaranteed to disseminate across the cluster fast. This is again done using consistent hashing ID’s against nodes. Consider a group is created with member ID’s A,B,C and returns the group ID G. Apart from persisting this information into the persistent store, for each member ID, Conflux sends out a message notifying the member that it now belongs to group ‘G’. The message is routed the same way as an agent message: by consistent hashing the key to the member ID. This ensures that the group membership message reaches the same node where the corresponding stream is homed. Since this message is subject to acknowledges and retries just like stream messages, it eventually reaches the correct node. To illustrate, consider member ‘X’ is added to the group ‘G’. Conflux persists the information and sends the message using ‘X’ as routing key, specifying that the X to G mapping no longer holds. The X routing key ensures that the group membership change message lands on the same node that services ‘X’. The node updates its cache removing X from group G. The same sequence of events happens

Figure 2. Every node maintains a cached list of groups that a stream homed on it belongs to. All group operations update the list, as shown in the first 2 diagrams. These lead to fast group updates. The 3rd ‘feed forward’ figure shows how this list is used for routing, enabling merging of grouped streams at one node. Since nodes owning groups is determined by consistent hashing and groups have unique IDs, Groups themselves are evenly spread across the cluster.

60


on member addition and group creation. If there are many members added or deleted, Conflux sends messages out in parallel. The remaining steps are identical.

b) Load Balancing, Scalability, Availability

Load balancing in Conflux is probabilistic, based on Consistent hashing. For a given stream, which consists of a stream of unbounded tuples with the same ID, routing is done by consistent hashing the ID against the number of live conflux nodes in the cluster. Once a packet lands on the cluster, the same ID is used for persisting the data into Cassandra. This ensures that writes to Cassandra are always local.

The number of network hops needed for a packet to from emission by an agent to persisting in Cassandra is exactly one. The probabilistic consistent hash algorithm ensures that all streams being ingested are roughly equally distributed throughout the Conflux clustered deployment.

If a node enters/leaves a Conflux cluster changing its state, message routing in RMQ changes and Cassandra would be selecting new nodes to write to, both per the latest cluster state. However, since the method of node selecting is the same for both cases, i.e. consistent hash the ID against the number of live nodes in the cluster, both would be selecting the same nodes for the same ID’s thus, writes would always remain local. For other nodes in steady state, that are not part of the cluster state

change, the consistent hash algorithm ensures streams homed to them remain homed as before, so the caches of these nodes that are minimally impacted.

c) Deployment Architecture

Conflux is deployed in a 5-node cluster that spans 3 racks, allowing for one rack to fail. Connectivity is through the gigabit backplane IP of the data center to prevent network hops being the bottleneck. [Ref: Figure 3]

Each node is configured with 8 CPU’s (virtual), 32 GB RAM and 1 TB of hard disk.

RMQ is configured for configurable batch acknowledge. I.e. Conflux consumes messages in batches and acknowledges them. Details are provided in the implementation.

Due to the nature of consistent hashing, load tests without groups distributes packets almost identically across nodes as expected. The standard deviation in percentage CPU is less than 3 when the average CPU is 95% across all 5 nodes in the cluster.

d) Reactive monitoring of individual streamsmanaged by a group.

Grouping allows users of Conflux to address and manage a large, transient list of streams with one group ID. Entries and exits to/from the group will

Figure 3 A typical deployment of Conflux consists of an odd numbered set of nodes to allow for quorum election in Cassandra. The nodes are always local to a data center and spread across racks to guard against hardware failure. TCP/IP connectivity is through the data center backplane. In practice a set of 5 nodes is used.

61


be automatically considered if group updates happen accurately from an external system watching group memberships. Consider a medical IOT streaming data into Conflux. The problem is to page a doctor when there is no heartbeat for a specified time for a given patient. In this example, The Conflux user will form a group consisting of heartbeat data sources, since there would be several other data sources in the system for measuring blood pressure, temperature etc. formulate a ‘no heartbeat’ condition and specify the paging callout for notifying a doctor. When any stream in the group of heartbeat data sources satisfies the ‘no heartbeat’ condition, the correct doctor is paged, based on the ID of the data source that caused the problem. This example can be extended for machine learning type use cases by specifying cost functions on individual metrics in a packet and calling out when a condition is changed. In this example, the operation is on the individual stream, the group merely functions as a ‘tag’ for classifying classes of streams.

e) Merging streams based on groups

Groups also allow for merging streams based on group ID. This is the more interesting use case of Conflux. Since a node maintains the mapping of the groups a stream belongs to, the node can apply a transform to the metrics and retransmit the data using the group ID as the routing key. Note that all members of a group transmit on the same group ID as a result of this scheme all the retransmitted data from the children land up on one node on which group ID is homed. These streams are now merged with the merge function specified for the group. The result of the group merge computation can be again ‘fed-forward’ to compose them further. Members of a group are expected to have the same sample frequency. There may be cases where groups would contain merges of many hundreds of streams. These would create hotspots Conflux deals with this by:

i. Limiting Group membership to 20.ii. Breaking down groups with larger than 20

into subgroups with 20 elements or lessrecursively, so very large groups would bebroken down into a tree-like containmenthierarchy.

iii. This works for cases where the mergefunctions are associative, e.g. SUM, MAX,MIN, AVG etc.

iv. For non-associative merge functions likeSTDDEV, TOP-10 etc. the group size isrestricted to 30.

v. In practice, large non-associative groupmerges are roughly expressed using a

combination of associative merges over the large group and a final non-associative merge. For example, the TOP-10 over a very large group containing many hundreds of elements can be expressed approximately as the TOP-10 of 30 large groups of MAX.

This is a limitation of Conflux, since such a value may not always be accurate.

The other limitation with this design is that load balancing is achieved at the cost of increased network hops of the order of log(group membership limit). This is not very significant in practice.

Figure 4: Transforming, Grouping and merging streams

The stream that results from a group merge is indistinguishable from an incoming stream. I.e. group streams, sub-group streams and incoming streams are indistinguishable from the other except by their unique ID. Conflux treats all streams identically by persisting every incoming packet in Cassandra. The intermediate stages of calculation from sub-groups are persisted as well.

The Since sub-groups are treated identical to groups; group change notifications travel up the tree and the group membership is correctly reflected.

f) Aligning time in streams

We propose the following approaches for aligning time in feed forward and merge functions:

62


i. Wall clock window, which results in anaggregate packet every time that the wall clockinterval is crossed, considering all packets forthe interval.

ii. Sliding window, which results in an aggregatepacket for every incoming packet, consideringall past packets for the same (id, metric).

iii. Timeout for ID all metrics with timestampsolder than this timeout value are not consideredfor merge or feed forward.

g) Software stack

The software stack consists of:

Web Layer Play Framework

Message bus Rabbit MQ with Consistent hash plugin for message routing

Persistent Store Cassandra.

The unit of deployment of conflux is a node. This node can either be a physical box or a VM (virtual machine). In case a VM is used, care must be taken to deploy some VM’s on different physical hardware to mitigate the possibility of catastrophic hardware failure. The physical hardware however is assumed to be part of a single data centre, connected by redundant, high bandwidth cabling. I.e. nodes cannot span geographical locations.

h) Handling failure

We consider two broad classes of failure

i. Node failures - we rely on consistent hashingre-distribute streams to survivor nodes uponnode failure. As described in section 3.b, trafficgets re-routed to available nodes. Batchedacknowledgements and correspondingretransmits in RMQ ensure packets are notlost in case the node goes down beforeprocessing.

ii. Packet re-transmits and delays – are handlednaturally for wall clock time alignment. When apacket arrives late, all calculations aretriggered for the wall clock window that thepacket belongs to. Feed forward ensures thatthese changes reflect in all group aggregatesthat the ID belongs to. Out of sequencepackets will not matter since eventually, theaggregate values are all correct. The system

can recover from failure by replaying persisted packets from the time of failure.

Packets older than a certain threshold are discarded by the system since they can trigger cascading re-calculations up to the present time.

i) Tradeoffs and Limitations

It is important to note that since Conflux is designed for low latency stream operations; a conscious design tradeoff is to assume the absence of network partitions. Conflux is not designed for a multi-site deployment.

4. ImplementationMonitoring Insight [12], a vCloud Air alpha service offering is in production with a subset of Conflux features.

RMQ [10] is used as the message bus and Cassandra [11] is the persistent NOSQL store.

The System was tested to monitor 80+ metrics from each VM, for 3000 VM’s collecting data every 5 minutes for 20 second granularity. This was simply too much data. A new ‘subscription’ feature was introduced where customers could subscribe by paying for each VM ID. Different subscriptions resulted in different granularities. The default was 5 minutes instead of 20 seconds. This dealt with the data volume well. The subscribe action uses the same feed forward, routing using the VM ID on which the customer subscribes, so it comes into effect immediately. This is currently on production.

A basic JavaScript API for surfacing JMX metrics, log files and accessing the Cassandra database has been tested. This has been invaluable in programmatically running checks to poke around when a problem occurs on a running system.

Data was written into Cassandra with a default TTL of 30 days. There was a subscription option, just like for granularity where it was possible to store metrics for 45 or 60 days. For example, it was possible to store very high granularity metrics for longer time in case of important monitored elements.

Cassandra compaction was run weekly across the entire cluster to reclaim tombstone data.

5. Evaluation and PerformanceThese performance numbers are for a single VM that is part of a cluster. The ingestion capacity

63


increases linearly as new nodes are added due to the consistent hashing property. The chart in Figure 5 is based on 20-second granularity data from 2 VMs. Nodes acknowledged each individual packet, which led to poor ingestion performance and increased latency.

Turning off individual acknowledgement and enabling batch acknowledgement dramatically pushed up ingestion rates as shown in Figure 6.

As expected, failure recovery is batched as well,. Pre-fetch size is set to 512, acknowledging 256 packets at a time, so there is always data in memory, speeding up consumption from 335 packets per second to 1277 per second.

These figures are from RMQ monitoring console. The Blue lines in the figure above indicate messages being consumed and persisted to Cassandra.

6. ConclusionConflux is a critical enabler for data mining services that an increasing number of companies are investing in. It allows users to react in real-time to events which are defined as functions on data streams.

Allowing customers to define actions that monitor and execute on live streaming data is a vital requirement across domains. This could be used to automate Dev-Ops, avert disaster scenarios,

monitor IOT or even execute buy/sell decisions on the stock market.

After historical data has been mined and collated into meaningful models, companies do not clearly

act on it in an automated manner. Conflux not only provides the means to act fast, but also update models with latest the learning in real-time.

7. Future workLarge groups handling, handling data sources with very different packet emission frequencies lead to skew since all keys do not correspond to the same load. Thus, load balancing is a subject for future work.

Cassandra and RMQ pick different nodes for the same ID, although the algorithm is consistent hash. This leads to an additional network hop. Tuning hash functions across

Purely dynamic group whose members are defined by a function is targeted for future work. If the function returns ‘true’ the packet at the child is fed-forward to the group ID.

Error Handling in sliding windows.

8. References[1] Karger, D., Lehman, E., Leighton, T.,Panigrahy, R., Levine, M., and Lewin, D. 1997.Consistent hashing and random trees: distributedcaching protocols for relieving hot spots on the

Figure 7. Data collected at 20 sec. intervals, for 2 VM’s, over an hour, displayed in a web UI. This could be any monitored

element or IOT device.

Figure 5. Ingestion performance with individual message acknowledgements.

Figure 6. Ingestion performance with batched acknowledgements.

64


World Wide Web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on theory of Computing (El Paso, Texas, United States, May 04 - 06, 1997). STOC '97. ACM Press, New York, NY, 654-663.

[2] Apache Samza.http://samza.incubator.apache.org

[3] Tyler Akidau, Alex Balikov, Kaya Bekiroglu,Slava Chernyak, Josh

Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033-1044 (2013)

[4] Rajagopal Ananthanarayanan, VenkateshBasker, Sumit Das, Ashish Gupta, Haifeng Jiang,Tianhao Qiu, Alexey Reznichenko, DeomidRyabkov, Manpreet Singh, ShivakumarVenkataraman: Photon: Fault-tolerant andScalable Joining of Continuous Data Streams.SIGMOD 2013: 577-588

[5] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu,Vikas Kedigehalli, Christopher Kellogg, SaileshMittal, Jignesh M. Patel, Karthik Ramasamy,Siddarth Taneja: Twitter Heron: StreamProcessing at Scale. SIGMOD 2015: 239-250

[6] Kestrel: A Simple, Distributed Message QueueSystem. http://robey.github.com/kestrel

[7] S4 Distributed Stream Computing Platform.http://incubator.apache.org/s4/

[8] SparkStreaming.https://spark.apache.org/streaming/

[9] Ankit Toshniwal,Siddarth Taneja,AmitShukla,Karthikeyan Ramasamy, Jignesh M. Patel,Sanjeev Kulkarni, Jason Jackson, Krishna Gade,Maosong Fu, Jake Donham, Nikunj Bhagat,Sailesh Mittal, Dmitriy V. Ryaboy: [email protected] 2014: 147- 156

[10] RabbitMQ: http://www.rabbitmq.com/

[11] Apache Cassandra:http://cassandra.apache.org/

65

DB Volume Emulator

Rekha Singhal Tata Consultancy Service Research

TCS, Mumbai India

[email protected]

Amol Khanapurkar Tata Consultancy Service Research

TCS, Mumbai India

[email protected]

ABSTRACT

Execution of analytic queries on large growing data volume has

led to challenge of assuring application performance with time.

There is need to understand query execution path at compile time

or development environment to tune it appropriately for assuring

performance on large data volume.

Data processing engines such as Oracle and Apache Calcite use

cost based optimizers to decide a SQL query execution plan at

compile time. In this paper, we discuss how these cost based

optimizers could be used to understand SQL query execution

paths without generating large data volume in development

environment- tools assisting this is referred as ‘DB Volume

Emulator’. We have presented a working DB volume emulator

tool, in this paper, for RDBMS and extended the thought process

for new big data processing platforms such as Spark, Apache

Calcite and Impala. This tool could be exploited to understand and

schedule SQL queries in workload for its optimal execution.

KeywordsBig Data, SQL, Performance, Prediction, Data Volume

1. What is DB Volume EmulatorStructured Query Language (SQL) has been widely accepted as

high level language to formulate analytic queries on data. It is

most popular means of extracting data from underlying data

storage. All relational databases such as Oracle and Postgres and

most of the newly available data processing engines such as Spark

and Hive supports SQL query for data access.

Data processing application often witness violation of guaranteed

performance with time due to increase in execution time of SQL

queries on large data volume. In general, a developer can get

execution plan for SQL queries on small data size in development

environment but has no knowledge of the SQL query execution

path followed by the database on large data volumes. A naïve

approach of generating large data volumes at compile time is

every expensive and not viable as well.

Data processing engine such as relational database, use cost based

optimizer to decide a SQL query execution path. These cost based

optimizer decide a SQL query execution path using knowledge of

sample data in the form of statistics maintained by them. A DB

volume emulator exploits this feature of optimizer and can

emulate the behaviour of data processing engine on large data

volume and outputs a SQL query execution plan without actually

generating data in development environment. This execution plan

may be used to estimate the SQL execution time in isolation using

the information from the plan such as type of operator, input data

size of each operator etc. We have used a one implementation of

the DB volume emulator to estimate a SQL execution time for

large data size using sample of data in development environment,

with average prediction error of 10% [1, 2, 5]. This model may

extended for estimating average execution for a concurrent

workload on larger data sizes, however current implementation

limits to SQL query execution in isolation.

A DB volume emulator can be used for following:

Understand SQL query execution steps for large data sizes at

compile time.

Its saves resources and shorten the time to deployment

It can be used to tune SQL queries in advance at compile

time for assuring performance over a time period.

It can be integrated with ‘SQL Execution Time predictor’[1]

to actually estimate SQL query execution time for larger data

sizes in development environment

It can be used by performance testing team to generate small

data representing future volumes for performance testing

It can be used to estimate minimum number of rows required

in testing environment complying ti performance testing time

budget [3].

CODD [3] is an example of open source DB volume emulator. It

has been tested for TPC-H benchmark and emulates only linear

growth pattern. We have also built DB volume emulator for

addressing larger research problem of estimating SQL execution

time, on big data processing platforms such as Oracle, Hive and

Spark, without generating large data size in testing or

development environment. We are presenting our approach of

building DB volume emulator in this paper.

2. Architecture of DB Volume EmulatorVolume emulation depends on the configuration of underlying

data processing engine such as Postgres or MySQL, schema or

meta data of each data objects and how data is growing in terms

of data object size and data value distribution. Growth of data is

projected from business growth and calculated using mapping of

business entities to logical entities in data processing engine. For

example, a retail sector has ‘X’ customer and each customer is

represented in form of two relational tables, then business growth

in number of customers by 2X may means increase in 4X rows in

Table 1 and 10X rows in Table 2.

A database volume emulator takes input as the DB server

configuration details, schema of the data and projected growth

details (sizes, data distribution in form of histogram and co-

relation across data objects) of all data objects to create an

66

environment of content less emulated database with projected data

sizes as shown in Fig 1. We have built DB emulator for relational

databases so the further discussion will be in the context of

RDBMS and relational data objects.

Figure 1: DB Volume Architecture

Database Server Details

1. Specify type of data processing engine– Postgres,

Oracle, MySQL Apache Spark, Apache Hive etc.

2. Their API to collect and extrapolate data processing

engine specific meta-data statistics.

3. Database configuration parameters such as working

memory size, multi block read etc for RDBMS. These

are huge in numbers so a user may set up an instance in

development environment and either export all these

details in a file or give direct access to the DB volume

emulator for accessing all details from the instance.

Meta Data of Data

1. The information about meta-data of data is stored in the

form of statistics in the database

2. For a relational data objects, it means meta data of

tables (average row length, number of rows etc.) storing

the data, meta data of each column (type of column,

distinct values, minimum value, maximum value, data

distribution etc.) in the table and meta data of each

index (type of index, column(s) name, rows accessed)

on a table. Some of these statistics are sensitive to data

sizes which are extrapolated for volume emulation- the

details can be seen in our earlier work [4].

3. These details either can be entered using a front end as

in CODD [3], which seems quite complex and need

DBA to understand the meaning of each of the meta-

data statistics especially for indexes. Other option is to

provide access to a small data size instance in

application development/testing environment and

collect all meta-data which may be extrapolated

appropriately for projected data size. Please note that

the exported meta-data corresponds to small size

database.

Growth Details 1. Projected number of items/rows in each data

object/table in database

2. Projected table sizes in the database with average row

length.

3. Each entity/column values domain - each column in

each table with their distinct, minimum and maximum

values. How these values changes with increase in

number of rows in the table.

4. Data value distribution in a column- how does it change

with growth in data.

a. Capture using histograms – frequency, height

balanced

b. Simple – max value constant, uniform growth,

linear, non linear (in the form of function such

as square)

Working of DB Volume Emulator

The DB Volume emulator exploits feature of Cost based

optimizer (CBO) used by most of the SQL query processing

engine to choose an optimal execution plan for the SQL query.

CBO stores meta-data for each of the objects created in the

database in the form of statistics. These statistics are stored in the

relational database itself in the form of Catalog tables with the

DBA permission. These statistics may be gathered automatically

to get meta-data information about the data loaded in the database

before choosing an execution plan for a SQL query. Most of the

relational databases such as Oracle provides API/stored

procedures to change the value of these statistics manually as

well. We have exploited this feature of the CBO for doing volume

emulation. We have identified statistics for all data objects created

in a database, which are sensitive to the property of data such as

data sizes, data distribution etc. These had been presented in detail

in [4]. For example, increase in number of rows of a table will

increase value of NUM_ROWS (a statistics maintained in DB) for

the table. Below we have shown meta-data statistics sensitive to

data size and its properties, as maintained by Oracle.

Table sensitive statistics: The table statistics of all the

user tables in the database are stored in the system

tables such as user_tables. This table stores all the

physical details of the tables however; variation in a

query performance with increase in database size is

most sensitive to the number of rows in the tables being

accessed, number of blocks allocated to the table and

the average row length.

Column Sensitive Statistics: The column statistics of all

tables are stored in the system table such as

user_tab_columns. The most sensitive column statistics

are number of distinct values for a column, number of

nulls in a column, density, low value/high value and

histogram.

Index sensitive statistics: The physical storage details of

various indexes created on column(s) are stored in

system table such as user_indexes. The index sensitive

statistics are number of leaf blocks allocated to indexes,

number of levels in the B-Tree, clustering factor,

number of rows on which the index is created, distinct

values, average number of data blocks per key and

average number of leaf blocks per key.

A replica of the database instance in development environment is

created which has no data and only meta-data and database

configuration details. We also take as input the projected growth

pattern for each business object which gets mapped to each data

object in the database. Each statistics is either retained as it is if

not sensitive to the given growth details or is extrapolated as per

instructions given in growth details. A column value may

extrapolate linearly, remain constant or grow quadratic with

67

overall growth factor. The extrapolated values are put back into

the database statistics using database specific API/stored

procedures. Now, the CBO perceives the empty database to have

projected data sizes and distribution as given in the growth details.

3. DB Volume Emulator ImplementationWe have built DB volume emulator in Java for relational

databases. The emulator can run from a window or linux machine

and access the development and emulated environment through

network. User provides credentials for both the environments as

an input to the tool. Tables growth details are given in terms of

expected number of rows in each table as a CSV file. The growth

details are provided for each column of every table as – linear,

constant and non-linear – ‘constant’ means the column’s

maximum value remains same in emulated environment, ‘linear’

means the column maximum value increases linearly to the table

size growth factor, ‘non-linear’ means the column maximum

value is function of the table size growth factor, which is given by

user.

The emulated environment is created with credentials given by

user. All the DB server and the data meta-data statistics are

exported from the development environment and imported in the

content less emulated environment.

The sensitive statistics of table, column and index are extrapolated

as per user given growth details and put back into the emulated

environment using the database stored procedures for setting the

statistics manually. The tool informs user on finishing the

extrapolation and message “Emulated Environment Ready’.

We tested the tool on TPC-H benchmarks on oracle 10g. We

generated 1 GB database in the development environment and

emulated to larger size such as 10GB using the tool. We also

generated 10GB database and created real 10G environment for

validation purpose. We checked the correctness of the tool by

matching the 22 TPC-H SQLs query explain plan, expected rows

and cost both in the emulated and the real environments [4] as

shown in Fig 2.

Figure 2: SQL Query Explain plan in Oracle 10g

4. Volume Emulation in Big Data PlatformsThe large size data processing has led to advent to many new big

data platforms for parallel processing of data. Most of them have

extended support for optimizer based SQL query execution. These

big data platforms works on cluster of nodes, whereby it is more

challenging to understand how a job will be executed, knowledge

of which may help users to tune either their job or platform for

optimal execution. Whether it is RDBMS or Big Data product, an

optimizer will always have a plan which will always have a

logical and physical manifestation, plus a hook or API to

manipulate the knowledge about the data (meta-data) or statistics.

The application of volume emulation for big data platforms is a

though experiment.

Most of the big data platforms with SQL query support do not

have explicit tools/methodology for tuning the SQL. Majority of

the available big data platforms’ optimizer focuses only on

reducing the number of rows to be processed by an operator in

SQL query DAG. So they end up pushing filter operation down to

leaf node to propagate less number of rows to parent operator in

DAG. They have not exploited meta-data of data to further

optimize the order of operations in SQL query DAG. Since, we

witnessed success stories on our past work in the context of

relational databases; we wish to extend thought of the volume

emulation framework for big data processing engines as well.

Volume emulator for these platforms will facilitate SQL query

tuning for the specific platform.

We shall discuss case of DB volume emulators for three new big

data platforms - Apache Hive (using Calcite Optimizer), Apache

Spark (using Catalyst Optimizer and Apache Impala for

distributed query processing. Figure 3 shows a stack for SQL

query processing on big data platforms. Most of the parallel data

processing platforms support Hadoop distributed file system

(HDFS) for managing data stored across all nodes in the cluster.

Data processing layer can have its own programming language

and API to allow users to write their jobs e.g. Hadoop is Map-

Reduce paradigm and expects a user to write map and reduce

functions for executing job on Hadoop. Similarly, Spark supports

scala language with its own dictionary of set of operations, python

and java program for data processing. Most of these data

processing frameworks have extended their support for SQL

query processing with an engine which can parse, translate and

optimize a SQL query for respective platforms.

Figure 3: Big Data Processing Stack on cluster of Nodes

Apache Hive

Hive [7, 8] has been developed by Apache Software Foundation

to query data using SQL language on top of Hadoop MapReduce

framework. It is similar to those functionalities provided by a

relational database management system (RDBMS) which eases

the migration of legacy analytic workload. Hadoop MapReduce

framework is one the earliest and most widely used parallel data

processing engine in big data community. Hadoop MapReduce

framework work on a cluster of nodes. Data is stored across all

the nodes and managed by Hadoop distributed file system

(HDFS). Hive as a data warehousing application on top of

Hadoop MapReduce which allows the user to query, insert and

partitions the data stored in it. Using Hive, it is possible to

generate tables and indexes (in form of partition) on data stored in

Hadoop distributed file system (HDFS). This data can then be

accessed using a Hive--specific query language called HiveQL.

ID Operation Name Rows Bytes Cost

0 Select 102M 27G 105M

1 Nested Loop

2 Nested Loop 102M 27G 105M

3 Table Access Full Supplier 1280K 175M 7314

4 Index Range Scan Partsupp_sk 80 2

5 Table Access by Index ROWID

Partsupp 80 11520 83

68

HiveQL is similar to SQL and therefore easy to learn for

relational database people.

Hive server or HiveQL processing engine maintains information

about meta-data of user data in the form of statistics stored in

relational catalog tables accessed through relational database

connector specified in the Hive configuration file. Most

commonly used ones are MySQL and Derby. HiveQL processing

engine parses a HiveQL and translates it into an optimal execution

plan as directed acyclic graph (DAG) of stages as shown in Fig 4.

Each stage may comprise of rename, conditional operator, only

map or map-reduce jobs. A map-reduce or map job is executed,

using Hadoop parallel data processing capability, on data stored

on HDFS. For a map-reduce job, data is a flat file, however for

HiveQL it is a table structure defined in Hive. A join may be

executed as map join, where small table is replicated across all

node, or a reduce join where join is executed as reduce job. Hive

uses Apache calcite optimizer to for building HiveQL execution

DAG using database statistics. These statistics are similar the one

provided by popular RDBMS such as Oracle and Postgres. If

Apache provide API to change these statistics manually (like in

RDBMS), one can explore how a HiveQL will execute for larger

data size with different data distributions and build Volume

emulator. The volume emulator can help in taking proactive

actions like cluster sizing for assuring performance. One of our

work [13] uses the DAG to estimate a HiveQL execution time in

development environment. [9] has used Hive optimizer statistics

to further improve HiveQL execution plan.

Figure 4: Hive DAG for a HiveQL

Apache Spark

Hadoop map-reduce forces a linear way of programming for

parallel processing of job. Apache Spark is more flexible parallel

data processing engine which keeps data in memory in the form of

Resilient Data sets (RDD) for iterative processing. Apache Spark-

SQL combines both relational and procedural systems which are

required for complete data processing pipeline from ETL to

querying of the data.

SparkSQL [10] is similar to relational database SQL or HiveQL in

syntax. Spark SQL provides a DataFrame API that can perform

relational operations on both external data sources and Spark’s

built-in distributed collections RDD. Spark SQL uses extensible

optimizer called Catalyst. Catalyst [11] makes it easy to add data

sources, optimization rules, and data types for domains such as

machine learning.

A SparkSQL is converted into a DAG of scala functions

supported by Spark data processing engine as shown in Fig 5.

Spark catalyst currently consider the functional components of

SparkSQL and optimizes its DAG by pushing the filter and

grouping operators early in the execution DAG. The current

versions 1.6 and 2.0 do not collect and make use of data statistics

for further optimization of SparkSQL DAG. However, according

to Ion Stoica (founder of Spark)1, use of statistics like RDBMS is

part of their plan to further optimize SparkSQL DAG.

Figure 5: Spark DAG for a SparkSQL doing Aggregation

Apache Impala

Apache Impala [6] is an open source distributed massively

parallel SQL query processing engine which brings scalable

parallel database technology to Hadoop, enabling users to issue

low-latency SQL queries to data stored in HDFS and Apache

HBase without requiring data movement or transformation.

Impala is integrated with Hadoop to use the same file and data

formats, metadata, security and resource management frameworks

used by MapReduce, Apache Hive, Apache Pig and other Hadoop

software.

Impala supports a subset of HiveQL. Impala maintains

information about table definitions in a central database known as

the metastore. Impala also tracks other metadata for the low-level

characteristics of data files. Impala runs a catalog demon at each

1 I got chance to interact with him in VLDB2016 held at Delhi.

69

node of the cluster, which stores meta data information about the

data and node. Impala uses this meta data to decide DAG for

distributed execution of a SQL as shown in Fig 6.

Figure 6: Execution Plan of a Impala SQL on Hbase

5. ConclusionsIn this big data age, most of the analytic SQL queries process

large size data in production environment. Given the complexity

of SQL query, it is imperative for application manager to

understand the SQL execution steps on larger data size at compile

time or development environment. In this paper, we had discussed

design of DB volume emulator which can emulate behaviour of

data processing engine for large data size. DB volume emulator

can outputs a SQL query execution plan for large data size at

compile time without generating and loading large data. The DB

volume emulator exploits features of cost based optimizers to

manually change the meta-data statistics to reflect growing data

size and distributions instead of physically generating and loading

it. We have discussed an implementation which is tested on

Oracle 10g for TPC-H benchmarks.

Since we have seen merits in having Volume emulator on

statistics based optimizers and we also see that many Big Data

SQL products have / should have statistics-based optimizer in

their product roadmap, building hooks in the product for Volume

Emulation will be of immense value. In fact having an in-built

Volume Emulator will also serve as a training method to teach

Developers how to write optimal queries for the Big Data

platform. Finally, we believe a lot of research methods on Volume

Emulation on RDBMS' can be reused for conducting similar

experiments on Big Data product platforms. These experiments

could provide further guidance on better ways for adding Volume

Emulation in these platforms.

REFERENCES

[1] Rekha Singhal and Manoj Nambiar, Predict Performance of

SQL on Large Data Volume, IDEAS, Montreal, Canada, July

2016.

[2] Rekha Singhal and Manoj Nambiar, Measurement based

model to study the effect of increase in data size on query

response time, Performance a nd Capacity CMG 2013, La

Jolla, California, November 2013.

[3] Rakshit S. Trivedi, I. Nilavalagan and Jayant R. Haritsa,

CODD: Constructing Dataless Databases, In proceedings of

DBTest’12, May 21, 2012 Scottsdale, AZ, U.S.A.

[4] Rekha Singhal and Manoj Nambiar, “Extrapolation of SQL

Query Elapsed Response Time at Application Development

Stage, Indicon 2012, Kerala, India.

[5] Chetan Phalak and Rekha Singhal, Database Buffer Cache

Simulator to Study and Predict Cache Behavior for Query

Execution, Proceedings of DATA 2016, Portugal.

[6] http://www.cloudera.com/documentation/archive/impala/2-

x/2-1-x/topics/impala_perf_stats.html

[7] https://cwiki.apache.org/confluence/display/Hive/StatsDev

[8] http://www.cloudera.com/documentation/archive/manager/4-

x/4-8-5/Cloudera-Manager-Installation-

Guide/cmig_hive_table_stats.html

[9] Anja Gruenheid, Edward Omiecinski and Leo Mark, Query

Optimization Using Column Statistics in Hive IDEAS11

2011, September 21-23, Lisbon, Portugal.

[10] https://databricks.com/blog/2015/04/13/deep-dive-into-

spark-sqls-catalyst-optimizer.html

[11] https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark

/sql/execution/ExplainCommand.html

[12] http://trafodion.apache.org/architecture-overview.html,

Apache Trafodion

[13] A. Sangroya and R. Singhal, “Performance Assurance Model

for HiveQL on Large Data Volume,” in Proceedings of the

International Workshop on Foundations of Big Data

Computing in conjunction with 22nd IEEE International

Conference on High Performance Computing, HiPC ’15,

December 2015.

70

http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_perf_stats.html

http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_perf_stats.html

https://cwiki.apache.org/confluence/display/Hive/StatsDev

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/sql/execution/ExplainCommand.html

https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark/sql/execution/ExplainCommand.html

http://trafodion.apache.org/architecture-overview.html

Leveraging Human Capacity Planning for Efficient Infrastructure Provisioning

Shashank Pawnarkar, Tata Consultancy Services

Today there are several infrastructure capacity planning techniques are available to make effective use of IT infrastructure and to enhance the infrastructure performance. The infrastructure capacity planning is not only based on IT infrastructure technology but it is based on business environment and business inputs. An analytical model based on historical business data, past trends, parameters such as peak business time, users skillsets and transactions volumes are of paramount importance to decide the infrastructure capacity. The Human Capacity Planning technique involves calculating optimal number of skilled resource required to complete business operation at particular time. This papers discusses how business inputs based on Human Capacity Planning helps to improve and optimize, efficient Infrastructure provisioning.

Introduction In typical infrastructure capacity planning the inputs are based on sizing, peak load, server utilization and data/transactions volume, whereas the business driven Human Capacity Planning gives the exact demand pattern, stating optimal number of resources required when (time and date), forecast of volume (number of transactions) and the time taken to complete the truncation. The output of Human Capacity Planning will aid in determining the exact count of resources (number of users) required, amount of volume transactions, demand patterns which further helps to calculate exact requirement on server infrastructure. The Human Capacity Planning provides exact patterns of peak load of usage during the year or month or day or time, and thus, Human Capacity Planning output complement the traditional method of Infrastructure Capacity Planning.

The offshore BPS (Business Process Services, often called as BPO – Business Process Outsourcing) typically provides back office business services to global Customers spread across multiple countries and multiple geographies wherein offshore agents with varying skillsets works in different shifts and delivers the work in stringent SLA (Service Level Agreement). Even though there is a fluctuation in workload demand and volumes, agent has to deliver the work with very high accuracy and within shortest possible time. Any miscalculation in demand forecast often leads to more number of human resources when work volume is more or high number of resources when work volume is low. These kind of misjudgment of Human Capacity Planning results in heavy loss of productivity and increase in operational cost and at the same time core IT servers are underutilized or over utilized. In case of peak demand the server response is degraded and resulting in further delayed server response, loss of productivity for agents and loss of business credibility to the organization.

The average transaction handling time (AHT) is key success parameter for the agent and in turn for BPS operations. While core IT applications and infrastructure plays a crucial role in determining response time and AHT, Customer are not very formal and clear about how they can assert that their system performance is adequate enough for the BPS to work well. The BPS works in constraint that Customer own and controls

71

the IT applications and infrastructure which are difficult to change according to the need of BPS workload, and hence it is best strategy to provide minute level Infrastructure Capacity Planning inputs to the business.

Figure 1: Infrastructure Capacity Planning driven by Human Capacity Planning

We have a leading edge Human Capacity Planning solution which analyzes structured and semi-structured data enabling consistent business insights and predicting accurate human workforce capacity, thus drive productivity savings of millions of dollars in operating costs. The Human Capacity Planning solution leverages analytics, forecasting and optimization techniques to predict accurate human resources based on demand and volume. The forecasting is typically based on regression techniques whereas optimization is based on linear programming.

Human Capacity Planning Solution In BPS world, operating discipline is very essential to remain competitive and then if you have to have leading edge, you need to have comprehensive Human Capacity Planning solution which will cater to typical BPS processing needs be it in terms of human resource planning, allocating the work to the Capacity and then monitoring of the work execution to ensure that SLA are gainfully met with optimal manner.

The Human Capacity Planning solution wherein, leveraging advance analytics, forecasting and optimization happens at a granular level (up to 15 minutes of time period) and taking into account, operational environmental dependencies like shift timing, skill wise AHT (Average Handle Time), volume forecast and SLA (Service Level Agreement).

Challenges in Human Capacity Planning:

1) BPS Operations are often challenged with high work volumes with available staff. At times, staff

may over utilized with high work volumes, or underutilized during periods of low work volumes.

Many a times, this is cumbersome and ineffective process which leads to situation of over allocating

resource when demand is low and under allocating resources when volume demand is high. Floor

managers often adhere Spreadsheets to manage the Human Capacity Planning. In such a

72

scenario, getting comprehensive understanding of all the parameters which impact Capacity

Planning was a mammoth task.

2) In the BPS environment, Customer expects service provider to meet the operational SLA under

varying degree of workload, and that too at a lower cost but with higher accuracy. Without the

technology solution, the BPS floor managers were struggling with under/over staffing, mismatched

skill set, ability to predict the workload volume, and all this was getting translated into higher cost of

operations and also missing the operational SLAs, and thus customer dissatisfaction. Missing of

SLAs would sometimes also cause financial penalties to the customer.

Challenges in Infrastructure Capacity Planning

1) One of the key challenges is to determine the right amount of resources required for the execution

of work in order to minimize the financial cost and to maximize the resource utilization. It is

necessary to utilize infrastructure resources properly to reduce the cost while maintaining the

Quality of Service parameters like throughput, response time, and performance while adhering to

Service Level Agreements (SLAs).

2) The BPS agents works on core IT applications and server infrastructure provided and supported by

the Customer that means in cases of peak workload, BPS agents are restricted by the available

server infrastructure which often cause delay in completing agent’s tasks resulting in operational

inefficiency and missing target SLAs.

Alternative Solutions Evaluated Agent Management System (AMS) and Work Force Management (WFM) tools (Internal) were evaluated before building new solution. Our findings are summarized in sections below -

1. Agent Management System

Agent Management System (AMS) has been designed to generate workload/schedule using volume based SLA. For example, if from 3:00 PM to 3:30 PM, 100 transactions have been received, and 80% must be finished before 3:30 PM. The key requirement for the Capacity Planning solution is to generate workload/schedule using time based SLA. Since time based SLA and volume based SLA are two different paradigms, AMS does not fit the requirements of the Capacity Planning Solution.

2. Work Force Management Tool

Work Force Management (WFM) tool is a tool for Queue Management. However, in WFM, all the key characteristics required for generation of workload are coded into the tool and are not user configurable. Moreover, this tool does not cater to our requirements for country specific SLAs. Since the key requirement of the Capacity Planning Solution was to have end user configurable solution taking into account Process Characteristics, BPS People Profiles, SLA requirements and other environmental attributes to enable semi-automated Queue Management, WFM was not deemed fit.

3. Other Commercial Solutions

Other commercially available solutions were also evaluated. None of the solutions had ability to deal with Varying/Dynamic Cut-Offs

Human Capacity Planning - Solution Description The Capacity Planning solution uses a scientific approach to arrive at the most optimal resource plan to meet the required SLAs. The solution uses industry standard mathematical algorithms to analyze, forecast, schedule and monitor resource requirements using techniques such as regression analysis and linear programming.

73

Figure 2: High level schematic of Human Capacity Planning Solution

The typical inputs to Capacity Planning algorithm are: - AHT (Average Handling Time) as per various products and sub-products

- Skill set

- Work shifts (given that typically BPS business runs 24x7)

- Operational SLAs (which can be both fixed and variable)

- Resource cost

- Target utilization level

- On floor constraints (like shift timings/duration, allowable overtime).

These inputs parameters worked upon along with: - Historical data of workload (solution would typically need 3 years of historical data)

- Customer prediction of upcoming volume (could be for next 1 month or quarter)

Mathematical Models: The heart of the Capacity Planning solution is the Mathematical Model. Capacity Planning Solution uses industry standard mathematical models to predict future volume using historical volume. The solution supports five different forecasting models viz.

1. Time Series Decomposition2. Single Exponential3. Double Exponential4. Triple Exponential (Winters Model)5. Auto ARIMA (Auto Regressive Integrated Moving Average Model)

74

Figure 3: Volume Forecast based on Mathematical Models

Volume forecast using all five models is generated however best fit model is recommended by the solution basis MAPE (Mean Absolute Percentage Error), AIC (Akaike information criterion) /BIC (Bayesian Information Criterion) parameters. Auto ARIMA model offers advantage that data elements are regressed and averaged to fit an approximation to any time series, thereby giving high accuracy of prediction. Volume forecast is generated at the granularity of 15 minutes interval throughout the day. This forecast can be downloaded in the form of spreadsheet. In addition to volume forecast, Human Capacity Planning solution also generates Shift Roster which tells number of human resources needed to finish incoming transactions while meeting SLAs.

This Mathematical model executes the in two stages: - Forecasting

- Optimization

The Forecasting pertains to arriving at the resource count needed to manage the projected workload. The number of resources needed for the required time period can have granularity up to the minutes as per the required AHT to process the incoming workload. That is, forecasting algorithm can let us say recommend number of resources needed at 15 minutes interval if AHT/SLA for a specific tasks demand such fine grained projection.

The Optimization following the Forecasting then works recommending how the resources should be in the working in the shifts taking into account all BPS shift related nuances (like shift index, shift start time, shift end time, breaks) in addition to nuances related to workload processing (like skills specific AHT, product/sub-product wise SLAs). The solution can generate shifts schedule on its own or can use pre-defined shits schedule. The Forecasting/Optimization is supported by built-in “What-If Analysis” component, which enables to

75

analyze impact of changes inputs parameters on generated roster.

Case Study – Human Capacity Planning for a Global Logistics Major

A global logistics major has outsourced data entry operations for 13 countries. The business has lot seasonality impact due to which the staffing requirements keep on varying and it is imperative for offshore to plan and roster accordingly so that they do not carry excess or low staff than required. This criticality is compounded by a metric named cut-off; there are 165 cut-offs across the day that needs to be adhered to. Hence it was of paramount importance to identify a solution to this complex problem which would help the client meet its objectives.

Inputs to the Human Capacity Solution:

Figure 4: Input Data points to Human Capacity solution

Outcome of Human Capacity Solution - The Capacity Planning Solution provides predicted volume at the granularity of 15 minutes for every type of transaction. Volume forecast can be used to identify monthly/weekly trends in the incoming volume and accordingly plan for infrastructure provisioning needed for each type of transaction.

76

Figure 5: Output of Human Capacity Solution which provides minute level granular volume details and resource (user) count

Infrastructure Capacity Planning – With 200 or more concurrent users do the data entry operations, the server response time is often degraded. The Human Capacity Planning helps to find out volume forecast and shift roster (number of users) at granularity of 15 minutes for a given transaction type. We can also find out volume demand for various application related resources looking at transaction types. All these combined are inputs for making efficient infrastructure planning such that IT applications would scale or descale depending on the volume and number of users.

Benefits of Human Capacity Planning Solution In BPS industry, customer continuously expects service provider to optimize the operations efficacy, which essentially translates into “do same with less” and then “do more with less”.

Qualitative Benefits 1. Effective Resource Management - The solution gives us the insight that for optimal human

resource management understanding nuances of floor operations and projects number of

resources required to fulfil the goal thereby provides salability to meet the high volume workload.

For example, while managing the Human Capacity Planning, it is important to understand constraints like resource cap, acceptable number of overtime, number of shifts, resource shrinkage. When it comes to Work Allocation understanding dependencies on cross skilled resources, collaboration needed between the resources is important. Also when it comes to Queue Monitoring, how Maker/Checker headcount can be re-purposed to manage the time stringent SLAs.

2. Dynamic Change Management - The solution is focused on need of having specific resource atspecific time to meet the SLA and thus to avoid any delays and inaccuracy in operations. Key decisionmaking data points are made readily available to BPS floor managers at pre-defined interval enablingright time view, which helps in managing SLAs under varied workload scenarios. The solutionaddresses business problem such as systematic tracking of the resources, nurture the growth ofresources and improve the overall value to the organization.

3. Risk Management – While carrying out the BPS operations, when SLAs are not met then BPS

service provider carries the risk paying penalties for missing the SLAs. More often than not this risk

is linked to financial losses borne by the customer, and thus can be a huge amount. But at the

same time, if SLA miss gets translated to loss of BPS customer reputation in the market place, then

BPS Service Provider also has the risk of losing the business.

Quantitative Benefits

77

• Operational Efficiency - solution helped in 99.7 % SLA compliance leading to improved

Customer satisfaction

• Typically 40 percent minimum improvement in Accuracy in Human Capacity Planning

• Typically 70 percent Time Savings

• Employee Satisfaction score improved from 5 to 9 (on a 10 point scale)

• Cost Saving – The solution has helped in cost savings of USD 150 Millions

Applicability and Future Roadmap: This solution is applicable wherever schedule and workload generation is required to manage high volume BPS operations with optimal utilization of infrastructure capacity. The solution adds value in cases where application response time is extremely important but at the same time there is no or limited control on server infrastructure to meet the varying workload capacity.

Conclusion

To be able to effectively estimate the infrastructure provisioning, the input from business environment are critical and that is where human capacity planning data points helps to predict the volume demands and number of users. The business inputs such as past history, skillset, peak times are leveraged using advanced analytics and mathematical models. The solution not only makes efficient and optimal use of human resources but hardware resources can gainfully be managed at the minute level granularity. The solution demonstrates that hardware or server capacity and performance is enhanced with right level of business inputs.

References [1] Brockwell, Peter J., and Richard A. Davis, eds. Introduction to time series and forecasting. Vol. 1.Taylor & Francis, 2002.[2] Brusco, Michael J., and Larry W. Jacobs. "Optimal models for meal-break and start-time flexibilityin continuous tour scheduling." Management Science 46.12 (2000): 1630-1641.[3] Caggiano, K. E., Jackson, P. L., Muckstadt, J. A., & Rappold, J. A. Efficient computation of time-based customer service levels in a multi-item, multi-echelon supply chain: A practical approach forinventory optimization. European Journal of Operational Research, 199(3), (2009) 744-749.[4] Timothy Wood, Ludmila Cherkasova, Kivanc Ozonat, Prashant Shenoy. Predicting ApplicationResource Requirements in Virtual Environments, HP Laboratories Technical Report, 2008.[5] Waheed Iqbal, Matthew N. Dailey, David Carrera. Black-Box Approach to Capacity Identificationfor Multi-Tier Applications Hosted on Virtualized Platforms, Cloud and Service Computing (CSC),IEEE International Conference, 2011

78

STATICPROF: TOOL TO STATICALLY MEASURE PERFORMANCE

METRICS OF C, C++, FORTRAN PROGRAMS

Barnali Basak

Tata Consultan y Servi es, India

Abstra t

In this paper we present a ompiler driven pro�ler, named Stati Prof, whi h stati ally

estimates both arithmeti and memory operations of a C, C++ and Fortran program. We

have developed the tool as an intermediate gimple pass in GNU Compiler Colle tion(GCC).

We present the results a hieved for multiple kernels from ben hmarks Polyben h, Lapa k,

CppLapa k and ompare our �ndings with the numbers pro�led by Intel provided Vtune

pro�ler. Additionally, we give the stati ally pro�led performan e numbers for GaussSeidel

smoother fun tion from �uid dynami s appli ation OpenFOAM.

Keywords: Performan e Bounds, FLOPs, Bandwidth, Stati Analysis, Performan e Measuring Tool.

1 INTRODUCTION

In the urrent arena of High Performan e Computing (HPC), sequential appli ations are ontinuously explored to

harness hidden performan es. Floating point operations per se ond (Flops) and bandwidth measurements are two

most important metri s to guide the pro ess of optimization of a program. Multiple tools have been developed till

date that dynami ally pro�le a program by ounting and sampling various exe utional events. As instan es, Gprof,

Vtune, Perf Suite, HPC Toolkit et . are few well known omer ially available tools.

1.1 Motivation

To a hieve the optimal performan e of an appli ation, it is needed to exe ute the program on a suitable ar hite ture.

Appli ations having high omputation bound are less likely to perform well on ar hite tures with poor omputing

ability. Similarly, for memory bound appli ations, ar hite ture with high memory transfer or bandwidth is required.

Finding the suitable ar hite ture for exe ution of an appli ation may not be always straight forward and may

require additional analysis. Here we propose a one step loser solution that stati ally omputes �ops and bandwidth

requirement of a program. Given su h metri s, one may hoose the suitable exe utional platform having su� ient

omputing and memory transfer abilities.

The opyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group In a royalty free

right to publish this paper in CMG India Annual Conferen e Pro eedings.

79

void Lapla e3D (int NX, int NY , int NZ, double *u1, double *u2){

for (int k=0; k<NZ; k++) {

for (int j=0; j<NY; j++) {

for (int i=0; i<NX; i++) {

int ind = i + j*NX + k*NX*NY;

if (i==0 || i==NX -1 || j==0 || j==NY -1|| k==0 || k==NZ -1) {

u2[ind℄ = u1[ind ℄;

} else {

S: u2[ind℄ = ( u1[ind -1℄ + u1[ind +1℄ + u1[ind -NX℄ + u1[ind+NX℄

+ u1[ind -NX*NY℄ + u1[ind+NX*NY℄ ) * sixth;

}}}}}

Figure 1: Motivating example: ode snippet of Lapla e omputation

1.2 Contribution

In this paper we illustrate a tool named Stati Prof, implemented as a pass in GCC ompiler. It measures both

omputational and memory operations of C, C++ and Fortran programs by stati ally analyzing the sour e odes. To

pro�le �op, it he ks for arithmeti operators that involve operands of �oating point data type. Similarly, to estimate

memory operations (memop), i.e., data load and store, our analyzer stores ount for both integer and �oating point

data type separately.

The performan e metri s, thus a hieved, an be used to hoose the suitable ar hite ture for exe ution of the

program. Here we present an example whi h shows the estimation of �op and memop by analysing the sour e ode.

Example 1. Consider the Lapla e3D fun tion depi ted in Figure 1. The fun tion, in essential, omputes value at

ea h grid point of a three dimensional grid by averaging out the values of all neighbours. A triply nested loop, having

loop indu tion variables i, j and k, traverses the 3D grid spa e. Loops i, j and k have symboli upper bounds NX, NY

and NZ respe tively, whose values are unknown at ompile time. Array variables u1 and u2 store the grid values and

are of double data type. Also, let T be the total exe ution time of the fun tion. The �ops and bandwidth requirements

of the fun tion an be analyzed as follows.

Flops omputation Bandwidth omputation

�op per loop iteration: 5 additions, 1 multipli ation memop per loop iteration: 6 reads, 1 write

�op for the whole loop: 6× NX× NY× NZ memory transfer(in bytes) by the whole loop: 7 ×

NX× NY× NZ× 8

�op per se ond: 6× NX× NY× NZ/T bandwidth (bytes/se ): 56× NX× NY× NZ/T

2 Related Work

Multiple methods have been developed till date whi h predi t performan e of programs by monitoring and sampling

the dynami events generated during exe ution. Tools like Gprof [GKM82℄, Vtune [vtu℄, HPC Toolkit [hp ℄, Perf

Suite [per℄, PAPI [pap℄ et . fall into this ategory.

In ontrast, performan e predi tion of program by stati analysis of sour e ode is another approa h to follow.

Gropp et al., in their paper [WDGKD99℄, presented the analyti omputation of memory operations and arithmenti

operations by stati ally analyzing general form of omputing statements. This motivates the automation of the pro ess

of performan e omputation.

As our tool follows the stati analysis approa h, we will not explain the aforementioned dynami tools. Instead,

80

D.1218 = a[1℄;

D.1219 = b[1℄;

D.1220 = D.1218 + D.1219;

D.1221 = b[1℄;

D.1222 = D.1220 * D.1221;

a[1℄ = D.1222;

a[1℄

D.1222

=

a[1℄ b[1℄

D.1218 D.1219

+

=

D.1220

*

D.1221

b[1℄

(a) Gimple representation (b) AST view

Figure 2: Gimple IR of sour e instru tion

will only present the losest related papers to our work.

Narayanan et al., in paper [NNH10℄, has des ribed estimating upper bound of C/C++ programs by stati time

ompiler analysis. They have presented a tool named PBound developed in ROSE ompiler. It generates parameterized

expressions for memory transfer and arithmeti operations. Su h expressions along with ar hite tural information are

used to �nally ompute the upper bounds of the program performan e.

Cas aval et al., in paper [CDPR99℄, has explained the performan e predi tion model that does not solely rely

on stati analysis of sour e ode, but takes into a ount dynami results too. Compiler driven stati analyzer has

been developed as a part of Polaris ompiler. It generates the performan e expressions and automati ally instruments

the ode to get values unknown at ompile time. Thus, the analyti al result a hieved at ompile time, along with

experimental result re eived at dynami time, ompute the �nal performan e bound.

3 Implementation Details

Stati Prof is implemented as an intermediate gimple pass in GCC. GCC parses the sour e ode, written in C/C++/Fortran

et ., and redu es it to gimple intermediate representation (IR). The gimple IR is a three address representation of

expressions where the operands are �rst loaded into temporary variables before any omputation takes pla e. Also it

uses temporary variables to hold intermediate values of any omplex omputation.

The resultant gimple an be viewed as dire ted a y li graph, in general known as abstra t syntax tree (AST).

An AST has all memory referen ing variables at its leaf nodes and mathemati al operations at its intermediate nodes.

Assignment node, in general, sits at the root node.

Example 2. Figure 2 shows both gimple representation and orresponding AST view of high language instru tion

a[1℄ = (a[1℄ + b[1℄) * b[1℄. All variables starting with D are temporary ones generated by GCC for internal use.

In Figure 3 we have illustrated the modular view of GCC. The gimple AST, generated by parser, is then pro essed

by series of gimple optimization passes. The optimized gimple ode is further lowered into register transfer language

(RTL) IR, whi h is again optimized by RTL passes. Finally, the optimized ode is passed through ASM generator to

produ e orresponding ma hine learning ode.

Our tool pro esses the gimple IR version of the sour e ode and hen e, is plugged in as a gimple pass into GCC.

It an pro�le a segment of sour e ode whose start and end statements are annotated before and after respe tively

by print statements; printing START_BLOCK_COMPUTATION and STOP_BLOCK_COMPUTATION variables for annotating

blo k segment and printing START_LOOP_COMPUTATION and STOP_LOOP_COMPUTATION variables for loop segment.

81

C/C++/Fortran

ode

gimple

AST

optimized

gimple

optimized

RTL

ma hine language

ode

parser

gimple

passes

RTL

passes

ASM

generator

Stati Prof

pass

Figure 3: Modular view of GCC

3.1 Algorithmi Details

Our analyzer targets both assignment and omputing statements of the form

lhs = val

lhs = rhs1 ⊕ rhs2 . . .rhsn ⊕ rhsn+1

All lhs, rhs1, . . . rhsn+1 are either s alar or array referen es referen ing integer, �oat or double data types. val be

the value assigned to lhs and ⊕ be any arithmeti operation.

The tool traverses the orresponding ASTs in top-down order, starting from the assignment node, and parses

ea h and every node. For ea h node it de ides whether the node is of interest or not. As instan e, while parsing a

mathemati al operation node it he ks the data type of its hild nodes. If type of any of the hild is �oat or double,

the tool in reases the �op ount by one. Memory operation keeps separate ounts for both integer and �oating type

of memory operations. During traversal if the tool en ounters any leaf node, the memop ount in reases a ordingly.

Example 3. Figure 4 depi ts the AST of statement S from the motivating example shown earlier in Figure 1. All

a esses to array variables u1 and u2 are leaf nodes and arithmeti operations are intermediate nodes. Note that, the

AST does not illustrate the tree formation of index omputations as the urrent implementation onsider them as free,

i.e., they do not parti ipate in performan e ount.

The s alar variable sixth in the AST is real in type but its value is onstant throughout the program exe ution.

During ompilation, onstant propagation optimization repla es variable sixth with its value. Therefore it does not

ontribute in memop omputation.

The tool starts analysing the AST from the assignment node. First it he ks the data type of the left hild u2[ind].

As the data type is double and its a store memory operation, the memop is set to one. Next, the analyzer traverses

the right subtree and en ounters the multipli ation node. Similarly, it he ks the data type of the left subtree and

right hild node of the multipli ation node. As the data type of the right hild is double, the �op ount is set to

one and memop ount in reases by one. The pro�ler re urrently traverses all subtrees and updates the performan e

numbers for all relevant nodes in the AST.

Memory referen es, whether through s alar or array, an be either dire t, like array referen e A[i℄, s alar referen e

a; or indire t, like A[B[i℄℄, or pointer referen e *a. Our pro�ler handles all dire t, indire t and pointer referen es. For

indire t a ess A[B[i℄℄ it ounts memop for both index array B and value array A. However, the urrent implementation

does not onsider any ontribution of loop index variables and operation on su h variables. They are onsidered free.

Our tool pro�les any valid basi blo k or loop of a program. A basi blo k ontains multiple statements in sequen e

and has single entry and single exit of ontrol. Our tool analyzes ea h statement of the blo k and simply umulates

the performan e numbers.

82

u1[ind− 1] u1[ind+ 1]

+ u1[ind− NX]

+ u1[ind+ NX]

+ u1[ind− NX ∗ NY]

+ u1[ind+ NX ∗ NY]

+ sixth

∗u2[ind]

=

flop

flop

flop

flop

flop

flop

load load

load

load

load

load

store

Figure 4: Flop and bandwidth numbers on the AST of statement S

Input: annotated C/C++/ Fortran ode

Output : flop and memop ount for the annotated segment

Pro edure :

1. Input program is parsed by the ompiler that produ es gimple ASTs.

2. For the valid segment , blo k or loop , in the sour e ode , identify the

Basi Blo ks (BBs) to be analyzed in the gimple pass.

3. For ea h BB goto 4.

4. For ea h statement (stmt) goto 5.

5. If stmt is of assignment type , parse lhs and rhs of assignment node.

(i) If lhs or rhs is only memory a ess of type float or int , ount memop

a ordingly .

(ii) If rhs is unary or binary expression , and operands are of floating type

, ount flop a ordingly .

6. On e all identified BBs are pro essed , report flop and memop ount for the

annotated segment .

Figure 5: Outline of the performan e omputation algorithm

Multiple basi blo ks together form a loop whi h is strongly onne ted in nature and must have single entry of

ontrol. Here the ontrol does not only fall from one blo k to the next one, but also may fall ba k to any previous

one. To pro�le any valid loop, all on erned basi blo ks are pro�led �rst and then the numbers are summed up to

get the performan e ount for a single loop iteration. The ount, thus generated, is multipled by the upper bounds of

surrounding loop iterations to get the �nal number.

We have presented the outline of the performan e omputation algorithm in Figure 5. To pro�le a valid blo k or

loop segement in the sour e ode, �rst the basi blo ks, under onsideration, are needed to be identi�ed. Then all

statements from su h basi blo ks are pro�led and �nally the values are umulated to get the �nal performan e result.

4 Experimental results

Command line option -fdump-tree-perf, during ompilation of GCC, dumps the performan e data pro�led by

Stati Prof. We have pro�led multiple kernels from various ben hmarks to he k orre tness of the pro�ler. For C we

have used ben hmark suite Polyben h [pol℄, for C++ programs CppLapa k [ pp℄ ben hmark is used and for Fortran

programs we onsider BLAS routines in Lapa k [lap℄.

In Table 1 we present stati ally omputed parameterized expressions of �oating point operations and memory

operations for few sele ted kernels from Polyben h suite. Note that, these kernels operate on �oating point data type

83

kernel �op memop

atax 4× ny× ny nx + 2× ny+ 7× ny× ny

mvt 4× n× n 8× n× n

gemm 3× nk× nj× ni + ni× nj 2× ni× nj+ 4× ni× nj× nk

dmm 2× num× num× num 4× num× num× num

symm 5× n× n× n + 6× n× n 5× n× n× n + 4× n× n

daxpy 2× n 2× n

ja obi_1d_imper 3× (n− 2) × tsteps 6× (n− 2) × tsteps

floyd_warshall n× n× n 4× n× n× n

Table 1: Table showing stati ally measured performan e numbers

kernel loop bounds �op by Stati Prof

(g�op)

�op by vtune on

Sandybridge (g�op)

holesky 1024 2.1506 2.4378

daxpy 100000000 0.4 0.48672

symm 1024 5.375 5.7373

mvt 10000 0.4 0.5417

dmm 1024 2.1475 2.9822

Table 2: Comparing Stati Prof and Vtune estimated �op

only. The parameterized expressions are fun tions of symboli loop bounds like, N, M, NK, NJ, TSTEPS et . whose

values are unknown at ompile time.

In Table 2 we ompare the �op ounts of few BLAS routines omputed by both Stati Prof and Vtune for stati ally

known loop bounds. Vtune monitors the runtime events generated during program exe ution and estimates the

performan e a ordingly. As instan e, onsider the kernel dmm. Stati Prof estimates �op ount as 2×1024×

1024×1024 = 2.1475 gflop, where the loops iterate 1024 times . Vtune dynami ally gathers the value of event

SIMD_FP_256.PACKED_DOUBLE as 745550 uflop. The �nal �op ount by Vtune would be 745550×4×1000 =

2.9822 gflop.

Figure 6 depi ts the overestimation aused by Vtune for ea h kernel mentioned in Table 2. The overestimation in

average ranges in between 6%-28%. This is a known issue of Vtune whi h o urs due to in orre t ounting of exe uted

instru tions

1

.

We have also tested the pro�ler on a large appli ation, OpenFOAM [ope℄, its an open sour e appli ation for

omputation �uid dynami s. Our tool orre tly estimates �op ount and memory operation ount for GaussSeidel

smoother fun tion, shown in Figure 7. The outer ell loop performs only division operation on �oating point data

type and a esses four array elements of data type double. Ea h inner fa e loop performs two arithmeti operations

and a esses two array elements of type double and one of type integer. As the exa t number of loop iterations are

unknown at ompile time, our pro�ler onsiders loop iterations as nCells and (fEnd - fStart) for outer and inner

loops respe tively. Therefore,

Flop = 1× nCells+ 2× 2× (fEnd− fStart)× nCells

As size of double data type is 8 bytes and size of integer element is 4 bytes

MemOp(inbytes) = 4 × nCells× 8 + 2× (2× 8 + 4) × (fEnd− fStart)× nCells

= 32× nCells+ 40× (fEnd− fStart)× nCells

1

http://i l. s.utk.edu/proje ts/papi/wiki/PAPITopi s:SandyFlops; https://software.intel. om/en-us/forums/software-tuning-

performan e-optimization-platform-monitoring/topi /499193

84

0

1

2

3

4

5

6

holesky daxpy

symm

mvt

dmm

Pro�led by Stati Prof

Pro�led by Vtune

Figure 6: Overestimation aused by Vtune

f o r ( r e g i s t e r l a b e l e l l i =0; e l l i <nC e l l s ; e l l i ++)

{

f S t a r t = fEnd ;

fEnd = ownStartPtr [ e l l i + 1 ℄ ;

p s i i = bPrimePtr [ e l l i ℄ ;

f o r ( r e g i s t e r l a b e l f a e i=f S t a r t ; f a e i <fEnd ; f a e i++){

p s i i −= upperPtr [ f a e i ℄∗ ps iP t r [ uPtr [ f a e i ℄ ℄ ;

}

p s i i /= diagPtr [ e l l i ℄ ;

f o r ( r e g i s t e r l a b e l f a e i=f S t a r t ; f a e i <fEnd ; f a e i++){

bPrimePtr [ uPtr [ f a e i ℄ ℄ −= lowerPtr [ f a e i ℄∗ p s i i ;

}

ps iP t r [ e l l i ℄ = p s i i ;

}

Figure 7: Code snippet of GaussSeidel fun tion

5 Con lusions and Future Work

We have automated the theoreti al �op and bandwidth omputation of C,C++ and Fortran programs by analyzing

sour e odes at ompile time. The main ontribution of this work is to develope the tool as a part of li ense free

ompiler GCC, in hope to be used both omer ially and for resear h purpose.

Analyze the e�e t of a he: Current implementation does not onsider any e�e t of a he. Presen e of a he

e�e tively redu es the amount of memory transfer operations as a he stores the frequently a essed elements. Our

implementation onsiders ea h and every load and store operation as a ess to memory and ounts the quantity of

memory transfer a ordingly. As a future work we want to revise the memory transfer ount in presen e of a he.

Handle index omputation and test on appli ations: Currently our tool does not take into a ount any index

omputation as ontributor to memory operation ount. In future we may extend our work to handle this. Also it has

been only tested on few ben hmarks and a single real life appli ation. We want to further test it on several other real

life large appli ations, like ROMS, Amber et .

Compare with Milepost: The GCC implementation of the tool pro esses optimized gimple odes to pro�le

performan e bounds. Pro�ling optimized versions learly ignores redundant arithmeti and memory operations and

redu es the gap between stati and dynami pro�le variants. Stati implementation of GCC has prior sequen e of

optimizations, whi h may or may not be optimal for di�erent sour e odes. Milepost [MTB

+08℄ GCC is a wrapper

around the stati GCC, whi h, in essential, �nds the optimal sequen e of optimizations for various sour e odes. We

plan to in orporate our tool into Milepost GCC to he k whether di�erent optimization sequen es a�e t the stati

85

performan e omputations of programs or not.

Referen es

[CDPR99℄ Calin Cas aval, Luiz DeRose, David A. Padua, and Daniel A. Reed. Compile-time based performan e

predi tion. In In Twelfth International Workshop on Languages and Compilers for Parallel Computing,

pages 365�379, 1999.

[ pp℄ http:// pplapa k.sour eforge.net/.

[GKM82℄ Susan L. Graham, Peter B. Kessler, and Marshall K. M Kusi k. gprof: a all graph exe ution pro�ler,

1982.

[hp ℄ http://hp toolkit.org/.

[lap℄ http://www.netlib.org/lapa k/lug/node71.html.

[MTB

+08℄ Cupertino Miranda, Olivier Temam, Phil Barnard, Elton Ashton, Mir ea Namolaru, Elad Yom-tov, Bilha

Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, and Chris Williams. Milepost g : ma hine

learning based resear h ompiler. In In GCC, 2008.

[NNH10℄ Sri Hari Krishna Narayanan, Boyana Norris, and Paul D. Hovland. Generating performan e bounds from

sour e ode. In ICPP Workshops, pages 197�206. IEEE Computer So iety, 2010.

[ope℄ http://openfoam. om/.

[pap℄ http://i l. s.utk.edu/papi/.

[per℄ http://perfsuite.n sa.illinois.edu/.

[pol℄ web. se.ohio-state.edu/~pou het/software/polyben h/.

[vtu℄ https://software.intel. om/en-us/get-started-with-vtune.

[WDGKD99℄ A D. K. Kaushik W. D. Gropp, B D. E. Keyes, and B. F. Smith D. Towards realisti performan e bounds

for impli it fd odes. In Pro eedings of Parallel CFD'99, pages 233�240. Elsevier, 1999.

86

The copyright of this paper is owned by the author, Ramya Ramalinga Moorthy. The author hereby grants Computer Measurement Group

Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.

Performance Anomaly Detection & Forecasting Model (PADFM) for eRetailer Web application

Ramya Ramalinga Moorthy EliteSouls Consulting Services LLP, Bangalore, India.

[email protected]

With high performance becoming a mandate, its impact & need for sophisticated performance management is realized by every e-business. Robust performance anomaly detection and forecasting solution is in demand to detect anomalies in production environment and to provide forecasts on server resource demand to support in server sizing. This paper deals with the implementation of PADFM for an online retailer using statistical modeling & machine learning techniques that has yielded multi-fold benefit to the business. Keywords: Performance Anomaly detection, Forecasting, K-NN, K-Means, Exponential smoothing, ARIMA.

1. Introduction

With the revolution brought by digital transformation, performance management for digital products is no more a luxury. If Amazon, experiences 1% decrease in sales for every additional 100 ms delay in response time, we can imagine the severity of performance impact on every high traffic e-business.

Though Application Performance Management (APM) tools are widely used and has brought down the performance problem diagnosis time to a great extend, these tools don’t actually help in detecting the anomalies / outliers in the production environment. As deploying robust performance management solutions being one of the compliance factors for digital products, the demand for quick & effective performance anomaly detection & forecasting solution is increasing steadily. The secret sauce for building such solution is the right choice of statistical modelling & machine learning techniques as they provide a strong basis for detecting system performance anomalies and for making forecasts.

With several techniques available in the domain of statistics & machine learning, choosing the right technique(s) that has the capability to detect maximum possible anomalies meeting high accuracy requirement of business is a great challenge. The choice of the technique(s) purely depends on the type

of data and its characteristics.

A performance anomaly is an abnormal application behavior that could be due to some unrelated event or an unexpected application behavior that adversely affects the application performance. The performance data used for detecting anomalies & for making forecasts come from the measurements of two different classes of performance metrics namely application metrics (Example: server hits, # users, application response time, etc) that provides the current state/health of an application and system metrics (Example: CPU / memory / disk utilization, queue length, etc) that provides the current state of the underlying system.

Providing forecasts on the server resource requirements supports in application capacity planning & infrastructure sizing. Usage of filtered server resource utilization statistical data, free from anomalies & outliers, can result in providing accurate forecasts that can be used for server sizing analysis.

This paper deals with our solution, Performance Anomaly Detection and Forecasting Model (PADFM v1.0) developed for addressing the requirements of e-retailer business application. The paper is organized in eight sections. Section 2 provides the overview of the e-retailer problem space. Section 3 provides overview of the PADFM model & its objectives.

87

Section 4 details various performance anomaly detection techniques employed by our model through illustrations. Section 5 details various performance forecasting techniques employed by our model through illustrations. Section 6 provides model recommendations & business benefits. Section 7 provides the limitations & future plans. Section 8 concludes the paper along with the references.

2. E-Retailer Context / Problem Space Overview

A US based retailer, who operates its women apparels business both in brick-n-mortar shop & online during last 7 years, plans to expand its online business portfolio in subsequent years. The business is interested in historical data investigation to gain confidence on what kind of analysis can be done to improve their current performance management practices. Also, the business is interested to know whether any recommendations on server resource demands can be made for justifying the need for server capacity upgrade for the year 2016.

The online ecommerce application developed using J2EE technologies is hosted on JBoss & Oracle server at their private data center is configured with IBM Tivoli to monitor the production environment health metrics at 5 minutes interval from Jan 2010 onwards till Dec 2015 (though not much analysis is currently done on the collected data). Server traffic monitoring logs are available from Jan 2013 onwards till Dec 2015 (though not analyzed).

A Proof-Of-Concept (POC) model is demanded that can use the historical data for providing recommendations to improve current performance & capacity management practices.

3. PADFM Solution Overview

Our solution, Performance Anomaly Detection & Forecasting Model (PADFM v1.0) designed for POC, helps in identifying the performance anomalies from the available historical data & for forecasting the hardware demand, particularly CPU requirements of web, application & database server for next one year.

Our model is implemented on R statistical platform and uses various statistical & machine learning techniques. The choice of the techniques is based on our study of data characteristics and evaluation of accuracy of various techniques. Our model provides two key solutions - Performance anomaly detection & forecasting (in offline mode).

For anomaly detection, our solution uses below algorithms.

Twitter’s Anomaly detection algorithm & Breakout

detection algorithm is used for detecting anomalies in the user traffic trend.

Twitter’s Breakout detection is used for identifying breakouts & shift in distribution trends of server resource utilization data.

K-means machine learning algorithm is used to cluster the server monitoring data of web, application & database servers based on similarity, to group suspicious data points on the outer most cluster to quickly detect anomalies.

K-NN machine learning algorithm is used to storerepresentative server utilization data points samplesto segregate new data points into well defined groupsbased on majority vote of its k neighbours to quicklydetect anomalies.

Pearson correlation analysis is used to correlate all the metrics considered in the model, at every point anomalies detected by the model to help in performing first level of root cause analysis.

For forecasting, our solution uses three time series analysis techniques, moving average, exponential smoothing (single, double or triple) and ARIMA. The historical server utilization data of web, application & database server are fed to these models to forecast the demand for next year. The best forecasts are recommended based on evaluation of various forecast accuracy measures.

4. Performance Anomaly Detection Solution

The performance anomaly detection techniques are grouped by the type of underlying model they assume for nominal data and the discrimination function used to identify data that doesn’t agree with the model.

Our performance anomaly detection solution is designed to offer below two type of analysis:

1) Visualize the performance counters over time,classifies them to provide easy view ofanomalies or outliers and reports anydeviations from the set thresholds.

2) Support in root-cause analysis by correlatingseveral performance counters & reportcounters with high correlation coefficients.

The techniques used by our model are discussed below:

4.1 Anomaly Detection Model using Twitter’s AD algorithm

Twitter’s anomaly detection algorithm is based on

88

S-H-ESD (Seasonal Hybrid Extreme StudentizedDeviates) algorithm, which is an extension of ageneralized ESD method that helps to detect bothglobal & local anomalies better than generalized ESDmethod.

The web server access log that has the server hit details for last 3 years (Jan 2013 till Dec 2015) is used for this analysis. By using this anomaly detection algorithm, both global & local anomalies are detected as shown in the figure1.

Figure 1: Anomalies detected using Twitter’s AD

Drill down analysis for a specific investigation period reported anomalies where upon verification, it iss found that 82% of anomalies detected during the investigation period seem to correlate with the days when performance tests were conducted on production environment, due to infrastructure issues on the test environment.

Figure 2: Drill down analysis using Twitter’s AD

The anomalies detected along with its percentage deviation are reported by the model as in figure 3.

Figure 3: Anomalies detected using Twitter’s AD

This algorithm is chosen as it is very popular in detecting both global & local anomalies. Our solution is able to rightly detect, one point sudden increases, unusual noise, sudden growth on seasonal pattern, etc on the server hits trend. But there are two limitations, where the algorithm is not able to detect 1) linearly increasing growth anomaly & 2) negative

anomaly represented below.

Figure 4: Anomalies not detected using Twitter’s AD

4.2 Anomaly Detection in data distribution trend using Twitter’s BD algorithm

Twitter’s Breakout Detection algorithm is based on EDM (E-Divisive Median) which is comparatively better in detecting breakouts & identifying changes in the distribution trend quickly than E-Divisive, Moving Average & Moving Median algorithms on our input data.

Our model is able to detect the two major types of anomalies, sudden increase / decrease and gradual increase / decrease in the time series trend of the resource utilization metrics.

A sudden increase in the DB server CPU utilization trend noticed on a specific day (28th Nov 2014) where the CPU utilization increased from average of 40% to 60% detected by the algorithm is shown in figure 5.

Figure 5: Sudden increase detected using Twitter’s BD

A slow increase in web server disk utilization trend noticed on a specific day (14

th Sep 2015) where disk

utilization increased from average of 10% to 40%, detected by the algorithm is shown in figure 6.

Figure 6: Slow increase detected using Twitter’s BD

1

2

89

Though above types of breakouts are easily detected & very useful, sometimes, the algorithm doesn’t detect changes in distribution shift from Poisson to Uniform.

4.3 Anomaly Detection Model using K-Nearest Neighbours Algorithm

K-NN algorithm is a simple instance based supervisedlearning algorithm that stores all the representativedata and classifies new data input by a majority voteof its k neighbours. This algorithm segregatesunlabeled data points into well defined groups.Choosing the number of nearest neighbours i.e.determining the value of k plays a significant rolein determining the efficacy of the model.

The data fed to our model has % CPU & % Disk utilization values with type labels (sample shown in figure 7), collected from application server during peak hour statistics observed during 2015. The threshold value (configurable) used for classifying the data is about 70% for % CPU utilization & 20% for % Disk utilization.

Figure 7: Input Data Sample used for K-NN analysis

The input data points are labeled using four classes, VALID (where both % CPU & % Disk utilization values are less than set thresholds), H-CPU (where % CPU utilization value is greater than set threshold), H-DISK (where % Disk utilization value is greater than set threshold) and H-CPUnDISK (where both % CPU & % Disk utilization values are greater than set thresholds).

Figure 8: Input data classification used for K-NN analysis

Out of 100 input data points fed to the model, 65% of the provided data is configured to be used for training purpose & remaining 35% data is used for testing purpose. The pictorial view of input data classification is shown in figure 8.

Upon applying this algorithm, the prediction accuracy provided by our model for the test data values is

shown in figure 9.

Figure 9: Resultant classification accuracy on test data

It is evident from the results that 100% of the predicted values on the test data (35% of input data) are valid. Thus developed model can be used to classify the new unlabelled input data & the model is able to classify the input data samples (with 10000+ observations) into 4 provided input classes, with an average accuracy of about 92%.

Figure 10: Resultant classification on new input data sample

A new unlabelled input data, with about 120 data points fed to the model has yielded 96% classification accuracy as shown in figure 10.

4.4 Anomaly Classification Model using K-Means Clustering algorithm

K-Means clustering is an unsupervised learningalgorithm that tries to cluster data based on theirsimilarity. The underlying idea of the algorithm is thata good cluster is the one which contains the smallestpossible within-cluster variation of all observations inrelation to each other. The variation is calculatedusing squared Euclidean distance between the datapoints.

90

The data fed to our model has 100 unlabelled data points of % CPU & % Disk utilization values (sample shown in figure 11) collected from application server during peak hour statistics observed during 2015.

Figure 11: Input data sample in K-means

Upon entering the # of clusters input value as four, the resultant clustered view of our model is shown in figure 12. Using this cluster representation, the anomalies can be automatically identified by analyzing cluster 4 (high CPU & Disk utilization cluster) data points.

Figure 12: Resultant cluster view of input data

Minor errors are noticed in the clustering results. Upon comparing the clustered results of this data with the labeled input data (shown in figure 8 used for K-NN analysis), there are 5 deviations noticed as represented in figure 13.

Figure 13: Labeled data versus Cluster results comparison

The optimal number of clusters will be calculated using Elbow method. By plotting the total within sum of squares, for various selected K values, the elbow point can be identified. The graph shown in figure 14, justifies, why K value is chosen as 4 for clustering the input data sample in the discussed case. From the graph, it is clear that, the graph shows a bend & stop making big gains after this point.

Figure 14: Cluster K value analysis

Another labeled data input fed to our model has 75 labeled data points of % CPU & % Disk utilization data from web, application & database server collected from peak traffic hours (sample shown in figure 15). K-means clustering result for this input data is shown in figure 16.

Figure 15: Input data sample with server labels

Figure 16: Clustering results of Web, App & DB server data

Comparison of resultant data clusters with the input data values showing minor erros is represented in figure 17. 2 data points of application server data class (cluster 1) is wrongly clustered into cluster 2 and incase of DB server data class (cluster 3), 3 data pointss are wrongly clustered into cluster 2.

Figure 17: K-Means Clustering results validation

By tuning the right k value, K-means clustering analysis can be easily used for classifying data both in online & offline modes for analyzing the production monitoring data or user traffic data for easy outlier detection.

91

4.5 Anomaly Root Cause analysis using Pearson Correlation algorithm

Our root-cause analysis solution is based on Pearson coefficient correlation, which assumes that the two populations have bell-shaped (normal) distributions and that the relationship between them, if any, is approximately a straight line. The correlation between 2 metrics is accepted as real & significant, if p value calculated from the correlation test is less than 0.05.

Our model takes into account 10 performance metrics (shown in figure 18), for performing correlation analysis on the server metrics. For every threshold deviation reported on CPU utilization, below correlation analysis can be performed by our model. In an instance, high CPU utilization confirmed the presence of memory leak in the application. During correlation analysis, 4 key counters, CPU, Memory (calculated from Available Mbytes counter), and Disk % utilization & processor queue length are correlated & reported as in figure 19. Additionally any interested counters from the list can be added to the correlation as required.

Figure 18: Performance Metrics (#10) used by PAD model

The correlation trend analysis graph shows the previous 6 hours data trend (configurable value) from the threshold deviation point.

The Pearson correlation coefficient analysis between % CPU Utilization & % Available Memory metric is reported as -0.83 with the p-value of 2.2e-16 (<0.05).

Figure 19: Server resource trend from our model showing before & after 6 hours of anomaly point resulted in exception

The scatter plot is used to visually report the

correlations across all counters, which confirms existence of correlation between CPU & memory Utilization counter.

Figure 20: Scatter plot representing correlations

The correlation coefficient matrix shown in figure 21 is used to report the severity of the correlation between counters used in analysis.

Figure 21: Correlation Coefficient Matrix & Heap map view

The heat map view shows the correlation values of all selected counters used in the analysis. The dark blue boxes in the figure, represents metrics with significant negative correlation value, as %.CPU & % Memory utilization values have inverse proportionality.

Various rules are included in the model to conclude on the root-cause though significant correlation can be seen between multiple counters. These results & graphs reported by our model help in quickly detecting the anomalies & performing root cause analysis.

5. Performance Forecasting Solution

Our forecasting model uses three univariate time series analysis techniques to forecast the future behavior of a performance metric according to its past values. Our model deploys Moving Average, Exponential Smoothing (Single or Double or Triple) and ARIMA techniques to forecast the CPU requirement of web, application & database servers & highlights the best forecasts based on forecast accuracy measures, RMSE & MAPE values.

92

Our forecasting model used in PADFM version 1.0 uses 95

th percentile value of past 72 months (Jan

2010 till Dec 2015) to make the forecasts for next year. The % CPU utilization data reported by Tivoli at every 5 minutes interval is used as input for the forecasting model, after removal of anomalies (using the techniques discussed in section 3). The 95

th

percentile value of every hour is used to calculate the 95

th percentile value for every day & this value is then

used to calculate the 95th percentile value for every

month for the last 6 years.

Though this is not the right approach, this technique is followed since the objective of the model during POC is to show how various forecasting techniques can be applied on the available statistical data to provide recommendations for server sizing. Thus, the data derived from this data aggregation analysis, approved by the business team & infrastructure team is used in forecasting models.

Figure 22: Time Series data input used for forecasting

Figure 22 provides the view of year on year increase in the Application Server % CPU utilization during last 6 years (Jan 2010 till Dec 2015). The forecasts made using this time series data input by applying 3 statistical modelling techniques will be discussed in this section to make forecasts for the next year, 2016.

Our forecasting model facilitates making forecasts either by using the all the data points or for the user selected timeframe period (that are considered relevant for sizing).

5.1 Moving Average (MA) Model

This method uses the average of the most recent k data values in the time series as the forecast for the next period. In this method, every time a new observation becomes available for the time series, the oldest observation in the equation is replaced and a new average is computed. For most recent time series values, a small value of k is preferred whereas for past values, a larger value of k is preferred.

Forecasted trend reported by our solution is shown in figure 23.

Figure 23: Forecasted Trend using MA model with k=3

The forecast accuracy measure values are calculated as RMSE is 1.03 & MAPE is 1.79.

5.2 Exponential Smoothing (ES) Model (Single, Double & Triple techniques)

This method uses a “smoothing constant” to determine how much weight to assign to the actual values. The smoothing parameter should fall into the interval between 0 and 1, so as to produce the smallest sums for the residuals.

Our solution performs data decomposition analysis to study the type of time series data before performing this analysis. During decomposition analysis, the input time series data is analyzed for the presence of 3 components - level, trend & seasonality. Accordingly, our solution applies one of the below mentioned smoothing technique to make forecasts.

Figure 24: Decomposed view of input time series

5.2.1 Simple Exponential Smoothing

This method is applicable for time series data with only level and no trend or seasonality. The fitted trend & forecast trend reported by our model are provided in figure 25 & 26.

93

Figure 25: Observed vs Fitted Trend using Simple ES

Figure 26: Forecasted Trend with alpha=0.64


5.2.2 Holt’s Exponential Smoothing

This method is used when time series data have level and trend and no seasonality. Forecasts are made where smoothing is controlled by two parameters, alpha, for the estimate of the level at the current time point, and beta for the estimate of the slope b of the trend component.

Figure 27: Forecasted trend with alpha = 0.48 & Beta = 0.03

The forecasted trend & residual trend are shown in figure 27 & 28. The residual plot shows forecast errors seem to have roughly constant variance over time.

Figure 28: Residuals Trend using Holt’s ES


5.2.3 Holt-Winter’s Exponential Smoothing

This method is used when time series data have level, trend and seasonality. This ES technique is majorly used, as server resource utilization time series data had level, trend & seasonality components. Forecasts are made where smoothing is controlled by three parameters: alpha, beta, and gamma, for the estimates of the level, slope b of the trend component, and the seasonal component, at current time point.

Figure 29: Observed vs Fitted Trend using Holt-Winter’s ES

The fitted trend & forecast trend reported by our model are shown in figure 29 & figure 30.

Figure 30: Forecasted Trend using Holt-Winter’s ES model with alpha = 0.39, beta =0.006 & gamma =0.46

94

Figure 31: Residuals Trend using Holt Winter’s ES model

The residual plot provided by our model shows forecast errors seem to have roughly constant variance over time. As per the Ljung-Box test, p-value is 0.55 (>0.05), so there is little evidence of non-zero autocorrelations in the in-sample forecast errors at lags 1-20 confirms the appropriate fit of time series model.

To be sure that the predictive model cannot be improved upon, our model provides a histogram representation of forecast errors. This helps to check whether the forecast errors are normally distributed with mean zero and constant variance. The forecast errors histogram plot with overlaid normal curve shows that the distribution of forecast errors is perfectly centered on zero and shows a normal curve.

Figure 32: Forecasted Errors Histogram


5.3 ARIMA Model

AutoRegressive Integrated Moving Average (ARIMA) model is the most popular approach to forecast both stationary & non-stationary time series data. The model summarized as ARIMA (p,d,q), comprises of three parameters: autoregressive parameter (p), the number of differencing passes (d), and moving average parameter (q). The method operates in three phases namely Transformation, Parameter Estimation and Fitting & Diagnostic Checking.

Our model uses transformation functions to convert the input data to stationary data and then calculates the parameter values to fit the best ARIMA model to make forecasts.

The non-stationary input data shown in figure 33, is differenced until it appear to be stationary in mean and variance, i.e., the level & variance of the series should stay roughly constant over time, as shown in figure 34.

Figure 33: Non-stationary input data

Figure 34: Stationary data after transformation (d=1)

Unit root test is performed to test the stationary of the time series data. The Correlogram (ACF) and Partial Correlogram Analysis (PACF) analysis carried out to calculate the value of ARIMA model parameters p & q is shown in figure 35 & figure 36.

Figure 35: ACF Correlogram View

Figure 36: PACF Correlogram View

95

Our model suggests the best ARIMA fit with least AIC and BIC value. The forecast trend reported by our solution using the recommended ARIMA model (1,0,0) (1,0,0) is provided in figure 37.

Figure 37: Forecasted Trend of ARIMA model

The Ljung-Box test & Normal Q-Q Plot are performed to validate there is no pattern in the residuals. The p-values for the Ljung-Box test are well above 0.05, indicating “non-significance”. The normal Q-Q plot values are normal as they rest on a line and aren't all over the place.

Figure 38: Ljung-Box Test Plot


Figure 39: Normal Q-Q Plot

5.4 Forecasting Model Accuracy Analysis Summary

Our model performs the below three types of analysis to confirm the best fit.

Ljung Box test (p value >0.05) is used to check for non-significance of autocorrelations on the residuals.

Histogram plot of forecast errors is used to check

whether the residuals are normally distributed with mean zero and constant variance.

Various forecast accuracy measures (ME, RMSE, MAE, MPE, MAPE, MASE & ACF1) are used to compare the forecast accuracy of employed techniques. Forecast model recommendation is based on lowest RMSE & MAPE value shown in figure 40.

Figure 40: Forecasting Model Accuracy Measures

As shown in the above recommendation, Exponential Smoothing technique seems to have better forecast accuracy with relatively low RMSE value (0.06) & MAPE value (1.29).

5.5 Forecasting Model Results

Though our forecasting model recommends the best forecasting technique, it provides the forecasted trend of all the models for referential purpose. The forecasted data trend for the application server CPU utilization for the year 2016 reported by our model is shown in figure 41.

Figure 41: Forecasted Trend based on 3 techniques

As per the forecast, the application server maximum % CPU utilization will fluctuate between 80% to 85% range during 2016. As the forecast values are above the application server % CPU utilization threshold (configured as 70%), the model suggests for capacity upgrade. This justifies the need for application server capacity upgrade for the next year, 2016. This forecast assumes, no sudden increase in traffic trend is expected, due to new promotional deals or

96

marketing campaign activities, etc. As these factors are not considered in the model, extra capacity room needs to be buffered to accommodate the additional requirement for the forecasted year.

Figure 42: Server Capacity Upgrade Recommendation Report

A similar analysis performed for web server & database server confirmed, there is no need for server capacity upgrade for the year 2016. As per the forecasts reported by the model, the web server maximum % CPU utilization will fluctuate between 60% to 65% range (threshold: 80%) and the database server maximum % CPU utilization will fluctuate between 45% to 50% range (threshold: 75%).

6. PADFM Results & Business Benefits

The anomalies detected in the user traffic trend & server resource utilization trend using various techniques requires manual analysis. For a reported point anomaly, 12 various performance metrics (# server hits & # users details from user traffic data and 10 server resource monitoring metrics shown in figure 18) can be graphically viewed & Pearson correlationanalysis can be performed to identify the root causefor the anomalous behavior.

Based on the root cause analysis results, server configuration tuning or application code tuning activities can be suggested. This manual analysis of point anomalies can result in any of the 2 actions namely; anomaly data point can be included for sizing analysis, considering it is a false positive detection or can be removed considering it as an outlier / unrelated event for the sizing analysis.

This refinement made on the time series data after filtering of possible anomalies and outliers is then used by the forecasting models to make forecasts for the next year. The best forecasted model results are used as an indication for justifying the need for server capacity upgrade for the forecasted period.

The business benefits achieved through our model include 1) Improvement in performance SLA management 2) Quick correlation analysis reports helps business to understand the system load & utilization behavior during BAU & peak seasons due to dynamisms in business factors like deals & promotions. 3) Anomalous distribution trends reported

helps in providing feedback to performance testing team for refining the test workload by considering realistic usage patterns. 3) Point anomalies indicated on server resource trends helps in bringing immediate focus on performance tuning and optimization activities based on first level of root cause analysis results. 4) Better forecast accuracy & server sizing, due to the usage of filtered (anomalies / outliers removed) input data. 5) Added intelligence on production monitoring to detect anomalies & to perform first level of root cause analysis helps in reducing the L1 and/or L2 level infrastructure support team on production environment.

7. Model Limitations & Future Plans

Our PADFM version 1.0 model has several limitations in handling large data volumes, processing speed, rich graphical representation & automatic workload versus server performance correlation analysis. Also, forecasts are purely based only on server utilization data without consideration of various business factors like promotional events/deals history, such impact to be assumed for the forecasted period and Server specifications (like CPU model, type & speed, memory, etc). The impact due to these factors business & application factors need to be correlated with server resource utilization patterns for making forecasts.

Our next version of the model, PADFM v2.0 under progress, focuses on handling the above limitations and includes additional techniques like NetFlix RAD algorithm (based on Robust Principle Component Analysis) & LOF (Local Outlier Factor) for detecting additional anomaly patterns missed by version 1.0 model.

Additional features introduced to capture workload fluctuations along with server utilizations including JVM statistics to feed the supervised machine learning techniques (K-NN algorithm) can help in quickly classifying input data for detecting anomalies related to workload fluctuations & patterns.

Forecasting model will include server sizing recommendations based on the several additional business & application factors along with industry benchmark data from TPC/SPEC.

To overcome the limitations related to data handling & speed, we have used InfluxDB, an open source database for storing real-time time series data from production servers & we have used Grafana for better visualization of time series data for representing the anomaly detection graphs.

97

8. Conclusion

Unidentified performance anomalies or unexpected capacity issues might lead to potential system failure leading to immediate impact on customer experience resulting in business loss. The real success of performance anomaly detection & forecasting solution lies in choosing the right combination of techniques (with fine tuning) that can complement each other and accounting the impact due to several other factors to yield better accuracy. The techniques employed by our model for retailer client has yielded great benefits for the business in detecting anomalies in production environment & for providing recommendations for server sizing.

References

[1] Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi,Julie Symons, and Evgenia Smirni. Anomaly?application change? or workload change? towardsautomated detection of application performanceanomaly and change.

[2]Joao Paulo Magalhaes and Luis Moura Silva.Detection of performance anomalies in web-basedapplications.

[3] Sandip Agarwala, Fernando Alegre, KarstenSchwan, and Jegannathan Mehalingham. E2eprof:Automated end-to-end performance management forenterprise systems

[4] Manjula Peiris, James H Hill, Jorgen Thelin,Sergey Bykov, Gabriel Kliot, and Christian Konig. Pad:Performance anomaly detection in multi-serverdistributed systems.

[5] Tao Wang, Jun Wei, Wenbo Zhang, Hua Zhong,and Tao Huang. Workload-aware anomaly detectionfor web applications.

[6] Joao Paulo Magalhaes and Luis Moura Silva.Root-cause analysis of performance anomalies inweb-based applications.

[7] Frank M Bereznay and Kaiser Permanente. Didsomething change? using statistical techniques tointerpret service and resource metrics.

[8] Daniel Joseph Dean, Hiep Nguyen, and XiaohuiGu. Ubl: unsupervised behavior learning for predictingperformance anomalies in virtualized cloud systems.

[9] Terence Kelly. Detecting performance anomalies inglobal applications.

[10] Tian Huang, Yan Zhu, Qiannan Zhang, YongxinZhu, Dongyang Wang, Meikang Qiu, and Lei Liu. Anlof-based adaptive anomaly detection scheme forcloud-computing.

98

Performance Comparison of Virtual Machines and Containers

with Unikernels

Nagashree N, Suprabha S, Rajat Bansal VMware Software India Private Limited, Bengaluru.

{nagashreen, suprabhas, rajatb}@vmware.com

In a typical cloud environment that uses software-defined data center (SDDC), most of the virtual machines are deployed to run single application for back-end services resulting in the under-utilization of virtual resources. With virtual machines, legacy applications are run on a full-fledged OS. This overhead is avoided by containers using OS-level virtualization where the applications are ‘caged’ to provide secure isolation. However, this adds another layer to the software stack. Unikernels, leveraging the benefits of virtualization, remove these redundant layers by enabling us to build applications with only the required libraries. These applications run as virtual machines on a hypervisor in Ring 0 (kernel) mode. Consequently, the attack surface is minimal ensuring very high security as compared to containers. Also, the privileges are reduced building potentially stronger walls between disparate components. The applications can scale largely because of a very small memory footprint. In this paper, we use OSv – a unikernel, Cloud OS, to build and deploy applications such as Apache, MySQL, etc. We have done the performance analysis of these applications running in virtual machine, container and unikernel. We have observed that the unikernel VM size is the smallest, the boot time is much faster and the read/write time is significantly low.

1. INTRODUCTION

Cloud computing arose from the realization that virtualization facilitates efficient use of hardware resources by providing a greater degree of abstraction between hardware and software. A virtual machine is a self-contained computer with a standard operating system running applications. It helps with server consolidation as there is lesser hardware to maintain. It is scalable and secure. However, each VM runs a full copy of an operating system and a virtual copy of all the hardware that the OS requires. Containers provide an alternative to this by virtualizing the operating system which relatively reduces the resource consumption. A container wraps up applications and all their dependencies in a complete file system. Containers, running as isolated processes in the host operating system’s user space, share the kernel that makes RAM usage and disk usage more efficient as compared to VMs [7]. Despite these benefits, containers are not as secure as virtual machines. The attack surface is comparatively larger because if there is vulnerability in the OS’s kernel, it can make its way through the containers. Also, they lack to offer the same level of isolation as hypervisors. The next logical approach from VMs and containerization is the “Unikernels”.

Unikernels are specialized, single address space machine images constructed by using ‘library operating systems’. A developer selects the minimal set of libraries which correspond to the OS constructs required for their application to run. These libraries are then compiled with the application and configuration code to build sealed, fixed-purpose images (unikernels) which run directly on a hypervisor or hardware without an intervening OS. [8] (Fig. 1)

99

Fig. 1: Application Stack in Virtual Machine, Container and Unikernel

Virtualization, in spite of many of its advantages, adds many layers and this overhead consumes more resources. Unikernels significantly reduce this overhead without compromising on any of the advantages of virtualization. Some of the key advantages are:

1. Performance — Less CPU usage & memory footprint.

2. Security — The application is the kernel. Unikernels provide extremely tiny and specialized runtime footprintthat is less vulnerable to attack.

3. Scalability — The image size is very small (less than 100MB), enabling flexible deployment on a large scale.

As with many technologies, unikernels also face a chasm between establishing the technical value and market adoption. With this paper, we attempt to bridge this gap and help unikernels be adopted widely across various industries.

In Section 3 and Section 4, we present evidences for the above by comparing virtual machines and containers (Docker) with unikernels.

2. RELATED WORK

The architecture of library operating system (libOS) has several advantages over more conventional designs [1]. “Unikernels: Rise of the Virtual Library Operating System”, details the limitations of current operating systems and how library operating systems overpower them. It illustrates the design of MirageOS which aims to restructure VMs into more modular components for increased benefits.

There are several unikernels or library operating systems such as OSv, ClickOS, Clive, HaLVM, LING, MirageOS and Rump Kernels. In “OSv —Optimizing the Operating System for Virtual Machines”, Avi Kivity et al. explain that OSv is a better OS for cloud than traditional operating systems such as Linux.

J(VM)2 [2], a pure Java-based VM encapsulated inside a specialized hypervisor-aware framework, provides many substantial advantages over the traditional JVM. unikernels.

100

3. PROOF-OF-CONCEPT Unikernels, by design, will not have open-vm-tools running inside them, as only one process runs within each unikernel. However, each of the unikernel implementations has an instrumentation API to get similar data. OSv is the unikernel OS used for building and deploying applications. OSv [4] is the open source operating system designed for the cloud. Built from the ground up for effortless deployment and management, with superior performance.

3.1 Legacy and Cloud-Native Applications There are not many applications that are ready to be run as unikernel images. Building a unikernel image for an application requires detailed knowledge of the application’s runtime and its dependencies, so that only the relevant parts of the OS library can be compiled into the image. A few legacy applications like Java, MySQL and Cloud-Native applications like Apache, Cassandra are compiled to an OVA. Later, they are deployed as virtual machines on a hypervisor. The virtual machine boots the application automatically in an instant as soon as it powers on. However, the VM shuts down if the application crashes or fails to start. The automated process of building an OSv application is depicted in Fig. 2.

Fig. 2: OSv Application Build Process

3.2 Performance Considering the above applications, a performance evaluation is done for unikernels against virtual machines and containers (Docker). A standard benchmark tool is chosen for each of the applications which are configured consistently across VM, container & unikernel. The stats are provided in the next section.

3.3 Security The two major flaws in containers are isolation and security. Since, unikernels are deployed as virtual machines on a hypervisor, these are not a concern. Containers do not provide resilient, multi-tenant isolation. A malicious code inside the container can attack the host operating system and other containers. Docker daemon requires root privileges to run containers (applications) which means that the underlying OS is vulnerable. The packaging of applications is still tricky because of the uncertainty in their dependencies [3]. Unikernels are more secure because the OS is only the libraries relevant for the application. There is no support for device drivers which are a potential risk. Unikernels offer the highest level of isolation as there is no privilege switching between user & kernel modes. BTC Piñata, a MirageOS unikernel, is an evidence for the level of security that a unikernel can provide. It is written in OCaml, runs directly on Xen, and uses native OCaml TLS and X.509 implementations.

3.4 Which Applications to build as Unikernels? Every application will not be suitable for implementation as a unikernel. Like other technologies, unikernels also have

101

their limits. The key limitations are [14]:

1. Lack of multiple processes (Single Process, Multiple Threads)

2. Single User

3. Limited Debugging

4. Impoverished Library Ecosystem An application that doesn't violate these limits can be considered for compiling as a unikernel. The application shouldn't fork new processes in the same machine. Whenever it demands for a new process, a new unikernel VM can be created as it starts up in sub seconds. Debugging can be improved by instrumenting it internally. Unikernel specifically targets security and scalability (applications that may need to scale into very high numbers). Unikernel would be appropriate for something that will be exposed to the Internet and therefore needs the highest levels of security [14]. There is no specific set of criteria for whether an application can be built as a unikernel or not. This is due to the fact that the technology is new. However, with time they can be widely adopted and many applications can be built as unikernels.

5. RESULTS & DISCUSSION Unikernels run as conventional virtual machines on a hypervisor. We measure the performance of applications such as Apache HTTP server & MySQL running in virtual machine, container (Docker) and unikernel. We have performed all the tests on a hypervisor with twelve 1-2 GHz Intel Xeon E5-2620 processors for a total of 12 cores, Hyper-Threading and 64 GB RAM. This is the configuration of the mainstream server to deploy virtual machines, containers and unikernels. We have used Ubuntu 15.10 (Wily Werewolf) 64-bit with Linux kernel 2.6.27.56 as base image for all Docker (v1.8.0) containers and cloud image for all virtual machines. We run a single instance of each of Apache Web Server & MySQL applications in VM, container and unikernel.

a. Apache Web Server

The performance of Apache Hypertext Transfer Protocol (HTTP) server is measured with a benchmarking tool called ApacheBench (ab). This tool especially shows how many requests per second the Apache installation is capable of serving. We have configured Apache HTTP server and ApacheBench on a single virtual machine with 4 vCPUs and 2 GB RAM. We run the Apache Benchmark Docker image in a single container hosted on another VM. We launch the server in unikernel and run the benchmarking tool in a VM. For consistency, all of these connect to “https://www.yahoo.com”. The total number of bytes received from the server which is essentially the number of bytes sent over the wire is 524210. The total number of document bytes received from the server excluding bytes received in HTTP headers is 460120. The number of requests is 1000. The number of concurrent clients used during the test is 100.

The performance of the server running in each of VM, container and unikernel is measured across three parameters as shown in TABLE I. Unikernel performs better than VM & container as the time taken (mean) to serve a request across all concurrent requests is the least. However, the rate at which requests are returned per second is better than VM and close to container.

Performance Metric VM Container Unikernel

Requests/Second (Higher is better) 16.45 30.12 28.12

Time/Request (ms) 5480.987 3200.13 3087.29

Total Test Time (s) 54.23 31.230 28.53

TABLE I: Performance Analysis of Apache HTTP server with VM, Container & Unikernel

102

Fig. 3 shows the max. time taken to connect to the server, process the request and send the response to the clients. With VM, the fastest request was 55 ms and the slowest was 10900 ms. In case of container and unikernel, the fastest was 71 ms and 58 ms respectively. The slowest was almost half the time taken as compared to VM. Fig. 4 shows the percentage of the requests served within a certain time. The server running inside the VM takes double the time to serve last 50% of the requests. Unikernel performs fairly well for 1000 requests compared to others. In real-time where the server has to satisfy thousands of requests, VMs and containers take a very long time. However, unikernels offer predictive performance.

Fig. 3: Maximum Time Taken to Connect, Process Request & Wait for response

Fig. 4: Time Taken to Service the Requests

103

b. MySQL The MySQL performance is measured with sysbench [13], a benchmark suite which allows to quickly get an impression about system performance while running a database under intensive load. We have configured MySQL and sysbench on a single VM with 4 vCPUs and 2 GB RAM. We have used Percona v5.6 Server which is a fork of the MySQL relational database management system created by Percona. We follow the same configuration in container. We run only MySQL in unikernel and run sysbench inside another VM. This simulates a more “real world” use case. We run sysbench OLTP for Mixed OLTP (R/W) and Read-Only OLTP tests with single thread and table size equal to 20,000,000.

Fig. 5: Read-Only and R/W Requests per Second

Fig. 6: Time Taken to service 95% of Read-Only and R/W Requests Fig. 5 shows the Read-Only and R/W requests per second. The number of Read-Only requests per second is higher in

104

case of unikernel and lower in case of VM & container. However, with R/W requests per second, VM beats the others. Fig. 6 shows the time taken to serve 95% of Read-Only and R/W requests. Comparatively, unikernel takes nearly half the time to satisfy 95% of the Read-Only requests. However, this is not the case with the latter as the time taken is relatively higher. This is due to the network latency caused by connecting from a different host. Regardless, unikernel did not kill performance all that much & is more suitable for read-intensive applications.

c. Performance Assessment The performance analysis is done for database and web server via applications such as MySQL and Apache Web Server. These applications running in each of VM, container and unikernel are measured across standard parameters as shown in TABLE II. With Apache Web Server, unikernel performs better than VM & container as the average time taken to serve a request across all concurrent requests is the least. However, the rate at which requests are returned per second is better than VM and close to container. With MySQL, the read-only (R) requests are served faster by unikernel. Also, the rate at which read-only requests are returned per second is better than VM & container. The read/write (RW) requests per second and the time taken to serve them are almost the same across all the three technologies.

Scenario Application Requests/Second (Higher is better) 95% Service Time (ms)

VM Container Unikernel VM Container Unikernel

Database MySQL

(sysbench) 7398.52 (R), 5588.97 (RW)

7301.21 (R), 4771.01 (RW)

8571.43 (R), 4817.36 (RW)

2.05 (R), 3.63 (RW)

1.91 (R), 4.03 (RW)

0.72 (R), 4.14 (RW)

Web Server

Apache (ab)

16.45 30.12 28.12 8924.72 5801.59 5213.12

TABLE II: Performance Analysis of Applications with VM, Container & Unikernel

Unikernels might not be the panacea, however, it performs much better than containers and virtual machines in some cases, and on par with them in the other cases.

6. CONCLUSION

In our work, we have compared unikernels with virtual machines and containers to show that they perform better for certain applications. Unikernels allow applications that demand predictable performance to access the hardware resources directly. Specialized implementations can be provided by including only the required set of libraries, drivers and interfaces. In addition to performance, security, scalability, isolation and feasibility, it allows to provide Disaster Recovery (DR) solution as a unikernel service. As unikernels boot in an instant, a new unikernel can be summoned whenever an application crashes. It also provides Transient Micro services in the Cloud. Nowadays, the hardware is fast, compact and cheap, allowing us to come out of the assumption that VM is persistent. A single VM need not be multifunction. A new unikernel VM can be summoned to handle each request over the network. Unikernels, with all these advantages, can serve as a better operating system for SDDC.

7. ACRONYMS & ABBREVIATIONS

Acronym/Abbreviation Description

ab ApacheBench

BTC Bitcoin

DR Disaster Recovery

HaLVM Haskell Lightweight Virtual Machine

libOS library Operating System

OLTP OnLine Transaction Processing

OVA Open Virtual Appliance or Application

SDDC Software-Defined Data Center

VM Virtual Machine

VMDK Virtual Machine Disk

VMX Virtual Machine Configuration File

105

ACKNOWLEGEMENT We would like to extend our gratitude to VMware for allowing us to carry out the research work by providing complete support and adequate resources.

REFERENCES

[1] Anil Madhavapeddy and David J. Scott, “Unikernels: Rise of the Virtual Library Operating System”, acmqueue,

Distributed Computing. [2] Goldstein Lyor, “J(VM)2 - A virtual Java Virtual Machine (machine) Making the case for a Pure-Java virtual

machine”, 2015. [3] http://robhirschfeld.com/2014/04/18/docker-real-or-hype/ [4] http://osv.io/ [5] http://www.itworld.com/article/2915530/virtualization/containersvs-virtual-machines-how-to-tell-which-is-the-right-

choice-for-yourenterprise.html [6] http://www.linux.com/news/enterprise/cloud-computing/819993-7-unikernel-projects-to-take-on-docker-in-2015 [7] https://www.docker.com/what-docker [8] https://en.wikipedia.org/wiki/Unikernel [9] https://blog.docker.com/2013/08/containers-docker-how-secure-arethey/ [10] Avi Kivity, Dor Laor, Glauber Costa, Pekka Enberg, Nadav Har’El, Don Marti and Vlad Zolotarov, Cloudius

Systems, “OSv - Optimizing the Operating System for Virtual Machines”, USENIX Annual Technical Conference, 2014.

[11] Wes Felter, Alexandre Ferreira, Ram Rajamony and Juan Rubio, “An Updated Performance Comparison of Virtual Machines and Linux Containers”, IBM Research Report, July 21, 2014.

[12] http://www.csoonline.com/article/2984543/vulnerabilities/ascontainers-take-off-so-do-security-concerns.html [13] https://www.howtoforge.com/how-to-benchmark-your-system-cpu-fileio-mysql-with-sysbench [14] Russell C. Pavlicek, “Unikernels – Beyond Containers to the Next Generation Cloud”, First Edition, O’Reilly,

October 2016.

106

proceedings for cmg india 2016 3rd annual conference · in the tutorials track - we had excellent...

Documents