proceedings for cmg india 2016 3rd annual conference · in the tutorials track - we had excellent...
TRANSCRIPT
Proceedings for CMG India 20163rd Annual Conference
Mumbai, India
___________________________________
____________________________________
1
Table of Contents
Foreword: CMG India's 3rd Annual Conference
Best Paper Awards CMG 2016
Automating Web Application Performance Testing – Challenges And Approaches
Bishnu Priya Nanda et al.
Survey Of Big Data Frameworks For Different Application Characteristics
Praveen Kumar Singh et al.
Performance Engineering Best Practices For Data Warehouse Applications
Gaurav Kodmalwar et al.
Performance Monitoring And Management Of Microservices On Docker Ecosystem
Sushanta Mahapatra et al.
Performance Testing And Tuning Of A Web Based Enterprise Scale High
Volume Accounts Payable Processing System
Seji Thomas et al.
The Docker Container Approach To Build Scalable And Performance Testing Environment
Pankaj Rodge et al.
Conflux: Distributed, Real-Time Actionable Insights On High-Volume Data Streams Vinay
Eswara et al.
Db Volume Emulator
Rekha Singhal et al.
Leveraging Human Capacity Planning For Efficient Infrastructure Provisioning
Shashank Pawnarkar
Staticprof: Tool To Statically Measure Performance Metrics Of C, C++,
Fortran Programs
Barnali Basak
Performance Anomaly Detection & Forecasting Model (Padfm)
For Eretailer Web Application
Ramya Ramalinga Moorthy
Performance Comparison Of Virtual Machines And Containers With Unikernels
Nagashree N et al.
5
6
16
21
31
42
53
57
66
71
79
87
9 9
2
3
Foreword: CMG India's 3rd Annual Conference
Abhay Pendse
President, CMG India
We completed 3 years of CMG India in 2016. In last 3+ years, we have come a long way. With 25+
regional events and 3 Annual conferences in 3 years, we have come a long way. This is certainly a
significant achievement in very short period of time.
CMG India continues to offer great knowledge sharing platform in the performance engineering and
capacity management area and our growing support from Industry and academia with professionals like
you is the testimony of the same.
We had strong support and presence from all of you at the first 2 annual conference - 2014 in Pune and
2015 in Bengaluru. For our 3rd annual conference – we decided the venue of Mumbai – India’s financial
capital. Our 3rd Annual conference in Mumbai too had an excellent response and was attended by 125+
delegates across Software services and products industry.
We had 2 major tracks at the 3rd Annual conference. Technical paper track and tutorial track.
We had 17 paper and 4 Tutorial submissions this year. Our technical program committee headed by
Amol Khanapurkar and co-chaired by Dheeraj Chahal did excellent job of screening the papers to select
12 top paper submissions for the conference.
The technical paper sessions this year were around 4 broader areas of “Data Engineering”,
“Performance Modelling”, “performance Testing” and “Structural Analysis”. Paper sessions had some
top notch technical content aided by our 6 keynote sessions with topics from “SBI (State Bank of India)
Digitization in scale and robustness” to “NSE (National Stock Exchange) Journey over 2 decades”, from
“Holistic Optimization of Data Access from Imperative Programs” to “Accelerating Deep Learning and
Machine Learning to a New Level” and from “Rise and Rise of e-Governance in India” to “Architecture of
a Real-Time Operational DBMS”.
In the tutorials track - We had excellent Industry and academia participation with topics ranging from
performance modelling, Trustworthy engineering, Accelerating GPUs and Application performance with
Vectorization.
Thanks to all our keynote speakers (Mr. Shiv Kumar Bhasin – CTO, State Bank of India (SBI), Prof. S.
Sudarshan –Dept. of Computer Science and Engg, IITB, Mr. Naveen Gv and Mr. Austin Cherian, Intel
Corporation, Mr. G M. Shenoy, CTO Operations, National Stock Exchange(NSE), Mr. Dharmesh Parekh,
Senior Vice President, NSDL, Dr. Srini V. Srinivasan is Founder and Chief Development Officer at
Aerospike) and our invited tutorial talk speaker Dr. Rajesh Mansharamani for their time and sessions.
CMG India being virtual entity, considering the limited budget, we initially had some challenges in
finding the right venue for the Mumbai conference. Our sincere thanks to Tata Consultancy Services
(TCS) for agreeing to host the conference at no cost. In addition, our sincere thanks to our sponsors –
Intel (Principal Partner), STPI (Diamond Partner) and TCS (Gold Partner) for the event. Without sponsor
support conference would not have been possible.
3
The organizing committee headed by Manoj Nambiar and supported by Mehul Vora and other CMG
India members Milind Hanchinmani, Suresh Babu, Prashant Ramakumar, Adarsh Jagadeeshwaran
working with Ismail Emkay from event management company “Think Happy Everyday” and his team did
excellent job in ensuring smooth organization of the conference. Kudos to our Technical program
committee headed by Amol Khanapurkar and co-chaired by Dheeraj Chahal and other TPC members for
ensuring good technical paper and tutorial content for the conference. Hats off to their commitment for
the conference.
As we enter into 2017, we would like to continue this momentum and organize more regional events
and conference with top notch technical content. CMG India is an all-volunteer organization and success
of such organization depends on all our members like you. I hope with your strong support CMG India
will continue to grow in the coming years. Thank you.
Conference Technical Program Committee Conference Organizing Committee
Amol Khanapurkar (Chair) Dheeraj Chahal (Co- Chair) Abhay Pendse Adarsh Jagadeeshwaran Bhargav Lavu Maheshgopinath Mariappan Manoj Raghavendra Milind Hanchinmani Prajakta Bhatt Prashant Ramakumar Sundarraj Kaushik Suresh Kumar YK
Manoj Nambiar (Head, Conference Organizing Committee) Mehul Vora Abhay Pendse Milind Hanchinmani Suresh Babu Amol Khanapurkar Adarsh Jagadeeshwaran
4
Best Paper Awards at CMG India 2016
Overall, 12 papers were Accepted and 9 papers were offered speaking slot of 40 minutes each.The Jury
headed by Mr. Prashant Ramakumar identified the following papers as the Best Papers.
Best Paper
Conflux: Distributed, real-time actionable insights on high-volume data streams
(Vinay Eswara, Jai Krishna and Gaurav Srivastava, VMWare)
2nd Best Paper
Performance Anomaly Detection & Forecasting Model for eRetailer Web application by Ramya
(Ramalinga Moorthy, Elitesouls.in)
3rd Best Paper
DB Volume Emulator
(Rekha Singhal and Amol Khanapurkar, TCS )
First 2 CMG India Annual conference paper award winners get direct entry into CMG US conference in a
subsequent year.
5
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
AUTOMATING WEB APPLICATION PERFORMANCE TESTING – CHALLENGES AND APPROACHES
Bishnu Priya Nanda
TCS Product Trustworthy CoE- Component Engineering Group, Tata Consultancy Services
Ltd , IT/ITES Special Economic Zone, Plot – 35, Chandaka Industrial Estate, Patia,
Chandrasekharpur, Bhubaneswar - 751024,Orissa, India +91 0674 6645333
Seji Thomas
TCS Product Trustworthy CoE- Component Engineering Group, Tata Consultancy Services
Ltd, 1st Floor TCS Center, Infopark, Kakkanad , Kochi – 682020, Kerala, India, Ph: +91-484-
6187568
Mohan Jayaramappa TCS Product Trustworthy CoE- Component Engineering Group, Tata Consultancy Services
Ltd , SJM Towers, Gandhinagar, Bangalore - 560009, Karnataka, India, , Ph: +91-080-67247967
1 ABSTRACT
With the rapid adoption of Agile and DevOps in software development, performance testing the manual way
brings in an impedance mismatch slowing down the development progress. To make performance testing agile,
first we need to automate it and integrate it with the development & testing toolsets. Making performance testing
automatable is not straightforward and in this paper we describe the challenges one faces and how we addressed
that to build a successful performance testing automation bringing in significant savings in time and effort.
2 INTRODUCTION
Performance testing is an important part in product development. In traditional software development models using the waterfall method, performance testing is planned after development is complete and, preferably, after functional testing is finished. Performance testing projects are technically complex. They often require expenses on infrastructure and can take several weeks. Moreover, a large part of the time is taken by preparatory work, such as setting up the test environment, developing load scripts and emulators, and preparing test data.
6
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
In practice, software development using traditional methods ends on time in less than 30% of cases, so performance testing is frequently cancelled to minimize delays to releasing the software. This often leads to application failure under load, slow execution times, and the operations department’s inability to predict the system’s behavior in the future when the load and volume of data will grow.
Many development teams are beginning to adopt agile methodologies in order to overcome the barriers of traditional methods and release high-quality product on time. Agile makes it possible to significantly shorten development cycles – now they are counted in days. As a result, many organizations and independent development teams are wondering what to do with performance testing, which looks awkward and impedes the idea of short cycles. As in traditional models, this is often the reason why performance testing is not performed and the reason things end with the same negative results. Automated performance testing ensures the rapid release of software with high performance. No longer need to wait months to receive performance test results. Automation helps in cycles of continuous software integration and delivery, helps in receiving constant feedback and optimize software performance during current sprints.
3 ABBREVIATION
Sl No. Acronyms Full form
1 CPU Central Processing Unit
2 CV Co-efficient of Variance
3 DB Data Base
4 GC Garbage Collection
5 IO Input Output
6 NFR Non Functional Requirement
7 PT Performance Testing
8 RAM Random Access Memory
9 SLA Service Level Agreement
4 NEED FOR PERFORMANCE TESTING AUTOMATION
Following are key reasons for automating PT.
• During design and implementation, many big and small decisions are made that may affect applicationperformance - for good and bad. But since performance can be ruined many times throughout a project,good application performance simply cannot be added as an extra activity at the end of a project. If youhave load tests carried out throughout your project development, you can proactively trace howperformance is affected.
7
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
• Without automation, the process of collecting measurement data and validating manually is laborious,which reduces productivity and increases turnaround times.
• Through automation, none of the steps done manually will be missed, and the time lost due to testerfatigue will be minimal.
• Developers can get immediate feedback since they are not heavily dependent on the performance tester.
• Test run comparisons that are automated for making high level inference, and have side by side displayof details help to review and assess difference in performance between different versions of theApplication.
5 CHALLENGES IN PERFORMANCE AUTOMATION
Unlike in functional test automation, there are multiple entities involved in a load test that need to be
monitored and from where metrics must be collected. These then need to be evaluated against target
metrics and a decision taken on pass/fail. Thus, we may have to collect metrics from the Web Server,
Application Server, DB server, utilization of network, CPU & RAM on each of these servers. Once
collected, the right inferences need to be made by the automation tool.
When manual performance testing is done, metrics such as CPU utilization figures are charted to visually
see trend and behavior. The performance tester on reviewing these charts decides based on his
judgment if things looks normal or not. In contrast, in automated performance testing, we need to identify
the right statistical measures upfront, compute the same and apply inference rules on these measures.
6 APPROACH TO AUTOMATE THE PERFORMANCE ASSESSMENT
We used JMeter for generating load tests and used standard tools present in the environment and
automated following metrics:
• Screen response times as captured by JMeter load testing
• CPU & RAM utilization at the OS level
• GC log analysis of the jvms
The load test scripting and sanity check is as usual and manual.
We used Jenkins to orchestrate the automation and created separate python programs to process the different
types of metrics. For each metric, a ‘Limit’ file is manually created that specifies the target to be achieved for that
metric. The python program at end of the test evaluates the observed values against the Limit file to decide if the
metric is passed or not. Finally a separate python program evaluates all the different metric results and generates
a report on overall load testing pass/fail. We can extend this approach to calculate a performance score by giving
a weightage to each of the metric types and then giving a final score at the end. This will help in tracking
improvements over multiple runs of the tests (whether scores are improving or not).
Subsequent sections describe the different metrics, the challenges and how the pass/fail inference is made.
6.1 Automating the application transaction response time report
Response time automation helps to measure the application page response times during load testing.
8
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
High level steps in creating the automation are:
• JMeter scripting is done manually as usual
• Manually create a ‘Limit’ file whose contents vary as per application requirement.A Limit file contains the response time SLA for the pages of an application considering the maximumconcurrent users accessing at peak time.
Example of response time Limit file is:
Page/Txn SLA (in ms) Concurrent users
Orders 2000 10 Payment By Cash 3500 10 Payment By Card 2500 30
The Jenkins is configured to orchestrate the following:
• Start load testing by initiating JMeter command line option.• At end of the JMeter run, start the python program to evaluate the result. Here Converting the JMeter raw
result to readable format and comparing that with the limit file and deciding the response time metricresult as pass or fail.
• If anyone transaction encounters errors or cross the SLA then the python program will mark all the PTresponse time result as fail otherwise overall response time result is said to be pass.
• The final PT dashboard shows error % in the final JMeter result and the JMeter script coverage i.e. whatpercentage of transaction scripted and tested from the total transaction list given in limit file.
Example of Output produced by the python program for response time is:
Page/Txn Request Name SLA(in ms)
Concurrent User
Samples
Min Avg 90th
Percentile
Max Pass/Fail
Orders Orders:64/ABC-WebServices/rest/lookup/itemLookUp
2000 10 10966 200 1005 1300 1980 Pass
Payment By Cash
PaymentByCash:7/ABC-WebServices/rest/salereturn/addSaleLineItem
3500 10 3071 450 2057 2640 3020 Pass
Payment By Card
Payment By card :6/ABC-WebServices/rest/transaction/beginSalesTransaction
2500 30 15679 245 2047 2356 3457 Pass
Challenges in Automation and how we addressed them:
• With default JMeter setting the result will come on XML format without the header, which is difficult toparse and compare with response SLA mentioned in limit file. So some basic properties change isrequired in JMeter to get the response time result in csv format to parse and compare with SLA easilyusing python script.
• For validating the response time of transaction against limit file, each and every page response cannot becompared against the limit. Also the same page request can come from various transactions and so will
9
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
appear multiple times in the JMeter report. So the python program summarizes the page responses across multiple occurrences for Min, Max, Average and 90
th percentile values. It then picks the 90
th
percentile value from the summarized data and then compares it with the SLA mentioned in limit.
• In manual analysis we can ignore some of the requests having error percentage, but in automationtransaction having zero percentage error is considered to be compared with SLA.
Note: JMeter an open source tool is used for load testing here. JMeter can perform load test, performance-oriented business (functional) test, regression test, etc., on different protocols or technologies. By default JMeter command line result is produced in xml format. For ease in processing, we used the CSV format by setting the appropriate command line options.
6.2 Automating the CPU and Memory utilization of the servers
Along with Response Times measurement, it is essential to monitor the CPU and memory usage of the servers and ensure usage is within limits.
High level steps in creating the automation are:
• Manually create a utilization ‘Limit’ file for each server (e.g. WebServer, AppServer, DB etc.) that is part ofthe load test in the application whose contents are typically fixed for most applications.The Limit file contains the upper limit of the CPU and RAM time SLA for the pages of an applicationconsidering the maximum concurrent users accessing at peak time.
• This limit file also specifies the statistical measure ‘coefficient of variance’ and typically should be 10%.Example of the Resource Utilization limit file:
Measure Limit Remarks
CPU 77 Upper limit in %age RAM 77 Upper limit in %age of
total Memory C.V. 10 Coefficient of Variance
=(Standard Deviation/Average)
The Jenkins is configured to orchestrate the following:
• Execute the Vmstat command on each of the Linux servers in background before the load testing
• Once the load test is over, start the python program to analyze the Vmstat output.• The python program calculates the min, max, average, standard deviation and ‘co-efficient of variance’
(C.V.) for CPU and RAM Vmstat values. The metric is considered passed if the calculated Average <Limit and C.V. < Limit C.V.
• This can be similarly extended to measure and evaluate IO times as well.
Example of Output produced by the python program for server utilization is:
Measure Limit Min Avg Max StdDev C.V. Pass/Fail Overall Pass/Fail
CPU 77 26 35.74 96 20.64 57.75 Fail Fail RAM 77 97.12 97.54 97.61 0.1 0.1 Fail Fail
Challenges in Automation and how we addressed them:
• To automatically transfer the Vmstat files from the application hosted servers to Jenkins server foranalysis requires password less authentication between the servers. So used certificates to authenticateover SSH.
10
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
• In manual graphical analysis of server utilization results, it is easy to spot peaks and idle periods andbased on experience judge as acceptable or not. For this to work in the automation program, we identifiedthe statistical measure ‘co-efficient of variance’ (C.V.) as a good indicator of the overall utilization. C.V. isthe Standard Deviation/Average. A coefficient of variance of 10 was found to be giving satisfactoryresults. Lower values indicate that the variation in utilization is more even and not bunched up.
Note: Vmstat is a Linux command to report information about processes, memory, paging, IO and CPU activity.
6.3 Automating the GC log analysis
Improper GC settings can cause lot of response time issues and hence needs to be tuned well. GC analysis also helps in identifying memory leaks.
High level steps in creating the automation are:
• Manually create a GC Limit file containing the number of Full GCs allowed for the duration of the loadtest, GC throughput and ‘Slope after Full GC’. This limit is typically constant for all type of application.
• Installing GC Viewer, a GC log analysis open source tool.
Example of the GC Limit file:
Measure Limit Remarks
MinTotalTime 3600 Seconds MinFullGCCount 0 Min Full GCs that must be
present in entire test MaxFullGCCountPerHour 10 Max number of Full GC per hour MinThroughput 95 In %age AvgMemAfterFullGC 5 % of total memory. If it is below
this limit then Overall result will always made PASS
MaxSlopeAfterFullGC 30000 Bytes/Second increase in used-memory after Full GC (i.e. slope of all min used-memory after Full GCs)
The Jenkins is configured to orchestrate the following:
• After the load test is complete, initiate GC Viewer to convert the raw GC Log file to GC summary reportusing GC Viewer tool.
• Start the python program to evaluate the GC output. This compared the GC Summary output (e.g. FullGC numbers, Max slope after full GC, Freed Memory after full GC) with GC Limit file and creating GCstatistics output file mentioning Pass or Fail.
Example of Output produced by the python program for GC is:
Measure Limit Actual Result Overall Result
Remarks
MinTotalTime 3600 57506 Pass Pass Seconds MinFullGCCount 0 0 Pass Pass Min Full GCs that
must be present in entire test
MaxFullGCCountPerHour 10 2 Pass Pass Max number of Full GC per hour
MinThroughput 95 98 Pass Pass In %age
11
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
AvgMemAfterFullGC 5 0 Pass Pass % of total memory. If it is below this limit then Overall result will always made PASS
MaxSlopeAfterFullGC 30000 0 Pass Pass Bytes/Second increase in used-memory after Full GC (i.e. slope of all min used-memory after FullGCs)
Challenges in Automation and how we addressed them:
• The GC log file format changes as per the GC algorithm used, making it more complex to process it. Weused the GCViewer tool as an intermediate step before analyzing.
• In manual GC analysis, the tester reviews the graphical trends of the GC behavior. To decide leaks andpotential issues. For the automation tool, we evaluated many of the parameters such as Number of GCs,GC throughput etc and finally settled on the parameter MaxSlopeAfterFullGC (i.e. calculating the maxmemory slope between two Full GC). If the slope is a steep rise after two Full GC and reaches to the maxlimit, then there is a high probability of memory leak.
Note: GC is the automatic memory management by JVM. JVM reclaims memory /garbage occupied by object that are no longer in use by application by doing GC.
7 RUNNING AUTOMATION STEPS USING JENKINS
Jenkins is an open source automation server. With Jenkins, organizations can accelerate the software development process through automation. Jenkins manages and controls development lifecycle processes of all kinds, including build, document, test, package, stage, deployment, static analysis and many more.
In Jenkins server need to create a new job for performance testing. The configuration required for Performance automation is possible by doing the setting in Configure section of that Job. Steps that need to configure serially in Jenkins are,
• Configuring the Vmstat command to run in all the servers used for application deployment, before loadtesting.
• Configuring the steps involved in connecting to the system where JMeter setup is done, executing theload test, capturing the result and transferring to Jenkins server for parsing and analyzing result.
• Transferring the Vmstat reports from the hosted server to Jenkins local for analysis.
• Transferring the GC log from hosted application server to Jenkins local for parsing and analysis.Note: Here the configuration is one time activity for an application.
Automation steps that will run once you trigger the job/schedule the job are,
• Reboot all the servers involved application deployment.
• Restart the Application server, DB server and Web server involved for executing the application.
• Run Vmstat command on each of the servers• Trigger load testing to perform through JMeter.
• Transfer the response time result to Jenkins server.
• Execute python program to evaluate JMeter output.
• Transfer Vmstat output from App/DB/Web servers to Jenkins server.
12
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
• Execute script to parse server utilization result from Vmstat report and compare with limit file meant forserver utilization and produce a server utilization output result as pass/fail.
• Transfer GC log from Application server to Jenkins server.• Execute script to parse GC log and compare with limit file meant for GC and produce a GC output result
as pass/fail.
• Execute a final script to read all the output results, if all results are pass then will declare performancetesting result as pass otherwise fail.
• The final performance result will be displayed in application’s dashboard in Jenkins.
The automation PT process can run as and when required by the application team.
8 DEPLOYMENT DIAGRAM OF PERFORMANCE TESTING AUTOMTION SETUP IN
JENKINS SERVER
Steps involved in setup are,
• Jenkins server connect with the Linux servers (App Server, DB server or web server) involved inapplication deployment using SSH remote host through configure management feature of Jenkins.
• Jenkins server connect with the windows system where JMeter is hosted using master – slave feature,where Jenkins server act as master and windows system act as slave. For this master – slaveconfiguration need to download Slave Agent of Jenkins and install in windows.
9 BENEFITS OF AUTOMATING THE PERFORMANCE TESTING
The automatic Performance testing helps, • In saving the effort and time consumption of doing the manual performance test setup before load testing.
• Avoid time consumption in manual analysis of each testing result and declaring the assessment as passor fail.
• Automating Performance Testing helps in analyzing and mitigating Performance issues before havingchance to derail system at the last moment or in production.
13
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
• By automating performance testing, organizations can catch performance issues early before they getmuch more costly to fix at any time during agile cycle. Developers can instantly know that the new featurein the latest build has caused the application to no longer meet SLAs. They can fix the problem then andthere before it becomes exponentially more expensive.
• Automated performance testing guarantees users get new feature not new performance issues.Integrating performance testing to continuous build and integration process in DevOps, before the newchanges are deployed to production will ensure that the new feature can meet the peak load conditionand will not show any new performance issues in production.
10 LIMITATIONS IN PT AUTOMATION
The basic limitations in automating the PT are,
• To configure the application hosted server with Jenkins, there should be password less authentication
available between Jenkins server and other servers which will help in automatic file transfer when the
process is triggered.
• Need to manually create the limit files for each application and changing the threshold manually with
change in workload.
• The JMeter scripts need to be created and edited manually for changing workload.
• If there is any build change for the application the JMeter scripts to be validated manually.
11 PLANNED WORK FOR FUTURE IN PT AUTOMATION
The planned automation activities in PT to be executed in future are,
• The automatic analysis of thread dump and alerting user incase of blocked thread in application.
• The server utilization automation of windows system considering different performance counters in
PerfMon tool.
• Generating script to automatically change JMeter script with change in concurrent user load per thread.
• To make configuration in Jenkins that the PT job will be triggered automatically and validate the scripts
With change in build version.
12 CONCLUSION
Agile development and Continuous Testing are increasing the productivity of teams and the quality of
applications. When adding load and performance testing into those processes, careful planning should occur to
ensure that performance is a priority in every iteration. The goal of load testing at the speed of Agile is to deliver
as much value to the users of an application through evolving features and functionality while ensuring
performance no matter how many users are on the app at any one time.
Load or Performance testing solution for agile development using Jenkins helps
• Both developers and performance testers to design and edit testing scenarios quickly just by manually
creating the test scripts and placing them in JMeter.
• Enabling everyone on the development team to share test scenarios and test results
14
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
• Easily integrate with new servers required for application deployment and display result trends over
several builds for regression testing.
• Producing reports that everyone on the team can understand and act on.
• Run tests that resemble the real world by accounting for users’ network conditions, and geographic
locations
13 ACKNOWLEDGEMENTS
Our sincere thanks to Mr. Mohan Jayaramappa, Head – SPD Trustworthy CoE, TCS in encouraging us to write
this white paper. Also a note of thanks to Debiprasad Swain, Head – Product Engineering SEG, Pitabasa Sa,
Product Manager of IPRMS and Automation in CEG for helping us to go for this experiment and concept.
15
Survey of Big Data Frameworks for DifferentApplication Characteristics
Praveen Kumar SinghTCS ResearchMumbai, India
Email: [email protected]
Rekha SinghalTCS ResearchMumbai, India
Email: [email protected]
Abstract—Nowadays applications are migrating from tradi-tional 3-tier architecture to Big data platform which are widelyavailable in open source and can do parallel data processing oncluster of commodity machines. The challenges are to choose the”right” available Big data framework for an application with theavailable features of the framework. We have proposed a rulebase methodology for making this choice. We first categorize anapplication into 4 major categories such as Iterative computationbased application, SQL Query based analytic workload, onlinetransaction processing application and streaming based applica-tion based on their features. The proposed rule base stores levelof support given by popular big data frameworks for each of theapplication’s features. We have presented rule base for each ofthese application category in this paper.
Keywords—SQL, NoSQL, Machine Learning, Rule Base,Hadoop, Spark, Flink, Iterative Computation, Streaming, Graph,Framework
I. INTRODUCTION
In the world of Big data, enormously data generated inthe terms of Petabyte , Yotabyte from the sources like socialnetworks, satellites, sensors, user devices, search engines etc.To analyze the large data and extract the useful informationfrom the large data, people come up with the large data pro-cessing frameworks. People from the open world are providingthe tools to implement and execute the parallel, complex andscientific application. Most of the scientific application needsparallel and distributed computation. As we know Hadoop isnot suitable for iterative computation and the most of the sci-entific computation and machine learning algorithms requiresiteration. Spark[13], Twister[7], Haloop[6] and Flink[1] arecome up with alternative solution of Map Reduce frameworkand providing better support for applications which requiresiterative computation.
These frameworks are parallel and distributed in nature.Programmers can distribute their data across the cluster andexecute their task in parallel over the data present in the cluster.
Framework like Hadoop and Twister evolves the disk I/O’s,and Spark reduce the I/O’s activity due to in-memory concept.It doesn’t always depend on the framework but it also dependson the application. Applications are need to be categorize intoI/O intensive and CPU intensive.
There are plethora’s of Big data frameworks available forexecuting various types of applications. There is always achallenge for a user to choose the right Big data processing
platform for his/her problem statement. This paper providesa mapping of different type of applications and workloadto possible set of data processing frameworks. We furthercategorize application based on their features such as speed,latency, dependency, language etc. For the perspective of userswe are trying to put comparisons among different computationframework. In order to help users to choose appropriateframework, which can be used to process their real worldproblem. We are categorizing the application to the differentframework based on their suitability and features. In the cat-egorization, we will focus on the application classes like Ma-chine Learning and Iterative computation based Application,SQL Query based Application, Online Transaction ProcessingApplication (OLTP) and Streaming based Application. Someof the open source Big data frameworks like Hadoop[3],Spark[13], Flink[1] with their pros, cons and Rule base forchoosing these frameworks. The contribution of this paper is todefine features for different types of applications and matchingit to appropriate Big data framework based on several studiesavailable in the literature.
The rest of the paper is organized as follows. In the nextSection, we will discuss about the various Application Classes,in Section 3 will explains about different Big data frameworksand in Section 4 we have Rule Base for choosing theseframeworks and finally concluded in section 5.
II. APPLICATION CLASSES
All domains such as Banking, Media, Health care, Ed-ucation, Manufacturing, Insurance, Government, Retail andTransportation etc have challenges of processing Big Data.Most of these applications have following type of use casesfor processing large data-sizes.
1) Knowledge Discovery in Database: Applications whichare required to identify the hidden data or extract theuseful information from the given data which is appli-cable to identify the current trends in market and moreinsight over the customer data.
2) Fraud Detection and prevention: used to identify thefraudulent cases happens in every sector, can be preventbefore it happens.
3) Device-generated Data Analytic: enormously data gen-erated through the sensor, remote devices, mobile and
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc aroyalty free right to publish this paper in CMG India Annual Conference Proceedings.16
satellites. Generated data can be used for weather fore-casting and it is being used to provide intelligence basedon the location.
4) Social Networking and their relationship Analytic: So-cial networking sites are generating huge amount of datain a seconds that need to be analyze in a real time withthe correct identification of relationship and respondback to the user.
5) Recommendation based Analytic: In the digital worldusers are spending most of the time with their devices,data need to be analyzed and recommend the usefulinformation to the user. This analytic can be able tobusy the users by their recommendations.
In this paper we are providing the rule base for choosingframework, and categorizing application broadly into Itera-tive computation based application, SQL Query based ana-lytic workload, online transaction processing application andstreaming based application.
A. Machine Learning and Iterative computation based Appli-cation
Machine Learning is a part of Artificial Intelligence wherethe machine is trained with the previous data to make thedecisions for the future and we can apply some algorithm toget some output. To apply machine learning few steps need tofollow
1) Data need to be collected from the various sources.2) Data need to be prepared well by analyzing the data,
removing the outlier and missing values issues.3) Data need to be representing in training and testing part
to build the model and test it.4) Check the accuracy of a model by evaluating the algo-
rithm outcome.5) Improve the performance if possible by the mean of
selecting different model altogether.Supervised learning is also known as task driven in which
learning is to be done through the past data to get futureprediction. Learning can be define through the attribute of aclass, if the attribute is discrete then it is classification andif the attribute is continuous then it is regression. DecisionTree, SVM and Regression are comes into this category. Un-supervised Learning is used in a cluster analysis based on thegiven input data set with the help of some objective function.A well known K-Means clustering algorithm and PrincipalComponent Analysis (PCA) comes under this category.
B. SQL Query based Application
The Big data framework provides the functionality for usersto run and analyze the query on the data, which is storedon the Hadoop Distributed File System(HDFS) and S3. Hiveon Hadoop Map-reduce, SparkSQL on Spark and many moreprovides the API to deal with SQL Query. Data can beviewed and analyzed by basic query like aggregation and join,even on the fly for business purposes through these availableframeworks.
C. Online Transaction Processing Application (OLTP)
These applications have read, write and update operations.Each transaction execution will be associated with somesemantics such as ACID, CAP etc. Most of the NoSQL basedframework such as HBase, Neo4j, MongoDB, etc supportsOLTP workload. Some of the NoSQL framework such asGraphLab and Pregel are specialized to work on graph baseddata which is required for social networking application. Theamount of data generated through the social networking sitesand various sources of data is in exponential rate. Graph basedmodel used for parallel computation on the graph ensuresthe data should be consistent and different techniques arebeing applied to get some useful information from hugeamount of data. Graph analysis is one of the solutions tounderstand complex relations and to check the different pat-terns present in the large data. The available frameworksfor graph processing are GraphX[8], GraphLab[9], Giraph[2]and Pregel[11]. PageRank, Connected Component, SVD++,Strongly Connected Components and Triangle Count are theapplications belongs to this category.
D. Streaming based Application
To Process the large amount of data is challenging taskbut to take decisions on-fly even more challenging in realtime, where stream processing comes into the picture. Mostof the business firms are more focused on streaming analysisto take business decision such as recommendations for theusers/customers/clients in real time. Stream processing mainlydoes the calculation, statistical analysis and continuous queryruns over the stream of data in real time. The availableframework for stream processing are Spark Streaming[14],Apache storm[4] from Twitter, IBM InfoSphere Streams[5]from IBM. Fraud detection, E-commerce, feed, page-view andTrading are the streaming based applications.
III. BIG DATA FRAMEWORKS
In this section, we discuss features of few of the popularBig data frameworks available in Open source.
A. Hadoop
Generally used for Batch processing and designed forparallel algorithms used in scientific applications. It is usedfor parallel processing but due to lack of single cluster theimprovised version of Hadoop[3] comes with the name ofHadoop YARN[3]. The associated ML tool is Mahout. Hadoopis not suitable first when the communication between the splitshappens. Secondly Hadoop not suitable for the long runningMR job and Last but not the least there is no concept calledin-memory and caching in MR loops.
B. Spark
It is also one of the programming models used to processthe large data. According to the spark community [13] , muchfaster than the Hadoop. The main reason behind the claim isinstead of placing intermediate data into the disks, it storesinto the memory. Due to that we can save the I/O time
17
which requires while storing and retrieving the data from thedisks. Spark introduces memory abstraction using the conceptcalled as resilient distributed datasets (RDDs), functions runon each record of HDFS or any available storage in parallelmanner. Two types of operation performed on the RDD one isTransformations which is lazy evaluation and another one isAction applied on the transformed RDD. RDD is collection ofobjects that partitioned across the cluster and rebuilt at timeof partition lost. The lost partition can be rebuilt using lineage[12]. Spark scheduling is done using the directed acyclic graph(DAG). The job has multiple stages and it can be scheduledsimultaneously if the jobs are independent. It supports thewide range of application category like Machine Learning,Spark SQL, Spark Streaming and Spark Graph computation.The associated ML tools are MLib, Mahout and H2O.
C. Flink
It is an open source platform for Hybrid, interactive, Real-Time Streaming, Real-World Streaming (out of order streams,Windowing, Back-pressure) and Native iterative Processing[1].Major abstraction is Cyclic Data-flows. The associated MLtools are Flink-ML and SAMOA. Flink is in-memory pro-cessing and Low-Latency engine.
D. GraphX
GraphX[8] framework is used for the graph processing onthe top of Spark. It has low-cost fault tolerance as well as dueto iterative processing of graphs, it reduces the memory over-head. Different API’s are available for the different frameworkfor graph processing.
E. Spark Streaming
Hadoop doesn’t have full controlled solution for streamingapplication, even though they are able to process micro-batch job[10]. Spark streaming provides strong consistency,scalability, parallel and fault recovery due to introduction ofdiscretized streams(D-Streams)[14] programming model forintermix streaming, batch and interactive queries.
IV. RULE BASE FOR CHOOSING FRAMEWORK
In this Section will provide a rule base mapping betweendifferent applications and frameworks discussed in Section 2and 3 respectively. The Table I first column presents differentfeatures of SQL Query based analytic workload based on ourexperience in client project. Rest of the columns representsdifferent Big data framework supporting processing of SQLQuery based analytic workload. Each cell in the Table shows amatch between the feature and the framework. Similarly TableII, III and IV present rule based mapping for NoSQL work-load, Iterative Computation based application and Streamingapplication respectively.
We have described the different workloads and features ofan framework on the tables mentioned below. Refer Table IIwhich is used to describe the NoSQL workload categories,Table III describes the features of an framework used for It-erative Computation based application and Table IV describesthe features of an framework used for Streaming application.
V. CONCLUSIONS AND FUTURE WORK
There are plethora of Big Data frameworks are availablein open source for parallel processing of CPU and/or dataintensive applications. It is humongous task for a user toselect a right platform for his/her workload. In this paper wehave categorized application broadly into SQL Query basedanalytic workload, NoSQL workload, Iterative Computationbased application and Streaming applications. We have furtherspecified features for each of these application categories.Finally we have presented rule base mapping for each ofthese category by specifying level of support provided byvarious available Big data frameworks for each of the specifiedfeatures. In future we shall benchmark these available Bigdata platforms for different types of applications with featuresas mentioned in the paper and corroborate the result withactual measurements. We also plan to integrate this rule basemapping with our larger project on migration of applicationsdeployed on traditional systems to Big data platforms.
REFERENCES
[1] http://flink.apache.org/.[2] http://giraph.apache.org/.[3] http://hadoop.apache.org/.[4] http://storm.apache.org/.[5] http://www-03.ibm.com/software/products/en/ibm-streams.[6] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: Efficient
iterative data processing on large clusters. PVLDB, 3(1):285–296, 2010.[7] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and
G. Fox. Twister: a runtime for iterative mapreduce. In S. Hariri andK. Keahey, editors, HPDC, pages 810–818. ACM, 2010.
[8] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin,and I. Stoica. Graphx: Graph processing in a distributed dataflowframework. In J. Flinn and H. Levy, editors, OSDI, pages 599–613.USENIX Association, 2014.
[9] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M.Hellerstein. Distributed graphlab: A framework for machine learning inthe cloud. PVLDB, 5(8):716–727, 2012.
[10] S. Shahrivari. Beyond batch processing: Towards real-time and stream-ing big data. Computers, 3(4):117–129, 2014.
[11] C. E. Tsourakakis. Pegasus: A system for large-scale graph processing.In S. Sakr and M. M. Gaber, editors, Large Scale and Big Data, pages255–286. Auerbach Publications, 2014.
[12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J.Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In S. D. Gribbleand D. Katabi, editors, NSDI, pages 15–28. USENIX Association, 2012.
[13] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica.Spark: Cluster computing with working sets. In E. M. Nahum and D. Xu,editors, HotCloud. USENIX Association, 2010.
[14] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams:An efficient and fault-tolerant model for stream processing on largeclusters. In R. Fonseca and D. A. Maltz, editors, HotCloud. USENIXAssociation, 2012.
18
TABLE I: Rule Base for SQL Analytic Workload
Features Hive Presto Drill SparkFramework MapReduce MapReduce Pipeline Queries MapReduce and DAGModel Batch Processing Real Time Interation using Query Pipeline Real Time Interaction Micro-Batch processing, Streaming
Speed Fast10x fasterthan Hive Fast
10x faster thanMR on disk and
100x faster inmemory
Capability ofProcessing Data
SizeTerabytes Petabytes Gegabytes to Petabytes Petabytes
ML Support No No No Yes
Language support Java API,SQL
C, Java,PHP, Python,
R, Ruby,SQL
ANSI SQL,Mongo QL,
Java API
Scala, Java,Python, SQL
Horizontal clusterscalability Scalable( 100+ Nodes) Scalable(1000+ Nodes) Scalable(100+ Nodes) Scalable(1000+ Nodes)
Optimization Query/Cost-Based(Not Yet)
rule based/cost based Catalyst
YARN Support Yes Not Yet Yes Yes
File System Local/HDFS/S3 HDFS/S3
Local/HDFS/S3/MapR-FS
HDFS/local filesystem/S3
/Hbase/Cassandra
TABLE II: Rule Base for NoSQL WorkloadKey Value Store Document Store Graph Based Wide Column
Features Redis Riak MongoDB CouchDB Titan Neo4j HBase CassandraSchema Flexibility High High High High High High Moderate Moderate
Implementation-Language C Erlang C++ Erlang Java Java Java Java
Distribution-Replication Master-Slave
Selectablereplication factor Master-Slave
Master-Master/Master-Slave
Multi-Masterreplication factor
Master-Slavereplication factor
Selectablereplication factor
Selectablereplication factor
Design and Features Sorted sets Vector clocks Indexing, GridFS Partitioningpluggable backendsCassandra,Hbase,MapR,Hazlecast
ACID Applicable MapReduce supportbuilt-in data
compression andMapReduce support
Query Method Throughkey-commands
MapReduceterm matching
Dynamic object-basedlanguage andMapReduce
MapReduceR ofJavascript funcs Gremlin,SparQL
SparQL,nativeJavaAPI
Internal API,Map Reduce
Internal API,SQL like(CQL)
Concurrency In Memory Eventually Consistent
Update in Place(Master-slave withMulti granularity
locking)
MVCC(Applicationcan select Optimistic
or Pessimistic locking)ACID Tunable C
Non-block reads,write locks involvednodes/relationship
until commit
Optimistic Lockingwith MVCC Tunable Consistency
API and otheraccess methods Proprietary protocol
HTTP API,Native Erlang
interface
Proprietary protocolusing JSON,
binary(BSON)
RESTful HTTP/JSON API
Java API,Blueprints,Gremlin,
Python,Clojure
Cypher querylanguage,Java API,
RESTful HTTP API
Java API,RESTful HTTP API
,Thrift
Thrift,Custombinary CQL
Complexity None None Low Low High High Low LowCAP AP AP CP CP AP/CP CP AP/CP
PartitioningScheme
Consistenthashing
Consistenthashing
Consistenthashing None Range Based
Consistenthashing
File System Volatile memoryfile system
Bitcask LevelDBVolatile memory
Volatile memoryfile system
Volatile memoryfile system
Volatile memoryfile system
Volatile memoryfile system HDFS
CassandraFile System
Map Reduce No Yes Yes Yes Yes No Yes YesHorizontal Scalable Yes Yes Yes Yes Yes No Yes Yes
19
TABLE III: Rule Base for Iterative Computation Application
Features Hadoop Twister Flink SparkFramework MapReduce MapReduce PACT-MapReduce MapReduceModel batch Operator-based micro-batchAuto-Tuning/Optimization Manually/NA
Manually/Staticdata caching
Auto-tuned/cachesstatic data path
Manually/static datacached in iteration
Latency low MediumMedium(better than
spark) Medium
Fault Tolerance Medium Medium High High(Through lineage )Memory Management Manually Manually Automatic Automatic(Spark 1.6.0+)Iteration Nature No Yes Yes Yes
General Purpose ETL Iterative AlgorithmETL/Machine Learning
/Iterative AlgorithmsETL/Machine Learning
/Iterative Algorithms
Storage support HDFS/local filesystem/s3 Local File systemsHDFS/S3/
MapR/TachyonHDFS/local file system/
S3/Hbase/Cassandra
Language support Java JavaJava,scala,python
Java,scala,python
Horizontal clusterscalability
Scalable (100 to1000 Nodes)
Scalable (100 to1000 Nodes)
Highly Scalable (100 to10000 Nodes)
Highly Scalable (100 to10000 Nodes)
TABLE IV: Rule Base for Streaming Application
Features Hadoop Spark Streaming Storm Flink Samza
Processing Framework Batch Batch, StreamingReal Time Eventbased Processing
Batch, Real Time Eventbased Processing Real Time stream Processing
Streaming Model batch Micro-batchingNative/micro-batching
with Trident API
Native/Micro-batchingby accumulates stream
messages
Native/Relies onKafka for internal
messagingResponse Time Minutes Seconds Milliseconds Milliseconds Sub-seconds
Functionalities/supports YARN,HDFS,MapReuce
YARN,HDFS,Mesos
Runs on YARNinteract with Hadoop
and HDFS
Runs on YARNas an Application Requires YARN and HDFS
Stream Source/Primitive/Computation NA
Receivers/D Streams/Transformations window
operationsSpouts/Tuple/Bolts DataStream consumers/message/Tasks
Delivery Semantics Exactly onceExactly once(Excepts
in some failure )At Least Once(Exactly
once with Trident) Exactly once Atleat once
State Management StatelessStateful(writes state
to storage),Dedicated DStream
Not build-in/Stateless(Roll your own or
use Trident)Stateful Operators
Stateful(EmbeddedKey-Value store)
Language Supported Java Scala,Java,PythonJVM-languages,
Ruby,Python,Javascript,Perl
Java,scala,python Scala,Java,JVM Languages only
API Declarative Compositional Declarative CompositionalLatency Minutes Seconds milliseconds milliseconds milliseconds
Throughput NA100k+ records pernode per seconds
10k+ records pernode per seconds
100k+ records pernode per seconds
100k+ records pernode per seconds
Scalabilty to Largeinput Volume(streams) No No yes yes yes
Fault Tolerance(completeing the compuation
correctly under failure)yes yes yes yes yes
Accuracy and Repeatability No No No Yes NoQueryable(querying
the results insidethe stream processor
without exporting themto an external database)
No No No No(Upcoming feature) No
In-Memory Processing No Yes Yes Yes NoResource Manager YARN YARN,Mesos YARN,Mesos YARN YARNSupported ML Tools Mahout Mahout/Mlib/H2o SAMOA Flink-ML/SAMOA SAMOA
20
Performance Engineering Best Practices
for Data Warehouse applications
Authors: Gaurav Kodmalwar, Niteen Yemul SAS R&D India Pvt. Ltd. Pune
Data warehouse applications are different in architecture than regular client-
server based applications. Performance testing and engineering strategy
for data warehouse applications are very different than traditional client
server based applications. This paper describes the basic architecture and
concept of data warehouse which are required prior to start of the
performance testing of data warehouse application. It covers performance
testing and engineering strategies to baseline application performance
when volume of data grows at different layers of data warehouse. In this
paper, references to SMP (Symmetric Multi-Processing) and MPP
(Massively Parallel Processing) databases are intentionally avoided and
examples are explained with SAS or SPDS (Scalable Performance Data
Server) file systems in comparison with undisclosed SMP and MPP
databases.
Introduction: Need of performance testing for data warehouse based applications like Business Intelligence (BI)
and Business Analytics (BA) are increasing due to increased volume of data at data warehouse or big data
environments. Usually the scope of performance testing in such applications are focused only on ETLs (extract,
transform, loading). ETLs (Extract transform Loading) are scheduled to complete during offline hours, unless it is a
real time application. Performance of data warehouse applications need to be within the service level agreement
(SLA), otherwise it may even impact end users accessing application during online hours. This paper is based up
on the practical learnings while conducting performance testing of data warehouse application. ETL scenarios
mentioned in the paper are kept more generic in name to understand data warehouse application’s architecture and
its performance considerations along with actual performance numbers wherever possible. Architecture may vary
for similar applications in different domains, accordingly approach for testing and troubleshooting need to change.
This paper is more focused towards ETL in data warehouse than BI reporting.
Experimental Setup
Typically in Data warehouse based solution, end user will get Business related (BI or BA) reports after completion
of multiple ETLs (e.g. ETL for loading data from production or OLTP (Online Transaction Processing) system into
staging tables, then ETL for loading data into mart and finally ETL for loading mart data into smaller report\output
tables). It’s only pragmatic to say that SLA (Service Level Agreement) for first few ETLs will be higher than later
ETLs since amount of data volume processed by initial ETLs are very high. In such circumstances, scope of performance testing is driven by multiple factors. It may be decided based on data
volume, frequency (scheduled period of time i.e. monthly/weekly/daily) and/or totally based on Business priority.
This paper is based on the experience gained while testing a banking related solution which had 8 different ETLs
and all of them were tested for performance. After each iteration of testing, certain tuning strategies were applied
to improve the performance. The best practices stated in this paper talks about the various tuning strategies
21
gathered while performance testing the first two ETLs which were both volume and server resource intensive. For most of the ETLs, history is maintained at the source and\or target tables. In case of our solution, 34 months history
was maintained at first ETL and 11 months history is maintained at 3rd ETL. Execution time of ETL for these two
ETLs were measured for 35th month and 12th month. This will help in measuring execution time at customer end
after 1 year or 3 years.
Paper Brief: The following diagram shows a generic data warehouse schema. In this section each of the database
layer is explained at a higher level along with the tips for improving performance.
Figure 1: Data Warehouse Data Flow Process (Image Source: Internet)
The integration layer retrieves data from the OLTP or source system on regular intervals
(daily/weekly/monthly/yearly). Mostly, the integration layer is the starting point of the data warehouse design. Due
to variety of the OLTP/source system, bringing data into integration layer from OLTP/source system are kept outside
the scope of data warehouse solutions. As integration layer accumulates huge data over the time, it may cause
performance slowness to ETL processes while extracting data from the integration layer to load into the data
warehouse.
Limitation
Since data in the integration layer keeps on accumulating over the time period, appending new data from the
OLTP\source system to the integration layer will be a time consuming process. There will be additional overhead in
case of indexes at the integration layer.
Performance Improvement Guideline (for above limitation)
To overcome the above limitation, there can be multiple options available depending upon the data model and
business requirement. By capturing only the changed transactional data in the operational systems and using it as
an input data source while loading data into the integration layer, will show the performance improvement. This can
be achieved by Change Data Capture (CDC) object which is mostly available in the data integration suite.
The data warehouse keeps the same operational system data but in a different structure (dimensional modeling
schema) that is accumulated over a time period. More details about dimensional modeling is provided in the next
section. Some data warehouse designs don’t contain an integration layer and load data directly into the data
warehouse from operational systems using ETL.
For the first time period, the data warehouse load (ETL) process might take a longer time due to the volume of data
being processed & loaded. During the dimension load process for the first month, all the business key (account
/customer etc.) accumulated over all the previous time periods gets loaded. Due to which, volume of data for
dimensions for the first time period is higher than the next consecutive time period. One out of few techniques
22
available to load data faster during first time period is BULKLOAD. BULKLOAD can be used to load data to staging area from production\OLTP. A summary of this is as follows:
a) Loading data into staging area tables with indexes will take very long time,. This is because there is an
additional overhead of updating index during data loading process.b) In some databases if target tables in staging area are empty then BULKLOAD can complete same
operations smaller amount of time BULKLOAD will insert into new empty blocks and not fill unused space
in existing blocks that aren’t completely full.c) SAS uses SAS specific interface and its methods as part of SAS Access for a database, a seamless
interface provided between SAS and other databases. This technology is used to load tables from SAS
dataset to databases and vice versa. Thus, a reusable code enhances development and more importantly
an easier maintenance of the code.
For subsequent load, only the changed data is tracked and loaded. This will show performance improvement in the
next load process compared to first time period. Few types of bottleneck in the SQL query or code can be discovered
for the first month load process whenever first time period ETL is slower than the next time periods. Table#1 shows
the response time for job loading of different dimensions for 1st month, 2nd month and 35th month. As mentioned
earlier, 1st month job’s response time is high due to high volume of data and from the second month onwards,
response time increases as changed data volume increases gradually. Performance issue in SQL query of big
database was found at a higher volume while generating the surrogate key for the first month. After fix the response
time for the 1st month was brought down to few minutes.
Table Name 1st month 2nd month 35th month
Response Time
No of Data Loaded
Response Time
No of Data Loaded
Response Time
No of Data Loaded
Dimension 1 1:02:53 30735406 0:00:04 0 0:02:57 3,948,846
Dimension 2 0:23:55 25181230 0:00:45 11981230 0:00:03 1
Dimension 3 0:22:44 20491766 0:00:04 0 0:00:51 2,632,770
Table#1
SQL Query for inserting\updating dimension tables mentioned in Table#1:
proc sql;
connect to database (SERVER="xxx" AUTHDOMAIN="xxxxxx");
execute (insert into "dim".APP_DIM2(
APP_SK , APP_RK , PRODUCT_RK , CUSTOMER_RK , VALID_END_DTTM, PROCESSED_DTTM)
select FUNCT(1,1) + 0 FUNCT(1,APP_RK),
APP_RK , PRODUCT_RK , CUSTOMER_RK , XREF_TODATE , PROCESSED_DTTM from
"spafm_scratch" .mrg_APP_DIM where ACTION in ('UPDATE','INSERT'); ) by database;
execute(commit) by database;
disconnect from database;
quit; In the above query, generating Surrogate key logic was changed instead of using inbuilt function (FUNCT) provided
by MPP database. FUNCT function was not efficient enough at high data volume however it was easy to use and
was not causing any slowness at low or medium data volume.
Star Schema means a fact surrounded by dimensions in an integration layer of data warehouse. Data is extracted
from this schema using query called Star Join query. Star schema stores data in a partially normalized format and
fully normalized format is called as snow-flake schema. In a very well designed data warehouse solutions, these
queries are used while loading data into data mart or extracting data from star schema on the fly.
23
Performance Engineering Best Practices:
Stated below are some of the best practices that can be applied across any data warehouse solution to gain performance improvement. For the ease of reader, tuning guidelines are segregated into Basic & Advanced level.
Basic level guidelines are handy tips which can be applied to bottlenecks that are uncovered in preliminary investigation phase. Advanced level can be applied post detailed investigation.
Basic Level of Tuning Guidelines
1. Indexing
Creating index on time column on top of existing index on the business key/primary key of the transactional tables
will help improve performance of extraction process However, ‘transaction time’ column should not be null
otherwise, it will slow down the performance due to duplicate values in indexes. However time indexes may not be
needed by other solutions sharing data in same integration layer, so this may be acceptable suggestion only if
beneficial for all solutions. As shown in chart#1, the performance of the extract jobs were improved after creating
indexes on the transaction time of transactional tables. The improvement will be different, depending up on the type
of transaction table, its columns, rows, and index columns. Note that this may show performance slowness at the
time of data load from the OLTP\source system to the integration layer due to index checks overhead.
Graph 1: Extract Job Performance Comparison
2. Verify Join (where clause) against indexes
Data warehouse has very huge volume of data compared to data volume in integration layer. Fact and dimensions
are huge data sizes e.g. up to 200 Gigabytes per table. Queries fetching data from such tables will be slower than
usual. These queries should be executed in the typically designed format like Star Join and surrogate keys should be used in Where clauses. As shown in the Table#2, there was performance improvement after SQL query was
rewritten with Star Join (which means all surrogate keys with respective dimensions used at where clause) instead of using surrogate keys of only few dimensions.
Query Type Execution Time
Partially Star Join 0:10:00
Fully Star Join 0:03:00
Table#2
Data mart tables are extracted through Star Join\Snow Flake queries from the Data warehouse. Data marts are
aggregated tables for specific business purpose and time period (e.g. monthly, yearly). These marts are small in
sizes compared to Data warehouse. Most of the times, there are detailed tables extracted by analyzing data in
00:09:46
00:04:38 00:03:02
00:02:07
00:00:00
00:02:53
00:05:46
00:08:38
00:11:31
Transaction Table 1 Transaction Table 2
Source Table (transaction type table)
Extract Job Performance Comparison
Without Index
With Index
24
marts especially for the purpose of reporting, mining. Usually there are cubes as well, which stores aggregated data across multiple dimensions and used for analysis purpose based upon categorical data selected as per business
need. Due to a small volume of data in data marts & cubes, there are usually lesser chances of obvious performance
issues compared to that at the integration layer or data warehouse. Performance issues found at Data mart and
cubes were mostly due to inefficient SQL queries those were resolved by rewriting them.
Advanced Level of Tuning Guidelines
Focus Area: Design Level Optimization
1. Compression
Compression techniques are used for different databases or big data to save disk space and also reduce IO
operations while reading\writing data on the disk. Depending upon compression techniques, performance of SQL
query execution will vary. However now a days, good compression techniques available so chances of adverse
effect of compression on query execution are very less. But still worth to test different SQL queries on compressed
tables to avoid any degradation in execution time. Following graph shows disk space saving and execution time
improvement for few experimental tables and SQL queries.
Graph 2: Table size in Gigabytes Graph 3: Query Execution Time
2. Partition
In case of very huge fact and dimension tables, both of these tables can be physically partitioned over a time period
for the faster query execution. These tables can be partitioned yearly, monthly, or even daily depending upon
volume of data. This is a good idea if designed to use from the beginning. Otherwise, implementing it in the existing
design would be complex.
In a specific example, a table was partitioned on the date-time column on month or week depending on the table
data access requirements. A table whose data is accessed on weekly basis is partitioned on week and a table which
is accessed for monthly basis is partitioned on month. In case a single table is used for accessing week or month
queries then using a week as a partition is usually a better method. It was found that, in a specific example, a set
of job that would run for 2 days has completed in less than 7 hours.
An example that illustrates advantages of a partition as given in the graph. This example contains SQL queries that
were run on a varied size of table data. Initial results are depicted which did not use any partition. Final results are
displayed which are obtained after partition. It can be seen that there are queries that provide large gain of several
hours due to large size of table. For example a query that would consume 4 hours 20 minutes, after partition and
compression would consume just 10 minutes.
0
20
40
60
80
100
Table 1 Table 2 Table 3
Table Size in Gigabytes
Uncompressed Compressed
0:00:00
0:02:53
0:05:46
0:08:38
Table 1 Table 2 Table 3
Query Execution Time in HH:MM:SS
Uncompressed Compressed
25
Graph 4: SQL Queries in ETL, before and after Partition
3. Parallelism
Although ETL term looks simple but it will have many jobs (different ETLs designed for different data source system
and business purpose e.g. ETL for credit card, sales, saving accounts etc.). Since these ETL source and target
tables are different, these can be grouped together to run in parallel independently without impacting functionality.
As obvious, it will show improvement in the performance due to parallel execution of jobs in ETL.
Database level parallelism is a very important and complex technique in ETL to improve performance. Query level
parallelism in a database can be implemented by enabling parallelism at table level, SQL query level. Parallelism
may show improvement and degradation depending upon level of parallelism for different queries and different data.
In case of MPP database, every SQL query works with high level of parallelism due to share nothing architecture
but in case of SMP database, level of parallelism has to restrict at some queries. During ETL execution in SMP
database, SQL query parallelism was set to automatic and level of parallelism was restricted to 4 or 8 depending
upon hardware configuration of SMP database. This change has shown improvement in ETL execution compared
to no-parallelism but few SQL queries were degraded while running in parallel however in overall, it was a trade-off
between improvement and degradation for queries running in parallel.
In continuation with previous experiment, it’s important to enable optimal performance settings at user defined
functions level in any other databases (SMP or MPP). Otherwise, SQL query consuming functions in other
databases (SMP or MPP) may not opt for faster execution plan. It’s observed many times that user defined functions
needs to specify parallel setting (and other type of setting as well) to allow faster execution of query with high level
of parallelism.
Proc sql;
CREATE TABLE src.TargetTable1 as select a."SUB_RK", a."AS_OF_DTTM", a."TARGET_VALUE", (select
TXT_1."FINAL_DTTM", FunctionName(TXT_1."AS_OF_DTTM") as max_dttm from src.SourceTable TXT_1; Quit;
From execution plan of above SQL query, it didn’t go in parallel, until particular function enabled the parallelism.
create or replace FUNCTION FunctionName(vars) PARALLEL; END FunctionName;
4. SQL Pass-through (SAS specific)
In some SAS based analytics solution, database can be any other SMP or MPP database. For faster performance
of SQL query, it’s important to pass all queries (or at least slow running queries) to databases so that database run
them by their own mechanism. Otherwise, SAS may bring data from database to SAS tier causing query to be
00:00 00:28 00:57 01:26 01:55 02:24 02:52 03:21 03:50 04:19 04:48
Query01
Query02
Query03
Query04
Query05 SQL Queries in ETL, before and after Partition (h:mm)
Initial Result Final Result
26
partially processed by database or not at all processed by database. In both the cases, performance of query will be very slow due to entire table’s data movement from Database to SAS tier. Such issues can be found from SAS
Jobs logs by enabling logging level and make sure that entire query is passed to the database. Following are three
type of messages in the logs after enabling logging level for query execution.
• SQL_IP_TRACE: None of the SQL was directly passed to the DBMS.
• SQL_IP_TRACE: Some of the SQL was directly passed to the DBMS.• SQL_IP_TRACE: The SELECT statement was passed to the DBMS.
Focus Area: System Level Measurements
1. Disk Speed
In data warehouse solution, ETL performance is very much dependent on IO capacity (both disk read and write
speed). UNIX has inbuilt command to check disk speed called “DD test” and similarly there are other tools in
windows based system as well. DD test will help to measure the disk capacity before start of the ETL test execution
and performance of high throughput SQL query may degrade due to disk capacity. As per following two monitoring
graphs (nmon) during dd test on disk, disk read and write speed were ~400 Mbps and 370 Mbps respectively except
spikes up to 1-2 Gbps in read speed (these spikes can be ignored due to memory cache benefits).
Graph 5: Disk Write Speed Graph 6: Disk Read Speed
2. RAID setting
Along with disk type and their speed, choosing correct type of RAID level settings of logical disk are crucial
depending up on purpose of the disk. In our experimental setup, RAID 0 was opted for logical disk containing
temporary data, RAID 5 was opted for logical disk containing solution data and RAID 1 was opted for logical disk
containing Operating System. Such type of RAID settings are important since RAID 0 are faster than RAID 1 and
RAID 5 however it doesn’t guarantee data recovery but temporary data may not need any recovery on the other
hand, solution data are needed to be recover in case of disk crash so RAID 5 are more important for solution data.
3. Job Parallelism
One ETL had ~260 jobs for Extract Transform and Loading for different business entities/purposes. This ETL ran
for ~35 minutes and at a time only 5 jobs ran in parallel to avoid hardware bottleneck. In following graph (captured
through PerfMon), disk read speed for solution data was up to maximum up to 600 Mbps and disk write speed for
temporary tables was up to 200 Mbps (except one spike of 500 Mbps). Disk speed capacity measured during DD
test was higher than the current utilization so it shows no limitation at disk level. CPU utilization and available
Memory are not reaching to their capacity except CPU utilization on 5 cores touches 100% as SAS jobs were single
threaded. In fact, its good sign that remaining 7 cores out of total 12 cores can be used for other SAS jobs if we
really want to increase job level parallelism to 12 than 5. After increasing Job level parallelism from 5 to 10, there
was improvement in ETL execution from 35 minutes to 28 minutes and this improvement was expected since at 5
parallel job setting, hardware was not fully utilized.
27
Graph 7: Operating System Monitoring using PerfMon
4. Parallel IO operation
SPDS is a data storage system optimized to deliver quickly by distributing data across multiple disk instead of single
disk. In one of the experiment setup, the data was distributed across multiple virtual disks. ETL execution was
conducted on data in SPDS, disk read and write speed (900 Mbps and 700 Mbps respectively) was higher than
disk read/write capacity (350 Mbps and 500 Mbps respectively) of individual virtual disks. This was possible only
due to data distributed across multiple virtual disks and SPDS performs parallel IO operations on virtual disks.
SPDS shows good performance improvement than SAS filesystem if data is distributed across multiple different
virtual disks (just like data on MPP database distributed across different disk called as “shared nothing
architecture”). Such experiment can be compared with SMP vs MPP databases (SMP has shared everything and
single host whereas MPP has shared nothing i.e. disk, CPU, nodes).
Graph 8: Disk Read Speed Graph 9: Disk Write Speed
Focus Area: Database Level Tuning
1. SQL Query Execution Plan
ETL execution time is derived by Jobs execution time, especially slowest running Job contribute more to ETL
execution time. Execution time of slowest running Job is derived by SQL queries or tasks in that job. So it’s important
to understand reason for slowest running SQL query in a job in order to improve performance of ETL. FULLSTIMER
setting in SAS configuration will only provide execution time of SQL query or task but in order to investigate reason
for slowness of SQL query, one needs to find execution plan of a SQL query. “_method” option will show execution
plan of a SQL query, however similar execution plan can be found for any other SMP or MPP databases by default
through their monitoring and reporting tools. Execution plan for different databases or SAS datasets are discovered
by many factors like table sizes, indexes, IO speed, server utilization (CPU, Memory), parallelism settings etc.
In following slow executing SQL query, FULLSTIMER shows 30 minutes taken by SQL query to execute and
_method shows sort and merge join opted by optimizer to execute query. Sort and merge joins seems to be slower
0
200
400
600
800
1000
1200
1400 Operating System Monitoring using
PerfMon
Process(_Total)\% Processor Time Available RAM in GBytes Temp Area - Disk Read MBytes/sec Solution Data - Disk Read MBytes/sec Temp Area - Disk Write MBytes/sec
0
200
400
600
800
1000
Disk Read KB/ssdf sde sdc sdb sda sdd
0
200
400
600
800
Disk Write KB/ssdc sdb sdf sda sde sdd
28
in many databases since it has to sort large tables (larger than RAM size) and then write them to disk and IO are always slower than memory. Sort and merge are useful when you have multiple database connections and sessions
running their SQL queries at a time so no one gets “service denial” type error and every one can run queries with
limited speed. However if there is enough hardware, optimizer should opt for faster joins. Other two types of Joins
available in SAS proc SQL are index and hash joins. If any one of the table out of two source tables fits into memory,
optimizer opts for hash join and this will be in-memory join so it should be always faster than sort-merge join.
However memory utilization will be high in case of hash join as it fits entire table in memory and do operations in
memory. If source table has indexes, then optimizer will opt for index join and execution will be even faster since it
has to read limited data than entire table. Index joins are not always fast, it will show improvement if output table
fetches limited rows out of total rows in a source table otherwise it may compensate or show degradation due to
additional task of index access. There are numerous ways in SAS or other databases (SMP or MPP) for promoting
faster execution plan by providing hint to optimizer through mechanism like collecting stats, partitioning tables (by
where clause), increasing memory size of database.
proc sql _method; create table temp.Table1_EXT as select SOURCETABLE1.PRODUCT_RK, SOURCETABLE2.LOAD_END_DTTM, SOURCETABLE1.SOURCETABLE1_TYPE_CD,
SOURCETABLE1.APR_RT, SOURCETABLE1.BASE_APR_RT, SOURCETABLE1.NOMINAL_RT from temp.SOURCETABLE2, SourceSchema.SOURCETABLE1 where SOURCETABLE2.LOAD_END_DTTM >= SOURCETABLE1.VALID_FROM_DTTM and SOURCETABLE2.LOAD_END_DTTM <= SOURCETABLE1.VALID_TO_DTTM ;
quit;
Execution plan:
sqxslct sqxjm sqxsort
sqxsrc(temp.SOURCETABLE2 (alias = TABLE0) ) sqxsort
sqxsrc(SourceSchema.SOURCETABLE1 (alias = TABLE1) )
NOTE: PROCEDURE SQL used (Total process time): real time 30:51.52
cpu time 24:12.46
Other Execution plans:
Hash Join
sqxslct
sqxjhsh
sqxsrc
sqxsrc
Index Join
sqxcrta sqxjndx
sqxsrc
sqxsrc
2. Configuration Settings
In SAS (and in many other databases SMP or MPP as well), memory setting at database level are available to
configure. Small value of such parameter may cause performance bottleneck while executing SQL query on large
sized tables. An experiment was conducted for different SORTSIZE setting in SAS (similar experiments were
conducted on other popular databases as well with their own System or Query level memory settings) and SQL
query in this experiment was creating UNIQUE index on a large sized table. This query was executed through sort
29
and merge join all the time as it need to find unique values for unique index creation. There were read operations in pieces and write to temp location if SORTSIZE 12Gigbytes and query execution time improved by increasing
SORTSIZE to 60Gigabytes. This improvement was due to reduced write operations to temp and increased
continuous read operation into memory. So such memory settings in any databases will help in reducing IO
operations and improve the query performance.
Graph 10: Sort Size Run 1 and Run 2
Basic Housekeeping Techniques
• Inconsistency of performance numbers between two iterations: ETL executions in data warehouse are
different than UI (User Interface) based load testing. Depending upon the total number of ETLs and number
of months for every ETL, backup and data restoration needed to re-run ETL only for selected months. Over
the time on windows systems, disk fragmentation happens significantly due to multiple time backup and
restoration causing slow IO operations. By running disk de-fragmentation on regular interval or prior to start
of another round of testing, IO operations seems consistent and so will SQL query execution time. This will
make sure that ETLs execution timings to remain consistent. As shown in following table, disk fragmentation
was significant on D drive having solution data due to which table’s data was not continuously stored on
the disk causing slow read\write processes. After disk de-fragmentation, the performance improved
significantly.
Disk
Driv e
Total
size (Tb)
Av ailable
(Tb)
Used
(Tb)
% of
fragmentation
C 1.08 1.04 0.08 19
D 11.4 9.21 2.19 5
E 4.36 3.36 1.0 0
Graph 11: SQL Execution Time Table #3: Disk Details
• Keep an eye on unnecessary intermediate table creation during ETL processes and use of views instead
of creating intermediate tables will save IO. Similarly, intermediate tables need to be cleaned up prior to
0:00
0:07
0:14
0:21
0:28
0:36
APPEND CREATE INDEX 1
CREATE INDEX 2
create table
DROP INDEX
(H:MM )
Before Defrag After Defrag
0
100
200
300
400
500
Application and Temp Data IO
Application Data ead/sec Temp Data Read/sec Application Data Write/sec Temp Data Write/sec
Run 1 – SORTSIZE 12G
Run 2 – SORTSIZE 60 G
30
execution of same ETL for next time period. Intermediate tables are small but many in numbers so they are created and used as per developers’ convenience. However, when the same tables are used for the next
time period, instead of creating new tables the old tables are loaded with new data and it increases the
table size. For best performance, usually such temporary tables are cleaned up and collecting stats on
temporary table may show performance improvement by choosing best execution plan by optimizer.
• As a best practice, keep on measuring not only timings of ETL for specified time period (e.g. 12 th month)
but for all incremental time period (1 to 12 months). This will help to find any inconsis tency in data
distribution over incremental time period if rest of the parameters remained the same. Ideally, for ETLs
where source or target data increases for next time period, the execution time of ETLs also increases
gradually. In case of any ETL where source or target data volume remains same for the next time period,
execution time of ETL also remains the same.
Graph 12: ETL Execution Time over Time Iterative Time Period
References:
1. [KIMB2004] Ralph Kimball and Joe Caserta, "The Data Warehouse ETL Toolkit: Practical Techniques for
Extracting, Cleaning, Conforming, and Delivering Data" first edition, Wiley (2004).
2. [SAS2010] A note from "SAS/ACCESS(R) 9.2 for Relational Databases: Reference, Fourth Edition" [2010],
http://support.sas.com/documentation/cdl/en/acreldb/63647/HTML/default/viewer.htm#a001384466.htm
3. [KRENN] Wiki title "Linux I/O Performance Tests using dd". Thomas Krenn, .
https://www.thomas-krenn.com/en/wiki/Linux_I/O_Performance_Tests_using_dd
4. [NATARAJAN2010] "RAID 0, RAID 1, RAID 5, RAID 10 Explained with Diagrams" by Ramesh Natarajan
AUGUST 10, 2010. http://www.thegeekstuff.com/2010/08/raid-levels-tutorial
Acronyms:
ETL (Extract transform Loading), OS (Operating system), UI (User Interface), OLTP (Online Transaction
Processing), CDC (Change Data Capture), BA (Business Analytics), BI (Business Intelligence)
00:00:00
00:02:53
00:05:46
00:08:38
00:11:31
00:14:24
00:17:17
1 2 3 4 5 6 7 8 9 10 11
Months
ETL Execution Time over Time Iterative Time Period
ETL 1 ETL 2
31
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
Performance Monitoring and Management of Microservices on Docker Ecosystem
Sushanta Mahapatra Sr.Software Specialist – Performance
Engineering SAS R&D India Pvt. Ltd. Pune
Richa Sharma Sr.Software Specialist – Performance
Engineering SAS R&D India Pvt. Ltd. Pune
Application packaging and deployment using Docker is trending and fast catching up in the infrastructure world. The big players of the industry are either deploying on Docker, integrating with Docker or developing for Docker. The reason being it is flexible, light weight, easy to manage and highly scalable. Docker containers wrap applications and their dependencies, use a shared kernel and run on the same host along with other containers. Essentially they are isolated processes in the user space on the host operating system. Similarly Microservice architecture enables decomposition of applications into small services which improves fault-tolerance and manageability. These services are designed to run on their own, and can communicate with the outside world via lightweight protocols like HTTP. These small services are designed to be flexible in deployment and work independently. One such deployment mechanism is containerization of these micro services via Docker. So Docker containers are one of the preferred options to build, ship and run these micro services. Containerization of these micro services via Docker is highly convenient as Docker provides automation for the deployment of applications inside the containers. Docker provides an additional layer of abstraction and automation of operating system–level virtualization on Linux. The Docker containers are built on top of the Linux containers mechanism and are very light weight. However the trade-off is the performance monitoring and management overhead that poses the challenge to measure system level utilization and performance characteristics of all the services.
This paper talks about various options that a performance engineer can leverage to monitor and manage the Docker based micro services. The options include live monitoring of containers, persistent storage options for offline monitoring and analysis along with integration possibilities with few of the industry standard performance testing tools (commercial). Additionally, this paper also delves into few container management tools like Docker UI and Consul Web UI.
32
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
Introduction:
Microservices architecture is the latest buzz amongst leading software architects and design guru’s. Over the last few years a sharp surge has been evident towards Microservice way of designing software services, leaving behind the monolithic approach. There is a very consistent pattern of migration to this architecture using the container technology, and primarily the choice of the container being Docker. Microservice adoption is been driven by de-composition of applications to granular level enabling a single service to be self-sufficient in itself. Cloud based infrastructure and elastic scalability are also drivers behind microservice adoption.
Figure 1: Decomposition of Service into microservices
From performance standpoint there is a need of adopting different approaches to monitor such Docker deployed
micro services. Typically, there will be various micro services communicating with each other on a production
system and each of the services will ideally be hosted on a Docker container on top of the Linux kernel or on a
cloud infrastructure. As these containers consume resources in isolation, the monitoring approaches should aim
at gauging the individual containers in isolation similar to monitoring a single host. There are various ways in
which a performance test engineer/tester can monitor and manage such containers and many platforms and
frameworks are developed or are under development to monitor such micro services.
Docker Containers and Resource Isolation: Docker is written in “Go” Language and makes use of several Linux kernel features like namespaces, control groups, union file systems and container format, to deliver the container isolation functionalities that are discussed below. Docker takes advantage of a technology called namespaces to provide the isolated workspace called the container. When you run a container, Docker creates a set of namespaces for that container which provides a layer of isolation: each aspect of a container runs in its own namespace and does not have access outside it as shown is Figure 2.
33
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
Figure 2: Docker Container resource isolation
Monitoring Approaches for Docker Containers: As discussed in the above section each Docker container will have its own subsystem and the resources can be monitored in isolation as the container is a separate host. There are several ways to achieve this. Listed below are few of the approaches that are most helpful.
Approach 1: Control Groups
Linux provides a kernel feature: control groups or commonly called as cgroups that allow us to allocate resources such as CPU time, memory, network bandwidth, or combinations of these resources—among user-defined groups of tasks or processes running on a system. We can configure and monitor the cgroups, control user access to cgroups. By using cgroups, system administrators gain fine-grained control over allocating, prioritizing, denying, managing, and monitoring system resources. Figure 3 below shows how the cgroups can be used to control the resources.
Figure 3 : Control Groups and an Analogy with real world
The Docker Containers, are based on control groups to isolate the resource usage (CPU, memory, disk I/O, and network) for a collection of processes. This is quite similar to the way the resource isolation happens in a multi-
34
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
stored building where the resource consumption details (i.e. water or electricity) are measured commonly for the whole building where as it is measured individually for each unit.
Usage Example: The memory metrics for a running container can be retrieved by using the container ID with cgroups by specifying /sys/fs/cgroup/memory/Docker/(id) at command prompt, which will show the following output of the specified container’s memory usage in bytes: total_cache 210492
total_rss 311541332
total_rss_huge 234061234
Similarly we can find any of the performance matrices (i.e. CPU, IO, Network) for any of the containers in
isolation.
Approach 2 : Using Google’s Cadvisor
Google created Cadvisor initially for their internal use but later on they added Docker container support and
released it for open source use. Cadvisor is short for Container Advisor which is easy to use and gives a detailed
look into the resource usage and the various performance characteristics of all the running containers. Cadvisor
includes a simple UI to view the live data, a simple API to retrieve the data programmatically, and the ability to
store the data in an external InfluxDB.
Cadvisor has native support for Docker containers out of the box. It provides details of resource usage and performance characteristics of the running containers. It is a running daemon that gathers, aggregates and presents information about all the running containers. For each containers it keeps resource isolation parameters, historical resource usage and network statistics container-wide and machine-wide.
Pulling and Running Cadvisor Using Cadvisor is quite easy, from your host just pull Cadvisor Docker container with the following command and you are ready to go
sudo Docker run –volume=/:/rootfs: ro –volume=/var/run:/var/run:rw –volume=/sys:/sys:ro –
volume=/var/lib/Docker/:/var/lib/Docker:ro –publish=8080:8080 –detach=true –name=Cadvisor
google/Cadvisor:latest
Once the Cadvisor container starts running you can bring up the UI from: http://yourhost:8080 (preferably using Firefox, Chrome) and drill down to any container you want to monitor
from the Docker Containers link.
35
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
Figure 4 : Cadvisor UI and typical graphs it shows for a container (Image customised for representation).
Additionally Cadvisor provides REST API end points which can be used to get all stats in JSON format which can be further consumed. Few of them are as follows:
http://yourhost:8080/api/v1.2/containers
http://yourhost:8080/api/v1.2/Docker
http://yourhost:8080/api/v1.2/machine
http://yourhost:8080/api/v1.2/Docker/myinstance (your container Instance)
Limitation:
• The data captured is live and there is no internal way of storing it for later analysis. For persistent storage of data, Cadvisor needs to be interfaced with some external time-series databases like InfluxDB and OpenTSDB. This will be discussed in the later part of this paper.
Approach 3 : Integration with Leading Performance Testing Tools
This section touches upon two approaches of integrating Docker container metrics with Leading Performance testing tool (commercial) for monitoring of Microservices.
Approach 3.1: Integrating Cadvisor API end points with Performance testing tools: As mentioned in the previous section, Google’s Cadvisor is good to get performance counters from Docker containers as well as from the services or processes running within those containers. The Cadvisor rest API end points can be called and processed programmatically to extract the different performance metrics and then store them as part of the analysis report (output from the tool).
36
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
Performance test tool has support for consuming those data points and generate graphs as part of the performance test report which can be stored permanently.
Limitation(s):
• A single virtual user should run another script at a specified interval (in background) to collect and process the metrics during the course of the test run (this will not add much overhead)
• The parser to process those Cadvisor JSON responses has to be implemented (this can be custom made using any simple program)
• The actual data values cannot be stored as it is. For Further analysis in case of any problem the test has to be re-run.
Usage Example: Extract each value and store it as a key/value pair as follows:
HashMap results=new HashMap(); results.put (“Counter1”,"Value");
Feed the data to the tools storage system or store it on any storage medium for offline analysis as depicted in the following diagram.
Figure 5 : Cadvisor API endpoints and their integration with Industry standard tools.
Advantages: The data received via the API end points will be stored permanently and can be used for live or offline analysis.
37
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
Approach 3.2: Registering custom scripts to leverage cgroups and integrate with industry standard tools. Red Hat Enterprise Linux kernel has a feature called control groups or cgroups that was discussed earlier in this paper. Cgroups can be used to collect the system resources for a specific group in complete isolation. Based on cgroups, one or more custom monitoring scripts can be developed to be registered with Leading Performance monitoring tools and those scripts can be used from within Performance test tool to collect metrics for any of the containers.
The figure below displays a simple process of writing a custom shell script in Linux leveraging cgroups to retrieve container specific matrices and how tools like JMeter and Gatling can consume those script’s output. Latest versions of JMeter as well as Gatling have support for host monitoring via direct plugins or collectd plugins which can be configured to use the custom script to pull matrices during the course of the test run and show them as part of the test result.
Figure 6 : Integrating custom cgroups script with tools like Jmeter and Gatling
Limitation: Proper understanding of cgroups and how to define the hierarchy and the subsystems is required.
Approach 4 : A persistent Monitoring Infrastructure using Cadvisor , InfluxDB and Grafana
As discussed in the previous section, Cadvisor can be used to monitor containers but the limitation in that approach is: the monitoring counters and their values that we are seeing are live and not really getting stored anywhere. Additionally, another limitation is customizing the performance statistics into meaningful graphs. Further section describes a quick approach to store the performance metrics persistently in a time-series database and use a charting solution to build more meaningful graphs based on the stored metrics.
38
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
Components in the Framework:
• Cadvisor ( to read performance matrices)
• InfluxDB or OpenTSDB (time-series databases for persistent storage)
• Grafana (charting solution that supports InfluxDB/OpenTSDB)
How the above components are integrated?
Figure 7 : A persistent Docker monitoring Infrastructure using Cadvisor, InfluxDB and Grafana
The integration is quite simple as shown in Figure 4. Let’s see in detail how each of the components integrate with each other. The objective is to monitor the Docker containers that sits on top of the Linux kernel.
• Run the Cadvisor Docker container which will provide all the resource usage statistics and performance measures for all the running containers.
• In order to store the metrics retrieved by Cadvisor (in some database) use either InfluxDB or OpenTSDB. For ease of use, InfluxDB is recommended since Cadvisor can be auto configured to push data to InfluxDB.
• Once the data is persisted, a charting solution such as Grafana can be used which allows to connect to these databases and helps in building rich and meaningful graphs to assist in performing better analysis.
• Once the performance dashboards are ready , end-users can simply view and monitor the performance characteristics live via these Dashboards
This infrastructure can be setup quickly as all these component are available as Docker containers, so if you have a Docker enabled Linux box, it will take just few minutes to build this dashboard and start gathering performance metrics.
Limitation:
As already mentioned it’s a quick way of building a monitoring infrastructure. So once you stop the InfluxDB or Grafana Docker instances you may lose all the data. If you are interested to build a robust monitoring infrastructure, it will be good to host the InfluxDB/OpenTSDB and Grafana configuration permanently or find a way to store the data that these Docker instances are using permanently somewhere so that they can be re-usable.
39
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
Managing Docker Containers:
DockerUI:
In addition to the resource monitoring it is equally important to manage the containers. Docker UI is a web interface which helps visualizing running containers and allows to manage them easily. It assists users in various container lifecycle actions like starting, stopping, pausing, removing and killing a container, making it pretty easy to manage containers and images with simple clicks without needing to execute lines of commands to do small jobs.
Figure 8 : Docker UI (Image Source : Internet)
Consul Web UI: Consul is a tool for discovering and configuring services in your infrastructure. It comes with a user-friendly web UI that can be used for viewing all services and nodes, for viewing all health checks and their current status, and for reading and setting key/value data. It can act as a one-shot dashboard for managing your services.
40
.
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right
to publish this paper in CMG India Annual Conference Proceedings.
Figure 9 : Consul web UI (Image Source : https://demo.consul.io)
References:
Definitions of Microservices: http://klangism.tumblr.com/post/80087171446/microservices
RedHat guide on Control Groups: https://access.redhat.com/documentation/en-
US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html
Google’s Cadvisor: https://github.com/google/Cadvisor
Grafana Information: http://grafana.org/
InfluxDB Information: https://influxdb.com/
Docker Information: https://www.Docker.com/
Consul Information: https://www.consul.io/
41
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement
Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
PERFORMANCE TESTING AND TUNING OF A WEB BASED ENTERPRISE
SCALE HIGH VOLUME ACCOUNTS PAYABLE PROCESSING SYSTEM
Seji Thomas Software Product Development CoE –
Intellectual Property & Engineering Group Tata Consultancy Services Ltd, 1st Floor
TCS Center, Infopark, Kakkanad Kochi – 682020, Kerala, India
Ph: +91-484-6187568
Rony Pius Manakkal Software Product Development Engg –
Intellectual Property & Engineering Group Tata Consultancy Services Ltd, 1st Floor
TCS Center, Infopark, Kakkanad Kochi – 682020, Kerala, India
Ph: +91-484-6187607
Abstract
Software Applications are developed catering to the requirements of the
customer. Requirements are expressed as business use cases that are
validated and verified using prototypes that then blend in the mainstream
Application. It is implicit that outcome of an Application development
project is favorable to the customer with respect to its proper functioning,
optimum performance, lower infrastructure investment and operational
costs, ease of administration, maintenance, and adaptability to changes in
business. To build to success, it is essential to understand what counts to
success. The target Application should bring a successful outcome to the
customer, it should meet current demands and also be flexible for
addressing future growth. While meeting current demands, it should
provide a winning edge for the customer to save time and cost of its
operations, leverage past and current investments for future, and earn a
repute in the market. Engineering with special focus on quality is a
challenge most development teams are trying to address, and the focus of
this paper is to present the steps taken to evaluate the performance
quality of the Application under construction, the challenges that came up
during the testing and tuning phases of development, and how those
challenges transformed as opportunities while scaling up business
operations.
1. Introduction
Business requirements form the central theme on
which an Application is build. When developing an
Application, it is important to know the respective
industry domain, the target user population, the use
case scenarios they will be using often, acceptable
response time, the collective demand that
42
determines peak usage, offline batch processing,
operational data volumes, long term growth in
demand, and so on, that form the performance
requirements. It is also beneficial to know if the
product is targeted for a niche market, or there is
already one or more competitor/s with demonstrated
credentials in the respective business domain. All
these information are important to responsibly build
the target architecture. When the quality measures to
be achieved are quantified at an early stage, it is
possible to choose the most fitting architectural
elements for building the structure that will suffice the
performance requirements.
During the milestone delivery phase, developmental
activities are frozen to actively test for the
performance of the product, as if being subject to real
use by end users. It is important that time is
scheduled for such an activity, to check the stability of
the release and reduce the risks due to regression.
Performance testing enhances the identification of
system hotspots that are rarely spotted in a functional
test. It also helps in understanding the level of
resources being consumed, determining whether the
system requires further tuning of performance
parameters, gauging the ability of the system to
support a work load of higher magnitude within
acceptable variance of the key performance
characteristics. The exhibited performance
characteristics are crucial for estimating the capacity
demands of the infrastructure required to support the
end user demands over the lifecycle of the product.
Time invested for discerning the architectural and
design alternatives, realizing the tangible performance
savings through code & configuration optimization,
and saving every bit of resources from being
consumed beyond the actual need, are keys to
success in the journey of making a product that
distinguishes well in the market.
2. Business Need
The product being considered for study is an
Enterprise class Web based Accounts Payable (AP)
Application built to replace a legacy system comprised
of several sub systems developed over years, with
limited documented information available on
Application integration and interfacing requirements.
The Accounts Payable application was built on Open
Source Frameworks and API like Java/ J2EE, Tomcat,
Oracle Database. Spring, Spring Batch, ActiveMQ,
and iBatis – ORM (Object Relation Mapping. In-house
Data Quality Management Application was used for
the purpose of mass scale record matching.
2.1 Performance Quality Requirements
The product that has to be built to replace the legacy
system has stringent performance requirements, as
the existing system already supported large number of
concurrent users and huge volumes of data from its
successful operation in the business domain over the
years. In the business domain it operates, the product
customer excels its competitors by a good margin and
with a significant share in the market. The key
performance requirements are listed below
• 300 users concurrently active on transactions
like Manual Reconciliation of Invoices (either
at header level or detail level), Receipt Item
Maintenance, and Manual entry of Invoice.
• Several daily batch jobs for execution that has
stringent throughput requirements like
completing processing of 10 to 12 million
detail level invoices within a 2 hour time
window.
• Web Page Response time of 2 seconds.
• Operating Data Volume -> Header Level
Invoice - 40 Million, Detail Level Invoice - 500
Million.
• Daily Data Processing Volume of 150 K
(includes both Header and Detail Invoices),
with transparent audit trail logging.
• CPU (Central Processing Unit) Utilization to
be under 50 % and Memory Utilization to be
between 30 and 40 %
Challenging part of the project is to develop the
system without leg room for mistakes that would derail
the earmarked time schedule and cost. From day 1 of
go-alive, without fail, the system has to be operational
for the above user concurrency, data volumes,
response time and data processing time windows,
throughput and resource consumption. It was
apparent that the architecture, design, and
development approaches have to be smart and
efficient enough to build every single element of the
system in a befitting manner, minimizing wastage.
2.2 Performance Driven Engineering
Software Systems are increasing in complexity day by
day. Next generation of technologies are arriving on
stage at hardly ever imagined pace, that the
challenges to build something worth its class and
43
adaptable to the business needs are also
proportionately growing. Driven by competition,
Software Development organizations are forced to cut
costs of construction, maintenance, and
customization. In this realm, it is worthwhile to think
how the developing organization can ensure that it
builds functionally useful product with greatest
satisfaction for its buyer.
The basic principle underlying the development of a
system that satisfies its customer needs is to address
functional and quality concerns right from the
conceptual stage, and carry on that rigor through all
life cycle stages until the product is phased out.
Taking the right step at its most appropriate time cuts
down wastage and avoids unnecessary costs that can
later grow uncontrolled.
PELC (Performance engineering life cycle) stages
overlaps with the SDLC (Software development life
cycle) stages and synergistically enhance the
performance quality of the product.
Within the functional scope it is important to
understand the target users who will benefit from a
business point of view. Some functions are heavily
used and others occasionally. Even at an abstract
level, each function when triggered for execution by
user action, is expected to result in a set of
component interactions occurring in a known pattern,
because of which a certain demand is imposed on the
resources backing the running Application. From the
perspective of real usage demand of the live
Application, the different functions will be invoked at
different rates, and their demands will collectively form
a demand foot print that the Application should be
capable of delivering at the mutually agreed upon SLA
(Service Level Agreement). Figure 1Error! Reference
source not found. represents the discussed idea.
Figure 1 Resource demand induced by transaction
The magnitude of demand imposed at each resource
center can be different and will depend on the actual
implementation, which is unknown at the architecture
development time. Nevertheless, based on the
architecture under evaluation, it is worthwhile to
approximate the total number of demand requests or
the time spent at each resource center to determine
which of them are in high demand. Such a back of the
44
envelope approximation will help to simplify the
analysis required to determine if the current
architecture introduces any hotspots that will deter
performance. The earlier such an analysis can be
done, the better, because it improves the success rate
of decisions being made later on.
Once the Architectural back bone is laid and design
phase flourishes, early evaluations related to
performance can be done because components along
the transaction pipe-lines are gradually being realized.
Along with the regular unit testing, performance unit
tests are done at this juncture to streamline and
optimize code. Automated Code quality checks save a
lot of effort as it evaluates all engineering debts to be
closed, which among others include architectural and
performance quality issues. Getting a good
compliance rating at this time is important for attaining
the targeted quality objectives.
Once the implementation of the key business
transactions are completed, they can be profiled at
code and SQL (Structured Query Language) level for
additional insights on performance. Method level and
SQL level inefficiencies are brought to fore for re-
factoring. The objective at this point is to save
computing resources and time, without falling in to the
trap of premature performance optimization.
As development progresses and matures through this
phase to reach a state of stable release, the code
base is frozen to avoid testing in influx. Application is
deployed in a performance test environment
comprising of hardware and network resources at a
scaled down factor of the production levels.
Production scale data if available is loaded in the test
environment. If not, appropriate tools are used to
generate data in volumes and load in the data
repository.
The work load characteristics that was determined at
the requirements gathering time to be the most likely
peak load to occur in production server is taken as
input for formulating the testing strategy. Transactions
are recorded to script automated transaction
invocations. Scripts are further modified through
configuration, and parameter correlation to simulate
user concurrency for the desired transaction mix.
Load ramp up and ramp down times, think time,
delays, and supporting data for various transactions
are also suitably set in the performance test
automation tool/ harness.
Load is triggered to check whether it simulated a
realistic load as originally intended. Once validated for
expected behavior, configurable performance
parameters are tuned sufficiently in the test
environment to support the quantum of load that will
be generated by load testing. While simulating the
load, checks are done to avoid all biases that would
lead to misinterpreting the performance behavior.
When the load is simulated, the environment and the
Application are monitored and all relevant
performance metrics collected for analysis and
reporting. Inferences are made on scientific evaluation
of the metrics. Outcome of the analysis is captured in
the test report, which reports the status of
achievement of the performance objectives stated at
the onset, existence of hotspots if any, and
recommendations for improvement. The testing report
feed backs to engineering, if improvement is required.
If the objectives are met, the Application performance
is benchmarked against the infrastructure capacity for
making capacity planning decisions for customers.
Performance engineering life cycle continues its
journey in production phase, where it takes the form of
continuous monitoring of the live system. The data
collected from monitoring is checked for symptoms of
short term and long term performance failures. All
such symptoms, or in some instances, real failures,
are debugged for identifying the root cause and
correcting as appropriate. Performance monitoring
data is also used for predicting user demands,
understanding the current resource usage levels,
knowing the user experience, and anticipating an
imminent problem like resource crunch that can be
fixed ahead of an issue.
2.3 Deployment architecture for
performance and scalability
In the fag-end of the development phase, concrete
decisions related to deployment had to be taken. Load
balancing and fail-over are the main concerns to be
addressed as per the requirements. For each layer,
redundancy is required, hence the concept of Server
pooling was considered. Having a cluster of Servers
allows for scaling out in the event of demand spike,
and in this scenario, the entire Application was scaled
out following that approach. Web Server content
caching for static data was adopted to reduce the
network demand. Figure 2 provides the high level
deployment plan.
45
Figure 2 Application Deployment Diagram
3. Development time tuning
The high level decisions taken in the previous phases
were actually put to test during development time. As
the product started realizing, it became easier to test
and validate for gaps. In this phase of product
development, we encountered several challenging
situations that proved to be good learning. During this
time, the standard engineering practices like
1 Code profiling to find out method bottlenecks, and
optimizing them
2 Slow Query analysis and adding required indexes
3 Sizing down Web pages after analysis using
developer tools
4 Conducting automated and peer code reviews,
and,
other best practices enabled the team to deliver at a
consistent quality. Some key lessons learned during
this phase are described below.
1 Power of parallel computing: As mentioned
earlier, Audit trail logging is a key functionality of
the system. It is a pre-built component build upon
the concept of queue. By default, items for
processing are put in a queue and consumed by a
single listener. Main invoice and sub items form a
sizable entity, whose accompanied change history
will also be bulky. Change history entries are
pushed to the queue, from where it will be
asynchronously picked for processing or
persisting in the Audit DB (Database) table.
During a high demand time interval, the change
history items generated will be numerous and
starts filling up in the queue. If only a single
listener consumer is registered for the queue, the
rate of consumption was falling behind, leaving
accumulation in the queue, and thereby forcing a
memory pressure. The number of listeners was
increased to observe the performance gain as in
Table 1
Queue buffer size Single Listener (default) Number of Listeners = 3
(optimized)
46
2 GB Frequent out of memory errors
when subject to high load
A logger that checks the number of
processed queue items every 10
mins reported that several
thousands have been processed
Table 1 Queue performance optimization by configuring appropriate number of Listeners registered
against the queue
2 Fetch the data in a batch and group it in Java
code: There is a requirement to view huge set of
data through the Web Browsers. As the
operations are performed in Web Browser itself
using JS functions, all the required data should be
brought to the screen while loading the page itself.
For example, Detail Level Manual Reconciliation
is an operation in which user works on the Invoice
line items and match them against the
corresponding receipt line items. For the operation
of matching, the screen should be populated with
huge set of Invoice and receipt data. This data is
encapsulated as records in parent and child
tables. Fetching that requires different queries,
which repeats as many times as the number of
line items in the invoice. As a result, loading an
invoice with more than 30 line items was taking
more than 2 seconds (SLA). Soon it was realized
that the performance issue is due to repeated DB
calls.
Code was re-written to fetch the additional details
required for the line items together, and group it in
Java code. Refer to the tables for pseudo code -
sub-optimal (Table 2) and optimized cases (Table
3)
List<InvLineItem> InvLineItems = fetch invoice line items --DB round trip
for each InvLineItems{
fetch and add Detail1 --DB round trip
…
fetch and add Detail5 --DB round trip
}
Table 2 Pseudo Code of the logic causing performance issue
List<InvLineItem> InvLineItems = fetch invoice line items; --DB round trip
List<InvLineId> lineIds = get the list of line ids;
Map<InvLineId, Detail1> = Fetch the Detail1 for every lineIds and create a Map with key as the lineId and
value as Details1; --DB round trip..
Map<InvLineId, Detail5> = Fetch the Detail5 for every lineIds and create a Map with key as the lineId and
value as Details5; --DB round trip
Tag the fetched Details into corresponding InvLineItems
Table 3 Pseudo Code of the logic for resolving performance issue
Performance improvement achieved in
this case is given in Table 4
47
Test Scenario Code State Number of SQL Queries Response time
For loading an invoice with 50 line items
Before fix It required firing almost 50 queries
7 to 9 Seconds
After fix Always 5 queries will get
triggered irrespective of
number of Line items.
< 2 seconds
Table 4 Performance advantage of fetching data in batch and grouping in the calling code
3 Custom query to update multiple records:
There are couple scenarios in which user
performs some operation on parent record and
the system has to update its child records as part
of the operation. In the application, there are
multiple occurrences of such scenario.
For example, when a user is performing header
level matching of an invoice, all the invoice line
level item records should be updated for fields like
line_item_match_amount and line_match_status.
The logic to update line_item_match_amount is by
multiplying line_item_qnty with
line_item_unit_price values in the same record
and update the line_match_status as 'MATCHED'.
As per the design, all the DML (Data Manipulation
Language) operations have to be accomplished
through ORM framework, which was implemented
so. Table 5 provides the pseudo code of the
operation.
LIST<InvLines> = Fetch the DOM records for LIST<inv_line_id>; -- DB round trip;
for each InvLines{
Update line_item_match_amount = line_item_qnty * line_item_unit_price;
Update line_match_status = 'MATCHED';
Persist the data; -- DB round trip;
}
Table 5 Pseudo code for DML operation through ORM framework
In the above approach, there is a DB call to fetch
the DOM and another DB call for each item to
persist in the DB. It followed from analysis that
that approach had performance limitations while
matching multiple invoices, each having
considerable number of line items. To overcome
the performance issue, the ORM based update
process was changed into a custom query based
update as given below.
update invoice_details set line_item_match_amount = (line_item_qnty * line_item_unit_price),
line_match_status = 'MATCHED' where inv_line_id in (LIST<inv_line_id>); -- only one round trip to
DB
As a result, only one DB call is required to update
the line item details irrespective of how many line
items are available for the selected invoices. The
performance advantage in this case is given in
Table 6
Test Scenario Prior to Fix After the Fix
Matching 10 Invoices and 10
receipts with 20 line items
each
5-8 seconds < 2 Seconds
Table 6 Performance benefit derived by use of a custom query to update multiple records
48
4. Performance testing time re-factor
Performance testing phase actually reveals the
behavior of the application when users start using it
concurrently. It is an important activity in the SDLC
phase that validates and verifies the ability of the
Application to operate consistently under load. Prior to
load testing, the following were done
• Pre-Production Environment (replica of
Production) was set up for PT and tuned for
the load.
• Customer PT Tools / Servers were leveraged
for Testing
• Load Runner Tool was used for Load Testing.
• Two Load Generator Servers were used to
simulate 300 Concurrent User Load Test for a
test duration of 1 Hour.
• Application Performance Management Tool
agents were installed on Application and
Batch Servers to analyze Performance
Bottlenecks in Web / App / Batch / Database
Tiers.
• Oracle AWR (Automatic Workload
Repository)/ADDM (Automatic Database
Diagnostic Monitor) Reports were leveraged
to analyze Database related Performance
Bottlenecks.
Some of the key lessons learned in this phase are
described below.
1 Configure Server parameters for performance:
Key performance parameters were configured in
the Web, App, and DB Servers as in Figure 3
Figure 3 Key performance parameter tuning for Servers
2 Configure proper Statement Cache: During the
performance testing phase, it was identified that
there was considerable amount of waiting for DB
querying. At the onset of analysis, it was mistook
as a problem due to insufficient Connections in
the pool, but it turned out that checkoutstatement
parameter of C3P0 was contributing to the wait
during query execution.
The issue was resolved by setting proper
Statement Caching, which is the caching of
executable SQL statements at Connection level.
A proper statement caching prevents
a. The overhead of repeated cursor
creation, and
b. Repeated statement parsing and
creation
Together, it also enables the reuse of data
structures in the client.
From analysis and testing, it was proven that
setting maxStatements to a sufficient number can
improve the Applications’ overall performance.
The value to be assigned was set to the number
obtained by multiplying frequently executing
distinct queries, and maximum number of
49
Connections configured in Connection pool
setting.
Performance advantage after the change was
implemented is given in Table 7
Test Scenario Prior to Fix After the Fix
Occurrences of wait for DB
querying for check out
statements
Occurring very frequently Not occurring
Table 7 Performance benefit of proper SQL Statement caching
3 Fix Connection closed exception at the
beginning of transactions: During performance
testing, it was noticed that Connection Closed
exception was thrown while executing the first
SQL query of a transaction. This was identified as
due to the staled connection checked out from the
Connection pool. To rectify this, enabled the
parameters - testConnectionOnCheckout &
testConnectionOnCheckin in C3P0 configuration
properties. Issue was resolved after this
configuration change, but it introduced a slight
delay while checking out a Connection from the
pool. To avoid this performance delay, used a
combination of Connection related parameters like
- idleConnectionTestPeriod &
testConnectionOnCheckin. As a result both the
idle test, and the check-in test are performed
asynchronously, which leads to better
performance - both perceived and actual.
4 Fix inordinate delays while deleting records
from the DB: In the online Application, there is no
hard delete. All the records will be soft deleted,
and there is a batch job to purge or archive the
soft deleted records from the DB. The count of
records will be varying from thousands to millions
depending on the business relevance of the table
that represents the entity. Following two issues
were causing the delay.
Issue 1: During the delete of parent table, foreign
key constraints from the child tables are checked
to ensure that a parent with a child record is not
deleted until deleting the child record. If the
columns on which constraints are enforced are
not indexed, the check is done in form of a full
table scan.
The pseudo script for the tables prior to fix is as
below
Create Table Receipt_Parent ( Receipt_Id,Receipt_name,Date ) ; -- Parent Table
Create Table Receipt_child ( Child_id , Receipt_Id ) ;
-- Child table
Alter table Receipt_child add constraint receipt_child_chk references Receipt_Parent (Receipt_Id) ; --
foreign Key
Delete From Receipt_Parent ; --> Full Table scan of Receipt_child
To rectify this issue, created indexes for the
foreign key validation as given below.
Create index index_child on Receipt_child (Receipt_Id) ;
Delete From Receipt_Parent ; --> Index scan of Receipt_child using index_child
Issue 2: The foreign key check can be done at
the transaction level or statement level. By
default, the constraint check is immediate – that
means for each row being deleted from parent
table, child table is checked for matching
records. The constraint check can be done at
the transaction level for all the records covering
the transaction in one shot. This allows Oracle to
optimize the performance of the checks. A
performance benefit of 50% of the execution
50
time was observed while deleting 6 million records.
Alter table Receipt_child add constraint receipt_child_chk references Receipt_Parent (Receipt_Id)
DEFERRABLE INITIALLY DEFERRED; --> Deferring the constraint
Delete From Receipt_Parent; --> Check child table Receipt_child once for the whole transaction
With these two fixes, the performance improved
as in Table 8
Test Scenario Prior to Fix After the Fix
Delete 6 million records
from a parent table with
references to other
tables
More than 24 hours Less than 2 hours
Table 8 Performance gain while deleting records - by indexing foreign key and deferring foreign
key check
5 Prior PK Sequence generation for bulk inserts:
Batch programs is expected to handle more than
three million records. It was found that two queries
are fired for every single row insertion. One for
generating the sequence, and the other for row
insertion. Since the use case expects to insert
more than three million records, 2 hits per record,
would contribute six million hits to DB, for insertion
alone.
As a solution, sequences are generated in
advance using a custom query. Number of
sequences generated is equal to the commit size,
and generated sequence numbers were applied to
each insert statements which result in only one db
hit for every insert statement, and one additional
hit for every commit interval. Performance gain
Following this approach, a performance gain of
10-12 % was achieved, when tested for 1.2 million
records.
5. Production time re-factor
Performance engineering does not stop post go-live.
In the production phase, active monitoring of the
infrastructure and Application for understanding the
usage trends, resource consumption, demand,
impending issues (short term and long term) and
others is the prime focus. The observations from
monitoring will drive optimization of the Application
further. True validation of non-functional qualities
happens in this phase, and hence it is very important
for a developing organization that aims for fine tuning
their skills to reach the highest level of maturity. Some
of the key lessons learned in this phase are described
below.
1 Audit Component queue item check out
attempt: It was observed that the queue items
were not being processed at the expected rate. If
processing does not happen, the next attempt
picks the same item from the queue, which
repeats endlessly, and the queue accumulates
items.
By the limiting the number of attempts the same
item will be picked, the issue was resolved. If
processing failed, the item is discarded by moving
in to a separate queue for deferred processing.
Following this approach, a healthy balance is
maintained between the number of items left in
the queue and its threshold buffer size.
2 Thread blocks, and results in clocking issues
for end users at random entry points: It was
reported that some transactions take up > 10 mins
and timeout. After some time, many transactions
on that Application Server (a node in the App
Server Cluster) start getting blocked.
On analyzing deeper, it was observed that the
Application Server threads are stuck in 'C3P0' DB
Connection pooling component. They got stuck
giving rise to WAITING at
sun.misc.Unsafe.park(Native Method)
It was identified that the issue related to
Application http-nio thread going in to wait
condition for 600 seconds or more is correlated to
the blocked state of
C3P0PooledConnectionPoolManager-Helper
Thread, or the Task-Thread-for-
51
com.mchange.v2.async.ThreadPerTaskAsynchro
nousRunner, when trying to close the Statement
underneath a Connection that is still in use. This
would not cause issue for most of the JDBC (Java
Database Connectivity) drivers that are developed
complying with the spec that it should support
closing of a Statement while the parent
Connection is in use. From various sources it was
understood that Oracle JDBC driver does not
comply with that spec, and hence can encounter
freeze, leading to deadlocks.
The solution adopted was to enable the following
C3P0 flag setting-
'statementCacheNumDeferredCloseThreads',
which will avoid the prolonged wait of Thread.
6. Conclusion
Following the approach of addressing performance
quality from the early planning phases of Software
Development, it was possible to address Application
performance effectively, and appreciate the close
linkage between architecture and performance quality.
7. Acknowledgements
Our sincere thanks to the product team for their
patience responding to our requests even during tight
development schedules. We would like to thank
Shameer Sulaiman, Lakshmi Hariharan, Ranjit
Poduval, Dhanesh Chakrapani (Product Manager,
IP&E), Krishnakumar G (Engineering Lead, IP&E) and
other key members of the product team who had
interacted with us and shared valuable points during
the study. Special thanks to Debiprasad Swain (SPD
Engineering Services Group Head, IP&E), Mohan
Jayaramappa (SPD CoE Head, IP&E), and Rajashree
Das (SPD CoE Architecture Group Lead, IP&E) for
the guidance and directions provided.
8. References
[1] http://docs.oracle.com/
[2] https://www.oracle.com/index.html
[3] https://projects.spring.io/spring-framework/
[4] http://docs.spring.io/spring-
batch/reference/html/spring-batch-intro.html
[5] http://www.mchange.com/projects/c3p0/
52
The Docker Container Approach to Build Scalable and Performance Testing Environment
Author: Pankaj Rodge ([email protected])
ABSTRACT
Testing forms an integral part of Software development life cycle (SDLC). There are mainly two levels of testing, functional and non-functional testing. Along with functional testing, scale and performance tests which are part of the non-functional testing play an important part of the testing phase.
Building test environments to execute scale and performance tests for products like SSL VPN would require emulating SSL VPN client logins and running tools like iperf simultaneously from multiple machines to check the throughput and maximum concurrent connections supported by the SSL VPN gateway. This would also require deployments of large number of virtual machines, creating the appropriate networking thus increasing the cost tremendously. The same is applicable to build scale test environment for feature like DCHP requires testing with huge number of DHCP clients. Using docker containers we can overcome the above challenges efficiently. This paper compares the advantages of the docker containers approach versus the virtual machine based approach for SSL VPN scale and performance and DHCP scale testing.
1. INTRODUCTION
Figure 1
SSL VPN-Plus is a service on NSX edge gateway. It allows
Internet enabled remote users to securely connect to private
networks behind edge services NSX Edge gateway. Remote
users can access servers and applications in the private
networks.
NSX Edge gateway supports four form factors depending on
the system resources. As per the form factor the maximum
number of concurrent SSL VPN connections are supported
i.e. A compact edge supports 50 concurrent SSL VPN
connections and an X-large edge supports 1000 concurrent
SSL VPN connections. In order to test the maximum
number SSL VPN client connections supported per form
factor we would require to deploy 50 virtual machines for a
compact edge and 1000 in case of X-large edge. These tests
can be easily conducted by using docker containers using a
single machine.
Figure 2
1.1 Scale Deployments using Virtual Machines
Figure 3
For creating scale, large number of VMs will be required to
be deployed. These VMs take up lot of system resources as
they not only require a full copy of the operating system but
also require a virtual copy of the hardware that the operating
system needs to run.
53
1.2 Scale Deployments using Docker Containers
Figure 4
A container environment is created by first deploying a host
operating system. Then a container layer i.e. docker is
installed over the host OS. Once that is done container
instances can be provisioned from the system’s available
computing resources and enterprise applications can be
installed within containers.
Container View inside Guest OS
Figure 5
2. Scale for SSLVpn ClientsSSL VPN scale tests environments can be created using
docker containers using the following workflow:
Figure 6
1. Install the Docker.
2. Build the image with all the required dependencies.
3. Create Container
4. Install VPN client
5. Connect to VPN Gateway.
6. Start IPerf.
7. Measure the Throughput
8. Repeat step from 3 to 7 for n number of clients.
2.1 Measure throughput on container and on VM.
Throughput is measured on docker VM and Normal VM
using 1G interface connect to SSLVPN gateway and
SSLVPN gateway is 2 hops away from both VM.
Iperf is used to measure throughput.
54
2.1.1 Throughput on Container
- SSLVPN Connection Status
- Throughput with Single thread
- Throughput with Ten thread in parallel
2.1.2 Throughput on VM
- SSLVPN Connection Status
- Throughput with Single thread
- Throughput with Ten thread in Parallel
It is observed that throughput is almost same with both
container and normal VM.
3. Scale for DHCP Client
3.1 Networking within docker host
- Here ens192 interface of docker host connected to
dhcp server (both dhcp server interface and ens192
interface of docker host are in same Port group)
- Created the bridge interface and added ens192
interface into it.
55
Bridge Status
Docker network
Interface Status
3.2 Creating containers with dhcp client installed
It is observed that for creating 500 clients it took 51 sec
which means approximate 10ms for one container which
more faster than deploying vm’s.
root@ubuntu:~# time ./create_container.sh 500 real 0m51.600s user 0m8.720s sys 0m3.272s
4. Future WorkDocker containers can be used similarly to execute scale
tests for services like L2 VPN, IPsec VPN, load balancer
etc.
Using docker containers scale tests can also be automated
easily.
5. REFERENCES[1] https://docs.docker.com/engine/tutorials/dockerimages
56
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
Conflux: Distributed, real-time actionable insights on high-volume data streams
Vinay Eswara VMware India, Bangalore [email protected]
Jai Krishna VMware India, Bangalore
Gaurav Srivastava VMware India, Bangalore
AbstractSystems and applications are most useful when they can react to events and data in real-time, because the utility of the reaction typically reduces with time. In cloud environments with high data volume, identifying and filtering relevant data, processing it and reacting in real-time becomes challenging.
This paper presents the design and implementation of Conflux, a system for merging high volume streams of data in a scalable, fault tolerant way. It allows users to declaratively specify complex merge functions over data streams and evaluates these functions in real time. It also provides callout facility where applications can specify conditional actions in the context of the merged result stream. This allows applications to delegate stream merging and evaluation and focus on the resultant application specific reactions.
Consider a stock-market use case where news feeds, stock quotes, risk appetite and currency values must be continuously monitored. When a favourable situation occurs, the system must automatically place buy or sell orders to maximize profit within the bounds of risk specified. With Conflux, the operator only needs to express application actions as a function over the current and historical stock quotes, risk and current currency streams, and provide stubbed code buying, selling or paging a trader. There are similar use-cases that can be realized in domains such as IoT, fraud detection etc. where real-time reaction at scale is critical.
Conflux itself is deployed on a cluster to linearly scale with data volume. It tolerates node failures by transparently redistributing streams on the fly using consistent hashing [1] onto survivor nodes while compensating for any data loss.
1. IntroductionStream processing systems are critical to make accurate decisions in rapidly changing environments. In scenarios where there are a multitude of disparate data streams that need to be grouped and merged on the fly the following issues need to be addressed:
Grouping: Users should be able to specify grouping of data streams before any merge functions.
Merging: Streams of data need to be merged based on some pluggable, domain specific rules that compute the result that can be specified by users of Conflux.
Change: Changes to the merge function specified by the user and group updates as individual data sources enter or leave the group should reflect immediately in the result of the merge.
Consider an example of automating Dev-Ops responses when running a critical SaaS application:
Operators’ definition of ‘critical workload’ may vary with time. Additional mid-tier machines may be added to the cluster or decommissioned; completely new application clusters may be marked critical. Conflux needs to track the snapshot of ‘criticality’ for the automatic DevOps system to be effective. I.e. it should allow users to specify a group of ‘critical data sources’ and guarantee updating of the group in real-time.
Developers, maintainers and operators of a system would usually have knowledge about what constitutes a problem. These would be expressed as a relationship between data streams. Since the users of that system know best, it should allow the developers / maintainers / operators to specify the relationship in a way that does not require a recompile / redeploy of Conflux itself. A secondary
57
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
objective is that it should be as simple as possible to specify these merge relationships
If either the grouping or the merge functions were changed, the changes should reflect in the composite merged stream immediately. Unless this is immediate, the reaction would be based on a stale view until the new declarations take effect.
Conflux is designed to address these high volume, low latency stream merging use cases. Typical use cases include IOT, Dev-Ops, faceted dash boarding and monitoring. It keeps users of the system insulated from scale, availability and performance concerns and allows them to focus on their application logic.
Conflux users can specify application specific aggregates, define thresholds and provide callout actions. This application code is in JavaScript to run on the JVM. This allows for dynamic modification of declarations without system recompile / redeploy.
Conflux does not address the problem of modelling using machine learning or other methods. It focuses on evaluating the results of an existing model function with low latency It supports adding; deleting or modifying existing models in real-time. It is hosted as a SaaS platform and supports a multi-tenant model. It uses consistent hashing to partition data streams based on the ID of the data source emitting the stream to
parallelize memory and compute resources across the cluster and for coping with node failure.
In this paper, we propose: i. A scalable, fault tolerant, low-latency way
to evaluate models, expressed asrelationships between different high-volume streams.
ii. A way to define new models, redefineexisting models and delete modelsdynamically in real-time
iii. Threshold results and invoke externalHTTP callouts for application specificactions.
The rest of the paper is organized as follows; Section 2 presents the background for conflux and related work. Section 3 presents system design. Section 4 describes the implementation of Conflux. Section 5 evaluates behavior and performance under different scenarios. Finally, section 6 contains concluding remarks and directions for future work.
2. Motivation and related workConflux originated as a general monitoring system [12] for virtual machines that ran critical workloads.The following requirements in this system led tothe conceptualization of Conflux:
Figure 1. The first part of the figure represents routing to conflux nodes using consistent hash of the stream IDs. The second shows ease of deployment/maintenance of conflux versus a typical distributed application.
58
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
i. Metrics were collected at the VM level.There was a need to provide rollups inreal-time to provide useful real-time dashboarding.
ii. It was useful to allow the customer tosynthesize new metrics by mergingmetrics streams from applications andunderlying metrics with resource billingand allocation metrics to provide real-time,low-granularity economic ROI dashboards.
iii. A logical extension was to invoke HTTPcallouts, where tenants can embedadvisory or remedial actions such asnotifications and auto-scaling respectively.
There are several existing stream merging systems available [2,3,4,5,6,7,8,9]. Twitter’s Heron, Google millwheel and Google photon are the closest in intent to Conflux, but are closed source. Other systems that have inspired Conflux are Apache Spark, Apache Storm, Apache Samza, etc. The critical reason for building Conflux instead of using these systems was twofold:
I. It was an explicit requirement that all nodes ofthe cluster processing system are identical w.r.tthe software stack, i.e. there are no ‘master’ or‘slave’ nodes. We have found that this greatlysimplifies deployment and upgrades inproduction since an the VM snapshots used fordeployment are identical. The tuning of thedifferent nodes, CPU, RAM and diskconfiguration are all identical. This requirementalso implies that when the Conflux clusterdeployment is under high load, all nodes shouldbe almost equally loaded. There should be nosingle point of failure or hot spotting. Also,adding new nodes to the cluster shoulduniformly reduce load across the cluster. Thisrequirement eliminated all systems withmaster-slave topologies or systems whichrequired nodes with heterogeneous softwarecomponents. Homogenous node types allow usto leverage established technology like VMsnapshotting, storage and deployment tomonitor and manage the cluster.
II. Another requirement was to be able to trulyoperate on streams ‘on the fly’. Thus,frameworks that do any form of in-memorycaching, e.g. micro-batching in Apache stormwere not considered.
3. System Designa) Definitions
Agents: are external systems that push data to Conflux in a format that it understands.
Packet: is the atomic unit of input and output in Conflux. It is a set of (ID, Metric, Timestamp, Value) tuples, which allows for composition and pipelining of data operations. The ID is a universally unique ID corresponding to each data source.
Stream: is a logically unbounded sequence of tuples bearing the same ID.
Routing: is the process of consistent hashing the ID in each packet with the number of live nodes in the conflux cluster to decide which node to deliver the packet to.
Data Source: is the entity being monitored for observable phenomenon. It could be a virtual machine in the case of monitoring, a process in the case of log file streams or a sensor in IoT. A single data source generates a single stream.
Metric: is an individual, time stamped, measurable property of a phenomenon being observed. A data source can emit multiple metrics, for example a data source corresponding to a VM can emit CPU, RAM and IOPS metrics identified by the VM id. A stock ticker can emit stock quotes identified by the stock symbol as ID. All Metric values are not restricted to numbers; strings and Booleans are allowed.
Tags: are empty metrics without timestamp/value tuples. They are used for logical grouping of streams.
Feed forward: when a node receives a packet with some ID ‘X’, it looks up all the groups that ‘X’ belongs to and for each group, retransmits the packet after applying the group specific transformation. Each transformed packet’s ID ‘X’ is overwritten with the respective Group ID. This process of retransmission is called feed forward.
Since the same Group Id is used for routing by all members of a given group, the result of the feed forward would always be homed to a specific node although individual members may be homed to different nodes. This allows for parallelization of pre-processing before merging the stream. [Figure 2]
Merging: is the process of joining multiple streams to create a single composite stream. Comparing
59
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
feed forward and merging to MapReduce, feed forward like map in that it operates over individual streams and can be parallelized. Merging is analogous to reduce which aggregates multiple values to one metric.
Ingestion: Conflux expects agents to emit streams in a specific format. An ID uniquely identifies each stream. The stream to node mapping is decided by a consistent hash of the stream ID to the nodes in the Conflux cluster. This ensures that in steady state, a particular stream always lands on a particular node in the cluster. Streams are roughly equally partitioned between the different nodes, so nodes can cache metadata about the stream IDs that are homed to it. This provides horizontal scale since the fire hose of streams are split among Conflux nodes in a clustered deployment. When nodes enter or leave the cluster, the resultant streams are consistent-hashed among the current snapshot of cluster nodes. This ensures that the load balancing characteristics of the cluster are retained.
Groups: Conflux is unique in the way that groups are treated. It exposes synchronous API to create a group and add/delete constituent members. The create API expects a group name and list of stream ID’s that belong to the group. It returns the
group ID. The “add” and “delete” API expect the group ID and the list of elements to be added and deleted from the group.
Any group operation is guaranteed to disseminate across the cluster fast. This is again done using consistent hashing ID’s against nodes. Consider a group is created with member ID’s A,B,C and returns the group ID G. Apart from persisting this information into the persistent store, for each member ID, Conflux sends out a message notifying the member that it now belongs to group ‘G’. The message is routed the same way as an agent message: by consistent hashing the key to the member ID. This ensures that the group membership message reaches the same node where the corresponding stream is homed. Since this message is subject to acknowledges and retries just like stream messages, it eventually reaches the correct node. To illustrate, consider member ‘X’ is added to the group ‘G’. Conflux persists the information and sends the message using ‘X’ as routing key, specifying that the X to G mapping no longer holds. The X routing key ensures that the group membership change message lands on the same node that services ‘X’. The node updates its cache removing X from group G. The same sequence of events happens
Figure 2. Every node maintains a cached list of groups that a stream homed on it belongs to. All group operations update the list, as shown in the first 2 diagrams. These lead to fast group updates. The 3rd ‘feed forward’ figure shows how this list is used for routing, enabling merging of grouped streams at one node. Since nodes owning groups is determined by consistent hashing and groups have unique IDs, Groups themselves are evenly spread across the cluster.
60
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
on member addition and group creation. If there are many members added or deleted, Conflux sends messages out in parallel. The remaining steps are identical.
b) Load Balancing, Scalability, Availability
Load balancing in Conflux is probabilistic, based on Consistent hashing. For a given stream, which consists of a stream of unbounded tuples with the same ID, routing is done by consistent hashing the ID against the number of live conflux nodes in the cluster. Once a packet lands on the cluster, the same ID is used for persisting the data into Cassandra. This ensures that writes to Cassandra are always local.
The number of network hops needed for a packet to from emission by an agent to persisting in Cassandra is exactly one. The probabilistic consistent hash algorithm ensures that all streams being ingested are roughly equally distributed throughout the Conflux clustered deployment.
If a node enters/leaves a Conflux cluster changing its state, message routing in RMQ changes and Cassandra would be selecting new nodes to write to, both per the latest cluster state. However, since the method of node selecting is the same for both cases, i.e. consistent hash the ID against the number of live nodes in the cluster, both would be selecting the same nodes for the same ID’s thus, writes would always remain local. For other nodes in steady state, that are not part of the cluster state
change, the consistent hash algorithm ensures streams homed to them remain homed as before, so the caches of these nodes that are minimally impacted.
c) Deployment Architecture
Conflux is deployed in a 5-node cluster that spans 3 racks, allowing for one rack to fail. Connectivity is through the gigabit backplane IP of the data center to prevent network hops being the bottleneck. [Ref: Figure 3]
Each node is configured with 8 CPU’s (virtual), 32 GB RAM and 1 TB of hard disk.
RMQ is configured for configurable batch acknowledge. I.e. Conflux consumes messages in batches and acknowledges them. Details are provided in the implementation.
Due to the nature of consistent hashing, load tests without groups distributes packets almost identically across nodes as expected. The standard deviation in percentage CPU is less than 3 when the average CPU is 95% across all 5 nodes in the cluster.
d) Reactive monitoring of individual streamsmanaged by a group.
Grouping allows users of Conflux to address and manage a large, transient list of streams with one group ID. Entries and exits to/from the group will
Figure 3 A typical deployment of Conflux consists of an odd numbered set of nodes to allow for quorum election in Cassandra. The nodes are always local to a data center and spread across racks to guard against hardware failure. TCP/IP connectivity is through the data center backplane. In practice a set of 5 nodes is used.
61
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
be automatically considered if group updates happen accurately from an external system watching group memberships. Consider a medical IOT streaming data into Conflux. The problem is to page a doctor when there is no heartbeat for a specified time for a given patient. In this example, The Conflux user will form a group consisting of heartbeat data sources, since there would be several other data sources in the system for measuring blood pressure, temperature etc. formulate a ‘no heartbeat’ condition and specify the paging callout for notifying a doctor. When any stream in the group of heartbeat data sources satisfies the ‘no heartbeat’ condition, the correct doctor is paged, based on the ID of the data source that caused the problem. This example can be extended for machine learning type use cases by specifying cost functions on individual metrics in a packet and calling out when a condition is changed. In this example, the operation is on the individual stream, the group merely functions as a ‘tag’ for classifying classes of streams.
e) Merging streams based on groups
Groups also allow for merging streams based on group ID. This is the more interesting use case of Conflux. Since a node maintains the mapping of the groups a stream belongs to, the node can apply a transform to the metrics and retransmit the data using the group ID as the routing key. Note that all members of a group transmit on the same group ID as a result of this scheme all the retransmitted data from the children land up on one node on which group ID is homed. These streams are now merged with the merge function specified for the group. The result of the group merge computation can be again ‘fed-forward’ to compose them further. Members of a group are expected to have the same sample frequency. There may be cases where groups would contain merges of many hundreds of streams. These would create hotspots Conflux deals with this by:
i. Limiting Group membership to 20.ii. Breaking down groups with larger than 20
into subgroups with 20 elements or lessrecursively, so very large groups would bebroken down into a tree-like containmenthierarchy.
iii. This works for cases where the mergefunctions are associative, e.g. SUM, MAX,MIN, AVG etc.
iv. For non-associative merge functions likeSTDDEV, TOP-10 etc. the group size isrestricted to 30.
v. In practice, large non-associative groupmerges are roughly expressed using a
combination of associative merges over the large group and a final non-associative merge. For example, the TOP-10 over a very large group containing many hundreds of elements can be expressed approximately as the TOP-10 of 30 large groups of MAX.
This is a limitation of Conflux, since such a value may not always be accurate.
The other limitation with this design is that load balancing is achieved at the cost of increased network hops of the order of log(group membership limit). This is not very significant in practice.
Figure 4: Transforming, Grouping and merging streams
The stream that results from a group merge is indistinguishable from an incoming stream. I.e. group streams, sub-group streams and incoming streams are indistinguishable from the other except by their unique ID. Conflux treats all streams identically by persisting every incoming packet in Cassandra. The intermediate stages of calculation from sub-groups are persisted as well.
The Since sub-groups are treated identical to groups; group change notifications travel up the tree and the group membership is correctly reflected.
f) Aligning time in streams
We propose the following approaches for aligning time in feed forward and merge functions:
62
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
i. Wall clock window, which results in anaggregate packet every time that the wall clockinterval is crossed, considering all packets forthe interval.
ii. Sliding window, which results in an aggregatepacket for every incoming packet, consideringall past packets for the same (id, metric).
iii. Timeout for ID all metrics with timestampsolder than this timeout value are not consideredfor merge or feed forward.
g) Software stack
The software stack consists of:
Web Layer Play Framework
Message bus Rabbit MQ with Consistent hash plugin for message routing
Persistent Store Cassandra.
The unit of deployment of conflux is a node. This node can either be a physical box or a VM (virtual machine). In case a VM is used, care must be taken to deploy some VM’s on different physical hardware to mitigate the possibility of catastrophic hardware failure. The physical hardware however is assumed to be part of a single data centre, connected by redundant, high bandwidth cabling. I.e. nodes cannot span geographical locations.
h) Handling failure
We consider two broad classes of failure
i. Node failures - we rely on consistent hashingre-distribute streams to survivor nodes uponnode failure. As described in section 3.b, trafficgets re-routed to available nodes. Batchedacknowledgements and correspondingretransmits in RMQ ensure packets are notlost in case the node goes down beforeprocessing.
ii. Packet re-transmits and delays – are handlednaturally for wall clock time alignment. When apacket arrives late, all calculations aretriggered for the wall clock window that thepacket belongs to. Feed forward ensures thatthese changes reflect in all group aggregatesthat the ID belongs to. Out of sequencepackets will not matter since eventually, theaggregate values are all correct. The system
can recover from failure by replaying persisted packets from the time of failure.
Packets older than a certain threshold are discarded by the system since they can trigger cascading re-calculations up to the present time.
i) Tradeoffs and Limitations
It is important to note that since Conflux is designed for low latency stream operations; a conscious design tradeoff is to assume the absence of network partitions. Conflux is not designed for a multi-site deployment.
4. ImplementationMonitoring Insight [12], a vCloud Air alpha service offering is in production with a subset of Conflux features.
RMQ [10] is used as the message bus and Cassandra [11] is the persistent NOSQL store.
The System was tested to monitor 80+ metrics from each VM, for 3000 VM’s collecting data every 5 minutes for 20 second granularity. This was simply too much data. A new ‘subscription’ feature was introduced where customers could subscribe by paying for each VM ID. Different subscriptions resulted in different granularities. The default was 5 minutes instead of 20 seconds. This dealt with the data volume well. The subscribe action uses the same feed forward, routing using the VM ID on which the customer subscribes, so it comes into effect immediately. This is currently on production.
A basic JavaScript API for surfacing JMX metrics, log files and accessing the Cassandra database has been tested. This has been invaluable in programmatically running checks to poke around when a problem occurs on a running system.
Data was written into Cassandra with a default TTL of 30 days. There was a subscription option, just like for granularity where it was possible to store metrics for 45 or 60 days. For example, it was possible to store very high granularity metrics for longer time in case of important monitored elements.
Cassandra compaction was run weekly across the entire cluster to reclaim tombstone data.
5. Evaluation and PerformanceThese performance numbers are for a single VM that is part of a cluster. The ingestion capacity
63
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
increases linearly as new nodes are added due to the consistent hashing property. The chart in Figure 5 is based on 20-second granularity data from 2 VMs. Nodes acknowledged each individual packet, which led to poor ingestion performance and increased latency.
Turning off individual acknowledgement and enabling batch acknowledgement dramatically pushed up ingestion rates as shown in Figure 6.
As expected, failure recovery is batched as well,. Pre-fetch size is set to 512, acknowledging 256 packets at a time, so there is always data in memory, speeding up consumption from 335 packets per second to 1277 per second.
These figures are from RMQ monitoring console. The Blue lines in the figure above indicate messages being consumed and persisted to Cassandra.
6. ConclusionConflux is a critical enabler for data mining services that an increasing number of companies are investing in. It allows users to react in real-time to events which are defined as functions on data streams.
Allowing customers to define actions that monitor and execute on live streaming data is a vital requirement across domains. This could be used to automate Dev-Ops, avert disaster scenarios,
monitor IOT or even execute buy/sell decisions on the stock market.
After historical data has been mined and collated into meaningful models, companies do not clearly
act on it in an automated manner. Conflux not only provides the means to act fast, but also update models with latest the learning in real-time.
7. Future workLarge groups handling, handling data sources with very different packet emission frequencies lead to skew since all keys do not correspond to the same load. Thus, load balancing is a subject for future work.
Cassandra and RMQ pick different nodes for the same ID, although the algorithm is consistent hash. This leads to an additional network hop. Tuning hash functions across
Purely dynamic group whose members are defined by a function is targeted for future work. If the function returns ‘true’ the packet at the child is fed-forward to the group ID.
Error Handling in sliding windows.
8. References[1] Karger, D., Lehman, E., Leighton, T.,Panigrahy, R., Levine, M., and Lewin, D. 1997.Consistent hashing and random trees: distributedcaching protocols for relieving hot spots on the
Figure 7. Data collected at 20 sec. intervals, for 2 VM’s, over an hour, displayed in a web UI. This could be any monitored
element or IOT device.
Figure 5. Ingestion performance with individual message acknowledgements.
Figure 6. Ingestion performance with batched acknowledgements.
64
The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
World Wide Web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on theory of Computing (El Paso, Texas, United States, May 04 - 06, 1997). STOC '97. ACM Press, New York, NY, 654-663.
[2] Apache Samza.http://samza.incubator.apache.org
[3] Tyler Akidau, Alex Balikov, Kaya Bekiroglu,Slava Chernyak, Josh
Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033-1044 (2013)
[4] Rajagopal Ananthanarayanan, VenkateshBasker, Sumit Das, Ashish Gupta, Haifeng Jiang,Tianhao Qiu, Alexey Reznichenko, DeomidRyabkov, Manpreet Singh, ShivakumarVenkataraman: Photon: Fault-tolerant andScalable Joining of Continuous Data Streams.SIGMOD 2013: 577-588
[5] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu,Vikas Kedigehalli, Christopher Kellogg, SaileshMittal, Jignesh M. Patel, Karthik Ramasamy,Siddarth Taneja: Twitter Heron: StreamProcessing at Scale. SIGMOD 2015: 239-250
[6] Kestrel: A Simple, Distributed Message QueueSystem. http://robey.github.com/kestrel
[7] S4 Distributed Stream Computing Platform.http://incubator.apache.org/s4/
[8] SparkStreaming.https://spark.apache.org/streaming/
[9] Ankit Toshniwal,Siddarth Taneja,AmitShukla,Karthikeyan Ramasamy, Jignesh M. Patel,Sanjeev Kulkarni, Jason Jackson, Krishna Gade,Maosong Fu, Jake Donham, Nikunj Bhagat,Sailesh Mittal, Dmitriy V. Ryaboy: [email protected] 2014: 147- 156
[10] RabbitMQ: http://www.rabbitmq.com/
[11] Apache Cassandra:http://cassandra.apache.org/
65
DB Volume Emulator
Rekha Singhal Tata Consultancy Service Research
TCS, Mumbai India
Amol Khanapurkar Tata Consultancy Service Research
TCS, Mumbai India
ABSTRACT
Execution of analytic queries on large growing data volume has
led to challenge of assuring application performance with time.
There is need to understand query execution path at compile time
or development environment to tune it appropriately for assuring
performance on large data volume.
Data processing engines such as Oracle and Apache Calcite use
cost based optimizers to decide a SQL query execution plan at
compile time. In this paper, we discuss how these cost based
optimizers could be used to understand SQL query execution
paths without generating large data volume in development
environment- tools assisting this is referred as ‘DB Volume
Emulator’. We have presented a working DB volume emulator
tool, in this paper, for RDBMS and extended the thought process
for new big data processing platforms such as Spark, Apache
Calcite and Impala. This tool could be exploited to understand and
schedule SQL queries in workload for its optimal execution.
KeywordsBig Data, SQL, Performance, Prediction, Data Volume
1. What is DB Volume EmulatorStructured Query Language (SQL) has been widely accepted as
high level language to formulate analytic queries on data. It is
most popular means of extracting data from underlying data
storage. All relational databases such as Oracle and Postgres and
most of the newly available data processing engines such as Spark
and Hive supports SQL query for data access.
Data processing application often witness violation of guaranteed
performance with time due to increase in execution time of SQL
queries on large data volume. In general, a developer can get
execution plan for SQL queries on small data size in development
environment but has no knowledge of the SQL query execution
path followed by the database on large data volumes. A naïve
approach of generating large data volumes at compile time is
every expensive and not viable as well.
Data processing engine such as relational database, use cost based
optimizer to decide a SQL query execution path. These cost based
optimizer decide a SQL query execution path using knowledge of
sample data in the form of statistics maintained by them. A DB
volume emulator exploits this feature of optimizer and can
emulate the behaviour of data processing engine on large data
volume and outputs a SQL query execution plan without actually
generating data in development environment. This execution plan
may be used to estimate the SQL execution time in isolation using
the information from the plan such as type of operator, input data
size of each operator etc. We have used a one implementation of
the DB volume emulator to estimate a SQL execution time for
large data size using sample of data in development environment,
with average prediction error of 10% [1, 2, 5]. This model may
extended for estimating average execution for a concurrent
workload on larger data sizes, however current implementation
limits to SQL query execution in isolation.
A DB volume emulator can be used for following:
Understand SQL query execution steps for large data sizes at
compile time.
Its saves resources and shorten the time to deployment
It can be used to tune SQL queries in advance at compile
time for assuring performance over a time period.
It can be integrated with ‘SQL Execution Time predictor’[1]
to actually estimate SQL query execution time for larger data
sizes in development environment
It can be used by performance testing team to generate small
data representing future volumes for performance testing
It can be used to estimate minimum number of rows required
in testing environment complying ti performance testing time
budget [3].
CODD [3] is an example of open source DB volume emulator. It
has been tested for TPC-H benchmark and emulates only linear
growth pattern. We have also built DB volume emulator for
addressing larger research problem of estimating SQL execution
time, on big data processing platforms such as Oracle, Hive and
Spark, without generating large data size in testing or
development environment. We are presenting our approach of
building DB volume emulator in this paper.
2. Architecture of DB Volume EmulatorVolume emulation depends on the configuration of underlying
data processing engine such as Postgres or MySQL, schema or
meta data of each data objects and how data is growing in terms
of data object size and data value distribution. Growth of data is
projected from business growth and calculated using mapping of
business entities to logical entities in data processing engine. For
example, a retail sector has ‘X’ customer and each customer is
represented in form of two relational tables, then business growth
in number of customers by 2X may means increase in 4X rows in
Table 1 and 10X rows in Table 2.
A database volume emulator takes input as the DB server
configuration details, schema of the data and projected growth
details (sizes, data distribution in form of histogram and co-
relation across data objects) of all data objects to create an
66
environment of content less emulated database with projected data
sizes as shown in Fig 1. We have built DB emulator for relational
databases so the further discussion will be in the context of
RDBMS and relational data objects.
Figure 1: DB Volume Architecture
Database Server Details
1. Specify type of data processing engine– Postgres,
Oracle, MySQL Apache Spark, Apache Hive etc.
2. Their API to collect and extrapolate data processing
engine specific meta-data statistics.
3. Database configuration parameters such as working
memory size, multi block read etc for RDBMS. These
are huge in numbers so a user may set up an instance in
development environment and either export all these
details in a file or give direct access to the DB volume
emulator for accessing all details from the instance.
Meta Data of Data
1. The information about meta-data of data is stored in the
form of statistics in the database
2. For a relational data objects, it means meta data of
tables (average row length, number of rows etc.) storing
the data, meta data of each column (type of column,
distinct values, minimum value, maximum value, data
distribution etc.) in the table and meta data of each
index (type of index, column(s) name, rows accessed)
on a table. Some of these statistics are sensitive to data
sizes which are extrapolated for volume emulation- the
details can be seen in our earlier work [4].
3. These details either can be entered using a front end as
in CODD [3], which seems quite complex and need
DBA to understand the meaning of each of the meta-
data statistics especially for indexes. Other option is to
provide access to a small data size instance in
application development/testing environment and
collect all meta-data which may be extrapolated
appropriately for projected data size. Please note that
the exported meta-data corresponds to small size
database.
Growth Details 1. Projected number of items/rows in each data
object/table in database
2. Projected table sizes in the database with average row
length.
3. Each entity/column values domain - each column in
each table with their distinct, minimum and maximum
values. How these values changes with increase in
number of rows in the table.
4. Data value distribution in a column- how does it change
with growth in data.
a. Capture using histograms – frequency, height
balanced
b. Simple – max value constant, uniform growth,
linear, non linear (in the form of function such
as square)
Working of DB Volume Emulator
The DB Volume emulator exploits feature of Cost based
optimizer (CBO) used by most of the SQL query processing
engine to choose an optimal execution plan for the SQL query.
CBO stores meta-data for each of the objects created in the
database in the form of statistics. These statistics are stored in the
relational database itself in the form of Catalog tables with the
DBA permission. These statistics may be gathered automatically
to get meta-data information about the data loaded in the database
before choosing an execution plan for a SQL query. Most of the
relational databases such as Oracle provides API/stored
procedures to change the value of these statistics manually as
well. We have exploited this feature of the CBO for doing volume
emulation. We have identified statistics for all data objects created
in a database, which are sensitive to the property of data such as
data sizes, data distribution etc. These had been presented in detail
in [4]. For example, increase in number of rows of a table will
increase value of NUM_ROWS (a statistics maintained in DB) for
the table. Below we have shown meta-data statistics sensitive to
data size and its properties, as maintained by Oracle.
Table sensitive statistics: The table statistics of all the
user tables in the database are stored in the system
tables such as user_tables. This table stores all the
physical details of the tables however; variation in a
query performance with increase in database size is
most sensitive to the number of rows in the tables being
accessed, number of blocks allocated to the table and
the average row length.
Column Sensitive Statistics: The column statistics of all
tables are stored in the system table such as
user_tab_columns. The most sensitive column statistics
are number of distinct values for a column, number of
nulls in a column, density, low value/high value and
histogram.
Index sensitive statistics: The physical storage details of
various indexes created on column(s) are stored in
system table such as user_indexes. The index sensitive
statistics are number of leaf blocks allocated to indexes,
number of levels in the B-Tree, clustering factor,
number of rows on which the index is created, distinct
values, average number of data blocks per key and
average number of leaf blocks per key.
A replica of the database instance in development environment is
created which has no data and only meta-data and database
configuration details. We also take as input the projected growth
pattern for each business object which gets mapped to each data
object in the database. Each statistics is either retained as it is if
not sensitive to the given growth details or is extrapolated as per
instructions given in growth details. A column value may
extrapolate linearly, remain constant or grow quadratic with
67
overall growth factor. The extrapolated values are put back into
the database statistics using database specific API/stored
procedures. Now, the CBO perceives the empty database to have
projected data sizes and distribution as given in the growth details.
3. DB Volume Emulator ImplementationWe have built DB volume emulator in Java for relational
databases. The emulator can run from a window or linux machine
and access the development and emulated environment through
network. User provides credentials for both the environments as
an input to the tool. Tables growth details are given in terms of
expected number of rows in each table as a CSV file. The growth
details are provided for each column of every table as – linear,
constant and non-linear – ‘constant’ means the column’s
maximum value remains same in emulated environment, ‘linear’
means the column maximum value increases linearly to the table
size growth factor, ‘non-linear’ means the column maximum
value is function of the table size growth factor, which is given by
user.
The emulated environment is created with credentials given by
user. All the DB server and the data meta-data statistics are
exported from the development environment and imported in the
content less emulated environment.
The sensitive statistics of table, column and index are extrapolated
as per user given growth details and put back into the emulated
environment using the database stored procedures for setting the
statistics manually. The tool informs user on finishing the
extrapolation and message “Emulated Environment Ready’.
We tested the tool on TPC-H benchmarks on oracle 10g. We
generated 1 GB database in the development environment and
emulated to larger size such as 10GB using the tool. We also
generated 10GB database and created real 10G environment for
validation purpose. We checked the correctness of the tool by
matching the 22 TPC-H SQLs query explain plan, expected rows
and cost both in the emulated and the real environments [4] as
shown in Fig 2.
Figure 2: SQL Query Explain plan in Oracle 10g
4. Volume Emulation in Big Data PlatformsThe large size data processing has led to advent to many new big
data platforms for parallel processing of data. Most of them have
extended support for optimizer based SQL query execution. These
big data platforms works on cluster of nodes, whereby it is more
challenging to understand how a job will be executed, knowledge
of which may help users to tune either their job or platform for
optimal execution. Whether it is RDBMS or Big Data product, an
optimizer will always have a plan which will always have a
logical and physical manifestation, plus a hook or API to
manipulate the knowledge about the data (meta-data) or statistics.
The application of volume emulation for big data platforms is a
though experiment.
Most of the big data platforms with SQL query support do not
have explicit tools/methodology for tuning the SQL. Majority of
the available big data platforms’ optimizer focuses only on
reducing the number of rows to be processed by an operator in
SQL query DAG. So they end up pushing filter operation down to
leaf node to propagate less number of rows to parent operator in
DAG. They have not exploited meta-data of data to further
optimize the order of operations in SQL query DAG. Since, we
witnessed success stories on our past work in the context of
relational databases; we wish to extend thought of the volume
emulation framework for big data processing engines as well.
Volume emulator for these platforms will facilitate SQL query
tuning for the specific platform.
We shall discuss case of DB volume emulators for three new big
data platforms - Apache Hive (using Calcite Optimizer), Apache
Spark (using Catalyst Optimizer and Apache Impala for
distributed query processing. Figure 3 shows a stack for SQL
query processing on big data platforms. Most of the parallel data
processing platforms support Hadoop distributed file system
(HDFS) for managing data stored across all nodes in the cluster.
Data processing layer can have its own programming language
and API to allow users to write their jobs e.g. Hadoop is Map-
Reduce paradigm and expects a user to write map and reduce
functions for executing job on Hadoop. Similarly, Spark supports
scala language with its own dictionary of set of operations, python
and java program for data processing. Most of these data
processing frameworks have extended their support for SQL
query processing with an engine which can parse, translate and
optimize a SQL query for respective platforms.
Figure 3: Big Data Processing Stack on cluster of Nodes
Apache Hive
Hive [7, 8] has been developed by Apache Software Foundation
to query data using SQL language on top of Hadoop MapReduce
framework. It is similar to those functionalities provided by a
relational database management system (RDBMS) which eases
the migration of legacy analytic workload. Hadoop MapReduce
framework is one the earliest and most widely used parallel data
processing engine in big data community. Hadoop MapReduce
framework work on a cluster of nodes. Data is stored across all
the nodes and managed by Hadoop distributed file system
(HDFS). Hive as a data warehousing application on top of
Hadoop MapReduce which allows the user to query, insert and
partitions the data stored in it. Using Hive, it is possible to
generate tables and indexes (in form of partition) on data stored in
Hadoop distributed file system (HDFS). This data can then be
accessed using a Hive--specific query language called HiveQL.
ID Operation Name Rows Bytes Cost
0 Select 102M 27G 105M
1 Nested Loop
2 Nested Loop 102M 27G 105M
3 Table Access Full Supplier 1280K 175M 7314
4 Index Range Scan Partsupp_sk 80 2
5 Table Access by Index ROWID
Partsupp 80 11520 83
68
HiveQL is similar to SQL and therefore easy to learn for
relational database people.
Hive server or HiveQL processing engine maintains information
about meta-data of user data in the form of statistics stored in
relational catalog tables accessed through relational database
connector specified in the Hive configuration file. Most
commonly used ones are MySQL and Derby. HiveQL processing
engine parses a HiveQL and translates it into an optimal execution
plan as directed acyclic graph (DAG) of stages as shown in Fig 4.
Each stage may comprise of rename, conditional operator, only
map or map-reduce jobs. A map-reduce or map job is executed,
using Hadoop parallel data processing capability, on data stored
on HDFS. For a map-reduce job, data is a flat file, however for
HiveQL it is a table structure defined in Hive. A join may be
executed as map join, where small table is replicated across all
node, or a reduce join where join is executed as reduce job. Hive
uses Apache calcite optimizer to for building HiveQL execution
DAG using database statistics. These statistics are similar the one
provided by popular RDBMS such as Oracle and Postgres. If
Apache provide API to change these statistics manually (like in
RDBMS), one can explore how a HiveQL will execute for larger
data size with different data distributions and build Volume
emulator. The volume emulator can help in taking proactive
actions like cluster sizing for assuring performance. One of our
work [13] uses the DAG to estimate a HiveQL execution time in
development environment. [9] has used Hive optimizer statistics
to further improve HiveQL execution plan.
Figure 4: Hive DAG for a HiveQL
Apache Spark
Hadoop map-reduce forces a linear way of programming for
parallel processing of job. Apache Spark is more flexible parallel
data processing engine which keeps data in memory in the form of
Resilient Data sets (RDD) for iterative processing. Apache Spark-
SQL combines both relational and procedural systems which are
required for complete data processing pipeline from ETL to
querying of the data.
SparkSQL [10] is similar to relational database SQL or HiveQL in
syntax. Spark SQL provides a DataFrame API that can perform
relational operations on both external data sources and Spark’s
built-in distributed collections RDD. Spark SQL uses extensible
optimizer called Catalyst. Catalyst [11] makes it easy to add data
sources, optimization rules, and data types for domains such as
machine learning.
A SparkSQL is converted into a DAG of scala functions
supported by Spark data processing engine as shown in Fig 5.
Spark catalyst currently consider the functional components of
SparkSQL and optimizes its DAG by pushing the filter and
grouping operators early in the execution DAG. The current
versions 1.6 and 2.0 do not collect and make use of data statistics
for further optimization of SparkSQL DAG. However, according
to Ion Stoica (founder of Spark)1, use of statistics like RDBMS is
part of their plan to further optimize SparkSQL DAG.
Figure 5: Spark DAG for a SparkSQL doing Aggregation
Apache Impala
Apache Impala [6] is an open source distributed massively
parallel SQL query processing engine which brings scalable
parallel database technology to Hadoop, enabling users to issue
low-latency SQL queries to data stored in HDFS and Apache
HBase without requiring data movement or transformation.
Impala is integrated with Hadoop to use the same file and data
formats, metadata, security and resource management frameworks
used by MapReduce, Apache Hive, Apache Pig and other Hadoop
software.
Impala supports a subset of HiveQL. Impala maintains
information about table definitions in a central database known as
the metastore. Impala also tracks other metadata for the low-level
characteristics of data files. Impala runs a catalog demon at each
1 I got chance to interact with him in VLDB2016 held at Delhi.
69
node of the cluster, which stores meta data information about the
data and node. Impala uses this meta data to decide DAG for
distributed execution of a SQL as shown in Fig 6.
Figure 6: Execution Plan of a Impala SQL on Hbase
5. ConclusionsIn this big data age, most of the analytic SQL queries process
large size data in production environment. Given the complexity
of SQL query, it is imperative for application manager to
understand the SQL execution steps on larger data size at compile
time or development environment. In this paper, we had discussed
design of DB volume emulator which can emulate behaviour of
data processing engine for large data size. DB volume emulator
can outputs a SQL query execution plan for large data size at
compile time without generating and loading large data. The DB
volume emulator exploits features of cost based optimizers to
manually change the meta-data statistics to reflect growing data
size and distributions instead of physically generating and loading
it. We have discussed an implementation which is tested on
Oracle 10g for TPC-H benchmarks.
Since we have seen merits in having Volume emulator on
statistics based optimizers and we also see that many Big Data
SQL products have / should have statistics-based optimizer in
their product roadmap, building hooks in the product for Volume
Emulation will be of immense value. In fact having an in-built
Volume Emulator will also serve as a training method to teach
Developers how to write optimal queries for the Big Data
platform. Finally, we believe a lot of research methods on Volume
Emulation on RDBMS' can be reused for conducting similar
experiments on Big Data product platforms. These experiments
could provide further guidance on better ways for adding Volume
Emulation in these platforms.
REFERENCES
[1] Rekha Singhal and Manoj Nambiar, Predict Performance of
SQL on Large Data Volume, IDEAS, Montreal, Canada, July
2016.
[2] Rekha Singhal and Manoj Nambiar, Measurement based
model to study the effect of increase in data size on query
response time, Performance a nd Capacity CMG 2013, La
Jolla, California, November 2013.
[3] Rakshit S. Trivedi, I. Nilavalagan and Jayant R. Haritsa,
CODD: Constructing Dataless Databases, In proceedings of
DBTest’12, May 21, 2012 Scottsdale, AZ, U.S.A.
[4] Rekha Singhal and Manoj Nambiar, “Extrapolation of SQL
Query Elapsed Response Time at Application Development
Stage, Indicon 2012, Kerala, India.
[5] Chetan Phalak and Rekha Singhal, Database Buffer Cache
Simulator to Study and Predict Cache Behavior for Query
Execution, Proceedings of DATA 2016, Portugal.
[6] http://www.cloudera.com/documentation/archive/impala/2-
x/2-1-x/topics/impala_perf_stats.html
[7] https://cwiki.apache.org/confluence/display/Hive/StatsDev
[8] http://www.cloudera.com/documentation/archive/manager/4-
x/4-8-5/Cloudera-Manager-Installation-
Guide/cmig_hive_table_stats.html
[9] Anja Gruenheid, Edward Omiecinski and Leo Mark, Query
Optimization Using Column Statistics in Hive IDEAS11
2011, September 21-23, Lisbon, Portugal.
[10] https://databricks.com/blog/2015/04/13/deep-dive-into-
spark-sqls-catalyst-optimizer.html
[11] https://spark.apache.org/docs/1.1.0/api/java/org/apache/spark
/sql/execution/ExplainCommand.html
[12] http://trafodion.apache.org/architecture-overview.html,
Apache Trafodion
[13] A. Sangroya and R. Singhal, “Performance Assurance Model
for HiveQL on Large Data Volume,” in Proceedings of the
International Workshop on Foundations of Big Data
Computing in conjunction with 22nd IEEE International
Conference on High Performance Computing, HiPC ’15,
December 2015.
70
Leveraging Human Capacity Planning for Efficient Infrastructure Provisioning
Shashank Pawnarkar, Tata Consultancy Services
Today there are several infrastructure capacity planning techniques are available to make effective use of IT infrastructure and to enhance the infrastructure performance. The infrastructure capacity planning is not only based on IT infrastructure technology but it is based on business environment and business inputs. An analytical model based on historical business data, past trends, parameters such as peak business time, users skillsets and transactions volumes are of paramount importance to decide the infrastructure capacity. The Human Capacity Planning technique involves calculating optimal number of skilled resource required to complete business operation at particular time. This papers discusses how business inputs based on Human Capacity Planning helps to improve and optimize, efficient Infrastructure provisioning.
Introduction In typical infrastructure capacity planning the inputs are based on sizing, peak load, server utilization and data/transactions volume, whereas the business driven Human Capacity Planning gives the exact demand pattern, stating optimal number of resources required when (time and date), forecast of volume (number of transactions) and the time taken to complete the truncation. The output of Human Capacity Planning will aid in determining the exact count of resources (number of users) required, amount of volume transactions, demand patterns which further helps to calculate exact requirement on server infrastructure. The Human Capacity Planning provides exact patterns of peak load of usage during the year or month or day or time, and thus, Human Capacity Planning output complement the traditional method of Infrastructure Capacity Planning.
The offshore BPS (Business Process Services, often called as BPO – Business Process Outsourcing) typically provides back office business services to global Customers spread across multiple countries and multiple geographies wherein offshore agents with varying skillsets works in different shifts and delivers the work in stringent SLA (Service Level Agreement). Even though there is a fluctuation in workload demand and volumes, agent has to deliver the work with very high accuracy and within shortest possible time. Any miscalculation in demand forecast often leads to more number of human resources when work volume is more or high number of resources when work volume is low. These kind of misjudgment of Human Capacity Planning results in heavy loss of productivity and increase in operational cost and at the same time core IT servers are underutilized or over utilized. In case of peak demand the server response is degraded and resulting in further delayed server response, loss of productivity for agents and loss of business credibility to the organization.
The average transaction handling time (AHT) is key success parameter for the agent and in turn for BPS operations. While core IT applications and infrastructure plays a crucial role in determining response time and AHT, Customer are not very formal and clear about how they can assert that their system performance is adequate enough for the BPS to work well. The BPS works in constraint that Customer own and controls
71
the IT applications and infrastructure which are difficult to change according to the need of BPS workload, and hence it is best strategy to provide minute level Infrastructure Capacity Planning inputs to the business.
Figure 1: Infrastructure Capacity Planning driven by Human Capacity Planning
We have a leading edge Human Capacity Planning solution which analyzes structured and semi-structured data enabling consistent business insights and predicting accurate human workforce capacity, thus drive productivity savings of millions of dollars in operating costs. The Human Capacity Planning solution leverages analytics, forecasting and optimization techniques to predict accurate human resources based on demand and volume. The forecasting is typically based on regression techniques whereas optimization is based on linear programming.
Human Capacity Planning Solution In BPS world, operating discipline is very essential to remain competitive and then if you have to have leading edge, you need to have comprehensive Human Capacity Planning solution which will cater to typical BPS processing needs be it in terms of human resource planning, allocating the work to the Capacity and then monitoring of the work execution to ensure that SLA are gainfully met with optimal manner.
The Human Capacity Planning solution wherein, leveraging advance analytics, forecasting and optimization happens at a granular level (up to 15 minutes of time period) and taking into account, operational environmental dependencies like shift timing, skill wise AHT (Average Handle Time), volume forecast and SLA (Service Level Agreement).
Challenges in Human Capacity Planning:
1) BPS Operations are often challenged with high work volumes with available staff. At times, staff
may over utilized with high work volumes, or underutilized during periods of low work volumes.
Many a times, this is cumbersome and ineffective process which leads to situation of over allocating
resource when demand is low and under allocating resources when volume demand is high. Floor
managers often adhere Spreadsheets to manage the Human Capacity Planning. In such a
72
scenario, getting comprehensive understanding of all the parameters which impact Capacity
Planning was a mammoth task.
2) In the BPS environment, Customer expects service provider to meet the operational SLA under
varying degree of workload, and that too at a lower cost but with higher accuracy. Without the
technology solution, the BPS floor managers were struggling with under/over staffing, mismatched
skill set, ability to predict the workload volume, and all this was getting translated into higher cost of
operations and also missing the operational SLAs, and thus customer dissatisfaction. Missing of
SLAs would sometimes also cause financial penalties to the customer.
Challenges in Infrastructure Capacity Planning
1) One of the key challenges is to determine the right amount of resources required for the execution
of work in order to minimize the financial cost and to maximize the resource utilization. It is
necessary to utilize infrastructure resources properly to reduce the cost while maintaining the
Quality of Service parameters like throughput, response time, and performance while adhering to
Service Level Agreements (SLAs).
2) The BPS agents works on core IT applications and server infrastructure provided and supported by
the Customer that means in cases of peak workload, BPS agents are restricted by the available
server infrastructure which often cause delay in completing agent’s tasks resulting in operational
inefficiency and missing target SLAs.
Alternative Solutions Evaluated Agent Management System (AMS) and Work Force Management (WFM) tools (Internal) were evaluated before building new solution. Our findings are summarized in sections below -
1. Agent Management System
Agent Management System (AMS) has been designed to generate workload/schedule using volume based SLA. For example, if from 3:00 PM to 3:30 PM, 100 transactions have been received, and 80% must be finished before 3:30 PM. The key requirement for the Capacity Planning solution is to generate workload/schedule using time based SLA. Since time based SLA and volume based SLA are two different paradigms, AMS does not fit the requirements of the Capacity Planning Solution.
2. Work Force Management Tool
Work Force Management (WFM) tool is a tool for Queue Management. However, in WFM, all the key characteristics required for generation of workload are coded into the tool and are not user configurable. Moreover, this tool does not cater to our requirements for country specific SLAs. Since the key requirement of the Capacity Planning Solution was to have end user configurable solution taking into account Process Characteristics, BPS People Profiles, SLA requirements and other environmental attributes to enable semi-automated Queue Management, WFM was not deemed fit.
3. Other Commercial Solutions
Other commercially available solutions were also evaluated. None of the solutions had ability to deal with Varying/Dynamic Cut-Offs
Human Capacity Planning - Solution Description The Capacity Planning solution uses a scientific approach to arrive at the most optimal resource plan to meet the required SLAs. The solution uses industry standard mathematical algorithms to analyze, forecast, schedule and monitor resource requirements using techniques such as regression analysis and linear programming.
73
Figure 2: High level schematic of Human Capacity Planning Solution
The typical inputs to Capacity Planning algorithm are: - AHT (Average Handling Time) as per various products and sub-products
- Skill set
- Work shifts (given that typically BPS business runs 24x7)
- Operational SLAs (which can be both fixed and variable)
- Resource cost
- Target utilization level
- On floor constraints (like shift timings/duration, allowable overtime).
These inputs parameters worked upon along with: - Historical data of workload (solution would typically need 3 years of historical data)
- Customer prediction of upcoming volume (could be for next 1 month or quarter)
Mathematical Models: The heart of the Capacity Planning solution is the Mathematical Model. Capacity Planning Solution uses industry standard mathematical models to predict future volume using historical volume. The solution supports five different forecasting models viz.
1. Time Series Decomposition2. Single Exponential3. Double Exponential4. Triple Exponential (Winters Model)5. Auto ARIMA (Auto Regressive Integrated Moving Average Model)
74
Figure 3: Volume Forecast based on Mathematical Models
Volume forecast using all five models is generated however best fit model is recommended by the solution basis MAPE (Mean Absolute Percentage Error), AIC (Akaike information criterion) /BIC (Bayesian Information Criterion) parameters. Auto ARIMA model offers advantage that data elements are regressed and averaged to fit an approximation to any time series, thereby giving high accuracy of prediction. Volume forecast is generated at the granularity of 15 minutes interval throughout the day. This forecast can be downloaded in the form of spreadsheet. In addition to volume forecast, Human Capacity Planning solution also generates Shift Roster which tells number of human resources needed to finish incoming transactions while meeting SLAs.
This Mathematical model executes the in two stages: - Forecasting
- Optimization
The Forecasting pertains to arriving at the resource count needed to manage the projected workload. The number of resources needed for the required time period can have granularity up to the minutes as per the required AHT to process the incoming workload. That is, forecasting algorithm can let us say recommend number of resources needed at 15 minutes interval if AHT/SLA for a specific tasks demand such fine grained projection.
The Optimization following the Forecasting then works recommending how the resources should be in the working in the shifts taking into account all BPS shift related nuances (like shift index, shift start time, shift end time, breaks) in addition to nuances related to workload processing (like skills specific AHT, product/sub-product wise SLAs). The solution can generate shifts schedule on its own or can use pre-defined shits schedule. The Forecasting/Optimization is supported by built-in “What-If Analysis” component, which enables to
75
analyze impact of changes inputs parameters on generated roster.
Case Study – Human Capacity Planning for a Global Logistics Major
A global logistics major has outsourced data entry operations for 13 countries. The business has lot seasonality impact due to which the staffing requirements keep on varying and it is imperative for offshore to plan and roster accordingly so that they do not carry excess or low staff than required. This criticality is compounded by a metric named cut-off; there are 165 cut-offs across the day that needs to be adhered to. Hence it was of paramount importance to identify a solution to this complex problem which would help the client meet its objectives.
Inputs to the Human Capacity Solution:
Figure 4: Input Data points to Human Capacity solution
Outcome of Human Capacity Solution - The Capacity Planning Solution provides predicted volume at the granularity of 15 minutes for every type of transaction. Volume forecast can be used to identify monthly/weekly trends in the incoming volume and accordingly plan for infrastructure provisioning needed for each type of transaction.
76
Figure 5: Output of Human Capacity Solution which provides minute level granular volume details and resource (user) count
Infrastructure Capacity Planning – With 200 or more concurrent users do the data entry operations, the server response time is often degraded. The Human Capacity Planning helps to find out volume forecast and shift roster (number of users) at granularity of 15 minutes for a given transaction type. We can also find out volume demand for various application related resources looking at transaction types. All these combined are inputs for making efficient infrastructure planning such that IT applications would scale or descale depending on the volume and number of users.
Benefits of Human Capacity Planning Solution In BPS industry, customer continuously expects service provider to optimize the operations efficacy, which essentially translates into “do same with less” and then “do more with less”.
Qualitative Benefits 1. Effective Resource Management - The solution gives us the insight that for optimal human
resource management understanding nuances of floor operations and projects number of
resources required to fulfil the goal thereby provides salability to meet the high volume workload.
For example, while managing the Human Capacity Planning, it is important to understand constraints like resource cap, acceptable number of overtime, number of shifts, resource shrinkage. When it comes to Work Allocation understanding dependencies on cross skilled resources, collaboration needed between the resources is important. Also when it comes to Queue Monitoring, how Maker/Checker headcount can be re-purposed to manage the time stringent SLAs.
2. Dynamic Change Management - The solution is focused on need of having specific resource atspecific time to meet the SLA and thus to avoid any delays and inaccuracy in operations. Key decisionmaking data points are made readily available to BPS floor managers at pre-defined interval enablingright time view, which helps in managing SLAs under varied workload scenarios. The solutionaddresses business problem such as systematic tracking of the resources, nurture the growth ofresources and improve the overall value to the organization.
3. Risk Management – While carrying out the BPS operations, when SLAs are not met then BPS
service provider carries the risk paying penalties for missing the SLAs. More often than not this risk
is linked to financial losses borne by the customer, and thus can be a huge amount. But at the
same time, if SLA miss gets translated to loss of BPS customer reputation in the market place, then
BPS Service Provider also has the risk of losing the business.
Quantitative Benefits
77
• Operational Efficiency - solution helped in 99.7 % SLA compliance leading to improved
Customer satisfaction
• Typically 40 percent minimum improvement in Accuracy in Human Capacity Planning
• Typically 70 percent Time Savings
• Employee Satisfaction score improved from 5 to 9 (on a 10 point scale)
• Cost Saving – The solution has helped in cost savings of USD 150 Millions
Applicability and Future Roadmap: This solution is applicable wherever schedule and workload generation is required to manage high volume BPS operations with optimal utilization of infrastructure capacity. The solution adds value in cases where application response time is extremely important but at the same time there is no or limited control on server infrastructure to meet the varying workload capacity.
Conclusion
To be able to effectively estimate the infrastructure provisioning, the input from business environment are critical and that is where human capacity planning data points helps to predict the volume demands and number of users. The business inputs such as past history, skillset, peak times are leveraged using advanced analytics and mathematical models. The solution not only makes efficient and optimal use of human resources but hardware resources can gainfully be managed at the minute level granularity. The solution demonstrates that hardware or server capacity and performance is enhanced with right level of business inputs.
References [1] Brockwell, Peter J., and Richard A. Davis, eds. Introduction to time series and forecasting. Vol. 1.Taylor & Francis, 2002.[2] Brusco, Michael J., and Larry W. Jacobs. "Optimal models for meal-break and start-time flexibilityin continuous tour scheduling." Management Science 46.12 (2000): 1630-1641.[3] Caggiano, K. E., Jackson, P. L., Muckstadt, J. A., & Rappold, J. A. Efficient computation of time-based customer service levels in a multi-item, multi-echelon supply chain: A practical approach forinventory optimization. European Journal of Operational Research, 199(3), (2009) 744-749.[4] Timothy Wood, Ludmila Cherkasova, Kivanc Ozonat, Prashant Shenoy. Predicting ApplicationResource Requirements in Virtual Environments, HP Laboratories Technical Report, 2008.[5] Waheed Iqbal, Matthew N. Dailey, David Carrera. Black-Box Approach to Capacity Identificationfor Multi-Tier Applications Hosted on Virtualized Platforms, Cloud and Service Computing (CSC),IEEE International Conference, 2011
78
STATICPROF: TOOL TO STATICALLY MEASURE PERFORMANCE
METRICS OF C, C++, FORTRAN PROGRAMS
Barnali Basak
Tata Consultan y Servi es, India
Abstra t
In this paper we present a ompiler driven pro�ler, named Stati Prof, whi h stati ally
estimates both arithmeti and memory operations of a C, C++ and Fortran program. We
have developed the tool as an intermediate gimple pass in GNU Compiler Colle tion(GCC).
We present the results a hieved for multiple kernels from ben hmarks Polyben h, Lapa k,
CppLapa k and ompare our �ndings with the numbers pro�led by Intel provided Vtune
pro�ler. Additionally, we give the stati ally pro�led performan e numbers for GaussSeidel
smoother fun tion from �uid dynami s appli ation OpenFOAM.
Keywords: Performan e Bounds, FLOPs, Bandwidth, Stati Analysis, Performan e Measuring Tool.
1 INTRODUCTION
In the urrent arena of High Performan e Computing (HPC), sequential appli ations are ontinuously explored to
harness hidden performan es. Floating point operations per se ond (Flops) and bandwidth measurements are two
most important metri s to guide the pro ess of optimization of a program. Multiple tools have been developed till
date that dynami ally pro�le a program by ounting and sampling various exe utional events. As instan es, Gprof,
Vtune, Perf Suite, HPC Toolkit et . are few well known omer ially available tools.
1.1 Motivation
To a hieve the optimal performan e of an appli ation, it is needed to exe ute the program on a suitable ar hite ture.
Appli ations having high omputation bound are less likely to perform well on ar hite tures with poor omputing
ability. Similarly, for memory bound appli ations, ar hite ture with high memory transfer or bandwidth is required.
Finding the suitable ar hite ture for exe ution of an appli ation may not be always straight forward and may
require additional analysis. Here we propose a one step loser solution that stati ally omputes �ops and bandwidth
requirement of a program. Given su h metri s, one may hoose the suitable exe utional platform having su� ient
omputing and memory transfer abilities.
The opyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group In a royalty free
right to publish this paper in CMG India Annual Conferen e Pro eedings.
79
void Lapla e3D (int NX, int NY , int NZ, double *u1, double *u2){
for (int k=0; k<NZ; k++) {
for (int j=0; j<NY; j++) {
for (int i=0; i<NX; i++) {
int ind = i + j*NX + k*NX*NY;
if (i==0 || i==NX -1 || j==0 || j==NY -1|| k==0 || k==NZ -1) {
u2[ind℄ = u1[ind ℄;
} else {
S: u2[ind℄ = ( u1[ind -1℄ + u1[ind +1℄ + u1[ind -NX℄ + u1[ind+NX℄
+ u1[ind -NX*NY℄ + u1[ind+NX*NY℄ ) * sixth;
}}}}}
Figure 1: Motivating example: ode snippet of Lapla e omputation
1.2 Contribution
In this paper we illustrate a tool named Stati Prof, implemented as a pass in GCC ompiler. It measures both
omputational and memory operations of C, C++ and Fortran programs by stati ally analyzing the sour e odes. To
pro�le �op, it he ks for arithmeti operators that involve operands of �oating point data type. Similarly, to estimate
memory operations (memop), i.e., data load and store, our analyzer stores ount for both integer and �oating point
data type separately.
The performan e metri s, thus a hieved, an be used to hoose the suitable ar hite ture for exe ution of the
program. Here we present an example whi h shows the estimation of �op and memop by analysing the sour e ode.
Example 1. Consider the Lapla e3D fun tion depi ted in Figure 1. The fun tion, in essential, omputes value at
ea h grid point of a three dimensional grid by averaging out the values of all neighbours. A triply nested loop, having
loop indu tion variables i, j and k, traverses the 3D grid spa e. Loops i, j and k have symboli upper bounds NX, NY
and NZ respe tively, whose values are unknown at ompile time. Array variables u1 and u2 store the grid values and
are of double data type. Also, let T be the total exe ution time of the fun tion. The �ops and bandwidth requirements
of the fun tion an be analyzed as follows.
Flops omputation Bandwidth omputation
�op per loop iteration: 5 additions, 1 multipli ation memop per loop iteration: 6 reads, 1 write
�op for the whole loop: 6× NX× NY× NZ memory transfer(in bytes) by the whole loop: 7 ×
NX× NY× NZ× 8
�op per se ond: 6× NX× NY× NZ/T bandwidth (bytes/se ): 56× NX× NY× NZ/T
2 Related Work
Multiple methods have been developed till date whi h predi t performan e of programs by monitoring and sampling
the dynami events generated during exe ution. Tools like Gprof [GKM82℄, Vtune [vtu℄, HPC Toolkit [hp ℄, Perf
Suite [per℄, PAPI [pap℄ et . fall into this ategory.
In ontrast, performan e predi tion of program by stati analysis of sour e ode is another approa h to follow.
Gropp et al., in their paper [WDGKD99℄, presented the analyti omputation of memory operations and arithmenti
operations by stati ally analyzing general form of omputing statements. This motivates the automation of the pro ess
of performan e omputation.
As our tool follows the stati analysis approa h, we will not explain the aforementioned dynami tools. Instead,
80
D.1218 = a[1℄;
D.1219 = b[1℄;
D.1220 = D.1218 + D.1219;
D.1221 = b[1℄;
D.1222 = D.1220 * D.1221;
a[1℄ = D.1222;
a[1℄
D.1222
=
a[1℄ b[1℄
D.1218 D.1219
+
=
D.1220
*
D.1221
b[1℄
(a) Gimple representation (b) AST view
Figure 2: Gimple IR of sour e instru tion
will only present the losest related papers to our work.
Narayanan et al., in paper [NNH10℄, has des ribed estimating upper bound of C/C++ programs by stati time
ompiler analysis. They have presented a tool named PBound developed in ROSE ompiler. It generates parameterized
expressions for memory transfer and arithmeti operations. Su h expressions along with ar hite tural information are
used to �nally ompute the upper bounds of the program performan e.
Cas aval et al., in paper [CDPR99℄, has explained the performan e predi tion model that does not solely rely
on stati analysis of sour e ode, but takes into a ount dynami results too. Compiler driven stati analyzer has
been developed as a part of Polaris ompiler. It generates the performan e expressions and automati ally instruments
the ode to get values unknown at ompile time. Thus, the analyti al result a hieved at ompile time, along with
experimental result re eived at dynami time, ompute the �nal performan e bound.
3 Implementation Details
Stati Prof is implemented as an intermediate gimple pass in GCC. GCC parses the sour e ode, written in C/C++/Fortran
et ., and redu es it to gimple intermediate representation (IR). The gimple IR is a three address representation of
expressions where the operands are �rst loaded into temporary variables before any omputation takes pla e. Also it
uses temporary variables to hold intermediate values of any omplex omputation.
The resultant gimple an be viewed as dire ted a y li graph, in general known as abstra t syntax tree (AST).
An AST has all memory referen ing variables at its leaf nodes and mathemati al operations at its intermediate nodes.
Assignment node, in general, sits at the root node.
Example 2. Figure 2 shows both gimple representation and orresponding AST view of high language instru tion
a[1℄ = (a[1℄ + b[1℄) * b[1℄. All variables starting with D are temporary ones generated by GCC for internal use.
In Figure 3 we have illustrated the modular view of GCC. The gimple AST, generated by parser, is then pro essed
by series of gimple optimization passes. The optimized gimple ode is further lowered into register transfer language
(RTL) IR, whi h is again optimized by RTL passes. Finally, the optimized ode is passed through ASM generator to
produ e orresponding ma hine learning ode.
Our tool pro esses the gimple IR version of the sour e ode and hen e, is plugged in as a gimple pass into GCC.
It an pro�le a segment of sour e ode whose start and end statements are annotated before and after respe tively
by print statements; printing START_BLOCK_COMPUTATION and STOP_BLOCK_COMPUTATION variables for annotating
blo k segment and printing START_LOOP_COMPUTATION and STOP_LOOP_COMPUTATION variables for loop segment.
81
C/C++/Fortran
ode
gimple
AST
optimized
gimple
optimized
RTL
ma hine language
ode
parser
gimple
passes
RTL
passes
ASM
generator
Stati Prof
pass
Figure 3: Modular view of GCC
3.1 Algorithmi Details
Our analyzer targets both assignment and omputing statements of the form
lhs = val
lhs = rhs1 ⊕ rhs2 . . .rhsn ⊕ rhsn+1
All lhs, rhs1, . . . rhsn+1 are either s alar or array referen es referen ing integer, �oat or double data types. val be
the value assigned to lhs and ⊕ be any arithmeti operation.
The tool traverses the orresponding ASTs in top-down order, starting from the assignment node, and parses
ea h and every node. For ea h node it de ides whether the node is of interest or not. As instan e, while parsing a
mathemati al operation node it he ks the data type of its hild nodes. If type of any of the hild is �oat or double,
the tool in reases the �op ount by one. Memory operation keeps separate ounts for both integer and �oating type
of memory operations. During traversal if the tool en ounters any leaf node, the memop ount in reases a ordingly.
Example 3. Figure 4 depi ts the AST of statement S from the motivating example shown earlier in Figure 1. All
a esses to array variables u1 and u2 are leaf nodes and arithmeti operations are intermediate nodes. Note that, the
AST does not illustrate the tree formation of index omputations as the urrent implementation onsider them as free,
i.e., they do not parti ipate in performan e ount.
The s alar variable sixth in the AST is real in type but its value is onstant throughout the program exe ution.
During ompilation, onstant propagation optimization repla es variable sixth with its value. Therefore it does not
ontribute in memop omputation.
The tool starts analysing the AST from the assignment node. First it he ks the data type of the left hild u2[ind].
As the data type is double and its a store memory operation, the memop is set to one. Next, the analyzer traverses
the right subtree and en ounters the multipli ation node. Similarly, it he ks the data type of the left subtree and
right hild node of the multipli ation node. As the data type of the right hild is double, the �op ount is set to
one and memop ount in reases by one. The pro�ler re urrently traverses all subtrees and updates the performan e
numbers for all relevant nodes in the AST.
Memory referen es, whether through s alar or array, an be either dire t, like array referen e A[i℄, s alar referen e
a; or indire t, like A[B[i℄℄, or pointer referen e *a. Our pro�ler handles all dire t, indire t and pointer referen es. For
indire t a ess A[B[i℄℄ it ounts memop for both index array B and value array A. However, the urrent implementation
does not onsider any ontribution of loop index variables and operation on su h variables. They are onsidered free.
Our tool pro�les any valid basi blo k or loop of a program. A basi blo k ontains multiple statements in sequen e
and has single entry and single exit of ontrol. Our tool analyzes ea h statement of the blo k and simply umulates
the performan e numbers.
82
u1[ind− 1] u1[ind+ 1]
+ u1[ind− NX]
+ u1[ind+ NX]
+ u1[ind− NX ∗ NY]
+ u1[ind+ NX ∗ NY]
+ sixth
∗u2[ind]
=
flop
flop
flop
flop
flop
flop
load load
load
load
load
load
store
Figure 4: Flop and bandwidth numbers on the AST of statement S
Input: annotated C/C++/ Fortran ode
Output : flop and memop ount for the annotated segment
Pro edure :
1. Input program is parsed by the ompiler that produ es gimple ASTs.
2. For the valid segment , blo k or loop , in the sour e ode , identify the
Basi Blo ks (BBs) to be analyzed in the gimple pass.
3. For ea h BB goto 4.
4. For ea h statement (stmt) goto 5.
5. If stmt is of assignment type , parse lhs and rhs of assignment node.
(i) If lhs or rhs is only memory a ess of type float or int , ount memop
a ordingly .
(ii) If rhs is unary or binary expression , and operands are of floating type
, ount flop a ordingly .
6. On e all identified BBs are pro essed , report flop and memop ount for the
annotated segment .
Figure 5: Outline of the performan e omputation algorithm
Multiple basi blo ks together form a loop whi h is strongly onne ted in nature and must have single entry of
ontrol. Here the ontrol does not only fall from one blo k to the next one, but also may fall ba k to any previous
one. To pro�le any valid loop, all on erned basi blo ks are pro�led �rst and then the numbers are summed up to
get the performan e ount for a single loop iteration. The ount, thus generated, is multipled by the upper bounds of
surrounding loop iterations to get the �nal number.
We have presented the outline of the performan e omputation algorithm in Figure 5. To pro�le a valid blo k or
loop segement in the sour e ode, �rst the basi blo ks, under onsideration, are needed to be identi�ed. Then all
statements from su h basi blo ks are pro�led and �nally the values are umulated to get the �nal performan e result.
4 Experimental results
Command line option -fdump-tree-perf, during ompilation of GCC, dumps the performan e data pro�led by
Stati Prof. We have pro�led multiple kernels from various ben hmarks to he k orre tness of the pro�ler. For C we
have used ben hmark suite Polyben h [pol℄, for C++ programs CppLapa k [ pp℄ ben hmark is used and for Fortran
programs we onsider BLAS routines in Lapa k [lap℄.
In Table 1 we present stati ally omputed parameterized expressions of �oating point operations and memory
operations for few sele ted kernels from Polyben h suite. Note that, these kernels operate on �oating point data type
83
kernel �op memop
atax 4× ny× ny nx + 2× ny+ 7× ny× ny
mvt 4× n× n 8× n× n
gemm 3× nk× nj× ni + ni× nj 2× ni× nj+ 4× ni× nj× nk
dmm 2× num× num× num 4× num× num× num
symm 5× n× n× n + 6× n× n 5× n× n× n + 4× n× n
daxpy 2× n 2× n
ja obi_1d_imper 3× (n− 2) × tsteps 6× (n− 2) × tsteps
floyd_warshall n× n× n 4× n× n× n
Table 1: Table showing stati ally measured performan e numbers
kernel loop bounds �op by Stati Prof
(g�op)
�op by vtune on
Sandybridge (g�op)
holesky 1024 2.1506 2.4378
daxpy 100000000 0.4 0.48672
symm 1024 5.375 5.7373
mvt 10000 0.4 0.5417
dmm 1024 2.1475 2.9822
Table 2: Comparing Stati Prof and Vtune estimated �op
only. The parameterized expressions are fun tions of symboli loop bounds like, N, M, NK, NJ, TSTEPS et . whose
values are unknown at ompile time.
In Table 2 we ompare the �op ounts of few BLAS routines omputed by both Stati Prof and Vtune for stati ally
known loop bounds. Vtune monitors the runtime events generated during program exe ution and estimates the
performan e a ordingly. As instan e, onsider the kernel dmm. Stati Prof estimates �op ount as 2×1024×
1024×1024 = 2.1475 gflop, where the loops iterate 1024 times . Vtune dynami ally gathers the value of event
SIMD_FP_256.PACKED_DOUBLE as 745550 uflop. The �nal �op ount by Vtune would be 745550×4×1000 =
2.9822 gflop.
Figure 6 depi ts the overestimation aused by Vtune for ea h kernel mentioned in Table 2. The overestimation in
average ranges in between 6%-28%. This is a known issue of Vtune whi h o urs due to in orre t ounting of exe uted
instru tions
1
.
We have also tested the pro�ler on a large appli ation, OpenFOAM [ope℄, its an open sour e appli ation for
omputation �uid dynami s. Our tool orre tly estimates �op ount and memory operation ount for GaussSeidel
smoother fun tion, shown in Figure 7. The outer ell loop performs only division operation on �oating point data
type and a esses four array elements of data type double. Ea h inner fa e loop performs two arithmeti operations
and a esses two array elements of type double and one of type integer. As the exa t number of loop iterations are
unknown at ompile time, our pro�ler onsiders loop iterations as nCells and (fEnd - fStart) for outer and inner
loops respe tively. Therefore,
Flop = 1× nCells+ 2× 2× (fEnd− fStart)× nCells
As size of double data type is 8 bytes and size of integer element is 4 bytes
MemOp(inbytes) = 4 × nCells× 8 + 2× (2× 8 + 4) × (fEnd− fStart)× nCells
= 32× nCells+ 40× (fEnd− fStart)× nCells
1
http://i l. s.utk.edu/proje ts/papi/wiki/PAPITopi s:SandyFlops; https://software.intel. om/en-us/forums/software-tuning-
performan e-optimization-platform-monitoring/topi /499193
84
0
1
2
3
4
5
6
holesky daxpy
symm
mvt
dmm
Pro�led by Stati Prof
Pro�led by Vtune
Figure 6: Overestimation aused by Vtune
f o r ( r e g i s t e r l a b e l e l l i =0; e l l i <nC e l l s ; e l l i ++)
{
f S t a r t = fEnd ;
fEnd = ownStartPtr [ e l l i + 1 ℄ ;
p s i i = bPrimePtr [ e l l i ℄ ;
f o r ( r e g i s t e r l a b e l f a e i=f S t a r t ; f a e i <fEnd ; f a e i++){
p s i i −= upperPtr [ f a e i ℄∗ ps iP t r [ uPtr [ f a e i ℄ ℄ ;
}
p s i i /= diagPtr [ e l l i ℄ ;
f o r ( r e g i s t e r l a b e l f a e i=f S t a r t ; f a e i <fEnd ; f a e i++){
bPrimePtr [ uPtr [ f a e i ℄ ℄ −= lowerPtr [ f a e i ℄∗ p s i i ;
}
ps iP t r [ e l l i ℄ = p s i i ;
}
Figure 7: Code snippet of GaussSeidel fun tion
5 Con lusions and Future Work
We have automated the theoreti al �op and bandwidth omputation of C,C++ and Fortran programs by analyzing
sour e odes at ompile time. The main ontribution of this work is to develope the tool as a part of li ense free
ompiler GCC, in hope to be used both omer ially and for resear h purpose.
Analyze the e�e t of a he: Current implementation does not onsider any e�e t of a he. Presen e of a he
e�e tively redu es the amount of memory transfer operations as a he stores the frequently a essed elements. Our
implementation onsiders ea h and every load and store operation as a ess to memory and ounts the quantity of
memory transfer a ordingly. As a future work we want to revise the memory transfer ount in presen e of a he.
Handle index omputation and test on appli ations: Currently our tool does not take into a ount any index
omputation as ontributor to memory operation ount. In future we may extend our work to handle this. Also it has
been only tested on few ben hmarks and a single real life appli ation. We want to further test it on several other real
life large appli ations, like ROMS, Amber et .
Compare with Milepost: The GCC implementation of the tool pro esses optimized gimple odes to pro�le
performan e bounds. Pro�ling optimized versions learly ignores redundant arithmeti and memory operations and
redu es the gap between stati and dynami pro�le variants. Stati implementation of GCC has prior sequen e of
optimizations, whi h may or may not be optimal for di�erent sour e odes. Milepost [MTB
+08℄ GCC is a wrapper
around the stati GCC, whi h, in essential, �nds the optimal sequen e of optimizations for various sour e odes. We
plan to in orporate our tool into Milepost GCC to he k whether di�erent optimization sequen es a�e t the stati
85
performan e omputations of programs or not.
Referen es
[CDPR99℄ Calin Cas aval, Luiz DeRose, David A. Padua, and Daniel A. Reed. Compile-time based performan e
predi tion. In In Twelfth International Workshop on Languages and Compilers for Parallel Computing,
pages 365�379, 1999.
[ pp℄ http:// pplapa k.sour eforge.net/.
[GKM82℄ Susan L. Graham, Peter B. Kessler, and Marshall K. M Kusi k. gprof: a all graph exe ution pro�ler,
1982.
[hp ℄ http://hp toolkit.org/.
[lap℄ http://www.netlib.org/lapa k/lug/node71.html.
[MTB
+08℄ Cupertino Miranda, Olivier Temam, Phil Barnard, Elton Ashton, Mir ea Namolaru, Elad Yom-tov, Bilha
Mendelson, Edwin Bonilla, John Thomson, Hugh Leather, and Chris Williams. Milepost g : ma hine
learning based resear h ompiler. In In GCC, 2008.
[NNH10℄ Sri Hari Krishna Narayanan, Boyana Norris, and Paul D. Hovland. Generating performan e bounds from
sour e ode. In ICPP Workshops, pages 197�206. IEEE Computer So iety, 2010.
[ope℄ http://openfoam. om/.
[pap℄ http://i l. s.utk.edu/papi/.
[per℄ http://perfsuite.n sa.illinois.edu/.
[pol℄ web. se.ohio-state.edu/~pou het/software/polyben h/.
[vtu℄ https://software.intel. om/en-us/get-started-with-vtune.
[WDGKD99℄ A D. K. Kaushik W. D. Gropp, B D. E. Keyes, and B. F. Smith D. Towards realisti performan e bounds
for impli it fd odes. In Pro eedings of Parallel CFD'99, pages 233�240. Elsevier, 1999.
86
The copyright of this paper is owned by the author, Ramya Ramalinga Moorthy. The author hereby grants Computer Measurement Group
Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.
Performance Anomaly Detection & Forecasting Model (PADFM) for eRetailer Web application
Ramya Ramalinga Moorthy EliteSouls Consulting Services LLP, Bangalore, India.
With high performance becoming a mandate, its impact & need for sophisticated performance management is realized by every e-business. Robust performance anomaly detection and forecasting solution is in demand to detect anomalies in production environment and to provide forecasts on server resource demand to support in server sizing. This paper deals with the implementation of PADFM for an online retailer using statistical modeling & machine learning techniques that has yielded multi-fold benefit to the business. Keywords: Performance Anomaly detection, Forecasting, K-NN, K-Means, Exponential smoothing, ARIMA.
1. Introduction
With the revolution brought by digital transformation, performance management for digital products is no more a luxury. If Amazon, experiences 1% decrease in sales for every additional 100 ms delay in response time, we can imagine the severity of performance impact on every high traffic e-business.
Though Application Performance Management (APM) tools are widely used and has brought down the performance problem diagnosis time to a great extend, these tools don’t actually help in detecting the anomalies / outliers in the production environment. As deploying robust performance management solutions being one of the compliance factors for digital products, the demand for quick & effective performance anomaly detection & forecasting solution is increasing steadily. The secret sauce for building such solution is the right choice of statistical modelling & machine learning techniques as they provide a strong basis for detecting system performance anomalies and for making forecasts.
With several techniques available in the domain of statistics & machine learning, choosing the right technique(s) that has the capability to detect maximum possible anomalies meeting high accuracy requirement of business is a great challenge. The choice of the technique(s) purely depends on the type
of data and its characteristics.
A performance anomaly is an abnormal application behavior that could be due to some unrelated event or an unexpected application behavior that adversely affects the application performance. The performance data used for detecting anomalies & for making forecasts come from the measurements of two different classes of performance metrics namely application metrics (Example: server hits, # users, application response time, etc) that provides the current state/health of an application and system metrics (Example: CPU / memory / disk utilization, queue length, etc) that provides the current state of the underlying system.
Providing forecasts on the server resource requirements supports in application capacity planning & infrastructure sizing. Usage of filtered server resource utilization statistical data, free from anomalies & outliers, can result in providing accurate forecasts that can be used for server sizing analysis.
This paper deals with our solution, Performance Anomaly Detection and Forecasting Model (PADFM v1.0) developed for addressing the requirements of e-retailer business application. The paper is organized in eight sections. Section 2 provides the overview of the e-retailer problem space. Section 3 provides overview of the PADFM model & its objectives.
87
Section 4 details various performance anomaly detection techniques employed by our model through illustrations. Section 5 details various performance forecasting techniques employed by our model through illustrations. Section 6 provides model recommendations & business benefits. Section 7 provides the limitations & future plans. Section 8 concludes the paper along with the references.
2. E-Retailer Context / Problem Space Overview
A US based retailer, who operates its women apparels business both in brick-n-mortar shop & online during last 7 years, plans to expand its online business portfolio in subsequent years. The business is interested in historical data investigation to gain confidence on what kind of analysis can be done to improve their current performance management practices. Also, the business is interested to know whether any recommendations on server resource demands can be made for justifying the need for server capacity upgrade for the year 2016.
The online ecommerce application developed using J2EE technologies is hosted on JBoss & Oracle server at their private data center is configured with IBM Tivoli to monitor the production environment health metrics at 5 minutes interval from Jan 2010 onwards till Dec 2015 (though not much analysis is currently done on the collected data). Server traffic monitoring logs are available from Jan 2013 onwards till Dec 2015 (though not analyzed).
A Proof-Of-Concept (POC) model is demanded that can use the historical data for providing recommendations to improve current performance & capacity management practices.
3. PADFM Solution Overview
Our solution, Performance Anomaly Detection & Forecasting Model (PADFM v1.0) designed for POC, helps in identifying the performance anomalies from the available historical data & for forecasting the hardware demand, particularly CPU requirements of web, application & database server for next one year.
Our model is implemented on R statistical platform and uses various statistical & machine learning techniques. The choice of the techniques is based on our study of data characteristics and evaluation of accuracy of various techniques. Our model provides two key solutions - Performance anomaly detection & forecasting (in offline mode).
For anomaly detection, our solution uses below algorithms.
Twitter’s Anomaly detection algorithm & Breakout
detection algorithm is used for detecting anomalies in the user traffic trend.
Twitter’s Breakout detection is used for identifying breakouts & shift in distribution trends of server resource utilization data.
K-means machine learning algorithm is used to cluster the server monitoring data of web, application & database servers based on similarity, to group suspicious data points on the outer most cluster to quickly detect anomalies.
K-NN machine learning algorithm is used to storerepresentative server utilization data points samplesto segregate new data points into well defined groupsbased on majority vote of its k neighbours to quicklydetect anomalies.
Pearson correlation analysis is used to correlate all the metrics considered in the model, at every point anomalies detected by the model to help in performing first level of root cause analysis.
For forecasting, our solution uses three time series analysis techniques, moving average, exponential smoothing (single, double or triple) and ARIMA. The historical server utilization data of web, application & database server are fed to these models to forecast the demand for next year. The best forecasts are recommended based on evaluation of various forecast accuracy measures.
4. Performance Anomaly Detection Solution
The performance anomaly detection techniques are grouped by the type of underlying model they assume for nominal data and the discrimination function used to identify data that doesn’t agree with the model.
Our performance anomaly detection solution is designed to offer below two type of analysis:
1) Visualize the performance counters over time,classifies them to provide easy view ofanomalies or outliers and reports anydeviations from the set thresholds.
2) Support in root-cause analysis by correlatingseveral performance counters & reportcounters with high correlation coefficients.
The techniques used by our model are discussed below:
4.1 Anomaly Detection Model using Twitter’s AD algorithm
Twitter’s anomaly detection algorithm is based on
88
S-H-ESD (Seasonal Hybrid Extreme StudentizedDeviates) algorithm, which is an extension of ageneralized ESD method that helps to detect bothglobal & local anomalies better than generalized ESDmethod.
The web server access log that has the server hit details for last 3 years (Jan 2013 till Dec 2015) is used for this analysis. By using this anomaly detection algorithm, both global & local anomalies are detected as shown in the figure1.
Figure 1: Anomalies detected using Twitter’s AD
Drill down analysis for a specific investigation period reported anomalies where upon verification, it iss found that 82% of anomalies detected during the investigation period seem to correlate with the days when performance tests were conducted on production environment, due to infrastructure issues on the test environment.
Figure 2: Drill down analysis using Twitter’s AD
The anomalies detected along with its percentage deviation are reported by the model as in figure 3.
Figure 3: Anomalies detected using Twitter’s AD
This algorithm is chosen as it is very popular in detecting both global & local anomalies. Our solution is able to rightly detect, one point sudden increases, unusual noise, sudden growth on seasonal pattern, etc on the server hits trend. But there are two limitations, where the algorithm is not able to detect 1) linearly increasing growth anomaly & 2) negative
anomaly represented below.
Figure 4: Anomalies not detected using Twitter’s AD
4.2 Anomaly Detection in data distribution trend using Twitter’s BD algorithm
Twitter’s Breakout Detection algorithm is based on EDM (E-Divisive Median) which is comparatively better in detecting breakouts & identifying changes in the distribution trend quickly than E-Divisive, Moving Average & Moving Median algorithms on our input data.
Our model is able to detect the two major types of anomalies, sudden increase / decrease and gradual increase / decrease in the time series trend of the resource utilization metrics.
A sudden increase in the DB server CPU utilization trend noticed on a specific day (28th Nov 2014) where the CPU utilization increased from average of 40% to 60% detected by the algorithm is shown in figure 5.
Figure 5: Sudden increase detected using Twitter’s BD
A slow increase in web server disk utilization trend noticed on a specific day (14
th Sep 2015) where disk
utilization increased from average of 10% to 40%, detected by the algorithm is shown in figure 6.
Figure 6: Slow increase detected using Twitter’s BD
1
2
89
Though above types of breakouts are easily detected & very useful, sometimes, the algorithm doesn’t detect changes in distribution shift from Poisson to Uniform.
4.3 Anomaly Detection Model using K-Nearest Neighbours Algorithm
K-NN algorithm is a simple instance based supervisedlearning algorithm that stores all the representativedata and classifies new data input by a majority voteof its k neighbours. This algorithm segregatesunlabeled data points into well defined groups.Choosing the number of nearest neighbours i.e.determining the value of k plays a significant rolein determining the efficacy of the model.
The data fed to our model has % CPU & % Disk utilization values with type labels (sample shown in figure 7), collected from application server during peak hour statistics observed during 2015. The threshold value (configurable) used for classifying the data is about 70% for % CPU utilization & 20% for % Disk utilization.
Figure 7: Input Data Sample used for K-NN analysis
The input data points are labeled using four classes, VALID (where both % CPU & % Disk utilization values are less than set thresholds), H-CPU (where % CPU utilization value is greater than set threshold), H-DISK (where % Disk utilization value is greater than set threshold) and H-CPUnDISK (where both % CPU & % Disk utilization values are greater than set thresholds).
Figure 8: Input data classification used for K-NN analysis
Out of 100 input data points fed to the model, 65% of the provided data is configured to be used for training purpose & remaining 35% data is used for testing purpose. The pictorial view of input data classification is shown in figure 8.
Upon applying this algorithm, the prediction accuracy provided by our model for the test data values is
shown in figure 9.
Figure 9: Resultant classification accuracy on test data
It is evident from the results that 100% of the predicted values on the test data (35% of input data) are valid. Thus developed model can be used to classify the new unlabelled input data & the model is able to classify the input data samples (with 10000+ observations) into 4 provided input classes, with an average accuracy of about 92%.
Figure 10: Resultant classification on new input data sample
A new unlabelled input data, with about 120 data points fed to the model has yielded 96% classification accuracy as shown in figure 10.
4.4 Anomaly Classification Model using K-Means Clustering algorithm
K-Means clustering is an unsupervised learningalgorithm that tries to cluster data based on theirsimilarity. The underlying idea of the algorithm is thata good cluster is the one which contains the smallestpossible within-cluster variation of all observations inrelation to each other. The variation is calculatedusing squared Euclidean distance between the datapoints.
90
The data fed to our model has 100 unlabelled data points of % CPU & % Disk utilization values (sample shown in figure 11) collected from application server during peak hour statistics observed during 2015.
Figure 11: Input data sample in K-means
Upon entering the # of clusters input value as four, the resultant clustered view of our model is shown in figure 12. Using this cluster representation, the anomalies can be automatically identified by analyzing cluster 4 (high CPU & Disk utilization cluster) data points.
Figure 12: Resultant cluster view of input data
Minor errors are noticed in the clustering results. Upon comparing the clustered results of this data with the labeled input data (shown in figure 8 used for K-NN analysis), there are 5 deviations noticed as represented in figure 13.
Figure 13: Labeled data versus Cluster results comparison
The optimal number of clusters will be calculated using Elbow method. By plotting the total within sum of squares, for various selected K values, the elbow point can be identified. The graph shown in figure 14, justifies, why K value is chosen as 4 for clustering the input data sample in the discussed case. From the graph, it is clear that, the graph shows a bend & stop making big gains after this point.
Figure 14: Cluster K value analysis
Another labeled data input fed to our model has 75 labeled data points of % CPU & % Disk utilization data from web, application & database server collected from peak traffic hours (sample shown in figure 15). K-means clustering result for this input data is shown in figure 16.
Figure 15: Input data sample with server labels
Figure 16: Clustering results of Web, App & DB server data
Comparison of resultant data clusters with the input data values showing minor erros is represented in figure 17. 2 data points of application server data class (cluster 1) is wrongly clustered into cluster 2 and incase of DB server data class (cluster 3), 3 data pointss are wrongly clustered into cluster 2.
Figure 17: K-Means Clustering results validation
By tuning the right k value, K-means clustering analysis can be easily used for classifying data both in online & offline modes for analyzing the production monitoring data or user traffic data for easy outlier detection.
91
4.5 Anomaly Root Cause analysis using Pearson Correlation algorithm
Our root-cause analysis solution is based on Pearson coefficient correlation, which assumes that the two populations have bell-shaped (normal) distributions and that the relationship between them, if any, is approximately a straight line. The correlation between 2 metrics is accepted as real & significant, if p value calculated from the correlation test is less than 0.05.
Our model takes into account 10 performance metrics (shown in figure 18), for performing correlation analysis on the server metrics. For every threshold deviation reported on CPU utilization, below correlation analysis can be performed by our model. In an instance, high CPU utilization confirmed the presence of memory leak in the application. During correlation analysis, 4 key counters, CPU, Memory (calculated from Available Mbytes counter), and Disk % utilization & processor queue length are correlated & reported as in figure 19. Additionally any interested counters from the list can be added to the correlation as required.
Figure 18: Performance Metrics (#10) used by PAD model
The correlation trend analysis graph shows the previous 6 hours data trend (configurable value) from the threshold deviation point.
The Pearson correlation coefficient analysis between % CPU Utilization & % Available Memory metric is reported as -0.83 with the p-value of 2.2e-16 (<0.05).
Figure 19: Server resource trend from our model showing before & after 6 hours of anomaly point resulted in exception
The scatter plot is used to visually report the
correlations across all counters, which confirms existence of correlation between CPU & memory Utilization counter.
Figure 20: Scatter plot representing correlations
The correlation coefficient matrix shown in figure 21 is used to report the severity of the correlation between counters used in analysis.
Figure 21: Correlation Coefficient Matrix & Heap map view
The heat map view shows the correlation values of all selected counters used in the analysis. The dark blue boxes in the figure, represents metrics with significant negative correlation value, as %.CPU & % Memory utilization values have inverse proportionality.
Various rules are included in the model to conclude on the root-cause though significant correlation can be seen between multiple counters. These results & graphs reported by our model help in quickly detecting the anomalies & performing root cause analysis.
5. Performance Forecasting Solution
Our forecasting model uses three univariate time series analysis techniques to forecast the future behavior of a performance metric according to its past values. Our model deploys Moving Average, Exponential Smoothing (Single or Double or Triple) and ARIMA techniques to forecast the CPU requirement of web, application & database servers & highlights the best forecasts based on forecast accuracy measures, RMSE & MAPE values.
92
Our forecasting model used in PADFM version 1.0 uses 95
th percentile value of past 72 months (Jan
2010 till Dec 2015) to make the forecasts for next year. The % CPU utilization data reported by Tivoli at every 5 minutes interval is used as input for the forecasting model, after removal of anomalies (using the techniques discussed in section 3). The 95
th
percentile value of every hour is used to calculate the 95
th percentile value for every day & this value is then
used to calculate the 95th percentile value for every
month for the last 6 years.
Though this is not the right approach, this technique is followed since the objective of the model during POC is to show how various forecasting techniques can be applied on the available statistical data to provide recommendations for server sizing. Thus, the data derived from this data aggregation analysis, approved by the business team & infrastructure team is used in forecasting models.
Figure 22: Time Series data input used for forecasting
Figure 22 provides the view of year on year increase in the Application Server % CPU utilization during last 6 years (Jan 2010 till Dec 2015). The forecasts made using this time series data input by applying 3 statistical modelling techniques will be discussed in this section to make forecasts for the next year, 2016.
Our forecasting model facilitates making forecasts either by using the all the data points or for the user selected timeframe period (that are considered relevant for sizing).
5.1 Moving Average (MA) Model
This method uses the average of the most recent k data values in the time series as the forecast for the next period. In this method, every time a new observation becomes available for the time series, the oldest observation in the equation is replaced and a new average is computed. For most recent time series values, a small value of k is preferred whereas for past values, a larger value of k is preferred.
Forecasted trend reported by our solution is shown in figure 23.
Figure 23: Forecasted Trend using MA model with k=3
The forecast accuracy measure values are calculated as RMSE is 1.03 & MAPE is 1.79.
5.2 Exponential Smoothing (ES) Model (Single, Double & Triple techniques)
This method uses a “smoothing constant” to determine how much weight to assign to the actual values. The smoothing parameter should fall into the interval between 0 and 1, so as to produce the smallest sums for the residuals.
Our solution performs data decomposition analysis to study the type of time series data before performing this analysis. During decomposition analysis, the input time series data is analyzed for the presence of 3 components - level, trend & seasonality. Accordingly, our solution applies one of the below mentioned smoothing technique to make forecasts.
Figure 24: Decomposed view of input time series
5.2.1 Simple Exponential Smoothing
This method is applicable for time series data with only level and no trend or seasonality. The fitted trend & forecast trend reported by our model are provided in figure 25 & 26.
93
Figure 25: Observed vs Fitted Trend using Simple ES
Figure 26: Forecasted Trend with alpha=0.64
The forecast accuracy measure values are calculated as RMSE is 2.54 & MAPE is 4.65.
5.2.2 Holt’s Exponential Smoothing
This method is used when time series data have level and trend and no seasonality. Forecasts are made where smoothing is controlled by two parameters, alpha, for the estimate of the level at the current time point, and beta for the estimate of the slope b of the trend component.
Figure 27: Forecasted trend with alpha = 0.48 & Beta = 0.03
The forecasted trend & residual trend are shown in figure 27 & 28. The residual plot shows forecast errors seem to have roughly constant variance over time.
Figure 28: Residuals Trend using Holt’s ES
The forecast accuracy measure values are calculated as RMSE is 2.4 & MAPE is 4.4.
5.2.3 Holt-Winter’s Exponential Smoothing
This method is used when time series data have level, trend and seasonality. This ES technique is majorly used, as server resource utilization time series data had level, trend & seasonality components. Forecasts are made where smoothing is controlled by three parameters: alpha, beta, and gamma, for the estimates of the level, slope b of the trend component, and the seasonal component, at current time point.
Figure 29: Observed vs Fitted Trend using Holt-Winter’s ES
The fitted trend & forecast trend reported by our model are shown in figure 29 & figure 30.
Figure 30: Forecasted Trend using Holt-Winter’s ES model with alpha = 0.39, beta =0.006 & gamma =0.46
94
Figure 31: Residuals Trend using Holt Winter’s ES model
The residual plot provided by our model shows forecast errors seem to have roughly constant variance over time. As per the Ljung-Box test, p-value is 0.55 (>0.05), so there is little evidence of non-zero autocorrelations in the in-sample forecast errors at lags 1-20 confirms the appropriate fit of time series model.
To be sure that the predictive model cannot be improved upon, our model provides a histogram representation of forecast errors. This helps to check whether the forecast errors are normally distributed with mean zero and constant variance. The forecast errors histogram plot with overlaid normal curve shows that the distribution of forecast errors is perfectly centered on zero and shows a normal curve.
Figure 32: Forecasted Errors Histogram
The forecast accuracy measure values are calculated as RMSE is 0.06 & MAPE is 1.29.
5.3 ARIMA Model
AutoRegressive Integrated Moving Average (ARIMA) model is the most popular approach to forecast both stationary & non-stationary time series data. The model summarized as ARIMA (p,d,q), comprises of three parameters: autoregressive parameter (p), the number of differencing passes (d), and moving average parameter (q). The method operates in three phases namely Transformation, Parameter Estimation and Fitting & Diagnostic Checking.
Our model uses transformation functions to convert the input data to stationary data and then calculates the parameter values to fit the best ARIMA model to make forecasts.
The non-stationary input data shown in figure 33, is differenced until it appear to be stationary in mean and variance, i.e., the level & variance of the series should stay roughly constant over time, as shown in figure 34.
Figure 33: Non-stationary input data
Figure 34: Stationary data after transformation (d=1)
Unit root test is performed to test the stationary of the time series data. The Correlogram (ACF) and Partial Correlogram Analysis (PACF) analysis carried out to calculate the value of ARIMA model parameters p & q is shown in figure 35 & figure 36.
Figure 35: ACF Correlogram View
Figure 36: PACF Correlogram View
95
Our model suggests the best ARIMA fit with least AIC and BIC value. The forecast trend reported by our solution using the recommended ARIMA model (1,0,0) (1,0,0) is provided in figure 37.
Figure 37: Forecasted Trend of ARIMA model
The Ljung-Box test & Normal Q-Q Plot are performed to validate there is no pattern in the residuals. The p-values for the Ljung-Box test are well above 0.05, indicating “non-significance”. The normal Q-Q plot values are normal as they rest on a line and aren't all over the place.
Figure 38: Ljung-Box Test Plot
The forecast accuracy measure values are calculated as RMSE is 2.89 & MAPE is 4.95.
Figure 39: Normal Q-Q Plot
5.4 Forecasting Model Accuracy Analysis Summary
Our model performs the below three types of analysis to confirm the best fit.
Ljung Box test (p value >0.05) is used to check for non-significance of autocorrelations on the residuals.
Histogram plot of forecast errors is used to check
whether the residuals are normally distributed with mean zero and constant variance.
Various forecast accuracy measures (ME, RMSE, MAE, MPE, MAPE, MASE & ACF1) are used to compare the forecast accuracy of employed techniques. Forecast model recommendation is based on lowest RMSE & MAPE value shown in figure 40.
Figure 40: Forecasting Model Accuracy Measures
As shown in the above recommendation, Exponential Smoothing technique seems to have better forecast accuracy with relatively low RMSE value (0.06) & MAPE value (1.29).
5.5 Forecasting Model Results
Though our forecasting model recommends the best forecasting technique, it provides the forecasted trend of all the models for referential purpose. The forecasted data trend for the application server CPU utilization for the year 2016 reported by our model is shown in figure 41.
Figure 41: Forecasted Trend based on 3 techniques
As per the forecast, the application server maximum % CPU utilization will fluctuate between 80% to 85% range during 2016. As the forecast values are above the application server % CPU utilization threshold (configured as 70%), the model suggests for capacity upgrade. This justifies the need for application server capacity upgrade for the next year, 2016. This forecast assumes, no sudden increase in traffic trend is expected, due to new promotional deals or
96
marketing campaign activities, etc. As these factors are not considered in the model, extra capacity room needs to be buffered to accommodate the additional requirement for the forecasted year.
Figure 42: Server Capacity Upgrade Recommendation Report
A similar analysis performed for web server & database server confirmed, there is no need for server capacity upgrade for the year 2016. As per the forecasts reported by the model, the web server maximum % CPU utilization will fluctuate between 60% to 65% range (threshold: 80%) and the database server maximum % CPU utilization will fluctuate between 45% to 50% range (threshold: 75%).
6. PADFM Results & Business Benefits
The anomalies detected in the user traffic trend & server resource utilization trend using various techniques requires manual analysis. For a reported point anomaly, 12 various performance metrics (# server hits & # users details from user traffic data and 10 server resource monitoring metrics shown in figure 18) can be graphically viewed & Pearson correlationanalysis can be performed to identify the root causefor the anomalous behavior.
Based on the root cause analysis results, server configuration tuning or application code tuning activities can be suggested. This manual analysis of point anomalies can result in any of the 2 actions namely; anomaly data point can be included for sizing analysis, considering it is a false positive detection or can be removed considering it as an outlier / unrelated event for the sizing analysis.
This refinement made on the time series data after filtering of possible anomalies and outliers is then used by the forecasting models to make forecasts for the next year. The best forecasted model results are used as an indication for justifying the need for server capacity upgrade for the forecasted period.
The business benefits achieved through our model include 1) Improvement in performance SLA management 2) Quick correlation analysis reports helps business to understand the system load & utilization behavior during BAU & peak seasons due to dynamisms in business factors like deals & promotions. 3) Anomalous distribution trends reported
helps in providing feedback to performance testing team for refining the test workload by considering realistic usage patterns. 3) Point anomalies indicated on server resource trends helps in bringing immediate focus on performance tuning and optimization activities based on first level of root cause analysis results. 4) Better forecast accuracy & server sizing, due to the usage of filtered (anomalies / outliers removed) input data. 5) Added intelligence on production monitoring to detect anomalies & to perform first level of root cause analysis helps in reducing the L1 and/or L2 level infrastructure support team on production environment.
7. Model Limitations & Future Plans
Our PADFM version 1.0 model has several limitations in handling large data volumes, processing speed, rich graphical representation & automatic workload versus server performance correlation analysis. Also, forecasts are purely based only on server utilization data without consideration of various business factors like promotional events/deals history, such impact to be assumed for the forecasted period and Server specifications (like CPU model, type & speed, memory, etc). The impact due to these factors business & application factors need to be correlated with server resource utilization patterns for making forecasts.
Our next version of the model, PADFM v2.0 under progress, focuses on handling the above limitations and includes additional techniques like NetFlix RAD algorithm (based on Robust Principle Component Analysis) & LOF (Local Outlier Factor) for detecting additional anomaly patterns missed by version 1.0 model.
Additional features introduced to capture workload fluctuations along with server utilizations including JVM statistics to feed the supervised machine learning techniques (K-NN algorithm) can help in quickly classifying input data for detecting anomalies related to workload fluctuations & patterns.
Forecasting model will include server sizing recommendations based on the several additional business & application factors along with industry benchmark data from TPC/SPEC.
To overcome the limitations related to data handling & speed, we have used InfluxDB, an open source database for storing real-time time series data from production servers & we have used Grafana for better visualization of time series data for representing the anomaly detection graphs.
97
8. Conclusion
Unidentified performance anomalies or unexpected capacity issues might lead to potential system failure leading to immediate impact on customer experience resulting in business loss. The real success of performance anomaly detection & forecasting solution lies in choosing the right combination of techniques (with fine tuning) that can complement each other and accounting the impact due to several other factors to yield better accuracy. The techniques employed by our model for retailer client has yielded great benefits for the business in detecting anomalies in production environment & for providing recommendations for server sizing.
References
[1] Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi,Julie Symons, and Evgenia Smirni. Anomaly?application change? or workload change? towardsautomated detection of application performanceanomaly and change.
[2]Joao Paulo Magalhaes and Luis Moura Silva.Detection of performance anomalies in web-basedapplications.
[3] Sandip Agarwala, Fernando Alegre, KarstenSchwan, and Jegannathan Mehalingham. E2eprof:Automated end-to-end performance management forenterprise systems
[4] Manjula Peiris, James H Hill, Jorgen Thelin,Sergey Bykov, Gabriel Kliot, and Christian Konig. Pad:Performance anomaly detection in multi-serverdistributed systems.
[5] Tao Wang, Jun Wei, Wenbo Zhang, Hua Zhong,and Tao Huang. Workload-aware anomaly detectionfor web applications.
[6] Joao Paulo Magalhaes and Luis Moura Silva.Root-cause analysis of performance anomalies inweb-based applications.
[7] Frank M Bereznay and Kaiser Permanente. Didsomething change? using statistical techniques tointerpret service and resource metrics.
[8] Daniel Joseph Dean, Hiep Nguyen, and XiaohuiGu. Ubl: unsupervised behavior learning for predictingperformance anomalies in virtualized cloud systems.
[9] Terence Kelly. Detecting performance anomalies inglobal applications.
[10] Tian Huang, Yan Zhu, Qiannan Zhang, YongxinZhu, Dongyang Wang, Meikang Qiu, and Lei Liu. Anlof-based adaptive anomaly detection scheme forcloud-computing.
98
Performance Comparison of Virtual Machines and Containers
with Unikernels
Nagashree N, Suprabha S, Rajat Bansal VMware Software India Private Limited, Bengaluru.
{nagashreen, suprabhas, rajatb}@vmware.com
In a typical cloud environment that uses software-defined data center (SDDC), most of the virtual machines are deployed to run single application for back-end services resulting in the under-utilization of virtual resources. With virtual machines, legacy applications are run on a full-fledged OS. This overhead is avoided by containers using OS-level virtualization where the applications are ‘caged’ to provide secure isolation. However, this adds another layer to the software stack. Unikernels, leveraging the benefits of virtualization, remove these redundant layers by enabling us to build applications with only the required libraries. These applications run as virtual machines on a hypervisor in Ring 0 (kernel) mode. Consequently, the attack surface is minimal ensuring very high security as compared to containers. Also, the privileges are reduced building potentially stronger walls between disparate components. The applications can scale largely because of a very small memory footprint. In this paper, we use OSv – a unikernel, Cloud OS, to build and deploy applications such as Apache, MySQL, etc. We have done the performance analysis of these applications running in virtual machine, container and unikernel. We have observed that the unikernel VM size is the smallest, the boot time is much faster and the read/write time is significantly low.
1. INTRODUCTION
Cloud computing arose from the realization that virtualization facilitates efficient use of hardware resources by providing a greater degree of abstraction between hardware and software. A virtual machine is a self-contained computer with a standard operating system running applications. It helps with server consolidation as there is lesser hardware to maintain. It is scalable and secure. However, each VM runs a full copy of an operating system and a virtual copy of all the hardware that the OS requires. Containers provide an alternative to this by virtualizing the operating system which relatively reduces the resource consumption. A container wraps up applications and all their dependencies in a complete file system. Containers, running as isolated processes in the host operating system’s user space, share the kernel that makes RAM usage and disk usage more efficient as compared to VMs [7]. Despite these benefits, containers are not as secure as virtual machines. The attack surface is comparatively larger because if there is vulnerability in the OS’s kernel, it can make its way through the containers. Also, they lack to offer the same level of isolation as hypervisors. The next logical approach from VMs and containerization is the “Unikernels”.
Unikernels are specialized, single address space machine images constructed by using ‘library operating systems’. A developer selects the minimal set of libraries which correspond to the OS constructs required for their application to run. These libraries are then compiled with the application and configuration code to build sealed, fixed-purpose images (unikernels) which run directly on a hypervisor or hardware without an intervening OS. [8] (Fig. 1)
99
Fig. 1: Application Stack in Virtual Machine, Container and Unikernel
Virtualization, in spite of many of its advantages, adds many layers and this overhead consumes more resources. Unikernels significantly reduce this overhead without compromising on any of the advantages of virtualization. Some of the key advantages are:
1. Performance — Less CPU usage & memory footprint.
2. Security — The application is the kernel. Unikernels provide extremely tiny and specialized runtime footprintthat is less vulnerable to attack.
3. Scalability — The image size is very small (less than 100MB), enabling flexible deployment on a large scale.
As with many technologies, unikernels also face a chasm between establishing the technical value and market adoption. With this paper, we attempt to bridge this gap and help unikernels be adopted widely across various industries.
In Section 3 and Section 4, we present evidences for the above by comparing virtual machines and containers (Docker) with unikernels.
2. RELATED WORK
The architecture of library operating system (libOS) has several advantages over more conventional designs [1]. “Unikernels: Rise of the Virtual Library Operating System”, details the limitations of current operating systems and how library operating systems overpower them. It illustrates the design of MirageOS which aims to restructure VMs into more modular components for increased benefits.
There are several unikernels or library operating systems such as OSv, ClickOS, Clive, HaLVM, LING, MirageOS and Rump Kernels. In “OSv —Optimizing the Operating System for Virtual Machines”, Avi Kivity et al. explain that OSv is a better OS for cloud than traditional operating systems such as Linux.
J(VM)2 [2], a pure Java-based VM encapsulated inside a specialized hypervisor-aware framework, provides many substantial advantages over the traditional JVM. unikernels.
100
3. PROOF-OF-CONCEPT Unikernels, by design, will not have open-vm-tools running inside them, as only one process runs within each unikernel. However, each of the unikernel implementations has an instrumentation API to get similar data. OSv is the unikernel OS used for building and deploying applications. OSv [4] is the open source operating system designed for the cloud. Built from the ground up for effortless deployment and management, with superior performance.
3.1 Legacy and Cloud-Native Applications There are not many applications that are ready to be run as unikernel images. Building a unikernel image for an application requires detailed knowledge of the application’s runtime and its dependencies, so that only the relevant parts of the OS library can be compiled into the image. A few legacy applications like Java, MySQL and Cloud-Native applications like Apache, Cassandra are compiled to an OVA. Later, they are deployed as virtual machines on a hypervisor. The virtual machine boots the application automatically in an instant as soon as it powers on. However, the VM shuts down if the application crashes or fails to start. The automated process of building an OSv application is depicted in Fig. 2.
Fig. 2: OSv Application Build Process
3.2 Performance Considering the above applications, a performance evaluation is done for unikernels against virtual machines and containers (Docker). A standard benchmark tool is chosen for each of the applications which are configured consistently across VM, container & unikernel. The stats are provided in the next section.
3.3 Security The two major flaws in containers are isolation and security. Since, unikernels are deployed as virtual machines on a hypervisor, these are not a concern. Containers do not provide resilient, multi-tenant isolation. A malicious code inside the container can attack the host operating system and other containers. Docker daemon requires root privileges to run containers (applications) which means that the underlying OS is vulnerable. The packaging of applications is still tricky because of the uncertainty in their dependencies [3]. Unikernels are more secure because the OS is only the libraries relevant for the application. There is no support for device drivers which are a potential risk. Unikernels offer the highest level of isolation as there is no privilege switching between user & kernel modes. BTC Piñata, a MirageOS unikernel, is an evidence for the level of security that a unikernel can provide. It is written in OCaml, runs directly on Xen, and uses native OCaml TLS and X.509 implementations.
3.4 Which Applications to build as Unikernels? Every application will not be suitable for implementation as a unikernel. Like other technologies, unikernels also have
101
their limits. The key limitations are [14]:
1. Lack of multiple processes (Single Process, Multiple Threads)
2. Single User
3. Limited Debugging
4. Impoverished Library Ecosystem An application that doesn't violate these limits can be considered for compiling as a unikernel. The application shouldn't fork new processes in the same machine. Whenever it demands for a new process, a new unikernel VM can be created as it starts up in sub seconds. Debugging can be improved by instrumenting it internally. Unikernel specifically targets security and scalability (applications that may need to scale into very high numbers). Unikernel would be appropriate for something that will be exposed to the Internet and therefore needs the highest levels of security [14]. There is no specific set of criteria for whether an application can be built as a unikernel or not. This is due to the fact that the technology is new. However, with time they can be widely adopted and many applications can be built as unikernels.
5. RESULTS & DISCUSSION Unikernels run as conventional virtual machines on a hypervisor. We measure the performance of applications such as Apache HTTP server & MySQL running in virtual machine, container (Docker) and unikernel. We have performed all the tests on a hypervisor with twelve 1-2 GHz Intel Xeon E5-2620 processors for a total of 12 cores, Hyper-Threading and 64 GB RAM. This is the configuration of the mainstream server to deploy virtual machines, containers and unikernels. We have used Ubuntu 15.10 (Wily Werewolf) 64-bit with Linux kernel 2.6.27.56 as base image for all Docker (v1.8.0) containers and cloud image for all virtual machines. We run a single instance of each of Apache Web Server & MySQL applications in VM, container and unikernel.
a. Apache Web Server
The performance of Apache Hypertext Transfer Protocol (HTTP) server is measured with a benchmarking tool called ApacheBench (ab). This tool especially shows how many requests per second the Apache installation is capable of serving. We have configured Apache HTTP server and ApacheBench on a single virtual machine with 4 vCPUs and 2 GB RAM. We run the Apache Benchmark Docker image in a single container hosted on another VM. We launch the server in unikernel and run the benchmarking tool in a VM. For consistency, all of these connect to “https://www.yahoo.com”. The total number of bytes received from the server which is essentially the number of bytes sent over the wire is 524210. The total number of document bytes received from the server excluding bytes received in HTTP headers is 460120. The number of requests is 1000. The number of concurrent clients used during the test is 100.
The performance of the server running in each of VM, container and unikernel is measured across three parameters as shown in TABLE I. Unikernel performs better than VM & container as the time taken (mean) to serve a request across all concurrent requests is the least. However, the rate at which requests are returned per second is better than VM and close to container.
Performance Metric VM Container Unikernel
Requests/Second (Higher is better) 16.45 30.12 28.12
Time/Request (ms) 5480.987 3200.13 3087.29
Total Test Time (s) 54.23 31.230 28.53
TABLE I: Performance Analysis of Apache HTTP server with VM, Container & Unikernel
102
Fig. 3 shows the max. time taken to connect to the server, process the request and send the response to the clients. With VM, the fastest request was 55 ms and the slowest was 10900 ms. In case of container and unikernel, the fastest was 71 ms and 58 ms respectively. The slowest was almost half the time taken as compared to VM. Fig. 4 shows the percentage of the requests served within a certain time. The server running inside the VM takes double the time to serve last 50% of the requests. Unikernel performs fairly well for 1000 requests compared to others. In real-time where the server has to satisfy thousands of requests, VMs and containers take a very long time. However, unikernels offer predictive performance.
Fig. 3: Maximum Time Taken to Connect, Process Request & Wait for response
Fig. 4: Time Taken to Service the Requests
103
b. MySQL The MySQL performance is measured with sysbench [13], a benchmark suite which allows to quickly get an impression about system performance while running a database under intensive load. We have configured MySQL and sysbench on a single VM with 4 vCPUs and 2 GB RAM. We have used Percona v5.6 Server which is a fork of the MySQL relational database management system created by Percona. We follow the same configuration in container. We run only MySQL in unikernel and run sysbench inside another VM. This simulates a more “real world” use case. We run sysbench OLTP for Mixed OLTP (R/W) and Read-Only OLTP tests with single thread and table size equal to 20,000,000.
Fig. 5: Read-Only and R/W Requests per Second
Fig. 6: Time Taken to service 95% of Read-Only and R/W Requests Fig. 5 shows the Read-Only and R/W requests per second. The number of Read-Only requests per second is higher in
104
case of unikernel and lower in case of VM & container. However, with R/W requests per second, VM beats the others. Fig. 6 shows the time taken to serve 95% of Read-Only and R/W requests. Comparatively, unikernel takes nearly half the time to satisfy 95% of the Read-Only requests. However, this is not the case with the latter as the time taken is relatively higher. This is due to the network latency caused by connecting from a different host. Regardless, unikernel did not kill performance all that much & is more suitable for read-intensive applications.
c. Performance Assessment The performance analysis is done for database and web server via applications such as MySQL and Apache Web Server. These applications running in each of VM, container and unikernel are measured across standard parameters as shown in TABLE II. With Apache Web Server, unikernel performs better than VM & container as the average time taken to serve a request across all concurrent requests is the least. However, the rate at which requests are returned per second is better than VM and close to container. With MySQL, the read-only (R) requests are served faster by unikernel. Also, the rate at which read-only requests are returned per second is better than VM & container. The read/write (RW) requests per second and the time taken to serve them are almost the same across all the three technologies.
Scenario Application Requests/Second (Higher is better) 95% Service Time (ms)
VM Container Unikernel VM Container Unikernel
Database MySQL
(sysbench) 7398.52 (R), 5588.97 (RW)
7301.21 (R), 4771.01 (RW)
8571.43 (R), 4817.36 (RW)
2.05 (R), 3.63 (RW)
1.91 (R), 4.03 (RW)
0.72 (R), 4.14 (RW)
Web Server
Apache (ab)
16.45 30.12 28.12 8924.72 5801.59 5213.12
TABLE II: Performance Analysis of Applications with VM, Container & Unikernel
Unikernels might not be the panacea, however, it performs much better than containers and virtual machines in some cases, and on par with them in the other cases.
6. CONCLUSION
In our work, we have compared unikernels with virtual machines and containers to show that they perform better for certain applications. Unikernels allow applications that demand predictable performance to access the hardware resources directly. Specialized implementations can be provided by including only the required set of libraries, drivers and interfaces. In addition to performance, security, scalability, isolation and feasibility, it allows to provide Disaster Recovery (DR) solution as a unikernel service. As unikernels boot in an instant, a new unikernel can be summoned whenever an application crashes. It also provides Transient Micro services in the Cloud. Nowadays, the hardware is fast, compact and cheap, allowing us to come out of the assumption that VM is persistent. A single VM need not be multifunction. A new unikernel VM can be summoned to handle each request over the network. Unikernels, with all these advantages, can serve as a better operating system for SDDC.
7. ACRONYMS & ABBREVIATIONS
Acronym/Abbreviation Description
ab ApacheBench
BTC Bitcoin
DR Disaster Recovery
HaLVM Haskell Lightweight Virtual Machine
libOS library Operating System
OLTP OnLine Transaction Processing
OVA Open Virtual Appliance or Application
SDDC Software-Defined Data Center
VM Virtual Machine
VMDK Virtual Machine Disk
VMX Virtual Machine Configuration File
105
ACKNOWLEGEMENT We would like to extend our gratitude to VMware for allowing us to carry out the research work by providing complete support and adequate resources.
REFERENCES
[1] Anil Madhavapeddy and David J. Scott, “Unikernels: Rise of the Virtual Library Operating System”, acmqueue,
Distributed Computing. [2] Goldstein Lyor, “J(VM)2 - A virtual Java Virtual Machine (machine) Making the case for a Pure-Java virtual
machine”, 2015. [3] http://robhirschfeld.com/2014/04/18/docker-real-or-hype/ [4] http://osv.io/ [5] http://www.itworld.com/article/2915530/virtualization/containersvs-virtual-machines-how-to-tell-which-is-the-right-
choice-for-yourenterprise.html [6] http://www.linux.com/news/enterprise/cloud-computing/819993-7-unikernel-projects-to-take-on-docker-in-2015 [7] https://www.docker.com/what-docker [8] https://en.wikipedia.org/wiki/Unikernel [9] https://blog.docker.com/2013/08/containers-docker-how-secure-arethey/ [10] Avi Kivity, Dor Laor, Glauber Costa, Pekka Enberg, Nadav Har’El, Don Marti and Vlad Zolotarov, Cloudius
Systems, “OSv - Optimizing the Operating System for Virtual Machines”, USENIX Annual Technical Conference, 2014.
[11] Wes Felter, Alexandre Ferreira, Ram Rajamony and Juan Rubio, “An Updated Performance Comparison of Virtual Machines and Linux Containers”, IBM Research Report, July 21, 2014.
[12] http://www.csoonline.com/article/2984543/vulnerabilities/ascontainers-take-off-so-do-security-concerns.html [13] https://www.howtoforge.com/how-to-benchmark-your-system-cpu-fileio-mysql-with-sysbench [14] Russell C. Pavlicek, “Unikernels – Beyond Containers to the Next Generation Cloud”, First Edition, O’Reilly,
October 2016.
106