a container -centric methodology for benchmarking workflow management ... › publications ›...

1Institute of Architecture of Application Systems, University of Stuttgart, Germany,{lastname}@iaas.uni-stuttgart.de

2Faculty of Informatics, University of Lugano (USI), Switzerland{firstname.lastname}@usi.ch

A Container-centric Methodology for Benchmarking Workflow Management Systems

Vincenzo Ferme2, Ana Ivanchkikj2, Cesare Pautasso2, Marigianna Skouradaki1,Frank Leymann1

This publication and contributions were presented at CLOSER 2016CLOSER 2016 Web site: http://closer.scitevents.org© 2016 SciTePress. Personal use of this material is permitted. However,permission to reprint/republish this material for advertising or promotionalpurposes or for creating new collective works for resale or redistribution toservers or lists, or to reuse any copyrighted component of this work in otherworks must be obtained from the SciTePress.

@inproceedings {INPROC-2016-14,author = {Vincenzo Ferme and Ana Ivanckikj and Cesare Pautasso and MarigiannaSkouradaki and Frank Leymann},title = {{A Container-centric Methodology for Benchmarking Workflow Management Systems}},booktitle = {Proceedings of the 6th International Conference on Cloud Computing and Service Science, (CLOSER 2016)},publisher = {SciTePress},pages = {74--84},month = {April},year = {2016}}

:

Institute of Architecture of Application Systems

A Container-centric Methodology for Benchmarking WorkflowManagement Systems

Vincenzo Ferme1, Ana Ivanchikj1, Cesare Pautasso1, Marigianna Skouradaki2, Frank Leymann2

1Faculty of Informatics, University of Lugano (USI), Switzerland2Institute of Architecture of Application Systems, University of Stuttgart, Germany

{vincenzo.ferme,ana.ivanchikj,cesare.pautasso}@usi.ch, {skourama,leymann}@iaas.uni-stuttgart.de

Keywords: Benchmarking, Docker Containers, Workflow Management Systems, Cloud Applications

Abstract: Trusted benchmarks should provide reproducible results obtained following a transparent and well-definedprocess. In this paper, we show how Containers, originally developed to ease the automated deployment ofCloud application components, can be used in the context of a benchmarking methodology. The proposedmethodology focuses on Workflow Management Systems (WfMSs), a critical service orchestration middle-ware, which can be characterized by its architectural complexity, for which Docker Containers offer a highlysuitable approach. The contributions of our work are: 1) a new benchmarking approach taking full advantageof containerization technologies; and 2) the formalization of the interaction process with the WfMS vendorsdescribed clearly in a written agreement. Thus, we take advantage of emerging Cloud technologies to addresstechnical challenges, ensuring the performance measurements can be trusted. We also make the benchmarkingprocess transparent, automated, and repeatable so that WfMS vendors can join the benchmarking effort.

1 Introduction

Performance benchmarking contributes to theconstant improvement of technology by clearly po-sitioning the standards in measuring and assess-ing performance. Benchmark results conducted un-der the umbrella of research are recognized as agreat contribution to software evolution (Sim et al.,2003). For example, the well-known and estab-lished benchmark, TPC-C (Transaction ProcessingCouncil (TPC), 1997) had a meritorious contribu-tion to the performance of Database ManagementSystems (DBMSs), which improved from 54 trans-actions/minute in 1992 to over 10 million transac-tions/minute in 2010. The importance of benchmark-ing is also recognized by the emergence of organi-sations, such as the Transaction Processing Perfor-mance Council (TPC)1 and the Standard PerformanceEvaluation Corporation (SPEC)2, that are responsiblefor ensuring the integrity of the benchmarking pro-cess and results. They maintain steering committees,involving both industry and academic partners.

Widely accepted and trustworthy benchmarksshould demonstrate relevance, portability, scalability,

1https://www.tpc.org/

2https://www.spec.org/

simplicity, vendor-neutrality, accessibility, repeatabil-ity, efficiency, and affordability (Gray, 1992; Hup-pler, 2009; Sim et al., 2003). These requirements im-pose significant technical challenges (Pautasso et al.,2015) regarding the automation of the benchmark-ing process, in order to facilitate efficiency and scal-ability, i.e., the benchmarking of different systems;the repeatability of tests under the same initial con-ditions; the reduction of noise from the environment,etc. Benchmarks also often face logistic challenges,since many vendors3 use software licence agreementsto tightly control how benchmark results, obtained us-ing their products, are published. Addressing suchchallenges is only possible with a carefully designedmethodology to manage the entire benchmarking pro-cess, from the deployment of the targeted system,through its testing, and up till the publishing of theresults.

In this paper, we propose a methodology forbenchmarking the performance of Workflow Manage-ment Systems (WfMSs). As defined by the WorkflowManagement Coalition (WfMC), a WfMS is “a sys-tem that completely defines, manages and executes

3Like Oracle: http://www.oracle.com/

technetwork/licenses/standard-license-152015.

html and Microsoft: http://contracts.onecle.com/

aristotle-international/microsoft-eula.shtml

Figure 1: Benchmarking Input/Process/Output (IPO) Model

‘workflows’ through the execution of software whoseorder of execution is driven by a computer represen-tation of the workflow logic” (Hollingsworth, 1995).The WfMSs can use different languages to executethe modeled workflows (e.g., WS-BPEL, BPMN 2.0).Even though the proposed methodology is not limitedto any specific execution language, we use the Busi-ness Process Model and Notation (BPMN) 2.0 (Jor-dan and Evdemon, 2011) as a use case for describingits feasibility. There are several reasons behind suchdecision: the growing usage of BPMN 2.0 and theWfMSs that support it (Skouradaki et al., 2015) (e.g.,Camunda, Activiti, jBPM, Bonita BPM4), the ISOstandardization (ISO/IEC 19510:2013) of the lan-guage and above all the fact that BPMN 2.0 supportsa superset of the elements supported by BPEL (Ley-mann, 2011).

Our methodology aims to offer solutions to theaforementioned: 1) technical challenges by defin-ing a benchmarking framework which benefits fromthe emerging containerization technology, initiallyused for deployment of Cloud applications (Turnbull,2014); and 2) logistic challenges by formalising theinteraction between the conductor of the benchmark-ing process (hereinafter the BenchFlow team) and thevendors of the WfMSs which are being benchmarked.We believe that such formalisation will foster ven-dor collaboration and public access to the benchmark-ing results. In our approach, we treat the WfMSs asa black-box, as the proposed methodology is appli-cable to any WfMS released in a Docker container.Our current goal is to benchmark the performance ofthe WfMSs under different load and observe their be-haviour in normal conditions (e.g., we do not injectfault during the workload execution). Furthermore,we assume that the execution bottlenecks lie with theWfMS, and not with the external services invokedduring the process instance execution. We ensure thisassumption holds when designing the workload, orwhen running the experiments.

Our work is conducted under the scope of theBenchFlow project5 which has the aim of defining astandard way to assess and compare WfMSs perfor-mance (Skouradaki et al., 2015).

4More detailed list is provided at: https://en.

wikipedia.org/wiki/List_of_BPMN_2.0_engines

5http://benchflow.inf.usi.ch

The remainder of the paper is organized as fol-lows: Section 2 introduces the benchmarking frame-work, Section 3 presents the process of formal in-teraction with WfMS vendors and the used artifacts,Section 4 examines the related work, and Section 5concludes and presents the planned future work.

2 Benchmarking Process

To ensure benchmark’s transparency and repeata-bility, it is crucial to clearly define the benchmarkingprocess, its input and output artifacts, as well as theimplementation of the benchmarking framework.

2.1 BenchmarkingInput/Process/Output Model

Regardless of the targeted System Under Test (SUT),the benchmarking process always uses as an inputa workload, i.e, an instance of the workload model,whose parameters are influenced by the test type. Itis a good practice to explore SUT performance usingdifferent SUT configurations and different computa-tional resources. The output of the benchmarking pro-cess is the data for calculating different metrics and/orKey Performance Indicators (KPIs). Such data shouldbe recorded only after a warm-up period, necessary toremove system initialization bias. Before the data canbe used for computing metrics and KPIs, it must bevalidated, by running the same tests multiple times,to ensure that the standard deviation in the results isbelow a predefined limit. In the following subsectionswe define the elements of the Input/Process/Output(IPO) model (Fig. 1) in the WfMSs context.

2.1.1 Workload Model

The parameters of the workflow model (workloadmix, load functions, and test data) vary depending onthe objective of the test type (e.g., load test, stress test,soak test (Molyneaux, 2014)). The interactions be-tween the parameters of the workload model are pre-sented in Fig. 2.

The workload mix is the mix of process modelsto be deployed during a given test. Each processmodel defines the activities, the order in which they

are performed, as well as any external interactions.It is crucial for benchmark’s validity that the work-load mix is representative of real-world models andthe set of language constructs that they use. Collect-ing and synthesizing real-world models is one of thechallenges faced when benchmarking WfMS perfor-mance, to be addressed by static and dynamic analy-sis of real-world model collections (Pautasso et al.,2015). Given the fact that the workflow executionlanguages are very rich and complex, the constructsincluded in the workload mix models need to be care-fully selected. They should: 1) stress the performanceof the WfMS; 2) be supported by the WfMS (Geigeret al., 2015); 3) be frequently used in real-world pro-cesses; and 4) be in-line with the targeted perfor-mance tests and KPIs. To make sure that our work-load mix is representative, we will follow the itera-tive approach for its development and evaluation (Sk-ouradaki et al., 2015). Since each WfMS uses dif-ferent serialization of executable process models, ev-ery time we add a new WfMS to the framework, ourworkload mix needs to be adapted to a compatible ex-ecutable format.

The load functions describe how the workload isissued to the SUT. For example, in the case of OnlineTransaction Processing (OLTP) systems, the work-load mix consists of transactions against a database(DB) and a load function describes the number ofqueries that will be executed against the DB per timeunit (Transaction Processing Council (TPC), 1997;Transaction Processing Performance Council, 2007).In more complex examples, such as the session-basedapplication systems, the workload mix is definedthrough different types of transactions, and differentroles of users that execute these transactions. In thesesystems, the load functions are more complex as theyhave to describe the behavior of the users, i.e., thefrequency and sequence in which they execute dif-ferent types of transactions (van Hoorn et al., 2014).The workload model for benchmarking WfMSs canbe seen as an extension of the workload model forsession-based applications, since the WfMS executesblocks of language constructs as transactions whichcan be compensated in case of an error. When

Figure 2: Workload Model Parameters’ Interactions

benchmarking WfMSs we distinguish two types ofload functions: 1) Load Start Functions - determinedbased on the performance test type, they define howoften process instances are being initiated during thetest. Even though a Load Start Function needs to bedefined for each process model in the workload mix,it is possible that multiple or all process models sharethe same Load Start Function; 2) Load DistributionFunctions - contain rules to define the distribution ofthe executed paths of a given process model in theworkload mix and the distribution of the interactions.Examples of such interactions are the completion ofa human (user or manual) task and the call of an ex-ternal Web service. The rules will be derived fromthe analysis of real-world execution logs and basedon the test type. There is mutual dependency betweenall the instances of a given process model, and theLoad Start Function for that process model, markedwith bidirectional arrow in Fig. 2. Namely the LoadStart Function determines the number of instances fora given process model in the workload mix, but at thesame time, the Load Start Function needs to know theprocess model to be able to generate the test data nec-essary to instantiate a process instance.

The test data are used as an input for starting aprocess instance, or during the execution of the pro-cess instance as part of the performance test. Theydepend on the definition of the load functions and canrefer to, for example, results of evaluating gatewayrules, messages for invoking Web services, persistentdata required for completing a task, the number ofusers, etc.

To summarise: the workload applied to the SUTis determined by the load functions, used to generateprocess instances of all the process models includedin the workload mix. The process instances are in-stantiated and executed using the test data selected inaccordance with the load functions. For instance, wecould use the process model shown in Fig. 2 as a sim-plistic workload mix. Consider Task A a human taskand Task B a Web service task. We can define theLoad Start Function, such that, 100 process instancesare started with the initialization event and then exe-cuted by the WfMS. The test data in this case, wouldbe the input data that the users are entering into theWfMS for Task A. The Load Distribution Functionwould define users’ think times, and the time it takesfor the Web service to complete.

2.1.2 System Under Test

The SUT refers to the system that is being tested forperformance. In our case a WfMS which, accord-ing to the WfMC (Hollingsworth, 1995), includes theWorkflow Enactment Service (WES) and any envi-

Functionality Min Response Data

Cor

eAP

Is

InitialisationAPIs

Deploy a process Deployed process IDStart a process instance Process instance ID

Non

-cor

eAP

Is

User APIs Create a user User IDCreate a group of users User group IDAccess pending tasks Pending tasks IDsClaim a task*Complete a task

Event APIs Access pending events Pending events IDsIssue events

Web serviceAPIs

Map tasks to Webservice endpoints

*Optional depending on the WfMS implementation

Table 1: Summary of Core and Non-core APIs to be imple-mented by the WfMS

ronments and systems required for its proper func-tioning, e.g., application servers, virtual machines(e.g., Java Virtual Machine) and DBMSs. The WESis a complex middleware component which, in addi-tion to handling the execution of the process models,interacts with users, external applications, and Webservices. This kind of SUT offers a large set of con-figuration options and deployment alternatives. TheDBMS configuration as well as the WES configura-tions (e.g., the granularity level of history logging,the DB connection options, the order of asynchronousjobs acquisition), may affect its performance. More-over, WfMSs often offer a wide range of deploymentalternatives (e.g., standalone, in a clustered deploy-ment, behind a load balancer in an elastic cloud in-frastructure) and target different versions of applica-tion server stacks (e.g., Apache Tomcat, JBoss).

To add a WfMS in the benchmark, we require theavailability of certain Core and Non-core APIs. Asummary of the same is presented in Table 1.

The Core APIs are Initialisation APIs necessaryfor automatic issuing of the simplest workload to theSUT in order to test SUT’s performance. They in-clude: 1) deploy a process and return as a response anidentifier of the deployed process (PdID); and 2) starta process instance by using the PdID and return as aresponse the new instance identifier (PiID).

Depending on the execution language of theWfMS and the constructs that it supports, other Non-

core APIs might be necessary for testing more com-plex workloads. For instance, if we are targetingBPMN 2.0 WfMSs we might also require the follow-ing APIs:

For applying workloads involving human tasks,the following User APIs are necessary: 1) createa user and return the identifier of the created user(UsID); 2) create a group of users, return the cre-ated group identifier (UgID), and enable adding users

by using their UsIDs; 3) pending user/manual tasks:access all the pending user/manual task instances ofa given user/manual task identified by its id (Jordanand Evdemon, 2011, sec. 8.2.1) as specified in themodel serialization. We want to obtain all the pend-ing tasks with the given id of all the process instances(PiIDs) of a given deployed process (PdID). The APIhas to respond with data, enabling the creation of acollection that maps the process instances to the listof their pending tasks <PiID, UtIDs> and <PiID,MtIDs>; 4) claim a user/manual task identified byUtID/MtID, if tasks are not automatically assigned bythe WfMS; and 5) complete a user/manual task iden-tified by UtID/MtID by submitting the data requiredto complete the task.

To issue a workload containing process mod-els with catching external events, the followingEvent APIs are necessary: 1) pending catchingevents/receive tasks: access the pending catchingevent/receive task instances of a given event/taskidentified by its id (Jordan and Evdemon, 2011, sec.8.2.1) specified in the model serialization. We wantto obtain all the pending catching events/receive taskswith the given id of all the process instances (Pi-IDs) of a given deployed process (PdID). The APIhas to respond with data enabling the creation of acollection that maps the process instances to the listof their pending catching events/receive tasks <PiID,CeIDs> and <PiID, RtIDs>; and 2) issue an event toa pending catching event/receive task identified by us-ing CeID/RtID. We require the APIs to accept the datanecessary to correlate the issued event to the correctprocess instance, e.g., a correlation key.

Finally, to be able to issue a workload defining in-teraction with Web services and/or containing throw-ing events, the WfMS has to support a binding mech-anism to map each Web service task/throwing eventto the corresponding Web service/throwing event end-point. The WfMS should preferably allow to specifythe mapping in the serialized version of the model,so that the binding can be added before deploying theprocess.

Since many WfMSs are offered as a service, it issafe to assume that many WfMSs expose, what wecall, the Core APIs. In our experience with systemswe have evaluated so far (e.g., Activiti, Bonita BPM,Camunda, Imixs Workflow, jBPM), they support notonly the core APIs, but also the non-core APIs. Theexact API may differ among systems, however thenecessary API features were always present.

2.1.3 Metrics and Key Performance Indicators

The output of the performance testing is collectedand later used to calculate a set of metrics and KPIs.

To make performance testing meaningful, they mustbe carefully selected in order to capture the differ-ences among configurations and systems. A metricis defined as a “quantitative measure of the degreeto which a system, component or process possessesa given attribute” (Fenton and Bieman, 2014). A KPIon the other hand is a “a set of measures” (Parmenter,2010, ch. 1) which focus on the critical aspects ofSUT’s performance. As opposed to metrics, the num-ber of KPIs should be limited, in order to clearlyfocus the improvement efforts. We propose group-ing the metrics and KPIs in three different levels,based on the requirements of their expected users: 1)Engine-level: to help end-users select the most suit-able WfMS as per their performance requirements; 2)Process-level: suitable for WfMS vendors to positionand evaluate their offerings in different use-case sce-narios; and 3) Feature-level: fine-grained measuresallowing WfMS developers to deal with system’s in-ternal complexity and to explore its bottlenecks. Foreach level, we enumerate a non-exhaustive set of met-rics to be calculated using process execution, environ-ment monitoring and resource utilization data. Whichmetrics will be calculated when running a specificperformance test, will depend on the objective of thetest. To increase transparency, a detailed definition ofeach of the used metrics will be provided to the WfMSvendors, together with the benchmark results.

The Engine-level metrics measure the WfMS per-formance based on the execution of the entire testworkload. Most metrics defined in the literaturefor assessing WfMS performance, refer to this leveland include throughput, latency, resource utilization(e.g., RAM, CPU, disk and network utilization) (Rocket al., 2014). In addition to these metrics, we pro-pose to also add metrics for assessing the scalabil-ity, the capacity (Jain, 1991, p. 123), and the flexi-bility of the SUT. For example, response time (ms)- what is the time it takes for the SUT to respond tothe service request (Barbacci et al., 1995, ch. 3), en-durance - can the SUT sustain a typical productionload (Molyneaux, 2014, p. 51), flexibility to spike -how does the SUT react to a sudden increase of theload (Molyneaux, 2014, ch. 3).

The Process-level metrics measure the WfMS per-formance based on the execution of instances of a spe-cific process model in the test workload. The resourceutilization metric is interesting at the process-level aswell, to analyse how much resource consumption isneeded per single process instance execution. We alsopropose metrics for the duration of a process instance,and the delay introduced by the WfMS in executingthe process instance. The delay is computed as a dif-ference between the actual and the expected duration

of the process instance, where the expected durationis determined as the aggregation of the expected du-ration of the constructs included in the executed pathof the process instance.

Different languages can support different concepts(e.g., time, parallelism, control and data flow), thusimpacting the design of the WfMSs executing them.The Feature-level fine-grained metrics are necessaryto determine the impact of specific language con-structs or patterns on WfMS performance, and to iso-late possible sources of performance issues. We pro-pose measuring the resource usage, the duration, andthe delay, for each language construct as well as forfrequently used patterns6. For instance, in BPMN 2.0,such constructs can include user task, timer event,exclusive gateway, etc. The ultimate goal is to helpdevelopers determine the performance bottlenecks oftheir WfMS, so that they can work on removing them,and thus improving the WfMS performance in gen-eral. We have used some of the mentioned metrics toevaluate the BenchFlow framework in (Ferme et al.,2015).

However, being able to calculate the above men-tioned metrics imposes certain requirements on thegranularity and the type of data to be saved in theDBMS. The DBMS persists and recovers the processinstance execution status and history. The most im-portant data to be recorded are the following times-tamps: start time and end time of each executed pro-cess instance (identified by its PiID); as well as starttime and end time of specific language constructs thatare part of the process instances, e.g., activities orevents identified by their id with a possibility to mapthem to the process instance they belong to. Thetimestamps should preferably have a precision up tomilliseconds.

2.2 Benchmarking FrameworkImplementation

The goal of the benchmarking framework is twofold:1) to make our research publicly available, and fos-ter collaboration; and 2) to enable reproducibility andfull transparency in the benchmarking process. To doso we have released the framework as open source7.The second goal is very important for inducing WfMSvendors to join our efforts, since it enables them toverify the benchmark results and use the frameworkwith their own infrastructure.

The framework builds on: 1) Faban (Faban,2014), an established and tested “performance work-

6http://www.workflowpatterns.com

7https://github.com/benchflow

Figure 3: Framework Deployment View

load creation and execution framework”, to de-sign the load drivers and issue the load; and 2)lightweight software containerization technologies,precisely Docker (Turnbull, 2014) in the current ver-sion of the framework, in order to obtain, from thevendors, a ready to use setup of the WfMSs to betested. We have decided to adopt Docker since itis a widely used and accepted containerization solu-tion (Merkel, 2014). Moreover, if carefully config-ured, it has proved to have a negligible impact on theperformance of the containerized application (Felteret al., 2014). The benefits of using containerizationtechnologies are: 1) avoiding the highly error-pronetask of installing the WfMS that can lead to a non op-timal setup impacting its performance results; 2) en-suring the reproducibility of the benchmark results bysaving system’s initial status (Boettiger, 2014); and 3)collecting detailed resources usage statistics for eachContainer by means of the Docker stats API.

The architecture of the framework has been de-scribed in details in (Ferme et al., 2015), while Fig. 3presents the framework components and their deploy-ment. Each of the framework’s components is de-ployed inside a different Container. Although weuse Docker as a containerization technology, the pro-posed methodology and framework can be applied

to any other containerization technology as well. Toscale with the amount of performance tests to be in-cluded in the benchmark, in order to provide for areliable comparison of WfMSs, we integrate the de-ployment of the WfMS in the setup of the perfor-mance test. We do so by using a SUT DeploymentDescriptor and by extending the Faban Master inthe Benchmark Master. The SUT Deployment De-scriptor defines the servers on which to deploy theWfMS’s Containers. The Benchmark Master auto-matically deploys the SUT as well as the workload,and all the related entities defined in the load func-tions, i.e., the Load Start Functions, users and appli-cations which are simulated by means of Faban DriverAgents; and the Web services which are simulatedby using techniques for automatic testbed generation,e.g., (Juszczyk et al., 2008). The main deploymentrequirement for a reliable benchmark is to isolate, asmuch as possible, the WES from other components,by deploying them on different physical machines onthe same local network, with the purpose of minimis-ing the possible noise introduced by their concurrentexecution on the same machine (e.g., the DBMS run-ning on the same machine as the WES). To do so,the framework enables the user to select the serverswhere to deploy the different components of the SUT,

so that there are enough computational resources togenerate a load that stresses the WfMS performance,and enough computational resources for the WES tosustain that load. For example, less powerful serverscan be used to run the WES, while the load func-tions and the DBMS can be assigned to more power-ful servers. While enhancing the reliability of the ob-tained results, deploying the components on differentphysical machines adds network costs to the systems.We mitigate these costs by connecting the differentservers on a high performance local network.

The framework also deals with the asynchronousexecution of process models, by gathering perfor-mance data directly from the DB used by the WES.Data cannot be gathered from the client side since thestarting of a new process instance from the Core APIis done in an asynchronous way. This means that theCore API call to the WfMS finishes as soon as the in-stance is created, and the Load Start Function is onlyaware of the start time of a process instance, while theend time needs to be obtained directly from the DB.By using measurements collected from all the WfMSContainers, the framework ensures that the conditionsof the environments that host the WfMS are suitablefor benchmark execution (e.g., there are neither inter-ferences from other running applications nor bottle-necks in the underlying hardware). This safeguardsthe reliability of the environment in which the perfor-mance tests are executed.

3 Formal Interaction and Artifacts

Although a carefully designed framework can leadto a benchmark that systematically satisfies the afore-mentioned desired properties, its industry acceptanceis not always a straightforward process. To addressthis issue, in addition to the technical framework, aspart of our methodology, we also propose a processof formal interaction with WfMS vendors (Fig. 4).Its main purpose is to enhance the transparency ofthe overall benchmarking process, while promotingthe engagement of WfMS vendors. The interactioncan be initiated either by the BenchFlow team or bythe WfMS vendor. Fig. 4 depicts only the desiredpath. If the vendor decides not to participate in thebenchmark, we can still run tests on its WfMS andpublish only anonymised results in research papers,provided it is not prohibited by its licence agreement.The rest of this section describes in detail the artifactsexchanged or produced during the benchmarking pro-cess.

3.1 Agreement with Vendors

After the vendor becomes familiar with the proposedbenchmarking methodology, its WfMS can only beincluded in the benchmark after a written agreementis signed. The agreement precisely defines the rightsand obligations of both parties: the vendor and theBenchFlow team. The main concerns the vendorneeds to agree on are:• defining which versions of the WfMS will be in-

cluded in the benchmark. Such versions need tobe a production stable release of the system andprovide at least the Core APIs described in Sub-section 2.1.2.

• providing the BenchFlow team with a container-ized WfMS for each version to be included inthe benchmark, or providing WfMS’s installa-tion guide and configuration details to enablethe BenchFlow team to prepare the containerizedWfMS;

• making the containerized WfMS publicly avail-able on a Container registry or providing theBenchFlow team access to a private registry.

• authorizing the BenchFlow team to publish theobtained results on its website and in research pa-pers using WfMS’s name, after the vendor hasverified their correctness.The framework will be publicly available for free

use for non commercial purposes, however with lim-ited analytics functionalities. Full analytics and dif-ferent workload models will be available to vendorswhich sign the agreement and include their WfMSin the benchmark. Thus they would benefit from de-tailed insights to possible causes of performance bot-tlenecks, and performance comparisons to competi-tors’ products or to previous versions of their ownproduct. Namely they could integrate the benchmark-ing framework in their performance regression testingframework, in addition to their continuous improve-ment efforts.

3.2 Containerized WfMS

As discussed in Section 2.2, we deploy the WfMScomponents in different Containers. To isolate theWES from other components, to access relevant data,and to formalize WfMS configuration parameters,we define some requirements for the containerizedWfMS. The vendor should provide at least two sep-arate Containers, one for the WES and one for theDBMS. Separate Containers can be provided for otherWfMS components as well (e.g., load balancers,workload predictors and autonomic controllers) de-pending on the WfMS’s architecture. The DBMS

Vendor

BenchFlow

Provide Benchmarking Methodology

Vendor

BenchFlow

Agree on adding Vendor's WfMS to the

Benchmark

Benchmarking Methodology Agreement Proposal

Signed Agreement

Vendor

BenchFlow

Provide Containerized Distribution of WfMS

Containerized WfMS Request

Containerized WfMS

Vendor

BenchFlow

Verify the Benchmark Results

Results Verification Outcome

Results Verification Request

Vendor

BenchFlow

Provide Draft Benchmark Results

Draft Benchmark ResultsVerified Benchmark Results

Not Valid

Are the Results Valid?

Community

BenchFlow

Publish Benchmark Results Valid

Figure 4: Benchmarking Methodology Choreography

Container can refer to an existing publicly availableContainer distributions. The containerized WfMSshould be publicly available (e.g., at the Docker Hubregistry8), or the Benchflow team should be grantedaccess to a private registry used by the vendor. Thesame applies to the Containers’ definition file, i.e., theDockerfile (Turnbull, 2014, ch. 4). While private reg-istries are a solution that can work with vendors ofclosed-source systems, they impact the reproducibil-ity of the results. For each WfMS version to be in-cluded in the benchmark, there must be a default con-figuration, i.e., the configuration in which the WfMSContainers start without modifying any configurationparameters, except the ones required in order to cor-rectly start the WfMS, e.g., the database connection.However, if vendors want to benchmark the perfor-mance of their WfMS with different configurations,for example, the configuration provided to users asa “getting started” configuration, or production-gradeconfigurations for real-life usage, they can also pro-vide different configurations. To do that, the Con-tainers must allow to issue the WfMS configurationsthrough additional environment variables9, and/or byspecifying the volumes to be exposed10 in order to

8https://hub.docker.com/

9https://docs.docker.com/reference/run/

#env-environment-variables

10https://docs.docker.com/

userguide/dockervolumes/

access the configuration files. Exposing the volumesallows to access the files defined inside the Contain-ers, on the host operating system. Precisely, the WESContainer has to, at least, enable the configuration of:1) the used DB, i.e., DB driver, url, username andpassword for connection; 2) the WES itself; and 3) thelogging level of the WES, and the application stacklayers on which it may depend on. Alternatively, in-stead of providing configuration details, the vendorscan provide different Containers for different config-urations. However, even in that case enabling the DBconfiguration is required.

In order to access relevant data, all WfMS Con-tainers have to specify the volumes to access theWfMS log files, and to access all the data useful tosetup the system. For instance, if the WfMS definesexamples as part of the deployment, we may wantto remove those examples by overriding the volumecontaining them. Moreover, the WES Container (orthe DBMS Container) has to create the WfMS’s DBand populate it with data required to run the WfMS.The vendors need to provide authentication configu-ration details of the WfMS components (e.g., the userwith admin privileges for the DBMS). Alternativelythe vendor may simply provide the files needed to cre-ate and populate the DB.

#mount-a-host-directory-as-a-data-volume

3.3 Draft Benchmark Results

The draft benchmark results delivered to the WfMSvendor consist of two important sets of information:1) information necessary for replicating the perfor-mance tests, i.e., input and configuration data. Theyinclude description of the type of tests that have beenrun accompanied by all the necessary input for run-ning them, i.e., the parameters of the workload modeland the workload as described in Subsection 2.1.1.They also include data regarding the environment set-up and deployment, and WfMS configuration; and 2)information and data regarding the metrics and KPIscalculated with the framework, both in terms of pre-cise definition of their calculation and purpose, aswell as, in terms of the results of their calculation andpossible interpretation of the same. If the calculationof certain metrics/KPIs is not possible, due to lack ofdata or system support of given language constructs,the status of the metric calculation attempt will beprovided in the draft results, together with an expla-nation of its meaning. The raw performance data, asobtained during the execution, will also be includedin the draft results, to enable WfMS vendors to calcu-late additional metrics not provided in the framework,should it be necessary.

If the WfMS vendors have agreed on includingdifferent versions or configurations of their product inthe benchmark, they would receive draft benchmarkresults for each version/configuration. This wouldenable them, for example, to draw conclusions onhow different configurations impact the performanceof their product and, if necessary, change their de-fault configuration with a better performing one. Theycould also control benchmark’s website for publishedresults from other WfMS vendors to understand theirproduct’s market position.

3.4 Verified Benchmark Results

After receiving the draft benchmark results, the ven-dors can validate their correctness by replicating thetests and comparing the obtained results. Ideally, nosignificant differences should be noticed in the resultswhen replicating the tests using the same environmentand workload the BenchFlow team had used to per-form the initial tests. However, should such differ-ences be identified, the vendor has to report them tothe BenchFlow team who is obliged, by the agree-ment, not to publish results unless they have been ver-ified for correctness by the vendor. Iterative approachwill be followed, as evident in Fig. 4, until the resultshave been verified and thus authorised for publishingat benchmark’s website and in research papers. Rea-

sonable time for verification of draft results will bedefined in the agreement as per discussion with thevendors.

4 Related Work

To the best of our knowledge, this is the first workproposing a Container-centric benchmarking frame-work, and providing a full methodology for bench-marking WfMSs. There have been various efforts tooutline structured frameworks for the description ofbenchmarking methodologies applied to diverse ap-plications and environments. Iosup et al. introducea generic approach for Cloud benchmarking, dis-cuss the challenges in benchmarking Cloud environ-ments, and summarize experiences in the field (Iosupet al., 2014). They also propose a benchmarking tool,SkyMark, for workloads based on the MapReducemodel (Dean and Ghemawat, 2008). Baruwal Chhetriet al. propose Smart CloudBench (Chhetri et al.,2015), a platform for automated performance bench-marking of the Cloud. Both of these references pro-pose Infrastructure-as-a-Service (IaaS) benchmark,thus they do not deal with the challenges introducedby benchmarking applications (e.g., the need of anagreement with vendors). Puri presents the challengesin benchmarking applications, and acknowledges thatthe benchmarking process requires extensive planningand thus a well defined methodology (Puri, 2008).While the aim of the mentioned papers is similar toours, i.e., to formalize a benchmarking process foremerging technologies and attract industry attentionto give feedback and share experiences, the focus isdifferent. In our work we focus on benchmarking aprecise middleware, the WfMS.

SPEC and TPC propose a different approach inbenchmarking software systems. Their approach canbe seen as complementary to the one proposed by us,since it allows the vendors to execute the benchmarkson their own hardware and send back the results. TheSPEC/TPC committees verify internally the correct-ness of the results before publication on their website.Their approach has demonstrated to be effective whena standard benchmark is well defined. Currently weare working on defining such a standard benchmarkfor WfMSs, thus at this phase, we believe that feed-back from different vendors through validation of in-ternally calculated results, might be more effective.

The need to benchmark WfMSs is frequently dis-cussed in literature (Wetzstein et al., 2009; Russellet al., 2007). Gillmann et al. compare the perfor-mance of a commercial WfMS with the one devel-oped by the authors, by using a simple e-commerce

workflow (Gillmann et al., 2000). The benchmarkmeasures the throughput of each of the systems aswell as the impact of the DB work that is forwardedto a dedicated server. Bianculli et al. propose amore systematic approach in the SOABench (Bian-culli et al., 2010b), where they tackle the automaticgeneration, execution and analysis of testbeds fortesting the service-oriented middleware performance.That framework is used to compare the response timeof several Web Services Business Process Execu-tion Language (WS-BPEL) WfMSs, i.e., ActiveVOS,jBPM, and Apache ODE (Bianculli et al., 2010a),when using a different number of clients and differ-ent think times between subsequent requests. The re-sults have pointed to some scalability limitations ofthe tested systems.

The limitations of the above projects, and otherwork in the area (Rock et al., 2014; Skouradaki et al.,2015; Daniel et al., 2011) lie in the fact that: 1)they target a small number of WfMSs; 2) their work-load mix consists of simple, non-representative pro-cesses; 3) their performance tests are limited to, e.g.,response time, or load testing; 4) the performancemetrics and KPIs used are not focused on WfMS,but on software systems in general; and 5) they fo-cus on benchmarking framework architecture, whileignoring the methodological issues discussed in thispaper. Thus they do not satisfy the requirements ofstandard benchmarks, but can be seen more as cus-tom benchmarks. Our methodology aims at address-ing these limitations, while being applicable on dif-ferent industry use cases.

There are many commercial and open-source per-formance framework solutions (Molyneaux, 2014),some dedicated to generic web applications (Fa-ban, 2014; The Apache Software Foundation, 2015;SmartBear, a), and others to specific middleware per-formance testing (Li et al., 2009; SmartBear, b).However, none of them refers specifically to WfMSs.Hence, this paper aims at filling that gap by offeringan open-source solution.

5 Conclusion and Future Work

Benchmarking is a widely accepted approach toassess the current state of a technology and drive itsperformance improvement (Sim et al., 2003), pro-vided that it meets the main requirements of rele-vance, trustworthiness, and reproducibility. To ad-dress these requirements, in this paper, we proposea framework which takes advantage of Docker Con-tainers and Faban. They enable the automatic and reli-able benchmarking of different WfMSs, as long as the

latter expose suitable APIs for process model deploy-ment, instance execution, and monitoring. The actualperformance metrics are collected and processed of-fline, after the workload has been completely issued inaccordance with the corresponding workload model.By exploiting containerization technologies, we alsodeal with some of the logistic challenges in handlingthe interaction with WfMSs vendors. To enhance thetransparency, our methodology: 1) clearly defines theinteraction between the BenchFlow team and the ven-dors, and formalises it by means of a written agree-ment; 2) allows the vendors to provide configuredWfMS Containers for each version to be included inthe benchmark, as well as for each configuration al-ternative; 3) defines a verification process in whichthe vendors can access the draft benchmark results,as well as all the configuration parameters for auto-matic execution of the tests, in order to validate theircorrectness.

We have started applying the methodology withtwo open-source WfMSs (Ferme et al., 2015), whichis driving the further development of the completebenchmarking framework that will be released as anopen-source. This will enable interested vendors toapply the framework following the methodology de-fined in this paper to obtain their performance bench-mark results. Only then we will be able to providea full validation of proposed methodology’s capabil-ities for real-world benchmarking and a definition ofKPIs capable of rightfully reflecting the differencesin performance. For the time being, the approach de-scribed in this paper has been approved as sound bythe management of the Activiti BPM Platform11.

6 ACKNOWLEDGMENTS

This work is funded by the Swiss National Sci-ence Foundation and the German Research Founda-tion with the BenchFlow project (DACH Grant Nr.200021E-145062/1).

REFERENCES

Barbacci, M., Klein, M., et al. (1995). Quality at-tributes. Technical report, Software Engineering In-stitute, Carnegie Mellon University, Pittsburgh, PA.

Bianculli, D., Binder, W., and Drago, M. L. (2010a).SOABench: Performance evaluation of service-oriented middleware made easy. In Proc. of the ICSE’10, pages 301–302.

11http://activiti.org

Bianculli, D., Binder, W., et al. (2010b). Automated perfor-mance assessment for service-oriented middleware: Acase study on BPEL engines. In Proc. of the WWW’10, pages 141–150.

Boettiger, C. (2014). An introduction to Docker for repro-ducible research, with examples from the R environ-ment. ACM SIGOPS Operating Systems Review, Spe-cial Issue on Repeatability and Sharing of Experimen-tal Artifacts, 49(1):71–79.

Chhetri, M. B., Chichin, S., et al. (2015). Smart Cloud-Bench – a framework for evaluating Cloud infras-tructure performance. Information Systems Frontiers,pages 1–16.

Daniel, F., Pozzi, G., and Zhang, Y. (2011). Workflow en-gine performance evaluation by a black-box approach.In Proc. of the International Conference on Informat-ics Engineering & Information Science (ICIEIS ’11),ICIEIS ’11, pages 189–203. Springer.

Dean, J. and Ghemawat, S. (2008). MapReduce: Simplifieddata processing on large clusters. Communications ofthe ACM, 51(1):107–113.

Faban (2014). Faban. http://faban.org.Felter, W., Ferreira, A., Rajamony, R., and Rubio, J. (2014).

An updated performance comparison of virtual ma-chines and linux containers. Technical report, IBM.

Fenton, N. and Bieman, J. (2014). Software metrics: a rig-orous and practical approach. CRC Press, 3rd edition.

Ferme, V., Ivanchikj, A., and Pautasso, C. (2015). A frame-work for benchmarking BPMN 2.0 workflow manage-ment systems. In Proc. of the 13th International Con-ference on Business Process Management, BPM ’15.Springer.

Geiger, M., Harrer, S., et al. (2015). BPMN conformance inopen source engines. In Proc. of the SOSE 2015, SanFrancisco Bay, CA, USA. IEEE.

Gillmann, M., Mindermann, R., et al. (2000). Benchmark-ing and configuration of workflow management sys-tems. In Proc. of the CoopIS ’00, pages 186–197.

Gray, J. (1992). The Benchmark Handbook for Databaseand Transaction Systems. Morgan Kaufmann, 2ndedition.

Hollingsworth, D. (1995). Workflow management coalitionthe workflow reference model. Workflow ManagementCoalition, 68.

Huppler, K. (2009). The art of building a good bench-mark. In Performance Evaluation and Benchmarking(TPCTC 2009), pages 18–30. Springer.

Iosup, A., Prodan, R., et al. (2014). IaaS Cloud bench-marking: Approaches, challenges, and experience.In Cloud Computing for Data-Intensive Applications,pages 83 –104. Springer New York.

Jain, R. K. (1991). The art of computer systems perfor-mance analysis: techniques for experimental design,measurement, simulation and modeling. Wiley, NY.

Jordan, D. and Evdemon, J. (2011). Business ProcessModel And Notation (BPMN) version 2.0. ObjectManagement Group. http://www.omg.org/spec/

BPMN/2.0/.Juszczyk, L., Truong, H.-L., et al. (2008). GENESIS - a

framework for automatic generation and steering of

testbeds of complex Web Services. In Proc. of theICECCS 2008, pages 131–140.

Leymann, F. (2011). BPEL vs. BPMN 2.0: Should youcare? In Proc. of the BPMN ’10, volume 67, pages8–13. Springer.

Li, X., Huai, J., et al. (2009). SOArMetrics: A toolkit fortesting and evaluating SOA middleware. In Proc. ofSERVICES-1 ’09, pages 163–170.

Merkel, D. (2014). Docker: Lightweight linux containersfor consistent development and deployment. Linux J.,2014(239).

Molyneaux, I. (2014). The Art of Application PerformanceTesting: From Strategy to Tools. O’Reilly Media, Inc.,2nd edition.

Parmenter, D. (2010). Key Performance Indicators (KPI):developing, implementing, and using winning KPIs.John Wiley & Sons.

Pautasso, C., Ferme, V., et al. (2015). Towards workflowbenchmarking: Open research challenges. In Proc. ofthe BTW 2015, pages 331–350.

Puri, S. (2008). Recommendations for perfor-mance benchmarking. http://www.infosys.

com/consulting/architecture-services/

white-papers/Documents/

performance-benchmarking-recommendations.

pdf.Rock, C., Harrer, S., et al. (2014). Performance bench-

marking of BPEL engines: A comparison framework,status quo evaluation and challenges. In Proc. of theSEKE 2014, pages 31–34.

Russell, N., van der Aalst, W. M., and Hofstede, A. (2007).All that glitters is not gold: Selecting the right tool foryour BPM needs. Cutter IT Journal, 20(11):31–38.

Sim, S. E., Easterbrook, S., et al. (2003). Using bench-marking to advance research: A challenge to softwareengineering. In Proc. of the ICSE ’03, pages 74–83.

Skouradaki, M., Roller, D. H., et al. (2015). On the road tobenchmarking BPMN 2.0 workflow engines. In Proc.of the ICPE ’15, pages 301–304.

SmartBear. LoadUI. http://www.loadui.org/.SmartBear. SoapUI. http://http://www.soapui.org.The Apache Software Foundation (2015). JMeter. http:

//jmeter.apache.org.Transaction Processing Council (TPC) (1997). TPC Bench-

mark C (Online Transaction Processing Benchmark)Version 5.11.

Transaction Processing Performance Council (2007). TPC-E. http://www.tpc.org/tpce.

Turnbull, J. (2014). The Docker Book: Containerization isthe new virtualization.

van Hoorn, A., Vøgele, C., et al. (2014). Automatic extrac-tion of probabilistic workload specifications for loadtesting session-based application systems. In Proc. ofthe VALUETOOLS 2014, pages 139–146. ICST.

Wetzstein, B., Leitner, P., et al. (2009). Monitoring andanalyzing influential factors of business process per-formance. In Proc. of EDOC ’09, pages 141–150.