upm scube-dynsys-presentation

Building Dynamic Models of Service CompositionsWith Simulation of Provision Resources

Dragan Ivanovic1, Martin Treiber2,Manuel Carro1, Schahram Dustdar2

1Universidad Politecnica de Madrid

2Technische Universitat Wien

S-Cube Industrial Dissemination Workshop, Thales, Paris, 24/2/2012

SOA Context

Service Orchestrations: centralized control flow, services combinedto achieve more complex tasks.

Exposed as services to the outside world — composability.Cross organizational boundaries — loose coupling.Usually designed around the notion of business processes.Events, asynchronous exchanges, stateful conversations.Often expressed in a specialized language: BPEL, YAWL, etc.

Service-Level Agreements (SLAs): constraints that provider andclient are expected to comply with.

Providers often implement some type of elasticity (of provisionresources) to ensure SLA and meet demand in different scenarios forload/request rates.

From the client perspective: attributing perceived QoS to logic,components, and/or infrastructure resources is difficult.

Motivation

SLA offering based on provider’s assesment of the expected behaviorof the provision system for a given orchestration:

for different input scenarios,with appropriate resource scaling policies.

Goal of this work

Predict dynamic (time-dependent) behavior of service provision with QoSattributes, taking into account characteristics of the particular compositionbeing served.

Given an inflow of requests (data) ⇒ infer how it maps into outflow.

Taking workflow structure into account ⇒ information onbehavior/bottlenecks of orchestration components.

Flows are defined on continuous time domains ⇒ using real (dense)functions of time gives dynamic system simulation.

Overview of the ApproachOverview of the Approach

AbstractBPEL

Exec.BPEL YAWL

WorkflowDefinition

Monitoring Data

Petri Net Model

DynamicWorkflowModel

ProvisionResourceModel

QoS Estimates

What-If Analysis

Policy Testing

Sensitivity etc.

Simulation

Modeling an Orchestration

Petri Nets as intermediate representation:

Used with several front-end languages.Weak constraints on net shape.

Places (circles) stand for conditions inworkflow execution.

Including start and finish.

Tokens (dots) denote running instances

They mark places.

Petri Nets

!"

#"

!$

#$!%

#%

!&!'()*+,+*-.,/0*,/12*-,/1,*3()/'4,1-*,15/617-6,/8(-+7/71-,9:1++7;'4,+/1)0(+/7)(''4,)01+*-<=

#8(-+7/71-+,+*-.,/0*,/12*-,/1,('',1>,/0*78,15/617-6,:'()*+=

Transitions (squares) stand for workflow activities and fire when allinput places (preconditions) are marked (stochastic choice).

A step through a Petri net fires all transitions that can fire andmoves tokens to new places.

Places and Transitions as Flows and Stocks

PT-nets oversimplify behavior: transition times not taken intoaccount (only discrete steps).

In reality:

Transitions (activities) take some non-zero time.Tokens (orchestration instances) stay in places for a negligible time.

To recover that dynamics, we use continuous-time models.

Stock (S)Inflow (fi ) Outflow (fo)

Initial Stock (S0)

S |t=0 = S0,ddt S = fi − fo ⇔ S = S0 +

∫ t0 (fi − fo)dt

Transitions 7→ stocks [number of instances of the activity].

Places 7→ flows [number of instances transiting per second].

Continuous-Time (CT) Model Derivation

Model derivation process mechanical.

Can be fully automated.

CT scheme for a place:non-input places aggregateoutputs from one or moreactivities.

· · ·

m

om

mp(t): a new stock ∀p ∈ •m

qt = argminp∈•m{mp(t)}m(t) = mqt (t)ddt mp(t) = p(t)wpm −om(t) , ∀p ∈ •m

om(t) =

�m(t)/Tm Tm > 0qt(t)wqt m Tm = 0

Fig. 7. ODE scheme for a transition.

· · ·

p p(t) = ∑m∈•p{om(t)}

Fig. 8. ODE scheme for a place.

Figure 7 shows a general ODE scheme for a transition m ∈ M with one or moreinput places. With single input place, the transition continuously accumulates tokensfrom the input place, and discharges them either instantaneously (Tm = 0) or gradually(Tm > 0) through om(t).

When a transition has more than one input place, its execution is driven by thesmallest number of accumulated tokens. At time t, qt ∈ •m denotes the the place fromwhich the smallest number of tokens has been accumulated. Because a transition needsto collect a token from all of its input places to fire, the smallest token accumulationmqt (t) dictates m(t).

When the average execution time Tm > 0, the outflow om(t) = m(t)/Tm correspondsto exponential decay. When Tm = 0, the transition is instantaneous, i.e. m(t) = 0, whichmeans that outflow has to balance inflow qt(t)wqt m from qt , therefore keeping mqt (t) atzero.

Figure 8 shows a general ODE scheme for a place p ∈ P, which is simply a sumof outflows from incoming transitions, assuming that •p is non-empty. Places with anempty set of incoming transitions must be treated as exogenous factors.

4.3 An Example ODE Model

To illustrate the approach to construction of the ODE model, we look at the PT-netrepresentation of our working example, shown in Figure 2. The PT-net model has thestarting place pin and the final place pout. Transitions that correspond to invocations ofpartner services are marked with letters A..H, and we assume that the correspondingaverage execution times TA..TH are non-zero. Other transitions are assumed to be in-stantaneous, and with the exception of the AND-join transition (marked J), they simplypropagate their inflow. Places p4 and p5 are decision nodes, and their outgoing links areannotated with weight factors corresponding to branch probabilities. With reference to

Slightly morecomplex fortransitions withseveral inputplaces.

· · ·

m

om

mp(t): a new stock ∀p ∈ •m

qt = argminp∈•m{mp(t)}m(t) = mqt (t)ddt mp(t) = p(t)wpm −om(t) , ∀p ∈ •m

om(t) =

�m(t)/Tm Tm > 0qt(t)wqt m Tm = 0

Fig. 7. ODE scheme for a transition.

· · ·

p p(t) = ∑m∈•p{om(t)}

Fig. 8. ODE scheme for a place.

Figure 7 shows a general ODE scheme for a transition m ∈ M with one or moreinput places. With single input place, the transition continuously accumulates tokensfrom the input place, and discharges them either instantaneously (Tm = 0) or gradually(Tm > 0) through om(t).

When a transition has more than one input place, its execution is driven by thesmallest number of accumulated tokens. At time t, qt ∈ •m denotes the the place fromwhich the smallest number of tokens has been accumulated. Because a transition needsto collect a token from all of its input places to fire, the smallest token accumulationmqt (t) dictates m(t).

When the average execution time Tm > 0, the outflow om(t) = m(t)/Tm correspondsto exponential decay. When Tm = 0, the transition is instantaneous, i.e. m(t) = 0, whichmeans that outflow has to balance inflow qt(t)wqt m from qt , therefore keeping mqt (t) atzero.

Figure 8 shows a general ODE scheme for a place p ∈ P, which is simply a sumof outflows from incoming transitions, assuming that •p is non-empty. Places with anempty set of incoming transitions must be treated as exogenous factors.

4.3 An Example ODE Model

To illustrate the approach to construction of the ODE model, we look at the PT-netrepresentation of our working example, shown in Figure 2. The PT-net model has thestarting place pin and the final place pout. Transitions that correspond to invocations ofpartner services are marked with letters A..H, and we assume that the correspondingaverage execution times TA..TH are non-zero. Other transitions are assumed to be in-stantaneous, and with the exception of the AND-join transition (marked J), they simplypropagate their inflow. Places p4 and p5 are decision nodes, and their outgoing links areannotated with weight factors corresponding to branch probabilities. With reference to

Composability

Modeling asynchronousmessage exchange betweenservices is easy.

· · ·

As

sA

Ae

rA

· · ·

Ar

Fig. 10. Asynchronous messaging scheme.

· · ·

m

· · ·

m1−φm

Φφm

⇒

Fig. 11. Failure accounting scheme.

tions. Both can be built into the model automatically, when translating from a concreteorchestration language with known formal semantics. Here we discuss a way to dealwith asynchronicity and faults in a general case.

Figure 10 shows a usual pattern of asynchronous communication with a partnerservice A. Transition As sends a message to A via a dedicated place sA, and transition Arreceives the reply through a synchronizing place rA. The same representation applies tosynchronous messaging as well, with Ar directly following As, and rA as its single inputplace.

In Figure 2, we have combined As, sA, Ae, rA, and Ar into a single transition Acharacterized with an overall average execution time TA. The rate sA(t) is the send rate,to be connected with the input rate parameter of the sub-model of Ae, while rA(t) is thereply rate from the sub-model, to be connected with the receive rate in the main OCTmodel. Examples of “wrapper” sub-models for Ae are: {rA(t) = sA(t)} (short circuiting,zero time), and {rA(t) = Ae(t)/TAe ; d

dt Ae(t) = sA(t)− rA(t)} (black box, definite time).Failures can be accounted for by introducing failure probabilities φm for each tran-

sition in the PT-net model, and decorating the transitions as shown in Figure 11. Faulthandling is represented by Φ . In the simplest case of unrecoverable faults, Φ is aninstantaneous transition to a terminal fault place pfail.

4.5 A Sample Resource Model

For a sample resource model, we model threads that execute orchestration activities onthe provider’s infrastructure. In the sample, shown on Figure 12, we assume that ser-vices A, B, G and H from Figure 2 (corresponding to the Formatting, Logging, Addingand Storage services) are “back-end” services hosted by the orchestration provider, sothat their each execution occupies a (logical) thread. The number of occupied threads inthe resource model is shown as X . The current capacity (available number of threads)is shown as X , and γ is the degree of utilization. The blocking factor β is 1 if somecapacity is free, 0 otherwise.

We “open” the transition boxes that represent component or partnerservices and connect their CT-models.

Accounting for failure:introduce failureprobabilities, and asingle failure sink.

· · ·

As

sA

Ae

rA

· · ·

Ar

Fig. 10. Asynchronous messaging scheme.

· · ·

m

· · ·

m1−φm

Φφm

⇒

Fig. 11. Failure accounting scheme.

tions. Both can be built into the model automatically, when translating from a concreteorchestration language with known formal semantics. Here we discuss a way to dealwith asynchronicity and faults in a general case.

Figure 10 shows a usual pattern of asynchronous communication with a partnerservice A. Transition As sends a message to A via a dedicated place sA, and transition Arreceives the reply through a synchronizing place rA. The same representation applies tosynchronous messaging as well, with Ar directly following As, and rA as its single inputplace.

In Figure 2, we have combined As, sA, Ae, rA, and Ar into a single transition Acharacterized with an overall average execution time TA. The rate sA(t) is the send rate,to be connected with the input rate parameter of the sub-model of Ae, while rA(t) is thereply rate from the sub-model, to be connected with the receive rate in the main OCTmodel. Examples of “wrapper” sub-models for Ae are: {rA(t) = sA(t)} (short circuiting,zero time), and {rA(t) = Ae(t)/TAe ; d

dt Ae(t) = sA(t)− rA(t)} (black box, definite time).Failures can be accounted for by introducing failure probabilities φm for each tran-

sition in the PT-net model, and decorating the transitions as shown in Figure 11. Faulthandling is represented by Φ . In the simplest case of unrecoverable faults, Φ is aninstantaneous transition to a terminal fault place pfail.

4.5 A Sample Resource Model

For a sample resource model, we model threads that execute orchestration activities onthe provider’s infrastructure. In the sample, shown on Figure 12, we assume that ser-vices A, B, G and H from Figure 2 (corresponding to the Formatting, Logging, Addingand Storage services) are “back-end” services hosted by the orchestration provider, sothat their each execution occupies a (logical) thread. The number of occupied threads inthe resource model is shown as X . The current capacity (available number of threads)is shown as X , and γ is the degree of utilization. The blocking factor β is 1 if somecapacity is free, 0 otherwise.

Example: Data Cleansing ServiceFormat Service

Check Service

Calibration Service

Data Checker Service

Storage Service Add Service

Search Service

HPS Machine Service

hit no hit

multiple hits

no hit

hit

condition Service invoke

Logging Service

Loop

Parallel Execution

precision not ok

Service Service

Fig. 1. Overview of data cleansing process

pin

A

Formatter service

p1

AND-split

p2 p3

C Search Service

p4 E

Check Servicew4mh p5

F

Calibration Service

w5p

w5nhp6w4nhw4h

G Add Service

w5h

p7

AND-join (J)

BLog Service

p8

p9

H

Storage Service

pout

Fig. 2. A PT-net representation of workflowfrom Fig. 1

result is not satisfying (e.g., low precision of the search result), a re-calibration of thesearch service is done manually and the re-calibrated Service Service is invoked again.In parallel, a Logging Service is executed to log service invocations (execution timesand service precision).

Figure 2 shows a representation of the cleansing process workflow from Figure 1 inthe shape of a PT-net. Services from the workflow are marked as transitions (A . . .H),and additional transitions are introduced for the AND-split/join. Places represent thearrival of an input message (pin), end of the process (pout), and activity starting/ending.The places that represent branches to mutually exclusive activities (p4 and p5) haveoutgoing branches annotated with non-negative weight factors that add up to 1; e.g.,the branches w4h, w4nh and w4mh of p4, which correspond to the single hit, no hits andmultiple hits outcomes of the search service (C), respectively.

As indicated by the example, the actual execution time of the service compositiondepends on various factors, like the execution time of the human-provided services.Based on these observations, we can summarize the challenges of the working exampleas follows:

– Unpredictable content. The content of the customer data varies in terms of datastructure, data size, and quality. For instance, there might be missing and/or wrongparts of addresses or wrong names and misplaced column content which can causeextra efforts.

– Unpredictable manual intervention. Depending on the data size and the data qualitythe precision of the search result differs. This requires manual calibration of searchprofiles during the execution of the task.

d

dtA(t) = pin(t)− p1(t) oF (t) = F (t)/TF p1(t) = A(t)/TA

p6(t) = p4(t)w4nh + p5(t)w5nh p2(t) = p1(t)d

dtG (t) = p6(t)− oG (t)

d

dtB(t) = p2(t)− p8(t) oG (t) = G (t)/TG p8(t) = B(t)/TB

p7(t) = p4(t)w4h + oG (t) + p5(t)w5h p3(t) = p1(t) + oF (t)d

dtJp8(t) = p8(t)− p9(t)

d

dtC (t) = p3(t)− p4(t)

d

dtJp7(t) = p7(t)− p9(t) p4(t) = C (t)/TC

d

dtE (t) = p4(t)w4mh − p5(t) p5(t) = E (t)/TE

d

dtF (t) = p5(t)w5p − oF (t)

d

dtH(t) = p9(t)− pout(t) pout(t) = H(t)/TH p9(t) =

{p8(t) Jp8(t) ≤ Jp7(t)

p7(t) Jp7(t) < Jp8(t)

Format Service

Check Service

Calibration Service

Data Checker Service

Storage Service Add Service

Search Service

HPS Machine Service

hit no hit

multiple hits

no hit

hit

condition Service invoke

Logging Service

Loop

Parallel Execution

precision not ok

Service Service

Fig. 1. Overview of data cleansing process

pin

A

Formatter service

p1

AND-split

p2 p3

C Search Service

p4 E

Check Servicew4mh p5

F

Calibration Service

w5p

w5nhp6w4nhw4h

G Add Service

w5h

p7

AND-join (J)

BLog Service

p8

p9

H

Storage Service

pout

Fig. 2. A PT-net representation of workflowfrom Fig. 1

result is not satisfying (e.g., low precision of the search result), a re-calibration of thesearch service is done manually and the re-calibrated Service Service is invoked again.In parallel, a Logging Service is executed to log service invocations (execution timesand service precision).

Figure 2 shows a representation of the cleansing process workflow from Figure 1 inthe shape of a PT-net. Services from the workflow are marked as transitions (A . . .H),and additional transitions are introduced for the AND-split/join. Places represent thearrival of an input message (pin), end of the process (pout), and activity starting/ending.The places that represent branches to mutually exclusive activities (p4 and p5) haveoutgoing branches annotated with non-negative weight factors that add up to 1; e.g.,the branches w4h, w4nh and w4mh of p4, which correspond to the single hit, no hits andmultiple hits outcomes of the search service (C), respectively.

As indicated by the example, the actual execution time of the service compositiondepends on various factors, like the execution time of the human-provided services.Based on these observations, we can summarize the challenges of the working exampleas follows:

– Unpredictable content. The content of the customer data varies in terms of datastructure, data size, and quality. For instance, there might be missing and/or wrongparts of addresses or wrong names and misplaced column content which can causeextra efforts.

– Unpredictable manual intervention. Depending on the data size and the data qualitythe precision of the search result differs. This requires manual calibration of searchprofiles during the execution of the task.

Preliminary Validation of CT Model

CT models can be used to produce distribution of expected runningtimes by feeding it with single unit spike (a Dirac pulse).

Assessed by comparing discrete simulations with predictions by CTmodels in a range of simple workflow patterns.

Data cleansing examplesimulated and compared withCT model prediction.

Reasonable correspondencebetween medians.Calibration of the model usingexperimental data is key.

Service Min. 1st Qu. Median Mean 3rd Qu. Max.Formatter 15.00 55.50 60.00 63.00 73.00 107.00Searcher 4.00 4.00 5.00 4.89 6.00 6.00Checker 0.00 0.00 1.00 0.53 1.00 1.00Calibrator 59.00 74.00 78.00 81.73 85.50 133.00Logger 1.00 1.00 1.00 1.00 1.00 1.00Adder 0.00 0.00 0.50 0.50 1.00 1.00Storage 1.00 1.00 1.00 1.40 2.00 2.00

Table 1. Statistical parameters of service substitutes.

Fig. 13. Distribution of execution times.

The smooth bold curve in Figure 13 shows pout in comparison to the histogramthat shows experimental execution times. The median value of the predicted executiontime was 142.82 and the average 197.55 (seconds). Although slightly more optimisticthan the experimental runs, on the overall the prediction qualitatively fits well with themeasured data, especially considering that the prediction does not take into accountmessage passing latencies, and that the experimental setting has used a mix of differentstatistical distributions for execution times of individual services.

6 Conclusions and Future Work

The approach proposed in this paper can be used for developing dynamic models ofservice composition provision, based on an automatically derived continuous-time or-dinary differential equation model of the target orchestration. The orchestration modelis calibrated using empirical estimates of average activity execution times and branch-ing probabilities, obtained from log analysis, event signaling, or other monitoring toolsat the infrastructure level. Several dynamic models of orchestrations provided togethercan be composed in a modular way.

The resulting dynamic model of composition provision can be used for exploringhow the provision system reacts to different input rates (requests per unit of time), test-ing and choosing different resource management strategies and their parameters, in the

Simulation Model for Several Orchestrations

Several orchestrations can interact through use of common resources.

Plug CT model(s) of orchestration(s) into simulation framework.

R

Incoming Requestsi

E

Rejects

e = R/Te

Te

S

Successful finishespout

F

Failurespfail

Orch. CT ModelResource

Model

Tin

βpin = β · R/Tin

S


F

Failurespfail

Resource model aggregates values from orchestration CT model tomodel provision infrastructure capacity and load.

Blocking factor (β) quenches request inflow when capacitiesexceeded.

Simulation Model (Cont.)

R

Incoming Requestsi

E

Rejects

e = R/Te

Te

S


F

Failurespfail

Orch. CT ModelResource

Model

Tin

βpin = β · R/Tin

S


F

Failurespfail

Basic accounting for rejects (E ), successes (S) and failures (F ).

Provides for separation of concerns – separated development of:Orchestration model (resource independent).Resource model (orchestration independent).

Interactive Simulation

Simulation model fed to (existing) dynamic simulation tool:

Simulating concurrently served orchestrations under different inputregimes.QoS regarding running time and reliability “for free”.

Others (like cost of component services/resources) derivable.

Typical simulations: what-if, goal-seek, sensitivity.Simulation results for compositions/QoS metrics ⇒ potential inputs forBP/KPI simulation in service networks tools (e.g. SNAPT of UoC).Currently working on simulation tool better tailored for services.

Relation to Monitoring & Adaptation

Prediction from the simulation ⇒ input for proactive adaptationRelated to other S-Cube QoS prediction techniques.Simulation gives the expected time frame for failure occurences.

Continuous monitoring: keep model in step with actual behavior.

E.g. using CEP-based ESPER framework for choreographies(USTUTT).Revising model parameters with empirical evidence

Together: dashboards to provide continuous feedback to humancontrollers.

E.g.: provisioning multi-service applications under varying request ratesin the work of Chi-Hung Chi (Tsinghua University)

Conclusions

Dynamic models of service orchestrations with provision resources: ahandy tool to

Test scenarios,Assess expected QoS,Design and validate infrastructure management policies.

Working on integration with tools for automatic derivation of PT-netmodels from executable orchestration specs.

Also aiming at modeling elasticity in cloud infrastructures.

Challenge: cover other, more advanced and data-dependent servicecomposition descriptions e.g.:

Concrete abstract and executable process specifications (e.g., BPEL),Colored Petri-nets,Other formalisms.

Building Dynamic Models of Service CompositionsWith Simulation of Provision Resources

Dragan Ivanovic1, Martin Treiber2,Manuel Carro1, Schahram Dustdar2

1Universidad Politecnica de Madrid

2Technische Universitat Wien

S-Cube Industrial Dissemination Workshop, Thales, Paris, 24/2/2012

upm scube-dynsys-presentation

Technology