approaches to failure and recovery in service composition

53
Approaches to Failure and Recovery in Service Composition by Petrus Johannes Steyn [email protected] Department of Computer Science University of Pretoria Pretoria, South Africa November 2006 SPE780 Computer Science Honours Project

Upload: bikash-ranjan-satapathy

Post on 04-Sep-2014

27 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Approaches to Failure and Recovery in Service Composition

Approaches to Failure and Recovery in Service Composition

by

Petrus Johannes Steyn [email protected]

Department of Computer Science University of Pretoria Pretoria, South Africa

November 2006

SPE780 Computer Science Honours Project

Page 2: Approaches to Failure and Recovery in Service Composition

2

Table of Contents Topic Page

Number

1 INTRODUCTION .............................................................................................................. 5

2 OVERVIEW OF WEB SERVICES.................................................................................... 6

2.1 WHAT IS A WEB SERVICE? ......................................................................................... 6 2.2 SOME OF THE PROBLEMS ........................................................................................... 7 2.3 SOME STANDARDS RELATED TO WEB SERVICES ......................................................... 8

3 FAILURE: AN INTRODUCTION........................... ......................................................... 10

3.1 AVAILABILITY FAILURES . .......................................................................................... 11 3.2 CONCURRENCY FAILURES ........................................................................................ 13 3.3 DEPENDENCY FAILURES ........................................................................................... 14 3.4 INCONSISTENCY FAILURES ....................................................................................... 15 3.5 COMPOSITION FAILURES .......................................................................................... 17 3.6 PARTIAL FAILURES .................................................................................................. 17 3.7 FAILURES DUE TO AMBIGUOUS OUTPUT .................................................................... 20 3.8 OTHER FAILURES ..................................................................................................... 21

4 POSSIBLE RECOVERY METHODS .......................... ................................................... 22

4.1 TRANSACTIONAL APPROACH .................................................................................... 22 4.2 DYNAMIC WEB SERVICES ......................................................................................... 25 4.3 LANGUAGE CONSTRUCTS......................................................................................... 26 4.4 SELF-HEALING NETWORKS ....................................................................................... 27 4.5 TRIVIAL RECOVERY METHODS .................................................................................. 30

5 FAILURE DETECTION .................................. ................................................................ 31

5.1 DEFENSIVE PROCESS DESIGN .................................................................................. 32 5.2 SERVICE RUN-TIME MONITORING .............................................................................. 32 5.3 WOFBPEL .............................................................................................................. 33

6 THREE SCENARIOS ..................................................................................................... 33

6.1 FOREIGN TRAVELLER INFORMATION ......................................................................... 33 6.2 GENERAL ENTERTAINMENT PLANNER ....................................................................... 35 6.3 MALL INFORMATION SYSTEM .................................................................................... 35

7 EXAMPLE SCENARIO: SHOPPING DOMAIN .................. ........................................... 37

7.1 PROGRAM DEMO...................................................................................................... 41

8 RELATED WORK ....................................... ................................................................... 41

9 CONCLUSION................................................................................................................ 43

10 ACKNOWLEDGEMENTS................................... ........................................................... 44

Page 3: Approaches to Failure and Recovery in Service Composition

3

Table of Contents for Figures Figure Page

Number Figure 1 – Web Services Stack ...................... ........................................................................ 9 Figure 2 – Flow Diagram of an availability failure. ............................................................. 11 Figure 3 – Flow Diagram of a Partial Failure ....... ............................................................... 18 Figure 4 – Flow Diagram of a process showing Ambigu ous Output............................... 20 Figure 5 – Flow Diagram of Transactional-Based Appr oach............................................ 25 Figure 6 – Catch Branch in OBPMS ................... ................................................................. 26 Figure 7 – Flow diagram of Pizza Company ....................................................................... 29 Figure 8 – Flow Diagram of a Trivial Recovery Metho d .................................................... 31 Figure 9 – Foreign Traveller Information ........... ................................................................. 34 Figure 10 – General Entertainment Planner .......... ............................................................. 36 Figure 11 – Mall Information System................ ................................................................... 36 Figure 12 – Flow Diagram showing where Sub-Goals wi ll be Checked .......................... 39 Figure 13 – Screenshot of Program requesting data .. ...................................................... 48 Figure 14 – Busy Searching for Shops ............... ................................................................ 49 Figure 15 – Results Found .......................... ......................................................................... 50 Figure 16 –Displaying Results ...................... ....................................................................... 51 Figure 17 – Failure with the possibility of Recover y ......................................................... 52 Figure 18 – Notification of Failure without the pos sibility of Recovery .......................... 53

Table of Contents for Examples Figure Page

Number Code Example 1 – Service not Found Exception from B PEL Console............................ 12 Code Example 2 – Error Message produced by server w hen incorrect types are used

as input ........................................... ............................................................................... 14 Code Example 3 – BPEL Code from Example............ ........................................................ 16 Code Example 4 – The Corresponding WSDL description ............................................... 16 Code Example 5 – Time out Exception from the BPEL S erver......................................... 19 Code Example 6 – Time out Exception shown in the BP EL Console .............................. 19 Code Example 7 – Catch Branch In BPEL .............. ............................................................ 27

Page 4: Approaches to Failure and Recovery in Service Composition

4

Abstract. Web services have become a vital part of our

lives. People do not always know that they are there, but we

do notice it when something went wrong. There are various

problems that can occur when using Web Services. These

problems can be trivial problems like a broken connection

or even more complicated like composition problems.

These problems, or failures, can be fixed by making use of

different recovery methods. Some common recovery

methods that are being researched today include Self-

healing networks and Transaction-based strategies. Most of

the research today is going into Self-healing networks and

dynamic composition of services. Many different detection

methods also exist and the two that are used frequently in

Self-healing networks are namely Defensive Process

Design (DPD) and Service run-time Monitoring (SrtM).

These two are examples of run-time detection strategies.

There are also static or off-line detection strategies.

WofBPEL is a good example of a static fault detection

strategy.

There are different applications for Web Services. Three

examples applications are discussed briefly in this

document. Each of them can be implemented in the real

world and can be of value if implemented successfully.

This document proposes a classification for some of the

most common failures that can occur when using Web

Services. It also proposes some recovery methods that can

be used to recover from these common failures. One of

these recovery methods is also illustrated at the hand of a

real world example.

Keywords: Web services, composition failures, recovery methods.

Page 5: Approaches to Failure and Recovery in Service Composition

5

1 Introduction Web services are becoming a big part of everyday life. We use them all

the time without even knowing it. But like everything in life, we only notice

something, when it’s not working.

Web services are very dynamic. They are all around us and we use them

everyday without even knowing it. Popular web site use Web Services to find

and display information from various different domains. The travel domain is

on the domains that rely heavily on the use of Web Services. Travel agencies

use Web Services to connect to other companies (like airline companies of

bus companies) to get their schedules and prices from them.

They can be working for months at a time, and then suddenly go down for

various reasons. When they are down, the system has to, somehow, recover

the data the user requested. There are various ways in which this can be

done. In this document, I try to classify some of the most common failures

that can occur when using Web Services. I also take a look at a few recovery

methods and also briefly discuss three failure detection strategies that are

used today. These failures detection strategies can be classified into two

categories; run-time detection strategies and off-line detection strategies.

These will be discussed in Section 5. Finally, I use a real world example to

illustrate one of the recovery methods that I discussed.

The remainder of this document is broken up as follows. In Section 2, I

give a brief overview and introduction of web service. In Section 3, I give an

overview of the different types of failures that can occur. In Section 4, I give

some methods for recovering from different failures. Section 5 briefly covers

some detection methods found in literature. In Section 6 I introduce a few real

world examples and also give some examples of what errors can occur during

the use of these examples. Section 7 goes more in depth into one of the

examples introduced in Section 6. In Section 8, I discuss some related work

and Section 9 concludes this document.

Page 6: Approaches to Failure and Recovery in Service Composition

6

2 Overview of Web Services As stated in the introduction, web services are all around us. We interact

with them all the time without even knowing that they exist. But what is a web

service? And why are they so important today?

2.1 What is a Web Service?

A Web Service is an entity on the web that can provide various kinds of

information to clients. Some types of services offered are: Weather Services,

Exchange Rate Services, Language Translation Services, Geographical

Information Services, etc. These services are accessible from anywhere in

the world, and they are always available (at least theory). The use of these

services are not limited, although, some providers can charge clients for the

use of the services that they provide. They form part of a greater architecture

known as Service Oriented Architecture (SOA). According to Wikipedia [2],

SOA is a “software architectural concept that defines the use of services to

support the requirements of software users”. Web Services are often

identified as the default implementation of SOA, but SOA can be implemented

using various other service-based technologies.

As an example, a Web Service can be compared to a company that is

providing some service to the community. People from the community can

use this service to their advantage. Let’s say the company is a supermarket.

The supermarket will supply the community with the goods that they want at

an affordable rate. Unfortunately, as is always the case, competition is not far

away. Another supermarket will open up soon, offer the same services, but at

a better price. This will cause the old supermarket to either lower their prices,

or offer newer services to their customers.

This example describes what is happening continuously with web

services. One web service, web service A, will offer exchange rate

information. Later a new web service, web service B, also does the same, but

offers better information (more up to date). In response to this, web service A

will start to offer more services, like additional stock exchange information.

This race can continue until one service provider will stop its service

completely.

Web services, as said, are always available. The only thing a client has to

do is to go out and find them. Finding a service that meets a client’s

Page 7: Approaches to Failure and Recovery in Service Composition

7

requirements can be complicated, but things can become easier if the service

makes use of certain methods to advertise itself.

Services make use of a Description Language to describe what it does,

and how a client can get access to it. This is described in Web Services

Description Language (WSDL), which serves as an interface to the service. A

WSDL description will supply the client with the necessary operations to

invoke it, and might include a description of the functionality of the service as

well.

Another method of advertising is making use of Ontology-annotated

signatures. These signatures, according to Brogi & Popescu [2005], describe

the semantics of a service. The semantic description of a Web Service will

describe, not only what the Web Service is, but also in what context to use it

(Foggon et al. [2004]). These signatures will eventually be used in the WSDL

descriptors to fully describe the service, and to expose the interface.

There are various other methods and languages that have been

developed around web services, and they will be discussed later.

2.2 Some of the Problems

What are the problems facing us when we want to use multiple services to

gain useful information? Why can we not just use one service for all our

needs?

According to Yu & Lin [2005], services can be upgraded or changed

dynamically according to changes or needs in the environment. This can

result in problems if interfaces to these services also change. When it comes

to service discovery, Sahin et al. [2005] states, that although many advances

have been made when it comes to service discovery, most of the service

discovery techniques have 2 major problems. (1) There is usually some

centralized server involved which handles all requests and this provides a

central failing point for the whole system, and (2) many servers offer limited

search capabilities, which means that you will not be able to always find the

best service.

Once you have access to the services, and you retrieved the necessary

data from them, the system then has to compose the data in a meaningful

way. This is known as service composition. During this process, the system

must be able to distinguish between data that is useful, and data that is

unwanted. This is not always very easy, and it cannot be guaranteed that the

Page 8: Approaches to Failure and Recovery in Service Composition

8

data we receive is the correct data. In Yu & Lin’s [2005] paper, the authors

take the approach of using Quality of Service measures, to ensure that data

that are retrieved are correct. The problem with this is that you have to

compare various services with each other in order to establish which service

offers the best quality data.

Another method of making sure that you do get the correct data is to

always use a trusted and reliable service provider. This will ensure that the

data is always correct, and that you receive a quality service. However, things

do go wrong. Service providers might change the services they offer, their

servers might crash, or they might shut down their servers. In such cases,

any use of the services provided by the service provider will result in a failure

being reported by the system.

There are various ways in which we can recover from failure. The system

can keep a backup of previous searches (in the form of a cache), and can use

this data. However this data will not be up to date and it might be invalid. The

system can also launch a search for a new service provider, or search for a

web service that claims to offer the same services. This will result in the user

getting the most up to date and correct data, but it might take a while to

perform the search. Different recovery methods will be discussed in Section

4.

2.3 Some Standards related to Web Services

According to Tartanoglu et al. [2006] the overall definition for Web

Services architecture is still incomplete. The base standards for Web Services

have already emerged from the W3C. They define a core middleware that is

partly built upon results obtained from object-based and component-based

middleware.

The main standards for Web Services architecture as defined by the W3C

Web Service Activity and the Oasis Consortium are:

• SOAP (Simple Object Access Protocol): A lightweight protocol for

information exchange. It sets the rules on how to encode data in XML.

It also describes invocation semantics and mappings to other Internet

transfer protocols.

• WSDL (Web Services Description Language): An XML-based

language that specifies a service’s interface (the type of messages that

Page 9: Approaches to Failure and Recovery in Service Composition

9

the service can understand), and the binding information (the protocol

dependant details).

• UDDI (Universal Description, Discovery and Integration): A registry for

dynamically locating Web Services. It can also be used to advertise

Web Services.

Figure 1 shows how these standards fit together in the technology stack.

This figure is adapted from the figures that can be found in Mikalsen et al.

[2002], Tartanoglu et al. [2006] and van der Aalst [2003].

Along with these standards, there also exist a number of languages that

are part of Web Services. The defacto standards for Web Services are BPEL

and WSDL. Where WSDL describes the service’s interface, BPEL describes

the service’s workflow. It describes the interactions that can be performed on

the services (interactions like invoke, reply and receive). These two

languages have been used very successfully up until now. Both have their

roots in XML, and both make use of several W3C approved standards.

Figure 1 – Web Services Stack

Van der Aalst [2003] took a pessimistic look at some of the standards that

have been developed in and around Web Services and work flow languages

in Web Services in his contribution: “Don’t go with the flow: Web services

composition standards exposed”. According to him, all of the supports

claimed by some of the languages are unfounded. He is also under the

Page 10: Approaches to Failure and Recovery in Service Composition

10

impression that there are too many so-called ‘standards’. Some of the

languages that he inspected were: BPEL, Microsoft’s XLANG, IBM’s WSFL

and the Workflow Management Coalition’s XPDL. From his research, BPEL

was one of the most comprehensive languages, albeit the most complex one.

His research though was done back in 2003, and since then, BPEL has

become a default standard for describing the work flow of Web Services.

There are also a few other languages (discussed later in this document),

but all the examples and code in this document is in BPEL and WSDL.

3 Failure: An Introduction Different types of failures can occur during the use of web services. These

failures can be caused by something as simple as a broken connection or

busy server, but they might also manifest during the composition of services.

Most of these failures can be solved with little effort, but sometimes the

problem lies much deeper.

Some trivial errors that can happen are broken connections and server

downtime.

These are caused by external factors most of the time since the fault can

lie at the server side (and in the case of the server downtime, the fault will

definitely be caused by the server). There are many other types of failures

that can range from concurrency problems, to dependency problems, and

even availability problems. Most of the types of failures can be classified

accordingly:

• Failures caused by availability.

• Failures caused by concurrency.

• Failures caused by dependency.

• Failures caused by inconsistency.

• Failures caused by incorrect composition.

• Partial failures caused by incorrect parallel execution.

• Failures due to ambiguous output.

These classifications are not the only ones that exist, but they are the

most common ones. Even though most tools will not deploy services with

some of these failures, they can still find their way in if you deploy them

manually. In the following sections I will describe each classification, and also

provide some examples of how these failures can occur.

Page 11: Approaches to Failure and Recovery in Service Composition

11

Where possible, I used Oracle JDeveloper 10g [1], and Oracle BPEL

Process Manager Server [1] to simulate the errors in the examples. All the

examples were coded in WSDL and BPEL, mainly due to the development

environment, but also because they work well together and because of their

popularity. Other languages do exist, but the pros and cons will be discussed

later in this paper. All examples make use of “dummy services” that only take

simple inputs and give back simple replies.

3.1 Availability Failures.

Failures in this classification can almost always be traced back to the

server or the connection to the server. They can present themselves in the

form of a ‘time out’, or a ‘service not found’ error.

Figure 2 – Flow Diagram of an availability failure

In OBPMS, this indicates that an error occurred during the execution of the process in question.

Page 12: Approaches to Failure and Recovery in Service Composition

12

During a Time Out, the client will usually stop requesting the service after

a certain amount of time due to the server not responding to its requests. This

can be caused by a busy server, or a broken connection, or a lost message.

Either way, the service cannot be accessed at that time.

A Service Not Found error can be attributed to a faulty server, or a deleted

service, or a broken connection. In these cases, the client assumes that the

service is deleted because it cannot find the service or the server that is

hosting the service. A Service Not Found error can also be caused by the

same conditions that cause a Time Out error.

In this example, I make use of a service that does not exist any more. The

system responds with a Remote Fault (basically a Service Not Found error),

and will return this error to the client. In Oracle’s BPEL Process Manager

Server [1] (OBPMS), the following output was observed.

In OBPMS, the user has the option of viewing either the flow diagrams of

the service or the code of the service. The following figure (Figure 2) is the

resulting flow diagram produced by OBPMS.

If we take a look at the code of the example, the following error was

reported by OBPMS.

<process>

<sequence>

receiveInput

[2006/10/16 10:03:42] Received "clientInput" call from partner "client" More...

<scope name="shopScope">

<sequence>

shopInputAssign

[2006/10/16 10:03:42] Updated variable "shopInput" less

<shopInput> <part xmlns:xsi=http://www.w3.org/2001/XMLSchema-in stance name="payload"> <shopdef xmlns="http://services.otn.com">

CNA,PRETORIA </shopdef> </part>

</shopInput>

searchShop (faulted)

[2006/10/16 10:03:42] " remoteFault" has been thrown. less

Code Example 1 – Service not Found Exception from B PEL Console

Page 13: Approaches to Failure and Recovery in Service Composition

13

<remoteFault xmlns="http://schemas.oracle.com/bpel/ extension"> <part name="code"> <code> WSDLReadingError </code> </part> <part name="summary">

<summary>Failed to read wsdl. Failed to read wsdl at "http://localhost:9700/orabpel/default/ShopServiceV 2/ShopServiceV2?wsdl", because "WSDLException: faultCode=INVALID_WSDL: The document: http://localhost:9700/orabpel/default/ShopServiceV2 /ShopServiceV2?wsdl is not a wsdl file or does not have a root element of "definitions" in the "http://schemas.xmlsoap.org/wsdl/" namespace or the "http://www.w3.org/2004/08/wsdl" namespace.". Make sure wsdl is valid. You may need to start the OraBPEL server, or make sure the related bpel process is deployed correctly. </summary>

</part> </remoteFault>

</sequence>

<scope>

</sequence>

[2006/10/16 10:03:42] "BPELFault" has not been caught by a catch block.

[2006/10/16 10:03:42] BPEL process instance "1105" cancelled

</process>

Code Example 1 – Service not Found Exception from B PEL Console continued…

3.2 Concurrency Failures

With concurrency failures come all the usual concurrency problems that

exist in normal computer systems and networks. A service can be used by

more than one client at any time, and this can cause concurrency problems if

the service is being updated by a client or by the server. Other clients need to

be informed about the update otherwise clients using the service will receive

inconsistent or corrupt data from the service. These types of failures are

difficult to detect unless the client is actually aware of the updates. Clients will

not know the difference if they are using a service that is outdated, as long as

they receive data that looks correct, according to them. These types of

failures are not common, but they can cause big problems if not caught in

time.

In another scenario, a web service can be used as a resource that first

needs to be acquired. This won’t happen often though since it would not

make sense to create such a service. In addition to this, Tanenbaum et al.

Page 14: Approaches to Failure and Recovery in Service Composition

14

[2002] states that trying to lock resources that are distributed is difficult and

can lead to a deadlock situation if not approached correctly.

3.3 Dependency Failures

Services are not only limited to only supplying us with information. They

can make use of other external services to gather the required information

before passing it on to the client. Many problems can occur when using such

a technique. Messages can get lost between services, and can cause a

service to deduce that the called service is not available any more. A service

can also pass on incompatible types to the services it calls (send on string

values when integer values are expected). This can cause the receiving

service to misinterpret the incoming message from the sender, and will

produce incorrect results due to the incompatible types received.

This type of error can be avoided when using a development environment,

but as stated in the previous sections, service providers can and will update

their services periodically. These updates might include changing the types of

the expected input data. Unless these changes are communicated to the

clients, or to the services using the updated service, failures will occur.

In the following example, I tried to invoke a service with an incorrect type.

The server was the only component to respond to this incorrect input. The

service itself did not respond any further because the server refused to invoke

it.

Message handle error.

An exception occurred while attempting to process t he message

"com.collaxa.cube.engine.dispatch.message.invoke.In vokeInstanceMessa

ge"; the exception is: XPath expression failed to e xecute.

Error while processing xpath expression, the expres sion is

"((bpws:getVariableData("inputVariable", "payload",

"/client:DummyService_3ProcessRequest/client:input" ) mod 2.0) =

1.0)", the reason is NaN is not an integer.

Please verify the xpath query.

Code Example 2 – Error Message produced by server w hen incorrect types are used as input

If the set up of the service was correct, in other words if we included catch

branches and exception handlers, the service would have been invoked.

Page 15: Approaches to Failure and Recovery in Service Composition

15

However, when working in a synchronous environment, the service would

eventually throw a time out exception to inform the client that something went

wrong if the necessary exception handlers are not present. If we were to work

in an asynchronous environment, we would have to include catch branches to

catch the exception.

3.4 Inconsistency Failures

Every now and again, a service provider might decide to change the

descriptors of some of their services. These changes can affect the access to

them in either a positive or a negative way. On the positive side, the new

descriptors might enhance the use of the service. On the negative side, the

new descriptors might cause a service to become unavailable.

Changes to a service’s descriptor file can cause one of two major

problems. If the descriptor is changed during run time, a client already using

the service might get unexpected results due to the new descriptor file. It can

also cause a service to behave differently to what it is supposed to do.

Changes to the descriptor can also cause a service to be broken

completely. This can happen if the descriptor file, and corresponding BPEL

file, are inconsistent, e.g. the BPEL file uses a variable that is not defined in

the descriptor file anymore.

The latter error should not happen too often since many software

development environments provide checks and tools to prevent this kind of

error. However, if a service provider chooses to do things manually, these

errors can (and most probably will) occur.

In the following code example, the descriptor has been changed, but the

workflow file was kept the same. This will result in a failure. Most

development environments will not allow the creation of such erroneous

services.

The highlighted code segments (shown in bold) are the code segments that will cause the inconsistency problems. The BPEL process still thinks that the input and fault variables can be accessed through the

MapServiceRequestMessage and MapServiceFaultMessage

respectively, whilst their names have changed in the description file to MapServiceInvokedMessage and MapServiceErrorMessage . The

outcome of such an error cannot be tested in the environment setup that I chose to work in, so the resulting behaviour is unknown.

Page 16: Approaches to Failure and Recovery in Service Composition

16

<partnerLinks > < partnerLink name=" client " partnerLinkType=" tns:MapService " myRole=" MapServiceProvider "/> </ partnerLinks > <variables > <variable name="input" messageType="tns:MapServiceRequestMessage"/> < variable name=" output " messageType=" tns:MapServiceResponseMessage "/> <variable name="fault" messageType="tns:MapServiceFaultMessage"/> </ variables >

Code Example 3 – BPEL Code from Example

<types >

<schema attributeFormDefault=" qualified " elementFormDefault=" qualified " targetNamespace=" http://services.otn.com " xmlns=" http://www.w3.org/2001/XMLSchema ">

<element name=" request " type=" string "/> <element name=" response " type=" string "/> <element name=" error " type=" string " />

</ schema> </ types > <message name="MapServiceInvokedMessage"> <part name="payload" element="tns:request"/> </message> <message name=" MapServiceResponseMessage "> < part name=" payload " element=" tns:response "/> </ message > <message name="MapServiceErrorMessage"> <part name="payload" element="tns:error" /> </message> <portType name=" MapService "> < operation name=" process "> <input message="tns:MapServiceInvokedMessage"/> < output message=" tns:MapServiceResponseMessage "/>

<fault name="MapNotFound" message="tns:MapServiceErrorMessage" /> </ operation > </ portType >

Code Example 4 – The Corresponding WSDL description

Page 17: Approaches to Failure and Recovery in Service Composition

17

3.5 Composition Failures

Failures can also happen during the composition phase. During

composition, different services offering different information are forced to work

together (the composition part). During composition, you need to be able to

rollback from an error (i.e. be able to recover to a point before the request

started) and sometimes these rollbacks are either incorrect, or incomplete. In

Section 4.1, I discuss a Transaction Based approach to recovery from these

types of errors.

Services can also be composed incorrectly (they are forced to work

together, but they cannot) and this can also cause a huge problem from a

client’s perspective. These types of errors will not happen often, but it can

happen that an incorrect service gets used due to its incorrect description (in

Section 4.2 I discuss this problem again).

3.6 Partial Failures

Partial failures are closely linked to composition failures since they can

cause partial failures. A partial failure implies that during a parallel execution

of services, one of the branches cannot find the needed or requested

services. This is not a major problem since parallel execution usually implies

that you only need the output from one branch, but you are working with

incomplete data. From a client’s perspective, it does not matter, since he

would not know the difference (unless all the branches fail), but the goal of a

service is to give the most accurate data to the client invoking it.

As said in the beginning, partial failures are closely linked to composition

failures. Sometimes composition failures can also go unnoticed by the client.

Although these failures will not be noticed, it does not mean they will not have

an affect. As said above, the goal of a service is to give the most accurate

data to the client invoking it. If a service cannot supply that, then the service

will not be good enough to use.

Another form of a partial failure would be if we need the result from all the

branches of the parallel execution. In some cases we might need the results

from all the branches to continue with the execution. If one of the branches

fails the system will still continue to completion, but with incomplete data. This

will cause the returned results to be incorrect or corrupt even. We can force

the execution to stop if we do not have all the necessary information to

continue, but this will be unacceptable to a client using the service.

Page 18: Approaches to Failure and Recovery in Service Composition

18

The following example will clarify this problem. As a client we only have

access to one service or access point to the composite service. The service

we are using is calling other services (in parallel) to gather the needed

information. The following was observed when one of the required services

was not found. Once again I show the resulting flow diagram (shown in Figure

3) and the code (shown in Code Example 5) from the OBPMS.

Figure 3 – Flow Diagram of a Partial Failure

The response for the server and the corresponding code fragments

obtained from OBPMS.

In OBPMS, this indicates a time out error. This will only happen when using synchronous services.

Page 19: Approaches to Failure and Recovery in Service Composition

19

Com.oracle.bpel.client.delivery. ReceiveTimeOutException : Waiting for response has timed out. The conversation id is 455aa7269f0030c5:149d886:10e5fc38efa:-7ffc. Please check the process instance for detail.

Code Example 5 – Time out Exception from the BPEL S erver

<sequence>

Assign_2

[2006/10/19 10:52:47] Updated variable "invokeDummy_initiate_InputVariable" less

<invokeDummy_initiate_InputVariable>

<part xmlns:xsi="http://www.w3.org/2001/XMLSchema-i nstance"

name="payload">

<DummyService_2ProcessRequest

xmlns="http://xmlns.oracle.com/DummyService_2">

<input>HELO</input>

</DummyService_2ProcessRequest>

</part>

</invokeDummy_initiate_InputVariable>

invokeDymmy

[2006/10/19 10:52:48] Invoked 1-way operation "initiate" on partner "Dummy2". less

<invokeDummy_initiate_InputVariable>

<part xmlns:xsi="http://www.w3.org/2001/XMLSchema-i nstance"

name="payload">

<DummyService_2ProcessRequest

xmlns="http://xmlns.oracle.com/DummyService_2">

<input>HELO</input>

</DummyService_2ProcessRequest>

</part>

</invokeDummy_initiate_InputVariable>

receiveDummy - pending

[2006/10/19 10:52:49] Waiting for "onResult" from "Dummy2". Asynchronous

callback.

Code Example 6 – Time out Exception shown in the BP EL Console

This example required the output from both branches in order to complete

the execution of the process. In this example I used a synchronous service

instead of an asynchronous service. A Time-out error will only occur when

using synchronous services. An asynchronous service will sit idle and wait

indefinitely for a result without giving us a time out exception. If we include

catch blocks in the service, we can avoid these errors. These methods will be

discussed in Section 4.

Page 20: Approaches to Failure and Recovery in Service Composition

20

3.7 Failures due to Ambiguous Output

In very few cases, services can be composed in such a way so as to

provide a user with more that one response to only one request. This is

undesirable since we only want one unique response from a service, given a

specific input. Even though some tools will not allow this type of service to be

deployed, they can still exist if they are created without the help of a tool.

Figure 4 – Flow Diagram of a process showing Ambigu ous Output

Page 21: Approaches to Failure and Recovery in Service Composition

21

As said above, some tools will not allow these types of services to be

deployed, and that is also the case of JDeveloper 10g [1]. These services can

be created, but they are riddled with errors usually. In the diagram below

(Figure 4), I try to show how this might look.

This example takes a string as input, and delivers two outputs; the string

all upper-case, and the string all lower-case. This service was deployed onto

the server, but failed to run to completion. In a real world scenario, this

service would be able to run, but the output would be determined by the

speed with which each branch executes. The slowest branch’s output would

be the output that would be displayed, unless two output variables are

defined (which is very difficult to do).

In addition, according to Ouyang et al. [2005], a BPEL process must not

use two or more receive actions on the same partner link, port type,

operations or correlation sets. This means that we cannot have two or more

input or output ports that are using the same variable. This statement is also

defined in the BPEL specification. However, this type of error sometimes does

still occur in real world services.

3.8 Other Failures

There are other failures that can occur when using Web Services that do

not fall into any of the categories mentioned above. Quality of Service (QoS)

problems and Service Level Agreement problems would be some of the most

common ones that cannot be classified. However, these two types of failures

can be traced back to any of the above mentioned failures.

A problem with the Quality of Service would result in a service just being

slow to react, or giving back results that is correct, but not up to standard. The

only way that this problem can be fixed would be to rebind to a new service.

In Yu & Lin [2005], the authors describe how to rebind to a new service that

will deliver a better Quality of Service.

A Service Level Agreement error will result in the use of an incorrect

service. As described briefly in Section 2.1, services need to advertise

themselves. If these descriptions of their services are incorrect, we might end

up making use of a service that is delivering faulty and incorrect data to us.

With a Service Level Agreement, we enter into a contract that promises us

the correct data, according to the description of the service. If this description

Page 22: Approaches to Failure and Recovery in Service Composition

22

was incorrect to begin with, the agreement is void, and we end up with a

binding to an incorrect service. Once again, the only way to fix this would be

to rebind to another service, but we might end up rebinding to another faulty

service.

There are ways to ensure that the services that we rebind to are correct.

These are discussed in the next section.

4 Possible Recovery Methods Due to failures that can, and will, occur when using web services, various

methods have been researched to be able to recover from these failures.

Some of these methods include transactional methods, Self-healing networks

and using QoS constraints as a heuristic in dynamic composition of services.

Tanenbaum & van Steen [2002] and Tartanoglu et al. [2006] classify

recovery methods into two sections; backward error recovery and forward

error recovery. Backward error recovery involves rolling back to a safe state,

and retrying the operation. This approach is followed in transactions. Forward

error recovery tries to recover from an erroneous state by transforming it into

a safe state. This approach is followed by Self-healing networks.

In this section I give an overview of some of the proposed methods of

error recovery when using web services. I do not do a classification of

recovery methods however. I will also take a look at some trivial methods (like

caching).

4.1 Transactional Approach

According to Tanenbaum & van Steen [2002], a transaction is an

operation that has an all-or-nothing property. This is sometimes also referred

to as the ACID property. Operations that exhibit the ACID property are said to

be; Atomic, Consistent, Isolated and Durable.

• Atomic : The transaction appears to be indivisible to the outside world.

• Consistent : The transaction will not violate any invariant rules of the

system.

• Isolated : Transaction appears to happen sequentially if they are

concurrent (in other words, they do not interfere with each other)

• Durable : Committed transactions cannot be undone, even if the

system crash after a transaction has been committed.

Page 23: Approaches to Failure and Recovery in Service Composition

23

They also classify three types of transactions; flat transactions, nested

transactions and distributed transactions. A flat transaction is a normal

transaction that will only commit after the main goal has been reached. This

type of transaction is what is normally referred to when speaking of

transactions in general. Nested and distributed transactions are discussed

later in this section and usually apply to systems spread across a network.

During a transaction, an operation can only be started once all the

resources required for the operation have been acquired. Once these

resources have been acquired, which usually implies that they have been

locked by the acquiring process, the transaction will run to completion before

releasing the resources. It will also only make changes to the acquired

resources permanent once the transaction have completed successfully.

According to Mikalsen et al. [2002], a Transactional Approach can be used

successfully to recover from failures that can occur. A lot of architectures

already support this model since it is quite easy to understand and to use.

The basic idea behind a transactional approach is this: Only commit when

every sub goal has completed successfully. Using a common example of the

travel domain, this is how a transactional approach would work:

A client sends a request to a service, requesting different booking

details from a travel agent. The system then goes and finds the relevant

details for each sub goal (booking a flight, booking a hotel etc.). As soon

as one sub goal cannot be completed, for example a flight cannot be

booked, the system will stop, and roll-back to before the request was

made. This roll-back action will undo all actions performed and thus

reset the state to just before the request. The client can then restart the

request with different parameters.

Sometime this complete roll-back is undesirable. If the flight cannot be

booked due to an unreachable server, we will not be able to complete the

transaction without changing the service we are using. A less strict way of

doing things, would be to commit after certain sub goals have been reached.

This will give the client more power to choose when he wants to commit. In

the example above, we can set the system up in such a way that the system

can commit after each sub goal is reached. If one of the sub goals should fail,

we can still do a partial commit, and complete the transaction in some other

way (to fill in the missing information).

Page 24: Approaches to Failure and Recovery in Service Composition

24

Using the same example, if the system is unable to book a flight, the

client can still commit to the hotel bookings and the car rental bookings.

The client can then choose to do the flight booking manually, or let the

system search for other flights that will also reach his final destination

on time. In this way, the system will only roll-back to the start of the

failed sub goal, instead of rolling back completely.

Such an approach is called a nested transaction (Tanenbaum & van Steen

[2002]). Another approach would be to make use of a distributed transaction,

but this would not be satisfactory.

In a distributed transaction, the transaction is approached as a normal

transaction, with the difference that the resources are spread across a

network. We still lock resources and perform the transaction as if it was a

normal flat transaction on a non-distributed platform, but this can cause

problems and can be difficult to manage for the two reasons mentioned

below.

According to Tartanoglu et al. [2006] a transaction-based approach is not

suited for the composition of Web Services for mainly two reasons.

• Transaction management becomes more difficult over a distributed

system. The main problem is that it requires cooperation among the

transactional supports of Web Services, which may not be amenable

with each other, or not willing to do so.

• A transaction-based approach usually involves locking resources until

you are done with them. In a Web Services environment, this is not

really feasible.

Overall though, this type of error recovery is a good method to use. It has

been proven to work in many different domains already, and a transactional-

based framework already exists for Web Services.

Using a simple flow diagram, the basic of using a Transaction-Based

approach is illustrated below.

Page 25: Approaches to Failure and Recovery in Service Composition

25

Figure 5 – Flow Diagram of Transactional-Based Appr oach

4.2 Dynamic Web Services

There are two ways in which services can be composed: static and

dynamic. Static composition is the easiest, and also the most stable method

to compose web services. Services are bound to each other during the

compiling of the service, and the bindings stay the same until the need arise

for them to change.

However, we live in a dynamic world. Services do change periodically to

reflect new information or data. This will result in services that are statically

bound, to become useless, unless the new updated service’s interface is still

the same.

To overcome this problem, we can make use of dynamic composition.

This type of composition occurs during run time. Services are bound to other

services on the fly (based on their WSDL descriptions and ontological

annotations). Dependency and composition failures can easily be solved by

this method.

There is a slight problem using this method though. If a service is

advertising itself as something that it is not, using this method might result in

the use of incorrect services. As an example, if we are looking for services

that provide translation services, and we end up using a service that

advertised itself as a translation service, but is actually an exchange rate

service, our resulting feedback will be totally incorrect. When using dynamic

composition, we cannot pick up on such problem until it is too late. We can

Page 26: Approaches to Failure and Recovery in Service Composition

26

make use of other recovery methods, in conjunction with dynamic

composition, to solve these problems more effectively.

4.3 Language Constructs

BPEL and WSDL provide us with some error support. We can include

catch branches and we can catch exceptions as they occur. These constructs

can only catch the exceptions that are defined though, but they are still useful.

We can force a process to complete (even with incomplete data) by using

these catch constructs. As an example, if we are expecting an integer value,

and the service gets a string value, we can use the catch block to substitute

the incoming value with a default value. We can only do so if the value is not

important or needed for the completion of the process, but in most cases,

such a solution just will not do.

Figure 6 – Catch Branch in OBPMS

Catch Branch in OBPMS

Page 27: Approaches to Failure and Recovery in Service Composition

27

We can however use catch branch to safely recover from an erroneous

state. Instead of just throwing an exception, we can use the catch branch to

catch the exception, and return a user friendly message to inform the client

that something went wrong. The example in Figure 6 makes use of a catch

branch.

In the code, below, you can see where the catch branch is inserted (the

faultHandlers section). If an exception is raised, or the input is incorrect, the

catch branch is invoked, and a default assignment is made. In this example, a

default error message is copied to the output variable.

<scope name=" shopScope "> <<ff aauull tt HHaannddll eerr ss >> <<cc aatt cc hh ff aauull tt NNaammee=="" nnss 11:: SShhooppNNoott FFoouunndd"" >> <<ss eeqquueenncc ee nnaammee=="" SSeeqquueenncc ee__33"" >> <<aass ss ii ggnn nnaammee=="" aass ss ii ggnnEErr rr oorr MMss gg"" >> <<cc ooppyy >> <<ff rr oomm eexx pprr eess ss ii oonn=="" '' SShhoopp NNoott FFoouunndd'' "" // >> <<tt oo vv aarr ii aabbll ee=="" cc ll ii eenntt OOuutt ppuutt "" ppaarr tt =="" ppaayy ll ooaadd"" qquueerr yy ==

"" // cc ll ii eenntt :: SShhooppFFii nnddeerr PPrr oocc eess ss RReess ppoonnss ee// cc ll ii eenntt :: rr eess uull tt "" // >> <<// cc ooppyy >>

<<// aass ss ii ggnn>> <<// ss eeqquueenncc ee>> <<// cc aatt cc hh>> <<// ff aauull tt HHaannddll eerr ss >> < sequence name=" Sequence_1 "> < assign name=" shopInputAssign "> < copy > < from variable =" clientInput " part =" payload " query ="/ client:ShopFinderProcessRequest/client:input "/> < to variable =" shopInput " part =" payload " query =" /ns1:shopdef"/> </ copy > </ assign > < invoke name=" searchShop " partnerLink =" ShopSearch " portType = " ns1:ShopServiceV2 " operation =" process " inputVariable =" shopInput " outputVariable = " shopOutput "/> </ sequence > </ scope >

Code Example 7 – Catch Branch In BPEL

4.4 Self-healing Networks

This is where most of the research has gone into so far. Many researchers

try to come up with new ways in which a network can heal itself without user

intervention. Yu & Lin [2005] uses some form of a self-healing network in their

paper. They combine it with QoS constraints as a heuristic. Baresi et al.

[2006] also proposes to make use of self-healing networks. But what is a self-

healing network?

Page 28: Approaches to Failure and Recovery in Service Composition

28

Self-healing networks are networks that are capable of recovering from

errors by themselves. In a Web Services context, they are networks that can

recover from composition faults by themselves. This is done by making use of

some external heuristic that monitors the network’s behaviour.

Different types of self-healing strategies have already been proposed.

There are strategies that make use of QoS constraints as a way of ensuring

stability when composing Web Services (Yu & Lin [2005]). Baresi et al. [2006]

proposes a strategy that is based on design by contract (a construct borrowed

from the Eiffel language). In their strategy, you can set pre- and post-

conditions that have to be met (similar to QoS constraints), but they also

weave in monitoring code that monitors the workflow and checks the pre- and

post-conditions of the services invoked.

Since Web Services live in a very dynamic environment, Self-healing

networks might be the way to go in the future. In Baresi et al. [2006], they use

the example of a Pizza Company to explain their concepts. The flow diagram

in Figure 7 is taken from their paper.

In the example, a client will use a web site or WAP enabled phone to contact

the pizza company. The client then gets authenticated after which his profile

is loaded. This profile holds information regarding the client’s favourite pizzas.

The Pizza Catalogue Service then offers the client a choice of four

different pizzas. When the client made his choice, his credit card details are

validated by the Credit Card Validation Web Service. If everything

goes according to plan, the client’s account is debited and the pizza

company’s account is credited. At the same time, the order will appear in the

browser of the pizza chef, informing him of the new order. In conjunction to

this, the address of the client is obtained from the Phone Company

Service, and the GPS Web Service is then called to obtain the precise

coordinates of the address. Once the coordinates are obtained, a map is

retrieved from the Map Web Service. After this has completed successfully,

the map is sent to the delivery boy’s PDA, and a SMS is sent to the client

informing him that his pizza will be delivered in 20 minutes. In this example,

various failures can occur, and because we are making use of dynamic

composition, failures are bound to happen.

Page 29: Approaches to Failure and Recovery in Service Composition

29

Figure 7 – Flow diagram of Pizza Company

In the paper, the authors propose two types of failures detection, and

three types of recovery methods. The two detection methods are briefly

discussed in the next section (Section 5). The three recovery methods

proposed by Baresi et al. [2006] are:

• Retry : if a binding to a service failed, we retry in the hope that it was a

once of failure.

Page 30: Approaches to Failure and Recovery in Service Composition

30

• Dynamically bind to another service : we rebind to another service

that offers the same functional or non-functional properties as the one

that is unavailable.

• Process reorganization : a dynamic reorganization of the process at

run-time, in order to overcome the problems due to a faulty or

unavailable external service, for which no alternative matching service

can be found.

These methods can be structured in a hierarchical fashion. This implies if

a service cannot be reached, we first retry the service a few times. If that

strategy doesn’t work, we rebind to another service. If that strategy fails, we

switch over to the most complex recovery method namely process

reorganization.

In process reorganization, we can locally reorganize services if we cannot

rebind to another service that can offer the same properties as the

unavailable service. This is done by using graph transformation rules. Using

this strategy, we can split single nodes into parallel and disjoint nodes, and

we can also combine parallel nodes into single nodes. This is done by

ensuring that the pre- and post-conditions are the same for the resulting

nodes after the transformation was applied. As an example, if a single node n

is split up into two nodes n1 and n2, the pre-condition of nodes n and n1 will

be the same. Similarly, the post-condition of nodes n and n2 will also be the

same. This will result in the post-condition of n1 implying the pre-condition of

n2.

As a more concrete example, if the Get Map and Route Service

cannot return a map of correct resolution for the PDA’s, we can split up that

service into two services Get Good Map and Route and Filter Map.

Get Good Map and Route will return a high resolution map, and Filter

Map will scale down the map to the proper resolution for the PDA’s.

4.5 Trivial Recovery Methods

There are some trivial recovery methods that can be used. A good one would

be to use caching. Clients can cache previous retrieved information, and can

recall it when the service cannot be found (Figure 8), or if some failure

occurred during the request. This would only be useful if the service offers

information that does not change too often (e.g. like a service giving

information about bus times). In cases where information will change very

Page 31: Approaches to Failure and Recovery in Service Composition

31

often (e.g. a service that offers the latest stock exchange information), this

type of approach, would be useless, since it would not help a client to use

information that is old.

Another trivial method would be to just keep requesting the information

until it is received, or until a specified time out is reached. This type of error

recovery is the easiest, but it is the most undesirable of all recovery methods,

since clients do not want to wait for a service to respond to a request. Clients

would prefer to use the quickest and most accurate service, which will provide

results in a fast and reliable manner.

Figure 8 – Flow Diagram of a Trivial Recovery Metho d

5 Failure Detection Having now classified some of the most common failures and also having

discussed some of the most common recovery methods, to bring the two

together we need some way to detect whether a failure occurred or not.

Failure detection algorithms are used in Self-healing networks to detect

whether or not something went wrong during the composition phase. There

are various ways in which this can be done, but these various techniques can

Invoke Service

Service Invoked?

Get Data from Cache

No

Yes

Page 32: Approaches to Failure and Recovery in Service Composition

32

be split into two main categories: dynamic detection of errors and static

detection of errors.

Dynamic detection implies that the error or failure is detected during

execution or during run-time. Static detection implies that errors or failures are

detected in an offline fashion (in other words, not during run-time).

Baresi et al. [2006] proposes two methods called Defensive Process

Design (DPD) and Service run-time Monitoring (SrtM), which are two forms of

dynamic detection. Ouyang et al. [2005] propose an automated analysis using

Petri net techniques which is a form of static detection.

Since this is not the main focus of this document, these methods will be

discussed briefly in the following subsections.

5.1 Defensive Process Design

According to Baresi et al. [2006], Defensive Process Design (DPD)

consists of designing services in such a way so that they can cope with

failures. This is done by using some of the language constructs that is

included in the BPEL standard. By designing services in such a way, we can

detect and gracefully recover from most exception and failures.

As an example, a time-out failure can be detected in such a way by

encapsulating the invoke action in a scope that has a timer. Once the timer

has run out, the service can recover from the time-out exception by calling

another service, or rebinding, or even retrying the same service.

This type of detection ties in with Section 4.3 since we can use exception

handlers and catch blocks to detect when an error has occurred. BPEL also

provides us with other constructs that will also help with the detection of

failures.

5.2 Service run-time Monitoring

Service run-time Monitoring (SrtM) consists of making use of external

monitoring tools to check whether functional and non-functional contract are

violated. There are various methods that can be used to monitor services.

Baresi et al. [2006] proposes an assertion based approach.

In their approach, they specify pre- and post-conditions to remote

services. These are checked by a separate tool that will notify the process

engine if anything goes wrong. In the event that a pre- or post-condition has

Page 33: Approaches to Failure and Recovery in Service Composition

33

been violated, the tool will notify the process engine, which will take the

appropriate actions to recover from the error.

The ASTRO tool set (Trainotti et al. [2005]) also makes use of a similar

method in its WS-mon component. The only difference is that the monitoring

code gets generated automatically by ASTRO and they use Java code to

monitor the services.

5.3 WofBPEL

Ouyang et al. [2005] proposes a technique that is based on Petri net

analysis techniques. They propose the use of an external tool, WofBPEL,

which can analyse composite services once they have been translated into

Petri Net Markup Language (PNML). Unlike the previous two methods, which

can be implemented to analyse service composition dynamically, this

technique analyses service composition statically in an off-line fashion.

A composite service needs to be translated into a secondary language

before it can be analysed for errors. At the time of the article, the tool only

supported three types of error detection: detection of unreachable actions,

detection of conflicting message-consuming activities and metadata

generation for garbage collection of unconsumable messages.

6 Three Scenarios In this section I want to introduce three scenarios where service oriented

computing (SOC) can be used for a real world implementation. There are

many different applications for SOC, some that are very big, and some that

are relatively small. With these scenarios I try to cover a wide spectrum from

the smaller implementation (Foreign Traveller Information) to the large scale

implementation (The General Entertainment Planner).

6.1 Foreign Traveller Information

The idea here is that you are a tourist that just landed in a foreign country.

You want to be able to get various information regarding transport options to

and from your hotel.

In this example, access to information regarding bus times, stations and

prices can be accessed from a mobile device or your laptop. The way this is

done is by making use of different services (one for bus times, another for

Page 34: Approaches to Failure and Recovery in Service Composition

34

geographical information, etc.). The main program will go out and find suitable

services to use, and will compose the received data in a meaningful way for

the client using the program. Many services are involved, but only a small

amount of data is needed from them in the end. See Figure 9.

This can be related to a real world scenario. A university professor, on his

way back from a conference, misses his connecting flight due to a delayed

flight from his previous destination. He enquires about other flights and finds

out that all flights to his final destination are booked full, and that the next

flight is only available the following night. Now the professor has a problem. It

is late at night and he needs to book a flight and also a hotel for the night.

Thankfully there are various web sites that the professor can visit to make

these bookings. These web sites almost always make use of Web Services to

gather information. So the professor goes to a web site that will allow him to

make a hotel reservation.

Figure 9 – Foreign Traveller Information

The site gathers information from all the local hotels, and displays them to the professor so that he can make an informed choice. He also visits a web site to make the booking for his flight the following evening. Thanks to Web

Page 35: Approaches to Failure and Recovery in Service Composition

35

Services, the day was saved, and the professor got a good nights rest and got home safely on the later flight the he booked using the web sites. 6.2 General Entertainment Planner

In this example a user can plan his night out by finding information about

nearby entertainment complexes. A user will be able to find out, for example,

what movies are showing at cinema complexes and also what times they will

be showing it.

He can also find out the location to these cinemas from his current

location. Other information that users will be able to access will include

information about restaurants, pubs, clubs, bars and other entertainment

hubs. This obviously means that all of this information must be obtained from

various locations so that the user can plan his night. You will need information

on each place’s location (geographical information so that the user can get

maps to these places), you will also need information about the specific

places (prices, atmosphere, type of place etc.) and probably some sort of

translation service so that you can display the information in various

languages. This once again will involve many different services from different

source, and in the end the information obtained form these services, must be

composed in a meaningful way. See Figure 10.

6.3 Mall Information System

In this example the idea is very simple. A user wants to locate the nearest

shop (specific shop like a stationary shop e.g. CNA) in his area. He also

wants to know whether the shop will have what he is looking for and also how

to get there. The user must be able to access this information from his home

computer, as well as his mobile phone (or other mobile device). This requires

that the system can find information about malls and the shops that they

have. It also needs to find geographical information so that it can give the

user directions to the mall. Instead of giving the user a map, the system must

be able to give the user directions in a descriptive manner.

This example once again needs information from different services, but

this time it is on a smaller scale. The system only needs to provide the user

with a list of shopping malls where the shop can be found, and directions to

the nearest one (or one chosen by the user). See Figure 11.

Page 36: Approaches to Failure and Recovery in Service Composition

36

Figure 10 – General Entertainment Planner

Figure 11 – Mall Information System

Page 37: Approaches to Failure and Recovery in Service Composition

37

7 Example Scenario: Shopping Domain Shopping Centres are being built everywhere nowadays and they are

getting bigger and bigger. Many centres though do not have all the shops that

you would want to visit. Although almost every major shopping centre has a

web site with a store directory on it, not many of us takes the time to go onto

the internet and find out what shopping centre contains a particular store. It

would be much simpler to just use your cell phone to get the information

about a shopping centre. Further more, not many of us know where some of

the major shopping centres are.

In a perfect world we all would know the direction to each one of these as

well as what store each one has. But as we all should know by now, that is

impossible, firstly because there are too many shopping centres, and

secondly, many shopping centres evolve and change. Older stores close to

make way for newer ones and thus the store directory constantly keeps

changing. The proposed system that I came up with will facilitate frequent

shoppers to know exactly where to go, and what they can expect.

The system is in concept, very simple. The customer will use either his cell

phone or his computer (or any other mobile device) to gather the required

information. A program on each device will connect to the necessary services,

and will return the results in a meaningful way. It will be the responsibility of

the program to do error handling and recovery.

Many different languages exist that can be used to describe a web

service. Almost all of them are derived from XML. Depending on the type of

description we want, we can describe a service using any one of the following

standards:

• Web Services Description Language (WSDL)

• OWL and OWL-S

• DAML and DAML-S

Each one of these languages brings along with them their own unique

method of describing a service. WSDL mainly describes the interface and can

also contain a short description of the service. It describes the interface as a

set of end-points operating on messages. These messages are described

abstractly and are bound to concrete network protocols. OWL describes the

semantics of the service. It is often used to describe the ontology of the

Page 38: Approaches to Failure and Recovery in Service Composition

38

service, in other words, the behaviour of the service. For the example, we will

use WSDL as the description language.

Different work flow languages also exist. Some of the ones that were

proposed are:

• BPEL (Business Process Execution Language)

• WSFL (Web Services Flow Language)

• XLANG (Web Services for Business Process Design)

• WSCI (Web Service Choreography Interface)

• BPML (Business Process Markup Language)

• BPSS (Business Process Schema Specification)

All of these languages have their own characteristics. According to van

der Aalst [2003], XLANG has block-structures with basic control flow

structures. WSFL on the other hand, is not limited to block-structures, and

allows for directed graphs. It mainly describes Web Service composition and

it considers 2 types of compositions; usage patterns and interaction patterns.

Usage patterns are concerned with how to achieve a particular goal and

interactive patterns are concerned with a collection of Web Services. BPEL

builds on both these languages (XLANG and WSFL) and therefore supports

most of the constructs supported by both languages. It uses programming

abstraction that allows developers to compose multiple discrete Web Services

into an end-to-end process flow. The other languages (WSCI, BPML and

BPSS) are quite new and they have not yet caught on as a standard to be

used for Web Services.

We will use BPEL as the flow language. This has been chosen due to their

ease of use, and also because my development platform (Oracle JDeveloper

10g [1]) only allows me to use these two languages.

To successfully simulate the use of this system, and its capabilities to

recover from a failure, the services that are used will be fake services,

created by me in JDeveloper 10g [1]. These services will only return the

necessary information to the system. This setup allows me to break a service,

so that the system can then start the recovery process.

Page 39: Approaches to Failure and Recovery in Service Composition

39

Figure 12 – Flow Diagram showing where Sub-Goals wi ll be Checked

Although there are many different recovery methods, the most practical

one to use when dealing with Web Services would be to use a transaction-

based approach to recovery. With this approach, we can control where and

when failures will be detected. We can do this by checking for certain sub-

goals that needs to be completed before we can continue with the processing

of information. Logical places to insert sub-goals would be after each call to a

Get Shopping

Center Listing

Shopping Center Listing

Retrieved

Get City Map

City Map Retrieved

Yes

No

Yes

No

Display

Retrieved

Data

Check Sub-Goal Here

Check Sub-Goal Here

Page 40: Approaches to Failure and Recovery in Service Composition

40

service. Once a service is invoked, we can check that the service has

responded to our request, if it has, that particular sub-goal is complete. If it

has not responded to our request, we can reissue our request, or choose to

rebind to another service. Figure 12 will shows where the sub-goals will be

checked.

For the program, I chose to use .NET for my development environment.

This is mainly due to its ease of use, but also because Web Services can be

easily integrated into the code.

A common way to simulate a transaction based approach in any

programming language would be to use try-catch blocks, or if-statements.

When using try-catch blocks, it would be very easy to pick up if an error

occurred, and if one did occur, we can recover from it in the catch segment

of the try-catch block. The following piece of C#-like pseudo code shows

how this would look.

public void searchServices( string shop, string city, string prov){

try { string service = invokeMapService( "http://aikon:9700/orabpel/ default/DummyService_1/DummyService_1?wsdl" );

} catch ( Exception exception) { MessageBox .Show( "Error Occured during invocation of Service. Retry invocation?" , "Invocation Error" , MessageBoxButtons .RetryCancel, MessageBoxIcon .Error); if (button == ”Retry” ){

string service = invokeMapService( "http://aikon:9700/ orabpel/default/DummyService_1/DummyService_1?wsdl" );

} }

}

public string invokeMapService( string url) {

try { invokeMapservice(parameter1, parameter2); string result = returnMapserviceresults();

} catch ( Exception exception) { MessageBox .Show( "Error Occured during invocation of Service" , "Invocation Error" , MessageBoxButtons .OK, MessageBoxIcon .Error);

} return result;

}

Code Example 8 – Pseudo code for a Transaction-base d approach

Page 41: Approaches to Failure and Recovery in Service Composition

41

Transaction can also be done in a similar way using if-statements. This

will look almost exactly the same as the try-catch example above, but

determining whether a failure occurred will be more difficult than before.

7.1 Program Demo

In this section I give a demonstration of how the program works, and how

it copes with failures. When the program is started, the user must input the

requested data into the fields. The data that is requested are; shop name,

province and city. This is shown in Figure 13.

The program then goes out and finds the relevant information and displays

it on the screen. Depending on the results found, the user will either get only

one response (in other words only one result will be displayed and the system

will automatically display the results page for this result), or the user will get

the opportunity to choose from a list of results and the user must choose

which one to display. Once the user has made his choice about which results

to display, the program will respond by displaying the shop name, the mall

name, additional information and directions on how to get there. This is

shown in Figure 16.

In the event that something went wrong during the invocation of the

service, the program will inform the user and will ask the user how he wants

to handle the situation. The user can either retry the invocation, or it can ask

the program to handle the error. The program will first retry to invoke the

service, after which it will try to find a new service (if one is available). In the

event that something went wrong during the operations on the services, the

program will make use of standard transaction-based rules to recover from

the failure. This is shown in Figure 17. It can also happen that there is no

possibility of recovery. This situation is shown in Figure 18.

8 Related Work During my research, I have not come upon any research papers that deal

with the classification of faults in Web Services. Many papers do, however,

name some common faults that can occur. In Baresi et al. [2006] the authors

name some of the faulty behaviour that can occur during deployment time,

and during run time. They do not, however, try to classify them into

categories.

Page 42: Approaches to Failure and Recovery in Service Composition

42

Tanenbaum & van Steen [2002] do a classification of faults in distributed

systems. Some of these faults are closely related to faults that can occur in

Web Services and they have been included in the classification model in

Section 3, but their work is focussed on distributed systems and not Web

Services.

A great deal of research has also gone into the detection of faults,

something I did not cover in detail in this document. Ouyang et al. [2005] uses

an automated tool to detect a limited set of faults by making use of Petri net

analysis techniques. Their tool, WofBPEL, can detect unreachable services,

services that make use of ambiguous input or output and invalid input

messages to a service (in other words, messages of the wrong type for the

service). Their analysis however, is done statically and the BPEL processes

have to be converted into another language before it can be analysed. Baresi

et al. uses two run-time methods to detect failures. DPD and SrtM can be

used to detect failures when using Self-healing networks. Another detection

strategy is included in ASTRO (Trainotti et al. [2005]). In ASTRO, monitors

are generated automatically in Java. These monitors are used to check

predefined properties of the associated processes and they will produce

feedback in the event of a failure. These properties can be related back to the

pre- and post-conditions of a service.

When it comes to recovery methods, a lot of research has gone into this

field. Both Tanenbaum & van Steen [2002] and Tartanoglu et al. [2006]

classify recovery methods into two subfields namely forward and backward

error recovery. Both also mention the use of transactions as a successful way

to recover from failure. However, most of the research focuses on Self-

healing networks, and dynamic composition of services. Other methods are

also discussed, but not as much as the Self-healing Approach. The

Transaction Based approach, however, has been mentioned before in

different papers and textbooks under many different names and guises. It

seems to be the most logical choice when you do not want to make use of a

Self-healing network (even though the two methods can be combined

successfully to produce an even better recovery method).

Various tools and languages have also been created to help with the

composition of services. Brogi & Popescu [2005] proposed a workflow

language called Yet Another Workflow Language (YAWL) that can be used to

not only express the basic workflow, but also the behaviour of the

Page 43: Approaches to Failure and Recovery in Service Composition

43

composition. YAWL is based on Petri nets, which makes failure detection a

bit easier. When using YAWL, a service using BPEL as the workflow

language and OWL as the descriptor will first need be translated into YAWL.

After that, services are expanded to include control-flow constructs. These

construct can then be used in the next phase to make sure that aggregated

services does not have processes with unsatisfied inputs. These constructs

can be seen as pre- and post-conditions of a service. If they are not met, the

composition will fail. Finally, the service is deployed as normal Web Service.

Their proposed strategy is a great in theory, but even though it is “semi-

automated”, it is still an off-line strategy.

Ponnekanti & Fox [2002] proposed a developer toolkit for the composition

of Web Services called SWORD. Although a developer toolkit isn’t anything

new, their toolkit allows for the composition of services by supplying it with the

necessary pre- and post-conditions. It will also generate rule based plans

using these conditions as a base to work from.

Pautasso & Alonso [2003] created a visual language in which a service’s

workflow can be described using a graphical representation. Their language

called BioOpera Flow Language (BFL) works very much the same as BPEL’s

graphical notation in OBPMS. They have many of the same constructs in

BFL, as well as a development environment specifically designed for BFL.

All in all, a lot of research has gone into recovery and detection methods,

but not a lot of research has gone into failures as such. Many researchers

mention some of the failures they came across in their publications, but they

do not classify them into specific classifications.

9 Conclusion Web Services live in a very dynamic environment. Due to this

environment, many things will go wrong during the lifetime of a single Web

Service. This paper tries to classify some of the common failure points when

using Web Services. This classification is by no means a complete

classification, but only serves as a model with which certain failures can be

associated. Very little research has gone into the classification of failures.

Some papers try to just name them (Baresi et al. [2006]) and others try to

classify them into their own classifications (Tartanoglu et al. [2006]). More

research has gone into recovery from failures than into failures themselves.

Page 44: Approaches to Failure and Recovery in Service Composition

44

Different recovery methods have been proposed, but some of the more

popular ones have stayed in the research arena longer. Nowadays more

research is going into self-healing composition of services than any other

recovery method. This is partly due to its success, but also due to the fact that

there are still many areas that can be improved upon in self-healing networks.

Transaction-based approaches have been around for a long time and they

have proven to be successful in the real world already. Some problems do

persist though when using a transaction-based approach in a distributed

fashion, but models have been proposed to solve this (Mikalsen et al. [2002]).

Other methods also exist. Tartanoglu et al. [2006] uses a term Forward

Error Recovery to classify al those recovery methods that come from the

workflow language itself (all the exception handling etc.). There also exist

trivial methods that are not suited to Web Services at all, like caching, that

only prove to us why we need all these different recovery methods.

The research field in recovery from failure is far from depleted, and a lot of

research can still be done in various other related areas. Even though it was

not covered in this document, a lot of research is still continuing in service

discovery as well. Discovery and recovery can go hand-in-hand, especially

when we look at Self-healing networks, since Self-healing networks do

recovery by searching (discovering) for other services that can take over from

a service that failed. Various other research fields are opening up in Web

Services, and all of them have to deal with failure and recovery at some point.

This document tries to show how important a formal classification of failures

can be.

10 Acknowledgements I would like to thank May Chan for her help and all the discussions

regarding this topic. I would also like my supervisor, Prof. J. Bishop, for her

support in guiding me in the right direction every time.

Page 45: Approaches to Failure and Recovery in Service Composition

45

References [1] "Oracle BPEL Process Manager Suite 10g," Oracle.

[2] "Service-oriented architecture," Wikipedia, Available:

http://en.wikipedia.org/wiki/Service_Oriented_Architecture. [Accessed:

2006/11/09 2006].

[3] "Microsoft Visual Studio 2005," Professional Edition ed: Microsoft,

2005.

[4] Wil .M.P. van der Aalst, "Don't go with the flow: Web services

composition standards exposed," IEEE Intelligent Systems, vol. 18, no.

1, pp. 72-76,

[5] Wolf-Tilo Balke and Matthias Wagner, "Towards Personalized

Selection of Web Services." in Proceedings of the WWW (Alternate

Paper Tracks), 2003.

[6] Luciano Baresi, Carlo Ghezzi, and Sam Guinea. "Towards Self-healing

Service Compositions." in Contributions to Ubiquitous Computing, vol

42, Springer, 2006.

[7] Antonio Brogi and Razvan Popescu, "Towards Semi-automated

Workflow-Based Aggregation of Web Services." in Proceedings of the

ICSOC, 2005, pp. 214-227.

[8] Robert J. Brunner, Frank Cohen, Francisco Curbera, Darren Govoni,

Steven Haines, Matthias Kloppmann, Benoit Marchal, K. Scott

Morison, Arthur Ryman, Joseph Weber, and Mark Wutka, Java Web

Services Unleashed, Sams Publishing, 2002.

[9] Paul A. Buhler, Christopher Starr, William H. Schroder, and José M.

Vidal, "Preparing for Service-Oriented Computing: A Composite

Design Pattern for Stubless Web Service Invocation." in Proceedings

of the ICWE, 2004, pp. 603-604.

[10] Damian Foggon, Daniel Maharry, Chris Ullman, and Karli Watson,

Programming Microsoft .NET XML Web Services, Microsoft Press,

2004.

[11] Rania Khalaf, Nirmal Mukhi, and Sanjiva Weerawarana, "Service-

Oriented Composition in BPEL4WS." in Proceedings of the WWW

(Alternate Paper Tracks), 2003, pp.

[12] Heiko Ludwig, Henner Gimpel, Asit Dan, and Robert Kearney,

"Template-Based Automated Service Provisioning - Supporting the

Page 46: Approaches to Failure and Recovery in Service Composition

46

Agreement-Driven Service Life-Cycle." in Proceedings of the ICSOC,

2005, pp. 283-295.

[13] Thomas Mikalsen, Stefan Tai, and Isabelle Rouvellou, "Transactional

Attitudes: Reliable Composition of Autonomous Web Services,"

presented at International Conference on Dependable Systems and

Networks, Washington D.C., USA, 2002.

[14] Chun Ouyang, Wil M.P. van der Aalst, Stephan Breutel, Marlon

Dumas, Arthur H.M. ter. Hofstede, and Eric Verbeek, "WofBPEL: A

Tool for Automated Analysis of BPEL Processes." in Proceedings of

the ICSOC, 2005, pp. 484-489.

[15] Abhijit A. Patil, Swapna A. Oundhakar, Amit P. Sheth, and Kunal

Verma, "Meteor-s web service annotation framework." in Proceedings

of the WWW, 2004, pp. 553-562.

[16] Cesare Pautasso and Gustavo Alonso, "Visual composition of web

services." in Proceedings of the HCC, 2003, pp. 92-99.

[17] David S. Platt, Introducing Microsoft .NET, Microsoft Press, 2001.

[18] Shankar R. Ponnekanti and Armando Fox, "SWORD: A Developer

Toolkit for Web Service Composition," vol. no. pp. January~01.

[19] Mike Rosen, "BPM and SOA: Where Does One End and the Other

Begin?" Available: http://www.bptrends.com. [Accessed: 2006].

[20] Ozgur D. Sahin, Cagdas Evren Gerede, Divyakant Agrawal, Amr El

Abbadi, Oscar H. Ibarra, and Jianwen Su, "SPiDeR: P2P-Based Web

Service Discovery." in Proceedings of the ICSOC, 2005, pp. 157-169.

[21] Ichiro Satoh, "Location-Based Services in Ubiquitous Computing

Environments." in Proceedings of the ICSOC, 2003, pp. 527-542.

[22] Andrew S. Tanenbaum and Maarten van Steen, Distributed Systems:

Principles and Paradigms, International Edition. Prentice Hall, 2002,

pp. 272-277.

[23] Ferda Tartanoglu, Valerie Issarny, Alexander Romanovsky, and Nicole

Levy, "Dependability in the Web Services Architecture," Available:

http://www-rocq.inria.fr/~tartanog/publi/wads/. [Accessed: 2006/10/05

2006].

[24] Michele Trainotti, Marco Pistore, Gaetano Calabrese, Gabriele Zacco,

Gigi Lucchese, Fabio Barbon, Piergiorgio Bertoli, and Paolo Traverso,

"ASTRO: Supporting Composition and Execution of Web Services." in

Proceedings of the ICSOC, 2005, pp. 495-501.

Page 47: Approaches to Failure and Recovery in Service Composition

47

[25] Tao Yu and Kwei-Jay Lin, "Service Selection Algorithms for

Composing Complex Services with Multiple QoS Constraints." in

Proceedings of the ICSOC, 2005, pp. 130-143.

Page 48: Approaches to Failure and Recovery in Service Composition

48

Figure 13 – Screenshot of Program requesting data

Page 49: Approaches to Failure and Recovery in Service Composition

49

Figure 14 – Busy Searching for Shops

Page 50: Approaches to Failure and Recovery in Service Composition

50

Figure 15 – Results Found

Page 51: Approaches to Failure and Recovery in Service Composition

51

Figure 16 –Displaying Results

Page 52: Approaches to Failure and Recovery in Service Composition

52

Figure 17 – Failure with the possibility of Recover y

Page 53: Approaches to Failure and Recovery in Service Composition

53

Figure 18 – Notification of Failure without the pos sibility of Recovery