discovery of microservice-based it landscapes at runtime ... · a best practice pattern1 for...

Discovery of Microservice-based IT Landscapes at Runtime: Algorithmsand Visualizations

Martin KleehausTUM, Germany

[email protected]

Nicolas Corpancho VillasanaTUM, Germany

[email protected]

Dominik HuthTUM, Germany

[email protected]

Florian MatthesTUM, Germany

[email protected]

Abstract

The documentation of IT landscapes is a challengingtask which is still performed mainly manually.Technology and software development trends likeagile practices and microservice-based architecturesexacerbate the endeavours to keep documentationup-to-date. Recent research efforts for automating thistask have not addressed runtime data for gatheringthe architecture and remain unclear regarding properalgorithms and visualization support. In this paper,we want to close this research gap by presentingtwo algorithms that 1) discover the IT landscapebased on historical data and 2) create continuouslyarchitecture snapshots based on new incoming runtimedata. We especially consider scenarios in which runtimeartifacts or communications paths were removed fromthe architecture as those cases are challenging tounveil from runtime data. We evaluate our prototypeby analyzing the monitoring data from 79 days of abig automotive company. The algorithms providedpromising results. The implemented prototype allowsstakeholders to explore the snapshots in order to analyzethe emerging behavior of the microservice-based ITlandscape.

1. Introduction

The current Information Technology (IT) inorganizations is evolving rapidly for fulfilling thefast-changing requirements. The reason is thatcompanies operate in a dynamic and high competitivemarketplace in which the capability to adapt to changingconditions has become fundamental for companies tosurvive and to successfully compete against their rivals.

Those conditions require more and more thecollaboration between different stakeholders forimproving the quality of software applications whilebeing able to develop them more quickly and reliably.New software development methodologies suchas agile practices [1], DevOps [2] and continuous

delivery [3] of containerized applications emergedfrom this development. One architecture style thatgained popularity in the last years is microservices[4]. Microservice architecture is a variant of theservice-oriented architecture (SOA) style that structuresan application as a collection of loosely coupledservices that run in their processes. Many well-knowncompanies promote microservices such as Amazon,Spotify, Linkedin, and Uber. By using this architecturalstyle, these companies claim to have achieved highscalability, agility and reliability [5]. Microservicearchitectures support heavily the aforementionedcollaboration aspect by releasing the rigid structure ofmonolithic systems towards independent deploymentsof single applications. Hence, microservices embedeasily into the agile environment and continuousdelivery approach.

Enterprise architecture (EA) management (EAM)[6]has been established as an important instrumentfor managing the complexity of the IT-landscapeand enabling enterprise-wide transparency. EAMis typically conducted to document and analyze thestatus-quo of the current EA in order to definerequirements and plans for transformations to anarchitecture that enables the business strategy, optimizesthe business processes and therefore reduces costs. ITlandscape modelling as a subarea of EAM aims todiscover and visualize the artifacts and relationships ofthe company’s IT landscape. It is typically conductedmanually by people within one organization havingdifferent technical and nontechnical backgrounds. Thisactivity is mandatory for analyzing transformationstrategies.

Unfortunately, even though microservices haveseveral advantages in contrast to monolithic systems,this architecture style introduces a high level ofcomplexity with regard to IT landscape modelling [7].For instance, due to agile practices new microservicesor communication paths between microservices can beintroduced very quickly into the current infrastructureor removed when they are no longer needed. Hence, it

Proceedings of the 53rd Hawaii International Conference on System Sciences | 2020

Page 5603URI: https://hdl.handle.net/10125/64431978-0-9981331-3-3(CC BY-NC-ND 4.0)

is crucial to keep track of the emerging IT landscape,which raises the documentation overhead.

The consequences are out–of–date EA models whichlead to decisions made on wrong data or bad data quality[8]. Farwick et al. [9] and Hauder et al. [10] identifiedruntime data as a promising information source fordelivering EA relevant information. Hence, a fewresearch endeavours [11, 12, 13] leverage runtime datafor automating IT landscape modelling. However, thepresented solutions did not allow to capture the completeIT landscape: An important aspect which is missingin most of the solutions are the proper identificationof communication dependencies from an end-to-endperspective, i.e. the communication paths betweenruntime artifacts and the used interfaces (API) for theinformation exchange. Although there exists a plethoraof commercial and open-source monitoring vendorsthat provide powerful agent-based instrumentations togain insights from an end-to-end perspective via tracing[14], those tools were primarily developed for i.a.monitoring application performance (APM) and notfor documentation purposes. For instance, manymonitoring solutions are event based [15], i.e. theyprovide runtime data as soon as a specific event occurslike a user request or a system failure. Those eventsare traced and kept in the database for a certain periodof time. Within an event, the communication behaviorbetween IT components is unveiled. Outside of eventsonly the state (running, paused, down, etc.) of ITcomponents is captured. As a consequence runtimedata provide only the IT architecture within a specificpoint of time which does not mean it is complete atall. In order to capture the entire as-is IT landscapewith all identified communication paths a request ofthe complete runtime history is required which ismostly not possible due to performance reasons andresource limitations. In addition, old components andcommunication paths would also be extracted that werealready removed several sprints ago.

In this work, we present a solution that leveragesruntime instrumentation for continuously reconstructingand documenting the as-is microservice-based ITlandscape in any given point of time. The contribution ofour work is threefold: 1) First, we describe an approachfor continuously tracking architecture changes andapplying those changes to our maintained architecturein order to ensure an up–to–date IT landscapedocumentation. 2) We store every change to thearchitecture as snapshots in our database and visualizethis emerging behaviour based on a timeline. Thevisualization of the information exchange dependenciesis built upon a graph-based scheme provided byGraphQL as the query language. 3) We allow users

to manually refine the reconstructed architecture andsave those refinements in our database. We evaluateour prototype in a big automotive company located inGermany.

The remainder of the paper is organized asfollows: Section II presents related academic workthat influenced our design decisions. In Section III,we describe the concept in more detail, whereas inSection IV, we dive deeper into the implementationaspects. Afterwards, we continue in Section V to discussour evaluation results. We finish the paper with ourlimitations and a conclusion in Section VI and VII.

2. Related work

There exist a few concepts on how to integrateruntime data from existing data sources for IT landscapemodelling. Holm et al. [13], as well as Alegria etal. [16] make use of network analysis tools in order toinfer information on the IT infrastructure. Buschle et al.[12], on the other hand, interpret the configuration of anEnterprise Service Bus (ESB) to include knowledge oncommunicating information systems in the EA model.These approaches have in common that they are limitedto a specific layer of the EA and the authors donot consider communication paths between runtimeartifacts. In addition, they are not appropriate formicroservice-based architectures.

O’Brien et al. [17] provide a state-of-the-art reporton several architecture recovery techniques and tools.The presented approaches aim to reconstruct softwarecomponents and their interrelations by analyzing sourcecode and by applying data mining methods.

Cuadrado et al. [18] describe a case study of theevolution of an existing legacy system towards SOA.The proposed process comprises architecture recovery,evolution planning, and evolution execution activities.Similar to our approach, the system architecture isrecovered by extracting static and dynamic informationfrom system documentation, source code, and theprofiling tool. This approach, however, does not analyzecommunication dependencies between services, whichis an outstanding feature of our prototype.

Van Hoorn et al. [19][20] propose a frameworkKieker for monitoring and analyzing the run-timebehaviour of concurrent or distributed software systems.Although the framework focuses on application-levelmonitoring, the authors also present a way howKieker could be used to recover microservice-based ITlandscapes via analyzing the profiled traces. Unlikeus, Kieker does not store and process architecturalchanges in runtime. In addition, it remains unclear howcommunication deletions are processed.

Page 5604

MicroART, an approach for recovering thearchitecture of microservice-based systems is presentedin [21][22]. The approach is based on Model-DrivenEngineering (MDE) principles and is composed of twomain steps: recovering the deployment architecture ofthe system and semi-automatically refining the obtainedsystem. The architecture recovery phase involves allactivities necessary to extract an architecture modelof the microservices, by finding static and dynamicinformation of microservices and their interrelationsfrom the GitHub source code repository, Dockercontainer engine, Vagrant platform, and TcpDumpmonitoring tool. However, the tool does also nottrack architectural changes and remain imprecise whathappens when communication between microservicesare deleted.

3. Architecture discovery concept

3.1. Monitoring techniques

Our algorithm for discovering microservice-basedIT landscapes is based on the combination of threemonitoring concepts:

A best practice pattern1 for building microservicearchitectures is the usage of service discovery [23]that serves as a repository to find the network locationof a specific microservice dynamically. Microservicesfrequently change their status and IP-address due toreasons like updates, autoscaling or failures. Inorder that the microservices are still able to find eachother in the network, the service discovery serves asa gateway that always provides the current networklocations. In case a change in the architecture (removedservice, added service, updated service) is detected, thisalteration is reflected in the repository of the servicediscovery. By retrieving this information, we are ableto reveal the current status of each service instance.

The service discovery mechanism already providesuseful data about the status of the microservice butlacks in reporting detailed infrastructure informationand interrelationships. For that reason, an additionalmonitoring agent is required that delivers 1)infrastructure-related data, like host, container typeand address, database type and address, operatingsystem or information about the cloud provider. 2)Interrelationships unveil schematic connections betweenmicroservices and other infrastructure elements, likeon which host the microservice is running, in whichspecific container or operating system the microserviceis deployed, or with which database the microservice iscommunicating. The repository data is enhanced with

1https://microservices.io

this information.Both concepts discover the status of

running microservices in run-time and unveilinfrastructure-related information. However, the realcommunication behaviour between the microservicesstill remains unknown. For that reason, it is necessaryto instrument each microservice with a monitoringprobe that tracks request flows through the system.This technique is called distributed tracing [14]and adapted by many commercial, or open-sourcemonitoring solutions2. Distributed tracing tracks allexecuted HTTP requests in each service by injectingtracing information into the request headers. The mainpurpose of tracing is to analyze application performance(APM) and to troubleshoot latency problems. Inaddition, it also provides capabilities to add furtherinformation in the form of annotations to each request.These annotations contain additional infrastructure andsoftware-related information like executed endpointaddress, class and method name, requested port, etc.We leverage distributed tracing in order to unveil thecommunication behaviour between microservices.

Last but not least, huge microservice infrastructuresare load balanced to avoid single points of failures.Tracing data or monitoring agents also provide serviceinstance information. Each instance of a service hasthe same name but distinguish itself in IP address andport. Therefore, the uniqueness of a service instance isdefined by the service description in combination withthe used IP address and the service port. In order todiscover all instances that belong to a specific service,we aggregate the information based on the servicedescription.

It is mostly not required to instrument theIT-landscape with three different monitoring agents.This would increase the administration and resourceoverhead unnecessarily. Modern monitoring tools likeDynatrace, AppDynamics or Instana already integrateall mentioned monitoring techniques in one monitoringagent which is a huge benefit from a DevOps point ofview.

3.2. Process description

Even though the combination of the aforementionedmonitoring techniques discovers most architectureelements of an IT landscape, it only unveils a snapshotof the current architecture which created runtimedata for a defined period of time. This is mostlyenough for analyzing the general existence of an ITelement. However, it cannot be ensured that thecommunication structure is uncovered completely, as it

2https://openapm.io/landscape

Page 5605

would require all possible communications between theapplications happened in the considered period of time.Consequently, we also have to analyze historical runtimedata. Hereby, we face the following four challenges:1) In order to reduce network overhead, most tracingtechniques are based on sampling, i.e. only a percentageof requests is traced and forwarded to the monitoringserver. In the worst case, specific communication pathsare rarely seen. 2) Due to resource limitations, mostmonitoring tools can only provide a small timeframeof runtime data, e.g. last 6 hours, depending onthe frequently incoming data volume. A request forruntime data for a longer duration would be too resourceintensive and cannot be served. 3) Most monitoringtools store runtime data for only a specific period of timeand archive or even delete older data in order to ensurefree storage capacity is always available. 4) The historydoes also contain old components and communicationpaths that were already removed several sprints ago.This legacy data must be filtered in order to unveil thereal architecture. This can be easily performed withthe general existence of IT artifacts, as they frequentlyprovide health data. If no health data is coming froma component anymore, it was certainly removed fromthe architecture. However, it is a different case withcommunication paths, as they only get visible by requestevents. No communication does only mean there havebeen no events reported.

Considering the listed challenges, we developed aconcept that discovers the architecture of the currentIT landscape based on a three-step process. First,we reconstruct the architecture by analyzing historicaldata. We developed the backwardDiscovery algorithmfor this purpose. This algorithm runs recursive andretrieves in every iteration historical tracing data (tD)with a timeframe of T = t1 - t0. In case nofurther data is available the discovered architecture isreturned for manual refinement. In the second step, wesupport the user with a visualization for adapting thearchitecture manually, i.e. the user is able to changethe structure of the communication paths. Finally, thealgorithm forwardDiscovery gets triggered on a fixedtime interval and consumes new incoming runtime datafor continuously adapting the final architecture.

3.3. Algorithms description

We execute two algorithms in a chronological orderto unveil the complete IT landscape architecture. Weassume the architecture A(E,C) is a directed graphwith runtime artifacts E and communication paths C,whereas C ⊆ E × E on a finite set E.

The algorithm backwardDiscovery analyzes the

available historical runtime data and reconstructs thearchitecture A′(E′, C′) from the last reported time t-1until the present time t0. It excludes communicationsthat did not occur in the regarded history yet orincludes communications between microservices thatwere already removed in last sprints. Both scenariosmust be handled accordingly. Hence, we defineA′(E′, C′) as

E′ = E :⇐⇒ ∀e(e ∈ E′ ↔ e ∈ E)

C′ := {c | (c ∈ E′) ∧ (c ∈ E′ ∩ E)}

The algorithm is executed one time. Due to resourcelimitations, we need to provide a timeframe T thatrepresents the maximum time period that is acceptedby the monitoring tool to go back in history. First,we instantiate the architecture A′(E′, C′) based on therepository data rD(E) (line 3 and 4) most monitoringtools provide in order to identify running IT artifacts.We use the function REPOSITORYDATA() for thispurpose. The communication paths C remain empty.Next, we retrieve the tracing data tD(E,C) for the lastconsidered timeframe via the call TRACEDATA(t0, t1)(line 7). If the tracing data tD is not empty (line8), we iterate through all elements e ∈ tD(E) andvalidate whether the elements e are also included in therepository data (line 9 and 10). If this is the case, weadd all communication paths assigned to this runtimeelement to the architecture (line 11) and start over withthe next timeframe (line 12). If no data is receivedfrom the monitoring server, the algorithm returns anincomplete architecture (line 14) which can be used asa basis for further refinements. Line 9 to 11 can alsobe described as an intersection between the elements ein tD(E,C) and rD(E), but for simplicity reasons weuse the imperative representation.

The next algorithm forwardDiscovery is executedafter backwardDiscovery and the manual refinement.It runs continuously based on a defined frequencyand is eventually returning the complete architectureof the instrumented IT landscape. As an input, theforwardDiscovery function consumes 1) a timeframeT for retrieving the monitoring data, 2) the deletionthreshold τ which defines how old a communicationpath is allowed to be, before it gets removedand 3) the incomplete architecture returned by thebackwardDiscovery function or the manual refinement.First, the function fetches both the current content ofthe repository (line 3) and the trace data (Line 4) fora specific period of time. Based on the retrieved datathe architecture A′′ is refined accordingly. For theruntime elements, we apply the intersection (line 5) andfor the communication paths, we use the union (line 6)to return the complete architecture which is eventual

Page 5606

consistency in case the missing communication pathswere available in the tracing data. However, we arestill facing the issue that removed communications arenot recognized without any manual input. Hence, weincorporate a threshold τ > 0 that defines the maximumperiod of time how long communications are allowed tobe invisible in the tracing data. In case the thresholdis exceeded (line 8), the particular communication pathis marked as deleted (line 9). Hereby, we use thelast seen timestamp of each communication. We neverremove communications from the current architectureas we never can make sure that the communication isnot appearing in future traces again. Finally, we storethe current snapshot of the discovered architecture (line10). The algorithm itself is designed to be idempotentas long as no changes have occurred in the architecture,therefore running it multiple times has no further impacton the result.

Algorithm 1 Backward DiscoveryRequire: T > 0

1: function BACKWARDDISCOVERY(A, t0, T)2: if A = ∅ then3: rD(E)← REPOSITORYDATA4: A′ ← A(rD(E), C)

5: t1←t06: t0←t1 − T7: tD(E,C)← TRACEDATA(t0,t1)8: if tD 6= ∅ then9: for all e ∈ tD(E) do

10: if e ∈ rD(E) then11: A′′ ← A′(E,C ∪ tD(Ce))

12: BACKWARDDISCOVERY(A′′, t0, T )13: else14: return A′

Algorithm 2 Forward DiscoveryRequire: T > 0, τ > 0

1: function FORWARDDISCOVERY(A, τ , T)2: t1←t0 + T3: rD(E)← REPOSITORYDATA4: tD(E,C)← TRACEDATA(t0,t1)5: A′ ← A(E ∩ rD(E), C)6: A′′ ← A(E′, C ∪ tD(C))7: for all c ∈ A′′(C′) do8: if c(lastSeen)+τ ≤ t0 then9: c(deleted)← true

10: V (i+ 1)← A′′11: return A′′

The forwardDiscovery algorithm can also be

<<Interface>>

GraphQL Interface

<<Interface>>

GraphQL Interface

<<Component>>

Database

<<Component>>

Database

<<Component>>

Server

<<Component>>

Server

<<Component>>

Client Application

<<Component>>

Client Application

<<Component>>

Monitoring Agents

<<Component>>

Monitoring Agents

CRUDCRUD

store references and architecture snapshotstore references and

architecture snapshot

consume runtime data

consume runtime data

stream runtime data

stream runtime data

consume architectureconsume architecture

provide architectureprovide architecture

Discovers IT landscape architecture based on received

monitoring data

Discovers IT landscape architecture based on received

monitoring data

Figure 1: Tracing data of sample versionfigure

adjusted in a way it does not trigger the architecturemodification based on defined time intervals but onspecific events that occur in the organizations. Whenthe trigger is aware of changes immediately when theyoccur, this could potentially give birth to the conceptof ”real-time” IT landscape architecture documentation.An overview of potential trigger events is shown in thefollowing:

• Scheduled trigger: Default trigger running on adefined schedule

• Pipeline trigger: Triggers the algorithm as soonas changes are deployed to production via acontinuous delivery pipeline

• Manual trigger: Allows further external tools ora user to trigger the forwardDiscovery algorithmmanually

4. Implementation

The architecture of the prototype is built on fourmain components: 1) the server receives the runtimedata from the monitoring agents and reconstructs thearchitecture, 2) the database stores the architecturesnapshots and the references to the runtime artifact,3) GraphQL interface communicates with the serverand provides a query language to traverse throughthe discovered IT-landscape architecture, and finally4) a client application for visualizing the architectureand enabling manual refinements. The correspondingcomponent diagram is illustrated in Figure 1.

4.1. Data model

The data model of our prototype is depicted inFigure 2: The class ”Snapshots” contains all snapshotsmade for the IT landscape architecture during the

607

Figure 2: Architecture reconstruction data modelfigure

time. Every snapshot version represents a definedtimeframe that is appropriate for the used monitoringtool starting at ”datefrom” to ”dateto”. A versionnumber is an incrementing number meaning that thehighest version number represents the latest stored ITlandscape architecture version.

The tracing data are stored in the classes”TracingComponent” and ”TracingEdge”. Tracingedges describes the communication paths between thetracing components. For every record in those classes,a hash is generated as the primary ID. The attribute”lastSeen” indicates the snapshot version in which thetracing data have been seen the last time.

The repository data retrieved from the monitoringtool is stored in the class ”RepositoryData”. Unlike thetracing data, monitoring tools mostly does not providehistory about the runtime artifacts, hence we version thisinformation by our own through frequently pulling thedata from the monitoring tool. The attribute ”lastSeen”contains the version as an integer in which the repositoryitems last existed.

Due to performance and implementation reasons, aversion class was integrated for every mentioned ITartifact (component, edges and repository items). Theseversion classes (VersionComponents, VersionEdges andVersionRepositoryData) are related to the createdparticular snapshots.

4.2. Graph-based visualization

With the support of GraphQL, we are able toprovide stakeholders with a query language that enablesthem to retrieve all information about the IT landscapeand to traverse through the discovered IT landscapearchitecture. In order to allow data to be queried,resolvers have to be defined and implemented on the root

Figure 3: Refinement menu for manual adaptationfigure

level. Resolvers such as ”database”, ”microservice”,”host”, etc allow to query for an artifact or collectionsof artifacts and work in a similar manner to RemoteProcedure Calls (RPC). The client application primarilycalls the resolvers and retrieves a JSON-based responsewith all IT landscape elements. The IT landscapearchitecture itself is visualized as a directed graphwith nodes and edges. Nodes represent the runtimeartifacts. The node types are identified with differentcolors. Edges visualize the communication pathsbetween runtime artifacts. The direction of the edgesindicates request calls via TCP or HTTP. An exampleof this visualization is depicted in Figure 4. Red colorindicates microservices. Blue color represents databasesand green color describes file storages.

4.3. Architecture refinement support

Our frontend application does not only visualize theIT-landscape architecture based on stored snapshots butalso provides features for a manual refinement of thearchitecture. The user can adapt the following elements:

• Add new and remove legacy communication pathsbetween runtime artifacts

• Add and remove runtime artifacts in case they arenot instrumented

• Add annotations to runtime artifacts andcommunication paths

• Provide a detailed view of a runtime artifactincorporating more information

Figure 3 presents a screenshot of the refinementmenu. It appears when the user clicks on an elementor a communication path.

4.4. Architecture comparison support

In order to analyze how the IT landscape emergedover time, we integrate a visual comparison betweentwo architecture snapshots. Same runtime artifacts and

Page 5608

Database

Database

Database

File Storage

File Storage

Figure 4: IT landscape visualization with different colorcodings.figure

communication paths that occur in both snapshots arehighlighted accordingly. The comparison of differentsnapshots enables a number of use cases. 1) It can beused to get instant feedback about architectural changes,like new or deleted runtime artifacts or communicationpaths. 2) A comparison also enables feedback regardingthe fulfilment of architecture-related requirements and3) it supports the analysis of the emerging behavior ofthe architecture. Hence, architects are able to intervenein a timely manner in order to prevent bad designdecisions.

5. Evaluation

5.1. Environment description

The company from which we got access to theirinstrumented IT landscape for evaluating our concept islocated in the automotive industry. The company hostedthe microservice architecture on the cloud providerAmazon Web Services (AWS). The microservices usethe NoSQL database DynamoDB for storing theirtransaction data. Data streams are realized via Kinesisstreams. The streams are processed in KinesisFirehose and forwarded to all subscribed microservices.The architecture itself provides services for otherdepartments realizing and contributing to variousbusiness use cases. The Enterprise Architect keepsthe main responsibility regarding overall architecturaldesign decisions and documentation.

AWS provides monitoring data in three differentforms. 1) All runtime artifacts are registered in theAWS repository and their health status is frequentlyreported. In order to remove a specific service, theartifact must be unregistered and deleted accordingly.2) A further monitoring probe (CloudWatch) createsinfrastructure and application logs that record failureevents or hardware related data like CPU, memory ornetwork utilization. 3) The tool X-Ray enables tracingfor analyzing requests from an end-to-end perspective.For automating the documentation of the microservicearchitecture, we combine the output of the monitorsfrom the AWS repository and X-Ray via the unique ID.

Due to performance issues and configurationsettings, the monitoring tools restricts the access toruntime data to a timeframe of 6 hours and only the last30 days can be retrieved. In addition, the repositorydoes only store the set of artifacts that are currentlyrunning. No history is kept in the database. Theaccessed microservice-based IT landscape contains 279runtime artifacts and 34 communication paths. Theartifacts consist of 50 microservices, 46 dynamoDBtables, 8 kinesis streams and 175 S3-Buckets thatrepresent simple data storage. The image on the rightin Figure 8 shows the final architecture with all runtimeartifacts. Due to space limitations, we only show thoseruntime artifacts that correspond to one product. Weignored the rest.

5.2. Accuracy calculation

After implementing our prototype, we created316 snapshots representing the last 79 days. TheforwardDiscovery algorithm was executed at snapshotversion 203. After the 79 days, our architecturediscovery result was validated by our evaluation partner.Modifications have to performed on the Kinesis streamsin order to achieve a complete and accurate model,which we call base model. The evaluation itself is donein an iterative process. For each iteration, we comparethe reconstructed architecture model against the basemodel and calculate the accuracy acc.

acc =(TPEi + TPNi)− (FPEi + FPNi)

(TPEi + TPNi + TNEi + TNNi)(1)

TPE = Edges are found both modelsTPN = Nodes are found in both modelsFPE = Edges are found in the reconstructed model butare not available in base modelFPN = Nodes are found in the reconstructed model butare not available in base modelTNE = Edges are not found in the reconstructed model

Page 5609

but are available in base modelTNN = Nodes are not found in the reconstructed modelbut are available in the base model

1. Iteration: After analyzing the complete historythe backwardDiscovery algorithm finally discovered intotal 279 correct runtime artifacts. 36 communicationpaths were unveiled out of 5 are reconstructed thatare not present in the base model anymore. 3communications are missing. Hence, the accuracy isacc = 82, 4%

2. Iteration: After receiving the result of thebackwardDiscovery, we execute the forwardDiscoveryalgorithm with a deletion threshold of τ = 168 hours(28 snapshots) representing the sprint length of 1 week.Hence, the algorithm ran in total 113 snapshots thatequals approximately 28 days. Overall, the secondalgorithm improves the accuracy to acc = 91.2%. After28 days, 279 correct runtime artifacts were discoveredin total. 33 correct communication paths were unveiled.1 communication was still not found that is availablein the base model and 2 communications are markedas deleted although they are still available. Further 2communications were correctly marked as removed.

We recognized after the second iteration, that anoverall threshold for all communication paths cannot beapplied as the communication behavior between runtimeartifacts differ significantly. Hence, we modified theforwardDiscovery to ensure every communication pathgets an individual threshold that is recalculated fromsnapshot to snapshot. We define the threshold as themaximum time in which the communication was notvisible regarding the considered timeframe. As anexample, Figure 6 illustrates the profile of two differentcommunication paths. Whereas the communication1 is marked as deleted after 11 snapshots (τ1 =N [247; 258]), communication 2 can be removed alreadyafter 7 snapshots (τ1 = N [259; 266]). The adaptedalgorithm could not be executed again due to resourcerestrictions of the evaluation partner. Hence, we werecontent with the last 28 days. Unfortunately, the presentdata pool did not allow an improvement of the accuracy.

5.3. Deletion threshold discussion

The selection of an appropriate deletion thresholdstrategy is fundamental to keep a high architecturediscovery accuracy. In the following, we discussdifferent approaches on how to define the threshold fordeleting potentially removed communication paths:

Manual definition: The period of time of how longthe algorithm has to wait until specific communicationpaths should be highlighted as removed could be basedon simple manual input. That means the user defines the

number of days as the threshold based on experience.The advantage of this option is the simplicity of thisapproach. However, it is rather inflexible and doesprobably not conform to development behavior.

Machine learning based: The drawbacks of themanual method could be neglected by a machinelearning approach. Hereby, we create a model thatlearns the behavior of the developers and predictsfuture communication removals accordingly. However,the creation of the prediction model is challenging toperform as it depends on the availability of labelled data,i.e. each communication removal must be recorded.

Event based: Specific events that describe asituation in which selected communication paths mustbe deleted can be leveraged for defining an event basedthreshold. However, the threshold is not a period oftime anymore but represents rather a boolean value thattriggers the deletion workflow. An advantage of thisoption is a resource optimization and near real-timedocumentation. On the contrary, the definition ofpossible events is challenging.

Tool support: In the last option, no thresholdcalculation is performed at all. The removal ofobsolete communications is achieved by tool support.Based on an application that visualizes the ITlandscape architecture the developers can decide whichcommunication path is obsolete and must be removed.That means, the decision is outsourced to a manual task,which realizes a high accuracy if developers maintainthe communications via the tool. As a disadvantage, noautomation mechanism is achieved.

5.4. Snapshot comparison

Figure 8 illustrates the microservice-based ITlandscape from two different snapshots. Version145 was created during the execution of thebackwardDiscovery algorithm and version 316 wasthe last created during forwardDiscovery. Again, weonly visualize the runtime artifacts that correspond toone product. Both images unveil how architecture haschanged after 42 days. In total, 37 new runtime artifactswere added to the landscape.

6. Limitations

In the course of the development of this paper a fewassumptions have been made that lead necessarily tothe following limitations: First, every runtime artifactmust be instrumented. Otherwise, the IT landscapecannot be discovered completely. To this end, somecapabilities from the monitoring tool has been seen asgiven. Especially, the report of distributed traces and thepropagation of APIs for reading runtime data.

Page 5610

0

1

2

3

4

5

203-

206

207-

210

211-

214

215-

218

219-

222

223-

226

227-

230

231-

234

235-

238

239-

242

243-

246

247-

250

251-

254

255-

258

259-

262

263-

266

267-

270

271-

274

275-

278

279-

282

283-

286

287-

290

291-

294

295-

298

299-

302

303-

306

307-

310

Reco

gnize

dco

mm

unic

atio

n

Snapshots

Communication 1

0

1

2

3

4

5

203-

206

207-

210

211-

214

215-

218

219-

222

223-

226

227-

230

231-

234

235-

238

239-

242

243-

246

247-

250

251-

254

255-

258

259-

262

263-

266

267-

270

271-

274

275-

278

279-

282

283-

286

287-

290

291-

294

295-

298

299-

302

303-

306

307-

310

Reco

gnize

dCo

mm

unic

atio

ns

Snapshots

Communication 2

Figure 6: Communication behavior between two microservices. The y-axis represent the number of recognizedcommunications within one day, i.e. 4 snapshots. A snapshot is created every 6 hours.figure

Figure 8: Comparison of two different snapshots of the same microservice-based IT landscape. Left: Discoveredarchitecture after snapshot 145. Right: Final architecture discovered after snapshot 316 with manual refinement.figure

Second, we mainly focus on runtime artifacts. If therelated process is not running and the monitoring agentis not providing data anymore it is interpreted as deleted,which might be wrong.

Third, the evaluation was conducted within atimeframe of 79 days. To this end, some of the conceptsas explained in Section 3 should be tested for a longertime especially when it comes to the proper inclusionin the workflows of teams and architects as well as thecalculation of the correct deletion threshold.

Fourth, we evaluate the developed algorithms bycalculating the related accuracies. However, we did notincorporate the stakeholders to evaluate the providedvisualizations based on structured or semi-structuredinterviews. Hence, the visualizations were not adaptedaccordingly. This is part of our future work.

Last but not least, we use CloudWatch andX-Ray as the monitoring providers, as those tools arenatively integrated in AWS. However, our proposedalgorithms were not evaluated on different environmentconfigurations. Hence, in this current research phase,

we cannot confirm a global applicability of our concept.Nevertheless, we are convinced that our approach isalso applicable in other technological environments.For instance, further APM vendors like Dynatrace,AppDynamics, NewRelic, or Instana just to name a fewalso provide powerful instrumentations to gain insightsfrom an end-to-end perspective. Those tools exposeseveral APIs for extracting the application repositoryand communication behaviour between microservices,and yet show the same issues regarding to IT landscapedocumentation described in Section 1. Hence, the usageof those tools in other technical environments should notreduce the applicability of the presented algorithms andvisualizations.

7. Conclusion

The trend of developing larger applications in theform of microservices as well as the accompanyingagile practices expose new challenges to the practicesof IT landscape documentation. In order to support

Page 5611

this process, we developed two algorithms that discovercontinuously the IT landscape by analyzing runtimedata. The results are stored as snapshots in our databaseand visualized via a graph library. Each differentruntime artifact is coloured accordingly. We evaluatedour prototype in the automotive industry. In total, wecreated 316 snapshots within 79 days. Our algorithmswere capable to discover the IT landscape architectureon the accuracy of acc = 91.2%. One of the biggestchallenges we faced was the accurate reconstruction ofcommunication behaviors between microservices.

The proposed approach works well if one importantprerequisite is fulfilled: Each runtime artifact has tobe instrumented by a monitoring solution that supportsdistributed tracing and service discovery. In case oneof those tools is not installed, the prototype will notbecome fully operational, which presents our mostsignificant limitation.

References

[1] T. Dingsyr, S. Nerur, V. Balijepally, and N. B. Moe,“A decade of agile methodologies: Towards explainingagile software development,” Journal of Systems andSoftware, vol. 85, no. 6, pp. 1213 – 1221, 2012. SpecialIssue: Agile Development.

[2] J. Smeds, K. Nybom, and I. Porres, “Devops:A definition and perceived adoption impediments,”in Proceedings of the International Conference onAgile Processes in Software Engineering and ExtremeProgramming (XP 2015), pp. 166–177, SpringerInternational Publishing, 2015.

[3] J. Humble and D. Farley, Continuous Delivery: ReliableSoftware Releases through Build, Test, and DeploymentAutomation. Addison-Wesley Signature Series (Fowler),Pearson Education, 2010.

[4] M. Fowler and J. Lewis, “Microservices,” tech. rep.,ThoughtWorks, 2014.

[5] W. Hasselbring and G. Steinacker, “Microservicearchitectures for scalability, agility and reliability ine-commerce,” in 2017 IEEE International Conference onSoftware Architecture Workshops (ICSAW), pp. 243–246,April 2017.

[6] I. Hanschke, Enterprise Architecture Managementeinfach und effektiv: Ein praktischer Leitfaden fr dieEinfhrung von EAM. Carl Hanser Verlag GmbH Co.KG, 2016.

[7] N. Alshuqayran, N. Ali, and R. Evans, “A systematicmapping study in microservice architecture,” inInternational Conference on Service-OrientedComputing and Applications (SOCA), pp. 44–51,IEEE, 2016.

[8] S. Roth, M. Hauder, M. Farwick, R. Breu, andF. Matthes, “Enterprise architecture documentation:Current practices and future directions,” inWirtschaftsinformatik, 2013.

[9] M. Farwick, R. Breu, M. Hauder, S. Roth, andF. Matthes, “Enterprise architecture documentation:Empirical analysis of information sources forautomation,” in 46th Hawaii International Conferenceon System Sciences, pp. 3868–3877, Jan 2013.

[10] M. Hauder, F. Matthes, and S. Roth, “Challengesfor automated enterprise architecture documentation,”in Trends in Enterprise Architecture Research andPractice-Driven Research on Enterprise Transformation,pp. 21–39, Springer Berlin Heidelberg, 2012.

[11] M. Farwick, B. Agreiter, R. Breu, M. Haering,K. Voges, and I. Hanschke, “Towards living landscapemodels: Automated integration of infrastructure cloudin enterprise architecture management,” 2010 IEEE3rd International Conference on Cloud Computing,pp. 35–42, 2010.

[12] M. Buschle, M. Ekstedt, S. Grunow, M. Hauder,F. Matthes, and S. Roth, “Automating enterprisearchitecture documentation using an enterprise servicebus,” in Americas Conference on Information Systems(AMCIS), 2012.

[13] H. Holm, M. Buschle, R. Lagerstrom, and M. Ekstedt,“Automatic data collection for enterprise architecturemodels,” Software & Systems Modeling, vol. 13,pp. 825–841, May 2014.

[14] B. H. Sigelman, L. A. Barroso, M. Burrows,P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, andC. Shanbhag, “Dapper, a large-scale distributed systemstracing infrastructure,” tech. rep., Google, Inc., 2010.

[15] D. J. Lilja, Measuring Computer Performance: APractitioner’s Guide. Cambridge University, 2005.

[16] A. Alegria and A. Vasconcelos, “It architectureautomatic verification: A network evidence-basedapproach,” in 2010 Fourth International Conference onResearch Challenges in Information Science (RCIS),pp. 1–12, May 2010.

[17] L. O’Brien, C. Stoermer, and C. Verhoef, “Softwarearchitecture reconstruction: Practice needs and currentapproaches,” Tech. Rep. CMU/SEI-2002-TR-024,Software Engineering Institute, Carnegie MellonUniversity, Pittsburgh, PA, 2002.

[18] F. Cuadrado, B. Garcıa, J. C. Duenas, and H. A.Parada, “A case study on software evolution towardsservice-oriented architecture,” in Advanced InformationNetworking and Applications-Workshops, 2008. AINAW2008. 22nd International Conference on, pp. 1399–1404,IEEE, 2008.

[19] A. van Hoorn, M. Rohr, W. Hasselbring, J. Waller,J. Ehlers, S. Frey, and D. Kieselhorst, “Continuousmonitoring of software services: Design and applicationof the kieker framework,” 2009.

[20] A. van Hoorn, J. Waller, and W. Hasselbring, “Kieker: Aframework for application performance monitoring anddynamic software analysis,” in Proceedings of the 3rdACM/SPEC International Conference on PerformanceEngineering, ICPE ’12, (New York, NY, USA),pp. 247–248, ACM, 2012.

[21] G. Granchelli, M. Cardarelli, P. Di Francesco,I. Malavolta, L. Iovino, and A. Di Salle, “Microart:A software architecture recovery tool for maintainingmicroservice-based systems,” in IEEE InternationalConference on Software Architecture (ICSA), 2017.

[22] G. Granchelli, M. Cardarelli, P. Di Francesco,I. Malavolta, L. Iovino, and A. Di Salle,“Towards recovering the software architecture ofmicroservice-based systems,” in Software ArchitectureWorkshops (ICSAW), 2017 IEEE InternationalConference on, pp. 46–53, IEEE, 2017.

[23] S. Newman, Building Microservices. O’Reilly Media,Inc., 1st ed., 2015.

Page 5612

discovery of microservice-based it landscapes at runtime ... · a best practice pattern1 for...

Documents