ditas d3.1 - data virtualization architecture design · 2018-12-11 · d3.1 data virtualization...

D3.1 Data Virtualization

architecture design

Project Acronym DITAS Project Title Data-intensive applications Improvement by moving daTA

and computation in mixed cloud/fog environmentS Project Number 731945 Instrument Collaborative Project Start Date 01/01/2017 Duration 36 months Thematic Priority Website:

ICT-06-2016 Cloud Computing https://www.ditas-project.eu/

Dissemination level: Public

Work Package WP3 Data Virtualization Due Date: M9 Submission Date: 30/09/2017 Version: 1.0 Status Final Authors: Vrettos Moulos, Achilleas Marinakis, George Chatzikyr-

iakos (ICCS); David García Pérez, Jose Antonio Sanchez (ATOS); Pierluigi Plebani (POLIMI); David Bermbach, Marco Peise, Sebastian Werner (TUB); Ma-ya Anderson (IBM); Aitor Fernández (IDEKO); Grigor Pavlov, Peter Gray (CS)

Reviewers David Bermbach, Marco Peise, Sebastian Werner (TUB); Peter Gray (CS)

Licensing Information: This work is licensed under Creative Commons Attribu-tion-ShareAlike 3.0 Unported (CC BY-SA 3.0) http://creativecommons.org/licenses/by-sa/3.0/

This project has received funding by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 731945

©MaineditorandothermembersoftheDITASconsortium

2 D3.1 Data Virtualization architecture design

Version History Version Date Comments, Changes, Status Authors, contributors, reviewers

0.1 23/06/2017 Initial version Vrettos Moulos, Achilleas Mari-nakis, George Chatzikyriakos (ICCS);

0.2 27/07/2017 Updates on sections 2, 3, 3.1, 3.3, 3.5, 4, 5, 6

Vrettos Moulos, Achilleas Mari-nakis, George Chatzikyriakos (ICCS);

0.3 03/08/2017 Contributions in section 3.4 David García Pérez, Jose Anto-nio Sanchez (ATOS)

0.4 01/09/2017 Contribution in section 6 Aitor Fernández (IDEKO) 0.5 01/09/2017 Contribution in sections 3.2,

4, 4.2 David Bermbach, Marco Peise, Sebastian Werner (TUB)

0.6 13/09/2017 Contribution in section 4.3 Grigor Pavlov, Peter Gray (CS) 0.7 20/09/2017 Contribution in sections 3.2,

4, 4.2, 5 Pierluigi Plebani (POLIMI)

0.8 24/09/2017 Contribution in sections 5, 6 Maya Anderson (IBM) 0.9 25/09/2017 Internal review version Vrettos Moulos, Achilleas Mari-

nakis, George Chatzikyriakos (ICCS)

0.91 27/09/2017 Internal review comments David Bermbach, Marco Peise, Sebastian Werner (TUB); Grigor Pavlov, Peter Gray (CS)

0.92 28/09/2017 Addressed internal review comments

Vrettos Moulos, Achilleas Mari-nakis, George Chatzikyriakos (ICCS)

1.0 29/09/2017 Final version for submission Vrettos Moulos, Achilleas Mari-nakis, George Chatzikyriakos (ICCS)



Contents Version History ................................................................................................................ 2 List of Figures .................................................................................................................. 4 List of tables ................................................................................................................... 4 Executive Summary ...................................................................................................... 6 1 Introduction ............................................................................................................ 7

1.1 Glossary of Acronyms ................................................................................... 8 2 Abstract VDC Blueprint ....................................................................................... 10

2.1 Internal Structure ......................................................................................... 10 2.2 Data Utility and Security Dimensions ......................................................... 11

2.2.1 Data Utility ................................................................................................ 11 2.2.2 Security ..................................................................................................... 12

2.3 Abstract Technical and Business Properties ............................................. 14 2.4 Components CookBook Appendix ........................................................... 15 2.5 Output Interface Details ............................................................................. 17

3 Application Design Process with DITAS ............................................................. 19 3.1 Initial VDC Resolution Process .................................................................... 20 3.2 VDC Resolution at Renegotiation Process ............................................... 22

3.2.1 Initialization phase ................................................................................... 23 3.2.2 Renegotiation phase .............................................................................. 24

4 Data Sources ........................................................................................................ 25 5 Conclusions .......................................................................................................... 27 6 References ............................................................................................................ 28 ANNEX A: VDC Components ..................................................................................... 29

ANNEX A.1: VDC Blueprint Repository ................................................................... 29 ANNEX A.2: VDC Resolution Engine ...................................................................... 30 ANNEX A.3: VDC Validator ..................................................................................... 32 ANNEX A.4: Data Utility Service .............................................................................. 32 ANNEX A.5: Data Utility Evaluator @ VDC (DUE@VDC) ....................................... 33 ANNEX A.6: Sample Data Generator .................................................................... 34 ANNEX A.7: Data Unified Access Functional ........................................................ 34 ANNEX A.8: Data Access Connector Functional ................................................ 34 ANNEX A.9: VDC Editor ........................................................................................... 35



List of Figures Figure 1: Abstract VDC Blueprint ............................................................................... 10 Figure 2: Part of the structure of the JSON Schema that conforms a VDC Blueprint ....................................................................................................................... 10 Figure 3: Example of Node-RED flow ........................................................................ 11 Figure 4: JSON Blueprint Security Example ............................................................... 14 Figure 5: Part of the structure of the JSON Schema for the Docker Container Specification ................................................................................................................ 15 Figure 6: JSON CookBook Example .......................................................................... 16 Figure 7: Automatic Deployment System ................................................................. 16 Figure 8: Part of the structure of the VDC Blueprint ................................................ 17 Figure 9: Example of exposed JSON Tuple of a VDC .............................................. 17 Figure 10: Example of exposed JSON Tuple of a VDC ............................................ 18 Figure 11: Initial VDC Resolution Process .................................................................. 19 Figure 12: Renegotiation of VDC Resolution Process .............................................. 20 Figure 13: Resolution Engine Phases ......................................................................... 20 Figure 14: Abstract VDC Blueprint Parts Analysis ..................................................... 23 Figure 15: Initialization process .................................................................................. 23 Figure 16: Renegotiation Process .............................................................................. 24 Figure 17: [Sequence Diagram] Renegotiation process ........................................ 24 Figure 18: [Sequence Diagram] Blueprint Generation Process ............................. 29 Figure 19: [Sequence Diagram] Blueprint Retrieval ................................................ 30 Figure 20: [Sequence Diagram] Blueprint Resolution Process ............................... 31

List of tables Table 1. Acronyms ......................................................................................................... 9 Table 2. Tentative Data Sources for DITAS Use Cases ............................................. 25 Table 3. Potential Data Sources for the Use Cases ................................................. 26 Table 4. VDC Blueprint Repository Functional Component ................................... 29 Table 5. VDC Blueprint Repository Interface ............................................................ 30 Table 6. VDC Resolution Engine Functional Component ....................................... 31 Table 7. VDC Resolution Engine Interface ............................................................... 32 Table 8. VDC Validator Functional Component ..................................................... 32 Table 9. Data Utility Service Component ................................................................. 32 Table 10. Data Utility Service Interfaces ................................................................... 33 Table 11. Data Utility Evaluator @ VDC (DUE@VDC) Component ........................ 33



Table 12. Data Utility Evaluator @ VDC (DUE@VDC) Interfaces ............................ 33 Table 13: Sample Data Generator Component ..................................................... 34 Table 14: Data Unified Access Functional Component ......................................... 34 Table 15: Data Access Connector Functional Component .................................. 34 Table 16: VDC Editor Component ............................................................................ 35



Executive Summary This deliverable aims to describe the architecture design of the functional components that WP3 consists of. Moreover, it analytically describes their func-tionalities, the exposed APIs for interacting and exchanging information inter-nally as well as the external APIs. The document puts emphasis on the Virtual Data Container (VDC) which is the key concept of WP3. An initial outcome is the modelling of the VDC concept by defining its proper-ties and functionalities that the Data-Intensive Application (DIA) developers can exploit. As a result, the deliverable describes in detail the abstract VDC Blueprint, a JSON schema file which contains all the information related to the VDC. Additionally, this document defines how a VDC image can be instantiat-ed in the DITAS Execution Environment (EE). Furthermore, D3.1 includes the preliminary version of the DITAS DIA design archi-tecture, namely the VDC Resolution Process. The aim of this process is to identify the best VDC Blueprint that fits to the requirements imposed by the DIA devel-oper, not only before the application is being deployed, but also during its runtime. Moreover, the deliverable highlights the interaction between the VDC Resolution components with the DITAS EE.



1 Introduction The VDC provides an abstraction layer that takes care of retrieving, processing and delivering data with the proper quality level, while in parallel putting spe-cial emphasis on data security, performance, privacy, and data protection. The VDC, acting as a middleware, lets the developers simply define the require-ments on the needed data, expressed as data utility, and takes the responsibil-ity for providing these data timely, securely, and accurately by hiding the com-plexity of the underlying infrastructure. The infrastructure could consist of differ-ent platforms, storage systems, and network capabilities. The orchestration of these services could be based on the Node-RED programming model1 which DITAS extends by introducing a business logic API.

A VDC image is instantiated based on the corresponding VDC Blueprint. This is a JSON schema file that describes, except for the technical implementation details, all the abstract properties of the VDC (both business and technical). That part is important as the VDC could be seen as a product having specific characteristics which could be used as information metadata by the DIA de-veloper. The Blueprint is published by the data administrator, within which he circulates, among others, all the details about where the data sources are lo-cated, how to access them, and the interface exposed to the application. All this information is placed inside the five distinct parts that construct the Blue-print, which are:(i) its internal structure, (ii) the data utility & security dimensions, (iii) its abstract technical & business properties, (iv) its components cookbook appendix and (v) its output interface details. These parts are systematically de-scribed in section 2.

Keeping in mind that DITAS has the objective, via the VDC paradigm, to im-prove the productivity of developers in building, deploying, and managing DI-As, it is rather worthy to analyze the issues that they are facing when selecting data sources. Indeed, DIA need to consume data coming from various types of data sources that belong to different categories. More specifically, the sources could be defined either as streaming or persistent and thus the way that they provide the data might differ. In one case a source could consistently keep feeding the data to a topic where the data consumers are listening to it and in another case the consumers are expected to perform queries in order to re-trieve the data. In addition, the data sources vary according to the structure of the data that they expose. From this point of view, the data could be unstruc-tured, which is the rawest form of data, like raw text, semi-structured such as JSON, XML, etc. that have a consistent format and structured that are very well defined and support all the query capabilities.

The aforementioned heterogeneous and complex nature of the data sources imposes huge burdens on an application developer that aims to design and develop a DIA. It might be quite hard and time-consuming for the developer to modify, recompile the source code of the application and implement wrappers 1 Node-RED is a platform where developers could build applications wiring code blocks together. Each code block, also known as 'node', is a configurable component which the developers could connect using a visual programming approach and thus to create advanced processes, the 'flows'.



in order to support different data source providers. Furthermore, an application that is exposed as a web service in a marketplace, would face the increase of its total downtime, being obliged to redeploy each time its coupled data source needs to change the type, protocol and/or format of the delivered da-ta. Another issue that the legacy applications are facing, is the veracity of the data, one of the 5 Vs challenges in the area of Big Data [1].

Indeed, a plethora of data sources are not reliable, accurate or even valid, a phenomenon that has as effect to destroy the reputation of the application. Moreover, the uncertainty of the utilization in the Big Data era forces the soft-ware developers to design their applications based on extreme scenarios lead-ing to maximal application requirements [2][3].

The satisfaction of the latter would increase the development cost, despite the rising number of data providers. Finally, as multisource data retrieval in DIA ap-plications is essential for the availability, application developers need to be fa-miliar (what is called as know-how) with the relevant APIs of the data source providers.

Having identified the major problems that the legacy applications have to deal with from the data source perspective, this is where the VDC kicks in, aiming to take care partly or completely of these problems. More precisely, the VDC ar-chitecture hides the complexity of the data sources, undertaking the burden of retrieving, storing and -if required- processing the data regardless of its format, structure etc. The VDC offers high agility to the DIA developers by minimizing the changes in case they need a slightly different data structure. This reduces the amount of specific knowledge a developer needs to develop a DIA. The proposed approach also focuses on the quality of the data by collecting real time values of the application at runtime. This process evaluates the VDC in-stance that is used, and possibly replaces it with a better one, aiming to protect the reputation of the application from unreliable data source providers.

1.1 Glossary of Acronyms

Acronym Definition

AB Abstract (VDC) Blueprint

API Application Programming Interface

AWS Amazon Web Services

CAF Common Accessibility Framework

CEP Complex Event Processing

CLI Command-Line Interface

CRUD Create Read Update Delete

CSV Comma-Separated Values

DB Data Base

DIA Data-Intensive Application

DUE Data Utility Evaluator

DUS Data Utility Service



EE Execution Environment

ERP Enterprise Resource Planning

GUI Graphical User Interface

HDFS Hadoop Distributed File System

HTTP Hypertext Transfer Protocol

IP Internet Protocol

JSON JavaScript Object Notation

N/A Not Applicable

QoS Quality of Service

RB Resolved (VDC) Blueprint

REST Representational State Transfer

SDK Software Development Kit

SLA Service-Level Agreement

SQL Structured Query Language

TBC To be Confirmed

TBD To Be Decided

TLS Transport Layer Security

VDC Virtual Data Container

VDM Virtual Data Manager

WP Work Package

XML Extensible Markup Language Table 1. Acronyms



2 Abstract VDC Blueprint Conceptually the VDC is the Docker container for Data Virtualization which aims to perform data cleansing, data correlation and data transformation as data moves from/towards to the data sources. VDC provides the mechanisms (e.g., libraries, tools) to enable an application to access the data, while being oblivious to the runtime concerns of how to access them in a secure and effi-cient manner. A VDC Blueprint is a JSON file, structured according to the VDC Blueprint JSON schema, that captures all the properties of the VDC and there-fore it includes, among others, the properties of all the data sources that are represented through this specific VDC.

The Blueprint (figure below) consists of five separate parts that contain all the necessary information for the VDC to transform from a file (Blueprint) to an im-age (at deployment time) and then to an instance (at runtime) (see section 3 of D4.1 [4]). Each part is described in one of the following sections. Most of the-se sections begin with a table that presents an overview of the elements that are analyzed there.

Figure 1: Abstract VDC Blueprint

2.1 Internal Structure Internal Structure

Flow

Components

Interconnections

Data Inputs Interfaces

Exposed Data Tuple Figure 2: Part of the structure of the JSON Schema that conforms a VDC Blueprint



The Internal Structure is a detailed description of the Node-RED model that or-chestrates the VDC lifecycle.

The Internal Structure contains:

a. General Information (name, type, schema) about the data sources rep-resented by the VDC and also all technical details on how to access the sources e.g. the APIs, as well as the format in which the data are provid-ed (Data Input Interface node).

b. Information about the Data Transformations, Data Utility and Data Secu-rity components. Given that these components are exposed as web ser-vices, this field refers to the REST API that is used by the Node-RED server in order to invoke them.

c. The output tuple and the schema of the data exposed by the VDC to the application.2

The Node-RED model that controls the data flow. The following figure depicts a possible flow in Node-RED, highlighting the interconnections between the data source, the VDC components and the application:

Figure 3: Example of Node-RED flow

2.2 Data Utility and Security Dimensions The VDC Blueprint data sources are described also in terms of the of Data Utility dimensions and Security capabilities already introduced in D2.1 [5].

2.2.1 Data Utility

Data Utility dimensions are dynamic and mainly used to evaluate the relevance of the data offered by the VDC to the applications that are willing to use such

2 Element c is further described in section 2.5



a VDC. In its general form, a dimension included in the VDC Blueprint under the Data Utility dimensions is a tuple composed of the following six elements3:

• name: name of the dimension • granularity: at which level the metric is computed. It could be at source

level or attribute level referring to the data exposed by the VDC as seen by the application.

• metric: the formula/expression used to compute the dimension • value: the latest known value for the dimension • unit of measure • timestamped value: when the current value has been computed

To give an example, here is a list of possible dimensions:

● <completeness, source, #values in the source/#expected values, 90, %, 1504259797> Completeness refers to the set of dimensions concerning the data quali-ty. In this case, the dimension is computed for the entire data source.

● <timeliness, attribute(@name=’address’), 1-(currency/volatility), 0.1, -, 1501581365> Also in this case, the dimension concerns the data quality but here it re-fers to a specific attribute exposed by the VDC Blueprint.

● <reliability, source, #requests satisfied/#total requests, 99, %, 1582384365> Differently from the previous examples, reliability belongs to the set of QoS dimensions.

● It is worth noting that, when the data administrator is creating a VDC Blueprint, only a portion of the dimensions characterizing the VDC can be evaluated. For instance, reliability can be computed only when VDC instances are created, as it is based on the invocation from the applica-tion. Conversely, concerning the completeness, this value can be calcu-lated in advance as it is based on the data sources to which the VDC is connected.

As described in more detail in D4.1 [4], the “Data Utility Evaluator @ VDM” is in charge of collecting all the information at runtime to update the values of the dimensions included in the set of Data Utility dimensions. As it is mentioned in section 3.2, the possible modification of the values for these dimensions could drive the renegotiation phase in case the values are no longer satisfying the application requirements.

2.2.2 Security

Security and Monitoring capabilities must be defined by the data administrator as part of the VDC setup. Both will have varying levels of detail for what should be secured, monitored and tracked and what kind of software should be used by that particular VDC. The VDC can also have facilities (components) to de-fine what should happen with data if it leaves a legal boundary and how to

3 At this stage, we limit the definition of dimensions to this set of 6 elements. In the next iteration of the project, possible extensions could include also evalua-tions about the reliability of the measurement, i.e., confidence.



detect these boundaries. DITAS will furthermore provide a way to secure data in transit and monitor access to personal information.

Conceptual Perspective:

Most of these security aspects must be configured and setup in advance by the data administrator, as a VDC primarily exposes the data to the application while hiding the complexity of what is happening behind the scene. The appli-cation is not aware of data sources connected to the VDC. The application only sees the output of the VDC.

Conceptually, the security dimensions cannot be measured, we can only ob-serve the concrete configuration options chosen (e.g., a specific TLS cipher suite for communication).

Each data source comes with its own set of security capabilities. These are stat-ic and cannot change. What can change is the concrete option chosen from this set of capabilities or the data source used within the same VDC and might be different for each instance of the VDC. As such the Security information for a VDC will likely be split between static information that is based on the data sources used and runtime information known after a VDC has been instantiated and is actively used.

Technical Perspective:

Security and Data Monitoring will most likely include configuration or modifica-tion of Node-RED4. We believe it will be most efficient to create custom Node-RED components for monitoring. These components then ship relevant data to the respective data sink. We also plan to provide components that allow the data administrator to limit the output modules to use only secure components. Furthermore, we propose a Security proxy between any Node-RED Application and the consuming application. It, however, must still be decided what proto-col these applications will use. This proxy could be a separate component or part of the Common Accessibility Framework (CAF).

Configuration Outline:

The format might be a part of the JSON Blueprint (figure below) with the follow-ing properties, but might also be subject to change once we have a more de-fined prototype at a later stage:

{

...

container_information : {

append*: {...}

nodeRed*: {...}

},

4 At this stage, we assume that Node-RED is used as the core framework for the VDC implementation. We created the technical security perspectives based on that assumption. We have to reevaluate and modify this section, in case we decide to use something else instead.



...

security : {

policy: {...},

LoggingServer:[....]

<tbd>

dataPolicy: {...}

},

monitor :{

policy: {...},

LoggingServer:[....]

<tbd>

}

...

}

*Optional

Figure 4: JSON Blueprint Security Example

security/monitor:

• append: installation instructions for the docker container needed to in-stall our security and monitoring components

• policy: Detailed description of the included components • LoggingServer: addresses for logging server • nodeRed: installation instruction for the Node-RED installation, including

configuration instructions. • dataPolicy: instruction for detecting if data is accessed moved through

legal boundaries and instruction as to what should happen to the data (anonymized, forbidden etc.)

2.3 Abstract Technical and Business Properties One of the main goals of the DITAS project is to let the DIA developer simply define application constraints in terms of what he expects from the data sources and then rely on a VDC to manage (retrieve or store) the needed da-ta. These constraints (abstract properties) will be used as binding agreement between the data administrator and the DIA developer. After the constraints have been defined, the challenge that arises is how to select the suitable VDC Blueprint from a plethora of available ones. The case is becoming more com-plex if we consider the factor that for a single data source several VDCs could be published from different providers. To deal with this challenge, the data ad-ministrators include in the Blueprint that they publish, all the abstract technical and business properties of the VDC and form it as a product, so that it can be advertised in the DITAS marketplace. These properties are used in the context of the VDC Resolution process during the initialization phase of the DIA lifecycle. In that phase, they are compared with the developer’s constraints. Therefore,



by their definition, these properties are meant to be static, as they are part of the identity of the VDC “product”, that a data administrator is offering within the DITAS ecosystem.

The rationale behind the abstract technical and business properties of the VDC, in contradiction to the dimensions of section 2.2 which are dynamic, excluding security, is that these values are produced from tests that the data administra-tor initially made, based on scenarios that were relevant to the market that he was focusing on.

The abstract technical and business properties of the VDC are fields in the JSON file of the Blueprint.

2.4 Components CookBook Appendix

Docker Container Specifications

Container image base

Runtime dependencies

CookBook type Figure 5: Part of the structure of the JSON Schema for the Docker Container Specification

A CookBook5 is a group of "recipes" that describe a series of applications and utilities (resources) and how they should be configured. In particular, it de-scribes the packages and the libraries that should be installed, the services/web services/daemons that should be running. Moreover, it contains the configura-tion files or any other file in general that should be overwritten.

VDCs will run in an environment based on containers. That allows the isolation, security, and mobility that we need to get in line with the objectives on DITAS. To allow that, a VDC Blueprint will contain requirements about the runtime de-pendencies that are needed in order to access the different data sources that the VDC exposes.

The format will be a part of the Abstract VDC Blueprint with the following prop-erties:

{

...

container_information : {

base : <container image base>

runtime_dependencies : <recipe with dependencies to install>

cookbook_type : <type of the recipe>

}

5 We are borrowing the term from Chef Continuous Automation tool [6], but it does not mean we are going to use that specific format in the project.



...

}

Figure 6: JSON CookBook Example

• Container image base: The container6 will have a base which will be a container image located in a repository accessible to the DITAS SDK.

• Runtime dependencies: On the base image, a recipe based on some existing automatic deployment system will be applied to install the rest of the runtime dependencies (be it software or configuration files) to fulfill the runtime requirements needed by the VDC described in the delivera-ble D4.1 [4].

• CookBook type: Given the myriad of automatic deployment tools, this element will inform on the recipes that has to deal with. It must be a tool type supported by the DITAS SDK

With this information, we can generate a container image with the software already present that will be deployed as needed either in the cloud or at the edge.

Figure 7: Automatic Deployment System

When a new VDC instance needs to be instantiated the Execution Engine will get the image from the VDC Image repository and deploy a new container where needed.

6 As commented in the deliverable D4.1 [4], the execution environment will be based on Docker containers [7].



2.5 Output Interface Details Output Interface

Schema

Protocol

Methods

Figure 8: Part of the structure of the VDC Blueprint

This part refers mainly to the methods that the VDC exposes. The goal is to help the developer so that he can access data that are coming from a variety of diverse sources through a unified output interface.

Regarding this interface, the fields within the JSON file of the VDC Blueprint de-scribe:

a. The schema of the data tuple offered by the VDC as a JSON schema to the application. For instance, the following image depicts a JSON ex-ample of a tuple containing a table of temperatures, as well as the dates and the location to which the measurements are referring: (The corresponding JSON schema for this example is shown in the next figure)

Figure 9: Example of exposed JSON Tuple of a VDC



Figure 10: Example of exposed JSON Tuple of a VDC

b. Details about the output protocol through which the application inter-acts with the VDC. This interaction will take place via RESTful web ser-vices.

c. The methods of the CAF that are exposed by the API of it.



3 Application Design Process with DITAS One of the key element in the DITAS-SDK [8] is represented by the VDC Resolu-tion process, which takes place in both the initialization (section 3.1) and the renegotiation (section 3.2) phase of a DIA lifecycle. Its goal is to select the most suitable VDC from the VDC Blueprint Repository that matches the application’s requirements and constraints.

The VDC Blueprint Repository is a database where all the Abstract Blueprints are stored in advance by the data administrators that have decided to rely on the DITAS platform to publish their data. In order to retrieve and query these Ab-stract Blueprints, we introduce the Resolution Engine component which is used to figure out if there is a candidate VDC that fulfils the criteria of the DIA devel-oper.

As the developer might not be able to know in detail how the performance of a VDC instance is evaluated, or even the exact limitations of the application, the DITAS SDK introduces properties which are abstract enough, so that the de-veloper could describe them based on the needs that he expects from the application to have.

During the initialization phase (section 4.1), these properties that are specified by the DIA developer are compared with the technical and business proper-ties, which are by default abstract. An example is shown below.

Figure 11: Initial VDC Resolution Process

However, during the renegotiation phase (section 3.2), the properties are re-evaluated by the data utility and security components based on real time ana-lytics. As these components are examining several characteristics - named as dimensions - (e.g., jitter, cryptographic algorithm, key length, etc.) a property’s relation should be used, in order to derive the abstract values.

An example is depicted in the figure below where the dimensions “Latency”, “Up-Time” and “Time to Recovery” of the current status of the VDC are produc-ing the single abstract property “Availability”. Having those abstract property



values, the resolution engine can match them with the DIA developer’s re-quirements, defined in the initialization phase, as they are of the same type.

Figure 12: Renegotiation of VDC Resolution Process

3.1 Initial VDC Resolution Process The process is triggered when the developer -through the DITAS SDK- is address-ing the application requirements in order to find a suitable VDC. Then, the SDK forwards the request with the requirements to the Resolution Semaphore which orchestrates the different phases of resolution (technical and business) that are needed as illustrated in the figure below.

Figure 13: Resolution Engine Phases



The DIA developer imposes the threshold values of the QoS parameters and the constraints that are relevant to the DIA. During the resolution process, these values are compared to the abstract technical and business properties of the available VDC Blueprints modeled as described in chapter 2. These properties are defined by the data administrators and are used to advertise the VDC, and therefore the data sources that are exposed through it, as a marketable prod-uct.

The aforementioned constraints imposed by the developer are categorized by the Resolution Semaphore into business and technical ones. The latter are pro-vided as input to the Technical Resolver (phase A), while the former are han-dled by the Business Resolver (phase B).

Phase A

The preliminary VDC Blueprint selection is based on the topic, e.g., VDCs expos-ing weather data, to which each VDC is related to. The Technical Resolver (phase A) performs a functional analysis and resolution, based on the inserted technical requirements and constraints, as well as on the preliminary VDC Blue-print selection. This process results in the first set of Abstract (VDC) Blueprints (ABs), which will be further eliminated in order to reach an optimal (or close to optimal) solution. This component has to communicate with the VDC Blueprint Repository to load the ABs’ primary solution set. Based on the functional char-acteristics, inserted by the DIA developer, the Technical Resolver reduces this set and also performs a Blueprint dependency control in order to specify exist-ing dependencies, in terms of data sources support and elasticity needs. The result of the above control is the list of ABs, which is then returned to the Resolu-tion Semaphore, in order to support the next step of the resolution.

Phase B

From a business perspective, the resolution refers to identifying those products that fulfil some business constraints, especially related to pricing and SLAs. Start-ing with the ABs’ list, returned from the Technical Resolver, the Business Resolver must also take into consideration pricing and business constraints (phase B). The final product selection is then made based on the desirable performance and characteristics as defined by the DIA developer. When the final selection is made, a “contract” is generated, using as input the DIA developer require-ments/constraints that were provided.

Phase C

This contract contains the identification number of the selected product, the description of the terms of use, the pricing model according to which the de-veloper will be charged, as well as the SLA terms (phase C).

There are two main reasons that led to the segregation of the resolution process into technical and business. The first reason is of a computational nature; by segregating these two processes the system becomes more distributable and scale more efficiently. Moreover, the computational complexity of the multipa-rameter algorithm that filters the Blueprints is reduced significantly. The second reason behind the segregation is the renegotiation phase. In the renegotiation phase the technical filtering is not performed as only the business properties from the data utility are under consideration.



The interactions among the DIA developer and the different components that have a key role in the search, selection and abstract resolution processes are shown in the Figure 13.

Moreover, the Resolution Semaphore triggers the WP4 process, which is the de-ployment phase, by sending the Abstract Blueprint to the Virtual Data Manager (through the DIA deployment tool) in order to transform the Abstract Blueprint to the Resolved (VDC) Blueprint (RB), which additionally contains all the neces-sary deployment information (e.g. IP address, port, credentials etc). This interac-tion is described in both sections 3.3.1 and 3.3.2.

3.2 VDC Resolution at Renegotiation Process During the runtime of the DIA, the Data Utility Evaluator @ VDM (component on the Execution Environment (see deliverable D4.1 [4])) evaluates the perfor-mance of the VDC instances created in terms of the Data Utility dimensions that are described in section 2.2.1. In contrast to the abstract technical and business properties of the VDC (section 2.3), the values of these dimensions could change. For instance, the response time reported in the VDC Blueprint is ob-tained by analyzing the behavior of different VDC instances created and man-aged in the past for other applications. Thus, it might happen that the actual QoS significantly differs and the application could require to change the VDC with a better one, in terms of performance.

The goal of the Resolution Process at runtime is to identify candidate VDCs that might better fit the application, always with respect to the require-ments/constraints that have been imposed by the DIA developer. Even if an-other VDC is proposed and thus being deployed, the SLA that has already been signed will not change, as it reflects the developer’s needs.

The goal of this section is to analyze the interconnections and the interfaces between WP3 and the other WPs’ (mainly WP4) functional components that interact during a DIA lifecycle. Before particularizing this interaction depending on the phase of the DIA (initialization or renegotiation), it is useful to recall the parts of the VDC Blueprint, but this time correlating them to the various pro-cesses of the DITAS project. This mapping is depicted in the figure below, where the Blueprint parts “Internal Structure”, “Components CookBook Appendix” and ”Output Interface Details” are used by WP4 for the instantiation of the Docker containers; the “Data Utility & Security Dimensions” part is introduced by WP2 which is analyzing the efficiency of the relationship between the application and the VDC; and the final part, named “Abstract Technical and Business Properties”, is used by WP3 and the Resolution Engine.



Figure 14: Abstract VDC Blueprint Parts Analysis

3.2.1 Initialization phase

During the initialization phase of the DITAS application lifecycle, the Resolution Engine provides the VDM component with the best candidate VDC Blueprint, alongside with the necessary deployment details (CookBook, Node-RED mod-el). The Resolution Engine will also deliver the technical & business requirements of the application. These values will be used by the SLA Manager, in order to generate the contract. The relevant figure is presented below:

Figure 15: Initialization process



3.2.2 Renegotiation phase

During the renegotiation phase, the Data Utility component, interacts with the VDC Resolution Engine. The goal is to evaluate the performance of the VDC that was selected during the initialization phase and is currently used, and then propose another VDC (if sensible) that might better fit the application’s needs.

Figure 16: Renegotiation Process

Figure 17: [Sequence Diagram] Renegotiation process



4 Data Sources The scope of this section is to identify the types of data sources with respect to the initial requirement analysis. In the context of DITAS a data source is man-aged by the Data Administrator, whose role is to create the VDC Blueprint. Once the VDC image is instantiated within the DITAS EE, it acts as the conduit between the application and the data source, and thus to make the data available at runtime. From an architectural point of view, we can identify data sources with different characteristics:

1. Data producers; the data could be stored either persistently in a DB (rela-tional or NoSQL), or could be available real time using message-oriented protocols (publish/subscribe) or streaming-oriented ones.

2. Data storages; the data could be stored either on the cloud or the edge and the Application Developer is able to create, read, update, or delete (CRUD) them.

From the requirement analysis [8], the data in the cloud is typically stored in HDFS or an object store in CSV, JSON or efficient Parquet [9] format to be used in data lake scenarios for data intensive analytics using Spark.

Data Sources in DITAS

The following is a tentative list of sources that will be used in the DITAS demon-strators and so, will be implemented into the WP3.

Place Storage Type Access to the Storage Response

Cloud MongoDB Non-official, custom Streaming API

Semi-Structured JSON

Cloud MongoDB Non-official, custom REST API

Semi-Structured JSON

Edge MySQL Direct access to DB Structured Datasets

Cloud Elasticsearch REST API JSON

Cloud Swift REST API JSON

Edge/Cloud Minio REST API JSON

Table 2. Tentative Data Sources for DITAS Use Cases



The following data sources may come into the use cases once they are more defined:

Place Storage Type Access to the Storage Response

Edge InfluxDB Direct access to DB Structured Da-tasets

Cloud Data Lake in AWS S3

REST API Raw JSON/CSV Files

Table 3. Potential Data Sources for the Use Cases

The abstraction layer provided by the VDC will definitely make a great differ-ence when developing DIAs providing the developer a trustworthy data gath-ering mechanism and offering a common query interface that relieves him from the burden of having to develop and maintain the connection and data gath-ering of every single source.



5 Conclusions In order to support the design and development of DIAs, DITAS proposes the VDC paradigm, a middleware between the DIAs and the data sources. Within this document an initial approach of the VDC concept is provided. According to this approach, a VDC, seen as a product in the DITAS marketplace, is adver-tised through the Abstract VDC Blueprint which contains its characteristics as also details about the data sources linked to it.

Having defined the VDC model, the deliverable deals with the description of the DIA design architecture. In particular, most of the components have been identified and then analyzed in terms of functionalities and interfaces. The core component of the DIA design process is the VDC Resolution Engine, which en-ables the DIA developer to select an initial VDC Blueprint that matches his re-quirements with the capabilities of the VDC. The VDC Resolution process is of high importance given the fact that for just a single data source many VDCs could be published, belonging to different providers.

As one of the DITAS main goals is to develop a framework for accessing data coming from different and heterogeneous data sources, the document at-tempts to categorize the sources based on various criteria, derived from the requirements and needs imposed by the DITAS real world use cases.



6 References [1] Anil Jain. “The 5 Vs of Big Data”. https://www.ibm.com/blogs/watson-

health/the-5-vs-of-big-data/ . Last visited on 30th of September of 2017.

[2] Zhao et al. “FusionFS: Toward supporting data-intensive scientific appli-cations on extreme-scale high-performance computing systems”. 2014 IEEE International Conference on Big Data. http://ieeexplore.ieee.org/abstract/document/7004214/?reload=true .

[3] Rabl et al. “Solving big data challenges for enterprise application per-formance management”. Proceedings of the VLDB Endowment. Volume 5, Issue 12. August 2012.

[4] Deliverable D4.1 of DITAS. “Execution environment architecture design”. DITAS Consortium. September of 2017.

[5] Deliverable D2.1 of DITAS, “DITAS Data Managerment – first release”. DI-TAS Consortium. September of 2017.

[6] Chef Continuous Automation for Apps and Infrastructure. https://www.chef.io/ , Last visited on 30th of September 2017.

[7] Docker, what is a container. https://www.docker.com/what-container . Last visited on 30th of September 2017.

[8] Deliverable D1.1 of DITAS. “Initial architecture document with Market Analysis, SotA refresh, and Validation Approach”. DITAS Consortium. June of 2017.

[9] IBM Redbook. “Data Integration in the Big Data World Using IBM In-foSphere Information Server”. http://www.redbooks.ibm.com/technotes/tips1265.pdf . Last accessed 30 of September of 2017.



ANNEX A: VDC Components This appendix contains explanations and descriptions of the components relat-ed to the VDC.

ANNEX A.1: VDC Blueprint Repository Component name: VDC Blueprint Repository

Description: This is a repository with existing VDC Blueprints that can be used as the base by the data administrator in order to define VDC Blueprints. The latter captures all the properties of a VDC, including the description of the data source (i.e., name, type, schema, formats, etc.), the data source properties as expressed by the data utility specification (e.g. data quality, availability, QoS, etc.), the data source runtime information (e.g. IP addresses, credentials, etc.) as well as security and privacy related information.

Inputs: Receives information from the VDC Validator and the VDC Resolu-tion Engine

Input mechanism: REST

Outputs: N/A Output mechanism: N/A

Implementation language (if code): TBC

Requirements: JSON Schema Storage: Semi-Structured DB Table 4. VDC Blueprint Repository Functional Component

The VDC Blueprints, that are structured in JSON format, will be stored in the VDC Blueprint Repository.

Sequence Diagrams

Figure 18: [Sequence Diagram] Blueprint Generation Process

VDC Resolution Engine component interacts with the VDC Blueprint Repository, in order to retrieve the best VDC Blueprint candidate, according to the appli-cation requirements. This interaction is shown in the following sequence dia-gram:



Figure 19: [Sequence Diagram] Blueprint Retrieval

Interface

VDC Repository communicates with other components (VDC Validator, VDC resolution Engine) through an HTTP REST interface, whose details are presented in the following table (the column named “Component” is used to identify which functional component performs the “Operation” written in the first col-umn):

Operation Functionality Component MongoDB

CREATE Store a new VDC Blueprint

VDC Validator insert

READ Retrieve VDC Blue-prints

VDC Resolution Engine

find

UPDATE Update an existing VDC Blueprint

VDC Validator update

DELETE Delete an existing VDC Blueprint

VDC Validator remove

Table 5. VDC Blueprint Repository Interface

ANNEX A.2: VDC Resolution Engine Component name: VDC Resolution Engine

Description: The component enables the application developer to find an ini-tial VDC Blueprint that matches the application requirements and the capabili-ties embedded in the VDC Blueprint. Furthermore, during the renegotiation phase, its role is to propose a VDC Blueprint that might be better for the current situation, based on real time analytics of the VDC that is currently deployed.

Inputs: Receives information from the Input mechanism: REST



Application Developer and the Data Utility Service (Task 2.1)

Outputs: Best candidate VDC Blue-print

Output mechanism: REST

Implementation language (if code): Java

Requirements: Task 2.1 output Storage: N/A Table 6. VDC Resolution Engine Functional Component

Sequence Diagram

Figure 20: [Sequence Diagram] Blueprint Resolution Process

The diagram above depicts how the Application Developer defines the tech-nical and business requirements of the application, using a DITAS User Interface. This is part of the initialization process and it is necessary for the Resolution En-gine in order to find the most suitable VDC Blueprint.

Interface

VDC Resolution Engine communicates with the other components/roles, that are mentioned before, through an HTTP REST interface, whose details are pre-sented in the following table (the column named “Component/Role” is used to identify who calls the “Method” written in the first column):

Method Functionality Component/Role

PUT/POST Define Technical & Business Requirements

Application Developer

PUT/POST Provide the updated values of the Data Utility Dimensions of

Data Utility Service



the VDC instance Table 7. VDC Resolution Engine Interface

ANNEX A.3: VDC Validator Component name: VDC Validator

Description: This component takes as input the VDC Blueprint in JSON format, as defined by the Data Administrator and validates it against the JSON sche-ma that is developed to describe a VDC (chapter 2).

Inputs: Receives the VDC Blueprint, in JSON format, from the Data Adminis-trator


Outputs: Stores the VDC Blueprint, if it is valid, in the VDC Blueprint Reposito-ry



Requirements: N/A Storage: N/A Table 8. VDC Validator Functional Component

ANNEX A.4: Data Utility Service Component name: Data Utility Service

Description: Given a VDC Blueprint evaluates the potential data utility, i.e, the data utility regardless of the application is using the VDC. When the VDC in-stance is created the first time, the data administrator will call the Data Utility Service. It might also happen that the data administrator wants to ask to recompute the values to make the VDC Blueprint more precise.

The Data Utility Service also embeds some modules able to estimate the values for the dimensions more related to the QoS which depends on the application that is going to call the VDC instance.

Inputs: A VDC Blueprint (in particular it needs to access to the data sources listed in the schema)


Outputs: An updated version of the VDC Blueprint with the values for the data utility dimensions



Requirements: Task 2.1 output Storage: N/A Table 9. Data Utility Service Component

Interface



Data Utility Service is mainly triggered by the component that is fetching the Blueprint, when the VDC image has been created and when the data admin-istrator wants to revise the VDC Blueprint. HTTP REST interface, whose details are presented in the following table:


GET/@vdcBlueprintID

Update the VDC Blueprint with the new sample data set

Table 10. Data Utility Service Interfaces

ANNEX A.5: Data Utility Evaluator @ VDC (DUE@VDC) Component name: Data Utility Evaluator @ VDC (DUE@VDC)

Description: Given a VDC instance it analyses the data flow between the data sources linked to the VDC and the application and check if one or more busi-ness constraints are violated. If so, the DUE informs also the DUE at VDM level for further computation. A list of callbacks interfaces to be called when viola-tions will occur are specified by the data administrator. One of the callback interfaces is by default the DUE@VDM (see deliverable D4.1 [4]).

Inputs: A VDC Blueprint (in particular it needs to access to the data sources listed in the schema). With a REST inter-face the DUE@VDC can be reconfig-ured adding/updating/deleting the callback interfaces.

Input mechanism: data stream + REST

Outputs: DUE will call the DUE@VDM level and the interfaces specified in the input as callbacks

Output mechanism: API calls

Implementation language (if code): TBD. The component will extend a CEP-based core.

Requirements: Task 2.1 output Storage: log of violations Table 11. Data Utility Evaluator @ VDC (DUE@VDC) Component

Interface

DUE@VDC provides a typical REST interface providing the CRUD operation to manage the list of callback interfaces.


GET/@vdcBlueprintID

Create/Recompute the VDC Blueprint

Table 12. Data Utility Evaluator @ VDC (DUE@VDC) Interfaces



ANNEX A.6: Sample Data Generator Component name: Sample Data Generator

Description: given a VDC Blueprint it recomputes the sample data set used to calculate the potential data utility.

Inputs: the VDC Blueprint which relat-ed sample data that need to be up-dated


Outputs: Updated sample data relat-ed to a VDC Blueprint



Requirements: N/A Storage: N/A Table 13: Sample Data Generator Component

ANNEX A.7: Data Unified Access Functional Component name: Provide Unified Access to the data oblivious of location & Transparent format translations & relation optimization

Description: Provide unified access to data by using the Spark Data source API in order to enable Spark SQL queries over a variety of data sources.

Inputs: Spark SQL queries Input mechanism: SQL

Outputs: Spark SQL query results Output mechanism: SQL


Requirements: N/A Storage: N/A Table 14: Data Unified Access Functional Component

ANNEX A.8: Data Access Connector Functional Component name: Data Access Connector from Spark to object store, ex-tended with security.

Description: High performing connector to object storage for Apache Spark, achieving performance by leveraging object store semantics. It provides uni-fied access to data oblivious of location. The connector should be extended with security mechanisms to enforce security restrictions defined by the data utility of the VDC instance.

Inputs: Amazon S3 API Input mechanism: library call

Outputs: Access to data Output mechanism: N/A


Requirements: Spark, Object Store with S3 API

Storage: N/A

Table 15: Data Access Connector Functional Component



ANNEX A.9: VDC Editor Component name: VDC Editor (Optional)

Description: A User Interface that the Data Administrator should use in order to define his/her VDC Blueprint.

Inputs: Receives information from the Data Administrator regarding the VDC properties

Input mechanism: GUI, CLI

Outputs: The VDC Blueprint in JSON format to be validated from the VDC Validator


Implementation language (if code): TBC

Requirements: N/A Storage: N/A Table 16: VDC Editor Component

ditas d3.1 - data virtualization architecture design · 2018-12-11 · d3.1 data virtualization...

Documents