ec.europa.eu · web viewhalf of the application services are autonomous services and are operated...

E S S n e t B i g D a t a I I

G r a n t A g r e e m e n t N u m b e r : 8 4 7 3 7 5 - 2 0 1 8 - N L - B I G D A T A

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdatah tt p s : / / e c . e u r o p a . e u / e u r o s t a t / c r o s / c o n t e n t / e s s n e t b i g d a t a _ e n

W o r k P a c k a g e F

P r o c e s s a n d a r c h i t e c t u r e

D e l i v e r a b l e F 1 p a r t 2

B R E A L : B i g D a t a R E f e r e n c e A r c h i t e c t u r e a n d L a y e r s

A p p l i c a ti o n a n d i n f o r m a ti o n a r c h i t e c t u r e

Version 2020-06-22

Work package leader:Monica Scannapieco (ISTAT, Italy)[email protected] : +39 06 46733319mobile phone : +39 366 6322497

Prepared by:

Frederik Bogdanovits (Statistics Estonia, Estonia)Arnaud Degorre, Frederic Gallois (INSEE, France)

Bernhard Fischer (DESTATIS, Germany)Kostadin Georgiev (BNSI, Bulgaria)

Remco Paulussen (CBS, Netherlands)Sónia Quaresma (INE, Portugal)

Monica Scannapieco, Donato Summa (ISTAT, Italy)Peter Stoltze (Statistics Denmark, Denmark)

mailto:[email protected]

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/User:Nquareso

https://ec.europa.eu/eurostat/cros/content/essnetbigdata_en

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata

Outline

Executive Summary..........................................................................................................................................3

1. BREAL Application Architecture....................................................................................................................3

1.1 Development services & components........................................................................................................3

1.2 Support services & components.................................................................................................................3

2. BREAL Generic Information architecture BREAL...........................................................................................3

3. BREAL framework for defining operational models......................................................................................3

4. Solution architecture for implementation workpackages............................................................................3

4.1 Solution architectures for WPB..................................................................................................................3

4.1.1 Application architecture for WPB............................................................................................................3

4.1.2 Information architecture for WPB...........................................................................................................4

4.1.3 Operational model for WPB.....................................................................................................................4

4.2 Solution architectures for WPC..................................................................................................................4

4.2.1 Application architecture for WPC............................................................................................................4

4.2.2 Information architecture for WPC...........................................................................................................4

4.2.3 Operational model for WPC.....................................................................................................................4

4.3 Solution architectures for WPD..................................................................................................................4

4.3.1 Application architecture for WPD............................................................................................................4

4.3.2 Information architecture for WPD...........................................................................................................4

4.3.3 Operational model for WPD....................................................................................................................5

4.4 Solution architectures for WPE...................................................................................................................5

4.4.1 Application architecture for WPE............................................................................................................5

4.4.2 Information architecture for WPE...........................................................................................................5

4.4.3 Operational model for WPE.....................................................................................................................5

5. Conclusions and future work........................................................................................................................5

References........................................................................................................................................................5

2

Executive Summary

To be filled.

1. BREAL Application Architecture

The BREAL architecture has adopted the Archimate language for describing the different architecture and their relationships. The beauty of that language is that different colours are used for the different types of architecture: yellow is used for the business architecture, blue for the application architecture, and green for the technical architecture.

1.1 Development services & components

The business functions of the BIDREA Lifecycle (yellow) consist of different application services (blue) as explained below.

1. Acquisition and Recording

External structured data retrieval: This service retrieves structured data from external sources. Structured data is data with well-defined structure, usually in the form of tables. Sources for this service can be APIs, tables on websites or databases.

External unstructured data retrieval: This service retrieves unstructured data from external sources. Unstructured data is data with very little or no structure, for example text or images. Sources can be websites, satellite images, news articles, social media etc.

Data storing: The service provides storage options for data acquired via the retrieval services. Storage provided can be relational or NoSQL (document based, key-value storage, graph based etc.).

2. Data Wrangling

Data preparation, filtering and deduplication: The service is responsible for bringing data in a machine-usable format, filtering relevant parameters out of unstructured data and removing duplicate data entries commonly occurring e.g. with webscraping.

3

Data standardization: Data is transformed in order to make it comparable to data from other sources. The data is for example scaled to the same units; text is converted to the same encoding.

Data aggregation: Data points are aggregated into larger intervals or groups in order to shrink the amount of data when the detailed resolution of the raw data is not necessary, and the available computational power is limited.

3. Data Representation

Data derivation: The service is responsible for deriving additional variables from the data – e.g. deriving a birth date from an age variable.

Structure modeling: The service derives and adds structure to previously weak- or unstructured data by modeling categories or other structuring elements.

Data encoding: The service is responsible for adding meta-data like formats or classifications to data.

4. Modeling and Interpretation

Data linking & enriching: The data in the big data pipeline is linked and enriched with data from other sources (conventional statistics, other big data-derived statistics, etc.) in order to add more meaning to it.

Methodology application: Domain-specific methodology (predominantly machine learning, but other methods are also in scope of this service) is applied to the data in order to derive information relevant for specific statistical products from it.

Statistical aggregation: Results from the modeling and linking steps are aggregated in order to fit to the basis of other statistical outputs. This can for example be an aggregation on the timescale to make the data adhere to the same measurement intervals as other data sources.

5. Shape output

Data exchange: The service is responsible for transforming and shaping the output of the big data pipeline in order to integrating the resulting information and models into statistical production.

1.2 Support services & components

1. Data organization

The implementation of a statistical process based on Big Data requires the mobilization of support services dedicated to the functions of storage and access to data, constituting altogether the data platform function. Data organization services are closely linked with computing and network services, both on technical and logical aspects.

Data storage : set of support services to manage the organization and distribution of data within the Big Data system. Since many Big Data systems are based on a close link between storage strategy and computing methods that will be used, these services are key elements to foster

4

technical effectiveness of the statistical pipelines. As Big Data systems are usually horizontally distributed across multiple infrastructure resources, specific activities related to creating data elements can specify that data will be replicated across a number of nodes and will be eventually consistent when accessed from any node in the cluster. Other activities should describe how data will be accessed and what type of indexing is required to support that access. Two aspects of storage that directly influence their suitability for Big Data solutions are capacity and transfer bandwidth. Capacity refers to the ability to handle the data volume, while transfer bandwidth deals with the “flow” of information that can be accessed in a given time-lapse. Many Big Data implementations address these aspects by using distributed file systems within the platform framework.

Access services : they are focused on the communication/interaction with the statistical users and pipelines that require data access. Access service may be a generic service such as a web server or application server that is configured to handle specific requests from authenticated users and pipelines. In addition, the access activity confirms that descriptive and administrative metadata and metadata schemes are captured and maintained for access (see Metadata Application Service).

Data Platform Services : they provide for the logical data organization and distribution combined with the associated access application programming interfaces (APIs) or methods. The frameworks may also include data registry and metadata services along with semantic data descriptions such as formal ontologies or taxonomies. The logical data organization may range from simple delimited flat files to fully distributed relational or columnar data stores. Accordingly, the access methods may range from file access APIs to query languages such as Structured Query Language (SQL).

2. Infrastructure : networking and computing

Support services related to infrastructure are dedicated to manage the underlying computing and networking functions required to implement the overall system. Complementary to data organization services, they reflect the underlying operations performed on stored data within the system, which include transmission on the one hand, processing and computations on the other hand. Transmission services include descriptions of data transmission requirements which define the required throughput and latency. Computing services cover technical settings which may deliver specific processing methods and having infrastructures adapted consequently (for instance, highly parallel processing of large matrices or data may be delivered by Single Instruction Multiple Data computing, which can be supported by an infrastructure based on Graphic Processing Units.

Processing services : Processing services are dedicated to set up relevant technical infrastructures and methods, given how data will be processed in Big Data statistical pipelines. This processing generally falls into a continuum, from long-running batch jobs to responsive processing, and supports interactive applications of continuous stream processing. The types of processing services described for a given architecture would be dependent on the characteristics (volume and velocity) of the data processed by the Big Data statistical pipelines and their requirements.

5

Depending on the type of processing required, processing services might set up MapReduce or Bulk Synchronous Parallel (BSP) solutions.

Networking : the connectivity of the architecture infrastructure is a crucial part of support services, as it affects the velocity characteristic and is a preliminary condition for deploying responsive processing for statistical pipelines. While some Big Data implementations may solely deal with data that is already stored in the data center and does not need to be accessed outside the local network, others may need to deploy specific network capacities to be continuously connnecter to outside data providers (or may expose internal data to outside users).

On the external side, the availability of network connectivity and the latency of transmission protocol are two key caracteristics to be handled by networking services in respect with the primary senders or receivers of Big Data processed by the statistical pipelines.

On the internal side, volume and velocity characteristics of Big Data are also driving factors in the specification of a suitable internal network infrastructure as well. For example, if the implementation requires frequent transfers of large multi-gigabyte files between cluster nodes, then high speed and low latency links are required to maintain connectivity to all nodes in the network.

Networking services may also include various activities : dynamic supervision services and service priority handler ; name resolution (e.g., Domain Name Server) ; encryption along with firewalls and other perimeter access control capabilities...

3. Security and information assurance

Support services dedicated to Security and Information assurance implement the core activities supporting the overall security and privacy requirements outlined by the policies and processes of Big Data statistical pipelines. This includes authentication services and monitoring services.

Authentication and authorizations services : these services must interface and interact with all other components within the Big Data system to support access control to the data and services of the system. This support includes authenticating the user or service attempting to access the system resource to validate their identity. For instance, authentication services may take the form of APIs to other services and components for collecting the identity information, and validating that information against a trusted store of identities. Frequently such authentication services will provide an identification token back to the invoking component that defines allowed access for the life of a session. This token can also be used to retrieve authorizations for the users/components detailing what data and service resources they may access.

Monitoring and Audit Services : monitoring services are responsible for collecting, managing and consolidating information about events from across the system that reflect access to and changes to data and services across the system. The scope and nature of the events collected is based on the requirements specified by the security policies (see Trust Management). Typically, monitoring services will collect and store this data within a secure centralized repository within the system and manage the retention of this data based on the policies. The data maintained by these components can be leveraged during system operation to provide providence and pedigree for data to users or application components as well as for forensic analysis in the response to security or data breaches.

6

4. Manage communication

Stakeholders of big data enabled statistical pipelines may cover various organizations impacting on the success of the process, including government authorities and regulators, big data providers, cloud and IT framework providers, academic community… and even the civil society which is at both ends of the spectrum (from individual data collection to final dissemination of results). Such a variety of stakeholders makes it necessary to take into account different interests, attitudes and priorities. Effective communication ensures that each kind of stakeholders receive information that is relevant to their needs and builds positive attitudes to the statistical project.

The stakeholders must be identified, actively managed, and communicated with to ensure their buy-in to the final product. Hence communication services contain at least two components:

1. Stakeholder Identification and Analysis2. Communication Management and Planning

Stakeholder Identification and Analysis : this service collects and stores centrally information about stakeholders, and must at least include the identification of each stakeholder, contact information, a categorization according to a project-adapted taxinomy of stakeholders, links with the projects and specific agreements that may have been set up between the NSI and each stakeholder. There are at least three types of stakeholders to identify :

1. Upwards - These are the stakeholders involved with initiating and financing the project. They include the project sponsors and more generally public authorities defining the statistical work program and its budget support.

2. Downwards - These stakeholders cover actors who are closely linked to the operational realization of the project work. They include, in addition to internal workers of the NSI, the suppliers and contractors.

3. Sideways - These include other statistical organization that may have similar or complementary statistical pipelines, research centers or academic communities that lend talent and resources to the project.

Communication management and planning : Communication with stakeholders builds dialogue. It may take various forms: newsletter, specific reports, conferences, forums, social medias…. By setting up this different forms of feedback, a better understanding of stakeholders’ interests and attitudes is targeted so that one can fine tune communications. Using forums or other social media to communicate enables to respond to critical comments or correct any misunderstandings.

The communication management and planning service keeps track of how the stakeholder will be communicated with, the type, frequency, and medium. It records the content of the communication and what it intends to accomplish. While it is true that stakeholders fall into groups with similar communication needs, they still have their own unique needs and may be addressed individually. The communication management service may hence record information either at individual level (for each stakeholder) or at group level (category of stakeholders)

7

depending of the communication strategy that has been decided. It can be used to ensure that the Stakeholder Communications Plan is being followed (calendar and content).

5. Provision agreement management

The service to manage the provision agreement must store the stipulation in the agreements with information providers to provide data according to the requirements: timeliness, confidentiality, quality, transmission protocol, authorship and pre-processing characteristics. It must be closely related with the Trust and Quality Management services to insure that obligations are met. In this sense it must also connect to the Validation Services.

The Provision Agreement Management Service should thus comprise the following components:

Storing Agreement database:This service collects information that centrally stores in a database related to all that is stipulated in the agreement and that must at least include: date the data must be provided, confidentiality and sensitiveness of the data that is provided, the transmission protocol through which data must be made available, structure agreed upon and pre-processing that may take place at the provider’s premises.

Update instance database:This service collects the information of each instance/transmission pertaining to a specific provision agreement. Instances non-conformant with the base agreement provision must issue alerts. The sharing of confidential information requires that the flows of information are secured and access and usage is traceable.

Provision Agreement reports:This service accesses the provision agreement and instance databases and creates specific reports that can be destined to users/managers or machines/services:

(i) Trust Management monitoring appropriately the flow of information, (ii) Metadata and Quality management, to trace how specific outputs have been actually

produced and identify potential improvements,(iii) Validation services information on the structure and content being provided and(iv) Acquisition and recording services to support directly the statistical data production.

6. Legislative work participation

The Legislative Work Participation is understood as the ability to participate in and influence legislative work that forms the legislative basis of official statistical production in a way that will support decision makers and is regarded as useful and important. Big data sources exploration has the potential to change the data that is available to produce statistics. This change of paradigm may have a humongous impact on society; however these big data sources pose new access

8

questions. It is necessary to devise new protocols, agreements and forms of cooperation with the data producers/holders. Despite all the effort put on collaboration, in some cases, legislation supporting and enabling the official statistical production using big data sources may be required. Information is the key asset of official statistics and it cannot be compromised by external and internal stakeholders.

To foster legislative work participation supporting the big data sources usage the following services must be in place:

Legislation database:Compiling the legislation, directives and their provisions related to the possibility of usage of relevant and reliable data for statistical production.

Updating Legislation database:Feeding database on new legislation and legal initiatives that may provide opportunities for other big data sources, not yet detected but emergent (for example from the digital transformation) and reporting it back to the New Data Sources Exploration services.

Detected Legislation reports:Report coordinating the New Data Sources Exploration services with the existing legislation applicable to the new data sources that may be harnessed to produce meaningful statistics.

Legislation Gap reports:Report coordinating the Provision Agreement Management services with the lacking/possible legislation, especially where problems are consistently and continuously detected compromising the timeliness and quality of the official statistics production.

7. Method and tool management

The method and tool development identifies possible activities, methods and tools that can accomplish the tasks necessary across statistical subject matter domains, statistical phases and process steps. When a need arises to face a new issue there are mainly three options to address the problem at hand. The first, option to cover an identified need should be to reuse an existing generic component (methods, definition, package/module/component/service …). If such functionalities are not readily available, the second option is to adapt a solution that already exists. These two options focus on reusing components with the goal of saving resources, like developers and development time. Only it the previous options are not viable, should buying an existing package or developing it be considered. To make this possible the Method and Tool development traditional IT support service must include a catalogue.

The development and management of such a portfolio of Building Blocks BB increases collaboration and reusability inside the NSI and/or across countries fostering the cooperation and creating synergies within the statistical community supporting the modernization and innovation.

9

Method and tool catalogue - contains references to various artefacts and services to support standardization and efficient sharing and reuse of process, information and services/applications. The catalogue must contain descriptions that are suitable to different to distinct types of users. The I3S1 communication kit for services reuse identifies 4 main classes of users with different information requirements: subject domain expert, methodologist, IT and management. A tool may implement several methods and, reciprocally the same method can be used by several tools, leading us to divide such a catalogue in two services:

o Methods Inventory – more research oriented, collecting evidence from used and proven methodologies to address specific questions. For Big Data such an inventory should be based on the Methodological Report for the usage of Big Data2.

o Tools Repository – providing easy to deploy and test elements such as generic description and requirements; instructions for implementation and testing data sets. Several such tools for big data have been made public available on platforms like Github that provide versioning. An example is the Starter Kit3 to WebScrape Enterprise Characteristics (developed by WP C, on Essnet Big Data 2).

2. BREAL Generic Information architecture BREAL

The purpose of this Section is to describe the BREAL Generic Information Architecture for Big Data (GIAB).

GIAB consists of three layers, as described by the ‘hourglass model’ proposed for Trusted Smart Statistics, namely a raw data layer, a convergence layer, and a statistical layer:

• The Raw Data Layer includes data that are acquired and stored by the BREAL “Acquisition and Recording” business function. Let us remark that, at this stage, the GIA just identifies the concepts, and no detail is provided on formats or other technical specifications that can be useful for raw data acquisition and storage.

1 https://ec.europa.eu/eurostat/cros/content/wp4-create-and-communicate-success-stories_en

2 https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/d/df/WPK_Deliverable_K5_First_draft_of_the_methodological_report_2020_06_17_final.pdf

3 https://github.com/EnterpriseCharacteristicsESSnetBigData/StarterKit/releases/tag/v1.1

10

https://github.com/EnterpriseCharacteristicsESSnetBigData/StarterKit/releases/tag/v1.1

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/d/df/WPK_Deliverable_K5_First_draft_of_the_methodological_report_2020_06_17_final.pdf

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/d/df/WPK_Deliverable_K5_First_draft_of_the_methodological_report_2020_06_17_final.pdf

https://ec.europa.eu/eurostat/cros/content/wp4-create-and-communicate-success-stories_en

• The Convergence Layer contains data represented as units of interest for the analyses. These data are produced as results of the BREAL life cycle functions of “Data Wrangling” and “Data Representation”.

• The Statistical Layer includes those concepts that are the targets of the analysis. These data are produced by “Modeling and Interpretation”, “Integrate Survey and Register Data”, “Enrich Statistical Registers” and “Shape Output”.

In addition to the data concepts, some metadata concepts are introduced for each of the three layers. In particular, one specific category of metadata has been selected as very specific of Big Data, namely Provenance metadata. These metadata have been specified for each of the three layers above mentioned.

Moreover, in order to better highlight the integration of the GIAB with respect to existing standards for data modelling in Official Statistics, we included some ‘key’ concepts from GSIM (Generic Statistical Information Model).

The resulting composition of GIAB is shown in Figure 1. Each layer consists of: (i) specific BD data entities (blue color); (ii) GSIM entities (pink color) and (iii) specific provenance metadata entities (yellow color) .

The Raw data layer of GIAB consists of:

Specific BD data entities: a Big Data Source entity is introduced with three main characteristics that are considered as particularly relevant for characterizing the source itself, which indeed corresponds to volume, variety and velocity. Volume is taken into account by two entities, namely Big Data Size and Non-Big Data Size; variety, by the two entities Structured and Unstructured and velocity of data in acquisition phase, by the two entities In-Motion and At Rest.

GSIM entities: GSIM Data Resource is introduced. A Big Data Source entity is a specialization of that.

Specific provenance metadata entities: Two entities are introduced Acquired Entities, to take into account which are the specific data selected from the available source and Provider Agents, to consider who provides them.

11

Figure 1: Generic Information Architecture for Big Data

12

The Convergence data layer of GIA consists of:

Specific BD data entities: a Big Data Unit Type entity is introduced that is composed by BD Variable that can be of three types: Identification, Core and Derived.

GSIM entities: GSIM Unit Data Type is introduced. Big Data Unit Type entity is a specialization of that. In addition, Statistical Register and Statistical Program (to be intended to take into account survey data are introduced as unit types to be potentially integrated with BD unit types.

Specific provenance metadata entities: An entity Throughput Activity is introduced to take into account that are the specific operations that are performed on raw data. This is specialized into two entities: Lower layer, which includes metadata possibly introduced by the “Data Wrangling” activities on data provided by the lower layer, and Upper layer, which includes metadata possibly introduced by the “Data Representation” and “Data Modeling and Interpretation” functions, for being used at the upper layer for data analyses.

Finally, the Statistical data layer of GIA consists of:

Specific BD data entities: a Big Data Information Set entity is introduced that can be used either for internal or for external output.

GSIM entities: GSIM Information Set is introduced. Big Data Information Set entity is a specialization of that.

Specific provenance metadata entities: An entity Lineage of Output entities is introduced to take into account what is the overall process performed on raw data from the output perspective. This entity take into account the provenance metadata introduced at lower layers and is the result of an elaboration of such metadata for adding a contribution to make transparent the whole BD production process.

3. BREAL framework for defining operational models

Operational model

There are various possibilities for NSIs to combine their efforts, working together on a joint solution. The easiest solution is to share code, for example via Github. This is common practice within this ESSnet, and allows us to replicate application services. Another possibility is to share infrastructure, for example Eurostat’s DataLab or the UN Global Platform. And another possibility is to share data sources. These three types of sharing are the basis of our operational mode:

Sharing application services

Sharing infrastructure platforms

Sharing data sources

13

Sharing application services

The type of application service follows the definition of the ESS EARF. Here, the following approaches for consolidation are used:

Autonomous service. The application services are designed and operated without coordination with other NSIs. This could be the case when the situation is country-specific.

Interoperable service. In this case, the interface of the application services across the NSIs is aligned. Thus, the NSIs have similar application services, but the implementation may vary. Reason for this variety, might be for example that the data differs per NSI.

Replicated services. The same application service is deployed (duplicated) at various NSIs.

Shared service. There is a common, distributed application service, shared and accessible to all the NSIs.

To indicate the type of application service in the various solution application architectures, the application services are colour-coded as follows:

Sharing infrastructure platforms

The type of platform define where the platform is located and who supports that platform. There are three variants:

Local platform (of the NSI). In this case the big data platform is fully operated and/or controlled by the NSI.

Local platform of the data provider. In this case, the data provider has its own big data platform, where statistical analysis of the NSIs can be operated. An example could be the platform of EMSA, where or environmental statistics based on AIS data could be operated. EMSA provides the data and the platform, and NSIs are involved in defining the methodology to be used, producing the statistical outcome.

Shared platform. In this case, there is a third party that operates and controls the big data platform. Examples are the UN Global Platform and Eurostat’s DataLab. Both platforms can be used by NSIs and other agencies to derive statistics. Another example, is when a dedicated party within the country of the NSI, provides a platform that is shared between various (governmental) agencies within the country. For example, SURFsara in The Netherlands.

To indicate the type of platform in the various solution application architectures, the platforms are colour-coded as follows:

14

Sharing data sources

The type of coverage of the data determines its usage and whether sharing adds value:

Local data, can only be used by the NSI(s) of the country (or countries) to which the data applies. For example, data on smart meters provided by the energy companies, applies in most cases only for one country. Furthermore, privacy issues might prevent sharing this data, or processing this data outside the premises of the NSI.

European data, which has Europe as coverage. An example is EMSA, which has AIS data with European coverage. Another example is the job vacancy data gathered by Cedefop.

Worldwide data , which has global coverage. A good example is the AIS data from exactEarth On the UN Global Platform. This data has global coverage.

To indicate the type of coverage of the data sources in the various solution application architectures, the coverage of the data are colour-coded as follows:

Combinations of sharing

Not all combinations of sharing add value. In the model below, we show the combinations that are valuable in our opinion. As a basis, we look at the type of platform:

At local platforms of the NSI we are running autonomous, interoperable and/or replicated application services. Of course, an NSI could provide a shared service to another NSI, but in that case the NSI would also become a shared platform. So, from that perspective we do not consider that combination as logical.

Of course, local data can reside and be used on local platforms. Also, European and worldwide data can be used. However, these data sources are costly and it is better to share these data sources on the platform of the data provides itself or on a shared platform.

The main reason to execute your big data solution on the local platform of data providers, is the access to their valuable data sources. The coverage of these data sources are in most cases, European or worldwide coverage, and the application services executed are mostly shared. Because it is an external party, it will be more difficult to share local data (due to privacy reasons) or to execute autonomous or interoperable application services (due to maintenance and trust reasons).

A shared platform is the solution to share valuable resources. Again, the coverage of these data sources are in most cases, European or worldwide coverage, and the application services executed are mostly shared. Local data can be used if supported by the platform provider and if allowed

15

from privacy perspectives. Autonomous or interoperable application services can be used if supported by the platform provider.

4. Solution architecture for implementation workpackages

To be filled.

4.1 Solution architectures for WPB

The work package B deals with online job vacancies (OJV) and especially the production of statistical estimates based on techniques and methodologies developed during the ESSNet Big Data I. The final objective is to develop and test methodologies and prototypes as well as to build capacity for integrating them into statistical production, at national and European levels.

The next parts will describe application and information architectures of WP B according to the Big Data Reference Architecture and Layers (BREAL):

Application architecture: representation linking the business processes and functions, external application services and application components and services views.

Information architecture: description of the information according to the raw data layer, the convergence layer and the statistical layer representations.

Operational model.

16

4.1.1 Application architecture for WPB

The next figure represents the application architecture of statistical estimates based on the online job vacancies as the result of the WPB work.

The figure is divided into three horizontal parts, respectively for business processes or functions, external application services and application components. Every business function triggers the next one and is served by the application services. The gray application services are those from the BREAL representation that are not used to describe the WP B work.

The first thing to notice concerns two crosscutting components: workflow manager and metadata management service. The first one is designed to ensure a correct triggering between the different steps of the process. The second one offers the capability of sharing metadata all along the process.

To realize the first business function (acquisition and recording), the process will guarantee that we can collect the data via structured and unstructured data retrieval and subsequently store the data through the data storing component.

The first task is achieved through a "Scraping and crawling component" that will either offer a specific or a generic scraping and also a website retrieval;

The first use case relates to a website with a known structure: where is the information? How is the data displayed? The second use case deals with the other hand: you do not know how the information is structured along the website. Using only a URL, this use case will scrape the information during the site crawling.

The last type of access to information on a website is made through an "API Querying" module. As his name explains, this is the way to access information about OJV on websites providing an API access.

To complete the Acquisition and recording step, two components are used: one to create URL inventory and the second one to store the scraped data. The URL inventory grows during specific and generic scraping. For example, using a generic scraping or a website retrieval component from only a company domain name, the component needs to store the URL that was analyzed and some characteristics (visit date, information…).

17

Once data is collected, it can be wrangled. This is achieved in WP B using a "Web data preparation" module first. "TD matrix", "Word and sentence embedding" and "Pretrained NLP models" are then used, before the "Compute aggregates" component.

The "Web data preparation" module intends to realize the first step of preparing the raw data collected. That includes three major parts: ex-post filtering, language detection and finally deduplication and merging. These first three operations realize a first filtering process of the raw data collected. The same "Web data preparation" module is used a second time in order to standardize the data.

"TD matrix", "Word and sentence embedding" and "Pretrained NLP models" components are used in order to achieve a data aggregation to close the Data wrangling business function. After deduplication of the collected raw data (the same offer on several websites), this step will compute some aggregates in order to determine if the job offer should be considered as a valid one, due to its date of publication for example.

The next business function ("Data representation") is implemented through five components: "Compute aggregates" (once more), and then "Build training and test set", Data set labeling", "Train, test and validate" and "Apply ML".

Re-using "Compute aggregates" initiates the "Data representation" function by creating derived data that will be needed further down in the process.

The next four modules build the step which will realize the BREAL "Structure modeling" service by using machine learning (ML). Due to the origin of the data, no classification is used to publish online job vacancies (if so they do not fulfill all the needs -location for example-). The four components describe the main steps of using machine learning and describe the cyclic usage of them (you may need some iterations using "Build training and test set" or "Train, test and validate" before applying machine learning.

Notice that in this "Data representation" business function of the BREAL, modeling the work of WP B does not imply a projection over "Data encoding" service.

Once data is represented according to the BREAL model, the next business function deals with "Modeling and interpretation". In WP B, this is achieved using just the BREAL service "Data linking and enriching". WP B implements one "Linking data" component whose goal is to bring together offers which seems dealing with the same job vacancy. That is completed through deduplication and linkage record methods.

The final "Shape output" business function is implemented by three components : one to "Enrich register units", and the other to build the output : "Small simulations", "Benchmark comparisons" and "Result validation".

The first action to shape the output consists in enriching the register units. In addition, this material will be used by the building output service. Enriching register units is convenient because

18

the job offers mention the company name or it can be a result of previous steps (in case of publication by a recruitment firm for example).

To shape the output, WP B finally uses the three components named above. As passing through the whole process once may not be successful or may give some poor quality, it will be useful to rerun some part of it. These three last components give some information and functions to measure their quality in order to validate the output or to modify some parameters for some of the business processes.

4.1.2 Information architecture for WPB

This section will explain shortly the structure of the information used in WP B, according to the BREAL representation. The BREAL representation deals with three layers: the raw data one, the convergence one and the statistical one.

In the online job vacancy process, the raw data layer, in a simplified way, contains only a Job portal entity as a data resource. The data acquisition, based on web scraping or crawling and API querying, needs only the job portal as a data resource to be achieved.

The convergence layer shapes the data that comes from the raw data but that are not usable in a statistical approach. In this way, the online job vacancy data can be seen in a convergence state with three parts: identification, core and derived variables.

The identification layer deals with an ID number and some information of the portal URLs. The core part is mainly related to the hob title ant the location. All other elements issued from the job vacancy offer are considered as derived variables. Required and optional skills, description of tasks, wages… are results from data wrangling and data representation processes, which implies standardization, transformation and other computations.

The statistical layer gives a vision of data that may be used in a statistical usage. As explained above, this data may need several loops of the process to have its final state. The statistical layer groups data needed to produce indicators : skills and statistics on skills, information on education degrees, flow and counting of online job offers.

19

4.1.3 Operational model for WPB

The goal of this section is to describe the solution architecture for processing OJA data according to an operational way, that is describing where the most-used data are located (local, European or worldwide level) and how the services are autonomous, interoperable, replicated or shared between the NSIs.

20

The figure above is a representation for the OJA operational model. The different blue colors deal with the state of the services (autonomous…) while the purple ones represents the data availability level. This figure shows the implementation version which has been achieved within the work package B.

Looking in the future, a target should be that the services linked to “Acquisition and recording”, “Data wrangling” and “Data representation” will be shared through the NSIs. The most peculiar thing about these three business functions is their autonomy regarding the aim of the process. Whatever the data you want to work on, data acquisition and recording are the same. The data needed could change the parameters used for scraping or website retrieving, but technical background is the same.

You can have the same logic about data wrangling. Data filtering, language detection, standardization… are technical phases, which are independent from the aim of the statistical process and could be shared and used for other purposes.

Looking at data representation this can be less evident - although creating indicators or classifying data are both quite the same whatever the goal of the statistical process, provided that the parameters can be well adjusted.

This point of view resounds with the Eurostat grant call for the development of web intelligence network. It could give the availability to share services hosted on Eurostat web intelligence hub, provided NSIs can help in testing and developing these services, giving their own experience.

Having an effective operational model relies also on some organizational aspects. The big data lifecycle needs to be implemented in a way that takes collaboration for heterogeneous teams and technologies into account. In addition, using ML (machine learning) in an effective way implies that ML models are reusable and interoperable.

21

4.2 Solution architectures for WPC

To be filled.

4.2.1 Application architecture for WPC

The purpose of this section is to describe generic Application solution architecture for OBECs’ Use Cases URLs Inventory of enterprises and Variables in the ICT usage in enterprise survey (Figure 2 -Generic graphical representation of Application architecture of OBECs’ Use Case URLs Inventory ofenterprises.and Figure 3, respectively). The description is based on BREAL Application architecture.

Figure 2 - Generic graphical representation of Application architecture of OBECs’ Use Case URLs Inventory of enterprises.

Figure 3 - Generic graphical representation of Application architecture of OBECs’ Use Case Variables in the ICT usage in enterprise survey.

The Business processes and functions layer describes the OBECs’ business functions derived from BREAL model. Every business function triggers the next one. Every business function is served by one or more application services. Every application service is realized by one or more application components.

The application components could be programs, modules, scripts, classes or functions. They could be stand-alone or part of a system. They could be written on different programing languages. The OBEC application components for Use Cases URLs Inventory of enterprises and Variables in the ICT usage in enterprise survey are as follow:

Scraping application component

22

This application component realizes extracting of data from web sites. The data is of known structure, when the sites are low numbered, search engines or APIs. The data is of unknown structure, when the sites are vast numbered. The component provides structured and unstructured data retrieval in case of URLs Inventory. The component provides unstructured data retrieval in case of ICT Variables. The services realized by this component are serving the Acquisition and Recording process of OBEC.

Create URLs inventory application component

This application component realizes creation of candidate list of URLs of enterprises. The component provides data storage service. The data storage can be relational DB or NoSQL. The services realized by this component are serving the Acquisition and Recording process of OBEC.

Create Document Base of Scraped Data application component

This application component realizes storing of scraped data obtained from the web sites, search engines and APIs. The component provides data storage service. The data storage can be relational DB or NoSQL. The services realized by this component are serving the Acquisition and Recording process of OBEC.

Web Data Preparation application component

This application component realizes transformation of scraped data into predefined format for further processing. In addition, in the case of URLs Inventory, the application component realizes structuring of data from unstructured scraped data. The services of this component are bringing data in a machine-usable format, filtering relevant parameters out of unstructured data and standardizing the data. In addition, in the case of URLs Inventory, the services of this component are deriving additional variables from the data, adding meta-data like formats or classifications to data. The services realized by this component are serving the Data Wrangling process of OBEC. In addition, in the case of URLs Inventory, the services realized by this component are serving the Data Representation process of OBEC.

TD Matrix application component

This component realizes structuring of data from unstructured scraped data for the case of ICT Variables. The services of this component are deriving additional variables from the data, adding meta-data like formats or classifications to data. The services realized by this component are serving the Data Representation process of OBEC.

Build Training and Test Set application component

This component realizes building of train and test sets for machine learning method chosen to process the data. The services of this component are applying the machine learning methodology. The services realized by this component are serving the Modelling and Interpretation process of OBEC.

23

Train, Test and Validate application component

This component realizes preparing and validating the machine learning model. The services of this component are applying the machine learning methodology. The services realized by this component are serving the Modelling and Interpretation process of OBEC.

Apply ML application component

This component realizes applying the prepared machine learning model. The services of this component are applying the machine learning methodology. The services realized by this component are serving the Modelling and Interpretation process of OBEC.

Perform Linkage application component

This component realizes linking of data with Statistical Business Register or other sources in the case of URLs Inventory. The services of this component are enriching the data for the enterprises with data form other sources and preparing the results for production of statistics. The services realized by this component are serving the Modelling and Interpretation and Shape Output processes of OBEC.

Enrich Register Units application component

This component realizes preparing of statistical outputs form the data. The services of this component are preparing the results for production of statistics. The services realized by this component are serving the Shape Output process of OBEC.

4.2.2 Information architecture for WPC

The purpose of this section is to describe generic Information solution architecture for OBECs’ Use Cases URLs Inventory of enterprises and Variables in the ICT usage in enterprise survey (Figure 4 and Figure 5, respectively). The description is based on BREAL Information architecture.

24

Figure 4 - Information solution architecture for OBECs’ Use Cases URLs Inventory of enterprises.

Figure 5 - Information solution architecture for OBECs’ Use Cases Variables in the ICT usage in enterprise survey.

25

The Raw data Layer describes the OBECs’ initial data sources in terms of BREAL. It covers the Acquisition and Recording process of OBEC.

The Convergence Layer describes the information entities derived from source data through Data Wrangling and Data Representation processes of OBEC.

The Statistical Layer describes the information entities derived from Convergence Layer information sources trough Modelling and Interpretation and Shape Output processes of OBEC.

Enterprise Website entity

The main data source for OBECs’ use cases are the enterprises’ websites. The websites are structured in HTML format, but due to the large number of sites with different HTML structures and the placement of relevant information in a text, this data source generally provides unstructured information.

Generally, the enterprise characteristics could be found on the first page of the website, but for better coverage, it is recommended to look at list to the second level pages. The web pages could be stored as HTML, text or term matrix in files, SQL or NoSQL stores. The volume of stored information could vary according to the format or country size – from hundreds of MB to hundreds of GB.

The characteristics of enterprise on its site are relatively stable from year to year, although the content could change daily or be outdated. As hole, this OBEC data source is at rest for at list several months or years.

Search Engines, APIs, Yellow Pages entity

These OBECs’ data sources are of well-known structure, due to their small number. They usually provide the information in structured formats like JSON and XML or semi-structured like HTML. The search engines are usually well-known international ones and in some cases, a country specific could be used. The Yellow Pages are usually country specific ones and in some cases, international could be used. Search Engines, Yellow Pages, Government registers or Business organisation registers could provide APIs as OBEC data sources.

The volume of the information of these data sources is generally small. It is usually structured and saved as CSV, JSON, XML or SQL formats in files, SQL or NoSQL stores.

The velocity of change of the information of these data sources source is at rest for at list several months or years.

Business register entity

This entity describes the identification data source of OBEC. Usually this is the Statistical Business Register where the information of interest for the enterprises is stored like ID, Name, Postal

26

address, Phone, URL, E-mail etc. The NSI controls the SBR In general. SBR is well structured and versioned controlled. It is used for the traditional statistical business surveys.

Survey data entity

This entity describes the core information that we know about the units of interest of OBEC. The core information comes from traditional statistical surveys, administrative sources and classification systems. It is used to derive structure and meaning to the OBEC data gathered form the Raw data Layer sources. The information is well structured. The information could be traditional statistical indicators, classifications like NACE, NUTS or others, term matrix words, machine learning training sets, blacklists, etc.

OnlineBasedEnterprise entity

This entity describes the derived information form the data we already have about the units of interest of OBEC. These could be score vectors, term matrix, filtered data sets, etc.

Indicator on Internet presence entity

This entity describes the statistical information obtained through implementation of process for OBECs’ Use Case URLs Inventory of enterprises. The main answer is whether the enterprise is present on the internet. This information serves as core input for the OBECs’ Use Case Variables in the ICT usage in enterprise survey.

The information is processed with implementation of the chosen methodologies such as machine learning methods. The information is aggregated and classified. The results are statistical indicators on enterprises presence on internet by enterprises size, NUTS levels, NACE levels, number of employees, etc.

Indicators on estimated variables entity

This entity describes the statistical information obtained through implementation of process for OBECs’ Use Case Variables in the ICT usage in enterprise survey.

The information is processed with implementation of the chosen methodologies such as machine learning methods. The information is aggregated and classified. The results are statistical indicators on enterprises ecommerce, job advertisements, social media and other characteristics presence on internet by enterprises size, NUTS levels, NACE levels, number of employees, etc.

4.2.3 Operational model for WPC

There are two cases of operational modules for application services running on infrastructure platform for each of the OBECs’ Use Cases URLs Inventory of enterprises and Variables in the ICT usage in enterprise survey (Figure 6, Figure 7, Figure 8, Figure 9).

27

There are two cases of operational modules for data sources used on infrastructure platform for each of the OBECs’ Use Cases URLs Inventory of enterprises and Variables in the ICT usage in enterprise survey (Figure 10, Figure 11, Figure 12, Figure 13).

Figure 6 - Sharing application services on a local platform of NSI for OBECs’ Use Case URLs Inventory of enterprises.

All application services of URLs inventory could be autonomous on a local platform of NSI. All application services without Data linking and enriching and Data exchange could be interoperable or replicated on a local platform of NSI for OBECs’ Use Cases URLs Inventory of enterprises.

Figure 7 - Sharing application services on a shared platform for OBECs’ Use Case URLs Inventory of enterprises.

28

All application services of URLs inventory could be autonomous on a shared platform. All application services without Data linking and enriching and Data exchange could be interoperable or shared on a shared platform for OBECs’ Use Cases URLs Inventory of enterprises.

Figure 8 - Sharing application services on a local platform of NSI for OBECs’ Use Case Variables in the ICT usage in enterprise survey.

All application services of Variables in the ICT usage in enterprise survey could be autonomous on a local platform of NSI. All application services without Data exchange could be interoperable or replicated on a local platform of NSI for OBECs’ Use Cases Variables in the ICT usage in enterprise survey.

Figure 9 - Sharing application services on a shared platform for OBECs’ Use Case URLs Variables in the ICT usage in enterprise survey.

All application services of Variables in the ICT usage in enterprise survey could be autonomous on a shared platform. All application services without Data exchange could be interoperable or shared on a shared platform of NSI for OBECs’ Use Cases Variables in the ICT usage in enterprise survey.

29

Figure 10 - Sharing data sources on a local platform of NSI for OBECs’ Use Case URLs Inventory of enterprises.

The data sources for OBECs’ Use Cases URLs Inventory of enterprises on a local platform of NSI are worldwide on the Raw Data Layer and they are local for the Convergence and Statistical Layers.

30

Figure 11 - Sharing data sources on a shared platform for OBECs’ Use Case URLs Inventory of enterprises.

The data sources for OBECs’ Use Cases URLs Inventory of enterprises on a shared platform are worldwide on the Raw Data Layer and they are local for the Convergence and Statistical Layers.

31

Figure 12 - Sharing data sources on a local platform of NSI for OBECs’ Use Case Variables in the ICT usage in enterprise survey.

The data sources for OBECs’ Use Cases Variables in the ICT usage in enterprise survey on a local platform of NSI are worldwide on the Raw Data Layer and they are local for the Convergence and Statistical Layers.

32

Figure 13 - Sharing data sources on a shared platform for OBECs’ Use Case URLs Variables in the ICT usage in enterprise survey.

The data sources for OBECs’ Use Cases Variables in the ICT usage in enterprise survey on a shared platform are worldwide on the Raw Data Layer and they are local for the Convergence and Statistical Layers.

4.3 Solution architectures for WPD

The aim of WPD is to implement the use of smart meter data for producing statistics in different areas, e.g. energy statistics of businesses, census statistics, household costs, tourism seasonality, or impact on environment. The implementation will include linking electricity data with other administrative sources for producing statistics of businesses and households, and identifying vacant living places or seasonal/temporary occupancy of living places.

To describe a solution architecture, five models were created:

General application architecture model of WPD Specific application architecture model (Estonia’s case) General information architecture model of WPD Deployment model for general application model of WPD Operational model of WPD – covered by deployment model?

33

4.3.1 Application architecture for WPD

The purpose of this section is to describe Smart Energy (WPD) application architecture using building blocks from BREAL generic application architecture. The first model of this section shows general business functions, application services and data objects. The second model is an example of concrete implementation in Estonia, with added business processes, application and infrastructure components.

Figure 14 - WPD Application Architecture

WPD application architecture model shows five BREAL business functions and ten chosen application services, that are specialized by specific work package services:

Acquisition and Recording: External structured data retrievalo Copy file

The first step is to receive and copy data from the hub or electricity provider. Data can be made available through the secure FTP, API or on an external storage device.

Acquisition and Recording: Data storingo Load file to database

When all necessary files are in production environment of NSI, the next step is to load this data into relational database (RDBMS).

Data Representation: Structure modelingo Pseudonymize personal data

Anonymization of personal information - removing personal codes from data set and adding NSI specific identifiers.

Data Representation: Data encodingo Geographical encoding

Geocoding of the data – adding NSI specific address information field to database.

Data Wrangling: Data preparation, filtering & deduplicationo Filter data

34

Removal of duplicates and anomalies. If company is associated with same metering point several times, then duplicate entries are being removed. Companies without an activity code and measuring points with negative consumption are also removed.

Data Wrangling: Data aggregationo Initial aggregation (daily and monthly)

Creating daily and monthly aggregations from measurements data.

Data Wrangling: Data standardizationo Split data

Metering data is split into different categories (dwelling, business units, building, land area).

Modeling and Interpretation: Data linking and enrichingo Weight and distribute co-consumption

Distributing consumption of companies, that are related to the same measurement point. Dividing consumption by number of companies or weighted by number of employees.

Modeling and Interpretation: Statistical aggregationo Final aggregation (by sector)

Grouping electricity consumption data of companies by NACE codes to find the overall electricity consumption of a business sector.

Shape output: Data exchangeo Compare with survey data

Comparison of companies’ declared consumption with survey data and, if possible, combining smart meters data with survey data to improve survey estimates.

Figure 15 - WPD Application Architecture. Realization in Estonia

The picture above shows initial (pre-standardized) application architecture model, that was created for Smart Energy process of Statistics Estonia. BREAL business functions are composed of

35

business process steps, that are served by specific application services. Application components are divided into four main parts: data collection tools, geocoding service, data preparation tools and data processing tools. Components are served by RDBMS and Big Data Environment technology services.

The base for creating this model is a mapped business process (using BPMN 2.0 standard).

4.3.2 Information architecture for WPD

In this section, general information architecture of Smart Energy is described using BREAL Generic Information Architecture for Big Data (GIAB). There are three defined layers: raw data layer, convergence layer and statistical layer:

36

Figure 16 - WPD Information Architecture Model

The Raw data Layer contains all necessary data resources, that are acquired during the Acquisition and Recording phase. Many of the Nordic countries have adopted a setup where data is gathered within a central institution, which manages a Data Hub. All data related to a metering point is collected and stored centrally. Hub data contains information on metering points, customers, agreements and the consumption/production of energy.

37

The Convergence Layer contains data represented as units of interest for the analysis. Main focus objects for WPD are households, business units and dwellings. As an additional resources, business register and weather data are used. Data Representation and Data Wrangling business functions and corresponding application services are responsible for creating and moving data in this layer.

The Statistical Layer includes those concepts that are the targets of the analysis. The aim of WPD is to develop functional production prototypes from smart meter data. During the project, the implementation procedures for the following statistical products are being produced:

Electricity statistics of businesses, by sector

Electricity statistics of households

Identifying vacant or seasonally vacant dwellings by new estimation models

Modeling and Interpretation and Shape Output business functions are used to operate with data in this

layer.

4.3.3 Operational model for WPD

Current Smart Energy solutions are deployed locally. All of the data is local data and technological platform is a local platform of NSI. Half of the application services are autonomous services and are operated without coordination with other NSI-s, they are country-specific implementations. Those services, that are supporting Data Wrangling and Modeling and Interpretation BREAL business functions, are aligned between NSI-s and therefore are marked as interoperable services.

Figure 17 - WPD Deployment Model

4.4 Solution architectures for WPE

The aim of work package WPE is to develop functional production prototypes using the potential wealth of data from the Automatic Identification System (AIS), an automatic tracking system on ships for safety and management purposes. This data source contains detailed information on

38

different aspects of the ship, such as identity, location, course, and speed, providing information to improve current statistics and to generate new statistical products.

The aim of WPE is to set up procedures and develop technical solutions for the implementation of AIS for statistics on inland waterways, port visits, fishing fleet behaviour, and energy use and emissions. The next paragraphs describe the information architecture, application architectures and deployment models of the various solutions, using the BREAL standard. These architecture and models are created in close cooperation with the people working on WPE.

4.4.1 Application architecture for WPE

As mentioned, work package WPE actually consists of four subprojects and various products. Luckily, all these products can be created using a similar architecture. Of course, each product has its specifics, which need to be supported by the application services.

Inland waterways statistics: application architecture

For the AIS solution to resolve the undercoverage of the Dutch inland waterway statistics, the application architecture is shown below.

Figure 18 - WPE Application Architecture Inland Waterways

As can be seen, the solution architecture can be described by using only six BREAL application services. The specific services derived from these BREAL application services are described below, including a high-level explanation of the process. For more information on the process, please read Work package E: Tracking ships, Deliverable E3: Interim technical report concerning the conditions for using the data, the methodology and the procedures.

Acquisition and Recording: External structural data

o Receive and copy AIS data

39

The first step in the process is to receive the AIS data. The national AIS data is provided by Rijkswaterstaat, which is part of the Dutch Ministry of Infrastructure and Water Management and responsible for the design, construction, management and maintenance of the main infrastructure facilities in the Netherlands. The data received is 22 GB per week and is currently provided via hard-disk. Of course, this will change in the near future. The file is copied from the hard-disk to the production environment of Statistics Netherlands.

Data Wrangling: Data preparation, filtering and deduplication

o Decode and filter AIS data

The next step is to decode and filter the AIS data. The data is in a specific technical format and there are special Python libraries, like libAIS, to decode the data in a more friendly machine-readable format. Raw AIS data is decoded and immediately filtered, resulting in two files: a static file (record type 5), and a dynamic file (record types 1, 2, 3 and 21). All other records are ignored. Only the variables MMSI, speed and location are selected.

Data Wrangling: Data standardisation

o Split per ship

Now, only the required data has been selected and filtered, the data is reorganised for further processing. The next steps are performed on a per ship basis, and therefore the two files with the static and dynamic are combined and split per ship.

Modelling and Interpretation: Data linking and enriching and Methodology application

o Derive ship, derive port and derive ship journey

This is the main part of the process. Here, each ship file is processed (can be performed in parallel) and the ship, the journeys, and the ports are derived. First, the ship is identified based on the MMSI, and all kind of characteristics are retrieved from the ship register. Then, the journeys are derived based on the speed. When the speed is below 0,2 knots, it is assumed that the ship has stopped at a port. These stops are checked using the port register. If it is not a port, because for example the ship is lying still for an overnight stay, the journeys are combined. At the end of the process, we know the ship, and all the journeys it made from which port to which port.

Shape Output

40

Match and merge ship journeys

As mentioned, AIS is used to resolve the undercoverage in the current Dutch statistics. The inland waterway (IVS) production system already derives all ships and their journeys using other data sources. All journeys from AIS are compared with the ones in IVS and the missing ones are added. Match and merge seems a simple process, but linking is methodological challenging.

Match and merge goods

For the inland waterway statistics, we also need the goods that were transported. This information is not derived yet and therefore, this step is not yet implemented. Based on the depth of a ship, you might be able to derive the tonnage transported, and based on the type of ship and specifics of the port, you might be able to derived the type of goods transported. Whether this is possible, is still under investigation.

Now all data from AIS has been processes and merged into the IVS production system, the regular process can continue. This is business as usual.

Inland waterways statistics: deployment model

The described AIS solution is to resolve the undercoverage of the Dutch inland waterway statistics. Because that is a local problem and local data from Rijkswaterstaat is used, the solution is deployed locally. The deployment model is as follows:

Figure 19 - WPE Local Deployment Model Inland Waterways

Even though the solution is deployed, many application services are replicated services. These services, which are shown in dark blue, are generic and can be used in other AIS solutions, as we will see when discussing the port visits and fishing fleet solutions. The match and merge services are local solutions, which depend on the local inland waterways production system.

Port visits statistics: application architecture

41

The main objective of the port visit solution is to create the so-called F2-table, which is mandatory and contains the number of port visits of ships to load or unload cargo in the main European ports. The number of visits at port level are broken down into port, type and size of vessels (loading or unloading cargo, embarking or disembarking passengers). The solution architecture is shown below.

Figure 20 - WPE Application Architecture Port Visits

Many of the application services described in the solution architecture of the inland waterways, are also used for port visits. Note however, that the implementation of the application services might differ, because different technology is used (R versus PostgreSQL). The application services introduced for port visits are as follows: Data Wrangling: Data preparation, filtering and deduplication

o Filter port area (rough)

Because we are only interested in ships loading and unloading, we only need the AIS messages of ship positions that are close to a port (for example, Piraeus in Greece). For each port, the location is specified by a rectangular area. This area is used to filter the AIS messages.

o Split per ship category

As mentioned, the F2-table is broken down per type of vessel, specifically cargo and passenger ships. These ships also use different terminals. In this step, the remaining AIS messages are split up into cargo and passenger ships, for further processing. Note that this step has been introduced to ease the analysis of the results of the prototype (first focus on cargo and later on passenger ships) and might not be needed in the final solution.

Modelling and Interpretation: Methodology application

o Derive arrival or departure

42

Per port visit, it needs to be determined whether it is an arrival or departure. This can be derived using the ship direction (changes in latitude) and speed (reduction of velocity).

Modelling and Interpretation: Statistical aggregation

o Aggregate per ship

Finally, the F2 table needs to be created per port, by calculating the arrivals for each vessel, per day for the whole month.

Port visits statistics: deployment model

The port visit solution can be deployed locally or as a shared solution. Both deployment models will be shown for comparison.

When locally deployed, national AIS data can be used. This data is in general more fine-grained than European or global data. For example, the ship’s position is transmitted very frequent (each 10 seconds), but is filtered by the data providers. EMSA filters the data to a 6-minute interval.During developing the prototype, ISTAT used national AIS data provided by the Hellenic Cost Guard. In a shared solution, EMSA data could be used. When locally deployed, any port or ship register could be used. During developing the prototype, ISTAT used the ΙHS Markit’s Ships Register. In a shared solution, a ship register with global coverage is required, like ΙHS Markit’s or Lloyd’s.Also, a port register is needed, containing all European ports with their coordinates (the rectangular area, needed for filtering port visits), and preferably including information on the terminals, the type of cargo supported (containers, bulks or even people) and their coordinates. Because such a register with global coverage does not yet exist, the required information is gathered locally.

Figure 21 - WPE Local Deployment Model Port Visits

As can be seen in the local and shared deployment model, is that most application services are shared services and thus can also be used as a replicated service locally. Only, the application

43

services “Receive and copy AIS data” and “Derive ship” are depicted as integrated services, because these might require some local integration, because local data and a local ship register is used.

Figure 22 - WPE Shared Deployment Model Port Visits

Fishing fleet statistics: application architecture

The main objective of the fishing fleet solution is to derive some experimental statistics on the behaviour (activity and traffic) of fishing vessels. The solution architecture is shown below.

Figure 23 - WPE Application Architecture Fishing Fleet

Many of the application services described in the solution architecture of the inland waterways and port visits, are also used for the fishing fleet solution. The application services introduced for the fishing fleet solution are as follows: Acquisition and Recording and Data Wrangling

o Scrape fishing fleet data and Enrich fishing fleet register

44

A fishing fleet register with a good quality is a prerequisite of the fishing fleet solution. Each EU country is obliged to maintain its national fishing fleet register. Even though it is a good basis for maintaining an up-to-date reference frame of fishing fleet, unfortunately a lot of information was missing. This missing data was collected using scraping techniques and used to enrich the reference frame.

Data Wrangling: Data preparation, filtering and deduplication

o Filter fishing ships and Remove outliers

Only AIS messages of fishing ships are needed. Using the fishing fleet register, all messages of other ships are filtered. Also, outliers are removed (like locations on land).

Modelling and Interpretation: Methodology application

o Derive fishing port of ship

What is specific for fishery is that usually each fishing boat is “assigned” to one fishing port.

o Measure fishing behaviour

Combining ports’ coordinates and ships’ AIS messages, we are able to derive how many times a day, for how long a fishing vessel is active or in the home port.

The algorithms to measure fishing behaviour at sea, are still under development. It is possible to differentiate between fishing and travelling. Routes involving fishing activity take the shape of zigzags, while the routes travelled to reach a fishery are generally straight lines. When fishing, a fishing vessel sails at a slow speed, but quite often changes the course.

Modelling and Interpretation: Statistical aggregation

o Aggregate results

Finally, the activity of fishing fleet (time when a fishing vessel is out of the port) and the traffic of fishing fleet (the number of fishing vessels in fishing areas) is derived by aggregating the detailed results. Furthermore, several visuals can be created, like heatmaps with the fishing ports, fishing areas, and the traffic intensity.

Fishing fleet statistics: deployment model

The fishing fleet solution is developed with deploying it on a shared environment in mind. The solution uses AIS data from EMSA. A ship register, containing the fishing fleet, and a port register, containing the fishing ports with their geographical coordinates, are needed. Both registers should at least have European coverage.

45

Figure 24 - WPE Shared Deployment Model Fishing Fleet

Emissions and energy used

When writing this document, the solution for the emissions and energy used was still under development. It uses many of the application services already described, however, there are also some specific differences: In order to derive the emissions and energy used, AIS data with worldwide coverage is needed.

Emissions are calculated based on which country produced the emission. For ships, this is derived by the nationality of the ship (based on the flag or operating company). So, regardless where in the world the ship is sailing, the emissions belong to the country of ship’s operating company nationality.

Because the nationality of the ship is so important, we start by deriving the list of applicable ships. Nationality by flag is derived using the ship register. Nationality by operating company is derived using the business register. For the Dutch case, a list of around 1700 ships for 2017 was derived.

Using the list of applicable ships, the AIS data is filtered at an early stage in the process.

The main methodology of the solution, is the calculation of the emission and energy used. These should be calculated based on the mileage, speed, depth, type of engine, and many more variables. Possibly, even weather information is needed. This part of the solution is still under development.

46

4.4.2 Information architecture for WPE

The BREAL information architecture consists of three layers: the raw data layer, the convergence layer, and the statistical layer. In Figure 25 the WPE information architecture is shown, with the three layers and the information objects.

Figure 25 - WPE Information Architecture

Data layer

The data layer contains all the raw data needed to create the statistics required. In case of work package WPE, the most important raw data is the AIS (Automatic Identification System) data in its purest form, which are the messages sent between the AIS transponders and the AIS antennas (both land-based and satellites). This data is gathered by various parties, on national, European and worldwide level.

Note that as a statistical office, we do not get this purest level of raw data, because the parties involved, already pre-process the data. For example, the data on national level is more detailed than on European and worldwide level, because in the latter case, the data is already filtered and combined (grouped). This is not a problem, as long as one knows what type of pre-processing was applied. In the ideal situation, the statistical office gets the data as specified in the convergence layer. Here, the data is defined on such a level that is it exactly right to produce all statistics required. And, preferably, all details only needed by the domain-experts in the field are no longer present (hide the complexity).

In order to produce the fishing fleet statistics and emission statistics, ideally, it is also required to receive weather data. You need to know whether and where the fishing fleet was able to fish, and how much energy was needed to travel due to water and wind conditions. Because at the time of

47

writing, we did not know yet how to receive this data, nor how it would look like, we placed it is the raw data layer. If the data is provided as required, it is actually part of the convergence layer.

When comparing and discussing the information architecture with our colleagues from work package WPD, which is also about sensor data, namely smart (energy) meters in buildings, we discovered that we both need weather data and that we both did not have this data yet. This shows how the architectures can help and how it can result in combining our efforts.

Convergence layer

The convergence layer contains the data on such a level that it is sufficient to create all statistics required. The data is represented as units of interest. In case of AIS, our main focus are the ships (at least to be able to ID the ships) and the position and movements (direction, speed and depth) of these ships. For example, looking at the raw AIS data, consisting of 27 different type of messages, only message 5 is important to identify the ship, and messages 1, 2, 3 and 21 are important to derive the position and movement.

Besides the actual data, we also need some registers to enrich the data. In case of WPE, we need the business register, the ship register and a port register.

The ship register (for example, Lloyd's Register) is needed to get the ship details, like type of ship, type of engine, type of cargo, owning company, flag of ship. This information is needed to create correct statistics. For example, for the emission statistics the nationality and/or the flag of a ship determines whether the emission should be counted or not (thus, not whether it travelled in the national waters).

The port register is needed to know where the ports are (physical location) and what type of ports they are (what type of cargo is supported). Because, in not all countries, such a port register exists, it might be needed to derive this port register from the AIS data itself.

Statistical layer

The statistical layer, contains all concepts needed for the required statistics. Work package WPE actually consists of four subprojects and various products (concepts):

The purpose of the port visits statistics is to create the mandatory F2 dataset, containing the port visits.

For the inland waterway statistics in The Netherlands, the AIS data is used to resolve the under-coverage in the current statistics. For this, the journeys of ships are needed. Actually, the cargo transported is needed, but this information is not part of AIS as such.

For the emissions and energy used, the emissions of ships need to be derived.

For the fishing fleet statistics, the traffic and activity (behaviour) of fishing ships is needed.

Of course, many more statistics and concepts might be needed in the (near) future. In the ideal case, we only need to derive a new concept in the statistical layer, using the units provided in the convergence layer. This would imply that we have chosen the convergence layer at the right level.

48

Of course, we will not be able to look into the future, so there will be cases where we have to extend the units in the convergence layer, or create new ones.

In a more formal notation, the WPE information architecture is as follows:

Figure 26 - WPE Information Architecture, a more formal model

4.4.3 Operational model for WPE

Combining the application services of the various AIS solution per BREAL application service, we get the following overview. Here, you can see that most application services can be shared or at least replicated, if it is desired to run these services locally. Only two services are local, because

49

the results have to be integrated with the local inland waterway production system. One service requires local integration, if local AIS is used. However, when EMSA data or global data is used, the service can be shared.

Figure 27 - WPE Combined Deployment Model

50

The combined deployment model shown, implies that the AIS solution can be executed for all NSIs on a shared platform, like DataLab. However, in order to do so, there are some prerequisites:

AIS data with European coverage, like the data from EMSA data, should be available on the shared platform. Note however, that for the emissions and energy used, global data is required.

A ship register with global coverage, like for example Lloyd’s, should be available on the shared platform. Each solution needs such a ship register and for now, each solution implemented own solutions. In most cases, local registers or registers from local partners were used. For the fishing fleet solution, scraping was used to extend and enrich the ship (fishing fleet) register.

A port register with European coverage should be available on the shared platform. This register should at least contain all European ports with their coordinates, preferably including information on the terminals, the type of cargo supported (containers, bulks or even people) and their coordinates.

5. Conclusions and future work

To be filled. (Monica)

References

To be filled.

51

ec.europa.eu · web viewhalf of the application services are autonomous services and are operated...

Documents