interoperability design and implementation choices ... · interoperability design and...
TRANSCRIPT
Grant Agreement N°825619
Page 1 of 20
AI4EU Deliverable D2.7
Interoperability design and implementation choices reference
document
WP 2 Platform design and implementation
Task 2.6 Interoperability and AI European partnering platforms
Dissemination level1 PU Due delivery date 31/08/2019
Nature2 R Actual delivery date 05/09/2019
Lead beneficiary IDI
Contributing beneficiaries KNO, BSC, TAS, IDSA and FHG
Document Version Date Author Comments3
1.0 Jul 31 S. Marcel, A. Anjos and S. Gaist (IDI) BEAT
1.1 Aug 9 M. Welß (FHG) Acumos
1.1 Aug 9 D. Kowald, R. Kern and S. Kopeinik (KNO) Added “European
Data for AI”
sections.
1.1 Aug 12 D. Vincente (BSC) HPC
1.2 Aug 12 S. Marcel, A. Anjos and S. Gaist (IDI) harmonisation
and restructuring
1.3 Aug 20 M. Aubrun (TAS) Mundi
1.3 Aug 20 S. Marcel, A. Anjos and S. Gaist (IDI) Abstract,
Introduction and
Conclusion
1.3 Aug 23 Sebastian Steinbuss (IDSA) Input from IDSA
1.4 Aug 27 M. Aubrun (TAS) Interoperability
with Acumos
1.5 Sep 3 S. Marcel and A. Anjos (IDI) Implementation
of comments from
reviewer 1
1 Dissemination level: PU = Public, PP = Restricted to other programme participants (including the JU), RE = Restricted to a group
specified by the consortium (including the JU), CO = Confidential, only for members of the consortium (including the JU)
2 Nature of the deliverable: R = Report, P = Prototype, D = Demonstrator, O = Other
3 Creation, modification, final version for evaluation, revised version following evaluation, final
AI4EU_D2.7_M8_vfinal
Page 2 of 20
Deliverable abstract
This deliverable describes the high-level interoperability requirement among different AI
Resources available within AI4EU. Following a concertation meeting held during a workshop in
March 2019, we converged to a technical solution connecting the main AI4EU platform to different
AI Resources. AI Resource producers will be expected to export containers with data and software
or brokers to access remote resources. AI Resource consumers shall be able to import these assets
upon End-User needs.
Deliverable Review
Reviewer #1: MICHELA MILANO Reviewer #2: ..........................................
Answer Comments Type* Answer Comments Type*
1. Is the deliverable in accordance with
(i) the
Description of the
Action?
X Yes
☐ No
☐ M
☐ m
☐ a
☐ Yes
☐ No
☐ M
☐ m
☐ a
(ii) the
international State of
the Art?
X Yes
☐ No
☐ M
☐ m
☐ a
☐ Yes
☐ No
☐ M
☐ m
☐ a
2. Is the quality of the deliverable in a status
(i) that allows it
to be sent to European
Commission?
X Yes
☐ No
☐ M
☐ m
☐ a
☐ Yes
☐ No
☐ M
☐ m
☐ a
(ii) that needs
improvement of the
writing by the
originator of the
deliverable?
☐ Yes
X No
☐ M
☐ m
☐ a
☐ Yes
☐ No
☐ M
☐ m
☐ a
(iii) that needs further
work by the Partners
responsible for the
deliverable?
☐ Yes
X No
☐ M
☐ m
☐ a
☐ Yes
☐ No
☐ M
☐ m
☐ a
* Type of comments: M = Major comment; m = minor comment; a = advice
AI4EU_D2.7_M8_vfinal
Page 3 of 20
1. Introduction
This deliverable describes the high-level interoperability requirement among different AI
Resources available within AI4EU. Currently Europe has a large variety of actors and stakeholders
providing data (i.e. Thales Alenia Space’s Mundi, Bonseyes, ELG), software (i.e. Acumos,
IDIAP’s BEAT) and computing infrastructures (i.e. BSC’s HPC, CINECA). These assets are
coined as AI Resources in AI4EU and this deliverable.
This deliverable is complementary to D2.1 on the architecture of the AI4EU platform and D3.2 on
the AI4EU use cases. Therefore, its scope is restricted to presenting high level technical
descriptions of how AI4EU actors intend to interoperate with a common container-based standard.
AI4EU actors can be categorized as resource producers or consumers. Resource producers are
data and software providers that should feed the AI4EU ecosystem. Resource consumers are
computing infrastructures that should input data and software to produce models and predictions.
For example, in image classification (e.g. cats, dogs, cars, pedestrians, …) the input data consists
of a large set of images containing the object of interest assigned with identity labels. To classify
a new image, an End-User will first create a model from the dataset using dedicated software
(training algorithm). The End-User may then use the trained model to infer the class label of a test
image. The AI Resources involved in this example are a dataset, software for training and
inference, and the trained model. Training and inference are executed at resource consumers while
resource producers provide data and software.
At the moment, the European landscape of AI Resources is highly heterogeneous and not
interoperable. As a consequence, the aim of this deliverable is to propose a common path towards
federating AI Resources in Europe. To reach that objective we first introduce in Section 2 relevant
technical details of the main partner resources involved in WP2 T2.6. Next in Section 3 we propose
a high-level technical solution connecting different AI. More precisely AI Resource producers will
be expected to export containers with data and software or brokers to access remote resources.
Moreover, AI Resource consumers shall be able to import these assets upon End-User needs.
AI4EU_D2.7_M8_vfinal
Page 4 of 20
2. State of the Art
a. Acumos
AI4EU operates its own instance of Acumos4 which can serve as a possible AI Resource Repository
of the AI4EU platform. In the first step, it will contain AI Resources carefully selected by AI4EU
experts such as:
● Trained Models that are ready to deploy,
● High quality datasets,
● Connectors and brokers that can be used in conjunction with the above models and datasets
to visually compose AI pipelines.
Acumos also includes a visual editor (Figure below) where an AI Resource must expose an
interface description (i.e. in the protobuf format).
Visual editor in Acumos
The resources can either be onboarded as deployable artefact or as catalog entry the refers to an
external source. In the second step, the community is encouraged to add further resources as well
as comment and rate the existing ones. All uploaded resources must undergo a well-defined
publication process to ensure integrity and quality of the content: this comprises technical, legal
and ethical aspects of the resource and content.
Acumos accepts AI Resources in the form of a Docker container that expose an interface (i.e. a
protobuf specification5). As part of the onboarding process, the target execution environment (like
x86_64 or HPC) must be specified, so it can eventually be deployed into the cloud or the AI4EU
HPC playground.
4 https://www.acumos.org/
5 https://developers.google.com/protocol-buffers/
AI4EU_D2.7_M8_vfinal
Page 5 of 20
b. Thales Alenia Space (TAS)
Existing Projects
Copernicus
Copernicus6 is the European Union's Earth Observation (EO) programme coordinated and managed
by the European Commission in partnership with the European Space Agency (ESA) and EU
Agencies. The objective of this programme is to provide global, continuous and high quality EO
data in order to address global challenges in six different thematic: atmosphere, marine, land,
climate, emergency and security. To achieve its objective, the European Commission has launched
major initiatives to:
● Produce data: Sentinel programme that consists to build EO satellites and set up ground
segments to receive and process EO data. Currently, seven missions are developed by ESA,
which include radar and super-spectral payloads. Note that Copernicus Programme has
adopted a free, full and open data policy for all information produced in the framework of
Copernicus
● Access data:
○ Conventional Data Access Hubs: Portals that provide free access to Copernicus
satellite data through interactive graphical user interface
○ Data and Information Access Services (DIAS): Cloud-based platforms that provide
data and information access alongside processing resources, tools and other relevant
data.
Mundi
Mundi is a one of the DIAS project, which is executed by a consortium composed of 9 parties,
including Thales Alenia Space. As mentioned above, Mundi is a cloud-based platform that:
● Gives unlimited, free and complete access to Copernicus data and information, as well as
access to additional commercial satellite or non-space data sets
● Gives access to sophisticated processing tools
● Provides a scalable computing and storage environment for third parties, either individual
or companies
● Allows third parties to offer advanced value-adding services integrating Copernicus with
their own data and tools to the benefit of their own users
● Provides adapted technical, business and functional support
Mundi platform is accessible via the following link: https://mundiwebservices.com.
Position in AI4EU project
In AI4EU project, the position of Mundi is data provider of EO satellite data. Note that no condition
is required to explore and view the EO satellite data, but users must be registered to download
them. The only condition to register is to have a valid email address.
6 https://www.copernicus.eu/en
AI4EU_D2.7_M8_vfinal
Page 6 of 20
c. BEAT platform from IDIAP (IDI)
BEAT7,8 is a framework for the definition, execution and comparison of software-based data-
driven workflows that can be subdivided functionally (into processing blocks). The user provides
the description of data exchange formats, algorithms, data flows (also known as toolchains) and
experimental details (parameters). The framework can execute the experiment locally or in a
computing infrastructure transparently. Results can be shared and compared via traditional
exchange mechanisms or by using a web-based platform.
The BEAT Platform and Framework were created as part of a pan-European project composed of
both academic and industrial partners in which one of the goals was the design and development
of a free, open-source online web-based platform for the development and certification of
reproducible software-based machine learning (ML) and pattern recognition (PR) experiments.
The main intent behind the web platform is to establish a framework for the certification and
performance analysis of such systems while still respecting the privacy and confidentiality of built-
in data and user contributions.
The BEAT Framework, as per definition, is task-independent, being adaptable to different problem
domains and evaluation scenarios. At the conceptual phase, the platform was bound to support a
number of use-cases which we try to summarize:
● Benchmarking of ML and PR systems and components: users should be able to program
and execute full systems so as to identify performance and computing requirements for
complete toolchains or individual components;
● Comparative evaluation: it should be possible to run challenges and competitions on the
platform as it is the case in similar systems such as Kaggle9;
● Certification of ML and PR systems: the platform should be able to attest on the operation
and performance of experiments so as to support the work of certification agencies or
publication claims;
● Educational resource: the platform shall be usable as an educational resource for
transmitting know-how about ML and PR applications. It should be possible to set-up
interest groups that share work assignments such as in a teacher-student relationship.
In the context of AI4EU, training and inference pipelines will be exported from the existing BEAT
Platform10, which makes it a producer of AI resources.
d. International Data Space Association (IDSA)
Today, there is a common understanding that data is of high value. Leveraging this value and
trading data creates huge revenues for the large data platform providers. Rarely, the creators of data
are benefitting from this value in an adequate way. Often, only the cost for data creation and
management remain with them. Furthermore, many give their data away for free or pay with it for
the use of a service. Finally, others keep it for themselves without taking advantage of the value.
There is a need for vendor independent data ecosystems and marketplaces, open to all at low cost
and with low entry barriers. This need is addressed by the International Data Spaces (IDS)
Association, a nonprofit organization with today about 100 members from various industrial and
7 https://arxiv.org/abs/1704.02319 8 https://www.idiap.ch/software/beat 9 https://www.kaggle.com 10 https://www.beat-eu.org/platform
AI4EU_D2.7_M8_vfinal
Page 7 of 20
scientific domains. The IDS Association specified an architecture, interfaces and sample code for
an open, secure data ecosystem of trusted partners. The specification of the IDS Association forms
the basis for a data marketplace based on European values, i.e. data privacy and security, equal
opportunities through a federated design, and ensuring data sovereignty for the creator of the data
and trust among participants. It forms the strategic link between the creation of data in the internet
of things on the one hand side and the use of this data in machine learning (ML) and artificial
intelligence (AI) algorithms on the other hand side.
Digital responsibility is evolving from a hygiene factor to key differentiator and source of
competitive advantage. Future data platforms and markets will be built on design principles that go
beyond our traditional understanding of cybersecurity and privacy. Based on strong data ethics
principles the IDS Reference Architecture Model puts the user in its center to ensure
trustworthiness in ecosystems and sovereignty over data in the digital age as its key value
proposition. IDSA defines a reference architecture, which supports sovereign exchange and sharing
of data between partners independent from their size and financial power. Thus, it meets the needs
of both large and small and medium enterprises (SMEs). Further down the road, it may be taken
up as well by individuals. Whether data of IoT devices is concerned, in on premise systems or
cloud platforms, the IDSA aims at providing the standard for sharing data between different
endpoints while ensuring data sovereignty.
e. Barcelona Supercomputing Center (BSC)
The High-performance computing (HPC) infrastructures included in the project will be used as
potential execution platform for selected trial projects that require thousands of processors and/or
accelerators to be completed in a reasonable time. These selected projects will be initially defined
and tested in the Acumos platform for later be adapted and executed in the HPC infrastructures.
Most of the HPC infrastructures have some limitations in terms of applications supported,
containers, and security restrictions, all these aspects will be evaluated and configured in the task
T2.5 to provide an easy environment to port projects from the AI4EU platform to the HPC
environments.
The current HPC centers involved in AI4EU project are the Barcelona Supercomputing Center
(BSC) from Spain and CINECA from Italy, in both cases the Infrastructures available in these two
supercomputing centers include general purpose processors and accelerated machines with GPUs.
For BSC the hardware available for the trial projects will be Nord3 for the non-accelerated codes,
the description of the hardware of this machine is available here (Nord3 system configuration).
This cluster is running SUSE Linux Enterprise Server 11 with SP3 and the system works with LSF
batch scheduler (LSF documentation), the cluster supports Singularity containers currently with
version 2.4.2 but the BSC support team is working to support the version 3.2.0 before the end of
september 2019.
For accelerated codes BSC have 2 clusters to support different kinds of workflows, depending on
their requirements. One of them is using Power9 processors with NVIDIA V100 GPUs (MN4-
Power9 system overview), this cluster provides more than 1 PFlops of compute power, and the
other one with x86 processors and NVIDIA K80 gpus ( MinoTauro System overview) which
provides more than 250 TFlops performance. In booth machines the containers supported are
singularity with the version 3.2.0, it is important to remark that for Power9 machine, the docker
images needs to be done for ppc64 architecture, which can difficult the porting from acumos, so
this infrastructure will be used only for very demanding projects where the porting is really
beneficial in terms of performance.
AI4EU_D2.7_M8_vfinal
Page 8 of 20
f. Know Center (KNO)
Data-driven services are becoming an increasingly important aspect of the modern economy, with
data markets playing a pivotal role as a broker between stakeholders of data-driven ecosystems. As
one example, the Data Market Austria (DMA)11 is an initiative to create a digital ecosystem, i.e., a
multi-sided market for shared datasets and data-driven algorithms. Specifically, DMA takes the
role of being a central hub for a variety of actors participating in the (Austrian) data economy,
regardless of their industry sector. For successful collaborations in data markets, different actors
need to collaborate to be able to create new solutions. Recommender services and underlying
models thus, take the role of matchmakers, that discover and suggest potential new combinations
between users, datasets, and services.
Therefore, DMA is built upon the scalable recommendation-as-a-service framework ScaR12, which
implements an important aspect of modern recommender systems. This includes functionality to:
● support different forms of metadata and interaction data.
● process and consider streaming data for the recommendation process in (near) real-time.
● scale the recommender system to be suitable for cloud based environments.
● combine (near) real-time recommender approaches with context dependent data.
To support these functionalities, ScaR is following the Microservice Architecture design pattern13
and uses Apache Zookeeper14 for scalability purposes. Furthermore, the high-performance
enterprise search platform Apache Solr15 is used as a database to allow for (near) real-time
recommendation and search functionality.
The ScaR framework was initially applied and evaluated in the course of DMA to interlink users,
datasets and algorithms16. However, the current implementation of the framework lacks of so-
called gatekeeper functionalities that assess technical and scientific properties of potential datasets
prior to the recommendation and search process. Thus, the implementation of such gatekeeper
functionalities is Know-Center’s contribution to T2.6 of AI4EU that can be understood as a
controller of AI resources (e.g., datasets), as it creates a European Data for AI database with
recommendation and search services (see Section 3).
11
https://datamarket.at/en/ 12
http://scar.know-center.tugraz.at/ 13
https://microservices.io/patterns/microservices.html 14
https://zookeeper.apache.org/ 15
https://lucene.apache.org/solr/ 16
https://arxiv.org/abs/1908.04017
AI4EU_D2.7_M8_vfinal
Page 9 of 20
3. Results and Analysis
In sections (a) to (f), we provide examples of interoperability plans from AI4EU partners, while in
the section (g) we describe a possible component exchange that data producers and consumers may
adhere if they would like to interoperate.
a. Acumos
Background
Acumos provides a data exchange format based on Docker containers for AI Resources (i.e.
components). We advocate that a similar container-based framework should be used as the
interoperability standard for AI4EU. We briefly introduce its functioning below.
The Acumos platform defines an “onboarding” feature17 allowing AI Resources to be uploaded
and downloaded from an existing platform instance. Resources are encapsulated and safe-kept
using Docker containers18. Users uploading AI Resources, typically describe such through means
of a programming core, data components and an I/O exchange definition, which is later converted
into a Docker container. In recent versions of Acumos, it is also possible to onboard Docker
containers directly19. A Docker image must be created in a way, that upon start, exposes the service
defined in the Protobuf on Port 80 and must be made available either in a public Docker registry
or on the Acumos registry, so it can be referenced during the onboarding process (Figure below).
Screenshot of the onboarding processing on Acumos
17
https://wiki.acumos.org/display/AC/Soup-to-
Nuts+Example%3A+Onboarding%2C+Downloading%2C+Deploying%2C+and+Using+a+Pytho
n-Based+Model+in+Acumos 18
https://www.docker.com/resources/what-container 19
https://wiki.acumos.org/display/LM/Docker+file+using+new+model+runner
AI4EU_D2.7_M8_vfinal
Page 10 of 20
Interoperability plan
Acumos is already nearly compliant with the proposed container-based interoperability plan as we
are using a similar underlying representation model. In case this representation evolves in time,
exchanging assets with the Acumos platform may have to be adapted.
b. Thales Alenia Space (TAS)
Background
Mundi has a classical architecture (see Figure below) with :
• an IaaS that has the specificity to provide single access point for the entire Copernicus data.
On this cloud environment, virtual machines (VM) with storage and computing capacities
are also installed. And if the VM resources are not enough, it is possible to buy a tenant,
which is guaranteed private and fully secured, and compliant with European’s privacy
policies.
• a PaaS
• a SaaS that contains a Jupyter Notebook to manipulate data and run the docker containers.
On this SaaS, it is also possible to propose complementary AI tools (based on Python 3 or
R codes) via Mundi marketplace.
Mundi Solution
Interoperability plan
As EO data are huge, downloading them through the Internet is not an effective way of working.
Moreover, it is possible to download only two EO data at once. Best way to process EO data is to
do it in the same cloud environment where they are hosted, that is why the considered solution
consists in downloading Docker containers from AI4EU repository platform, which contain the AI
tools of interest, and to execute these containers on the virtual machine provided by the Mundi
platform (see next Figure). Note that Thales Alenia Space plans also to negotiate with other
members of the consortium to expose a link toward AI4EU repository platform on Mundi
Marketplace to promote the integration of AI tools from AI4EU project. If the AI4EU user want to
discover what is an EO data, it will be possible to download EO data (not more than two at once
for free account) from the semantic search of AI4EU project. This tool will also help users to select
the EO data of interest.
AI4EU_D2.7_M8_vfinal
Page 11 of 20
Interoperability between AI4EU and Mundi platforms
c. BEAT platform from IDIAP (IDI)
Background
Essentially, each processing unit in a BEAT workflow is represented by:
● An Algorithm20 object which is composed of
○ a JSON description containing information about the inputs and outputs, the type of
the algorithm, its parameters and some metadata,
○ the actual code that will be executed following a predetermined class format,
○ the documentation of the algorithm.
● one or more DataFormat21 objects which describe the data types (simple or complex) that
must be used to allow the Algorithm to read from its input and write to its outputs.
These processing units are executed in a Docker container that provide a specific Environment
containing an arbitrary number of libraries (e.g. TensorFlow, pyTorch). Environments are
versioned so they can evolve in time, while older versions are kept for reproducibility.
Interoperability plan
20 https://www.idiap.ch/software/beat/docs/beat/docs/stable/beat/algorithms.html 21 https://www.idiap.ch/software/beat/docs/beat/docs/stable/beat/dataformats.html
AI4EU_D2.7_M8_vfinal
Page 12 of 20
Diagram illustrating the exporting of a BEAT processing pipeline to an Acumos Docker
container
To allow exporting BEAT Algorithms or complex sets of those, we will modify the BEAT
framework to allow the user to arbitrarily create AI4EU-compatible containers from a subset of
Algorithms running in a BEAT Experiment22.
More precisely, we will make the following modifications to the BEAT framework to support
exporting pipelines to the AI4EU platform:
1. A code generation tool to:
a. Convert the DataFormat objects into a Protobuf description,
b. Create a programming core from selected BEAT Algorithms23.
2. An exporter tool to create Docker images such that they can be directly onboarded to the
AI4EU platform.
These containers will be based on original BEAT Environments and will be augmented to contain
the necessary programming core and input/output descriptions.
22 https://www.idiap.ch/software/beat/docs/beat/docs/stable/beat/experiments.html 23 https://www.idiap.ch/software/beat/docs/beat/docs/stable/beat/algorithms.html
AI4EU_D2.7_M8_vfinal
Page 13 of 20
d. International Data Space Association (IDSA)
Background
The International Data Spaces connects the lower-level architectures for communication and basic
data services with more abstract architectures for smart data services. It therefore supports the
establishment of secure data supply chains from data source to data use, while at the same time
making sure data sovereignty is guaranteed for data owners.
Over the IDS Connector, the International Data Space’s central component, industrial data clouds,
as well as individual enterprise clouds, on-premises applications and individual, connected devices
can be connected to the International Data Spaces.
International Data Spaces connecting different cloud platforms
The IDS Reference Architecture Model describes processes for the provision and consumption of
data and algorithms and provides a semantic data model (IDS Infomodel) to describe offerings for
the data economy. The IDS Connector is responsible for communication and interoperability in the
IDS terms it is supported by the components Broker, App Store and Identity Provider.
Interaction of technical components
AI4EU_D2.7_M8_vfinal
Page 14 of 20
A distributed network like the International Data Spaces relies on the connection of different
member nodes where Connectors or other core components are hosted (a Connector comprising
one or more Data Endpoints). The Connector is responsible for the exchange of data or as a proxy
in the exchange of data, as it executes the complete data exchange process from and to the internal
data resources and enterprise systems of the participating organizations and the International Data
Spaces. It provides metadata to the Broker as specified in the connector self-description, e.g.
technical interface description, authentication mechanism, exposed data sources, and associated
data usage policies. It is important to note that the data is transferred between the Connectors of
the Data Provider and the Data Consumer (peer-to-peer network concept).
There may be different types of implementations of the Connector, based on different technologies
and depending on what specific functionality is required regarding the purpose of the Connector.
Two fundamental variants are the Base Connector and the Trusted Connector as they differ in the
capabilities regarding security and data sovereignty.
The Connector Architecture uses application container management technology to ensure an
isolated and secure environment for individual data services. A data service matches a system
which offers an API to store, access or process data. To ensure the privacy of sensitive data, data
processing should take place as close to the data source as possible. Any data preprocessing (e.g.,
filtering, anonymization, or analysis) should be performed by Internal Connectors. Only data
intended for being made available to other participants should be made visible through External
Connectors.
Data Apps are data services encapsulating data processing and/or data transformation functionality
bundled as container images for simple installation by application container management.
Using an integrated index service, the Broker manages the data sources available in the
International Data Spaces and supports publication and maintenance of associated metadata.
Furthermore, the Broker Index Service supports the search for data resources. Both the App Store
and the Broker are based on the Connector architecture (which is described in detail in the
following paragraphs) in order to support secure and trusted data exchange with these services.
Connector Architecture
The details of the IDS Connector Architecture can be found in the IDS Reference Architecture
Model24
24 https://www.internationaldataspaces.org/wp-content/uploads/2019/03/IDS-Reference-
Architecture-Model-3.0.pdf
AI4EU_D2.7_M8_vfinal
Page 15 of 20
Interoperability plan
As the IDS Connector is a generic concept, but based on virtualization and container management,
the concept of the IDS Connector is easily adaptable to the AI4EU platform. Most current
implementations of IDS Connectors rely on Docker or Kubernetes.
The Execution Core Container of the IDS Connector can be integrated as Docker image in the
AI4EU platform and can be used in different scenarios for consuming and providing data, models
and applications. in data sovereign way. From the IDSA perspective it is still open how the IDS
Connector Components Data Bus and Data Router will be adopted in the AI4EU platform as they
connect the different containers. The IDS Reference Architecture Model relies on a certification
scheme, that includes an assessment of the technical component used and the participants in the
ecosystem. It is still open, how this certification scheme will be adopted within the AI4EU platform
(regarding Core Component Certification) and in general (regarding Participant Certification).
e. Know Center (KNO)
Background
The Know Center (KNO) will focus on technological development, which supports the
interoperability of external datasets managed within T2.7 of AI4EU, i.e., “External Data for AI”.
To this end, a database that holds metadata of potential external datasets will be provided based on
the database created within the DMA project. This database will implement the metadata standards
described in the initial data management plan (D2.9). It will further offer a Web service-based
interface that enables search and recommendation of datasets. Main functionalities will encompass:
● Recommendation algorithms that provide personalized suggestions of datasets based on
prior user interactions (e.g., clicks on datasets, past search queries, etc.). Therefore, the
ScaR recommendation framework will be utilized.
● Gatekeeper functionalities that assess technical and scientific properties of potential
datasets prior to their database integration. Thus, the research will investigate the
application of algorithms that allow an automatic assessment of datasets towards their
suitability to cater specific use cases (e.g., is a dataset suitable for time-series analyses?).
To this end, AI algorithms will be prototypically applied on the data in order to observe the
properties of the learnt models. The outcome of this process will then be used as an
additional set of descriptive features to feed the aforementioned search and
recommendation services.
● Data consumption functionalities via data brokers. Once suitable datasets are found for
specific algorithms, they can be consumed via containerized data brokers. Therefore, the
data broker type (e.g., csv file) as well as the source locations (e.g., Web URL) is saved in
the metadata database as well.
Interoperability plan
The main aim of the “External Data for AI” database is to organize metadata of datasets. The
database will be enriched by a set of intelligent services that support structure, findability,
recommendation and consumption. In the remainder of this section, the framework’s different
components as well as the interoperability with AI4EU (with Acumos as an example) is described:
AI4EU_D2.7_M8_vfinal
Page 16 of 20
● External Platforms: Metadata of datasets from external platforms will be collected. This
process will start with the approximately 15,000 datasets identified in DMA.
● Gatekeeper: Prior to the database integration, all datasets will be assessed by a so-called
gatekeeper that determines the suitability of a dataset for AI algorithms. The gatekeeper
will also be responsible for filtering datasets with insufficient metadata quality.
● “European Data for AI” database and services: Apache Solr will serve as a data backend
managing an inverted index. The inverted index stores all relevant features used for
generating recommendations and search results.
● Recommendations and Search: Flexible recommendation and search services will
facilitate access to the dataset metadata. These services can be exploited by the AI4EU
platform using a REST-full interface i.e., HTTP as a communication protocol and JSON as
a format for data transmission. The recommendation and search services will be developed
based on the ScaR framework.
● Interactions: In order to calculate personalized recommendations, the AI4EU platform will
be able to provide interaction data. Thus, if a user of the platform works with a specific
dataset or AI service, this interaction can be stored in the database. Such data also act as a
feedback loop to offer personalization services, as it allows to track whether recommended
datasets or services have been interacted with or not.
● AI services / algorithms: Similar to the handling of above described interaction data,
AI4EU might also provide metadata of AI services / algorithms. With this information,
additional recommendation functionality can be offered. Thus, this allows to go beyond
typical item2user recommendations (e.g., recommending a dataset to a user) and
additionally provide item2item recommendations (e.g., recommending a dataset to an AI
service). This should lead to novel and useful combinations of datasets and AI services.
● AI4EU data broker: By providing an AI4EU data broker, datasets can be consumed by
the platform. Therefore, a service definition implemented as a Protobuf file, a license in
JSON format and a container embedding the data broker will be provided. The data broker
type (e.g., csv file) as well as the source location (e.g., Web URL) is stored in the “European
Data for AI” database.
On top of this database, recommendation and search services will be employed that aim to provide
novel and helpful suggestions for combining AI services and datasets, and aim to support platform
AI4EU_D2.7_M8_vfinal
Page 17 of 20
users in completing targeted and extensive searches for datasets, respectively. Thus,
recommendation and search build on a data corpus describing offered items (i.e., dataset and AI
services), actors (i.e., organizations and users) and (user) interactions. Interactions of users (e.g.,
viewing the description of an offer) and feedback of users (e.g., clicking on a recommendation) are
stored separately. Metadata describes the properties of AI services, datasets, organizations and
users.
f. Barcelona Supercomputing Center (BSC)
Background
The HPC environment is usually more restrictive in terms of security and performance than other
environments like the HTC or cloud. One of the limitations in the HPC centers is the availability
of containers in their executions. The most used container up to now is Docker, but currently in
most of the HPC centers Docker is not supported and the containers supported are mainly
Singularity25 and/or Shifter26, booth systems provide mechanisms to transform a Docker image to
the corresponding container type but limiting some of its features to increase the security of the
final execution.
Interoperability plan
At BSC there is support only for Singularity, so the main way to input assets from the AI4EU to
HPC platforms will be to establish guidelines for the users to generate containers that can be easily
translated to Singularity, to be executed on HPC. Also, another core point to be taken into account
is the performance of containers. This aspect is an important point to be checked before spending
massive amounts of compute hours in a project, as it is very important to assure the best possible
performance and efficiency of the executions before scale it up to several thousands of cores. The
performance and portability part will be fully managed by the Task 2.5 while the interoperability
design between AI4EU and HPC will be designed with the effort allocated to BSC in the T2.6.
g. A container approach towards interoperability
Typical examples of AI Resources are:
● pre-trained machine learning models,
● rule-based models (expert systems aka. symbolic AI),
● algorithms to perform training or inference,
● datasets (for training or testing),
● data brokers to access remote data (e.g. satellite images).
We recommend that an AI Resource should comprise three objects:
1. A service definition implemented as a Protobuf file (.proto),
2. A license in JSON format (license.json),
3. A container (e.g. Docker image) containing the programming core and or data components
(e.g. pre-trained models, datasets or data brokers).
Service definition
25
https://sylabs.io/docs/ 26
https://docs.nersc.gov/programming/shifter/overview/
AI4EU_D2.7_M8_vfinal
Page 18 of 20
Any AI Resource must define an execution context and a data exchange interface via a Protobuf
file. This data exchange interface must be self-contained in the sense that it contains the definition
of the input and output data structures as well as the service definition.
The following example shows the Protobuf definition for a classifier inferring class labels for the
Iris Flower dataset (basic benchmark in Machine Learning) : syntax = "proto3";
package kTglehYxRGIPEoXkdoKpXCLzWgLrCbCp;
service Model {
rpc classify (IrisDataFrame) returns (ClassifyOut);
}
message IrisDataFrame {
repeated double sepal_length = 1;
repeated double sepal_width = 2;
repeated double petal_length = 3;
repeated double petal_width = 4;
}
message ClassifyOut {
repeated int64 value = 1;
}
AI4EU_D2.7_M8_vfinal
Page 19 of 20
License
Additionally, the AI Resource must provide a suitable license file in JSON format. The
specification for the format can be found online27. For instance, here is an example of a license file
in JSON format wrapping an Apache 2.0 license:
{
"modelLicenses": [
{
"keyword": "Apache-2.0",
"intro": "Apache 2.0 License for Company A. Legal Text",
"copyright": {
"year": 2019,
"company": "Company B",
"suffix": "All rights reserved."
},
"swidTag": "Acumos Ai/ML Model|Data",
"modelId": "AB123456",
"licenseType": "trythenbuy|purchasemodel|purchaseartifacts",
"rights": [
{
"id": "location",
"name": "Locations Allowed",
"desc": "The right to use this software is granted for the specified allowed locations",
"limit": {
"type": "location",
"value": [
"China,Europe,United States"
]
}
}
],
"contact": {
"desc": "Contact Company @ [email protected] To acquire the right to use this software"
},
"fullLegalLicense": "All legal text|url"
}
]
}
Container
Any suitable container format such as Docker or Singularity.
27
https://docs.acumos.org/en/boreas/submodules/security-verification/license-manager-client-
library/docs/license-json.html
AI4EU_D2.7_M8_vfinal
Page 20 of 20
4. Conclusion
In this deliverable, we proposed a common path towards federating AI Resources in Europe. We
introduced relevant technical details of partner resources and proposed a high-level technical
solution towards interoperability.
● The Acumos platform already defines an “onboarding” feature allowing AI Resources to be uploaded and downloaded from an existing instance. Resources are encapsulated and safe-kept using Docker containers through means of a programming core, data components and an I/O exchange definition.
● BEAT will work as an AI Resource producer exporting its Algorithms or complex sets of
those allowing the End-Users to arbitrarily create compatible containers from a BEAT
Experiment.
● Thales Alenia Space will work as an AI Resource producer/consumer by hosting executions
of compatible containers.
● Know Center will work as an AI Resource provider/controller by providing data brokers to
consume datasets that are referenced in the “European Data for AI” database and
recommendation infrastructure created in Task 2.7.
● International Data Spaces Association will work as an AI Resource producer/consumer by
providing a reference architecture model for the sovereign exchange of data, models and
algorithms. The Execution Core of the IDS Connector can be made available as reusable
component in the AI4EU platform. The Certification scheme of the IDS Reference
Architecture Model will provide trustworthiness to the platform and the AI4EU ecosystem.
● Barcelona Supercomputing Center will work as an AI Resource consumer by providing an
adaptor between AI4EU containers and supported Singularity images.
We believe that the proposed approach using containers will fit many use cases and is flexible
enough to accommodate the currently heterogeneous ecosystem of AI actors in Europe and beyond.
In conclusion, we expect that this deliverable forms one of the main pillars for the consolidation of
AI Resources.