on warehouses, lakes, and spaces: the changing role of ...quix/papers/data-lakes-spaces.pdf · on...

15
On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data Integration Matthias Jarke 1,2 and Christoph Quix 1,2 1 Database and Information Systems, RWTH Aachen University, Germany {jarke,quix}@dbis.rwth-aachen.de 2 Fraunhofer Institute for Applied Information Technology FIT, Germany Abstract. The role of conceptual models, their formalization and imple- mentation as knowledge bases, and the related metadata and metamodel management, has continuously evolved since their inception in the late 1970s. In this paper, we trace this evolution from traditional database de- sign, to data warehouse integration, to the recent data lake architectures. Concerning future developments, we argue that much of the research has perhaps focused too much on the design perspective of individual compa- nies or strongly managed centralistic company networks, culminating in todays huge oligopolistic web players, and propose a vision of interacting data spaces which seems to offer more sovereignty of small and medium enterprises over their own data. 1 Introduction Conceptual modeling and meta modeling have been key ingredients for data ma- nagement tasks such as database design, information systems engineering, data integration, and data analytics since the mid-1970’s. However, as data manage- ment transcended more and more aspects of our lives and businesses, the roles and thus also the kinds of models, supporting formalisms, and tools had to be continuously adapted to these changing needs. The 1980’s saw the emergence of semantic data models, and their interpre- tation as evolving knowledge bases from which databases and data-intensive transactional applications could be derived in a quality-controlled and semi- automatic manner; Antoni Oliv´ e was one of the major contributors to this per- spective with his research on consistent knowledge-base, view updates, and con- sistent derivation of logical data schemas from conceptual models [43,57,2]. The following decades can, in short, be characterized by the contrasting buzz- words of data warehouses and data lakes, perhaps juxtaposed with the relatively new concept of data spaces for information sovereignty. In the 1990’s, it was noticed that transactional data could not just be used for administering day-to-day business but had in fact built up a collective treasure of historical information which could be exploited for data analytics. To ena- ble analytics without concurrency control problems, data warehouses were set

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

On Warehouses, Lakes, and Spaces: TheChanging Role of Conceptual Modeling for Data

Integration

Matthias Jarke1,2 and Christoph Quix1,2

1 Database and Information Systems, RWTH Aachen University, Germany{jarke,quix}@dbis.rwth-aachen.de

2 Fraunhofer Institute for Applied Information Technology FIT, Germany

Abstract. The role of conceptual models, their formalization and imple-mentation as knowledge bases, and the related metadata and metamodelmanagement, has continuously evolved since their inception in the late1970s. In this paper, we trace this evolution from traditional database de-sign, to data warehouse integration, to the recent data lake architectures.Concerning future developments, we argue that much of the research hasperhaps focused too much on the design perspective of individual compa-nies or strongly managed centralistic company networks, culminating intodays huge oligopolistic web players, and propose a vision of interactingdata spaces which seems to offer more sovereignty of small and mediumenterprises over their own data.

1 Introduction

Conceptual modeling and meta modeling have been key ingredients for data ma-nagement tasks such as database design, information systems engineering, dataintegration, and data analytics since the mid-1970’s. However, as data manage-ment transcended more and more aspects of our lives and businesses, the rolesand thus also the kinds of models, supporting formalisms, and tools had to becontinuously adapted to these changing needs.

The 1980’s saw the emergence of semantic data models, and their interpre-tation as evolving knowledge bases from which databases and data-intensivetransactional applications could be derived in a quality-controlled and semi-automatic manner; Antoni Olive was one of the major contributors to this per-spective with his research on consistent knowledge-base, view updates, and con-sistent derivation of logical data schemas from conceptual models [43,57,2].

The following decades can, in short, be characterized by the contrasting buzz-words of data warehouses and data lakes, perhaps juxtaposed with the relativelynew concept of data spaces for information sovereignty.

In the 1990’s, it was noticed that transactional data could not just be used foradministering day-to-day business but had in fact built up a collective treasureof historical information which could be exploited for data analytics. To ena-ble analytics without concurrency control problems, data warehouses were set

Page 2: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

up separately from transactional databases (now called data sources). This rai-sed additional concerns in the semantically correct integration of independentlydeveloped databases, e.g., concerning temporal and spatial validity [37]. Metalanguages such as the Meta Object Facility (MOF [53]), Telos [40,27], or evendescription logics [28], Lenzerini helped formalize the relationships among sourceschema and warehouse schema for homogeneous and increasingly heterogeneousdata sources (e.g., including the XML/JSON family).

Shortly after the turn of the century, the Big Data movement - originallyproposed by Jim Gray for crowd-analytics of astronomy data - began to take off.Multimedia data and mobile data management added to volume and variety ofdata, but also enriched the context information needed for the characterizationof data sources. Additionally, the requirements for near real-time analytics grewrapidly, and various kinds of NoSQL databases were introduced to deal with someof these new challenges, adding to the complexity of data systems integration.Since about 2010, data lake architectures have been proposed for the provisioningof data for such richer data analytics tasks, and some huge data lakes have beenbuilt up by the leading search engine, eCommerce, and social network vendors.

In this chapter, we review some of the research we and others have done toaccompany this evolution, almost exactly in parallel and often interacting with,the impressive research career of Antoni Olive over the past thirty years. Star-ting with the logic-based meta data management system ConceptBase and itsuse in the European DWQ project on foundations of data warehouse quality,we arrive via our Generic Role-Based Metamodeling Suite GeRoMeSuite to ourpresent work on the Constance data lake system and its applications. However,these organizationally rather centralistic viewpoints have recently raised someconcerns not just about personal privacy of users, but also about risks for thesecurity of confidential company knowhow, especially for small and medium en-terprises which feel threatened by the huge data lakes both of private marketleaders and state organizations. The paper ends with a brief overview of theFraunhofer-led Industrial Data Space initiative in which a large worldwide as-sociation from industry and science is investigating innovative solutions for thisproblem.

2 Conceptual Model Formalization and ModelManagement for Data Warehouse Integration

Consider the scenario for heterogeneous data integration illustrated in Figure1. There are two data sources (relational and XML) with information aboutstudents, their universities, and courses. The relational source is in additionannotated with an ontology which has a mapping to desired target data structurein JSON. The XML source has a direct mapping to the target model.

Within this scenario, several integration challenges have to be addressed:

Schema Matching: The definition of mappings between schemas is often amanual task as the complete semantics of a schema can only be understood

Page 3: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

Fig. 1. Motivating scenario for heterogeneous data integration

by a human. Schema matching methods [51,55] can support this process byidentifying correspondences between schemas.

Schema Mapping: While schema matching produces correspondences that stateonly a similarity between schema elements, a schema mapping is a formalexpression that can be interpreted as a constraint over the sources or as arule to transform the source data into the target database [34,10,33]. Variousmapping languages with different expressive powers and complexities havebeen proposed, which are usually a variant of tuple-generating dependencies[7,15].

Schema Integration: If the target schema is not given, how can we create acommon schema that covers the source schemas and can be used to queryboth sources in an integrated way [5,49,36,35].

Data & Schema Quality: Data quality problems occur frequently in integra-tion scenarios as the data is used for another application than initially in-tended. Data quality is not defined only defined by the data itself, but alsoby the schema describing the data and systems used for processing the data[27,6,2].

In the following, we give a short overview on how the techniques to addressthese challenges have evolved over the last decades. We focus especially on theconceptual modeling aspects in this context.

In parallel to the initial IRDS standardization, a logic-based approach to de-aling with an unbounded number of meta levels was investigated in the Telosproject jointly conducted between the University of Toronto and several Euro-pean projects in the late 1980s [40]. An important feature of Telos is the strongand highly efficient formalization in Datalog with stratified negation [29]. Basedon this formalization, our deductive metadatabase system ConceptBase [26] wasthe first metadata repository to also offer effective query optimization, integrityconstraint evaluation, and incremental view maintenance at the data level [56],as well as viewpoint resolution [42] and requirements traceability [52] at themeta level simultaneously, thus providing an early example of fully automatedmodel-based code generation.

Page 4: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

With the confluence of structured data, text and multimedia capabilities,the World Wide Web, and mobile communications, the range of data modelshas grown well beyond what could be covered by these early approaches. In [23],an overview of metadata interoperability in heterogeneous media repositories isgiven which also addresses interoperability between “structural” modeling lan-guages, such as UML, XML Schema, and OWL. The authors classify approachesaccording to the MOF hierarchy and argue that effective interoperability bet-ween systems can only be achieved if data transformations at the instance levelare also addressed. Furthermore, they distinguish between standardization andmapping approaches. The former propose metadata standards to enable intero-perability, whereas the latter build relationships between different metamodels.Mapping approaches are more complex, but are advantageous in open environ-ments such as the Web, as in these cases, no central authority can enforce astandard [23].

The increasing complexity of information systems [11] requires techniques forautomating the tasks of creating models and mappings. The original vision ofmodel management aimed at providing algebraic operations for these tasks [9],because complete automation was expected to be hard to achieve. The creationof models and mappings was considered a design activity which requires a deepunderstanding of the semantics of the modeled systems. Another important mo-tivation for the definition of a model management algebra was the observationthat many applications that deal with models require a significant amount ofcode for loading and navigating models in graph-like structures.

First model management systems such as Rondo [39,38] and COMA [14]applied simple, abstract model representations in which a model is representedas a directed, labeled graph. Other approaches focused on schemas, e.g., there is ahuge research area on schema matching [51,54]. Although a graph representationis often sufficient for basic schema matching tasks, semantic details such asconstraints cannot be easily represented. Mappings are often just representedas a set of pairwise correspondences between nodes in the graphs. Even thoughin MISM (Model-Independent Schema Management) [3], schemas are describedin a generic way using a multi-level dictionary, the system uses a set-theoreticapproach for some model management operators (e.g., Merge), i.e. again onlycorrespondences (equivalence views), as the mapping formalism.

In the original vision of model management [9], mappings had a weak re-presentation and were seen as a special type of a model which might includeexpressions to describe the semantics of a mapping in more detail (e.g., by usinga SQL query). In order to automate operations on models and mappings, map-pings have to be represented in a separate formalism which is more expressivethan just simple correspondences.

There have been several attempts aiming at combining a rich modeling lan-guage with powerful mapping languages. For example, in the European DWQproject (Foundations of Data Warehouse Quality [28]), a semantically rich me-tamodel [27] was combined with an information integration approach based ondescription logics [12]. Similarly, the Italian MOMIS system used an object-

Page 5: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

oriented modeling language to support the integrated querying of heterogeneousinformation sources [8].

Clio [24,21,16] introduced a more active mapping language based on tuple-generating dependencies (tgds) [7,1]. The well-defined, formal basis and the abi-lity to easily translate the mappings into executable code (e.g., queries in SQL orXQuery) proved a significant advantage and caused a re-thinking of the wholedefinition of model management, dubbed Model Management 2.0 [10]. Here,mappings are involved in all model management operations, at the core of anyintegration approach. Furthermore, model management remained no longer onlya design time issue, because mappings have to be executed eventually to performdata transformation tasks. Thus, model management systems must be able tounderstand the semantics of a mapping in order to enable mapping operations(e.g., composition, inversion) and produce mappings as output (e.g., in matchand merge operations). While Clio explored this issue in the context of (nested)relational data models, we have investigated the extension to the managementof heterogeneous data models.

3 Generic Modeling Languages for HeterogeneousSources

The heterogeneity of data management systems and modeling languages usedin the web, but also in enterprise information systems, can be addressed by ageneric modeling approach which is able to cope with the different modelingformalisms in a single uniform framework. Moreover, this has to be done atsimultaneously at the model level and at the data level.

3.1 Role-Based Generic Metamodeling

Our generic metamodel GeRoMe [31] is a rich modeling language to representdetailed features of various concrete languages such as OWL, XML Schema,UML, or the Relational Data Model (RDM). In GeRoMe, a mapping states howthe data of one model is related to the data of another model, i.e. mappings relatedata and not only models. Because of this, mappings need to be very expressiveto be able to represent rich data transformations. Executability of a mappinglanguage means that it must be possible to apply a mapping such that it enablesautomatic code generation that executes the data transformations specified inthe mapping [33].

Figure 2 illustrates the generic GeRoMe models for our scenario from Figure1. The upper part represents the relational schema with two relations Studentand University. These elements are described by the role Aggregate (Ag), whichis an element that can have (among other features) attributes. The attributesuname and uni are connected by a foreign key constraint (FKConst). The lowerpart of the figure shows the GeRoMe representation of the XML schema, whereUniversity is a root element having the complex type UniType. Student is a nes-ted element with the complex type StudType. As complex types and attributes

Page 6: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

Fig. 2. GeRoMe representations of a relational and an XML schema

in XML have the same semantics as relations and columns in the relationalschema, the corresponding elements play the same roles (e.g., Aggregate (Ag)and Attribute (At)).

As shown in the example, the generic metamodel unifies different represen-tations and therefore simplifies the implementation of model management ope-rations. For example, a schema matching or integration approach needs to beimplemented only once for the generic representation, and does not have to berepeated for all different modeling languages. On the other hand, the examplesalso show the differences of the original modeling languages. A nesting in XMLis a different modeling construct than a foreign key constraint in RDM, althoughboth can be used to model the same semantics (a relationship between two en-tity types). If the goal is to translate models from one modeling language intoanother, then model transformation has to be applied in which such modelingconstructs are translated [4]. We developed a rule-based model transformationfor GeRoMe [30] with which we could show that a more detailed fine-granularrepresentation of models is especially beneficial for model management operati-ons that need an exact representation of the semantics of modeling constructs.More abstract representations such as graphs are useful for schema matching[51] in which a less accurate representation of the schema semantics is sufficient,but are usually not appropriate for model transformation or schema integration.

3.2 Generic Mappings

The ultimate goal of our GeRoMe-based mapping language is to realize dataintegration and transform data between different, possibly heterogeneous datasources [33]. The mappings are expressed as second-order tuple-generating de-pendencies [18].

Page 7: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

∃F,G ∀u, s, iUniversity(u) ∧ Student(s, i) ∧ Studies(u, s) →

inst(F (i), Student) ∧ inst(G(i, u), University) ∧F (s, i, G(i, u)) ∧G(u)

Fig. 3. Mapping from a relational schema to an XML schema

Second-order dependencies are required for mapping heterogeneous models,because nested structures (or models with different structures) need to be map-ped. This is in contrast to mappings between relational models, where usuallya tuple on the target side is created for each tuple that satisfies the conditionson the source side of the mapping. In nested data structures, multiple elementson the target side might have to be created, e.g., an outer object and multiplenested objects. As models can be very diverse in structure, the full power ofsecond-order tuple-generating dependencies is required to map between thesemodels.

Figure 3 shows a mapping between a ‘flat’ relational model and the nestedXML model from our scenario. The relational model from the running examplehas been slightly modified to handle also students that study at multiple univer-sities (i.e., the relation Studies represents the many-to-many relationship betweenstudents and universities. In the XML schema, the structure has been reversed,i.e., students are now at the top level, and their universities are represented asnested elements. The conditional part of the mapping selects all students fromthe relational source. On the target side, we create two objects: one for the mainelement Student and one for the nested element University. Those elements areidentified by the Skolem terms F (i) and G(i, u). The inst-predicate indicate,that the objects represented by the Skolem terms are instances of Student andUniversity, respectively. We use the symbols F and G as function symbols, butalso as relation names. Note that the notation presented here is an abbreviatedform (inspired by notation of nested mappings in Clio [19,17]) of our original,more verbose notation presented in [33].

We also developed a mapping composition and optimization algorithm forthis kind of generic mappings. The optimization exploits key and foreign-keyconstraints of the schema. Furthermore, the mappings could be also translatedto executable code, e.g., SQL queries to retrieve data from a relational datasource, or update operations on an XML document.

In our model management system GeRoMeSuite [32], we implemented severalmodel management operators, for example, for schema matching, schema inte-gration, schema mapping, model transformation, and mapping composition. Wecould show that the detailed representation of models is useful for semanticallycomplex operations, but could also bring a benefit for simpler operations (e.g.,schema matching [50]). Despite all these advantages, however, a rich metamodellike GeRoMe rather complicates the mapping procedure. Not all detailed con-

Page 8: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

straints of a data model have to be understood when data has to be transformedfrom one source to another.

3.3 Conceptual Modeling and Integration for Data Lakes

This is especially true in ’data lake’ Big Data scenarios where the data sources of-ten do not expose a full-fledged schema definition. In data lakes, in contrast to theclassical ETL (Extract-Transform-Load) processes of data warehouse systems,the transformation step is skipped and data is loaded in its original structure toavoid upfront integration effort and to make all source data available for laterdata analysis tasks [13]. Thus, the idea of data lakes as common data reposito-ries with no need to define an integrated schema or ETL processes beforehandare in line with the current trends of Schema-on-Read in Big Data and NoSQLsystems. Still, to avoid that a data lake turns into a data swamp without anyuseful understandable data, metadata management is a key element of a datalake architecture. A conceptual architecture of a data lake [46] which we use asa blue print for our data lake system is illustrated in Figure 4.

In this context, the term ‘Schema-on-Read’ is frequently used to describethe situation that schemas are not created before data is stored in a data ma-nagement system, but a schema might be created when the data is read. Asthis schema creation is usually done automatically, the schemas contain onlythe core information of the data, i.e., its structure and basic constraints such askeyss and data types. The ‘Schema-on-Read’ fashion of data processing is alsosupported by recent NoSQL database management systems and, in fact, one ofthe main reasons for their popularity [20] as no schema design is required be-fore using the database. The agile style of software development also contributesto this trend; in the early phase of a software project, it is more important tohave a few running software modules than to deal with conceptual modeling andproper database schema design. Thus, conceptual models and database schemasare often not stated explicitly or lack a formal representation, which leads tounverifiability of these important artefacts in information systems design [44].

We are currently working on an incremental, interactive integration processin which metadata, including conceptual models and schemas, play a key role[47]. A starting point for the integration process is the metadata extraction fromthe sources. While some metadata is given explicitly and can be extracted auto-matically with standard methods (e.g., schemas of relational databases can beextracted with JDBC), other types of metadata have to be extracted with moreadvanced methods. For example, in the life science informatics project HUMIT3

with the National Center for Neuro-Degenerative Diseases, where researchersstore most of their data in semi-structured CSV and Excel files, we developed amethod to extract metadata descriptions in a semi-automatic way [48]. The ex-tracted metadata include technical metadata (i.e., the structure of the data) aswell as descriptive metadata (i.e., information about by whom for which projectthe data has been created). A user can also incrementally add semantics to the

3 http://humit.de

Page 9: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

Fig. 4. Conceptual architecture of a data lake system

metadata of the data lake, e.g., by annotating the raw metadata with elementsof a conceptual model or a standard such as schema.org [58]. The workflow ofthe HUMIT system is illustrated in Fig. 5. The metadata extraction componentis part of our data lake system Constance [22].

Despite the delaying of transformation steps, modeling and mapping hete-rogeneous information remain necessary within a data lake system, as data isstored in its original format, e.g., relational, semi-structured, graph-structured,etc. As the rich metamodel of GeRoMe does not scale to Big Data problems,Constance employs a simple, yet expressive metamodel at its core. It can be seenas a variant of the Entity-Relationship Model, but expressive enough to expressall data sets in a data lake. It is also self-describing like JSON and XML. Inthis model, entities are data objects that can have a type, properties, and rela-tionships to other data objects. The type is just denoted as a string without aparticular semantics, but it can be enriched by a user by annotating the typewith more semantical information (e.g., a reference to an element in schema.org).

Page 10: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

Fig. 5. Workflow of the HUMIT System

Properties can be single or multi-valued, simple or complex. Relationships areassociations to other objects.

This model is independent of a particular storage system, but could be im-plemented with different underlying database systems. In a document-orientedMongoDB storage with JSON as basic data model, types and properties wouldbe implemented as simple key-value pairs, whereas relationships are implementedas foreign-key references to other data objects. Adapting the mapping languagefrom GeRoMe to this simpler metadata model is straightforward, as the Ge-RoMe mappings used also only types, properties, and relationships as the mainconstructs for any kind of data structure.

4 Conclusion: From Data Lakes to Data Spaces

In small and medium enterprises with long-standing market-leading knowhow,the presence of huge data lakes with unclear data ownership and data protectioncauses great political concern. In Germany and other European states which hosta large number of such “hidden champions”, a debate on data sovereignty hasbegun, defined as the ability of an organization to decide itself how it wants touse and share its own data. Independent of the fact that data ownership andthus also data sovereignty in themselves extremely controversial terms from le-gal and social science viewpoints, the Fraunhofer ICT group which I had thehonor to chair at the time, decided in 2014 to propose a so-called IndustrialData Space initiative which investigates the related modeling and system requi-rements of such a concept, along with related questions of regulations, businessmodels, and the like. The idea was quickly taken up by the German government,the European commission, as well as several leading players in the Europeanuser industries, and led to funding of an initial design project and the formation

Page 11: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

Fig. 6. Components of the Industrial Space Reference Architecture

of the International Data Space Association which currently already comprisesover 80 organizations on four continents. A first version of the Industrial DataSpace Reference Architecture [45] was publicly presented at the Hannover Fair2017, where already 26 research organizations and companies showed initial de-monstrators of solutions based on aspects of the Reference Architecture.

Fig. 6 gives an overview of the most important player roles and components.Participating companies (here: A and B) use a certified Internal IDS Connectoroffer or import safely containerized views on their data, with well-defined usagecontrols. In larger networks, specialized IDS Brokers can offer services for se-arching, negotiating, and monitoring service and data contracts. Other serviceproviders can bring in data cleaning or analytics services through their AppStores, again connected to the system by IDS Connectors. For confidentialityas well as for performance reasons, service execution should have a choice bet-ween policy-based uploading and downloading data, or distribution of the servicealgorithms themselves which may require additional External IDS Connectors.

The technical elaboration of these and other aspects pf the above-mentionedIDS reference architecture is still in an early stage. From a conceptual modelingperspective, we can see that many of the mapping and integration approachesfor heterogeneous data presented in the previous sections can be reused. Ho-wever, they must be augmented by conceptual models of the players and theirdesires and permissions, as well as of the physical architecture which explicitlyaddresses data sharing (or not-sharing) relationships and various optimizationstrategies around the cloud computing and edge computing literatures; if con-ceptual modeling is manually done at all in such a setting, it must occur across

Page 12: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

organizational boundaries and in almost real-time, such that near-realtime col-laborative modeling [41] is expected to play a major role in such a setting. Muchprevious research in conceptual modeling of distributed systems with conflictinggoals, e.g. around the i* framework [25], is expected to contribute towards theambitious goals of Industrial Data Spaces and its variants, e.g., in medical andmaterial sciences, and we look forward to many exciting cooperations in theconceptual modeling community.

References

1. S. Abiteboul, R. Hull, V. Vianu. Foundations of Databases. Addison-Wesley, 1995.2. D. Aguilera, C. Gomez, A. Olive. Enforcement of Conceptual Schema Qua-

lity Issues in Current Integrated Development Environments. In C. Salinesi,M. C. Norrie, O. Pastor (eds.), Proc. 25th Intl. Conf. on Advanced InformationSystems Engineering (CAiSE), Lecture Notes in Computer Science, vol. 7908,pp. 626–640. Springer, Valencia, Spain, 2013. URL https://doi.org/10.1007/

978-3-642-38709-8_40.3. P. Atzeni, L. Bellomarini, F. Bugiotti, G. Gianforme. MISM: A Platform for

Model-Independent Solutions to Model Management Problems. Journal of DataSemantics, 14:133–161, 2009.

4. P. Atzeni, P. Cappellari, R. Torlone, P. A. Bernstein, G. Gianforme. Model-independent schema translation. VLDB Journal, 17(6):1347–1370, 2008.

5. C. Batini, M. Lenzerini, S. B. Navathe. A Comparative Analysis of Methodologiesfor Database Schema Integration. ACM Computing Surveys, 18(4):323–364, 1986.

6. C. Batini, M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques.Data-Centric Systems and Applications. Springer, 2006. URL https://doi.org/

10.1007/3-540-33173-5.7. C. Beeri, M. Y. Vardi. A Proof Procedure for Data Dependencies. Journal of the

ACM, 31(4):718–741, 1984.8. S. Bergamaschi, S. Castano, M. Vincini, D. Beneventano. Semantic Integration of

Heterogeneous Information Sources. Data & Knowledge Engineering, 36(3):215–249, 2001.

9. P. A. Bernstein, A. Y. Halevy, R. Pottinger. A Vision for Management of ComplexModels. SIGMOD Record, 29(4):55–63, 2000.

10. P. A. Bernstein, S. Melnik. Model Management 2.0: Manipulating Richer Map-pings. In L. Zhou, T. W. Ling, B. C. Ooi (eds.), Proc. ACM SIGMOD Intl. Conf.on Management of Data, pp. 1–12. ACM Press, Beijing, China, 2007.

11. M. L. Brodie. Data Integration at Scale: From Relational Data Integration toInformation Ecosystems. In Proc. 24th IEEE Intl. Conf. on Advanced InformationNetworking and Applications (AINA), pp. 2–3. IEEE Computer Society, Perth,Australia, 2010.

12. D. Calvanese, G. D. Giacomo, M. Lenzerini, D. Nardi, R. Rosati. Data Integrationin Data Warehousing. International Journal of Cooperative Information Systems(IJCIS), 10(3):237–271, 2001.

13. J. Dixon. Data Lakes Revisited. James Dixon’s Blog, September 2014. URLhttps://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/.

14. H. H. Do, E. Rahm. COMA - A System for Flexible Combination of SchemaMatching Approaches. In Proc. 28th Intl. Conference on Very Large Data Bases(VLDB), pp. 610–621. Morgan Kaufmann, Hong Kong, China, 2002.

Page 13: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

15. R. Fagin. Tuple-Generating Dependencies. In L. Liu, M. T. Ozsu (eds.), En-cyclopedia of Database Systems, pp. 3201–3202. Springer, 2009. URL https:

//doi.org/10.1007/978-0-387-39940-9_1274.

16. R. Fagin, L. M. Haas, M. A. Hernandez, R. J. Miller, L. Popa, Y. Velegrakis.Clio: Schema Mapping Creation and Data Exchange. In Conceptual Modeling:Foundations and Applications, LNCS, vol. 5600, pp. 198–236. Springer, 2009.

17. R. Fagin, L. M. Haas, M. A. Hernandez, R. J. Miller, L. Popa, Y. Velegrakis. Clio:Schema Mapping Creation and Data Exchange. In A. Borgida, V. K. Chaudhri,P. Giorgini, E. S. K. Yu (eds.), Conceptual Modeling: Foundations and Applications,Lecture Notes in Computer Science, vol. 5600, pp. 198–236. Springer, 2009.

18. R. Fagin, P. G. Kolaitis, L. Popa, W. C. Tan. Composing schema mappings:Second-order dependencies to the rescue. ACM Trans. Database Syst., 30(4):994–1055, 2005.

19. A. Fuxman, M. A. Hernandez, C. T. H. Ho, R. J. Miller, P. Papotti, L. Popa.Nested Mappings: Schema Mapping Reloaded. In U. Dayal, K.-Y. Whang, D. B.Lomet, G. Alonso, G. M. Lohman, M. L. Kersten, S. K. Cha, Y.-K. Kim (eds.),Proc. 32nd Intl. Conference on Very Large Data Bases (VLDB), pp. 67–78. ACMPress, 2006.

20. F. Gessert, N. Ritter. Scalable data management: NoSQL data stores in researchand practice. In Proc. 32nd IEEE International Conference on Data Engineering(ICDE), pp. 1420–1423. IEEE Computer Society, Helsinki, Finland, 2016. URLhttps://doi.org/10.1109/ICDE.2016.7498360.

21. L. M. Haas, M. A. Hernandez, H. Ho, L. Popa, M. Roth. Clio grows up: fromresearch prototype to industrial tool. In Proc. SIGMOD Conf., pp. 805–810. ACMPress, 2005.

22. R. Hai, S. Geisler, C. Quix. Constance: An Intelligent Data Lake System. InF. Ozcan, G. Koutrika, S. Madden (eds.), Proc. Intl. Conf. on Management ofData (SIGMOD), pp. 2097–2100. ACM, San Francisco, CA, USA, 2016. URLhttp://doi.acm.org/10.1145/2882903.2899389.

23. B. Haslhofer, W. Klas. A survey of techniques for achieving metadata interopera-bility. ACM Comput. Surv., 42(2), 2010.

24. M. A. Hernandez, R. J. Miller, L. M. Haas. Clio: A Semi-Automatic Tool ForSchema Mapping. In Proc. ACM SIGMOD, p. 607. 2001.

25. J. Horkoff, D. Barone, L. Jiang, E. S. K. Yu, D. Amyot, A. Borgida, J. My-lopoulos. Strategic business modeling: representation and reasoning. Softwareand System Modeling, 13(3):1015–1041, 2014. URL https://doi.org/10.1007/

s10270-012-0290-8.

26. M. Jarke, R. Gallersdorfer, M. A. Jeusfeld, M. Staudt. ConceptBase - A DeductiveObject Base for Meta Data Management. Journal of Intelligent Information Sys-tems, 4(2):167–192, 1995.

27. M. Jarke, M. A. Jeusfeld, C. Quix, P. Vassiliadis. Architecture and Quality in DataWarehouses: An Extended Repository Approach. Information Systems, 24(3):229–253, 1999.

28. M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis (eds.). Fundamentals of DataWarehouses. Springer-Verlag, 2 edn., 2003.

29. M. A. Jeusfeld. Anderungskontrolle in Deduktiven Objektbanken. Ph.D. thesis,Universitat Passau, 1992.

30. D. Kensche, C. Quix. Transformation of Models in(to) a Generic Metamodel. InProc. BTW Workshop on Model and Metadata Management, pp. 4–15. 2007.

Page 14: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

31. D. Kensche, C. Quix, M. A. Chatti, M. Jarke. GeRoMe: A Generic Role BasedMetamodel for Model Management. Journal on Data Semantics, VIII:82–117,2007.

32. D. Kensche, C. Quix, X. Li, Y. Li. GeRoMeSuite: A System for Holistic Gene-ric Model Management. In C. Koch, J. Gehrke, M. N. Garofalakis, D. Srivas-tava, K. Aberer, A. Deshpande, D. Florescu, C. Y. Chan, V. Ganti, C.-C. Kanne,W. Klas, E. J. Neuhold (eds.), Proceedings 33rd Intl. Conf. on Very Large DataBases (VLDB), pp. 1322–1325. Vienna, Austria, 2007.

33. D. Kensche, C. Quix, X. Li, Y. Li, M. Jarke. Generic schema mappings for com-position and query answering. Data Knowl. Eng., 68(7):599–621, 2009.

34. M. Lenzerini. Data Integration: A Theoretical Perspective. In L. Popa (ed.), Proc.21st ACM Symposium on Principles of Database Systems (PODS), pp. 233–246.ACM Press, Madison, Wisconsin, 2002.

35. X. Li, C. Quix. Merging Relational Views: A Minimization Approach. In M. A.Jeusfeld, L. M. L. Delcambre, T. W. Ling (eds.), Proc. 30th Intl. Conference onConceptual Modeling (ER 2011), Lecture Notes in Computer Science, vol. 6998,pp. 379–392. Springer, Brussels, Belgium, 2011.

36. X. Li, C. Quix, D. Kensche, S. Geisler. Automatic schema merging using mappingconstraints among incomplete sources. In J. Huang, N. Koudas, G. J. F. Jones,X. Wu, K. Collins-Thompson, A. An (eds.), Proc. 19th ACM Conf. on Informa-tion and Knowledge Management (CIKM), pp. 299–308. ACM, Toronto, Ontario,Canada, 2010.

37. J. Lopez, A. Olive. A Framework for the Evolution of Temporal Conceptual Sche-mas of Information Systems. In Proc. 12th Intl. Conf. on Advanced Informa-tion Systems Engineering (CAiSE), pp. 369–386. Stockholm, Sweden, 2000. URLhttps://doi.org/10.1007/3-540-45140-4_25.

38. S. Melnik, E. Rahm, P. A. Bernstein. Developing metadata-intensive applicationswith Rondo. Journal of Web Semantics, 1(1):47–74, 2003.

39. S. Melnik, E. Rahm, P. A. Bernstein. Rondo: A Programming Platform for GenericModel Management. In Proc. SIGMOD, pp. 193–204. ACM, 2003.

40. J. Mylopoulos, A. Borgida, M. Jarke, M. Koubarakis. Telos: Representing Kno-wledge About Information Systems. ACM Transactions on Information Systems,8(4):325–362, 1990.

41. P. Nicolaescu, M. Rosenstengel, M. Derntl, R. Klamma, M. Jarke. View-Based NearReal-Time Collaborative Modeling for Information Systems Engineering. In Proc.28th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pp. 3–17.Ljubljana, Slovenia, 2016. URL https://doi.org/10.1007/978-3-319-39696-5_

1.42. H. W. Nissen, M. Jarke. Repository Support for Multi-Perspective Requirements

Engineering. Inf. Syst., 24(2):131–158, 1999. URL https://doi.org/10.1016/

S0306-4379(99)00009-5.43. A. Olive. On the design and implementation of information systems from deductive

conceptual models. In Proc. 15th Intl. Conf. on Very Large Data Bases (VLDB),pp. 3–11. Amsterdam, The Netherlands, 1989. URL http://www.vldb.org/conf/

1989/P003.PDF.44. A. Olive. Conceptual Modeling in Agile Information Systems Development. In

Proc. 16th Intl. Conf. on Enterprise Information Systems (ICEIS), pp. IS–11. Lis-bon, Portugal, 2014.

45. B. Otto, S. Lohmann, S. Auer, G. Brost, J. Cirullies, A. Eitel, T. Ernst, C. Haas,M. Huber, C. Jung, J. Jrjens, C. Lange, C. Mader, N. Menz, R. Nagel, H. Pet-tenpohl, J. Pullmann, C. Quix, J. Schon, D. Schulz, J. Schtte, M. Spiekermann,

Page 15: On Warehouses, Lakes, and Spaces: The Changing Role of ...quix/papers/data-lakes-spaces.pdf · On Warehouses, Lakes, and Spaces: The Changing Role of Conceptual Modeling for Data

S. Wenzel. Reference Architecture Model for the Industrial Data Space. Technicalreport, Fraunhofer-Gesellschaft, 2017. URL http://www.industrialdataspace.

de.46. C. Quix. Data Lakes: A Solution or a new Challenge for Big Data Integration? In

Proc. 5th Intl. Conf. Data Management Technologies and Applications (DATA),p. 7. Lisbon, Portugal, 2016. Keynote presentation.

47. C. Quix, T. Berlage, M. Jarke. Interactive Pay-As-You-Go-Integrationof Life Science Data: The HUMIT Approach. ERCIM News,2016(104), 2016. URL http://ercim-news.ercim.eu/en104/special/

interactive-pay-as-you-go-integration-of-life-science-data-the-humit-approach.48. C. Quix, R. Hai, I. Vatov. Metadata Extraction and Management in Data Lakes

With GEMMS. Complex Systems Informatics and Modeling Quarterly (CSIMQ),9:67–83, 2016. URL https://doi.org/10.7250/csimq.2016-9.04.

49. C. Quix, D. Kensche, X. Li. Generic Schema Merging. In J. Krogstie, A. Op-dahl, G. Sindre (eds.), Proc. 19th Intl. Conf. on Advanced Information SystemsEngineering (CAiSE’07), LNCS, vol. 4495, pp. 127–141. Springer-Verlag, 2007.

50. C. Quix, D. Kensche, X. Li. Matching of Ontologies with XML Schemas using aGeneric Metamodel. In R. Meersman, Z. Tari (eds.), Proc. OTM Confederated In-ternational Conf. CoopIS/DOA/ODBASE/GADA/IS, Lecture Notes in ComputerScience, vol. 4803, pp. 1081–1098. Springer, Vilamoura, Portugal, 2007.

51. E. Rahm, P. A. Bernstein. A Survey of Approaches to Automatic Schema Mat-ching. VLDB Journal, 10(4):334–350, 2001.

52. B. Ramesh, M. Jarke. Toward Reference Models of Requirements Traceability.IEEE Trans. Software Eng., 27(1):58–93, 2001. URL https://doi.org/10.1109/

32.895989.53. R. Raventos, A. Olive. An object-oriented operation-based approach to translation

between MOF metaschemas. Data Knowl. Eng., 67(3):444–462, 2008. URL https:

//doi.org/10.1016/j.datak.2008.07.003.54. P. Shvaiko, J. Euzenat. A Survey of Schema-Based Matching Approaches. Journal

on Data Semantics, IV:146–171, 2005. LNCS 3730.55. P. Shvaiko, J. Euzenat. Ontology Matching: State of the Art and Future Challenges.

IEEE Transactions on Knowledge and Data Engineering, 25(1):158–176, 2013.56. M. Staudt, M. Jarke. View Management Support in Advanced Knowledge Base

Servers. J. Intell. Inf. Syst., 15(3):253–285, 2000. URL https://doi.org/10.

1023/A:1008780430577.57. E. Teniente, A. Olive. Updating knowledge bases while maintaining their consis-

tency. The VLDB Journal, 4(2):193–241, 1995.58. A. Tort, A. Olive. An approach to website schema.org design. Data Knowl. Eng.,

99:3–16, 2015. URL https://doi.org/10.1016/j.datak.2015.06.011.