ontology-based conceptual design of etl processes for both structured and semi-structured data
DESCRIPTION
Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data. Outline. Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions. Outline. Introduction - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/1.jpg)
Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-
structured Data
Dimitrios Skoutas Alkis Simitsis{dskoutas,asimi}@dblab.ece.ntua.gr
National Technical University of AthensDept. of Electrical and Computer Engineering
http://www.dblab.ece.ntua.gr
![Page 2: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/2.jpg)
2
Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions
Outline
![Page 3: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/3.jpg)
3
IntroductionIntroduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions
Outline
![Page 4: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/4.jpg)
4
Extract-Transform-Load (ETL)
Sources
Extract Transform & Clean
DW
Load
DSA
![Page 5: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/5.jpg)
5
Problem description Conceptual design of ETL processes
is a critical task performed at the early stages of a DW project describe the integration of data from heterogeneous sources into
the Data Warehouse
Two main goals specify inter-schema mappings identify appropriate transformations
![Page 6: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/6.jpg)
6
Motivation The problem of heterogeneity in data sources
structural heterogeneity data stored under different schemata
semantic heterogeneity different naming conventions
e.g., homonyms, synonyms different representation formats
e.g., units of measurement, currencies, encodings different ranges of values
![Page 7: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/7.jpg)
7
Overview of our approach Key idea
an ontology-based approach to facilitate the conceptual design of an ETL scenario
An ontology is a “formal, explicit specification of a shared conceptualization” describes the knowledge in a domain in terms of classes,
properties, and relationships between them machine processable formal semantics reasoning mechanisms
The Web Ontology Language (OWL) is used as the language for the ontology W3C recommendation based on Description Logics
![Page 8: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/8.jpg)
8
Overview of our approach Method
Construct a graph representation for each datastore
datastore graph Construct a suitable application ontology
ontology graph Annotate the datastores
Establish mappings between the datastore graph and the ontology graph
Apply reasoning techniques to
select relevant sources
to identify required transformations
![Page 9: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/9.jpg)
9
Introduction Graph-based Datastore RepresentationGraph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions
Outline
![Page 10: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/10.jpg)
10
The schema SD of a datastore comprises
elements containing the actual data
elements containing or referring to other elements
Datastore schema
![Page 11: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/11.jpg)
11
Each element e defined in the schema SD is represented by a node ve ∈ VD.
Each containment relationship between elements e1, e2 is represented by an edge (v1, v2).
Each reference from element e1 to element e2 is represented by an edge (v1, v2).
Each edge is assigned a label of the form [min, max] denoting the corresponding cardinality.
Elements containing the actual data are represented by leaf nodes
Datastore graph
![Page 12: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/12.jpg)
12
Reference exampleDatastore
Schema
DW PARTSUP(pkey, supplier, quantity, cost, city, address, date)
DS1 PS(pid, sid, department, address, date, cost, qty)
DS2
![Page 13: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/13.jpg)
13
Reference example (cont’d) Datastore graphs
![Page 14: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/14.jpg)
14
Introduction Graph-based Datastore Representation Application Ontology Construction and RepresentationApplication Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions
Outline
![Page 15: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/15.jpg)
15
A suitable application ontology is constructed to model
the concepts of the domain
the relationships between those concepts
the attributes characterizing each concept
the different representation formats and (ranges of) values for each attribute
Application Ontology
![Page 16: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/16.jpg)
16
The application ontology comprises a set of classes C = CC ∪ CT ∪ CG
CC : classes representing domain concepts CT : classes representing value types CG : classes representing aggregate functions
a set of properties P containing PP : properties representing attributes of concepts or
relationships between concepts property: convertsTo property: aggregates property: groups
Application Ontology
![Page 17: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/17.jpg)
17
A graph representation specified for the ontology
Graph nodes represent classes in the ontology
Graph edges represent properties in the ontology
Different symbols are used for the different types of classes and properties
Ontology Graph
![Page 18: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/18.jpg)
18
Ontology Graph
![Page 19: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/19.jpg)
19
Reference example (cont’d) The application ontology graph
![Page 20: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/20.jpg)
20
Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore AnnotationDatastore Annotation ETL Transformations Conclusions
Outline
![Page 21: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/21.jpg)
21
The semantic annotation of each datastore consists in establishing the appropriate mappings between the datastore graph GS and the ontology graph GO.
Each internal node of GS may be mapped to one concept-node of GO.
A leaf node of GS may be mapped to one or more nodes of GO of the following types:
type-node format-node range-node aggregated-node
A node may have zero or more mappings.
Mappings are represented as node labels.
Datastore annotation
![Page 22: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/22.jpg)
22
A defined class is created in the ontology for each internal labeled node of the datastore graph.
The definition for a node is constructed based on its neighbor labeled nodes.
A neighbor labeled node of n is each node n΄ such that: n΄ is labeled there is a path p in the datastore graph from node n to node n΄ p contains no other labeled nodes, except n and n΄
Datastore annotation
![Page 23: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/23.jpg)
23
Reference example (cont’d) Datastore mappings
![Page 24: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/24.jpg)
24
Reference example (cont’d) Datastore definitions
![Page 25: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/25.jpg)
25
Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL TransformationsETL Transformations Conclusions
Outline
![Page 26: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/26.jpg)
26
Generic types of ETL transformations
ETL Transformations
![Page 27: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/27.jpg)
27
Generating ETL transformations Two main steps
select relevant sources to populate each DW element
identify required data transformations between the sources and the DW
![Page 28: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/28.jpg)
28
Generating ETL transformations Selecting relevant sources
a source node nS, mapped to class cS a target node nT, mapped to class cT nS is provider for nT, if
cS and cT have a common superclass ensures that the integrated data records have the same
semantics cS and cT are not disjoint
prevents data integration between datastores with conflicting constraints
![Page 29: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/29.jpg)
29
Generating ETL transformations Identifying data transformations (I)
a RETRIEVE operation for each provider node n
a MERGE operation to combine data from several provider nodes
an EXTRACT operation to extract a portion of data from a provider node
![Page 30: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/30.jpg)
30
Generating ETL transformations Identifying data transformations (II)
if CS ≡ CT or CS ⊏ CT , no transformations are required
if CT ⊏ CS, AGGREGATE, FILTER and/or MINCARD/MAXCARD operations are required
else, as previous plus CONVERT operations
![Page 31: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/31.jpg)
31
Generating ETL transformations Identifying data transformations (III)
a JOIN operation to combine recordsets from nodes, whose corresponding classes are related by a property.
a UNION operation, followed by a DD operation, to combine recordsets from nodes, whose corresponding classes have a common superclass.
a STORE operation to denote loading of data to the target datastore.
![Page 32: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/32.jpg)
32
Reference example (cont’d) Provider nodes and transformations for DS2
![Page 33: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/33.jpg)
33
Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations ConclusionsConclusions
Outline
![Page 34: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/34.jpg)
34
Conclusions A graph-based representation, datastore graph, as a
common model for the datastores.
A suitable application ontology and a corresponding graph representation, ontology graph.
Datastore annotation through mappings from the datastore graph to the ontology graph.
Reasoning on the mappings to identify relevant sources and required transformations.
![Page 35: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/35.jpg)
35
Current and Future Work Semi-automatic construction of the application ontology
Semi-automatic annotation of the datastores
Executable workflow
Evaluation on real-world ETL scenarios
Maintenance/adaptation of the ETL workflow
![Page 36: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/36.jpg)
36
Thank You
![Page 37: Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data](https://reader036.vdocuments.us/reader036/viewer/2022062408/56813e27550346895da8070f/html5/thumbnails/37.jpg)
37
Questions