tracer: amachinelearningapproachtodata lineage
TRANSCRIPT
Tracer: A Machine Learning Approach to DataLineage
by
Felipe Alex Hofmann
Submitted to the Department of Electrical Engineering and ComputerScience
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2020
c○ Massachusetts Institute of Technology 2020. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Electrical Engineering and Computer Science
May 20, 2020
Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Kalyan Veeramachaneni
Principal Research ScientistThesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Katrina LaCurts
Chair, Master of Engineering Thesis Committee
2
Tracer: A Machine Learning Approach to Data Lineage
by
Felipe Alex Hofmann
Submitted to the Department of Electrical Engineering and Computer Scienceon May 20, 2020, in partial fulfillment of the
requirements for the degree ofMaster of Engineering in Electrical Engineering and Computer Science
Abstract
The data lineage problem entails inferring the source of a data item. Unfortunately,most of the existing work in this area relies either on metadata, code analysis or dataannotations. In contrast, our primary focus is to present a machine learning solutionthat uses the data itself to infer the lineage. This thesis will formally define the datalineage problem, specify the underlying assumptions under which we solved it, as wellas provide a detailed description of how our system works.
Thesis Supervisor: Kalyan VeeramachaneniTitle: Principal Research Scientist
3
4
Acknowledgments
I would like to thank my supervisor, Kalyan, for his insights, ideas, and instruction
during this project; his support and guidance were invaluable for this thesis. I would
also like to thank Kevin and Carles for their help with the design and development
of the system. Finally, I would like to thank my family — Ricardo, Leila, Gustavo,
and Nathália — for their unwavering support over the years.
5
6
Contents
1 Introduction 17
1.1 Data Lineage Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.2 Why does it matter? . . . . . . . . . . . . . . . . . . . . . . . 18
1.2 Tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.2 Why is Tracer needed? . . . . . . . . . . . . . . . . . . . . . . 20
1.2.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Thesis Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Related Work 23
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 System Design 25
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Tracer Primitives 29
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Primary Key Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7
4.2.1 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Foreign Key Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Column Map Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.3 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Modeling Methods 41
5.1 Primary Key Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Foreign Key Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Column Map Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.2 Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.3 Inverse Transform . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Datasets 47
6.1 Column Mapping Simulation . . . . . . . . . . . . . . . . . . . . . . . 48
6.1.1 Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.2 Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1.3 Sieve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7 Experiments 53
7.1 Primary Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2 Foreign Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.3 Column Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8
7.3.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8 Conclusion 61
A Column Map without Foreign Keys 63
B Relational Dataset 67
C Column Map Discovery Performance 71
9
10
List of Figures
1-1 Example of data lineage under our definitions. Input tables 𝐼1 and 𝐼2
are going through transformations 𝑇1, 𝑇2, 𝑇3, 𝑇4 and generating output
tables 𝑂1, 𝑂2. Therefore, the lineage of 𝑂1, 𝑂2 is the set containing 𝐼1
and 𝐼2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1-2 This figure shows a user who is interested in a statistic and wants to
know the source(s) of the data: in other words, the data lineage. The
goal of this project is to answer this question in the context of tabular
databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3-1 This figure provides a brief overview of how Tracer works and its role
in the system context. Tables without lineage (in the left) go through
the Data Ingestion Engine to be converted into the appropriate format,
and then fed into Tracer — the focus of this thesis — which discovers
the lineage of the tables. . . . . . . . . . . . . . . . . . . . . . . . . . 26
3-2 This figure shows how all the primitives are put together into a sin-
gle pipeline, as well as their standard inputs and outputs. Further
clarification regarding the data types can be found in Table 3.1. . . . 26
4-1 This figure showcases the functionality of the Primary Key Discovery
module. It takes as input a set of tables (on the left) and returns as
output the name of the columns that it predicts are primary keys (on
the right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
11
4-2 This is an example of how the Foreign Key Discovery module works.
It takes a set of tables as input (on the left), as well as their corre-
sponding primary keys (represented in green), and outputs the foreign
key relationships that it predicts, in sorted order from most likely to
least (on the right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4-3 This figure shows an example of the functionality of the Column Map
Discovery primitive, where the module learns that the target column
"Total" was generated by adding “Cost 1” and “Cost 2”. . . . . . . . . 34
4-4 This example shows the functionality of the Column Map Discovery
primitive, where the module discovers that “Purchase Date” was gen-
erated from “Quantity” (the amount of rows for each key in “Purchase
Date” corresponds to the values in “Quantity”). . . . . . . . . . . . . 35
4-5 This figure shows the Column Map Discovery primitive being applied,
where the module learns that “Total” was generated from adding all
values from “Cost 1” and “Cost 2” with the same foreign keys. . . . . 35
4-6 This example shows the functionality of the Column Map Discovery
primitive, where the module discovers that “Min/Max” was generated
from “Prices” (the values in “Min/Max” are the minimum and maxi-
mum values from each key of “Prices”). . . . . . . . . . . . . . . . . . 36
5-1 This figure demonstrates how Column Map Discovery works in three
steps: (1) the model expands the columns of the dataset (e.g. “Date”
becomes “Day”, “Month”, “Year”); (2) the model attempts to predict
which of these columns generated the target column, assigning scores
to each of them; (3) the model aggregates the columns that were ex-
panded, selecting the largest scores as lineage of the target. . . . . . . 43
6-1 This figure shows the schema for the MovieLens dataset [24], where the
arrows represent foreign key relationships, the first column describes
the columns in each table, and the second column informs the data
type of each column. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
12
6-2 A basic example on how to use DataReactor. . . . . . . . . . . . . . . 48
A-1 This figure shows an example of the functionality of the Column Map
Discovery primitive, where the module learns that "Total" was gen-
erated by adding “Cost 1” and “Cost 2” without knowing the foreign
keys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
13
14
List of Tables
2.1 A summary of the research done to each data lineage field, separated
by type (how-lineage or why-lineage) and granularity (schema-level or
instance-level) [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 This table defines the standard inputs and outputs of the Primary Key,
Foreign Key and Column Map Discovery primitives. . . . . . . . . . . 28
4.1 This table shows an example of the correct input/output format for
the Primary Key Discovery primitive. . . . . . . . . . . . . . . . . . . 31
4.2 This table shows an example of the correct input/output format for
the Foreign Key Discovery primitive. . . . . . . . . . . . . . . . . . . 33
4.3 This table shows an example of the correct input/output format for
the Column Map Discovery primitive. . . . . . . . . . . . . . . . . . . 37
6.1 Description of the Dataset and DerivedColumn data structures. . . . 50
7.1 Number of tables per dataset, as well as the accuracy of our imple-
mentation of the Primary Key Discovery primitive in three situations,
(1) normally, (2) without utilizing the column names, and (3) without
utilizing the position of the columns. . . . . . . . . . . . . . . . . . . 54
7.2 Example of the Foreign Key Discovery evaluation method, when ground
truth is A, B, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
15
7.3 Number of foreign keys per dataset, F-measure, precision and recall
of our implementation of the Foreign Key Discovery primitive when
tested against eighteen datasets, using the evaluation method described
in Section 7.2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.4 This table shows the performance of our current implementation of
the Column Map Discovery primitive. These results were obtained by
aggregating the values from Table C.1 for each dataset. . . . . . . . . 58
7.5 This table shows the F-measure, precision and recall of our implemen-
tation of the Column Map Discovery primitive. These results were
obtained by aggregating the values from Table C.1 for the two trans-
formations generated by DataReactor. . . . . . . . . . . . . . . . . . 58
B.1 This table provides some details regarding the 73 relational datasets
from [24]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
C.1 This table shows the F-measure, precision and recall of our implemen-
tation of the Column Map Discovery primitive when tested against
eleven relational datasets using the evaluation method proposed in Sec-
tion 7.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
16
Chapter 1
Introduction
In data science, the inability to trace the lineage of particular data is a common
and fundamental problem that undermines both security and reliability [22]. Under-
standing data lineage enables auditing and accountability, and allows users to better
understand the flow of data through a system. However, by the time data makes its
way through a system, its lineage has often been lost. Our goal in this thesis is to
present a machine learning solution to the data lineage problem. We argue that our
approach provides a more general and more extensible solution to this problem than
existing approaches which rely only on handcrafted heuristics [14], code analysis [23],
and/or manual annotation of data [9].
1.1 Data Lineage Tracing
1.1.1 Definition
To approach the problem we need a formal definition of data lineage. For the purposes
of this thesis, we will be using the definition provided in [14]:
Define 𝑂 as a set of output tables, 𝐼 as a set of input tables, 𝑜 as a data item (row
or column) from the output set, and define transformation 𝑇 as a procedure that
takes a set of tables as input and produces another set of tables as output. Then,
given 𝑇 (𝐼) = 𝑂 and an output item 𝑜 ∈ 𝑂, the set 𝐼 ′ ⊆ 𝐼 of input data items that
17
TI1
I2O1
O2
1 T4
T2
T3
Figure 1-1: Example of data lineage under our definitions. Input tables 𝐼1 and 𝐼2 aregoing through transformations 𝑇1, 𝑇2, 𝑇3, 𝑇4 and generating output tables 𝑂1, 𝑂2.Therefore, the lineage of 𝑂1, 𝑂2 is the set containing 𝐼1 and 𝐼2.
contributed to 𝑜’s derivation is the lineage of 𝑜, denoted as 𝐼 ′ = 𝑇 ′(𝑜, 𝐼). Figure 1-1
provides an example of lineage under this definition.
With this, we can define a lineage graph to be a directed acyclic graph of the
lineage, where nodes are entries of tables and edges are the lineage of these entries,
pointing from the original entry to the entry that was generated. Notice that under
this definition, 𝑇 ′ only provides the immediate parents of a data item, and the function
has to be called recursively to fully compute the lineage graph.
1.1.2 Why does it matter?
Reliable data is fundamental for effective decision making. However, as data —
specifically big relational data — is sampled, mixed and altered, and passed through
systems and between stakeholders, its lineage is frequently lost [22]. If the original
source of data is obscured, the quality of the derived data may be questioned, along
with the reliability of any decisions based on it. A method for tracing the origin of
data and its movement between people, processes, and systems would help address
this issue.
Furthermore, more control over data is essential for regulatory compliance. As
issues like privacy and control over personal data receive more attention, data protec-
tion agencies around the world are requiring that companies and other data-collecting
entities provide users a higher level of control over their own data. Some recent exam-
ples of legislation are the GDPR (General Data Protection Regulation) [2] which was
implemented in the European Union in 2018, and the CCPA (California Consumer
18
Where is this datacoming from?
?
?
?
Figure 1-2: This figure shows a user who is interested in a statistic and wants to knowthe source(s) of the data: in other words, the data lineage. The goal of this projectis to answer this question in the context of tabular databases.
Privacy Act) [1], which takes effect in 2020. With automatic data lineage tracing,
it becomes easier to answer the question “Where did this data come from?” and to
satisfy these regulatory compliance requirements.
Understanding the flow of data through systems has many benefits for stakehold-
ers. Accurate data lineage can help identify gaps in reporting, highlight data quality
issues, and help trace errors back to their root cause. It can help address questions
such as how valuable the data is, whether the data is duplicated or redundant, and
whether it is internal or external. Being able to trace the lineage of data is, overall,
indispensable for enterprise applications.
1.2 Tracer
1.2.1 Overview
In light of the issues described above, we propose Tracer, a machine learning solution
to the data lineage problem. At a high level, our system traces the lineage of data
by (1) taking a tabular database as input, (2) analyzing the tables with statistical
methods, machine learning techniques, and hand-crafted heuristics, and (3) returning
the rows and columns that generated the tables.
An end-to-end solution to the data lineage problem consists of:
1. Identifying parent-child relationships between tables.
19
2. Understanding how each row/column in a parent table maps to a row/column
in the child table.
3. Presenting this information in a human-readable manner which allows users to
derive actionable insights, as well as in a machine-readable manner which can
be utilized by downstream data science tools.
1.2.2 Why is Tracer needed?
While many solutions to the data lineage problem have been proposed before, all
available attempts at tackling the problem rely either on handcrafted heuristics (such
as [14], which provides algorithms for lineage tracing in data warehouses), code anal-
ysis (such as [23], which provides a system for lineage tracing relying on source code
analysis), and/or manual annotation of data (such as [9], which provides a data an-
notation system for relational databases).
While these are all valid ways of lineage tracing, they all have limitations. Code
analysis requires access to the code, which may be inaccessible when dealing with
external applications or if the code was lost/deprecated. Algorithms require a strict
set of conditions to be able to run, such as knowing ahead of time all the possible
transformations, which cannot always be satisfied. Finally, solutions that require
annotations are great for future lineage tracing, but they were not designed to answer
the question of where the present non-annotated data came from.
Tracer does not have these same limitations. Differently than these approaches,
Tracer learns the lineage from the data itself and so (1) it doesn’t require code access,
(2) there are few constraints to the data, since it uses machine learning instead of
algorithms, and (3) it doesn’t need any annotations, since it only requires the data
to be able to learn its lineage.
1.2.3 Scope
Trying to solve the data lineage problem for all types of data would be a massive
undertaking, and we would likely encounter issues achieving high accuracy and speed.
20
Therefore, we will focus our attention on a subset of more common situations, as
enumerated below:
1. All transformations must be deterministic, since adding randomization to the
problem makes inferring the lineage significantly harder.
2. All transformations must be reducible to operations on from columns to columns
or from rows to rows. For example, merging two columns together is valid, but
merging a column with a row is not.
3. The lineage must be unambiguous. For example, let’s assume the input column
[1, 2, 3] generated the output column [4, 8, 12]. If there is another input [2, 4,
6], the model will not be able to infer which input generated the output, since
there are equally simple transformations for both inputs that would result in
the output.
4. All the data must be provided in tables. Tracer currently does not support
conversion of other data types into the appropriate format.
1.3 Thesis Road Map
This thesis is organized as follow:
∙ Chapter 2 introduces related work about the data lineage problem.
∙ Chapter 3 presents the high-level design of the Tracer system.
∙ Chapter 4 describes the API of our system.
∙ Chapter 5 clarifies the technical workings of our system.
∙ Chapter 6 presents the datasets used to test our system.
∙ Chapter 7 demonstrates the performance of our system.
∙ Chapter 8 highlights promising areas for future work on this topic.
21
22
Chapter 2
Related Work
In this section, we provide an overview of previous work on the data lineage problem.
2.1 Definition
People have been working to keep track of data lineage for several decades. Back in
1997, Woodruff and Stonebraker [28] defined the lineage of data as its entire processing
history, and provided a novel system to generate lineage without relying on metadata.
Buneman, Khanna and Tan discussed the issue as well [11]. The field was further
advanced when Cui and Widom published their work on the topic [15], [13], [14],
defining data lineage in the context of data warehouses and providing algorithms for
lineage tracing.
2.2 Classification
Research regarding data lineage can be divided into four distinct fields, based on the
lineage type (how-lineage or where-lineage) and granularity (schema-level or instance-
level). Lineage type refers to which question is trying to be answered, either “where”
the data comes from (i.e. which tables generated some output) or “how” the data
was transformed (i.e. the specific transformations which generated some output).
Granularity refers to the format of the lineage, where schema-level/coarse-grained
23
Schema Instance
Where Workflow [17]
Why and Where [12]General Warehouse [14]Warehouse View [15]
Non-Answers [18]Approximate [25]
Trio [6]
How Curated [10]Workflow [17] Curated [10]
Table 2.1: A summary of the research done to each data lineage field, separatedby type (how-lineage or why-lineage) and granularity (schema-level or instance-level)[19].
lineage means the model looks at whole tables at a time, or even a entire datasets,
while instance-level/fine-grained means that each data item is analysed individually.
This is further discussed by Ikeda and Widom [19], who show some of the existing
body of work for each category, which we reproduce on the Table 2.1.
2.3 Setting
Data context determines what type of data lineage tracing can be performed. For
example, in data warehouses, where information is properly organized and well-
annotated, the lineage can be discovered with basic heuristics, such as described
by [14]. On the other hand, in malicious environments, intentional and unintentional
leakages must be taken into consideration, and entirely different frameworks must
be developed, as [7] and [22] discuss. Other examples include lineage in the context
of arrays, as studied in [21], and in the setting of relational databases containing
uncertain data, as described by [8] and [6].
24
Chapter 3
System Design
At a high level, the Tracer library works by executing pipelines that contain modules
called primitives. Each of these primitives is responsible for solving a particular
subproblem related to the data lineage task, from primary and foreign key discovery
to identifying column mappings. In this chapter, we describe how these pipelines and
primitives interact to solve the data lineage problem.
3.1 Overview
The Tracer library was developed to solve the data lineage problem in the context
of tabular databases. To understand the purpose and functionality of the library,
we must first understand how Tracer interacts with other modules within the same
context. Figure 3-1 provides a general overview of the system design.
On the left side, the figure shows various systems generating tables. We don’t
always know which system generated which tables, which can lead to a loss of data
lineage. These systems can also be both complex and diverse — for example, System
A could be a MySQL database, and System B a Hadoop database, with each one
requiring distinct handling.
The tables are fed into the Data Ingestion Engine. The two components of this
module — the System Connector and the Input Generator — help to transform the
various possible inputs into a standardized format called Metadata.JSON [3].
25
Data Ingestion EngineTable a1
Table an
..
System A
...
Table b1
Table bn
..
System B
Tracermetadb
Updatedmetadata.json
RESTFUL API
Tracer
Primitives
Pipelines
SystemConnector
UI/UX Library
Service APIs(DL Orchestration
Engine)
Metadata.json
Data Set(s)
Accenture DL Portal Data LineageDashboard
BusinessUsers
...
InputGenerator
Figure 3-1: This figure provides a brief overview of how Tracer works and its role inthe system context. Tables without lineage (in the left) go through the Data IngestionEngine to be converted into the appropriate format, and then fed into Tracer — thefocus of this thesis — which discovers the lineage of the tables.
Tracer then takes the datasets, along with the Metadata.Json file, and attempts
to identify useful properties pertinent to the data lineage problem. It does this by
applying various procedures called primitives which extract information (such as pri-
mary keys and foreign keys) and store it in the metadatabase, metadb. This data is
then processed by the UI/UX library, which provides a graphical interface, the Data
Lineage Dashboard, that facilitates the user’s interaction with Tracer.
3.2 Tracer
Primary KeyDiscovery
Foreign KeyDiscovery
Column MapDiscovery ColumnMapTables Tables
PrimaryKeysTables
ForeignKeys
Figure 3-2: This figure shows how all the primitives are put together into a singlepipeline, as well as their standard inputs and outputs. Further clarification regardingthe data types can be found in Table 3.1.
26
Now that we have an overview of the system at large, let’s take a closer look at
the Tracer module. Tracer is an end-to-end pipeline takes in a set of data tables and
outputs the lineage of that data. It accomplishes this by running through a series of
steps, called primitives. Each primitive generates relevant information that is then
fed to the next primitive in the pipeline, and on down the line until the data lineage
is discovered.
This interaction is graphically represented in Figure 3.2. As the diagram shows,
we start with a set of tables, which are fed to the Primary Key Discovery primitive to
generate the primary keys of the dataset. Next, the Foreign Key Discovery primitive
uses the tables and the generated primary keys to detect any foreign key relationships
that may exist in the dataset. Finally the Column Map Discovery primitive uses
these foreign keys, as well as the original tables, to discover which columns generated
which ones in the dataset. The design and functionality of each of these primitives is
described in Chapter 4.
The above process, if accurate, is sufficient to provide the lineage of any item in a
database. This is due to one of the constraints we presented previously: Tracer only
deals with transformations on rows and columns. Therefore, if we know the mapping
of every row (through foreign keys), and the mapping of every column (through the
column map), we can pinpoint the exact origin of each cell of a table.
3.3 Terminology
As can be seen in Figure 3.2, each primitive has well-defined inputs and outputs.
These data structures are defined in Table 3.1 and will be used throughout the rest
of this thesis.
27
Name Description Example
Tables A set of tables, each of whichhas (1) a unique name, (2) oneor more named columns, and (3)zero or more rows. Data Struc-ture: dictionary mapping tablenames to dataframes.
{table1: DataFrame,table2: Dataframe,
}
PrimaryKeys Indicates which columns of aset of tables are the primarykeys. Data Structure: dictionarymapping table names to columnnames.
{table1: user_id,table2: prod_id,
}
ForeignKeys All the foreign key mappings thatexist within a set of tables, whereeach referenced table maps to alltables that refer to it. DataStructure: a sorted list of dictio-naries, each of which contains fiveelements: table, field, ref_table,ref_field, score.
[{"table": users,"column": user_id,"ref_table": posts,"ref_column": key,"score": .9}, ...]
ColumnMap The mapping between a targetcolumn to the columns from theparent table that generated it.Data Structure: a sorted list ofdictionaries.
[{"table": costs,"column": cost_1,"score": .9}, ...]
Table 3.1: This table defines the standard inputs and outputs of the Primary Key,Foreign Key and Column Map Discovery primitives.
28
Chapter 4
Tracer Primitives
This chapter aims to provide further context about the building blocks of Tracer: the
primitives. As explained in Chapter 3, the goal of each primitive is to solve a sub-
problem of the data lineage task. When combined, the solutions to these subproblems
allow us to answer any data lineage query on a standard relational database. In this
chapter we will (1) define what a primitive is, (2) explain how each of them works,
and (3) show how they they work together to solve a data lineage problem.
4.1 Overview
While Section 3.2 explained what a primitive does, it did not answer the question
of what a primitive is. A primitive is a well defined Python class which contains
methods aimed at solving a specific part of the data lineage problem. There are
three types of primitives implemented in Tracer: Primary Key Discovery, Foreign
Key Discovery, and Column Map Discovery. Because the output of each successive
primitive is required as an input for the one that follows, as shown in Figure 3.2, they
must be called in this specific order.
As for their actual implementation, each Tracer primitive class satisfies the fol-
lowing properties:
1. Contains a fit function, which is optional and corresponds to training the
model on some external datasets.
29
2. Contains a solve function, which is required and uses the fitted model to pro-
duce the appropriate output.
3. Has standard inputs/outputs as shown in Figure 3.2.
4. Is easily extensible. In the scenario where a superior method to solve a primitive
task is discovered (we describe our current implementation in Chapter 5), only
the code for that particular primitive has to be substituted, since each of the
primitives is independent of the others.
In the following sections we will describe (1) the three main types of primitives
implemented in the Tracer library, (2) their respective APIs, and (3) the reasoning
behind their design.
4.2 Primary Key Discovery
Primary Key
Figure 4-1: This figure showcases the functionality of the Primary Key Discoverymodule. It takes as input a set of tables (on the left) and returns as output the nameof the columns that it predicts are primary keys (on the right).
The goal of the Primary Key Discovery primitive is to discover all the primary keys
within a dataset. It takes as input a dictionary mapping table names to dataframes
(Tables) and as output a dictionary mapping table names to the column name that
describes the primary key (PrimaryKeys). If a table has no primary key, it will map
to "None". Figure 4-1 demonstrates the functionality of this primitive.
Consider the scenario shown in Table 4.1. In this example there are two tables,
“users” and “posts”, with primary key columns “user_id” and “post_id”, respectively.
30
Input Output
{users: pd.DataFrame(user_id, user_name, ...),posts: pd.DataFrame(post_id, post_title, ...),
}
{users: user_id,posts: post_id
}
Table 4.1: This table shows an example of the correct input/output format for thePrimary Key Discovery primitive.
Now assume we lost the information indicating which columns are the primary keys.
To rediscover them, we can use the Primary Key Discovery primitive by calling the
solve function on these tables, and the model will return the primary keys for us.
4.2.1 Design choices
The Primary Key Discovery primitive takes a dataset as input, as opposed to only a
single table. We designed it as such because some methods for discovering primary
keys rely on looking at relationships between tables. For example, consider a dataset
(Tables), where the primary key for every table always has the same column name.
Under our implementation, a primary key discovery model would be able to use this
fact to simplify the key tracing. However, this argument does not extend to multiple
datasets: there is little that can be learned across multiple datasets that could not
be done within one, and so the input is a single dataset instead of a list.
In addition, PrimaryKeys only stores one primary key per table, because the
current implementation of Tracer can only discover up to one key per table. This will
be enhanced in a future version of the project.
Another relevant decision concerns why this module only takes Tables as input
and not other potentially useful information, such as foreign keys. There are two
reasons for this. First, the Foreign Key Discovery module takes the primary keys
as input, so if Primary Key Discovery required the foreign keys, we would have a
loop. Second, the current module already achieves near-perfect accuracy, as is shown
in Section 7.2, which means the added complexity of alternative implementations
31
(e.g. merging both the Key Discovery primitives to solve the looping issue) is not
justifiable.
4.3 Foreign Key Discovery
Foreign Key
ForeignKey
Figure 4-2: This is an example of how the Foreign Key Discovery module works. Ittakes a set of tables as input (on the left), as well as their corresponding primary keys(represented in green), and outputs the foreign key relationships that it predicts, insorted order from most likely to least (on the right).
The goal of the Foreign Key Discovery primitive is to discover all the foreign
keys within some dataset. It takes as input a dictionary mapping table names to
dataframes (Tables) and a dictionary mapping these tables to their primary keys
(PrimaryKeys), and returns a list of dictionaries representing the foreign keys of the
table, sorted by how likely they are according to the model (ForeignKeys). No-
tice that the majority of the keys in ForeignKeys don’t actually exist. Because
ForeignKeys contains all possible foreign key relationships, sorted from most to least
likely to exist based on their “score“ (see Table 3.1), only the initial values of the list
correspond to actual foreign keys, while the other values must be thresholded. Figure
4-2 demonstrates the functionality of this primitive.
Consider the scenario shown in Table 4.2. In this example there are two tables,
“users” and “posts,” with a foreign key relationship between “user_id” and “key.”
Now assume we have lost the information about which columns are foreign keys. To
rediscover them, Tracer will first call the Primary Key Discovery primitive, which
will provide the appropriate PrimaryKeys object. Then we will use the Foreign Key
Discovery’s solve function with Tables and PrimaryKeys as input, which will return
32
all possible foreign keys, sorted by most to least likely based on their “score.” Notice
that the example in Table 4.2 only shows the first foreign key for the output, but the
list contains four items, where the other possible foreign keys — (“user_id,” “key”),
(“user_name,” “id”) and (“user_name,” “key”) — have lower scores.
Input Output
{ #Tablesusers: pd.DataFrame(user_id, user_name, ...),posts: pd.DataFrame(id, key, nb_posts, ...)
},{ #PrimaryKeysusers: user_id,posts: key
}
[{"table": users"column": user_id"ref_table": posts"ref_column": key"score": .95}, ...
]
Table 4.2: This table shows an example of the correct input/output format for theForeign Key Discovery primitive.
4.3.1 Design choices
Unlike the Primary Key Discovery primitive, this primitive does allow any number
of keys per table. We decided to prioritize doing so in this module because, from our
observations, most relational databases contain at least one compound foreign key,
while compound primary keys are not as widespread.
Another major decision about this API was the output format. In earlier ver-
sions of Tracer, Foreign Key Discovery generated a dictionary mapping the column
names of each foreign key to a list of tuples containing (“parent_table”, “child_table”,
“child_column”), corresponding to the matching foreign keys. While the dictionary
data structure allowed for easier access to a specific foreign key, it was a very unintu-
itive format. Changing it to a list of objects made it more approachable, while also
facilitating the sorting of the foreign keys by their scores, which simplifies the job of
the Column Map Discovery when it receives the ForeignKeys.
Finally, we need to explain our decision to generate all possible foreign keys and
33
their respective scores. Initially, the output of this module was solely the thresholded
foreign keys, and so no scores were provided to the next module. While this method
generated cleaner outputs, we had trouble achieving high accuracy in the Column
Map Discovery primitive, because results generated from Foreign Key Discovery were
failing. This is caused by the error compounding multiplicatively in the pipeline: if
the predicted foreign keys are incorrect, then the next module will also be incorrect,
because it was fed the wrong information. To fix this issue, we now provide all
possible foreign keys with their respective scores, which gives the next primitive more
information with which to make decisions.
4.4 Column Map Discovery
This primitive type attempts to discover which columns of the dataset generated
some target column chosen by the user. It takes as input a dictionary mapping table
names to dataframes (Tables), a list of dictionaries representing the foreign keys
of the table (ForeignKeys), as well the target column, and outputs a sorted list
of dictionaries representing the columns that generated the target (ColumnMap), as
shown in Figure 3.2. This primitive is significantly more complex than the other two,
so we will study it by analyzing four distinct examples of problems this module is
capable of solving.
4.4.1 Examples
Total = f (Cost 1, Cost 2)
Foreign Key
Key
1
2
3
100
50
0
200
150
100
Cost 1 Cost 2 Key
3
2
1
100
300
200
Total
Figure 4-3: This figure shows an example of the functionality of the Column MapDiscovery primitive, where the module learns that the target column "Total" wasgenerated by adding “Cost 1” and “Cost 2”.
34
For the first example, consider the situation in Figure 4-3. The image shows the
Column Map Discovery learning that the lineage of the column “Total” consists of the
columns “Cost 1” and “Cost 2.” Note that it doesn’t know the exact transformation
(a combination of columns, where Cost = Cost 1 + Cost 2), only that there exists
some correlation between the columns.
Purchase Date = f (Quantity)
Foreign Key
Key
1
2
Alice
Bob
2
3
Name Quantity
Key
2
2
1
1
2
05 / 03 / 19
05 / 07 / 19
05 / 09 / 19
05 / 21 / 19
05 / 28 / 19
Purchase Date
Figure 4-4: This example shows the functionality of the Column Map Discovery prim-itive, where the module discovers that “Purchase Date” was generated from “Quan-tity” (the amount of rows for each key in “Purchase Date” corresponds to the valuesin “Quantity”).
Another type of transformation Column Map Discovery is capable of recognizing
is shown in Figure 4-4. In this example, the elements from column “Purchase Date”
were generated from column “Quantity,” where the number in each cell of “Quantity”
represents how many rows for each key there will be in “Purchase Date.” Note that
in this example the column “Name” serves no purpose. The primitive should be able
to figure out that there is no correlation between “Name” and “Purchase Date.”
Total = f (Cost 1, Cost 2)
Foreign Key
Key
1
2
18
5
Total
Key
2
2
1
1
2
1
0
5
0
-1
2
1
10
3
2
Cost 1 Cost 2
Figure 4-5: This figure shows the Column Map Discovery primitive being applied,where the module learns that “Total” was generated from adding all values from “Cost1” and “Cost 2” with the same foreign keys.
35
The third example is shown in Figure 4-5. Similarly to the first example, “Total”
is a function of columns “Cost 1” ’ and “Cost 2.” Differently than in example one,
multiple rows also combine, so that “Total” is generated by the summation of all
elements of both columns that share the same key. Column Map Discovery supports
this type of transformation by converting the multiple rows into one through the
Transform stage, as explained in Section 5.3.
Min/Max = f (Prices)
Foreign Key
Key
2
2
1
1
2
1
-2
5
0
50
PricesKey
1
1
2
2
0
5
-2
50
Min/Max
Figure 4-6: This example shows the functionality of the Column Map Discoveryprimitive, where the module discovers that “Min/Max” was generated from “Prices”(the values in “Min/Max” are the minimum and maximum values from each key of“Prices”).
A last example is shown in Figure 4-6. In this situation, the elements from column
“Min/Max” were generated from column “Prices” in the first table by selecting the
smallest and largest values for each key, and then placing the maximum value below
the minimum one. This example really underscores the value of Tracer, seeing as
even very unorthodox transformations can still be discovered by the Column Map
Discovery primitive.
4.4.2 Inputs and Outputs
Consider the scenario shown in Table 4.3. In this example, there are two tables, “users”
and “posts” (where “nb_posts” is generated by combining the columns from “post1”
and “post2”), as well as a foreign key relationship between the columns “user_id” and
“key.” Now assume we lost this information. To rediscover the lineage of the target
column “nb_posts”, Tracer will first call the Primary Key Discovery primitive, which
will feed its output to the Foreign Key Discovery primitive, which then generates the
36
ForeignKeys object we need. Now we can simply use the Column Map Discovery’s
solve function with Tables and ForeignKeys as input to generate the lineage of
“nb_posts.”
Input Output
{ #Tablesusers: pd.DataFrame(user_id, nb_posts, ...),posts: pd.DataFrame(key, post1, post2, ...)
},nb_posts, #target column[{ #ForeignKeys"table": users"column": user_id"ref_table": posts"ref_column": key"score": .95
}, ...]
[{"table": posts"column": post1"score": .95},{"table": posts"column": post2"score": .9}
]
Table 4.3: This table shows an example of the correct input/output format for theColumn Map Discovery primitive.
4.4.3 Design choices
One of the most important design decisions in this project involved the API of the Col-
umn Map Discovery. While the Primary and Foreign Key Discovery primitives deal
with standard relational database structures, the Column Map Discovery is unique to
Tracer, and very flexible both in implementation and functionality. This was the first
module that was implemented, and the whole architecture of Tracer was designed
around what this module would require to function.
One important decision about the module is what type of inputs it should take.
Theoretically it could work by only taking Tables as input (see Appendix A), but
the lack of row mapping (i.e. foreign key relationship) would make the model im-
practically slow, since allowing uncertainty in both row and column mappings at the
same time is an immense search space.
37
We dealt with this issue by providing the Column Map Discovery primitive with
the ForeignKeys object. While it is true that this creates the issue of multiplicative
error (i.e. if the Foreign Key primitive is incorrect, the Column Map will also be
incorrect, since it is acting on false data), we concluded that predicting foreign keys
is sufficiently easy and there is enough research on the topic that we will always be
able to achieve sufficiently accurate results on that module.
Another important design choice is how dependent Column Map Discovery should
be on ForeignKeys. In earlier versions of Tracer, Column Map Discovery was actually
a class of four distinct primitives, each one aimed at dealing with a different types of
foreign key. While this division facilitates the creation of primitives, we noticed that
it limited the potential for solutions that use multiple foreign types at the same time.
Instead, we decided not to limit our framework as such, and in the current version
of Tracer there is no distinction between the types of foreign keys. This means that
any Column Map Discovery primitive must find a way to deal with the various foreign
keys. For an example of how our current implementation accomplishes this, refer to
Section 5.3.
4.5 Putting It All Together
During the fit stage, the pipeline is given a list of databases. Each “database”
contains not only Tables with raw data, but also the metadata, which includes the
foreign keys, primary keys, and constraints. Then, the execution pipeline performs
the following:
1. Fit the Primary Key Discovery module by building a training set using the
given primary keys.
2. Fit the Foreign Key Discovery module by building a training set using the given
primary and foreign keys.
Note that the models can be saved to disk and used for inference (i.e. we can train
the system on a large collection of databases, store the resulting pipeline, and provide
38
the pre-trained pipelines to users to apply to their databases).
During the solve stage, the pipeline takes a dictionary of tables as well as a (ta-
ble_name, column_name) tuple which specifies the target column for which we want
to trace the lineage. The execution pipeline runs the following:
1. Apply the Primary Key Discovery module to the tables and identify the primary
keys.
2. Apply the Foreign Key Discovery module to the tables and primary keys and
identify the foreign keys.
3. Apply the Column Map Discovery module to the tables, foreign keys, and target
column in order to identify the lineage of the target column.
Note that to obtain the lineage for multiple columns, the user doesn’t need to rerun
the first two steps, only the third step which takes as input the target column. This is
implemented by caching the results during pipeline execution — i.e. since the inputs
to the primary and foreign key discovery module do not change, the results will be
retrieved from the cache.
39
40
Chapter 5
Modeling Methods
In this chapter, we will delve further into the modeling algorithms used to support
each primitive. While Chapter 4 went through what primitives are and how they are
implemented, this chapter will focus on some of the algorithms Tracer uses to solve
each primitive task.
5.1 Primary Key Discovery
At a high level, the primary key discovery module works by generating a feature
vector for each column in a table, and predicting which column (if any) is most likely
to be the primary key. Note that in this version of Tracer, we do not aim to support
composite primary keys, only single-column primary keys.
The feature vector contains handcrafted features describing the column, decided
based on common patterns observed among primary keys. These feature vectors
include:
1. Ordinal position. Primary keys tend to be one of the first columns in the
table.
2. Uniqueness. Primary keys must be unique.
3. Data type. Primary keys are typically strings or integers.
41
4. Column name. We encode binary features for whether the name of the column
contains common terms used for primary keys such as “_id” or “_key”.
During the fit stage, the Primary Key Discovery module trains a classification
model to predict primary key columns using these features; it performs hyperparam-
eter optimization and cross-validation using the BTB library [16] to identify the best
model. During the solve stage, it uses the model and identifies the most promising
candidate (if any) for the primary key.
5.2 Foreign Key Discovery
As discussed in [29] [26], foreign keys typically satisfy the following 6 properties:
1. High cardinality.
2. Contain most of the primary keys.
3. Tend to be a primary key of few foreign keys.
4. Are not a subset of too many primary keys.
5. Share average lengths (mostly for strings).
6. Have similar column names.
By using the above properties, we came up with the following implementation to
discover foreign keys. First, the model generates all the possible pairs of columns (i.e.
for every two tables, create all the pairings between any column of one table and a
column of the other), as long as a foreign key is viable (e.g. if the data types don’t
match, then there can be no foreign key relationship, so this paring is not created).
Then the model has to decide which pairs are actually foreign keys. It accomplishes
this by using the following handcrafted features, based on the six properties described
above:
1. Intersection. Foreign keys should have significant intersection.
42
2. Inclusion dependency. One of the foreign keys tends to be a subset of the
other.
3. Uniqueness. At least one of the foreign keys tends to be unique (i.e. a primary
key).
4. Column name. We encode binary features for whether the name of either
column contains common terms used for foreign keys, such as “_id” or “_key”.
5. Data type. Foreign keys are typically strings or integers.
Finally, much like for the Primary Key Discovery primitive, the fit procedure of
the Foreign Key Discovery module trains a random forest classifier to predict foreign
keys using these features; it then performs hyperparameter optimization and cross-
validation using the BTB library [16] to identify the best model. During the solve
stage, it then uses the model to identify the most likely foreign keys.
5.3 Column Map Discovery
Date1
Target
Date is derived
from
Target
Day Month Year
Target
Day Month Year
0.10.1 0.70.1 0.3 0.3Target
2 3
Figure 5-1: This figure demonstrates how Column Map Discovery works in threesteps: (1) the model expands the columns of the dataset (e.g. “Date” becomes “Day”,“Month”, “Year”); (2) the model attempts to predict which of these columns generatedthe target column, assigning scores to each of them; (3) the model aggregates thecolumns that were expanded, selecting the largest scores as lineage of the target.
The Column Map Discovery primitive provides the following functionality: given
a target column, it identifies the columns that the target column is derived from. At
a high level, as can be seen in Figure 5-1, it works by (1) transforming the database
into an (X, y) representation where X and y contain numerical values representing the
table and the target column, respectively, (2) identifying the relationships between X
and y, and (3) reversing the transformation to identify which of the original columns
contributed to the target column.
43
5.3.1 Transform
The transform step is meant to lead to more useful representations of the data. As
an example, let’s consider the situation in Figure 5-1, where one of the columns is a
date field. In this scenario, the transformation step would expand this one column
into three: day, month and year. If the target column happens to be the year field,
then this transformation has made it much easier for the model to identify that the
date field is related to the target column.
More specifically, the transform step uses reversible data transforms [4] to generate
a new set of columns, which can then be further processed in the next steps. The
functionality of this step can be further improved with the addition of new features,
such as computing the length of a string.
This procedure works for any type of foreign key relationship. If the foreign keys
are many-to-many (i.e. multiple rows of the parent table match to multiple rows of the
child table), for example, the module simply computes various aggregate transforms
(e.g. number of rows in each child table), and the end result is still an (X, y) pair
which contains numerical values. Therefore, the output is indistinguishable from the
one-to-one case, and the next step doesn’t need to be concerned with the foreign keys.
5.3.2 Solver
This is the step of the process in which it is determined which columns generated
the target. The Solver takes the (X, y) generated from the Transform step (i.e.
the expanded table and the corresponding target column) and assigns scores to each
column indicating how likely they are to have generated the target column. It does
so by training a random forest regressor to identify the feature importance of each
column.
To more formally explain why the above procedure works, let us consider the
following situation. If the input table contains two columns 𝐴 and 𝐵 and only column
𝐴 contributed to some output 𝑌 , then given sufficient data and assuming that 𝑌 is not
conditionally independent of 𝐴 given 𝐵 (e.g. the simplest case where this assumption
44
is violated would be if 𝐴 and 𝐵 are identical/redundant), then a model with sparsity
regularization will only use column 𝐴. Also, if the transformations are deterministic
(e.g. there is no randomness in how the output column is generated) and we have
enough data, then a universal function estimator (e.g. deep learning) will achieve
perfect accuracy on the training set.
It is important to observe that in this context overfitting isn’t a problem, it’s a
feature. To identify the set of input columns that contributed to the output column,
we simply need to solve:
𝑎𝑟𝑔𝑚𝑖𝑛𝑆[𝑚𝑖𝑛𝐹 ||𝐹 (𝑋[𝑆])− 𝑦||2] + 𝜆||𝑆||0 (5.1)
where 𝑆 is a subset of columns and 𝐹 is an arbitrary machine learning model.
Intuitively, this corresponds to identifying the smallest set of variables that results in
the best prediction accuracy.
5.3.3 Inverse Transform
Finally, now that we have identified how useful each column is for generating the
target column, we can reverse our initial transform. For example, if a date field
was transformed into three columns and the columns are assigned importance score
(0.0, 0.1, 0.4), then we would aggregate it into a total importance score of 0.5 and
assign it to the original date field. We explored two different ways of aggregating
the importance scores — adding them and taking the max over them — and found
addition to work better and more consistently. This results in a score for each column
which indicates how likely it is to have contributed to the target column.
45
46
Chapter 6
Datasets
To evaluate the performance of our data lineage system, we compiled a suite of rela-
tional datasets from [24]. This repository contains 73 multi-table relational datasets
covering a diverse range of applications, from molecular chemistry to supply chain
management; it also includes standard benchmark datasets that have been used in
related work on data processing systems [5].
movies2actors
movieid
actorid
cast_num
mediumint
mediumint
int
movies2directors
movieid
directorid
genre
mediumint
mediumint
varchar
u2base
userid
movieid
rating
int
mediumint
varchar
directors
directorid
d_quality
avg_revenue
mediumint
int
int
movies
movieid
year
isEnglish
country
runningtime
mediumint
int
enum
varchar
int
users
userid
age
u_gender
occupation
int
varchar
varchar
varchar
actors
actorid
a_gender
a_quality
mediumint
enum
int
Figure 6-1: This figure shows the schema for the MovieLens dataset [24], where thearrows represent foreign key relationships, the first column describes the columns ineach table, and the second column informs the data type of each column.
Each dataset consists of multiple tables which are related through various primary
and foreign key relationships. For example, the schema for the MovieLens dataset is
shown in Figure 6-1. We store these datasets (and the associated metadata) using the
Metadata.JSON format proposed in [3]. We can then use these datasets to evaluate
47
the performance of our Primary and Foreign Key Discovery modules. For further
information about these datasets, please consult Appendix B.
However, these datasets are not sufficient to evaluate the Column Map Discovery
primitive. To assess this primitive, we developed a system to transform the relational
datasets and generate new “derive” columns where the column mapping is known.
Then, we can use this augmented dataset to evaluate the performance of the Column
Map Discovery primitive.
6.1 Column Mapping Simulation
The DataReactor project aims to take a relational dataset and intelligently generate
derived columns in order to aid in the evaluation of the DataReactor library and guide
the design of Metadata.JSON [3]. These derived columns are a function of the other
columns in their table as well as the associated rows in the parent and child tables.
DataReactor is implemented as an independent Python package which can be used
as follows:
from datareactor import DataReactor
reactor = DataReactor()reactor.transform(
source="/path/to/dataset",destination="/path/to/dataset_new"
)
Figure 6-2: A basic example on how to use DataReactor.
The transform operation takes in a path to a dataset which is stored in a Meta-
data.JSON compatible format and produces a new Metadata.JSON compatible dataset
where:
1. It has added derived columns to the tables which have useful statistical prop-
erties.
2. It has updated the metadata object with the lineage of these derived columns.
48
For example, suppose we have a simple database containing users and transac-
tions. One of the derived columns produced by DataReactor might be an additional
column in the users table which stores the number of transactions (i.e. the number
of corresponding rows in the transactions table for each user). Other examples of
derived columns produced by DataReactor include summary statistics (i.e. the aver-
age amount spent per transaction for each user), transforms on datetime fields (i.e.
extracting the day of the week), and transforms on string fields (i.e. the length of the
string).
6.1.1 Software Design
At a high level, DataReactor performs the following operations:
1. Load the tables and metadata.
2. Generate derived columns for each table.
(a) This functionality is provided by the DataReactor.Atoms sub-module.
(b) Each Atom is responsible for generating a type of derived column. The
Atom API is easily extensible, allowing us to quickly add new transforms
in the future.
(c) By randomly applying these Atoms to different tables, we can generate a
wide range of derived columns.
3. Prune the derived columns to remove meaningless values.
(a) This functionality is provided by the DataReactor.Sieve sub-module.
(b) Each Sieve provides a different strategy for filtering derived columns, from
more machine learning-oriented approaches to feature selection to simpler
heuristics such as removing duplicate and constant-valued columns.
4. Write the expanded tables and updated metadata to disk.
49
We describe the modules responsible for steps 2 and 3 in greater detail in the
following sections; the data structures defined in DataReactor are described in Ta-
ble 6.1.
Class Name Description
Dataset The Dataset object is responsible for reading and writingdatasets. It manages both the metadata and the tables. Italso provides the ability to add derived columns to a dataset(by updating the metadata and target table).
DerivedColumn The DerivedColumn object represents a derived column andcontains (1) the name of the table it belongs to, (2) the ac-tual values in the column, (3) information about the name andtype of the column, and (4) a constraint object indicating thelineage of the column.
Table 6.1: Description of the Dataset and DerivedColumn data structures.
6.1.2 Atoms
In this section, we introduce the atom module which implements the logic for generat-
ing derived columns. Each atom inherits from the Atom base class (see Appendix C)
and is required to implement a derive method which takes as input a dataset and tar-
get table and returns a sequence of DerivedColumn objects containing new columns
for that table.
One simple Atom would be the RowCountAtom which creates a derived column
which counts the number of rows in the child table. For example, suppose there is a
foreign key relationship between users and transactions. This atom would propose a
new derived column for the users table which contains the number of transactions for
each user.
Another example of an Atom is the FeatureToolsAtom which uses the FeatureTools
[20] library to generate features, identifies the lineage of the features, and converts
them to a derived column format. By integrating with FeatureTools, we are able
to generate a broad variety of derived columns from data type-specific transforms
50
(i.e. converting datetime strings into time zones, etc.) to more general aggregate
transforms (i.e. identifying the mode of a categorical column).
6.1.3 Sieve
Finally, the sieve module filters the derived columns. The generative process can
result in columns that are constant-valued — for example, a transform that extracts
the year from a datetime field could always return the same result if all the data
comes from the same year — or even columns that are duplicates of one another.
The sieve module applies filters to remove these edge cases and produce a final set
of derived columns for analysis.
51
52
Chapter 7
Experiments
In this section we will test each of our primitives using the datasets described in
Chapter 6.
7.1 Primary Key
7.1.1 Evaluation
As discussed in Chapter 4, this primitive is designed to discover all the primary keys
of a dataset. To assess how successful it is at achieving this goal, we will use the
following evaluation method: (1) disregard all tables that contain composite keys or
no keys at all, seeing as our current implementation returns single-column keys; (2)
on this restricted dataset, check what percentage of the predicted keys are actually
correct; (3) return the accuracy.
7.1.2 Experiments
Primary keys generally follow very strict rules, as discussed in Section 5.1. These
constraints make the identification of primary keys fairly trivial, and so any reasonable
model should achieve near-perfect accuracy on most datasets. Indeed, as can be seen
in Table 7.1, our model achieves near perfect accuracy under the current evaluation
procedure.
53
Dataset Number of Accuracy Accuracy AccuracyTables w/o Names w/o Ordinals
AustralianFootball 4 1 1 1Biodegradability 5 1 1 1Dunur 17 1 1 1Elti 11 1 1 1Hepatitis_std 7 0.25 0.25 1Mesh 29 1 1 1Mooney_Family 68 1 1 1NBA 4 1 1 1PremierLeague 4 1 1 1Same_gen 4 1 1 1Toxicology 4 1 1 1UW_std 4 1 1 1Walmart 4 1 1 1classicmodels 8 1 1 1cs 8 1 1 1imdb_MovieLens 7 1 1 1pubs 11 1 1 1university 5 1 1 1
Table 7.1: Number of tables per dataset, as well as the accuracy of our implementationof the Primary Key Discovery primitive in three situations, (1) normally, (2) withoututilizing the column names, and (3) without utilizing the position of the columns.
To test whether our model was actually looking at the data rather than overly
depending on the metadata (seeing as around 40% of the column names contain
indicators of the nature of the column, such as id or key), we removed the column
names from the previous datasets and tested again. As we can see in Table 7.1, our
results are invariant, with only one poor performance in the eighteen datasets. This
outcome is positive, since it seems to indicate our model is not overly reliant on the
metadata.
We also tested how well our model performs without the ordinal feature (i.e.
without knowing which order the columns are in). As the table shows, the model
achieved perfect performance under this circumstance, which is a little surprising
considering it outperforms the original model. Upon further investigation, the issue
seems to be that for almost all training samples primary keys are the first column,
which causes the model to overestimate the importance of this feature.
54
7.2 Foreign Key
7.2.1 Evaluation
As discussed in Section 4.3, the Foreign Key Discovery primitive attempts to discover
all the foreign keys of a dataset. As for the Primary Key Discovery primitive, we
have to design a metric to evaluate how successful the model is. In this case, though,
accuracy is not a reasonable choice of evaluation, as some tables may contain multiple
foreign keys while others have none, which wouldn’t be properly captured by such a
simple metric. Instead, we will use precision, recall and F-measure for evaluation, as
recommended in [29].
We compute these values is as follows. First, recall that the Foreign Key Discovery
primitive returns a sorted list of foreign keys, from most to least likely. Therefore
we can select the first X% most likely foreign keys which our algorithm predicted,
for various values of X. Then we compute the precision, recall and F-measure of our
method for each value of X, and finally select the best F-measure achieved, as well as
the corresponding value for X.
To clarify this idea, let us go through the example in Table 7.2. Imagine a dataset
containing three foreign key mappings, A, B and C. Now imagine the Foreign Key
Discovery module predicts the following possible foreign keys, sorted from most to
least likely: A, D, B, C, E. To evaluate the performance of the model, we will select
the top 20% percent of our predictions (i.e. we believe A is the only foreign key
relationship in the dataset) and compute the recall, precision and F-measure for it.
Predicted X Recall Precision F-Measure
A 20% 0.33 1.00 0.50D 40% 0.33 0.50 0.40B 60% 0.66 0.66 0.66C 80% 1.00 0.75 0.86E 100% 1.00 0.60 0.80
Table 7.2: Example of the Foreign Key Discovery evaluation method, when groundtruth is A, B, C.
55
We then repeat this process for multiple values of X, and select the best-performing
result, which in this example is a F-measure = 0.86, for 𝑋 = 80%.
7.2.2 Experiment
Dataset Foreign Keys F-Measure Precision Recall
AustralianFootball 5 1.00 1.00 1.00Biodegradability 5 1.00 1.00 1.00Dunur 32 1.00 1.00 1.00Elti 20 1.00 1.00 1.00Hepatitis_std 6 1.00 1.00 1.00Mesh 33 0.43 0.77 0.30Mooney_Family 134 0.70 0.61 0.83NBA 5 0.83 0.71 1.00PremierLeague 5 0.75 1.00 0.60Same_gen 6 1.00 1.00 1.00Toxicology 5 0.75 1.00 0.60UW_std 4 0.86 1.00 0.75Walmart 3 1.00 1.00 1.00classicmodels 8 0.73 1.00 0.57cs 10 0.74 0.78 0.70imdb_MovieLens 6 1.00 1.00 1.00pubs 10 0.82 0.75 0.90university 4 1.00 1.00 1.00
Table 7.3: Number of foreign keys per dataset, F-measure, precision and recall of ourimplementation of the Foreign Key Discovery primitive when tested against eighteendatasets, using the evaluation method described in Section 7.2.1.
Foreign keys are allowed much more flexibility than primary keys. Each table can
have multiple keys or none at all, the column names are much less indicative than
for primary keys, and their overall structure is more complex. As such, discovering
foreign keys are significantly harder than its counterpart. Our results for this task,
when using the evaluation method previously discussed, can be seen in Table 7.2.2.
Overall, the model achieves an average F-measure of 0.87, an average precision of
0.92, and average recall of 0.85. One thing to note is that the Precision and Recall are
fairly similar to each other, which means we chose an appropriate value for X, seeing
as, in general, a higher X increases recall and decreases precision, while a lower X
56
does the opposite. A balance between the two values is desirable given the harmonic
nature of the F-measure.
When comparing our results in Table 7.2.2 with the structure of the datasets in
Table B.1, we notice they are highly correlated. For example, if we limit our results
to the 13 datasets containing less than 10 tables, our average F-measure, precision
and recall become 0.90, 0.96, 0.86, respectively, which is a measurable improvement.
This superior result in smaller datasets implies an inferior result for the more complex
ones, something we will continue working on to improve in future versions of Tracer.
7.3 Column Map
7.3.1 Evaluation
For the final primitive, the goal is to trace all the columns from the dataset that
helped generate some target column. In order to evaluate the success rate of this
module we will use a fixed threshold value of 0. This means we select all column
maps predicted by our algorithm with feature importance larger than 0. Then we
simply compute the precision, recall and F-measure and report our results.
7.3.2 Experiments
To evaluate the Column Map Discovery primitive we will need to utilize DataReactor.
As explained in Chapter 6, DataReactor takes a relational dataset and intelligently
generates derived columns. These derived columns are a function of other columns in
their respective datasets, so we can use them to test Column Map Discovery. For the
following experiment, DataReactor was used to generate two types of target columns:
“nb_rows” which is based on foreign key relationships and corresponds to the number
of rows in the child table for each of the keys in the parent table, and “add_numerical”
which corresponds to a random linear combination of the numerical columns in the
target table.
Target columns generated by DataReactor follow a specific naming convention.
57
To understand it, consider an example where DataReactor generated a target col-
umn called “molecule.nb_rows_in_atom”. This means the column was derived from
some column in table “atom” through transformation “nb_rows”. This transforma-
tion counts the number of rows in the “atom” table which belong to each “molecule”
table, where the “atom” table is related to the “molecule” table through a foreign key
relationship.
We test the Column Map Discovery primitive by using eleven datasets from the
Relational Dataset Repository [24], with target columns generated by DataReactor
and evaluation as described in Section 7.3.1. Our full results are reported in Table C.1.
We also aggregate the results by dataset (Table 7.4) and by transformation (Table
7.5). Notice that precision and recall are the average of the values, while F-measure
is computed using these two results.
Dataset F-Measure Precision Recall
AustralianFootball 0.69 0.61 0.79Biodegradability 0.60 0.43 1.00Dunur 0.29 0.17 1.00Elti 0.31 0.18 1.00Hepatitis_std 0.61 0.51 0.77Mesh 0.62 0.45 1.00Mooney_Family 0.05 0.03 1.00NBA 0.23 0.13 0.93Same_gen 0.50 0.34 1.00Toxicology 0.35 0.21 1.00UW_std 0.67 0.53 0.90
Table 7.4: This table shows the performance of our current implementation of theColumn Map Discovery primitive. These results were obtained by aggregating thevalues from Table C.1 for each dataset.
Transformation F-Measure Precision Recall
add_numerical 0.78 0.76 0.79nb_rows 0.38 0.23 1.00
Table 7.5: This table shows the F-measure, precision and recall of our implementationof the Column Map Discovery primitive. These results were obtained by aggregatingthe values from Table C.1 for the two transformations generated by DataReactor.
58
As shown in Table 7.4, our current implementation of the Column Key Discovery
primitive achieves an average F-measure of 0.45, an average precision of 0.33, and an
average recall of 0.94. Given the extensibility of our system, we expect these numbers
to improve as additional transformations/primitives are implemented in future version
of Tracer.
Furthermore, we note that the F-measure is only weakly correlated with the real-
world utility provided by our system. With our system, an user can look at the top
X% of the column maps suggested by our model to quickly figure out the lineage of
the target column. This is a significant improvement over manually digging through
thousands of combinations of columns.
59
60
Chapter 8
Conclusion
In this paper we proposed Tracer, a novel machine learning approach to data lineage
discovery. We present a framework that accomplishes end-to-end lineage tracing by
separating the tracing process into its primitive components.
While the work we have accomplished is extensive, there is much that can be
improved. As of yet, we mostly assume columns and rows to be completely separate
from each other, while in reality the lineage of each item in a table has high correlation
with that of other items. Rewriting our primitives to include this fact should speed
up the process manifold.
Tracer also does not utilize past lineage searches into future computations. Given
that the data from a company shares many of the same patterns, a specific module
to aggregate it and analyze it could be an interesting line of research.
Another topic that needs further work is the addition of new primitives. Cur-
rently Tracer only contains the most basic primitive modules, and utilizing current
and future researches to write new and more sophisticated primitives is essential
for Tracer to achieve its full potential. One simple example is in the Primary Key
Discovery module. Currently Tracer does not support composite keys, even though
multi-column primary keys can be commonly found in relational datasets.
Better usage of metadata and data annotations would also improve Tracer. While
Tracer does make use of column names for key discovery, the Column Map Discov-
ery module does not. Further exploration of how we can leverage natural language
61
processing techniques for this problem is essential for obtaining better results.
Another potential line of research centers on multi-parent tables. For example,
suppose we have a users table, a locations table, and a companies table. Each user
works at a company and has a location. Currently, Tracer computes the lineage
for these tables separately - first users/location, then users/companies. Taking both
tables into consideration at the same time could allow for more accurate results, and
should be implemented into a future version of the project.
Finally, another aspect that could be further improved is how we can handle loops
in datasets. Taking the example from above, with a users table, a locations table,
and a companies table, it is possible each user works at a company, each user has a
location and each company has a location. This forms a loop, which is a situation we
have not considered sufficiently, and may require further research.
By adding these additional features to the Tracer project, we can further enhance
the system and enable much more general tracing than provided by alternatives.
62
Appendix A
Column Map without Foreign Keys
Key
?
?
?
100
50
0
200
150
100
Cost 1 Cost 2 Key
?
?
?
100
200
300
Total
Figure A-1: This figure shows an example of the functionality of the Column MapDiscovery primitive, where the module learns that "Total" was generated by adding“Cost 1” and “Cost 2” without knowing the foreign keys.
In this Appendix we describe an alternate algorithm for the Column Map Dis-
covery primitive which discovers the ColumnMap without using ForeignKeys. Notice
that ForeignKeys is still provided, since it is the standard input of the primitive,
but it is not used. This module is useful in situations where the foreign keys can-
not be trusted, so we need to derive ColumnMap directly from Tables. Figure A-1
demonstrates a situation where this module might be useful. This method is purely
theoretical and is not currently implemented in Tracer.
To solve this problem, we propose an iterative procedure which alternates between
(1) learning a distribution over row mappings based on the optimal column mapping
and (2) learning the optimal column mapping based on the distribution over row
mappings. Under weak assumptions about the types of transformations that can be
applied to the data, we can show that, after each iteration, the distribution over the
row mapping converges toward the true row mapping.
63
At a high level, the iterative estimation procedure works as follows:
1. Sample from a distribution over row mappings.
2. Assuming the row mapping is correct, find the optimal column mapping and
find the loss.
3. If the loss is lower than average, update the distribution over row mappings to
increase the probability of each (input, output) pair in the mapping.
4. Otherwise, update the distribution over row mappings to decrease the proba-
bility of each (in, out) pair in the mapping.
5. Repeat.
In order for this procedure to work, we need to make a few assumptions. Primarily,
we need to assume that the loss associated with the optimal column mapping is a
decreasing function of the number of correctly matched rows. In other words, if our
row mapping has more correct matches, than our model will be more successful at
learning the relationship between the inputs and outputs.
Formally, we can use a regularization based approach to solve the problem, where
our goal is to find the row mapping 𝑀 and subset of columns 𝑆 which minimizes the
loss
𝑎𝑟𝑔𝑚𝑖𝑛𝑀,𝑆[𝑚𝑖𝑛𝐹 ||𝐹 (𝑋[𝑆])−𝑀(𝑦)||2] + 𝜆||𝑆||0 (A.1)
where 𝑆 is a subset of columns and 𝐹 is an arbitrary machine learning model.
The biggest obstacle to implementing this procedure is figuring out how to effi-
ciently sample from and iteratively update the distribution over row mappings. In
general, exact solutions to the bipartite matching problem (e.g. finding the maximum
likelihood between two sets of rows) is intractable since given an input and output
table with 𝑁 rows, there are 𝑁 ! possible row mappings. However, given a solution
for sampling from and iteratively updating a distribution over row mappings [27], we
can then concurrently estimate both the row mappings and the column mappings for
64
a pair of tables, allowing us to identify the set of input cells which contributed to a
particular output cell.
65
66
Appendix B
Relational Dataset
Dataset Name Size (MB) Num of Tables Temporal? Numeric? String?
Accidents 234.5 3 Yes Yes Yes
AdventureWorks 233.8 71 Yes Yes Yes
Airline 464.5 19 Yes Yes Yes
Atherosclerosis 3.2 4 No Yes Yes
AustralianFootball 34.1 4 Yes Yes Yes
BasketballMen 18.3 9 No Yes Yes
BasketballWomen 2.3 8 No Yes Yes
Biodegradability 3.7 5 No Yes Yes
Bupa .3 9 No Yes Yes
Carcinogenesis 21 6 No Yes Yes
CCS 15.9 6 Yes Yes Yes
Chess .3 2 Yes Yes Yes
ClassicModels .5 8 Yes Yes Yes
CORA 4.6 3 No Yes Yes
Countries 8.6 4 No Yes Yes
Credit 317.9 9 Yes Yes Yes
CS .3 8 Yes Yes Yes
DCG .3 2 No Yes Yes
67
Dunur .8 17 No Yes Yes
Elti .6 11 No Yes Yes
Employee 197.4 6 Yes Yes Yes
Facebook 2 2 No Yes No
Financial 78.8 8 Yes Yes Yes
FNHK 129.8 3 Yes Yes Yes
FTP 7.5 2 Yes Yes Yes
Geneea 61.4 19 Yes Yes Yes
Genes 1.8 3 No Yes Yes
Hepatitis 2.2 7 No Yes Yes
Hockey 15.6 22 No Yes Yes
IMDb 477.1 7 No Yes Yes
KRK .1 1 No Yes Yes
Lahman 74.1 25 No Yes Yes
LegalActs 240.2 5 Yes Yes Yes
Mesh 1.1 29 No No Yes
Mondial 3,2 40 Yes Yes Yes
MooneyFamily 3.2 68 No No Yes
MovieLens 154.9 7 No Yes Yes
Musk .4 2 No Yes Yes
Mutagenesis .9 3 No Yes Yes
Nations 1.2 3 No Yes Yes
NBA .3 4 Yes Yes Yes
NCAA 35.8 9 Yes Yes Yes
Northwind 1.1 29 Yes Yes Yes
Pima .7 9 No Yes Yes
PremiereLeague 11.3 4 Yes Yes Yes
PTC 8.1 4 No No Yes
PTE 4.4 38 No Yes Yes
Pubs .4 11 Yes Yes Yes
68
Pyrimidine ? 2 No Yes No
Restbase 3 3 No Yes Yes
Sakila 6.5 16 Yes Yes Yes
SalesDB 584.3 4 No Yes Yes
SameGen .3 4 No Yes Yes
SAP 246.9 5 Yes Yes Yes
SAT 2.7 37 No Yes Yes
Seznam 146.8 4 Yes Yes Yes
Stats 658.4 8 Yes Yes Yes
StudentLoan .9 10 No Yes Yes
Thrombosis 1.9 3 Yes Yes Yes
TPCC 165.8 9 Yes Yes Yes
TPCD 2500 8 Yes Yes Yes
TPCDS 4800 24 Yes Yes Yes
TPCH 2000 8 Yes Yes Yes
Trains .1 2 No Yes Yes
Triazine .2 2 No Yes No
University .3 5 No Yes Yes
UTube .2 2 No Yes Yes
UW-CSE .2 4 No Yes Yes
VisualGenome 198.1 6 No Yes Yes
VOC 4.7 8 Yes Yes Yes
Walmart 167.3 4 Yes Yes Yes
WebKP 12.8 3 No No Yes
World .7 3 No Yes Yes
Table B.1: This table provides some details regarding the 73 relational datasets from
[24].
69
70
Appendix C
Column Map Discovery Performance
Dataset Target Column F1 Prec. Rec.
AustralianFootball matches.add_numerical 0.75 1.00 0.60
AustralianFootball players.add_numerical 0.50 1.00 0.33
AustralianFootball players.nb_rows_in_match_stats 1.00 1.00 1.00
AustralianFootball teams.nb_rows_in_match_stats 0.03 0.02 1.00
AustralianFootball teams.nb_rows_in_matches 0.03 0.02 1.00
Biodegradability atom.nb_rows_in_bond 0.40 0.25 1.00
Biodegradability atom.nb_rows_in_gmember 0.67 0.50 1.00
Biodegradability bond.add_numerical 1.00 1.00 1.00
Biodegradability group.nb_rows_in_gmember 0.67 0.50 1.00
Biodegradability molecule.add_numerical 0.36 0.22 1.00
Biodegradability molecule.nb_rows_in_atom 0.20 0.11 1.00
Dunur person.nb_rows_in_aunt 0.15 0.08 1.00
Dunur person.nb_rows_in_brother 0.22 0.13 1.00
Dunur person.nb_rows_in_daughter 0.17 0.09 1.00
Dunur person.nb_rows_in_dunur 0.25 0.14 1.00
Dunur person.nb_rows_in_father 0.33 0.20 1.00
Dunur person.nb_rows_in_husband 0.29 0.17 1.00
Dunur person.nb_rows_in_husband2 0.22 0.13 1.00
71
Dunur person.nb_rows_in_mother 0.33 0.20 1.00
Dunur person.nb_rows_in_nephew 0.14 0.08 1.00
Dunur person.nb_rows_in_niece 0.18 0.10 1.00
Dunur person.nb_rows_in_sister 0.20 0.11 1.00
Dunur person.nb_rows_in_son 0.18 0.10 1.00
Dunur person.nb_rows_in_target 0.04 0.02 1.00
Dunur person.nb_rows_in_uncle 0.18 0.10 1.00
Dunur person.nb_rows_in_wife 0.22 0.13 1.00
Dunur person.nb_rows_in_wife2 0.20 0.11 1.00
Dunur target.add_numerical 1.00 1.00 1.00
Elti person.nb_rows_in_brother 0.11 0.06 1.00
Elti person.nb_rows_in_daughter 0.33 0.20 1.00
Elti person.nb_rows_in_elti 0.13 0.07 1.00
Elti person.nb_rows_in_father 0.12 0.06 1.00
Elti person.nb_rows_in_husband 0.25 0.14 1.00
Elti person.nb_rows_in_mother 0.09 0.05 1.00
Elti person.nb_rows_in_sister 0.12 0.06 1.00
Elti person.nb_rows_in_son 0.33 0.20 1.00
Elti person.nb_rows_in_target 0.10 0.05 1.00
Elti person.nb_rows_in_wife 0.25 0.14 1.00
Elti target.add_numerical 1.00 1.00 1.00
Hepatitis_std Bio.add_numerical 0.46 0.30 1.00
Hepatitis_std Bio.nb_rows_in_rel11 0.18 0.10 1.00
Hepatitis_std dispat.add_numerical 0.25 0.25 0.25
Hepatitis_std dispat.nb_rows_in_rel11 0.25 0.14 1.00
Hepatitis_std dispat.nb_rows_in_rel12 0.13 0.07 1.00
Hepatitis_std dispat.nb_rows_in_rel13 0.40 0.25 1.00
Hepatitis_std indis.add_numerical 0.33 0.50 0.25
Hepatitis_std inf.add_numerical 0.67 1.00 0.50
Hepatitis_std rel11.add_numerical 0.67 1.00 0.50
72
Hepatitis_std rel12.add_numerical 1.00 1.00 1.00
Hepatitis_std rel13.add_numerical 1.00 1.00 1.00
Mesh element.nb_rows_in_circuit 0.67 0.50 1.00
Mesh element.nb_rows_in_circuit_hole 0.50 0.33 1.00
Mesh element.nb_rows_in_cont_loaded 0.67 0.50 1.00
Mesh element.nb_rows_in_equal 0.50 0.33 1.00
Mesh element.nb_rows_in_fixed 0.67 0.50 1.00
Mesh element.nb_rows_in_free 0.67 0.50 1.00
Mesh element.nb_rows_in_half_circuit 0.67 0.50 1.00
Mesh element.nb_rows_in_half_circuit_hole 0.67 0.50 1.00
Mesh element.nb_rows_in_llong 0.50 0.33 1.00
Mesh element.nb_rows_in_long_for_hole 0.40 0.25 1.00
Mesh element.nb_rows_in_mesh 0.13 0.07 1.00
Mesh element.nb_rows_in_mesh_test 0.13 0.07 1.00
Mesh element.nb_rows_in_mesh_test_Neg 0.14 0.08 1.00
Mesh element.nb_rows_in_neighbour_xy 0.67 0.50 1.00
Mesh element.nb_rows_in_neighbour_yz 0.50 0.33 1.00
Mesh element.nb_rows_in_neighbour_zx 0.67 0.50 1.00
Mesh element.nb_rows_in_noload 0.67 0.50 1.00
Mesh element.nb_rows_in_notimportant 0.67 0.50 1.00
Mesh element.nb_rows_in_one_side_fixed 0.67 0.50 1.00
Mesh element.nb_rows_in_one_side_loaded 0.67 0.50 1.00
Mesh element.nb_rows_in_opposite 0.67 0.50 1.00
Mesh element.nb_rows_in_quarter_circuit 0.50 0.33 1.00
Mesh element.nb_rows_in_short_for_hole 0.67 0.50 1.00
Mesh element.nb_rows_in_sshort 0.67 0.50 1.00
Mesh element.nb_rows_in_two_side_fixed 0.67 0.50 1.00
Mesh element.nb_rows_in_two_side_loaded 1.00 1.00 1.00
Mesh element.nb_rows_in_usual 0.67 0.50 1.00
Mesh mesh_test_Neg.add_numerical 1.00 1.00 1.00
73
Mesh mesh_test.add_numerical 0.67 0.50 1.00
Mesh mesh.add_numerical 1.00 1.00 1.00
Mesh nnumber.add_numerical 0.29 0.17 1.00
Mesh nnumber.nb_rows_in_mesh 0.29 0.17 1.00
Mooney_Family person.nb_rows_in_aunt 0.02 0.01 1.00
Mooney_Family person.nb_rows_in_brother 0.05 0.03 1.00
Mooney_Family person.nb_rows_in_daughter 0.08 0.04 1.00
Mooney_Family person.nb_rows_in_father 0.04 0.02 1.00
Mooney_Family person.nb_rows_in_husband 0.08 0.04 1.00
Mooney_Family person.nb_rows_in_mother 0.04 0.02 1.00
Mooney_Family person.nb_rows_in_nephew 0.02 0.01 1.00
Mooney_Family person.nb_rows_in_niece 0.02 0.01 1.00
Mooney_Family person.nb_rows_in_sister 0.03 0.02 1.00
Mooney_Family person.nb_rows_in_son 0.11 0.06 1.00
Mooney_Family person.nb_rows_in_uncle 0.02 0.01 1.00
Mooney_Family person.nb_rows_in_wife 0.08 0.04 1.00
NBA Actions.add_numerical 0.40 0.40 0.40
NBA Game.add_numerical 0.05 0.03 1.00
NBA Game.nb_rows_in_Actions 0.04 0.02 1.00
NBA Player.add_numerical 0.40 0.25 1.00
NBA Player.nb_rows_in_Actions 0.50 0.33 1.00
NBA Team.add_numerical 0.01 0.00 1.00
NBA Team.nb_rows_in_Actions 0.02 0.01 1.00
NBA Team.nb_rows_in_Game 0.02 0.01 1.00
Same_gen person.nb_rows_in_parent 0.17 0.09 1.00
Same_gen person.nb_rows_in_same_gen 0.25 0.14 1.00
Same_gen person.nb_rows_in_target 0.20 0.11 1.00
Same_gen target.add_numerical 1.00 1.00 1.00
Toxicology atom.nb_rows_in_connected 0.50 0.33 1.00
Toxicology molecule.nb_rows_in_atom 0.29 0.17 1.00
74
Toxicology molecule.nb_rows_in_bond 0.25 0.14 1.00
UW_std advisedBy.add_numerical 0.80 0.67 1.00
UW_std course.add_numerical 1.00 1.00 1.00
UW_std course.nb_rows_in_taughtBy 0.22 0.13 1.00
UW_std person.add_numerical 0.50 1.00 0.33
UW_std person.nb_rows_in_advisedBy 0.29 0.17 1.00
UW_std person.nb_rows_in_taughtBy 0.15 0.08 1.00
UW_std taughtBy.add_numerical 0.80 0.67 1.00
Table C.1: This table shows the F-measure, precision and recall of our implementation
of the Column Map Discovery primitive when tested against eleven relational datasets
using the evaluation method proposed in Section 7.3.1.
75
76
Bibliography
[1] Ccpa. https://oag.ca.gov/privacy/ccpa. Accessed: 2020-05-04.
[2] Gdpr. https://gdpr-info.eu/. Accessed: 2020-05-04.
[3] Metadata.json. https://github.com/HDI-Project/MetaData.json. Accessed:2020-05-04.
[4] Rdt. https://sdv-dev.github.io/RDT/. Accessed: 2020-05-14.
[5] Tpc. http://www.tpc.org/default5.asp. Accessed: 2020-05-03.
[6] Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, ShubhaNabar, Tomoe Sugihara, and Jennifer Widom. Trio: A system for data, uncer-tainty, and lineage. Proc. of VLDB 2006 (demonstration description), 2006.
[7] Michael Backes, Niklas Grimm, and Aniket Kate. Data lineage in malicious envi-ronments. IEEE Transactions on Dependable and Secure Computing, 13(2):178–191, 2015.
[8] Omar Benjelloun, Anish Das Sarma, Alon Halevy, and Jennifer Widom. Uldbs:Databases with uncertainty and lineage. In Proceedings of the 32nd internationalconference on Very large data bases, pages 953–964. VLDB Endowment, 2006.
[9] Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijay-vargiya. An annotation management system for relational databases. The VLDBJournal, 14(4):373–396, 2005.
[10] Peter Buneman, Adriane Chapman, and James Cheney. Provenance managementin curated databases. In Proceedings of the 2006 ACM SIGMOD internationalconference on Management of data, pages 539–550, 2006.
[11] Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. Data provenance: Somebasic issues. In International Conference on Foundations of Software Technologyand Theoretical Computer Science, pages 87–93. Springer, 2000.
[12] Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. Why and where: Acharacterization of data provenance. In International conference on databasetheory, pages 316–330. Springer, 2001.
77
[13] Yingwei Cui and Jennifer Widom. Practical lineage tracing in data warehouses.In Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), pages 367–378. IEEE, 2000.
[14] Yingwei Cui and Jennifer Widom. Lineage tracing for general data warehousetransformations. The VLDB Journal—The International Journal on Very LargeData Bases, 12(1):41–58, 2003.
[15] Yingwei Cui, Jennifer Widom, and Janet L Wiener. Tracing the lineage of viewdata in a warehousing environment. ACM Transactions on Database Systems(TODS), 25(2):179–227, 2000.
[16] Laura Gustafson. Bayesian tuning and bandits: an extensible, open source libraryfor AutoML. PhD thesis, Massachusetts Institute of Technology, 2018.
[17] Thomas Heinis and Gustavo Alonso. Efficient lineage tracking for scientific work-flows. In Proceedings of the 2008 ACM SIGMOD international conference onManagement of data, pages 1007–1018, 2008.
[18] Jiansheng Huang, Ting Chen, AnHai Doan, and Jeffrey F Naughton. On theprovenance of non-answers to queries over extracted data. Proceedings of theVLDB Endowment, 1(1):736–747, 2008.
[19] Robert Ikeda and Jennifer Widom. Data lineage: A survey. Technical report,Stanford InfoLab, 2009.
[20] James Max Kanter and Kalyan Veeramachaneni. Deep feature synthesis: To-wards automating data science endeavors. In 2015 IEEE International Con-ference on Data Science and Advanced Analytics (DSAA), pages 1–10. IEEE,2015.
[21] Arunprasad P Marathe. Tracing lineage of array data. Journal of IntelligentInformation Systems, 17(2-3):193–214, 2001.
[22] Patrick McDaniel. Data provenance and security. IEEE Security & Privacy,9(2):83–85, 2011.
[23] JP Moresmau, Fabian Schyns, Uta Sommerweiss, Lothar Grabowsky, Jens-UweRichter, Henric Gomes, Gerald Csapo, Karsten Baensch, Gunter Wiedemer,Micha Treiber, et al. Intelligent metadata management and data lineage tracing,September 21 2017. US Patent App. 15/457,808.
[24] Jan Motl and Oliver Schulte. The ctu prague relational learning repository. arXivpreprint arXiv:1511.03086, 2015.
[25] Christopher Ré and Dan Suciu. Approximate lineage for probabilistic databases.Proceedings of the VLDB Endowment, 1(1):797–808, 2008.
78
[26] Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, and UlfLeser. A machine learning approach to foreign key discovery. In WebDB, 2009.
[27] Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matchingproblems. In Advances in Neural Information Processing Systems, pages 1313–1321, 2012.
[28] Allison Woodruff and Michael Stonebraker. Supporting fine-grained data lin-eage in a database visualization environment. In Proceedings 13th InternationalConference on Data Engineering, pages 91–102. IEEE, 1997.
[29] Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M Procopiuc,and Divesh Srivastava. On multi-column foreign key discovery. Proceedings ofthe VLDB Endowment, 3(1-2):805–814, 2010.
79