tracer: amachinelearningapproachtodata lineage

79
Tracer: A Machine Learning Approach to Data Lineage by Felipe Alex Hofmann Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2020 c Massachusetts Institute of Technology 2020. All rights reserved. Author ................................................................ Department of Electrical Engineering and Computer Science May 20, 2020 Certified by ............................................................ Kalyan Veeramachaneni Principal Research Scientist Thesis Supervisor Accepted by ........................................................... Katrina LaCurts Chair, Master of Engineering Thesis Committee

Upload: others

Post on 28-Jun-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tracer: AMachineLearningApproachtoData Lineage

Tracer: A Machine Learning Approach to DataLineage

by

Felipe Alex Hofmann

Submitted to the Department of Electrical Engineering and ComputerScience

in partial fulfillment of the requirements for the degree of

Master of Engineering in Electrical Engineering and Computer Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

May 2020

c○ Massachusetts Institute of Technology 2020. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Electrical Engineering and Computer Science

May 20, 2020

Certified by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Kalyan Veeramachaneni

Principal Research ScientistThesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Katrina LaCurts

Chair, Master of Engineering Thesis Committee

Page 2: Tracer: AMachineLearningApproachtoData Lineage

2

Page 3: Tracer: AMachineLearningApproachtoData Lineage

Tracer: A Machine Learning Approach to Data Lineage

by

Felipe Alex Hofmann

Submitted to the Department of Electrical Engineering and Computer Scienceon May 20, 2020, in partial fulfillment of the

requirements for the degree ofMaster of Engineering in Electrical Engineering and Computer Science

Abstract

The data lineage problem entails inferring the source of a data item. Unfortunately,most of the existing work in this area relies either on metadata, code analysis or dataannotations. In contrast, our primary focus is to present a machine learning solutionthat uses the data itself to infer the lineage. This thesis will formally define the datalineage problem, specify the underlying assumptions under which we solved it, as wellas provide a detailed description of how our system works.

Thesis Supervisor: Kalyan VeeramachaneniTitle: Principal Research Scientist

3

Page 4: Tracer: AMachineLearningApproachtoData Lineage

4

Page 5: Tracer: AMachineLearningApproachtoData Lineage

Acknowledgments

I would like to thank my supervisor, Kalyan, for his insights, ideas, and instruction

during this project; his support and guidance were invaluable for this thesis. I would

also like to thank Kevin and Carles for their help with the design and development

of the system. Finally, I would like to thank my family — Ricardo, Leila, Gustavo,

and Nathália — for their unwavering support over the years.

5

Page 6: Tracer: AMachineLearningApproachtoData Lineage

6

Page 7: Tracer: AMachineLearningApproachtoData Lineage

Contents

1 Introduction 17

1.1 Data Lineage Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.1.2 Why does it matter? . . . . . . . . . . . . . . . . . . . . . . . 18

1.2 Tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.2.2 Why is Tracer needed? . . . . . . . . . . . . . . . . . . . . . . 20

1.2.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.3 Thesis Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Related Work 23

2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 System Design 25

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Tracer Primitives 29

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Primary Key Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7

Page 8: Tracer: AMachineLearningApproachtoData Lineage

4.2.1 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Foreign Key Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.1 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Column Map Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4.2 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4.3 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Modeling Methods 41

5.1 Primary Key Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Foreign Key Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Column Map Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3.1 Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3.2 Solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3.3 Inverse Transform . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Datasets 47

6.1 Column Mapping Simulation . . . . . . . . . . . . . . . . . . . . . . . 48

6.1.1 Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.2 Atoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1.3 Sieve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Experiments 53

7.1 Primary Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.1.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.1.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.2 Foreign Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.2.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.2.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.3 Column Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8

Page 9: Tracer: AMachineLearningApproachtoData Lineage

7.3.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8 Conclusion 61

A Column Map without Foreign Keys 63

B Relational Dataset 67

C Column Map Discovery Performance 71

9

Page 10: Tracer: AMachineLearningApproachtoData Lineage

10

Page 11: Tracer: AMachineLearningApproachtoData Lineage

List of Figures

1-1 Example of data lineage under our definitions. Input tables 𝐼1 and 𝐼2

are going through transformations 𝑇1, 𝑇2, 𝑇3, 𝑇4 and generating output

tables 𝑂1, 𝑂2. Therefore, the lineage of 𝑂1, 𝑂2 is the set containing 𝐼1

and 𝐼2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1-2 This figure shows a user who is interested in a statistic and wants to

know the source(s) of the data: in other words, the data lineage. The

goal of this project is to answer this question in the context of tabular

databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3-1 This figure provides a brief overview of how Tracer works and its role

in the system context. Tables without lineage (in the left) go through

the Data Ingestion Engine to be converted into the appropriate format,

and then fed into Tracer — the focus of this thesis — which discovers

the lineage of the tables. . . . . . . . . . . . . . . . . . . . . . . . . . 26

3-2 This figure shows how all the primitives are put together into a sin-

gle pipeline, as well as their standard inputs and outputs. Further

clarification regarding the data types can be found in Table 3.1. . . . 26

4-1 This figure showcases the functionality of the Primary Key Discovery

module. It takes as input a set of tables (on the left) and returns as

output the name of the columns that it predicts are primary keys (on

the right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

11

Page 12: Tracer: AMachineLearningApproachtoData Lineage

4-2 This is an example of how the Foreign Key Discovery module works.

It takes a set of tables as input (on the left), as well as their corre-

sponding primary keys (represented in green), and outputs the foreign

key relationships that it predicts, in sorted order from most likely to

least (on the right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4-3 This figure shows an example of the functionality of the Column Map

Discovery primitive, where the module learns that the target column

"Total" was generated by adding “Cost 1” and “Cost 2”. . . . . . . . . 34

4-4 This example shows the functionality of the Column Map Discovery

primitive, where the module discovers that “Purchase Date” was gen-

erated from “Quantity” (the amount of rows for each key in “Purchase

Date” corresponds to the values in “Quantity”). . . . . . . . . . . . . 35

4-5 This figure shows the Column Map Discovery primitive being applied,

where the module learns that “Total” was generated from adding all

values from “Cost 1” and “Cost 2” with the same foreign keys. . . . . 35

4-6 This example shows the functionality of the Column Map Discovery

primitive, where the module discovers that “Min/Max” was generated

from “Prices” (the values in “Min/Max” are the minimum and maxi-

mum values from each key of “Prices”). . . . . . . . . . . . . . . . . . 36

5-1 This figure demonstrates how Column Map Discovery works in three

steps: (1) the model expands the columns of the dataset (e.g. “Date”

becomes “Day”, “Month”, “Year”); (2) the model attempts to predict

which of these columns generated the target column, assigning scores

to each of them; (3) the model aggregates the columns that were ex-

panded, selecting the largest scores as lineage of the target. . . . . . . 43

6-1 This figure shows the schema for the MovieLens dataset [24], where the

arrows represent foreign key relationships, the first column describes

the columns in each table, and the second column informs the data

type of each column. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

12

Page 13: Tracer: AMachineLearningApproachtoData Lineage

6-2 A basic example on how to use DataReactor. . . . . . . . . . . . . . . 48

A-1 This figure shows an example of the functionality of the Column Map

Discovery primitive, where the module learns that "Total" was gen-

erated by adding “Cost 1” and “Cost 2” without knowing the foreign

keys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

13

Page 14: Tracer: AMachineLearningApproachtoData Lineage

14

Page 15: Tracer: AMachineLearningApproachtoData Lineage

List of Tables

2.1 A summary of the research done to each data lineage field, separated

by type (how-lineage or why-lineage) and granularity (schema-level or

instance-level) [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 This table defines the standard inputs and outputs of the Primary Key,

Foreign Key and Column Map Discovery primitives. . . . . . . . . . . 28

4.1 This table shows an example of the correct input/output format for

the Primary Key Discovery primitive. . . . . . . . . . . . . . . . . . . 31

4.2 This table shows an example of the correct input/output format for

the Foreign Key Discovery primitive. . . . . . . . . . . . . . . . . . . 33

4.3 This table shows an example of the correct input/output format for

the Column Map Discovery primitive. . . . . . . . . . . . . . . . . . . 37

6.1 Description of the Dataset and DerivedColumn data structures. . . . 50

7.1 Number of tables per dataset, as well as the accuracy of our imple-

mentation of the Primary Key Discovery primitive in three situations,

(1) normally, (2) without utilizing the column names, and (3) without

utilizing the position of the columns. . . . . . . . . . . . . . . . . . . 54

7.2 Example of the Foreign Key Discovery evaluation method, when ground

truth is A, B, C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

15

Page 16: Tracer: AMachineLearningApproachtoData Lineage

7.3 Number of foreign keys per dataset, F-measure, precision and recall

of our implementation of the Foreign Key Discovery primitive when

tested against eighteen datasets, using the evaluation method described

in Section 7.2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.4 This table shows the performance of our current implementation of

the Column Map Discovery primitive. These results were obtained by

aggregating the values from Table C.1 for each dataset. . . . . . . . . 58

7.5 This table shows the F-measure, precision and recall of our implemen-

tation of the Column Map Discovery primitive. These results were

obtained by aggregating the values from Table C.1 for the two trans-

formations generated by DataReactor. . . . . . . . . . . . . . . . . . 58

B.1 This table provides some details regarding the 73 relational datasets

from [24]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

C.1 This table shows the F-measure, precision and recall of our implemen-

tation of the Column Map Discovery primitive when tested against

eleven relational datasets using the evaluation method proposed in Sec-

tion 7.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

16

Page 17: Tracer: AMachineLearningApproachtoData Lineage

Chapter 1

Introduction

In data science, the inability to trace the lineage of particular data is a common

and fundamental problem that undermines both security and reliability [22]. Under-

standing data lineage enables auditing and accountability, and allows users to better

understand the flow of data through a system. However, by the time data makes its

way through a system, its lineage has often been lost. Our goal in this thesis is to

present a machine learning solution to the data lineage problem. We argue that our

approach provides a more general and more extensible solution to this problem than

existing approaches which rely only on handcrafted heuristics [14], code analysis [23],

and/or manual annotation of data [9].

1.1 Data Lineage Tracing

1.1.1 Definition

To approach the problem we need a formal definition of data lineage. For the purposes

of this thesis, we will be using the definition provided in [14]:

Define 𝑂 as a set of output tables, 𝐼 as a set of input tables, 𝑜 as a data item (row

or column) from the output set, and define transformation 𝑇 as a procedure that

takes a set of tables as input and produces another set of tables as output. Then,

given 𝑇 (𝐼) = 𝑂 and an output item 𝑜 ∈ 𝑂, the set 𝐼 ′ ⊆ 𝐼 of input data items that

17

Page 18: Tracer: AMachineLearningApproachtoData Lineage

TI1

I2O1

O2

1 T4

T2

T3

Figure 1-1: Example of data lineage under our definitions. Input tables 𝐼1 and 𝐼2 aregoing through transformations 𝑇1, 𝑇2, 𝑇3, 𝑇4 and generating output tables 𝑂1, 𝑂2.Therefore, the lineage of 𝑂1, 𝑂2 is the set containing 𝐼1 and 𝐼2.

contributed to 𝑜’s derivation is the lineage of 𝑜, denoted as 𝐼 ′ = 𝑇 ′(𝑜, 𝐼). Figure 1-1

provides an example of lineage under this definition.

With this, we can define a lineage graph to be a directed acyclic graph of the

lineage, where nodes are entries of tables and edges are the lineage of these entries,

pointing from the original entry to the entry that was generated. Notice that under

this definition, 𝑇 ′ only provides the immediate parents of a data item, and the function

has to be called recursively to fully compute the lineage graph.

1.1.2 Why does it matter?

Reliable data is fundamental for effective decision making. However, as data —

specifically big relational data — is sampled, mixed and altered, and passed through

systems and between stakeholders, its lineage is frequently lost [22]. If the original

source of data is obscured, the quality of the derived data may be questioned, along

with the reliability of any decisions based on it. A method for tracing the origin of

data and its movement between people, processes, and systems would help address

this issue.

Furthermore, more control over data is essential for regulatory compliance. As

issues like privacy and control over personal data receive more attention, data protec-

tion agencies around the world are requiring that companies and other data-collecting

entities provide users a higher level of control over their own data. Some recent exam-

ples of legislation are the GDPR (General Data Protection Regulation) [2] which was

implemented in the European Union in 2018, and the CCPA (California Consumer

18

Page 19: Tracer: AMachineLearningApproachtoData Lineage

Where is this datacoming from?

?

?

?

Figure 1-2: This figure shows a user who is interested in a statistic and wants to knowthe source(s) of the data: in other words, the data lineage. The goal of this projectis to answer this question in the context of tabular databases.

Privacy Act) [1], which takes effect in 2020. With automatic data lineage tracing,

it becomes easier to answer the question “Where did this data come from?” and to

satisfy these regulatory compliance requirements.

Understanding the flow of data through systems has many benefits for stakehold-

ers. Accurate data lineage can help identify gaps in reporting, highlight data quality

issues, and help trace errors back to their root cause. It can help address questions

such as how valuable the data is, whether the data is duplicated or redundant, and

whether it is internal or external. Being able to trace the lineage of data is, overall,

indispensable for enterprise applications.

1.2 Tracer

1.2.1 Overview

In light of the issues described above, we propose Tracer, a machine learning solution

to the data lineage problem. At a high level, our system traces the lineage of data

by (1) taking a tabular database as input, (2) analyzing the tables with statistical

methods, machine learning techniques, and hand-crafted heuristics, and (3) returning

the rows and columns that generated the tables.

An end-to-end solution to the data lineage problem consists of:

1. Identifying parent-child relationships between tables.

19

Page 20: Tracer: AMachineLearningApproachtoData Lineage

2. Understanding how each row/column in a parent table maps to a row/column

in the child table.

3. Presenting this information in a human-readable manner which allows users to

derive actionable insights, as well as in a machine-readable manner which can

be utilized by downstream data science tools.

1.2.2 Why is Tracer needed?

While many solutions to the data lineage problem have been proposed before, all

available attempts at tackling the problem rely either on handcrafted heuristics (such

as [14], which provides algorithms for lineage tracing in data warehouses), code anal-

ysis (such as [23], which provides a system for lineage tracing relying on source code

analysis), and/or manual annotation of data (such as [9], which provides a data an-

notation system for relational databases).

While these are all valid ways of lineage tracing, they all have limitations. Code

analysis requires access to the code, which may be inaccessible when dealing with

external applications or if the code was lost/deprecated. Algorithms require a strict

set of conditions to be able to run, such as knowing ahead of time all the possible

transformations, which cannot always be satisfied. Finally, solutions that require

annotations are great for future lineage tracing, but they were not designed to answer

the question of where the present non-annotated data came from.

Tracer does not have these same limitations. Differently than these approaches,

Tracer learns the lineage from the data itself and so (1) it doesn’t require code access,

(2) there are few constraints to the data, since it uses machine learning instead of

algorithms, and (3) it doesn’t need any annotations, since it only requires the data

to be able to learn its lineage.

1.2.3 Scope

Trying to solve the data lineage problem for all types of data would be a massive

undertaking, and we would likely encounter issues achieving high accuracy and speed.

20

Page 21: Tracer: AMachineLearningApproachtoData Lineage

Therefore, we will focus our attention on a subset of more common situations, as

enumerated below:

1. All transformations must be deterministic, since adding randomization to the

problem makes inferring the lineage significantly harder.

2. All transformations must be reducible to operations on from columns to columns

or from rows to rows. For example, merging two columns together is valid, but

merging a column with a row is not.

3. The lineage must be unambiguous. For example, let’s assume the input column

[1, 2, 3] generated the output column [4, 8, 12]. If there is another input [2, 4,

6], the model will not be able to infer which input generated the output, since

there are equally simple transformations for both inputs that would result in

the output.

4. All the data must be provided in tables. Tracer currently does not support

conversion of other data types into the appropriate format.

1.3 Thesis Road Map

This thesis is organized as follow:

∙ Chapter 2 introduces related work about the data lineage problem.

∙ Chapter 3 presents the high-level design of the Tracer system.

∙ Chapter 4 describes the API of our system.

∙ Chapter 5 clarifies the technical workings of our system.

∙ Chapter 6 presents the datasets used to test our system.

∙ Chapter 7 demonstrates the performance of our system.

∙ Chapter 8 highlights promising areas for future work on this topic.

21

Page 22: Tracer: AMachineLearningApproachtoData Lineage

22

Page 23: Tracer: AMachineLearningApproachtoData Lineage

Chapter 2

Related Work

In this section, we provide an overview of previous work on the data lineage problem.

2.1 Definition

People have been working to keep track of data lineage for several decades. Back in

1997, Woodruff and Stonebraker [28] defined the lineage of data as its entire processing

history, and provided a novel system to generate lineage without relying on metadata.

Buneman, Khanna and Tan discussed the issue as well [11]. The field was further

advanced when Cui and Widom published their work on the topic [15], [13], [14],

defining data lineage in the context of data warehouses and providing algorithms for

lineage tracing.

2.2 Classification

Research regarding data lineage can be divided into four distinct fields, based on the

lineage type (how-lineage or where-lineage) and granularity (schema-level or instance-

level). Lineage type refers to which question is trying to be answered, either “where”

the data comes from (i.e. which tables generated some output) or “how” the data

was transformed (i.e. the specific transformations which generated some output).

Granularity refers to the format of the lineage, where schema-level/coarse-grained

23

Page 24: Tracer: AMachineLearningApproachtoData Lineage

Schema Instance

Where Workflow [17]

Why and Where [12]General Warehouse [14]Warehouse View [15]

Non-Answers [18]Approximate [25]

Trio [6]

How Curated [10]Workflow [17] Curated [10]

Table 2.1: A summary of the research done to each data lineage field, separatedby type (how-lineage or why-lineage) and granularity (schema-level or instance-level)[19].

lineage means the model looks at whole tables at a time, or even a entire datasets,

while instance-level/fine-grained means that each data item is analysed individually.

This is further discussed by Ikeda and Widom [19], who show some of the existing

body of work for each category, which we reproduce on the Table 2.1.

2.3 Setting

Data context determines what type of data lineage tracing can be performed. For

example, in data warehouses, where information is properly organized and well-

annotated, the lineage can be discovered with basic heuristics, such as described

by [14]. On the other hand, in malicious environments, intentional and unintentional

leakages must be taken into consideration, and entirely different frameworks must

be developed, as [7] and [22] discuss. Other examples include lineage in the context

of arrays, as studied in [21], and in the setting of relational databases containing

uncertain data, as described by [8] and [6].

24

Page 25: Tracer: AMachineLearningApproachtoData Lineage

Chapter 3

System Design

At a high level, the Tracer library works by executing pipelines that contain modules

called primitives. Each of these primitives is responsible for solving a particular

subproblem related to the data lineage task, from primary and foreign key discovery

to identifying column mappings. In this chapter, we describe how these pipelines and

primitives interact to solve the data lineage problem.

3.1 Overview

The Tracer library was developed to solve the data lineage problem in the context

of tabular databases. To understand the purpose and functionality of the library,

we must first understand how Tracer interacts with other modules within the same

context. Figure 3-1 provides a general overview of the system design.

On the left side, the figure shows various systems generating tables. We don’t

always know which system generated which tables, which can lead to a loss of data

lineage. These systems can also be both complex and diverse — for example, System

A could be a MySQL database, and System B a Hadoop database, with each one

requiring distinct handling.

The tables are fed into the Data Ingestion Engine. The two components of this

module — the System Connector and the Input Generator — help to transform the

various possible inputs into a standardized format called Metadata.JSON [3].

25

Page 26: Tracer: AMachineLearningApproachtoData Lineage

Data Ingestion EngineTable a1

Table an

..

System A

...

Table b1

Table bn

..

System B

Tracermetadb

Updatedmetadata.json

RESTFUL API

Tracer

Primitives

Pipelines

SystemConnector

UI/UX Library

Service APIs(DL Orchestration

Engine)

Metadata.json

Data Set(s)

Accenture DL Portal Data LineageDashboard

BusinessUsers

...

InputGenerator

Figure 3-1: This figure provides a brief overview of how Tracer works and its role inthe system context. Tables without lineage (in the left) go through the Data IngestionEngine to be converted into the appropriate format, and then fed into Tracer — thefocus of this thesis — which discovers the lineage of the tables.

Tracer then takes the datasets, along with the Metadata.Json file, and attempts

to identify useful properties pertinent to the data lineage problem. It does this by

applying various procedures called primitives which extract information (such as pri-

mary keys and foreign keys) and store it in the metadatabase, metadb. This data is

then processed by the UI/UX library, which provides a graphical interface, the Data

Lineage Dashboard, that facilitates the user’s interaction with Tracer.

3.2 Tracer

Primary KeyDiscovery

Foreign KeyDiscovery

Column MapDiscovery ColumnMapTables Tables

PrimaryKeysTables

ForeignKeys

Figure 3-2: This figure shows how all the primitives are put together into a singlepipeline, as well as their standard inputs and outputs. Further clarification regardingthe data types can be found in Table 3.1.

26

Page 27: Tracer: AMachineLearningApproachtoData Lineage

Now that we have an overview of the system at large, let’s take a closer look at

the Tracer module. Tracer is an end-to-end pipeline takes in a set of data tables and

outputs the lineage of that data. It accomplishes this by running through a series of

steps, called primitives. Each primitive generates relevant information that is then

fed to the next primitive in the pipeline, and on down the line until the data lineage

is discovered.

This interaction is graphically represented in Figure 3.2. As the diagram shows,

we start with a set of tables, which are fed to the Primary Key Discovery primitive to

generate the primary keys of the dataset. Next, the Foreign Key Discovery primitive

uses the tables and the generated primary keys to detect any foreign key relationships

that may exist in the dataset. Finally the Column Map Discovery primitive uses

these foreign keys, as well as the original tables, to discover which columns generated

which ones in the dataset. The design and functionality of each of these primitives is

described in Chapter 4.

The above process, if accurate, is sufficient to provide the lineage of any item in a

database. This is due to one of the constraints we presented previously: Tracer only

deals with transformations on rows and columns. Therefore, if we know the mapping

of every row (through foreign keys), and the mapping of every column (through the

column map), we can pinpoint the exact origin of each cell of a table.

3.3 Terminology

As can be seen in Figure 3.2, each primitive has well-defined inputs and outputs.

These data structures are defined in Table 3.1 and will be used throughout the rest

of this thesis.

27

Page 28: Tracer: AMachineLearningApproachtoData Lineage

Name Description Example

Tables A set of tables, each of whichhas (1) a unique name, (2) oneor more named columns, and (3)zero or more rows. Data Struc-ture: dictionary mapping tablenames to dataframes.

{table1: DataFrame,table2: Dataframe,

}

PrimaryKeys Indicates which columns of aset of tables are the primarykeys. Data Structure: dictionarymapping table names to columnnames.

{table1: user_id,table2: prod_id,

}

ForeignKeys All the foreign key mappings thatexist within a set of tables, whereeach referenced table maps to alltables that refer to it. DataStructure: a sorted list of dictio-naries, each of which contains fiveelements: table, field, ref_table,ref_field, score.

[{"table": users,"column": user_id,"ref_table": posts,"ref_column": key,"score": .9}, ...]

ColumnMap The mapping between a targetcolumn to the columns from theparent table that generated it.Data Structure: a sorted list ofdictionaries.

[{"table": costs,"column": cost_1,"score": .9}, ...]

Table 3.1: This table defines the standard inputs and outputs of the Primary Key,Foreign Key and Column Map Discovery primitives.

28

Page 29: Tracer: AMachineLearningApproachtoData Lineage

Chapter 4

Tracer Primitives

This chapter aims to provide further context about the building blocks of Tracer: the

primitives. As explained in Chapter 3, the goal of each primitive is to solve a sub-

problem of the data lineage task. When combined, the solutions to these subproblems

allow us to answer any data lineage query on a standard relational database. In this

chapter we will (1) define what a primitive is, (2) explain how each of them works,

and (3) show how they they work together to solve a data lineage problem.

4.1 Overview

While Section 3.2 explained what a primitive does, it did not answer the question

of what a primitive is. A primitive is a well defined Python class which contains

methods aimed at solving a specific part of the data lineage problem. There are

three types of primitives implemented in Tracer: Primary Key Discovery, Foreign

Key Discovery, and Column Map Discovery. Because the output of each successive

primitive is required as an input for the one that follows, as shown in Figure 3.2, they

must be called in this specific order.

As for their actual implementation, each Tracer primitive class satisfies the fol-

lowing properties:

1. Contains a fit function, which is optional and corresponds to training the

model on some external datasets.

29

Page 30: Tracer: AMachineLearningApproachtoData Lineage

2. Contains a solve function, which is required and uses the fitted model to pro-

duce the appropriate output.

3. Has standard inputs/outputs as shown in Figure 3.2.

4. Is easily extensible. In the scenario where a superior method to solve a primitive

task is discovered (we describe our current implementation in Chapter 5), only

the code for that particular primitive has to be substituted, since each of the

primitives is independent of the others.

In the following sections we will describe (1) the three main types of primitives

implemented in the Tracer library, (2) their respective APIs, and (3) the reasoning

behind their design.

4.2 Primary Key Discovery

Primary Key

Figure 4-1: This figure showcases the functionality of the Primary Key Discoverymodule. It takes as input a set of tables (on the left) and returns as output the nameof the columns that it predicts are primary keys (on the right).

The goal of the Primary Key Discovery primitive is to discover all the primary keys

within a dataset. It takes as input a dictionary mapping table names to dataframes

(Tables) and as output a dictionary mapping table names to the column name that

describes the primary key (PrimaryKeys). If a table has no primary key, it will map

to "None". Figure 4-1 demonstrates the functionality of this primitive.

Consider the scenario shown in Table 4.1. In this example there are two tables,

“users” and “posts”, with primary key columns “user_id” and “post_id”, respectively.

30

Page 31: Tracer: AMachineLearningApproachtoData Lineage

Input Output

{users: pd.DataFrame(user_id, user_name, ...),posts: pd.DataFrame(post_id, post_title, ...),

}

{users: user_id,posts: post_id

}

Table 4.1: This table shows an example of the correct input/output format for thePrimary Key Discovery primitive.

Now assume we lost the information indicating which columns are the primary keys.

To rediscover them, we can use the Primary Key Discovery primitive by calling the

solve function on these tables, and the model will return the primary keys for us.

4.2.1 Design choices

The Primary Key Discovery primitive takes a dataset as input, as opposed to only a

single table. We designed it as such because some methods for discovering primary

keys rely on looking at relationships between tables. For example, consider a dataset

(Tables), where the primary key for every table always has the same column name.

Under our implementation, a primary key discovery model would be able to use this

fact to simplify the key tracing. However, this argument does not extend to multiple

datasets: there is little that can be learned across multiple datasets that could not

be done within one, and so the input is a single dataset instead of a list.

In addition, PrimaryKeys only stores one primary key per table, because the

current implementation of Tracer can only discover up to one key per table. This will

be enhanced in a future version of the project.

Another relevant decision concerns why this module only takes Tables as input

and not other potentially useful information, such as foreign keys. There are two

reasons for this. First, the Foreign Key Discovery module takes the primary keys

as input, so if Primary Key Discovery required the foreign keys, we would have a

loop. Second, the current module already achieves near-perfect accuracy, as is shown

in Section 7.2, which means the added complexity of alternative implementations

31

Page 32: Tracer: AMachineLearningApproachtoData Lineage

(e.g. merging both the Key Discovery primitives to solve the looping issue) is not

justifiable.

4.3 Foreign Key Discovery

Foreign Key

ForeignKey

Figure 4-2: This is an example of how the Foreign Key Discovery module works. Ittakes a set of tables as input (on the left), as well as their corresponding primary keys(represented in green), and outputs the foreign key relationships that it predicts, insorted order from most likely to least (on the right).

The goal of the Foreign Key Discovery primitive is to discover all the foreign

keys within some dataset. It takes as input a dictionary mapping table names to

dataframes (Tables) and a dictionary mapping these tables to their primary keys

(PrimaryKeys), and returns a list of dictionaries representing the foreign keys of the

table, sorted by how likely they are according to the model (ForeignKeys). No-

tice that the majority of the keys in ForeignKeys don’t actually exist. Because

ForeignKeys contains all possible foreign key relationships, sorted from most to least

likely to exist based on their “score“ (see Table 3.1), only the initial values of the list

correspond to actual foreign keys, while the other values must be thresholded. Figure

4-2 demonstrates the functionality of this primitive.

Consider the scenario shown in Table 4.2. In this example there are two tables,

“users” and “posts,” with a foreign key relationship between “user_id” and “key.”

Now assume we have lost the information about which columns are foreign keys. To

rediscover them, Tracer will first call the Primary Key Discovery primitive, which

will provide the appropriate PrimaryKeys object. Then we will use the Foreign Key

Discovery’s solve function with Tables and PrimaryKeys as input, which will return

32

Page 33: Tracer: AMachineLearningApproachtoData Lineage

all possible foreign keys, sorted by most to least likely based on their “score.” Notice

that the example in Table 4.2 only shows the first foreign key for the output, but the

list contains four items, where the other possible foreign keys — (“user_id,” “key”),

(“user_name,” “id”) and (“user_name,” “key”) — have lower scores.

Input Output

{ #Tablesusers: pd.DataFrame(user_id, user_name, ...),posts: pd.DataFrame(id, key, nb_posts, ...)

},{ #PrimaryKeysusers: user_id,posts: key

}

[{"table": users"column": user_id"ref_table": posts"ref_column": key"score": .95}, ...

]

Table 4.2: This table shows an example of the correct input/output format for theForeign Key Discovery primitive.

4.3.1 Design choices

Unlike the Primary Key Discovery primitive, this primitive does allow any number

of keys per table. We decided to prioritize doing so in this module because, from our

observations, most relational databases contain at least one compound foreign key,

while compound primary keys are not as widespread.

Another major decision about this API was the output format. In earlier ver-

sions of Tracer, Foreign Key Discovery generated a dictionary mapping the column

names of each foreign key to a list of tuples containing (“parent_table”, “child_table”,

“child_column”), corresponding to the matching foreign keys. While the dictionary

data structure allowed for easier access to a specific foreign key, it was a very unintu-

itive format. Changing it to a list of objects made it more approachable, while also

facilitating the sorting of the foreign keys by their scores, which simplifies the job of

the Column Map Discovery when it receives the ForeignKeys.

Finally, we need to explain our decision to generate all possible foreign keys and

33

Page 34: Tracer: AMachineLearningApproachtoData Lineage

their respective scores. Initially, the output of this module was solely the thresholded

foreign keys, and so no scores were provided to the next module. While this method

generated cleaner outputs, we had trouble achieving high accuracy in the Column

Map Discovery primitive, because results generated from Foreign Key Discovery were

failing. This is caused by the error compounding multiplicatively in the pipeline: if

the predicted foreign keys are incorrect, then the next module will also be incorrect,

because it was fed the wrong information. To fix this issue, we now provide all

possible foreign keys with their respective scores, which gives the next primitive more

information with which to make decisions.

4.4 Column Map Discovery

This primitive type attempts to discover which columns of the dataset generated

some target column chosen by the user. It takes as input a dictionary mapping table

names to dataframes (Tables), a list of dictionaries representing the foreign keys

of the table (ForeignKeys), as well the target column, and outputs a sorted list

of dictionaries representing the columns that generated the target (ColumnMap), as

shown in Figure 3.2. This primitive is significantly more complex than the other two,

so we will study it by analyzing four distinct examples of problems this module is

capable of solving.

4.4.1 Examples

Total = f (Cost 1, Cost 2)

Foreign Key

Key

1

2

3

100

50

0

200

150

100

Cost 1 Cost 2 Key

3

2

1

100

300

200

Total

Figure 4-3: This figure shows an example of the functionality of the Column MapDiscovery primitive, where the module learns that the target column "Total" wasgenerated by adding “Cost 1” and “Cost 2”.

34

Page 35: Tracer: AMachineLearningApproachtoData Lineage

For the first example, consider the situation in Figure 4-3. The image shows the

Column Map Discovery learning that the lineage of the column “Total” consists of the

columns “Cost 1” and “Cost 2.” Note that it doesn’t know the exact transformation

(a combination of columns, where Cost = Cost 1 + Cost 2), only that there exists

some correlation between the columns.

Purchase Date = f (Quantity)

Foreign Key

Key

1

2

Alice

Bob

2

3

Name Quantity

Key

2

2

1

1

2

05 / 03 / 19

05 / 07 / 19

05 / 09 / 19

05 / 21 / 19

05 / 28 / 19

Purchase Date

Figure 4-4: This example shows the functionality of the Column Map Discovery prim-itive, where the module discovers that “Purchase Date” was generated from “Quan-tity” (the amount of rows for each key in “Purchase Date” corresponds to the valuesin “Quantity”).

Another type of transformation Column Map Discovery is capable of recognizing

is shown in Figure 4-4. In this example, the elements from column “Purchase Date”

were generated from column “Quantity,” where the number in each cell of “Quantity”

represents how many rows for each key there will be in “Purchase Date.” Note that

in this example the column “Name” serves no purpose. The primitive should be able

to figure out that there is no correlation between “Name” and “Purchase Date.”

Total = f (Cost 1, Cost 2)

Foreign Key

Key

1

2

18

5

Total

Key

2

2

1

1

2

1

0

5

0

-1

2

1

10

3

2

Cost 1 Cost 2

Figure 4-5: This figure shows the Column Map Discovery primitive being applied,where the module learns that “Total” was generated from adding all values from “Cost1” and “Cost 2” with the same foreign keys.

35

Page 36: Tracer: AMachineLearningApproachtoData Lineage

The third example is shown in Figure 4-5. Similarly to the first example, “Total”

is a function of columns “Cost 1” ’ and “Cost 2.” Differently than in example one,

multiple rows also combine, so that “Total” is generated by the summation of all

elements of both columns that share the same key. Column Map Discovery supports

this type of transformation by converting the multiple rows into one through the

Transform stage, as explained in Section 5.3.

Min/Max = f (Prices)

Foreign Key

Key

2

2

1

1

2

1

-2

5

0

50

PricesKey

1

1

2

2

0

5

-2

50

Min/Max

Figure 4-6: This example shows the functionality of the Column Map Discoveryprimitive, where the module discovers that “Min/Max” was generated from “Prices”(the values in “Min/Max” are the minimum and maximum values from each key of“Prices”).

A last example is shown in Figure 4-6. In this situation, the elements from column

“Min/Max” were generated from column “Prices” in the first table by selecting the

smallest and largest values for each key, and then placing the maximum value below

the minimum one. This example really underscores the value of Tracer, seeing as

even very unorthodox transformations can still be discovered by the Column Map

Discovery primitive.

4.4.2 Inputs and Outputs

Consider the scenario shown in Table 4.3. In this example, there are two tables, “users”

and “posts” (where “nb_posts” is generated by combining the columns from “post1”

and “post2”), as well as a foreign key relationship between the columns “user_id” and

“key.” Now assume we lost this information. To rediscover the lineage of the target

column “nb_posts”, Tracer will first call the Primary Key Discovery primitive, which

will feed its output to the Foreign Key Discovery primitive, which then generates the

36

Page 37: Tracer: AMachineLearningApproachtoData Lineage

ForeignKeys object we need. Now we can simply use the Column Map Discovery’s

solve function with Tables and ForeignKeys as input to generate the lineage of

“nb_posts.”

Input Output

{ #Tablesusers: pd.DataFrame(user_id, nb_posts, ...),posts: pd.DataFrame(key, post1, post2, ...)

},nb_posts, #target column[{ #ForeignKeys"table": users"column": user_id"ref_table": posts"ref_column": key"score": .95

}, ...]

[{"table": posts"column": post1"score": .95},{"table": posts"column": post2"score": .9}

]

Table 4.3: This table shows an example of the correct input/output format for theColumn Map Discovery primitive.

4.4.3 Design choices

One of the most important design decisions in this project involved the API of the Col-

umn Map Discovery. While the Primary and Foreign Key Discovery primitives deal

with standard relational database structures, the Column Map Discovery is unique to

Tracer, and very flexible both in implementation and functionality. This was the first

module that was implemented, and the whole architecture of Tracer was designed

around what this module would require to function.

One important decision about the module is what type of inputs it should take.

Theoretically it could work by only taking Tables as input (see Appendix A), but

the lack of row mapping (i.e. foreign key relationship) would make the model im-

practically slow, since allowing uncertainty in both row and column mappings at the

same time is an immense search space.

37

Page 38: Tracer: AMachineLearningApproachtoData Lineage

We dealt with this issue by providing the Column Map Discovery primitive with

the ForeignKeys object. While it is true that this creates the issue of multiplicative

error (i.e. if the Foreign Key primitive is incorrect, the Column Map will also be

incorrect, since it is acting on false data), we concluded that predicting foreign keys

is sufficiently easy and there is enough research on the topic that we will always be

able to achieve sufficiently accurate results on that module.

Another important design choice is how dependent Column Map Discovery should

be on ForeignKeys. In earlier versions of Tracer, Column Map Discovery was actually

a class of four distinct primitives, each one aimed at dealing with a different types of

foreign key. While this division facilitates the creation of primitives, we noticed that

it limited the potential for solutions that use multiple foreign types at the same time.

Instead, we decided not to limit our framework as such, and in the current version

of Tracer there is no distinction between the types of foreign keys. This means that

any Column Map Discovery primitive must find a way to deal with the various foreign

keys. For an example of how our current implementation accomplishes this, refer to

Section 5.3.

4.5 Putting It All Together

During the fit stage, the pipeline is given a list of databases. Each “database”

contains not only Tables with raw data, but also the metadata, which includes the

foreign keys, primary keys, and constraints. Then, the execution pipeline performs

the following:

1. Fit the Primary Key Discovery module by building a training set using the

given primary keys.

2. Fit the Foreign Key Discovery module by building a training set using the given

primary and foreign keys.

Note that the models can be saved to disk and used for inference (i.e. we can train

the system on a large collection of databases, store the resulting pipeline, and provide

38

Page 39: Tracer: AMachineLearningApproachtoData Lineage

the pre-trained pipelines to users to apply to their databases).

During the solve stage, the pipeline takes a dictionary of tables as well as a (ta-

ble_name, column_name) tuple which specifies the target column for which we want

to trace the lineage. The execution pipeline runs the following:

1. Apply the Primary Key Discovery module to the tables and identify the primary

keys.

2. Apply the Foreign Key Discovery module to the tables and primary keys and

identify the foreign keys.

3. Apply the Column Map Discovery module to the tables, foreign keys, and target

column in order to identify the lineage of the target column.

Note that to obtain the lineage for multiple columns, the user doesn’t need to rerun

the first two steps, only the third step which takes as input the target column. This is

implemented by caching the results during pipeline execution — i.e. since the inputs

to the primary and foreign key discovery module do not change, the results will be

retrieved from the cache.

39

Page 40: Tracer: AMachineLearningApproachtoData Lineage

40

Page 41: Tracer: AMachineLearningApproachtoData Lineage

Chapter 5

Modeling Methods

In this chapter, we will delve further into the modeling algorithms used to support

each primitive. While Chapter 4 went through what primitives are and how they are

implemented, this chapter will focus on some of the algorithms Tracer uses to solve

each primitive task.

5.1 Primary Key Discovery

At a high level, the primary key discovery module works by generating a feature

vector for each column in a table, and predicting which column (if any) is most likely

to be the primary key. Note that in this version of Tracer, we do not aim to support

composite primary keys, only single-column primary keys.

The feature vector contains handcrafted features describing the column, decided

based on common patterns observed among primary keys. These feature vectors

include:

1. Ordinal position. Primary keys tend to be one of the first columns in the

table.

2. Uniqueness. Primary keys must be unique.

3. Data type. Primary keys are typically strings or integers.

41

Page 42: Tracer: AMachineLearningApproachtoData Lineage

4. Column name. We encode binary features for whether the name of the column

contains common terms used for primary keys such as “_id” or “_key”.

During the fit stage, the Primary Key Discovery module trains a classification

model to predict primary key columns using these features; it performs hyperparam-

eter optimization and cross-validation using the BTB library [16] to identify the best

model. During the solve stage, it uses the model and identifies the most promising

candidate (if any) for the primary key.

5.2 Foreign Key Discovery

As discussed in [29] [26], foreign keys typically satisfy the following 6 properties:

1. High cardinality.

2. Contain most of the primary keys.

3. Tend to be a primary key of few foreign keys.

4. Are not a subset of too many primary keys.

5. Share average lengths (mostly for strings).

6. Have similar column names.

By using the above properties, we came up with the following implementation to

discover foreign keys. First, the model generates all the possible pairs of columns (i.e.

for every two tables, create all the pairings between any column of one table and a

column of the other), as long as a foreign key is viable (e.g. if the data types don’t

match, then there can be no foreign key relationship, so this paring is not created).

Then the model has to decide which pairs are actually foreign keys. It accomplishes

this by using the following handcrafted features, based on the six properties described

above:

1. Intersection. Foreign keys should have significant intersection.

42

Page 43: Tracer: AMachineLearningApproachtoData Lineage

2. Inclusion dependency. One of the foreign keys tends to be a subset of the

other.

3. Uniqueness. At least one of the foreign keys tends to be unique (i.e. a primary

key).

4. Column name. We encode binary features for whether the name of either

column contains common terms used for foreign keys, such as “_id” or “_key”.

5. Data type. Foreign keys are typically strings or integers.

Finally, much like for the Primary Key Discovery primitive, the fit procedure of

the Foreign Key Discovery module trains a random forest classifier to predict foreign

keys using these features; it then performs hyperparameter optimization and cross-

validation using the BTB library [16] to identify the best model. During the solve

stage, it then uses the model to identify the most likely foreign keys.

5.3 Column Map Discovery

Date1

Target

Date is derived

from

Target

Day Month Year

Target

Day Month Year

0.10.1 0.70.1 0.3 0.3Target

2 3

Figure 5-1: This figure demonstrates how Column Map Discovery works in threesteps: (1) the model expands the columns of the dataset (e.g. “Date” becomes “Day”,“Month”, “Year”); (2) the model attempts to predict which of these columns generatedthe target column, assigning scores to each of them; (3) the model aggregates thecolumns that were expanded, selecting the largest scores as lineage of the target.

The Column Map Discovery primitive provides the following functionality: given

a target column, it identifies the columns that the target column is derived from. At

a high level, as can be seen in Figure 5-1, it works by (1) transforming the database

into an (X, y) representation where X and y contain numerical values representing the

table and the target column, respectively, (2) identifying the relationships between X

and y, and (3) reversing the transformation to identify which of the original columns

contributed to the target column.

43

Page 44: Tracer: AMachineLearningApproachtoData Lineage

5.3.1 Transform

The transform step is meant to lead to more useful representations of the data. As

an example, let’s consider the situation in Figure 5-1, where one of the columns is a

date field. In this scenario, the transformation step would expand this one column

into three: day, month and year. If the target column happens to be the year field,

then this transformation has made it much easier for the model to identify that the

date field is related to the target column.

More specifically, the transform step uses reversible data transforms [4] to generate

a new set of columns, which can then be further processed in the next steps. The

functionality of this step can be further improved with the addition of new features,

such as computing the length of a string.

This procedure works for any type of foreign key relationship. If the foreign keys

are many-to-many (i.e. multiple rows of the parent table match to multiple rows of the

child table), for example, the module simply computes various aggregate transforms

(e.g. number of rows in each child table), and the end result is still an (X, y) pair

which contains numerical values. Therefore, the output is indistinguishable from the

one-to-one case, and the next step doesn’t need to be concerned with the foreign keys.

5.3.2 Solver

This is the step of the process in which it is determined which columns generated

the target. The Solver takes the (X, y) generated from the Transform step (i.e.

the expanded table and the corresponding target column) and assigns scores to each

column indicating how likely they are to have generated the target column. It does

so by training a random forest regressor to identify the feature importance of each

column.

To more formally explain why the above procedure works, let us consider the

following situation. If the input table contains two columns 𝐴 and 𝐵 and only column

𝐴 contributed to some output 𝑌 , then given sufficient data and assuming that 𝑌 is not

conditionally independent of 𝐴 given 𝐵 (e.g. the simplest case where this assumption

44

Page 45: Tracer: AMachineLearningApproachtoData Lineage

is violated would be if 𝐴 and 𝐵 are identical/redundant), then a model with sparsity

regularization will only use column 𝐴. Also, if the transformations are deterministic

(e.g. there is no randomness in how the output column is generated) and we have

enough data, then a universal function estimator (e.g. deep learning) will achieve

perfect accuracy on the training set.

It is important to observe that in this context overfitting isn’t a problem, it’s a

feature. To identify the set of input columns that contributed to the output column,

we simply need to solve:

𝑎𝑟𝑔𝑚𝑖𝑛𝑆[𝑚𝑖𝑛𝐹 ||𝐹 (𝑋[𝑆])− 𝑦||2] + 𝜆||𝑆||0 (5.1)

where 𝑆 is a subset of columns and 𝐹 is an arbitrary machine learning model.

Intuitively, this corresponds to identifying the smallest set of variables that results in

the best prediction accuracy.

5.3.3 Inverse Transform

Finally, now that we have identified how useful each column is for generating the

target column, we can reverse our initial transform. For example, if a date field

was transformed into three columns and the columns are assigned importance score

(0.0, 0.1, 0.4), then we would aggregate it into a total importance score of 0.5 and

assign it to the original date field. We explored two different ways of aggregating

the importance scores — adding them and taking the max over them — and found

addition to work better and more consistently. This results in a score for each column

which indicates how likely it is to have contributed to the target column.

45

Page 46: Tracer: AMachineLearningApproachtoData Lineage

46

Page 47: Tracer: AMachineLearningApproachtoData Lineage

Chapter 6

Datasets

To evaluate the performance of our data lineage system, we compiled a suite of rela-

tional datasets from [24]. This repository contains 73 multi-table relational datasets

covering a diverse range of applications, from molecular chemistry to supply chain

management; it also includes standard benchmark datasets that have been used in

related work on data processing systems [5].

movies2actors

movieid

actorid

cast_num

mediumint

mediumint

int

movies2directors

movieid

directorid

genre

mediumint

mediumint

varchar

u2base

userid

movieid

rating

int

mediumint

varchar

directors

directorid

d_quality

avg_revenue

mediumint

int

int

movies

movieid

year

isEnglish

country

runningtime

mediumint

int

enum

varchar

int

users

userid

age

u_gender

occupation

int

varchar

varchar

varchar

actors

actorid

a_gender

a_quality

mediumint

enum

int

Figure 6-1: This figure shows the schema for the MovieLens dataset [24], where thearrows represent foreign key relationships, the first column describes the columns ineach table, and the second column informs the data type of each column.

Each dataset consists of multiple tables which are related through various primary

and foreign key relationships. For example, the schema for the MovieLens dataset is

shown in Figure 6-1. We store these datasets (and the associated metadata) using the

Metadata.JSON format proposed in [3]. We can then use these datasets to evaluate

47

Page 48: Tracer: AMachineLearningApproachtoData Lineage

the performance of our Primary and Foreign Key Discovery modules. For further

information about these datasets, please consult Appendix B.

However, these datasets are not sufficient to evaluate the Column Map Discovery

primitive. To assess this primitive, we developed a system to transform the relational

datasets and generate new “derive” columns where the column mapping is known.

Then, we can use this augmented dataset to evaluate the performance of the Column

Map Discovery primitive.

6.1 Column Mapping Simulation

The DataReactor project aims to take a relational dataset and intelligently generate

derived columns in order to aid in the evaluation of the DataReactor library and guide

the design of Metadata.JSON [3]. These derived columns are a function of the other

columns in their table as well as the associated rows in the parent and child tables.

DataReactor is implemented as an independent Python package which can be used

as follows:

from datareactor import DataReactor

reactor = DataReactor()reactor.transform(

source="/path/to/dataset",destination="/path/to/dataset_new"

)

Figure 6-2: A basic example on how to use DataReactor.

The transform operation takes in a path to a dataset which is stored in a Meta-

data.JSON compatible format and produces a new Metadata.JSON compatible dataset

where:

1. It has added derived columns to the tables which have useful statistical prop-

erties.

2. It has updated the metadata object with the lineage of these derived columns.

48

Page 49: Tracer: AMachineLearningApproachtoData Lineage

For example, suppose we have a simple database containing users and transac-

tions. One of the derived columns produced by DataReactor might be an additional

column in the users table which stores the number of transactions (i.e. the number

of corresponding rows in the transactions table for each user). Other examples of

derived columns produced by DataReactor include summary statistics (i.e. the aver-

age amount spent per transaction for each user), transforms on datetime fields (i.e.

extracting the day of the week), and transforms on string fields (i.e. the length of the

string).

6.1.1 Software Design

At a high level, DataReactor performs the following operations:

1. Load the tables and metadata.

2. Generate derived columns for each table.

(a) This functionality is provided by the DataReactor.Atoms sub-module.

(b) Each Atom is responsible for generating a type of derived column. The

Atom API is easily extensible, allowing us to quickly add new transforms

in the future.

(c) By randomly applying these Atoms to different tables, we can generate a

wide range of derived columns.

3. Prune the derived columns to remove meaningless values.

(a) This functionality is provided by the DataReactor.Sieve sub-module.

(b) Each Sieve provides a different strategy for filtering derived columns, from

more machine learning-oriented approaches to feature selection to simpler

heuristics such as removing duplicate and constant-valued columns.

4. Write the expanded tables and updated metadata to disk.

49

Page 50: Tracer: AMachineLearningApproachtoData Lineage

We describe the modules responsible for steps 2 and 3 in greater detail in the

following sections; the data structures defined in DataReactor are described in Ta-

ble 6.1.

Class Name Description

Dataset The Dataset object is responsible for reading and writingdatasets. It manages both the metadata and the tables. Italso provides the ability to add derived columns to a dataset(by updating the metadata and target table).

DerivedColumn The DerivedColumn object represents a derived column andcontains (1) the name of the table it belongs to, (2) the ac-tual values in the column, (3) information about the name andtype of the column, and (4) a constraint object indicating thelineage of the column.

Table 6.1: Description of the Dataset and DerivedColumn data structures.

6.1.2 Atoms

In this section, we introduce the atom module which implements the logic for generat-

ing derived columns. Each atom inherits from the Atom base class (see Appendix C)

and is required to implement a derive method which takes as input a dataset and tar-

get table and returns a sequence of DerivedColumn objects containing new columns

for that table.

One simple Atom would be the RowCountAtom which creates a derived column

which counts the number of rows in the child table. For example, suppose there is a

foreign key relationship between users and transactions. This atom would propose a

new derived column for the users table which contains the number of transactions for

each user.

Another example of an Atom is the FeatureToolsAtom which uses the FeatureTools

[20] library to generate features, identifies the lineage of the features, and converts

them to a derived column format. By integrating with FeatureTools, we are able

to generate a broad variety of derived columns from data type-specific transforms

50

Page 51: Tracer: AMachineLearningApproachtoData Lineage

(i.e. converting datetime strings into time zones, etc.) to more general aggregate

transforms (i.e. identifying the mode of a categorical column).

6.1.3 Sieve

Finally, the sieve module filters the derived columns. The generative process can

result in columns that are constant-valued — for example, a transform that extracts

the year from a datetime field could always return the same result if all the data

comes from the same year — or even columns that are duplicates of one another.

The sieve module applies filters to remove these edge cases and produce a final set

of derived columns for analysis.

51

Page 52: Tracer: AMachineLearningApproachtoData Lineage

52

Page 53: Tracer: AMachineLearningApproachtoData Lineage

Chapter 7

Experiments

In this section we will test each of our primitives using the datasets described in

Chapter 6.

7.1 Primary Key

7.1.1 Evaluation

As discussed in Chapter 4, this primitive is designed to discover all the primary keys

of a dataset. To assess how successful it is at achieving this goal, we will use the

following evaluation method: (1) disregard all tables that contain composite keys or

no keys at all, seeing as our current implementation returns single-column keys; (2)

on this restricted dataset, check what percentage of the predicted keys are actually

correct; (3) return the accuracy.

7.1.2 Experiments

Primary keys generally follow very strict rules, as discussed in Section 5.1. These

constraints make the identification of primary keys fairly trivial, and so any reasonable

model should achieve near-perfect accuracy on most datasets. Indeed, as can be seen

in Table 7.1, our model achieves near perfect accuracy under the current evaluation

procedure.

53

Page 54: Tracer: AMachineLearningApproachtoData Lineage

Dataset Number of Accuracy Accuracy AccuracyTables w/o Names w/o Ordinals

AustralianFootball 4 1 1 1Biodegradability 5 1 1 1Dunur 17 1 1 1Elti 11 1 1 1Hepatitis_std 7 0.25 0.25 1Mesh 29 1 1 1Mooney_Family 68 1 1 1NBA 4 1 1 1PremierLeague 4 1 1 1Same_gen 4 1 1 1Toxicology 4 1 1 1UW_std 4 1 1 1Walmart 4 1 1 1classicmodels 8 1 1 1cs 8 1 1 1imdb_MovieLens 7 1 1 1pubs 11 1 1 1university 5 1 1 1

Table 7.1: Number of tables per dataset, as well as the accuracy of our implementationof the Primary Key Discovery primitive in three situations, (1) normally, (2) withoututilizing the column names, and (3) without utilizing the position of the columns.

To test whether our model was actually looking at the data rather than overly

depending on the metadata (seeing as around 40% of the column names contain

indicators of the nature of the column, such as id or key), we removed the column

names from the previous datasets and tested again. As we can see in Table 7.1, our

results are invariant, with only one poor performance in the eighteen datasets. This

outcome is positive, since it seems to indicate our model is not overly reliant on the

metadata.

We also tested how well our model performs without the ordinal feature (i.e.

without knowing which order the columns are in). As the table shows, the model

achieved perfect performance under this circumstance, which is a little surprising

considering it outperforms the original model. Upon further investigation, the issue

seems to be that for almost all training samples primary keys are the first column,

which causes the model to overestimate the importance of this feature.

54

Page 55: Tracer: AMachineLearningApproachtoData Lineage

7.2 Foreign Key

7.2.1 Evaluation

As discussed in Section 4.3, the Foreign Key Discovery primitive attempts to discover

all the foreign keys of a dataset. As for the Primary Key Discovery primitive, we

have to design a metric to evaluate how successful the model is. In this case, though,

accuracy is not a reasonable choice of evaluation, as some tables may contain multiple

foreign keys while others have none, which wouldn’t be properly captured by such a

simple metric. Instead, we will use precision, recall and F-measure for evaluation, as

recommended in [29].

We compute these values is as follows. First, recall that the Foreign Key Discovery

primitive returns a sorted list of foreign keys, from most to least likely. Therefore

we can select the first X% most likely foreign keys which our algorithm predicted,

for various values of X. Then we compute the precision, recall and F-measure of our

method for each value of X, and finally select the best F-measure achieved, as well as

the corresponding value for X.

To clarify this idea, let us go through the example in Table 7.2. Imagine a dataset

containing three foreign key mappings, A, B and C. Now imagine the Foreign Key

Discovery module predicts the following possible foreign keys, sorted from most to

least likely: A, D, B, C, E. To evaluate the performance of the model, we will select

the top 20% percent of our predictions (i.e. we believe A is the only foreign key

relationship in the dataset) and compute the recall, precision and F-measure for it.

Predicted X Recall Precision F-Measure

A 20% 0.33 1.00 0.50D 40% 0.33 0.50 0.40B 60% 0.66 0.66 0.66C 80% 1.00 0.75 0.86E 100% 1.00 0.60 0.80

Table 7.2: Example of the Foreign Key Discovery evaluation method, when groundtruth is A, B, C.

55

Page 56: Tracer: AMachineLearningApproachtoData Lineage

We then repeat this process for multiple values of X, and select the best-performing

result, which in this example is a F-measure = 0.86, for 𝑋 = 80%.

7.2.2 Experiment

Dataset Foreign Keys F-Measure Precision Recall

AustralianFootball 5 1.00 1.00 1.00Biodegradability 5 1.00 1.00 1.00Dunur 32 1.00 1.00 1.00Elti 20 1.00 1.00 1.00Hepatitis_std 6 1.00 1.00 1.00Mesh 33 0.43 0.77 0.30Mooney_Family 134 0.70 0.61 0.83NBA 5 0.83 0.71 1.00PremierLeague 5 0.75 1.00 0.60Same_gen 6 1.00 1.00 1.00Toxicology 5 0.75 1.00 0.60UW_std 4 0.86 1.00 0.75Walmart 3 1.00 1.00 1.00classicmodels 8 0.73 1.00 0.57cs 10 0.74 0.78 0.70imdb_MovieLens 6 1.00 1.00 1.00pubs 10 0.82 0.75 0.90university 4 1.00 1.00 1.00

Table 7.3: Number of foreign keys per dataset, F-measure, precision and recall of ourimplementation of the Foreign Key Discovery primitive when tested against eighteendatasets, using the evaluation method described in Section 7.2.1.

Foreign keys are allowed much more flexibility than primary keys. Each table can

have multiple keys or none at all, the column names are much less indicative than

for primary keys, and their overall structure is more complex. As such, discovering

foreign keys are significantly harder than its counterpart. Our results for this task,

when using the evaluation method previously discussed, can be seen in Table 7.2.2.

Overall, the model achieves an average F-measure of 0.87, an average precision of

0.92, and average recall of 0.85. One thing to note is that the Precision and Recall are

fairly similar to each other, which means we chose an appropriate value for X, seeing

as, in general, a higher X increases recall and decreases precision, while a lower X

56

Page 57: Tracer: AMachineLearningApproachtoData Lineage

does the opposite. A balance between the two values is desirable given the harmonic

nature of the F-measure.

When comparing our results in Table 7.2.2 with the structure of the datasets in

Table B.1, we notice they are highly correlated. For example, if we limit our results

to the 13 datasets containing less than 10 tables, our average F-measure, precision

and recall become 0.90, 0.96, 0.86, respectively, which is a measurable improvement.

This superior result in smaller datasets implies an inferior result for the more complex

ones, something we will continue working on to improve in future versions of Tracer.

7.3 Column Map

7.3.1 Evaluation

For the final primitive, the goal is to trace all the columns from the dataset that

helped generate some target column. In order to evaluate the success rate of this

module we will use a fixed threshold value of 0. This means we select all column

maps predicted by our algorithm with feature importance larger than 0. Then we

simply compute the precision, recall and F-measure and report our results.

7.3.2 Experiments

To evaluate the Column Map Discovery primitive we will need to utilize DataReactor.

As explained in Chapter 6, DataReactor takes a relational dataset and intelligently

generates derived columns. These derived columns are a function of other columns in

their respective datasets, so we can use them to test Column Map Discovery. For the

following experiment, DataReactor was used to generate two types of target columns:

“nb_rows” which is based on foreign key relationships and corresponds to the number

of rows in the child table for each of the keys in the parent table, and “add_numerical”

which corresponds to a random linear combination of the numerical columns in the

target table.

Target columns generated by DataReactor follow a specific naming convention.

57

Page 58: Tracer: AMachineLearningApproachtoData Lineage

To understand it, consider an example where DataReactor generated a target col-

umn called “molecule.nb_rows_in_atom”. This means the column was derived from

some column in table “atom” through transformation “nb_rows”. This transforma-

tion counts the number of rows in the “atom” table which belong to each “molecule”

table, where the “atom” table is related to the “molecule” table through a foreign key

relationship.

We test the Column Map Discovery primitive by using eleven datasets from the

Relational Dataset Repository [24], with target columns generated by DataReactor

and evaluation as described in Section 7.3.1. Our full results are reported in Table C.1.

We also aggregate the results by dataset (Table 7.4) and by transformation (Table

7.5). Notice that precision and recall are the average of the values, while F-measure

is computed using these two results.

Dataset F-Measure Precision Recall

AustralianFootball 0.69 0.61 0.79Biodegradability 0.60 0.43 1.00Dunur 0.29 0.17 1.00Elti 0.31 0.18 1.00Hepatitis_std 0.61 0.51 0.77Mesh 0.62 0.45 1.00Mooney_Family 0.05 0.03 1.00NBA 0.23 0.13 0.93Same_gen 0.50 0.34 1.00Toxicology 0.35 0.21 1.00UW_std 0.67 0.53 0.90

Table 7.4: This table shows the performance of our current implementation of theColumn Map Discovery primitive. These results were obtained by aggregating thevalues from Table C.1 for each dataset.

Transformation F-Measure Precision Recall

add_numerical 0.78 0.76 0.79nb_rows 0.38 0.23 1.00

Table 7.5: This table shows the F-measure, precision and recall of our implementationof the Column Map Discovery primitive. These results were obtained by aggregatingthe values from Table C.1 for the two transformations generated by DataReactor.

58

Page 59: Tracer: AMachineLearningApproachtoData Lineage

As shown in Table 7.4, our current implementation of the Column Key Discovery

primitive achieves an average F-measure of 0.45, an average precision of 0.33, and an

average recall of 0.94. Given the extensibility of our system, we expect these numbers

to improve as additional transformations/primitives are implemented in future version

of Tracer.

Furthermore, we note that the F-measure is only weakly correlated with the real-

world utility provided by our system. With our system, an user can look at the top

X% of the column maps suggested by our model to quickly figure out the lineage of

the target column. This is a significant improvement over manually digging through

thousands of combinations of columns.

59

Page 60: Tracer: AMachineLearningApproachtoData Lineage

60

Page 61: Tracer: AMachineLearningApproachtoData Lineage

Chapter 8

Conclusion

In this paper we proposed Tracer, a novel machine learning approach to data lineage

discovery. We present a framework that accomplishes end-to-end lineage tracing by

separating the tracing process into its primitive components.

While the work we have accomplished is extensive, there is much that can be

improved. As of yet, we mostly assume columns and rows to be completely separate

from each other, while in reality the lineage of each item in a table has high correlation

with that of other items. Rewriting our primitives to include this fact should speed

up the process manifold.

Tracer also does not utilize past lineage searches into future computations. Given

that the data from a company shares many of the same patterns, a specific module

to aggregate it and analyze it could be an interesting line of research.

Another topic that needs further work is the addition of new primitives. Cur-

rently Tracer only contains the most basic primitive modules, and utilizing current

and future researches to write new and more sophisticated primitives is essential

for Tracer to achieve its full potential. One simple example is in the Primary Key

Discovery module. Currently Tracer does not support composite keys, even though

multi-column primary keys can be commonly found in relational datasets.

Better usage of metadata and data annotations would also improve Tracer. While

Tracer does make use of column names for key discovery, the Column Map Discov-

ery module does not. Further exploration of how we can leverage natural language

61

Page 62: Tracer: AMachineLearningApproachtoData Lineage

processing techniques for this problem is essential for obtaining better results.

Another potential line of research centers on multi-parent tables. For example,

suppose we have a users table, a locations table, and a companies table. Each user

works at a company and has a location. Currently, Tracer computes the lineage

for these tables separately - first users/location, then users/companies. Taking both

tables into consideration at the same time could allow for more accurate results, and

should be implemented into a future version of the project.

Finally, another aspect that could be further improved is how we can handle loops

in datasets. Taking the example from above, with a users table, a locations table,

and a companies table, it is possible each user works at a company, each user has a

location and each company has a location. This forms a loop, which is a situation we

have not considered sufficiently, and may require further research.

By adding these additional features to the Tracer project, we can further enhance

the system and enable much more general tracing than provided by alternatives.

62

Page 63: Tracer: AMachineLearningApproachtoData Lineage

Appendix A

Column Map without Foreign Keys

Key

?

?

?

100

50

0

200

150

100

Cost 1 Cost 2 Key

?

?

?

100

200

300

Total

Figure A-1: This figure shows an example of the functionality of the Column MapDiscovery primitive, where the module learns that "Total" was generated by adding“Cost 1” and “Cost 2” without knowing the foreign keys.

In this Appendix we describe an alternate algorithm for the Column Map Dis-

covery primitive which discovers the ColumnMap without using ForeignKeys. Notice

that ForeignKeys is still provided, since it is the standard input of the primitive,

but it is not used. This module is useful in situations where the foreign keys can-

not be trusted, so we need to derive ColumnMap directly from Tables. Figure A-1

demonstrates a situation where this module might be useful. This method is purely

theoretical and is not currently implemented in Tracer.

To solve this problem, we propose an iterative procedure which alternates between

(1) learning a distribution over row mappings based on the optimal column mapping

and (2) learning the optimal column mapping based on the distribution over row

mappings. Under weak assumptions about the types of transformations that can be

applied to the data, we can show that, after each iteration, the distribution over the

row mapping converges toward the true row mapping.

63

Page 64: Tracer: AMachineLearningApproachtoData Lineage

At a high level, the iterative estimation procedure works as follows:

1. Sample from a distribution over row mappings.

2. Assuming the row mapping is correct, find the optimal column mapping and

find the loss.

3. If the loss is lower than average, update the distribution over row mappings to

increase the probability of each (input, output) pair in the mapping.

4. Otherwise, update the distribution over row mappings to decrease the proba-

bility of each (in, out) pair in the mapping.

5. Repeat.

In order for this procedure to work, we need to make a few assumptions. Primarily,

we need to assume that the loss associated with the optimal column mapping is a

decreasing function of the number of correctly matched rows. In other words, if our

row mapping has more correct matches, than our model will be more successful at

learning the relationship between the inputs and outputs.

Formally, we can use a regularization based approach to solve the problem, where

our goal is to find the row mapping 𝑀 and subset of columns 𝑆 which minimizes the

loss

𝑎𝑟𝑔𝑚𝑖𝑛𝑀,𝑆[𝑚𝑖𝑛𝐹 ||𝐹 (𝑋[𝑆])−𝑀(𝑦)||2] + 𝜆||𝑆||0 (A.1)

where 𝑆 is a subset of columns and 𝐹 is an arbitrary machine learning model.

The biggest obstacle to implementing this procedure is figuring out how to effi-

ciently sample from and iteratively update the distribution over row mappings. In

general, exact solutions to the bipartite matching problem (e.g. finding the maximum

likelihood between two sets of rows) is intractable since given an input and output

table with 𝑁 rows, there are 𝑁 ! possible row mappings. However, given a solution

for sampling from and iteratively updating a distribution over row mappings [27], we

can then concurrently estimate both the row mappings and the column mappings for

64

Page 65: Tracer: AMachineLearningApproachtoData Lineage

a pair of tables, allowing us to identify the set of input cells which contributed to a

particular output cell.

65

Page 66: Tracer: AMachineLearningApproachtoData Lineage

66

Page 67: Tracer: AMachineLearningApproachtoData Lineage

Appendix B

Relational Dataset

Dataset Name Size (MB) Num of Tables Temporal? Numeric? String?

Accidents 234.5 3 Yes Yes Yes

AdventureWorks 233.8 71 Yes Yes Yes

Airline 464.5 19 Yes Yes Yes

Atherosclerosis 3.2 4 No Yes Yes

AustralianFootball 34.1 4 Yes Yes Yes

BasketballMen 18.3 9 No Yes Yes

BasketballWomen 2.3 8 No Yes Yes

Biodegradability 3.7 5 No Yes Yes

Bupa .3 9 No Yes Yes

Carcinogenesis 21 6 No Yes Yes

CCS 15.9 6 Yes Yes Yes

Chess .3 2 Yes Yes Yes

ClassicModels .5 8 Yes Yes Yes

CORA 4.6 3 No Yes Yes

Countries 8.6 4 No Yes Yes

Credit 317.9 9 Yes Yes Yes

CS .3 8 Yes Yes Yes

DCG .3 2 No Yes Yes

67

Page 68: Tracer: AMachineLearningApproachtoData Lineage

Dunur .8 17 No Yes Yes

Elti .6 11 No Yes Yes

Employee 197.4 6 Yes Yes Yes

Facebook 2 2 No Yes No

Financial 78.8 8 Yes Yes Yes

FNHK 129.8 3 Yes Yes Yes

FTP 7.5 2 Yes Yes Yes

Geneea 61.4 19 Yes Yes Yes

Genes 1.8 3 No Yes Yes

Hepatitis 2.2 7 No Yes Yes

Hockey 15.6 22 No Yes Yes

IMDb 477.1 7 No Yes Yes

KRK .1 1 No Yes Yes

Lahman 74.1 25 No Yes Yes

LegalActs 240.2 5 Yes Yes Yes

Mesh 1.1 29 No No Yes

Mondial 3,2 40 Yes Yes Yes

MooneyFamily 3.2 68 No No Yes

MovieLens 154.9 7 No Yes Yes

Musk .4 2 No Yes Yes

Mutagenesis .9 3 No Yes Yes

Nations 1.2 3 No Yes Yes

NBA .3 4 Yes Yes Yes

NCAA 35.8 9 Yes Yes Yes

Northwind 1.1 29 Yes Yes Yes

Pima .7 9 No Yes Yes

PremiereLeague 11.3 4 Yes Yes Yes

PTC 8.1 4 No No Yes

PTE 4.4 38 No Yes Yes

Pubs .4 11 Yes Yes Yes

68

Page 69: Tracer: AMachineLearningApproachtoData Lineage

Pyrimidine ? 2 No Yes No

Restbase 3 3 No Yes Yes

Sakila 6.5 16 Yes Yes Yes

SalesDB 584.3 4 No Yes Yes

SameGen .3 4 No Yes Yes

SAP 246.9 5 Yes Yes Yes

SAT 2.7 37 No Yes Yes

Seznam 146.8 4 Yes Yes Yes

Stats 658.4 8 Yes Yes Yes

StudentLoan .9 10 No Yes Yes

Thrombosis 1.9 3 Yes Yes Yes

TPCC 165.8 9 Yes Yes Yes

TPCD 2500 8 Yes Yes Yes

TPCDS 4800 24 Yes Yes Yes

TPCH 2000 8 Yes Yes Yes

Trains .1 2 No Yes Yes

Triazine .2 2 No Yes No

University .3 5 No Yes Yes

UTube .2 2 No Yes Yes

UW-CSE .2 4 No Yes Yes

VisualGenome 198.1 6 No Yes Yes

VOC 4.7 8 Yes Yes Yes

Walmart 167.3 4 Yes Yes Yes

WebKP 12.8 3 No No Yes

World .7 3 No Yes Yes

Table B.1: This table provides some details regarding the 73 relational datasets from

[24].

69

Page 70: Tracer: AMachineLearningApproachtoData Lineage

70

Page 71: Tracer: AMachineLearningApproachtoData Lineage

Appendix C

Column Map Discovery Performance

Dataset Target Column F1 Prec. Rec.

AustralianFootball matches.add_numerical 0.75 1.00 0.60

AustralianFootball players.add_numerical 0.50 1.00 0.33

AustralianFootball players.nb_rows_in_match_stats 1.00 1.00 1.00

AustralianFootball teams.nb_rows_in_match_stats 0.03 0.02 1.00

AustralianFootball teams.nb_rows_in_matches 0.03 0.02 1.00

Biodegradability atom.nb_rows_in_bond 0.40 0.25 1.00

Biodegradability atom.nb_rows_in_gmember 0.67 0.50 1.00

Biodegradability bond.add_numerical 1.00 1.00 1.00

Biodegradability group.nb_rows_in_gmember 0.67 0.50 1.00

Biodegradability molecule.add_numerical 0.36 0.22 1.00

Biodegradability molecule.nb_rows_in_atom 0.20 0.11 1.00

Dunur person.nb_rows_in_aunt 0.15 0.08 1.00

Dunur person.nb_rows_in_brother 0.22 0.13 1.00

Dunur person.nb_rows_in_daughter 0.17 0.09 1.00

Dunur person.nb_rows_in_dunur 0.25 0.14 1.00

Dunur person.nb_rows_in_father 0.33 0.20 1.00

Dunur person.nb_rows_in_husband 0.29 0.17 1.00

Dunur person.nb_rows_in_husband2 0.22 0.13 1.00

71

Page 72: Tracer: AMachineLearningApproachtoData Lineage

Dunur person.nb_rows_in_mother 0.33 0.20 1.00

Dunur person.nb_rows_in_nephew 0.14 0.08 1.00

Dunur person.nb_rows_in_niece 0.18 0.10 1.00

Dunur person.nb_rows_in_sister 0.20 0.11 1.00

Dunur person.nb_rows_in_son 0.18 0.10 1.00

Dunur person.nb_rows_in_target 0.04 0.02 1.00

Dunur person.nb_rows_in_uncle 0.18 0.10 1.00

Dunur person.nb_rows_in_wife 0.22 0.13 1.00

Dunur person.nb_rows_in_wife2 0.20 0.11 1.00

Dunur target.add_numerical 1.00 1.00 1.00

Elti person.nb_rows_in_brother 0.11 0.06 1.00

Elti person.nb_rows_in_daughter 0.33 0.20 1.00

Elti person.nb_rows_in_elti 0.13 0.07 1.00

Elti person.nb_rows_in_father 0.12 0.06 1.00

Elti person.nb_rows_in_husband 0.25 0.14 1.00

Elti person.nb_rows_in_mother 0.09 0.05 1.00

Elti person.nb_rows_in_sister 0.12 0.06 1.00

Elti person.nb_rows_in_son 0.33 0.20 1.00

Elti person.nb_rows_in_target 0.10 0.05 1.00

Elti person.nb_rows_in_wife 0.25 0.14 1.00

Elti target.add_numerical 1.00 1.00 1.00

Hepatitis_std Bio.add_numerical 0.46 0.30 1.00

Hepatitis_std Bio.nb_rows_in_rel11 0.18 0.10 1.00

Hepatitis_std dispat.add_numerical 0.25 0.25 0.25

Hepatitis_std dispat.nb_rows_in_rel11 0.25 0.14 1.00

Hepatitis_std dispat.nb_rows_in_rel12 0.13 0.07 1.00

Hepatitis_std dispat.nb_rows_in_rel13 0.40 0.25 1.00

Hepatitis_std indis.add_numerical 0.33 0.50 0.25

Hepatitis_std inf.add_numerical 0.67 1.00 0.50

Hepatitis_std rel11.add_numerical 0.67 1.00 0.50

72

Page 73: Tracer: AMachineLearningApproachtoData Lineage

Hepatitis_std rel12.add_numerical 1.00 1.00 1.00

Hepatitis_std rel13.add_numerical 1.00 1.00 1.00

Mesh element.nb_rows_in_circuit 0.67 0.50 1.00

Mesh element.nb_rows_in_circuit_hole 0.50 0.33 1.00

Mesh element.nb_rows_in_cont_loaded 0.67 0.50 1.00

Mesh element.nb_rows_in_equal 0.50 0.33 1.00

Mesh element.nb_rows_in_fixed 0.67 0.50 1.00

Mesh element.nb_rows_in_free 0.67 0.50 1.00

Mesh element.nb_rows_in_half_circuit 0.67 0.50 1.00

Mesh element.nb_rows_in_half_circuit_hole 0.67 0.50 1.00

Mesh element.nb_rows_in_llong 0.50 0.33 1.00

Mesh element.nb_rows_in_long_for_hole 0.40 0.25 1.00

Mesh element.nb_rows_in_mesh 0.13 0.07 1.00

Mesh element.nb_rows_in_mesh_test 0.13 0.07 1.00

Mesh element.nb_rows_in_mesh_test_Neg 0.14 0.08 1.00

Mesh element.nb_rows_in_neighbour_xy 0.67 0.50 1.00

Mesh element.nb_rows_in_neighbour_yz 0.50 0.33 1.00

Mesh element.nb_rows_in_neighbour_zx 0.67 0.50 1.00

Mesh element.nb_rows_in_noload 0.67 0.50 1.00

Mesh element.nb_rows_in_notimportant 0.67 0.50 1.00

Mesh element.nb_rows_in_one_side_fixed 0.67 0.50 1.00

Mesh element.nb_rows_in_one_side_loaded 0.67 0.50 1.00

Mesh element.nb_rows_in_opposite 0.67 0.50 1.00

Mesh element.nb_rows_in_quarter_circuit 0.50 0.33 1.00

Mesh element.nb_rows_in_short_for_hole 0.67 0.50 1.00

Mesh element.nb_rows_in_sshort 0.67 0.50 1.00

Mesh element.nb_rows_in_two_side_fixed 0.67 0.50 1.00

Mesh element.nb_rows_in_two_side_loaded 1.00 1.00 1.00

Mesh element.nb_rows_in_usual 0.67 0.50 1.00

Mesh mesh_test_Neg.add_numerical 1.00 1.00 1.00

73

Page 74: Tracer: AMachineLearningApproachtoData Lineage

Mesh mesh_test.add_numerical 0.67 0.50 1.00

Mesh mesh.add_numerical 1.00 1.00 1.00

Mesh nnumber.add_numerical 0.29 0.17 1.00

Mesh nnumber.nb_rows_in_mesh 0.29 0.17 1.00

Mooney_Family person.nb_rows_in_aunt 0.02 0.01 1.00

Mooney_Family person.nb_rows_in_brother 0.05 0.03 1.00

Mooney_Family person.nb_rows_in_daughter 0.08 0.04 1.00

Mooney_Family person.nb_rows_in_father 0.04 0.02 1.00

Mooney_Family person.nb_rows_in_husband 0.08 0.04 1.00

Mooney_Family person.nb_rows_in_mother 0.04 0.02 1.00

Mooney_Family person.nb_rows_in_nephew 0.02 0.01 1.00

Mooney_Family person.nb_rows_in_niece 0.02 0.01 1.00

Mooney_Family person.nb_rows_in_sister 0.03 0.02 1.00

Mooney_Family person.nb_rows_in_son 0.11 0.06 1.00

Mooney_Family person.nb_rows_in_uncle 0.02 0.01 1.00

Mooney_Family person.nb_rows_in_wife 0.08 0.04 1.00

NBA Actions.add_numerical 0.40 0.40 0.40

NBA Game.add_numerical 0.05 0.03 1.00

NBA Game.nb_rows_in_Actions 0.04 0.02 1.00

NBA Player.add_numerical 0.40 0.25 1.00

NBA Player.nb_rows_in_Actions 0.50 0.33 1.00

NBA Team.add_numerical 0.01 0.00 1.00

NBA Team.nb_rows_in_Actions 0.02 0.01 1.00

NBA Team.nb_rows_in_Game 0.02 0.01 1.00

Same_gen person.nb_rows_in_parent 0.17 0.09 1.00

Same_gen person.nb_rows_in_same_gen 0.25 0.14 1.00

Same_gen person.nb_rows_in_target 0.20 0.11 1.00

Same_gen target.add_numerical 1.00 1.00 1.00

Toxicology atom.nb_rows_in_connected 0.50 0.33 1.00

Toxicology molecule.nb_rows_in_atom 0.29 0.17 1.00

74

Page 75: Tracer: AMachineLearningApproachtoData Lineage

Toxicology molecule.nb_rows_in_bond 0.25 0.14 1.00

UW_std advisedBy.add_numerical 0.80 0.67 1.00

UW_std course.add_numerical 1.00 1.00 1.00

UW_std course.nb_rows_in_taughtBy 0.22 0.13 1.00

UW_std person.add_numerical 0.50 1.00 0.33

UW_std person.nb_rows_in_advisedBy 0.29 0.17 1.00

UW_std person.nb_rows_in_taughtBy 0.15 0.08 1.00

UW_std taughtBy.add_numerical 0.80 0.67 1.00

Table C.1: This table shows the F-measure, precision and recall of our implementation

of the Column Map Discovery primitive when tested against eleven relational datasets

using the evaluation method proposed in Section 7.3.1.

75

Page 76: Tracer: AMachineLearningApproachtoData Lineage

76

Page 77: Tracer: AMachineLearningApproachtoData Lineage

Bibliography

[1] Ccpa. https://oag.ca.gov/privacy/ccpa. Accessed: 2020-05-04.

[2] Gdpr. https://gdpr-info.eu/. Accessed: 2020-05-04.

[3] Metadata.json. https://github.com/HDI-Project/MetaData.json. Accessed:2020-05-04.

[4] Rdt. https://sdv-dev.github.io/RDT/. Accessed: 2020-05-14.

[5] Tpc. http://www.tpc.org/default5.asp. Accessed: 2020-05-03.

[6] Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, ShubhaNabar, Tomoe Sugihara, and Jennifer Widom. Trio: A system for data, uncer-tainty, and lineage. Proc. of VLDB 2006 (demonstration description), 2006.

[7] Michael Backes, Niklas Grimm, and Aniket Kate. Data lineage in malicious envi-ronments. IEEE Transactions on Dependable and Secure Computing, 13(2):178–191, 2015.

[8] Omar Benjelloun, Anish Das Sarma, Alon Halevy, and Jennifer Widom. Uldbs:Databases with uncertainty and lineage. In Proceedings of the 32nd internationalconference on Very large data bases, pages 953–964. VLDB Endowment, 2006.

[9] Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijay-vargiya. An annotation management system for relational databases. The VLDBJournal, 14(4):373–396, 2005.

[10] Peter Buneman, Adriane Chapman, and James Cheney. Provenance managementin curated databases. In Proceedings of the 2006 ACM SIGMOD internationalconference on Management of data, pages 539–550, 2006.

[11] Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. Data provenance: Somebasic issues. In International Conference on Foundations of Software Technologyand Theoretical Computer Science, pages 87–93. Springer, 2000.

[12] Peter Buneman, Sanjeev Khanna, and Tan Wang-Chiew. Why and where: Acharacterization of data provenance. In International conference on databasetheory, pages 316–330. Springer, 2001.

77

Page 78: Tracer: AMachineLearningApproachtoData Lineage

[13] Yingwei Cui and Jennifer Widom. Practical lineage tracing in data warehouses.In Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), pages 367–378. IEEE, 2000.

[14] Yingwei Cui and Jennifer Widom. Lineage tracing for general data warehousetransformations. The VLDB Journal—The International Journal on Very LargeData Bases, 12(1):41–58, 2003.

[15] Yingwei Cui, Jennifer Widom, and Janet L Wiener. Tracing the lineage of viewdata in a warehousing environment. ACM Transactions on Database Systems(TODS), 25(2):179–227, 2000.

[16] Laura Gustafson. Bayesian tuning and bandits: an extensible, open source libraryfor AutoML. PhD thesis, Massachusetts Institute of Technology, 2018.

[17] Thomas Heinis and Gustavo Alonso. Efficient lineage tracking for scientific work-flows. In Proceedings of the 2008 ACM SIGMOD international conference onManagement of data, pages 1007–1018, 2008.

[18] Jiansheng Huang, Ting Chen, AnHai Doan, and Jeffrey F Naughton. On theprovenance of non-answers to queries over extracted data. Proceedings of theVLDB Endowment, 1(1):736–747, 2008.

[19] Robert Ikeda and Jennifer Widom. Data lineage: A survey. Technical report,Stanford InfoLab, 2009.

[20] James Max Kanter and Kalyan Veeramachaneni. Deep feature synthesis: To-wards automating data science endeavors. In 2015 IEEE International Con-ference on Data Science and Advanced Analytics (DSAA), pages 1–10. IEEE,2015.

[21] Arunprasad P Marathe. Tracing lineage of array data. Journal of IntelligentInformation Systems, 17(2-3):193–214, 2001.

[22] Patrick McDaniel. Data provenance and security. IEEE Security & Privacy,9(2):83–85, 2011.

[23] JP Moresmau, Fabian Schyns, Uta Sommerweiss, Lothar Grabowsky, Jens-UweRichter, Henric Gomes, Gerald Csapo, Karsten Baensch, Gunter Wiedemer,Micha Treiber, et al. Intelligent metadata management and data lineage tracing,September 21 2017. US Patent App. 15/457,808.

[24] Jan Motl and Oliver Schulte. The ctu prague relational learning repository. arXivpreprint arXiv:1511.03086, 2015.

[25] Christopher Ré and Dan Suciu. Approximate lineage for probabilistic databases.Proceedings of the VLDB Endowment, 1(1):797–808, 2008.

78

Page 79: Tracer: AMachineLearningApproachtoData Lineage

[26] Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, and UlfLeser. A machine learning approach to foreign key discovery. In WebDB, 2009.

[27] Maksims Volkovs and Richard S Zemel. Efficient sampling for bipartite matchingproblems. In Advances in Neural Information Processing Systems, pages 1313–1321, 2012.

[28] Allison Woodruff and Michael Stonebraker. Supporting fine-grained data lin-eage in a database visualization environment. In Proceedings 13th InternationalConference on Data Engineering, pages 91–102. IEEE, 1997.

[29] Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M Procopiuc,and Divesh Srivastava. On multi-column foreign key discovery. Proceedings ofthe VLDB Endowment, 3(1-2):805–814, 2010.

79