data curation with deep learning · data curation plays an important role so as to fully unlock the...

10
Data Curation with Deep Learning Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, AnHai Doan Qatar Computing Research Institute, HBKU; University of Wisconsin-Madison {sthirumuruganathan, ntang, mouzzani}@hbku.edu.qa, [email protected] ABSTRACT Data curation – the process of discovering, integrating, and clean- ing data – is one of the oldest, hardest, yet inevitable data manage- ment problems. Despite decades of efforts from both researchers and practitioners, it is still one of the most time consuming and least enjoyable work of data scientists. In most organizations, data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with the ever-changing data ecosystem, because they often require substantially high human cost. Meanwhile, deep learning is making strides in achieving remarkable successes in multiple areas, such as image recognition, natural language pro- cessing, and speech recognition. In this vision paper, we explore how some of the fundamental innovations in deep learning could be leveraged to improve existing data curation solutions and to help build new ones. We identify interesting research opportu- nities and dispel common myths. We hope that the synthesis of these important domains will unleash a series of research activi- ties that will lead to significantly improved solutions for many data curation tasks. 1 INTRODUCTION Data Curation (DC) [32, 44] – the process of discovering, inte- grating [46] and cleaning data (Figure 1) for downstream analytic tasks – is critical for any organization to extract real business value from their data. The following are the most important problems that have been extensively studied by the database community under the umbrella of data curation: data discovery, the process of identifying relevant data for a specific task; schema matching and schema mapping, the process of identifying simi- lar columns and learning the transformation between matched columns, respectively; entity resolution, the problem of identify- ing pairs of tuples that denote the same entity; and data cleaning, the process of identifying errors in the data and possibly repair- ing them. These are in addition to problems related to outlier detection, data imputation, and data dependencies. Due to the importance of data curation, there has been many commercial solutions (for example, Tamr [45] and Trifacta [23]) and academic efforts for all aspects of DC, including data dis- covery [8, 19, 33, 36], data integration [12, 27], and data clean- ing [7, 16, 42]. An oft-cited statistic is that data scientists spend 80% of their time curating their data [8]. Most DC solutions can- not be fully automated, as they are often ad-hoc and require substantial effort to generate things, such as features and labeled data, which are used to synthesize rules and train machine learn- ing models. Practitioners need practical and usable solutions that can significantly reduce the human cost. © 2020 Copyright held by the owner/author(s). Published in Proceedings of the 23rd International Conference on Extending Database Technology (EDBT), March 30-April 2, 2020, ISBN 978-3-89318-083-7 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0. Deep Learning (DL) is a successful paradigm within the area of machine learning (ML) that has achieved significant successes in diverse areas, such as computer vision, natural language process- ing, speech recognition, genomics, and many more. The trifecta of big data, better algorithms, and faster processing power has resulted in DL achieving outstanding performance in many areas. Due to these successes, there has been extensive new research seeking to apply DL to other areas both inside and outside of computer science. Data Curation Meets Deep Learning. In this article, we in- vestigate intriguing research opportunities for answering the following questions: What does it take to significantly advance a challenging area such as DC? How can we leverage techniques from DL for DC? Given the many DL research efforts, how can we identify the most promising leads that are most relevant to DC? We believe that DL brings unique opportunities for DC, which our community must seize upon. We identified two fundamental ideas with great potential to solve challenging DC problems. Feature (or Representation) Learning. For ML- or non-ML based DC solutions, domain experts are heavily involved in feature engineering and data understanding so as to define data quality rules. Representation learning is a fundamental idea in DL where appropriate features for a given task are automatically learned instead of being manually crafted. Developing DC-specific repre- sentation learning algorithms could dramatically alleviate many of the frustrations domain experts face when solving DC prob- lems. The learned representation must be generic such that it could be used for multiple DC tasks. DL Architectures for DC. So far, no DL architecture exists that is cognizant of the characteristics of DC tasks, such as represen- tations for tuples or columns, integrity constraints, and so on. Naively applying existing DL architectures may work well for some, but definitely not all, DC tasks. Instead, designing new DL architectures that are tailored for DC and are cognizant of the characteristics of DC tasks would allow efficient learning both in terms of training time and the amount of required training data. Given the success of domain specific DL architectures in computer vision and natural language processing, it behooves us to design an equivalent of such a DL architecture customized for DC task(s). Contributions and Roadmap. The major elements of our pro- posed approach to exploiting the huge potential offered by DL to tackle key DC challenges include: Representation Learning for Data Curation. We de- scribe several research directions for building representa- tions that are explicitly designed for DC. Developing algo- rithms for representation learning that can work on multi- ple modalities of data (structured, unstructured, graphical), Series ISSN: 2367-2005 277 10.5441/002/edbt.2020.25

Upload: others

Post on 19-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

Data Curation with Deep Learning

Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, AnHai Doan‡

Qatar Computing Research Institute, HBKU; ‡University of Wisconsin-Madison{sthirumuruganathan, ntang, mouzzani}@hbku.edu.qa, [email protected]

ABSTRACTData curation – the process of discovering, integrating, and clean-ing data – is one of the oldest, hardest, yet inevitable data manage-ment problems. Despite decades of efforts from both researchersand practitioners, it is still one of the most time consuming andleast enjoyable work of data scientists. In most organizations,data curation plays an important role so as to fully unlock thevalue of big data. Unfortunately, the current solutions are notkeeping up with the ever-changing data ecosystem, because theyoften require substantially high human cost. Meanwhile, deeplearning is making strides in achieving remarkable successes inmultiple areas, such as image recognition, natural language pro-cessing, and speech recognition. In this vision paper, we explorehow some of the fundamental innovations in deep learning couldbe leveraged to improve existing data curation solutions and tohelp build new ones. We identify interesting research opportu-nities and dispel common myths. We hope that the synthesis ofthese important domains will unleash a series of research activi-ties that will lead to significantly improved solutions for manydata curation tasks.

1 INTRODUCTIONData Curation (DC) [32, 44] – the process of discovering, inte-grating [46] and cleaning data (Figure 1) for downstream analytictasks – is critical for any organization to extract real businessvalue from their data. The following are the most importantproblems that have been extensively studied by the databasecommunity under the umbrella of data curation: data discovery,the process of identifying relevant data for a specific task; schemamatching and schema mapping, the process of identifying simi-lar columns and learning the transformation between matchedcolumns, respectively; entity resolution, the problem of identify-ing pairs of tuples that denote the same entity; and data cleaning,the process of identifying errors in the data and possibly repair-ing them. These are in addition to problems related to outlierdetection, data imputation, and data dependencies.

Due to the importance of data curation, there has been manycommercial solutions (for example, Tamr [45] and Trifacta [23])and academic efforts for all aspects of DC, including data dis-covery [8, 19, 33, 36], data integration [12, 27], and data clean-ing [7, 16, 42]. An oft-cited statistic is that data scientists spend80% of their time curating their data [8]. Most DC solutions can-not be fully automated, as they are often ad-hoc and requiresubstantial effort to generate things, such as features and labeleddata, which are used to synthesize rules and train machine learn-ing models. Practitioners need practical and usable solutions thatcan significantly reduce the human cost.

© 2020 Copyright held by the owner/author(s). Published in Proceedings of the23rd International Conference on Extending Database Technology (EDBT), March30-April 2, 2020, ISBN 978-3-89318-083-7 on OpenProceedings.org.Distribution of this paper is permitted under the terms of the Creative Commonslicense CC-by-nc-nd 4.0.

Deep Learning (DL) is a successful paradigm within the area ofmachine learning (ML) that has achieved significant successes indiverse areas, such as computer vision, natural language process-ing, speech recognition, genomics, and many more. The trifectaof big data, better algorithms, and faster processing power hasresulted in DL achieving outstanding performance in many areas.Due to these successes, there has been extensive new researchseeking to apply DL to other areas both inside and outside ofcomputer science.Data Curation Meets Deep Learning. In this article, we in-vestigate intriguing research opportunities for answering thefollowing questions:

• What does it take to significantly advance a challengingarea such as DC?• How can we leverage techniques from DL for DC?• Given the many DL research efforts, how can we identify

the most promising leads that are most relevant to DC?

We believe that DL brings unique opportunities for DC, whichour community must seize upon. We identified two fundamentalideas with great potential to solve challenging DC problems.Feature (or Representation) Learning. For ML- or non-ML basedDC solutions, domain experts are heavily involved in featureengineering and data understanding so as to define data qualityrules. Representation learning is a fundamental idea in DL whereappropriate features for a given task are automatically learnedinstead of being manually crafted. Developing DC-specific repre-sentation learning algorithms could dramatically alleviate manyof the frustrations domain experts face when solving DC prob-lems. The learned representation must be generic such that itcould be used for multiple DC tasks.DL Architectures for DC. So far, no DL architecture exists that iscognizant of the characteristics of DC tasks, such as represen-tations for tuples or columns, integrity constraints, and so on.Naively applying existing DL architectures may work well forsome, but definitely not all, DC tasks. Instead, designing new DLarchitectures that are tailored for DC and are cognizant of thecharacteristics of DC tasks would allow efficient learning bothin terms of training time and the amount of required trainingdata. Given the success of domain specific DL architectures incomputer vision and natural language processing, it behooves usto design an equivalent of such a DL architecture customized forDC task(s).Contributions and Roadmap. The major elements of our pro-posed approach to exploiting the huge potential offered by DL totackle key DC challenges include:

• Representation Learning for Data Curation. We de-scribe several research directions for building representa-tions that are explicitly designed for DC. Developing algo-rithms for representation learning that can work on multi-ple modalities of data (structured, unstructured, graphical),

Series ISSN: 2367-2005 277 10.5441/002/edbt.2020.25

Page 2: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

DataDiscovery

DataIntegration

DataCleaning

Schema Mapping

Schema Matching

Entity Resolution

Data Imputation

Data transformationTable Search

Data

Oce

an

Busi

ness

Val

ue

Figure 1: A Data Curation Pipeline

and can work on diverse DC tasks (such as deduplication,error detection, data repair) is challenging.• Deep Learning Architectures for Data Curation

Tasks. We identify some of the most common tasks inDC and propose preliminary solutions using DL. We alsohighlight the importance of designing DC specific DL ar-chitectures.• Early Successes of DL for DC. In the last couple of years,

there has been some promising developments in applyingDL to some DC problems such as data discovery [19] andentity resolution [14, 35]. We provide a brief overview ofthese work and the lessons learned.• Taming DL’s Hunger for Data. DL often requires a

large amount of training data. However, there are multiplepromising techniques, which when adapted for DC, coulddramatically reduce the required amount of labeled data.• Myths and Concerns about DL. DL is still a relatively

new concept, especially for database applications and isnot a silver bullet. We thus identify and address a numberof myths and concerns about DL.

2 DEEP LEARNING FUNDAMENTALSWe provide a quick overview of relevant DL concepts needed forthe latter sections. Please refer to [5, 20, 29] for additional details.

Deep learning is a subfield of ML that seeks to learn meaning-ful representations from the data using computational modelsthat are composed of multiple processing layers. The representa-tions could be then used to solve the task at hand effectively. Themost commonly used DL models are neural networks with manyhidden layers. The key insight is that successive layers in this“deep” neural network can be used to learn increasingly usefulrepresentations of the data. Intuitively, the input layer often takesin the raw features. As the data is forwarded and transformedby each layer, an increasingly sophisticated and more meaning-ful information (representation) is extracted. This incrementalmanner in which increasingly sophisticated representations arelearned layer-by-layer is one of the defining characteristic of DL.

2.1 Deep Learning ArchitecturesFully-connected Neural Networks. The simplest model is aneural network with many hidden layers – a series of layerswhere each node in a given layer is connected to every othernode in the next layer. They are also called feed-forward neu-ral network. It can learn relationships between any two inputfeatures or intermediate representations. Its generality however

comes with a cost; one has to learn weights and parameters whichrequires a lot of training data.Domain Specific Architectures. There are a number of neuralnetwork architectures that are designed to leverage the domainspecific characteristics. For example, Convolutional Neural Net-works (CNNs) are widely used by the computer vision community.The input is fed through convolutions layers, where neurons inconvolutional layers only connect to close neighbors (insteadof all neurons connecting to all neurons). This method excelsin identifying spatially local patterns and use it to learn spatialhierarchies such as nose → face → human. Recurrent NeuralNetworks (RNNs) are widely used in Natural Language Process-ing (NLP) and Speech recognition. RNN processes inputs in asequence one step at a time. For example, given two words “datacuration”, it first handles “data” and then “curation”. Neurons inan RNN are designed by being fed with information not just fromthe previous layer but also from themselves from the previouspass that are relevant for NLP.

2.2 Distributed RepresentationsDeep learning is based on learning data representations, andthe concept of distributed representations (a.k.a. embeddings) isoften central to DL. Distributed representations [24] represent oneobject by many representational elements (or many neurons).Each neuron is associated with more than one represented object.That is, the neurons represent features of objects. Distributedrepresentations are very expressive and can handle semanticsimilarity. These advantages become especially relevant in do-mains such as DC. There has been extensive work on developingalgorithms for learning distributed representations for differenttypes of entities such as words and nodes.Distributed Representations of Words (a.k.a. word embed-dings) seek to map individual words to a vector space, whichhelps DL algorithms to achieve better performance in NLP tasksby grouping similar words. The dimensionality is often fixed(such as 300) and the representation is often dense. Word em-beddings are often learned from the data in such a way thatsemantically related words are often close to each other. The geo-metric relationship between words often also encodes a semanticrelationship between them. An oft-quoted example shows thatby adding the vector corresponding to the concept of female tothe distributed representation of king, we (approximately) obtainthe distributed representation of queen.

278

Page 3: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

Distributed Representations of Graphs (a.k.a. network em-beddings) are a popular mechanism to effectively learn appro-priate features for graphs (see [6] for a survey). Each node isrepresented as a dense fixed length high dimensional vector. Theoptimization objective is neighborhood-preserving whereby twonodes that are in the same neighborhood will have similar rep-resentations. Different definitions of neighborhood results indifferent representations for nodes.Distributed Representations for Sets. Sets are a fundamentaldata structure where the order of items does not matter. Sets areextensively used in the Codd model where each tuple is a setof atomic units and a relation is a set of tuples. Algorithms forlearning set embeddings must ensure that they have permutationinvariance so that any permutation of the set should obtain thesame embedding. There has been a number of popular algorithmsfor processing sets to produce such embeddings such as [41, 52].Compositions of Distributed Representations. One canthen use these to design distributional representations of atomicunits – words in NLP, or nodes in a graph – for more complexunits. For NLP, these could be sentences (i.e., sentence2vec), para-graphs or even documents (i.e., doc2vec) [28]. In the case ofgraphs, it can be subgraphs or entire graph.

3 REPRESENTATION LEARNING FOR DATACURATION

In this section, we discuss potential research opportunities for de-signing DL architectures and algorithms for learning DC specificrepresentations.

3.1 The Need for DC Specific DistributedRepresentations

Data curation is the process of discovering, integrating and clean-ing data for downstream tasks. There are a number of challengingsub-problems in data curation. However, a number of problemscould be unified through the prism of “matching”.Data Discovery is the process of identifying relevant data for aspecific task. In most enterprises, data is often scattered acrossa large number of tables that could range in the tens of thou-sands. As an example, the user might be interested in identifyingall tables describing user studies involving insulin. The typicalapproach involves asking an expert or performing manual dataexploration [8]. Data discovery could be considered as identify-ing tables that match a user specification (e.g., a keyword or anSQL query).Schema Matching is the problem of identifying a pair ofcolumns from different tables that are (semantically) related andprovide comparable information about an underlying entity. Forexample, two columns wage and salary could store similar in-formation. The problem of schema mapping seeks to learn thetransformation between matched columns such as hourly wageand monthly salary. One can see that this problem seeks to iden-tify matching columns.Data Cleaning is the general problem of detecting errors ina dataset and repairing them in order to improve the qualityof the data. This includes qualitative data cleaning which usesmainly integrity constraints, rules, or patterns to detect errors andquantitative approaches which are mainly based on statisticalmethods. Many sub-problems in data cleaning fall under theambit of matching. These include:

• Entity Resolution is the problem of identifying pairs oftuples that denote the same entity. For example, the tuple⟨Bill Gates, MS⟩ and ⟨William Gates, Microsoft⟩ refer tothe same person. This has a number of applications suchas identifying duplicate tuples between two tables. Onceagain, we can see that this corresponds to discoveringmatching entities (or tuples), i.e., those that refer to thesame real-world objects.• Data Dependencies specify how one attribute value de-

pends on other attribute values. This is widely used asintegrity constraints for capturing data violations. For ex-ample, the social security number determines a person’sname while the country name typically determines its cap-ital. We can see that this could be treated as an instanceof relevance matching.• Outlier Detection identifies anomalous data that does not

match a group of values, either syntactically, semantically,or statistically. For example, the phone number 123−456−7890 is anomalous when the other values are of the formnnnnnnnnnn.• Data Imputation. is the problem of guessing the missing

value that should match its surrounding evidence.

The main drawback of traditional DC solutions is that theytypically use syntactic similarity based features. The syntacticapproach is often ad-hoc, heavily reliant on domain experts andthus not extensible. While they are effective, they cannot leveragesemantic similarity. A different approach is needed.

Intuitively, distributed representations, which can interpretone object in many different ways, may provide great help forvarious DC problems. However, there are a number of challengesbefore they could be used in DC. In domains such as NLP, simpleco-occurrence is a sufficient approximation for distributionalsimilarity. In contrast, domains such as DC require much morecomplex syntactic and semantic relationships between (multiple)attribute values, tuples, columns, and tables. Furthermore, thedata itself could be noisy and the learned distributions should beresilient to errors.

3.2 Distributed Representation of CellsA cell, which is an attribute value of a tuple, is the atomic dataelement in a relational database. Learning distributed representa-tion for cells is already quite challenging and requires synthesisof representation learning of words and graphs.Word Embeddings based Approaches. An initial approach in-spired by word2vec [31] treats this as equivalent to the problemof learning word embeddings. Each tuple corresponds to a doc-ument where the value of each attribute corresponds to words.Hence, if two cell values occur together often in a similar context,then their distributed representation will be similar. For example,if a relation has attributes Country and Capital with many tuplescontaining (USA, Washington DC) for these two attributes, thentheir distributed representations would be similar.Limitations of this Approach. The embeddings produced bythis approach suffer from a number of issues. First, databases aretypically well normalized to reduce redundancy, which also mini-mizes the frequency that two semantically related attribute valuesco-occur in the same tuples. Databases have many data depen-dencies (or integrity constraints), within tables (e.g., functionaldependencies [2], and conditional functional dependencies [17])

279

Page 4: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

Employee ID

Employee ID

John Doe

John Smith

Department ID

Finance

Human Resources

1

2

0001 1

Marketing

0004

0002 Jane Doe

John Doe

Employee Name

1

Department Name

Human Resources0003

0001

0002

0003

0004

John Doe

Jane Doe

John Smith

1

2

Human Resources

Marketing

Finance

Employee Name Department ID Department name

An Employee Table

values appear in the same tuple functional dependency

Functional dependency 2: Department ID -> Department Name

Functional dependency 1: Employee ID -> Department ID

Figure 2: A Heterogeneous Graph of A Table

or across tables (e.g., foreign keys, and matching dependencies).These data dependencies are important hints about semanticallyrelated cells that should be captured by learning distributed rep-resentations of cells.Combining Word and Graph Embeddings. To capture therelationships (e.g., integrity constraints) between cells, a morenatural way is to treat each relation as a heterogeneous network.Each relation D is modeled as a graph G (V ,E), where each nodeu ∈ V is a unique attribute value, and each edge (u,v ) ∈ Erepresents multiple relationships, such as (u,v ) co-occur in onetuple, there is functional dependency from the attribute of u tothe attribute ofv , and so on. The edges could be either directed orundirected, have labels and weights. This enriched model mightprovide a more meaningful distributed representation that iscognizant of both content and constraints.

A sample table and our proposed graph representation of thetable is shown in Figure 2. There are four distinct Employee IDvalues (nodes), three distinct Employee Name values, two dis-tinct Department ID values, and three Department Name values.There are two types of edges: undirected edges indicating valuesappearing in the same tuple, e.g., 0001 and John Doe, and di-rected edges for functional dependencies, e.g., Employee ID 0001implies DepartmentID 1.

Research Opportunities.

• Algorithms. How can we design an algorithm for learningcell embeddings that take values, integrity constraints, andother metadata (e.g., a query workload) into consideration?• Global Distributed Representations. We need to learn dis-

tributed representations for the cells over the entire datalake, not only on one relation. How can we “transfer”knowledge gained from one relation to another to improvethe representations?

• Handling Rare Values. Word embeddings often provideinadequate representations for rare values. How can weensure that primary keys and other rare values have ameaningful representation?

3.3 Hierarchical Distributed RepresentationsMany DC tasks are often performed at a higher level of granu-larity than cells. The next fundamental question to solve is todesign an algorithm to compose the distributed representations ofmore abstract units from these atomic units. As an example, howcan one design an algorithm for tuple embeddings assuming onealready has a distributed representation for each of its attributevalues? A common approach is to simply average the distributedrepresentation of all its component values.

Research Opportunities.

• Tuple, Column and Table Embeddings (Tuple2Vec, Col-umn2Vec, Table2Vec ): Are there any other elegant ap-proaches to compose representations for tuples than aver-aging? Can it be composed from representations for cells?Or should it be learned directly? How can one extend thisidea to learn representations for columns that are oftenuseful for tasks such as schema matching? Finally, howcan one learn a representation for the entire relation thatcan benefit a number of tasks such as copy detection ordata discovery (finding similar relations)?• Compositional or Direct Learning: Different domains re-

quire different ways to create hierarchical representations.For the computer vision domain, hierarchical representa-tions such as shapes are learned by composing simpler in-dividual representations such as edges. On the other hand,for NLP, hierarchical representations are often learneddirectly [28]. What is an appropriate mechanism for DC?• Contextual Embeddings for DC. There has been increas-

ing interest in the NLP community in contextual embed-dings [10, 38] that can provide different embeddings fora word (such as apple) based on the surrounding context.This is often helpful for tasks requiring word disambigua-tion. Often, DC tasks require a number of contextual infor-mation and hence any learned representation must takethat into account. What is an appropriate formalization ofcontext and algorithms for contextual embeddings in DC?• Multi Modal Embeddings for DC. Enterprises often pos-

sess data across various modalities, such as structureddata (e.g., relations), unstructured data (e.g., documents),graphical data (e.g., enterprise networks), and even videos,audios, and images. An intriguing line of research is crossmodal representation learning [25], wherein two entitiesthat are similar in one or more modalities, e.g., occurringin the same relation, document, image, and so on, willhave similar distributed representations.

4 DEEP LEARNING ARCHITECTURES FORDATA CURATION TASKS

Representation learning and domain specific architectures arethe two pillars that led to the major successes achieved by DLin various domains. While representation learning ensures thatthe input to the DL model contains all relevant information, thedomain specific architecture often processes the input in a mean-ingful way requiring less training data than generic architectures.

280

Page 5: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

In this section, we motivate the need for designing DC specificarchitecture and provide a promising design space to explore.

4.1 Need for DC Specific DL ArchitecturesRecall from Section 2 that a fully connected architecture is themost generic one. It does not make any domain specific assump-tions and hence can be used for arbitrary domains. However, thisgenerality comes at a cost: a lot of training data. One can arguethat a major part of DL’s success in computer vision and NLP isdue to the design of specialized DL architectures – CNN and RNNrespectively. CNN leverages spatial hierarchies of patterns wherecomplex/global patterns are often composed of simpler/local pat-terns (e.g., curves→ mouth→ face). Similarly, RNN processes aninput sequence one step at a time while maintaining an internalstate. These assumptions allows one to design effective neuralarchitectures for processing images and text that also require lesstraining data. This is especially important in data curation wherethere is a persistent scarcity of labeled data. There is a pressingneed for new DL architectures that are tailored for DC and arecognizant of the characteristics of DC tasks. This would allowthem to do the learning efficiently both in terms of training timeand the amount of required training data.Desiderata for DC-DL Architectures.

• Handling Heterogeneity. In both CNN and RNN, the inputis homogeneous – images and text. However, a relationcan contain a wide variety of data, such as categorical,ordinal, numerical, textual, and image.• Set/Bag Semantics. While an image can be considered as a

sequence of pixels and a document as a sequence of words,such an abstraction does not work for relational data. Fur-thermore, the relational algebra is based on set and bagsemantics with the major query languages specified asset operators. DL architectures that operate on sets are anunder explored area in DL.• Long Range Dependencies. The DC architecture must be

able to determine long range dependencies across at-tributes and sometimes across relations. Many DC tasksrely on the knowledge of such dependencies. They shouldalso be able to leverage additional domain knowledge andintegrity constraints.

4.2 Design Space for DC-DL ArchitecturesBroadly speaking, there are two areas in which DL architecturescould be used in DC. First, novel DL architectures are neededfor learning effective representations for downstream DC tasks.Second, we need DL architectures for common DC tasks such asmatching, data repair, imputation, and so on. Both are challengingon their own right and require significant innovations.Architectures for DC Representation Learning. In a numberof fields such as computer vision, deep neural networks learngeneral features in early layers and transition to task specificfeatures in the latter layers. Initial layers learn generic featuressuch as edges (which can be used in many tasks) while latterlayers learn high level concepts such as a cat or dog (which couldbe used for task for recognizing cats from dogs). Models trainedon large datasets such as ImageNet could be used as featureextractors followed by specific task specific models. Similarly,in NLP there has been a number of pretrained language modelssuch as word2vec [31], ELMo [38], and BERT [10] that could thenbe customized for various tasks such as classification, question

answering, semantic matching, and coreference resolution. Inother words, the pre-trained DL models act as a set of Lego bricksthat could be put together to perform the required functionality.It is thus important that the DL architecture for DC follows thisproperty by learning features that are generic and could readilygeneralize to many DC tasks.

While there has been extensive work in linguistics about thehierarchical representations of language, a corresponding inves-tigation in data curation is lacking. Nevertheless, we believe thata generic DC representation learning architecture must be in-spired based on the pretrained language models. Once such anarchitecture is identified, the learned representations could beused for multiple DC tasks.DL Architectures for DC Tasks. While the generic approachis often powerful, it is important that we must also work on DLarchitectures for specific DC tasks. This approach is often muchsimpler and provides incremental gains while also increasingour understanding of DL-DC nexus. In fact, this has been theapproach taken by the NLP community - they worked on DLarchitectures for specific tasks (such as text classification or ques-tion answering) even while they searched for more expressivegenerative language models.

A promising approach is to take specific DC tasks and breakthem into simpler components and evaluate if there are existingDL architectures that could be reused. Often, it is possible to“compose” individual DL models to create a more complex modelfor solving a specific DC task. Many of the DC tasks could beconsidered as a combination of tasks such as matching, errordetection, data repair, imputation, ranking, discovery, and syn-tactic/semantic transformations. An intriguing research questionis to identify DL models for each of these and investigate how toinstantiate and compose them together.

4.3 Pre-Trained DL Models for DCIn domains such as computer vision and NLP, there is a com-mon tradition of training a DL model on a large dataset and thenreusing it (after some tuning) for tasks on other smaller datasets.One example include ImageNet [9] which contains almost 14Mimages over 20k categories. The spatial hierarchy learned fromthis dataset is often a proxy for modeling the visual world [5]and can be applied on a different dataset and even different cate-gories [5]. Similarly, in NLP, word embeddings are an effectiveapproximation for language modeling. The embeddings are oftenlearned from large diverse corpora such as Wikipedia or PubMedand could be used for downstream NLP tasks. These pre-trainedmodels can be used in two ways: (a) feature extraction wherethese are used to extract generic features that are fed to a sep-arate classifier for the task at hand; (b) fine-tuning where oneadjusts the abstract representations from the last few hidden lay-ers of a pre-trained model and make it more relevant to a targetedtask. A promising research avenue is the design of pre-trained DLmodels for DC that could be used by others with comparativelyless resources.

5 EARLY SUCCESSES OF DL FOR DCIn this section, we describe the early successes achieved by theDC community on using DL for major DC tasks such as datadiscovery and entity matching.

281

Page 6: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

5.1 Data DiscoveryLarge enterprises typically possess hundreds or thousands ofdatabases and relations. Data required for analytic tasks is oftenscattered across multiple databases depending on who collectedthe data. This makes the process of finding relevant data for aparticular analytic task very challenging. Usually, there is nodomain expert who has complete knowledge about the entiredata repository. This results in a non-optimal scenario wherethe data analyst patches together data from her prior knowledgeor limited searches – thereby leaving out potentially useful andrelevant data.

As an example of applying word embeddings for data discov-ery, [19] shows how to discover semantic links between differentdata units, which are materialized in the enterprise knowledgegraph (EKG)1. These links assist in data discovery by linkingtables to each other, to facilitate navigating the schemas, and byrelating data to external data sources such as ontologies and dic-tionaries, to help explain the schema meaning. A key componentis a semantic matcher based on word embeddings. The key ideais that a group of words is similar to another group of wordsif the average similarity in the embeddings between all pairsof words is high. This approach was able to surface links thatwere previously unknown to the analysts, e.g., isoform, a typeof protein, with Protein and Pcr – polymerase chain reaction, atechnique to amplify a segment of DNA – with assay. We cansee that any approach using syntactic similarity measures suchas edit distance would not have identified these connections.

5.2 Entity MatchingEntity matching is a key problem in data integration, whichdetermines if two tuples refer to the same underlying real-worldentity [15].DeepER. A recent work [14], DeepER, applies DL techniquesfor ER, whose overall architecture is shown in Figure 3. DeepERoutperforms existing ER solutions in terms of accuracy, efficiency,and ease-of-use. For accuracy, DeepER uses sophisticated compo-sitional methods, namely uni- and bi-directional recurrent neuralnetworks (RNNs) with long short term memory (LSTM) hiddenunits to convert each tuple to a distributed representation, whichare used to capture similarities between tuples. DeepER uses alocality sensitive hashing (LSH) based approach for blocking; ittakes all attributes of a tuple into consideration and producesmuch smaller blocks, compared with traditional methods thatconsider only few attributes. For ease-of-use, DeepER requiresmuch less human labeled data, and does not need feature en-gineering, compared with traditional machine learning basedapproaches which require handcrafted features, and similarityfunctions along with their associated thresholds.DeepMatcher [35] proposes a template based architecture forentity matching. Figure 4 shows an illustration. It identifiesfour major components: attribute embedding, attribute summa-rization, attribute similarity and matching. It proposes variouschoices for each of these components leading to a rich designspace. The four most promising models differ primarily in howthe attribute summarization is performed and are dubbed asSIF, RNN, Attention, and Hybrid. SIF model is the simplest andcomputes a weighted average aggregation over word embed-dings to get the attribute embedding. RNN uses a bi-directional1An EKG is a graph structure whose nodes are data elements such as tables, at-tributes, and reference data such as ontologies and mapping tables, and whoseedges represent different relationships between nodes.

w1

wi

wj

A1 … Ap … Am

Words

Composition (avg, LSTM)

layer

v1

vi

vk

Classificationlayer

Denselayer

Similaritylayer

Words

tuple t

A1 … Ap … Amtuple t’

Embedding lookup layer

Figure 3: DeepER Framework (Figure from [14])

Attr 1 Attr 2

Seq

1

Seq

2

Sequences of Words

1. Attribute Embedding

Sequences of Word Embeddings

2. Attribute SimilarityRepresentation

Attribute Similarity

Entity Similarity

3. Classification

prediction

Seq

1

Seq

2

NNs with the same pattern share parameters

Neural Network (NN)

Attr 3

Seq

1

Seq

2

Figure 2: Our architecture template for DL solutions for EM.

3 A DESIGN SPACE OF DL SOLUTIONSBuilding on the above categorization of the DL solutions for match-ing tasks in NLP, we now describe an architecture template for DLsolutions for EM. This template consists of three main modules,and for each module we provide a set of choices. The combinationsof these choices form a design space of possible DL solutions forEM. The next section selects four DL solutions for EM (SIF, RNN,A�ention, and Hybrid) as “representative points” in this designspace. Section 5 then evaluates these four DL solutions, as well asthe trade-o�s introduced by the di�erent design choices.

3.1 Architecture Template & Design SpaceFigure 2 shows our architecture template for DL solutions for EM.This template is for the matching phase of EM only (the focus ofthis paper). It uses the categorization of DL models for matchingrelated tasks discussed in Section 2.3, and is built around the samecategorization dimensions: (1) the language representation, (2) thesummarization technique, and (3) the comparison method usedto analyze an input pair of sequences. The template consists ofthree main modules each of which is associated with one of thesedimensions.

Before discussing the modules, we discuss assumptions regardingthe input. We assume that each input point corresponds to a pairof entity mentions (e1, e2), which follow the same schema withattributes A1, . . . ,AN . Textual data can be represented using aschema with a single attribute. We further assume that the value ofattribute Aj for each entity mention e corresponds to a sequence ofwords we, j . We allow the length of these sequences to be di�erentacross di�erent entity mentions. Given this setup, each input pointcorresponds to a vector of N entries (one for each attribute Aj 2{A1, . . . ,AN }) where each entry j corresponds to a pair of wordsequences we1, j and we2, j .The Attribute Embedding Module: For all attributes Aj 2A1 · · ·AN , this module takes sequences of words we1, j and we2, jand converts them to two sequences of word embedding vectors

ue1, j and ue2, j whose elements correspond to d-dimensional em-beddings of the corresponding words. More precisely, if for e1, wordsequence we1, j contains m elements then we have ue1, j 2 Rd⇥m .The same holds for e2. The overall output of the attribute embed-ding module of our template is a pair of embeddings ue1, j and ue2, jfor the values of attribute Aj for entity mentions e1 and e2. Wedenote the �nal output of this module as {(ue1, j , ue2, j )}N

j=1.The Attribute Similarity Representation Module: The goal ofthis module is to automatically learn a representation that capturesthe similarity of two entity mentions given as input. This moduletakes as input the attribute value embeddings {(ue1, j , ue2, j )}N

j=1 andencodes this input to a representation that captures the attributevalue similarities of e1 and e2. For each attribute Aj and pair ofattribute embeddings (ue1, j , ue2, j ) the operations performed bythis module are split into two major parts:

(1) Attribute summarization. This module takes as input the twosequences (ue1, j , ue2, j ) and applies an operation H that summa-rizes the information in the input sequences. More precisely, letsequences ue1, j and ue2, j contain m and k elements respectively.An h dimensional summarization model H takes as input sequencesue1, j 2 Rd⇥m and ue2, j 2 Rd⇥k and outputs two summary vectorsse1, j 2 Rh and se2, j 2 Rh . The role of attribute summarization isto aggregate information across all tokens in the attribute valuesequence of an entity mention. This summarization process mayconsider the pair of sequences (ue1, j , ue2, j ) jointly to perform moresophisticated operations such as soft alignment [2].

(2) Attribute comparison. This part takes as input the summaryvectors se1, j 2 Rh and se2, j 2 Rh and applies a comparison functionD over those summaries to obtain the �nal similarity representationof the attribute values for e1 and e2. We denote that similarityrepresentation by sj 2 Rl with sj = D(se1, j , se2, j ).

The output of the similarity representation module is a collec-tion of similarity representation vectors {s1, . . . , sN }, one for eachattribute A1, . . . ,AN . There are various design choices for the twoparts of this module. We discuss those in detail in Section 3.3.The Classi�er Module: This module takes as input the similar-ity representations {s1, . . . , sN } and uses those as features for aclassi�er M that determines if the input entity mentions e1 and e2refer to the same real-world entity.A Design Space of DL Solutions for EM: Our architecture tem-plate provides a set of choices for each of the above three modules.Figure 3 describes these choices (under “Options” on the right sideof the �gure). Note that we provide only one choice for the classi�ermodule, namely a multi-layer NN, because this is the most commonchoice today in DL models for classi�cation. For other modules weprovide multiple choices. In what follows we discuss the choices forattribute embedding, summarization, and comparison. The number-ing of the choices that we will discuss correspond to the numberingused in Figure 3.

3.2 Attribute Embedding ChoicesPossible embedding choices for this module can be characterizedalong two axes: (1) the granularity of the embedding and (2) whether

Figure 4: DeepMatcher Framework (Figure from [35])

GRU for summarizing an attribute for efficiency. Attention usesa decomposable attention model that can leverage informationfrom aligned attributes of both tuples to summarize the attribute.Finally, the hybrid approach combines RNN and attention. Deep-Matcher was evaluated on a diverse set of datasets such as struc-tured, unstructured and noisy and provides useful rule of thumbon when to use DL for ER.

5.3 Representation LearningA number of recent works such as DeepER [14], Deep-Matcher [35] have applied a compositional approach using aver-aging or RNN/LSTMs for obtaining tuple embeddings from pre-trained word embeddings. For a number of benchmark datasetsin entity resolution, this provides good results. There has beena number of recent efforts for learning distributed representa-tions for data curation directly. Freddy [22] incorporated supportfor semantic similarity based queries by using pre-trained wordembeddings inside Postgres. Termite [18] proposed an effectivetechnique to learn a common (distributed) representation for bothstructured and unstructured data in an organization. In this ap-proach, various entities such as rows, columns, and paragraphs

282

Page 7: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

are all represented as a vector. This unified representation al-lows Termite to identify related concepts even if they are spreadacross structured and unstructured data. Finally, EmbDC [4] pro-posed an interesting approach to learn embeddings for cells thatcombine ideas from word and node embeddings. It constructsa tripartite graph and performs random walks over them. Thewalks correspond to sentences that are passed to an algorithmthat computes word embeddings. This allows for different per-mutations of the same data being outputted thereby partiallyincorporating the set semantics. The authors show that this twostep approach provides promising results for unsupervised DCtasks such as schema matching and entity resolution.

5.4 Understanding the Success and What isMissing?

Both DeepER and DeepMatcher provide a mechanism to obtaintuple embeddings from pre-trained word embeddings. The simi-larity between the embeddings of two tuples is then used to traina DL classifier. In both cases, the classifier was generic and hencethe state-of-the-art performance of these approaches could beattributed to learning effective representations for the tuples. Afollow-up work [50] showed that such learned representationscould improve the performance of even non-DL classifiers. [35]performed extensive empirical evaluation and found that deeplearning based methods were especially effective for entity reso-lution involving textual and dirty data. For example, the use ofword embeddings allowed the DL based approach to identify that‘Bill‘ and ‘William‘ are semantically similar while no string simi-larity metric could do that. Learning such representations couldresult in some unexpected applications such as performing entityresolution between relations that are in different languages [3].

Something analogous happens in the data discovery scenarioas well. The word embedding based approach learned effectiverepresentations that could show that certain concepts were rele-vant (isoform and Protein or Pcr and assay as mentioned above)that could not have been identified by any syntactic similaritymeasures such as edit distance. This ability to learn representa-tions that could identify similar pair of words even if they aresyntactically dissimilar explains the success of DL for schemamatching in [34, 43].

From the above discussion, it is clear that learning effectiverepresentations was the cornerstone of improved performance ofDL based methods. Such representations learn relationships thatare hard to pull off using non-DL methods. However, these earlyworks are only scratching the surface as they do not yet leveragepowerful techniques such as task specific architectures or trans-fer learning. As an example, DeepMatcher proposed a templatebased architecture that has two key layers – composition/ag-gregation and similarity computation. This could be consideredas a rudimentary task specific architecture that resulted in re-duced training data. Recent work such as [50] and [26] seek touse transfer learning to improve ER. However, the days of usingpre-trained ER models for multiple datasets is still far off. Evenwithin representation learning, works have not yet explored theuse of contextual embeddings such as ELMO or BERT.

6 TAMING DL’S HUNGER FOR DATADeep learning often requires a large amount of training data.Unfortunately, in domains such as DC, it is often prohibitivelyexpensive to get high quality training data. In order for DL to be

widely used to solve DC problems, we must overcome the prob-lem of obtaining training data. Fortunately, there are a numberof exciting innovations from DL that can be adapted to solve thisissue. We highlight some of these promising ideas and discusssome potential research questions.

6.1 Unsupervised Representation LearningWhile the amount of supervised training data is limited, mostenterprises have substantial amount of unsupervised data in var-ious relations and data lakes. These unlabeled data could be usedto learn some generic patterns and representations that are thenused train a DL model with relatively less labeled data. This tech-nique is widely used in NLP where word embeddings are learnedon large unlabeled corpus data such as Wikipedia and providegood performance for many downstream NLP tasks.

Research Opportunities.

• How can we perform a holistic representation learning overthe enterprise data? How can the learned representationsbe used as features for downstream DC tasks such as entitymatching, schema matching, etc? While there are somepromising start in recent work such as [4, 51], more needsto be done.

6.2 Data AugmentationData augmentation increases the size of labeled training datawithout increasing the load on domain experts. The key idea is toapply a series of label preserving transformations over the existingtraining data. For an image recognition task, one can apply atransformations such as translation (moving the location of thedog/cat within the image), rotation (changing the orientation),shearing, scaling, changing brightness/color, and so on. Each ofthese operations does not change the label of the image (cat ordog) – yet generate additional synthetic training data.

Research Opportunities.

• Label Preserving Transformations for DC. What does labelpreserving transformations mean for DC? Is it possible todesign an algebra of such transformations?• Domain Knowledge Aware Augmentation. To avoid creat-

ing erroneous data, we could integrate domain knowledgeand integrity constraints in the data augmentation process.This would ensure that we do not create tuples that sayNew York City is the capital of USA.• Domain Specific Transformations. There are some recent

approaches such as Tanda [40] that seek to learn domainspecific transformations. For example, if one knows thatCountry → Capital, we can just swap the (Country, Cap-ital) values of two tuples to generate two new tuples. Inthis case, even if the new tuple is not necessarily real, itslabel will be the same as the original tuple. Is it possibleto develop an algebra for such transformations?

6.3 Synthetic Data Generation for DCA seminal event in computer vision was the construction ofImageNet dataset [9] with many million images over thousandsof categories. This was a key factor in the development ofpowerful DL algorithms. We believe that the DC community isin need of such an ImageNet moment.

283

Page 8: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

Research Opportunities.

• Benchmark for DC. It is paramount to create a similarbenchmark to drive research in DC (both DL and non-DL). While there has been some work for data cleaningsuch as BART [1], it is often limited to specific scenarios.For example, BART can be used to benchmark data repairalgorithms where the integrity constraints are specifiedas denial constraints.• Synthetic Datasets. If it is not possible to create an open-

source dataset that has realistic data quality issues, a usefulfall back is to create synthetic datasets that exhibit repre-sentative and realistic data quality issues. The family ofTPC benchmarks involves a series of synthetic datasetsthat is somehow realistic and widely used for benchmark-ing database systems. A recent work [48] proposed a vari-ational autoencoder based model for generating syntheticdata that has similar statistical properties as the originaldata. Is it possible to extend that approach to encode dataquality issues as well?

6.4 Weak SupervisionA key bottleneck in creating training data is that there is oftenan implicit assumption that it must be accurate. However, it isoften infeasible to produce sufficient hand-labeled and accuratetraining data for most DC tasks. This is especially challenging forDL models that require a huge amount of training data. However,if one can relax the need for the veracity of training data, itsgeneration will become much easier for the expert. The domainexpert can specify a high level mechanism to generate trainingdata without endeavoring to make it perfect.

Research Opportunities.

• Weakly Supervised DL Models. There has been a series ofresearch (such as Snorkel [39]) that seek to weakly super-vise ML models and provide a convenient programmingmechanism to specify “mostly correct” training data. Whatare the DC specific mechanisms and abstractions for weaksupervision? Can we automatically create such a weaksupervision through data profiling?

6.5 Transfer LearningAnother trick to handle limited training data is to “transfer” rep-resentations learned in one task to a different yet related task. Forexample, DL models for computer vision tasks are often trainedon ImageNet [9], a dataset that is commonly used for imagerecognition purposes. However, these models could be used fortasks that are not necessarily image recognition. This approachof training on a large diverse dataset followed by tuning for alocal dataset and tasks has been very successful.

Research Opportunities.

• Transfer learning. What is the equivalent of this approachfor DC? Is it possible to train on a single large dataset suchthat it could be used for many downstream DC tasks?Alternatively, is it possible to train on a single dataset andfor a single task (such as entity matching) such that thelearned representations could be used for entity matchingin similar domains? How can we train a DL model for onetask and tune the model for new tasks by using the limitedlabeled data instead of starting from scratch?• Pre-trained Models. Is it possible to provide pre-trained

models that have been trained on large, generic datasets?

These models could then be tuned by individual practi-tioners in different enterprises.

7 DEEP LEARNING: MYTHS ANDCONCERNS

In the past few years, DL has achieved substantial successes inmany areas. However, DC has a number of characteristics thatare quite different from prior domains where DL succeeded. Weanticipate that applying DL to challenging real-world DC taskscan be messy. We now describe some of the concerns that couldbe raised by a pragmatic DC practitioner or a skeptical researcher.

7.1 Deep Learning is Computing HeavyA common concern is that training DL models requires exor-bitant computing resources where model training could takedays even on a large GPU cluster. In practice, training time oftendepends on the model complexity, such as the number of param-eters to be learnt, and the size of training data. There are manytricks that can reduce the amount of training time. For example,a task-aware DL architecture often requires substantially lessparameters to be learned. Alternatively, one can “transfer” knowl-edge from a pre-trained model from a related task or domainand the training time will now be proportional to the amount offine-tuning required to customize the model to the new task. Forexample, DeepER [14] leveraged word embeddings from GloVe(whose training can be time consuming) and built a light-weightDL model that can be trained in a matter of minutes even on aCPU. Finally, while training could be expensive, this can often bedone as a pre-processing task. Prediction using DL is often veryfast and comparable to that of other ML models.

7.2 Data Curation Tasks are Too Messy orToo Unique

DC tasks often require substantial domain knowledge anda large dose of “common sense”. Current DL models are verynarrow in the sense that they primarily learn from the corre-lations present in the training data. However, it is quite likelythat this might not be sufficient. Unsupervised representationlearning over the entire enterprise data can only partially addressthis issue. Current DL models are often not amenable to encodingdomain knowledge in general as well as those that are specificto DC such as data integrity constraints. As mentioned before,substantial amount of new research on DC-aware DL architec-tures is needed. However, it is likely that DL, even in its currentform, can reduce the work of domain experts.DC tasks often exhibit a skewed label distribution. For thetask of entity resolution (ER), the number of non-duplicate tuplepairs are orders of magnitude larger than the number of duplicatetuple pairs. If one is not careful, DL models can provide inaccu-rate results. Similarly, other DC tasks often exhibit unbalancedcost model where the cost of misclassification is not symmetric.Prior DL work utilizes a number of techniques to address theseissues such as (a) cost sensitive models where the asymmetricmisclassification costs are encoded into the objective function,and (b) sophisticated sampling approach where we under or oversample certain classes of data. For example, DeepER [14] samplesnon-duplicate tuple pairs that are abundant at a higher level thanduplicate tuple pairs.

284

Page 9: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

7.3 Deep Learning Predictions are InscrutableDomain experts could be concerned that the predictions of DLmodels are often uninterpretable. Deep learning models are of-ten very complex and the black-box predictions might not beexplainable by even DL experts. However, explaining why aDL model made a particular data repair is very important for adomain expert. Recently, there has been intense interest in de-veloping algorithms for explaining predictions of DL models ordesigning interpretable DL models in general. Please see [21] foran extensive survey. Designing algorithms that can explain theprediction of DL models for DC tasks is an intriguing researchproblem. While there are some promising approaches exist forspecific tasks such as Entity resolution [11, 13, 49], more researchremains.

7.4 Deep Learning can be Easily FooledThere exist a series of recent works which show that DL mod-els (especially for image recognition) can be easily fooled byperturbing the images in an adversarial manner. The sub-fieldof adversarial DL [37, 47] studies the problem of constructingsynthetic examples by slightly modifying real examples fromtraining data such that the trained DL model (or any ML model)makes an incorrect prediction with high confidence. While thisis indeed a long term concern, most DC tasks are often collab-orative and limited to an enterprise. Furthermore, there are aseries of research efforts that propose DL models that are moreresistant to adversarial training such as [30].

7.5 Building Deep Learning Models for DataCuration is “Just Engineering”

Many DC researchers might look at the process of building DLmodels for DC and simply dismiss it as a pure engineering effort.And they are indeed correct! Despite its stunning success, DL isstill at its infancy and the theory of DL is still being developed.To a DC researcher used to purveying a well organized gardenof conferences such as VLDB/SIGMOD/ICDE, the DL researchlandscape might look like the wild west.

In the early stages, researchers might just apply an existing DLmodel or algorithm for a DC task. Or they might slightly tweak aprevious model to achieve better results. We argue that databaseconferences must provide a safe zone in which these DL explo-rations are conducted in a principled manner. One could takeinspiration from the computer vision community. They created alarge dataset (ImageNet [9]) that provided a benchmark by whichdifferent DL architectures and algorithms can be compared. Theyalso created one or more workshops focused on applying DL forspecific tasks in computer vision (such as image recognition andscene understanding). The database community has its own TPCseries of synthetic datasets that have been influential in bench-marking database systems. Efforts similar to TPC are essentialfor the success of DL-driven DC.

7.6 Deep Learning is Data HungryIndeed, this is one of the major issues in adopting DL for DC. Mostclassical DL architectures often have many hidden layers withmillions of parameters to learn, which requires a large amountof training data. Unfortunately, the amount of training data forDC is often small. The good news is, there exist a wide variety oftechniques and tricks in the DL’s arsenal that can help addressthis issue. We could leverage the techniques described in Section 6for addressing these issues.

8 A CALL TO ARMSIn this paper, we make two key observations. Data Curation is along standing problem and needs novel solutions in order to han-dle the emerging big data ecosystem. Deep Learning is gainingtraction across many disciplines, both inside and outside com-puter science. The meeting of these two disciplines will unleasha series of research activities that will lead to usable solutions formany DC tasks. We identified research opportunities in learningdistributed representations for database aware objects such astuples or columns and designing DC-aware DL architectures.We described a number of promising approaches to tame DL’shunger for data. We discuss the early successes in using DL forimportant DC tasks such as data discovery and entity matchingthat required novel adaptations of DL techniques. We make acall to arms for the database community in general, and the DCcommunity in particular, to seize this opportunity to significantlyadvance the area, while keeping in mind the risks and mitigationthat were also highlighted in this paper.

REFERENCES[1] P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro. Mess-

ing up with BART: error generation for evaluating data-cleaning algorithms.PVLDB, 2015.

[2] L. Berti-Equille, H. Harmouch, F. Naumann, N. Novelli, and S. Thirumuru-ganathan. Discovery of genuine functional dependencies from relational datawith missing values. Proceedings of the VLDB Endowment, 11(8):880–892, 2018.

[3] Ö. Ö. Çakal, M. Mahdavi, and Z. Abedjan. Clrl: Feature engineering forcross-language record linkage. In EDBT, pages 678–681, 2019.

[4] R. Cappuzzo, P. Papotti, and S. Thirumuruganathan. Local embeddings forrelational data integration. arXiv preprint arXiv:1909.01120, 2019.

[5] F. Chollet. Deep learning with Python. Manning Publications, 2018.[6] P. Cui, X. Wang, J. Pei, and W. Zhu. A survey on network embedding. CoRR,

abs/1711.08752, 2017.[7] M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani,

and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, 2013.[8] D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elma-

garmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizersystem. In CIDR, 2017.

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805, 2018.

[11] V. Di Cicco, D. Firmani, N. Koudas, P. Merialdo, and D. Srivastava. Interpretingdeep learning models for entity resolution: An experience report using lime.In Proceedings of the Second International Workshop on Exploiting ArtificialIntelligence Techniques for Data Management, aiDM ’19, pages 8:1–8:4, NewYork, NY, USA, 2019. ACM.

[12] A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. MorganKaufmann, 2012.

[13] A. Ebaid, S. Thirumuruganathan, W. G. Aref, A. Elmagarmid, and M. Ouzzani.Explainer: Entity resolution explanations. In ICDE, pages 2000–2003. IEEE,2019.

[14] M. Ebraheem, S. Thirumuruganathan, S. R. Joty, M. Ouzzani, and N. Tang.Deeper - deep entity resolution. VLDB, 12, 2018.

[15] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection:A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16, 2007.

[16] W. Fan and F. Geerts. Foundations of Data Quality Management. SynthesisLectures on Data Management. Morgan & Claypool Publishers, 2012.

[17] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functionaldependencies for capturing data inconsistencies. TODS, 33(2), 2008.

[18] R. C. Fernandez and S. Madden. Termite: a system for tunneling throughheterogeneous data. arXiv preprint arXiv:1903.05008, 2019.

[19] R. C. Fernandez, E. Mansour, A. A. Qahtan, A. Elmagarmid, I. F. Ilyas, S. Mad-den, M. Ouzzani, M. Stonebraker, and N. Tang. Seeping semantics: Linkingdatasets using word embeddings for data discovery. In ICDE, 2018.

[20] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.http://www.deeplearningbook.org.

[21] R. Guidotti, A. Monreale, F. Turini, D. Pedreschi, and F. Giannotti. A surveyof methods for explaining black box models. arXiv:1802.01933, 2018.

[22] M. Günther. Freddy: Fast word embeddings in database systems. In Proceedingsof the 2018 International Conference on Management of Data, pages 1817–1819.ACM, 2018.

[23] J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data trans-formation. In CIDR, 2015.

285

Page 10: Data Curation with Deep Learning · data curation plays an important role so as to fully unlock the value of big data. Unfortunately, the current solutions are not keeping up with

[24] G. E. Hinton. Learning distributed representations of concepts. In CogSci,volume 1, page 12. Amherst, MA, 1986.

[25] X. Huang, Y. Peng, and M. Yuan. Cross-modal common representation learningby hybrid transfer network. In IJCAI, 2017.

[26] J. Kasai, K. Qian, S. Gurajada, Y. Li, and L. Popa. Low-resource deep entityresolution with transfer and active learning. arXiv preprint arXiv:1906.08042,2019.

[27] P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi,H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra.Magellan: Toward building entity matching management systems. PVLDB,9(12):1197–1208, 2016.

[28] Q. V. Le and T. Mikolov. Distributed representations of sentences and docu-ments. In ICML, 2014.

[29] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436,2015.

[30] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deeplearning models resistant to adversarial attacks. arXiv:1706.06083, 2017.

[31] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributedrepresentations of words and phrases and their compositionality. In NIPS,2013.

[32] R. J. Miller. Big data curation. In COMAD, page 4, 2014.[33] R. J. Miller, F. Nargesian, E. Zhu, C. Christodoulakis, K. Q. Pu, and P. Andritsos.

Making open data transparent: Data discovery on open data. IEEE Data Eng.Bull., 41(2):59–70.

[34] M. J. Mior, I. Ororbia, and G. Alexander. Column2vec: Structural under-standing via distributed representations of database schemas. arXiv preprintarXiv:1903.08621, 2019.

[35] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Ar-caute, and V. Raghavendra. Deep learning for entity matching: A design spaceexploration. In SIGMOD, 2018.

[36] F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller. Table union search on opendata. PVLDB, 11(7):813–825, 2018.

[37] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami.The limitations of deep learning in adversarial settings. In EuroS&P, 2016.

[38] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, andL. Zettlemoyer. Deep contextualized word representations. arXiv preprintarXiv:1802.05365, 2018.

[39] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapidtraining data creation with weak supervision. arXiv:1711.10160, 2017.

[40] A. J. Ratner, H. R. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré. Learningto compose domain-specific transformations for data augmentation. In NIPS,2017.

[41] S. Ravanbakhsh, J. Schneider, and B. Poczos. Deep learning with sets andpoint clouds. arXiv preprint arXiv:1611.04500, 2016.

[42] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairswith probabilistic inference. PVLDB, 10(11), 2017.

[43] R. Shraga, A. Gal, and H. Roitman. Deep smane-deep similarity matrix adjust-ment and evaluation to improve schema matching. Technical report, TechnicalReport IE/IS-2019-02. Technion–Israel Institute of Technology âĂę, 2019.

[44] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik,A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR,2013.

[45] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik,A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR,2013.

[46] M. Stonebraker and I. F. Ilyas. Data integration: The current status and theway forward. IEEE Data Eng. Bull., 41(2):3–9, 2018.

[47] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, andR. Fergus. Intriguing properties of neural networks. arXiv:1312.6199, 2013.

[48] S. Thirumuruganathan, S. Hasan, N. Koudas, and G. Das. Approximate queryprocessing using deep generative models. arXiv preprint arXiv:1903.10000,2019.

[49] S. Thirumuruganathan, M. Ouzzani, and N. Tang. Explaining entity resolutionpredictions: Where are we and what needs to be done? 2019.

[50] S. Thirumuruganathan, S. A. P. Parambath, M. Ouzzani, N. Tang, and S. Joty.Reuse and adaptation for entity resolution through transfer learning. arXivpreprint arXiv:1809.11084, 2018.

[51] R. Wu, S. Chaba, S. Sawlani, X. Chu, and S. Thirumuruganathan. Autoer:Automated entity resolution using generative modelling. arXiv preprintarXiv:1908.06049, 2019.

[52] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J.Smola. Deep sets. In Advances in neural information processing systems, pages3391–3401, 2017.

286