getting data right preview ed aug 2015 tamr-2

Tackling the Challenges of Big Data Volume and Variety

Getting Data Right

Compliments of

PREVIEW EDITIONeditable

outlines

Chapters by Michael Stonebraker and Tom Davenport

Fuel your decision-making with all available data

In unifying enterprise data for

better analytics, Tamr unifies

enterprise organizations –

bringing business and IT

together on mission critical

questions and the information

needed to answer them.

unified data. optimized decisions.

This Preview Edition of Getting Data Right: Tackling the Challenges of Big Data Volume and Variety, is a work in progress. The final book is

expected in late 2015.

Chapters by Michael Stonebraker andTom Davenport

Getting Data RightTackling The Challenges of Big Data

Volume and Variety

978-1-491-93529-3

[LSI]

Getting Data RightEdited by Shannon Cutt

Copyright © 2015 Tamr, Inc. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA95472.

O’Reilly books may be purchased for educational, business, or sales promotional use.Online editions are also available for most titles (http://safaribooksonline.com). Formore information, contact our corporate/institutional sales department:800-998-9938 or [email protected] .

Editor: Shannon Cutt Cover Designer: Randy Comer

June 2015: First Edition

Revision History for the First Edition2015-06-17: First Release2015-08-24: Second Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Getting DataRight, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

While the publisher and the authors have used good faith efforts to ensure that theinformation and instructions contained in this work are accurate, the publisher andthe authors disclaim all responsibility for errors or omissions, including withoutlimitation responsibility for damages resulting from the use of or reliance on thiswork. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is sub‐ject to open source licenses or the intellectual property rights of others, it is yourresponsibility to ensure that your use thereof complies with such licenses and/orrights.

http://safaribooksonline.com

Table of Contents

1. An Alternative Approach to Data Management. . . . . . . . . . . . . . . . . . 5Centralized Planning Approaches 6Common Information 6Information Chaos 7What Is to Be Done? 8Take a Federal Approach to Data Management 9Use All the New Tools at Your Disposal 10Don’t Model, Catalog 12Cataloging Tools 13Keep Everything Simple and Straightforward 14Use an Ecological Approach 15

2. The Solution: Data Curation at Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . 17Three Generations of Data Integration Systems 17Five Tenets for Success 21

iii

CHAPTER 1

An Alternative Approach to DataManagement

—Thomas H. Davenport

For much of the sixty years or so that organizations have been man‐aging data in electronic form, there has been an overpowering desireto subdue it through centralized planning and architectural initia‐tives.

These initiatives have had a variety of names over the years, includ‐ing the most familiar: “information architecture,” “information engi‐neering,” and “master data management”. These initiatives have allhad certain key attributes and beliefs in common:

• Data needs to be centrally-controlled• Modeling is an approach to controlling data• Abstraction is a key to successful modeling• An organization’s information should all be defined in a com‐

mon fashion• Priority is on efficiency in information storage (a given data ele‐

ment should only be stored once)• Politics, ego, and other common human behaviors are irrelevant

to data management (or at least not something that organiza‐tions should attempt to manage)

Each of these attributes has at least a grain of truth in it, but takentogether and to their full extent, I have come to believe that they

5

simply don’t work. I rarely find business users who believe theywork either, and this dissatisfaction has been brewing for a longtime. For example, in the 1990s I interviewed a marketing managerat Xerox Corporation who had also spent some time in IT at thesame company. He explained that the company had “tried informa‐tion architecture” for 25 years, but got nowhere — they alwaysthought they were doing it incorrectly.

Centralized Planning ApproachesMost organizations have had similar results from their centralizedarchitecture and planning approaches.

Not only do centralized planning approaches waste time and money,they also drive a wedge between those who are planning them, andthose who will actually use the information and technology. Regula‐tory submissions, abstract meetings, and incongruous goals can leadto months of frustration, without results.

The complexity and detail of centralized planning approaches oftenmean that they are frequently never completed, and when they arefinished, managers often decide not to implement them. The resour‐ces devoted to central data planning are often redeployed into otherIT projects of more tangible value. If by chance they are imple‐mented, they are hopelessly out of date by the time they go intoeffect.

For one illustration of how the key tenets of centralized informationplanning are not consistent with real organizational behavior, let’slook at one: the assumption that all information needs to be com‐mon.

Common InformationCommon information—agreement within an organization on howto define and use key data elements—is a useful thing, to be sure.But it’s also helpful to know that uncommon information—informa‐tion definitions that suit the purposes of a particular group or indi‐vidual—can also be useful to a particular business function, unit, orwork group. Companies need to strike a balance between these twodesirable goals.

6 | Chapter 1: An Alternative Approach to Data Management

After speaking with many managers and professionals about com‐mon information, and reflecting on the subject carefully, I formula‐ted “Davenport’s Law of Common Information” (you can Google it,but don’t expect a lot of results). If by some strange chance youhaven’t heard of Davenport’s Law, it goes like this:

The more an organization knows or cares about a particular busi‐ness entity, the less likely it is to agree on a common term andmeaning for it.

I first noticed this paradoxical observation at American Airlinesmore than a decade ago. They told me during a research visit thatthey had eleven different usages of the term “airport.” As a frequenttraveler on their planes, I was initially a bit concerned about this,but when they explained it, the proliferation of meanings madesense. They said that the cargo workers at American Airlines viewedanyplace you can pick up or drop off cargo as the airport; the main‐tenance people viewed anyplace you can fix an airplane as the air‐port; the people who worked with the International Air TransportAuthority relied on their list of international airports, and so on.

The next week I was doing some consulting work at Union PacificRailroad, where they sheepishly admitted at some point during myvisit that they had great debates about what constitutes a “train.” Forsome, it’s an abstract scheduling entity; for others it’s the locomotive;for others it’s the locomotive and whatever railcars it’s pulling at thetime. By an amazing twist of fate shortly thereafter, I visited the U.S.Department of Justice; they were going through an informationmodeling exercise, and ran into a roadblock in creating a commondefinition of “trial.”

Information ChaosSo just like Newton being hit on the head with an apple and discov‐ering gravity, the key elements of Davenport’s Law hit me like abrick. This was why organizations were having so many problemscreating consensus around key information elements. I also formu‐lated a few corollaries to the law, such as:

If you’re not arguing about what constitutes a “customer,” yourorganization is probably not very passionate about customers.

Davenport’s Law, in my humble opinion, makes it much easier tounderstand why companies all over the world have difficulty estab‐

Information Chaos | 7

lishing common definitions of key terms around their organiza‐tions.

Of course, this should not be an excuse for organizations to allowalternative meanings of key terms to proliferate. Even though thereis a good reason for why they proliferate, organizations may have tolimit—sometimes even stop—the proliferation of meanings andagree on one meaning for each term. Otherwise they will continueto find that when the CEO asks multiple people how many employ‐ees a company has, he/she will get different answers. The prolifera‐tion of meanings, however justifiable, leads to information chaos.

But Davenport’s Law offers one more useful corollary about how tostop the proliferation of meanings. Here it is:

A manager’s passion for a particular definition of a term will not bequenched by a data model specifying an alternative definition.

If a manager has a valid reason to prefer a particular meaning of aterm, he or she is unlikely to be persuaded to abandon it by a com‐plex, abstract data model that is difficult to understand in the firstplace, and is likely never to be implemented.

Is there a better way to get adherence to a single definition of aterm?

Here’s one final corollary:Consensus on the meaning of a term throughout an organization isachieved not by data architecture, but by data arguing.

Data modeling doesn’t often lead to dialog because it’s simply notcomprehensible to most nontechnical people. If people don’t under‐stand your data architecture, it won’t stop the proliferation of mean‐ings.

What Is to Be Done?There is little doubt that something needs to be done to make dataintegration and management easier. In my research, I’ve conductedmore than 25 extended interviews with data scientists about whatthey do, and how they go about their jobs. I concluded that a moreappropriate title for data scientists might actually be “data plumber.”It is often so difficult to extract, clean, and integrate data that datascientists can spend up to 90% of their jobs doing those tasks. It’s nowonder that big data often involves “small math”—after all the prep‐


aration work, there isn’t enough time left to do sophisticated analyt‐ics.

This is not a new problem in data analysis. The dirty little secret ofthe field is that someone has always had to do a lot of data prepara‐tion before the data can be analyzed. The problem with Big Data ispartly that there is a large volume of it, but mostly that we are oftentrying to integrate multiple sources. Combining multiple data sour‐ces means that for each source we have to determine how to clean,format, and integrate it. The more sources and types of data, themore plumbing work is required.

So let’s assume that data integration and management are necessaryevils. But what particular approaches to them are most effective?Throughout the remainder of this chapter, I’ll describe fiveapproaches to realistic, effective data management, including:

1. Taking a federal approach to data management2. Using all of the new tools at your disposal3. Don’t model, catalog4. Keep everything simple and straightforward5. Use an ecological approach

Take a Federal Approach to Data ManagementFederal political models—of which the United States is one example—don’t try to get consensus on every issue. They have some lawsthat are common throughout the country, and some that are allowedto vary from state-to-state or by region or city. It’s a hybrid approachto the centralization/decentralization issue that bedevils many largeorganizations. Its strength is its practicality, in that it’s easier to getconsensus on some issues than on all of them. If there is a downsideto federalism, it’s that there is usually a lot of debate and discussionabout which rights are federal, and which are states’ or other units’rights. The United States has been arguing about this issue for morethan 200 years.

While federalism does have some inefficiencies, it’s a good model fordata management. It means that some data should be defined com‐monly across the entire organization, and some should be allowed tovary. Some should have a lot of protections, and some should be rel‐

Take a Federal Approach to Data Management | 9

atively open. That will reduce the overall effort to manage data, sim‐ply because not everything will have to be tightly-managed.

Your organization will, however, have to engage in some “data argu‐ing.” Hashing things out around a table is the best way to resolve keyissues in a federal data approach. You will have to argue about whichdata should be governed by corporate rights, and which will beallowed to vary. Once you have identified corporate data, you’ll thenhave to argue about how to deal with it. But I have found that ifmanagers feel that their issues have been fairly aired, they are morelikely to comply with a policy that goes against those issues.

Use All the New Tools at Your DisposalWe now have a lot of powerful tools for processing and analyzingdata, but we haven’t had them for cleaning, integrating, and “curat‐ing” data. (“Curating” is a term often used by librarians, and thereare typically many of them in pharmaceutical firms who manage sci‐entific literature). These tools are sorely needed and are beginningto emerge. One source I’m close to is a startup called Tamr, whichaims to help “tame” your data using a combination of machinelearning and crowdsourcing. Tamr isn’t the only new tool for this setof activities, and I am an advisor to the company. So I would adviseyou to do your own investigation. The founders of Tamr are AndyPalmer and Michael Stonebraker. Palmer is a serial entrepreneurand incubator founder in the Boston area.

Stonebraker is the database architect behind INGRES, Vertica,VoltDB, Paradigm4, and a number of other database tools. He’s alsoa longtime computer science professor, now at MIT. As noted in hischapter of this book, we have a common view of how well-centralized information architecture approaches work in largeorganizations.

In a research paper published in 2013, Stonebraker and several co-authors wrote that they tested “Data-Tamer” (as it was then known)in three separate organizations. They found that the tool reducedthe cost of data curation in those organizations by about 90%.

I like the idea that Tamr uses two separate approaches to solving theproblem. If the data problem is somewhat repetitive and predictable,the machine learning approach will develop an algorithm that willdo the necessary curation. If the problem is a bit more ambiguous,


the crowdsourcing approach can ask people who are familiar withthe data (typically the owner of that data source) to weigh in on itsquality and other attributes. Obviously the machine learningapproach is more efficient, but crowdsourcing at least spreads thelabor around to the people who are best qualified to do it. These twoapproaches are, together, more successful than the top-downapproaches that many large organizations have employed.

A few months before writing this chapter, I spoke with several man‐agers from companies who are working with Tamr. Thomson Reu‐ters is using the technology to curate “core entity master” data—creating clear and unique identities of companies and their parentsand subsidiaries. Previous in-house curation efforts, relying on ahandful of data analysts, found that 30%-60% of entities requiredmanual review. Thomson Reuters believed manual integrationwould take up to six months to complete, and would identify 95% ofduplicate matches (precision) and 95% of suggested matches thatwere, in fact, different (recall).

Thomson Reuters looked to Tamr’s machine-driven, human-guidedapproach to improve this process. After converting the company’sXML files to CSVs, Tamr ingested three core data sources—factualdata on millions of organizations with more than 5.4 millionrecords. Tamr de-duplicated the records and used “fuzzy matching”to find suggested matches, with the goal of achieving high accuracyrates, while reducing the number of records requiring review. Inorder to scale the effort and improve accuracy, Tamr appliedmachine learning algorithms to a small training set of data and fedguidance from Thomson Reuters’ experts back into the system.

The “big pharma” company Novartis has many different sources ofbiomedical data that it employs in research processes, making cura‐tion difficult. Mark Schreiber, then an “informatician” at NovartisInstitutes for Biomedical Research (he has since moved to Merck),oversaw the testing of Tamr going all the way back to its academicroots at MIT. He is particularly interested in the tool’s crowdsourc‐ing capabilities, as he wrote in a blog post:

“The approach used gives you a critical piece of the workflowbridging the gap between the machine learning/automated dataimprovement and the curator. When the curator isn’t confident inthe prediction or their own expertise, they can distribute tasks toyour data producers and consumers to ask their opinions and draw

Use All the New Tools at Your Disposal | 11

on their expertise and institutional memory, which is not stored inany of your data systems.”

I also spoke with Tim Kasbe, the COO of Gloria Jeans, which is thelargest “fast fashion” retailer in Russia and Ukraine. Gloria Jeans hastried out Tamr on several different data problems, and found it par‐ticularly useful for identifying and removing duplicate loyalty pro‐gram records. Here are some results from that project:

We loaded data for about 100,000 people and families and ran ouralgorithms on them and found about 5,000 duplicated entries. Aportion of these represented people or families that had signed upfor multiple discount cards. In some cases, the discount cards hadbeen acquired in different locations or different contact informa‐tion had been used to acquire them. The whole process took aboutan hour and did not need deep technical staff due to the simple andelegant Tamr user experience. Getting to trustworthy data to makegood and timely decisions is a huge challenge this tool will solve forus, which we have now unleashed on all our customer referencedata, both inside and outside the four walls of our company.

I am encouraged by these reports that we are on the verge of abreakthrough in this domain. Don’t take my word for it—do a proofof concept with one of these types of tools.

Don’t Model, CatalogOne of the paradoxes of IT planning and architecture is that thoseactivities have made it more difficult for people to find the data theyneed to do their work. According to Gartner, much of the roughly$3-4 trillion invested in enterprise software over the last 20 years hasgone toward building and deploying software systems and applica‐tions to automate and optimize key business processes in the con‐text of specific functions (sales, marketing, manufacturing) and/orgeographies (countries, regions, states, etc.). As each of these idio‐syncratic applications is deployed, an equally idiosyncratic datasource is created. The result is that data is extremely heterogeneousand siloed within organizations.

For generations, companies have created “data models,” “master datamodels,” and “data architectures” that lay out the types, locations,and relationships for all the data that they had have now and willhave in the future. Of course, those models rarely get implementedexactly as planned, given the time and cost involved. As a result,organizations have no guide to what data they actually have in the


http://www.gartner.com/newsroom/id/2959717

present and how to find it. Instead of creating a data model, theyshould create a catalog of their data—a straightforward listing ofwhat data exists in the organization, where it resides, who’s responsi‐ble for it, and so forth.

One other reason why companies don’t create simple catalogs oftheir data is that the result is often somewhat embarrassing and irra‐tional. Data are often duplicated many times across the organiza‐tion. Different data are referred to by the same term, and the samedata by different terms. A lot of data that the organization no longerneeds is still hanging around, and data that the organization couldreally benefit from is nowhere to be found. It’s not easy to face up toall of the informational chaos that a cataloging effort can reveal.

Perhaps needless to say, however, cataloging data is worth the trou‐ble and initial shock at the outcome. A data catalog that lists whatdata the organization has, what it’s called, where it’s stored, who’sresponsible for it, and other key metadata can easily be the most val‐uable information offering that an IT group can create.

Cataloging ToolsGiven that IT organizations have been preoccupied with modelingthe future over describing the present, enterprise vendors haven’treally addressed the catalog tool space to a significant degree. Thereare several catalog tools for individuals and small businesses, andseveral vendors of ETL (extract, transform and load) tools havesome cataloging capabilities relative to their own tools. Some also tiea catalog to a data governance process, although “governance” isright up there with “bureaucracy” as a term that makes many peoplewince.

At least a few data providers and vendors are actively pursuing cata‐log work, however. One company, Enigma, has created a catalog forpublic data, for example. The company has compiled a set of publicdatabases, and you can simply browse through their catalog (for freeif you are an individual) and check out what data you can access andanalyze. That’s a great model for what private enterprises should bedeveloping, and I know of some companies (including Tamr, Infor‐matica, Paxata, and Trifacta) that are developing tools to help com‐panies develop catalogs.

Cataloging Tools | 13

http://enigma.io/

http://www.tamr.com

https://www.informatica.com/

https://www.informatica.com/

http://www.paxata.com/

http://www.trifacta.com/

In industries such as biotech and financial services, for example, youincreasingly need to know what data you have. And it’s not only soyou can respond to business opportunities. Industry regulators arealso concerned about what data you have and what you are doingwith it. In biotech companies, for example, any data involvingpatients has to be closely monitored and its usage controlled. Infinancial services firms there is increasing pressure to keep track ofyour customers’ and partners’ “legal entity identifiers,” and to ensurethat dirty money isn’t being laundered.

But if you don’t have any idea of what data you have today, you’regoing to have a much tougher time adhering to the demands fromregulators. You also won’t be able to meet the demands of your mar‐keting, sales, operations, or HR departments. Knowing where yourdata is, seems perhaps the most obvious tenet of information man‐agement, but thus far, it has been among the most elusive.

Keep Everything Simple and StraightforwardWhile data management is a complex subject, traditional informa‐tion architectures are generally more complex than they need to be.They are usually incomprehensible not only to nontechnical people,but also to the technical people who didn’t have a hand in creatingthem. From IBM’s Business Systems Planning—one of the earliestarchitectural approaches—up through Master Data Management,architectures feature complex and voluminous flow diagrams andmatrices. Some look like the circuitry diagram for the latest Intelmicroprocessor. “Master Data Management” has the reasonableobjective of ensuring that all important data within an organizationcomes from a single authoritative source, but it often gets boggeddown in discussions about who’s in charge of data and whose data ismost authoritative.

It’s unfortunate that information architects don’t emulate architectsof physical buildings. While they definitely require complex dia‐grams full of technical details, good building architects don’t showthose blueprints to their clients. For clients, they create simple andeasy-to-digest sketches of what the building will look like when it’sdone. If it’s an expensive or extensive building project, they may cre‐ate three-dimensional models of the finished structure.

More than thirty years ago, Michael Hammer and I created a newapproach to architecture based primarily on “principles.” These are


simple, straightforward articulations of what an organizationbelieves and wants to achieve with information management; it maybe the equivalent of a sketch for a physical architect. Here are someexamples of the data-oriented principles from that project:

• Data will be owned by its originator but will be accessible tohigher levels;

• Critical data items in customer and sales files will conform tostandards for name, form, and semantics;

• Applications should be processed where data resides.

We suggested that an organization’s entire list of principles—includ‐ing those for technology infrastructure, organization, and applica‐tions, as well as data management—should take up no more than asingle page. Good principles can be the drivers of far more detailedplans, but they should be articulated at a level that facilitates under‐standing and discussion by business people. In this age of digitalbusinesses, such simplicity and executive engagement is far morecritical than it was in 1984.

Use an Ecological ApproachI hope I have persuaded you that enterprise-level models (or reallymodels at any level) are not sufficient to change individual andorganizational behavior, with respect to data. But now I will go evenfurther and argue that neither models, nor technology, policy, or anyother single factor is enough to move behavior in the right direction.Instead organizations need a broad, ecological approach to data-oriented behaviors.

In 1997 I wrote a book called Information Ecology: Mastering theInformation and Knowledge Environment. It was focused on thissame idea—that multiple factors and interventions are necessary tomove an organization in a particular direction with regard to dataand technology management. Unlike engineering-based models,ecological approaches assume that technology alone is not enoughto bring about the desired change, and that with multiple interven‐tions an environment can evolve in the right direction. In the book,I describe one organization, a large UK insurance firm called Stan‐dard Life, that adopted the ecological approach and made substan‐tial progress on managing its customer and policy data. Of course,

Use an Ecological Approach | 15

no one—including Standard Life—ever achieves perfection in datamanagement; all one can hope for is progress.

In Information Ecology, I discussed the influence on a company’sdata environment, of a variety of factors, including staff, politics,strategy, technology, behavior and culture, process, architecture, andthe external information environment. I’ll explain the lesser-knownaspects of this model briefly.

Staff, of course, refers to the types of people and skills that arepresent to help manage information. Politics refers primarily to thetype of political model for information that the organizationemploys; as noted earlier, I prefer federalism for most large compa‐nies. Strategy is the company’s focus on particular types of informa‐tion and particular objectives for it. Behavior and culture refers tothe particular information behaviors (e.g., not creating new datasources and reusing existing ones) that the organization is trying toelicit; in the aggregate they constitute “information culture.” Processinvolves the specific steps that an organization undertakes to create,analyze, disseminate, store, and dispose of information. Finally, theexternal information environment consists of information sourcesand uses outside of organization’s boundaries that the organizationmay use to improve its information situation. Most organizationshave architectures and technology in place for data management,but they have few, if any, of these other types of interventions.

I am not sure that these are now (or ever were) the only types ofinterventions that matter, and in any case the salient factors will varyacross organizations. But I am quite confident that an approach thatemploys multiple factors to achieve an objective (for example, toachieve greater use of common information) is more likely to suc‐ceed than one focused only on technology or architectural models.

Together, the approaches I’ve discussed in this chapter comprise acommon-sense philosophy of data management that is quite differ‐ent from what most organizations have employed. If for no otherreason, organizations should try something new because so manyhave yet to achieve their desired state of data management.


CHAPTER 2

The Solution: Data Curation atScale

—Michael Stonebraker, Ph.D.

Integrating data sources isn’t a new challenge. But the challenge hasintensified in both importance and difficulty, as the volume andvariety of usable data - and enterprises’ ambitious plans for analyz‐ing and applying it - have increased. As a result, trying to meettoday’s data integration demands with yesterday’s data integrationapproaches is impractical.

In this chapter, we look at the three generations of data integrationproducts and how they have evolved. We look at new third-generation products that deliver a vital missing layer in the dataintegration “stack”: data curation at scale. Finally, we look at five keytenets of an effective data curation at scale system.

Three Generations of Data IntegrationSystemsData integration systems emerged to enable business analysts toaccess converged data sets directly for analyses and applications.

First-generation data integration systems - data warehouses -arrived on the scene in the 1990s. Led by the major retailers,customer-facing data (e.g., item sales, products, customers) wereassembled in a data store and used by retail buyers to make betterpurchase decisions. For example, pet rocks might be out of favor

17

while Barbie dolls might be “in.” With this intelligence, retailerscould discount the pet rocks and tie up the Barbie doll factory with abig order. Data warehouses typically paid for themselves within ayear through better buying decisions.

First-generation data integration systems were termed ETL (Extract,Transform and Load) products. They were used to assemble the datafrom various sources (usually fewer than 20) into the warehouse.But enterprises underestimated the “T” part of the process - specifi‐cally, the cost of data curation (mostly, data cleaning) required to getheterogeneous data into the proper format for querying and analy‐sis. Hence, the typical data warehouse project was usually substan‐tially over-budget and late because of the difficulty of dataintegration inherent in these early systems.

This led to a second generation of ETL systems, whereby the majorETL products were extended with data cleaning modules, additionaladapters to ingest other kinds of data, and data cleaning tools. Ineffect, the ETL tools were extended to become data curation tools.

Data curation involves:

• ingesting data sources• cleaning errors from the data (-99 often means null)• transforming attributes into other ones (for example, Euros to

dollars)• performing schema integration to connect disparate data sour‐

ces• performing entity consolidation to remove duplicates

In general, data curation systems followed the architecture of earlierfirst-generation systems: toolkits oriented toward professional pro‐grammers. In other words, they were programmer productivitytools.

Second-generation data curation tools have two substantial weak‐nesses:

ScalabilityEnterprises want to curate “the long tail” of enterprise data.They have several thousand data sources, everything from com‐pany budgets in the CFO’s spreadsheets to peripheral opera‐tional systems. There is “business intelligence gold” in the long

18 | Chapter 2: The Solution: Data Curation at Scale

tail, and enterprises wish to capture it - for example, for cross-selling of enterprise products. Furthermore, the rise of publicdata on the web leads business analysts to want to curate addi‐tional data sources. Anything from weather data to customsrecords to real estate transactions to political campaign contri‐butions are readily available. However, in order to capture long-tail enterprise data as well as public data, curation tools must beable to deal with hundreds to thousands of data sources ratherthan the typical few tens of data sources.

ArchitectureSecond-generation tools typically are designed for central ITdepartments. A professional programmer does not know theanswers to many of the data curation questions that arise. Forexample, are “rubber gloves” the same thing as “latex hand pro‐tectors?” Is an “ICU50” the same kind of object as an “ICU?”Only business people in line-of-business organizations cananswer these kinds of questions. However, business people areusually not in the same organization as the programmers run‐ning data curation projects. As such, second-generation systemsare not architected to take advantage of the humans best able toprovide curation help.

These weaknesses led to a third generation of data curation prod‐ucts, which we term scalable data curation systems. Any data cura‐tion system should be capable of performing the five tasks notedabove. However, first- and second-generation ETL products willonly scale to a small number of data sources, because of the amountof human intervention required.

To scale to hundreds or even thousands of data sources, a newapproach is needed - one that:

1. Uses statistics and machine learning to make automatic deci‐sions wherever possible.

2. Asks a human expert for help only when necessary.

Instead of an architecture with a human controlling the process withcomputer assistance, move to an architecture with the computerrunning an automatic process, asking a human for help only whenrequired. And ask the right human: the data creator or owner (abusiness expert) not the data wrangler (a programmer).

Three Generations of Data Integration Systems | 19

Obviously, enterprises differ in the required accuracy of curation, sothird-generation systems must allow an enterprise to make tradeoffsbetween accuracy and the amount of human involvement. In addi‐tion, third-generation systems must contain a crowdsourcing com‐ponent that makes it efficient for business experts to assist withcuration decisions. Unlike Amazon’s Mechanical Turk, however, adata-curation crowdsourcing model must be able to accommodate ahierarchy of experts inside an enterprise as well as various kinds ofexpertise. Therefore, we call this component an expert sourcing sys‐tem to distinguish it from the more primitive crowdsourcing sys‐tems.

In short: a third-generation data curation product is an automatedsystem with an expert sourcing component. Tamr is an early exam‐ple of this third generation of systems.

Third-generation systems can co-exist with currently-in-placesecond-generation systems, which can curate the first tens of datasources to generate a composite result that in turn can be curatedwith the “long tail” by third-generation systems.

Table 2-1. Evolution of Three Generations of Data Integration Systems

First Generation1990s

Second Generation2000s

Third Generation 2010s

Approach ETL ETL+>Data Curation Scalable Data Curation

Target DataEnvironment(s)

Data Warehouse Data Warehouses or DataMarts

Data Lakes & Self-ServiceData Analytics

Users IT/Programmers IT/Programmers Data Scientists, DataStewards, Data Owners,Business Analysts

IntegrationPhilosophy

Top-down/rules-based/IT-driven

Top-down/rules-based/IT-driven

Bottom-up/demand-based/business-driven

Architecture Programmerproductivity tools(task automation)

Programmingproductivity tools (taskautomation withmachine assistance)

Machine-driven, human-guided process

Scalability (# ofdata sources)

10s 10s to 100s 100s to 1000s+


To summarize: ETL systems arose to deal with the transformationchallenges in early data warehouses. They evolved into second-generation data curation systems with an expanded scope of offer‐ings. Third-generation data curation systems, which have a verydifferent architecture, were created to address the enterprise’s needfor data source scalability.

Five Tenets for SuccessThird-generation scalable data curation systems provide the archi‐tecture, automated workflow, interfaces and APIs for data curationat scale. Beyond this basic foundation, however, are five tenets thatare desirable in any third-generation system.

Tenet 1: Data curation is never doneBusiness analysts and data scientists have an insatiable appetite formore data. This was brought home to me about a decade ago duringa visit to a beer company in Milwaukee. They had a fairly standarddata warehouse of sales of beer by distributor, time period, brandand so on. I visited during a year when El Niño was forecast to dis‐rupt winter weather in the US. Specifically, it was forecast to be wet‐ter than normal on the West Coast and warmer than normal in NewEngland. I asked the business analysts: “Are beer sales correlatedwith either temperature or precipitation?” They replied, “We don’tknow, but that is a question we would like to ask.” However temper‐ature and precipitation were not in the data warehouse, so askingwas not an option.

The demand from warehouse users to correlate more and more dataelements for business value leads to additional data curation tasks.Moreover, whenever a company makes an acquisition, it creates adata curation problem (digesting the acquired’s data). Lastly, thetreasure trove of public data on the web (such as temperature andprecipitation data) is largely untapped, leading to more curationchallenges.

Even without new data sources, the collection of existing data sour‐ces is rarely static. Hence, inserts and deletes to these sources gener‐ates a pipeline of incremental updates to a data curation system.Between the requirements of new data sources and updates to exist‐ing ones, it is obvious that data curation is never done, ensuring that

Five Tenets for Success | 21

any project in this area will effectively continue indefinitely. Realizethis and plan accordingly.

One obvious consequence of this tenet concerns consultants. If youhire an outside service to perform data curation for you, then youwill have to rehire them for each additional task. This will give theconsultant a guided tour through your wallet over time. In my opin‐ion, you are much better off developing in-house curation compe‐tence over time.

Tenet 2: A PhD in AI can’t be a requirement for successAny third-generation system will use statistics and machine learningto make automatic or semi-automatic curation decisions. Inevitably,it will use sophisticated techniques such as T-tests, regression, pre‐dictive modeling, data clustering, and classification. Many of thesetechniques will entail training data to set internal parameters. Sev‐eral will also generate recall and/or precision estimates.

These are all techniques understood by data scientists. However,there will be a shortage of such people for the foreseeable future,until colleges and universities produce substantially more than atpresent. Also, it is not obvious that one can “retread” a business ana‐lyst into a data scientist. A business analyst only needs to under‐stand the output of SQL aggregates; in contrast, a data scientist istypically knowledgeable in statistics and various modeling techni‐ques.

As a result, most enterprises will be lacking in data science expertise.Therefore, any third-generation data curation product must usethese techniques internally, but not expose them in the user inter‐face. Mere mortals must be able to use scalable data curation prod‐ucts.

Tenet 3: Fully automatic data curation is not likely to besuccessfulSome data curation products expect to run fully automatically. Inother words, they translate input data sets into output withouthuman intervention. Fully automatic operation is very unlikely to besuccessful in an enterprise for a variety of reasons. First, there arecuration decisions that simply cannot be made automatically. Forexample, consider two records; one stating that restaurant X is at


location Y while the second states that restaurant Z is at location Y.This could be a case where one restaurant went out of business andgot replaced by a second one or it could be a food court. There is nogood way to know the answer to this question without human guid‐ance.

Second, there are cases where data curation must have high reliabil‐ity. Certainly, consolidating medical records should not createerrors. In such cases, one wants a human to check all (or maybe justsome) of the automatic decisions. Third, there are situations wherespecialized knowledge is required for data curation. For example, ina genomics application one might have two terms: ICU50 andICE50. An automatic system might suggest that these are the samething, since the lexical distance between the terms is low. However,only a human genomics specialist can decide this question.

For these reasons, any third-generation data curation system mustbe able to ask a human expert - the right human expert - when it isunsure of the answer. Therefore, one must have multiple domains inwhich a human can be an expert. Within a single domain, humanshave a variable amount of expertise, from a novice level to enterpriseexpert. Lastly, one must avoid overloading the humans that it isscheduling. Therefore, when considering a third generation datacuration system, look for an embedded expert system with levels ofexpertise, load balancing and multiple expert domains.

Tenet 4: Data curation must fit into the enterpriseecosystemEvery enterprise has a computing infrastructure in place. Thisincludes a collection of DBMSs storing enterprise data, a collectionof application servers and networking systems, and a set of installedtools and applications. Any new data curation system must fit intothis existing infrastructure. For example, it must be able to extractfrom corporate databases, use legacy data cleaning tools, and exportdata to legacy data systems. Hence, an open environment is requiredwhereby callouts are available to existing systems. In addition,adapters to common input and export formats is a requirement. Donot use a curation system that is a closed ”black box.”

Five Tenets for Success | 23

Tenet 5: A scheme for “finding” data sources must bepresentA typical question to ask CIOs is “How many operational data sys‐tems do you have?”. In all likelihood, they do not know. The enter‐prise is a sea of such data systems connected by a hodgepodge set ofconnectors. Moreover, there are all sorts of personal datasets,spreadsheets and databases, as well as datasets imported from publicweb-oriented sources. Clearly, CIOs should have a mechanism foridentifying data resources that they wish to have curated. Such a sys‐tem must contain a data source catalog with information on a CIO’sdata resources, as well as a query system for accessing this catalog.Lastly, an “enterprise crawler” is required to search a corporateinternet to locate relevant data sources. Collectively, this represents aschema for “finding” enterprise data sources.

Collectively, these five tenets indicate the characteristics of a goodthird generation data curation system. If you are in the market forsuch a product, then look for systems with these characteristics.


getting data right preview ed aug 2015 tamr-2

Documents

available data

data curation

unified data

master data management

unifying enterprise

challenges of big data

lsigetting data rightedited

common information