data integration for data-intensive science

Data Integration for Data-Intensive Science

Philip A. Bernstein

Dear Editor:

To perform analytic computation on data (scientificor otherwise), there are many preliminary, time-con-

suming steps to identify and integrate the relevant inputs. Atypical sequence of steps is the following.

Data discovery—Find the data sets that are relevant to thequestion at hand. This may involve Web search and net-working with other researchers, followed by deeper study todetermine whether the data is really appropriate for the in-tended use.

Obtain access to data—Determine how to access the datain a desired format. This may involve retrieving the data orenabling the invocation of operations on the data in its homelocation. It may also involve legal issues, such as data own-ership and privacy.

Convert data—Transform the data into a format that ana-lytic tools can cope with. This may be as simple as loading asingle densely populated numerical table from a spread sheetinto a database. But often the necessary transformations aremore complex (e.g., a sparsely populated spread sheet withmany tabs, irregular tables, and a mixture of text and num-bers). The transformations may require extracting structuredinformation from unstructured (e.g., text) or semistructured(e.g., XML or HTML) sources.

Semantic integration—Determine the meaning of datafields and transform them so they can be compared or com-bined. This may involve cleaning the data to eliminate in-consistencies, deal with missing values, and identifyerroneous values. This is often the most time-consuming stepof data integration.

Labeling data products—The result of each computationshould be labeled with metadata, so the provenance of eachdata product can be traced and the result can be integrated infuture scientific analyses. Examples of useful metadata are theversion of the data sources and where to find them, the soft-ware that performed the computation, and the person re-sponsible for the computation.

The above steps are harder when data originate from dif-ferent sources, when it is poorly documented, and when it isso large that moving the data is impractical. A lot of tech-nology is available to perform these integration steps. A shortsurvey appears in Bernstein and Haas (2008). However, thistechnology is mostly targeted for professional informationtechnology staff at large enterprises, not for the science com-munity. That is, much of it is expensive and not open source, isprimarily intended for use with relational databases and

OLAP data warehouses, and does not support importantscientific data types such as sequences and multidimensionalarrays.

The problems associated with data integration in the sci-ences are nicely summarized in a report of a National Aca-demies workshop in 2009 (Weidman and Arrison, 2010). Thereport includes experiences from major science projects inastronomy, biomedical computing, climate research, andhigh-energy physics. It also includes experiences of data man-agement researchers who work on generic data-integrationtechnology. The report is a good place to start for an under-standing of the problems and promising solutions to dataintegration in the sciences. Some of the improvements thatwere suggested are the following:

1. Provide incentives to researchers to make their datareusable by documenting, cleaning, and packaging it ina form that others can access. Incentives might includeextra funding for the work and providing recognition tothose who do it.

2. To simplify data discovery, develop data registries andbetter Web search to find structured data.

3. Develop data transformation libraries for data conver-sion and semantic integration.

4. Develop domain-specific repositories where data is ar-chived, with search tools to browse them.

Author Disclosure Statement

The author declares that no conflicting financial interestsexist.

References

Bernstein, P.A., and Haas L. (2008). Information integration inthe enterprise. Commun ACM 51, 72–79.

Weidman, S., and Arrison T. [rapporteurs]. (2010). Steps TowardLarge-Scale Data Integration in the Sciences. (The National Aca-demies Press, Washington, DC).

Address correspondence to:Philip A. BernsteinMicrosoft Research

One Microsoft WayRedmond, WA 98052

E-mail: [email protected]

Microsoft Research, Redmond, Washington.

OMICS A Journal of Integrative BiologyVolume 15, Number 4, 2011ª Mary Ann Liebert, Inc.DOI: 10.1089/omi.2011.0020

241

OM

ICS:

A J

ourn

al o

f In

tegr

ativ

e B

iolo

gy 2

011.

15:2

41-2

41.

Dow

nloa

ded

from

onl

ine.

liebe

rtpu

b.co

m b

y R

adbo

ud U

niv

Nijm

egen

on

12/0

2/14

. For

per

sona

l use

onl

y.

OM

ICS:

A J

ourn

al o

f In

tegr

ativ

e B

iolo

gy 2

011.

15:2

41-2

41.

Dow

nloa

ded

from

onl

ine.

liebe

rtpu

b.co

m b

y R

adbo

ud U

niv

Nijm

egen

on

12/0

2/14

. For

per

sona

l use

onl

y.

data integration for data-intensive science

Documents