data integration for data-intensive science

2
Data Integration for Data-Intensive Science Philip A. Bernstein Dear Editor: T o perform analytic computation on data (scientific or otherwise), there are many preliminary, time-con- suming steps to identify and integrate the relevant inputs. A typical sequence of steps is the following. Data discovery—Find the data sets that are relevant to the question at hand. This may involve Web search and net- working with other researchers, followed by deeper study to determine whether the data is really appropriate for the in- tended use. Obtain access to data—Determine how to access the data in a desired format. This may involve retrieving the data or enabling the invocation of operations on the data in its home location. It may also involve legal issues, such as data own- ership and privacy. Convert data—Transform the data into a format that ana- lytic tools can cope with. This may be as simple as loading a single densely populated numerical table from a spread sheet into a database. But often the necessary transformations are more complex (e.g., a sparsely populated spread sheet with many tabs, irregular tables, and a mixture of text and num- bers). The transformations may require extracting structured information from unstructured (e.g., text) or semistructured (e.g., XML or HTML) sources. Semantic integration—Determine the meaning of data fields and transform them so they can be compared or com- bined. This may involve cleaning the data to eliminate in- consistencies, deal with missing values, and identify erroneous values. This is often the most time-consuming step of data integration. Labeling data products—The result of each computation should be labeled with metadata, so the provenance of each data product can be traced and the result can be integrated in future scientific analyses. Examples of useful metadata are the version of the data sources and where to find them, the soft- ware that performed the computation, and the person re- sponsible for the computation. The above steps are harder when data originate from dif- ferent sources, when it is poorly documented, and when it is so large that moving the data is impractical. A lot of tech- nology is available to perform these integration steps. A short survey appears in Bernstein and Haas (2008). However, this technology is mostly targeted for professional information technology staff at large enterprises, not for the science com- munity. That is, much of it is expensive and not open source, is primarily intended for use with relational databases and OLAP data warehouses, and does not support important scientific data types such as sequences and multidimensional arrays. The problems associated with data integration in the sci- ences are nicely summarized in a report of a National Aca- demies workshop in 2009 (Weidman and Arrison, 2010). The report includes experiences from major science projects in astronomy, biomedical computing, climate research, and high-energy physics. It also includes experiences of data man- agement researchers who work on generic data-integration technology. The report is a good place to start for an under- standing of the problems and promising solutions to data integration in the sciences. Some of the improvements that were suggested are the following: 1. Provide incentives to researchers to make their data reusable by documenting, cleaning, and packaging it in a form that others can access. Incentives might include extra funding for the work and providing recognition to those who do it. 2. To simplify data discovery, develop data registries and better Web search to find structured data. 3. Develop data transformation libraries for data conver- sion and semantic integration. 4. Develop domain-specific repositories where data is ar- chived, with search tools to browse them. Author Disclosure Statement The author declares that no conflicting financial interests exist. References Bernstein, P.A., and Haas L. (2008). Information integration in the enterprise. Commun ACM 51, 72–79. Weidman, S., and Arrison T. [rapporteurs]. (2010). Steps Toward Large-Scale Data Integration in the Sciences. (The National Aca- demies Press, Washington, DC). Address correspondence to: Philip A. Bernstein Microsoft Research One Microsoft Way Redmond, WA 98052 E-mail: [email protected] Microsoft Research, Redmond, Washington. OMICS A Journal of Integrative Biology Volume 15, Number 4, 2011 ª Mary Ann Liebert, Inc. DOI: 10.1089/omi.2011.0020 241 OMICS: A Journal of Integrative Biology 2011.15:241-241. Downloaded from online.liebertpub.com by Radboud Univ Nijmegen on 12/02/14. For personal use only.

Upload: philip-a

Post on 07-Apr-2017

228 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Integration for Data-Intensive Science

Data Integration for Data-Intensive Science

Philip A. Bernstein

Dear Editor:

To perform analytic computation on data (scientificor otherwise), there are many preliminary, time-con-

suming steps to identify and integrate the relevant inputs. Atypical sequence of steps is the following.

Data discovery—Find the data sets that are relevant to thequestion at hand. This may involve Web search and net-working with other researchers, followed by deeper study todetermine whether the data is really appropriate for the in-tended use.

Obtain access to data—Determine how to access the datain a desired format. This may involve retrieving the data orenabling the invocation of operations on the data in its homelocation. It may also involve legal issues, such as data own-ership and privacy.

Convert data—Transform the data into a format that ana-lytic tools can cope with. This may be as simple as loading asingle densely populated numerical table from a spread sheetinto a database. But often the necessary transformations aremore complex (e.g., a sparsely populated spread sheet withmany tabs, irregular tables, and a mixture of text and num-bers). The transformations may require extracting structuredinformation from unstructured (e.g., text) or semistructured(e.g., XML or HTML) sources.

Semantic integration—Determine the meaning of datafields and transform them so they can be compared or com-bined. This may involve cleaning the data to eliminate in-consistencies, deal with missing values, and identifyerroneous values. This is often the most time-consuming stepof data integration.

Labeling data products—The result of each computationshould be labeled with metadata, so the provenance of eachdata product can be traced and the result can be integrated infuture scientific analyses. Examples of useful metadata are theversion of the data sources and where to find them, the soft-ware that performed the computation, and the person re-sponsible for the computation.

The above steps are harder when data originate from dif-ferent sources, when it is poorly documented, and when it isso large that moving the data is impractical. A lot of tech-nology is available to perform these integration steps. A shortsurvey appears in Bernstein and Haas (2008). However, thistechnology is mostly targeted for professional informationtechnology staff at large enterprises, not for the science com-munity. That is, much of it is expensive and not open source, isprimarily intended for use with relational databases and

OLAP data warehouses, and does not support importantscientific data types such as sequences and multidimensionalarrays.

The problems associated with data integration in the sci-ences are nicely summarized in a report of a National Aca-demies workshop in 2009 (Weidman and Arrison, 2010). Thereport includes experiences from major science projects inastronomy, biomedical computing, climate research, andhigh-energy physics. It also includes experiences of data man-agement researchers who work on generic data-integrationtechnology. The report is a good place to start for an under-standing of the problems and promising solutions to dataintegration in the sciences. Some of the improvements thatwere suggested are the following:

1. Provide incentives to researchers to make their datareusable by documenting, cleaning, and packaging it ina form that others can access. Incentives might includeextra funding for the work and providing recognition tothose who do it.

2. To simplify data discovery, develop data registries andbetter Web search to find structured data.

3. Develop data transformation libraries for data conver-sion and semantic integration.

4. Develop domain-specific repositories where data is ar-chived, with search tools to browse them.

Author Disclosure Statement

The author declares that no conflicting financial interestsexist.

References

Bernstein, P.A., and Haas L. (2008). Information integration inthe enterprise. Commun ACM 51, 72–79.

Weidman, S., and Arrison T. [rapporteurs]. (2010). Steps TowardLarge-Scale Data Integration in the Sciences. (The National Aca-demies Press, Washington, DC).

Address correspondence to:Philip A. BernsteinMicrosoft Research

One Microsoft WayRedmond, WA 98052

E-mail: [email protected]

Microsoft Research, Redmond, Washington.

OMICS A Journal of Integrative BiologyVolume 15, Number 4, 2011ª Mary Ann Liebert, Inc.DOI: 10.1089/omi.2011.0020

241

OM

ICS:

A J

ourn

al o

f In

tegr

ativ

e B

iolo

gy 2

011.

15:2

41-2

41.

Dow

nloa

ded

from

onl

ine.

liebe

rtpu

b.co

m b

y R

adbo

ud U

niv

Nijm

egen

on

12/0

2/14

. For

per

sona

l use

onl

y.

Page 2: Data Integration for Data-Intensive Science

OM

ICS:

A J

ourn

al o

f In

tegr

ativ

e B

iolo

gy 2

011.

15:2

41-2

41.

Dow

nloa

ded

from

onl

ine.

liebe

rtpu

b.co

m b

y R

adbo

ud U

niv

Nijm

egen

on

12/0

2/14

. For

per

sona

l use

onl

y.