data intensive

5CHAPTER

Data-Intensive Computing

Reagan W. MooreChaitanya BaruRichard MarcianoArcot RajasekarMichael Wan

Computational grids provide access to distributed compute resources anddistributed data resources, creating unique opportunities for improved accessto information. When data repositories are accessible from any platform, ap-plications can be developed that support nontraditional uses of computingresources. Environments thus enabled include knowledge networks, in whichresearchers collaborate on common problems by publishing results in digi-tal libraries, and digital government, in which policy decisions are based onknowledge gleaned from teams of experts accessing distributed data repos-itories. In both cases, users access data that has been turned into informa-tion through the addition of metadata that describes its origin and quality.Information-based computing within computational grids will enable collec-tive advances in knowledge [396].

In this view of the applications that will dominate in the future, appli-cation development will be driven by the need to process and analyze infor-mation, rather than the need to simulate a physical process. In addition toaccessing specific data sets, applications will need to use information discov-ery interfaces [138] and dynamically determine which data sets to process. InSection 5.1, we discuss how these applications will evolve, and we illustratetheir new capabilities by presenting projects now under way that use someconcepts implicit within grid environments. Data-intensive applications thatwill require the manipulation of terabytes of data aggregated across hundredsof files range from comparisons of numerical simulation output, to analysesof satellite observation data streams, to searches for homologous structures

5 Data-Intensive Computing106

for use as input conditions in chemical structure computations. Accessing ter-abytes of data will require data transfer rates approaching 10 GB/s, implyingthe ability to manipulate a petabyte of data per day from data repositories dis-tributed across multiple sites.

The creation of computational grids implies not only ubiquitous access tocomputing resources but also uniform access to all data systems. No matterwhere an application is executed, it needs access to input data sets on localor remote disk systems, in distributed data repositories, or in archival storagesystems. For data sets to be remotely accessible, metadata must be providedthat describes the data sets location. Providing this information in a metadatacatalog constitutes a form of publication because it makes it possible to findthe data sets through information discovery interfaces.

Grids will require new mechanisms to support publication and peer re-view of data. Data is only as good as the degree of assurance about its qualityand the validity of the categorization of the data by discipline-specific at-tributes. If the data is not of high quality, conclusions drawn about the dataare suspect. If data sets cannot be located because they are described in-correctly, they will never be used. Peer-reviewed publication of data solvesboth problems by providing metadata that can be used to access high-qualitycurated data globally, effectively turning data into information. In grids, pub-lication of peer-reviewed data sets will become as important as publicationof peer-reviewed scientific reports: By using information discovery interfaces,the most recently published data can be used directly in subsequent analyses,providing an information feedback loop that nonlinearly advances scientificdiscoveries.

The implementation of information-based computing will require dra-matic extensions to the data support infrastructure of grid systems. Data setsare valuable when they are organized as information and made accessibleto other researchers. Several scientific disciplines (e.g., high-energy physics,molecular science, astronomy, and computational fluid dynamics) have rec-ognized this fact and are now aggregating domain-specific results into datarepositories. The data is reviewed, annotated, and made accessible through acommon, uniform interface. Researchers are then able to make faster progressby accessing all of the curated data related to their discipline. Such data reposi-tories, however, require a variety of services to make the data useful, includingsupport for data publication, data curation and quality assurance, informationdiscovery, and distributed data analysis. The emerging technology that pro-vides these services is based on digital libraries. In Sections 5.2 and 5.3, wediscuss how digital library technology can be integrated into grids to enableinformation-based computing.

5.1 Evolution of Data-Intensive Applications107

Grid software infrastructure must be based on application requirementsto be useful to the user community. Postulating grid environments that willnot be functional until after the year 2000, though, requires us to postulatesimilarly how applications are most likely to evolve. We base our projectionof future application requirements on current supercomputer systems in thebelief that individual nodes in computational grids will have capabilities sim-ilar to those of current supercomputers. At the same time, the pressure toanalyze information and build discipline-specific data collections will forcethe development of new technologies. We base our projection of such newtechnologies on the services that digital libraries now provide for local datarepositories. We expect that data-intensive applications will compel the coevo-lution of supercomputer, digital library, and grid technologies. This conclusionis based on the fact that teraFLOPS-capable computers of the future will gen-erate petabytes of data that must be managed and turned into information.Similarly, the digital libraries of the future will house petabytes of data thatwill need teraFLOPS-capable computers to support analysis and other services.In Section 5.4, we discuss how grid systems can facilitate the merger of thesetechnologies.

5.1 EVOLUTION OF DATA-INTENSIVE APPLICATIONS

The term data-intensive computing is used to describe applications that areI/O bound. Such applications devote the largest fraction of execution timeto movement of data. They can be identified by evaluating computationalbandwidththe number of bytes of data processed per floating-point oper-ation. On vector supercomputers for applications that sustain high perfor-mance, usually 7 bytes of data are accessed from memory for every floating-point operation [552, 395]. For well-balanced applications, this ratio shouldmatch the memory bandwidth divided by the CPU execution rate. When datatransmission rates between computer memory and local disk are examined,we find that memory acts as a cache that greatly reduces the disk bandwidthrequirements. For vector supercomputers, the computational bandwidth todisk is 1 byte of data accessed per 70 FLOPS, a factor of 490 smaller. For well-balanced applications, this ratio should match the disk bandwidth divided bythe CPU execution rate. We can think of data-intensive applications, then, ascodes that require data access rates to data storage peripherals that are sub-stantial fractions of their memory data access rates.

In computational grids, a data-intensive application may need a high-bandwidth data access rate all the way to a remote data repository. Since


network bandwidth performance tends to be smaller than local disk bandwidthperformance, it appears that data-intensive applications will be difficult to sup-port in grid environments. In practice, when the available bandwidth is lessthan the required bandwidth, either the CPU is held idle until the data sets areavailable, or other jobs are executed while waiting for the data to be cached onthe local system. One challenge is hence to support both distributed process-ing, in which the application is moved to the site where the data resides, anddistributed caching, in which the data is moved to the supercomputer for analy-sis. The former method tends to be preferred when data is processed throughthe CPU once, but the latter if data is read multiple times by the application.The decision between these options depends on determining which one min-imizes the total time needed for solution. This calculation is dependent onthe network protocol overhead, network latency, computational and networkbandwidths, and the total amount of data accessed [150, 394, 397]. This taskshould be supported within computational grids by a scheduler that tracks theavailability of resources and chooses the optimal site to execute each applica-tion (see Chapter 12).

5.1.1 Data Assimilation

A good example of a data-intensive application is the problem of assimilatingremote satellite observations into a comprehensive data set that describes theweather over the entire globe. Satellite observations provide measurements ofsome of the physical parameters needed to describe the weather, but only forthe portion of the earth that is covered by the orbit, and only for the time pe-riod that the satellite is over a given area. What is desired instead is a globallyconsistent data set that describes all of the physical parameters for the entireglobe with snapshots of the data at regular intervals in time. At the NASA DataAssimilation Office, the Goddard Earth Observing System (GEOS) Data Assimi-lation System (DAS) is used to accomplish this task [449]. The analysis requiresrunning a General Circulation Model (GCM) to predict the global weatherpatterns, comparing the satellite observations with the predicted weather, cal-culating the discrepancies between the observed and predicted data, then re-running the model using gridded corrections, or increments, to reproduce theobserved data. The assimilation cycle is repeated every 6 hours.

The discrete observations are interpolated onto a regular time and spacegrid. The gridded data is then used to evaluate global hydrological and energycycles. The winds derived from the atmospheric circulation are used to trans-port trace gases in the troposphere and stratosphere. The end products are


Network connection Bandwidth (Mb/s) Daily transfer (GB/day)

T1 1.4 15

T3 45 486

OC-3 155 1,670

OC-12 622 6,720

OC-48 2,488 26,870

OC-192 9,952 107,480

5.1

TABLE

Upper limits for data transmission.

data sets used by other researchers to support, for example, investigation ofgreenhouse gases, dust circulation from volcanoes, and global heating.

The GEOS DAS is a prototype computational grid application. Approxi-mately 2 GB of data per day are collected at the NASA Goddard Space FlightCenter Distributed Active Archive Center (DAAC) in Maryland. The raw data isprocessed at Goddard to generate the input data sets for DAS. The data is thensent to the NASA Ames computer facility in California for analysis by DAS. Theresults, approximately 6 GB per day, are then sent back to the Goddard DAAC.This process requires moving data on a continual basis across the country be-tween the two sites, turning the raw data into curated information through theuse of the GCM simulation, then moving the data products back to Goddardfor publication. The data is cached at NASA Ames while the data assimila-tion is done. DAS requirements are tractable because the total amount of datamovement per day is small compared with the available network bandwidth.

As shown in Table 5.1, a T3 network connection (45 Mb/s) can transmitmore than 400 GB of data per day. The amount of data transmitted by DAS willgrow as higher-resolution grids are used in the GCM weather simulation or asmore data is collected from satellites. Once the data movement uses an appre-ciable fraction of the available bandwidth, computational grids must managecollective effects arising from contention for resources. Eventually, the dataassimilation will require scheduling of data transmission and of disk cachespace utilization, in addition to scheduling of CPU access. See Chapter 19 forquality-of-service issues related to network use.

While the software systems used to support DAS have been explicitlydeveloped for this purpose, the data-handling steps are quite general:

1. Identify the required raw data sets.

2. Retrieve the data from a data repository.


3. Subdivide the data to generate the required input set.

4. Cache the data at the site where the computation will take place.

5. Analyze the data, and generate new data products.

6. Transmit the results back to the data repository.

7. Publish (register) the new data sets in the repository.

Grid data-handling environments should provide mechanisms to support allthese steps, which would greatly simplify the effort required to develop otherdata assimilation applications.

A second component of the DAS mission is to support reanalysis of thedata. As the physical models for representing weather improve, the GCM willbe modified correspondingly. The new models are expected to provide bettersimulation predictions and improved assimilation of the data. Prior observa-tional data will be reanalyzed to provide higher-quality weather patterns. Thereanalyses will be done using data over 10-year periods, requiring the move-ment of 29 TB of data stored in more than 47,000 files. Such data handling isonerous, unless identification of the data sets and caching of the data can beautomated and managed within the application.

Data handling can be automated if general interfaces can be designedthat support information discovery from running applications. One difficultyoccurs in creating a logical handle for the input file. This handle must begenerated dynamically based on the attributes associated with the data set,such as the type of satellite and the time of the observation. A second difficultyoccurs in determining where the input file is located, since in general it willnot be on local disk. The application must be able to perform caching of thatnamed data set onto the local disk from the remote data repository.

Traditionally, name generation has been automated by embedding thedata sets attributes within the UNIX pathname under which the data set isstored. An application-specific algorithm is used to concatenate the attributesinto a valid pathname, which is then accessed through a UNIX open statement.This works as long as the application knows the concatenation algorithm.Computational grids should provide a more general support mechanism toidentify data sets by querying an information discovery system for the locationand name of the data set that corresponds to the desired attributes. This willrequire users to learn how to invoke information discovery interfaces andinterpret the results [262].


5.1.2 Distributed Data Analysis

The size of the individual data sets required to support the DAS analysis isrelatively modest. But what if the total amount of data becomes much largerthan the transmission capacity of the underlying grid infrastructure? Thissituation can occur because either the size of the data collection becomes verylarge or the sizes of individual data sets become very large. Then additionaldata-handling systems are needed to support processing of the data withinthe repository. An example is the Digital Sky project, which will integrate skysurveys that contain data taken at different light wavelengths.

Recent advances in astronomy research have made it feasible to create dig-ital images of large areas of the sky by digitizing existing astronomical photo-graphic plates (Digital Palomar Observatory Sky Survey) or directly recordingdigital signals from light detection devices attached to a telescope (Two MicronAll-Sky Survey). The images are analyzed automatically to identify the loca-tion and brightness of each object. The aggregate raw data collections range insize from 3 to 40 TB of pixel data. Since most pixels are black, correspondingto no observable star or galaxy, the size of the data collection can be reducedsignificantly by saving only those pixels where light is detected. This analy-sis turns 40 TB of unprocessed data into approximately 2 billion objects, eachof which has associated metadata to describe its location and brightness. Thesize is still large, on the order of 250 GB for the object metadata, and severalterabytes for all the nonblack pixels. The pixel images of each object must besaved to allow reanalysis of each object to verify whether it is a star or galaxy.

This project is of great interest to astronomers because it will allow statis-tical questions to be answered for objects observed at multiple wavelengths oflight. It will be possible to find and access subsets of the data representing ob-jects of the same type and to analyze the images corresponding to the objectsfor morphology.

Such analyses require the ability to generate database queries based onobject attributes or image specifications. The Digital Sky project will needto coordinate such queries across multiple databases, including small (evenpersonal) data sets residing in various locations. Through the use of advanceddatabase and archival storage technology, the goal is to do the analyses in daysinstead of years.

Unique data-handling requirements for this project arise because of thevery large number of objects that must be individually accessible, since as thesurveys are completed, the aggregate number of objects will grow to billions.The object metadata can be stored in databases, but the object images willbe stored in archival storage systems. A user requesting information about an


individual star will need to format the request as a query to the database, whilestatistical analyses may access a large fraction of the image archive. Thus,accessing the data may require execution of methods to obtain the desiredresult. This implies that metadata to describe objects within the data-handlingenvironment should be augmented with metadata to describe the types ofmethods or processing algorithms applicable to the data, making it practicalto apply algorithms within the data resource and minimizing the amount ofdata that needs to be transmitted over the network.

A second requirement on the metadata comes from the need to integratedata from multiple digital sky survey repositories. Each of the surveys will belocated at the site where the scientific expertise resides for maintaining datafrom that survey. Such ownership is necessary to support data curation andvalidation. An example is the automated classification of stellar objects as starsor galaxies. If a meteor transits the sky during an observation, an automatedsystem might classify its track as a string of galaxies. To ensure against this, thedata sets need to be checked by experts at each repository to guarantee theirvalidity. In the Digital Sky project, it will be necessary to integrate access tocombinations of relational databases, object-oriented databases, and archivalstorage systems, which can be done only if system-level metadata is kept thatdescribes the access protocol, network location, and local data-set identifiersfor each of the data repositories.

A third requirement is the need to support multiple copies of a data set.When joint statistical analyses are done across two or more surveys, copies ofeach survey will need to be colocated to facilitate the comparisons. The data-handling system should be able to record the location of the copy and use thatinformation to optimize future requests. Having multiple distributed copies ofdata sets minimizes network traffic, ensures fault tolerance, improves disasterrecovery, and allows the data to be stored in different formats to minimize pre-sentation time. The data access scheduling system can determine which copyis closest in terms of transmission time and can use that copy to minimizethe time required to support a query. If the copy becomes inaccessible forany reason, the data-handling system can automatically redirect a data accessretrieval request to a backup site.

The need to federate access to multiple data repositories will be quitecommon. Many scientific disciplines are generating multiple data repositories.Each tends to focus on one aspect of the discipline. Consequently, each mayuse a different set of unique attributes to describe its data sets. Since generalquestions may be answered only by accessing all of the repositories within adiscipline, mechanisms are needed to publish the discipline-specific metadataat each site [244].


Multiple data repositories are also being created in neuroscience. Inanother project, detailed images of brains for primates (human, macaquemonkey), rats, and crickets are being collected in repositories at UCSD, UCLA,and Washington University. Images from all three sites can be used to makestatistically significant claims about anatomical properties relating to bothstructure and function. Comparisons between two brain images are madeby transforming the shape of one brain to match the shape of the secondthrough deformation algorithms, requiring access to the transformation algo-rithms used to represent the brain structures as well as access to both datasets.

Data sets will continue to grow in size as it becomes possible to achieveever higher resolutions in observational data and simulation calculations. Itwill become necessary to support data sets that are partitioned across multiplestorage devices, with only the active portion of the data set on the local high-performance disk and the rest stored within a repository such as an archivalstorage system. In the neuroscience community, the size of a current brainimage is on the order of 50 GB. However, rapid advances in technology areexpected to enable the capture of brain images at much higher resolutions,with up to 1 TB of data stored per image. Therefore, a collection of 1,000 brainimages could be as large as 1 PB. This scenario implies that subsets of givenimages will be used, rather than the entire image, creating the need for thecomputational grids data-handling environments to support replicates of par-titions of data sets. In this case, the metadata will have to include informationabout how the data set is partitioned across multiple storage resources.

5.1.3 Information Discovery

In addition to supporting scientific analysis of distributed data sets, computa-tional grids will also need to support science-based policy making [237]. Gridusers will include citizens who want access to information and decision mak-ers who need access to scientific knowledge. The needs of these users havebeen discussed in multiple NSF-sponsored workshops including KnowledgeNetworking [575] and Research and Development Opportunities in Federal In-formation Services [496]. Both workshops focused on how to turn data intoinformation and how to use information to support predictive modeling, prob-lem analysis, and decision making.

The Knowledge Networking workshops proposed a generalization of gridapplications. The term knowledge networks was defined to represent the multi-ple sets of discipline expertise, information, and knowledge that can be aggre-gated to analyze a problem of scientific or societal interest, that is, both people


and infrastructure. For scientific problems, the people include the applica-tion scientists to model the underlying physical systems and the computerscientists to organize and manage the information. For societal problems, thepeople include not only application and computer scientists, but also plan-ners to develop policy decisions based on the results of scientific models.The infrastructure includes computational grids, an information-based com-puting environment, models, and applications. Knowledge is presented eitheras predictions of effects (e.g., the impact of human-released aerosols on globalchange) or as interpretations of existing data (e.g., analyses of social sciencesurveys). The knowledge can then be used to change previous planning deci-sions or direct new research activities.

Individual researchers comprise a knowledge network enclave that in-cludes their expertise, data collections, and analysis tools. Researchers formgroups to address larger-scale problems whose scope exceeds the capabilitiesof any individual. Groups, in turn, aggregate themselves into multidisciplinaryconsortia to tackle Grand Challenge problems. At each level of this hierar-chy, information is exchanged within and between enclaves. The enclavesinclude legacy systems and, consequently, are heterogeneous and distributed.The heterogeneity spans all components of the data and information orga-nization hierarchies (including the data sources themselves, the ways thedata is organized, the vocabularies describing the data, and the cultures ofthe groups of experts). One of the fundamental challenges facing grid sys-tems is to support interoperability between legacy systems, emerging tech-nology, and the multiple cultures that might be connected by a knowledgenetwork [149].

Each knowledge network enclave may impose unique requirements onthe data-handling environment of computational grids. Some may keep dataprivate until it can be analyzed for new effects. Some may publish immedi-ately to establish precedence. Some enclaves may organize information foreither scientists or the publics use. These enclaves will create new ideas thatchange the approach to data handling. For instance, an enclave might establisha content standard for all their scientific data objects, implying a world view ofthat domain. Since the purpose of science is to evolve world views, the contentstandard within an enclave can also be expected to evolve over time.

Digital library technology addresses some of these issues. For each disci-pline, an ontology is created that defines how the information is to be struc-tured. Attributes are defined that encapsulate the information as metadata,which are then organized into a schema. Definitions of each attribute are spec-ified by semantics that have a common meaning within the discipline. When


world views evolve, the ontology, schema, metadata, and semantics may allneed to change. This process implies that the structures used to organize in-formation must themselves evolve over time [219].

Computational grid data-handling environments can be simplified greatlyby providing access to persistent objects, with changes to data sets recordedas new versions of the data. Access to prior versions is needed to allow com-parisons between versions to determine the impact of modifications. Thisapproach minimizes caching consistency problems. The metadata for a par-ticular repository must be kept consistent with the stored objects, but newversions of a data set can be disseminated lazily to external metadata caches.For example, applications identify the specific version of each data set that isused. The validity of the analysis is then a function of the publishing date ofeach data set, and the analysis can be rerun if a new version of the data be-comes available. Data sets that are highly referenced by the community willbecome the standards on which analyses are done.

Consequently, grid data-handling environments should provide lineageinformation about each data set to allow reanalysis of the original data. Thisinformation will need to be recorded as part of the system-level metadata,with the lineage preserved through every method invoked on the data. Thepublication of new data sets should include metadata about the source of allinput files and metadata attributes that identify the application or methodused to generate the data products.

The concern about semanticsthe vocabulary used to describe themetadatais that the data sets are created within a culture. Persons withoutthat cultural background are at risk because they do not understand the un-derlying ontology used within the domain or the vocabulary used to conveymeaning. The cultures may be discipline driven or socially driven (e.g., userswith different levels of education). Thus, mechanisms are needed to providehierarchies of information from general to domain-specific to satisfy publicand discipline-oriented scientific access.

Data quality is critical, implying the need for peer review mechanismsfor users of data to provide feedback. Even for high-quality data, errors canbe introduced from cross-discipline data exchanges and unintended uses ofthe data. The underlying organization of the data may be inappropriate andmay result in the data being biased with respect to another disciplines usagepattern. An example is a data collection that gives the location of all thehardwood forests. If this is used to represent the location of all of the treeswithin an area, the data will be inaccurate because nonhardwood trees are notrepresented.


5.1.4 Application Requirements Summary

Data-intensive computing will require access to data at rates that may exceedthe capabilities of computational grid networks. Hence, data-intensive appli-cations will continue to be run on local resources where data and computeservers can be tightly coupled together. Grid systems will be able to supportdata-intensive applications when it is possible to cache the data at the computeserver.

The more general applications in the future, however, will be as interestedin metadata about the data set as in the data set itself. The metadata constitutesinformation that can be used to determine how the data set should be used.Information-based computing will enable applications to make effective useof computational grids by implementing data access behind information dis-covery interfaces. Information environments within grids will be establishedthrough publication of data sets in data repositories or digital libraries. Themost general applications will be based on knowledge networks that com-bine grids and information-based computing environments with enclaves ofexperts to enable collective analysis of societally important problems.

The application requirements for information-based computing are sum-marized in Table 5.2. The requirements have been organized loosely as a func-tion of the evolving data environments needed by future applications. Theyall assume access is being provided to published data sets. In Section 5.2, weexamine how data support software infrastructure has also been evolving toaddress these requirements.

5.2 SOFTWARE INFRASTRUCTURE EVOLUTION

The evolution of data-handling environments has been driven by the need todevelop storage systems to hold data, information discovery mechanisms tolocate data sets, data-handling mechanisms to retrieve data sets, publicationmechanisms to populate high-quality data repositories, and systems to sup-port data manipulation services. Each of these areas has experienced a steadyincrease in the ability to manage and manipulate data. The evolving capabil-ities are characterized in Table 5.3. Each row illustrates a different capability,which eventually should constitute part of computational grids. Chapters thatprovide more detailed discussions of the capabilities are also listed.

In each area, we examine the available data-handling software infrastruc-ture and identify the research goals that are needed to enable information-based computing within computational grids.

5.2 Software Infrastructure Evolution117

Data environment Requirements

Data-intensive computing Data-caching system

Attribute-based data set identification

Access to heterogeneous legacy systems

Automated data handling

Data-subsetting mechanisms

Information-based computing Data publication mechanisms

Quality assurance mechanisms

Information discovery interfaces

Attribute-based access to data sets

System-level metadata for resources, users, data sets,and methods

Discipline-specific metadata

Replicated data sets for fault tolerance and disasterrecovery

Partitioned data sets

Knowledge networks Extensible semantics and schemas

Publication mechanisms for semantics and schemas

Interoperability mechanisms for semantics andschemas

Lineage metadata and audit mechanisms

5.2

TABLE

Application requirements for data-handling environments.

5.2.1 Data-Naming Systems

Traditionally, applications use input data sets to define the problem of interestand store results in output files written to local disk. The problem of identify-ing or naming the data sets is handled by specifying a unique UNIX pathnameon a given host for each data set. The user maintains a private metadata catalogto equate the pathname with the unique attributes that identify the contentsof the data set. This task may be done manually in a notebook or by encodingthe attributes in the pathname. In either case, the only way to discover thenaming convention is by communicating with the datas originator.

With the advent of the Web, naming conventions have been developed todescribe data sets based on their network address. A URL specifies both the In-ternet address of a server and the pathname of each object. This extends theUNIX pathname convention to include the address of the site where the data


Capability Growth paths

Data naming UNIX pathname LDAP Database metadata(Chapter 11)

Data storage Local disk files Archival storage(Chapter 17)

Integrateddatabase/archive

Data handling Manual access Integratedarchive/file system

Homogeneousaccess to filesystems, archives,databases

Data services Local applications Distributed objects(Chapter 9)

Knowledgenetworks

Datapublication

Data repositories Digital libraries Federated informa-tion repositories

Datapresentation

Application-specificvisualization

User-managed dataflow systems

Coordinated presen-tation, JavaBeans(Chapter 10)

5.3

TABLE

Evolution of data-handling environments.

object resides. URNs extend the concept of URLs by providing location trans-parency. URNs are unique names across the Web that can map to multipleURLs. A URN service is required to translate the URN into a URL. Users muststill individually learn the URN that corresponds to a given object to build theirown metadata catalog of interesting data object names.

One approach to improve the ability to name data sets is to impose astandard structure on the UNIX pathname. The Lightweight Directory AccessProtocol (LDAP) [582, 281] organizes entries in a hierarchical treelike structurethat reflects political, geographic, and/or organizational boundaries. Typically,entries representing countries appear at the top of the tree, with entries repre-senting states or national organizations hierarchically listed below. A structuremay be defined that represents arbitrary metadata attributes. LDAP is a proto-col for accessing online directory services [281] and is used by nearly all X.500directory clients.

The LDAP directory service model is based on entries, which are collec-tions of attributes with a distinguished name. The distinguished name refersto an entry unambiguously by taking the name of the entry itself and con-catenating the names of its ancestor entries. This process is similar to using a


UNIX pathname to define a data-set name, except in this case the name is de-fined within the context of the attributes associated with the LDAP directorystructure.

Research proposals for the Web and LDAP focus on metadata extensionsto facilitate data naming and information discovery. For the Web, the DublinCore provides mechanisms to associate descriptive fields with every documentat a Web URL [324, 564]. The Warwick Framework [323] provides a containerarchitecture to integrate distinct packages of metadata, including the DublinCore. (See Section 11.4 for additional discussion of LDAP.)

An alternative approach is to use a relational database to store the at-tributes associated with the data set. As with LDAP, a structure must be de-signed to organize the attributes, but in this case the relation between theattributes is specified by the database schema. This allows the design of moregeneral relationships and supports more complex queries. As a result, it be-comes possible to access a data set by specifying a subset of the attributesassociated with the data set instead of the distinguished name. As disciplinesidentify more attributes to characterize their data, the schema used to describethe data sets will also increase in complexity, implying the need for an exten-sible database schema.

5.2.2 Data Storage Systems

At supercomputer centers, archival storage systems are used to maintaincopies of data sets on tape robots that back-end large disk caches. Data writtento the archive migrates from the cache to the robot based on the frequency ofaccess (e.g., data sets that are never accessed reside on tape, minimizing thecost of storage across the data collection). Archives typically store millions offiles and have capacities measured in the hundreds of terabytes. Almost allarchives in use today rely on a user-specified pathname to identify each file.A current research topic is how to integrate object-relational database technol-ogy with archival storage systems to enable attribute-based access to the datasets within the archive. (Information about data storage system capabilities isgiven in Chapter 17.)

Within computational grids, the archival storage system will provide per-sistent storage. But several challenges must be met. The number of data setswill grow into the billions and vastly exceed the name server design specifi-cations for current archives. A second challenge is archive access latencies,which are measured in tens of seconds for data migrated to tape. The retrievalof a large number of small data sets (size less than the tape access latency


times the access bandwidth) will be inconveniently long if the data sets aredistributed across a large number of tapes. Again, database technology is be-ing considered for its ability to aggregate data into containers, which are theentities stored in the archive. This minimizes the number of entities storedand the access latency for multiple data-set requests. This scenario suggeststhat storage of data within archives needs to be controlled by clustering al-gorithms that automatically aggregate jointly accessed data sets in the samedatabase container.

For data-intensive computing on large data sets, the latency of access tothe data peripheral is small compared with the transmission time. If largedata sets are accessed whose size is greater than the local disk cache, thedata must be paged between the archive and the disk. This is feasible if thearchive transmission rate can be increased to a substantial fraction of the localdisk access rate. The standard way to do this is to use third-party transfer, inwhich data is moved from network-attached peripherals to the disk cache orrequesting computer. This is possible by separating the data control and datamovement functions within the archive [284, 130].

Some archival storage systems support movement of a single data setacross parallel I/O channels from tape and disk [562]. This approach allowsaggregation of I/O bandwidth up to the capability of the receiving system.Fully parallel implementations make it possible to increase the I/O accessrate of the data in the archive in proportion to the size of the archive. It thenbecomes feasible to construct archives in which the entire data collectioncan be processed in a single day. The standard figure of merit with currenttechnology is an access rate of 1 GB/s per terabyte of disk cache in the archive.A 10 TB disk cache enables data-intensive problems requiring the movementof a petabyte of data per day. An area of research interest is how to integratethe standard MPI I/O data redistribution primitives [377] on top of third-partytransfer, thus enabling dynamic redistribution of the data onto the appropriatenodes of a parallel computer.

5.2.3 Data-Handling Systems

Data-intensive applications are labor-intensive, requiring manual interventionto identify the data sets and to move the data from a remote repository toa cache on the local disk. In addition, the application must be told the localfilenames before execution. When the number of data files is counted in thehundreds or the sizes of the data files are measured in gigabytes, the time


expended in manual data-handling support can exceed the CPU execution timeby orders of magnitude.

Multiple software infrastructures have been developed to minimize man-ual intervention in accessing files:

Distributed file systems to provide a global name space. Examples arethe Andrew File System, the Distributed File System (DFS), and remotelymounted versions of the Network File System (NFS). In each case, the usermust know the unique UNIX pathname for each data set. Data repositoriesthat do not provide an NFS or DFS interface must be accessed separately.

Persistent object computation environments that federate file systems.The Legion environment (see Chapter 9) transparently migrates objectsfor execution among systems [248]. Although the user is required to knowthe unique Legion object identifier to access an object and must maintaina list of objects, the manipulation of objects is automated.

Database systems that support queries against local data collections. Re-trieving data from a data repository managed by a database typically re-quires the user to generate SQL syntax to identify the data set of interest.Interfaces are now available that support queries across distributed data-bases.

Data migration systems that tightly couple file systems with tape storage.The Cray Data Migration Facility (DMF) uses hooks within the UNIXfile system to identify data sets that have been migrated to tape andautomatically retrieves the data sets when they are requested by the user.

These solutions are characterized by acting on strictly local resources orby requiring the user to identify the data set based on an arbitrarily chosenpathname. In computational grids, these restrictions need to be alleviated.When a user can access data sets anywhere within the grid, it is not reason-able to expect the user to know the unique UNIX pathname that identifies adata set. What is needed is a metadata catalog that maintains the correlationbetween data set attributes and the data set name.

In addition, the storage systems accessible within computational grids willnot have uniform access protocols. What is needed is a storage resource broker(SRB) that supports protocol conversion between the UNIX-style streaminginterface used by most applications and the protocols used by the storageresources. Figure 5.1 shows an architecture for such a storage resource broker.

In the SRB, storage interface definitions (SIDs) provide alternate interfacesthat the application may choose to access. The SRB then converts the data


UniTreedriver

Informixdriver

HPSSdriver

DB2driver

Filesystem

FileSID

DBlobSID

ObjectSID

SRB APIs

SRB

Catalogservices

Authenticationand

access control

Schedulingbroker

Application

5.1

FIGURE

Storage resource broker for interoperation between data resource systems.

request into the format needed by a particular storage system. The conversionprocess requires access to system-level metadata to identify the characteristicsof the storage resource to decide what type of protocol conversion is required.The SRB provides the homogeneous access to file systems, databases, andarchival storage systems needed by grid applications.

5.2.4 Data Service Systems

Data sets may require preprocessing before they are accessed by an applica-tion. The preprocessing typically can be expressed as well-defined servicesthat support general data-subsetting operations. A data-handling infrastructureis needed to encapsulate the services as methods that can be applied withincomputational grids. Two technologies are emerging to support this capability.One is CORBA, in which data sets are encapsulated within objects that providethe required manipulation [424]. This system works very well as long as therequested service is defined and available. CORBA attempts to provide somesupport for information discovery using its notion of trader services.

Digital libraries provide a more comprehensive and powerful set of toolsto manipulate data sets, by supporting services on top of data repositories. The


Catalogservices Catalog API

Registrationand

publication

DiscoverysupportMDAs

Storageresourcebroker

Authentica-tion andaccesscontrol

Schedulingbroker

Methodexecution

APIAPI API API API API API

Application

LDAPX.500

HPSSDPSFile

GSSSSH

Kerberos

NWSAppLeS

GlobusLegion

system

5.2

FIGURE

Digital library architecture.

services can be invoked against any of the data sets stored in the library andregistered as methods within object-relational databases. The combination ofmetadata-based access to data sets through catalog services, with the ability toregister methods to manipulate the data sets, provides most of the attributesneeded for computational grid data-handling environments. An example of adigital library architecture is shown in Figure 5.2.

Possible services include publication/registration of new data sets, sup-port for information discovery for attribute-based identification of data sets,support for access to heterogeneous data resources through a storage resourcebroker, support for authentication and access control, support for schedulingof compute and I/O resources, and support for distributed execution of theservices that constitute the digital library.

Although the digital library architecture is extensible and capable of scal-ing to wide area grid environments, current implementations tend to beclosely coupled to local resources. The underlying data storage resources areusually the local file system, and the methods are executed locally by the data-base. Current research topics include generalizing digital library technology tofunction in a distributed environment with support for executing the methodson nonlocal resources [42].


5.2.5 Data Publication Systems

Data publication provides the quality assessment needed to turn data into in-formation. This capability is provided by the experts who assemble data repos-itories for a given discipline. The mechanisms used to assess the quality of thedata include statistical analyses of the data to identify outliers (possibly poordata points) and to compute the inherent properties of the data (mean, stan-dard deviation, distribution). For the large data sets accessed by data-intensiveapplications, research topics include how to generate statistical propertieswhen the size of the data set exceeds the storage capacity of the local re-sources. For publishing data across multiple repositories, the coordination ofdata privacy requires cryptographically guaranteed properties for authorshipand modification records (discussed in Chapter 16).

Publication also involves developing peer review mechanisms to validatethe worth of the data set, which is an active area of research within the librarycommunity. The mechanisms will be similar to those employed for reviewof scientific reports. Indeed, in the chemistry community, some scientificjournals require publication of molecular structures in data repositories beforereports that analyze the structures can be published in the journal.

The harder research issue is supporting interoperability among multipledata repositories and digital libraries [137]. To enable ubiquitous access toinformation, mechanisms are needed that support interoperability betweenschemas. This requires continued research on how to specify schemas and in-terpret semantics so that a query can be completed across two heterogeneousdatabases. The schemas and associated semantics must be published for eachrepository. One approach is to generate a set of global semantics that spans alldisciplines. Unfortunately, as noted above, the semantics used within a givendiscipline are expected to evolve. Thus, the semantics associated with dataset attributes must also evolve. Queries for information that require compar-ing historical and current data will require interoperability between schemasbased on different semantics. Current approaches include generalizing thequery to access higher-level attributes that can be defined with the same se-mantics across the heterogeneous schemas.

A unifying approach to enable interoperability is the use of proxies. Prox-ies act as interpreters between a standard protocol and the native protocols ofindividual information sources. The Stanford InfoBus protocol is an implemen-tation of such a standard [479]. It uses distributed CORBA objects to supportinformation discovery across multiple digital libraries. Information access andretrieval are accomplished through a Digital Library Interoperation Protocol(DLIOP).


5.2.6 Data Presentation Systems

Unifying data presentation architectures are needed to enable collaborative ex-amination of results. The data presentation may involve teleinstrumentation(Chapter 4), with data streaming from instruments in real time for displayon collaborators workstations, or dynamic steering of applications as theyexecute (Chapter 7). The associated data-handling systems require supportfor asynchronous communication mechanisms, so that processing of the datastream will not impede the executing application or interrupt the instrument-driven data stream. When data sets become very large, the data streams mayhave to be redirected to archives able to accommodate the entire data set.Subsets of the data may then be redirected for display on the researchersworkstation.

The realtime constraints associated with collaborative examination of datawill also affect the design of grid data-caching infrastructure. Multiple repre-sentations of the data sets at different resolutions may be needed to maintaininteractive response when the collaborations are distributed across a conti-nent. This, in turn, will affect the type of data-subsetting services that thedigital library should provide. (Realtime applications are discussed in detailin Chapter 4.)

Visualization of data sets will be as important as the ability to locate andretrieve them. For data-intensive applications, the resolution of the data setcan be finer than the resolution of the display device. One approach to thissituation is to zoom into a data set through multiple levels of resolution, witheach level stored as a reduced data set within the data repository. An alter-native approach is to decompose the data set into basis functions that can beused to display the data at different resolutions. Fractal decompositions canminimize the amount of data that must be transmitted as higher-resolutionversions of the data are requested. Both approaches require additional meta-data to describe the alternate representations of the data that are available.

Data coordination systems are also needed to ensure that the same pre-sentation is provided on all windows active on a display system and across themultiple windows within a distributed collaborative environment. Changes inthe visualization control parameters in one window need to be reflected in allother windows. JavaBeans is a platform-neutral technology that can accom-plish this process [528]. This requirement poses a major architectural designchallenge for computational grid data-support environments. Presentation en-vironments will need to be integrated across a combination of computationalgrids, CORBA object services, Java presentation services, and digital librarypublication services. (More information about Java is provided in Chapter 10.)


Local Distributed Computational grids

Data storage system Distributed data handling Data analysis system

Data-intensive comput-ing on data repositories

Information discoveryenvironments

Information-based policymaking

Digital libraries Information-basedcomputing

Knowledge networking

5.4

TABLE

Evolution of data-handling paradigms.

5.3 DATA-HANDLING PARADIGMS

Data-handling environments are evolving from local systems that can interactonly with local data peripherals to distributed systems that integrate access tomultiple heterogeneous data resources. Data access via user-defined data setnames is evolving to information access based on data set attributes. Computeenvironments are evolving from support for local execution of services ormethods to distributed execution of services within computational grids.

This shift from local to global resource access will enable new paradigmsfor data handling. Three such shifts are shown in Table 5.4. They can becharacterized as a shift from local resources, to distributed resources, toan environment that supports ubiquitous access to computing and informa-tion resources. Each paradigm builds upon the capabilities of the prior one.The long-range goal is to develop infrastructure that supports information-based policy decisions by experts organized into knowledge network en-claves.

The basic software infrastructure to build an information-based computingenvironment includes the following:

Persistent object computation environments for grids that support run-time execution of applications and services

Information discovery system that supports attribute-based access to data,metadata mining, semantic interoperability, shared ontologies, and im-proved data annotation [41, 137]

Digital library technology that supports publication, cataloging, and cura-tion of scientific data sets

5.4 Information Revolution127

Data management system that provides a system-level metadata catalog tosupport interoperation among objects and resources within computationalgrids

Storage resource broker that provides a uniform access mechanism toheterogeneous data sources

Database repositories that support domain-specific data collections

Archival storage systems that provide permanent data repositories

5.4 INFORMATION REVOLUTION

A major user of the information discovery environment will be the internalsystems comprising computational grids themselves. For ubiquitous comput-ing infrastructure to be accessible, system-level metadata is required to iden-tify available resources. For scheduling systems to be capable of operatingwithin grids, access is needed to resource utilization statistics. To support se-curity access control lists and authentication systems, system-level metadatais needed to describe user privileges. For data sets to be accessible in remotearchives, again system-level metadata is needed to determine the protocol thatshould be used to access the heterogeneous data resources. In short, the tech-nologies needed to implement grids will be driven by the needs of its internalsubsystems, with development of support mechanisms for system-level meta-data providing the unifying infrastructure.

One consequence of the implementation of an information-based comput-ing environment on top of grid systems will be a revolution in the ability togenerate information. Data analysis has been a cottage industry in which re-searchers develop unique applications that access local data sets. Information-based computing will turn data analysis into a national infrastructure by mak-ing it possible to access and manipulate any published data set. The synergyobserved within individual disciplines when they integrate access to theirdata will be made possible across multiple fields of study that choose to worktogether. Common metadata models are being developed to enable such inter-change [185, 171].

The emergence of ubiquitous access to data is revolutionizing the conductof science [445]. Researchers are publishing scientific results on the Web andproviding Web-based access mechanisms to query data repositories and applyanalysis algorithms. The opportunity exists to develop scalable informationdiscovery systems that generalize the above capabilities and enable analysisof terabyte-sized data collections. Information-based computing will enable


information access from applications running on supercomputers. This in turnwill enable automated examination of, and access to, all available informationsources for a discipline, including scientific data generated by simulations,observational data, standard reference test case results, published scientificalgorithms for analyzing data, published research literature, data collections,and domain-specific databases.

This infrastructure is expected to enable analyses that could not be con-templated before, resulting in faster progress in scientific research through thenonlinear feedback made possible when new information is used to improvedata analysis. The publication of the results of computations, followed by thedynamic discovery of information when new applications are run, forms afeedback loop that can rapidly accelerate the generation of knowledge.

Rapid progress in building this infrastructure can be made by buildingupon existing technologies. Supercomputer centers are evolving from provid-ing support for predominantly numerically intensive computing to also pro-viding support for data-intensive applications. Systems that can manage themovement of terabytes of data per day are in development. Data-handlingenvironments are also evolving from syntactic-based to semantic-based ac-cess systems. Digital library technology is evolving to include the capabilityto analyze data in associated workspaces through application of published al-gorithms. Finally, user interfaces to these systems are evolving into dynamiccollaboration environments in which researchers simultaneously view and in-teract with data.

The coevolution and integration of computational grids, information-based computing, and digital library technologies promise to create a uniqueinfrastructure that will enable ubiquitous computing and ubiquitous access toinformation. The resulting synergy between the ability to analyze data to cre-ate information and the ability to access information to drive the data analysiswill have a profound effect on the conduct of science.

FURTHER READING

For more information on the topics covered in this chapter, see www.mkp.com/grids and also the following references:

A white paper prepared for the Workshop on Distributed HeterogeneousKnowledge Networks [575] provides an introduction to the knowledgenetworking concept.

data intensive

Documents