for further information please contact joe buck: [email protected]@soe.ucsc.edu this work was...

For further informationPlease contact Joe Buck: [email protected] work was supported by the Systems Research Lab at the University of California, Santa Cruz, the Department of Energy and the HPC-5 group at Los Alamos National Laboratory.

SciHadoopProcessing Array-Based Scientific Data in Hadoop

MapReduce

Figure 1. The MapReduce process takes a logical description of the data, integrates that with a view of the physical layout of the same data, and generates an execution plan for the query that maximizes data locality

Data ModelAn array-based data model maps an n-dimensional shape onto a byte-stream for storage in a file, as can be seen in Figure 2.

Proportional data placement (total logical space / units desire)

PartitioningOne of the goals of MapReduce systems is to reduce remote data access (thereby conserving network resources and accelerating the processing of the data). In order to ensure that a large percentage of reads are done locally, efficient partitioning of the input data is required. We have implemented three methods of partitioning that provide different tradeoffs.

Introduction

Scientific data is growing in size at a rapid rate. Examples include:1) Large Synaptic Sky Telescope (LSST) generating 30 TB / night2) CERN’s Large Hadron Collider (LHC) generates around 15 PB / year3) AR4 climate model (2007) 12TB4) AR5 climate model (2013) > 300 TB

This increased size is creating issues in terms of data management, storage, and efficient processing.

This data is often stored in highly structured, array-based binary file formats (HDF5, NetCDF3, etc).

MapReduce, and specifically the Hadoop implementation, has become a popular framework for enabling large-scale, parallel data processing.

Using MapReduce to process scientific data stored in array-based binary formats is not a straight-forward solution. The interface to data stored in these formats is a logical model expressed as n-dimensional arrays. In contrast, Hadoop’s interface is based on a byte-stream interface. By resolving the logical interface to the byte-stream interface, not only is it then possible to execute MapReduce programs over scientific data, but several optimizations are then possible.

Traditionally, a MapReduce program creates tasks by analyzing the input file(s) and assigning regions of their byte-streams to map tasks that are then assigned to nodes.

Joe Buck, Noah Watkins, Kleoni Ioannidou, Carlos Maltzahn, Scott BrandtUCSC – Systems Research Lab

Jeff LeFevre, Neoklis PolyzotisUCSC – Database Systems Group

John Bent, Meghan Wingate, Gary GriderLANL

Figure 2. An example of a 2-dimensional data set on the left and the software stack that maps accesses of the array to an underlying byte-stream. The shaded area on the left represents a specific geographical sub-region within the larger data set.

QueriesThe purpose of the MapReduce paradigm is to apply a function over a set of data. When processing scientific data it is often desirable to process only the part of the total input that is required to satisfy the query. In order to accomplish this, our query uses a constraining space to specify the contiguous portion of data set to process.

Array-based data typically holds values of one type, say floats representing observed temperatures, and the various dimensions represent information describing the recorded data. In figure 2, the two dimensions represent latitude and longitude where the given temperature was recorded. It’s simple to see how a third dimension could be added that represents time and that would result in a data set that was a time-series of temperature measurements at specified latitude, longitude coordinates.

Figure 3. The MapReduce process takes a logical description of the data, integrates that with a view of the physical layout of the same data, and generates an execution plan for the query that maximizes data locality

A Slab Extraction function is then applied to the constraining space. Given that dimensions represent attributes, such as longitude and latitude, the slab extraction shape dictates what data points will be processed together. For example, if the function was meant to apply to measurements from fixed geographical areas, then the slab extraction function would produce Input Sets that had that fixed shape on the latitude and longitude dimensions. Each Input Shape is then processed by the function (f in figure 3) to produce the Result Set.

Physical-to-Logical (search the logical space to find where physical boundaries are, sampling, and produce Input Sets that map precisely to the underlying byte-stream)

Chunking & Grouping (use smaller shapes, use sampling to place them, aggregate all shapes that are mostly on the same Input Set)

Holistic FunctionsThis class of functions, which requires that all elements be processed at the same time, are not readily amenable to efficient application via a MapReduce program. Two optimizations allow holistic function to be more efficiently processed by SciHadoop: Opportunistic Holistic Combiner and Holistic-aware partitioning.

Holistic Combiner: determines if all the elements required happened to be present at a map node and applies the function if they are present

Holistic-aware Partitioning: adjust partitions to increase the chances that all elements needed by the function are present at a single mapped (thereby increasing the efficacy of the Holistic Combiner)

Figure 5. On the left, data for the desired partition (light gray) is split across two mappers, preventing the Holistic Combiner from executing. All data is sent across the network and stored on both the map and reduce nodes as intermediate dataOn the right, the paritions are adjusted by Holistic-aware Partitioning. The Holistic Combiner can now execute, greatly reducing data passed through the system

Figure 4. Three different partitioning strategies

The table above shows several interesting results:1) Chunking & Grouping as well as the Physical-to-Logical partitioning strategies greatly increase read locality.2) Holistic Combiner greatly (20x) reduced the amount of intermediate data generated and, in turn, the execution time (5-7x) of the query.3) Read locality isn’t the only useful metric. Compare Holistic-aware partitioning with Chunking & Grouping to Physical-to-Logical (the third to last and last tests in the table). A reduction in read locality did not have a negative effect on the runtime, but rather the increased efficacy of the Holistic Combiner, seen by the reduction in temporary data, resulted in a ~10% reduction in run-time.

ResultsExperiments were executed on a 30-node cluster where each node had two 1.8 GHz dual-core Opterons, 8 GB RAM, 4x 250GB sata hard drives and all nodes were interconnected via gigabit Ethernet on a single switch. The sample NetCDF file was extrapolated from an environmental dataset and stored 132 GB of data in a variable that represented wind pressure measurements across four dimensions (time, latitude, longitude, elevation). The query executed calculated the median pressure in across two time steps for a fixed latitude x longitude area and a fixed range of elevation.

Future WorkThe SciHadoop project is currently being extended to determine how controlling the mapping of data from map tasks to reduce tasks can be leveraged to decrease the mount of intermediate data, reduce network communication and increase reducer data locality. Work is also being done to leverage structural knowledge of the input to aggressively start reducer execution prior to all mappers completing. This allows for access to partial results much more quickly and a reduction in total resource usage during. Additionally, adding support for the HDF5 file format, and exploring the optimizations that are possible given its more flexible internal structure, are underway.

mailto:[email protected]

for further information please contact joe buck: [email protected]@soe.ucsc.edu this work was...

Documents

set of data

data locality data model

input data

arraybased data model

dimensional data set

introduction scientific

parallel data processing

larger data set