cdas design drivers - nasa. cda… · cdas design drivers • access data in raw (netcdf, hdf)...
Post on 08-Jun-2020
3 Views
Preview:
TRANSCRIPT
2
CDAS Design Drivers
• Access data in raw (NetCDF, HDF) format in POSIX filesystem. • Avoid supporting additional copies of entire data holdings. • Cache variables of interest in domain (xyzt) of interest.
• Enable interactive performance on simple operations. • Light weight WPS implementation using the scala Play framework. • Most operations highly IO bound: high performance data cache.
• Utilize modular, composable compute operations (kernels). • Link kernels to compose workflows.
• Support existing climate data analysis packages. • Enable kernel development in a wide range of programming languages. • Parallelize data, not analysis packages.
• Deploy the ESGF CWT WPS API. • Leverage existing big data technologies.
• CDAS core developed in java/scala using Apache Spark engine.
8
Dynamic Data Cache
• Data requested by cached fragment ID: • Use that fragment directly
• Data requested by collection ID, varName, and ROI: • Search cache for fragment that satisfies request:
• Matches collection ID and varName
• Overlaps requested ROI
• If cached fragment is found then subset and return
• If no fragment is found:
• Cache requested ROI for the requested variable • Can be different from ROI of operation
• “Precache” with empty operation
• Return new fragment when done.
9
Partitioning and Parallelism
o Data Initially Partitioned over Time Axis: o Matches data file partitioning.
o MERRA: ~ 10K files partitioned by time.
o Other partition schemes require a reshuffle operation.
o Each partition represented by a CDArray
o Streaming parallelism implemented using Spark. o In-memory workflow pipelines.
o Extends Sparks’ lazy execution model.
o Kernel computations utilize Map-Reduce style operations.
13
WPS Request
http://localhost:9001/wps?status=True&version=1.0.0&datainputs=[ variable=[
{"domain":"d0","uri":"collection:/GISS-E2-R_r3i1p1","id":"tas|vR3"}, {"domain":"d0","uri":"collection:/GISS-E2-R_r2i1p1","id":"tas|vR2"}, {"domain":"d0","uri":"collection:/GISS-E2-R_r1i1p1","id":"tas|vR1"}, {"domain":"d0","uri":"collection:/GISS_r5i1p1","id":"tas|vH5"}, {"domain":"d0","uri":"collection:/GISS_r4i1p1","id":"tas|vH4"}, {"domain":"d0","uri":"collection:/GISS_r1i1p1","id":"tas|vH1"},
domain=[{"id":"d0"}]; operation=[ {"input":["vR1","vR2","vR3"], "name":"CDSpark.multiAverage", "result":"b2761"}, {"input":["vH1","vH2","vH3"], "name":"CDSpark.multiAverage", "result":"665c"}, {"input":["665c"], "crs":"gaussian~128", "name":"CDSpark.regrid", "result":"32235"}, {"input":["b2761"], "crs":"gaussian~128", "name":"CDSpark.regrid", "result":"12d9f"}, {"input":["32235","12d9f"], "domain":"d0", "name":"CDSpark.multiAverage", "result":"323a5f"} ] ] &service=WPS&Identifier=CDSpark.multiAverage&request=Execute&store=True
15
Code and Documentation
• Compute Engine: https://github.com/nasa-nccs-cds/CDAS2.git
• Web Server: https://github.com/nasa-nccs-cds/CDWPS.git
• Java Client: https://github.com/ESGF/esgf-compute-api
top related