quest for value in big earth data - university of iceland · ccl udo ccl: connected component...

43
Quest for Value in Big Earth Data Identifying the most cost-effective way of dealing with Big Data in Geoscience Kwo-Sen Kuo 1,2,3 and colleagues Including but not limited to: Michael L Rilee 4 , Lina Yu 5 , Yu Pan 5 , Feiyu Zhu 5 , Hongfeng Yu 5 1. NASA Goddard Space Flight Center, Greenbelt, Maryland, USA 2. University of Maryland, College Park, Maryland, USA 3. Bayesics LLC, Bowie, Maryland, USA 4. Rilee Systems Technologies, Derwood, MD, USA 5. University of Nebraska-Lincoln, Lincoln, NE, USA

Upload: others

Post on 09-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Quest for Value in Big Earth DataIdentifying the most cost-effective way of dealing with Big Data in Geoscience

Kwo-Sen Kuo1,2,3 and colleagues

Including but not limited to:

Michael L Rilee4, Lina Yu5, Yu Pan5, Feiyu Zhu5, Hongfeng Yu5

1. NASA Goddard Space Flight Center, Greenbelt, Maryland, USA2. University of Maryland, College Park, Maryland, USA3. Bayesics LLC, Bowie, Maryland, USA4. Rilee Systems Technologies, Derwood, MD, USA5. University of Nebraska-Lincoln, Lincoln, NE, USA

Page 2: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

The V’s of Big Data challenge

VERACITY

SCALE

VO

LUM

E

VA

RIETY

VELOCITY VALUE

Page 3: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Scaling Volume

❖ Parallel processing – There is no other way!

➢ Need to exploit multimodal parallelization

o Shared memory parallelization, SMP

▪ SMP has been incorporated into the fundamentals of modern scripting

languages and achieved considerable pervasiveness

• MatLab, IDL, Python, R, Julia, etc.

▪ Geoscientists can transparently leverage SMP without knowing, say, OpenMP

▪ (GPU, Quantum)

o Distributed memory parallelization, DMP

▪ Memory on a single node is not sufficient for Big Data – need a cluster of

many nodes

▪ Pervasive systematic utilization of DMP? Very challenging!

▪ We have found a way!

Page 4: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Observations and Insights

Page 5: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Observations on Data Movement Links

~+25% / year

≪ Moore’s Law

~+40% / year< Moore’s Law)

https://itblog.sandisk.com/cpu-bandwidth-the-worrisome-2020-trend/

Dedicated

~+50% / year

< Moore’s Law

Page 6: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Effective Internet user bandwidth

~+50% / year

< Moore’s Law

https://www.nngroup.com/articles/law-of-bandwidth/

Page 7: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Summary

Link BW (MB/s) Remark

DRAM ~O(5) (max) Dedicated

Network ~O(4) (max) Shared

SSD ~O(3) (max) Dedicated

HDD ~O(2) (max) Dedicated

Internet ~O(2) (eff)

❖ Network

➢ high-performance interconnect, e.g. InfiniBand or Fiber Optics

➢ Usually shared among ~10-1000 of nodes

❖ Progressively lower bandwidth further away from the CPUs

Away

from

CPU

Page 8: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Insights

❖ Data must be stored close to CPU

➢ Bandwidth drops ~O(1) per link away from CPU (GPU)

➢ Congestion at any link could decimate the throughput

➢ Shared nothing architecture! – This rules out Cloud, which is

based mostly on traditional HPC compute-bound architecture

❖ If data must be moved, the volume to be moved must be

commensurate with the bandwidth

➢ i.e. move larger volume using higher bandwidth

❖ Data locality is paramount to performance/efficiency

➢ High bandwidth access to all data required for analysis

Page 9: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

A Conclusion

Tightly coupled compute-storage: The only class of

approaches to satisfy the requirements

❖ The analysis engine also manages the storage.

❖ Parallel distributed database management system

(DBMS) typifies the approach.

➢ Not the loosely coupled approaches like Spark or Hadoop

➢ Not the non-system tools like Dask

❖ Cloud becomes prohibitively expensive with IOPS

guaranty.

Page 10: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

HPC versus SNA Architecture

Page 11: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

HPC/Cloud Architecture

❖ Better suited for compute-bound tasks, e.g. model simulations

❖ Facilitating elasticity – spinning up a variable number of VMs

Page 12: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Shared Nothing Architecture

❖ Better suited for I/O-bound tasks, e.g. data analysis

❖ Improved compute-storage affinity (data locality) but reduced elasticity

Page 13: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Data PlacementHow data are partitioned and distributed onto cluster nodes

Page 14: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

t

N. America is

here!

Simplistic Data Placement/Layout

Page 15: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Better Load Balance – Smaller Chunks

Page 16: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Data Placement AlignmentData Placement also known as Data Layout

Page 17: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Data Variety Challenges

❖ Spatial aspect (2 or 3D)

➢ Nonuniform data models

o e.g. Grid, Swath, and Point

➢ Nonuniform data resolutions

➢ Decoupling of array indices from geolocations

❖ Temporal aspect (1D)

➢ Nonuniform data resolutions

➢ Decoupling of array index from calendrical time

❖ File-centric practice

➢ Data are packaged into files of nonuniform

extents

o i.e. different spatial coverages and temporal durations

Page 18: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Worker

SciDBEngine

LocalStore

Coordinator

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Coordinator

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

Coordinator

SciDBEngine

LocalStore

Worker

SciDBEngine

LocalStore

1 2 3 4

5 6

lm

1

2

3

4

5

6 7

8

9

10

11

12

13 14

15

16 1

2

3

4

5

67

8

9

10

11

12

13

14

15

16

1

2

3

4

5

6 7

8

9

10

11

12

13 14

15

16

6/6/2018 EarthCube 2018 AHM (Poster 154)

Page 19: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Minimize Data Movement:Keep Analysis Pleasingly Parallel!

We recognize:

❖ A great proportion of Earth Science analysis requires spatiotemporal

coincidence

➢ i.e. for the same location and same time

➢ e.g. comparisons (for verification and validation) almost always require it

❖ The objective: data placement alignment

❖ It is impossible with some analysis, e.g. FFT and CCL

To align the placements of data chunks spatiotemporally on the

cluster nodes, we must thus

❖ Uniformly index all geoscience data spatiotemporally!

➢ Because Array DBMSs use index for partitioning

Page 20: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Scaling Variety

❖ Scaling Variety is much harder than scaling Volume.

❖ Homogeneous dataset: parallelization, even DMP, can be developed once and applied many times (to the same homogeneous dataset): reusable and scalable.➢ NASA Earth Exchange, NEX

❖ Heterogeneous datasets: necessary for integrative analysis, but the above case-by-case approach no longer scales!➢ Being a system science, Earth science demands interdisciplinary, integrative

analysis of diverse datasets from diverse subdisciplines.

➢ If parallelization must be developed for each combination of heterogeneous datasets -> piecemeal scalability!

➢ While SMP can be transparently leveraged by typical Earth scientists using modern scripting languages, it is not the case with DMP.

➢ Typical Earth scientists do not possess the programming skill for DMP.

➢ Without a better system, Earth scientists are doomed to stay in the “disarray of variety”.

Page 21: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

STARESpatioTemporal Adaptive-Resolution Encoding

Page 22: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Spatial Element of STARE

❖ Hierarchical Triangular Mesh, HTM

Page 23: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Right-justified HTM Encoding

❖ Quadtree hierarchy

➢ Indexes geolocation – a substitute for lat-lon

➢ Contains approximate data resolution

The bit format of the STARE spatial index including geo-position and resolution.

Page 24: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Efficient Set Operations on Regions

Spatial intersection by comparing integers is facilitated by the encoded resolution.

Page 25: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Temporal Element of STARE

❖ Hierarchical Calendrical Encoding, HCE

Page 26: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Summary on ScalingVolume and Variety

Page 27: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Scaling Volume and Variety

On a parallel distributed array DBMS like SciDB:

❖ STARE homogenizes Variety by guaranteeing spatiotemporal data placement alignment on cluster nodes➢ Data chunks of the same place and time from different datasets are colocated

❖ Shared Memory Parallelization, SMP➢ Utilized on each node when analysis is pleasingly parallel

o High bandwidth from local storage to DRAM fully exploited

o Unnecessary node-to-node communication minimized

❖ Distributed Memory Parallelization, DMP➢ Utilized automatically and systematically when necessary

o The data partitioning/distribution operation performed by by DBMSs is akin to (pre-)domain decomposition

o Data are “domain decomposed” consistently using STARE into chunks and stored on disk

o Data chunks are quickly loaded into memory when needed for analysis

o DBMSs coordinate communication and execution of DMP

o SciDB has its own communication protocol but can also use MPI, e.g. SCALAPCK

Page 28: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

A PrototypeInteractive animation

Page 29: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Data and Process Flow

Heterogeneous Geoscience

Data

SciDBAnalytics

STARE

Spatio-temporally Colocated

Data

Browser-based GUI

Page 30: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Prototype Demo Animation

Page 31: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Future Additions

❖ Use STARE to drastically improve (parallel) ingest performance while reducing resource requirement.

❖ Connect to metadata repository(ies)➢ Seamlessly integrating into existing data discovery practice

❖ Enhance throughput➢ Further reducing data movement for visualization

➢ Anticipating user analysis needs with behavior prediction

❖ Improve Python API to SciDB➢ Integrating highly compatible Xarray

o E.g. Both support named dimensions

Page 32: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Envisioned Architecture

Compute Cluster Parallel File System

Traditional HPC Simulation Cluster

C G

C G

C G

C G

C G

Data Intensive Analysis (GATE) Cluster

C G

C G

C G

C G

C G

C G

Metadata Repository

Data Centers

Users5

1

2

3

4

67

Page 33: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

(Profound) Implications

❖ Better software quality, traceability, and reusability

➢ Programming SciDB UDXs, with DMP especially, is not for typical scientists

➢ Professional software engineers are required

➢ Once a UDX is constructed it is immediately reusable for all users

❖ Easier interdisciplinary collaboration

➢ A multiuser system with sophisticated control of roles and permissions

❖ More ensured research reproducibility

➢ More localized provenance collection

❖ Higher cost effectiveness by leveraging existing HPC facilities!

➢ Collocated with simulation hardware -> fast ingest of simulation results for analysis

➢ Thecollective cost of using Cloud is higher than it appears (since scientists do not pay for Cloud out of their own pockets), especially when the above factors are included in consideration.

Page 34: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

CCL UDO

❖ CCL: Connected Component Labeling

➢ Non-pleasingly parallel, DMP required

❖ UDO: User Defined Operator

❖ Used to track blizzards defined by visibility

reduction due to in-air snow mass

➢ Processed 36 years (1980-2015) of MERRA reanalysis

hourly data at 0.625°×0.5° resolution in ~30 minutes

using a cluster of 28 containerized nodes (~6-yr-old).

Page 35: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Blizzard Track Density

Page 36: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Blizzard Unique Visits

Page 37: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Snowfall Rate

Page 38: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

2-m Wind Speed

Page 39: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Momentum Roughness Length

Page 40: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

North America 2010 Winter Animation

Page 41: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Phenomenon Hierarchy

❖ Phenomenon hierarchies

➢ Super phenomena containing sub phenomena, like supersets and subsets

❖ Schiermeier, Q., “The real holes in climate science.” Nature News, 2010.

➢ Regional climate projections

➢ Representation of precipitation and cloud

➢ Role of aerosols

➢ Palaeoclimatological data

❖ Process-based diagnostics

➢ The heavy-handed use of univariate averaging has reached its limit of usefulness

➢ Highly contextual/conditional approaches are needed

➢ Phenomenon hierarchies are such high-context conditional features for more targeted

model diagnostics

Page 42: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Acknowledgement

This research has been sponsored in part by

❖ National Science Foundation through grants ICER-

1541043, ICER-1540542, IIS-1423487 and

❖ Advanced Information Systems Technology (AIST)

program of NASA Earth Science Technology Office

(ESTO)

with supplemental funding from

❖ Advancing Collaborative Connections for Earth System

Science (ACCESS) program of NASA Earth Science Data

System program.

Page 43: Quest for Value in Big Earth Data - University of Iceland · CCL UDO CCL: Connected Component Labeling Non-pleasingly parallel, DMP required UDO: User Defined Operator Used to track

Thank you!