parallel geospatial data management for multi-scale environmental data analysis on gpus

1
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs Visiting Faculty: Jianting Zhang, The City College of New York (CCNY), [email protected] ORNL: Host: Dali Wang, Climate Change Science Institute (CCSI), [email protected] Overview As the spatial and temporal resolutions of Earth observatory data and Earth system simulation outputs are getting higher, in- situ and/or post- processing such large amount of geospatial data increasingly becomes a bottleneck in scientific inquires of Earth systems and their human impacts. Existing geospatial techniques that are based on outdated computing models (e.g., serial algorithms and disk-resident systems), as have been implemented in many commercial and open source packages, are incapable of processing large-scale geospatial data and achieve desired level of performance. Partially Supported by DOE’s Visiting Faculty Program (VFP), we are investigating on a set of parallel data structures and algorithms that are capable of utilizing massively data parallel computing power available on commodity Graphics Processing Units (GPUs). We have designed and implemented a popular geospatial technique called Zonal Statistics, i.e., deriving histograms of points or raster cells in polygonal zones. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and ,60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes. High-resolution Satellite Imagery T In-situ Observation Sensor Data T Global and Regional Climate Model Outputs Data Assimilation Zonal Statist ics T Ecological, environmental and administrative zones T T V Temporal Trends High-End Computing Facility A B C Thread Block ROIs Background & Motivation Zonal Statistics on NASA Shuttle Radar Topography Mission (SRTM) Data 2 3 SQL: SELECT COUNT(*) from T_O, T_Z WHERE ST_WITHIN (T_O.the_geom,T_Z.the_ geom) GROUP BY T_Z.z_id; Point-in- Polygon Test For each county, derive its histogram of elevations from raster cells that are in the polygon. •SRTM: 20*10 9 (billion) raster cells (~40GB raw, ~15GB compressed TIFF) •Zones: 3141 counties, 87,097 vertices Brute-force Point-in-polygon test •RT=(#of points)*(number of vertices)*(number of ops per p-in-p test)/(number of ops per second) =20*10 9 *87097*20/(10*10 9 )=3.7*10 6 seconds=40days •Using up all Titan’s 18,688 nodes: ~200 seconds •Flops utilization is typically low: can be <1% for data intensive applications (typically <10% in HPC) Hybrid Spatial Databases + HPC Approach Observation : only points/cells that are close enough to polygons need to be tested Question: how do we pair neighboring points/cells with polygons? Minimum Bounding Boxes (MBRs) Step1: divide a raster tile into blocks and generate per-tile histograms Step 2: derive polygon MBRs and pair MBRs with blocks through box-in-polygon test (inside/intersect) Step 3: aggregate per-blocks histograms into per- polygon histograms if blocks are within polygons Step 4: for each intersected polygon/block pair, perform point(cell)-in-polygon test for all the raster cells in the blocks and update respective polygon histogram (A) Intersect (B) Inside (C) Outside GPU Implementations Identifying parallelisms and mapping to hardware (1)Deriving per-block histograms (2)Block in polygon test (3)Aggregate histograms for “within” blocks (4)Point-in-polygon test for individual cells and update GPU Thread Block Raster Block AtomicAdd 1 M1 C1 M1 C2 M2 C3 •Point-in-poly test for each of cell’s 4 corners •All-pair Edge intersection tests between polygon and cell 2 Polygo n vertic es Cells Perfect coalesced memory accesses Utilizing GPU floating point power 4 BPQ-Tree based Raster compression to save disk I/O •Idea: chop a M-bit raster into M binary bitmaps and then build a quadtree for each bitmap (Zhang et al 2011) •BPQ-Tree achieves competitive compression ratio but is much more parallelization friendly on GPUs •Advantage 1: compressed data is streamed from disk to GPU without requiring decompression on CPUs reducing CPU memory footprint and data transfer time •Advantage 2: can be used in conjunction with CPU-based compression to further improve compression ratio Experiments and Evaluations Tile # dimension Partition Schema 1 54000*43200 2*2 2 50400*43200 2*2 3 50400*43200 2*2 4 82800*36000 2*2 5 61200*46800 2*2 6 68400*111600 4*4 Total 20,165,760,0 00 36 Data Format Volume (GB) Original (Raw) 38 TIFF Compression 15 gzip compression 8.3 BPQ-Tree Compression 7.3 BPQ-Tree+ gzip compression 5.5 Single Node Configuration 1: Dell T5400 WS •Intel Xeon E5405 dual Quad- Core Processor (2.00 GHZ), 16 GB, PCI-E Gen2, 3*500GB 7200 RPM disk with 32M cache ($5,000) •Nvidia Quadro 6000 GPU, 448 Fermi core ( 574 MHZ), 6 GB, 144GB/s ($4,500) Single Node Configuration 2: Do-IT Yourself Workstation •Intel Core-i5 650 Dual- Core, 8 GB, PCI-E Gen3, 500GB 7200 RPM disk with 32M (recycled), ($1000) •Nvidia GTX Titan GPU, 2688 Kepler core ( 837 MHZ), 6 GB, 288 GB/s ($1000) Parameters: •Raster chunk size (coding): 4096*4096 • Thread block size: 256 •Maximum histogram bins: 5000 •Raster Block size: 0.1*0.1 degree 360*360 (resolution is 1*1 arc-second) GPU Cluster : OLCF Titan Results Single Node Cold Cache Hot cache Config1 180s 78s Config2 101s 46s Quadro 6000 GTX Titan 8-core CPU (Step 0): Raster decompression 16.2 8.30 131.9 Step 1: Per-block histogramming 21.5 13.4 / Step 2: Block-in-polygon test 0.11 0.07 / Step 3: “within-block” histogram aggregation 0.14 0.11 / Step 4: cell-in-polygon test and histogram update 29.7 11.4 / total major steps 67.7 33.3 / Wall-clock end-to-end 85 46 / Titan #of nodes 1 2 4 8 16 Runtime(s ) 60. 7 31.3 17. 9 10.2 7.6 Data and Pre- processing

Upload: keiran

Post on 24-Feb-2016

51 views

Category:

Documents


2 download

DESCRIPTION

Cells. Polygon vertices. (C) Outside. (B) Inside. V. T. Temporal Trends. ROIs. High-End Computing Facility. C. B. A. Thread Block. Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on  GPUs

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs

Visiting Faculty: Jianting Zhang, The City College of New York (CCNY), [email protected]: Host: Dali Wang, Climate Change Science Institute (CCSI), [email protected]

OverviewAs the spatial and temporal resolutions of Earth observatory data and Earth system simulation outputs are getting higher, in-situ and/or post- processing such large amount of geospatial data increasingly becomes a bottleneck in scientific inquires of Earth systems and their human impacts. Existing geospatial techniques that are based on outdated computing models (e.g., serial algorithms and disk-resident systems), as have been implemented in many commercial and open source packages, are incapable of processing large-scale geospatial data and achieve desired level of performance. Partially Supported by DOE’s Visiting Faculty Program (VFP), we are investigating on a set of parallel data structures and algorithms that are capable of utilizing massively data parallel computing power available on commodity Graphics Processing Units (GPUs). We have designed and implemented a popular geospatial technique called Zonal Statistics, i.e., deriving histograms of points or raster cells in polygonal zones. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and ,60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes.

High-resolution Satellite ImageryT

In-situ Observation Sensor Data

T

Global and Regional Climate Model Outputs

Data Assimilation Zonal Statistics

TEcological, environmental and administrative zones

T

T

V

Temporal Trends

High-End Computing Facility

A

BC

Thread Block

ROIs

Background & Motivation

Zonal Statistics on NASA Shuttle Radar Topography Mission (SRTM) Data

2 3

SQL: SELECT COUNT(*) from T_O, T_Z WHERE ST_WITHIN (T_O.the_geom,T_Z.the_geom) GROUP BY T_Z.z_id;

Point-in-Polygon Test

For each county, derive its histogram of elevations from raster cells that are in the polygon.

•SRTM: 20*109 (billion) raster cells (~40GB raw, ~15GB compressed TIFF)•Zones: 3141 counties, 87,097 vertices

Brute-force Point-in-polygon test•RT=(#of points)*(number of vertices)*(number of ops per p-in-p test)/(number of ops per second) =20*109*87097*20/(10*109)=3.7*106seconds=40days•Using up all Titan’s 18,688 nodes: ~200 seconds•Flops utilization is typically low: can be <1% for data intensive applications (typically <10% in HPC)

Hybrid Spatial Databases + HPC ApproachObservation : only points/cells that are close enough to polygons need to be tested Question: how do we pair neighboring points/cells with polygons?

Minimum Bounding Boxes (MBRs)

Step1: divide a raster tile into blocks and generate per-tile histograms Step 2: derive polygon MBRs and pair MBRs with blocks through box-in-polygon test (inside/intersect)Step 3: aggregate per-blocks histograms into per-polygon histograms if blocks are within polygonsStep 4: for each intersected polygon/block pair, perform point(cell)-in-polygon test for all the raster cells in the blocks and update respective polygon histogram

(A) Intersect (B) Inside (C) Outside

GPU Implementations

Identifying parallelisms and mapping to hardware(1)Deriving per-block histograms(2)Block in polygon test(3)Aggregate histograms for “within” blocks (4)Point-in-polygon test for individual cells and update

GPU Thread BlockRaster Block

AtomicAdd

1

M1C1

M1C2

M2C3

•Point-in-poly test for each of cell’s 4 corners•All-pair Edge intersection tests between polygon and cell 2

Polygon vertices

Cells

•Perfect coalesced memory accesses •Utilizing GPU floating point power

4

BPQ-Tree based Raster compression to save disk I/O•Idea: chop a M-bit raster into M binary bitmaps and then build a quadtree for each bitmap (Zhang et al 2011)

•BPQ-Tree achieves competitive compression ratio but is much more parallelization friendly on GPUs •Advantage 1: compressed data is streamed from disk to GPU without requiring decompression on CPUs reducing CPU memory footprint and data transfer time•Advantage 2: can be used in conjunction with CPU-based compression to further improve compression ratio

Experiments and EvaluationsTile # dimension Partition Schema1 54000*43200 2*22 50400*43200 2*23 50400*43200 2*24 82800*36000 2*25 61200*46800 2*26 68400*111600 4*4Total 20,165,760,000 36

Data Format Volume (GB)Original (Raw) 38TIFF Compression 15gzip compression 8.3BPQ-Tree Compression 7.3BPQ-Tree+ gzip compression

5.5

Single Node Configuration 1: Dell T5400 WS•Intel Xeon E5405 dual Quad-Core Processor (2.00 GHZ), 16 GB, PCI-E Gen2, 3*500GB 7200 RPM disk with 32M cache ($5,000)•Nvidia Quadro 6000 GPU, 448 Fermi core ( 574 MHZ), 6 GB, 144GB/s ($4,500)

Single Node Configuration 2: Do-IT Yourself Workstation•Intel Core-i5 650 Dual-Core, 8 GB, PCI-E Gen3, 500GB 7200 RPM disk with 32M (recycled), ($1000)•Nvidia GTX Titan GPU, 2688 Kepler core ( 837 MHZ), 6 GB, 288 GB/s ($1000)

Parameters:•Raster chunk size (coding): 4096*4096• Thread block size: 256

•Maximum histogram bins: 5000•Raster Block size: 0.1*0.1 degree 360*360 (resolution is 1*1 arc-second)

GPU Cluster : OLCF Titan

ResultsSingle Node Cold Cache Hot cacheConfig1 180s 78sConfig2 101s 46s

Quadro 6000 GTX Titan 8-core CPU(Step 0): Raster decompression 16.2 8.30 131.9Step 1: Per-block histogramming 21.5 13.4 /Step 2: Block-in-polygon test 0.11 0.07 /Step 3: “within-block” histogram aggregation 0.14 0.11 /Step 4: cell-in-polygon test and histogram update 29.7 11.4 /total major steps 67.7 33.3 /Wall-clock end-to-end 85 46 /

Titan#of nodes 1 2 4 8 16Runtime(s) 60.7 31.3 17.9 10.2 7.6

Data and Pre-processing