data parallel quadtree indexing and spatial query processing of complex polygon data on gpus

Data Parallel Quadtree Indexing and Spatial Query Processing of Complex Polygon Data on GPUs

Jianting Zhang1,2 Simin You2, Le Gruenwald3

1 Depart of Computer Science, CUNY City College (CCNY)2 Department of Computer Science, CUNY Graduate Center3 School of Computer Science, the University of Oklahoma

CISE/IIS Medium Collaborative Research Grants 1302423/1302439: “Spatial Data and Trajectory

Data Management on GPUs”

http://images.google.com/imgres?imgurl=http://www.ccny.cuny.edu/public_safety/LOGO-NEW-2.gif&imgrefurl=http://www.ccny.cuny.edu/public_safety/&h=1518&w=1670&sz=419&hl=en&start=5&usg=__RzwNqxW2tlvxCgyaIMmAFcqnUOo=&tbnid=LQS2VxavYEbsuM:&tbnh=136&tbnw=150&prev=/images%3Fq%3Dccny%2Blogo%26gbv%3D2%26hl%3Den%26sa%3DX

http://www.google.com/imgres?imgurl=http://www.dln.cuny.edu/it/images/cuny_logo_blue.gif&imgrefurl=http://www.dln.cuny.edu/it/cfp.html&usg=__BxUFrh8fbpc2QuPJ392UNNFwkf4=&h=86&w=177&sz=3&hl=en&start=5&um=1&itbs=1&tbnid=ajHY5OdlYfw1TM:&tbnh=49&tbnw=101&prev=/images%3Fq%3Dcuny%2Blogo%26um%3D1%26hl%3Den%26imgtbs%3Ds%26tbs%3Disch:1

http://en.wikipedia.org/wiki/File:Ccnymedal.jpg

http://images.google.com/imgres?imgurl=http://geosciences.ou.edu/logos/OU/OU-400.gif&imgrefurl=http://saturdayrugbyday.blogspot.com/2006/01/oklahoma-rugby-teams-host-camp.html&h=550&w=400&sz=8&hl=en&start=4&usg=__3HWU3kWbUxwyq0tezUsVqXSAhrg=&tbnid=iBoCWfz2RNAzZM:&tbnh=133&tbnw=97&prev=/images?q=university+of+oklahoma+site:ou.edu&gbv=2&hl=en

Outline•Introduction & Background

•Application: Large-Scale Biodiversity Data Management

•Data Parallel Designs and Implementations

• Polygon Decomposition

• Quadtree Construction

• Spatial Query Processing

•Experiments

•Summary and Future Work

Parallel Computing – Hardware

A

B

C

Thread Block

CPU Host (CMP)

Core

Local Cache

Shared Cache

DRAM

Disk SSD

GPU

SIMD

PCI-E

Ring Bus

Local Cache

Core ... Core

Core Core

GDRAM GDRAM

Core Core Core... ...

MIC

PCI-E

T0 T1

T2 T3

4-Threads

In-Order

16 Intel Sandy Bridge CPU cores+ 128GB RAM + 8TB disk + GTX TITAN + Xeon Phi 3120A ~ $9994

ASCI Red: 1997 First 1 Teraflops (sustained) system with 9298 Intel Pentium II Xeon processors (in 72 Cabinets)

•Feb. 2013 •7.1 billion transistors (551mm²) •2,688 processors •4.5 TFLOPS SP and 1.3 TFLOPS DP•Max bandwidth 288.4 GB/s•PCI-E peripheral device•250 W (17.98 GFLOPS/W -SP)• Suggested retail price: $999

What can we do today using a device that is more powerful than ASCI Red 17 years ago?

GeoTECI@CCNYCCNY Computer Science LAN

Microway

Dual 8-core128GB memoryNvidia GTX TitanIntel Xeon Phi 3120A8 TB storage

DIY

*2

SGI Octane III

Dual Quadcore48GB memoryNvidia C2050*28 TB storage

Dual-core8GB memoryNvidia GTX Titan3 TB storage

Dell T5400

Dual Quadcore16GB memoryNvidia Quadro 6000 1.5 TB storage

Lenovo T400s

Dell T7500

Dual 6-core24 GB memoryNvidia Quadro 6000

Dell T7500

Dual 6-core24 GB memoryNvidia GTX 480

Dual Quadcore16GB memoryNvidia FX3700*2

Dell T5400

DIY

Quadcore (Haswell) 16 GB memoryAMD/ATI 7970

Quadcore8 GB memoryNvidia Quadro 5000m

HP 8740w

HP 8740w

CUNY HPCC

KVM

“Brawny” GPU cluster

“Wimmy” GPU cluster

Web Server/Linux App Server Windows

App Server

...building a highly-configurable experimental computing environment for innovative BigData technologies…

Computer Architecture

Spatial Data Management

How to fill the big gap

effectively?David Wentzlaff, “Computer Architecture”, Princeton University Course on Coursea

Parallel Computing– Languages & Libraries

http://www.macs.hw.ac.uk/cs/techreps/docs/files/HW-MACS-TR-0103.pdf

Thrust

Bolt CUDPP

boost

GNU Parallel Mode

Source: http://parallelbook.com/sites/parallelbook.com/files/SC11_20111113_Intel_McCool_Robison_Reinders.pptx

Data Parallelisms Parallel Primitives Parallel libraries Parallel hardware







•Experiments

•Conclusions and Future Work

Managing Large-Scale Biodiversity Data

SELECT aoi_id, sp_id, sum (ST_area (inter_geom)) FROM( SELECT aoi_id, sp_id, ST_Intersection (sp_geom, qw_geom)

AS inter_geom FROM SP_TB, QW_TB WHERE ST_Intersects (sp_geometry, qw_geom))GROUP BY aoi_id, sp_idHAVING sum(ST_area(inter_geom)) >T;

http://geoteci.engr.ccny.cuny.edu/birds30s/BirdsQuest.html

Indexing “Complex” Polygonshttp://en.wikipedia.org/wiki/Simple_polygon http://en.wikipedia.org/wiki/Simple_Features

Problems in indexing MBRs: •Inexpensive yet inaccurate approximation for complex polygons•Low pruning power when polygons are highly overlapped

“Complex” Polygons: •Polygons with multiple rings (with holes)•Highly overlapped

http://xlinux.nist.gov/dads/HTML/linearquadtr.html

Indexing “Complex” Polygons

(Zhang et al 2009)

(Zhang 2012)

Hours of runtimes on birds range maps by extending GDAL/OGR (serial)

Fang et al 2008. Spatial indexing in Microsoft SQL Server 2008. (SIGMOD’08)

Using B-Tree to index quadrants, but it is unclear how the quadrants are derived from polygons







•Experiments

•Conclusions and Future Work

Parallel Quadtree Construction

Parallel Query Processing

DFS BFS

Parallel Polygon Decomposition

(0,2)

(1,3)

(1) All operations are data parallel at the quadrant level; (2) Quadrants may be at different levels and come from same or different polygons; (3) Each GPU thread process a quadrant; (4) Accesses to GPU memory can be coalesced for neighboring quadrants from

same polygons

Observations:

•Quadtree Construction

•Spatial Query Processing

Experiment SetupSpecies distribution data•4062 bird species in the West Hemisphere•708,509 polygons •77,699,991 vertices

Polygon Group

num of vertices range

total num of polygons

total num of points

1 10-100 497,559 11,961,3892 100-1,000 33,374 8,652,2783 1,000-10,000 6,719 20,436,9314 10,000-100,000 1,213 33,336,083

• Dual 8-core Sandy Bridge CPU (2.60G) • 128GB memory • Nvidia GTX Titan (6GB, 2688 cores) • Intel Xeon Phi 3120A (6GB, 57 cores)• 8 TB storage

• CentOS 6.4 with GCC 4.7.2, TBB 4.2, ISPC 1.6, CUDA 5.5

• All vector initialization times in Thrust on GPUs are counted (new versions of Thrust allow uninitialized device vectors)

• Performance can vary among CUDA SDKs

Runtimes (Polygon Decomposition)

G1 (10-100) G2 (100-1000)

G3(1000-10000)

G4(10000-100000)

Comparisons with PixelBox* (Polygon Decomposition)

(milliseconds) 10-100 100-1000 1000-10000 10000-100000Proposed 451 2303 46215 193714PixelBox*-shared 2260 15732 338695 1686879PixelBox*-global 866 6023 124031 948560

PixelBox*: modifying and extending the PixelBox algorithm [5] to decompose single polygons (vs. computing sizes of intersection areas of pairs of polygons) and handle “complex” multi-ring polygons

PixelBox*-shared: CUDA implementation using GPU shared memory for stack • DFS traversal with a batch size of N • N can not be too big (shared memory capacity) or too small (GPU is underutilization

if less than warp size) PixelBox*-global: CUDA implementation using GPU global memory for stack

• DFS traversal with different batch sizes • Coalesced global GPU memory accesses are efficient

Proposed technique: Thrust data parallel implementation on top of parallel primitives• BFS traversal with higher degrees of parallelisms• Data parallel designs (using primitives) simplify implementations• GPU shared memory is not explicitly used and is more flexible • Coalesced global GPU memory accesses are efficient• But: large memory footprint (for the current implementation)

Summary and Future Work• Diversified hardware makes it challenging to develop efficient

parallel implementations for complex domain specific applications across platforms.

• The framework on data parallel designs on top of parallel primitives seems to be a viable solution in the context of managing and querying large-scale geo-referenced species distribution data

• Further understand the advantages and disadvantages of data parallel designs/implementations on parallel hardware (GPUs, MICs and CMPs) through domain specific applications.

• More efficient polygon decomposition algorithms (e.g. scanline based) using parallel primitives

• System integration and more applications

• Experiments on 4000+ birds species distribution data have shown up to 190X speedups for polygon decomposition and 27X speedups for quadtree construction over serial implementations on a high-end GPU.

• Comparisons with PixelBox* variations that are natively CUDA implementations have shown that efficiency and productivity can be achieved simultaneously based on the data parallel framework using parallel primitives

Q&A

[email protected]

http://www-cs.ccny.cuny.edu/~jzhang/

data parallel quadtree indexing and spatial query processing of complex polygon data on gpus

Documents