large-scale spatial query processing on gpu-accelerated big data...

31
Large-Scale Spatial Query Processing on GPU-Accelerated Big Data Systems Jianting Zhang 1,2 Simin You 2 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer Science, CUNY Graduate Center http://www-cs.ccny.cuny.edu/~jzhang/

Upload: others

Post on 17-Oct-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Large-Scale Spatial Query Processing

on GPU-Accelerated Big Data Systems

Jianting Zhang1,2 Simin You2

1 Depart of Computer Science, CUNY City College (CCNY)

2 Department of Computer Science, CUNY Graduate Center

http://www-cs.ccny.cuny.edu/~jzhang/

Page 2: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Outline•Introduction

•Spatial data, GIS, BigData and HPC

•Taxi trip data in NYC and Global Biodiversity Applications

•Spatial query processing on GPUs

•ISP-GPU

•Architecture and Implementations

•Experiment Results

•Alternative Techniques

•SpatialSpark

•Lightweight Distributed Execution (LDE) Engine

•Summary an Future Work

Page 3: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Computational

Geometry

Spatial Databases:

data modeling,

indexing, query

processing

Scientific Data/Information Visualization

Statistics/Machine learning

Geographical Information System

GIS

Environmental

Modeling

Air quality

Remote

Sensing

Hydrology

Ecology

Social-

Economic

Modeling Urban planning

Transportation

Census/Taxation

Social Studies

Computer

Graphics

Image Processing/Computer Vision

Page 4: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Big Geospatial Data Challenges

– Event Locations, trajectories and O-D data

• E.g., Taxi trip records (GPS traces or O-D locations)

• 0.5 million in NYC (medallion taxi cab only) and 1.2 million in Beijing per day

• From O-D locations to trajectories to frequent patterns

– Satellite: e.g., from GOES to GOES-R (2015/2016) [$11B]

• http://www.goes-r.gov/downloads/GOES-R-Tri-10-06-09_v7.pdf

• Spectral (3X)*spatial (4X)* temporal (5X)=60X

• 2km*2km*5min*16bands(360*60)*(180*60)*(12*24)*16~ 1+ trillion pixels per day

• Derived thematic data products (vector)

– http://www.goes-r.gov/products/baseline.html

– http://www.goes-r.gov/products/option2.html

– Species distributions

• E.g. 400+ million occurrence records (GBIF)

• E.g. 717,057 polygons and 78,929,697 vertices for 4148 birds distribution data (NatureServe)

Page 5: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Cloud computing+MapReduce+Hadoop

A

B

C

Thread Block

CPU Host (CMP)

Core

Local Cache

Shared Cache

DRAM

HDD SSD

GPU

SIMD

PCI-E

Ring Bus

Local Cache

Core ... Core

Core Core

GDRAM GDRAM

Core Core Core... ...

MIC

PCI-E

T0 T1

T2 T3

4-Threads

In-Order

16 Intel Sandy Bridge CPU cores+ 128GB RAM + 8TB disk + GTX TITAN + Xeon Phi 3120A ~ $9,994

Page 6: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

ASCI Red: 1997 First 1 Teraflops

(sustained) system with 9298 Intel Pentium

II Xeon processors (in 72 Cabinets)

•Feb. 2013

•7.1 billion transistors (551mm²)

•2,688 processors •4.5 TFLOPS SP and 1.3 TFLOPS DP

•Max bandwidth 288.4 GB/s

•PCI-E peripheral device

•250 W (17.98 GFLOPS/W -SP)

• Suggested retail price: $999

What can we do today using a device that is more

powerful than ASCI Red 16 years ago?

Page 7: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Geospatial Technologies and

Environmental CyberInfrastructure

(GeoTECI) Lab

Dr. Jianting Zhang

Department of Computer Science

The City College of New York

Affiliated Institutions

Collaborating Institutions

Students: Simin You (Ph.D. 2009 -), Siyu Liao (Ph.D. 2014-), Costin Vicoveanu (Undergraduate, 2014-)

Bharat Rosanlall (Undergraduate, 2014), Jay Yao (MS-thesis, 2011-2012), Chandrashekar Singh (MS

2013), Agniva Banerjee (MS, 2012), Roger King (MS, 2012), Wahyu Nugroho (MS, 2011), Xiao Quan

Cen Feng (MS 2011), Chetram Dasrat (Undergraduate, 2008)

Page 8: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

GeoTECI@CCNY

•HIGHEST-DB

•HIgh-performance GrapHics units based Engine for Spatial-Temporal data

•Spatial and Spatiotemporal indexing, query processing and optimization

•Trajectory data management on GPUs

•Segmentation/simplification/compression/Aggregation/Warehousing

•Map matching with road networks

•Data mining (moving cluster, convoy, swarm...)

… when yellow cabs,

green cabs and MTA

buses meet with multi-

core CPUs, GPUs and

MICs in NYC …

$449,845/4yr (08/01/2013-07/31/2017)

Page 9: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

GeoTECI@CCNY

High-resolution Satellite ImageryT

In-situ Observation Sensor Data

T

Global and Regional Climate Model Outputs

Data AssimilationZonal Statistics

TEcological, environmental and administrative zones

T

T

V

Temporal Trends

High-End Computing Facility

A

B

C

Thread Block

ROIs

… when GOES-R satellites, extratropical

cyclones and hummingbirds meet with TITAN …

Page 10: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

GeoTECI@CCNYCCNY Computer Science LAN

Microway

Dual 8-core

128GB memory

Nvidia GTX Titan

Intel Xeon Phi 3120A

8 TB storage

DIY

*2

SGI Octane III

Dual Quadcore

48GB memory

Nvidia C2050*2

8 TB storage

Dual-core

8GB memory

Nvidia GTX Titan

3 TB storage

Dell T5400

Dual Quadcore

16GB memory

Nvidia Quadro 6000

1.5 TB storage

Lenovo T400s

Dell T7500

Dual 6-core

24 GB memory

Nvidia Quadro 6000

Dell T7500

Dual 6-core

24 GB memory

Nvidia GTX 480

Dual Quadcore

16GB memory

Nvidia FX3700*2

Dell T5400

DIY

Quadcore (Haswell)

16 GB memory

AMD/ATI 7970

Quadcore

8 GB memory

Nvidia Quadro 5000m

HP 8740w

HP 8740w

CUNY HPCC

KVM

“Brawny” GPU cluster

“Wimmy” GPU cluster

Web Server/

Linux App Server Windows

App Server

...building a highly-configurable experimental computing

environment for innovative BigData technologies…

Page 12: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Taxi trip data in NYC

Count-Distance Distribution

0

5000000

10000000

15000000

20000000

<=

0

.0

( 0

.8,

1.0

]

( 1

.8,

2.0

]

( 2

.8,

3.0

]

( 3

.8,

4.0

]

( 4

.8,

5.0

]

( 5

.8,

6.0

]

( 6

.8,

7.0

]

( 7

.8,

8.0

]

( 8

.8,

9.0

]

( 9

.8,

10

.0]

( 1

0.8

, 1

1.0

]

( 1

1.8

, 1

2.0

]

( 1

2.8

, 1

3.0

]

( 1

3.8

, 1

4.0

]

( 1

4.8

, 1

5.0

]

( 1

5.8

, 1

6.0

]

( 1

6.8

, 1

7.0

]

( 1

7.8

, 1

8.0

]

( 1

8.8

, 1

9.0

]

( 1

9.8

, 2

0.0

]

Trip Distance (mile)

Co

un

t

Count-Time Distribution

0

5000000

10000000

15000000

20000000

<=

0

.0

( 2

.0,

3.0

]

( 5

.0,

6.0

]

( 8

.0,

9.0

]

( 1

1.0

, 1

2.0

]

( 1

4.0

, 1

5.0

]

( 1

7.0

, 1

8.0

]

( 2

0.0

, 2

1.0

]

( 2

3.0

, 2

4.0

]

( 2

6.0

, 2

7.0

]

( 2

9.0

, 3

0.0

]

( 3

2.0

, 3

3.0

]

( 3

5.0

, 3

6.0

]

( 3

8.0

, 3

9.0

]

( 4

1.0

, 4

2.0

]

( 4

4.0

, 4

5.0

]

( 4

7.0

, 4

8.0

]

> 5

0.0

TripTime (Minute)

Co

un

t

Count-Speed Distribution

0

5000000

10000000

15000000

20000000

<=

0

.0

( 1

.0, 2

.0]

( 3

.0, 4

.0]

( 5

.0, 6

.0]

( 7

.0, 8

.0]

( 9

.0, 1

0.0

]

( 1

1.0

, 1

2.0

]

( 1

3.0

, 1

4.0

]

( 1

5.0

, 1

6.0

]

( 1

7.0

, 1

8.0

]

( 1

9.0

, 2

0.0

]

( 2

1.0

, 2

2.0

]

( 2

3.0

, 2

4.0

]

( 2

5.0

, 2

6.0

]

( 2

7.0

, 2

8.0

]

( 2

9.0

, 3

0.0

]

( 3

1.0

, 3

2.0

]

( 3

3.0

, 3

4.0

]

( 3

5.0

, 3

6.0

]

( 3

7.0

, 3

8.0

]

( 3

9.0

, 4

0.0

]

( 4

1.0

, 4

2.0

]

( 4

3.0

, 4

4.0

]

( 4

5.0

, 4

6.0

]

( 4

7.0

, 4

8.0

]

( 4

9.0

, 5

0.0

]

Speed (MPH)

Co

un

t

Count-Fare Distribution

0

5000000

1000000015000000

20000000

25000000

30000000

<=

0.0

( 1

.0,

2.0

]

( 3

.0,

4.0

]

( 5

.0,

6.0

]

( 7

.0,

8.0

]

( 9

.0,

10.0

]

( 11.0

, 12.0

]

( 13.0

, 14.0

]

( 15.0

, 16.0

]

( 17.0

, 18.0

]

( 19.0

, 20.0

]

( 21.0

, 22.0

]

( 23.0

, 24.0

]

( 25.0

, 26.0

]

( 27.0

, 28.0

]

( 29.0

, 30.0

]

( 31.0

, 32.0

]

( 33.0

, 34.0

]

( 35.0

, 36.0

]

( 37.0

, 38.0

]

( 39.0

, 40.0

]

( 41.0

, 42.0

]

( 43.0

, 44.0

]

( 45.0

, 46.0

]

( 47.0

, 48.0

]

( 49.0

, 50.0

]

Fare ($)

Co

un

t

Over all distributions of trip distance, time, speed and fare (2009)

Page 13: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Taxi trip data in NYC

• How to manage taxi trip data?

– Geographical Information System (GIS)

– Spatial Databases (SDB)

– Moving Object Databases (MOD)

• How good are they?

– Pretty good for small amount of data

– But, rather poor for large-scale data

Page 14: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Taxi trip data in NYC

• Example 1: – Loading 170 million taxi pickup locations into PostgreSQL

– UPDATE t SET PUGeo = ST_SetSRID(ST_Point("PULong","PuLat"),4326);

– 105.8 hours!

• Example 2: – Finding the nearest tax blocks for 170 million taxi pickup locations

using open source libspatiaindex+GDAL

– 30.5 hours!

I do not have time to wait...

Can we do better?

Intel Xeon 2.26 GHz processors with 48G memory

Page 15: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Global

Biodiversity

Data at GBIF

15

SELECT aoi_id, sp_id, sum (ST_area (inter_geom))

FROM

(

SELECT aoi_id, sp_id,

ST_Intersection (sp_geom, qw_geom)

AS inter_geom

FROM SP_TB, QW_TB

WHERE ST_Intersects (sp_geometry, qw_geom)

)

GROUP BY aoi_id, sp_id

HAVING sum(ST_area(inter_geom)) >T;

http://gbif.org

Page 16: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Spatial Data Processing on GPUs

http://www-cs.ccny.cuny.edu/~jzhang/papers/gpu_spatial_tr.pdf

Page 17: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Spatial query processing on GPUs

Single-Level Grid-File based Spatial Filtering

Vertices

(polygon/

polyline)

Points

•Perfect coalesced

memory accesses

•Utilizing GPU floating

point computing power

Nested-Loop based Refinement

J. Zhang, S. You and L. Gruenwald, "Parallel Online Spatial and Temporal Aggregations on Multi-core CPUs and Many-Core GPUs," Information Systems, vol. 44, p. 134–154, 2014.

Page 18: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Spatial query processing on GPUs

Top: grid size =256*256resolution=128 feet Right: grid size =8192*8192resolution=4 feet

Spatial Aggregation

9,424 /326=30X (8192*8192)

Temporal Aggregation

1709/198=8.6X (minute)

1598 /165 = 9.7X (hour)

Page 19: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Spatial query processing on GPUs

P2P-TP2N-D P2P-D

147,011

street

segments

38,794

census

blocks

(470,941

points)

735,488 tax

blocks

(4,698,986

points)

P2N-D P2P-T P2P-D

- 15.2 h 30.5 h

10.9 s 11.2 s 33.1 s

- 4,900X 3,200X

CPU time

GPU Time

Speedup

Algorithmic

improvement: 3.7X

Using main-memory

data structures: 37.4X

GPU Acceleration:

24.3X

Page 20: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Outline•Introduction

•Spatial data, GIS, BigData and HPC

•Taxi trip data in NYC and Global Biodiversity Applications

•Spatial query processing on GPUs

•ISP-GPU

•Architecture and Implementations

•Experiment Results

•Alternative Techniques

•SpatialSpark

•Lightweight Distributed Execution (LDE) Engine

•Summary an Future Work

Page 21: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters

Page 22: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters

http://www.slideshare.net/hadooparchbook/impala-architecture-presentation

• SQL Frontend: translate SQL

queries into execution plans

• C/C++ backend with SSE4

support (for strings operations)

• Efficient implementations of

hash-joins (partitioned and non-

partitioned)

• LLVM-based JIT

• ….

Attractive Features

Extension is challenging!

Page 23: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters

class SpatialJoinNode : public BlockingJoinNode {

public:

SpatialJoinNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);

virtual Status Prepare(RuntimeState* state);

virtual Status GetNext(RuntimeState* state, RowBatch* row_batch, bool* eos);

virtual void Close(RuntimeState* state);

protected:

virtual Status InitGetNext(TupleRow* first_left_row);

virtual Status ConstructBuildSide(RuntimeState* state);

private:

boost::scoped_ptr<TPlanNode> thrift_plan_node_;

RuntimeState* runtime_state_;

}

create_rtree(…)

pip_join(…)

nearest_join(…)

Page 24: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters

http://www-cs.ccny.cuny.edu/~jzhang/papers/isp_gpu_tr.pdf

Scalable and Efficient Spatial Data Management on Multi-Core CPU and GPU Clusters.

IEEE HardBD’15 Workshop

Page 25: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters

ISP-GPU ISP-MC+ GPU-Standalone MC-Standalone

taxi-nycb (s) 96 130 50 89

GBIF-WWF(s) 1822 2816 1498 2664

Taxi-nycb: ~170 million points, ~40 thousand polygons (9 vertices/polygon)

GBF-WWF: ~375 million points, ~15 thousand polygons (279 vertices/polygon)

Single-node results: 16core CPU/128GB, GTX Titan

Cluster results: 2-10 nodes each with 8 vCPU cores/15GB, 1536 CUDA cores/4 GB

(50 million species locations used due to memory constraint)

Page 26: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Outline•Introduction

•Spatial data, GIS, BigData and HPC

•Taxi trip data in NYC and Global Biodiversity Applications

•Spatial query processing on GPUs

•ISP-GPU

•Architecture and Implementations

•Experiment Results

•Alternative Techniques

•SpatialSpark

•Lightweight Distributed Execution (LDE) Engine

•Summary an Future Work

Page 27: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Alternative

Techniques

val sc = new SparkContext(conf)

//reading left side data from HDFS and perform pre-processing

val leftData = sc.textFile(leftFile, numPartitions).map(x => x.split(SEPARATOR)).zipWithIndex()

val leftGeometryById = leftData.map(x => (x._2, Try(new WKTReader().read(x._1.apply(leftGeometryIndex)))))

.filter(_._2.isSuccess).map(x => (x._1, x._2.get))

//similarly for right-side data….

//ready for spatial query (broadcast-based)

val joinPredicate =SpatialOperator.Within // NearestD can be applied similarly

var matchedPairs:RDD[(Long, Long)] = BroadcastSpatialJoin(sc, leftGeometryById, rightGeometryById, joinPredicate)

SpatialSpark: Just Open-Sourced

http://simin.me/projects/spatialspark/

http://www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf

Large-Scale Spatial Join Query Processing in Cloud (Comparison with ISP-MC)

IEEE CloudDM’15 Workshop

Page 28: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Alternative

Techniques

Lightweight Distributed Execution Engine for Large-Scale Spatial Join Query Processinghttp://www-cs.engr.ccny.cuny.edu/~jzhang/papers/lde_spatial_tr.pdf

Page 30: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Summary and Future Work

• Designs and implementations of an in-memory

spatial data management system on multi-core

CPU and many-core GPU clusters by

extending Cloudera Impala for distributed

spatial join query processing

• Experiments on the initial implementations

have revealed both advantages and

disadvantages of extending a tightly-coupled

big data system to support spatial data types

and their operations.

• Alternative techniques are being developed to

further improve efficiency, scalability,

extensibility and portability.

Page 31: Large-Scale Spatial Query Processing on GPU-Accelerated Big Data …geoteci.engr.ccny.cuny.edu/temp/GTC_Talk_03172015.pdf · 2018. 10. 30. · Big Geospatial Data Challenges –Event

Q&A

[email protected]

http://www-cs.ccny.cuny.edu/~jzhang/