Large-Scale Spatial Query Processing
on GPU-Accelerated Big Data Systems
Jianting Zhang1,2 Simin You2
1 Depart of Computer Science, CUNY City College (CCNY)
2 Department of Computer Science, CUNY Graduate Center
http://www-cs.ccny.cuny.edu/~jzhang/
Outline•Introduction
•Spatial data, GIS, BigData and HPC
•Taxi trip data in NYC and Global Biodiversity Applications
•Spatial query processing on GPUs
•ISP-GPU
•Architecture and Implementations
•Experiment Results
•Alternative Techniques
•SpatialSpark
•Lightweight Distributed Execution (LDE) Engine
•Summary an Future Work
Computational
Geometry
Spatial Databases:
data modeling,
indexing, query
processing
Scientific Data/Information Visualization
Statistics/Machine learning
Geographical Information System
GIS
Environmental
Modeling
Air quality
Remote
Sensing
Hydrology
Ecology
Social-
Economic
Modeling Urban planning
Transportation
Census/Taxation
Social Studies
Computer
Graphics
Image Processing/Computer Vision
Big Geospatial Data Challenges
– Event Locations, trajectories and O-D data
• E.g., Taxi trip records (GPS traces or O-D locations)
• 0.5 million in NYC (medallion taxi cab only) and 1.2 million in Beijing per day
• From O-D locations to trajectories to frequent patterns
– Satellite: e.g., from GOES to GOES-R (2015/2016) [$11B]
• http://www.goes-r.gov/downloads/GOES-R-Tri-10-06-09_v7.pdf
• Spectral (3X)*spatial (4X)* temporal (5X)=60X
• 2km*2km*5min*16bands(360*60)*(180*60)*(12*24)*16~ 1+ trillion pixels per day
• Derived thematic data products (vector)
– http://www.goes-r.gov/products/baseline.html
– http://www.goes-r.gov/products/option2.html
– Species distributions
• E.g. 400+ million occurrence records (GBIF)
• E.g. 717,057 polygons and 78,929,697 vertices for 4148 birds distribution data (NatureServe)
Cloud computing+MapReduce+Hadoop
A
B
C
Thread Block
CPU Host (CMP)
Core
Local Cache
Shared Cache
DRAM
HDD SSD
GPU
SIMD
PCI-E
Ring Bus
Local Cache
Core ... Core
Core Core
GDRAM GDRAM
Core Core Core... ...
MIC
PCI-E
T0 T1
T2 T3
4-Threads
In-Order
16 Intel Sandy Bridge CPU cores+ 128GB RAM + 8TB disk + GTX TITAN + Xeon Phi 3120A ~ $9,994
ASCI Red: 1997 First 1 Teraflops
(sustained) system with 9298 Intel Pentium
II Xeon processors (in 72 Cabinets)
•Feb. 2013
•7.1 billion transistors (551mm²)
•2,688 processors •4.5 TFLOPS SP and 1.3 TFLOPS DP
•Max bandwidth 288.4 GB/s
•PCI-E peripheral device
•250 W (17.98 GFLOPS/W -SP)
• Suggested retail price: $999
What can we do today using a device that is more
powerful than ASCI Red 16 years ago?
Geospatial Technologies and
Environmental CyberInfrastructure
(GeoTECI) Lab
Dr. Jianting Zhang
Department of Computer Science
The City College of New York
Affiliated Institutions
Collaborating Institutions
Students: Simin You (Ph.D. 2009 -), Siyu Liao (Ph.D. 2014-), Costin Vicoveanu (Undergraduate, 2014-)
Bharat Rosanlall (Undergraduate, 2014), Jay Yao (MS-thesis, 2011-2012), Chandrashekar Singh (MS
2013), Agniva Banerjee (MS, 2012), Roger King (MS, 2012), Wahyu Nugroho (MS, 2011), Xiao Quan
Cen Feng (MS 2011), Chetram Dasrat (Undergraduate, 2008)
GeoTECI@CCNY
•HIGHEST-DB
•HIgh-performance GrapHics units based Engine for Spatial-Temporal data
•Spatial and Spatiotemporal indexing, query processing and optimization
•Trajectory data management on GPUs
•Segmentation/simplification/compression/Aggregation/Warehousing
•Map matching with road networks
•Data mining (moving cluster, convoy, swarm...)
… when yellow cabs,
green cabs and MTA
buses meet with multi-
core CPUs, GPUs and
MICs in NYC …
$449,845/4yr (08/01/2013-07/31/2017)
GeoTECI@CCNY
High-resolution Satellite ImageryT
In-situ Observation Sensor Data
T
Global and Regional Climate Model Outputs
Data AssimilationZonal Statistics
TEcological, environmental and administrative zones
T
T
V
Temporal Trends
High-End Computing Facility
A
B
C
Thread Block
ROIs
… when GOES-R satellites, extratropical
cyclones and hummingbirds meet with TITAN …
GeoTECI@CCNYCCNY Computer Science LAN
Microway
Dual 8-core
128GB memory
Nvidia GTX Titan
Intel Xeon Phi 3120A
8 TB storage
DIY
*2
SGI Octane III
Dual Quadcore
48GB memory
Nvidia C2050*2
8 TB storage
Dual-core
8GB memory
Nvidia GTX Titan
3 TB storage
Dell T5400
Dual Quadcore
16GB memory
Nvidia Quadro 6000
1.5 TB storage
Lenovo T400s
Dell T7500
Dual 6-core
24 GB memory
Nvidia Quadro 6000
Dell T7500
Dual 6-core
24 GB memory
Nvidia GTX 480
Dual Quadcore
16GB memory
Nvidia FX3700*2
Dell T5400
DIY
Quadcore (Haswell)
16 GB memory
AMD/ATI 7970
Quadcore
8 GB memory
Nvidia Quadro 5000m
HP 8740w
HP 8740w
CUNY HPCC
KVM
“Brawny” GPU cluster
“Wimmy” GPU cluster
Web Server/
Linux App Server Windows
App Server
...building a highly-configurable experimental computing
environment for innovative BigData technologies…
Taxi trip data in NYC
11
Taxi trip records
•~170 million trips (300 million
passengers) in 2009
•1/5 of that of subway riders and
1/3 of that of bus riders in NYC
Taxicabs
•13,000 Medallion taxi cabs
•License priced at > $1M
•Car services and taxi services
are separate
Taxi trip data in NYC
Count-Distance Distribution
0
5000000
10000000
15000000
20000000
<=
0
.0
( 0
.8,
1.0
]
( 1
.8,
2.0
]
( 2
.8,
3.0
]
( 3
.8,
4.0
]
( 4
.8,
5.0
]
( 5
.8,
6.0
]
( 6
.8,
7.0
]
( 7
.8,
8.0
]
( 8
.8,
9.0
]
( 9
.8,
10
.0]
( 1
0.8
, 1
1.0
]
( 1
1.8
, 1
2.0
]
( 1
2.8
, 1
3.0
]
( 1
3.8
, 1
4.0
]
( 1
4.8
, 1
5.0
]
( 1
5.8
, 1
6.0
]
( 1
6.8
, 1
7.0
]
( 1
7.8
, 1
8.0
]
( 1
8.8
, 1
9.0
]
( 1
9.8
, 2
0.0
]
Trip Distance (mile)
Co
un
t
Count-Time Distribution
0
5000000
10000000
15000000
20000000
<=
0
.0
( 2
.0,
3.0
]
( 5
.0,
6.0
]
( 8
.0,
9.0
]
( 1
1.0
, 1
2.0
]
( 1
4.0
, 1
5.0
]
( 1
7.0
, 1
8.0
]
( 2
0.0
, 2
1.0
]
( 2
3.0
, 2
4.0
]
( 2
6.0
, 2
7.0
]
( 2
9.0
, 3
0.0
]
( 3
2.0
, 3
3.0
]
( 3
5.0
, 3
6.0
]
( 3
8.0
, 3
9.0
]
( 4
1.0
, 4
2.0
]
( 4
4.0
, 4
5.0
]
( 4
7.0
, 4
8.0
]
> 5
0.0
TripTime (Minute)
Co
un
t
Count-Speed Distribution
0
5000000
10000000
15000000
20000000
<=
0
.0
( 1
.0, 2
.0]
( 3
.0, 4
.0]
( 5
.0, 6
.0]
( 7
.0, 8
.0]
( 9
.0, 1
0.0
]
( 1
1.0
, 1
2.0
]
( 1
3.0
, 1
4.0
]
( 1
5.0
, 1
6.0
]
( 1
7.0
, 1
8.0
]
( 1
9.0
, 2
0.0
]
( 2
1.0
, 2
2.0
]
( 2
3.0
, 2
4.0
]
( 2
5.0
, 2
6.0
]
( 2
7.0
, 2
8.0
]
( 2
9.0
, 3
0.0
]
( 3
1.0
, 3
2.0
]
( 3
3.0
, 3
4.0
]
( 3
5.0
, 3
6.0
]
( 3
7.0
, 3
8.0
]
( 3
9.0
, 4
0.0
]
( 4
1.0
, 4
2.0
]
( 4
3.0
, 4
4.0
]
( 4
5.0
, 4
6.0
]
( 4
7.0
, 4
8.0
]
( 4
9.0
, 5
0.0
]
Speed (MPH)
Co
un
t
Count-Fare Distribution
0
5000000
1000000015000000
20000000
25000000
30000000
<=
0.0
( 1
.0,
2.0
]
( 3
.0,
4.0
]
( 5
.0,
6.0
]
( 7
.0,
8.0
]
( 9
.0,
10.0
]
( 11.0
, 12.0
]
( 13.0
, 14.0
]
( 15.0
, 16.0
]
( 17.0
, 18.0
]
( 19.0
, 20.0
]
( 21.0
, 22.0
]
( 23.0
, 24.0
]
( 25.0
, 26.0
]
( 27.0
, 28.0
]
( 29.0
, 30.0
]
( 31.0
, 32.0
]
( 33.0
, 34.0
]
( 35.0
, 36.0
]
( 37.0
, 38.0
]
( 39.0
, 40.0
]
( 41.0
, 42.0
]
( 43.0
, 44.0
]
( 45.0
, 46.0
]
( 47.0
, 48.0
]
( 49.0
, 50.0
]
Fare ($)
Co
un
t
Over all distributions of trip distance, time, speed and fare (2009)
Taxi trip data in NYC
• How to manage taxi trip data?
– Geographical Information System (GIS)
– Spatial Databases (SDB)
– Moving Object Databases (MOD)
• How good are they?
– Pretty good for small amount of data
– But, rather poor for large-scale data
Taxi trip data in NYC
• Example 1: – Loading 170 million taxi pickup locations into PostgreSQL
– UPDATE t SET PUGeo = ST_SetSRID(ST_Point("PULong","PuLat"),4326);
– 105.8 hours!
• Example 2: – Finding the nearest tax blocks for 170 million taxi pickup locations
using open source libspatiaindex+GDAL
– 30.5 hours!
I do not have time to wait...
Can we do better?
Intel Xeon 2.26 GHz processors with 48G memory
Global
Biodiversity
Data at GBIF
15
SELECT aoi_id, sp_id, sum (ST_area (inter_geom))
FROM
(
SELECT aoi_id, sp_id,
ST_Intersection (sp_geom, qw_geom)
AS inter_geom
FROM SP_TB, QW_TB
WHERE ST_Intersects (sp_geometry, qw_geom)
)
GROUP BY aoi_id, sp_id
HAVING sum(ST_area(inter_geom)) >T;
http://gbif.org
Spatial Data Processing on GPUs
http://www-cs.ccny.cuny.edu/~jzhang/papers/gpu_spatial_tr.pdf
Spatial query processing on GPUs
Single-Level Grid-File based Spatial Filtering
Vertices
(polygon/
polyline)
Points
•Perfect coalesced
memory accesses
•Utilizing GPU floating
point computing power
Nested-Loop based Refinement
J. Zhang, S. You and L. Gruenwald, "Parallel Online Spatial and Temporal Aggregations on Multi-core CPUs and Many-Core GPUs," Information Systems, vol. 44, p. 134–154, 2014.
Spatial query processing on GPUs
Top: grid size =256*256resolution=128 feet Right: grid size =8192*8192resolution=4 feet
Spatial Aggregation
9,424 /326=30X (8192*8192)
Temporal Aggregation
1709/198=8.6X (minute)
1598 /165 = 9.7X (hour)
Spatial query processing on GPUs
P2P-TP2N-D P2P-D
147,011
street
segments
38,794
census
blocks
(470,941
points)
735,488 tax
blocks
(4,698,986
points)
P2N-D P2P-T P2P-D
- 15.2 h 30.5 h
10.9 s 11.2 s 33.1 s
- 4,900X 3,200X
CPU time
GPU Time
Speedup
Algorithmic
improvement: 3.7X
Using main-memory
data structures: 37.4X
GPU Acceleration:
24.3X
Outline•Introduction
•Spatial data, GIS, BigData and HPC
•Taxi trip data in NYC and Global Biodiversity Applications
•Spatial query processing on GPUs
•ISP-GPU
•Architecture and Implementations
•Experiment Results
•Alternative Techniques
•SpatialSpark
•Lightweight Distributed Execution (LDE) Engine
•Summary an Future Work
ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters
ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters
http://www.slideshare.net/hadooparchbook/impala-architecture-presentation
• SQL Frontend: translate SQL
queries into execution plans
• C/C++ backend with SSE4
support (for strings operations)
• Efficient implementations of
hash-joins (partitioned and non-
partitioned)
• LLVM-based JIT
• ….
Attractive Features
Extension is challenging!
ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters
class SpatialJoinNode : public BlockingJoinNode {
public:
SpatialJoinNode(ObjectPool* pool, const TPlanNode& tnode, const DescriptorTbl& descs);
virtual Status Prepare(RuntimeState* state);
virtual Status GetNext(RuntimeState* state, RowBatch* row_batch, bool* eos);
virtual void Close(RuntimeState* state);
protected:
virtual Status InitGetNext(TupleRow* first_left_row);
virtual Status ConstructBuildSide(RuntimeState* state);
private:
boost::scoped_ptr<TPlanNode> thrift_plan_node_;
RuntimeState* runtime_state_;
…
}
create_rtree(…)
pip_join(…)
nearest_join(…)
ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters
http://www-cs.ccny.cuny.edu/~jzhang/papers/isp_gpu_tr.pdf
Scalable and Efficient Spatial Data Management on Multi-Core CPU and GPU Clusters.
IEEE HardBD’15 Workshop
ISP-GPU: Scaling out Geospatial Data Processing to GPU Clusters
ISP-GPU ISP-MC+ GPU-Standalone MC-Standalone
taxi-nycb (s) 96 130 50 89
GBIF-WWF(s) 1822 2816 1498 2664
Taxi-nycb: ~170 million points, ~40 thousand polygons (9 vertices/polygon)
GBF-WWF: ~375 million points, ~15 thousand polygons (279 vertices/polygon)
Single-node results: 16core CPU/128GB, GTX Titan
Cluster results: 2-10 nodes each with 8 vCPU cores/15GB, 1536 CUDA cores/4 GB
(50 million species locations used due to memory constraint)
Outline•Introduction
•Spatial data, GIS, BigData and HPC
•Taxi trip data in NYC and Global Biodiversity Applications
•Spatial query processing on GPUs
•ISP-GPU
•Architecture and Implementations
•Experiment Results
•Alternative Techniques
•SpatialSpark
•Lightweight Distributed Execution (LDE) Engine
•Summary an Future Work
Alternative
Techniques
val sc = new SparkContext(conf)
//reading left side data from HDFS and perform pre-processing
val leftData = sc.textFile(leftFile, numPartitions).map(x => x.split(SEPARATOR)).zipWithIndex()
val leftGeometryById = leftData.map(x => (x._2, Try(new WKTReader().read(x._1.apply(leftGeometryIndex)))))
.filter(_._2.isSuccess).map(x => (x._1, x._2.get))
//similarly for right-side data….
//ready for spatial query (broadcast-based)
val joinPredicate =SpatialOperator.Within // NearestD can be applied similarly
var matchedPairs:RDD[(Long, Long)] = BroadcastSpatialJoin(sc, leftGeometryById, rightGeometryById, joinPredicate)
SpatialSpark: Just Open-Sourced
http://simin.me/projects/spatialspark/
http://www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf
Large-Scale Spatial Join Query Processing in Cloud (Comparison with ISP-MC)
IEEE CloudDM’15 Workshop
Alternative
Techniques
Lightweight Distributed Execution Engine for Large-Scale Spatial Join Query Processinghttp://www-cs.engr.ccny.cuny.edu/~jzhang/papers/lde_spatial_tr.pdf
Spatial Data Processing and IoT
http://www.sensor-networks.org/
• Cell-phone based sensing and querying 3D world (personal navigation)
• Crowd-sourcing 3D urban infrastructure/traffic monitoring using
RGB-D videos
• Emergency response
and disaster relief
• Building
Information
System and
energy control
Summary and Future Work
• Designs and implementations of an in-memory
spatial data management system on multi-core
CPU and many-core GPU clusters by
extending Cloudera Impala for distributed
spatial join query processing
• Experiments on the initial implementations
have revealed both advantages and
disadvantages of extending a tightly-coupled
big data system to support spatial data types
and their operations.
• Alternative techniques are being developed to
further improve efficiency, scalability,
extensibility and portability.