accelerating geospatial data analytics with pivotal ... · pdf filempp shared nothing...

24
1 Copyright © 2014 Pivotal Software, Inc. All rights reserved. Accelerating GeoSpatial Data Analytics With Pivotal Greenplum Database Kuien Liu Pivotal, Inc. FOSS4G Seoul 2015

Upload: dotuyen

Post on 17-Feb-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

1 Copyright © 2014 Pivotal Software, Inc. All rights reserved. 1 Copyright © 2014 Pivotal Software, Inc. All rights reserved.

Accelerating GeoSpatial Data Analytics With Pivotal Greenplum Database Kuien Liu Pivotal, Inc.

FOSS4G Seoul 2015

Page 2: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

2 Copyright © 2014 Pivotal Software, Inc. All rights reserved.

Warm-up: GeoSpatial on Hadoop vs. P-RDB

� Now

NoSQL Data Systeme.g., Hbase, MangoDB

Dimension Translatore.g., GeoHash, Hilbert Filling

Distributed File Systeme.g., GFS, HDFS

GeoSpatial Query

Range Query

Segment

Network Interconnect

GeoSpatial

SegmentGeoSpatial

SegmentGeoSpatial

MasterGeoSpatial

StandbyGeoSpatial

GeoSpatial Query

(a) GeoSpatial query on NoSQL (b) GeoSpatial query on MPP

We do NOT have to reinvent the wheel !

Page 3: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

3 © Copyright 2014 Pivotal. All rights reserved.

Agenda

� Greenplum Database Overview –  Architecture & Features –  GeoSpatial supporting

� Big Data Analytics based on GPDB –  Big Data Suite –  Open Source Plan

� Data Science Case Studies

� Q&A

Page 4: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

4 © Copyright 2014 Pivotal. All rights reserved.

Greenplum Database

�  Fully ACID Relational Database built for Big Structured Data

�  SQL Standard Compliant

�  Cluster based system running on “Commodity” hardware on top of the Linux OS

�  Available as an EMC appliance

�  10+ years of R&D investment

�  Enterprise product with 1000+ install base

Page 5: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

5 © Copyright 2014 Pivotal. All rights reserved.

POLYMORPHIC STORAGE

HEAP, Append Only, Columnar, External,

Compression

MULTI-VERSION CONCURRENCY

CONTROL (MVCC)

Feature Set Greenplum DB SY

STEM

A

CC

ESS

DA

TA

PRO

CES

SIN

G

DA

TA

STO

RA

GE

CLIENT ACCESS PSQL, ODBC, JDBC

BULK LOAD/UNLOAD GPLoad, GPFdist,

External Tables, GPHDFS

ADMIN TOOLS GP Command Center,

GP Perfmon, GP Support

3rd PARTY TOOLS Compatible with Industry Standard BI & ETL Tools

SQL STANDARD COMPLIANCE

MASSIVELY PARALLEL PROCESSING (MPP)

IN-DATABASE PROGRAMMING

LANGUAGES PL/pgSQL, PL/Python, PL/R, PL/Perl, PL/Java,

PL/C

IN-DATABASE ANALYTICS & EXTENSIONS

MADLib, PostGIS, PGCrypto

FULLY ACID COMPLIANT

TRANSACTIONAL DATABASE

INDEXES

B-Tree, Bitmap, GiST

Page 6: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

6 © Copyright 2014 Pivotal. All rights reserved.

MPP Shared Nothing Architecture

Standby Master

Segment Host with one or more Segment Instances Segment Instances process queries in parallel

Flexible framework for processing large datasets

High speed interconnect for continuous pipelining of data processing …

Master Host

SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts

Interconnect

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

Segment Hosts have their own CPU, disk and memory (shared nothing)

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node1

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node2

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node3

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

nodeN

Page 7: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

7 © Copyright 2014 Pivotal. All rights reserved.

Query Execution Example

Segment 1

Slice 3

MotionRedist(b.name)

ScanBarsb

Filterb.city = 'San Francisco'

Segment 1

Slice 2 MotionGather

MotionRedist(b.name)

HashJoinb.name = s.bar

ScanSellss

Projects.beer, s.price

Master Slice 1 MotionGather

Segment n

Slice 3

MotionRedist(b.name)

ScanBarsb

Filterb.city = 'San Francisco'

Segment n

Slice 2 MotionGather

MotionRedist(b.name)

HashJoinb.name = s.bar

ScanSellss

Projects.beer, s.price

SELECT s.beer, s.price FROM Bars b, Sells s WHERE b.name = s.bar AND b.city = ‘Seoul’

–  Bars is distributed randomly –  Sells is distributed by bar

Page 8: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

8 © Copyright 2014 Pivotal. All rights reserved.

md

card

memo

provider

provider

cost xforms

stats

search

gpos

memory

unittests

Parallel Query Optimizer �  Cost based optimizer designed for big data �  Allows for rapid development iteration �  Based on Cascades Framework for Optimizers �  Breaking the optimizer into multiple components

–  components can be replaced and configured separately

�  Algebraic model –  Operators input properties determine output

properties –  Separating representation from optimization

�  Separate component integrated into GPDB

Page 9: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

9 © Copyright 2014 Pivotal. All rights reserved.

Loading: Massively-Parallel Ingest

•  Fast Parallel Load & Unload –  No Master Node bottleneck –  10+ TB/Hour per Rack –  Linear scalability

•  Low Latency –  Data immediately available –  No intermediate stores –  No data “reorganization”

•  Load/Unload To & From: –  File Systems –  Any ETL Product –  Hadoop Distributions

Extreme speed and, immediate usability from files, ETL & Hadoop

External Sources

Loading, streaming, etc.

gNet Network Interconnect

... ...

... ... Master Servers

Query planning & dispatch

Segment Servers

Query processing & data storage

SQL

ETL File

Systems

Page 10: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

10 © Copyright 2014 Pivotal. All rights reserved.

Polymorphic Storage™

�  Columnar storage compresses better �  Optimized for retrieving a subset of the

columns when querying �  Compression can be set differently per

column: gzip (1-9), quicklz, delta, RLE

�  Row oriented faster when returning all columns

�  HEAP for many updates and deletes �  Append-Optimized for insert mostly �  Use bitmap indexes on column or

row tables to accelerate scans. �  Use B-Tree indexes on column or

row for drill through queries

TABLE ‘SALES’ Jun

Column-oriented Row-oriented

Oct Year -1

Year -2

External HDFS �  Less accessed partitions

on HDFS with external partitions to seamlessly query all data

�  Text, CSV, Binary, Avro, Parquet format

�  All major HDP Distros

Nov Dec Jul Aug Sep

User Definable Storage Layout

Page 11: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

11 © Copyright 2014 Pivotal. All rights reserved.

Morgan Stanley deployed multiple GPDB clusters for big data use cases including Risk Management, Sales & Trading

Page 12: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

12 © Copyright 2014 Pivotal. All rights reserved.

Focus Areas

Open Source Cloud Enablement

Database Engine Backup And Disaster Recovery

Analytics

Page 13: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

13 © Copyright 2014 Pivotal. All rights reserved.

Reference � Papers

–  MAD Skills: New Analysis Practices for Big Data (VLDB 2009) –  Dynamic prioritization of database queries (ICDE 2011) –  Online Expansion of Large-scale Data Warehouses (VLDB 2011) –  The MADlib analytics library: or MAD skills, the SQL (VLDB 2012) –  HAWQ: a massively parallel processing SQL engine in hadoop (SIGMOD 2014) –  Orca: A Modular Query Optimizer Architecture for Big Data (SIGMOD 2014) –  Optimizing queries over partitioned tables in MPP system (SIGMOD 2014) –  The data revolution: how companies are transforming with big data (SIGIR 2014) –  Unveiling clusters of events for alert and incident management in large-scale

enterprise it (SIGKDD 2014)

� Patents

Page 14: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

14 © Copyright 2014 Pivotal. All rights reserved.

Agenda

� Greenplum Database Overview –  Architecture & Features –  GeoSpatial supporting

� Big Data Analytics on GPDB –  Big Data Suite –  Open Source Plan

� Data Science Case Studies

� Q&A

Page 15: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

15 © Copyright 2014 Pivotal. All rights reserved.

Pivotal Big Data Suite � With it to build the foundational blocks of big data analytic.

–  Pivotal Greenplum Database –  Pivotal GemFire –  Pivotal HD –  Pivotal HAWQ –  OSS Support for Spring XD –  OSS Support for Redis –  OSS Support for RabbitMQ –  Redis for Pivotal Cloud Foundry –  RabbitMQ for Pivotal Cloud Foundry –  Pivotal HD for Pivotal Cloud Foundry * –  Pivotal HAWQ For Pivotal Cloud

Foundry * –  *(Requires Pivotal Cloud Foundry v1.3)

Page 16: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

16 © Copyright 2014 Pivotal. All rights reserved.

Greenplum Database: Analytical Maturity

Shared-Nothing MPP Parallel Query Optimizer

Polymorphic Data Storage™

Parallel Dataflow Engine gNet™ Software Interconnect

Scatter/Gather Streaming™ Data Loading

3rd PARTY TOOLS BI Tools, ETL Tools

Data Mining, etc

ADMIN TOOLS Greenplum Command Center

Greenplum Package Manager

CLIENT ACCESS ODBC, JDBC, OLEDB,

MapReduce, etc.

PRODUCT FEATURES

CLIENT ACCESS & TOOLS

CORE MPP ARCHITECTURE

GREENPLUM DB SERVICES

STORAGE & DATA ACCESS

Hybrid Storage & Execution (Row- & Column-Oriented) In-Database Compression

Multi-Level Partitioning Indexes – Btree, Bitmap, etc.

External Table Support

LANGUAGE SUPPORT Comprehensive SQL Native MapReduce

SQL 2003 OLAP Extensions Programmable Analytics

Analytics Extensions (GeoSpatial, PR/R, PL/Java,

PL/Python, PL/Perl)

LOADING & EXT. ACCESS Petabyte-Scale Loading Trickle Micro-Batching Anywhere Data Access

Multi-Level Fault Tolerance Online System Expansion Workload Management

Page 17: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

17 © Copyright 2014 Pivotal. All rights reserved.

Example: Data-driven, Empirical Methods for Problem Solving

� Data Science

� Statistics

� Machine Learning

� Artificial Intelligence

� Econometrics

� ….

•  For example: –  Linear regression

•  linregr() in MADlib, lm() in R

–  Logistic Regression •  logregr() in MADlib, glm()

in R

–  K-Means Clustering •  kmeanspp( ) in MADlib,

kmeans() in R

Page 18: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

18 © Copyright 2014 Pivotal. All rights reserved. 18 © Copyright 2013 Pivotal. All rights reserved. 18 © Copyright 2014 Pivotal. All rights reserved.

Online

1500

Open Source Meetup, 12 Sept. 2015, Beijing China

Page 19: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

19 © Copyright 2014 Pivotal. All rights reserved.

Agenda

� Greenplum Database Overview –  Architecture –  Features

� Big Data Analytics on Pivotal –  Big Data Suite –  Open Source Plan

� Data Science Case Studies

� Q&A

Page 20: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

20 © Copyright 2014 Pivotal. All rights reserved.

Predicting Delivery Duration of International Shipments

Customer Major postal service provider Business Problem Addressing the predictability of international shipments Challenges •  Many incomplete data rows (information not

always available when shipment is outside US) •  Large volume of data spread across several

tables and identifying trends, gaps, anomalies, and insights was a challenge

•  Solution •  Built a regression model to predict delivery

duration of international items

•  The MPP architecture of GPDB enabled rapid discovery of trends, anomalies, and actionable insights from their data

•  We demonstrated to the customer that their data had great predictive power

Page 21: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

21 © Copyright 2014 Pivotal. All rights reserved.

Customer A major semiconductor company Business Problem Given a set of wafer bin map images, identify random die failures, outlying wafer bin maps and clusters of wafer bin maps •  Challenges Deriving features from the wafer bin map images and feature reduct ion for clustering, outlier detection etc.

Solution §  Denoising techniques applied to wafer bin

map images for noise reduction §  Features extracted from wafer bin map

images. Feature space is reduced for clustering and outlier detection.

Wafer Bin Map Image Analytics

Page 22: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

22 © Copyright 2014 Pivotal. All rights reserved.

Route Optimization Customer

A major courier delivery services company

Business Problem

Optimizing routing decisions while meeting the demand and satisfying the many business constraints to guarantee feasibility and compliance.

Challenges

•  Routing problems are known to be NP-Hard

•  Size of the operation. Delivery of 3 million packages a day with the largest fleet in the US

•  Existing solution takes weeks to roll out monthly routing plans

Solution

•  Avoided expensive data movements by addressing demand forecasting and route optimization in the database

•  Built a fully parallelized approximation algorithm that featured a variation of Floyd Warshall all pairs shortest paths and neighborhood searches

•  Achieved significant reduction in fuel consumption over a greedy initial feasible solution

Page 23: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

23 © Copyright 2014 Pivotal. All rights reserved.

Questions?

Email: Kuien Liu <[email protected]>

Page 24: Accelerating GeoSpatial Data Analytics With Pivotal ... · PDF fileMPP Shared Nothing Architecture Standby ... a massively parallel processing SQL engine in hadoop (SIGMOD 2014)

A NEW PLATFORM FOR A NEW ERA

Additional Line 18 Point Verdana