accelerating geospatial data analytics with pivotal ... · pdf filempp shared nothing...
TRANSCRIPT
1 Copyright © 2014 Pivotal Software, Inc. All rights reserved. 1 Copyright © 2014 Pivotal Software, Inc. All rights reserved.
Accelerating GeoSpatial Data Analytics With Pivotal Greenplum Database Kuien Liu Pivotal, Inc.
FOSS4G Seoul 2015
2 Copyright © 2014 Pivotal Software, Inc. All rights reserved.
Warm-up: GeoSpatial on Hadoop vs. P-RDB
� Now
NoSQL Data Systeme.g., Hbase, MangoDB
Dimension Translatore.g., GeoHash, Hilbert Filling
Distributed File Systeme.g., GFS, HDFS
GeoSpatial Query
Range Query
Segment
Network Interconnect
GeoSpatial
SegmentGeoSpatial
SegmentGeoSpatial
MasterGeoSpatial
StandbyGeoSpatial
GeoSpatial Query
(a) GeoSpatial query on NoSQL (b) GeoSpatial query on MPP
We do NOT have to reinvent the wheel !
3 © Copyright 2014 Pivotal. All rights reserved.
Agenda
� Greenplum Database Overview – Architecture & Features – GeoSpatial supporting
� Big Data Analytics based on GPDB – Big Data Suite – Open Source Plan
� Data Science Case Studies
� Q&A
4 © Copyright 2014 Pivotal. All rights reserved.
Greenplum Database
� Fully ACID Relational Database built for Big Structured Data
� SQL Standard Compliant
� Cluster based system running on “Commodity” hardware on top of the Linux OS
� Available as an EMC appliance
� 10+ years of R&D investment
� Enterprise product with 1000+ install base
5 © Copyright 2014 Pivotal. All rights reserved.
POLYMORPHIC STORAGE
HEAP, Append Only, Columnar, External,
Compression
MULTI-VERSION CONCURRENCY
CONTROL (MVCC)
Feature Set Greenplum DB SY
STEM
A
CC
ESS
DA
TA
PRO
CES
SIN
G
DA
TA
STO
RA
GE
CLIENT ACCESS PSQL, ODBC, JDBC
BULK LOAD/UNLOAD GPLoad, GPFdist,
External Tables, GPHDFS
ADMIN TOOLS GP Command Center,
GP Perfmon, GP Support
3rd PARTY TOOLS Compatible with Industry Standard BI & ETL Tools
SQL STANDARD COMPLIANCE
MASSIVELY PARALLEL PROCESSING (MPP)
IN-DATABASE PROGRAMMING
LANGUAGES PL/pgSQL, PL/Python, PL/R, PL/Perl, PL/Java,
PL/C
IN-DATABASE ANALYTICS & EXTENSIONS
MADLib, PostGIS, PGCrypto
FULLY ACID COMPLIANT
TRANSACTIONAL DATABASE
INDEXES
B-Tree, Bitmap, GiST
6 © Copyright 2014 Pivotal. All rights reserved.
MPP Shared Nothing Architecture
Standby Master
Segment Host with one or more Segment Instances Segment Instances process queries in parallel
Flexible framework for processing large datasets
High speed interconnect for continuous pipelining of data processing …
Master Host
SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts
Interconnect
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
Segment Hosts have their own CPU, disk and memory (shared nothing)
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node1
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node2
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
node3
Segment Host Segment Instance Segment Instance Segment Instance Segment Instance
nodeN
7 © Copyright 2014 Pivotal. All rights reserved.
Query Execution Example
Segment 1
Slice 3
MotionRedist(b.name)
ScanBarsb
Filterb.city = 'San Francisco'
Segment 1
Slice 2 MotionGather
MotionRedist(b.name)
HashJoinb.name = s.bar
ScanSellss
Projects.beer, s.price
Master Slice 1 MotionGather
Segment n
Slice 3
MotionRedist(b.name)
ScanBarsb
Filterb.city = 'San Francisco'
Segment n
Slice 2 MotionGather
MotionRedist(b.name)
HashJoinb.name = s.bar
ScanSellss
Projects.beer, s.price
…
SELECT s.beer, s.price FROM Bars b, Sells s WHERE b.name = s.bar AND b.city = ‘Seoul’
– Bars is distributed randomly – Sells is distributed by bar
8 © Copyright 2014 Pivotal. All rights reserved.
md
card
memo
provider
provider
cost xforms
stats
search
gpos
memory
unittests
Parallel Query Optimizer � Cost based optimizer designed for big data � Allows for rapid development iteration � Based on Cascades Framework for Optimizers � Breaking the optimizer into multiple components
– components can be replaced and configured separately
� Algebraic model – Operators input properties determine output
properties – Separating representation from optimization
� Separate component integrated into GPDB
9 © Copyright 2014 Pivotal. All rights reserved.
Loading: Massively-Parallel Ingest
• Fast Parallel Load & Unload – No Master Node bottleneck – 10+ TB/Hour per Rack – Linear scalability
• Low Latency – Data immediately available – No intermediate stores – No data “reorganization”
• Load/Unload To & From: – File Systems – Any ETL Product – Hadoop Distributions
Extreme speed and, immediate usability from files, ETL & Hadoop
External Sources
Loading, streaming, etc.
gNet Network Interconnect
... ...
... ... Master Servers
Query planning & dispatch
Segment Servers
Query processing & data storage
SQL
ETL File
Systems
10 © Copyright 2014 Pivotal. All rights reserved.
Polymorphic Storage™
� Columnar storage compresses better � Optimized for retrieving a subset of the
columns when querying � Compression can be set differently per
column: gzip (1-9), quicklz, delta, RLE
� Row oriented faster when returning all columns
� HEAP for many updates and deletes � Append-Optimized for insert mostly � Use bitmap indexes on column or
row tables to accelerate scans. � Use B-Tree indexes on column or
row for drill through queries
TABLE ‘SALES’ Jun
Column-oriented Row-oriented
Oct Year -1
Year -2
External HDFS � Less accessed partitions
on HDFS with external partitions to seamlessly query all data
� Text, CSV, Binary, Avro, Parquet format
� All major HDP Distros
Nov Dec Jul Aug Sep
User Definable Storage Layout
11 © Copyright 2014 Pivotal. All rights reserved.
Morgan Stanley deployed multiple GPDB clusters for big data use cases including Risk Management, Sales & Trading
12 © Copyright 2014 Pivotal. All rights reserved.
Focus Areas
Open Source Cloud Enablement
Database Engine Backup And Disaster Recovery
Analytics
13 © Copyright 2014 Pivotal. All rights reserved.
Reference � Papers
– MAD Skills: New Analysis Practices for Big Data (VLDB 2009) – Dynamic prioritization of database queries (ICDE 2011) – Online Expansion of Large-scale Data Warehouses (VLDB 2011) – The MADlib analytics library: or MAD skills, the SQL (VLDB 2012) – HAWQ: a massively parallel processing SQL engine in hadoop (SIGMOD 2014) – Orca: A Modular Query Optimizer Architecture for Big Data (SIGMOD 2014) – Optimizing queries over partitioned tables in MPP system (SIGMOD 2014) – The data revolution: how companies are transforming with big data (SIGIR 2014) – Unveiling clusters of events for alert and incident management in large-scale
enterprise it (SIGKDD 2014)
� Patents
14 © Copyright 2014 Pivotal. All rights reserved.
Agenda
� Greenplum Database Overview – Architecture & Features – GeoSpatial supporting
� Big Data Analytics on GPDB – Big Data Suite – Open Source Plan
� Data Science Case Studies
� Q&A
15 © Copyright 2014 Pivotal. All rights reserved.
Pivotal Big Data Suite � With it to build the foundational blocks of big data analytic.
– Pivotal Greenplum Database – Pivotal GemFire – Pivotal HD – Pivotal HAWQ – OSS Support for Spring XD – OSS Support for Redis – OSS Support for RabbitMQ – Redis for Pivotal Cloud Foundry – RabbitMQ for Pivotal Cloud Foundry – Pivotal HD for Pivotal Cloud Foundry * – Pivotal HAWQ For Pivotal Cloud
Foundry * – *(Requires Pivotal Cloud Foundry v1.3)
16 © Copyright 2014 Pivotal. All rights reserved.
Greenplum Database: Analytical Maturity
Shared-Nothing MPP Parallel Query Optimizer
Polymorphic Data Storage™
Parallel Dataflow Engine gNet™ Software Interconnect
Scatter/Gather Streaming™ Data Loading
3rd PARTY TOOLS BI Tools, ETL Tools
Data Mining, etc
ADMIN TOOLS Greenplum Command Center
Greenplum Package Manager
CLIENT ACCESS ODBC, JDBC, OLEDB,
MapReduce, etc.
PRODUCT FEATURES
CLIENT ACCESS & TOOLS
CORE MPP ARCHITECTURE
GREENPLUM DB SERVICES
STORAGE & DATA ACCESS
Hybrid Storage & Execution (Row- & Column-Oriented) In-Database Compression
Multi-Level Partitioning Indexes – Btree, Bitmap, etc.
External Table Support
LANGUAGE SUPPORT Comprehensive SQL Native MapReduce
SQL 2003 OLAP Extensions Programmable Analytics
Analytics Extensions (GeoSpatial, PR/R, PL/Java,
PL/Python, PL/Perl)
LOADING & EXT. ACCESS Petabyte-Scale Loading Trickle Micro-Batching Anywhere Data Access
Multi-Level Fault Tolerance Online System Expansion Workload Management
17 © Copyright 2014 Pivotal. All rights reserved.
Example: Data-driven, Empirical Methods for Problem Solving
� Data Science
� Statistics
� Machine Learning
� Artificial Intelligence
� Econometrics
� ….
• For example: – Linear regression
• linregr() in MADlib, lm() in R
– Logistic Regression • logregr() in MADlib, glm()
in R
– K-Means Clustering • kmeanspp( ) in MADlib,
kmeans() in R
18 © Copyright 2014 Pivotal. All rights reserved. 18 © Copyright 2013 Pivotal. All rights reserved. 18 © Copyright 2014 Pivotal. All rights reserved.
Online
1500
Open Source Meetup, 12 Sept. 2015, Beijing China
19 © Copyright 2014 Pivotal. All rights reserved.
Agenda
� Greenplum Database Overview – Architecture – Features
� Big Data Analytics on Pivotal – Big Data Suite – Open Source Plan
� Data Science Case Studies
� Q&A
20 © Copyright 2014 Pivotal. All rights reserved.
Predicting Delivery Duration of International Shipments
Customer Major postal service provider Business Problem Addressing the predictability of international shipments Challenges • Many incomplete data rows (information not
always available when shipment is outside US) • Large volume of data spread across several
tables and identifying trends, gaps, anomalies, and insights was a challenge
• Solution • Built a regression model to predict delivery
duration of international items
• The MPP architecture of GPDB enabled rapid discovery of trends, anomalies, and actionable insights from their data
• We demonstrated to the customer that their data had great predictive power
21 © Copyright 2014 Pivotal. All rights reserved.
Customer A major semiconductor company Business Problem Given a set of wafer bin map images, identify random die failures, outlying wafer bin maps and clusters of wafer bin maps • Challenges Deriving features from the wafer bin map images and feature reduct ion for clustering, outlier detection etc.
Solution § Denoising techniques applied to wafer bin
map images for noise reduction § Features extracted from wafer bin map
images. Feature space is reduced for clustering and outlier detection.
Wafer Bin Map Image Analytics
22 © Copyright 2014 Pivotal. All rights reserved.
Route Optimization Customer
A major courier delivery services company
Business Problem
Optimizing routing decisions while meeting the demand and satisfying the many business constraints to guarantee feasibility and compliance.
Challenges
• Routing problems are known to be NP-Hard
• Size of the operation. Delivery of 3 million packages a day with the largest fleet in the US
• Existing solution takes weeks to roll out monthly routing plans
Solution
• Avoided expensive data movements by addressing demand forecasting and route optimization in the database
• Built a fully parallelized approximation algorithm that featured a variation of Floyd Warshall all pairs shortest paths and neighborhood searches
• Achieved significant reduction in fuel consumption over a greedy initial feasible solution
A NEW PLATFORM FOR A NEW ERA
Additional Line 18 Point Verdana