built for the speed of businessidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · descriptive statistics...
TRANSCRIPT
BUILT FOR THE SPEED OF BUSINESS
2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved.
Pivotal MPP Databases and In-Database Analytics
Shengwen Yang
2013-12-08
3© Copyright 2013 Pivotal. All rights reserved.
Outline
About Pivotal
Pivotal Greenplum Database
The Crown Jewels of Greenplum (HAWQ)
In-Database Analytics (MADLIB)
4© Copyright 2013 Pivotal. All rights reserved.
About Pivotal
A new company spun off from EMC &
VMWare (http://gopivotal.com)
– Established in Apr 1, 2013
– Investors: EMC, Vmware, & GE
– Mission: Build a new platform for a new era
BIG DATAPaaS AGILE OSS
5© Copyright 2013 Pivotal. All rights reserved.
What Matters: Apps. Data. Analytics.
Apps power businesses, and
those apps generate data
Analytic insights from that data
drive new app functionality,
which in-turn drives new data
The faster you can move
around that cycle, the faster
you learn, innovate & pull
away from the competition
DATA
APPS
ANALYTICS
6© Copyright 2013 Pivotal. All rights reserved.
Cloud Fabric
Data Fabric Application Fabric
Scale-out storage: HDFS/Object
Languages
&
Frameworks
Ingest & Query: very high-capacity &
in-memory AnalyticsServices
Cloud Abstraction (portability)
Automation: App Provisioning & Life-cycle
Service Registry
Pivotal One
vFabricGemFire
7© Copyright 2013 Pivotal. All rights reserved.
Pivotal GemFire
Distributed in-memory data grid
Designed for dynamic and linear scalability,
high performance, and database-like
persistence
Enables real time application processing and
data ingest
8© Copyright 2013 Pivotal. All rights reserved.
Pivotal HD
World’s first true SQL processing for enterprise-
ready Hadoop
100% Apache Hadoop-based platform
Virtualization and cloud ready with VMWare and
Isilon
Scale tested in 1000 node Pivotal Analytics
Workbench
Available as a software-only or appliance-based
solution
9© Copyright 2013 Pivotal. All rights reserved.
Pivotal Greenplum Database
Highly-scalable, shared-nothing, MPP
analytic database (data warehouse)
• Platform for advanced analytics on
any (and all) data
• High-speed processing of highly
flexible advanced algorithm libraries
GREENPLUM
DATABASE
10© Copyright 2013 Pivotal. All rights reserved.
HAWQ - The Crown Jewels of Greenplum
SQL compliant
World-class parallel query optimizer
Interactive query
Horizontal scalability
Robust data management
Common Hadoop formats
Deep analytics
11© Copyright 2013 Pivotal. All rights reserved.
MPP: Performance Through Parallelism
Network InterconnectDynamic Pipelining
... ...
......Master Servers
& Name Nodes
Query planning & dispatch
Segment Servers
& Data Nodes
Query processing & data
storage
External Sources
Loading, streaming, etc.
SQL
12© Copyright 2013 Pivotal. All rights reserved.
HAWQ Benchmarks
User inteligence 4.2 198
Sales analysis 8.7 161
Click analysis 2.0 415
Data exploration 2.7 1,285
BI drill down 2.8 1,815
47X
19X
208X
476X
648X
13© Copyright 2013 Pivotal. All rights reserved.
HAWQ Benchmarks
User intelligence 4.2 37
Sales analysis 8.7 596
Click analysis 2.0 50
Data exploration 2.7 55
BI drill down 2.8 59
9X
69X
25X
20X
21X
14© Copyright 2013 Pivotal. All rights reserved.
In-Database Analytics
Data Access & Query Layer
SQL
GREENPLUM DATABASE
ODBC JDBC
In-Database Analytics
Partner
Open-Source
User-written
Customized
Embedded
SAS/HPA
High Performance
Analytics
SAS Scoring
Accelerator
MADlib
Open Source
Analytical
Algorithms
Customized
MADlib
User-Written
Analytical
Algorithms
Greenplum DB
Embedded
Analytics
Greenplum
Spatial
Greenplum TextSAS
Access
SAS
Grid
15© Copyright 2013 Pivotal. All rights reserved.
MADLIB: In-Database Analytic Library• Scalable in-database analytic library
Parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data
• Open-source To enable extensibility and growth
• Fully Parallelized• Algorithms are designed to leverage MPP
architecture
• Algorithms scale as your data set scales
• Collaboration between industry and academy
• Diverse user interfaces• SQL, Python, R, and GUI
http://madlib.net/
16© Copyright 2013 Pivotal. All rights reserved.
MADlib Functions Generalized Linear Models
Linear Regression
Logistic Regression
Multinomial Logistic Regression
Elastic Net Regularization
Cox-Proportional Hazards Regression
Robust Variance
Clustered Variance
Marginal Effects
Cross Validation
Association Rules– A-priori
Clustering– K-Means
Topic Modeling– LDA
Time Series Analysis – ARIMA
Linear Systems– Sparse/Dense
Matrix Factorization– Low-rank Matrix Factorization
– SVD
Dimensionality Reduction– PCA
Descriptive Statistics– Summary
– Pearson's Correlation
Hypothesis Tests
Cardinality Estimators– CountMin
(Cormode-Muthukrishnan)
– FM (Flajolet-Martin)
– MFV (Most Frequent Values)
Support Modules– Array Operations
– Sparse Vectors
– Probability Functions
– Compatibility
Early Stage Development– Naive Bayes Classification,
Decision Tree, Random Forest, SVM, CRF, Linear Algebra Operations, …
17© Copyright 2013 Pivotal. All rights reserved.
MADlib SQL InterfaceThe full interface support the analysis of heteroskedasticity of the linear fit
SELECT madlib.linregr_train (
'source_table', -- name of input table, VARCHAR
'out_table', -- name of output table, VARCHAR
'dependent_varname', -- dependent variable, VARCHAR
'independent_varname', -- independent variable, VARCHAR
[group_cols, -- columns to group by, VARCHAR[].
-- Default value: Null
[heteroskedasticity_option]] -- whether to analyze
-- heteroskedasticity (Breush-Pagan), BOOLEAN.
-- Default value: False
);
18© Copyright 2013 Pivotal. All rights reserved.
PivotalR – MADlib R Interface
Database
No data here All Data is hereMinimal data transfer – SQL
strings, data for preview
R Client
SQL to execute
Computation result
R → SQL
RPostgreSQL/ODBC/D
BI
PivotalR
BUILT FOR THE SPEED OF BUSINESS