built for the speed of businessidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · descriptive statistics...

19
BUILT FOR THE SPEED OF BUSINESS

Upload: others

Post on 18-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

BUILT FOR THE SPEED OF BUSINESS

Page 2: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

2© Copyright 2013 Pivotal. All rights reserved. 2© Copyright 2013 Pivotal. All rights reserved.

Pivotal MPP Databases and In-Database Analytics

Shengwen Yang

2013-12-08

Page 3: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

3© Copyright 2013 Pivotal. All rights reserved.

Outline

About Pivotal

Pivotal Greenplum Database

The Crown Jewels of Greenplum (HAWQ)

In-Database Analytics (MADLIB)

Page 4: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

4© Copyright 2013 Pivotal. All rights reserved.

About Pivotal

A new company spun off from EMC &

VMWare (http://gopivotal.com)

– Established in Apr 1, 2013

– Investors: EMC, Vmware, & GE

– Mission: Build a new platform for a new era

BIG DATAPaaS AGILE OSS

Page 5: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

5© Copyright 2013 Pivotal. All rights reserved.

What Matters: Apps. Data. Analytics.

Apps power businesses, and

those apps generate data

Analytic insights from that data

drive new app functionality,

which in-turn drives new data

The faster you can move

around that cycle, the faster

you learn, innovate & pull

away from the competition

DATA

APPS

ANALYTICS

Page 6: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

6© Copyright 2013 Pivotal. All rights reserved.

Cloud Fabric

Data Fabric Application Fabric

Scale-out storage: HDFS/Object

Languages

&

Frameworks

Ingest & Query: very high-capacity &

in-memory AnalyticsServices

Cloud Abstraction (portability)

Automation: App Provisioning & Life-cycle

Service Registry

Pivotal One

vFabricGemFire

Page 7: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

7© Copyright 2013 Pivotal. All rights reserved.

Pivotal GemFire

Distributed in-memory data grid

Designed for dynamic and linear scalability,

high performance, and database-like

persistence

Enables real time application processing and

data ingest

Page 8: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

8© Copyright 2013 Pivotal. All rights reserved.

Pivotal HD

World’s first true SQL processing for enterprise-

ready Hadoop

100% Apache Hadoop-based platform

Virtualization and cloud ready with VMWare and

Isilon

Scale tested in 1000 node Pivotal Analytics

Workbench

Available as a software-only or appliance-based

solution

Page 9: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

9© Copyright 2013 Pivotal. All rights reserved.

Pivotal Greenplum Database

Highly-scalable, shared-nothing, MPP

analytic database (data warehouse)

• Platform for advanced analytics on

any (and all) data

• High-speed processing of highly

flexible advanced algorithm libraries

GREENPLUM

DATABASE

Page 10: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

10© Copyright 2013 Pivotal. All rights reserved.

HAWQ - The Crown Jewels of Greenplum

SQL compliant

World-class parallel query optimizer

Interactive query

Horizontal scalability

Robust data management

Common Hadoop formats

Deep analytics

Page 11: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

11© Copyright 2013 Pivotal. All rights reserved.

MPP: Performance Through Parallelism

Network InterconnectDynamic Pipelining

... ...

......Master Servers

& Name Nodes

Query planning & dispatch

Segment Servers

& Data Nodes

Query processing & data

storage

External Sources

Loading, streaming, etc.

SQL

Page 12: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

12© Copyright 2013 Pivotal. All rights reserved.

HAWQ Benchmarks

User inteligence 4.2 198

Sales analysis 8.7 161

Click analysis 2.0 415

Data exploration 2.7 1,285

BI drill down 2.8 1,815

47X

19X

208X

476X

648X

Page 13: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

13© Copyright 2013 Pivotal. All rights reserved.

HAWQ Benchmarks

User intelligence 4.2 37

Sales analysis 8.7 596

Click analysis 2.0 50

Data exploration 2.7 55

BI drill down 2.8 59

9X

69X

25X

20X

21X

Page 14: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

14© Copyright 2013 Pivotal. All rights reserved.

In-Database Analytics

Data Access & Query Layer

SQL

GREENPLUM DATABASE

ODBC JDBC

In-Database Analytics

Partner

Open-Source

User-written

Customized

Embedded

SAS/HPA

High Performance

Analytics

SAS Scoring

Accelerator

MADlib

Open Source

Analytical

Algorithms

Customized

MADlib

User-Written

Analytical

Algorithms

Greenplum DB

Embedded

Analytics

Greenplum

Spatial

Greenplum TextSAS

Access

SAS

Grid

Page 15: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

15© Copyright 2013 Pivotal. All rights reserved.

MADLIB: In-Database Analytic Library• Scalable in-database analytic library

Parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data

• Open-source To enable extensibility and growth

• Fully Parallelized• Algorithms are designed to leverage MPP

architecture

• Algorithms scale as your data set scales

• Collaboration between industry and academy

• Diverse user interfaces• SQL, Python, R, and GUI

http://madlib.net/

Page 16: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

16© Copyright 2013 Pivotal. All rights reserved.

MADlib Functions Generalized Linear Models

Linear Regression

Logistic Regression

Multinomial Logistic Regression

Elastic Net Regularization

Cox-Proportional Hazards Regression

Robust Variance

Clustered Variance

Marginal Effects

Cross Validation

Association Rules– A-priori

Clustering– K-Means

Topic Modeling– LDA

Time Series Analysis – ARIMA

Linear Systems– Sparse/Dense

Matrix Factorization– Low-rank Matrix Factorization

– SVD

Dimensionality Reduction– PCA

Descriptive Statistics– Summary

– Pearson's Correlation

Hypothesis Tests

Cardinality Estimators– CountMin

(Cormode-Muthukrishnan)

– FM (Flajolet-Martin)

– MFV (Most Frequent Values)

Support Modules– Array Operations

– Sparse Vectors

– Probability Functions

– Compatibility

Early Stage Development– Naive Bayes Classification,

Decision Tree, Random Forest, SVM, CRF, Linear Algebra Operations, …

Page 17: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

17© Copyright 2013 Pivotal. All rights reserved.

MADlib SQL InterfaceThe full interface support the analysis of heteroskedasticity of the linear fit

SELECT madlib.linregr_train (

'source_table', -- name of input table, VARCHAR

'out_table', -- name of output table, VARCHAR

'dependent_varname', -- dependent variable, VARCHAR

'independent_varname', -- independent variable, VARCHAR

[group_cols, -- columns to group by, VARCHAR[].

-- Default value: Null

[heteroskedasticity_option]] -- whether to analyze

-- heteroskedasticity (Breush-Pagan), BOOLEAN.

-- Default value: False

);

Page 18: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

18© Copyright 2013 Pivotal. All rights reserved.

PivotalR – MADlib R Interface

Database

No data here All Data is hereMinimal data transfer – SQL

strings, data for preview

R Client

SQL to execute

Computation result

R → SQL

RPostgreSQL/ODBC/D

BI

PivotalR

Page 19: BUILT FOR THE SPEED OF BUSINESSidke.ruc.edu.cn/ccfbigdata/slides/ysw.pdf · Descriptive Statistics – Summary – Pearson's Correlation Hypothesis Tests Cardinality Estimators –

BUILT FOR THE SPEED OF BUSINESS