greenplum & data science - 数字化三部曲€¦ · greenplum architecture data science in...

© Copyright 2018 Pivotal Software, Inc. All rights Reserved. Version 1.0

Hubert Zhang [email protected] Oct 2018

Greenplum & Data Science

Cover w/ Image

Agenda

■  What Is Data Science

■  Greenplum Architecture

■  Data Science In Greenplum

■  PL/Container

■  MADlib

■  Q+A

What is Data Science

●  Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured ----Wikipedia

●  Data scientists are a new breed of analytical data expert who have the technical skills to solve complex problems – and the curiosity to explore what problems need to be solve ----SAS

Greenplum Architecture

Standby Master

…

Master Host

SQL

Interconnect

Segment Host

Node1

Segment Host

Node2

Segment Host

Node3

Segment Host

NodeN

Local Storage

Other RDBMSes Spark GemFire

Cloud Object

Storage HDFS Kafka ETL

Spring Cloud

Data Flow

GPCC Visualize

Cook

Prepare

Shop

Greenplum Big Data Platform

Greenplum Support on Data Science

How to prepare and cook data in greenplum

•  Procedure Language: PL/Python, PL/R, PL/Container

•  GPText

•  Apache MADlib

Procedural Language: PL/Python

User defined function written in Python ●  CREATE TABLE sales (id int, year int, qtr int, day int, region text) DISTRIBUTED BY (id) ;

INSERT INTO sales VALUES (1, 2014, 1,1, 'usa'), (2, 2002, 2,2, 'europe'), (3, 2014, 3,3, 'asia'), (4, 2014, 4,4, 'usa'), (5, 2014, 1,5, 'europe'), (6, 2014, 2,6, 'asia'), (7, 2002, 3,7, 'usa') ;

●  CREATE OR REPLACE FUNCTION pydbaccess(a integer)

RETURNS text AS $$ rv = plpy.execute("SELECT * FROM sales ORDER BY id", 5) region = rv[a]["region"] return region $$ language plpythonu;

●  SELECT pydbaccess(2) ;

Procedural Language: PL/R

User defined function written in R ●  CREATE OR REPLACE FUNCTION r_norm(n integer, mean float8,

std_dev float8) RETURNS float8[ ] AS $$ x<-rnorm(n,mean,std_dev) return(x) $$ LANGUAGE 'plr';

●  CREATE TABLE test_norm_var

AS SELECT id, r_norm(10,0,1) as x FROM (SELECT generate_series(1,30:: bigint) AS ID) foo DISTRIBUTED BY (id);

●  SELECT mypytest(2) ;

Procedural Language: Pain Point How to install data science library of Python and R

●  Only DBA is able to install the third party library

●  Data science bundle pre-installed with 100+ libraries.

Untrusted language for PL/Python and PL/R ●  Only DBA could create function in untrusted language

●  Malicious UDF code may harm DB system.

Data Science Bundle

Only Superuser can Create UDF in Untrusted Languages

System(“rm -rf /data”)

The Problem: Triangle Dependency Data Scientist

DBA

UDF

Review & Create

Run UDF

Greenplum

Package UDF Package

1. Greenplum

2. Operation System

3. Python / R

4. TensorFlow

Resolve The Problem: untrusted -> untrusted Data Scientist

DBA UDF

Review & Create

Run UDF

Create UDF

How to Make untrusted to untrusted?

PL/Container

PL/Container

What is PL/Container?

PL/Container is a customizable, secure runtime for Greenplum Database Procedural Languages. ●  Greenplum Database Extension

●  Stateless

●  Based on Docker Container

●  Customizable

●  Secure

●  Isolated

PL/Container

PL/Container

Build Custom Docker Image

Minimum Requirement:

●  Python or R environment

●  Add location of libpython.so and libR.so to LD_LIBRARY_PATH Customize Your image:

●  Install specific packages

FROM continuumio/anaconda ENV LD_LIBRARY_PATH "/opt/conda/lib:$LD_LIBRARY_PATH”

FROM continuumio/anaconda3 RUN conda install -c conda-forge -y tensorflow ENV LD_LIBRARY_PATH "/opt/conda/lib:/usr/local/lib:$LD_LIBRARY_PATH"

Run PL/Container

Running a simple PL/Container UDF to calculate log10

postgres=# CREATE EXTENSION plcontainer;

postgres=# CREATE OR REPLACE FUNCTION pylog10(Input double precision) RETURNS double

precision AS $$

# container: plc_python_shared

import math

return math.log10(input)

$$ LANGUAGE plcontainer;

postgres=# SELECT pylog10(100); -------------- 2

GPText

•  Enables organizations to process mass quantities of raw text data for large-scale

text analytics

•  Indexing, Search, and Analytics Functions are exposed as SQL UDFs

•  Integrates with Apache Solr enterprise search

•  Computing distributed in segments, can be run in parallel

MADlib

Apache MADlib: Scalable In-Database Machine Learning in SQL

Open source, top level Apache project

For PostgreSQL and Greenplum Database

Powerful machine learning, graph, statistics and analytics for data scientists

•  Open source https://github.com/apache/madlib •  Downloads and docs http://madlib.apache.org/ •  Wiki https://cwiki.apache.org/confluence/display/MADLIB/

MADlib

History ●  Established in 2011

●  EMC/Greenplum Pivotal

●  Joe Hellerstein from Univ. of California, Berkeley.

MADlib

Features of MADlib

●  Rich algorithm libraries: more than 50+ algorithms.

●  Easy to use: SQL based interface.

●  Open source: Apache top project.

●  Distributed: integrated with Greenplum, support MPP architecture.

●  In-database analysis: No additional data movement.

MADlib

Architecture of MADlib

MADlib

MADlib’s Rich Algorithm Libraries ●  Supervised Learning: Linear Regression, Decision Tree, SVM, CRF etc.

●  Unsupervised Learning: K-Means, LDA, PCA, Association Rule etc.

●  Graph: PageRank, HIT, All Pairs Shortest Path, Weakly Connected Components etc.

●  Time Series: ARIMA

●  Model Selection: Cross Validation, Prediction Metrics

MADlib

Linear Regression in MADlib ●  Suppose data follow the linear relationship

●  Cost function: sum of squared residuals(SSR) is used to measure the overall model fit

#Dataset

CREATETABLEhouses(idINT,taxINT,bedroomINT,bathFLOAT,priceINT,sizeINT,lotINT);

#Training

SELECTmadlib.linregr_train('houses','houses_linregr','price','ARRAY[tax,bath,size]');

#Pridicting

SELECThouses.*,madlib.linregr_predict(m.coef,ARRAY[tax,bath,size])aspredict,price-madlib.linregr_predict(m.coef,ARRAY[tax,bath,size])asresidual

FROMhouses,houses_linregrmORDERBYid;

MADlib

PageRank ●  Google’s classical rank algorithm used to rank the importance of the web page.

●  PageRank is based on random walk model which can include the following steps: ○  Initialization step: set initial probability distribution

○  Iteration step: recalculate the score for each web page.

○  Converge step

Image from https://en.wikipedia.org/wiki/PageRank

MADlib

PageRank in MADlib

#Data

CREATETABLEvertex(idINTEGER);CREATETABLEedge(srcINTEGER,destINTEGER,user_idINTEGER);

#Training

SELECTmadlib.pagerank('vertex',--Vertextable'id',--Vertixidcolumn'edge',--Edgetable'src=src,dest=dest',--Edgeargs'pagerank_out');--Outputtable

#ResultSELECT*FROMpagerank_outORDERBYpagerankDESC;

1

3

4 2

3 5

1

1 2

Vertex or node

Edge

Edge Weight

MADlib

Github Analysis ●  Github is one of the biggest code repo for different kinds of cool projects, especially open

source.

●  GH Archive store all the Github history data: 113GB compressed data in 2017

MADlib

Github Analysis ●  Intern project: find the important repo in Github

during 2015-current

●  How to measure importance?

○  Personal PageRank

●  How to represent the node in PageRank?

○  People & Repo

●  How to represent the edge in PageRank?

○  People star a repo (single direction)

○  People contribute to a repo (bidirectional)

greenplum & data science - 数字化三部曲€¦ · greenplum architecture data science in...

Documents