greenplum & data science - 数字化三部曲€¦ · greenplum architecture data science in...
TRANSCRIPT
© Copyright 2018 Pivotal Software, Inc. All rights Reserved. Version 1.0
Hubert Zhang [email protected] Oct 2018
Greenplum & Data Science
Cover w/ Image
Agenda
■ What Is Data Science
■ Greenplum Architecture
■ Data Science In Greenplum
■ PL/Container
■ MADlib
■ Q+A
What is Data Science
● Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured ----Wikipedia
● Data scientists are a new breed of analytical data expert who have the technical skills to solve complex problems – and the curiosity to explore what problems need to be solve ----SAS
Greenplum Architecture
Standby Master
…
Master Host
SQL
Interconnect
Segment Host
Node1
Segment Host
Node2
Segment Host
Node3
Segment Host
NodeN
Local Storage
Other RDBMSes Spark GemFire
Cloud Object
Storage HDFS Kafka ETL
Spring Cloud
Data Flow
GPCC Visualize
Cook
Prepare
Shop
Greenplum Big Data Platform
Greenplum Support on Data Science
How to prepare and cook data in greenplum
• Procedure Language: PL/Python, PL/R, PL/Container
• GPText
• Apache MADlib
Procedural Language: PL/Python
User defined function written in Python ● CREATE TABLE sales (id int, year int, qtr int, day int, region text) DISTRIBUTED BY (id) ;
INSERT INTO sales VALUES (1, 2014, 1,1, 'usa'), (2, 2002, 2,2, 'europe'), (3, 2014, 3,3, 'asia'), (4, 2014, 4,4, 'usa'), (5, 2014, 1,5, 'europe'), (6, 2014, 2,6, 'asia'), (7, 2002, 3,7, 'usa') ;
● CREATE OR REPLACE FUNCTION pydbaccess(a integer)
RETURNS text AS $$ rv = plpy.execute("SELECT * FROM sales ORDER BY id", 5) region = rv[a]["region"] return region $$ language plpythonu;
● SELECT pydbaccess(2) ;
Procedural Language: PL/R
User defined function written in R ● CREATE OR REPLACE FUNCTION r_norm(n integer, mean float8,
std_dev float8) RETURNS float8[ ] AS $$ x<-rnorm(n,mean,std_dev) return(x) $$ LANGUAGE 'plr';
● CREATE TABLE test_norm_var
AS SELECT id, r_norm(10,0,1) as x FROM (SELECT generate_series(1,30:: bigint) AS ID) foo DISTRIBUTED BY (id);
● SELECT mypytest(2) ;
Procedural Language: Pain Point How to install data science library of Python and R
● Only DBA is able to install the third party library
● Data science bundle pre-installed with 100+ libraries.
Untrusted language for PL/Python and PL/R ● Only DBA could create function in untrusted language
● Malicious UDF code may harm DB system.
Data Science Bundle
Only Superuser can Create UDF in Untrusted Languages
System(“rm -rf /data”)
The Problem: Triangle Dependency Data Scientist
DBA
UDF
Review & Create
Run UDF
Greenplum
Package UDF Package
1. Greenplum
2. Operation System
3. Python / R
4. TensorFlow
Resolve The Problem: untrusted -> untrusted Data Scientist
DBA UDF
Review & Create
Run UDF
Create UDF
How to Make untrusted to untrusted?
PL/Container
PL/Container
What is PL/Container?
PL/Container is a customizable, secure runtime for Greenplum Database Procedural Languages. ● Greenplum Database Extension
● Stateless
● Based on Docker Container
● Customizable
● Secure
● Isolated
PL/Container
PL/Container
Build Custom Docker Image
Minimum Requirement:
● Python or R environment
● Add location of libpython.so and libR.so to LD_LIBRARY_PATH Customize Your image:
● Install specific packages
FROM continuumio/anaconda ENV LD_LIBRARY_PATH "/opt/conda/lib:$LD_LIBRARY_PATH”
FROM continuumio/anaconda3 RUN conda install -c conda-forge -y tensorflow ENV LD_LIBRARY_PATH "/opt/conda/lib:/usr/local/lib:$LD_LIBRARY_PATH"
Run PL/Container
Running a simple PL/Container UDF to calculate log10
postgres=# CREATE EXTENSION plcontainer;
postgres=# CREATE OR REPLACE FUNCTION pylog10(Input double precision) RETURNS double
precision AS $$
# container: plc_python_shared
import math
return math.log10(input)
$$ LANGUAGE plcontainer;
postgres=# SELECT pylog10(100); -------------- 2
GPText
• Enables organizations to process mass quantities of raw text data for large-scale
text analytics
• Indexing, Search, and Analytics Functions are exposed as SQL UDFs
• Integrates with Apache Solr enterprise search
• Computing distributed in segments, can be run in parallel
MADlib
Apache MADlib: Scalable In-Database Machine Learning in SQL
Open source, top level Apache project
For PostgreSQL and Greenplum Database
Powerful machine learning, graph, statistics and analytics for data scientists
• Open source https://github.com/apache/madlib • Downloads and docs http://madlib.apache.org/ • Wiki https://cwiki.apache.org/confluence/display/MADLIB/
MADlib
History ● Established in 2011
● EMC/Greenplum Pivotal
● Joe Hellerstein from Univ. of California, Berkeley.
MADlib
Features of MADlib
● Rich algorithm libraries: more than 50+ algorithms.
● Easy to use: SQL based interface.
● Open source: Apache top project.
● Distributed: integrated with Greenplum, support MPP architecture.
● In-database analysis: No additional data movement.
MADlib
Architecture of MADlib
MADlib
MADlib’s Rich Algorithm Libraries ● Supervised Learning: Linear Regression, Decision Tree, SVM, CRF etc.
● Unsupervised Learning: K-Means, LDA, PCA, Association Rule etc.
● Graph: PageRank, HIT, All Pairs Shortest Path, Weakly Connected Components etc.
● Time Series: ARIMA
● Model Selection: Cross Validation, Prediction Metrics
MADlib
Linear Regression in MADlib ● Suppose data follow the linear relationship
● Cost function: sum of squared residuals(SSR) is used to measure the overall model fit
#Dataset
CREATETABLEhouses(idINT,taxINT,bedroomINT,bathFLOAT,priceINT,sizeINT,lotINT);
#Training
SELECTmadlib.linregr_train('houses','houses_linregr','price','ARRAY[tax,bath,size]');
#Pridicting
SELECThouses.*,madlib.linregr_predict(m.coef,ARRAY[tax,bath,size])aspredict,price-madlib.linregr_predict(m.coef,ARRAY[tax,bath,size])asresidual
FROMhouses,houses_linregrmORDERBYid;
MADlib
PageRank ● Google’s classical rank algorithm used to rank the importance of the web page.
● PageRank is based on random walk model which can include the following steps: ○ Initialization step: set initial probability distribution
○ Iteration step: recalculate the score for each web page.
○ Converge step
Image from https://en.wikipedia.org/wiki/PageRank
MADlib
PageRank in MADlib
#Data
CREATETABLEvertex(idINTEGER);CREATETABLEedge(srcINTEGER,destINTEGER,user_idINTEGER);
#Training
SELECTmadlib.pagerank('vertex',--Vertextable'id',--Vertixidcolumn'edge',--Edgetable'src=src,dest=dest',--Edgeargs'pagerank_out');--Outputtable
#ResultSELECT*FROMpagerank_outORDERBYpagerankDESC;
1
3
4 2
3 5
1
1 2
Vertex or node
Edge
Edge Weight
MADlib
Github Analysis ● Github is one of the biggest code repo for different kinds of cool projects, especially open
source.
● GH Archive store all the Github history data: 113GB compressed data in 2017
MADlib
Github Analysis ● Intern project: find the important repo in Github
during 2015-current
● How to measure importance?
○ Personal PageRank
● How to represent the node in PageRank?
○ People & Repo
● How to represent the edge in PageRank?
○ People star a repo (single direction)
○ People contribute to a repo (bidirectional)
Transforming How The World Builds Software
© Copyright 2018 Pivotal Software, Inc. All rights Reserved.