data science amsterdam - massively parallel processing with procedural languages

23
BUILT FOR THE SPEED OF BUSINESS

Upload: ian-huston

Post on 26-Jan-2015

105 views

Category:

Technology


1 download

DESCRIPTION

The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. With Procedural Languages such as PL/Python and PL/R data parallel queries can be run across terabytes of data using not only pure SQL but also familiar Python and R packages. The Pivotal Data Science team have used this technique to create fraud behaviour models for each individual user in a large corporate network, to understand interception rates at customs checkpoints by accelerating natural language processing of package descriptions and to reduce customer churn by building a sentiment model using customer call centre records. http://www.meetup.com/Data-Science-Amsterdam/events/178974942/

TRANSCRIPT

Page 1: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

BUILT FOR THE SPEED OF BUSINESS

Additional Line 18 Point Verdana

Page 2: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 2 © Copyright 2014 Pivotal. All rights reserved. 2 © Copyright 2013 Pivotal. All rights reserved. 2 © Copyright 2013 Pivotal. All rights reserved. 2 © Copyright 2013 Pivotal. All rights reserved. 2 © Copyright 2014 Pivotal. All rights reserved.

Massively Parallel Processing with Procedural Languages

PL/R and PL/Python

Ian Huston, @ianhuston Data Scientist, Pivotal

Page 3: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 3 © Copyright 2014 Pivotal. All rights reserved.

Some Links for this talk

� Simple code examples: https://github.com/ihuston/plpython_examples

�  IPython notebook rendered with nbviewer: http://tinyurl.com/ih-plpython

� More info (especially for PL/R): http://gopivotal.github.io/gp-r/

Page 4: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 4 © Copyright 2014 Pivotal. All rights reserved.

The Pivotal Message

Agile: data driven apps and rapid time to value Data Lake: store everything, analyze anything Enterprise PaaS: revolutionary software development and speed; build the right thing

Page 5: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 5 © Copyright 2014 Pivotal. All rights reserved.

Open Source is Pivotal

Page 6: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 6 © Copyright 2014 Pivotal. All rights reserved.

Introducing Pivotal Data Labs Our Charter: Pivotal Data Labs is Pivotal’s differentiated and highly opinionated data-centric service delivery organization.

Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies.

Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in-class data science and data engineering services, with a deep emphasis on knowledge transfer.

Page 7: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 7 © Copyright 2014 Pivotal. All rights reserved. 7 © Copyright 2013 Pivotal. All rights reserved. 7 © Copyright 2013 Pivotal. All rights reserved. 7 © Copyright 2013 Pivotal. All rights reserved. 7 © Copyright 2014 Pivotal. All rights reserved.

Procedural Languages

Page 8: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 8 © Copyright 2014 Pivotal. All rights reserved.

MPP Architectural Overview

Think of it as multiple PostGreSQL servers

Workers

Master

Page 9: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 9 © Copyright 2014 Pivotal. All rights reserved.

User-Defined Functions (UDFs) �  PostgreSQL/Greenplum provide lots of flexibility in defining your own functions.

�  Simple UDFs are SQL queries with calling arguments and return types.

Definition: CREATE  FUNCTION  times2(INT)  RETURNS  INT  AS  $$          SELECT  2  *  $1  $$  LANGUAGE  sql;  

Execution: SELECT  times2(1);    times2    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐              2  (1  row)  

Page 10: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 10 © Copyright 2014 Pivotal. All rights reserved.

�  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster

•  Data Parallelism: -  PL/X piggybacks on

Greenplum’s MPP architecture

•  Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

Master Host

SQL

Interconnect

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

Page 11: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 11 © Copyright 2014 Pivotal. All rights reserved.

Data Parallelism

� Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.

� Examples: –  Measure the height of each student in a classroom (explicitly

parallelizable by student) –  MapReduce –  Individual machine learning models per customer/region/etc.

Page 12: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 12 © Copyright 2014 Pivotal. All rights reserved.

Intro to PL/Python �  Procedural languages need to be installed on each database used.

�  Syntax is like normal Python function inside a SQL wrapper

�  Need to specify return type

CREATE  FUNCTION  pymax  (a  integer,  b  integer)      RETURNS  integer  AS  $$      if  a  >  b:          return  a      return  b  $$  LANGUAGE  plpythonu;  

SQL wrapper

SQL wrapper

Normal Python

Page 13: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 13 © Copyright 2014 Pivotal. All rights reserved.

Returning Results �  Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) �  Composite types can be returned by creating a composite type in the database:  

CREATE  TYPE  named_value  AS  (      name    text,      value    integer  );  

�  Then you can return a list, tuple or dict (not sets) which reference the same structure as the table: CREATE  FUNCTION  make_pair  (name  text,  value  integer)      RETURNS  named_value  AS  $$      return  [  name,  value  ]      #  or  alternatively,  as  tuple:  return  (  name,  value  )      #  or  as  dict:  return  {  "name":  name,  "value":  value  }      #  or  as  an  object  with  attributes  .name  and  .value  $$  LANGUAGE  plpythonu;  

�  For functions which return multiple rows, prefix “setof” before the return type

Page 14: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 14 © Copyright 2014 Pivotal. All rights reserved.

Parallelisation

Separate data into units using GROUP BY

Aggregate input for function using array_agg

SELECT name , plp.linregr(array_agg(x), array_agg(y))

FROM plp.test_data GROUP BY name ORDER BY name;

Page 15: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 15 © Copyright 2014 Pivotal. All rights reserved.

Benefits of Procedural Languages

� Easy to bring your code to the data.

� When SQL falls short leverage your Python/R/Java/C experience quickly.

� Apply Python/R packages across terabytes of data with minimal overhead or additional requirements.

� Results are already in the database system, ready for further analysis or storage.

Page 16: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 16 © Copyright 2014 Pivotal. All rights reserved.

Text Analytics for Churn Prediction Customer

A major telecom company

Business Problem

Reducing churn through more accurate models

Challenges

�  Existing models only used structured features

�  Call center memos had poor structure and had lots of typos

Solution

�  Built sentiment analysis models to predict churn and topic models to understand topics of conversation in call center memos

�  Achieved 16% improvement in ROC curve for Churn Prediction

Page 17: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 17 © Copyright 2014 Pivotal. All rights reserved.

Predicting Customs Clearance Intercepts Customer

A major courier delivery services company

Business Problem

7% of shipments into the US are intercepted by Customs based on package description.

Challenge

�  Data Volume: Previously not able to build models with rich features for billions of packages.

�  Data Variety: Customer’s prior technology did not allow them to extract features from unstructured shipment documents.

Solution

•  Processed shipment description documents using PL/Python. Tokenized, stemmed, removed stop words, etc.

•  Built features to summarize the shippers’ prior shipment and interception history

•  Built a model which predicts the propensity of a package being intercepted (AUC =0.86)

Intercepted Not Intercepted

Page 18: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 18 © Copyright 2014 Pivotal. All rights reserved.

Network User Behavior Anomaly Detection Customer

A major global financial service enterprise

Business Problem

Detect anomalous user behaviour in the global enterprise computer network

Challenges

10 Billions events in 6 months; 15K+ network devices; No existing SIEM solutions can model user behavioral resource access baseline and enable anomaly detection in an adaptive and scalable architecture.

Solution

An innovative Graph Mining based algorithmic framework with advanced Machine Learning. Network topology and temporal behaviors are both modeled. (Patent pending). Implemented in MPP and PL/R, enabling parallel model training and behavior risk scoring. Successfully identified DLP violating anomalous users.

Page 19: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 19 © Copyright 2014 Pivotal. All rights reserved. 19 © Copyright 2013 Pivotal. All rights reserved. 19 © Copyright 2013 Pivotal. All rights reserved. 19 © Copyright 2013 Pivotal. All rights reserved. 19 © Copyright 2014 Pivotal. All rights reserved.

MADlib

Page 20: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 20 © Copyright 2014 Pivotal. All rights reserved.

MADlib: The Origin

•  First mention of MAD analytics was at VLDB 2009 MAD Skills: New Analysis Practices for Big Data J. Hellerstein, J. Cohen, B. Dolan, M. Dunlap, C. Welton (with help from: Noelle Sio, David Hubbard, James Marca) http://db.cs.berkeley.edu/papers/vldb09-madskills.pdf

•  MADlib project initiated in late 2010:

Greenplum Analytics team and Prof. Joe Hellerstein

UrbanDictionary mad (adj.): an adjective used to enhance a noun. 1- dude, you got skills. 2- dude, you got mad skills

•  Open Source!https://github.com/madlib/madlib

•  Works on Greenplum DB, PostgreSQL and also HAWQ & Impala

•  Active development by Pivotal -  Latest Release: v1.5 (Apr 2014)

•  Downloads and Docs: http://madlib.net/

Page 21: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 21 © Copyright 2014 Pivotal. All rights reserved.

MADlib In-Database Functions

Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market

Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Descriptive Statistics

Sketch-based Estimators •  CountMin (Cormode-

Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent

Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions

Page 22: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

@ianhuston 22 © Copyright 2014 Pivotal. All rights reserved.

Get in touch

Feel free to contact me about PL/Python & PL/R, or more generally about Data Science and opportunities available.

@ianhuston

ihuston @ gopivotal.com http://www.ianhuston.net

Page 23: Data Science Amsterdam - Massively Parallel Processing with Procedural Languages

BUILT FOR THE SPEED OF BUSINESS

Additional Line 18 Point Verdana