pattern: pmml for cascading and hadoop

26
19th ACM SIGKDD KDD 2013 PMML Workshop Conference on Knowledge Discovery and Data Mining kdd13pmml.wordpress.com Pattern: PMML for Cascading and Hadoop Paco Nathan Mesosphere, Inc. Girish Kathalagiri AgilOne, Inc. Tuesday, 13 August 13

Upload: paco-nathan

Post on 20-Aug-2015

2.440 views

Category:

Technology


1 download

TRANSCRIPT

19th ACM SIGKDD KDD 2013 PMML WorkshopConference on Knowledge Discovery and Data Miningkdd13pmml.wordpress.com

Pattern: PMML for Cascading and Hadoop

Paco NathanMesosphere, Inc.

Girish KathalagiriAgilOne, Inc.

Tuesday, 13 August 13

Pattern: PMML for Cascading and HadoopP Nathan, G Kathalagiri (2013-08-11)

Chicago Crime Data

Workflow Abstraction

Cascading, Pattern, etc.

Tuesday, 13 August 13

Pattern: Example App

• example integration of PMML and Cascading, using a sample app based on the crime dataset from the City of Chicago Open Data

• sample app implements a predictive model for expected crime rates based on location, hour of day, and month

• modeling performed in R, using the pmml package

• multiple models are captured as PMML, then integrated via Pattern to implement the entire workflow as a single app

• PMML provides a vector for migrating workloads off of SAS, SPSS, etc., onto Hadoop clusters for more cost-effective scaling

Tuesday, 13 August 13

Pattern: Example App

City of Chicago Open Data portalcityofchicago.org/city/en/narr/foia/CityData.html

Pattern open source projectgithub.com/Cascading/pattern

Observed benefits include greatly reduced development costs and less licensing issues at scale, while leveraging the scalability of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff.

Analysts can train predictive models in popular analytics frameworks, such as SAS, Microstrategy, R, Weka, SQL Server, etc., then run those models at scale on Apache Hadoop with little or no coding required.

Tuesday, 13 August 13

API Support for Model Chaining, Transforms, etc.

workflow used for data preparation:

Tuesday, 13 August 13

API Support for Model Chaining, Transforms, etc.

workflow used for model scoring:

Tuesday, 13 August 13

Pattern: PMML for Cascading and HadoopP Nathan, G Kathalagiri (2013-08-11)

Chicago Crime Data

Workflow Abstraction

Cascading, Pattern, etc.

Tuesday, 13 August 13

Enterprise Data Workflows

middleware for Big Data applications is evolving, with commercial examples that include:

Cascading, Lingual, Pattern, etc.

Concurrent

Anaconda, Wakari, IPython Notebook, etc.

Continuum Analytics

ParAccel Big Data Analytics Platform

Actian

ETL dataprep

predictivemodel

datasources

enduses

Tuesday, 13 August 13

Anatomy of an Enterprise app

Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

ANSI SQL for ETL

Tuesday, 13 August 13

Anatomy of an Enterprise app

Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic

Tuesday, 13 August 13

Anatomy of an Enterprise app

Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive models

Tuesday, 13 August 13

Anatomy of an Enterprise app

Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

Tuesday, 13 August 13

Anatomy of an Enterprise app

Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic

most of the project costs…

Tuesday, 13 August 13

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

a compiler sees it all…

cascading.org

Tuesday, 13 August 13

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );

cascading.org

Tuesday, 13 August 13

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );

Tuesday, 13 August 13

Pattern: PMML for Cascading and HadoopP Nathan, G Kathalagiri (2013-08-11)

Chicago Crime Data

Workflow Abstraction

Cascading, Pattern, etc.

Tuesday, 13 August 13

Cascading – functional programming

• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments

• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:

Cascalog in Clojure (2010)Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wikiWhy Adopting the Declarative Programming Practices Will Improve Your Return from TechnologyDan Woods, 2013-04-17 Forbes

forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/

Tuesday, 13 August 13

Functional Programming for Big Data

WordCount with token scrubbing…

Apache Hive: 52 lines HQL + 8 lines Python (UDF)

compared to

Scalding: 18 lines Scala/Cascading

functional programming languages help reduce software engineering costs at scale, over time

Tuesday, 13 August 13

Workflow Abstraction – pattern language

Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Data is represented as flows of tuples. Operations in the flows bring functional programming aspects into Java

A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199

Tuesday, 13 August 13

Workflow Abstraction – literate programming

Cascading workflows generate their own visual documentation: flow diagrams

in formal terms, flow diagrams leverage a methodology called literate programming

provides intuitive, visual representations for apps –great for cross-team collaboration

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Literate ProgrammingDon Knuthliterateprogramming.com

Tuesday, 13 August 13

Workflow Abstraction – business process

following the essence of literate programming, Cascading workflows provide statements of business process

this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)

Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.)

this is especially apparent in large-scale Cascalog apps:

“Specify what you require, not how to achieve it.”

by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale

Tuesday, 13 August 13

CustomerOrders

Classify ScoredOrders

GroupBytoken

Count

PMMLModel

M R

FailureTraps

Assert

ConfusionMatrix

Pattern – score a model, using pre-defined Cascading app

cascading.org/pattern

Tuesday, 13 August 13

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Pattern – model scoring

• migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML

• great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc.

• integrate with other libraries –Matrix API, etc.

• leverage PMML as another kind of DSL

cascading.org/pattern

Tuesday, 13 August 13

public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );  // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "\t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "\t" ), classifyPath );  // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg();  OptionSet options = optParser.parse( args );  // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );  if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); }  // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); }

Pattern – score a model, within an app

Tuesday, 13 August 13