pattern: pmml for cascading and hadoop
TRANSCRIPT
19th ACM SIGKDD KDD 2013 PMML WorkshopConference on Knowledge Discovery and Data Miningkdd13pmml.wordpress.com
Pattern: PMML for Cascading and Hadoop
Paco NathanMesosphere, Inc.
Girish KathalagiriAgilOne, Inc.
Tuesday, 13 August 13
Pattern: PMML for Cascading and HadoopP Nathan, G Kathalagiri (2013-08-11)
Chicago Crime Data
Workflow Abstraction
Cascading, Pattern, etc.
Tuesday, 13 August 13
Pattern: Example App
• example integration of PMML and Cascading, using a sample app based on the crime dataset from the City of Chicago Open Data
• sample app implements a predictive model for expected crime rates based on location, hour of day, and month
• modeling performed in R, using the pmml package
• multiple models are captured as PMML, then integrated via Pattern to implement the entire workflow as a single app
• PMML provides a vector for migrating workloads off of SAS, SPSS, etc., onto Hadoop clusters for more cost-effective scaling
Tuesday, 13 August 13
Pattern: Example App
City of Chicago Open Data portalcityofchicago.org/city/en/narr/foia/CityData.html
Pattern open source projectgithub.com/Cascading/pattern
Observed benefits include greatly reduced development costs and less licensing issues at scale, while leveraging the scalability of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff.
Analysts can train predictive models in popular analytics frameworks, such as SAS, Microstrategy, R, Weka, SQL Server, etc., then run those models at scale on Apache Hadoop with little or no coding required.
Tuesday, 13 August 13
API Support for Model Chaining, Transforms, etc.
workflow used for data preparation:
Tuesday, 13 August 13
API Support for Model Chaining, Transforms, etc.
workflow used for model scoring:
Tuesday, 13 August 13
Pattern: PMML for Cascading and HadoopP Nathan, G Kathalagiri (2013-08-11)
Chicago Crime Data
Workflow Abstraction
Cascading, Pattern, etc.
Tuesday, 13 August 13
Enterprise Data Workflows
middleware for Big Data applications is evolving, with commercial examples that include:
Cascading, Lingual, Pattern, etc.
Concurrent
Anaconda, Wakari, IPython Notebook, etc.
Continuum Analytics
ParAccel Big Data Analytics Platform
Actian
ETL dataprep
predictivemodel
datasources
enduses
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
ANSI SQL for ETL
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
endusesJ2EE for business logic
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
SAS for predictive models
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
endusesJ2EE for business logic
most of the project costs…
Tuesday, 13 August 13
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
Tuesday, 13 August 13
a compiler sees it all…
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
Tuesday, 13 August 13
a compiler sees it all…
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );
Tuesday, 13 August 13
Pattern: PMML for Cascading and HadoopP Nathan, G Kathalagiri (2013-08-11)
Chicago Crime Data
Workflow Abstraction
Cascading, Pattern, etc.
Tuesday, 13 August 13
Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments
• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:
Cascalog in Clojure (2010)Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wikiWhy Adopting the Declarative Programming Practices Will Improve Your Return from TechnologyDan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/
Tuesday, 13 August 13
Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce software engineering costs at scale, over time
Tuesday, 13 August 13
Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Data is represented as flows of tuples. Operations in the flows bring functional programming aspects into Java
A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199
Tuesday, 13 August 13
Workflow Abstraction – literate programming
Cascading workflows generate their own visual documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology called literate programming
provides intuitive, visual representations for apps –great for cross-team collaboration
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Literate ProgrammingDon Knuthliterateprogramming.com
Tuesday, 13 August 13
Workflow Abstraction – business process
following the essence of literate programming, Cascading workflows provide statements of business process
this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale
Tuesday, 13 August 13
CustomerOrders
Classify ScoredOrders
GroupBytoken
Count
PMMLModel
M R
FailureTraps
Assert
ConfusionMatrix
Pattern – score a model, using pre-defined Cascading app
cascading.org/pattern
Tuesday, 13 August 13
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Pattern – model scoring
• migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML
• great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries –Matrix API, etc.
• leverage PMML as another kind of DSL
cascading.org/pattern
Tuesday, 13 August 13
public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "\t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "\t" ), classifyPath ); // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg(); OptionSet options = optParser.parse( args ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); } // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); }
Pattern – score a model, within an app
Tuesday, 13 August 13
Enterprise Data Workflows with Cascading
O’Reilly, 2013shop.oreilly.com/product/0636920028536.do
Newsletter for analysis, events, updates:
liber118.com/pxn/
Tuesday, 13 August 13