pattern - an open source project for migrating predictive models from sas, etc., onto hadoop
DESCRIPTION
"Pattern" is an open source project which takes models trained in popular analytics frameworks, such as SAS, Microstrategy, SQL Server, etc., and runs them at scale on Apache Hadoop. This machine learning library works by translating PMML -- an established XML standard for predictive model markup -- into data workflows based on the Cascading API in Java. PMML models can be run in a pre-defined JAR file with no coding required. PMML can also be combined with other flows based on ANSI SQL (Lingual), Scala (Scalding), Clojure (Cascalog), etc. Multiple companies have collaborated to implement parallelized algorithms: Random Forest, Logistic Regression, K-Means, Hierarchical Clustering, etc., with more machine learning support being added. Benefits include greatly reduced development costs and less licensing issues at scale ?- while leveraging a combination of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff. Sample code in the talk will show apps using predictive models built in SAS and R, e.g., anti-fraud classifiers. In addition, examples will show how to compare variations of models for large-scale customer experiments. Portions of this material come from the O`Reilly book "Enterprise Data Workflows with Cascading", due June 2013.TRANSCRIPT
![Page 1: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/1.jpg)
Paco NathanConcurrent, Inc.San Francisco, CA@pacoid
“Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop”
![Page 2: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/2.jpg)
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern
![Page 3: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/3.jpg)
Cascading – origins
API author Chris Wensel worked as a system architectat an Enterprise firm well-known for many popular dataproducts.
Wensel was following the Nutch open source project –where Hadoop started.
Observation: would be difficult to find Java developersto write complex Enterprise apps in MapReduce – potential blocker for leveraging new open sourcetechnology.
![Page 4: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/4.jpg)
Cascading – functional programming
Key insight: MapReduce is based on functional programming– back to LISP in 1970s. Apache Hadoop use cases aremostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firmsbegan to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functionalprogramming for large-scale data workflows:
leverages JVM and Java-based tools without anyneed to create new languages
allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters
•
•
![Page 5: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/5.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading – definitions
a pattern language for Enterprise DataWorkflows
simple to build, easy to test, robust inproduction
design principles ⟹ ensure best practices atscale
•
•
•
![Page 6: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/6.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading – usage
Java API, DSLs in Scala,Clojure, Jython, JRuby, Groovy, ANSISQL
ASL 2 license, GitHub src, http://conjars.org
5+ yrs production use, multiple Enterprise verticals
•
•
•
![Page 7: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/7.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading – integrations
partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera
taps: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc.
serialization: Avro, Thrift, Kryo, JSON, etc.
topologies: Apache Hadoop, tuple spaces, local mode
•
•
•
•
![Page 8: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/8.jpg)
Cascading – deployments
case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.
use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.
•
•
![Page 9: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/9.jpg)
Cascading – deployments
case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.
use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.
•
•
workflow abstraction addresses: • staffing bottleneck; • system integration;• operational complexity; • test-driven development
![Page 10: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/10.jpg)
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern
![Page 11: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/11.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Enterprise Data Workflows
Let’s consider a “strawman” architecture for an example app… at the front end
LOB use cases drive demand for apps
![Page 12: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/12.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Enterprise Data Workflows
Same example… in the back office
Organizations have substantial investmentsin people, infrastructure, process
![Page 13: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/13.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Enterprise DataWorkflows
Same example… the heavy lifting!
“Main Street” firms are migratingworkflows to Hadoop, for cost savings and scale-out
![Page 14: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/14.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading workflows – taps
taps integrate other data frameworks, as tuple streams
these are “plumbing” endpoints in the pattern language
sources (inputs), sinks (outputs), traps (exceptions)
text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc.
data serialization: Avro, Thrift, Kryo, JSON, etc.
extend a new kind of tap in just a few lines of Java
schema and provenance get derived from analysis of the taps
••••
•
•
![Page 15: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/15.jpg)
Cascading workflows – taps
String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ).addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();
source and sink tapsfor TSV data in HDFS
![Page 16: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/16.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading workflows – topologies
topologies execute workflows on clusters
flow planner is like a compiler for queries
Hadoop (MapReduce jobs)
local mode (dev/test or special config)
in-memory data grids (real-time)
flow planner can be extended to support other topologies
blend flows in different topologies into the same app – for example,batch (Hadoop) + transactions (IMDG)
••
-
-
-
•
![Page 17: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/17.jpg)
Cascading workflows – topologies
String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ).addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();
flow planner for Apache Hadoop topology
![Page 18: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/18.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Cascading workflows – test-driven development
assert patterns (regex) on the tuple streams
adjust assert levels, like log4j levels
trap edge cases as “data exceptions”
TDD at scale:
start from raw inputs in the flow graph
define stream assertions for each stage of transforms
verify exceptions, code to remove them
when impl is complete, app has full test coverage
redirect traps in production to Ops, QA, Support, Audit, etc.
••••
1.
2.
3.
4.
![Page 19: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/19.jpg)
Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps,Tuple Flows, Filters, Joins, Traps, etc.
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Data is represented as flows of tuples. Operations withinthe flows bring functional programming aspects into Java
In formal terms, this provides a pattern language
![Page 20: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/20.jpg)
Pattern Language
structured method for solving large, complex designproblems, where the syntax of the language ensures the use of best practices – i.e., conveying expertise
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199
![Page 21: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/21.jpg)
Workflow Abstraction – literate programming
Cascading workflows generate their own visualdocumentation: flow diagrams
in formal terms, flow diagrams leverage a methodologycalled literate programming
provides intuitive, visual representations for apps –great for cross-team collaboration
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
![Page 22: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/22.jpg)
LiterateProgramming
by Don Knuth
Literate ProgrammingUniv of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is to instruct a computer what to do, let usconcentrate rather on explaining to humanbeings what we want a computer to do.”
![Page 23: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/23.jpg)
Workflow Abstraction – business process
following the essence of literate programming, Cascadingworkflows provide statements of business process
this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns betweenbusiness process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner thendetermines how to translate business process into efficient,parallel jobs at scale
![Page 24: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/24.jpg)
BusinessProcess
by Edgar Codd
“A relational model of data for large shared data banks”Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685
rather than arguing between SQL vs. NoSQL…structured vs. unstructured data frameworks… this approach focuses on what apps do:
the process of structuring data
![Page 25: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/25.jpg)
Cascading – functionalprogramming
Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading– used for their large-scale production deployments
new case studies for Cascading apps are mostlybased on domain-specific languages (DSLs) in JVMlanguages which emphasize functional programming:
Cascalog in Clojure (2010)Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wiki
•
•
Why Adopting the Declarative Programming Practices Will Improve Your Return fromTechnologyDan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/
![Page 26: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/26.jpg)
Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce software engineering costs at scale, over time
![Page 27: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/27.jpg)
Two Avenues to the App Layer…
scale ➞co
mpl
exity
➞
Enterprise: must contend withcomplexity at scale everyday…
incumbents extend current practices andinfrastructure investments – using J2EE,ANSI SQL, SAS, etc. – to migrateworkflows onto Apache Hadoop whileleveraging existing staff
Start-ups: crave complexity and scale to become viable…
new ventures move into Enterprise spaceto compete using relatively lean staff, while leveraging sophisticated engineeringpractices, e.g., Cascalog and Scalding
![Page 28: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/28.jpg)
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern
![Page 29: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/29.jpg)
established XML standard for predictive model markup
organized by Data Mining Group (DMG), since 1997 http://dmg.org/
members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc.
PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows
“PMML is the leading standard for statistical and data mining models andsupported by over 20 vendors and organizations. With PMML, it is easy to develop a model on one system using one application and deploy themodel on another system using another application.”
••
•
•
PMML – standard
wikipedia.org/wiki/Predictive_Model_Markup_Language
![Page 30: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/30.jpg)
Association Rules: AssociationModel element
Cluster Models: ClusteringModel element
Decision Trees: TreeModel element
Naïve Bayes Classifiers: NaiveBayesModel element
Neural Networks: NeuralNetwork element
Regression: RegressionModel and GeneralRegressionModel elements
Rulesets: RuleSetModel element
Sequences: SequenceModel element
Support Vector Machines: SupportVectorMachineModel element
Text Models: TextModel element
Time Series: TimeSeriesModel element
•••••••••••
PMML – model coverage
ibm.com/developerworks/industry/library/ind-PMML2/
![Page 31: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/31.jpg)
PMML – vendor coverage
![Page 32: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/32.jpg)
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern
![Page 33: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/33.jpg)
Hadoop Cluster
sourcetap
sourcetap sink
taptraptap
customer profile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
Analytics Cubes
sinktap
Modeling PMML
Pattern – model scoring
migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML
great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc.
integrate with other libraries –Matrix API, etc.
leverage PMML as another kind of DSL
•
•
•
•
cascading.org/pattern
![Page 34: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/34.jpg)
## train a RandomForest model f <- as.formula("as.factor(label) ~ .")fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance)print(fit) predicted <- predict(fit, data)data$predicted <- predictedconfuse <- table(pred = predicted, true = data[,1])print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Pattern – create a model in R
![Page 35: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/35.jpg)
<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_0http://www.dmg.org/v4-0/pmml-4-0.xsd"> <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model"> <Extension name="user" value="ceteri" extender="Rattle/PMML"/> <Application name="Rattle/PMML" version="1.2.30"/> <Timestamp>2012-10-22 19:39:28</Timestamp> </Header> <DataDictionary numberOfFields="4"> <DataField name="label" optype="categorical" dataType="string"> <Value value="0"/> <Value value="1"/> </DataField> <DataField name="var0" optype="continuous" dataType="double"/> <DataField name="var1" optype="continuous" dataType="double"/> <DataField name="var2" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification"> <MiningSchema> <MiningField name="label" usageType="predicted"/> <MiningField name="var0" usageType="active"/> <MiningField name="var1" usageType="active"/> <MiningField name="var2" usageType="active"/> </MiningSchema> <Segmentation multipleModelMethod="majorityVote"> <Segment id="1"> <True/> <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit"> <MiningSchema> <MiningField name="label" usageType="predicted"/> <MiningField name="var0" usageType="active"/> <MiningField name="var1" usageType="active"/> <MiningField name="var2" usageType="active"/> </MiningSchema>...
Pattern – capture model parameters as PMML
![Page 36: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/36.jpg)
public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "\t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "\t" ), classifyPath ); // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg(); OptionSet options = optParser.parse( args ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); } // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); }
Pattern – score a model, within an app
![Page 37: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/37.jpg)
CustomerOrders
Classify ScoredOrders
GroupBytoken
Count
PMMLModel
M R
FailureTraps
Assert
ConfusionMatrix
Pattern – score a model, using pre-defined Cascading app
cascading.org/pattern
![Page 38: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/38.jpg)
## run an RF classifier at scale hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \--pmml data/sample.rf.xml
## run an RF classifier at scale, assert regression test, measure confusion matrix hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \--pmml data/sample.rf.xml --assert --measure out/measure
## run a predictive model at scale, measure RMSE hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \ --pmml data/iris.lm_p.xml --rmse out/measure
Pattern – score a model, using pre-defined Cascading app
![Page 39: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/39.jpg)
Roadmap – existing algorithms for scoring
Random Forest
Decision Trees
Linear Regression
GLM
Logistic Regression
K-Means Clustering
Hierarchical Clustering
Multinomial
Support Vector Machines (prepared for release)
also, model chaining and general support for ensembles
•••••••••
cascading.org/pattern
![Page 40: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/40.jpg)
Roadmap – next priorities for scoring
Time Series (ARIMA forecast)
Association Rules (basket analysis)
Naïve Bayes
Neural Networks
algorithms extended based on customer use cases – contact groups.google.com/forum/?fromgroups#!forum/pattern-user
••••
cascading.org/pattern
![Page 41: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/41.jpg)
Roadmap – top priorities for creating models at scale
Random Forest
Logistic Regression
K-Means Clustering
Association Rules
…plus all models which can be trained via sparse matrix factorization(TQSR => PCA, SVD least squares, etc.)
a wealth of recent research indicates many opportunities to parallelize popular algorithms for training models at scale on Apache Hadoop…
••••
cascading.org/pattern
![Page 42: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/42.jpg)
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern
![Page 43: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/43.jpg)
Experiments – comparing models
much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale
run multiple variants, then measure relative “lift”
Concurrent runtime – tag and track models
the following example compares two models trained with different machine learning algorithms
this is exaggerated, one has an important variable intentionally omitted to help illustrate the experiment
•
••
![Page 44: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/44.jpg)
## train a Random Forest model## example: http://mkseo.pe.kr/stats/?p=220 f <- as.formula("as.factor(label) ~ var0 + var1 + var2")fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)print(fit)saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))
Experiments – Random Forest model
OOB estimate of error rate: 14%Confusion matrix: 0 1 class.error0 69 16 0.18823531 12 103 0.1043478
![Page 45: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/45.jpg)
## train a Logistic Regression model (special case of GLM)## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r f <- as.formula("as.factor(label) ~ var0 + var2")fit <- glm(f, family=binomial, data=data)print(summary(fit))saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))
Experiments – Logistic Regression model
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 ***var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
NB: this model has “var1” intentionally omitted
![Page 46: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/46.jpg)
Experiments – comparing results
use a confusion matrix to compare results for the classifiers
Logistic Regression has a lower “false negative” rate (5% vs. 11%)however it has a much higher “false positive” rate (52% vs. 14%)
assign a cost model to select a winner –for example, in an ecommerce anti-fraud classifier:
FN ∼ chargeback risk FP ∼ customer support costs
••
•
![Page 47: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/47.jpg)
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern
![Page 48: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/48.jpg)
Two Cultures
“A new research community using these tools sprang up. Their goalwas predictive accuracy. The community consisted of young computerscientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex predictionproblems where it was obvious that data models were not applicable:speech recognition, image recognition, nonlinear time series prediction,handwriting recognition, prediction in financial markets.”
Statistical Modeling: The Two Cultures Leo Breiman, 2001bit.ly/eUTh9L
in other words, seeing the forest for the trees…
this paper chronicled a sea change from data modeling practices(silos, manual process) to the rising use of algorithmic modeling(machine data for automation/optimization)
![Page 49: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/49.jpg)
Why Do Ensembles Matter?
The World…per Data Modeling
The World…
![Page 50: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/50.jpg)
Algorithmic Modeling
“The trick to being a scientist is to be open to using a wide variety of tools.” – Breiman
circa 2001: Random Forest, bootstrap aggregation, etc., yield dramatic increases in predictive power over earliermodeling such as Logistic Regression
major learnings from the Netflix Prize: the power ofensembles, model chaining, etc.
the problems at hand have become simply too big and toocomplex for ONE distribution, ONE model, ONE team…
![Page 51: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/51.jpg)
Ensemble Models
Breiman: “a multiplicity of data models”
BellKor team: 100+ individual models in 2007 Progress Prize
while the process of combining models adds complexity(making it more difficult to anticipate or explain predictions)accuracy may increase substantially
Ensemble Learning: Better Predictions Through DiversityTodd HollowayETech (2008)abeautifulwww.com/EnsembleLearningETech.pdf
The Story of the Netflix Prize: An Ensemblers TaleLester MackeyNational Academies Seminar, Washington, DC (2011)stanford.edu/~lmackey/papers/
![Page 52: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/52.jpg)
KDD 2013 PMML Workshop
Pattern: PMML for Cascading and HadoopPaco Nathan, Girish KathalagiriChicago, 2013-08-11 (accepted)
19th ACM SIGKDD Conference on Knowledge Discovery and Data Miningkdd13pmml.wordpress.com
![Page 53: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/53.jpg)
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern
![Page 54: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/54.jpg)
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
![Page 55: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/55.jpg)
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
ANSI SQL for ETL
![Page 56: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/56.jpg)
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
endusesJ2EE for business logic
![Page 57: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/57.jpg)
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
SAS for predictive models
![Page 58: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/58.jpg)
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
enduses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
![Page 59: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/59.jpg)
Anatomy of an Enterprise app
Definition a typical Enterprise workflow which crosses throughmultiple departments, languages, and technologies…
ETL dataprep
predictivemodel
datasources
endusesJ2EE for business logic
most of the project costs…
![Page 60: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/60.jpg)
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
![Page 61: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/61.jpg)
a compiler sees it all…
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
![Page 62: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/62.jpg)
a compiler sees it all…
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );
![Page 63: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/63.jpg)
cascading.orgETL data
preppredictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open source
visual collaboration for the business logic is a great wayto improve how teams work together
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
![Page 64: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/64.jpg)
ETL dataprep
predictivemodel
datasources
enduses
Lingual:DW → ANSI SQL
Pattern:SAS, R, etc. → PMML
business logic in Java, Clojure, Scala, etc.
sink taps for Memcached, HBase, MongoDB, etc.
source taps for Cassandra, JDBC,Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow componentsinto an integrated app – one among many, typically – based on 100% open source
FailureTraps
bonusallocation
employee
PMMLclassifier
quarterlysales
Join Countleads
multiple departments, working in their respective
frameworks, integrate results into a combined app,
which runs at scale on a cluster… business process
combined in a common space (DAG) for flow
planners, compiler, optimization, troubleshooting,
exception handling, notifications, security audit,
performance monitoring, etc.
cascading.org
![Page 65: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/65.jpg)
Enterprise Data Workflowswith Cascading
O’Reilly, 2013amazon.com/dp/1449358721
references…
newsletter updates:
liber118.com/pxn/
![Page 66: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/66.jpg)
Many thanks to others who have contributed code,ideas, suggestions, etc., to Pattern:
Chris Wensel @ Concurrent
Girish Kathalagiri @ AgilOne
Vijay Srinivas Agneeswaran @ Impetus
Chris Severs @ eBay
Ofer Mendelevitch @ Hortonworks
Sergey Boldyrev @ Nokia
Quinton Anderson @ IZAZI Solutions
Chris Gutierrez @ Airbnb
Villu Ruusmann @ JPMML project
•••••••••
acknowledgements…
![Page 67: Pattern - an open source project for migrating predictive models from SAS, etc., onto Hadoop](https://reader034.vdocuments.us/reader034/viewer/2022042814/54c67f414a79597a728b4586/html5/thumbnails/67.jpg)
blog, developer community, code/wiki/gists, maven repo, commercial products, etc.:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
drill-down…