data workflows for machine learning
TRANSCRIPT
-
7/22/2019 Data Workflows for Machine Learning
1/79
Data Workflows forMachine Learning:
SF Bay Area ML Meetup
2014-04-09
Paco Nathan
@pacoidhttp://liber118.com/pxn/
meetup.com/SF-Bayarea-Machine-Learning/events/173759442/
http://www.meetup.com/SF-Bayarea-Machine-Learning/events/173759442/http://bayareadatageeks.com/http://strataconf.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://liber118.com/pxn/https://twitter.com/pacoid -
7/22/2019 Data Workflows for Machine Learning
2/79
BTW, since its April 9th
Happy #IoTDay !!
http://iotday.org/
(any idea of what this means?)
http://iotday.org/ -
7/22/2019 Data Workflows for Machine Learning
3/79
Why is this talk here?
Machine Learning in production apps is less and less about
algorithms (even though that work is quite fun and vital)Performing real work is more about:! socializing a problem within an organization! feature engineering (Beyond Product Managers)! tournaments in CI/CD environments! operationalizing high-ROI apps at scale! etc.
So Ill just crawl out on a limb and state that leveraging great
frameworks to build data workflows is more important than
chasing after diminishing returns on highly nuanced algorithms.Because Interwebs!
-
7/22/2019 Data Workflows for Machine Learning
4/79
Data Workflows for Machine Learning
Middleware has been evolving for Big Data, and there are some
great examples well review several. Process has been evolving
too, right along with the use cases.Popular frameworks typically provide some Machine Learning
capabilities within their core components, or at least among their
major use cases.Lets consider features from Enterprise Data Workflowsas a basis
for whats needed in Data Workflows for Machine Learning.Their requirements for scale, robustness, cost trade-offs,
interdisciplinary teams, etc., serve as guides in general.
-
7/22/2019 Data Workflows for Machine Learning
5/79
Caveat Auditor
I wont claim to be expert with each of the frameworks and
environments described in this talk. Expert with a few of them
perhaps, but more to the point: embroiled in many use cases. This talk attempts to define a scorecard for evaluating
important ML data workflow features:
whats needed for use cases, compare and contrast of whats
available, plus some indication of which frameworks are likely
to be best for a given scenario.Seriously, this is a work in progress.
-
7/22/2019 Data Workflows for Machine Learning
6/79
Outline
Definition: Machine Learning Definition: Data Workflows A whole bunch o examples across several platforms Nine points to discuss, leading up to a scorecard A crazy little thing called PMML Questions, comments, flying tomatoes
-
7/22/2019 Data Workflows for Machine Learning
7/79
Data Workflows forMachine Learning:
Frame the question
A Basis for Whats Needed
http://footage.shutterstock.com/clip-3063934-stock-footage-a-businessman-steps-into-frame-holding-a-sign-with-a-question-mark-in-front-of-his-face-off.html?src=rel/1251676:2 -
7/22/2019 Data Workflows for Machine Learning
8/79
Machine learning algorithms can figure out how to perform important tasks by generalizingfrom examples. This is often feasible and cost-effective where manual programming is not.
As more data becomes available, more ambitious problemscan be tackled. As a result, machine learning is widely usedin computer science and other fields.Pedro Domingos, U Washington A Few Useful Things to Know about Machine Learning
Definition: Machine Learning
overfitting (varian
underfitting (bias)
perfect classifier
[learning as generalization] =
[representation] + [evaluation] + [optimization]
http://homes.cs.washington.edu/~pedrod/ -
7/22/2019 Data Workflows for Machine Learning
9/79
Definition: Machine Learning Conceptually
1. real-world data2. graph theory for representation
3. convert to sparse matrix for production work
leveraging abstract algebra + func programming
4. cost-effective parallel processing for ML appsat scale, live use cases
-
7/22/2019 Data Workflows for Machine Learning
10/79
Definition: Machine Learning Conceptually
1. real-world data2. graph theory for representation
3. convert to sparse matrix for production work
leveraging abstract algebra + func programming
4. cost-effective parallel processing for ML appsat scale, live use cases
[representation]
[evaluation]
[optimization]
[generalization]
f(x)
g(z)
-
7/22/2019 Data Workflows for Machine Learning
11/79
discover
ydisc
overy
modeling
modeling
integrat
ion
integrat
ionappsapps syst
emssyst
ems
data
science
DataScientist
App Dev
Ops
DomainExpert
Definition: Machine Learning Interdisciplinary Teams, Needs "R
-
7/22/2019 Data Workflows for Machine Learning
12/79
Definition: Machine Learning Process, Feature Engineering
evaluationrepresentation optimization use c
feature engineering
datasources
datasources
data prep
pipelinetrain
test
hold-outs
learners
classifiersclassifiers
classifiers
data
sources
sco
-
7/22/2019 Data Workflows for Machine Learning
13/79
Definition: Machine Learning Process, Tournaments
evaluationrepresentation optimization use c
feature engineering tournaments
datasources
datasources
data prep
pipelinetrain
test
hold-outs
learners
classifiersclassifiers
classifiers
data
sources
sco
quantify benefi risk? opera
obtain other data? improve metadata? refine representation? improve optimization?
iterate with stakeholders?
can models (inference) informfirst principles approaches?
-
7/22/2019 Data Workflows for Machine Learning
14/79
monoid(an alien life form, courtesy of abstract algebra):! binary associative operator! closure within a set of objects! unique identity element
what are those good for?! composable functions in workflows! compiler hints on steroids, to parallelize! reassemble results minimizing bottlenecks at scale! reusable components, like Lego blocks! think: Docker for business logic
Monoidify! Monoids as a Design Principle for Efficient
MapReduce AlgorithmsJimmy Lin, U Maryland/Twitterkudos to Oscar Boykin, Sam Ritchie, et al.
Definition: Machine Learning Invasion of the Monoids
http://arxiv.org/pdf/1304.7544v1.pdf -
7/22/2019 Data Workflows for Machine Learning
15/79
Definition: Machine Learning
evaluationrepresentation optimization use cases
feature engineering tournaments
data
sourcesdata
sourcesdata preppipeline
train
test
hold-outs
learners
classifiersclassifiers
classifiers
datasources
scoring
quantify and me benefit? risk? operational c
obtain other data? improve metadata? refine representation?
improve optimization?
iterate with stakeholders?
can models (inference) inform
first principles approaches?
discover
ydisc
overy
modeling
modeling
integrat
ion
integrat
ionappsapps
-
7/22/2019 Data Workflows for Machine Learning
16/79
Can we fit the required
process into generalized
workflow definitions?
-
7/22/2019 Data Workflows for Machine Learning
17/79
Definition: Data Workflows
Middleware, effectively, is evolving for Big Data and Machine Learning The following design pattern a DAG shows up in many places,
via many frameworks:
ETL data
prep
predictive
model
data
sources
end
use
-
7/22/2019 Data Workflows for Machine Learning
18/79
Definition: Data Workflows
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies
ETL data
prep
predictive
model
data
sources
end
use
ANSI SQL for ETL
-
7/22/2019 Data Workflows for Machine Learning
19/79
Definition: Data Workflows
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies
ETL data
prep
predictive
model
data
sources
end
useJava, Pig, etc., for business logic
-
7/22/2019 Data Workflows for Machine Learning
20/79
Definition: Data Workflows
definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies
ETL data
prep
predictive
model
data
sources
end
use
SAS for pr
-
7/22/2019 Data Workflows for Machine Learning
21/79
most of the licensing costs
-
7/22/2019 Data Workflows for Machine Learning
22/79
most of the project costs
-
7/22/2019 Data Workflows for Machine Learning
23/79
Something emerges
to fill these needs,
for instance
-
7/22/2019 Data Workflows for Machine Learning
24/79
ETL data
prep
predictive
model
data
sources
end
use
Lingual:DW!ANSI SQL Pattern:SAS, R, etc.!PMMLbusiness logic in Java,Clojure, Scala, etc.
sink taps fMemcache
MongoDB
source taps forCassandra, JDBC,
Splunk, etc.
Definition: Data Workflows
For example, Cascading and related projects implement the following
components, based on 100% open source:
a compiler sees it all one connected DAG:
troubleshooting
exception handling
notifications
some optimizations
-
7/22/2019 Data Workflows for Machine Learning
25/79
FlowDefflowDef=FlowDef.flowDe
.setName("etl")
.addSource("example.employee"
.addSource("example.sales",s
.addSink("results",resultsTa
SQLPlannersqlPlanner=newSQLP
.setSql(sqlStatement);flowDef.addAssemblyPlanner(sqlP
http://cascading.org/ -
7/22/2019 Data Workflows for Machine Learning
26/79
FlowDefflowDef=FlowDef.flowDef().setName("classifier").addSource("input",inputTap).addSink("classify",classifyTap);
PMMLPlannerpmmlPlanner=newPMMLPlanner()
.setPMMLInput(newFile(pmmlModel))
.retainOnlyActiveIncomingFields();flowDef.addAssemblyPlanner(pmmlPlanner);
-
7/22/2019 Data Workflows for Machine Learning
27/79
Enterprise Data Workflows with CascadingOReilly, 2013
shop.oreilly.com/product/
0636920028536.do
http://shop.oreilly.com/product/0636920028536.dohttp://shop.oreilly.com/product/0636920028536.do -
7/22/2019 Data Workflows for Machine Learning
28/79
This begs an update
http://shop.oreilly.com/product/0636920028536.dohttp://shop.oreilly.com/product/0636920028536.do -
7/22/2019 Data Workflows for Machine Learning
29/79
Because
http://shop.oreilly.com/product/0636920028536.dohttp://shop.oreilly.com/product/0636920028536.dohttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttp://ipython.org/notebook.htmlhttp://www.knime.org/ -
7/22/2019 Data Workflows for Machine Learning
30/79
http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/https://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttp://pandas.pydata.org/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://www.paraccel.com/http://www.mlbase.org/https://code.google.com/p/augustus/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://cascalog.org/https://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttp://julialang.org/http://www.rstudio.com/ide/docs/r_markdownhttp://ipython.org/notebook.htmlhttp://www.knime.org/http://mahout.apache.org/http://www.zementis.com/on-the-cloud.htm -
7/22/2019 Data Workflows for Machine Learning
31/79
Data Workflows forMachine Learning:
Examples
Neque per exemplum
http://blog.magnatiles.com/play-and-learn/http://www.knime.org/ -
7/22/2019 Data Workflows for Machine Learning
32/79
Example: KNIME
a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. large number of integrations (over 1000 modules) ranked #1 in customer satisfaction among
open source analytics frameworks visual editing of reusable modules leverage prior work in R, Perl, etc. Eclipse integration easily extended for new integrations
knime.org
http://www.knime.org/http://www.knime.org/http://ipython.org/notebook.html -
7/22/2019 Data Workflows for Machine Learning
33/79
Example: Python stack
Python has much to offer ranging across anorganization, no just for the analytics staff ipython.org
pandas.pydata.org
scikit-learn.org
numpy.org
scipy.org
code.google.com/p/augustus continuum.io
nltk.org
matplotlib.org
http://pandas.pydata.org/https://code.google.com/p/augustus/http://ipython.org/notebook.htmlhttp://matplotlib.org/http://nltk.org/http://www.continuum.io/https://code.google.com/p/augustus/http://scipy.org/http://www.numpy.org/http://scikit-learn.org/stable/http://pandas.pydata.org/http://ipython.org/http://julialang.org/ -
7/22/2019 Data Workflows for Machine Learning
34/79
Example: Julia
a high-level, high-performance dynamic programming languagefor technical computing, with syntax that is familiar to users ofother technical computing environments significantly faster than most alternatives built to leverage parallelism, cloud computing still relatively new one to watch!
importallBasetypeBubbleSort
-
7/22/2019 Data Workflows for Machine Learning
35/79
Example: Summingbird
a library that lets you write streaming MapReduce programs that look like native Scala or Java collection transformationsand execute them on a number of well-known distributedMapReduce platforms like Storm and Scalding. switch between Storm, Scalding (Hadoop) Spark support is in progress leverage Algebird, Storehaus, Matrix API, etc.
github.com/
def wordCount[P
toWords(sentence).map(_ -> 1L)}.sumByKey(store)
E l S ldi
https://engineering.twitter.com/opensource/projects/summingbirdhttps://github.com/twitter/summingbird/wikihttps://github.com/twitter/scalding -
7/22/2019 Data Workflows for Machine Learning
36/79
Example: Scalding
importcom.twitter.scalding._classWordCount(args :Args)extendsJob(args){Tsv(args("doc"),
('doc_id,'text),skipHeader =true).read
.flatMap('text ->'token){text :String=>text.split("[ \\[\\]\\(\\),.]")
}.groupBy('token){_.size('count)}.write(Tsv(args("wc"),writeHeader =true))
}
github.com
E l S ldi ith b
https://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://github.com/twitter/scaldinghttps://github.com/twitter/scalding -
7/22/2019 Data Workflows for Machine Learning
37/79
Example: Scalding
extends the Scala collections API so that distributed listsbecome pipes backed by Cascading
code is compact, easy to understand nearly 1:1 between elements of conceptual flow diagram
and function calls extensive libraries are available for linear algebra, abstract
algebra, machine learning e.g.,Matrix API,Algebird, etc. significant investments by Twitter, Etsy, eBay, etc. less learning curve than Cascalog build scripts those take a while to run :\
github.com
Example: Cascalog l
https://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://github.com/twitter/scaldinghttp://cascalog.org/ -
7/22/2019 Data Workflows for Machine Learning
38/79
Example: Cascalog
(ns impatient.core(:use[cascalog.api]
[cascalog.more-taps:only(hfs-delimited)])(:require[clojure.string:ass]
[cascalog.ops:asc])(:gen-class))
(defmapcatopsplit[line]"reads in a line of string and splits it by regex"(s/splitline#"[\[\]\\\(\),.)\s]+"))
(defn -main[inout&args](??word)(c/count?count)))
; Paul Lam; github.com/Quantisan/Impatient
cascalog.o
Example: Cascalog cascalo o
http://cascalog.org/http://cascalog.org/http://github.com/Quantisan/Impatienthttp://cascalog.org/ -
7/22/2019 Data Workflows for Machine Learning
39/79
Example: Cascalog
implements Datalogin Clojure, with predicates backedby Cascading for a highly declarative language
run ad-hoc queries from the Clojure REPL approx. 10:1 code reduction compared with SQL
composable subqueries, used for test-driven
development (TDD) practices at scale Leiningenbuild: simple, no surprises, in Clojure itself more new deployments than other Cascading DSLs
Climate Corp is largest use case: 90% Clojure/Cascalog has a learning curve, limited number of Clojure
developers aggregators are the magic, and those take effort to learn
cascalog.o
Example: Apache Spark spark-proj
http://cascalog.org/http://cascalog.org/http://spark-project.org/ -
7/22/2019 Data Workflows for Machine Learning
40/79
Example: Apache Spark
in-memory cluster computing,by Matei Zaharia, Ram Venkataraman, et al. intended to make data analytics fast to write and to run load data into memory and query it repeatedly, much
more quickly than with Hadoop APIs in Scala, Java and Python, shells in Scala and Python Shark (Hive on Spark), Spark Streaming (like Storm), etc. integrations with MLbase, Summingbird (in progress)
// word countvalfile=spark.textFile(hdfs://)valcounts=file.flatMap(line=>line.spl
.map(word=>(word,1))
.reduceByKey(_+_)counts.saveAsTextFile(hdfs://)
spark-proj
Example: MLbase mlbase org
http://spark-project.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://www.mlbase.org/ -
7/22/2019 Data Workflows for Machine Learning
41/79
Example: MLbase
distributed machine learning made easy sponsored by UC Berkeley EECS AMP Lab MLlib common algorithms, low-level, written atop Spark MLI feature extraction, algorithm dev atop MLlib ML Optimizer automates model selection,
compiler/optimizer see article:
http://strata.oreilly.com/2013/02/mlbase-scalable-
machine-learning-made-accessible.html
mlbase.org
data = load("hdfs://path/to/als_clinical")// the features are stored in columns 2-10X = data[, 2 to 10]y = data[, 1]model = do_classify(y, X)
Example: Spark SQL
spark-proj
http://www.mlbase.org/http://www.mlbase.org/http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.htmlhttp://spark-project.org/ -
7/22/2019 Data Workflows for Machine Learning
42/79
Example: Spark SQL
blurs the lines between RDDs and relational tablesintermix SQL commands to querying external data,along with complex analytics, in a single app: allows SQL extensions based on MLlib Shark is being migrated to Spark SQL
Spark SQL: Manipulating Structured Data Using Spark Michael Armbrust, Reynold Xin (2014-03-24) databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-
Spark.html people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
// data can be extracted from exist// e.g., Apache HivevaltrainingDataTable =sql("""SELECT e.action
u.age,u.latitude,u.logitude
FROM Users uJOIN Events eON u.userId = e.userId""")
// since `sql` returns an RDD, the //can be used in MLlibvaltrainingData =trainingDataTablvalfeatures =Array[Double](rowLabeledPoint(row(0),features)
}
valmodel =newLogisticRegressionWithSGD().
spark-proj
Example: Titan thinkaureli
http://spark-project.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.htmlhttp://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.htmlhttp://thinkaurelius.github.io/titan/ -
7/22/2019 Data Workflows for Machine Learning
43/79
Example: Titan
distributed graph database,by Matthias Broecheler, Marko Rodriguez, et al. scalable graph database optimized for storing and
querying graphs supports hundreds of billions of vertices and edges transactional database that can support thousands of
concurrent users executing complex graph traversals supports search through Lucene, ElasticSearch can be backed by HBase, Cassandra, BerkeleyDB TinkerPopnative graph stack with Gremlin, etc.
// who is hercules' paternal grandfather?g.V('name','hercules').out('father').out('father'
thinkaureli
Example: MBrace m-brace.ne
http://thinkaurelius.github.io/titan/http://www.tinkerpop.com/http://www.m-brace.net/ -
7/22/2019 Data Workflows for Machine Learning
44/79
Example: MBrace
a .NET based software stack that enables easy large-scale distributed computation and big data analysis. declarative and concise distributed algorithms in
F# asynchronous workflows scalable for apps in private datacenters or public
clouds, e.g., Windows Azure or Amazon EC2 tooling for interactive, REPL-style deployment,
monitoring, management, debugging in Visual Studio leverages monads (similar to Summingbird) MapReduce as a library, but many patterns beyond
m brace.ne
let rec mapReduce (map: (reduce: 'R -> 'R -> Clo(identity: 'R)(input: 'T list) =cloud {match input with| [] -> return identity| [value] -> return! map| _ ->let left, right = List.slet! r1, r2 =(mapReduce map reduce id(mapReduce map reduce idreturn! reduce r1 r2}
D t W kfl f
http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/ -
7/22/2019 Data Workflows for Machine Learning
45/79
Data Workflows forMachine Learning:
Update the definitions
Nine Points to Discuss
Data Workflows
1
http://www.greatdreams.com/crop/99whthrs.htm -
7/22/2019 Data Workflows for Machine Learning
46/79
Formally speaking, workflows include people(cross-dept)
and define oversight for exceptional dataotherwise, data flowor pipelinewould be more aptexamples include Cascadingtraps, with exceptional tuplesrouted to a review organization, e.g., Customer Support
includes peo
oversight fo
HadoopCluster
source
tap
source
tap sinktap
traptap
customerprofile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
AnalyticsCubes
sink
tap
Modeling PMML
1
Data Workflows 1+
http://www.cascading.org/ -
7/22/2019 Data Workflows for Machine Learning
47/79
Footnote see also an excellent article related to this point
and the next:Therbligs for data science:
A nuts and bolts framework for accelerating data workAbe [email protected]/2014/03/therbligs-for-data-science.html
strataconf.com/strata2014/public/schedule/detail/32291
brighttalk.com/webcast/9059/105119
includes peo
oversight fo
1+
Data Workflows 2
https://www.brighttalk.com/webcast/9059/105119http://strataconf.com/strata2014/public/schedule/detail/32291http://blog.abegong.com/2014/03/therbligs-for-data-science.html -
7/22/2019 Data Workflows for Machine Learning
48/79
Workflows impose a separation of concerns, allowing
for multiple abstraction layersin the tech stackspecify whatis required, not howto accomplish itarticulating the business logic Cascadingleverages pattern languagerelated notions from Knuth, literate programmingnot unlike BPM/BPEL for Big Dataexamples: IPython, R Markdown, etc.
separation o
for literate p
2
Data Workflows 2+
http://www.patternlanguage.com/http://www.literateprogramming.com/http://www.rstudio.com/ide/docs/r_markdownhttp://ipython.org/notebook.htmlhttp://en.wikipedia.org/wiki/Literate_programminghttp://www.cascading.org/ -
7/22/2019 Data Workflows for Machine Learning
49/79
Footnote while discussing separation of concerns and a general
design pattern, lets consider data workflows in a
broader context of probabilistic programming, sincebusiness data is heterogeneous and structured
the data for every domain is heterogeneous.
Paraphrasing Ginsberg:Ive seen the best minds of my generation destroyed by
madness, dragging themselves through quagmires of large
LCD screens filled with Intellij debuggers and databasecommand lines, yearning to fit real-world data into their
preferred deterministic tools
separation o
for literate p
2+
Probabilistic Programming:
Why, What, How, When Beau Cronin
Strata SC (2014) speakerdeck.com/beprobabilistic-programstrata-santa-clara-20
Why Probabilistic ProgramMatters
Rob ZinkovConvex Optimized (201zinkov.com/posts/2012-06-27-why-proprogramming-matte
Data Workflows 3
http://zinkov.com/posts/2012-06-27-why-prob-programming-matters/https://speakerdeck.com/beaucronin/probabilistic-programming-strata-santa-clara-2014 -
7/22/2019 Data Workflows for Machine Learning
50/79
Multiple abstraction layersin the tech stack are
needed, but still emergingfeedback loops based on machine dataoptimizers feed on abstractionmetadata accounting:
track data lineage propagate schema model / feature usage ensemble performance
app history accounting: util stats, mixing workloads heavy-hitters, bottlenecks throughput, latency
multiple abs
for metadat
and optimiz
ClusterScheduler
Planners/
Optimizers
Clusters
DSLs
MachineData
app history
util stats
bottlenecksMixedTopologies
BusinessProcess
ReusableComponents
PortableModels
data lineage
schema propagation
feature selection
tournaments
3
-
7/22/2019 Data Workflows for Machine Learning
51/79
Um, will compilers
ever look like this??
Data Workflows 4
-
7/22/2019 Data Workflows for Machine Learning
52/79
Workflows must provide for test, which is no simple
mattertesting is required on three abstraction layers:
model portability/evaluation, ensembles, tournaments TDD in reusable components and business process continuous integration / continuous deployment
examples: Cascalogfor TDD,PMMLfor model evalstill so much more to improvekeep in mind that workflows involve people, too
testing: mod
TDD, app d
4
ClusterScheduler
Planners/
Optimizers
Clusters
DSLs
M
app h
util st bottleMixed
Topologies
BusinessProcess
Reusable
Components
Portable
Models
d
s fe to
Data Workflows
5
http://www.dmg.org/v4-1/GeneralStructure.htmlhttp://cascalog.org/ -
7/22/2019 Data Workflows for Machine Learning
53/79
Workflows enable system integrationfuture-proof integrations and scale-outbuild componentsand apps, not jobs and command linesallow compiler optimizations across the DAG,i.e., cross-dept contributionsminimize operationalization costs:
troubleshooting, debugging at scale exception handling notifications, instrumentation
examples: KNIME, Cascading, etc.
future-proof
integration,
HadoopCluster
source
tap
source
tap sink
tap
trap
tap
customerprofile DBsCustomer
Prefs
logslogs
Logs
DataWorkflow
Cache
Customers
Support
WebApp
Reporting
AnalyticsCubes
sink
tap
Modeling PMML
5
Data Workflows 6
http://www.cascading.org/http://www.knime.org/ -
7/22/2019 Data Workflows for Machine Learning
54/79
Visualizing workflows, what a great idea.the practical basis for:
collaboration rapid prototyping component reuse
examples: KNIMEwins best in categoryCascadinggenerates flow diagrams,
which are a nice start
visualizing a
to collabora
6
Data Workflows
7
http://www.cascading.org/http://www.knime.org/ -
7/22/2019 Data Workflows for Machine Learning
55/79
Abstract algebra, containerizing workflow metadata
monoids, semigroups, etc., allow for reusable components
with well-defined properties for running in parallel at scalelet business process be agnostic about underlying topologies,with analogy to Linux containers (Docker, etc.)compose functions, take advantage of sparsity (monoids)because data is fully functional FP for s/w eng benefitsaggregators are almost always magic, now we can solve
for common use casesread Monoidify! Monoids as a Design Principle for Efficient
MapReduce Algorithmsby Jimmy LinCascadingintroduced some notions of this circa 2007,
but didnt make it explicitexamples: Algebirdin Summingbird, Simmer, Spark, etc.
abstract alg
functional p
containerize
7
Data Workflows 7+
http://spark.incubator.apache.org/https://github.com/avibryant/simmerhttps://github.com/twitter/summingbird/wikihttps://engineering.twitter.com/opensource/projects/algebirdhttp://www.cascading.org/http://arxiv.org/pdf/1304.7544v1.pdf -
7/22/2019 Data Workflows for Machine Learning
56/79
Footnote see also two excellent talks related to this point:
Algebra for Analyticsspeakerdeck.com/johnynek/algebra-for-analyticsOscar Boykin, Strata SC (2014)Add ALL the Things:Abstract Algebra Meets Analytics
infoq.com/presentations/abstract-algebra-analytics Avi Bryant, Strange Loop (2013)
abstract alg
functional p
containerize
7+
Data Workflows
bl d l
8
http://www.infoq.com/presentations/abstract-algebra-analyticshttps://speakerdeck.com/johnynek/algebra-for-analytics -
7/22/2019 Data Workflows for Machine Learning
57/79
Workflow needs vary in time, and need to blend timesomething something batch, something something low-latencyscheduling batch isnt hard; scheduling low-latency is
computationally brutal (see the Omegapaper)because people like saying Lambda Architecture,
it gives them goosebumps, or somethingbecause real-timemeans so many different thingsbecause batch windows are so 1964examples: Summingbird, Oryx
blend result
time scales:
latency
8
Big Data
Nathan Marz,manning.com
Data Workflows
i i l9
http://www.manning.com/marz/http://www.manning.com/marz/http://www.manning.com/marz/https://github.com/cloudera/oryxhttps://engineering.twitter.com/opensource/projects/summingbirdhttp://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf -
7/22/2019 Data Workflows for Machine Learning
58/79
Workflows may define contexts in which model selection
possibly becomes a compiler problemfor example, see Boyd, Parikh, et al.ultimately optimizing for loss function+ regularization termperhaps not ready for prime-time immediatelyexamples: MLbase
optimize lea
to make mo
potentially a
9
f(x): loss functiong(z): regularization term
Data Workflows bon
http://www.mlbase.org/http://www.stanford.edu/~boyd/papers/admm_distr_stats.html -
7/22/2019 Data Workflows for Machine Learning
59/79
For one of the best papers about what workflows really
truly require, see Out of the Tar Pitby Moseley and Marks
bon
Nine Points for Data Workflows a wish list
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928 -
7/22/2019 Data Workflows for Machine Learning
60/79
1. includes people, defines oversight for exceptional data
2. separation of concerns, allows for literate programming
3. multiple abstraction layers for metadata, feedback, and
optimization4. testing: model evaluation, TDD, app deployment
5. future-proof system integration, scale-out, ops
6. visualizing workflows allows people to collaborate
through code
7. abstract algebra and functional programming
containerize business process
8. blend results from different time scales:
batch plus low-latency
9. optimize learners in context, to make model
selection potentially a compiler problem
ClusterScheduler
Planners/Optimizers
Clusters
DSLs
MixedTopologies
BusinessProcess
ReusableComponents
PortableModels
https://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttp://ipython.org/notebook.htmlhttp://www.knime.org/ -
7/22/2019 Data Workflows for Machine Learning
61/79
Nine Points for Data Workflows a scorecard
http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/https://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttp://pandas.pydata.org/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://www.paraccel.com/http://www.mlbase.org/https://code.google.com/p/augustus/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://cascalog.org/https://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.c