data workflows for machine learning

Upload: minoru-osorio-garcia

Post on 08-Feb-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/22/2019 Data Workflows for Machine Learning

    1/79

    Data Workflows forMachine Learning:

    SF Bay Area ML Meetup

    2014-04-09

    Paco Nathan

    @pacoidhttp://liber118.com/pxn/

    meetup.com/SF-Bayarea-Machine-Learning/events/173759442/

    http://www.meetup.com/SF-Bayarea-Machine-Learning/events/173759442/http://bayareadatageeks.com/http://strataconf.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://thedataguild.com/http://liber118.com/pxn/https://twitter.com/pacoid
  • 7/22/2019 Data Workflows for Machine Learning

    2/79

    BTW, since its April 9th

    Happy #IoTDay !!

    http://iotday.org/

    (any idea of what this means?)

    http://iotday.org/
  • 7/22/2019 Data Workflows for Machine Learning

    3/79

    Why is this talk here?

    Machine Learning in production apps is less and less about

    algorithms (even though that work is quite fun and vital)Performing real work is more about:! socializing a problem within an organization! feature engineering (Beyond Product Managers)! tournaments in CI/CD environments! operationalizing high-ROI apps at scale! etc.

    So Ill just crawl out on a limb and state that leveraging great

    frameworks to build data workflows is more important than

    chasing after diminishing returns on highly nuanced algorithms.Because Interwebs!

  • 7/22/2019 Data Workflows for Machine Learning

    4/79

    Data Workflows for Machine Learning

    Middleware has been evolving for Big Data, and there are some

    great examples well review several. Process has been evolving

    too, right along with the use cases.Popular frameworks typically provide some Machine Learning

    capabilities within their core components, or at least among their

    major use cases.Lets consider features from Enterprise Data Workflowsas a basis

    for whats needed in Data Workflows for Machine Learning.Their requirements for scale, robustness, cost trade-offs,

    interdisciplinary teams, etc., serve as guides in general.

  • 7/22/2019 Data Workflows for Machine Learning

    5/79

    Caveat Auditor

    I wont claim to be expert with each of the frameworks and

    environments described in this talk. Expert with a few of them

    perhaps, but more to the point: embroiled in many use cases. This talk attempts to define a scorecard for evaluating

    important ML data workflow features:

    whats needed for use cases, compare and contrast of whats

    available, plus some indication of which frameworks are likely

    to be best for a given scenario.Seriously, this is a work in progress.

  • 7/22/2019 Data Workflows for Machine Learning

    6/79

    Outline

    Definition: Machine Learning Definition: Data Workflows A whole bunch o examples across several platforms Nine points to discuss, leading up to a scorecard A crazy little thing called PMML Questions, comments, flying tomatoes

  • 7/22/2019 Data Workflows for Machine Learning

    7/79

    Data Workflows forMachine Learning:

    Frame the question

    A Basis for Whats Needed

    http://footage.shutterstock.com/clip-3063934-stock-footage-a-businessman-steps-into-frame-holding-a-sign-with-a-question-mark-in-front-of-his-face-off.html?src=rel/1251676:2
  • 7/22/2019 Data Workflows for Machine Learning

    8/79

    Machine learning algorithms can figure out how to perform important tasks by generalizingfrom examples. This is often feasible and cost-effective where manual programming is not.

    As more data becomes available, more ambitious problemscan be tackled. As a result, machine learning is widely usedin computer science and other fields.Pedro Domingos, U Washington A Few Useful Things to Know about Machine Learning

    Definition: Machine Learning

    overfitting (varian

    underfitting (bias)

    perfect classifier

    [learning as generalization] =

    [representation] + [evaluation] + [optimization]

    http://homes.cs.washington.edu/~pedrod/
  • 7/22/2019 Data Workflows for Machine Learning

    9/79

    Definition: Machine Learning Conceptually

    1. real-world data2. graph theory for representation

    3. convert to sparse matrix for production work

    leveraging abstract algebra + func programming

    4. cost-effective parallel processing for ML appsat scale, live use cases

  • 7/22/2019 Data Workflows for Machine Learning

    10/79

    Definition: Machine Learning Conceptually

    1. real-world data2. graph theory for representation

    3. convert to sparse matrix for production work

    leveraging abstract algebra + func programming

    4. cost-effective parallel processing for ML appsat scale, live use cases

    [representation]

    [evaluation]

    [optimization]

    [generalization]

    f(x)

    g(z)

  • 7/22/2019 Data Workflows for Machine Learning

    11/79

    discover

    ydisc

    overy

    modeling

    modeling

    integrat

    ion

    integrat

    ionappsapps syst

    emssyst

    ems

    data

    science

    DataScientist

    App Dev

    Ops

    DomainExpert

    Definition: Machine Learning Interdisciplinary Teams, Needs "R

  • 7/22/2019 Data Workflows for Machine Learning

    12/79

    Definition: Machine Learning Process, Feature Engineering

    evaluationrepresentation optimization use c

    feature engineering

    datasources

    datasources

    data prep

    pipelinetrain

    test

    hold-outs

    learners

    classifiersclassifiers

    classifiers

    data

    sources

    sco

  • 7/22/2019 Data Workflows for Machine Learning

    13/79

    Definition: Machine Learning Process, Tournaments

    evaluationrepresentation optimization use c

    feature engineering tournaments

    datasources

    datasources

    data prep

    pipelinetrain

    test

    hold-outs

    learners

    classifiersclassifiers

    classifiers

    data

    sources

    sco

    quantify benefi risk? opera

    obtain other data? improve metadata? refine representation? improve optimization?

    iterate with stakeholders?

    can models (inference) informfirst principles approaches?

  • 7/22/2019 Data Workflows for Machine Learning

    14/79

    monoid(an alien life form, courtesy of abstract algebra):! binary associative operator! closure within a set of objects! unique identity element

    what are those good for?! composable functions in workflows! compiler hints on steroids, to parallelize! reassemble results minimizing bottlenecks at scale! reusable components, like Lego blocks! think: Docker for business logic

    Monoidify! Monoids as a Design Principle for Efficient

    MapReduce AlgorithmsJimmy Lin, U Maryland/Twitterkudos to Oscar Boykin, Sam Ritchie, et al.

    Definition: Machine Learning Invasion of the Monoids

    http://arxiv.org/pdf/1304.7544v1.pdf
  • 7/22/2019 Data Workflows for Machine Learning

    15/79

    Definition: Machine Learning

    evaluationrepresentation optimization use cases

    feature engineering tournaments

    data

    sourcesdata

    sourcesdata preppipeline

    train

    test

    hold-outs

    learners

    classifiersclassifiers

    classifiers

    datasources

    scoring

    quantify and me benefit? risk? operational c

    obtain other data? improve metadata? refine representation?

    improve optimization?

    iterate with stakeholders?

    can models (inference) inform

    first principles approaches?

    discover

    ydisc

    overy

    modeling

    modeling

    integrat

    ion

    integrat

    ionappsapps

  • 7/22/2019 Data Workflows for Machine Learning

    16/79

    Can we fit the required

    process into generalized

    workflow definitions?

  • 7/22/2019 Data Workflows for Machine Learning

    17/79

    Definition: Data Workflows

    Middleware, effectively, is evolving for Big Data and Machine Learning The following design pattern a DAG shows up in many places,

    via many frameworks:

    ETL data

    prep

    predictive

    model

    data

    sources

    end

    use

  • 7/22/2019 Data Workflows for Machine Learning

    18/79

    Definition: Data Workflows

    definition of a typical Enterprise workflow which crosses through

    multiple departments, languages, and technologies

    ETL data

    prep

    predictive

    model

    data

    sources

    end

    use

    ANSI SQL for ETL

  • 7/22/2019 Data Workflows for Machine Learning

    19/79

    Definition: Data Workflows

    definition of a typical Enterprise workflow which crosses through

    multiple departments, languages, and technologies

    ETL data

    prep

    predictive

    model

    data

    sources

    end

    useJava, Pig, etc., for business logic

  • 7/22/2019 Data Workflows for Machine Learning

    20/79

    Definition: Data Workflows

    definition of a typical Enterprise workflow which crosses through

    multiple departments, languages, and technologies

    ETL data

    prep

    predictive

    model

    data

    sources

    end

    use

    SAS for pr

  • 7/22/2019 Data Workflows for Machine Learning

    21/79

    most of the licensing costs

  • 7/22/2019 Data Workflows for Machine Learning

    22/79

    most of the project costs

  • 7/22/2019 Data Workflows for Machine Learning

    23/79

    Something emerges

    to fill these needs,

    for instance

  • 7/22/2019 Data Workflows for Machine Learning

    24/79

    ETL data

    prep

    predictive

    model

    data

    sources

    end

    use

    Lingual:DW!ANSI SQL Pattern:SAS, R, etc.!PMMLbusiness logic in Java,Clojure, Scala, etc.

    sink taps fMemcache

    MongoDB

    source taps forCassandra, JDBC,

    Splunk, etc.

    Definition: Data Workflows

    For example, Cascading and related projects implement the following

    components, based on 100% open source:

    a compiler sees it all one connected DAG:

    troubleshooting

    exception handling

    notifications

    some optimizations

  • 7/22/2019 Data Workflows for Machine Learning

    25/79

    FlowDefflowDef=FlowDef.flowDe

    .setName("etl")

    .addSource("example.employee"

    .addSource("example.sales",s

    .addSink("results",resultsTa

    SQLPlannersqlPlanner=newSQLP

    .setSql(sqlStatement);flowDef.addAssemblyPlanner(sqlP

    http://cascading.org/
  • 7/22/2019 Data Workflows for Machine Learning

    26/79

    FlowDefflowDef=FlowDef.flowDef().setName("classifier").addSource("input",inputTap).addSink("classify",classifyTap);

    PMMLPlannerpmmlPlanner=newPMMLPlanner()

    .setPMMLInput(newFile(pmmlModel))

    .retainOnlyActiveIncomingFields();flowDef.addAssemblyPlanner(pmmlPlanner);

  • 7/22/2019 Data Workflows for Machine Learning

    27/79

    Enterprise Data Workflows with CascadingOReilly, 2013

    shop.oreilly.com/product/

    0636920028536.do

    http://shop.oreilly.com/product/0636920028536.dohttp://shop.oreilly.com/product/0636920028536.do
  • 7/22/2019 Data Workflows for Machine Learning

    28/79

    This begs an update

    http://shop.oreilly.com/product/0636920028536.dohttp://shop.oreilly.com/product/0636920028536.do
  • 7/22/2019 Data Workflows for Machine Learning

    29/79

    Because

    http://shop.oreilly.com/product/0636920028536.dohttp://shop.oreilly.com/product/0636920028536.dohttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttp://ipython.org/notebook.htmlhttp://www.knime.org/
  • 7/22/2019 Data Workflows for Machine Learning

    30/79

    http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/https://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttp://pandas.pydata.org/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://www.paraccel.com/http://www.mlbase.org/https://code.google.com/p/augustus/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://cascalog.org/https://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttp://julialang.org/http://www.rstudio.com/ide/docs/r_markdownhttp://ipython.org/notebook.htmlhttp://www.knime.org/http://mahout.apache.org/http://www.zementis.com/on-the-cloud.htm
  • 7/22/2019 Data Workflows for Machine Learning

    31/79

    Data Workflows forMachine Learning:

    Examples

    Neque per exemplum

    http://blog.magnatiles.com/play-and-learn/http://www.knime.org/
  • 7/22/2019 Data Workflows for Machine Learning

    32/79

    Example: KNIME

    a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. large number of integrations (over 1000 modules) ranked #1 in customer satisfaction among

    open source analytics frameworks visual editing of reusable modules leverage prior work in R, Perl, etc. Eclipse integration easily extended for new integrations

    knime.org

    http://www.knime.org/http://www.knime.org/http://ipython.org/notebook.html
  • 7/22/2019 Data Workflows for Machine Learning

    33/79

    Example: Python stack

    Python has much to offer ranging across anorganization, no just for the analytics staff ipython.org

    pandas.pydata.org

    scikit-learn.org

    numpy.org

    scipy.org

    code.google.com/p/augustus continuum.io

    nltk.org

    matplotlib.org

    http://pandas.pydata.org/https://code.google.com/p/augustus/http://ipython.org/notebook.htmlhttp://matplotlib.org/http://nltk.org/http://www.continuum.io/https://code.google.com/p/augustus/http://scipy.org/http://www.numpy.org/http://scikit-learn.org/stable/http://pandas.pydata.org/http://ipython.org/http://julialang.org/
  • 7/22/2019 Data Workflows for Machine Learning

    34/79

    Example: Julia

    a high-level, high-performance dynamic programming languagefor technical computing, with syntax that is familiar to users ofother technical computing environments significantly faster than most alternatives built to leverage parallelism, cloud computing still relatively new one to watch!

    importallBasetypeBubbleSort

  • 7/22/2019 Data Workflows for Machine Learning

    35/79

    Example: Summingbird

    a library that lets you write streaming MapReduce programs that look like native Scala or Java collection transformationsand execute them on a number of well-known distributedMapReduce platforms like Storm and Scalding. switch between Storm, Scalding (Hadoop) Spark support is in progress leverage Algebird, Storehaus, Matrix API, etc.

    github.com/

    def wordCount[P

    toWords(sentence).map(_ -> 1L)}.sumByKey(store)

    E l S ldi

    https://engineering.twitter.com/opensource/projects/summingbirdhttps://github.com/twitter/summingbird/wikihttps://github.com/twitter/scalding
  • 7/22/2019 Data Workflows for Machine Learning

    36/79

    Example: Scalding

    importcom.twitter.scalding._classWordCount(args :Args)extendsJob(args){Tsv(args("doc"),

    ('doc_id,'text),skipHeader =true).read

    .flatMap('text ->'token){text :String=>text.split("[ \\[\\]\\(\\),.]")

    }.groupBy('token){_.size('count)}.write(Tsv(args("wc"),writeHeader =true))

    }

    github.com

    E l S ldi ith b

    https://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://github.com/twitter/scaldinghttps://github.com/twitter/scalding
  • 7/22/2019 Data Workflows for Machine Learning

    37/79

    Example: Scalding

    extends the Scala collections API so that distributed listsbecome pipes backed by Cascading

    code is compact, easy to understand nearly 1:1 between elements of conceptual flow diagram

    and function calls extensive libraries are available for linear algebra, abstract

    algebra, machine learning e.g.,Matrix API,Algebird, etc. significant investments by Twitter, Etsy, eBay, etc. less learning curve than Cascalog build scripts those take a while to run :\

    github.com

    Example: Cascalog l

    https://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://github.com/twitter/scaldinghttp://cascalog.org/
  • 7/22/2019 Data Workflows for Machine Learning

    38/79

    Example: Cascalog

    (ns impatient.core(:use[cascalog.api]

    [cascalog.more-taps:only(hfs-delimited)])(:require[clojure.string:ass]

    [cascalog.ops:asc])(:gen-class))

    (defmapcatopsplit[line]"reads in a line of string and splits it by regex"(s/splitline#"[\[\]\\\(\),.)\s]+"))

    (defn -main[inout&args](??word)(c/count?count)))

    ; Paul Lam; github.com/Quantisan/Impatient

    cascalog.o

    Example: Cascalog cascalo o

    http://cascalog.org/http://cascalog.org/http://github.com/Quantisan/Impatienthttp://cascalog.org/
  • 7/22/2019 Data Workflows for Machine Learning

    39/79

    Example: Cascalog

    implements Datalogin Clojure, with predicates backedby Cascading for a highly declarative language

    run ad-hoc queries from the Clojure REPL approx. 10:1 code reduction compared with SQL

    composable subqueries, used for test-driven

    development (TDD) practices at scale Leiningenbuild: simple, no surprises, in Clojure itself more new deployments than other Cascading DSLs

    Climate Corp is largest use case: 90% Clojure/Cascalog has a learning curve, limited number of Clojure

    developers aggregators are the magic, and those take effort to learn

    cascalog.o

    Example: Apache Spark spark-proj

    http://cascalog.org/http://cascalog.org/http://spark-project.org/
  • 7/22/2019 Data Workflows for Machine Learning

    40/79

    Example: Apache Spark

    in-memory cluster computing,by Matei Zaharia, Ram Venkataraman, et al. intended to make data analytics fast to write and to run load data into memory and query it repeatedly, much

    more quickly than with Hadoop APIs in Scala, Java and Python, shells in Scala and Python Shark (Hive on Spark), Spark Streaming (like Storm), etc. integrations with MLbase, Summingbird (in progress)

    // word countvalfile=spark.textFile(hdfs://)valcounts=file.flatMap(line=>line.spl

    .map(word=>(word,1))

    .reduceByKey(_+_)counts.saveAsTextFile(hdfs://)

    spark-proj

    Example: MLbase mlbase org

    http://spark-project.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://www.mlbase.org/
  • 7/22/2019 Data Workflows for Machine Learning

    41/79

    Example: MLbase

    distributed machine learning made easy sponsored by UC Berkeley EECS AMP Lab MLlib common algorithms, low-level, written atop Spark MLI feature extraction, algorithm dev atop MLlib ML Optimizer automates model selection,

    compiler/optimizer see article:

    http://strata.oreilly.com/2013/02/mlbase-scalable-

    machine-learning-made-accessible.html

    mlbase.org

    data = load("hdfs://path/to/als_clinical")// the features are stored in columns 2-10X = data[, 2 to 10]y = data[, 1]model = do_classify(y, X)

    Example: Spark SQL

    spark-proj

    http://www.mlbase.org/http://www.mlbase.org/http://strata.oreilly.com/2013/02/mlbase-scalable-machine-learning-made-accessible.htmlhttp://spark-project.org/
  • 7/22/2019 Data Workflows for Machine Learning

    42/79

    Example: Spark SQL

    blurs the lines between RDDs and relational tablesintermix SQL commands to querying external data,along with complex analytics, in a single app: allows SQL extensions based on MLlib Shark is being migrated to Spark SQL

    Spark SQL: Manipulating Structured Data Using Spark Michael Armbrust, Reynold Xin (2014-03-24) databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-

    Spark.html people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html

    // data can be extracted from exist// e.g., Apache HivevaltrainingDataTable =sql("""SELECT e.action

    u.age,u.latitude,u.logitude

    FROM Users uJOIN Events eON u.userId = e.userId""")

    // since `sql` returns an RDD, the //can be used in MLlibvaltrainingData =trainingDataTablvalfeatures =Array[Double](rowLabeledPoint(row(0),features)

    }

    valmodel =newLogisticRegressionWithSGD().

    spark-proj

    Example: Titan thinkaureli

    http://spark-project.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.htmlhttp://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.htmlhttp://thinkaurelius.github.io/titan/
  • 7/22/2019 Data Workflows for Machine Learning

    43/79

    Example: Titan

    distributed graph database,by Matthias Broecheler, Marko Rodriguez, et al. scalable graph database optimized for storing and

    querying graphs supports hundreds of billions of vertices and edges transactional database that can support thousands of

    concurrent users executing complex graph traversals supports search through Lucene, ElasticSearch can be backed by HBase, Cassandra, BerkeleyDB TinkerPopnative graph stack with Gremlin, etc.

    // who is hercules' paternal grandfather?g.V('name','hercules').out('father').out('father'

    thinkaureli

    Example: MBrace m-brace.ne

    http://thinkaurelius.github.io/titan/http://www.tinkerpop.com/http://www.m-brace.net/
  • 7/22/2019 Data Workflows for Machine Learning

    44/79

    Example: MBrace

    a .NET based software stack that enables easy large-scale distributed computation and big data analysis. declarative and concise distributed algorithms in

    F# asynchronous workflows scalable for apps in private datacenters or public

    clouds, e.g., Windows Azure or Amazon EC2 tooling for interactive, REPL-style deployment,

    monitoring, management, debugging in Visual Studio leverages monads (similar to Summingbird) MapReduce as a library, but many patterns beyond

    m brace.ne

    let rec mapReduce (map: (reduce: 'R -> 'R -> Clo(identity: 'R)(input: 'T list) =cloud {match input with| [] -> return identity| [value] -> return! map| _ ->let left, right = List.slet! r1, r2 =(mapReduce map reduce id(mapReduce map reduce idreturn! reduce r1 r2}

    D t W kfl f

    http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/
  • 7/22/2019 Data Workflows for Machine Learning

    45/79

    Data Workflows forMachine Learning:

    Update the definitions

    Nine Points to Discuss

    Data Workflows

    1

    http://www.greatdreams.com/crop/99whthrs.htm
  • 7/22/2019 Data Workflows for Machine Learning

    46/79

    Formally speaking, workflows include people(cross-dept)

    and define oversight for exceptional dataotherwise, data flowor pipelinewould be more aptexamples include Cascadingtraps, with exceptional tuplesrouted to a review organization, e.g., Customer Support

    includes peo

    oversight fo

    HadoopCluster

    source

    tap

    source

    tap sinktap

    traptap

    customerprofile DBsCustomer

    Prefs

    logslogs

    Logs

    DataWorkflow

    Cache

    Customers

    Support

    WebApp

    Reporting

    AnalyticsCubes

    sink

    tap

    Modeling PMML

    1

    Data Workflows 1+

    http://www.cascading.org/
  • 7/22/2019 Data Workflows for Machine Learning

    47/79

    Footnote see also an excellent article related to this point

    and the next:Therbligs for data science:

    A nuts and bolts framework for accelerating data workAbe [email protected]/2014/03/therbligs-for-data-science.html

    strataconf.com/strata2014/public/schedule/detail/32291

    brighttalk.com/webcast/9059/105119

    includes peo

    oversight fo

    1+

    Data Workflows 2

    https://www.brighttalk.com/webcast/9059/105119http://strataconf.com/strata2014/public/schedule/detail/32291http://blog.abegong.com/2014/03/therbligs-for-data-science.html
  • 7/22/2019 Data Workflows for Machine Learning

    48/79

    Workflows impose a separation of concerns, allowing

    for multiple abstraction layersin the tech stackspecify whatis required, not howto accomplish itarticulating the business logic Cascadingleverages pattern languagerelated notions from Knuth, literate programmingnot unlike BPM/BPEL for Big Dataexamples: IPython, R Markdown, etc.

    separation o

    for literate p

    2

    Data Workflows 2+

    http://www.patternlanguage.com/http://www.literateprogramming.com/http://www.rstudio.com/ide/docs/r_markdownhttp://ipython.org/notebook.htmlhttp://en.wikipedia.org/wiki/Literate_programminghttp://www.cascading.org/
  • 7/22/2019 Data Workflows for Machine Learning

    49/79

    Footnote while discussing separation of concerns and a general

    design pattern, lets consider data workflows in a

    broader context of probabilistic programming, sincebusiness data is heterogeneous and structured

    the data for every domain is heterogeneous.

    Paraphrasing Ginsberg:Ive seen the best minds of my generation destroyed by

    madness, dragging themselves through quagmires of large

    LCD screens filled with Intellij debuggers and databasecommand lines, yearning to fit real-world data into their

    preferred deterministic tools

    separation o

    for literate p

    2+

    Probabilistic Programming:

    Why, What, How, When Beau Cronin

    Strata SC (2014) speakerdeck.com/beprobabilistic-programstrata-santa-clara-20

    Why Probabilistic ProgramMatters

    Rob ZinkovConvex Optimized (201zinkov.com/posts/2012-06-27-why-proprogramming-matte

    Data Workflows 3

    http://zinkov.com/posts/2012-06-27-why-prob-programming-matters/https://speakerdeck.com/beaucronin/probabilistic-programming-strata-santa-clara-2014
  • 7/22/2019 Data Workflows for Machine Learning

    50/79

    Multiple abstraction layersin the tech stack are

    needed, but still emergingfeedback loops based on machine dataoptimizers feed on abstractionmetadata accounting:

    track data lineage propagate schema model / feature usage ensemble performance

    app history accounting: util stats, mixing workloads heavy-hitters, bottlenecks throughput, latency

    multiple abs

    for metadat

    and optimiz

    ClusterScheduler

    Planners/

    Optimizers

    Clusters

    DSLs

    MachineData

    app history

    util stats

    bottlenecksMixedTopologies

    BusinessProcess

    ReusableComponents

    PortableModels

    data lineage

    schema propagation

    feature selection

    tournaments

    3

  • 7/22/2019 Data Workflows for Machine Learning

    51/79

    Um, will compilers

    ever look like this??

    Data Workflows 4

  • 7/22/2019 Data Workflows for Machine Learning

    52/79

    Workflows must provide for test, which is no simple

    mattertesting is required on three abstraction layers:

    model portability/evaluation, ensembles, tournaments TDD in reusable components and business process continuous integration / continuous deployment

    examples: Cascalogfor TDD,PMMLfor model evalstill so much more to improvekeep in mind that workflows involve people, too

    testing: mod

    TDD, app d

    4

    ClusterScheduler

    Planners/

    Optimizers

    Clusters

    DSLs

    M

    app h

    util st bottleMixed

    Topologies

    BusinessProcess

    Reusable

    Components

    Portable

    Models

    d

    s fe to

    Data Workflows

    5

    http://www.dmg.org/v4-1/GeneralStructure.htmlhttp://cascalog.org/
  • 7/22/2019 Data Workflows for Machine Learning

    53/79

    Workflows enable system integrationfuture-proof integrations and scale-outbuild componentsand apps, not jobs and command linesallow compiler optimizations across the DAG,i.e., cross-dept contributionsminimize operationalization costs:

    troubleshooting, debugging at scale exception handling notifications, instrumentation

    examples: KNIME, Cascading, etc.

    future-proof

    integration,

    HadoopCluster

    source

    tap

    source

    tap sink

    tap

    trap

    tap

    customerprofile DBsCustomer

    Prefs

    logslogs

    Logs

    DataWorkflow

    Cache

    Customers

    Support

    WebApp

    Reporting

    AnalyticsCubes

    sink

    tap

    Modeling PMML

    5

    Data Workflows 6

    http://www.cascading.org/http://www.knime.org/
  • 7/22/2019 Data Workflows for Machine Learning

    54/79

    Visualizing workflows, what a great idea.the practical basis for:

    collaboration rapid prototyping component reuse

    examples: KNIMEwins best in categoryCascadinggenerates flow diagrams,

    which are a nice start

    visualizing a

    to collabora

    6

    Data Workflows

    7

    http://www.cascading.org/http://www.knime.org/
  • 7/22/2019 Data Workflows for Machine Learning

    55/79

    Abstract algebra, containerizing workflow metadata

    monoids, semigroups, etc., allow for reusable components

    with well-defined properties for running in parallel at scalelet business process be agnostic about underlying topologies,with analogy to Linux containers (Docker, etc.)compose functions, take advantage of sparsity (monoids)because data is fully functional FP for s/w eng benefitsaggregators are almost always magic, now we can solve

    for common use casesread Monoidify! Monoids as a Design Principle for Efficient

    MapReduce Algorithmsby Jimmy LinCascadingintroduced some notions of this circa 2007,

    but didnt make it explicitexamples: Algebirdin Summingbird, Simmer, Spark, etc.

    abstract alg

    functional p

    containerize

    7

    Data Workflows 7+

    http://spark.incubator.apache.org/https://github.com/avibryant/simmerhttps://github.com/twitter/summingbird/wikihttps://engineering.twitter.com/opensource/projects/algebirdhttp://www.cascading.org/http://arxiv.org/pdf/1304.7544v1.pdf
  • 7/22/2019 Data Workflows for Machine Learning

    56/79

    Footnote see also two excellent talks related to this point:

    Algebra for Analyticsspeakerdeck.com/johnynek/algebra-for-analyticsOscar Boykin, Strata SC (2014)Add ALL the Things:Abstract Algebra Meets Analytics

    infoq.com/presentations/abstract-algebra-analytics Avi Bryant, Strange Loop (2013)

    abstract alg

    functional p

    containerize

    7+

    Data Workflows

    bl d l

    8

    http://www.infoq.com/presentations/abstract-algebra-analyticshttps://speakerdeck.com/johnynek/algebra-for-analytics
  • 7/22/2019 Data Workflows for Machine Learning

    57/79

    Workflow needs vary in time, and need to blend timesomething something batch, something something low-latencyscheduling batch isnt hard; scheduling low-latency is

    computationally brutal (see the Omegapaper)because people like saying Lambda Architecture,

    it gives them goosebumps, or somethingbecause real-timemeans so many different thingsbecause batch windows are so 1964examples: Summingbird, Oryx

    blend result

    time scales:

    latency

    8

    Big Data

    Nathan Marz,manning.com

    Data Workflows

    i i l9

    http://www.manning.com/marz/http://www.manning.com/marz/http://www.manning.com/marz/https://github.com/cloudera/oryxhttps://engineering.twitter.com/opensource/projects/summingbirdhttp://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
  • 7/22/2019 Data Workflows for Machine Learning

    58/79

    Workflows may define contexts in which model selection

    possibly becomes a compiler problemfor example, see Boyd, Parikh, et al.ultimately optimizing for loss function+ regularization termperhaps not ready for prime-time immediatelyexamples: MLbase

    optimize lea

    to make mo

    potentially a

    9

    f(x): loss functiong(z): regularization term

    Data Workflows bon

    http://www.mlbase.org/http://www.stanford.edu/~boyd/papers/admm_distr_stats.html
  • 7/22/2019 Data Workflows for Machine Learning

    59/79

    For one of the best papers about what workflows really

    truly require, see Out of the Tar Pitby Moseley and Marks

    bon

    Nine Points for Data Workflows a wish list

    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928
  • 7/22/2019 Data Workflows for Machine Learning

    60/79

    1. includes people, defines oversight for exceptional data

    2. separation of concerns, allows for literate programming

    3. multiple abstraction layers for metadata, feedback, and

    optimization4. testing: model evaluation, TDD, app deployment

    5. future-proof system integration, scale-out, ops

    6. visualizing workflows allows people to collaborate

    through code

    7. abstract algebra and functional programming

    containerize business process

    8. blend results from different time scales:

    batch plus low-latency

    9. optimize learners in context, to make model

    selection potentially a compiler problem

    ClusterScheduler

    Planners/Optimizers

    Clusters

    DSLs

    MixedTopologies

    BusinessProcess

    ReusableComponents

    PortableModels

    https://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttp://ipython.org/notebook.htmlhttp://www.knime.org/
  • 7/22/2019 Data Workflows for Machine Learning

    61/79

    Nine Points for Data Workflows a scorecard

    http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/http://www.m-brace.net/https://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttps://github.com/cloudera/oryxhttp://pandas.pydata.org/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://thinkaurelius.github.io/titan/http://www.paraccel.com/http://www.mlbase.org/https://code.google.com/p/augustus/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://spark.incubator.apache.org/http://cascalog.org/https://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.com/opensource/projects/scaldinghttps://engineering.twitter.c