large-scale data science on hadoop (intel big data day)

1© Cloudera, Inc. All rights reserved.

Large-Scale Data Scienceon HadoopUri Laserson | Data Scientist | @laserson


About the speaker

• Data Scientist at Cloudera

• PhD in BME at MIT/Harvard

• Committer on ADAM, impyla

• Co-author on Advanced Analytics with Spark

• [email protected]


What is a data scientist?


What is a data science?


What is a data science?

Phase 1. Collect Data Phase 2. Data Science? Phase 3. Profit!


Some things you might do as a data scientist

• Data quality issues

• Data formats/versions

• Data source integration

• Exploration/visualization

• Building/deploy models


Plumbing

Exploratory Operational


1. Data science is data plumbing


Example:

• Sells deep analysis of huge satellite images

• Easy: C++ to analyze images

• Hard: continuously reliablyingesting, transforming

• Expensive: storing, computing

• Hadoop as the data science plumber


2. Data science is investigative analytics


Example: large UK retailer

• Customer Churn

• SAS, Hive

• Path Analysis

• Giraph, MapReduce

• Customer Segmentation

• SAS, Spotfire, Impala

• Hadoop as one hub for investigative tools

• Avoid buying, training for N new tools


3. Data science is operational analytics


Example:

• Real-time Search, ML over Patient Data

• MapReduce for indexing, learning

• HBase for storage and fast access

• Storm for incremental update

• RDBMS for recent derived data

• API façade for input and querying learning

Engineering

Machine Learning


Plumbing

Exploratory Operational


Factors to consider when choosing your tools

• Single-node performance

• Scalability

• Language and tooling familiarity

• Integration with Hadoop

• Libraries / functions / richness of ecosystem

• Integration with data prep / ETL workflows

Pattern

JPMML


Plumbing in a nutshell

Plumbing Apache Kafka

Apache Pig

Apache Crunch


Serialization/RPC frameworks

• Specify schemas/services in user-friendly IDLs

• Code-generation to multiple languages (wire-compatible/portable)

• Compact, binary formats

• Natural support for schema evolution

• Multiple implementations:

• Apache Thrift, Apache Avro, Google’s Protocol Buffers

service Twitter {

void ping();

bool postTweet(1:Tweet tweet);

TweetSearchResult searchTweets(1:string query);

}

struct Tweet {

1: required i32 userId;

2: required string userName;

3: required string text;

4: optional Location loc;

16: optional string language = "english"

}


Log and service oriented architecture

http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying


Factory (operational) vs. Laboratory (exploratory)

Programming languages

Systems languages

Latency, throughput

Huge data

Online problems

Automated

Developers, Engineers

Statistical environments, BI tools

High-level languages

Accuracy

Medium-sized data

Offline work

Ad-hoc

Statisticians, Analysts

vs.


Exploratory analytics

• Offline

• Statistical Environment

• Discovery-phase

• Model Building and Tuning

• Accuracy Important

• Medium-scale

• Visualizations

Exploratory


Exploratory: BI/visualization

• Nothing Hadoop-specific

• Take your pick of any 3rd party tool

• Typically connects to Hadoop via SQL interface with Impala


Exploratory: SAS

• Connects to Hadoop data stores

• Can push down some computation to cluster, but requires data movement

• Mature and widely used; large algolibrary

• Ongoing collaborative engineering effort with Cloudera


Exploratory: Python

• Python and JVM don’t play nice

• Hadoop Streaming / mrjob / scikit-learn

• Impyla: Python UDFs on Impala

• PySpark: Spark API in Python


Operational analytics

• Online

• Real-Time

• Cluster Environment

• Model Serving, Update

• QPS, Latency Important

• Large ScaleOperational

Pattern

JPMML


Operational: MLlib (Spark)

• Model building on Spark

• Fast (distributed in-memory)

• Basic algorithms only

• LR, SVM, decision tree

• PCA, SVD

• K-means

• ALS

• Easy integration with Spark-as-ETL


GROUPBY integration with Hadoop

Read Hadoop data Requires data movement


GROUPBY integration with Hadoop

YARN-managed Outside


GROUPBY open source

Open source Closed source


GROUPBY active community

Active community Not


Languages

Java Python R Scala


• Next-generation general processing engine for Hadoop

• APIs in Python, Java, Scala (and early R)

• DAG execution / in-memory

• Interactive REPL

• Batch or streaming

• MLlib, GraphX

• Active community

• Scala-like API


Large scale or real-time?

Large-ScaleOfflineBatch

Real-TimeOnlineStreaming

vs


Large scale or real-time?

Large-ScaleOfflineBatch

Real-TimeOnlineStreaming

vs

Why Don’t We Have Both?

λ!


Lambda architecture

• Tackle in 3 Layers

• Batch Layer: offline, big model build

• Speed Layer: near-real-time, approximate update

• Serving Layer: real-time model query / scoring


PMML

• Predictive Modeling Markup Language

• XML-based format for predictive models

• Standardized by Data Mining Group(www.dmg.org)

• Wide tool support

<PMML xmlns="http://www.dmg.org/PMML-4_1"

version="4.1">

<Header copyright="www.dmg.org"/>

<DataDictionary numberOfFields="5">

<DataField name="temperature"

optype="continuous"

dataType="double"/>

…

</DataDictionary>

<TreeModel modelName="golfing"

functionName="classification">

<MiningSchema>

<MiningField name="temperature"/>

…

</MiningSchema>

<Node score="will play">

<Node score="will play">

<SimplePredicate field="outlook"

operator="equal"

value="sunny"/>

…

</Node>

</Node>

</TreeModel>

</PMML>


Lambda implementation: Oryx 2.x

• Generic lambda-architecture platform

• With ML specializations

• hyperparam selection

• Built on Spark Streaming, Kafka

• With Intel

• 2.x: pre-alpha

github.com/OryxProject/oryx


Lambda implementation: Oryx 2.x

github.com/OryxProject/oryx


HTTP REST API

• Convention for RPC-like request / response

• HTTP verbs, transport

• GET : query

• POST : add input

• Easy from browser, CLI, Java, Python, Scala, etc.

GET /recommend/jwills

HTTP/1.1 200 OK

Content-Type: text/plain

"Ray LaMontagne",0.951

"Fleet Foxes",0.7905

"The National",0.688

"Shearwater",0.3017


Thank [email protected]

large-scale data science on hadoop (intel big data day)

Technology

satellite data

data plumbing

hadoopbig data

data science plumber

data science special

speakerdata scientist

investigative tools

investigative analytics