c-bag big data meetup chennai oct.29-2014 hortonworks and concurrent on cascading

Big Data Meetup

October 29, 2014

Chennai

C-BAG

C-BAG Chennai Big Data Analytic Group C-BAG is an open group formed in the interest of creating a good BIG DATA Environment.

C-BAG is conducting weekly and monthly online/offline free sessions, creating awareness on the BIG DATA technologies and support BIG DATA initiatives. C-BAGs aim is to be a one stop place for all BIG DATA queries, discussions and support !

Contact Us : [email protected]

Speakers

About Dhruv Kumar Solutions Architect Concurrent Inc. Dhruv Kumar has over six years of diverse software development experience in Big Data, Web and High Performance Computing applications. Prior to joining Concurrent, he worked at Terracotta as a Software Engineer. He has a MS degree in Computer Engineering from the University of Massachusetts-Amherst.

About Vinay Shukla Director of Product Management Hortonworks Vinay Shukla is a seasoned Enterprise Software professional with extensive experience in Product management, Product development and Project management. Prior to Hortonworks, Vinay has worked as security architect, product manager, developer and project manager. Vinay admits to being a caffeine addict and spends his free time on a Yoga mat and on Hikes.

•  Founded in 2011

•  Original 24 architects, developers, operators of Hadoop from Yahoo!

•  Leaders in Hadoop community

•  500+ employees

Customer Momentum •  300+ customers in seven quarters, growing at 75+/quarter •  Two thirds of customers come from F1000

Partner Momentum •  Over 1000 Partners, Hundreds of Certified Solutions •  Some key

partners include:

Hortonworks and Hadoop at Scale •  HDP in production on largest clusters on planet •  Most +1000 node clusters

Hortonworks enables adoption of Apache Hadoop with Hortonworks Data Platform

The Forrester Wave™ Big Data Hadoop Solutions Q1 2014

“Hortonworks loves and lives open source innovation” World Class Support and Services. Hortonworks' Customer Support received a maximum score and was significantly higher than both Cloudera and MapR

A Leader in Hadoop

HDP IS Apache Hadoop

There is ONE Enterprise Hadoop: everything else is a vendor derivation

Hortonworks Data Platform 2.2

Had

oop

&YA

RN

Pig

Hiv

e &

HC

atal

og

HB

ase

Sqo

op

Ooz

ie

Zoo

keep

er

Am

bari

Sto

rm

Flu

me

Kno

x

Pho

enix

Acc

umul

o

2.2.0 0.12.0

0.12.0 2.4.0

0.12.1

Data Management

0.13.0

0.96.1

0.98.0

0.9.1 1.4.4

1.3.1

1.4.0

1.4.4

1.5.1

3.3.2

4.0.0

3.4.5 0.4.0

4.0.0

1.5.1

Fal

con

0.5.0

Ran

ger

Spa

rk

Kaf

ka

0.14.0 0.14.0

0.98.4

1.6.1

4.2 0.9.3

1.2.0 0.6.0

0.8.1

1.4.5 1.5.0

1.7.0

4.1.0 0.5.0

0.4.0 2.6.0

* version numbers are targets and subject to change at time of general availability in accordance with ASF release process

3.4.5

Tez

0.4.0

Slid

er

0.60

HDP 2.0

October

2013

HDP 2.2 October

2014

HDP 2.1

April

2014

Sol

r

4.7.2

4.10.0

0.5.1

Data Access Governance & Integration Security Operations

The Modern Data Architecture w/ HDP

Enterprise Goals for the Modern Data Architecture

•  Consolidate siloed data sets structured and unstructured

•  Central data set on a single cluster

•  Multiple workloads across batch interactive and real time

•  Central services for security, governance and operation

•  Preserve existing investment in current tools and platforms

•  Single view of the customer, product, supply chain

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS

EDW

MPP

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

Interactive Real-Time Batch CRM

ERP

Other 1 ° ° °

° ° ° °

HDFS (Hadoop Distributed File System)

SOU

RC

ES

EXISTING(Systems(

Clickstream( Web((&Social(

Geoloca9on( Sensor((&(Machine(

Server((Logs(

Unstructured(

1. Unlock New Applications from New Types of Data INDUSTRY USE CASE Sentiment

& Web Clickstream & Behavior

Machine & Sensor Geographic Server Logs Structured &

Unstructured

Financial Services New Account Risk Screens ✔ ✔

Trading Risk ✔

Insurance Underwriting ✔ ✔ ✔

Telecom Call Detail Records (CDR) ✔ ✔

Infrastructure Investment ✔ ✔

Real-time Bandwidth Allocation ✔ ✔ ✔

Retail 360° View of the Customer ✔ ✔ ✔

Localized, Personalized Promotions ✔

Website Optimization ✔

Manufacturing Supply Chain and Logistics ✔

Assembly Line Quality Assurance ✔

Crowd-sourced Quality Assurance ✔

Healthcare Use Genomic Data in Medial Trials ✔ ✔ ✔

Monitor Patient Vitals in Real-Time

Pharmaceuticals Recruit and Retain Patients for Drug Trials ✔ ✔

Improve Prescription Adherence ✔ ✔ ✔ ✔

Oil & Gas Unify Exploration & Production Data ✔ ✔ ✔ ✔

Monitor Rig Safety in Real-Time ✔ ✔ ✔

Government ETL Offload/Federal Budgetary Pressures ✔ ✔

Sentiment Analysis for Government Programs ✔

..to shift from reactive to proactive interactions

HDP and Hadoop allow organizations to shift interactions from…

Reactive Post Transaction

Proactive Pre Decision

…to Real-time Personalization From static branding

…to repair before break From break then fix

…to Designer Medicine From mass treatment

…to Automated Algorithms From Educated Investing

…to 1x1 Targeting From mass branding

A shift in Advertising

A shift in Financial Services

A shift in Healthcare

A shift in Retail

A shift in Telco

2. Or to realize a dramatic cost savings…

✚

EDW Optimization

OPERATIONS 50%

ANALYTICS 20%

ETL PROCESS 30%

OPERATIONS 50% ANALYTICS

50%

Current Reality EDW at capacity: some usage from low value workloads

Older data archived, unavailable for ongoing exploration

Source data often discarded

Augment w/ Hadoop

Free up EDW resources from low value tasks

Keep 100% of source data and historical data for ongoing exploration

Mine data for value after loading it because of schema-on-read

Hadoop Parse, Cleanse

Apply Structure, Transform

2. Or to realize a dramatic cost savings…

✚

EDW Optimization

OPERATIONS 50%

ANALYTICS 20%

ETL PROCESS 30%

OPERATIONS 50% ANALYTICS

50%

Current Reality EDW at capacity: some usage from low value workloads

Older data archived, unavailable for ongoing exploration

Source data often discarded

Augment w/ Hadoop

Free up EDW resources from low value tasks

Keep 100% of source data and historical data for ongoing exploration

Mine data for value after loading it because of schema-on-read

MPP

SAN

Engineered System

NAS

HADOOP

Cloud Storage

$0 $20,000 $40,000 $60,000 $80,000 $180,000

Fully-loaded Cost Per Raw TB of Data (Min–Max Cost)

Commodity Compute & Storage Hadoop Enables Scalable Compute & Storage at a Compelling Cost Structure

Hadoop Parse, Cleanse

Apply Structure, Transform

Storage Costs/Compute Costs from $19/GB to $0.23/GB

3. Data Lake: An architectural shift

SCA

LE

SCOPE

Unlocking the Data Lake (

RDBMS

MPP

EDW

Data Lake Enabled by YARN •  Single data repository,

shared infrastructure

•  Multiple biz apps accessing all the data

•  Enable a shift from reactive to proactive interactions

•  Gain new insight across the entire enterprise

New Analytic Apps or IT Optimization

HDP 2.1

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

YARN

Case Study: 12 month Hadoop evolution at TrueCar

Dat

a Pl

atfo

rm C

apab

ilitie

s

12 months execution plan

June 2013 Begin Hadoop Execution

July 2013 Hortonworks Partnership

May ‘14 IPO

Aug 2013 Training & Dev Begins

Nov 2013 Production Cluster 60 Nodes 2 PB

Jan 2014 40% Dev Staff Perficient

Dec 2013 Three Production Apps (3 total)

Feb 2014 Three More Production Apps (6 total)

12 Month Results at TrueCAR •  Six Production Hadoop Applications •  Sixty nodes/2PB data •  Storage Costs/Compute Costs

from $19/GB to $0.23/GB

“We addressed our data platform capabilities strategically as a pre-cursor to IPO.”

DRIVING INNOVATION THROUGH DATAREDUCING DEVELOPMENT TIME FOR PRODUCTION-GRADE HADOOP APPLICATIONSDhruv Kumar Solutions Architect, Concurrent Inc

GET TO KNOW CONCURRENT

2

Leader in Application Infrastructure for Big Data!

• Building enterprise software to simplify Big Data application development and management

Products and Technology!

• CASCADINGOpen Source - The most widely used application infrastructure for building Big Data apps with over 200,000 downloads each month. 8000 deployments worldwide.

• DRIVEN Enterprise data application management for Big Data apps

Proven — Simple, Reliable, Robust!

• Thousands of enterprises rely on Concurrent to provide their data application infrastructure.

Founded: 2008 HQ: San Francisco, CA !CEO: Gary Nakamura CTO, Founder: Chris Wensel !!www.concurrentinc.com

BIG DATA APPLICATION INFRASTRUCTURE

3

“It’s all about the apps”"There needs to be a comprehensive solution for building, deploying, running

and managing this new class of enterprise applications.

Business Strategy Data & Technology

Challenges!!

Skill sets, systems integration, standard op procedure and

operational visibility

Connecting Business and Data

DATA APPLICATIONS - ENTERPRISE NEEDS

4

Enterprise Data Application Infrastructure!!

• Need reliable, reusable tooling to quickly build and consistently deliver data products !

• Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets

!

• Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application

!

• Need operational visibility for entire data application lifecycle

WORD COUNT EXAMPLE WITH CASCADING

5

!

!String docPath = args[ 0 ];!String wcPath = args[ 1 ];!Properties properties = new Properties();!AppProps.setApplicationJarClass( properties, Main.class );!HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!!

configuration

integration

!// create source and sink taps!Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );!Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );!!

processing

// specify a regex to split "document" text lines into token stream!Fields token = new Fields( "token" );!Fields text = new Fields( "text" );!RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\$\$,.]" );!// only returns "token"!Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!// determine the word counts!Pipe wcPipe = new Pipe( "wc", docPipe );!wcPipe = new GroupBy( wcPipe, token );!wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!

scheduling

!// connect the taps, pipes, etc., into a flow definition!FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )! .addTailSink( wcPipe, wcTap );!// create the Flow!Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!wcFlow.complete(); // <<-- Runs jobs on Cluster

• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical

• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)

• Aggregations ‣ Count, Average, etc

SOME COMMON PROCESSING PATTERNS

6

filter

filter

function

functionfilterfunctiondata

PipelineSplit Join

Merge

data

Topology

• Java API

• Separates business logic from integration

• Testable at every lifecycle stage

• Works with any JVM language

• Many integration adapters

CASCADING API

7

Process Planner

Processing API Integration APIScheduler API

Scheduler

Apache Hadoop

Cascading

Data Stores

ScriptingScala, Clojure, JRuby, Jython, Groovy

Enterprise Java

Cascading Domain Specific Languages (DSLs)

FRAMEWORK AND PROGRAMMING LANGUAGE INDEPENDENCE

8

New Fabrics

ClojureSQL Ruby

StormTez

Supported Fabrics and Data Stores

Mainframe DB / DW Data Stores HadoopIn-Memory

!

• Any JVM language can use Cascading API

• Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …

THE STANDARD FOR DATA APPLICATION DEVELOPMENT

9

www.cascading.org

Build data apps that are

scale-free!!!!

Design principles ensure best practices at

any scale

Test-Driven Development!

!Efficiently test code and process local files before

deploying on a cluster

Staffing Bottleneck!

!Use existing Java, SQL,

modeling skill sets

Operational Complexity!

!Simple - Package up into

one jar and hand to operations

Application Portability!

!!

Write once, then run on different computation

fabrics

Systems Integration!

!!

Hadoop never lives alone. Easily integrate to existing

systems

!

Proven application development framework for building data apps

Application platform that addresses:

CASCADING DATA APPLICATIONS

10

Enterprise IT!Extract Transform Load

Log File Analysis Systems Integration Operations Analysis

!

Corporate Apps!HR Analytics

Employee Behavioral Analysis Customer Support | eCRM

Business Reporting !

Telecom!Data processing of Open Data

Geospatial Indexing Consumer Mobile Apps Location based services

Marketing / Retail!Mobile, Social, Search Analytics

Funnel Analysis Revenue Attribution

Customer Experiments Ad Optimization

Retail Recommenders !

Consumer / Entertainment!Music Recommendation Comparison Shopping Restaurant Rankings

Real Estate Rental Listings

Travel Search & Forecast !

!

Finance!Fraud and Anomaly Detection

Fraud Experiments Customer Analytics

Insurance Risk Metric !

Health / Biotech!Aggregate Metrics For Govt

Person Biometrics Veterinary Diagnostics Next-Gen Genomics

Argonomics Environmental Maps

!

STRONG ORGANIC GROWTH

11

200,000+ downloads / month!8000+ Deployments!

BUSINESSES DEPEND ON US

12

• 30000 Jobs per day!

• Makes complex analysis of very large data sets simple!

• Machine learning, linear algebra to improve!

• User experience!

• Ad quality (matching users and ad effectiveness)!

• All revenue applications are running on Cascading/Scalding!

TWITTER


13

• Cascading Java API!

• Data normalization and cleansing of search and click-through logs for

use by analytics tools, Hive analysts!

• Easy to operationalize heavy lifting of data in one framework


14

• Cascalog (Clojure)!

• Weather pattern modeling to protect growers against loss!

• ETL against 20+ datasets daily!

• Machine learning to create models!

• Purchased by Monsanto for $930M US

BROAD SUPPORT

15

Hadoop ecosystem supports Cascading!

… AND INCLUDES RICH SET OF EXTENSIONS

16

http://www.cascading.org/extensions/

WORD COUNT DEMO ON HDP

17

• Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite !

• Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x !

• Cascading 3.0 opens up the query planner — write apps once, run on any fabric

SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING

18

Concurrent offers training classes for Cascading & Scalding

CONTACT INFORMATION

Dhruv Kumar!Solutions Architect!

Concurrent [email protected]

DRIVING INNOVATION THROUGH DATATHANK YOUDhruv Kumar

APPENDIX

21

USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO SPARK

22

• Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps !

• Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC !

• Great for migration of data, integrating with non-Big Data assets — extends life of existing IT assets in an organization

Query Planner

JDBC API Lingual APIProvider API

Cascading

Apache Hadoop

Lingual

Data Stores

CLI / Shell Enterprise Java

Catalog

SCALDING

23

• Scalding is a language binding to Cascading for Scala • The name Scalding comes from the combining of SCALa and cascaDING !

• Scalding is great for Scala developers; can crisply write constructs for matrix math… !

• Scalding has very large commercial deployments at: • Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality • Ebay - Use cases include search analytics and other production data pipelines

PATTERN ENABLES MIGRATING YOUR MODELS TO SPARK

24

• Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps.

• PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms

• PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows!

• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle • Open source frameworks - R, Weka, KNIME, RapidMiner

• Pattern is great for migrating your model scoring to Hadoop from your decision systems

Confidential25

• Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest

!

algorithms extended based on customer use cases –

PATTERN: ALGOS IMPLEMENTED

Confidential26

BUILDING AND RUNNING PMML MODELS

Model Producer

Data PMML ModelExplore data and build model

using Regression, clustering, etc.

Training

Scoring

NewData

PMML model

Measure and improve model

Post Processing

ModelConsumer

Data Data

scores

PATTERN

ETL, prepare data

ETL, prepare data

LINGUAL

LINGUAL

c-bag big data meetup chennai oct.29-2014 hortonworks and concurrent on cascading

Technology

hortonworks data platform

big data queries

big data technologies

modern data architecturepage

big data meetupoctober

hortonworks customer

enterprise hadoop

hadoop alloworganizations