cascading concurrent yahoo lunch_nlearn

46
DRIVING INNOVATION THROUGH DATA BUILDING PRODUCTION-GRADE HADOOP APPLICATIONS WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc

Upload: cascading

Post on 14-Jul-2015

102 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Cascading concurrent   yahoo lunch_nlearn

DRIVING INNOVATION THROUGH DATABUILDING PRODUCTION-GRADE HADOOP APPLICATIONS WITH CASCADINGSupreet Oberoi VP Field Engineering, Concurrent Inc

Page 2: Cascading concurrent   yahoo lunch_nlearn

ABOUT ME

2

• I am a Data Engineer, not a Data Scientist

• I help Enterprises develop decisions on building their “Big Data” roadmap and technical strategy — use cases, products, technology decisions, employee skills

• I design Hadoop applications with the intent to operationalize them in Enterprise settings — applications on which business depend, and last longer than the technologies underneath them…

• This talk is about learning how to design your BD strategy that leverages best

Page 3: Cascading concurrent   yahoo lunch_nlearn

BUILDING AN OPEN PLATFORM IS KEY TO PREVENTING LOCK-IN

3

• Open Language

• Open Data

• Open Hardware

• Open Compute Platform

• Open Development Platform

Page 4: Cascading concurrent   yahoo lunch_nlearn

OPEN LANGUAGES ALLOW YOU TO HARNESS THE TALENT OF YOUR ENTERPRISE

4

• Don’t equate architecture with language; develop architecture to support multiple

• Support SQL and SQL-like languages

• Encourage development in proven & scalable languages as Java

• Develop architecture to support change of programming languages (even for same app)

• Have common performance-management tools across all programming environments

Page 5: Cascading concurrent   yahoo lunch_nlearn

OPEN DATA ENABLES REUSE OF DATA AND APPS

5

• Prevent exclusive access to data sets through proprietary tools

• Promote a common meta-data repository

• Forbid storing data in proprietary formats

• Build seamless integration capabilities

Develop a common operating picture by promoting reuse with open data

Page 6: Cascading concurrent   yahoo lunch_nlearn

OPEN HARDWARE PROMOTES REUSE OF INFRASTRUCTURE

6

• Get commodity hardware — commodity hardware will always cost less than “optimized” specialized hardware (note: definition of “specialized” is up for debate)

• Develop and maintain a cluster that can be reused by different applications and technology stacks — avoid custom software installations on the cluster, or setting up dedicated clusters for given tech stacks

• Harness the power of collective from the cluster — avoid fragmenting the cluster if possible

Page 7: Cascading concurrent   yahoo lunch_nlearn

OPEN COMPUTE PLATFORM MAKES YOU SELECT THE RIGHT TOOL FOR THE PROBLEM

7

• Ensure that moving your application from one Hadoop compute platform (e.g. MapReduce) to another (e.g., Tez) does not:

• impact application code

• impact production-monitoring tools

• Resist compute platforms that require your enterprise to acquire significantly new skills (even if it is easy) to become productive

• Avoid new platforms that partition the cluster

• Avoid platforms that do not support Open Data

Make tradeoffs between reliability & speed based on your business context

Page 8: Cascading concurrent   yahoo lunch_nlearn

OPEN DEVELOPMENT PLATFORM PROVIDES LONG-TERM SUSTAINABILITY

8

• Invest in picking the correct development platform — open, easy, scalable, popular, tools, …

• Bet on a sustainable open source platform

• Measure the vitality of the community:

• number of downloads, extensions (living ecosystem), extensible architecture, consumers of the technology, code stability…

Development platforms improve developer productivity and operational excellence — picking a correct platform gives you best practices developed by the community, achieving higher quality

A proven platform provides tools to get your apps to production

Page 9: Cascading concurrent   yahoo lunch_nlearn

GET TO KNOW CONCURRENT

9

Leader in Application Infrastructure for Big Data

• Building enterprise software to simplify Big Data application development and management

Products and Technology

• CASCADINGOpen Source - The most widely used application infrastructure for building Big Data apps with over 175,000 downloads each month

• DRIVEN Enterprise data application management for Big Data apps

Proven — Simple, Reliable, Robust

• Thousands of enterprises rely on Concurrent to provide their data application infrastructure.

Founded: 2008 HQ: San Francisco, CA

CEO: Gary Nakamura CTO, Founder: Chris Wensel

www.concurrentinc.com

Page 10: Cascading concurrent   yahoo lunch_nlearn

BIG DATA — OPERATIONALIZE YOUR DATA APPS WITH CASCADING

10

“It’s all about the apps”There needs to be a comprehensive solution for building, deploying, running

and managing this new class of enterprise applications.

Business Strategy Data & Technology

ChallengesSkill sets, systems integration, standard op procedure and

operational visibility

Connecting Business and Data

Page 11: Cascading concurrent   yahoo lunch_nlearn

DATA APPLICATIONS - ENTERPRISE NEEDS

11

Enterprise Data Application Infrastructure

• Need reliable, reusable tooling to quickly build and consistently deliver data products

• Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets

• Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application

• Need operational visibility for entire data application lifecycle

Page 12: Cascading concurrent   yahoo lunch_nlearn

Cascading Apps

CASCADING - DE-FACTO FOR DATA APPS

12

New Fabrics

ClojureSQL Ruby

StormTez

System Integration

Mainframe DB / DW Data Stores HadoopIn-Memory

• Standard for enterprise data app development

• Your programming language of choice

• Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …

Page 13: Cascading concurrent   yahoo lunch_nlearn

WORD COUNT EXAMPLE WITH CASCADING

13

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

configuration

integration// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

processing

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

scheduling

// connect the taps, pipes, etc., into a flow definitionFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// create the FlowFlow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of WorkwcFlow.complete(); // <<-- Runs jobs on Cluster

Page 14: Cascading concurrent   yahoo lunch_nlearn

• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical

• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)

• Aggregations ‣ Count, Average, etc

SOME COMMON PATTERNS

14

filter

filter

function

functionfilterfunctiondata

PipelineSplit Join

Merge

data

Topology

Page 15: Cascading concurrent   yahoo lunch_nlearn

• Java API

• Separates business logic from integration

• Testable at every lifecycle stage

• Works with any JVM language

• Many integration adapters

CASCADING

15

Process Planner

Processing API Integration APIScheduler API

Scheduler

Apache Hadoop

Cascading

Data Stores

ScriptingScala, Clojure, JRuby, Jython, Groovy

Enterprise Java

Page 16: Cascading concurrent   yahoo lunch_nlearn

THE STANDARD FOR DATA APPLICATION DEVELOPMENT

16

www.cascading.org

Build data apps that are

scale-free

Design principals ensure best practices at any scale

Test-Driven Development

Efficiently test code and process local files before

deploying on a cluster

Staffing Bottleneck

Use existing Java,Scala, SQL, modeling skill sets

Operational Complexity

Simple - Package up into one jar and hand to

operations

Application Portability

Write once, then run on different computation

fabrics

Systems Integration

Hadoop never lives alone. Easily integrate to existing

systems

Proven application development framework for building data apps

Application platform that addresses:

Page 17: Cascading concurrent   yahoo lunch_nlearn

STRONG ORGANIC GROWTH

17

175,000+ downloads / month7000+ Deployments

Page 18: Cascading concurrent   yahoo lunch_nlearn

CASCADING DATA APPLICATIONS

18

Enterprise ITExtract Transform Load

Log File Analysis Systems Integration Operations Analysis

Corporate AppsHR Analytics

Employee Behavioral Analysis Customer Support | eCRM

Business Reporting

TelecomData processing of Open Data

Geospatial Indexing Consumer Mobile Apps Location based services

Marketing / RetailMobile, Social, Search Analytics

Funnel Analysis Revenue Attribution

Customer Experiments Ad Optimization

Retail Recommenders

Consumer / EntertainmentMusic Recommendation Comparison Shopping Restaurant Rankings

Real Estate Rental Listings

Travel Search & Forecast

FinanceFraud and Anomaly Detection

Fraud Experiments Customer Analytics

Insurance Risk Metric

Health / BiotechAggregate Metrics For Govt

Person Biometrics Veterinary Diagnostics Next-Gen Genomics

Argonomics Environmental Maps

Page 19: Cascading concurrent   yahoo lunch_nlearn

BUSINESSES DEPEND ON US

19

• Cascading Java API

• Data normalization and cleansing of search and click-through logs for

use by analytics tools, Hive analysts

• Easy to operationalize heavy lifting of data in one framework

Page 20: Cascading concurrent   yahoo lunch_nlearn

BUSINESSES DEPEND ON US

20

• Cascalog (Clojure)

• Weather pattern modeling to protect growers against loss

• ETL against 20+ datasets daily

• Machine learning to create models

• Purchased by Monsanto for $930M US

Page 21: Cascading concurrent   yahoo lunch_nlearn

BUSINESSES DEPEND ON US

21

• Scalding (Scala)

• Makes complex analysis of very large data sets simple

• Machine learning, linear algebra to improve

• 30,000 jobs a day — this works @ scale

• Ad quality (matching users and ad effectiveness)

TWITTER

Page 22: Cascading concurrent   yahoo lunch_nlearn

Confidential

CASCADING DEPLOYMENTS

2222

Page 23: Cascading concurrent   yahoo lunch_nlearn

BROAD SUPPORT

23

Hadoop ecosystem supports Cascading

Page 24: Cascading concurrent   yahoo lunch_nlearn

… AND INCLUDES RICH SET OF EXTENSIONS

24

http://www.cascading.org/extensions/

Page 25: Cascading concurrent   yahoo lunch_nlearn

CASCADING 3.0

25

“Write once and deploy on your fabric of choice.”

• The Innovation — Cascading 3.0 will allow for data apps to execute on existing and emerging fabrics through its new customizable query planner.

• Cascading 3.0 will support — Local In-Memory, Apache MapReduce and soon thereafter (3.1) Apache Tez, Apache Spark and Apache Storm

Enterprise Data Applications

MapReduceLocal In-Memory

Other Custom

Computation Fabrics

Page 26: Cascading concurrent   yahoo lunch_nlearn

USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO HADOOP

26

• Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps

• Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC

• Great for migration of data, integrating with non-Big Data assets — extends life of existing IT assets in an organization

Query Planner

JDBC API Lingual APIProvider API

Cascading

Apache Hadoop

Lingual

Data Stores

CLI / Shell Enterprise Java

Catalog

Page 27: Cascading concurrent   yahoo lunch_nlearn

SCALDING

27

• Scalding is a language binding to Cascading for Scala • The name Scalding comes from the combining of SCALa and cascaDING

• Scalding is great for Scala developers; can crisply write constructs for matrix math…

• Scalding has very large commercial deployments at: • Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality • Ebay - Use cases include search analytics and other production data pipelines

Page 28: Cascading concurrent   yahoo lunch_nlearn

PATTERN SCORES MODELS AT SCALE

28

• Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps.

• PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms

• PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows

• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle • Open source frameworks - R, Weka, KNIME, RapidMiner

• Pattern is great for migrating your model scoring to Hadoop from your decision systems

Page 29: Cascading concurrent   yahoo lunch_nlearn

Confidential29

PATTERN SCORES MODELS AT SCALEStep 1: Train your model with industry-leading Tools

Step 2: Score your models at scale with Pattern

Page 30: Cascading concurrent   yahoo lunch_nlearn

OPERATIONAL EXCELLENCE WITH DRIVEN

From Development — Building and Testing• Design & Development • Debugging • Tuning

To Production — Monitoring and Tracking• Maintain Business SLAs • Balance & Controls • Application and Data Quality • Operational Health • Real-time Insights

Visibility Through All Stages of App Lifecycle

30

Page 31: Cascading concurrent   yahoo lunch_nlearn

• Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster

DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS

31

Page 32: Cascading concurrent   yahoo lunch_nlearn

DRIVEN ARCHITECTURE

Page 33: Cascading concurrent   yahoo lunch_nlearn

• Easily comprehend, debug, and tune your data applications

• Get rich insights on your application performance

• Monitor applications in real-time • Compare app performance with

historical (previous) iterations

DEEPER VISUALIZATION INTO YOUR HADOOP CODE

33

Debug and optimize your Hadoop applications more effectively with Driven

Page 34: Cascading concurrent   yahoo lunch_nlearn

• Quickly breakdown how often applications execute based on their tags, teams, or names

• Immediately identify if any application is monopolizing cluster resources

• Understand the utilization of your cluster with a timeline of all applications running

GET OPERATIONAL INSIGHTS WITH DRIVEN

34

Visualize the activity of your applications to help maintain SLAs

Page 35: Cascading concurrent   yahoo lunch_nlearn

• Easily keep track of all your applications by segmenting them with user-defined tags

• Segment your applications for trending analysis, cluster analysis, and developing chargeback models

• Quickly breakdown how often applications execute based on their tags, teams, or names

ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY

35

Segment your applications for greater insights across all your applications

Page 36: Cascading concurrent   yahoo lunch_nlearn

• Invite others to view and collaborate on a specific application

• Gain visibility to all the apps and their owners associated with each team

• Simply manage your teams and the users assigned to them

COLLABORATE WITH TEAMS

36

Utilize teams to collaborate and gain visibility over your set of applications

Page 37: Cascading concurrent   yahoo lunch_nlearn

• Identify problematic apps with their owners and teams

• Search for groups of applications segmented by user-defined tags

• Compare specific applications with their previous iterations to ensure that your application can meet its SL

MANAGE PORTFOLIO OF BIG DATA APPLICATIONS

37

Fast, powerful, rich search capabilities enable you to easily find the exact set of applications that you’re looking for

Page 38: Cascading concurrent   yahoo lunch_nlearn

• Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster

DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS

38

Page 39: Cascading concurrent   yahoo lunch_nlearn

• Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite

• Scalding — a Scala-based extension to Cascading — provides crisp programming constructs for algorithm developers and data scientists

• Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x

• Cascading 3.0 opens up the query planner — write apps once, run on any fabric

SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING

39

Concurrent offers training classes for Cascading (DEC 9) & Scalding (NOV 4)

Page 40: Cascading concurrent   yahoo lunch_nlearn

CONTACT INFORMATION

Supreet [email protected]

650-868-7675 (m) @supreet_online

Page 41: Cascading concurrent   yahoo lunch_nlearn

DRIVING INNOVATION THROUGH DATATHANK YOUSupreet Oberoi

Page 42: Cascading concurrent   yahoo lunch_nlearn

DRIVING INNOVATION THROUGH DATABACKUPSupreet Oberoi

Page 43: Cascading concurrent   yahoo lunch_nlearn

DIFFERENT PHILOSOPHY THAN GUI TOOLS

43

• Cascading is a general-purpose framework to develop data applications; supports development through the entire lifecycle of data - from staging to final data sets

• Developing with an API is more productive and intuitive than with a UI — incorporate best-practices

• Can do in three lines of code what takes 20-clicks in an app (fluent API with IDE makes it even simpler)

• Can test locally and deploy production without code change

• Because it is based in code: debuggable, extensible, deployable, traceable

• GUI tools do not help in visualizing the must-have insights

• Real-time application visualization, application bottlenecks, anatomy of the application, application dependencies, cost breakdown of an operation (join), bottlenecks due to code, data, network, cluster

Page 44: Cascading concurrent   yahoo lunch_nlearn

Confidential44

• Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest

algorithms extended based on customer use cases –

PATTERN: ALGOS IMPLEMENTED

Page 45: Cascading concurrent   yahoo lunch_nlearn

CASCADING 3.0 IMPACT - DATA APP DEVELOPMENT FOR SPARK ON ROBUST FRAMEWORK

45

• Cascading 3.0 will ease application migration to Spark

• Enterprises can standardize on one API to meet business challenges and solve a variety of business problems ranging from simple to complex, regardless of latency or scale

• Third party products, data apps, frameworks and dynamic programming languages on Cascading will immediately benefit from this portability

• Even more operational visibility from development through production with Driven

Page 46: Cascading concurrent   yahoo lunch_nlearn

BUSINESSES DEPEND ON US

46

• Estimate suicide risk from what people write online

• Cascading + Cassandra

• You can do more than optimize add yields

• http://www.durkheimproject.org