cascading concurrent yahoo lunch_nlearn

DRIVING INNOVATION THROUGH DATABUILDING PRODUCTION-GRADE HADOOP APPLICATIONS WITH CASCADINGSupreet Oberoi VP Field Engineering, Concurrent Inc

ABOUT ME

• I am a Data Engineer, not a Data Scientist

• I help Enterprises develop decisions on building their “Big Data” roadmap and technical strategy — use cases, products, technology decisions, employee skills

• I design Hadoop applications with the intent to operationalize them in Enterprise settings — applications on which business depend, and last longer than the technologies underneath them…

• This talk is about learning how to design your BD strategy that leverages best

BUILDING AN OPEN PLATFORM IS KEY TO PREVENTING LOCK-IN

• Open Language

• Open Data

• Open Hardware

• Open Compute Platform

• Open Development Platform

OPEN LANGUAGES ALLOW YOU TO HARNESS THE TALENT OF YOUR ENTERPRISE

• Don’t equate architecture with language; develop architecture to support multiple

• Support SQL and SQL-like languages

• Encourage development in proven & scalable languages as Java

• Develop architecture to support change of programming languages (even for same app)

• Have common performance-management tools across all programming environments

OPEN DATA ENABLES REUSE OF DATA AND APPS

• Prevent exclusive access to data sets through proprietary tools

• Promote a common meta-data repository

• Forbid storing data in proprietary formats

• Build seamless integration capabilities

Develop a common operating picture by promoting reuse with open data

OPEN HARDWARE PROMOTES REUSE OF INFRASTRUCTURE

• Get commodity hardware — commodity hardware will always cost less than “optimized” specialized hardware (note: definition of “specialized” is up for debate)

• Develop and maintain a cluster that can be reused by different applications and technology stacks — avoid custom software installations on the cluster, or setting up dedicated clusters for given tech stacks

• Harness the power of collective from the cluster — avoid fragmenting the cluster if possible

OPEN COMPUTE PLATFORM MAKES YOU SELECT THE RIGHT TOOL FOR THE PROBLEM

• Ensure that moving your application from one Hadoop compute platform (e.g. MapReduce) to another (e.g., Tez) does not:

• impact application code

• impact production-monitoring tools

• Resist compute platforms that require your enterprise to acquire significantly new skills (even if it is easy) to become productive

• Avoid new platforms that partition the cluster

• Avoid platforms that do not support Open Data

Make tradeoffs between reliability & speed based on your business context

OPEN DEVELOPMENT PLATFORM PROVIDES LONG-TERM SUSTAINABILITY

• Invest in picking the correct development platform — open, easy, scalable, popular, tools, …

• Bet on a sustainable open source platform

• Measure the vitality of the community:

• number of downloads, extensions (living ecosystem), extensible architecture, consumers of the technology, code stability…

Development platforms improve developer productivity and operational excellence — picking a correct platform gives you best practices developed by the community, achieving higher quality

A proven platform provides tools to get your apps to production

GET TO KNOW CONCURRENT

Leader in Application Infrastructure for Big Data

• Building enterprise software to simplify Big Data application development and management

Products and Technology

• CASCADINGOpen Source - The most widely used application infrastructure for building Big Data apps with over 175,000 downloads each month

• DRIVEN Enterprise data application management for Big Data apps

Proven — Simple, Reliable, Robust

• Thousands of enterprises rely on Concurrent to provide their data application infrastructure.

Founded: 2008 HQ: San Francisco, CA

CEO: Gary Nakamura CTO, Founder: Chris Wensel

www.concurrentinc.com

BIG DATA — OPERATIONALIZE YOUR DATA APPS WITH CASCADING

“It’s all about the apps”There needs to be a comprehensive solution for building, deploying, running

and managing this new class of enterprise applications.

Business Strategy Data & Technology

ChallengesSkill sets, systems integration, standard op procedure and

operational visibility

Connecting Business and Data

DATA APPLICATIONS - ENTERPRISE NEEDS

Enterprise Data Application Infrastructure

• Need reliable, reusable tooling to quickly build and consistently deliver data products

• Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets

• Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application

• Need operational visibility for entire data application lifecycle

Cascading Apps

CASCADING - DE-FACTO FOR DATA APPS

New Fabrics

ClojureSQL Ruby

StormTez

System Integration

Mainframe DB / DW Data Stores HadoopIn-Memory

• Standard for enterprise data app development

• Your programming language of choice

• Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …

WORD COUNT EXAMPLE WITH CASCADING

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

configuration

integration// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

processing

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\$\$,.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

scheduling

// connect the taps, pipes, etc., into a flow definitionFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// create the FlowFlow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of WorkwcFlow.complete(); // <<-- Runs jobs on Cluster

• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical

• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)

• Aggregations ‣ Count, Average, etc

SOME COMMON PATTERNS

filter

function

functionfilterfunctiondata

PipelineSplit Join

Topology

• Java API

• Separates business logic from integration

• Testable at every lifecycle stage

• Works with any JVM language

• Many integration adapters

CASCADING

Process Planner

Processing API Integration APIScheduler API

Scheduler

Apache Hadoop

Cascading

Data Stores

ScriptingScala, Clojure, JRuby, Jython, Groovy

Enterprise Java

THE STANDARD FOR DATA APPLICATION DEVELOPMENT

www.cascading.org

Build data apps that are

scale-free

Design principals ensure best practices at any scale

Test-Driven Development

Efficiently test code and process local files before

deploying on a cluster

Staffing Bottleneck

Use existing Java,Scala, SQL, modeling skill sets

Operational Complexity

Simple - Package up into one jar and hand to

operations

Application Portability

Write once, then run on different computation

fabrics

Systems Integration

Hadoop never lives alone. Easily integrate to existing

systems

Proven application development framework for building data apps

Application platform that addresses:

STRONG ORGANIC GROWTH

175,000+ downloads / month7000+ Deployments

CASCADING DATA APPLICATIONS

Enterprise ITExtract Transform Load

Log File Analysis Systems Integration Operations Analysis

Corporate AppsHR Analytics

Employee Behavioral Analysis Customer Support | eCRM

Business Reporting

TelecomData processing of Open Data

Geospatial Indexing Consumer Mobile Apps Location based services

Marketing / RetailMobile, Social, Search Analytics

Funnel Analysis Revenue Attribution

Customer Experiments Ad Optimization

Retail Recommenders

Consumer / EntertainmentMusic Recommendation Comparison Shopping Restaurant Rankings

Real Estate Rental Listings

Travel Search & Forecast

FinanceFraud and Anomaly Detection

Fraud Experiments Customer Analytics

Insurance Risk Metric

Health / BiotechAggregate Metrics For Govt

Person Biometrics Veterinary Diagnostics Next-Gen Genomics

Argonomics Environmental Maps

BUSINESSES DEPEND ON US

• Cascading Java API

• Data normalization and cleansing of search and click-through logs for

use by analytics tools, Hive analysts

• Easy to operationalize heavy lifting of data in one framework

• Cascalog (Clojure)

• Weather pattern modeling to protect growers against loss

• ETL against 20+ datasets daily

• Machine learning to create models

• Purchased by Monsanto for $930M US

• Scalding (Scala)

• Makes complex analysis of very large data sets simple

• Machine learning, linear algebra to improve

• 30,000 jobs a day — this works @ scale

• Ad quality (matching users and ad effectiveness)

TWITTER

Confidential

CASCADING DEPLOYMENTS

BROAD SUPPORT

Hadoop ecosystem supports Cascading

… AND INCLUDES RICH SET OF EXTENSIONS

http://www.cascading.org/extensions/

CASCADING 3.0

“Write once and deploy on your fabric of choice.”

• The Innovation — Cascading 3.0 will allow for data apps to execute on existing and emerging fabrics through its new customizable query planner.

• Cascading 3.0 will support — Local In-Memory, Apache MapReduce and soon thereafter (3.1) Apache Tez, Apache Spark and Apache Storm

Enterprise Data Applications

MapReduceLocal In-Memory

Other Custom

Computation Fabrics

USE LINGUAL TO MIGRATE ITERATIVE ETL TASKS TO HADOOP

• Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps

• Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC

• Great for migration of data, integrating with non-Big Data assets — extends life of existing IT assets in an organization

Query Planner

JDBC API Lingual APIProvider API

Cascading

Apache Hadoop

Lingual

Data Stores

CLI / Shell Enterprise Java

Catalog

SCALDING

• Scalding is a language binding to Cascading for Scala • The name Scalding comes from the combining of SCALa and cascaDING

• Scalding is great for Scala developers; can crisply write constructs for matrix math…

• Scalding has very large commercial deployments at: • Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality • Ebay - Use cases include search analytics and other production data pipelines

PATTERN SCORES MODELS AT SCALE

• Pattern is an open source project that allows to leverage Predictive Model Markup Language (PMML) models and translate them into Cascading apps.

• PMML is an XML-based popular analytics framework that allows applications to describe data mining and machine learning algorithms

• PMML models from popular analytics frameworks can be reused and deployed within Cascading workflows

• Vendor frameworks - SAS, IBM SPSS, MicroStrategy, Oracle • Open source frameworks - R, Weka, KNIME, RapidMiner

• Pattern is great for migrating your model scoring to Hadoop from your decision systems

Confidential29

PATTERN SCORES MODELS AT SCALEStep 1: Train your model with industry-leading Tools

Step 2: Score your models at scale with Pattern

OPERATIONAL EXCELLENCE WITH DRIVEN

From Development — Building and Testing• Design & Development • Debugging • Tuning

To Production — Monitoring and Tracking• Maintain Business SLAs • Balance & Controls • Application and Data Quality • Operational Health • Real-time Insights

Visibility Through All Stages of App Lifecycle

• Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster

DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS

DRIVEN ARCHITECTURE

• Easily comprehend, debug, and tune your data applications

• Get rich insights on your application performance

• Monitor applications in real-time • Compare app performance with

historical (previous) iterations

DEEPER VISUALIZATION INTO YOUR HADOOP CODE

Debug and optimize your Hadoop applications more effectively with Driven

• Quickly breakdown how often applications execute based on their tags, teams, or names

• Immediately identify if any application is monopolizing cluster resources

• Understand the utilization of your cluster with a timeline of all applications running

GET OPERATIONAL INSIGHTS WITH DRIVEN

Visualize the activity of your applications to help maintain SLAs

• Easily keep track of all your applications by segmenting them with user-defined tags

• Segment your applications for trending analysis, cluster analysis, and developing chargeback models

• Quickly breakdown how often applications execute based on their tags, teams, or names

ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY

Segment your applications for greater insights across all your applications

• Invite others to view and collaborate on a specific application

• Gain visibility to all the apps and their owners associated with each team

• Simply manage your teams and the users assigned to them

COLLABORATE WITH TEAMS

Utilize teams to collaborate and gain visibility over your set of applications

• Identify problematic apps with their owners and teams

• Search for groups of applications segmented by user-defined tags

• Compare specific applications with their previous iterations to ensure that your application can meet its SL

MANAGE PORTFOLIO OF BIG DATA APPLICATIONS

Fast, powerful, rich search capabilities enable you to easily find the exact set of applications that you’re looking for

• Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster

DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS

• Cascading framework enables developers to intuitively create data applications that scale and are robust, future-proof, supporting new execution fabrics without requiring a code rewrite

• Scalding — a Scala-based extension to Cascading — provides crisp programming constructs for algorithm developers and data scientists

• Driven — an application visualization product — provides rich insights into how your applications executes, improving developer productivity by 10x

• Cascading 3.0 opens up the query planner — write apps once, run on any fabric

SUMMARY - BUILD ROBUST DATA APPS RIGHT THE FIRST TIME WITH CASCADING

Concurrent offers training classes for Cascading (DEC 9) & Scalding (NOV 4)

CONTACT INFORMATION

Supreet Oberoisupreet@concurrentinc.com

650-868-7675 (m) @supreet_online

DRIVING INNOVATION THROUGH DATATHANK YOUSupreet Oberoi

DRIVING INNOVATION THROUGH DATABACKUPSupreet Oberoi

DIFFERENT PHILOSOPHY THAN GUI TOOLS

• Cascading is a general-purpose framework to develop data applications; supports development through the entire lifecycle of data - from staging to final data sets

• Developing with an API is more productive and intuitive than with a UI — incorporate best-practices

• Can do in three lines of code what takes 20-clicks in an app (fluent API with IDE makes it even simpler)

• Can test locally and deploy production without code change

• Because it is based in code: debuggable, extensible, deployable, traceable

• GUI tools do not help in visualizing the must-have insights

• Real-time application visualization, application bottlenecks, anatomy of the application, application dependencies, cost breakdown of an operation (join), bottlenecks due to code, data, network, cluster

Confidential44

• Hierarchical Clustering • K-Means Clustering • Linear Regression • Logistic Regression • Random Forest

algorithms extended based on customer use cases –

PATTERN: ALGOS IMPLEMENTED

CASCADING 3.0 IMPACT - DATA APP DEVELOPMENT FOR SPARK ON ROBUST FRAMEWORK

• Cascading 3.0 will ease application migration to Spark

• Enterprises can standardize on one API to meet business challenges and solve a variety of business problems ranging from simple to complex, regardless of latency or scale

• Third party products, data apps, frameworks and dynamic programming languages on Cascading will immediately benefit from this portability

• Even more operational visibility from development through production with Driven

• Estimate suicide risk from what people write online

• Cascading + Cassandra

• You can do more than optimize add yields

• http://www.durkheimproject.org

cascading concurrent yahoo lunch_nlearn

open platform

reuse of data

data sets

data engineer

data scientist

correct platform

open dataopen hardware

big data roadmap

Technology

cascading into ss3

cascading commitments

pattern: pmml for cascading and hadoop - … pmml for...

exceptional performance evolution at yahoo! steve souders...

cascading style sheet

cascading failure

cascading flashcopy€¦ · scenarios for using cascading...

dynamic cascading prompts

cascading brochure

css cascading style sheets cascading style sheets 1

network adaptations under cascading failures for mission...

17.6.2015jaana holvikivi1 cascading style sheets css -...

operational defense of power system cascading sequences...

present industry practices with cascading outages -...

intro to cascading

cascading style sheets

cascading on starfish

the cascading effect of civility on outcomes of clarity, job...

cascading site config

cascading prompts