accelerate big data application development with cascading and hdp, hortonworks and concurrent...

46
Page 1 Accelerate Big Data Application Development with Cascading and HDP April 22, 2014

Upload: hortonworks

Post on 26-Jan-2015

109 views

Category:

Technology


0 download

DESCRIPTION

Accelerate Big Data Application Development with Cascading and HDP, webinar hosted by Hortonworks and Concurrent. Visit Hortonworks.com/webinars to access the recording.

TRANSCRIPT

Page 1: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 1

Accelerate Big Data Application Development with Cascading and HDP

April 22, 2014

Page 2: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 2

Agenda

•  Take advantage of the latest Hadoop processing frameworks like YARN and Tez in HDP 2.1

•  How developers can create future proof, data-driven applications built on Apache Hadoop with Cascading

•  How Cascading accelerates Hadoop application development by abstracting the platforms underneath

Page 3: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 3

Speakers

Ajay Singh, Director of Technical Channels, Hortonworks

Supreet Oberoi, VP of Field Engineering, Concurrent

Page 4: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 4

Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process

Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind

Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills

Enable your Modern Data Architecture by delivering Enterprise Apache Hadoop

Our Mission:

Reseller Partners:

Headquartered in Palo Alto, CA; 300+ employees and growing

Page 5: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 5

A data architecture under pressure from new data

APPLICAT

IONS*

DATA

**SYSTEM*

REPOSITORIES*

SOURC

ES*

Exis4ng*Sources**(CRM,*ERP,*Clickstream,*

Logs)*

RDBMS* EDW* MPP*

Business**Analy4cs*

Custom*Applica4ons*

Packaged*Applica4ons*

Source: IDC

2.8*ZB*in*2012*

85%*from*New*Data*Types*

15x*Machine*Data*by*2020*

40*ZB*by*2020*

OLTP,&ERP,&CRM&

Systems&

Unstructured&documents,&

emails&

Clickstream&

Server&logs&

Sen>ment,&Web&

Data&

Sensor.&Machine&

Data&

Geoloca>on&

Page 6: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 6

A Modern Data Architecture AP

PLICAT

IONS*

DATA

**SYSTEM*

REPOSITORIES*

SOURC

ES*

Exis4ng*Sources**(CRM,*ERP,*Clickstream,*Logs)*

RDBMS* EDW* MPP*

Emerging*Sources**(Sensor,*Sen4ment,*Geo,*Unstructured)*

OPERATIONAL*TOOLS*

MANAGE*&*MONITOR*

DEV*&*DATA*TOOLS*

BUILD*&*TEST*

Business**Analy4cs*

Custom*Applica4ons*

Packaged*Applica4ons*

Gov

erna

nce

&

Inte

grat

ion

ENTERPRISE HADOOP

Secu

rity

Ope

ratio

ns

Data Access

Data Management

Page 7: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 7

Clickstream Capture and analyze website visitors’ data trails and optimize your website

Sensors Discover patterns in data streaming automatically from remote sensors and machines

Server Logs Research logs to diagnose process failures and prevent security breaches

New types of data Hadoop Value:

Sentiment Understand how your customers feel about your brand and products – right now

Geographic Analyze location-based data to manage operations where they occur

Unstructured Understand patterns in files across millions of web pages, emails, and documents

Page 8: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 8

Enterprise Hadoop: Core Foundation of Hadoop Applications

Page 9: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 9

Core Capabilities of Enterprise Hadoop

Load data and manage

according to policy

Deploy and effectively

manage the platform

Store and process all of your Corporate Data Assets &

Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered

approach to security through Authentication, Authorization,

Accounting, and Data Protection

&

DATA**MANAGEMENT*

SECURITY*DATA**ACCESS*GOVERNANCE*&*INTEGRATION* OPERATIONS*

Enable both existing and new application to provide value to the organization

PRESENTATION*&*APPLICATION*

Empower existing operations and security tools to manage Hadoop

ENTERPRISE*MGMT*&*SECURITY*

Provide deployment choice across physical, virtual, cloud

DEPLOYMENT*OPTIONS*

Page 10: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 10

HDP 2.1: Enterprise Hadoop

HDP 2.1 Hortonworks Data Platform

**

Provision,*Manage*&*Monitor*

&

Ambari&

Zookeeper&

Scheduling*&

Oozie&

Data*Workflow,*Lifecycle*&*Governance*

*Falcon&

Sqoop&

Flume&

NFS&

WebHDFS&

YARN*:*Data*Opera4ng*System&

DATA**MANAGEMENT*

SECURITY*DATA**ACCESS*GOVERNANCE*&*INTEGRATION*

Authen4ca4on*Authoriza4on*Accoun4ng*

Data*Protec4on*&

Storage:&HDFS&

Resources:&YARN&

Access:&Hive,&…&&

Pipeline:&Falcon&

Cluster:&Knox&

OPERATIONS*

Script*&

Pig&

**

Search**

Solr&

**

SQL**

Hive/Tez,&

HCatalog&

**

NoSQL**

HBase&

Accumulo&

**

Stream***

Storm&

&

**

Others**

InUMemory&

Analy>cs,&&

ISV&engines&

1& °& °& °& °& °& °& °& °& °&

°& °& °& °& °& °& °& °& °& °&

°& °& °& °& °& °& °& °& °& °&

°&

°&

N*

HDFS**(Hadoop&Distributed&File&System)&

Batch**

Map&

Reduce&

**

Deployment*Choice&Linux Windows On-Premise Cloud

Page 11: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 11

Hadoop is wholly integrated into the data center

APPLICAT

IONS*

DATA

**SYSTEM*

SOURC

ES*

RDBMS* EDW* MPP*

Emerging*Sources**(Sensor,*Sen4ment,*Geo,*Unstructured)*

HANA

BusinessObjects BI

OPERATIONAL*TOOLS*

DEV*&*DATA*TOOLS*

Exis4ng*Sources**(CRM,*ERP,*Clickstream,*Logs)*

INFRASTRUCTURE*

HDP 2.1 G

over

nanc

e

& In

tegr

atio

n

Secu

rity

Ope

ratio

ns

Data Access

Data Management

Page 12: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 12

Developing Apps on Hadoop

•  Spring XD Framework –  Consistent configuration & Java API across wide range of Hadoop ecosystem

projects

•  Microsoft .NET SDK For Hadoop –  API access to HDP on windows and HDInsight service

–  LINQ libraries for accessing Hive

•  Cascading –  Delivers an easy to use abstraction layer for developing Hadoop applications

–  Supports development in Scala & Clojure

–  Hortonworks to Certify, Support & Deliver Cascading SDK with Hortonworks Data Platform

Page 13: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

DRIVING INNOVATION THROUGH DATAACCELERATE BIG DATA APPLICATION DEVELOPMENT WITH CASCADING AND HDPSupreet Oberoi | April 22, 2014 VP Field Engineering, Concurrent Inc

Page 14: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

HORTONWORKS PARTNERS WITH CONCURRENT

• The Cascading SDK will now be integrated with the Hortonworks Data Platform (HDP)

• Hortonworks will certify and support Cascading™ SDK with HDP

• Cascading will support Apache Tez; companies using Cascading or domain-specific languages on Cascading can seamlessly migrate HDP supporting Apache Tez

The partnership benefits users by combining the power and simplicity of Cascading with the reliability and stability of HDP.

Page 15: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Confidential

AGENDA

3

• Who is Concurrent • What is Cascading • Where is it used • What problems does Cascading solve • What is included in the Cascading kit !

Page 16: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Confidential

ABOUT CONCURRENT, INC.

4

Page 17: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Confidential

GET TO KNOW CONCURRENT

5

Leader in Application Infrastructure for Big Data!

• Building enterprise software to simplify Big Data application development and management

Products and Technology!

• CASCADINGThe most widely used application infrastructure for building Big Data applications with over 150,000 downloads each month

• DRIVEN Enterprise Data Application management for Big Data apps

Proven - Simple, Reliable, Robust!

• Thousands of enterprises rely on Concurrent to provide their data application infrastructure.

Founded: 2008 HQ: San Francisco, CA !CEO: Gary Nakamura CTO, Founder: Chris Wensel !www.concurrentinc.com

Page 18: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

PRODUCTS AND TECHNOLOGY

!

!

Big Data Application Development!Simple, Reliable, Repeatable

!

!

Unmatched Application Insight!Visibility into your Data Applications

Open Source Commercial

www.concurrentinc.com/products

Open Source Community!Focused on Data App Development

!Project home of Cascading

Collection of sub-projects / tools !!

Data App Management!Realtime monitoring

Performance Management Operational Control Data Provenance

Compliance Governance

Page 19: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

BUSINESSES DEPEND ON US

• Cascading Java API

• Data normalization and cleansing of search and click-through

logs for use by analytics tools, Hive analysts

• Easy to operationalize heavy lifting of data

Page 20: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

BUSINESSES DEPEND ON US

• Cascalog (Clojure)

• Weather pattern modeling to protect growers against loss

• ETL against 20+ datasets daily

• Machine learning to create models

• Purchased by Monsanto for $930M US

Page 21: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

BUSINESSES DEPEND ON US

• Scalding (Scala)

• Machine learning (linear algebra) to improve

• User experience

• Ad quality (matching users and ad effectiveness)

• All revenue applications are running on Cascading/Scalding

• IPO

TWITTER

Page 22: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

BUSINESSES DEPEND ON US

• Estimate suicide risk from what people write online

• Cascading + Cassandra

• You can do more than optimize add yields

• http://www.durkheimproject.org

Page 23: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

CASCADING DEPLOYMENTS

11

Page 24: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

DRIVING ADVANTAGE WITH DATA APPLICATIONS

Enterprise IT!Extract Transform Load

Log File Analysis Systems Integration Operations Analysis

!

Corporate Apps!HR Analytics

Employee Behavioral Analysis Customer Support | eCRM

Business Reporting !

Telecom!Data processing of Open Data

Geospatial Indexing Consumer Mobile Apps Location based services

Marketing / Retail!Mobile, Social, Search Analytics

Funnel analysis Revenue attribution

Customer experiments Ad Optimization

Retail recommenders !

Consumer / Entertainment!Music Recommendation Comparison Shopping Restaurant Rankings

Real Estate Rental Listings

Travel Search & Forecast !

!

Finance!Fraud and Anomaly Detection

Fraud Experiments Customer Analytics

Insurance Risk Metric !

Health / Biotech!Aggregate metrics for Govt

Person biometrics Veterinary diagnostics Next-Gen Genomics

Argonomics Environmental Maps

!

Page 25: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

BIG DATA — THE NEXT PHASE OF MATURITY

“It’s all about the Apps”"There needs to be a comprehensive solution for building, deploying, running and

managing these new class of enterprise applications

Business Strategy Data & TechnologyLoyalty and promotions analysis

Retention campaigns Marketing campaign optimization

Fraud detection Risk management Scientific research

Remote monitoring and diagnosis and more!

Your Data & Systems Hadoop, EDW, Mainframe,

System Logs, NO SQL DBs, etc.Challenges!!

Leveraging existing skill sets, existing systems, past investments and existing business processes

Connecting Business and Data

Page 26: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Confidential

PRODUCTS OVERVIEW

14

Page 27: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

• Java API (alternative to Hadoop MapReduce)

• Separates business logic from integration

• Testable at every lifecycle stage

• Works with any JVM language

• Many integration adapters

CASCADING

15

Process Planner

Processing API Integration APIScheduler API

Scheduler

Apache Hadoop

Cascading

Data Stores

ScriptingScala, Clojure, JRuby, Jython, Groovy

Enterprise Java

Page 28: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

KEY CASCADING CONCEPTS

Tap

Page 29: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

KEY CASCADING CONCEPTS

PipeFlow

Page 30: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical

• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)

• Aggregations ‣ Count, Average, etc ‣ Rolling windows

SOME COMMON PATTERNS

18

filter

filter

function

functionfilterfunctiondata

PipelineSplit Join

Merge

data

Topology

Page 31: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

WORD COUNT EXAMPLE!

!String docPath = args[ 0 ];!String wcPath = args[ 1 ];!Properties properties = new Properties();!AppProps.setApplicationJarClass( properties, Main.class );!HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!!

configuration

integration

!// create source and sink taps!Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );!Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );!!

processing

// specify a regex to split "document" text lines into token stream!Fields token = new Fields( "token" );!Fields text = new Fields( "text" );!RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );!// only returns "token"!Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!// determine the word counts!Pipe wcPipe = new Pipe( "wc", docPipe );!wcPipe = new GroupBy( wcPipe, token );!wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!

scheduling

!// connect the taps, pipes, etc., into a flow definition!FlowDef flowDef = FlowDef.flowDef().setName( "wc" )! .addSource( docPipe, docTap )! .addTailSink( wcPipe, wcTap );!// create the Flow!Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!wcFlow.complete(); // <<-- Runs jobs on Cluster

Page 32: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

CASCADING OVERVIEW

www.cascading.org

Build Data Apps that are

scale-free!!!!

Design principals ensure best practices at any scale

Test-Driven Development!

!Efficiently test code and process local files before you deploy on a cluster

Staffing Bottleneck!

!Use existing Java, SQL,

modeling skills sets

Operational Complexity!

!Simple - Package up into

one jar and hand to operations

Application Portability!

!!

Write once, then run on different computation

fabrics.

Systems Integration!

!!

Hadoop never lives alone. Easily integrate to your

existing systems!

Proven application development framework for building Data

applications

Framework addresses

Page 33: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

OPERATIONAL READINESS: DISCIPLINE & ABILITY TO MEASURE

• Visibility into app development • Business SLA • Balance & Controls • Application testing • Data quality • Process to “productionalize” apps • High fidelity execution analysis • Real-time monitoring • …

Page 34: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

PRODUCTS AND TECHNOLOGY

LINGUAL Simplifying Systems Integration

PATTERN Enabling Machine Scoring Algorithms

!

!

Big Data Application Development!Simple, Reliable, Repeatable

!

!

Unmatched Application Insight!Visibility into your Data Applications

Open Source Commercial

www.concurrentinc.com/products

Page 35: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

CASCADING ECOSYSTEM IS MORE THAN CASCADING FRAMEWORK

Lingual, Pattern and other Dynamic Programming Languages such as

Scalding are part of the Cascading Ecosystem and are included as part

of the Cascading kit

http://www.cascading.org/extensions/

Page 36: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

LINGUAL

• Lingual is an extension to Cascading that executes ANSI SQL queries as Cascading apps!

• Supports integrating with any data source that can be accessed through JDBC — Cascading Tap can be created for any source supporting JDBC!

• Great for migration of data, integrating with non-Big Data assets — extends life of existing IT assets in an organization

Query Planner

JDBC API Lingual APIProvider API

Cascading

Apache Hadoop

Lingual

Data Stores

CLI / Shell Enterprise Java

Catalog

Page 37: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

SCALDING

• Scalding is a language binding to Cascading for Scala!

- The name Scalding comes from the combining of SCALa and cascaDING!

• Scalding is great for Scala developers; can crisply write constructs for matrix math… !

• Scalding has very large commercial deployments at:!

- Twitter - Use cases such as the revenue quality team, ad targeting and traffic quality!

- Ebay - Use cases include search analytics and other production data pipelines

Page 38: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

DRIVEN OVERVIEW

What is Driven?!The first application

performance management product for Big Data

applications

Capabilities

Visualize your Data App!

No more black box! Instantly visualize your

running app in real-time

Diagnose App Failures!

Identify where and how your app failed… all without sorting through logs!

Track App Performance!

For all your apps, view and compare history of your

app’s runtime performance

Insight into your Applications!

At any moment, quickly understand what your app

is doing on your clusterLINGUALPATTERN

SCALDINGCASCALOG

Benefits

Key Features

• Accelerate Time to Market • Build Reliable Applications • Optimize Application Performance

• Application visualization • Dashboard performance view • Application performance history • Insights for each application (workflow,

telemetry, error types) • Team collaboration and management

Works with:

www.cascading.io

Page 39: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Driven is free for developer use (cloud)

Page 40: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Lingual Pattern

Availability Cascading 2.5 Available Now

Lingual 1.1 Available Now

Pattern 1.0-WIPWIP Available Now

License Apache License 2.0 Apache License 2.0 Apache License 2.0

SupportCommunity Forums & Mailing List, Enterprise

Support

Community Forums & Mailing List, Enterprise

Support

Community Forums & Mailing List, Enterprise

Support

CASCADING AVAILABILITY

Cascading, Lingual and Pattern are open source projects freely available to the general public under Apache License 2.0

Page 41: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

ConfidentialConfidential29

Summary!• APM for Big Data | The first application performance management product for Big Data applications

!

!

!

!

• For Developers and Operators | Significantly improves developer productivity and operations control by providing an unprecedented level of insight into building and managing enterprise-grade data applications

• Collaboration | Facilitates and encourages user collaboration to build enterprise data applications • Community Integration | Driven is a free cloud service integrated with the Cascading open source community • Licensing | Driven is free for development (cloud only) and licensable for production or on-premise deployments • Deployment Options | Deploy in the cloud or on-premise

Accelerate Time to Market

Process visualization and monitoring capabilities in a rich UI

Build Reliable Apps

Detailed insight into data processing logic and algorithms

Optimize App Performance

Key application behavior metrics with historical data to trend performance

Page 42: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

GET STARTED WITH CASCADING ON HDP 2.1

1. Download HDP 2.1

2. Take Cascading for a spin by running the Impatient tutorial at http://docs.cascading.org/impatient/

Page 43: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

CONTACT INFORMATION

Supreet [email protected]

650-868-7675 (m) @supreet_online

Page 44: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

DRIVING INNOVATION THROUGH DATATHANK YOUSupreet Oberoi | April 18, 2014

Page 45: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 13

SAN JOSE June 3-5

AMSTERDAM April 2-3

•  6 tracks, 3 days, and 120+ sessions to choose from •  Community Focused - Sessions voted on by the public and

selected by a committee of industry luminaries •  Deep Dive Technical Content - Including a Committer track with

content presented by Apache committers •  Business and Technical Topics •  Community Activities - Hadoop Summit will host community meet-

ups and birds of a feather sessions

www.hadoopsummit.org

The Largest Hadoop Community Events in �Europe and North America

Page 46: Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014

Page 14

Questions? Use the Q/A panel to ask your questions Download the Hortonworks Sandbox and Cascading •  Cascading and HDP 2.1 Sandbox

•  Hortonworks Sandbox

•  Cascading Impatient Tutorial