hadoop 2.0: yarn to further optimize data processing

31
© Hortonworks Inc. 2014 Quick Housekeeping Q&A box is available for your questions Webinar will be recorded Thank You for joining!

Upload: hortonworks

Post on 21-Nov-2014

303 views

Category:

Data & Analytics


0 download

DESCRIPTION

Data is exponentially increasing in both types and volumes, creating opportunities for businesses. Watch this video and learn from three Big Data experts: John Kreisa, VP Strategic Marketing at Hortonworks, Imad Birouty, Director of Technical Product Marketing at Teradata and John Haddad, Senior Director of Product Marketing at Informatica. Multiple systems are needed to exploit the variety and volume of data sources, including a flexible data repository. Learn more about: - Apache Hadoop 2 and YARN - Data Lakes - Intelligent data management layers needed to manage metadata and usage patterns as well as track consumption across these data platforms.

TRANSCRIPT

Page 1: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Quick Housekeeping

Q&A box is available for your questions Webinar will be recorded Thank You for joining!

Page 2: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Hadoop 2.0: YARN to Further Optimize Data Processing

Page 3: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Your Speakers

John Kreisa, VP Strategic Marketing, Hortonworks

Imad Birouty, Director, Technical Product Marketing, Teradata

John Haddad, Senior Director, Product Marketing, Informatica

Page 4: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

John Kreisa, VP Strategic Marketing, Hortonworks @marked_man

Page 5: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Big Data Market Trends and Predictions

Big Data Explosion

% by which org’s leveraging modern info management systems outperform peers by 2015

ñ Hadoop enabled DBMS’s

85% from new data types

50x data growth 2010 to 2020

1 Zettabyte (ZB) =

1 Billion TBs

15x

growth rate of machine

generated data by 2020

The US has 1/3 of the world’s data

Big Data is 1 of 5 US GDP Game Changers $325 billion incremental annual GDP from big data analytics

in retail and manufacturing by 2020

Page 6: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Existing systems under pressure AP

PLICAT

IONS  

DATA

   SYSTEM  

REPOSITORIES  

SOURC

ES  

Exis4ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

RDBMS   EDW   NoSQL  

Business    Analy4cs  

Custom  Applica4ons  

Packaged  Applica4ons  

Source: IDC

2.8  ZB  in  2012  

85%  from  New  Data  Types  

15x  Machine  Data  by  2020  

40  ZB  by  2020  

OLTP,  ERP,  CRM  Systems  

Unstructured    documents,  emails  

Clickstream  

Server  logs  

Sen>ment,    Web  Data  

Sensor.    Machine  Data  

Geoloca>on  

Page 7: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Hadoop with YARN Compliments Existing Architecture

OPERATIONS  TOOLS  

Provision, Manage & Monitor

DEV  &  DATA  TOOLS  

Build & Test

DATA

   SYSTEM  

REPOSITORIES  

SOURC

ES  

RDBMS   EDW   NoSQL  

OLTP,  ERP,  CRM  Systems  

Documents,    Emails  

Web  Logs,  Click  Streams  

Social  Networks  

Machine  Generated  

Sensor  Data  

Geoloca>on  Data  

APPLICAT

IONS  

Business    Analy4cs  

Custom  Applica4ons  

Packaged  Applica4ons  

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

Interactive Real-Time Batch

Page 8: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Hadoop: Typically used for new analytic apps SC

ALE

SCOPE

New Analytic Apps New types of data LOB-driven

Page 9: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Unlock Value in New Types of Data

1.  Social Understand how people are feeling and interacting – right now

2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website

3.  Sensor/Machine Discover patterns in data streaming from remote sensors and machines

4.  Geographic Analyze location-based data to manage operations where they occur

5.  Server Logs Diagnose process failures and prevent security breaches

6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents

Value

+ Online archive Data that was once purged or moved to tape can be stored in Hadoop to discover long term trends and previously hidden value

Page 10: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

New Analytic Applications on Hadoop

Industry Use Case Type of Data

Financial Services New Account Risk Screens Text, Server Logs

Trading Risk Server Logs

Insurance Underwriting Geographic, Sensor, Text

Telecom Call Detail Records (CDRs) Machine, Geographic

Infrastructure Investment Machine, Server Logs

Real-time Bandwidth Allocation Server Logs, Text, Social

Retail 360° View of the Customer Clickstream, Text

Localized, Personalized Promotions Geographic

Website Optimization Clickstream

Manufacturing Supply Chain and Logistics Sensor

Assembly Line Quality Assurance Sensor

Crowdsourced Quality Assurance Social

Healthcare Use Genomic Data in Medical Trials Structured

Monitor Patient Vitals in Real-Time Sensor

Pharmaceuticals Recruit and Retain Patients for Drug Trials Social, Clickstream

Improve Prescription Adherence Social, Unstructured, Geographic

Oil & Gas Unify Exploration & Production Data Sensor, Geographic & Unstructured

Monitor Rig Safety in Real-Time Sensor, Unstructured

Government ETL Offload in Response to Federal Budgetary Pressures Structured

Sentiment Analysis for Government Programs Social

Page 11: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Hadoop: YARN Driven MDA Leads to a Data Lake SC

ALE

SCOPE

A Modern Data Architecture/Data Lake  

New Analytic Apps New types of data LOB-driven

RDBMS

MPP

EDW

Data Lake An architectural shift in the data center that uses Hadoop to deliver deeper insight across a large, broad, diverse set of data at efficient scale

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

Interactive Real-Time Batch

Page 12: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Integrating with Existing Investments AP

PLICAT

IONS  

DATA

 SYSTEM  

SOURC

ES  

RDBMS   EDW   MPP  

Emerging  Sources    (Sensor,  Sen4ment,  Geo,  Unstructured)  

BusinessObjects BI

OPERATIONAL  TOOLS  

DEV  &  DATA  TOOLS  

Exis4ng  Sources    (CRM,  ERP,  Clickstream,  Logs)  

INFRASTRUCTURE  

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

HDFS (Hadoop Distributed File System)

Interactive Real-Time Batch

SOURC

ES  

OLTP,  ERP,  CRM  Systems  

Documents,    Emails  

Web  Logs,  Click  Streams  

Social  Networks  

Machine  Generated  

Sensor  Data  

Geoloca>on  Data  

Viewpoint

Page 13: Hadoop 2.0: YARN to Further Optimize Data Processing

Imad Birouty, Director, Technical Product Marketing, Teradata

Page 14: Hadoop 2.0: YARN to Further Optimize Data Processing

Analysts Recommend: Shift from a Single Platform to an Ecosystem

“We will abandon the old models based on the desire to implement for high-value analytic applications.”

"Logical" Data Warehouse

Page 15: Hadoop 2.0: YARN to Further Optimize Data Processing

Math and Stats

Data Mining

Business Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

DISCOVERY PLATFORM

DATA WAREHOUSE

ERP

SCM

CRM

Images

Audio and Video

Machine Logs

Text

Web and Social

SOURCES

DATA PLATFORM

ACCESS MANAGE MOVE

UNIFIED DATA ARCHITECTURE

Marketing Executives

Operational Systems

Frontline Workers

Customers Partners

Engineers

Data Scientists

Business Analysts

Fast Loading

Filtering and Processing

Online Archival

Business Intelligence

Predictive Analytics

Operational Intelligence

Data Discovery

Path, graph, time-series analysis

Pattern Detection

Page 16: Hadoop 2.0: YARN to Further Optimize Data Processing

Math and Stats

Data Mining

Business Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS

USERS

DISCOVERY PLATFORM

DATA WAREHOUSE

ERP

SCM

CRM

Images

Audio and Video

Machine Logs

Text

Web and Social

SOURCES

DATA PLATFORM

ACCESS MANAGE MOVE

UNIFIED DATA ARCHITECTURE

Marketing Executives

Operational Systems

Frontline Workers

Customers Partners

Engineers

Data Scientists

Business Analysts

Page 17: Hadoop 2.0: YARN to Further Optimize Data Processing

Data Lake Overview

•  The single source of raw, historical, and real-time operational data

•  The ability to cost effectively explore data sets of unknown, under-appreciated, or unrecognized value

•  The reduction of LOB specific big data environments, which reduces costs and analytical discrepancies

•  The co-location of data sets to enable light, on-the-fly integration

Page 18: Hadoop 2.0: YARN to Further Optimize Data Processing

Approaches to Data Integration

Schema on Write

• Well understood data • Relational integrity • Storage efficiency

Schema On Read

• Dynamic data • Reduced coordination • Human readable

Data Warehouse

Data Lake

Page 19: Hadoop 2.0: YARN to Further Optimize Data Processing

The “Capture Everything” Approach

“Capture only what’s needed”

IT delivers a platform for storing, refining, and analyzing all data

sources

Business explores data for questions worth

answering

Big Data Method Multi-structured & Iterative

Analysis

IT structures the data to answer those questions

Business determines what questions to ask

Classic Method Structured & Repeatable

Analysis

“Capture in case it’s needed”

Page 20: Hadoop 2.0: YARN to Further Optimize Data Processing

Value from combining business data with detail data

•  Determine which cars to recall for bad battery lot >  Business data held in data warehouse >  Detailed sensor data held in data lake >  Query combines data >  Determine which cars to repair

Automobile Sensor Data Use Case

TERADATA

PRODUCTION DATA

• VINs • Service records

• Warranty data • DTC descriptions

HADOOP

RAW MULTI-STRUCTURED

DATA

• Battery Temperature Sensor data

Battery Temperature vs. Air Temperature

Page 21: Hadoop 2.0: YARN to Further Optimize Data Processing

Customer Value Based on Social Influence Use Case

HADOOP TERADATA

ASTER DATABASE

TERADATA DATABASE

• Determine high value customers based on history

• Determine customer value based on social influence

<=

• Determine customer sentiment

• Determine customer sphere of influence

$$

Page 22: Hadoop 2.0: YARN to Further Optimize Data Processing

Data Optimization for the Modern Data Architecture

John Haddad, Senior Director, Product Marketing, Informatica

Page 23: Hadoop 2.0: YARN to Further Optimize Data Processing

The Big Data Journey

The Big Data Journey

Optimize infrastructure for performance, cost, &

scalability

A single place to manage the supply and

demand of data

Real-time proactive customer engagement

Data Warehouse Optimization

Real-Time Customer Analytics

Managed Data Lake

Big Data business initiatives

IT driven Business driven

Page 24: Hadoop 2.0: YARN to Further Optimize Data Processing

Proactive Customer Engagement

Web Logs Clickstream Data

Big Data Integration / Analytics

Streaming

Master Data Mgmt

Financial Advisors

Integration & Quality

Customer / Product Master

Customer

Customer Smartphone

Real-Time Event

Processing

Visualization

Social Data / Signals

Social Data Connector

FIX, SWIFT, Market Data

Customer Portal

DATA PLATFORM

DISCOVERY PLATFORM

DATA WAREHOUSE

Page 25: Hadoop 2.0: YARN to Further Optimize Data Processing

Proactive Patient Member Engagement

Web Logs Clickstream Data

Big Data Integration / Analytics

Streaming

Care Providers

Integration & Quality

Patient Member

Patient Member Smartphone

Real-Time Event

Processing

Visualization

Social Data / Signals

Social Data Connector

RFID, Patient Monitoring

Healthcare & Patient Forums

Master Data Mgmt

Member / Provider Master

DATA PLATFORM

DISCOVERY PLATFORM

DATA WAREHOUSE

Page 26: Hadoop 2.0: YARN to Further Optimize Data Processing

Unified Data Architecture

DATA PLATFORM

DISCOVERY PLATFORM

DATA WAREHOUSE

The Intelligent Data Platform

Rol

e-B

ased

Dat

a M

anag

emen

t To

ols

Infra

stru

ctur

e S

ervi

ces

Data Intelligence Metadata Meets Machine Learning

Data Infrastructure

Vibe ™ Virtual Data Machine

New

Industry- Leading

Data Lake Infrastructure

Page 27: Hadoop 2.0: YARN to Further Optimize Data Processing

Data Lake Architecture Informatica Developers are Now Hadoop Developers

Visual Development Environment

Enterprise Repositories

MDM

DATA REFINEMENT

Profile Profile

Parse

ETL

Cleanse

Match

LOAD

SOURCE DATA

Batch

Replicate Stream Archive

JMS Queue’s

Servers & Mainframe

Files

Databases

Sensor data

Social

Apache  YARN  

   

Apache    MapReduce  

 

1   °   °   °  

°   °   °   °  

°   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

   

Apache    Tez    

Apache  Hive  SQL  

DELIVER

Batch

Services Events Topics

DATA WAREHOUSE

Page 28: Hadoop 2.0: YARN to Further Optimize Data Processing

How do you plan to staff your Big Data projects?

4 weeks 4 days!

2X performance!

Vs.

Hadoop Hand-coders

Informatica developers

Choose tools that leverages existing skills so you can quickly staff Big Data projects

Page 29: Hadoop 2.0: YARN to Further Optimize Data Processing

How do you adopt and minimize the impact of new and rapidly changing technologies?

Choose a platform and tools that minimize the need to rebuild your data pipeline as technologies change

Hadoop

Cloud DI Servers Data Warehouse

Development Deployment

Page 30: Hadoop 2.0: YARN to Further Optimize Data Processing

Time to Deploy

How long does it take you to deploy Big Data projects to production?

Maximize  Reuse

Available  24x7 Scale  Performance

Flexible  to  Change Easy  to  Maintain

Automa4cally  Deploy

Time to Deploy

Everything you build in the sandbox should be immediately deployed as enterprise ready production

Page 31: Hadoop 2.0: YARN to Further Optimize Data Processing

© Hortonworks Inc. 2014

Next Steps

Try the free Informatica Big Data Edition 60-Day Trial Download the Hortonworks Sandbox

Download Teradata Express Download Aster Express

http://marketplace.informatica.com/bdehortonworks

http://downloads.teradata.com/download/database

http://hortonworks.com/products/hortonworks-sandbox/

http://downloads.teradata.com/download/aster/aster-express