falcon - data management platform on hadoop (beyond etl)

Data Management Platform on Hadoop

Srikanth SundarrajanVenkatesh Seetharam

(Incubating)

whoami

Principal ArchitectInMobi

Apache Hadoop Contributor

Hadoop Team @Yahoo!

Srikanth Sundarraj

an Architect/DeveloperHortonworks

Apache Hadoop Contributor

Data Management @ Yahoo!

Venkatesh

Seetharam

Agenda

2 Falcon Overview

1 Motivation

3 Case Studies

4 Questions & Answers

MOTIVATION

Data Processing Landscape

External data source

Acquire (Import)

Data Processing (Transform/Pipeline)

Eviction Archive

Replicate(Copy)

Export

Core ServicesProcess

• Late data management• Relays

Data management

• Acquisition• Replication• Retention

Operability

• SLA• Lineage

Process Management – Relays

picture courtersy: http://istockphoto.com/

Late Data Management

picture courtersy: http://iwebask.com

Data Retention As Service

picture courtersy: http://vimeo.com/

Data Replication As Service

picture courtersy: http://boylesmedia.com

Data Acquisition As Service

picture courtersy: http://wmpu.org

Operability – Dashboard

picture courtersy: http://www.opentrack.ch/

FALCON OVERVIEW

Holistic Declaration of Intent

picture courtersy: http://bigboxdetox.com

Entity Dependency Graph

Hadoop / Hbase … Cluster

External data

source

feed Process

depends depends

depends

High Level Architecture

Apache Falcon

Messaging

HCatalog

Hadoop

Entity

Entity status

Process status / notification

CLI/REST

Config store

Feed Schedule

Cluster xml

Feed xml Falcon

Falcon config store / Graph

Retention / Replication workflow

Oozie Scheduler HDFS

JMS Notification per action

Catalog service

Instance Management

Process Schedule

Cluster/feed xml

Process xml

Falcon

Falcon config store / Graph

Process workflow

Oozie Scheduler HDFS

JMS Notification per available

Catalog service

Instance Management

Physical Architecture

Falcon Colo 1

Falcon Colo 2

Falcon Colo 3

Scheduler

Falcon – PrismGlobal view

CASE STUDY Multi Cluster Failover

Multi Cluster – Failover

> Falcon manages workflow, replication or both.> Enables business continuity without requiring full data reprocessing.> Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

Retention Policies

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

> Sophisticated retention policies expressed in one place.> Simplify data retention for audit, compliance, or for data re-processing.

CASE STUDY Distributed Processing

Example: Digital Advertising @ InMobi

Hadoop @ InMobiAbout InMobi

Worlds leading independent mobile advertising company

Hadoop usage at InMobi ~ 6 Clusters > 1PB of storage > 5TB new data ingested each day > 20TB data crunched each day > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase > 175K hadoop jobs / day > 60K Oozie workflows / day 300+ Falcon feed definitions 100+ Falcon process definitions

Processing – Single Data Center

Ad Request data

Impression render event

Click event

Conversion event

Continuous Streaming (minutely)

Hourly summary

Enrichment (minutely/5 minutely)

Summarizer

Global Aggregation

Ad Request data

Click event

Conversion event

Continuous

Streaming (minutely)

Hourly summa

Enrichment (minutely/5 minutely) Summarizer

Ad Request data

Click event

Conversion event

Continuous

Streaming (minutely)

Hourly summa

Enrichment (minutely/5 minutely) Summarizer

……..

Consumable global

aggregate

HIGHLIGHTS

Future

Security

Embed Pig/Hive scripts

Data Acquisition – file-based

Monitoring/Management Dashboard

Summary

Questions?Apache Falcon

http://falcon.incubator.apache.orgmailto: dev@falcon.incubator.apache.org

Srikanth Sundarrajansriksun@apache.org#sriksun

Venkatesh Seetharamvenkatesh@apache.org#innerzeal

falcon - data management platform on hadoop (beyond etl)

Technology

etl: still relevant in the age of hadoop

discover hdp 2.1: apache falcon for data governance in...

running apache airflow workflows as etl processes on hadoop

apache falcon : 22 sept 2014 for hadoop user group france...

etl process etl: overview

etl on hadoop –what is...

ralph baxter, ceo, clusterseven - actuaries · pdf fileetl...

apache falcon - data management platform for hadoop

hortonworks technical preview for apache falcon · apache...

etl on hadoop –what is required -...

5 steps to - em360...• build a proof of concept and...

performance advantages of hadoop etl offload with the ... ·...

design advantages of hadoop etl offload with the intel...

testing big data: automated etl testing of hadoop

scalable etl with talend and hadoop, cédric carbone, talend

big data orchestration - activeeon.com · big data...

modernizing business intelligence and...

etl european etl spain 2002 s.l. - etl hungary kft

etl testing training | etl testing course | best etl testing...

etl on hadoop –what is required - meetup etl on... · etl...