apache falcon at hadoop summit 2013

30
Data Management Platform on Hadoop Srikanth Sundarrajan Venkatesh Seetharam (Incubating)

Upload: venkatesh-seetharam

Post on 10-May-2015

2.725 views

Category:

Technology


1 download

DESCRIPTION

Presented Apache Falcon at Hadoop Summit 2013, SJC. Delves into the motivation behind Falcon, overview of the architecture, and looking forward into the future.

TRANSCRIPT

Page 1: Apache Falcon at Hadoop Summit 2013

Data Management Platform on Hadoop Srikanth Sundarrajan Venkatesh Seetharam

(Incubating)

Page 2: Apache Falcon at Hadoop Summit 2013

whoami

Principal Architect

InMobi

Apache Hadoop Contributor

Hadoop Team @Yahoo!

Srikanth Sundarrajan

Architect/Developer

Hortonworks

Apache Hadoop Contributor

Data Management @ Yahoo!

Venkatesh Seetharam

Page 3: Apache Falcon at Hadoop Summit 2013

Agenda

2   Falcon Overview

1   Motivation

3   Case Studies

4   Questions & Answers

Page 4: Apache Falcon at Hadoop Summit 2013

MOTIVATION

Page 5: Apache Falcon at Hadoop Summit 2013

Data Processing Landscape

External data source

Acquire (Import)

Data Processing (Transform/Pipeline)

Eviction Archive

Replicate (Copy)

Export

Page 6: Apache Falcon at Hadoop Summit 2013

Core Services

Process

•  Late data management •  Relays

Data management

•  Acquisition •  Replication •  Retention

Operability

•  SLA •  Lineage

Page 7: Apache Falcon at Hadoop Summit 2013

Process Management – Relays

picture courtersy: http://istockphoto.com/

Page 8: Apache Falcon at Hadoop Summit 2013

Late Data Management

picture courtersy: http://iwebask.com

Page 9: Apache Falcon at Hadoop Summit 2013

Data Retention As Service

picture courtersy: http://vimeo.com/

Page 10: Apache Falcon at Hadoop Summit 2013

Data Replication As Service

picture courtersy: http://boylesmedia.com

Page 11: Apache Falcon at Hadoop Summit 2013

Data Acquisition As Service

picture courtersy: http://wmpu.org

Page 12: Apache Falcon at Hadoop Summit 2013

Operability – Dashboard

picture courtersy: http://www.opentrack.ch/

Page 13: Apache Falcon at Hadoop Summit 2013

FALCON OVERVIEW

Page 14: Apache Falcon at Hadoop Summit 2013

Holistic Declaration of Intent

picture courtersy: http://bigboxdetox.com

Page 15: Apache Falcon at Hadoop Summit 2013

Entity Dependency Graph

Hadoop / Hbase … Cluster

External data

source

feed Process

depends

depends

Page 16: Apache Falcon at Hadoop Summit 2013

High Level Architecture

Apache Falcon

Oozie

Messaging

HCatalog

Hadoop

Entity

Entity status

Process status / notification

CLI/REST

JMS

Config store

Page 17: Apache Falcon at Hadoop Summit 2013

Feed Schedule

Cluster xml

Feed xml Falcon

Falcon config store / Graph

Retention / Replication workflow

Oozie Scheduler HDFS

JMS Notification per action

Catalog service

Instance Management

Page 18: Apache Falcon at Hadoop Summit 2013

Process Schedule

Cluster/feed xml

Process xml Falcon

Falcon config store / Graph

Process workflow

Oozie Scheduler HDFS

JMS Notification per available

feed Catalog service

Instance Management

Page 19: Apache Falcon at Hadoop Summit 2013

Physical Architecture

Falcon  Colo  1  

Falcon  Colo  2  

Falcon  Colo  3  

Scheduler  

Scheduler  

Scheduler  

Falcon  –  Prism  Global  view  

Page 20: Apache Falcon at Hadoop Summit 2013

CASE STUDY Multi Cluster Failover

Page 21: Apache Falcon at Hadoop Summit 2013

Multi Cluster – Failover

>  Falcon manages workflow, replication or both. >  Enables business continuity without requiring full data reprocessing. >  Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

Rep

licat

ion

Page 22: Apache Falcon at Hadoop Summit 2013

Retention Policies

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

>  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing.

Page 23: Apache Falcon at Hadoop Summit 2013

CASE STUDY Distributed Processing

Example: Digital Advertising @ InMobi

Page 24: Apache Falcon at Hadoop Summit 2013

Hadoop @ InMobi �  About InMobi

�  Worlds leading independent mobile advertising company

�  Hadoop usage at InMobi �  ~ 6 Clusters

�  > 1PB of storage

�  > 5TB new data ingested each day

�  > 20TB data crunched each day

�  > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase

�  > 175K hadoop jobs / day

�  > 60K Oozie workflows / day

�  300+ Falcon feed definitions

�  100+ Falcon process definitions

Page 25: Apache Falcon at Hadoop Summit 2013

Processing – Single Data Center

Ad Request data

Impression render event

Click event

Conversion event

Continuous Streaming (minutely)

Hourly summary

Enrichment (minutely/5 minutely)

Summarizer

Page 26: Apache Falcon at Hadoop Summit 2013

Global Aggregation

Ad Request data

Impression render event

Click event

Conversion event

Continuous

Streaming

(minutely)

Hourly summa

ry

Enrichment (minutely/5 minutely) Summarizer

Ad Request data

Impression render event

Click event

Conversion event

Continuous

Streaming

(minutely)

Hourly summa

ry

Enrichment (minutely/5 minutely) Summarizer

……..

Dat

a C

ente

r 1

Dat

a C

ente

r N

Consumable global aggregate

Page 27: Apache Falcon at Hadoop Summit 2013

HIGHLIGHTS

Page 28: Apache Falcon at Hadoop Summit 2013

Future

Security

Embed Pig/Hive scripts

Data Acquisition – file-based

Monitoring/Management Dashboard

Page 29: Apache Falcon at Hadoop Summit 2013

Summary

Page 30: Apache Falcon at Hadoop Summit 2013

Questions? �  Apache Falcon

�  http://falcon.incubator.apache.org

�  mailto: [email protected]

�  Srikanth Sundarrajan �  [email protected] �  #sriksun

�  Venkatesh Seetharam �  [email protected]

�  #innerzeal