apache falcon at hadoop summit europe 2014

Data Management Platform on Hadoop

Venkatesh Seetharam

(Incubating)

© Hortonworks Inc. 2011

whoamiHortonworks Inc.

–Architect/Developer

–Lead Data Management efforts

Apache–Apache Falcon Committer, IPMC

–Apache Knox Committer

–Apache Hadoop, Sqoop, Oozie Contributor

Part of the Hadoop team at Yahoo! since 2007–Senior Principal Architect of Hadoop Data at Yahoo!

–Built 2 generations of Data Management at Yahoo!

Architecting the Future of Big Data

Agenda

2 Falcon Overview

1 Motivation

3 Falcon Architecture

4 Case Studies

MOTIVATION

Data Processing Landscape

External data source

Acquire (Import)

Data Processing (Transform/Pipeline)

Eviction Archive

Replicate(Copy)

Export

Core ServicesProcessManagement

• Relays• Late data handling• Retries

Data Management

• Import/Export• Replication• RetentionD

ataGovernance

• Lineage• Audit• SLA

FALCON OVERVIEW

Holistic Declaration of Intent

picture courtersy: http://bigboxdetox.com

Entity Dependency Graph

Hadoop / Hbase … Cluster

External data

source

feed Process

depends depends

depends

depends

<?xml version="1.0"?><cluster colo=”NJ-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations></cluster>

Needed by distcp for replications

Writing to HDFS

Used to submit processes as MR

Submit Oozie jobs

Hive metastore to register/deregister partitions and get events on partition

availability

Used For alerts

HDFS directories used by Falcon server

Cluster Specification

Feed Specification<?xml version="1.0"?><feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <tags externalSource=TeradataEDW-1, externalTarget=Marketing> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>

<retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/></feed>

Feed run frequency in mins/hrs/days/mths

Late arrival cutoff

Global location across clusters - HDFS paths

or Hive tables

Feeds can belong to multiple groups

One or more source & target clusters for

retention & replication

Access Permissions

Metadata tagging

Process Specification<process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)">

<late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process></process>

How frequently does the process run , how many

instances can be run in parallel and in what order

Which cluster should the process run on and when

The processing logic.

Retry policy on failure

Handling late input feeds

Input & output feeds for process

Late Data HandlingDefines how the late (out of band) data is handled

Each Feed can define a late cut-off value<late-arrival cut-off="hours(4)”/>

Each Process can define how this late data is handled<late-process policy="exp-backoff" delay="hours(1)”>

<late-input input="input" workflow-path="/apps/clickstream/late" />

</late-process>

Policies include: backoff exp-backoff final

Retry PoliciesEach Process can define retry policy

<process name="[process name]">

...

<retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/>

<retry policy="backoff" delay="minutes(10)" attempts="3"/>

...

</process>

Policies include:backoffexp-backoff

Lineage

Apache Falcon

Provides Orchestrates

Data Management Needs Tools

Multi Cluster Management Oozie

Replication Sqoop

Scheduling Distcp

Data Reprocessing Flume

Dependency Management Map / Reduce

Eviction Hive and Pig Jobs

Governance

Falcon provides a single interface to orchestrate data lifecycle.Sophisticated DLM easily added to Hadoop applications.

Falcon: One-stop Shop for Data Management

FALCON ARCHITECTURE

High Level Architecture

Apache Falcon

Oozie

Messaging

HCatalog

HDFS

Entity

Entity status

Process status / notification

CLI/REST

JMS

Config store

Feed Schedule

Cluster xml

Feed xml Falcon

Falcon config store / Graph

Retention / Replication workflow

Oozie Scheduler HDFS

JMS Notification per action

Catalog service

Instance Management

Process Schedule

Cluster/feed xml

Process xml

Falcon

Falcon config store / Graph

Process workflow

Oozie Scheduler HDFS

JMS Notification per available

feed

Catalog service

Instance Management

Physical Architecture

• STANDALONE– Single Data Center– Single Falcon Server– Hadoop jobs and relevant

processing involves only one cluster

• DISTRBUTED– Multiple Data Centers– Falcon Server per DC– Multiple instances of hadoop

clusters and workflow schedulers

HADOOPStore & Process

FalconServer

(standalone)

Site 1


replication


FalconServer

(standalone)

Site 1


replication

Site 2

FalconServer

(standalone)

Falcon PrismServer

(distributed)

CASE STUDY Multi Cluster Failover

Multi Cluster – Failover

> Falcon manages workflow, replication or both.> Enables business continuity without requiring full data reprocessing.> Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

Re

plic

atio

n

Retention Policies

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

> Sophisticated retention policies expressed in one place.> Simplify data retention for audit, compliance, or for data re-processing.

CASE STUDY Distributed Processing

Example: Digital Advertising @ InMobi

Processing – Single Data Center

Ad Request data

Impression render event

Click event

Conversion event

Continuous Streaming (minutely)

Hourly summary

Enrichment (minutely/5 minutely)

Summarizer

Global Aggregation

Ad Request data


Click event

Conversion event

Continuous

Streaming (minutely)

Hourly summa

ry

Enrichment (minutely/5 minutely) Summarizer

Ad Request data


Click event

Conversion event

Continuous

Streaming (minutely)

Hourly summa

ry

Enrichment (minutely/5 minutely) Summarizer

……..

Dat

a C

ente

r 1

Dat

a C

ente

r N

Consumable global

aggregate

HIGHLIGHTS

Future

Data Governance

Data Pipeline Designer

Data Acquisition – file-based

Monitoring/Management Dashboard

1

2

3

4

Summary

Questions?Apache Falcon

http://falcon.incubator.apache.orgmailto: [email protected]

Venkatesh [email protected]#innerzeal

http://falcon.incubator.apache.org/



mailto:[email protected]



apache falcon at hadoop summit europe 2014

Technology

data lifecycle

band data

generations of data

data management platform

falcon architecture

falcon overview

apache falcon http

future of big data