apache falcon at hadoop summit europe 2014

31
Data Management Platform on Hadoop Venkatesh Seetharam (Incubatin g)

Upload: venkatesh-seetharam

Post on 10-May-2015

1.504 views

Category:

Technology


2 download

DESCRIPTION

An overview of Apache Falcon at Hadoop Summit in Amsterdam, Apr 2 2014

TRANSCRIPT

Page 1: Apache Falcon at Hadoop Summit Europe 2014

Data Management Platform on Hadoop

Venkatesh Seetharam

(Incubating)

Page 2: Apache Falcon at Hadoop Summit Europe 2014

© Hortonworks Inc. 2011

whoamiHortonworks Inc.

–Architect/Developer

–Lead Data Management efforts

Apache–Apache Falcon Committer, IPMC

–Apache Knox Committer

–Apache Hadoop, Sqoop, Oozie Contributor

Part of the Hadoop team at Yahoo! since 2007–Senior Principal Architect of Hadoop Data at Yahoo!

–Built 2 generations of Data Management at Yahoo!

Page 2Architecting the Future of Big Data

Page 3: Apache Falcon at Hadoop Summit Europe 2014

Agenda

2 Falcon Overview

1 Motivation

3 Falcon Architecture

4 Case Studies

Page 4: Apache Falcon at Hadoop Summit Europe 2014

MOTIVATION

Page 5: Apache Falcon at Hadoop Summit Europe 2014

Data Processing Landscape

External data source

Acquire (Import)

Data Processing (Transform/Pipeline)

Eviction Archive

Replicate(Copy)

Export

Page 6: Apache Falcon at Hadoop Summit Europe 2014

Core ServicesProcessManagement

• Relays• Late data handling• Retries

Data Management

• Import/Export• Replication• RetentionD

ataGovernance

• Lineage• Audit• SLA

Page 7: Apache Falcon at Hadoop Summit Europe 2014

FALCON OVERVIEW

Page 8: Apache Falcon at Hadoop Summit Europe 2014

Holistic Declaration of Intent

picture courtersy: http://bigboxdetox.com

Page 9: Apache Falcon at Hadoop Summit Europe 2014

Entity Dependency Graph

Hadoop / Hbase … Cluster

External data

source

feed Process

depends depends

depends

depends

Page 10: Apache Falcon at Hadoop Summit Europe 2014

<?xml version="1.0"?><cluster colo=”NJ-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations></cluster>

Needed by distcp for replications

Writing to HDFS

Used to submit processes as MR

Submit Oozie jobs

Hive metastore to register/deregister partitions and get events on partition

availability

Used For alerts

HDFS directories used by Falcon server

Cluster Specification

Page 11: Apache Falcon at Hadoop Summit Europe 2014

Feed Specification<?xml version="1.0"?><feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <tags externalSource=TeradataEDW-1, externalTarget=Marketing> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>

<retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/></feed>

Feed run frequency in mins/hrs/days/mths

Late arrival cutoff

Global location across clusters - HDFS paths

or Hive tables

Feeds can belong to multiple groups

One or more source & target clusters for

retention & replication

Access Permissions

Metadata tagging

Page 12: Apache Falcon at Hadoop Summit Europe 2014

Process Specification<process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)">

<late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process></process>

How frequently does the process run , how many

instances can be run in parallel and in what order

Which cluster should the process run on and when

The processing logic.

Retry policy on failure

Handling late input feeds

Input & output feeds for process

Page 13: Apache Falcon at Hadoop Summit Europe 2014

Late Data HandlingDefines how the late (out of band) data is handled

Each Feed can define a late cut-off value<late-arrival cut-off="hours(4)”/>

Each Process can define how this late data is handled<late-process policy="exp-backoff" delay="hours(1)”>

<late-input input="input" workflow-path="/apps/clickstream/late" />

</late-process>

Policies include: backoff exp-backoff final

Page 14: Apache Falcon at Hadoop Summit Europe 2014

Retry PoliciesEach Process can define retry policy

<process name="[process name]">

...

<retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/>

<retry policy="backoff" delay="minutes(10)" attempts="3"/>

...

</process>

Policies include:backoffexp-backoff

Page 15: Apache Falcon at Hadoop Summit Europe 2014

Lineage

Page 16: Apache Falcon at Hadoop Summit Europe 2014

Apache Falcon

Provides Orchestrates

Data Management Needs Tools

Multi Cluster Management Oozie

Replication Sqoop

Scheduling Distcp

Data Reprocessing Flume

Dependency Management Map / Reduce

Eviction Hive and Pig Jobs

Governance

Falcon provides a single interface to orchestrate data lifecycle.Sophisticated DLM easily added to Hadoop applications.

Falcon: One-stop Shop for Data Management

Page 17: Apache Falcon at Hadoop Summit Europe 2014

FALCON ARCHITECTURE

Page 18: Apache Falcon at Hadoop Summit Europe 2014

High Level Architecture

Apache Falcon

Oozie

Messaging

HCatalog

HDFS

Entity

Entity status

Process status / notification

CLI/REST

JMS

Config store

Page 19: Apache Falcon at Hadoop Summit Europe 2014

Feed Schedule

Cluster xml

Feed xml Falcon

Falcon config store / Graph

Retention / Replication workflow

Oozie Scheduler HDFS

JMS Notification per action

Catalog service

Instance Management

Page 20: Apache Falcon at Hadoop Summit Europe 2014

Process Schedule

Cluster/feed xml

Process xml

Falcon

Falcon config store / Graph

Process workflow

Oozie Scheduler HDFS

JMS Notification per available

feed

Catalog service

Instance Management

Page 21: Apache Falcon at Hadoop Summit Europe 2014

Physical Architecture

• STANDALONE– Single Data Center– Single Falcon Server– Hadoop jobs and relevant

processing involves only one cluster

• DISTRBUTED– Multiple Data Centers– Falcon Server per DC– Multiple instances of hadoop

clusters and workflow schedulers

HADOOPStore & Process

FalconServer

(standalone)

Site 1

HADOOPStore & Process

replication

HADOOPStore & Process

FalconServer

(standalone)

Site 1

HADOOPStore & Process

replication

Site 2

FalconServer

(standalone)

Falcon PrismServer

(distributed)

Page 22: Apache Falcon at Hadoop Summit Europe 2014

CASE STUDY Multi Cluster Failover

Page 23: Apache Falcon at Hadoop Summit Europe 2014

Multi Cluster – Failover

> Falcon manages workflow, replication or both.> Enables business continuity without requiring full data reprocessing.> Failover clusters require less storage and CPU.

Staged Data

Cleansed Data

Conformed Data

Presented Data

Staged Data

Presented Data

BI and Analytics

Primary Hadoop Cluster

Failover Hadoop Cluster

Re

plic

atio

n

Page 24: Apache Falcon at Hadoop Summit Europe 2014

Retention Policies

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

> Sophisticated retention policies expressed in one place.> Simplify data retention for audit, compliance, or for data re-processing.

Page 25: Apache Falcon at Hadoop Summit Europe 2014

CASE STUDY Distributed Processing

Example: Digital Advertising @ InMobi

Page 26: Apache Falcon at Hadoop Summit Europe 2014

Processing – Single Data Center

Ad Request data

Impression render event

Click event

Conversion event

Continuous Streaming (minutely)

Hourly summary

Enrichment (minutely/5 minutely)

Summarizer

Page 27: Apache Falcon at Hadoop Summit Europe 2014

Global Aggregation

Ad Request data

Impression render event

Click event

Conversion event

Continuous

Streaming (minutely)

Hourly summa

ry

Enrichment (minutely/5 minutely) Summarizer

Ad Request data

Impression render event

Click event

Conversion event

Continuous

Streaming (minutely)

Hourly summa

ry

Enrichment (minutely/5 minutely) Summarizer

……..

Dat

a C

ente

r 1

Dat

a C

ente

r N

Consumable global

aggregate

Page 28: Apache Falcon at Hadoop Summit Europe 2014

HIGHLIGHTS

Page 29: Apache Falcon at Hadoop Summit Europe 2014

Future

Data Governance

Data Pipeline Designer

Data Acquisition – file-based

Monitoring/Management Dashboard

1

2

3

4

Page 30: Apache Falcon at Hadoop Summit Europe 2014

Summary