apache falcon at hadoop summit europe 2014
DESCRIPTION
An overview of Apache Falcon at Hadoop Summit in Amsterdam, Apr 2 2014TRANSCRIPT
Data Management Platform on Hadoop
Venkatesh Seetharam
(Incubating)
© Hortonworks Inc. 2011
whoamiHortonworks Inc.
–Architect/Developer
–Lead Data Management efforts
Apache–Apache Falcon Committer, IPMC
–Apache Knox Committer
–Apache Hadoop, Sqoop, Oozie Contributor
Part of the Hadoop team at Yahoo! since 2007–Senior Principal Architect of Hadoop Data at Yahoo!
–Built 2 generations of Data Management at Yahoo!
Page 2Architecting the Future of Big Data
Agenda
2 Falcon Overview
1 Motivation
3 Falcon Architecture
4 Case Studies
MOTIVATION
Data Processing Landscape
External data source
Acquire (Import)
Data Processing (Transform/Pipeline)
Eviction Archive
Replicate(Copy)
Export
Core ServicesProcessManagement
• Relays• Late data handling• Retries
Data Management
• Import/Export• Replication• RetentionD
ataGovernance
• Lineage• Audit• SLA
FALCON OVERVIEW
Holistic Declaration of Intent
picture courtersy: http://bigboxdetox.com
Entity Dependency Graph
Hadoop / Hbase … Cluster
External data
source
feed Process
depends depends
depends
depends
<?xml version="1.0"?><cluster colo=”NJ-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations></cluster>
Needed by distcp for replications
Writing to HDFS
Used to submit processes as MR
Submit Oozie jobs
Hive metastore to register/deregister partitions and get events on partition
availability
Used For alerts
HDFS directories used by Falcon server
Cluster Specification
Feed Specification<?xml version="1.0"?><feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <tags externalSource=TeradataEDW-1, externalTarget=Marketing> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>
<retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/></feed>
Feed run frequency in mins/hrs/days/mths
Late arrival cutoff
Global location across clusters - HDFS paths
or Hive tables
Feeds can belong to multiple groups
One or more source & target clusters for
retention & replication
Access Permissions
Metadata tagging
Process Specification<process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)">
<late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process></process>
How frequently does the process run , how many
instances can be run in parallel and in what order
Which cluster should the process run on and when
The processing logic.
Retry policy on failure
Handling late input feeds
Input & output feeds for process
Late Data HandlingDefines how the late (out of band) data is handled
Each Feed can define a late cut-off value<late-arrival cut-off="hours(4)”/>
Each Process can define how this late data is handled<late-process policy="exp-backoff" delay="hours(1)”>
<late-input input="input" workflow-path="/apps/clickstream/late" />
</late-process>
Policies include: backoff exp-backoff final
Retry PoliciesEach Process can define retry policy
<process name="[process name]">
...
<retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/>
<retry policy="backoff" delay="minutes(10)" attempts="3"/>
...
</process>
Policies include:backoffexp-backoff
Lineage
Apache Falcon
Provides Orchestrates
Data Management Needs Tools
Multi Cluster Management Oozie
Replication Sqoop
Scheduling Distcp
Data Reprocessing Flume
Dependency Management Map / Reduce
Eviction Hive and Pig Jobs
Governance
Falcon provides a single interface to orchestrate data lifecycle.Sophisticated DLM easily added to Hadoop applications.
Falcon: One-stop Shop for Data Management
FALCON ARCHITECTURE
High Level Architecture
Apache Falcon
Oozie
Messaging
HCatalog
HDFS
Entity
Entity status
Process status / notification
CLI/REST
JMS
Config store
Feed Schedule
Cluster xml
Feed xml Falcon
Falcon config store / Graph
Retention / Replication workflow
Oozie Scheduler HDFS
JMS Notification per action
Catalog service
Instance Management
Process Schedule
Cluster/feed xml
Process xml
Falcon
Falcon config store / Graph
Process workflow
Oozie Scheduler HDFS
JMS Notification per available
feed
Catalog service
Instance Management
Physical Architecture
• STANDALONE– Single Data Center– Single Falcon Server– Hadoop jobs and relevant
processing involves only one cluster
• DISTRBUTED– Multiple Data Centers– Falcon Server per DC– Multiple instances of hadoop
clusters and workflow schedulers
HADOOPStore & Process
FalconServer
(standalone)
Site 1
HADOOPStore & Process
replication
HADOOPStore & Process
FalconServer
(standalone)
Site 1
HADOOPStore & Process
replication
Site 2
FalconServer
(standalone)
Falcon PrismServer
(distributed)
CASE STUDY Multi Cluster Failover
Multi Cluster – Failover
> Falcon manages workflow, replication or both.> Enables business continuity without requiring full data reprocessing.> Failover clusters require less storage and CPU.
Staged Data
Cleansed Data
Conformed Data
Presented Data
Staged Data
Presented Data
BI and Analytics
Primary Hadoop Cluster
Failover Hadoop Cluster
Re
plic
atio
n
Retention Policies
Staged Data
Retain 5 Years
Cleansed Data
Retain 3 Years
Conformed Data
Retain 3 Years
Presented Data
Retain Last Copy Only
> Sophisticated retention policies expressed in one place.> Simplify data retention for audit, compliance, or for data re-processing.
CASE STUDY Distributed Processing
Example: Digital Advertising @ InMobi
Processing – Single Data Center
Ad Request data
Impression render event
Click event
Conversion event
Continuous Streaming (minutely)
Hourly summary
Enrichment (minutely/5 minutely)
Summarizer
Global Aggregation
Ad Request data
Impression render event
Click event
Conversion event
Continuous
Streaming (minutely)
Hourly summa
ry
Enrichment (minutely/5 minutely) Summarizer
Ad Request data
Impression render event
Click event
Conversion event
Continuous
Streaming (minutely)
Hourly summa
ry
Enrichment (minutely/5 minutely) Summarizer
……..
Dat
a C
ente
r 1
Dat
a C
ente
r N
Consumable global
aggregate
HIGHLIGHTS
Future
Data Governance
Data Pipeline Designer
Data Acquisition – file-based
Monitoring/Management Dashboard
1
2
3
4
Summary
Questions?Apache Falcon
http://falcon.incubator.apache.orgmailto: [email protected]
Venkatesh [email protected]#innerzeal