introduction to apex
TRANSCRIPT
● Project history, current status● Use cases● Guiding principles● Architecture overview● API model● Important components● Key concepts
Agenda
Project history
Jan 2012 Hadoop v2.0 (YARN)
Jun 2012 Project development started at DataTorrent
July 2015 Open-sourced Core
Aug 2015 Apache incubation
April 2016 Apache top level project
Current status
Over 50 committers already…And growing….
Committers
● Advertising● IoT● Finance● Telecoms and
Networks● Security● Ingestion● ...
Use cases
● Highly scalable and performant● Easily operable● Fault tolerant● Hadoop native● Easily integrated● Easily developed
Guiding Principles
Architecture overview
API model
● A DAG is composed of vertices (Operators) and edges (Streams).● A Stream is a sequence of data tuples which connects operators at end-points called Ports● An Operator takes one or more input streams, performs computations & emits one or more output streams
○ Each operator is USER’s business logic, or built-in operator from our open source library○ Operator may have multiple instances that run in parallel
Logical DAG
1
2
3 4 6
5
Description
Op 1, Op 2
Reading from source
Op 3 Custom computation
Op 4 Summarizing results
Op 5 Writing Raw results to store
Op 6 Feeding summarized results to dashboard
● Hadoop(YARN) native
● Can Co-exist with MapReduce
● Operators ⇒ YARN containers
DAG on cluster
● Native YARN application ● Key functions
○ Provisions and monitors Apex Worker Containers for operators
○ Partitions and merges operators for auto-scaling○ Detects failure and restarts failed operators○ Updates DAG topology
● HA○ Checkpoints its state in HDFS○ Leverages YARN’s monitoring and restarting feature
Streaming app master (StrAM)
● Resides in a YARN container● Runs operator instances ● Includes Buffer Server● Manages bookkeeping and
checkpointing● Executes commands from
Apex App Master ● Starts new operator instances● Purges old data● Sends heartbeats with stats to
Apex App Master
Apex Worker Container
Worker Container
Op
BSS BSP
Local Disk
Upstreams of Op
Downstreams of OpBuffer Server
Apache Apex-Malhar
14
● Streaming windows● Checkpointing,
bookkeeping on window boundaries
● No artificial latency● Aggregates on
sliding window, tumbling window
Windowing
● Saving operators’ state for recovery
● Decentralized● Operator serialized state is
asynchronously written to HDFS ● If all operators have checkpointed
a particular window, that window is “committed” and all previous checkpoints will be purged
● The state of Apex App Master is also checkpointed
Fault Tolerance - Checkpointing
1
2
3 4 6
5
180
240
180
60
120 120
Checkpoint window # 60 120 180 240
Checkpoint # 1 2 3 4
Assuming checkpoint frequency = 60 windows
Committed Checkpoint # = 1
● App Master detects failure● Downstream operators ⇒
Re-deployed● Upon restart, Operators recover
from last committed checkpoint● Data is replayed from upstream
buffer-server● Recovery is automatic and
typically takes only few seconds
Fault Tolerance - Recovery
1
2
3 4 6
5
180
240
180
60
120 120
Committed Checkpoint window = 60
● AT_LEAST_ONCE (default): ○ No data loss.
● AT_MOST_ONCE○ Data loss allowed (in case of failures)○ Fastest recovery
● EXACTLY_ONCE○ Checkpoint every window○ Checkpointing becomes blocking○ Heavy bookkeeping
Processing Modes
● RACK_LOCAL● NODE_LOCAL
○ Same node, separate container
● CONTAINER_LOCAL ○ Same container, separate
thread (in-memory queue)● THREAD_LOCAL
○ Same thread (function call)
Default is no locality
Compute locality
Worker Container
Op1
Op2
BSS BSP
Disk
Upstream of Op1
Downstream of Op2Buffer Server
● Multiple physical instance for the logical operator
● Static scaling/ Dynamic scaling
● Unifiers
Partitioning & Auto-scaling
Without restarting the pipeline :● Adding new operator● Changing configuration properties of the
operators
Run-time modifications
● Apache Apex website - http://apex.apache.org/● Subscribe - http://apex.apache.org/community.html● Download - http://apex.apache.org/downloads.html● Twitter : follow @ApacheApex● Facebook : like ApacheApex● Meetup - http://www.meetup.com/topics/apache-apex● Startup Program – Free Enterprise License for Startups,
Educational Institutions, Non-Profits
Resources