introduction to apex

24
Introduction to Apache Apex Yogi devendra [email protected] (Overview, API model, features …)

Upload: datatorrent

Post on 13-Apr-2017

85 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Introduction to Apex

Introduction to Apache Apex

Yogi [email protected]

(Overview, API model, features …)

Page 2: Introduction to Apex

● Project history, current status● Use cases● Guiding principles● Architecture overview● API model● Important components● Key concepts

Agenda

Page 3: Introduction to Apex

Project history

Jan 2012 Hadoop v2.0 (YARN)

Jun 2012 Project development started at DataTorrent

July 2015 Open-sourced Core

Aug 2015 Apache incubation

April 2016 Apache top level project

Page 4: Introduction to Apex

Current status

Page 5: Introduction to Apex

Over 50 committers already…And growing….

Committers

Page 6: Introduction to Apex

● Advertising● IoT● Finance● Telecoms and

Networks● Security● Ingestion● ...

Use cases

Page 7: Introduction to Apex

● Highly scalable and performant● Easily operable● Fault tolerant● Hadoop native● Easily integrated● Easily developed

Guiding Principles

Page 8: Introduction to Apex

Architecture overview

Page 9: Introduction to Apex

API model

● A DAG is composed of vertices (Operators) and edges (Streams).● A Stream is a sequence of data tuples which connects operators at end-points called Ports● An Operator takes one or more input streams, performs computations & emits one or more output streams

○ Each operator is USER’s business logic, or built-in operator from our open source library○ Operator may have multiple instances that run in parallel

Page 10: Introduction to Apex

Logical DAG

1

2

3 4 6

5

Description

Op 1, Op 2

Reading from source

Op 3 Custom computation

Op 4 Summarizing results

Op 5 Writing Raw results to store

Op 6 Feeding summarized results to dashboard

Page 11: Introduction to Apex

● Hadoop(YARN) native

● Can Co-exist with MapReduce

● Operators ⇒ YARN containers

DAG on cluster

Page 12: Introduction to Apex

● Native YARN application ● Key functions

○ Provisions and monitors Apex Worker Containers for operators

○ Partitions and merges operators for auto-scaling○ Detects failure and restarts failed operators○ Updates DAG topology

● HA○ Checkpoints its state in HDFS○ Leverages YARN’s monitoring and restarting feature

Streaming app master (StrAM)

Page 13: Introduction to Apex

● Resides in a YARN container● Runs operator instances ● Includes Buffer Server● Manages bookkeeping and

checkpointing● Executes commands from

Apex App Master ● Starts new operator instances● Purges old data● Sends heartbeats with stats to

Apex App Master

Apex Worker Container

Worker Container

Op

BSS BSP

Local Disk

Upstreams of Op

Downstreams of OpBuffer Server

Page 14: Introduction to Apex

Apache Apex-Malhar

14

Page 15: Introduction to Apex

● Streaming windows● Checkpointing,

bookkeeping on window boundaries

● No artificial latency● Aggregates on

sliding window, tumbling window

Windowing

Page 16: Introduction to Apex

● Saving operators’ state for recovery

● Decentralized● Operator serialized state is

asynchronously written to HDFS ● If all operators have checkpointed

a particular window, that window is “committed” and all previous checkpoints will be purged

● The state of Apex App Master is also checkpointed

Fault Tolerance - Checkpointing

1

2

3 4 6

5

180

240

180

60

120 120

Checkpoint window # 60 120 180 240

Checkpoint # 1 2 3 4

Assuming checkpoint frequency = 60 windows

Committed Checkpoint # = 1

Page 17: Introduction to Apex

● App Master detects failure● Downstream operators ⇒

Re-deployed● Upon restart, Operators recover

from last committed checkpoint● Data is replayed from upstream

buffer-server● Recovery is automatic and

typically takes only few seconds

Fault Tolerance - Recovery

1

2

3 4 6

5

180

240

180

60

120 120

Committed Checkpoint window = 60

Page 18: Introduction to Apex

● AT_LEAST_ONCE (default): ○ No data loss.

● AT_MOST_ONCE○ Data loss allowed (in case of failures)○ Fastest recovery

● EXACTLY_ONCE○ Checkpoint every window○ Checkpointing becomes blocking○ Heavy bookkeeping

Processing Modes

Page 19: Introduction to Apex

● RACK_LOCAL● NODE_LOCAL

○ Same node, separate container

● CONTAINER_LOCAL ○ Same container, separate

thread (in-memory queue)● THREAD_LOCAL

○ Same thread (function call)

Default is no locality

Compute locality

Worker Container

Op1

Op2

BSS BSP

Disk

Upstream of Op1

Downstream of Op2Buffer Server

Page 20: Introduction to Apex

● Multiple physical instance for the logical operator

● Static scaling/ Dynamic scaling

● Unifiers

Partitioning & Auto-scaling

Page 21: Introduction to Apex

Without restarting the pipeline :● Adding new operator● Changing configuration properties of the

operators

Run-time modifications

Page 22: Introduction to Apex

Questions

Image ref [2]

Page 23: Introduction to Apex

● Apache Apex website - http://apex.apache.org/● Subscribe - http://apex.apache.org/community.html● Download - http://apex.apache.org/downloads.html● Twitter : follow @ApacheApex● Facebook : like ApacheApex● Meetup - http://www.meetup.com/topics/apache-apex● Startup Program – Free Enterprise License for Startups,

Educational Institutions, Non-Profits

Resources

Page 24: Introduction to Apex