low-latency ingestion and analytics with apache kafka and apache apex

24
Low-latency ingestion and analytics with Apache Kafka and Apache Apex Thomas Weise, Architect DataTorrent, PPMC member Apache Apex March 28 th 2016

Upload: datatorrent

Post on 08-Jan-2017

111 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Low-latency ingestion and analytics with Apache Kafka and Apache Apex

Thomas Weise, Architect DataTorrent, PPMC member Apache Apex  March 28th 2016

Page 2: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Apache Apex Features• In-memory Stream Processing• Scale out, Distributed, Parallel, High Throughput• Windowing (temporal boundary)• Reliability, Fault Tolerance• Operability• YARN native• Compute Locality• Dynamic updates

2

Page 3: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

3

Apex Platform Overview

Page 4: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

4

Apache Apex Malhar Library

Page 5: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Apache Kafka

5

“A high-throughput distributed messaging system.”“Fast, Scalable, Durable, Distributed”

Kafka is a natural fit to deliver events into Apex for low-latency processing.

Page 6: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Kafka Integration - Consumer

6

• Low-latency, high throughput ingest• Scales with Kafka topics

ᵒ Auto-partitioningᵒ Flexible and customizable partition mapping

• Fault-tolerance (in 0.8 based on SimpleConsumer)ᵒ Metadata monitoring/failover to new brokerᵒ Offset checkpointingᵒ Idempotencyᵒ External offset storage

• Support for multiple clustersᵒ Built for better resource utilization

• Bandwidth controlᵒ Bytes per second

Page 7: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Kafka Integration - Producer

7

• Output operator is a Kafka producer• Exactly once strategy

ᵒ On failure data already sent to message queue should not be re-sentᵒ Sends a key along with data that is monotonically increasingᵒ On recovery operator asks the message queue for the last sent message

• Gets the recovery key from the messageᵒ Ignores all replayed data with key that is less than or equal to the

recovered keyᵒ If the key is not monotonically increasing then data can be sorted on the

key at the end of the window and sent to message queue• Implemented in operator AbstractExactlyOnceKafkaOutputOperator

in apache/incubator-apex-malhar github repository available here

Page 8: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

8

Apex Application Specification

Page 9: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

9

Logical and Physical Plan

Page 10: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

10

PartitioningNxM PartitionsUnifier

0 1 2 3

Logical DAG

0 1 2

1

1 Unifier

1

20

Logical Diagram

Physical Diagram with operator 1 with 3 partitions

0

Unifier

1a

1b

1c

2a

2b

Unifier 3

Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck

Unifier

Unifier0

1a

1b

1c

2a

2b

Unifier 3

Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier

Page 11: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

11

Advanced Partitioning

0

1a

1b

2 3 4Unifier

Physical DAG

0 4

3a2a1a

1b 2b 3b

Unifier

Physical DAG with Parallel Partition

Parallel Partition

Container

uopr

uopr1

uopr2

uopr3

uopr4

uopr1

uopr2

uopr3

uopr4

dopr

dopr

doprunifier

unifier

unifier

unifier

Container

Container

NIC

NIC

NIC

NIC

NIC

Container

NIC

Logical Plan

Execution Plan, for N = 4; M = 1

Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers

Cascading Unifiers

0 1 2 3 4

Logical DAG

Page 12: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

12

Dynamic Scaling

Partitioning change while application is running• Change number of partitions at runtime based on stats• Determine initial number of partitions dynamically

– Kafka operators scale according to number of Kafka partitions• Supports re-distribution of state when number of partitions change• API for custom scaling or partitioning

2b

2c

3

2a

2d

1b

1a1a 2a

1b 2b

3

1a 2b

1b 2c 3b

2a

2d

3a

Unifiers not shown

Page 13: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

13

Fault Tolerance• Operator state is checkpointed to persistent store

ᵒ Automatically performed by engine, no additional coding neededᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state

• Automatic detection and recovery of failed containersᵒ Heartbeat mechanismᵒ YARN process status notification

• Buffering to enable replay of data from recovered pointᵒ Fast, incremental recovery, spike handling

• Application master state checkpointedᵒ Snapshot of physical (and logical) planᵒ Execution layer change log

Page 14: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

14

Streaming Windows

Application window Sliding window and tumbling window

Checkpoint window No artificial latency

Page 15: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

15

Checkpointing Operator State• Save state of operator so that it can be recovered on failure• Pluggable storage handler• Default implementation

ᵒ Serialization with Kryoᵒ All non-transient fields serializedᵒ Serialized state written to HDFSᵒ Writes asynchronous, non-blocking

• Possible to implement custom handlers for alternative approach to extract state or different storage backend (such as IMDG)

• For operators that rely on previous state for computationᵒ Operators can be marked @Stateless to skip checkpointing

• Checkpoint frequency tunable (by default 30s)ᵒ Based on streaming windows for consistent state

Page 16: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

16

Processing GuaranteesAt-least-once• On recovery data will be replayed from a previous checkpoint

ᵒ No messages lostᵒ Default, suitable for most applications

• Can be used to ensure data is written once to storeᵒ Transactions with meta information, Rewinding output, Feedback

from external entity, Idempotent operationsAt-most-once• On recovery the latest data is made available to operator

ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient

Exactly-onceᵒ At-least-once + idempotency + transactional mechanisms (operator

logic) to achieve end-to-end exactly once behavior

Page 17: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

17

Idempotency with Kafka Consumer

Page 18: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Use Case – Ad TechCustomer:• Leading digital automation software company for publishers• Helps publishers monetize their digital assets• Enables publishers to make smarter inventory decisions and improve revenue

Features: • Reporting of critical metrics from auctions and client logs• Revenue, impression, and click information• Aggregate counters and reporting on top N metrics• Low latency querying using pub-sub model

18

Page 19: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Use Case – Ad Tech

19

User Browser

AdServer

REST proxy

REST proxy

Kafka Cluster

Client logs

Kafka Input(Auction logs)

Kafka Input(Client logs)

CDN(Caching of logs)

ETL ETL

Filter Filter

Dimensions Aggregator

Dimensions Aggregator

Dimensions Store

Query Query Result

Kafka Cluster

Auction Logs

Client logs

Middleware

Auction Logs

Client logs

Kafka Messages Kafka Messages

Decompress & Flatten

Decompress & Flatten

Filtered Events Filtered Events

Aggregates

Query from MW

Query Query Results

Kafka Cluster

Page 20: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Use Case – Ad Tech

20

Page 21: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Use Case – Ad Tech

• 15+ billion impressions per day

• Average data inflow of 200K events/sec

• 64 Kafka Input operators reading from 6 geographically distributed DCs

• 32 instances of in-memory distributed store

• 64 aggregators

• ~150 container processes, 30+ nodes

• 1.2 TB memory footprint @ peak load

21

Page 22: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Resources

22

• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter

ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent

• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product

ᵒ https://www.datatorrent.com/product/startup-accelerator/

Page 23: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

We Are Hiring

23

[email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders

Page 24: Low-latency Ingestion and Analytics with Apache Kafka and Apache Apex

Q&A

24