kafka for data scientists

25
Jennifer Rawlins Real Time Streaming with Kafka - for the data scientist

Upload: jenn-rawlins

Post on 16-Apr-2017

244 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Kafka for data scientists

Jennifer Rawlins

Real Time Streaming with Kafka - for the data scientist

Page 2: Kafka for data scientists

About MeJenn Rawlins has been creating software solutions for 19 years. She began her career at Microsoft as an engineer in test, and as an international program manager. This was followed by management and consultant roles, working with VPs and Directors across multiple industries to create custom software solutions. She then changed her focus to software engineering roles.

Recently Jenn has created Big Data solutions using Hadoop, Yarn, Kafka, and Cassandra, writing real time streaming solutions in Java and Scala. Her current focus is a solution in AWS for IoT devices.

Page 3: Kafka for data scientists

AGENDA

❖ Messaging Systems

❖ Kafka

❖ SparkR

❖ Data Processing Pipelines

Page 4: Kafka for data scientists

What is a message queueing systemMessages are sent to a queue. Messages are read from a queue. The queue is independent of the senders or receivers (Publishers/Subscribers or Producers/Consumers). Fast, Predictable, easy to scale.

Cloud solutions

Amazon SQS - Simple Queue Service

Azure service bus

Server Solutions

Kafka

IBM WebSphere MQ

RabbitMQ

ActiveMQ

Windows Servers MSMQ

Page 5: Kafka for data scientists

KafkaLinkedIn uses Apache Kafka as a central publish-subscribe log for integrating data between applications, stream processing, and Hadoop data ingestion.

REAL-TIME STREAMING

1.Data pipelines that reliably get data between systems or applications.

2.Applications to transform or react to streams of data.

Page 6: Kafka for data scientists

Real Time Process streams of records as they occur. Data in, Data out.

Fault Tolerant Store streams of records in a fault-tolerant way.

Highly Scalable (Horizontal) Nodes can be added and removed from a

Kafka Cluster and the cluster will rebalance itself. High Availability begins at 5 Nodes.

Page 7: Kafka for data scientists

Ordering guaranteed within a partition as it was received

Parallel processing of partitioned topics

Multi publisher (producer) - kafka writes message as received to a specific topic, balancing across multiple partitions.

Multi subscriber (consumer) - Partitions assigned to specific subscriber.

Page 8: Kafka for data scientists

Producer

ProducerProduce

rProducerProduce

rProducer

Consumer

ConsumerConsume

rConsumerConsume

rConsumer

KafkaProducer

Consumer

Consumer

Kafka Cluster

Producer

Producer

Consumer

Page 9: Kafka for data scientists

Record consists of a key, a value, and a timestamp. (message)

Topic kafka stores streams of records in categories called topics.

Cluster Kafka is run as a cluster on one or more servers.

Broker The actual server, and synchronization layer between server instances.

Node The logical kafka entity or ‘worker’ on each server.

Publish and subscribe to streams of records. Similar to a message queue or enterprise messaging system.

Page 10: Kafka for data scientists

Publish and Consume streams of records.

Process streams of records efficiently and in real time.

Store streams of records safely in a distributed, replicated cluster. Fault Tolerant.

Page 11: Kafka for data scientists

A Stream is an unbounded, continuously updating data set. A stream is an ordered, replayable, and fault-tolerant sequence of immutable data records.

A Stream DSL is stateful, and is a processor topology.

# Example: a record stream for page view events

1 => {"time":1440557383335, "user_id":1, "url":"/home?user=1"}

5 => {"time":1440557383345, "user_id":5, "url":"/home?user=5"}

2 => {"time":1440557383456, "user_id":2, "url":"/profile?user=2"}

1 => {"time":1440557385365, "user_id":1, "url":"/profile?user=1"}

Page 12: Kafka for data scientists

Typical Use Cases

Message Broker ActiveMQ or RabbitMQ

Website Activity Tracking

Metrics - monitoring

Log Aggregation

Stream Processing

Event Sourcing

Page 13: Kafka for data scientists

Website Activity Tracking

TRACKING - Web Site Activity

Add clicks

page views,

searches,

or other actions users may take

Record of each activity is published to central topics, with one topic per activity type.

Page 14: Kafka for data scientists

Application

Connector

RealTimeProcessor

Application

Application

Connector

Kafka Cluster

DataStore

DataStore

Application

Producers (write)

DataStore

Processor

Real Time

Consumers (read)

Connectors

Stream Processors

User Action

Page 15: Kafka for data scientists

Platforms Spark runs on Hadoop Yarn, Apache Mesos, in Standalone cluster mode, or in the on EC2.

Languages Can be used from Scala, Python, and R shells

Processing optimizes jobs running on Hadoop in memory by 100x, or 10X faster on disk.

Page 16: Kafka for data scientists

R limitationsR is a popular statistical programming language used for data processing and machine learning tasks.

Data Analysis is usually limited to a single thread, and the memory available on a single computer.

Page 17: Kafka for data scientists

Developed at the AMPLab, it was accepted and merged into Spark version 1.4

Provides an R frontend to apache Spark

Uses the Sparks data sources API to read from a variety of sources: Hive(Hadoop), Json Files, Parquet Files.

Uses Spark’s distributed computation engine to run large scale data analysis from the R shell on a cluster: Many Cores, Many Machines.

SparkDataFrame (distributed collection of data organized in named columns) inherit optimizations from the computation engine.

SparkR: R package for Apache Spark

Page 18: Kafka for data scientists

MLib and SparkR

Machine Learning algorithms currently supported:

Generalized Linear Model

Accelerated Failure Time (AFT)

Survival Regression Model

Naive Bayes Model

KMeans Model

SparkR uses MLib to training the model.

Page 19: Kafka for data scientists

Real Time Record Processing

Example Real Time Scenario: Serve up related ads to user that are more likely to be clicked

Kafka Data Stream

Spark StreamingWebsiteUser Clicks Ad Record added to

AdClick TopicAdClick run Ad through model to update predictive score

ApplicationLog Click RecordUse AdClick to find related ads to serve to user using predictive scoring.

Display New Ads to User

Page 20: Kafka for data scientists

Real-time process user data using an R model in a Spark job.

Batch process data from Kafka, Hadoop HDFS, SQL, Cassandra, HBase

Model Training multiple times with SparkR from multiple data sources

Page 21: Kafka for data scientists

Historical Record Batch ProcessingSparkRKafka Data Streams

AdClick

HomePageView

Spark job

AdClick topic: run recent records through model

RSpark & SparkHadoopHive

AdClick model training on historical data

Cassandra

SQL

Pull Topics to create stores of data for many related features

AdView Kafka Topic

Page 22: Kafka for data scientists

Language

Kafka is written in Java

In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages0

.JavaC/C++PythonGo (AKA golang)Erlang.NETClojureRubyNode.js

Proxy (HTTP REST, etc)Perlstdin/stdoutPHPRustAlternative JavaStormScala DSL Clojure

Page 23: Kafka for data scientists

Kafka http://kafka.apache.org/

Free and Open Source Software under the Apache License

Github code repo: https://github.com/apache/kafka

Confluent http://www.confluent.io/

Open Source offering Consulting, Training, Support, Monitoring Tools

Confluent Docs: http://docs.confluent.io/3.0.0/streams/developer-guide.html

Examples: https://github.com/confluentinc/examples/tree/kafka-0.10.0.0-cp-3.0.0/kafka-streams/src/main/java/io/confluent/examples/streams

Page 24: Kafka for data scientists