building a high scale machine learning pipeline with ... · (spark streaming) listen for events...

29
Building a high scale machine learning pipeline with Apache Spark and Kafka https://www.flickr.com/photos/sanjayaprime/5013115478 Bedő Dániel

Upload: others

Post on 16-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Building a high scale machine learning pipeline

with Apache Spark and Kafka

https://www.flickr.com/photos/sanjayaprime/5013115478Bedő Dániel

Page 2: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

• biggest community-driven question & answer website in Germany

• 20 million questions, 70 million answers

• similar to Quora, Yahoo Answers

Page 3: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Google Update Impact

Page 4: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Ordering of answers

Page 5: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

supervised machine learning

determine the type of the training data

gather a training set

find a representation of the data

pick a learning algorithm

run the training algorithm

evaluate the accuracy

Page 6: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Regression Prototype

Page 7: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Identify the problems

• Model not complex enough

• Similar inputs, different outputs?

• Not enough training data

Page 8: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

The old ETL pipeline

Page 9: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

ETL v2Spark

Page 10: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Spark ecosystem

API

Scala Python Java R

Spark Streaming

Spark SQL MLLib GraphX

Page 11: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

K

Kafka

Producer

Producer

Consumer

Consumer

Consumer

Topic 1 Partition 0

Broker 1

Topic 1 Partition 1

Broker 2

Topic 1 Partition 2

Broker 3

Kafka Cluster

Page 12: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Kafka topic

• scale

• parallelism

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7 8

Producer

partition 0

partition 1

partition 2

Page 13: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Parquet

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

id cc votes

1 DE 2

2 DE 3

3 AT 1

4 DE 2

SELECT votes FROM logs WHERE cc = ‘AT’

push-down filters

Page 14: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Kafka

Rabbit MQServices

Tracking

Spark Cluster

KafkaKafka

HDFS(Tracking)

Stre

amin

g Worker

MySQL Read Slave

MySQL Master

Redis Master

Redis Read Slave

ElasticSearch

ETL v2

Page 15: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Project Moria

Page 16: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Clean training data?

Page 17: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Project Angmar

• tried lots of different supervised learning methods

• feature engineering - most crucial part

• analyse the domain, chart everything

Page 18: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

FeaturesContent

length

syntactic complexity

number of links

probability of deletion

Social

votes

most helpful answer

number of comments

answered by expert

Author

gained votes

credibility score

role

deleted answer ratio

number of answers

number of comments

reported answer ratio

Page 19: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

The structure of the network

21

3

1

0,2

0,4

0,1

2 0,8

Answer vector

AV normalized

0,9

0,6

0,2

0,1

Output

Page 20: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

The Result

Page 21: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Calculate features for all answers

Batch Layer (Spark Batch)

Insert features in Redis

Calculate Score and store in MySQL

Speed Layer (Spark Streaming)

Listen for events

Insert or update Redis

Calculate Score and store in MySQL

Serving Layer

Lambda Architecture

Page 22: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Back pressure

• Bulk jobs insert too fast

• MySQL: sendQueue size, threads connected

• ElasticSearch: load on the instance creating the new index

Page 23: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Debugging the network

+1Change individual features

Page 24: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

real world test (deleted vs non-deleted)

deletednon-deleted

Page 25: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Switching models

Amount of questions for a

score range

Old Score

New score

Page 26: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture
Page 27: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Insights

MenWomen

Page 28: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Learnings• If your use case is complex, you need a complex

model

• If you have a complex model, you need lots of data

• If you have lots of data, you need an ETL pipeline that can process huge amounts of data fast

• Think about your use case first, then design the pipeline

Page 29: Building a high scale machine learning pipeline with ... · (Spark Streaming) Listen for events Insert or update Redis Calculate Score and store in MySQL Serving Layer Lambda Architecture

Questions?You can ask them on gutefrage too :)