war stories with apache spark - bi...

38
War stories with Apache Spark Mate Gulyas

Upload: others

Post on 24-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

War stories withApache SparkMate Gulyas

Page 2: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

CTO & Co-Founder

GULYÁS MÁTÉ

@gulyasm

Page 3: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data
Page 4: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data
Page 5: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

Product placeholder

Page 6: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

DATA PLATFORM at Enbritely

Page 7: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

DATA COLLECTION

ANALYZEDATA PROCESSION

ANTI FRAUDVIEWABILITY

BRAND SAFETYREPORT + API

What we do?

Page 8: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

HOW WE GOT HERE?

MONOLITHIC PYTHON ANALYTICS

EVALUATE BIG DATA TECHNOLOGIES

STARTED WORK ON DP

DPPRODUCTION READY

SAAS DP

@gulyasm

Page 9: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

DATA COLLECTION

Page 10: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

The way to access log

{

"session_id": "spark_meetup_jsmmmoq",

"timestamp": 1456080915621,

"type": "click"

}

eyJzZXNzaW9uX2lkIjoic3Bhcmtfb

WVldHVwX2pzbW1tb3EiLCJ0aW1l

c3RhbXAiOjE0NTYwODA5MTU2M

jEsInR5cGUiOiAiY2xpY2sifQo=

Click event attributes

(created by JS tracker)

Access log format

TS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."

1.

2.

3.

Page 11: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

DATA PROCESSINGDATA PROCESSING

Page 12: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

Spark TOOLS

● 0.5-2TB data processed daily

1-10B rows

● Ad-hoc batch queries 20TB data

● 20+ node cluster

● Spent 4 month optimizing it

Page 13: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

Luigi TOOLS

Luigi + enbrite.ly extensions = Gabo Luigi

WORKFLOW ENGINE

Page 14: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

Page 15: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

YOU WILL SPEND A LOT

OF TIME ON TOOLING

Page 16: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data
Page 17: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

Tools we created GABO LUIGI

Page 18: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

OPTIMIZATION

takes a

LOT OF TIME

Page 19: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data
Page 20: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

OPTIMIZATION

NEVER

ENDS

Page 21: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

AUTOMATE

PERFORMANCE

OPTIMIZATION

Page 22: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

PERFORMANCE MEASUREMENTS

● CLUSTER CONFIGURATION

● SPARK JOB CONFIGURATION

● DATA SET VARIATIONS

● IMPACT OF ALGORITHMS

Page 23: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

PERFORMANCE MEASUREMENTS

MARATHON

Page 24: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

DATA STORAGE IS THE

BIGGEST

OPTIMIZATION

Page 25: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

DON’T START WITH

SCALA AND SPARK

Page 26: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

KEEP ANALYTICS CODE

IN ONE

REPOSITORY

Page 27: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data
Page 28: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data
Page 29: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

STRUCTURE YOUR

CODE

Page 30: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data
Page 31: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data
Page 32: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

START WITH THE

SMALLEST BIG DATA PROJECT

Page 33: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

HOW WE GOT HERE?

MONOLITHIC PYTHON ANALYTICS

EVALUATE BIG DATA TECHNOLOGIES

STARTED WORK ON DP

DPPRODUCTION READY

SAAS DP

@gulyasm

Page 34: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

REUSECODE

Page 35: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

LESSONS LEARNED

REUSEKNOWLEDGE

Page 36: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

Unified Data Processing Engine

Page 37: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

NOT EVERY USE CASE IS A SPARK USE-CASE

Page 38: War stories with Apache Spark - BI Consultingbiconsulting.hu/.../2016budapestdata/...war_stories_with_apache_spa… · War stories with Apache Spark ... Spark TOOLS 0.5-2TB data

MATE [email protected]

@gulyasm@enbritely

THANK YOU!