pixels camp 2017 - stories from the trenches of building a data architecture

22
BinaryEdge.io Be Ready. Be Safe. Be Secure. Florentino Bexiga Stories from the Trenches of Building a Data Architecture Data Engineer/ Platform Developer [email protected]

Upload: tiago-henriques

Post on 21-Jan-2018

443 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Pixels Camp 2017 - Stories from the trenches of building a data architecture

BinaryEdge.ioBe Ready. Be Safe. Be Secure.

Florentino Bexiga

Stories from the Trenches of Building a Data Architecture

Data Engineer/ Platform [email protected]

Page 2: Pixels Camp 2017 - Stories from the trenches of building a data architecture

WHO WE ARE AND WHAT WE DO

VNC

RDP

Files People

Social

Companyregistration

internal

external

Phone

Email

Linked urls

BGP

AS

Whois

AS membership

AS peer

List of IPs

Sharedinfrastructure

Co-hostedsites

Contact

Geolocation

O�celocations

Socialnetworks

Phone

portscan

dns

Screenshots

Web

Services

http https

Users

AppsFiles

Banners

Image

Classi�er

Vulnerabilities

DATA POINTS

metadata

PhotosFamily&friends

Behaviour

LikesTopics

Search

NewsForums

Sub-reddits

DomainsAXFRMX records

WebserverFrameworkHeadersCookies

Certi�cateCon�gurationAuthoritiesEntities

OCR

SWip addressurl address

SMB

torrents peers torrent name categorysource hashes of �les

Page 3: Pixels Camp 2017 - Stories from the trenches of building a data architecture

AGENDA

0102

THE NEED OF A DATA ARCHITECTURE

03

SIMPLE ARCHITECTURE OVERVIEW

0405

MESSAGE QUEUE

STREAM PROCESSING

06 BATCH PROCESSING

07 DATABASES

08 BONUS ROUND: MANAGEMENT

09 ARCHITECTURE REVISITED

10 CLOUD-BASED ARCHITECTURES

THE BASIC SURVIVAL KIT

Page 4: Pixels Camp 2017 - Stories from the trenches of building a data architecture

THE NEED OF A DATA ARCHITECTURE

Rules before building a data architecture Typical list of needs

Think about what you need to do with the data

There are no more rules

Gather a lot of data coming from di�erent places

Process that data in (close to) real-time

Make data available in multiple formats

Provide ways to easily process that data

Page 5: Pixels Camp 2017 - Stories from the trenches of building a data architecture

SIMPLE ARCHITECTURE OVERVIEW

SENSOR STREAM PRO-CESSING

SENSOR

SENSOR

DATA SINK MESSAGE QUEUE

FILE STORAGE

BATCH PROCESSING

DATABASES APIs PORTALS

Page 6: Pixels Camp 2017 - Stories from the trenches of building a data architecture

THE BASIC SURVIVAL KIT

Apache HadoopMapReduce

HDFS

Yarn

Why Apache Hadoop?Interoperability with many other tools

Great community

Gets the job done

THE BASIC SURVIVAL KIT

Page 7: Pixels Camp 2017 - Stories from the trenches of building a data architecture

THE BASIC SURVIVAL KIT

YARNAvailable resources per node for processing

Timeouts

Heap, heap...

HDFSSame as above

Primary/ Secondary nodes - high availability

Points of attention

Page 8: Pixels Camp 2017 - Stories from the trenches of building a data architecture

MESSAGE QUEUE

Apache KafkaOriginally developed by LinkedIn

Massively scalable publish/ subscribe message queue

High troughout

Low latency

Concepts

Topics

Consumers

Consumer groups

Partitions

Replicas

Page 9: Pixels Camp 2017 - Stories from the trenches of building a data architecture

MESSAGE QUEUE

Points of attentionTimeouts

Message sizes

Retention logs vs cleanup interval !!!!

Also, do not, for the love of god, simply delete all the subdirectories in your “kafka-logs” directory, you will cry.

Page 10: Pixels Camp 2017 - Stories from the trenches of building a data architecture

STREAM PROCESSING

vs. vs.

Page 11: Pixels Camp 2017 - Stories from the trenches of building a data architecture

STREAM PROCESSING

The good partsVery simple programming model and APIs

Multilanguage support

Points of attention

Mini-batch processing, not real stream

Heavy resource �ngerprint

Prone to timeouts of memory errors

Hard to �ne-tune to get the right performance

DataFrame API

ML Libraries

Wide community

Wide range of addons

Page 12: Pixels Camp 2017 - Stories from the trenches of building a data architecture

STREAM PROCESSINGSTREAM PROCESSING

The good partsStream processing

Multilanguage support

Points of attention

Slightly more complex programming model

Some support for other languages

Works without much con�guration e�ort

Low resources con�guration

Wide community

Lots of connectors and addons

Great performance, like, “The �ash” great

Page 13: Pixels Camp 2017 - Stories from the trenches of building a data architecture

STREAM PROCESSINGSTREAM PROCESSING

The good partsStream processing

Multilanguage support

Buuuuut.....

Does not have a wide community

Does not have that many connectors and addons

Simple API (very similar to Spark)

Dataset API

ML Libraries

Good handling of resources

Low con�guration/ optimisation overhead

Page 14: Pixels Camp 2017 - Stories from the trenches of building a data architecture

BATCH PROCESSING

Apache SparkMultilanguage support

Simple API

DataFrame API

ML Libraries

Wide community

Wide range of addons

Apache Flink

The good parts

Multilanguage support

Simple API (very similar)

DataSet API

ML Libraries

Page 15: Pixels Camp 2017 - Stories from the trenches of building a data architecture

BATCH PROCESSING

Apache SparkHeavy resource �ngerprint

Prone to timeouts of memory errors

Hard to �ne-tune to get the right performance

Apache Flink

Points of attention

Less con�guration problems

Better handling of resources

Not a big community

Not many addons

Page 16: Pixels Camp 2017 - Stories from the trenches of building a data architecture

DATABASES

Before commiting to a database

01 Think about how you need to access the data

02 Read 1 again

03 Seriously, read 1 again

Select a database, based on your needs, i.e.:

Hardcore read/ write workload and not much advanced querying: HBase

Heavy read/ write workload and minimally dynamic querying: Cassandra

Advanced text querying and not such heavy read/ write workload: something else

Page 17: Pixels Camp 2017 - Stories from the trenches of building a data architecture

BONUS ROUND: MANAGEMENT

Apache AmbariProvision a Hadoop ClusterManage a Hadoop Cluster

Monitor a Hadoop Cluster

Ambari uses Hadoop ecosystem distributions such as:

HortonworksCloudera

Page 18: Pixels Camp 2017 - Stories from the trenches of building a data architecture

ARCHITECTURE REVISITED

SENSOR APACHESTORM

SENSOR

SENSOR

DATA SINK APACHE KAFKA

APACHEHDFS

APACHESPARK

APACHE HBASE/CASSANDRA APIs PORTALS

Page 19: Pixels Camp 2017 - Stories from the trenches of building a data architecture

CLOUD BASED ARCHITECTURES

Pros

Less con�guration overhead

Less maintenance overhead

Easily scalable

Reliable

Return focus back to data and product

Cons

$$$$$$$$$$

Page 20: Pixels Camp 2017 - Stories from the trenches of building a data architecture

CLOUD BASED ARCHITECTURES

SENSOR GOOGLE DATAFLOW

SENSOR

SENSOR

DATA SINK GOOGLEPUBSUB

GOOGLE CLOUD STORAGE

GOOGLEDATAPROC

APIs PORTALSGOOGLE BIGTABLE/ BIGQUERY

Page 21: Pixels Camp 2017 - Stories from the trenches of building a data architecture

CLOUD BASED ARCHITECTURES

SENSOR AMAZON DATA PIPELINE

SENSOR

SENSOR

DATA SINKAMAZON SIMPLE QUEUE SERVICE AMAZON S3

AMAZON ELASTIC MAPREDUCE

APIs PORTALSAMAZON

DYNAMODB/ REDSHIFT

Page 22: Pixels Camp 2017 - Stories from the trenches of building a data architecture

BE READY. BE SAFE. BE SECURE.

BinaryEdge AGFreigutstrasse 40, 8001 ZurichSwitzerland

[email protected]

+ 41 78 713 40 00

CONTIGENCY THREAT SAFE IRRELEVANT