pixels camp 2017 - stories from the trenches of building a data architecture

BinaryEdge.ioBe Ready. Be Safe. Be Secure.

Florentino Bexiga

Stories from the Trenches of Building a Data Architecture

Data Engineer/ Platform [email protected]

WHO WE ARE AND WHAT WE DO

VNC

RDP

Files People

Social

Companyregistration

internal

external

Phone

Email

Linked urls

BGP

AS

Whois

AS membership

AS peer

List of IPs

Sharedinfrastructure

Co-hostedsites

Contact

Geolocation

O�celocations

Socialnetworks

Phone

portscan

dns

Screenshots

Web

Services

http https

Users

AppsFiles

Banners

Image

Classi�er

Vulnerabilities

DATA POINTS

metadata

PhotosFamily&friends

Behaviour

LikesTopics

Search

NewsForums

Sub-reddits

DomainsAXFRMX records

WebserverFrameworkHeadersCookies

Certi�cateCon�gurationAuthoritiesEntities

OCR

SWip addressurl address

SMB

torrents peers torrent name categorysource hashes of �les

AGENDA

0102

THE NEED OF A DATA ARCHITECTURE

03

SIMPLE ARCHITECTURE OVERVIEW

0405

MESSAGE QUEUE

STREAM PROCESSING

06 BATCH PROCESSING

07 DATABASES

08 BONUS ROUND: MANAGEMENT

09 ARCHITECTURE REVISITED

10 CLOUD-BASED ARCHITECTURES

THE BASIC SURVIVAL KIT

THE NEED OF A DATA ARCHITECTURE

Rules before building a data architecture Typical list of needs

Think about what you need to do with the data

There are no more rules

Gather a lot of data coming from di�erent places

Process that data in (close to) real-time

Make data available in multiple formats

Provide ways to easily process that data

SIMPLE ARCHITECTURE OVERVIEW

SENSOR STREAM PRO-CESSING

SENSOR

SENSOR

DATA SINK MESSAGE QUEUE

FILE STORAGE

BATCH PROCESSING

DATABASES APIs PORTALS


Apache HadoopMapReduce

HDFS

Yarn

Why Apache Hadoop?Interoperability with many other tools

Great community

Gets the job done



YARNAvailable resources per node for processing

Timeouts

Heap, heap...

HDFSSame as above

Primary/ Secondary nodes - high availability

Points of attention

MESSAGE QUEUE

Apache KafkaOriginally developed by LinkedIn

Massively scalable publish/ subscribe message queue

High troughout

Low latency

Concepts

Topics

Consumers

Consumer groups

Partitions

Replicas

MESSAGE QUEUE

Points of attentionTimeouts

Message sizes

Retention logs vs cleanup interval !!!!

Also, do not, for the love of god, simply delete all the subdirectories in your “kafka-logs” directory, you will cry.

STREAM PROCESSING

vs. vs.

STREAM PROCESSING

The good partsVery simple programming model and APIs

Multilanguage support

Points of attention

Mini-batch processing, not real stream

Heavy resource �ngerprint

Prone to timeouts of memory errors

Hard to �ne-tune to get the right performance

DataFrame API

ML Libraries

Wide community

Wide range of addons

STREAM PROCESSINGSTREAM PROCESSING

The good partsStream processing


Points of attention

Slightly more complex programming model

Some support for other languages

Works without much con�guration e�ort

Low resources con�guration

Wide community

Lots of connectors and addons

Great performance, like, “The �ash” great

STREAM PROCESSINGSTREAM PROCESSING

The good partsStream processing


Buuuuut.....

Does not have a wide community

Does not have that many connectors and addons

Simple API (very similar to Spark)

Dataset API

ML Libraries

Good handling of resources

Low con�guration/ optimisation overhead

BATCH PROCESSING

Apache SparkMultilanguage support

Simple API

DataFrame API

ML Libraries

Wide community

Wide range of addons

Apache Flink

The good parts


Simple API (very similar)

DataSet API

ML Libraries

BATCH PROCESSING

Apache SparkHeavy resource �ngerprint

Prone to timeouts of memory errors

Hard to �ne-tune to get the right performance

Apache Flink

Points of attention

Less con�guration problems

Better handling of resources

Not a big community

Not many addons

DATABASES

Before commiting to a database

01 Think about how you need to access the data

02 Read 1 again

03 Seriously, read 1 again

Select a database, based on your needs, i.e.:

Hardcore read/ write workload and not much advanced querying: HBase

Heavy read/ write workload and minimally dynamic querying: Cassandra

Advanced text querying and not such heavy read/ write workload: something else

BONUS ROUND: MANAGEMENT

Apache AmbariProvision a Hadoop ClusterManage a Hadoop Cluster

Monitor a Hadoop Cluster

Ambari uses Hadoop ecosystem distributions such as:

HortonworksCloudera

ARCHITECTURE REVISITED

SENSOR APACHESTORM

SENSOR

SENSOR

DATA SINK APACHE KAFKA

APACHEHDFS

APACHESPARK

APACHE HBASE/CASSANDRA APIs PORTALS

CLOUD BASED ARCHITECTURES

Pros

Less con�guration overhead

Less maintenance overhead

Easily scalable

Reliable

Return focus back to data and product

Cons

$$$$$$$$$$


SENSOR GOOGLE DATAFLOW

SENSOR

SENSOR

DATA SINK GOOGLEPUBSUB

GOOGLE CLOUD STORAGE

GOOGLEDATAPROC

APIs PORTALSGOOGLE BIGTABLE/ BIGQUERY


SENSOR AMAZON DATA PIPELINE

SENSOR

SENSOR

DATA SINKAMAZON SIMPLE QUEUE SERVICE AMAZON S3

AMAZON ELASTIC MAPREDUCE

APIs PORTALSAMAZON

DYNAMODB/ REDSHIFT

BE READY. BE SAFE. BE SECURE.

BinaryEdge AGFreigutstrasse 40, 8001 ZurichSwitzerland

[email protected]

+ 41 78 713 40 00

CONTIGENCY THREAT SAFE IRRELEVANT

pixels camp 2017 - stories from the trenches of building a data architecture

Technology