pixels camp 2017 - stories from the trenches of building a data architecture
TRANSCRIPT
BinaryEdge.ioBe Ready. Be Safe. Be Secure.
Florentino Bexiga
Stories from the Trenches of Building a Data Architecture
Data Engineer/ Platform [email protected]
WHO WE ARE AND WHAT WE DO
VNC
RDP
Files People
Social
Companyregistration
internal
external
Phone
Linked urls
BGP
AS
Whois
AS membership
AS peer
List of IPs
Sharedinfrastructure
Co-hostedsites
Contact
Geolocation
O�celocations
Socialnetworks
Phone
portscan
dns
Screenshots
Web
Services
http https
Users
AppsFiles
Banners
Image
Classi�er
Vulnerabilities
DATA POINTS
metadata
PhotosFamily&friends
Behaviour
LikesTopics
Search
NewsForums
Sub-reddits
DomainsAXFRMX records
WebserverFrameworkHeadersCookies
Certi�cateCon�gurationAuthoritiesEntities
OCR
SWip addressurl address
SMB
torrents peers torrent name categorysource hashes of �les
AGENDA
0102
THE NEED OF A DATA ARCHITECTURE
03
SIMPLE ARCHITECTURE OVERVIEW
0405
MESSAGE QUEUE
STREAM PROCESSING
06 BATCH PROCESSING
07 DATABASES
08 BONUS ROUND: MANAGEMENT
09 ARCHITECTURE REVISITED
10 CLOUD-BASED ARCHITECTURES
THE BASIC SURVIVAL KIT
THE NEED OF A DATA ARCHITECTURE
Rules before building a data architecture Typical list of needs
Think about what you need to do with the data
There are no more rules
Gather a lot of data coming from di�erent places
Process that data in (close to) real-time
Make data available in multiple formats
Provide ways to easily process that data
SIMPLE ARCHITECTURE OVERVIEW
SENSOR STREAM PRO-CESSING
SENSOR
SENSOR
DATA SINK MESSAGE QUEUE
FILE STORAGE
BATCH PROCESSING
DATABASES APIs PORTALS
THE BASIC SURVIVAL KIT
Apache HadoopMapReduce
HDFS
Yarn
Why Apache Hadoop?Interoperability with many other tools
Great community
Gets the job done
THE BASIC SURVIVAL KIT
THE BASIC SURVIVAL KIT
YARNAvailable resources per node for processing
Timeouts
Heap, heap...
HDFSSame as above
Primary/ Secondary nodes - high availability
Points of attention
MESSAGE QUEUE
Apache KafkaOriginally developed by LinkedIn
Massively scalable publish/ subscribe message queue
High troughout
Low latency
Concepts
Topics
Consumers
Consumer groups
Partitions
Replicas
MESSAGE QUEUE
Points of attentionTimeouts
Message sizes
Retention logs vs cleanup interval !!!!
Also, do not, for the love of god, simply delete all the subdirectories in your “kafka-logs” directory, you will cry.
STREAM PROCESSING
vs. vs.
STREAM PROCESSING
The good partsVery simple programming model and APIs
Multilanguage support
Points of attention
Mini-batch processing, not real stream
Heavy resource �ngerprint
Prone to timeouts of memory errors
Hard to �ne-tune to get the right performance
DataFrame API
ML Libraries
Wide community
Wide range of addons
STREAM PROCESSINGSTREAM PROCESSING
The good partsStream processing
Multilanguage support
Points of attention
Slightly more complex programming model
Some support for other languages
Works without much con�guration e�ort
Low resources con�guration
Wide community
Lots of connectors and addons
Great performance, like, “The �ash” great
STREAM PROCESSINGSTREAM PROCESSING
The good partsStream processing
Multilanguage support
Buuuuut.....
Does not have a wide community
Does not have that many connectors and addons
Simple API (very similar to Spark)
Dataset API
ML Libraries
Good handling of resources
Low con�guration/ optimisation overhead
BATCH PROCESSING
Apache SparkMultilanguage support
Simple API
DataFrame API
ML Libraries
Wide community
Wide range of addons
Apache Flink
The good parts
Multilanguage support
Simple API (very similar)
DataSet API
ML Libraries
BATCH PROCESSING
Apache SparkHeavy resource �ngerprint
Prone to timeouts of memory errors
Hard to �ne-tune to get the right performance
Apache Flink
Points of attention
Less con�guration problems
Better handling of resources
Not a big community
Not many addons
DATABASES
Before commiting to a database
01 Think about how you need to access the data
02 Read 1 again
03 Seriously, read 1 again
Select a database, based on your needs, i.e.:
Hardcore read/ write workload and not much advanced querying: HBase
Heavy read/ write workload and minimally dynamic querying: Cassandra
Advanced text querying and not such heavy read/ write workload: something else
BONUS ROUND: MANAGEMENT
Apache AmbariProvision a Hadoop ClusterManage a Hadoop Cluster
Monitor a Hadoop Cluster
Ambari uses Hadoop ecosystem distributions such as:
HortonworksCloudera
ARCHITECTURE REVISITED
SENSOR APACHESTORM
SENSOR
SENSOR
DATA SINK APACHE KAFKA
APACHEHDFS
APACHESPARK
APACHE HBASE/CASSANDRA APIs PORTALS
CLOUD BASED ARCHITECTURES
Pros
Less con�guration overhead
Less maintenance overhead
Easily scalable
Reliable
Return focus back to data and product
Cons
$$$$$$$$$$
CLOUD BASED ARCHITECTURES
SENSOR GOOGLE DATAFLOW
SENSOR
SENSOR
DATA SINK GOOGLEPUBSUB
GOOGLE CLOUD STORAGE
GOOGLEDATAPROC
APIs PORTALSGOOGLE BIGTABLE/ BIGQUERY
CLOUD BASED ARCHITECTURES
SENSOR AMAZON DATA PIPELINE
SENSOR
SENSOR
DATA SINKAMAZON SIMPLE QUEUE SERVICE AMAZON S3
AMAZON ELASTIC MAPREDUCE
APIs PORTALSAMAZON
DYNAMODB/ REDSHIFT
BE READY. BE SAFE. BE SECURE.
BinaryEdge AGFreigutstrasse 40, 8001 ZurichSwitzerland
+ 41 78 713 40 00
CONTIGENCY THREAT SAFE IRRELEVANT