quick dive into the big data pool without drowning - demi ben-ari @ panorays
Post on 16-Apr-2017
166 Views
Preview:
TRANSCRIPT
Quick dive into theBig Data pool
without drowning
Demi Ben-Ari - VP R&D @ Panorays
About Me
Demi Ben-Ari, Co-Founder & VP R&D @ Panorays● BS’c Computer Science – Academic College Tel-Aviv Yaffo● Co-Founder “Big Things” Big Data Community
In the Past:● Sr. Data Engineer - Windward● Team Leader & Sr. Java Software Engineer,
Missile defense and Alert System - “Ofek” – IAFInterested in almost every kind of technology – A True Geek
Agenda
● Basic Concepts
● Introduction to Big Data frameworks
● Distributed Systems => Problems
● Monitoring
● Conclusions
Say “Distributed”, Say “Big Data”,Say….
Some basic concepts
What is Big Data (IMHO)?
● Systems involving the “3 Vs”:
What are the right questions we want to ask?
○ Volume - How much?
○ Velocity - How fast?
○ Variety - What kind? (Difference)
What is Big Data (IMHO)
● Some define it the “7 Vs”○ Variability (constantly changing)○ Veracity (accuracy)○ Visualization ○ Value
What is Big Data (IMHO)
● Characteristics○ Multi-region availability ○ Very fast and reliable response○ No single point of failure
Why Not Relational Data
● Relational Model Provides○ Normalized table schema○ Cross table joins○ ACID compliance (Atomicity, Consistency, Isolation, Durability)
● But at very high cost○ Big Data table joins - bilions of rows - massive overhead○ Sharding tables across systems is complex and fragile
● Modern applications have different priorities○ Needs for speed and availability come over consistency○ Commodity servers racks trump massive high-end systems○ Real world need for transactional guarantees is limited
What strategies help manage Big Data?
● Distribute data across nodes○ Replication
● Relax consistency requirements● Relax schema requirements● Optimize data to suit actual needs
What is the NoSQL landscape?
● 4 broad classes of non-relational databases (DB-Engines)○ Graph: data elements each relate to N others in graph / network○ Key-Value: keys map to arbitrary values of any data type○ Document: document sets (JSON) queryable in whole or part ○ Wide column Store (Column Family): keys mapped to sets of
n-numbers of typed columns● Three key factors to help understand the subject
○ Consistency: Get identical results, regardless which node is queried?○ Availability: Respond to very high read and write volumes?○ Partition tolerance: Still available when part of it is down?
What is the CAP theorem? ● In distributed systems, consistency, availability and partition tolerance exist in
a manually dependant relationship, Pick any two.
Availability
Partition toleranceConsistency
MySQL, PostgreSQL,Greenplum, Vertica,
Neo4J
Cassandra, DynamoDB, Riak,
CouchDB, Voldemort
HBase, MongoDB, Redis, BigTable, BerkeleyDB
GraphKey-Value
Wide Column RDBMS
DB Engines - Comparison● http://db-engines.com/en/ranking
DB Engines - Comparison
What does DevOps really mean?
DevelopmentSoftware EngineeringUX
OperationsSystem AdminDatabase Admin
What does DevOps really mean?
DevOpsCross-functional teams
Operators automating systemsDevelopers operating systems
Introduction to Big Data
Frameworks
https://d152j5tfobgaot.cloudfront.net/wp-content/uploads/2015/02/yourstory_BigData.jpg
Characteristics of Hadoop
● A system to process very large amounts of unstructured and complex data with wanted speed
● A system to run on a large amount of machines that don’t share any memory or disk
● A system to run on a cluster of machines which can put together in relatively lower cost and easier maintenance
Hadoop Principals● “A system to move the computation, where the data is”
● Key Concepts of Hadoop
Flexibility Scalability
Low cost Fault Tolerant
Hadoop Core Components
● HDFS - Hadoop Distributed File System○ Provides a distributed data storage system to store data in smaller
blocks in a fail safe manner● MapReduce - Programming framework
○ Has the ability to take a query over a dataset, divide it and run in in parallel on multiple nodes
● YARN - (Yet Another Resource Negotiator) MRv2○ Splitting a MapReduce Job Tracker’s info
■ Resource Manager (Global)■ Application Manager (Per application)
Hadoop Ecosystem
Hadoop Core
HDFS MapReduce /YARN
Hadoop Common
Hadoop Applications
Hive Pig HBase Oozie Zookeeper Sqoop Spark
Hadoop (+Spark) Distributions
Elastic MapReduce DataProc
New Age BI Applications
● Able to understand various types of data● Ability to clean the data● Process data with applied rules locally and in distributed environment● Visualize sizeable data with speed● Extend results by sharing within the enterprise
Big Data Analytics
● Processing large amounts of data without data movement● Avoid data connectors if possible (run natively)● Ability to understand vast amount of data types and and data
compressions● Ability to process data on variety of processing frameworks ● Distributed data processing
○ In-Memory a big plus● Super fast visualization
○ In-Memory a big plus
When to choose hadoop?
● Large volumes of data to store and process● Semi-Structured or Unstructured data● Data is not well categorized● Data contains a lot of redundancy● Data arrives in streams or large batches● Complex batch jobs arriving in parallel● You don’t know how the data might be useful
Distributed Systems => Problems
https://imgflip.com/i/1ap5krhttp://kingofwallpapers.com/otter/otter-004.jpg
Monolith Structure
OS CPU Memory Disk
Processes Java Application Server
Database
Web Server
Load Balancer
Users - Other Applications
Monitoring System
UI
Many times...all of this was on a single physical server!
Distributed Microservices Architecture
Service A
Queue
DB
Service B
DBCache
Cache DBService C
Web Server
DB
Analytics Cluster
Master
Slave Slave Slave
Monitoring System???
MongoDB + Spark
Worker 1
Worker 2
….
….
…
…
Worker N
Spark Cluster
Master
Write
Read
MasterSahrded MongoDB
Replica Set
Cassandra + Spark
Worker 1
Worker 2
….
….
…
…
Worker N
Cassandra Cluster
Spark Cluster
Write
Read
Cassandra + Serving
Cassandra Cluster
Write
Read
UI ClientUI Client
UI ClientUI Client
Web ServiceWeb
ServiceWeb ServiceWeb
Service
Problems
● Multiple physical servers
● Multiple logical services
● Want Scaling => More Servers
● Even if you had all of the metrics○ You’ll have an overflow of the data
● Your monitoring becomes a “Big Data” problem itself
This is what “Distributed” really Means
The DevOps Guy
(It might be you)
Monitoring is Crucial
http://memeguy.com/photo/46871/you-are-being-monitored
Monitoring Operation SystemMetrics
Some help from “the Cloud”
AWS’s CloudWatch / GCP StackDriver
Report to Where?
● We chose: ● Graphite (InfluxDB) + Grafana● Can correlate System and
Application metrics in one place :)
Monitoring Cassandra
Monitoring Cassandra
● OpsCenter - by DataStax
Monitoring Cassandra
Monitoring Spark
Ways to Monitoring Spark
● Grafana-spark-dashboards
○ Blog: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/
● Spark UI - Online on each application running● Spark History Server - Offline (After application finishes)● Spark REST API
○ Querying via inner tools to do ad-hoc monitoring
● Back to the basics: dstat, iostat, iotop, jstack● Blog post by Tzach Zohar - “Tips from the Trenches”
Monitoring Your Data
https://memegenerator.net/instance/53617544
Data Questions? What should be measure
● Did all of the computation occur?
○ Are there any data layers missing?
● How much data do we have? (Volume)
● Is all of the data in the Database?
● Data Quality Assurance
Data Answers!● The method doesn’t really matter, as long as you:
○ Can follow the results over time
○ Know what your data flow, know what might fail
○ It’s easy for anyone to add more monitoring(For the ones that add the new data each time…)
○ It don’t trust others to add monitoring
(It will always end up the DevOps’s “fault” -> No monitoring will be
applied)
Logging?Monitoring?
https://lh4.googleusercontent.com/DFVcH-E5XKj8cbhEtI0qabmf_wwVqWWvk0pK5H5rnC_kVxY2tXClKfzV-LvAH61YRLJUEvtO9amjWfjcY4Z57VBYCuQ95_hdAVEHgLAuepJiArH0wJERWuzzmgnPysCiIA
ELK - Elasticsearch + Logstash + Kibana
http://www.digitalgov.gov/2014/05/07/analyzing-search-data-in-real-time-to-drive-decisions/
Monitoring Stack
Alerting
Metrics Collection
Datastore
Dashboard
Data Monitoring
Log Monitoring
Big Data - Are we there yet?
● “3 Vs”: - What are the right questions we want to ask?○ Volume - How much?
■ Can it run on a single machine in reasonable time?
○ Velocity - How fast?
■ Can a single machine handle the throughput?
○ Variety - What kind? (Difference)
■ Is your data not changing and varying?
● If the answer for most of the previous questions is “Yes”?Think again if you want to add the complexity of “Big Data”
Conclusions
● Think carefully before going into the “Big Data pool”○ See if you really have a problem that you’re trying to solve○ It’s not a silver bullet
● Take measures to automate and monitor everything● Having Clusters and distributed frameworks will cost a lot - eventually● Fit your storage layer(s) to the needs
Questions?
https://www.stayathomemum.com.au/wp-content/uploads/2015/01/DDDDDD.jpg
Still feel like you’re drowning?
● LinkedIn● Twitter: @demibenari● Blog:
http://progexc.blogspot.com/● demi.benari@gmail.com
● “Big Things” Community
�Meetup, YouTube, Facebook, Twitter
● GDG Cloud
top related