big data on the cloud
TRANSCRIPT
BIG DATA ON THE CLOUD@ugurarpaci
@SercanKaraoglu
CONTENTS
3V Model Development & Operational Challenges Distributed Processing Hadoop & Spark AWS Spot Instance Management Use Case: Apache Zeppelin, Spark
WHO WE ARE
Financial Data Provider Merging Different Markets Applications on Different Platforms (Web, Mobile,
Desktop, APIs) Software Development Team ~50 People, 130 Total Financial Data Application Management
3V MODEL
90% of the data in the world today has been created over the last two years alone
VOLUME VELOCITY VARIETY
High data generation speed
Data is formatted by any shape
HIGH HIGH HIGH
METADATA, EVENTS, ACTIONS ARE BIGDATA
What you see is not the whole picture!
An actual tweet to end user is similar as follows:
{text: “This is a 140
chars”,created_at: date();favourited: boolean;
}
OPERATIONAL CHALLENGES
HIDDEN VALUES IN DATA
AutomatedDecisionsForecastPatternData
DISTRIBUTED PROCESSING
Location Transparency
Redundancy
Logical Grouping
Decoupling Storage From Processing
HADOOP - DISTRIBUTED PROCESSING
Hadoop Distributed File SystemHadoop Common
Hadoop Map-ReduceHadoop YARN
The common utilities that support the other Hadoop modules
A distributed file system that provides high-throughput access to application data
A framework for job scheduling and cluster resource Management
A YARN based system for parallel processing of large datasets
DISTRIBUTED PROCESSING
MAP REDUCE
SPARK STACK
SPARK VS HADOOP - PERFORMANCE
SPARK VS HADOOP – DEVELOPER PRODUCTIVITY
RDD - SPARK
Resilient Distributed Dataset
Transformationsmap, filter, distinct, union, sample, groupByKey, join, reduce.. etc.
Actionscollect, count, first, take, foreach.. etc
RESOURCE MANAGEMENT ON THE CLOUD
Resource Requirement
Orchestrated Cluster Management
Accesibility
CLOUD STORAGE (AMAZON S3)
Separate compute and storage
Resize and shutdown Spark Instance(EMR, EC2) with no data loss
Point multiple Spark Clusters at the same data in S3
Easily evolve your analytic infrastructure as technology evolves
SPOT INSTANCE PROVISIONING PROCESS
Provisioning
Spinning-up
Service DiscoveryService Registry
Data Persistence
val conf = new SparkConf().setAppName("Trading Statistics").setMaster("spark://foreks.sparkcluster.com:18080")val sc = new SparkContext(conf)
tasks
tasks
tasks
Read HDFS&S3
block
Read HDFS&S3
block
Read HDFS&S3
block
Process&Cache Data
Process&Cache Data
Process&Cache Data
Results
Results
Results
USE CASE: SPARK + ZEPPELIN + S3
var logFile = sc.textFile("s3://../../2016/*/*/*.log.gz")
logFile = logFile.filter(line => line.startsWith("t;")) .map(toTradeObject) .groupBy(_.getSecurityName)
logFile.count().show()
USE CASE: SPARK + ZEPPELIN + S3Data Engineers write necessary queries for Marketing Department
Marketing Department can view & evaluate analytics graphics and several statistics showed on Zeppelin nice and smooth
Access logs uploaded to S3
Spark Cluster pulls access logs from s3://../../2016/*/*.log.gz
USE CASE: SPARK + ZEPPELIN + S3
THANK YOU