big data on the cloud

BIG DATA ON THE CLOUD@ugurarpaci

@SercanKaraoglu

CONTENTS

3V Model Development & Operational Challenges Distributed Processing Hadoop & Spark AWS Spot Instance Management Use Case: Apache Zeppelin, Spark

WHO WE ARE

Financial Data Provider Merging Different Markets Applications on Different Platforms (Web, Mobile,

Desktop, APIs) Software Development Team ~50 People, 130 Total Financial Data Application Management

3V MODEL

90% of the data in the world today has been created over the last two years alone

VOLUME VELOCITY VARIETY

High data generation speed

Data is formatted by any shape

HIGH HIGH HIGH

METADATA, EVENTS, ACTIONS ARE BIGDATA

What you see is not the whole picture!

An actual tweet to end user is similar as follows:

{text: “This is a 140

chars”,created_at: date();favourited: boolean;

}

OPERATIONAL CHALLENGES

HIDDEN VALUES IN DATA

AutomatedDecisionsForecastPatternData

DISTRIBUTED PROCESSING

Location Transparency

Redundancy

Logical Grouping

Decoupling Storage From Processing

HADOOP - DISTRIBUTED PROCESSING

Hadoop Distributed File SystemHadoop Common

Hadoop Map-ReduceHadoop YARN

The common utilities that support the other Hadoop modules

A distributed file system that provides high-throughput access to application data

A framework for job scheduling and cluster resource Management

A YARN based system for parallel processing of large datasets

DISTRIBUTED PROCESSING

MAP REDUCE

SPARK STACK

SPARK VS HADOOP - PERFORMANCE

SPARK VS HADOOP – DEVELOPER PRODUCTIVITY

RDD - SPARK

Resilient Distributed Dataset

Transformationsmap, filter, distinct, union, sample, groupByKey, join, reduce.. etc.

Actionscollect, count, first, take, foreach.. etc

RESOURCE MANAGEMENT ON THE CLOUD

Resource Requirement

Orchestrated Cluster Management

Accesibility

CLOUD STORAGE (AMAZON S3)

Separate compute and storage

Resize and shutdown Spark Instance(EMR, EC2) with no data loss

Point multiple Spark Clusters at the same data in S3

Easily evolve your analytic infrastructure as technology evolves

SPOT INSTANCE PROVISIONING PROCESS

Provisioning

Spinning-up

Service DiscoveryService Registry

Data Persistence

val conf = new SparkConf().setAppName("Trading Statistics").setMaster("spark://foreks.sparkcluster.com:18080")val sc = new SparkContext(conf)

tasks

tasks

tasks

Read HDFS&S3

block

Read HDFS&S3

block

Read HDFS&S3

block

Process&Cache Data

Process&Cache Data

Process&Cache Data

Results

Results

Results

USE CASE: SPARK + ZEPPELIN + S3

var logFile = sc.textFile("s3://../../2016/*/*/*.log.gz")

logFile = logFile.filter(line => line.startsWith("t;")) .map(toTradeObject) .groupBy(_.getSecurityName)

logFile.count().show()

Sercan KARAOGLU

[email protected] burada animasyon var ;) das ist gut

USE CASE: SPARK + ZEPPELIN + S3Data Engineers write necessary queries for Marketing Department

Marketing Department can view & evaluate analytics graphics and several statistics showed on Zeppelin nice and smooth

Access logs uploaded to S3

Spark Cluster pulls access logs from s3://../../2016/*/*.log.gz

Sercan KARAOGLU


USE CASE: SPARK + ZEPPELIN + S3

Sercan KARAOGLU


THANK YOU

Sercan KARAOGLU


big data on the cloud

Data & Analytics