big data on the cloud

BIG DATA ON THE CLOUD@ugurarpaci

@SercanKaraoglu

CONTENTS

3V Model Development & Operational Challenges Distributed Processing Hadoop & Spark AWS Spot Instance Management Use Case: Apache Zeppelin, Spark

WHO WE ARE

Financial Data Provider Merging Different Markets Applications on Different Platforms (Web, Mobile,

Desktop, APIs) Software Development Team ~50 People, 130 Total Financial Data Application Management

3V MODEL

90% of the data in the world today has been created over the last two years alone

VOLUME VELOCITY VARIETY

High data generation speed

Data is formatted by any shape

HIGH HIGH HIGH

METADATA, EVENTS, ACTIONS ARE BIGDATA

What you see is not the whole picture!

An actual tweet to end user is similar as follows:

{text: “This is a 140

chars”,created_at: date();favourited: boolean;

OPERATIONAL CHALLENGES

HIDDEN VALUES IN DATA

AutomatedDecisionsForecastPatternData

DISTRIBUTED PROCESSING

Location Transparency

Redundancy

Logical Grouping

Decoupling Storage From Processing

HADOOP - DISTRIBUTED PROCESSING

Hadoop Distributed File SystemHadoop Common

Hadoop Map-ReduceHadoop YARN

The common utilities that support the other Hadoop modules

A distributed file system that provides high-throughput access to application data

A framework for job scheduling and cluster resource Management

A YARN based system for parallel processing of large datasets

DISTRIBUTED PROCESSING

MAP REDUCE

SPARK STACK

SPARK VS HADOOP - PERFORMANCE

SPARK VS HADOOP – DEVELOPER PRODUCTIVITY

RDD - SPARK

Resilient Distributed Dataset

Transformationsmap, filter, distinct, union, sample, groupByKey, join, reduce.. etc.

Actionscollect, count, first, take, foreach.. etc

RESOURCE MANAGEMENT ON THE CLOUD

Resource Requirement

Orchestrated Cluster Management

Accesibility

CLOUD STORAGE (AMAZON S3)

Separate compute and storage

Resize and shutdown Spark Instance(EMR, EC2) with no data loss

Point multiple Spark Clusters at the same data in S3

Easily evolve your analytic infrastructure as technology evolves

SPOT INSTANCE PROVISIONING PROCESS

Provisioning

Spinning-up

Service DiscoveryService Registry

Data Persistence

val conf = new SparkConf().setAppName("Trading Statistics").setMaster("spark://foreks.sparkcluster.com:18080")val sc = new SparkContext(conf)

Read HDFS&S3

Process&Cache Data

Results

USE CASE: SPARK + ZEPPELIN + S3

var logFile = sc.textFile("s3://../../2016/*/*/*.log.gz")

logFile = logFile.filter(line => line.startsWith("t;")) .map(toTradeObject) .groupBy(_.getSecurityName)

logFile.count().show()

USE CASE: SPARK + ZEPPELIN + S3Data Engineers write necessary queries for Marketing Department

Marketing Department can view & evaluate analytics graphics and several statistics showed on Zeppelin nice and smooth

Access logs uploaded to S3

Spark Cluster pulls access logs from s3://../../2016/*/*.log.gz

USE CASE: SPARK + ZEPPELIN + S3

THANK YOU

big data on the cloud

Data & Analytics

big data - cloud computing

qubole - big data in cloud

huawei cloud big data & ai

cloud mobile big data - cuomo

10t big data . ch icbm (iot, cloud, big data, mobile

microsoft cloud big data strategy

cloud computing & big data

big data cloud architecture

customer cloud architecture for big data analytics ·...

analytics - big data & the cloud

aspera bt-big-data-cloud

cloud computing: big data technology

cloud and big data trends

cloud - big data

rackspace cloud big data...

big data using public cloud

big data cloud computing

big data & cloud | cloud storage simplified | adrian cole

big data and the cloud

big data @ cloud scale