big data on the cloud

21
BIG DATA ON THE CLOUD @ugurarpaci @SercanKaraoglu

Upload: sercan-karaoglu

Post on 22-Jan-2017

186 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big Data on the Cloud

BIG DATA ON THE CLOUD@ugurarpaci

@SercanKaraoglu

Page 2: Big Data on the Cloud

CONTENTS

3V Model Development & Operational Challenges Distributed Processing Hadoop & Spark AWS Spot Instance Management Use Case: Apache Zeppelin, Spark

Page 3: Big Data on the Cloud

WHO WE ARE

Financial Data Provider Merging Different Markets Applications on Different Platforms (Web, Mobile,

Desktop, APIs) Software Development Team ~50 People, 130 Total Financial Data Application Management

Page 4: Big Data on the Cloud

3V MODEL

90% of the data in the world today has been created over the last two years alone

VOLUME VELOCITY VARIETY

High data generation speed

Data is formatted by any shape

HIGH HIGH HIGH

Page 5: Big Data on the Cloud

METADATA, EVENTS, ACTIONS ARE BIGDATA

What you see is not the whole picture!

An actual tweet to end user is similar as follows:

{text: “This is a 140

chars”,created_at: date();favourited: boolean;

}

Page 6: Big Data on the Cloud

OPERATIONAL CHALLENGES

Page 7: Big Data on the Cloud

HIDDEN VALUES IN DATA

AutomatedDecisionsForecastPatternData

Page 8: Big Data on the Cloud

DISTRIBUTED PROCESSING

Location Transparency

Redundancy

Logical Grouping

Decoupling Storage From Processing

Page 9: Big Data on the Cloud

HADOOP - DISTRIBUTED PROCESSING

Hadoop Distributed File SystemHadoop Common

Hadoop Map-ReduceHadoop YARN

The common utilities that support the other Hadoop modules

A distributed file system that provides high-throughput access to application data

A framework for job scheduling and cluster resource Management

A YARN based system for parallel processing of large datasets

Page 10: Big Data on the Cloud

DISTRIBUTED PROCESSING

MAP REDUCE

Page 11: Big Data on the Cloud

SPARK STACK

Page 12: Big Data on the Cloud

SPARK VS HADOOP - PERFORMANCE

Page 13: Big Data on the Cloud

SPARK VS HADOOP – DEVELOPER PRODUCTIVITY

Page 14: Big Data on the Cloud

RDD - SPARK

Resilient Distributed Dataset

Transformationsmap, filter, distinct, union, sample, groupByKey, join, reduce.. etc.

Actionscollect, count, first, take, foreach.. etc

Page 15: Big Data on the Cloud

RESOURCE MANAGEMENT ON THE CLOUD

Resource Requirement

Orchestrated Cluster Management

Accesibility

Page 16: Big Data on the Cloud

CLOUD STORAGE (AMAZON S3)

Separate compute and storage

Resize and shutdown Spark Instance(EMR, EC2) with no data loss

Point multiple Spark Clusters at the same data in S3

Easily evolve your analytic infrastructure as technology evolves

Page 17: Big Data on the Cloud

SPOT INSTANCE PROVISIONING PROCESS

Provisioning

Spinning-up

Service DiscoveryService Registry

Data Persistence

Page 18: Big Data on the Cloud

val conf = new SparkConf().setAppName("Trading Statistics").setMaster("spark://foreks.sparkcluster.com:18080")val sc = new SparkContext(conf)

tasks

tasks

tasks

Read HDFS&S3

block

Read HDFS&S3

block

Read HDFS&S3

block

Process&Cache Data

Process&Cache Data

Process&Cache Data

Results

Results

Results

USE CASE: SPARK + ZEPPELIN + S3

var logFile = sc.textFile("s3://../../2016/*/*/*.log.gz")

logFile = logFile.filter(line => line.startsWith("t;")) .map(toTradeObject) .groupBy(_.getSecurityName)

logFile.count().show()

Sercan KARAOGLU
[email protected] burada animasyon var ;) das ist gut
Page 19: Big Data on the Cloud

USE CASE: SPARK + ZEPPELIN + S3Data Engineers write necessary queries for Marketing Department

Marketing Department can view & evaluate analytics graphics and several statistics showed on Zeppelin nice and smooth

Access logs uploaded to S3

Spark Cluster pulls access logs from s3://../../2016/*/*.log.gz

Sercan KARAOGLU
[email protected] burada animasyon var ;) das ist gut
Page 20: Big Data on the Cloud

USE CASE: SPARK + ZEPPELIN + S3

Sercan KARAOGLU
[email protected] burada animasyon var ;) das ist gut
Page 21: Big Data on the Cloud

THANK YOU

Sercan KARAOGLU
[email protected] burada animasyon var ;) das ist gut