the future of big data tooling

28
The Future of Big Data Tooling Alexander Aldev

Upload: datasciencesociety

Post on 12-Feb-2017

1.192 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: The future of Big Data tooling

The Future of Big Data ToolingAlexander Aldev

Page 2: The future of Big Data tooling

Alexander Aldev

About me

Тranslator| between business and IT

CTO and co-founder | MammothDB

17 years | various shades of analytics, DWH, BI

Nerd | making scaled data infrastructure practical

Page 3: The future of Big Data tooling

Spoiler Alert!

This talk

Can we predict the future?

How do Big Data tools work today?

How did they evolve?

… and some examples

What is their environment?

yes, we can! It already happened.

Page 4: The future of Big Data tooling

HOW MANY Z’S IN THIS SOUP?

photo Ursus Wehrli

Page 5: The future of Big Data tooling

THE BIG DATA TOOL APPROACH

photo Ursus Wehrli

1

23

4

Page 6: The future of Big Data tooling

MAYBE THIS WOULD HELP…

photo Ursus Wehrli

Page 7: The future of Big Data tooling

Working Definition

Just what’s Big Data?

Datasets so large and/or complex

that traditional data processing techniques

are inadequate to handle them

ExamplesIndexing 100PB of crawled web content

Providing on-line interactive analytics to 10mln clients

Page 8: The future of Big Data tooling

IT’S RELATIVE

photo Anders Rasmusen

Page 9: The future of Big Data tooling

Today

The Big Data toolset?

… for analytics, this is mostly synonymous with Hadoop

Page 10: The future of Big Data tooling

Hadoop architecture

Cluster of Commodity Servers

Distributed File Store (HDFS)

Resource Management

(YARN)

Distributed Compute(MapReduce)

Higher-level Apps

NoS

QL

Data

Sto

re

(HBa

se)Data Flow

(Pig)Query(Hive)

Machine Learning(Mahout)

Page 11: The future of Big Data tooling

the DFS

Data Node 1

File 1

Data Node 10Data Node 2

File 2

High throughputLinear scalabilityFault tolerance

blockreplication

Page 12: The future of Big Data tooling

classical workflow1 MapReduce Job

Input File on DFSSplit

Extract Structure

Shuffle

Aggregate

Output File on DFS

Store on DFS

Read from DFS

Store on Local FS

Analytical Query

Input on DFS Input on DFS

M/R Job

M/R Job

M/R JobIntermediate/DFS

M/R Job

Intermediate/DFSIntermediate/DFS

Output on DFS

Page 13: The future of Big Data tooling

programmabilityMap()

Reduce()

complex queriesrequire running many

Map/Reduce jobs!!!

JOINs are difficult

WHEREs are difficult

File 1 k

File 2k

node 1

node 2

shuffle

File 2k

= k ?

Page 14: The future of Big Data tooling

resource management1 Task = 1 Core

Split

400 cores = 100 node x 4 cores

2.5 GB/s = 400 tasks x 64 MB/task / 10 sec/task

14.6 GB/s = 100 nodes * 150 MB/s

20 cores = 5 node x 4 cores

128 MB/s = 20 tasks x 64 MB/task / 10 sec/task

2.9 GB/s = 20 nodes * 150 MB/s

theoretical

theoretical

max

max

in reality, multiple M/R ~ 3MB/s

Page 15: The future of Big Data tooling

Spark architecture

Cluster of Commodity Servers

Distributed File Store (HDFS)

Resource Management

(YARN)

Distributed Compute(Spark)

Higher-level Apps

NoS

QL

Data

Sto

re

(HBa

se)Data Flow

(Scala)Query

(Spark SQL)Machine Learning

(MLib)

Page 16: The future of Big Data tooling

Optimized Execution

what’s different?

Pipelines for batches of jobsMemory caching of intermediate results

ProgrammabilityRich set of high-level data flow operations

Support for popular languages Scala, Java, Python

Page 17: The future of Big Data tooling

Workflow

what’s the same?

Scan the fileInterpret data structure in user codePerform analysis

PhilosophyIngest and collect all data nowAnalyze later

Page 18: The future of Big Data tooling

Hadoop Storage

other improvements

Columnar data formatsCompression

SQL-on-HadoopFriendlier interface to analysts and toolsOptimized implementation (Impala, PrestoDB)

Page 19: The future of Big Data tooling

Data Sources

now, an enterprise

A variety of systems covering departmental functionsMostly structured and transactionalLoose alignment of business terms

Typical ChallengesData qualityData integrationInteractive analytics Business audiencesClient self-service analyticsSignificant volumes (10-100 TB range)Leveraging investment in IT and trainingBUDGET!!!

Page 20: The future of Big Data tooling

Scalable Storage and Computaton

Big Data tools offer

Reliable and scalable storage for filesReliable and scalable batch-mode computationNot efficient at small scale

Unified Data IntegrationThe data is thereIts quality is up to the userIts integration is up to the user and difficult / slow “The user” is a small group of highly qualified data scientistsNew programming interfacesMounting costs to acquire, extend and run

Page 21: The future of Big Data tooling

Top Uses in 2015 (Gartner)

Hadoop adoption

File storageBasic analyticsProof of conceptNext year: Advanced Analytics, DWH

Cluster SizeAverage cluster size: 20 nodesMedian cluster size: 32 nodes50% report under 10TB of storage

Top Reasons for Slow AdoptionLack of adequate skillNo business case

Page 22: The future of Big Data tooling

Especially good at …

so Hadoop is …

Batch-processingof web-scaleunstructured dataon large expensive infrastructures

But not that good at …data integration and unificationconcurrent useinteractive queryingaccessibility to business users

Page 23: The future of Big Data tooling

Yeah, mainframes of old days…

sounds familiar…?

Batch-processedCentralizedUsers waiting queuing for system accessCODASYL-style programming

What’s the future?Scale outLet the data management system manage the dataOptimized structured storageDeclarative syntax for business usersInterfacing data management and presentation toolsData integration methodologies

Page 24: The future of Big Data tooling

scaled-out DBMS

Cluster of Commodity Servers

Distributed File Store

Resource ManagementDistributed Execution & Aggregation

Higher-level Apps

Declarative Query Language

Distributed Database EnginePartitioned Storage and Querying

Data Integration

Self-service BI

Advanced Analytics

Machine Learning

Page 25: The future of Big Data tooling

MammothDB architecture

Cluster of Commodity Servers

Resource ManagementInteractive Map/Reduce

Higher-level Apps

SQL

Columnar RDBMS (per Node)Partitioned Storage and Querying

Data Integration

Self-service BI

Advanced Analytics

Machine Learning

Page 26: The future of Big Data tooling

Business Challenge

use case logistics

Predict cost of moving cargo between pairs of citiesIntegrate into ERP Validate at country level globallyTrack historical accuracy Outputs: 3 levels of service, 15’000 tradelanes, 4 charges

Client DWH

Solution

MammothDB Web Portal

E-LTprediction

model

MS SSAS ROLAP cube

Rate Calculator

SAP extract generator

Page 27: The future of Big Data tooling

Business Challenge

use case media planing

Track campaign across different mediaIntegrate online feedsStore extended historical dataLoad into downstream systemProvide ad-hoc reporting

Google

Solution

MammothDB Web Portal

E-LTpull &

consolidate

MS SSAS ROLAP cube

extract generator

Facebook

Gemius

…QlikView

Page 28: The future of Big Data tooling

Q & A

Thank you!