the future of big data tooling

The Future of Big Data ToolingAlexander Aldev

Alexander Aldev

About me

Тranslator| between business and IT

CTO and co-founder | MammothDB

17 years | various shades of analytics, DWH, BI

Nerd | making scaled data infrastructure practical

Spoiler Alert!

This talk

Can we predict the future?

How do Big Data tools work today?

How did they evolve?

… and some examples

What is their environment?

yes, we can! It already happened.

HOW MANY Z’S IN THIS SOUP?

photo Ursus Wehrli

THE BIG DATA TOOL APPROACH

photo Ursus Wehrli

1

23

4

MAYBE THIS WOULD HELP…

photo Ursus Wehrli

Working Definition

Just what’s Big Data?

Datasets so large and/or complex

that traditional data processing techniques

are inadequate to handle them

ExamplesIndexing 100PB of crawled web content

Providing on-line interactive analytics to 10mln clients

IT’S RELATIVE

photo Anders Rasmusen

Today

The Big Data toolset?

… for analytics, this is mostly synonymous with Hadoop

Hadoop architecture

Cluster of Commodity Servers

Distributed File Store (HDFS)

Resource Management

(YARN)

Distributed Compute(MapReduce)

Higher-level Apps

NoS

QL

Data

Sto

re

(HBa

se)Data Flow

(Pig)Query(Hive)

Machine Learning(Mahout)

the DFS

Data Node 1

File 1

Data Node 10Data Node 2

File 2

High throughputLinear scalabilityFault tolerance

blockreplication

classical workflow1 MapReduce Job

Input File on DFSSplit

Extract Structure

Shuffle

Aggregate

Output File on DFS

Store on DFS

Read from DFS

Store on Local FS

Analytical Query

Input on DFS Input on DFS

M/R Job

M/R Job

M/R JobIntermediate/DFS

M/R Job

Intermediate/DFSIntermediate/DFS

Output on DFS

programmabilityMap()

Reduce()

complex queriesrequire running many

Map/Reduce jobs!!!

JOINs are difficult

WHEREs are difficult

File 1 k

File 2k

node 1

node 2

shuffle

File 2k

= k ?

resource management1 Task = 1 Core

Split

400 cores = 100 node x 4 cores

2.5 GB/s = 400 tasks x 64 MB/task / 10 sec/task

14.6 GB/s = 100 nodes * 150 MB/s

20 cores = 5 node x 4 cores

128 MB/s = 20 tasks x 64 MB/task / 10 sec/task

2.9 GB/s = 20 nodes * 150 MB/s

theoretical

theoretical

max

max

in reality, multiple M/R ~ 3MB/s

Spark architecture


Distributed File Store (HDFS)

Resource Management

(YARN)

Distributed Compute(Spark)

Higher-level Apps

NoS

QL

Data

Sto

re

(HBa

se)Data Flow

(Scala)Query

(Spark SQL)Machine Learning

(MLib)

Optimized Execution

what’s different?

Pipelines for batches of jobsMemory caching of intermediate results

ProgrammabilityRich set of high-level data flow operations

Support for popular languages Scala, Java, Python

Workflow

what’s the same?

Scan the fileInterpret data structure in user codePerform analysis

PhilosophyIngest and collect all data nowAnalyze later

Hadoop Storage

other improvements

Columnar data formatsCompression

SQL-on-HadoopFriendlier interface to analysts and toolsOptimized implementation (Impala, PrestoDB)

Data Sources

now, an enterprise

A variety of systems covering departmental functionsMostly structured and transactionalLoose alignment of business terms

Typical ChallengesData qualityData integrationInteractive analytics Business audiencesClient self-service analyticsSignificant volumes (10-100 TB range)Leveraging investment in IT and trainingBUDGET!!!

Scalable Storage and Computaton

Big Data tools offer

Reliable and scalable storage for filesReliable and scalable batch-mode computationNot efficient at small scale

Unified Data IntegrationThe data is thereIts quality is up to the userIts integration is up to the user and difficult / slow “The user” is a small group of highly qualified data scientistsNew programming interfacesMounting costs to acquire, extend and run

Top Uses in 2015 (Gartner)

Hadoop adoption

File storageBasic analyticsProof of conceptNext year: Advanced Analytics, DWH

Cluster SizeAverage cluster size: 20 nodesMedian cluster size: 32 nodes50% report under 10TB of storage

Top Reasons for Slow AdoptionLack of adequate skillNo business case

Especially good at …

so Hadoop is …

Batch-processingof web-scaleunstructured dataon large expensive infrastructures

But not that good at …data integration and unificationconcurrent useinteractive queryingaccessibility to business users

Yeah, mainframes of old days…

sounds familiar…?

Batch-processedCentralizedUsers waiting queuing for system accessCODASYL-style programming

What’s the future?Scale outLet the data management system manage the dataOptimized structured storageDeclarative syntax for business usersInterfacing data management and presentation toolsData integration methodologies

scaled-out DBMS


Distributed File Store

Resource ManagementDistributed Execution & Aggregation

Higher-level Apps

Declarative Query Language

Distributed Database EnginePartitioned Storage and Querying

Data Integration

Self-service BI

Advanced Analytics

Machine Learning

MammothDB architecture


Resource ManagementInteractive Map/Reduce

Higher-level Apps

SQL

Columnar RDBMS (per Node)Partitioned Storage and Querying

Data Integration

Self-service BI

Advanced Analytics

Machine Learning

Business Challenge

use case logistics

Predict cost of moving cargo between pairs of citiesIntegrate into ERP Validate at country level globallyTrack historical accuracy Outputs: 3 levels of service, 15’000 tradelanes, 4 charges

Client DWH

Solution

MammothDB Web Portal

E-LTprediction

model

MS SSAS ROLAP cube

Rate Calculator

SAP extract generator

Business Challenge

use case media planing

Track campaign across different mediaIntegrate online feedsStore extended historical dataLoad into downstream systemProvide ad-hoc reporting

Google

Solution

MammothDB Web Portal

E-LTpull &

consolidate

MS SSAS ROLAP cube

extract generator

Facebook

Gemius

…QlikView

Q & A

Thank you!

the future of big data tooling

Data & Analytics