the future of big data tooling
TRANSCRIPT
The Future of Big Data ToolingAlexander Aldev
Alexander Aldev
About me
Тranslator| between business and IT
CTO and co-founder | MammothDB
17 years | various shades of analytics, DWH, BI
Nerd | making scaled data infrastructure practical
Spoiler Alert!
This talk
Can we predict the future?
How do Big Data tools work today?
How did they evolve?
… and some examples
What is their environment?
yes, we can! It already happened.
HOW MANY Z’S IN THIS SOUP?
photo Ursus Wehrli
THE BIG DATA TOOL APPROACH
photo Ursus Wehrli
1
23
4
MAYBE THIS WOULD HELP…
photo Ursus Wehrli
Working Definition
Just what’s Big Data?
Datasets so large and/or complex
that traditional data processing techniques
are inadequate to handle them
ExamplesIndexing 100PB of crawled web content
Providing on-line interactive analytics to 10mln clients
IT’S RELATIVE
photo Anders Rasmusen
Today
The Big Data toolset?
… for analytics, this is mostly synonymous with Hadoop
Hadoop architecture
Cluster of Commodity Servers
Distributed File Store (HDFS)
Resource Management
(YARN)
Distributed Compute(MapReduce)
Higher-level Apps
NoS
QL
Data
Sto
re
(HBa
se)Data Flow
(Pig)Query(Hive)
Machine Learning(Mahout)
the DFS
Data Node 1
File 1
Data Node 10Data Node 2
File 2
High throughputLinear scalabilityFault tolerance
blockreplication
classical workflow1 MapReduce Job
Input File on DFSSplit
Extract Structure
Shuffle
Aggregate
Output File on DFS
Store on DFS
Read from DFS
Store on Local FS
Analytical Query
Input on DFS Input on DFS
M/R Job
M/R Job
M/R JobIntermediate/DFS
M/R Job
Intermediate/DFSIntermediate/DFS
Output on DFS
programmabilityMap()
Reduce()
complex queriesrequire running many
Map/Reduce jobs!!!
JOINs are difficult
WHEREs are difficult
File 1 k
File 2k
node 1
node 2
shuffle
File 2k
= k ?
resource management1 Task = 1 Core
Split
400 cores = 100 node x 4 cores
2.5 GB/s = 400 tasks x 64 MB/task / 10 sec/task
14.6 GB/s = 100 nodes * 150 MB/s
20 cores = 5 node x 4 cores
128 MB/s = 20 tasks x 64 MB/task / 10 sec/task
2.9 GB/s = 20 nodes * 150 MB/s
theoretical
theoretical
max
max
in reality, multiple M/R ~ 3MB/s
Spark architecture
Cluster of Commodity Servers
Distributed File Store (HDFS)
Resource Management
(YARN)
Distributed Compute(Spark)
Higher-level Apps
NoS
QL
Data
Sto
re
(HBa
se)Data Flow
(Scala)Query
(Spark SQL)Machine Learning
(MLib)
Optimized Execution
what’s different?
Pipelines for batches of jobsMemory caching of intermediate results
ProgrammabilityRich set of high-level data flow operations
Support for popular languages Scala, Java, Python
Workflow
what’s the same?
Scan the fileInterpret data structure in user codePerform analysis
PhilosophyIngest and collect all data nowAnalyze later
Hadoop Storage
other improvements
Columnar data formatsCompression
SQL-on-HadoopFriendlier interface to analysts and toolsOptimized implementation (Impala, PrestoDB)
Data Sources
now, an enterprise
A variety of systems covering departmental functionsMostly structured and transactionalLoose alignment of business terms
Typical ChallengesData qualityData integrationInteractive analytics Business audiencesClient self-service analyticsSignificant volumes (10-100 TB range)Leveraging investment in IT and trainingBUDGET!!!
Scalable Storage and Computaton
Big Data tools offer
Reliable and scalable storage for filesReliable and scalable batch-mode computationNot efficient at small scale
Unified Data IntegrationThe data is thereIts quality is up to the userIts integration is up to the user and difficult / slow “The user” is a small group of highly qualified data scientistsNew programming interfacesMounting costs to acquire, extend and run
Top Uses in 2015 (Gartner)
Hadoop adoption
File storageBasic analyticsProof of conceptNext year: Advanced Analytics, DWH
Cluster SizeAverage cluster size: 20 nodesMedian cluster size: 32 nodes50% report under 10TB of storage
Top Reasons for Slow AdoptionLack of adequate skillNo business case
Especially good at …
so Hadoop is …
Batch-processingof web-scaleunstructured dataon large expensive infrastructures
But not that good at …data integration and unificationconcurrent useinteractive queryingaccessibility to business users
Yeah, mainframes of old days…
sounds familiar…?
Batch-processedCentralizedUsers waiting queuing for system accessCODASYL-style programming
What’s the future?Scale outLet the data management system manage the dataOptimized structured storageDeclarative syntax for business usersInterfacing data management and presentation toolsData integration methodologies
scaled-out DBMS
Cluster of Commodity Servers
Distributed File Store
Resource ManagementDistributed Execution & Aggregation
Higher-level Apps
Declarative Query Language
Distributed Database EnginePartitioned Storage and Querying
Data Integration
Self-service BI
Advanced Analytics
Machine Learning
MammothDB architecture
Cluster of Commodity Servers
Resource ManagementInteractive Map/Reduce
Higher-level Apps
SQL
Columnar RDBMS (per Node)Partitioned Storage and Querying
Data Integration
Self-service BI
Advanced Analytics
Machine Learning
Business Challenge
use case logistics
Predict cost of moving cargo between pairs of citiesIntegrate into ERP Validate at country level globallyTrack historical accuracy Outputs: 3 levels of service, 15’000 tradelanes, 4 charges
Client DWH
Solution
MammothDB Web Portal
E-LTprediction
model
MS SSAS ROLAP cube
Rate Calculator
SAP extract generator
Business Challenge
use case media planing
Track campaign across different mediaIntegrate online feedsStore extended historical dataLoad into downstream systemProvide ad-hoc reporting
Solution
MammothDB Web Portal
E-LTpull &
consolidate
MS SSAS ROLAP cube
extract generator
Gemius
…QlikView
Q & A
Thank you!