future of data intensive applicaitons

of 58 /58
Future of Data Intensive Applications Milind Bhandarkar Chief Scientist, Pivotal @techmilind Thursday, December 12, 2013

Upload: milind-bhandarkar

Post on 11-Aug-2014

880 views

Category:

Data & Analytics


58 download

DESCRIPTION

"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.

TRANSCRIPT

Page 1: Future of Data Intensive Applicaitons

Future of Data Intensive Applications

Milind BhandarkarChief Scientist, Pivotal

@techmilind

Thursday, December 12, 2013

Page 2: Future of Data Intensive Applicaitons

About Me

• http://www.linkedin.com/in/milindb

• Founding member of Hadoop team at Yahoo! [2005-2010]

• Contributor to Apache Hadoop since v0.1

• Built and led Grid Solutions Team at Yahoo! [2007-2010]

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)

Thursday, December 12, 2013

Page 3: Future of Data Intensive Applicaitons

Thursday, December 12, 2013

Page 4: Future of Data Intensive Applicaitons

Kryptonite: First Hadoop Cluster At Yahoo!

Thursday, December 12, 2013

Page 5: Future of Data Intensive Applicaitons

M45Thursday, December 12, 2013

Page 6: Future of Data Intensive Applicaitons

OpenCirrusThursday, December 12, 2013

Page 7: Future of Data Intensive Applicaitons

Analytics Workbench

Thursday, December 12, 2013

Page 8: Future of Data Intensive Applicaitons

Analytics Workbench

Thursday, December 12, 2013

Page 9: Future of Data Intensive Applicaitons

Thursday, December 12, 2013

Page 10: Future of Data Intensive Applicaitons

70% of data generated by

customers

80% of data being stored

3% being prepared for

analysis

0.5% being analyzed

<0.5% being operationalized

Average Enterprises

The Big GapThursday, December 12, 2013

Page 11: Future of Data Intensive Applicaitons

Example: Healthcare

• In last 5 years

• 3,573 studies on hospital readmissions

• 9,745 papers on comparative effectiveness

• 39,230 studies on drug interactions

• 132,241 studies on hospital mortality

• Yet, very few models operational

Thursday, December 12, 2013

Page 12: Future of Data Intensive Applicaitons

PowerPoint is where Models go to Die.

- Hulya Farinas, Principal Data Scientist, Pivotal

Thursday, December 12, 2013

Page 13: Future of Data Intensive Applicaitons

ModernizationThursday, December 12, 2013

Page 14: Future of Data Intensive Applicaitons

Building BlocksThursday, December 12, 2013

Page 15: Future of Data Intensive Applicaitons

Structured

BI/Analytics Tools Data Science Applications

Semi-structured Unstructured

High-Speed Integration Data provisioning, shared security, coordinated transformation

(Big) Data Staging Platform

Analytic Data Warehouse

In Memory Data Grid

On-Demand, Self-Service Access

Meta-data driven access control

Modern Data Architecture

Thursday, December 12, 2013

Page 16: Future of Data Intensive Applicaitons

Data Fabric Requirements

• Store massive & diverse data sets economically

• Integrate and Ingest from legacy & disparate sources

• Ability to rapidly analyze massive data sets

• Control, Auditing, Manageability

• Self-Service

Thursday, December 12, 2013

Page 17: Future of Data Intensive Applicaitons

Data Fabric ArchitectureThursday, December 12, 2013

Page 18: Future of Data Intensive Applicaitons

Infrastructure-As-A-Service is the new

“Hardware”

Thursday, December 12, 2013

Page 19: Future of Data Intensive Applicaitons

IAAS: New Hardware

• AWS, GCE, Azure

• vSphere, OpenStack

• Easy Provisioning

• Scalable, Elastic, Ubiquitous

•Needs bundling with Data & Analytics as Services

Thursday, December 12, 2013

Page 20: Future of Data Intensive Applicaitons

App Fabric Requirements

• IAAS Cloud-Agnostic

• Rapid provisioning, Elasticity

•Open, No-Lock-In, Data As-A-Service

• Automation for Application Lifecycle Management

•Developer Agility : Eliminate Infrastructure Wiring

Thursday, December 12, 2013

Page 21: Future of Data Intensive Applicaitons

Thursday, December 12, 2013

Page 22: Future of Data Intensive Applicaitons

EcosystemThursday, December 12, 2013

Page 23: Future of Data Intensive Applicaitons

Broader EcosystemThursday, December 12, 2013

Page 24: Future of Data Intensive Applicaitons

Legacy App Deployment

Thursday, December 12, 2013

Page 25: Future of Data Intensive Applicaitons

!"#$%&%#'()*+(,-#./0(

((12"341()*+(,-#./0(

((!.&5()*+(2!!0(

((6%'/()*+(&4"$%,4&0(

((&,2-4()*+(2!!0(7899(

.!3"2/4()*+(,-#./0(

(

Modern App Deployment

Thursday, December 12, 2013

Page 26: Future of Data Intensive Applicaitons

Infrastructure One

JVM

VM

Infrastructure One

Infrastructure Two

App

Container 1

App Server

JVM

Container 2

App Server

JVM

Dev Framework Dev Framework

App Server

Configurations Manifests, Automations

Infrastructure Two

JVM

VM

Dev Framework

App Server

Configurations

App App App

Application As Unit of Deployment

Thursday, December 12, 2013

Page 27: Future of Data Intensive Applicaitons

Hadoop’s Role in Data Clouds

Thursday, December 12, 2013

Page 28: Future of Data Intensive Applicaitons

Trough of Disillusionment ?

Thursday, December 12, 2013

Page 29: Future of Data Intensive Applicaitons

Or, Hadoop Everywhere?

Thursday, December 12, 2013

Page 30: Future of Data Intensive Applicaitons

Thursday, December 12, 2013

Page 31: Future of Data Intensive Applicaitons

Thursday, December 12, 2013

Page 32: Future of Data Intensive Applicaitons

Thursday, December 12, 2013

Page 33: Future of Data Intensive Applicaitons

Thursday, December 12, 2013

Page 34: Future of Data Intensive Applicaitons

Thursday, December 12, 2013

Page 35: Future of Data Intensive Applicaitons

Game Changing Hadoop Economics

$-

$20,000

$40,000

$60,000

$80,000

2008 2009 2010 2011 2012 2013

Big Data Platform Price/TB

Big Data DB Hadoop

Thursday, December 12, 2013

Page 36: Future of Data Intensive Applicaitons

Storage Options

• HDFS, MapR, Quantcast QFS

• EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre

• Amazon S3, EMC Atmos, OpenStack Swift

• GlusterFS, Ceph

• EMC ViPR

Thursday, December 12, 2013

Page 37: Future of Data Intensive Applicaitons

SQL-on-Hadoop

• Pivotal HAWQ

• Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger

• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase

•More to come...

Thursday, December 12, 2013

Page 38: Future of Data Intensive Applicaitons

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

INTERACTIVE

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

ONLINE

Hadoop 1.0(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

Page 39: Future of Data Intensive Applicaitons

MapReduce 1.0

(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

Page 40: Future of Data Intensive Applicaitons

Hadoop 2.0

(Image Courtesy Arun Murthy, Hortonworks)

HADOOP 1.0

!"#$%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*

&'()*+,-*%!2+%.(#"*"#./%"2#*3'&'0#3#&(*

*4*$'('*5"/2#..,&01*

!"#$.%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*

/0)1%!2+%.(#"*"#./%"2#*3'&'0#3#&(1*

2*3%!#6#2%7/&*#&0,&#1*

HADOOP 2.0

456%!$'('*8/91*

!57*%!.:+1*

%89:*;<%!2'.2'$,&01*

*

456%!$'('*8/91*

!57*%!.:+1*

%89:*;<%!2'.2'$,&01*

%

&)%!-'(2;1*

)2%%$9;*'=>%?;'(:%!"#$%&''()$*+,'

*

$*;75-*<%-.*/0'

*

Thursday, December 12, 2013

Page 41: Future of Data Intensive Applicaitons

!""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+

35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2*

9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2***

:!;<3+=>&",04-%0?+

2.;@,!<;2A@+=;0B?+

7;,@!>2.C+=7D(EFG+7HGI?+

C,!J3+=C$E&"K?+

2.L>@>M,9+=7"&EN?+

3J<+>J2+=M"0)>J2?+

M.O2.@+=3:&*0?+

M;3@,+=70&E%K?+=P0&/0I?+

YARN Platform

(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

Page 42: Future of Data Intensive Applicaitons

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

+"',&-'$)*./.*

+"',&-'$)*0/1*

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

+"',&-'$)*./0*

+"',&-'$)*./2*

3%*.*

+"',&-'$)*0/0*

+"',&-'$)*0/.*

+"',&-'$)*0/2*

3%0*

+4-$',0*

5$6"7)8$%&'&($)*

98:$#74$)*

YARN Architecture

(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

Page 43: Future of Data Intensive Applicaitons

YARN

• Yet Another Resource Negotiator

• Resource Manager

•Node Managers

• Application Masters

• Specific to paradigm, e.g. MR Application master (aka JobTracker)

Thursday, December 12, 2013

Page 44: Future of Data Intensive Applicaitons

Beyond MapReduce

• Apache Giraph - BSP & Graph Processing

• Storm on Yarn - Streaming Computation

• HOYA - HBase on Yarn

• Hamster - MPI on Hadoop

•More to come ...

Thursday, December 12, 2013

Page 45: Future of Data Intensive Applicaitons

Hamster

• Hadoop and MPI on the same cluster

• OpenMPI Runtime on Hadoop YARN

• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System

• Open MPI Provides: Process launching, Communication, I/O forwarding

Thursday, December 12, 2013

Page 46: Future of Data Intensive Applicaitons

Hamster Components

• Hamster Application Master

• Gang Scheduler, YARN Application Preemption

• Resource Isolation (lxc Containers)

•ORTE: Hamster Runtime

• Process launching, Wireup, Interconnect

Thursday, December 12, 2013

Page 47: Future of Data Intensive Applicaitons

Resource Manager

Scheduler

AMService

Node Manager Node Manager Node Manager !

Proc/Container

Framework Daemon NS MPI

Scheduler HNP

MPI AM

Proc/Container

! RM-AM

AM-NM

RM-NodeManager Client Client-RM

Aux Srvcs

Proc/Container

Framework Daemon NS

Proc/Container

!

Aux Srvcs RM-

NodeManager

Hamster ArchitectureThursday, December 12, 2013

Page 48: Future of Data Intensive Applicaitons

Hamster Scalability

• Sufficient for small to medium HPC workloads

• Job launch time gated by YARN resource scheduler

Launch WireUp Collectives Monitor

OpenMPI O(logN) O(logN) O(logN) O(logN)

Hamster O(N) O(logN) O(logN) O(logN)

Thursday, December 12, 2013

Page 49: Future of Data Intensive Applicaitons

GraphLab + Hamsteron Hadoop

!

Thursday, December 12, 2013

Page 50: Future of Data Intensive Applicaitons

About GraphLab

• Graph-based, High-Performance distributed computation framework

• Started by Prof. Carlos Guestrin in CMU in 2009

• Recently founded Graphlab Inc to commercialize Graphlab.org

Thursday, December 12, 2013

Page 51: Future of Data Intensive Applicaitons

GraphLab Features

• Topic Modeling (e.g. LDA)

• Graph Analytics (Pagerank, Triangle counting)

• Clustering (K-Means)

• Collaborative Filtering

• Linear Solvers

• etc...

Thursday, December 12, 2013

Page 52: Future of Data Intensive Applicaitons

Only Graphs are not Enough

• Full Data processing workflow required ETL/Postprocessing, Visualization, Data Wrangling, Serving

• MapReduce excels at data wrangling

• OLTP/NoSQL Row-Based stores excel at Serving

• GraphLab should co-exist with other Hadoop frameworks

Thursday, December 12, 2013

Page 53: Future of Data Intensive Applicaitons

Call To Action

Thursday, December 12, 2013

Page 54: Future of Data Intensive Applicaitons

Prepare for Convergence

• HPC: Cache Coherence, Prefetching, Zero-copy, Low-contention locks

• “Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency

•Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization

Thursday, December 12, 2013

Page 55: Future of Data Intensive Applicaitons

Convergence

• Resource Allocation, Scheduling, Lifecycle Management

• Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs

• Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering

Thursday, December 12, 2013

Page 56: Future of Data Intensive Applicaitons

New Hardware Platforms

•Mellanox - Hadoop Acceleration through Network-assisted Merge

• RoCE - Brocade, Cisco, Extreme, Arista...

• ARM - Low power Hadoop servers

• SSD - Velobit, Violin, FusionIO, Samsung..

•Niche - Compression, Encryption...

Thursday, December 12, 2013

Page 57: Future of Data Intensive Applicaitons

Data Cloud of Future?

depl

oy

Public Cloud

Private Cloud

On Premise

Thursday, December 12, 2013

Page 58: Future of Data Intensive Applicaitons

Questions?Thursday, December 12, 2013