natc 2013 - future of data intensive applications by milind bhandarkar

58
Future of Data Intensive Applications Milind Bhandarkar Chief Scientist, Pivotal @techmilind Thursday, December 12, 2013

Upload: nasscom

Post on 26-Jan-2015

110 views

Category:

Technology


4 download

DESCRIPTION

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar, Chief Scientist, Pivotal Labs

TRANSCRIPT

Page 1: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Future of Data Intensive Applications

Milind BhandarkarChief Scientist, Pivotal

@techmilind

Thursday, December 12, 2013

Page 2: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

About Me

• http://www.linkedin.com/in/milindb

• Founding member of Hadoop team at Yahoo! [2005-2010]

• Contributor to Apache Hadoop since v0.1

• Built and led Grid Solutions Team at Yahoo! [2007-2010]

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)

Thursday, December 12, 2013

Page 3: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Thursday, December 12, 2013

Page 4: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Kryptonite: First Hadoop Cluster At Yahoo!

Thursday, December 12, 2013

Page 5: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

M45Thursday, December 12, 2013

Page 6: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

OpenCirrusThursday, December 12, 2013

Page 7: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Analytics Workbench

Thursday, December 12, 2013

Page 8: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Analytics Workbench

Thursday, December 12, 2013

Page 9: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Thursday, December 12, 2013

Page 10: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

70% of data generated by

customers

80% of data being stored

3% being prepared for

analysis

0.5% being analyzed

<0.5% being operationalized

Average Enterprises

The Big GapThursday, December 12, 2013

Page 11: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Example: Healthcare

• In last 5 years

• 3,573 studies on hospital readmissions

• 9,745 papers on comparative effectiveness

• 39,230 studies on drug interactions

• 132,241 studies on hospital mortality

• Yet, very few models operational

Thursday, December 12, 2013

Page 12: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

PowerPoint is where Models go to Die.

- Hulya Farinas, Principal Data Scientist, Pivotal

Thursday, December 12, 2013

Page 13: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

ModernizationThursday, December 12, 2013

Page 14: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Building BlocksThursday, December 12, 2013

Page 15: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Structured

BI/Analytics Tools Data Science Applications

Semi-structured Unstructured

High-Speed Integration Data provisioning, shared security, coordinated transformation

(Big) Data Staging Platform

Analytic Data Warehouse

In Memory Data Grid

On-Demand, Self-Service Access

Meta-data driven access control

Modern Data Architecture

Thursday, December 12, 2013

Page 16: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Data Fabric Requirements

• Store massive & diverse data sets economically

• Integrate and Ingest from legacy & disparate sources

• Ability to rapidly analyze massive data sets

• Control, Auditing, Manageability

• Self-Service

Thursday, December 12, 2013

Page 17: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Data Fabric ArchitectureThursday, December 12, 2013

Page 18: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Infrastructure-As-A-Service is the new

“Hardware”

Thursday, December 12, 2013

Page 19: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

IAAS: New Hardware

• AWS, GCE, Azure

• vSphere, OpenStack

• Easy Provisioning

• Scalable, Elastic, Ubiquitous

•Needs bundling with Data & Analytics as Services

Thursday, December 12, 2013

Page 20: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

App Fabric Requirements

• IAAS Cloud-Agnostic

• Rapid provisioning, Elasticity

•Open, No-Lock-In, Data As-A-Service

• Automation for Application Lifecycle Management

•Developer Agility : Eliminate Infrastructure Wiring

Thursday, December 12, 2013

Page 21: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Thursday, December 12, 2013

Page 22: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

EcosystemThursday, December 12, 2013

Page 23: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Broader EcosystemThursday, December 12, 2013

Page 24: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Legacy App Deployment

Thursday, December 12, 2013

Page 25: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

!"#$%&%#'()*+(,-#./0(

((12"341()*+(,-#./0(

((!.&5()*+(2!!0(

((6%'/()*+(&4"$%,4&0(

((&,2-4()*+(2!!0(7899(

.!3"2/4()*+(,-#./0(

(

Modern App Deployment

Thursday, December 12, 2013

Page 26: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Infrastructure One

JVM

VM

Infrastructure One

Infrastructure Two

App

Container 1

App Server

JVM

Container 2

App Server

JVM

Dev Framework Dev Framework

App Server

Configurations Manifests, Automations

Infrastructure Two

JVM

VM

Dev Framework

App Server

Configurations

App App App

Application As Unit of Deployment

Thursday, December 12, 2013

Page 27: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Hadoop’s Role in Data Clouds

Thursday, December 12, 2013

Page 28: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Trough of Disillusionment ?

Thursday, December 12, 2013

Page 29: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Or, Hadoop Everywhere?

Thursday, December 12, 2013

Page 30: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Thursday, December 12, 2013

Page 31: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Thursday, December 12, 2013

Page 32: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Thursday, December 12, 2013

Page 33: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Thursday, December 12, 2013

Page 34: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Thursday, December 12, 2013

Page 35: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Game Changing Hadoop Economics

$-

$20,000

$40,000

$60,000

$80,000

2008 2009 2010 2011 2012 2013

Big Data Platform Price/TB

Big Data DB Hadoop

Thursday, December 12, 2013

Page 36: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Storage Options

• HDFS, MapR, Quantcast QFS

• EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre

• Amazon S3, EMC Atmos, OpenStack Swift

• GlusterFS, Ceph

• EMC ViPR

Thursday, December 12, 2013

Page 37: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

SQL-on-Hadoop

• Pivotal HAWQ

• Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger

• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase

•More to come...

Thursday, December 12, 2013

Page 38: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

INTERACTIVE

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

ONLINE

Hadoop 1.0(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

Page 39: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

MapReduce 1.0

(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

Page 40: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Hadoop 2.0

(Image Courtesy Arun Murthy, Hortonworks)

HADOOP 1.0

!"#$%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*

&'()*+,-*%!2+%.(#"*"#./%"2#*3'&'0#3#&(*

*4*$'('*5"/2#..,&01*

!"#$.%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*

/0)1%!2+%.(#"*"#./%"2#*3'&'0#3#&(1*

2*3%!#6#2%7/&*#&0,&#1*

HADOOP 2.0

456%!$'('*8/91*

!57*%!.:+1*

%89:*;<%!2'.2'$,&01*

*

456%!$'('*8/91*

!57*%!.:+1*

%89:*;<%!2'.2'$,&01*

%

&)%!-'(2;1*

)2%%$9;*'=>%?;'(:%!"#$%&''()$*+,'

*

$*;75-*<%-.*/0'

*

Thursday, December 12, 2013

Page 41: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

!""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+

35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2*

9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2***

:!;<3+=>&",04-%0?+

2.;@,!<;2A@+=;0B?+

7;,@!>2.C+=7D(EFG+7HGI?+

C,!J3+=C$E&"K?+

2.L>@>M,9+=7"&EN?+

3J<+>J2+=M"0)>J2?+

M.O2.@+=3:&*0?+

M;3@,+=70&E%K?+=P0&/0I?+

YARN Platform

(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

Page 42: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

+"',&-'$)*./.*

+"',&-'$)*0/1*

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

+"',&-'$)*./0*

+"',&-'$)*./2*

3%*.*

+"',&-'$)*0/0*

+"',&-'$)*0/.*

+"',&-'$)*0/2*

3%0*

+4-$',0*

5$6"7)8$%&'&($)*

98:$#74$)*

YARN Architecture

(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

Page 43: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

YARN

• Yet Another Resource Negotiator

• Resource Manager

•Node Managers

• Application Masters

• Specific to paradigm, e.g. MR Application master (aka JobTracker)

Thursday, December 12, 2013

Page 44: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Beyond MapReduce

• Apache Giraph - BSP & Graph Processing

• Storm on Yarn - Streaming Computation

• HOYA - HBase on Yarn

• Hamster - MPI on Hadoop

•More to come ...

Thursday, December 12, 2013

Page 45: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Hamster

• Hadoop and MPI on the same cluster

• OpenMPI Runtime on Hadoop YARN

• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System

• Open MPI Provides: Process launching, Communication, I/O forwarding

Thursday, December 12, 2013

Page 46: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Hamster Components

• Hamster Application Master

• Gang Scheduler, YARN Application Preemption

• Resource Isolation (lxc Containers)

•ORTE: Hamster Runtime

• Process launching, Wireup, Interconnect

Thursday, December 12, 2013

Page 47: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Resource Manager

Scheduler

AMService

Node Manager Node Manager Node Manager !

Proc/Container

Framework Daemon NS MPI

Scheduler HNP

MPI AM

Proc/Container

! RM-AM

AM-NM

RM-NodeManager Client Client-RM

Aux Srvcs

Proc/Container

Framework Daemon NS

Proc/Container

!

Aux Srvcs RM-

NodeManager

Hamster ArchitectureThursday, December 12, 2013

Page 48: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Hamster Scalability

• Sufficient for small to medium HPC workloads

• Job launch time gated by YARN resource scheduler

Launch WireUp Collectives Monitor

OpenMPI O(logN) O(logN) O(logN) O(logN)

Hamster O(N) O(logN) O(logN) O(logN)

Thursday, December 12, 2013

Page 49: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

GraphLab + Hamsteron Hadoop

!

Thursday, December 12, 2013

Page 50: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

About GraphLab

• Graph-based, High-Performance distributed computation framework

• Started by Prof. Carlos Guestrin in CMU in 2009

• Recently founded Graphlab Inc to commercialize Graphlab.org

Thursday, December 12, 2013

Page 51: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

GraphLab Features

• Topic Modeling (e.g. LDA)

• Graph Analytics (Pagerank, Triangle counting)

• Clustering (K-Means)

• Collaborative Filtering

• Linear Solvers

• etc...

Thursday, December 12, 2013

Page 52: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Only Graphs are not Enough

• Full Data processing workflow required ETL/Postprocessing, Visualization, Data Wrangling, Serving

• MapReduce excels at data wrangling

• OLTP/NoSQL Row-Based stores excel at Serving

• GraphLab should co-exist with other Hadoop frameworks

Thursday, December 12, 2013

Page 53: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Call To Action

Thursday, December 12, 2013

Page 54: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Prepare for Convergence

• HPC: Cache Coherence, Prefetching, Zero-copy, Low-contention locks

• “Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency

•Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization

Thursday, December 12, 2013

Page 55: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Convergence

• Resource Allocation, Scheduling, Lifecycle Management

• Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs

• Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering

Thursday, December 12, 2013

Page 56: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

New Hardware Platforms

•Mellanox - Hadoop Acceleration through Network-assisted Merge

• RoCE - Brocade, Cisco, Extreme, Arista...

• ARM - Low power Hadoop servers

• SSD - Velobit, Violin, FusionIO, Samsung..

•Niche - Compression, Encryption...

Thursday, December 12, 2013

Page 57: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Data Cloud of Future?

depl

oy

Public Cloud

Private Cloud

On Premise

Thursday, December 12, 2013

Page 58: NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

Questions?Thursday, December 12, 2013