natc 2013 - future of data intensive applications by milind bhandarkar

Post on 26-Jan-2015

110 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar, Chief Scientist, Pivotal Labs

TRANSCRIPT

Future of Data Intensive Applications

Milind BhandarkarChief Scientist, Pivotal

@techmilind

Thursday, December 12, 2013

About Me

• http://www.linkedin.com/in/milindb

• Founding member of Hadoop team at Yahoo! [2005-2010]

• Contributor to Apache Hadoop since v0.1

• Built and led Grid Solutions Team at Yahoo! [2007-2010]

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)

Thursday, December 12, 2013

Thursday, December 12, 2013

Kryptonite: First Hadoop Cluster At Yahoo!

Thursday, December 12, 2013

M45Thursday, December 12, 2013

OpenCirrusThursday, December 12, 2013

Analytics Workbench

Thursday, December 12, 2013

Analytics Workbench

Thursday, December 12, 2013

Thursday, December 12, 2013

70% of data generated by

customers

80% of data being stored

3% being prepared for

analysis

0.5% being analyzed

<0.5% being operationalized

Average Enterprises

The Big GapThursday, December 12, 2013

Example: Healthcare

• In last 5 years

• 3,573 studies on hospital readmissions

• 9,745 papers on comparative effectiveness

• 39,230 studies on drug interactions

• 132,241 studies on hospital mortality

• Yet, very few models operational

Thursday, December 12, 2013

PowerPoint is where Models go to Die.

- Hulya Farinas, Principal Data Scientist, Pivotal

Thursday, December 12, 2013

ModernizationThursday, December 12, 2013

Building BlocksThursday, December 12, 2013

Structured

BI/Analytics Tools Data Science Applications

Semi-structured Unstructured

High-Speed Integration Data provisioning, shared security, coordinated transformation

(Big) Data Staging Platform

Analytic Data Warehouse

In Memory Data Grid

On-Demand, Self-Service Access

Meta-data driven access control

Modern Data Architecture

Thursday, December 12, 2013

Data Fabric Requirements

• Store massive & diverse data sets economically

• Integrate and Ingest from legacy & disparate sources

• Ability to rapidly analyze massive data sets

• Control, Auditing, Manageability

• Self-Service

Thursday, December 12, 2013

Data Fabric ArchitectureThursday, December 12, 2013

Infrastructure-As-A-Service is the new

“Hardware”

Thursday, December 12, 2013

IAAS: New Hardware

• AWS, GCE, Azure

• vSphere, OpenStack

• Easy Provisioning

• Scalable, Elastic, Ubiquitous

•Needs bundling with Data & Analytics as Services

Thursday, December 12, 2013

App Fabric Requirements

• IAAS Cloud-Agnostic

• Rapid provisioning, Elasticity

•Open, No-Lock-In, Data As-A-Service

• Automation for Application Lifecycle Management

•Developer Agility : Eliminate Infrastructure Wiring

Thursday, December 12, 2013

Thursday, December 12, 2013

EcosystemThursday, December 12, 2013

Broader EcosystemThursday, December 12, 2013

Legacy App Deployment

Thursday, December 12, 2013

!"#$%&%#'()*+(,-#./0(

((12"341()*+(,-#./0(

((!.&5()*+(2!!0(

((6%'/()*+(&4"$%,4&0(

((&,2-4()*+(2!!0(7899(

.!3"2/4()*+(,-#./0(

(

Modern App Deployment

Thursday, December 12, 2013

Infrastructure One

JVM

VM

Infrastructure One

Infrastructure Two

App

Container 1

App Server

JVM

Container 2

App Server

JVM

Dev Framework Dev Framework

App Server

Configurations Manifests, Automations

Infrastructure Two

JVM

VM

Dev Framework

App Server

Configurations

App App App

Application As Unit of Deployment

Thursday, December 12, 2013

Hadoop’s Role in Data Clouds

Thursday, December 12, 2013

Trough of Disillusionment ?

Thursday, December 12, 2013

Or, Hadoop Everywhere?

Thursday, December 12, 2013

Thursday, December 12, 2013

Thursday, December 12, 2013

Thursday, December 12, 2013

Thursday, December 12, 2013

Thursday, December 12, 2013

Game Changing Hadoop Economics

$-

$20,000

$40,000

$60,000

$80,000

2008 2009 2010 2011 2012 2013

Big Data Platform Price/TB

Big Data DB Hadoop

Thursday, December 12, 2013

Storage Options

• HDFS, MapR, Quantcast QFS

• EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre

• Amazon S3, EMC Atmos, OpenStack Swift

• GlusterFS, Ceph

• EMC ViPR

Thursday, December 12, 2013

SQL-on-Hadoop

• Pivotal HAWQ

• Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger

• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase

•More to come...

Thursday, December 12, 2013

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

INTERACTIVE

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

BATCH

HDFS

!"#$%&'())'

ONLINE

Hadoop 1.0(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

MapReduce 1.0

(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

Hadoop 2.0

(Image Courtesy Arun Murthy, Hortonworks)

HADOOP 1.0

!"#$%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*

&'()*+,-*%!2+%.(#"*"#./%"2#*3'&'0#3#&(*

*4*$'('*5"/2#..,&01*

!"#$.%!"#$%&$'&()*"#+,'-+#*.(/"'0#1*

/0)1%!2+%.(#"*"#./%"2#*3'&'0#3#&(1*

2*3%!#6#2%7/&*#&0,&#1*

HADOOP 2.0

456%!$'('*8/91*

!57*%!.:+1*

%89:*;<%!2'.2'$,&01*

*

456%!$'('*8/91*

!57*%!.:+1*

%89:*;<%!2'.2'$,&01*

%

&)%!-'(2;1*

)2%%$9;*'=>%?;'(:%!"#$%&''()$*+,'

*

$*;75-*<%-.*/0'

*

Thursday, December 12, 2013

!""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+

35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2*

9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2***

:!;<3+=>&",04-%0?+

2.;@,!<;2A@+=;0B?+

7;,@!>2.C+=7D(EFG+7HGI?+

C,!J3+=C$E&"K?+

2.L>@>M,9+=7"&EN?+

3J<+>J2+=M"0)>J2?+

M.O2.@+=3:&*0?+

M;3@,+=70&E%K?+=P0&/0I?+

YARN Platform

(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

+"',&-'$)*./.*

+"',&-'$)*0/1*

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*

+"',&-'$)*./0*

+"',&-'$)*./2*

3%*.*

+"',&-'$)*0/0*

+"',&-'$)*0/.*

+"',&-'$)*0/2*

3%0*

+4-$',0*

5$6"7)8$%&'&($)*

98:$#74$)*

YARN Architecture

(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013

YARN

• Yet Another Resource Negotiator

• Resource Manager

•Node Managers

• Application Masters

• Specific to paradigm, e.g. MR Application master (aka JobTracker)

Thursday, December 12, 2013

Beyond MapReduce

• Apache Giraph - BSP & Graph Processing

• Storm on Yarn - Streaming Computation

• HOYA - HBase on Yarn

• Hamster - MPI on Hadoop

•More to come ...

Thursday, December 12, 2013

Hamster

• Hadoop and MPI on the same cluster

• OpenMPI Runtime on Hadoop YARN

• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System

• Open MPI Provides: Process launching, Communication, I/O forwarding

Thursday, December 12, 2013

Hamster Components

• Hamster Application Master

• Gang Scheduler, YARN Application Preemption

• Resource Isolation (lxc Containers)

•ORTE: Hamster Runtime

• Process launching, Wireup, Interconnect

Thursday, December 12, 2013

Resource Manager

Scheduler

AMService

Node Manager Node Manager Node Manager !

Proc/Container

Framework Daemon NS MPI

Scheduler HNP

MPI AM

Proc/Container

! RM-AM

AM-NM

RM-NodeManager Client Client-RM

Aux Srvcs

Proc/Container

Framework Daemon NS

Proc/Container

!

Aux Srvcs RM-

NodeManager

Hamster ArchitectureThursday, December 12, 2013

Hamster Scalability

• Sufficient for small to medium HPC workloads

• Job launch time gated by YARN resource scheduler

Launch WireUp Collectives Monitor

OpenMPI O(logN) O(logN) O(logN) O(logN)

Hamster O(N) O(logN) O(logN) O(logN)

Thursday, December 12, 2013

GraphLab + Hamsteron Hadoop

!

Thursday, December 12, 2013

About GraphLab

• Graph-based, High-Performance distributed computation framework

• Started by Prof. Carlos Guestrin in CMU in 2009

• Recently founded Graphlab Inc to commercialize Graphlab.org

Thursday, December 12, 2013

GraphLab Features

• Topic Modeling (e.g. LDA)

• Graph Analytics (Pagerank, Triangle counting)

• Clustering (K-Means)

• Collaborative Filtering

• Linear Solvers

• etc...

Thursday, December 12, 2013

Only Graphs are not Enough

• Full Data processing workflow required ETL/Postprocessing, Visualization, Data Wrangling, Serving

• MapReduce excels at data wrangling

• OLTP/NoSQL Row-Based stores excel at Serving

• GraphLab should co-exist with other Hadoop frameworks

Thursday, December 12, 2013

Call To Action

Thursday, December 12, 2013

Prepare for Convergence

• HPC: Cache Coherence, Prefetching, Zero-copy, Low-contention locks

• “Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency

•Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization

Thursday, December 12, 2013

Convergence

• Resource Allocation, Scheduling, Lifecycle Management

• Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs

• Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering

Thursday, December 12, 2013

New Hardware Platforms

•Mellanox - Hadoop Acceleration through Network-assisted Merge

• RoCE - Brocade, Cisco, Extreme, Arista...

• ARM - Low power Hadoop servers

• SSD - Velobit, Violin, FusionIO, Samsung..

•Niche - Compression, Encryption...

Thursday, December 12, 2013

Data Cloud of Future?

depl

oy

Public Cloud

Private Cloud

On Premise

Thursday, December 12, 2013

Questions?Thursday, December 12, 2013

top related