future of data intensive applicaitons
Embed Size (px)
DESCRIPTION
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.TRANSCRIPT
Future of Data Intensive Applications Milind Bhandarkar Chief
Scientist, Pivotal @techmilind Thursday, December 12, 2013 About Me
http://www.linkedin.com/in/milindb Founding member of Hadoop team
atYahoo! [2005-2010] Contributor to Apache Hadoop since v0.1 Built
and led Grid SolutionsTeam atYahoo! [2007-2010] Parallel
Programming Paradigms [1989-today] (PhD cs.illinois.edu) Center for
Development of Advanced Computing (C-DAC), National Center for
Supercomputing Applications (NCSA), Center for Simulation of
Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale
Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly
Greenplum) Thursday, December 12, 2013 Thursday, December 12, 2013
Kryptonite: First Hadoop Cluster AtYahoo! Thursday, December 12,
2013 M45 Thursday, December 12, 2013 OpenCirrus Thursday, December
12, 2013 Analytics Workbench Thursday, December 12, 2013 Analytics
Workbench Thursday, December 12, 2013 Thursday, December 12, 2013
70% of data generated by customers 80% of data being stored 3%
being prepared for analysis 0.5% being analyzed @>M,9+
=7"&EN?+ 3JJ2+ =M"0)>J2?+ [email protected]+ =3:&*0?+ M;[email protected],+
=70&E%K?+ =P0&/0I?+ YARN Platform (Image Courtesy Arun
Murthy, Hortonworks) Thursday, December 12, 2013
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
!"#$%&'&($)* +"',&-'$)*./.* +"',&-'$)*0/1*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
!"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./0*
+"',&-'$)*./2* 3%*.* +"',&-'$)*0/0* +"',&-'$)*0/.*
+"',&-'$)*0/2* 3%0* +4-$',0* 5$6"7)8$%&'&($)*
98:$#74$)* YARN Architecture (Image Courtesy Arun Murthy,
Hortonworks) Thursday, December 12, 2013 YARN Yet Another Resource
Negotiator Resource Manager Node Managers Application Masters
Specic to paradigm, e.g. MR Application master (aka JobTracker)
Thursday, December 12, 2013 Beyond MapReduce Apache Giraph - BSP
& Graph Processing Storm onYarn - Streaming Computation HOYA -
HBase onYarn Hamster - MPI on Hadoop More to come ... Thursday,
December 12, 2013 Hamster Hadoop and MPI on the same cluster
OpenMPI Runtime on Hadoop YARN Hadoop Provides: Resource
Scheduling, Process monitoring, Distributed File System Open MPI
Provides: Process launching, Communication, I/O forwarding
Thursday, December 12, 2013 Hamster Components Hamster Application
Master Gang Scheduler,YARN Application Preemption Resource
Isolation (lxc Containers) ORTE: Hamster Runtime Process
launching,Wireup, Interconnect Thursday, December 12, 2013 Resource
Manager Scheduler AMService Node Manager Node Manager Node Manager
! Proc/ Container Framework Daemon NS MPI Scheduler HNP MPI AM
Proc/ Container !RM-AM AM-NM RM-NodeManagerClient Client-RM Aux
Srvcs Proc/ Container Framework Daemon NS Proc/ Container ! Aux
Srvcs RM- NodeManager Hamster Architecture Thursday, December 12,
2013 Hamster Scalability Sufcient for small to medium HPC workloads
Job launch time gated byYARN resource scheduler Launch WireUp
Collectives Monitor OpenMPI O(logN) O(logN) O(logN) O(logN) Hamster
O(N) O(logN) O(logN) O(logN) Thursday, December 12, 2013 GraphLab +
Hamster on Hadoop ! Thursday, December 12, 2013 About GraphLab
Graph-based, High-Performance distributed computation framework
Started by Prof. Carlos Guestrin in CMU in 2009 Recently founded
Graphlab Inc to commercialize Graphlab.org Thursday, December 12,
2013 GraphLab Features Topic Modeling (e.g. LDA) Graph Analytics
(Pagerank,Triangle counting) Clustering (K-Means) Collaborative
Filtering Linear Solvers etc... Thursday, December 12, 2013 Only
Graphs are not Enough Full Data processing workow required ETL/
Postprocessing,Visualization, Data Wrangling, Serving MapReduce
excels at data wrangling OLTP/NoSQL Row-Based stores excel at
Serving GraphLab should co-exist with other Hadoop frameworks
Thursday, December 12, 2013 CallTo Action Thursday, December 12,
2013 Prepare for Convergence HPC: Cache Coherence, Prefetching,
Zero- copy, Low-contention locks Big Data: Caching, Mirroring,
Sharding (various avors), relaxed consistency Databases: Indexing,
MVCC, Columnar storage/processing, Cost-based optimization
Thursday, December 12, 2013 Convergence Resource Allocation,
Scheduling, Lifecycle Management Compute, Storage, and
Communication isolation, Multi-tenancy, Performance SLAs Auth &
Auth, Data/System Provisioning and Management, Monitoring, Metadata
Management, Metering Thursday, December 12, 2013 New Hardware
Platforms Mellanox - Hadoop Acceleration through Network-assisted
Merge RoCE - Brocade, Cisco, Extreme,Arista... ARM - Low power
Hadoop servers SSD -Velobit,Violin, FusionIO, Samsung.. Niche -
Compression, Encryption... Thursday, December 12, 2013 Data Cloud
of Future? deploy Public Cloud Private Cloud On Premise Thursday,
December 12, 2013 Questions? Thursday, December 12, 2013