spark one platform webinar

27
1 © Cloudera, Inc. All rights reserved. Uniting Spark & Hadoop The One Platform Initiative Doug Cutting | Chief Architect | Cloudera Anand Iyer | Senior Product Manager | Cloudera

Upload: cloudera-inc

Post on 15-Apr-2017

1.506 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Spark One Platform Webinar

1© Cloudera, Inc. All rights reserved.

Uniting Spark & HadoopThe One Platform Initiative

Doug Cutting | Chief Architect | ClouderaAnand Iyer | Senior Product Manager | Cloudera

Page 2: Spark One Platform Webinar

2© Cloudera, Inc. All rights reserved.

Agenda

• Emergence of Spark• Advantages of Spark• Spark replacing MapReduce• The “One Platform Initiative”• Future of Data Processing in Hadoop

Page 3: Spark One Platform Webinar

3© Cloudera, Inc. All rights reserved.

MapReduce: A great tool for its day

The original scalable, general, processing engine of Hadoop ecosystem- Useful across diverse problem domains- Fueled initial ecosystem explosion

MapReduce Execution Engine

Hive Pig Mahout SolrCrunch

Page 4: Spark One Platform Webinar

4© Cloudera, Inc. All rights reserved.

Enter Apache Spark

General purpose computational framework thatsubstantially improves on MapReduce

Key Properties:• Leverages distributed memory

• Full Directed Graph expressions for data parallel computations

• Simpler developer experience

Yet Retains:• Linear scalability

• Fault-tolerance

• Data locality-based computations

Page 5: Spark One Platform Webinar

5© Cloudera, Inc. All rights reserved.

Apache SparkFlexible, in-memory data processing for Hadoop

Easier More Powerful Faster

• Rich APIs for Scala, Java, and Python

• Interactive shell

• APIs for different types of workloads:• Batch • Streaming• Machine Learning• Graph

• In-Memory processing and caching

Page 6: Spark One Platform Webinar

6© Cloudera, Inc. All rights reserved.

Easy DevelopmentHigh Productivity Language Support

• Native support for multiple languages with identical APIs• Scala, Java, Python

• Use of closures, iterations, and other modern language constructs to minimize code• 2-5x less code

Pythonlines = sc.textFile(...)lines.filter(lambda s: “ERROR” in s).count()

Scalaval lines = sc.textFile(...)lines.filter(s => s.contains(“ERROR”)).count()

JavaJavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Page 7: Spark One Platform Webinar

7© Cloudera, Inc. All rights reserved.

Easy DevelopmentUse Interactively

• Interactive exploration of data for data scientists•No need to develop

“applications”

• Developers can prototype application on live system

$ ./bin/spark-shell --master local[*]...Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)Type in expressions to have them evaluated.Type :help for more information....

scala> val words = sc.textFile("file:/usr/share/dict/words")...words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count...res0: Long = 235886

scala>

Page 8: Spark One Platform Webinar

8© Cloudera, Inc. All rights reserved.

Spark Takes Advantage of Memory

Resilient Distributed Datasets (RDD)•Memory caching layer that stores data in a distributed, fault-tolerant

cache• Can fall back to disk when data-set does not fit in memory

• Created by parallel transformations on data in stable storage• Provides fault-tolerance through concept of lineage

Page 9: Spark One Platform Webinar

9© Cloudera, Inc. All rights reserved.

The Spark Ecosystem & Hadoop

Spark Streaming MLlib SparkSQL GraphX Data-

frames SparkR

STORAGEHDFS, HBase

RESOURCE MANAGEMENTYARN

Spark Impala MR OthersSearch

Page 10: Spark One Platform Webinar

10© Cloudera, Inc. All rights reserved.

Cloudera is Driving the Spark Movement

2013 2014 2015 2016

Identified Spark’s early potential

Ships and Supports Spark with CDH 4.4

Added Spark on YARN integration

Announces initiative to make Spark the standard execution engine

Launches first Spark training

Added security integration

Cloudera engineers publish O’Reilly Spark book

Driving effort to further performance, usability, and enterprise-readiness

Page 11: Spark One Platform Webinar

11© Cloudera, Inc. All rights reserved.

Spark at Cloudera• Cloudera was the first Hadoop vendor to ship and support Spark

• Spark is a fully integrated part of Cloudera’s platform• Shared data, metadata, resource management, administration, security, and governance• Complements specialized analytic tools for comprehensive big data platform

• Cloudera is the first Hadoop vendor to offer Spark training• Trained more customers than any other vendor• Most popular training course

• Cloudera has 5x the engineering resources of the next competitor• Most committers on staff and most changes contributed• Well-trained staff across the globe with expertise implementing a broad range of Spark use cases

Page 12: Spark One Platform Webinar

12© Cloudera, Inc. All rights reserved.

Cloudera’s Engineering Commitment to Spark

Cloudera57%Intel

29%

Hortonworks14%

Spark Committers by Hadoop Distribution*

* IBM and MapR have 0 committers

Spark Patches by Hadoop Distribution

Cloudera, 451

Hortonworks, 10

IBM, 29

MapR, 1

Intel, 466

Page 13: Spark One Platform Webinar

13© Cloudera, Inc. All rights reserved.

Cloudera Customers

• More customers running Spark than all other vendors combined•Over 150 customers• Spark clusters as large as 800 nodes•Diverse range of use cases across multiple industries

• Search personalization• Genomics research• Insurance modeling• Advertising optimization• Predictive modeling of disease conditions

Page 14: Spark One Platform Webinar

14© Cloudera, Inc. All rights reserved.

Cloudera Customer Use CasesCore Spark Spark Streaming

• Portfolio Risk Analysis• ETL Pipeline Speed-Up• 20+ years of stock dataFinancial

Services

Health

• Identify disease-causing genes in the full human genome

• Calculate Jaccard scores on health care data sets

ERP

• Optical Character Recognition and Bill Classification

• Trend analysis • Document classification (LDA)• Fraud analyticsData

Services

1010

• Online Fraud DetectionFinancial Services

Health

• Incident Prediction for Sepsis

Retail

• Online Recommendation Systems• Real-Time Inventory Management

Ad Tech

• Real-Time Ad Performance Analysis

Page 15: Spark One Platform Webinar

15© Cloudera, Inc. All rights reserved.

Spark will replace MapReduceas the standard execution engine for Hadoop

Page 16: Spark One Platform Webinar

16© Cloudera, Inc. All rights reserved.

Community Initiative: Spark Supersedes MapReduce

Stage 1• Crunch on Spark• Search on Spark

Stage 2• Hive on Spark (beta)• Spark on HBase (beta)

Stage 3• Pig on Spark (alpha)• Sqoop on Spark

Cloudera is driving community development to port components to Spark:

Page 17: Spark One Platform Webinar

17© Cloudera, Inc. All rights reserved.

Uniting Spark and HadoopThe One Platform Initiative Investment Areas

ManagementLeverage Hadoop-nativeresource management.

SecurityFull support for Hadoop security

and beyond.

ScaleEnable 10k-node clusters.

StreamingSupport for 80% of common stream

processing workloads.

Page 18: Spark One Platform Webinar

18© Cloudera, Inc. All rights reserved.

ManagementLeverage Hadoop-native resource management

Spark-on-YARN• Drove Spark-On-YARN Integration• Improve Spark-On-YARN Integration for

better multi-tenancy, performance and ease of use

Metrics• Improved metrics for debugging and

monitoring• Improve metrics for visibility into resource

utilization• Revamp WebUI for better debugging and

monitoring (especially at high concurrency)

Automation• Dynamic Resource Allocation based on

needs of job• Smart auto-selection and tuning of job

parameters (when data volumes change)

Accessibility• SparkSQL & Hive integration

improvements• Easy Python dependency management

for PySpark

Page 19: Spark One Platform Webinar

19© Cloudera, Inc. All rights reserved.

SecurityFull support for Hadoop security and beyond

Perimeter• Kerberos Integration

Visibility• Audit/Lineage via Cloudera Navigator• Full Spark PCI compliance

Access• HDFS Sync (Apache Sentry)• Enable column- and view-level security

Data Protection• Integration with Intel’s Advanced

Encryption libraries

Page 20: Spark One Platform Webinar

20© Cloudera, Inc. All rights reserved.

ScaleEnable 10K-Node Clusters

Fault-Tolerance• Revamp Scheduler handling of node failure• Dynamic resource utilization &

prioritization

Performance• Task scheduling based on HDFS data

locality & HDFS caching (reduce data movement and enable more jobs)• Integrate with HDFS Discardable

Distributed Memory (reduce memory pressure)• Scheduler improvements for performance

at scale

Stability• Sort Based Shuffle Improvements for

improved stability at scale• Stress test at scale with mixed multi-tenant

workloads• Scale Spark History Server for 1000s of jobs

Page 21: Spark One Platform Webinar

21© Cloudera, Inc. All rights reserved.

StreamingSupport for 80% of common stream processing workloads

Zero Data Loss• Delivered full data resiliency

Management• Streaming application management via

Cloudera Manager for zero downtime

Ingest• Delivered Flume integration and drove

Kafka integration

Accessibility• SQL interfaces and API extensions for

Streaming jobs

Performance• Improved State Management to enable

maintaining a high volume of state information

Page 22: Spark One Platform Webinar

22© Cloudera, Inc. All rights reserved.

The Future of Data Processing on HadoopSpark complemented by specialized fit-for-purpose engines

General Data Processing w/Spark

Fast Batch Processing, Machine Learning, and Stream Processing

Analytic Database w/Impala

Low-LatencyMassively Concurrent

Queries

Full-Text Search w/Solr Querying textual data

On-Disk Processing w/MapReduceJobs at extreme scale and extremely disk IO intensive

Shared:• Data Storage• Metadata• Resource

Management• Administration• Security• Governance

Page 23: Spark One Platform Webinar

23© Cloudera, Inc. All rights reserved.

Cloudera is Built for Production Success

Hadoop delivers:• One place for unlimited data• Unified, multi-framework data access

Cloudera delivers:• Leading Performance• Enterprise Security• Data Management• Simple Administration

Security and Administration

Unlimited Storage

Process Discover Model Serve

DeploymentFlexibility

On-PremisesAppliancesEngineered Systems

Public CloudPrivate CloudHybrid Cloud

A modern data platform plus what the enterprise requires.

Page 24: Spark One Platform Webinar

24© Cloudera, Inc. All rights reserved.

Spark Resources• Learn Spark• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)• Cloudera Developer Blog• cloudera.com/spark

• Get Trained• Cloudera Spark Training

• Try it Out• Cloudera Live Spark Tutorial

Page 25: Spark One Platform Webinar

‹#›© 2015 Cloudera, Inc. All rights reserved.

The conference for and by Data Scientists, from startup to enterprise

wrangleconf.com

Public registration is now open!

Who: Featuring data scientists from Salesforce, Uber, Pinterest, and moreWhen: Thursday, October 22, 2015Where: Broadway Studios, San Francisco

Page 26: Spark One Platform Webinar
Page 27: Spark One Platform Webinar

27© Cloudera, Inc. All rights reserved.

Thank You!