how to deploy apache spark in a multi-tenant, on-premises environment
TRANSCRIPT
HOW TO DEPLOY APACHE SPARK IN A MULTI-TENANT, ON-PREMISES ENVIRONMENT
Adoption of Apache Spark is accelerating
• Spark adoption is growing rapidly – The number of contributors and end users is increasing at a substantial rate
• Spark is expanding beyond Hadoop– Spark is an integral component of new big data platforms - with support for pipelines,
streaming and statistical analysis, SQL, and more
• A variety of use cases are being implemented – Use cases include recommendation systems, data warehousing, log processing, and more
• Programming paradigm is expanding– Languages supported include java, scala, python, SQL, R and more
Source: Spark Survey Report, 2015 (Databricks)
Top roles using Spark in the enterprise
DATA ENGINEERS
41%DATA SCIENTISTS
22.2%ARCHITECTS
17.2%
MANAGEMENT
10.6%ACADEMIA
6.2%OTHER
2.4%
Source: Spark Survey Report, 2015 (Databricks)
Spark infrastructure patterns
• Individual developers or data scientists who build their own infrastructure from VMs or bare metal machines
• A bottoms-up approach where everyone gets the same infrastructure/platform irrespective of their skill or use case
Developers / data scientists and Spark
• Mostly self-starters who identify a use case
• They build their own systems on laptops, VMs, or servers
• The complexity soon overwhelms them and restricts adoption
• They need help to scale deployment beyond the initial use case
Rigid on-premises infrastructure
• Infrastructure is often built by IT for generic use cases
• Flexibility to cater to different usage scenarios is lost
• Spark users needs are always changing
• Upgrades become a challenge
Common Deployment Patterns
48%Standalone mode
40%YARN
11%Mesos
Most Common Spark Deployment Environments (Cluster Managers)
Source: Spark Survey Report, 2015 (Databricks)
Scalable, self-service infrastructure
• IT controls machines, network, storage, and security
• Users create their own tenants and Spark clusters
• Teams can upgrade and scale their clusters independently
Big Data New Realities
Big Data Traditional Assumptions
Bare-metal
Data locality
HDFS on local disks
Big Data New Realities
Containers and VMs
Compute and storage separation
In-place access on remote data stores (e.g.
NFS, Object)
New Benefits and Value
Big-Data-as-a-Service
Agility and cost savings
Faster time-to-insights
Local HDFS
BlueData EPIC Software Platform
IOBoost™ - Extreme performance and scalability
ElasticPlane™ - Self-service, multi-tenant clusters
DataTap™ - In-place access to enterprise data stores
Blue Data EPIC 2.0 PlatformMarketing R&D Sales Manufacturing Support
BI/Analytics Tools
NFS Gluster Object Store Remote HDFS CEPH
Deployment flexibility for Spark
• Physical Machines or VMs as hosts
• Docker containers as nodes
• Networking and security enabled
• Standalone or YARN-based deployment
Support for all types of Spark users
• Integrated web-based notebook support for data analysts
• Command line support for data engineers and data scientists
• API support for building customer pipelines
•Multiple language support including SQL, R, Streaming
• JDBC support for business intelligence tools
Simple and easy Spark cluster creation
Instant Spark analysis and visualization
• Web-based notebook with integrated Spark cluster
• Support for various languages and Zeppelin interpreters
• Fully provisioned Hadoop File System (HDFS)
• Support for persistent tables
• Iterative analysis and visualization
App Store for Spark and Big Data tools
One-click Big Data app deployment
www.bluedata.com