getting started running apache spark on apache mesos

Post on 19-Aug-2015

6.282 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Getting Started Running Apache Spark on Apache Mesos, 2014-01-24

Paco Nathan liber118.com/pxn@pacoid

Spark on Mesos, 2014-01-24

•what is Apache Mesos?

• launch a Mesos cluster in the cloud

•configure and run Spark on Mesos

•run jobs in Spark

•further resources…

Datacenter Computing

Google has been doing datacenter computing for years, to address the complexities of large-scale data workflows:

• leveraging the modern kernel: isolation in lieu of VMs

• among the top 10 Linux kernel OSS contributors: cgroups

• “most (>80%) jobs are batch jobs, but the majority of resources (55–80%) are allocated to service jobs”

• mixed workloads, multi-tenancy

• relatively high utilization rates

• JVM? not so much…

!

take-aways: scheduling batch is not so difficult; scheduling services is hard+expensive

Google describes the business case…

Taming Latency Variability Jeff Deanplus.google.com/u/0/+ResearchatGoogle/posts/C1dPhQhcDRv

“Return of the Borg”

Return of the Borg: How Twitter Rebuilt Google’s Secret Weapon Cade Metzwired.com/wiredenterprise/2013/03/google-borg-twitter-mesos

!The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines Luiz André Barroso, Urs Hölzle research.google.com/pubs/pub35290.html !!2011 GAFS Omega John Wilkes, et al. youtu.be/0ZFMlO98Jkc

Google describes the technology…

Omega: flexible, scalable schedulers for large compute clusters Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, John Wilkes eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf

Mesos – open source datacenter computing

a common substrate for cluster computing

mesos.apache.org

heterogenous assets in your datacenter or cloud made available as a homogenous set of resources

• top-level Apache project

• scalability to 10,000s of nodes

• obviates the need for virtual machines

• isolation (pluggable) for CPU, RAM, I/O, FS, etc.

• fault-tolerant leader election based on Zookeeper

• APIs in C++, Java, Python, Go

• web UI for inspecting cluster state

• available for Linux, OpenSolaris, Mac OSX

Mesos – architecture

Kernel

Apps

servicesbatch

Frameworks

Workloads

distributed file system

Chronos

DFS

distributed resources: CPU, RAM, I/O, FS, rack locality, etc. Cluster

Storm

Kafka JBoss Django RailsSharkImpalaScalding

Marathon

SparkHadoopMPI

MySQL

Mesos – architecture

HDFS, distrib file system

Mesos, distrib kernel

meta-frameworks: Aurora, Marathon

frameworks: Spark, Storm, MPI, Jenkins, etc.

task schedulers: Chronos, etc.

APIs: C++, JVM, Py, Go

apps: HA services, web apps, batch jobs, scripts, etc.

Linux: libcgroup, libprocess, libev, etc.

Mesos – dynamics

Mesosdistrib kernel

Marathondistrib init.d

Chronosdistrib cron

distribframeworks

HAservices

scheduledapps

Mesos – dynamics

resourceoffers

distributedframework Scheduler Executor Executor Executor

Mesosslave

Mesosslave

Mesosslave

distributedkernel

available resources

Mesosslave

Mesosslave

Mesosslave

MesosmasterMesosmaster

Production Deployments (public)

Case Study: Twitter (bare metal / on premise)

“Mesos is the cornerstone of our elastic compute infrastructure – it’s how we build all our new services and is critical for Twitter’s continued success at scale. It's one of the primary keys to our data center efficiency."

Chris Fry, SVP Engineering blog.twitter.com/2013/mesos-graduates-from-apache-incubation wired.com/gadgetlab/2013/11/qa-with-chris-fry/ !

• key services run in production: analytics, typeahead, ads

• Twitter engineers rely on Mesos to build all new services

• instead of thinking about static machines, engineers think about resources like CPU, memory and disk

• allows services to scale and leverage a shared pool of servers across datacenters efficiently

• reduces the time between prototyping and launching

Spark on Mesos, 2014-01-24

•what is Apache Mesos?

• launch a Mesos cluster in the cloud

•configure and run Spark on Mesos

•run jobs in Spark

•further resources…

http://elastic.mesosphere.io

launch a Mesos cluster in the Amazon AWS cloud in three simple steps, given:

• AWS credentials • SSH public key • email address

Spark on Mesos, 2014-01-24

•what is Apache Mesos?

• launch a Mesos cluster in the cloud

•configure and run Spark on Mesos

•run jobs in Spark

•further resources…

http://mesosphere.io/learn/run-spark-on-mesos/

configure and run Spark on a Mesos cluster on AWS, in a seven-step tutorial…

step 1: ssh to master

ssh -l ubuntu <master>

step 2: install git, jdk-7

sudo aptitude -y install git!sudo aptitude -y install openjdk-7-jdk

step 3: download spark

wget http://spark-project.org/download/spark-0.8.0-incubating-bin-cdh4.tgz!tar xzf spark-0.8.0-incubating-bin-cdh4.tgz!cd spark-0.8.0-incubating-bin-cdh4/

step 4: sbt clean assembly

SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.4.0 sbt/sbt clean assembly

step 5: make distro, cp to HDFS

./make-distribution.sh --hadoop 2.0.0-mr1-cdh4.4.0!mv dist spark-0.8.0-2.0.0-mr1-cdh4.4.0!tar czf spark-0.8.0-2.0.0-mr1-cdh4.4.0.tgz spark-0.8.0-2.0.0-mr1-cdh4.4.0!!hadoop fs -mkdir /tmp!hadoop fs -put spark-0.8.0-2.0.0-mr1-cdh4.4.0.tgz /tmp

step 6: config env

cd conf/!cp spark-env.sh.template spark-env.sh!vim spark-env.sh!!export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so!export SPARK_EXECUTOR_URI=hdfs://<nn>/tmp/spark-0.8.0-2.0.0-mr1-cdh4.4.0.tgz!export MASTER=zk://<master>:2181/mesos!!cat spark-env.sh!cd ..!!./spark-shell

et voilà!

Spark on Mesos, 2014-01-24

•what is Apache Mesos?

• launch a Mesos cluster in the cloud

•configure and run Spark on Mesos

•run jobs in Spark

•further resources…

http://spark.incubator.apache.org/examples.html

run an example job in Spark, to filter an RDD of integers, in two steps at the REPL…

step 1: create an RDD

val data = 1 to 10000!val distData = sc.parallelize(data)!!distData.filter(_< 10).collect()

step 2: run the filter

Spark on Mesos, 2014-01-24

•what is Apache Mesos?

• launch a Mesos cluster in the cloud

•configure and run Spark on Mesos

•run jobs in Spark

•further resources…

Join us! !

O’Reilly Strata, Santa ClaraFeb 11-13 strataconf.com/strata2014

Mesos tutorial, Tue 2/11 1:30pm BOF lunch, Wed 2/12 12:10pm Mesos session, Thu 2/13 2:20pm office hours, Thu 2/13 3:15pm

More insights… !

Monthly newsletter for events, conf summaries, workshops, etc.: liber118.com/pxn/ !

collected Mesos notes: goo.gl/jPtTP

top related