scaling big data with hadoop and mesos

50
Scaling Big Data with Hadoop And Mesos

Upload: discover-pinterest

Post on 19-Aug-2014

296 views

Category:

Engineering


0 download

DESCRIPTION

As a company starts dealing with large amounts of data, operation engineers are challenged with managing the influx of information while ensuring the resilience of data. Hadoop HDFS, Mesos and Spark help reduce issues with a scheduler that allows data cluster resources to be shared. It provides a common ground where data scientists and engineers can meet, develop high performance data processing applications and deploy their own tools.

TRANSCRIPT

Page 1: Scaling Big Data with Hadoop and Mesos

Scaling Big Data with

Hadoop And Mesos

Page 2: Scaling Big Data with Hadoop and Mesos

Bernardo Gomez Palacio

Software Engineer at Guavus Inc

Page 3: Scaling Big Data with Hadoop and Mesos

Beyond Buzz Words

Page 4: Scaling Big Data with Hadoop and Mesos

Mesos and Data AnalysisYes, you don't need Hadoop to start using Mesos and Spark.

Page 5: Scaling Big Data with Hadoop and Mesos

Now, If You...4 Need to store large files? by default each block is 128MB.

4 Data is written mainly as new files or by appending into existing ones?

Page 6: Scaling Big Data with Hadoop and Mesos

Convinced you want to jump into the Hadoop bandwagon?Read

Sammer, Eric. "Hadoop Operations." Sebastopol, CA: O'Reilly, 2012. Print.

Page 7: Scaling Big Data with Hadoop and Mesos

Welcome to the Jungle

Page 8: Scaling Big Data with Hadoop and Mesos

Version Hell

Page 9: Scaling Big Data with Hadoop and Mesos

DistributionsApache Bigtop, CDH, HDP, MapR

Page 10: Scaling Big Data with Hadoop and Mesos

HadoopHDFS

MRV1

MRV2

Page 11: Scaling Big Data with Hadoop and Mesos

Assuming You Already Have Mesos4 Mesosphere Packages

4 https://mesosphere.io/downloads/

4 From Source.

4 https://github.com/apache/mesos

Page 12: Scaling Big Data with Hadoop and Mesos

Hadoop MRV1 in Mesohttps://github.com/mesos/hadoop

Page 13: Scaling Big Data with Hadoop and Mesos

Hadoop MRV1 in Mesos4 Requires Hadoop MRV1

4 Officially works with CDH5 MRV1

4 Apache Hadoop 0.22, 0.23 and 1+

4 Apache Hadoop 2+ doesn't come with MRV1!

Page 14: Scaling Big Data with Hadoop and Mesos

Hadoop MRV1 in Mesos4 Requires a JobTracker.

4 By default uses the org.apache.hadoop.mapred.JobQueueTaskScheduler

4 You can change it .e.g ...mapred.FairScheduler

Page 15: Scaling Big Data with Hadoop and Mesos

Hadoop MRV1 in Mesos4 Requires TaskTracker.

4 That is org.apache.hadoop.mapreduce.server.jobtracker.TaskTracker.

4 And not org.apache.hadoop.mapred.TaskTracker.java.

Page 16: Scaling Big Data with Hadoop and Mesos

How Hadoop MRV1 Runs In Mesos?

Page 17: Scaling Big Data with Hadoop and Mesos

How Hadoop MRV1 in Mesos works?1. Framework Mesos Scheduler creates the Job

Tracker as part of the driver.

2. The Job Trakcer will use org.apache.hadoop.mapred.MesosScheduler to lunch tasks.

Page 18: Scaling Big Data with Hadoop and Mesos

Mesos Hadoop Task Scheduling4 mapred.mesos.slot.cpus (1)

4 mapred.mesos.slot.disk (1024MB)

4 mapred.mesos.slot.mem (1024MB)

Page 19: Scaling Big Data with Hadoop and Mesos

Additional Mesos parameters4 mapred.mesos.checkpoint (false)

4 mapred.mesos.role (*)

Page 20: Scaling Big Data with Hadoop and Mesos

ThoughtsWhat about Hadoop 2.4?

Namenode HA?

MRV2 and YARN?

Page 21: Scaling Big Data with Hadoop and Mesos

Personal Preference4 Use Hadoop 2.4.0 or above.

4 Name Node HA through the Quorum Journal Manager.

4 Move to Spark if Possible.

Page 22: Scaling Big Data with Hadoop and Mesos

Example of a Mesos Data Analysis Stack1. HDFS stores files.

2. Use the Spark CLI to test ideas.

3. Use Spark Submit for jobs.

4. Use Chronos or Oozie to schedule workflows.

Page 23: Scaling Big Data with Hadoop and Mesos

Spark On Mesos

Page 24: Scaling Big Data with Hadoop and Mesos

Spark On Mesos

https://spark.apache.org/docs/latest/img/cluster-overview.png

Page 25: Scaling Big Data with Hadoop and Mesos

Know that Each Spark Application1. Has its own driving process.

2. Has its own RDDs

3. Has its own cache.

Page 26: Scaling Big Data with Hadoop and Mesos

Spark Schedulers on MesosFine Grained

Coarse Grained

Page 27: Scaling Big Data with Hadoop and Mesos

Spark Fine Grained Scheduling4 Enabled by default.

4 Each Spark task runs as a separate Mesos task.

4 Has an overhead in launching each task.

Page 28: Scaling Big Data with Hadoop and Mesos

Spark Coarse Grained Scheduling4 Uses only one long-running Spark task on each Mesos

slave.

4 Dynamically schedules its own “mini-tasks”, using Akka.

4 Lower startup overhead.

4 Reserving the cluster resources for the complete duration of the application.

Page 29: Scaling Big Data with Hadoop and Mesos

Be ware of...4 Greedy Scheduling (Coarse Grain)

4 Over committing and deadlocks (Fine Grained)

Page 30: Scaling Big Data with Hadoop and Mesos

Using SparkUnderstand Parametrization and Usage4 spark.app.name

4 spark.executor.memory

4 spark.serializer

4 spark.local.dir

4 ....

Page 31: Scaling Big Data with Hadoop and Mesos

Use Spark SubmitAvoid parametrizing the Spark Context in your code as much as possible.

Leverage the spark-submit arguments, properties files as well as environment variables to configure your application.

Page 32: Scaling Big Data with Hadoop and Mesos

Using Spark

Accept That Tunning is a Science & an Art

Page 33: Scaling Big Data with Hadoop and Mesos

Understand and Tune Your Applications4 Know your Working Set.

4 Understand Spark Partitioning and Block management.

4 Define your Spark workflow and where to cache/persist.

4 If you cache you will serialize, use Kryo.

Page 34: Scaling Big Data with Hadoop and Mesos

Example Spark API PairRDDFunctions def combineByKey[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]

Page 35: Scaling Big Data with Hadoop and Mesos

PairRDDFunctions.combineByKey4 Combines the elements for key using a custom set of

aggregations.

4 RDD[(K, V)] to RDD[(K, C)]

Page 36: Scaling Big Data with Hadoop and Mesos

PairRDDFunctions.combineByKey

4 createCombiner: Turns a V into a C

4 mergeValue: merge a V into a C

4 mergeCombiners: to combine two C's into a single one.

partitioner defaults to HashPartitioner.

Page 37: Scaling Big Data with Hadoop and Mesos

Example Spark API PairRDDFunctionsself: RDD[(K, V)]

def aggregateByKey[U: ClassTag](zeroValue: U)( seqOp: (U, V) => U, combOp: (U, U) => U ): RDD[(K, U)]

Uses the default partitioner.

Page 38: Scaling Big Data with Hadoop and Mesos

Understand your Data

Page 39: Scaling Big Data with Hadoop and Mesos

Tune your Data4 Per Data Source understand its optimal block size

4 Leverage Avro as the serialization format.

4 Leverage Parquet as the storage format.

4 Try to keep your Avro & Parquet schemas flat.

Page 40: Scaling Big Data with Hadoop and Mesos

Suggestions

Page 41: Scaling Big Data with Hadoop and Mesos

Each Application

4 Instrument the Code.

4 Measure Input size in number of records and byte size.

4 Measure Output size in the same way.

Page 42: Scaling Big Data with Hadoop and Mesos

Standardize

4 JDK & JRE version across your cluster.

4 The Spark version across your cluster.

4 The libraries that will be added to the JVM classpath by default.

4 A packaging strategy for your application, uber jar.

Page 43: Scaling Big Data with Hadoop and Mesos

About YARN and Spark

Page 44: Scaling Big Data with Hadoop and Mesos

Some Differences with YARN

4 Execution Cluster vs Client modes.

4 Isolation process vs cgroups

4 Docker support? LXC Templates?

4 Deployment complexity?

Page 45: Scaling Big Data with Hadoop and Mesos

Wrapping Up

Page 46: Scaling Big Data with Hadoop and Mesos

Some Ideas..

Page 47: Scaling Big Data with Hadoop and Mesos

References1. "Hadoop - Apache Hadoop 2.4.0." Apache Hadoop

2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 24 July 2014. link.

2. "Hadoop Distributed File System-2.4.0 - HDFS High Availability Using the Quorum Journal Manager." Apache Hadoop 2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 23 July 2014. link.

Page 48: Scaling Big Data with Hadoop and Mesos

References1. Sammer, Eric. Hadoop Operations. Sebastopol, CA:

O'Reilly, 2012. Print.

2. "Spark Configuration." Spark 1.0.1 Documentation. Apache Software Foundation, n.d. Web. 24 July 2014. link.

3. "Tuning Spark." Spark 1.0.1 Documentation. Apache Software Foundation, n.d. Web. 24 July 2014. link.

Page 49: Scaling Big Data with Hadoop and Mesos

References1. Ryza, Sandy. "Managing Multiple Resources in

Hadoop 2 with YARN." Cloudera Developer Blog. Cloudera, 2 Dec. 2013. Web. 24 July 2014. link.

Page 50: Scaling Big Data with Hadoop and Mesos

Thank you! ✌