installing hadoop / spark from scratch

Big Data Developer meetup

Installing Apache Hadoop and Spark from scratch

Ljubljana, June 2016

Agenda

Why do you need Hadoop

What do you need before you install Apache Hadoop

Hadoop distributions

Hadoop components you need to know about

About Spark

Installation process walk-through

Adding cluster nodes

Ways to automate

Zero-install options

Why do you need Apache Hadoop

License – free

Scalable

General purpose MPP

engine

Distributed storage

Packed with tools

Backend for your Big

Data project

What do you need before you install Hadoop and Spark

A server (or servers)

Installed OS (in case of IBM RHEL 6.5-7 or SUSE 11 SP3)

A Hadoop distribution (more later)

Or avoid all that trouble by using VM / Docker if you are just

playing (more later)

Apache Hadoop Distributions

Hortonworks HDP

Cloudera CDH

IBM IOP (today’s focus)

Number of others

Distributions are very similar but different, as in Linux

Some are part of ODP some are not

Hadoop components you need to know about

Yarn – resource manager

MapReduce

Ambari

ZooKeeper

Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-

scale data processing

– Fast

•Leverages aggressively cached in-memory

distributed computing and dedicated

App Executor processes even when no jobs

are running

•Faster than MapReduce

– General purpose

•Covers a wide range of workloads

•Provides SQL, streaming and complex

analytics

– Flexible and easier to use than Map Reduce

•Spark is written in Scala, an object oriented,

functional programming language

•Scala, Python and Java APIs

•Scala and Python interactive shells

•Runs on Hadoop, Mesos, standalone or

Logistic regression in Hadoop and Spark

from http://spark.apache.org

Installation process walk-through

Review the requirements

Review the installation docs

Get IOP software: http://www-

01.ibm.com/support/docview.wss?uid=swg24040517

Prereqs

Install OS

Setup yum repository

Install prerequisites

• Yum install nc

Full list of preparation steps

Make sure your hostname is in /etc/hosts

Tweak some settings (disable Trasparent Huge Pages)

• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

Generate ssh key and set up passwordless ssh

• Ssh-keygen

• Chmod 700 ~/.ssh

• Check with ssh localhost

Prereqs (cont.)

disable IPv6

Configure ulimit

• /etc/security/limits.conf

Disable SELinux

Set up NTP on all servers

First step – install Ambari

Install repository

• yum install iop-4.1.0.0-1.<version>.<platform>.rpm

Install ambari

• Yum install ambari-server

Setup ambari server

• sudo ambari-server setup

Start ambari server

• Ambari-server start

Go to ambari interface <your-ip>:8080

• Default user/pass = admin/admin

Launch installation wisard

Ambari installation

Next-next-next

Provide cluster name

Provide private ssh key

Choose services

Assign masters

Assign slaves and clients

Customize services

Here you would have to setup proper DB server connections in

your prod environment

Review and deploy

Validate

Adding a new cluster node

Create a new server, with same

pre-rereqs

Make sure that passwordless ssh

works from ambari server to the

ssh-copy-id -i ~/.ssh/id_rsa.pub

root@hostname01

And done

Extra steps

Install Anaconda / Jupyter for data analysis

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook -

-no-browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark

Ways to automate - Ansible

Simple automation tool

Infrastructure as a code

Agent-less

Easy to learn

Check for examples online “ansible hadoop

playbook”

Zero – installation options

•Big Insights QSE

•BigInsights on cloud (paid)

WRAP-UP

installing hadoop / spark from scratch

Data & Analytics

apache hadoop & spark what is it - ifremer · spark...

overview of hadoop and spark service at cern · overview of...

spark vs hadoop

big data hadoop & spark - intellipaat

hadoop to spark-v2

course content for hadoop and spark...

hadoop & spark – using amazon emr

introduction to spark on hadoop

hadoop and spark – perfect together

cleveland hadoop users group - spark

is spark replacing hadoop

big data – spark/hadoop data services · 2017-07-10 ·...

spark + hadoop perfect together

hadoop tutorials. todays agenda hadoop introduction and...

brave new world: hadoop vs. spark - eth...

apache spark: moving on from hadoop

spark-on-yarn: empower spark applications on hadoop cluster

apache spark & hadoop

cloud computing using mapreduce, hadoop, spark -...

introduction to apache spark - university of...