installing hadoop / spark from scratch

Post on 15-Apr-2017

109 Views

Category:

Data & Analytics

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

© 2016 IBM Corporation

Big Data Developer meetup

Installing Apache Hadoop and Spark from scratch

Ljubljana, June 2016

2

© 2016 IBM Corporation

Agenda

Why do you need Hadoop

What do you need before you install Apache Hadoop

Hadoop distributions

Hadoop components you need to know about

About Spark

Installation process walk-through

Adding cluster nodes

Ways to automate

Zero-install options

3

© 2016 IBM Corporation

Why do you need Apache Hadoop

License – free

Scalable

General purpose MPP

engine

Distributed storage

Packed with tools

Backend for your Big

Data project

4

© 2016 IBM Corporation

What do you need before you install Hadoop and Spark

A server (or servers)

Installed OS (in case of IBM RHEL 6.5-7 or SUSE 11 SP3)

A Hadoop distribution (more later)

Or avoid all that trouble by using VM / Docker if you are just

playing (more later)

5

© 2016 IBM Corporation

Apache Hadoop Distributions

Hortonworks HDP

Cloudera CDH

IBM IOP (today’s focus)

Number of others

Distributions are very similar but different, as in Linux

Some are part of ODP some are not

6

© 2016 IBM Corporation

Hadoop components you need to know about

Yarn – resource manager

HDFS

MapReduce

Ambari

ZooKeeper

Hive

Pig

sqoop

7

© 2016 IBM Corporation

Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-

scale data processing

– Fast

•Leverages aggressively cached in-memory

distributed computing and dedicated

App Executor processes even when no jobs

are running

•Faster than MapReduce

– General purpose

•Covers a wide range of workloads

•Provides SQL, streaming and complex

analytics

– Flexible and easier to use than Map Reduce

•Spark is written in Scala, an object oriented,

functional programming language

•Scala, Python and Java APIs

•Scala and Python interactive shells

•Runs on Hadoop, Mesos, standalone or

cloud

Logistic regression in Hadoop and Spark

from http://spark.apache.org

9

© 2016 IBM Corporation

Prereqs

Install OS

Setup yum repository

Install prerequisites

• Yum install nc

Full list of preparation steps

Make sure your hostname is in /etc/hosts

Tweak some settings (disable Trasparent Huge Pages)

• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

Generate ssh key and set up passwordless ssh

• Ssh-keygen

• Chmod 700 ~/.ssh

• Check with ssh localhost

10

© 2016 IBM Corporation

Prereqs (cont.)

disable IPv6

Configure ulimit

• /etc/security/limits.conf

Disable SELinux

Set up NTP on all servers

11

© 2016 IBM Corporation

First step – install Ambari

Install repository

• yum install iop-4.1.0.0-1.<version>.<platform>.rpm

Install ambari

• Yum install ambari-server

Setup ambari server

• sudo ambari-server setup

Start ambari server

• Ambari-server start

Go to ambari interface <your-ip>:8080

• Default user/pass = admin/admin

Launch installation wisard

12

© 2016 IBM Corporation

Ambari installation

Next-next-next

Provide cluster name

Provide private ssh key

13

© 2016 IBM Corporation

Choose services

14

© 2016 IBM Corporation

Assign masters

15

© 2016 IBM Corporation

Assign slaves and clients

16

© 2016 IBM Corporation

Customize services

Here you would have to setup proper DB server connections in

your prod environment

17

© 2016 IBM Corporation

Review and deploy

18

© 2016 IBM Corporation

Validate

19

© 2016 IBM Corporation

Adding a new cluster node

Create a new server, with same

pre-rereqs

Make sure that passwordless ssh

works from ambari server to the

node

ssh-copy-id -i ~/.ssh/id_rsa.pub

root@hostname01

And done

20

© 2016 IBM Corporation

Extra steps

Install Anaconda / Jupyter for data analysis

PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook -

-no-browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark

21

© 2016 IBM Corporation

Ways to automate - Ansible

Simple automation tool

Infrastructure as a code

Agent-less

Easy to learn

Check for examples online “ansible hadoop

playbook”

22

© 2016 IBM Corporation

Zero – installation options

•Big Insights QSE

•BigInsights on cloud (paid)

23

© 2014 IBM Corporation

WRAP-UP

top related