installing hadoop / spark from scratch
TRANSCRIPT
1
© 2016 IBM Corporation
Big Data Developer meetup
Installing Apache Hadoop and Spark from scratch
Ljubljana, June 2016
2
© 2016 IBM Corporation
Agenda
Why do you need Hadoop
What do you need before you install Apache Hadoop
Hadoop distributions
Hadoop components you need to know about
About Spark
Installation process walk-through
Adding cluster nodes
Ways to automate
Zero-install options
3
© 2016 IBM Corporation
Why do you need Apache Hadoop
License – free
Scalable
General purpose MPP
engine
Distributed storage
Packed with tools
Backend for your Big
Data project
4
© 2016 IBM Corporation
What do you need before you install Hadoop and Spark
A server (or servers)
Installed OS (in case of IBM RHEL 6.5-7 or SUSE 11 SP3)
A Hadoop distribution (more later)
Or avoid all that trouble by using VM / Docker if you are just
playing (more later)
5
© 2016 IBM Corporation
Apache Hadoop Distributions
Hortonworks HDP
Cloudera CDH
IBM IOP (today’s focus)
Number of others
Distributions are very similar but different, as in Linux
Some are part of ODP some are not
6
© 2016 IBM Corporation
Hadoop components you need to know about
Yarn – resource manager
HDFS
MapReduce
Ambari
ZooKeeper
Hive
Pig
sqoop
7
© 2016 IBM Corporation
Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-
scale data processing
– Fast
•Leverages aggressively cached in-memory
distributed computing and dedicated
App Executor processes even when no jobs
are running
•Faster than MapReduce
– General purpose
•Covers a wide range of workloads
•Provides SQL, streaming and complex
analytics
– Flexible and easier to use than Map Reduce
•Spark is written in Scala, an object oriented,
functional programming language
•Scala, Python and Java APIs
•Scala and Python interactive shells
•Runs on Hadoop, Mesos, standalone or
cloud
Logistic regression in Hadoop and Spark
from http://spark.apache.org
8
© 2016 IBM Corporation
Installation process walk-through
Review the requirements
Review the installation docs
Get IOP software: http://www-
01.ibm.com/support/docview.wss?uid=swg24040517
9
© 2016 IBM Corporation
Prereqs
Install OS
Setup yum repository
Install prerequisites
• Yum install nc
Full list of preparation steps
Make sure your hostname is in /etc/hosts
Tweak some settings (disable Trasparent Huge Pages)
• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
Generate ssh key and set up passwordless ssh
• Ssh-keygen
• Chmod 700 ~/.ssh
• Check with ssh localhost
10
© 2016 IBM Corporation
Prereqs (cont.)
disable IPv6
Configure ulimit
• /etc/security/limits.conf
Disable SELinux
Set up NTP on all servers
11
© 2016 IBM Corporation
First step – install Ambari
Install repository
• yum install iop-4.1.0.0-1.<version>.<platform>.rpm
Install ambari
• Yum install ambari-server
Setup ambari server
• sudo ambari-server setup
Start ambari server
• Ambari-server start
Go to ambari interface <your-ip>:8080
• Default user/pass = admin/admin
Launch installation wisard
12
© 2016 IBM Corporation
Ambari installation
Next-next-next
Provide cluster name
Provide private ssh key
13
© 2016 IBM Corporation
Choose services
14
© 2016 IBM Corporation
Assign masters
15
© 2016 IBM Corporation
Assign slaves and clients
16
© 2016 IBM Corporation
Customize services
Here you would have to setup proper DB server connections in
your prod environment
17
© 2016 IBM Corporation
Review and deploy
18
© 2016 IBM Corporation
Validate
19
© 2016 IBM Corporation
Adding a new cluster node
Create a new server, with same
pre-rereqs
Make sure that passwordless ssh
works from ambari server to the
node
ssh-copy-id -i ~/.ssh/id_rsa.pub
root@hostname01
And done
20
© 2016 IBM Corporation
Extra steps
Install Anaconda / Jupyter for data analysis
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook -
-no-browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark
21
© 2016 IBM Corporation
Ways to automate - Ansible
Simple automation tool
Infrastructure as a code
Agent-less
Easy to learn
Check for examples online “ansible hadoop
playbook”
22
© 2016 IBM Corporation
Zero – installation options
•Big Insights QSE
•BigInsights on cloud (paid)
23
© 2014 IBM Corporation
WRAP-UP