[ lab session 8 ] cs455 - introduction to distributed systems · 2020-03-13 · update the...

31
CS455 - Introduction To Distributed Systems [ Lab Session 8 ] Jason Stock Computer Science Colorado State University

Upload: others

Post on 09-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

CS455 - Introduction To Distributed Systems[ Lab Session 8 ]

Jason StockComputer Science

Colorado State University

Page 2: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Topics Covered in Today’s Lab

● Hadoop setup

● HW3 introduction

● Wordcount example

NOTE: Feel free to bring laptops, code, and questions!

2

Page 3: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Hadoop Cluster Setup

● SSH login without password

○ https://www.cs.colostate.edu/~info/faq.html#4.08

○ http://www.linuxproblem.org/art_9.html

● Supplementary setup guide

○ https://www.cs.colostate.edu/~cs455/CS455-Hadoop-Setup-Guide.pdf

3

Page 4: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Hadoop Cluster Setup

● Download the configuration files

denver ➜ ~ mkdir ~/hadoop && cd $_

denver ➜ hadoop wget http://www.cs.colostate.edu/~cs455/hadoop-conf.tar

denver ➜ hadoop tar -xvf hadoop-conf.tar

● NOTE: the directory ~/hadoop will be used as an example for this presentation

● Modify your configuration files with the masters, workers, and the associated XML files

● All ports should be unique and non-conflicting

○ Do not select any ports between 56,000 and 57,000

○ These are reserved for the shared HDFS cluster

4

Page 5: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Hadoop Configuration Setup [1/4]

● conf/masters

○ Hostname for Secondary NameNode of local HDFS cluster

● conf/workers

○ List of hostnames for Datanodes of your HDFS cluster

○ The use of 10 - 15 nodes is recommended

5

Page 6: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Hadoop Configuration Setup [2/4]

● conf/core-site.xml

○ Update NAME-NODE-HOST and PORT

○ Namenode service will be running on NAME-NODE-HOST and PORT

○ File operations create/delete/read/write will be NameNode service

● conf/hdfs-site.xml

○ Use unique port numbers where the place holder PORT appears

○ Update the Secondary NameNode host (hostname mentioned in conf/masters)

6

Page 7: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Hadoop Configuration Setup [3/4]

● conf/yarn-site.xml

○ Update the RESOURCE-MANAGER-HOST with hostname of node for YARN Resource Manager

● conf/hadoop-env.sh and conf/yarn-env.sh

○ These scripts need not to be modified

7

Page 8: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Hadoop Configuration Setup [4/4]

● Hadoop will use the local disks of the machines specified in masters and workers

○ /s/${HOSTNAME}/a/nobackup/cs455/${USER}

○ ${HOSTNAME} is the machine, e.g., denver

○ ${USER} is your CS department login, e.g., johnsmith

● Log files will also be available in the tmp directory created in the local disks

○ /s/${HOSTNAME}/a/tmp/cs455/${USER}

○ Separate directories will be used for Hadoop and Yarn logs

8

Page 9: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Hadoop Environment Setup [1/2]

● Determine which shell you are using (for most people this is bash)

● For bash copy the following variables to your ~./bashrc file

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_CONF_DIR="${HOME}/hadoop/conf"

export JAVA_HOME=/usr/local/jdk1.8.0_51-64

9

Page 10: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Hadoop Environment Setup [2/2]

● Add aliases or export the $HADOOP_HOME/sbin and $HADOOP_HOME/bin directories to start

and stop your cluster and use the hadoop command, e.g.,

alias hstartdfs="$HADOOP_HOME/sbin/start-dfs.sh"

alias hstopdfs="$HADOOP_HOME/sbin/stop-dfs.sh"

alias hadoop="$HADOOP_HOME/bin/hadoop"

10

Page 11: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Running HDFS

● First time configuration

○ This should only have to be run once!

○ Login to your NameNode machine specified in your hadoop/conf/core-site.xml

○ Format your namenode ( be careful, this deletes all staged data )

namenode ➜ ~ $HADOOP_HOME/bin/hdfs namenode -format

● Start HDFS

○ Login to your NameNode machine

namenode ➜ ~ $HADOOP_HOME/sbin/start-dfs.sh

○ Can use hstartdfs alias previously setup

11

Page 12: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Verify HDFS is Running [1/3]

● Using the Java Virtual Machine Process Status Tool (jps)

namenode ➜ ~ jps

[lvmid] NameNode

datanode_1 ➜ ~ jps

[lvmid] DataNode

● Using the webportal of the NameNode DFS web UI

○ The port specified in hadoop/conf/hdfs-site.xml for dfs.namenode.http-address

○ http://<namenode>.cs.colostate.edu:<port>

○ You will need to either (i) port forward or (ii) use the VPN to connect to CSU’s network

12

Page 13: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Verify HDFS is Running [2/3]

● What If HDFS fails to start?

○ Check Hadoop log directory for any exceptions

■ /s/${HOSTNAME}/a/tmp/cs455/${USER}

○ Most common culprit: java.net.bindexception: Address already in use

○ Resolution will be to change the PORT in the configuration files and try again

● If the error is something else?

○ Google the error and see what comes up

○ If you can not find the solution or do not understand it?

■ Visit help desk, post on Piazza, etc.

13

Page 14: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Verify HDFS is Running [3/3]

● Additional installation resources can be found on Infospaces

○ https://infospaces.cs.colostate.edu

○ Hadoop: Configuration Files

○ Hadoop: Environment Setup

14

Page 15: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Stopping HDFS

● Login to your NameNode machine

namenode ➜ ~ $HADOOP_HOME/sbin/stop-dfs.sh

● Can use hstopdfs alias previously setup

15

Page 16: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Using YARN Resource Management [1/2]● Start YARN

○ Login to your Resource Manager Machine in your hadoop/conf/yarn-site.xml

resourcemgr ➜ ~ $HADOOP_HOME/sbin/start-yarn.sh

● Verify YARN is running

○ Use the jps command and check the webportal yarn.resourcemanager.webapp.address

○ Check the YARN log directory for exceptions if the application fails to start

● Stopping YARN

○ Login to your Resource Manager Machine in your hadoop/conf/yarn-site.xml

resourcemgr ➜ ~ $HADOOP_HOME/sbin/stop-yarn.sh

16

Page 17: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Using YARN Resource Management [2/2]

● Running a job locally

○ Change the value of mapreduce.framework.name to local in hadoop/conf/mapred-site.xml

○ To be safe, restart HDFS and YARN

○ From any node, run the following command:

namenode ➜ ~ $HADOOP_HOME/bin/hadoop jar your.jar main.class.Name args

○ Can use hadoop alias previously setup

● Running a job using YARN

○ Change the value of mapreduce.framework.name to yarn in hadoop/conf/mapred-site.xml

○ Repeat the steps from above…

17

Page 18: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

HW3-PC Introduction [1/3]

● Air Quality Dataset form the Environmental Protection Agency

● Hourly measurements of meteorological temperatures and sulfur dioxide (SO2) readings

from various monitors around the United States

● Total of 85 GB consisting of approximately 360 million records dating between

1980 - 2019

18

Page 19: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

HW3-PC Introduction [2/3]

19

Index Field Name Description

1 State Code FIPS code of the state

2 Country Code FIPS code of the county within a state

3 Site Num Unique number within the country identifying the state

4 Parameter Code AQS code corresponding to the parameter measured

5 POC Parameter Occurrence Code for different instruments

Page 20: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

HW3-PC Introduction [3/3]

20

Q1 Which state has the most monitoring sites across the United States?Note: a site is identified by the combination of the state code, county code and site number.

Q2 Does the East Coast or West Coast have higher mean levels of SO2?Note: there are a total of 4 and 16 states in the West Coast and East Coast, respectfully.

Q3 What time of day (GMT) has the highest SO2 levels between 2000 – 2019?Capture the mean SO2 levels for each hour (GMT) over all 20 years to justify your answer.

Q4 Has there been a change in SO2 levels over the last 40 years?Capture the mean SO2 levels for each year to justify your answer.

Q5 What are the top 10 hottest states for the summer months (June, July, August)?Capture the mean temperature levels for the summer months (GMT) to justify your answer.

Q6 What is the mean SO2 levels for the hottest states found in Question 6?

Page 21: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Accessing Shared Datasets

● Using ViewFS based federation for sharing datasets due to space constraints

● Aforementioned instructions were specific for your local cluster (this is still needed!)

● HW3-PC MapReduce programs will deal with two namenodes

○ NameNode of your local HDFS cluster

■ Used for writing output generated by your programs

○ NameNode of the shared HDFS cluster

■ Hosts input data and made read only

■ Do not attempt to write to this cluster. This is only for staging input data!

21

Page 22: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Shared Dataset Mounting Points

● Datasets are hosted in the /data directory of the shared file system

● You will get read permissions for this directory

○ This will allow your MapReduce programs to read files in the dataset

22

Mount Point NameNode Directory in NameNode

/home Local /

/tmp Local /tmp

/data Shared /data

Page 23: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

23

Abstract View

Page 24: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Configuration for the Shared Cluster [1/2]

● Specifics for HW3-PC

○ Hadoop configuration for a client is different due to the previously mentioned mount points

○ MapReduce programs will be run using the resource manager of the shared HDFS cluster

○ You will read input and run applications on shared cluster ⟶ write on the local cluster

● Local hadoop/conf and hadoop/client-conf to connect to the shared cluster

24

Page 25: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Configuration for the Shared Cluster [2/2]

● Download and extract the client-config

denver ➜ hadoop wget http://www.cs.colostate.edu/~cs455/client-config.tar

denver ➜ hadoop tar -xvf client-conf.tar

● Update the hadoop/client-conf/core-site.xml

○ Replace NAMENODE_HOST and NAMENODE_PORT with local values in hadoop/conf/core-site.xml

○ Do not change entry with augusta. It points to the NameNode of the shared cluster

25

Page 26: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Accessing the Shared Cluster [1/2]

● Update the environment variable HADOOP_CONF_DIR for the given session

● Do not update the environment variable set in the .*rc file. Your local HDFS relies on this.

● (Important!) set the variable for the current shell instance

➜ export HADOOP_CONF_DIR=$HOME/hadoop/client-conf

● Easy way to ensure this is done?

○ Create a bash script to do the following:

1. Build with gradle

2. Export client-conf

3. Run application26

Page 27: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Accessing the Shared Cluster [2/2]

● Build the application using Gradle

➜ gradle build

● Run your Hadoop program as usual

● Include the paths to the respective mounting mounts, e.g.,

➜ $HADOOP_HOME/bin/hadoop jar build/libs/app.jar

cs455.analysis.DeepLossyCount /data/gases /home/out/lossyoutput

27

Page 28: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Running Wordcount Locally [1/3]

● First start HDFS and YARN locally, and verify they running

● Download text copies of at least 3 books from Project Gutenberg: http://www.gutenberg.org

● Create directory on local HDFS and stage files there

➜ $HADOOP_HOME/bin/hdfs dfs -mkdir /cs455

➜ $HADOOP_HOME/bin/hdfs dfs -mkdir /cs455/books

➜ $HADOOP_HOME/bin/hdfs dfs -put *.txt /cs455/books

➜ $HADOOP_HOME/bin/hdfs dfs -ls /cs455/books

28

Page 29: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Running Wordcount Locally [2/3]

● Download and extract the tarball for WordCount

➜ wget https://www.cs.colostate.edu/~cs455/cs455-wordcount-sp19.tar.gz

➜ tar -xvf cs455-wordcount-sp19.tar.gz

● Build using the cs455-wordcount-sp19/gradle.build file

➜ gradle build

● (Important!) remove the previous output directory before running

➜ $HADOOP_HOME/bin/hdfs dfs -rm -R /cs455/wordcount

29

Page 30: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Running Wordcount Locally [3/3]

● Submit the JAR to Hadoop using YARN

➜ $HADOOP_HOME/bin/hadoop jar build/libs/cs455-wordcount-sp19.jar

cs455.hadoop.wordcount.WordCountJob /cs455/books /cs455/wordcount

● View output from job using CLI

➜ $HADOOP_HOME/bin/hdfs dfs -cat /cs455/wordcount/part-r-00000

● Or check the local HDFS webportal (specified within hadoop/conf/hdfs-site.xml )

30

Page 31: [ Lab Session 8 ] CS455 - Introduction To Distributed Systems · 2020-03-13 · Update the environment variable HADOOP_CONF_DIR for the given session Do not update the environment

Questions and Discussion…

31