jian wang based on “meet hadoop! open source grid computing” by devaraj das yahoo! inc....

18
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Upload: myrtle-green

Post on 23-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Jian Wang

Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das

Yahoo! Inc. Bangalore & Apache Software Foundation

Page 2: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Need to process 10TB datasets On 1 node:

◦ scanning @ 50MB/s = 2.3 days On 1000 node cluster:

◦ scanning @ 50MB/s = 3.3 min

Need Efficient, Reliable and Usable framework◦Google File System (GFS) paper◦Google's MapReduce paper

Page 3: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system◦ Files are divided into large blocks and distributed

across the cluster (64MB)◦ Blocks replicated to handle hardware failure◦ Current block replication is 3 (configurable)◦ It cannot be directly mounted by an existing operating system.

Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30

Page 4: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Master-Slave Architecture

HDFS Master “Namenode” (irkm-1)◦ Accepts MR jobs submitted by users◦ Assigns Map and Reduce tasks to Tasktrackers◦ Monitors task and tasktracker status, re-executes

tasks upon failure HDFS Slaves “Datanodes” (irkm-1 to irkm-6)

◦ Run Map and Reduce tasks upon instruction from the Jobtracker

◦ Manage storage and transmission of intermediate output

Page 5: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation
Page 6: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Hadoop is locally “installed” on each machine◦ Version 0.19.2

◦ Installed location is in /home/tmp/hadoop

◦ Slave nodes store their data in /tmp/hadoop-${user.name} (configurable)

Page 7: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

If it is the first time that you use it, you need to format the namenode:◦ - log to irkm-1◦ - cd /home/tmp/hadoop◦ - bin/hadoop namenode –format

Basically we see most commands look similar ◦ bin/hadoop “some command” options◦ If you just type hadoop you get all possible

commands (including undocumented)

Page 8: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

hadoop dfs◦ [-ls <path>]◦ [-du <path>]◦ [-cp <src> <dst>]◦ [-rm <path>]◦ [-put <localsrc> <dst>]◦ [-copyFromLocal <localsrc> <dst>]◦ [-moveFromLocal <localsrc> <dst>]◦ [-get [-crc] <src> <localdst>]◦ [-cat <src>]◦ [-copyToLocal [-crc] <src> <localdst>]◦ [-moveToLocal [-crc] <src> <localdst>]◦ [-mkdir <path>]◦ [-touchz <path>]◦ [-test -[ezd] <path>]◦ [-stat [format] <path>]◦ [-help [cmd]]

Page 9: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

bin/start-all.sh – starts all slave nodes and master node

bin/stop-all.sh – stops all slave nodes and master node

Run jps to check the status

Page 10: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Log to irkm-1 rm –fr /tmp/hadoop/$userID cd /home/tmp/hadoop bin/hadoop dfs –ls bin/hadoop dfs –copyFromLocal example

example

After that bin/hadoop dfs –ls

Page 11: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation
Page 12: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation
Page 13: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation
Page 14: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Mapper.py

Page 15: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Reducer.py

Page 16: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

bin/hadoop dfs -ls

bin/hadoop dfs –copyFromLocal example example

bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount-py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output

bin/hadoop dfs -cat java-output/part-00000

bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local

Page 17: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation

Hadoop job tracker◦ http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp

Hadoop task tracker◦ http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp

Hadoop dfs checker◦ http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp 

Page 18: Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation