traing on hadoop

8/16/2019 Traing on Hadoop

1/123

Apache Hadoop

• Documentation

• Forum

Module 1: Tutorial Introduction

Previous module | Table of contents | Next module

Introduction

Welcome to the Yahoo! Hadoop tutorial! This series of tutorial documents will wal ou throu"h

man aspects of the #pache Hadoop sstem$ You will be shown how to set up simple andadvanced cluster confi"urations% use the distributed file sstem% and develop complex Hadoop

&ap'educe applications$ (ther related sstems are also reviewed$

Goals for this Module:

• )nderstand the scope of problems applicable to Hadoop

• )nderstand how Hadoop addresses these problems differentl from other distributed

sstems$

Outline

*$ +ntroduction

,$ -oals for this &odule

.$ (utline

/$ Problem 0cope

*$ 1hallen"es at 2ar"e 0cale

,$ &oore3s 2aw

4$ The Hadoop #pproach

*$ 1omparison to 5xistin" Techni6ues

,$ Data Distribution

http://hadoop.apache.org/core/docs/current/http://wiki.apache.org/hadoop/Helphttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.html#introhttp://developer.yahoo.com/hadoop/tutorial/module1.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module1.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module1.html#scopehttp://developer.yahoo.com/hadoop/tutorial/module1.html#challengeshttp://developer.yahoo.com/hadoop/tutorial/module1.html#moorehttp://developer.yahoo.com/hadoop/tutorial/module1.html#hadoophttp://developer.yahoo.com/hadoop/tutorial/module1.html#comparisonhttp://developer.yahoo.com/hadoop/tutorial/module1.html#datahttp://wiki.apache.org/hadoop/Helphttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.html#introhttp://developer.yahoo.com/hadoop/tutorial/module1.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module1.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module1.html#scopehttp://developer.yahoo.com/hadoop/tutorial/module1.html#challengeshttp://developer.yahoo.com/hadoop/tutorial/module1.html#moorehttp://developer.yahoo.com/hadoop/tutorial/module1.html#hadoophttp://developer.yahoo.com/hadoop/tutorial/module1.html#comparisonhttp://developer.yahoo.com/hadoop/tutorial/module1.html#datahttp://hadoop.apache.org/core/docs/current/


2/123

.$ &ap'educe7 +solated Processes

/$ Flat 0calabilit

8$ The 'est of the Tutorial

Problem Scope

Hadoop is a lar"e9scale distributed batch processin" infrastructure$ While it can be used on a

sin"le machine% its true power lies in its abilit to scale to hundreds or thousands of computers%

each with several processor cores$ Hadoop is also desi"ned to efficientl distribute lar"e amountsof wor across a set of machines$

How large an amount of wor! (rders of ma"nitude lar"er than man existin" sstems wor

with$ Hundreds of "i"abtes of data constitute the low end of Hadoop9scale$ #ctuall Hadoop is

built to process :web9scale: data on the order of hundreds of "i"abtes to terabtes or petabtes$

#t this scale% it is liel that the input data set will not even fit on a sin"le computer3s hard drive%much less in memor$ 0o Hadoop includes a distributed file sstem which breas up input data

and sends fractions of the ori"inal data to several machines in our cluster to hold$ This results inthe problem bein" processed in parallel usin" all of the machines in the cluster and computes

output results as efficientl as possible$

"hallenges at #arge Scale

Performin" lar"e9scale computation is difficult$ To wor with this volume of data re6uiresdistributin" parts of the problem to multiple machines to handle in parallel$ Whenever multiple

machines are used in cooperation with one another% the probabilit of failures rises$ +n a sin"le9

machine environment% failure is not somethin" that pro"ram desi"ners explicitl worr aboutver often7 if the machine has crashed% then there is no wa for the pro"ram to recover anwa$

+n a distributed environment% however% partial failures are an expected and common occurrence$

Networs can experience partial or total failure if switches and routers brea down$ Data ma not

arrive at a particular point in time due to unexpected networ con"estion$ +ndividual computenodes ma overheat% crash% experience hard drive failures% or run out of memor or dis space$

Data ma be corrupted% or maliciousl or improperl transmitted$ &ultiple implementations or

versions of client software ma spea sli"htl different protocols from one another$ 1locs ma

become desnchroni;ed% loc files ma not be released% parties involved in distributed atomictransactions ma lose their networ connections part9wa throu"h% etc$ +n each of these cases% the

rest of the distributed sstem should be able to recover from the component failure or transienterror condition and continue to mae pro"ress$ (f course% actuall providin" such resilience is ama


3/123

distributed sstems mae different trade9offs% as the intend to be used for problems with other

re6uirements =e$"$% hi"h securit>$

+n addition to worrin" about these sorts of bu"s and challen"es% there is also the fact that thecompute hardware has finite resources available to it$ The ma


4/123

infrastructure% then determinin" how best to restart the lost computation and propa"atin" this

information about the chan"e in networ topolo" ma be non trivial to implement$

Moore$s #aw

0o wh use a distributed sstem at all The seem lie more trouble than the3re worth$ #nd withthe fast pace of computer hardware desi"n% it seems inevitable that sin"le9chip hardware will be

able to :"row up: to handle the lar"er volumes of data$ #fter all% &oore3s 2aw =named after

-ordon &oore% the founder of +ntel> states that the number of transistors that can be placed in

a processor will double appro%imatel& e'er& two &ears( for half the cost) @ut trends in chip

desi"n are chan"in" to face new realities$ While we can still double the number of transistors per

unit area at this pace% this does not necessaril result in faster sin"le9threaded performance$ New processors such as +ntel 1ore , and +tanium , architectures now focus on embeddin" man&

smaller "P*s or :cores: onto the same phsical device$ This allows multiple threads to process

twice as much data in parallel% but at the same speed at which the operated previousl$

5ven if hundreds or thousands of 1P) cores are placed on a sin"le machine% it would not be possible to deliver input data to these cores fast enou"h for processin"$ +ndividual hard drives

can onl sustain read speeds between 8A9*AA &@Esecond$ These speeds have been increasin"

over time% but not at the same breanec pace as processors$ (ptimisticall assumin" the upper

limit of *AA &@Esecond% and assumin" four independent +E( channels are available to themachine% that provides /AA &@ of data ever second$ # / terabte data set would thus tae over

*A%AAA seconds to read99about three hours


5/123

-ata -istribution

+n a Hadoop cluster% data is distributed to all the nodes of the cluster as it is bein" loaded in$ The

Hadoop Distributed File 0stem =HDF0> will split lar"e data files into chuns which aremana"ed b different nodes in the cluster$ +n addition to this each chun is replicated across

several machines% so that a sin"le machine failure does not result in an data bein" unavailable$#n active monitorin" sstem then re9replicates the data in response to sstem failures which can

result in partial stora"e$ 5ven thou"h the file chuns are replicated and distributed across severalmachines% the form a sin"le namespace% so their contents are universall accessible$

Data is conceptuall record.oriented in the Hadoop pro"rammin" framewor$ +ndividual input

files are broen into lines or into other formats specific to the application lo"ic$ 5ach processrunnin" on a node in the cluster then processes a subset of these records$ The Hadoop framewor

then schedules these processes in proximit to the location of dataErecords usin" nowled"e from

the distributed file sstem$ 0ince files are spread across the distributed file sstem as chuns%

each compute process runnin" on a node operates on a subset of the data$ Which data operated

on b a node is chosen based on its localit to the node7 most data is read from the local disstrai"ht into the 1P)% alleviatin" strain on networ bandwidth and preventin" unnecessar

networ transfers$ This strate" of mo'ing computation to the data% instead of movin" the datato the computation allows Hadoop to achieve hi"h data localit which in turn results in hi"h

performance$

Figure 1.1: Data is distributed across nodes at load time.

Map/educe: Isolated Processes

Hadoop limits the amount of communication which can be performed b the processes% as each

individual record is processed b a tas in isolation from one another$ While this sounds lie a

ma


6/123

performed implicitly$ Pieces of data can be ta""ed with e names which inform Hadoop how to

send related bits of information to a common destination node$ Hadoop internall mana"es all of

the data transfer and cluster topolo" issues$

@ restrictin" the communication between nodes% Hadoop maes the distributed sstem much

more reliable$ +ndividual node failures can be wored around b restartin" tass on othermachines$ 0ince user9level tass do not communicate explicitl with one another% no messa"es

need to be exchan"ed b user pro"rams% nor do nodes need to roll bac to pre9arran"edchecpoints to partiall restart the computation$ The other worers continue to operate as thou"h

nothin" went wron"% leavin" the challen"in" aspects of partiall restartin" the pro"ram to the

underlin" Hadoop laer$

0lat Scalabilit&

(ne of the ma ma perform much better on two% four% or perhaps a do;en

machines$ Thou"h the effort of coordinatin" wor amon" a small number of machines ma be

better9performed b such sstems% the price paid in performance and en"ineerin" effort =whenaddin" more hardware as a result of increasin" data volumes> increases non9linearl$

# pro"ram written in distributed framewors other than Hadoop ma re6uire lar"e amounts of

refactorin" when scalin" from ten to one hundred or one thousand machines$ This ma involvehavin" the pro"ram be rewritten several times? fundamental elements of its desi"n ma also put

an upper bound on the scale to which the application can "row$

Hadoop% however% is specificall desi"ned to have a ver flat scalabilit curve$ #fter a Hadoop

pro"ram is written and functionin" on ten nodes% ver little99if an99wor is re6uired for thatsame pro"ram to run on a much lar"er amount of hardware$ (rders of ma"nitude of "rowth can

be mana"ed with little re9wor re6uired for our applications$ The underlin" Hadoop platform

will mana"e the data and hardware resources and provide dependable performance "rowth

proportionate to the number of machines available$

The /est of the Tutorial

This module of the tutorial has hi"hli"hted the ma stores vast

6uantities of information% how to confi"ure HDF0% and how to use it to store and retrieve

our data$

http://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.html


7/123

• &odule . shows ou how to "et started settin" up a Hadoop environment to experiment

with$ +t reviews how to install a Hadoop virtual machine =included in this resource 1D>

so that ou can run Hadoop re"ardless of what operatin" sstem ou are runnin"$

• &odule / explains the Hadoop &ap'educe pro"rammin" model itself% and how to write

some &ap'educe pro"rams$

• &odule 4 "oes into further detail about the specifics of Hadoop &ap'educe% and how to

use advanced features for more powerful control over a pro"ram3s execution$

• &odule 8 describes some other components of the Hadoop ecosstem which can add

further capabilities to our distributed sstem$

• &odule describes how to confi"ure Hadoop clusters of different si;es$ +t describes what

particular parameters of Hadoop need to be tuned for settin" up clusters of various si;es$

+n addition it describes the various performance monitorin" tools available in Hadoop for

monitorin" the health of our cluster$

• #nd to expand upon the Pi" section described in &odule 8% a separate Pi" Tutorial is

included in this paca"e at the end as &odule G$

Module : The Hadoop -istributed 0ile S&stem


Introduction

H-0S% the Hadoop Distributed File 0stem% is a distributed file sstem desi"ned to hold ver

lar"e amounts of data =terabtes or even petabtes>% and provide hi"h9throu"hput access to this

information$ Files are stored in a redundant fashion across multiple machines to ensure their

durabilit to failure and hi"h availabilit to ver parallel applications$ This module introducesthe desi"n of this distributed file sstem and instructions on how to operate it$


• )nderstand the basic desi"n of HDF0 and how it relates to basic distributed file sstem

concepts

• 2earn how to set up and use HDF0 from the command line

• 2earn how to use HDF0 in our applications

Outline

http://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://developer.yahoo.com/hadoop/tutorial/module5.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/module7.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/pigtutorial.htmlhttp://developer.yahoo.com/hadoop/tutorial/pigtutorial.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.htmlhttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://developer.yahoo.com/hadoop/tutorial/module5.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/module7.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/pigtutorial.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.htmlhttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.html


8/123

*$ +ntroduction

,$ -oals for this &odule

.$ (utline

/$ Distributed File 0stem @asics

4$ 1onfi"urin" HDF0

8$ +nteractin" With HDF0

*$ 1ommon 5xample (perations

,$ HDF0 1ommand 'eference

.$ DF0#dmin 1ommand 'eference

$ )sin" HDF0 in &ap'educe

G$ )sin" HDF0 Pro"rammaticall

B$ HDF0 Permissions and 0ecurit

*A$ #dditional HDF0 Tass

*$ 'ebalancin" @locs

,$ 1opin" 2ar"e 0ets of Files

.$ Decommissionin" Nodes

/$ erifin" File 0stem Health

4$ 'ac #wareness

**$ HDF0 Web +nterface

*,$ 'eferences

-istributed 0ile S&stem 2asics

# distributed file sstem is desi"ned to hold a lar"e amount of data and provide access to this

data to man clients distributed across a networ$ There are a number of distributed file sstemsthat solve this problem in different was$

http://developer.yahoo.com/hadoop/tutorial/module2.html#introhttp://developer.yahoo.com/hadoop/tutorial/module2.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module2.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module2.html#basicshttp://developer.yahoo.com/hadoop/tutorial/module2.html#confighttp://developer.yahoo.com/hadoop/tutorial/module2.html#interactinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#commonopshttp://developer.yahoo.com/hadoop/tutorial/module2.html#commandrefhttp://developer.yahoo.com/hadoop/tutorial/module2.html#admincommandrefhttp://developer.yahoo.com/hadoop/tutorial/module2.html#mapreducehttp://developer.yahoo.com/hadoop/tutorial/module2.html#programmaticallyhttp://developer.yahoo.com/hadoop/tutorial/module2.html#permshttp://developer.yahoo.com/hadoop/tutorial/module2.html#taskshttp://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#copyinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#fsckhttp://developer.yahoo.com/hadoop/tutorial/module2.html#rackhttp://developer.yahoo.com/hadoop/tutorial/module2.html#webhttp://developer.yahoo.com/hadoop/tutorial/module2.html#refshttp://developer.yahoo.com/hadoop/tutorial/module2.html#introhttp://developer.yahoo.com/hadoop/tutorial/module2.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module2.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module2.html#basicshttp://developer.yahoo.com/hadoop/tutorial/module2.html#confighttp://developer.yahoo.com/hadoop/tutorial/module2.html#interactinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#commonopshttp://developer.yahoo.com/hadoop/tutorial/module2.html#commandrefhttp://developer.yahoo.com/hadoop/tutorial/module2.html#admincommandrefhttp://developer.yahoo.com/hadoop/tutorial/module2.html#mapreducehttp://developer.yahoo.com/hadoop/tutorial/module2.html#programmaticallyhttp://developer.yahoo.com/hadoop/tutorial/module2.html#permshttp://developer.yahoo.com/hadoop/tutorial/module2.html#taskshttp://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#copyinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#fsckhttp://developer.yahoo.com/hadoop/tutorial/module2.html#rackhttp://developer.yahoo.com/hadoop/tutorial/module2.html#webhttp://developer.yahoo.com/hadoop/tutorial/module2.html#refs


9/123

30S( the Networ File 0stem% is the most ubi6uitous distributed file sstem$ +t is one of the

oldest still in use$ While its desi"n is strai"htforward% it is also ver constrained$ NF0 provides

remote access to a sin"le lo"ical volume stored on a sin"le machine$ #n NF0 server maes a portion of its local file sstem visible to external clients$ The clients can then mount this remote

file sstem directl into their own 2inux file sstem% and interact with it as thou"h it were part of

the local drive$

(ne of the primar advanta"es of this model is its transparenc$ 1lients do not need to be particularl aware that the are worin" on files stored remotel$ The existin" standard librar

methods lie open()% close()% fread()% etc$ will wor on files hosted over NF0$

@ut as a distributed file sstem% it is limited in its power$ The files in an NF0 volume all resideon a sin"le machine$ This means that it will onl store as much information as can be stored in

one machine% and does not provide an reliabilit "uarantees if that machine "oes down =e$"$% b

replicatin" the files to other servers>$ Finall% as all the data is stored on a sin"le machine% all the

clients must "o to this machine to retrieve their data$ This can overload the server if a lar"e

number of clients must be handled$ 1lients must also alwas cop the data to their localmachines before the can operate on it$

HDF0 is desi"ned to be robust to a number of the problems that other DF03s such as NF0 arevulnerable to$ +n particular7

• HDF0 is desi"ned to store a ver lar"e amount of information =terabtes or petabtes>$

This re6uires spreadin" the data across a lar"e number of machines$ +t also supports much

lar"er file si;es than NF0$

• HDF0 should store data reliabl$ +f individual machines in the cluster malfunction% data

should still be available$

• HDF0 should provide fast% scalable access to this information$ +t should be possible to

serve a lar"er number of clients b simpl addin" more machines to the cluster$

• HDF0 should inte"rate well with Hadoop &ap'educe% allowin" data to be read and

computed upon locall when possible$

@ut while HDF0 is ver scalable% its hi"h performance desi"n also restricts it to a particular class

of applications? it is not as "eneral9purpose as NF0$ There are a lar"e number of additionaldecisions and trade9offs that were made with HDF0$ +n particular7

• #pplications that use HDF0 are assumed to perform lon" se6uential streamin" reads from

files$ HDF0 is optimi;ed to provide streamin" read performance? this comes at theexpense of random see times to arbitrar positions in files$

• Data will be written to the HDF0 once and then read several times? updates to files after

the have alread been closed are not supported$ =#n extension to Hadoop will provide


10/123

support for appendin" new data to the ends of files? it is scheduled to be included in

Hadoop A$*B but is not available et$>

• Due to the lar"e si;e of files% and the se6uential nature of reads% the sstem does not

provide a mechanism for local cachin" of data$ The overhead of cachin" is "reat enou"h

that data should simpl be re9read from HDF0 source$

• +ndividual machines are assumed to fail on a fre6uent basis% both permanentl and

intermittentl$ The cluster must be able to withstand the complete failure of several

machines% possibl man happenin" at the same time =e$"$% if a rac fails all to"ether>$While performance ma de"rade proportional to the number of machines lost% the sstem

as a whole should not become overl slow% nor should information be lost$ Data

replication strate"ies combat this problem$

The desi"n of HDF0 is based on the desi"n of G0S% the -oo"le File 0stem$ +ts desi"n wasdescribed in a paper published b -oo"le$

HDF0 is a bloc9structured file sstem7 individual files are broen into blocs of a fixed si;e$

These blocs are stored across a cluster of one or more machines with data stora"e capacit$

+ndividual machines in the cluster are referred to as -ata3odes$ # file can be made of several blocs% and the are not necessaril stored on the same machine? the tar"et machines which hold

each bloc are chosen randoml on a bloc9b9bloc basis$ Thus access to a file ma re6uire the

cooperation of multiple machines% but supports file si;es far lar"er than a sin"le9machine DF0?individual files can re6uire more space than a sin"le hard drive could hold$

+f several machines must be involved in the servin" of a file% then a file could be rendered

unavailable b the loss of an one of those machines$ HDF0 combats this problem b replicatin"

each bloc across a number of machines =.% b default>$

Fi"ure ,$*7 DataNodes holdin" blocs of multiple files with a replication factor of ,$ The NameNode maps the filenames onto the bloc ids$

&ost bloc9structured file sstems use a bloc si;e on the order of / or G I@$ @ contrast% the

default bloc si;e in HDF0 is 8/&@ 99 orders of ma"nitude lar"er$ This allows HDF0 to decrease

the amount of metadata stora"e re6uired per file =the list of blocs per file will be smaller as thesi;e of individual blocs increases>$ Furthermore% it allows for fast streamin" reads of data% b

eepin" lar"e amounts of data se6uentiall laid out on the dis$ The conse6uence of this decision

is that HDF0 expects to have ver lar"e files% and expects them to be read se6uentiall$ )nlie afile sstem such as NTF0 or 5JT% which see man ver small files% HDF0 expects to store amodest number of ver lar"e files7 hundreds of me"abtes% or "i"abtes each$ #fter all% a *AA

&@ file is not even two full blocs$ Files on our computer ma also fre6uentl be accessed

:randoml%: with applications cherr9picin" small amounts of information from severaldifferent locations in a file which are not se6uentiall laid out$ @ contrast% HDF0 expects to

read a bloc start9to9finish for a pro"ram$ This maes it particularl useful to the &ap'educe

http://developer.yahoo.com/hadoop/tutorial/module2.html#ref_gfshttp://developer.yahoo.com/hadoop/tutorial/module2.html#ref_gfshttp://developer.yahoo.com/hadoop/tutorial/module2.html#ref_gfs


11/123

stle of pro"rammin" described in &odule /$ That havin" been said% attemptin" to use HDF0 as

a "eneral9purpose distributed file sstem for a diverse set of applications will be suboptimal$

@ecause HDF0 stores files as a set of lar"e blocs across several machines% these files are not part of the ordinar file sstem$ Tpin" ls on a machine runnin" a DataNode daemon will

displa the contents of the ordinar 2inux file sstem bein" used to host the Hadoop services 99 but it will not include an of the files stored inside the HDF0$ This is because HDF0 runs in a

separate namespace% isolated from the contents of our local files$ The files inside HDF0 =or

more accuratel7 the blocs that mae them up> are stored in a particular director mana"ed b

the DataNode service% but the files will named onl with bloc ids$ You cannot interact with

HDF09stored files usin" ordinar 2inux file modification tools =e$"$% ls% cp% mv% etc>$ However%

HDF0 does come with its own utilities for file mana"ement% which act ver similar to these

familiar tools$ # later section in this tutorial will introduce ou to these commands and their

operation$

+t is important for this file sstem to store its metadata reliabl$ Furthermore% while the file data

is accessed in a write once and read man model% the metadata structures =e$"$% the names of filesand directories> can be modified b a lar"e number of clients concurrentl$ +t is important that

this information is never desnchroni;ed$ Therefore% it is all handled b a sin"le machine% calledthe 3ame3ode$ The NameNode stores all the metadata for the file sstem$ @ecause of the

relativel low amount of metadata per file =it onl tracs file names% permissions% and the

locations of each bloc of each file>% all of this information can be stored in the main memor ofthe NameNode machine% allowin" fast access to the metadata$

To open a file% a client contacts the NameNode and retrieves a list of locations for the blocs that

comprise the file$ These locations identif the DataNodes which hold each bloc$ 1lients then

read file data directl from the DataNode servers% possibl in parallel$ The NameNode is not

directl involved in this bul data transfer% eepin" its overhead to a minimum$

(f course% NameNode information must be preserved even if the NameNode machine fails? there

are multiple redundant sstems that allow the NameNode to preserve the file sstem3s metadata

even if the NameNode itself crashes irrecoverabl$ NameNode failure is more severe for thecluster than DataNode failure$ While individual DataNodes ma crash and the entire cluster will

continue to operate% the loss of the NameNode will render the cluster inaccessible until it is

manuall restored$ Fortunatel% as the NameNode3s involvement is relativel minimal% the oddsof it failin" are considerabl lower than the odds of an arbitrar DataNode failin" at an "iven

point in time$

# more thorou"h overview of the architectural decisions involved in the desi"n andimplementation of HDF0 is "iven in the official Hadoop HDF0 documentation$ @eforecontinuin" in this tutorial% it is advisable that ou read and understand the information presented

there$

"onfiguring H-0S

http://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.html


12/123

The HDF0 for our cluster can be confi"ured in a ver short amount of time$ First we will fill

out the relevant sections of the Hadoop confi"uration file% then format the NameNode$

"luster configuration

These instructions for cluster confi"uration assume that ou have alread downloaded andun;ipped a cop of Hadoop$ &odule . discusses "ettin" started with Hadoop for this tutorial$

&odule discusses how to set up a lar"er cluster and provides preliminar setup instructions for

Hadoop% includin" downloadin" prere6uisite software$

The HDF0 confi"uration is located in a set of J&2 files in the Hadoop confi"uration director?

conf/ under the main Hadoop install director =where ou un;ipped Hadoop to>$ The

conf/hadoop-defaults.xml file contains default values for ever parameter in Hadoop$ This

file is considered read9onl$ You override this confi"uration b settin" new values in

conf/hadoop-site.xml$ This file should be replicated consistentl across all machines in the

cluster$ =+t is also possible% thou"h not advisable% to host it on NF0$>

1onfi"uration settin"s are a set of e9value pairs of the format7

property-name property-value

#ddin" the line true inside the property bod will prevent properties from

bein" overridden b user applications$ This is useful for most sstem9wide confi"uration options$

The followin" settin"s are necessar to confi"ure HDF07

e& 'alue e%ample

fs$default$name protocol 7EE servername7 port hdfs7EEalpha$milman$or"7BAAA

dfs$data$dir pathname EhomeEusernameEhdfsEdata

dfs$name$dir pathname EhomeEusernameEhdfsEname

These settin"s are described individuall below7

fs)default)name 9 This is the )'+ =protocol specifier% hostname% and port> that describes the NameNode for the cluster$ 5ach node in the sstem on which Hadoop is expected to operate

needs to now the address of the NameNode$ The DataNode instances will re"ister with this NameNode% and mae their data available throu"h it$ +ndividual client pro"rams will connect to

this address to retrieve the locations of actual file blocs$

dfs)data)dir 9 This is the path on the local file sstem in which the DataNode instance should

store its data$ +t is not necessar that all DataNode instances store their data under the same local

path prefix% as the will all be on separate machines? it is acceptable that these machines are

hetero"eneous$ However% it will simplif confi"uration if this director is standardi;ed

http://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module7.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module7.html


13/123

throu"hout the sstem$ @ default% Hadoop will place this under /tmp$ This is fine for testin"

purposes% but is an eas wa to lose actual data in a production sstem% and thus must be

overridden$

dfs)name)dir 9 This is the path on the local file sstem of the NameNode instance where the

NameNode metadata is stored$ +t is onl used b the NameNode instance to find its information%and does not exist on the DataNodes$ The caveat above about /tmp applies to this as well? this

settin" must be overridden in a production sstem$

#nother confi"uration parameter% not listed above% is dfs)replication$ This is the default

replication factor for each bloc of data in the file sstem$ For a production cluster% this should

usuall be left at its default value of .$ =You are free to increase our replication factor% thou"hthis ma be unnecessar and use more space than is re6uired$ Fewer than three replicas impact

the hi"h availabilit of information% and possibl the reliabilit of its stora"e$>

The followin" information can be pasted into the hadoop-site.xml file for a sin"le9node

confi"uration7 fs.default.name

hdfs://your.server.name.com:9000 dfs.data.dir

/home/username/hdfs/data dfs.name.dir

/home/username/hdfs/name

(f course% your.server.name.com needs to be chan"ed% as does username$ )sin" port BAAA for

the NameNode is arbitrar$

#fter copin" this information into our conf/hadoop-site.xml file% cop this to the conf/

directories on all machines in the cluster$

The master node needs to now the addresses of all the machines to use as DataNodes? thestartup scripts depend on this$ #lso in the conf/ director% edit the file slaves so that it contains

a list of full96ualified hostnames for the slave instances% one host per line$ (n a multi9node

setup% the master node =e$"$% localhost> is not usuall present in this file$

Then mae the directories necessar7

user!ach"achine# m$dir -p #%&"!/hdfs/data


14/123

usernamenode# m$dir -p #%&"!/hdfs/name

The user who owns the Hadoop instances will need to have read and write access to each of these

directories$ +t is not necessar for all users to have access to these directories$ 0et permissions

with chmod as appropriate$ +n a lar"e9scale environment% it is recommended that ou create a user

named :hadoop: on each node for the express purpose of ownin" and runnin" Hadoop tass$ Fora sin"le individual3s machine% it is perfectl acceptable to run Hadoop under our own username$

+t is not recommended that ou run Hadoop as root$

Starting H-0S

Now we must format the file sstem that we


15/123

The dfs module% also nown as :Fs0hell%: provides basic file manipulation operations$ Their

usa"e is introduced here$

# cluster is onl useful if it contains data of interest$ Therefore% the first operation to perform isloadin" information into the cluster$ For purposes of this example% we will assume an example

user named :someone: 99 but substitute our own username where it maes sense$ #lso note thatan operation on files in HDF0 can be performed from an node with access to the cluster%

whose conf/hadoop-site.xml is confi"ured to set fs.default.name to our cluster3s

NameNode$ We will call the fictional machine on which we are operatin" anynode$ 1ommands

are bein" run from the :hadoop: director where ou installed Hadoop$ This ma be

/home/someone/src/hadoop on our machine% or /home/foo/hadoop on someone else3s$ These

initial commands are centered around loadin" information into HDF0% checin" that it3s there%

and "ettin" information bac out of HDF0$

#isting files

+f we attempt to inspect HDF0% we will not find anthin" interestin" there7

someoneanynode:hadoop# 'in/hadoop dfs -ls someoneanynode:hadoop#

The :9ls: command returns silentl$ Without an ar"uments% 9ls will attempt to show the contents

of our :home: director inside HDF0$ Don3t for"et% this is not the same as /home/#!* =e$"$%

/home/someone> on the host machine =HDF0 eeps a separate namespace from the local files>$

There is no concept of a :current worin" director: or cd command in HDF0$

+f ou provide 9ls with an ar"ument% ou ma see some initial director contents7

someoneanynode:hadoop# 'in/hadoop dfs -ls / +ound , items drxr-xr-x - hadoop supergroup 0 ,00-09-,0 9:0 /hadoop drxr-xr-x - hadoop supergroup 0 ,00-09-,0 ,0:0 /tmp

These entries are created b the sstem$ This example output assumes that :hadoop: is the

username under which the Hadoop daemons =NameNode% DataNode% etc> were started$

:super"roup: is a special "roup whose membership includes the username under which theHDF0 instances were started =e$"$% :hadoop:>$ These directories exist to allow the Hadoop

&ap'educe sstem to move necessar data to the different


16/123

explicit source and destination paths$> #n relative paths used as ar"uments to HDF0% Hadoop

&ap'educe% or other components of the sstem are assumed to be relative to this base director$

Step 1: 1reate our home director if it does not alread exist$

someoneanynode:hadoop# 'in/hadoop dfs -m$dir /user

+f there is no /user director% create that first$ +t will be automaticall created later if necessar%

but for instructive purposes% it maes sense to create it manuall ourselves this time$

Then we are free to add our own home director7

someoneanynode:hadoop# 'in/hadoop dfs -m$dir /user/someone

(f course% replace /user/someone with /user/yourUserName$

Step : )pload a file$ To insert a sin"le file into HDF0% we can use the put command lie so7

someoneanynode:hadoop# 'in/hadoop dfs -put/home/someone/interesting+ile.txt /user/yourUserName/

This copies /home/someone/interesting+ile.txt from the local file sstem into

/user/yourUserName/interesting+ile.txt on HDF0$

Step 8: erif the file is in HDF0$ We can verif that the operation wored with either of the two

followin" =e6uivalent> commands7

someoneanynode:hadoop# 'in/hadoop dfs -ls /user/yourUserName someoneanynode:hadoop# 'in/hadoop dfs -ls

You should see a listin" that starts with +ound items and then includes information about the

file ou inserted$

The followin" table demonstrates example uses of the put command% and their effects7

"ommand: Assuming: Outcome:

'in/hadoop dfs -putfoo 'ar

No fileEdirector named

/user/#!*/'ar exists in

HDF0

)ploads local file foo to a file named/user/#!*/'ar


/user/#!*/'ar is a

director)ploads local file foo to a file named/user/#!*/'ar/foo

'in/hadoop dfs -putfoosomedir/somefile

/user/#!*/somedir

does not exist in HDF0

)ploads local file foo to a file named

/user/#!*/somedir/somefile% creatin"

the missin" director


/user/#!*/'ar is

alread a file in HDF0

No chan"e in HDF0% and an error is

returned to the user$


17/123

When the put command operates on a file% it is all9or9nothin"$ )ploadin" a file into HDF0 first

copies the data onto the DataNodes$ When the all acnowled"e that the have received all the

data and the file handle is closed% it is then made visible to the rest of the sstem$ Thus based onthe return value of the put command% ou can be confident that a file has either been successfull

uploaded% or has :full failed?: ou will never "et into a state where a file is partiall uploaded

and the partial contents are visible externall% but the upload disconnected and did not completethe entire file contents$ +n a case lie this% it will be as thou"h no upload too place$

Step 9: )ploadin" multiple files at once$ The put command is more powerful than movin" a

sin"le file at a time$ +t can also be used to upload entire director trees into HDF0$

1reate a local director and put some files into it usin" the cp command$ (ur example user ma

have a situation lie the followin"7

someoneanynode:hadoop# ls -* myfiles myfiles: file.txt file,.txt su'dir/

myfiles/su'dir: another+ile.txt someoneanynode:hadoop#

This entire myfiles/ director can be copied into HDF0 lie so7

someoneanynode:hadoop# 'in/hadoop -put myfiles /user/myUsername someoneanynode:hadoop# 'in/hadoop -ls +ound items /user/someone/myfiles ,00-01-, ,0:29 rxr-xr-x someonesupergroup useranynode:hadoop 'in/hadoop -ls myfiles

+ound 3 items /user/someone/myfiles/file.txt 143 ,00-01-, ,0:29r-r--r-- someone supergroup /user/someone/myfiles/file,.txt 1 ,00-01-, ,0:29r-r--r-- someone supergroup /user/someone/myfiles/su'dir ,00-01-, ,0:29rxr-xr-x someone supergroup

Thus demonstratin" that the tree was correctl uploaded recursivel$ You3ll note that in addition

to the file path% ls also reports the number of replicas of each file that exist =the :*: in Lr *M>% the

file si;e% upload time% permissions% and owner information$

#nother snonm for .put is .cop&0rom#ocal$ The sntax and functionalit are identical$

/etrie'ing data from H-0S

There are multiple was to retrieve files from the distributed file sstem$ (ne of the easiest is to

use cat to displa the contents of a file on stdout$ =+t can% of course% also be used to pipe the data

into other applications or destinations$>


18/123

Step 1: Displa data with cat$

+f ou have not alread done so% upload some files into HDF0$ +n this example% we assume that a

file named :foo: has been loaded into our home director on HDF0$

someoneanynode:hadoop# 'in/hadoop dfs -cat foo (contents of foo are displayed here) someoneanynode:hadoop#

Step : 1op a file from HDF0 to the local file sstem$

The get command is the inverse operation of put? it will cop a file or director =recursivel>

from HDF0 into the tar"et of our choosin" on the local file sstem$ # snonmous operation iscalled .cop&To#ocal$

someoneanynode:hadoop# 'in/hadoop dfs -get foo local+oo someoneanynode:hadoop# ls

local+oo someoneanynode:hadoop# cat local+oo (contents of foo are displayed here)

2ie the put command% "et will operate on directories in addition to individual files$

Shutting -own H-0S

+f ou want to shut down the HDF0 functionalit of our cluster =either because ou do not want

Hadoop occupin" memor resources when it is not in use% or because ou want to restart thecluster for up"radin"% confi"uration chan"es% etc$>% then this can be accomplished b lo""in" in

to the NameNode machine and runnin"7

someonenamenode:hadoop# 'in/stop-dfs.sh

This command must be performed b the same user who started HDF0 with 'in/start-dfs.sh$

H-0S "ommand /eference

There are man more commands in 'in/hadoop dfs than were demonstrated here% althou"h

these basic operations will "et ou started$ 'unnin" 'in/hadoop dfs with no additional

ar"uments will list all commands which can be run with the Fs0hell sstem$ Furthermore%

'in/hadoop dfs -help commandName will displa a short usa"e summar for the operation in

6uestion% if ou are stuc$

# table of all operations is reproduced below$ The followin" conventions are used for

parameters7

• italics denote variables to be filled out b the user$


19/123

• :path: means an file or director name$

• :path$$$: means one or more file or director names$

• :file: means an filename$

•

• :src: and :dest: are path names in a directed operation$

• :local0rc: and :localDest: are paths as above% but on the local file sstem$ #ll other file

and path names refer to ob


20/123

9mdir path1reates a director named path in HDF0$ 1reates an parent directories

in path that are missin" =e$"$% lie m$dir -p in 2inux>$

9setrep 9'O 9wO rep

path

0ets the tar"et replication factor for files identified b path to rep$ =The

actual replication factor will move toward the tar"et over time>

9touch; path1reates a file at path containin" the current time as a timestamp$ Fails if

a file alread exists at path% unless the file is alread si;e A$9test 9e;dO path 'eturns * if path exists? has ero len"th? or is a director% or A otherwise$

9stat formatO path

Prints information about path$ format is a strin" which accepts file si;e in

blocs =Cb>% filename =Cn>% bloc si;e =Co>% replication =Cr>% and

modification date =C% CY>$

9tail 9fO file 0hows the lats *I@ of file on stdout$

9chmod 9'Omode#mode#... path...

1han"es the file permissions associated with one or more ob


21/123

The NameNode waits until a specific percenta"e of the blocs are present and accounted9for? this

is controlled in the confi"uration b the dfs.safemode.threshold.pct parameter$ #fter this

threshold is met% safemode is automaticall exited% and HDF0 allows normal operations$ The

'in/hadoop dfsadmin -safemode what command allows the user to manipulate safemode

based on the value of what % described below7

• enter 9 5nters safemode

• leave 9 Forces the NameNode to exit safemode

• get 9 'eturns a strin" indicatin" whether safemode is (N or (FF

• ait 9 Waits until safemode has exited and returns

"hanging H-0S membership 9 When decommissionin" nodes% it is important to disconnectnodes from HDF0 "raduall to ensure that data is not lost$ 0ee the section on decommissionin"

later in this document for an explanation of the use of the -refreshodes dfsadmin command$

*pgrading H-0S 'ersions 9 When up"radin" from one version of Hadoop to the next% the file

formats used b the NameNode and DataNodes ma chan"e$ When ou first start the newversion of Hadoop on the cluster% ou need to tell Hadoop to chan"e the HDF0 version =or else it

will not mount>% usin" the command7 'in/start-dfs.sh -upgrade$ +t will then be"in

up"radin" the HDF0 version$ The status of an on"oin" up"rade operation can be 6ueried with the

'in/hadoop dfsadmin -upgraderogress status command$ &ore verbose information can

be retrieved with 'in/hadoop dfsadmin -upgraderogress details$ +f the up"rade is

bloced and ou would lie to force it to continue% use the command7 'in/hadoop dfsadmin

-upgraderogress force$ =Note7 be sure ou now what ou are doin" if ou use this last

command$>

When HDF0 is up"raded% Hadoop retains bacup information allowin" ou to down"rade to theori"inal HDF0 version in case ou need to revert Hadoop versions$ To bac out the chan"es% stop

the cluster% re9install the older version of Hadoop% and then use the command7 'in/start-

dfs.sh -roll'ac$$ +t will restore the previous HDF0 state$

(nl one such archival cop can be ept at a time$ Thus% after a few das of operation with thenew version =when it is deemed stable>% the archival cop can be removed with the command

'in/hadoop dfsadmin -finali;epgrade$ The rollbac command cannot be issued after this

point$ This must be performed before a second Hadoop up"rade is allowed$

Getting help 9 #s with the dfs module% tpin" 'in/hadoop dfsadmin -help cmd will provide

more usa"e information about the particular command$

*sing H-0S in Map/educe

http://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommission


22/123

The HDF0 is a powerful companion to Hadoop &ap'educe$ @ settin" the fs.default.name

confi"uration option to point to the NameNode =as was done above>% Hadoop &ap'educe

This section provides a short tutorial on usin" the Kava9based HDF0 #P+$ +t will be based on the

followin" code listin"7

: import ava.io.+ile=,: import ava.io.&!xception=3:: import org.apache.hadoop.conf.?onfiguration=2: import org.apache.hadoop.fs.+ileystem=1: import org.apache.hadoop.fs.+@atanputtream=4: import org.apache.hadoop.fs.+@ata&utputtream=: import org.apache.hadoop.fs.ath=9:0: pu'lic class %@+%elloAorld 5:,: pu'lic static final tring the+ilename B Chello.txtC=3: pu'lic static final tring message B C%elloD orldEFnC=:2: pu'lic static void main (tring GH args) thros &!xception 51:4: ?onfiguration conf B ne ?onfiguration()=: +ileystem fs B +ileystem.get(conf)=9:,0: ath filenameath B ne ath(the+ilename)=,:,,: try 5,3: if (fs.exists(filenameath)) 5,: // remove the file first,2: fs.delete(filenameath)=

,1: 6,4:,: +@ata&utputtream out B fs.create(filenameath)=,9: out.riteI+(message=30: out.close()=3:3,: +@atanputtream in B fs.open(filenameath)=33: tring messagen B in.readI+()=3: ystem.out.print(messagen)=32: in.close()=

http://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://wiki.apache.org/hadoop/LibHDFShttp://wiki.apache.org/hadoop/LibHDFShttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://wiki.apache.org/hadoop/LibHDFS


23/123

1: 6 catch (&!xception ioe) 54: ystem.err.println(C&!xception during operation: C 7ioe.totring())=: ystem.exit()=9: 60: 6: 6

This pro"ram creates a file named hello.txt% writes a short messa"e into it% then reads it bac

and prints it to the screen$ +f the file alread existed% it is deleted first$

First we "et a handle to an abstract +ileystem ob


24/123

(ther operations such as copin"% movin"% and renamin" are e6uall strai"htforward operations

on ath ob$ Nor

can an existin" file be written to% althou"h the w bits ma still be set$

0ecurit permissions and ownership can be modified usin" the 'in/hadoop dfs -chmod%

-chon% and -chgrp operations described earlier in this document? the wor in a similar fashion

to the P(0+JE2inux tools of the same name$

-etermining identit& 9 +dentit is not authenticated formall with HDF0? it is taen from anextrinsic source$ The Hadoop sstem is pro"rammed to use the user3s current lo"in as their

http://hadoop.apache.org/common/docs/r0.20.2/api/index.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://wiki.apache.org/hadoop/HadoopDfsReadWriteExamplehttp://wiki.apache.org/hadoop/HadoopDfsReadWriteExamplehttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/index.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://wiki.apache.org/hadoop/HadoopDfsReadWriteExamplehttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.html


25/123

Hadoop username =i$e$% the e6uivalent of hoami>$ The user3s current worin" "roup list =i$e% the

output of groups> is used as the "roup list in Hadoop$ HDF0 itself does not verif that this

username is "enuine to the actual operator$

Superuser status 9 The username which was used to start the Hadoop process =i$e$% the username

who actuall ran 'in/start-all.sh or 'in/start-dfs.sh> is acnowled"ed to be the superuser for HDF0$ +f this user interacts with HDF0% he does so with a special username

superuser$ This user3s operations on HDF0 never fail% re"ardless of permission bits set on the

particular files he manipulates$ +f Hadoop is shutdown and restarted under a different username%

that username is then bound to the superuser account$

Supergroup 9 There is also a special "roup named supergroup% whose membership is controlled

b the confi"uration parameter dfs.permissions.supergroup$

-isabling permissions 9 @ default% permissions are enabled on HDF0$ The permission sstem

can be disabled b settin" the confi"uration option dfs.permissions to false$ The owner%

"roup% and permissions bits associated with each file and director will still be preserved% but theHDF0 process does not enforce them% except when usin" permissions9related operations such as

-chmod$

Additional H-0S Tass

/ebalancing 2locs

New nodes can be added to a cluster in a strai"htforward manner$ (n the new node% the same

Hadoop version and confi"uration =conf/hadoop-site.xml> as on the rest of the cluster should

be installed$ 0tartin" the DataNode daemon on the machine will cause it to contact the

NameNode and

@ut the new DataNode will have no data on board initiall? it is therefore not alleviatin" space

concerns on the existin" nodes$ New files will be stored on the new DataNode in addition to the

existin" ones% but for optimum usa"e% stora"e should be evenl balanced across all nodes$

This can be achieved with the automatic balancer tool included with Hadoop$ The @alancer classwill intelli"entl balance blocs across the nodes to achieve an even distribution of blocs within

a "iven threshold% expressed as a percenta"e$ =The default is *AC$> 0maller percenta"es mae

nodes more evenl balanced% but ma re6uire more time to achieve this state$ Perfect balancin"

=AC> is unliel to actuall be achieved$

The balancer script can be run b startin" 'in/start-'alancer.sh in the Hadoop director$

The script can be provided a balancin" threshold percenta"e with the -threshold parameter?

e$"$% 'in/start-'alancer.sh -threshold 2$ The balancer will automaticall terminate when

it achieves its "oal% or when an error occurs% or it cannot find more candidate blocs to move to

achieve better balance$ The balancer can alwas be terminated safel b the administrator b

runnin" 'in/stop-'alancer.sh$

http://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/server/balancer/Balancer.htmlhttp://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/server/balancer/Balancer.htmlhttp://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/server/balancer/Balancer.htmlhttp://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/server/balancer/Balancer.html


26/123

The balancin" script can be run either when nobod else is usin" the cluster =e$"$% overni"ht>% but

can also be run in an :online: fashion while man other % the tas should be divided between

multiple nodes to allow them all to share in the bandwidth re6uired for the process$ Hadoopincludes a tool called distcp for this purpose$

@ invoin" 'in/hadoop distcp src dest% Hadoop will start a &ap'educe tas to distribute

the burden of copin" a lar"e number of files from src to dest $ These two parameters ma specif

a full )'2 for the the path to cop$ e$"$% Chdfs://omeameode:9000/foo/'ar/C andChdfs://&therameode:,000/'a;/Luux/C will cop the children of /foo/'ar on one cluster

to the director tree rooted at /'a;/Luux on the other$ The paths are assumed to be directories%

and are copied recursivel$ 0. )'2s can be specified with s3://bucket-name/key $

-ecommissioning 3odes

+n addition to allowin" nodes to be added to the cluster on the fl% nodes can also be removed

from a cluster while it is runnin"% without data loss$ @ut if nodes are simpl shut down :hard%:

data loss ma occur as the ma hold the sole cop of one or more file blocs$

Nodes must be retired on a schedule that allows HDF0 to ensure that no blocs are entirelreplicated within the to9be9retired set of DataNodes$

HDF0 provides a decommissionin" feature which ensures that this process is performed safel$

To use it% follow the steps below7

Step 1: "luster configuration$ +f it is assumed that nodes ma be retired in our cluster% then before it is started% an e$cludes file must be confi"ured$ #dd a e named dfs.hosts.exclude

to our conf/hadoop-site.xml file$ The value associated with this e provides the full path to

a file on the NameNode3s local file sstem which contains a list of machines which are not permitted to connect to HDF0$

Step : -etermine hosts to decommission$ 5ach machine to be decommissioned should be

added to the file identified b dfs.hosts.exclude% one per line$ This will prevent them from

connectin" to the NameNode$

Step 8: 0orce configuration reload$ 'un the command 'in/hadoop dfsadmin

-refreshodes$ This will force the NameNode to reread its confi"uration% includin" the newl9


27/123

updated excludes file$ +t will decommission the nodes over a period of time% allowin" time for

each node3s blocs to be replicated onto machines which are scheduled to remain active$

Step 9: Shutdown nodes$ #fter the decommission process has completed% the decommissionedhardware can be safel shutdown for maintenance% etc$ The 'in/hadoop dfsadmin -report

command will describe which nodes are connected to the cluster$

Step ;: +dit e%cludes file again$ (nce the machines have been decommissioned% the can be

removed from the excludes file$ 'unnin" 'in/hadoop dfsadmin -refreshodes a"ain will

read the excludes file bac into the NameNode% allowin" the DataNodes to re$ The options ma include two different

tpes of options7

Action options specif what action should be taen when corrupted files are found$ This can be

-move% which moves corrupt files to /lost7found% or -delete% which deletes corrupted files$

Information options specif how verbose the tool should be in its report$ The -files option

will list all files it checs as it encounters them$ This information can be further expanded b

addin" the -'loc$s option% which prints the list of blocs for each file$ #ddin" -locations to

these two options will then print the addresses of the DataNodes holdin" these blocs$ 0till moreinformation can be retrieved b addin" -rac$s to the end of this list% which then prints the rac

topolo" information for each location$ =0ee the next subsection for more information on

confi"urin" networ rac awareness$> Note that the later options do not impl the former? oumust use them in con


28/123

The -- is not re6uired if ou provide a path to start the chec from% or if ou specif another

ar"ument first such as -move$

@ default% fsc will not operate on files still open for write b another client$ # list of such filescan be produced with the -openforrite option$

/ac Awareness

For small clusters in which all servers are connected b a sin"le switch% there are onl two levelsof localit7 :on9machine: and :off9machine$: When loadin" data from a DataNode3s local drive

into HDF0% the NameNode will schedule one cop to "o into the local DataNode% and will pic

two other machines at random from the cluster$

For lar"er Hadoop installations which span multiple racs% it is important to ensure that replicas

of data exist on multiple racs$ This wa% the loss of a switch does not render portions of the data

unavailable due to all replicas bein" underneath it$

HDF0 can be made rac9aware b the use of a script which allows the master node to map thenetwor topolo" of the cluster$ While alternate confi"uration strate"ies can be used% the default

implementation allows ou to provide an executable script which returns the :rac address: of

each of a list of +P addresses$

The network topology script receives as ar"uments one or more +P addresses of nodes in thecluster$ +t returns on stdout a list of rac names% one for each input$ The input and output order

must be consistent$

To set the rac mappin" script% specif the e topology.script.file.name in conf/hadoop-

site.xml$ This provides a command to run to return a rac id? it must be an executable script or pro"ram$ @ default% Hadoop will attempt to send a set of +P addresses to the file as several

separate command line ar"uments$ You can control the maximum acceptable number of

ar"uments with the topology.script.num'er.args e$

'ac ids in Hadoop are hierarchical and loo lie path names$ @ default% ever node has a rac

id of /default-rac$$ You can set rac ids for nodes to an arbitrar path% e$"$% /foo/'ar-rac$$

Path elements further to the left are hi"her up the tree$ Thus a reasonable structure for a lar"einstallation ma be /top-switch-name/rack-name$

Hadoop rac ids are not currentl expressive enou"h to handle an unusual routin" topolo" such

as a .9d torus? the assume that each node is connected to a sin"le switch which in turn has asin"le upstream switch$ This is not usuall a problem% however$ #ctual pacet routin" will bedirected usin" the topolo" discovered b or set in switches and routers$ The Hadoop rac ids

will be used to find :near: and :far: nodes for replica placement =and in A$*% &ap'educe tas

placement>$

The followin" example script performs rac identification based on +P addresses "iven ahierarchical +P addressin" scheme enforced b the networ administrator$ This ma wor directl


29/123

for simple installations? more complex networ confi"urations ma re6uire a file9 or table9based

looup process$ 1are should be taen in that case to eep the table up9to9date as nodes are

phsicall relocated% etc$ This script re6uires that the maximum number of ar"uments be set to *$

ME/'in/'ashM et rac$ id 'ased on address.

M Jssumes netor$ administrator has complete controlM over addresses assigned to nodes and they areM in the 0.x.y.; address space. Jssumes thatM addresses are distri'uted hierarchically. e.g.DM 0..y.; is one data center segment and 0.,.y.; is another=M 0...; is one rac$D 0..,.; is another rac$ inM the same segmentD etc.)MM Ihis is invo$ed ith an address as its only argument

M get address from the inputipaddrB#0

M select Cx.yC and convert it to Cx/yCsegmentsBNecho #ipaddr O cut --delimiterB. --fieldsB,-3 --output-delimiterB/Necho /#5segments6

H-0S 4eb Interface

HDF0 exposes a web server which is capable of performin" basic status monitorin" and file browsin" operations$ @ default this is exposed on port 4AAA on the NameNode$ #ccessin"

http7EEnamenode74AAAE with a web browser will return a pa"e containin" overview information

about the health% capacit% and usa"e of the cluster =similar to the information returned b

'in/hadoop dfsadmin -report>$

The address and port where the web interface listens can be chan"ed b settin"

dfs.http.address in conf/hadoop-site.xml$ +t must be of the form address7 port $ To accept

re6uests on all addresses% use 0.0.0.0$

From this interface% ou can browse HDF0 itself with a basic file9browser interface$ 5ach

DataNode exposes its file browser interface on port 4AA4$ You can override this b settin" the

dfs.datanode.http.address confi"uration e to a settin" other than 0.0.0.0:20042$ 2o"

files "enerated b the Hadoop daemons can be accessed throu"h this interface% which is useful

for distributed debu""in" and troubleshootin"$

/eferences

-hemawat% 0$ -obioff% H$ and 2eun"% 0$9T$ The -oo"le File 0stem$ Proceedin"s of the *Bth

#1& 0mposium on (peratin" 0stems Principles$ pp ,B99/.$ @olton 2andin"% NY% )0#$ ,AA.$

Q ,AA.% #1&$@orthaur% Dhruba$ The Hadoop Distributed File 0stem7 #rchitecture and Desi"n$ Q ,AA% The

#pache 0oftware Foundation$

Hadoop DF0 )ser -uide$ Q ,AA% The #pache 0oftware Foundation$

http://research.google.com/archive/gfs.htmlhttp://research.google.com/archive/gfs.htmlhttp://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_user_guide.htmlhttp://research.google.com/archive/gfs.htmlhttp://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_user_guide.html


30/123

HDF07 Permissions )ser and #dministrator -uide$ Q ,AA% The #pache 0oftware Foundation$

HDF0 #P+ Kavadoc Q ,AAG% The #pache 0oftware Foundation$

HDF0 source code

Module 3: Getting Started With Hadoop


Introduction

Hadoop is an open source implementation of the &ap'educe platform and distributed file

sstem% written in Kava$ This module explains the basics of how to be"in usin" Hadoop to

experiment and learn from the rest of this tutorial$ +t covers settin" up the platform and

connectin" other tools to use it$


• Set up a pre-congured Hadoop virtual machine

• Verify that you can connect to the virtual machine

• nderstand tools availa!le to help you use Hadoop

"utline

#$ Introduction

%$ Goals for this Module

3$ "utline

&$ 'rere(uisites

)$ * Virtual Machine Hadoop +nvironment

#$ Installing VM,are 'layer

%$ Setting up the Virtual +nvironment

3$ Virtual Machine ser *ccounts

&$ unning a Hadoop .o!

)$ *ccessing the VM via ssh

/$ Shutting 0o,n the VM

http://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.htmlhttp://hadoop.apache.org/core/docs/current/api/http://hadoop.apache.org/core/version_control.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.html#introhttp://developer.yahoo.com/hadoop/tutorial/module3.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module3.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module3.html#prereqhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vmhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vmware-installhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setuphttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-usershttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-jobshttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-sshhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-shutdownhttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.htmlhttp://hadoop.apache.org/core/docs/current/api/http://hadoop.apache.org/core/version_control.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.html#introhttp://developer.yahoo.com/hadoop/tutorial/module3.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module3.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module3.html#prereqhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vmhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vmware-installhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setuphttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-usershttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-jobshttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-sshhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-shutdown


31/123

/$ Getting Started With +clipse

#$ 0o,nloading and Installing

%$ Installing the Hadoop Mapeduce 'lugin

3$ Ma1ing a 2opy of Hadoop

&$ unning +clipse

)$ 2onguring the Mapeduce 'lugin

$ Interacting With H04S

#$ sing the 2ommand 5ine

%$ sing the Mapeduce 'lugin 4or +clipse

6$ unning a Sample 'rogram

#$ 2reating the 'ro7ect

%$ 2reating the Source 4iles

3$ 5aunching the .o!

8$ eferences 9 esources

#$2omplete ;ools 5ist

'rere(uisites

Developin" for Hadoop re6uires a Kava pro"rammin" environment$ You can download a Kava

Development Iit =KDI> for a wide variet of operatin" sstems from http7EE% version 8% which is the most current

version at the time of this writin"$

* Virtual Machine Hadoop +nvironment

This section explains how to confi"ure a virtual machine to run Hadoop within our hostcomputer$ #fter installin" the virtual machine software and the virtual machine ima"e% ou will

learn how to lo" in and run machines with no additional software beond Kava$ +f ou are interested in

http://developer.yahoo.com/hadoop/tutorial/module3.html#eclipsehttp://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse-dlhttp://developer.yahoo.com/hadoop/tutorial/module3.html#plugin-installhttp://developer.yahoo.com/hadoop/tutorial/module3.html#hadoop-copyhttp://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse-runhttp://developer.yahoo.com/hadoop/tutorial/module3.html#plugin-confhttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfshttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfs-cmdhttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfs-pluginhttp://developer.yahoo.com/hadoop/tutorial/module3.html#runninghttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-createhttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-sourcehttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-runhttp://developer.yahoo.com/hadoop/tutorial/module3.html#refshttp://developer.yahoo.com/hadoop/tutorial/module3.html#toolshttp://java.sun.com/http://java.sun.com/http://developer.yahoo.com/hadoop/tutorial/module3.html#eclipsehttp://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse-dlhttp://developer.yahoo.com/hadoop/tutorial/module3.html#plugin-installhttp://developer.yahoo.com/hadoop/tutorial/module3.html#hadoop-copyhttp://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse-runhttp://developer.yahoo.com/hadoop/tutorial/module3.html#plugin-confhttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfshttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfs-cmdhttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfs-pluginhttp://developer.yahoo.com/hadoop/tutorial/module3.html#runninghttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-createhttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-sourcehttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-runhttp://developer.yahoo.com/hadoop/tutorial/module3.html#refshttp://developer.yahoo.com/hadoop/tutorial/module3.html#toolshttp://java.sun.com/


32/123

doin" this% there are instructions available on the Hadoop web site in the -ettin" 0tarted

document$

'unnin" Hadoop on top of Windows re6uires installin" c"win% a 2inux9lie environment that

runs within Windows$ Hadoop wors reasonabl well on c"win% but it is officiall for

:development purposes onl$: Hadoop on c"win ma be unstable% and installin" c"win itself

can be cumbersome$

To aid developers in "ettin" started easil with Hadoop% we have provided a virtual machine

image containin" a preconfi"ured Hadoop installation$ The virtual machine ima"e will run inside

of a :sandbox: environment in which we can run another operatin" sstem$ The (0 inside the

sandbox does not now that there is another operatin" environment outside of it? it acts as thou"h

it is on its own computer$ This sandbox environment is referred to as the :"uest machine:

runnin" a :"uest operatin" sstem$: The actual phsical machine runnin" the & software is

referred to as the :host machine: and it runs the :host operatin" sstem$: The virtual machine

provides other host9machine applications with the appearance that another phsical computer is

available on the same networ$ #pplications runnin" on the host machine see the & as a

separate machine with its own +P address% and can interact with the pro"rams inside the & in

this fashion$

4igure 3$#: * virtual machine encapsulates one operating system ,ithin another$

*pplications in the VM !elieve they run on a separate physical host from other

applications in the e machine$

#pplication developers do not need to use the virtual machine to run Hadoop$ Developers on

2inux tpicall use Hadoop in their native development environment% and Windows users often

http://hadoop.apache.org/core/docs/current/index.htmlhttp://hadoop.apache.org/core/docs/current/index.htmlhttp://hadoop.apache.org/core/docs/current/index.htmlhttp://www.cygwin.com/http://hadoop.apache.org/core/docs/current/index.htmlhttp://hadoop.apache.org/core/docs/current/index.htmlhttp://www.cygwin.com/


33/123

install c"win for Hadoop development$ The virtual machine provided with this tutorial allows

users a convenient alternative development platform with a minimum of confi"uration re6uired$

#nother advanta"e of the virtual machine is its eas reset functionalit$ +f our experiments

brea the Hadoop confi"uration or render the operatin" sstem unusable% ou can alwas simpl

cop the virtual machine ima"e from the 1D bac to where ou installed it on our computer%

and start from a nown9"ood state$

(ur virtual machine will run 2inux% and comes preconfi"ured to run Hadoop in pseudo9

distributed mode on this sstem$ =+t is confi"ured lie a full distributed sstem% but is actuall

runnin" on a sin"le machine instance$> We can write Hadoop pro"rams usin" editors and other

applications of the host platform% and run them on our :cluster: consistin" of


34/123

to start the virtual machine$ #fter un;ippin" the vmware folder ;ip file% to start the virtual

machine% double9clic on the hadoop-appliance-0..0.vmx file in Windows 5xplorer$

4igure 3$%: When you start the virtual machine for the rst time? tell VM,are 'layer

that you have copied the VM image$

When ou start the virtual machine for the first time% &ware Plaer will reco"ni;e that the

virtual machine ima"e is not in the same location it used to be$ You should inform &ware

Plaer that ou copied this virtual machine ima"e$ &ware Plaer will then "enerate newsession identifiers for this instance of the virtual machine$ +f ou later move the & ima"e to a

different location on our own hard drive% ou should tell &ware Plaer that ou have moved

the ima"e$

+f ou ever corrupt the & ima"e =e$"$% b inadvertentl deletin" or overwritin" important files>%

ou can alwas restore a pristine cop of the virtual machine b copin" a fresh & ima"e off


35/123

of this tutorial 1D$ =0o don3t be sh about explorin"! You can alwas reset it to a functionin"

state$>

#fter ou select this option and clic (I% the virtual machine should be"in bootin" normall$

You will see it perform the standard boot procedure for a 2inux sstem$ +t will bind itself to an +P

address on an unused networ se"ment% and then displa a prompt allowin" a user to lo" in$

Virtual Machine ser *ccounts

The virtual machine comes preconfi"ured with two user accounts7 :root: and :hadoop9user:$ The

hadoop9user account has sudo permissions to perform sstem mana"ement functions% such as

shuttin" down the virtual machine$ The vast ma7

youyour-machine:P# cd hadoopyouyour-machine:P/hadoop# 'in/start-all.sh

You will see a set of status messa"es appear as the services boot$ +f prompted whether it is oato connect to the current host% tpe :es:$ Tr runnin" an example pro"ram to ensure that

Hadoop is correctl confi"ured7

hadoop-userhadoop-des$:P# cd hadoophadoop-userhadoop-des$:P/hadoop# 'in/hadoop ar hadoop-0..0-examples.ar pi0 000000

This should provide output that loos somethin" lie this7


36/123

Arote input for "ap MArote input for "ap M,Arote input for "ap M3...Arote input for "ap M0tarting Qo'+& mapred.+ilenput+ormat: Iotal input paths to process: 0+& mapred.Qo'?lient: *unning o': o'K,0001,300K000+& mapred.Qo'?lient: map 0R reduce 0R+& mapred.Qo'?lient: map 0R reduce 0R...+& mapred.Qo'?lient: map 00R reduce 00R+& mapred.Qo'?lient: Qo' complete: o'K,0001,300K000...Qo' +inished in ,2. second!stimated value of is 3.1

This tas runs a simulation to estimate the value of pi based on samplin"$ The test first wrote out

a number of points to a list of files% one per map tas$ +t then calculated an estimate of pi based

on these points% in the &ap'educe tas itself$ How &ap'educe wors and how to write such a

pro"ram are discussed in the next module$ The Hadoop client pro"ram ou used to launch the pi

test launched the


37/123

Shutting 0o,n the VM

When ou are done with the virtual machine% ou can turn it off b lo""in" in as hadoop-user

and tpin" sudo poeroff$ The virtual machine will shut itself down in an orderl fashion and

the window it runs in will disappear$

Getting Started With +clipse

# powerful development environment for Kava9based pro"rammin" is 5clipse$ 5clipse is a free%

open9source +D5$ +t supports multiple lan"ua"es throu"h a plu"in interface% with special

attention paid to Kava$ Tools desi"ned for worin" with Hadoop can be inte"rated into 5clipse%

main" it an attractive platform for Hadoop development$ +n this section we will review how to

obtain% confi"ure% and use 5clipse$

0o,nloading and Installing

3ote: The most current release of 5clipse is called Gan&mede$ (ur testin" shows that

-anmede is currentl incompatible with the Hadoop &ap'educe plu"in$ The most recent

version which wored properl with the Hadoop plu"in is version .$.$*% :5uropa$: To download

5uropa% do not visit the main 5clipse website? it can be found in the archive site

http7EEarchive$eclipse$or"EeclipseEdownloadsE as the :#rchived 'elease =.$.$*>$:

The 5clipse website has several versions available for download? choose either :5clipse 1lassic:

or :5clipse +D5 for Kava Developers$:

@ecause it is written in Kava% 5clipse is ver cross9platform$ 5clipse is available for Windows%

2inux% and &ac (0J$

+nstallin" 5clipse is ver strai"htforward$ 5clipse is paca"ed as a .;ip file$ Windows itself can

nativel un;ip the compressed file into a director$ +f ou encounter errors usin" the Windows

decompression tool =see *O>% tr usin" a third9part un;ip utilit such as 9;ip or Win'#' $

#fter ou have decompressed 5clipse into a director% ou can run it strai"ht from that director

with no modifications or other :installation: procedure$ You ma want to move it into

?:Frogram +ilesF!clipse to eep consistent with our other applications% but it can reside in

the Destop or elsewhere as well$

Installing the Hadoop Mapeduce 'lugin

Hadoop comes with a plu"in for 5clipse that maes developin" &ap'educe pro"rams easier$ +n

the hadoop-0..0/contri'/eclipse-plugin director on this 1D% ou will find a file named

hadoop-0..0-eclipse-plugin.ar$ 1op this into the plugins/ subdirector of wherever

ou un;ipped 5clipse$

http://www.eclipse.org/http://www.eclipse.org/http://archive.eclipse.org/eclipse/downloads/http://wiki.eclipse.org/SDK_Known_Issues#Windows_issueshttp://www.7-zip.org/http://www.win-rar.com/http://www.eclipse.org/http://archive.eclipse.org/eclipse/downloads/http://wiki.eclipse.org/SDK_Known_Issues#Windows_issueshttp://www.7-zip.org/http://www.win-rar.com/


38/123

Ma1ing a 2opy of Hadoop

While we will be runnin" &ap'educe pro"rams on the virtual machine% we will be compilin"

them on the host machine$ The host therefore needs a cop of the Hadoop


39/123

Step 8: Switch to the Map/educe perspecti'e) +n the upper9ri"ht corner of the worbench%

clic the :(pen Perspective: button% as shown in Fi"ure .$/7

4igure 3$&: 2hanging the 'erspective

0elect :(ther%: followed b :&apE'educe: in the window that opens up$ #t first% nothin" ma

appear to chan"e$ +n the menu% choose 4indow = Show


40/123

director tree to see an files alread there$ +f ou inserted files into HDF0 ourself% the will be

visible in this tree$

4igure 3$/: 4iles Visi!le in the H04S Vie,er

Now that our sstem is confi"ured% the followin" sections will introduce ou to the basic

features and verif that the wor correctl$

Interacting With H04S

The &ware ima"e will expose a sin"le9node HDF0 instance for our use in &ap'educe

applications$ +f ou are lo""ed in to the virtual machine% ou can interact with HDF0 usin" the

command9line tools described in &odule ,$ You can also manipulate HDF0 throu"h the

&ap'educe plu"in$

sing the 2ommand 5ine

#n interestin" &ap'educe tas will re6uire some external data to process7 lo" files% web crawl

results% etc$ @efore ou can be"in processin" with &ap'educe% data must be loaded into its

distributed file sstem$ +n &odule ,% ou learned how to cop files from the local file sstem into

HDF0$ @ut this will cop files from the local file sstem of the & into HDF0 9 not from the

file sstem of our host computer$

To load data into HDF0 in the virtual machine% ou have several options available to ou7

#$ scp the les to the virtual machine? and then use the 'in/hadoop fs -put ... synta< to copy the les from the VMAs local le system into H04S?

%$ pipe the data from the local machine into a put command reading from stdin?

3$ or install the Hadoop tools on the host system and congure it tocommunicate directly ,ith the guest instance

We will review each of these in turn$

http://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.html


41/123

To load data into HDF0 usin" the command line within the virtual machine% ou can first send

the data to the &3s local dis% then insert it into HDF0$ You can send files to the & usin" an

scp client% such as the pscp component of putt% or Win01P$

scp will allow ou to cop files from one machine to another over the networ$ The scp

command taes two ar"uments% both of the form usernameSOhostnameO7 filename$ The scp

command itself is of the form scp source dest% where source and dest are formatted as

described above$ @ default% it will assume that paths are on the local host% and should be

accessed usin" the current username$ You can override the username and hostname to perform

remote copies$

0o supposin" ou have a file named foo.txt% and ou would lie to cop this into the virtual

machine which has +P address *B,$*8G$*BA$*,G% ou can perform this operation with the

command7

# scp foo.txt hadoop-user9,.1.90.,:foo.txt

+f ou are usin" the pscp pro"ram% substitute pscp instead of scp above$ # cop of the :re"ular:

scp can be run under c"win b downloadin" the (pen00H paca"e$ pscp is a utilit b the

maers of putt and does not re6uire c"win$

Note that since we did not specif a destination director% it will "o in /home/hadoop-user b

default$ To chan"e the tar"et director% specif it after the hostname =e$"$% hadoop-

user9,.1.,.90:/some/dest/path/foo.txt $> You can also omit the destination

filename% if ou want it to be identical to the source filename$ However% if ou omit both the

tar"et director and filename% ou must not for"et the colon =:7:> that follows the tar"et

hostname$ (therwise it will mae a local cop of the file% with the name 9,.1.90.,$ #n

e6uivalent correct command to cop foo.txt to /home/hadoop-user on the remote machine is7

# scp foo.txt hadoop-user9,.1.90.,:

Windows users ma be more inclined to use a -)+ tool to perform scp commands$ The free

Win01P pro"ram provides an FTP9lie -)+ interface over scp$

#fter ou have copied files into the local dis of the virtual machine% ou can lo" in to the virtual

machine as hadoop-user and insert the files into HDF0 usin" the standard Hadoop commands$

For example%

hadoop-uservm-instance:hadoop# 'in/hadoop dfs -put P/foo.txt F /user/hadoop-user/input/foo.txt

# second option available to upload individual files to HDF0 from the host machine is to echo

the file contents into a put command runnin" via ssh$ e$"$% assumin" ou have the cat pro"ram

=which comes with 2inux or c"win> to echo the contents of a file to the terminal output% ou can

connect its output to the input of a put command runnin" over ssh lie so7

http://www.chiark.greenend.org.uk/~sgtatham/putty/http://www.winscp.net/http://www.winscp.net/http://www.winscp.net/http://www.chiark.greenend.org.uk/~sgtatham/putty/http://www.winscp.net/http://www.winscp.net/


42/123

youhost-machine# cat somefile O ssh hadoop-uservm-ip-addr F Chadoop/'in/hadoop fs -put - destinationfile

The - as an ar"ument to the put command instructs the sstem to use stdin as its input file$ This

will cop somefile on the host machine to destinationfile in HDF0 on the virtual machine$

Finall% if ou are runnin" either 2inux or c"win% ou can cop the Ehadoop.>)1?)> director onthe 1D to our local instance$ You can then confi"ure hadoop-site.xml to use the virtual

machine as the default distributed file sstem =b settin" the fs.default.name parameter>$ +f

ou then run 'in/hadoop fs -put ... commands on this machine =or an other hadoop

commands% for that matter>% the will interact with HDF0 as served b the virtual machine$ 0ee

the Hadoop "ettin" started for instructions on confi"urin" a Hadoop installation% or &odule for

a more thorou"h treatment$

sing the Mapeduce 'lugin 4or +clipse

#n easier wa to manipulate files in HDF0 ma be throu"h the 5clipse plu"in$ +n the DF0

location viewer% ri"ht9clic on an folder to see a list of actions available$ You can create new

subdirectories% upload individual files or whole subdirectories% or download files and directories

to the local dis$

+f /user/hadoop-user does not exist% create that first$ 'i"ht9clic on the top9level director and

select :1reate New Director:$ Tpe :user: and clic (I$ You will then need to refresh the

current director view b ri"ht9clicin" and selectin" :'efresh: from the pop9up menu$ 'epeat

this process to create the :hadoop9user: director under :user$:

Now% prepare some local files to upload$ 0omewhere on our hard drive% create a directornamed :input: and find some text files to cop there$ +n the DF0 explorer% ri"ht9clic the

:hadoop9user: director and clic :)pload Director to DF0$: 0elect our new input folder and

clic (I$ 5clipse will cop the files directl into HDF0% bpassin" the local drive of the virtual

machine$ You ma have to refresh the director view to see our chan"es$ You should now have

a director hierarch containin" the /user/hadoop-user/input director% which has at least

one text file in it$

unning a Sample 'rogram

While we have not et formall introduced the pro"rammin" stle for Hadoop% we can still test

whether a &ap'educe pro"ram will run on our Hadoop virtual machine$ This section wals ou

throu"h the steps re6uired to verif this$

The pro"ram that we will run is a word count utilit$

traing on hadoop

Documents