traing on hadoop

Upload: shubham

Post on 05-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/16/2019 Traing on Hadoop

    1/123

    Apache Hadoop

    • Documentation

    • Forum

    Module 1: Tutorial Introduction

    Previous module | Table of contents  |  Next module

    Introduction

    Welcome to the Yahoo! Hadoop tutorial! This series of tutorial documents will wal ou throu"h

    man aspects of the #pache Hadoop sstem$ You will be shown how to set up simple andadvanced cluster confi"urations% use the distributed file sstem% and develop complex Hadoop

    &ap'educe applications$ (ther related sstems are also reviewed$

    Goals for this Module:

    • )nderstand the scope of problems applicable to Hadoop

    • )nderstand how Hadoop addresses these problems differentl from other distributed

    sstems$

    Outline

    *$ +ntroduction

    ,$ -oals for this &odule

    .$ (utline

    /$ Problem 0cope

    *$ 1hallen"es at 2ar"e 0cale

    ,$ &oore3s 2aw

    4$ The Hadoop #pproach

    *$ 1omparison to 5xistin" Techni6ues

    ,$ Data Distribution

    http://hadoop.apache.org/core/docs/current/http://wiki.apache.org/hadoop/Helphttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.html#introhttp://developer.yahoo.com/hadoop/tutorial/module1.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module1.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module1.html#scopehttp://developer.yahoo.com/hadoop/tutorial/module1.html#challengeshttp://developer.yahoo.com/hadoop/tutorial/module1.html#moorehttp://developer.yahoo.com/hadoop/tutorial/module1.html#hadoophttp://developer.yahoo.com/hadoop/tutorial/module1.html#comparisonhttp://developer.yahoo.com/hadoop/tutorial/module1.html#datahttp://wiki.apache.org/hadoop/Helphttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.html#introhttp://developer.yahoo.com/hadoop/tutorial/module1.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module1.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module1.html#scopehttp://developer.yahoo.com/hadoop/tutorial/module1.html#challengeshttp://developer.yahoo.com/hadoop/tutorial/module1.html#moorehttp://developer.yahoo.com/hadoop/tutorial/module1.html#hadoophttp://developer.yahoo.com/hadoop/tutorial/module1.html#comparisonhttp://developer.yahoo.com/hadoop/tutorial/module1.html#datahttp://hadoop.apache.org/core/docs/current/

  • 8/16/2019 Traing on Hadoop

    2/123

    .$ &ap'educe7 +solated Processes

    /$ Flat 0calabilit

    8$ The 'est of the Tutorial

    Problem Scope

    Hadoop is a lar"e9scale distributed batch processin" infrastructure$ While it can be used on a

    sin"le machine% its true power lies in its abilit to scale to hundreds or thousands of computers%

    each with several processor cores$ Hadoop is also desi"ned to efficientl distribute lar"e amountsof wor across a set of machines$

    How large an amount of wor! (rders of ma"nitude lar"er than man existin" sstems wor

    with$ Hundreds of "i"abtes of data constitute the low end  of Hadoop9scale$ #ctuall Hadoop is

     built to process :web9scale: data on the order of hundreds of "i"abtes to terabtes or petabtes$

    #t this scale% it is liel that the input data set will not even fit on a sin"le computer3s hard drive%much less in memor$ 0o Hadoop includes a distributed file sstem which breas up input data

    and sends fractions of the ori"inal data to several machines in our cluster to hold$ This results inthe problem bein" processed in parallel usin" all of the machines in the cluster and computes

    output results as efficientl as possible$

    "hallenges at #arge Scale

    Performin" lar"e9scale computation is difficult$ To wor with this volume of data re6uiresdistributin" parts of the problem to multiple machines to handle in parallel$ Whenever multiple

    machines are used in cooperation with one another% the probabilit of failures rises$ +n a sin"le9

    machine environment% failure is not somethin" that pro"ram desi"ners explicitl worr aboutver often7 if the machine has crashed% then there is no wa for the pro"ram to recover anwa$

    +n a distributed environment% however% partial failures are an expected and common occurrence$

     Networs can experience partial or total failure if switches and routers brea down$ Data ma not

    arrive at a particular point in time due to unexpected networ con"estion$ +ndividual computenodes ma overheat% crash% experience hard drive failures% or run out of memor or dis space$

    Data ma be corrupted% or maliciousl or improperl transmitted$ &ultiple implementations or

    versions of client software ma spea sli"htl different protocols from one another$ 1locs ma

     become desnchroni;ed% loc files ma not be released% parties involved in distributed atomictransactions ma lose their networ connections part9wa throu"h% etc$ +n each of these cases% the

    rest of the distributed sstem should be able to recover from the component failure or transienterror condition and continue to mae pro"ress$ (f course% actuall providin" such resilience is ama

  • 8/16/2019 Traing on Hadoop

    3/123

    distributed sstems mae different trade9offs% as the intend to be used for problems with other

    re6uirements =e$"$% hi"h securit>$

    +n addition to worrin" about these sorts of bu"s and challen"es% there is also the fact that thecompute hardware has finite resources available to it$ The ma

  • 8/16/2019 Traing on Hadoop

    4/123

    infrastructure% then determinin" how best to restart the lost computation and propa"atin" this

    information about the chan"e in networ topolo" ma be non trivial to implement$

    Moore$s #aw

    0o wh use a distributed sstem at all The seem lie more trouble than the3re worth$ #nd withthe fast pace of computer hardware desi"n% it seems inevitable that sin"le9chip hardware will be

    able to :"row up: to handle the lar"er volumes of data$ #fter all% &oore3s 2aw =named after

    -ordon &oore% the founder of +ntel> states that the number of transistors that can be placed in

    a processor will double appro%imatel& e'er& two &ears( for half the cost) @ut trends in chip

    desi"n are chan"in" to face new realities$ While we can still double the number of transistors per

    unit area at this pace% this does not necessaril result in faster sin"le9threaded performance$ New processors such as +ntel 1ore , and +tanium , architectures now focus on embeddin" man&

    smaller "P*s or :cores: onto the same phsical device$ This allows multiple threads to process

    twice as much data in parallel% but at the same speed at which the operated previousl$

    5ven if hundreds or thousands of 1P) cores are placed on a sin"le machine% it would not be possible to deliver input data to these cores fast enou"h for processin"$ +ndividual hard drives

    can onl sustain read speeds between 8A9*AA &@Esecond$ These speeds have been increasin"

    over time% but not at the same breanec pace as processors$ (ptimisticall assumin" the upper

    limit of *AA &@Esecond% and assumin" four independent +E( channels are available to themachine% that provides /AA &@ of data ever second$ # / terabte data set would thus tae over

    *A%AAA seconds to read99about three hours

  • 8/16/2019 Traing on Hadoop

    5/123

    -ata -istribution

    +n a Hadoop cluster% data is distributed to all the nodes of the cluster as it is bein" loaded in$ The

    Hadoop Distributed File 0stem =HDF0> will split lar"e data files into chuns which aremana"ed b different nodes in the cluster$ +n addition to this each chun is replicated across

    several machines% so that a sin"le machine failure does not result in an data bein" unavailable$#n active monitorin" sstem then re9replicates the data in response to sstem failures which can

    result in partial stora"e$ 5ven thou"h the file chuns are replicated and distributed across severalmachines% the form a sin"le namespace% so their contents are universall accessible$

    Data is conceptuall record.oriented in the Hadoop pro"rammin" framewor$ +ndividual input

    files are broen into lines or into other formats specific to the application lo"ic$ 5ach processrunnin" on a node in the cluster then processes a subset of these records$ The Hadoop framewor 

    then schedules these processes in proximit to the location of dataErecords usin" nowled"e from

    the distributed file sstem$ 0ince files are spread across the distributed file sstem as chuns%

    each compute process runnin" on a node operates on a subset of the data$ Which data operated

    on b a node is chosen based on its localit to the node7 most data is read from the local disstrai"ht into the 1P)% alleviatin" strain on networ bandwidth and preventin" unnecessar

    networ transfers$ This strate" of mo'ing computation to the data% instead of movin" the datato the computation allows Hadoop to achieve hi"h data localit which in turn results in hi"h

     performance$

     Figure 1.1: Data is distributed across nodes at load time.

    Map/educe: Isolated Processes

    Hadoop limits the amount of communication which can be performed b the processes% as each

    individual record is processed b a tas in isolation from one another$ While this sounds lie a

    ma

  • 8/16/2019 Traing on Hadoop

    6/123

     performed implicitly$ Pieces of data can be ta""ed with e names which inform Hadoop how to

    send related bits of information to a common destination node$ Hadoop internall mana"es all of

    the data transfer and cluster topolo" issues$

    @ restrictin" the communication between nodes% Hadoop maes the distributed sstem much

    more reliable$ +ndividual node failures can be wored around b restartin" tass on othermachines$ 0ince user9level tass do not communicate explicitl with one another% no messa"es

    need to be exchan"ed b user pro"rams% nor do nodes need to roll bac to pre9arran"edchecpoints to partiall restart the computation$ The other worers continue to operate as thou"h

    nothin" went wron"% leavin" the challen"in" aspects of partiall restartin" the pro"ram to the

    underlin" Hadoop laer$

    0lat Scalabilit&

    (ne of the ma ma perform much better on two% four% or perhaps a do;en

    machines$ Thou"h the effort of coordinatin" wor amon" a small number of machines ma be

     better9performed b such sstems% the price paid in performance and en"ineerin" effort =whenaddin" more hardware as a result of increasin" data volumes> increases non9linearl$

    # pro"ram written in distributed framewors other than Hadoop ma re6uire lar"e amounts of

    refactorin" when scalin" from ten to one hundred or one thousand machines$ This ma involvehavin" the pro"ram be rewritten several times? fundamental elements of its desi"n ma also put

    an upper bound on the scale to which the application can "row$

    Hadoop% however% is specificall desi"ned to have a ver flat scalabilit curve$ #fter a Hadoop

     pro"ram is written and functionin" on ten nodes% ver little99if an99wor is re6uired for thatsame pro"ram to run on a much lar"er amount of hardware$ (rders of ma"nitude of "rowth can

     be mana"ed with little re9wor re6uired for our applications$ The underlin" Hadoop platform

    will mana"e the data and hardware resources and provide dependable performance "rowth

     proportionate to the number of machines available$

    The /est of the Tutorial

    This module of the tutorial has hi"hli"hted the ma stores vast

    6uantities of information% how to confi"ure HDF0% and how to use it to store and retrieve

    our data$

    http://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.html

  • 8/16/2019 Traing on Hadoop

    7/123

    • &odule . shows ou how to "et started settin" up a Hadoop environment to experiment

    with$ +t reviews how to install a Hadoop virtual machine =included in this resource 1D>

    so that ou can run Hadoop re"ardless of what operatin" sstem ou are runnin"$

    • &odule / explains the Hadoop &ap'educe pro"rammin" model itself% and how to write

    some &ap'educe pro"rams$

    • &odule 4 "oes into further detail about the specifics of Hadoop &ap'educe% and how to

    use advanced features for more powerful control over a pro"ram3s execution$

    • &odule 8 describes some other components of the Hadoop ecosstem which can add

    further capabilities to our distributed sstem$

    • &odule  describes how to confi"ure Hadoop clusters of different si;es$ +t describes what

     particular parameters of Hadoop need to be tuned for settin" up clusters of various si;es$

    +n addition it describes the various performance monitorin" tools available in Hadoop for

    monitorin" the health of our cluster$

    • #nd to expand upon the Pi" section described in &odule 8% a separate Pi" Tutorial is

    included in this paca"e at the end as &odule G$

    Module : The Hadoop -istributed 0ile S&stem

    Previous module | Table of contents |  Next module 

    Introduction

    H-0S% the Hadoop Distributed File 0stem% is a distributed file sstem desi"ned to hold ver

    lar"e amounts of data =terabtes or even petabtes>% and provide hi"h9throu"hput access to this

    information$ Files are stored in a redundant fashion across multiple machines to ensure their

    durabilit to failure and hi"h availabilit to ver parallel applications$ This module introducesthe desi"n of this distributed file sstem and instructions on how to operate it$

    Goals for this Module:

    • )nderstand the basic desi"n of HDF0 and how it relates to basic distributed file sstem

    concepts

    • 2earn how to set up and use HDF0 from the command line

    • 2earn how to use HDF0 in our applications

    Outline

    http://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://developer.yahoo.com/hadoop/tutorial/module5.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/module7.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/pigtutorial.htmlhttp://developer.yahoo.com/hadoop/tutorial/pigtutorial.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.htmlhttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://developer.yahoo.com/hadoop/tutorial/module5.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/module7.htmlhttp://developer.yahoo.com/hadoop/tutorial/module6.htmlhttp://developer.yahoo.com/hadoop/tutorial/pigtutorial.htmlhttp://developer.yahoo.com/hadoop/tutorial/module1.htmlhttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.html

  • 8/16/2019 Traing on Hadoop

    8/123

    *$ +ntroduction

    ,$ -oals for this &odule

    .$ (utline

    /$ Distributed File 0stem @asics

    4$ 1onfi"urin" HDF0

    8$ +nteractin" With HDF0

    *$ 1ommon 5xample (perations

    ,$ HDF0 1ommand 'eference

    .$ DF0#dmin 1ommand 'eference

    $ )sin" HDF0 in &ap'educe

    G$ )sin" HDF0 Pro"rammaticall

    B$ HDF0 Permissions and 0ecurit

    *A$ #dditional HDF0 Tass

    *$ 'ebalancin" @locs

    ,$ 1opin" 2ar"e 0ets of Files

    .$ Decommissionin" Nodes

    /$ erifin" File 0stem Health

    4$ 'ac #wareness

    **$ HDF0 Web +nterface

    *,$ 'eferences

    -istributed 0ile S&stem 2asics

    # distributed file sstem is desi"ned to hold a lar"e amount of data and provide access to this

    data to man clients distributed across a networ$ There are a number of distributed file sstemsthat solve this problem in different was$

    http://developer.yahoo.com/hadoop/tutorial/module2.html#introhttp://developer.yahoo.com/hadoop/tutorial/module2.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module2.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module2.html#basicshttp://developer.yahoo.com/hadoop/tutorial/module2.html#confighttp://developer.yahoo.com/hadoop/tutorial/module2.html#interactinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#commonopshttp://developer.yahoo.com/hadoop/tutorial/module2.html#commandrefhttp://developer.yahoo.com/hadoop/tutorial/module2.html#admincommandrefhttp://developer.yahoo.com/hadoop/tutorial/module2.html#mapreducehttp://developer.yahoo.com/hadoop/tutorial/module2.html#programmaticallyhttp://developer.yahoo.com/hadoop/tutorial/module2.html#permshttp://developer.yahoo.com/hadoop/tutorial/module2.html#taskshttp://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#copyinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#fsckhttp://developer.yahoo.com/hadoop/tutorial/module2.html#rackhttp://developer.yahoo.com/hadoop/tutorial/module2.html#webhttp://developer.yahoo.com/hadoop/tutorial/module2.html#refshttp://developer.yahoo.com/hadoop/tutorial/module2.html#introhttp://developer.yahoo.com/hadoop/tutorial/module2.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module2.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module2.html#basicshttp://developer.yahoo.com/hadoop/tutorial/module2.html#confighttp://developer.yahoo.com/hadoop/tutorial/module2.html#interactinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#commonopshttp://developer.yahoo.com/hadoop/tutorial/module2.html#commandrefhttp://developer.yahoo.com/hadoop/tutorial/module2.html#admincommandrefhttp://developer.yahoo.com/hadoop/tutorial/module2.html#mapreducehttp://developer.yahoo.com/hadoop/tutorial/module2.html#programmaticallyhttp://developer.yahoo.com/hadoop/tutorial/module2.html#permshttp://developer.yahoo.com/hadoop/tutorial/module2.html#taskshttp://developer.yahoo.com/hadoop/tutorial/module2.html#rebalancinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#copyinghttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#fsckhttp://developer.yahoo.com/hadoop/tutorial/module2.html#rackhttp://developer.yahoo.com/hadoop/tutorial/module2.html#webhttp://developer.yahoo.com/hadoop/tutorial/module2.html#refs

  • 8/16/2019 Traing on Hadoop

    9/123

    30S( the Networ File 0stem% is the most ubi6uitous distributed file sstem$ +t is one of the

    oldest still in use$ While its desi"n is strai"htforward% it is also ver constrained$ NF0 provides

    remote access to a sin"le lo"ical volume stored on a sin"le machine$ #n NF0 server maes a portion of its local file sstem visible to external clients$ The clients can then mount this remote

    file sstem directl into their own 2inux file sstem% and interact with it as thou"h it were part of 

    the local drive$

    (ne of the primar advanta"es of this model is its transparenc$ 1lients do not need to be particularl aware that the are worin" on files stored remotel$ The existin" standard librar

    methods lie open()% close()% fread()% etc$ will wor on files hosted over NF0$

    @ut as a distributed file sstem% it is limited in its power$ The files in an NF0 volume all resideon a sin"le machine$ This means that it will onl store as much information as can be stored in

    one machine% and does not provide an reliabilit "uarantees if that machine "oes down =e$"$% b

    replicatin" the files to other servers>$ Finall% as all the data is stored on a sin"le machine% all the

    clients must "o to this machine to retrieve their data$ This can overload the server if a lar"e

    number of clients must be handled$ 1lients must also alwas cop the data to their localmachines before the can operate on it$

    HDF0 is desi"ned to be robust to a number of the problems that other DF03s such as NF0 arevulnerable to$ +n particular7

    • HDF0 is desi"ned to store a ver lar"e amount of information =terabtes or petabtes>$

    This re6uires spreadin" the data across a lar"e number of machines$ +t also supports much

    lar"er file si;es than NF0$

    • HDF0 should store data reliabl$ +f individual machines in the cluster malfunction% data

    should still be available$

    • HDF0 should provide fast% scalable access to this information$ +t should be possible to

    serve a lar"er number of clients b simpl addin" more machines to the cluster$

    • HDF0 should inte"rate well with Hadoop &ap'educe% allowin" data to be read and

    computed upon locall when possible$

    @ut while HDF0 is ver scalable% its hi"h performance desi"n also restricts it to a particular class

    of applications? it is not as "eneral9purpose as NF0$ There are a lar"e number of additionaldecisions and trade9offs that were made with HDF0$ +n particular7

    • #pplications that use HDF0 are assumed to perform lon" se6uential streamin" reads from

    files$ HDF0 is optimi;ed to provide streamin" read performance? this comes at theexpense of random see times to arbitrar positions in files$

    • Data will be written to the HDF0 once and then read several times? updates to files after

    the have alread been closed are not supported$ =#n extension to Hadoop will provide

  • 8/16/2019 Traing on Hadoop

    10/123

    support for appendin" new data to the ends of files? it is scheduled to be included in

    Hadoop A$*B but is not available et$>

    • Due to the lar"e si;e of files% and the se6uential nature of reads% the sstem does not

     provide a mechanism for local cachin" of data$ The overhead of cachin" is "reat enou"h

    that data should simpl be re9read from HDF0 source$

    • +ndividual machines are assumed to fail on a fre6uent basis% both permanentl and

    intermittentl$ The cluster must be able to withstand the complete failure of several

    machines% possibl man happenin" at the same time =e$"$% if a rac fails all to"ether>$While performance ma de"rade proportional to the number of machines lost% the sstem

    as a whole should not become overl slow% nor should information be lost$ Data

    replication strate"ies combat this problem$

    The desi"n of HDF0 is based on the desi"n of G0S% the -oo"le File 0stem$ +ts desi"n wasdescribed in a paper   published b -oo"le$

    HDF0 is a bloc9structured file sstem7 individual files are broen into blocs of a fixed si;e$

    These blocs are stored across a cluster of one or more machines with data stora"e capacit$

    +ndividual machines in the cluster are referred to as -ata3odes$ # file can be made of several blocs% and the are not necessaril stored on the same machine? the tar"et machines which hold

    each bloc are chosen randoml on a bloc9b9bloc basis$ Thus access to a file ma re6uire the

    cooperation of multiple machines% but supports file si;es far lar"er than a sin"le9machine DF0?individual files can re6uire more space than a sin"le hard drive could hold$

    +f several machines must be involved in the servin" of a file% then a file could be rendered

    unavailable b the loss of an one of those machines$ HDF0 combats this problem b replicatin"

    each bloc across a number of machines =.% b default>$

    Fi"ure ,$*7 DataNodes holdin" blocs of multiple files with a replication factor of ,$ The NameNode maps the filenames onto the bloc ids$

    &ost bloc9structured file sstems use a bloc si;e on the order of / or G I@$ @ contrast% the

    default bloc si;e in HDF0 is 8/&@ 99 orders of ma"nitude lar"er$ This allows HDF0 to decrease

    the amount of metadata stora"e re6uired per file =the list of blocs per file will be smaller as thesi;e of individual blocs increases>$ Furthermore% it allows for fast streamin" reads of data% b

    eepin" lar"e amounts of data se6uentiall laid out on the dis$ The conse6uence of this decision

    is that HDF0 expects to have ver lar"e files% and expects them to be read se6uentiall$ )nlie afile sstem such as NTF0 or 5JT% which see man ver small files% HDF0 expects to store amodest number of ver lar"e files7 hundreds of me"abtes% or "i"abtes each$ #fter all% a *AA

    &@ file is not even two full blocs$ Files on our computer ma also fre6uentl be accessed

    :randoml%: with applications cherr9picin" small amounts of information from severaldifferent locations in a file which are not se6uentiall laid out$ @ contrast% HDF0 expects to

    read a bloc start9to9finish for a pro"ram$ This maes it particularl useful to the &ap'educe

    http://developer.yahoo.com/hadoop/tutorial/module2.html#ref_gfshttp://developer.yahoo.com/hadoop/tutorial/module2.html#ref_gfshttp://developer.yahoo.com/hadoop/tutorial/module2.html#ref_gfs

  • 8/16/2019 Traing on Hadoop

    11/123

    stle of pro"rammin" described in &odule /$ That havin" been said% attemptin" to use HDF0 as

    a "eneral9purpose distributed file sstem for a diverse set of applications will be suboptimal$

    @ecause HDF0 stores files as a set of lar"e blocs across several machines% these files are not part of the ordinar file sstem$ Tpin" ls on a machine runnin" a DataNode daemon will

    displa the contents of the ordinar 2inux file sstem bein" used to host the Hadoop services 99 but it will not include an of the files stored inside the HDF0$ This is because HDF0 runs in a

    separate namespace% isolated from the contents of our local files$ The files inside HDF0 =or

    more accuratel7 the blocs that mae them up> are stored in a particular director mana"ed b

    the DataNode service% but the files will named onl with bloc ids$ You cannot interact with

    HDF09stored files usin" ordinar 2inux file modification tools =e$"$% ls% cp% mv% etc>$ However%

    HDF0 does come with its own utilities for file mana"ement% which act ver similar to these

    familiar tools$ # later section in this tutorial will introduce ou to these commands and their

    operation$

    +t is important for this file sstem to store its metadata reliabl$ Furthermore% while the file data

    is accessed in a write once and read man model% the metadata structures =e$"$% the names of filesand directories> can be modified b a lar"e number of clients concurrentl$ +t is important that

    this information is never desnchroni;ed$ Therefore% it is all handled b a sin"le machine% calledthe 3ame3ode$ The NameNode stores all the metadata for the file sstem$ @ecause of the

    relativel low amount of metadata per file =it onl tracs file names% permissions% and the

    locations of each bloc of each file>% all of this information can be stored in the main memor ofthe NameNode machine% allowin" fast access to the metadata$

    To open a file% a client contacts the NameNode and retrieves a list of locations for the blocs that

    comprise the file$ These locations identif the DataNodes which hold each bloc$ 1lients then

    read file data directl from the DataNode servers% possibl in parallel$ The NameNode is not

    directl involved in this bul data transfer% eepin" its overhead to a minimum$

    (f course% NameNode information must be preserved even if the NameNode machine fails? there

    are multiple redundant sstems that allow the NameNode to preserve the file sstem3s metadata

    even if the NameNode itself crashes irrecoverabl$ NameNode failure is more severe for thecluster than DataNode failure$ While individual DataNodes ma crash and the entire cluster will

    continue to operate% the loss of the NameNode will render the cluster inaccessible until it is

    manuall restored$ Fortunatel% as the NameNode3s involvement is relativel minimal% the oddsof it failin" are considerabl lower than the odds of an arbitrar DataNode failin" at an "iven

     point in time$

    # more thorou"h overview of the architectural decisions involved in the desi"n andimplementation of HDF0 is "iven in the official Hadoop HDF0 documentation$ @eforecontinuin" in this tutorial% it is advisable that ou read and understand the information presented

    there$

    "onfiguring H-0S

    http://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.html

  • 8/16/2019 Traing on Hadoop

    12/123

    The HDF0 for our cluster can be confi"ured in a ver short amount of time$ First we will fill

    out the relevant sections of the Hadoop confi"uration file% then format the NameNode$

    "luster configuration

    These instructions for cluster confi"uration assume that ou have alread downloaded andun;ipped a cop of Hadoop$ &odule . discusses "ettin" started with Hadoop for this tutorial$

    &odule  discusses how to set up a lar"er cluster and provides preliminar setup instructions for

    Hadoop% includin" downloadin" prere6uisite software$

    The HDF0 confi"uration is located in a set of J&2 files in the Hadoop confi"uration director?

    conf/ under the main Hadoop install director =where ou un;ipped Hadoop to>$ The

    conf/hadoop-defaults.xml file contains default values for ever parameter in Hadoop$ This

    file is considered read9onl$ You override this confi"uration b settin" new values in

    conf/hadoop-site.xml$ This file should be replicated consistentl across all machines in the

    cluster$ =+t is also possible% thou"h not advisable% to host it on NF0$>

    1onfi"uration settin"s are a set of e9value pairs of the format7

         property-name   property-value

     

    #ddin" the line true inside the property bod will prevent properties from

     bein" overridden b user applications$ This is useful for most sstem9wide confi"uration options$

    The followin" settin"s are necessar to confi"ure HDF07

    e& 'alue e%ample

    fs$default$name   protocol 7EE servername7 port  hdfs7EEalpha$milman$or"7BAAA

    dfs$data$dir    pathname EhomeEusernameEhdfsEdata

    dfs$name$dir    pathname EhomeEusernameEhdfsEname

    These settin"s are described individuall below7

    fs)default)name 9 This is the )'+ =protocol specifier% hostname% and port> that describes the NameNode for the cluster$ 5ach node in the sstem on which Hadoop is expected to operate

    needs to now the address of the NameNode$ The DataNode instances will re"ister with this NameNode% and mae their data available throu"h it$ +ndividual client pro"rams will connect to

    this address to retrieve the locations of actual file blocs$

    dfs)data)dir 9 This is the path on the local file sstem in which the DataNode instance should

    store its data$ +t is not necessar that all DataNode instances store their data under the same local

     path prefix% as the will all be on separate machines? it is acceptable that these machines are

    hetero"eneous$ However% it will simplif confi"uration if this director is standardi;ed

    http://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module7.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.htmlhttp://developer.yahoo.com/hadoop/tutorial/module7.html

  • 8/16/2019 Traing on Hadoop

    13/123

    throu"hout the sstem$ @ default% Hadoop will place this under /tmp$ This is fine for testin"

     purposes% but is an eas wa to lose actual data in a production sstem% and thus must be

    overridden$

    dfs)name)dir 9 This is the path on the local file sstem of the NameNode instance where the

     NameNode metadata is stored$ +t is onl used b the NameNode instance to find its information%and does not exist on the DataNodes$ The caveat above about /tmp applies to this as well? this

    settin" must be overridden in a production sstem$

    #nother confi"uration parameter% not listed above% is dfs)replication$ This is the default

    replication factor for each bloc of data in the file sstem$ For a production cluster% this should

    usuall be left at its default value of .$ =You are free to increase our replication factor% thou"hthis ma be unnecessar and use more space than is re6uired$ Fewer than three replicas impact

    the hi"h availabilit of information% and possibl the reliabilit of its stora"e$>

    The followin" information can be pasted into the hadoop-site.xml file for a sin"le9node

    confi"uration7    fs.default.name

      hdfs://your.server.name.com:9000      dfs.data.dir

      /home/username/hdfs/data      dfs.name.dir

      /home/username/hdfs/name 

    (f course% your.server.name.com needs to be chan"ed% as does username$ )sin" port BAAA for

    the NameNode is arbitrar$

    #fter copin" this information into our conf/hadoop-site.xml file% cop this to the conf/ 

    directories on all machines in the cluster$

    The master node needs to now the addresses of all the machines to use as DataNodes? thestartup scripts depend on this$ #lso in the conf/ director% edit the file slaves so that it contains

    a list of full96ualified hostnames for the slave instances% one host per line$ (n a multi9node

    setup% the master node =e$"$% localhost> is not usuall present in this file$

    Then mae the directories necessar7

      user!ach"achine# m$dir -p #%&"!/hdfs/data

  • 8/16/2019 Traing on Hadoop

    14/123

      usernamenode# m$dir -p #%&"!/hdfs/name

    The user who owns the Hadoop instances will need to have read and write access to each of these

    directories$ +t is not necessar for all users to have access to these directories$ 0et permissions

    with chmod as appropriate$ +n a lar"e9scale environment% it is recommended that ou create a user 

    named :hadoop: on each node for the express purpose of ownin" and runnin" Hadoop tass$ Fora sin"le individual3s machine% it is perfectl acceptable to run Hadoop under our own username$

    +t is not recommended that ou run Hadoop as root$

    Starting H-0S

     Now we must format the file sstem that we

  • 8/16/2019 Traing on Hadoop

    15/123

    The dfs module% also nown as :Fs0hell%: provides basic file manipulation operations$ Their

    usa"e is introduced here$

    # cluster is onl useful if it contains data of interest$ Therefore% the first operation to perform isloadin" information into the cluster$ For purposes of this example% we will assume an example

    user named :someone: 99 but substitute our own username where it maes sense$ #lso note thatan operation on files in HDF0 can be performed from an node with access to the cluster%

    whose conf/hadoop-site.xml is confi"ured to set fs.default.name to our cluster3s

     NameNode$ We will call the fictional machine on which we are operatin" anynode$ 1ommands

    are bein" run from the :hadoop: director where ou installed Hadoop$ This ma be

    /home/someone/src/hadoop on our machine% or /home/foo/hadoop on someone else3s$ These

    initial commands are centered around loadin" information into HDF0% checin" that it3s there%

    and "ettin" information bac out of HDF0$

    #isting files

    +f we attempt to inspect HDF0% we will not find anthin" interestin" there7

      someoneanynode:hadoop# 'in/hadoop dfs -ls  someoneanynode:hadoop#

    The :9ls: command returns silentl$ Without an ar"uments% 9ls will attempt to show the contents

    of our :home: director inside HDF0$ Don3t for"et% this is not the same as /home/#!* =e$"$%

    /home/someone> on the host machine =HDF0 eeps a separate namespace from the local files>$

    There is no concept of a :current worin" director: or cd command in HDF0$

    +f ou provide 9ls with an ar"ument% ou ma see some initial director contents7

      someoneanynode:hadoop# 'in/hadoop dfs -ls /  +ound , items  drxr-xr-x - hadoop supergroup 0 ,00-09-,0 9:0 /hadoop  drxr-xr-x - hadoop supergroup 0 ,00-09-,0 ,0:0 /tmp

    These entries are created b the sstem$ This example output assumes that :hadoop: is the

    username under which the Hadoop daemons =NameNode% DataNode% etc> were started$

    :super"roup: is a special "roup whose membership includes the username under which theHDF0 instances were started =e$"$% :hadoop:>$ These directories exist to allow the Hadoop

    &ap'educe sstem to move necessar data to the different

  • 8/16/2019 Traing on Hadoop

    16/123

    explicit source and destination paths$> #n relative paths used as ar"uments to HDF0% Hadoop

    &ap'educe% or other components of the sstem are assumed to be relative to this base director$

    Step 1: 1reate our home director if it does not alread exist$

      someoneanynode:hadoop# 'in/hadoop dfs -m$dir /user

    +f there is no /user director% create that first$ +t will be automaticall created later if necessar%

     but for instructive purposes% it maes sense to create it manuall ourselves this time$

    Then we are free to add our own home director7

      someoneanynode:hadoop# 'in/hadoop dfs -m$dir /user/someone

    (f course% replace /user/someone with /user/yourUserName$

    Step : )pload a file$ To insert a sin"le file into HDF0% we can use the put command lie so7

      someoneanynode:hadoop# 'in/hadoop dfs -put/home/someone/interesting+ile.txt /user/yourUserName/

    This copies /home/someone/interesting+ile.txt from the local file sstem into

    /user/yourUserName/interesting+ile.txt on HDF0$

    Step 8: erif the file is in HDF0$ We can verif that the operation wored with either of the two

    followin" =e6uivalent> commands7

      someoneanynode:hadoop# 'in/hadoop dfs -ls /user/yourUserName  someoneanynode:hadoop# 'in/hadoop dfs -ls

    You should see a listin" that starts with +ound items and then includes information about the

    file ou inserted$

    The followin" table demonstrates example uses of the put command% and their effects7

    "ommand: Assuming: Outcome:

    'in/hadoop dfs -putfoo 'ar

     No fileEdirector named

    /user/#!*/'ar exists in

    HDF0

    )ploads local file foo to a file named/user/#!*/'ar

    'in/hadoop dfs -putfoo 'ar

    /user/#!*/'ar is a

    director)ploads local file foo to a file named/user/#!*/'ar/foo

    'in/hadoop dfs -putfoosomedir/somefile

    /user/#!*/somedir 

    does not exist in HDF0

    )ploads local file foo to a file named

    /user/#!*/somedir/somefile% creatin"

    the missin" director

    'in/hadoop dfs -putfoo 'ar

    /user/#!*/'ar is

    alread a file in HDF0

     No chan"e in HDF0% and an error is

    returned to the user$

  • 8/16/2019 Traing on Hadoop

    17/123

    When the put command operates on a file% it is all9or9nothin"$ )ploadin" a file into HDF0 first

    copies the data onto the DataNodes$ When the all acnowled"e that the have received all the

    data and the file handle is closed% it is then made visible to the rest of the sstem$ Thus based onthe return value of the put command% ou can be confident that a file has either been successfull

    uploaded% or has :full failed?: ou will never "et into a state where a file is partiall uploaded

    and the partial contents are visible externall% but the upload disconnected and did not completethe entire file contents$ +n a case lie this% it will be as thou"h no upload too place$

    Step 9: )ploadin" multiple files at once$ The put command is more powerful than movin" a

    sin"le file at a time$ +t can also be used to upload entire director trees into HDF0$

    1reate a local director and put some files into it usin" the cp command$ (ur example user ma

    have a situation lie the followin"7

      someoneanynode:hadoop# ls -* myfiles  myfiles:  file.txt file,.txt su'dir/

      myfiles/su'dir:  another+ile.txt  someoneanynode:hadoop#

    This entire myfiles/ director can be copied into HDF0 lie so7

      someoneanynode:hadoop# 'in/hadoop -put myfiles /user/myUsername  someoneanynode:hadoop# 'in/hadoop -ls  +ound items  /user/someone/myfiles ,00-01-, ,0:29 rxr-xr-x someonesupergroup  useranynode:hadoop 'in/hadoop -ls myfiles

      +ound 3 items  /user/someone/myfiles/file.txt 143 ,00-01-, ,0:29r-r--r-- someone supergroup  /user/someone/myfiles/file,.txt 1 ,00-01-, ,0:29r-r--r-- someone supergroup  /user/someone/myfiles/su'dir ,00-01-, ,0:29rxr-xr-x someone supergroup

    Thus demonstratin" that the tree was correctl uploaded recursivel$ You3ll note that in addition

    to the file path% ls also reports the number of replicas of each file that exist =the :*: in Lr *M>% the

    file si;e% upload time% permissions% and owner information$

    #nother snonm for .put is .cop&0rom#ocal$ The sntax and functionalit are identical$

    /etrie'ing data from H-0S 

    There are multiple was to retrieve files from the distributed file sstem$ (ne of the easiest is to

    use cat to displa the contents of a file on stdout$ =+t can% of course% also be used to pipe the data

    into other applications or destinations$>

  • 8/16/2019 Traing on Hadoop

    18/123

    Step 1: Displa data with cat$

    +f ou have not alread done so% upload some files into HDF0$ +n this example% we assume that a

    file named :foo: has been loaded into our home director on HDF0$

      someoneanynode:hadoop# 'in/hadoop dfs -cat foo  (contents of foo are displayed here)  someoneanynode:hadoop#

    Step : 1op a file from HDF0 to the local file sstem$

    The get command is the inverse operation of put? it will cop a file or director =recursivel>

    from HDF0 into the tar"et of our choosin" on the local file sstem$ # snonmous operation iscalled .cop&To#ocal$

    someoneanynode:hadoop# 'in/hadoop dfs -get foo local+oo  someoneanynode:hadoop# ls

      local+oo  someoneanynode:hadoop# cat local+oo  (contents of foo are displayed here)

    2ie the put command% "et will operate on directories in addition to individual files$

    Shutting -own H-0S 

    +f ou want to shut down the HDF0 functionalit of our cluster =either because ou do not want

    Hadoop occupin" memor resources when it is not in use% or because ou want to restart thecluster for up"radin"% confi"uration chan"es% etc$>% then this can be accomplished b lo""in" in

    to the NameNode machine and runnin"7

      someonenamenode:hadoop# 'in/stop-dfs.sh

    This command must be performed b the same user who started HDF0 with 'in/start-dfs.sh$

    H-0S "ommand /eference

    There are man more commands in 'in/hadoop dfs than were demonstrated here% althou"h

    these basic operations will "et ou started$ 'unnin" 'in/hadoop dfs with no additional

    ar"uments will list all commands which can be run with the Fs0hell sstem$ Furthermore%

    'in/hadoop dfs -help commandName will displa a short usa"e summar for the operation in

    6uestion% if ou are stuc$

    # table of all operations is reproduced below$ The followin" conventions are used for

     parameters7

    •   italics denote variables to be filled out b the user$

  • 8/16/2019 Traing on Hadoop

    19/123

    • :path: means an file or director name$

    • :path$$$: means one or more file or director names$

    • :file: means an filename$

    • :src: and :dest: are path names in a directed operation$

    • :local0rc: and :localDest: are paths as above% but on the local file sstem$ #ll other file

    and path names refer to ob

  • 8/16/2019 Traing on Hadoop

    20/123

    9mdir path1reates a director named path in HDF0$ 1reates an parent directories

    in path that are missin" =e$"$% lie m$dir -p in 2inux>$

    9setrep 9'O 9wO rep 

     path

    0ets the tar"et replication factor for files identified b path to rep$ =The

    actual replication factor will move toward the tar"et over time>

    9touch; path1reates a file at path containin" the current time as a timestamp$ Fails if

    a file alread exists at path% unless the file is alread si;e A$9test 9e;dO path 'eturns * if path exists? has ero len"th? or is a director% or A otherwise$

    9stat formatO path

    Prints information about path$ format  is a strin" which accepts file si;e in

     blocs =Cb>% filename =Cn>% bloc si;e =Co>% replication =Cr>% and

    modification date =C% CY>$

    9tail 9fO file 0hows the lats *I@ of file on stdout$

    9chmod 9'Omode#mode#...  path...

    1han"es the file permissions associated with one or more ob

  • 8/16/2019 Traing on Hadoop

    21/123

    The NameNode waits until a specific percenta"e of the blocs are present and accounted9for? this

    is controlled in the confi"uration b the dfs.safemode.threshold.pct parameter$ #fter this

    threshold is met% safemode is automaticall exited% and HDF0 allows normal operations$ The

    'in/hadoop dfsadmin -safemode what command allows the user to manipulate safemode

     based on the value of what % described below7

    • enter 9 5nters safemode

    • leave 9 Forces the NameNode to exit safemode

    • get 9 'eturns a strin" indicatin" whether safemode is (N or (FF

    • ait 9 Waits until safemode has exited and returns

    "hanging H-0S membership 9 When decommissionin" nodes% it is important to disconnectnodes from HDF0 "raduall to ensure that data is not lost$ 0ee the section on decommissionin" 

    later in this document for an explanation of the use of the -refreshodes dfsadmin command$

    *pgrading H-0S 'ersions 9 When up"radin" from one version of Hadoop to the next% the file

    formats used b the NameNode and DataNodes ma chan"e$ When ou first start the newversion of Hadoop on the cluster% ou need to tell Hadoop to chan"e the HDF0 version =or else it

    will not mount>% usin" the command7 'in/start-dfs.sh -upgrade$ +t will then be"in

    up"radin" the HDF0 version$ The status of an on"oin" up"rade operation can be 6ueried with the

    'in/hadoop dfsadmin -upgraderogress status command$ &ore verbose information can

     be retrieved with 'in/hadoop dfsadmin -upgraderogress details$ +f the up"rade is

     bloced and ou would lie to force it to continue% use the command7 'in/hadoop dfsadmin

    -upgraderogress force$ =Note7 be sure ou now what ou are doin" if ou use this last

    command$>

    When HDF0 is up"raded% Hadoop retains bacup information allowin" ou to down"rade to theori"inal HDF0 version in case ou need to revert Hadoop versions$ To bac out the chan"es% stop

    the cluster% re9install the older version of Hadoop% and then use the command7 'in/start-

    dfs.sh -roll'ac$$ +t will restore the previous HDF0 state$

    (nl one such archival cop can be ept at a time$ Thus% after a few das of operation with thenew version =when it is deemed stable>% the archival cop can be removed with the command

    'in/hadoop dfsadmin -finali;epgrade$ The rollbac command cannot be issued after this

     point$ This must be performed before a second Hadoop up"rade is allowed$

    Getting help 9 #s with the dfs module% tpin" 'in/hadoop dfsadmin -help cmd  will provide

    more usa"e information about the particular command$

    *sing H-0S in Map/educe

    http://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommissionhttp://developer.yahoo.com/hadoop/tutorial/module2.html#decommission

  • 8/16/2019 Traing on Hadoop

    22/123

    The HDF0 is a powerful companion to Hadoop &ap'educe$ @ settin" the fs.default.name 

    confi"uration option to point to the NameNode =as was done above>% Hadoop &ap'educe

    This section provides a short tutorial on usin" the Kava9based HDF0 #P+$ +t will be based on the

    followin" code listin"7

    : import ava.io.+ile=,: import ava.io.&!xception=3:: import org.apache.hadoop.conf.?onfiguration=2: import org.apache.hadoop.fs.+ileystem=1: import org.apache.hadoop.fs.+@atanputtream=4: import org.apache.hadoop.fs.+@ata&utputtream=: import org.apache.hadoop.fs.ath=9:0: pu'lic class %@+%elloAorld 5:,: pu'lic static final tring the+ilename B Chello.txtC=3: pu'lic static final tring message B C%elloD orldEFnC=:2: pu'lic static void main (tring GH args) thros &!xception 51:4: ?onfiguration conf B ne ?onfiguration()=: +ileystem fs B +ileystem.get(conf)=9:,0: ath filenameath B ne ath(the+ilename)=,:,,: try 5,3: if (fs.exists(filenameath)) 5,: // remove the file first,2: fs.delete(filenameath)=

    ,1: 6,4:,: +@ata&utputtream out B fs.create(filenameath)=,9: out.riteI+(message=30: out.close()=3:3,: +@atanputtream in B fs.open(filenameath)=33: tring messagen B in.readI+()=3: ystem.out.print(messagen)=32: in.close()=

    http://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://wiki.apache.org/hadoop/LibHDFShttp://wiki.apache.org/hadoop/LibHDFShttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://wiki.apache.org/hadoop/LibHDFS

  • 8/16/2019 Traing on Hadoop

    23/123

    1: 6 catch (&!xception ioe) 54: ystem.err.println(C&!xception during operation: C 7ioe.totring())=: ystem.exit()=9: 60: 6: 6

    This pro"ram creates a file named hello.txt% writes a short messa"e into it% then reads it bac

    and prints it to the screen$ +f the file alread existed% it is deleted first$

    First we "et a handle to an abstract +ileystem ob

  • 8/16/2019 Traing on Hadoop

    24/123

    (ther operations such as copin"% movin"% and renamin" are e6uall strai"htforward operations

    on ath ob$ Nor

    can an existin" file be written to% althou"h the w bits ma still be set$

    0ecurit permissions and ownership can be modified usin" the 'in/hadoop dfs -chmod%

    -chon% and -chgrp operations described earlier in this document? the wor in a similar fashion

    to the P(0+JE2inux tools of the same name$

    -etermining identit& 9 +dentit is not authenticated formall with HDF0? it is taen from anextrinsic source$ The Hadoop sstem is pro"rammed to use the user3s current lo"in as their

    http://hadoop.apache.org/common/docs/r0.20.2/api/index.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://wiki.apache.org/hadoop/HadoopDfsReadWriteExamplehttp://wiki.apache.org/hadoop/HadoopDfsReadWriteExamplehttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/index.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/fs/FileSystem.htmlhttp://wiki.apache.org/hadoop/HadoopDfsReadWriteExamplehttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.html

  • 8/16/2019 Traing on Hadoop

    25/123

    Hadoop username =i$e$% the e6uivalent of hoami>$ The user3s current worin" "roup list =i$e% the

    output of groups> is used as the "roup list in Hadoop$ HDF0 itself does not verif that this

    username is "enuine to the actual operator$

    Superuser status 9 The username which was used to start the Hadoop process =i$e$% the username

    who actuall ran 'in/start-all.sh or 'in/start-dfs.sh> is acnowled"ed to be the superuser  for HDF0$ +f this user interacts with HDF0% he does so with a special username

    superuser$ This user3s operations on HDF0 never fail% re"ardless of permission bits set on the

     particular files he manipulates$ +f Hadoop is shutdown and restarted under a different username%

    that username is then bound to the superuser account$

    Supergroup 9 There is also a special "roup named supergroup% whose membership is controlled

     b the confi"uration parameter dfs.permissions.supergroup$

    -isabling permissions 9 @ default% permissions are enabled on HDF0$ The permission sstem

    can be disabled b settin" the confi"uration option dfs.permissions to false$ The owner%

    "roup% and permissions bits associated with each file and director will still be preserved% but theHDF0 process does not enforce them% except when usin" permissions9related operations such as

    -chmod$

    Additional H-0S Tass

    /ebalancing 2locs

     New nodes can be added to a cluster in a strai"htforward manner$ (n the new node% the same

    Hadoop version and confi"uration =conf/hadoop-site.xml> as on the rest of the cluster should

     be installed$ 0tartin" the DataNode daemon on the machine will cause it to contact the

     NameNode and

    @ut the new DataNode will have no data on board initiall? it is therefore not alleviatin" space

    concerns on the existin" nodes$ New files will be stored on the new DataNode in addition to the

    existin" ones% but for optimum usa"e% stora"e should be evenl balanced across all nodes$

    This can be achieved with the automatic balancer tool included with Hadoop$ The @alancer  classwill intelli"entl balance blocs across the nodes to achieve an even distribution of blocs within

    a "iven threshold% expressed as a percenta"e$ =The default is *AC$> 0maller percenta"es mae

    nodes more evenl balanced% but ma re6uire more time to achieve this state$ Perfect balancin"

    =AC> is unliel to actuall be achieved$

    The balancer script can be run b startin" 'in/start-'alancer.sh in the Hadoop director$

    The script can be provided a balancin" threshold percenta"e with the -threshold parameter?

    e$"$% 'in/start-'alancer.sh -threshold 2$ The balancer will automaticall terminate when

    it achieves its "oal% or when an error occurs% or it cannot find more candidate blocs to move to

    achieve better balance$ The balancer can alwas be terminated safel b the administrator b

    runnin" 'in/stop-'alancer.sh$

    http://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/server/balancer/Balancer.htmlhttp://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/server/balancer/Balancer.htmlhttp://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/server/balancer/Balancer.htmlhttp://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/server/balancer/Balancer.html

  • 8/16/2019 Traing on Hadoop

    26/123

    The balancin" script can be run either when nobod else is usin" the cluster =e$"$% overni"ht>% but

    can also be run in an :online: fashion while man other % the tas should be divided between

    multiple nodes to allow them all to share in the bandwidth re6uired for the process$ Hadoopincludes a tool called distcp for this purpose$

    @ invoin" 'in/hadoop distcp src dest% Hadoop will start a &ap'educe tas to distribute

    the burden of copin" a lar"e number of files from src to dest $ These two parameters ma specif

    a full )'2 for the the path to cop$ e$"$% Chdfs://omeameode:9000/foo/'ar/C andChdfs://&therameode:,000/'a;/Luux/C  will cop the children of /foo/'ar on one cluster

    to the director tree rooted at /'a;/Luux on the other$ The paths are assumed to be directories%

    and are copied recursivel$ 0. )'2s can be specified with s3://bucket-name/key $

    -ecommissioning 3odes

    +n addition to allowin" nodes to be added to the cluster on the fl% nodes can also be removed

    from a cluster while it is runnin"% without data loss$ @ut if nodes are simpl shut down :hard%:

    data loss ma occur as the ma hold the sole cop of one or more file blocs$

     Nodes must be retired on a schedule that allows HDF0 to ensure that no blocs are entirelreplicated within the to9be9retired set of DataNodes$

    HDF0 provides a decommissionin" feature which ensures that this process is performed safel$

    To use it% follow the steps below7

    Step 1: "luster configuration$ +f it is assumed that nodes ma be retired in our cluster% then before it is started% an e$cludes file must be confi"ured$ #dd a e named dfs.hosts.exclude 

    to our conf/hadoop-site.xml file$ The value associated with this e provides the full path to

    a file on the NameNode3s local file sstem which contains a list of machines which are not permitted to connect to HDF0$

    Step : -etermine hosts to decommission$ 5ach machine to be decommissioned should be

    added to the file identified b dfs.hosts.exclude% one per line$ This will prevent them from

    connectin" to the NameNode$

    Step 8: 0orce configuration reload$ 'un the command 'in/hadoop dfsadmin

    -refreshodes$ This will force the NameNode to reread its confi"uration% includin" the newl9

  • 8/16/2019 Traing on Hadoop

    27/123

    updated excludes file$ +t will decommission the nodes over a period of time% allowin" time for

    each node3s blocs to be replicated onto machines which are scheduled to remain active$

    Step 9: Shutdown nodes$ #fter the decommission process has completed% the decommissionedhardware can be safel shutdown for maintenance% etc$ The 'in/hadoop dfsadmin -report 

    command will describe which nodes are connected to the cluster$

    Step ;: +dit e%cludes file again$ (nce the machines have been decommissioned% the can be

    removed from the excludes file$ 'unnin" 'in/hadoop dfsadmin -refreshodes a"ain will

    read the excludes file bac into the NameNode% allowin" the DataNodes to re$ The options ma include two different

    tpes of options7

    Action options specif what action should be taen when corrupted files are found$ This can be

    -move% which moves corrupt files to /lost7found% or -delete% which deletes corrupted files$

    Information options specif how verbose the tool should be in its report$ The -files option

    will list all files it checs as it encounters them$ This information can be further expanded b

    addin" the -'loc$s option% which prints the list of blocs for each file$ #ddin" -locations to

    these two options will then print the addresses of the DataNodes holdin" these blocs$ 0till moreinformation can be retrieved b addin" -rac$s to the end of this list% which then prints the rac

    topolo" information for each location$ =0ee the next subsection for more information on

    confi"urin" networ rac awareness$> Note that the later options do not impl the former? oumust use them in con

  • 8/16/2019 Traing on Hadoop

    28/123

    The -- is not re6uired if ou provide a path to start the chec from% or if ou specif another

    ar"ument first such as -move$

    @ default% fsc will not operate on files still open for write b another client$ # list of such filescan be produced with the -openforrite option$

    /ac Awareness

    For small clusters in which all servers are connected b a sin"le switch% there are onl two levelsof localit7 :on9machine: and :off9machine$: When loadin" data from a DataNode3s local drive

    into HDF0% the NameNode will schedule one cop to "o into the local DataNode% and will pic

    two other machines at random from the cluster$

    For lar"er Hadoop installations which span multiple racs% it is important to ensure that replicas

    of data exist on multiple racs$ This wa% the loss of a switch does not render portions of the data

    unavailable due to all replicas bein" underneath it$

    HDF0 can be made rac9aware b the use of a script which allows the master node to map thenetwor topolo" of the cluster$ While alternate confi"uration strate"ies can be used% the default

    implementation allows ou to provide an executable script which returns the :rac address: of

    each of a list of +P addresses$

    The network topology script  receives as ar"uments one or more +P addresses of nodes in thecluster$ +t returns on stdout a list of rac names% one for each input$ The input and output order

    must be consistent$

    To set the rac mappin" script% specif the e topology.script.file.name in conf/hadoop-

    site.xml$ This provides a command to run to return a rac id? it must be an executable script or pro"ram$ @ default% Hadoop will attempt to send a set of +P addresses to the file as several

    separate command line ar"uments$ You can control the maximum acceptable number of

    ar"uments with the topology.script.num'er.args e$

    'ac ids in Hadoop are hierarchical and loo lie path names$ @ default% ever node has a rac

    id of /default-rac$$ You can set rac ids for nodes to an arbitrar path% e$"$% /foo/'ar-rac$$

    Path elements further to the left are hi"her up the tree$ Thus a reasonable structure for a lar"einstallation ma be /top-switch-name/rack-name$

    Hadoop rac ids are not currentl expressive enou"h to handle an unusual routin" topolo" such

    as a .9d torus? the assume that each node is connected to a sin"le switch which in turn has asin"le upstream switch$ This is not usuall a problem% however$ #ctual pacet routin" will bedirected usin" the topolo" discovered b or set in switches and routers$ The Hadoop rac ids

    will be used to find :near: and :far: nodes for replica placement =and in A$*% &ap'educe tas

     placement>$

    The followin" example script performs rac identification based on +P addresses "iven ahierarchical +P addressin" scheme enforced b the networ administrator$ This ma wor directl

  • 8/16/2019 Traing on Hadoop

    29/123

    for simple installations? more complex networ confi"urations ma re6uire a file9 or table9based

    looup process$ 1are should be taen in that case to eep the table up9to9date as nodes are

     phsicall relocated% etc$ This script re6uires that the maximum number of ar"uments be set to *$

    ME/'in/'ashM et rac$ id 'ased on address.

    M Jssumes netor$ administrator has complete controlM over addresses assigned to nodes and they areM in the 0.x.y.; address space. Jssumes thatM addresses are distri'uted hierarchically. e.g.DM 0..y.; is one data center segment and 0.,.y.; is another=M 0...; is one rac$D 0..,.; is another rac$ inM the same segmentD etc.)MM Ihis is invo$ed ith an address as its only argument

    M get address from the inputipaddrB#0

    M select Cx.yC and convert it to Cx/yCsegmentsBNecho #ipaddr O cut --delimiterB. --fieldsB,-3 --output-delimiterB/Necho /#5segments6

    H-0S 4eb Interface

    HDF0 exposes a web server which is capable of performin" basic status monitorin" and file browsin" operations$ @ default this is exposed on port 4AAA on the NameNode$ #ccessin"

    http7EEnamenode74AAAE with a web browser will return a pa"e containin" overview information

    about the health% capacit% and usa"e of the cluster =similar to the information returned b

    'in/hadoop dfsadmin -report>$

    The address and port where the web interface listens can be chan"ed b settin"

    dfs.http.address in conf/hadoop-site.xml$ +t must be of the form address7 port $ To accept

    re6uests on all addresses% use 0.0.0.0$

    From this interface% ou can browse HDF0 itself with a basic file9browser interface$ 5ach

    DataNode exposes its file browser interface on port 4AA4$ You can override this b settin" the

    dfs.datanode.http.address confi"uration e to a settin" other than 0.0.0.0:20042$ 2o"

    files "enerated b the Hadoop daemons can be accessed throu"h this interface% which is useful

    for distributed debu""in" and troubleshootin"$

    /eferences

    -hemawat% 0$ -obioff% H$ and 2eun"% 0$9T$ The -oo"le File 0stem$ Proceedin"s of the *Bth

    #1& 0mposium on (peratin" 0stems Principles$ pp ,B99/.$ @olton 2andin"% NY% )0#$ ,AA.$

    Q ,AA.% #1&$@orthaur% Dhruba$ The Hadoop Distributed File 0stem7 #rchitecture and Desi"n$ Q ,AA% The

    #pache 0oftware Foundation$

    Hadoop DF0 )ser -uide$ Q ,AA% The #pache 0oftware Foundation$

    http://research.google.com/archive/gfs.htmlhttp://research.google.com/archive/gfs.htmlhttp://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_user_guide.htmlhttp://research.google.com/archive/gfs.htmlhttp://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.htmlhttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_user_guide.html

  • 8/16/2019 Traing on Hadoop

    30/123

    HDF07 Permissions )ser and #dministrator -uide$ Q ,AA% The #pache 0oftware Foundation$

    HDF0 #P+ Kavadoc Q ,AAG% The #pache 0oftware Foundation$

    HDF0 source code 

    Module 3: Getting Started With Hadoop

    Previous module  | Table of contents  |  Next module 

    Introduction

    Hadoop is an open source implementation of the &ap'educe platform and distributed file

    sstem% written in Kava$ This module explains the basics of how to be"in usin" Hadoop to

    experiment and learn from the rest of this tutorial$ +t covers settin" up the platform and

    connectin" other tools to use it$

    Goals for this Module:

    • Set up a pre-congured Hadoop virtual machine

    • Verify that you can connect to the virtual machine

    • nderstand tools availa!le to help you use Hadoop

    "utline

    #$ Introduction

    %$ Goals for this Module

    3$ "utline

    &$ 'rere(uisites

    )$ * Virtual Machine Hadoop +nvironment

    #$ Installing VM,are 'layer

    %$ Setting up the Virtual +nvironment

    3$ Virtual Machine ser *ccounts

    &$ unning a Hadoop .o!

    )$ *ccessing the VM via ssh

    /$ Shutting 0o,n the VM

    http://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.htmlhttp://hadoop.apache.org/core/docs/current/api/http://hadoop.apache.org/core/version_control.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.html#introhttp://developer.yahoo.com/hadoop/tutorial/module3.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module3.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module3.html#prereqhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vmhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vmware-installhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setuphttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-usershttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-jobshttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-sshhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-shutdownhttp://hadoop.apache.org/common/docs/r0.20.2/hdfs_permissions_guide.htmlhttp://hadoop.apache.org/core/docs/current/api/http://hadoop.apache.org/core/version_control.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/index.htmlhttp://developer.yahoo.com/hadoop/tutorial/module4.htmlhttp://developer.yahoo.com/hadoop/tutorial/module3.html#introhttp://developer.yahoo.com/hadoop/tutorial/module3.html#goalshttp://developer.yahoo.com/hadoop/tutorial/module3.html#outlinehttp://developer.yahoo.com/hadoop/tutorial/module3.html#prereqhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vmhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vmware-installhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-setuphttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-usershttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-jobshttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-sshhttp://developer.yahoo.com/hadoop/tutorial/module3.html#vm-shutdown

  • 8/16/2019 Traing on Hadoop

    31/123

    /$ Getting Started With +clipse

    #$ 0o,nloading and Installing

    %$ Installing the Hadoop Mapeduce 'lugin

    3$ Ma1ing a 2opy of Hadoop

    &$ unning +clipse

    )$ 2onguring the Mapeduce 'lugin

    $ Interacting With H04S

    #$ sing the 2ommand 5ine

    %$ sing the Mapeduce 'lugin 4or +clipse

    6$ unning a Sample 'rogram

    #$ 2reating the 'ro7ect

    %$ 2reating the Source 4iles

    3$ 5aunching the .o!

    8$ eferences 9 esources

    #$2omplete ;ools 5ist

    'rere(uisites

    Developin" for Hadoop re6uires a Kava pro"rammin" environment$ You can download a Kava

    Development Iit =KDI> for a wide variet of operatin" sstems from http7EE% version 8% which is the most current

    version at the time of this writin"$

    * Virtual Machine Hadoop +nvironment

    This section explains how to confi"ure a virtual machine to run Hadoop within our hostcomputer$ #fter installin" the virtual machine software and the virtual machine ima"e% ou will

    learn how to lo" in and run machines with no additional software beond Kava$ +f ou are interested in

    http://developer.yahoo.com/hadoop/tutorial/module3.html#eclipsehttp://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse-dlhttp://developer.yahoo.com/hadoop/tutorial/module3.html#plugin-installhttp://developer.yahoo.com/hadoop/tutorial/module3.html#hadoop-copyhttp://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse-runhttp://developer.yahoo.com/hadoop/tutorial/module3.html#plugin-confhttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfshttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfs-cmdhttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfs-pluginhttp://developer.yahoo.com/hadoop/tutorial/module3.html#runninghttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-createhttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-sourcehttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-runhttp://developer.yahoo.com/hadoop/tutorial/module3.html#refshttp://developer.yahoo.com/hadoop/tutorial/module3.html#toolshttp://java.sun.com/http://java.sun.com/http://developer.yahoo.com/hadoop/tutorial/module3.html#eclipsehttp://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse-dlhttp://developer.yahoo.com/hadoop/tutorial/module3.html#plugin-installhttp://developer.yahoo.com/hadoop/tutorial/module3.html#hadoop-copyhttp://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse-runhttp://developer.yahoo.com/hadoop/tutorial/module3.html#plugin-confhttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfshttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfs-cmdhttp://developer.yahoo.com/hadoop/tutorial/module3.html#dfs-pluginhttp://developer.yahoo.com/hadoop/tutorial/module3.html#runninghttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-createhttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-sourcehttp://developer.yahoo.com/hadoop/tutorial/module3.html#run-runhttp://developer.yahoo.com/hadoop/tutorial/module3.html#refshttp://developer.yahoo.com/hadoop/tutorial/module3.html#toolshttp://java.sun.com/

  • 8/16/2019 Traing on Hadoop

    32/123

    doin" this% there are instructions available on the Hadoop web site in the -ettin" 0tarted

    document$

    'unnin" Hadoop on top of Windows re6uires installin" c"win% a 2inux9lie environment that

    runs within Windows$ Hadoop wors reasonabl well on c"win% but it is officiall for

    :development purposes onl$: Hadoop on c"win ma be unstable% and installin" c"win itself

    can be cumbersome$

    To aid developers in "ettin" started easil with Hadoop% we have provided a virtual machine

    image containin" a preconfi"ured Hadoop installation$ The virtual machine ima"e will run inside

    of a :sandbox: environment in which we can run another operatin" sstem$ The (0 inside the

    sandbox does not now that there is another operatin" environment outside of it? it acts as thou"h

    it is on its own computer$ This sandbox environment is referred to as the :"uest machine:

    runnin" a :"uest operatin" sstem$: The actual phsical machine runnin" the & software is

    referred to as the :host machine: and it runs the :host operatin" sstem$: The virtual machine

     provides other host9machine applications with the appearance that another phsical computer is

    available on the same networ$ #pplications runnin" on the host machine see the & as a

    separate machine with its own +P address% and can interact with the pro"rams inside the & in

    this fashion$

    4igure 3$#: * virtual machine encapsulates one operating system ,ithin another$

    *pplications in the VM !elieve they run on a separate physical host from other

    applications in the e machine$

    #pplication developers do not need to use the virtual machine to run Hadoop$ Developers on

    2inux tpicall use Hadoop in their native development environment% and Windows users often

    http://hadoop.apache.org/core/docs/current/index.htmlhttp://hadoop.apache.org/core/docs/current/index.htmlhttp://hadoop.apache.org/core/docs/current/index.htmlhttp://www.cygwin.com/http://hadoop.apache.org/core/docs/current/index.htmlhttp://hadoop.apache.org/core/docs/current/index.htmlhttp://www.cygwin.com/

  • 8/16/2019 Traing on Hadoop

    33/123

    install c"win for Hadoop development$ The virtual machine provided with this tutorial allows

    users a convenient alternative development platform with a minimum of confi"uration re6uired$

    #nother advanta"e of the virtual machine is its eas reset functionalit$ +f our experiments

     brea the Hadoop confi"uration or render the operatin" sstem unusable% ou can alwas simpl

    cop the virtual machine ima"e from the 1D bac to where ou installed it on our computer%

    and start from a nown9"ood state$

    (ur virtual machine will run 2inux% and comes preconfi"ured to run Hadoop in pseudo9

    distributed mode on this sstem$ =+t is confi"ured lie a full distributed sstem% but is actuall

    runnin" on a sin"le machine instance$> We can write Hadoop pro"rams usin" editors and other

    applications of the host platform% and run them on our :cluster: consistin" of

  • 8/16/2019 Traing on Hadoop

    34/123

    to start the virtual machine$ #fter un;ippin" the vmware folder ;ip file% to start the virtual

    machine% double9clic on the hadoop-appliance-0..0.vmx file in Windows 5xplorer$

    4igure 3$%: When you start the virtual machine for the rst time? tell VM,are 'layer

    that you have copied the VM image$

    When ou start the virtual machine for the first time% &ware Plaer will reco"ni;e that the

    virtual machine ima"e is not in the same location it used to be$ You should inform &ware

    Plaer that ou copied  this virtual machine ima"e$ &ware Plaer will then "enerate newsession identifiers for this instance of the virtual machine$ +f ou later move the & ima"e to a

    different location on our own hard drive% ou should tell &ware Plaer that ou have moved

    the ima"e$

    +f ou ever corrupt the & ima"e =e$"$% b inadvertentl deletin" or overwritin" important files>%

    ou can alwas restore a pristine cop of the virtual machine b copin" a fresh & ima"e off

  • 8/16/2019 Traing on Hadoop

    35/123

    of this tutorial 1D$ =0o don3t be sh about explorin"! You can alwas reset it to a functionin"

    state$>

    #fter ou select this option and clic (I% the virtual machine should be"in bootin" normall$

    You will see it perform the standard boot procedure for a 2inux sstem$ +t will bind itself to an +P

    address on an unused networ se"ment% and then displa a prompt allowin" a user to lo" in$

    Virtual Machine ser *ccounts

    The virtual machine comes preconfi"ured with two user accounts7 :root: and :hadoop9user:$ The

    hadoop9user account has sudo permissions to perform sstem mana"ement functions% such as

    shuttin" down the virtual machine$ The vast ma7

    youyour-machine:P# cd hadoopyouyour-machine:P/hadoop# 'in/start-all.sh

    You will see a set of status messa"es appear as the services boot$ +f prompted whether it is oato connect to the current host% tpe :es:$ Tr runnin" an example pro"ram to ensure that

    Hadoop is correctl confi"ured7

    hadoop-userhadoop-des$:P# cd hadoophadoop-userhadoop-des$:P/hadoop# 'in/hadoop ar hadoop-0..0-examples.ar pi0 000000

    This should provide output that loos somethin" lie this7

  • 8/16/2019 Traing on Hadoop

    36/123

    Arote input for "ap MArote input for "ap M,Arote input for "ap M3...Arote input for "ap M0tarting Qo'+& mapred.+ilenput+ormat: Iotal input paths to process: 0+& mapred.Qo'?lient: *unning o': o'K,0001,300K000+& mapred.Qo'?lient: map 0R reduce 0R+& mapred.Qo'?lient: map 0R reduce 0R...+& mapred.Qo'?lient: map 00R reduce 00R+& mapred.Qo'?lient: Qo' complete: o'K,0001,300K000...Qo' +inished in ,2. second!stimated value of is 3.1

    This tas runs a simulation to estimate the value of pi based on samplin"$ The test first wrote out

    a number of points to a list of files% one per map tas$ +t then calculated an estimate of pi based

    on these points% in the &ap'educe tas itself$ How &ap'educe wors and how to write such a

     pro"ram are discussed in the next module$ The Hadoop client pro"ram ou used to launch the pi

    test launched the

  • 8/16/2019 Traing on Hadoop

    37/123

    Shutting 0o,n the VM

    When ou are done with the virtual machine% ou can turn it off b lo""in" in as hadoop-user 

    and tpin" sudo poeroff$ The virtual machine will shut itself down in an orderl fashion and

    the window it runs in will disappear$

    Getting Started With +clipse

    # powerful development environment for Kava9based pro"rammin" is 5clipse$ 5clipse is a free%

    open9source +D5$ +t supports multiple lan"ua"es throu"h a plu"in interface% with special

    attention paid to Kava$ Tools desi"ned for worin" with Hadoop can be inte"rated into 5clipse%

    main" it an attractive platform for Hadoop development$ +n this section we will review how to

    obtain% confi"ure% and use 5clipse$

    0o,nloading and Installing

    3ote: The most current release of 5clipse is called Gan&mede$ (ur testin" shows that

    -anmede is currentl incompatible with the Hadoop &ap'educe plu"in$ The most recent

    version which wored properl with the Hadoop plu"in is version .$.$*% :5uropa$: To download

    5uropa% do not visit the main 5clipse website? it can be found in the archive site

    http7EEarchive$eclipse$or"EeclipseEdownloadsE as the :#rchived 'elease =.$.$*>$:

    The 5clipse website has several versions available for download? choose either :5clipse 1lassic:

    or :5clipse +D5 for Kava Developers$:

    @ecause it is written in Kava% 5clipse is ver cross9platform$ 5clipse is available for Windows%

    2inux% and &ac (0J$

    +nstallin" 5clipse is ver strai"htforward$ 5clipse is paca"ed as a .;ip file$ Windows itself can

    nativel un;ip the compressed file into a director$ +f ou encounter errors usin" the Windows

    decompression tool =see *O>% tr usin" a third9part un;ip utilit such as 9;ip or Win'#' $

    #fter ou have decompressed 5clipse into a director% ou can run it strai"ht from that director

    with no modifications or other :installation: procedure$ You ma want to move it into

    ?:Frogram +ilesF!clipse to eep consistent with our other applications% but it can reside in

    the Destop or elsewhere as well$

    Installing the Hadoop Mapeduce 'lugin

    Hadoop comes with a plu"in for 5clipse that maes developin" &ap'educe pro"rams easier$ +n

    the hadoop-0..0/contri'/eclipse-plugin  director on this 1D% ou will find a file named

    hadoop-0..0-eclipse-plugin.ar$ 1op this into the plugins/ subdirector of wherever

    ou un;ipped 5clipse$

    http://www.eclipse.org/http://www.eclipse.org/http://archive.eclipse.org/eclipse/downloads/http://wiki.eclipse.org/SDK_Known_Issues#Windows_issueshttp://www.7-zip.org/http://www.win-rar.com/http://www.eclipse.org/http://archive.eclipse.org/eclipse/downloads/http://wiki.eclipse.org/SDK_Known_Issues#Windows_issueshttp://www.7-zip.org/http://www.win-rar.com/

  • 8/16/2019 Traing on Hadoop

    38/123

    Ma1ing a 2opy of Hadoop

    While we will be runnin" &ap'educe pro"rams on the virtual machine% we will be compilin"

    them on the host machine$ The host therefore needs a cop of the Hadoop

  • 8/16/2019 Traing on Hadoop

    39/123

    Step 8: Switch to the Map/educe perspecti'e) +n the upper9ri"ht corner of the worbench%

    clic the :(pen Perspective: button% as shown in Fi"ure .$/7

    4igure 3$&: 2hanging the 'erspective

    0elect :(ther%: followed b :&apE'educe: in the window that opens up$ #t first% nothin" ma

    appear to chan"e$ +n the menu% choose 4indow = Show

  • 8/16/2019 Traing on Hadoop

    40/123

    director tree to see an files alread there$ +f ou inserted files into HDF0 ourself% the will be

    visible in this tree$

    4igure 3$/: 4iles Visi!le in the H04S Vie,er

     Now that our sstem is confi"ured% the followin" sections will introduce ou to the basic

    features and verif that the wor correctl$

    Interacting With H04S

    The &ware ima"e will expose a sin"le9node HDF0 instance for our use in &ap'educe

    applications$ +f ou are lo""ed in to the virtual machine% ou can interact with HDF0 usin" the

    command9line tools described in &odule ,$ You can also manipulate HDF0 throu"h the

    &ap'educe plu"in$

    sing the 2ommand 5ine

    #n interestin" &ap'educe tas will re6uire some external data to process7 lo" files% web crawl

    results% etc$ @efore ou can be"in processin" with &ap'educe% data must be loaded into its

    distributed file sstem$ +n &odule ,% ou learned how to cop files from the local file sstem into

    HDF0$ @ut this will cop files from the local file sstem of the & into HDF0 9 not from the

    file sstem of our host computer$

    To load data into HDF0 in the virtual machine% ou have several options available to ou7

    #$ scp the les to the virtual machine? and then use the 'in/hadoop fs -put ... synta< to copy the les from the VMAs local le system into H04S?

    %$ pipe the data from the local machine into a put command reading from stdin?

    3$ or install the Hadoop tools on the host system and congure it tocommunicate directly ,ith the guest instance

    We will review each of these in turn$

    http://developer.yahoo.com/hadoop/tutorial/module2.htmlhttp://developer.yahoo.com/hadoop/tutorial/module2.html

  • 8/16/2019 Traing on Hadoop

    41/123

    To load data into HDF0 usin" the command line within the virtual machine% ou can first send

    the data to the &3s local dis% then insert it into HDF0$ You can send files to the & usin" an

    scp client% such as the pscp component of putt% or  Win01P$

    scp will allow ou to cop files from one machine to another over the networ$ The scp

    command taes two ar"uments% both of the form usernameSOhostnameO7 filename$ The scp

    command itself is of the form scp source dest% where source and dest  are formatted as

    described above$ @ default% it will assume that paths are on the local host% and should be

    accessed usin" the current username$ You can override the username and hostname to perform

    remote copies$

    0o supposin" ou have a file named foo.txt% and ou would lie to cop this into the virtual

    machine which has +P address *B,$*8G$*BA$*,G% ou can perform this operation with the

    command7

      # scp foo.txt hadoop-user9,.1.90.,:foo.txt

    +f ou are usin" the pscp pro"ram% substitute pscp instead of scp above$ # cop of the :re"ular:

    scp can be run under c"win b downloadin" the (pen00H paca"e$ pscp is a utilit b the

    maers of putt and does not re6uire c"win$

     Note that since we did not specif a destination director% it will "o in /home/hadoop-user b

    default$ To chan"e the tar"et director% specif it after the hostname =e$"$% hadoop-

    user9,.1.,.90:/some/dest/path/foo.txt $> You can also omit the destination

    filename% if ou want it to be identical to the source filename$ However% if ou omit both the

    tar"et director and filename% ou must not for"et the colon =:7:> that follows the tar"et

    hostname$ (therwise it will mae a local cop of the file% with the name 9,.1.90.,$ #n

    e6uivalent correct command to cop foo.txt to /home/hadoop-user on the remote machine is7

      # scp foo.txt hadoop-user9,.1.90.,:

    Windows users ma be more inclined to use a -)+ tool to perform scp commands$ The free

    Win01P pro"ram provides an FTP9lie -)+ interface over scp$

    #fter ou have copied files into the local dis of the virtual machine% ou can lo" in to the virtual

    machine as hadoop-user and insert the files into HDF0 usin" the standard Hadoop commands$

    For example%

    hadoop-uservm-instance:hadoop# 'in/hadoop dfs -put P/foo.txt F  /user/hadoop-user/input/foo.txt

    # second option available to upload individual files to HDF0 from the host machine is to echo

    the file contents into a put command runnin" via ssh$ e$"$% assumin" ou have the cat pro"ram

    =which comes with 2inux or c"win> to echo the contents of a file to the terminal output% ou can

    connect its output to the input of a put command runnin" over ssh lie so7

    http://www.chiark.greenend.org.uk/~sgtatham/putty/http://www.winscp.net/http://www.winscp.net/http://www.winscp.net/http://www.chiark.greenend.org.uk/~sgtatham/putty/http://www.winscp.net/http://www.winscp.net/

  • 8/16/2019 Traing on Hadoop

    42/123

    youhost-machine# cat somefile O ssh hadoop-uservm-ip-addr  F  Chadoop/'in/hadoop fs -put - destinationfile

    The - as an ar"ument to the put command instructs the sstem to use stdin as its input file$ This

    will cop somefile on the host machine to destinationfile in HDF0 on the virtual machine$

    Finall% if ou are runnin" either 2inux or c"win% ou can cop the Ehadoop.>)1?)> director onthe 1D to our local instance$ You can then confi"ure hadoop-site.xml to use the virtual

    machine as the default distributed file sstem =b settin" the fs.default.name parameter>$ +f

    ou then run 'in/hadoop fs -put ... commands on this machine =or an other hadoop

    commands% for that matter>% the will interact with HDF0 as served b the virtual machine$ 0ee

    the Hadoop "ettin" started for instructions on confi"urin" a Hadoop installation% or  &odule  for

    a more thorou"h treatment$

    sing the Mapeduce 'lugin 4or +clipse

    #n easier wa to manipulate files in HDF0 ma be throu"h the 5clipse plu"in$ +n the DF0

    location viewer% ri"ht9clic on an folder to see a list of actions available$ You can create new

    subdirectories% upload individual files or whole subdirectories% or download files and directories

    to the local dis$

    +f /user/hadoop-user does not exist% create that first$ 'i"ht9clic on the top9level director and

    select :1reate New Director:$ Tpe :user: and clic (I$ You will then need to refresh the

    current director view b ri"ht9clicin" and selectin" :'efresh: from the pop9up menu$ 'epeat

    this process to create the :hadoop9user: director under :user$:

     Now% prepare some local files to upload$ 0omewhere on our hard drive% create a directornamed :input: and find some text files to cop there$ +n the DF0 explorer% ri"ht9clic the

    :hadoop9user: director and clic :)pload Director to DF0$: 0elect our new input folder and

    clic (I$ 5clipse will cop the files directl into HDF0% bpassin" the local drive of the virtual

    machine$ You ma have to refresh the director view to see our chan"es$ You should now have

    a director hierarch containin" the /user/hadoop-user/input director% which has at least

    one text file in it$

    unning a Sample 'rogram

    While we have not et formall introduced the pro"rammin" stle for Hadoop% we can still test

    whether a &ap'educe pro"ram will run on our Hadoop virtual machine$ This section wals ou

    throu"h the steps re6uired to verif this$

    The pro"ram that we will run is a word count utilit$