an overview of hadoop

35
An Overview of Hadoop Contributors: Hussain, Barak, Durai, Gerrit and Asif

Upload: asif-ali

Post on 20-Jan-2015

3.101 views

Category:

Technology


0 download

DESCRIPTION

This presentation provides an overview of Hadoop and its implementation.

TRANSCRIPT

  • 1. AnOverviewof HadoopContributors: Hussain, Barak, Durai, Gerritand Asif

2. IntroductionsTarget audience:Programmers, Developers who want to learnabout Big Data and Hadoop 3. What is Big Data? A technology term about Data that becomes too large to be managed in a manner that is previously known to work normally. 4. Apache HadoopIs an open source, top level Apache project that is based onGoogles MapReduce whitepaperIs a popular project that is used by several large companies toprocess massive amounts of dataThe Apache Hadoop software library is a framework thatallows for the distributed processing of large data setsacross clusters of computers using a simpleprogramming model. 5. HadoopIs designed to Scale!Uses commodity hardware and can bescaled out using several boxesProcesses data in batchesCan process Petabyte scale data 6. Before you consider Hadoop,define your use caseBefore using Hadoop or Any other "big data" or"NoSQL" solution, consider defining your usecase well.Regular databases can solve most commonuse cases if scaled well.For example MySQL can handle 100s ofmillions of records efficiently if scaled well. 7. Hadoop is best suited whenYou have massive data - Terrabytes of datagenerated or generating everyday and wouldlike to process this data for several groupqueries, in a scallable manner.You have billions of rows of data that needs tobe disected for several different reports 8. Hadoop, Components and RelatedProjectsHDFS - Hadoop Distributed File SystemMap Reduce - Hadoop Distributed ExecutionFrameworkApache Pig - Functional Dataflow Languagefor Hadoop M/R (Yahoo).Hive - SQL like language for Hadoop M/R(Facebook)ZookeeperEtc 9. Our use case and previous solutionOur use case: Store and process 100s ofmillions of records a day. Run 100s of queries -summing, aggregating data.Along the line - we had evaluated differentNoSQL technologies for various use cases (notnecessarily for the one above)MongoDB, Cassandra, Memcached etc wereimplemented for various use cases 10. Original solution: Sharded MySQL systemthat could actually process 100s of millions 11. Issues faced1. Not a known / preferred design approach.2. Scalability issues which we had to address it ourselves3. Needed a solution that can handle billions of records per day instead of 100s of millions4. Needed a truly proven, scalable solution 12. Why HadoopA proven, open source, highly reliable,distributed data processing platform.Met our use case of processing 100s of millionsof logs perfectlyWe tuned the deployment to process all datawith a max of 30 mins latency 13. Getting Startedwith HadoopInstallationHDFSMap/ReduceResult 14. InstallationRequirementsLinux0.20.203.X current stable versionJava 1.6.xUsing RPMConfigurationSingle NodeMultiple Node 15. InstallationModes Local(Standalone) Running in single node with a single Javaprocess Pseudo Running in single node with a seperateJava process Fully Distributed Running in clusters of nodes with aseperate Java process 16. Related linkshttp://hadoop.apache.org/common/releases.html#Newshttp://pig.apache.org/docs/r0.7.0/setup.htmlhttp://code.google.com/p/bigstreams/ 17. Configurations/etc/hadoop/conf/hdfs-site.xml/etc/hadoop/conf/core-site.xml/etc/hadoop/conf/mapred-site.xml/etc/hadoop/conf/slaves 18. Commandsyum install hadoop-0.20bin/start-all.shbin/stop-all.shjpsbin/hadoop namenode -formathadoop fs -mkdir /user/demohadoop fs -ls /user/demobin/hadoop fs -put conf inputbin/hadoop fs -cat output/*hadoop fs -cat /user/demo/sample.txt 19. Sample Physical Architecture D AN TCOLLECTORA AM N NODEE DB ONNODE DO ED SE GLUE + PIGZOOKEEPERNODE (VMs) STREAM STREAMSTREAM STREAM APP NODE 1 APP NODE 2APP NODE 3APP NODE N 20. Sample Logical Architecture D DD D DN A AA A AA T TT T TM A AA A AE GLUE N NN N NNDB O OO O OO D DD D DD E EE E EE PIGZOOKEEPER COLLECTOR STREAMSTREAMSTREAM STREAM STREAM STREAMSTREAM APP 1 APP 2APP 3APP 4 APP 5APP 6 APP N 21. Implementation of aHadoop SystemBigStreams - Logging FrameworkStreams is a high availability, extremely fast, low resource usage real time logcollection framework for terrabytes of data.- Key author is Gerrit, our Architecthttp://code.google.com/p/bigstreams/Google Protobufhttp://code.google.com/p/protobuf/for compressing(LZO) the data before transfering the data from applicationnode to Hadoop node 22. ImplementationData Logs will be compressed by stream agents and sendto CollectorCollector informs Namenode regarding new file arrivalsNamenode replies with File sizes and how many blocksand each blocks will be stored into datanodesCollector will send the block of file to directly to thedatanodes 23. Data Processing with PigOnce data is saved in the HDFS cluster it canbe processed using Java programs or by usingApache Pighttp://pig.apache.org1.Apache Pig is a platform platform for analyzing large datasets2. Pig latin is the language which presents a simplifiedmanner to run queries3. Pig platform has a compiler which translates Pig queriesto MapReduce Programs for Hadoop 24. Pig QueriesInteractive Mode(Grunt Shell)Batch Mode (Pig Script)AlsoLocal ModeMap/Reduce Mode 25. RequirementsUnix and Windows users need the following:1.Hadoop 0.20.2 - http://hadoop.apache.org/common/releases.html2.Java 1.6 - http://java.sun.com/javase/downloads/index.jsp (setJAVA_HOME to the root of your Java installation)3. Ant 1.7 - http://ant.apache.org/ (optional, for builds)4. JUnit 4.5 - http://junit.sourceforge.net/ (optional, for unit tests)Windows users need to install Cygwin and the Perl package: http://www.cygwin.com/Download Link:http://www.gtlib.gatech.edu/pub/apache//pig/pig-0.8.1/ 26. Pig Installing CommandsTo install Pig on Red Hat systems: $rpm -ivh --nodeps pig-0.8.0-1x86_64To start the Grunt Shell:$ pig0 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to Hadoop file system at: hdfs://localhost:8020352 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:9001grunt> 27. Pig StatementsLOADLoads data from the file system.UsageUse the LOAD operator to load data from the file system.ExamplesSuppose we have a data file called myfile.txt. The fields are tab-delimited. The records are newline-separated.123421834In this example the default load function, PigStorage, loads data from myfile.txt to form relation A. The two LOADstatements are equivalent. Note that, because no schema is specified, the fields are not named and all fields default to typebytearray.A = LOAD myfile.txt;A = LOAD myfile.txt USING PigStorage(t);DUMP A;(1,2,3)(4,2,1)(8,3,4) 28. Pig Statements FOREACH Generates data transformations based on columns of data. Syntax : alias = FOREACH { gen_blk | nested_gen_blk }; Usage Use the FOREACHGENERATE operation to work with columns of data (if you want to work with tuples or rows of data, use the FILTER operation). FOREACH...GENERATE works with relations (outer bags) as well as inner bags:If A is a relation (outer bag), a FOREACH statement could look like this.X = FOREACH A GENERATE f1; If A is an inner bag, a FOREACH statement could look like this. X = FOREACH B { S = FILTER A BY xyz; GENERATE COUNT (S.$0);} 29. Pig Statements GROUP Groups the data in one or more relations. Syntax alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression ] [USING collected | merge] [PARTITION BY partitioner] [PARALLEL n]; Usage The GROUP operator groups together tuples that have the same group key (key field). The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key. The result of a GROUP operation is a relation that includes one tuple per group. This tuple contains two fields: The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key. The second field takes the name of the original relation and is type bag. The names of both fields are generated by the system as shown in the example below. Note the following about the GROUP/COGROUP and JOIN operators: The GROUP and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOINcreates a flat set of output tuples 30. Pig StatementsExample For GroupSuppose we have relation A.A = LOAD data as (f1:chararray, f2:int, f3:int);DUMP A;(r1,1,2)(r2,2,1)(r3,2,8)(r4,4,4)In this example the tuples are grouped using an expression, f2*f3.X = GROUP A BY f2*f3;DUMP X;(2,{(r1,1,2),(r2,2,1)})(16,{(r3,2,8),(r4,4,4)}) 31. Pig StatementsFILTERSelects tuples from a relation based on somecondition.Syntax: alias = FILTER alias BY expression;UsageUse the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH...GENERATE operation).FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you dont want.X = FILTER A BY (f1 == 8) OR (NOT (f2+f3 > f1));DUMP X;(4,2,1)(8,3,4)(7,2,5)(8,4,3) 32. Pig StatementsSTOREStores or saves results to the file system.SyntaxSTORE alias INTO directory [USING function];UsageUse the STORE operator to run (execute) Pig Latin statements and save (persist) results to the file system. Use STORE forproduction scripts and batch mode processing.Note: To debug scripts during development, you can use DUMP to check intermediate results. 33. Pig StatementsExamples For STOREIn this example data is stored using PigStorage and the asterisk character (*) as the field delimiter.A = LOAD data AS (a1:int,a2:int,a3:int);DUMP A;(1,2,3)(4,2,1)(8,3,4)(4,3,3)(7,2,5)(8,4,3)STORE A INTO myoutput USING PigStorage (*);CAT myoutput;1*2*34*2*18*3*44*3*37*2*58*4*3 34. Pig Latin Exampleads = LOAD /log/raw/old/ads/month=09/day=01,/log/raw/old/clicks/month=09/day=01 usingcom.twitter.elephantbird.pig.proto.LzoProtobuffB64LinePigStore(ad_data);r1 = foreach ads generatead.id as id,device.carrier_name as carrier_name,device.device_name as device_name,device.mobile_model as mobile_model,ipinfo.city as city_code,ipinfo.country as country,ipinfo.ipaddress as ipaddress,ipinfo.region_code as wifi_Gprs,site.client_id as client_id,ad.metrics as metrics,impressions,clicks;g = group r1 by (id,carrier_name,device_name,mobile_model,city_code,country,client_id,wifi_Gprs,metrics);r = foreach g generate FLATTEN(group), SUM($1.impressions) as imp, SUM($1.clicks) as cl;rmf /tmp/predicttesnul;store d into /tmp/predicttesnul using PigStorage(t); 35. Demos / Contact usSome hands on Demos followNeed more information -Find us on Twittertwitter.com/azifali