Big Data with Apache Hadoop

Download Big Data with Apache Hadoop

Post on 04-Jun-2015

301 views

Category:

Software

2 download

DESCRIPTION

Slidedeck from our seminar about Hadoop (08/10/2014) Topics covered: - What is Big Data? - About Apache Hadoop - HDFS - MapReduce - Pig - Hive - HBase - Mahout & Machine Learning - Other tooling: Sqoop, Oozie, ... - Hadoop deployment options - Real-life cases

TRANSCRIPT

  • 1. Data Science CompanyBig Data with Apache HadoopVeldkant 33A, Kontich info@infofarm.be www.infofarm.be8/10/2014

2. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beWho am IBEN VERMEERSCHBig Data ConsultantCloudera Certified Developerfor Apache Hadoopben.vermeersch@infofarm.be @benvermeersch 3. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beAbout InfoFarmDataScienceBigDataIdentifying, extracting and using data of all typesand origins; exploring, correlating and using it in newand innovative ways in order to extract meaningand business value from it. 4. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beAbout InfoFarm2 Data Scientists 4 Big DataConsultants1 InfrastructureSpecialist 5. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beJavaPHPE-CommerceMobileWebDevelopment 6. Veldkant 33A, Kontich info@infofarm.be www.infofarm.be 7. Agenda 09:30 What is Big Data? 09:45 Hadoop HDFS & MapReduce 10:00 HDFS & MapReduce in Practice 10:30 The Hadoop Ecosystem 11:30 Examples 12:00 Wrap up and LunchVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 8. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beWhat is Big Data? 9. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beWhat is Big Data not? 10. What is Big Data not? a technology a solution (certainly not a silver-bullet) toany IT problem a replacement for an RDBMs a cloud storage system Veldkant 33A, Kontich info@infofarm.be www.infofarm.be 11. Big Data definition attempta description of a problem domain withspecific challenges and solutions which hasbecome relevant with increasing volume,velocity and variety in business data andthe increasing requirements towardsprocessing of this dataVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 12. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beThe 3 Vs 13. Veldkant 33A, Kontich info@infofarm.be www.infofarm.be 14. Working the (Hadoop) Big Data way Bringing data processing to the data (vscentralized db) Using unstructured or semi-structured data Store first, process later Simple techniques applied at massivescale Your hardware will fail!Veldkant 33A, Kontich info@infofarm.be www.infofarm.be 15. Hadoop (limited) overviewVeldkant 33A, Kontich info@infofarm.be www.infofarm.beOozieWorkflowHDFSDistributed File SystemMapReduceAmazon S3 Local FSYARNDistributed Data ProcessingHBaseNoSQLHiveData MartPigScriptingSqoopSQLImportExportMahoutMachineLearning 16. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beHDFS 17. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beHDFS Rack Topology 18. MapReduce A method for distributing tasks acrossmultiple nodes Data is processed where it is stored (wherepossible) Two phases:Veldkant 33A, Kontich info@infofarm.be www.infofarm.be Map Reduce Both fases have key-value pairs as input andoutput that may be chosen by theprogrammer The output from the mappers is used by thereducers 19. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beMap & ReduceMapper input Mapper output Reducer input Reducer output 20. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beMap functionInput.txtBlock 1Block 2Block 3Node 1Block 1Block 2Node 2Block 2Block 3Node 3Block 1Block 3 21. Shuffle and sort Hadoop automatically sorts and merges outputVeldkant 33A, Kontich info@infofarm.be www.infofarm.befrom all map tasksThis intermediate process is known as the shuffleand sortThe result is supplied to reduce tasks 22. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beReduce function Reducer input comes from the shuffle and sort processreceives one record at a timereceives all records for a given keyemit zero or more output records Example: A reduce function sums total per person and emitsemployee name (key) and total (value) as output 23. MapReduce under the hoodClient ResourceManagerNode 1 AppMasterNode 2Node 3Veldkant 33A, Kontich info@infofarm.be www.infofarm.beHDFS 24. HDFS & MapReduceDEMOVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 25. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beJoiningUser Name1 John2 Maria3 JaneUser Comment1 Cool2 Nonono2 Hi there3 Hadoop is awesomeMapper MapperKey Value1 AJohn2 AMaria3 AJaneKey Value1 BCool2 BNonono2 BHi there3 BHadoop is awesome 26. Shuffle/SortKey Values1 AJohn; BCool2 AMaria; BNonono; BHi there3 AJane; BHadoop is awesomeVeldkant 33A, Kontich info@infofarm.be www.infofarm.beJoiningKey Value1 AJohn2 AMaria3 AJaneKey Value1 BCool2 BNonono2 BHi there3 BHadoop is awesomeReducer 27. Key Values1 AJohn; BCool2 AMaria; BNonono; BHi there3 AJane; BHadoop is awesomeVeldkant 33A, Kontich info@infofarm.be www.infofarm.beJoiningReducerUserid Name Comment1 John Cool2 Maria Nonono2 Maria Hi there3 Jane Hadoop is awesome 28. MapReduce Design Patterns More info: Frameworks on top of MapReduce likeHive or Pig make this easierVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 29. The Hadoop EcosystemVeldkant 33A, Kontich info@infofarm.be www.infofarm.beOozieWorkflowHDFSDistributed File SystemMapReduceAmazon S3 Local FSYARNDistributed Data ProcessingHBaseNoSQLHiveData MartPigScriptingSqoopSQLImportExportMahoutMachineLearning 30. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beApache Pig Processingframework for (large)datasets Pig Latin Runs on Hadoop (orlocal) withMapReduce Extensible withUDFs 31. Apache PigDEMOVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 32. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beApache Hive SQL-like querying onHadoop datasets Translates toMapReduce underthe hood Originally developedat Facebook Now Apache TopLevel project 33. Hive Traditional RDBMS Schema on read Fast initial load Flexible schema No update ordelete (only insertinto) HiveQL (subset ofSQL) Schema on write Slow initial load Fixed schema Updates, deletes,inserts all possible SQL compliantVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 34. Apache HiveDEMOVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 35. HBase Column-oriented Data Store Distributed Type of NoSQL-DB Based on Google BigTableVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 36. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beHBase Lots and lots ofdata Large amount ofclients Single selects Range scan bykey Variable schema Not TraditionalRDBMS Transactions Group by Join Where Like 37. HBaseDEMOVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 38. Sqoop Import data from structured data source(typically RDBMS) into Hadoop Export data into structured data sources fromHadoop sqoop import --connectjdbc:mysql://localhost/salesdb --table orders sqoop export --connectjdbc:mysql://localhost/salesdb --table orders --export-dir/user/test/orders --input-fields-terminated-Veldkant 33A, Kontich info@infofarm.be www.infofarm.beby t 39. Mahout Scalable Machine LearningVeldkant 33A, Kontich info@infofarm.be www.infofarm.beRecommendationClassificationClustering 40. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beRecommendation 41. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beClassificationMammal Reptile Bird 42. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beClustering 43. More information: Free seminar: Machine Learning inpractice Fri 7th of November 2014 12:00 16:00 Kontich http://www.buzzberry.be/events/Veldkant 33A, Kontich info@infofarm.be www.infofarm.be 44. Integrating Hadoop in your IT landscapeVeldkant 33A, Kontich info@infofarm.be www.infofarm.be 45. Tools BigData IT options Hadoop is not a trivial piece of software to manage!Veldkant 33A, Kontich info@infofarm.be www.infofarm.be On-premise Commodity Hardware Advantage: full control & performance Disadvantage: required skills, migrations, backup, ... Cloud Amazon AWS EMR (Elastic Map Reduce) Storage in S3 Very competitive offering financially Manageability and flexibility Cloud - IBM SoftLayer Hardware options (performance) 46. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beBeyond MapReduce 47. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beThere is more 48. Veldkant 33A, Kontich info@infofarm.be www.infofarm.beOak3 Courses Data Science Hadoop Hbase http://www.oak3.be/ 49. Questions?Veldkant 33A, Kontich info@infofarmDa.btae Sciwewncwe. inCfoomfaprman.bye

Recommended

View more >