Post on 14-Apr-2017




15 download


HADOOPBIG DATA ANALYTICS WITH HADOOPBy Viplav Mandalcomputer science andengg.Volume 1 AGENDAINTRODUCTIONBIG DATA REQUIREMENTS?DATA GROWTHHADOOPHADOOP HISTORYHADOOP DEVELOPMENT HISTORYHADOOP TOOLSREFERENCESINTRODUCTIONBig Data..What does it means?Volume- Big data comes in large scale, Its in TB even PB.- Records, Transactions, Tables, files.Velocity- Data Flown continues, time sensitive, streaming flow- Batch, real time, Steams, HistoricVariety- Big data includes structured, semi- structured, un- structured and all variety- Text, files, logs, xml, audios, videos, stream, flat files etc.Veracity- Quality, consistency, reliability, and provenance of data - Good, bad, in-complete, undefined Ratio?- 20% of structured- 80% of un-structuredBig Data RequirementsTechnology Requirement?- No technology stack required- Fresher or experienced doesnt matter- Better to Know (But not required)- Java & LinuxHardware Requirement?- No need to purchase anything- Better to Have 64 bit MachineWhat about growth?Big data usersHadoopA large scale distributed processing apache frameworkCreator of Hadoop: Doug CuttingNo high end server/machine required only commodity serverOpen source framework & implementation of Google Map reduceEfficient, reliable, Easy to useStore & process large amount of dataPerformance, Storage, processing scale linearlySimple core, modular and extensibleManageable and heal selfHardware cost effectiveHadoop History2002-2004 Doug Cutting started working with Nutch2003-2004 Google publish GFS and Map Reduce white papers 2004 Doug Cutting adds DFS and Map Reduce support to NutchYahoo! Hires Doug Cutting build team to develop Hadoop2007 NY times converts 4 TB of archive over 100 TB EC2 of HadoopWeb Scale deployment at Yahoo, Facebook, twitterMay 2009 Yahoo does fastest sort of 1 TB in 62 Sec over 146 NodesHadoop development historyHADOOP TOOLS APACHE HIVEHDFSSQOOPMAPREDUCEAPACHE HIVE:SQL ON HADOOPOSS data warehouse built on top of HadoopFirst Apache Hive released in 2009Initial goal was to write Map Reduce jobs in SQL -Most query ran from minutes to hours -primary used for batch processingHive-Single tool for all SQL use casesHDFS-master and slavesslaves(Data Nodes) stores blocks of data and serve the Master (Name Node) manages all metadata information to the clientSQOOPCouldnt I just do this with a shell scriptWhat year you live 2001? No there is a better wayStructured data already captured in databases should be used with unstructured data in Hadoop Tedious glue code necessary to wrap database records for consumption in Hadoop Large amount of log data to processApache open source softwareBulk data transfer toolImport/Export from/to relational databases,enterprise data warehouses, and NoSQL systemsPopulate tables in HDFS, Hive, and HBaseIntegrate with OozieSupport plugins via connector based architectureMAP REDUCEMap Reduce is a programming model and software framework first developed by Google (Googles Map Reduce paper submitted in 2004) Map Reduce is a programming model and software framework first developed by Google (Googles Map Reduce paper submitted in 2004) Intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner -Petabytes of data -Thousands of nodes -Computational processing occurs on both: Unstructured data : file system Structured data : database REFFERNCES Queries? THANK YOU