big data analytics with hadoop

Download BIG DATA ANALYTICS WITH HADOOP

Post on 14-Apr-2017

270 views

Category:

Engineering

15 download

Embed Size (px)

TRANSCRIPT

HADOOP

BIG DATA ANALYTICS WITH HADOOP

By Viplav Mandalcomputer science andengg.Volume 1

AGENDA

INTRODUCTIONBIG DATA REQUIREMENTS?DATA GROWTHHADOOPHADOOP HISTORYHADOOP DEVELOPMENT HISTORYHADOOP TOOLSREFERENCES

INTRODUCTIONBig Data..What does it means?Volume- Big data comes in large scale, Its in TB even PB.- Records, Transactions, Tables, files.Velocity- Data Flown continues, time sensitive, streaming flow- Batch, real time, Steams, HistoricVariety- Big data includes structured, semi- structured, un- structured and all variety- Text, files, logs, xml, audios, videos, stream, flat files etc.Veracity- Quality, consistency, reliability, and provenance of data - Good, bad, in-complete, undefined Ratio?- 20% of structured- 80% of un-structured

Big Data RequirementsTechnology Requirement?- No technology stack required- Fresher or experienced doesnt matter- Better to Know (But not required)- Java & Linux

Hardware Requirement?- No need to purchase anything- Better to Have 64 bit Machine

What about growth?

Big data users

HadoopA large scale distributed processing apache frameworkCreator of Hadoop: Doug CuttingNo high end server/machine required only commodity serverOpen source framework & implementation of Google Map reduceEfficient, reliable, Easy to useStore & process large amount of dataPerformance, Storage, processing scale linearlySimple core, modular and extensibleManageable and heal selfHardware cost effective

Hadoop History2002-2004 Doug Cutting started working with Nutch2003-2004 Google publish GFS and Map Reduce white papers 2004 Doug Cutting adds DFS and Map Reduce support to NutchYahoo! Hires Doug Cutting build team to develop Hadoop2007 NY times converts 4 TB of archive over 100 TB EC2 of HadoopWeb Scale deployment at Yahoo, Facebook, twitterMay 2009 Yahoo does fastest sort of 1 TB in 62 Sec over 146 Nodes

Hadoop development history

HADOOP TOOLS APACHE HIVEHDFSSQOOPMAPREDUCE

APACHE HIVE:SQL ON HADOOPOSS data warehouse built on top of HadoopFirst Apache Hive released in 2009Initial goal was to write Map Reduce jobs in SQL -Most query ran from minutes to hours -primary used for batch processing

Hive-Single tool for all SQL use cases

HDFS-master and slavesslaves(Data Nodes) stores blocks of data and serve the Master (Name Node) manages all metadata information to the client

SQOOPCouldnt I just do this with a shell scriptWhat year you live 2001? No there is a better wayStructured data already captured in databases should be used with unstructured data in Hadoop Tedious glue code necessary to wrap database records for consumption in Hadoop Large amount of log data to processApache open source softwareBulk data transfer toolImport/Export from/to relational databases,enterprise data warehouses, and NoSQL systemsPopulate tables in HDFS, Hive, and HBaseIntegrate with OozieSupport plugins via connector based architecture

MAP REDUCEMap Reduce is a programming model and software framework first developed by Google (Googles Map Reduce paper submitted in 2004) Map Reduce is a programming model and software framework first developed by Google (Googles Map Reduce paper submitted in 2004) Intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner -Petabytes of data -Thousands of nodes -Computational processing occurs on both: Unstructured data : file system Structured data : database

REFFERNCEShttps://www.google.co.in/search?q=HADOOP&ie=utf-8&oe=utf-8&gws_rd=cr&ei=13jTVcm2E8m3uQSkzp3wBghttps://hadoop.apache.org/http://www.slideshare.net/linuxpham/hadoop-at-gnt-2012http://www.slideshare.net/narangv43/seminar-presentation-hadoophttps://www.google.co.in/search?q=hadoop+books&ie=utf-8&oe=utf-8&gws_rd=cr&ei=nXnTVeSAMY2-uASdoo7ACAhttp://www.fromdev.com/2014/07/Best-Hadoop-Books.html

Any Queries?

THANK YOU