big data analytics with hadoop

BIG DATA ANALYTICS WITH HADOOP

By Viplav Mandalcomputer science and

engg.Volume 1

AGENDA

INTRODUCTION BIG DATA REQUIREMENTS? DATA GROWTH HADOOP HADOOP HISTORY HADOOP DEVELOPMENT HISTORY HADOOP TOOLS REFERENCES

INTRODUCTIONBig Data..What does it means? Volume

- Big data comes in large scale, Its in TB even PB.

- Records, Transactions, Tables, files. Velocity

- Data Flown continues, time sensitive, streaming flow

- Batch, real time, Steams, Historic Variety

- Big data includes structured, semi- structured, unstructured and all variety

- Text, files, logs, xml, audios, videos, stream, flat files etc. Veracity

- Quality, consistency, reliability, and provenance of data

- Good, bad, in-complete, undefined Ratio?

- 20% of structured

- 80% of un-structured

Big Data Requirements

Technology Requirement?

- No technology stack required

- Fresher or experienced doesn’t matter

- Better to Know (But not required)

- Java & Linux

Hardware Requirement?

- No need to purchase anything

- Better to Have – 64 bit Machine

What about growth?

Big data users

Hadoop

A large scale distributed processing apache framework Creator of Hadoop: Doug Cutting No high end server/machine required only commodity server Open source framework & implementation of Google Map reduce Efficient, reliable, Easy to use Store & process large amount of data Performance, Storage, processing scale linearly Simple core, modular and extensible Manageable and heal self Hardware cost effective

Hadoop History

2002-2004 – Doug Cutting started working with Nutch 2003-2004 – Google publish GFS and Map Reduce white papers 2004 – Doug Cutting adds DFS and Map Reduce support to Nutch Yahoo! Hires Doug Cutting build team to develop Hadoop 2007 – NY times converts 4 TB of archive over 100 TB EC2 of Hadoop Web Scale deployment at Yahoo, Facebook, twitter May 2009 – Yahoo does fastest sort of 1 TB in 62 Sec over 146 Nodes

Hadoop development history

HADOOP TOOLS

APACHE HIVE HDFS SQOOP MAPREDUCE

APACHE HIVE:SQL ON HADOOP

OSS data warehouse built on top of Hadoop First Apache Hive released in 2009 Initial goal was to write Map Reduce jobs in SQL

-Most query ran from minutes to hours

-primary used for batch processing

Hive-Single tool for all SQL use cases

HDFS-master and slaves

slaves(Data Nodes) stores blocks of data and serve the Master (Name Node) manages all metadata information to the client

SQOOP

Couldn’t I just do this with a shell script– What year you live 2001? No there is a better way

Structured data already captured in databases should be used with unstructured data in Hadoop

Tedious “glue” code necessary to wrap database records for consumption in Hadoop

Large amount of log data to process Apache open source software Bulk data transfer tool

– Import/Export from/to relational databases,– enterprise data warehouses, and NoSQL systems– Populate tables in HDFS, Hive, and HBase– Integrate with Oozie– Support plugins via connector based architecture

MAP REDUCE

Map Reduce is a programming model and software framework first developed by Google (Google’s Map Reduce paper submitted in 2004) Map Reduce is a programming model and software framework first developed by Google (Google’s Map Reduce paper submitted in 2004)

Intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner

-Petabytes of data

-Thousands of nodes

-Computational processing occurs on both:

Unstructured data : file system

Structured data : database

REFFERNCES

https://www.google.co.in/search?q=HADOOP&ie=utf-8&oe=utf-8&gws_rd=cr&ei=13jTVcm2E8m3uQSkzp3wBg

https://hadoop.apache.org/ http://www.slideshare.net/linuxpham/hadoop-at-gnt-2012 http://www.slideshare.net/narangv43/seminar-presentation-hadoop https://

www.google.co.in/search?q=hadoop+books&ie=utf-8&oe=utf-8&gws_rd=cr&ei=nXnTVeSAMY2-uASdoo7ACA

http://www.fromdev.com/2014/07/Best-Hadoop-Books.html



https://hadoop.apache.org/

http://www.slideshare.net/linuxpham/hadoop-at-gnt-2012

http://www.slideshare.net/narangv43/seminar-presentation-hadoop

http://www.slideshare.net/narangv43/seminar-presentation-hadoop

https://www.google.co.in/search?q=hadoop+books&ie=utf-8&oe=utf-8&gws_rd=cr&ei=nXnTVeSAMY2-uASdoo7ACA





Any Queries?

THANK YOU

big data analytics with hadoop

Engineering