big data analytics with hadoop

17
BIG DATA ANALYTICS WITH HADOOP By Viplav Mandal computer science and engg. Volume 1

Upload: imviplav

Post on 14-Apr-2017

278 views

Category:

Engineering


15 download

TRANSCRIPT

Page 1: BIG DATA ANALYTICS WITH HADOOP

BIG DATA ANALYTICS WITH HADOOP

By Viplav Mandalcomputer science and

engg.Volume 1

Page 2: BIG DATA ANALYTICS WITH HADOOP

AGENDA

INTRODUCTION BIG DATA REQUIREMENTS? DATA GROWTH HADOOP HADOOP HISTORY HADOOP DEVELOPMENT HISTORY HADOOP TOOLS REFERENCES

Page 3: BIG DATA ANALYTICS WITH HADOOP

INTRODUCTIONBig Data..What does it means? Volume

- Big data comes in large scale, Its in TB even PB.

- Records, Transactions, Tables, files. Velocity

- Data Flown continues, time sensitive, streaming flow

- Batch, real time, Steams, Historic Variety

- Big data includes structured, semi- structured, un- structured and all variety

- Text, files, logs, xml, audios, videos, stream, flat files etc. Veracity

- Quality, consistency, reliability, and provenance of data

- Good, bad, in-complete, undefined Ratio?

- 20% of structured

- 80% of un-structured

Page 4: BIG DATA ANALYTICS WITH HADOOP

Big Data Requirements

Technology Requirement?

- No technology stack required

- Fresher or experienced doesn’t matter

- Better to Know (But not required)

- Java & Linux

Hardware Requirement?

- No need to purchase anything

- Better to Have – 64 bit Machine

Page 5: BIG DATA ANALYTICS WITH HADOOP

What about growth?

Page 6: BIG DATA ANALYTICS WITH HADOOP

Big data users

Page 7: BIG DATA ANALYTICS WITH HADOOP

Hadoop

A large scale distributed processing apache framework Creator of Hadoop: Doug Cutting No high end server/machine required only commodity server Open source framework & implementation of Google Map reduce Efficient, reliable, Easy to use Store & process large amount of data Performance, Storage, processing scale linearly Simple core, modular and extensible Manageable and heal self Hardware cost effective

Page 8: BIG DATA ANALYTICS WITH HADOOP

Hadoop History

2002-2004 – Doug Cutting started working with Nutch 2003-2004 – Google publish GFS and Map Reduce white papers 2004 – Doug Cutting adds DFS and Map Reduce support to Nutch Yahoo! Hires Doug Cutting build team to develop Hadoop 2007 – NY times converts 4 TB of archive over 100 TB EC2 of Hadoop Web Scale deployment at Yahoo, Facebook, twitter May 2009 – Yahoo does fastest sort of 1 TB in 62 Sec over 146 Nodes

Page 9: BIG DATA ANALYTICS WITH HADOOP

Hadoop development history

Page 10: BIG DATA ANALYTICS WITH HADOOP

HADOOP TOOLS

APACHE HIVE HDFS SQOOP MAPREDUCE

Page 11: BIG DATA ANALYTICS WITH HADOOP

APACHE HIVE:SQL ON HADOOP

OSS data warehouse built on top of Hadoop First Apache Hive released in 2009 Initial goal was to write Map Reduce jobs in SQL

-Most query ran from minutes to hours

-primary used for batch processing

Page 12: BIG DATA ANALYTICS WITH HADOOP

Hive-Single tool for all SQL use cases

Page 13: BIG DATA ANALYTICS WITH HADOOP

HDFS-master and slaves

slaves(Data Nodes) stores blocks of data and serve the Master (Name Node) manages all metadata information to the client

Page 14: BIG DATA ANALYTICS WITH HADOOP

SQOOP

Couldn’t I just do this with a shell script– What year you live 2001? No there is a better way

Structured data already captured in databases should be used with unstructured data in Hadoop

Tedious “glue” code necessary to wrap database records for consumption in Hadoop

Large amount of log data to process Apache open source software Bulk data transfer tool

– Import/Export from/to relational databases,– enterprise data warehouses, and NoSQL systems– Populate tables in HDFS, Hive, and HBase– Integrate with Oozie– Support plugins via connector based architecture

Page 15: BIG DATA ANALYTICS WITH HADOOP

MAP REDUCE

Map Reduce is a programming model and software framework first developed by Google (Google’s Map Reduce paper submitted in 2004) Map Reduce is a programming model and software framework first developed by Google (Google’s Map Reduce paper submitted in 2004)

Intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner

-Petabytes of data

-Thousands of nodes

-Computational processing occurs on both:

Unstructured data : file system

Structured data : database

Page 17: BIG DATA ANALYTICS WITH HADOOP

Any Queries?

THANK YOU