ppt on hadoop

14
BY – SHUBHAM PARMAR

Upload: shubham-parmar

Post on 27-Jan-2017

257 views

Category:

Engineering


6 download

TRANSCRIPT

Page 1: PPT on Hadoop

BY – SHUBHAM PARMAR

Page 2: PPT on Hadoop

What is Hadoop?• The Apache Hadoop software library is a

framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

• It is made by apache software foundation in 2011.

• Written in JAVA.

Page 3: PPT on Hadoop

Hadoop is open source software.

Framework

Massive Storage

Processing Power

Page 4: PPT on Hadoop

Big Data• Big data is a term used to define very large amount of unstructured and semi structured data a company creates.

•The term is used when talking about Petabytes and Exabyte of data.

•That much data would take so much time and cost to load into relational database for analysis.

•Facebook has almost 10billion photos taking up to 1Petabytes of storage.

Page 5: PPT on Hadoop

So what is the problem??1. Processing that large data is very difficult in relational database.

2. It would take too much time to process data and cost.

Page 6: PPT on Hadoop

We can solve this problem by Distributed Computing.

But the problems in distributed computing is –

Hardware failureChances of hardware failure is always there.

Combine the data after analysisData from all disks have to be combined from all the disks which is a mess.

Page 7: PPT on Hadoop

To Solve all the Problems Hadoop Came.

It has two main parts –

1. Hadoop Distributed File System (HDFS),

2. Data Processing Framework & MapReduce

Page 8: PPT on Hadoop

1. Hadoop Distributed File System

It ties so many small and reasonable priced machines together into a single cost effective computer cluster.

Data and application processing are protected against hardware failure.

If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail.

it automatically stores multiple copies of all data.

It provides simplified programming model which allows user to quickly read and write the distributed system.

Page 9: PPT on Hadoop

2. MapReduceMapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

It is an associative implementation for processing and generating large data sets.

MAP function that process a key pair to generates a set of intermediate key pairs.

REDUCE function that merges all intermediate values associated with the same intermediate key

Page 10: PPT on Hadoop
Page 11: PPT on Hadoop
Page 12: PPT on Hadoop

Pros of Hadoop

1. Computing power2. Flexibility3. Fault Tolerance4. Low Cost5. Scalability

Page 13: PPT on Hadoop

Cons of Hadoop

1. Integration with existing systems Hadoop is not optimised for ease for use. Installing and integrating with existing databases might prove to be difficult, especially since there is no software support provided.

2. Administration and ease of use Hadoop requires knowledge of MapReduce, while most data practitioners use SQL. This means significant training may be required to administer Hadoop clusters.

3. Security Hadoop lacks the level of security functionality needed for safe enterprise deployment, especially if it concerns sensitive data.

Page 14: PPT on Hadoop