hadoop

HADOOP

Zubair Arshad (35)M Ahsan (32)

BSCSSession 2013-2017

NFC Institute of Engineering & Technology Multan

Pakistan

How Big is data ?

New york stock exchange generate 1TB/day

Facebook host 10 billions photos, taking 1 peta byte of storage

CERN/LHC generate 15TB/day

And Growth is accelerating each day

Big data

Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate.

Challenges include analysis, capture, search, sharing, storage, transfer, visualization, querying and information privacy.

Issues !

Data Processing Issues

Data Management issues

Data Storage Issues

Privacy and Security Issues

Data Complexity Issues

Hadoop

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.

It provides massive storage for any kind of data

Enormous processing power

The ability to handle virtually limitless concurrent tasks tor jobs.

HISTORY

History

As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information

Early search results were returned by humans. But as web grew it become automatic

One such project is Open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster

Distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously

History

Another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way

In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing.

The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop

In 2008, Yahoo released Hadoop as an open-source project.

Today, Hadoop maintained by the non-profit Apache Software Foundation

Security

Apache Knox

The Knox Gateway (“Knox”) provides a single point of authentication and access for Apache Hadoop services in a cluster

Apache Ranger

. It provides central security policy administration across the core enterprise security requirements of authorization, authentication, audit and data protection.

Why is Hadoop important?

Economical

Computing power

Fault tolerance

Distribute and processing of data across clusters of commonly available computers

Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.

Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes

Why is Hadoop important?

Scalability

Low cost

Flexibility

you don’t have to preprocess data before storing it. You can store as much data as you want

The open-source framework is free and uses commodity hardware to store

It can store and process peta byte and zeta byte

HADOOP ARCHITECTURE

Hadoop Architecture

Hadoop designed and built on two independent frameworks

Hadoop Distributed File System(HDFS)

Map reduce

HDFS is reliable distributed file system that provide high-throughput access to data Storing of data through HDFS

Map reduce is programming model Program written in this functional style are automatically parallelized and executed on large cluster of commodity hardware Processing of data through mapreduce

Master and slave architecture Components of HDFS

are NameNode is master of system .It maintains the name systems(directories and files)and manages the blocks which are present on data nodes

Data Node store and retrieve blocks when they are told to(by client or name node) and they report back to name node with the list of blocks that they are storing

Secondary Name Node is responsible for performing periodic checkpoints . So in the event of Name Node failure , you can restart the Name Node using check points.

Master and slave architecture

Master and slave architecture Components of Map reduce are

JobTrackercoordinates all the jobs run on the system by scheduling tasks to run on task trackers.

Task tracker

run tasks provided by job tracker and send progress report to job tracker

If a tasks fails ,the job tracker can reschedule it on different Task tracker

Master and slave architecture

HADOOP ECOSYSTEMS

Hadoop Ecosystems

The success of Hadoop frame work has lack to the development of array of related softwares.

Hadoop allows with these set of related softwares make Hadoop ecosystems

The main purpose of these softwares is to enhance the functionality and increase the efficiency of Hadoop framework

Hadoop Ecosystem comprises of

Apache PIG Apache HBASE Apache Hive Apache Sqoop Apache Flume

Functionality of Hadoop Ecosystems

Apache PIGIt is a platform that is used to analyze and access the large datasets that are present on clusters of computer

Apache HBASE

It is a column oriented database that allows reading and writing of data onto HDFS Apache SqoopIt is an application that is used to transfer data to and from Hadoop to any RDBMS

Apache HIVEThe Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage

Other applications

Log and/or clickstream analysis of various kindsMarketing analyticsMachine learning and/or sophisticated data mining Image processingProcessing of XML messagesWeb crawling and/or text processingGeneral archiving, including of relational/tabular data, e.g. for compliance

hadoop

Education