hadoop
TRANSCRIPT
![Page 1: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/1.jpg)
HADOOP
![Page 2: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/2.jpg)
Zubair Arshad (35)M Ahsan (32)
BSCSSession 2013-2017
NFC Institute of Engineering & Technology Multan
Pakistan
![Page 3: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/3.jpg)
DATA
![Page 4: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/4.jpg)
How Big is data ?
New york stock exchange generate 1TB/day
Facebook host 10 billions photos, taking 1 peta byte of storage
CERN/LHC generate 15TB/day
And Growth is accelerating each day
![Page 5: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/5.jpg)
Big data
Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate.
Challenges include analysis, capture, search, sharing, storage, transfer, visualization, querying and information privacy.
![Page 6: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/6.jpg)
Issues !
Data Processing Issues
Data Management issues
Data Storage Issues
Privacy and Security Issues
Data Complexity Issues
![Page 7: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/7.jpg)
Hadoop
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
It provides massive storage for any kind of data
Enormous processing power
The ability to handle virtually limitless concurrent tasks tor jobs.
![Page 8: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/8.jpg)
HISTORY
![Page 9: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/9.jpg)
History
As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information
Early search results were returned by humans. But as web grew it become automatic
One such project is Open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster
Distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously
![Page 10: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/10.jpg)
History
Another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing.
The Nutch project was divided – the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop
In 2008, Yahoo released Hadoop as an open-source project.
Today, Hadoop maintained by the non-profit Apache Software Foundation
![Page 11: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/11.jpg)
Security
Apache Knox
The Knox Gateway (“Knox”) provides a single point of authentication and access for Apache Hadoop services in a cluster
Apache Ranger
. It provides central security policy administration across the core enterprise security requirements of authorization, authentication, audit and data protection.
![Page 12: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/12.jpg)
Why is Hadoop important?
Economical
Computing power
Fault tolerance
Distribute and processing of data across clusters of commonly available computers
Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.
Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes
![Page 13: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/13.jpg)
Why is Hadoop important?
Scalability
Low cost
Flexibility
you don’t have to preprocess data before storing it. You can store as much data as you want
The open-source framework is free and uses commodity hardware to store
It can store and process peta byte and zeta byte
![Page 14: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/14.jpg)
HADOOP ARCHITECTURE
![Page 15: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/15.jpg)
Hadoop Architecture
Hadoop designed and built on two independent frameworks
Hadoop Distributed File System(HDFS)
Map reduce
HDFS is reliable distributed file system that provide high-throughput access to data Storing of data through HDFS
Map reduce is programming model Program written in this functional style are automatically parallelized and executed on large cluster of commodity hardware Processing of data through mapreduce
![Page 16: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/16.jpg)
Master and slave architecture Components of HDFS
are NameNode is master of system .It maintains the name systems(directories and files)and manages the blocks which are present on data nodes
Data Node store and retrieve blocks when they are told to(by client or name node) and they report back to name node with the list of blocks that they are storing
Secondary Name Node is responsible for performing periodic checkpoints . So in the event of Name Node failure , you can restart the Name Node using check points.
![Page 17: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/17.jpg)
Master and slave architecture
![Page 18: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/18.jpg)
Master and slave architecture Components of Map reduce are
JobTrackercoordinates all the jobs run on the system by scheduling tasks to run on task trackers.
Task tracker
run tasks provided by job tracker and send progress report to job tracker
If a tasks fails ,the job tracker can reschedule it on different Task tracker
![Page 19: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/19.jpg)
Master and slave architecture
![Page 20: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/20.jpg)
HADOOP ECOSYSTEMS
![Page 21: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/21.jpg)
Hadoop Ecosystems
The success of Hadoop frame work has lack to the development of array of related softwares.
Hadoop allows with these set of related softwares make Hadoop ecosystems
The main purpose of these softwares is to enhance the functionality and increase the efficiency of Hadoop framework
Hadoop Ecosystem comprises of
Apache PIG Apache HBASE Apache Hive Apache Sqoop Apache Flume
![Page 22: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/22.jpg)
Functionality of Hadoop Ecosystems
Apache PIGIt is a platform that is used to analyze and access the large datasets that are present on clusters of computer
Apache HBASE
It is a column oriented database that allows reading and writing of data onto HDFS Apache SqoopIt is an application that is used to transfer data to and from Hadoop to any RDBMS
Apache HIVEThe Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage
![Page 23: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/23.jpg)
![Page 24: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/24.jpg)
Other applications
Log and/or clickstream analysis of various kindsMarketing analyticsMachine learning and/or sophisticated data mining Image processingProcessing of XML messagesWeb crawling and/or text processingGeneral archiving, including of relational/tabular data, e.g. for compliance
![Page 25: Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070601/587f171f1a28ab350c8b4fed/html5/thumbnails/25.jpg)