the solution for big data
TRANSCRIPT
![Page 1: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/1.jpg)
THE SOLUTION FOR BIG DATA
NAME:SIVAKOTI TARAKA SATYA PHANINDRAROLL NO:15K81D5824COURSE: CSE M.TECH/SEM-1
HADOOP
![Page 2: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/2.jpg)
CONTENT:
Data – Trends in storing data. BigData – Problems in IT industry Why BigData ? Introduction to HADOOP HDFS (Hadoop Distributed File System) MapReduce Prominent users of Hadoop. Conclusion
![Page 3: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/3.jpg)
Data – Trends in storing data What is data--- Any real world symbol (character, numeric, special character) or a of group of them is said to be data it may be of the visual or audio or scriptural , images, etc .,
File system
Databases
Cloud (internet)
![Page 4: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/4.jpg)
BIG DATA:
What is big data—In IT, it is a collection of data sets so large and complex data that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
As of 2016, limits on the size of data sets that are feasible to process in reasonable time were on the order of Exabyte of data. (KBs MBs GBs TBs PB ZB )
![Page 5: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/5.jpg)
![Page 6: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/6.jpg)
BIGDATA and problems with it. Daily about 0.8 Petabytes of updates are being made
into FACEBOOK including 50 millions photos.
Daily, YOUTUBE is loaded with videos that can be watched for one year
continuously
Limitations are encountered due to large data sets in many areas, including
meteorology, genomics, complex physics simulations, and biological and
environmental research.
Also affect Internet search, finance and business informatics.
The challenges include in capture, retrieval, storage, search, sharing, analysis,
and visualization.
![Page 7: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/7.jpg)
Why BIG DATA ?
![Page 8: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/8.jpg)
Unstructured DATA growth !
![Page 9: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/9.jpg)
THEN WHAT COULD BE THE SOLUTION FOR BIGDATA ?
![Page 10: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/10.jpg)
Hadoop’s Developers: 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to
support distribution for the Nutch search engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache Software Foundation.
Doug Cutting
![Page 11: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/11.jpg)
What is Hadoop? It is a open source software written in java Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
![Page 12: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/12.jpg)
• Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.
• It is a flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.
![Page 13: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/13.jpg)
![Page 14: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/14.jpg)
The project includes these modules:
Hadoop Common
Hadoop Distributed File System(HDFS)
Hadoop MapReduce
![Page 15: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/15.jpg)
1.Hadoop Commons
It provides access to the filesystems supported by Hadoop.
The Hadoop Common package contains the necessary JAR files and scripts needed to start Hadoop.
The package also provides source code, documentation, and a contribution section which includes projects from the Hadoop Community (Avro, Cassandra, Chukwa, Hbase, Hive, Mahout, Pig, ZooKeeper)
![Page 16: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/16.jpg)
2. Hadoop Distributed File System (HDFS):
Hadoop uses HDFS, a distributed file system based on GFS
(Google File System), as its shared filesystem.
HDFS architecture divides files into large chunks (~64MB)
distributed across data servers (this is configurable).
It has a namenode and datanodes
![Page 17: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/17.jpg)
What does a HDFS contain
HDFS consists of a global namenodes or namespaces and they
are federated.
The datanodes are used as common storage for blocks by all the
Namenodes.
Each datanode registers with all the Namenodes in the cluster.
Datanodes send periodic heartbeats and block reports and
handles commands from the Namenodes
![Page 18: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/18.jpg)
Structure of Hadoop system: Master Node :
Name NodeSecondary Name Node
Job Tracker
Slaves :
Data
Node
Task
Tracker
![Page 19: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/19.jpg)
MASTER NODE:
Master node Keeps track of namespace and metadata about items Keeps track of MapReduce jobs in the system Hadoop currently configured with centurion064 as the
master node Hadoop is locally installed in each system. Installed location is in /localtmp/hadoop/hadoop-
0.15.3
![Page 20: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/20.jpg)
SLAVE NODES :
Slave nodes Manage blocks of data sent from master node
In common, these are the chunkservers
Currently centurion060, centurion064 are the two slave nodes being used.
Slave nodes store their data in /localtmp/hadoop/hadoop-dfs (this is automatically created by the DFS)
Once you use the DFS, relative paths are from /usr/{your usr id}
![Page 21: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/21.jpg)
![Page 22: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/22.jpg)
Advantages and Limitations of HDFS : Reduce traffic on job scheduling. File access can be achieved through the native Java or
language of the users' choice (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml),
It cannot be directly mounted by an existing operating system.
It should be provided with UNIX or LUNIX system.
![Page 23: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/23.jpg)
3.Hadoop MAPREDUCE SYSTEM:
The Hadoop MapReduce framework harnesses a cluster of machines and executes user defined MapReduce jobs across the nodes in the cluster.
A MapReduce computation has two phases a map phase and a reduce phase.
![Page 24: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/24.jpg)
MAP AND REDUCE METHODS USAGE…
Map function
Reduce function
Run this program as aMapReduce job
![Page 25: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/25.jpg)
WORD COUNT OVER A GIVEN SET OF STRINGS
We love India
We Play Tennis
We1
love1
India 1
We1
Play1
Tennis 1
Love1
India1
We
2Tennis
1Play
1
Map Reduce
![Page 26: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/26.jpg)
MAPREDUCE IN WITH NO REDUCE TASKS
![Page 27: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/27.jpg)
MAPREDUCE WITH TWO REDUCE TASKS - AUTOMATIC PARALLEL EXECUTION IN MAPREDUCE
![Page 28: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/28.jpg)
Shuffle and sort in MapReduce with multiple reduce tasks
![Page 29: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/29.jpg)
![Page 30: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/30.jpg)
![Page 31: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/31.jpg)
![Page 32: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/32.jpg)
Prominent users of HADOOP
Amazon – 100 nodes Facebook – two clusters of 8000 and 3000 nodes Adobe – 80 node system EBay – 532 node cluster yahoo – cluster of about 4500 nodes IIIT Hyderabad – 30 node cluster
![Page 33: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/33.jpg)
Trending :Hadoop Job’s
![Page 34: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/34.jpg)
Salaries Tend in Hadoop:
![Page 35: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/35.jpg)
Achievements : 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte
of data in 209 seconds, compared to previous record of 297 seconds)
2009 - Avro and Chukwa became new members of Hadoop Framework family
2010 - Hadoop's Hbase, Hive and Pig subprojects completed, adding more computational power to Hadoop framework
2011 - ZooKeeper Completed March 2011 - Apache Hadoop takes top prize at Media Guardian
Innovation Award
2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha. - Ambari, Cassandra, Mahout have been added
![Page 36: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/36.jpg)
Conclusion:It reduce traffic on capture, storage, search, sharing, analysis,
and visualization.
A huge amount of data could be stored and large computations could be done in a single compound with full safety and security at cheap cost.
BIGDATA and BIGDATA-SOLUTIONS is one of the burning issues in the present IT industry so, work on those will surely make you more useful to that.
![Page 37: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/37.jpg)
![Page 38: THE SOLUTION FOR BIG DATA](https://reader036.vdocuments.us/reader036/viewer/2022081517/5873627b1a28abe7648b5f1b/html5/thumbnails/38.jpg)