hadoop basics -venkat cherukupalli. what is hadoop? open source distributed processing large data...
TRANSCRIPT
![Page 1: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/1.jpg)
Hadoop Basics -Venkat Cherukupalli
![Page 2: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/2.jpg)
What is Hadoop?Open Source
Distributed processing
Large data sets across clusters
Commodity, shared-nothing servers
Local computation and storage
![Page 3: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/3.jpg)
Key ServicesHadoop Distributed File System (HDFS)
Reliable data storage
MapReducehigh-performance parallel data processing
![Page 4: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/4.jpg)
HDFSSplits user data across servers in a cluster
Replication - multiple node failures will not cause data loss
Reliable, scalable and low-cost storage
RAID – Massive scale
Namenode and Datanode
![Page 5: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/5.jpg)
HDFS
![Page 6: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/6.jpg)
MapReduceParallel distributed processing system
No special programming techniques
Existing algorithms work without change
![Page 7: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/7.jpg)
MapReduce Framework
• Processes large jobs in parallel across many nodes and combines results.
• Eliminates the bottlenecks imposed by monolithic storage systems.
• Results are collated and digested into a single output after each piece has been
analyzed.
![Page 8: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/8.jpg)
Self-healingShifting work to the remaining nodes.
Creates additional copy of the data from the replicas
Self-healing for both storage and computation
No sysadmin intervention
![Page 9: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/9.jpg)
What is SQOOPImports individual tables or entire databases to files in HDFS
Generates Java classes to allow you to interact with your imported data
Provides the ability to import from SQL databases straight into your Hive data warehouse
![Page 10: Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers](https://reader036.vdocuments.us/reader036/viewer/2022082712/56649e8f5503460f94b94062/html5/thumbnails/10.jpg)
Other ConceptsHBase -is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java
Hive - Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis
Pig -Platform for creating MapReduce programs used with Hadoop.
ZooKeeper -Reliable distributed coordination