hadoop
DESCRIPTION
Apache Hadoop SeminarTRANSCRIPT
1
Presented by NIKHIL P L
Apache Hadoop
• Developer(s) : Apache Software Foundation
• Type : Distributed File System• License : Apache License 2.0• Written in : Java• O S : Cross platform• Created by : Doug Cutting (2005)• Inspired by: Google’s MapReduce, GFS
2
3
Sub projects
• HDFS– distributed, scalable, and portable file system– Store large data sets– Cope with hardware failure– Runs on top of the existing system
4
HDFS - Replication
• Blocks with data are replicated to multiple nodes
• Allow for node failure without data loss
5
Sub projects .
• MapReduce– Technology from Google– Hadoop's fundamental data filtering algorithm– Map and Reduce functions– Useful in a wide range of application• distributed pattern-based searching, distributed
sorting, web link-graph reversal, machine learning, statistical machine translation.
6
MapReduce - Workflow
7
Hadoop cluster (Terminology)
8
Types of Nodes
• HDFS nodes– NameNode (Master)– DataNode (Slaves)
• MapReduce nodes– Job Tracker (Master)– Task Tracker (Slaves)
9
Types of Nodes .
10
Sub projects ..
• Hive– providing data summarization, query, and analysis– initially developed by Facebook
• Hbase– open source, non-relational, distributed database– Providing Google BigTable-model database -like
capabilities
11
Sub projects …
• Zookeeper– distributed configuration service, synchronization
services, notification systems and naming registry for large distributed systems.
• Pig– A language and compiler to generate Hadoop
programs– Originally developed at Yahoo!
12
How does Hadoop works? .
• HDFS Works
13
How does Hadoop works? ..
• MapReduce Works
14
How does Hadoop works? …
• MapReduce Works
15
How does Hadoop works? ….
• Managing Hadoop Jobs
16
Applications
• Marketing analytics• Machin learning (eg: spam filters)• Image processing• Processing of XML messages
17
• world's largest Hadoop production application• ~20,000 machines running Hadoop
18
• the largest Hadoop cluster in the world with 100 PB of storage
• 1200 machines with 8 cores each + 800 machines with 16 cores each
• 32 GB of RAM per machine• 65 millions files in HDFS• 12 TB of compressed data added per day
19
Other Users
20
Thanks