hadoop

Post on 27-Jan-2015

662 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Apache Hadoop Seminar

TRANSCRIPT

1

Presented by NIKHIL P L

Apache Hadoop

• Developer(s) : Apache Software Foundation

• Type : Distributed File System• License : Apache License 2.0• Written in : Java• O S : Cross platform• Created by : Doug Cutting (2005)• Inspired by: Google’s MapReduce, GFS

2

3

Sub projects

• HDFS– distributed, scalable, and portable file system– Store large data sets– Cope with hardware failure– Runs on top of the existing system

4

HDFS - Replication

• Blocks with data are replicated to multiple nodes

• Allow for node failure without data loss

5

Sub projects .

• MapReduce– Technology from Google– Hadoop's fundamental data filtering algorithm– Map and Reduce functions– Useful in a wide range of application• distributed pattern-based searching, distributed

sorting, web link-graph reversal, machine learning, statistical machine translation.

6

MapReduce - Workflow

7

Hadoop cluster (Terminology)

8

Types of Nodes

• HDFS nodes– NameNode (Master)– DataNode (Slaves)

• MapReduce nodes– Job Tracker (Master)– Task Tracker (Slaves)

9

Types of Nodes .

10

Sub projects ..

• Hive– providing data summarization, query, and analysis– initially developed by Facebook

• Hbase– open source, non-relational, distributed database– Providing Google BigTable-model database -like

capabilities

11

Sub projects …

• Zookeeper– distributed configuration service, synchronization

services, notification systems and naming registry for large distributed systems.

• Pig– A language and compiler to generate Hadoop

programs– Originally developed at Yahoo!

12

How does Hadoop works? .

• HDFS Works

13

How does Hadoop works? ..

• MapReduce Works

14

How does Hadoop works? …

• MapReduce Works

15

How does Hadoop works? ….

• Managing Hadoop Jobs

16

Applications

• Marketing analytics• Machin learning (eg: spam filters)• Image processing• Processing of XML messages

17

• world's largest Hadoop production application• ~20,000 machines running Hadoop

18

• the largest Hadoop cluster in the world with 100 PB of storage

• 1200 machines with 8 cores each + 800 machines with 16 cores each

• 32 GB of RAM per machine• 65 millions files in HDFS• 12 TB of compressed data added per day

19

Other Users

20

Thanks

top related