hadoop and big data - my presentation to selective audience

38
HADOOP Presented by Chandra Sekhar BIG DATA YOUR COMPANY INFORMATION • WWW.YOURCOMPANY.COM

Upload: chandra-sekhar

Post on 18-Dec-2014

169 views

Category:

Technology


2 download

DESCRIPTION

My presentation on hadoop and big data

TRANSCRIPT

Page 1: Hadoop And Big Data - My Presentation To Selective Audience

HADOOPPresented by Chandra Sekhar

BIG DATA

YOUR COMPANY INFORMATION • WWW.YOURCOMPANY.COM

Page 2: Hadoop And Big Data - My Presentation To Selective Audience

What is Hadoop?

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

Page 3: Hadoop And Big Data - My Presentation To Selective Audience

PRESENTATION FLOW

1. How Hadoop STORES Data.2. How Hadoop PROCESS Data.3. Architecture of Hadoop4. ROI5. Resources

Page 4: Hadoop And Big Data - My Presentation To Selective Audience
Page 5: Hadoop And Big Data - My Presentation To Selective Audience

CHALLENGES LIKE OPPORTUNITIES:● Out of all People who sailed between 1997 - 2005, should I target the

people who purchased alcohol package or Spa Package?● Based on the onboard spending of adult men from New York who

have ever sailed with us, who can be targeted to sail on Azamara ?● Which first time guest will be a high roller ?

COST SAVINGS:● On a sailing, Who and How many will have genuine complaints vs

whining?● Which propulsion will break next?

PRODUCTIVITY :● Which employee will quit next ?

We have answers to most of these questions somewhere in our warehouses.

Page 6: Hadoop And Big Data - My Presentation To Selective Audience
Page 7: Hadoop And Big Data - My Presentation To Selective Audience

What is so Great about Hadoop ?

● Why all this buzz?● Is it a hype?● Is it a dot com ?● How does Hadoop Handle ?

Next Slide is a good example

Page 8: Hadoop And Big Data - My Presentation To Selective Audience

At Yahoo in 2008

Page 9: Hadoop And Big Data - My Presentation To Selective Audience
Page 10: Hadoop And Big Data - My Presentation To Selective Audience

Hadoop is ideal For

● Write once, Read many times operation.. ● No edits, No Updates..● Movie files, Music files, Flight data

recorders, Logs, XML files are all fine ( DB records as well.)

Page 11: Hadoop And Big Data - My Presentation To Selective Audience

HOW HADOOP STORES DATA

● Hadoop uses blocks to store Files.

● Default Block size is 64MB ● Every block gets replicated thrice. ● A 100 MB file will take up 2 blocks ( +

Replication factor of 3 = 6 blocks)● 1 GB File, not a problem … 48 blocks

Page 12: Hadoop And Big Data - My Presentation To Selective Audience

OLD VS NEW

● You can set replication for older files to 2, and new files to 3 or even 4.

● You can compress the files .

Page 13: Hadoop And Big Data - My Presentation To Selective Audience

More on Blocks..

Because a unit of storage is block, It does not really matter how many files, or how big the files are ..

But.

Hadoop prefers large files instead of many small files. Why ?

Page 14: Hadoop And Big Data - My Presentation To Selective Audience

Why Large Files ?

When a block gets created, the addresses of block location , gets stored in namenode in memory For faster retrievel.

It is not mandated,but it is efficient to have few entries . Usually multiple files get merged into a single file ( ex : all Assignment manager logs of a day into a single huge file)

Page 15: Hadoop And Big Data - My Presentation To Selective Audience

Data Loss is extremely Rare .. Here is why

Page 16: Hadoop And Big Data - My Presentation To Selective Audience

HOW HADOOP PROCESSES DATA

MAP REDUCE

Page 17: Hadoop And Big Data - My Presentation To Selective Audience

MAP REDUCE

Map Function ● Reads the data ● Usually does the preprocessing ● Hands over the records to Reduce

Function for further processing ( Ex : Eliminate all records where the age is

less than 18 )

Page 18: Hadoop And Big Data - My Presentation To Selective Audience

More about Processing● A single huge file ( ex: 1GB ) file could be

processed by several mappers ( usually one block = 1 mapper, so about 16 Map jobs.

● If a simple logic, then you can disable reduce function and map job can process the logic.

● A Mapreduce job can pick up a web log from our website, join to a Siebel table and the output written to a TIBCO Queue to write to AS400 ( or MongoDB directly)

Page 19: Hadoop And Big Data - My Presentation To Selective Audience

Hadoop Eco-System

Page 20: Hadoop And Big Data - My Presentation To Selective Audience

Mapreduce Flow

Page 21: Hadoop And Big Data - My Presentation To Selective Audience

KEY VALUE PAIRHello World Example

File Content : The mouse runs faster than the Cat

Page 22: Hadoop And Big Data - My Presentation To Selective Audience

Map function output

Map Job output : (K1, VI)(The, 1)(mouse,1)(runs,1)(faster,1)(than,1)(the,1)(cat,1)

Page 23: Hadoop And Big Data - My Presentation To Selective Audience

Reducer Function

Reducer Job output : (K1, VI)(The, 2)(mouse,1)(runs,1)(faster,1)(than,1)(cat,1)

Page 24: Hadoop And Big Data - My Presentation To Selective Audience

Hadoop Programming Languages

Java, Any scripting languages , HIVE, PIG etc

Page 25: Hadoop And Big Data - My Presentation To Selective Audience

Sample code in Java

Page 26: Hadoop And Big Data - My Presentation To Selective Audience

Same Code in Python

Page 27: Hadoop And Big Data - My Presentation To Selective Audience

Same Code in PIG

A = load '/home/cloudera/wordcountproblem' using TextLoader as (data:chararray);

B = foreach A generate FLATTEN(TOKENIZE(data)) as word;

C = group B by word;D = foreach C generate group, COUNT(B);

store D into '/home/cloudera/Chandra7' using PigStorage(',');

Page 28: Hadoop And Big Data - My Presentation To Selective Audience

Same Code in HIVE

SELECT word, COUNT(*) FROM input LATERAL

VIEW explode(split(text, ' ')) lTable as

word GROUP BY word;

Page 29: Hadoop And Big Data - My Presentation To Selective Audience

More on data processing

● Map function output is always sorted by the Key.

● Map data is intermediate data , so it is not saved in HDFS, only in the local node and gets deleted after reducer finishes.

Page 30: Hadoop And Big Data - My Presentation To Selective Audience

ARCHITECTURE.

Page 31: Hadoop And Big Data - My Presentation To Selective Audience
Page 32: Hadoop And Big Data - My Presentation To Selective Audience
Page 33: Hadoop And Big Data - My Presentation To Selective Audience
Page 34: Hadoop And Big Data - My Presentation To Selective Audience

ROI

One study : Storing and Processing 1 TB Traditional RDBMS : $37,000 / yearData Appliance : $5000 / yearHadoop Cluster : $ 2000 /yearSource :

HBR Big Data@work page 60

Page 35: Hadoop And Big Data - My Presentation To Selective Audience

Wikibon StudyBREAK EVEN TIMEFRAME

Big data Approach : 4 months

Traditional DW Appliance Approach : 26 months

Page 36: Hadoop And Big Data - My Presentation To Selective Audience

Resources Youtube “Stanford university,Amr Awadallah

Page 37: Hadoop And Big Data - My Presentation To Selective Audience

‘Must Read’ to get Certified.. http://www.amazon.com/review/R3BSEBI4I4SNUL

Page 38: Hadoop And Big Data - My Presentation To Selective Audience

THANK YOU