hadoop and big data - my presentation to selective audience

HADOOPPresented by Chandra Sekhar

BIG DATA

YOUR COMPANY INFORMATION • WWW.YOURCOMPANY.COM

What is Hadoop?

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.

PRESENTATION FLOW

1. How Hadoop STORES Data.2. How Hadoop PROCESS Data.3. Architecture of Hadoop4. ROI5. Resources

CHALLENGES LIKE OPPORTUNITIES:● Out of all People who sailed between 1997 - 2005, should I target the

people who purchased alcohol package or Spa Package?● Based on the onboard spending of adult men from New York who

have ever sailed with us, who can be targeted to sail on Azamara ?● Which first time guest will be a high roller ?

COST SAVINGS:● On a sailing, Who and How many will have genuine complaints vs

whining?● Which propulsion will break next?

PRODUCTIVITY :● Which employee will quit next ?

We have answers to most of these questions somewhere in our warehouses.

What is so Great about Hadoop ?

● Why all this buzz?● Is it a hype?● Is it a dot com ?● How does Hadoop Handle ?

Next Slide is a good example

At Yahoo in 2008

Hadoop is ideal For

● Write once, Read many times operation.. ● No edits, No Updates..● Movie files, Music files, Flight data

recorders, Logs, XML files are all fine ( DB records as well.)

HOW HADOOP STORES DATA

● Hadoop uses blocks to store Files.

● Default Block size is 64MB ● Every block gets replicated thrice. ● A 100 MB file will take up 2 blocks ( +

Replication factor of 3 = 6 blocks)● 1 GB File, not a problem … 48 blocks

OLD VS NEW

● You can set replication for older files to 2, and new files to 3 or even 4.

● You can compress the files .

More on Blocks..

Because a unit of storage is block, It does not really matter how many files, or how big the files are ..

But.

Hadoop prefers large files instead of many small files. Why ?

Why Large Files ?

When a block gets created, the addresses of block location , gets stored in namenode in memory For faster retrievel.

It is not mandated,but it is efficient to have few entries . Usually multiple files get merged into a single file ( ex : all Assignment manager logs of a day into a single huge file)

Data Loss is extremely Rare .. Here is why

HOW HADOOP PROCESSES DATA

MAP REDUCE

MAP REDUCE

Map Function ● Reads the data ● Usually does the preprocessing ● Hands over the records to Reduce

Function for further processing ( Ex : Eliminate all records where the age is

less than 18 )

More about Processing● A single huge file ( ex: 1GB ) file could be

processed by several mappers ( usually one block = 1 mapper, so about 16 Map jobs.

● If a simple logic, then you can disable reduce function and map job can process the logic.

● A Mapreduce job can pick up a web log from our website, join to a Siebel table and the output written to a TIBCO Queue to write to AS400 ( or MongoDB directly)

Hadoop Eco-System

Mapreduce Flow

KEY VALUE PAIRHello World Example

File Content : The mouse runs faster than the Cat

Map function output

Map Job output : (K1, VI)(The, 1)(mouse,1)(runs,1)(faster,1)(than,1)(the,1)(cat,1)

Reducer Function

Reducer Job output : (K1, VI)(The, 2)(mouse,1)(runs,1)(faster,1)(than,1)(cat,1)

Hadoop Programming Languages

Java, Any scripting languages , HIVE, PIG etc

Sample code in Java

Same Code in Python

Same Code in PIG

A = load '/home/cloudera/wordcountproblem' using TextLoader as (data:chararray);

B = foreach A generate FLATTEN(TOKENIZE(data)) as word;

C = group B by word;D = foreach C generate group, COUNT(B);

store D into '/home/cloudera/Chandra7' using PigStorage(',');

Same Code in HIVE

SELECT word, COUNT(*) FROM input LATERAL

VIEW explode(split(text, ' ')) lTable as

word GROUP BY word;

More on data processing

● Map function output is always sorted by the Key.

● Map data is intermediate data , so it is not saved in HDFS, only in the local node and gets deleted after reducer finishes.

ARCHITECTURE.

ROI

One study : Storing and Processing 1 TB Traditional RDBMS : $37,000 / yearData Appliance : $5000 / yearHadoop Cluster : $ 2000 /yearSource :

HBR Big Data@work page 60

Wikibon StudyBREAK EVEN TIMEFRAME

Big data Approach : 4 months

Traditional DW Appliance Approach : 26 months

Resources Youtube “Stanford university,Amr Awadallah

‘Must Read’ to get Certified.. http://www.amazon.com/review/R3BSEBI4I4SNUL

THANK YOU

hadoop and big data - my presentation to selective audience

Technology