hadoop and big data - my presentation to selective audience
DESCRIPTION
My presentation on hadoop and big dataTRANSCRIPT
HADOOPPresented by Chandra Sekhar
BIG DATA
YOUR COMPANY INFORMATION • WWW.YOURCOMPANY.COM
What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
PRESENTATION FLOW
1. How Hadoop STORES Data.2. How Hadoop PROCESS Data.3. Architecture of Hadoop4. ROI5. Resources
CHALLENGES LIKE OPPORTUNITIES:● Out of all People who sailed between 1997 - 2005, should I target the
people who purchased alcohol package or Spa Package?● Based on the onboard spending of adult men from New York who
have ever sailed with us, who can be targeted to sail on Azamara ?● Which first time guest will be a high roller ?
COST SAVINGS:● On a sailing, Who and How many will have genuine complaints vs
whining?● Which propulsion will break next?
PRODUCTIVITY :● Which employee will quit next ?
We have answers to most of these questions somewhere in our warehouses.
What is so Great about Hadoop ?
● Why all this buzz?● Is it a hype?● Is it a dot com ?● How does Hadoop Handle ?
Next Slide is a good example
At Yahoo in 2008
Hadoop is ideal For
● Write once, Read many times operation.. ● No edits, No Updates..● Movie files, Music files, Flight data
recorders, Logs, XML files are all fine ( DB records as well.)
HOW HADOOP STORES DATA
● Hadoop uses blocks to store Files.
● Default Block size is 64MB ● Every block gets replicated thrice. ● A 100 MB file will take up 2 blocks ( +
Replication factor of 3 = 6 blocks)● 1 GB File, not a problem … 48 blocks
OLD VS NEW
● You can set replication for older files to 2, and new files to 3 or even 4.
● You can compress the files .
More on Blocks..
Because a unit of storage is block, It does not really matter how many files, or how big the files are ..
But.
Hadoop prefers large files instead of many small files. Why ?
Why Large Files ?
When a block gets created, the addresses of block location , gets stored in namenode in memory For faster retrievel.
It is not mandated,but it is efficient to have few entries . Usually multiple files get merged into a single file ( ex : all Assignment manager logs of a day into a single huge file)
Data Loss is extremely Rare .. Here is why
HOW HADOOP PROCESSES DATA
MAP REDUCE
MAP REDUCE
Map Function ● Reads the data ● Usually does the preprocessing ● Hands over the records to Reduce
Function for further processing ( Ex : Eliminate all records where the age is
less than 18 )
More about Processing● A single huge file ( ex: 1GB ) file could be
processed by several mappers ( usually one block = 1 mapper, so about 16 Map jobs.
● If a simple logic, then you can disable reduce function and map job can process the logic.
● A Mapreduce job can pick up a web log from our website, join to a Siebel table and the output written to a TIBCO Queue to write to AS400 ( or MongoDB directly)
Hadoop Eco-System
Mapreduce Flow
KEY VALUE PAIRHello World Example
File Content : The mouse runs faster than the Cat
Map function output
Map Job output : (K1, VI)(The, 1)(mouse,1)(runs,1)(faster,1)(than,1)(the,1)(cat,1)
Reducer Function
Reducer Job output : (K1, VI)(The, 2)(mouse,1)(runs,1)(faster,1)(than,1)(cat,1)
Hadoop Programming Languages
Java, Any scripting languages , HIVE, PIG etc
Sample code in Java
Same Code in Python
Same Code in PIG
A = load '/home/cloudera/wordcountproblem' using TextLoader as (data:chararray);
B = foreach A generate FLATTEN(TOKENIZE(data)) as word;
C = group B by word;D = foreach C generate group, COUNT(B);
store D into '/home/cloudera/Chandra7' using PigStorage(',');
Same Code in HIVE
SELECT word, COUNT(*) FROM input LATERAL
VIEW explode(split(text, ' ')) lTable as
word GROUP BY word;
More on data processing
● Map function output is always sorted by the Key.
● Map data is intermediate data , so it is not saved in HDFS, only in the local node and gets deleted after reducer finishes.
ARCHITECTURE.
ROI
One study : Storing and Processing 1 TB Traditional RDBMS : $37,000 / yearData Appliance : $5000 / yearHadoop Cluster : $ 2000 /yearSource :
HBR Big Data@work page 60
Wikibon StudyBREAK EVEN TIMEFRAME
Big data Approach : 4 months
Traditional DW Appliance Approach : 26 months
Resources Youtube “Stanford university,Amr Awadallah
‘Must Read’ to get Certified.. http://www.amazon.com/review/R3BSEBI4I4SNUL
THANK YOU