apache hadoop - a deep dive (part 1 - hdfs)
DESCRIPTION
This is our next tech talk in the series where we dive deep into the Apache Hadoop framework. Hadoop, undoubtedly is the current industry leader in Big data implementation. This tech talk covers core Hadoop and how it works. This is Part 1 which explains HDFS. The next tech talk will be Part 2 explaining MapReduce.TRANSCRIPT
![Page 1: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/1.jpg)
HADOOP-A DEEP DIVEDebarchan Sarkar
Sunil Kumar Chakrapani
The call would start soon, please be on mute.Thanks for your time and patience.
![Page 2: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/2.jpg)
AGENDA Recap - What is Big DATA?
Problems Introduced
Traditional Architecture
Cluster Architecture
Where it all started?
How does It work, A 50000 feet overview How does it work 1 & 2
Hadoop Distributed Architecture
HDFS Architecture
![Page 3: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/3.jpg)
Internet of things Audio /
VideoLog Files
Text/Image
Social Sentiment
Data Market FeedseGov Feeds
Weather
Wikis / Blogs
Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
WEB 2.0Mobile
Advertising
Collaboration
eCommerce
Digital Marketing
Search Marketing
Web Logs
Recommendations
ERP / CRM
Sales Pipeline
PayablesPayroll
Inventory
Contacts
Deal Tracking
Terabytes(10E12)
Gigabytes(10E9)
Exabytes(10E18)
Petabytes(10E15)
Velocity - Variety - variability
Volu
me
1980190,000$
20100.07$
19909,000$
200015$
Storage/GB
ERP / CRM WEB 2.0
Internet of things
WHAT IS BIG DATA?
![Page 4: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/4.jpg)
STORAGE CAPACITY VS ACCESS SPEED1990 2010
Stores 1370 MB of data
Read
@ 4.4MB/S transfer rate
1 TB is a norm
Read
@ 100MB/S transfer rate
Takes 5 minutes Takes 2.5 hours
![Page 5: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/5.jpg)
READ 1 TB OF DATA1 Machine 10 Machine
4 I/O Channels
Each channel: 100 MB/s
~ 45 minutes
4 I/O Channels
Each channel: 100 MB/s
~4.5 Minutes
![Page 6: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/6.jpg)
HARDWARE FAILURE
A common way of avoiding data loss is through replication
![Page 7: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/7.jpg)
TRADITIONAL ARCHITECTURE
Servers
SAN
Storage
![Page 8: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/8.jpg)
CLUSTER ARCHITECTURE
1 U
1 U
1 U
1 U
1 U
1 U
1 U
1 U 1 U
1 U
![Page 9: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/9.jpg)
NUTCH IS WHERE IT ALL STARTED
Google File System
Map Reduce
HDFS: HADOOP Distributed File System
MapReduce
![Page 10: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/10.jpg)
HOW DOES IT WORK - 1
![Page 11: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/11.jpg)
HOW DOES IT WORK - 2
RUNTIME
// Map Reduce function in JavaScript
var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {context.write(words[i].toLowerCase(), 1);}}};
var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());
}context.write(key, sum);};
CodeCodeCodeCode
![Page 12: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/12.jpg)
Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png
MapReduce Layer
HDFS Layer
HADOOP DISTRIBUTED ARCHITECTURE Master Slave
![Page 13: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/13.jpg)
RACK 1 - DataNodes RACK 2 - DataNodes
File Metadata/user/kc/data01.txt – Block 1,2,3,4/user/apb/data02.txt– Block 5,6
HDFS ARCHITECTURE
1 11
2 23
3
2
34 445
5
5 66
6
Block1: R1DN01, R1DN02, R2DN01Block2:R1DN01, R1DN02, R2DN03Block3:R1DN02, R1DN03, R2DN01
![Page 14: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/14.jpg)
BLOCK SIZE AND REPLICATION <property>
<name>dfs.block.size</name> <value>134217728</value> </property>
<property> <name>dfs.replication</name>
<value>3</value>
</property>
![Page 15: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/15.jpg)
NAMENODE & SECONDARY
NameNode
Secondary NameNode
• Reads fsimage and edits file
• Transaction in edits are merged With fsimage and edits is emptied
• A client application creates a new file in HDFS
• Name node logs that transaction in the edits file
Checkpoint
• Secondary Namenode periodically creates checkpoints of the namespace
• It downloads fsimage and edit from the active NameNode
• Merges fsimage and edits locally
• Uploads the new image back to the active NameNode
• fs.checkpoint.period• fs.checkpoint.size
![Page 16: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/16.jpg)
SAFE MODE During start up the NameNode loads the file system state from the fsimage
and the edits log file.
Waits for DataNodes to report their blocks.
During this time NameNode stays in Safemode. Safemode for the NameNode is essentially a read-only mode for the HDFS
cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes
have reported that most file system blocks are available.
![Page 17: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/17.jpg)
HDFS WRITES
1 2 3
1. HDFS client
caches the file data into a
temporary local file
Step 2
Step 3
Step 4
Step 5
Name Node
Data Node
![Page 18: Apache Hadoop - A Deep Dive (Part 1 - HDFS)](https://reader033.vdocuments.us/reader033/viewer/2022051015/55617752d8b42ade208b4d84/html5/thumbnails/18.jpg)
FEEDBACKSupport Team’s blog: http://blogs.msdn.com/b/bigdatasupport/ Facebook Page: https://www.facebook.com/MicrosoftBigData Facebook Group: https://www.facebook.com/groups/bigdatalearnings/ Twitter: @debarchans
Read more:http://en.wikipedia.org/wiki/Hadoophttp://en.wikipedia.org/wiki/Big_data
Next Session:Apache Hadoop – Map Reduce