big data and hadoop - introduction, architecture - hdfs and mapreduce, ecosystem
DESCRIPTION
This is the first introductory session on Big Data & Hadoop. It clears a lot of myths and confusion about Big Data. How exactly Big Data handling is a challenge and how does Hadoop addresses those problems. This is typically for a 3 hours session. The first half of the session talks about Big Data and related challenges and how does Hadoop solves those challenges. Second half of the presentation talks about the Hadoop components stack, HDFS architecture, MapReduce Paradigm and Architecture. By the end of this presentation you should be fairly clear about Hadoop and Big Data. To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-and-hadoopTRANSCRIPT
Sandeep GiriHadoop
Session 1 - Introduction
Welcome to Big Data & Hadoop
Session
[email protected]+91-9538998962
Use Q&A to Communicate With Instructor
Sandeep GiriHadoop
Interact - Ask Questions
Lifetime access of content
Class Recording
Cluster Access
24x7 support
WELCOME - KNOWBIGDATA
Real Life Project
Quizzes & Certification Test
10 x (3hr class)
Socio-Pro Visibility
Mock Interviews
Sandeep GiriHadoop
ABOUT ME2014 KnowBigData Founded2014
AmazonBuilt High Throughput Systems for Amazon.com site using in-house NoSql.
20122012 InMobi Built Recommender after churning 200 TB2011
tBits GlobalFounded tBits Global Built an enterprise grade Document Management System
2006
D.E.ShawBuilt the big data systems before the term was coined
20022002 IIT Roorkee Finished B.Tech somehow.
Sandeep GiriHadoop
COURSE CONTENTCOURSE CONTENTI Understanding BigData, Hadoop Architecture
II Cluster Setup, ETL, Project Environment
III MapReduce framework
IV Adv MapReduce & Testing
V Analytics using Pig
VI Analytics using Hive
VII NoSQL, HBASE
VIII Zookeeper, Oozie, Mahout
IX Apache Flume, Apache Spark
X YARN, Big Data Sets & Project Assignment
Sandeep GiriHadoop
What/why of Big Data?
Why Now?
Examples Customers
What is Hadoop?
Components Hadoop
HDFS Architecture
TODAY’S CLASS
NameNode
Secondary NameNode
Job Tracker
File Read/Write
Replication & Rack Awareness
Further Reading/Assignment
Sandeep GiriHadoop
WHAT IS BIG DATA?
Simply: Data of Very Big Size
Can’t process with usual tools
Distributed Architecture Needed
Structured / Unstructured
Sandeep GiriHadoop
WHAT IS BIG DATA?
Facebook: 500TB /day Boeing737: 240 TB / flight
. Clickstreams: ~ 1m events / sec
Geospatial data 3D data
audio & video Unstructured text
Sandeep GiriHadoop
How many bytes in a petabyte?
Sandeep GiriHadoop
How many bytes in petabytes?
1.1259x10^15
Sandeep GiriHadoop
How many bytes in petabytes?
Kilo 1024 Bytes 1024 Bytes
Mega 1024 KB 1024 Bytes
Giga 1024 MB 1024 Bytes
Tera 1024 GB 1024^4 Bytes
Peta 1024 Tera 1024 Bytes
Exa 1024 Peta 1024 Bytes
Zeta 1024 Exa 1024 Bytes
Yotta 1024 Zeta 1024 Bytes
1 byte = 8 bit = can store 256 states
1.1259x10^15
Sandeep GiriHadoop
WHY IS IT IMPORTANT NOW?
Smart Phones
4.6 billion mobile-phones. 1 - 2 billion people accessing the internet.
Facebook:1.06 bn monthly active users, 30 billion pieces shared monthly.
~175 million tweets every day
Connectivity: Internet Of Things
Connectivity: Social Networks
Moores Law has failed. data-bus is the bottle neck
Sandeep GiriHadoop
BIG DATA PROBLEMTo process & store data
we need
3. Disk Size + Speed2. RAM - Speed & Size
1. CPU Speed4. Network
Sandeep GiriHadoop
BIG DATA PROBLEMTo process & store data
we need
1. Disk Size + Speed3.RAM - Speed & Size
2. Speed
And at least one of these become bottle neck
Sandeep GiriHadoop
BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?
Sandeep GiriHadoop
BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?
A: Have 100 1TB drives and make 100 subfolders mount these.
Challenges?
Sandeep GiriHadoop
BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?
A: Have 100 1TB drives and make 100 subfolders mount these.
• What about fail overs & Backups? • How would distribute the data uniformly? • Is this best value for money? • is this best use of resources? We might have hundreds of smaller drives already. • What about Increasing accessibility?
Then?
Challenges?
Sandeep GiriHadoop
BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?
A: Have 100 1TB drives and make 100 subfolders mount these.
• What about failovers & Backups? • How would distribute the data uniformly? • Is this best value for money? • is this best use of resources? We might have hundreds of smaller drives already. • What about Increasing accesibility?
Then?
Challenges?
Hadoop Distributed File System or HDFS
Sandeep GiriHadoop
BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?
1TB == 10 billion names having 100 characters each
RAM = 2GB
Sandeep GiriHadoop
BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?A: Around 5 hours
Sandeep GiriHadoop
BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?A: Around 5 hours
Google, 8 Sept, 2011: Sorting10PB took 6.5 hrs on 8000 computers
We need 1. Faster Sort 2. Bigger Data Sorting 3. More often
Sandeep GiriHadoop
Components
Break for 5 Minutes. Come back at 10:08am IST.
Sandeep GiriHadoop
EXAMPLE BIG DATA CUSTOMERSWeb and e-tailing • Recommendation Engines • Ad Targeting • Search Quality • Sentiment Analyses • Abuse and Click Fraud Detection
Telecommunications • Customer Churn Prevention • Network Performance Optimization • Calling Data Record (CDR) Analysis • Analyzing Network to Predict Failure
Sandeep GiriHadoop
EXAMPLE BIG DATA CUSTOMERS
Government • Fraud Detection • Cyber Security Welfare • Justice
Healthcare & Life Sciences • Health information exchange • Gene sequencing • Healthcare improvements • Drug Safety
Sandeep GiriHadoop
AND MANY MORE…kaggle.com
Sandeep GiriHadoop
11 COMMON MYTHSBIG DATA • Always means data above or in range of TB • Is always about social media. Doesn't apply to me. • Will replace EDW • Is just a buzz word. No Practical Applications
!• Is New Concept • Will be future. • Is Expensive • Is only for data scientists. Or is magic. • We have enough hardware. Don't need any more. • We will build it when we need it. • Big Data is about Hadoop.
Sandeep GiriHadoop
Q1:How is analysis of Big Data saving money for businesses? —Souvik
A1:How is analysis of Data saving money for business? • Examine our assumptions before building product. • Evaluate sale / marketing strategies • Retain / Attain existing customers • A/B Testing • Historical Data Analysis better than Surveys
A2:What is the expense of "Big Data" handling? • Cost: All tools are free/open. • Cheap H/W needed.
Sandeep GiriHadoop
Q2:How important it is to know big data to make a fast growing IT career?
Short Answer: Very Important
Number of Jobsindeed.com naukri.com
Hadoop 1,102 2312
Big Data 1,255 659
Analyst Reports Gartner Top 10, 2014: Point #1, #4, #9, #10,
!Forbes top 10 tech trends: Point #5, #6, #8, #9
Sandeep GiriHadoop
WHAT IS HADOOP?
A. Cute Little Yellow Toy Elephant B. Framework to handle Big Data C. Open Source - Apache D. Power, Popular & Supported E. For reliable, scalable, distributed computing F. Created by Doug Cutting (of Yahoo) and Mike Cafarella G. Built for Nutch search engine project H. Written in Java
Sandeep GiriHadoop
CORE OF HADOOP?
MapReduce engine Execute your logic on multiple computers in parallel.
Hadoop Distributed File System (HDFS) Stores multiple copies of file parts on multiple machines.
Sandeep GiriHadoop
ComponentsWorkflow
SQL Inteface
New Language
Machine learning /
STATS
Sandeep GiriHadoop
ComponentsWorkflow
SQL Inteface
New Language
Machine learning /
STATS
NoSQL Database
Compute Engine
Main Component
Sandeep GiriHadoop
CORE COMPONENTS
Machine1 Machine 2 Machine 3 Machine 4 Machine 5
Sandeep GiriHadoop
CORE OF HADOOP?HDFS – Hadoop Distributed File System (Storage) 1. Distributed across “nodes” 2. Fault Tolerant & High Throughput 3. Low Cost Hardware 4. NameNode tracks locations.
MapReduce (Processing) 1. Splits a task across processors / nodes 2. “near” the data & assembles results 3. Self-Healing, High Bandwidth 4. Clustered storage 5. JobTracker manages the TaskTrackers
Sandeep GiriHadoop
WHAT IS HDFS?Distributed file system to run on commodity hardware
Design Goals 1. Hardware Failure 2. Streaming Data Access 3. Large Data Sets 4. Simple Coherency Model 5. Moving Code is Cheaper than Data 6. Portable - Java
Sandeep GiriHadoop
Sandeep GiriHadoop
MAIN COMPONENTS OF HDFS?
NameNode: 1. Master of the system 2. Maintains and manages the blocks which are present on the
DataNodes
DataNodes: 1. Slaves which are deployed on each machine and provide
the actual storage 2. Responsible for serving read and write requests for the
clients
Sandeep GiriHadoop
ANATOMY OF A FILE WRITE
Sandeep GiriHadoop
ANATOMY OF A FILE READ
Sandeep GiriHadoop
NAMENODE METADATA
Meta-data in Memory 1. The entire metadata is in main memory. 2. No demand paging of FS meta-data !Types of Metadata 1. List of files 2. List of Blocks for each file 3. List of DataNode for each block 4. File attributes, e.g. access time, replication
factor !A Transaction Log 1. Records file creations, file deletions. etc
NameNode (File Name, numReplicas, block-ids,…)
/users/sgiri/data/part-0,r:2, {1,3},… /users/sgiri/data/part-1,r:3, {2,4,5},…
NameNode Keeps track of overall file directory structure and the placement of Data
Block
Sandeep GiriHadoop
JOB TRACKER
Sandeep GiriHadoop
CORE COMPONENTS
Machine1 Machine 2 Machine 3 Machine 4 Machine 5
Sandeep GiriHadoop
JOB TRACKER (DETAILED)
Sandeep GiriHadoop
JOB TRACKER (CONT.)
Sandeep GiriHadoop
JOB TRACKER (CONT.)
Sandeep GiriHadoop
MAP REDUCE
Sandeep GiriHadoop
FURTHER READING
http://en.wikipedia.org/wiki/Apache_Hadoop !http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
Sandeep GiriHadoop
ASSIGNMENT / PRE-WORK
• Setup The Hadoop Environment based on the Virtual machines Present in LMS & cluster
• Access the cluster. Details are in LMS. • Finish the quiz from LMS • See Assignment section on LMS
Sandeep GiriHadoop
www.KnowBigData.com
• Second Class Onwards from 16 Aug 7:00am • Every Saturday-Sunday - 3 hours • 30 hours - 3 hr x 10 classes • ₹ 19995 ₹ 14996 (25% off) (Incl. Taxes)
• ₹ 1880/- per class
• Cluster for Hands-On Experiments
FULL COURSE
Sandeep GiriHadoop
COLLECT YOUR CERTIFICATEwww.KnowBigData.com/dell
• Permanent URL / Link • Crawl-able • Verifiable • Display on LinkedIn • Put in Social Profile