big data and hadoop - introduction, architecture - hdfs and mapreduce, ecosystem

56
Sandeep Giri Hadoop Session 1 - Introduction Welcome to Big Data & Hadoop Session [email protected] +91-9538998962 Use Q&A to Communicate With Instructor

Upload: knowbigdata

Post on 10-May-2015

1.658 views

Category:

Technology


1 download

DESCRIPTION

This is the first introductory session on Big Data & Hadoop. It clears a lot of myths and confusion about Big Data. How exactly Big Data handling is a challenge and how does Hadoop addresses those problems. This is typically for a 3 hours session. The first half of the session talks about Big Data and related challenges and how does Hadoop solves those challenges. Second half of the presentation talks about the Hadoop components stack, HDFS architecture, MapReduce Paradigm and Architecture. By the end of this presentation you should be fairly clear about Hadoop and Big Data. To watch the video or know more about the course, please visit http://www.knowbigdata.com/page/big-data-and-hadoop

TRANSCRIPT

Page 1: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Session 1 - Introduction

Welcome to Big Data & Hadoop

Session

[email protected]+91-9538998962

Use Q&A to Communicate With Instructor

Page 2: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Interact - Ask Questions

Lifetime access of content

Class Recording

Cluster Access

24x7 support

WELCOME - KNOWBIGDATA

Real Life Project

Quizzes & Certification Test

10 x (3hr class)

Socio-Pro Visibility

Mock Interviews

Page 3: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

ABOUT ME2014 KnowBigData Founded2014

AmazonBuilt High Throughput Systems for Amazon.com site using in-house NoSql.

20122012 InMobi Built Recommender after churning 200 TB2011

tBits GlobalFounded tBits Global Built an enterprise grade Document Management System

2006

D.E.ShawBuilt the big data systems before the term was coined

20022002 IIT Roorkee Finished B.Tech somehow.

Page 4: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

COURSE CONTENTCOURSE CONTENTI Understanding BigData, Hadoop Architecture

II Cluster Setup, ETL, Project Environment

III MapReduce framework

IV Adv MapReduce & Testing

V Analytics using Pig

VI Analytics using Hive

VII NoSQL, HBASE

VIII Zookeeper, Oozie, Mahout

IX Apache Flume, Apache Spark

X YARN, Big Data Sets & Project Assignment

Page 5: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

What/why of Big Data?

Why Now?

Examples Customers

What is Hadoop?

Components Hadoop

HDFS Architecture

TODAY’S CLASS

NameNode

Secondary NameNode

Job Tracker

File Read/Write

Replication & Rack Awareness

Further Reading/Assignment

Page 6: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

WHAT IS BIG DATA?

Page 7: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

WHAT IS BIG DATA?

Simply: Data of Very Big Size

Can’t process with usual tools

Distributed Architecture Needed

Structured / Unstructured

Page 8: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

WHAT IS BIG DATA?

Facebook: 500TB /day Boeing737: 240 TB / flight

. Clickstreams: ~ 1m events / sec

Geospatial data 3D data

audio & video Unstructured text

Page 9: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

How many bytes in a petabyte?

Page 10: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

How many bytes in petabytes?

1.1259x10^15

Page 11: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

How many bytes in petabytes?

Kilo 1024 Bytes 1024 Bytes

Mega 1024 KB 1024 Bytes

Giga 1024 MB 1024 Bytes

Tera 1024 GB 1024^4 Bytes

Peta 1024 Tera 1024 Bytes

Exa 1024 Peta 1024 Bytes

Zeta 1024 Exa 1024 Bytes

Yotta 1024 Zeta 1024 Bytes

1 byte = 8 bit = can store 256 states

1.1259x10^15

Page 12: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

WHY BIG DATA

Page 13: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

WHY IS IT IMPORTANT NOW?

Smart Phones

4.6 billion mobile-phones. 1 - 2 billion people accessing the internet.

Facebook:1.06 bn monthly active users, 30 billion pieces shared monthly.

~175 million tweets every day

Connectivity: Internet Of Things

Connectivity: Social Networks

Moores Law has failed. data-bus is the bottle neck

Page 14: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

BIG DATA PROBLEMTo process & store data

we need

3. Disk Size + Speed2. RAM - Speed & Size

1. CPU Speed4. Network

Page 15: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

BIG DATA PROBLEMTo process & store data

we need

1. Disk Size + Speed3.RAM - Speed & Size

2. Speed

And at least one of these become bottle neck

Page 16: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?

Page 17: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?

A: Have 100 1TB drives and make 100 subfolders mount these.

Challenges?

Page 18: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?

A: Have 100 1TB drives and make 100 subfolders mount these.

• What about fail overs & Backups? • How would distribute the data uniformly? • Is this best value for money? • is this best use of resources? We might have hundreds of smaller drives already. • What about Increasing accessibility?

Then?

Challenges?

Page 19: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?

A: Have 100 1TB drives and make 100 subfolders mount these.

• What about failovers & Backups? • How would distribute the data uniformly? • Is this best value for money? • is this best use of resources? We might have hundreds of smaller drives already. • What about Increasing accesibility?

Then?

Challenges?

Hadoop Distributed File System or HDFS

Page 20: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?

1TB == 10 billion names having 100 characters each

RAM = 2GB

Page 21: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?A: Around 5 hours

Page 22: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?A: Around 5 hours

Google, 8 Sept, 2011: Sorting10PB took 6.5 hrs on 8000 computers

We need 1. Faster Sort 2. Bigger Data Sorting 3. More often

Page 23: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Components

Break for 5 Minutes. Come back at 10:08am IST.

Page 24: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

EXAMPLE BIG DATA CUSTOMERSWeb and e-tailing • Recommendation Engines • Ad Targeting • Search Quality • Sentiment Analyses • Abuse and Click Fraud Detection

Telecommunications • Customer Churn Prevention • Network Performance Optimization • Calling Data Record (CDR) Analysis • Analyzing Network to Predict Failure

Page 25: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

EXAMPLE BIG DATA CUSTOMERS

Government • Fraud Detection • Cyber Security Welfare • Justice

Healthcare & Life Sciences • Health information exchange • Gene sequencing • Healthcare improvements • Drug Safety

Page 26: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

AND MANY MORE…kaggle.com

Page 27: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

11 COMMON MYTHSBIG DATA • Always means data above or in range of TB • Is always about social media. Doesn't apply to me. • Will replace EDW • Is just a buzz word. No Practical Applications

!• Is New Concept • Will be future. • Is Expensive • Is only for data scientists. Or is magic. • We have enough hardware. Don't need any more. • We will build it when we need it. • Big Data is about Hadoop.

Page 28: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Q1:How is analysis of Big Data saving money for businesses? —Souvik

A1:How is analysis of Data saving money for business? • Examine our assumptions before building product. • Evaluate sale / marketing strategies • Retain / Attain existing customers • A/B Testing • Historical Data Analysis better than Surveys

A2:What is the expense of "Big Data" handling? • Cost: All tools are free/open. • Cheap H/W needed.

Page 29: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Q2:How important it is to know big data to make a fast growing IT career?

Short Answer: Very Important

Number of Jobsindeed.com naukri.com

Hadoop 1,102 2312

Big Data 1,255 659

Analyst Reports Gartner Top 10, 2014: Point #1, #4, #9, #10,

!Forbes top 10 tech trends: Point #5, #6, #8, #9

Page 30: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

WHAT IS HADOOP?

A. Cute Little Yellow Toy Elephant B. Framework to handle Big Data C. Open Source - Apache D. Power, Popular & Supported E. For reliable, scalable, distributed computing F. Created by Doug Cutting (of Yahoo) and Mike Cafarella G. Built for Nutch search engine project H. Written in Java

Page 31: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

CORE OF HADOOP?

MapReduce engine Execute your logic on multiple computers in parallel.

Hadoop Distributed File System (HDFS) Stores multiple copies of file parts on multiple machines.

Page 32: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Components

Main Component

Page 33: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Components

NoSQL Database

Compute Engine

Page 34: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

ComponentsWorkflow

SQL Inteface

New Language

Machine learning /

STATS

Page 35: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

ComponentsWorkflow

SQL Inteface

New Language

Machine learning /

STATS

NoSQL Database

Compute Engine

Main Component

Page 36: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

CORE COMPONENTS

Machine1 Machine 2 Machine 3 Machine 4 Machine 5

Page 37: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

CORE OF HADOOP?HDFS – Hadoop Distributed File System (Storage) 1. Distributed across “nodes” 2. Fault Tolerant & High Throughput 3. Low Cost Hardware 4. NameNode tracks locations.

MapReduce (Processing) 1. Splits a task across processors / nodes 2. “near” the data & assembles results 3. Self-Healing, High Bandwidth 4. Clustered storage 5. JobTracker manages the TaskTrackers

Page 38: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

WHAT IS HDFS?Distributed file system to run on commodity hardware

Design Goals 1. Hardware Failure 2. Streaming Data Access 3. Large Data Sets 4. Simple Coherency Model 5. Moving Code is Cheaper than Data 6. Portable - Java

Page 39: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Page 40: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

MAIN COMPONENTS OF HDFS?

NameNode: 1. Master of the system 2. Maintains and manages the blocks which are present on the

DataNodes

DataNodes: 1. Slaves which are deployed on each machine and provide

the actual storage 2. Responsible for serving read and write requests for the

clients

Page 41: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

ANATOMY OF A FILE WRITE

Page 42: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

ANATOMY OF A FILE READ

Page 43: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

NAMENODE METADATA

Meta-data in Memory 1. The entire metadata is in main memory. 2. No demand paging of FS meta-data !Types of Metadata 1. List of files 2. List of Blocks for each file 3. List of DataNode for each block 4. File attributes, e.g. access time, replication

factor !A Transaction Log 1. Records file creations, file deletions. etc

NameNode (File Name, numReplicas, block-ids,…)

/users/sgiri/data/part-0,r:2, {1,3},… /users/sgiri/data/part-1,r:3, {2,4,5},…

NameNode Keeps track of overall file directory structure and the placement of Data

Block

Page 44: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

JOB TRACKER

Page 45: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

CORE COMPONENTS

Machine1 Machine 2 Machine 3 Machine 4 Machine 5

Page 46: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

JOB TRACKER (DETAILED)

Page 47: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

JOB TRACKER (CONT.)

Page 48: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

JOB TRACKER (CONT.)

Page 49: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

MAP REDUCE

Page 50: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

FURTHER READING

http://en.wikipedia.org/wiki/Apache_Hadoop !http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Page 51: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

ASSIGNMENT / PRE-WORK

• Setup The Hadoop Environment based on the Virtual machines Present in LMS & cluster

• Access the cluster. Details are in LMS. • Finish the quiz from LMS • See Assignment section on LMS

Page 52: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

www.KnowBigData.com

• Second Class Onwards from 16 Aug 7:00am • Every Saturday-Sunday - 3 hours • 30 hours - 3 hr x 10 classes • ₹ 19995 ₹ 14996 (25% off) (Incl. Taxes)

• ₹ 1880/- per class

• Cluster for Hands-On Experiments

FULL COURSE

[email protected]

Page 53: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Thank you.

Big Data & Hadoop

[email protected]+91-9538998962

Page 54: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

COLLECT YOUR CERTIFICATEwww.KnowBigData.com/dell

• Permanent URL / Link • Crawl-able • Verifiable • Display on LinkedIn • Put in Social Profile

Page 55: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Session 1 - Introduction

Big Data & Hadoop

[email protected]+91-9538998962

Page 56: Big Data and Hadoop - Introduction, Architecture - HDFS and MapReduce, Ecosystem

Sandeep GiriHadoop

Session 1 - Introduction

Big Data & Hadoop

[email protected]+91-9538998962