big data and hadoop - introduction, architecture - hdfs and mapreduce, ecosystem

Sandeep GiriHadoop

Session 1 - Introduction

Welcome to Big Data & Hadoop

Session

[email protected]+91-9538998962

Use Q&A to Communicate With Instructor

Sandeep GiriHadoop

Interact - Ask Questions

Lifetime access of content

Class Recording

Cluster Access

24x7 support

WELCOME - KNOWBIGDATA

Real Life Project

Quizzes & Certification Test

10 x (3hr class)

Socio-Pro Visibility

Mock Interviews

Sandeep GiriHadoop

ABOUT ME2014 KnowBigData Founded2014

AmazonBuilt High Throughput Systems for Amazon.com site using in-house NoSql.

20122012 InMobi Built Recommender after churning 200 TB2011

tBits GlobalFounded tBits Global Built an enterprise grade Document Management System

2006

D.E.ShawBuilt the big data systems before the term was coined

20022002 IIT Roorkee Finished B.Tech somehow.

http://amazon.com

http://www.linkedin.com/in/girisandeep

Sandeep GiriHadoop

COURSE CONTENTCOURSE CONTENTI Understanding BigData, Hadoop Architecture

II Cluster Setup, ETL, Project Environment

III MapReduce framework

IV Adv MapReduce & Testing

V Analytics using Pig

VI Analytics using Hive

VII NoSQL, HBASE

VIII Zookeeper, Oozie, Mahout

IX Apache Flume, Apache Spark

X YARN, Big Data Sets & Project Assignment

Sandeep GiriHadoop

What/why of Big Data?

Why Now?

Examples Customers

What is Hadoop?

Components Hadoop

HDFS Architecture

TODAY’S CLASS

NameNode

Secondary NameNode

Job Tracker

File Read/Write

Replication & Rack Awareness

Further Reading/Assignment

Sandeep GiriHadoop

WHAT IS BIG DATA?

http://www.knowbigdata.com

Sandeep GiriHadoop

WHAT IS BIG DATA?

Simply: Data of Very Big Size

Can’t process with usual tools

Distributed Architecture Needed

Structured / Unstructured


Sandeep GiriHadoop

WHAT IS BIG DATA?

Facebook: 500TB /day Boeing737: 240 TB / flight

. Clickstreams: ~ 1m events / sec

Geospatial data 3D data

audio & video Unstructured text

Sandeep GiriHadoop

How many bytes in a petabyte?

Sandeep GiriHadoop

How many bytes in petabytes?

1.1259x10^15

Sandeep GiriHadoop

How many bytes in petabytes?

Kilo 1024 Bytes 1024 Bytes

Mega 1024 KB 1024 Bytes

Giga 1024 MB 1024 Bytes

Tera 1024 GB 1024^4 Bytes

Peta 1024 Tera 1024 Bytes

Exa 1024 Peta 1024 Bytes

Zeta 1024 Exa 1024 Bytes

Yotta 1024 Zeta 1024 Bytes

1 byte = 8 bit = can store 256 states

1.1259x10^15

Sandeep GiriHadoop

WHY BIG DATA


Sandeep GiriHadoop

WHY IS IT IMPORTANT NOW?

Smart Phones

4.6 billion mobile-phones. 1 - 2 billion people accessing the internet.

Facebook:1.06 bn monthly active users, 30 billion pieces shared monthly.

~175 million tweets every day

Connectivity: Internet Of Things

Connectivity: Social Networks

Moores Law has failed. data-bus is the bottle neck

Sandeep GiriHadoop

BIG DATA PROBLEMTo process & store data

we need

3. Disk Size + Speed2. RAM - Speed & Size

1. CPU Speed4. Network


Sandeep GiriHadoop

BIG DATA PROBLEMTo process & store data

we need

1. Disk Size + Speed3.RAM - Speed & Size

2. Speed

And at least one of these become bottle neck

Sandeep GiriHadoop

BIG DATA PROBLEM - STORAGEQ: If you have 100TB data, How would you store it?


Sandeep GiriHadoop


A: Have 100 1TB drives and make 100 subfolders mount these.

Challenges?


Sandeep GiriHadoop



• What about fail overs & Backups? • How would distribute the data uniformly? • Is this best value for money? • is this best use of resources? We might have hundreds of smaller drives already. • What about Increasing accessibility?

Then?

Challenges?


Sandeep GiriHadoop



• What about failovers & Backups? • How would distribute the data uniformly? • Is this best value for money? • is this best use of resources? We might have hundreds of smaller drives already. • What about Increasing accesibility?

Then?

Challenges?

Hadoop Distributed File System or HDFS


Sandeep GiriHadoop

BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?

1TB == 10 billion names having 100 characters each

RAM = 2GB


Sandeep GiriHadoop

BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?A: Around 5 hours

Sandeep GiriHadoop

BIG DATA PROBLEM - PROCESSINGQ: How fast can 1GHz processor to sort the 1TB data?A: Around 5 hours

Google, 8 Sept, 2011: Sorting10PB took 6.5 hrs on 8000 computers

We need 1. Faster Sort 2. Bigger Data Sorting 3. More often

Sandeep GiriHadoop

Components

Break for 5 Minutes. Come back at 10:08am IST.


Sandeep GiriHadoop

EXAMPLE BIG DATA CUSTOMERSWeb and e-tailing • Recommendation Engines • Ad Targeting • Search Quality • Sentiment Analyses • Abuse and Click Fraud Detection

Telecommunications • Customer Churn Prevention • Network Performance Optimization • Calling Data Record (CDR) Analysis • Analyzing Network to Predict Failure

Sandeep GiriHadoop

EXAMPLE BIG DATA CUSTOMERS

Government • Fraud Detection • Cyber Security Welfare • Justice

Healthcare & Life Sciences • Health information exchange • Gene sequencing • Healthcare improvements • Drug Safety

Sandeep GiriHadoop

AND MANY MORE…kaggle.com

Sandeep GiriHadoop

11 COMMON MYTHSBIG DATA • Always means data above or in range of TB • Is always about social media. Doesn't apply to me. • Will replace EDW • Is just a buzz word. No Practical Applications

!• Is New Concept • Will be future. • Is Expensive • Is only for data scientists. Or is magic. • We have enough hardware. Don't need any more. • We will build it when we need it. • Big Data is about Hadoop.

Sandeep GiriHadoop

Q1:How is analysis of Big Data saving money for businesses? —Souvik

A1:How is analysis of Data saving money for business? • Examine our assumptions before building product. • Evaluate sale / marketing strategies • Retain / Attain existing customers • A/B Testing • Historical Data Analysis better than Surveys

A2:What is the expense of "Big Data" handling? • Cost: All tools are free/open. • Cheap H/W needed.

Sandeep GiriHadoop

Q2:How important it is to know big data to make a fast growing IT career?

Short Answer: Very Important

Number of Jobsindeed.com naukri.com

Hadoop 1,102 2312

Big Data 1,255 659

Analyst Reports Gartner Top 10, 2014: Point #1, #4, #9, #10,

!Forbes top 10 tech trends: Point #5, #6, #8, #9

http://indeed.com

http://naukri.com

http://www.gartner.com/technology/research/top-10-technology-trends/

http://www.forbes.com/pictures/ehjh45lihf/5-firms-shed-yesterdays-data-limitations/

Sandeep GiriHadoop

WHAT IS HADOOP?

A. Cute Little Yellow Toy Elephant B. Framework to handle Big Data C. Open Source - Apache D. Power, Popular & Supported E. For reliable, scalable, distributed computing F. Created by Doug Cutting (of Yahoo) and Mike Cafarella G. Built for Nutch search engine project H. Written in Java

Sandeep GiriHadoop

CORE OF HADOOP?

MapReduce engine Execute your logic on multiple computers in parallel.

Hadoop Distributed File System (HDFS) Stores multiple copies of file parts on multiple machines.

Sandeep GiriHadoop

Components

Main Component


Sandeep GiriHadoop

Components

NoSQL Database

Compute Engine


Sandeep GiriHadoop

ComponentsWorkflow

SQL Inteface

New Language

Machine learning /

STATS


Sandeep GiriHadoop

ComponentsWorkflow

SQL Inteface

New Language

Machine learning /

STATS

NoSQL Database

Compute Engine

Main Component


Sandeep GiriHadoop

CORE COMPONENTS

Machine1 Machine 2 Machine 3 Machine 4 Machine 5


Sandeep GiriHadoop

CORE OF HADOOP?HDFS – Hadoop Distributed File System (Storage) 1. Distributed across “nodes” 2. Fault Tolerant & High Throughput 3. Low Cost Hardware 4. NameNode tracks locations.

MapReduce (Processing) 1. Splits a task across processors / nodes 2. “near” the data & assembles results 3. Self-Healing, High Bandwidth 4. Clustered storage 5. JobTracker manages the TaskTrackers

Sandeep GiriHadoop

WHAT IS HDFS?Distributed file system to run on commodity hardware

Design Goals 1. Hardware Failure 2. Streaming Data Access 3. Large Data Sets 4. Simple Coherency Model 5. Moving Code is Cheaper than Data 6. Portable - Java


Sandeep GiriHadoop


Sandeep GiriHadoop

MAIN COMPONENTS OF HDFS?

NameNode: 1. Master of the system 2. Maintains and manages the blocks which are present on the

DataNodes

DataNodes: 1. Slaves which are deployed on each machine and provide

the actual storage 2. Responsible for serving read and write requests for the

clients


Sandeep GiriHadoop

ANATOMY OF A FILE WRITE

Sandeep GiriHadoop

ANATOMY OF A FILE READ

Sandeep GiriHadoop

NAMENODE METADATA

Meta-data in Memory 1. The entire metadata is in main memory. 2. No demand paging of FS meta-data !Types of Metadata 1. List of files 2. List of Blocks for each file 3. List of DataNode for each block 4. File attributes, e.g. access time, replication

factor !A Transaction Log 1. Records file creations, file deletions. etc

NameNode (File Name, numReplicas, block-ids,…)

/users/sgiri/data/part-0,r:2, {1,3},… /users/sgiri/data/part-1,r:3, {2,4,5},…

NameNode Keeps track of overall file directory structure and the placement of Data

Block

Sandeep GiriHadoop

JOB TRACKER

Sandeep GiriHadoop

CORE COMPONENTS

Machine1 Machine 2 Machine 3 Machine 4 Machine 5


Sandeep GiriHadoop

JOB TRACKER (DETAILED)

Sandeep GiriHadoop

JOB TRACKER (CONT.)

Sandeep GiriHadoop

MAP REDUCE

Sandeep GiriHadoop

FURTHER READING

http://en.wikipedia.org/wiki/Apache_Hadoop !http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

http://en.wikipedia.org/wiki/Apache_Hadoop

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Sandeep GiriHadoop

ASSIGNMENT / PRE-WORK

• Setup The Hadoop Environment based on the Virtual machines Present in LMS & cluster

• Access the cluster. Details are in LMS. • Finish the quiz from LMS • See Assignment section on LMS

Sandeep GiriHadoop

www.KnowBigData.com

• Second Class Onwards from 16 Aug 7:00am • Every Saturday-Sunday - 3 hours • 30 hours - 3 hr x 10 classes • ₹ 19995 ₹ 14996 (25% off) (Incl. Taxes)

• ₹ 1880/- per class

• Cluster for Hands-On Experiments

FULL COURSE

[email protected]

http://www.KnowBigData.com

mailto:[email protected]

Sandeep GiriHadoop

Thank you.

Big Data & Hadoop


Sandeep GiriHadoop

COLLECT YOUR CERTIFICATEwww.KnowBigData.com/dell

• Permanent URL / Link • Crawl-able • Verifiable • Display on LinkedIn • Put in Social Profile

http://www.KnowBigData.com/dell

Sandeep GiriHadoop

Session 1 - Introduction

Big Data & Hadoop


big data and hadoop - introduction, architecture - hdfs and mapreduce, ecosystem

Technology

tb data

big data systems

big data hadoopsession

sandeep girihadoop components

sandeep girihadoop session

process store data

bigger data sorting3

tb drives