by: shrikant gawande · twitter example twitter has over 500 million registered users. the usa,...

45
By: Shrikant Gawande (Cloudera Certified )

Upload: others

Post on 25-Dec-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

By: Shrikant Gawande (Cloudera Certified )

Page 2: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

What is Big Data ?

For every 30 mins, a airline jet

collects 10 terabytes of sensor

data (flying time)

NYSE generates about one

terabyte of new trade data per

day to Perform stock trading

analytics to determine trends for

optimal trades.

Page 3: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Facebook users spend 10.5 billion

minutes (almost 20,000 years) online

on the social network.

Facebook has an average of 3.2

billion likes and comments are posted

every day.

Facebook Example

Page 4: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Twitter Example

Twitter has over 500 million registered users.

The USA, whose 141.8 million accounts represents

27.4 percent of all Twitter users, good enough to

finish well ahead of Brazil, Japan, the UK and

Indonesia.

79% of US Twitter users are more likely to

recommend brands they follow .

67% of US Twitter users are more likely to buy from

brands they follow .

57% of all companies that use social media for

business use Twitter.

Page 5: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hadoop is being used across industries

Industries using Hadoop

Source : Karmasphere

Page 6: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Why to learn Big Data ?

Page 7: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

What Big Companies Have To Say ..

Page 8: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Data Volume Is Growing Exponentially

Estimated Global Data Volume:

2011: 1.8 ZB

2015: 7.9 ZB

The world's information doubles every

two years

Over the next 10 years:

The number of servers worldwide

will grow by 10x

Amount of information managed by

enterprise data centers will grow by

50x

Number of “files” enterprise data

center handle will grow by 75x

Source: http://www.emc.com/leadership/programs/digital-

universe.htm,which was based on the 2011 IDC Digital Universe

Study

Page 9: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

IBM’s Definition

IBM’s definition –Big Data Characteristics

http://www-1.ibm.com/software/data/bigdata/

A collection of large and complex data sets which are difficult to process using common database

management tools or traditional data processing applications.

Big Data is the amount of data that is beyond the storage and the processing capabilities of a single

physical machine.

Data that has extra large volume, comes from variety of sources, variety of formats and comes at us

with a great velocity it normally referred as Big Data

Page 10: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

It’s more of unstructured Data than Structured Data

Page 11: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

A Traditional Approach Under Pressure

Page 12: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Why Big Data ?

ERP

CRM Data ( few TBs)

Enterprise data

What Data We have been adding in last 3-4 Years

Customer Experience

Click Streams

Online Campaign

Banner Ads – capturing every click 100 n TBs

User Entered data

Search – In product search

Social media – to understand general sentiments

Page 13: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Industry Use Cases Types of Data

Financial Services

New Account Risk Screens Text, Server Logs

Trading Risk Server Logs

Insurance Underwriting Geographic, Sensor, Text

Telecom

Call Details records (CDR) Machine, Geographic

Infrastructure Investment Machine, Server Logs

Real-Time Bandwidth Allocation Server Logs, Text, Social

Retail

360 Degree View of Customer ClickStream, Text

Localized, Personal Promotion Geographic

Website Optimization ClickStream

Manufacturing

Supply Chain and Logistics Sensor

Assembly Line Quality Assurance Sensor

Crowd sourced Quality Assurance Social

HealthCareUse Genomic in Medical Trials Structured

Monitor Patient Vitals in Real-Time Sensor

Pharmaceuticals

Recruit and Retain Patients for Drug Trails Social, Clickstream

Improve Prescription Adherence Social, Unstructured, Geographic

Oil and GasUnify Exploration and Production Data Sensor, Unstructured, Geographic

Monitor Rig Safety in Real Time Sensor, Unstructured

Common Business Applications

Page 14: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

How can we find products

that customers are interested

in BUT DON’T BUY ?

Page 15: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Leveraging ALL Business Data

How to Extract Insights from 9TBs of Web Logs ?

How do you make

sense of this ?

Page 16: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Leveraging ALL Business Data

How to Extract Insights from 9TBs of Web Logs ?

What users did when they come to our web site ?

Which product they viewed ?

Which product seen but not purchased ? Why ? New Offering based on past data?

In the First line User has seen some product by some particular ID ?

Page 17: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Leveraging ALL Business Data

How to Extract Insights from 9TBs of Web Logs ? (Contd …

Visitor views 2nd

product- We want to do this not just for 1 customer but all the customers

Page 18: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hidden Treasure

Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data

New offerings to the customer

Page 19: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Limitations of Existing Data Analytics Architecture

Page 20: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Solution: A Combined Storage Computer Layer

Page 21: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Differentiating factors

Page 22: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Some of the Hadoop Users

Page 23: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Why DFS ?

Page 24: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

What is Hadoop ?

Apache Hadoop is a framework that allows for the distributed processing

of large data sets across clusters of commodity computers using a simple

programming model.

It is an Open-source Data Management with scale-out storage &

distributed processing

Page 25: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hadoop Key Characteristics

Page 26: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hadoop History

Page 27: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hadoop Eco-System

Page 28: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hadoop Core Components

HDFS –Hadoop Distributed File System(Storage)

Distributed across “nodes”

Natively redundant

Name Node tracks locations.

MapReduce (Processing)

Splits a task across processors

“near” the data & assembles results

Self-Healing, High Bandwidth

Clustered storage

Page 29: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hadoop Core Components (contd.)

Page 30: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

HDFS Architecture

Page 31: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Main Components of HDFS

NameNode

master of the system

maintains and manages the blocks which are

present on the DataNodes

DataNodes

slaves which are deployed on each machine and

provide the actual storage

responsible for serving read and write requests

for the clients

Page 32: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

NameNode and Datanode

Page 33: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

NameNode Meta Data

Meta-data in Memory

• The entire metadata is in main memory

• No demand paging of FS meta-data

Types of Metadata

• List of files

• List of Blocks for each file

• List of DataNode for each block

• File attributes, e.g. access time, replication factor

A Transaction Log

• Records file creations, file deletions. etc

Page 34: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Storage : Name-Node and Data-Node.SProcessing : Job-Tracker and Task-Tracker.S

H1 H2 H3 H4

Page 35: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Poll - 01

Page 36: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Poll - 02

Page 37: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Poll - 03

Page 38: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Poll - 04

Page 39: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Poll - 05

Page 40: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough
Page 41: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hadoop Courses and its fees across

major training institutes…

Page 42: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hadoop Course fee at Cloudera

Cloudera Hadoop Training :

Page 43: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Hadoop Course fee at HortonWorks and Edureka

$ 2,795 = Rs. 1,73,290

Page 44: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

My Contact Details:

Page 45: By: Shrikant Gawande · Twitter Example Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough

Thank You …