introduction to big data & hadoop

31
www.edureka.co/big-data-and-hadoop CMC Contact : [email protected] Edureka Contact : [email protected] Introduction to big data and hadoop

Upload: edureka

Post on 15-Apr-2017

1.070 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Introduction to Big Data & Hadoop

www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Introduction to big data and hadoop

Page 2: Introduction to Big Data & Hadoop

Slide 2 CMC Contact : [email protected] Edureka Contact : [email protected] www.edureka.co/big-data-and-hadoop

Objectives

At the end of this session , you will understand the:

Big Data Introduction

Use Cases of Big Data in Multiple Industry Verticals

Hadoop and Its Eco-System

Hadoop Architecture

Learning Path for Developers, Administrators, Testing Professionals and Aspiring Data Scientists

Page 3: Introduction to Big Data & Hadoop

Slide 3 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Un-structured Data is Exploding

Source: Twitter

Page 4: Introduction to Big Data & Hadoop

Slide 4 www.edureka.co/big-data-and-hadoop

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

IBM’s Definition of Big Data

Page 5: Introduction to Big Data & Hadoop

Slide 5 www.edureka.co/big-data-and-hadoop

Annie’s Introduction

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Page 6: Introduction to Big Data & Hadoop

Slide 6 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Annie’s Question

Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)

Page 7: Introduction to Big Data & Hadoop

Slide 7 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Annie’s Answer

Ans. XML files, e-mail body Semi-structured dataAudio, Video, Image, Files, Archived documents Unstructured data Data from Enterprise systems (ERP, CRM etc.) Structured data

Page 8: Introduction to Big Data & Hadoop

Slide 8 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Further Reading

More on Big Data

http://www.edureka.in/blog/the-hype-behind-big-data/

Why Hadoop?

http://www.edureka.in/blog/why-hadoop/

Opportunities in Hadoop

http://www.edureka.in/blog/jobs-in-hadoop/

Big Data

http://en.wikipedia.org/wiki/Big_Data

IBM’s definition – Big Data Characteristics

http://www-01.ibm.com/software/data/bigdata/

Page 9: Introduction to Big Data & Hadoop

Slide 9Slide 9Slide 9 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Common Big Data Customer Scenarios

Web and e-tailing

» Recommendation Engines» Ad Targeting» Search Quality» Abuse and Click Fraud Detection

Telecommunications

» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure

http://wiki.apache.org/hadoop/PoweredBy

Page 10: Introduction to Big Data & Hadoop

Slide 10Slide 10Slide 10 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Government

» Fraud Detection and Cyber Security» Welfare Schemes » Justice

Healthcare and Life Sciences

» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety

http://wiki.apache.org/hadoop/PoweredBy

Common Big Data Customer Scenarios (Contd.)

Page 11: Introduction to Big Data & Hadoop

Slide 11Slide 11Slide 11 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Common Big Data Customer Scenarios (Contd.)

Banks and Financial services

» Modeling True Risk» Threat Analysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis

Retail

» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis

http://wiki.apache.org/hadoop/PoweredBy

Page 12: Introduction to Big Data & Hadoop

Slide 12Slide 12Slide 12 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Why DFS?

Read 1 TB Data

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

Page 13: Introduction to Big Data & Hadoop

Slide 13Slide 13Slide 13 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Why DFS? (Contd.)

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

43 Minutes

Read 1 TB Data

Page 14: Introduction to Big Data & Hadoop

Slide 14Slide 14Slide 14 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Why DFS? (Contd.)

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

4.3 Minutes43 Minutes

Read 1 TB Data

Page 15: Introduction to Big Data & Hadoop

Slide 15Slide 15Slide 15 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Page 16: Introduction to Big Data & Hadoop

Slide 16 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

Hadoop Cluster: A Typical Use Case

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

Active NameNodeSecondary NameNode

DataNode DataNode

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

StandBy NameNode

Page 17: Introduction to Big Data & Hadoop

Slide 17Slide 17Slide 17 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Hidden Treasure

Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data.

*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.

Case Study: Sears Holding Corporation

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

Page 18: Introduction to Big Data & Hadoop

Slide 18Slide 18Slide 18 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

ETL Compute Grid

Storage only Grid (Original Raw Data)

Collection

Instrumentation

A meagre 10% of the

~2PB data is available for

BI

Storage

2. Moving data to compute doesn’t scale

90% of the ~2PB archived

Processing

3. Premature data death

1. Can’t explore original high fidelity raw data

Limitations of Existing Data Analytics Architecture

Page 19: Introduction to Big Data & Hadoop

Slide 19Slide 19Slide 19 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

Hadoop : Storage + Compute Grid

Collection

Instrumentation

Both Storage

And Processing

Entire ~2PB Data is

available for processing

No Data Archiving

1. Data Exploration & Advanced analytics

2. Scalable throughput for ETL & aggregation

3. Keep data alive forever

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.

Solution: A Combined Storage Computer Layer

Page 20: Introduction to Big Data & Hadoop

Slide 20 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Annie’s Question

Hadoop is a framework that allows for the distributed processing of:» Small Data Sets» Large Data Sets

Page 21: Introduction to Big Data & Hadoop

Slide 21 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Annie’s Answer

Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.

Page 22: Introduction to Big Data & Hadoop

Slide 22Slide 22Slide 22 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Hadoop Ecosystem

Pig LatinData Analysis

HiveDW System

OtherYARN

Frameworks(MPI, GRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

Hadoop 2.0

Sqoop

Unstructured or Semi-structured Data Structured Data

Flume

MahoutMachine Learning

Page 23: Introduction to Big Data & Hadoop

Slide 23 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Hadoop Cluster: Facebook

Facebook

We use Hadoop to store copies of internal log and dimension data sources and useit as a source for reporting/analytics and machine learning.

Currently we have 2 major clusters:

» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.

» A 300-machine cluster with 2400 cores and about 3 PB raw storage.

» Each (commodity) node has 8 cores and 12 TB of storage.

» We are heavy users of both streaming as well as the Java APIs. We have

built a higher level data warehousing framework using these features called

Hive(see the http://Hadoop.apache.org/hive/). We have also developed a

FUSE implementation over HDFS.

Page 24: Introduction to Big Data & Hadoop

Slide 24 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce

Page 25: Introduction to Big Data & Hadoop

Slide 25 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS.

Hadoop daemons run on the local machine.

Hadoop daemons run on a cluster of machines.

Standalone (or Local) Mode

Hadoop Cluster Modes

Page 26: Introduction to Big Data & Hadoop

Slide 26 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Big Data Learning Path

• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark

• Linux Administration• Cluster Management• Cluster Performance• Virtualization

• Statistics Skills• Machine Learning• Hadoop Essentials • Expertise in R

Developer/Testing

Administration

Data Analyst

Big Data and Hadoop

MapReduceDesign Patterns

Apache Spark & Scala

Apache Cassandra

Linux Administration Hadoop Administration

Data Science

Business Analytics Using R

Advance Predictive Modelling in R

Talend for Big Data

Data Visualization Using Tableau

Page 27: Introduction to Big Data & Hadoop

Slide 27 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Learning Path to Certification

CourseLIVE Online Class Class Recording in LMS

24/7 Post Class Support Module Wise Quiz and Assignment

Project Work

Verifiable Certificate

1. Assistance from Peers and Support team

2. Review for Certification

Page 28: Introduction to Big Data & Hadoop

Slide 28 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Further Reading

Apache Hadoop and HDFS

http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

Apache Hadoop HDFS Architecture

http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

Page 29: Introduction to Big Data & Hadoop

Slide 29 www.edureka.co/big-data-and-hadoop

Assignment

Referring the documents present in the LMS under assignment solve the below problem.

How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

Page 30: Introduction to Big Data & Hadoop

Slide 30

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better!

Please spare few minutes to take the survey after the webinar.

www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Survey

Page 31: Introduction to Big Data & Hadoop