introduction to big data and hadoop

31
Edureka Contact : [email protected] www.edureka.co/big-data-and-hadoop Introduction to bigdataand hadoop

Upload: edureka

Post on 21-Mar-2017

1.335 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Introduction to Big Data and Hadoop

Edureka Contact : [email protected] www.edureka.co/big-data-and-hadoop

I n t r o d u c t i o n t o b i g d a t a a n d h a d o o p

Page 2: Introduction to Big Data and Hadoop

Objectives

www.edureka.co/big-data-and-hadoopSlide 2 Edureka Contact : [email protected]

At the end of this session , you will understand the:

Big Data Introduction

Use Cases of Big Data in Multiple Industry Verticals

Hadoop and Its Eco-System

Hadoop Architecture

Learning Path for Developers, Administrators, Testing Professionals and Aspiring Data Scientists

Page 3: Introduction to Big Data and Hadoop

Un-structured Data is Exploding

Source: Twitter

www.edureka.co/big-data-and-hadoopSlide 3 Edureka Contact : [email protected]

Page 4: Introduction to Big Data and Hadoop

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

IBM’s Definition of Big Data

www.edureka.co/big-data-and-hadoopSlide 4

Page 5: Introduction to Big Data and Hadoop

Annie’s Introduction

www.edureka.co/big-data-and-hadoopSlide 5

Hello There!!My name is Annie.I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Page 6: Introduction to Big Data and Hadoop

Annie’s Question

Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)

www.edureka.co/big-data-and-hadoopSlide 6 Edureka Contact : [email protected]

Page 7: Introduction to Big Data and Hadoop

Annie’s Answer

Ans. XML files, e-mail body Semi-structured dataAudio, Video, Image, Files, Archived documents Unstructured dataData from Enterprise systems (ERP, CRM etc.) Structured data

www.edureka.co/big-data-and-hadoopSlide 7 Edureka Contact : [email protected]

Page 8: Introduction to Big Data and Hadoop

Further Reading

More on Big Data

http://www.edureka.in/blog/the-hype-behind-big-data/

Why Hadoop?

http://www.edureka.in/blog/why-hadoop/

Opportunities in Hadoop

http://www.edureka.in/blog/jobs-in-hadoop/

Big Data

http://en.wikipedia.org/wiki/Big_Data

IBM’s definition – Big Data Characteristics

http://www-01.ibm.com/software/data/bigdata/

www.edureka.co/big-data-and-hadoopSlide 8 Edureka Contact : [email protected]

Page 9: Introduction to Big Data and Hadoop

Common Big Data Customer Scenarios

Web and e-tailing

» Recommendation Engines» Ad Targeting» SearchQuality» Abuse and Click Fraud Detection

Telecommunications

» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure

http://wiki.apache.org/hadoop/PoweredBy

www.edureka.co/big-data-and-hadoopSlide 9 Edureka Contact : [email protected]

Page 10: Introduction to Big Data and Hadoop

Government

» Fraud Detection and Cyber Security» Welfare Schemes» Justice

Healthcare and Life Sciences

» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety

Common Big Data Customer Scenarios (Contd.)

http://wiki.apache.org/hadoop/PoweredBy

www.edureka.co/big-data-and-hadoopSlide 10 Edureka Contact : [email protected]

Page 11: Introduction to Big Data and Hadoop

Common Big Data Customer Scenarios (Contd.)

Banks and Financial services

» Modeling True Risk» ThreatAnalysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis

Retail

» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis

http://wiki.apache.org/hadoop/PoweredBy

www.edureka.co/big-data-and-hadoopSlide 11 Edureka Contact : [email protected]

Page 12: Introduction to Big Data and Hadoop

Why DFS?

Read 1 TB Data

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine4 I/O ChannelsEach Channel – 100 MB/s

www.edureka.co/big-data-and-hadoopSlide 12 Edureka Contact : [email protected]

Page 13: Introduction to Big Data and Hadoop

Why DFS? (Contd.)

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine4 I/O ChannelsEach Channel – 100 MB/s

www.edureka.co/big-data-and-hadoopSlide 13 Edureka Contact : [email protected]

43 Minutes

Read 1 TB Data

Page 14: Introduction to Big Data and Hadoop

Why DFS? (Contd.)

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine4 I/O ChannelsEach Channel – 100 MB/s

www.edureka.co/big-data-and-hadoopSlide 14 Edureka Contact : [email protected]

4.3 Minutes43 Minutes

Read 1 TB Data

Page 15: Introduction to Big Data and Hadoop

www.edureka.co/big-data-and-hadoopSlide 15 Edureka Contact : [email protected]

Page 16: Introduction to Big Data and Hadoop

Hadoop Cluster: A Typical Use Case

Active NameNode

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 Cores Ethernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

Secondary NameNode

RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

DataNode

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores. Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

DataNode

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

StandBy NameNode

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 Cores Ethernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

www.edureka.co/big-data-and-hadoopSlide 16 Edureka Contact : [email protected]

Page 17: Introduction to Big Data and Hadoop

Hidden Treasure

Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data.

Case Study: Sears Holding Corporation

*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

www.edureka.co/big-data-and-hadoopSlide 17 Edureka Contact : [email protected]

Page 18: Introduction to Big Data and Hadoop

Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

ETL Compute Grid

Storage only Grid (Original Raw Data)

Collection

Instrumentation

A meagre 10% of the

~2PB data is available for

BI

Storage

2. Moving data to compute doesn’t scale

90% of the ~2PB archived

Processing

3. Premature data death

1. Can’t explore original high fidelity raw data

Limitations of Existing Data Analytics Architecture

www.edureka.co/big-data-and-hadoopSlide 18 Edureka Contact : [email protected]

Page 19: Introduction to Big Data and Hadoop

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

Hadoop : Storage + Compute Grid

Collection

Instrumentation

Both Storage

And Processing

Entire ~2PB Data is

available for processing

No Data Archiving

1. Data Exploration & Advanced analytics

2. Scalable throughput for ETL & aggregation

Mostly Append

3. Keep data alive forever

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% aswas the case with existing Non-Hadoop solutions.

www.edureka.co/big-data-and-hadoopSlide 19 Edureka Contact : [email protected]

Solution: A Combined Storage Computer Layer

Page 20: Introduction to Big Data and Hadoop

Annie’s Question

Hadoop is a framework that allows for the distributedprocessing of:» Small Data Sets» Large Data Sets

www.edureka.co/big-data-and-hadoopSlide 20 Edureka Contact : [email protected]

Page 21: Introduction to Big Data and Hadoop

Annie’s Answer

Ans. Large Data Sets.It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.

www.edureka.co/big-data-and-hadoopSlide 21 Edureka Contact : [email protected]

Page 22: Introduction to Big Data and Hadoop

Hadoop Ecosystem

Pig LatinData Analysis

HiveDWSystem

Other YARN

Frameworks(MPI,GRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Hadoop 2.0

Sqoop

Unstructured or Semi-structured Data Structured Data

Flume

MahoutMachine Learning

www.edureka.co/big-data-and-hadoopSlide 22 Edureka Contact : [email protected]

Page 23: Introduction to Big Data and Hadoop

Hadoop Cluster: Facebook

Facebook

We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.

Currently we have 2 major clusters:

» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.

» A 300-machine cluster with 2400 cores and about 3 PB raw storage.

» Each (commodity) node has 8 cores and 12 TB of storage.

» We are heavy users of both streaming as well as the Java APIs. We have

built a higher level data warehousing framework using these features called

Hive(see the http://Hadoop.apache.org/hive/). We have also developed a

FUSE implementation over HDFS.

www.edureka.co/big-data-and-hadoopSlide 23 Edureka Contact : [email protected]

Page 24: Introduction to Big Data and Hadoop

BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4,…)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPCMPI(OpenMPI)

OTHER(Search)

(Weave..)

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

www.edureka.co/big-data-and-hadoopSlide 24 Edureka Contact : [email protected]

YARN – Moving beyond MapReduce

Page 25: Introduction to Big Data and Hadoop

Pseudo-Distributed Mode

Hadoop daemons run on the local machine.

Fully-Distributed Mode

Hadoop daemons run on a cluster of machines.

Hadoop can run in any of the following three modes:

Standalone (or Local) Mode

No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS.

www.edureka.co/big-data-and-hadoopSlide 25 Edureka Contact : [email protected]

Hadoop Cluster Modes

Page 26: Introduction to Big Data and Hadoop

Big Data Learning Path

• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark

• Hadoop Essentials• Expertise in R

Developer/Testing

Administration

• Linux Administration• Cluster Management• Cluster Performance• Virtualization

Data Analyst

• Statistics Skills• Machine Learning

Big Data and Hadoop

www.edureka.co/big-data-and-hadoopSlide 26 Edureka Contact : [email protected]

MapReduceDesign Patterns

Apache Spark & Scala

Apache Cassandra

Linux Administration Hadoop Administration

Data Science

Business Analytics Using R

Advance Predictive Modelling in R

Talend for Big Data

Data Visualization Using Tableau

Page 27: Introduction to Big Data and Hadoop

Learning Path to Certification

CourseLIVE Online Class Class Recording in LMS

24/7 Post Class Support Module Wise Quiz and Assignment

Verifiable Certificate

ProjectWork

1. Assistance from Peers and Support team

2. Review for Certification

www.edureka.co/big-data-and-hadoopSlide 27 Edureka Contact : [email protected]

Page 28: Introduction to Big Data and Hadoop

Further Reading

Apache Hadoop and HDFS

http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

Apache Hadoop HDFS Architecture

http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

www.edureka.co/big-data-and-hadoopSlide 28 Edureka Contact : [email protected]

Page 29: Introduction to Big Data and Hadoop

Assignment

Referring the documents present in the LMS under assignment solve the below problem.

How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

www.edureka.co/big-data-and-hadoopSlide 29

Page 30: Introduction to Big Data and Hadoop

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better!

Please spare few minutes to take the survey after the webinar.

www.edureka.co/big-data-and-hadoopSlide 30 Edureka Contact : [email protected]

Survey

Page 31: Introduction to Big Data and Hadoop