introduction to big data & hadoop

www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Introduction to big data and hadoop

CMC Contact : [email protected] Edureka Contact : [email protected] www.edureka.co/big-data-and-hadoop

Objectives

At the end of this session , you will understand the:

Big Data Introduction

Use Cases of Big Data in Multiple Industry Verticals

Hadoop and Its Eco-System

Hadoop Architecture

Learning Path for Developers, Administrators, Testing Professionals and Aspiring Data Scientists

mailto:[email protected]



Un-structured Data is Exploding

Source: Twitter



www.edureka.co/big-data-and-hadoop

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

IBM’s Definition of Big Data

http://www-01.ibm.com/software/data/bigdata/


Annie’s Introduction

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.


Annie’s Question

Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)




Annie’s Answer

Ans. XML files, e-mail body Semi-structured dataAudio, Video, Image, Files, Archived documents Unstructured data Data from Enterprise systems (ERP, CRM etc.) Structured data




Further Reading

More on Big Data

http://www.edureka.in/blog/the-hype-behind-big-data/

Why Hadoop?

http://www.edureka.in/blog/why-hadoop/

Opportunities in Hadoop

http://www.edureka.in/blog/jobs-in-hadoop/

Big Data

http://en.wikipedia.org/wiki/Big_Data

IBM’s definition – Big Data Characteristics




http://www.edureka.in/blog/the-hype-behind-big-data/

http://www.edureka.in/blog/why-hadoop/

http://www.edureka.in/blog/jobs-in-hadoop/

http://en.wikipedia.org/wiki/Big_Data


Slide 9Slide 9 www.edureka.co/big-data-and-hadoopCMC Contact : [email protected] Edureka Contact : [email protected]

Common Big Data Customer Scenarios

Web and e-tailing

» Recommendation Engines» Ad Targeting» Search Quality» Abuse and Click Fraud Detection

Telecommunications

» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure

http://wiki.apache.org/hadoop/PoweredBy





Government

» Fraud Detection and Cyber Security» Welfare Schemes » Justice

Healthcare and Life Sciences

» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety


Common Big Data Customer Scenarios (Contd.)





Common Big Data Customer Scenarios (Contd.)

Banks and Financial services

» Modeling True Risk» Threat Analysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis

Retail

» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis






Why DFS?

Read 1 TB Data

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine




Why DFS? (Contd.)



10 Machine

43 Minutes

Read 1 TB Data




Why DFS? (Contd.)



10 Machine

4.3 Minutes43 Minutes

Read 1 TB Data




RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

Hadoop Cluster: A Typical Use Case

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply


Active NameNodeSecondary NameNode

DataNode DataNode


StandBy NameNode




Hidden Treasure

Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data.

*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.

Case Study: Sears Holding Corporation

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?



http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?


Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

ETL Compute Grid

Storage only Grid (Original Raw Data)

Collection

Instrumentation

A meagre 10% of the

~2PB data is available for

BI

Storage

2. Moving data to compute doesn’t scale

90% of the ~2PB archived

Processing

3. Premature data death

1. Can’t explore original high fidelity raw data

Limitations of Existing Data Analytics Architecture




Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

Hadoop : Storage + Compute Grid

Collection

Instrumentation

Both Storage

And Processing

Entire ~2PB Data is

available for processing

No Data Archiving

1. Data Exploration & Advanced analytics

2. Scalable throughput for ETL & aggregation

3. Keep data alive forever

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.

Solution: A Combined Storage Computer Layer




Annie’s Question

Hadoop is a framework that allows for the distributed processing of:» Small Data Sets» Large Data Sets




Annie’s Answer

Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.




Hadoop Ecosystem

Pig LatinData Analysis

HiveDW System

OtherYARN

Frameworks(MPI, GRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

Hadoop 2.0

Sqoop

Unstructured or Semi-structured Data Structured Data

Flume

MahoutMachine Learning




Hadoop Cluster: Facebook

Facebook

We use Hadoop to store copies of internal log and dimension data sources and useit as a source for reporting/analytics and machine learning.

Currently we have 2 major clusters:

» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.

» A 300-machine cluster with 2400 cores and about 3 PB raw storage.

» Each (commodity) node has 8 cores and 12 TB of storage.

» We are heavy users of both streaming as well as the Java APIs. We have

built a higher level data warehousing framework using these features called

Hive(see the http://Hadoop.apache.org/hive/). We have also developed a

FUSE implementation over HDFS.



http://hadoop.apache.org/hive/


BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce



http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html


Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS.

Hadoop daemons run on the local machine.

Hadoop daemons run on a cluster of machines.

Standalone (or Local) Mode

Hadoop Cluster Modes




Big Data Learning Path

• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark

• Linux Administration• Cluster Management• Cluster Performance• Virtualization

• Statistics Skills• Machine Learning• Hadoop Essentials • Expertise in R

Developer/Testing

Administration

Data Analyst

Big Data and Hadoop

MapReduceDesign Patterns

Apache Spark & Scala

Apache Cassandra

Linux Administration Hadoop Administration

Data Science

Business Analytics Using R

Advance Predictive Modelling in R

Talend for Big Data

Data Visualization Using Tableau




Learning Path to Certification

CourseLIVE Online Class Class Recording in LMS

24/7 Post Class Support Module Wise Quiz and Assignment

Project Work

Verifiable Certificate

1. Assistance from Peers and Support team

2. Review for Certification




Further Reading

Apache Hadoop and HDFS

http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

Apache Hadoop HDFS Architecture

http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/



http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/


Assignment

Referring the documents present in the LMS under assignment solve the below problem.

How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better!

Please spare few minutes to take the survey after the webinar.


Survey



introduction to big data & hadoop

Technology