basics of big data analytics hadoop

Basics of Big Data

Analytics &

Hadoop

Ambuj Kumar

[email protected]

http://ambuj4bigdata.blogspot.in

http://ambujworld.wordpress.com

mailto:[email protected]

http://ambuj4bigdata.blogspot.in/

http://ambuj4bigdata.blogspot.in/

http://ambujworld.wordpress.com/


Agenda

Big Data –

Concepts overview

Analytics –

Concepts overview

Hadoop –

Concepts overview

HDFS

Concepts overview

Data Flow - Read & Write Operation

MapReduce

Concepts overview

WordCount Program

Use Cases

Landscape

Hadoop Features & Summary

What is Big Data?

Big data is data which is too large, complex and dynamic for any conventional data tools to capture,

store, manage and analyze.

Challenges of Big Data

• Storage (~ Petabytes) 1

• Processing (Timely manner) 2

• Variety of Data (Structured, Semi Structured, Un-structured) 3

• Cost 4

Big Data Analytics

Big data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information.

Big Data Analytics Solutions

There are many different Big Data Analytics Solutions out in the market.

Tableau – visualization tools

SAS – Statistical computing

IBM and Oracle – They have a range of tools for Big Data Analysis

Revolution – Statistical computing

R – Open source tool for Statistical computing

What is Hadoop?

Open-source data storage and processing API

Massively scalable, automatically parallelizable

Based on work from Google

GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work

Apache Hadoop

Cloudera – CDH4

Hortonworks

MapR

AWS

Windows Azure HDInsight

Why Use Hadoop?

Cheaper Scales to Petabytes

or more

Faster

Parallel data processing

Better Suited for particular

types of BigData problems

Hadoop History

In 2008, Hadoop became Apache Top Level Project

Comparing: RDBMS vs. Hadoop

Traditional RDBMS Hadoop / MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Integrity High (ACID) Low

Scaling Nonlinear Linear

Query

Response Time

Can be near immediate Has latency (due to batch

processing)

Where is Hadoop used?

Industry

Technology

Use Cases

Search People you may know

Movie recommendations

Banks Fraud Detection

Regulatory Risk management

Media Retail

Marketing analytics Customer service

Product recommendations

Manufacturing Preventive maintenance

Companies Using Hadoop

Search

Yahoo, Amazon, Zvents

Log Processing

Facebook, Yahoo, ContextWeb.Joost, Last.fm

Recommendation Systems

Facebook, Linkedin

Data Warehouse

Facebook, AOL

Video & Image Analysis

New York Times, Eyealike

------- Almost in every domain!

Hadoop is a set of Apache

Frameworks and more…

Data storage (HDFS)

Runs on commodity hardware (usually Linux)

Horizontally scalable

Processing (MapReduce)

Parallelized (scalable) processing

Fault Tolerant

Other Tools / Frameworks

Data Access

HBase, Hive, Pig, Mahout

Tools

Hue, Sqoop

Monitoring

Greenplum, Cloudera

Hadoop Core - HDFS

MapReduce API

Data Access

Tools & Libraries

Monitoring & Alerting

Core parts of Hadoop distribution

HDFS Storage

Redundant (3 copies)

For large files – large blocks

64 or 128 MB / block

Can scale to 1000s of nodes

MapReduce API

Batch (Job) processing

Distributed and Localized to clusters (Map)

Auto-Parallelizable for huge amounts of data

Fault-tolerant (auto retries)

Adds high availability and more

Other Libraries

Pig

Hive

HBase

Others

Hadoop Cluster HDFS (Physical)

Storage

Name Node

Data Node 1 Data Node 2 Data Node 3

Secondary Name Node

• Contains web site to view cluster information

• V2 Hadoop uses multiple Name Nodes for HA

One Name Node

• 3 copies of each node by default

Many Data Nodes

• Using common Linux shell commands

• Block size is 64 or 128 MB

Work with data in HDFS

MapReduce Job – Logical View

Hadoop Ecosystem

Common Hadoop Distributions

Open Source

Apache Commercial

Cloudera Hortonworks MapR AWS MapReduce Microsoft HDInsight

HDFS : Architecture

Master

NameNode

Slave

Bunch of DataNodes

HDFS Layers

NameNode

Storage

…………

NS

Block Management

NameNode

DataNode

DataNode DataNode DataNode DataNode DataNode

DataNode

Nam

e

Sp

ace

Blo

ck

Sto

rag

e

HDFS : Basic Features

Highly fault-tolerant

High throughput

Suitable for applications with large data sets

Streaming access to file system data

Can be built out of commodity hardware

HDFS Write (1/2)

Client Name Node

1

2

Data Node

A Data Node

B

Data Node

C

Data Node

D

A2 A3 A4 A1

3

Client contacts NameNode to write data

NameNode says write it to these nodes

Client sequentially writes

blocks to DataNode

HDFS Write (2/2)

Client Name Node

Data Node

A Data Node

B

Data Node

C

Data Node

D

A1

DataNodes replicate data

blocks, orchestrated

by the NameNode A2

A4

A2 A1

A3

A3 A2

A4

A4 A1

A3

HDFS Read

Client Name Node

1

2

Data Node

A Data Node

B

Data Node

C

Data Node

D

A1

3

Client contacts NameNode to read data

NameNode says you can find it here

Client sequentially

reads blocks from

DataNode A2

A4

A2 A1

A3

A3 A2

A4

A4 A1

A3

HA (High Availability) for

NameNode

NameNode (StandBy)

DataNode

NameNode (Active)

Active NameNode

Do normal namenode’s operation

Standby NameNode

Maintain NameNode’s data

Ready to be active NameNode

DataNode DataNode DataNode DataNode

MapReduce

MapReduce job consist of two tasks

Map Task

Reduce Task

Blocks of data distributed across several machines are

processed by map tasks parallel

Results are aggregated in the reducer

Works only on KEY/VALUE pair

MapReduce: Word Count

Deer 1

Bear 1

River 1

Car 1

Car 1

River 1

Deer 1

Car 1

Bear 1

Bear 2

Car 3

Deer 2

River 2

Can we do word count in parallel?

Deer Bear River

Car Car River

Deer Car Bear

MapReduce: Word Count Program

Data Flow in a MapReduce Program in Hadoop

Mapper Class Package ambuj.com.wc;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends

Mapper<LongWritable, Text, Text, LongWritable> {

private final static LongWritable one = new LongWritable(1);

private Text word = new Text();

@Override

public void map(LongWritable inputKey, Text inputVal, Context context)

throws IOException, InterruptedException {

String line = inputVal.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

Reducer Class

package ambuj.com.wc;

import java.io.IOException;



import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends

Reducer<Text, LongWritable, Text, LongWritable> {

@Override

public void reduce(Text key, Iterable<LongWritable> listOfValues,

Context context) throws IOException, InterruptedException {

long sum = 0;

for (LongWritable val : listOfValues) {

sum = sum + val.get();

}

context.write(key, new LongWritable(sum));

}

}

Driver Class package ambuj.com.wc;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.Path;



import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class WordCountDriver extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "WordCount");

job.setJarByClass(WordCountDriver.class);

job.setMapperClass(WordCountMapper.class);

job.setReducerClass(WordCountReducer.class);

job.setInputFormatClass(TextInputFormat.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(LongWritable.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LongWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

return 0;

}

public static void main(String[] args) throws Exception {

ToolRunner.run(new WordCountDriver(), args);

}

}

A view of Hadoop Client Job

Data Node

Task

Tracker

Task

Task

Task

Job Tracker Name Node

Data Node

Task

Tracker

Task

Task

Task

Data Node

Task

Tracker

Task

Task

Task

Ma

ste

r S

lave

Blocks HDFS

MapReduce

Use Cases

Utilities want to predict power consumption

Banks and insurance companies want to understand risk

Fraud detection

Marketing departments want to understand customers

Recommendations

Location-Based Ad Targeting

Threat Analysis

Big Data Landscape

Hadoop Features & Summary

Distributed frame work for processing and storing data generally on commodity hardware. Completely open source and written in Java.

Store anything

Unstructured or semi structured data,

Storage capacity

Scale linearly, cost in not exponential.

Data locality and process in your way.

Code moves to data

In MR you specify the actual steps in processing the data and drive the out put.

Stream access: Process data in any language.

Failure and fault tolerance:

Detect Failure and Heals itself.

Reliable, data replicated, failed task are rerun , no need maintain backup of data

Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines.

The Hadoop framework transparently for customization to provides applications both reliability, adaption and data motion.

Primarily used for batch processing, not real-time/ transactional user applications.

References - Hadoop

Hadoop: The Definitive Guide, Third Edition by Tom White.

http://hadoop.apache.org

http://www.cloudera.com

http://ambuj4bigdata.blogspot.com

http://ambujworld.wordpress.com

http://hadoop.apache.org/

http://hadoop.apache.org/

http://www.cloudera.com/

http://www.cloudera.com/

http://ambuj4bigdata.blogspot.com/

http://ambuj4bigdata.blogspot.com/


Thank You

basics of big data analytics hadoop

Data & Analytics