basics of big data analytics hadoop
DESCRIPTION
Basics of big data analytics hadoopTRANSCRIPT
Basics of Big Data
Analytics &
Hadoop
Ambuj Kumar
http://ambuj4bigdata.blogspot.in
http://ambujworld.wordpress.com
Agenda
Big Data –
Concepts overview
Analytics –
Concepts overview
Hadoop –
Concepts overview
HDFS
Concepts overview
Data Flow - Read & Write Operation
MapReduce
Concepts overview
WordCount Program
Use Cases
Landscape
Hadoop Features & Summary
What is Big Data?
Big data is data which is too large, complex and dynamic for any conventional data tools to capture,
store, manage and analyze.
Challenges of Big Data
• Storage (~ Petabytes) 1
• Processing (Timely manner) 2
• Variety of Data (Structured, Semi Structured, Un-structured) 3
• Cost 4
Big Data Analytics
Big data analytics is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other useful information.
Big Data Analytics Solutions
There are many different Big Data Analytics Solutions out in the market.
Tableau – visualization tools
SAS – Statistical computing
IBM and Oracle – They have a range of tools for Big Data Analysis
Revolution – Statistical computing
R – Open source tool for Statistical computing
What is Hadoop?
Open-source data storage and processing API
Massively scalable, automatically parallelizable
Based on work from Google
GFS + MapReduce + BigTable
Current Distributions based on Open Source and Vendor Work
Apache Hadoop
Cloudera – CDH4
Hortonworks
MapR
AWS
Windows Azure HDInsight
Why Use Hadoop?
Cheaper Scales to Petabytes
or more
Faster
Parallel data processing
Better Suited for particular
types of BigData problems
Hadoop History
In 2008, Hadoop became Apache Top Level Project
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query
Response Time
Can be near immediate Has latency (due to batch
processing)
Where is Hadoop used?
Industry
Technology
Use Cases
Search People you may know
Movie recommendations
Banks Fraud Detection
Regulatory Risk management
Media Retail
Marketing analytics Customer service
Product recommendations
Manufacturing Preventive maintenance
Companies Using Hadoop
Search
Yahoo, Amazon, Zvents
Log Processing
Facebook, Yahoo, ContextWeb.Joost, Last.fm
Recommendation Systems
Facebook, Linkedin
Data Warehouse
Facebook, AOL
Video & Image Analysis
New York Times, Eyealike
------- Almost in every domain!
Hadoop is a set of Apache
Frameworks and more…
Data storage (HDFS)
Runs on commodity hardware (usually Linux)
Horizontally scalable
Processing (MapReduce)
Parallelized (scalable) processing
Fault Tolerant
Other Tools / Frameworks
Data Access
HBase, Hive, Pig, Mahout
Tools
Hue, Sqoop
Monitoring
Greenplum, Cloudera
Hadoop Core - HDFS
MapReduce API
Data Access
Tools & Libraries
Monitoring & Alerting
Core parts of Hadoop distribution
HDFS Storage
Redundant (3 copies)
For large files – large blocks
64 or 128 MB / block
Can scale to 1000s of nodes
MapReduce API
Batch (Job) processing
Distributed and Localized to clusters (Map)
Auto-Parallelizable for huge amounts of data
Fault-tolerant (auto retries)
Adds high availability and more
Other Libraries
Pig
Hive
HBase
Others
Hadoop Cluster HDFS (Physical)
Storage
Name Node
Data Node 1 Data Node 2 Data Node 3
Secondary Name Node
• Contains web site to view cluster information
• V2 Hadoop uses multiple Name Nodes for HA
One Name Node
• 3 copies of each node by default
Many Data Nodes
• Using common Linux shell commands
• Block size is 64 or 128 MB
Work with data in HDFS
MapReduce Job – Logical View
Hadoop Ecosystem
Common Hadoop Distributions
Open Source
Apache Commercial
Cloudera Hortonworks MapR AWS MapReduce Microsoft HDInsight
HDFS : Architecture
Master
NameNode
Slave
Bunch of DataNodes
HDFS Layers
NameNode
Storage
…………
NS
Block Management
NameNode
DataNode
DataNode DataNode DataNode DataNode DataNode
DataNode
Nam
e
Sp
ace
Blo
ck
Sto
rag
e
HDFS : Basic Features
Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data
Can be built out of commodity hardware
HDFS Write (1/2)
Client Name Node
1
2
Data Node
A Data Node
B
Data Node
C
Data Node
D
A2 A3 A4 A1
3
Client contacts NameNode to write data
NameNode says write it to these nodes
Client sequentially writes
blocks to DataNode
HDFS Write (2/2)
Client Name Node
Data Node
A Data Node
B
Data Node
C
Data Node
D
A1
DataNodes replicate data
blocks, orchestrated
by the NameNode A2
A4
A2 A1
A3
A3 A2
A4
A4 A1
A3
HDFS Read
Client Name Node
1
2
Data Node
A Data Node
B
Data Node
C
Data Node
D
A1
3
Client contacts NameNode to read data
NameNode says you can find it here
Client sequentially
reads blocks from
DataNode A2
A4
A2 A1
A3
A3 A2
A4
A4 A1
A3
HA (High Availability) for
NameNode
NameNode (StandBy)
DataNode
NameNode (Active)
Active NameNode
Do normal namenode’s operation
Standby NameNode
Maintain NameNode’s data
Ready to be active NameNode
DataNode DataNode DataNode DataNode
MapReduce
MapReduce job consist of two tasks
Map Task
Reduce Task
Blocks of data distributed across several machines are
processed by map tasks parallel
Results are aggregated in the reducer
Works only on KEY/VALUE pair
MapReduce: Word Count
Deer 1
Bear 1
River 1
Car 1
Car 1
River 1
Deer 1
Car 1
Bear 1
Bear 2
Car 3
Deer 2
River 2
Can we do word count in parallel?
Deer Bear River
Car Car River
Deer Car Bear
MapReduce: Word Count Program
Data Flow in a MapReduce Program in Hadoop
Mapper Class Package ambuj.com.wc;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final static LongWritable one = new LongWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable inputKey, Text inputVal, Context context)
throws IOException, InterruptedException {
String line = inputVal.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
Reducer Class
package ambuj.com.wc;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends
Reducer<Text, LongWritable, Text, LongWritable> {
@Override
public void reduce(Text key, Iterable<LongWritable> listOfValues,
Context context) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable val : listOfValues) {
sum = sum + val.get();
}
context.write(key, new LongWritable(sum));
}
}
Driver Class package ambuj.com.wc;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "WordCount");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new WordCountDriver(), args);
}
}
A view of Hadoop Client Job
Data Node
Task
Tracker
Task
Task
Task
Job Tracker Name Node
Data Node
Task
Tracker
Task
Task
Task
Data Node
Task
Tracker
Task
Task
Task
Ma
ste
r S
lave
Blocks HDFS
MapReduce
Use Cases
Utilities want to predict power consumption
Banks and insurance companies want to understand risk
Fraud detection
Marketing departments want to understand customers
Recommendations
Location-Based Ad Targeting
Threat Analysis
Big Data Landscape
Hadoop Features & Summary
Distributed frame work for processing and storing data generally on commodity hardware. Completely open source and written in Java.
Store anything
Unstructured or semi structured data,
Storage capacity
Scale linearly, cost in not exponential.
Data locality and process in your way.
Code moves to data
In MR you specify the actual steps in processing the data and drive the out put.
Stream access: Process data in any language.
Failure and fault tolerance:
Detect Failure and Heals itself.
Reliable, data replicated, failed task are rerun , no need maintain backup of data
Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines.
The Hadoop framework transparently for customization to provides applications both reliability, adaption and data motion.
Primarily used for batch processing, not real-time/ transactional user applications.
References - Hadoop
Hadoop: The Definitive Guide, Third Edition by Tom White.
http://hadoop.apache.org
http://www.cloudera.com
http://ambuj4bigdata.blogspot.com
http://ambujworld.wordpress.com
Thank You