introduction to big data and hadoop

42
INTRODUCTION TO BIG DATA & HADOOP

Upload: amir-shaikh

Post on 20-Mar-2017

436 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Introduction to BIg Data and Hadoop

INTRODUCTION TO BIG DATA & HADOOP

Page 2: Introduction to BIg Data and Hadoop

OUTLINE

• Data Generation Sources• Per minute data evaluation• What is Big Data?• Limitations of RDBMS• What is Hadoop?• History of Hadoop• Hadoop Core Components• Hadoop Architecture• Hadoop Ecosystem

Page 3: Introduction to BIg Data and Hadoop

OUTLINE

• Hadoop V1 v/s Hadoop V2• Hadoop Distributions• Who uses Hadoop? • Overview of Data Lake

Page 4: Introduction to BIg Data and Hadoop

DATA GENERATION IN LAST FEW DECADES

Page 5: Introduction to BIg Data and Hadoop

DATA GENERATING NOW

Page 6: Introduction to BIg Data and Hadoop

’32 BILLION DEVICES PLUGGED IN & GENERATING DATA BY 2020′

❑The EMC Digital Universe Study launched its seventh edition. According to the study, by 2020, the amount of data in our digital universe is expected to grow from 4.4 trillion GB to 44 trillion GB

-11th APRIL 2014

❑According to computer giant IBM, "2.5 exabytes - that's 2.5 billion gigabytes (GB) - of data was generated every day in 2012. That's big by anyone's standards. "About 75% of data is unstructured, coming from sources such as text, voice and video."

-Mr. Miles

Page 7: Introduction to BIg Data and Hadoop
Page 8: Introduction to BIg Data and Hadoop

WHAT IS BIG DATA?

Gartner : “Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

Page 9: Introduction to BIg Data and Hadoop

BIG DATA Volume, Velocity and Variety

Volume• It refers to the vast amounts of data generated every

second.

Velocity• Refers to the speed at which new data is generated and

the speed at which data moves around.

Variety• Refers to the different types of data generated from

different sources.

Page 10: Introduction to BIg Data and Hadoop

BIG DATA HAS ALSO BEEN DEFINED BY THE FIVE V’s

1.Volume2.Velocity3.Variety4.Veracity 5.Value

Page 11: Introduction to BIg Data and Hadoop

BIG DATA Veracity

Veracity

• Veracity refers to the biases, noise and abnormality in data.

• Is the data that is being stored, and mined meaningful to the problem being analyzed.

Page 12: Introduction to BIg Data and Hadoop

BIG DATA Value

Value

• Data has intrinsic value—but it must be discovered.• There are a range of quantitative and investigative

techniques to derive value from Big Data.• The technological breakthrough makes much more

accurate and precise decisions possible.• exploring the value in big data requires experimentation

and exploration. Whether creating new products or looking for ways to gain competitive advantage.

Page 13: Introduction to BIg Data and Hadoop

BIG DATA

Page 14: Introduction to BIg Data and Hadoop

Analysing Big Data:

● Predictive analysis

● Text analytics● Sentiment

analysis● Image

Processing● Voice

analytics● Movement

Analytics● Etc.

Data Sources

● ERP● CRM● Inventory● Finance● Conversations● Voice● Social Media● Browser logs● Photos ● Videos● Log● Sensors● Etc.

Volume

Veracity

Variety

Velocity

Turning Big Data into Value

Value

Page 15: Introduction to BIg Data and Hadoop

Comparison Between Traditional RDBMS And Hadoop

Page 16: Introduction to BIg Data and Hadoop

LIMITATIONS OF RDBMS TO SUPPORT “BIG DATA”

• Designed and structured to accommodate structured data.

• Data size has increased tremendously, RDBMS finds it challenging to handle such huge data volumes.

• lacks in high velocity because it’s designed for steady data retention rather than rapid growth.

• Not designed for distributed computing.• Many issues while scaling up for massive datasets.• Expensive specialized hardware.• Even if RDBMS is used to handle and store

“BigData,” it will turn out to be very expensive.

Page 17: Introduction to BIg Data and Hadoop

WHAT IS HADOOP?

Page 18: Introduction to BIg Data and Hadoop

WHAT IS HADOOP?• Hadoop is an open-source software framework• Allows the distributed storage and processing of large

data sets across clusters of commodity hardware• uses simple programming models for processing• It is designed to scale up from single servers to

thousands of machines. each offering local computation and storage.

• It provides massive storage for any kind of data.• enormous processing power and the ability to handle

virtually limitless concurrent tasks or jobs.• Stores files in the form of blocks.

Page 19: Introduction to BIg Data and Hadoop

WHAT IS HADOOP?(cont)

• Hadoop is an open-source implementation of Google MapReduce, GFS(Google File System).

• Hadoop was created by Dough Cutting, the creator of Apache Lucene, the widely used text search library.

Page 20: Introduction to BIg Data and Hadoop

HISTORY OF HADOOP

• 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages.

• Oct 2003 - Google releases papers with GFS (Google File System).

• Dec 2004 - Google releases papers with MapReduce.

• 2005 - Nutch used GFS and MapReduce to perform operations.

• 2006 - Yahoo! created Hadoop based on GFS and MapReduce (with Doug Cutting and team)

• 2007 - Yahoo started using Hadoop on a 1000 node cluster

• Jan 2008 - Apache took over Hadoop

• Jul 2008 - Tested a 4000 node cluster with Hadoop successfully

• 2009 - Hadoop successfully sorted a petabyte of data in less than 17 hours to handle billions of searches and indexing millions of web pages.

Page 21: Introduction to BIg Data and Hadoop

HADOOP CORE COMPONENTS

HDFS(storage) and MapReduce(processing) are the two core components of Apache Hadoop.

HDFS• HDFS is a distributed file system that provides high-

throughput access to data. • It provides a limited interface for managing the file system to

allow it to scale and provide high throughput. • HDFS creates multiple replicas of each data block and

distributes them on computers throughout a cluster to enable reliable and rapid access.

 

Page 22: Introduction to BIg Data and Hadoop

HADOOP CORE COMPONENTS

MapReduce

• MapReduce is a framework for performing distributed data processing using the MapReduce programming paradigm.

• Each job has a user-defined map phase and user-defined reduce phase where the output of the map phase is aggregated.

• HDFS is the storage system for both input and output of the MapReduce jobs.

Page 23: Introduction to BIg Data and Hadoop

HDFS OVERVIEW

• Based on Google’s GFS (Google File System)• Provides redundant storage of massive amounts of

data– Using commodity hardware

• Data is distributed across all nodes at load time.• Provides for efficient Map Reduce processing.

– Operates on top of an existing filesystem.

• Files are stored as ‘Blocks’

– Each Block is replicated across several Data Nodes• NameNode stores metadata and manages access.

• No data caching due to large datasets

Page 24: Introduction to BIg Data and Hadoop

HADOOP ARCHITECTURE

MasterMmmfj-Slave

Master-Slave

Page 25: Introduction to BIg Data and Hadoop

HADOOP ARCHITECTURE

NameNode• Stores all metadata: filenames, locations of each block

on DataNodes, file attributes, etc…• Keeps metadata in RAM for fast lookup.• Filesystem metadata size is limited to the amount of

available RAM on NameNodeDataNode• Stores file contents as blocks.• Different blocks of the same file are stored on different

DataNodes.• Periodically sends a report of all existing blocks to the

NameNode.

Page 26: Introduction to BIg Data and Hadoop

COMPONENTS(DAEMONS) OF HADOOP

• NameNode• DataNode• Secondary NameNode• JobTracker• TaskTracker

Page 27: Introduction to BIg Data and Hadoop

Task TrackerData Node

Task TrackerData Node

Task TrackerData Node

Job TrackerName Node

Client

HADOOP ARCHITECTURE

Job TrackerName Node

Page 28: Introduction to BIg Data and Hadoop

APACHE HADOOP ECOSYSTEM

Page 29: Introduction to BIg Data and Hadoop

(CONT.)

❑Pig❑A high-level data-flow language and execution framework

for parallel computation.❑Hive ❑A data warehouse infrastructure that provides data

summarization and ad hoc querying.❑Sqoop❑A tool designed for efficiently transferring bulk data

between Hadoop and structured datastores such as relational databases.

❑HBase❑A scalable, distributed database that supports structured

data storage for large tables.

Page 30: Introduction to BIg Data and Hadoop

HADOOP 1.0 VS HADOOP 2.0

Page 31: Introduction to BIg Data and Hadoop

MR V1 MR v2

Page 32: Introduction to BIg Data and Hadoop

HADOOP DISTRIBUTIONS

➢Let's say we go download Apache Hadoop and MapReduce from http://hadoop.apache.org/

➢At first it works great but then we decide to start using HBase

➢No problem, just download HBase from http://hadoop.apache.org/ and point to your existing HDFS installation

➢But we find that HBase can only work with a previous version of HDFS, so we go downgrade HDFS and everything still works great.

➢Later on we decide to add Pig➢Unfortunately the version of Pig doesn't work with the

version of HDFS, it wants us to upgrade➢But if we upgrade we will break HBase.

Page 33: Introduction to BIg Data and Hadoop

HADOOP DISTRIBUTIONS

Hadoop Distributions aim to resolve version incompatibilities

Distribution vendors will

Integration test a set of Hadoop product Package Hadoop product in various installation

formats linux packages, tarballs, etc Distribution may provide additional scripts to execute

Hadoop Some vendors may choose to backport features and

bug fixes made by Apache Typically vendors will employ Hadoop committers so

the bugs find will make it into Apache repository.

Page 34: Introduction to BIg Data and Hadoop

DISTRIBUTION VENDORS

• Cloudera Distribution for Hadoop(CDH)• MapR Distribution • Hortonworks Data Platform (HDP)• Greenplum• IBM BigInsights

Page 35: Introduction to BIg Data and Hadoop

CLOUDERA DISTRIBUTION FOR HADOOP(CDH)

• Cloudera has taken the lead on providing Hadoop Distribution

• Cloudera is affecting the Hadoop ecosystem in the same way RedHat popularized Linux in the enterprise circle Most Popular Distribution

http://cloudera.com/hadoop100% open-source

• Cloudera employs a large percentage of core Hadoop committers

• CDH is provided in various formats Linux package, Virtual Machine Images and Tarballs

• Integrates majority of popular Hadoop productHDFS, MapReduce, HBase, Hive, Oozie, Pig, Sqoop, Zookeeper and Flume etc.

Page 36: Introduction to BIg Data and Hadoop

SUPPORTED OPERATING SYSTEM

• Each Distribution will support its own list of Operating System

• Common OS supported

Red Hat Enterprise CentOS Oracle Linux Ubuntu SUSE Linux Enterprise Server

Page 37: Introduction to BIg Data and Hadoop

WHO USES HADOOP

➢Amazon/A9➢Facebook➢Google➢ IBM➢Joost➢LinkedIn➢New York Times➢PowerSet➢Yahoo!

Now It’s Our Turn

Page 38: Introduction to BIg Data and Hadoop

OVERVIEW OF DATA LAKE

“A Data Lake is a large storage repository and processing engine. They provide "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs."

Page 39: Introduction to BIg Data and Hadoop

DATA LAKE

Page 40: Introduction to BIg Data and Hadoop

REFERENCE

• https://hadoop.apache.org/

• http://www.cloudera.com/hadoop-and-big-data.html

• http://hortonworks.com/hadoop/

• Hadoop: The Definitive Guide, 4th Edition - O'Reilly Media

Page 41: Introduction to BIg Data and Hadoop

Any BIGGER Question?

Page 42: Introduction to BIg Data and Hadoop

Thank YouPresenter

Amir R. ShaikhHadoop Administrator

Thank You

For

Your attention and your time

have a good day ahead

Signing off