big data – hadoop - paramesh

42
BIG DATA – HADOOP Governance Team 6 Dec 16

Upload: ayyappan-paramesh

Post on 15-Apr-2017

108 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Big Data – Hadoop - Paramesh

BIG DATA – HADOOP

Governance Team 6 Dec 16

Page 2: Big Data – Hadoop - Paramesh

• Big Data Fundamentals1

• Hadoop and Components2

• QA3

Today’s Overview

Page 3: Big Data – Hadoop - Paramesh

Agenda – Big Data Fundamental• What is Big Data ?• Basic Characteristics

of Big Data• Sources of Big Data• V’s of Big Data• Processing Of Data

– Traditional Approach VS Big Data Approach

Page 4: Big Data – Hadoop - Paramesh

What is Big Data

Page 5: Big Data – Hadoop - Paramesh

What is Big Data –con’t

• Basically Big Data is nothing but collection of large set of Data that not able to processed using traditional approach and also its contains the followings– Structured Data- Traditional Data– Semi Structure Data- XML – Unstructured Data – Image/PDF/Media and etc

Page 6: Big Data – Hadoop - Paramesh

Various V’s- Big Data

Page 7: Big Data – Hadoop - Paramesh

Processing - Data

• Traditional Approach

• Big Data Approach

Page 8: Big Data – Hadoop - Paramesh

Hadoop Fundamental• What is Hadoop ?• Key Characterstics• Components • HDFS• MapReduce • Yarn• Benefits of Hadoop

Page 9: Big Data – Hadoop - Paramesh
Page 10: Big Data – Hadoop - Paramesh

Key Characteristics -Hadoop

• Reliable • Flexible• Scalable • Economical

Page 11: Big Data – Hadoop - Paramesh

Components

• Common Libraries • High Volume of Distributed Data Storage

System –HDFS• High Volume of Distributed Data Processing

Framework –MapReduce• Resource and Meta Data Management -YARN

Page 12: Big Data – Hadoop - Paramesh
Page 13: Big Data – Hadoop - Paramesh

– HDFS• What is HDFS?• Architecture• Components • Basic Features

Page 14: Big Data – Hadoop - Paramesh

What is HDFS ?

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing

Page 15: Big Data – Hadoop - Paramesh
Page 16: Big Data – Hadoop - Paramesh

Components- HDFSMaster/slave architectureHDFS cluster consists of a single Namenode, a

master server that manages the file system namespace and regulates access to files by clients.

There are a number of DataNodes usually one per node in a cluster.

The DataNodes manage storage attached to the nodes that they run on.

Page 17: Big Data – Hadoop - Paramesh

Components -HDFS

HDFS exposes a file system namespace and allows user data to be stored in files.

A file is split into one or more blocks and set of blocks are stored in DataNodes.

DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode

Page 18: Big Data – Hadoop - Paramesh

Features

• Highly fault-tolerant• High throughput• Suitable Distributed Storage for large

Amount of Data • Streaming access to file system data• Can be built out of commodity hardware

Page 19: Big Data – Hadoop - Paramesh

MapReduce • What is MapReduce• Tasks /Components • Basic Features • Demo

Page 20: Big Data – Hadoop - Paramesh

What is MapReduce

• Its framework mainly used to process the large Amount of Data in parallel on the large clusters of commodity hardware

• Its based on divide –conquer Principle which provides built-in fault tolerance and redundancy

• Its batch oriented parallel processing engine to process the large volume of data

Page 21: Big Data – Hadoop - Paramesh

MapReduce– Map stage : The map or mapper’s job is to process the

input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

– Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

Page 22: Big Data – Hadoop - Paramesh

Stages of each Tasks• Map Task have the following Stages

– Map– Combine– Partition

• Reduce Task have the following stages – Shuffle and Sort – Reduce

Page 23: Big Data – Hadoop - Paramesh

Demo

• Refer the PDF Attachment • Mainly for reading the text and count the no

of word

Page 24: Big Data – Hadoop - Paramesh

– YARN• What is YARN?• Architecture and

Components

Page 25: Big Data – Hadoop - Paramesh
Page 26: Big Data – Hadoop - Paramesh

QUESTIONS?

Page 27: Big Data – Hadoop - Paramesh

APPENDIX

Page 28: Big Data – Hadoop - Paramesh

CAP

• CAP Theorem – Consistency

• Read the data from all the notes always consistent– Availability

• Read/write always acknowledge either success or failure– Partition Tolerance

• It can tolerate communication outage that spit the cluster into multiple silos /data set

Distributed Data System only provides the any two of the above properties Distributed Data Storage based on the above theorem

Page 29: Big Data – Hadoop - Paramesh

ACID

• ACID – Atomicity – Consistency– Isolation – Durability

Page 30: Big Data – Hadoop - Paramesh

BASE

• BASE– Basic availability– Soft state– Eventual consistency Above property mainly used in database distributed data for non transactional data

Page 31: Big Data – Hadoop - Paramesh

SCV

• SCV– Speed– Consistency– Volume High Data Volume Data Processing is based on the above algorithmData Processing should satisfied at max of two of the above properties

Page 32: Big Data – Hadoop - Paramesh

Sharding

• ShardingIt’s the process of Horizontally partitioning of large volume of data into smaller set of more manageable data set

Page 33: Big Data – Hadoop - Paramesh

Replication

• ReplicationStores the multiple copies of the data set known as replicasProvides always high availability , scalability and fault tolerance since its stores into multiple nodesReplicas implements the following was

Master-slavePeer -Peer

Page 34: Big Data – Hadoop - Paramesh

HDFS

Page 35: Big Data – Hadoop - Paramesh

HDFS-

Page 36: Big Data – Hadoop - Paramesh

HDFS Commands

• https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

Page 37: Big Data – Hadoop - Paramesh

HDFS

• Blocks – In HDFS File can split into small segments which

used to store the Data .Each Segments called as Block

– Default size of the Block is 64 MB , you can change the size in HDFS Configuration upto 128 MB(Advisable approach)

Page 38: Big Data – Hadoop - Paramesh

Types of File Format -MR

• TxtInputFormat-- Default• KeyValueTxtInputFormat• SequenceFileInputFormat• SequenceAsFileTxtInputFormat

Page 39: Big Data – Hadoop - Paramesh

Reader and Writer

• RecordReader –– Read the Record from file line by line , Each line

in the file treat as a record– Perform before the Mapper function

• RecordWriter –Write content into file as a output– Perform after the Reducer

Page 40: Big Data – Hadoop - Paramesh

Reducer

• IdentityReducer- Does not have the shuffle capability

• CustomReducer- Shuffle and Sorting Capability

Page 41: Big Data – Hadoop - Paramesh

BoxClasses in MR

• Its equivalent to wrapper in JAVA• IntWritter• FloatWritter• LongWritter• DoubleWritter• TextWritter• Mainly used for (K,V) in MR

Page 42: Big Data – Hadoop - Paramesh

WHAT NEXT ………….TOOLS OF HADOOPPIG/SQOOP/HBASE/HIVE/SPARK…….