big data hadoop

74
BIG DATA – HADOOP Governance Team 6 Dec 16

Upload: ayyappan-paramesh

Post on 05-Apr-2017

78 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Big data  Hadoop

BIG DATA – HADOOP

Governance Team 6 Dec 16

Page 2: Big data  Hadoop

• Big Data Fundamentals1

• Hadoop and Components2

• QA3

Today’s Overview

Page 3: Big data  Hadoop

Agenda – Big Data Fundamental• What is Big Data ?• Basic Characteristics

of Big Data• Sources of Big Data• V’s of Big Data• Processing Of Data

– Traditional Approach VS Big Data Approach

Page 4: Big data  Hadoop

What is Big Data

Page 5: Big data  Hadoop

What is Big Data –con’t

• Basically Big Data is nothing but collection of large set of Data that not able to processed using traditional approach and also its contains the followings– Structured Data- Traditional Data– Semi Structure Data- XML – Unstructured Data – Image/PDF/Media and etc

Page 6: Big data  Hadoop

Various V’s- Big Data

Page 7: Big data  Hadoop

Processing - Data

• Traditional Approach

• Big Data Approach

Page 8: Big data  Hadoop

Hadoop Fundamental• What is Hadoop ?• Key Characterstics• Components • HDFS• MapReduce • Yarn• Benefits of Hadoop

Page 9: Big data  Hadoop
Page 10: Big data  Hadoop

What is Hadoop

• Hadoop is an open-source software framework for storing large amounts of data and processing/querying those data on a cluster with multiple nodes of commodity hardware (i.e. low cost hardware).

Page 11: Big data  Hadoop

Key Characteristics -Hadoop

• Reliable • Flexible• Scalable • Economical

Page 12: Big data  Hadoop

Components

• Common Libraries • High Volume of Distributed Data Storage

System –HDFS• High Volume of Distributed Data Processing

Framework –MapReduce• Resource and Meta Data Management -YARN

Page 13: Big data  Hadoop
Page 14: Big data  Hadoop

– HDFS• What is HDFS?• Architecture• Components • Basic Features

Page 15: Big data  Hadoop

What is HDFS ?

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing

Page 16: Big data  Hadoop
Page 17: Big data  Hadoop

Components- HDFSMaster/slave architectureHDFS cluster consists of a single Namenode, a

master server that manages the file system namespace and regulates access to files by clients.

There are a number of DataNodes usually one per node in a cluster.

The DataNodes manage storage attached to the nodes that they run on.

Page 18: Big data  Hadoop

Components -HDFS

HDFS exposes a file system namespace and allows user data to be stored in files.

A file is split into one or more blocks and set of blocks are stored in DataNodes.

DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode

Page 19: Big data  Hadoop

Features

• Highly fault-tolerant• High throughput• Suitable Distributed Storage for large

Amount of Data • Streaming access to file system data• Can be built out of commodity hardware

Page 20: Big data  Hadoop

MapReduce • What is MapReduce• Tasks /Components • Basic Features • Demo

Page 21: Big data  Hadoop

What is MapReduce

• Its framework mainly used to process the large Amount of Data in parallel on the large clusters of commodity hardware

• Its based on divide –conquer Principle which provides built-in fault tolerance and redundancy

• Its batch oriented parallel processing engine to process the large volume of data

Page 22: Big data  Hadoop

MapReduce– Map stage : The map or mapper’s job is to process the

input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.

– Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.

Page 23: Big data  Hadoop

Stages of each Tasks• Map Task have the following Stages

– Map– Combine– Partition

• Reduce Task have the following stages – Shuffle and Sort – Reduce

Page 24: Big data  Hadoop

Demo

• Refer the PDF Attachment • Mainly for reading the text and count the no

of word

Page 25: Big data  Hadoop

– YARN• What is YARN?• Architecture and

Components

Page 26: Big data  Hadoop

YARN

• YARN (Yet Another Resource Nagotiator): A framework for job scheduling and cluster resource management

Page 27: Big data  Hadoop
Page 28: Big data  Hadoop

– Hive• What is Hive?• Architecture of Hive• Flow in Hive• Data Types • Sample Query • Not Hive• Demo

Page 29: Big data  Hadoop

What is Hive

• Its Data warehouse infrastructure tool to process the structured data in Hadoop platform

• Its originally developed by Facebook then moves into apache umbrella

• Basic large volume of data is retrieve from multiple resources and RDBMS system could not fit as perfect solutions .We move into Hive.

Page 30: Big data  Hadoop

What is Hive

• Its Query Engine wrapper on top of the Hadoop to perform the OLAP

• Provides the HiveQL is similar to SQL • Targeted to the users/developer with SQL

background • Its stores schema in database and process the data

in HDFS• Data Stored in HDFS/HBASE and every tables

should reference to the file on HDFS/HBASE

Page 31: Big data  Hadoop

Architecture - Hive

• Components– User Interface- Infrastructure tool used to

interaction between user and HDFS/HBASE– Meta Store – Used to store Schema/tables and etc,

Mainly used to store the meta data information– SerDe- libraries used to Serialize/Deserialize for

their own data format. Read and Writes the rows from/in the tables

– Query Processor -

Page 32: Big data  Hadoop

Architecture -Hive

Page 33: Big data  Hadoop

Data Type• Integral Type• SmallInt,BigInt,TinyInt,INT• Float Type

– Double,Decimal• String Type

– Char , Varchar• Misc Type

– Boolean ,Binary• TimeStamp,Dates,Decimal• Complex Type

– Struct,Map,Arrays

Page 34: Big data  Hadoop

Sample Query

• Create Table• Drop Table• Alter Table• Rename Table- Rename the table name • Load Data –Insert • Create View• Select

Page 35: Big data  Hadoop

Operator and Built in Function

• Arithmetic Operator• Relational Operator• Logical Operator • Aggregate and Built in Function • Supports Index/Order/Join

Page 36: Big data  Hadoop

Disadvantages of HIVE

• Not for Real time Query • Supports ACID from 0.14 version onwards• Poor performance – It took more time to

process since each time Hive will generate/process the Map Reduce or Spark Program internally while processing the Records sets

Page 37: Big data  Hadoop

Disadvantages of HIVE

• It can process only for large volume of Structured data not for other categories

Page 38: Big data  Hadoop

Hive Interface Option

• CLI• HUE(Hadoop User Experience)-

www.gethue.com• JDBC/ODBC - JAVA

Page 39: Big data  Hadoop

QUESTIONS?

Page 40: Big data  Hadoop

APPENDIX

Page 41: Big data  Hadoop

CAP

• CAP Theorem – Consistency

• Read the data from all the notes always consistent– Availability

• Read/write always acknowledge either success or failure– Partition Tolerance

• It can tolerate communication outage that spit the cluster into multiple silos /data set

Distributed Data System only provides the any two of the above properties Distributed Data Storage based on the above theorem

Page 42: Big data  Hadoop

ACID

• ACID – Atomicity – Consistency– Isolation – Durability

Page 43: Big data  Hadoop

BASE

• BASE– Basic availability– Soft state– Eventual consistency Above property mainly used in database distributed data for non transactional data

Page 44: Big data  Hadoop

SCV

• SCV– Speed– Consistency– Volume High Data Volume Data Processing is based on the above algorithmData Processing should satisfied at max of two of the above properties

Page 45: Big data  Hadoop

Sharding

• ShardingIt’s the process of Horizontally partitioning of large volume of data into smaller set of more manageable data set

Page 46: Big data  Hadoop

Replication

• ReplicationStores the multiple copies of the data set known as replicasProvides always high availability , scalability and fault tolerance since its stores into multiple nodesReplicas implements the following was

Master-slavePeer -Peer

Page 47: Big data  Hadoop

HDFS

Page 48: Big data  Hadoop

HDFS-

Page 49: Big data  Hadoop

HDFS Commands

• https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

Page 50: Big data  Hadoop

HDFS

• Blocks – In HDFS File can split into small segments which

used to store the Data .Each Segments called as Block

– Default size of the Block is 64 MB (Hadoop 1.X) , you can change the size in HDFS Configuration upto 128 MB(Hadoop 2.x Advisable approach)

Page 51: Big data  Hadoop

Types of File Format -MR

• TxtInputFormat-- Default• KeyValueTxtInputFormat• SequenceFileInputFormat• SequenceAsFileTxtInputFormat

Page 52: Big data  Hadoop

Reader and Writer

• RecordReader –– Read the Record from file line by line , Each line

in the file treat as a record– Perform before the Mapper function

• RecordWriter –Write content into file as a output– Perform after the Reducer

Page 53: Big data  Hadoop

Reducer

• IdentityReducer- Does not have the shuffle capability

• CustomReducer- Shuffle and Sorting Capability

Page 54: Big data  Hadoop

BoxClasses in MR

• Its equivalent to wrapper in JAVA• IntWritter• FloatWritter• LongWritter• DoubleWritter• TextWritter• Mainly used for (K,V) in MR

Page 55: Big data  Hadoop

Schema on Read/Write

• Hadoop –Schema on Read approach• RDBMS – Schema on Write approach

Page 56: Big data  Hadoop

Key Steps in Big Data Solution

• Ingesting Data• Storing Data• Processing Data

Page 57: Big data  Hadoop

HDFS

Page 58: Big data  Hadoop

Hadoop Tools

• 15+ frameworks & tools like Sqoop, Flume, Kafka, Pig, Hive, Spark, Impala, etc to ingest data into HDFS, store and process data within HDFS, and to query data from HDFS for business intelligence & analytics. Some tools like Pig & Hive are abstraction layers on top of MapReduce, whilst the other tools like Spark & Impala are improved architecture/design from MapReduce for much improved latencies to support near real-time (i.e. NRT) & real-time processing.

Page 59: Big data  Hadoop

NRT

• Near Real time – – Near real-time processing is when speed is

important, but processing time in minutes is acceptable in lieu of seconds

Page 60: Big data  Hadoop

HeartBit - HDFS

• Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker

Page 61: Big data  Hadoop

MapReducer – Partition

• all the value of a single key goes to the same reducer from Mapper, eventually which helps evenly distribution of the map output over the reducers

Page 62: Big data  Hadoop

HDFS VS NAS(Network Attached Storage)

• HDFS data blocks are distributed across local drives of all machines in a cluster

• NAS data is stored on dedicated hardware. • HDFS there is data redundancy because of

the replication protocol.• NAS there is no probability of data

redundancy

Page 63: Big data  Hadoop

Commodity Hardware

• Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM

Page 64: Big data  Hadoop

Port Number

• NameNode 50070• Job Tracker 50030• Task Tracker 50060

Page 65: Big data  Hadoop

Combine-MapReduce

• A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.

Page 66: Big data  Hadoop

MapReduce Programs

• Driver – Main method class which invoke by the scheduler

• Mapper• Reducer

Page 67: Big data  Hadoop

JobTracker –Functionality – When Client applications submit map reduce jobs to the Job tracker. The

JobTracker talks to the Name node to determine the location of the data.– The JobTracker locates Tasktracker nodes with available slots at or near the

data– The JobTracker submits the work to the chosen Tasktracker nodes.– The TaskTracker nodes are monitored. If they do not submit heartbeat

signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.

– When the work is completed, the JobTracker updates its status.– Client applications can poll the JobTracker for information.

Page 68: Big data  Hadoop

DW –Data Warehouse

• Database specific for analysis and reporting purpose

Page 69: Big data  Hadoop

Hive Support File Format

• Text File (Plain raw data)• Sequence File(Key value pairs)• RCFile (Record Columnar files which are

stored columns of the table in columnar Database)

Page 70: Big data  Hadoop

NameNode Vs MetaNode

• NameNode- Stores the MetaData information about the files in Hadoop

• MetaNode-Stores the MetaData information about the Tables /Data Base in Hive

Page 71: Big data  Hadoop

Tez- Hive

• execute complex directed acyclic graphs of general data processing tasks

• Its better than the MapReduce

Page 72: Big data  Hadoop

Bucketing -Hive

• Bucketing provides mechanism to query and examine random samples of data.

• Bucketing offers capability to execute queries on a sub-set of random data

Page 73: Big data  Hadoop

Reference -Hive

• http://dl.farinsoft.com/files/94/Ultimate-Guide-Programming-Apache-Hive-ebook.pdf

Page 74: Big data  Hadoop

WHAT NEXT ………….TOOLS OF HADOOPPIG/SQOOP/HBASE/HIVE/SPARK…….