introduction to hadoop

Workshop on data analytics using big data tools ‘ 2016 – BHARATHIAR UNIVERSITY

K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

INTRODUCTION TO

Presented ByK.SANTHIYAPh.d Research ScholarDepartment of Computer ApplicationsBharathiar University

Under the Guidance ofDr.V.BHUVANESWARI

Assistant ProfessorDepartment of Computer

ApplicationsBharathiar UniversityK.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-

WDABT 2016

AGENDA

• WORLD OF DATA Few Instances

• CONVENTIONAL APPROACHES Limitations

• HADOOP FRAMEWORK Terminology Review

• HADOOP COMPONENTS HDFS & MAPREDUCE

• HDFS – IN DETAIL• HADOOP ECOSYSTEM


DATA EXPLOSION

2.5 quintillion bytes of data is created each day…..

1K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-

WDABT 2016

WORLD WIDE DATA

Since the beginning of Time

Last two years


WDABT 2016

2.9 375 20 24 50 700 1.3 72Million MB Hrs PB Million Billion Exabytes items

THE WORLD OF DATA


WDABT 2016

Minimum size that a Big Data file starts with is at least

1 Terabyte


WDABT 2016


WDABT 2016

&


WDABT 2016

Conventional approaches

RDBMSOS FILE SYSTEM

SQL QUERIESCUSTOM FRAMEWORK

* C / C++* PERL* PYTHON

35


WDABT 2016

ISSUES IN LEGACY SYSTEMSLimited Storage CapacityLimited Processing CapacityNo ScalabilitySingle point of FailureSequential ProcessingRDBMSs can handle Structured DataRequires preprocessing of DataInformation is collected according to current business needs


WDABT 2016

How do we mine (and mind)

all this data?

HOW TO RESOLVE ALL THESE ISSUES?

9K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016

Mr. HADOOP says he has a solution to our BIG problem !


43


Companies Using


What is

Apache Hadoop is a framework that allowsfor the distributed processing of large datasets across clusters of commodity computers using a simple programming model.

ConceptMoving computation is more efficient than moving large data


STORAGE COMPUTATION COMPLEXITY


TWO DAEMONS OF HADOOP

44


ARCHITECTURE


TERMINIOLOGY REVIEW

Node 1

Node 2

Node n

::

Rack 1

Node 1

Node 2

Node n

::

Rack 2 ::

Clus

ter

18


HADOOP CLUSTER ARCHITECTURE


HADOOP CORE SERVICES

i. Name Nodeii.Data Nodeiii.Resource Manageriv.Application Masterv.Node Managervi.Secondary Name Node


HDFS – REAL LIFE CONNECT

• A college library was gifted a massive collection of books by a patron. The books were very popular titles. The librarian decided to arrange the books in a small rack, and distribute multiple copies of each book in other racks, so that students can find the books easily. Similarly, HDFS creates multiple copies of a data block, and keeps them in separate systems for easy access.

22K.Santhiya , Ph.d Research

Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-

WDABT 2016

WHAT IS HDFS• Hadoop Distributed File System

Highly Fault tolerant , distributed , reliable , scalable file system for data storage.

Stores multiple copies of data on different nodes

A File is split up into blocks and stored on multiple machines

Hadoop cluster typically has a single namenode and no. of data nodes to form a hadoop cluster. 2

3K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari,

Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-

WDABT 2016

HDFS BLOCKS

• Files are broken in to large blocks. Typically 128 MB block size Blocks are replicated for reliability One replica on local node, Another replica on a remote rack, Third replica on local rack,

Additional replicas are randomly placed


HDFS BLOCKS contd.,ADVANTAGES OF HDFS BLOCKSFixed Size Chunk of file < block size : Only needed space is

used.Eg : 420 MB file is split as


HDFS Operation Principle


NAME NODE


DATA NODE


SECONDARY NAME NODE


HDFS ARCHITECTURE

30K.Santhiya , Ph.d Research

Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-

WDABT 2016

HDFS – BLOCK REPLICATION ARCHITECTURE


NAMENODE IN HA MODE


Name Node HA Architecture


BUSINESS SCENARIO olivia tyler is the evp of it operations with nutri worldwide, inc.,and she has decided to use

hdfs for storing big data. she will use hdfs shell to store the data in a hadoop file system, and she will execute various commands on it.


HADOOP SHELL COMMANDS

hadoop fs -mkdir /learning

hadoop fs –copyFromLocal test.txt /learning

hadoop fs -ls /learning

hadoop fs -cat/learning/test.txt


HADOOP ECOSYSTEM COMPONENTS


DATA TRANSFER COMPONENTS


DATA STORE COMPONENTS

• Following are the data store components of the Hadoop Ecosystem.

DISTRIBUTEDSCALABLE

BIG DATA STORE

SCALABLECONSISTENTDISTRIBUTED

STRUCTURED KEY VALUE STORE

SORTED DISTRIBUTED KEY

VALUE DATA STORAGE AND

RETRIEVAL SYSTEM

HBASE CASSANDRA ACCUMULO


Serialization Components

• The serialization components are Avro, Trevni, and Thrift.

• Avro is a data serialization system. • Trevni is a column file format used to

permit compatible, independent implementations that read and /or write files in this format.

• Thrift is a framework for scalable, cross-language services development.


WDABT 2016

JOB EXECUTION COMPONENTS

• Following are the job execution components :


WORK MANAGEMENT COMPONENTS


CONCLUSION

56


REFERENCES• J. Gantz and D. Reinsel, ``The digital universe in 2020: Big data, bigger

digital shadows, and biggest growth in the far east,'' in Proc. IDC iView,IDC Anal. Future, 2012.

• (2015) Available : [online] http://expandedramblings.com/index.php/by-the-numbers-a-gigantic-list-of-google-stats-and-facts/

• D. Evans and R. Hutley, ``The explosion of data,'' white paper, 2010.

• Seema Acharya, Subhashini Chelleppan " Big Data and Analytics "Wiley India Pvt Ltd , 2015

• Dhruba Borthakur , " HDFS Architecture Guide " , 2013.

• Available:[Online]http:// hortonworks.com/hadoop/flume/#section_2

• Marko Grobelnik , " Big-Data tutorial" , white paper,2012.


44


introduction to hadoop

Data & Analytics