introduction to hadoop
TRANSCRIPT
Workshop on data analytics using big data tools ‘ 2016 – BHARATHIAR UNIVERSITY
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
INTRODUCTION TO
Presented ByK.SANTHIYAPh.d Research ScholarDepartment of Computer ApplicationsBharathiar University
Under the Guidance ofDr.V.BHUVANESWARI
Assistant ProfessorDepartment of Computer
ApplicationsBharathiar UniversityK.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
AGENDA
• WORLD OF DATA Few Instances
• CONVENTIONAL APPROACHES Limitations
• HADOOP FRAMEWORK Terminology Review
• HADOOP COMPONENTS HDFS & MAPREDUCE
• HDFS – IN DETAIL• HADOOP ECOSYSTEM
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
DATA EXPLOSION
2.5 quintillion bytes of data is created each day…..
1K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
WORLD WIDE DATA
Since the beginning of Time
Last two years
2K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
2.9 375 20 24 50 700 1.3 72Million MB Hrs PB Million Billion Exabytes items
THE WORLD OF DATA
3K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
Minimum size that a Big Data file starts with is at least
1 Terabyte
4K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
5K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
&
6K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
Conventional approaches
RDBMSOS FILE SYSTEM
SQL QUERIESCUSTOM FRAMEWORK
* C / C++* PERL* PYTHON
35
7K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
ISSUES IN LEGACY SYSTEMSLimited Storage CapacityLimited Processing CapacityNo ScalabilitySingle point of FailureSequential ProcessingRDBMSs can handle Structured DataRequires preprocessing of DataInformation is collected according to current business needs
8K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
How do we mine (and mind)
all this data?
HOW TO RESOLVE ALL THESE ISSUES?
9K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Mr. HADOOP says he has a solution to our BIG problem !
10K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
11K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
43
12K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Companies Using
13K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
What is
Apache Hadoop is a framework that allowsfor the distributed processing of large datasets across clusters of commodity computers using a simple programming model.
ConceptMoving computation is more efficient than moving large data
14K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
STORAGE COMPUTATION COMPLEXITY
15K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
TWO DAEMONS OF HADOOP
44
16K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
ARCHITECTURE
17K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
TERMINIOLOGY REVIEW
Node 1
Node 2
Node n
::
Rack 1
Node 1
Node 2
Node n
::
Rack 2 ::
Clus
ter
18
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
HADOOP CLUSTER ARCHITECTURE
19K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
20K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
HADOOP CORE SERVICES
i. Name Nodeii.Data Nodeiii.Resource Manageriv.Application Masterv.Node Managervi.Secondary Name Node
21K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
HDFS – REAL LIFE CONNECT
• A college library was gifted a massive collection of books by a patron. The books were very popular titles. The librarian decided to arrange the books in a small rack, and distribute multiple copies of each book in other racks, so that students can find the books easily. Similarly, HDFS creates multiple copies of a data block, and keeps them in separate systems for easy access.
22K.Santhiya , Ph.d Research
Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
WHAT IS HDFS• Hadoop Distributed File System
Highly Fault tolerant , distributed , reliable , scalable file system for data storage.
Stores multiple copies of data on different nodes
A File is split up into blocks and stored on multiple machines
Hadoop cluster typically has a single namenode and no. of data nodes to form a hadoop cluster. 2
3K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari,
Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
HDFS BLOCKS
• Files are broken in to large blocks. Typically 128 MB block size Blocks are replicated for reliability One replica on local node, Another replica on a remote rack, Third replica on local rack,
Additional replicas are randomly placed
24K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
HDFS BLOCKS contd.,ADVANTAGES OF HDFS BLOCKSFixed Size Chunk of file < block size : Only needed space is
used.Eg : 420 MB file is split as
25K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
HDFS Operation Principle
26K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
NAME NODE
27K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
DATA NODE
28K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
SECONDARY NAME NODE
29K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
HDFS ARCHITECTURE
30K.Santhiya , Ph.d Research
Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
HDFS – BLOCK REPLICATION ARCHITECTURE
31K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
NAMENODE IN HA MODE
32K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Name Node HA Architecture
33K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
BUSINESS SCENARIO olivia tyler is the evp of it operations with nutri worldwide, inc.,and she has decided to use
hdfs for storing big data. she will use hdfs shell to store the data in a hadoop file system, and she will execute various commands on it.
34K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
35K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
HADOOP SHELL COMMANDS
hadoop fs -mkdir /learning
hadoop fs –copyFromLocal test.txt /learning
hadoop fs -ls /learning
hadoop fs -cat/learning/test.txt
36K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
HADOOP ECOSYSTEM COMPONENTS
37K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
DATA TRANSFER COMPONENTS
38K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
DATA STORE COMPONENTS
• Following are the data store components of the Hadoop Ecosystem.
DISTRIBUTEDSCALABLE
BIG DATA STORE
SCALABLECONSISTENTDISTRIBUTED
STRUCTURED KEY VALUE STORE
SORTED DISTRIBUTED KEY
VALUE DATA STORAGE AND
RETRIEVAL SYSTEM
HBASE CASSANDRA ACCUMULO
39K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
Serialization Components
• The serialization components are Avro, Trevni, and Thrift.
• Avro is a data serialization system. • Trevni is a column file format used to
permit compatible, independent implementations that read and /or write files in this format.
• Thrift is a framework for scalable, cross-language services development.
40K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,-
WDABT 2016
JOB EXECUTION COMPONENTS
• Following are the job execution components :
41K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
WORK MANAGEMENT COMPONENTS
42K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
CONCLUSION
56
43K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
REFERENCES• J. Gantz and D. Reinsel, ``The digital universe in 2020: Big data, bigger
digital shadows, and biggest growth in the far east,'' in Proc. IDC iView,IDC Anal. Future, 2012.
• (2015) Available : [online] http://expandedramblings.com/index.php/by-the-numbers-a-gigantic-list-of-google-stats-and-facts/
• D. Evans and R. Hutley, ``The explosion of data,'' white paper, 2010.
• Seema Acharya, Subhashini Chelleppan " Big Data and Analytics "Wiley India Pvt Ltd , 2015
• Dhruba Borthakur , " HDFS Architecture Guide " , 2013.
• Available:[Online]http:// hortonworks.com/hadoop/flume/#section_2
• Marko Grobelnik , " Big-Data tutorial" , white paper,2012.
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
44
K.Santhiya , Ph.d Research Scholar , Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016