hadoop and it’s ecosystem€¦ · introduction to big data overview, history and today's...

12
10/10/2018 Big Data with Cloudera, Talent, Tableau | Santanu B QUICKITDOTNET,KHARADHI,PUNE HADOOP AND IT’S ECOSYSTEM

Upload: others

Post on 17-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

10/10/2018

Big Data with Cloudera, Talent, Tableau | Santanu B

QUICKITDOTNET,KHARADHI,PUNE HADOOP AND IT’S

ECOSYSTEM

Page 2: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

HADOOP

With

Cloudera, Talend (ETL tool), Tableau (Data

Visualization), Informatica, Teradata

About the trainer: Santanu has overall 20 yrs experience in the industry. He

worked with HCL Tech, Amdocs, and Cognizant. He was in Cognizant till Apr’17 as

a Senior Manager (Emp Id:274292) in Data Warehouse team. He worked in DWBI

domain with Informatica, Teradata, Hadoop, Talend, Tableau tools. He has long

experience in the industry with big clients like AT&T, Novertise, Astrazeneca, CS,

etc. He was attached with the recruitment team for long time in Amdocs, Cognizant

which will definitely help the students for interview preparation!

(Santanu Bhattacharjee:9823569371)

Page 3: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

Java

(Basic understanding

before stating Big Data)

Basic Concepts, Class, Objects, Methods, Loops,

Decision, Arrays, Variables, OOPS concepts

(Abstraction, Polymorphism, encapsulation

etc...), Command line input, Exception

handling etc.

SQL

(Basic understanding

before stating Big Data)

Introduction to RDBMS concept and

architecture, Introduction to SQL, DDL, DML,

SELECT STATEMENT - Where Clause, Order

By / Distinct Clause, SQL Function - Scalar,

Aggregate Functions, Group By/Having

Clause, Self Join, Inner Join, Outer Join, LEFT

JOIN, RIGHT JOIN, FULL JOIN, Union, Sub

Queries, Views, Indexes.

Linux

(Basic understanding

before stating Big Data)

Overview of Linux OS and it's importance

various commands, vim editor, shell Scripts-

arithmetic operators, File Test Operators,

Command line parameter, Conditions, Loops,

Executing Scripts.

Introduction to Big

Data

Overview, history and today's challenges

which can be handle by Bid Data.

Goals of HDFS system.

Hadoop ecosystem and different components

and it's usages.

Architecture of Hadoop system.

MapReduce and how it Works.

Discussion on different installation modes

(Standalone, Pseudo, Fully Distributed).

Installation of Hadoop (in Ubentu-14) with

Pseudo Distribution mode.

Discussion on important Daemons running

background of Hadoop system.

Discussion on different important HDFS

commands.

Page 4: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

Hadoop Eco-system

What is Hadoop?

Hadoop's Key Characteristics

Hadoop Eco-system & Core Components

Where Hadoop Fits?

Traditional vs. Hadoop’s Data Analytics

Architecture

When to Use & Not Use Hadoop?

Apache Hadoop & Distributions

Hadoop Job Trends

HDFS Architecture

Introduction to Hadoop Distributed File

System

HDFS Architecture and Features

Files and Data Blocks

Anatomy of a File Read/ Write on HDFS

Replication & Rack Awareness

Hadoop Setup

Hadoop Deployment Modes

Setting up a Pseudo-distributed Cluster

Cloudera Sandbox Installation &Configuration

Linux Terminal Commands

Configuration Parameters and Values

MapReduce Basics

What is MapReduce?

MapReduce Framework, Architecture and Use

Cases

Input Splits

Hands on with MapReduce Programming

Packaging MapReduce Jobs in a JAR

Using Pig

Background

Pig Architecture

Understanding and installation of ‘Pig’

Pig Latin Basics

Pig Execution Modes

Pig Processing – Loading and Transforming

Data

Pig Built-in Functions

Filtering, Grouping, Sorting Data

Page 5: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

Relational Join Operators

Pig User Defined Functions

Sample exercise on ‘Pig’ with data visualization

tool ‘Zeppelin’.

Create Talend jobs to execute Pig tasks.

Web Log Report Analytics by ‘Pig’.

Using Hive

Background of Hive

Hive Architecture

Warehouse Directory & Metastore

Data Processing – Loading Data into Tables

Using Hive Built-in Functions, UDF

Using Joins in Hive

Partitioning Data using Hive - Static &

Dynamic

Bucketing in Hive

ETL by Talend and visualization by Tableau

Application: Store Data

Analytics

Case Study: Store data analytics and Reporting using

Hive & Zeppelin (Data Visualization)

Working with HBase

HBase Overview

HBase Data Model

Row Oriented v/s Column Oriented

Storage

HBase Architecture

HBase Shell Commands

Bulk Load Data into HBase

Loading data by Talend into HBASE

Impala

Overview and Environment

Impala Architecture

Database creation ,deletion

Table and View specific statements like Create,

Insert, Describe, alter, drop etc.

Impala Clauses-Order By, Group By, Having,

Limit, offset, Union, Distinct.

ETL by Talend and visualization by Tableau

Page 6: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

Sqoop

Why Flume?

Setup MySQL RDBMS & Sqoop

Sqoop Connectors, Commands

Sqoop Options File

Importing Data – to HDFS & Hive

Exporting Data to MySQL

Data Ingestion using Flume

Flume Architecture

Ingesting Weblog Data into HDFS using

Flume

Flume, ZooKeeper

Overview on ZooKeeper and how it helps a

cluster for coordination activity.

Installation of ZooKeeper in the Ubentu VM.

Discussion on Kafka with real life scenario.

Discussion on Point to Point , Publish-

Subscribe Messaging System.

Discussion on Kafka's architecture, Producer ,

Consumer , Topic category , Broker .

Installation of Kafka in the Ubentu VM.

Discussion on architecture of Flume.

Agent of Flume and it's components.

Verious types of Sources, Channels, Sinks

supported by Flume.

Discussion on how to configure an Agent of

Flume in the configuration file.

Run Consumer-Subscriber and publish a

Topic by the Producer!

Page 7: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

MongoDB:

MongoDB

Concept of No SQL database ‘MongoDB’ and where to use.

Installation of ‘MongoDB’ in VM Mapping ‘SQL’ to ‘MongoDB’ Queries-sample

examples. Create, Insert, Update and Delete operations. Complex aggregations, equal, less than, greater than

operators. Join, Group By, Having etc. Case Study on ‘MongoDB’

Spark and Scala

Overview of Apache Spark and it's importancy.

Resilient Distributed Datasets (RDDs)

Components of Spark

Configuration to access tables in MySQL, Hive.

Spark SQL – DataFrames, SQLContext,

HiveContext.

Loading data in HDFS, Hive, MYSQL.

Create a report in Zeppeline.

Overview of Sacla and it's importance.

Compilation of Scala program in spark and

execute it.

Scala Data Types, Variables, Access Modifiers,

Operators, Logical statements, Loops,

Functions, Closures, Collections, Classes &

Objects, Exception Handling.

Singleton object.

Using Oozie

Overview, Features and Challenges of Oozie.

Setting up Database & Oozie Configuration.

Creating Workflows .

Submitting, Monitoring and Managing Oozie

Jobs.

Page 8: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

ETL tool ‘Talend’ with Cloudera:

Talend &

Cloudera

Configuration of Talent with Cloudera VM How to automate HDFS Jobs in Talend. Insert data in Hbase by Talend Job. Load data from various source systems in Cloudera

Hive by Talend. Create Jobs in Talend to check aggregation, filters,

left-right-outer joins in Pig. ETL job creation and load it in Hive & Impala and

visualization by Tableau.

Reporting by ‘Tabelue’:

Tableau

Case Study: We have census dataset which is available in .CSV file. We need to load it in Cloudera’s Hive OR Impala by creating Talend Jobs and analyse it by Tableau Report.

Informatica & Teradata (Additional):

Informatica

Concept of Warehouse and ETL layers Architecture How to configure Source Analyzer and Target Designer Mappings Workflow Monitor Different Transformations

Verification points while doing ETL testing. Common defects coming in ETL projects.

Teradata

Teradata architecture SQL in Teradata Utilities like SQL Assistant, BTEQ, FirstLoad, MultiLoad.

Note: Although intention of this course is to learn Big Data with Talend & Tableau but additionally

basic Informatica & Teradata will be discussed as these are highly used for data acquisition with Big

Data in the industry. VM (Virtual Machine) will be provided for practicing purpose!

Page 9: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

Sample Case Study

Census data analysis:

Following is an example of Job in Talend which will create a table in Impla before

loading Census data. It will take the data in .csv format from the Host System and

will load it into the Cloudera’s Impala table.

(Load data from Census.csv into Impala by Talend)

(Load data from MySql to Hive by Talend)

Page 10: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

Check in Cloudera’s Hue that Table should be created and data should be loaded in Hive:

(Check the data loaded into the Hive/Impala table)

We need to configure Tableau with Cloudera by selecting IP of the Cloudera VM,

select the type of the connectin like HiveServer2 or Impala. Design reports in

Tabelue will point into the Cloudera’s Hive/Impala table and will start analysing on

Census data:

(Connect Tableau with HiveServer2 of Cloudera server)

Whenever require, we can use formulas to extract data and creating reports in

tableau. Here we are creating Density of Health Care Centre w.r.t population taking

states as Dimension.

Page 11: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

Density of Health Care Centre:

(Horizontal Bar graph in Tableau)

AVG Literacy rate in treemaps report in Tableau:

(Treemaps report)

Web Log Report Analytics by Pig:

Log files are very important for analyzing purpose. We can get lots of information

like as follows. We will analysis it by Pig script.

Page 12: HADOOP AND IT’s ECOSYSTEM€¦ · Introduction to Big Data Overview, history and today's challenges which can be handle by Bid Data. Goals of HDFS system. Hadoop ecosystem and different

Total bites consumed by each URL

Most visited URLs

Rank of the IPs based on total visits.

(Architecture of Log File analysis)

Two tables are joining in Talend’s mapping and creating a target table in

Hive/Impala:

(Talend Mapper)