cloudera sessions - clinic 1 - getting started with hadoop

Headline Goes HereSpeaker Name or Subhead Goes Here

DO NOT USE PUBLICLY PRIOR TO 10/23/12From Zero to Hadoop

Speaker Name | Title April 18, 2023

Agenda

• Hadoop Ecosystem Overview• Hadoop Core Technical Overview

• HDFS• MapReduce

• Hadoop in the Enterprise• Cluster Planning• Cluster Management with Cloudera Manager

What Are All These Things?

Hadoop Ecosystem Overview

Hadoop Ecosystem

INGEST STORE EXPLORE PROCESS ANALYZE SERVE

CONNECTORS

STORAGE

RESOURCE MGMT& COORDINATION

USER INTERFACE WORKFLOW MGMT METADATACLOUD

INTEGRATION

YAYARN

ZOZOOKEEPER

HDFSHADOOP DFS

HBHBASE

OOOOZIE

WHWHIRR

SQSQOOP

FLFLUME

FILEFUSE-DFS

RESTWEBHDFS / HTTPFS

SQLODBC / JDBC

MSMETA STORE

ACACCESS

BI ETL RDBMS

BATCH COMPUTE

BATCH PROCESSING REAL-TIME ACCESS & COMPUTE

MRMAPREDUCE

MR2MAPREDUCE2

HIHIVE

MAMAHOUT

DFDATAFU

IMIMPALA

MANAGEMENT SOFTWARE &TECHNICAL SUPPORTSUBSCIPTION OPTIONS

CLOUDERA NAVIGATOR

CLOUDERA MANAGER

CORE(REQUIRED)

RTD RTQ

AUDIT(v1.0) LINEAGE

ACCESS(v1.0) LIFECYCLE

EXPLORE

Performs Bi Directional data transfers between Hadoop and almost any SQL database with a JDBC driver

FlumeNG

Client

A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.

• A low latency, distributed, non-SQL database built on HDFS.

• A “Columnar Database”

• Relational database

abstraction using a SQL like

dialect called HiveQL• Statements are executed as

One or more MapReduce

SELECTs.word, s.freq, k.freq

FROM shakespeare JOIN ON (s.word= k.word)WHERE s.freq >= 5;

• High-level scripting language

for for executing one or more

MapReduce jobs• Created to simplify authoring

of MapReduce jobs• Can be extended with user

defined functions

emps = LOAD 'people.txt’ AS (id,name,salary);rich = FILTER emps BY salary > 200000;sorted_rich = ORDER rich BY salary DESC;STORE sorted_rich INTO ’rich_people.txt';

A workflow engine and

scheduler built specifically

for large-scale job

orchestration on a

Hadoop cluster

Zookeeper

• Zookeeper is a distributed

consensus engine• Provides well-defined concurrent

access semantics:• Leader election• Service discovery• Distributed locking / mutual

exclusion• Message board / mailboxes

MahoutA machine learning library with algorithms for:• Recommendation based on users'

behavior. • Clustering groups related documents. • Classification from existing

categorized. • Frequent item-set mining (shopping

cart content).

Hadoop Security

• Authentication is secured by MIT Kerberos v5

and integrated with LDAP

• Provides Identity, Authentication, and

Authorization

• Useful for multitenancy or secure

environments

Only the Good Parts

Hadoop Core Technical Overview

Components of HDFS

• NameNode – Holds all metadata for HDFS• Needs to be a highly reliable machine

• RAID drives – typically RAID 10• Dual power supplies• Dual network cards – Bonded

• The more memory the better – typical 36GB to - 64GB• Secondary NameNode – Provides check pointing for the

NameNode. Same hardware as the NameNode should be used

Components of HDFS – Contd.

• DataNodes – Hardware will depend on the specific needs of the cluster• No RAID needed, JBOD (just a bunch of disks) is used• Typical ratio is:

• 1 hard drive• 2 cores• 4GB of RAM

HDFS Architecture Overview

Secondary Namenode

Host 2

Namenode

Host 1DataNode

Host 3

DataNode

Host 4

DataNode

Host 5

DataNode

Host n

Block Size = 64MBReplication Factor = 3

HDFS Block Replication

Node 1 Node 2

Node 3

Node 4 Node 5

Blocks

MapReduce – Map• Records from the data source (lines out of files, rows of a

database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).

• map() produces one or more intermediate values along with an output key from the input.

MapTask

(key 1, values)

(key 2, values)

(key 3, values)

ShufflePhase

(key 1, int. values)

Reduce Task

Final (key, values)

MapReduce – Reduce• After the map phase is over, all the intermediate values for a

given output key are combined together into a list

• reduce() combines those intermediate values into one or more final values for that same output key

MapTask

(key 1, values)

(key 2, values)

(key 3, values)

ShufflePhase

(key 1, int. values)

Reduce Task

Final (key, values)

MapReduce – Shuffle and Sort

How It Works In The Real World

Hadoop In the Enterprise

Networking• One of the most important things to consider when

setting up a Hadoop cluster• Typically a top of rack is used with Hadoop with a

core switch • Careful on over subscribing the backplane of the

switch!

Hadoop Typical Data Pipeline

Data Sources

PigHive

MapReduce

Data Warehouse

HadoopOozie

SqoopFlume

Hadoop Use Cases

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions Analysis

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Mediation

Data Factory

Trade Reconciliation

SIGINT

Application ApplicationIndustry

Retail

Financial

Federal

Bioinformatics Genome MappingSequencing Analysis

Use CaseUse Case

Hadoop in the Enterprise

Logs Files Web Data Relational Databases

IDE’s BI / Analytics Enterprise Reporting

Enterprise Data Warehouse

Web Application

Management Tools

OPERATORS ENGINEERS ANALYSTS BUSINESS USERS

CUSTOMERS

Cloudera ManagerEnd-to-End Administration for CDH

ManageEasily deploy, configure & optimize clusters1MonitorMaintain a central view of all activity2DiagnoseEasily identify and resolve issues3IntegrateUse Cloudera Manager with existing tools4

Install A Cluster In 3 Simple Steps

1 2 3Find Nodes Install Components Assign Roles

Enter the names of the hosts which will be included in the Hadoop cluster. Click Continue.

Cloudera Manager automatically installs the CDH components on the hosts you specified.

Verify the roles of the nodes within your cluster. Make changes as necessary.

Cloudera Manager Key Features

View Service Health & PerformanceCloudera Manager Key Features

Monitor & Diagnose Cluster WorkloadsCloudera Manager Key Features

Visualize Health Status With HeatmapsCloudera Manager Key Features

Rolling UpgradesCloudera Manager Key Features

cloudera sessions - clinic 1 - getting started with hadoop

Documents

a beginners guide to cloudera hadoop

dr. amr awadallah, cto/founder @awadallah, aaa@cloudera...

cloudera development kit (cdk): hadoop application...

big data applications on cloudera hadoop

hadoop administration using cloudera student lab guidebook

big data governance in hadoop environments with cloudera...

cloudera distributed hadoop (cdh) installation and

big data processing using hadoop with cloudera quickstart

hadoop benchmark: evaluating cloudera, hortonworks, and mapr

cloudera-intel-cisco hadoop benchmark toi (external) what

setting up hadoop cluster with cloudera manager and impala

cloudera data analyst training for apache hadoop

cloudera-intel-cisco hadoop benchmark toi (external) … ·...

installation: sas university edition · cloudera hadoop and...

cloudera impala: a modern sql engine for apache hadoop

cloudera administrator training for apache hadoop

deploying cloudera cdh (cloudera distribution including...

houston hadoop meetup presentation by vikram oberoi of...

cloudera intro hadoop 111116 slides

total data governance on hadoop with talend and cloudera