hadoop abi insight
DESCRIPTION
HadoopTRANSCRIPT
![Page 1: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/1.jpg)
Dell - Internal Use - Confidential
Hadoop @ ABI
Insight Into The Ecosystem
Tuesday, October 07, 2014
![Page 2: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/2.jpg)
2 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
TEAM - POC
• Deepak Gattala– Hadoop Administrator– DW Architect – ABI/EBI
• Spike White– Linux System Administrator– Kerberos Specialist.
• Will O’Brian– Active Directory and Identity.– Security Analyst.
• Note: Special thanks for supporting the effort. – Bart Crider, Attila Finta, Mike Porreca, Feargal Tobin, Alisha Worsham.
![Page 3: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/3.jpg)
3 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Agenda [ Next 1 Hour ]
• Deepak Gattala [15 minutes]– Get Familiar with Hadoop (Cloudera). [5 minutes]– HDFS & Map reduce tour. [5 minutes]– Hadoop Family and Ecosystem. [5 minutes]
• Spike White and Will O’Brian [15 minutes]– Integration and Security. [5 minutes]– Kerberos [5 minutes]– AD Forest and OU [5 minutes]
• Deepak Gattala [15 minutes]– Understanding Hive and Impala. [5 minutes]– Cloudera Manager. [5 minutes]– DELL IT/Services Use case and Interest. [5 minutes]
• Deepak Gattala, Spike White and Will O’Brian [15 minutes]– Product Demo. [5 minutes]– Question & Answers. [10 minutes]
![Page 4: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/4.jpg)
Dell - Internal Use - Confidential
Deepak Gattala- Architect
![Page 5: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/5.jpg)
5 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Photography
![Page 6: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/6.jpg)
6 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
What is Hadoop?
• Hadoop is an open source software frame work.
• It’s an Apache top-level project but the underlying technology was from google white paper about to index all the rich textural and structural information.
• Architected to run on a large number of machines that don’t share any memory or disks.
• Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits.
• Designed to solve problems with large data while running analytics that are deep and computationally extensive.
![Page 7: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/7.jpg)
7 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Prerequisites
• Hadoop framework mainly consists of two important components:-
– HDFS (Hadoop Distributed File System).– MapReduce paradigm
• HDFS is a file system written in Java use for storage similar to ext3 or ext4 in LINUX.
• MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
• MapReduce is the paradigm used to process data on HDFS, the processing is moved to the data location.
• Basic Linux commands Ex: ls, cat .. Etc.
![Page 8: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/8.jpg)
8 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
HDFS
• Every piece of data is split into blocks and distributed across cluster.
• Typically blocks are 64MB or 128MB, default replication is 3.
1 2 3
332 11 12 2
Input file
3
MetadataTCP/IPNetworking
4 5
4
4
45 5 5
DN1
Client
DN2 DN3 DN4 DN5
Name Node
Data Node File Blocks
DataNode1 1, 4, 5
DataNode2 2, 3, 4
DataNode3 1, 2, 5
DataNode4 3, 4, 5
DataNode5 1, 2, 3
Data Node File Blocks
DataNode1
DataNode2
DataNode3
DataNode4
DataNode5
![Page 9: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/9.jpg)
9 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
MapReduce - Example
• MapReduce had 5 different stages.
![Page 10: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/10.jpg)
10 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Hadoop Distributions
• Even though Hadoop is an open source project, we have some vendors who actually packaged the compatible version together and enable the operations tools and provide great flexibility.
• Below are the top 3 vendors
– Cloudera– Horton Works– MapR
• The underlying code still remain bare bone apache open source however some of them have commercial products and services attached to specific distributions.
![Page 11: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/11.jpg)
11 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Eco-System
![Page 12: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/12.jpg)
12 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Hadoop Configuration
• Daemons of Hadoop ecosystem:-
– Namenode (Master)– Block information
– Secondary Namenode (Master)– Check point of Namenode
– Data Node (Slave)– Data residence.
– Task Tracker (Slave)– Workers
– Job Tracker (Master)– Checks and keeps the status.
• Hadoop by default replicates each block of data three times for redundancy and fail over.
![Page 13: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/13.jpg)
Dell - Internal Use - Confidential
Spike White- System Sr. Engineer
![Page 14: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/14.jpg)
14 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Hardware Configuration
• There are 3 different types of node configuration that are very important in the architecture to get optimal performance.
• For small to medium size cluster less than 1000 nodes.
– Master Nodes (Generally 2 or 3 in a cluster)– Slave Nodes (Can scale 1 to .. N nodes)– Edge Nodes (Normally 2 for Load balancing)
• Each category of node has specific configuration with respect to the hardware and also Hadoop software.
• Please find Dell reference architecture link found below:-– http://files.cloudera.com/pdf/Dell_Cloudera_Solution_for_Apache_Hadoop_Ref
erence_Architecture.pdf
![Page 15: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/15.jpg)
15 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Solution Center Rack Diagram
• Location of the rack RR8 EBI Lab.
• CM 5.1 and CDH 5.1.2.– 2 Name Nodes. (R 720’s)– 6 Data Nodes. (R 720 XD’s)– 2 Edge Nodes. (R 720’s)– 1G network cards (Due upgrade)– Force 10 S 60 Switch 1G (Due Upgrade)
• System crashed, bring it offline and fix it – no impact.
• Hard drive crashed, replace it and create the mount.
![Page 16: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/16.jpg)
16 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
QUEST VINTELA Authentication
• Vintela Authentication Services (VAS) implements Kerberos and LDAP functionality on UNIX and Linux systems, and fully integrate with AD.
• The benefits of using VAS include the following: – You have the ability to manage UNIX
and Linux users and computers are managed through the Active Directory Users and Computers Microsoft Management Console (MMC) snap-in.
– Kerberos is the protocol used to secure LDAP traffic.
– Performance is tuned to work effectively with Active Directory.
![Page 17: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/17.jpg)
17 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Kerberos
• The Kerberos protocol is a standard designed to provide strong authentication within a client/server network environment.
• Kerberos network messages are encrypted and decrypted using algorithms that are very difficult to decode into its original form.
• Kerberos contains a number of terms– Principal:- All entities within Kerberos, including users, computers, and
services, are known as principals. Principal names are unique.
– Realms: -The principal is a member of a realm.
– Ticket: - A ticket is the fundamental unit of Kerberos authentication. It is a carefully constructed message containing the authentication information which is passed between computers.
– Key Distribution Center: -The Key Distribution Center (KDC) is made up of three components: – Database of principals containing users, computers, and services;
– Authentication server that issues Ticket Granting Tickets (TGT);
– Ticket Granting Service (TGS) that issues service tickets granting clients access to specific services.
![Page 18: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/18.jpg)
18 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Kerberos
![Page 19: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/19.jpg)
Dell - Internal Use - Confidential
Will O’Brian- Active Directory Services
![Page 20: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/20.jpg)
20 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Active Directory
• The US-POCLAB.DELLPOC.COM Active Directory (AD) Domain was utilized for the Cloudera setup.
• A Hadoop Organizational Unit (OU) was manually created under us-poclab.dellpoc.com/Unix/Servers.
• A “parent” Service Account (Servicegtminf) was manually created under the us-poclab.dellpoc.com/Service Accounts OU.
![Page 21: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/21.jpg)
21 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Active Directory
• CM uses the service account “servicegtminf”.
• The Servicegtminf was given rights to create\delete accounts within us-poclab.dellpoc.com/Hadoop OU as well as Full Control rights to any descendant objects (accounts).
• Service accounts are create by CM by changing the user principles name.
• Account “serviceARFSqfwFob” is configured to be utilized for the “sentry” service running on ausgtmhadoop07.
![Page 22: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/22.jpg)
22 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Active Directory
![Page 23: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/23.jpg)
Dell - Internal Use - Confidential
Deepak Gattala- Architect
![Page 24: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/24.jpg)
24 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Word Count Quiz:- What you choose?
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one); }}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum)); }
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
} }
Select word, count(*) from file_table group
by word;
Using Hive
Using PIG
a = load '/user/hue/word_count_text.txt';
b = foreach a generate
flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word; d = foreach c generate COUNT(b),
group;
store d into '/user/hue/pig_wordcount';
Using Mapreduce
![Page 25: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/25.jpg)
25 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Hive
• Facebook uses Hadoop extensively, looking for way to allow non-Java programmers access to the data in its Hadoop clusters.– Data analysts, Statisticians, Data Scientists etc.
• In Hive - SQL SELECT statement => MapReduce translator– Takes Hive queries and turns them into Java MapReduce code and then
Submits the code to the cluster– Display’s the results back to the user. Note: Not all SQL works!
• Hive is much easier to learn than Java-based MapReduce– Writing HiveQL queries is much faster than writing the equivalent Java
code.– Many people already know SQL – Can rapidly start using Hive to query
and manipulate data in the cluster.
![Page 26: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/26.jpg)
26 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Hive - Authorization
• CREATE ROLE [ ROLE NAME];
• DROP ROLE [ ROLE NAME ];
• GRANT ROLE role_name [, role_name] TO GROUP <groupName> [,GROUP <groupName>];
• REVOKE ROLE role_name [, role_name] FROM GROUP <groupName> [,GROUP <groupName>];
• GRANT <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> TO ROLE <roleName> [,ROLE <roleName>];
• REVOKE <PRIVILEGE> [, <PRIVILEGE> ] ON <OBJECT> <object_name> FROM ROLE <roleName> [,ROLE <roleName>];
• POC uses Groups:-– gtm_hdp_inf_dev – Hive Group used for POC– gtm_hdp_inf_adm – Cloudera Manager Admin Group
![Page 27: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/27.jpg)
27 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Hive - Authorization
• The Object Hierarchy where you can apply security can be as granular as below:-
![Page 28: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/28.jpg)
28 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
DELL ABI Use Cases
• SAIE (Support Assist Intelligence Engine) [Design & Architecture]– Teradata Appliance (PROD) & Home grown (Horton works 2.1) - DEV/SIT
• DCCMT/NGMT Hadoop Reporting [POC]– Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]
• Server log analysis POC on Hadoop [POC]– Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]
• Big Data Edition ETL Use case. [POC]– Informatica 9.6.1 & Cloudera CDH 5.1.2 [ Due upgrade CDH 5.2 soon]
• MAW (Marketing Analytics Workbench) [Beta Production]– Teradata Appliance HDP 1.3.2 [ Due upgrade HDP 2.1 ]
• Rainstor – Archival Strategy [In Production]– Cloudera CDH 4.2
![Page 29: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/29.jpg)
29 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Cloudera Manager
• Cloudera provides the web interface for the cluster management.
![Page 30: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/30.jpg)
Dell - Internal Use - Confidential
Questions???
![Page 31: Hadoop ABI Insight](https://reader031.vdocuments.us/reader031/viewer/2022012406/577cc16a1a28aba71192f54f/html5/thumbnails/31.jpg)
31 Enterprise Business Intelligence (EBI)Analytics and BI (ABI) | Dell IT
Dell - Internal Use - Confidential
Cloudera Hadoop Demo