hdinsight essentials hadoop on microsoft platform
DESCRIPTION
This book gives a quick introduction to Hadoop-like problems, and gives a primer on the real value of HDInsight. Next, it will show how to set up your HDInsight cluster. Then, it will take you through the four stages: collect, process, analyze, and report. For each of these stages you will see a practical example with the working code.TRANSCRIPT
HDInsight Essentials ISBN : 1849695369 / ISBN 13 : 9781849695367
Rajesh Nadipalli 05/01/2014
Goals of this Book • Focus on Microso'’s new Hadoop distribu=on • Serve as Quick Reference • Provide an Overview of Hadoop • Address both cloud and on-‐premise setup for HDInsight • Highlight HDInsight differen:ator • Provide Prac=cal & Real world examples
Book Table of Contents • Chapter 1: HDInsight in a Heartbeat • Chapter 2: Deployment HDInsight on premise • Chapter 3: HDInsight Azure cloud service • Chapter 4: Administer your cluster • Chapter 5: Ingest data to your cluster • Chapter 6: Transform data in your cluster • Chapter 7: Analyze & Report data from cluster • Chapter 8: Project Planning & Architectural Considera=ons
CHAPTER 1 HIGHLIGHTS: HDINSIGHT IN A HEARTBEAT
Big Data Problem Characteristics
Hadoop Overview
Self Healing Distributed Storage
Fault Tolerant Distributed Computing
+ Abstraction for
Parallel Processing
CORE HADOOP COMPONENTS • HDFS: Distributed Storage – replicated, self-‐healing and scalable
• MapReduce: Parallel Processing, process local data for efficiency
NameNode
JobTracker TaskTracker
TaskTracker
TaskTracker
MapReduce Layer
Distributed File System
Layer Secondary NameNode
Master Node Slaves Nodes
DataNode
DataNode
DataNode
Hadoop Nodes Layout
Data Sources
RDBMS Databases
Audio, Images Log Files Sensors,
RFID Social
Media, Feeds
Hadoop Data Store
HDFS
Hbase (NOSQL DB)
Data Processing
Mapreduce
Data Access
Hive Pig Mahout Machine Learning
Flume, Sqoop
Excel
Business Data Feeds
Zook
eepe
r (Distrib
uted Process M
anag
ement)
Hcatalog (M
etad
ata on
Pig, H
ive, M
apRe
duce )
Oozie Workflow, Scheduler
Infrastructure , Ope
ra:o
ns
(Mon
itorin
g, Con
figura<
on)
Hadoop Eco System
Collect & Import to HDFS
Process (MapReduce)
Analyze (BI Tools) Report & Publish
End to End Solution on Hadoop
Popular Hadoop Distributions • Amazon Elas=c MapReduce (cloud, hbp://aws.amazon.com/elas=cmapreduce/)
• Cloudera (hbp://www.cloudera.com/content/cloudera/en/home.html)
• EMC PivitolHD (hbp://gopivotal.com/)
• Hortonworks HDP (hbp://hortonworks.com/)
• MapR (hbp://mapr.com/)
• Microsod HDInsight (cloud, hbp://www.windowsazure.com/)
HDInsight Differenciator • Enterprise-‐ready Hadoop backed by Microsod
• Analy:cs using Excel
• Integra=on with Ac=ve Directory.
• Integra=on with .NET and Javascript
• Connectors to RDBMS
• Scale using cloud offering: Azure HDInsight service enables customers to scale quickly and has seamless interface between HDFS and Azure Storage Vault
• JavaScript Console
WordCount in HDInsight
CHAPTER 2 HIGHLIGHTS: HDINSIGHT INSTALL ON PREMISE
Apache Hadoop
• Open Source Sodware • Community Development
Hortonworks Data PlaSorm
• Enterprise Hadoop Plagorm (HDP) • Leaders in Hadoop • Code commibers to Hadoop
Microso' HDInsight
• Built on top of HDP • Integra=on with ASV, Excel, Powerview,
SQLServer, Ac=ve Directory
HDInsight Distribution
Physical Install Options
NN SNN JT
DN / TT
Single node for development/test
Mul= node for produc=on
Multi Node Install Steps • Pre-‐requisites • Networking Setup • Remote Scrip=ng • Firewall Setup • Sodware Install (each node) • Hadoop Configura=on • Verifica=on
CHAPTER 3 HIGHLIGHTS: HDINSIGHT AZURE SERVICE
Azure Cloud Service
Create Storage
Create HDInsight cluster
CHAPTER 4 HIGHLIGHTS: ADMINISTER YOUR CLUSTER
HDInsight Cluster Management
HDInsight Dashboard
HDInsight Dashboard
NameNode Status
Jobtracker Status
CHAPTER 5 HIGHLIGHTS: INGEST DATA INTO YOUR CLUSTER
Loading Data into your Cluster You have following op=ons… • Loading data using Hadoop commands • Loading data using Azure Storage Vault • Loading data using Interac:ve JavaScript • Shipping data to your Cluster • Loading data from RDBMS via Sqoop
Loading via Azure Storage Explorer
CHAPTER 6 HIGHLIGHTS: TRANSFORM YOUR DATA
Transforming Data You have following op=ons… • MapReduce • Hive • Pig • Others
Processing Data in Cluster Map for Jan2012
Map for Feb2012
Map for Apr2013
…
One Reducer
HDFS
Hive JDBC/OBDC
Metastore
Thrift Server
Command Line Web GUI
Driver (Parser, Planner, Executor)
MapReduce
Hive
Raw Data in HDFS • Distributed
Storage • Reliable
Data Processing via Pig • Pipelines • Itera=ve Processing • Research
Data Warehouse
HDFS
Data Warehouse via Hive • BI Tools • Analysis
Hive or Pig?
CHAPTER 7 HIGHLIGHTS: ANALYZE & REPORT
Analyze using Excel
Analyze using Excel
CHAPTER 8: PROJECT PLANNING & ARCHITECTURAL CONSIDERATIONS
Execu:ve & Stakeholder
Buy-‐in
Discovery & Analysis
Design
Implementa:on User Acceptance
Produc:on Opera:ons
Feedback, New Requirements