hdinsight essentials hadoop on microsoft platform

37
HDInsight Essentials ISBN : 1849695369 / ISBN 13 : 9781849695367 Rajesh Nadipalli 05/01/2014

Upload: nvvrajesh

Post on 05-Dec-2014

184 views

Category:

Data & Analytics


4 download

DESCRIPTION

This book gives a quick introduction to Hadoop-like problems, and gives a primer on the real value of HDInsight. Next, it will show how to set up your HDInsight cluster. Then, it will take you through the four stages: collect, process, analyze, and report. For each of these stages you will see a practical example with the working code.

TRANSCRIPT

Page 1: HdInsight essentials Hadoop on Microsoft Platform

HDInsight  Essentials  ISBN  :  1849695369    /  ISBN  13  :  9781849695367  

Rajesh  Nadipalli  05/01/2014  

Page 2: HdInsight essentials Hadoop on Microsoft Platform

Goals  of  this  Book  • Focus  on  Microso'’s  new  Hadoop  distribu=on  • Serve  as  Quick  Reference  • Provide  an  Overview  of  Hadoop  • Address  both  cloud  and  on-­‐premise  setup  for  HDInsight  • Highlight  HDInsight  differen:ator    • Provide  Prac=cal  &  Real  world  examples  

Page 3: HdInsight essentials Hadoop on Microsoft Platform

Book  Table  of  Contents  • Chapter  1:    HDInsight  in  a  Heartbeat  • Chapter  2:    Deployment  HDInsight  on  premise  • Chapter  3:    HDInsight  Azure  cloud  service  • Chapter  4:    Administer  your  cluster  • Chapter  5:    Ingest  data  to  your  cluster  • Chapter  6:    Transform  data  in  your  cluster  • Chapter  7:    Analyze  &  Report  data  from  cluster  • Chapter  8:    Project  Planning  &                                              Architectural  Considera=ons  

Page 4: HdInsight essentials Hadoop on Microsoft Platform

CHAPTER  1  HIGHLIGHTS:    HDINSIGHT  IN  A  HEARTBEAT  

Page 5: HdInsight essentials Hadoop on Microsoft Platform

Big  Data  Problem  Characteristics    

Page 6: HdInsight essentials Hadoop on Microsoft Platform

Hadoop  Overview  

Self Healing Distributed Storage

Fault Tolerant Distributed Computing

+ Abstraction for

Parallel Processing

CORE HADOOP COMPONENTS •  HDFS:  Distributed  Storage  –  replicated,  self-­‐healing  and  scalable    

•  MapReduce:    Parallel  Processing,  process  local  data  for  efficiency    

 

Page 7: HdInsight essentials Hadoop on Microsoft Platform

NameNode

JobTracker TaskTracker  

 TaskTracker  

 TaskTracker  

 MapReduce  Layer  

Distributed    File  System  

Layer   Secondary NameNode

Master  Node   Slaves  Nodes  

DataNode    

DataNode    

DataNode    

Hadoop  Nodes  Layout  

Page 8: HdInsight essentials Hadoop on Microsoft Platform

Data  Sources        

RDBMS    Databases  

Audio,    Images   Log  Files   Sensors,    

RFID  Social    

Media,  Feeds  

 Hadoop  Data  Store  

       

HDFS  

Hbase    (NOSQL  DB)  

 Data  Processing  

     

Mapreduce  

 Data  Access  

     

Hive   Pig   Mahout    Machine  Learning  

Flume,  Sqoop  

Excel  

Business    Data  Feeds  

Zook

eepe

r  (Distrib

uted  Process  M

anag

ement)  

Hcatalog  (M

etad

ata  on

 Pig,  H

ive,  M

apRe

duce  )  

Oozie    Workflow,  Scheduler  

Infrastructure  ,  Ope

ra:o

ns  

(Mon

itorin

g,  Con

figura<

on)  

Hadoop  Eco  System  

Page 9: HdInsight essentials Hadoop on Microsoft Platform

Collect & Import to HDFS

Process (MapReduce)

Analyze (BI Tools) Report & Publish

End  to  End  Solution  on  Hadoop  

Page 10: HdInsight essentials Hadoop on Microsoft Platform

Popular  Hadoop  Distributions  •  Amazon  Elas=c  MapReduce  (cloud,  hbp://aws.amazon.com/elas=cmapreduce/)    

•  Cloudera  (hbp://www.cloudera.com/content/cloudera/en/home.html)    

•  EMC  PivitolHD  (hbp://gopivotal.com/)    

•  Hortonworks  HDP  (hbp://hortonworks.com/)    

•  MapR  (hbp://mapr.com/)    

•  Microsod  HDInsight  (cloud,  hbp://www.windowsazure.com/)  

Page 11: HdInsight essentials Hadoop on Microsoft Platform

HDInsight  Differenciator  •  Enterprise-­‐ready  Hadoop  backed  by  Microsod    

•  Analy:cs  using  Excel  

•  Integra=on  with  Ac=ve  Directory.      

•  Integra=on  with  .NET  and  Javascript    

•  Connectors  to  RDBMS    

•  Scale  using  cloud  offering:    Azure  HDInsight  service  enables  customers  to  scale  quickly  and  has  seamless  interface  between  HDFS  and  Azure  Storage  Vault    

•  JavaScript  Console  

Page 12: HdInsight essentials Hadoop on Microsoft Platform

WordCount  in  HDInsight  

Page 13: HdInsight essentials Hadoop on Microsoft Platform

CHAPTER  2  HIGHLIGHTS:    HDINSIGHT  INSTALL  ON  PREMISE  

Page 14: HdInsight essentials Hadoop on Microsoft Platform

Apache  Hadoop        

•  Open  Source  Sodware  •  Community  Development      

Hortonworks  Data  PlaSorm        

•  Enterprise  Hadoop  Plagorm  (HDP)  •  Leaders  in  Hadoop  •  Code  commibers  to  Hadoop  

Microso'  HDInsight        

•  Built  on  top  of  HDP  •  Integra=on  with  ASV,  Excel,  Powerview,  

SQLServer,  Ac=ve  Directory      

HDInsight  Distribution  

Page 15: HdInsight essentials Hadoop on Microsoft Platform

Physical  Install  Options  

NN          SNN            JT  

DN    /  TT  

Single  node  for  development/test      

Mul=  node  for  produc=on      

Page 16: HdInsight essentials Hadoop on Microsoft Platform

Multi  Node  Install  Steps  •  Pre-­‐requisites  •  Networking  Setup  •  Remote  Scrip=ng  •  Firewall  Setup  •  Sodware  Install  (each  node)  •  Hadoop  Configura=on  •  Verifica=on  

Page 17: HdInsight essentials Hadoop on Microsoft Platform

CHAPTER  3  HIGHLIGHTS:    HDINSIGHT  AZURE  SERVICE  

Page 18: HdInsight essentials Hadoop on Microsoft Platform

Azure  Cloud  Service  

Create  Storage  

Create  HDInsight  cluster  

Page 19: HdInsight essentials Hadoop on Microsoft Platform

CHAPTER  4  HIGHLIGHTS:    ADMINISTER  YOUR  CLUSTER  

Page 20: HdInsight essentials Hadoop on Microsoft Platform

HDInsight  Cluster  Management  

Page 21: HdInsight essentials Hadoop on Microsoft Platform

HDInsight  Dashboard  

Page 22: HdInsight essentials Hadoop on Microsoft Platform

HDInsight  Dashboard  

Page 23: HdInsight essentials Hadoop on Microsoft Platform

NameNode  Status  

Page 24: HdInsight essentials Hadoop on Microsoft Platform

Jobtracker  Status  

Page 25: HdInsight essentials Hadoop on Microsoft Platform

CHAPTER  5  HIGHLIGHTS:    INGEST  DATA  INTO  YOUR  CLUSTER  

Page 26: HdInsight essentials Hadoop on Microsoft Platform

Loading  Data  into  your  Cluster  You  have  following  op=ons…    •  Loading  data  using  Hadoop  commands  •  Loading  data  using  Azure  Storage  Vault  •  Loading  data  using  Interac:ve  JavaScript    •  Shipping  data  to  your  Cluster  •  Loading  data  from  RDBMS  via  Sqoop  

Page 27: HdInsight essentials Hadoop on Microsoft Platform

Loading  via  Azure  Storage  Explorer  

Page 28: HdInsight essentials Hadoop on Microsoft Platform

CHAPTER  6  HIGHLIGHTS:    TRANSFORM  YOUR  DATA  

Page 29: HdInsight essentials Hadoop on Microsoft Platform

Transforming  Data  You  have  following  op=ons…    •  MapReduce  •  Hive  •  Pig  •  Others  

Page 30: HdInsight essentials Hadoop on Microsoft Platform

Processing  Data  in  Cluster  Map for Jan2012

Map for Feb2012

Map for Apr2013

…  

One Reducer

Page 31: HdInsight essentials Hadoop on Microsoft Platform

HDFS  

Hive  JDBC/OBDC

Metastore

Thrift Server

Command Line Web GUI

Driver (Parser, Planner, Executor)

MapReduce  

Hive  

Page 32: HdInsight essentials Hadoop on Microsoft Platform

Raw  Data  in  HDFS  •  Distributed  

Storage  •  Reliable  

Data  Processing  via  Pig  •  Pipelines  •  Itera=ve  Processing  •  Research  

Data  Warehouse  

HDFS  

Data  Warehouse  via  Hive  •  BI  Tools  •  Analysis  

Hive  or  Pig?  

Page 33: HdInsight essentials Hadoop on Microsoft Platform

CHAPTER  7  HIGHLIGHTS:    ANALYZE  &  REPORT  

Page 34: HdInsight essentials Hadoop on Microsoft Platform

Analyze  using  Excel  

Page 35: HdInsight essentials Hadoop on Microsoft Platform

Analyze  using  Excel  

Page 36: HdInsight essentials Hadoop on Microsoft Platform

CHAPTER  8:    PROJECT  PLANNING  &  ARCHITECTURAL  CONSIDERATIONS  

Page 37: HdInsight essentials Hadoop on Microsoft Platform

Execu:ve  &  Stakeholder    

Buy-­‐in  

Discovery  &  Analysis  

Design  

Implementa:on  User  Acceptance  

Produc:on  Opera:ons  

Feedback,  New  Requirements