a microsoft it pros guide to the hdinsight and the swarm ... · data the cloud era! the modern way!...

32
DATA A Microsoft IT pros guide to the HDInsight and the swarm of open source solutions it comes with Michael Jonsson Cloud Solutions Architect – Avanade Denmark P-CSA – Microsoft (Finalist in the Microsoft Partner of the Year awards 2018 in the category “Partner Seller Excellence in Technology, Sales and/or Licensing Award”) Twitter: Michael_Jonsson Blog: Azurefabric.com

Upload: others

Post on 29-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

A Microsoft IT pros guide

to the HDInsight and the

swarm of open source

solutions it comes with

Michael Jonsson

Cloud Solutions Architect – Avanade Denmark

P-CSA – Microsoft (Finalist in the Microsoft Partner of the Year awards 2018 in the category “Partner Seller Excellence in Technology, Sales and/or Licensing Award”)

Twitter: Michael_Jonsson

Blog: Azurefabric.com

Page 2: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

What to expect from this session

• War stories implementing HDInsight

• Suttle complaints on not ready solutions

• Fast pace and bad jokes

• Guidance to a partial datadriven org

• Some Demos, but maybe just screen shots

• Pitfalls info.... (and bad powerpointing)

Page 3: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Feedback please ☺

https://feedback.expertslive.nl

Page 4: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

The IT state is evolving

Page 5: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Customer project

Goals

• Information management

• Advanced analytics

• Future platform (data driven, self service)

• Machine Learning (AI)

Page 6: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

If the Infrastructure is already in Place!!

Page 7: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

The second 80/20 dilemma

Page 8: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Wrong in the current OLD methods

1: Cleaning to relational data = deleting some

2: ETL with defined schemas = hard to change

3: No DevOps (everything as code) fit the method

Page 9: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA The Cloud era!

The modern way!

• For modern big data analytics -Consider all data as valuable and therefore keep it unstructured (schemaless), stored in native format in low cost location

• Ingest, analyze and then structure. (schema on read or ELT)

• Schema free data enables the possibility to add additional and new data to the mix

• Infra as Code = No infra core knowledge needed for the datascientist (its in the code!)

• PaaS services

• Infra as Code

• CI/CD

• DevOps

Page 10: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Customer project

Goals

• Information management

• Advanced analytics

• Future platform (data driven, self service)

• Machine Learning (AI)

Page 11: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA WHY?

Data Driven business! (Goal)• Descisions based on Data (facts)! – Netflix

• Democratize data – Available for all –

(Microsoft services goal)

• Enable AI to get insights not natural to

humas – Analysis, analytics etc

Page 12: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Page 13: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA The Cloud era!

The modern way!

• For modern big data analytics -Consider all data as valuable and therefore keep it unstructured (schemaless), stored in native format in low cost location

• Ingest, analyze and then structure. (schema on read or ELT)

• Schema free data enables the possibility to add additional and new data to the mix

• Infra as Code = No infra core knowledge needed for the datascientist (its in the code!)

• PaaS services

• Infra as Code

• CI/CD

• DevOps

Page 14: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

AZURE ML ML SERVER AZURE

DATABRICKS

AZURE

STREAM ANALYTICS

AZURE

HDINSIGHT

AZURE

DATABRICKS

AZURE DATA

LAKE

ANALYTICS

AZURE

HDINSIGHTAZURE

DATABRICKS

The Azure BIG Data Landscape

AZURE

SDK

AZURE

DATA

FACTORY

AZURE IMPORT

EXPORT

SERVICE

AZURE

CLI

AZURE IOT HUB AZURE EVENT HUBS

AZURE EXPRESSROUTE AZURE NETWORK

SECURITY GROUPS

AZURE FUNCTIONSVISUAL STUDIOOPERATIONS

MANAGEMENT SUITE

AZURE SEARCH

AZURE

ACTIVE DIRECTORY

COGNITIVE SERVICESBOT SERVICEAZURE

DATA CATALOG

AZURE KEY

MANAGEMENT SERVICE

AZURE STORAGE

BLOBS

AZURE DATA LAKE

STORE

KAFKA ON

AZURE HDINSIGHT

AZURE SQL DATA

WAREHOUSEAZURE SQL DB AZURE COSMOS DB

AZURE

ANALYSIS SERVICES POWER BI

Page 15: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Page 16: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Demo: HDInsight (to keep it

interesting)

Page 17: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Pointers on HDInsight• When integrating HDInsight with Data Lake through an AAD service principal, how do you control

access to data on a per-user level from the cluster? (this is where the Apache Ranger discussion comes in)

• Domain-joined premium HDInsight clusters are difficult (at least in preview when documentation was severely lacking)

• Limitations in the process for domain joining HDInsight nodes may be unacceptable for large enterprises (e.g., must put all nodes in the same OU and cannot control their hostnames)

• Parameters in ARM templates are poorly documented and change frequently (don’t assume you can reuse stuff from GitHub)

• Forced tunneling is not supported according to the HDInsight documentation, so how do you get it to work with e.g. ExpressRoute via a customer firewall?

• If multiple HDInsight clusters are deployed in the same VNet, the first 6 characters of the cluster names must be different (this may be a problem depending on the corporate naming convention)

Page 18: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Start with the Data• For modern big data analytics -Consider all data as

valuable and therefore keep it unstructured (schemaless), stored in native format in low cost location

• Ingest, analyze and then structure. (schema on read or ELT)

• Schema free data enables the possibility to add additional and new data to the mix

Page 19: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

On-demand analytics job service powering intelligent action across a hyper-scale repository for Big Data analytics workloads

Azure Data Lake

• Start in seconds,

• Develop • Leverage open source

• Debug & optimize

• Petabyte size files

• Virtualize your analytics

• Provide I/O capacity

• Always encrypted,

Page 20: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Datalake

Page 21: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Use the built in capabilities• Azure Data Lake Analytics: U-SQL

• Ingest, analyze and then structure.

(schema on read or ELT)

• Schema free data enables the possibility to

add additional and new data to the mix

Page 22: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

U-SQL

Declarative

+

Imperative

Structured

+

Semi-structured

+

Unstructured

Batch

+

Interactive

+

Streaming

+

Machine Learning

Programming models Data Workloads

a language that unifies

Page 23: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

Develop massively parallel programs with simplicity

A simple U-SQL script can scale from Gigabytes to Petabytes without learning complex big data programming techniques.

U-SQL automatically generates a scaled out and optimized execution plan to handle any amount of data.

Execution nodes immediately rapidly allocated to run the program.

Error handling, network issues, and runtime optimization are handled automatically.

@searchlog = EXTRACT UserId int,

Start DateTime, Region string, Query string, Duration int, Urls string, ClickedUrls string

FROM @"/Samples/Data/SearchLog.tsv"USING Extractors.Tsv();

OUTPUT @searchlogTO @"/Samples/Output/SearchLog_output.tsv"USING Outputters.Tsv();

Page 24: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

DEMO

Azure Data Factory ->

Azure Data Lake ->

Azure Data Lake Analytics: U-SQL ->

Azure Data Lake ->

PowerBI

Page 25: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

What to use when?

Page 26: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Page 27: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

AZURE ML ML SERVER AZURE

DATABRICKS

AZURE

STREAM ANALYTICS

AZURE

HDINSIGHT

AZURE

DATABRICKS

AZURE DATA

LAKE

ANALYTICS

AZURE

HDINSIGHTAZURE

DATABRICKS

The Azure BIG Data Landscape

AZURE

SDK

AZURE

DATA

FACTORY

AZURE IMPORT

EXPORT

SERVICE

AZURE

CLI

AZURE IOT HUB AZURE EVENT HUBS

AZURE EXPRESSROUTE AZURE NETWORK

SECURITY GROUPS

AZURE FUNCTIONSVISUAL STUDIOOPERATIONS

MANAGEMENT SUITE

AZURE SEARCH

AZURE

ACTIVE DIRECTORY

COGNITIVE SERVICESBOT SERVICEAZURE

DATA CATALOG

AZURE KEY

MANAGEMENT SERVICE

AZURE STORAGE

BLOBS

AZURE DATA LAKE

STORE

KAFKA ON

AZURE HDINSIGHT

AZURE SQL DATA

WAREHOUSEAZURE SQL DB AZURE COSMOS DB

AZURE

ANALYSIS SERVICES POWER BI

Page 28: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

CONTROL EASE OF USE

Azure Data Lake Store

Azure Storage

Any Hadoop technology,

any distribution

Workload optimized,

managed clusters

Data Engineering in a

Job-as-a-service model

Azure MarketplaceHDP | CDH | MapR

Azure Data Lake

Analytics

IaaS Clusters Managed Clusters Big Data as-a-service

Azure HDInsight

Frictionless & Optimized

Spark clusters

Azure Databricks

BIG

DA

TA

S

TO

RA

GE

BIG

DA

TA

A

NA

LY

TIC

S

Red

uced

Ad

min

istr

ati

on

Page 29: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

Demo: Data Bricks (to keep it

interesting)

Page 30: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

What to use when?

• Less is more

• Start with Data store (DataLake) and

see how far you get with the native

capabilities

• U-SQL, Data Factory

• Use VSTS (DevOps, early)

Page 31: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

THANK YOU!

https://feedback.expertslive.nl

Twitter: Michael_Jonsson

Blog: Azurefabric.com

Page 32: A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATADo you want to gain more

knowledge about Microsoft

technology?

The Future Ready Skills program

offers online courseware, online

labs, live Q&A’s and expert

sessions, so you can acquire

your official Microsoft Certificate

in the most efficient way.

For more information:

aka.ms/frsblog

FUTURE READY

SKILLS