azure offerings for€¦ ·  · 2016-10-192016-10-19 · • connect to on-premises and cloud data...

63

Upload: dangdieu

Post on 27-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Azure Offerings forBig data

In Kee PaekCloud Data Solution ArchitectMicrosoft KoreaOctober. 2016

Agenda

1. Integrated Big data Platform - Cortana Intelligent Suite

2. Scalable Machine Learning - R & Spark on HDInsight

3. GPUs in Azure for Compute & Visualization

Stay ahead of the curve with Cortana Intelligence Suite

Business apps

Custom apps

Sensors and devices

People

Automated systems

Data Intelligence

Cortana Intelligence

Action

Apps

Cortana Intelligence combines the services you already know

Transform data into intelligent action

Intelligence

Dashboards &

Visualizations

Information

Management

Big Data Stores Machine Learning

and Analytics

CortanaEvent HubsHDInsight

(Hadoop and

Spark)

Stream

Analytics

Data Intelligence Action

People

Automated Systems

Apps

Web

Mobile

Bots

Bot

FrameworkSQL Data

WarehouseData Catalog

Data Lake

Analytics

Data Factory Machine

LearningData Lake Store

Cognitive

Services

Power BI

Data

Sources

Apps

Sensors

and

devices

Data

Information Management

Data

Sources

Apps

Sensors

and devices

Data

Information

Management

Event Hubs

Data Catalog

Data Factory

Compose and orchestrate data services at scale

INGEST

SQL

<>

SQL

DATA SOURCES

{ }

SQL

• Create, schedule, orchestrate, and manage data pipelines

• Visualize data lineage

• Connect to on-premises and cloud data sources

• Monitor data pipeline health

• Automate cloud resource management

• Move relational data for Hadoop processing

• Transform with Hive, Pig, or custom code

Information

Management

Event Hubs

Data Catalog

Data Factory

Get more value from your enterprise data assets

Information

Management

Event Hubs

Data Catalog

Data Factory

• Spend less time looking for data, and more time getting value from it

• Register enterprise data sources, discover data assets and unlock their potential, and capture tribal knowledge to make data understandable

• Bridge the gap between IT and the business, allowing everyone to contribute their insights, tags, and descriptions

• Intuitive search and filtering to understand the data sources and their purpose

• Let your data live where you want; connect using tools you choose

• Integrate into existing tools and processes with open REST APIs

Ingest events from websites, apps and devices at cloud scale

• Log millions of events per second in near real time

• Connect devices using flexible authorization and throttling

• Use time-based event buffering

• Get a managed service with elastic scale

• Get a managed service with elastic scale

• Reach a broad set of platforms using native client libraries

• Pluggable adapters for other cloud services

Azure

API

Management

Backend Services

Data

Information

Management

Event Hubs

Data Catalog

Data Factory

Data sources

Apps

Sensors and devices

Event Hubs

SQL Database Machine Learning

HDInsightStorage

Power BIStream Analytics

Big Data Stores

Big Data Stores

SQL Data

Warehouse

Data Lake Store

Data

Sources

Apps

Sensors

and devices

Data

Information

Management

Event Hubs

Data Catalog

Data Factory

A hyper-scale repository for big data analytics workloads

• A Hadoop Distributed File System for the cloud

• No fixed limits on file size

• No fixed limits on account size

• Unstructured and structured data in their native format

• Massive throughput to increase analytic performance

• High durability, availability, and reliability

• Azure Active Directory access control

LOB

Applications

SocialDevices

Clickstream

Sensors

Video

Web

Relational

HDInsight

ADL Analytics

Machine Learning

Spark

R

ADL Store

Big Data Stores

SQL Data

Warehouse

Data Lake Store

Elastic data warehouse as a service with enterprise-class features

• Petabyte scale with massively parallel processing

• Independent scaling of compute and storage—in seconds

• Transact-SQL queries across relational and non-relational data

• Full enterprise-class SQL Server experience

• Works seamlessly with Power BI, Machine Learning, HDInsight, and Data Factory

Power BI

App ServiceSQL Database

SQL Data Warehouse

Machine Learning

Hadoop

Intelligent App

Big Data Stores

SQL Data

Warehouse

Data Lake Store

Machine Learning and Analytics

Big Data Stores

SQL Data

Warehouse

Data Lake Store

Data

Sources

Apps

Sensors

and devices

Data Intelligence

Information

Management

Event Hubs

Data Catalog

Data Factory

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream

Analytics

Data Lake

Analytics

Machine

Learning

Easily build, deploy, and share predictive analytics solutions

• Simple, scalable, cutting edge. A fully managed cloud service that enables you to easily build, deploy, and share predictive analytics solutions.

• Deploy in minutes. Azure Machine Learning means business. You can deploy your model into production as a web service that can be called from any device, anywhere and that can use any data source.

• Publish, share, monetize. Share your solution with the world in the Gallery or on the Azure Marketplace.

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream

Analytics

Data Lake

Analytics

Machine

Learning

Big data analytics made easy

• Analyze data of any kind and size

• Develop faster, debug and optimize smarter

• Interactively explore patterns in your data

• No learning curve—use U-SQL, Spark, Hive, HBase and Storm

• Managed and supported with an enterprise-grade SLA

• Dynamically scales to match your business priorities

• Enterprise-grade security with Azure Active Directory

• Built on YARN, designed for the cloud

Data Lake Analytics

SQL DW SQL DB Storage BlobsData Lake Store SQL DB in a VM

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream

Analytics

Data Lake

Analytics

Machine

Learning

Comprehensive set of managed Apache big data projects

• Scale to petabytes on demand

• Process unstructured and semi-structured data

• Develop in Java, .NET, and more

• Skip buying and maintaining hardware

• Deploy in Windows or Linux

• Spin up an Apache Hadoop cluster in minutes

• Visualize your Hadoop data in Excel

• Easily integrate on-premises Hadoop clusters

Core Engine

Batch

Map Reduce

Script

Pig

SQL

Hive

NoSQL

HBase

Streaming

Storm

In-Memory

Spark

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream

Analytics

Data Lake

Analytics

Machine

Learning

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream

Analytics

Data Lake

Analytics

Machine

Learning

Real-time stream processing in the cloud

• Perform real-time analytics for your Internet of Things solutions

• Stream millions of events per second

• Get mission-critical reliability and performance with predictable results

• Create real-time dashboards and alerts over data from devices and applications

• Correlate across multiple streams of data

• Use familiar SQL-based language for rapid development

Event Hubs

Blob Storage

Stream

Analytics

SQL Database

Event Hubs

Power BI

Blob Storage

Table Storage

Intelligence

Intelligence

Cortana

Bot

Framework

Cognitive

Services

Big Data Stores

SQL Data

Warehouse

Data Lake Store

Data

Sources

Apps

Sensors

and devices

Data

Information

Management

Event Hubs

Data Catalog

Data Factory

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream

Analytics

Data Lake

Analytics

Machine

Learning

Agenda

1. Integrated Big data Platform - Cortana Intelligent Suite

2. Scalable Machine Learning - R & Spark on HDInsight

3. GPUs in Azure for Compute & Visualization

Cortana Intelligent Suite

Intelligence

Dashboards &

Visualizations

Information

Management

Big Data Stores Machine Learning

and Analytics

CortanaEvent HubsHDInsight

(Hadoop and

Spark)

Stream

Analytics

Data Intelligence Action

People

Automated Systems

Apps

Web

Mobile

Bots

Bot

FrameworkSQL Data

WarehouseData Catalog

Data Lake

Analytics

Data Factory Machine

LearningData Lake Store

Cognitive

Services

Power BI

Data

Sources

Apps

Sensors

and

devices

Data

Infinite world of scalable machine learning

logistic regression, linear models,

basic statistics, hypothesis testing,

k-means, decision trees

page rank, collaborative filtering,

graph processing, SVD, PCA,

Bayesian models, …

deep learning over

various types of networks

Use cases of scalable machine learning

product recommendations

intelligent search

routing

robotics

ad placement

predictive maintenance

image, video recognition

sentiment analysis

text comprehension

natural language processing

robotics

bots

augmented reality

predictive maintenance

Retail Financial services Healthcare Manufacturing

loyalty programs

customer acquisition

pricing strategy

supply chain mgnt

customer churn

fraud detection

risk & compliance

cross-sell & upsell

personalization

bill collection

operational efficiency

patient demographics

pay for performance

demand forecasting

pricing strategy

supply chain

optimization

predictive maintenance

remote monitoring

Scalable machine learning offerings in HDInsight

Server

What is R

What is

• The most popular statistical programming language

• A data visualization tool

• Open source

• 2.5+M users

• Taught in most universities

• Thriving user groups worldwide

• 8000+ contributed packages

• New and recent grad’s use it

Language

Platform

Community

Ecosystem• Rich application & platform integration

R from Microsoft brings

Peace of mind Speed and scalability

Flexibility

Open Source R

"http://www.ats.ucla.edu/stat/data/binary.csv"

Microsoft R Server

“/data/binary.csv”

R Server Parallelized by Spark

“/data/binary.csv”

R Server on HDInsight

R R R R R

R R R R R

R Server

R Server and Spark resource sharing

YARN

Livy server

Thrift server

Jupyter notebooks

Default Queue

Thrift Queue

IntelliJ IDEA

BI Tools

Head node

Edge node

R server

DeployR

R Tools for VS

R Studio

Parallelized and Distributed Algorithms

Data import – Delimited, Fixed, SAS, SPSS,

OBDC

Variable creation & transformation

Recode variables

Factor variables

Missing value handling

Sort, Merge, Split

Aggregate by category (means, sums)

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

ETL Statistical Tests

Min / Max, Mean, Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product matrix for set

variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data (standard tables & long

form)

Marginal Summaries of Cross Tabulations

Descriptive Statistics

Sum of Squares (cross product matrix for set

variables)

Multiple Linear Regression

Generalized Linear Models (GLM) exponential

family distributions: binomial, Gaussian, inverse

Gaussian, Poisson, Tweedie. Standard link

functions: cauchit, identity, log, logit, probit. User

defined distributions & link functions.

Covariance & Correlation Matrices

Logistic Regression

Predictions/scoring for models

Residuals for all models

Predictive Statistics K-Means

Clustering

Decision Trees

Decision Forests

Gradient Boosted Decision Trees

Naïve Bayes

Machine Learning

Simulation Simulation (e.g. Monte Carlo)

Parallel Random Number Generation

Custom Parallelization rxDataStep

rxExec

PEMA-R API

Variable Selection Stepwise Regression

Microsoft R Server: scale-out R, Enterprise Class!

Apache Spark

Spark as a Platform

Data Sources

Spark MLlib algorithms

Spark MLlib algorithms in R language

Spark MLlib algorithms in Python language

0%

10%

20%

30%

40%

50%

60%

R SAS Python SQL Java

KNuggets poll (2014)

the poll

R and Python are two dominant languages

R and Spark are better together

Brief history of Deep learning

Deep Learning is about big models and big data

Deep Neural Network training

Scalable Machine Learning offerings in HDInsight

R language Python language Scala/Java

Server

Server

+ R Ecosystem + Python Ecosystem + Spark Ecosystem

+ Spark Ecosystem

Agenda

1. Integrated Big data Platform - Cortana Intelligent Suite

2. Scalable Machine Learning - R & Spark on HDInsight

3. GPUs in Azure for Compute & Visualization

GPU Virtualization Vision

• Deliver accelerated graphics & compute capabilities in Azure infrastructure

• High end performance

• Not “Swiss-army knife” offering

• Helps achieve true “HPC in the Cloud”

• Close partnership with NVIDIA

Media

• Stream high fidelity video games

• Encoding and transcoding

• Image processing

• Social media sentiment analysis

Rendering

• Visual Effects (VFX)

• Ray-Tracing rendering

• Advertising & Marketing

• CAD Applications in Architecture

• Simulations

GPU Virtualization Technology

• DDA (Discrete Device Assignment)

Entire device is mapped into the VM just as it would be running on bare metal

Allows for full access to capabilities of that device as well as allowing the device’s native driver to be used

• Introduced in Windows Server 2016 as part of Hyper-V

• Pass-through PCIe devices directly to a Guest VM

• Allows for close to bare-metal performance

Compute Virtual Machines

Tesla K80 – “It’s fast…”

0x

5x

10x

15x

K80 CPU

Quantum Chemistry

Molecular Dynamics PhysicsBenchmarks

Rate of Improvement

72%

74%

84%

88%

93%

96%

65%

70%

75%

80%

85%

90%

95%

100%

2010 2011 2012 2013 2014 2015

GPU

65%

70%

75%

80%

85%

90%

95%

100%

11-2013 6-2014 12-2014 7-2015 1-2016

Acc

ura

cy

39%

45%

55%

62%

66%

72%75%

79%83%

86%

87.5%

30%

40%

50%

60%

70%

80%

90%

100%

Top Score

Visualization Virtual Machines

Collaboration with CNTK

• First class citizen

• Scalability – multi-GPU-multi-VM

• Performance

• Internal use-cases across various Microsoft properties and products

• (DSVM) Data Science VM by Azure Machine Learning

• N-Series and CNTK works really well together

CNTK Performance on DDA

2670

10560

18755

27575

35750

0

5000

10000

15000

20000

25000

30000

35000

40000

CPU 1 GPU 2 GPUs 3 GPUs 4 GPUs

Sam

ple

s p

er S

eco

nd

Resource

Avg. Samples/Sec Linear (Avg. Samples/Sec)

NV = Tesla M60 and supports OpenGL & DirectX

NC = Tesla K80 and supports CUDA & OpenCL

Sign up @ http://gpu.azure.com

Azure Batch “recipes” @ aka.as/tryazurehpc

Teradici trial @ http://teradici.com/4azuregpus

NVIDIA GRID @ http://nvidia.com/grid

© 2016 Microsoft Corporation. All rights reserved.