a practical guidance of the enterprise machine learning

Post on 13-Apr-2017

1.423 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A Practical Guidance to the Enterprise Machine Learning Platform Ecosystem

About Us

• Helping great companies become great software companies

• Building software solutions powered by disruptive enterprise software trends

-Machine learning and data science

-Cyber-security

-Enterprise IOT

-Powered by Cloud and Mobile

• Bringing innovation from startups and academic institutions to the enterprise

• Award winning agencies: Inc 500, American Business Awards, International Business Awards

About This Webinar

• Research that brings together big enterprise software trends, exciting startups and academic research

• Best practices based on real world implementation experience• No sales pitches

• Cloud vs. On-Premise machine learning

• Cloud machine learning platforms

• Azure machine learning

• AWS machine learning

• Databricks

• Watson developer cloud

• Others…

• On-premise machine learning platforms

• Revolution analytics

• Dato

• Spark Mlib

• TensorFlow

• Others…

Agenda

Enterprise Data Science

“data science”

Modern Machine Learning

• Advances in storage, compute and data science research are making machine learning as part of mainstream technology platforms

• Big data movement

• Machine learning platforms are optimized with developer-friendly interfaces

• Platform as a service providers have drastically lowered the entry point for machine learning applications

• R and Python are leading the charge

Cloud vs. On-Premisemachine learning platforms

Cloud Machine Learning Platforms: Benefits

• Service abstraction layer over the machine learning infrastructure

• Rich visual modeling tools

• Rich monitoring and tracking interfaces

• Combine multiple platforms: R, Python, etc

• Enable programmatic access to ML models

Cloud machine Learning Platforms:: Challenges

• Integration with on-premise data stores

• Extensibility

• Security and privacy

On-Premise machine Learning Platforms: Benefits

• Control

• Security

• Integration with on-premise data stores

• Integrated with R and Python machine learning frameworks

On-Premise machine Learning Platforms: Challenges

• Code-based modeling interfaces

• Scalability

• Tightly coupled with Hadoop distributions

• Monitoring and management

• Data quality and curation

Cloud Machine Learning Platforms

• Azure Machine Learning

• AWS machine learning

• Databricks

• Watson developer cloud

The Leaders

Azure Machine Learning

Azure Machine Learning

• Native machine learning capabilities as part of the Azure cloud

• Elastic infrastructure that scale based on the model requirements

• Support over 30 supervised and unsupervised machine learning algorithms

• Integration with R and Python machine learning libraries

• Expose machine learning models via programmable interfaces

• Integrated with the Cortana Analytics suite

• Integrated with PowerBI

• Supports both supervised and unsupervised models

• Integrated with Azure HDInsight

• Large library of models and sample gallery

• Support for R and Python code

Visual Model Creation

• Visual dashboard to track the execution of ML models

• Track execution of different steps within a ML model

• Integrated monitoring experience with other Azure services

Rich Monitoring and Management Interface

• Expose machine learning models as Web Services APIs

• Integrate ML Models with Azure API Gateway

• Retrain and extend models via ML APIs

Programmatic Access to ML Models

AWS Machine Learning

AWS Machine Learning

• Native machine learning service in AWS

• Provide data exploration and visualization tools

• Supports supervised and unsupervised algorithms

• Integrated data transformation models

• APIs for dynamically creating machine learning models

• Programmatic creation of machine learning models

• Large number of algorithms and recipes

• Data transformation models included in the language

Sophisticated ML Model Authoring

• Sophisticated monitoring for evaluating ML models

• Integrated with AWS Cloud Watch

• KPIs that evaluate the efficiency of ML models

Monitoring ML Model Execution

• Optimized DSL for data transformation

• Recipes that abstract common transformations

• Reuse transformation recipes across ML models

Embedded Data Transformation

• Sophisticated monitoring for evaluating ML models

• Integrated with AWS Cloud Watch

• KPIs that evaluate the efficiency of ML models

Monitoring ML Model Execution

Databricks

Databricks Machine Learning

• Scaling Spark machine learning pipelines

• Integrated data visualization tools

• Sophisticated ML monitoring tools

• Combine Python, Scala and R in a single platform

• Implementing machine learning models using Notebooks

• Publishing notebooks to a centralized catalog

• Leverage Python, Scala or R to implement machine learning models

Notebooks Based Authoring

• Integrate data visualization into machine learning pipelines

• Reuse data visualization notebooks across applications

• Evaluate the efficiency of machine learning pipelines using visualizations

Machine Learning Data Visualization

• Monitor the execution of machine learning pipelines

• Run machine learning pipelines manually

• Rapidly modify and deploy machine learning pipelines

Monitoring and Management

Watson Developer Cloud

• Personality Insights

• Tradeoff Analytics

• Relationship Extraction

• Concept Insights

• Speech to Text

• Text to Speech

• Visual Recognition

• Natural Language Classifier

• Language Identification

• Language Translation

• Question and Answer

• Concept Expansion

• Message Resonance

• AlchemyAPI Services

Large Variety of Cognitive Services

• Access services via REST APIs

• SDKs available for different languages

• Integration with different services in the BlueMix platform

Rich Developer Interfaces

Relationship Extraction Concept Expansion Message Resonance

User Modeling

Complex Algorithms – Simple Interfaces

Other Interesting Platforms

• Microsoft’s Project Oxford https://www.projectoxford.ai/ • BigML https://bigml.com/

On-premise machinelearning platforms

The Leaders

• Revolution Analytics (Microsoft)

• Spark Mlib + Spark R

• Dato

• TensorFlow• Others: PredictionIO, Scikit-learn…

Revolution Analytics

All of Open Source R plus:

• Big Data scalability

• High-performance analytics

• Development and deployment tools

• Data source connectivity

• Application integration framework

• Multi-platform architecture

• Support, Training and Services

Revolution Analytics (Microsoft)

R+CR

AN

Revo

R

DistributedR

ScaleR

ConnectR

DeployRDevelopR

In the Cloud Amazon AWS

Workstations & ServersWindowsRed Hat and SUSE Linux

Clustered Systems IBM Platform LSFMicrosoft HPC

EDW IBM NetezzaTeradata

Hadoop HortonworksCloudera

Write Once, Deploy Anywhere

DeployR does not provide any application UI.

3 integration modes embed real-time R results into existing interfaces

Web app, mobile app, desktop app, BI tool, Excel, …

RBroker Framework :

Simple, high-performance API for Java, .NET and Javascript apps Supports transactional, on-demand analytics on a stateless R session

Client Libraries:

Flexible control of R services from Java, .NET and Javascript apps Also supports stateful R integrations (e.g. complex GUIs)

DeployR Web Services API:

Integrate R using almost any client languages

Integrate R Scripts Into Third Party Applications

Spark Mlib + SparkR

• It is built on Apache Spark, a fast and general engine for large-scale data processing

• Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

• Write applications quickly in Java, Scala, or Python.

Spark Mlib

• Integrated with Spark SQL for data queries and transformations

• Integrated with Spark GraphX for data visualizations

• Integrated with Spark Streaming for real time data processing

Beyond Machine Learning

• Run R and machine learning models using the same infrastructure

• Leverage R scripts from Spark Mlib models

• Scale R models as part of a Spark cluster

• Execute R models programmatically using Java APIs

Spark Mlib + SparkR

Dato

• Makes Python machine learning enterprise – ready

• Graphlab Create

• Dato Distributed

• Dato Predictive Services

Dato

Principles:

• Get started fast

• Rapidly iterate

• Combine for new apps

import graphlab as gl data = gl.SFrame.read_csv('my_data.csv') model = gl.recommender.create(data,

user_id='user',

item_id='moviez

target='rating') recommendations = model.recommend(k=5)

Recommender Image search Sentiment Analysis

Data Matching Auto Tagging Churn Predictor

Click Prediction Product Sentiment Object Detector

Search Ranking Summarization …

Sophisticated ML made easy - Toolkits

Tensor Flow

• Powers deep learning capabilities on dozens of Google’s products

• Interfaces for modeling machine and deep learning algorithms

• Platform for executing those algorithms

• Scales from mobile devices to a cluster with thousands of nodes

• Has become one of the most popular projects in Guthub in less than a week

Google’s Tensor Flow

• Based on the principle of a dataflow graph

• Nodes can perform data operations but also send or receive data

• Python and C++ libraries. NodeJS, Go and others in the pipeline

Tensorflow Programming Model

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))sess.run(tf.initialize_all_variables())for i in range(20000): batch = mnist.train.next_batch(50) if i%100 == 0: train_accuracy = accuracy.eval(feed_dict={ x:batch[0], y_: batch[1], keep_prob: 1.0}) print "step %d, training accuracy %g"%(i, train_accuracy) train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print "test accuracy %g"%accuracy.eval(feed_dict={ x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})

• Scales from a single device to a large cluster of nodes

• Tensorflow uses a placement algorithm based on heuristics to place tasks on the different nodes in a graph

• The execution engine assigns tasks for fault tolerance

• Linear scalability model

Tensor Flow Implementation

• TensorFlow includes an engine that enables the visual representation of the execution graph

• Visualizations include summary statistics of the different states of the model

• The visualization engine is included in the current open source release

Tensor Flow Graph Visualization

Other Interesting Projects

• H20.ai• PredictionIO• Scikit-Learn• Microsoft’s DMTK

Machine Learning in the Enterprise

•Enable foundational building blocks -Data quality -Data discovery -Functional and integration testing •Predictions are tempting but classification and clustering are easier

•Run multiple models at once•Enable programmatic interfaces to interact with ML models •Start small, deliver quickly, iterate…

Machine Learning in the Enterprise

•Machine learning is becoming one of the most important elements of modern enterprise solutions

•Innovation in machine learning is happening in both the on-premise and cloud space

•Cloud machine learning innovators include: Azure ML, AWS ML, Databricks and IBM Watson

•On-premise machine learning innovators include: Spark Mlib, Microsoft’s Revolution R, Dato, TensorFlow

•Enterprise machine learning solutions should include elements such as data quality, data governance, etc

•Start small and use real use cases

Summary

Appendix A: Scikit-Learn

• Extensions to SciPy (Scientific Python) are called SciKits. SciKit-Learn provides machine learning algorithms.

• Algorithms for supervised & unsupervised learning

• Built on SciPy and Numpy

• Standard Python API interface

• Sits on top of c libraries, LAPACK, LibSVM, and Cython

• Open Source: BSD License (part of Linux)

• Probably the best general ML framework out there.

Scikit-Learn

Load & Transform Data

Raw Data Feature Extraction

Build Model

Feature Evaluation

Very Simple Prediction Model

Evaluate Model

Assess how model will generalize to independent data set (e.g. data not in the training set).

1. Divide data into training and test splits

2. Fit model on training, predict on test

3. Determine accuracy, precision and recall

4. Repeat k times with different splits then average as F1

Predicted Class A Predicted Class B

Actual A True A False B #A

Actual B False A True B #B

#P(A) #P(B) total

Simple Programming Model-Cross Validation (classification)

How to evaluate clusters? Visualization (but only in 2D)

Data Visualization

Appendix B: Prediction IO

• Developer friendly machine learning platform

• Completely open source

• Based on Apache Spark

PredictionIO

• PredictionIO platformA machine learning stack for building, evaluating and deploying engines with machine learning algorithms.

• Event ServerAn open source machine learning analytics layer for unifying events from multiple platforms

• Template Galleryengine templates for different type of machine learning applications

A Simple Architecture

• Execute models asynchronous via event interface

• Query data programmatically via REST interface

• Various SDKs provided as part of the platform

Model Execution

• Visual model for model creation

• Integrated with a template gallery

• Ability to test and valite engines

Rich Model Creation Interface

top related