308471 ch5 machine learning using microsoft azure

52
ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 52 M5: Machine Learning using MS Azure The only way to do great work is to love what you do. -- Steve Jobs -- WORAPOT JAKKHUPAN, PHD [email protected] ROOM BSC.0406/7 Information and Communication Technology Programme, Faculty of Science, PSU

Upload: worapot-jakkhupan

Post on 16-Apr-2017

648 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 52

M5: Machine Learning using MS Azure

The only way to do great work is to love what you do. -- Steve Jobs --

W O R A P O T J A K K H U PA N , P H DW O R A P O T . J @ P S U . A C . T H R O O M B S C . 0 4 0 6 / 7

I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y P r o g r a m m e , F a c u l t y o f S c i e n c e , P S U

Page 2: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 2 of 52

What is learning?• “Learning denotes changes in a system that ... enable a

system to do the same task more efficiently the next time.” –Herbert Simon

• “Learning is constructing or modifying representations of what is being experienced.” –Ryszard Michalski

• “Learning is making useful changes in our minds.” –Marvin Minsky

Page 3: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 3 of 52

Why learn?• Understand and improve efficiency of human learning

• Use to improve methods for teaching and tutoring people (e.g., better computer-aided instruction)

• Discover new things or structure that were previously unknown to humans• Examples: data mining, scientific discovery

• Fill in skeletal or incomplete specifications about a domain• Large, complex AI systems cannot be completely derived by hand and

require dynamic updating to incorporate new information. • Learning new characteristics expands the domain or expertise and

lessens the “brittleness” of the system

• Build software agents that can adapt to their users or to other software agents

Page 4: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 4 of 52

What is machine learning?• Machine learning simply detects patterns in large amount of data

to predict what happens when you get new information.

• Machine learning uses computers to run predictive models that learn from existing data in order to forecast future behaviors, outcomes, and trends.

• These forecasts or predictions from machine learning can make apps and devices smarter. • When you shop online, machine learning helps recommend other

products you might like based on what you've purchased. • When your credit card is swiped, machine learning compares the

transaction to a database of transactions and helps the bank do fraud detection.

Page 5: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 5 of 52

ComputerData

ProgramOutput

ComputerData

OutputProgram

Machine Learning

Traditional Programming

Page 6: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 6 of 52

Key ML terminology and concepts• Data exploration is the process of gathering information

about a large and often unstructured data set in order find characteristics for focused analysis. Data mining refers to automated data exploration.

• Descriptive analytics is the process of analyzing a data set in order to summarize what happened. The vast majority of business analytics - such as sales reports, web metrics, and social networks analysis - are descriptive.

• Predictive analytics is the process of building models from historical or current data in order to forecast future outcomes.

Page 7: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 7 of 52

• Supervised (inductive) learning• Training data includes desired outputs

• Unsupervised learning• Training data does not include desired outputs

• Semi-supervised learning• Training data includes a few desired outputs

• Reinforcement learning• Rewards from sequence of actions

Types of Learning

Page 8: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 8 of 52

Supervised and unsupervised learning

• Supervised learning algorithms are trained with labeled data - in other words, data comprised of examples of the answers wanted. For instance, a model that identifies fraudulent credit card use would be trained from a data set in which data points indicating known fraudulent and valid charges were labeled. Most machine learning is supervised.

• Unsupervised learning is used on data with no labels, and the goal is to find relationships in the data. For instance, you might want to find groupings of customer demographics with similar buying habits.

Page 9: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 9 of 52

1. Supervised Learning• In supervised learning, a data set is provided to the

algorithm which returns a “right answer” back.• The algorithm uses a known data set (called the

training data set) to make predictions. • The training data set includes input data and response

values. • From this data set, the supervised learning algorithm

seeks to build a model that can predict the response values for a new data set.

• Supervised learning includes two categories of algorithms, namely, regression and classification.

Page 10: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 10 of 52

1.1 Regression• Regression is used for continuous-

response values, for example, predicting housing price based on its size.

• From an existing data set, the algorithm plots a graph of houses and their respective price.

• From this data set, we now want to predict the price of a house of 900 Square Feet.

• The algorithm will detect the tendency of the data and present it in the form of a straight line to make a forecast.

Page 11: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 11 of 52

1.2 Classification• Classification is used for categorical

response values, where the data can be separated into specific “classes”.

• Classification is used to predict discreet value output ( i.e. 0/1, Yes/No).

• Consider a case where we need to determine whether a cancer is malignant or no based on its size.

• The algorithm compares the tumor size to the cancer type. Now if we need to forecast whether a tumor of size Z is dangerous, the algorithm will determine this and will find that it’s not harmful.

Page 12: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 12 of 52

2. Unsupervised learning• In unsupervised machine learning, the algorithm will try to

identify structure in the data given a data set.

• The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.

• Example 1 -- Market Research• Market researchers use cluster analysis to partition the

consumers into market segments and to better understand the relationships between different groups,

• and for use in market segmentation, product positioning, new product development and selecting test markets."

Page 13: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 13 of 52

2. Unsupervised learning

• Example 2 -- Social Network Analysis• Clustering may be used to recognize communities within large

groups of people.

• Example 3 -- Crime Analysis • Cluster analysis can be used to identify areas where there are

greater incidences of particular types of crime.

• By identifying these distinct areas or "hot spots" where a similar crime has happened over a period of time,

• it is possible to manage law enforcement resources more effectively.

Page 14: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 14 of 52

Other common ML terms (1)• algorithm: A self-contained set of rules used to solve problems

through data processing, calculation, or automated reasoning.

• categorical data: Data that is organized by categories and that can be divided into groups. For example a categorical data set for autos could specify year, make, model, and price.

• classification: A model for organizing data points into categories based on a data set for which category groupings are already known.

• feature engineering: The process of extracting or selecting features related to a data set in order to enhance the data set and improve outcomes. For instance, airfare data could be enhanced by days of the week and holidays.

Page 15: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 15 of 52

Other common ML terms (2)• module: A functional element in a Machine Learning Studio

model, such as the Enter Data module which enables entering and editing small data sets. An algorithm is also a type of module in Machine Learning Studio.

• model: For supervised learning, a model is the product of a machine learning experiment comprised of a training data set, an algorithm module, and functional modules, such as a Score Model module.

• numerical data: Data that has meaning as measurements (continuous data) or counts (discrete data). Also referred to as quantitative data.

• partition: The method by which you divide data into samples.

Page 16: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 16 of 52

Other common ML terms (3)• prediction: A prediction is a forecast of a value or values from a

machine learning model. You might also see the term "predicted score"; however, predicted scores are not the final output of a model. An evaluation of the model follows the score.

• regression: A model for predicting a continuous value based on independent variables, such as predicting the price of a car based on its year and make.

• score: A predicted value generated from a trained classification or regression model, using the Score Model module in Machine Learning Studio. Classification models also return a score for the probability of the predicted value. Once you've generated scores from a model, you can evaluate the model's accuracy using the Evaluate Model module.

• sample: A part of a data set intended to be representative of the whole. Samples can be selected randomly or based on specific features of the data set.

Page 17: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 17 of 52

Machine Learning on Azure

• Azure Machine Learning is a powerful cloud-based predictive analytics service that makes it possible to quickly create and deploy predictive models as analytics solutions.

• Azure Machine Learning not only provides tools to model predictive analytics, but also provides a fully-managed service you can use to deploy your predictive models as ready-to-consume web services.

• Azure Machine Learning provides tools for creating complete predictive analytics solutions in the cloud: Quickly create, test, operationalize, and manage predictive models.

Page 18: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 18 of 52

What is predictive analytics?

• Predictive analytics uses various statistical techniques - in this case, machine learning - to analyze collected or current data for patterns or trends in order to forecast future events.

• Azure Machine Learning is a particularly powerful way to do predictive analytics: • You can work from a ready-to-use library of algorithms, create

models on an internet-connected, and deploy your predictive solution quickly.

• You can also find ready-to-use examples and solutions in the Microsoft Azure Marketplace or Cortana Analytics Gallery.

Page 19: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 19 of 52

Azure ML: Basic workflow

Page 20: 308471 CH5 Machine Learning using Microsoft Azure
Page 21: 308471 CH5 Machine Learning using Microsoft Azure
Page 22: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 22 of 52

List of ML Azure modules and algorithms• The modules in this section provide tools for the final

phases of machine learning, in which you apply an algorithm to data to train a model, generate scores, and then evaluate the accuracy and usefulness of the model.• Initialize Model: Choose from a variety of customizable machine

learning algorithms, including clustering, regression, classification, and anomaly detection models.

• Train: Provide your data to the configured model to learn from patterns and create statistics that can be used for predictions.

• Score: Create predictions using the trained models.• Evaluate: Measure the accuracy of a trained model or compare

multiple models.https://msdn.microsoft.com/en-us/library/azure/dn905870.aspx

Page 23: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 23 of 52

Model training and evaluation• A machine learning model is an abstraction of the question you

are trying to answer or the outcome you want to predict. Models are trained and evaluated from existing data.

• Training from data• In Azure Machine Learning, a model is built from an algorithm module

that processes training data and functional modules, such as a scoring module.

• In supervised learning, if you're training a fraud detection model, you'll use a set of transactions that are labeled as either fraudulent or valid. You'll split your data set randomly, and use part to train the model and part to test or evaluate the model.

• Evaluation data• Once you have a trained model, evaluate the model using the

remaining test data. You use data you already know the outcomes for, so that you can tell whether your model predicts accurately.

Page 24: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 24 of 52

Five steps to create an experiment• In this machine learning tutorial, you'll follow five basic

steps to build an experiment in Machine Learning Studio in order to create, train, and score your model:

• Create a model • Step 1: Get data• Step 2: Preprocess data• Step 3: Define features

• Train the model • Step 4: Choose and apply a learning algorithm

• Score and test the model • Step 5: Predict new automobile prices

Page 25: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 25 of 52

1. Get Data (loan_hist.csv)• Adding a new data set

• To upload a new data set, go to, Experiments New Data set from local file

• Creating a new Experiment

Page 26: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 26 of 52

2. Drag and place the data on the canvas• Clicking on new experiment, will bring you on a new canvas

where you can add all elements needed for your experiment.• From here, you may drag objects from the left pane and place

on the canvas.• From the saved Data set tab, you may browse both some

sample data sets and also the data sets you uploaded.

Page 27: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 27 of 52

3. Visualize the data

• Once you select your data set, you may view the contents on the data set by clicking on visualize.

Page 28: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 28 of 52

4. Split the data (test and training)• The next step is to split the data into test and training.

• Select the split object on the lift and drag it on the screen. Next, we need to specify which percentage of the data will be used and training and which percentage as test.

• Test data will be used to evaluate the accuracy of the trained data.

Page 29: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 29 of 52

5. Train model• The train model is where the learning occurs. It takes 2 inputs,

the data set and an algorithm.

• Since we need to answer a two class question, which falls under classification, we shall use a classification algorithm.

Page 30: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 30 of 52

6. Feature selection

• Next step is to configure the train model, to determine which fields to predict. To do so, click on the train model and click on "launch column selector" on the right to select the required field.

• In our case, we need to predict the field "Loan Paid?"

Page 31: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 31 of 52

7. Score & Evaluate Model

• We can then compare the results of the two models to see which generated better results.

• The score model takes 2 input parameters, the train model and the test data.

• To evaluate the two scoring results we'll use the Evaluate Model module.

• The evaluate model can take up to two score models as input parameter for comparison.

• We'll use the scoring data that was separated out by the Split module to score our trained

models.

Page 32: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 32 of 52

8. Adding another algorithm

1. You may copy and paste the existing train and score model.

2. Remove the algorithm connector from the copied train model.

3. Add a new predictive algorithm to the new train model.

4. Connect the new score model to the existing Evaluate model.

If we want to use more algorithms for making comparison, we may add

more algorithms. Below are the steps how to add another algorithm to our

experiment.

Page 33: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 33 of 52

9. Running and evaluating the results• Hit Run. • Once completed, click on the

output port of the evaluate model and click visualize.

• The Evaluate Model module produces a pair of curves and metrics that allow you to compare the results of the two scored models.

• You can view the results as Receiver Operator Characteristic (ROC) curves, Precision/Recall curves, or Lift curves.

Page 34: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 34 of 52

10. Publishing (1)

1. Right click on the model we need and click save as Trained Model

2. Create a new model for publishing.

Once we know which algorithm to use, we can now put the

experiment to production and create a web service so that we

can allow applications to connect and parse data to it.

Page 35: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 35 of 52

10. Publishing (2)

3. Add the Data Source, Train Model and Score Model. Also, add a missing value scrubber, to replace all missing value from the data set.

Page 36: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 36 of 52

10. Publishing (3)

4. Define the input and output ports of the score model.

5. Run the experiment and hit publish web service

6. The web service is now created. To place it into production, Go to the web service, click on settings and set it as ready for production.

Page 37: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 37 of 52

http://social.technet.microsoft.com/wiki/contents/articles/31842.developing-a-recommender-solution-with-azure-machine-learning.aspx

Case study Developing a Recommender Solution

with Azure Machine Learning

This article demonstrates how Azure Machine Learning can be used to develop a

Recommender Solution.

Ever wondered how websites like Amazon and EBay provides you useful

suggestions and recommendations? This article is for you!

Page 38: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 38 of 52

1. Add the dataset

• In this example the Movie Ratings Sample Data shall be used.

• The Movie Rating sample has the following columns:

In Azure Machine Learning, an existing dataset can be used of

a new one can be loaded from an Azure Database, Azure Blob

Storage, Data Feed Reader, Web Service or a Hive Query.

Page 39: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 39 of 52

2. Exclude the columns that not be needed

• To do so, the project columns tool object can be used. Add it in the experiment.

• Now, from the right menu, select "launch column selector" to select the fields that shall be needed. Here, the TimeStamp column shall be excluded.

Page 40: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 40 of 52

3. Split the data (1)

• Test Data: Used to validate the results of the recommender

Drag the split tool and connect it as below.

Now, the data shall be partitioned into 2 distinct sets:

• Train Data: Used to “train” the recommender. That is, the

algorithm shall use this data to "learn" and make

predictions.

Page 41: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 41 of 52

3. Split the data (2)Deciding about the amount of data to use for training and testing is subjective.

The ratio should be typed as a decimal number between 0 and 1 to represent the percentage of rows sent to the first output dataset.

For example, if you type 0.75 as the

value, the dataset would be split by

using a 75:25 ratio, with 75% of the

rows sent to the first output dataset,

and 25% sent to the second output

dataset.

Page 42: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 42 of 52

4. Add the Train Matchbox Recommender• Train a recommendation model based on the Matchbox

recommender engine.

• It has the ability to learn about people’s preferences from observing how they rate items such as movies, content, or other products.

Page 43: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 43 of 52

5. Add the Score Matchbox Recommender• The Score Matchbox Recommender scores predictions for a

dataset using the Matchbox recommender.

• It generates results based on a trained recommendation model.

Page 44: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 44 of 52

6. Add the Evaluate Recommender

• The Evaluate Recommender tests the accuracy of recommender model predictions

Page 45: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 45 of 52

7. Run the experimentation

• At this point in time, the solution is like below and can be executed by clicking on the Run button.

Page 46: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 46 of 52

8. Visualize the output• After its execution, if click on the output of the Score Matchbox

Recommender and click on visualize, all the movie IDs together with their respective "related" movies" will now be displayed as shown below.

Page 47: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 47 of 52

9. Add the IMDB Movie Title Sample• However, this won't be much

useful for analysis purposes. What will be meaningful, is to have the movie names instead of the movie IDs.

• Fortunately, the Join operator can be used as shown below.

• This sample has all the Movie Names and their respective Movie IDs.

Page 48: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 48 of 52

10. Add the Meta Data Editor and make it treat the values as String

• This can be done by selecting all the columns from the column selector and set the data type to String from the right pane.

Page 49: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 49 of 52

11. Join the Movie IDs from the Meta Data editor• In the column selector, select "Item" from the left column and select

"Movie Id" from the Right column selector.• This will join the Item column form the Score Match Box

Recommender to the Movie ID from the IMDB Movie titles. So,ifthe experiment is executed again, the Movie Name and all the related Movie IDs shall be listed as below.

Page 50: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 50 of 52

12. Add another Join operator, to join the result from the previous join (result) with the Movie Titles sample

• In the left column selector, select

related item 1 and in the right column

selector, select Movie ID.

• This will join the related movie id 1

with the Movie Titles sample to return

the name of the related movie.

• Run the experiment to obtain a list of

movie and their related movies.

Page 51: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 51 of 52

• For example, we can deduct that people who like Thor also liked Iron Man.

Page 52: 308471 CH5 Machine Learning using Microsoft Azure

ICT@PSU 308-471 Data Warehousing and Data Mining 52 of 52

Reference• http://social.technet.microsoft.com/wiki/contents/articles/266

89.predictive-analytics-with-microsoft-azure-machine-learning.aspx

• https://msdn.microsoft.com/en-us/library/azure/dn905846.aspx

• http://social.technet.microsoft.com/wiki/contents/articles/31842.developing-a-recommender-solution-with-azure-machine-learning.aspx

• https://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-cheat-sheet/