machine learning with r
TRANSCRIPT
Machine learning with RAMIS Day April 3rd 2017
Maarten Smeets
MACHINE LEARNING WITH R
WHAT IS MACHINE LEARNING USE CASES FOR MACHINE LEARNING
SUPERVISED LEARNING
UNSUPERVISED LEARNING INTRODUCING R
COOL FEATURES OF R R AND ORACLE
MACHINE LEARNING
• Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed.
MACHINE LEARNINGUSE CASES
• E-mail categorizationSpam, News, Personal, Orders, …
• Anomaly detectionFraud detection, behavior which does not fit known classifications well
• Optical Character recognition (OCR)
• GeneticsWill you have a high change of relapse when you have this cancer type and these genes?
MACHINE LEARNINGUSE CASES
• Log file analysisWhich entries are rare? Which are the variables in a log line?Intruder detection
• IoTSelf learning thermostats
• Predict weatherBased on environmental measures like humidity, air pressure, satellite images
• Detect trendsThe number of cases present in the KEI system at Spir-it and performance
• Image recognitionSelf driving cars like Tesla, BMW
• Predict stock pricesFind correlations between stocks and try to find features which can predict future prices
1 2
WHAT IS MACHINE LEARNING
Supervised learning Unsupervised learning
SUPERVISED LEARNING
• The computer is presented with input and desired output
• The goal is to derive a general ruleset to map input to output
• This ruleset can be used to do predictions of output based on input
SUPERVISED LEARNINGEXAMPLES
• Linear regression
• Support Vector Regression
• Random forest
• Artificial Neural Networks (ANN)
SUPERVISED LEARNINGLINEAR REGRESSION
Data
Statistics
Plot
SUPERVISED LEARNINGSUPPORT VECTOR REGRESSION
SUPERVISED LEARNINGSUPPORT VECTOR REGRESSION
http://www.svm-tutorial.com/2014/10/support-vector-regression-r/
Prediction with tuned model
SUPERVISED LEARNINGRANDOM FOREST
SUPERVISED LEARNINGRANDOM FOREST
• Features are used to classify data
• A set of decision trees are generated based on 2 sets of random features
• Every tree sees a subset of the data
• Splits in the tree are determined by training data valueswhere does a split add most information
• To do predictions, features are put through all decision trees and the result classifications are given a weight
SUPERVISED LEARNINGRANDOM FOREST
SUPERVISED LEARNINGRANDOM FOREST
SUPERVISED LEARNINGRANDOM FOREST
Variable importance plot
Mainly Y was used in the decision trees to determine the outcome
i (a counter) was not important
SUPERVISED LEARNINGRANDOM FOREST
• Why is it very useful?• Data does not have many requirements• Can deal with multiple dimensions• Does good predictions in a lot of cases• Fast• Variable importance can easily be determined
If many features are correlated, a single representative feature can be used
Large black boxperforming
magic
SUPERVISED LEARNINGARTIFICIAL NEURAL NETWORKS (ANN)
Input Output
SUPERVISED LEARNINGARTIFICIAL NEURAL NETWORKS (ANN)
Input Output
Inputnodes
Outputnodes
Hiddennodes
ARTIFICIAL NEURAL NETWORKS (ANN)EXAMPLE BACKPROPAGATION
• Backpropagation1. Nodes have connections and connections have a random assigned weight2. Provide input and let the network generate output3. Compare generated output with desired output4. Go from output nodes back to input and adjust the weight of the node connections.
Adjusting a little bit at a time increases learning time and accuracy5. Repeat from step 2 until desired error rate reached
• Can be done with weights or with node activation thresholds
ARTIFICIAL NEURAL NETWORKS (ANN)SOME PERSONAL THOUGHTS (AS NEUROBIOLOGIST)
• Most samples of artificial neural networks do not take into account several properties of biological neural networks• Signals take time to go from A to B• Neurons are not arranged in layers
Biological neural networks have a 3d structure with specialized area’s• Once trained, most artificial neural networks are static and don’t learn anymore• Biological neural networks implement a wide range of signaling mechanisms per node
(neurotransmitters)
• Learning algorithms are not only internal to the neural network. Natural selection also plays a role
SUPERVISED LEARNINGCHALLENGES
• Requires learning set of inputs and desired outputs
• Training data should be balanced• Correlated features cause biases• Outputs should be distributed as evenly as possible
SUPERVISED LEARNING
AAAAAA AB B
Training data
ABBBBBB
Test data ABAAAAAA
Input Output
Input Output
UNSUPERVISED LEARNING
• Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from "unlabeled" data
a classification or categorization is not included in the observations
• Examples• Clustering• Anomaly detection• Neural networks (Self Organizing Map)
HIERARCHICAL CLUSTERING
Every point starts a cluster
Clusters merge as they go up the tree
HIERARCHICAL CLUSTERINGA: MEAN 2,2 STDEV 2 B: MEAN 6,6 STDEV 2
HIERARCHICAL CLUSTERING (HCL)
HIERARCHICAL CLUSTERINGA: MEAN 2,2 STDEV 2 B: MEAN 6,6 STDEV 2
Original Prediction
HIERARCHICAL CLUSTERINGA: MEAN 2,2 STDEV 1 B: MEAN 6,6 STDEV 1
Original Prediction
1 2 3
History Installation Basics
INTRODUCING R
R A SHORT HISTORY
• Conceived august 1993An implementation of the S programming languageS was conceived in 1976
• Open sourced June 1995
• Main competitors: SPSS and SAS
• A lot of (mostly statistical) libraries availableCRAN package repository features 10366 available packages.
R INSTALLATION
• Download and install Rhttps://www.r-project.org/
R STUDIO INSTALLATION
• Download and install R Studiohttps://www.rstudio.com/
R BASICS
• R is a functional programming (FP) language
• It provides many tools for the creation and manipulation of functions.
• You can do anything with functions that you can do with vectors: you can assign them to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and even return them as the result of a function.
R BASICSSOME FEATURES
• GIT integration
• Interpreted; does not require compilationExecute a line in your script and look at the result in the console
• Has its own markdown variant for documentationEspecially useful if you want to have graphs
• R Shiny allows you to generate and host scripts / graphs and make them available from a browser
R BASICSSOME FEATURES
• Code completion
• Allows multi threaded execution
• Can be run remotely on an R-server
• Great at reading / writing datasetsFor example web site scraping for data
• Of course great at statistics
• Great at generating plotsEspecially when using the ggplot2 library
R BASICSSOME TIPS TO GET STARTED
• ?ggplot• help(package=“ggplot2")
R DATATYPESTHE VECTOR
• Vectora <- c(1,2,5.3,6,-2,4) # numeric vectorb <- c("one","two","three") # character vectorc <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
a <- c(1,2,5.3,6,-2,4)b <- a * 2
[1] 2.0 4.0 10.6 12.0 -4.0 8.0
R DATATYPESTHE MATRIX. ALL VALUES HAVE THE SAME TYPE AND LENGTH
# generates 5 x 4 numeric matrix y<-matrix(1:20, nrow=5,ncol=4)
# another examplecells <- c(1,26,24,68)rnames <- c("R1", "R2")cnames <- c("C1", "C2") mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames))
# accessing matrix values|x[,4] # 4th column of matrixx[3,] # 3rd row of matrix x[2:4,1:3] # rows 2,3,4 of columns 1,2,3
R DATATYPESTHE DATA.FRAME. LIKE A MATRIX BUT TYPES AND LENGTHS CAN VARY
d <- c(1,2,3,4)e <- c("red", "white", "red", NA)f <- c(TRUE,TRUE,TRUE,FALSE)mydata <- data.frame(d,e,f)names(mydata) <- c("ID","Color","Passed") # variable names
myframe[3:5] # columns 3,4,5 of data framemyframe[c("ID","Age")] # columns ID and Age from data framemyframe$X1 # variable x1 in the data frame
R DATATYPESTHE LIST
• An ordered collection of objects (components)
# example of a list with 4 components – # a string, a numeric vector, a matrix, and a scaler w <- list(name=“Maarten", mynumbers=a, mymatrix=y, age=36)
# example of a list containing two lists v <- c(list1,list2)
1 2 3
Hosting plotsShinyPlot.ly
R markdown Web site crawling
COOL FEATURES OF R
COOL FEATURES OF RSHINY
COOL FEATURES OF RSHINY
UI Server
COOL FEATURES OF RPLOT.LY INTERACTIVE GRAPHS
COOL FEATURES OF RPLOT.LY INTERACTIVE GRAPHS
COOL FEATURES OF RR MARKDOWN
COOL FEATURES OF RR MARKDOWN
COOL FEATURES OF RWEB SITE CRAWLING
COOL FEATURES OF RWEB SITE CRAWLING
• Sector to Industry, Industry to Company
COOL FEATURES OF RWEB SITE CRAWLING
COOL FEATURES OF RWEB SITE CRAWLING
http://chart.finance.yahoo.com/table.csv?s=ABT.AX&a=1&b=28&c=2017&d=2&e=28&f=2017&g=d&ignore=.csv
1 2 3
What does Oracle do with R
Using data from an Oracle DB in R
Using functions from R in the Oracle DB
ORACLE AND R
ORACLE AND R
ORACLE R ENTERPRISEUSING DATABASE DATA IN R
ORACLE R ENTERPRISEUSING R SCRIPTS DIRECTLY IN SQL STATEMENTS
https://github.com/MaartenSmeets/R