Download - Data mining with caret package
![Page 1: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/1.jpg)
Data mining with caret packageKai Xiao and Vivian Zhang @Supstat Inc.
![Page 2: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/2.jpg)
OutlineIntroduction of data mining and caret
before model training
building model
advance topic
exercise
·
·
visualization
pre-processing
Data slitting
-
-
-
·
Model training and Tuning
Model performance
variable importance
-
-
-
·
feature selection
parallel processing
-
-
·
/
![Page 3: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/3.jpg)
cross-industry standard process for data mining
/
![Page 4: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/4.jpg)
Introduction of caretThe caret package (short for Classification And REgression Training) is a set of functions thatattempt to streamline the process for creating predictive models. The package contains tools for:
data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation
·
·
·
·
·
/
![Page 5: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/5.jpg)
A very simple examplelibrary(caret) str(iris) set.seed(1) # preprocess process <- preProcess(iris[,-5],method=c('center','scale')) dataScaled <- predict(process,iris[,-5]) # data splitting inTrain <- createDataPartition(iris$Species,p=0.75)[[1]] length(inTrain) trainData <- dataScaled[inTrain, ] trainClass <- iris[inTrain,5] testData <- dataScaled[-inTrain, ] testClass <- iris[-inTrain,5]
/
![Page 6: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/6.jpg)
A very simple example# model tuning set.seed(1) fitControl <- trainControl(method = "cv", number = 10) tunedf <- data.frame(.cp=c(0.01,0.05,0.1,0.3,0.5)) treemodel <- train(x = trainData, y = trainClass, method='rpart', trControl = fitControl, tuneGrid = tunedf) print(treemodel) plot(treemodel) # prediction and performance assessment treePred <- predict(treemodel,testData) confusionMatrix(treePred, testClass)
/
![Page 7: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/7.jpg)
visualizationsThe featurePlot function is a wrapper for different lattice plots to visualize the data.
Scatterplot Matrix
boxplot
featurePlot(x = iris[, 1:4], y = iris$Species, plot = "pairs", ## Add a key at the top auto.key = list(columns = 3))
featurePlot(x = iris[, 1:4], y = iris$Species, plot = "box", ## Add a key at the top auto.key = list(columns = 3))
/
![Page 8: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/8.jpg)
pre-processingCreating Dummy Variables
when <- data.frame(time = c("afternoon", "night", "afternoon", "morning", "morning", "morning", "morning", "afternoon", "afternoon")) when levels(when$time) <- c("morning", "afternoon", "night") mainEffects <- dummyVars(~ time, data = when) predict(mainEffects, when)
/
![Page 9: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/9.jpg)
pre-processingZero- and Near Zero-Variance Predictors
data <- data.frame(x1=rnorm(100), x2=runif(100), x3=rep(c(0,1),times=c(2,98)), x4=rep(3,length=100)) nzv <- nearZeroVar(data, saveMetrics = TRUE) nzv nzv <- nearZeroVar(data) dataFilted <- data[,-nzv] head(dataFilted)
/
![Page 10: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/10.jpg)
pre-processingIdentifying Correlated Predictors
set.seed(1) x1 <- rnorm(100) x2 <- x1 + rnorm(100,0.1,0.1) x3 <- x1 + rnorm(100,1,1) data <- data.frame(x1,x2,x3) corrmatrix <- cor(data) highlyCor <- findCorrelation(corrmatrix, cutoff = 0.75) dataFilted <- data[,-highlyCor] head(dataFilted)
/
![Page 11: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/11.jpg)
pre-processingIdentifying Linear Dependencies Predictors
set.seed(1) x1 <- rnorm(100) x2 <- x1 + rnorm(100,0.1,0.1) x3 <- x1 + rnorm(100,1,1) x4 <- x2 + x3 data <- data.frame(x1,x2,x3,x4) comboInfo <- findLinearCombos(data) dataFilted <- data[,-comboInfo$remove] head(dataFilted)
/
![Page 12: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/12.jpg)
pre-processingCentering and Scaling
set.seed(1) x1 <- rnorm(100) x2 <- 3 + 3*x1 + rnorm(100) x3 <- 2 + 2*x1 + rnorm(100) data <- data.frame(x1,x2,x3) summary(data) preProc <- preProcess(data, method = c("center", "scale")) dataProced <- predict(preProc, data) summary(dataProced)
/
![Page 13: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/13.jpg)
pre-processingImputation:bagImpute/knnImpute/
data <- iris[,-5] data[1,2] <- NA data[2,1] <- NA impu <- preProcess(data,method='knnImpute') dataProced <- predict(impu,data)
/
![Page 14: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/14.jpg)
pre-processingtransformation: BoxCox/PCA
data <- iris[,-5] pcaProc <- preProcess(data,method='pca') dataProced <- predict(pcaProc,data) head(dataProced)
/
![Page 15: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/15.jpg)
data splittingcreate balanced splits of the data
set.seed(1) trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE, times = 1) head(trainIndex) irisTrain <- iris[trainIndex, ] irisTest <- iris[-trainIndex, ] summary(irisTest$Species)
createResample can be used to make simple bootstrap samples
createFolds can be used to generate balanced cross–validation groupings from a set of data.
·
·
/
![Page 16: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/16.jpg)
Model Training and Parameter TuningThe train function can be used to
evaluate, using resampling, the effect of model tuning parameters on performance
choose the "optimal" model across these parameters
estimate model performance from a training set
·
·
·
/
![Page 17: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/17.jpg)
Model Training and Parameter Tuningprepare data
data(PimaIndiansDiabetes2,package='mlbench') data <- PimaIndiansDiabetes2 library(caret) # scale and center preProcValues <- preProcess(data[,-9], method = c("center", "scale")) scaleddata <- predict(preProcValues,data[,-9]) # YeoJohnson transformation preProcbox <- preProcess(scaleddata, method = c("YeoJohnson")) boxdata <- predict(preProcbox , scaleddata)
/
![Page 18: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/18.jpg)
Model Training and Parameter Tuningprepare data
# bagimpute preProcimp <- preProcess(boxdata,method="bagImpute") procdata <- predict(preProcimp,boxdata) procdata$class <- data[,9] # data splitting inTrain <- createDataPartition(procdata$class,p=0.75)[[1]] length(inTrain) trainData <- procdata[inTrain, 1:8] trainClass <- procdata[inTrain, 9] testData <- procdata[-inTrain, 1:8] testClass <- procdata[-inTrain, 9]
/
![Page 19: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/19.jpg)
Model Training and Parameter Tuningdefine sets of model parameter values to evaluate
tunedf <- data.frame(.cp=seq(0.001,0.2,length.out=10))
/
![Page 20: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/20.jpg)
Model Training and Parameter Tuningdefine the type of resampling method
k-fold cross-validation (once or repeated)
leave-one-out cross-validation
bootstrap (simple estimation or the 632 rule)
·
·
·
fitControl <- trainControl(method = "repeatedcv", # 10-fold cross validation number = 10, # repeated 3 times repeats = 3)
/
![Page 21: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/21.jpg)
Model Training and Parameter Tuningstart training
treemodel <- train(x = trainData, y = trainClass, method='rpart', trControl = fitControl, tuneGrid = tunedf)
/
![Page 22: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/22.jpg)
Model Training and Parameter Tuninglook at the final result
treemodel plot(treemodel)
/
![Page 23: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/23.jpg)
The trainControl Functionmethod: The resampling method
number and repeats: number controls with the number of folds in K-fold cross-validation ornumber of resampling iterations for bootstrapping and leave-group-out cross-validation.
verboseIter: A logical for printing a training log.
returnData: A logical for saving the data into a slot called trainingData.
classProbs: a logical value determining whether class probabilities should be computed for held-out samples during resample.
summaryFunction: a function to compute alternate performance summaries.
selectionFunction: a function to choose the optimal tuning parameters.
returnResamp: a character string containing one of the following values: "all", "final" or "none".This specifies how much of the resampled performance measures to save.
·
·
·
·
·
·
·
·
/
![Page 24: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/24.jpg)
Alternate Performance MetricsPerformance Metrics:
Another built-in function, twoClassSummary, will compute the sensitivity, specificity and area underthe ROC curve
regression: RMSE and R2
classification: accuracy and Kappa
·
·
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary) treemodel <- train(x = trainData, y = trainClass, method='rpart', trControl = fitControl, tuneGrid = tunedf, metric="ROC") treemodel
/
![Page 25: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/25.jpg)
Extracting PredictionsPredictions can be made from these objects as usual.
pre <- predict(treemodel,testData) pre <- predict(treemodel,testData,type="prob")
/
![Page 26: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/26.jpg)
Evaluating Test Setscaret also contains several functions that can be used to describe the performance of classificationmodels
testPred <- predict(treemodel, testData) testPred.prob <- predict(treemodel, testData,type='prob') postResample(testPred, testClass) confusionMatrix(testPred, testClass)
/
![Page 27: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/27.jpg)
Exploring and Comparing ResamplingDistributions
Within-Model Comparing·
densityplot(treemodel, pch = "|")
/
![Page 28: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/28.jpg)
Exploring and Comparing ResamplingDistributions
Between-Models Comparing
let's build a nnet model, and compare these two model performance
·
·
tunedf <- expand.grid(.decay=0.1, .size=1:8, .bag=T) nnetmodel <- train(x = trainData, y = trainClass, method='avNNet', trControl = fitControl, trace=F, linout=F, metric="ROC", tuneGrid = tunedf) nnetmodel
/
![Page 29: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/29.jpg)
Exploring and Comparing ResamplingDistributionsGiven these models, can we make statistical statements about their performance differences? To dothis, we first collect the resampling results using resamples.
We can compute the differences, then use a simple t-test to evaluate the null hypothesis that there isno difference between models.
resamps <- resamples(list(tree = treemodel, nnet = nnetmodel)) bwplot(resamps) densityplot(resamps,metric='ROC')
difValues <- diff(resamps) summary(difValues)
/
![Page 30: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/30.jpg)
Variable importance evaluationVariable importance evaluation functions can be separated into two groups:
model-based approach
Model Independent approach
·
·
For classification, ROC curve analysis is conducted on each predictor.
For regression, the relationship between each predictor and the outcome is evaluated
-
-
# model-based approach treeimp <- varImp(treemodel) plot(treeimp)
# Model Independent approach RocImp <- varImp(treemodel,useModel = FALSE) plot(RocImp) # or RocImp <- filterVarImp(x = trainData, y = trainClass) plot(RocImp)
/
![Page 31: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/31.jpg)
feature selectionMany models do not necessarily use all the predictors
Feature Selection Using Search Algorithms("wrapper" approach)
Feature Selection Using Univariate Filters('filter' approach)
·
·
·
/
![Page 32: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/32.jpg)
feature selection: wrapper approach
/
![Page 33: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/33.jpg)
feature selection: wrapper approachfeature selection based on random forest model
pre-defined sets of functions: linear regression(lmFuncs), random forests (rfFuncs), naive Bayes(nbFuncs), bagged trees (treebagFuncs)
ctrl <- rfeControl(functions = rfFuncs, method = "repeatedcv", number = 10, repeats = 3, verbose = FALSE, returnResamp = "final") Profile <- rfe(x = trainData, y = trainClass, sizes = 1:8, rfeControl = ctrl) Profile
/
![Page 34: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/34.jpg)
feature selection: wrapper approachfeature selection based on custom model
tunedf <- data.frame(.cp=seq(0.001,0.2,length.out=5)) fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary) customFuncs <- caretFuncs customFuncs$summary <- twoClassSummary ctrl <- rfeControl(functions = customFuncs, method = "repeatedcv", number = 10, repeats = 3, verbose = FALSE, returnResamp = "final") Profile <- rfe(x = trainData, y = trainClass, sizes = 1:8, method = 'rpart', rfeControl = ctrl, /
![Page 35: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/35.jpg)
parallel processingsystem.time({ library(doParallel) registerDoParallel(cores = 2) nnetmodel.para <- train(x = trainData, y = trainClass, method='avNNet', trControl = fitControl, trace=F, linout=F, metric="ROC", tuneGrid = tunedf) }) nnetmodel$times nnetmodel.para$times
/
![Page 36: Data mining with caret package](https://reader036.vdocuments.us/reader036/viewer/2022081722/586fe9061a28ab92198b498b/html5/thumbnails/36.jpg)
exercise-1use knn method to train model
library(caret) fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3) tunedf <- data.frame(.k=seq(3,20,by=2)) knnmodel <- train(x = trainData, y = trainClass, method='knn', trControl = fitControl, tuneGrid = tunedf) plot(knnmodel)
/