![Page 1: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/1.jpg)
SMS Spam Filter Design Using R:
A Machine Learning Approach
Reza Rahimi,Ph.D Candidate,
School of Information and Computer Science, University of California, Irvine.
![Page 2: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/2.jpg)
Introduction• In basic terms Machine Learning (ML) is
about the construction of systems that can learn from data.
• It is used as a tool for knowledge discovery.• Several Important classes of problems could
be solved using machine learning techniques like:– Classification (Prediction):
• Given a collection of records as a training set.• Each record contains a set of attributes and one of the
attributes called class. • The problem is to find a model for class attribute as a
function of other attributes.– Example: Spam or Ham, Handwriting Recognition,…
![Page 3: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/3.jpg)
– Clustering (Description): • Given a set of data points, with some attributes, and a
similarity measure (metric) among them.• The goal is to find clusters such that data points in one
cluster are more similar to one another. – Example: Document Clustering, people categorization,…
– Association (Description): • Given a set of records each contains some items from a
given collection.• The goal is to produce dependency rules which show the
occurrence of an item based on occurrences of other items.
– Example: user habit pattern recognition,…
– Regression (Prediction): • Predict a value of a given continuous variables based on
the values of other variables.• Could be linear or nonlinear model of dependency.
– Example: Stock prediction
![Page 4: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/4.jpg)
• ML is a very mature and developed area.• In all of the different mentioned problem
classes, it contains rich resources of tools, techniques and Algorithms.
• These tools are provided in different languages and Framework like R, Matlab, Java, C++, Mahout,…
• The following procedure could be considered as the general methodology for problem solving in this framework:
Problem Solving Using Machine
Learning Framework
![Page 5: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/5.jpg)
Get a sense of data: Feature extraction, dimension
reduction, noise cancellation,…
Problem modeling: Classification, Clustering, Association,
Regression,…
Run standard ML Algorithms: check the errors according to the
standard ML Metrics.
Select the methods that satisfy your performance criteria and metrics.
• In the next section I will describe design of SMS Spam Filter in R language based on mentioned methodology.
![Page 6: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/6.jpg)
SMS Spam Filter using R• #this file is SMS Spam filter codes with different classifiers in R language• #Written by: Reza Rahimi • #Initialization: Raw Data (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection), • #loading required packages, libraries and function declaration
• #required package for text mining• if(!require("tm"))• install.packages("tm")• • #required package for SVM• if(!require("e1071"))• install.packages("e1071")• • #required package for KNN• if(!require("RWeka"))• install.packages("RWeka", dependencies = TRUE)• • #required package for Adaboost• if(!require("ada"))• install.packages("ada")• library("tm")• library("e1071")• library(RWeka)• library("ada")
![Page 7: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/7.jpg)
R Codes (Cont.)• #Initialize random generator• set.seed(1245)• • #This function makes vector (Vector Space Model) from text message using highly repeated words• vsm<-function(message,highlyrepeatedwords){• • tokenizedmessage<-strsplit(message, "\\s+")[[1]]• • #making vector• v<-rep(0, length(highlyrepeatedwords))• for(i in 1:length(highlyrepeatedwords)){• for(j in 1:length(tokenizedmessage)){• if(highlyrepeatedwords[i]==tokenizedmessage[j]){• v[i]<-v[i]+1• }• }• }• return (v)• }• #loading data. Original data is from http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection• print("Uploading SMS Spams and Hams!\n")• smstable<-read.csv("C:/tmp/smsspamcollection.txt", header = TRUE, sep = "\t",
colClasses=c("type"="character","sms"="character"))
![Page 8: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/8.jpg)
R Codes (Cont.)• smstabletmp<-smstable• • print("Extracting Ham and Spam Basic Statistics!")• • smstabletmp$type[smstabletmp$type=="ham"] <- 1• smstabletmp$type[smstabletmp$type=="spam"] <- 0• • #Convert character data into numeric• tmp<-as.numeric(smstabletmp$type)• • #Basic Statisctics like mean and variance of spam and hams• hamavg<-mean(tmp)• print("Average Ham is :");hamavg• • hamvariance<-var(tmp)• print("Var of Ham is :");hamvariance• • print("Extract average token of Hams and Spams!")• • nohamtokens<-0• noham<-0• nospamtokens<-0• nospam<-0
![Page 9: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/9.jpg)
R Codes (Cont.)• for(i in 1:length(smstable$type)){• if(smstable[i,1]=="ham"){• nohamtokens<-length(strsplit(smstable[i,2], "\\s+")[[1]])+nohamtokens• noham<-noham+1• }else{ • nospamtokens<-length(strsplit(smstable[i,2], "\\s+")[[1]])+nospamtokens• nospam<-nospam+1• }• }• • totaltokens<-nospamtokens+nohamtokens;• print("total number of tokens is:")• print(totaltokens)• • avgtokenperham<-nohamtokens/noham• print("Avarage number of tokens per ham message")• print(avgtokenperham)• • avgtokenperspam<-nospamtokens/nospam• print("Avarage number of tokens per spam message")• print(avgtokenperspam)• • print(" Make two different sets, training data and test data!")
![Page 10: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/10.jpg)
R Codes (Cont.)• #select the percent of data that you want to use as training set• trdatapercent<-0.3• • #training data set• trdata=NULL• • #test data set• tedata=NULL• • for(i in 1:length(smstable$type)){• if(runif(1)<trdatapercent){• trdata=rbind(trdata,c(smstable[i,1],tolower(smstable[i,2])))• }• else{• tedata=rbind(tedata,c(smstable[i,1],tolower(smstable[i,2])))• }• }• • print("Training data size is!")• dim(trdata)• • print("Test data size is!")• dim(tedata)
![Page 11: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/11.jpg)
R Codes (Cont.)• # Text feature extraction using tm package• • trsmses<-Corpus(VectorSource(trdata[,2]))• trsmses<-tm_map(trsmses, stripWhitespace)• trsmses<-tm_map(trsmses, tolower)• trsmses<-tm_map(trsmses, removeWords, stopwords("english"))• • dtm <- DocumentTermMatrix(trsmses)• • highlyrepeatedwords<-findFreqTerms(dtm, 80)• • #These highly used words are used as an index to make VSM • #(vector space model) for trained data and test data• • #vectorized training data set• vtrdata=NULL• • #vectorized test data set • vtedata=NULL
![Page 12: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/12.jpg)
R Codes (Cont.)• for(i in 1:length(trdata[,2])){• if(trdata[i,1]=="ham"){• vtrdata=rbind(vtrdata,c(1,vsm(trdata[i,2],highlyrepeatedwords)))• }• else{• vtrdata=rbind(vtrdata,c(0,vsm(trdata[i,2],highlyrepeatedwords)))• }• • }• • for(i in 1:length(tedata[,2])){• if(tedata[i,1]=="ham"){• vtedata=rbind(vtedata,c(1,vsm(tedata[i,2],highlyrepeatedwords)))• }• else{• vtedata=rbind(vtedata,c(0,vsm(tedata[i,2],highlyrepeatedwords)))• }• • }
![Page 13: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/13.jpg)
R Codes (Cont.)• # Run different classification algorithms• # differnet SVMs with different Kernels • print("----------------------------------SVM-----------------------------------------") • print("Linear Kernel")• svmlinmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],type='C', kernel='linear');• summary(svmlinmodel)• predictionlin <- predict(svmlinmodel, vtedata[,2:length(vtedata[1,])])• tablinear <- table(pred = predictionlin , true = vtedata[,1]); tablinear• precisionlin<-sum(diag(tablinear))/sum(tablinear);• print("General Error using Linear SVM is (in percent):");(1-precisionlin)*100• print("Ham Error using Linear SVM is (in percent):");(tablinear[1,2]/sum(tablinear[,2]))*100• print("Spam Error using Linear SVM is (in percent):");(tablinear[2,1]/sum(tablinear[,1]))*100• • print("Polynomial Kernel")• svmpolymodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel='polynomial',
probability=FALSE)• summary(svmpolymodel)• predictionpoly <- predict(svmpolymodel, vtedata[,2:length(vtedata[1,])])• tabpoly <- table(pred = predictionpoly , true = vtedata[,1]); tabpoly• • print("Radial Kernel")• svmradmodel <- svm(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1], kernel = "radial", gamma =
0.09, cost = 1, probability=FALSE)• summary(svmradmodel)• predictionrad <- predict(svmradmodel, vtedata[,2:length(vtedata[1,])])• tabrad <- table(pred = predictionrad, true = vtedata[,1]); tabrad
![Page 14: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/14.jpg)
R Codes (Cont.)• print("----------------------------------KNN-----------------------------------------") • data<-data.frame(sms=vtrdata[,2:length(vtrdata[1,])],type=vtrdata[,1])• classifier <- IBk(data, control = Weka_control(K = 20, X = TRUE))• summary(classifier)• evaluate_Weka_classifier(classifier, newdata =
data.frame(sms=vtedata[,2:length(vtedata[1,])],type=vtedata[,1]))
• print("---------------------------------Adaboost-------------------------------------")• adaptiveboost<-
ada(x=vtrdata[,2:length(vtrdata[1,])],y=vtrdata[,1],test.x=vtedata[,2:length(vtedata[1,])], test.y=vtedata[,1], loss="logistic", type="gentle", iter=100)
• summary(adaptiveboost)• varplot(adaptiveboost)
![Page 15: SMS Spam Filter Design Using R: A Machine Learning Approach](https://reader031.vdocuments.us/reader031/viewer/2022020217/54b771c64a79592a448b46f6/html5/thumbnails/15.jpg)
Conclusions
• In these slides I gave a broad overview of ML and different problems that could be solved in this framework.
• I reviewed in details one way of SMS spam filter implementation using ML techniques with R language.
• ML provides strong framework to solve problem in Big Data domain.