introduction to rapidminer studio v7
TRANSCRIPT
DublinRLightningTalksEvent
Introduction to Rapidminer Geraldine Gray, PhD
March 24th 2016
Introduc9onsGeraldineisalecturerinIns9tuteofTechnologyBlanchardstown(ITB)
CoordinatorforITB’sMScinAppliedDataScienceandAnaly9cs
[email protected] https://ie.linkedin.com/in/geraldine-gray-9b2b187
@GGrayITB geraldine.gray.itb
Overview Objec9ve:u Introduc9ontoRapidMinerStudiofordataanaly9cs
Agenda:1. OverviewofRapidMinerStudiointerface2. Impor9ngadataset3. Descrip9vesta9s9csandvisualisa9on4. Datamodelling5. Modelevalua9on6. Datacleaning7. AddingRscript
G. Gray 3
Topic1:OverviewofRapidminerStudio
G. Gray 4
InstallingRapidmineronyourownmachine
ThelatestversionofRapidminerStudioisV7,itcanbedownloadedfromhUps://rapidminer.com/products/comparison/
• Forwindows:downloadtherapidminer-install.exeandinstall.DefaultsinstallittoC:\programfiles,andaddittothestart>programsmenu.
• Formac:downloadthe.dmgandaddittoyourapplica9onsfolder.
G. Gray 5
Background Rapidminercomeswithover:
u Over125miningalgorithms
u Over100datacleaningandprepara9onfunc9ons.
u Over30chartsfordatavisualisa9on,
u andselec9onofmetricstoevaluatemodelperformance.
Eachfunc9onisavailableasanOPERATOR,(whichisimplementedasaJavaclass).Aprocessisbuiltbyconnec9ngoperatorstogether,withtheoutputofoneoperatorpassingasinputtothenext.Thisisalldonebydraganddrop.
G. Gray 6
Creating a repository • All processes created in Rapidminer are saved to a
repository. The repository will also store other objects including datasets and prediction models.
• A repository maps to a folder on your machine created specifically for Rapidminer work.
Before starting RapidMiner studio for the first time, create a folder somewhere on your machine that will store your process and datasets from todays workshop.
• The folder can be local to the machine, on a external drive/USB, or in the cloud.
G. Gray 7
StartupRapidminerWhenyoustartRapidminerstudio,youarepresentedwithanini9alintroduc9onwindow.Closethiswindowtoseethemaininterface.
G. Gray 8
RAPID MINER GUI
Processdesignwindow
Parameterseangsforselectedopera9on
Logofac9vi9es,includingerrors.Ifthisismissing,
addfromView/ShowPanel
Availableoperators
Explana9onoftheselectedoperator
Navigaterepositories
G. Gray 9
Rapid Miner toolbars Run process
Stop process
Automatically connect operators
undo redo
save
new open
Add/remove breakpoints
Show and alter the order in which operators run
Resize the process window
Process design view
View process results
Add a note / comment
Enable/disable an operator
Rightclickop9ons:
G. Gray 10
ProcessesandDatasets• Yourrapidminerrepository(folder)willcontaindifferenttypesofobjects,mostcommonly:
• Datasets–theactualdataitself• Thesymbolisabluecylinder
• Processes–aseriesofoperatorsthatareappliedtoadatasettoanalyseit.• Thesymbolistwocogwheels• Aprocesswillreadinadataset,carryoutvarioustasksonit,andoutputtheresults.AprocessdoesNOTchangetheoriginaldataset.
G. Gray 11
Repositories• Rapidminercomeswitharepositorycalledsamples,whichhasa
numberofdatasetsandexampleprocesses.– Youcannoteditthesamplesrepository
Tocreateyouownrepository,selectthedropdownboxontherepositorywindow,select‘createrepository’,andbrowsetothefolderyoucreated.
G. Gray 12
Findinganoperator• Rapidminercomeswithmanyoperators,sofindingtheoneyouwant
canbedaun9ngatfirst.• Onceyougetfamiliarwithoperatornames,youcanfindthemmore
easilyusingthefilteratthetopoftheoperatorwindow
G. Gray 13
Listalloperatorsthatstartwith‘read’
Listalloperatorswhosefirstwordstartswith‘dec’,and2ndwordstartswith‘t’.
Topic2:Impor9ngadataset
G. Gray 14
Reading in a dataset Therearetwoop9onsforaccessingadataset:1. YoucanuseoneofthemanyReadoperatorsto
readdataintoRapidminertemporarilyforapar9cularprocess.
2.
• Rapidminershipswithanumberofdatasetsalready
loadedintheSAMPLESrepository
Onceadatasetisinarepository,youcanaccessitusingtheRetrieveoperator.
You can import a dataset into your repository, where it will be available to all processes via the retrieve operator. This is the most efficient method, as meta data is stored with the dataset.
G. Gray 15
Wine Quality Dataset WearefirstgoingtoimporttheWINEQUALITYdatasetfromtheUCIrepository:hUp://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
AUributes:1-fixedacidity2-vola9leacidity3-citricacid4-residualsugar5-chlorides6-freesulfurdioxide7-totalsulfurdioxide8-density9-pH10-sulphates11–alcoholOutputvariable(basedonsensorydata):12-quality(scorebetween0and10)
Downloadthewine-quality-red.csvfilefromtheUCIwebsite.TakealookatthedatasetinExcelorNotepad/Textpad.Thefirstrowiscolumnheadings.Columnsareseparatedby‘;’
G. Gray 16
Google:UCIrepository,andlookforwinequality(notwine)
Importing the wine dataset into Rapidminer
1. ReturntoRapidminer2. Select‘adddata’;then‘mycomputer’andbrowsetothedownloaded
file.
3. Youarepresentedwithanumberofscreenstosetthemetadataforthisdatasetasfollows...
G. Gray 17
Importing the wine dataset into Rapidminer
Thefirstscreenspecifiesimportseangs,includingthecolumndelimiter.ApreviewatboUomtellsyouiftheseangsarecorrect
G. Gray 18
Importing the wine dataset into Rapidminer
• ThesecondscreenspecifiesdatatypeforeachaUribute,anditsroleinthedataanaly9csprocess
G. Gray 19
Mostdatatypesareintui9ve.Binominal:binaryaUribute,itcanonlyhavetwovalues.RapidminerwillassumebinomialifanaUributehasjusttwodis9nctvaluesinthefirst100rowsscanned.Thisisnotalwayscorrect.Polynominal:anon-numericaUributewithmul9plevalues.
Importing the wine dataset into Rapidminer
ROLE• AUributeswithoutaroleareusedbyminingalgorithmstoiden9fypaUerns
inthedataset.• Predic9onmodelswillaUempttopredicttheaUributewiththeroleof
LABEL.• TheaUributewiththeroleofIDisaprimarykey,usedinJOINopera9ons.• Youcanspecifyother,userdefined,rolesforaUributestobeignoredby
miningalgorithms
G. Gray 20
ChangetheroleofthefinalaUribute,quality,tolabel.
Importing the wine dataset into Rapidminer
Inthefinalscreen,specifythenameofthedataset,i.e.wine,andbrowsetotherepositoryfolderwhereitistobestored.Thedatasetwillnowappearinyourrepositorywindow
G. Gray 21
Topic3:Descrip9veSta9s9csandVisualisa9on
G. Gray 22
ExploringadatasetInthesamples/datarepositorythereareanumberofdatasetsalreadyimported(i.e.IntheRMformat).ClickontheTITANICdatasettoopenit.Thisautoma9callybringsyoutotheresultsview.Withintheresultsview,therearefivetabsonthelenhandside.Wewilllookatthefirstthree:1. Data:Viewthedatainthedataset2. Sta9s9cs:Viewsummarysta9s9csonthedataset3. Charts:Arangeofvisualiza9onsofthedataset
G. Gray 23
Thedataview• Thedataviewlistsalltherowsinthedataset,andreportsonthe
numberofrows(examples),andcolumns(aUributes)inthedataset.
• Thefiltersontherighthandsideallowyoutoinves9gaterowswithmissingvalues.
G. Gray 24
Thesta9s9csviewThesta9s9csviewgivesmetadataoneachaUribute,specifically:
– Datatypes– Numberofmissingvalues– Min,max,averagefornumberaUributes– Least,Mostandalistofvaluesfornon-numericaUributes
ClickingonanaUributewillshowahistogramforthataUributeThisisagoodviewforanini9alqualityassessmentof:
1. Missingvalues2. Outliervalues3. AUributeswhosedistribu9onofvaluesisnotasexpec9ng,
indica9ngthedatasetinnotrepresenta9veofthepopula9onofinterest.
G. Gray 25
Thechartsview• Thechartsviewgivesyouaccesstoarangeofvisualisa9onsforyour
dataset.
G. Gray 26
Thechartsview
G. Gray 27
Gotothechartviewofthe9tanicdataset.Underchartstyle,select‘histrogramcolor’.SetHistrogramto‘age’;Colorto‘Survived’;andreducetheOpaquenessofthehistrogram.a) Doesitappearthatprioritywasgiventochildren?b) Insteadof‘age’plot‘sex’.Doesitappearthat
prioritywasgiventowomen?c) Lookingatahistogramof‘class’,whichclassof
passengerwasmostlikelytosurvive?
ThechartsviewWearegoingtolookatonemoredataset,theirisdataset,whichhasitsownwikipediapage:hUps://en.wikipedia.org/wiki/Iris_flower_data_set
G. Gray 28
AUributes:a1:SepalLengtha2:SepalWidtha3:PetalLengtha4:PetalWidth
Classlabel:Iris-setosaIris-veriscolorIrish-virginica
Thechartsview
• NavigatetotheIRISdatasetinthesamples/datarepository.Doubleclicktoopenitintheresultsview.
• Inthechartsview,select‘ScaUerMatrix’.ThisshowsascaUerplotofallpairsofaUributes,colourcodedbyclasslabel.
a) Arethethreeclasseswellseparated?b) SelectaScaUer3-DColorplot.Bydefaultitcolorcodesbyclasslabel.
Useyourmousetorotatetheplotandsoviewitfromdifferentperspec9ves.
G. Gray 29
Closealltabsintheresultsview
G. Gray 30
Topic4
Buildingapredic9vemodel
G. Gray 31
Classifica9onAclassifica9onalgorithmtrainsamodeltopredictaclasslabel–oneoftheaUributesinthedatasetThisclasslabeldefinesgroupsinthedatasetThealgorithmlearnswhatdifferen9atesthesegroupsfromeachother
G. Gray 32
ClassLabel A1 A2 A3 A4Iris-setosa 5.1 3.5 1.4 0.2Iris-setosa 5 3.6 1.4 0.2Iris-setosa 5.7 3.8 1.7 0.3Iris-setosa 4.6 3.6 1 0.2Iris-setosa 5 3.3 1.4 0.2Iris-versicolor 6 2.2 4 1Iris-versicolor 6.7 3.1 4.4 1.4Iris-versicolor 6.8 2.8 4.8 1.4Iris-versicolor 5.7 3 4.2 1.2Iris-versicolor 5.7 2.9 4.2 1.3Iris-virginica 7.1 3 5.9 2.1Iris-virginica 7.2 3.6 6.1 2.5Iris-virginica 6.5 3.2 5.1 2Iris-virginica 6.7 3.3 5.7 2.1Iris-virginica 6 3 4.8 1.8
Classifica9onalgorithmsClassifica9onalgorithmsuselabeleddatatolearnhowtoiden9fyinstancesofeachclass
Willitbeeasytotrainamodeltodifferen9atebetweenthethreetypesofirisbelow?
G. Gray 33
Iris virginica
Iris veriscolor
Iris setosa
Classifica9onalgorithmsTherearemanyclassifica9onalgorithmsimplementedinRapidminer,undermodeling/predic9ve.Wewilllookatonesuchalgorithm:aDecisionTree
G. Gray 34
Star9ngaprocess...
• SofarinRapidminer,wehavejustlookedatdatasets,wehaven’tactuallydoneanythingwiththedata.
• Inthissec9onwewillcreateaRapidminerprocessthattrainsaclassifica9onmodel...
ReturntotheDesignView
Theprocesswindowshouldbeempty
G. Gray 35
Star9ngaprocessTheprocesswillstartbyretrievingadataset.
– Wewillusetheirisdataset
Navigatetotheirisdatasetinthedata/samplesrepository,anddragitintotheprocesswindow.
– ThisaddsaRetrieveoperator,whichretrievesadatasetfromtherepository.
G. Gray 36
Buildingamodel
– Drag‘DecisionTree’fromtheoperatorswindowontotheprocesswindow,aner‘Retrieve’.
– Connectthe‘out’portfromRetrieve(clickonthesemicircle)tothe‘tra’portofthe‘DecisionTrees’(clickonthesemicircle)
– ConnectbothoutputportsoftheDecisionTreetotheprocessoutputport
G. Gray 37
Aboutports...
G. Gray 38 38 Process input port
Process output ports Operator
input ports Operator output ports
MandatoryinputportOp9onalinputportOutputporthasavalue
Outputportdoesnothaveavalue
Ports represents input to an operator, and outputs from an operator. Data an other objects are passed from one operator to the next in a process, as indicated by ports that are connected. Colors are used to indicate the type of data/object, e.g:
purple: dataset green: model brown: model performance
Hover over a port to see the type of object required.
Connectmatchingcolours
Runtheprocesstobuildthemodel• Runtheprocess.Rapidminerwillautoma9callybringyoutotheresults
view.• Therearetwotabsintheresultsview(becausewehadtwooutputsfrom
theprocess:– Thedatasetitself– Thedecisiontreeclassifica9onmodel
• ClickontheDecisionTreetab
G. Gray 39
Classifica9onmodel
ThetextonLeafnodesisthepredictedclasslabel.
G. Gray 40
AUributes:a1:SepalLengtha2:SepalWidtha3:PetalLengtha4:PetalWidth
Theheightofthebarindicatesthenumberofrowsthatmatchedthisbranch.Hoveroverthenodetogettheactualnumbers
Amixofcoloursindicatesthatnotallrowsmatchingthisbranchwereinthesameclass
Branchesonthedecisiontreerepresentif..then..rules,e.g.ifa3<=2.450thentheflowerisIrisSetosa
WhichaUributesweremostpredic9veoftheclasslabel?
Topic5
Modelaccuracy(andbuildingblocks)
G. Gray 41
ModelaccuracyAdecisiontreeproducesanicevisualisa9onoftherulesthatpredictclassmembership.Itscanbeusedasawaytoexplorehistoricdata(Descrip9vemodeling).However,thedecisiontreeitselfdoesnottellushowaccuratethemodelwillbewhenappliedtonewdata(i.e.datathatwasnotavailabletoitduringtraining.).
i.e.canwereplyontheaccuracyofitspredic9ons?(Predic9vemodeling)Todeterminemodelaccuracywhenmakingpredic9onsonnewdata,wedothefollowing:
G. Gray 42
Modelaccuracy
G. Gray 43
1.Splitthedatasetintoatrainingdatasetandatestdataset
2.Trainingamodelonthetrainingdataset
3.Applythemodeltothetestdataset
4.Calculatehowmanyrowswerepredictedcorrectly.
Modelaccuracy
G.Gray 44
Label A1 A2 A3 A4Iris-versicolor 6 2.2 4 1Iris-setosa 4.6 3.6 1 0.2Iris-versicolor 5.7 2.9 4.2 1.3Iris-versicolor 5.7 3 4.2 1.2Iris-virginica 7.1 3 5.9 2.1Iris-virginica 6 3 4.8 1.8Iris-versicolor 6.7 3.1 4.4 1.4Iris-virginica 6.5 3.2 5.1 2Iris-setosa 5 3.3 1.4 0.2Iris-virginica 6.7 3.3 5.7 2.1Iris-setosa 5.1 3.5 1.4 0.2
Training data
Label A1 A2 A3 A4Predictedvalue
Iris-setosa 5 3.6 1.4 0.2 ?Iris-versicolor 6.8 2.8 4.8 1.4 ?Iris-virginica 7.2 3.6 6.1 2.5 ?Iris-setosa 5.7 3.8 1.7 0.3 ?
Test data
Classifica9onalgorithm
Trainmodel
Classifica9onmodel
Applymodel
TrueLabel PredictedlabelIris-setosa Iris-setosaIris-versicolor Iris-virginicaIris-virginica Iris-versicolorIris-setosa Iris-setosa
Accuracy: 50%
Labeled data
ModelaccuracyinRM• ReturntotheDesignView• RightclickontheDecisionTreeoperatoranddeleteit• Rightclickanywhereintheprocesswindow,selectInsertBuildingBlock,andthenNominalX-Valida9on.
• AValida9onoperatorisaddedtotheprocesswindow.Moveittotherightoftheretrieveoperatorandconnecttheports.
G. Gray 45
Buildingblocksaregroupsofoperatorsfrequentlyusedtogether.Youcandefineyourown,orusethe5predefinedbuildingblocks
TheiconontheboUomrightcorneroftheoperatorindicatesthereareotheroperatorsembeddedwithinthisoperator.Clickontheoperatortoviewitssub-processes
Modelaccuracy1.Thevalida9onoperatorsplitsthedatasetintopar99ons:someareusedfortrainingwhileothersareusedfortes9ng
G. Gray 46
2. Train a Decision Tree on the training portion of the dataset
3. Apply the decision tree model to the test portion of the dataset
4. Calculate how many predictions were correct
Modelaccuracy• Returnuptotherootlevel.• Outputthemodel(mod)andtheperformance(ave)port.
• Runtheprocess
G. Gray 47
Modelaccuracy–confusionmatrixTheperformanceoperatorgivestheoverallmodelaccuracy,andaccuracywithineachclassdepictedasaconfusionmatrix:
G. Gray 48
pred.: refers to the class label predicted by the decision tree
true: Refers to the actual class label in the original dataset
4 rows in the dataset were predicted as being Iris-virginica, but were actually iris-veriscolor
5 rows in the dataset were predicted as being Iris-veriscolor, but were actually iris-virginica
The diagonal represents correct predictions
Topic6:Datacleaning
Crea9ngaRapidminerprocessto1. RemoveaUributes2. RemoveRows
3. Fillmissingvalues
G. Gray 49
Datacleaning
• Theirisdatasetisacleandataset,withclassesthatareeasytodis9nguish.
• Datasetsarenotusuallysoclean,oreasytomodel.• Thenextsec9onwillbuildaRapidminerprocesstocleanadatasetand
thentrainaclassifica9onmodel...• ReturntotheDesignView.• Saveyourcurrentprocesstoyourrepository,andcallitDT-IRIS
• Startanewprocess
• Choseablanktemplate
G. Gray 50
Datacleaning• Theprocesswillstartbyretrievingadataset.
– Wewillusethe9tanicdataset,andsortoutthemissingvalues• Navigatetothe9tanicdatasetinthedata/samplesrepository,anddrag
itintotheprocesswindow.– ThisaddsaRetrieveoperator,whichretrievesadatasetfromtherepository.
• The9tanicdatasethas1309rows.5aUributeshadmissingvalues
G. Gray 51
AEeibutes Numbermissing %agemissingPassengerFare 1 0.08%PortofEmbarka9on 2 0.15%Age 263 20.09%LifeBoat 823 62.87%Cabin 1014 77.46%
Datacleaning
Step1:RemoveaUributeswith>40%missing– Drag‘selectaUributes’ontotheprocesswindowaner‘Retrieve’.– ConnecttheoutputfromRetrieve(clickonthesemicircle)totheInputof
‘SelectAUributes’(clickonthesemicircle)– Clickon‘SelectAUributes’toviewitsparametersontherighthandpane.
WemustspecifywhataUributesininclude/excludeintheprocess.
G. Gray 52
• SetaUributefilterto‘subset’;clickon‘selectaUributes’,anddoubleclickonCabinandLifeboattomovethemtotherighthandlist.Clickapply.
• Clickon‘invertselect’asthesearetheaUributeswedoNOTwanttoselect.
RUN THE PROCESS
Datacleaning
Step2:ReplacemissingvaluesinAGE– Drag‘replacemissingvalues’ontotheprocesswindowaner
‘SelectAUributes’.– Connectthe‘exa’outputfromselectaUributestothe‘exa’
inputof‘replacemissingvalues’– Clickon‘replacemissingvalues’toviewitsparametersonthe
righthandpane.
G. Gray 53
• SetaUributefilterto‘single’;clickthedropdownboxbelow,andselect‘age’
• Thedefaultisthatmissingvalueswillbereplacedbytheaveragevalueforage
RUN THE PROCESS
Datacleaning
Step3:RemoverowsforaUributeswith<5%missing– TheonlyaUributeslenwithmissingvaluesarePassengerFareand
PortofEmbarka9on.RemovingALLrowswithmissingvalueswillhandletheremainingmissingvalues
– DragFilterExamplesontotheprocesswindowanerReplacemissing.Selectfilterexamplestoviewitsparameters:
• Clickthecustom_filtersdropdownboxintheoperatorsparameters,andselectno_missing_aUributes
G. Gray 54
RUN THE PROCESS
Buildapredic9vemodelonthecleaneddata
• Rightclickontheprocesswindow,andaddaNominalX-Valida9onblocktotheendoftheprocess.
• Connecttheports,ensuringmodelandtheaccuracy(ave)areoupuUedfromtheprocess.
G. Gray 55
A red port indicates there may be an error. Run the process to check . . .
Buildapredic9vemodelonthecleaneddata
• LookfortheSetRoleoperator,anddropitontotheprocesswindow.• ConnectitinbetweenRetrieveandSelectAUributes.• Clickonsetroletoviewitsparameters.SetaUributenametosurvived,
andtargetroletolabel.Thedatasetnothasaclasslabel.
G. Gray 56
HowaccurateistheDecisionTree?WhichaUributesweremostpredic9veoftheclasslabel?
RUN THE PROCESS
Topic7:AddingRcode
G. Gray 57
RunningRscriptwithinRapidminer
• ThereareanumberofextensionstoRapidMinerstudioavailablefreefromtheirmarketplace,includinganextensiontorunRscriptwithinRapidminer.Installedpackagesarelistedundertheextensionsfolder.
• TheoperatortorunRscripts‘ExecuteR’.Theoperatorsparameter
providestheeditorforRscript;Inputsaretheparameterstoamandatorymainfunc9on;Areturnstatementdefinestheoutputsfromtheoperator.
G. Gray 58
RunningRscriptwithinRapidminer
Theoperatorshelpgivesalinktotheexampleprocess.ThePolynomialdatasetissplitintotwopar99ons.LearnModelcontainsRscripttotrainalinearmodel;ApplyRModelcontainsRscripttoapplythemodelandrecorditsperformance.Thescriptforbothisonthenextslide...
G. Gray 59
RunningRscriptwithinRapidminer
• LearnModel
#trainalinearmodelonthetrainingdataandreturnthelearnedmodelrm_main=func9on(data){
linearModel<-lm(formula=label~.,data=data) return(linearModel)}
• ApplyRmodel##loadthetrainedmodelandapplyitonthetestdatarm_main=func9on(model,data){#applythemodelandbuildapredic9onresult<-predict(model,data)#addthepredic9ontotheexamplesetdata$predic9on<-result#updatethemetadatametaData$data$predic9on<<-list(type="real",role="predic9on")return(data)}G. Gray 60
Learningmore...WehavejusttouchedonafewoftheoperatorsinRapidminer.• Thesamples/processesrepositoryinRapidminerhasmanymore
examples.• Therapidminerwebsitehastrainingmaterial.• TheRapidminerResourceswebsitealsohastrainingmaterial,someof
whichisfree.• Neuralmarkettrends(ThomasOU)alsohasgoodvideosonRapidminer.
G. Gray 61
Books:1. RapidminerDataMiningUseCasesand
BusinessAnaly9csApplica9ons.Editors:
Dr.MarkusHofmann&RalfKlinkenberg
2. ExploringdatawithRapidminerby
AndrewChisholm(freetodownload)