introduction to rapidminer studio v7

Post on 11-Apr-2017

918 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DublinRLightningTalksEvent

Introduction to Rapidminer Geraldine Gray, PhD

March 24th 2016

Introduc9onsGeraldineisalecturerinIns9tuteofTechnologyBlanchardstown(ITB)

CoordinatorforITB’sMScinAppliedDataScienceandAnaly9cs

geraldine.gray@itb.ie https://ie.linkedin.com/in/geraldine-gray-9b2b187

@GGrayITB geraldine.gray.itb

Overview Objec9ve:u  Introduc9ontoRapidMinerStudiofordataanaly9cs

Agenda:1.  OverviewofRapidMinerStudiointerface2.  Impor9ngadataset3.  Descrip9vesta9s9csandvisualisa9on4.  Datamodelling5.  Modelevalua9on6.  Datacleaning7.  AddingRscript

G. Gray 3

Topic1:OverviewofRapidminerStudio

G. Gray 4

InstallingRapidmineronyourownmachine

ThelatestversionofRapidminerStudioisV7,itcanbedownloadedfromhUps://rapidminer.com/products/comparison/

•  Forwindows:downloadtherapidminer-install.exeandinstall.DefaultsinstallittoC:\programfiles,andaddittothestart>programsmenu.

•  Formac:downloadthe.dmgandaddittoyourapplica9onsfolder.

G. Gray 5

Background Rapidminercomeswithover:

u Over125miningalgorithms

u Over100datacleaningandprepara9onfunc9ons.

u Over30chartsfordatavisualisa9on,

u  andselec9onofmetricstoevaluatemodelperformance.

Eachfunc9onisavailableasanOPERATOR,(whichisimplementedasaJavaclass).Aprocessisbuiltbyconnec9ngoperatorstogether,withtheoutputofoneoperatorpassingasinputtothenext.Thisisalldonebydraganddrop.

G. Gray 6

Creating a repository •  All processes created in Rapidminer are saved to a

repository. The repository will also store other objects including datasets and prediction models.

•  A repository maps to a folder on your machine created specifically for Rapidminer work.

Before starting RapidMiner studio for the first time, create a folder somewhere on your machine that will store your process and datasets from todays workshop.

•  The folder can be local to the machine, on a external drive/USB, or in the cloud.

G. Gray 7

StartupRapidminerWhenyoustartRapidminerstudio,youarepresentedwithanini9alintroduc9onwindow.Closethiswindowtoseethemaininterface.

G. Gray 8

RAPID MINER GUI

Processdesignwindow

Parameterseangsforselectedopera9on

Logofac9vi9es,includingerrors.Ifthisismissing,

addfromView/ShowPanel

Availableoperators

Explana9onoftheselectedoperator

Navigaterepositories

G. Gray 9

Rapid Miner toolbars Run process

Stop process

Automatically connect operators

undo redo

save

new open

Add/remove breakpoints

Show and alter the order in which operators run

Resize the process window

Process design view

View process results

Add a note / comment

Enable/disable an operator

Rightclickop9ons:

G. Gray 10

ProcessesandDatasets•  Yourrapidminerrepository(folder)willcontaindifferenttypesofobjects,mostcommonly:

•  Datasets–theactualdataitself•  Thesymbolisabluecylinder

•  Processes–aseriesofoperatorsthatareappliedtoadatasettoanalyseit.•  Thesymbolistwocogwheels•  Aprocesswillreadinadataset,carryoutvarioustasksonit,andoutputtheresults.AprocessdoesNOTchangetheoriginaldataset.

G. Gray 11

Repositories•  Rapidminercomeswitharepositorycalledsamples,whichhasa

numberofdatasetsandexampleprocesses.–  Youcannoteditthesamplesrepository

Tocreateyouownrepository,selectthedropdownboxontherepositorywindow,select‘createrepository’,andbrowsetothefolderyoucreated.

G. Gray 12

Findinganoperator•  Rapidminercomeswithmanyoperators,sofindingtheoneyouwant

canbedaun9ngatfirst.•  Onceyougetfamiliarwithoperatornames,youcanfindthemmore

easilyusingthefilteratthetopoftheoperatorwindow

G. Gray 13

Listalloperatorsthatstartwith‘read’

Listalloperatorswhosefirstwordstartswith‘dec’,and2ndwordstartswith‘t’.

Topic2:Impor9ngadataset

G. Gray 14

Reading in a dataset Therearetwoop9onsforaccessingadataset:1.  YoucanuseoneofthemanyReadoperatorsto

readdataintoRapidminertemporarilyforapar9cularprocess.

2. 

•  Rapidminershipswithanumberofdatasetsalready

loadedintheSAMPLESrepository

Onceadatasetisinarepository,youcanaccessitusingtheRetrieveoperator.

You can import a dataset into your repository, where it will be available to all processes via the retrieve operator. This is the most efficient method, as meta data is stored with the dataset.

G. Gray 15

Wine Quality Dataset WearefirstgoingtoimporttheWINEQUALITYdatasetfromtheUCIrepository:hUp://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

AUributes:1-fixedacidity2-vola9leacidity3-citricacid4-residualsugar5-chlorides6-freesulfurdioxide7-totalsulfurdioxide8-density9-pH10-sulphates11–alcoholOutputvariable(basedonsensorydata):12-quality(scorebetween0and10)

Downloadthewine-quality-red.csvfilefromtheUCIwebsite.TakealookatthedatasetinExcelorNotepad/Textpad.Thefirstrowiscolumnheadings.Columnsareseparatedby‘;’

G. Gray 16

Google:UCIrepository,andlookforwinequality(notwine)

Importing the wine dataset into Rapidminer

1.  ReturntoRapidminer2.  Select‘adddata’;then‘mycomputer’andbrowsetothedownloaded

file.

3.  Youarepresentedwithanumberofscreenstosetthemetadataforthisdatasetasfollows...

G. Gray 17

Importing the wine dataset into Rapidminer

Thefirstscreenspecifiesimportseangs,includingthecolumndelimiter.ApreviewatboUomtellsyouiftheseangsarecorrect

G. Gray 18

Importing the wine dataset into Rapidminer

•  ThesecondscreenspecifiesdatatypeforeachaUribute,anditsroleinthedataanaly9csprocess

G. Gray 19

Mostdatatypesareintui9ve.Binominal:binaryaUribute,itcanonlyhavetwovalues.RapidminerwillassumebinomialifanaUributehasjusttwodis9nctvaluesinthefirst100rowsscanned.Thisisnotalwayscorrect.Polynominal:anon-numericaUributewithmul9plevalues.

Importing the wine dataset into Rapidminer

ROLE•  AUributeswithoutaroleareusedbyminingalgorithmstoiden9fypaUerns

inthedataset.•  Predic9onmodelswillaUempttopredicttheaUributewiththeroleof

LABEL.•  TheaUributewiththeroleofIDisaprimarykey,usedinJOINopera9ons.•  Youcanspecifyother,userdefined,rolesforaUributestobeignoredby

miningalgorithms

G. Gray 20

ChangetheroleofthefinalaUribute,quality,tolabel.

Importing the wine dataset into Rapidminer

Inthefinalscreen,specifythenameofthedataset,i.e.wine,andbrowsetotherepositoryfolderwhereitistobestored.Thedatasetwillnowappearinyourrepositorywindow

G. Gray 21

Topic3:Descrip9veSta9s9csandVisualisa9on

G. Gray 22

ExploringadatasetInthesamples/datarepositorythereareanumberofdatasetsalreadyimported(i.e.IntheRMformat).ClickontheTITANICdatasettoopenit.Thisautoma9callybringsyoutotheresultsview.Withintheresultsview,therearefivetabsonthelenhandside.Wewilllookatthefirstthree:1.  Data:Viewthedatainthedataset2.  Sta9s9cs:Viewsummarysta9s9csonthedataset3.  Charts:Arangeofvisualiza9onsofthedataset

G. Gray 23

Thedataview•  Thedataviewlistsalltherowsinthedataset,andreportsonthe

numberofrows(examples),andcolumns(aUributes)inthedataset.

•  Thefiltersontherighthandsideallowyoutoinves9gaterowswithmissingvalues.

G. Gray 24

Thesta9s9csviewThesta9s9csviewgivesmetadataoneachaUribute,specifically:

–  Datatypes–  Numberofmissingvalues–  Min,max,averagefornumberaUributes–  Least,Mostandalistofvaluesfornon-numericaUributes

ClickingonanaUributewillshowahistogramforthataUributeThisisagoodviewforanini9alqualityassessmentof:

1.  Missingvalues2.  Outliervalues3.  AUributeswhosedistribu9onofvaluesisnotasexpec9ng,

indica9ngthedatasetinnotrepresenta9veofthepopula9onofinterest.

G. Gray 25

Thechartsview•  Thechartsviewgivesyouaccesstoarangeofvisualisa9onsforyour

dataset.

G. Gray 26

Thechartsview

G. Gray 27

Gotothechartviewofthe9tanicdataset.Underchartstyle,select‘histrogramcolor’.SetHistrogramto‘age’;Colorto‘Survived’;andreducetheOpaquenessofthehistrogram.a)  Doesitappearthatprioritywasgiventochildren?b)  Insteadof‘age’plot‘sex’.Doesitappearthat

prioritywasgiventowomen?c)  Lookingatahistogramof‘class’,whichclassof

passengerwasmostlikelytosurvive?

ThechartsviewWearegoingtolookatonemoredataset,theirisdataset,whichhasitsownwikipediapage:hUps://en.wikipedia.org/wiki/Iris_flower_data_set

G. Gray 28

AUributes:a1:SepalLengtha2:SepalWidtha3:PetalLengtha4:PetalWidth

Classlabel:Iris-setosaIris-veriscolorIrish-virginica

Thechartsview

•  NavigatetotheIRISdatasetinthesamples/datarepository.Doubleclicktoopenitintheresultsview.

•  Inthechartsview,select‘ScaUerMatrix’.ThisshowsascaUerplotofallpairsofaUributes,colourcodedbyclasslabel.

a)  Arethethreeclasseswellseparated?b)  SelectaScaUer3-DColorplot.Bydefaultitcolorcodesbyclasslabel.

Useyourmousetorotatetheplotandsoviewitfromdifferentperspec9ves.

G. Gray 29

Closealltabsintheresultsview

G. Gray 30

Topic4

Buildingapredic9vemodel

G. Gray 31

Classifica9onAclassifica9onalgorithmtrainsamodeltopredictaclasslabel–oneoftheaUributesinthedatasetThisclasslabeldefinesgroupsinthedatasetThealgorithmlearnswhatdifferen9atesthesegroupsfromeachother

G. Gray 32

ClassLabel A1 A2 A3 A4Iris-setosa 5.1 3.5 1.4 0.2Iris-setosa 5 3.6 1.4 0.2Iris-setosa 5.7 3.8 1.7 0.3Iris-setosa 4.6 3.6 1 0.2Iris-setosa 5 3.3 1.4 0.2Iris-versicolor 6 2.2 4 1Iris-versicolor 6.7 3.1 4.4 1.4Iris-versicolor 6.8 2.8 4.8 1.4Iris-versicolor 5.7 3 4.2 1.2Iris-versicolor 5.7 2.9 4.2 1.3Iris-virginica 7.1 3 5.9 2.1Iris-virginica 7.2 3.6 6.1 2.5Iris-virginica 6.5 3.2 5.1 2Iris-virginica 6.7 3.3 5.7 2.1Iris-virginica 6 3 4.8 1.8

Classifica9onalgorithmsClassifica9onalgorithmsuselabeleddatatolearnhowtoiden9fyinstancesofeachclass

Willitbeeasytotrainamodeltodifferen9atebetweenthethreetypesofirisbelow?

G. Gray 33

Iris virginica

Iris veriscolor

Iris setosa

Classifica9onalgorithmsTherearemanyclassifica9onalgorithmsimplementedinRapidminer,undermodeling/predic9ve.Wewilllookatonesuchalgorithm:aDecisionTree

G. Gray 34

Star9ngaprocess...

•  SofarinRapidminer,wehavejustlookedatdatasets,wehaven’tactuallydoneanythingwiththedata.

•  Inthissec9onwewillcreateaRapidminerprocessthattrainsaclassifica9onmodel...

ReturntotheDesignView

Theprocesswindowshouldbeempty

G. Gray 35

Star9ngaprocessTheprocesswillstartbyretrievingadataset.

–  Wewillusetheirisdataset

Navigatetotheirisdatasetinthedata/samplesrepository,anddragitintotheprocesswindow.

–  ThisaddsaRetrieveoperator,whichretrievesadatasetfromtherepository.

G. Gray 36

Buildingamodel

–  Drag‘DecisionTree’fromtheoperatorswindowontotheprocesswindow,aner‘Retrieve’.

–  Connectthe‘out’portfromRetrieve(clickonthesemicircle)tothe‘tra’portofthe‘DecisionTrees’(clickonthesemicircle)

–  ConnectbothoutputportsoftheDecisionTreetotheprocessoutputport

G. Gray 37

Aboutports...

G. Gray 38 38 Process input port

Process output ports Operator

input ports Operator output ports

MandatoryinputportOp9onalinputportOutputporthasavalue

Outputportdoesnothaveavalue

Ports represents input to an operator, and outputs from an operator. Data an other objects are passed from one operator to the next in a process, as indicated by ports that are connected. Colors are used to indicate the type of data/object, e.g:

purple: dataset green: model brown: model performance

Hover over a port to see the type of object required.

Connectmatchingcolours

Runtheprocesstobuildthemodel•  Runtheprocess.Rapidminerwillautoma9callybringyoutotheresults

view.•  Therearetwotabsintheresultsview(becausewehadtwooutputsfrom

theprocess:–  Thedatasetitself–  Thedecisiontreeclassifica9onmodel

•  ClickontheDecisionTreetab

G. Gray 39

Classifica9onmodel

ThetextonLeafnodesisthepredictedclasslabel.

G. Gray 40

AUributes:a1:SepalLengtha2:SepalWidtha3:PetalLengtha4:PetalWidth

Theheightofthebarindicatesthenumberofrowsthatmatchedthisbranch.Hoveroverthenodetogettheactualnumbers

Amixofcoloursindicatesthatnotallrowsmatchingthisbranchwereinthesameclass

Branchesonthedecisiontreerepresentif..then..rules,e.g.ifa3<=2.450thentheflowerisIrisSetosa

WhichaUributesweremostpredic9veoftheclasslabel?

Topic5

Modelaccuracy(andbuildingblocks)

G. Gray 41

ModelaccuracyAdecisiontreeproducesanicevisualisa9onoftherulesthatpredictclassmembership.Itscanbeusedasawaytoexplorehistoricdata(Descrip9vemodeling).However,thedecisiontreeitselfdoesnottellushowaccuratethemodelwillbewhenappliedtonewdata(i.e.datathatwasnotavailabletoitduringtraining.).

i.e.canwereplyontheaccuracyofitspredic9ons?(Predic9vemodeling)Todeterminemodelaccuracywhenmakingpredic9onsonnewdata,wedothefollowing:

G. Gray 42

Modelaccuracy

G. Gray 43

1.Splitthedatasetintoatrainingdatasetandatestdataset

2.Trainingamodelonthetrainingdataset

3.Applythemodeltothetestdataset

4.Calculatehowmanyrowswerepredictedcorrectly.

Modelaccuracy

G.Gray 44

Label A1 A2 A3 A4Iris-versicolor 6 2.2 4 1Iris-setosa 4.6 3.6 1 0.2Iris-versicolor 5.7 2.9 4.2 1.3Iris-versicolor 5.7 3 4.2 1.2Iris-virginica 7.1 3 5.9 2.1Iris-virginica 6 3 4.8 1.8Iris-versicolor 6.7 3.1 4.4 1.4Iris-virginica 6.5 3.2 5.1 2Iris-setosa 5 3.3 1.4 0.2Iris-virginica 6.7 3.3 5.7 2.1Iris-setosa 5.1 3.5 1.4 0.2

Training data

Label A1 A2 A3 A4Predictedvalue

Iris-setosa 5 3.6 1.4 0.2 ?Iris-versicolor 6.8 2.8 4.8 1.4 ?Iris-virginica 7.2 3.6 6.1 2.5 ?Iris-setosa 5.7 3.8 1.7 0.3 ?

Test data

Classifica9onalgorithm

Trainmodel

Classifica9onmodel

Applymodel

TrueLabel PredictedlabelIris-setosa Iris-setosaIris-versicolor Iris-virginicaIris-virginica Iris-versicolorIris-setosa Iris-setosa

Accuracy: 50%

Labeled data

ModelaccuracyinRM•  ReturntotheDesignView•  RightclickontheDecisionTreeoperatoranddeleteit•  Rightclickanywhereintheprocesswindow,selectInsertBuildingBlock,andthenNominalX-Valida9on.

•  AValida9onoperatorisaddedtotheprocesswindow.Moveittotherightoftheretrieveoperatorandconnecttheports.

G. Gray 45

Buildingblocksaregroupsofoperatorsfrequentlyusedtogether.Youcandefineyourown,orusethe5predefinedbuildingblocks

TheiconontheboUomrightcorneroftheoperatorindicatesthereareotheroperatorsembeddedwithinthisoperator.Clickontheoperatortoviewitssub-processes

Modelaccuracy1.Thevalida9onoperatorsplitsthedatasetintopar99ons:someareusedfortrainingwhileothersareusedfortes9ng

G. Gray 46

2. Train a Decision Tree on the training portion of the dataset

3. Apply the decision tree model to the test portion of the dataset

4. Calculate how many predictions were correct

Modelaccuracy•  Returnuptotherootlevel.•  Outputthemodel(mod)andtheperformance(ave)port.

•  Runtheprocess

G. Gray 47

Modelaccuracy–confusionmatrixTheperformanceoperatorgivestheoverallmodelaccuracy,andaccuracywithineachclassdepictedasaconfusionmatrix:

G. Gray 48

pred.: refers to the class label predicted by the decision tree

true: Refers to the actual class label in the original dataset

4 rows in the dataset were predicted as being Iris-virginica, but were actually iris-veriscolor

5 rows in the dataset were predicted as being Iris-veriscolor, but were actually iris-virginica

The diagonal represents correct predictions

Topic6:Datacleaning

Crea9ngaRapidminerprocessto1.  RemoveaUributes2.  RemoveRows

3.  Fillmissingvalues

G. Gray 49

Datacleaning

•  Theirisdatasetisacleandataset,withclassesthatareeasytodis9nguish.

•  Datasetsarenotusuallysoclean,oreasytomodel.•  Thenextsec9onwillbuildaRapidminerprocesstocleanadatasetand

thentrainaclassifica9onmodel...•  ReturntotheDesignView.•  Saveyourcurrentprocesstoyourrepository,andcallitDT-IRIS

•  Startanewprocess

•  Choseablanktemplate

G. Gray 50

Datacleaning•  Theprocesswillstartbyretrievingadataset.

–  Wewillusethe9tanicdataset,andsortoutthemissingvalues•  Navigatetothe9tanicdatasetinthedata/samplesrepository,anddrag

itintotheprocesswindow.–  ThisaddsaRetrieveoperator,whichretrievesadatasetfromtherepository.

•  The9tanicdatasethas1309rows.5aUributeshadmissingvalues

G. Gray 51

AEeibutes Numbermissing %agemissingPassengerFare 1 0.08%PortofEmbarka9on 2 0.15%Age 263 20.09%LifeBoat 823 62.87%Cabin 1014 77.46%

Datacleaning

Step1:RemoveaUributeswith>40%missing–  Drag‘selectaUributes’ontotheprocesswindowaner‘Retrieve’.–  ConnecttheoutputfromRetrieve(clickonthesemicircle)totheInputof

‘SelectAUributes’(clickonthesemicircle)–  Clickon‘SelectAUributes’toviewitsparametersontherighthandpane.

WemustspecifywhataUributesininclude/excludeintheprocess.

G. Gray 52

•  SetaUributefilterto‘subset’;clickon‘selectaUributes’,anddoubleclickonCabinandLifeboattomovethemtotherighthandlist.Clickapply.

•  Clickon‘invertselect’asthesearetheaUributeswedoNOTwanttoselect.

RUN THE PROCESS

Datacleaning

Step2:ReplacemissingvaluesinAGE–  Drag‘replacemissingvalues’ontotheprocesswindowaner

‘SelectAUributes’.–  Connectthe‘exa’outputfromselectaUributestothe‘exa’

inputof‘replacemissingvalues’–  Clickon‘replacemissingvalues’toviewitsparametersonthe

righthandpane.

G. Gray 53

•  SetaUributefilterto‘single’;clickthedropdownboxbelow,andselect‘age’

•  Thedefaultisthatmissingvalueswillbereplacedbytheaveragevalueforage

RUN THE PROCESS

Datacleaning

Step3:RemoverowsforaUributeswith<5%missing–  TheonlyaUributeslenwithmissingvaluesarePassengerFareand

PortofEmbarka9on.RemovingALLrowswithmissingvalueswillhandletheremainingmissingvalues

–  DragFilterExamplesontotheprocesswindowanerReplacemissing.Selectfilterexamplestoviewitsparameters:

•  Clickthecustom_filtersdropdownboxintheoperatorsparameters,andselectno_missing_aUributes

G. Gray 54

RUN THE PROCESS

Buildapredic9vemodelonthecleaneddata

•  Rightclickontheprocesswindow,andaddaNominalX-Valida9onblocktotheendoftheprocess.

•  Connecttheports,ensuringmodelandtheaccuracy(ave)areoupuUedfromtheprocess.

G. Gray 55

A red port indicates there may be an error. Run the process to check . . .

Buildapredic9vemodelonthecleaneddata

•  LookfortheSetRoleoperator,anddropitontotheprocesswindow.•  ConnectitinbetweenRetrieveandSelectAUributes.•  Clickonsetroletoviewitsparameters.SetaUributenametosurvived,

andtargetroletolabel.Thedatasetnothasaclasslabel.

G. Gray 56

HowaccurateistheDecisionTree?WhichaUributesweremostpredic9veoftheclasslabel?

RUN THE PROCESS

Topic7:AddingRcode

G. Gray 57

RunningRscriptwithinRapidminer

•  ThereareanumberofextensionstoRapidMinerstudioavailablefreefromtheirmarketplace,includinganextensiontorunRscriptwithinRapidminer.Installedpackagesarelistedundertheextensionsfolder.

•  TheoperatortorunRscripts‘ExecuteR’.Theoperatorsparameter

providestheeditorforRscript;Inputsaretheparameterstoamandatorymainfunc9on;Areturnstatementdefinestheoutputsfromtheoperator.

G. Gray 58

RunningRscriptwithinRapidminer

Theoperatorshelpgivesalinktotheexampleprocess.ThePolynomialdatasetissplitintotwopar99ons.LearnModelcontainsRscripttotrainalinearmodel;ApplyRModelcontainsRscripttoapplythemodelandrecorditsperformance.Thescriptforbothisonthenextslide...

G. Gray 59

RunningRscriptwithinRapidminer

•  LearnModel

#trainalinearmodelonthetrainingdataandreturnthelearnedmodelrm_main=func9on(data){

linearModel<-lm(formula=label~.,data=data) return(linearModel)}

•  ApplyRmodel##loadthetrainedmodelandapplyitonthetestdatarm_main=func9on(model,data){#applythemodelandbuildapredic9onresult<-predict(model,data)#addthepredic9ontotheexamplesetdata$predic9on<-result#updatethemetadatametaData$data$predic9on<<-list(type="real",role="predic9on")return(data)}G. Gray 60

Learningmore...WehavejusttouchedonafewoftheoperatorsinRapidminer.•  Thesamples/processesrepositoryinRapidminerhasmanymore

examples.•  Therapidminerwebsitehastrainingmaterial.•  TheRapidminerResourceswebsitealsohastrainingmaterial,someof

whichisfree.•  Neuralmarkettrends(ThomasOU)alsohasgoodvideosonRapidminer.

G. Gray 61

Books:1.  RapidminerDataMiningUseCasesand

BusinessAnaly9csApplica9ons.Editors:

Dr.MarkusHofmann&RalfKlinkenberg

2.  ExploringdatawithRapidminerby

AndrewChisholm(freetodownload)

top related