en tanagra gain chart
DESCRIPTION
ID3 - Mineria de datosTRANSCRIPT
-
Tanagra
21/01/09 1/26
1 Introduction
ComputingaGainChart.Comparing thecomputation timeofdatamining toolsona largedatasetunder
Linux.
Thegainchartisanalternativetoconfusionmatrixfortheevaluationofaclassifier.Itsnameissometimes
different according the tools (e.g. lift curve, lift chart, cumulative gain chart, etc.). Themain idea is to
elaborateascatterplotwheretheXcoordinates isthepercentofthepopulationandtheYcoordinates is
the percent of the positive value of the class attribute. The gain chart is usedmainly in themarketing
domainwherewewanttodetectpotentialcustomers,butitcanbeusedinothersituations.
The construction of the gain chart is already outlined in a previous tutorial (see http://datamining
tutorials.blogspot.com/2008/11/liftcurvecoilchallenge2000.html). In this tutorial, we extend the
descriptiontootherdataminingtools(Knime,RapidMiner,WekaandOrange).Thesecondoriginalityofthis
tutorial is thatwe lead the experiment under Linux (French version ofUbuntu 8.10 see http://data
miningtutorials.blogspot.com/2009/01/tanagraunderlinux.html for the installation and the utilization of
TanagraunderLinux).Thethirdoriginalityisthatwehandlealargedatasetwith2.000.000examplesand
41variables.Itwillbeveryinterestingtostudythebehaviorofthesetoolsinthisconfiguration,especially
becauseourcomputerisnotreallypowerful(Celeron,2.53GHz,1MBRAM).
Weadoptthesamewayforeachtool.Inafirsttime,wedefinethesequenceofoperationsandthesettings
-
Tanagra
21/01/09 2/26
onasampleof2.000examples.Then, inasecondtime,wemodifythedatasource,wehandlethewhole
dataset.Wemeasurethecomputationtimeandthememoryoccupation.Wenotethatsometoolsfailed
theanalysisonthecompletedataset.
About the learningmethod,we use a linear discriminant analysiswith a variable selection process for
Tanagra.For theother tools, thisapproach isnotavailable.So,weuseaNaiveBayesmethodwhich isa
linearclassifieralso.
2 Dataset
WeuseamodifiedversionoftheKDDCUP99dataset.Theaimoftheanalysisisthedetectionofanetwork
intrusion (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). We want to detect a normal
connection (binaryproblem "normalnetwork connectionornot"). Moreprecisely,weask the following
problem:"ifwesetascoretoexamplesandrankthemaccordingtothisscore,what istheproportionof
positive(normal)connectiondetectedifweselectonly20percentofthewholeconnections?"
Wehavetwodatafiles.Thefirstcontains2.000.000examples (full_gain_chart.arff). It isusedtocompare
the ability of the tools to handle a large dataset. The second is a sample with 2.000 examples
(sample_gain_chart.arff).Itisusedforspecifyingallthesequenceofoperations.Thesefilesaresettogether
inanarchivehttp://eric.univlyon2.fr/~ricco/tanagra/fichiers/dataset_gain_chart.zip
3 Tanagra
3.1 Defining the diagram
Creatingadiagramand importingthedataset.Afterwe launchTanagra,weclickonFILE/NEWmenu in
order to create a diagram and import the dataset. In the dialog settings, we select the data file
SAMPLE_GAIN_CHART.ARFF(Wekafileformat).WesetthenameofthediagramasGAIN_CHART.TDM.
-
Tanagra
21/01/09 3/26
41variablesand2.000casesareloaded.
-
Tanagra
21/01/09 4/26
Partitioning the dataset. We want to use 10 percent of the dataset for the learning phase, and the
remainderforthetestphase.10percentcorrespondto200examplesonoursample.Itseemsverylimited.
Butwemusttorememberthatthisfirstanalysis isusedonlyasapreparationphase.Inthenextstep,we
applytheanalysisonthewholedataset.Sothetruelearningsetsizeis10percentof2millionsi.e.200.000
examples.Itseemsenoughtocreateareliableclassifier.
Inordertodefinethetrainandtestsets,weusetheSAMPLINGcomponent(INSTANCESELECTIONtab).We
clickonthePARAMETERSmenu.WesetPROPORTIONSIZEas10%.
WhenweclickonthecontextualVIEWmenu,Tanagraindicatesthat200examplesarenowselected.
Definingthevariablestypes.WeinserttheDEFINESTATUScomponent(usingtheshortcutintothetoolbar)
inordertodefinetheTARGETattribute(CLASSE)andtheINPUTattributes(theotherones).
-
Tanagra
21/01/09 5/26
Automatic feature selection. Some of the INPUT attributes are irrelevant or redundant. We insert the
STEPDISC (FEATURE SELECTION tab) component in order to detect the relevant subset of predictive
variables.Settingthesizeofthesubsetisveryhard.Statisticalcutvaluesarenotreallyusefulherebecause
weapplytheprocessonalargedataset,allthevariablesseemsignificant.
So, the simplestway is the trial and error approach.We observe the decreasing of theWILKS Lambda
statisticwhenweaddanewpredictivevariable.ThegoodchoiceseemsSUBSETSIZE=6.Whenweadda
new variable after this step, the decreasing of the Lambda is comparatively weak. This corresponds
approximately to theelbowof the curveunderlining the relationshipbetween thenumberofpredictive
variableandtheWilks'lambda.Wesetthesettingsaccordingtothispointofview.
-
Tanagra
21/01/09 6/26
Learningphase.WeaddtheLINEARDISCRIMINANTANALYSIScomponent(SPVLEARNINGtab).Weobtain
thefollowingresults.
-
Tanagra
21/01/09 7/26
Computingthescorecolumn.Wecomputenowthescore(thescorevalueisnotreallyaprobability,butit
allowstoranktheexamplesastheconditionalprobabilityP[CLASS=positive/ INPUTattributes])ofeach
exampletobeapositiveone.Itiscomputedonthewholedataseti.e.boththetrainandthetestset.
WeaddtheSCORINGcomponent(SCORINGtab).Wesetthefollowingparametersi.e.the"positive"value
oftheclassattributeis"normal".AnewcolumnSCORE_1isaddedtothedataset.
Creatingthegainchart.InordertocreatetheGAINCHART,wemustdefinetheTARGET(CLASSE)attribute
andtheSCOREattributes(INPUT).WeusetheDEFINESTATUScomponentfromthetoolbar.Wenotethat
we can set several SCORE columns, this optionwill be usefulwhenwewant to compare various score
columns(e.g.whenwewanttocomparetheefficiencyofvarious learningalgorithmswhichcomputethe
score).
-
Tanagra
21/01/09 8/26
Then,weaddtheLIFTCURVEcomponent(SCORINGtab).Thesettingscorrespondtothecomputationofthe
curveonthe"normal"valueoftheclassattribute,andontheunselectedexamplesi.e.thetestset.
-
Tanagra
21/01/09 9/26
Weobtainthefollowingchart.
In the HTML tab, we have the details of the results. We see that among the first 20 percent of the
populationwehaveatruepositiverateof97.80%i.e.97.80%ofthepositiveexamplesofthedataset.
-
Tanagra
21/01/09 10/26
3.2 Running the diagram on the complete dataset
Thisfirststepallowsustodefinethesequenceofdataanalysis.Now,wewanttoapplythesameframework
onthecompletedataset.Wecreatetheclassifieronalearningsetwith200,000instances,andevaluateits
performance,usingagainchart,onatestsetwith1,800,000examples.
WithTanagra,thesimplestwayistosavethediagram(FILE/SAVE)andclosetheapplication.Then,weopen
thefilediagramGAIN_CHART.TDMinatexteditor.Wereplacethedatasourcereference.Wesetthename
ofthecompletedataseti.e.full_gain_chart.arff.
We launchTanagraagain.Weopenthemodifieddiagramgain_chart.tdm (FILE/OPENmenu).Thenew
datasetisautomaticallyloaded.Tanagraisreadytoexecuteeachnodeofthediagram.
-
Tanagra
21/01/09 11/26
Theprocessingtimeis180seconds(3minutes).
-
Tanagra
21/01/09 12/26
Weexecutealloperationswith theDIAGRAM /EXECUTEmenu.Weget the tableassociated to thegain
chart.Amongthe20%firstindividuals(accordingtothescorevalue),wefind98.68%ofpositives(thetrue
positiverateis98.68%).
Hereisthegainchart.
-
Tanagra
21/01/09 13/26
Weconsiderbelowthecomputationtime.WeseemainlythatthememoryoccupationofTanagraincreased
slightlywiththecalculations.Thesearethedataloadedintomemorywhichmainlydeterminethememory
occupation.
4 Knime
4.1 Defining the workflow on the sample
The installation and the implementation of Knime (http://www.knime.org/, version 2.0.0) are very easy
underUbuntu.Wecreateanewworkflow(FILE/NEW).WeaskforaKnimeProject.
Aswesayabove,theLinearDiscriminantmethod isnotavailable.SoweusetheNaiveBayesClassifier. It
corresponds to a linear classifier. We expect to obtain similar results. Unfortunately, Knime does not
proposeafeatureselectionprocessadaptedtothisapproach.Thus,weevaluatetheclassifiercomputedon
allthepredictivevariables.
Wecreatethefollowingworkflow:
-
Tanagra
21/01/09 14/26
Wedescribebelowthesettingsofsomecomponents.
ARFF READER: The dialog box appears with the CONFIGURE menu. We set the file name i.e.sample_gain_chart.arff.AnotherveryimportantparameterisintheGENRALNODESETTINGStab.
Ifthedatasetistoolarge(higherthan100,000examplesaccordingtothedocumentation),itcanbe
swappedonthedisk.Itistheonlysoftwareofthistutorialwhichdoesnotloadtheentiredataset
inmemoryduringtheprocessing.Inthispointofview,itissimilartothecommercialtoolsthatcan
handle very large databases. Certainly, it is an advantage. We see the consequences of this
technicalchoicewhenwehavetohandlethecompletedatabase.
-
Tanagra
21/01/09 15/26
ForPARTITIONNINGnode,weselect10%oftheexamples. ForNAIVEBAYESPREDICTORnode,weaskthatthecomponentproducesthescorecolumnforthe
valuesoftheclassattribute.LikeTanagra,weneedtothiscolumnforrankingtheexamplesduring
theconstructionofthegainchart.
Last, for the LIFTCHART,we specify the class attribute (CLASSE), thepositive valueof the class
-
Tanagra
21/01/09 16/26
attribute(NORMAL).Thewidthoftheinterval(INTERVALWIDTH)fortheconstructionofthetable
is10%oftheexamples.
Whenwe click on the EXECUTE ANDOPEN VIEW contextualmenu,we obtain the following gain chart
(CUMULATIVEGAINCHARTtab).
-
Tanagra
21/01/09 17/26
4.2 Calculations on the complete database
Wearereadytoworkonthecompletebase.WeactivateagainthedialogsettingsoftheARFFREADERnode
(CONFIGURE menu). We select the full_gain_chart.arff data file. We can execute the workflow. We
obtainthefollowinggainchart.
Theresultsareverysimilartothoseobtainedonthesample(2,000examples).
Thecomputationtimehoweverisnotreallythesame.ItisveryhighcomparedtothatofTanagra.Westudy
indetailsbelowthecomputationtimeofeachstep.
Aboutthememoryoccupation,thesystemmonitorshowsthatKnimeuses"only"193Mo(Figure1). It is
clearlylowerthanTanagra;itwillbelowerthanallothertoolsweanalyzeinthistutorial.
Wecanclearlyseetheadvantagesanddisadvantagesofswappingdataonthedisk:thememory isunder
control,but the computation time increases. Ifweare in thephaseof finalproductionof thepredictive
model, it isnotaproblem. Ifweare in theexploratoryphase,whereweuse trialanderror inorder to
definethebeststrategy,prohibitivecomputationtimeisdiscouraging.
Inthisconfiguration,theonlysolutionistoworkonsample.Inourstudy,wenotethattheresultsobtained
-
Tanagra
21/01/09 18/26
on2,000casesareveryclose to the resultsobtainedon2,000,000examples.Ofcourse, it isanunusual
situation.Theunderlyingconceptbetween theclassattributeand thepredictiveattribute isverysimple.
But,itisobviousthat,ifwecandeterminethesufficientsamplesizefortheproblem,thisstrategyallowsus
toexploredeeplythevarioussolutionswithoutaprohibitivecalculationstime.
Figure1MemoryoccupationofKnimeduringtheprocessing
5 RapidMiner
The installation and the implementation of RapidMiner (http://rapidi.com/content/blogcategory/38/69/,
CommunityEdition,version4.3)areverysimpleunderUbuntu.Weaddon thedesktopashortcutwhich
runsautomaticallytheJavaVirtualMachinewithrapidminer.jarfile.
5.1 Working on the sample
As the other tools,we set the sequence of operations on the sample (2,000 examples). There is little
documentationonthesoftware.However,itcomeswithmanypredefinedexamples.Weoftenfindquickly
anexampleofanalysiswhichissimilartoours.
About thegainchart,weget theexample filesample/06_Visualization/16_LiftChar.xlm.Weadapt the
-
Tanagra
21/01/09 19/26
sequenceofoperationsonourdataset.
Idonotreallyunderstandall.TheoperatorsIORetrieverandIOStorerseemquitemysterious.Anyway,the
importantthingisthatRapidMinerprovidesthedesiredresults.Anditis.
-
Tanagra
21/01/09 20/26
Weseeintothischart:thereare1,800examplesintothetestset;thereare324positiveexamples(CLASSE
=NORMAL).Amongthefirst20%ofthedataset(360examples)rankedaccordingtothescorevalue,there
are298positiveexamplesi.e.thetruepositiverateis298/324#92%.
Weusethefollowingsettingsforthisanalysis:
SIMPLEVALIDATION:SPLIT_RATIO=0.1; LIFTPARETOCHART:NUMBEROFBINS=10
WeusetheNaiveBayesclassifierforRapidminerbecausethelineardiscriminantanalysisisnotavailable.
-
Tanagra
21/01/09 21/26
5.2 Calculations on the complete database
To run the calculations on the complete database, we modify the settings of ARFF EXAMPLE SOURCE
component.Then,weclickonthePLAYbuttonintothetoolbar.
Thecalculationwasfailed.Thebasewas loaded in17minutes;thememory is increasedto700MBwhen
thecrashoccurred.Themessagesentbythesystemisthefollowing
-
Tanagra
21/01/09 22/26
6 Weka
Wekaisawellknownmachinelearningsoftware(http://www.cs.waikato.ac.nz/ml/weka/,Version360).It
runsunderaJRE(JavaRuntime).Theinstallationandtheimplementationareeasy.
There are variousways touseWeka.We choose theKNOWLEDGEFLOWmode in this tutorial. It is very
similartoKnime.
6.1 Working on the sample
As the other tools, we work on the sample in a first time in order to define the entire sequence of
operations.
We seebelow thediagram.We voluntarily separate the components.Thuswe see, inblue, the typeof
information transmitted from an icon to another. It is very important. It determines the behavior of
components.
-
Tanagra
21/01/09 23/26
Wedescribeheresomenodesandsettings:
ARFFLOADERloadsthedatafilesample_gain_chart.arff; CLASSASSIGNERspecifiestheclassattributeclasse,alltheothervariablesarethedescriptors; CLASSVALUEPICKERdefinesthepositivevalueoftheclassattributei.e.CLASSE=NORMAL; TRAINTESTSPLITMAKERsplitsthedataset,weselect10%forthetrainset; NAIVEBAYES is the learningmethod,weconnect to thiscomponentboth theTRAININGSETand the
TESTSETconnections;
CLASSIFIERPERFORMANCEEVALUATORevaluatestheclassifierperformance; MODELPERFORMANCECHARTcreatesvariouschartsfortheclassifierevaluationonthetestset.Tostartthecalculations,weclickontheSTARTLOADINGmenuoftheARFFLOADERcomponent.Theresults
areavailablebyclickingontheSHOWCHARTmenuoftheMODELPERFORMANCECHARTcomponent.
Wekaoffers a clever tool for visualizing charts.We can select interactively the xcoordinates and the y
-
Tanagra
21/01/09 24/26
coordinates.But,unfortunately, ifwecanchoosetherightvalues fortheYcoordinates,wecannotselect
thepercentofthepopulationforthexcoordinates.Weusethefalsepositiveratebelowinordertodefine
theROCcurve.
6.2 Calculations on the complete database
Toprocessthecompletedatabase,weconfiguretheARFFLOADER.Weselectthefull_gain_chart.arffdata
file.ThenweclickontheSTARTLOADINGmenu.
After 20minutes, the calculation is not completed.Weka disappears suddenly.We tried tomodify the
settingsoftheJRE(JavaRuntime)(e.g.setXmx1024mfortheheapsize).Butthisisnotfruitful.
WekaunderWindowshasthesameprobleminthesamecircumstance.Wecannothandleadatabasewith
41variablesand2,000,000examples.
-
Tanagra
21/01/09 25/26
7 Conclusion
Ofcourse,maybeIhavenotsettheoptimalsettingsforthevarioustools.Buttheyarefree,thedatasetis
availableon line,anyonecanmakethesameexperiments,andtryothersettings inorderto increasethe
efficiency of the processing, both about the computation time and the memory occupation (e.g. the
settingsoftheJREJavaRuntimeEnvironment).
Computationtimeandmemoryoccupationarereportedinthefollowingtable.
Tanagra Knime Weka RapidMinerLoad the dataset 3 mn 6 mn 40 sec > 20 mn (erreur) 17 mnLearning phase + Creating the gain chart 25 sec 30 mn - erreurMemory occupation (MB) 342 MB 193 MB > 700 MB (erreur) 700 MB (erreur)
Tool
Thefirstresultwhichdrawsourattention isthequicknessofTanagra.Although itusesWineunderLinux,
Tanagraisstillveryfast.Thecomparisonisespeciallyvalidforloadingdata.Forthelearningphase,because
wedonotusethesameapproach,thecomputationtimesarenotreallycomparable.
Thisresultseemscuriousbecausewemadethesamecomparison(loadingalargedataset)underWindows
(http://dataminingtutorials.blogspot.com/2008/11/decisiontreeandlargedataset.html).Thecomputation
timesweresimilaratthisoccasion.Perhaps,itistheJREunderLinuxwhichisnotefficient?
ThesecondimportantresultofthiscomparisonistheexcellentmemorymanagementofKnime.Ofcourse,
weuseaverysimplisticmethod (NaiveBayesClassifier).However,wenote that thememoryoccupation
remainsverystableduring thewholeprocess.Obliviously,Knimecanhandlewithoutanyproblemavery
largedatabase.
Incompensation,thecomputationtimeofKnimeisclearlyhigher.Incertaincircumstance,duringthestep
wherewetrytodefinetheoptimalsettingsusingatrialanderrorapproach,itcanbeadrawback.
Finally,tobequitecomprehensive,animportanttoolmissesinthiscomparison.It isOrangewhichIlikea
lot because it is very powerful and user friendly (http://www.ailab.si/orange/). Unfortunately, even if I
followedattentively thedescriptionof the installationprocess, Icouldnot install (compile)Orangeunder
Ubuntu8.10(http://www.ailab.si/orange/downloadslinux.asp#orange1.0ubuntu).
TheWindowsversionworkswithoutanyproblems.We showbelow the resultof theprocessunder the
Windows version.Wedonot give the computation time and thememoryoccupationherebecause the
systemsarenotcomparable.
-
Tanagra
21/01/09 26/26