en tanagra gain chart

Tanagra

21/01/09 1/26

1 Introduction

ComputingaGainChart.Comparing thecomputation timeofdatamining toolsona largedatasetunder

Linux.

Thegainchartisanalternativetoconfusionmatrixfortheevaluationofaclassifier.Itsnameissometimes

different according the tools (e.g. lift curve, lift chart, cumulative gain chart, etc.). Themain idea is to

elaborateascatterplotwheretheXcoordinates isthepercentofthepopulationandtheYcoordinates is

the percent of the positive value of the class attribute. The gain chart is usedmainly in themarketing

domainwherewewanttodetectpotentialcustomers,butitcanbeusedinothersituations.

The construction of the gain chart is already outlined in a previous tutorial (see http://datamining

tutorials.blogspot.com/2008/11/liftcurvecoilchallenge2000.html). In this tutorial, we extend the

descriptiontootherdataminingtools(Knime,RapidMiner,WekaandOrange).Thesecondoriginalityofthis

tutorial is thatwe lead the experiment under Linux (French version ofUbuntu 8.10 see http://data

miningtutorials.blogspot.com/2009/01/tanagraunderlinux.html for the installation and the utilization of

TanagraunderLinux).Thethirdoriginalityisthatwehandlealargedatasetwith2.000.000examplesand

41variables.Itwillbeveryinterestingtostudythebehaviorofthesetoolsinthisconfiguration,especially

becauseourcomputerisnotreallypowerful(Celeron,2.53GHz,1MBRAM).

Weadoptthesamewayforeachtool.Inafirsttime,wedefinethesequenceofoperationsandthesettings

Tanagra

21/01/09 2/26

onasampleof2.000examples.Then, inasecondtime,wemodifythedatasource,wehandlethewhole

dataset.Wemeasurethecomputationtimeandthememoryoccupation.Wenotethatsometoolsfailed

theanalysisonthecompletedataset.

About the learningmethod,we use a linear discriminant analysiswith a variable selection process for

Tanagra.For theother tools, thisapproach isnotavailable.So,weuseaNaiveBayesmethodwhich isa

linearclassifieralso.

2 Dataset

WeuseamodifiedversionoftheKDDCUP99dataset.Theaimoftheanalysisisthedetectionofanetwork

intrusion (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). We want to detect a normal

connection (binaryproblem "normalnetwork connectionornot"). Moreprecisely,weask the following

problem:"ifwesetascoretoexamplesandrankthemaccordingtothisscore,what istheproportionof

positive(normal)connectiondetectedifweselectonly20percentofthewholeconnections?"

Wehavetwodatafiles.Thefirstcontains2.000.000examples (full_gain_chart.arff). It isusedtocompare

the ability of the tools to handle a large dataset. The second is a sample with 2.000 examples

(sample_gain_chart.arff).Itisusedforspecifyingallthesequenceofoperations.Thesefilesaresettogether

inanarchivehttp://eric.univlyon2.fr/~ricco/tanagra/fichiers/dataset_gain_chart.zip

3 Tanagra

3.1 Defining the diagram

Creatingadiagramand importingthedataset.Afterwe launchTanagra,weclickonFILE/NEWmenu in

order to create a diagram and import the dataset. In the dialog settings, we select the data file

SAMPLE_GAIN_CHART.ARFF(Wekafileformat).WesetthenameofthediagramasGAIN_CHART.TDM.

Tanagra

21/01/09 3/26

41variablesand2.000casesareloaded.

Tanagra

21/01/09 4/26

Partitioning the dataset. We want to use 10 percent of the dataset for the learning phase, and the

remainderforthetestphase.10percentcorrespondto200examplesonoursample.Itseemsverylimited.

Butwemusttorememberthatthisfirstanalysis isusedonlyasapreparationphase.Inthenextstep,we

applytheanalysisonthewholedataset.Sothetruelearningsetsizeis10percentof2millionsi.e.200.000

examples.Itseemsenoughtocreateareliableclassifier.

Inordertodefinethetrainandtestsets,weusetheSAMPLINGcomponent(INSTANCESELECTIONtab).We

clickonthePARAMETERSmenu.WesetPROPORTIONSIZEas10%.

WhenweclickonthecontextualVIEWmenu,Tanagraindicatesthat200examplesarenowselected.

Definingthevariablestypes.WeinserttheDEFINESTATUScomponent(usingtheshortcutintothetoolbar)

inordertodefinetheTARGETattribute(CLASSE)andtheINPUTattributes(theotherones).

Tanagra

21/01/09 5/26

Automatic feature selection. Some of the INPUT attributes are irrelevant or redundant. We insert the

STEPDISC (FEATURE SELECTION tab) component in order to detect the relevant subset of predictive

variables.Settingthesizeofthesubsetisveryhard.Statisticalcutvaluesarenotreallyusefulherebecause

weapplytheprocessonalargedataset,allthevariablesseemsignificant.

So, the simplestway is the trial and error approach.We observe the decreasing of theWILKS Lambda

statisticwhenweaddanewpredictivevariable.ThegoodchoiceseemsSUBSETSIZE=6.Whenweadda

new variable after this step, the decreasing of the Lambda is comparatively weak. This corresponds

approximately to theelbowof the curveunderlining the relationshipbetween thenumberofpredictive

variableandtheWilks'lambda.Wesetthesettingsaccordingtothispointofview.

Tanagra

21/01/09 6/26

Learningphase.WeaddtheLINEARDISCRIMINANTANALYSIScomponent(SPVLEARNINGtab).Weobtain

thefollowingresults.

Tanagra

21/01/09 7/26

Computingthescorecolumn.Wecomputenowthescore(thescorevalueisnotreallyaprobability,butit

allowstoranktheexamplesastheconditionalprobabilityP[CLASS=positive/ INPUTattributes])ofeach

exampletobeapositiveone.Itiscomputedonthewholedataseti.e.boththetrainandthetestset.

WeaddtheSCORINGcomponent(SCORINGtab).Wesetthefollowingparametersi.e.the"positive"value

oftheclassattributeis"normal".AnewcolumnSCORE_1isaddedtothedataset.

Creatingthegainchart.InordertocreatetheGAINCHART,wemustdefinetheTARGET(CLASSE)attribute

andtheSCOREattributes(INPUT).WeusetheDEFINESTATUScomponentfromthetoolbar.Wenotethat

we can set several SCORE columns, this optionwill be usefulwhenwewant to compare various score

columns(e.g.whenwewanttocomparetheefficiencyofvarious learningalgorithmswhichcomputethe

score).

Tanagra

21/01/09 8/26

Then,weaddtheLIFTCURVEcomponent(SCORINGtab).Thesettingscorrespondtothecomputationofthe

curveonthe"normal"valueoftheclassattribute,andontheunselectedexamplesi.e.thetestset.

Tanagra

21/01/09 9/26

Weobtainthefollowingchart.

In the HTML tab, we have the details of the results. We see that among the first 20 percent of the

populationwehaveatruepositiverateof97.80%i.e.97.80%ofthepositiveexamplesofthedataset.

Tanagra

21/01/09 10/26

3.2 Running the diagram on the complete dataset

Thisfirststepallowsustodefinethesequenceofdataanalysis.Now,wewanttoapplythesameframework

onthecompletedataset.Wecreatetheclassifieronalearningsetwith200,000instances,andevaluateits

performance,usingagainchart,onatestsetwith1,800,000examples.

WithTanagra,thesimplestwayistosavethediagram(FILE/SAVE)andclosetheapplication.Then,weopen

thefilediagramGAIN_CHART.TDMinatexteditor.Wereplacethedatasourcereference.Wesetthename

ofthecompletedataseti.e.full_gain_chart.arff.

We launchTanagraagain.Weopenthemodifieddiagramgain_chart.tdm (FILE/OPENmenu).Thenew

datasetisautomaticallyloaded.Tanagraisreadytoexecuteeachnodeofthediagram.

Tanagra

21/01/09 11/26

Theprocessingtimeis180seconds(3minutes).

Tanagra

21/01/09 12/26

Weexecutealloperationswith theDIAGRAM /EXECUTEmenu.Weget the tableassociated to thegain

chart.Amongthe20%firstindividuals(accordingtothescorevalue),wefind98.68%ofpositives(thetrue

positiverateis98.68%).

Hereisthegainchart.

Tanagra

21/01/09 13/26

Weconsiderbelowthecomputationtime.WeseemainlythatthememoryoccupationofTanagraincreased

slightlywiththecalculations.Thesearethedataloadedintomemorywhichmainlydeterminethememory

occupation.

4 Knime

4.1 Defining the workflow on the sample

The installation and the implementation of Knime (http://www.knime.org/, version 2.0.0) are very easy

underUbuntu.Wecreateanewworkflow(FILE/NEW).WeaskforaKnimeProject.

Aswesayabove,theLinearDiscriminantmethod isnotavailable.SoweusetheNaiveBayesClassifier. It

corresponds to a linear classifier. We expect to obtain similar results. Unfortunately, Knime does not

proposeafeatureselectionprocessadaptedtothisapproach.Thus,weevaluatetheclassifiercomputedon

allthepredictivevariables.

Wecreatethefollowingworkflow:

Tanagra

21/01/09 14/26

Wedescribebelowthesettingsofsomecomponents.

ARFF READER: The dialog box appears with the CONFIGURE menu. We set the file name i.e.sample_gain_chart.arff.AnotherveryimportantparameterisintheGENRALNODESETTINGStab.

Ifthedatasetistoolarge(higherthan100,000examplesaccordingtothedocumentation),itcanbe

swappedonthedisk.Itistheonlysoftwareofthistutorialwhichdoesnotloadtheentiredataset

inmemoryduringtheprocessing.Inthispointofview,itissimilartothecommercialtoolsthatcan

handle very large databases. Certainly, it is an advantage. We see the consequences of this

technicalchoicewhenwehavetohandlethecompletedatabase.

Tanagra

21/01/09 15/26

ForPARTITIONNINGnode,weselect10%oftheexamples. ForNAIVEBAYESPREDICTORnode,weaskthatthecomponentproducesthescorecolumnforthe

valuesoftheclassattribute.LikeTanagra,weneedtothiscolumnforrankingtheexamplesduring

theconstructionofthegainchart.

Last, for the LIFTCHART,we specify the class attribute (CLASSE), thepositive valueof the class

Tanagra

21/01/09 16/26

attribute(NORMAL).Thewidthoftheinterval(INTERVALWIDTH)fortheconstructionofthetable

is10%oftheexamples.

Whenwe click on the EXECUTE ANDOPEN VIEW contextualmenu,we obtain the following gain chart

(CUMULATIVEGAINCHARTtab).

Tanagra

21/01/09 17/26

4.2 Calculations on the complete database

Wearereadytoworkonthecompletebase.WeactivateagainthedialogsettingsoftheARFFREADERnode

(CONFIGURE menu). We select the full_gain_chart.arff data file. We can execute the workflow. We

obtainthefollowinggainchart.

Theresultsareverysimilartothoseobtainedonthesample(2,000examples).

Thecomputationtimehoweverisnotreallythesame.ItisveryhighcomparedtothatofTanagra.Westudy

indetailsbelowthecomputationtimeofeachstep.

Aboutthememoryoccupation,thesystemmonitorshowsthatKnimeuses"only"193Mo(Figure1). It is

clearlylowerthanTanagra;itwillbelowerthanallothertoolsweanalyzeinthistutorial.

Wecanclearlyseetheadvantagesanddisadvantagesofswappingdataonthedisk:thememory isunder

control,but the computation time increases. Ifweare in thephaseof finalproductionof thepredictive

model, it isnotaproblem. Ifweare in theexploratoryphase,whereweuse trialanderror inorder to

definethebeststrategy,prohibitivecomputationtimeisdiscouraging.

Inthisconfiguration,theonlysolutionistoworkonsample.Inourstudy,wenotethattheresultsobtained

Tanagra

21/01/09 18/26

on2,000casesareveryclose to the resultsobtainedon2,000,000examples.Ofcourse, it isanunusual

situation.Theunderlyingconceptbetween theclassattributeand thepredictiveattribute isverysimple.

But,itisobviousthat,ifwecandeterminethesufficientsamplesizefortheproblem,thisstrategyallowsus

toexploredeeplythevarioussolutionswithoutaprohibitivecalculationstime.

Figure1MemoryoccupationofKnimeduringtheprocessing

5 RapidMiner

The installation and the implementation of RapidMiner (http://rapidi.com/content/blogcategory/38/69/,

CommunityEdition,version4.3)areverysimpleunderUbuntu.Weaddon thedesktopashortcutwhich

runsautomaticallytheJavaVirtualMachinewithrapidminer.jarfile.

5.1 Working on the sample

As the other tools,we set the sequence of operations on the sample (2,000 examples). There is little

documentationonthesoftware.However,itcomeswithmanypredefinedexamples.Weoftenfindquickly

anexampleofanalysiswhichissimilartoours.

About thegainchart,weget theexample filesample/06_Visualization/16_LiftChar.xlm.Weadapt the

Tanagra

21/01/09 19/26

sequenceofoperationsonourdataset.

Idonotreallyunderstandall.TheoperatorsIORetrieverandIOStorerseemquitemysterious.Anyway,the

importantthingisthatRapidMinerprovidesthedesiredresults.Anditis.

Tanagra

21/01/09 20/26

Weseeintothischart:thereare1,800examplesintothetestset;thereare324positiveexamples(CLASSE

=NORMAL).Amongthefirst20%ofthedataset(360examples)rankedaccordingtothescorevalue,there

are298positiveexamplesi.e.thetruepositiverateis298/324#92%.

Weusethefollowingsettingsforthisanalysis:

SIMPLEVALIDATION:SPLIT_RATIO=0.1; LIFTPARETOCHART:NUMBEROFBINS=10

WeusetheNaiveBayesclassifierforRapidminerbecausethelineardiscriminantanalysisisnotavailable.

Tanagra

21/01/09 21/26


To run the calculations on the complete database, we modify the settings of ARFF EXAMPLE SOURCE

component.Then,weclickonthePLAYbuttonintothetoolbar.

Thecalculationwasfailed.Thebasewas loaded in17minutes;thememory is increasedto700MBwhen

thecrashoccurred.Themessagesentbythesystemisthefollowing

Tanagra

21/01/09 22/26

6 Weka

Wekaisawellknownmachinelearningsoftware(http://www.cs.waikato.ac.nz/ml/weka/,Version360).It

runsunderaJRE(JavaRuntime).Theinstallationandtheimplementationareeasy.

There are variousways touseWeka.We choose theKNOWLEDGEFLOWmode in this tutorial. It is very

similartoKnime.

6.1 Working on the sample

As the other tools, we work on the sample in a first time in order to define the entire sequence of

operations.

We seebelow thediagram.We voluntarily separate the components.Thuswe see, inblue, the typeof

information transmitted from an icon to another. It is very important. It determines the behavior of

components.

Tanagra

21/01/09 23/26

Wedescribeheresomenodesandsettings:

ARFFLOADERloadsthedatafilesample_gain_chart.arff; CLASSASSIGNERspecifiestheclassattributeclasse,alltheothervariablesarethedescriptors; CLASSVALUEPICKERdefinesthepositivevalueoftheclassattributei.e.CLASSE=NORMAL; TRAINTESTSPLITMAKERsplitsthedataset,weselect10%forthetrainset; NAIVEBAYES is the learningmethod,weconnect to thiscomponentboth theTRAININGSETand the

TESTSETconnections;

CLASSIFIERPERFORMANCEEVALUATORevaluatestheclassifierperformance; MODELPERFORMANCECHARTcreatesvariouschartsfortheclassifierevaluationonthetestset.Tostartthecalculations,weclickontheSTARTLOADINGmenuoftheARFFLOADERcomponent.Theresults

areavailablebyclickingontheSHOWCHARTmenuoftheMODELPERFORMANCECHARTcomponent.

Wekaoffers a clever tool for visualizing charts.We can select interactively the xcoordinates and the y

Tanagra

21/01/09 24/26

coordinates.But,unfortunately, ifwecanchoosetherightvalues fortheYcoordinates,wecannotselect

thepercentofthepopulationforthexcoordinates.Weusethefalsepositiveratebelowinordertodefine

theROCcurve.


Toprocessthecompletedatabase,weconfiguretheARFFLOADER.Weselectthefull_gain_chart.arffdata

file.ThenweclickontheSTARTLOADINGmenu.

After 20minutes, the calculation is not completed.Weka disappears suddenly.We tried tomodify the

settingsoftheJRE(JavaRuntime)(e.g.setXmx1024mfortheheapsize).Butthisisnotfruitful.

WekaunderWindowshasthesameprobleminthesamecircumstance.Wecannothandleadatabasewith

41variablesand2,000,000examples.

Tanagra

21/01/09 25/26

7 Conclusion

Ofcourse,maybeIhavenotsettheoptimalsettingsforthevarioustools.Buttheyarefree,thedatasetis

availableon line,anyonecanmakethesameexperiments,andtryothersettings inorderto increasethe

efficiency of the processing, both about the computation time and the memory occupation (e.g. the

settingsoftheJREJavaRuntimeEnvironment).

Computationtimeandmemoryoccupationarereportedinthefollowingtable.

Tanagra Knime Weka RapidMinerLoad the dataset 3 mn 6 mn 40 sec > 20 mn (erreur) 17 mnLearning phase + Creating the gain chart 25 sec 30 mn - erreurMemory occupation (MB) 342 MB 193 MB > 700 MB (erreur) 700 MB (erreur)

Tool

Thefirstresultwhichdrawsourattention isthequicknessofTanagra.Although itusesWineunderLinux,

Tanagraisstillveryfast.Thecomparisonisespeciallyvalidforloadingdata.Forthelearningphase,because

wedonotusethesameapproach,thecomputationtimesarenotreallycomparable.

Thisresultseemscuriousbecausewemadethesamecomparison(loadingalargedataset)underWindows

(http://dataminingtutorials.blogspot.com/2008/11/decisiontreeandlargedataset.html).Thecomputation

timesweresimilaratthisoccasion.Perhaps,itistheJREunderLinuxwhichisnotefficient?

ThesecondimportantresultofthiscomparisonistheexcellentmemorymanagementofKnime.Ofcourse,

weuseaverysimplisticmethod (NaiveBayesClassifier).However,wenote that thememoryoccupation

remainsverystableduring thewholeprocess.Obliviously,Knimecanhandlewithoutanyproblemavery

largedatabase.

Incompensation,thecomputationtimeofKnimeisclearlyhigher.Incertaincircumstance,duringthestep

wherewetrytodefinetheoptimalsettingsusingatrialanderrorapproach,itcanbeadrawback.

Finally,tobequitecomprehensive,animportanttoolmissesinthiscomparison.It isOrangewhichIlikea

lot because it is very powerful and user friendly (http://www.ailab.si/orange/). Unfortunately, even if I

followedattentively thedescriptionof the installationprocess, Icouldnot install (compile)Orangeunder

Ubuntu8.10(http://www.ailab.si/orange/downloadslinux.asp#orange1.0ubuntu).

TheWindowsversionworkswithoutanyproblems.We showbelow the resultof theprocessunder the

Windows version.Wedonot give the computation time and thememoryoccupationherebecause the

systemsarenotcomparable.

Tanagra

21/01/09 26/26

en tanagra gain chart

Documents

cumulative gain chart

lift chart

large dataset

theother tools

theaimoftheanalysisist

lift curve

previous tutorial

positive value