en tanagra gain chart

26
Tanagra 21/01/09 1/26 1 Introduction Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux. The gain chart is an alternative to confusion matrix for the evaluation of a classifier. Its name is sometimes different according the tools (e.g. lift curve, lift chart, cumulative gain chart, etc.). The main idea is to elaborate a scatter plot where the X coordinates is the percent of the population and the Y coordinates is the percent of the positive value of the class attribute. The gain chart is used mainly in the marketing domain where we want to detect potential customers, but it can be used in other situations. The construction of the gain chart is already outlined in a previous tutorial (see http://dataminingtutorials.blogspot.com/2008/11/liftcurvecoilchallenge2000.html ). In this tutorial, we extend the description to other data mining tools (Knime, RapidMiner, Weka and Orange). The second originality of this tutorial is that we lead the experiment under Linux (French version of Ubuntu 8.10 – see http://dataminingtutorials.blogspot.com/2009/01/tanagraunderlinux.html for the installation and the utilization of Tanagra under Linux). The third originality is that we handle a large dataset with 2.000.000 examples and 41 variables. It will be very interesting to study the behavior of these tools in this configuration, especially because our computer is not really powerful (Celeron, 2.53 GHz, 1 MB RAM). We adopt the same way for each tool. In a first time, we define the sequence of operations and the settings

Upload: rodolfo-sabala-veiles

Post on 14-Sep-2015

227 views

Category:

Documents


3 download

DESCRIPTION

ID3 - Mineria de datos

TRANSCRIPT

  • Tanagra

    21/01/09 1/26

    1 Introduction

    ComputingaGainChart.Comparing thecomputation timeofdatamining toolsona largedatasetunder

    Linux.

    Thegainchartisanalternativetoconfusionmatrixfortheevaluationofaclassifier.Itsnameissometimes

    different according the tools (e.g. lift curve, lift chart, cumulative gain chart, etc.). Themain idea is to

    elaborateascatterplotwheretheXcoordinates isthepercentofthepopulationandtheYcoordinates is

    the percent of the positive value of the class attribute. The gain chart is usedmainly in themarketing

    domainwherewewanttodetectpotentialcustomers,butitcanbeusedinothersituations.

    The construction of the gain chart is already outlined in a previous tutorial (see http://datamining

    tutorials.blogspot.com/2008/11/liftcurvecoilchallenge2000.html). In this tutorial, we extend the

    descriptiontootherdataminingtools(Knime,RapidMiner,WekaandOrange).Thesecondoriginalityofthis

    tutorial is thatwe lead the experiment under Linux (French version ofUbuntu 8.10 see http://data

    miningtutorials.blogspot.com/2009/01/tanagraunderlinux.html for the installation and the utilization of

    TanagraunderLinux).Thethirdoriginalityisthatwehandlealargedatasetwith2.000.000examplesand

    41variables.Itwillbeveryinterestingtostudythebehaviorofthesetoolsinthisconfiguration,especially

    becauseourcomputerisnotreallypowerful(Celeron,2.53GHz,1MBRAM).

    Weadoptthesamewayforeachtool.Inafirsttime,wedefinethesequenceofoperationsandthesettings

  • Tanagra

    21/01/09 2/26

    onasampleof2.000examples.Then, inasecondtime,wemodifythedatasource,wehandlethewhole

    dataset.Wemeasurethecomputationtimeandthememoryoccupation.Wenotethatsometoolsfailed

    theanalysisonthecompletedataset.

    About the learningmethod,we use a linear discriminant analysiswith a variable selection process for

    Tanagra.For theother tools, thisapproach isnotavailable.So,weuseaNaiveBayesmethodwhich isa

    linearclassifieralso.

    2 Dataset

    WeuseamodifiedversionoftheKDDCUP99dataset.Theaimoftheanalysisisthedetectionofanetwork

    intrusion (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). We want to detect a normal

    connection (binaryproblem "normalnetwork connectionornot"). Moreprecisely,weask the following

    problem:"ifwesetascoretoexamplesandrankthemaccordingtothisscore,what istheproportionof

    positive(normal)connectiondetectedifweselectonly20percentofthewholeconnections?"

    Wehavetwodatafiles.Thefirstcontains2.000.000examples (full_gain_chart.arff). It isusedtocompare

    the ability of the tools to handle a large dataset. The second is a sample with 2.000 examples

    (sample_gain_chart.arff).Itisusedforspecifyingallthesequenceofoperations.Thesefilesaresettogether

    inanarchivehttp://eric.univlyon2.fr/~ricco/tanagra/fichiers/dataset_gain_chart.zip

    3 Tanagra

    3.1 Defining the diagram

    Creatingadiagramand importingthedataset.Afterwe launchTanagra,weclickonFILE/NEWmenu in

    order to create a diagram and import the dataset. In the dialog settings, we select the data file

    SAMPLE_GAIN_CHART.ARFF(Wekafileformat).WesetthenameofthediagramasGAIN_CHART.TDM.

  • Tanagra

    21/01/09 3/26

    41variablesand2.000casesareloaded.

  • Tanagra

    21/01/09 4/26

    Partitioning the dataset. We want to use 10 percent of the dataset for the learning phase, and the

    remainderforthetestphase.10percentcorrespondto200examplesonoursample.Itseemsverylimited.

    Butwemusttorememberthatthisfirstanalysis isusedonlyasapreparationphase.Inthenextstep,we

    applytheanalysisonthewholedataset.Sothetruelearningsetsizeis10percentof2millionsi.e.200.000

    examples.Itseemsenoughtocreateareliableclassifier.

    Inordertodefinethetrainandtestsets,weusetheSAMPLINGcomponent(INSTANCESELECTIONtab).We

    clickonthePARAMETERSmenu.WesetPROPORTIONSIZEas10%.

    WhenweclickonthecontextualVIEWmenu,Tanagraindicatesthat200examplesarenowselected.

    Definingthevariablestypes.WeinserttheDEFINESTATUScomponent(usingtheshortcutintothetoolbar)

    inordertodefinetheTARGETattribute(CLASSE)andtheINPUTattributes(theotherones).

  • Tanagra

    21/01/09 5/26

    Automatic feature selection. Some of the INPUT attributes are irrelevant or redundant. We insert the

    STEPDISC (FEATURE SELECTION tab) component in order to detect the relevant subset of predictive

    variables.Settingthesizeofthesubsetisveryhard.Statisticalcutvaluesarenotreallyusefulherebecause

    weapplytheprocessonalargedataset,allthevariablesseemsignificant.

    So, the simplestway is the trial and error approach.We observe the decreasing of theWILKS Lambda

    statisticwhenweaddanewpredictivevariable.ThegoodchoiceseemsSUBSETSIZE=6.Whenweadda

    new variable after this step, the decreasing of the Lambda is comparatively weak. This corresponds

    approximately to theelbowof the curveunderlining the relationshipbetween thenumberofpredictive

    variableandtheWilks'lambda.Wesetthesettingsaccordingtothispointofview.

  • Tanagra

    21/01/09 6/26

    Learningphase.WeaddtheLINEARDISCRIMINANTANALYSIScomponent(SPVLEARNINGtab).Weobtain

    thefollowingresults.

  • Tanagra

    21/01/09 7/26

    Computingthescorecolumn.Wecomputenowthescore(thescorevalueisnotreallyaprobability,butit

    allowstoranktheexamplesastheconditionalprobabilityP[CLASS=positive/ INPUTattributes])ofeach

    exampletobeapositiveone.Itiscomputedonthewholedataseti.e.boththetrainandthetestset.

    WeaddtheSCORINGcomponent(SCORINGtab).Wesetthefollowingparametersi.e.the"positive"value

    oftheclassattributeis"normal".AnewcolumnSCORE_1isaddedtothedataset.

    Creatingthegainchart.InordertocreatetheGAINCHART,wemustdefinetheTARGET(CLASSE)attribute

    andtheSCOREattributes(INPUT).WeusetheDEFINESTATUScomponentfromthetoolbar.Wenotethat

    we can set several SCORE columns, this optionwill be usefulwhenwewant to compare various score

    columns(e.g.whenwewanttocomparetheefficiencyofvarious learningalgorithmswhichcomputethe

    score).

  • Tanagra

    21/01/09 8/26

    Then,weaddtheLIFTCURVEcomponent(SCORINGtab).Thesettingscorrespondtothecomputationofthe

    curveonthe"normal"valueoftheclassattribute,andontheunselectedexamplesi.e.thetestset.

  • Tanagra

    21/01/09 9/26

    Weobtainthefollowingchart.

    In the HTML tab, we have the details of the results. We see that among the first 20 percent of the

    populationwehaveatruepositiverateof97.80%i.e.97.80%ofthepositiveexamplesofthedataset.

  • Tanagra

    21/01/09 10/26

    3.2 Running the diagram on the complete dataset

    Thisfirststepallowsustodefinethesequenceofdataanalysis.Now,wewanttoapplythesameframework

    onthecompletedataset.Wecreatetheclassifieronalearningsetwith200,000instances,andevaluateits

    performance,usingagainchart,onatestsetwith1,800,000examples.

    WithTanagra,thesimplestwayistosavethediagram(FILE/SAVE)andclosetheapplication.Then,weopen

    thefilediagramGAIN_CHART.TDMinatexteditor.Wereplacethedatasourcereference.Wesetthename

    ofthecompletedataseti.e.full_gain_chart.arff.

    We launchTanagraagain.Weopenthemodifieddiagramgain_chart.tdm (FILE/OPENmenu).Thenew

    datasetisautomaticallyloaded.Tanagraisreadytoexecuteeachnodeofthediagram.

  • Tanagra

    21/01/09 11/26

    Theprocessingtimeis180seconds(3minutes).

  • Tanagra

    21/01/09 12/26

    Weexecutealloperationswith theDIAGRAM /EXECUTEmenu.Weget the tableassociated to thegain

    chart.Amongthe20%firstindividuals(accordingtothescorevalue),wefind98.68%ofpositives(thetrue

    positiverateis98.68%).

    Hereisthegainchart.

  • Tanagra

    21/01/09 13/26

    Weconsiderbelowthecomputationtime.WeseemainlythatthememoryoccupationofTanagraincreased

    slightlywiththecalculations.Thesearethedataloadedintomemorywhichmainlydeterminethememory

    occupation.

    4 Knime

    4.1 Defining the workflow on the sample

    The installation and the implementation of Knime (http://www.knime.org/, version 2.0.0) are very easy

    underUbuntu.Wecreateanewworkflow(FILE/NEW).WeaskforaKnimeProject.

    Aswesayabove,theLinearDiscriminantmethod isnotavailable.SoweusetheNaiveBayesClassifier. It

    corresponds to a linear classifier. We expect to obtain similar results. Unfortunately, Knime does not

    proposeafeatureselectionprocessadaptedtothisapproach.Thus,weevaluatetheclassifiercomputedon

    allthepredictivevariables.

    Wecreatethefollowingworkflow:

  • Tanagra

    21/01/09 14/26

    Wedescribebelowthesettingsofsomecomponents.

    ARFF READER: The dialog box appears with the CONFIGURE menu. We set the file name i.e.sample_gain_chart.arff.AnotherveryimportantparameterisintheGENRALNODESETTINGStab.

    Ifthedatasetistoolarge(higherthan100,000examplesaccordingtothedocumentation),itcanbe

    swappedonthedisk.Itistheonlysoftwareofthistutorialwhichdoesnotloadtheentiredataset

    inmemoryduringtheprocessing.Inthispointofview,itissimilartothecommercialtoolsthatcan

    handle very large databases. Certainly, it is an advantage. We see the consequences of this

    technicalchoicewhenwehavetohandlethecompletedatabase.

  • Tanagra

    21/01/09 15/26

    ForPARTITIONNINGnode,weselect10%oftheexamples. ForNAIVEBAYESPREDICTORnode,weaskthatthecomponentproducesthescorecolumnforthe

    valuesoftheclassattribute.LikeTanagra,weneedtothiscolumnforrankingtheexamplesduring

    theconstructionofthegainchart.

    Last, for the LIFTCHART,we specify the class attribute (CLASSE), thepositive valueof the class

  • Tanagra

    21/01/09 16/26

    attribute(NORMAL).Thewidthoftheinterval(INTERVALWIDTH)fortheconstructionofthetable

    is10%oftheexamples.

    Whenwe click on the EXECUTE ANDOPEN VIEW contextualmenu,we obtain the following gain chart

    (CUMULATIVEGAINCHARTtab).

  • Tanagra

    21/01/09 17/26

    4.2 Calculations on the complete database

    Wearereadytoworkonthecompletebase.WeactivateagainthedialogsettingsoftheARFFREADERnode

    (CONFIGURE menu). We select the full_gain_chart.arff data file. We can execute the workflow. We

    obtainthefollowinggainchart.

    Theresultsareverysimilartothoseobtainedonthesample(2,000examples).

    Thecomputationtimehoweverisnotreallythesame.ItisveryhighcomparedtothatofTanagra.Westudy

    indetailsbelowthecomputationtimeofeachstep.

    Aboutthememoryoccupation,thesystemmonitorshowsthatKnimeuses"only"193Mo(Figure1). It is

    clearlylowerthanTanagra;itwillbelowerthanallothertoolsweanalyzeinthistutorial.

    Wecanclearlyseetheadvantagesanddisadvantagesofswappingdataonthedisk:thememory isunder

    control,but the computation time increases. Ifweare in thephaseof finalproductionof thepredictive

    model, it isnotaproblem. Ifweare in theexploratoryphase,whereweuse trialanderror inorder to

    definethebeststrategy,prohibitivecomputationtimeisdiscouraging.

    Inthisconfiguration,theonlysolutionistoworkonsample.Inourstudy,wenotethattheresultsobtained

  • Tanagra

    21/01/09 18/26

    on2,000casesareveryclose to the resultsobtainedon2,000,000examples.Ofcourse, it isanunusual

    situation.Theunderlyingconceptbetween theclassattributeand thepredictiveattribute isverysimple.

    But,itisobviousthat,ifwecandeterminethesufficientsamplesizefortheproblem,thisstrategyallowsus

    toexploredeeplythevarioussolutionswithoutaprohibitivecalculationstime.

    Figure1MemoryoccupationofKnimeduringtheprocessing

    5 RapidMiner

    The installation and the implementation of RapidMiner (http://rapidi.com/content/blogcategory/38/69/,

    CommunityEdition,version4.3)areverysimpleunderUbuntu.Weaddon thedesktopashortcutwhich

    runsautomaticallytheJavaVirtualMachinewithrapidminer.jarfile.

    5.1 Working on the sample

    As the other tools,we set the sequence of operations on the sample (2,000 examples). There is little

    documentationonthesoftware.However,itcomeswithmanypredefinedexamples.Weoftenfindquickly

    anexampleofanalysiswhichissimilartoours.

    About thegainchart,weget theexample filesample/06_Visualization/16_LiftChar.xlm.Weadapt the

  • Tanagra

    21/01/09 19/26

    sequenceofoperationsonourdataset.

    Idonotreallyunderstandall.TheoperatorsIORetrieverandIOStorerseemquitemysterious.Anyway,the

    importantthingisthatRapidMinerprovidesthedesiredresults.Anditis.

  • Tanagra

    21/01/09 20/26

    Weseeintothischart:thereare1,800examplesintothetestset;thereare324positiveexamples(CLASSE

    =NORMAL).Amongthefirst20%ofthedataset(360examples)rankedaccordingtothescorevalue,there

    are298positiveexamplesi.e.thetruepositiverateis298/324#92%.

    Weusethefollowingsettingsforthisanalysis:

    SIMPLEVALIDATION:SPLIT_RATIO=0.1; LIFTPARETOCHART:NUMBEROFBINS=10

    WeusetheNaiveBayesclassifierforRapidminerbecausethelineardiscriminantanalysisisnotavailable.

  • Tanagra

    21/01/09 21/26

    5.2 Calculations on the complete database

    To run the calculations on the complete database, we modify the settings of ARFF EXAMPLE SOURCE

    component.Then,weclickonthePLAYbuttonintothetoolbar.

    Thecalculationwasfailed.Thebasewas loaded in17minutes;thememory is increasedto700MBwhen

    thecrashoccurred.Themessagesentbythesystemisthefollowing

  • Tanagra

    21/01/09 22/26

    6 Weka

    Wekaisawellknownmachinelearningsoftware(http://www.cs.waikato.ac.nz/ml/weka/,Version360).It

    runsunderaJRE(JavaRuntime).Theinstallationandtheimplementationareeasy.

    There are variousways touseWeka.We choose theKNOWLEDGEFLOWmode in this tutorial. It is very

    similartoKnime.

    6.1 Working on the sample

    As the other tools, we work on the sample in a first time in order to define the entire sequence of

    operations.

    We seebelow thediagram.We voluntarily separate the components.Thuswe see, inblue, the typeof

    information transmitted from an icon to another. It is very important. It determines the behavior of

    components.

  • Tanagra

    21/01/09 23/26

    Wedescribeheresomenodesandsettings:

    ARFFLOADERloadsthedatafilesample_gain_chart.arff; CLASSASSIGNERspecifiestheclassattributeclasse,alltheothervariablesarethedescriptors; CLASSVALUEPICKERdefinesthepositivevalueoftheclassattributei.e.CLASSE=NORMAL; TRAINTESTSPLITMAKERsplitsthedataset,weselect10%forthetrainset; NAIVEBAYES is the learningmethod,weconnect to thiscomponentboth theTRAININGSETand the

    TESTSETconnections;

    CLASSIFIERPERFORMANCEEVALUATORevaluatestheclassifierperformance; MODELPERFORMANCECHARTcreatesvariouschartsfortheclassifierevaluationonthetestset.Tostartthecalculations,weclickontheSTARTLOADINGmenuoftheARFFLOADERcomponent.Theresults

    areavailablebyclickingontheSHOWCHARTmenuoftheMODELPERFORMANCECHARTcomponent.

    Wekaoffers a clever tool for visualizing charts.We can select interactively the xcoordinates and the y

  • Tanagra

    21/01/09 24/26

    coordinates.But,unfortunately, ifwecanchoosetherightvalues fortheYcoordinates,wecannotselect

    thepercentofthepopulationforthexcoordinates.Weusethefalsepositiveratebelowinordertodefine

    theROCcurve.

    6.2 Calculations on the complete database

    Toprocessthecompletedatabase,weconfiguretheARFFLOADER.Weselectthefull_gain_chart.arffdata

    file.ThenweclickontheSTARTLOADINGmenu.

    After 20minutes, the calculation is not completed.Weka disappears suddenly.We tried tomodify the

    settingsoftheJRE(JavaRuntime)(e.g.setXmx1024mfortheheapsize).Butthisisnotfruitful.

    WekaunderWindowshasthesameprobleminthesamecircumstance.Wecannothandleadatabasewith

    41variablesand2,000,000examples.

  • Tanagra

    21/01/09 25/26

    7 Conclusion

    Ofcourse,maybeIhavenotsettheoptimalsettingsforthevarioustools.Buttheyarefree,thedatasetis

    availableon line,anyonecanmakethesameexperiments,andtryothersettings inorderto increasethe

    efficiency of the processing, both about the computation time and the memory occupation (e.g. the

    settingsoftheJREJavaRuntimeEnvironment).

    Computationtimeandmemoryoccupationarereportedinthefollowingtable.

    Tanagra Knime Weka RapidMinerLoad the dataset 3 mn 6 mn 40 sec > 20 mn (erreur) 17 mnLearning phase + Creating the gain chart 25 sec 30 mn - erreurMemory occupation (MB) 342 MB 193 MB > 700 MB (erreur) 700 MB (erreur)

    Tool

    Thefirstresultwhichdrawsourattention isthequicknessofTanagra.Although itusesWineunderLinux,

    Tanagraisstillveryfast.Thecomparisonisespeciallyvalidforloadingdata.Forthelearningphase,because

    wedonotusethesameapproach,thecomputationtimesarenotreallycomparable.

    Thisresultseemscuriousbecausewemadethesamecomparison(loadingalargedataset)underWindows

    (http://dataminingtutorials.blogspot.com/2008/11/decisiontreeandlargedataset.html).Thecomputation

    timesweresimilaratthisoccasion.Perhaps,itistheJREunderLinuxwhichisnotefficient?

    ThesecondimportantresultofthiscomparisonistheexcellentmemorymanagementofKnime.Ofcourse,

    weuseaverysimplisticmethod (NaiveBayesClassifier).However,wenote that thememoryoccupation

    remainsverystableduring thewholeprocess.Obliviously,Knimecanhandlewithoutanyproblemavery

    largedatabase.

    Incompensation,thecomputationtimeofKnimeisclearlyhigher.Incertaincircumstance,duringthestep

    wherewetrytodefinetheoptimalsettingsusingatrialanderrorapproach,itcanbeadrawback.

    Finally,tobequitecomprehensive,animportanttoolmissesinthiscomparison.It isOrangewhichIlikea

    lot because it is very powerful and user friendly (http://www.ailab.si/orange/). Unfortunately, even if I

    followedattentively thedescriptionof the installationprocess, Icouldnot install (compile)Orangeunder

    Ubuntu8.10(http://www.ailab.si/orange/downloadslinux.asp#orange1.0ubuntu).

    TheWindowsversionworkswithoutanyproblems.We showbelow the resultof theprocessunder the

    Windows version.Wedonot give the computation time and thememoryoccupationherebecause the

    systemsarenotcomparable.

  • Tanagra

    21/01/09 26/26