fril - sourceforgefril.sourceforge.net/fril-tutorial-3.2.pdf · 4 general architecture before we...

43
FRIL Fine-grained Record Integration and Linkage Tool Tutorial V 3.2 08/07/2009 Author: Pawel Jurczyk Copyright: Emory University, Math&CS Department, 2009 Document history: Date Version Date Version 12/22/2008 1.0 08/07/2009 3.2 12/29/2008 2.0 06/29/2009 3.0 07/02/2009 3.1

Upload: dangmien

Post on 23-Sep-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

FRILFine­grainedRecordIntegrationandLinkageTool

TutorialV3.2

08/07/2009

Author:PawelJurczyk

Copyright:EmoryUniversity,Math&CSDepartment,2009

Documenthistory:

Date Version Date Version12/22/2008 1.0 08/07/2009 3.212/29/2008 2.0 06/29/2009 3.0 07/02/2009 3.1

2

Tableofcontents

Tableofcontents ............................................................................................................................................................ 2

Introduction..................................................................................................................................................................... 3

Installation ....................................................................................................................................................................... 3Generalarchitecture..................................................................................................................................................... 4

Mainwindow ................................................................................................................................................................... 4Configurationofdatasources.................................................................................................................................... 7

Configurationoflinkage ............................................................................................................................................18

Resultsavers/Resultsdeduplication ....................................................................................................................24Runningthelinkage,linkageconfidence .............................................................................................................26

Howconfidenceleveliscalculated? ......................................................................................................................28Distancemetrics...........................................................................................................................................................30

Datadeduplication ......................................................................................................................................................38

References ......................................................................................................................................................................43

3

IntroductionThisdocumentprovidesatutorialfortoolcalledFRIL(Fine‐grainedRecordsIntegrationandLinkageTool).FRILisaJava‐basedtoolthatincorporatesarichcollectionofrecorddistancemetrics,searchmethods,andanalysistools.Alongitsworkflow,FRILprovidesarichsetofuser‐tunableparametersaugmentedwithgraphicvisualizationtoolstoassistusersinunderstandingtheeffectsofparameterchoices.InthistutorialthereaderwillobtainknowledgeaboutallthecomponentsavailableinFRIL.

InstallationFRILcanbedownloadedfromitshomepage.Thefileavailablefordownloadisazippedarchivethatcontainsallthefilesrequiredforrunningthesoftwareandexampledatafilesusedinthistutorial.Ifyouhavenotobtainedthesoftwaresofar,doitnow:

1. Gotohttp://www.mathcs.emory.edu/Research/Area/datainfo/FRIL2. GotoDownloadsection3. Downloadthemostrecentbinarydistribution(orsourcedistributionifyouprefer)4. Unzipdownloadedarchive5. OpentheFRILfolder(itshouldbecreatedafterunzippingthefile)6. Runthesoftware.Usefril.bat(Windows)/fril.sh(Linux/Unix)/fril.sh.command(Mac)depending

onoperatingsystemyouuse.7. Ifthesoftwaredoesnotstart,makesureyouhavejava1.6installed.FRILisdevelopedinjavaand

requiresthisframeworktorun.Youcanobtainthelatestjavaframeworkfromhttp://java.com.ForfurtherinformationyoumightalsowanttorefertothefileREADME.txtthatislocatedinaFRILdirectory.

4

GeneralarchitectureBeforewegetstartedwithFRIL,itisimportanttounderstandthearchitectureofthesystem.TheworkflowofFRILisshowninFigure1.Theuserspecifiestheinitialinputfiles.Eachrunrequirestheusertospecifythesearchmethod,thedistancefunctionintheattributecomparisonmoduleandthedecisionmodel.OutputconsistsofsetsM,U,Pandvarioussummarystatistics.SetsUandPmaybefedbackintoFRILwithadifferentsetofparameters.

Figure1:ArchitectureofFRIL

InthistutorialweguideuserthroughallthecomponentsofFRIL,presentinghowtoconfigure,debugandrunefficientandaccuraterecordlinkage.

MainwindowOnceyoustartthesoftware,themainwindowopens.ThescreenshotofapplicationispresentedinFigure2.Notethatthereare5majorcomponentsintheapplication:

• Configurationofthefirstdatasource• Configurationoftheseconddatasource• Configurationofthelinkageprocess• Configurationoftheresultssaver• Linkageprocesspanel

TheFRILmainwindowcontainsthefollowingspecificcomponents(thenumberscorrespondtonumbersinlabelsinFigure2):

1. Configurationofthefirstdatasourcebutton2. Viewofnotjoineddatainthefirstdatasourcebutton3. Indicatorsoffiltering/deduplicationinthefirstdatasource4. Configurationoftheseconddatasourcebutton5. Viewofnotjoineddataintheseconddatasourcebutton6. Indicatorsoffiltering/deduplicationintheseconddatasource

5

7. Configurationoflinkageprocessbutton8. Linkageprocessstartbutton9. Configurationofresultsavers/resultdeduplicationbutton10. Viewoflinkedrecordsbutton11. Indicatorofdeduplicationoflinkageresults12. IndicatorofFRILmode(linkage/deduplication)

ToconfigurethelinkageusingFRILyouneedtogothroughallthemajorcomponents(configuredatasources,linkageandresultsavers).Onceyoucompletethatyoucanstartthelinkageprocessusingthestartlinkagebutton.Intheremainingpartofthistutorialwewillgoovereachofthestepsabovetolinksampledatafiles.

FRILhastwomodes:linkageanddeduplication.ThelinkagemodeisthedefaultoneandispresentedinFigure1.ThededuplicationmodecanbeenabledintheModemenu.PleaseseesectionDatadeduplicationformoredetails.

IfyouhavealreadystartedFRILbefore,youmightbeaskedifyouwanttoopenyourpreviousconfigurationwhenFRILstarts.FRILremembersitsconfigurationandattemptstorestoreituponrestart.

Ifyouarenotsurehowtoobtainorstartthesoftware,gotoInstallationsection.

WorkingwithsavedconfigurationWhenyouworkwithFRIL,saveyourconfigurationoften.Inthatcaseyouwillbesafeifanythingunexpectedhappens(e.g.,FRILhangsorcrashes–yes,itmighthappen,althoughweareworkingtomakethesoftwarestable).Tosaveyourconfiguration,clickon“File‐>Saveas”or“File‐>Save”.Later,whenyouopenFRILagain,youwillhaveanoptionofloadingthesavedconfiguration.Youcanalsoloadconfigurationatanytime.Todoso,click“File‐>Open”.

AutosaveoptionFRILwasrecentlyupdatedtoincludeanautosaveoption(thisoptionisenabledbydefault,youcanchangeitintheToolsmenuitem).Theautosaveoptionautomaticallysavestheconfigurationasyouworkwiththetool.Ifyouclosetheapplicationimproperlyorifitbreaks,youcanbeaskedwhetheryouwanttoloadanautosavedconfigurationwhenyourestartFRIL.

6

Figure2:MainwindowofFRIL

7

ConfigurationofdatasourcesDatasourcesprovidedatathatwillbelinkedinFRIL.Youhaveanoptionofthefollowingdatasources:

1. CSVfileComma‐separatedvaluesfile.Thisisatextfilewheredataisseparatedbycomma,semicolonorotherseparatorcharacter.TheconfigurationofCSVdatasourcerequiresspecificationofaninputfileandacolumnseparator.

2. ExcelfileFilesavedinMicrosoftExcelformat.Note:FRILworkswithExcelfilessavedinOffice1997‐2003fileformat.SupportforOffice2007isonitsway.TheconfigurationofExceldatasourcerequiresspecificationofafilenameandasheetname.Bothparametersarerequired.

3. Fixedwidthcolumnsinputfile(Textfile)Plaintextdatafile.Eachrowinthefileisaseparaterecord,andeveryattributeshouldhavefixedlength.Requiresaschemafile(oradatamodelfile)thatdescribesattributesavailableinthefileandlengthofthoseattributes.Thedatamodelfilenameisoptional,andifnotprovidedFRILwilllookforafilewithextension“.fields”.Forinstance,ifyouuseaninputfiledata.txt,andyoudonotspecifythedatamodelfile,FRILwillexpectthefiledata.fieldstobelocatedinthesamedirectoryasthefiledata.txt.The“.fields”filecouldlookasfollows:

Theexampleabovedefinesatextfilewith3attributes,name,ageandaddress.Eachlineinthetextfileisexpectedtohave83characters(20+3+60).Thefirst20charactersineachlinearedecodedasaname,next3charactersasanageandnext60charactersasanaddress.

4. DatabaseThedatawillbepulledfromdatabase.RequiresJDBClibraries.Currently,thelibrarieshavetobedownloadedmanually,butinthefuturethisprocesscanbeautomated.Note:JDBCgivesalotofoptionsfordatasources.Usingthistechnology,youcanaccessforinstanceMSAccessdatabase,MSSQL,MySQL,PostgreSQL,Oracleandmanymoredatabases.

Let’snowadddatatoFRIL.Followthestepsoutlinedbelowtofullyconfigurethefirstdatasource.Wewillworkwiththesampledatathatshouldbelocatedinthedata‐sampledirectorylocatedinyourFRILhomedirectory.First,clickontheconfigurationofupperdatasourceinmainwindow(clickonabuttonnexttothelabelwithnumber1inFigure2).Awizardfordatasourceshouldappear.InDatasourcetypechooseaCSVfileoption.Namethedatasource“FirstFile”andclickon“…”buttonontherightfromthe“Filename”textfield.FilechoosershouldopeninyourFRILdirectory.Ifnot,navigatethere.Doubleclickondata‐sampledirectoryandclickon“generated‐data‐error.csv”.Click“Open”.YoushouldbebackintheDatasourceconfigurationwizard.In“Columnseparator”optionchooseComma.YourwindowshouldlookliketheonepresentedinFigure3.NowyoushouldbereadytoclickNext.Whenyoudothat,thenextstepinconfigurationofdatasourceappears.Youshouldnowseeafewboxeswiththefollowingnamesonthem:first_name,last_name,DOB,height,weight.

IfyouchooseawrongColumnseparator,FRILwillnotbeabletoreadyourdatainaproperway.Youwilllikelyseeonlyoneboxinthesecondstepofthedatasourceconfigurationthatreadssomethinglike:“first_name,last_na…“.

name20age3address60

8

Figure3:Firststepofdatasourceconfiguration

Tryclickingon“Showfieldssummary”buttonlocatedatthetopwhenyouareinthesecondstepofadatasourceconfiguration.Youwillseethedatasummarywindow.Trytomoveyourmouseoveroneofthe“See”buttons–afewexamplesofdatavaluesshouldbepresented.Thisoptioncanbeusefulwhenexploringthedata.

ThenextstepinconfigurationistochoosecolumnsfromtheinputfilethatyouwillactuallyuseinFRIL.Yourinputfilecouldpotentiallyhavehundredsofdatacolumns,andyoumightreallyneedonlyfewofthemintheapplication.Bychoosingthecolumnsyoureallyneedyoumakethetooltoworkmuchfaster.Inourcase,however,ourdatafilehasonly5columns,andallthecolumnswillbeusedinthelinkage.Let’snowaddattributesthatwillbeusedlaterbyFRIL.Followthesesteps:

1. Wedonotwanttoworkwithfirstandlastnameseparately,butwewanttotreatthosetwoattributesasone(orasname).Clickontheboxthatreadsfirst_name.ChooseUseconverter‐>Mergeconverter.Anewconfigurationwindowshouldappear.Asthe“Outattributename”type“name”.ChooseSpacefortheoption“Joinfieldsusing”.Inthe“Joinedcolumns”clicklast_name@FirstFile.YourwindowshouldlookliketheoneinscreenshotfromFigure5.Wonderingwhatyouweredoing?Clickonmagnifyingglassicon.Youshouldseeadynamicanalysiswindowthatshowsthreecolumns(Figure6).Youseethatyouhavetwoinputcolumns(first_nameandlast_name)andoneoutputcolumn(name).Nowyouseewhatmergeconverterdoes–ittakestwo(ormore)attributesfrominputfileandcreatesanewattribute.Thevalueofthenewattributeiscalculatedbymergingvaluesfromfewcolumns.

Beforeyouclosetheconverterconfiguration,trytochangetheoption“Joinfieldsusing”or“Joinedcolumns”whilekeepingthedynamicanalysiswindowopen.Theanalysiswindowisupdatedeverytimeyoumakeanychangeintheconfigurationofconverters.Makesure,however,toselectvaluesofparametersasdescribedinstep1beforeyouclickOK.

2. Nowyoushouldbereadytoclosetheconfigurationofconverter.ClickOK.3. Clickon“Showfieldssummary”buttonandmoveyourmouseover“See…”buttonfortheattribute

DOB.YouwillseethatthevaluesaredatescodedasMMDDYYYY.Wewillnowchangethe4‐digityearformatinto2‐digitformat.OurgoalistoproducevaluesthathaveformatMMDDYY.

9

4. Closethesummarywindowandclickonboxthatreads“DOB”andselectUseconverter‐>Trimconverter.

5. Intheconfigurationofconverter,asthe“Outputattributename”type“Date‐MMDD”.Selectthecheckboxnextto“Modifyendofstring”.UsetheoptionforCutandsetnumberofcutcharactersto4.MakesurethatyourconfigurationisconsistentwithFigure7.

6. Clickonthemagnifyingglassbutton.Theconverteryoujustcreatedleavesfirst4charactersofeveryDOBvalue.ThosecharactersareMMDDpartofthedate.ClosedynamicanalysiswindowwhenyouaredoneandclickOKintheconverterconfigurationwindow.

7. Clickagainontheboxthatreads“DOB”,chooseUseconverter‐>Trimconverter.8. Thistimespecify“Date‐YY”asan“Outputattributename”,selectthecheckboxnextto“Modifyend

ofstring”andselecttheradiobuttonnextto“Leave”.Type2intheenabledtextfield(matchyourconfigurationwithFigure8).Clickagainonthemagnifyingglassbutton.Thisconverterleavesonlytwolastdigitsofyearfromanyinputdate.Applythechanges(closedynamicanalysisandclickOK).

9. WehaveseparateattributesforMMDDpartofdateandYYpartofdate.Thenextstepistomakeoneattributefromthosetwocomponents.Clickonboxthatreads“Date‐MMDD”andchooseUseconverter‐>Mergeconverter.MatchtheconfigurationwithFigure9.MakesuretoselectNonefortheoption“Joinfieldsusing”.Asusually,clickonmagnifyingglasstoseetheresultsandonceyouaredoneapplythechanges.

10. Clickonboxthatreads“height”.Clickonoption“Addtooutmodel”.Followthesamestepsfor“weight”.

11. YourdatasourceconfigurationshouldlooksimilarlytotheonepresentedinFigure10.Ifnot,adjustyourconfigurationaccordingly.

Tryclickingoneachoftheboxesyouseeinyourconfigurationwindow.Youwillseethateverytypeofboxhasdifferentoptions.E.g.,byclickingonanorangecolorbox(converter),youcanmodifyitssettings.Ontheotherhand,byclickingonabluebox(outputcolumn),youcanrenamethecolumnorremoveit.

Observethatyoucanbuildasequenceofconverterstoachievemorecomplextasksindatamanipulation.

12. Click“Next”tofollowtothelaststepofdatasourceconfiguration(presentedinFigure4).Inthelaststepacceptthedefaultconfigurationandclick“Finish”

Thelaststepofdatasourceconfigurationisadatapreprocessingstepandrequiresabitofexplanation.Youcanconfigureheretwomajortasksforadatasource:deduplicationandfiltering.Datadeduplicationisataskofremovingduplicaterecordsfrominputdata(pleaseseesectionDatadeduplicationformoredetails).Filteringisataskofspecifyingwhichrecordsshouldbeusedinthelinkageprocess.Notethatifyoudonotuseanyfilter,alltherecordsprovidedbygivendatasourcewillbeused.Ontheotherhand,ifyouspecifyanyfilter,onlytherecordsthatsatisfythefilterexpressionwillbeconsidered.Otherrecordswillnotbetakenintoaccount.Thefilterexpressioncanbeanybooleanexpressionthatinvolvesattributesfromadatasource.AssumingthatyourdatasourceprovidesattributesNAME,AGEandZIP_CODE,thefollowingtablepresentcorrectfilterexpressions:AGE=32 OnlyrecordsthathaveAGEequalto32willbe

considered.AGE<=32andZIP_CODE=“30030”andNAME=“SMITH”

OnlyrecordsthathaveNAMEequalto“SMITH”andZIP_CODEequalto30030andAGElessorequal32willbeconsidered.

10

ZIP_CODE=“30030”orAGE>64 OnlyrecordsthathaveZIP_CODEequalto30030orAGEgreaterthan64willbeconsidered.

Notethatthefilterexpressioneditorhassometoolsthatcanhelpyouwithaproperfiltercondition.The“Addfilterexpression”buttonopensawindowthathelpstoaddanewexpressiontothefilter,the“Clearfilter”buttonclearsthefilter(emptyfilterisequivalenttonofilter),andthe“Validatefilter”buttoncheckswhethertheselectedfilterexpressioniscorrect.Inaneventoffilterexpressionbeingincorrect,theerrormessagegivesanexplanationoferror.

Filterexpressionscaninvolveanarbitrarynumberofconditions.Youcanuse“and”and“or”asbooleanoperatorsandparenthesis(“(“,“)”)todefinetheorderoftheoperations.Bothoperators(“and”and“or”)canappearinanexpressionatthesametime,e.g.,thefollowingfilterexpressioniscorrect:NAME=“JOHN”or(AGE=33andZIP_CODE=“30030”)

Notethatanypreprocessingofadatasourcehappensbeforetheactuallinkage.Itmeansthatifyoudecidetodeduplicateadatasource,thedatasourcewillbededuplicatedbeforeprovidingdataforthelinkageprocess.Thesameruleappliestofiltering(recordswillbefilteredbeforeprovidinganydataforthelinkage).Thisfactmayhaveanimpactontimeittakestocompletetheoveralllinkageprocess:deduplicationmayincreasethistimeasitaddstocomputationoverhead,whilstfilteringwillreduceitmostoftimesasitlimitsanumberofrecordsinthelinkagephase.

Figure4:Configurationofdatasourcepreprocessing

11

Figure5:Converterconfiguration

Figure6:Dynamicanalysiswindow

12

Figure7:TrimconverterextractingMMDDfromdate

Figure8:TrimconverterextractingYYfromdate

13

Figure9:Mergeconvertermergingdate

Figure10:Completeddatasourceconfiguration

Nowyoushouldbedonewiththeconfigurationofthefirstdatasource.Savetheconfiguration.Usethesamestepstoconfiguretheseconddatasource.Clickonconfigurationoflowerdatasourceinthemainwindow(clickonbuttonnexttolabelwithnumber2inFigure2).ChooseCSVfileandnamethesource“SecondFile”choosethefile“generated‐data.csv”asaninputfile(locatedinthesamedirectoryasthefileusedbefore)andcommaasaseparator.Thisfileprovides4attributes:patient_name,DOB,height,weight.Makesuretoanalyzethedatausing“Showfieldssummary”button.Donotuseanyconverters;justaddallthecolumnstoOutmodel(tryoption“Addallcolumnstooutmodel”).Don’tforgettosaveyourconfigurationwhenyoucompletethistask.

14

Thinkwhywechosetousemergeconverterforfirstandlastnameandwhywedidallthethingswiththedateswhenthefirstdatasourcewasconfigured.Thesecondfileprovidesjustnamesofpatients,withnodistinctionbetweenfirstandlastnames.Also,theformatofdateinthesecondfileisMMDDYY.Byapplyingtheconvertersasdescribedaboveweresolvedschemadiscrepancybetweenthetwoinputfilesweareusing.

Atthispointyoushouldhaveafullconfigurationofbothdatasources.Youcannowmovetoaconfigurationofthelinkageprocess.Beforewedothat,however,let’sgooveralltheconvertersthatareavailableinthesystemasyoumightfindthosequiteusefulinyourprojects.

MergeconverterSampleusage:mergefirstnameandlastnamestoredinseparateattributesintooneattribute.Thisconverterisusedtomergetwoormoreattributesintooneattribute.Wehavealreadyseenhowtouseitbefore.

SplitconverterSampleusage:splitnameintofirstnameandlastname.Thisconverterdoestheoppositetothemergeconverter.Youcansplitoneattributeintotwoormoreattributes.Totryit,opentheconfigurationoftheseconddatasource(theoneyouconfiguredbyyourself).Gotothesecondstepoftheconfiguration(clickNextinthefirststep).Clickontheboxthatreadspatient_name(eitherofthetwo).ChooseUseconverter‐>Splitconverter.Youseethatyoucanconfigurehowmanyofattributeswillbegeneratedfromtheinputattribute(defaultvalueis2).Youmightalsorenametheattributesandchoosethesplitcharacter.MaketheconfigurationofthisconverterconsistentwithFigure11andpressthemagnifyingglassbutton.Youshouldbeabletoseewhatthisconverterdoes.Remember,youcanalwaysplaywithattributeswhiledynamicanalysiswindowisopentoseewhatyourconverterisactuallydoing.TryclickingOK.NowyouseehowtheSplitconverterisrepresentedintheconfigurationwindow.Click“Cancel”onthewizardtopreventsavingthechangesyoujustmade.

15

Figure11:Samplesplitconverterconfiguration

TrimconverterSampleusage:cutfirsttwodigitsfromyearsothatyouhaveYYformatinsteadofYYYY.Thisconverterisusedtotrimsomecharactersfromattributevalues.Wehavealreadyseenhowtouseitbefore.

AppendconverterSampleusage:appendstring“inch”toeveryvalueofheightattribute,sothatyourattributehasaformof“XXXinch”whereXXXaresomenumbers.Thisconvertercanappendagivenstringtothefrontorendoftheattributevalue.Theinputvaluesaretreatedasstrings,andyoucanappendtofrontand/orendofthestring.Totryit,opentheconfigurationofthefirstdatasource(theoneweconfiguredtogether).Gotothesecondstepoftheconfiguration(clickNextinthefirststep).ClickontheboxthatreadsDOB.ChooseUseconverter‐>Appendconverter.MaketheconfigurationofthisconverterconsistentwithwhatyouseeinFigure12andpressthemagnifyingglassbutton.Youshouldbeabletoseewhatthisconverterdoes.Closethedynamicanalysiswindowwhenyouaredoneandclick“Cancel”intheconfigurationofAppendconverter.

16

Figure12:Sampleappendconverterconfiguration

ReplacestringconverterSampleusage:Replaceabbreviationsinstreetnameswithfullnames(like“Dr.”‐>“Drive”or“Ave.”‐>“Avenue”).Thisconvertercanreplacespecifiedvaluesingiveninputstringwithsomeothervalues.Totryit,opentheconfigurationofthefirstdatasource(theoneweconfiguredtogether).Gotosecondstepoftheconfiguration(clickNextinthefirststep).ClickontheboxthatreadsDOB.ChooseUseconverter‐>Replacestringconverter.MaketheconfigurationofthisconverterconsistentwithFigure13(youwillneedtouse“+”buttonthreetimestoconfigurethereplacementrules,notealsothatyoucanedittheconfigurationbypressingthebuttonbetween“+”and“‐“)andpressthemagnifyingglassbutton.Youshouldbeabletoseewhatthisconverterdoes.Closethedynamicanalysiswindowwhenyouaredoneandclick“Cancel”intheconfigurationofAppendconverter.

Replacestringconverteriscasesensitive.Thismeansthatifyourinputvalueis“10FirstST”andtheonlyreplacementruleis“St”‐>”Street”,the“ST.”intheinputvaluewillnotbereplaced.Tohaveitreplaced,youwillneedtoaddsecondreplacementrulethatreads“ST”‐>”Street”.

Replacestringconverteristricky.Itsreplacementrulesareregularexpressions.Forinstance,“.”doesnotmatchdot,butanycharacter.Areplacementrule“A.B”‐>”CD”willnotreplaceonlystrings“A.B”with“CD”butanycharacterinform“AxB”,wherexisanycharacteras“.”inregularexpressionmeansanycharacter.Tocodetheruleaboveproperlyyouwouldneedtousethefollowingsyntax:“A\.B”‐>”CD”(“\”tellstotreat“.”asadot,notasanycharacter).Formoreinformationonregularexpressionsrefertohttp://www.regular‐expressions.info/reference.html.

17

Figure13:Replacestringconverter

18

ConfigurationoflinkageOnceyouaredonewiththeconfigurationofthedatasources,youcanmovetoconfigurationofthelinkageprocess.Tostartthelinkageconfiguration,clickonthebuttonnexttothelabelwithnumber3inFigure2.Youshouldseethelinkagewizard.AcceptthedefaultconfigurationforDatastratificationandclick“Next”.NowyouareabouttotellFRILhowyouwanttoidentifymatchesintheinputdatafiles.

Theconfigurationoflinkagestronglydependsontheconfigurationofdatasources.Ifyouhavenotconfigureddatasourcessofar,gobacktotheprevioussection.Otherwiseyouwillnotbeabletofollowthestepsoutlinedinthissection.

ThemostimportantpartinthelinkageconfigurationisconfigurationofsocalledJoincondition.InJoinconditionyouneedtospecifywhichattributesfromthefirstdatasourcewillbecomparedwithwhichattributesfromtheseconddatasources.Basingonthesecomparisons,FRILwillcalculatematchscoreforeachofpairofrecordsthataretested,andwilldecidewhetherthepairisamatchorpotentialmatch.Toconfigureyourfirstjoincondition,followthesesteps:

1. Click“+”locatedinthepanel“Joincondition”2. Youshouldseeanewwindow.In“Leftcolumn”[email protected]“Rightcolumn”click

[email protected]. Nowyouneedtospecifyhowtheattributeschosenaboveshouldbecompared.Youknowthatthe

attributesarenames,andnamestendtohavemisspellingsorerrors.Probablysomecomparisonmethodthattakesthatintoaccountwilldothebestinthiscase.Let’sthenchoose“Editdistance”asa“Distancemetric”.

4. Nowconfigurationpanelforthedistancemetricshouldbeupdated.Detaileddescriptionofthepanelcanbefoundbelow(seesectionDistancemetrics).Atthemoment,let’sjustacceptthedefaultconfigurationoftheeditdistanceconfiguration.

5. Settheweightvalueto40.MakesureyouconfigurationmatchesFigure14.ClickOK.

Theconfigurationofdistancemetric(Figure14)containstheparametercalled“Emptyvaluescore”.ThisparameterallowssettingthebehaviorofFRILwhenatleastoneofinputvaluesisempty.Recallthateachdistancemetricreturnsvaluebetween0.0and1.0,andthevaluerepresentshowsimilararetheinputvalues.Ifanyoftheinputvalues,however,isempty,thesimilaritybetweenthemisdefinedusingthevalueofparameter“Emptyvaluescore”insteadofusingthedistancemetric.Forinstance,ifyousetthisparameterto0.6,thenthescoreformatchinganemptyvaluewithanyothervaluewillalwaysbe0.6.

6. Windowcloses,andnowyoushouldseeonerowinJoincondition.ObservealsotheCurrentsumofweights–thevalueshouldbe40anditshouldbered.ThereasonisthatallweightsinFRILcurrentlyhavetoaddupto100.

7. Click“+”againandnowchooseDate@[email protected]’suseagaintheEditdistance(despitethisisadate).Changethe“Approvelevel”ineditdistanceconfigurationto0.1and“Disapprovelevel”to0.3(seeFigure15).Setweightoftheconditionto30.ClickOK.

8. Click“+”againandnowchooseheight@[email protected]’susetheNumericdistance.

9. Inthenumericdistanceconfiguration,selectoption“Range(percentage)”andinsert10intobothactivetextfields(seeFigure16).Wearesayingthatheightswillbetreatedasnumbers,andthetolerancebetweenvaluescanbeupto10%.

10. Specify15asaweightandclickOK.

19

11. Repeatsteps7‐10forattributesweight@[email protected]. Changethevalueofacceptancelevelto71.13. Click“+”inthe“Outputcolumns”panel.Selectalltheavailablecolumns(left‐clickonfirstoptionin

thelist,thenpressShiftandwiththeshiftpressedleft‐clickonlastoptioninthelist).ClickOK.

Youcanarrangecolumnsinthe“Outputcolumns”asyoulike.Clickonanyiteminthelistandusearrowsontherighttochangetheorderofattributes.Forinstance,itmightbeusefultohavecolumnsthatarebeingcomparedinjoincondition(likenameandpatirnt_name)nexttoeachother.Also,youdonotneedtoaddallthecolumnsto“Outputcolumns”.Forinstance,Date‐MMDDandDate‐YYareonlytemporaryattributesandprobablycouldberemovedfromthelist.

14. YourresultingconfigurationshouldlooksimilartooneinFigure17.

Theconfigurationoflinkage(Figure17)hasabutton“Manualreview”.Usethisoptionifyouwanttomanuallyapproveorrejectcertainlinkages.Onceyouclickonthisbutton,youcanenabletheprocessofmanualreview,andsetthelevelsformanuallyreviewedlinkages.Specifically,youwillseeawindowsimilartotheonepresentedinFigure18.Notethatyoucanslidetheredlinesothatthemanualreviewlevelwillbeadjustedtoyourneeds.

Oncethemanualreviewprocessisenabled,itwillbesignalizedbythegreen“bulb”belowthe“Manualreview”.

15. Click“Next”.16. Inthelaststepofjoinconfigurationchoosethe“Sortedneighborhoodmethod”asajoinmethod

type.17. Intheconfigurationofthisjoinmethod,setwindowsizeto8andmakesurethatthesortorderis

asfollows:name,Date,height,weight(seeFigure19).Ifnecessary,clickonentriesinsortorderandusearrowsontherighttosettheappropriateorder.Checkbothcheckboxesatthebottom(summarycreation).

18. MakesureyourconfigurationmatchesFigure19.Ifso,clickFinishinthelinkagewizard.19. Savetheconfiguration(File‐>Save).

Thearrowsontherightthatallowyoutochangethesortorderinsortedneighborhoodmethodwillbedisabledifyoudonothaveanyrowintheordertableselected.

Formoredetailsondistancemetricsseesection“Distancemetrics”.Descriptionofdifferentjoinmethodsisoutsidethescopeofthistutorial.Fordetailsonthatrefertothepublication“FRIL:AToolforComparativeRecordLinkage”.LinktothepublicationcanbefoundonFRILhomepageinthesectionDocumentation/Publications.

ThesummaryfornotjoineddatamaynotbeavailableforallthesearchmethodsthatareavailableinFRIL.Forinstance,ifyouchooseNestedloopjoinasasearchmethod,onlyonesummarybuttonisavailable.Thisiscausedbydetailsofimplementationofeachofsearchmethods.Simplyput,forsomecasesitisextremelyhardtoimplementefficientlythisfeature(e.g.,withoutextendingsignificantlyatimeofthelinkageorCPU/memoryconsumption).

Atthispointyouhavecompleteaconfigurationofdatasourcesandthelinkageprocess.Youarereadytoproceedtotheconfigurationofresultssavers.

20

Figure14:Conditionusedtocomparenames

21

Figure15:Editdistanceconfigurationfordates

Figure16:Numericdistanceconfiguration

22

Figure17:Completedsecondstepoflinkageconfiguration

Figure18:Configurationofmanualreview

23

Figure19:Laststepoflinkageconfiguration

24

Resultsavers/ResultsdeduplicationConfigurationofresultssaversistheeasiestpartoftheconfiguration.Allyouneedtodoistospecifywheretosaveyourdata.Clickonthebuttonnexttothelabelwithnumber9inFigure2.Youshouldseetheresultssaverswizard(seeFigure20).Followthesteps:

1. Clickcreatenewbutton(“+”).2. Anewwindowshouldappear.AsaresultsavertypechooseCSVfiledatasaver.3. Asanoutputfileleave“results.csv”.4. ClickOK.5. Leavethesettingsofresultsdeduplicationfornow(seedetailsbelow)andpressFinish.6. Savethecurrentconfiguration.

Figure20:Resultsaver/Resultsdeduplicationconfiguration

That’sit,youaredonewiththeconfigurationoflinkage.Beforewemoveontothenextsection,however,wewilltakeacloserlookattheoptionofdeduplicationoflinkageresults.

Thefirstquestionweneedtoansweriswhydoyouneedtheresultsdeduplicationatall.It’sallbecauseofduplicatesyoucangetwhenyoulinkyourfiles.Imagineasituationwhereyouhavetworecordsinthefirstfile,XandY.Therecords(let’sassumethedatacontainsinformationaboutpatients)canhavesuchattributesasfirstandlastnames,DOB,heightandweight.Ourtworecords,XandYcouldactuallycontain

25

informationabouttwins.Insuchacase,youcanassumethesamelastnames,thesameDOB,similarheightandweight.Whataboutthefirstnames?Well,theycanalsobeverysimilar,asparentssometimesliketogivetwinssimilarnames.Now,let’sassumethatyouarelinkingthisfileagainstsomeotherfile.FRILcanfindthatbothrecordsXandYmatchtothesamerecord,sayZ,fromthesecondfile.ThereasonisthattheXandYareverysimilarandthustheycanmeetthelinkageconditionandacceptancelevelyouspecified.Intheconsequence,wehavetwolinkages:X‐ZandY‐Z.Asyousee,theZrecordislinkedtwice,andthuswecallthissituationaduplicateinresults.

FRILgivesaneasy‐to‐useoptiontosolvetheproblemdescribedabove.Whatyoucandoistoenableresultsdeduplication.Ifyoudoso,FRILwilllookforsituationsasaboveandwillattempttochooseoneofthelinkages,rejectingthesecondonebeforesavingtheresults.Ifthedecisioncannotbemadeautomatically,thetoolcanaskyouwhattodo.

Toconfiguretheresultsdeduplicationyouhavetochooseoneofthethreeoptionsyouhavein“Resultsdeduplication”inFigure20.Theoptionsyouhavespecifythetypesofduplicatesyoumightwanttolookfor:

1. Youcanrequirethateveryrecordfromthefirstdatasourcewillbelinkedwithexactlyonerecordfromtheothersource.Inthiscase,however,youcanhaveonerecordfromtheotherdatasourcetobelinkedwithmultiplerecordsfromthefirstsource.

2. Youcanrequirethateveryrecordfromtheseconddatasourcewillbelinkedwithexactlyonerecordfromthefirstsource.Inthiscase,however,youcanhaveonerecordfromthefirstdatasourcetobelinkedwithmultiplerecordsfromtheothersource.

3. Youcanrequireboth,e.g.,exactlyonerecordfromthefirstsourcetobelinkedwithexactlyonerecordfromthesecondsource,andexactlyonerecordfromthesecondsourcetobelinkedwithexactlyonerecordfromthefirstsource.

Ifyouhavesomeexperiencewithdatabasesystems,youshouldseethattheoptionsinresultsdeduplicationrelatetotherelationtypesyoucanhavebetweentablesinarelationaldatabase:many‐to‐one,one‐to‐manyandone‐to‐onerelations.

Insomecasesyoumightnotwanttoremoveallduplicates,butyoumightwanttouseoneofthefirsttwooptionsyouhave.Considertwofiles,oneprovidingdataaboutchildrenandtheotherprovidingdataaboutmothers.Itisperfectlyfinetohaveonemotherlinkedwithtwoormorechildren.However,itisnotacceptabletohaveonechildtobelinkedwithtwoormoremothers.

Thesecondoptionthatyouhaveavailableintheconfigurationofresultsdeduplicationis“Deduplicationconflicts”.ThisoptionaddressestheduplicatesthatcannotbedecidedautomaticallybyFRIL.NotethatFRILwillchoosethebestlinkagefromduplicatesbasedonalinkagescore.E.g.,ifyouhavetwoduplicatesasdiscussedabove,X‐ZandY‐Z,andifyourconfigurationprohibitsthistypeofduplicates,thenthededuplicationwillchoosethelinkagewiththehighermatchscore.However,ifthetwolinkageshavethesamescore,anautomaticlegitimatedecisioncannotbemadebyFRIL.Inthiscase,youcanconfigureFRILtodooneofthefollowing:

1. Donothing–thisoptionwillleavebothduplicatesintheresults

2. Chooseonelinkagerandomly–thisoptionwillchooserandomlyoneandrejecttheotherlinkagesbeingduplicates.Onlythechosenlinkagewillbeincludedintheresults.

26

3. Askyouaboutthedecision–inthiscaseFRILwillpopupawindowinwhichyouwillbeabletodecidewhichlinkagesarefalsepositivesandwhicharetruepositives.

Runningthelinkage,linkageconfidenceTorunthelinkage,clickonabuttonnexttothelabelwithnumber8inFigure2.Youshouldseealinkageprogresswindow(seeFigure21).Notethatyoucanclickonthe“arrow”or“chart”buttons.Tryit:

1. Arrowbuttonopensalinkagedebuggingwindow.Asyouopenit,youcanseewhatiscomparedwithwhatinthelinkageprocess.Youcananalyzetheprocessandseeifyoumadeanyerrorsintheconfiguration.Ifapairofrecordsisjoined,itappearsgreen.Ifitisnotjoined,itappearsred.

2. Chartbuttonopensahistogramofthelinkageresults.

Whenlinkagedebuggingisenabled,thelinkageprocessismuchslowerthanwhendebuggingisdisabled.Makesuretodisableitonceyoufeeleverythingrunswell.

Figure21:Linkageprogresswindow

Oncethelinkageprocesscompletes,youshouldhaveinformationonnumberofrecordsthathavebeenlinked(inourcase2972).Onceyouseethatnotification,clickOK.Notethatyoucouldeasilyrerunthelinkagebypressingagainthestartbutton.InyourfilebrowsernavigatetothedirectoryonyourdiskwhereyouhaveFRIL.Nowyoushouldseethreenewfilesinthatdirectory:

1. results.csvThisfilecontainsresultsofyourlinkage.

Nameofresultsfilecouldbedifferentifyouchosedifferentconfigurationofresultsaver(stepnumber3atthebeginningofthissection).

2. minus‐FirstFile.csvListofrecordsfromtheFirstFilesourcethathavenotbeenjoined.

3. minus‐SecondFile.csvListofrecordsfromtheSecondFilesourcethathavenotbeenjoined.

27

Nowlet’sopentheresults.csvfile.UseExceloranyotherspreadsheetapplicationlikeCalcfromOpenOffice.Youcanseethatyourresultsfilecontainslinkages.ThelastcolumnintheresultsfileiscalledConfidence.ThisnumbertellsyouhowconfidentFRILwasaboutgivenlinkage(valuebetween0and100).Sortontheconfidencesothatrecordsareinascendingorder.Youshouldseethattherewasonelinkagewithconfidence75.Observethatthislinkagehasdifferencesindatesandheight.Inspectsomelinkageswithotherconfidencelevels.MovetothenextsectiontoseedetailsonhowFRILcalculatesthoseconfidencescores.

Alternatively,youcanchooseaFRIL‐integratedresultsviewertoseethelinkageresultsandnotjoineddatarecords.Touseit,clickononeofthebuttonswithlabels2,5or10inFigure2.Pleasenotethattheminusviewercanbedisabledifyoudidnotconfigurethelinkageprocesstocreateasummaryofnotjoineddata.TheFRIL‐integratedrecordsviewerispresentedinFigure22.

Figure22:FRIL­integratedrecordsviewer

Themaincomponentsoftherecordsviewerincludeatoolbar(topofthewindow),contentsandastatusbar(bottomofthewindow).Themajorelementsare(thenumberscorrespondtothenumbersinlabelsinFigure22):

1. SavebuttonThisbuttonopensadialogthatallowstosaverecordsvisibleintheviewertoafile.Youcaneither

28

savealltherecordsreturnedbythecurrentquery,oronlyrecordsvisibleoncurrentpage.Notethattherecordsviewerdividesrecordsintopages,witheachpageshowing200recordsbydefault.

2. FilterbuttonThisbuttonenablesfilteringofthevisiblerecords.Theconfigurationoffilteringisthesameastheconfigurationoffilteringforadatasourcediscussedinprevioussections.

3. SortbuttonThisbuttonenablesrecordssorting.Youcansortrecordsbyanyattribute(s).Inthecaseoflinkages,youcanalsoenablesortingbasedonconfidenceofthelinkages.

4. DetailsbuttonThisbuttonopensadetailswindow.Thedetailswindowpresentsvaluesofalltheattributesofrecordsintheview.Toseethedetails,youhavetomoveyourmousepointeroverarecordyouwanttoseethedetailsof.

5. PreferencesbuttonThisbuttonopensapreferenceswindow.Youcanconfiguresuchpreferencesasvisibleattributesorcolors.

6. Statusoffilter/sortThiscomponentgivesquickinformationaboutsort/filterbeingenabledordisabled.Ifsortand/orfilteris(are)enabled,thecolorofthecorrespondingdotturnstogreen.

7. Forward/backwardbuttonsandcurrentpagenumberThebuttonsinthiscomponentallowtomovebetweenthepagesofrecords.

Notethatyoucanopenalinkagesviewwindowandnotjoinedrecordsviewwindow(s)atthesametime.Thismighthelpinyouranalysisofthelinkageresults.

Howconfidenceleveliscalculated?CalculationofaconfidencescoreinFRILissolelybasedonconfigurationyouprovided.RefertoFigure17.Youalreadyknowthatthesumofallweightsinjoinconditionhastobeequalto100.Observethatwehave4sub‐conditions:fornames,dates,heightandweight.Theweightsforeachofsub‐conditionsare40,30,15and15,respectively.InFRIL,eachofdistancefunctionsreturnsvaluebetween0and1.Tocomputethetotalscoreforcomparedpair,FRILmultipliestheweightofeachsub‐conditionbythevaluereturnedbychosendistancefunction.Then,thosenumbersareaddedtogethertocalculateatotalscoreforgivenpairofrecords.IfthetotalscoreisequalorgreaterthantheApprovelevel(refertoFigure17),thepairofrecordsisconsideredtobeamatchandissavedintheresultsfile.Tofindmoreinformationontheprocessofscorecalculation,pleaserefertopublications[1,2].

Togiveanexample,let’sconsiderwehavethefollowingrecordfromsourceFirstFile:

Name=JohnSmith,Date=101182,Height=80,Weight=180

Let’sconsideranotherrecordfromsourceSecondFile:

Patient_name=JohnSmith,DOB=101182,Height=85,Weight=160

29

Now,let’sassumeagainthatconfigurationofFRIListhesameastheonepresentedinFigure17.Thenamesanddatesarecomparedusinganeditdistance,andasthosevaluesarethesame,theeditdistancereturns1.Valuesofheightinbothrecordsaredifferent.However,tocomparetheheightsweareusingaNumericdistancewithparameters+/‐10%(refertoFigure16).Inthatcase,theallowablerangeforheightisfrom(80–10%of80)to(80+10%of80).Thisgivesarangeof72‐88.ThevalueofheightinthecomparedrecordfromtheSecondFilefallswithintherangeabove,andthescorereturnedbythenumericdistanceforheightwillbe(88‐85)/(88‐80)=3/8=0.375.Asimilarsituationismetforweight:anumericdistancewith+/‐10%isusedforcomparison.Thistime,however,allowablerangeis162‐198(180+/‐10%),andintheconsequenceofcomparingthevalue180with160thedistancefunctionreturns0.Thetotalmatchscoreofthosetworecordswillthusbe(rememberthevaluesofourweights):1*40+1*30+0.375*15+0*15=75.6≈76

Thescoresyouaregettingcanbeinfluencedbyvalueoftheparametercalled“Emptyvaluescore”ifyourdatacontainsemptyvalues.

Thenextsectiongivesmoredetailsonalltheavailabledistancemetrics.

30

DistancemetricsDistancemetricsareoneofthemostimportantcomponentsavailableinFRIL.Byspecifyingdistance

metricbetweentwoattributes,youareprovidinginformationonhowtheattributesshouldbetreated(asstrings,numbers,dates)andwhatareallowablediscrepanciesbetweenvaluesoftheattributes.Rememberthateachdistancefunctionmapssimilaritybetweenvaluesofattributesontovalueintherange[0…1],where0means“arenotsimilarintermsofthisdistancefunction”and1means“areexactlythesameintermsofthisdistancefunction”.Inthereminderofthissectionwewilldescribeinmoredetailsthe6coredistancefunctions.

Equalfieldsbooleandistance

Inputvaluesaretreatedasrawstrings.Thisdistancefunctionreturnsonlytwovalues,0or1.0isreturnedifcomparedvaluesaredifferent,1ifareexactlythesame.Noadditionalconfigurationisrequired.

Examples:

Firstvalue Secondvalue Score“ABC” “ABC” 1“ABC” “ABB” 0“1.0” “1” 0“1.0” “1.0” 1“10/12/08” “10/12/2008” 0“10/12/2008” “10/12/2008” 1

Editdistance

Inputvaluesaretreatedasrawstrings.Thedistancetestshowmanyoperationsneedtobeappliedtothefirststringsothatitisconvertedtothesecondstring.Possibleoperationsincludedeletingacharacter,insertingacharacterorswitchingtwocharactersthatarenexttoeachother.Youneedtospecifyapprovelevelanddisapprovelevelwhenthisdistancemetricisused(pleaseseeFigure23).Approveanddisapprovelevelintheuserinterfacearealsoboundtothesmallplotwithaccept/rejectvalues.Onceyouchangeeitherofthevaluesinthetextfields,theplotisupdated,andonceyoudragthelineontheplot,thevaluesintextboxesarealsoupdated.Bothparametersthatneedtobeconfiguredcanhavevaluesfromrange[0…1].Thescoreofeditdistancefunctioniscalculatedusingthefollowingformula(strAisthefirststring,strBisthesecondstring,e(strA,strB)istheeditdistancebetweentwostrings,aisanapprovelevelanddisadisapprovelevel):

score(strA,strB) =

0,if e(strA,strB) > d *max(length(strA),length(strB))1,if e(strA,strB) < a*max(length(strA),length(strB))d *max(length(strA),length(strB)) − e(strA,strB)

(d − a) *max(length(strA),length(strB)), otherwise

31

Figure23:Editdistanceconfiguration

Examplesofscorevaluewhenaneditdistanceisused:

Firstvalue Secondvalue e(strA,strB) Approvelevel

Disapprovelevel

Score

“A” “A” 0 0.1 0.3 1“A” “B” 1 0.1 0.3 0“ADAM” “ADAMS” 1 0.1 0.3 0.5“JACOBDOBBS” “JAKOBHOBBS” 2 0.1 0.3 0.59“JASPERCISNEROS” “ADENCISNEROS” 4 0.1 0.3 0.17“BREANNAROBISON” “BRENNAROBINSON” 2 0.1 0.3 0.83

Don’tforgetthatyoucanusethemagnifyingglasstoseehowcurrentconfigurationofeditdistancecalculatesscores.Withtheanalysiswindowopen,youcanchangetheparametersandtheresultspresentedinthedynamicanalysiswindowwillbeupdated.

Jaro­Winklerdistance

Inputvaluesaretreatedasrawstrings.ThisdistanceusesamethodproposedbyWinkler.Itissuitableforshorterstrings,suchasnames.

Thevalueofthisdistancemetriciscalculatedasfollows:

score(s1,s2) =max(100,d j + lp(1− d j ))

wheredjisthevalueofJaro‐Winklerdistancebetweenstringss1ands2(seebelow),listhelengthofcommonprefixbetweenstringsandpisascalingfactor.ThelandpareparametersthatcanbesetinFRIL‐thesystemusesscaleptogivemorefavorableratingstostringsthatmatchfromthebeginningforasetprefixlengthl.Thedefaultvalueforpis0.1andthevalueforlshouldnotexceed4.

TheJaro‐Winklerdistanceisdefinedas:

32

d j =13( m| s1 |

+m| s2 |

+m − tm

)

wheremisanumberofmatchingcharacters,andtisthenumberoftranspositions.

Twocharactersfroms1ands2arematching,iftheyarenotfartherthan

max(| s1 |,| s2 |)2

−1positions

apart.Thenumberofmatching(butdifferentsequenceorder)charactersdividedby2definesthenumberoftranspositions.

Examples(p=0.1andl=4):

s1 s2 Matchingstringfroms1

Matchingstringfroms2

m t dj score(s1,s2)

“CRATE” “TRACE” “RATE” “RACE” 3 0 0.867 0.867“DWAYNE” “DUANE” “DANE” “DANE” 4 0 0.223 0.84

Q­gramsdistance

Inputvaluesaretreatedasrawstrings.Thedistanceoperatesonso‐calledq‐grams,substringsfrominputstringsoflengthq.Bothinputstringsarefirstdividedintoq‐grams.Thisoperationproducestwosetsofso‐calledgrams.Next,acountofitemsthatdonotappearintheintersectionofthosetwosetsiscalculated.Theresultingnumberhasanimpactonthescore.Configurationofthisdistancerequiresspecificationofq,approvelevelanddisapprovelevel(seeFigure24).Approveanddisapprovelevelvalueshaveasimilarmeaningasintheeditdistance.Thefinalscoreiscalculatedasfollows(strAisthefirststring,strBisthesecondstring,gramsreturnsq‐gramsofgivenstring,diffreturnsacountofitemsthatdonotappearintheintersectionofsets,aanddareapproveanddisapprovelevels,respectively):

score(strA,strB) =

1, if diff (grams(strA),grams(strB)) < round(a* (| grams(strA) | + | grams(strB) |))0, if diff (grams(strA),grams(strB)) > round(d * (| grams(strA) | + | grams(strB) |))

1− diff (grams(strA),grams(strB))− round(a* (| grams(strA) | + | grams(strB) |))round(d * (| grams(strA) | + | grams(strB) |)) − round(a* (| grams(strA) | + | grams(strB) |))

otherwise

33

Figure24:Q­gramsdistanceconfiguration

Examples(allexamplesareforq=3):

Firstvalue Secondvalue Firstvaluegrams Secondvaluegrams

Approve/Disapprovelevel

Score

“WILLIAMYOUNG”

“WILLIAMYOUNG”

WIL,ILL,LLI,LIA,IAM,AM,MY,YO,YOU,OUN,UNG,NG,G

WIL,ILL,LLI,LIA,IAM,AM,MY,YO,YOU,OUN,UNG,NG,G

0.3/0.1 1

“DONOVANBLACK”

“DONOVANPACK”

DON,ONO,NOV,OVA,VAN,AN,NB,BL,BLA,LAC,ACK,CK,K

DON,ONO,NOV,OVA,VAN,AN,NP,PA,PAC,ACK,CK,K

0.3/0.1 0.2

“AJYHUDDLESTON”

“JAYHUDDLESTON”

AJY,JY,YH,HU,HUD,UDD,DDL,DLE,LES,EST,STO,TON,ON,N

JAY,AY,YH,HU,HUD,UDD,DDL,DLE,LES,EST,STO,TON,ON,N

0.3/0.1 0.8

Notethattheq‐gramsdistanceworkswithsets,andorderofgramsdoesnotmatter.Thismeansthatthedistancewillgiveveryhighscoretothepairsofstringslike“MICHAELADAMS”and“ADAMSMICHAEL”.Therefore,thisdistanceminimizeserrorsduetoswitching,forinstance,first,secondandlastnames.

Don’tforgetthatyoucanusemagnifyingglasstoseehowcurrentconfigurationofeditdistancecalculatesscores.Withtheanalysiswindowopen,youcanchangetheparametersandtheresultspresentedindynamicanalysiswindowwillbeupdated.

34

Soundex

Inputvaluesaretreatedasrawstrings.ImplementstheSoundexalgorithmtocalculateascorebetweentwostrings(usuallynames).Requiresthefollowingparameters:soundexlength,approveanddisapprovelevels(seeFigure25).Soundexlengthprovideslengthofsoundexcodethatwillbecomputed.Editdistanceisusedtocomparethesoundexcodesoftwostrings,andapprove/disapprovelevelsaretheparametersofthiseditdistance.Notethatdefaultvaluesof0forapproveanddisapprovelevelsmeanthatonlysoundexcodesthatareexactlythesamewillgetanon‐zeroscore.FormoreinformationonSoundexseehttp://en.wikipedia.org/wiki/Soundex.

Figure25:ConfigurationofSoundexdistance

Numericdistance

Inputvaluesaretreatedasnumbers(realvalues).Thisdistanceallowsuserstospecifyarangeofvaluesthatwillhaveanon‐zeromatchscore.Twooptionsforspecifyingtherangeareavailable(seeFigure26):

1. RangeasfixedvalueRangeisspecifiedusingfixedvalue.Forinstance,onecansaythatarangewillhaveradiusof2.Then,ifaninputvalueis10,therangewillbe[8…12].

2. RangeaspercentageofvalueRangeasapercentageofinputvalue.Forinstance,onecansaythattherangewillhavearadiusof30%ofinputvalue.Then,ifaninputvaluewas10,theresultingrangewouldbe[7…13].

Notethatyoucanspecifydifferentvaluesforleftandrightsideoftherange.

Besidesthespecificationofarange,usercanspecifywhetherthescoreshouldbecalculatedusinglinearapproximation.Iflinearapproximationisdisabled,thefinalscoreiscalculatedusingthefollowingformula(lowerBoundandupperBoundreturntheboundsoftherange):

score(v1,v2) =1, if v2∈ [lowerBound(v1)...upperBound(v1)]0 otherwise

35

Ifthelinearapproximationisenabled,thefollowingformulaisused:

score(v1,v2) =

upperBound(v1) − v2upperBound(v1) − v1

if v2∈ [v1...upperBound(v1)]

v2 − lowerBound(v1)v1− lowerBound(v1)

if v2∈ [lowerBound(v1)...v1)

0 otherwise

Figure26:Configurationofnumericdistance

Examples:

Firstvalue Secondvalue Linearapproximation

Lowerbound Upperbound Score

10 9 On 6(value‐4) 12(value+2) 0.7510 9 Off 6(value‐4) 12(value+2) 110 11 On 6(value‐4) 12(value+2) 0.510 11 Off 6(value‐4) 12(value+2) 110 12 On 6(value‐4) 12(value+2) 050 48 On 40(value‐20%) 60(value+20%) 0.850 54 On 40(value‐20%) 60(value+20%) 0.6

Datedistance

Inputvaluesaretreatedasdates.ThisdistanceworksverysimilartotheNumericdistance.Insteadofworkingwithnumbers,however,ittreatsinputvaluesasdates.Whenrangesarecomputed,numberofyears,days,hours,minutesorsecondsbetweendatescanbespecified.Optionoflinearapproximationisavailable(thisoptionworksinthesamewayasforthenumericdistance).Inaddition,userhastospecifyaformatofinputdates(seeFigure27).Notethatthedatedistancewillsuggestanappropriateformat.Theformatcanusethefollowingcharacters:

Character Meaning Examplevalue

36

y Year 1996;96

M Monthinyear July;Jul;07

w Weekinyear 27

W Weekinmonth 2

D Dayinyear 189

d Dayinmonth 10

F Dayofweekinmonth 2

E Dayinweek Tuesday;Tue

a Am/pmmarker PM

H Hourinday(0‐23) 0

k Hourinday(1‐24) 24

K Hourinam/pm(0‐11) 0

h Hourinam/pm(1‐12) 12

m Minuteinhour 30

s Secondinminute 55

S Millisecond 978

z Timezone PacificStandardTime;PST;GMT‐08:00

Z Timezone ‐0800

Sometimesyoumightnotwanttouseeditdistanceforattributesthataredates(likewedidinthistutorial).Considercaseswhenhumanscopydatesfrompaperformstoacomputer.Digits1and7lookfamiliar,sothedate10/22/2007couldbetypedinas10/22/2001.Inthosecases,datedistancewillnotperformthebest.Probablybetterchoiceisaneditdistancethatcancapturesucherrors.

37

Figure27:Configurationofdatedistance

Ifformatofdateisnotspecifiedcorrectly,thedatedistancewillnotworkinpredictableway.Usedynamicanalysiswindowtomakesurethattheconfigurationisappropriate.

38

DatadeduplicationDatadeduplicationallowsidentifyingduplicatesingivendatasource.Aduplicateisdefinedastwoormorerecordsthatsatisfyduplicateconditionconfiguredbyauser.TherearetwoaspectsofthedatadeduplicationinFRIL:

1. DeduplicationofdatasourcebeforelinkageDeduplicationbeforelinkageallowsonetoconfigurededuplicationrulesinthelinkageconfiguration.Inthatcase,beforetheactuallinkageisperformed,FRILwilldeduplicateselecteddatasources.Thededuplicationistransparentforuser,andisanintegralpartofthelinkageprocess.Theconfigurationofthededuplicationofdatasourcebeforelinkageisperformedduringthelaststepoftheconfigurationofadatasource.

2. DeduplicationasaseparateprocessFRILcandeduplicatedatasourcewithoutperforminganylinkage.Inthatcase,ausercanchooseadeduplicationmodewhereonlyoneinputfileisrequired(toswitchbetweenmodesuseModemenu).Oncetheconfigurationiscomplete,ausercanrundeduplication.Asaresult,FRILwillsavededuplicateddataand,ifenabledduringtheconfiguration,afilecontainingidentifiedduplicateswillbegenerated.AscreenshotofthededuplicationmodeispresentedinFigure28.Notethattheconfigurationofdeduplicationmodeisquitesimilartotheconfigurationoflinkage,anditrequiresdatasourceconfiguration,deduplicationconfigurationandresultsaverconfiguration.Thedetailsofeachstepoftheconfigurationarediscussedbelow.

Theconfigurationofdeduplicationinbothcasesdescribedaboveisidenticalandusesthesameuserinterface.Toconfigurethededuplicationinadatasourcebeforelinkage,enabledatasourcededuplicationinthelaststepofwizardandpressthePreferencesbutton.Toconfigurededuplicationinthededuplicationmode,press“Configure”buttoninthededuplicationboxinthescreenshotpresentedinFigure28.

39

Figure28:DeduplicationmodeinFRIL

Theconfigurationofdeduplicationmode,asmentionedbefore,hasthreemajorsteps:aconfigurationofdatasource,aconfigurationofdeduplicationprocessandaconfigurationofresultsaver.Belowwedescribedetailsofeachofthesesteps:

1. DatasourceconfigurationDatasourceconfigurationinthedatadeduplicationmodefollowsthesamestructureasadatasourceconfigurationforlinkage.Thesameapproachanduserinterfaceareused.Inthelaststepoftheconfiguration,userscanchoosewhethertofiltertheinputdataandwhethertogenerateasummaryofaduplicatedata.Notethatyoucannotdisablethededuplicationofadatasource,asthiswouldcontradictwiththepurposeofadeduplicationmode.AscreenshotpresentingthelaststepofdatasourceconfigurationindeduplicationmodeispresentedinFigure29.

2. DeduplicationconfigurationInthisstepyouneedtoconfigurearulethatwillidentifyduplicates.Theconfigurationhastwosteps:aconfigurationofcomparedattributesandaconfigurationofblockingattributeusedfordeduplication.AscreenshotofthefirstofconfigurationstepsispresentedinFigure30.Asyoucansee,similarlyasforthelinkageconfiguration,youneedtoidentifyaconditionthatwillbeusedforduplicatesidentification.Theconditioncanuseanyattributesprovidedbydatasourceand

40

selectedasavailableinFRIL.Notethat,unlikeforthelinkageconfiguration,youarechoosingonlyattributesfromonedatasourceasonlyonedatasourceisavailable.Whenaddingnewattributestothecondition,youcanchooseanyoftheavailabledistancemetrics(seethesectionDistancemetricsforadetaileddescriptionofthedistancemetricsavailableinFRIL).Onceyouaredonewiththefirststepoflinkageconfiguration,youcanproceedtothenextone(Figure31).Inthesecondstepyouneedtoconfigureablockingvariable.NotethatFRILwilllookforduplicatesonlyamongtherecordsthatendupinthesameblock.Therefore,youneedtothinkwhatisthebestwaytoblockrecordsinordertoidentifyasmanyduplicatesaspossible.Somegoodideasforblockingvariableare:soundexcodeoflastname,birthyear(orbirthdate)orZIPcode.

Theavailableoptionsforblockingmethoddependondistancemetricusedforchosenblockingattribute.IfEqualbooleandistance,DatedistanceorNumericdistancemetricisused,suchanattributewillonlybeallowedtobuildblocksthathavethesamevalueofattribute.IfEditdistance,Jaro‐Winklerdistance,Q‐gramsdistance,SoundexorStreetaddressdistanceisused,onlytheothertwooptionswillbeenabled.

3. ResultsaverThisisthelaststepofdeduplicationconfigurationinadeduplicationmode.Notethatthisstepisnotavailableindeduplicationofdatasourcebeforelinkage,asthededuplicateddataisnotsavedintoanyfile,butratherusedasaninputofthelinkageprocess.Inthisstepallyouneedtodoistochooseafilenamewherethededuplicatedresultswillbesaved.

Figure29:Laststepofdatasourceconfiguration(Deduplicationmode)

41

Soundexcodeforblockingvariableisavailableonlyforattributesthatarecomparedusingastringdistancemetric(e.g.,editdistanceorsoundexdistance).

Usealwaysanappropriatedistancemetricforgivenfieldtype.Ifyouarecomparingnumericvalues(likeweightorheight),useanumericdistance.Ifyouarecomparingnames,useoneofthestringdistancemetrics.Thiswillallowyoutofindmoreduplicates,forinstancethosecontainingmisspellingsorerrorsinmeasurements.

Howtheduplicatesareidentifiedgiventhededuplicationcondition?ItisveryimportanttounderstandwhatrecordsareconsideredduplicatesbyFRIL.Onceyouprovideyourduplicatesidentificationcondition,tworecordsareconsideredduplicatesifthesumofproductsofdistancemetricvalueforcomparedattributebetweengiventworecordsmultipliedbyweightfortheattributeisgreaterthantheduplicateacceptancelevel.Notethisisverysimilartohowthelinkagesareidentified(seesectionHowconfidenceleveliscalculated?formoredetails).Inthededuplicationyouonlyhave,however,datafromonedatasource.

Emptyvaluescorehasthesameimpactforfindingduplicatesasforfindinglinkages.Refertosection“Howconfidenceleveliscalculated?”formoredetails.

Figure30:Firststepofdeduplicationconfiguration

42

Figure31:Deduplicationconfiguration­step2

Figure32:Secondstepofdeduplicationconfiguration

43

References1. P.Jurczyk,J.J.Lu,L.Xiong,J.D.CraganandA.Correa,“FRIL:AToolforComparativeRecord

Linkage”,AmericanMedicalInformaticsAssociation(AMIA)2008AnnualSymposium2. P.Jurczyk,J.J.Lu,L.Xiong,J.D.CraganandA.Correa,“Fine‐grainedrecordintegrationandlinkage

tool”,Birthdefectsresearch,PartA,Clinicalandmolecularteratology,2008Nov;82(11):822‐9.