fril - sourceforgefril.sourceforge.net/fril-tutorial-3.2.pdf · 4 general architecture before we...
TRANSCRIPT
FRILFinegrainedRecordIntegrationandLinkageTool
TutorialV3.2
08/07/2009
Author:PawelJurczyk
Copyright:EmoryUniversity,Math&CSDepartment,2009
Documenthistory:
Date Version Date Version12/22/2008 1.0 08/07/2009 3.212/29/2008 2.0 06/29/2009 3.0 07/02/2009 3.1
2
Tableofcontents
Tableofcontents ............................................................................................................................................................ 2
Introduction..................................................................................................................................................................... 3
Installation ....................................................................................................................................................................... 3Generalarchitecture..................................................................................................................................................... 4
Mainwindow ................................................................................................................................................................... 4Configurationofdatasources.................................................................................................................................... 7
Configurationoflinkage ............................................................................................................................................18
Resultsavers/Resultsdeduplication ....................................................................................................................24Runningthelinkage,linkageconfidence .............................................................................................................26
Howconfidenceleveliscalculated? ......................................................................................................................28Distancemetrics...........................................................................................................................................................30
Datadeduplication ......................................................................................................................................................38
References ......................................................................................................................................................................43
3
IntroductionThisdocumentprovidesatutorialfortoolcalledFRIL(Fine‐grainedRecordsIntegrationandLinkageTool).FRILisaJava‐basedtoolthatincorporatesarichcollectionofrecorddistancemetrics,searchmethods,andanalysistools.Alongitsworkflow,FRILprovidesarichsetofuser‐tunableparametersaugmentedwithgraphicvisualizationtoolstoassistusersinunderstandingtheeffectsofparameterchoices.InthistutorialthereaderwillobtainknowledgeaboutallthecomponentsavailableinFRIL.
InstallationFRILcanbedownloadedfromitshomepage.Thefileavailablefordownloadisazippedarchivethatcontainsallthefilesrequiredforrunningthesoftwareandexampledatafilesusedinthistutorial.Ifyouhavenotobtainedthesoftwaresofar,doitnow:
1. Gotohttp://www.mathcs.emory.edu/Research/Area/datainfo/FRIL2. GotoDownloadsection3. Downloadthemostrecentbinarydistribution(orsourcedistributionifyouprefer)4. Unzipdownloadedarchive5. OpentheFRILfolder(itshouldbecreatedafterunzippingthefile)6. Runthesoftware.Usefril.bat(Windows)/fril.sh(Linux/Unix)/fril.sh.command(Mac)depending
onoperatingsystemyouuse.7. Ifthesoftwaredoesnotstart,makesureyouhavejava1.6installed.FRILisdevelopedinjavaand
requiresthisframeworktorun.Youcanobtainthelatestjavaframeworkfromhttp://java.com.ForfurtherinformationyoumightalsowanttorefertothefileREADME.txtthatislocatedinaFRILdirectory.
4
GeneralarchitectureBeforewegetstartedwithFRIL,itisimportanttounderstandthearchitectureofthesystem.TheworkflowofFRILisshowninFigure1.Theuserspecifiestheinitialinputfiles.Eachrunrequirestheusertospecifythesearchmethod,thedistancefunctionintheattributecomparisonmoduleandthedecisionmodel.OutputconsistsofsetsM,U,Pandvarioussummarystatistics.SetsUandPmaybefedbackintoFRILwithadifferentsetofparameters.
Figure1:ArchitectureofFRIL
InthistutorialweguideuserthroughallthecomponentsofFRIL,presentinghowtoconfigure,debugandrunefficientandaccuraterecordlinkage.
MainwindowOnceyoustartthesoftware,themainwindowopens.ThescreenshotofapplicationispresentedinFigure2.Notethatthereare5majorcomponentsintheapplication:
• Configurationofthefirstdatasource• Configurationoftheseconddatasource• Configurationofthelinkageprocess• Configurationoftheresultssaver• Linkageprocesspanel
TheFRILmainwindowcontainsthefollowingspecificcomponents(thenumberscorrespondtonumbersinlabelsinFigure2):
1. Configurationofthefirstdatasourcebutton2. Viewofnotjoineddatainthefirstdatasourcebutton3. Indicatorsoffiltering/deduplicationinthefirstdatasource4. Configurationoftheseconddatasourcebutton5. Viewofnotjoineddataintheseconddatasourcebutton6. Indicatorsoffiltering/deduplicationintheseconddatasource
5
7. Configurationoflinkageprocessbutton8. Linkageprocessstartbutton9. Configurationofresultsavers/resultdeduplicationbutton10. Viewoflinkedrecordsbutton11. Indicatorofdeduplicationoflinkageresults12. IndicatorofFRILmode(linkage/deduplication)
ToconfigurethelinkageusingFRILyouneedtogothroughallthemajorcomponents(configuredatasources,linkageandresultsavers).Onceyoucompletethatyoucanstartthelinkageprocessusingthestartlinkagebutton.Intheremainingpartofthistutorialwewillgoovereachofthestepsabovetolinksampledatafiles.
FRILhastwomodes:linkageanddeduplication.ThelinkagemodeisthedefaultoneandispresentedinFigure1.ThededuplicationmodecanbeenabledintheModemenu.PleaseseesectionDatadeduplicationformoredetails.
IfyouhavealreadystartedFRILbefore,youmightbeaskedifyouwanttoopenyourpreviousconfigurationwhenFRILstarts.FRILremembersitsconfigurationandattemptstorestoreituponrestart.
Ifyouarenotsurehowtoobtainorstartthesoftware,gotoInstallationsection.
WorkingwithsavedconfigurationWhenyouworkwithFRIL,saveyourconfigurationoften.Inthatcaseyouwillbesafeifanythingunexpectedhappens(e.g.,FRILhangsorcrashes–yes,itmighthappen,althoughweareworkingtomakethesoftwarestable).Tosaveyourconfiguration,clickon“File‐>Saveas”or“File‐>Save”.Later,whenyouopenFRILagain,youwillhaveanoptionofloadingthesavedconfiguration.Youcanalsoloadconfigurationatanytime.Todoso,click“File‐>Open”.
AutosaveoptionFRILwasrecentlyupdatedtoincludeanautosaveoption(thisoptionisenabledbydefault,youcanchangeitintheToolsmenuitem).Theautosaveoptionautomaticallysavestheconfigurationasyouworkwiththetool.Ifyouclosetheapplicationimproperlyorifitbreaks,youcanbeaskedwhetheryouwanttoloadanautosavedconfigurationwhenyourestartFRIL.
7
ConfigurationofdatasourcesDatasourcesprovidedatathatwillbelinkedinFRIL.Youhaveanoptionofthefollowingdatasources:
1. CSVfileComma‐separatedvaluesfile.Thisisatextfilewheredataisseparatedbycomma,semicolonorotherseparatorcharacter.TheconfigurationofCSVdatasourcerequiresspecificationofaninputfileandacolumnseparator.
2. ExcelfileFilesavedinMicrosoftExcelformat.Note:FRILworkswithExcelfilessavedinOffice1997‐2003fileformat.SupportforOffice2007isonitsway.TheconfigurationofExceldatasourcerequiresspecificationofafilenameandasheetname.Bothparametersarerequired.
3. Fixedwidthcolumnsinputfile(Textfile)Plaintextdatafile.Eachrowinthefileisaseparaterecord,andeveryattributeshouldhavefixedlength.Requiresaschemafile(oradatamodelfile)thatdescribesattributesavailableinthefileandlengthofthoseattributes.Thedatamodelfilenameisoptional,andifnotprovidedFRILwilllookforafilewithextension“.fields”.Forinstance,ifyouuseaninputfiledata.txt,andyoudonotspecifythedatamodelfile,FRILwillexpectthefiledata.fieldstobelocatedinthesamedirectoryasthefiledata.txt.The“.fields”filecouldlookasfollows:
Theexampleabovedefinesatextfilewith3attributes,name,ageandaddress.Eachlineinthetextfileisexpectedtohave83characters(20+3+60).Thefirst20charactersineachlinearedecodedasaname,next3charactersasanageandnext60charactersasanaddress.
4. DatabaseThedatawillbepulledfromdatabase.RequiresJDBClibraries.Currently,thelibrarieshavetobedownloadedmanually,butinthefuturethisprocesscanbeautomated.Note:JDBCgivesalotofoptionsfordatasources.Usingthistechnology,youcanaccessforinstanceMSAccessdatabase,MSSQL,MySQL,PostgreSQL,Oracleandmanymoredatabases.
Let’snowadddatatoFRIL.Followthestepsoutlinedbelowtofullyconfigurethefirstdatasource.Wewillworkwiththesampledatathatshouldbelocatedinthedata‐sampledirectorylocatedinyourFRILhomedirectory.First,clickontheconfigurationofupperdatasourceinmainwindow(clickonabuttonnexttothelabelwithnumber1inFigure2).Awizardfordatasourceshouldappear.InDatasourcetypechooseaCSVfileoption.Namethedatasource“FirstFile”andclickon“…”buttonontherightfromthe“Filename”textfield.FilechoosershouldopeninyourFRILdirectory.Ifnot,navigatethere.Doubleclickondata‐sampledirectoryandclickon“generated‐data‐error.csv”.Click“Open”.YoushouldbebackintheDatasourceconfigurationwizard.In“Columnseparator”optionchooseComma.YourwindowshouldlookliketheonepresentedinFigure3.NowyoushouldbereadytoclickNext.Whenyoudothat,thenextstepinconfigurationofdatasourceappears.Youshouldnowseeafewboxeswiththefollowingnamesonthem:first_name,last_name,DOB,height,weight.
IfyouchooseawrongColumnseparator,FRILwillnotbeabletoreadyourdatainaproperway.Youwilllikelyseeonlyoneboxinthesecondstepofthedatasourceconfigurationthatreadssomethinglike:“first_name,last_na…“.
name20age3address60
8
Figure3:Firststepofdatasourceconfiguration
Tryclickingon“Showfieldssummary”buttonlocatedatthetopwhenyouareinthesecondstepofadatasourceconfiguration.Youwillseethedatasummarywindow.Trytomoveyourmouseoveroneofthe“See”buttons–afewexamplesofdatavaluesshouldbepresented.Thisoptioncanbeusefulwhenexploringthedata.
ThenextstepinconfigurationistochoosecolumnsfromtheinputfilethatyouwillactuallyuseinFRIL.Yourinputfilecouldpotentiallyhavehundredsofdatacolumns,andyoumightreallyneedonlyfewofthemintheapplication.Bychoosingthecolumnsyoureallyneedyoumakethetooltoworkmuchfaster.Inourcase,however,ourdatafilehasonly5columns,andallthecolumnswillbeusedinthelinkage.Let’snowaddattributesthatwillbeusedlaterbyFRIL.Followthesesteps:
1. Wedonotwanttoworkwithfirstandlastnameseparately,butwewanttotreatthosetwoattributesasone(orasname).Clickontheboxthatreadsfirst_name.ChooseUseconverter‐>Mergeconverter.Anewconfigurationwindowshouldappear.Asthe“Outattributename”type“name”.ChooseSpacefortheoption“Joinfieldsusing”.Inthe“Joinedcolumns”clicklast_name@FirstFile.YourwindowshouldlookliketheoneinscreenshotfromFigure5.Wonderingwhatyouweredoing?Clickonmagnifyingglassicon.Youshouldseeadynamicanalysiswindowthatshowsthreecolumns(Figure6).Youseethatyouhavetwoinputcolumns(first_nameandlast_name)andoneoutputcolumn(name).Nowyouseewhatmergeconverterdoes–ittakestwo(ormore)attributesfrominputfileandcreatesanewattribute.Thevalueofthenewattributeiscalculatedbymergingvaluesfromfewcolumns.
Beforeyouclosetheconverterconfiguration,trytochangetheoption“Joinfieldsusing”or“Joinedcolumns”whilekeepingthedynamicanalysiswindowopen.Theanalysiswindowisupdatedeverytimeyoumakeanychangeintheconfigurationofconverters.Makesure,however,toselectvaluesofparametersasdescribedinstep1beforeyouclickOK.
2. Nowyoushouldbereadytoclosetheconfigurationofconverter.ClickOK.3. Clickon“Showfieldssummary”buttonandmoveyourmouseover“See…”buttonfortheattribute
DOB.YouwillseethatthevaluesaredatescodedasMMDDYYYY.Wewillnowchangethe4‐digityearformatinto2‐digitformat.OurgoalistoproducevaluesthathaveformatMMDDYY.
9
4. Closethesummarywindowandclickonboxthatreads“DOB”andselectUseconverter‐>Trimconverter.
5. Intheconfigurationofconverter,asthe“Outputattributename”type“Date‐MMDD”.Selectthecheckboxnextto“Modifyendofstring”.UsetheoptionforCutandsetnumberofcutcharactersto4.MakesurethatyourconfigurationisconsistentwithFigure7.
6. Clickonthemagnifyingglassbutton.Theconverteryoujustcreatedleavesfirst4charactersofeveryDOBvalue.ThosecharactersareMMDDpartofthedate.ClosedynamicanalysiswindowwhenyouaredoneandclickOKintheconverterconfigurationwindow.
7. Clickagainontheboxthatreads“DOB”,chooseUseconverter‐>Trimconverter.8. Thistimespecify“Date‐YY”asan“Outputattributename”,selectthecheckboxnextto“Modifyend
ofstring”andselecttheradiobuttonnextto“Leave”.Type2intheenabledtextfield(matchyourconfigurationwithFigure8).Clickagainonthemagnifyingglassbutton.Thisconverterleavesonlytwolastdigitsofyearfromanyinputdate.Applythechanges(closedynamicanalysisandclickOK).
9. WehaveseparateattributesforMMDDpartofdateandYYpartofdate.Thenextstepistomakeoneattributefromthosetwocomponents.Clickonboxthatreads“Date‐MMDD”andchooseUseconverter‐>Mergeconverter.MatchtheconfigurationwithFigure9.MakesuretoselectNonefortheoption“Joinfieldsusing”.Asusually,clickonmagnifyingglasstoseetheresultsandonceyouaredoneapplythechanges.
10. Clickonboxthatreads“height”.Clickonoption“Addtooutmodel”.Followthesamestepsfor“weight”.
11. YourdatasourceconfigurationshouldlooksimilarlytotheonepresentedinFigure10.Ifnot,adjustyourconfigurationaccordingly.
Tryclickingoneachoftheboxesyouseeinyourconfigurationwindow.Youwillseethateverytypeofboxhasdifferentoptions.E.g.,byclickingonanorangecolorbox(converter),youcanmodifyitssettings.Ontheotherhand,byclickingonabluebox(outputcolumn),youcanrenamethecolumnorremoveit.
Observethatyoucanbuildasequenceofconverterstoachievemorecomplextasksindatamanipulation.
12. Click“Next”tofollowtothelaststepofdatasourceconfiguration(presentedinFigure4).Inthelaststepacceptthedefaultconfigurationandclick“Finish”
Thelaststepofdatasourceconfigurationisadatapreprocessingstepandrequiresabitofexplanation.Youcanconfigureheretwomajortasksforadatasource:deduplicationandfiltering.Datadeduplicationisataskofremovingduplicaterecordsfrominputdata(pleaseseesectionDatadeduplicationformoredetails).Filteringisataskofspecifyingwhichrecordsshouldbeusedinthelinkageprocess.Notethatifyoudonotuseanyfilter,alltherecordsprovidedbygivendatasourcewillbeused.Ontheotherhand,ifyouspecifyanyfilter,onlytherecordsthatsatisfythefilterexpressionwillbeconsidered.Otherrecordswillnotbetakenintoaccount.Thefilterexpressioncanbeanybooleanexpressionthatinvolvesattributesfromadatasource.AssumingthatyourdatasourceprovidesattributesNAME,AGEandZIP_CODE,thefollowingtablepresentcorrectfilterexpressions:AGE=32 OnlyrecordsthathaveAGEequalto32willbe
considered.AGE<=32andZIP_CODE=“30030”andNAME=“SMITH”
OnlyrecordsthathaveNAMEequalto“SMITH”andZIP_CODEequalto30030andAGElessorequal32willbeconsidered.
10
ZIP_CODE=“30030”orAGE>64 OnlyrecordsthathaveZIP_CODEequalto30030orAGEgreaterthan64willbeconsidered.
Notethatthefilterexpressioneditorhassometoolsthatcanhelpyouwithaproperfiltercondition.The“Addfilterexpression”buttonopensawindowthathelpstoaddanewexpressiontothefilter,the“Clearfilter”buttonclearsthefilter(emptyfilterisequivalenttonofilter),andthe“Validatefilter”buttoncheckswhethertheselectedfilterexpressioniscorrect.Inaneventoffilterexpressionbeingincorrect,theerrormessagegivesanexplanationoferror.
Filterexpressionscaninvolveanarbitrarynumberofconditions.Youcanuse“and”and“or”asbooleanoperatorsandparenthesis(“(“,“)”)todefinetheorderoftheoperations.Bothoperators(“and”and“or”)canappearinanexpressionatthesametime,e.g.,thefollowingfilterexpressioniscorrect:NAME=“JOHN”or(AGE=33andZIP_CODE=“30030”)
Notethatanypreprocessingofadatasourcehappensbeforetheactuallinkage.Itmeansthatifyoudecidetodeduplicateadatasource,thedatasourcewillbededuplicatedbeforeprovidingdataforthelinkageprocess.Thesameruleappliestofiltering(recordswillbefilteredbeforeprovidinganydataforthelinkage).Thisfactmayhaveanimpactontimeittakestocompletetheoveralllinkageprocess:deduplicationmayincreasethistimeasitaddstocomputationoverhead,whilstfilteringwillreduceitmostoftimesasitlimitsanumberofrecordsinthelinkagephase.
Figure4:Configurationofdatasourcepreprocessing
13
Figure9:Mergeconvertermergingdate
Figure10:Completeddatasourceconfiguration
Nowyoushouldbedonewiththeconfigurationofthefirstdatasource.Savetheconfiguration.Usethesamestepstoconfiguretheseconddatasource.Clickonconfigurationoflowerdatasourceinthemainwindow(clickonbuttonnexttolabelwithnumber2inFigure2).ChooseCSVfileandnamethesource“SecondFile”choosethefile“generated‐data.csv”asaninputfile(locatedinthesamedirectoryasthefileusedbefore)andcommaasaseparator.Thisfileprovides4attributes:patient_name,DOB,height,weight.Makesuretoanalyzethedatausing“Showfieldssummary”button.Donotuseanyconverters;justaddallthecolumnstoOutmodel(tryoption“Addallcolumnstooutmodel”).Don’tforgettosaveyourconfigurationwhenyoucompletethistask.
14
Thinkwhywechosetousemergeconverterforfirstandlastnameandwhywedidallthethingswiththedateswhenthefirstdatasourcewasconfigured.Thesecondfileprovidesjustnamesofpatients,withnodistinctionbetweenfirstandlastnames.Also,theformatofdateinthesecondfileisMMDDYY.Byapplyingtheconvertersasdescribedaboveweresolvedschemadiscrepancybetweenthetwoinputfilesweareusing.
Atthispointyoushouldhaveafullconfigurationofbothdatasources.Youcannowmovetoaconfigurationofthelinkageprocess.Beforewedothat,however,let’sgooveralltheconvertersthatareavailableinthesystemasyoumightfindthosequiteusefulinyourprojects.
MergeconverterSampleusage:mergefirstnameandlastnamestoredinseparateattributesintooneattribute.Thisconverterisusedtomergetwoormoreattributesintooneattribute.Wehavealreadyseenhowtouseitbefore.
SplitconverterSampleusage:splitnameintofirstnameandlastname.Thisconverterdoestheoppositetothemergeconverter.Youcansplitoneattributeintotwoormoreattributes.Totryit,opentheconfigurationoftheseconddatasource(theoneyouconfiguredbyyourself).Gotothesecondstepoftheconfiguration(clickNextinthefirststep).Clickontheboxthatreadspatient_name(eitherofthetwo).ChooseUseconverter‐>Splitconverter.Youseethatyoucanconfigurehowmanyofattributeswillbegeneratedfromtheinputattribute(defaultvalueis2).Youmightalsorenametheattributesandchoosethesplitcharacter.MaketheconfigurationofthisconverterconsistentwithFigure11andpressthemagnifyingglassbutton.Youshouldbeabletoseewhatthisconverterdoes.Remember,youcanalwaysplaywithattributeswhiledynamicanalysiswindowisopentoseewhatyourconverterisactuallydoing.TryclickingOK.NowyouseehowtheSplitconverterisrepresentedintheconfigurationwindow.Click“Cancel”onthewizardtopreventsavingthechangesyoujustmade.
15
Figure11:Samplesplitconverterconfiguration
TrimconverterSampleusage:cutfirsttwodigitsfromyearsothatyouhaveYYformatinsteadofYYYY.Thisconverterisusedtotrimsomecharactersfromattributevalues.Wehavealreadyseenhowtouseitbefore.
AppendconverterSampleusage:appendstring“inch”toeveryvalueofheightattribute,sothatyourattributehasaformof“XXXinch”whereXXXaresomenumbers.Thisconvertercanappendagivenstringtothefrontorendoftheattributevalue.Theinputvaluesaretreatedasstrings,andyoucanappendtofrontand/orendofthestring.Totryit,opentheconfigurationofthefirstdatasource(theoneweconfiguredtogether).Gotothesecondstepoftheconfiguration(clickNextinthefirststep).ClickontheboxthatreadsDOB.ChooseUseconverter‐>Appendconverter.MaketheconfigurationofthisconverterconsistentwithwhatyouseeinFigure12andpressthemagnifyingglassbutton.Youshouldbeabletoseewhatthisconverterdoes.Closethedynamicanalysiswindowwhenyouaredoneandclick“Cancel”intheconfigurationofAppendconverter.
16
Figure12:Sampleappendconverterconfiguration
ReplacestringconverterSampleusage:Replaceabbreviationsinstreetnameswithfullnames(like“Dr.”‐>“Drive”or“Ave.”‐>“Avenue”).Thisconvertercanreplacespecifiedvaluesingiveninputstringwithsomeothervalues.Totryit,opentheconfigurationofthefirstdatasource(theoneweconfiguredtogether).Gotosecondstepoftheconfiguration(clickNextinthefirststep).ClickontheboxthatreadsDOB.ChooseUseconverter‐>Replacestringconverter.MaketheconfigurationofthisconverterconsistentwithFigure13(youwillneedtouse“+”buttonthreetimestoconfigurethereplacementrules,notealsothatyoucanedittheconfigurationbypressingthebuttonbetween“+”and“‐“)andpressthemagnifyingglassbutton.Youshouldbeabletoseewhatthisconverterdoes.Closethedynamicanalysiswindowwhenyouaredoneandclick“Cancel”intheconfigurationofAppendconverter.
Replacestringconverteriscasesensitive.Thismeansthatifyourinputvalueis“10FirstST”andtheonlyreplacementruleis“St”‐>”Street”,the“ST.”intheinputvaluewillnotbereplaced.Tohaveitreplaced,youwillneedtoaddsecondreplacementrulethatreads“ST”‐>”Street”.
Replacestringconverteristricky.Itsreplacementrulesareregularexpressions.Forinstance,“.”doesnotmatchdot,butanycharacter.Areplacementrule“A.B”‐>”CD”willnotreplaceonlystrings“A.B”with“CD”butanycharacterinform“AxB”,wherexisanycharacteras“.”inregularexpressionmeansanycharacter.Tocodetheruleaboveproperlyyouwouldneedtousethefollowingsyntax:“A\.B”‐>”CD”(“\”tellstotreat“.”asadot,notasanycharacter).Formoreinformationonregularexpressionsrefertohttp://www.regular‐expressions.info/reference.html.
18
ConfigurationoflinkageOnceyouaredonewiththeconfigurationofthedatasources,youcanmovetoconfigurationofthelinkageprocess.Tostartthelinkageconfiguration,clickonthebuttonnexttothelabelwithnumber3inFigure2.Youshouldseethelinkagewizard.AcceptthedefaultconfigurationforDatastratificationandclick“Next”.NowyouareabouttotellFRILhowyouwanttoidentifymatchesintheinputdatafiles.
Theconfigurationoflinkagestronglydependsontheconfigurationofdatasources.Ifyouhavenotconfigureddatasourcessofar,gobacktotheprevioussection.Otherwiseyouwillnotbeabletofollowthestepsoutlinedinthissection.
ThemostimportantpartinthelinkageconfigurationisconfigurationofsocalledJoincondition.InJoinconditionyouneedtospecifywhichattributesfromthefirstdatasourcewillbecomparedwithwhichattributesfromtheseconddatasources.Basingonthesecomparisons,FRILwillcalculatematchscoreforeachofpairofrecordsthataretested,andwilldecidewhetherthepairisamatchorpotentialmatch.Toconfigureyourfirstjoincondition,followthesesteps:
1. Click“+”locatedinthepanel“Joincondition”2. Youshouldseeanewwindow.In“Leftcolumn”[email protected]“Rightcolumn”click
[email protected]. Nowyouneedtospecifyhowtheattributeschosenaboveshouldbecompared.Youknowthatthe
attributesarenames,andnamestendtohavemisspellingsorerrors.Probablysomecomparisonmethodthattakesthatintoaccountwilldothebestinthiscase.Let’sthenchoose“Editdistance”asa“Distancemetric”.
4. Nowconfigurationpanelforthedistancemetricshouldbeupdated.Detaileddescriptionofthepanelcanbefoundbelow(seesectionDistancemetrics).Atthemoment,let’sjustacceptthedefaultconfigurationoftheeditdistanceconfiguration.
5. Settheweightvalueto40.MakesureyouconfigurationmatchesFigure14.ClickOK.
Theconfigurationofdistancemetric(Figure14)containstheparametercalled“Emptyvaluescore”.ThisparameterallowssettingthebehaviorofFRILwhenatleastoneofinputvaluesisempty.Recallthateachdistancemetricreturnsvaluebetween0.0and1.0,andthevaluerepresentshowsimilararetheinputvalues.Ifanyoftheinputvalues,however,isempty,thesimilaritybetweenthemisdefinedusingthevalueofparameter“Emptyvaluescore”insteadofusingthedistancemetric.Forinstance,ifyousetthisparameterto0.6,thenthescoreformatchinganemptyvaluewithanyothervaluewillalwaysbe0.6.
6. Windowcloses,andnowyoushouldseeonerowinJoincondition.ObservealsotheCurrentsumofweights–thevalueshouldbe40anditshouldbered.ThereasonisthatallweightsinFRILcurrentlyhavetoaddupto100.
7. Click“+”againandnowchooseDate@[email protected]’suseagaintheEditdistance(despitethisisadate).Changethe“Approvelevel”ineditdistanceconfigurationto0.1and“Disapprovelevel”to0.3(seeFigure15).Setweightoftheconditionto30.ClickOK.
8. Click“+”againandnowchooseheight@[email protected]’susetheNumericdistance.
9. Inthenumericdistanceconfiguration,selectoption“Range(percentage)”andinsert10intobothactivetextfields(seeFigure16).Wearesayingthatheightswillbetreatedasnumbers,andthetolerancebetweenvaluescanbeupto10%.
10. Specify15asaweightandclickOK.
19
11. Repeatsteps7‐10forattributesweight@[email protected]. Changethevalueofacceptancelevelto71.13. Click“+”inthe“Outputcolumns”panel.Selectalltheavailablecolumns(left‐clickonfirstoptionin
thelist,thenpressShiftandwiththeshiftpressedleft‐clickonlastoptioninthelist).ClickOK.
Youcanarrangecolumnsinthe“Outputcolumns”asyoulike.Clickonanyiteminthelistandusearrowsontherighttochangetheorderofattributes.Forinstance,itmightbeusefultohavecolumnsthatarebeingcomparedinjoincondition(likenameandpatirnt_name)nexttoeachother.Also,youdonotneedtoaddallthecolumnsto“Outputcolumns”.Forinstance,Date‐MMDDandDate‐YYareonlytemporaryattributesandprobablycouldberemovedfromthelist.
14. YourresultingconfigurationshouldlooksimilartooneinFigure17.
Theconfigurationoflinkage(Figure17)hasabutton“Manualreview”.Usethisoptionifyouwanttomanuallyapproveorrejectcertainlinkages.Onceyouclickonthisbutton,youcanenabletheprocessofmanualreview,andsetthelevelsformanuallyreviewedlinkages.Specifically,youwillseeawindowsimilartotheonepresentedinFigure18.Notethatyoucanslidetheredlinesothatthemanualreviewlevelwillbeadjustedtoyourneeds.
Oncethemanualreviewprocessisenabled,itwillbesignalizedbythegreen“bulb”belowthe“Manualreview”.
15. Click“Next”.16. Inthelaststepofjoinconfigurationchoosethe“Sortedneighborhoodmethod”asajoinmethod
type.17. Intheconfigurationofthisjoinmethod,setwindowsizeto8andmakesurethatthesortorderis
asfollows:name,Date,height,weight(seeFigure19).Ifnecessary,clickonentriesinsortorderandusearrowsontherighttosettheappropriateorder.Checkbothcheckboxesatthebottom(summarycreation).
18. MakesureyourconfigurationmatchesFigure19.Ifso,clickFinishinthelinkagewizard.19. Savetheconfiguration(File‐>Save).
Thearrowsontherightthatallowyoutochangethesortorderinsortedneighborhoodmethodwillbedisabledifyoudonothaveanyrowintheordertableselected.
Formoredetailsondistancemetricsseesection“Distancemetrics”.Descriptionofdifferentjoinmethodsisoutsidethescopeofthistutorial.Fordetailsonthatrefertothepublication“FRIL:AToolforComparativeRecordLinkage”.LinktothepublicationcanbefoundonFRILhomepageinthesectionDocumentation/Publications.
ThesummaryfornotjoineddatamaynotbeavailableforallthesearchmethodsthatareavailableinFRIL.Forinstance,ifyouchooseNestedloopjoinasasearchmethod,onlyonesummarybuttonisavailable.Thisiscausedbydetailsofimplementationofeachofsearchmethods.Simplyput,forsomecasesitisextremelyhardtoimplementefficientlythisfeature(e.g.,withoutextendingsignificantlyatimeofthelinkageorCPU/memoryconsumption).
Atthispointyouhavecompleteaconfigurationofdatasourcesandthelinkageprocess.Youarereadytoproceedtotheconfigurationofresultssavers.
24
Resultsavers/ResultsdeduplicationConfigurationofresultssaversistheeasiestpartoftheconfiguration.Allyouneedtodoistospecifywheretosaveyourdata.Clickonthebuttonnexttothelabelwithnumber9inFigure2.Youshouldseetheresultssaverswizard(seeFigure20).Followthesteps:
1. Clickcreatenewbutton(“+”).2. Anewwindowshouldappear.AsaresultsavertypechooseCSVfiledatasaver.3. Asanoutputfileleave“results.csv”.4. ClickOK.5. Leavethesettingsofresultsdeduplicationfornow(seedetailsbelow)andpressFinish.6. Savethecurrentconfiguration.
Figure20:Resultsaver/Resultsdeduplicationconfiguration
That’sit,youaredonewiththeconfigurationoflinkage.Beforewemoveontothenextsection,however,wewilltakeacloserlookattheoptionofdeduplicationoflinkageresults.
Thefirstquestionweneedtoansweriswhydoyouneedtheresultsdeduplicationatall.It’sallbecauseofduplicatesyoucangetwhenyoulinkyourfiles.Imagineasituationwhereyouhavetworecordsinthefirstfile,XandY.Therecords(let’sassumethedatacontainsinformationaboutpatients)canhavesuchattributesasfirstandlastnames,DOB,heightandweight.Ourtworecords,XandYcouldactuallycontain
25
informationabouttwins.Insuchacase,youcanassumethesamelastnames,thesameDOB,similarheightandweight.Whataboutthefirstnames?Well,theycanalsobeverysimilar,asparentssometimesliketogivetwinssimilarnames.Now,let’sassumethatyouarelinkingthisfileagainstsomeotherfile.FRILcanfindthatbothrecordsXandYmatchtothesamerecord,sayZ,fromthesecondfile.ThereasonisthattheXandYareverysimilarandthustheycanmeetthelinkageconditionandacceptancelevelyouspecified.Intheconsequence,wehavetwolinkages:X‐ZandY‐Z.Asyousee,theZrecordislinkedtwice,andthuswecallthissituationaduplicateinresults.
FRILgivesaneasy‐to‐useoptiontosolvetheproblemdescribedabove.Whatyoucandoistoenableresultsdeduplication.Ifyoudoso,FRILwilllookforsituationsasaboveandwillattempttochooseoneofthelinkages,rejectingthesecondonebeforesavingtheresults.Ifthedecisioncannotbemadeautomatically,thetoolcanaskyouwhattodo.
Toconfiguretheresultsdeduplicationyouhavetochooseoneofthethreeoptionsyouhavein“Resultsdeduplication”inFigure20.Theoptionsyouhavespecifythetypesofduplicatesyoumightwanttolookfor:
1. Youcanrequirethateveryrecordfromthefirstdatasourcewillbelinkedwithexactlyonerecordfromtheothersource.Inthiscase,however,youcanhaveonerecordfromtheotherdatasourcetobelinkedwithmultiplerecordsfromthefirstsource.
2. Youcanrequirethateveryrecordfromtheseconddatasourcewillbelinkedwithexactlyonerecordfromthefirstsource.Inthiscase,however,youcanhaveonerecordfromthefirstdatasourcetobelinkedwithmultiplerecordsfromtheothersource.
3. Youcanrequireboth,e.g.,exactlyonerecordfromthefirstsourcetobelinkedwithexactlyonerecordfromthesecondsource,andexactlyonerecordfromthesecondsourcetobelinkedwithexactlyonerecordfromthefirstsource.
Ifyouhavesomeexperiencewithdatabasesystems,youshouldseethattheoptionsinresultsdeduplicationrelatetotherelationtypesyoucanhavebetweentablesinarelationaldatabase:many‐to‐one,one‐to‐manyandone‐to‐onerelations.
Insomecasesyoumightnotwanttoremoveallduplicates,butyoumightwanttouseoneofthefirsttwooptionsyouhave.Considertwofiles,oneprovidingdataaboutchildrenandtheotherprovidingdataaboutmothers.Itisperfectlyfinetohaveonemotherlinkedwithtwoormorechildren.However,itisnotacceptabletohaveonechildtobelinkedwithtwoormoremothers.
Thesecondoptionthatyouhaveavailableintheconfigurationofresultsdeduplicationis“Deduplicationconflicts”.ThisoptionaddressestheduplicatesthatcannotbedecidedautomaticallybyFRIL.NotethatFRILwillchoosethebestlinkagefromduplicatesbasedonalinkagescore.E.g.,ifyouhavetwoduplicatesasdiscussedabove,X‐ZandY‐Z,andifyourconfigurationprohibitsthistypeofduplicates,thenthededuplicationwillchoosethelinkagewiththehighermatchscore.However,ifthetwolinkageshavethesamescore,anautomaticlegitimatedecisioncannotbemadebyFRIL.Inthiscase,youcanconfigureFRILtodooneofthefollowing:
1. Donothing–thisoptionwillleavebothduplicatesintheresults
2. Chooseonelinkagerandomly–thisoptionwillchooserandomlyoneandrejecttheotherlinkagesbeingduplicates.Onlythechosenlinkagewillbeincludedintheresults.
26
3. Askyouaboutthedecision–inthiscaseFRILwillpopupawindowinwhichyouwillbeabletodecidewhichlinkagesarefalsepositivesandwhicharetruepositives.
Runningthelinkage,linkageconfidenceTorunthelinkage,clickonabuttonnexttothelabelwithnumber8inFigure2.Youshouldseealinkageprogresswindow(seeFigure21).Notethatyoucanclickonthe“arrow”or“chart”buttons.Tryit:
1. Arrowbuttonopensalinkagedebuggingwindow.Asyouopenit,youcanseewhatiscomparedwithwhatinthelinkageprocess.Youcananalyzetheprocessandseeifyoumadeanyerrorsintheconfiguration.Ifapairofrecordsisjoined,itappearsgreen.Ifitisnotjoined,itappearsred.
2. Chartbuttonopensahistogramofthelinkageresults.
Whenlinkagedebuggingisenabled,thelinkageprocessismuchslowerthanwhendebuggingisdisabled.Makesuretodisableitonceyoufeeleverythingrunswell.
Figure21:Linkageprogresswindow
Oncethelinkageprocesscompletes,youshouldhaveinformationonnumberofrecordsthathavebeenlinked(inourcase2972).Onceyouseethatnotification,clickOK.Notethatyoucouldeasilyrerunthelinkagebypressingagainthestartbutton.InyourfilebrowsernavigatetothedirectoryonyourdiskwhereyouhaveFRIL.Nowyoushouldseethreenewfilesinthatdirectory:
1. results.csvThisfilecontainsresultsofyourlinkage.
Nameofresultsfilecouldbedifferentifyouchosedifferentconfigurationofresultsaver(stepnumber3atthebeginningofthissection).
2. minus‐FirstFile.csvListofrecordsfromtheFirstFilesourcethathavenotbeenjoined.
3. minus‐SecondFile.csvListofrecordsfromtheSecondFilesourcethathavenotbeenjoined.
27
Nowlet’sopentheresults.csvfile.UseExceloranyotherspreadsheetapplicationlikeCalcfromOpenOffice.Youcanseethatyourresultsfilecontainslinkages.ThelastcolumnintheresultsfileiscalledConfidence.ThisnumbertellsyouhowconfidentFRILwasaboutgivenlinkage(valuebetween0and100).Sortontheconfidencesothatrecordsareinascendingorder.Youshouldseethattherewasonelinkagewithconfidence75.Observethatthislinkagehasdifferencesindatesandheight.Inspectsomelinkageswithotherconfidencelevels.MovetothenextsectiontoseedetailsonhowFRILcalculatesthoseconfidencescores.
Alternatively,youcanchooseaFRIL‐integratedresultsviewertoseethelinkageresultsandnotjoineddatarecords.Touseit,clickononeofthebuttonswithlabels2,5or10inFigure2.Pleasenotethattheminusviewercanbedisabledifyoudidnotconfigurethelinkageprocesstocreateasummaryofnotjoineddata.TheFRIL‐integratedrecordsviewerispresentedinFigure22.
Figure22:FRILintegratedrecordsviewer
Themaincomponentsoftherecordsviewerincludeatoolbar(topofthewindow),contentsandastatusbar(bottomofthewindow).Themajorelementsare(thenumberscorrespondtothenumbersinlabelsinFigure22):
1. SavebuttonThisbuttonopensadialogthatallowstosaverecordsvisibleintheviewertoafile.Youcaneither
28
savealltherecordsreturnedbythecurrentquery,oronlyrecordsvisibleoncurrentpage.Notethattherecordsviewerdividesrecordsintopages,witheachpageshowing200recordsbydefault.
2. FilterbuttonThisbuttonenablesfilteringofthevisiblerecords.Theconfigurationoffilteringisthesameastheconfigurationoffilteringforadatasourcediscussedinprevioussections.
3. SortbuttonThisbuttonenablesrecordssorting.Youcansortrecordsbyanyattribute(s).Inthecaseoflinkages,youcanalsoenablesortingbasedonconfidenceofthelinkages.
4. DetailsbuttonThisbuttonopensadetailswindow.Thedetailswindowpresentsvaluesofalltheattributesofrecordsintheview.Toseethedetails,youhavetomoveyourmousepointeroverarecordyouwanttoseethedetailsof.
5. PreferencesbuttonThisbuttonopensapreferenceswindow.Youcanconfiguresuchpreferencesasvisibleattributesorcolors.
6. Statusoffilter/sortThiscomponentgivesquickinformationaboutsort/filterbeingenabledordisabled.Ifsortand/orfilteris(are)enabled,thecolorofthecorrespondingdotturnstogreen.
7. Forward/backwardbuttonsandcurrentpagenumberThebuttonsinthiscomponentallowtomovebetweenthepagesofrecords.
Notethatyoucanopenalinkagesviewwindowandnotjoinedrecordsviewwindow(s)atthesametime.Thismighthelpinyouranalysisofthelinkageresults.
Howconfidenceleveliscalculated?CalculationofaconfidencescoreinFRILissolelybasedonconfigurationyouprovided.RefertoFigure17.Youalreadyknowthatthesumofallweightsinjoinconditionhastobeequalto100.Observethatwehave4sub‐conditions:fornames,dates,heightandweight.Theweightsforeachofsub‐conditionsare40,30,15and15,respectively.InFRIL,eachofdistancefunctionsreturnsvaluebetween0and1.Tocomputethetotalscoreforcomparedpair,FRILmultipliestheweightofeachsub‐conditionbythevaluereturnedbychosendistancefunction.Then,thosenumbersareaddedtogethertocalculateatotalscoreforgivenpairofrecords.IfthetotalscoreisequalorgreaterthantheApprovelevel(refertoFigure17),thepairofrecordsisconsideredtobeamatchandissavedintheresultsfile.Tofindmoreinformationontheprocessofscorecalculation,pleaserefertopublications[1,2].
Togiveanexample,let’sconsiderwehavethefollowingrecordfromsourceFirstFile:
Name=JohnSmith,Date=101182,Height=80,Weight=180
Let’sconsideranotherrecordfromsourceSecondFile:
Patient_name=JohnSmith,DOB=101182,Height=85,Weight=160
29
Now,let’sassumeagainthatconfigurationofFRIListhesameastheonepresentedinFigure17.Thenamesanddatesarecomparedusinganeditdistance,andasthosevaluesarethesame,theeditdistancereturns1.Valuesofheightinbothrecordsaredifferent.However,tocomparetheheightsweareusingaNumericdistancewithparameters+/‐10%(refertoFigure16).Inthatcase,theallowablerangeforheightisfrom(80–10%of80)to(80+10%of80).Thisgivesarangeof72‐88.ThevalueofheightinthecomparedrecordfromtheSecondFilefallswithintherangeabove,andthescorereturnedbythenumericdistanceforheightwillbe(88‐85)/(88‐80)=3/8=0.375.Asimilarsituationismetforweight:anumericdistancewith+/‐10%isusedforcomparison.Thistime,however,allowablerangeis162‐198(180+/‐10%),andintheconsequenceofcomparingthevalue180with160thedistancefunctionreturns0.Thetotalmatchscoreofthosetworecordswillthusbe(rememberthevaluesofourweights):1*40+1*30+0.375*15+0*15=75.6≈76
Thescoresyouaregettingcanbeinfluencedbyvalueoftheparametercalled“Emptyvaluescore”ifyourdatacontainsemptyvalues.
Thenextsectiongivesmoredetailsonalltheavailabledistancemetrics.
30
DistancemetricsDistancemetricsareoneofthemostimportantcomponentsavailableinFRIL.Byspecifyingdistance
metricbetweentwoattributes,youareprovidinginformationonhowtheattributesshouldbetreated(asstrings,numbers,dates)andwhatareallowablediscrepanciesbetweenvaluesoftheattributes.Rememberthateachdistancefunctionmapssimilaritybetweenvaluesofattributesontovalueintherange[0…1],where0means“arenotsimilarintermsofthisdistancefunction”and1means“areexactlythesameintermsofthisdistancefunction”.Inthereminderofthissectionwewilldescribeinmoredetailsthe6coredistancefunctions.
Equalfieldsbooleandistance
Inputvaluesaretreatedasrawstrings.Thisdistancefunctionreturnsonlytwovalues,0or1.0isreturnedifcomparedvaluesaredifferent,1ifareexactlythesame.Noadditionalconfigurationisrequired.
Examples:
Firstvalue Secondvalue Score“ABC” “ABC” 1“ABC” “ABB” 0“1.0” “1” 0“1.0” “1.0” 1“10/12/08” “10/12/2008” 0“10/12/2008” “10/12/2008” 1
Editdistance
Inputvaluesaretreatedasrawstrings.Thedistancetestshowmanyoperationsneedtobeappliedtothefirststringsothatitisconvertedtothesecondstring.Possibleoperationsincludedeletingacharacter,insertingacharacterorswitchingtwocharactersthatarenexttoeachother.Youneedtospecifyapprovelevelanddisapprovelevelwhenthisdistancemetricisused(pleaseseeFigure23).Approveanddisapprovelevelintheuserinterfacearealsoboundtothesmallplotwithaccept/rejectvalues.Onceyouchangeeitherofthevaluesinthetextfields,theplotisupdated,andonceyoudragthelineontheplot,thevaluesintextboxesarealsoupdated.Bothparametersthatneedtobeconfiguredcanhavevaluesfromrange[0…1].Thescoreofeditdistancefunctioniscalculatedusingthefollowingformula(strAisthefirststring,strBisthesecondstring,e(strA,strB)istheeditdistancebetweentwostrings,aisanapprovelevelanddisadisapprovelevel):
€
score(strA,strB) =
0,if e(strA,strB) > d *max(length(strA),length(strB))1,if e(strA,strB) < a*max(length(strA),length(strB))d *max(length(strA),length(strB)) − e(strA,strB)
(d − a) *max(length(strA),length(strB)), otherwise
31
Figure23:Editdistanceconfiguration
Examplesofscorevaluewhenaneditdistanceisused:
Firstvalue Secondvalue e(strA,strB) Approvelevel
Disapprovelevel
Score
“A” “A” 0 0.1 0.3 1“A” “B” 1 0.1 0.3 0“ADAM” “ADAMS” 1 0.1 0.3 0.5“JACOBDOBBS” “JAKOBHOBBS” 2 0.1 0.3 0.59“JASPERCISNEROS” “ADENCISNEROS” 4 0.1 0.3 0.17“BREANNAROBISON” “BRENNAROBINSON” 2 0.1 0.3 0.83
Don’tforgetthatyoucanusethemagnifyingglasstoseehowcurrentconfigurationofeditdistancecalculatesscores.Withtheanalysiswindowopen,youcanchangetheparametersandtheresultspresentedinthedynamicanalysiswindowwillbeupdated.
JaroWinklerdistance
Inputvaluesaretreatedasrawstrings.ThisdistanceusesamethodproposedbyWinkler.Itissuitableforshorterstrings,suchasnames.
Thevalueofthisdistancemetriciscalculatedasfollows:
€
score(s1,s2) =max(100,d j + lp(1− d j ))
wheredjisthevalueofJaro‐Winklerdistancebetweenstringss1ands2(seebelow),listhelengthofcommonprefixbetweenstringsandpisascalingfactor.ThelandpareparametersthatcanbesetinFRIL‐thesystemusesscaleptogivemorefavorableratingstostringsthatmatchfromthebeginningforasetprefixlengthl.Thedefaultvalueforpis0.1andthevalueforlshouldnotexceed4.
TheJaro‐Winklerdistanceisdefinedas:
32
€
d j =13( m| s1 |
+m| s2 |
+m − tm
)
wheremisanumberofmatchingcharacters,andtisthenumberoftranspositions.
Twocharactersfroms1ands2arematching,iftheyarenotfartherthan
€
max(| s1 |,| s2 |)2
−1positions
apart.Thenumberofmatching(butdifferentsequenceorder)charactersdividedby2definesthenumberoftranspositions.
Examples(p=0.1andl=4):
s1 s2 Matchingstringfroms1
Matchingstringfroms2
m t dj score(s1,s2)
“CRATE” “TRACE” “RATE” “RACE” 3 0 0.867 0.867“DWAYNE” “DUANE” “DANE” “DANE” 4 0 0.223 0.84
Qgramsdistance
Inputvaluesaretreatedasrawstrings.Thedistanceoperatesonso‐calledq‐grams,substringsfrominputstringsoflengthq.Bothinputstringsarefirstdividedintoq‐grams.Thisoperationproducestwosetsofso‐calledgrams.Next,acountofitemsthatdonotappearintheintersectionofthosetwosetsiscalculated.Theresultingnumberhasanimpactonthescore.Configurationofthisdistancerequiresspecificationofq,approvelevelanddisapprovelevel(seeFigure24).Approveanddisapprovelevelvalueshaveasimilarmeaningasintheeditdistance.Thefinalscoreiscalculatedasfollows(strAisthefirststring,strBisthesecondstring,gramsreturnsq‐gramsofgivenstring,diffreturnsacountofitemsthatdonotappearintheintersectionofsets,aanddareapproveanddisapprovelevels,respectively):
€
score(strA,strB) =
1, if diff (grams(strA),grams(strB)) < round(a* (| grams(strA) | + | grams(strB) |))0, if diff (grams(strA),grams(strB)) > round(d * (| grams(strA) | + | grams(strB) |))
1− diff (grams(strA),grams(strB))− round(a* (| grams(strA) | + | grams(strB) |))round(d * (| grams(strA) | + | grams(strB) |)) − round(a* (| grams(strA) | + | grams(strB) |))
otherwise
33
Figure24:Qgramsdistanceconfiguration
Examples(allexamplesareforq=3):
Firstvalue Secondvalue Firstvaluegrams Secondvaluegrams
Approve/Disapprovelevel
Score
“WILLIAMYOUNG”
“WILLIAMYOUNG”
WIL,ILL,LLI,LIA,IAM,AM,MY,YO,YOU,OUN,UNG,NG,G
WIL,ILL,LLI,LIA,IAM,AM,MY,YO,YOU,OUN,UNG,NG,G
0.3/0.1 1
“DONOVANBLACK”
“DONOVANPACK”
DON,ONO,NOV,OVA,VAN,AN,NB,BL,BLA,LAC,ACK,CK,K
DON,ONO,NOV,OVA,VAN,AN,NP,PA,PAC,ACK,CK,K
0.3/0.1 0.2
“AJYHUDDLESTON”
“JAYHUDDLESTON”
AJY,JY,YH,HU,HUD,UDD,DDL,DLE,LES,EST,STO,TON,ON,N
JAY,AY,YH,HU,HUD,UDD,DDL,DLE,LES,EST,STO,TON,ON,N
0.3/0.1 0.8
Notethattheq‐gramsdistanceworkswithsets,andorderofgramsdoesnotmatter.Thismeansthatthedistancewillgiveveryhighscoretothepairsofstringslike“MICHAELADAMS”and“ADAMSMICHAEL”.Therefore,thisdistanceminimizeserrorsduetoswitching,forinstance,first,secondandlastnames.
Don’tforgetthatyoucanusemagnifyingglasstoseehowcurrentconfigurationofeditdistancecalculatesscores.Withtheanalysiswindowopen,youcanchangetheparametersandtheresultspresentedindynamicanalysiswindowwillbeupdated.
34
Soundex
Inputvaluesaretreatedasrawstrings.ImplementstheSoundexalgorithmtocalculateascorebetweentwostrings(usuallynames).Requiresthefollowingparameters:soundexlength,approveanddisapprovelevels(seeFigure25).Soundexlengthprovideslengthofsoundexcodethatwillbecomputed.Editdistanceisusedtocomparethesoundexcodesoftwostrings,andapprove/disapprovelevelsaretheparametersofthiseditdistance.Notethatdefaultvaluesof0forapproveanddisapprovelevelsmeanthatonlysoundexcodesthatareexactlythesamewillgetanon‐zeroscore.FormoreinformationonSoundexseehttp://en.wikipedia.org/wiki/Soundex.
Figure25:ConfigurationofSoundexdistance
Numericdistance
Inputvaluesaretreatedasnumbers(realvalues).Thisdistanceallowsuserstospecifyarangeofvaluesthatwillhaveanon‐zeromatchscore.Twooptionsforspecifyingtherangeareavailable(seeFigure26):
1. RangeasfixedvalueRangeisspecifiedusingfixedvalue.Forinstance,onecansaythatarangewillhaveradiusof2.Then,ifaninputvalueis10,therangewillbe[8…12].
2. RangeaspercentageofvalueRangeasapercentageofinputvalue.Forinstance,onecansaythattherangewillhavearadiusof30%ofinputvalue.Then,ifaninputvaluewas10,theresultingrangewouldbe[7…13].
Notethatyoucanspecifydifferentvaluesforleftandrightsideoftherange.
Besidesthespecificationofarange,usercanspecifywhetherthescoreshouldbecalculatedusinglinearapproximation.Iflinearapproximationisdisabled,thefinalscoreiscalculatedusingthefollowingformula(lowerBoundandupperBoundreturntheboundsoftherange):
€
score(v1,v2) =1, if v2∈ [lowerBound(v1)...upperBound(v1)]0 otherwise
35
Ifthelinearapproximationisenabled,thefollowingformulaisused:
€
score(v1,v2) =
upperBound(v1) − v2upperBound(v1) − v1
if v2∈ [v1...upperBound(v1)]
v2 − lowerBound(v1)v1− lowerBound(v1)
if v2∈ [lowerBound(v1)...v1)
0 otherwise
Figure26:Configurationofnumericdistance
Examples:
Firstvalue Secondvalue Linearapproximation
Lowerbound Upperbound Score
10 9 On 6(value‐4) 12(value+2) 0.7510 9 Off 6(value‐4) 12(value+2) 110 11 On 6(value‐4) 12(value+2) 0.510 11 Off 6(value‐4) 12(value+2) 110 12 On 6(value‐4) 12(value+2) 050 48 On 40(value‐20%) 60(value+20%) 0.850 54 On 40(value‐20%) 60(value+20%) 0.6
Datedistance
Inputvaluesaretreatedasdates.ThisdistanceworksverysimilartotheNumericdistance.Insteadofworkingwithnumbers,however,ittreatsinputvaluesasdates.Whenrangesarecomputed,numberofyears,days,hours,minutesorsecondsbetweendatescanbespecified.Optionoflinearapproximationisavailable(thisoptionworksinthesamewayasforthenumericdistance).Inaddition,userhastospecifyaformatofinputdates(seeFigure27).Notethatthedatedistancewillsuggestanappropriateformat.Theformatcanusethefollowingcharacters:
Character Meaning Examplevalue
36
y Year 1996;96
M Monthinyear July;Jul;07
w Weekinyear 27
W Weekinmonth 2
D Dayinyear 189
d Dayinmonth 10
F Dayofweekinmonth 2
E Dayinweek Tuesday;Tue
a Am/pmmarker PM
H Hourinday(0‐23) 0
k Hourinday(1‐24) 24
K Hourinam/pm(0‐11) 0
h Hourinam/pm(1‐12) 12
m Minuteinhour 30
s Secondinminute 55
S Millisecond 978
z Timezone PacificStandardTime;PST;GMT‐08:00
Z Timezone ‐0800
Sometimesyoumightnotwanttouseeditdistanceforattributesthataredates(likewedidinthistutorial).Considercaseswhenhumanscopydatesfrompaperformstoacomputer.Digits1and7lookfamiliar,sothedate10/22/2007couldbetypedinas10/22/2001.Inthosecases,datedistancewillnotperformthebest.Probablybetterchoiceisaneditdistancethatcancapturesucherrors.
37
Figure27:Configurationofdatedistance
Ifformatofdateisnotspecifiedcorrectly,thedatedistancewillnotworkinpredictableway.Usedynamicanalysiswindowtomakesurethattheconfigurationisappropriate.
38
DatadeduplicationDatadeduplicationallowsidentifyingduplicatesingivendatasource.Aduplicateisdefinedastwoormorerecordsthatsatisfyduplicateconditionconfiguredbyauser.TherearetwoaspectsofthedatadeduplicationinFRIL:
1. DeduplicationofdatasourcebeforelinkageDeduplicationbeforelinkageallowsonetoconfigurededuplicationrulesinthelinkageconfiguration.Inthatcase,beforetheactuallinkageisperformed,FRILwilldeduplicateselecteddatasources.Thededuplicationistransparentforuser,andisanintegralpartofthelinkageprocess.Theconfigurationofthededuplicationofdatasourcebeforelinkageisperformedduringthelaststepoftheconfigurationofadatasource.
2. DeduplicationasaseparateprocessFRILcandeduplicatedatasourcewithoutperforminganylinkage.Inthatcase,ausercanchooseadeduplicationmodewhereonlyoneinputfileisrequired(toswitchbetweenmodesuseModemenu).Oncetheconfigurationiscomplete,ausercanrundeduplication.Asaresult,FRILwillsavededuplicateddataand,ifenabledduringtheconfiguration,afilecontainingidentifiedduplicateswillbegenerated.AscreenshotofthededuplicationmodeispresentedinFigure28.Notethattheconfigurationofdeduplicationmodeisquitesimilartotheconfigurationoflinkage,anditrequiresdatasourceconfiguration,deduplicationconfigurationandresultsaverconfiguration.Thedetailsofeachstepoftheconfigurationarediscussedbelow.
Theconfigurationofdeduplicationinbothcasesdescribedaboveisidenticalandusesthesameuserinterface.Toconfigurethededuplicationinadatasourcebeforelinkage,enabledatasourcededuplicationinthelaststepofwizardandpressthePreferencesbutton.Toconfigurededuplicationinthededuplicationmode,press“Configure”buttoninthededuplicationboxinthescreenshotpresentedinFigure28.
39
Figure28:DeduplicationmodeinFRIL
Theconfigurationofdeduplicationmode,asmentionedbefore,hasthreemajorsteps:aconfigurationofdatasource,aconfigurationofdeduplicationprocessandaconfigurationofresultsaver.Belowwedescribedetailsofeachofthesesteps:
1. DatasourceconfigurationDatasourceconfigurationinthedatadeduplicationmodefollowsthesamestructureasadatasourceconfigurationforlinkage.Thesameapproachanduserinterfaceareused.Inthelaststepoftheconfiguration,userscanchoosewhethertofiltertheinputdataandwhethertogenerateasummaryofaduplicatedata.Notethatyoucannotdisablethededuplicationofadatasource,asthiswouldcontradictwiththepurposeofadeduplicationmode.AscreenshotpresentingthelaststepofdatasourceconfigurationindeduplicationmodeispresentedinFigure29.
2. DeduplicationconfigurationInthisstepyouneedtoconfigurearulethatwillidentifyduplicates.Theconfigurationhastwosteps:aconfigurationofcomparedattributesandaconfigurationofblockingattributeusedfordeduplication.AscreenshotofthefirstofconfigurationstepsispresentedinFigure30.Asyoucansee,similarlyasforthelinkageconfiguration,youneedtoidentifyaconditionthatwillbeusedforduplicatesidentification.Theconditioncanuseanyattributesprovidedbydatasourceand
40
selectedasavailableinFRIL.Notethat,unlikeforthelinkageconfiguration,youarechoosingonlyattributesfromonedatasourceasonlyonedatasourceisavailable.Whenaddingnewattributestothecondition,youcanchooseanyoftheavailabledistancemetrics(seethesectionDistancemetricsforadetaileddescriptionofthedistancemetricsavailableinFRIL).Onceyouaredonewiththefirststepoflinkageconfiguration,youcanproceedtothenextone(Figure31).Inthesecondstepyouneedtoconfigureablockingvariable.NotethatFRILwilllookforduplicatesonlyamongtherecordsthatendupinthesameblock.Therefore,youneedtothinkwhatisthebestwaytoblockrecordsinordertoidentifyasmanyduplicatesaspossible.Somegoodideasforblockingvariableare:soundexcodeoflastname,birthyear(orbirthdate)orZIPcode.
Theavailableoptionsforblockingmethoddependondistancemetricusedforchosenblockingattribute.IfEqualbooleandistance,DatedistanceorNumericdistancemetricisused,suchanattributewillonlybeallowedtobuildblocksthathavethesamevalueofattribute.IfEditdistance,Jaro‐Winklerdistance,Q‐gramsdistance,SoundexorStreetaddressdistanceisused,onlytheothertwooptionswillbeenabled.
3. ResultsaverThisisthelaststepofdeduplicationconfigurationinadeduplicationmode.Notethatthisstepisnotavailableindeduplicationofdatasourcebeforelinkage,asthededuplicateddataisnotsavedintoanyfile,butratherusedasaninputofthelinkageprocess.Inthisstepallyouneedtodoistochooseafilenamewherethededuplicatedresultswillbesaved.
Figure29:Laststepofdatasourceconfiguration(Deduplicationmode)
41
Soundexcodeforblockingvariableisavailableonlyforattributesthatarecomparedusingastringdistancemetric(e.g.,editdistanceorsoundexdistance).
Usealwaysanappropriatedistancemetricforgivenfieldtype.Ifyouarecomparingnumericvalues(likeweightorheight),useanumericdistance.Ifyouarecomparingnames,useoneofthestringdistancemetrics.Thiswillallowyoutofindmoreduplicates,forinstancethosecontainingmisspellingsorerrorsinmeasurements.
Howtheduplicatesareidentifiedgiventhededuplicationcondition?ItisveryimportanttounderstandwhatrecordsareconsideredduplicatesbyFRIL.Onceyouprovideyourduplicatesidentificationcondition,tworecordsareconsideredduplicatesifthesumofproductsofdistancemetricvalueforcomparedattributebetweengiventworecordsmultipliedbyweightfortheattributeisgreaterthantheduplicateacceptancelevel.Notethisisverysimilartohowthelinkagesareidentified(seesectionHowconfidenceleveliscalculated?formoredetails).Inthededuplicationyouonlyhave,however,datafromonedatasource.
Emptyvaluescorehasthesameimpactforfindingduplicatesasforfindinglinkages.Refertosection“Howconfidenceleveliscalculated?”formoredetails.
Figure30:Firststepofdeduplicationconfiguration
43
References1. P.Jurczyk,J.J.Lu,L.Xiong,J.D.CraganandA.Correa,“FRIL:AToolforComparativeRecord
Linkage”,AmericanMedicalInformaticsAssociation(AMIA)2008AnnualSymposium2. P.Jurczyk,J.J.Lu,L.Xiong,J.D.CraganandA.Correa,“Fine‐grainedrecordintegrationandlinkage
tool”,Birthdefectsresearch,PartA,Clinicalandmolecularteratology,2008Nov;82(11):822‐9.