remadder software tutorial (v2.0) - fuzzy match record linkage and data deduplication

Upload: matalab

Post on 06-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    1/59

     

    Homepage: http://ReMaDDersoft.wix.com/ReMaDDer  

    ReMaDDer Software Tutorial How to use ReMaDDer software for successful records matching, data

    cleansing and data deduplication projects 

    11/20/2016 

    Revision 2.0. 

    http://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadder

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    2/59

    ReMaDDer Software Tutorial

    Page 1 / 59  

    Table of Contents

    Introduction ........................................................................................................................ 3

     What Is ReMaDDer Software .......................................................................................... 3

    Fuzzy Match ..................................................................................................................... 3

    Records Linkage .............................................................................................................. 4

    Data Deduplication .......................................................................................................... 4

    ReMaDDer Software Advantages .................................................................................... 4

    Prerequisites .................................................................................................................... 5

    Revision History .............................................................................................................. 5

    Projects ................................................................................................................................ 7

    Projects Page .................................................................................................................... 7

    Concept of “Left” and “Right” Dataset ............................................................................ 8

    Record Matching Project vs. Data Deduplication Projects ............................................. 8

    Copy A Project ................................................................................................................. 9

    Raw Data Import ................................................................................................................. 9

    “Left” and “Right” datasets ............................................................................................ 10

    Import Raw Data ............................................................................................................ 11

    Browse And Choose CSV files ..................................................................................................................... 11

    Register CSV Files ....................................................................................................................................... 11

    Determine And Convert CSV File To UTF-8 ............................................................................................ 12

    Edit Raw Datasource Schema Information ...............................................................................................17

    Pre-process Raw Datasource ......................................................................................................................17

    Import Data From Raw Datasources ........................................................................................................ 19

    Solution Definition .............................................................................................................21

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    3/59

    ReMaDDer Software Tutorial

    Page 2 / 59  

    How ReMaDDer performs record linkage and data deduplication .............................. 22

    Solution Definition Header ........................................................................................... 22

    Solution Basic Information ....................................................................................................................... 24

    Machine Learning Strictness ..................................................................................................................... 25Join Type .................................................................................................................................................... 25

    Return Only Best Matching Records ........................................................................................................ 26

    Solution Definition Details ............................................................................................ 26

    Fields Picker ............................................................................................................................................... 27

    Solution Constraints .................................................................................................................................. 29

    Solution Execution ............................................................................................................ 34

    Solution Execution In One Step .................................................................................... 38

    Solution Execution In Two Major Steps ....................................................................... 39

    Solution Execution In Several Minor Steps .................................................................. 39

    Data Retrieving And Storing ..............................................................................................41

    Execute Resultset Retrieval SQL Query ........................................................................ 42

    Solution Status Info ....................................................................................................... 43

    Save And Load Resultset ............................................................................................... 45

    Review And Edit Resultset ............................................................................................ 46

    Resultset Browsing .................................................................................................................................... 46

    Resultset Edit And Review ........................................................................................................................ 51

    Exporting Resultset.................................................................................................................................... 52

    Customize Data Grids........................................................................................................ 55

    Customize Splitters ........................................................................................................... 56

    ReMaDDer Software Trial ................................................................................................. 56Commercial Release Code Purchase And Activation ........................................................ 57

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    4/59

    ReMaDDer Software Tutorial

    Page 3 / 59  

    ReMaDDer Software Tutorial

     How to use ReMaDDer software for successful records matching, data cleansing and

    data deduplication projects 

    Introduction

    What Is ReMaDDer SoftwareReMaDDer is record linkage and data cleansing software, with powerful fuzzy record matching and data

    deduplication capabilities, based on state of the art machine learning and data processing techniques.

     As client-server application, ReMaDDer consists of two parts: client front-end part and server-side part.

    Client front-end provides user-friendly graphical interface with intuitive means for projects creation, raw

    data import and solutions definition, while server-side part ensures mighty data processing engine that can

    solve even the most complex fuzzy match analysis in reasonable time.

    By combining advanced artificial intelligence with clever blocking techniques and multiple string similarity

    metrics, ReMaDDer provides unique solution for fully automatic records matching and data deduplication

    projects.

    Traditionally, fuzzy records matching software require substantial human intervention, either to provide

     various parameters and threshold values, either to perform extensive clerical review and supervised

    machine learning training. Unique property of the ReMaDDer software is that it does not require any such

    human assistance beyond project definition. There are no thresholds or any other input parameters which

    user must provide in order to enable software to distinguish between matches and non-matches, the

    ReMaDDer software is capable to infer and learn everything by itself.

     As far as we are aware, ReMaDDer might be the only software currently available that is capable to perform

    fully automatic fuzzy record matching without human expert intervention, while attaining accuracy of

    human clerical review. This is accomplished by utilizing various advanced machine learning techniques and

    approaches.

    The name “ReMaDeDer” is an acronym for “Records Matching and Data Deduplication Software”. 

    Homepage: http://ReMaDDersoft.wix.com/ReMaDDer  

    Fuzzy MatchTerm “fuzzy match” refers to methods of identifying related records by measuring how similar they are. It

    is used in cases where no unique identifier or exact match relation exists between two sets of data.

    Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity.

    Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with

    probabilities below threshold are considered to be non-matches.

    http://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadder

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    5/59

    ReMaDDer Software Tutorial

    Page 4 / 59  

    Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold

    matching percentage set by the application.

    Records LinkageRecord linkage refers to the task of finding records in a data set that refer to the same entity across different

    data sources, i.e. to identify related records in two separate data sets.

    Record linkage is necessary when joining data sets is based on entities that may or may not share a common

    identifier, as may be the case due to differences in record shape, storage location, and/or curator style or

    preference.

    There are many business cases where record linkage has to be performed. Some typical examples are

    product price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.

    Data DeduplicationData deduplication refers to identifying duplicate records in a dataset and cleansing datasets from

    redundant information.

    ReMaDDer Software AdvantagesDue to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic

    papers. Some of the researchers even tend to build their own software, but those programs suffer from their

    complexity and necessity to understand advanced mathematics and algorithms, in order to be able to use

    it. This is not something that can be expected from an average user facing data linkage problem in urge to

     be able to solve it in matter of hours or days.

    On the other hand, there are huge corporate entity resolution framework solutions, produced by big

    software companies, oriented towards huge corporate customers. These solutions are often very complex

    and affordable only to big companies and corporate users.

    ReMaDDer places itself in the middle and provides powerful fuzzy match records linkage solution for meremortals and regular office users.

    By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints

    in visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and he/she

    can focus on the business case, rather than technical issues. That is where ReMaDDer software really shines

    and clearly distinguishes itself from competition.

    Traditionally, fuzzy record matching software suffer from requiring immense user involvement in project

    parameterization and clerical review. User is either required to provide various input parameters and

    threshold values, either he/she is required to perform machine learning training and provide examples of

    matches and non-matches. In both cases, considerable user involvement and expertise is prerequisite for

    successful analysis.

    On the contrary, the ReMaDDer software does not require such heavy user involvement, since it can figure

    optimal parameter values automatically, all by itself. This is accomplished by advanced artificial intelligence

    utilizing various state of the art machine learning techniques.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    6/59

    ReMaDDer Software Tutorial

    Page 5 / 59  

    To summarize: utilization of advanced artificial intelligence, accompanied with intuitive graphical user

    interface and low pricing - that is what makes ReMaDDer superb fuzzy match records linkage solution.

    PrerequisitesMajor prerequisite to use ReMaDDer is active internet connection, since the raw data is imported to remote

    server where data is processed. After trial period expires, you are required to purchase commercial releasecode in order to be able to continue using remote server.

    However, project and solution creation and editing can be performed even without established connection

    and purchased release code, since these data are stored locally on your computer.

    ReMaDDer front-end client is available as executable for Windows and Linux systems. It is possible to

    provide executables for various other systems, on demand.

    ReMaDDer does not operate directly on original data sources, but requires data to be imported from CSV

    (comma separated values) flat files to server, where corresponding “left” and “right” database tables are

    then created and processed. Therefore, you will have to provide source datasets as flat CSV file, encoded in

    UTF-8, preferably with comma (“,”) or semi-colon (“;”) field separators. 

    Revision HistoryRevision  Date  Change Description 

    1.0.  3/20/2016 Initial release. Tutorial covers ReMaDDer version 1.0.1.1.  5/10/2016 Document is updated to reflect changes and improvements brought by

    ReMaDDer version 1.1.

    New version brings many improvements and simplifies solutiondefinition. Instead of separately choosing and defining thresholds fortrigram similarity and levenshtein distance functions, a new, combined,common similarity function (ReMaDDer_similarity) is now introducedthat combines both trigram and levenshtein similarity properties. This

    reduces complexity and uncertainty in solution definition creation,retaining ReMaDDer strength and advantages.

    Previous ReMaDDer version has been outputting all columns from leftand right dataset into resultset. Now, you can choose which fields are to

     be included in resultset.

    Raw data import process is also much improved, especially regardingimporting data from Excel files (in CSV format) where column namescontain non-ascii characters and blanks.

    There are many small performance improvements and several bugfixesthat will improve user experience when using the ReMaDDer softwarefor data match analysis.

    2.0.  11/20/2016 Document is updated to reflect major changes and improvements brought by ReMaDDer version 2.0.The main changes are:

      Instead of using only Levenshtein and Trigram similarity functions,multiple other similarity metrics are added to the server engine.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    7/59

    ReMaDDer Software Tutorial

    Page 6 / 59  

      Matches and non-matches are not based on similarity thresholdsany more. Instead, ReMaDDer now utilizes machine learningtechniques. Advanced algorithms infer and automatically detectduplicates and record matches.

      Threshold parameters are removed as obsolete.

     

    “Use composite field” parameter is removed as obsolete.

     

    “Use inclusive OR ”parameter is removed as obsolete.

     

    New parameter “Machine Learning Strictness” is introduced. Theparameter defines how strictly artificial intelligence willdistinguished between matches and non-matches. The options are:match, strict match and potential match.

      New parameter “Join Type”is introduced. Join Type attribute

    determines how SQL joins between left and right tables will beestablished, via solution base table. There are three options of

     joining: a) inner join, b) left outer join, c) right outer join.The "inner join" option is default behavior, meaning that theresultset will contain all rows from left and right datasets whichmeet matching criteria.In case of "left outer join" option, resultset will contain all rowsfrom left dataset and only those rows from right dataset that satisfy

    matching criteria.In case of "right outer join" option, resultset will contain all rowsfrom right dataset and only those rows from left dataset that satisfymatching criteria.

     

    New parameter “Return Only Best Match” is introduced. Theparameter can have True or False value and determines whetherSQL query will return only best matching record or multiple recordssatisfying similarity criteria.Check this option if you wish to return only the best matchingrecords for each left or right record, when using corresponding leftor right outer joins.If this option is unchecked (default), multiple matching rows will bereturned.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    8/59

    ReMaDDer Software Tutorial

    Page 7 / 59  

    Projects

    Projects PageProject is basic entity in ReMaDDER software. Each project contains definition of two source datasets 

    to be imported and analyzed (so-called "left dataset" and "right dataset"), as well as variable number of

    corresponding solutions, which are stored definitions of how to perform fuzzy match analysis.

    On creation, each project is assigned unique project tag. During raw data importing to server,

    corresponding input tables get that tag appended in their name. This way, imported tables are always tagged

     by the project name, which ensures their uniqueness.

    The “Projects” page consists of two two sections separated by movable splitter. In upper section there is

    a datagrid view  where you can browse and edit projects, while on the lower section there is form view  

    of currently selected project. The same concept of datagrids and form views is implemented throughout the

    application.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    9/59

    ReMaDDer Software Tutorial

    Page 8 / 59  

     You can easily create new projects, edit and browse existing projects, by using navigator buttons.

    Concept of “Left” and “Right” DatasetThroughout ReMaDDer application and this manual, we will use terms “left” and “right” dataset or table .

    In every fuzzy match project, we always compare two tables, i.e. two datasets, inspecting their rows

    similarity. For convenience, we call them “left” and “right” table.

    Purpose of entity resolution framework software, such is ReMaDDer, is to identify which records from “left”

    dataset correspond to which records from “right” dataset.  

    ReMaDDer does not operate on original data sources directly, but requires data to be imported from source

    CSV (comma separated values) flat files to server, where corresponding left and right database tables are

    then created and processed.

    Record Matching Project vs. Data Deduplication ProjectsIn ReMaDDer software, there is no fundamental difference between data deduplication and records

    matching projects. In both cases we compare two datasets, trying to infer which records from “left” dataset

    correspond to which records in “right” dataset.

    The only difference between the two is that in case of records matching project we have two different input

    datasets to be compared, while in case of data deduplication project we have to compare a dataset with

    itself, in order to identify duplicate records in the dataset.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    10/59

    ReMaDDer Software Tutorial

    Page 9 / 59  

    Since ReMaDDer software always compare two datasets - left and right datasets, in case of data

    deduplication project we need to import the same original CSV file twice - first as left dataset and then as

    right dataset. The ReMaDDer software will thus create two identical tables with different names, in the

    underlying database.

    Copy A ProjectInstead of manually entering all the parameters for new projects, ReMaDDer allows you to copy existingproject into another project. This action copies raw data import specifications as well as solution definitions.

    Raw Data Import

    Datasets to be analyzed are called "left" and "right" datasets and can be easily imported from source CSV

    files, encoded in UTF-8.

    The CSV file format ("Comma Separated Values") is chosen due to its ubiquity and because all databases

    and spreadsheet editors, as well as all other data sources can be easily exported to a csv file.

    The source data CSV files, however, must be UTF-8  encoded. Otherwise, import will most likely fail.

    Therefore, you must first ensure that the source data CSV files are properly UTF-8 encoded. ReMaDDer has

    embedded tools for charset encoding detection and conversion, but you can also use famous Notepad++

    (https://notepad-plus-plus.org/), CudaText (http://uvviewsoft.com/cudatext/)  and other powerful text

    editors which are capable to perform encoding detection and conversion of files.

    ReMaDDer provides simple and intuitive tool for importing csv files. It will automatically detect

    field’s delimiter and columns schema information. You can then edit the retrieved schema and

    finally import the files on server, for further processing.

    https://notepad-plus-plus.org/https://notepad-plus-plus.org/https://notepad-plus-plus.org/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/https://notepad-plus-plus.org/

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    11/59

    ReMaDDer Software Tutorial

    Page 10 / 59  

    “Left” and “Right” datasets In each data deduplication or record matching project, we always compare two datasets for matching of

    records. In case of record matching projects, these two datasets correspond to two different input CSV files,

     while in case of data deduplication projects, these two datasets are imported from the same input CSV file.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    12/59

    ReMaDDer Software Tutorial

    Page 11 / 59  

    Nevertheless, we always have so-called “left dataset” and “right dataset” to be compared. Think of this like

    comparing fingers from left and right hand. You can easily identify thumb on the left hand to be related to

    the thumb on the right hand, since they share similar shape. It is obvious due to their physical similarity.

    It is same with fuzzy match analysis, where we compare fields from left and right dataset in order to identify

    string similarities. ReMaDDer internally uses various functions to measure string similarities, results of

     which are then processed by artificial intelligence to infer whether two records represent same entity or not.

    Import Raw DataProcess of importing raw data into server database consists of several logical phases. First we need to

    identify source CSV files for “left” and “right” dataset. After source files are identified, we need to ensure

    that the CSV files are properly UTF-8 encoded. Once we ensured proper encoding, then we need to retrieve

    and specify schema information about the CSV files. In last phase we actually perform import from source

    files, according to previously defined schema. Result of the last step is that the source files are imported on

    server-side database, where they can be processed according to various solution definitions.

    On “Data Import” page, there are two sub-pages: “Left Dataset Specification” and “Right Dataset

    Specification”, in which we separately define input dataset specifications for “left” and “right” dataset. 

    Import can be executed separately for left and righ dataset, or both can be imported in batch, at once.

    Browse And Choose CSV files

    First step in importing input CSV files is to choose CSV files to be imported.

    On upper part of “Left Dataset Specification” or “Right Dataset Specification” sub-page, there is a CSV file

     browser dialog box.

     You can browse CSV files on your computer by clicking on the browse button . This opens a file

     browser in which you can choose a CSV file. The absolute file path is then copied to the edit box.

    Register CSV Files

    Next step is to define CSV file schema specification. We call this process “registering CSV file”. 

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    13/59

    ReMaDDer Software Tutorial

    Page 12 / 59  

    By clicking “Register CSV file” button near the file browser, the browsed CSV

    file is examined for its columns and it’s schema information is then inserted into the corresponding list of

    fields (columns).

     As you can see, ReMaDDer determines field delimiter in CSV file (normally it is either “;” or “,”) and

    retrieves information about columns.

    If a column name has upper case characters, it is converted to lower case.

    Currently, ReMaDDer treats all columns as text fields of various length. This is due fact that the comparison

    is performed by using string comparison functions, so other data types (e.g. datetime, integer, real etc.)

     would not make sense for string comparisons.

    Determine And Convert CSV File To UTF-8

    In previous ReMaDDer version, the program used to detect encoding and convert it to UTF-8

    automatically, during CSV file registration. Although very convenient, this might have lead to wrong results,

    since encoding detection function is not 100% reliable and sometimes it guesses encoding wrongly. This is

    due fact that charset detection is inherently difficult task and there is no 100% sure method. It is always

    kind of educated guess according to content inspection.

    Therefore, we decided to remove automatic charset detection and conversion to UTF-8. You will have to do

    it yourself and ensure that the source files are properly UTF-8 encoded. Charset detection, as well file

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    14/59

    ReMaDDer Software Tutorial

    Page 13 / 59  

    encoding conversion to UTF-8 is still present as ReMaDDer feature (and even improved), but you will have

    to trigger it manually with respective buttons, or by choosing it from menu.

     Another option is to use embedded spreadsheet editor “Spready” to open and convert source files.  

     Alternatively, you can use various established tools such as Notepad++ text editor, that are capable to

    recognize file encoding and perform required conversion to UTF-8.

    Determine And Convert CSV File Encoding, with embedded tool

     After a CSV file is registered as left or right dataset source, it can be analyzed with embedded tool for

    detecting charset encoding.

     When you click button “Determine Encoding of Left Dataset CSV File”  or button “Determine

    Encoding of Right Dataset CSV File” the respective CSV file will be analyzed for its encoding type, by

    two different embedded procedures. Result of encoding analysis will be displayed in corresponding pop-up

     window.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    15/59

    ReMaDDer Software Tutorial

    Page 14 / 59  

    If both functions agree that the encoding is UTF-8 (utf8), as in the example above, then the CSV file is in

    appropriate format for import.

    But, if result is not UTF-8, then the CSV file must be converted to UTF-8 before importing!

     You can convert CSV file encoding to UTF-8 by clicking button “Convert Encoding Of Left Dataset 

    CSV File” or “Convert Encoding Of Right Dataset CSV File”.

     When the conversion action is triggered, ReMaDDer will first back up the original CSV file and then convert

    the file encoding to UTF-8.

    Determine And Convert CSV File Encoding, with embedded spreadsheet editor “Spready” 

    Besides above mentioned embedded encoding detection and conversion tool, ReMaDDer has embedded

    “Spready ” spreadsheet editor (http://wiki.lazarus.freepascal.org/FPSpreadsheet), which can also be used

    for file encoding conversion.

    http://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheet

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    16/59

    ReMaDDer Software Tutorial

    Page 15 / 59  

    Determine And Convert CSV File Encoding, with external tools

    Charset detection with embedded tool is not 100% reliable, which is also true for any tool performing

    charset inferring.

    If you encounter difficulties with embedded charset detection and conversion tools or you know what is the

    file encoding, you might try various external tools, of which I would recommend well established

    Notepad++ text editor (https://notepad-plus-plus.org/).

    https://notepad-plus-plus.org/https://notepad-plus-plus.org/https://notepad-plus-plus.org/https://notepad-plus-plus.org/

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    17/59

    ReMaDDer Software Tutorial

    Page 16 / 59  

     Another interesting alternative is CudaText  text editor (http://uvviewsoft.com/cudatext/), which is

    capable of charset detection and conversion too.

    http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    18/59

    ReMaDDer Software Tutorial

    Page 17 / 59  

    Edit Raw Datasource Schema Information

    Once you retrieved schema information from a CSV file, you might conclude that you don’t want to import

    all columns, but only a subset of fields.

     You can edit the schema by using corresponding data grid navigator buttons.

    If you wish to delete currently selected field from schema, just click delete button.

    If you wish to regain original columns schema, just click “Get Fields Schema” 

     button and the columns list will be repopulated from the CSV file.

    Pre-process Raw Datasource

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    19/59

    ReMaDDer Software Tutorial

    Page 18 / 59  

     While defining import schema specification, you might realize that input data need some pre-processing

     before importing to server for further analysis.

    Of course, you can edit input CSV files by using any spreadsheet editor (such as LibreOffice or OpenOffice

    Calc, Gumeric or Miscrosoft Excel) or textual editor (such as Notepad, Notepad ++, ConText, Gedit,

    CudaText, Geany or Leafpad), but you can also use an embedded spreadsheet editor “Spready”.

     You can launch external default spreadsheet editor by clicking the button “Open CSV File in Ext.

    Editor” . 

     You can launch the embedded spreadsheet editor by clicking button “Open CSV File In Int. Editor”

    . This will open the embedded spreadsheet editor “Spready” 

    (http://wiki.lazarus.freepascal.org/FPSpreadsheet).

    http://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheet

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    20/59

    ReMaDDer Software Tutorial

    Page 19 / 59  

    Import Data From Raw Datasources

    Final step in source data import is execution of import procedure, by clicking appropriate button or

    triggering action from respective menu.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    21/59

    ReMaDDer Software Tutorial

    Page 20 / 59  

     We can execute import separately for left and right datasets, by clicking corresponding buttons “Import

    left dataset CSV file” or “Import right dataset CSV file” or we can import them both at once by

    clicking the button “Import both CSV files to server”.

     When you click the import button, ReMaDDer will automatically open the “Import Log” page, where you

    can watch import process progress.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    22/59

    ReMaDDer Software Tutorial

    Page 21 / 59  

    Import speed depends on the file size and most importantly, internet connection quality.

    Solution Definition

     A solution definition represents definition of parameters for performing record linkage or data

    deduplication analysis. Each project can have many solutions, with different specification, thus you can test which combination of parameters lead to best results.

    Each solution definition consists of solution header  specification and solution constraints 

    specification.

    Solution header specification contains general info about the solution and defines important parameters

     which determine how record matching analysis will be performed. These parameters are: “machine

    learning strictness”, “join type” and “return only best match”.

    Solution constraints specification consists of: exact match relations section, fuzzy match relations 

    section and other constraints section.

    Solution definition page (page “Record Matching Analysis”, sub-page “Solution Definition”):

     As with other pages, “Solution” page is also divided into two sections: datagrid view and form view.

    For better user experience, form view is additionaly divided into several tabs and sub-tabs. Main tabs are:

    “Solution Definition” and “Solution Result”.

    “Solution Definition” tab is furtherly divided into: “Solution Header”, “Solution Fields Picker” and “Solution

    Constraints”.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    23/59

    ReMaDDer Software Tutorial

    Page 22 / 59  

    “Solution Header” tab is divided into several sub-tabs: “Common”, “Solution Base Table Creation Query

    Info” and “Solution Resultset Retrieval Query Info”.

    “Solution Constraints” tab is divided into sub-tabs: “Exact Match Constraints”, “Fuzzy Match Constraints”

    and “Other Constraints”.

    How ReMaDDer performs record linkage and data deduplicationFor each project we can define one or more solutions. A solution consists of solution definition and solution

    resultset.

    Solution definition is specification which instructs ReMaDDer how to perform record linkage or data

    deduplication analysis in order to retrieve resultset.

     We can define three type of solution constraints: exact match constraints, fuzzy match constraints and other

    constraints.

    Fuzzy match constraints define field pairs from left and right dataset to be compared for fuzzy string

    similarity. In order to infer records similarity, ReMaDDer utilizes various string similarity metrics, along

     with powerful machine learning algorithms.

     Advanced artificial intelligence automatically infers records linkage or duplicates and creates solution base

    table.

    Final step is resultset retrieval, in which database engine creates and executes SQL query which joins left

    and right dataset with the solution base table, outputting resultset. The retrieved resultset can be exported

    to a spredsheet or flat file.

    Solution Definition HeaderSolution definition header contains general solution definition parameters and info about solution

    execution status.

    Solution definition header (whole page):

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    24/59

    ReMaDDer Software Tutorial

    Page 23 / 59  

    Solution definition header (datagrid view):

    Solution definition header (form view):

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    25/59

    ReMaDDer Software Tutorial

    Page 24 / 59  

    Solution definition header can be entered either through datagrid or through form view which shows

    currently selected solution.

    Solution Basic Information

    Basic information about a solution is shown in fields: “Solution Name”, “Solution Tag”, “Solution Base Table

    Name”, “Tag Assigned”, “Solution Status” and “Solution Comment”.

    Solution Tag is automatically generated designation which is appended to each solution name by default

    and is also used in Solution Base Table name formation.

    Solution Base Table Name is automatically formed from Solution name and Solution Tag. Solution Tag

    ensures uniqueness of created solution base table, on server.

    Solution Status and Solution Comment are fields in which user can enter additional arbitrary information.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    26/59

    ReMaDDer Software Tutorial

    Page 25 / 59  

    Machine Learning Strictness

    The parameter “Machine Learning Strictness”  defines how strictly artificial intelligence will

    distinguished between matches and non-matches. The options are: match, strict match and potentialmatch.

    Machine learning strictness attribute determines how strictly fuzzy matching will be determined.

    Possible values are: a) match, b) strict match, c) potential match.

    "Match"  option is default behavior. Resultset retrieved will contained balanced ratio between true

    positives and false positives. It tends to include all true positives, with some degree of false positives and

     very little false negatives.

    "Strict match" is the strictest option. Resultset will tend to contain only true positives, but due to higher

    incidence of false negatives, it might miss to recognize some matches.

    "Potential match" is the weakest option. Resultset will tend to contain all true positives, but many false

    positives as well.

    Join Type“Join Type” attribute determines how SQL joins between left and right tables will be established, via

    solution base table. There are three options of joining: a) inner join, b) left outer join, c) right outer join.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    27/59

    ReMaDDer Software Tutorial

    Page 26 / 59  

    The "inner join" option is default behavior, meaning that the resultset will contain all rows from left and

    right datasets which meet matching criteria.

    In case of "left outer join" option, resultset will contain all rows from left dataset and only those rows

    from right dataset that satisfy matching criteria.

    In case of "right outer join" option, resultset will contain all rows from right dataset and only those rows

    from left dataset that satisfy matching criteria.

    Return Only Best Matching Records

    The parameter “Return Only Best Match” can have True or False value and determines whether SQLquery will return only best matching record or multiple records satisfying similarity criteria. It is used as

    modifier to left outer join or right outer join.

    If this option is unchecked (default), multiple matching rows will be returned. If it is checked, only best

    matching item from slave dataset will be joined to corresponding record in master dataset.

    Check this option if you wish to return only the best matching records for each left or right record, when

    using left or right outer joins and datasets are in master/slave relation.

    In case of “inner join” join type, this parameter has no meaning and is ignored.

    Typical use case for left or right outer join with “return only best matching” option is when we want to match

    two product price lists of which one is master list.

    Solution Definition Details While solution definition header defines general parameters for performing fuzzy match analysis, solution

    definition details are being set in Field Picker sub-page and Solution Constraints sub-page with three

    sections defining solution constraints: Exact Match Relations section, Fuzzy Match Relations section

    and Other Constraints section.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    28/59

    ReMaDDer Software Tutorial

    Page 27 / 59  

    Fields Picker

    ReMaDDer provides simple, yet very powerful visual tool to add field pairs to exact match relations section,

    fuzzy match section or other constraints section.

    By having input datasets ("left" and "right" datasets) fields listed side by side, you can easily browse two

    lists, visually establish field pairs and send them to appropriate constraints definition sections by click on

    appropriate button.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    29/59

    ReMaDDer Software Tutorial

    Page 28 / 59  

     You can add selected fields pair to exact match section by clicking the button “Add Fields Pair To ExactMatch Relations Section”.

     You can add selected fields pair to fuzzy match section by clicking the button “Add Fields Pair To Fuzzy

    Match Relations Section”.

     You can add left or right dataset field to other constraints section by clicking the respective button.

    By eliminating need for tedious manual input and letting you to visually build solution constraints instead,

    ReMaDDer simplifies solution definition creation and boosts your performance.

    Starting from ReMaDDer version 1.1., checkbox column “Output Field to Resultset?” is added to the

    Field Picker datagrid. It is used to include or exclude fields from being outputted to a resultset. By default,all fields are included in resultset.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    30/59

    ReMaDDer Software Tutorial

    Page 29 / 59  

    Solution Constraints

    There are three type of constraints that you can define for a solution: exact match relations, fuzzy match

    relations and other constraints.

    Exact Match RelationsIn exact matching relations section, we can add field pairs from "left" and "right" imported dataset and

    define their equalness (=) or not-equalness ().

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    31/59

    ReMaDDer Software Tutorial

    Page 30 / 59  

    If we can define exact matching relation on one or more filed pairs, we can tremendously increase speed of

    analysis by narrowing down number of record pair combinations to be analyzed for fuzzy match.

    Therefore, it is recommended to use exact match relations whenever possible.

    Fuzzy Match Relations

    In solution header section we can set various general parameters that determine how fuzzy match analysis

     will be performed: we can choose machine learning strictness, join type and whether all matches or just

     best matches will be returned.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    32/59

    ReMaDDer Software Tutorial

    Page 31 / 59  

    In fuzzy match relations section we provide details for fuzzy match comparison analysis. We can list field

    pairs to be compared and furtherly define how fuzzy match analysis will be performed.

    Relative Field Weight

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    33/59

    ReMaDDer Software Tutorial

    Page 32 / 59  

    For each field pair, which will be compared for similarity, we have to define its relative weight. The bigger

    the weight, the greater is importance of the particular field pair similarity in final decision whether two

    records do match or not.

    The weight for particular field pair is entered as an arbitrary integer value in the field “Field Weight

    (integer)” and ReMaDDer then calculates its relative weight. The sum of relative weights is always 1.

    On new field pair addition to the fuzzy match relations section, the field pair gets default relative weight

    (integer) value, which is one (1). You can change this value to any bigger integer and ReMaDDer willrecalculate relative weights, taking care of their sum, which must always be 1.

    Notice that there is an additional graphical indicator of relative weights. It shows graphically relative weight

    for currently selected fields pair.

    There are two buttons provided: “Recalculate Weights” and “Reset Weights”.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    34/59

    ReMaDDer Software Tutorial

    Page 33 / 59  

    The button “Reset Weights” reset all relative weights to 1, which is the same as if relative weights are not

    used at all. In that case, all field pairs are treated equally important.

    The button “Recalculate Weights” performs the recalculation of relative fields according to the integer

     values entered in the field “Field Weight (integer)”. You don’t need to trigger this action manually, since

    this procedure is triggered automatically on each change of integer value or a field pair addition.

    Other Constraints

    Similar to exact matching relations, it is desirable to limit analysis on particular subset of data. Such

    constraints can greatly increase speed of record linkage or data deduplication analysis.

     We can define any custom constraint to be applied on a particular field from "left" or "right" dataset.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    35/59

    ReMaDDer Software Tutorial

    Page 34 / 59  

    Normally, condition is: sometable.somefield= ‘some string’ , but other operators such as LIKE can be used

    as well.

    Solution Execution

    Once a solution definition is prepared by setting global parameters, exact match, fuzzy match and other

    constraints, you can then execute the solution on remote server and retrieve resultset. There are two

    consequences of the solution execution: solution base table is created on server and resultset is retrieved to

    client.

    Solution execution is actually sequence of two different steps, which can be executed in batch or separately.

    First step is solution base table creation on server, which is prerequisite for next step, resultset

    retrieval on client.

    The first step, in which Solution Base Table  is created, is the most critical point in ReMaDDer

    application (and most resource and time demanding, too). In this step, sequence of several critical

    underlying procedures are triggered that determine solution space from which final resultset is finally

    retrieved.

    This step is actually composed of several discrete sub-steps.

    First of sub-steps is so-called “blocking” procedure, which is a method to reduce space of combinations

     which will be furtherly analyzed for string similarity. This step is of great importance, since fuzzy match

    analysis is inherently time-consuming job and analyzing all possible combinations would take extremely

    long time to complete.

    Next sub-step is step in which string similarity is calculated between left and right dataset records.

    ReMaDDer utilizes multiple string similarity functions. Some of them are quite resource demanding.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    36/59

    ReMaDDer Software Tutorial

    Page 35 / 59  

     After string similarity is established for all combinations in solution space, advanced machine learning

    algorithms take results from previous step and infer record linkage or detect duplicates. This is the heart of

    inventive and unique approach that ReMaDDer software utilizes to perform entity resolution job.

    Unlike other competing software, ReMaDDer does not require any user involvement in this step. There is

    no need to provide examples of matches and non-matches, neither to provide any threshold value that

     would distinguish matches from non-matches. ReMaDDer will acquire knowledge and determine records

    linkage automatically, without need for human domain expert or clerical review.

     As far as we are aware, there is no other software, currently available on market, that is capable to perform

    such automatic record linkage inference by artificial intelligence, with accuracy reaching human clerical

    review.

    Technically, solution can be executed in three different ways:

     A) in one step

    In this scenario, both major steps (solution base table creation and resultset retrieval) are executed at once.

    B) 

    In two major steps

    In this scenario, major steps (solution base table creation and resultset retrieval) are executed one by one

    in consecutive order.

    C) In several minor steps

    In this scenario, both major steps are executed in sequence of several distinct minor steps.

    In simplest scenario, you can execute solution in one step. On the Solution definition header, as well in the

    corresponding “Solution Header” menu entry there are two buttons. The button “Execute Solution” 

    executes both steps at once, in batch, while the button “Prepare And Execute Result SQL Query”

    executes only the last step, i.e. resultset retrieval.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    37/59

    ReMaDDer Software Tutorial

    Page 36 / 59  

    Obviously, you must trigger the  button “EXECUTE SOLUTION”  at least once, in order to create

    underlying Solution Base Table on server, which is prerequisite for second step, resultset retrieval.

    The first step, solution base table creation, might be extremely resource and time demanding. Depending

    on the records count in left and right dataset, number of field pairs to be compared for string similarity etc.,

    it can take anything from 30 seconds to 24 hours or even more (!). You must be aware that the time required

    for solution base table creation grows exponentially, not linearly, with records count!

    Be aware that the solution complexity, and time required for solution to be resolved, grows exponentially  

     with records count in left and right dataset. The same is true for number of field pairs to be compared. It is

    not same if you analyze only one field pair or if you compare 9 field pairs for fuzzy match. Fuzzy match

    analysis is inherently complex and time consuming.

    Once the solution base table is already created, you can easily change machine learning strictness or join

    type or choose whether to return only best match. For these changes, you don’t need to re -trigger tediousand time-consuming solution base table recreation, it is enough to re-trigger only second step. That is

    exactly the reason why the button “Prepare And Execute Solution Result SQL Query” is foreseen.

    Beside default differentiation on major steps, there is also fine grained differentiation on sub-steps, which

    is available in the “Solution Definition” menu entry .

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    38/59

    ReMaDDer Software Tutorial

    Page 37 / 59  

    In fine grained differentiation of solution execution steps, we distinguish following separate actions:

      “Prepare Solution Base Table SQL Query” --> this action will prepare SQL query for solution

     base table, but will not execute it.

      “Execute Solution Base Table SQL Query (Create Solution Base Table)” --> this action will

    execute solution base table creation.

     

    “Prepare Solution Result SQL Query With Forced Base Table (Re)creation”  --> this willtrigger recreation of SQL Query for recreation of solution base table on server and then retrieve

    resultset.

      “Prepare Solution Result SQL Query With Check Whether To Create Base Table”  --> this

     will trigger action that will check whether solution base table has to be recreated. The solution base

    table will be recreated only if necessary. Then resultset will be retrieved.

      “Prepare Solution Result SQL Query” --> just prepare SQL Query that will retrieve resultset, but

    don’t actually trigger it’s execution. 

      “Prepare And Execute Solution Result SQL Query”  --> prepare and execute SQL query that

     will retrieve resultset.

      “Execute Solution Result SQL Query (Retrive Resultset)” --> execute already prepared SQL

    query that will retrieve resultset.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    39/59

    ReMaDDer Software Tutorial

    Page 38 / 59  

    These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For

    regular user, it is only relevant to remember that the solution base table must first be created in order to be

    able to retrieve resultset.

    If solution base table is already created, then you don’t need to recreate solution base table for different

    combination of “machine learning strictness”, “join type” and “return only best match” parameters. It is

    enough to use just “Prepare And Execute Solution Result SQL Query” button. 

    If you’ re in doubt and don’t know what to do, the simplest and safest way to execute solution and retrieve

    resultset is to click “EXECUTE SOLUTION” button. 

    Solution Execution In One StepThe simplest way to execute solution is to execute the analysis in one step, by clicking the  button

    “EXECUTE SOLUTION” or by choosing corresponding menu item.

    This action will force (re)creation of solution base table on server, from scratch, and prepare and execute

    resultset retrival SQL query.

    Be aware that solution base table (re)creation is costly action and it might take considerable time to

    complete! If left or right dataset contains million of records, this might take extremely long time to

    complete.

    Therefore, it is preferred to execute solution base table (re)creation only if necessary.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    40/59

    ReMaDDer Software Tutorial

    Page 39 / 59  

    Solution Execution In Two Major StepsBesides simple solution execution in one step, there is possibility to execute solution in two major steps.

    In this scenario, first step is solution base table creation on server, which is prerequisite for next step,

    resultset retrieval on client.

    Once the solution base table is already created, you can easily change machine learning strictness or join

    type or choose whether to return only best match. For these changes, you don’t need to retrigger tedious

    and time-consuming solution base table recreation, it is enough to re-trigger only second step. That is

    exactly the reason why the button “Prepare And Execute Solution Result SQL Query” is foreseen.

    On the Solution definition header, as well in the corresponding “Solution Header” menu entry, there is

     button “Prepare And Execute Result SQL Query”, which executes only the last step, i.e. resultset

    retrieval. You can use it if proper solution base table is already created on server.

    Solution Execution In Several Minor StepsIf appropriate solution base table is not yet created or solution definition is changed so it needs to be

    recreated, then you have to (re)create solution base table first, and then execute resulset retrieval query.

    Beside executing everything in one step, there there is also fine grained differentiation of these sub-steps

    present ed in the “Solution Definition” menu entry .

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    41/59

    ReMaDDer Software Tutorial

    Page 40 / 59  

    In fine grained differentiation of steps, we distinguish following separate actions:

      “Prepare Solution Base Table SQL Query” --> this action will prepare SQL query for solution

     base table, but will not execute it.

      “Execute Solution Base Table SQL Query (Create Solution Base Table)” --> this action will

    execute solution base table creation.

     

    “Prepare Solution Result SQL Query With Forced Base Table (Re)creation”  --> this willtrigger recreation of SQL Query for recreation of solution base table on server.

      “Prepare Solution Result SQL Query With Check Whether To Create Base Table”  --> this

     will trigger action that will check whether solution base table has to be recreated. The solution base

    table will be recreated only if necessary.

      “Prepare Solution Result SQL Query” --> just prepare SQL Query that will retrieve resultset, but

    don’t actually trigger it’s execution. 

     

    “Prepare And Execute Solution Result SQL Query”  --> prepare and execute SQL query that

     will retrieve resultset.

     

    “Execute Solution Result SQL Query (Retrive Resultset)” --> execute already prepared SQLquery that will retrieve resultset.

    These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For

    regular user, it is only relevant to remember that the solution base table must first be created in order to be

    able to retrieve resultset.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    42/59

    ReMaDDer Software Tutorial

    Page 41 / 59  

    If solution base table is already created, then you don’t need to recreate solution base table for different

    combination of “machine learning strictness”, “join type” and “return only best match” parameters. It is

    enough to use just “Prepare And Execute Solution Result SQL Query” button. 

    If you’ re in doubt and don’t know what to do, the simplest and safest way to execute solution and retrieve

    resultset is to click “EXECUTE SOLUTION” button. 

    Data Retrieving And Storing

     You can launch previously prepared solution SQL queries and return resultsets, by clicking the  button

    “Prepare And Execute Result SQL Query”.

     Alternatively, you can execute solution in one step, which includes both solution base table creation and

    resultset retrieval SQL query execution in one step, with button “EXECUTE SOLUTION”.

    In both cases, once resultset is retrieved, it is stored locally on your computer and you can load it afterwards,

    anytime you wish.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    43/59

    ReMaDDer Software Tutorial

    Page 42 / 59  

     You can easily browse, edit and analyze results in many different ways, including datasheet forms with

    sophisticated data searching, filtering and navigation capabilities.

    Execute Resultset Retrieval SQL QueryThe resultset retrieval query is executed by clicking the button “Execute Solution” 

    or by clicking the button “Prepare And Execute Solution” 

    , which can be used if solution base table has already been

    created.

    The difference is that “Execute solution” action (re)creates underlying solution base table and then executes

    SQL query, which joins left and right datasets with the solution base table, while action “Prepare And

    Execute Results SQL Query” just performs the last step. Obviously, prerequisite to use the latter is that the

    solution base table has already been created.

     When action is triggered, previously prepared SQL query text is sent to server for execution. The progress

    of query execution can be monitored in “Solution Log” page.

    The retrieved resultset is automatically opened in a separate form.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    44/59

    ReMaDDer Software Tutorial

    Page 43 / 59  

    Solution Status InfoReMaDDer automatically updates solution status upon solution base table creation query and resultsetretrieval query preparation and execution actions. These solution status informations are shown both in the

    solution header data grid and form view, in respective tabs.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    45/59

    ReMaDDer Software Tutorial

    Page 44 / 59  

     You get various information about solution base table creation process, such as: whether solution base table

    is created or not, whether solution creation query has already been executed or not, whether solution base

    table is empty or not, what are query execution times.

     Also, you get various information about resultset retrieval query execution process, such as: whether

    resultset retrival SQL query is generated (prepared) or not, whether SQL query was already executed or not,

     whether resultset is retrieved or not and if retrieved whether it was empty or not. It is also shown whether

    the resultset is stored locally and in which file. There is information about execution times and number of

    executions performed.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    46/59

    ReMaDDer Software Tutorial

    Page 45 / 59  

    Save And Load ResultsetOnce a solution is executed and results retrieved, the resultset is automatically saved as a locally stored file

    in the ReMaDDer installation folder, subfolder “/data/results”. 

    Resultset can be loaded into the subpage “Solution Result” of the main form, by clicking the button “Load

    Solution Resultset” or in a separate form, by clicking the button “Load Solution Resultset In

    Separate Window”.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    47/59

    ReMaDDer Software Tutorial

    Page 46 / 59  

    Review And Edit ResultsetThere are various ways you can post-process and review the retrieved resultset.

    Resultset Browsing

     You can easily browse, edit and analyze loaded resultset in data grid form. Datasheet contains sophisticated

    data searching, filtering and navigation capabilities.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    48/59

    ReMaDDer Software Tutorial

    Page 47 / 59  

     You can scroll by using mouse, vertical and horizontal sliders and arrows.

     You can also browse records by using navigation buttons.

    Resultset Searching

     You can easily search for any particular value in any column. On the upper left corner of the datagrid

    there is a small button represented by orange double arrow. This button opens a pop-up dialog

     with various search, filter and customization options, of which one is “Find data”.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    49/59

    ReMaDDer Software Tutorial

    Page 48 / 59  

     When you click on the “Find data” button, a search dialog box appears. You can search any value on

    any column.

    Resultset Filtration

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    50/59

    ReMaDDer Software Tutorial

    Page 49 / 59  

     You can easily filter by any column. On the upper left corner of the datagrid there is a small button

    represented by orange double arrow.

    This button opens a pop-up dialog with various search, filter and customization options, including “Filter

    data” and “Filter in table”, which are two different ways to perform filtration in a datagrid.

    Filter Data

     When you click the button “Filter data”, a dialog box appears on which you can build your filtering

    conditions. This way you can define complex multicolumn filters.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    51/59

    ReMaDDer Software Tutorial

    Page 50 / 59  

    The filtering is then applied by clicking “Apply” button.

    Filter In Table

     Another option for filtration is to use the  button “Filter in table”, which activates a filtration

    combobox, which is placed just below each column’s title. When you click on the filtration combobox cell,

    a combobox list appears, listing all possible values for respective column. When you select a value, the

    respective column is automatically filtered by the chosen value.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    52/59

    ReMaDDer Software Tutorial

    Page 51 / 59  

    Resultset Sorting

     You can sort ascending or descending on any column by clicking column title.

    Resultset Edit And Review

     You can edit the resultset in datagrid easily. You can delete a row by using delete button ,

    or edit a record by clicking the edit button .

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    53/59

    ReMaDDer Software Tutorial

    Page 52 / 59  

    Exporting Resultset

    Besides using datagrid controls, another option for resultset post-processing is to export the resultset into

    a spreadsheet and then perform reviewing and editing in a spreadsheet editor.

    ReMaDDer has many different possibilities of exporting resultset to spreadsheets.

    Exporting Resultset To Spreadsheet

    Resultset can be exported to a CSV file by clicking the button “Export To CSV File”. 

    Resultset can be exported to a XLSX file by clicking the button “Export To XLSX File”.

    Resultset can be exported to XLS file by clicking the button “Export To XLS File”.

    Resultset can be loaded directly into your default spreadsheet editor, e.g. LibreOffice Calc or Microsoft

    Excel, by clicking the button “Load In Ext. Spreadsheet Editor”.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    54/59

    ReMaDDer Software Tutorial

    Page 53 / 59  

    ReMaDDer also has its own embedded spreadsheet editor which can be used for resultset post-processing.

    Resultset can be loaded into the embedded spreadsheet editor by clicking the button “Load As

    Spreadsheet”.

    Exporting Datagrid To Spreadsheet

     Another possibility for exporting resultset into a spreadsheet file is to use datagrid’s exporting feature.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    55/59

    ReMaDDer Software Tutorial

    Page 54 / 59  

     You have to browse the destination folder for export and enter exported file name and extension, as well

    as to enter page name (sheet name). If you forget to specify “page name”, you will get an error.  

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    56/59

    ReMaDDer Software Tutorial

    Page 55 / 59  

    Customize Data Grids

    ReMaDDer enables you to customize your user interface in certain extent. You can shrink or stretch

    columns, rearrange their order and hide/unhide columns.

    Resize columns by dragging vertical splitters between columns.

    Rearrange columns by pushing the left mouse bu tton on a column’s title and dragging the column while

    mouse button is still pushed down. After the column is moved to another position, release the mouse button.

     You can define which columns are shown and which are hidden, by clicking on the button “Select visible

    columns”.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    57/59

    ReMaDDer Software Tutorial

    Page 56 / 59  

     When you close the application, your customization is saved (remadder_props.xml file) and when you

    open the application again, your customizations will be loaded as well.

    Customize Splitters

     You will notice that various sections are divided by splitters which you can easily drag and thus resizethe corresponding splitted sections.

    The customization you make is saved on application close and reloaded on application start.

    ReMaDDer Software TrialReMaDDer client application is distributed as a shareware with 15-days trial period.

    On first application start on your computer the trial period is initialized.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    58/59

    ReMaDDer Software Tutorial

    Page 57 / 59  

    Commercial Release Code Purchase And Activation

     After trial period expires, you are required to purchase commercial release code in order to be able

    to continue using server features, such as raw data import and query execution.

     You can, however, continue creating and editing projects and solution definitions, as well as loading and

    editing previously acquired resultsets.

     When purchasing release code, you are required to enter MachineID in purchase form. The MachineID is

    a tag generated by ReMaDDer software and is unique for your hardware. The purchased commercial release

    code is thus machine-specific and valid only for your hardware.

    Once you purchased release code, activate it by clicking the button “Activate Commercial Release

    Code”.

     You are asked to enter the release code.

  • 8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

    59/59