remadder software tutorial (v2.0) - fuzzy match record linkage and data deduplication

8/17/2019 ReMaDDer Software Tutorial (v2.0) - Fuzzy Match Record Linkage And Data Deduplication

1/59

Homepage: http://ReMaDDersoft.wix.com/ReMaDDer

ReMaDDer Software Tutorial How to use ReMaDDer software for successful records matching, data

cleansing and data deduplication projects

11/20/2016

Revision 2.0.

http://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadder


2/59

ReMaDDer Software Tutorial

Page 1 / 59

Table of Contents

Introduction ........................................................................................................................ 3

What Is ReMaDDer Software .......................................................................................... 3

Fuzzy Match ..................................................................................................................... 3

Records Linkage .............................................................................................................. 4

Data Deduplication .......................................................................................................... 4

ReMaDDer Software Advantages .................................................................................... 4

Prerequisites .................................................................................................................... 5

Revision History .............................................................................................................. 5

Projects ................................................................................................................................ 7

Projects Page .................................................................................................................... 7

Concept of “Left” and “Right” Dataset ............................................................................ 8

Record Matching Project vs. Data Deduplication Projects ............................................. 8

Copy A Project ................................................................................................................. 9

Raw Data Import ................................................................................................................. 9

“Left” and “Right” datasets ............................................................................................ 10

Import Raw Data ............................................................................................................ 11

Browse And Choose CSV files ..................................................................................................................... 11

Register CSV Files ....................................................................................................................................... 11

Determine And Convert CSV File To UTF-8 ............................................................................................ 12

Edit Raw Datasource Schema Information ...............................................................................................17

Pre-process Raw Datasource ......................................................................................................................17

Import Data From Raw Datasources ........................................................................................................ 19

Solution Definition .............................................................................................................21


3/59


Page 2 / 59

How ReMaDDer performs record linkage and data deduplication .............................. 22

Solution Definition Header ........................................................................................... 22

Solution Basic Information ....................................................................................................................... 24

Machine Learning Strictness ..................................................................................................................... 25Join Type .................................................................................................................................................... 25

Return Only Best Matching Records ........................................................................................................ 26

Solution Definition Details ............................................................................................ 26

Fields Picker ............................................................................................................................................... 27

Solution Constraints .................................................................................................................................. 29

Solution Execution ............................................................................................................ 34

Solution Execution In One Step .................................................................................... 38

Solution Execution In Two Major Steps ....................................................................... 39

Solution Execution In Several Minor Steps .................................................................. 39

Data Retrieving And Storing ..............................................................................................41

Execute Resultset Retrieval SQL Query ........................................................................ 42

Solution Status Info ....................................................................................................... 43

Save And Load Resultset ............................................................................................... 45

Review And Edit Resultset ............................................................................................ 46

Resultset Browsing .................................................................................................................................... 46

Resultset Edit And Review ........................................................................................................................ 51

Exporting Resultset.................................................................................................................................... 52

Customize Data Grids........................................................................................................ 55

Customize Splitters ........................................................................................................... 56

ReMaDDer Software Trial ................................................................................................. 56Commercial Release Code Purchase And Activation ........................................................ 57


4/59


Page 3 / 59


How to use ReMaDDer software for successful records matching, data cleansing and

data deduplication projects

Introduction

What Is ReMaDDer SoftwareReMaDDer is record linkage and data cleansing software, with powerful fuzzy record matching and data

deduplication capabilities, based on state of the art machine learning and data processing techniques.

As client-server application, ReMaDDer consists of two parts: client front-end part and server-side part.

Client front-end provides user-friendly graphical interface with intuitive means for projects creation, raw

data import and solutions definition, while server-side part ensures mighty data processing engine that can

solve even the most complex fuzzy match analysis in reasonable time.

By combining advanced artificial intelligence with clever blocking techniques and multiple string similarity

metrics, ReMaDDer provides unique solution for fully automatic records matching and data deduplication

projects.

Traditionally, fuzzy records matching software require substantial human intervention, either to provide

various parameters and threshold values, either to perform extensive clerical review and supervised

machine learning training. Unique property of the ReMaDDer software is that it does not require any such

human assistance beyond project definition. There are no thresholds or any other input parameters which

user must provide in order to enable software to distinguish between matches and non-matches, the

ReMaDDer software is capable to infer and learn everything by itself.

As far as we are aware, ReMaDDer might be the only software currently available that is capable to perform

fully automatic fuzzy record matching without human expert intervention, while attaining accuracy of

human clerical review. This is accomplished by utilizing various advanced machine learning techniques and

approaches.

The name “ReMaDeDer” is an acronym for “Records Matching and Data Deduplication Software”.

Homepage: http://ReMaDDersoft.wix.com/ReMaDDer

Fuzzy MatchTerm “fuzzy match” refers to methods of identifying related records by measuring how similar they are. It

is used in cases where no unique identifier or exact match relation exists between two sets of data.

Fuzzy matching uses weights to calculate the probability that two given records refer to the same entity.

Record pairs with probabilities above a certain threshold are considered to be matches, while pairs with

probabilities below threshold are considered to be non-matches.

http://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadderhttp://remaddersoft.wix.com/remadder


5/59


Page 4 / 59

Fuzzy matching attempts to find a match which, although not a 100 percent match, is above the threshold

matching percentage set by the application.

Records LinkageRecord linkage refers to the task of finding records in a data set that refer to the same entity across different

data sources, i.e. to identify related records in two separate data sets.

Record linkage is necessary when joining data sets is based on entities that may or may not share a common

identifier, as may be the case due to differences in record shape, storage location, and/or curator style or

preference.

There are many business cases where record linkage has to be performed. Some typical examples are

product price lists, partner lists, book and movie catalogs, customer loyalty databases, medical records etc.

Data DeduplicationData deduplication refers to identifying duplicate records in a dataset and cleansing datasets from

redundant information.

ReMaDDer Software AdvantagesDue to its inherent complexity, fuzzy match analysis is a popular subject of scientific research and academic

papers. Some of the researchers even tend to build their own software, but those programs suffer from their

complexity and necessity to understand advanced mathematics and algorithms, in order to be able to use

it. This is not something that can be expected from an average user facing data linkage problem in urge to

be able to solve it in matter of hours or days.

On the other hand, there are huge corporate entity resolution framework solutions, produced by big

software companies, oriented towards huge corporate customers. These solutions are often very complex

and affordable only to big companies and corporate users.

ReMaDDer places itself in the middle and provides powerful fuzzy match records linkage solution for meremortals and regular office users.

By allowing users to define exact matching constraints, fuzzy matching constraints and all other constraints

in visual and intuitive way, all the complexity of the fuzzy match analysis is hidden from the user and he/she

can focus on the business case, rather than technical issues. That is where ReMaDDer software really shines

and clearly distinguishes itself from competition.

Traditionally, fuzzy record matching software suffer from requiring immense user involvement in project

parameterization and clerical review. User is either required to provide various input parameters and

threshold values, either he/she is required to perform machine learning training and provide examples of

matches and non-matches. In both cases, considerable user involvement and expertise is prerequisite for

successful analysis.

On the contrary, the ReMaDDer software does not require such heavy user involvement, since it can figure

optimal parameter values automatically, all by itself. This is accomplished by advanced artificial intelligence

utilizing various state of the art machine learning techniques.


6/59


Page 5 / 59

To summarize: utilization of advanced artificial intelligence, accompanied with intuitive graphical user

interface and low pricing - that is what makes ReMaDDer superb fuzzy match records linkage solution.

PrerequisitesMajor prerequisite to use ReMaDDer is active internet connection, since the raw data is imported to remote

server where data is processed. After trial period expires, you are required to purchase commercial releasecode in order to be able to continue using remote server.

However, project and solution creation and editing can be performed even without established connection

and purchased release code, since these data are stored locally on your computer.

ReMaDDer front-end client is available as executable for Windows and Linux systems. It is possible to

provide executables for various other systems, on demand.

ReMaDDer does not operate directly on original data sources, but requires data to be imported from CSV

(comma separated values) flat files to server, where corresponding “left” and “right” database tables are

then created and processed. Therefore, you will have to provide source datasets as flat CSV file, encoded in

UTF-8, preferably with comma (“,”) or semi-colon (“;”) field separators.

Revision HistoryRevision Date Change Description

1.0. 3/20/2016 Initial release. Tutorial covers ReMaDDer version 1.0.1.1. 5/10/2016 Document is updated to reflect changes and improvements brought by

ReMaDDer version 1.1.

New version brings many improvements and simplifies solutiondefinition. Instead of separately choosing and defining thresholds fortrigram similarity and levenshtein distance functions, a new, combined,common similarity function (ReMaDDer_similarity) is now introducedthat combines both trigram and levenshtein similarity properties. This

reduces complexity and uncertainty in solution definition creation,retaining ReMaDDer strength and advantages.

Previous ReMaDDer version has been outputting all columns from leftand right dataset into resultset. Now, you can choose which fields are to

be included in resultset.

Raw data import process is also much improved, especially regardingimporting data from Excel files (in CSV format) where column namescontain non-ascii characters and blanks.

There are many small performance improvements and several bugfixesthat will improve user experience when using the ReMaDDer softwarefor data match analysis.

2.0. 11/20/2016 Document is updated to reflect major changes and improvements brought by ReMaDDer version 2.0.The main changes are:

Instead of using only Levenshtein and Trigram similarity functions,multiple other similarity metrics are added to the server engine.


7/59


Page 6 / 59

Matches and non-matches are not based on similarity thresholdsany more. Instead, ReMaDDer now utilizes machine learningtechniques. Advanced algorithms infer and automatically detectduplicates and record matches.

Threshold parameters are removed as obsolete.

“Use composite field” parameter is removed as obsolete.

“Use inclusive OR ”parameter is removed as obsolete.

New parameter “Machine Learning Strictness” is introduced. Theparameter defines how strictly artificial intelligence willdistinguished between matches and non-matches. The options are:match, strict match and potential match.

New parameter “Join Type”is introduced. Join Type attribute

determines how SQL joins between left and right tables will beestablished, via solution base table. There are three options of

joining: a) inner join, b) left outer join, c) right outer join.The "inner join" option is default behavior, meaning that theresultset will contain all rows from left and right datasets whichmeet matching criteria.In case of "left outer join" option, resultset will contain all rowsfrom left dataset and only those rows from right dataset that satisfy

matching criteria.In case of "right outer join" option, resultset will contain all rowsfrom right dataset and only those rows from left dataset that satisfymatching criteria.

New parameter “Return Only Best Match” is introduced. Theparameter can have True or False value and determines whetherSQL query will return only best matching record or multiple recordssatisfying similarity criteria.Check this option if you wish to return only the best matchingrecords for each left or right record, when using corresponding leftor right outer joins.If this option is unchecked (default), multiple matching rows will bereturned.


8/59


Page 7 / 59

Projects

Projects PageProject is basic entity in ReMaDDER software. Each project contains definition of two source datasets

to be imported and analyzed (so-called "left dataset" and "right dataset"), as well as variable number of

corresponding solutions, which are stored definitions of how to perform fuzzy match analysis.

On creation, each project is assigned unique project tag. During raw data importing to server,

corresponding input tables get that tag appended in their name. This way, imported tables are always tagged

by the project name, which ensures their uniqueness.

The “Projects” page consists of two two sections separated by movable splitter. In upper section there is

a datagrid view where you can browse and edit projects, while on the lower section there is form view

of currently selected project. The same concept of datagrids and form views is implemented throughout the

application.


9/59


Page 8 / 59

You can easily create new projects, edit and browse existing projects, by using navigator buttons.

Concept of “Left” and “Right” DatasetThroughout ReMaDDer application and this manual, we will use terms “left” and “right” dataset or table .

In every fuzzy match project, we always compare two tables, i.e. two datasets, inspecting their rows

similarity. For convenience, we call them “left” and “right” table.

Purpose of entity resolution framework software, such is ReMaDDer, is to identify which records from “left”

dataset correspond to which records from “right” dataset.

ReMaDDer does not operate on original data sources directly, but requires data to be imported from source

CSV (comma separated values) flat files to server, where corresponding left and right database tables are

then created and processed.

Record Matching Project vs. Data Deduplication ProjectsIn ReMaDDer software, there is no fundamental difference between data deduplication and records

matching projects. In both cases we compare two datasets, trying to infer which records from “left” dataset

correspond to which records in “right” dataset.

The only difference between the two is that in case of records matching project we have two different input

datasets to be compared, while in case of data deduplication project we have to compare a dataset with

itself, in order to identify duplicate records in the dataset.


10/59


Page 9 / 59

Since ReMaDDer software always compare two datasets - left and right datasets, in case of data

deduplication project we need to import the same original CSV file twice - first as left dataset and then as

right dataset. The ReMaDDer software will thus create two identical tables with different names, in the

underlying database.

Copy A ProjectInstead of manually entering all the parameters for new projects, ReMaDDer allows you to copy existingproject into another project. This action copies raw data import specifications as well as solution definitions.

Raw Data Import

Datasets to be analyzed are called "left" and "right" datasets and can be easily imported from source CSV

files, encoded in UTF-8.

The CSV file format ("Comma Separated Values") is chosen due to its ubiquity and because all databases

and spreadsheet editors, as well as all other data sources can be easily exported to a csv file.

The source data CSV files, however, must be UTF-8 encoded. Otherwise, import will most likely fail.

Therefore, you must first ensure that the source data CSV files are properly UTF-8 encoded. ReMaDDer has

embedded tools for charset encoding detection and conversion, but you can also use famous Notepad++

(https://notepad-plus-plus.org/), CudaText (http://uvviewsoft.com/cudatext/) and other powerful text

editors which are capable to perform encoding detection and conversion of files.

ReMaDDer provides simple and intuitive tool for importing csv files. It will automatically detect

field’s delimiter and columns schema information. You can then edit the retrieved schema and

finally import the files on server, for further processing.

https://notepad-plus-plus.org/https://notepad-plus-plus.org/https://notepad-plus-plus.org/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/https://notepad-plus-plus.org/


11/59


Page 10 / 59

“Left” and “Right” datasets In each data deduplication or record matching project, we always compare two datasets for matching of

records. In case of record matching projects, these two datasets correspond to two different input CSV files,

while in case of data deduplication projects, these two datasets are imported from the same input CSV file.


12/59


Page 11 / 59

Nevertheless, we always have so-called “left dataset” and “right dataset” to be compared. Think of this like

comparing fingers from left and right hand. You can easily identify thumb on the left hand to be related to

the thumb on the right hand, since they share similar shape. It is obvious due to their physical similarity.

It is same with fuzzy match analysis, where we compare fields from left and right dataset in order to identify

string similarities. ReMaDDer internally uses various functions to measure string similarities, results of

which are then processed by artificial intelligence to infer whether two records represent same entity or not.

Import Raw DataProcess of importing raw data into server database consists of several logical phases. First we need to

identify source CSV files for “left” and “right” dataset. After source files are identified, we need to ensure

that the CSV files are properly UTF-8 encoded. Once we ensured proper encoding, then we need to retrieve

and specify schema information about the CSV files. In last phase we actually perform import from source

files, according to previously defined schema. Result of the last step is that the source files are imported on

server-side database, where they can be processed according to various solution definitions.

On “Data Import” page, there are two sub-pages: “Left Dataset Specification” and “Right Dataset

Specification”, in which we separately define input dataset specifications for “left” and “right” dataset.

Import can be executed separately for left and righ dataset, or both can be imported in batch, at once.

Browse And Choose CSV files

First step in importing input CSV files is to choose CSV files to be imported.

On upper part of “Left Dataset Specification” or “Right Dataset Specification” sub-page, there is a CSV file

browser dialog box.

You can browse CSV files on your computer by clicking on the browse button . This opens a file

browser in which you can choose a CSV file. The absolute file path is then copied to the edit box.

Register CSV Files

Next step is to define CSV file schema specification. We call this process “registering CSV file”.


13/59


Page 12 / 59

By clicking “Register CSV file” button near the file browser, the browsed CSV

file is examined for its columns and it’s schema information is then inserted into the corresponding list of

fields (columns).

As you can see, ReMaDDer determines field delimiter in CSV file (normally it is either “;” or “,”) and

retrieves information about columns.

If a column name has upper case characters, it is converted to lower case.

Currently, ReMaDDer treats all columns as text fields of various length. This is due fact that the comparison

is performed by using string comparison functions, so other data types (e.g. datetime, integer, real etc.)

would not make sense for string comparisons.

Determine And Convert CSV File To UTF-8

In previous ReMaDDer version, the program used to detect encoding and convert it to UTF-8

automatically, during CSV file registration. Although very convenient, this might have lead to wrong results,

since encoding detection function is not 100% reliable and sometimes it guesses encoding wrongly. This is

due fact that charset detection is inherently difficult task and there is no 100% sure method. It is always

kind of educated guess according to content inspection.

Therefore, we decided to remove automatic charset detection and conversion to UTF-8. You will have to do

it yourself and ensure that the source files are properly UTF-8 encoded. Charset detection, as well file


14/59


Page 13 / 59

encoding conversion to UTF-8 is still present as ReMaDDer feature (and even improved), but you will have

to trigger it manually with respective buttons, or by choosing it from menu.

Another option is to use embedded spreadsheet editor “Spready” to open and convert source files.

Alternatively, you can use various established tools such as Notepad++ text editor, that are capable to

recognize file encoding and perform required conversion to UTF-8.

Determine And Convert CSV File Encoding, with embedded tool

After a CSV file is registered as left or right dataset source, it can be analyzed with embedded tool for

detecting charset encoding.

When you click button “Determine Encoding of Left Dataset CSV File” or button “Determine

Encoding of Right Dataset CSV File” the respective CSV file will be analyzed for its encoding type, by

two different embedded procedures. Result of encoding analysis will be displayed in corresponding pop-up

window.


15/59


Page 14 / 59

If both functions agree that the encoding is UTF-8 (utf8), as in the example above, then the CSV file is in

appropriate format for import.

But, if result is not UTF-8, then the CSV file must be converted to UTF-8 before importing!

You can convert CSV file encoding to UTF-8 by clicking button “Convert Encoding Of Left Dataset

CSV File” or “Convert Encoding Of Right Dataset CSV File”.

When the conversion action is triggered, ReMaDDer will first back up the original CSV file and then convert

the file encoding to UTF-8.

Determine And Convert CSV File Encoding, with embedded spreadsheet editor “Spready”

Besides above mentioned embedded encoding detection and conversion tool, ReMaDDer has embedded

“Spready ” spreadsheet editor (http://wiki.lazarus.freepascal.org/FPSpreadsheet), which can also be used

for file encoding conversion.

http://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheet


16/59


Page 15 / 59

Determine And Convert CSV File Encoding, with external tools

Charset detection with embedded tool is not 100% reliable, which is also true for any tool performing

charset inferring.

If you encounter difficulties with embedded charset detection and conversion tools or you know what is the

file encoding, you might try various external tools, of which I would recommend well established

Notepad++ text editor (https://notepad-plus-plus.org/).

https://notepad-plus-plus.org/https://notepad-plus-plus.org/https://notepad-plus-plus.org/https://notepad-plus-plus.org/


17/59


Page 16 / 59

Another interesting alternative is CudaText text editor (http://uvviewsoft.com/cudatext/), which is

capable of charset detection and conversion too.

http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/http://uvviewsoft.com/cudatext/


18/59


Page 17 / 59

Edit Raw Datasource Schema Information

Once you retrieved schema information from a CSV file, you might conclude that you don’t want to import

all columns, but only a subset of fields.

You can edit the schema by using corresponding data grid navigator buttons.

If you wish to delete currently selected field from schema, just click delete button.

If you wish to regain original columns schema, just click “Get Fields Schema”

button and the columns list will be repopulated from the CSV file.

Pre-process Raw Datasource


19/59


Page 18 / 59

While defining import schema specification, you might realize that input data need some pre-processing

before importing to server for further analysis.

Of course, you can edit input CSV files by using any spreadsheet editor (such as LibreOffice or OpenOffice

Calc, Gumeric or Miscrosoft Excel) or textual editor (such as Notepad, Notepad ++, ConText, Gedit,

CudaText, Geany or Leafpad), but you can also use an embedded spreadsheet editor “Spready”.

You can launch external default spreadsheet editor by clicking the button “Open CSV File in Ext.

Editor” .

You can launch the embedded spreadsheet editor by clicking button “Open CSV File In Int. Editor”

. This will open the embedded spreadsheet editor “Spready”

(http://wiki.lazarus.freepascal.org/FPSpreadsheet).

http://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheethttp://wiki.lazarus.freepascal.org/FPSpreadsheet


20/59


Page 19 / 59

Import Data From Raw Datasources

Final step in source data import is execution of import procedure, by clicking appropriate button or

triggering action from respective menu.


21/59


Page 20 / 59

We can execute import separately for left and right datasets, by clicking corresponding buttons “Import

left dataset CSV file” or “Import right dataset CSV file” or we can import them both at once by

clicking the button “Import both CSV files to server”.

When you click the import button, ReMaDDer will automatically open the “Import Log” page, where you

can watch import process progress.


22/59


Page 21 / 59

Import speed depends on the file size and most importantly, internet connection quality.

Solution Definition

A solution definition represents definition of parameters for performing record linkage or data

deduplication analysis. Each project can have many solutions, with different specification, thus you can test which combination of parameters lead to best results.

Each solution definition consists of solution header specification and solution constraints

specification.

Solution header specification contains general info about the solution and defines important parameters

which determine how record matching analysis will be performed. These parameters are: “machine

learning strictness”, “join type” and “return only best match”.

Solution constraints specification consists of: exact match relations section, fuzzy match relations

section and other constraints section.

Solution definition page (page “Record Matching Analysis”, sub-page “Solution Definition”):

As with other pages, “Solution” page is also divided into two sections: datagrid view and form view.

For better user experience, form view is additionaly divided into several tabs and sub-tabs. Main tabs are:

“Solution Definition” and “Solution Result”.

“Solution Definition” tab is furtherly divided into: “Solution Header”, “Solution Fields Picker” and “Solution

Constraints”.


23/59


Page 22 / 59

“Solution Header” tab is divided into several sub-tabs: “Common”, “Solution Base Table Creation Query

Info” and “Solution Resultset Retrieval Query Info”.

“Solution Constraints” tab is divided into sub-tabs: “Exact Match Constraints”, “Fuzzy Match Constraints”

and “Other Constraints”.

How ReMaDDer performs record linkage and data deduplicationFor each project we can define one or more solutions. A solution consists of solution definition and solution

resultset.

Solution definition is specification which instructs ReMaDDer how to perform record linkage or data

deduplication analysis in order to retrieve resultset.

We can define three type of solution constraints: exact match constraints, fuzzy match constraints and other

constraints.

Fuzzy match constraints define field pairs from left and right dataset to be compared for fuzzy string

similarity. In order to infer records similarity, ReMaDDer utilizes various string similarity metrics, along

with powerful machine learning algorithms.

Advanced artificial intelligence automatically infers records linkage or duplicates and creates solution base

table.

Final step is resultset retrieval, in which database engine creates and executes SQL query which joins left

and right dataset with the solution base table, outputting resultset. The retrieved resultset can be exported

to a spredsheet or flat file.

Solution Definition HeaderSolution definition header contains general solution definition parameters and info about solution

execution status.

Solution definition header (whole page):


24/59


Page 23 / 59

Solution definition header (datagrid view):

Solution definition header (form view):


25/59


Page 24 / 59

Solution definition header can be entered either through datagrid or through form view which shows

currently selected solution.

Solution Basic Information

Basic information about a solution is shown in fields: “Solution Name”, “Solution Tag”, “Solution Base Table

Name”, “Tag Assigned”, “Solution Status” and “Solution Comment”.

Solution Tag is automatically generated designation which is appended to each solution name by default

and is also used in Solution Base Table name formation.

Solution Base Table Name is automatically formed from Solution name and Solution Tag. Solution Tag

ensures uniqueness of created solution base table, on server.

Solution Status and Solution Comment are fields in which user can enter additional arbitrary information.


26/59


Page 25 / 59

Machine Learning Strictness

The parameter “Machine Learning Strictness” defines how strictly artificial intelligence will

distinguished between matches and non-matches. The options are: match, strict match and potentialmatch.

Machine learning strictness attribute determines how strictly fuzzy matching will be determined.

Possible values are: a) match, b) strict match, c) potential match.

"Match" option is default behavior. Resultset retrieved will contained balanced ratio between true

positives and false positives. It tends to include all true positives, with some degree of false positives and

very little false negatives.

"Strict match" is the strictest option. Resultset will tend to contain only true positives, but due to higher

incidence of false negatives, it might miss to recognize some matches.

"Potential match" is the weakest option. Resultset will tend to contain all true positives, but many false

positives as well.

Join Type“Join Type” attribute determines how SQL joins between left and right tables will be established, via

solution base table. There are three options of joining: a) inner join, b) left outer join, c) right outer join.


27/59


Page 26 / 59

The "inner join" option is default behavior, meaning that the resultset will contain all rows from left and

right datasets which meet matching criteria.

In case of "left outer join" option, resultset will contain all rows from left dataset and only those rows

from right dataset that satisfy matching criteria.

In case of "right outer join" option, resultset will contain all rows from right dataset and only those rows

from left dataset that satisfy matching criteria.

Return Only Best Matching Records

The parameter “Return Only Best Match” can have True or False value and determines whether SQLquery will return only best matching record or multiple records satisfying similarity criteria. It is used as

modifier to left outer join or right outer join.

If this option is unchecked (default), multiple matching rows will be returned. If it is checked, only best

matching item from slave dataset will be joined to corresponding record in master dataset.

Check this option if you wish to return only the best matching records for each left or right record, when

using left or right outer joins and datasets are in master/slave relation.

In case of “inner join” join type, this parameter has no meaning and is ignored.

Typical use case for left or right outer join with “return only best matching” option is when we want to match

two product price lists of which one is master list.

Solution Definition Details While solution definition header defines general parameters for performing fuzzy match analysis, solution

definition details are being set in Field Picker sub-page and Solution Constraints sub-page with three

sections defining solution constraints: Exact Match Relations section, Fuzzy Match Relations section

and Other Constraints section.


28/59


Page 27 / 59

Fields Picker

ReMaDDer provides simple, yet very powerful visual tool to add field pairs to exact match relations section,

fuzzy match section or other constraints section.

By having input datasets ("left" and "right" datasets) fields listed side by side, you can easily browse two

lists, visually establish field pairs and send them to appropriate constraints definition sections by click on

appropriate button.


29/59


Page 28 / 59

You can add selected fields pair to exact match section by clicking the button “Add Fields Pair To ExactMatch Relations Section”.

You can add selected fields pair to fuzzy match section by clicking the button “Add Fields Pair To Fuzzy

Match Relations Section”.

You can add left or right dataset field to other constraints section by clicking the respective button.

By eliminating need for tedious manual input and letting you to visually build solution constraints instead,

ReMaDDer simplifies solution definition creation and boosts your performance.

Starting from ReMaDDer version 1.1., checkbox column “Output Field to Resultset?” is added to the

Field Picker datagrid. It is used to include or exclude fields from being outputted to a resultset. By default,all fields are included in resultset.


30/59


Page 29 / 59

Solution Constraints

There are three type of constraints that you can define for a solution: exact match relations, fuzzy match

relations and other constraints.

Exact Match RelationsIn exact matching relations section, we can add field pairs from "left" and "right" imported dataset and

define their equalness (=) or not-equalness ().


31/59


Page 30 / 59

If we can define exact matching relation on one or more filed pairs, we can tremendously increase speed of

analysis by narrowing down number of record pair combinations to be analyzed for fuzzy match.

Therefore, it is recommended to use exact match relations whenever possible.

Fuzzy Match Relations

In solution header section we can set various general parameters that determine how fuzzy match analysis

will be performed: we can choose machine learning strictness, join type and whether all matches or just

best matches will be returned.


32/59


Page 31 / 59

In fuzzy match relations section we provide details for fuzzy match comparison analysis. We can list field

pairs to be compared and furtherly define how fuzzy match analysis will be performed.

Relative Field Weight


33/59


Page 32 / 59

For each field pair, which will be compared for similarity, we have to define its relative weight. The bigger

the weight, the greater is importance of the particular field pair similarity in final decision whether two

records do match or not.

The weight for particular field pair is entered as an arbitrary integer value in the field “Field Weight

(integer)” and ReMaDDer then calculates its relative weight. The sum of relative weights is always 1.

On new field pair addition to the fuzzy match relations section, the field pair gets default relative weight

(integer) value, which is one (1). You can change this value to any bigger integer and ReMaDDer willrecalculate relative weights, taking care of their sum, which must always be 1.

Notice that there is an additional graphical indicator of relative weights. It shows graphically relative weight

for currently selected fields pair.

There are two buttons provided: “Recalculate Weights” and “Reset Weights”.


34/59


Page 33 / 59

The button “Reset Weights” reset all relative weights to 1, which is the same as if relative weights are not

used at all. In that case, all field pairs are treated equally important.

The button “Recalculate Weights” performs the recalculation of relative fields according to the integer

values entered in the field “Field Weight (integer)”. You don’t need to trigger this action manually, since

this procedure is triggered automatically on each change of integer value or a field pair addition.

Other Constraints

Similar to exact matching relations, it is desirable to limit analysis on particular subset of data. Such

constraints can greatly increase speed of record linkage or data deduplication analysis.

We can define any custom constraint to be applied on a particular field from "left" or "right" dataset.


35/59


Page 34 / 59

Normally, condition is: sometable.somefield= ‘some string’ , but other operators such as LIKE can be used

as well.

Solution Execution

Once a solution definition is prepared by setting global parameters, exact match, fuzzy match and other

constraints, you can then execute the solution on remote server and retrieve resultset. There are two

consequences of the solution execution: solution base table is created on server and resultset is retrieved to

client.

Solution execution is actually sequence of two different steps, which can be executed in batch or separately.

First step is solution base table creation on server, which is prerequisite for next step, resultset

retrieval on client.

The first step, in which Solution Base Table is created, is the most critical point in ReMaDDer

application (and most resource and time demanding, too). In this step, sequence of several critical

underlying procedures are triggered that determine solution space from which final resultset is finally

retrieved.

This step is actually composed of several discrete sub-steps.

First of sub-steps is so-called “blocking” procedure, which is a method to reduce space of combinations

which will be furtherly analyzed for string similarity. This step is of great importance, since fuzzy match

analysis is inherently time-consuming job and analyzing all possible combinations would take extremely

long time to complete.

Next sub-step is step in which string similarity is calculated between left and right dataset records.

ReMaDDer utilizes multiple string similarity functions. Some of them are quite resource demanding.


36/59


Page 35 / 59

After string similarity is established for all combinations in solution space, advanced machine learning

algorithms take results from previous step and infer record linkage or detect duplicates. This is the heart of

inventive and unique approach that ReMaDDer software utilizes to perform entity resolution job.

Unlike other competing software, ReMaDDer does not require any user involvement in this step. There is

no need to provide examples of matches and non-matches, neither to provide any threshold value that

would distinguish matches from non-matches. ReMaDDer will acquire knowledge and determine records

linkage automatically, without need for human domain expert or clerical review.

As far as we are aware, there is no other software, currently available on market, that is capable to perform

such automatic record linkage inference by artificial intelligence, with accuracy reaching human clerical

review.

Technically, solution can be executed in three different ways:

A) in one step

In this scenario, both major steps (solution base table creation and resultset retrieval) are executed at once.

B)

In two major steps

In this scenario, major steps (solution base table creation and resultset retrieval) are executed one by one

in consecutive order.

C) In several minor steps

In this scenario, both major steps are executed in sequence of several distinct minor steps.

In simplest scenario, you can execute solution in one step. On the Solution definition header, as well in the

corresponding “Solution Header” menu entry there are two buttons. The button “Execute Solution”

executes both steps at once, in batch, while the button “Prepare And Execute Result SQL Query”

executes only the last step, i.e. resultset retrieval.


37/59


Page 36 / 59

Obviously, you must trigger the button “EXECUTE SOLUTION” at least once, in order to create

underlying Solution Base Table on server, which is prerequisite for second step, resultset retrieval.

The first step, solution base table creation, might be extremely resource and time demanding. Depending

on the records count in left and right dataset, number of field pairs to be compared for string similarity etc.,

it can take anything from 30 seconds to 24 hours or even more (!). You must be aware that the time required

for solution base table creation grows exponentially, not linearly, with records count!

Be aware that the solution complexity, and time required for solution to be resolved, grows exponentially

with records count in left and right dataset. The same is true for number of field pairs to be compared. It is

not same if you analyze only one field pair or if you compare 9 field pairs for fuzzy match. Fuzzy match

analysis is inherently complex and time consuming.

Once the solution base table is already created, you can easily change machine learning strictness or join

type or choose whether to return only best match. For these changes, you don’t need to re -trigger tediousand time-consuming solution base table recreation, it is enough to re-trigger only second step. That is

exactly the reason why the button “Prepare And Execute Solution Result SQL Query” is foreseen.

Beside default differentiation on major steps, there is also fine grained differentiation on sub-steps, which

is available in the “Solution Definition” menu entry .


38/59


Page 37 / 59

In fine grained differentiation of solution execution steps, we distinguish following separate actions:

“Prepare Solution Base Table SQL Query” --> this action will prepare SQL query for solution

base table, but will not execute it.

“Execute Solution Base Table SQL Query (Create Solution Base Table)” --> this action will

execute solution base table creation.

“Prepare Solution Result SQL Query With Forced Base Table (Re)creation” --> this willtrigger recreation of SQL Query for recreation of solution base table on server and then retrieve

resultset.

“Prepare Solution Result SQL Query With Check Whether To Create Base Table” --> this

will trigger action that will check whether solution base table has to be recreated. The solution base

table will be recreated only if necessary. Then resultset will be retrieved.

“Prepare Solution Result SQL Query” --> just prepare SQL Query that will retrieve resultset, but

don’t actually trigger it’s execution.

“Prepare And Execute Solution Result SQL Query” --> prepare and execute SQL query that

will retrieve resultset.

“Execute Solution Result SQL Query (Retrive Resultset)” --> execute already prepared SQL

query that will retrieve resultset.


39/59


Page 38 / 59

These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For

regular user, it is only relevant to remember that the solution base table must first be created in order to be

able to retrieve resultset.

If solution base table is already created, then you don’t need to recreate solution base table for different

combination of “machine learning strictness”, “join type” and “return only best match” parameters. It is

enough to use just “Prepare And Execute Solution Result SQL Query” button.

If you’ re in doubt and don’t know what to do, the simplest and safest way to execute solution and retrieve

resultset is to click “EXECUTE SOLUTION” button.

Solution Execution In One StepThe simplest way to execute solution is to execute the analysis in one step, by clicking the button

“EXECUTE SOLUTION” or by choosing corresponding menu item.

This action will force (re)creation of solution base table on server, from scratch, and prepare and execute

resultset retrival SQL query.

Be aware that solution base table (re)creation is costly action and it might take considerable time to

complete! If left or right dataset contains million of records, this might take extremely long time to

complete.

Therefore, it is preferred to execute solution base table (re)creation only if necessary.


40/59


Page 39 / 59

Solution Execution In Two Major StepsBesides simple solution execution in one step, there is possibility to execute solution in two major steps.

In this scenario, first step is solution base table creation on server, which is prerequisite for next step,

resultset retrieval on client.

Once the solution base table is already created, you can easily change machine learning strictness or join

type or choose whether to return only best match. For these changes, you don’t need to retrigger tedious

and time-consuming solution base table recreation, it is enough to re-trigger only second step. That is

exactly the reason why the button “Prepare And Execute Solution Result SQL Query” is foreseen.

On the Solution definition header, as well in the corresponding “Solution Header” menu entry, there is

button “Prepare And Execute Result SQL Query”, which executes only the last step, i.e. resultset

retrieval. You can use it if proper solution base table is already created on server.

Solution Execution In Several Minor StepsIf appropriate solution base table is not yet created or solution definition is changed so it needs to be

recreated, then you have to (re)create solution base table first, and then execute resulset retrieval query.

Beside executing everything in one step, there there is also fine grained differentiation of these sub-steps

present ed in the “Solution Definition” menu entry .


41/59


Page 40 / 59

In fine grained differentiation of steps, we distinguish following separate actions:

“Prepare Solution Base Table SQL Query” --> this action will prepare SQL query for solution

base table, but will not execute it.

“Execute Solution Base Table SQL Query (Create Solution Base Table)” --> this action will

execute solution base table creation.

“Prepare Solution Result SQL Query With Forced Base Table (Re)creation” --> this willtrigger recreation of SQL Query for recreation of solution base table on server.

“Prepare Solution Result SQL Query With Check Whether To Create Base Table” --> this

will trigger action that will check whether solution base table has to be recreated. The solution base

table will be recreated only if necessary.

“Prepare Solution Result SQL Query” --> just prepare SQL Query that will retrieve resultset, but

don’t actually trigger it’s execution.

“Prepare And Execute Solution Result SQL Query” --> prepare and execute SQL query that

will retrieve resultset.

“Execute Solution Result SQL Query (Retrive Resultset)” --> execute already prepared SQLquery that will retrieve resultset.

These fine-grained actions are accessible only from Menu, because casual user will rarely need to use it. For

regular user, it is only relevant to remember that the solution base table must first be created in order to be

able to retrieve resultset.


42/59


Page 41 / 59

If solution base table is already created, then you don’t need to recreate solution base table for different

combination of “machine learning strictness”, “join type” and “return only best match” parameters. It is

enough to use just “Prepare And Execute Solution Result SQL Query” button.

If you’ re in doubt and don’t know what to do, the simplest and safest way to execute solution and retrieve

resultset is to click “EXECUTE SOLUTION” button.

Data Retrieving And Storing

You can launch previously prepared solution SQL queries and return resultsets, by clicking the button

“Prepare And Execute Result SQL Query”.

Alternatively, you can execute solution in one step, which includes both solution base table creation and

resultset retrieval SQL query execution in one step, with button “EXECUTE SOLUTION”.

In both cases, once resultset is retrieved, it is stored locally on your computer and you can load it afterwards,

anytime you wish.


43/59


Page 42 / 59

You can easily browse, edit and analyze results in many different ways, including datasheet forms with

sophisticated data searching, filtering and navigation capabilities.

Execute Resultset Retrieval SQL QueryThe resultset retrieval query is executed by clicking the button “Execute Solution”

or by clicking the button “Prepare And Execute Solution”

, which can be used if solution base table has already been

created.

The difference is that “Execute solution” action (re)creates underlying solution base table and then executes

SQL query, which joins left and right datasets with the solution base table, while action “Prepare And

Execute Results SQL Query” just performs the last step. Obviously, prerequisite to use the latter is that the

solution base table has already been created.

When action is triggered, previously prepared SQL query text is sent to server for execution. The progress

of query execution can be monitored in “Solution Log” page.

The retrieved resultset is automatically opened in a separate form.


44/59


Page 43 / 59

Solution Status InfoReMaDDer automatically updates solution status upon solution base table creation query and resultsetretrieval query preparation and execution actions. These solution status informations are shown both in the

solution header data grid and form view, in respective tabs.


45/59


Page 44 / 59

You get various information about solution base table creation process, such as: whether solution base table

is created or not, whether solution creation query has already been executed or not, whether solution base

table is empty or not, what are query execution times.

Also, you get various information about resultset retrieval query execution process, such as: whether

resultset retrival SQL query is generated (prepared) or not, whether SQL query was already executed or not,

whether resultset is retrieved or not and if retrieved whether it was empty or not. It is also shown whether

the resultset is stored locally and in which file. There is information about execution times and number of

executions performed.


46/59


Page 45 / 59

Save And Load ResultsetOnce a solution is executed and results retrieved, the resultset is automatically saved as a locally stored file

in the ReMaDDer installation folder, subfolder “/data/results”.

Resultset can be loaded into the subpage “Solution Result” of the main form, by clicking the button “Load

Solution Resultset” or in a separate form, by clicking the button “Load Solution Resultset In

Separate Window”.


47/59


Page 46 / 59

Review And Edit ResultsetThere are various ways you can post-process and review the retrieved resultset.

Resultset Browsing

You can easily browse, edit and analyze loaded resultset in data grid form. Datasheet contains sophisticated

data searching, filtering and navigation capabilities.


48/59


Page 47 / 59

You can scroll by using mouse, vertical and horizontal sliders and arrows.

You can also browse records by using navigation buttons.

Resultset Searching

You can easily search for any particular value in any column. On the upper left corner of the datagrid

there is a small button represented by orange double arrow. This button opens a pop-up dialog

with various search, filter and customization options, of which one is “Find data”.


49/59


Page 48 / 59

When you click on the “Find data” button, a search dialog box appears. You can search any value on

any column.

Resultset Filtration


50/59


Page 49 / 59

You can easily filter by any column. On the upper left corner of the datagrid there is a small button

represented by orange double arrow.

This button opens a pop-up dialog with various search, filter and customization options, including “Filter

data” and “Filter in table”, which are two different ways to perform filtration in a datagrid.

Filter Data

When you click the button “Filter data”, a dialog box appears on which you can build your filtering

conditions. This way you can define complex multicolumn filters.


51/59


Page 50 / 59

The filtering is then applied by clicking “Apply” button.

Filter In Table

Another option for filtration is to use the button “Filter in table”, which activates a filtration

combobox, which is placed just below each column’s title. When you click on the filtration combobox cell,

a combobox list appears, listing all possible values for respective column. When you select a value, the

respective column is automatically filtered by the chosen value.


52/59


Page 51 / 59

Resultset Sorting

You can sort ascending or descending on any column by clicking column title.

Resultset Edit And Review

You can edit the resultset in datagrid easily. You can delete a row by using delete button ,

or edit a record by clicking the edit button .


53/59


Page 52 / 59

Exporting Resultset

Besides using datagrid controls, another option for resultset post-processing is to export the resultset into

a spreadsheet and then perform reviewing and editing in a spreadsheet editor.

ReMaDDer has many different possibilities of exporting resultset to spreadsheets.

Exporting Resultset To Spreadsheet

Resultset can be exported to a CSV file by clicking the button “Export To CSV File”.

Resultset can be exported to a XLSX file by clicking the button “Export To XLSX File”.

Resultset can be exported to XLS file by clicking the button “Export To XLS File”.

Resultset can be loaded directly into your default spreadsheet editor, e.g. LibreOffice Calc or Microsoft

Excel, by clicking the button “Load In Ext. Spreadsheet Editor”.


54/59


Page 53 / 59

ReMaDDer also has its own embedded spreadsheet editor which can be used for resultset post-processing.

Resultset can be loaded into the embedded spreadsheet editor by clicking the button “Load As

Spreadsheet”.

Exporting Datagrid To Spreadsheet

Another possibility for exporting resultset into a spreadsheet file is to use datagrid’s exporting feature.


55/59


Page 54 / 59

You have to browse the destination folder for export and enter exported file name and extension, as well

as to enter page name (sheet name). If you forget to specify “page name”, you will get an error.


56/59


Page 55 / 59

Customize Data Grids

ReMaDDer enables you to customize your user interface in certain extent. You can shrink or stretch

columns, rearrange their order and hide/unhide columns.

Resize columns by dragging vertical splitters between columns.

Rearrange columns by pushing the left mouse bu tton on a column’s title and dragging the column while

mouse button is still pushed down. After the column is moved to another position, release the mouse button.

You can define which columns are shown and which are hidden, by clicking on the button “Select visible

columns”.


57/59


Page 56 / 59

When you close the application, your customization is saved (remadder_props.xml file) and when you

open the application again, your customizations will be loaded as well.

Customize Splitters

You will notice that various sections are divided by splitters which you can easily drag and thus resizethe corresponding splitted sections.

The customization you make is saved on application close and reloaded on application start.

ReMaDDer Software TrialReMaDDer client application is distributed as a shareware with 15-days trial period.

On first application start on your computer the trial period is initialized.


58/59


Page 57 / 59

Commercial Release Code Purchase And Activation

After trial period expires, you are required to purchase commercial release code in order to be able

to continue using server features, such as raw data import and query execution.

You can, however, continue creating and editing projects and solution definitions, as well as loading and

editing previously acquired resultsets.

When purchasing release code, you are required to enter MachineID in purchase form. The MachineID is

a tag generated by ReMaDDer software and is unique for your hardware. The purchased commercial release

code is thus machine-specific and valid only for your hardware.

Once you purchased release code, activate it by clicking the button “Activate Commercial Release

Code”.

You are asked to enter the release code.


59/59

remadder software tutorial (v2.0) - fuzzy match record linkage and data deduplication

Documents