dqmdm_sql2008r2

Data Quality and Master Data Management with Microsoft SQL Server 2008 R2 Dejan Sarka, Davide Mauri

Advanced Techniques & Tools for Storing, Maintaining, Cleansing, and Merging Master Data

Table of Contents

Page 1

Table of Contents

Table of Contents ............................................................................................................................. 1

Foreword .......................................................................................................................................... 6

Acknowledgements .......................................................................................................................... 8

About the Authors .......................................................................................................................... 11

Chapter 1: Master Data Management ........................................................................................... 12

Types of Data .............................................................................................................................. 13

What Is Master Data? ............................................................................................................. 14

Master Data Management ..................................................................................................... 16

MDM Challenges ................................................................................................................ 21

Data Models ............................................................................................................................... 23

Relational Model .................................................................................................................... 23

Dimensional Model ................................................................................................................ 31

Other Data Formats and Storages .......................................................................................... 33

Data Quality ................................................................................................................................ 36

Data Quality Dimensions ........................................................................................................ 36

Completeness ..................................................................................................................... 37

Accuracy ............................................................................................................................. 38

Information......................................................................................................................... 38

Consistency ......................................................................................................................... 39

Data Quality Activities ............................................................................................................ 42

Master Data Services and Other SQL Server Tools .................................................................... 47

Table of Contents

Page 2

Master Data Services .............................................................................................................. 47

Entities ................................................................................................................................ 49

Attributes............................................................................................................................ 49

Members ............................................................................................................................ 49

Hierarchies .......................................................................................................................... 49

Collections .......................................................................................................................... 50

Versions .............................................................................................................................. 50

Other SQL Server Tools for MDM ........................................................................................... 51

Summary .................................................................................................................................... 54

References .................................................................................................................................. 55

Chapter 2: Master Data Services Concepts and Architecture ........................................................ 56

Master Data Services Setup ....................................................................................................... 57

Installation of MDS Components and Tools ........................................................................... 57

Setup of MDS Database .......................................................................................................... 58

Setup of MDS Web Application .............................................................................................. 59

Master Data Manager Web Application .................................................................................... 62

Explorer .................................................................................................................................. 63

Version Management ............................................................................................................. 63

Integration Management ....................................................................................................... 63

System Administration ........................................................................................................... 64

User and Groups Permissions ................................................................................................ 64

Models ........................................................................................................................................ 65

Models .................................................................................................................................... 65

Table of Contents

Page 3

Entities and Attributes ........................................................................................................... 66

Hierarchies .............................................................................................................................. 77

Derived Hierarchy ............................................................................................................... 78

Explicit Hierarchy ................................................................................................................ 79

Collections .............................................................................................................................. 84

Business Rules ........................................................................................................................ 85

Importing, Exporting and Managing Data .................................................................................. 89

Import Data ............................................................................................................................ 89

Managing Data ....................................................................................................................... 91

Export Data ............................................................................................................................. 93

Multiple Versions of Data ........................................................................................................... 95

MDS Database Schema .............................................................................................................. 98

Staging Tables ....................................................................................................................... 101

Summary .................................................................................................................................. 103

References ................................................................................................................................ 104

Chapter 3: Data Quality and SQL Server 2008 R2 Tools ............................................................... 105

Measuring the Completeness .................................................................................................. 106

Attribute Completeness ....................................................................................................... 107

XML Data Type Attribute Completeness .............................................................................. 109

Simple Associations among NULLs ....................................................................................... 112

Tuple and Relation Completeness ........................................................................................ 118

Multivariate Associations among NULLs .............................................................................. 120

Profiling the Accuracy............................................................................................................... 126

Table of Contents

Page 4

Numeric, Date and Discrete Attributes Profiling ................................................................. 128

Strings Profiling .................................................................................................................... 131

Other Simple Profiling .......................................................................................................... 134

Multivariate Accuracy Analysis ............................................................................................ 136

Measuring Information ............................................................................................................ 142

Using Other SQL Server Tools for Data Profiling ...................................................................... 145

SSAS Cubes ........................................................................................................................... 145

PowerPivot for Excel 2010 ................................................................................................... 149

SSIS Data Profiling Task ........................................................................................................ 153

Excel Data Mining Add-Ins.................................................................................................... 156

Clean-Up ............................................................................................................................... 160

Summary .................................................................................................................................. 161

References ................................................................................................................................ 162

Chapter 4: Identity Mapping and De-Duplicating ........................................................................ 163

Identity Mapping ...................................................................................................................... 165

Problems ............................................................................................................................... 166

T-SQL and MDS String Similarity Functions .......................................................................... 166

Preparing the Data ........................................................................................................... 169

Testing the String Similarity Functions ............................................................................. 174

Optimizing Mapping with Partitioning ............................................................................. 176

Optimizing Mapping with nGrams Filtering ..................................................................... 180

Comparing nGrams Filtering with Partitioning ................................................................ 188

Microsoft Fuzzy Components ................................................................................................... 190

Table of Contents

Page 5

Fuzzy Algorithm Description ................................................................................................ 190

Configuring SSIS Fuzzy Lookup ............................................................................................. 192

Testing SSIS Fuzzy Lookup .................................................................................................... 200

Fuzzy Lookup Add-In for Excel .............................................................................................. 201

De-Duplicating .......................................................................................................................... 203

Preparing for Fuzzy Grouping ............................................................................................... 204

SSIS Fuzzy Grouping Transformation ................................................................................... 206

Testing SSIS Fuzzy Grouping ................................................................................................. 211

Clean-Up ............................................................................................................................... 213

Summary .................................................................................................................................. 215

References ................................................................................................................................ 216

Index ............................................................................................................................................. 217

Foreword

Page 6

Foreword

Dejan Sarka

“If all men were just, there would be no need of valor,” said Agesilaus, Spartan King, 444 BC-360

BC. Just from this quote we can realize that Agesilaus was not too keen on fighting. Actually,

Agesilaus never hurt his enemies without just cause, and he never took any unjust advantages.

Nevertheless, the ancient world was just as imperfect as the contemporary world is, and

Agesilaus had to fight his share of battles.

If everyone would always insert correct data into a system, there would be no need for

proactive constraints or for reactive data cleansing. We could store our data in text files, and

maybe the only application we would need would be Notepad. Unfortunately, in real life, things

go wrong. People are prone to make errors. Sometimes our customers do not provide us with

accurate and timely data. Sometimes an application has a bug and makes errors in the data.

Sometimes end users unintentionally make a transposition of letters or numbers. Sometimes we

have more than one application in an enterprise, and in each application we have slightly

different definitions of the data. (We could continue listing data problems forever.)

A good and suitable data model, like the Relational Model, enforces data integrity through the

schema and through constraints. Unfortunately, many developers still do not understand the

importance of a good data model. Nevertheless, even with an ideal model, we cannot enforce

data quality. Data integrity means that the data is in accordance with our business rules; it does

not mean that our data is correct.

Not all data is equally important. In an enterprise, we can always find the key data, such as

customer data. This key data is the most important asset of a company. We call this kind of data

master data.

Foreword

Page 7

This book deals with master data. It explains how we can recognize our master data. It stresses

the importance of a good data model for data integrity. It shows how we can find areas of bad

or suspicious data. It shows how we can proactively enforce better data quality and make an

authoritative master data source through a specialized Master Data Management application. It

also shows how we can tackle the problems with duplicate master data and the problems with

identity mapping from different databases in order to create a unique representation of the

master data.

For all the tasks mentioned in this book, we use the tools that are available in the Microsoft SQL

Server 2008 R2 suite. In order to achieve our goal—good quality of our data—nearly any part of

the suite turns to be useful. This is not a beginner’s book. We, the authors, suppose that you,

the readers, have quite good knowledge of SQL Server Database Engine, .NET, and other tools

from the SQL Server suite.

Achieving good quality of your master data is not an easy task. We hope this book will help you

with this task and serve you as a guide for practical work and as a reference manual whenever

you have problems with master data.

Acknowledgements

Page 8

Acknowledgements

Dejan Sarka

This book would never have been finished without the help and support of several people. I

need to thank them for their direct and indirect contributions, their advice, their

encouragement, and other kinds of help.

In the first place, I have to mention my coauthor and colleague from SolidQ, Davide Mauri. As an

older guy, I have followed his career over several years. I am amazed by the amount of

knowledge he gained in the past few years. He has become a top speaker and recognized author

of several books and articles. Nevertheless, he retained all the vigor of youth and is still full of

ideas. Davide, I am so proud of you. I always enjoy working with you, and I am looking forward

to our further cooperation.

With three other colleagues from SolidQ, Itzik Ben-Gan, Herbert Albert, and Gianluca Hotz, we

are forming a gang of four inside the company, called Quartet. It is not just unofficial group; our

official duty in the company is to certify and confirm places for company parties. Our endless

discussions during conferences, hikes, or time spent in pubs are an invaluable source of insight

and enlightenment. Besides general help through our socializing, all three of them have made a

concrete contribution to this book.

Herbert helped with technical review. Gianluca didn’t review the book officially; nevertheless,

he read it. His old man’s grumbling was always a sign to me that I wrote something inaccurate

or even wrong. Itzik was not directly involved in this book. However, this book would never have

been written without him. This is the ninth book I am contributing to so far. I would never even

have started writing if Itzik hadn’t pushed me and involved me in coauthoring his book 7 years

ago. Itzik has invited me to contribute to four of his books so far, and we are already starting the

fifth one together. To my friends from the Quartet, thank you for all of the precious time we are

spending together!

Acknowledgements

Page 9

SolidQ is not just a regular company. First of all, we are all friends. Even more, we have the best

CEO in the world, Fernando Guerrero. Fernando, thank you for inviting me to become a part of

this wonderful group of people, and thank you for all of your patience with me, and for all of

your not only technical advice but also life advice! And thanks to all other members of this

company; because we all together total more than 150 SQL Server and related technologies

worldwide experts joining our efforts, I simply cannot list every single colleague.

Besides concrete work with SQL Server, I am also involved in theoretical research in The Data

Quality Institute. Dr. Uroš Godnov helped me with first steps in data quality, and he is

continuing to educate me. Although he has some problems with his health, he is always

available to me. Uroš, forget what Agesilaus said! We need courage not just because of our

enemies; sometimes we need courage because of ourselves. Stay as brave as you are forever!

I cannot express enough how much I appreciate being a member of the Microsoft Most

Valuable Professional (MVP) program. Through this program, MVPs have direct contact with the

SQL Server team. Help of the team in general and help from the Master Data Services part of

the team is extraordinary. No matter how busy they are with developing a new version of SQL

Server, they always take time to respond to our questions.

Finally, I have to thank to my family and friends. Thank you for understanding the reduced time I

could afford to spend with you! However, to be really honest, I did not miss too many parties

and beers.

Acknowledgements

Page 10

Davide Mauri

Dejan already wrote a lot about our friendship, our company, and everything that allows us to

continue to enjoy our work every day. But even at the cost of seeming repetitious, I’d also like

to thank all my SolidQ colleagues, who are not only colleagues but friends above all. I learn a lot

from you, and each discussion is mind-opening for me. I cannot thank you enough for this!

I would like to say a big thank to Dejan, Itzik, Gianluca, and Fernando for being examples—truly

mentors—not only through words but also with facts: You really give me inspiration and you

show me each day, with your excellent work, your determination, your honesty, and your

integrity, the path one has to follow to be someone who can make a difference, from a

professional and ethical point of view. I couldn’t have found better colleagues, friends, and

partners. Thanks!

I’d like also to thank specifically “my” Italian team. Guys, we’re really creating something new,

setting new standards, finding clever ways to solve business problems, making customers

happy, and giving them the maximum quality the market can offer, while being happy ourselves

at the same time. This is not easy, but we’re able to work together as a well-trained team,

enjoying what we do each day. Thanks!

Last but not least, of course, a big thanks also to my wife Olga and my newborn son Riccardo:

You really are the power behind me. Olga, thanks for your patience and for your support!

Riccardo, thanks for your smiles that allow me to see the world with different eyes! Thank you!

About the Authors

Page 11

About the Authors

Dejan Sarka

Dejan Sarka focuses on development of database and business intelligence

applications. Besides projects, he spends about half of the time on training and mentoring. He

is the founder of the Slovenian SQL Server and .NET Users Group. Dejan is the main author or

coauthor of nine books about databases and SQL Server so far. Dejan also developed three

courses for SolidQ: Data Modeling Essentials, Data Mining, and Master Data Management. As an

MCT, Dejan speaks at many international events, including conferences such as PASS, TechEd,

and DevWeek. He is also indispensable on regional Microsoft events. In addition, he is a co-

organizer of a top-level technical conference named Bleeding Edge.

Davide Mauri

Davide Mauri is a SolidQ Mentor and a member of the Board of Directors of SolidQ Italia. A well-

known Microsoft SQL Server MVP, MCP, MCAD, MCDBA, and MCT, as well as an acclaimed

speaker at international SQL Server conferences, Davide enjoys working with T-SQL and

Relational Modeling and studying the theory behind these technologies. In addition, he is well-

grounded in Reporting Services, .NET, and object-oriented principles, and he has a deep

knowledge of Integration Services and Analysis Services, giving him a well-rounded area of

expertise around the Microsoft Data Platform, allowing him to have the correct vision and

experience to handle development of complex business intelligence solutions. He is a course

author for SolidQ, including seminars about Upgrading to SQL Server 2008, co-author of the

book Smart Business Intelligence Solutions with SQL Server 2008, and author of the well-known

DTExec replacement tool DTLoggedExec (http://dtloggedexec.davidemauri.it).

http://dtloggedexec.davidemauri.it/

Chapter 1: Master Data Management

Page 12


Dejan Sarka

Master Data Management (MDM), the process of creating and maintaining master data, is one

of the most challenging tasks in IT for an enterprise. In this chapter we will define what exactly

the term master data means and what the most important challenges are, including data quality

problems and ways to improve them.

When talking about data quality, we cannot skip data integrity, and data models. The Relational

Model is still the best model currently in use by line of business applications for enforcing data

integrity. Nevertheless, it is not very suitable for analyzing data. Additional models for analytical

systems have evolved; the most widely used is the Dimensional Model. We are going to briefly

introduce both, Relational and Dimensional models, and show where the master data is in these

models. In addition to data integrity, we are going to deal also with other data quality issues.

After all introductions, we are going to put MDS in the picture. We are going to see that MDS is

an integral part of MDM and that there are also other SQL Server tools that can help us in MDM.

SQL Server Integration Services (SSIS) and SQL Server Analysis Services can play an important

role in MDM activity. We are also going to define how MDS can help improving data in different

kinds of applications, including On Line Transactional Processing (OLTP), On Line Analytical

Processing (OLAP) and Customer Relationship Management (CRM) applications.

In this chapter, we are going to introduce the following:

Master data and Master Data Management;

Data models;

Data quality;

Master Data Services and other SQL Server tools.


Page 13

Types of Data

In an average company, many different types of data appear. These types include:

Metadata – this is data about data. Metadata includes database schemas for

transactional and analytical applications, XML document schemas, report

definitions, additional database table and column descriptions stored by using

SQL Server provided extended properties or custom tables, application

configuration data and similar.

Transactional data - maintained by line of business OLTP applications. In this

context, we are using term transactional in means of business transactions, not

database management system transactions. This data includes, for example,

customer orders, invoices, insurance claims, data about manufacturing stages,

deliveries, monetary transactions and similar. In short, this is OLTP data about

events.

Hierarchical data typically appears in analytical applications. Relationships

between data are represented in hierarchies. Some hierarchies represent

intrinsic structure of data – they are natural for the data. An example is product

taxonomy. Products have subcategories, which are in categories. We can have

multiple levels of hierarchies. Such hierarchies are especially useful for drilling

down from general to detailed level in order to find reasons, patterns and

problems. This is a very common way of analyzing data in OLAP applications.

Semi-structured data is typically in XML form. XML data can appear in standalone

files, or as part (a column in a table) of a database. Semi-structured data is useful

where metadata, i.e. schema, changes frequently, or when you do not have need

for detailed relational schema. In addition, XML is widely used for data exchange.

Unstructured data involves all kind of documents with instructions, company

portals, magazine articles, e-mails and similar. This data can appear in a

database, in a file, or even in printed material.


Page 14

Finally, we have master data.

What Is Master Data?

Intuitively, everybody has an idea of what kind of data master data would be. Let us give you

some more precise definitions and examples.

In the Relational Model, a table represents a predicate, and a row represents a proposition

about a subject, object or event from the real world. We start building a relational database by

describing the business problems in sentences. Nouns in the sentences describe real world

entities that are of interest for the business, like customers, products, and employees and

similar. Verbs in the sentences describe relationships between nouns, precisely roles the nouns

play in the relationships. Nouns define our master data. Verbs lead to transactional data. From a

description of a relational database, you can easily find nouns. These critical nouns typically

define one of the following:

People, including customers, suppliers, employees, and sales representatives

Things, including products, equipment, assets, and stores

Concepts, including contracts, licenses, and bills of material

Places, including company locations, and customer geographic divisions

Any of these entity sets can be further divided into specialized subsets. For example, a company

can segment their customers based on previous sales into premier and other customers, or

based on customer types, or persons and companies.

For analytical applications, many times data is organized in a Dimensional Model. A popular

name for the Dimensional Model is Star Schema (although, to be precise, a dimensionally

modeled database can include multiple star and snowflake schemas). This is because it has a

central, Fact table, and surrounding, dimensional tables or Dimensions. Fact tables hold data we

are measuring, namely Measures. Dimension attributes are used for pivoting fact data; they give

measures some meaning. Dimensions give context to measures. Fact tables are populated from


Page 15

transactional data. Dimensions are populated from entities that represent nouns in a relational

database description. Therefore, in a Dimensional Model, dimensions are the master data.

As we already mentioned, OLAP analyses typically involve drilling down over hierarchies.

Dimensions can include multiple hierarchies.

Master data appears in probably every single application in an enterprise. ERP applications

include products, bills of material, customers, suppliers, contracts and similar. Customer

Relationship Management (CRM) applications deal, of course, with customers. Human

Resources Management (HRM) applications are about employees. Analytical applications many

times include all master data that appears in an enterprise. We can easily imagine that master

data is a very important part of data. It is crucial that this master data is known and correct.

Data quality issues are mostly about master data. It is easy to imagine that having the same

master data in multiple sources can immediately lead to problems with same definitions, same

identifications, and duplication. Master data typically changes with a much slower rate than

transactional data. Customers, for example, do not change addresses frequently; however, they

interact with your company through orders, services or even complaints probably on daily basis.

Nevertheless, although less volatile than transactional data, the master data life cycle is still a

classical CRUD cycle: Create, Read, Update and Destroy. The question that arises is whether any

data about people, things, places or concepts is really master data for any company.

If a company sells only five different products, then products data is not master data for this

company. Although technically it is master data, the company does not need any specific

management of this data. Every single attribute of these five products has a correct value in the

system, and it is unlikely that an inaccurate value will appear. Therefore, for this company,

products data is not considered master data. Cardinality has influence on definition whether we

consider specific data as master data or not.

If a company does not collect many attributes for product categories, if the categories entity is

quite simple in terms of the number of attributes (for example, it could have only two


Page 16

attributes, category id and category name), then again that company probably does not

consider this as master data. Complexity of data is another criterion that helps us decide which

data needs special treatment and which data is master data.

Another factor to consider when defining master data is volatility. We already mentioned that

master data tends to change less frequently that transactional data. Now imagine that some

data does not change at all. For example, let us use geographic locations. Data about geographic

locations changes quite infrequently; in some systems, we can even consider this data as static.

As soon as we have cleansed this data, it does not need any further treatment, and therefore

we do not consider this as master data again.

For some data, we need to maintain history. In transactional applications, we need to know how

the data came to the current state. Government or other authorities prescribes a formal

auditing for some business areas. In analytical applications, we frequently analyze over time, for

example, we compare sales of this year with sales of the previous year in a geographic region. In

order to make proper analyses, we need to take into account possible movements of customers

from region to region; therefore, we need to maintain history again. Data that needs versioning,

auditing, or any other kind of maintaining of history, is typically master data.

Finally, we give more attention to data that we reuse repeatedly. Re-usage increases value of

data for us. The value of the data can increase because of other factors as well. For example, in

pharmacy, an error in a bill of materials can lead to huge damage. Therefore, the more valuable

the data is for us, the more likely we are going to define it as master data.

Master Data Management

We discussed that data that does not need special treatment is not master data. Clearly, we can

conclude that master data needs some special treatment, which is called Master Data

Management. In a more formal definition Master Data Management (MDM) is a set of

coordinated processes, policies, tools and technologies used to create and maintain accurate

master data.


Page 17

Even the formal definition of MDM is still very broad. We do not have a single tool for MDM.

We can consider anything we use to improve the quality of master data as a MDM tool. Any

formal, scheduled or repeated, and ad-hoc activity that improves master data quality is a MDM

process. Any technology, like Relational Database Management System (RDBMS) that enforces

data integrity, is part of MDM technology. However, depending on the approach we use to

maintain master data, we might consider using a specialized tool with clearly defines the

process for managing master data.

Some of the most important goals of MDM include:

Unifying or at least harmonizing master data between different transactional, or

operational systems

Maintaining multiple versions of master data for different needs in different

operational systems

Integrating master data for analytical and CRM systems

Maintaining history for analytical systems

Capturing information about hierarchies in master data, especially useful for

analytical applications

Supporting compliance with government prescriptions (e.g., Sarbanes-Oxley)

through auditing and versioning

Having a clear CRUD process through prescribed workflow

Maximizing Return Of Investment (ROI) through re-usage of master data

Please note the last bullet. Master Data Management can be quite costly, and very intensive in

terms of resources used, including man hours. For a small company with a single operational

system, probably no specific MDM tool is needed. Such a company can maintain master data in

the operational system. The more we reuse master data (in multiple operational, CRM and

analytical applications), the bigger ROI we get.


Page 18

A RDBMS and an application can and should enforce data integrity. Data integrity means that

the data conforms to business rules. Business rules can be quite simple, as order numbers must

be unique, there should be no order without known customer, and quantity ordered must be

greater than zero. Business rules can also be more complicated. For example, a company can

define that no customer should order in a single Web order a product in quantity that is more

than half of the quantity of the product on the stock. If the database and the application do not

enforce data integrity, we can expect dirty data.

Nevertheless, even if the database and the application do enforce data integrity, we still should

not take data accuracy for granted. How can we prevent typo errors? For example, an operator

could write “42 Hudson Avenue” instead of “24 Hudson Avenue”; both addresses are valid from

data integrity perspective. Another issue arises if we have multiple systems. Do all operators

enter data in consistent way? Some operators could write the correct address, but in slightly

different form, like “24 Hudson Ave.”.

We could resolve data quality issues with occasional data cleansing. With cleansing, data quality

rises; however, over time, the quality falls down again. This is reactive approach. A proactive

approach, which prevents entering low quality data, is even better. What we need is explicit

data governance. We must know who is responsible for the data, and we must have clearly

defined the process of maintaining the data. Data governance sets the policies, the rules for

master data. Data governance rules can prescribe data requirements, such as which information

is required, how we derive values, data retention periods and similar. Data governance can also

prescribe security policies, like which master data needs encryption, and which part has to be

audited. It can prescribe versioning and workflows. It defines how to address data

inconsistencies between different source systems; for example, it can define the authoritative

source. Data governance policies should also define how to bring new systems online without

breaking existing master data processes and quality. Finally, data governance can define the

need for explicit roles in an enterprise – roles, responsible for maintaining master data and

implementing data governance.


Page 19

In MDM terminology, we have to define the Data stewards, the people responsible for their

part of master data. Data stewards are the governors. They should work independently of any

specific source or destination system for master data, in an objective way. Data stewards must

have deep knowledge about the data they govern. Commonly, one data steward covers one

business area of master data; we have one steward for customers, one for products etc. We

should define data stewardship roles and designate data stewards early in the process of

implementing a MDM solution.

From what we have seen so far, we can conclude that there are different approaches to master

data management. Here is a list of possible approaches:

No central master data management – we have systems that do not

communicate at all. When we need any kind of cross-system interaction, like

doing analysis over data from multiple systems, we do ad-hoc merging and

cleansing. This approach is very cheap from the beginning; however, it turns out

as the most expensive over time. From this book’s perspective, we really should

not treat this approach as a real MDM approach.

Central metadata storage – with this approach, we have at least unified,

centrally maintained definitions for master data. Different systems should follow

and implement these central definitions. Ad-hoc merging and cleansing becomes

somehow simpler. In this scenario, for the central metadata storage, we typically

do not use a specialized solution. This central storage of metadata is probably in

an unstructured form – in documents, worksheets, or even on paper only.

Central metadata storage with identity mapping – besides unified, centrally

maintained definitions for master data we also store keys mapping tables in our

MDM solution. Data integration applications can be developed much faster and

easier. Although this solution seems quite appealing, it has many problems with

maintaining master data over time. We have only keys from different systems in

our MDM database; we do not have any other attributes. All attributes in source

systems change over time, and we have no versioning or auditing in place to


Page 20

follow the changes. This approach is viable for a limited time only. It is useful, for

example, during upgrading, testing and the initial usage of a new ERP system to

provide mapping back to the old ERP system.

Central metadata storage and central data that is continuously merged –we

have metadata as well as master data stored in a dedicated MDM system.

However, we do not insert or update master data here; we merge (and cleanse)

master data from source systems continuously, on daily basis. There are multiple

issues with this approach and continuous merge can become expensive. The only

viable possibility for this approach is that we can find out what has changed in

source systems from the last merge, enabling us to merge only the delta, only the

new and updated data. This approach is frequently used for analytical systems.

We prepare a Data Warehouse (DW) as a central storage for analytical data

(which includes transactional and master data). We populate DW overnight, and

during population, we merge data and resolve cleansing issues. Although we

typically do not create the DW with MDM as the main goal, we can treat DW as

an authoritative source of master data.

Central MDM, single copy – with this approach, we have a specialized MDM

application, where we maintain master data, together with its metadata, in a

central location. All existing applications are consumers of this master data. This

approach seems preferable at first glimpse. However, it has its own drawbacks.

We have to upgrade all existing applications to consume master data from

central storage instead of maintain their own copies. This can be quite costly, and

maybe even impossible with some legacy systems. In addition, our central master

metadata should union all metadata from all source systems. Finally, the process

of creating and updating master data could simply be too slow. It could happen,

for example, that a new customer would have to wait for couple of days before

submitting the first order, because the process of inserting customer data with all

possible attributes involves contacting all source systems and takes simply too

long.


Page 21

Central MDM, multiple copies – in this approach, we have a central storage of

master data and its metadata. However, the metadata here includes only an

intersection of common metadata from source systems. Each source system

maintains its own copy of master data, with additional attributes that pertain to

this system only. After we insert master data in the central MDM system, we

replicate it (preferably automatically) to source systems, where we update the

source-specific attributes. This approach gives us a good compromise between

cost, data quality, and effectiveness of the CRUD process. Still, there is no free

lunch. As different systems can also update the common data, we can have

update conflicts. Therefore, this approach involves continuous merge as well.

However, as at least part of the data is updated centrally, this approach means

less work with continuous merging than in central metadata storage and central

data that is continuously merged approach.

For the last two approaches we need a special MDM application. A specialized MDM solution

could be useful for central metadata storage with identity mapping and central metadata

storage and central data that is continuously merged approaches as well. SQL Server 2008 R2

Master Data Services (MDS) is a specialized MDM application. We could also write our own

application. Other SQL Server tools, like SSIS and SSAS, are helpful in the MDM process as well.

However, for the last two approaches to MDM, MDS is the most efficient solution.

MDM Challenges

For a successful MDM project, we have to tackle all challenges we meet. These challenges

include:

Different definitions of master metadata in source systems

We can have different coding schemes, data types, collations and similar. We

have to unify the metadata definitions.

Data quality issues


Page 22

This is something we always have to expect. In short, if we do not have data

quality issues, then we probably do not need a specialized MDM solution

anyway. We have to improve the data quality; otherwise, the MDM project fails

to accomplish its most important goal.

Authority

Who is responsible for master data? Different departments want to be

authoritative for their part of master data, and the authority for master data can

overlap in an enterprise. We have to define policies for master data, with explicit

data stewardship process prescribed. We also define data ownership as part of

authority issue resolution.

Data conflicts

When we prepare the central master data database, we have to merge data from

our sources. We have to resolve data conflicts during the project, and, depending

on the MDM approach we take, replicate the resolved data back to the source

systems.

Domain knowledge

We should include domain experts in a MDM project.

Documentation

We have to take care that we properly document our master data and metadata.

No matter which approach we take, MDM projects are always challenging. However, tools like

MDS can efficiently help us resolve possible issues.


Page 23

Data Models

It is crucial that we have a basic understanding of data models used in an enterprise before we

start a MDM project. Details of data modeling are out of the scope for this book; only the

minimum needed is here. We are going to introduce the Relational Model, the Dimensional

Model, and briefly other models and storages.

Relational Model

The relational model was conceived in the 1960s by Edgar F. Codd, who worked for IBM. It is a

simple, yet rigorously defined conceptualization of how users perceive and work with data. The

most important definition is the Information Principle, which states that all information in a

relational database is expressed in one (and only one) way, as explicit values in columns within

rows of a table. In the relational model, a table is called a relation, and a row is called a tuple,

which consists of attributes.

Each relation represents some real-world entity, such as a person, place, thing, or event. An

entity is a thing that can be distinctly identified and is of business interest. Relationships are

associations between entities.

A row in a relation is a proposition, like an employee with identification equal to 17, full name

Davide Mauri, lives in Milan.

Relation header, or schema of the relation, is the predicate for its propositions.

A predicate is a generalized form of proposition, like employee with identification

EmployeeId(int), full name EmloyeeName(string), lives in City(CitiesCollection).

Note the name / domain pair of placeholders for concrete values. The domain, or the data type,

is the first point where a RDBMS can start enforcing data integrity. In the previous example, we

cannot insert an EmployeeId that is not an integral number. We cannot insert a city that is not in


Page 24

our collection of allowed cities. Please also note that without a good naming convention, it is

hard to reconstruct predicates and propositions from a database schema.

With different source systems, we typically do not have influence on the source schema and

naming conventions; however, we should be aware that worse schema and naming mean more

problems for our MDM project, and worse data quality. One of the most important tools for

enforcing data integrity in a relational model is normalization. As mentioned, we are going to

introduce only basics here; for more details on relational model, do please refer to data

modeling books, for example An Introduction to Database Systems by C. J. Date.

Tables are normalized when they represent propositions about entities of one type - in other

words, when they represent a single set. This means that entities do not overlap in tables, and

that tables are orthogonal or normal in mathematical terms. When a table meets a certain

prescribed set of conditions, it is in a particularly normal form. A database is normalized when

all tables are normalized.

Normalization is a redesign process to unbundle the entities. The process involves

decomposition. The normalization is achieved by applying a sequence of rules to create what

are called normal forms. The goal is to eliminate redundancy and incompleteness.

Many normal forms are defined. The most important ones are first, second, and third. If a

database is in third normal form, it is usually already fully normalized. We are going to introduce

the first three normal forms only. If a database is not normalized, you can experience data

manipulation anomalies. Let us start with the following example of a model for customers’

orders, shown in figure 1.


Page 25

Orders

PK OrderId

OrderDate

CustomerId

CustomerName

Address

City

Country

OrderDetails{ProductId, ProductName, Quantity}

Figure 1: ORDERS MODEL BEFORE 1ST NF

Note that the last column (OrderDetails) is actually a collection of items ordered. This model

leads to many CRUD anomalies. How do you insert a potential customer, a customer without an

order? How do you insert a product that is not ordered yet? If you delete the last order for a

customer, you lose all information about that customer. If you update a name of a product, you

have to take care to update it in all rows where this product is on order. Even reading this data

is hard; how do you find the total quantity ordered for a single product, for example.

The first normal form says that a table is in first normal form if all columns are atomic, or

indivisible. No multivalued columns are allowed. Decomposition has to start with the

OrderDetails column. We need a single row per item in an order, and every atomic piece of data

of a single item (ProductId, ProductName, Quantity) must have its own column. However, after

the decomposition, we have multiple rows for a single order. OrderId by itself cannot be the key

anymore. The new key is composed of the OrderId and ItemId columns. Figure 2 shows our

Orders table in first normal form.


Page 26

Orders

PK OrderId

PK ItemId

OrderDate

CustomerId

CustomerName

Address

City

Country

ProductId

ProductName

Quantity

Figure 2: ORDERS MODEL IN 1ST NF

Queries are now simplified; it is easy to find totals for products. However, update anomalies are

still possible. You still cannot insert a customer without an order, a product that is not ordered

yet, and you have to change a single piece of information, like order date, which now repeats

for the same order in multiple rows, in multiple places.

To achieve the second normal form, a table must be in first normal form, and every non-key

column must be functionally dependent on the entire key. This means that no non-key column

can depend on a part of the key only. In Orders model that is in first normal form, we need

OrderId only to get customer data and order date; we don’t need ItemId, which is also part of

the key. To achieve the second normal form, we need to decompose the table into two tables,

shown in figure 3.


Page 27

Orders

PK OrderId

OrderDate

CustomerId

CustomerName

Address

City

Country

OrderDetails

PK,FK1 OrderId

PK ItemId

ProductId

ProductName

Quantity

Figure 3: ORDERS MODEL IN 2ND NF

With the model in the second normal form, we resolved the problem of updating order date.

For a single order, we maintain it in a single place now. However, there are still some update

anomalies possible. There is still some redundancy in the model. Customer address, city and

country repeat over orders; product name repeats over order details.

To achieve the third normal form, a table must be in the second normal form, and every non-

key column must be non-transitively dependent on the key. In other words, non-key columns

must be mutually independent. In our previous model in second normal form, we can find

CustomerId from OrderId, and from CustomerId we can transitively find the customer name and

address. From address, we can find city, and from city, we can find country. In addition, in the

OrderDetails table, ProductId and ProductName columns are not mutually independent. To

achieve the third normal form, we must create new tables for dependencies between non-key

columns, as shown in figure 4.


Page 28

Customers

PK CustomerId

CustomerName

Address

FK1 CityId

Orders

PK OrderId

FK1 CustomerId

OrderDate

Cities

PK CityId

City

FK1 CountryId

OrderDetails

PK,FK1 OrderId

PK ItemId

FK2 ProductId

Quantity

Countries

PK CountryId

Country

Products

PK ProductId

ProductName

Figure 4: ORDERS MODEL IN 3RD NF

After a database is in third normal, there are usually no update anomalies. However, queries for

retrieving data are more complicated, as they involve multiple joins. Properly implemented

Relational Model helps maintaining data integrity and data quality in general.

Besides data types and normalization, we can use many additional tools provided by modern

RDBMS’ like SQL Server. We have declarative constraints, for example Check constraint, which

further narrows down possible values of an attribute. We can define whether all values have to

be known; we can forbid NULLs (NULL is a standard placeholder for unknown or not applicable

value). We can implement integrity rules programmatically, with triggers and stored procedures.

Of course, we can implement them programmatically in the middle tier or client tier of an

application as well. The important thing for a MDM project is that we understand our source

data. Realizing whether the design of operational databases, which are the sources, where the

data is introduced in an enterprise for the first time, is proper or not, helps us a lot when

evaluating the cost of possible approaches to a MDM solution.


Page 29

Before explaining the Dimensional Model, we have to understand another relational design

technique – specialization and generalization. We can have NULLs in our data because the value

is unknown, or because an attribute is not applicable for a subset of rows. In our example, we

could have the persons and companies in the Customers table. For persons, the birth date

makes sense; for companies, the number of employees might be valuable information. In order

to prevent NULLs, we can introduce subtype tables, or subtypes in short.

Two entities are of distinct, or primitive, types if they have no attributes in common. Some

relations can have both common and distinct attributes. If they have a common identifier, we

can talk about a special supertype-subtype relationship. Supertypes and subtypes are helpful for

representing different levels of generalization or specialization. In a business problem

description, the verb is (or explicitly is a kind of) leads to a supertype-subtype relationship.

Specialization leads to additional decomposition of tables, as we can see in the example in

figure 5.


Page 30

Customers

PK CustomerId

CustomerName

Address

FK1 CityId

Orders

PK OrderId

FK1 CustomerId

OrderDate

Cities

PK CityId

City

FK1 CountryId

OrderDetails

PK,FK1 OrderId

PK ItemId

FK2 ProductId

Quantity

Countries

PK CountryId

Country

Products

PK ProductId

ProductName

Persons

PK,FK1 CustomerId

BirthDate

Companies

PK,FK1 CustomerId

NumberOfEmployees

Figure 5: ORDERS MODEL IN 3RD NF WITH SPECIALIZATION

As you can see, even in this small example, our relational design became quite complicated.

Queries for analyses involve multiple joins; they are hard to write, and many times do not

perform good enough to analyze on-line, in real time. Analyzing on-line means that we can

change the structure of the aggregations we are looking for in real time. For example, instead of

aggregating companies’ sales over countries and years, we could decide to aggregate over

products and age of individual people. Because of problems with analyses, Dimensional Model

evolved. The Dimensional Model is the base building block of OLAP systems.


Page 31

Dimensional Model

Dimensional Model of a database has more deterministic schema than Relational Model. We

use it for Data Warehouses (DW). In a DW, we store merged and cleansed data from different

source systems, with historical data included, in one or more star schemas. A single star schema

covers one business area, like sales, inventory, production, or human resources. As we already

know, we have one central (Fact) table and multiple surrounding (Dimensions) tables in a star

schema. Multiple star schemas are connected through shared dimensions. An explicit Time (or

Date, depends on the level of granularity we need) dimension is always present, as we always

include historical data in a DW. Star schema was introduced by Ralph Kimball in his famous book

The Data Warehouse Toolkit.

Star schema is deliberately denormalized. Lookup tables, like Cities and Countries in our

example, are flattened back to the original table, and attributes from that lookup tables form

natural hierarchies. In addition, we can also flatten specialization tables. Finally, we can add

multiple derived attributes and custom-defined hierarchies. An example of a star schema

created from our normalized and specialized model is illustrated in figure 6.


Page 32

FactOrders

PK,FK1 CustomerId

PK,FK2 ProductId

PK,FK3 DateId

Quantity

DimCustomers

PK CustomerId

CustomerName

Address

City

Country

PersonCompanyFlag

PersonAge

CompanySize

DimDates

PK DateId

Date

Month

Quarter

Year

DimProducts

PK ProductId

ProductName

Figure 6: ORDERS IN A STAR SCHEMA MODEL

With more dimension tables, the schema would represent a star even more. We can see derived

columns (Age and CompanySize) in the DimCustomers table. This table is also flattened from the

original Customers table, Cities and Countries lookup tables, and Persons and Companies

subtype tables. In the DimCustomers, we can see natural a hierarchy, Country ⇨ City ⇨

Customer; in the DimDates table, we can see a natural hierarchy, Year ⇨ Quarter ⇨ Month ⇨

Date. For analytical purposes, we could also define ad-hoc hierarchies, for example Country ⇨

CompanySize.

While relational databases are typically sources of master data, are dimensional databases

usually destinations of master data. As with the Relational Model, there is much more to say for

the Dimensional Model. However, this brief introduction should be enough for understanding

the two mostly used models, and how the impact on our master data management solution.


Page 33

Other Data Formats and Storages

Besides relational and dimensional databases, data can appear in an enterprise in many

additional various formats ant storages.

Some companies implement Operational Data Store (ODS). ODS is by definition of Bill Inmon,

who introduced this concept:

Subject-oriented, integrated, volatile, current-valued, detailed-only collection of data in

support of an organization's need for up-to-the-second, operational, integrated, collective

information.

Bill Inmon: Building the Operational Data Store, 2nd Edition (John Wiley & Sons, 1999)

The data in ODS has limited history, and is updated more frequently than the data in a DW.

However, the data is already merged from multiple sources. ODS is many times part of a CRM

application; typically, it is data about customers. A MDM solution can actually replace, or

integrate, existing ODS.

Some data is semi-structured. Either the structure is not prescribed in so many details, or the

structure itself is volatile. Nowadays semi-structured data usually appears in XML format. We

can have XML data in files in a file system or in a database. Modern relational systems support

XML data types.

XML instances can have a schema. For defining XML schemas, we use XML Schema Definition

(XSD) documents. XSD is an XML instance with defined namespaces, elements and attributes,

that expresses a set of rules to which an XML instance must conform. If XML instance conforms

to XSD, then we say it is schema validated. Here is an example of XML schema:

<xsd:schema targetNamespace="ResumeSchema" xmlns:schema="ResumeSchema" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sqltypes="http://schemas.microsoft.com/sqlserver/2004/sqltypes" elementFormDefault="qualified">

<xsd:import namespace="http://schemas.microsoft.com/sqlserver/2004/sqltypes" schemaLocation="http://schemas.microsoft.com/sqlserver/2004/sqltypes/sqltypes


Page 34

.xsd" />

<xsd:element name="Resume">

<xsd:complexType>

<xsd:sequence>

<xsd:element name="skills" minOccurs="0" maxOccurs="1">

<xsd:simpleType>

<xsd:restriction base="sqltypes:nvarchar" sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52">

<xsd:maxLength value="1000" />

</xsd:restriction>

</xsd:simpleType>

</xsd:element>

<xsd:element name="previousemployment" minOccurs="0">

<xsd:simpleType>

<xsd:restriction base="sqltypes:nvarchar" sqltypes:localeId="1033" sqltypes:sqlCompareOptions="IgnoreCase IgnoreKanaType IgnoreWidth" sqltypes:sqlSortId="52">

<xsd:maxLength value="100" />

</xsd:restriction>

</xsd:simpleType>

</xsd:element>

</xsd:sequence>

</xsd:complexType>

</xsd:element>

</xsd:schema>

Reading XML is not very funny. Nevertheless, we can extract some useful information. From the

highlighted parts of the code, we can conclude that this is a schema for resumes, probably for

job candidates. It allows two elements, according to their names for describing skills and

previous employment. Skills can appear only once, while previous employment multiple times.

In order to maintain data quality for XML data, we should force validation of XML instances

against XML schemas. SQL Server supports XML schema validation for columns of XML data

type.

Every company has also to deal with unstructured data. This data is in documents,

spreadsheets, other computer formats, or even on paper only. If this data is important, if it is


Page 35

part of master data, we should include it in our MDM solution. Of course, the problem is how to

do it.

One possibility is simply to store all kind of files in a database. SQL Server supports a couple of

large objects (LOB) data types. In addition, SQL Server supports FileStream data for binary large

objects, data type varbinary(max). FileStream integrates the SQL Server Database Engine with

an NTFS file system by storing varbinary(max) object data as files on the file system. This way,

we get unlimited storage for unstructured data inside our relational database.

Usually we can find interesting properties of the unstructured data. If we store unstructured

data in our database, it makes a lot of sense to store these interesting properties in additional,

regular non-LOB attributes. We can classify documents before storing them in a database. We

can do the classification manually, or with help of a tool. Text mining tools, for example, can

extract terms from text. We can use the terms extracted to classify texts.


Page 36

Data Quality

Data quality is indivisibly interleaved with master data management. The most important goal

of an MDM solution is to raise the quality of master data. We should tackle data quality issues in

any MDM project. Nevertheless, data quality activities, such as data profiling, finding root cause

for poor data, and improvements of quality, can be independent of an MDM project as well. An

enterprise can define data quality policies and processes through existing applications only.

However, a specialized MDM solution can mitigate implementation of those policies a great

deal.

Before we describe data quality activities, we have to decide for which aspects we are going to

measure and improve the quality of our data. Data quality dimensions capture a specific aspect

included in general data quality term. Measuring data quality, also known as data profiling,

should be an integral part of the implementation of an MDM solution. We should always get a

thorough comprehension of source data before we start merging it. We should also include

measuring improvements of data quality over time to understand and explain the impact of the

MDM solution. Let us start with data quality dimensions, to show what and how we can

measure data quality.

Data Quality Dimensions

Data quality dimensions can refer to data values or to their schema. We introduced the most

popular schemas earlier in this chapter. We are going to start with pure data quality dimensions,

and return to schema quality dimensions later in this section.

There is no exact definition which data quality dimensions should we inspect. Different tools

and different books list different sets of dimensions. Nevertheless, some dimensions are

analyzed more frequently than others are. We are going to focus on the most important, i.e. the

most frequent ones.


Page 37

We can measure some data quality dimensions with tools, like Transact-SQL queries. The

measurable dimensions are also called hard dimensions. For some dimensions, we depend on

the perception of the users of data. These are soft dimensions. We cannot measure soft

dimensions directly; we can measure them indirectly through interviews with users of data, or

through any other kind of communication with users. Note that this communication can

unfortunately include unpleasant events, like customer complaints, we want to prevent. Let us

start with hard dimensions.

Completeness

Completeness is a dimension that can be measured in the easiest way. We can start measuring it

on population level. In a closed world assumption, we can state that no other values except the

values actually present in a relational table represent facts in the real world. If the relation does

not have unknown values (NULLs), the relation is complete from the population perspective. In

the open world assumption, we cannot state the population completeness, even if our relation

does not contain null values. In order to evaluate the completeness of a relation in the open

world assumption, we need to get a reference relation that contains the entire population. With

a reference relation, we can define the completeness as the proportion of the number of tuples

presented in the relation and the number of tuples in the reference relation. Because of privacy

and law constraints, it is commonly not possible to acquire the reference relation. However,

usually we can get at least the number of tuples in a reference relation. For example, we can

easily get the number of citizens of a country. From a technical point of view, it is very easy to

measure the completeness of our relation once you have the number of tuples in a reference

relation.

In a closed world assumption, in a relational database, the presence of the NULLs is what

defines the completeness. We could measure attribute completeness, i.e. the number of NULLs

in a specific attribute, tuple completeness, i.e. the number of unknown values of the attributes

in a tuple, and relation completeness, i.e. the number of tuples with unknown attribute values

in the relation. Finally, we could also measure value completeness, which makes sense for


Page 38

complex, semi-structured columns, namely for XML data type columns. In an XML instance, a

complete element or an attribute can miss. In addition, XML standards also define a special

xsi:nil attribute as a placeholder for missing values; this is similar to relational NULLs.

Accuracy

Accuracy is a complicated dimension. Firstly, you have to determine what is inaccurate.

Accuracy is stricter than just conforming to business rules; the latter should be enforced with

data integrity. For data that should be unique, duplicate values are inaccurate. Finding duplicate

values might be quite easy, with simple queries, or very hard, if we have to find duplicates

across different systems. Finding other inaccurate data might involve some manual work. With

different algorithms, you can extract data that is potentially inaccurate only.

Here is some advice on how to isolate inaccurate data. For discrete data values, we can use

frequency distribution of values. A value with very low frequency is probably incorrect. For

strings, we can search for string length distribution. A string with very untypical length is

potentially incorrect. For strings, we can also try to find patterns, and then create pattern

distribution. Patterns with low frequency are probably denoting wrong values. For continuous

attributes, we can use descriptive statistics. Just by looking at minimal and maximal values, we

can easily spot potentially problematic data. No matter how we find inaccurate data, we can

flag it, and then measure the level of accuracy. We can measure this for columns and tables.

Information

Another measurable dimension is information. Information Theory, an applied mathematics

branch, defines entropy as the quantification of information in a system. We can measure

entropy on column and table level. The more disperse values we have, the more frequencies

distribution of a discrete column is equally spread among the values, the more information we

have in the column. Information is not a direct data quality dimension; however, it can tell us

whether our data is suitable for analyses or not.


Page 39

Consistency

Consistency measures the equivalence of information stored in various databases. We can find a

lot of inconsistent data by comparing values with a predefined set of possible values; we can

find some inconsistencies by comparing data among systems. We can find some inconsistencies

manually. No matter how we find inconsistencies, we can flag them, and then again measure

the level of inconsistency on column or table level.

We can measure soft dimension indirectly, through interaction with users. Questionnaires,

quick polls, user complaints, or any other communication with data users, is our tool for

measuring quality of soft data dimensions. The following list includes some typical soft

dimensions:

Timeliness tells us the degree to which data is current and available when

needed. There is always some delay between change in the real world and the

moment when this change is entered into a system. Although stale data can

appear in any system, this dimension is especially important for Web applications

and sites. A common problem on the Web is that owners do not update sites in a

timely manner; we can find a lot of obsolete information on the Web.

Ease of use is a very typical dimension that relies on user perception. This

dimension depends on application, on user interface. In addition, users of data

can perceive usage as complex also because they are undereducated.

Intention – is the data the right data for intended usage? Sometimes we do not

have the exact data we need; however, we can substitute the data needed with

data with similar information. For example, we can use phone area codes instead

of ZIP codes in order to locate customers approximately. Although phone

numbers were not intended for analyses, they can give us reasonable results.

Another, worse example of unintended usage is usage of a column in a table for

storing unrelated information, like using product name to store product


Page 40

taxonomy in it. This is unintended usage of schema, which leads to many

problems with data cleansing and integration.

Trust – we have to ask users whether they trust the data. This is a very important

dimension. If users do not trust data in operational systems, they will create their

own little, probably unstructured, databases. Integration of master data from

unstructured sources is very challenging. If users do not trust data from analytical

applications, they will simply stop using them.

Presentation quality is another dimension that depends on user perception.

When presenting data, format and appearance should support appropriate use of

information. In operational systems, this dimension is closely related to ease of

use dimension, and depends a lot on user interface. For example, an application

can force users to enter dates manually, or guide them through calendar control.

In analytical systems, presentation is probably even more important. Do we show

data in graphs or in tables? How much interactivity should we put in our reports?

Questions like this and answers to these questions can have a big influence on

success or failure of analytical systems.

Finally, we can describe some schema quality dimensions. A common perception is that schema

quality cannot be measured automatically. Well, this is true for some dimensions; we cannot

measure them without digging into a business problem. Nevertheless, it is possible to find

algorithms and create procedures that help us in measuring some part of schema quality. The

following list shows the most important schema quality dimensions with a brief description of

how to measure them when applicable.

Completeness tells us to which extent schema covers the business problem. We

cannot measure this dimension without in-depth analysis of business problem

and its needs.

Correctness of the model concerns the correct representation of real-world

objects in the schema and the correct representation of requirements. For

example, the first name could be modeled as an entity with one-to-one


Page 41

relationship to Customers entity, or as an attribute of the Customers entity. Of

course, the latter is correct. The first name cannot be uniquely identified, and is

therefore not an entity. This is the problem of correctly representing real world

objects. An example of incorrectness considering requirements is a model of

departments and managers. If a requirement says that a manager can manage a

single department, and that each department has a single manager, then the

relationship between managers and departments should be one-to-one.

Modeling this relationship as many-to-many means an incorrect schema. We can

measure this dimension manually, by investigating business requirements and

object definitions and their representation in the schema.

Documentation tells us whether the schema is properly documented. Schema

diagrams, like Entity-Relationship diagrams, should always be part of

documentation. In addition, all business rules that cannot be represented in a

diagram should be documented in textual format. In short, we should have

complete documentation of conceptual schema. We can check the quality of this

dimension manually, with schema documentation overview.

Compliance with theoretical models –is the database schema for transactional

applications in a properly normalized and specialized relational schema, and does

the schema for data warehouse consist of star schemas. This dimension has to be

partially measured manually, with an overview of the data model. However, we

can find some problematical areas in the schema procedurally. For example, we

can check correlations between non-key attributes; attributes with very high

correlation in a table lead to a conclusion that the schema is not normalized, at

least it is not in third normal form. In addition, many NULLs lead to conclusion

that there is not enough specialization, i.e. subtypes in the schema. This is

especially true if some values of an attribute always leads to NULLs in another

attribute. we can measure this with database queries.

Minimalization – abstraction is a very important modeling technique. It means

that we should have only objects in a model that are important for business


Page 42

problems that we are solving. Schema should be minimal, without objects that

are not pertinent to the problem it is solving. Again, we can measure this

dimension with manual overview and comparison to business requirements.

There are many more data and schema quality dimensions defined in literature that deal

specifically with data quality problems. Nevertheless, we mentioned the most important ones;

now it is time to describe the activities needed to understand and improve data quality.

Data Quality Activities

Data quality projects are very intensive in terms of resources needed. In order to execute a

successful project, we have to show the possible benefits to key stakeholders. Firstly, we need

to understand business needs. We can use interviews, overviews of organizational charts,

analyses of existing practices in an enterprise etc. We have to prioritize the business issues, and

make a clear project plan. For a successful project, it is important to start with a business area

that is either very painful for the stakeholders, or is quite simple to solve. We should always

implement data quality projects step by step.

Before we start any MDM project, we have to understand sources and destinations of master

data. Therefore, data quality activities must include overviews. We have to make an extensive

overview of all schemas of databases that pertain to master data. We should interview domain

experts and users of data. This is especially important for getting comprehension of quality of

schema dimensions. In addition, after this step, we should also have a clear understanding of

technology that the enterprise is using. If needed, we have to plan to include appropriate

technology experts in the project. During the overview of the data, we should also focus on data

life cycle, to understand retention periods and similar.

The next important step is the data quality assessment. We can assess hard dimensions with

procedural analysis of the data, i.e. data profiling. There are many different tools for data

profiling. We should exploit all knowledge we have and all tools available for this task. We

should measure soft data quality dimensions in this step as well. We can get a lot of insight on


Page 43

the state of the company by comparing hard and soft dimensions. If we evaluate hard

dimensions as bad, and get good evaluations for soft dimensions, then the company does not

realize they have problems with data. In such a case, additional assessment of potential damage

caused with bad data can help key stakeholders understand data quality problems. If both, soft

and hard dimensions are evaluated low; the company is ready for data quality and / or MDM

project. If hard dimensions get good evaluations, while soft get bad, then this means that for

some reason domain experts and users do not trust the data. Usually the reason is in previous

systems, or in previous versions of a system, or in under education. If both, hard and soft

dimensions get good evaluations, then the company does not need a special data quality

project; however, the company could still decide for a MDM project, in order to minimize

expenses with master data maintenance.

After finishing with data assessment, we can re-assess the business impact of low quality data.

We should meet again with key stakeholders in order to review the priorities and elaborate the

improvements part of the project plan in detail.

Before we start improving data quality, we have to find root causes for bad data. Finding root

causes can narrow down the size of work with data quality improvements substantially. For

finding root causes, we can use the five whys method introduced by Sakichi Toyoda and used

first in Toyota Motor Company. With this method, we simply ask why five times. Imagine the

problem is with duplicate records for customers.

We can ask why there are duplicate records. An answer might be because

operators frequently insert new customer record instead of using existing ones.

We should ask the second why – why are operators creating new records for

existing customers? The answer might be because operators do not search for

existing records of a customer.

We then ask the third why – why don’t they search. The answer might be

because the search would take too long.


Page 44

The next why is, of course, why it takes so long? The answer might be because it

is very clumsy to search for existing customers.

Now we ask the final, the fifth why – why is searching so clumsy? The answer

might be because one can search only for exact values, not for approximate

strings, and an operator does not have exact name, address, or phone number of

a customer in memory. We found the root cause for duplication – for this

example, this is application, specifically user interface. Now we know where to

put effort in order to lower the number of duplicates.

Of course, five whys is not the only technique for finding root causes. We can also just track a

piece of information through its life cycle. By tracking it, we can easily spot the moment when it

becomes inaccurate. We can find some root causes for some problems procedurally as well. For

example, we can find that NULLs are in the system because lack of subtypes with quite simple

queries. No matter how we find root causes, we have to use this information to prepare a

detailed improvements plan.

An improvement plan should include two parts: correcting existing data, and even more

important, preventing future errors. If we focus on correcting only, we will have to repeat the

correcting part of the data quality activities regularly. Of course, we have to spend some time in

correcting existing data; however, we should not forget the preventing part. When we have

both parts of improvements plan, we start implementing it.

Implementation of corrective measures involves automatic and manual cleansing methods.

Automatic cleansing methods can include our own procedures and queries. If we have a known

logic how to correct the data, we should use it. We can solve consistency problems by defining a

single way of representing the data in all systems, and then replace inconsistent representations

with the newly defined ones. For example, if gender is represented in some system with

numbers 1 and 2, while we define that is should be represented with letters F and M, we can

replace numbers with letters in a single update statement. For de-duplication and merging from

different sources, we can use string-matching algorithms. For correcting addresses, we can use


Page 45

validating and cleansing tools that already exist in the market and use some registries of valid

addresses. However, we should always prepare on the fact that part of data cleansing has to be

done manually.

Preventing new inserts of inaccurate data involves different techniques as well. The most

important one is to implement a MDM solution. For a MDM solution, we need an MDM

application. SQL Server Master Data Services is an example of the tool we could use.

Besides enabling central storage for master data and its metadata, the MDM application has to

support explicit data stewardship, versioning, auditing and data governance workflows. In

addition to MDM solution, we should also focus on source systems. It is very unlikely that out

MDM solution will ever cover all possible master entities with all of their attributes. As we

already mentioned, an approach with a central MDM and a single copy of data is most of the

times impractical. Therefore, part of master data is still going to be maintained in source,

operational applications. We have to enforce proper data models, constraints, and good user

interfaces wherever possible.

After we have implemented data quality corrective and preventive solutions, we should

measure how they perform. Even better, we should prepare improvements infrastructure in

advance, before starting implementing solutions. By measuring improvements, we can easily

show the value of the solutions to key stakeholders. In addition, we can control our work as well

since our improvements can fail and lead even to worse data. We can measure soft dimensions

with ongoing interviews with end users and domain experts. We can store results of the

interviews in a special data quality data warehouse, to track the data quality over time. For hard

dimensions, we can measure them automatically on a predefined schedule, and again store

results in the data quality data warehouse. Figure 7 shows potential schema for data quality

data warehouse, for measuring completeness and accuracy.


Page 46

FactColumns

PK,FK3 ColumnId

PK,FK1 DateId

PK,FK2 EmployeeId

NumValues

NumUnknownValues

NumErroneousValues

FactTables

PK,FK3 TableId

PK,FK1 DateId

PK,FK2 EmployeeId

NumRows

NumUnknownRows

NumErroneousRows

DimDates

PK DateId

Date

Month

Quarter

Year

DimStewards

PK,FK1 EmployeeId

EmployeeName

ManagerId

Department

Title

Age

DimColumns

PK ColumnId

ColumnName

FK1 TableId

DimTables

PK TableId

TableName

SchemaName

DatabaseName

ServerName

ApplicationName

Figure 7: DATA QUALITY DATA WAREHOUSE

Although not mentioned as an explicit data quality activity step, it is very important to

communicate actions and results throughout the project. The more communication we have,

the better. Domain experts can always help us with their knowledge. IT professionals and end

users put much more effort in success if they are involved, if they feel the project is their

project. Key stakeholders should always know how the project progresses and have to be

involved in all decision milestones.


Page 47

Master Data Services and Other SQL Server Tools

With SQL Server 2008 R2, Microsoft released a MDM solution called Master Data Services as

part of the suite. Besides this specialized solution, other SQL Server parts are extremely useful in

all parts of data quality and master data management activity. In this part, we are going to

briefly introduce the Master Data Services and its architecture, and other useful SQL Server

tools as well. Of course, a detailed presentation of Master Data Services and instructions on

how to work with it are in the rest of this book.

Master Data Services

SQL Server Master Data Services is a platform for master data management. It provides a

location for master data and metadata, enables processes for data stewardship and workflow,

and has connection points we can use to exchange data with sources and destinations directly

or through application interfaces, and even to extend the functionality of MDS. It consists of a

database, Web application, configuration application, Web service as an interface to

applications, and .NET class libraries that can be used to develop and maintain MDS applications

programmatically.

The MDS database, called MDS Hub, is central to any MDS deployment. The MDS Hub stores:

Schema and database objects for master data and metadata

Versioning settings and business rules

Information for starting workflow and e-mail notifications

Database objects, settings and data for MDS application

Staging tables for importing and processing data from source systems

Subscription views for systems that can retrieve master data directly from the

Hub


Page 48

Figure 8 represents the MDS Hub, processes in the Hub, and connections to outer systems, to

data sources and destinations, graphically.

Metadata

Entities Hierarchies

Data Stewardship

Versioning

WorkflowBusiness Rules

Web Service

Synchronization

ERP CRM SharePoint DW

Other

Figure 8: MDS HUB

In an MDS database models are the topmost level of master data organization. Models consist

of

Entities- which map to real world entities, or DW dimensions

Attributes- part of entities, containers for entity properties


Page 49

Members- part of entities, which we can consider as rows in entity table

Hierarchies, also defined as entities, that group members in a tree structure

Collections- which give possibility to users to create ad-hoc groups of hierarchies

and members.

Entities

Entities are central objects in the model structure. Typical entities include customers, products,

employees and similar. An entity is a container for its members, and members are defined by

attributes. We can think of entities as tables. In a model, we can have multiple entities.

Attributes

Attributes are objects within entities; they are containers for values. Values describe properties

of entity members. Attributes can be combined in attribute groups. We can have domain-based

attribute values; this means that we get the pool of possible values of attributes in a lookup

table, related to the corresponding entity of the attribute.

Members

Members are the master data. We can think of members as rows of master data entities. A

member is a product, a customer, an employee and similar.

Hierarchies

Another key concept in MDS is Hierarchy. Hierarchies are tree structures that either group

similar members for organizational purposes or consolidate and summarize members for

analyses. Hierarchies are extremely useful for data warehouse dimensions, because typical

analytical processing involves drilling down through hierarchies.

Derived hierarchies based on domain-based attributes, i.e. on relationships that exist in the

model. In addition, we can create explicit hierarchies, which we can use for consolidating

members any way we need.


Page 50

Collections

We can also consolidate members in collections. Collections are not as fixed as hierarchies; we

can create them ad-hoc for grouping members and hierarchies for reporting or other purposes.

Versions

MDS supports versioning. We can have multiple versions of master data within a model.

Versions contain members, attributes and attribute values, hierarchies and hierarchy

relationships, and collections of a model.

Master Data Manager is an ASP.NET application which is the data stewardship portal. In this

application, we can create models, manage users and permissions, create MDS packages for

deployment on different MDS servers, manage versions, and integrate master data with source

and destination systems. Figure 9 shows the Master Data Manager home page.


Page 51

Figure 9: MASTER DATA MANAGER HOME PAGE

Other SQL Server Tools for MDM

SQL Server Database Engine, the RDBMS itself, is the system that hosts MDS database, the MDS

Hub. Of course, as a RDBMS, it can host many transactional and analytical databases and we can

use it for many other master data management and data quality bound activities. Transact-SQL

is a very powerful language for manipulating and querying data that can be used to discover and

correct incorrect data. From version 2005 and later, we can enhance Transact-SQL capabilities

with CLR functions and procedures. For example, Transact-SQL does not support validating

strings against regular expressions out of the box. However, we can write Visual Basic or Visual

C# functions for this task, and use it inside SQL Server.


Page 52

SQL Server Integration Services (SSIS) is a tool for developing Extract – Transform – Load (ETL)

applications. An ETL process is always part of an analytical solution. We have to populate data

warehouses somehow. Of course, we can use the ETL process and ETL tools for preparing

master data for our MDS application as well. SSIS has included many procedures called tasks

and transformations that are useful for analyzing, cleansing and merging data. For example, the

Data Profiling task quickly gives us a good overview of the quality of our data. The Data Profiling

task is shown in figure 10.

Figure 10: THE DATA PROFILING TASK

Besides Data profiling, there are quite a few additional tasks and transformations that help us

with the ETL process. For example, Fuzzy Lookup transformation can join data from different

sources based on string similarities. Fuzzy Grouping transformation helps find duplicates, again

based on string similarities.


Page 53

SQL Server Analysis Services (SSAS) is an analytical tool. It enables OLAP and Data Mining

analyses. As an analytical tool, it does not seem useful for master data management at first

glimpse. However, we can use OLAP cubes efficiently for data overview. Data Mining comprises

of a set of advanced algorithms that try to find patterns and rules in a data set. Some of these

algorithms are very useful for finding bad data. For example, the Clustering algorithm tries to

group rows in groups, or clusters, based on similarity of the attribute values. After we find

clusters, we can analyze how a row fits to some cluster. A row that does not fit well in any

cluster is a row with suspicious attribute values.

Finally, we have SQL Server Reporting Services (SSRS) in SQL Server suite. SSRS is useful for

master data management indirectly. We can present our activities and data quality

improvements through reports. We can publish reports on default SSRS portal site, or on any

other intranet portal site, including SharePoint sites, we have in our enterprise.


Page 54

Summary

For a beginning of the book about SQL Server Master Data Services, we started with some

theoretical introductions. We defined what master data is. Then we described master data

management in general. We mentioned explicit actions about data governance, and operators

who manage the master data. The name of those operators is data stewards. We also

introduced different approaches to master data management.

Then we switched to master data sources and destinations. For sources, it is crucial that they

take care for data integrity. We discussed briefly the Relational Model, the most important

model for transactional databases. We also mentioned normalization as the process of

unbundling relations, a formal process that leads to desired state of a database, when tables

represent exactly one entity. In this second part of the first chapter, we also introduced the

Dimensional Model, a model used for analytical systems.

Master data management always has to deal with data quality. Investments into a MDM

solution make no sense, if the quality of the data in our company does not improve. In order to

measure the data quality, we need to understand which data quality dimensions we have. In

addition, we introduced the most important activities dedicated to data quality improvements.

In the last part of the chapter, we introduced SQL Server Master Data Services. We mentioned

MDS key concepts. We also explained how we can use other elements of SQL Server suite for

master data management and data quality activities. It is time now to start working with MDS.


Page 55

References

Bill Inmon: Building the Operational Data Store, 2nd Edition (John Wiley & Sons, 1999)

C. J. Date: An Introduction to Database Systems, Eighth Edition (Pearson Education, Inc.,

2004)

Ralph Kimball and Margy Ross: The Data Warehouse Toolkit: The Complete Guide to

Dimensional Modeling, Second Edition (John Wiley & Sons, 2002)

Danette McGilvray: Executing Data Quality Projects (Morgan Kaufmann, 2008)

5 Whys on Wikipedia

http://en.wikipedia.org/wiki/5_Whys

Chapter 2: Master Data Services Concepts and Architecture

Page 56


Davide Mauri

We previously introduced master data, master data management and Master Data Services in

the first chapter of the book, now it is time to show how we can work with this new product in

SQL Server suite. This chapter shows how we can create a MDS model with entities, attributes,

hierarchies and collections. In addition, it shows how we can import and export data to and

from MDS Hub. We will also learn how to implement business rules for our attributes, and the

concept of versioning master data.

The chapter starts also cover a quick introduction to Master Data Services’ setup so that it can

serve you if you want to put your hands on MDS and want to use this chapter as a walk-through

guide, and you do not have MDS installed yet. Topics covered in this chapter include:

Installation of Master Data Services (MDS);

Using MDS Web application;

Defining models, entities, attributes and hierarchies;

Constraining inputs with business rules;

Importing, exporting and managing data;

Managing versions.


Page 57

Master Data Services Setup

This section provides an overview to installing Master Data Services. The aim is to give you just a

short description of the steps necessary to have a machine with MDS up and running without

going too much into detail. The complete coverage of an MDS installation is provided in SQL

Server 2002 R2 Books Online.

The installation and setup of Master Data Services is divided into three main steps:

Installation of MDS components and tools

Setup of MDS database

Setup of MDS Web Application

Installation of MDS Components and Tools

Installing MDS components and tools, is a straightforward operation. It’s done using a typical

Windows installer package and you can simply accept all default options.

MDS Setup is included only in the 64-bit versions of SQL Server 2008 R2 Developer, Enterprise or

Datacenter editions. Operating system support is determined mainly by the edition of SQL

Server you choose. For example, the Developer edition can be installed starting with Windows

Vista (Business, Enterprise and Ultimate) and Windows 7 (Professional, Enterprise and

Ultimate).

MDS components and tools require .NET Framework 3.5 Service Pack 1 to be installed. If it’s not

already installed, MDS Setup will install it before proceeding with the main installation.

Furthermore, the Windows user account of the logged user that is executing MDS Setup needs

to be a member of local Administrators Windows group.


Page 58

Although not strictly necessary to setup the components and tools, the PowerShell feature

needs to be installed before proceeding with the remaining configuration steps.

The other two installation steps require using a specialized configuration program called Master

Data Services Configuration Manager that is installed along with the other components and

tools in the first step.

Figure 11: MASTER DATA SERVICES CONFIGURATION MANAGER

Setup of MDS Database

Setting up a database for an MDS solution requires an already installed SQL Server instance and

the provisioning of two Windows users that will be granted the appropriate permissions to login

to the instance and access the database.


Page 59

The first Windows user will be used to provide the identity of the web application pool and the

second one will be used as the MDS System Administrator. The MDS System Administrator can

access and update all models and all data in all functional areas for all web applications that use

an MDS database. In this book, we assume that the MDS database created is named MDSBook.

MDS Configuration Manager can then be used to connect to the instance and create the

database as shown in the Figure 2.

Figure 2: DATABASE SETUP WITH MDS CONFIGURATION MANAGER

Setup of MDS Web Application

Setting up an MDS web application requires the installation of the Web Server (ISS) role along

with some specific role services. As a quick guideline, the following roles services are required:

Common HTTP Features


Page 60

o Static Content

o Default Document

o Directory Browsing

o HTTP Errors

Application Development

o ASP.NET

o .NET Extensibility

o ISAPI Extensions

o ISAPI Filters

Health and Diagnostics

o HTTP Logging

o Request Monitor

Security

o Windows Authentication

o Request Filtering

Performance

o Static Content Compression

Management Tools

o IIS Management Console

Furthermore, the following features are required:

.NET Framework Features

o WCF Activation

HTTP Activation

Non-HTTP Activation

Windows PowerShell

Windows Process Activation Service

o Process Model

o .NET Environment


Page 61

o Configuration APIs

MDS Configuration Manager can then be used to create and configure the web application as

shown in Figure 3.

Figure 3: CONFIGURING WEB APPLICATION WITH MDS CONFIGURATION MANAGER


Page 62

Master Data Manager Web Application

The entry point of any kind of human interaction with Master Data Services is the Master Data

Manager Web Application.

The web application is installed during the configuration of Master Data Services as part of the

initial system deployment, and allows managing everything related to a Master Data

Management solution. From a data model definition to system and security management,

passing through data verification and manipulation, this is the central point from where Master

Data can be managed.

Figure 4 shows the five main functional areas, which are the entry point to access all available

features.


Page 63

Figure 4: MASTER DATA MANAGER WEB APPLICATION

Explorer

Explorer is where all data stored inside Master Data Services can be managed. Models, entities,

hierarchies and collections can be navigated and their data can be validated with the help of

defined business rules.

Direct updates or additions to the Members, the actual data stored in entities, can be done

here. It’s also possible to annotate such data to explain why a change has been made, so that

everything can be tracked and kept safe for future reference. It’s also possible to reverse

transactions if some changes made to data have to be undone.

Version Management

Once the process of validating Master Data is finished, external applications can start to use the

verified data stored in Master Data Services. Since applications will relay on that reference data,

it’s vital that no modification at all can be done on it, but still data will need to change in order

to satisfy business requirements. By creating different versions of data, it’s possible to manage

all these different situations, keeping track of changes and creating stable and immutable

versions of Master Data.

Integration Management

As the name implies, Integration Management allows managing the integration of the Master

Data with the already existing data ecosystem. With Integration Management, it’s possible to

batch import data that has been put into staging tables and monitor the import results. It is also

possible to define how data can be exposed to external applications via defined Subscription

Views.


Page 64

System Administration

System Administration is the central place where the work of defining a Master Data Model

takes place. With System Administration it is possible to create models, entities, attributes,

hierarchies, business rules and everything offered by Master Data Services in terms of data

modeling.

User and Groups Permissions

Security plays a key role when a system manages all the data that represents a core asset of any

business. With User and Groups Permissions, all topics regarding security can be administered

and managed, configuring all the security clearances needs to access and modify Master Data.


Page 65

Models

As explained in Chapter 1, Master Data Services organizes its data with the aid of different

concepts to deal with different aspects of data. All these concepts allow the definition of a Data

Model. To start to get confident with Master Data Services concepts and technology, in the next

paragraph we’ll create a model to hold customers data. By following a walk-through of all the

steps needed we will get an example of a functional Master Data Services model.

Models

The highest level of data organization in Master Data Services is the Model. A model is the

container of all other Master Data Services objects and can be thought of as a database in

respect with the Relational Model.

As a result, the first step to create any solution based on Master Data Services is to create a

Model. From the System Administration section Master Data Manager portal, select the

element Model from the Manage menu.

Figure 5: OPENING THE MODEL MAINTENANCE PAGE


Page 66

From the Model Maintenance page, it’s possible to create a new Model simply clicking on the

Add Model Button (The green “plus” icon). This will open the Add Model page, where it is

possible to specify the name and other options that will automatically create some default

objects for us. These will be useful after having a solid background on how Master Data Services

works, on order to save some work, so for now we’ll uncheck them.

Figure 6: ADDING A MODEL

Since a Model is just a simple container, all that’s needed to create a Model is its name. For our

sample we’re going to create customers Master Data, so we call our Model Customer.

Entities and Attributes

After having created the Customer Model, Entities of that model need to be defined. Entities

are the objects of which a customer is made of, and they are in concept very similar to the

tables of a standard Relational Database.

In our sample Customer Model, each customer, obviously, will have a specific address. An

address exists in a City, which has a specific Province (or State), within a Country.


Page 67

Cities, Provinces and Countries will then be entities on their own since they represent reference

data too. In this way, Master Data will contain one and only one definition of their member

values, so that it will be possible to have a single version of the truth, avoiding human errors

such as misspelled or incorrect data that hinders the reach of high data quality standards.

The process of defining which object has to be put into a separate Entity and which does not is

similar to the Normalization process for a relational database. The aim is to avoid redundancy

and data duplication, which drives to data anomalies and inconsistencies.

Back to our sample, to hold customers reference data, we’ll create the following entities:

Customer

City

StateProvince

CountryRegion

This sample is based on the usage of the AdventureWorksLT2008R2 sample database,

which can be downloaded from CodePlex.

Entities are created through the Model -> Entities menu item from the System Administration

Page. After having defined in which model an Entity belongs to, only the name as to be specified

in order to create it.

http://msftdbprodsamples.codeplex.com/


Page 68

Figure 7: ADDING AN ENTITY

We haven’t defined what are hierarchies and collections, so we don’t want to have them in our

model for now. It will be possible to change this option later on, when we’ll have a deeper

understating of these objects.

After creating all the aforementioned Entities, you’ll see a page like the following:


Page 69

Figure 8: ENTITY MAINTENANCE PAGE AFTER ADDING ALL ENTITIES

From here it’s possible to enrich and complete entities definition with the aid of Attributes:

entities are made of Attributes, which are the analogous of columns in a relational database.

Any Entity is made of at least two attributes, which are mandatory, and cannot be removed nor

renamed:

Name

Code

Attributes can be of three different types. The types define which kind of values an Attribute

can handle:

Free-Form: Allow the input of free-form text, numbers, dates and links.

File: Allow you to store any generic file. For example: documents, images or any

kind of binary object.

Domain-Based: The values of an attribute that as a Domain-Based attribute type,

are the values stored into another Entity. It allows the creation of relationships


Page 70

between Entities, by imposing the rule that each value used here must be a value

that exists in another specific Entity. As a result, this force the user not to enter

free-form data but to choose from a list of defined values, which helps a lot in

avoiding errors and redundancy. The entity that stores the domain of possible

values for the attribute is similar to a lookup table in classical Relational Model.

To manage attributes for an entity, we first need to select that entity from the Entity

Maintenance Page, so that a set of icons will become visible:

Figure 9: SELECTING AN ENTITY IN THE ENTITY MAINTENANCE PAGE

Clicking on the pencil icon will open the Attribute Management page for the selected Entity.


Page 71

Figure 10: OPENING THE ATTRIBUTE MAINTENANCE PAGE

In the Leaf Attributes section it’s possible to add as many attribute as we need. Attribute are

called “Leaf” to distinguish them from Consolidated Attributes used in Explicit Hierarchies. In

brief, a Leaf Attribute represents the lowest level of details used to define the Master Data

model.

Moving forward in our Customer sample, we need to create an attribute for the StateProvice

entity: a State or a Province in contained within a Country and so we need to model this

situation in our solution.

This kind of relationship between the StateProvince and Country entities can be represented

using a Domain-Based Attribute, like shown in Figure 11.


Page 72

Figure 11: CREATING A DOMAIN-BASED ATTRIBUTE

Now it will probably be obvious that a Domain-Based Attribute has to be defined also for City

Entity, so that it will be related to the state or province in which the city is located.

With the Customer Entity, we will need to define several attributes. Beside the mandatory Code

and Name we’ll create the following:

Firstname

Lastname

EmailAddress

Phone

Address

City


Page 73

For simplicity we’ll create them all – with an exception for City attribute – as Free-Text

attributes with a maximum length of 100 characters. City has to be a Domain-Based attribute

whose values will be taken from City entity.

When a Free-Text attribute is created, it’s possible to define the length of the data and the

length – in pixels – of the textbox that will allow a user to see and modify attribute’s data.

Figure 12: CREATING A FREE-FORM ATTRIBUTE

As the previous image shows, it’s also possible to flag the option Enable Change Tracking. All

kinds of attributes support this option. As the name suggest, if Change Tracking is enabled,

Master Data Services will monitor attribute’s data for changes. If a change happens, a Business

Rule can be configured to be fired so that the necessary validation of data can take place.


Page 74

Attribute’s data type determines which constraints can be placed on the attribute itself to

assure data consistency. For Number Data Type, it’s possible to define how many decimal places

have to be used and the input mask.

Figure 13: NUMBER DATA TYPE OPTIONS

DateTime data type also has needs to have an input mask defined. A File Attribute can constrain

the type of handled files by specifying a file extension. Just be aware that this is not meant to be

a security mechanism. For example, nothing will check that the effective file content is what its

extension says it should be.

After having created all entities and attributes we have defined the model in which we’ll store

customer data. It should be clear at this point that creating Entities and Attributes is the key

point of modeling Master Data. As it happens, for relational database modeling, it’s vital for the

success of a project to have a clear idea of all the entities that are needed along with their

attributes so that a correct relationship can be created. Since changing them once data has

been imported, though still possible, it’s a time and resource consuming task.

Some similarities with a relational database emerge here. This is not an unexpected

thing. Master Data Services uses the power of SQL Server Relational Database Engine to

enforce constraints that assures data quality. Each time an Entity is created, a

corresponding table gets created too, and each time a Domain-Based attribute put two


Page 75

Entities in a relationship, a corresponding Foreign Key constraint gets created between

the two underneath tables. All entities and attributes are stored into mdm.tblEntity and

mdm.tblAttribute tables of the configure Master Data Services database. Members of an

entity are stored in a specific table created ad-hoc. Two tables are created per entity.

These tables following a specific naming convention:

mdm_tbl_<model_id>_<entity_id>_<table_type>. EN table type holds entity members,

while MS table type holds security information. The mapping between these tables and

the Entity that uses them can be found in the mdm.tblEntity table.

As can be easily supposed, in real world solution, an entity can have dozens of attributes, each

one or each group describing a specific aspect of the entity. In this case it may make sense to

group logically connected attributes together so that it will be easier for the final user to focus

only on a specific group of data.

The possibility to group attributes is supplied by the Attribute Group feature. In our example it

makes sense to group the Address and City attribute together since they hold address

information. The management page of Attribute Groups can be opened from Model -> Attribute

Group menu. For each entity, Attributes Groups can be created and populated with the desired

attributes. The Attribute Groups Maintenance page is shown in Figure 14.


Page 76

Figure 14: ATTRIBUTE GROUPS MAINTENANCE PAGE

Name and Code attribute are not visible since they will always be available in any attribute

group, and so they don’t need to be inserted manually..

Another Attribute Group we may want to create here is the one the groups contact information

together (FirstName, LastName, Phone, EmailAddress) to isolate them from address data.

Besides helping users to manage Master Data more easily, Attribute Groups are also a powerful

security feature. Do please refer to Books Online for more details how effective permissions are

determined through users, groups and MDS Model elements assigned permissions.


Page 77

Hierarchies

Hierarchies are tree based structures that allow you to organize master data in a coherent way,

which also facilitates the navigation and aggregation of the data itself.

Figure 15: HIERARCHY EXAMPLE

A hierarchy tells the user or the consumer system that receives Master Data from Master Data

Services, that make sense to aggregate data at all level except the leaf level, which represent

the actual un-aggregated data.

For example, Customers may be the leaf level. Since customers live and work in cities, it makes

sense to aggregate all the customers on a per city basis, so that we can analyze how strong the

presence of our company in different cities is. Cities are located in countries so aggregating data

by country is interesting analyze worldwide business.

Having a common place where hierarchies are defined, like in Master Data Services models, is a

key point to have a standard way to aggregate data across all the applications used in a

company.


Page 78

For all those who already have developed a Business Intelligence solution, the concept of

hierarchy in a SQL Server Analysis Services Dimension is very close to the one used by Master

Data Services. In fact, it is possible to create and populate an Analysis Services Dimension just

using the data stored in a Master Data Services hierarchy.

Derived Hierarchy

A Derived Hierarchy is a hierarchy created using the relationships that exist between entities

that use Domain-Based attributes. In our Customer sample, we have a relationship between

City, StateProvince and CountryRegion as follows:

Figure 16: DERIVED HIERARCHY EXAMPLE

A Derived Hierarchy can be created using the Derived Hierarchy Maintenance page, reachable

via Manage menu. After having specified the name, Geography in our case, composing the

hierarchy is all a matter of dragging the available entities from the left to the Current Level

section in the middle of the page. After dropping the entity, Master Data Services will

automatically detect at which level of the hierarchy the dropped entity has to be placed, by

analyzing the relationship between that and already placed entities.


Page 79

Figure 17: DERIVED HIERARCHY MAINTENANCE PAGE

The Preview pane will show how Members are organized based on the created hierarchy. This

feature will only be useful when data has already been imported in Master Data Services to

check that the hierarchy organizes values in the expected way.

In a Derived Hierarchy, all members of all used entities take part in the hierarchy. When a

member of an entity is added, deleted or updated, Master Data Services automatically updates

the related hierarchies so that data is always consistent. An entity member can be used one

time only in a hierarchy. An entity cannot be placed in more than one level. Of course, an entity

can be used in more than one hierarchy.

Special cases for Derived Hierarchies are Recursive Hierarchies. This is the case when you have

an Entity that has a Domain-Based attribute that uses values coming from the entity itself. A

typical scenario here is the organizational chart of a company where an employee can also be

the manager of other employees. There is no particular difference between this kind of

hierarchy and the standard Derived Hierarchies, but there are a few limitations:

Only one Recursive Hierarchy can exist per Entity

Member Permission cannot be assigned

Circular References are not allowed

Explicit Hierarchy

It is also possible to define a hierarchy without using the relationship between entities: maybe

we need to create a hierarchy using some additional data not available in any system, by adding

new data manually directly into the Master Data Services database.


Page 80

These special hierarchies are called Explicit Hierarchies. They can organize data coming from

one entity only (in opposition to what happen with Derived Hierarchies) and allow the creation

of Consolidated Members, which defines the level of the hierarchy itself.

In our Customer example, we have CountryRegion data, which contains information about the

nation where a customer resides. For business purposes we may decide that it makes sense to

have the data grouped into Continents and Geographic Zones.

Figure 18: EXPLICIT HIERARCHY EXAMPLE

Explicit Hierarchies can be ragged, meaning that not all branches have to be equally deep. For

example, we may decide to have a higher level of details under North America, adding the levels

Country and State, while omitting them for other elements.

To create an Explicit Hierarchy, we firs have to enable this feature at Entity level. This can be

done from the Entity Maintenance page, together with the definition of the hierarchy name.


Page 81

Figure 19: Enabling EXPLICIT HIERARCHIES

If the “Include all leaf members in a mandatory hierarchy” flag is checked, the newly created

Explicit Hierarchy will contain all the entity’s members. By default they will be places under the

“Root” node so that they can be moved into other levels later on. If the flag’s not checked, not

all entity’s members must be used in the hierarchy. The unused members will be placed in a

special node named “Unused”.

After having enabled the Explicit Hierarchy, the Entity Maintenance page will look like the one in

Figure 20.


Page 82

Figure 20: ENTITY MAINTENANCE PAGE AFTER ENABLING EXPLICIT HIERARCHIES

Consolidated Attributes are the attributes used by Consolidated Members. As visible, it’s

possible to have more than one Explicit Hierarchy per Entity.

Collection Attributes are attributes used by Collection which will be discussed later in this

chapter.

After having defined an Explicit Hierarchy, to add Consolidated Members to that hierarchy, in

order to build its levels, the Explorer functional area of Master Data Services has to be used.

From here, through the Hierarchies menu item it’s possible to access the Maintenance Page to

choose the Explicit Hierarchy. The page is divided in two main sections. On the left, the

hierarchy is visible and it’s possible to navigate its levels or move items from one level to

another (along with their sublevel) by dragging and dropping them or using the cut-n-paste

feature.

Figure 21: MANAGING EXPLICIT HIERARCHIES – LEFT SIDE

On the right section of the page, there is the area where it’s possible to create or delete items

from levels. In the following sample, under the Europe level we need to add the Western,


Page 83

Central and Eastern zones. After having selected the Europe item on the left side of the page, it

is possible to add Consolidated members to that hierarchy level.

Figure 22: MANAGING EXPLICIT HIERARCHIES – RIGHT SIDE

It is also possible to add Leaf Members from this page. Remember that since a Leaf Member is

considered the last item of a hierarchy, a leaf member cannot have any child element.

After having clicked on the Add button (again the green “plus” icon) a new page will pop-up.

Here, all attributes for the chosen Item will be inserted and it’s also possible to select under

which level the new member will be placed by clicking on the “Find Parent Node” button

(indicated by a magnifying glass icon).


Page 84

Figure 23: ADDING A CONSOLIDATED MEMBER

The manual addition of Consolidated Members is just one option. As we’ll see next, it’s possible

to batch import such members as it happens for Leaf Members.

Collections

Collections are nothing more than a group of Explicit Hierarchies and Collection Members. They

are useful to create groups of members that not necessarily grouped from a business

perspective; maybe they are useful to be grouped to make life easier for the user. For example,

if someone is in charge of managing Customers from Spain, Portugal and South America, (s)he

can do a more efficient job if (s)he can find all their customers in a single group. Here’s where

Collections come into play.

Collections can be created on a per-Entity basis and, just like Explicit Hierarchies; they can also

have their own Consolidated Members. Again, Members management is done in the Explorer

functional area. After having created a Consolidated Member to represent the collection, it’s

possible to add members by clicking on the “Edit Members” menu item visible after clicking on

the drop-down arrow near the collection we want to manage.


Page 85

Figure 24: EDITING A COLLECTION

A new page will be displayed where using the same drag-n-drop or cut-n-paste technique used

to manage Hierarchies, members of the collection can be added, removed or changed.

Business Rules

Business Rules are one of the most powerful features of Master Data Services. They allow the

definition of rules that assure the quality of data. A business rule is basically an If...Then

sentence that allows specifying what has to be done if certain conditions are met.

For example, let’s say that an email address, to be considered valid, need to contain at least the

“at” (@) character.

Business Rules are created in the System Administration functional area, by accessing the

Business Rule Maintenance Page through the “Manage” menu item. New rules can be created

by clicking on the usual “Add” button, and rule name can be set by double-clicking on the

existing name that has to be changed. After having selected for which entity the Business Rule

has to be created, the rule can be defined by clicking on the third icon from the left:


Page 86

Figure 25: BUSINESS RULE MAINTENANCE PAGE

The Edit Business Rule page will be displayed. This page has four main areas.

In the top left section there are the Components of the business rule. Each component

represents pieces of a sentence that can be put together to form a business rule. They are of

three types: conditions used for the If part of the rule sentence, actions used for the Then part,

and logical operators that connect two or more conditions. As said before, in our case we just

want to be sure that the EmailAddress attribute values have the “@” sign. Under the Validation

Actions there is the “must contain the pattern” action. We don’t need any conditions, since we

always want to verify that an email address is formally correct.

To put desired elements under the proper section, we just have to drag-n-drop the element we

want to use in the right place: Conditions or Actions.


Page 87

Figure 25: EDITING A BUSINESS RULE

As soon as we drop the item, the Edit section of the page, on the bottom right will require us to

configure the used Condition or Action. Basing on the selected item we may be asked to specify

which attribute we want to use and some other configuration options. In our example, we’ll

need to specify that the rule will be applied on the EmailAddress attribute and that pattern

we’re looking for is %@%.

Figure 26: SPECIFYING DETAILS OF THE “MUST CONTAIN THE PATTERN” ACTION


Page 88

Again, to select the Attribute we want to use, we just have to drag-n-drop it under the “Edit

Action” item. Once the rule is completed, it needs to be published before users can start to use

it to validate data. From the Business Rules Maintenance page, select the rule that needs to be

published and click on the “Publish Business Rules” button (second from left).

Figure 27: PUBLISHING A BUSINESS RULE

After the Business Rule has been published, it can be used to validate Master Data.


Page 89

Importing, Exporting and Managing Data

Master Data Services needs to collect data from other systems in order to consolidate and

validate it and make it Master Data. Though data can be managed directly using the Master

Data Management portal, it will be a common scenario to import and export data from and to

an external system that produces and consumes reference data.

Import Data

Importing data into Master Data Services is a batch process based on three main tables:

mdm.tblStagingMember

mdm.tblStgMemberAttribute

mdm.tblStgRelationship

The first table – mdm.tblStagingMember – is used to import members and their Code &

Name system attributes. All other user defined attributes have to be imported using the

mdm.tblStgMemberAttribute table. If we need to move members into an Explicit

Hierarchy or add members to a Collection, we’ll use the mdm.tblStgRelationship table too.

Without going into deep technical details, to import member data, we firstly need to create

members, and then populate all the user-defined attributes, in the following sequence.

For the first attempt, we’ll load data for the CountryRegion Entity, taking the values from

AdventureWorksLT2008R2 sample database.

To extract and load data into the mdm.tblStagingMember we’ll use the following T-SQL

code:

USE AdventureWorksLT2008R2; GO SELECT DISTINCT CountryRegion FROM SalesLT.[Address]


Page 90

) INSERT INTO MDSBook.mdm.tblStgMember (ModelName, HierarchyName, EntityName, MemberType_ID, MemberName, MemberCode) SELECT ModelName = 'Customer', HierachyName = NULL, EntityName = 'CountryRegion', MemberType_ID = 1, MemberName = CountryRegion, MemberCode = ROW_NUMBER() OVER (order by CountryRegion) FROM cte_source; GO

After executing the T-SQL batch from SQL Server Management Studio, we’ll have three rows

imported in the member staging table of our MDSBook sample database. To notify Master Data

Services that we want to start the import process, we have to use the Import feature available

in the “Integration Management” functional area.

Figure 28: IMPORTING CUSTOMER DATA

By clicking on the Process Button (the only one above the Model label) we’ll start the batch

process which will queue the rows to be imported. Batches are processed every 60 seconds by

default.


Page 91

After that time the process will be completed and we’ll be able to see the result, using the same

page:

Figure 29: AFTER CUSTOMER DATA IS IMPORTED

If errors occurred, we could have detailed information on the errors shown in a separate page,

simply by clicking on the “Batch Details” icon, the first on the left.

Managing Data

Now that we have our data inside Master Data Services, we can use the Explorer functional area

to explore and manage the data. Here we can select the Entity we want to manage simply

choosing the desired one from the “Entities” menu.

Since we’ve just imported CountryRegion data we may want to check it. The CountryRegion

Entity page will be as shown in Figure 30.


Page 92

Figure 30: THE COUNTRYREGION ENTITY PAGE

Here it is possible to add or remove members or change existing ones simply by double clicking

on the attribute that we want to change.

The yellow question mark visible on the left by the members name indicates that data has not

yet been validated, which means that we have to execute business rules in order to validate

date. To do this, we simply have to select the members we want to validate, just by checking the

relative checkbox and then clicking on the “Apply Business Rules” button, the third from the

right, just below the “CoutryRegion” title.

We haven’t defined any business rule for CountyRegion, so data will surely pass validation,

changing the yellow question mark to a green tick mark.

Since CountryRegion has an Explicit Hierarchy defined, we may also want to check Consolidated

Members. As the image shows, we can do this by selecting the “Consolidated” option, or we can

use the Hierarchies menu to access the Explicit Hierarchy and put the imported Leaf Members

under the correct levels:


Page 93

Figure 31: THE COUNTRYREGION EXPLICIT HIERARCHY PAGE

By using drag-n-drop or by using the cut-n-paste features we can move “Canada” and “United

States” leaf members under the “North America” Consolidated Member, and “United Kingdom”

leaf member under the “Western Europe” consolidated member.

Export Data

Now that we have imported, validated and consolidated our data, we may need to make it

available to external applications. This can be done by creating a Subscription View from the

“Integration Management” functional area.

Figure 32: CREATING A SUBSCRIPTION VIEW

A Subscription View needs to be defined with a name, a version of the data that has to be

exported (we’ll introduce versioning in the following section of this chapter) and of course the


Page 94

Entity or the Derived Hierarchy we want to export. The format allows you to define which kind

of data we want to export: Leaf Members, Consolidated Members, levels of an Explicit Hierarchy

and so on.

We want to export the CountryRegion Master Data, along with the information of its Explicit

Hierarchy: that’s way we have to choose “Explicit Levels” in the Format combo-box and specify

how many levels we want to export. Three levels are enough, since our Explicit Hierarchy is

made up of the Continent – Zone – Country.

After having saved the Subscription View, Master Data Services creates a SQL Server View for

us. In the MDSBook database, we created the Customer_CountryRegion view which is ready to

be used by external applications to get that Master Data. Here is an example of Transact-SQL

code used to query the view:

SELECT * FROM [MDSBook].[mdm].[Customer_CountryRegion];

The view, along with other columns, has “flattened” the hierarchy using three levels, as

requested:

Figure 33: DATA FROM A SUBSCRIPTION VIEW


Page 95

Multiple Versions of Data

Now that Master Data has been made available to other system via Subscription Views, we must

guarantee those systems that Master Data won’t change suddenly and without any advice. To

“freeze” Master Data so that its values won’t change, we have to introduce the concept of

versions.

Versions provide a way to create logical snapshots of the Master Data; this way, we can still

make changes to Master Data while keep tracking the changes over time, and providing to the

subscribing system the version they expect to get.

Versioning is completely integrated into the system and it is always active. At the very

beginning, only one version of data exists and the status is set to Open. This first version is

automatically created by Master Data Services as soon as the Model is created.

“Open” means that everyone (who has the required permission) can make changes to the

model and to its data.

Figure 34: THE FIRST VERSION OF DATA IS OPEN

Once the model is completed and has been filled with data, the administrator needs to start the

validation process. Of course, to validate data it has to be stable. To be sure that no-one can

change data while someone is validating it, the version must be set to Locked. This operation is

done in the Version Management functional area, shown in Figure 35.


Page 96

Figure 35: LOCKING A VERSION

Once the version has been locked, validation can take place. To validate data all at once, the

Validation Version page, reachable via the menu with the same name, runs all the defined

Business Rules on all data:

Figure 36: VALIDATING A VERSION

After having the data validated, it’s finally possible to set the version to the Committed status,

using the initially greyed button (the second in the top left side of the image above).


Page 97

Once the version has reached this status, its data cannot be changed anymore. If new need to

change to Master Data, we’ll also need to create a new version of it. New versions are created

from the Manage Version page, by selecting which committed version of data we want to start

to work from.

Figure 37: THE MANAGE VERSIONS PAGE

Now that we have a new version, we can start to work on it freely, being sure that all reference

data stored in the previous version will remain untouched, allowing the system that rely upon it

to work correctly and without problems.


Page 98

MDS Database Schema

In the last part of this chapter, we are going to take a look at the MDS database schema. This

should help us understanding how it’s made and how it works, since an intimate knowledge of

the final store of our Master Data can help us to create better solution to handle it.

Each time a new MDS solution is created, using the MDS Configuration Manager tool, a new

database is also created too.

This database will contain all the models we’ll define in this specific MDS solution, beside all the

system objects needed by MDS for its own functioning. All these objects are located in the mdm

schema. Among all the tables here contained several are particularly interesting and deserve

more attention.

The first of these tables is mdm.tblModel. Each time a new model is created within the selected

MDS Application, a new row is also inserted here. This action also generates an ID value that

uniquely identify the model and that is used throughout the database. From now on this ID will

be referred to as model_id.

Model’s Entities also reside in tables; the main table where a list of all Entities is stored is the

mdm.tblEntity table. As obvious, also in this case each entity has its own ID value, which will be

called entity_id. The table contains all the Entity defined in the MDS Application and each Entity

is related to the Model in which it has been defined using the model_id value.

Each time a new Entity is added to the model, at least two dedicated tables are also created.

The name of these tables is generated using this rule:

mdm.tbl_<model_id>_<entity_id>_<member_type>


Page 99

The first two placeholders, model_id and entity_id has been already defined, the member_type

can have six different values:

EN: table that holds entity data

MS: table that stores security data

HR: table that stores Explicit Hierarchies Data

HP: stores relationships between entities in the Explicit Hierarchy

CN: table that holds Collections defined on the Entity

CM: table that keep tracks of which Entity Member is in which Collection

Allowed member types are stored in the mdm.tblEntityMemberType table.

As soon as an entity is created, only the “EN” and “MS” gets created. All the other tables will be

created only if the functionality they support is used. The following query shows an example

how to find all entities in a MDS database by querying the mdm.tblEntity table, with the results

shown in the figure right after the query.

SELECT ID, Model_ID, Name, EntityTable, SecurityTable FROM mdm.tblEntity;

Figure 38: ENTITIES IN THE MDM.TBLENTITY TABLE

As we know, each Entity can have its own set of Attributes. These attributes are stored as

columns in “EN” entity table. Each time a new attribute is added to an entity, a new column is

also added to the related table. The name of the created column follows a naming convention

similar to the one already seen in table names:


Page 100

uda_<model_id>_<attribute_id>

The only exception to this rule is for the Code and Name attributes. Since they are mandatory

for any entity, they are mapped directly to columns with the same name. In the following figure,

we can see sample results of a query over an example of the “EN” entity table.

Figure 39: ATTRIBUTES OF AN “EN” ENTITY TABLE

The attribute_id, like the others ids mentioned until now, is generated automatically by the

system each time a new row is inserted. The table that keep tracks of all attributes in the

database is the mdm.tblAttribute table. This table stores all the metadata needed by attributes,

like the DisplayName and the mapping to the column name used in entity table:

Figure 40: PARTIAL CONTENT OF THE MDM.TBLATTRIBUTE TABLE

When an entity has a Domain-Based attribute, a Foreign Key constraint is automatically created

between the referenced and the referencing entities in order to support the relationship.


Page 101

Staging Tables

As we have learned in previous section, each time a new entity is created, an associated

database table gets created as well. In such dynamic and always changing environment, creating

a standard solution to import data into the Master Data Services database so that it will fit into

the defined model can be a nightmare.

Luckily, Master Data Services comes to help, giving us three standard tables, all existing in the

mdm schema, that have to be used to import data into entities, to populate theirs attributes

and to define relationships of hierarchies.

These three tables are called staging tables and are the following:

mdm.tblStagingMember

mdm.tblStgMemberAttribute

mdm.tblStgRelationship

By using them it’s possible to import data into any model we have in our Master Data Services

database. A fourth staging table exists, but it isn’t used actively in the import process, since it

just reports that status and the result of the batch process that is moving data from the

aforementioned staging tables into entities, attributes and hierarchies. This table is the

mdm.StgBatch table.

You can populate the staging tables with regular T-SQL inserts and bulk inserts. After the staging

tables are populated, you can invoke the batch staging process from the Master Data Manager

Web application. After the batch process is finished, you should check for the possible errors.

Please refer to Books Online for more details about populating your MDS database through the

staging tables.

Following this process and with the aid of the three staging tables, is possible not only to

populate entities with new fresh data (which means creating new members), but also:


Page 102

Update members values

Delete and reactivate members

Create collections

Add members to collections

Delete and reactivate collections

Update attribute values

Designate relationships in explicit hierarchies


Page 103

Summary

In this chapter, the quick walk-through Master Data Manager Web application gave us an

overview of MDS capabilities. We also learned how to work with Master Data Manager. Of

course, everything is not that simple as we have shown in this chapter. Nevertheless, this

chapter helps us match elements of a master data management solution with practical

implementation by using Master Data Services.

The first part of this chapter is just a quick guide to installing MDS. This should help readers to

start testing it and use this chapter as a quick walk-through MDS. Then we defined MDS Models

and elements of a Model. We have also shown how to import and export data. Finally, we

discussed versioning of our master data.

We have to mention that for a real-life solution, we should expect much more work with

importing data. Before importing data into a MDS model, we have to profile it, to check it for

quality, and cleanse it. Typical actions before importing data include also merging the data from

multiple sources and de-duplicating it.

In the next chapter, we are going to explain how we can check data quality in our existing

systems by exploiting tools and programming languages included in and supported by SQL

Server suite. The last chapter of this book is dedicated to merging and de-duplicating data.

We are not going to spend more time and place on the Master Data Services application in this

book. This is a general MDM book, and MDS is just a part of a general MDM solution. In

addition, MDS that comes with SQL Server 2008 R2 is just a first version and is, according to our

opinion, not suitable for production usage in an enterprise yet. We suggest waiting for the next

version of MDS for real-life scenarios. Nevertheless, the understanding how MDS works in

current version we got in this chapter should definitely help us with successful deployment and

usage of the next version.


Page 104

References

Installing and Configuring Master Data Services on MSDN

MSFT database samples on CodePlex

How Permissions Are Determined (Master Data Services) on MSDN

Master Data Services Team Blog

Master Data Services Database (Master Data Services) on MSDN

http://msdn.microsoft.com/en-us/library/ee633752.aspx


http://msdn.microsoft.com/en-us/library/ff486950.aspx

http://sqlblog.com/blogs/mds_team/default.aspx

http://msdn.microsoft.com/en-us/library/ee633808.aspx

Chapter 3: Data Quality and SQL Server 2008 R2 Tools

Page 105


Dejan Sarka

Before we start implementing any centralized MDM solution, we should understand the quality

of our existing data. If we transfer and merge bad data into our MDM solution, we will have bad

data from the start, and over time the situation can get worse. Therefore, intensive data

profiling should always start a MDM project.

Besides getting an impression of our data quality, we should also try to find the root cause for

bad data. In chapter one of this book, we have already seen that the five whys is a known

technique used to find the root cause. However, the five whys technique works well for soft

data quality dimensions.

In this chapter, we are going to focus on hard data quality dimensions, i.e. on measurable

dimensions. Namely, we are going to deal with completeness, accuracy and information. We are

going to show how we can use SQL Server 2008 R2 tools for measuring data quality. In addition,

we are going to show how we can use these tools to find the root cause for bad data. We will

also see that through data profiling we can make some conclusions about schema quality.

In this chapter, we are going to introduce the following:

Measuring the completeness;

Profiling the accuracy;

Measuring information;

Using other SQL Server tools for data quality.


Page 106

Measuring the Completeness

In order to get some value out of the data, our data has to be complete. This means that the

data should be deep enough to cover all of the business needs, and scoped for the business

tasks.

Completeness starts with schema completeness. Of course, if our schema does not have a place

for a specific piece of data provided, we cannot insert that piece of data. Normalization of a

relational schema is a formal mathematical process that guarantees the completeness of the

schema. In addition, normalization eliminates redundancy. Nevertheless, we are not going to

talk about data modeling here. We are going to show how we can measure the completeness

data quality dimension, and find the root cause for incomplete data.

Population completeness is the first completeness that can be easily measured. We can use two

different assumptions here:

Closed world assumption: all tuples that satisfy relation predicates are in the relation;

Open world assumption: population completeness defined on a reference

relation.

If we have the reference relation, we can measure population completeness by just comparing

the number of rows in our and in reference relation. Of course, just having the number of rows

in the reference relation is sufficient information to measure our population completeness.

In a relational database, the presence of the NULLs is what defines the completeness. NULLs are

standard placeholders for unknown. We can measure attribute completeness, i.e. the number

of null values in a specific attribute, tuple completeness, i.e. the number of unknown values of

the attributes in a tuple, and relation completeness, i.e. the number of tuples with unknown

attribute values in the relation.


Page 107

In a relational model, we have only a single placeholder for unknown attribute values, the NULL.

However, we can have NULLs because attributes are not applicable for subset of rows, or

because the values are really unknown. Not applicable values can also be a signal that the

schema is not of the best possible quality; namely, they can tell us that some subtypes should

be introduced in the schema. Inadequate schema quality is one possible root cause for NULLs in

our database.

In XML data, we do not have NULLs. XML has a special xsi:nil placeholder for unknown values. In

addition, a complete element or an attribute can miss in an XML instance, and this is missing

data as well. We have to use XQuery language inside Transact-SQL (T-SQL), inside XML data type

methods, in order to find incomplete data inside XML.

When we start data profiling, we meet a lot of suspicious or even bad data at the start. In order

to find the root cause, we have to narrow down our data profiling activities and searches. The

techniques shown in this chapter help us not only with data profiling, but with narrowing down

the problem in order to find the root cause as well.

Attribute Completeness

Let us start with a simple example – finding the number of NULLs in an attribute. We are going

to analyze attributes from the Production.Product table from the AdventureWorks2008R2 demo

database.

The samples are based on the usage of the SQL 2008 R2 sample databases, which can be

downloaded from CodePlex.

You can execute queries in SQL Server Management Studio (SSMS), the tool shipped with SQL

Server. If you are not familiar with this tool yet, do please refer to Books Online.

First, let’s find which columns are nullable, i.e. allow null values, in the Production.Product table

with the following query that uses the INFORMATION_SCHEMA.COLUMNS view:



Page 108

SELECT COLUMN_NAME, IS_NULLABLE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = N'Production' AND TABLE_NAME = N'Product';

The nullable columns are Color, Size, SizeUnitMeasureCode, WeightUnitMeasureCode, Weight,

ProductLine, Class, Style, ProductSubcategoryID, ProductModelID, SellEndDate and

DiscontinuedDate. The next query shows an overview of first 100 rows, with only nullable

columns, and columns that help you identifying rows, namely ProductId and Name:

SELECT TOP 100 ProductId ,Name ,Color ,Size ,SizeUnitMeasureCode ,WeightUnitMeasureCode ,Weight ,ProductLine ,Class ,Style ,ProductSubcategoryID ,ProductModelID ,SellEndDate ,DiscontinuedDate FROM Production.Product;

Partial results, showing only a couple of columns and rows, are here:

ProductId Name Color Size SizeUnitMea

sureCode

…

1 Adjustable Race NULL NULL NULL …

2 Bearing Ball NULL NULL NULL …

3 BB Ball Bearing NULL NULL NULL …

4 Headset Ball Bearings NULL NULL NULL …

316 Blade NULL NULL NULL …

… … … … … …

You can easily see that there are many unknown values in this table. With a simple GROUP BY

query, we can find the number of nulls, for example in the Size column. By dividing the number

of NULLs by the total number of rows in the Production.Product table, we can also get the

proportion of NULLs.


Page 109

The following queries show the absolute number of NULLs in the Size column, and the

proportion of NULLs:

SELECT Size ,COUNT(*) AS cnt FROM Production.Product WHERE Size IS NULL GROUP BY Size; SELECT 100.0 * (SELECT COUNT(*) FROM Production.Product WHERE Size IS NULL GROUP BY Size) / (SELECT COUNT(*) FROM Production.Product) AS PctNullsOfSize;

By running these two queries, we are able to find out that there are 293 NULLs in the Size

column, which is 58% of all values. This is a huge proportion; just from this percentage, we can

conclude that size is not applicable for all products. For products, size is a common attribute, so

we could expect mostly known values. We could continue checking other nullable attributes

with similar queries; however, we will see later that for finding NULLs in a column, the SQL

Server Integration Services (SSIS) Data Profiling task is suitable for this.

XML Data Type Attribute Completeness

So far we have shown how to check whether the complete value is missing. However, if we have

a column of XML data type, there could be just a single element inside the XML value missing.

Of course, we can then define that the xml value is not complete. The question is how you can

find rows with a specific element missing. In SQL Server, XML data type supports some useful

methods. These methods are:

query() method, which returns part of the xml data in xml format;

value() method, which returns a scalar value of an element or an attribute of an

element;

exist() method, which returns 1, if an element or an attribute exists in an xml

instance, 0 if an element or an attribute does not exist, and null if the xml data

type instance contains null;


Page 110

modify() method, which allows you to insert an element or an attribute, delete

an element or an attribute, or update value of an element or an attribute;

nodes() method, which allows you to shred an xml data type instance to

relational data.

All of the XML data type methods accept XQuery as an argument. XQuery expressions allow us

to traverse through nodes of XML instance to find a specific element or attribute. The value

method accepts additional parameter, the target SQL Server scalar data type. For the modify()

method, XQuery is extended to allow modifications, and is called XML Data Modification

Language or XML DML. You can learn more about XQuery expressions and XML DML in Books

Online.

For checking whether an element is present in the xml instance, exist() method is the right one.

First, let us create XML instance in a query from the Production.Product table by using the FOR

XML clause:

SELECT p1.ProductID ,p1.Name ,p1.Color ,(SELECT p2.Color FROM Production.Product AS p2 WHERE p2.ProductID = p1.ProductID FOR XML AUTO, ELEMENTS, TYPE) AS ColorXml FROM Production.Product AS p1 WHERE p1.ProductId < 319 ORDER BY p1.ProductID;

The subquery in the SELECT clause generates XML data type column from the Color attribute.

For the sake of brevity, the query is limited to return seven rows only. The first five have NULLs

in the Color attribute, and the XML column returned does not include the Color element. The

last two rows include the Color element.

We are going to use the previous query inside a CTE in order to simulate a table with XML

column, where the Color element is missing for some rows. The outer query is using the .exist()

method to check for the presence of the Color attribute.


Page 111

WITH TempProducts AS (SELECT p1.ProductID ,p1.Name ,p1.Color ,(SELECT p2.Color FROM Production.Product AS p2 WHERE p2.ProductID = p1.ProductID FOR XML AUTO, ELEMENTS, TYPE) AS ColorXml FROM Production.Product AS p1 WHERE p1.ProductId < 319) SELECT ProductID ,Name ,Color ,ColorXml.value('(//Color)[1]','nvarchar(15)') AS ColorXmlValue FROM TempProducts WHERE ColorXml.exist('//Color') = 0;

The outer query correctly finds the first five rows from the CTE. XML specification also allows

that an element is present but has no value. In such a case, a special attribute xsi:nil should

appear inside the nillable (XML term for nullable) element. Therefore, in order to find all

incomplete xml instances, we have to check also for the xsi:nil attribute. In order to check for

the xsi:nil attribute, we have to create it in some rows first. We are going to slightly change the

last query. In the CTE, we are going to include XSINIL keyword in the FOR XML clause of the

subquery. This will generate the Color element for every row; however, when the color is

missing, this element will have an additional xsi:nil attribute. Then, with the outer query, we

have to check whether this attribute appears in the Color element:

WITH TempProducts AS (SELECT p1.ProductID ,p1.Name ,p1.Color ,(SELECT p2.Color FROM Production.Product AS p2 WHERE p2.ProductID = p1.ProductID FOR XML AUTO, ELEMENTS XSINIL, TYPE) AS ColorXml FROM Production.Product AS p1 WHERE p1.ProductId < 319) SELECT ProductID ,Name ,Color ,ColorXml.value('(//Color)[1]','nvarchar(15)') AS ColorXmlValue FROM TempProducts WHERE ColorXml.exist('//Color[@xsi:nil]') = 1;

Of course, the CTE query returns the same small sample (seven) rows of products with

ProductID lower than 319. The outer query correctly finds the first five rows from the CTE.


Page 112

Simple Associations among NULLs

It would be too easy if we could finish our research by just calculating the proportion of

unknown values. In order to improve the quality of our data, we have to discover the cause of

these Nulls. The first thing we can check is whether Nulls in one column lead to Nulls in another

one. The relation between columns could indicate two things for the schema: the schema could

not be in third normal form, i.e. we have functional dependency between two non-key columns,

or the values for a column are not applicable for all rows, and thus we should probably

introduce subtypes. Let’s first check whether NULLs in one column lead to NULLs in another.

We can need to decide which columns need checking. The answer depends on the information

we can get from the names of the columns and from the business experts. If the naming

convention in the database is narrative, we can easily find the potential candidate columns for

checking. If the naming convention is bad, we have to rely on business experts to narrow down

our checking. If we do not have a business expert at hand and we cannot get a clue from the

column names, then we have a problem; we have to check all pairs of columns.

Fortunately, in the AdventureWorks2008R2 demo database, the naming convention is quite

good. If we take a look at the Size and SizeUnitMeasureCode columns, their names tell us they

are somehow related. Therefore, let’s check whether NULLs are related in these two columns

with the following query:

SELECT 1 AS ord ,Size ,SizeUnitMeasureCode ,COUNT(*) AS cnt FROM Production.Product GROUP BY Size, SizeUnitMeasureCode UNION ALL SELECT 2 AS ord ,Size ,SizeUnitMeasureCode ,COUNT(*) AS cnt FROM Production.Product GROUP BY SizeUnitMeasureCode, Size ORDER BY ord, Size, SizeUnitMeasureCode;

Before showing the results of this query, let us add a comment on the query. The result set

unions two result sets; first SELECT aggregates rows on Size and then on SizeUnitMeasureCode,


Page 113

the second SELECT does the opposite and aggregates rows on SizeUnitMeasureCode first and

then on Size.

The first column of the result is just a constant used to get guaranteed order in the combined

result set.

In SQL 2008 and 2008 R2, we can rewrite this query in a shorter way using the new GROUPING

SETS subclause of the GROUP BY clause. It is not just the query that is shorter; SQL Server is

usually able to find a more effective execution plan as well, and thus execute the query faster.

However, in order to guarantee the order of the result, the query with GROUPING SETS

becomes more complicated, and also less efficient. As this is not a T-SQL programming book, we

are going to use the query above, written without GROUPING SETS. Although the query is not as

efficient as it could be, it is more readable, and thus more suitable for explaining the concepts of

narrowing down the search for the reasons for NULLs. The abbreviated results are as follows:

ord Size SizeUnitMeasureCode cnt

1 NULL NULL 293

1 38 CM 12

… … … …

1 62 CM 11

1 70 NULL 1

1 L NULL 11

1 M NULL 11

1 S NULL 9

1 XL NULL 3

2 NULL NULL 293

2 38 CM 12

... … … …


Page 114

2 62 CM 11

2 70 NULL 1

2 L NULL 11

2 M NULL 11

2 S NULL 9

2 XL NULL 3

Note the bolded rows. Whenever the Size is null, SizeUnitMeasureCode is also null; however,

the opposite is not true. In addition, SizeUnitMeasureCode is null for all sizes expressed in

character codes, like L, M and not null for numeric sizes, except for size 70 (the bolded italics

row). We can conclude that there is a strong relation between nulls in these two columns. Of

course, size unit measure code tells us in which unit the size measured is if the size is numeric;

for example in the second row of the result set, we can see that size 38 is measured in

centimeters. When the size is expressed in character codes, the measure unit makes no sense.

However, we can see that something is wrong with size 70; this one should have measure unit.

The measure unit is missing or size 70 should not be in the relation, as it is potentially

erroneous. By researching unknown values, we can find potential errors. In addition, for the

root cause of the unknown values, we can omit the SizeUnitMeasureCode column; we already

know where nulls in this column come from. Therefore, we can limit our research to the Size

column only from this pair.

If we do the same analysis for the Weight and WeightUnitMeasureCode columns, we will find

that we can omit the WeightUnitMeasureCode column from further researches as well. Finally,

we can do the same thing for the ProductSubcategoryID and ProductModelID columns, and will

find out that whenever ProductSubcategoryID is null, ProductModelID is null as well. Therefore,

we can also omit the ProductModelID from further completeness checking.

How can we prevent missing size measure units? The answer lies in the schema. We can

introduce a check constraint on the SizeUnitMeasureCode column that would not accept null

values for numeric sizes, or we can create a trigger on the Production.Product table that can


Page 115

check whether the measure unit is known for numeric sizes. Finally, we can further normalize

the schema. It is obvious that the SizeUnitMeasureCode and Size columns are functionally

dependent. Figure 1shows the normalized version of the design for size and size unit measure.

Products

PK ProductId

NameFK1 SizeId OtherCols

Sizes

PK SizeId

SizeFK1 SizeTypeId

SizeTypes

PK SizeTypeId

SizeTypeName MeasureUnit

Figure 12: SIZE AFTER NORMALIZATION

In a design like the one in the image above, once we insert the correct measure of units for the

appropriate size types, we cannot have erroneous or missing size measure units anymore. Of

course, if the values are incorrect in the lookup tables, Sizes and SizeTypes, measure units would

be wrong for all products of a specific size. However, it is easier to maintain small lookup tables

and have only correct values in them. The question here is whether it is worth changing the

database design because of the problem with a missing size measure unit. Changing the

database design could lead to upgrading one or more applications, which could be very

expensive. We could prevent new incorrect values with a check constraint or a trigger instead.

Even with a check constraint or a trigger we could end up with problems with our applications.

If our applications do not implement exception handling, we might finish with a crashed

application when an erroneous size measure unit would be inserted. Preventing errors should

be our ultimate goal; however, in many cases, especially when we have to deal with legacy

applications, it might be the cheapest way to simply find the errors and correct them on a

regular schedule.

Here, we just checked whether NULL in one column leads to NULL in another column. What if a

known value in a column leads to NULL in another column? This would mean that the second

attribute is not applicable for all values of the first attribute. Subtype tables would probably be

needed. In the Production.Product table, from the name of the ProductSubcategoryId column


Page 116

we can suspect that this column might be a potential candidate for introducing subtypes for

values of this column. Let’s check if the Class attribute is applicable for all product subcategories

using the following query:

WITH Classes AS (SELECT ProductSubcategoryID ,Class ,COUNT(*) AS NRowsInClass FROM Production.Product GROUP BY ProductSubcategoryId, Class) SELECT c.ProductSubcategoryID ,c.Class ,c.NRowsInClass ,CAST(ROUND(100.0 * c.NRowsInClass / SUM(c.NRowsInClass) OVER(PARTITION BY c.ProductSubcategoryId),0) AS decimal(3,0)) AS PctOfSubCategory FROM Classes c ORDER BY PctOfSubCategory DESC;

This query uses Common Table Expressions (CTE) to calculate the number of rows in

subcategories and classes. The outer query uses the CTE to calculate the percentage of rows of a

class in a specific subcategory from the total number of rows of the same subcategory by using

the OVER clause. The result set is ordered by a descending percentage. If the percentage is

close to 100, it means that one class is prevalent in one subcategory. If the value of the class is

NULL, it means that the class is probably not applicable for the whole subcategory. Here are

partial results of this query.

ProductSubcategoryID Class NRowsInClass PctOfSubCategory

6 NULL 2 100

7 NULL 1 100

9 NULL 2 100

18 NULL 3 100

… … … …

We can easily see that the Class attribute is not applicable for all subcategories. In further

research of reasons for NULLs of the Class attribute, we can exclude the rows where the values

are not applicable. We can also check which other attributes are not applicable for some


Page 117

subcategories. Note that there are not necessarily the same subcategories leading to NULLs in

Class as the subcategories leading to NULLs in other attributes. For example, subcategory 4

leads to NULLs in Color, but not to NULLs in Class attribute. Therefore, in order to continue the

research for the reason for NULLs with narrowed number of rows by excluding the rows where

the values are not applicable, we should make separate researches for each nullable attribute.

We will focus on the Class attribute and continue researching with rows where Class is

applicable.

The following query limits rows, to rows where the Class attribute is applicable, and limits the

column list to interesting columns only.

SELECT ProductId ,Name ,Color ,Size ,Weight ,ProductLine ,Class ,Style FROM Production.Product WHERE (Color IS NULL OR Size IS NULL OR Weight IS NULL OR ProductLine IS NULL OR Class IS NULL OR Style IS NULL) AND (ProductSubcategoryId NOT IN (SELECT ProductSubcategoryId FROM Production.Product WHERE ProductSubcategoryId IS NOT NULL GROUP BY ProductSubcategoryId HAVING COUNT(DISTINCT Class) = 0));

Excluded columns are ProductModelId, because it is NULL whenever ProductSubcategoryId is

NULL; WeightUnitMeasureCode and SizeUnitMeasureCode are not needed, as they are NULL

whenever Weight or Size are NULL.

In addition, SellEndDate and DiscontinuedDate are not interesting for completeness quality

checking, as from the names we can expect that most of the rows should have NULLs in these

two columns, and we can easily imagine there is business reason behind those NULLs. Of


Page 118

course, these two columns could be interesting from the accuracy perspective. Six interesting

columns are left from the completeness perspective.

Rows are limited by the WHERE clause to rows which have an unknown value in any of the six

interesting columns without rows where the Class column is not applicable. The final subquery

in the WHERE clause is used to find all subcategories for which the Class is not applicable. There

is not a single distinct class in a subcategory. Note that the COUNT(DISTINCT Class) aggregate

function, differently from COUNT(*), eliminates NULLs from the aggregation. Therefore, if the

count of distinct classes in a subcategory is zero, this means that there are only classes with

NULLs in a subcategory.

Tuple and Relation Completeness

For the tuple completeness, we can count how many columns have unknown values. For

relation completeness, we can use the proportion of rows with NULLs and total number of rows.

Before starting measuring, let’s create a scalar user-defined function that accepts a parameter

of sql_variant data type, and returns one if the parameter is null and zero if it is not null.

CREATE FUNCTION dbo.ValueIsNULL (@checkval sql_variant) RETURNS tinyint AS BEGIN DECLARE @retval int IF (@checkval IS NULL) SET @retval = 1 ELSE SET @retval = 0; RETURN(@retval) END;

With this function, it is easy to write a query that calculates the number of nulls for each

interesting row. Note that the following query refers to interesting columns only, and limits the

result set to rows with applicable Class attribute only.

SELECT ProductId ,Name ,dbo.ValueIsNULL(Color) + dbo.ValueIsNULL(Size) + dbo.ValueIsNULL(Weight) + dbo.ValueIsNULL(ProductLine) + dbo.ValueIsNULL(Class) + dbo.ValueIsNULL(Style)


Page 119

AS NumberOfNULLsInRow FROM Production.Product WHERE (ProductSubcategoryId NOT IN (SELECT ProductSubcategoryId FROM Production.Product WHERE ProductSubcategoryId IS NOT NULL GROUP BY ProductSubcategoryId HAVING COUNT(DISTINCT Class) = 0)) ORDER BY NumberOfNULLsInRow DESC;

The abbreviated result from the query is as follows:

ProductId Name NumberOfNULLsInRow

802 LL Fork 5

803 ML Fork 5

804 HL Fork 5

… … …

We can say that tuples with more NULLs are less complete than tuples with fewer NULLs. We

can even export this data to a staging table, repeat the tuple completeness measure on a

schedule, and compare the measures to notice the tuple improvement. Of course, this makes

sense only if we can join measures on some common identification; we need something that

uniquely identifies each row.

In the Production.Product table, there is a primary key on the ProductId column. In a relational

database, every table should have a primary key, and the key should not change if you want to

make comparisons over time.

For relation completeness, we can use two measures: the total number of NULLs in the relation

and the number of rows with NULL in any of columns. The following query does both

calculations for the Production.Product table, limited on rows with applicable Class attribute

only.

SELECT 'Production' AS SchemaName ,'Product' AS TableName ,COUNT(*) AS NumberOfRowsMeasured ,SUM( dbo.ValueIsNULL(Color) + dbo.ValueIsNULL(Size) +


Page 120

dbo.ValueIsNULL(Weight) + dbo.ValueIsNULL(ProductLine) + dbo.ValueIsNULL(Class) + dbo.ValueIsNULL(Style) ) AS TotalNumberOfNULLs ,SUM( CASE WHEN ( dbo.ValueIsNULL(Color) + dbo.ValueIsNULL(Size) + dbo.ValueIsNULL(Weight) + dbo.ValueIsNULL(ProductLine) + dbo.ValueIsNULL(Class) + dbo.ValueIsNULL(Style) ) > 0 THEN 1 ELSE 0 END ) AS NumberOfRowsWithNULLs FROM Production.Product WHERE (ProductSubcategoryId NOT IN (SELECT ProductSubcategoryId FROM Production.Product WHERE ProductSubcategoryId IS NOT NULL GROUP BY ProductSubcategoryId HAVING COUNT(DISTINCT Class) = 0));

The result of the relation completeness query is as follows:

SchemaNa

me

TableNa

me

NumberOfRows

Measured

TotalNumber

OfNULLs

NumberOfRows

WithNULLs

Production Product 237 222 61

We can continue with such measurements for each table in the database. We can also store the

results in a data quality data warehouse, as proposed in chapter 1 of this book. This way, we can

measure improvements over time. After all, one of the most important goals when

implementing a MDM solution is data quality improvement.

Multivariate Associations among NULLs

We presented a way how you can find whether there is some association between missing

values in two columns. A little bit more advanced question is whether there is an association

among missing values in multiple columns; for example, if a value is missing in all or any of the

Color, Class and Style columns of the Production.Product table, does this fact lead to a missing

value in the Weight column?


Page 121

There are many multivariate analyses known from statistics and data mining. SQL Server

Analysis Services (SSAS) bring very strong data mining support. All of the most popular

algorithms are implemented. For the problem mentioned, Naïve Bayes is an appropriate

algorithm. It checks associations of pairs of variables, and then calculates joined associations,

i.e. associations between multiple input variables and a target variable. The target variable is

called the predictable variable, as we can use the model to predict the values of the target

variable in new cases when we have values for the input variables.

In SSAS, we can declare all variables as input and predictable, thus getting one Naïve Bayes

analysis per variable. Microsoft also ships data mining viewers, the controls that show us the

knowledge learned by the algorithm graphically. The Naïve Bayes Dependency Network viewer

shows us all links between all variables in a single screen, it shows all models simultaneously.

This way we can easily find the most important links and thus understand which target variable

has the strongest associations with which input variables.

Let’s start working. First, we need a view in the AdventureWorks2008R2 database that selects

only the interesting columns for applicable values of the Class column of the Production.Product

table, as in the following code:

CREATE VIEW dbo.ProductsMining AS SELECT ProductId ,Name ,dbo.ValueIsNULL(Color) AS Color ,dbo.ValueIsNULL(Size) AS Size ,dbo.ValueIsNULL(Weight) AS Weight ,dbo.ValueIsNULL(ProductLine) AS ProductLine ,dbo.ValueIsNULL(Class) AS Class ,dbo.ValueIsNULL(Style) AS Style FROM Production.Product WHERE (ProductSubcategoryId NOT IN (SELECT ProductSubcategoryId FROM Production.Product WHERE ProductSubcategoryId IS NOT NULL GROUP BY ProductSubcategoryId HAVING COUNT(DISTINCT Class) = 0));

Next, we need to create an Analysis Services project in Business Intelligence Development

Studio (BIDS). We can do this by completing the following steps:


Page 122

1. Open the BIDS.

2. In BIDS, create a new SSAS project. Name the solution MDSBook_Ch03, the

project MDSBook_Ch03_SSAS, and save it the solution to any folder you want –

we suggest the C:\MDSBook\Chapter03 folder.

3. In Solution Explorer, right-click on the Data Sources folder. Create a new Data

Source for the AdventureWorks2008R2 database. Connect to the instance where

you have deployed the demo databases. Select the AdventureWorks2008R2

database. Use the service account impersonation. Use the default name for the

data source (Adventure Works2008R2).

4. In Solution Explorer, right-click on the Data Source Views folder. Create a new

Data Source View based on the data source from the previous step. From the

available objects, select only the ProductsMining (dbo) view you just created. Use

the default name for the data source view (Adventure Works2008R2).

5. In Solution Explorer, right-click on the Mining Structures folder and select New

Mining Structure.

6. Use the existing relational database or data warehouse.

7. Select the Microsoft Naïve Bayes technique.

8. Use Adventure Works2008R2 data source view.

9. Specify ProductsMining as a case table.

10. Use ProductId as a key column (selected by default), and Class, Color,

ProductLine, Size, Style and Weight as input and predictable columns, like in

figure 2.


Page 123

Figure 2: TRAINING DATA COLUMNS

11. In the Specify Columns’ Content and Data Type screen, make sure the content

type is discrete for all input and target variables. If needed, change the content to

Discrete (in SQL 2008 R2 RTM, the wizard usually detects the content of these

columns inappropriately as Discretized).

12. Use the default percentage (30%) of data for the test set (this is not important for

the usage you are performing, as you have a single model only; this is important

if you want to compare the accuracy of predictions of multiple models).

13. Name the structure Products Mining Completeness, and the model Products

Mining Completeness Naïve Bayes.

14. Click Finish.

15. Save, deploy and process the project by right-clicking on the project in the

Solution Explorer window and selecting Deploy.


Page 124

16. When the processing is finished, click on the Mining Model Viewer tab. The

default viewer is the Dependency Network viewer.

17. Lower the slider on the left to get more important links. For example, put the

slider on the left side of the viewer on the seventh position from the bottom, and

then click on the Size attribute. You should see the something like figure 3 shows.

Figure 3: DEPENDENCY NETWORK VIEWER

18. You can easily see that ProductLine, Style and Weight predict Size, while Size also

predicts Class. Color has no strong associations with other attributes.

19. Check the other viewer, for example the Attribute Discrimination viewer. Select

the Size attribute, and compare value 0 with value 1, like in figure 4.


Page 125

Figure 4: ATTRIBUTE DISCRIMINATION VIEWER

20. You can see that the Style attribute is strongly linked with Size; i.e., if Style is

NULL, it is highly probable that Size is NULL as well (value 1 in the picture, as the

dbo.ValueIsNULL function returns 1 for NULLs).

As we can see from this example, we can easily find associations between missing values

and data mining. Data mining can help you a lot in searching for the root cause. Before

finishing with this example, we have to add a quick note. We can use the Data Source

View (DSV) for a quick overview of your data as well by completing the following steps:

21. In the DSV, right-click on the ProductsMining view, and select Explore Data.

22. Use the Chart tab and select Class, Color, ProductLine, Size, Style and Weight

columns.

23. Check the other views as well. When finished, close the Explore Data view. Do

not close BIDS yet, if you want to follow further examples in this chapter.


Page 126

Profiling the Accuracy

Similarly like completeness, accuracy can also be defined on the attribute, tuple or relation

level. However, measuring the accuracy is more complicated than measuring the completeness.

Sometimes we can easily define which data is inaccurate. The data can be duplicated, out of

predefined range, or be inaccurate in some other way that can be easily spotted and discovered

with simple queries. For example, it is clear that something is wrong if the birth date is higher

than the employment data for an employee. However, there is something wrong if a twenty

year old employee has a Ph.D.in medicine? This may be an error. Nevertheless, we cannot say

this without checking this further.

With code, we cannot always find inaccurate data. Many times we can find suspicious data only,

and then check this data manually. Therefore, it makes a lot of sense to find suspicious data

programmatically. If we can narrow down manual checking from ten million rows to ten

thousand suspicious rows only, we have made fantastic progress.

Finding inaccurate date differs if we are dealing with numbers, strings, or other data types. For

the completeness, we showed examples dealing with products.

Another aspect of master data is customers. For profiling accuracy we are going to deal with

customers, where we typically have many different types of data gathered. We are going to

base our examples on the dbo.vTargetMail view from the AdventureWorksDW2008R2 demo

database. This database, also part of the product samples, represents an analytical data

warehouse. It is already denormalized, using star schema. The dbo.vTargetMail view joins

customers’ demographic data from multiple tables from the AdventureWorks2008R2 database,

and is thus more suitable for demonstrating techniques for finding inaccurate data than source

tables, where we would have to deal with multiple joins.

Because the sample data is accurate, or at least intended to be accurate, we are going to create

a view based on mentioned vTargetMail view. Our view will select subset of rows from the


Page 127

original view, and add some erroneous rows. We have worked with the AdventureWorks2008R2

database so far; we can continue working in this context, in order to have less work with clean-

up code after we finish with this chapter. The following code creates the view we are going to

use in the AdventureWorks2008R2 database.

CREATE VIEW dbo.vTargetMailDirty AS SELECT TOP 500 CustomerKey, FirstName, MiddleName, LastName, BirthDate, MaritalStatus, Gender, EmailAddress, YearlyIncome, TotalChildren, NumberChildrenAtHome, EnglishEducation AS Education, EnglishOccupation AS Occupation, HouseOwnerFlag, NumberCarsOwned, AddressLine1 AS Address, Phone, DateFirstPurchase, CommuteDistance, Region, Age, BikeBuyer FROM AdventureWorksDW2008.dbo.vTargetMail WHERE CustomerKey < 12000 UNION -- misspelled MiddleName, repeated CustomerKey 11000 SELECT 99000, N'Jon', N'VeryLongMiddleName', N'Yang', '19660408', N'M', N'M', N'[email protected]', 90000, 2, 0, N'Bachelors', N'Professional', 1, 0, N'3761 N. 14th St', N'1 (11) 500 555-0162', '20010722', N'1-2 Miles', N'Pacific', 42, 1 UNION -- duplicate PK, repeated CustomerKey 11000 SELECT 99000, N'Jon', N'V', N'Yang', '19660408', N'M', N'M', N'[email protected]', 90000, 2, 0, N'Bachelors', N'Professional', 1, 0, N'3761 N. 14th St', N'1 (11) 500 555-0162', '20010722', N'1-2 Miles', N'Pacific', 42, 1 UNION -- wrong EmailAddress, repeated CustomerKey 11001 SELECT 99001, N'Eugene', N'L', N'Huang', '19650514', N'S', N'M', N'eugene10#adventure-works.com', 60000, 3, 3, N'Bachelors', N'Professional', 0, 1, N'2243 W St.', N'1 (11) 500 555-0110', '20010718', N'0-1 Miles', N'Pacific', 42, 1 UNION -- BirthDate out of range, repeated CustomerKey 11001 SELECT 99002, N'Eugene', N'L', N'Huang', '18650514', N'S', N'M', N'[email protected]', 60000, 3, 3, N'Bachelors', N'Professional', 0, 1, N'2243 W St.', N'1 (11) 500 555-0110', '20010718', N'0-1 Miles', N'Pacific', DATEDIFF(YY, '18650514', GETDATE()), 1 UNION -- misspelled Occupation, repeated CustomerKey 11002 SELECT 99003, N'Ruben', NULL, N'Torres', '19650812', N'M', N'M', N'[email protected]', 60000, 3, 3, N'Bachelors', N'Profesional', 1, 1, N'5844 Linden Land', N'1 (11) 500 555-0184', '20010710', N'2-5 Miles', N'Pacific', 42, 1 UNION -- Phone written as 'Phone: ' + number


Page 128

SELECT CustomerKey, FirstName, MiddleName, LastName, BirthDate, MaritalStatus, Gender, EmailAddress, YearlyIncome, TotalChildren, NumberChildrenAtHome, EnglishEducation AS Education, EnglishOccupation AS Occupation, HouseOwnerFlag, NumberCarsOwned, AddressLine1 AS Address, N'Phone: ' + Phone, DateFirstPurchase, CommuteDistance, Region, Age, BikeBuyer FROM AdventureWorksDW2008.dbo.vTargetMail WHERE CustomerKey > 12000 AND CustomerKey % 500 = 3;

Numeric, Date and Discrete Attributes Profiling

We are going to start with finding inaccuracies in a single attribute. As we already mentioned,

the techniques differ for different data types. We are going to start with dates. For finding

potentially inaccurate dates, the MIN and MAX T-SQL aggregate functions are very useful. The

following query finds the oldest and the youngest person among our customers.

SELECT CustomerKey ,FirstName ,LastName ,BirthDate FROM dbo.vTargetMailDirty WHERE BirthDate = (SELECT MIN(BirthDate) FROM dbo.vTargetMailDirty) OR BirthDate = (SELECT MAX(BirthDate) FROM dbo.vTargetMailDirty);

The results have found suspicious data. The oldest person is born in the year 1865.

CustomerKey FirstName LastName BirthDate

99002 Eugene Huang 1865-05-14

11132 Melissa Richardson 1984-10-26

Finding suspicious data is mostly translated to finding outliers, i.e. rare and far out of bound

values. We can use a similar technique for continuous numeric values.

With a couple of standard T-SQL aggregate functions, we can easily get an idea of distribution of

values, and then compare minimal and maximal values with the average. In addition, the

standard deviation tells us how spread the distribution is in general. The less spread it is, the


Page 129

more likely the outliers are inaccurate. The following query calculates this basic descriptive

statistics for the Age column, which is calculated from the BirthDate column.

SELECT MIN(Age) AS AgeMin ,MAX(Age) AS AgeMax ,AVG(CAST(Age AS float)) AS AgeAvg ,STDEV(Age) AS AgeStDev FROM dbo.vtargetMailDirty;

The result shows us something we already knew: the oldest person in the data is probably the

wrong Age, and since Age is calculated, the BirthDate has to be wrong also.

AgeMin AgeMax AgeAvg AgeStDev

26 146 50.1018518518519 13.1247332792511

Before we move to discrete attributes, let us mention how to interpret this basic descriptive

statistics. We should expect a normal, Gaussian distribution of ages around the average age. In a

normal distribution, around 68% of the data should lie within one standard deviation of either

side of the mean, about 95% of the data should lie within two standard deviations of either side

of the mean, and about 99% of the data should lie within three standard deviations of either

side of the mean. The minimal age is less than two standard deviations from the mean (50 – 2 *

13 = 24 years), while the maximal age is more than seven standard deviations of the mean.

There is a very, very low probability for data to lay more than seven standard deviations from

the average value.

For example, we already have less than 0.5% of probability to have data outside interval that

finishes three standard deviations from the mean. Thus, we can conclude there is something

wrong with the maximal value in the Age column.

How do we find outliers in discrete columns? No matter whether they are numeric, dates or

strings, they can take a value from discrete pools of possible values only. Let’s say we do not

know the pool in advance. We can still try to find suspicious values by measuring frequency

distribution of all values of an attribute. A value with very low frequency is potentially an


Page 130

erroneous value. The following query finds the frequency distribution for the Occupation

column.

WITH freqCTE AS ( SELECT v.Occupation ,AbsFreq=COUNT(v.Occupation) ,AbsPerc=CAST(ROUND(100.* (COUNT(v.Occupation)) / (SELECT COUNT(*) FROM vTargetMailDirty) ,0) AS int) FROM dbo.vTargetMailDirty v GROUP BY v.Occupation ) SELECT c1.Occupation ,c1.AbsFreq ,CumFreq=(SELECT SUM(c2.AbsFreq) FROM freqCTE c2 WHERE c2.Occupation <= c1.Occupation) ,c1.AbsPerc ,CumPerc=(SELECT SUM(c2.AbsPerc) FROM freqCTE c2 WHERE c2.Occupation <= c1.Occupation) ,Histogram=CAST(REPLICATE('*',c1.AbsPerc) AS varchar(50)) FROM freqCTE c1 ORDER BY c1.Occupation;

The query uses a CTE to calculate absolute frequency and absolute percentage, and calculates

cumulative values in outer query with correlated subqueries. Note again that this is not the

most efficient query. However, we want to show the techniques we can use, and the

performance is not our main goal here. In addition, we typically do not execute these queries

very frequently.

In the result of the previous query, we can see the suspicious occupation. The “Profesional”

value is present in a single row only. As it is very similar to the “Professional” value, we can

conclude this is an error.

Occupation AbsFreq CumFreq AbsPerc CumPerc Histogram

Clerical 68 68 13 13 *************

Management 134 202 25 38 *************************

Manual 64 266 12 50 ************


Page 131

Profesional 1 267 0 50

Professional 157 424 29 79 *****************************

Skilled

Manual

116 540 21 100 *********************

Strings Profiling

Catching errors in unconstrained strings, like names and addresses, is one of the most

challenging data profiling tasks. Because there are no constraints, it is not possible to say in

advance what is correct and what is incorrect. Still, the situation is not hopeless. We are going

to show couple of queries we can use to find string inconsistencies.

In the character column, strings have different lengths in different rows. However, lengths are

distributed with either nearly uniform or normal distribution. In both cases, strings that are

extremely long or short might be errors. Therefore, we are going to start our profiling by

calculating the distribution of string lengths. The following example checks lengths of middle

names.

SELECT LEN(MiddleName) AS MNLength ,COUNT(*) AS Number FROM dbo.vTargetMailDirty GROUP BY LEN(MiddleName) ORDER BY Number;

The vast majority of middle names are either unknown or one character long.

MNLength Number

18 1

2 1

NULL 235

1 303

Of course, it is easy to find middle names that are more than one character long with the next

query.


Page 132

SELECT CustomerKey ,FirstName ,LastName ,MiddleName FROM dbo.vTargetMailDirty WHERE LEN(MiddleName) > 1;

We see that one middle name is definitely wrong (of course, this is the one we added

intentionally). In addition, the middle name that is two characters long might be written

inconsistently as well. It looks like middle names should be written with a single letter, without a

dot after it.

CustomerKey FirstName LastName MiddleName

11377 David Robinett R.

99000 Jon Yang VeryLongMiddleName

Sometimes we know what strings should look like. This means we have patterns for strings. We

can check for basic patterns with LIKE T-SQL operator. For example, we would not expect any

letters in phone numbers. Let’s check them with the following query.

SELECT CustomerKey ,FirstName ,LastName ,Phone FROM dbo.vTargetMailDirty WHERE Phone LIKE '%[A-Z]%';

From the abbreviated results, we can see that there are some phone numbers that include

characters. It seems like some operator constantly uses prefix “Phone” when entering phone

numbers.

CustomerKey FirstName LastName Phone

12003 Audrey Munoz Phone: 1 (11) 500 555-0124

12503 Casey Shen Phone: 1 (11) 500 555-0148

13003 Jill Hernandez Phone: 1 (11) 500 555-0197

13503 Theodore Gomez Phone: 1 (11) 500 555-0167

14003 Angel Ramirez Phone: 488-555-0166


Page 133

… … … …

More advanced pattern matching can be done with regular expressions. Regular expressions

can be treated as LIKE operator on steroids. T-SQL does not support regular expressions out of

the box. However, we are not powerless. From version 2005, SQL Server supports CLR objects,

including functions, stored procedures, triggers, user-defined types and user-defined

aggregates. A simple .NET, either Visual C# or Visual Basic function could do the work. Going

into details of using CLR code inside SQL Server is out of the scope of this book. Nevertheless,

the CLR project with this function is added to the accompanying code, and we can use it. Here is

Visual C# code for the function.

using System; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; using System.Text.RegularExpressions; public partial class CLRUtilities { [SqlFunction(DataAccess = DataAccessKind.None)] public static SqlBoolean IsRegExMatch( SqlString inpStr, SqlString regExStr) { return (SqlBoolean)Regex.IsMatch(inpStr.Value, regExStr.Value, RegexOptions.CultureInvariant); } };

The Boolean function accepts two parameters: string to check and regular expression. It returns

true if string matches the pattern and false otherwise. Before we can use the function, we have

to import the assembly into SQL Server database, and create and register the function. The

CREATE ASSEMBLY command imports an assembly. The CREATE FUNCTION registers CLR

function. After the function is registered, we can use it like any built-in T-SQL function. The T-

SQL code for importing the assembly and registering function is as follows:

CREATE ASSEMBLY MDSBook_Ch03_CLR FROM 'C:\MDSBook\Chapter03\MDSBook_Ch03\MDSBook_Ch03_CLR\bin\Debug\MDSBook_Ch03_CLR.dll' WITH PERMISSION_SET = SAFE; GO


Page 134

CREATE FUNCTION dbo.IsRegExMatch (@inpstr AS nvarchar(max), @regexstr AS nvarchar(max)) RETURNS BIT WITH RETURNS NULL ON NULL INPUT EXTERNAL NAME MDSBook_Ch03_CLR.CLRUtilities.IsRegExMatch;

Note that this code expects that the assembly is in the

“C:\MDSBook\Chapter03\MDSBook_Ch03\ MDSBook_Ch03_CLR\bin\Debug\” folder. If you

have it in some other folder, do please change the path accordingly.

After the function is registered in SQL Server, we can use it like any other T-SQL functions. We

are going to use it for checking e-mails, as in the following code.

SELECT CustomerKey ,FirstName ,LastName ,EmailAddress ,dbo.IsRegExMatch(EmailAddress, N'(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})') AS IsEmailValid FROM dbo.vTargetMailDirty WHERE dbo.IsRegExMatch(EmailAddress, N'(\w[-._\w]*\w@\w[-._\w]*\w\.\w{2,3})') = CAST(0 AS bit);

In the results, we can see that we found an incorrect e-mail address.

CustomerKey FirstName LastName EmailAddress IsEmailValid

99001 Eugene Huang eugene10#adventure-

works.com

0

You can learn more about regular expressions on the MSDN. In addition, there are many sites

on the Web where developers freely exchange regular expressions.

Other Simple Profiling

It is very simple to check for the uniqueness of rows in a table as well. If we do not use Primary

Key and Unique constraints in our database, we can get duplicate rows. We have to check which

columns can potentially form a key, i.e. we have to check candidate keys. If they are unique,

they are useful as keys. Do please note that keys should not only be unique, but also known;

therefore, checking for completeness should be a part of candidate keys checks as well.


Page 135

In the following example we are checking the uniqueness of rows over a single column – the

CustomerKey column - only. Nevertheless, the query would not be much more complicated if

we were to check the uniqueness of a composite candidate key. We would only have to use all

columns that form the candidate key in every place we use CustomerKey.

SELECT CustomerKey ,COUNT(*) AS Number FROM dbo.vTargetMailDirty GROUP BY CustomerKey HAVING COUNT(*) > 1 ORDER BY Number DESC;

Of course, we get the duplicated CustomerKey in the result:

CustomerKey Number

99000 2

Let us only mention how we can profile XML data. In the completeness part of this chapter, we

have mentioned XML data type methods, and have shown how we can use them in T-SQL

queries. For the accuracy, we used the .value() method of the XML data type to extract element

and attribute values and represent them as scalar values of the built-in SQL Server data types.

After that, we used the same methods as we used for finding inaccuracies in attributes of built-

in scalar data types. Therefore, dealing with XML data type attributes does not differ much from

what we have seen so far in the accuracy part of this chapter.

Finally, we have to mention that SQL Server supports validating the complete XML instance

against an XML schema collection. In an XML schema collection, we can have multiple XML

schemas in XSD format. An XML instance has to validate against one of the schemas for the

collection, otherwise SQL Server rejects the instance. Similarly like we mentioned for check

constraints, we should also try to use XML schema collection validation in our databases, in

order to enforce the data integrity rules.


Page 136

Multivariate Accuracy Analysis

As with completeness, we also want to find the root cause for inaccurate data. We should ask

ourselves whether specific values in some attributes lead to errors in other attributes. If we can

find the attributes that influence on errors in other attributes, we can focus just on them, and

make huge improvements in data quality with minimal efforts. Again, we are going to use data

mining for this task.

We can use Naïve Bayes algorithm for finding the root cause of inaccurate data, like we used it

for finding causes of incompleteness. However, this time we are going to introduce another

useful algorithm, the Decision Trees algorithm.

Decision Trees work on discrete data. Note that SQL Server Decision Trees accept continuous

attributes; however, the algorithm changes to Regression Trees. In the following example, we

are going to use discrete algorithms only, in order to make sure we get Decision Trees. Before

we start exploring the data, let’s explain briefly how Decision Trees work.

Decision Trees is a directed technique. A target variable is the one that holds information about

a particular decision, divided into a few discrete and broad categories (yes / no; liked / partially

liked / disliked; etc.).

We are trying to explain this decision using other gleaned information saved in other variables

(demographic data, purchasing habits, etc.). The process of building the tree is a recursive

partitioning. The data is split into partitions using a certain value of one of the explaining

variables; the partitions are then split again and again. Initially the data is in one big box. The

algorithm tries all possible breaks of all input (explaining) variables for the initial split. The goal

is to get purer partitions in terms of the target variable. The purity is related to the distribution

of the target variable. We want to get uniform distributions of target variable in branches of the

tree as possible. The tree continues to grow using the new partitions as separate starting points

and splitting them more.


Page 137

We then have to stop the process somewhere; otherwise we could get a completely fitting tree

having only one case in each class. The class would be, of course, absolutely pure. This would

not make any sense; the results could not be used for any meaningful prediction. This

phenomenon is called “over-fitting”. We are not going into details here; let’s just mention that

we can control the growth of the tree with algorithm parameters.

We are now going to analyze what leads to a incorrect phone number. The following view filters

out rows with duplicate primary key (data mining models require key in source data) and adds a

flag showing whether the phone number is valid or not.

CREATE VIEW dbo.TargetMailMining AS SELECT CustomerKey ,FirstName ,LastName ,MaritalStatus ,Gender ,YearlyIncome ,TotalChildren ,NumberChildrenAtHome ,Education ,Occupation ,HouseOwnerFlag ,NumberCarsOwned ,CommuteDistance ,Region ,Age ,BikeBuyer ,CASE WHEN Phone LIKE '%[A-Z]%' THEN N'False' ELSE N'True' END AS PhoneValid FROM dbo.vTargetMailDirty WHERE CustomerKey <> 99000;

To create the Decision Trees model, complete the following steps:

1. If you have already closed the project with Naïve Bayes model, reopen it in BIDS.

2. In Solution Explorer, double-click on the Adventure Works200R2 data source

view in the Data Source Views folder to open the Data Source View designer.

3. Right-click in the empty space in designer window and select Add/Remove Tables

option. Find the TargetMailMining view and add it to the Data Source View.


Page 138

4. In Solution Explorer, right-click on the Mining Structures folder, and select New

Mining Structure.

5. Use existing relational database or data warehouse.

6. Select the Microsoft Decision Trees technique.

7. Use Adventure Works2008R2 data source view.

8. Specify TargetMailMining as a case table.

9. Use CustomerKey as a key column and PhoneValid as predictable column. Use

Age, BikeBuyer, CommuteDistance, Education, Gender, HouseOwnerFlag,

MaritalStatus, NumberCarsOwned, NumberChildrenAtHome, Occupation,

Region, TotalChildren and YearlyIncome as input columns. Check whether your

selection matches figure 5.


Page 139

Figure 5: COLUMNS USAGE FOR THE DECISION TREES ANALYSIS

10. In the Specify Columns’ Content and Data Type screen, click on the Detect

button. Be sure that the content of the Bike Buyer attribute is discrete.

11. In the Specify Columns’ Content and Data Type screen, click on the Detect

button. Be sure that the content of all attributes is discrete, except Age and

YearlyIncome, which should be continuous, and CustomerKey, which is the key.

12. In the Create Testing Set window, accept the default split option to use 30% of

data for testing.

13. Name the structure Target Mail Mining Accuracy, and model Target Mail Mining

Accuracy Decision Trees. Click Finish.

14. In a real-life project, you should discretize continuous columns in advance, and

define discretization buckets from business perspective. Nevertheless, you can

discretize in SSAS as well. Discretize Age and Yearly Income mining structure

columns in five groups using EqualAreas method. Use the Properties window of a

column in Mining Structure tab of the Data Mining Designer, change Content

property to Discretized, DiscretizationBucketCount property to 5 and

DiscretizationMethod property to Automatic.

15. In the demo data, there is no strong rule for invalid phones based on other

attributes. Although no pattern is a piece of information as well, for the following

example, let’s force deeper tree through algorithm parameters. In the Mining

model Designer window, click on the Mining Models tab.

16. Right-click on the Decision Trees model and select Set Algorithm Parameters.

Change the COMPLEXITY_PENALTY parameter to 0.1, and the

MINIMUM_SUPPORT parameter to 5, as in figure 6.


Page 140

Figure 6: SETTING THE ALGORITHM PARAMETERS FOR THE DECISION TREES ANALYSIS

17. Save, deploy and process the model (or the whole project).

18. Click on the Mining Model Viewer tab. Change the Background drop-down list to

False. Now you can easily spot what leads to invalid phone numbers. Darker blue

nodes mean nodes with more invalid phone numbers. In the example in figure 7

it seems that having House Owner Flag set to 0, Yearly Income between 75,650

and 98,550, and Commute Distance not between zero and one mile, is a good

way to have an invalid phone number.


Page 141

Figure 7: DECISION TREE MINING MODEL VIEWER

19. Check the Dependency Network viewer as well.

20. Save the project. Do not close BIDS, if you want to try the next exercise.

Although for this example we needed to tweak the Decision Trees parameters in order to force

growth of the tree to get at least some results, the example is valid to show the process of

finding root causes for inaccurate data.


Page 142

Measuring Information

The next hard data quality dimension is information. The amount of information in data is not

important from a correctness point of view; however, it makes sense to know it if we want to

use our data for business intelligence, for analyses.

From mathematics, from Information Theory, the measure for the amount of information is

Entropy. Before quoting the formula, let’s explain the idea behind the Entropy.

In real life, information is the same thing as surprise. If a friend tells you something, and you are

not surprised, this means you already knew that. Now, let’s say we have a discrete attribute.

How many times can we get surprised with its value, i.e. its state? For a start, let’s say we have

two possible states, with equal distribution. We would expect one state; in 50% of cases we

would be surprised, because we would get another state. Now imagine we have skewed

distribution, and one state has 90% probability. We would expect this state, and be surprised in

10% of cases only. With totally uniform distribution, we would not have any chance to be

surprised ever.

Now let’s say the attribute has three possible states, with equal distribution. We still expect one

specific state. However, in this case, we would be surprised in 67% examples. With four possible

states and equal distribution, we would be surprised in 75% examples. We can conclude the

following: the more possible states a discrete variable has, the higher the maximal possible

information in this attribute. The more equal across possible values (the less uniform) the

distribution is, the more actual information is in the attribute. The Entropy formula reflects this:


Page 143

We do not have support for Entropy calculation out of the box in T-SQL. However, we can

calculate probabilities, i.e. frequencies of states, and then the Entropy. The following query

calculates the Entropy for Education, Region and Gender columns:

WITH DistinctValues AS ( SELECT COUNT(DISTINCT Education) AS DistinctEducation ,COUNT(DISTINCT Region) AS DistinctRegion ,COUNT(DISTINCT Gender) AS DistinctGender FROM dbo.vTargetMailDirty ) SELECT N'H(Education)' AS Measure ,(SELECT DistinctEducation FROM DistinctValues) AS DistinctVals ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) AS Actual ,(1.0*LOG((SELECT DistinctEducation FROM DistinctValues))/LOG(2)) AS Maximal ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) / (1.0*LOG((SELECT DistinctEducation FROM DistinctValues))/LOG(2)) AS Normalized FROM (SELECT Education ,1.0*COUNT(Education) / (SELECT COUNT(*) FROM dbo.vTargetMailDirty) AS Px FROM dbo.vTargetMailDirty GROUP BY Education) AS Prob UNION SELECT N'H(Region)' AS Measure ,(SELECT DistinctRegion FROM DistinctValues) AS DistinctVals ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) AS Actual ,(1.0*LOG((SELECT DistinctRegion FROM DistinctValues))/LOG(2)) AS Maximal ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) / (1.0*LOG((SELECT DistinctRegion FROM DistinctValues))/LOG(2)) AS Normalized FROM (SELECT Region ,1.0*COUNT(Region) / (SELECT COUNT(*) FROM dbo.vTargetMailDirty) AS Px FROM dbo.vTargetMailDirty GROUP BY Region) AS Prob UNION SELECT N'H(Gender)' AS Measure ,(SELECT DistinctGender FROM DistinctValues) AS DistinctVals ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) AS Actual ,(1.0*LOG((SELECT DistinctGender FROM DistinctValues))/LOG(2)) AS Maximal ,SUM((-1.0)*Px*(LOG(Px)/LOG(2))) / (1.0*LOG((SELECT DistinctGender FROM DistinctValues))/LOG(2)) AS Normalized FROM (SELECT Gender ,1.0*COUNT(Gender) / (SELECT COUNT(*) FROM dbo.vTargetMailDirty) AS Px FROM dbo.vTargetMailDirty GROUP BY Gender) AS Prob ORDER BY Normalized;

The query calculates the number of distinct states in a CTE. This information is used to calculate

the maximal possible Entropy. Then the query calculates in derived tables, in the FROM part of


Page 144

outer query, the actual distribution for each of the analyzed columns separately. In the SELECT

part of the outer queries for each analyzed columns, the actual Entropy is calculated. In

addition, normalized Entropy is calculated as a division of actual and maximal Entropy, to show

which variable has relatively more information, or better said, which variable has more equal

distribution. Finally, all outer queries are unioned in a single query with UNION operators. The

result is as follows:

Measure DistinctVals Actual Maximal Normalized

H(Education) 5 2.14581759656851 2.32192809488736 0.924153336743447

H(Region) 3 1.54525421415102 1.58496250072116 0.974946860539558

H(Gender) 2 0.999841673765157 1 0.999841673765157

From the results, we can see that we have the most information in the Education column.

However, Gender column has the most equal distribution.

For continuous attributes, we can calculate the Entropy in the same way if we discretize them.

We have to discretize them in classes (or bins) of equal width, in order to preserve the shape of

the original distribution function as much as possible. We can try with different number of bins.


Page 145

Using Other SQL Server Tools for Data Profiling

We used T-SQL queries, enhanced with CLR function, and data mining models in our data

profiling activities so far. There are more useful tools in the SQL Server suite for this task. SSAS

OLAP cubes are very useful for an overview of discrete columns. SSIS has a specialized Data

Profiling task. Finally, we can also use free Office Data Mining Add-Ins. We are now going to

show you how we can use all of the tools mentioned for data profiling.

SSAS Cubes

SSAS cubes are typically built on Star Schema, using multiple dimensions and fact tables, with

business measures like quantities and amounts. Nevertheless, we can create a SSAS cube based

on a single table, and define par of columns for a dimension, and part for a fact table. In fact,

this way we are creating star schema inside Analysis Services cube, although we have flattened

(single table) model in our data warehouse.

For data profiling, the only measure we need is the count of rows. Then we can use all other

attributes for a dimension, and use them for analyzing counts, i.e. frequencies, of different

states of discrete attributes.

We are going to create a cube based on TargetMailMining view we used for Decision Trees

analysis. In order to do this, you will need to complete the following steps:

1. If you closed the BI project from this chapter, reopen it in BIDS.

2. In Solution Explorer, right-click on the Cubes folder, and select New Cube.

3. Use the existing tables in the Select Creation Method window.

4. Use Adventure Works data source view. Select TargetMailMining for the measure

group table.


Page 146

5. In the Select Measures window, clear all measures selected except Target Mail

Mining Count measure. The wizard uses all numeric columns for measures, and

automatically creates additional measure with COUNT aggregate function. This is

exactly the measure we need.

6. Use the dimension selected by default, i.e. Target Mail Mining (it is, of course,

the only dimension you can use).

7. Specify CustomerKey as the dimension key.

8. Name the cube Target Mail Cube and click Finish. The Cube Designer window

should open.

9. In Solution Explorer, expand the Dimensions folder. Double-click on the Target

Mail Mining dimension to open the Dimension Designer. The Cube wizard did not

add any attributes to the dimension.

10. In Dimension Designer, drag and drop columns from the rightmost of the three

panes (the Data Source View pane) to the leftmost (the Attributes) pane. Drag all

columns we used as input columns for the Decision Trees model: Age, BikeBuyer,

CommuteDistance, Education, Gender, HouseOwnerFlag, MaritalStatus,

NumberCarsOwned, NumberChildrenAtHome, Occupation, Region, TotalChildren

and YearlyIncome. Your dimension should be like the one in figure 8.


Page 147

Figure 8: TARGET MAIL MINING DIMENSION

11. You have to discretize the Age and Yearly Income attributes again. Use the

Properties window to change the DiscretizationBucketCount property to 5 and

DiscreatizationMethod property to Automatic for both attributes.

12. Save the project and deploy it.

13. You can browse the cube directly from BIDS using the Browser tab in Cube

Designer. However, it is much nicer to use Microsoft Office Excel 2007 or 2010 as

OLAP client.

14. If you have access to Excel 2007 or 2010, open it.

15. Click on Data tab to open the Data ribbon.

16. In the From Other Sources drop-down list, select From Analysis Services.

17. In Data Connection Wizard, write your Analysis Services in the Name textbox. Use

Windows authentication. Select the MDSBook_Ch03_SSAS database, Target Mail

Cube cube. Finish the Data Connection Wizard. Click OK to import the data in

PivotTable report.


Page 148

18. In the PivotTable Field List window, select the Target Mail Mining Count measure,

and the Occupation attribute. You should already see the distribution for the

Occupation in the pivot table. Click somewhere in the pivot table to make sure

you have accessible the PivotTable Tools ribbon.

19. Click on the Options tab of the PivotTable Tools ribbon. On the right side of the

ribbon, click on PivotChart. Click OK to select the default chart options.

20. Your Excel worksheet should look similar to the one in figure 9.

Figure 9: DISTRIBUTION OF THE OCCUPATION ATTRIBUTE IN EXCEL, DATA FROM SSAS CUBE

21. Play with other distributions. Try also to group over columns.

22. When finished, close Excel. Do not close BIDS yet.

As you can see, SSAS together with Excel is a powerful data profiling tool. This was an example

of using a tool for other than its intended purpose. In the next section, we are going to

introduce a tool that is intended for data profiling – the SSIS Data Profiling task.


Page 149

PowerPivot for Excel 2010

Microsoft SQL Server PowerPivot for Excel is a free add-in that became available at the same

time as SQL Server 2008 R2. Besides Excel version, SharePoint version of PowerPivot exists as

well. This tool is providing so-called self-service BI. From Excel, advanced end users can create

their own analytical solution without intervention of database specialists.

You can download PowerPivot for Excel 2010 from the PowerPivot site.

PowerPivot is a completely new technology for data warehousing. In this version, as an add-in

for Excel 2010, this is an in-memory relational database with column-oriented storage. Because

it is relational, we work with tables that represent entity sets. Because it is in-memory, it is fast.

And because the data, although we work with rows, is stored in sorted, compressed column by

column storage, it is even faster. Because the interface for creating the database, importing the

data, and creating PivotTable and PivotChart reports is available in Excel, it is simple to use.

We can use PowerPivot instead of SSAS cubes for our data profiling tasks. Assuming you already

downloaded and installed PowerPivot for Excel 2010, complete the following steps to prepare

similar analysis like with SSAS on the TargetMailMining view:

1. Open Excel 2010. Click on the PowerPivot tab (this tab appears after the

PowerPivot is installed).

2. From the Launch group on the left side of the ribbon, click on the PowerPivot

Window button. You just launched your database development and

administration interface for PowerPivot databases.

3. Using the Home tab, from the Get External data group, click on the From

Database drop-down list, and select From SQL Server.

4. In the Table Import Wizard, Connect to a Microsoft SQL Server Database page,

connect to your SQL Server instance with the AdventureWorks2008R2 sample

database where the TargetMailMining view is created. Click Next.

http://www.powerpivot.com/download.aspx


Page 150

5. In the Choose How to Import the Data page, use the default option – Select from

a list of tables and views to choose the data to import. Click Next.

6. In the Select Tables and Views page, choose the TargetMailMining view. Do not

click Finish yet – click on the Preview & Filter button first.

7. The Preview Selected Table window opens. Now you can select only the columns

we need for analysis and the key: CustomerKey, Age, BikeBuyer,

CommuteDistance, Education, Gender, HouseOwnerFlag, MaritalStatus,

NumberCarsOwned, NumberChildrenAtHome, Occupation, Region, TotalChildren

and YearlyIncome. Actually, you have to de-select other columns (FirstName,

LastName and PhoneValid), because by default all columns are selected. Click OK.

After you are returned to the Select tables and Views page, click Finish.

Figure 10: THE PREVIEW SELECTED TABLE WINDOW

8. After the data is imported, on the Importing page, click Close. Your data is now in

the PowerPivot database, in memory.

9. Discretizing continuous columns is not that simple in PowerPivot as it is in

Analysis Services. You can create derived columns using expressions in the Data


Page 151

Analysis Expressions (DAX) language, the language that PowerPivot supports.

Please refer to Books Online or PowerPivot book mentioned in the References

section of this chapter for details about PowerPivot and DAX. For this brief

overview, we are going to simply skip the Age and YearlyIncome continuous

columns.

10. Let’s start analyzing the data. In the Reports group of the PowerPivot Home

window, click on the PivotTable drop-down list, and select PivotTable.

11. In the Create PivotTable pop-up window, select the Existing Worksheet option,

and click OK.

12. In the PowerPivot Field List, select the CustomerKey and Occupation columns.

CustomerKey should appear in the Values area, and Occupation in the Axis Fields

(Categories) area. Note that the default aggregation function is Sum. Therefore,

your chart shows sum of CustomerKey over Occupation. Of course, this makes no

sense.

13. In the Values area, click on the down arrow near the Sum of CustomerKey field.

Select Edit Measure option.

14. In the Measure Settings window, select the Count aggregate function. Click OK.


Page 152

Figure 11: SETTING THE COUNT AGGREGATE FUNCTION

15. We can add a chart to the worksheet, like we did when we analyzed SSAS data.

However, this chart is not automatically connected to the PivotTable we already

have in the sheet. We will have to select fields for analysis again.

16. In the Reports group, click on the PivotTable drop-down list, and select

PivotChart. In the Create PivotChart pop-up window, select Existing Worksheet,

and change the Location to 'Sheet1'!$A$10.

17. In the PowerPivot Field List, select the CustomerKey and Occupation columns.

18. In the Values area, click on the down arrow near the Sum of CustomerKey field.

Select Edit Measure option.

19. In the Measure Settings window, select the Count aggregate function. Click OK.

20. Now you should have the same analysis of distribution of the Occupation column

like you had when analyzing SSAS cube.

Figure 12: DISTRIBUTION OF THE OCCUPATION ATTRIBUTE IN EXCEL, DATA FROM POWERPIVOT

21. Play with other distributions. Try also to group over columns.


Page 153

22. When finished, close Excel.

SSIS Data Profiling Task

With SSIS Data Profiling task available, a question why we did develop all of the queries, mining

models and cubes in this chapter so far. Well, everything we learned so far is still valid and

useful. The Data Profiling task can do limited profiling only. We already developed many more

advanced profiling techniques. In addition, the Data Profiling task saves the result in XML form,

which is not useful directly. We would have to write custom code in the SSIS Script task in order

to consume the results in the same package. For overview of the result, an application called

Data Profile Viewer is included within SQL Server suite. Let’s explore what the Data Profiling task

can do by completing the following steps.

1. If you closed BIDS, reopen the solution we created so far.

2. Use File – Add – New Project menu option to add a project to the solution. Use the

Integration Services project template. Name the project MDSBook_Ch03_SSIS.

3. In Solution Explorer, in SSIS Packages folder, right-click on the default package called

Package.dtsx and rename it to SSISProfiling.dtsx. Confirm renaming of the package

object.

4. From the Toolbox, drag and drop the Data Profiling Task to the control from area.

Right-click on it and select Edit.

5. In the General tab, use the Destination property drop-down list, and select New File

Connection.

6. In the File Connection Manager Editor window, change usage type to Create file. In

the File text box, type file name you want to create (you can select folder graphically

with Browse button). In order to follow placing and naming we had so far, type

C:\MDSBook\Chapter03\SSISProfiling.xml.

7. When you are back in the Data Profiling Task Editor, General tab, change the

OverwriteDestination property to True, in order to make possible to re-execute the


Page 154

package multiple times (otherwise, you would get an error saying that the

destination file already exists with next execution of the package).

8. In the bottom-right corner of the Data Profiling Task Editor, General tab, click on the

Quick Profile button.

9. In the Simple Table Quick Profiling Form, click on the New button to create a new

ADO.NET connection. The Data Profiling Task accepts only ADO.NET connections.

10. Connect to your SQL Server and select the AdventureWorks2008R2 database. Click

OK to return to the Simple Table Quick Profiling Form.

11. Select the dbo.TargetMailMining view in the Table or View drop-down list. Leave

selected all check boxes that are selected by default. Select also the Column Pattern

Profile and Functional Dependency Profile check boxes. Click OK.

12. In the Data Profiling Task Editor window, in the Profile Type list on the right, select

the Functional Dependency Profile Request. In the Request Properties window,

select BikeBuyer as dependent column, and CommuteDistance, Education and

NumberCarsOwned as determinant columns. Deselect star (all columns) as

determinant columns. Your Data Profiling Task Editor should look like the one in

figure 13. When finished, click OK to close the Data Profiling Task Editor window.


Page 155

Figure 13: THE DATA PROFILING TASK CONFIGURED

12. Save the project. Right-click on the Data Profiling task and select Execute Task

option. Wait until the execution finishes. The task color should turn to green after

a successful execution.

13. From the Debug menu, select the Stop Debugging option. Check whether the

XML file appeared in the folder.

14. From All Programs, SQL Server 2008 R2, Integration Services menu, start the Data

Profile Viewer application.

15. Click on the Open button. Navigate to the SSISProfiling.xml file and open it. Now

you can start harvesting the results.

16. In the left, the Profiles pane, select, for example, the Column Value Distribution

Profiles. In the top-right pane, select the Occupation column. In the middle-right

window, you should see the distribution for Occupation attribute. Click on the

value that has very low frequency (the Proffesional value). At the top-right corner

of the middle-right window, you will find the drill-down button. Click on it. In the


Page 156

bottom-right window, you get row with this value for the Occupation column,

like in figure 14.

Figure 14: THE OCCUPATION COLUMN VALUE DISTRIBUTION PROFILE

17. Check other profiles generated.

SSIS Data Profiling Task can be a nice, automated basic tool for quick profiling of our data.

Besides some simple profiling, it has some more sophisticated options. For example, the

Functional Dependency profile uses Naïve Bayes algorithm internally to find dependency

between single target variable and multiple determinant variables. Data mining algorithms are

really useful for data profiling.

Excel Data Mining Add-Ins

Microsoft Office 2007 Data Mining Add-Ins is free to download from Microsoft SQL Server 2008

R2 Feature Pack site:

(http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=ceb4346f-657f-

4d28-83f5-aae0c5c83d52). The same add-ins work for Office 2010 as well. However, do please

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=ceb4346f-657f-4d28-83f5-aae0c5c83d52

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=ceb4346f-657f-4d28-83f5-aae0c5c83d52


Page 157

note that this add-ins currently exist in 32-bit version only. After you install them, you should

get a new Data Mining tab in Excel.

With data mining add-ins, Excel does not become a data mining engine. It can utilize, or

leverage some existing Analysis Services instance. Excel sends data from its worksheets or tables

to SSAS instance, and then gets the result and displays it. If data is formatted as Excel tables,

then data mining is simpler, through more business-oriented options that appear in Table Tools

ribbon. If data is organized in worksheets only, then we can still utilize the Data Mining ribbon,

and mine it from more technical perspective.

The Clustering data mining algorithm is especially useful for finding complex outliers. By

complex outliers we mean outliers not because of a value of a single column; we are searching

for outliers based on combination of values from multiple columns. Therefore, with Clustering,

we can make multivariate analysis for outliers.

The Clustering algorithm tries to group rows into clusters. In the Expectation-Maximization

variant of the algorithm, which is default in SSAS, each case belongs to each cluster with some

probability. A case can belong to one cluster with 0.3, and to another with 0.05 probabilities.

However, the highest probability could be very low as well. For example, a case can belong to

one cluster with probability 0.001, to another with 0.00003, and to other clusters with even

lower probabilities. This means that the case does not fit well to any cluster, or, said differently,

that this case is an exception. This is how we can find outliers with Clustering, find cases for

which the highest probability they belong to a cluster is still very low.

With data mining add-ins, Excel becomes a very powerful data mining tool. We can use it for

nearly any data mining activity Analysis Services can perform. Excel can serve as data mining

client, because with data mining add-ins, we also get data mining viewers that are built in BIDS

and SSMS in Excel. In addition, the add-ins also brings a couple of new templates for Visio.

Nevertheless, this introduction to Office Data Mining Add-Ins should be enough for a Master


Page 158

Data Services book. Rather than continuing with theory and descriptions let’s use the add-ins!

You get started; you will need to complete the following steps:

1. In order to use Excel as a data mining tool, your SSAS instance has to allow session

mining models. In SSMS, in Object Explorer, connect to your SSAS instance. Right-

click on it and select Properties. Change the Data Mining \ Allow Session Mining

Models property to True, as shown in figure 12. Click OK.

Figure 15: ALLOW SSAS SESSION MINING MODELS

2. Open Excel. In the Data ribbon, click on From Other Sources drop-down list. Select

the From SQL Server option.

3. Connect to your SQL Server and select the AdventureWorks2008R2 database. In the

Select Database and Table window of the Data Connection Wizard, make sure that

the Connect to a specific table check box is checked. Select the

dbo.TargetMailMining view and finish the wizard.


Page 159

4. In the Import Data window, click OK. You should get all of the data in tabular format

in your worksheet.

5. If part of your worksheet is formatted as table n Excel, you get additional Table Tools

– Analyze ribbon with data mining add-ins.

6. As Excel needs Analysis Services to perform data mining tasks, you need to click on

the Connection button (by default it should show No connection). Connect to your

Analysis Services, to the MDSBook_Ch02_SSAS database. Your screen should look

like the one in figure 16.

Figure 16: TABLE ANALYSIS TOOLS AFTER CONNECTION TO SSAS

7. Click on the Highlight Exceptions button. In the pop-up window, click Run.

8. After the wizard finishes, you should get additional sheet with summary information

about exceptions. If you select your original worksheet, the worksheet with data, you

will notice that some rows are highlighted with dark yellow color, like shown in figure

17. In addition, some columns of highlighted rows are highlighted with lighter yellow

color.


Page 160

Figure 17: EXCEPTIONS HIGHLIGHTED IN THE ORIGINAL SHEET

9. These highlights show you suspicious rows. In addition, you can see which column

makes a row an exception. Now it is easy to find errors.

10. After you finish, close Excel and BIDS.

Clean-Up

As this was the last profiling option introduced in this chapter, we can clean up the

AdventureWorks2008R2 database with the following code:

USE AdventureWorks2008R2; IF OBJECT_ID(N'dbo.ValueIsNULL', N'FN') IS NOT NULL DROP FUNCTION dbo.ValueIsNULL; IF OBJECT_ID(N'dbo.ProductsMining',N'V') IS NOT NULL DROP VIEW dbo.ProductsMining; IF OBJECT_ID(N'dbo.vTargetMailDirty',N'V') IS NOT NULL DROP VIEW dbo.vTargetMailDirty; IF OBJECT_ID(N'dbo.IsRegExMatch', N'FS') IS NOT NULL DROP FUNCTION dbo.IsRegExMatch; DROP ASSEMBLY MDSBook_Ch03_CLR; IF OBJECT_ID(N'dbo.TargetMailMining',N'V') IS NOT NULL DROP VIEW dbo.TargetMailMining; GO


Page 161

Summary

The old rule “garbage in – garbage out” is absolutely valid for MDM projects as well. Before

starting implementing a centralized MDM solution, like SQL Server Master Data Services

solution, we should have in-depth comprehension of the quality of our data. In addition, we

should find root causes for bad data.

We have shown in this chapter how we can use tools from SQL Server 2008 R2 suite for data

profiling and for finding the root cause. We have used Transact-SQL queries. We have used

XQuery and CLR code for controlling data quality of XML data and strings. We used SQL Server

Analysis Services intensively. The Unified Dimensional Model, or, if we prefer this expression,

OLAP cube, is a nice way for a quick, graphical overview of the data. In addition, PowerPivot for

Excel 2010 gives us opportunity to achieve the same graphical overview even without Analysis

Services. Data Mining helps us find interesting patterns, and thus this is a useful tool for finding

root causes for bad data. With Office Data Mining Add-Ins, Excel 2007 and 2010 became a

powerful data mining tool as well. SQL Server Integration Services Data Profiling task is another

quick tool for finding bad or suspicious data.

One of the most challenging tasks in preparing and maintaining master data is merging it from

multiple sources when we do not have the same identifier in all sources. This means we have to

do the merging based on similarities of column values, typically on similarity of string columns

like names and addresses. Even if we have a single source of master data, we can have duplicate

rows for the same entity, like duplicate rows for the same customer. In the next chapter, we are

going to tackle these two problems, merging and de-duplicating.


Page 162

References

Erik Veerman, Teo Lachev, Dejan Sarka: MCTS Self-Paced Training Kit (Exam 70-448):

Microsoft SQL Server 2008-Business Intelligence Development and Maintenance

(Microsoft Press, 2009)

Thomas C. Redman: Data Quality - The Field Guide (Digital Press, 2001)

Tamraparni Dasu, Theodore Johnson: Exploratory Data Mining and Data Cleaning (John

Wiley & Sons, 2003)

Arkady Maydanchik: Data Quality Assessment (Technics Publications, 2007)

Itzik Ben-Gan, Lubor Kollar, Dejan Sarka, Steve Kass: Inside Microsoft SQL Server 2008: T-

SQL Querying (Microsoft Press, 2009)

Itzik Ben-Gan, Dejan Sarka, Roger Wolter, Greg Low, Ed Katibah, Isaac Kunen: Inside

Microsoft SQL Server 2008: T-SQL Programming (Microsoft Press, 2010)

Marco Russo, Alberto Ferrari: PowerPivot for Excel 2010 (Microsoft Press, 2011)

Chapter 4: Identity Mapping and De-Duplicating

Page 163


Dejan Sarka

Two of the most challenging problems with maintaining master data are identity mapping and

de-duplication.

In an enterprise, we frequently have more than one source of master data. Sometimes we have

to import master data from outer sources. Different sources can include legacy applications;

relational databases used by OLTP applications; analytical databases; semi-structured data, such

as XML data from some Web service; and even non-structured data in Microsoft Excel

worksheets and Microsoft Word documents. Typically, we do not have unique identification of

entities in all these different sources. However, we would like to get a complete picture of an

entity (e.g., of a customer) in order to properly support our applications, such as a Customer

Relationship Management (CRM) application. Therefore, we have to identify which objects (e.g.,

rows in a table, XML instances) represent the same entity. We need to match the identities

based on similarities of entity attributes.

Even if we have a single source of data, we are not without problems. Master data is often

duplicated. For example, multiple operators could insert the same customer multiple times,

each time with slightly differently written name or address. Again, we would like to identify such

duplicates and get rid of them.

In this chapter, we introduce details of both problems and show possible solutions. We focus on

identity mapping, and we show that de-duplicating can be actually translated to the same

problem. We discuss the following topics:

Problems with identity mapping


Page 164

String similarity functions

Reducing the search space

Performing identity mapping with T-SQL

Using SSIS Fuzzy Lookup transformation for identity mapping

De-duplicating

Using SSIS Fuzzy Grouping transformation for de-duplicating


Page 165

Identity Mapping

The problem with identity mapping arises when data is merged from multiple sources that can

update data independently. Each source has its own way of entity identification, or its own keys.

There is no common key to make simple joins. Data merging has to be done based on

similarities of strings, using names, addresses, e-mail addresses, and similar attributes. Figure 1

shows the problem: we have three rows in the first table in the top left corner, and two rows in

the second table in the top right corner. Keys of the rows from the left table (Id1 column) are

different from keys of the rows in the right table (Id2 column). The big table in the bottom

shows the result of approximate string matching. Note that each row from the left table is

matched to each row from the right table; similarities are different for different pairs of rows.

Figure 13: IDENTITY MAPPING PROBLEM


Page 166

Problems

Many problems arise with identity mapping. First, there is no way to get a 100 percent accurate

match programmatically; if you need a 100 percent accurate match, you must match entities

manually. But even with manual matching, you cannot guarantee 100 percent accurate matches

at all times, such as when you are matching people. For example, in a database table, you might

have two rows for people named John Smith, living at the same address; we cannot know

whether this is a single person or two people, maybe a father and son. Nevertheless, when you

perform the merging programmatically, you would like to get the best matching possible. You

must learn which method to use and how to use it in order to get the best possible results for

your data. In addition, you might even decide to use manual matching on the remaining

unmatched rows after programmatic matching is done. Later in this chapter, we compare a

couple of public algorithms that are shipped in SQL Server 2008 R2, in the Master Data Services

(MDS) database; we also add SSIS Fuzzy Lookup transformation to the analysis.

The next problem is performance. For approximate merging, any row from one side, from one

source table, can be matched to any row from the other side. This creates a cross join of two

tables. Even small data sets can produce huge performance problems, because cross join is an

algorithm with quadratic complexity. For example, cross join of 18,484 rows with 18,484 rows of

the AdventureWorksDW200R2 vTargetMail view to itself (we will use this view for testing),

means dealing with 341,658,256 rows after the cross join! We discuss techniques for optimizing

this matching (i.e., search space reduction techniques) later in this chapter.

Another problem related to identity mapping is de-duplicating. We deal briefly with this

problem at the end of the chapter.

T-SQL and MDS String Similarity Functions

The T-SQL SOUNDEX() and DIFFERENCE() functions allow searches on similar sounding character

strings. SOUNDEX() takes a word, such as a person’s name, as input and produces a character

string that identifies a set of words that are (roughly) phonetically alike. Phonetically alike in this


Page 167

case means US English phonetically alike. DIFFERENCE() returns an integer indicating the

difference between the SOUNDEX() values. DIFFERENCE() returns a value between 0 and 4. The

integer returned is the number of characters in the SOUNDEX() values that are the same.

Therefore, value 0 indicates weak or no similarity, and value 4 indicates strong similarity or the

same values. SOUNDEX() is a very basic function, not suitable for solving real-life problems.

There are many better algorithms that compare strings phonetically. In addition, all phonetic

algorithms are language dependent. We focus on language-independent algorithms in this

chapter, because we cannot say in which language we will have to do the matching in advance.

If we install MDS in SQL Server 2008 R2, we get the following string matching algorithms:

Levenshtein distance (also called edit distance)

Jaccard index

Jaro-Winkler distance

Simil (longest common substring; also called Ratcliff/Obershelp)

All of these algorithms are well-known and publically documented (e.g., on Wikipedia). They are

implemented through a CLR function. Note that MDS comes only with SQL Server 2008 R2

Enterprise and Datacenter 64-bit editions. If you are not running either of these editions, you

can use any edition of SQL Server 2005 or later and implement these algorithms in the CLR

functions. Anastasios Yalanopoulos’s “Beyond SoundEx - Functions for Fuzzy Searching in MS

SQL Server” (http://anastasiosyal.com/archive/2009/01/11/18.aspx#soundex) provides a link to

a publically available library of CLR string matching functions for SQL Server.

Levenshtein (edit) distance measures the minimum number of edits needed to transform one

string into the other. It is the total number of character insertions, deletions, or substitutions

that it takes to convert one string to another. For example, the distance between kitten and

sitting is 3:

• kitten → sitten (substitution of 's' for 'k')

• sitten → sittin (substitution of 'i' for 'e')

http://anastasiosyal.com/archive/2009/01/11/18.aspx#soundex


Page 168

• sittin → sitting (insert 'g' at the end)

Similarity is then normalized between 0 and 1.

The Jaccard index (similarity coefficient) measures similarity between sample sets. When used

for strings, it defines strings as sets of characters. The size of the intersection is divided by the

size of the union of the sample sets, as the following formula shows. Note that the coefficient is

normalized between 0 and 1 by its definition.

The Jaro-Winkler distance is a variant of Jaro string similarity metrics. Jaro distance combines

matches and transpositions for two strings s1 and s2:

| |

| |

The symbol m means the number of matching characters and the symbol t is the number of

transpositions, while |s| denotes the length of a string. In order to define characters as

matching, their position in the strings must be close together, which is defined by character

position distance CPD; they should be no farther than the following formula calculates:

[ | | | |

]

Jaro-Winkler distance uses a prefix scale p, which gives more favorable ratings to strings that

match from the beginning.

In this formula, the meaning of the signs is:


Page 169

• dj is the Jaro distance

• is the common-prefix length at the start of the string, up to a maximum of 4

characters

• p is a scaling factor for common prefixes; p should not exceed 0.25, otherwise the

distance can become larger than 1 (usually p is equal to 0.1)

Finally, the Simil algorithm looks for the longest common substring in two strings. Then it

removes this substring from the original strings. After that, it searches for the next longest

common substring in remainders of the two original strings from the left and the right. It

continues this process recursively until no common substrings of a defined length (e.g., two

characters) are found. Finally, the algorithm calculates a coefficient between 0 and 1 by dividing

the sum of the lengths of the substrings by the lengths of the strings themselves.

Preparing the Data

In order to compare functions’ efficiency, we need some data. The code in Listing 1 prepares

sample data. First, it prepares two tables. The CustomersMaster table will be the master table,

which is a table with keys that we want to keep and transfer to the target table. The

CustomersTarget table is the target of identity mapping; it will receive the keys from the

CustomersMaster table. We fill both tables from the

AdventureWorksDW2008R2.dbo.vTargetMail view. In the target table, we keep the original

keys, multiplied by -1, in order to control the efficiency of merging. The target table also has an

empty column to store the key from the master table (MasterCustomerId) and one additional

column (Updated) that will be used only in the code that produces errors in the data. Besides

common columns (Fullname, StreetAddress, and CityRegion), the two tables also have some

different columns, to show that identical structures are not necessary. All we need is at least

one character column in common, a column that is used to compare values.

Listing 1: Code to Prepare Sample Data

-- Assuming that MDSBook database exists USE MDSBook; GO


Page 170

-- Creating sample data -- Customers primary (master) table IF OBJECT_ID(N'dbo.CustomersMaster', N'U') IS NOT NULL DROP TABLE dbo.CustomersMaster; GO CREATE TABLE dbo.CustomersMaster ( CustomerId int NOT NULL PRIMARY KEY, FullName nvarchar(200) NULL, StreetAddress nvarchar(200) NULL, CityRegion nvarchar(200) NULL, NumberCarsOwned tinyint NULL ); GO INSERT INTO dbo.CustomersMaster SELECT c.CustomerKey AS CustomerId ,c.FirstName + ' ' + c.LastName AS FullName ,c.AddressLine1 AS StreetAddress ,g.City + ', ' + g.StateProvinceName + ', ' + g.EnglishCountryRegionName AS CityRegion ,c.NumberCarsOwned FROM AdventureWorksDW2008R2.dbo.DimCustomer AS c INNER JOIN AdventureWorksDW2008R2.dbo.DimGeography AS g ON c.GeographyKey = g.GeographyKey; GO -- Customers secondary (merge target) table IF OBJECT_ID(N'dbo.CustomersTarget', N'U') IS NOT NULL DROP TABLE dbo.CustomersTarget; GO CREATE TABLE dbo.CustomersTarget ( CustomerId int NOT NULL PRIMARY KEY, FullName nvarchar(200) NULL, StreetAddress nvarchar(200) NULL, CityRegion nvarchar(200) NULL, MaritalStatus nchar(1) NULL, Updated int NULL, MasterCustomerId int NULL ); GO INSERT INTO dbo.CustomersTarget SELECT c.CustomerKey * (-1) AS CustomerId ,c.FirstName + ' ' + c.LastName AS FullName ,c.AddressLine1 AS StreetAddress ,g.City + ', ' + g.StateProvinceName + ', ' + g.EnglishCountryRegionName AS CityRegion ,c.MaritalStatus ,0 AS Updated ,NULL AS MasterCustomerId FROM AdventureWorksDW2008R2.dbo.DimCustomer AS c INNER JOIN AdventureWorksDW2008R2.dbo.DimGeography AS g ON c.GeographyKey = g.GeographyKey; GO

We can perform matching based on FullName, StreetAddress, and CityRegion and get 100

percent accurate results. Of course, we have the same data in both tables. In order to test the

functions, we have to perform some updates in the target table and produce some errors in the


Page 171

common columns. The code in Listing 2 performs these changes. This code makes changes with

a slightly controlled randomness, in a loop that repeats three times. In every loop, the code

makes three updates on 17 percent, 17 percent, and 16 percent (combined 50 percent) rows.

Rows to be updated are selected with Bernoulli’s formula for sampling, which provides

statistically good randomness. In the first two passes, the 50 percent of rows are selected from

all rows; in the third pass, we select 50 percent of rows from the rows that were updated in the

previous two passes only. This way, we get more errors on fewer rows; if we selected rows

randomly from all rows in each pass, we would get fewer errors on more rows. From

experience, we know that errors tend to pile in the same rows and are not spread completely

randomly in a table. For example, we nearly always get an error with the last name Sarka in

Slovenian (it is a Hungarian last name, uncommon in Slovenia), written as Sraka (which is a

Slovenian word), whereas we nearly never get an error with the last name Novak (the most

common last name in Slovenian).

Listing 2: Code to Update the Target Table

DECLARE @i AS int = 0, @j AS int = 0; WHILE (@i < 3) -- loop more times for more changes BEGIN SET @i += 1; SET @j = @i - 2; -- control here in which step you want to update -- only already updated rows WITH RandomNumbersCTE AS ( SELECT CustomerId ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber1 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber2 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber3 ,FullName ,StreetAddress ,CityRegion ,MaritalStatus ,Updated FROM dbo.CustomersTarget ) UPDATE RandomNumbersCTE SET FullName = STUFF(FullName, CAST(CEILING(RandomNumber1 * LEN(FullName)) AS int), 1, CHAR(CEILING(RandomNumber2 * 26) + 96)) ,StreetAddress = STUFF(StreetAddress, CAST(CEILING(RandomNumber1 * LEN(StreetAddress)) AS int), 2, '') ,CityRegion =


Page 172

STUFF(CityRegion, CAST(CEILING(RandomNumber1 * LEN(CityRegion)) AS int), 0, CHAR(CEILING(RandomNumber2 * 26) + 96) + CHAR(CEILING(RandomNumber3 * 26) + 96)) ,Updated = Updated + 1 WHERE RAND(CHECKSUM(NEWID()) % 1000000000 - CustomerId) < 0.17 AND Updated > @j; WITH RandomNumbersCTE AS ( SELECT CustomerId ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber1 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber2 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber3 ,FullName ,StreetAddress ,CityRegion ,MaritalStatus ,Updated FROM dbo.CustomersTarget ) UPDATE RandomNumbersCTE SET FullName = STUFF(FullName, CAST(CEILING(RandomNumber1 * LEN(FullName)) AS int), 0, CHAR(CEILING(RandomNumber2 * 26) + 96)) ,StreetAddress = STUFF(StreetAddress, CAST(CEILING(RandomNumber1 * LEN(StreetAddress)) AS int), 2, CHAR(CEILING(RandomNumber2 * 26) + 96) + CHAR(CEILING(RandomNumber3 * 26) + 96)) ,CityRegion = STUFF(CityRegion, CAST(CEILING(RandomNumber1 * LEN(CityRegion)) AS int), 2, '') ,Updated = Updated + 1 WHERE RAND(CHECKSUM(NEWID()) % 1000000000 - CustomerId) < 0.17 AND Updated > @j; WITH RandomNumbersCTE AS ( SELECT CustomerId ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber1 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber2 ,RAND(CHECKSUM(NEWID()) % 1000000000 + CustomerId) AS RandomNumber3 ,FullName ,StreetAddress ,CityRegion ,MaritalStatus ,Updated FROM dbo.CustomersTarget ) UPDATE RandomNumbersCTE SET FullName = STUFF(FullName, CAST(CEILING(RandomNumber1 * LEN(FullName)) AS int), 1, '') ,StreetAddress = STUFF(StreetAddress, CAST(CEILING(RandomNumber1 * LEN(StreetAddress)) AS int), 0, CHAR(CEILING(RandomNumber2 * 26) + 96) + CHAR(CEILING(RandomNumber3 * 26) + 96)) ,CityRegion =


Page 173

STUFF(CityRegion, CAST(CEILING(RandomNumber1 * LEN(CityRegion)) AS int), 2, CHAR(CEILING(RandomNumber2 * 26) + 96) + CHAR(CEILING(RandomNumber3 * 26) + 96)) ,Updated = Updated + 1 WHERE RAND(CHECKSUM(NEWID()) % 1000000000 - CustomerId) < 0.16 AND Updated > @j; END;

After the update, we can check how many rows are different in the target table for the original

rows, which are still available in the master table. The maximum number of updates per row is

9; the more times a row was updated, the more common attribute values differ from the

original (and correct) ones. The probability that a row is updated many times drops quite quickly

with higher numbers of updates. The query in Listing 3 compares full names and addresses after

the update, selecting only rows with some changes and sorting them by the number of updates

in descending order, so we get the rows with maximal number of updates on the top.

Listing 3: Query to Compare Full Names and Addresses after Update

SELECT m.FullName ,t.FullName ,m.StreetAddress ,t.StreetAddress ,m.CityRegion ,t.CityRegion ,t.Updated FROM dbo.CustomersMaster AS m INNER JOIN dbo.CustomersTarget AS t ON m.CustomerId = t.CustomerId * (-1) WHERE m.FullName <> t.FullName OR m.StreetAddress <> t.StreetAddress OR m.CityRegion <> t.CityRegion ORDER BY t.Updated DESC;

The partial result in Figure 2 shows that three rows were updated 7 times (note that in order to

fit the number of updates of a row into the figure, CityRegion columns before and after the

update are omitted from the figure). Altogether, 7,790 rows were updated in this test. You

should get different results every time you run this test, because the updates are done

randomly (or better, with a controlled randomness, no matter how paradoxical this sounds).

You can also see that the values in the rows that were updated many times differ quite a lot

from the original values; therefore, our string matching algorithms are going to have a hard time

finding similarities.


Page 174

Figure 2: RESULTS OF NAME AND ADDRESS COMPARISON

Testing the String Similarity Functions

Although we mentioned that we are not going to use T-SQL SOUNDEX() and DIFFERENCE()

functions because they are language dependent, it makes sense to start with a quick test of

these functions. This test shows another reason for not using the functions: They are very

inefficient. Let’s look at how they perform on full names. Because we retained the original key

(although multiplied by -1) in the target table, we can make an exact join and compare the

original (master) and changed (target) names, as Listing 4 shows.

Listing 4: SOUNDEX() and DIFFERENCE() Test on Full Names

SELECT m.CustomerId ,m.FullName ,t.FullName ,DIFFERENCE(m.FullName, t.Fullname) AS SoundexDifference ,SOUNDEX(m.FullName) AS SoundexMaster ,SOUNDEX(t.FullName) AS SoundexTarget FROM dbo.CustomersMaster AS m INNER JOIN dbo.CustomersTarget AS t ON m.CustomerId = t.CustomerId * (-1) ORDER BY SoundexDifference;

The results in Figure 3 show that the DIFFERENCE() function based on the SOUNDEX() code did

not find any similarity for quite a few full names. However, as you can see from the highlighted

row, the name Zoe Rogers was not changed much; there should be some similarity in the strings

“Zoe Rogers” and “rZoeRogers.” This proves that the two functions included in T-SQL are not

efficient enough for a successful identity mapping.


Page 175

Figure 3: RESULTS OF TESTING THE SOUNDEX AND DIFFERENCE FUNCTIONS ON FULL NAMES

You could continue with checking the two T-SQL functions using the street address and city and

region strings.

All of the four algorithms implemented in the mdq.Similarity function in the MDS database

return similarity as a number between zero and one. A higher number means better similarity.

The query in Listing 5 checks how algorithms perform on full names. Again, because we retained

the original key (although multiplied by -1) in the target table, we can make an exact join,

compare the original (master) and changed (target) names, and visually evaluate which

algorithm gives the highest score.

Listing 5: Code to Check Algorithm Performance on Full Names

SELECT m.CustomerId ,m.FullName ,t.FullName ,mdq.Similarity(m.FullName, t.Fullname, 0, 0.85, 0.00) AS Levenshtein ,mdq.Similarity(m.FullName, t.Fullname, 1, 0.85, 0.00) AS Jaccard ,mdq.Similarity(m.FullName, t.Fullname, 2, 0.85, 0.00) AS JaroWinkler ,mdq.Similarity(m.FullName, t.Fullname, 3, 0.85, 0.00) AS Simil FROM dbo.CustomersMaster AS m INNER JOIN dbo.CustomersTarget AS t ON m.CustomerId = t.CustomerId * (-1) ORDER BY Levenshtein;

For more information about the mdq.Similarity parameters, see SQL Server Books Online.


Page 176

The results (in Figure 4) are sorted by Levenshtein coefficient, in ascending order (we could

choose any of the four), in order to get the rows with the maximal dissimilarity on the top.

Figure 4: RESULTS OF CHECKING ALGORITHM PERFORMANCE ON FULL NAMES FOR THE MDQ.SIMILARITY FUNCTION

From the results, we can see that the Jaro-Winkler algorithm gives the highest scores in our

example. You should check the algorithms on street address and city and region strings as well.

Although we cannot say that the Jaro-Winkler algorithm would always perform the best, and

although you should always check how algorithms perform on your data, we can say that this is

not a surprise. Jaro-Winkler is one of the most advanced public algorithms for string matching.

In addition, it forces higher scores for strings with the same characters in the beginning of the

string. From experience, we noticed that errors in data, produced by humans, typically do not

appear in the first few characters. Therefore, it seems that the Jaro-Winkler algorithm is the

winner in this case, and we will use it to do the real matching. However, before we do the

match, we need to discuss optimizing the matching by trying to avoid a full cross join.

Optimizing Mapping with Partitioning

As mentioned, the identity matching is a quadratic problem. The search space dimension is

equal to the cardinality of A x B of the Cartesian product of the sets included in the match.

There are multiple search space reduction techniques.

A partitioning or blocking technique partitions at least one set involved in matching in blocks.

For example, take the target rows in batches and match each batch with a full master table.


Page 177

With 10,000 rows in each table, we could have 10 iterations of joins with 1,000 x 10,000 =

10,000,000 rows instead of one big row with 10,000 x 10,000 = 100,000,000 rows. The

important thing here is that we do not bloat the memory and go over our hardware resources.

In order to prevent this, we can change batch sizes appropriately.

A sorting neighborhood technique sorts both sets and then moves a window of a fixed size on

both sets. We do the matching on that window only, one window after another. The problem

with such a technique lies in the fact that we can have objects from one set in a window that is

not compared to the window from another set where the objects that should actually be

matched reside. The same problem could arise with blocking techniques implemented on both

sets. We do not want to lose the matches we might find by optimization of the mapping

methods.

Let’s test the partitioning method. The first partition is, of course, the partition with exact

matches. The query in Listing 6 updates the MasterCustomerId column in the target table with

CustomerId from the master table based on exact matches on all three common string columns.

Listing 6: Query to Update MasterCustomerId Column

WITH Matches AS ( SELECT t.MasterCustomerId ,m.CustomerId AS MCid ,t.CustomerId AS TCid FROM dbo.CustomersTarget AS t INNER JOIN dbo.CustomersMaster AS m ON t.FullName = m.FullName AND t.StreetAddress = m.StreetAddress AND t.CityRegion = m.CityRegion ) UPDATE Matches SET MasterCustomerId = MCid;

In the next step, we select (nearly) randomly 1,000 rows from the target table and perform a

match with all rows from the master table. We measure the efficiency of the partitioning

technique and of the Jaro-Winkler string similarity algorithm. We select rows from the target

(nearly) randomly in order to prevent potential bias that we might get by selecting, for example,

rows based on CustomerId. We use the NEWID() function to simulate randomness. The values


Page 178

generated by NEWID() actually do not satisfy statistical randomness, which is why we used term

“(nearly) randomly.” In a real-life example, we would iterate through all rows block by block

anyway, so it is not that important to use real statistical randomness. NEWID() should give us

enough randomness to simulate one iteration (i.e., one block from the target table matched

with the complete master table).

We measure the efficiency of the matching on the fly, while we are doing the match. The first

part of the code simply declares two variables to store the number of rows updated in this pass

and the start time of the UPDATE query. The actual UPDATE query starts with a CTE that selects

1,000 rows from the target table, as we said, nearly randomly, using the TOP operator on

NEWID() values, as Listing 7 shows.

Listing 7: UPDATE Query

-- Variables to store the number of rows updated in this pass -- and start time DECLARE @RowsUpdated AS int, @starttime AS datetime; SET @starttime = GETDATE(); -- Select (nearly) randomly 1000 rows from the target table WITH CustomersTarget1000 AS ( SELECT TOP 1000 t.CustomerId ,t.MasterCustomerId ,t.FullName ,t.StreetAddress ,t.CityRegion FROM CustomersTarget AS t WHERE t.MasterCustomerId IS NULL ORDER BY NEWID() ),

The next CTE, which Listing 8 shows, performs a cross join between the 1,000-row block from

the target table with the full master table. It also compares the strings and adds the Jaro-

Winkler similarity coefficient to the output. In addition, it adds the row number sorted by the

Jaro-Winkler similarity coefficient, in descending order and partitioned by the target table key,

to the output. A row number equal to 1 will mean the highest Jaro-Winkler similarity coefficient

for a specific target table key. This way, we marked the master table row that, according to Jaro-

Winkler algorithms, is the most similar to the target row. The last part of the UPDATE


Page 179

statement, also shown in Listing 8, is the actual update. It updates target table rows with the

key from the master table row with the highest similarity.

Listing 8: Code to Perform Cross Join and Actual Update

-- Full cross join -- Adding Jaro-Winkler coefficient and row number MasterTargetCross AS ( SELECT t.CustomerId AS TCid ,m.CustomerId AS MCid ,t.MasterCustomerId ,mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion, t.FullName + t.StreetAddress + t.CityRegion, 2, 0.85, 0.00) AS JaroWinkler ,ROW_NUMBER() OVER (PARTITION BY t.CustomerId ORDER BY mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion, t.FullName + t.StreetAddress + t.CityRegion, 2, 0.85, 0.00) DESC) AS RowNo FROM CustomersMaster AS m CROSS JOIN CustomersTarget1000 AS t ) -- Actual update UPDATE MasterTargetCross SET MasterCustomerId = MCid WHERE RowNo = 1;

Finally, in the last statements of the batch, we use the fact that we know what key from the

master table we should receive in the target table. We are just counting how many rows got a

wrong key and comparing this number to the number of rows we updated in this pass. In

addition, we are also measuring the time needed to execute this update, as Listing 9 shows.

Listing 9: Code to Measure Efficiency

-- Measuring the efficiency SET @RowsUpdated = @@ROWCOUNT; SELECT @RowsUpdated AS RowsUpdated ,100.0 * @RowsUpdated / 7790 AS PctUpdated ,COUNT(*) AS NumErrors ,100.0 * COUNT(*) / @RowsUpdated AS PctErrors ,DATEDIFF(S, @starttime, GETDATE()) AS TimeElapsed FROM dbo.CustomersTarget WHERE MasterCustomerId <> CustomerId * (-1);

After we execute the update, we can harvest the results of the efficiency measurement, as

Figure 5 shows.


Page 180

Figure 5: RESULTS OF EFFICIENCY MEASUREMENT

The results are not too promising in this case. We updated only 1,000 out of 7,790 rows we have

to update, which is approximately 12.84 percent. With more than 6 percent of errors on that

small subset, it seems that pure a Jaro-Winkler algorithm does not satisfy our needs, and

doesn’t make sense to apply it to the full set. In addition, 133 seconds for only a small subset of

rows is not too encouraging. This is the price of the cross join. Apparently, we need to improve

both the accuracy of the matching and the performance. Before developing a better procedure,

let’s reset the MasterCustomerId column in the target table:

UPDATE dbo.CustomersTarget SET MasterCustomerId = NULL;

Optimizing Mapping with nGrams Filtering

In this section, we focus on a pruning (or filtering) method. The question is whether we can

intelligently pre-select a subset of rows from the master table for a match with a single pre-

selected batch from the target table. With intelligent selection, we would like to improve the

accuracy and the performance of the matching procedure.

The problem here is how to pre-select these rows. We need to be sure that we do not exclude a

row that should be matched with one or more target rows. Besides that, we need reasonably

sized subsets; if we pre-select a subset of 9,000 rows from a set of 10,000 rows, we didn’t gain

nearly anything, because we would get nearly full cross join again. In addition, we would like to

build some intelligence in selecting batches from the target table; let’s select the rows that have

more chances to match well with master rows first, and then rows with a little fewer chances,

and then rows with even fewer chances, and so on. What we need is, of course, a procedure

that accepts parameters. Finally, we could leave some rows for manual matching. Of course, we

would start with exact matching.


Page 181

Without any further hesitation, let’s introduce the pre-selecting algorithm we will use. We will

tokenize strings from the master table to substrings of length n. These tokens are called

nGrams. In the MDS database, there is already a function mdq.NGrams that does this for us. If

we do not use MDS, we can write our own CLR or T-SQL function. A T-SQL solution is already

written and can be found in SQL Server MVP Deep Dives.

After we tokenize the strings from the master table and store these tokens together with keys in

a new table, we calculate the overall (absolute) frequency of tokens and store this in a token

statistics table. With this statistics, we can select for comparison rows with strings that have at

least m common nGrams with less than p frequency in each pass. The m, n and p are the

parameters we use to select rows to be matched in each pass. We can start by comparing the

rows that have at least 3 (m) common 4(n)Grams with less than 20 (p) absolute frequency. The

p parameter helps us to start comparing strings with rare tokens first, with tokens that have low

absolute frequency. Then we can use lower values for m and n and higher values for p in the

next pass, and so on.

Of course, this solution comes at a price. We have to create nGrams for both the source and

target tables. In our example, we will store nGrams for the master (source) table and create

nGrams for a batch of target rows on the fly. If we need to do the identity mapping continuously

(e.g., from day to day), we could store nGrams for both tables permanently and update them

with the help of triggers on master and target tables. We need to keep updated also the token

statistics table. This could be done either with another trigger on the master table or with help

of an indexed view on the tokens table (i.e., the token statistics table could be just an indexed

view). Triggers and indexes would slow down updates on the master and target tables.

Nevertheless, because both tables hold master data, we do not expect too frequent changes

there. In addition, the tokens and token statistics tables occupy some space. However, we do

not consider this as a huge problem. We need only a couple of tables like this, because we

usually merge only a couple of entity sets using string matching (typically customers).


Page 182

In summary, pre-selecting comes with a price. The price is quite low if we do the merging

continuously. For a single merge, it might be simpler to use a blocking technique to reduce the

search space. But we don’t need to guess, because we have developed a whole infrastructure

for measuring the efficiency.

As mentioned, in our example we will show a solution that stores nGrams for the master table

only and calculates nGrams for the target table on the fly. In addition, we are will store the

nGrams statistics in a table as well.

The code in Listing 10 creates a table to store nGrams from the master table and then populates

the new table. Note that the n parameter in this case is 4—we are using 4Grams. In addition,

before actually extracting nGrams, we are standardizing all strings to upper case.

Listing 10: Code to Create Table to Store nGrams from the Master Table

CREATE TABLE dbo.CustomersMasterNGrams ( CustomerId int NOT NULL, Token char(4) NOT NULL PRIMARY KEY (Token, CustomerId) ); GO INSERT INTO dbo.CustomersMasterNGrams SELECT DISTINCT -- Store one NGram for one customer only once m.CustomerId ,g.Token FROM dbo.CustomersMaster AS m CROSS APPLY (SELECT Token, Sequence FROM mdq.NGrams(UPPER(m.FullName + m.StreetAddress + m.CityRegion), 4, 0)) AS g WHERE CHARINDEX('', g.Token) = 0 AND CHARINDEX(' ', g.Token) = 0;

We do not show the code and results of checking the content of this table here. Let’s directly

create the nGrams frequency table, and store the frequency there, with the code in Listing 11.

Listing 11: Code to Create the nGrams Frequency Table

CREATE TABLE dbo.CustomersMasterNGramsFrequency ( Token char(4) NOT NULL, Cnt int NOT NULL,


Page 183

RowNo int NOT NULL PRIMARY KEY (Token) ); GO INSERT INTO dbo.CustomersMasterNGramsFrequency SELECT Token ,COUNT(*) AS Cnt ,ROW_NUMBER() OVER(ORDER BY COUNT(*)) AS RowNo FROM dbo.CustomersMasterNGrams GROUP BY Token;

The following query’s results display which nGrams values are the most frequent, as Figure 6

shows.

SELECT * FROM dbo.CustomersMasterNGramsFrequency ORDER BY RowNo DESC;

Figure 6: NGRAMS FREQUENCY

In order to completely comprehend the impact of storing the tokens and their frequency, let’s

check the space used by the original and tokens tables. The following code checks the space

used; Figure 7 shows the results.

EXEC sp_spaceused 'dbo.CustomersMaster'; EXEC sp_spaceused 'dbo.CustomersTarget'; EXEC sp_spaceused 'dbo.CustomersMasterNGrams'; EXEC sp_spaceused 'dbo.CustomersMasterNGramsFrequency';


Page 184

Figure 7: SPACE USED

From the results, we can see that the nGrams table actually occupies much more space,

approximately ten times more space, than the original table. Of course, our original

CustomersMaster table does not have many columns. In a real-world situation, the tokens table

would probably use approximately the same or even less space than the source table.

Now we have the entire infrastructure we need to start doing the actual matching.

Again, we are starting with exact matches. The UPDATE code in Listing 12 performs the exact

matching on all three common columns; this is the same query we used in the partitioning

procedure already.

Listing 12: Code to Perform Matching on Common Columns

WITH Matches AS ( SELECT t.MasterCustomerId ,m.CustomerId AS MCid ,t.CustomerId AS TCid FROM dbo.CustomersTarget AS t INNER JOIN dbo.CustomersMaster AS m ON t.FullName = m.FullName AND t.StreetAddress = m.StreetAddress AND t.CityRegion = m.CityRegion ) UPDATE Matches


Page 185

SET MasterCustomerId = MCid;

Now we can filter out exact matches in all of the following steps. If we have a large number of

rows, we can also use only a batch of still unmatched rows of the target table combined with

nGrams pre-selecting. This means we could actually combine the nGrams filtering with the

partitioning technique. However, because the number of rows to match is already small enough

in this proof of concept project, we do not split the unmatched rows in the target table in

batches here. We will measure the efficiency of the matching on the fly, while we are doing the

match. The first part of the code, as in the partitioning technique, declares two variables to

store the number of rows updated in this pass and the start time.

The actual merging query uses four CTEs. In the first one, we are tokenizing the target table

rows into 4Grams on the fly. Because we are using 4Grams, the n parameter is equal to 4 in this

pass. This part of code is shown in Listing 13.

Listing 13: Variable Declaration and the First CTE

-- Variables to store the number of rows updated in this pass -- and start time DECLARE @RowsUpdated AS int, @starttime AS datetime; SET @starttime = GETDATE(); -- Tokenize target table rows WITH CustomersTargetNGrams AS ( SELECT t.CustomerId AS TCid ,g.Token AS TToken ,t.MasterCustomerId FROM dbo.CustomersTarget AS t CROSS APPLY (SELECT Token, Sequence FROM mdq.NGrams(UPPER(t.FullName + t.StreetAddress + t.CityRegion), 4, 0)) AS g WHERE t.MasterCustomerId IS NULL AND CHARINDEX(' , g.Token) = 0 AND CHARINDEX('0, g.Token) = 0 ),

The next CTE selects only target rows with 4Grams, with absolute frequency less than or equal

to 20. Thus, the p parameter is 20 in this case. The code is shown in Listing 14.

Listing 14: The Second CTE


Page 186

-- Target rows with 4Grams, with absolute frequency less than or equal to 20 NGramsMatch1 AS ( SELECT tg.TCid ,tg.MasterCustomerId ,mg.CustomerId AS MCid ,tg.TToken ,f.Cnt FROM CustomersTargetNGrams AS tg INNER JOIN dbo.CustomersMasterNGramsFrequency AS f ON tg.TToken = f.Token AND f.Cnt <= 20 INNER JOIN dbo.CustomersMasterNGrams AS mg ON f.Token = mg.Token ),

In the third CTE, we are selecting only matches that have in common at least three less frequent

4Grams. Thus, the m parameter is 3 in this example, as shown in Listing 15.

Listing 15: The Third CTE

-- Matches that have in common at least three less frequent 4Grams NGramsMatch2 AS ( SELECT TCid ,MCid ,COUNT(*) AS NMatches FROM NGramsMatch1 GROUP BY TCid, MCid HAVING COUNT(*) >= 3 ),

The last CTE then compares the strings and adds the Jaro-Winkler similarity coefficient to the

output. In addition, it adds the row number sorted by Jaro-Winkler similarity coefficient, in

descending order and partitioned by the target table key, to the output. Row number equal to 1

means the row with the highest Jaro-Winkler similarity coefficient for a specific target table key.

This way, we marked the master table row that, according to Jaro-Winkler algorithms, is the

most similar to the target row. The code for the fourth CTE that calculates the Jaro-Winkler

coefficient and the row number is shown in Listing 16. Fortunately, this is the last CTE in this

long query; after this CTE, the only thing left to do is the actual UPDATE. It updates the target

table rows with the key from the master table row with the highest similarity. The UPDATE

statement is also shown in Listing 16.

Listing 16: The Fourth CTE and the UPDATE Statement


Page 187

-- Adding Jaro-Winkler coefficient and row number NGramsMatch3 AS ( SELECT t.CustomerId AS TCid ,m.CustomerId AS MCid ,t.MasterCustomerId ,mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion, t.FullName + t.StreetAddress + t.CityRegion, 2, 0.85, 0.00) AS JaroWinkler ,ROW_NUMBER() OVER (PARTITION BY t.CustomerId ORDER BY mdq.Similarity(m.FullName + m.StreetAddress + m.CityRegion, t.FullName + t.StreetAddress + t.CityRegion, 2, 0.85, 0.00) DESC) AS RowNo ,ngm2.NMatches FROM NGramsMatch2 AS ngm2 INNER JOIN dbo.CustomersTarget AS t ON ngm2.TCid = t.CustomerId INNER JOIN dbo.CustomersMaster AS m ON ngm2.MCid = m.CustomerId ) -- Actual update UPDATE NGramsMatch3 SET MasterCustomerId = MCid WHERE RowNo = 1;

We use the code in Listing 17 to measure the efficiency, and show the results in Figure 8.

Listing 17: Measuring the Efficiency

-- Measuring the efficiency SET @RowsUpdated = @@ROWCOUNT; SELECT @RowsUpdated AS RowsUpdated ,100.0 * @RowsUpdated / 7790 AS PctUpdated ,COUNT(*) AS NumErrors ,100.0 * COUNT(*) / @RowsUpdated AS PctErrors ,DATEDIFF(S, @starttime, GETDATE()) AS TimeElapsed FROM dbo.CustomersTarget WHERE MasterCustomerId <> CustomerId * (-1);

Figure 8: RESULTS OF MEASURING THE EFFICIENCY

The results are much better in this case. We updated only 6,847 out of 7,790 rows we have to

update, which is approximately 87.89 percent. With only 0.48 percent of errors, we can see now

that the complication with intelligent pre-selecting makes sense. In addition, the performance is

magnitudes better than we achieved with the partitioning procedure. We needed only 7


Page 188

seconds to match 6,847 rows, whereas we needed 133 seconds to match only 1,000 rows with

the partitioning procedure. But is the n Grams filtering method always that efficient?

Comparing nGrams Filtering with Partitioning

In order to make a more frank comparison, we need to play with different parameters of the

nGrams filtering method. We admit that in the section where we described the method, we

used parameters that give us, according to many tests, very good results. Some of the most

important results of multiple tests are aggregated in the Table 1.

Table 1: Results of nGrams Filtering

Method n p m Rows to

match

Rows

matched

Percent

matched

Number

of errors

Percent

of errors

Elapsed

time (s)

Partitions NA NA NA 7790 1000 12.83 63 6.30 133

Partitions NA NA NA 7790 2000 25.67 93 4.65 262

n Grams

filtering

3 20 3 7790 7391 94.88 232 3.14 223

nGrams

filtering

3 20 2 7790 7717 99.06 322 4.17 264

nGrams

filtering

4 50 3 7790 7705 96.39 48 0.64 11

nGrams

filtering

4 50 2 7790 7716 99.05 106 1.37 14

nGrams

filtering

4 20 3 7790 6847 87.89 33 0.48 7


Page 189

nGrams

filtering

4 20 2 7790 7442 95.53 145 1.95 7

nGrams

filtering

4 10 2 7790 6575 84.40 188 2.86 4

nGrams

filtering

4 10 3 7790 5284 67.83 28 0.53 4

As you can see from the table, it is important to choose the correct values for the parameters of

the nGrams filtering method. For example, when using two common 3Grams with absolute

frequency less than or equal to 20, the performance (elapsed time 264s) and the accuracy (4.17

percent of errors) were not really shining. Nevertheless, the table proves that the nGrams

filtering method works. In all cases in the table, it is more efficient than the partitioning method.

With proper values for the parameters, it gives quite astonishing results.

In addition, note that in a real example, not as many errors would be present in the target table

rows, and the results could be even better. We should also try matching with other algorithms,

to check which one is the most suitable for our data. In addition, we could start with more strict

values for the three parameters and have even fewer errors in the first pass. So, did we find the

best possible method for identity mapping? Before making such a strong conclusion, let’s try the

last option for matching we have in SQL Server out of the box, the SSIS Fuzzy Lookup data flow

transformation.


Page 190

Microsoft Fuzzy Components

The Fuzzy Lookup and Fuzzy Grouping transformations, which are included in Enterprise,

Datacenter and Developer editions of SQL Server only, use a proprietary Microsoft matching

algorithm. It is partially documented in SQL Server Books Online; more information can be found

in MSDN and TechNet articles. In addition, Microsoft recently published a Fuzzy Lookup add-in

for Excel 2010. This add-in’s documentation includes a good explanation of the Fuzzy algorithm.

Even more details can be found in the documentation for the Microsoft Research Data Cleaning

project.

Fuzzy Algorithm Description

Fuzzy transformations use a quite advanced algorithm for approximate string matching. It

actually comprises some other algorithms that we already know. It starts by using the Jaccard

similarity coefficient. However, the Fuzzy transformations version is much more advanced: It is

actually weighted Jaccard similarity for tokens.

For example, the sets {a, b, c} and {a, c, d} have a Jaccard similarity of 2/4 = 0.5 because the

intersection is {a, c} and the union is {a, b, c, d}. You can assign weights to each item in a set and

define the weighted Jaccard similarity as the total weight of the intersection divided by the total

weight of the union. For example, we added arbitrary weights to the elements of the sets from

the previous example to get the weighted sets {(a, 2), (b, 5), (c, 3)}, {(a, 2), (c, 3), (d, 7)}. For

these two sets, the weighted Jaccard similarity is (2 + 3) / (2 + 3 + 5 +7) = 5/17 = 0.294. Tokens

are substrings of original strings. Fuzzy transformations convert strings to sets before they

calculate the weighted Jaccard similarity. The transformation used for converting an internal

component is called a tokenizer. For example, the row {“Ruben Torres”, “5844 Linden Land”}

might be tokenized into the set {“Ruben”, “Torres”, “5844”, “Linden”, “Land”}. The default

tokenizer is for English text. You can change the LocaleId property in component properties.

Note that this is an advanced property, and you need to open the Advanced Editor for the


Page 191

transformation in order to get to this property (right-click on either Fuzzy Lookup or Fuzzy

Grouping transformation in SSIS Designer and select the Show Advanced Editor option).

Fuzzy transformations assign weights to tokens. Tokens get higher weights if they occur

infrequently and lower weights if they occur frequently. In database texts, for example,

frequent words such as “database” might be given a lower weight, whereas less frequent words

such as “broker” might be given a higher weight. In the Excel version of Fuzzy Lookup, you can

even override the default token weights by supplying your own table of token weights.

Fuzzy components are additionally enhanced with token transformations. Tokens are converted

from one string to another. There are many classes of such transformations that Fuzzy

components handle automatically, such as spelling mistakes, string prefixes, and string

merge/split operations. In the Excel version, you can also define a custom transformation table

to specify conversions from a token to a token. For example, you can specify that “Inc” token

has to be converted to “Incorporated” token.

The Jaccard coefficient is further enhanced to Jaccard similarity under transformations. This is

the maximum Jaccard similarity between any two transformations of each set. With a given set

of transformation rules (either from your table in Excel, or only built-in rules in SSIS), all possible

transformations of the set are considered. For example, for the sets {a, b, c} and {a, c, d} and the

transformation rules {b=>d, d=>e}, the Jaccard similarity is computed:

Variations of {a, b, c}: {a, b, c}, {a, d, c}

Variations of {a, c, d}: {a, c, d}, {a, c, e}

Maximum Jaccard similarity between all pairs:

J({a, b, c}, {a, c, d}) = 2/4 = 0.5

J({a, b, c}, {a, c, e}) = 2/4 = 0.5

J({a, d, c}, {a, c, d}) = 3/3 = 1.0

J({a, d, c}, {a, c, e}) = 2/4 = 0.5

The maximum is 1.0.


Page 192

Fuzzy components use also Edit (Levenshtein) distance. The algorithm for Edit distance was

described earlier in this chapter. Fuzzy components include an additional internal

transformation provider called EditTransformationProvider, which generates specific

transformations for each string and creates a transformation from the token to all words in its

dictionary that are within a given edit distance. The normalized edit distance is the edit distance

divided by the length of the input string.

As you can see, Fuzzy transformations use quite an advanced algorithm, which combines many

public algorithms and some internal components. But can we expect better results than we

obtained with public algorithms provided in MDS database? Let’s run some tests, starting with

SSIS Fuzzy Lookup transformation.

Configuring SSIS Fuzzy Lookup

To test the Fuzzy lookup transformation, we need to create a new empty target table, as shown

in Listing 18.

Listing 18: A New Table for Fuzzy Lookup Matching

CREATE TABLE dbo.FuzzyLookupMatches ( CustomerId int, FullName nvarchar(200), StreetAddress nvarchar(200), CityRegion nvarchar(200), MaritalStatus nvarchar(1), Updated int, MasterCustomerId int, MCid int );

Of course, even Fuzzy Lookup cannot perform magic. If we try to match every row from the left

table with every row from the right table, we still get a full cross join. Therefore, it is wise to do

some optimization here as well.

In the SSIS package, we start with exact matches again, using the Lookup transformation. The

target table is the source, and the master table is the lookup table for the Lookup

transformation. The rows that did not match (i.e., the “Lookup no Match Output”) are then


Page 193

driven to the Fuzzy Lookup transformation. After exact and approximate matches are done, we

make a union of all rows and send them to the output. If you want to test Fuzzy Lookup by

yourself, follow these steps.

1. Open Business Intelligence Development Studio (BIDS).

2. In BIDS, create a new SSIS project. Name the solution MDSBook_Ch04, the

project MDSBook_Ch04_SSIS, and save the solution to any folder you want (we

suggest C:\MDSBook\Chapter04).

3. In Solution Explorer, right-click on the default package with the name

Package.dtsx. Rename it to FuzzyLookup.dtsx. Click OK in the message box

window to rename the package object.

4. From the Toolbox, drag the Data Flow Task to the Control Flow working area. In

SSIS Designer, click on the Data Flow tab.

5. From the Toolbox, drag the OLE DB Source to the Data Flow working area. Right-

click on it, then select the Rename option, and rename it to CustomersTarget.

6. Double-click on the source to open the OLE DB Source Editor. In the Connection

Manager tab, create a new connection to your SQL Server system with the

MDSBook database. Select the dbo.CustomersTarget in the Name of the table or

the view drop-down list. Your data source should be similar to the one shown in

Figure 9.


Page 194

Figure 9: CONFIGURING THE DATA SOURCE

7. Click on the Columns tab to get columns mapping. Click OK to close the data

Source Editor.

8. Drag the Lookup transformation to the canvas. Rename it to Exact Matches.

Connect it with the source using the green arrow from the CustomersTarget data

source. Double-click on it to open the editor.

9. In the Lookup Transformation Editor window, in the General tab, use the Specify

how to handle rows with no matching entries drop-down list to select the option

Redirect rows to no match output.

10. Use the Connection tab to specify the lookup table. Use the same connection

manager as you created for the data source. Using the Use a table or view option,

select the dbo.CustomersMaster table.


Page 195

11. In the Columns tab, drag and drop the FullName column from the left table to the

FullName column in the right table. Repeat this for StreetAddress and CityRegion

columns. These three columns will be used for matching rows. Click on the check

box near the CustomerId column in the right table. This is the column we are

adding to our CustomersTarget table. In the Output Alias cell, rename the

CustomerId column to MCid. Your Columns tab of the Lookup Transformation

Editor should be similar to the one in Figure 10. Click OK to close the editor.

Figure 10: CONFIGURING THE LOOKUP TRANSFORMATION

12. Drag and drop the Fuzzy Lookup transformation to the canvas. Rename it to

Approximate Matches. Connect it with the Lookup transformation using Lookup

no Match Output. Double-click on it to open the Fuzzy Lookup Transformation

Editor.


Page 196

13. In the Reference Table tab, use the same connection manager as for the Lookup

transformation reference table, and again select the dbo.CustomersMaster table

for the reference table name.

14. Check the Store new index check box. By storing the index, you can optimize

subsequent executions (see Books Online and the sidebar “Storing an Index”).

Your Reference Table should resemble the one in Figure 11.

Figure 11: CONFIGURING THE REFERENCE TABLE TAB

Storing an Index

When the package first runs the transformation, the transformation copies the reference

table, adds a key with an integer data type to the new table, and builds an index on the

key column. Next, the transformation builds an index, called a match index, on the copy

of the reference table. The match index stores the results of tokenizing the values in the


Page 197

transformation input columns, and the transformation then uses the tokens in the

lookup operation. The match index is a table in a SQL Server 2000 or later database.

When the package runs again, the transformation can either use an existing match index

or create a new index. If the reference table is static, the package can avoid the

potentially expensive process of rebuilding the index for repeat sessions of data

cleaning.

15. Click on the Columns tab. Delete the connection between left and right

CustomerId columns. Check the box near the right CustomerId column. Rename

the outp column name to MCid. If connections between FullName,

StreetAddress, and CityRegion columns from left and right do not exist, create

them. These settings are the same as columns settings for the Lookup

transformation.

16. Click on the Advanced tab. Set the similarity threshold to 0.50 (see the sidebar

“Similarity Threshold”). Check that Maximum number of matches to output per

lookup is set to 1. We are selecting the best match according to the Fuzzy Lookup

only, like we did when we matched rows using the Jaro-Winkler algorithm. Your

Advanced tab should be similar to the one in Figure 12. Click OK.


Page 198

Figure 12: THE ADVANCED TAB OF THE FUZZY LOOKUP TRANSFORMATION EDITOR

Similarity Threshold

The closer the value of the similarity threshold is to 1, the closer the resemblance of the

lookup value to the source value must be to qualify as a match. Increasing the threshold

can improve the speed of matching because fewer candidate records need to be

considered. You can optimize the matching by having multiple Fuzzy Lookups. First you

match rows with higher similarity, then you lower the similarity threshold, then lower

again, and so on. Because every time you need to match fewer rows, you can control the

performance.

17. Drag and drop the Union All transformation to the working area. Connect it with

the Lookup Match Output from the Lookup transformation and with the default

output from the Fuzzy Lookup transformation.


Page 199

18. Drag and drop OLE DB Destination to the working area. Rename it to

FuzzyLookupMatches. Connect it with the green arrow from the Union All

transformation. Double-click on it to open the editor.

19. In the Connection Manager tab, use the same connection manager to the

MDSBook database as in all sources and transformations in this exercise. Select

the dbo.FuzzyLookupMatches table in the Name of the table or the view drop-

down list.

20. Click on the Mappings tab to check whether the column mappings are correct.

Because we used the same column names throughout the package and we used

the same names in the destination table, the automatic mapping should be

correct. Click OK to close the OLE DB Destination Editor. Your data flow of the

package should be similar to the one in Figure 13 shows.


Page 200

Figure 13: THE COMPLETE DATA FLOW FOR FUZZY LOOKUP

21. Save the project. Do not exit BIDS.

Testing SSIS Fuzzy Lookup

Now it is time to test how the Fuzzy Lookup transformation works. At this point, we need to

execute the package. After execution, we need to measure the results. Let’s start by calculating

the number of matched rows, or, simpler, the number of not matched rows:

SELECT COUNT(*) FROM dbo.FuzzyLookupMatches

WHERE MCid IS NULL;

From this number, it is easy to calculate the number of matched rows. In my case, the number

of unmatched row was 608. Because I had 7,790 rows to match, this means that Fuzzy Lookup

actually matched 92.20 percent of rows to match. With the following query we can measure the

number of errors:

SELECT * FROM dbo.FuzzyLookupMatches

WHERE CustomerId <> MCid * (-1);

In my case, the result was fantastic: no errors at all. At first glimpse, it looked too good to be

true to me. Therefore, I ran many additional tests, and only here and there got an error. Out of

more than 20 tests, I got two errors only once!

What about the execution time? After you execute the package, you can read execution time

using the Execution Results tab of the SSIS Designer. In my case, execution (or elapsed) time was

15 seconds. However, this time included the time needed to read the data, to perform exact

matches with the Lookup transformation, to run the Fuzzy Lookup transformation, union the

data, and write the data to the destination.


Page 201

Let’s compare the Fuzzy Lookup results with the best results we got using public algorithms.

With public algorithms, we matched 87.89 percent of rows to match with 0.48 percent of errors

in 7 seconds (approximate match only). With Fuzzy Lookup, we matched 92.20 percent of rows

to match with 0.00 percent of errors in 15 seconds (exact and approximate match).

As we could expect, Microsoft’s Fuzzy algorithm really works. However, you should still test

other algorithms on your data. It is impossible to predict that Fuzzy Lookup will outperform all

other algorithms on any kind of data. That’s why we developed different methods using

different algorithms for identity mapping.

Fuzzy Lookup Add-In for Excel

After you download and install the Fuzzy Lookup add-in for Excel, you get this power of identity

mapping for Excel tables as well. There is nothing much to add from the algorithm perspective—

of course, this is the same Fuzzy Lookup algorithm as we described earlier. Exhaustive

instructions on how to use the add-in are provided in a PDF document and in a demo Excel file.

Figure 14 shows a quick overview of the Excel demo file with Fuzzy Lookup prepared for

configuration.


Page 202

Figure 14: EXCEL FUZZY LOOKUP


Page 203

De-Duplicating

De-duplicating is actually a very similar, or more accurately, it is actually the same problem as

identity mapping. SSIS has a separate Fuzzy Grouping transformation for this task. It groups

rows based on string similarities. Let’s see how we can use either of the two Fuzzy tasks for both

problems.

Let’s start with using Fuzzy Grouping for identity mapping. This is a very simple task. We can just

make a not distinct union (in T-SQL, use UNION ALL rather than the UNION operator) of all rows

from both the master and target tables. Then we can perform the grouping on this union. The

result is the same as we would get with identity mapping.

To turn the problem around: How could we perform de-duplication with Fuzzy Lookup or

another string similarity merging? We should use the same table twice, once as the master and

once as the target table. We would immediately exclude exact matches for all the character

columns and with the same keys (i.e., matches of a row with itself). In the next step, we would

perform the identity mapping of exact matches of character columns of rows with different

keys. Then we would perform approximate matching on the rest of the rows. Finally, we could

delete all the rows that got a match (i.e., the same identification) except one. The Excel Fuzzy

Lookup add-in can actually be used for de-duplicating as well. For the left and the right table for

matching, you can define the same Excel table.

Nevertheless, you might find it easier to perform the de-duplicating by using the SSIS Fuzzy

Grouping transformation. Therefore, let’s test it. Before testing, let’s prepare some demo data.

We will add all the rows from the CustomersTarget table to the CustomersMaster table and

then try to de-duplicate them.

We can add the rows with the INSERT query, like shown in Listing 19.

Listing 19: Adding Duplicate Rows to the CustomersMaster Table


Page 204

INSERT INTO dbo.CustomersMaster (CustomerId ,FullName ,StreetAddress ,CityRegion ,NumberCarsOwned) SELECT t.CustomerId ,t.FullName ,t.StreetAddress ,t.CityRegion ,NULL FROM dbo.CustomersTarget AS t

Preparing for Fuzzy Grouping

Before we start using the Fuzzy Lookup transformation, we have to think about the algorithm.

We need to optimize it similarly to how we optimized the identity mapping process. We have to

perform exact matches first. We will use the Aggregate transformation for this task. We will

group by FullName, StreetAddress, and CityRegion columns. However, we need to decide which

CustomerId to keep. For this example, let’s say we keep the higher Id. This is a quite plausible

scenario, because a higher Id could mean that the row came to our system later than the one

with the lower Id and might be cleaner than the one with the lower Id. Because we have

negative CustomerId values in the CustomersTarget table, this means we will retain the Id from

the CustomersMaster table.

After we have the Id, we need to find the correct value for the NumberCarsOwned column. We

will use the Lookup transformation and perform a lookup on the same table we are de-

duplicating (i.e., the CustomersMaster table). Because we will have higher Ids only in the

aggregated table, we will get the correct match and read the value of the NumberCarsOwned

attribute from the correct row.

First, we have to prepare the destination table. Besides original columns, we are adding

columns that the Fuzzy Grouping transformation provides. We are using the default Fuzzy

Grouping names, as shown in Listing 20.

Listing 20: Table for Fuzzy Grouping Matches

CREATE TABLE dbo.FuzzyGroupingMatches


Page 205

( _key_in int NULL, _key_out int NULL, _score real NULL, CustomerId int NULL, FullName nvarchar(200) NULL, StreetAddress nvarchar(200) NULL, CityRegion nvarchar(200) NULL, NumberCarsOwned tinyint NULL, FullName_clean nvarchar(200) NULL, StreetAddress_clean nvarchar(200) NULL, CityRegion_clean nvarchar(200) NULL, _Similarity_FullName real NULL, _Similarity_StreetAddress real NULL, _Similarity_CityRegion real NULL

);

There will be some additional work after Fuzzy Grouping finishes. The transformation adds the

following columns to the output:

_key_in, a column that uniquely identifies each row for the transformation;

_key_out, a column that identifies a group of duplicate rows;

The _key_out column has the value of the _key_in column in the canonical data row. The

canonical row is the row that the Fuzzy Grouping identified as the most plausible correct

row and was used for comparison (i.e., the row used for standardizing data). Rows with

the same value in _key_out are part of the same group. The _key_out value for a group

corresponds to the value of _key_in in the canonical data row. We could suppose that

we could keep the canonical row from the group only;

_score, a value between 0 and 1 that indicates the similarity of the input row to the

canonical row. For the canonical row, the _score has a value of 1.

In addition, Fuzzy Grouping adds columns used for approximate string comparison with clean

values. Clean values are the values from the canonical row. In our example, these columns are

the FullName_clean, StreetAddress_clean, and CityRegion_clean columns. Finally, the

transformation adds columns with similarity scores for each character column used for

approximate string comparison. In our example, these are the _Similarity_FullName,

_Similarity_StreetAddress, and _Similarity_CityRegion columns.


Page 206

SSIS Fuzzy Grouping Transformation

We will create a new SSIS package for the de-duplicating problem. Use the following steps:

1. If you closed it, open BIDS and re-open the MDSBook_Ch04_SSIS project.

2. In Solution Explorer, right-click on the Packages folder and add a new package.

Rename it to FuzzyGrouping.dtsx. Click OK in the message box window to rename

the package object as well.

3. From the Toolbox, drag the Data Flow Task to the Control Flow working area. In

SSIS Designer, click on the Data Flow tab.

4. From the Toolbox, drag the OLE DB Source to the Data Flow working area. Right-

click on it, then select the Rename option and rename it to CustomersMaster.

5. Double-click on the source to open the OLE DB Source Editor. In the Connection

Manager tab, create a new connection to your SQL Server system with the

MDSBook database. Select dbo.CustomersMaster in the Name of the table or the

view drop-down list.

6. Click on the Columns tab to get the columns mapping. Click OK.

7. Drag the Aggregate transformation to the working area. Rename it to Exact

Matches. Connect it with the source using the green arrow from the

CustomersMaster data source. Double-click on it to open the editor.

8. Configure the FullName, StreetAddress, and CityRegion for the Group by

operation. Use Maximum Operation for the CustomerId column. Click OK. Figure

15 shows the Aggregate transformation configuration.


Page 207

Figure 15: CONFIGURING THE AGGREGATE TRANSFORMATION

9. Drag the Lookup transformation to the canvas. Rename it to Adding

NumberCarsOwned. Connect it with the Aggregate transformation using the

green arrow from the CustomersTarget data source. Double-click on it to open

the editor.

10. In the Lookup Transformation Editor window, in the General tab, use the default

settings.

11. Use the Connection tab to specify the lookup table. Use the same connection

manager as you created for the data source. Using the Use a table or view option,

select the dbo.CustomersMaster table.

12. Click on the Columns tab. Make sure that the only connection is between the

CustomerId columns from the left and right. Use NumberCarsOwned as the


Page 208

lookup column, and make sure you use the same name for the output alias.

Figure 16 shows the correct configuration.

Figure 16: COLUMN USAGE FOR THE LOOKUP TRANSFORMATION

13. Drag the Fuzzy Grouping transformation to the working area. Rename it to

Approximate Matches. Connect it with the Lookup Match Output to the Lookup

transformation. Double-click on it to open the Fuzzy Grouping Transformation

editor.

14. In the Connection Manager tab, use the same connection manager as in previous

steps.

15. In the Columns tab, use FullName, StreetAddress, and CityRegion as the grouping

columns, and CustomerId and NumberCarsOwned as pass-through columns, as

Figure 17 shows.


Page 209

Figure 17: COLUMN USAGE FOR THE FUZZY GROUPING TRANSFORMATION

16. Click on the Advanced tab and check the advanced settings, especially the Input,

Output, and Similarity Score column names. Make sure you are using the same

names as in the dbo.FuzzyGroupingMatches table. Set the similarity threshold to

0.50 (as we did for the Fuzzy Lookup transformation when we did identity

mapping). Click OK to close the Fuzzy Grouping Editor.

17. Drag and drop OLE DB Destination to the working area. Rename it to

FuzzyGroupingMatches. Connect it with the green arrow from the Fuzzy

Grouping transformation. Double-click on it to open the editor.

18. In the Connection Manager tab, use the same connection manager to the

MDSBook database as in all sources and transformations in this exercise. Select


Page 210

the dbo.FuzzyGroupingMatches table in the Name of the table or the view drop-

down list.

19. Click on the Mappings tab to check whether the column mappings are correct.

Because we used the same column names throughout the package and the same

names in the destination table, the automatic mapping should be correct. Click

OK to close the OLE DB Destination Editor. Your data flow of the package should

be similar to that in Figure 18.

Figure 18: THE COMPLETE DATA FLOW FOR FUZZY GROUPING

20. Save the project. Do not exit BIDS.


Page 211

Testing SSIS Fuzzy Grouping

Now it’s time to execute the package! After the execution, we have to measure the results. Let’s

start with execution time. In my test, it was about 57 seconds. Therefore, it seems like it was

less efficient than Fuzzy Lookup. In addition, we did not finish yet. We have to check the content

of the destination table. A very simple query gives us an overview of the results, which Figure 19

shows.

SELECT *

FROM dbo.FuzzyGroupingMatches

Figure 19: RESULTS OF THE FUZZYGROUPINGMATCHES QUERY

At a first glimpse, the results are quite satisfying. Rows with _key_out (the second column)

equal to 1 (the first two rows) were correctly identified as duplicates, and the row with the

positive CustomerId (the first row in the output, with CustomerId 14348) was identified as the

canonical row (_key_in equal to 1). However, let’s run some additional tests.

In our next query, let’s count the number of duplicate rows in each group (i.e., the number of

rows with the same _key_out value).

SELECT _key_out ,COUNT(_key_out) AS NumberOfDuplicates FROM dbo.FuzzyGroupingMatches GROUP BY _key_out

ORDER BY NumberOfDuplicates DESC;


Page 212

From our data, we would expect that the maximal number of rows in a group would be 2.

However, the results in Figure 20 show a different picture.

Figure 20: RESULTS OF QUERYING THE NUMBER OF _KEY_OUT DUPLICATES

In this test, we got maximally 19 rows identified as duplicates! Of course, we could set a higher

similarity threshold and get more accurate matches for duplicates. However, we set it to 0.50 to

have a direct comparison of efficiency of the Fuzzy Grouping with the Fuzzy Lookup

transformation. Let’s check the rows with the higher number of duplicates identified visually,

with the query shown in Listing 21.

Listing 21: Checking Rows with High Number of Duplicates

WITH NumberOfDuplicatesCTE AS ( SELECT _key_out ,COUNT(_key_out) AS NumberOfDuplicates FROM dbo.FuzzyGroupingMatches GROUP BY _key_out ) SELECT * FROM dbo.FuzzyGroupingMatches WHERE _key_out = (SELECT TOP 1 _key_out FROM NumberOfDuplicatesCTE ORDER BY NumberOfDuplicates DESC);


Page 213

Figure 21 shows the results.

Figure 21: RESULTS OF CHECKING THE ROWS WITH HIGH NUMBER OF DUPLICATES

From the results, we can see that the canonical row was not identified properly for all the rows.

For example, the canonical row for the sixth row (i.e., the row with CustomerId -25321) should

be the second row (i.e., the row with CustomerId 25321). Many correct rows with positive

CustomerId values were identified incorrectly as duplicates of the first row (the row with

CustomerId equal to 25320), which was identified as the canonical row for this group.

Apparently, we would have to perform more manual work in order to finish de-duplicating by

using the Fuzzy Grouping than by using the Fuzzy Lookup transformation. Of course, we should

play more with Fuzzy Grouping with different similarity threshold settings. We could perform a

consecutive procedure using de-duplicating with a high similarity threshold first, then lower it a

bit, and then lower it more, and so on. Nevertheless, it seems that the Fuzzy Lookup

transformation could be more suitable for de-duplicating than Fuzzy Grouping. Not only did it

give us better results, but it also easily managed to outperform Fuzzy Grouping.

Clean-Up

To clean up the MDSBook database, use the code from Listing 22.

Listing 22: Clean-Up Code

USE MDSBook;


Page 214

IF OBJECT_ID(N'dbo.CustomersMaster', N'U') IS NOT NULL DROP TABLE dbo.CustomersMaster; IF OBJECT_ID(N'dbo.CustomersTarget', N'U') IS NOT NULL DROP TABLE dbo.CustomersTarget; IF OBJECT_ID(N'dbo.CustomersMasterNGrams', N'U') IS NOT NULL DROP TABLE dbo.CustomersMasterNGrams; IF OBJECT_ID(N'dbo.CustomersMasterNGramsFrequency', N'U') IS NOT NULL DROP TABLE dbo.CustomersMasterNGramsFrequency; IF OBJECT_ID(N'dbo.FuzzyLookupMatches', N'U') IS NOT NULL DROP TABLE dbo.FuzzyLookupMatches; IF OBJECT_ID(N'dbo.FuzzyGroupingMatches', N'U') IS NOT NULL DROP TABLE dbo.FuzzyGroupingMatches;

GO


Page 215

Summary

As you can see in this chapter, identity mapping and de-duplicating are not simple. We

developed a custom algorithm for identity mapping using the functions from the MDS database.

We tested it on data from the AdventureWorksDW2008R2 demo database. We made errors in

the data in a controllable way, so we could measure the results of the tests throughout the

chapter. Through the tests, we realized that the quality of the results and the performance of

our de-duplicating algorithm are highly dependent on proper selection of parameters.

After testing the manual procedure, we used SSIS Fuzzy Lookup transformation for de-

duplicating. We also introduced the Fuzzy Lookup add-in for Excel, which brings the power of

this transformation to advanced users on their desktops. We showed that de-duplicating is

actually the same problem as identity mapping. Nevertheless, we also tested the SSIS Fuzzy

Grouping transformation.

According to the tests in this chapter, the Fuzzy Lookup transformation is a clear winner. It gave

us better results than any other option, including the Fuzzy Grouping performance, with quite

astonishing performance. Nevertheless, this does not mean you should always use Fuzzy Lookup

for identity mapping and de-duplicating. You should test other possibilities on your data as well.

The tests presented in this chapter are quite exhaustive, so you should have the heavy work

with identity mapping and de-duplicating mitigated.

This is the last chapter in this version of the book. However, stay tuned; there are many exciting

new features coming with the next release of SQL Server, code-name Denali. Microsoft already

announced rewritten Master Data Services, improved Integration Services, and a complete new

application called Data Quality Services. We will update this book to incorporate these new

technologies when they become available.


Page 216

References

53 Microsoft SQL Server MVPs: SQL Server MVP Deep Dives (Manning, 2010)

Carlo Batini, Monica Scannapieco: Data Quality – Concepts, Methodologies and

Techniques (Springer-Verlag, 2006)

Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server

Levenshtein distance on Wikipedia

Jaccard index on Wikipedia

Jaro-Winkler distance on Wikipedia

Ratcliff/Obershelp pattern recognition on National Institute of Standards and Technology

Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005 article on

MSDN describing more details about Fuzzy transformations than Books Online

Fuzzy Lookup Add-In for Excel

Microsoft Research Data Cleaning Project

http://anastasiosyal.com/archive/2009/01/11/18.aspx#soundex

http://en.wikipedia.org/wiki/Levenshtein_distance

http://en.wikipedia.org/wiki/Jaccard_index

http://en.wikipedia.org/wiki/Jaro-Winkler

http://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html

http://msdn.microsoft.com/en-us/library/ms345128(SQL.90).aspx#fzdtssql05_topic6

http://www.microsoft.com/downloads/en/details.aspx?FamilyID=eb408616-9903-440a-a14f-97548171d137

http://research.microsoft.com/en-us/projects/datacleaning/

Index

Page 217

Index

.NET, 7, 11, 47, 57, 60, 133

_key_in, 205, 211

_key_out, 205, 211, 212

a composite candidate key, 135

accuracy, 18, 38, 45, 105, 118, 123, 126, 135, 136,

139, 180, 189

accurate, 6, 16, 126, 166, 170, 212

administrator, 95

Adventure Works2008R2, 122, 138

AdventureWorks2008R2, 107, 112, 121, 122, 126,

127, 149, 154, 158, 160

aggregate functions, 128

Aggregate transformation, 204, 206, 207

aggregation, 30

Alberto Ferrari, 162

algorithm, 38, 40, 44, 53, 121, 136, 156, 166, 167,

173, 175, 176, 178, 186, 189, 190, 192, 201

Analysis Services, 11, 12, 53, 78, 121, 145, 147, 150,

157, 159, 161

analytical application, 13, 14, 16, 17, 40

Anastasios Yalanopoulos, 167

application, 6, 7, 13, 15, 18, 20, 21, 28, 33, 39, 40, 44,

45, 47, 50, 52, 56, 59, 61, 62, 63, 101, 103, 115,

153, 155, 163, 215

Arkady Maydanchik, 162

assessment, 42, 43

association, 23, 120, 121, 124, 125

Attribute Discrimination, 124, 125

Attribute Groups, 75, 76

Attribute Groups Maintenance, 75, 76

Attribute Management, 70

attributes, 48, 49, 66, 69, 71, 74, 75, 82, 99, 100, 128,

146

auditing, 16, 17, 19, 45

authoritative source, 18, 20

authority, 22

Axis Fields (Categories), 151

bias, 177

BIDS, 121, 122, 125, 137, 141, 145, 147, 148, 153,

157, 160, 193, 200, 206, 210

Bill Inmon, 33, 55

bill of material, 14, 15

bill of materials, 16

blocking, 176, 177, 182

Boolean, 133

business intelligence, 11, 142

business interest, 23

business needs, 42, 106

business problem, 10, 14, 29, 40, 42

business requirements, 41, 42, 63

Business Rule Maintenance, 85, 86

business rules, 6, 18, 38, 41, 47, 56, 63, 64, 92

Business Rules, 85, 88, 92, 96

C. J. Date, 24, 55

canonical row, 205, 211, 213

cardinality, 176

Carlo Batini, 216

Cartesian product, 176

central storage, 19, 20, 21, 45

Check constraint, 28

Circular References, 79

class libraries, 47

classify, 35

Index

Page 218

cleansing, 6, 18, 19, 20, 40, 44, 52

client tier, 28

closed world assumption, 37

Closed world assumption, 106

CLR, 51, 133, 134, 145, 160, 161, 167, 181

Clustering, 53, 157

Code attribute, 76

collation, 21

Collection Attributes, 82

collections, 50, 56, 63, 68, 102

Collections, 49, 50, 84, 99

committed version, 97

common columns, 169, 171, 184

Common Table Expressions, 116

common-prefix length, 169

communicate, 19, 46

completeness, 37, 40, 45, 105, 106, 107, 109, 114,

117, 118, 119, 120, 123, 126, 134, 135, 136

complexity, 16

COMPLEXITY_PENALTY, 139

compliance, 17, 41

components, 57, 58, 191, 192

compromise, 21

concept, 14

Concepts, 56, 216

Configuration Manager, 58, 59, 61, 98

Connection Manager, 153, 193, 199, 206, 208, 209

consistency, 39

consistent, 18, 79

Consolidated Attributes, 82

constant, 113

constraint, 6, 28, 37, 45, 74, 131, 134, 135

container, 49, 65, 66

continuous, 20, 21, 38, 128, 136, 139, 144, 150

continuous merge, 20, 21

contract, 14, 15

coordinated, 16

correcting, 44

corrective measures, 44

correctness, 40

CRM, 12, 15, 17, 33, 163

cross-system, 19

CTE, 110, 111, 116, 130, 143, 178, 185, 186

cumulative values, 130

current, 16, 33, 39, 103

customers, 6, 10, 14, 15, 16, 19, 24, 33, 39, 43, 44,

49, 65, 66, 67, 77, 84, 126, 128, 181

Danette McGilvray, 55

Data conflict, 22

Data Connection Wizard, 147, 158

Data Flow Task, 193, 206

data governance, 18, 45, 54

data integrity, 6, 7, 12, 17, 18, 23, 24, 28, 38, 54, 135

Data Mining, 11, 53, 139, 145, 156, 157, 158, 161,

162

data model, 6, 7, 12, 41, 62

data profiling, 36, 42, 105, 107, 131, 145, 148, 149,

156, 161

Data Profiling task, 52, 109, 145, 148, 153, 155, 161

Data Profiling Task Editor, 153, 154

data quality, 6, 7, 9, 12, 15, 17, 18, 21, 22, 24, 28, 34,

36, 37, 38, 42, 43, 44, 45, 46, 47, 51, 53, 54, 67,

74, 103, 105, 106, 120, 136, 142, 161

Data Source Views, 122, 137

data steward, 19

data stewardship, 19, 22, 45, 47, 50

data type, 21, 23, 33, 34, 35, 38, 74, 107, 109, 110,

118, 126, 128, 135, 196

data warehouse, 41, 45, 46, 49, 120, 122, 126, 138,

145

Index

Page 219

database, 11, 13, 14, 18, 19, 22, 24, 28, 31, 33, 35,

41, 47, 48, 51, 54, 57, 58, 59, 65, 67, 75, 79, 89,

90, 94, 98, 99, 100, 101, 104, 107, 112, 115, 120,

121, 122, 126, 127, 133, 134, 147, 149, 150, 154,

158, 159, 160, 166, 169, 175, 181, 191, 192, 193,

197, 199, 206, 209, 213, 215

DateTime, 74

Decision Trees, 136, 137, 138, 139, 140, 141, 145,

146

decomposition, 24, 25, 29

dedicated, 20, 54, 98, 103

de-duplicating, 164, 203

de-duplication, 163

definition, 15, 16, 17, 23, 33, 36, 62, 65, 67, 69, 80,

85, 168

Dejan Sarka, 6, 8, 11, 12, 56, 105, 162, 163

denormalized, 31, 126

Dependency Network, 121, 124, 141

deployment, 47, 50, 62, 103

derived columns, 32, 150

derived hierarchies, 49

Derived Hierarchy, 78, 79, 94

Derived Hierarchy Maintenance, 78

descriptive statistics, 38, 129

destinations, 32, 42, 47, 48, 54

DIFFERENCE(), 166, 174

Dimension, 14, 15, 31, 36, 146

Dimensional Model, 12, 14, 15, 23, 29, 30, 31, 32, 54,

161

discretize, 139, 144, 147

distinct states, 143

documentation, 22, 41

domain, 22, 23, 42, 43, 45, 49, 70

domain knowledge, 22

Domain-Based, 69, 71, 72, 73, 74, 78, 79, 100

duplicate primary key, 137

duplicate records, 43

duplication, 15, 44, 67, 163, 203

DW, 20, 31, 33, 48

ease of use, 39

Ed Katibah, 162

education, 43

efficiency, 169, 177, 178, 179, 180, 182, 185, 187,

212

employees, 14, 15, 29, 49, 79

Enable Change Tracking, 73

enforce, 6, 7, 18, 45, 74, 135

enforcing, 12, 23, 24

enterprise, 6, 12, 15, 18, 22, 23, 28, 33, 36, 42, 53,

103, 163

entities, 14, 15, 23, 24, 29, 45, 48, 49, 56, 63, 64, 67,

69, 71, 74, 75, 78, 79, 99, 100, 101, 163, 166

entity, 14, 15, 23, 40, 48, 49, 54, 70, 71, 73, 75, 78,

79, 80, 81, 85, 98, 99, 100, 101, 149, 161, 163,

165, 181

Entity Maintenance Page, 70

entropy, 38

Entropy formula, 142

entry point, 62

Erik Veerman, 162

erroneous, 114, 115, 127, 130

ETL, 52

evaluating the cost, 28

event, 14, 23

exact matching, 180

Excel, 147, 148, 149, 152, 153, 156, 157, 158, 159,

160, 161, 162, 163, 190, 191, 201, 202, 203, 215,

216

execution time, 200, 211

exist(), 109, 110

Index

Page 220

explicit hierarchies, 49, 102

Explicit Hierarchies, 80

Explicit Hierarchy, 79, 80, 81, 82, 89, 92, 93, 94, 99

Explorer, 63, 82, 84, 91, 122, 123, 137, 138, 145, 146,

153, 158, 193, 206

exporting, 56

Fact table, 14

Feature Pack, 156

File, 69, 74, 153

filtering, 180, 185, 188, 189

first pass, 189

five whys, 43, 44, 105

flag, 38, 39, 73, 81, 137

Free-Form, 69

Free-Text, 73

functional areas, 59, 62

functional dependency, 112

functionally dependent, 26, 115

Fuzzy Grouping, 52, 164, 190, 191, 203, 204, 205,

206, 208, 209, 210, 211, 212, 213, 215, 216

Fuzzy Lookup, 52, 164, 166, 189, 190, 191, 192, 193,

195, 197, 198, 200, 201, 202, 203, 204, 209, 211,

212, 213, 215, 216

Fuzzy transformations, 190, 191, 192, 216

generalization, 29

government, 17

governor, 19

Greg Low, 162

GROUPING SETS, 113

hard dimensions, 37, 42, 45

harmonizing, 17

header, 23

hierarchical data, 13

Hierarchies, 49, 71, 77, 79, 80, 81, 82, 83, 84, 85, 92,

99

historical data, 31

history, 16, 17, 33

HRM, 15

human resources, 31

identification, 15, 23, 119, 163, 165, 203

identity mapping, 7, 19, 21, 163, 164, 165, 166, 169,

174, 181, 189, 201, 203, 204, 209, 215

importing, 47, 103, 133, 149

Importing, 56, 89, 90, 150

improvement, 36, 43, 44, 45, 53, 54, 120, 136

improvement plan, 44

inaccuracies, 128, 135

inaccurate, 8, 15, 38, 44, 45, 126, 128, 129, 136, 141

incompleteness, 24, 136

inconsistent, 39, 44

independently, 19, 165

influence, 15, 24, 40, 136

information, 17, 18, 23, 25, 26, 29, 33, 34, 38, 39, 40,

44, 75, 76, 80, 91, 94, 105, 106, 112, 136, 139,

142, 143, 144, 159, 175, 190

Information Principle, 23

Information Theory, 38, 142

input, 69, 74, 121, 122, 123, 136, 138, 146, 166, 192,

197, 205

installation, 57, 58, 59

installer package, 57

installing, 57, 103

instance, 33, 38, 58, 59, 107, 109, 110, 122, 135, 149,

157, 158

integrating, 17

integration, 19, 40, 63

Integration Management, 63, 90, 93

Integration Services, 11, 12, 52, 109, 153, 155, 161,

215, 216

integrity rules, 28

Index

Page 221

interaction, 19, 39, 62

interactivity, 40

intersection, 21, 168, 190

interview, 37, 42, 45

Isaac Kunen, 162

Itzik Ben-Gan, 8, 162

Jaccard, 167, 168, 175, 190, 191, 216

Jaro-Winkler, 167, 168, 176, 177, 178, 179, 180, 186,

187, 197, 216

join, 28, 30, 126, 165, 177

key data, 6

keys mapping tables, 19

Leaf Attributes, 71

leaf level, 77

legacy applications, 115, 163

Levenshtein, 167, 175, 176, 192, 216

license, 14

life cycle, 15, 42, 44

LIKE, 132, 133, 137

list of defined values, 70

location, 14, 16

longest common substring, 167, 169

lookup table, 49, 70, 192, 194, 207

Lookup transformation, 192, 194, 195, 196, 197, 198,

200, 204, 207, 208, 213

low frequency, 38, 129, 155

Lubor Kollar, 162

magnifying glass, 83

Management Studio, 90, 107

managing data, 56

Mappings, 199, 210

Marco Russo, 162

master data, 6, 7, 12, 14, 15, 16, 17, 18, 19, 20, 21,

22, 32, 35, 36, 40, 42, 43, 45, 47, 48, 49, 50, 51,

52, 53, 54, 56, 77, 103, 126, 161, 163, 181

Master Data Management, 7, 11, 12, 16, 17, 62, 89

Master Data Manager, 50, 62, 63, 65, 101, 103

Master Data Services, 9, 12, 21, 45, 47, 54, 56, 57, 58,

62, 63, 64, 65, 66, 73, 74, 77, 78, 79, 82, 85, 89,

90, 91, 94, 95, 101, 103, 104, 158, 161, 166, 215

matching, 133, 165, 166, 167, 168, 170, 173, 176,

177, 178, 180, 181, 184, 185, 189, 190, 194, 195,

198, 203

matching characters, 168

MDM solution, 19, 21, 22, 28, 33, 35, 36, 45, 47, 54,

103, 105, 120, 161

mdm.StgBatch, 101

mdm.tblAttribute, 75, 100

mdm.tblEntity, 75, 98, 99

mdm.tblEntityMemberType, 99

mdm.tblModel, 98

mdm.tblStagingMember, 89, 101

mdm.tblStgMemberAttribute, 89, 101

mdm.tblStgRelationship, 89, 101

mdq.NGrams, 181, 182, 185

mdq.Similarity, 175, 176, 179, 187

MDS, 12, 21, 22, 47, 48, 49, 50, 51, 52, 54, 56, 57, 58,

59, 61, 76, 98, 99, 101, 103, 166, 167, 175, 181,

192, 215

MDS Hub, 47, 51, 56

member, 49, 63, 75, 79, 80, 82, 83, 84, 92, 94

member values, 67

merging, 19, 21, 36, 44, 52, 103, 161, 165, 166, 169,

182, 185, 203

metadata, 13, 19, 20, 21, 22, 45, 47, 100

Microsoft, 7, 9, 11, 47, 121, 122, 133, 138, 147, 149,

156, 162, 163, 190, 201, 215, 216

Microsoft Research, 190, 216

middle tier, 28

minimalization, 41

Index

Page 222

MINIMUM_SUPPORT, 139

Mining Structures, 122, 138

missing, 38, 107, 109, 110, 111, 114, 115, 120, 125

Model, 14, 30, 60, 64, 65, 66, 67, 75, 76, 90, 95, 98,

99, 103, 124, 140, 141

modify(), 110

Monica Scannapieco, 216

mutually independent, 27

MVP, 9, 11, 181, 216

Naïve Bayes, 121, 122, 123, 136, 137, 156

Name attribute, 100

naming convention, 24, 75, 99, 112

natural hierarchies, 31

navigation, 77

NEWID(), 171, 172, 173, 177, 178

nGrams, 180, 181, 182, 183, 184, 185, 188, 189

nodes(), 110

non-transitively dependent, 27

normal distribution, 129, 131

normal form, 24, 25, 26, 27, 41, 112

normalization, 24, 28, 54, 106, 115

Normalization, 24, 67, 106

notification, 47

noun, 14, 15

NULL, 28, 29, 37, 38, 41, 44, 90, 106, 107, 108, 109,

110, 112, 113, 114, 115, 116, 117, 118, 119, 120,

121, 125, 127, 131, 134, 160, 170, 178, 180, 182,

183, 185, 200, 204, 205, 214

nullable, 107, 108, 109, 111, 117

object, 11, 14, 35, 41, 67, 69, 153, 193, 206

ODS, 33

OLAP, 12, 13, 15, 30, 53, 145, 147, 161

OLAP cubes, 53

OLE DB Destination, 199, 209, 210

OLE DB Source, 193, 206

OLTP, 12, 13, 163

open world assumption, 37

Open world assumption, 106

operating system, 57

operational system, 17

optimization, 177, 192

order, 13, 15, 24, 27

OVER, 90, 116, 179, 183, 187

over-fitting, 137

package, 153, 154, 192, 193, 196, 197, 199, 200, 206,

210, 211

partitioning, 176, 177, 184, 185, 187, 189

pattern, 38

people, 14

People, 6

PivotChart, 148, 149, 152

PivotTable, 147, 148, 149, 151, 152

place, 14

policy, 16, 18, 22, 36

population, 20, 37, 106

portal, 50, 53, 65, 89

PowerPivot, 149, 150, 151, 152, 161, 162

PowerShell, 58, 60

predicate, 14, 23

predictable, 121, 122, 138

pre-select, 180

presentation quality, 40

preventing, 44

Preview, 79, 150

priority, 43

proactive, 6, 18

probability, 129, 142, 157, 173

Process Activation Service, 60

processing, 47, 49, 124

product categories, 15

Index

Page 223

products, 14, 15, 19, 26, 30, 49, 109, 111, 115, 126

proposition, 14, 23

pruning, 180

quality, 7, 10, 17, 18, 22, 36, 39, 40, 41, 42, 43, 45,

52, 54, 85, 103, 105, 107, 112, 117, 120, 161, 215

query(), 109

Ralph Kimball, 31, 55

randomness, 171, 173, 177

Ratcliff/Obershelp, 167, 216

RDBMS, 17, 18, 23, 28, 51

reactive, 6, 18

real world, 14, 37, 39, 41, 48, 75

Recursive Hierarchies, 79

recursive partitioning, 136

redundancy, 24, 27, 67, 70, 106

reference data, 63, 67, 89, 97

reference relation, 37, 106

Reference Table, 196

regular expressions, 51, 133, 134

relation, 23, 37, 106, 112, 114, 118, 119, 120, 126

relational database, 14, 15, 23, 35, 37, 67, 69, 74,

106, 119, 122, 138, 149

relational design, 29, 30

Relational Model, 6, 11, 12, 14, 23, 28, 31, 32, 54, 65,

70

relationship, 13, 14, 23, 29, 41, 49, 50, 69, 71, 74, 75,

78, 79, 99, 100, 101, 102

retention, 18, 42

re-usage, 17

reuse, 16, 17

Roger Wolter, 162

root cause, 36, 43, 44, 105, 106, 107, 114, 125, 136,

141, 161

Sakichi Toyoda, 43

sales representative, 14

sample data, 126, 169, 170

schema, 6, 13, 23, 24, 31, 32, 33, 34, 36, 40, 41, 42,

45, 98, 101, 105, 106, 107, 112, 114, 126, 135, 145

schema quality, 36, 40, 107

search space, 164, 166, 176, 182

search space reduction, 176

security, 18, 60, 62, 64, 74, 75, 76, 99

segment, 14

SELECT, 89, 90, 94, 99, 108, 109, 110, 111, 112, 116,

117, 118, 119, 120, 121, 127, 128, 129, 130, 131,

132, 134, 135, 137, 143, 144, 170, 171, 172, 173,

174, 175, 177, 178, 179, 182, 183, 184, 185, 186,

187, 200, 204, 211, 212

semi-structured data, 13

service, 15, 59

set, 14, 18, 36, 112, 149, 166, 168, 176, 177, 181,

190, 191

shared dimensions, 31

Simil, 167, 169, 175

Similarity Threshold, 197, 198

single copy of data, 45

soft dimensions, 37, 39, 43, 45

SolidQ, 8, 9, 10, 11

Solution Explorer, 122

sorting neighborhood, 177

SOUNDEX(), 166, 174

sources, 15, 22, 28, 32, 33, 40, 42, 44, 47, 48, 52, 54,

103, 161, 163, 165, 199, 209

specialization, 29, 30, 31, 41

spreadsheet, 34

SQL 2008 R2 sample databases, 107

SQL Server, 7, 9, 11, 12, 13, 21, 28, 34, 35, 45, 47, 51,

52, 53, 54, 56, 57, 58, 74, 78, 90, 94, 103, 105,

107, 109, 110, 113, 121, 133, 134, 135, 136, 145,

Index

Page 224

149, 153, 154, 155, 156, 158, 161, 162, 166, 167,

175, 181, 189, 190, 193, 197, 206, 215, 216

SSAS, 21, 53, 121, 122, 139, 145, 147, 148, 149, 152,

157, 158, 159

SSIS, 12, 21, 52, 109, 145, 148, 153, 156, 164, 166,

189, 191, 192, 193, 200, 203, 206, 211, 215

SSRS, 53

staging tables, 47

stakeholder, 42, 43, 45, 46

standard deviation, 128, 129

Star Schema, 14, 145

static, 16, 133, 197

Steve Kass, 162

string lengths, 131

string similarity, 164

string-matching, 44

strings profiling, 131

subject, 14

Subscription View, 93, 94

subscription views, 47

Subscription Views, 63, 95

subtype, 29, 41, 44, 107, 112, 116

System Administration Page, 67

Tamraparni Dasu, 162

taxonomy, 13, 40

Teo Lachev, 162

Testing Set, 139

text mining, 35

Theodore Johnson, 162

thing, 14

third normal form, 27

Thomas C. Redman, 162

timeliness, 39

timely, 6, 39

token, 181, 183, 184, 190, 191, 197

tokenizer, 190

Toolbox, 153, 193, 206

transactional, 13, 14, 15, 16, 17, 20, 41, 51, 54

transactional data, 13, 15

Transact-SQL, 37, 51, 94, 107, 161

transposition, 6

transpositions, 168

treatment, 16

trust, 40

T-SQL, 11, 89, 90, 101, 107, 113, 128, 132, 133, 134,

135, 143, 145, 162, 164, 166, 174, 175, 181, 203

tuple, 23, 37, 106, 118, 119, 126

unified, 19

uniform distribution, 142

UNION, 112, 127, 143, 144, 203

UNION ALL, 112, 203

Union All transformation, 198, 199

unique, 7, 18, 38, 134, 163

uniqueness, 134, 135

unstructured data, 13

UPDATE, 171, 172, 177, 178, 179, 180, 184, 186, 187

update anomalies, 26, 27, 28

User and Groups Permissions, 64

user-defined attributes, 89

validate, 33, 63, 92, 93, 96

validate data, 88, 95, 96

Validation Actions, 86

value(), 109, 135

verb, 14

version, 9, 50, 51, 67, 93, 95, 96, 97, 103, 115, 133,

149, 157, 190, 191, 215

versioning, 16, 17, 18, 19, 45, 50, 56, 93, 103

volatility, 16

vTargetMail, 126, 127, 128, 166, 169

Index

Page 225

Web, 18, 39, 47, 56, 57, 59, 61, 62, 63, 101, 103, 134,

163

WHERE, 108, 109, 110, 111, 117, 118, 119, 120, 121,

127, 128, 130, 132, 134, 137, 172, 173, 178, 179,

182, 185, 187, 200, 212

Wikipedia, 55, 167, 216

Windows, 57, 58, 59, 60, 147

workflow, 17, 18, 45, 47

worksheet, 148, 152, 159

XML, 13, 33, 34, 38, 107, 109, 110, 111, 135, 153,

155, 161, 163

XQuery, 107, 110, 161

XSD, 33, 135

xsi:nil, 107, 111

dqmdm_sql2008r2

Documents