data quality class 6. this week homework questions data standardization

Data Quality

Class 6

This Week

Homework Questions Data Standardization

Data Standardization

What is a standard? Benefits of Standardization Defining Data standards Testing for standard form Transforming into standard form

What is a Standard?

a standard is something set up and established by authority, custom, or general consent as a model or example

a model to which all objects of the same class must conform.

What is a Standard? 2

conforms to a predefined expected format which may be defined by:

– an organization with some official authority (e.g., government)– some recognized authoritative board (such as a standards

committee)– negotiated agreement (such as electronic data interchange

(EDI) agreements)– de facto convention (e.g., telephone number formats)

Benefits of Standardization

conformity for comparison (as well as aggregation and analysis purposes)

an audit trail for data error accountability a streamlined means for the transfer and

sharing of information

Defining Standards

Find representative body Identify a simple set of rules that completely

specify the valid structure and meaning of a correct data value

Present the standard to the committee (or even the community as a whole) for comments

Document and publish standard

Testing for Standard Form

If there is a standard, there should be a way to test to see if data is in standard form

Example: US Telephone numbers– Defined by Industry Numbering Committee (INC)– NPA: Numbering Plan Area code– NXX: Central Office Code

Test for format conformance (e.g., 1-XXX-YYY-ZZZZ for telephone numbers)

Test for validity (e.g., is XXX a valid NPA, is YYY a valid NXX for the NPA XXX)

Transforming into Standard Form

Given a good standard, it should be straightforward to transform data into that form

Must be able to recognize data components to be able to place them in proper locations

Error Paradigms

How are errors introduced into data?– Attribute Granularity– Finger Flubs– Format Conformance– Semi-structured form– Transcription Errors– Transformation Flubs– Misfielded Data– Floating Data– Overloaded Attributes

Attribute Granularity

Data granularity is not at the proper level– Example: “name” vs. last name, first name

Creates confusion when more than one entity can be represented in the same attribute

Finger Flubs

This happens whem the incorrect letter is typed on the keybpard

Also, sometimes mnore than one letter is hit by mistake

Also, a leter might be missing

Format Conformance

When the format is too restrictive, the user may not be able to properly enter the data– Example: First name, middle initial, last name

Some people go by their middle name

Semi-structured form

There may be multiple “valid” formats that appear in free-form– Example: corporate structure laid out at web sites– Example:

(first name) (middle initial) (last name) or (last name), (first name)

Transcription Errors

Data is collected through “fuzzy” media and is not properly transcribed– Mispronounced data– Incorrect spellings

Transformation Flubs

Automated processing may introduce errors– We’ve already seen this example:

a database of names was found to have an inordinately large number of high-frequency word fragments, such as “INCORP,” “ATIONAL,” “COMPA.”

Text spanned multiple fields, which were not concatenated properly on extraction

Misfielded Data

Data that is placed in the wrong field Example:street addresses

– Fields may not be big enough– Text spills over to next field

Floating Data

Information that belongs in one field is contained in different fields in different records in the database

See examples in housing authority database

Overloaded Attributes

More than one entity shows up in data Example:

– John and Mary Smith, TTES, Smith Foundation

Record Parsing

Tokenizing data elements within an attribute Assign meaning to tokens

– Domain membership– Patterns– Context

Record Parsing 2

In order to do this, we need:– The names and types of the data components expected to be

found in the field– The set of valid values for each data component type– The acceptable forms that the data may take– A means for tagging records that have unidentified data

components

We can do this with domains, mappings, and rules!

Data Correction

If we can automatically recognize data as not conforming to a standard, can we automate its correction?

If we have translation rules or mappings from incorrect values to correct values

This is how many data cleansing applications work– example: InternatinalInternational

Data Correction 2

Correction by consolidation Makes use of record linkage

– Find a pivot attribute across which to link– The pivot should be unique (such as social security

number)– Link records together and consolidate “correct”

name based on other factors, such as data source, timestamp, etc.

Data Standardization

Use standard form as a pivot for linkage and consolidation

Example– Elizabeth R. Johnson, 123 Main St – Beth R. Johnson, 123 Main St

It’s a good hunch that these records represent the same person

We can standardize components based on nicknames, abbreviations, etc.

Data Standardization 2

Examples:– Robert, Rob, Bob, Robby, Bobby– Elizabeth, Elisabeth, Liz, Lizzie, Beth– International, Intl, Int’l, Intrntnl

Make use of a standard form, even if it is not necessarily correct– In other words, “change” all Roberts, Robs, Bobs,

Robbys, and Bobbys to Robert– Use standard form for linkage

Data Standardization 3

Again, this concept sounds familiar– Many to one mapping– Maintain the standardization mapping as metadata– Apply mapping to get standard form

Abbreviation Expansion

Rule/mapping oriented Translates common abbreviations to a

standard form Types:

– Shortenings (INC for INCORPORATED)– Compression (INTL for INTERNATIONAL)– Acronyms (IBM for you know what)

Transformation Rules

Standardization is a process of transforming nonconforming forms to conforming forms

Use mappings/transformation rules Create a rule engine instance and integrate the

rules Engine becomes a filter

Transformation Engine

Application of context-sensitive consistency and derivation rules transforms a data instance into appropriate form

In this case, derivation rules act on non-standard values

Also referred to as “edits” Rule base grows as violations are noted

Transformation Engine 2

1. Determine validity expectations (=what should the data instance look like if it were in standard form)

2. Create a validity filter with invalid records forwarded to a domain expert

3. For each violation, (or set of violations) the domain expert determines if a general rule can be applied to transform the bad record into its standard form

4. Merge transformation rules into validity filter

Example:Address Standardization

United States Postal Service (USPS) has done a very good job of presenting their addressing standard

Their goal: increase readability of mail to increase deliverability

Benefits are given to postal customers when data is in correct form

USPS Address Standard

Multiple address lines– Recipient line– Delivery Address line– Last line

Standard Address Block

Recipient Line

Person or entity to whom mail is to be delivered

First line of standard address block

Delivery Address Line

Contains location information Includes street address Broken down into:

– Primary address number– Predirectional and/or Postdirectional– Street name– Suffix (RD, ST, etc.)– Secondary address designator

Last Line

City State ZIP+4 code

Standard Abbreviations

USPS expects addresses to be represented in a reduced form, using standard abbreviations

This can be represented using a mapping See example (pub. 28)

Encoding of geographical data Actually, the ZIP code is an overloaded data

value It contains state information as well as delivery

location focus

Address Standardization

First: Is the address already in standard form?– This can be checked by making sure that the

address conforms to the address block layout– Some special cases need addressing (East West

Hwy)– Are real city names used, or vanity names?– Is correct ZIP+4 used?

Address Standardization 2

More…– Identify all addressing elements– Make sure placement is correct; if not, correct it– Is the street specified a valid street name? (USPS

provides database)– Is the address number valid within the street

address ranges?

Address Standardization 3

Next: Correct if necessary– Identify all address elements– Look up proper city name– Look up correct ZIP+4

If the right one cannot be used, use the ZIP+4 centroid

– Move elements to proper location in address block– Transform elements into standard abbreviated form– Generate bar code (if needed)

Business Data Elements

USPS standard is a nice source for business rules

Elements are broken down into element classes:

Business Elements

Secondary unit indicator Secondary number Company name PO box number City State ZIP/ZIP+4 Carrier Route code Operational Endorsement Key line code POSTNET barcode POSTNET address barcode

Name Prefix (Mr., Mrs.) First name Middle name or initial Surname Suffix title (e.g., JR, PHD) Professional title (PROJECT

MANAGER) Division/Department Mailstop code Street number Predirectional Street name Street suffix

Acronym for Coding Accuracy Support System– Provides a platform to measure the quality of

address matching and standardization software– Addresses are CASS certified if they pass USPS

provided tests (I.e., they are standardized)– Only mail that is CASS certified can qualify for

postage savings

~20% of population changes addresses each year

NCOA: National Change of Address

Other Standards

Telephone industry Financial industry (SWIFT, FIX) HTML SIC codes GIS standards

www.xml.org Incredible growth of defined format DTDs and

schemas for data interchange Review some of these and look for data quality

rules!

Next Week

Data cleansing Record linkage Similarity and distance

data quality class 6. this week homework questions data standardization

standard slide

standard form slide

data example

standard form example

data standards

attribute slide

data components

information slide

Documents

data standardization and classification

standardization in the data center transcript

standardization guidance for weather data climate

tuberculosis data standardization and harmonization a... ·...

public health data element standardization - a framework for...

open data platform for dpm standardization · technical...

analysis of data virtualization & enterprise data...

health data & informatics (hdi) data standardization toolset

topic report: open data and standardization

standardization of spend analysis data - lean ohio

faa data standardization handbook

data standardization planning for clinical … ·...

definition and standardization of data elements

data standardization-what why and how

data standardization interoperability

standardization and harmonization of data and data...

pharmacy data standardization (pds) local user overview

‘clinical data management (cdm) process standardization

oracle sql developer data modeler naming standardization ·...

wildlife data standardization and sharing: environmental