![Page 1: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/1.jpg)
Data Quality
Class 6
![Page 2: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/2.jpg)
This Week
Homework Questions Data Standardization
![Page 3: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/3.jpg)
Data Standardization
What is a standard? Benefits of Standardization Defining Data standards Testing for standard form Transforming into standard form
![Page 4: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/4.jpg)
What is a Standard?
a standard is something set up and established by authority, custom, or general consent as a model or example
a model to which all objects of the same class must conform.
![Page 5: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/5.jpg)
What is a Standard? 2
conforms to a predefined expected format which may be defined by:
– an organization with some official authority (e.g., government)– some recognized authoritative board (such as a standards
committee)– negotiated agreement (such as electronic data interchange
(EDI) agreements)– de facto convention (e.g., telephone number formats)
![Page 6: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/6.jpg)
Benefits of Standardization
conformity for comparison (as well as aggregation and analysis purposes)
an audit trail for data error accountability a streamlined means for the transfer and
sharing of information
![Page 7: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/7.jpg)
Defining Standards
Find representative body Identify a simple set of rules that completely
specify the valid structure and meaning of a correct data value
Present the standard to the committee (or even the community as a whole) for comments
Document and publish standard
![Page 8: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/8.jpg)
Testing for Standard Form
If there is a standard, there should be a way to test to see if data is in standard form
Example: US Telephone numbers– Defined by Industry Numbering Committee (INC)– NPA: Numbering Plan Area code– NXX: Central Office Code
Test for format conformance (e.g., 1-XXX-YYY-ZZZZ for telephone numbers)
Test for validity (e.g., is XXX a valid NPA, is YYY a valid NXX for the NPA XXX)
![Page 9: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/9.jpg)
Transforming into Standard Form
Given a good standard, it should be straightforward to transform data into that form
Must be able to recognize data components to be able to place them in proper locations
![Page 10: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/10.jpg)
Error Paradigms
How are errors introduced into data?– Attribute Granularity– Finger Flubs– Format Conformance– Semi-structured form– Transcription Errors– Transformation Flubs– Misfielded Data– Floating Data– Overloaded Attributes
![Page 11: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/11.jpg)
Attribute Granularity
Data granularity is not at the proper level– Example: “name” vs. last name, first name
Creates confusion when more than one entity can be represented in the same attribute
![Page 12: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/12.jpg)
Finger Flubs
This happens whem the incorrect letter is typed on the keybpard
Also, sometimes mnore than one letter is hit by mistake
Also, a leter might be missing
![Page 13: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/13.jpg)
Format Conformance
When the format is too restrictive, the user may not be able to properly enter the data– Example: First name, middle initial, last name
Some people go by their middle name
![Page 14: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/14.jpg)
Semi-structured form
There may be multiple “valid” formats that appear in free-form– Example: corporate structure laid out at web sites– Example:
(first name) (middle initial) (last name) or (last name), (first name)
![Page 15: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/15.jpg)
Transcription Errors
Data is collected through “fuzzy” media and is not properly transcribed– Mispronounced data– Incorrect spellings
![Page 16: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/16.jpg)
Transformation Flubs
Automated processing may introduce errors– We’ve already seen this example:
a database of names was found to have an inordinately large number of high-frequency word fragments, such as “INCORP,” “ATIONAL,” “COMPA.”
Text spanned multiple fields, which were not concatenated properly on extraction
![Page 17: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/17.jpg)
Misfielded Data
Data that is placed in the wrong field Example:street addresses
– Fields may not be big enough– Text spills over to next field
![Page 18: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/18.jpg)
Floating Data
Information that belongs in one field is contained in different fields in different records in the database
See examples in housing authority database
![Page 19: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/19.jpg)
Overloaded Attributes
More than one entity shows up in data Example:
– John and Mary Smith, TTES, Smith Foundation
![Page 20: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/20.jpg)
Record Parsing
Tokenizing data elements within an attribute Assign meaning to tokens
– Domain membership– Patterns– Context
![Page 21: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/21.jpg)
Record Parsing 2
In order to do this, we need:– The names and types of the data components expected to be
found in the field– The set of valid values for each data component type– The acceptable forms that the data may take– A means for tagging records that have unidentified data
components
We can do this with domains, mappings, and rules!
![Page 22: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/22.jpg)
Data Correction
If we can automatically recognize data as not conforming to a standard, can we automate its correction?
If we have translation rules or mappings from incorrect values to correct values
This is how many data cleansing applications work– example: InternatinalInternational
![Page 23: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/23.jpg)
Data Correction 2
Correction by consolidation Makes use of record linkage
– Find a pivot attribute across which to link– The pivot should be unique (such as social security
number)– Link records together and consolidate “correct”
name based on other factors, such as data source, timestamp, etc.
![Page 24: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/24.jpg)
Data Standardization
Use standard form as a pivot for linkage and consolidation
Example– Elizabeth R. Johnson, 123 Main St – Beth R. Johnson, 123 Main St
It’s a good hunch that these records represent the same person
We can standardize components based on nicknames, abbreviations, etc.
![Page 25: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/25.jpg)
Data Standardization 2
Examples:– Robert, Rob, Bob, Robby, Bobby– Elizabeth, Elisabeth, Liz, Lizzie, Beth– International, Intl, Int’l, Intrntnl
Make use of a standard form, even if it is not necessarily correct– In other words, “change” all Roberts, Robs, Bobs,
Robbys, and Bobbys to Robert– Use standard form for linkage
![Page 26: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/26.jpg)
Data Standardization 3
Again, this concept sounds familiar– Many to one mapping– Maintain the standardization mapping as metadata– Apply mapping to get standard form
![Page 27: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/27.jpg)
Abbreviation Expansion
Rule/mapping oriented Translates common abbreviations to a
standard form Types:
– Shortenings (INC for INCORPORATED)– Compression (INTL for INTERNATIONAL)– Acronyms (IBM for you know what)
![Page 28: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/28.jpg)
Transformation Rules
Standardization is a process of transforming nonconforming forms to conforming forms
Use mappings/transformation rules Create a rule engine instance and integrate the
rules Engine becomes a filter
![Page 29: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/29.jpg)
Transformation Engine
Application of context-sensitive consistency and derivation rules transforms a data instance into appropriate form
In this case, derivation rules act on non-standard values
Also referred to as “edits” Rule base grows as violations are noted
![Page 30: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/30.jpg)
Transformation Engine 2
1. Determine validity expectations (=what should the data instance look like if it were in standard form)
2. Create a validity filter with invalid records forwarded to a domain expert
3. For each violation, (or set of violations) the domain expert determines if a general rule can be applied to transform the bad record into its standard form
4. Merge transformation rules into validity filter
![Page 31: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/31.jpg)
Example:Address Standardization
United States Postal Service (USPS) has done a very good job of presenting their addressing standard
Their goal: increase readability of mail to increase deliverability
Benefits are given to postal customers when data is in correct form
![Page 32: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/32.jpg)
USPS Address Standard
Multiple address lines– Recipient line– Delivery Address line– Last line
Standard Address Block
![Page 33: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/33.jpg)
Recipient Line
Person or entity to whom mail is to be delivered
First line of standard address block
![Page 34: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/34.jpg)
Delivery Address Line
Contains location information Includes street address Broken down into:
– Primary address number– Predirectional and/or Postdirectional– Street name– Suffix (RD, ST, etc.)– Secondary address designator
![Page 35: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/35.jpg)
Last Line
City State ZIP+4 code
![Page 36: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/36.jpg)
Standard Abbreviations
USPS expects addresses to be represented in a reduced form, using standard abbreviations
This can be represented using a mapping See example (pub. 28)
![Page 37: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/37.jpg)
ZIP+4
Encoding of geographical data Actually, the ZIP code is an overloaded data
value It contains state information as well as delivery
location focus
![Page 38: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/38.jpg)
Address Standardization
First: Is the address already in standard form?– This can be checked by making sure that the
address conforms to the address block layout– Some special cases need addressing (East West
Hwy)– Are real city names used, or vanity names?– Is correct ZIP+4 used?
![Page 39: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/39.jpg)
Address Standardization 2
More…– Identify all addressing elements– Make sure placement is correct; if not, correct it– Is the street specified a valid street name? (USPS
provides database)– Is the address number valid within the street
address ranges?
![Page 40: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/40.jpg)
Address Standardization 3
Next: Correct if necessary– Identify all address elements– Look up proper city name– Look up correct ZIP+4
If the right one cannot be used, use the ZIP+4 centroid
– Move elements to proper location in address block– Transform elements into standard abbreviated form– Generate bar code (if needed)
![Page 41: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/41.jpg)
Business Data Elements
USPS standard is a nice source for business rules
Elements are broken down into element classes:
![Page 42: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/42.jpg)
Business Elements
Secondary unit indicator Secondary number Company name PO box number City State ZIP/ZIP+4 Carrier Route code Operational Endorsement Key line code POSTNET barcode POSTNET address barcode
Name Prefix (Mr., Mrs.) First name Middle name or initial Surname Suffix title (e.g., JR, PHD) Professional title (PROJECT
MANAGER) Division/Department Mailstop code Street number Predirectional Street name Street suffix
![Page 43: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/43.jpg)
CASS
Acronym for Coding Accuracy Support System– Provides a platform to measure the quality of
address matching and standardization software– Addresses are CASS certified if they pass USPS
provided tests (I.e., they are standardized)– Only mail that is CASS certified can qualify for
postage savings
![Page 44: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/44.jpg)
NCOA
~20% of population changes addresses each year
NCOA: National Change of Address
![Page 45: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/45.jpg)
Other Standards
Telephone industry Financial industry (SWIFT, FIX) HTML SIC codes GIS standards
![Page 46: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/46.jpg)
XML
www.xml.org Incredible growth of defined format DTDs and
schemas for data interchange Review some of these and look for data quality
rules!
![Page 47: Data Quality Class 6. This Week Homework Questions Data Standardization](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d3f5503460f94a184d8/html5/thumbnails/47.jpg)
Next Week
Data cleansing Record linkage Similarity and distance