data quality testing generic (
DESCRIPTION
http://www.geektester.blogspot.com/TRANSCRIPT
Microsoft Confidential 12/27/06 1
“Virtually everything in business today is an undifferentiated commodity, except how a company manages its information. How you manage information determines whether you win or lose.” Bill Gates
-Narendra Parihar- Bhoomika Goyal
- Raj Kamal (rajkamal)
Data Quality Testing
Microsoft Confidential 12/27/06 2
Data Quality Overview
Testing :: DQ Categories / Checks
Testing :: DQ Case Study
DQ Test Management
DQ Benefits & Challenges
Q & A
Data Quality Overview
Testing :: DQ Categories / Checks
Testing :: DQ Case Study
DQ Test Management
DQ Benefits & Challenges
Q & A
Agenda
2
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 3
Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J.M. Juran).
The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.
DQ Impact: Organizations with poor data quality spend time working with conflicting reports and flawed business plans, resulting in erroneous decisions that are made with outdated, inconsistent, and invalid data
Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J.M. Juran).
The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.
DQ Impact: Organizations with poor data quality spend time working with conflicting reports and flawed business plans, resulting in erroneous decisions that are made with outdated, inconsistent, and invalid data
Overview: DQ Definition
3
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 4
“End users spend as much as 40-50% of a typical IT budget reworking data in one application to make it work with another”. The high cost of low data quality.
The Data Warehouse Institute estimates that bad customer data costs American companies upwards of $600billion dollars per year By Wayne W. Eckerson
POOR Data Quality can kill your POOR Data Quality can kill your business!!!!business!!!!
“End users spend as much as 40-50% of a typical IT budget reworking data in one application to make it work with another”. The high cost of low data quality.
The Data Warehouse Institute estimates that bad customer data costs American companies upwards of $600billion dollars per year By Wayne W. Eckerson
POOR Data Quality can kill your POOR Data Quality can kill your business!!!!business!!!!
Overview: DQ Stats
4
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 5
Testing :: DQ CheatSheet
5
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 6
Count of records at Source and Target should be same at a given point of time.
Count of records at Source and Target should be same at a given point of time.
Rule #1: Row Counts
6
Overview DQ Testing Case Study CloseDQ
Management
Missing Records
ExtraRecords
Microsoft Confidential 12/27/06 7
# Example 1
7
Overview DQ Testing Case Study CloseDQ
Management
DeptID DeptName DeptStartDate
1 HR 22-Aug-2007
2 Finance 12-June-1988
4 Admin 1-May-1999
5 IT 2-June-1997
Source_Dept
DeptID DeptName DeptStartDate
1 Human Resource 22-Aug-2007
2 Finance 12-June-1978
3 Operations 11-May-1752
Target_Dept
Microsoft Confidential 12/27/06 8
Missing Records: Records which are only present at Source
Extra Records: Records which are only present at Target
Missing Records: Records which are only present at Source
Extra Records: Records which are only present at Target
Rule #1: Row Counts
8
Overview DQ Testing Case Study CloseDQ
Management
DeptID DeptName DeptStartDate
4 Admin 1-May-1999
5 IT 2-June-1997
DeptID DeptName DeptStartDate
3 Operations 11-May-1752
Microsoft Confidential 12/27/06 9
All the data under consideration at the Source and Target should be same at a given point of time satisfying the business rules.
All the data under consideration at the Source and Target should be same at a given point of time satisfying the business rules.
Rule #2: Completeness
9
Overview DQ Testing Case Study CloseDQ
Management
SourceTable
TargetTable
Microsoft Confidential 12/27/06 10
Missing Records: Records which are only present at Source
Extra Records: Records which are only present at Target
Mismatched Records: Which contain at least one different value for the same record between Source and Target
Missing Records: Records which are only present at Source
Extra Records: Records which are only present at Target
Mismatched Records: Which contain at least one different value for the same record between Source and Target
Rule #2: Completeness
10
Overview DQ Testing Case Study CloseDQ
Management
DeptID DeptName DeptStartDate
4 Admin 1-May-1999
5 IT 2-June-1997
DeptID DeptName DeptStartDate
3 Operations 11-May-1752
DeptID DeptName DeptStartDate DifferenceType
2 Finance 12-June-1988 At Source
2 Finance 12-June-1978 At Target
Microsoft Confidential 12/27/06 11
This ensures that each user observes a consistent view of the data, including changes made by transactions
There is data inconsistency between the Source & Target if the same data is stored in different formats or contain different values at different places.
This ensures that each user observes a consistent view of the data, including changes made by transactions
There is data inconsistency between the Source & Target if the same data is stored in different formats or contain different values at different places.
Rule #3: Consistency
11
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 12
# Example 2
12
Overview DQ Testing Case Study CloseDQ
Management
DeptID DeptName Revenue ($)
DeptStartDate
1 HR 100 22-Aug-2007
2 Finance 200 12-June-1988
Source_Dept
DeptID DeptName Revenue (Euro)
DeptStartDate
1 HR 70 22/08/2007
2 Finance 140 12/06/1978
Warehouse_Dept
Data Mart_Dept
DeptID DeptName Revenue (Euro)
DeptStartDate
1 Human Resource
70 22/08/2007
2 Finance 999999 12/06/1978
Microsoft Confidential 12/27/06 13
Example #1: Zip code / Date / Currency formatsa)
b)
Example #1: Zip code / Date / Currency formatsa)
b)
Rule #3: Consistency
13
Overview DQ Testing Case Study CloseDQ
Management
DeptID DeptName Revenue ($ or Euro ) DeptStartDate Difference Point
1 HR 100 22-Aug-2007 Same data, Inconsistent due to Revenue & Currency format
1 HR 70 22/08/2007 Same data, Inconsistent due to Revenue & Currency format
DeptID DeptName Revenue ($ or Euro ) DeptStartDate Difference Point
1 HR 100 22-Aug-2007 Same data, Inconsistent due to different format of Department name
1 Human Resource
70 22/08/2007 Same data, Inconsistent due to different format for department name
Microsoft Confidential 12/27/06 14
Example #2: Regional Setting e.g. Language
Example #3: Different values at different points
Example #2: Regional Setting e.g. Language
Example #3: Different values at different points
Rule #3: Consistency
14
Overview DQ Testing Case Study CloseDQ
Management
DeptID DeptName Revenue ($ or Euro ) DeptStartDate Difference Point
1 Human Resource
100 22/08/2007 Same data, Inconsistent due to different language used
1 人的資源 100 22/08/2007 Same data, Inconsistent due to different language used
DeptID DeptName Revenue ($ or Euro ) DeptStartDate Difference Point
2 Finance 140 12/06/1978 Same data, Inconsistent value for Revenue between Warehouse & Mart
2 Finance 999999 12/06/1978 Same data, Inconsistent value for Revenue between Warehouse & Mart
Microsoft Confidential 12/27/06 15
Validity is the correctness and reasonableness of data.
A valid measure must be reliable, but a reliable measure need not be valid.
Questions:
-> Is Information Reliable?
-> How is Information measured ?
Validity is the correctness and reasonableness of data.
A valid measure must be reliable, but a reliable measure need not be valid.
Questions:
-> Is Information Reliable?
-> How is Information measured ?
Rule #4: Validity
15
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 16
Example #1: Measuring “Unemployment” in a country
-> Statistics are collected reliably month-on-month
-> Definition of collecting “Unemployment” remains same.
e.g. Definition of “unemployment” has changed in past 25 years hence we can’t compare old data with current data as comparison is not valid
Example #2: Values falling outside a range
Example #1: Measuring “Unemployment” in a country
-> Statistics are collected reliably month-on-month
-> Definition of collecting “Unemployment” remains same.
e.g. Definition of “unemployment” has changed in past 25 years hence we can’t compare old data with current data as comparison is not valid
Example #2: Values falling outside a range
Rule #4: Validity
16
Overview DQ Testing Case Study CloseDQ
Management
DeptID DeptName Revenue (Euro)
DeptStartDate
1 Human Resource
70 22/08/2255
2 Finance 999999 12/06/1752
Microsoft Confidential 12/27/06 17
Example #3: Dates having valid MM, DD, YYYY
Example #4: Birth date > Death Date
Example #3: Dates having valid MM, DD, YYYY
Example #4: Birth date > Death Date
Rule #4: Validity
17
Overview DQ Testing Case Study CloseDQ
Management
DeptID DeptName Revenue (Euro)
DeptStartDate
1 Human Resource
70 13/13/2007
EmpId EmpName DOB DOE
1 Jack 13/01/2008 24/11/1996
Microsoft Confidential 12/27/06 18
Physical Duplicates: All the columns values repeating for at least 2 records in a table
Logical Duplicates: Business Key (list of column) values are repeating for at least 2 records in a table
Physical Duplicates: All the columns values repeating for at least 2 records in a table
Logical Duplicates: Business Key (list of column) values are repeating for at least 2 records in a table
Rule #5: Redundancy
18
Overview DQ Testing Case Study CloseDQ
Management
Logical Dups
Physical Dups
Microsoft Confidential 12/27/06 19
# Example 3
19
Overview DQ Testing Case Study CloseDQ
Management
EmpID EmpName EmpAddress Age DeptID
1 Jim #22, Jackson St., NY 23 1
2 Sam A302, Woodsvilla, WA 28 2
4 Samuel No. AA, Andrew Street, Redmond, WA 22 999
5 Jim #22, Jackson St., NY 23 1
2 Sam A302, Woodsvilla, WA 28 2
7 Jack #23, Jackson St., NY 41 NULL
Employee
Example #1: Physical Duplicates
Example #2: Logical Duplicates
EmpID EmpName EmpAddress Age DeptID
2 Sam A302, Woodsvilla, WA 28 2
2 Sam A302, Woodsvilla, WA 28 2
EmpID EmpName EmpAddress Age DeptID
1 Jim #22, Jackson St., NY 23 1
5 Jim #22, Jackson St., NY 23 1
Microsoft Confidential 12/27/06 20
If there are child records for which no corresponding parent records existing then they are called “Orphan Records”
Logical relationship rules between parent & child tables should be defined by business.
Rule #6: RI
20
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 21
# Example 4
21
Overview DQ Testing Case Study CloseDQ
Management
EmpID EmpName EmpAddress Age DeptID (FK)
1 Jim #22, Jackson St., NY 23 1
2 Sam A302, Woodsvilla, WA 28 2
4 Samuel No. AA, Andrew Street, Redmond, WA 22 999
5 Jim #22, Jackson St., NY 23 1
7 Jack #23, Jackson St., NY 41 NULL
Child Table:: Employee
DeptID (PK) DeptName DeptStartDate
1 HR 22-Aug-2007
2 Finance 12-June-1988
3 Operations 11-May-1752
Parent Table:: Department
EmpID EmpName EmpAddress Age DeptID
4 Samuel No. AA, Andrew Street, Redmond, WA 22 999
7 Jack #23, Jackson St., NY 41 NULL
Orphan Records
Microsoft Confidential 12/27/06 22
Possible values that can be allowed in a data element.
Possible values that can be allowed in a data element.
Rule #7: Domain Integrity
22
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 23
Example #1: Invalid Lookup Table Values (Valid:: HR, Finance, Operations)
Example #2: Truncation::Data Types, Data Length etc
Example #1: Invalid Lookup Table Values (Valid:: HR, Finance, Operations)
Example #2: Truncation::Data Types, Data Length etc
Rule #7: Domain Integrity
23
Overview DQ Testing Case Study CloseDQ
Management
DeptID (PK) DeptName
1 HR
2 Finance
3 Operations
4 Invalid Dept
DeptID (PK) DeptName (Varchar(50))
1 HR
2 Finance
3 Operations
DeptID (PK) DeptName (Varchar (2))
1 HR
2 Fi
3 Op
Source Table Target Table
Microsoft Confidential 12/27/06 24
Example #3: Constraints: NOT NULL, CHECK, PK, UK etc
Example #3: Constraints: NOT NULL, CHECK, PK, UK etc
Rule #7: Domain Integrity
24
Overview DQ Testing Case Study CloseDQ
Management
DeptID (PK) DeptName (NOT NULL)
1 HR
2 Finance
3 Operations
4 Invalid Dept
DeptID (PK) DeptName (NOT NULL)
1 HR
2 Finance
3 NULL
4 NULL
Source Table Target Table
Microsoft Confidential 12/27/06 25
Degree to which data reflects Real World objects
Accuracy is generally measured by comparing against something defined as “true” source of information
Degree to which data reflects Real World objects
Accuracy is generally measured by comparing against something defined as “true” source of information
Rule #8: Accuracy
25
Overview DQ Testing Case Study CloseDQ
Management
Accuracy
Microsoft Confidential 12/27/06 26
Describes the relevance and the meaning of data
Example #:
Denotes the ease with which data can be used
Describes the relevance and the meaning of data
Example #:
Denotes the ease with which data can be used
Rule #9: Usability
26
Overview DQ Testing Case Study CloseDQ
Management
DeptID (PK) DeptName
1 HR
2 Fin
3 Ops
DeptID (PK) DeptName
1 Human Resources
2 Finance
3 Operations
Represented As
Mart Table ReportingTable
Microsoft Confidential 12/27/06 27
Defines if data required is available when required as per SLA
Example #1: Data Freshness
If everyday data is pulled 24 times and target doesn’t get even for one cycle, “data freshness” get impacted and users see old data which can impact business decisions.
For decision making & mission critical system, timely availability of information is must.
Defines if data required is available when required as per SLA
Example #1: Data Freshness
If everyday data is pulled 24 times and target doesn’t get even for one cycle, “data freshness” get impacted and users see old data which can impact business decisions.
For decision making & mission critical system, timely availability of information is must.
Rule #10: Timeliness
27
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 28
ADQC (Automated Data Quality Check) v2.0
ADQC (Automated Data Quality Check) v2.0
Testing :: DQ Case Study
28
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 29
DQ Test Management:DQ Test Management:
DQ Test Management
29
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 30
DQ Test Management: Planning
1. Ensure DQ Requirement are covered in following documents:
BRDFSDTest Plan
2. Ensure DQ Requirements are clarified by Business / PDMs
DQ Test Management: Planning
1. Ensure DQ Requirement are covered in following documents:
BRDFSDTest Plan
2. Ensure DQ Requirements are clarified by Business / PDMs
DQTM: Test Planning
30
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 31
DQ Test Management: Test Case Design
1. Ensure DQ Requirement are covered in Test Scenarios and Test Cases
2. Ensure DQ Test cases are automated.
DQ Test Management: Test Case Design
1. Ensure DQ Requirement are covered in Test Scenarios and Test Cases
2. Ensure DQ Test cases are automated.
DQTM: Test Design
31
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 32
DQ Test Management: Test Execution
1. Ensure Test Cases related to DQ Requirements are executed in Test cycles
2. Ensure DQ Test results & DQ Bugs are
shared with the Business / PDM in the triage meeting to understand the correct priority based on the impact.
DQ Test Management: Test Execution
1. Ensure Test Cases related to DQ Requirements are executed in Test cycles
2. Ensure DQ Test results & DQ Bugs are
shared with the Business / PDM in the triage meeting to understand the correct priority based on the impact.
DQTM: Test Execution
32
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 33
DQ Test Management: Test Monitoring
1. Regularly collect DQ Metrics to depict the trend
2. If DQ Issues Trend is upward, immediate
action need to be taken
DQ Test Management: Test Monitoring
1. Regularly collect DQ Metrics to depict the trend
2. If DQ Issues Trend is upward, immediate
action need to be taken
DQTM: Test Monitoring
33
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 34
DQ Challenges
34
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 35
DQ Best Practices Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 36
DQ Jargons
DATA GOVERNANCE
• Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise
• Data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures
DATA STEWARDS
• Data Stewards are those individuals ultimately responsible for the definition, management, control, integrity or maintenance of Enterprise data.
DATA INTEGRITY
• Data integrity is the assurance that data is correct and consistent--that the data correctly reflects the "real" world.
DATA GOVERNANCE
• Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise
• Data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures
DATA STEWARDS
• Data Stewards are those individuals ultimately responsible for the definition, management, control, integrity or maintenance of Enterprise data.
DATA INTEGRITY
• Data integrity is the assurance that data is correct and consistent--that the data correctly reflects the "real" world.
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 37
References
• www.infoimpact.com• http://www.idma.org/valuePropositionGeneral.pdf• http://www.intelligententerprise.com/showArticle.jhtml?
articleID=17701630• http://www.sociology.org.uk/p1mc5n1a.htm• http://blogs.sun.com/emmyp/entry/ensuring_the_validity_of_your• http://www.dmreview.com/dmdirect/20021108/6019-1.html
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 38
Questions & Answers
38
Overview DQ Testing Case Study CloseDQ
Management
Microsoft Confidential 12/27/06 39
Thank you
Overview DQ Testing Case Study CloseDQ
Management