data quality testing generic (

39
Microsoft Confidential 12/27/06 1 “Virtually everything in business today is an undifferentiated commodity, except how a company manages its information. How you manage information determines whether you win or lose.” Bill Gates [email protected] -Narendra Parihar - Bhoomika Goyal - Raj Kamal (rajkamal) Data Quality Testing

Upload: rajkamal13

Post on 15-Nov-2014

9.136 views

Category:

Technology


4 download

DESCRIPTION

http://www.geektester.blogspot.com/

TRANSCRIPT

Page 1: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 1

“Virtually everything in business today is an undifferentiated commodity, except how a company manages its information. How you manage information determines whether you win or lose.” Bill Gates

[email protected]

-Narendra Parihar- Bhoomika Goyal

- Raj Kamal (rajkamal)

Data Quality Testing

Page 2: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 2

Data Quality Overview

Testing :: DQ Categories / Checks

Testing :: DQ Case Study

DQ Test Management

DQ Benefits & Challenges

Q & A

Data Quality Overview

Testing :: DQ Categories / Checks

Testing :: DQ Case Study

DQ Test Management

DQ Benefits & Challenges

Q & A

Agenda

2

Overview DQ Testing Case Study CloseDQ

Management

Page 3: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 3

Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J.M. Juran).

The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.

DQ Impact: Organizations with poor data quality spend time working with conflicting reports and flawed business plans, resulting in erroneous decisions that are made with outdated, inconsistent, and invalid data

Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J.M. Juran).

The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.

DQ Impact: Organizations with poor data quality spend time working with conflicting reports and flawed business plans, resulting in erroneous decisions that are made with outdated, inconsistent, and invalid data

Overview: DQ Definition

3

Overview DQ Testing Case Study CloseDQ

Management

Page 4: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 4

“End users spend as much as 40-50% of a typical IT budget reworking data in one application to make it work with another”. The high cost of low data quality.

The Data Warehouse Institute estimates that bad customer data costs American companies upwards of $600billion dollars per year By Wayne W. Eckerson

POOR Data Quality can kill your POOR Data Quality can kill your business!!!!business!!!!

“End users spend as much as 40-50% of a typical IT budget reworking data in one application to make it work with another”. The high cost of low data quality.

The Data Warehouse Institute estimates that bad customer data costs American companies upwards of $600billion dollars per year By Wayne W. Eckerson

POOR Data Quality can kill your POOR Data Quality can kill your business!!!!business!!!!

Overview: DQ Stats

4

Overview DQ Testing Case Study CloseDQ

Management

Page 5: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 5

Testing :: DQ CheatSheet

5

Overview DQ Testing Case Study CloseDQ

Management

Page 6: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 6

Count of records at Source and Target should be same at a given point of time.

Count of records at Source and Target should be same at a given point of time.

Rule #1: Row Counts

6

Overview DQ Testing Case Study CloseDQ

Management

Missing Records

ExtraRecords

Page 7: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 7

# Example 1

7

Overview DQ Testing Case Study CloseDQ

Management

DeptID DeptName DeptStartDate

1 HR 22-Aug-2007

2 Finance 12-June-1988

4 Admin 1-May-1999

5 IT 2-June-1997

Source_Dept

DeptID DeptName DeptStartDate

1 Human Resource 22-Aug-2007

2 Finance 12-June-1978

3 Operations 11-May-1752

Target_Dept

Page 8: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 8

Missing Records: Records which are only present at Source

Extra Records: Records which are only present at Target

Missing Records: Records which are only present at Source

Extra Records: Records which are only present at Target

Rule #1: Row Counts

8

Overview DQ Testing Case Study CloseDQ

Management

DeptID DeptName DeptStartDate

4 Admin 1-May-1999

5 IT 2-June-1997

DeptID DeptName DeptStartDate

3 Operations 11-May-1752

Page 9: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 9

All the data under consideration at the Source and Target should be same at a given point of time satisfying the business rules.

All the data under consideration at the Source and Target should be same at a given point of time satisfying the business rules.

Rule #2: Completeness

9

Overview DQ Testing Case Study CloseDQ

Management

SourceTable

TargetTable

Page 10: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 10

Missing Records: Records which are only present at Source

Extra Records: Records which are only present at Target

Mismatched Records: Which contain at least one different value for the same record between Source and Target

Missing Records: Records which are only present at Source

Extra Records: Records which are only present at Target

Mismatched Records: Which contain at least one different value for the same record between Source and Target

Rule #2: Completeness

10

Overview DQ Testing Case Study CloseDQ

Management

DeptID DeptName DeptStartDate

4 Admin 1-May-1999

5 IT 2-June-1997

DeptID DeptName DeptStartDate

3 Operations 11-May-1752

DeptID DeptName DeptStartDate DifferenceType

2 Finance 12-June-1988 At Source

2 Finance 12-June-1978 At Target

Page 11: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 11

This ensures that each user observes a consistent view of the data, including changes made by transactions

There is data inconsistency between the Source & Target if the same data is stored in different formats or contain different values at different places.

This ensures that each user observes a consistent view of the data, including changes made by transactions

There is data inconsistency between the Source & Target if the same data is stored in different formats or contain different values at different places.

Rule #3: Consistency

11

Overview DQ Testing Case Study CloseDQ

Management

Page 12: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 12

# Example 2

12

Overview DQ Testing Case Study CloseDQ

Management

DeptID DeptName Revenue ($)

DeptStartDate

1 HR 100 22-Aug-2007

2 Finance 200 12-June-1988

Source_Dept

DeptID DeptName Revenue (Euro)

DeptStartDate

1 HR 70 22/08/2007

2 Finance 140 12/06/1978

Warehouse_Dept

Data Mart_Dept

DeptID DeptName Revenue (Euro)

DeptStartDate

1 Human Resource

70 22/08/2007

2 Finance 999999 12/06/1978

Page 13: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 13

Example #1: Zip code / Date / Currency formatsa)

b)

Example #1: Zip code / Date / Currency formatsa)

b)

Rule #3: Consistency

13

Overview DQ Testing Case Study CloseDQ

Management

DeptID DeptName Revenue ($ or Euro ) DeptStartDate Difference Point

1 HR 100 22-Aug-2007 Same data, Inconsistent due to Revenue & Currency format

1 HR 70 22/08/2007 Same data, Inconsistent due to Revenue & Currency format

DeptID DeptName Revenue ($ or Euro ) DeptStartDate Difference Point

1 HR 100 22-Aug-2007 Same data, Inconsistent due to different format of Department name

1 Human Resource

70 22/08/2007 Same data, Inconsistent due to different format for department name

Page 14: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 14

Example #2: Regional Setting e.g. Language

Example #3: Different values at different points

Example #2: Regional Setting e.g. Language

Example #3: Different values at different points

Rule #3: Consistency

14

Overview DQ Testing Case Study CloseDQ

Management

DeptID DeptName Revenue ($ or Euro ) DeptStartDate Difference Point

1 Human Resource

100 22/08/2007 Same data, Inconsistent due to different language used

1 人的資源 100 22/08/2007 Same data, Inconsistent due to different language used

DeptID DeptName Revenue ($ or Euro ) DeptStartDate Difference Point

2 Finance 140 12/06/1978 Same data, Inconsistent value for Revenue between Warehouse & Mart

2 Finance 999999 12/06/1978 Same data, Inconsistent value for Revenue between Warehouse & Mart

Page 15: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 15

Validity is the correctness and reasonableness of data.

A valid measure must be reliable, but a reliable measure need not be valid.

Questions:

-> Is Information Reliable?

-> How is Information measured ?

Validity is the correctness and reasonableness of data.

A valid measure must be reliable, but a reliable measure need not be valid.

Questions:

-> Is Information Reliable?

-> How is Information measured ?

Rule #4: Validity

15

Overview DQ Testing Case Study CloseDQ

Management

Page 16: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 16

Example #1: Measuring “Unemployment” in a country

-> Statistics are collected reliably month-on-month

-> Definition of collecting “Unemployment” remains same.

e.g. Definition of “unemployment” has changed in past 25 years hence we can’t compare old data with current data as comparison is not valid

Example #2: Values falling outside a range

Example #1: Measuring “Unemployment” in a country

-> Statistics are collected reliably month-on-month

-> Definition of collecting “Unemployment” remains same.

e.g. Definition of “unemployment” has changed in past 25 years hence we can’t compare old data with current data as comparison is not valid

Example #2: Values falling outside a range

Rule #4: Validity

16

Overview DQ Testing Case Study CloseDQ

Management

DeptID DeptName Revenue (Euro)

DeptStartDate

1 Human Resource

70 22/08/2255

2 Finance 999999 12/06/1752

Page 17: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 17

Example #3: Dates having valid MM, DD, YYYY

Example #4: Birth date > Death Date

Example #3: Dates having valid MM, DD, YYYY

Example #4: Birth date > Death Date

Rule #4: Validity

17

Overview DQ Testing Case Study CloseDQ

Management

DeptID DeptName Revenue (Euro)

DeptStartDate

1 Human Resource

70 13/13/2007

EmpId EmpName DOB DOE

1 Jack 13/01/2008 24/11/1996

Page 18: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 18

Physical Duplicates: All the columns values repeating for at least 2 records in a table

Logical Duplicates: Business Key (list of column) values are repeating for at least 2 records in a table

Physical Duplicates: All the columns values repeating for at least 2 records in a table

Logical Duplicates: Business Key (list of column) values are repeating for at least 2 records in a table

Rule #5: Redundancy

18

Overview DQ Testing Case Study CloseDQ

Management

Logical Dups

Physical Dups

Page 19: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 19

# Example 3

19

Overview DQ Testing Case Study CloseDQ

Management

EmpID EmpName EmpAddress Age DeptID

1 Jim #22, Jackson St., NY 23 1

2 Sam A302, Woodsvilla, WA 28 2

4 Samuel No. AA, Andrew Street, Redmond, WA 22 999

5 Jim #22, Jackson St., NY 23 1

2 Sam A302, Woodsvilla, WA 28 2

7 Jack #23, Jackson St., NY 41 NULL

Employee

Example #1: Physical Duplicates

Example #2: Logical Duplicates

EmpID EmpName EmpAddress Age DeptID

2 Sam A302, Woodsvilla, WA 28 2

2 Sam A302, Woodsvilla, WA 28 2

EmpID EmpName EmpAddress Age DeptID

1 Jim #22, Jackson St., NY 23 1

5 Jim #22, Jackson St., NY 23 1

Page 20: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 20

If there are child records for which no corresponding parent records existing then they are called “Orphan Records”

Logical relationship rules between parent & child tables should be defined by business.

Rule #6: RI

20

Overview DQ Testing Case Study CloseDQ

Management

Page 21: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 21

# Example 4

21

Overview DQ Testing Case Study CloseDQ

Management

EmpID EmpName EmpAddress Age DeptID (FK)

1 Jim #22, Jackson St., NY 23 1

2 Sam A302, Woodsvilla, WA 28 2

4 Samuel No. AA, Andrew Street, Redmond, WA 22 999

5 Jim #22, Jackson St., NY 23 1

7 Jack #23, Jackson St., NY 41 NULL

Child Table:: Employee

DeptID (PK) DeptName DeptStartDate

1 HR 22-Aug-2007

2 Finance 12-June-1988

3 Operations 11-May-1752

Parent Table:: Department

EmpID EmpName EmpAddress Age DeptID

4 Samuel No. AA, Andrew Street, Redmond, WA 22 999

7 Jack #23, Jackson St., NY 41 NULL

Orphan Records

Page 22: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 22

Possible values that can be allowed in a data element.

Possible values that can be allowed in a data element.

Rule #7: Domain Integrity

22

Overview DQ Testing Case Study CloseDQ

Management

Page 23: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 23

Example #1: Invalid Lookup Table Values (Valid:: HR, Finance, Operations)

Example #2: Truncation::Data Types, Data Length etc

Example #1: Invalid Lookup Table Values (Valid:: HR, Finance, Operations)

Example #2: Truncation::Data Types, Data Length etc

Rule #7: Domain Integrity

23

Overview DQ Testing Case Study CloseDQ

Management

DeptID (PK) DeptName

1 HR

2 Finance

3 Operations

4 Invalid Dept

DeptID (PK) DeptName (Varchar(50))

1 HR

2 Finance

3 Operations

DeptID (PK) DeptName (Varchar (2))

1 HR

2 Fi

3 Op

Source Table Target Table

Page 24: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 24

Example #3: Constraints: NOT NULL, CHECK, PK, UK etc

Example #3: Constraints: NOT NULL, CHECK, PK, UK etc

Rule #7: Domain Integrity

24

Overview DQ Testing Case Study CloseDQ

Management

DeptID (PK) DeptName (NOT NULL)

1 HR

2 Finance

3 Operations

4 Invalid Dept

DeptID (PK) DeptName (NOT NULL)

1 HR

2 Finance

3 NULL

4 NULL

Source Table Target Table

Page 25: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 25

Degree to which data reflects Real World objects

Accuracy is generally measured by comparing against something defined as “true” source of information

Degree to which data reflects Real World objects

Accuracy is generally measured by comparing against something defined as “true” source of information

Rule #8: Accuracy

25

Overview DQ Testing Case Study CloseDQ

Management

Accuracy

Page 26: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 26

Describes the relevance and the meaning of data

Example #:

Denotes the ease with which data can be used

Describes the relevance and the meaning of data

Example #:

Denotes the ease with which data can be used

Rule #9: Usability

26

Overview DQ Testing Case Study CloseDQ

Management

DeptID (PK) DeptName

1 HR

2 Fin

3 Ops

DeptID (PK) DeptName

1 Human Resources

2 Finance

3 Operations

Represented As

Mart Table ReportingTable

Page 27: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 27

Defines if data required is available when required as per SLA

Example #1: Data Freshness

If everyday data is pulled 24 times and target doesn’t get even for one cycle, “data freshness” get impacted and users see old data which can impact business decisions.

For decision making & mission critical system, timely availability of information is must.

Defines if data required is available when required as per SLA

Example #1: Data Freshness

If everyday data is pulled 24 times and target doesn’t get even for one cycle, “data freshness” get impacted and users see old data which can impact business decisions.

For decision making & mission critical system, timely availability of information is must.

Rule #10: Timeliness

27

Overview DQ Testing Case Study CloseDQ

Management

Page 28: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 28

ADQC (Automated Data Quality Check) v2.0

ADQC (Automated Data Quality Check) v2.0

Testing :: DQ Case Study

28

Overview DQ Testing Case Study CloseDQ

Management

Page 29: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 29

DQ Test Management:DQ Test Management:

DQ Test Management

29

Overview DQ Testing Case Study CloseDQ

Management

Page 30: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 30

DQ Test Management: Planning

1. Ensure DQ Requirement are covered in following documents:

BRDFSDTest Plan

2. Ensure DQ Requirements are clarified by Business / PDMs

DQ Test Management: Planning

1. Ensure DQ Requirement are covered in following documents:

BRDFSDTest Plan

2. Ensure DQ Requirements are clarified by Business / PDMs

DQTM: Test Planning

30

Overview DQ Testing Case Study CloseDQ

Management

Page 31: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 31

DQ Test Management: Test Case Design

1. Ensure DQ Requirement are covered in Test Scenarios and Test Cases

2. Ensure DQ Test cases are automated.

DQ Test Management: Test Case Design

1. Ensure DQ Requirement are covered in Test Scenarios and Test Cases

2. Ensure DQ Test cases are automated.

DQTM: Test Design

31

Overview DQ Testing Case Study CloseDQ

Management

Page 32: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 32

DQ Test Management: Test Execution

1. Ensure Test Cases related to DQ Requirements are executed in Test cycles

2. Ensure DQ Test results & DQ Bugs are

shared with the Business / PDM in the triage meeting to understand the correct priority based on the impact.

DQ Test Management: Test Execution

1. Ensure Test Cases related to DQ Requirements are executed in Test cycles

2. Ensure DQ Test results & DQ Bugs are

shared with the Business / PDM in the triage meeting to understand the correct priority based on the impact.

DQTM: Test Execution

32

Overview DQ Testing Case Study CloseDQ

Management

Page 33: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 33

DQ Test Management: Test Monitoring

1. Regularly collect DQ Metrics to depict the trend

2. If DQ Issues Trend is upward, immediate

action need to be taken

DQ Test Management: Test Monitoring

1. Regularly collect DQ Metrics to depict the trend

2. If DQ Issues Trend is upward, immediate

action need to be taken

DQTM: Test Monitoring

33

Overview DQ Testing Case Study CloseDQ

Management

Page 34: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 34

DQ Challenges

34

Overview DQ Testing Case Study CloseDQ

Management

Page 35: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 35

DQ Best Practices Overview DQ Testing Case Study CloseDQ

Management

Page 36: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 36

DQ Jargons

DATA GOVERNANCE

• Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise

• Data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures

DATA STEWARDS

• Data Stewards are those individuals ultimately responsible for the definition, management, control, integrity or maintenance of Enterprise data.

DATA INTEGRITY

• Data integrity is the assurance that data is correct and consistent--that the data correctly reflects the "real" world.

DATA GOVERNANCE

• Data governance (DG) refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise

• Data governance program includes a governing body or council, a defined set of procedures, and a plan to execute those procedures

DATA STEWARDS

• Data Stewards are those individuals ultimately responsible for the definition, management, control, integrity or maintenance of Enterprise data.

DATA INTEGRITY

• Data integrity is the assurance that data is correct and consistent--that the data correctly reflects the "real" world.

Overview DQ Testing Case Study CloseDQ

Management

Page 37: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 37

References

• www.infoimpact.com• http://www.idma.org/valuePropositionGeneral.pdf• http://www.intelligententerprise.com/showArticle.jhtml?

articleID=17701630• http://www.sociology.org.uk/p1mc5n1a.htm• http://blogs.sun.com/emmyp/entry/ensuring_the_validity_of_your• http://www.dmreview.com/dmdirect/20021108/6019-1.html

Overview DQ Testing Case Study CloseDQ

Management

Page 38: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 38

Questions & Answers

38

Overview DQ Testing Case Study CloseDQ

Management

Page 39: Data Quality Testing Generic (

Microsoft Confidential 12/27/06 39

Thank you

Overview DQ Testing Case Study CloseDQ

Management