stop focusing on symptom fixing, and address the...

67
Practical Considerations for Rapidly Improving Quality in Large Data Collections How to get management to stop focusing on symptom fixing, and address the underlying data management problems responsible for all the IT wrongs in their organizations - datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved! Peter Aiken [email protected] +1 804 382 5957 1

Upload: phamnga

Post on 28-Jul-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data

Collections

How to get management to stop focusing on symptom

fixing, and address the underlying data management problems responsible for all

the IT wrongs in their organizations

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Peter [email protected] +1 804 382 5957

1

Page 2: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Peter Aiken• Full time in information technology since 1981• IT engineering research and project background• University teaching experience since 1979• Seven books and dozens of articles• Research Areas

– reengineering, data reverse engineering, software requirements engineering, information engineering, human-computer interaction, systems integration/systems engineering, strategic planning, and DSS/BI

• Director– George Mason University/Hypermedia Laboratory (1989-1993)

• DoD Computer Scientist– Reverse Engineering Program Manager/Office of the Chief Information Officer (1992-1997)

• Visiting Scientist– Software Engineering Institute/Carnegie Mellon University (2001-2002)

• Published Papers– Communications of the ACM, IBM Systems Journal, InformationWEEK, Information & Management, Information

Resources Management Journal, Hypermedia, Information Systems Management, Journal of Computer Information Systems and IEEE Computer & Software

• DAMA International President (http://dama.org)

– 2001 DAMA International Individual Achievement Award (with Dr. E. F. "Ted" Codd)– 2005 DAMA Community Award

• Founding Advisor/International Association for Information and Data Quality (http://iaidq.org)

• Founding Advisor/Meta-data Professionals Organization (http://metadataprofessional.org)

• Founding Director Data Blueprint 19992

Page 3: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Practical Considerations for Rapidly Improving Quality in Large Data Collections

3

3. "Solution" Considerations

1. Motivation (3 of them)

2. Root Cause

Analysis

Page 4: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Data InflationUnit Size What it means

Bit (b) 1 or 0 Short for “binary digit”, after the binary code (1 or 0) computers use to store and process data

Byte (B) 8 bits Enough information to create an English letter or number in computer code. It is the basic unit of computing

Kilobyte (KB) 1,000, or 210, bytes From “thousand” in Greek. One page of typed text is 2KB

Megabyte (MB)

1,000KB; 220 bytes From “large” in Greek. The complete works of Shakespeare total 5MB. A typical pop song is about 4MB

Gigabyte (GB)

1,000MB; 230 bytes From “giant” in Greek. A two-hour film can be compressed into 1-2GB

Terabyte (TB) 1,000GB; 240 bytes From “monster” in Greek. All the catalogued books in America’s Library of Congress total 15TB

Petabyte (PB) 1,000TB; 250 bytes All letters delivered by America’s postal service this year will amount to around 5PB. Google processes around 1PB every hour

Exabyte (EB) 1,000PB; 260 bytes Equivalent to 10 billion copies of The Economist

Zettabyte (ZB)

1,000EB; 270 bytes The total amount of information in existence this year is forecast to be around 1.2ZB

Yottabyte (YB)

1,000ZB; 280 bytes Currently too big to imagine

The prefixes are set by an intergovernmental group, the International Bureau of Weights and Measures. Source: The Economist Yotta and Zetta were added in 1991; terms for larger amounts have yet to be established

4

Page 5: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

The situation

is getting worse!

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!5

http://offthemark.com/search-results/key/charts/

Page 6: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Famous Words?• Question:

– Why haven't organizations taken a more proactive approach to data quality?

• Answer:– Fixing data quality problems is not easy – It is dangerous -- they'll come after you– Your efforts are likely to be misunderstood – You could make things worse – Now you get to fix it

• A single data quality issue can grow into a significant, unexpected investment

6

Page 7: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

5% & 10% Data Growth versus Sales

7

0

5

10

15

20

1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

0

1.25

2.50

3.75

5.00

1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

Page 8: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!8

High Cost of Poor Quality Government Information

Organization Loss Amount

Australia Department of Defense $2,900,000,000

Fannie Mae $12,000,000,000Federal Agency $2,300,000,000FEMA $1,000,000,000Federal Government $2,500,000,000Government Contractors $160,000,000Haliburton $1,400,000,000Hubble Telescope $700,000,000L. A. County $1,200,000,000Medicare $358,000,000NASA-Mars Lander $125,000,000Nashville Metro Government $57,000,000Pentagon $13,000,000,000State of Tennessee $833,000U. K. Government $19,564,000,000U. S. Government $30,800,000,000IRS $640,000,000,000

Total: $728,064,833,000Excerpted from Table 1-1 Information Quality Applied by Larry English 2009 pages 4-7

Page 9: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Objective comparison across four major (anonymous) DoD data management programs indicates that some DoD efforts

outperform average private sector organizations whose performance roughly indicated by the dotted line.

9

0

1

2

3

Data Program Coordination Organizational DataIntegration

Data Stewardship Data Development

Page 10: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

4.00%

11.00%

51.00%

27.00%

4.00%

1.00%

Excellent

Very Good

Good

Poor

Very Poor

Disastrous

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

State of Data Quality: Information Indifference

10

The Sad State of Data Quality: Results from the Information Difference survey document initiatives and the state of data quality today • Information Management Magazine, 11/01/2009 by David Waddington

• 4% excellent• 51% data quality good• 32% rate data quality poor • 49% actively measuring

Yes-Department Level31%

Yes-Enterprise Level18%

No42%

Don't Know9%

Page 11: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

6.00% 6.00%10.00%

1.00% 3.00%6.00%

3.00%3.00%

<$50k$51k-$100k

$101k-$500k$500k-$1m

$1m-$10m$11m-$50 m

$51m-$100m>$100m

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Insight into the cost of poor data is lacking

11

"The Sad State of Data Quality" • Information Management Magazine, 11/01/2009 by David Waddington and PriceWaterhouseCoopers, Global Risk Management Solutions, Global Data Management Survey 2001: the new economy is the data economy

Not calculated63%

Calculated37%

• Only one in three companies are very confident in the quality of their own data

• Only 15% of companies are very confident of the data received from other organizations

Page 12: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Most organizations are still approaching data quality problems from stove piped

perspectives!

12

Page 13: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

The Blind Men and the Elephant

• It was six men of Indostan, To learning much inclined,Who went to see the Elephant(Though all of them were blind),That each by observationMight satisfy his mind.

• The First approached the Elephant,And happening to fallAgainst his broad and sturdy side,At once began to bawl:"God bless me! but the ElephantIs very like a wall!"

• The Second, feeling of the tuskCried, "Ho! what have we here,So very round and smooth and sharp? To me `tis mighty clearThis wonder of an ElephantIs very like a spear!"

• The Third approached the animal,And happening to takeThe squirming trunk within his hands, Thus boldly up he spake:"I see," quoth he, "the ElephantIs very like a snake!"

• The Fourth reached out an eager hand, And felt about the knee:"What most this wondrous beast is like Is mighty plain," quoth he;"'Tis clear enough the Elephant Is very like a tree!"

• The Fifth, who chanced to touch the ear, Said: "E'en the blindest manCan tell what this resembles most;Deny the fact who can,This marvel of an ElephantIs very like a fan!"

• The Sixth no sooner had begunAbout the beast to grope,Than, seizing on the swinging tailThat fell within his scope."I see," quoth he, "the ElephantIs very like a rope!"

• And so these men of IndostanDisputed loud and long,Each in his own opinionExceeding stiff and strong,Though each was partly in the right,And all were in the wrong! (Source: John Godfrey Saxe's ( 1816-1887) version of the famous Indian legend )

13

Page 14: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

No universal conception of data quality exists, instead many differing

perspective compete.• Problem:

– Most organizations approach data quality problems in the same way that the blind men approached the elephant - people tend to see only the data that is in front of them

– Little cooperation across boundaries, just as the blind men were unable to convey their impressions about the elephant to recognize the entire entity.

– Leads to confusion, disputes and narrow views• Solution:

– Data quality engineering can help achieve a more complete picture and facilitate cross boundary communications

14

Page 15: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

• Three Motivations:

1. It is getting worse

2. Approaching Data Quality from Stove-piped perspectives

3. No shared understanding of data quality exists

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Data Quality: State of the Art

15

Page 16: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Practical Considerations for Rapidly Improving Quality in Large Data Collections

16

3. "Solution" Considerations

1. Motivation (3 of them)

2. Root Cause

Analysis

Page 17: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Many DQ challenges are unique and/or context specific!

17

Page 18: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Ishikawa Fishbone Diagrams

18

Page 19: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!19

Root Cause Analysis

1. Prevention is more cost effective than treating the symptoms

2. Data quality problems are more unique that similar

Page 20: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Why Data Projects Fail by Joseph R. Hudicka

• Assessed 1200 migration projects!– Surveyed only

experienced migration specialists who have done at least four migration projects

• The median project costs over 10 times the amount planned!• Biggest Challenges: Bad Data; Missing Data; Duplicate Data

• The survey did not consider projects that were cancelled largely due to data migration difficulties

• "… problems are encountered rather than discovered"

$0 $125,000 $250,000 $375,000 $500,000

Median Project Expense

Median Project Cost

Joseph R. Hudicka "Why ETL and Data Migration Projects Fail" Oracle Developers Technical Users Group Journal June 2005 pp. 29-3120

Page 21: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Educational institutions are not addressing the challenge!

21

Page 22: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

StandardData

Data Management

Data Program Coordination

OrganizationalData Integration

DataStewardship

Data SupportOperations

Data Asset Use

Organizational Strategies

Goals

IntegratedModels

BusinessData

Business Value

Application Models & Designs

Feedback

Implementation

Direction

DataDevelopment

Guidance

22

Page 23: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Assign responsibilities for data.

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Manage data coherently.

Share data across boundaries.

Engineer data delivery systems.

Maintain data availability.

11

Data Program Coordination

Organizational Data Integration

Data Stewardship

Data Development

Data Support Operations

Data Management

23

Page 24: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Assessment Components

Data Management Practice AreasData Management Practice AreasData program coordination

DM is practiced as a coherent and coordinated set of activities

Organizational data integration

Delivery of data is support of organizational objectives – the currency of DM

Data stewardship Designating specific individuals caretakers for certain data

Data development

Efficient delivery of data via appropriate channels

Data support Ensuring reliable access to data

4

Capability Maturity Model Levels

Examples of practice maturity

1 – Initial Our DM practices are ad hoc and dependent upon "heroes" and heroic efforts

2 - Repeatable We have DM experience and have the ability to implement disciplined processes

3 - Documented We have standardized DM practices so that all in the organization can perform it with uniform quality

4 - Managed We manage our DM processes so that the whole organization can follow our standard DM guidance

5 - Optimizing We have a process for improving our DM capabilities

24

Page 25: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Objective comparison across four major (anonymous) DoD data management programs indicates that some DoD efforts

outperform average private sector organizations whose performance roughly indicated by the dotted line.

25

0

1

2

3

Data Program Coordination Organizational DataIntegration

Data Stewardship Data Development

Page 26: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Vendors are incented to not address the challenges proactively!

26

Page 27: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Data quality is now acknowledged as a major source of organizational risk by certified risk professionals!

27

Page 28: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Risk Response “Risk response development involves defining enhancement steps

for opportunities and threats.” Page 119, Duncan, W., A Guide to the Project Management Body of Knowledge, PMI, 1996

"The go-live date may need to be extended due to certain critical path deliverables not being met. This extension will require additional tasks and resources. The decision of whether or not to extend the go-live date should be made by Monday, November 3, 20XX so that resources can be allocated to the additional tasks."

Tasks HoursNew Year Conversion 120Tax and payroll balance conversion 120General Ledger conversion 80

Total 320

Resource HoursG/L Consultant 40Project Manager 40Recievables Consultant 40HRMS Technical Consultant 40Technical Lead Consultant 40HRMS Consultant 40Financials Technical Consultant 40

Total 280

Delay Weekly Resources Weeks Tasks CumulativeJanuary (5 weeks) 280 5 320 1720February (4 weeks) 280 4 1120

Total 284028

Page 29: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

• Root Causes:

1. It is difficult to generalize about data quality

2. It requires specialized KSAs

3. We provide incorrect incentives

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Data Quality: State of the Art

29

Page 30: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Practical Considerations for Rapidly Improving Quality in Large Data Collections

30

3. "Solution" Considerations

1. Motivation (3 of them)

2. Root Cause

Analysis

Page 31: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

A body of knowledge has been developed!

31

• Published by DAMA International– the professional association for Data Managers (40 chapters worldwide)

• DM BoK organized around – Primary data management functions focused around data delivery to the organization– Organized around several environmental elements

• Amazon– http://www.amazon.com/DAMA-Guide-Management-Knowledge-DAMA-DMBOK/dp/0977140083– Or enter the terms "dama dm bok" at the Amazon search engine

Data Management Functions Environmental Elements

1/26/2010 © Copyright this and previous years by Data Blueprint - all rights reserved!

Page 32: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

32

Data Management

Functions

Page 33: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!33

Environmental Elements

Page 34: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Much more analysis is required

before we can implement repeatable

solutions to today's data quality challenges!

34

Page 35: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Traditional Quality Life Cycle

35

Data Acquisition Activities

Data Usage

Activities

Data Storage

Page 36: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

restored data

Metadata Creation Metadata Refinement

Metadata Structuring

Data Utilization

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Data Manipulation

Data Creation

Data Storage

Data Assessment

Data Refinement

data architecture & models

populated data models and

storage locations

data values

datavalues

datavalues

valuedefects

structuredefects

architecturerefinements

modelrefinements

Data Life Cycle Model

Products

data

Page 37: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Metadata Creation Metadata Refinement

Metadata Structuring

Data Utilization

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Data Manipulation

Data Creation

Data Storage

Data Assessment

Data Refinement

Data Life Cycle Model:QualityFocus

architecture quality

architecture & model quality

model quality

value quality

value quality

value quality

representation quality

Page 38: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Startingpointfor newsystemdevelopment

data performance metadata

data architecture

dataarchitecture and

data models

shared data updated data

correcteddata

architecturerefinements

facts &meanings

Metadata &Data Storage

Starting pointfor existingsystems

Metadata Refinement• Correct Structural Defects• Update Implementation

Metadata Creation• Define Data Architecture• Define Data Model Structures

Metadata Structuring• Implement Data Model Views• Populate Data Model Views

Data Refinement• Correct Data Value Defects• Re-store Data Values

Data Manipulation• Manipulate Data• Updata Data

Data Utilization• Inspect Data• Present Data

Data Creation• Create Data• Verify Data Values

Data Assessment• Assess Data Values• Assess Metadata

Extended data life cycle model with metadata sources and uses

38

Page 39: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Quality Attributes

39

Page 40: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!© Copyright 10/22/09 by Data Blueprint - all rights reserved!68 - datablueprint.com

Resistance is Futile

40

Page 41: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Frozen Falls

41

Page 42: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

"Our understanding of the nature of this

socio-technical challenge is evolving!"

42

Page 43: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

• National Stock Number (NSN) Discrepancies– If NSNs in LUAF, GABF, and RTLS are

not present in the MHIF, these records cannot be updated in SASSY

– Additional overhead is created to correct data before performing the real maintenance of records

• Serial Number Duplication– If multiple items are assigned the same

serial number in RTLS, the traceability of those items is severely impacted

– Approximately $531 million of SAC 3 items have duplicated serial numbers

• On-Hand Quantity Discrepancies– If the LUAF O/H QTY and number of items serialized in RTLS conflict, there can

be no clear answer as to how many items a unit actually has on-hand– Approximately $5 billion of equipment does not tie out between the LUAF and

RTLS - datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Our toolset is improving!

Page 44: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Academic Research Findings0% 12.500% 25.000% 37.500% 50.000%

49.00%

39.00%

21.00%

20.00%

20.00%

20.00%

19.00%

18.00%

18.00%

17.00%

Retail

Consulting

Air Transportation

Food Products

Construction

Steel

Automobile

Publishing

Industrial Instruments

Telecommunications

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!44

A 10% improvement in data usability on

productivity (increased sales per

employee by 14.4% or $55,900)

Measuring the Business Impacts of Effective Data by Anitesh Barua,, Deepa Mani,, Rajiv Mukherjee

Page 45: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Academic Research Findings

Projected increase in sales (in $M) due to 10% improvement in

data usability on productivity (sales per employee)

Measuring the Business Impacts of Effective Data by Anitesh Barua,, Deepa Mani,, Rajiv Mukherjee

45

Page 46: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Projected impact of a 10% improvement in data quality and

sales mobility on Return on Equity

Measuring the Business Impacts of Effective Data by Anitesh Barua,, Deepa Mani,, Rajiv Mukherjee

Academic Research Findings

46

Page 47: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Projected Impact of a 10% increase in intelligence and accessibility of

data on Return on Assets

Measuring the Business Impacts of Effective Data by Anitesh Barua,, Deepa Mani,, Rajiv Mukherjee

Academic Research Findings

47

Page 48: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Best approaches combines manual and automated reconciliation!

48

Humans Generally Better Machines Generally Better• Sense low level stimuli• Detect stimuli in noisy background• Recognize constant patterns in varying

situations• Sense unusual and unexpected events• Remember principles and strategies• Retrieve pertinent details without a priori

connection• Draw upon experience and adapt decision to

situation• Select alternatives if original approach fails• Reason inductively; generalize from

observations• Act in unanticipated emergencies and novel

situations• Apply principles to solve varied problems• Make subjective evaluations• Develop new solutions• Concentrate on important tasks when overload

occurs• Adapt physical response to changes in situation

• Sense stimuli outside human's range• Count or measure physical quantities• Store quantities of coded information accurately• Monitor prespecified events, especially

infrequent• Make rapid and consisted responses to input

signals• Recall quantities of detailed information

accurately• Retrieve pertinent detailed without a priori

connection• Process quantitative data in prespecified ways• Perform repetitive preprogrammed actions

reliably• Exert great, highly controlled physical force• Perform several activities simultaneously• Maintain operations under heavy operation load• Maintain performance over extended periods of

time

Page 49: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Improving Data Quality during System Migration

49

• Challenge– Millions of NSN/SKUs

maintained in a catalog– Key and other data stored in

clear text/comment fields– Original suggestion was manual

approach to text extraction– Left the data structuring problem unsolved

• Solution– Proprietary, improvable text extraction process– Converted non-tabular data into tabular data– Saved a minimum of $5 million– Literally person centuries of work

Page 50: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

An Iterative Approach

50

Unmatched Items

Unmatched Items

Ignorable Ignorable Items

Avg Extracted

Items MatchedItems Matched

Rev#

(% Total) NSNs (% Total) Items Matched

Per Item (% Total)

Items Extracted

1 329948 31.47% 14034 1.34% N/A N/A N/A 264703

2 222474 21.22% 73069 6.97% N/A N/A N/A 286675

3 216552 20.66% 78520 7.49% N/A N/A N/A 287196

4 340514 32.48% 125708 11.99% 582101 1.1000222 55.53% 640324

… … … … … … … … …

14 94542 9.02% 237113 22.62% 716668 1.1142914 68.36% 798577

15 94929 9.06% 237118 22.62% 716276 1.1139282 68.33% 797880

16 99890 9.53% 237128 22.62% 711305 1.1153008 67.85% 793319

17 99591 9.50% 237128 22.62% 711604 1.1154392 67.88% 793751

18 78213 7.46% 237130 22.62% 732980 1.2072812 69.92% 884913

Page 51: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Time needed to review all NSNs once over the life of the project:Time needed to review all NSNs once over the life of the project:NSNs 2,000,000Average time to review & cleanse (in minutes) 5Total Time (in minutes) 10,000,000

Time available per resource over a one year period of time:Time available per resource over a one year period of time:Work weeks in a year 48Work days in a week 5Work hours in a day 7.5Work minutes in a day 450Total Work minutes/year 108,000

Person years required to cleanse each NSN once prior to migration:Person years required to cleanse each NSN once prior to migration:Minutes needed 10,000,000Minutes available person/year 108,000Total Person-Years 92.6

Resource Cost to cleanse NSN's prior to migration:Resource Cost to cleanse NSN's prior to migration:Avg Salary for SME year (not including overhead) $60,000.00Projected Years Required to Cleanse/Total DLA Person Year Saved

93Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Quantitative Data Quality Benefits

51

Page 52: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

• Solution Considerations:

1. Body of knowledge now exists

2. Better understand DQ lifecycle/attributes

3. Technical tools are improving

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Data Quality: State of the Art

52

Page 53: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

• Motivation– The situation is getting worse!– Most organizations are still approaching data

quality problems from stove piped perspectives!• Root Cause Analysis

– Many DQ challenges are unique and/or context specific!– Educational institutions are not addressing the challenge!– Vendors are incented to not address the challenges proactively!– Much more analysis is required before we can implement

repeatable solutions to today's data quality challenges!• "Solution" Considerations

– A body of knowledge has been developed!– Our understanding of the nature of this

socio-technical challenge is evolving!– Our toolset is improving!– Best approaches combines manual and automated reconciliation!– Data quality must be approaches as a specialized discipline!

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Practical Considerations for Rapidly Improving Quality in Large Data Collections

53

Page 54: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

• Unique and/or context specific challenges require development of specialized disciplines with unique knowledge, skills, and abilities!

• Best approaches combines manual and automation!

• Educational institutions are not addressing the challenge!

• Vendors are incented to not address the challenges proactively!

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Take Aways!TakeAways!

54

Page 55: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

Practical Considerations for Rapidly Improving Quality in Large Data Collections

55

3. "Solution" Considerations

1. Motivation (3 of them)

2. Root Cause

Analysis

Page 56: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

… well, I'll be darned,I guess he does have a

license to do that …

… well, I'll be darned,I guess he does have a

license to do that …

56

Data quality must be approaches as a specialized discipline!

Page 57: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

- datablueprint.com 5/12/2011 © Copyright this and previous years by Data Blueprint - all rights reserved!

http://peteraiken.net

Contact Information:

Peter Aiken, Ph.D.

Department of Information Systems School of BusinessVirginia Commonwealth UniversitySnead Hall Room B4217301 West Main StreetRichmond, Virginia 23284-4000

Data Blueprint 10124C West Broad StreetGlen Allen VA 23060804.521.4056http://datablueprint.com

office: +1.804.883.7594cell: +1.804.382.5957

e-mail: [email protected]: http://peteraiken.net

57

Page 58: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections

by

Peter Aiken • VCU/Data Blueprint

August 2010

Abstract: While data quality has been a subject of interest for many years, only recently has the research output begun to converge with the needs of organizations faced with these challenges. This paper addresses the fundamental issues existent within existing approaches to improving DoD data quality (DQ). It briefly discusses our collective motivation, examines three root causes preventing more rapid DQ improvement progress. An examination of "newly perceived" realities in this area leads to discussion of several considerations that will improve related efforts.

Motivation The situation is getting worse!

A recent, voluminous book on the subject has documented more than $13 billion in the high cost of poor quality government information that is attributed directly to the Pentagon and more than $700 billion to governmental challenges [English, 2009]. When we couple these costs with recent attempts to determine how much DQ measurement is occurring – the results indicate that these two numbers are probably very low. This, in spite of the fact that DoD has being objectively determined to be on the relative forefront of these types of efforts (see Figure 1).

Figure 1 Objective comparison across four major (anonymous) DoD data management

programs indicates that some DoD efforts outperform average private sector organizations whose performance roughly indicated by the dotted line.

Page 59: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections Page 2

Peter Aiken August 2010

Figure 2 indicates the results of a 2009 survey from Information Management Magazine. Highlights from this and other recent survey data include:

• One-third of respondents rate their data quality as poor at best and only 4 percent as excellent.

• Forty-two percent of organizations make no attempt to measure the quality of their data.

• Only 15% of organizations are very confident of the data received from other organizations.

The only reasonable conclusion is that, absent a formal data quality assessment effort, all data in an organization is of unknown quality!

Figure 2 Percentage of organizations reporting

various levels of data quality (bars) & % of organizations proactively measuring the quality of their own data (pie chart).

With the advent of truly big data challenges the problem continues to worsen. Recent articles such as this year's special report from The Economist have helped to increase awareness of the challenges of dealing with yottabytes of data [Economist 2010].

Most organizations are still approaching data quality problems from stove piped perspectives!

Figure 3 No universal conception of data quality

exists, instead many differing perspective compete.

In spite of these challenges, many are still dealing with these challenges from various stove piped perspectives. It has been the classic case of the blind person and the elephant illustrated in Figure 2. Most organizations approach data quality problems in the same way that the blind men approached the elephant - people tend to see only the data that is in front of them. Little cooperation exists across boundaries, just as the blind men were unable to convey their impressions about the elephant to recognize the entire entity.

In order to be effective, data quality engineering must achieve a more complete picture and facilitate cross boundary communications. Whether you believe that the solution

Page 60: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections Page 3

Peter Aiken August 2010

should come in the form of TQM, six-sigma, standards related work, or tiger teams it remains clear that one solution cannot satisfy all aspects of the challenge.

Root Cause Analysis Three root causes do seem common to DQ problems.

Many DQ challenges are unique and/or context specific!

After dealing with data quality problems for more than 25 years, I two strong opinions:

• First, prevention is more cost effective than treating the symptoms. This is Tom Redman's well-repeated story about eliminating the sources of water pollution for any given "lake" of data as opposed to attempting to continue to clear the data lake of the polluted data. It should be obvious that correcting the data quality problems will be less expensive that fixing them forever.

• Second, more data quality problems are more unique that similar. This prevents the resolution of these challenges from following programmatic solution development practices and it mandates the development of specialized data quality engineering specialists within organizations (more on this in the solutions section of this paper).

Particular evidence of this second point can be seen when we examine the practices of "experienced" data migration specialists – experienced here meaning that those surveyed individually had accomplished four or more data migrations. Collectively this group of experienced professionals underestimated the cost of future data migration projects by a factor of 10 as shown in Figure 4 [Hudicka 2005].

Figure 4 Experienced IT professional are not yet

able to use past expertise to accurately forecast project costs!

Educational institutions are not addressing the challenge!

Computer engineering/information systems/computer science (CEISCS) students are not being taught data quality concepts and non-CEISCS students (such as business majors) receive virtually no exposure to data concepts at all. With a few notable exceptions (including MIT's and ULAR's Data Quality Programs), university level programs are not addressing data quality in CEISCS curricula. Indeed the most prevalent data related skill taught by these programs is how to develop new databases – probably the very least desired skill set when considering organizational legacy systems environments. At the research level, there is also a short history. It was only in 2006 that the first academic journal dedicated to data quality was created.

Page 61: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections Page 4

Peter Aiken August 2010

Vendors are incented to not address the challenges proactively!

When contracting for a highway project (at least in the Commonwealth of Virginia) the contractor is offered a bonus for completing the project ahead of schedule, the contracted amount for finishing the project on time, and is penalized for completing the project behind schedule. In DoD systems contracting, vendors actually plan on cost overruns and are bonused for the achievement of these overruns. Anecdotal evidence indicates that data is the primary area where these overruns occur.

I have spent considerable time expert witnessing or otherwise in litigation support. Virtually all IT upgrades, migrations, and/or consolidations involve movement of data. When new systems don't work one party blames the problems on poor quality data from the source system. Without a baseline assessment of the quality of the data before the movement/consolidation/transformation it is impossible to defend against this charge. Yet, data quality is typically not addressed formally or informally as part of IT contracts.

Vendors currently are incented to "discover" data quality problems after contracts are signed – a practice that is literally indefensible, wasteful, and costly.

New Realities Data quality is now acknowledged as a major source of organizational risk by certified risk professionals!

Data quality is now widely acknowledged as a major source of corporate risk. The DoD should take note of the advent of two new C-level executives in private industry: the Chief Risk Officer (CRO) and the Chief Data Officer (CDO). The CDO is an acknowledgement that the CIO concept has

been hijacked to focus on areas far beyond the original focus of "corporate information as an asset." Indeed many organizations are properly relabeling these individuals as Chief Technology Officers (CTOs) in light of their more broadly technology focused roles, and refocusing the data assets under the control of a CDO. From the business side, CROs are being groomed to understand how all aspects of risk play into strategic failures. These professionals understand the role that data quality plays in risk mitigation and can often be best allies of CDO's in the business management hierarchy.

A body of knowledge has been developed!

While this paper has focused on several challenges that relate to the relative immaturity of data quality engineering as a professional discipline, there is some hopeful news. In 2009, DAMA International released A Guide to the Data Management Body of Knowledge [DAMA 2009]. While it isn't a detailed as a Body of Knowledge (BOK) focused specifically on data quality, it does now elevate the field of data management to the status enjoyed by the Project Management discipline (PM BOK) and Software Engineering (SW BOK). Also, there much reference material in the DM BOK that focuses specifically on data quality.

Page 62: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections Page 5

Peter Aiken August 2010

Much more analysis is required before we can implement repeatable solutions to today's data quality challenges!

Similar to the point noted above, experienced IT professionals cannot well predict data migration costs, those of us experienced with developing data quality engineering solutions understand that the relative newness of this discipline precludes implementation of repeatable (much less optimized) solutions. Indeed, it is amazing how fast progress has been made in this area. Consider, for example, our concept of the data life cycle. As originally proposed in [Redman 1993] is consisted of three phases. Data acquisition, data storage and data use (see Figure 5).

Figure 5 Original Data Acquisition and Use Cycle [Levitan 1993]

Just five years later we acknowledge the data life cycle as more complex (Figure 6).

Startingpointfor newsystemdevelopment

data performance metadata

data architecture

dataarchitecture and

data models

shared data updated data

correcteddata

architecturerefinements

facts &meanings

Metadata &Data Storage

Starting pointfor existingsystems

Metadata Refinement• Correct Structural Defects• Update Implementation

Metadata Creation• Define Data Architecture• Define Data Model Structures

Metadata Structuring• Implement Data Model Views• Populate Data Model Views

Data Refinement• Correct Data Value Defects• Re-store Data Values

Data Manipulation• Manipulate Data• Updata Data

Data Utilization• Inspect Data• Present Data

Data Creation• Create Data• Verify Data Values

Data Assessment• Assess Data Values• Assess Metadata

Figure 6 Refined Data Life Cycle [Finkelstein 1999]

Page 63: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections Page 6

Peter Aiken August 2010

Another relatively recent development focuses on the expansion of the canonical list of data quality attributes. Again, an original formulation of these consisted of a list of terms (such as completeness, conformity, consistency, accuracy, duplication, and integrity). We now know (see Figure 7) that data quality attributes extend to the data models that produce and govern production datasets, and even to organizational data architectures [GAO 2007].

Figure 7 A complete list of data quality attributes includes data model and data architecture

attributes as well as data representation and data value quality attributes [Yoon 1999]

Finally, I'm reminded of events that occurred more than 15 years ago with the DoD. The Office of the Secretary of Defense (OSD) would routinely send out requests to the various branches and services for information. These were referred to then as "data calls." One data call might request of various organizations "how many employees do you have?"

On the surface this might seem a simple and innocent query. But as I observed the mechanics of the response patterns, they were generally of the form, "What do you mean by an employee?" As a data person, this was a reasonable clarifying question. Since the 37 systems that paid civilians at the time were not designed to maintain the same information types, they did not. A careful respondent might ask this question to ensure valid comparisons could be made of the responses. After all, in those days it was somewhat common for a service member to work part time for another agency at night or when otherwise off duty to earn vacation money or contribute a needed source of expertise.

After seeing the various response patterns repeated, I became aware that data quality was a socio-technical discipline. The various respondents had no intention of providing the OSD with any information and the various questions, while legitimate, were also designed to ensure that no numbers were provided back to the head office. If no numbers were provided, then OSD couldn’t tell the respondents to take any action based on the numbers. So we had to incorporate some social engineering onto our future data calls.

Page 64: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections Page 7

Peter Aiken August 2010

"Solution" Considerations Our understanding of the nature of this socio-technical challenge is evolving!

It is the relative velocity of the developments outlined above that force us to acknowledge that right now we know just a bit and we still don't know what we don't know about data quality. We are in the discovery curve and attempts to over formalize our various approaches will result in development of brittle solutions. We do know that application of scientific and engineering disciplines can produce better data quality solutions than previous attempts. But for now, it is better to concentrate our efforts on high-level application of policies and principles as opposed to detailed specifications.

Our toolset is improving!

Since the development of formalized data reverse engineering and the invention of data profiling (both DoD funded initiatives [Aiken 1996]) in the early 1990's, our collective data quality engineering tool kit has matured considerably. A multitude of products are now available to help out with various analyses and tasks. The most common problem now facing DoD is the wide spread perception that tools alone will accomplish data quality improvements and that purchase of a package solve data quality problems. This of course has and will always be false.

Best approaches combines manual and automated reconciliation!

As we continue to learn more about data quality, solutions engineering, and related issues, one thing will continue to remain clear: the best data quality engineering solutions will continue to be a combination of selected tools combined with specific analysis tasks and that the primary challenge as we attempt to improve will be determining the proper mix human and automated solutions. While Figure 8 below was developed by one of my hero's J. R. C. Licklider, his insight as to the relative capabilities of human and machine was prescient and is as correct now, as it was when it was published in 1960.

HUMANS GENERALLY BETTER MACHINES GENERALLY BETTER - Sense low level stimuli - Detect stimuli in noisy background - Recognize constant patterns in varying situations - Sense unusual and unexpected events - Remember principles and strategies - Retrieve pertinent details without a priori

connection - Draw upon experience and adapt decision to

situation - Select alternatives if original approach fails - Reason inductively; generalize from observations - Act in unanticipated emergencies and novel

situations - Apply principles to solve varied problems

- Sense stimuli outside human's range - Count or measure physical quantities - Store quantities of coded information accurately - Monitor prespecified events, especially

infrequent - Make rapid and consisted responses to input

signals - Recall quantities of detailed information

accurately - Retrieve pertinent detailed without a priori

connection - Process quantitative data in prespecified ways - Perform repetitive preprogrammed actions

reliably

Page 65: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections Page 8

Peter Aiken August 2010

- Make subjective evaluations - Develop new solutions - Concentrate on important tasks when overload

occurs - Adapt physical response to changes in situation

- Exert great, highly controlled physical force - Perform several activities simultaneously - Maintain operations under heavy operation load - Maintain performance over extended periods of

time

Figure 8 Licklider’s Relative Capabilities

A simple example will illustrate this point. At one point in the Defense Logistics Agency's business modernization program, someone realized that much their data was was poorly stored in the clear text/comment fields of their old SAMMS system. It thought that a manual approach would be required to clean and restructure the data to prepare it for use in the new SAP system. As simple set of calculations indicated that the time required to implement this manual approach to data quality engineering for approximately 2 million NSN/SKUs (a subset of the entire inventory) would run into person centuries (see Figure 9).

Time needed to review all NSNs once over the life of the project:Time needed to review all NSNs once over the life of the project:NSNs 2,000,000Average time to review & cleanse (in minutes) 5Total Time (in minutes) 10,000,000

Time available per resource over a one year period of time:Time available per resource over a one year period of time:Work weeks in a year 48Work days in a week 5Work hours in a day 7.5Work minutes in a day 450Total Work minutes/year 108,000

Person years required to cleanse each NSN once prior to migration:Person years required to cleanse each NSN once prior to migration:Minutes needed 10,000,000Minutes available person/year 108,000Total Person-Years 92.6

Resource Cost to cleanse NSN's prior to migration:Resource Cost to cleanse NSN's prior to migration:Avg Salary for SME year (not including overhead) $60,000.00Projected Years Required to Cleanse/Total DLA Person Year Saved

93Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million

Figure 9 Illustration of how data cleansing of 2 million NSN/SKUs would require 93 person years if the task took only 5 minutes per NSN/SKU – real estimates were much greater.

Instead Figure 10 illustrates that a combination of automated processing was able to reduce the "problem space" from a 100% manual approach to a much smaller task requiring manual attention to less that 7.5% of the original NSN/SKU inventory. Of perhaps equal importance we were able to demonstrate that we could objectively identify the point of diminishing returns – where more work on the automated approach did not produce a greater time/effort savings. This kind of synergistic approach is common to most data quality engineering challenges.

Page 66: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections Page 9

Peter Aiken August 2010

Figure 10 Semi-automating the data cleansing of DLA's SAMMS data saved literally person

centuries – not to mention millions of tax-payer dollars.

Data quality must be approaches as a specialized discipline!

Give all of the above, it remains clear that the best approach to resolving some of DoD's data quality challenges is to form specialized data quality teams dedicated to resolving challenges wherever and whenever they occur. Only in this manner can DoD effectively concentrate its strengths into a process that can be matured from heroic, to repeatable, to documented, to managed, and finally to improvable. Failure to do so will dilute the intellectual strength of data quality engineers with respect to their subject matter knowledge, their tools expertise, and their ability to select and apply appropriate automated solutions to appropriate challenges.

References [Aiken 1996] Data Reverse Engineering: Slaying the legacy dragon

[DAMA 2009] Guide to the Data Management Body of Knowledge available at amazon.com

[Economist 2010] Data, Data, Everywhere a special Economist Report on Managing Information Feb 27 2010

[English 2009] Information Quality Applied by Larry English 2009

Page 67: stop focusing on symptom fixing, and address the ...metadata-standards.org/Document-library/Documents-by-number/WG2-N1501... · Kilobyte (KB) 1,000, or 210, ... Ishikawa Fishbone

Practical Considerations for Rapidly Improving Quality in Large Data Collections Page 10

Peter Aiken August 2010

[Finkelstein 1999] Finkelstein, C. and P.H. Aiken, Building Corporate Portals Using XML". 1999, New York: McGraw-Hill. 530 pages (ISBN: 0-07-913705-9).

[GAO 2007] DHS Enterprise Architecture Continues to Evolve but Improvements Needed GAO-07-564

[GAO 2008] Key Navy Programs’ Compliance with DOD’s Federated Business Enterprise Architecture Needs to Be Adequately Demonstrated GAO-08-972

[Hudicka 2005] Joseph R. Hudicka "Why ETL and Data Migration Projects Fail" Oracle Developers Technical Users Group Journal June 2005 pp. 29-31.

[Waddington, 2009] The Sad State of Data Quality: Results from the Information Difference survey document initiatives and the state of data quality today • Information Management Magazine, 11/01/2009 by David Waddington.

[Yoon 1999} Yoon, Y., Aiken, P., Guimaraes, T. Managing Organizational Data Resources: Quality Dimensions Information Resources Management Journal 13(3) July-September 2000 pp. 5-13.