© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
Statistisches Bundesamt
SDE – Budapest
Towards a Generic Approach to Validation: the ValiDat Foundation Project
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 2
ValiDat - Foundation A bit of history
n Once upon a time (actually now) .. n .. 28 Member states in the European Statistical System
(ESS) sent data to Eurostat
n .. and got a lot of problems: n The gatekeeper at the central fortress refused entry and
sent the data back: They did not comply to the rules! n The poor member states tried again and again .. n .. until finally some brave knights decided to fight the
dragon of uncertainty and non-standards
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 3
ValiDat - Foundation A bit of history
More prosaic: n Already in 2009, Eurostat launched a project to harmonize data validation policies in the ESS n In 2013 that project became the ESS Vision implementing Project „Validation“ n In 2014, a member states became more actively involved through a so-called ESSnet project n Italy, the Netherlands, Germany and Lithuania started working on methodological and technological questions (as well as questions of standardization) n In a truly interdisciplinary approach, the team engages methodologists and technicians from several fields
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 4
ValiDat - Foundation What are we talking about?
n Right from the start, we had serious problems agreeing on a common definition of validation
n Specifically, the relationship between data validation and editing was not clear to some of us ..
n .. but we had wise guys from the methodological side in our team:
Data Validation is an activity verifying whether a combination of values is a member of a set of acceptable
combinations n As we see it: data validation is part one of the data editing
process
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 5
ValiDat - Foundation if employment status == “old-‐age pensioner” and age < 35 then error “Too young!” 0.5 < turnover(curMonth)/turnover(prevMonth) < 2 WENN ANZAHL VON Familie[ALLE].Person[MIT Alter < 18] > 0 DANN ...
ENDE IF maritalstate=married THEN
Age>15 “Too young to be married”
ENDIF
profit <= 0.6*revenue
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 6
ValiDat - Foundation Is there a business case?
n When we started our survey on data validation in the ESS we were not completely aware of the scale of the „problem“:
n Effort: The amount of effort put into data validation (and editing) in the five sample domains was estimated by the member states to make up 40 to 60 % of the total effort
n Relevance: We have not asked for the impact of data validation on data quality (non-sampling errors) but assume that it is generally of paramount importance
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 7
ValiDat - Foundation Business case - implications:
n If validation has such a high impact on data quality and consumes so many resources, then it should be n well understood, n fairly wide standardized n and as far as possible automated
n Sequence: Understanding is the a) methodological foundation of b) standardization which in turn will be the base for c) technical innovation (and process enhancements)
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 8
ValiDat - Foundation The Base Line: Methodology
n A central part of the methodological work of the ESSnet project is writing a „handbook“ i.e. compiling from the work of others and make it available (pragmatically) for a general audience of statisticians
n Why are we doing validation (remember the business case!)? n Enhance data quality dimensions:
n Directly (like accuracy, coherence and compatability)
n Indirectly (timeliness) as restrictions
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 9
ValiDat - Foundation The Base Line: Methodology
n Content of handbook: n What n Why n How n When
Table of contents 1.......................................................................................................................................................................... 1
1 Introduction ............................................................................................................................................... 2
2 Data validation ........................................................................................................................................... 2
2.1 What is data validation. ..................................................................................................................... 3
2.2 Why data validation. Relationship between validation and quality .................................................. 5
2.3 How to do data validation: validation levels and validation rules ..................................................... 7
2.3.1 Validation levels from a business perspective ........................................................................... 8
2.3.2 Validation rules ........................................................................................................................ 13
2.4 Generic framework for validation levels and validation rules ......................................................... 17
2.4.1 Validation levels based on decomposition of metadata .......................................................... 17
2.4.2 A formal typology of data validation functions........................................................................ 19
2.4.3 Validation levels ....................................................................................................................... 20
2.4.4 Relation between validation levels from a business and a formal perspective ...................... 21
2.4.5 Applications and examples ...................................................................................................... 23
3 Data validation as a process ..................................................................................................................... 24
3.1 Data validation in a statistical production process (GSBPM) ........................................................... 24
3.2 The informative objects of data validation (GSIM) .......................................................................... 27
4 The data validation process life cycle ...................................................................................................... 30
4.1 Design phase .................................................................................................................................... 32
4.2 Implementation phase ..................................................................................................................... 33
4.3 Execution phase ............................................................................................................................... 34
4.4 Review phase ................................................................................................................................... 35
References ....................................................................................................................................................... 37
Appendix 2.3.2: List of validation rules ............................................................................................................ 38
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 10
ValiDat - Foundation The Base Line: Methodology – What?
n The handbook provides classification schemes for validation rules: n Levels n Pragmatic typology n Formal typology
n All have their merits and help communicate about validation
Class Description of input Example function Description of example
Single data point
Univariate comparison with constant
Multivariate (in-‐record)
Linear restriction
Multi-‐element (single variable)
Condition on aggregate of single variable
Multi-‐element multivariate
Condition on ratio of aggregates of two variables
Multi-‐measurement
Condition on difference between current and previous observation.
Multi-‐measurement multivariate
Condition on ratio of sums of two currently and preciously observed observations.
Multi-‐measurement multi-‐element
Condition on ratio of current and previously observed aggregate.
Multi-‐measurement multi-‐element, multivariate
Condition on difference between ratios of previous and currently observed aggregates.
Multi-‐universe multi-‐element multivariate
Condition on ratio of aggregates over different variables of different object types.
Multi-‐universe multi-‐measurement multi-‐element multi-‐time
Condition on difference between ratios of aggregates of different object types measured at different times.
Typology dimension Types of checks
1
Identity checks Range checks • bounds fixed • bounds depending on entries in other
fields
2 Simple checks, based directly on the entry of a target field
More “complex” checks, combining more than one field by functions (like sums, differences, ratios)
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 11
ValiDat - Foundation The Base Line: Methodology – What?
n Levels and rule types are building blocks to discuss other important concepts like: n Structural vs. content based validation n Simple vs. complex rule types n Soft vs. hard checks n Micro data vs. macro data validation
n They can be used as a framework for metrics, languages and technologies
n One of the results of our survey was that there is no common understanding on these methodological issues yet!
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 12
ValiDat - Foundation The Base Line: Methodology – When?
HERE
HERE
HERE HERE
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 13
ValiDat - Foundation The Base Line: Methodology – How?
n Validation Life Cycle
Simon et al. 2015
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 14
ValiDat - Foundation The Base Line: Methodology – How?
n How do we know that we have struck the right balance between n Improving data quality n At acceptable costs
n Our solution: use metrics! n Analyse the internal consistency of validation rule sets n Analyse the value of validation rules on observed data n Analyse validation rule sets in comparison to observed
and expected data
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 15
ValiDat - Foundation Language
n The future validation language has two main goals: n It should provide an unambigous
communication channel for specialists (humans!)
n It should feed different IT-systems with the necessary specific information about a particular survey
n These might be conflicting aims!
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 16
ValiDat - Foundation Language: A new Sta(nda)r(d) is born
n VTL - Validation and Transformation Language has been specified by the SDMX community
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 17
ValiDat - Foundation Language: A new Sta(nda)r(d) is born
n Language is currently under review in our project n Different Aspects:
n Correctness and coherence n Completeness n Usability (by human users) n Feasibility (for machine-to-machine communication)
n Evaluation will be publicly available soon ..
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 18
ValiDat - Foundation From Language to Tools and Services
n Proof of Concept being launched n From real world to .. n .. VTL to .. n .. national Systems
n CBS (flexible R solution) n Destatis (mighty IT-infrastructural approach)
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 19
ValiDat - Foundation Tools and services
n Heterogeneity: Tools beeing used for validation are varied. A large number of general purpose and specialized tools are applied in the NSI of the ESS (compare survey results)
n Costs: Variety causes costs! n Common Infrastructure: As a solution to secure
homogeneous (high quality) output at reasonable (and lower than now) costs
n Preconditions: Methodology and language have to be standardized before
n Is it worth it?
n We believe: Yes!
Statistisches Bundesamt
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
14/09/15 Folie 20
ValiDat - Foundation Your input is needed!
n In this presentation we covered a lot of ground without going into detail
n We would appreciate if you could give us feedback on the methodological findings as stated in the handbook (publicly available by the end of this year!)
n And we would be very happy, if you (or colleagues from your institution) would discuss the implications and proposals made in our ESSnet at a
Workshop in Wiesbaden, November 10 -11!
n .. and help fight the Data Ping-Pong!
© Statistisches Bundesamt, Gruppe C3 – IT-Unterstützung des Geschäftsprozesses
Statistisches Bundesamt
Köszönöm szépen