standard operating protocol (sop) on data quality
TRANSCRIPT
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731065
Project Title: AQUACOSM: Network of Leading European AQUAtic
MesoCOSM Facilities Connecting Mountains to Oceans from
the Arctic to the Mediterranean
Project number: 731065
Project Acronym: AQUACOSM
Proposal full title: Network of Leading European AQUAtic MesoCOSM Facilities
Connecting Mountains to Oceans from the Arctic to the
Mediterranean
Type: Research and innovation actions
Work program topics
addressed:
H2020-INFRAIA-2016-2017: Integrating and opening research
infrastructures of European interest
Standard Operating Protocol (SOP) on Data Quality Assurance and
Quality Control
Version: V1.0; 29 May 2020
Main Authors: Thomas Davidson, Daphne Buijert-de Gelder, Lisette de Senerpont Domis, Johan
Wikner
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731065
Abstract This deliverable is a Standard Operating Protocol (SOP) that describes the methods for data quality assurance and quality control (QA/QC). It defines terms and sets out guidelines for workflow. It then describes practical processes for quality assurance and a range of tests for quality control, including suggestions for flagging systems and data handling.
Keywords • Quality assurance, Quality control, flagging
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731065
Table of Contents
I. Cross References ....................................................................................................................................... 4
II. Dissemination activities related to the Deliverable .................................................................................. 4
1. Data Quality Assurance and Quality Control ............................................................................................. 5
1.1 Definitions and terms ........................................................................................................................ 5
1.2 Cross reference .................................................................................................................................. 5
1.3 Health and safety regulation ............................................................................................................. 5
1.4 Environmental indications ................................................................................................................. 6
1.5 Quality Assurance and Quality control Workflow ............................................................................. 6
1.5.1 Quality assurance of raw data collection .................................................................................. 7
1.5.2 Quality Control .......................................................................................................................... 7
1.5.3 Aggregated, summarized data................................................................................................. 13
1.6 References 1 – QA &QC ................................................................................................................... 14
Co-funded by the European Union D4.1 Standard Operating Procedures| I Cross References
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 4 of 14
I. Cross References
The SOPs that will be provided by AQUACOSM will be listed here in the following versions when the different
SOPs are completed.
The SOPs that will be provided by AQUACOSM will be for:
1. Phytoplankton (this SOP)
2. Zooplankton (Deliverable 4.1.2)
3. Microbial Plankton (Deliverable 4.1.3
4. Periphyton (Phytobenthos) (Deliverable 4.1.4)
5. Water Chemistry (Physical and Chemical Elements of Water) (Deliverable 4.1.5)
6. High-Frequency Data Collection (Deliverable 4.1.6)
7. QA/QC (Deliverable 4.1.7)
A general description for water sampling will be covered under the Water Chemistry SOP.
II.Dissemination activities related to the Deliverable
The SOPs will be made available to all users of TA in AQUACOSM, and will also be publicly available for any
user through the AQUACOSM webpage (https://www.aquacosm.eu/project-information/deliverables/)
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 5 of 14
1. Data Quality Assurance and Quality Control
1.1 Definitions and terms
TERM DEFINITION(S)
Flags A system to identify the quality of data which preserves the original data and
indicates the degree of manipulation
Metadata Contextual information to describe, understand and use a set of data. [2]
QA & QC Quality Assurance and Quality Control is a two-stage process aiming to identify
and filter data in order to assure their utility and reliability for a given purpose
QA Quality Assurance (process oriented)
Is process oriented and encompasses a set of processes, procedures or tests
covering planning, implementation, documentation and assessment to ensure
the process generating the data meet a set of defined quality objectives.
QC Quality Control (product oriented)
Is product oriented and consists of technical activities to measure the attributes
and performance of a variable to assess whether it passes some pre-defined
criteria of quality.
1.2 Cross reference
All other SOP’s provided by AQUACOSM in which data are collected should refer to this SOP in relation to QA
procedures.
Materials and Reagents
● Software as Excel, R, SPSS, SAS, Systat or other statistical programs. QC procedures may also be built
into database functionality.
1.3 Health and safety regulation
Not relevant.
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 6 of 14
1.4 Environmental indications
Not relevant.
1.5 Quality Assurance and Quality control Workflow
The purpose of quality assurance and quality control is to ensure the reliability and validity of the information
content of the data. Many QA & QC measures can be undertaken; however, a ubiquitous and crucial
characteristic is that each step is documented, described and repeatable. Documentation is the key to make
data reliable, valuable [3] and reusable [4]. The figure below describes the recommended workflow for
AQUACOSM data collection.
Figure 1-1: Suggested QA and QC workflow
Based on the level of quality assurance and control steps the raw data has undergone, we distinguish four data levels:
level 0 - raw data
level 1 - automated QC - large obvious errors removed
level 2 - manual QC
level 3 - Gap filled or Interpolated data
level 4 - aggregated and summarized data
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 7 of 14
1.5.1 Quality assurance of raw data collection
The quality of the data to be collected in the mesocosm experiment will be improved when the data collection
preparation and mesocosm sampling undergoes several quality assurance steps. These quality assurance
steps differ for the type of data collected, and more detailed information on quality assurance for specific
sampling procedures can be found in the SOPs for manual data collection (zooplankton, water chemistry,
phytoplankton, periphyton, protozoa, bacteria, archaea, viruses etc). See above for examples, including
sensor calibration, and adequate labelling of sampling containers.
In addition to method-specific quality assurance steps as described in the specific SOP s, prior to collecting
the data, one should define standard names for common objects. This should be done for at least the
following cases:
● Parameter name: use standardized names that describe the content and describe the parameter in
the metadata [1, 3, 5]. As a best practice advice: use the vocabulary developed during the
AQUACOSM project with the standardized names.
● Formats: choose a format for each parameter, describe this format in the metadata and use it
through the whole dataset. Important formats to consider are dates, times, spatial coordinates and
significant digits. [1, 5]
● Taxonomic nomenclature. Follow international species data list.
● Measurement units: make use of the SI units (and the AQUACOSM vocabulary) and document these
units (in the metadata). [1, 5]
● Codes: “standardized list of predefined values”. Determine which codes to use, describe the codes
and use the codes consistently. Every change that is made in the codes, should be documented. [1,
5]
● Metadata: data about data, with as goal to help scientists to understand and use the data [1]. The
mesocosm metadatabase developed in Aquacosm (link) is currently build on the Ecological Metadata
Language (EML, see Fegraus et al. 2005), specifically adapted to mesocosm data.
Another important point of QA is to assign the responsibility for data quality to a person or persons who has
some experience with QA & QC procedures [1].
1.5.2 Quality Control
Raw or primary data should not be removed or changed unless there is solid evidence that it is erroneous. In
the first instance questionable data should be flagged according to international code system (e.g. ICES or
the like). In the event that primary data are altered it must be saved and a motivation for the action added
in the same post. It is essential that the raw, unmanipulated form of the data is saved so that any subsequent
procedures performed on the data can be repeated [6]. Instead of removing or deleting data it is preferable
to use a system of flags, via a range of QA processes and steps, thereafter QC can be carried out by filtering
the data based on the flags and further analysis carried out.
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 8 of 14
The general QC checks that should be done are described in detail in [7], this can be done manually or
automated, the latter being more applicable to high frequency data:
● Gap or missing value check: do you have all the expected results?
● Control samples: Are they within expected range and variability?
● Calibration curves: Are coefficients within expected range and variability?
● Spurious / impossible results check: do you have negative results or results with an unreasonable
magnitude?
● Outlier check: do the results fall outside an expected distribution of the data?
● Range check: do the calculated values fall within the expected range and variability?
● Climatology check: are the results reasonable compared with historical results (range and patterns)?
● Neighbour check: are the results reasonable compared to results of the same site, same day or
different depths?
● Seasonality check: do results reflect seasonal processes or effects or are they extraordinarily
different?
For a sensor network QC should include [6]:
● Date and time: check if each data point has the right date and time.
● Range: check if data fall within established upper and lower bounds
● Persistence: check if the same value is recorded repeatedly, this can indicate problems like a sensor
error or system failure.
● Change in slope: check the change in slope to see if the rate of change is realistic for the type of data
collected
● Internal consistency: evaluate differences between related parameters
● Spatial consistency: check replicate sensors or compare the sensors with identical sensors from
another site.
After the QC check one should evaluate data points that did not pass the QC check. Compare with other
variables, check calibration curves and other types of sampling and instrument performance. Check log books
for comments by the responsible person. Only when evidence for sampling error, contamination or
instrument failure is shown should data point be removed or replaced with a, for example interpolated value.
The decision and its motivation must be documented and the data point flagged accordingly.
1.5.2.1 Flagging
This SOP states that to establish successful QC procedures it is crucial to flag your data to explain differences
between the raw data and processed data. “Flags or qualifiers convey information about individual data
values, typically using codes that are stored in a separate field to correspond with each value” [6].
We recommend using the flagging values as suggested by Hook et al. [5], as it allows you to retrieve the QC
steps your data has undergone.
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 9 of 14
Comment: other possible flagging systems are: Campbell et al. [6], Marine water website [8], QARTOD
flagging system [9], IODE flagging system by Reiner Schlitzer [10] and protocols produced EU infrastructure
projects the Copernicus Marine Environment Monitoring Service (CMEMS) [11]. Table 1-1 suggests a flagging
system that may be useful for mesocosm applications. It is more complicated than some systems but it
contained information of why a data point is questionable and allows filtering of the data with different
stringency of quality control.
Table 1-1: Recommend flag values [5] for the AQUACOSM project
Flag Value Description
V0 Valid value
V1 Valid value but comprised wholly or partially of below detection limit data
V2 Valid estimated value
V3 Valid interpolated value
V4 Valid value despite failing to meet some QC or statistical criteria
V5 In-valid value but flagged because of possible contamination (e.g., pollution source,
laboratory contamination source)
V6 In-valid value due to non-standard sampling conditions (e.g., instrument malfunction,
sample handling)
V7 Valid value but set equal to the detection limit (DL) because the measured value was
below the DL
M1 Missing value because not measured
M2 Missing value because invalidated by data originator
H1 Historical data that have not been assessed or validated
Table 1-2: The ARGO data quality flagging system
Flag Value Description
0 No data quality control on data
1 Data passed all tests
2 Data probably good
3 Data probably bad. Failed minor tests
4 Data bad. Failed major tests
7 Averaged value
8 Interpolated value
9 Missing data
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 10 of 14
Table 1-3: ICES Data Quality Flag
Flag Value Description
0 data are not checked
1 data are checked and appear correct
2 data are checked and appear inconsistent but correct
3 data are checked and appear doubtful
4 data are checked and appear to be wrong
5 data are checked and the value has been altered
1.5.2.2 Automated QC
The large increase in the number of studies using multiple sensors recording at high frequency has led to the
generation of large volumes of data, making it extremely difficult and undesirable to carry out manual QC. A
number of automated methods have been developed for quality control. Visual checks of such data are
relatively straightforward, if time consuming. Therefore, it is increasingly desirable to automate these
approaches, which ideally would be integrated into database functionality, for example through the use of R
or Python code, adapted to the particular data set. These QC steps can be applied in real time or post data
collection. Real time quality control (RTQC) has the advantage that an alarm system can be integrated to
indicate suspect values providing an early warning of sub-optimal sensor performance.
These methods follow many of the same QC methods as manual methods, in particular:
● Date and time: this checks if each data point has the right date and time.
● Range: these check if data fall within established upper and lower bounds can be divided into Global
range tests (possible distribution) and local range tests (probable distribution)
● Persistence: or frozen value test - this checks if the same value is recorded repeatedly, this can
indicate problems like a sensor error or system failure.
● Change in slope or spike test: this checks the change in slope to see if the rate of change is realistic
for the type of data collected
● Internal consistency: this evaluates differences between related parameters
● Spatial consistency: this checks replicate sensors or compare the sensors with identical sensors from
another site.
Parameters measured by autonomous sensors can vary on a range of scales and a number of sensor types,
in particular optical sensors, can be affected by non-negligible noise. There are a number of methods which
can be used to identify outliers or spurious data that can then be flagged accordingly and filtered out of
subsequent analysis. One approach [11] has been to apply a procedure testing the statistical entropy caused
by each progressive measurement, as described by [12], this is a 2-step estimation of the Akaike information
criterion - details can be found in [12]. This approach to outlier or spike identification is highly dependent on
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 11 of 14
the number of sample points considered in the estimate of statistical entropy. The inclusion of too few
samples risks the exclusion becoming too sensitive with ‘good’ data being excluded. The inclusion of too
many samples and the test may potentially become too insensitive (Figure 1-2). In addition, the designation
of an observation as an outlier is also dependent on the selection of a cut off or critical value of variation
above which a value is designated as an outlier. The selection of these two parameters for the test is
dependent upon the type of data being collecting. For example, the data generated by the Ferrybox [11] are
collected 400 m apart and in this case, it was necessary to relax the criteria used for identifying outliers as
their natural variation in samples 400 m apart is greater than samples measured in the same location.
A similar approach to identification of outliers in sensor data has been developed at the University of
Waikato, New Zealand (https://www.lernz.co.nz/tools-and-resources/b3) and now also available as an R
package (https://github.com/kohjim/rB3) , under the aegis of the GLEON network. This is a freely available,
downloadable programme that can be used for post-data collection data processing. B3 is an integrated
programme that can be used to carry out many of the QC steps outlined above, e.g. range check, missing
data, repeated values. It also identifies outliers or spikes samples by two methods. The first is similar to the
above described method, where a running mean is calculated and observations that fall outside a critical
standard deviation can be identified as potential outliers. Both the period over which the running mean is
calculated and the cut off value of the standard deviation can be altered to suit the dataset in question. The
second method uses a rate of change analysis (ROC) to identify jumps in the data that are out of the normal
range. The number of data points used to identify the ‘normal’ range can be altered as can the critical rate
of change value above which a value is deemed a potential outlier. Each of these methods has different
sensitivities to outliers, depending on the kind of data analysed and there is a balance to strike between
excluding ‘good’ data and including ‘bad’ data. For example, data with a large diurnal range (e.g. DO data)
may be vulnerable to excluding good data at the high and low end of the diurnal cycle when the running
mean methods are used. Thus, it may be necessary to tailor the cut off values for ROC analysis and running
mean analysis for each dataset, or even each mesocosm and some form of manual QC is highly
recommended.
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 12 of 14
Figure 1-2: Ferry box turbidity data with spike point analysis described in section 7.2.2 from [11,12] showing the effects of including different numbers of observations (5, 10 and 100). The red points are identified as outliers and n=5 and n=10 appear too sensitive and exclude a large amount of good data.
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 13 of 14
1.5.3 Aggregated, summarized data
As a final step, data can be aggregated and summarized for publication purposes. This includes integration
over e.g. depth (space) and time. It should be remembered that integration however reduce the degrees of
freedom and may limit the statistical tests that can be performed and possibly the statistical power.
Aggregation is preferably done in a database environment with the integrating function located at one
instance. This function should be validated by manual calculation with selected data from the same set.
The start and end values for the range of integration should be clearly defined, as is true for methods for
inter- and extrapolation where applicable.
Number of samples included in the integrated value (n) and its standard deviation (±SD) can be provided to
assess the extent of data for the aggregated value.
Co-funded by the European Union D4.1 Standard Operating Procedures| 1 Data Quality Assurance and Quality Control
AQUACOSM – INFRA-01-2016-2017- N. 731065 Page 14 of 14
1.6 References 1 – QA &QC
1. Michener, W.K. and M.B. Jones, Ecoinformatics:
supporting ecology as a data-intensive science. Trends in
Ecology & Evolution, 2012. 27(2): p. 85-93.
2. Jones, M.B., et al., The New Bioinformatics:
Integrating Ecological Data from the Gene to the Biosphere.
Annual Review of Ecology, Evolution, and Systematics, 2006.
37(1): p. 519-544.
3. Rüegg, J., et al., Completing the data life cycle: using
information management in macrosystems ecology research.
Frontiers in Ecology and the Environment, 2014. 12(1): p. 24-
30.
4. Michener, W.K., Ecological data sharing. Ecological
Informatics, 2015. 29: p. 33-44.
5. Hook, L.A., et al., Best practices for preparing
environmental data sets to share and archive. 2010, Oak
Ridge National Laboratory Distributed Active Archive Center,
Oak Ridge, Tennessee, U.S.A. p. 40.
6. Campbell, J.L., et al., Quantity is Nothing without
Quality: Automated QA/QC for Streaming Environmental
Sensor Data. BioScience, 2013. 63(7): p. 574-585.
7. Bos, J., C. Krembs, and W.R. Kammin, EAP088
Marine Waters Data Quality Assurance and Quality Control V
1.0 5/30/2015. 2015: p. 35.
8. Department of Ecology - State of Washington.
Marine Waters Data Quality Codes. [webpage] 2017 [cited
2017 October 17]; Available from:
http://www.ecy.wa.gov/programs/eap/mar_wat/datacodes.
html.
9. Integrated Ocean Observing System. Manual for
Real-Time Oceanographic Data Quality Control Flags. 2017
[cited 2017 October]; Available from:
https://ioos.noaa.gov/wp-
content/uploads/2017/06/QARTOD-Data-Flags-
Manual_Final_version1.1.pdf.
10. Schlitzer, R. Oceanographic quality flag schemes
and mappings between them. 2013 2013-05-24 [cited 2017
October]; version 1.4:[Available from:
https://odv.awi.de/fileadmin/user_upload/odv/misc/ODV4_
QualityFlagSets.pdf.
11. Jaccard Pierre, Hjemann Dag Oystein, Ruohola Jani,
Ledang Anna Birgitta, Marty Sabine, Kristiansen Trond, Kaitala
Seppo, Mangin Antoine (2018). Quality Control of
Biogeochemical Measurements. CMEMS-INS-BGC-QC.
https://doi.org/10.13155/36232
12. Ueda, T. 2009. A simple method for the detection of
outliers. Electronic Journal of Applied Statistical Analysis 2:67-
76.