bwc technical team 24 july 2007. background carbon-climate researchers from around the world have...
Post on 28-Mar-2015
212 Views
Preview:
TRANSCRIPT
bwc technical team24 July 2007
BackgroundCarbon-climate researchers from
around the world have contributed data toward global scale synthesis analyses. Over 700 site-years of data available
from over 200 sites around the world. The carbon flux and meteorological
data are being processed to assign quality, gap-fill, and derive additional scientific data products.
To be eligible for this processing, the data must meet certain minimal criteria.
This data qualification has proved to be a significant part of the effort necessary to process the data.
Can a database and/or datacube help?
The existing qualification processing uses a combination of scripts, MatLab plots and direct data viewing of each site-year.
We think that running a few database queries and simple data cube visualizations could reduce that effort.
Goal is to identify the most obvious problem site-years well enough that direct data viewing would be unnecessary.
Review by a scientist will still be necessary for the more subtle problems
Let’s try….To test this idea, we ran some
simple queries and data cube reports on the L2 data available on the Ameriflux web site in May 2007. 372 site-years from 84 sites in the
Americas The database contains the same
half-hourly flux-met data used for the Fluxnet qualification processing.
The datacube contains daily values derived from the half-hourly data.
Each of the quality criteria we understand was considered in turn.
ftp://cdiac.ornl.gov/pub/ameriflux/data/Level2/standardized_files/with_gaps/
The following variables are mandatory. CO2 flux (FC) should not be gap-filled or ustar filtered
CO2 concentration at the top of the tower (CO2_top) and/or storage flux (SFC or SCO2); not required for grassland or cropland sites
Ustar (UST) or momentum (TAU)
Global incoming radiation (RG_in) or photosynthetic radiation (PPFD or PAR_in)
Air temperature (TA)
Water vapour concentration (H2O) and/or relative humidity (RH
Roughly 40% of the site-years do not meet these criteria
FC reporting by site by year
Plot shows fraction of half-hourly FC data reported for site-year by site Values of 1 (no gaps) likely are investigator gap filled (instruments fail!). Values around .5 are likely hourly rather than half-hourly reporting. Values of .25 are likely too sparse to support synthesis analysis
Of 372 potential site-years of data, only 285 have .3 < FC reporting < .999
Required variable reporting by site by year Each row corresponds to one
site-year Each cell corresponds to one
site year of (FC, CO2 or SCO2, UST, PAR or Rg, TA, and Rh or H2O).
Color indicates: Red – likely not enough for
processing - % < .3 reported (roughly less than 5K of 17.5K)
Green – likely enough for processing .3<%<.999
Yellow – may not be good for processing due to gap-filling - % > .999
Red CO2 (second column) can be ignored for cropland/grassland sites
Sites shown are just a sampleOf the 285 site years with good FC, 50 site years are missing one of (UST, PAR/Rg, and TA) and 79 sites have likely gap-filled data.
Air (TA) and Soil (TS) temperatures are reported in degrees CelciusRadiation is measured in W/m2 (in)Relative humidity (RH) and soil water content (SWC) are reported in % (0:100)Air pressure (PRESS) is measured in kPA
Obviously bad annual averages
Data cube used to browse average yearly Rg values across all site-years16 additional likely problematic site-years at 5 sites
Drill down to consistent (bad) daily values
Data cube used to browse 2005 Rg values shows consistently high reporting (not just a few very large spikes) at Duke Forest sites
Too many values out of range (TS)
Database used to find sites with TA or TS outside the range of -30 °C to 50°C more than 200 times per yearData cube used to browse daily values of offending and nearby sites
Too many values out of range (Rg)
Database used to find sites with Rg > 1100 W/m2 more than 200 times per yearData cube used to browse daily values in MJ/m2/d of offending and nearby site
Locates site years with consistently bad as well as highly variable reporting
Incorrect minimum values
Data cube used to browse annual minimum Rg values looking for site-years during which Rg does not return to zero at night
Incorrect units and/or sign
FC at each site averaged over all reported years plotted by day of year
Drill down to Mead sites
Daily average FC at each site shows likely units and/or sign issues
Drill down to other potential problems
Four additional sites also have like reporting/units problems, but not across all reported years
CO2 flux (FC) measurements should not be “discrete”Differences between successive measurements should vary
There should be a wide distribution in the deltas between successive values. Suggests lack of instrumentation precision, errors when converting units, or other data pre-processing error.
Flux should not remain constant across many measurementsSuccessive values should change.Suggests problems with instrumentation, bad gap-filling
Simple check: the same FC value reported too often
Database query returns count by (site, year, value) where count >= 500
Graph shows sum of (returned counts) by site and year
Sites with high sums likely report very few unique FC values
Better check: the same difference between successive FC values reported too often
Database query returns count by (site, year, value) where count >= 500
Graph shows sum of (returned counts) by site and year
Sites with high deltas likely discrete
Best check: count runs of constant values
Database query returns (site, year, run start time, run length)Graph shows sum of (count(runs > 60 minuts)) by site and
year
Sites with many runs VERY likely discrete (or poorly filled)
Drill down on Duke Forest-loblolly pine
Many long runs that tend to start at dusk (17:30) and run through the night
Year 1998 1999 2000 2001 2002 2003 2004 2005
Number of values with repeat >= 10 11 28 10 11 15 12 12 12
% Data due to repeated values 0.45 0.45 0.46 0.47 0.45 0.44 0.43 0.45
Count of runs with a specific valueby length of run in minutes
Drill down on Shidler
Long constant runs are not the problemThe same values appear too frequently
Year 1997 1998 1999 2000Number of values with repeat >= 10 110 53 149 167% Data due to repeated values 0.29 0.99 0.99 0.26Number of values with repeat >= 100 0 11 32 26% Data due to repeated values 0.22 0.76 0.7 0.25
Count of runs with a specific valueby length of run in minutes
What we learned
Our observations Data availability is the first requirement Sensor, bias, units, and calculation issues can often be identified
by comparison with other data Running fast, simple queries that can rapidly find sites which
clearly do not meet specific simple criteria seems viable The specific queries we used can be generalized as stored
procedures and applied to different measurement types The data availability, units, and discrete data checks can be run in
roughly one hour over the existing 285 site-years of data These (and more?) checks could be run at data upload to the
ORNL site Catching data reporting errors and missing data early in the
process is always best Gives reporting scientists rapid feedback on potential for wider
data usage The results could be tabulated on the ORNL web site
This enables scientists without intimate knowledge of the data from various sites to know more about the data before downloading the actual data and repeating this sort of data quality check
All of this is essential to curating a usable scientific data set
Technical Details and SQL queries
Spot sites with high TA/TS valuesselect * from (select (select name from site where site.siteid = data.siteid) as sitename, (select shortname from datumtype where data.datumid = datumtype.datumid) as variablename, exdatumid, offsetid, datepart(yy,time) as yearOfMeasurement, count(*) as badcountfrom data where datumid in (select datumid from datumtype where shortname = 'TS' orshortname = 'TA')anddata.value > 50. group by siteid, datumid, exdatumid, offsetid, datepart(yy,time)) foowhere badcount > 200order by sitename, variablename, yearOfMeasurement, exdatumid, offsetid, badcount
La Selva TS 1 1 1997 214Sky Oaks-Old Stand TS 1 1 2000 549Sky Oaks-Old Stand TS 1 1 2001 511Sky Oaks-Old Stand TS 1 1 2002 720Sky Oaks-Old Stand TS 1 46 2002 506Sky Oaks-Old Stand TS 1 67 2002 274Sky Oaks-Old Stand TS 1 46 2003 345Sky Oaks-Old Stand TS 1 67 2003 412Sky Oaks-Young Stand TS 1 26 2004 423Sky Oaks-Young Stand TS 1 1 2006 881
Spot runs of constant valuesTemporary table used for
simplicity and speed (4.6M rows)
Table generation takes about 10 minute
insert into fcboomCvIselect siteid, datumid, value, time, exdatumid, offsetid from data as dwhere repeatid = 0and d.sitesetid in (select sitesetid from siteset where description = 'L2 ORNL May 2007')and d.datumid = (select datumid from datumtype where shortname = 'FC')and d.exdatumid in (select exdatumid from exdatumtype where debris = '(none)' or debris = '_unadj')
insert into fcboomCvIselect siteid, datumid, value, time, exdatumid, offsetid from data as dwhere repeatid = 0and d.sitesetid in (select Sitesetid from siteset where description = 'L2 ORNL May 2007')and d.datumid = (select datumid from datumtype where shortname = 'Other')and d.exdatumid in (select exdatumid from exdatumtype where debris like 'FC%'
Identify gaps and bin deltasnew30 column holds value
of next 30 minute measurement (if any)
Takes about 2 minutes
update fcboomCvI set new30 = (select value from fcboomCvI as twowhere one.siteid = two.siteidand one.datumid = two.datumidand one.exdatumid = two.exdatumidand one.offsetid = two.offsetidand two.time = dateadd(mi, 30, one.time)), new60 = (select value from fcboomCvI as threewhere one.siteid = three.siteidand one.datumid = three.datumidand one.exdatumid = three.exdatumidand one.offsetid = three.offsetidand three.time = dateadd(mi, 60, one.time))from fcboomCvI as one
Difference between successive measurements binned to look for common values
Takes about 2 minues
select (select name from site where foo.siteid = site.siteid)
as sitename,datumid, delta30, SiteYear, exdatumid, offsetid,valuecountfrom (select siteid, datumid, (value - new30) as delta30,
datepart(yy,time) as siteYear, exdatumid, offsetid, count(*) as valuecountfrom fcboomCvI where new30 is not null group by siteid, datumid, (value - new30),
datepart(yy,time),exdatumid, offsetid) foowhere valuecount > 499
Spot runs of constant valuesTag first measurement in a
run of constant values with 1 and any additional measurements with 0
Takes about 2 minutes
update fcboomcviset run30 = CASE WHEN one.value = (select value from fcboomCvI
as two where one.siteid = two.siteid and one.datumid = two.datumid and one.exdatumid = two.exdatumid and one.offsetid = two.offsetid and two.time = dateadd(mi, -30, one.time)) THEN 0 ELSE 1END from fcboomCvI as onewhere one.new30 is not NULL and one.new30 = one.value
Find all runs for export to spreadsheet
Takes about 10 minutesselect (select name from site where mum.siteid =
site.siteid) as sitename, datumid, value, time1, time2, datediff (mi, time1,
time2) as runtime, exdatumid, offsetid, running30from (select siteid, datumid, value, min(starttime) as
time1, max(endtime) as time2, exdatumid, offsetid, running30from (select one.siteid, one.datumid, one.value, one.time as starttime, one.time as endtime,one.exdatumid, one.offsetid, one.new30,one.run30, (select sum(two.run30) from fcboomcvi as
twowhere one.siteid = two.siteidand one.datumid = two.datumidand one.exdatumid = two.exdatumidand one.offsetid = two.offsetidand two.run30 is not nulland two.time <= one.time) as running30from fcboomcvi as onewhere one.run30 is not null) foo group by running30, siteid, datumid, value, exdatumid, offsetid) mum where time2 <> time1
top related