TDWG- Lisbon Oct 2003
Data Cleaning Tools and Methodologies
Arthur D. Chapman
Australia / Brazil
Centro de Referência em Informação Ambiental
TDWG- Lisbon Oct 2003
Background
• ERIN/CRIA
• speciesLink
• FAPESP/Biota
TDWG- Lisbon Oct 2003
Species Data
• Museum/Herbarium
• Observation
• Survey
TDWG- Lisbon Oct 2003
Data Error
• Names
• Geocode
• Altitude
• Collectors
• Dates
TDWG- Lisbon Oct 2003
Data quality - fitness for use
TDWG- Lisbon Oct 2003
Methods for geocode validation
• Internal Database Checks
• Outliers in Geographic Space - GIS
• Outliers in Environmental Space - Models
• Statistical outliers
TDWG- Lisbon Oct 2003
Internal Database Checks
• Internal inconsistencies
• Checking one field against another– Text location vs geocode
• Checking one database against another– Gazetteers– DEM– Collectors
TDWG- Lisbon Oct 2003
Geographic outliers - GIS
• Country, State, named district, etc.
TDWG- Lisbon Oct 2003
Geographic outliers - GIS
TDWG- Lisbon Oct 2003
Geographic Outliers - GIS
• Collectors – location vs date
TDWG- Lisbon Oct 2003
Environmental Outliers
• Cumulative Frequency Curves
TDWG- Lisbon Oct 2003
Acacia orites - 19 records - 9 Temperature parameters
0
5
10
15
20
25
30
35
tann
tmncm
tmxwm
tspan
tclq
twmq
twetq
tdryq
Tem
pera
ture
(C)
Reverse Jack-knife
TDWG- Lisbon Oct 2003
Outliers in climate space
(T=0.95(√n)+0.2)
where ‘n’ is the number of records
TDWG- Lisbon Oct 2003
FloraMap
• CIAT (Columbia)
• PCA
• Cluster Analysis
• $US100
• Modelling
• 10-minute grids
TDWG- Lisbon Oct 2003
Principal Components Analysis - FloraMap
Image from FloraMap (Jones and Gladkov 2001) showing use of Principal Components Analysis to identify an outlier in Rauvolfia littoralis specimen data.
A. Principal Components Analysis B. Specimen record. C. Mapped specimen. D. Climate profile
TDWG- Lisbon Oct 2003
Cluster Analysis - FloraMap
Image from FloraMap (Jones and Gladkov 2001) showing use of Cluster Analysis to identify an outlier in Rauvolfia littoralis specimen data.
A.Cluster Analysis B. Principal Components Analysis. C. Mapped specimen. D. Climate profile. E. Specimen record
TDWG- Lisbon Oct 2003
Diva-GIS
• Free
• Simple GIS
• Modelling (BIOCLIM/Domain)
• Data Cleaning Tools
TDWG- Lisbon Oct 2003
Diva-GIS – Coordinate Check
Using Diva-GIS to check coordinates by comparing a file of point specimen records (red) against a polygon of Bolivian provinces. Input dialogue box is shown at A, where it can be seen that “STATE” in the point file has been set to the equivalent “DEPARTMENT” in the polygon file (Hijmans et al. 2003).
TDWG- Lisbon Oct 2003
Points outside Polygon – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing point records that fall outside all polygons in the Bolivian provinces polygon file. The highlighted record shows the linking between the results dialogue box and the mapped record
TDWG- Lisbon Oct 2003
Mismatched Provinces – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing point records that do not match set relationships between the specimen point file and the polygon of Bolivian provinces. The highlighted record where the geocoding on the specimen record causes it to fall in the wrong province
TDWG- Lisbon Oct 2003
Assign Coordinates – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing point records with geocodes automatically assigned. A. Unambiguous geocodes found by the program and assigned. B. Ambiguous geocodes identified. C. Appropriate geocodes not found.
TDWG- Lisbon Oct 2003
Multiple possibilities – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing alternate geocodes for a record where use of the Gazetteer has produced a number of credible alternatives.
TDWG- Lisbon Oct 2003
Cumulative Frequency Curves - DivaGiS
Results from Diva-GIS (Hijmans et al. 2003) showing the use of the Cumulative Frequency curve from BIOCLIM to identify possible geocoding errors in Rauvolfia littoralis. A1 and A2 show possible outliers in climate space, B1 and B2 the corresponding mapped records. The Blue lines represent the 97.5 percentile
TDWG- Lisbon Oct 2003
Bioclimatic Envelop – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing the use of the Bioclimatic Envelope from BIOCLIM to identify outliers in climate space. In this case the percentile cut off is set at 95. Red points on the envelope correspond with red points on the map, green points in the envelope correspond with yellow points on the map
TDWG- Lisbon Oct 2003
ANUCLIM• $AUD1000 (with data files)
• Modelling (BIOCLIM / ESOCLIM)
• Cumulative Frequency Curves
• Parameter Extremes
TDWG- Lisbon Oct 2003
Cumulative Frequency - ANUCLIM
Log file of Eucalyptus fastigata from ANUCLIM Version 5.1 (Houlder et al. 2002) showing the species accumulation curve with an identified outlier (labelled “bad”). Information from the “bad” record is displayed at the top of the log file (from Houlder et al. 2000).
TDWG- Lisbon Oct 2003
Parameter extremes - ANUCLIM
Log file of Eucalyptus fastigata from ANUCLIM Version 5.1 (Houlder et al. 2002) showing the parameter extremes (top) and associated species accumulation curve (bottom) (from Houlder et al. 2000
TDWG- Lisbon Oct 2003
Statistical Tests
• Outliers in Latitude
• Outliers in Altitude
• Outliers in collectors range/day or week– Especially 17th, 18th and 19th Century
collections
TDWG- Lisbon Oct 2003
Thank You…
Questions?