confessions of a data hoarder
DESCRIPTION
Experiences of a Earth Science Data User. Confessions of a Data Hoarder. Rob Carver, The Weather Company. “Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.” . –Andrew S. Tanenbaum. Open Data and The Weather Company. - PowerPoint PPT PresentationTRANSCRIPT
Experiences of a Earth Science Data User
Confessions of a Data Hoarder
Rob Carver, The Weather Company
–Andrew S. Tanenbaum
“Never underestimate the bandwidth of a station wagon full of tapes hurtling down the
highway.”
Open Data and The Weather Company
❖ Our business model is taking open data and using it to tell interesting stories that engage our users.
❖ Over the years, we’ve archived over 100 Tb of data❖ GRIB1, GRIB2, NIDS, shapefiles, netCDF, HDF5,
❖ NWS/NCEP, NCDC, FEMA, Census Bureau, NASA DAAC’s
100+ Tb of Weather Models
❖ Most data arrives through Unidata’s LDM and FTP pull scripts. ECMWF pushes data to our FTP site. (All GRIB2/1)
❖ Ingested into the forecast system, and GRADS handles the model visualization
❖ Archived to local disk arrays and Amazon S3
Level-III NIDS Archive
❖ NCDC maintains an archive of the WSR-88D radar network’s products from 1995 to present (>10 Tb)❖ Order datasets from a tape-based archive
❖ Two years to acquire it using a set of PHP scripts❖ Easier to acquire the entire archive than figuring
out what subset to acquire❖ Already had a NIDS parser for visualization
FEMA Flood Maps
❖ Data Acquisition Method: DVD for each state❖ Format: ESRI Shapefiles (1 shapefile of a feature
class per state)❖ Data Display: Split state shapefiles by county
and then pre-render tiles for moderate to coarse zoom levels on a map mashup.
Suggestions❖ Data in a difficult/proprietary format just waste
disk space ❖ Please use data formats that are well-supported
by open-source software packages (i.e. OGR/GDAL)❖ netCDF, TIFF, ESRI shapefiles, HDF5, geoJSON
❖ Instead of complex CSV or fixed-width text files, use self-describing formats (JSON,XML,SQLITE)
Suggestions (cont.)❖ Data/Navigation files should use the same
naming conventions/sequences❖ Don’t use overly large archive files❖ Data pools/ftp servers attached to large disk
arrays are awesome data providers (as long as limits are in place)
❖ For really large, static datasets (>10Gb), Bittorrent would be really useful