confessions of a data hoarder

Experiences of a Earth Science Data User

Confessions of a Data Hoarder

Rob Carver, The Weather Company

–Andrew S. Tanenbaum

“Never underestimate the bandwidth of a station wagon full of tapes hurtling down the

highway.”

Open Data and The Weather Company

❖ Our business model is taking open data and using it to tell interesting stories that engage our users.

❖ Over the years, we’ve archived over 100 Tb of data❖ GRIB1, GRIB2, NIDS, shapefiles, netCDF, HDF5,

❖ NWS/NCEP, NCDC, FEMA, Census Bureau, NASA DAAC’s

Locating Data

1. Google and literature searches

2. ???

3. Data!

100+ Tb of Weather Models

❖ Most data arrives through Unidata’s LDM and FTP pull scripts. ECMWF pushes data to our FTP site. (All GRIB2/1)

❖ Ingested into the forecast system, and GRADS handles the model visualization

❖ Archived to local disk arrays and Amazon S3

Level-III NIDS Archive

❖ NCDC maintains an archive of the WSR-88D radar network’s products from 1995 to present (>10 Tb)❖ Order datasets from a tape-based archive

❖ Two years to acquire it using a set of PHP scripts❖ Easier to acquire the entire archive than figuring

out what subset to acquire❖ Already had a NIDS parser for visualization

FEMA Flood Maps

❖ Data Acquisition Method: DVD for each state❖ Format: ESRI Shapefiles (1 shapefile of a feature

class per state)❖ Data Display: Split state shapefiles by county

and then pre-render tiles for moderate to coarse zoom levels on a map mashup.

Suggestions❖ Data in a difficult/proprietary format just waste

disk space ❖ Please use data formats that are well-supported

by open-source software packages (i.e. OGR/GDAL)❖ netCDF, TIFF, ESRI shapefiles, HDF5, geoJSON

❖ Instead of complex CSV or fixed-width text files, use self-describing formats (JSON,XML,SQLITE)

Suggestions (cont.)❖ Data/Navigation files should use the same

naming conventions/sequences❖ Don’t use overly large archive files❖ Data pools/ftp servers attached to large disk

arrays are awesome data providers (as long as limits are in place)

❖ For really large, static datasets (>10Gb), Bittorrent would be really useful

Questions/Comments/Answers?

❖[email protected]

mailto:[email protected]

confessions of a data hoarder

Documents

open data

data formats

data hoarderrob carver

data hoarder experiences

awesome data providers

tb of weather modelsmost

esri shapefiles

large disk arrays