3 kishor

24
Accessibility Considerations for Very Large Datasets Puneet Kishor University of Wisconsin-Madison and Creative Commons Monday, October 25, 2010

Upload: psi-session-at-codata-2010

Post on 10-Nov-2014

740 views

Category:

Technology


2 download

DESCRIPTION

Accessibility is a complex property of information, a composite of several different indicators that measure the ease with which one can get to and work with information from another source. These different indicators of accessibility scale differently with the size and scope of information. As the datasets get bigger, some indicators remain constant, some scale linearly, while others may scale in a more accelerated manner. At the University of Wisconsin Carbon Modeling group, we are building a very large, spatially accessible U.S. national climate dataset for ecosystem process modeling that is bringing up unique challenges and lessons for accessibility in the multi-Terabyte scale datasets. This paper describes the problem, and the information accessibility challenges and lessons learned while putting together its solution.

TRANSCRIPT

Page 1: 3 kishor

Accessibility Considerations for Very

Large Datasets

Puneet KishorUniversity of Wisconsin-Madison and Creative Commons

Monday, October 25, 2010

Page 2: 3 kishor

Acknowledgments to CODATA for inviting me, Creative Commons for funding my trip, University of Wisconsin-Madison for paying my salary, and most importantly, the US Federal Goverment for making all the data available to anyone, anywhere without any pre-conditionsMonday, October 25, 2010

Page 3: 3 kishor

Research context: ecosystem process modeling of very large terrestrial ecosystems

Monday, October 25, 2010

Page 4: 3 kishor

Information by numbers

Monday, October 25, 2010

Page 5: 3 kishor

7daily variables

tmax

tmin

taveprc

psra

dvp

dda

yl

Monday, October 25, 2010

Page 6: 3 kishor

1km2 cell

1 km

1 km

tmax

tmin

taveprc

psra

dvp

dda

yl

Monday, October 25, 2010

Page 7: 3 kishor

13million cells

4587

2889

.25

Monday, October 25, 2010

Page 8: 3 kishor

8401days

8400

Monday, October 25, 2010

Page 9: 3 kishor

111billion septets

.32

tmax

tmin

taveprc

psra

dvp

dda

yl

111.32 b

Monday, October 25, 2010

Page 10: 3 kishor

725raw gigabytes

.78

Monday, October 25, 2010

Page 11: 3 kishor

10times as much in a database

Monday, October 25, 2010

Page 12: 3 kishor

84GB of NetCDF format in tar gzipped archives

Monday, October 25, 2010

Page 13: 3 kishor

2 3 4 58 9 10 11

612

713 14

1

4˚square chunks

Monday, October 25, 2010

Page 14: 3 kishor

“½”incomplete documentation

Monday, October 25, 2010

Page 15: 3 kishor

0ways to query the data

?XMonday, October 25, 2010

Page 16: 3 kishor

1. Acquire NetCDF file of lat/lon values for each cell from the weather data 1 km2 estimates

2. Dump lat/lon values to CSV with Panoply3. Import into ArcMap as XY data4. Export as shapefile5. Assign WGS84 datum to shapefile in ArcCatalog6. Reproject to Lambert Spherical (“US National

Atlas Equal Area”)7. Separate by 2x2 degree tile using "tile_num"

attribute (so grid will match the netCDF met files) using defination query in ArcMap and exporting to individual shapefiles (256 tiles) as "mask".

8. Open lambert points in qGIS and make 1km grid (shapefile) for each 2x2 tile

9. Assign projection to output (EPSG:2163)10. Add each new grid shapefile (one at a time) to

ArcMap with 2x2 Grid as separate layer11. Select by location (select from grid x that

intersect mask x)12. Export selected features of grid x (now will be

numbered sequentially by record in a way that matches the met NetCDF “ncells”)

13. Clean up: delete extra fields from qGIS (ID,MAXX,MINX,MAXY,MINY) add ncell_id (FID+1) block_id, block_name

“10”times the work to unpack the data

Monday, October 25, 2010

Page 17: 3 kishor

Many kinds of queries f<variable> <location> <point in time>avg(srad) at x,y on Dec 2, 2001tmin for area on May 19, 1992tmax at x,y on May 19, 1992

f<variable> <point location> <duration of time> tave at x,y during the first quarter of 1983sum(vpd) at x,y during the last week of Mar, 2003

Monday, October 25, 2010

Page 18: 3 kishor

accessible |akˈsesəbəl|adjective1 (of a place) able to be reached or entered : the town is accessible by bus | the building has been made accessible to disabled people.• (of an object, service, or facility) able to be easily obtained or

used : making learning opportunities more accessible to adults.• easily understood : his Latin grammar is lucid and accessible.• able to be reached or entered by people in wheelchairs : it

provides specialized features such as nonslip floors and accessible entrances.

2 (of a person, typically one in a position of authority or importance) friendly and easy to talk to; approachable : he is more accessible than most tycoons.Monday, October 25, 2010

Page 19: 3 kishor

Accessible information is easy to: find, determine what one can do with it, acquire, and useMonday, October 25, 2010

Page 20: 3 kishor

Factors that affect accessibility: law; technology; culture; semantics; and economics

Monday, October 25, 2010

Page 21: 3 kishor

Law makes sharing permissible; technology makes it possible; culture makes it acceptable; semantics make it understandable; and economics affordableMonday, October 25, 2010

Page 22: 3 kishor

It is permissible, acceptable, and affordable to access public sector information, but not necessarily possible or understandableMonday, October 25, 2010

Page 23: 3 kishor

Goals of the new storage: make the information technologically and semantically accessibleMonday, October 25, 2010

Page 24: 3 kishor

Allow access by providing user-interface, application programming interface and documentationMonday, October 25, 2010