using dco data ( infrastructure , management , analysis, visualization, …)

30
Using DCO Data (Infrastructure, Management, Analysis, Visualization, …) Peter Fox @taswegian, [email protected] (Marshall Ma) and the Data Science Team Tetherless World Constellation Rensselaer Polytechnic Institute DCO Summer School, July 14, 2014. Big Sky, MT Data Science https://deepcarbon.net/group/dco-summer-school- 2014

Upload: teva

Post on 22-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Using DCO Data ( Infrastructure , Management , Analysis, Visualization, …). Data Science. Peter Fox @ taswegian , [email protected] (Marshall Ma) and the Data Science Team Tetherless World Constellation Rensselaer Polytechnic Institute DCO Summer School, July 14, 2014. Big Sky, MT. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Using DCO Data (Infrastructure, Management,

Analysis, Visualization, …)Peter Fox @taswegian, [email protected] (Marshall Ma) and

the Data Science TeamTetherless World Constellation

Rensselaer Polytechnic InstituteDCO Summer School, July 14, 2014. Big Sky, MT

DataSciencehttps://deepcarbon.net/group/dco-summer-school-2014

Page 2: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Deep Carbon ObservatoryGlobal community of ‘Carbon’ scientists (~1000 from ~40 countries) contributing to a Deep Earth Computer (data legacy) comprising:

• Global Earth Mineral Laboratory• Global Census of Deep Fluids• Global Volcano Gas Emissions• Global Census of Deep Microbial Life• Global State of High Pressure and Temperature Carbon and

Related Materials• Global Inventory of Diamonds with Inclusions• …

Page 3: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Data Science is …• Doing science with someone else’s data …

– across datasets– with models– multi-dimensional, multi-scale, multi-mode– complex data-types– needing new analytic and visual approaches

• Especially in multiple “dimensions” (functional) – E.g. Detection/ attribution methods/ algorithms– Visual exploration

DataScience

Page 4: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

You may see many diagrams like

4

Page 5: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

5

Physical quantity versus measured as quantity

Value and units?

Reference frame?

Reference units?Value and units?

Page 6: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Data

A scientist bringing new data

Spreadsheet

Diagram

Digital MapReport

A data manager transforming data

Transformed data ready for import

Repository staff/Data librarian

(Fleischer, 2011)

Importing toolA data repository

Internet

Use case: How DCO Finds Out About Data

Page 7: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Data-Information-Knowledge “Ecosystem”

7

Data Information Knowledge

Producers Consumers

Context

PresentationOrganization

IntegrationConversation

CreationGathering

Experience

Page 8: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

8

Producers Consumers

Quality Control

Fitness for Purpose Fitness for Use

Quality Assessment

Trustee Trustor

Page 9: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Spreadsheets• E.g. Excel – import data

9

Page 10: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Documentation?

10

Page 11: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

• Substantial metadata – how to visualize THIS?

Census of Deep Life

Page 12: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

• To incline to one side; to give a particular direction to; to influence; to prejudice; to prepossess. [1913 Webster]

• A partiality that prevents objective consideration of an issue or situation [syn: prejudice, preconception]

• For acquisition – sampling bias is your enemy

• Cognitive bias is (due to) YOU!

12

Page 13: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Provenance*• Origin or source from which something

comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility– Internal– External

Page 14: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

How you find DCO data…?• http://deepcarbon.net/dco_datasets

– Will soon be a window into community-based sources• http://metpetdb.rpi.edu • http://earthchem.org/• http://www.earthchem.org/petdb • http://vamps.mbl.edu/portals/deep_carbon/

cdl.php• …

Page 15: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Browser

Page 16: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

All information is linked and traceable!

16

Page 17: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)
Page 18: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

E.g. Deep Life (CoDL)New tools: R (statistics, visualization, modeling), D3.js (visualization) NOT just of the data, but of all types of information, knowledge! iPython Notebooks?

Page 19: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

When You Use Data – Science 2.0• Version/ subsetting and converting to a format you are

familiar with is very common but mysterious– Take notes – document – provenance

• Software – what did you use and how?• Derived products – what did you create, how, why, etc.• Use the metadata every chance you get, e.g.

filenames!• Place them in a Web-accessible folder, consider getting

an identifier• Use social media, blogs, etc. to discuss it..

Page 20: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

4 R’s … Goble and others

Page 21: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)
Page 22: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Exercise 1• Search for and access a dataset that you are not

familiar with:• Can you read it?• Can you make sense of it?• Can you assess quality, uncertainty?• Any sources of bias?• What would you need to do to make it useful?

Page 23: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

When You Generate Data – Science 2.0• How the data was generated, why, for what, when and

in what format – Take notes – document – provenance

• Software – what did you use and how?• Derived products – what did you create, how, why, etc.• Use the metadata every chance you get, e.g.

filenames!• Place them in a Web-accessible folder, consider getting

an identifier• Use social media, blogs, etc. to discuss it..

Page 24: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Make it visible to DCO (can be private)https://deepcarbon.net/dco/dco-open-access-and-data-

policies https://deepcarbon.net/page/submit-community-

data You get an identifier! DCO-ID, can be cited, rewarded and much more…Share…

Page 25: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

DCO checklist: what people have to do (courtesy UC3)

Your data management plan

Funding agency requirements

Creating your data

Organizing your data

Managing your data

Sharing your data

Domain Scientist

Data manager

Repository staff

Data Scientist

CurationServices

&Tools

Domain scientists often also take up these two roles,which however is not efficient and effective (i.e., the 80-20 rule). Data

Science

Page 26: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

DCO checklist: a service & tool perspective

Your data management plan

AP Sloan requirements+

Creating your data

Organizing your data

Managing your data

Sharing your data

e.g., NSF New Proposal and Award Policies and Procedures Guide (effective January 14, 2013)

Object Modeling

Identity Services

Storage Services

Ingest Services

Discovery Service

Characterization Services

Access Services

CKAN, community

CKAN, community

Faceted search and Drupal etc.

DCO-ID (Handle+DOI)

+

Linked Data, community

Schema.org, etc.

Use cases, info. model

Page 27: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Exercise 2• Begin with a recent dataset that you generated or

we’re involved in generating• Can someone else read it?• Can someone make sense of it?• Have you asserted quality, uncertainty?• Have you described known sources of bias?• What else would you now do to make it more

useful?

Page 29: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Breakout Session Today• Exercises 1 and 2• Discussion

Page 30: Using DCO  Data  ( Infrastructure , Management ,  Analysis, Visualization, …)

Friday• Marshall (Xiaogang) Ma will round out the data

discussion

• DCO goal for data: in the interim, – help you become data scientists (as well as your

specialty) • Then, in time…

– you can drop “data” because you will handle data as easily as you do field work, use instruments, etc…