using dco data ( infrastructure , management , analysis, visualization, …)
DESCRIPTION
Using DCO Data ( Infrastructure , Management , Analysis, Visualization, …). Data Science. Peter Fox @ taswegian , [email protected] (Marshall Ma) and the Data Science Team Tetherless World Constellation Rensselaer Polytechnic Institute DCO Summer School, July 14, 2014. Big Sky, MT. - PowerPoint PPT PresentationTRANSCRIPT
Using DCO Data (Infrastructure, Management,
Analysis, Visualization, …)Peter Fox @taswegian, [email protected] (Marshall Ma) and
the Data Science TeamTetherless World Constellation
Rensselaer Polytechnic InstituteDCO Summer School, July 14, 2014. Big Sky, MT
DataSciencehttps://deepcarbon.net/group/dco-summer-school-2014
Deep Carbon ObservatoryGlobal community of ‘Carbon’ scientists (~1000 from ~40 countries) contributing to a Deep Earth Computer (data legacy) comprising:
• Global Earth Mineral Laboratory• Global Census of Deep Fluids• Global Volcano Gas Emissions• Global Census of Deep Microbial Life• Global State of High Pressure and Temperature Carbon and
Related Materials• Global Inventory of Diamonds with Inclusions• …
Data Science is …• Doing science with someone else’s data …
– across datasets– with models– multi-dimensional, multi-scale, multi-mode– complex data-types– needing new analytic and visual approaches
• Especially in multiple “dimensions” (functional) – E.g. Detection/ attribution methods/ algorithms– Visual exploration
DataScience
You may see many diagrams like
4
5
Physical quantity versus measured as quantity
Value and units?
Reference frame?
Reference units?Value and units?
Data
A scientist bringing new data
Spreadsheet
Diagram
Digital MapReport
A data manager transforming data
Transformed data ready for import
Repository staff/Data librarian
(Fleischer, 2011)
Importing toolA data repository
Internet
Use case: How DCO Finds Out About Data
Data-Information-Knowledge “Ecosystem”
7
Data Information Knowledge
Producers Consumers
Context
PresentationOrganization
IntegrationConversation
CreationGathering
Experience
8
Producers Consumers
Quality Control
Fitness for Purpose Fitness for Use
Quality Assessment
Trustee Trustor
Spreadsheets• E.g. Excel – import data
9
Documentation?
10
• Substantial metadata – how to visualize THIS?
Census of Deep Life
• To incline to one side; to give a particular direction to; to influence; to prejudice; to prepossess. [1913 Webster]
• A partiality that prevents objective consideration of an issue or situation [syn: prejudice, preconception]
• For acquisition – sampling bias is your enemy
• Cognitive bias is (due to) YOU!
12
Provenance*• Origin or source from which something
comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility– Internal– External
How you find DCO data…?• http://deepcarbon.net/dco_datasets
– Will soon be a window into community-based sources• http://metpetdb.rpi.edu • http://earthchem.org/• http://www.earthchem.org/petdb • http://vamps.mbl.edu/portals/deep_carbon/
cdl.php• …
Browser
All information is linked and traceable!
16
E.g. Deep Life (CoDL)New tools: R (statistics, visualization, modeling), D3.js (visualization) NOT just of the data, but of all types of information, knowledge! iPython Notebooks?
When You Use Data – Science 2.0• Version/ subsetting and converting to a format you are
familiar with is very common but mysterious– Take notes – document – provenance
• Software – what did you use and how?• Derived products – what did you create, how, why, etc.• Use the metadata every chance you get, e.g.
filenames!• Place them in a Web-accessible folder, consider getting
an identifier• Use social media, blogs, etc. to discuss it..
4 R’s … Goble and others
Exercise 1• Search for and access a dataset that you are not
familiar with:• Can you read it?• Can you make sense of it?• Can you assess quality, uncertainty?• Any sources of bias?• What would you need to do to make it useful?
When You Generate Data – Science 2.0• How the data was generated, why, for what, when and
in what format – Take notes – document – provenance
• Software – what did you use and how?• Derived products – what did you create, how, why, etc.• Use the metadata every chance you get, e.g.
filenames!• Place them in a Web-accessible folder, consider getting
an identifier• Use social media, blogs, etc. to discuss it..
Make it visible to DCO (can be private)https://deepcarbon.net/dco/dco-open-access-and-data-
policies https://deepcarbon.net/page/submit-community-
data You get an identifier! DCO-ID, can be cited, rewarded and much more…Share…
DCO checklist: what people have to do (courtesy UC3)
Your data management plan
Funding agency requirements
Creating your data
Organizing your data
Managing your data
Sharing your data
Domain Scientist
Data manager
Repository staff
Data Scientist
CurationServices
&Tools
Domain scientists often also take up these two roles,which however is not efficient and effective (i.e., the 80-20 rule). Data
Science
DCO checklist: a service & tool perspective
Your data management plan
AP Sloan requirements+
Creating your data
Organizing your data
Managing your data
Sharing your data
e.g., NSF New Proposal and Award Policies and Procedures Guide (effective January 14, 2013)
Object Modeling
Identity Services
Storage Services
Ingest Services
Discovery Service
Characterization Services
Access Services
CKAN, community
CKAN, community
Faceted search and Drupal etc.
DCO-ID (Handle+DOI)
+
Linked Data, community
Schema.org, etc.
Use cases, info. model
Exercise 2• Begin with a recent dataset that you generated or
we’re involved in generating• Can someone else read it?• Can someone make sense of it?• Have you asserted quality, uncertainty?• Have you described known sources of bias?• What else would you now do to make it more
useful?
Further reading• Data Science course at RPI:
http://tw.rpi.edu/web/Courses/DataScience/2013• Fourth Paradigm:
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
• Data Management Planning tools:– http://tw.rpi.edu/web/project/DCO-DS/WorkingGroups
/DMP
– http://www.iedadata.org/compliance/plan– https://dmp.cdlib.org/
Breakout Session Today• Exercises 1 and 2• Discussion
Friday• Marshall (Xiaogang) Ma will round out the data
discussion
• DCO goal for data: in the interim, – help you become data scientists (as well as your
specialty) • Then, in time…
– you can drop “data” because you will handle data as easily as you do field work, use instruments, etc…