earthcube_ecogeo 2015 workshop i final report.pdf

Research Coordination Network

Workshop I Report

27-28 August 2015, University of Hawai‘i at Mānoa, Honolulu, HI

Report: 24 November 2016

Table of Contents

Executive Summary

ECOGEO – a community focused on solutions Page 2

Workshop Goals Page 2

Workshop Outcomes Page 3

I. Summary of workshop structure

II. Grand ‘omics science challenges

III. Specific cyberinfrastructure needs

IV. Leveraging and expanding existing infrastructures

V. Enabling and encouraging big data best practices

Recommendations

Building Environmental ‘Omics Infrastructure for Earth Sciences Page 9

Upcoming events Page 11

Appendices Page 12

Appendix I: Workshop I Agenda

Appendix II: Participant List

Appendix III: Participant Use Cases and Use Case Template

Appendix IV: Community Survey and Summary

Resources

ECOGEO RCN website: http://earthcube.org/group/ecogeo

Workshop website: http://cmore.soest.hawaii.edu/rcn2015/

Workshop agenda: http://cmore.soest.hawaii.edu/rcn2015/agenda.htm (also Appendix I)

Workshop webinar: https://vimeo.com/uhcmore/review/138035693/5223a63a63

Environmental ‘omics Resource Viewer: http://pivots.azurewebsites.net/ecogeo.html

Workshop conveners: Edward DeLong, Elisha Wood-Charlson, ECOGEO Steering Committee

Report prepared by: Edward DeLong, Elisha Wood-Charlson, and ECOGEO Workshop I

participants (Appendix II)

http://earthcube.org/group/ecogeo

http://cmore.soest.hawaii.edu/rcn2015/

http://cmore.soest.hawaii.edu/rcn2015/agenda.htm

https://vimeo.com/uhcmore/review/138035693/5223a63a63

http://pivots.azurewebsites.net/ecogeo.html

24 November 2015

1

Executive Summary

The aim of ECOGEO’s first workshop was to enable domain scientists and cyberinfrastructure

experts to collaboratively discuss grand challenges in ‘omics science (Outcome II) and explore

use cases that translate those challenges into cyberinfrastructure needs (Outcome III). The

group also worked to outline existing resources, brainstormed how to best leverage and expand

those resources (Outcome IV), and discussed ways to better establish best practices in the

community (Outcome V).

The workshop hosted over 50 participants from more than 20 universities, three national labs,

four cyberinfrastructure centers, two NSF funded data resources, and three EarthCube funded

projects. At the conclusion of the workshop, participant comments were unified and optimistic

about how to best move forward as a community. Scientists and cyberinfrastructure experts

were able to identify common ground and consensus on how to address the some of the core

‘omics science challenges. This report summarizes discussions and synthesis activities that

generated the following recommendations, aimed at overcoming cyberinfrastructure challenges

in environmental ‘omics research.

In summary, the community’s recommendations include:

I. EarthCube-based solutions should integrate science drivers and challenges (e.g. end

user discussions and use case scenarios) with technological and engineering solutions

via continuous, iterative discussion/development cycles, from start to finish.

II. Existing ‘omics-orientated cyberinfrastructures should be extensively leveraged and

integrated into EarthCube systems, with further development and support, to better meet

current and future needs of the ‘omics community.

III. Data centers, databases, and analytical tools that address issues of data discovery,

scalability, and community HPC access should be further developed.

IV. Data visualization and statistical analytical frameworks should be integrated into

standard ‘omics analyses workflows and software.

V. The cyberinfrastructure should enable and encourage “big data” best practices and

standards for the community.

VI. The ‘omics and cyberinfrastructure communities should enable and provide a platform

for future ‘omics research via streamlined, accessible, state-of-the-art education, training

tools, and best practices.

This report aims to provide a foundation of community support for a federated platform of

interoperable cyberinfrastructures for oceanography and geobiology environmental ‘omics

research. With support for improved cyberinfrastructure, interdisciplinary collaborations, and

best practices and training, the ‘omics community (domain scientists and cyberinfrastructure

experts) is very well positioned to move environmental ‘omics research into the future.

24 November 2015

2

ECOGEO – a community focused on solutions

The EarthCube Oceanography and Geobiology Environmental ‘Omics (ECOGEO) project is a

two-year, NSF funded Research Coordination Network (RCN) designed to bring together

domain and cyberinfrastructure scientists and engineers with the goal of articulating needs,

challenges, solutions, and required software and hardware infrastructures for enabling and

advancing current and future ‘omics research in the Geosciences, and in particular Ocean

Science.

The ECOGEO research community spans an array of disciplines, but is united in developing

and applying ‘omics technologies and bioinformatic approaches to address core questions

relating the interplay of biological, geological, and chemical processes. Investigations range

from high-throughput sequencing of microbial community DNA to assess taxon, gene, and

metabolic pathway distributions across samples (metagenomics), monitoring the expression of

genes and/or proteins in a variety of environmental settings (metatranscriptomics and

proteomics, respectively), and measuring the distribution and significance of metabolites and

lipids in organisms and the environment (metabolomics, lipidomics). These methods enable

researchers in biological oceanography, biogeochemistry, organic geochemistry, microbial

oceanography, and geobiology to explore and inter-relate the biological, geological, and

chemical (biogeochemical) world in hitherto unprecedented depth and detail. This general

approach requires considerable computational hardware and software infrastructures, which

rely on high performance computing, advanced networking and database capabilities, and

collaboration with computer scientists, bioinformaticians, software engineers, computational

biologists, and interdisciplinary support from both government and private funding agencies

and foundations.

Overall goals

Create and sustain a strategic network and community of field and cyber scientists to

explore new facets of ‘omics data.

Articulate needs, challenges, and practical solutions that address: 1) development of

cyberinfrastructure, 2) integration and implementation of workflows, and 3) database and

resource sustainability to support ocean and geobiology environmental ‘omics research.

Develop a community-based framework that integrates best practices for sharing, curation,

and analysis of ‘omics data, with associated “metadata”, and facilitates collaboration and

training among environmental microbiology, geobiology, and computer science disciplines.

Workshop Goals

Highlight core science and technology drivers for research using environmental ‘omics

The field of environmental ‘omics requires close collaboration between domain science and

technology/cyberinfrastructure. Therefore, one of ECOGEO’s main workshop goals was to

discuss key drivers from the perspective of current challenges and cyberinfrastructure needs.

These drivers were identified during review of the original end-user workshop documents,

participants use cases developed for this workshop (Appendix III), and community feedback

from the ECOGEO survey (Appendix IV).

http://earthcube.org/document/2014/ocean-omics-report-genomic-sci-pub

24 November 2015

3

Identify solutions: Based on these challenge-focused science and technology drivers, the

workshop participants also discussed ways to leverage existing resources and described gaps

where new solutions could be built.

EarthCube Context

The workshop leveraged numerous EarthCube resources to enhance our discussions. Several

active members of EarthCube governance and funded projects were in attendence, including

(alphabetical by last name, also available in Appendix II) Emma Aronson, Science Committee;

Basil Gomez, Leadership Council; Danie Kinkade, Leadership Council; Ouida Meier,

CReSCyNT RCN; Ken Rubin, Science Committee; Elisha Wood-Charlson, Engagement and

Liaison Teams, Science Committee, ECOGEO RCN; and Ilya Zaslavsky, GEAR Conceptual

Design, Technology and Architecture Committee, CINERGI Building Block.

In addition, ECOGEO has, and will continue, to contribute to EarthCube’s vision. The 2nd year of

our RCN will focus on integrating the ‘omics community and our collective resources into the

broader EarthCube infrastructure, with the goal of creating sustainable contributions through

future EarthCube funded projects. Thus far, we have compiled a dozen domain science use

cases, which will be refined at a follow up working group meeting early 2016 (see Path Forward)

and have been actively engaged with CINERGI to expand our environmental ‘omics resource

viewer. Throughout year 2, ECOGEO will continue to have representation in EarthCube

governance, including revision of best practices and core documents, as well as future ideas for

EarthCube funded projects recommendations. Finally, we will communicate outcomes from our

workshops to EarthCube community through dissemination of reports, and continue to inform

the broader ‘omics community of EarthCube through society town hall sessions.

Workshop Outcomes

I. Summary of workshop structure

This workshop focused on understanding the key science and technology drivers in the field and

the development of use cases to identify resource gaps, as well highlight their potential for

training. The workshop was organized into several breakout groups, with time for reporting and

discussion amongst all participants (please see Appendix I for the full agenda). The first series

of breakouts on science/tech drivers focused on grand ‘omics science challenges:

Geospatial & temporal registry for 'omics data across scales, led by D. Kindade, V. Orphan

Tracking synoptic ‘omics data products, led by B. Hurwitz, N. Kyrpides

Integrated modeling of organisms (‘omics) and environmental dynamics, led by M. Follows,

N. Levine

The next set of breakouts focused on a subset of participant submitted use cases with the aim

of extracting the overarching science challenges and cyberinfrastructure needs in ‘omics

research. Representative use cases were grouped by theme and are available in Appendix III.

“Google Earth” ‘omics, led by E. Allen. Use cases by H. Alexander and B. Jenkins.

Linking function to biogeochemical cycling in space/time, led by M. Saito. Use cases by R.

Morris and J. Waldbauer.

http://earthcube.org/

http://earthcube.org/group/science-committee

http://earthcube.org/group/leadership-council

http://earthcube.org/group/leadership-council

http://earthcube.org/group/crescynt-coral-reef-science-cyberinfrastructure-network


http://earthcube.org/group/engagement-team

http://earthcube.org/group/liaison-team


http://earthcube.org/group/ecogeo

http://earthcube.org/forum/poster-submission/earthcube-conceptual-design-gear-project

http://earthcube.org/group/technology-architecture-committee

http://earthcube.org/group/cinergi



ElishaWC

Typewritten Text

24 November 2015

4

Using ‘omics for evolution/trait-based studies, led by E. Aronson. Use cases by D. Chivian

and J. Gilbert.

The final discussion focused on the potential for and limitations of existing cyberinfrastructure.

The aim of this session was to move beyond just identifying gaps in existing resources to

proposing solutions as a community to address specific current and future cyberinfrastructure

needs in ‘omics research. Breakout leads included J. Heidelberg, B. Jenkins, and D. Kinkade.

In addition to collective brainstorming, the workshop offered several presentations on resources

that could be integrated into the cyberinfrastructure “ecosystem”, or at least provide fodder for

on-going discussions. Presentations (listed in agenda, Appendix I) included several plenary

talks by NSF-funded Science and Technology Centers and their approaches to handling big

data, as well as a presentation by Jason Leigh, who leads the Laboratory for Advanced

Visualization and Application (LAVA) at UH Mānoa. Jason and his team also hosted a Q&A and

informal brainstorming session with the workshop participants, who were encouraged to explore

advancements in visualization. We also had a presentation by B. Hurwitz and D. Kinkade on

linking the main environmental oceanographic database, BCO-DMO, to the iMicrobe ‘omics

data commons integrated in the iPlant Cyberinfrastructure, both supported by NSF. B. Tully and

I. Zaslavsky presented the ECOGEO resource viewer, which was enabled by the EarthCube

Building Block CINERGI. Finally, L. Teytelman described Protocols.io as a mechanism for

researchers to share, modify, comment, and collaborate on laboratory and bioinformatics

protocols, and F. Chavez presented MBARI’s collaboration with NOAA on a new eDNA study

concept and proposed data analysis workflow.

II. Grand ‘omics science challenges

One of the inherent challenges in environmental ‘omics research is that the focus, by default,

occurs at the level of microbes, since they make up most of the environmental biomass on

Earth. However, many of our research questions span from sub-micron scales (viruses) to

global ecosystems and modeling. How do you sample a micron-level 3-D space (e.g. an algal

bloom in the ocean) over a 4th dimension (time) and then extrapolate the micro-changes and

environmental interactions to ecosystem biogeochemical cycling? The current answer is – “the

technology is not there quite yet”, but these sorts of analyses may be within reach in the very

near future. The workshop participants discussed how to create geospatial and temporal

registries for ‘omics data across different scales (Use Case: Jenkins), and how they could be

visualized through a Google Earth data discovery model (Use Case: Alexander, Morris). For

example, data layers could be expanded from latitude/longitude to include nutrient

concentrations and global temperatures, with a “street-view” like function that would allow for

micro-to-meter scale visualization, such as activity on sinking particles in the ocean. When the

main challenges extend beyond traditional conventional scale boundaries, one must reach

across scales in order to enable the next generation of research questions.

Within the issue of scaling lies the fundamental challenge of understanding how biological

processes interact with these scalable environmental data layers. For example, from an

environmental (microbial) ‘omics perspective, the classic ecological metrics of alpha and beta

species diversity may no longer be entirely relevant. In part, microbes don’t follow typical

http://lava.manoa.hawaii.edu/

http://www.bco-dmo.org/


http://www.iplantcollaborative.org/


https://www.protocols.io/

24 November 2015

5

speciation criteria. Therefore, the context of environmental interactions and functional diversity

(ability to fix nitrogen, utilize low abundance iron, etc.) may be more relevant from a

biogeochemist’s perspective than defining which strain of microbe is present (Use Case:

Chivian, Gilbert, Waldbauer). In particular, trait-based metrics as opposed to taxonomic criteria

may be more important when considering globally significant and/or societally relevant

questions such as, “how are microbe-microbe and microbe-environment interactions impacting

global biogeochemical cycles?” and “how can that information be used to improve climate

models and projections of change?”. Currently, there is a disconnect between model predictions

and data driven observations. Therefore, we need new ways to enable more iterative

observations, hypothesis generation, and hypothesis testing cycles. These big picture

challenges require a fundamental change in the structure, availability, and scale of ‘omics-

enabling cyberinfrastructures.

III. Specific cyberinfrastructure needs

Although the ‘omics community is diverse in focus and techniques, we share several common

challenges that prevent the field from moving forward. These challenges were highlighted in the

workshop use cases (Appendix III) as well as the community survey (Appendix IV), which was

administered to the ‘omics community in late 2014.

One of the most acute articulated needs for the ECOGEO community is a new mode of

sequence data repositories that facilitate data sharing and data discovery of primary data and

any associated environmental and sample processing data (“metadata”), as well as links to

other data products (and their provenance), analytical software and workflows, and the

infrastructures required to implement them. Such a repository would store sequence read data

with its associated metadata in a manner that would allow seamless and simultaneous queries

of metadata fields and their corresponding sequence reads (and vice versa). A similar repository

for environmental proteomic mass spectrometry data was also identified as a core need. Such

repositories would be invaluable tools for data discovery. They promote efficient computation,

the requirement for transferring large data sets would be reduced, and they facilitate

development of downstream analysis pipelines that could be shared and standardized. Beyond

repositories for raw sequence reads and mass spectrometry data, the community also needs a

federated, searchable repository for data products being used as a basis for biological inference

in publications, which would greatly enhance comparative analyses between studies. These

products include such things as phylogenetically resolved population level genome fragments

assembled from metagenomes, gene/protein expression data from metatranscriptome/proteome

analyses, and sequence alignments underlying phylogenetic inferences. In addition to

searchable data and data product repositories, analysis platforms that enable large-scale

metagenomic data comparisons were identified as another core “ocean ‘omics”

cyberinfrastructure requirement. Such platforms would support data searches driven by

taxonomy and physiochemistry thus making such large-scale comparisons feasible.

However, this vision of next generation of sequence data repositories goes beyond aggregating

disparate data sets currently housed in dispersed data resources (e.g. NCBI, IMG, iPlant, EBI,

MG-RAST, BCO-DMO, etc.). They must also consider “dark data”, data already available in the

24 November 2015

6

public domain but not readily discoverable in via the commonly accessed databases. Data

mining through web crawls focused on primary scientific literature would significantly extend the

volume of data gathered to promote comparative ‘omics investigations.

Once ‘omics data are available in a suitable federated infrastructure with query-able and

standardized metadata, it should be possible to pose a striking diversity of hypotheses. Analysis

tools, workflows, visualization, and statistics are critical to making sense of ‘omics data. Many

analysis tools are disparate (developed by individual research groups), which can make them

difficult to capture in a single workflow. Furthermore, most tools require computational

experience to run, and are not well vetted by the community (i.e. which is best tool for certain

data types and why) or maintained once a developer moves on. This computational climate

prevents the continual improvement and vetting of existing tools by the community and an ever-

expanding database of programs that are difficult to maintain and come up to speed on. This

strategy is particularly problematic for researchers without a bioinformatics team. In addition,

programs are often developed independently without workflows in mind. This leads to disparate

output and formats that are not easily bridged between tools without scripting skills to reformat

resulting data and therefore cannot be easily merged into user-defined workflows.

The ideal platform would promote a federated toolkit with inter-operable and standardized

output formats to enable domain scientists to answer their science questions, as well as

encourage continued technology and cyberinfrastructure development. Quantitative Insights in

Microbial Ecology (QIIME) and the associated QIITA database have been recently adopted by a

large community as platforms for federated ribosomal RNA (rRNA) tag studies, with open

access tools and a well supported community run helpdesk. QIIME provides online tutorials that

allow for community adoption resulting in a large user database. However, the rRNA tag data

set analyses for which QIIME is designed have low complexity, and require comparatively less

processing than metagenomic data sets (rendering development and use of these analytical

tools much more straightforward). For large-scale meta- ‘omic analyses, different cyber

solutions will be required. A few other analysis platforms exist, such as DNA Nexus, which

supports the biomedical community, and bioinformatics Apps for the life science that exist in the

iPlant “ecosystem”.

Finally, statistical and visualization tools are necessary for researchers to explore data and draw

conclusions related to their core science questions. During the workshop, we were able to start

this conversation with big data statistics and visualization experts. The main take-home

message was that domain scientists should not struggle with these challenges alone. Just as

our community has grown to educate and include bioinformaticians in the development of our

research plans, it is evident that we need to expand our collaborations to also include the big

data statistics and visualization experts.

IV. Leveraging and expanding existing infrastructures

The ‘omics community has many distinct layers of cyberinfrastructure requirements, ranging

from the physical hardware to house and serve the data, to software that allows users to

process, analyze, and interface with the data. The ECOGEO workshop focused extensively on

24 November 2015

7

discussing these needs and identifying existing resources that might be leveraged to

accomplish our research goals.

One of the core issues with ‘omics data is the size. Moving large-scale sequence data sets

requires significant network bandwidth and access to network platforms, such as Globus and

Internet2. It is estimated that the data we have today represents only 10% of the total data

available in the next 5 years, given improvements in sequencing capacity and advances in the

throughput of non-nucleic acid data products. While data storage and networking and

communications infrastructures will continue to evolve to help meet these “big data” needs, they

also need to serve a variety of end-users: from raw novices to domain experts, as well as

specialized requirements of the educational, survey and monitoring, and policy driven programs

and communities.

Analyzing and interpreting large collections of data will likely require a collaborative and

federated approach. Cloud-based computing approaches hold great promise, but there are a

number of issues that need to be addressed by the community, and the current economics of

commercial cloud solutions do not appear scalable for a large and diverse community. For the

larger and more well established institutions, such as the Joint Genome Institute (JGI),

iPlant/iMicrobe, Broad, J. Craig Venter Institute (JCVI), Sanger, etc., maintaining dedicated

computing resources make sense, while for small labs it probably doesn’t. For most

intermediate size groups, a hybrid approach may be optimal, with some dedicated resources

coupled with access to Cloud-based resources.

In this context, federated infrastructure virtual machines - including lightweight containers - will

likely be a central avenue to provide easy-to-use analysis tools. These have the potential to

democratize access to software suites that may be too complex for researchers, without

dedicated computational support staff, to install. These researchers may be best served by

access to online analysis tools offered by groups such as JGI, KBase, and iPlant/iMicrobe.

Common APIs and architectures with such virtual machines will help forge the links for an

interoperable and federated infrastructure. In addition, existing EarthCube “dark data” discovery

projects, such as DeepDive, can be used to identify published data that are not in a public

repository. The imagined next generation data repositories will likely be built on a federated

cross-agency structure that ties together data from different providers into a common

framework. Data collections from public resources, such as iMicrobe and the International

Nucleotide Sequence Database Collaboration (INSDC), which includes Sequence Read Archive

(SRA), GenBank, European Nucleotide Archive (ENA), and Integrated Microbial Genomes

(IMG), are currently the most used, robust and sustainable cyber and meta- ’omics resources.

These should definitely be integral, federated players in the context of any proposed meta-

‘omics “cyber superstructure”.

Presently, researchers in the ECOGEO community deposit raw sequence data into the SRA as

part of the National Center for Biotechnology Information’s (NCBI) GenBank service, which

currently houses over 19,000 environmental genomic data sets totaling > 15 TB. This resource,

however, is not easily searchable and thus prevents the integration of data sets across projects

http://jgi.doe.gov/


http://imicrobe.us/

https://www.broadinstitute.org/

http://www.jcvi.org/cms/home/

https://www.sanger.ac.uk/

http://jgi.doe.gov/

https://kbase.us/


http://imicrobe.us/

http://deepdive.stanford.edu/

http://imicrobe.us/

http://www.insdc.org/

http://www.ncbi.nlm.nih.gov/sra

http://www.ncbi.nlm.nih.gov/genbank/

http://www.ebi.ac.uk/ena

http://img.jgi.doe.gov/

24 November 2015

8

and limits the possibility of ecosystem level analyses. Further, SRA files at NCBI often do not

contain a sufficient description of a sample’s “metadata”, which ideally includes information on

the sampled environment, sample collection, processing, and data generation. This contextual

metadata, in addition to the oceanographic data in BCO-DMO, is essential if data sets are to be

intercomparable. The Genome Standards Consortium (GSC) has established baselines for

describing genomic, metagenomic, metatranscriptomic, and amplicon sequence data

(discussed in the next section). Previously, the ocean ‘omics community relied heavily on the

Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis

(CAMERA) database for data discoverability, but this platform was discontinued in 2014. The

CAMERA data sets have been transferred into iMicrobe, a sub-portal within iPlant. In addition,

IMG at JGI and MG-RAST at Argonne National Laboratory are genomic and metagenomic

resources that have been heavily leveraged by the larger community.

Many data repositories, such as GenBank, are also limited in that they do not accept processed

data products and are not properly formatted for non-nucleic acid ‘omics data, such as

proteomic, glycomic, metabolomic, and lipidomic data. Currently, the largest available

metagenomics data integrations are provided through the JGI’s IMG system. Both systems

support the integration and analysis of a number of different ‘omics data sets, and support the

general community by annotating and analyzing user submitted data. Due to their long term

funding scheme, volume of existing data, and their position within the international community,

these systems are “pre-adapted” to be an integral part of a federated meta- ‘omics “cyber

superstructure”. A unique capability provided by IMG is the scale of processed data publically

available. While IMG is not a centralized resource for raw data storage, it has potential to serve

as a central resource of assembled metagenomics data sets. There is currently an effort to

assemble and annotate a large part of the raw metagenomics data available through SRA and

integrate them with the metadata curation effort through the Genomes OnLine Database

(GOLD). This provides a good example of how current data centers could serve as specialized

hubs in a federated, interoperable alliance, each providing different data products – in this case

with IMG serving as a central repository for metagenome assemblies.

Another example of existing cyber-infrastructure that is “pre-adapted” to be an integral part of a

federated meta- ‘omics “cyber superstructure” is the iMicrobe project, built on the iPlant

cyberinfrastructure. This collaboration is emerging as a viable solution for storing user-

generated and defined data sets in a community data commons. The iMicrobe project provides

a query-able interface for data sets in iPlant by linking to BCODMO’s data and mapping

appropriate metadata to GSC’s MiXS compliant terminology and other standardized ontologies

to enhance data discovery and re-use. iPlant also provides the capacity for users to develop

and distribute tools for use by the community. These tools are tied to freely available, high

performance computing resources at iPlant and Texas Advanced Computing Center (TACC).

Presently, over 500 bioinformatics tools are available within iPlant’s discovery environment, and

the iMicrobe project is developing tools specific to microbial ‘omics analyses, including

metagenomic and metatranscriptomic data sets, and new analysis pipelines for uncultured

viruses. iPlant and TACC are NSF-funded programs that, at the request of NSF, have expanded

http://gensc.org/

http://metagenomics.anl.gov/metagenomics.cgi?page=Home

https://gold.jgi.doe.gov/

24 November 2015

9

their scope to the broader (non-plant) life science community, which is compatible with

ECOGEO-related research questions.

V. Enabling and encouraging big data best practices

The GSC has already established a foundation for the minimal information that should

accompany sequence data sets. This community-led initiative has spent 10 years creating

consensus based standard languages and formats for describing the metadata associated with

a sequence data set. This includes physical, chemical, and biological data that accompany the

physically sampled environment. These environmental metadata standards include formats

developed for marine, soil, human, host-associated, built-environment, and many other

systems. This provides a crib sheet that helps educate people on the kinds of information they

should include when they submit their data to a public database. The format promotes a

standard that makes incoming data compliant with other data sets in the databases, and also

makes the data machine readable and hence searchable. This includes the use of standardized

ontologies (e.g. country of origin defined as USA, instead of U.S.A., US, United States, or

United States of America). Variations in descriptors confuse searching and make data retrieval

extremely difficult. Standard ontologies allow for communication between the searcher and the

submitter. In addition, the ‘omics community has been asked to compile a list of requests that

would make the SRA database more useful to the community, in alignment with the NIH

microbiome database needs. The primary difficulty is not getting people to agree that these

standards should be used; it is getting them to use them. Formatting data appropriately requires

effort from the submitter. Therefore, databases, journals, and funding agencies are finding it

difficult to reach consensus on the best way to motivate the community to employ such

standards. GenBank and EBI have adopted the GSC ‘gold star’ standard, making databases

complied with the minimal information standard (e.g. MiMS, MiXS, etc.) more data rich and the

search for data sets easier. The hope is that if data sets comply with this standard, and are

given a gold star, they will be more frequently used, more regularly cited, and hence will

encourage more researchers to employ these standards.

Recommendations

Building Environmental ‘Omics Infrastructure for Earth Sciences

Enabling our community to build the necessary data discovery repositories, with federated and

efficient frameworks for data integration and interoperability, the establishment of best practices

and workflows, and the development of functional platforms for analysis, visualization, and

statistics.

Recommendation I. EarthCube-based solutions should integrate science drivers and

challenges (e.g. end user discussions and use case scenarios) with technological and

engineering solutions via continuous, iterative discussion/development cycles, from

start to finish.

Without frequent communication regarding needs of the larger end user community, EarthCube

funded projects will not be able to adapt, limiting their usefulness and sustainability.

Development cycles should strive for rapid beta testing, end user feedback, and engineering

redesign and build iterations.

24 November 2015

10

Recommendation II. Existing ‘omics-orientated cyberinfrastructures should be

extensively leveraged and integrated into EarthCube systems, with further development

and support, to better meet current and future needs of the ‘omics community. The true

potential of ‘omics-based science is limited by the community’s need for a well developed,

federated, interoperable, and distributed “cyber superstructure”. Currently, IMG and iMicrobe

are the most extensive, robust, and sustainable cyber meta- ‘omic resources. These platforms

should be integrated with other data projects (e.g. BCO-DMO, EarthCube funded projects) and

supported as leaders in the development of a federated and interoperable of meta- ‘omics

“cyber superstructure”.

Recommendation III. Data centers, databases, and analytical tools that address issues of

data discovery, scalability, and community HPC access should be further developed.

Data discovery through accessible repositories, semantic integration of associated metadata,

and scalable analyses are crucial for the ‘omics community to address many of the globally

significant and/or societally relevant questions. This level of data integration will require

ingenuity and collaboration between domain scientists, cyberinfrastructure developers,

statisticians, and visualization experts. As resources, tools, and experts become available, the

‘omics community should support the development of innovative ideas.

Recommendation IV. Data visualization and statistical analytical frameworks should be

integrated into standard ‘omics analyses workflows and software. Conversations with data

visualization and statistical experts have already started, but stronger integration, through the

development of collaborations and interdisciplinary projects, will be necessary for effective

interdisciplinary projects that will expand and enable the full potential of ‘omics data.

Recommendation V. The cyberinfrastructure should enable and encourage “big data”

best practices and standards for the community. The on-going efforts of the GSC to create

a ‘gold star’ standard for user submitted data sets should continue to be supported and adopted

by the community. With a concerted effort by leaders in this field, from single investigators to the

established institutions, these standards can be adopted by the ‘omics community over time.

This level of cohesion should also be extended to the development of workflows describing data

processing and data products. As a community, we should explore existing tools, such as

Galaxy, Protocols.io, and EarthCube’s GeoSoft project for curation and disseminating of

protocols, software, and scripts. In addition, international collaboration should be further

developed and encouraged, with the assistance of the EarthCube Liaison Team. For example,

European efforts in metagenomics and microbiome studies (e.g. Marine Ecological Genomics

and EBI Metagenomics) have parallel goals and objective to those described here.

Recommendation VI. The ‘omics and cyberinfrastructure communities should enable

and provide a platform for future ‘omics research via streamlined, accessible, state-of-

the-art education, training tools, and best practices. The complex network of ‘omics

research requires individuals to be savvy in field, laboratory, and computer-based techniques. In

order to continue pushing the science forward, the ‘omics community should strive to develop

http://img.jgi.doe.gov/

http://imicrobe.us/


http://earthcube.org/info/about/funded-projects

https://galaxyproject.org/

https://www.protocols.io/

http://www.isi.edu/ikcap/geosoft

http://mb3is.megx.net/

https://www.ebi.ac.uk/metagenomics/

24 November 2015

11

and disseminate educational training tools, such as training workflows, demonstration videos,

interactive workshops, and training courses. Effective knowledge transfer to the next generation

of ‘omics researchers, developers, and innovators will be necessary to position them to take

‘omics science into the future. Through EarthCube, ECOGEO will develop a foundation of

training videos, but proper development, assessment, and improvements will require support

and a large community effort.

Upcoming events

The ECOGEO RCN has several activities planned for the remaining year of NSF funding

(through August 2016). In addition to hosting a second workshop (late Spring, early Summer

2016) as funded in the original award (1440066), supplementary funds were granted by the NSF

Division of Ocean Science (OCE) for additional activities. In January/February, the ECOGEO

RCN will run a small working group focused on creating 12 complete EarthCube use cases.

Prior to Workshop I, participants were asked to submit use cases to be reviewed and discussed

during the workshop. Due to time constraints, we were only able to review six use cases, but we

are keen to work with the TAC Use Case Working Group to flesh out all 12 use cases, including

integration into EarthCube resources where possible, and then contribute them to the

EarthCube use case repository. In addition, the ECOGEO RCN will be hosting a Town Hall at

the 2016 ASLO/AGU/TOS Ocean Sciences Meeting in New Orleans, LA. The Town Hall will be

held on 25 February from 12:45-13:45 in the Ernest N. Morial Convention Center (217-219). The

Town Hall is intended to introduce the OSM community to EarthCube and the on-going efforts of

the ECOGEO RCN. Because we already have representation on the EarthCube Engagement

Team, several “Introduction to EarthCube” resources are already under development. Our final

workshop will focus on creating instructional webinars that demonstrate ‘omics tools and data

portals, as well as implementing the developed use cases. The main goal is to train the next

generation of ‘omics researchers and develop ways for them to integrate their research with

EarthCube’s on-going mission to enable data science through cyberinfrastructure.

Appendices (pages 12 – 40)

Appendix I: Workshop I Agenda

Appendix II: Participant List

Appendix III: Participant Use Cases and Use Case Template

Appendix IV: Community Survey and Summary

http://osm.agu.org/2016/

(updated 15 Sep 2015)

Workshop I – Agenda See also - http://cmore.soest.hawaii.edu/rcn2015/agenda.htm

27 August 2015, East-West Center (EWC), UH Manoa

0745-0800 Depart hotel for EWC

0800-0830 Morning coffee, light breakfast at EWC

0830-1200 Big Data, Big Ideas - Joint session w/ STC directors meeting (Keoni Auditorium)

0830-0845 Advanced Networking Critical Infrastructure for Big Data and Global

Collaborative Science – David Lassner, President – University of Hawaii

0845-0945 See the Angel, and Other Thoughts on Breakthrough Science

Hon. Daniel S. Goldin, Founder & Chairman, Intellisis Corporation;

9th NASA Administrator, Retired

0945-1015 Coffee Break

1015-1035 Geophysical Sensors, Ice Sheets – Prasad Gogineni, Director CReSIS

1035-1055 Life Science Applications – Ananth Grama, Associate Director CSoI

1055-1115 The Role of STC's and BIO Centers in the Face of Big Data – Erik

Goodman, Director BEACON

1115-1135 Big Data Visualization – Jason Leigh, Information and Computer

Sciences, UH Manoa

1135-1200 Open mic – Moderator Ed DeLong, ECOGEO PI & C-MORE Co-Director

1200-1300 Lunch at EWC (w/ STCs for group discussion, Bottom Floor)

1300-1330 ECOGEO only (Asia Room, 2nd floor): Opening Remarks; Goals and Agenda

1330-1430 Breakout I (Asia, Sarimanok, Kaniela): Science and CI drivers

1430-1500 Report I: Science/CI drivers (Asia)


1530-1630 Breakout II (Asia, Sarimanok, Kaniela): Use cases

1630-1700 Report II: Use cases (Asia)

1700 Depart EWC for hotel

1800 Dinner at Waikiki Aquarium (w/ STC meeting) – Bring your name tag!

http://cmore.soest.hawaii.edu/rcn2015/agenda.htm

Workshop I – Agenda

2

28 August 2015, Morning – C-MORE Hale, Late-morning – EWC

0745-0800 Depart hotel for C-MORE Hale

0800-0830 Morning coffee, light breakfast at C-MORE Hale

0830-0900 Debrief for Community Telecom

0900-1030 Community Telecom – live video stream, interactive Q&A webinar

• Overview of workshop

• Panel presentation of breakouts: Science/CI drivers, Use cases

• Open community forum for discussion and feedback

1030-1100 Coffee Break at EWC (Asia)

1100-1130 Discussion re: Community Telecom, remaining agenda items (Asia)

1130-1220 Presentation and discussion: Linking environmental and sequence databases

Bonnie Hurwitz, iMicrobe; Danie Kinkade, BCO-DMO

1230-1300 Lunch with presentation on ECOGEO Resource Viewer (Bottom Floor)

Benjamin Tully, USC/C-DEBI; Ilya Zaslavsky, UCSD/EarthCube CINERGI BB

1300-1400 Discussion and brainstorm on data visualization (Asia)

Jason Leigh, UH; Khairi Reda, UH/ Argonne NL; Madhi Belcaid, UH/ HIMB

1400-1500 Breakout III (Asia, Sarimanok, Kaniela): Final list-CI needs, potential solutions


1530-1630 Report II: Final list of CI needs, potential solutions - addressed in the final report

1630-1700 Final Presentations

Lenny Teytelman, ZappyLab – Protocols.io

Francisco Chavez, MBARI – eDNA workflow

1700-1715 Outline of workshop report, ECOGEO RCN’s next steps

1715 Depart EWC for hotel – Aloha and Mahalo!

Workshop Participant List (updated 14 Sep 2015)

Last First Institution Alexander Harriet MIT Allen Andrew JCVI, UCSD Allen Eric SIO, UCSD Alm Eric MIT Amend Jan USC, CDEBI Aronson Emma UC Riverside, EarthCube Belcaid Madhi UH (HIMB) Bender Sara GBMF Buchan Alison U Tenn Chavez Francisco MBARI Chivian Dylan LBNL Cleveland Sean UH (ITS) Crump Byron Oregon State DeLong Edward UH (ECOGEO Lead PI) Dhyrman Sonya Columbia Follows Michael MIT Gomez Basil UH, EarthCube Grethe Jeffrey UCSD Hallam Steven UBC Heidelberg John USC, C-DEBI Hurwitz Bonnie U Arizona, iMicrobe Jacobs Gwen UH (ITS) Jenkins Bethany U Rhode Isl Kinkade Danie BCO-DMO, EarthCube Kyrpides Nikos JGI Leigh Jason UH (Visualization) Levine Naomi USC Mackey Katherine UC Irvine Matsen Frederick Hutchinson Meier Ouida UH, EarthCube - CReSCyNT Merrill Ron UH (ITS) Moran Mary Ann U Georgia Murray Alison DRI/U Nevada Nahorniak Jasmine Oregon State Neuer Susanne Arizona State Orphan Victoria Cal Tech Polson Shawn U Delaware Reda Khairi UH, Argonne NL Rubin Ken UH, EarthCube Saito Mak WHOI Schanzenbach David UH (ITS) Seracki Michael NSF Stanzione Dan TACC Teske Andreas UNC Teytelman Lenny ZappyLab. Protocols.io Tully Ben USC, C-DEBI Waldbauer Jacob U Chicago Wood-Charlson Elisha UH (ECOGEO Communications) Zaslavsky Ilya UCSD Zeigler Allen Lisa JCVI, UCSD Zinser Erik U Tenn

1

Aloha 2015 ECOGEO Workshop Participants,

We are really looking forward to having you join us in Hawai‘i on the 27-28 August for the first

ECOGEO RCN workshop. In order to prepare for the workshop, the organizers (Ed, Elisha, and

the ECOGEO Steering Committee) would greatly appreciate having your research group

contribute a single Use Case related to your work in environmental ‘omics.

As our first workshop is focused on core issues in ‘omics research, many of the invited

participants (list available on the website) represent the senior research/PI level. Therefore, we

ask that you use this Use Case development opportunity to involve your research group in the

conversation. Below are a few points to help provide some direction, but don’t hesitate to

contact Elisha if you have questions or would like feedback.

1. Please draft a Use Case that highlights a current challenge/limitation for your ‘omics

research (see the provided Use Case as an example).

2. The provided Use Case represents a current big picture ‘omics question/challenge.

Depending on your Use Case, this may or may not be appropriate. Any level of focus

and/or complexity is welcome.

3. We encourage input from all research groups: Science and Tech/CI !

Please submit your Use Case NO LATER than 10 August 2015!

(Elisha: [email protected])

Prior to the workshop, we will review the submitted Use Cases with the aim of collecting and

preparing representative examples for 1) focused discussion on solving challenges and 2)

progressing each Use Cases towards functionality in research and training.

Mahalo, looking forward to seeing you all very soon! Please refer to the website for logistics and

documents related to the workshop.

Cheers!

Elisha Wood-Charlson and Ed DeLong



mailto:[email protected]?subject=ECOGEO%20Use%20Case

mailto:[email protected]


http://cmore.soest.hawaii.edu/rcn2015/files/ECOGEO_UseCase_example.pdf

2

Use Case Template (revised from EarthCube version 1.1)

Summary Information Section

Use Case Name

Contact(s)

Overarching Science Driver (these can be refined during the workshop)

Science Objectives, Outcomes, and/or Measures of Success

Key people and their roles

Basic Flow Describe steps to be followed. Document as a list here and/or as a diagram (see use case example)

1. 2. 3. 4. 5. 6. 7. …

Critical Existing Cyberinfrastructure o o o

Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may be good candidate(s) for a community CI application.

3

Critical Cyberinfrastructure Not in Existence o o o

Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI) Please list particulars that come to mind, but don’t focus on completing the story. This can be expanded during the workshop.

Problems/Challenges (any barriers to successful completion of use case) For each one, list - The challenge - What, if any, efforts have been undertaken to fix these problems? - What recommendations do you have for tackling this problem?

1. 2. 3. 4. …

References (links to background or useful source material)

Notes (any additional information that does not fit in a previous category)

1



Use Case Name Cosmopolitan species physiological response and strain variability across ecological gradients

Contact(s) Harriet Alexander ([email protected])

Overarching Science Driver (these can be refined during the workshop) Understand the role of species-level genome variability in the success of a species complex across environmental gradients.

Science Objectives, Outcomes, and/or Measures of Success Aggregate available meta-omic datasets that contain an organism or sequence of interest Create analysis work flow for pulling out target species from within meta-omic datasets

Key people and their roles Sonya Dyhrman (lead PI) Harriet Alexander


1. Select all available metagenomic/metatranscriptomic datasets based on location within the water column (euphotic zone 1% surface irradiance)

2. Query selected datasets for the presence of organism of interest (based on query sequence, genome, or transcriptome) within the omic dataset.

3. Extract metadata, sequences associated with your taxonomic query, and information associated with the sequences (e.g. relative sequence abundance)

4. Run expression, statistical, and alignments locally 5. Visualize data locally

Critical Existing Cyberinfrastructure o iMicrobe, NCBI SRA, IMG, EBI, JGI o Python, iPython, Amazon cloud for HPC, virtual machines o Bioinformatic tools for mapping sequences (BWA, Bowtie), assembling sequences

(Trinity, Velvet, Abyss), clustering sequences (CD-HIT), taxonomically binning sequences (ClaMS, Phylophythia, ESOM)

2


Critical Cyberinfrastructure Not in Existence o Portal similar to JGI or EBI that can be used to browse meta-omic data without

having to download it locally o Standardized data format for environmental sequence data collected on different

platforms o Some means of linking omic datasets based on organisms/genes present



1. How can we unify the type of sequence data that is made available from environmental studies? What types of data should be required?

a. We should decide upon what types of data should be 2. The computational time and memory required to specifically query against tens to

hundreds of large omic data sets is not feasible to do locally. a. Many groups have started to use a combination of cloud computing (e.g.

Amazon cloud) and virtual machines to perform analyses. If the databases provided through earthcube could be made to streamline into such a platform analyses might be made easier.

3. In an ideal world every time a meta-omic dataset were added to the overarching database that dataset would be queried against all other environmental/culture datasets. For example, genes would be clustered with like genes from other environments, species common across environments would be highlighted, patterns of khmer abundance might be tracked and correlated. The goal here would be to create a synthetic , this might place the data within the new dataset into greater context and consequently make further analyses more streamlined.

a. This particular challenge is still a bit far off from being solved. I think that work needs to be done to improve the actual computational tools that we currently have available to make such computational efforts more tractable.



1



Use Case Name Trait-based modelling of community response to changing conditions. (Overall goal: integrate time-series biogeochemical measurements, meta-transcriptomics, 16S profiling, metagenomic assemblies, isolate genomes, isolate metabolite dependencies, and sometimes isolate metabolomics and meta-metabolomics.) (Note: this is not one of my own experiments, but rather I have several collaborations pursuing questions using such rich data. Example systems include Desert Crust and Mediterranean Grassland Rhizosphere, but similar experimental designs are also used in aquatic environments).

Overarching Science Driver (these can be refined during the workshop) Understand key functional genes and the roles of the trait-guild member species in adaptation to a perturbed environment.

Science Objectives, Outcomes, and/or Measures of Success 1. Identify key functional genes in perturbation response. 2. Link key functional genes to species. 3. Model trait-guild member species and their interactions.

Key people and their roles Dylan Chivian, Ulas Karaoz - Science and CI Eoin Brodie - PI Trent Northen - PI


1. Assembly and annotation of isolate genomes. 2. Assembly, annotation, binning, and assessment of MG-derived genomes. 3. Meta-transcriptomic abundance calculations against isolate and MG-derived genomes. 4. Trait-guild member assignment.

Contact(s) Dylan Chivian ([email protected])

2

5. Integration of metabolomic and meta-metabolomic data into species models. 6. Time-series models of community adaptation. 7. Stats and visualization.

Critical Existing Cyberinfrastructure o KBase/RAST/ModelSEED/MG-RAST, M-suite, QIIME, IMG, IMG/M, ggKbase,

iMicrobe, PathwayTools, MicrobesOnline, metaMicrobesOnline o R, MeV, SparCC, kallisto, bowtie, Cytoscape (analysis and viz)

Critical Cyberinfrastructure Not in Existence o Easy access to rapid metagenomic assembly and binning. o Easy Integration of metabolomics data into metabolic modeling. o Meaningful compartmentalized metabolic models and interaction networks. o Easy trait-guild modeling and viz. o Easy Time-series trait-based modeling.

Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI) Please list particulars that come to mind, but don’t focus on completing the story. This can be expanded during the workshop. This will be done during the workshop.


1. Tools and data formats sometimes inconsistent. One-stop shopping would be nice. 2. Ease of use for non-coder biologist desirable. 3. Information rich but clear data and analysis viz hard to make. Need to make more

available to biologists.



1



Use Case Name Using population genomes to analyse taxon specific functional constraints

Contact(s) Jack A. Gilbert: [email protected] Naseer Sangwan: [email protected] Chris Marshall: [email protected] Melissa Dsouza: [email protected] Pamela Weisenhorn: [email protected]

Overarching Science Driver To understand how translational fine-tuning shapes the microbial genome evolution in natural environment

Science Objectives, Outcomes, and/or Measures of Success (I) Create habitat specific database of population level orthologous genes with pre-calculated metrics i.e. codon bias, dN/dS. (ii) Create new workflows and analysis pipelines to compute codon bias and dN/dS values across fragmented metagenome assemblies representing complex environments e.g. soil/sediment (iii) Create new normalization methods for accurate correlation between dN/dS and codon bias values of population level genes

Key people and their roles Jack A. Gilbert: Lead PI Naseer Sangwan: Postdoctoral researcher Chris Marshall: Postdoctoral researcher Pamela B. Weisenhorn : Postdoctoral researcher Melissa Dsouza: Postdoctoral researcher

Basic Flow 1. Quality trimming and de-novo assembly of shot-gun metagenome datasets 2. Binning Metagenome contigs into population genomes (pan-genomes) 3. Gene calling on contig bins representing population genomes 4. Identification of orthologous genes between population genomes 5. Cross validation of orthologous genes (i.e length cut-off, sequencing errors)

2

6. Calculating pairwise dN/dS and codon bias values 7. Normalization and calculation of pairwise correlation between dN/dS and codon bias

profiles 8. Demarcate & functionally characterize protein pairs w/ positive and/or negative selection

Critical Existing Cyberinfrastructure o Alignable Tight Genome Clusters (ATGC) database of prokaryote genomes (has

genomes of cultured isolates) o Integrated Microbial Genomes (IMG) (e.g. can be used to pull orthologous genes) o MicroScope pipeline ( e.g. *has size limit for annotation*)

Critical Cyberinfrastructure Not in Existence o Central database of population genomes i.e. reconstructed from metagenomes o Unique algorithms for calculating codon bias and dN/dS across short protein

sequences. o Accurate normalization method that can handle the average genome size variation

across populations

Activity Diagram This can be targeted during the workshop

Problems/Challenges 1. How to acess the habitat specific gene pool information? Recommendation : Create a comprehensive portal that can store such datasets. 2. High-throughput methods to screen orthologous genes across multipule population genomes

a. some methods exist, but they are specific for genome sequences of cultured micobes. b. Recommendation: develop new methods or modify the existing methods to target the genome bins represting mix of strains or species.

3. How to calculate accurate rate to evolution and codon bias on short protein sequences. a. There are some methods but they are not validated for errors and bias caused during metagenome data analysis e.g length variation, average genome size variation etc. b. Recommendation: develop some new method to calculate and normalize the dN/dS and codon bias profiles of population genomes. e.g consider the average genome size variations.

References -Ran W, Kristensen DM, Koonin EV. (2014). Coupling Between Protein Level Selection and Codon Usage Optimization in the Evolution of Bacteria and Archaea. mBio 5:e00956–14. -Nielsen, R. (2005). Molecular signatures of natural selection. Annu Rev Genet. 39:197-218.

Notes

1



Use Case Name Linking global models of nutrient limitation to gene expression of nutrient-specific responses in diatoms

Contact(s) Bethany Jenkins University of Rhode Island, Joselynn Wallace PhD candidate University of Rhode Island

Overarching Science Driver (these can be refined during the workshop) Linking global biogeochemical models to in-situ measurements and meta-omics

Science Objectives, Outcomes, and/or Measures of Success Compile micro (trace metals, vitamins) and macro (N, P, Si) nutrient concentration measurements, CTD depth profiles, measures of biodiversity and metagenomics, and gene-specific expression or metatranscriptome data into a queryable database.

Key people and their roles GEOTRACES, PDC, K. Buck – trace metal concentration and distribution, Fe speciation BDJ, PDC, K. Thamatrakoln? – gene-specific expression (genetic markers of Si and Fe limitation of diatoms), metagenomics and transcriptomics

Basic Flow 1. Use global models predicting the role of nutrient limitation on primary production of key

phytoplankton taxa to select oceanic region of interest. 2. Filter by depth horizon 3. Retrieve historical macro and micronutrient measurements collected from this region

and filter data by concentration of a given nutrient 4. Retrieve ‘omics datasets from this region (this is the crux of this pipeline matching the

nutrient data with the ‘omics data and finding relevant omics data) 5. Compile locations of nutrient measurements at a range of selected values with ‘omic

data-availability of metagenomes and metatranscriptomes 6. Determine from metagenomics data if target organisms or taxa are present at target

nutrient values 7. Filter metatranscriptome data by taxonomy to only retrieve transcripts from target

taxonomic group (2nd crux of pipeline-need to interface with phylogenetics infrastructure).

8. Use downstream measures to search for specific genes (e.g. BLAST)

2

Critical Existing Cyberinfrastructure o World Ocean Database (Atlas)( https://www.nodc.noaa.gov/OC5/indprod.html) o BCO-DMO (http://www.bco-dmo.org/) o GEOTRACES International Data Assembly Center o PANGEA archive (http://doi.pangaea.de/10.1594/PANGAEA.840721) o iMicrobe (http://imicrobe.us/) o EBI metagenomics (https://www.ebi.ac.uk/metagenomics/) o European Nucleotide Archive (http://www.ebi.ac.uk/ena) o NCBI (http://www.ncbi.nlm.nih.gov/) o QIIME (http://qiime.org/)


Critical Cyberinfrastructure Not in Existence o Centralized or cross referenced queryable repository of global model/map overlays,

nutrient and in-situ measurements, and associated –omics data. o Integrated taxonomic pipelines for omics data


1.#Global#map#of#nutrient#limita2on#for#diatoms#

2.#Define#region#of#Fe#limita2on#in#N#equatorial#Atlan2c#

3.#Query#db#of#mixed#layer#depth#samples#from#specified#region#with#measured#Fe#values#below#specified#level.#Return#data#with#Fe#and#all#other#measured#nutrient#and#profiling#data#(e.g.#temp,#salinity#etc).#

4.#Query##database#(same#or#different)#with#metagenomics#informa2on#that#is#cross#referenced#to#samples#

5.#Apply#taxonomic#filtering#to#data#(requires#integrated#pipeline#for#taxonomic#classifica2on)#

6.##Retrieve#metatranscriptome#data#for#sample#containing#taxonomic#targets#

3


1. Cross referencing of data-BCO-DMO-having a “accession number’ for each sample that is capitulated through all data records so they can be housed in different databases but search engines can query by record and then for specific types of associated data

2. Discoverability of “omics data” –data currently living in a variety of repositories (ncbi, ebi, iMicrobe) submissions don’t presently contain links to metadata records. Omics data may need to live in separate mirrored repository to facilitate retrieval.

References (links to background or useful source material) Global model images from J. Keith Moore, Keith Lindsay, Scott C. Doney, Matthew C. Long, and Kazuhiro Misumi, 2013: Marine Ecosystem Dynamics and Biogeochemical Cycling in the Community Earth System Model [CESM1(BGC)]: Comparison of the 1990s with the 2090s under the RCP4.5 and RCP8.5 Scenarios.J. Climate, 26, 9291–9312.


1



Use Case Name Systems analysis linking information from metagenomic, metatranscriptomic, and metaproteomic datasets with key physical and chemical parameters.

Contact(s) Robert M. Morris, University of Washington ([email protected])

Overarching Science Driver (these can be refined during the workshop) To identify the key environmental parameters controlling the activities of microbial communities across an ocean gradient in organic and inorganic nutrients

Science Objectives, Outcomes, and/or Measures of Success Synchronize community “omics” datasets (ID, location, time, replicate, annotations!!!) Extract information across datasets (genes, transcripts, proteins with same annotations)

Key people and their roles Virginia Armbrust, Adam Martini (genomics) Mary Ann Moran (transcriptomics) Robert Morris (proteomics)


1. Identify samples with matching datasets (physical, chemical, biological) 2. Download and retrieve appropriate datasets (omics, metals, nutrients, etc.) 3. Synchronize biological omics datasets (annotate using standard annotations) 4. Identify categories for comparison (CEG paths, EC numbers, taxonomy, etc.) 5. Extract data for comparative analyses 6. Determine genetic potential, gene regulation, and expressed protein functions 7. Multivariate analysis of biological activity with physical and chemical parameters

Critical Existing Cyberinfrastructure o **Standard annotation database developed by Mary Ann Moran o Data archives (BCO-DMO, NCBI, MG-RAST, SILVA-RDP-Greengenes for 16S) o Comet: An open source MS/MS sequence database search tool o Kbase: A systems biology knowledge base (mostly genomic at this point)

Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may

2

be good candidate(s) for a community CI application.

Critical Cyberinfrastructure Not in Existence o Database to host datasets o Tools for comparative analyses of “omics” datasets (establishes links) o File conversion and export capabilities

Activity Diagram (more detail than basic flow, including inputs/outputs, incorporating tech/CI) Please list particulars that come to mind, but don’t focus on completing the story. This can be expanded during the workshop. Will be done at the workshop


1. What data are available? A) BCO-DMO does some of this, but is not specialized for large biological datasets generated by genomics, transcriptomics and proteomics B) The host site should have a fairly uniform summary diagram with links to available data, data that is coming, data available through other sources.

2. Data are in different formats (raw files, processed files, annotated/unannotated) A) Existing data archives (above) do this, but they can be very difficult to navigate and the file formats are not always consistent (for meta “omics” data). A) Some standards regarding data formats should be established

3. Some datasets have been deduplicated and some have not been deduplicated A) Many sites offer both versions B) This is particularly challenging when annotations don’t match. Decisions about

annotation (in addition to sequence similarity) will impact this. 4. Multiple files are available (size fractionated, replicates, etc.)

A) This is often done, but naming schemes are not uniform B) Develop some standards for naming when data are deposited so that the user will

know if there are replicates, different size fractions, etc. 5. Processed files from published results are often times unavailable

A) Not always required B) Should be able to save and export data at different stages of analyses.



1



Use Case Name Increasing identification rates for peptide mass spectra from ocean metaproteome datasets

Contact(s) Jacob Waldbauer

Overarching Science Driver (these can be refined during the workshop) Develop clearer picture of protein-level gene expression patterns and regulation for quantitative understanding of metabolic & biogeochemical processes

Science Objectives, Outcomes, and/or Measures of Success Develop ‘Ocean Metaproteome Atlas’ for comparative analysis of protein-level expression in oceanographic context Compare community spatiotemporal gene expression patterns between transcript & protein levels, and examine relationships with activities of biogeochemical interest Ultimately, develop a sufficiently mechanistic & quantitative picture of expression regulation & consequent metabolic activity in marine microbes to contribute to predictive biogeochemical models of ocean carbon & nutrient cycling

Key people and their roles


1. Collate & integrate ocean metaproteome datasets 2. Extract potentially informative peptide fragmentation spectra 3. Develop refined sequence databases for PSM searching 4. Sequence peptides by database searching, de novo, spectral library and/or hybrid

methods 5. Control FDR on putative sequence IDs in integrated statistical framework 6. Assign gene ID, function and/or taxon to identified peptides 7. Compare & visualize identified peptides across metaproteome samples 8. Contribute identified spectra to community spectrum library

2

Critical Existing Cyberinfrastructure o Peptide-spectrum matching, spectral library searching and de novo sequencing

algorithms (of varying speed/parallelizability)** Sidenote: Please identify (**) ‘omics tool(s) listed here that are not easily accessible and may be good candidate(s) for a community CI application.

Critical Cyberinfrastructure Not in Existence o Sharing/integration platform for raw metaproteome data in open format(s) o Automated pipeline/expert system for generating/optimizing proteome search

databases o Integrated system for linking peptide IDs with annotation/taxonomy systems o Community spectral library of confident peptide IDs



1. Sharing metaproteomic mass spec data a. Currently, PRIDE and MassIVE repositories active, but little ability to integrate

oceanographically-relevant metadata b. Recommendation: work with proteomeXchange and/or MassIVE (CCMS, UCSD)

to develop ocean-specific metaproteome repository 2. Focusing on most (potentially) informative spectra

a. Recommendation: develop generalized criteria (via machine learning?) for sequence-information content of fragmentation spectra – will cull large amounts of uninformative data

3. Arriving at consensus, FDR-controlled sequence ID & annotation from multiple sequencing methods/annotation streams

a. Recommendation: Allow Metaproteome Atlas to maintain multiple scored ID candidates for given spectrum, apply parsimony and/or pathway logic at protein and/or organism levels



Summary of ECOGEO’s community survey

1

EarthCube’s Oceanography and Geobiology Environmental ‘Omics (ECOGEO) Research Coordination Network (RCN) created a survey to assess current community needs and challenges with respect to ‘omics research. The survey was available from Nov 2014 – Jan 2015, and had a total of 105 respondents. Of those, ~90 gave feedback on a major of questions, while 30-60 responded to the open ended questions. Results from this survey are summarized below. Overview The main areas of ‘omics research currently being explored by our community are metagenomics, 16S/18S taxonomy, and correlating omics data with environmental data (Figure 1). In addition, the majority of our research community regularly collects samples for processing (~85%), conduct in-depth analysis on the output data (~72%), and use the data for comparative omics (62%) (n=96, with more that one selection possible). However, our community’s engagement with ‘omics data ranges from doing limited analysis (~47%) to using the data to develop workflows (~40%).

Figure 1. Areas of 'omics research (n=97, more than

one selection possible)

Accessing data Most ‘omics users are able to submit data sets and associated metadata for archival, search reference databases by sequence similarity or annotation (Figure 2). However, we struggle to search by associated metadata/ project characteristics, and we definitely face challenges in accessing unique data sets not in the main reference databases (a.k.a. “dark data”).

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100% Would like to use Already use

Figure 2. Resources already in use or would like to use.

Summary of ECOGEO’s community survey

2

Community Workflows Feedback on idealized workflows provided good fodder for use case development, the need for which was highlighted in Figure 2 (“would like to use” – case studies, interactive webinars). Therefore, we are asking the 2015 workshop participants to submit a use case prior to the workshop; one that highlights a current challenge in ‘omics research. We have provided a template form and an example use case focused on using metadata to retrieve targeted data sets for further exploration. During the workshop, we will discuss several representative use cases to 1) highlight areas that we, as a community, need to focus on to move our research forward, and 2) establish a repository of training tools for the next generation of ‘omics researchers. Barriers to Research The general consensus on the barriers to ‘omics research moving forward are summarized in a few key, big picture points (below). During the 2015 ECOGEO workshop, we will be tackling these at a finer scale level, in an attempt to move solutions forward.

1. Data standards – including quality measures and a way to index data sets that will link samples to environmental metadata, across different types of ‘sequencing’, and throughout various sequence analyses and annotations stages.

2. Central repository of raw and processed data (see #1) that is searchable (see Figure 2) and downloadable with compatible/standardized output, while also having online tools and compute power for processing (and archiving) assemblies, comparative analyses, annotations, visualizations, and statistics.

3. Regular annotation updates on existing databases with potential to request notifications if data sets of interest gain new information.

4. Training – use-cases/workflows, training webinars, user-friendly GUI interface. 5. Last, but far from least – longevity!

Earth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'Omics

ECOGEO is a brandnew, NSFfunded Research Coordination Network (RCN) housed within the EarthCube platform. Please visit http://workspace.earthcube.org/ecogeo for more information and to join our listserv! The mission of this RCN is to identify community needs and develop necessary plans to create a federated cyberinfrastructure to enable ocean and geobiology environmental ‘omics. This survey is designed to address the first part of our mission. We are gathering information regarding the current usage of and community needs for 'omics research in the oceanography and geobiology communities. This brief research survey should take 515 minutes of your time, depending on your level of feedback. Your participation is greatly appreciated, but also voluntary and you can choose to not answer any question. This survey is anonymous and without foreseeable risks to you for taking part in this survey. Please do not include any personal information in your responses. If you have any questions or concerns regarding this survey, please contact Dr. Elisha WoodCharlson at the University of Hawai'i at Manoa ([email protected]). If you have questions regarding your rights as a participant, please contact the University of Hawai'i at Manoa Human Studies Program ([email protected]) This study has been reviewed and approved by the University of Hawaii Institutional Review Board (#...).

1. By selecting "Yes", you are indicating your consent to participate in this survey.

*

Yes

nmlkj

No

nmlkj


2. What area(s) of ‘omics research do you typically work in? (select all that apply)

Genomics

gfedc

Single cell genomics

gfedc

Metagenomics

gfedc

Transcriptomics

gfedc

Metatranscriptomics

gfedc

Proteomics

gfedc

Metaproteomics

gfedc

Metabolomics

gfedc

Correlating ‘omics data with environmental data

gfedc

Phylogenetics

gfedc

16S, 18S; Taxonomy

gfedc

Modeling

gfedc

Other (please specify)

55

66

Earth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'Omics3. What area(s) of ‘omics sample and data processing do you typically engage in? (select all that apply)

Collect samples and process for sequencing

gfedc

Limited analysis of processed ‘omics data (e.g. postQC/QA)

gfedc

Indepth analysis (e.g. single data set assembly, annotation, pathways, etc…)

gfedc

Workflow development

gfedc

Analytical and/or statistical tool development

gfedc

Use ‘omics data in modeling

gfedc

Comparative ‘omics (e.g. across ‘omic types, complex data sets, integration with metadata)

gfedc


55

66


4. Please indicate ‘omicsassociated RESOURCES you use or would like to use.

5. Please indicate ‘omicsassociated TOOLS you use or would like to use.

Already use Would like to use

Submission of sequence data and metadata for archival services gfedc gfedc

Access to unique data sets not available in other sequence repositories gfedc gfedc

Search for usersubmitted samples by description or project characteristics gfedc gfedc

Search for usersubmitted samples by sequence similarity (e.g. BLAST, RapSearch) gfedc gfedc

Search for usersubmitted samples by annotation (e.g. gene function, taxonomy) gfedc gfedc

Search for data sets by metadata (e.g. latitude/longitude, date collected, lead PI) gfedc gfedc

Access to reference datasets (e.g. nonredundant and RefSeq from NCBI) gfedc gfedc

Casestudies for training gfedc gfedc

Interactive webinars gfedc gfedc

Already use Would like to use

Initial data processing (e.g. QC/QA, trimming) gfedc gfedc

BLAST and BLASTlike workflows (e.g. RapSearch) gfedc gfedc

Assembly tools (e.g. RayMeta, Newbler) gfedc gfedc

Annotation tools (e.g. Pfam, COG/KOG, TIGRFAM, NCBI’s PRK) gfedc gfedc

Phylogeneticallybased annotation services (e.g. MEGAN) gfedc gfedc

Workflow pipelines (e.g. Clustering, RAMMCAP, Redundancy filter) gfedc gfedc

Comparative pathway analysis (e.g. KEGG, pFAM) gfedc gfedc

Statistical tools gfedc gfedc

Visualization tools gfedc gfedc

Other resources, or additional comments (please specify)

55

66

Other resources, or additional comments (please specify)

55

66

Earth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'Omics6. If you currently have favorite tools/resources, please list them and explain why they are working for you.

7. To put the previous questions in a research context, please describe your idealized data analysis workflow that would best achieve your main science goals using omics data sets. What do you want ‘omics data to do in order to answer your scientific questions?

55

66

55

66


8. Please identify the community needs for storage, management, analysis, sharing, integration, and visualization of ‘omic data that you feel are immediate vs. should be considered in future development with a longerterm vision.

9. Any additional thoughts regarding Question 8?

Immediate Longterm

Storage of raw data (akin to the NCBI Short Read Archive for sequence data) gfedc gfedc

Storage of processed data (e.g., translated proteins or assembled contigs) gfedc gfedc

Storage of data used for biological inference (e.g., differential gene/protein expression) gfedc gfedc

Linking different ‘omics for single sample gfedc gfedc

Sustainable curation gfedc gfedc

Access to highperformance computational resources gfedc gfedc

Access to usersubmitted data gfedc gfedc

Analysis workflows gfedc gfedc

Annotation tools gfedc gfedc

Comparative pathway tools gfedc gfedc

Comparative ‘omics tools gfedc gfedc

Statistical tools gfedc gfedc

Visualization tools gfedc gfedc

Casestudies for training gfedc gfedc

55

66


55

66

Earth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'OmicsEarth Cube Oceanography and Geobiology Environmental 'Omics10. Please comment on what you perceive to be the PRIMARY NEEDS surrounding ‘omics research for the oceanography and geobiology communities

11. Please comment on what you perceive to be the MAJOR INFRASTRUCTURE BARRIERS for improving ‘omics research in the oceanography and geobiology communities.

55

66

55

66

earthcube_ecogeo 2015 workshop i final report.pdf

Documents

workshop goals page

workshop outcomes page

workshop conveners

omics community

summary of workshop

edurcn2015 workshop

omics science outcome

solutions page