data landscapes: the neuroscience information framework

Data Landscapes: The Neuroscience Data Landscapes: The Neuroscience Information FrameworkInformation Framework

neuinfo.orgneuinfo.org

Maryann E. Martone, Ph. D.University of California, San

Diego

Organization• Introduction• The Neuroscience Information Framework• A tour of NIF• The NIF Framework– Ontologies– NIF Analytics: What can we learn from the data

space?• Where do we go from here?– Resource Identification Initiative– Conclusions

• NIF is an initiative of the NIH Blueprint consortium of institutesNIF is an initiative of the NIH Blueprint consortium of institutes– What types of resources (data, tools, materials, services) are available to the What types of resources (data, tools, materials, services) are available to the

neuroscience community?neuroscience community?– How many are there?How many are there?– What domains do they cover? What domains do they not cover?What domains do they cover? What domains do they not cover?– Where are they?Where are they?

• Web sitesWeb sites• DatabasesDatabases• LiteratureLiterature• Supplementary materialSupplementary material

– Who uses them?Who uses them?– Who creates them?Who creates them?– How can we find them?How can we find them?– How can we make them better in the future?How can we make them better in the future?

http://neuinfo.org

• PDF filesPDF files

• Desk drawersDesk drawers

Old Model: Single type of content; single mode of distribution

ScholarScholar

LibraryLibrary

Scholar

PublisherPublisher

FORCE11.org: Future of research communications and e-scholarshipFORCE11.org: Future of research communications and e-scholarship

Scholar

Consumer

Libraries

Data Repositories

Code RepositoriesCommunity databases/platforms

OA

Curators

Social Networks

Social Networks

Social Networks

Social NetworksSocial

NetworksSocial

Networks

Peer Reviewers

NarrativeNarrative

WorkflowsWorkflows

DataData

ModelsModels

MultimediaMultimedia

NanopublicationsNanopublications

CodeCode

Solving the large problems of science?

• Observation• Experimentation• Modeling• Cooperative data

intensive science

“An unaided human’s ability to process large data sets is comparable to a dog’s ability to do arithmetic, and not much more valuable.” –Michael Nielson, Reinventing Discovery, 2012.

“An unaided human’s ability to process large data sets is comparable to a dog’s ability to do arithmetic, and not much more valuable.” –Michael Nielson, Reinventing Discovery, 2012.

NIF: A New Type of Entity for New Modes NIF: A New Type of Entity for New Modes of Scientific Disseminationof Scientific Dissemination

NIF: A New Type of Entity for New Modes NIF: A New Type of Entity for New Modes of Scientific Disseminationof Scientific Dissemination

• NIF’s mission is to maximize the awareness of, access to and utility of digital resources produced worldwide to enable better science and promote efficient use– NIF unites neuroscience information without respect to domain, funding

agency, institute or community– NIF is a library for scholarly output that is a web enabled resource and

not a paper– Aggregates all the different databases, tools and resources now

produced by the scientific community– Makes them searchable from a single interface– A practical approach to the data deluge– Educate neuroscientists and students about effective data sharing

Surveying the resource landscapeSurveying the resource landscape

NIF resource registry: listing of > 12000 databases, tools, materials, services, websites (> 2500 databases)NIF resource registry: listing of > 12000 databases, tools, materials, services, websites (> 2500 databases)

NIF data federation: Pub Med Central for dataNIF data federation: Pub Med Central for data

NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources, providing deep query of the contents and unified viewsNIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources, providing deep query of the contents and unified views

200 sources> 800 M records200 sources> 800 M records

Registry vs Federation: Metadata Registry vs Federation: Metadata aboutabout resource vs resource vs metadata/data metadata/data inin database database

What resources are available for Addiction and GRM1?What resources are available for Addiction and GRM1?

With the thousands of databases and other information sources available, simple descriptive metadata will not sufficeWith the thousands of databases and other information sources available, simple descriptive metadata will not suffice

How do resources get added to the How do resources get added to the NIF?NIF?

•NIF curators•Nomination by the community•Semi-automated text mining pipelines

NIF RegistryRequires no special skillsSite map available for local hosting

•NIF Data Federation•DISCO interop•Requires some programming skill•Open Source Brain < 2 hr

Low barrier to entry; incremental refinementLow barrier to entry; incremental refinement

What about my data?•Best practice:

•Put it in a repository

•What repository?•Community repository for your data type, e.g., GEO

•General repository:•Dryad•FigShare

•Institutional repository•Research libraries are setting up repositories to manage their “digital assets”

NIF can help you find a place for your dataNIF can help you find a place for your data

Requirements for effective data sharing

• Discoverability– Data can be found

• Accessibility– Data can be accessed and

access rights are clear– Links to data are stable

• Assessability– The reliability of the data can be

determined• Understandability

– The data can be understood• Usability

– The data are in a usable form

Duality of modern scholarship: A machine and human dimension to each

Duality of modern scholarship: A machine and human dimension to each

• Publishing data on your website or as supplemental material is not the best way to make it available

• Publishing data on your website or as supplemental material is not the best way to make it available

But we have Google!But we have Google!• Current web is

designed to share documents– Documents are

unstructured data

• Much of the content of digital resources is part of the “hidden web”

• Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.

What do you mean by data?What do you mean by data?Databases come in many shapes and sizes• Primary data:

– Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)

• Secondary data– Data features extracted through data

processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)

• Tertiary data– Claims and assertions about the meaning of

data• E.g., gene upregulation/downregulation,

brain activation as a function of task

• Registries:– Metadata– Pointers to data sets or

materials stored elsewhere• Data aggregators

– Aggregate data of the same type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede

• Single source– Data acquired within a single

context , e.g., Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies

Researchers are producing a variety of information artifacts using a multitude of technologies

Which databases do you use?Which databases do you use?• Mouse Genome

Database• Clinical Trials.gov• Pub Med• dbGAP• GEO• NIH Reporter• OMIM

• Bionumbers:– -a database of numerical

values extracted from literature

• Epigenomics– - human epigenomic data to

catalyze basic biology and disease-oriented research

• Antibody Registry– -2M antibodies

• BioGrid– an interaction repository of

protein and genetic interactions

17Most resources are largely unknown and underutilizedMost resources are largely unknown and underutilized

NIF unifies look, feel and access

Making it easier to access and understand distributed databases

Each resource implements a different, though related model; systems are complex and difficult to learn, in many casesEach resource implements a different, though related model; systems are complex and difficult to learn, in many cases

Exploring the data space

Facets and filters: Progressive refinement of search

More effective to start with a general query and use the navigation to refine searchMore effective to start with a general query and use the navigation to refine search

Some NIF(ty) Features

Current challenge: With so much available, how do I find what I need?• “What genes are upregulated

by chronic morphine?”– It depends

• Most often use cases require connecting a researcher to relevant data sets and appropriate tools– Depending upon the data and

tools, the answers may differ

• Many databases have tool bases and workflows that they support

Exploration of NIF: 1. Progressive refinement of search

More effective to start with a general query and use the navigation to refine searchMore effective to start with a general query and use the navigation to refine search

2. “Data trails”: Linking data and analysis tools

Same data: different analysisSame data: different analysis

• Gemma: Gene ID + Gene Symbol• DRG: Gene name + Probe ID

• Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatumChronic vs acute morphine in striatum

• Analysis:•1370 statements from Gemma regarding gene expression as a function of chronic

morphine•617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis•Results for 1 gene were opposite in DRG and Gemma•45 did not have enough information provided in the paper to make a judgment

NIF is working to make it easier to find where data has gone and what has been done with itNIF is working to make it easier to find where data has gone and what has been done with it

3. SciCrunch: A social network for data and tools

• NIF platform has been adapted to create SciCrunch– Beta release: http://scicrunch.com

• Create more narrow community-based portals based on common data platform

• Select your data; organize it as you wish

• Cost effective: a data portal can be set up in a few hours

• Connects communities through data and tools

• Shared curation-shared knowledge

28

SciCrunchSciCrunchShared

ResourcesUndiagnosed

Disease ProgramUndiagnosed

Disease Program

Phenotype RCNPhenotype RCN

One Mind for Research

One Mind for Research

Consortia-PediaFaster Cures

Consortia-PediaFaster Cures

Model Organism Databases

Model Organism Databases

Community Outreach

Community Outreach

Community Built Uniform Resource Community Built Uniform Resource LayerLayer

Resource Identification Portal

Aging

Neuroscience

dkNET

Phenotypes

NSF Earthcube

Breaking down silos: Community enrichment

Phases of NIFPhases of NIF• 2006-2008: A survey of what was out there• 2008-2009: Strategy for resource discovery

– NIF Registry vs NIF data federation– Ingestion of data contained within different technology platforms, e.g., XML vs relational vs RDF– Effective search across semantically diverse sources

• NIFSTD ontologies

• 2009-2011: Strategy for data integration– Unified views across common sources– Mapping of content to NIF vocabularies

• 2011-present: Data analytics and Linking data– Uniform external data references

• 2013-present: SciCrunch: unified biomedical resource services• “data trails”

NIF provides a strategy and set of tools applicable to all biomedical scienceNIF provides a strategy and set of tools applicable to all biomedical science

INFORMATION FRAMEWORKS

-a tool for analyzing and structuring information (“ a reduction of uncertainty”)

What is an effective information framework for neuroscience?

Knowledge in space and spatial relationships (the “where”)

Knowledge in words, terminologies and logical relationships (the “what”)

NIF Semantic Framework: NIFSTD ontologyNIF Semantic Framework: NIFSTD ontology

• NIF covers multiple structural scales and domains of relevance to neuroscience• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene

Ontology, Chebi, Protein Ontology

NIFSTDNIFSTD

OrganismOrganism

NS FunctionNS FunctionMoleculeMolecule InvestigationInvestigationSubcellular structure

Subcellular structure

MacromoleculeMacromolecule GeneGene

Molecule DescriptorsMolecule Descriptors

TechniquesTechniques

ReagentReagent ProtocolsProtocols

CellCell

ResourceResource InstrumentInstrument

DysfunctionDysfunction QualityQualityAnatomical Structure

Anatomical Structure

Ontologies provide the universals for integrating across disparate data by linking them to human knowledge modelsOntologies provide the universals for integrating across disparate data by linking them to human knowledge models

PurkinjeCell

AxonTerminal

Axon DendriticTree

DendriticSpine

Dendrite

Cell body

Cerebellarcortex

Space limitations: Multiscale integration is not obviousSpace limitations: Multiscale integration is not obvious

There is little obvious connection between data sets taken at different scales using different microscopies without an explicit representation of the biological objects that the data represent

There is little obvious connection between data sets taken at different scales using different microscopies without an explicit representation of the biological objects that the data represent

: CNeurolex: > 1 million triples

Dr. Yi Zeng: Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

NIF “translates” common concepts through ontology and annotation standards

What genes are upregulated by drugs of abuse in the adult mouse? (show me the data!)

MorphineMorphineIncreased expressionIncreased expression

Adult MouseAdult Mouse

Another search tip: Custom Another search tip: Custom query syntaxquery syntax

Ontologies as a data integration frameworkOntologies as a data integration framework

•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions

•Brain Architecture Management System (rodent)•Temporal lobe.com (rodent)•Connectome Wiki (human)•Brain Maps (various)•CoCoMac (primate cortex)•UCLA Multimodal database (Human fMRI)•Avian Brain Connectivity Database (Bird)

•Total: 1800 unique brain terms (excluding Avian)

•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385

NIF ANALYTICSWhat can we learn from the data space?

Data Federation GrowthData Federation GrowthData Federation GrowthData Federation GrowthNIF searches the largest collation of neuroscience-relevant data on the webNIF searches the largest collation of neuroscience-relevant data on the web

40

Definition: “The long tail of small Definition: “The long tail of small data”data”

• Long tail data: large numbers of small data sets

http://en.wikipedia.org/wiki/Long_tailhttp://en.wikipedia.org/wiki/Long_tail

Estimate: ~50% of long tail data is “Dark data”: data not available for searchEstimate: ~50% of long tail data is “Dark data”: data not available for search

NIF Analytics: The Neuroscience Landscape

Ontologies provide a semantic framework for understanding data/resource landscapeOntologies provide a semantic framework for understanding data/resource landscape

Where are the data?

StriatumHypothalamusOlfactory bulb

Cerebral cortex

Brain

Brai

n re

gion

Data source

Vadim Astakhov, Kepler Workflow Engine

01-10

11-100>101

Data and knowledge gapsData Sources

NIF lets us ask: where isn’t there data? What isn’t studied? Why?NIF lets us ask: where isn’t there data? What isn’t studied? Why?

ForebrainForebrain

MidbrainMidbrain

HindbrainHindbrain

01-10

11-100>101

Data Sources

SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186

Adult mouse brain connectivity matrix: revenge of the midbrain

The tale of the tail“Human neuroimaging typically is performed on a whole brain basis. However, for several reasons tail of the caudate activity can easily be missed. •One reason is limitations in the normalization algorithms, that typically are optimized to maximize accuracy for cortical rather than subcortical structures. ... •A second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body, and completely exclude the tail... •A final reason is that the tail of the caudate is close to the hippocampus, and could be misidentified as such especially in tasks involving learning and memory. Therefore, the tail of the caudate may be recruited in additional cognitive tasks, but yet not have been properly identified and reported in the neuroimaging literature”

Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.

“The Data Homunculus”

Beware of biases in the data space...Beware of biases in the data space...

WHERE ARE WE GOING?

The Encyclopedia of Life

A…

Access to data has Access to data has changed over the changed over the

yearsyears

Tim Berner-s Lee: Web of dataWikipedia defines Linked Data as "a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.” http://linkeddata.org/

GenbankGenbank

PDBPDB

“Whichever technology wins broad adoption will become, by default, the data web. That’s why we don’t need to know which technological vision of the data web will win to conclude that the data web is inevitable”-Michael Nielson

“Whichever technology wins broad adoption will become, by default, the data web. That’s why we don’t need to know which technological vision of the data web will win to conclude that the data web is inevitable”-Michael Nielson

I am a number: ORCID ID

The web of data runs on the ability to uniquely identify all the relevant entities

The web of data runs on the ability to uniquely identify all the relevant entities

• Have authors supply appropriate identifiers for key resources used within a study such that they are:– Machine processible (i.e., unique

identifier that resolves to a single resource)

– Outside of the paywall– Uniform across journals and

publishers • Goal: Proof of principle

– What infrastructure would be needed

– Could authors perform the task– Would authors perform the task– Will it be useful?

Resource Identification InitiativeResource Identification Initiative

http://www.force11.org/resource_identification_initiative

http://www.force11.org/resource_identification_initiative

What studies used ...?•100 articles have appeared to date•15 journals•Data set being made available to community•> 600 RRID’s

•~10% disappeared after copyediting•5% were in error•14% false negative rate

•> 200 antibodies were added•> 75 software tools/databases were added

Database available at: https://www.force11.org/node/5635 Database available at: https://www.force11.org/node/5635

RRID:AB_90755RRID:AB_90755

ArticleArticle

CodeCode

BlogsBlogs

WorkflowsWorkflows

DataData

Persistent Identifiers

PortalsPortals



Unique and persistent identifiers and a system for referencing them allow a scholarly ecosystem to function

Unique and persistent identifiers and a system for referencing them allow a scholarly ecosystem to function

An ecosystem for research objects

DataDataDataData

CodeCodeCodeCode

BlogsBlogsBlogsBlogs

WorkflowsWorkflowsWorkflowsWorkflows

PortalsPortalsPortalsPortals

Search enginesSearch engines




Taking a global view on data: Taking a global view on data: microculture to ecosystemmicroculture to ecosystem

• Several powerful trends should change the way we think about our data: One Many– Many data

• Generation of data is getting easier shared data• Data space is getting richer: more –omes everyday• But...compared to the biological space, still sparse

– Many eyes• Wisdom of crowds• More than one way to interpret data

– Many algorithms• Not a single way to analyze data

– Many analytics• “Signatures” in data may not be directly related to the question for which they

were acquired but tell us something really interesting

One data set one algorithm one paper???One data set one algorithm one paper???

How you can contribute• Register your tools/data to NIF• Let us help you with your use cases• Use RRID’s in your publications– http://scicrunch.com/resources

• Get your ORCID ID!• Put your data in a repository– NIF can help you find one; NIF is one

• If you are planning on building your own data resources, talk to us!

Future of Research Communications and e-Scholarship (FORCE11.org)

Join us! http://force11.orgJoin us! http://force11.org

NIF team (past and present)NIF team (past and present)

Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi Polavarum

Fahim ImamLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceSvetlana SulimaDavis BanksVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer (retired)Jonathan Pollock, NIH, Program Officer

And my colleagues in Monarch, dkNet, 3DVC, Force 11

data landscapes: the neuroscience information framework

Science

community nif

nif curatorsnomination

tour of nif

data deluge

data landscapes

thousands of databases

connectivity primary

processlarge data sets