how do we know what we don't know? exploring the data and knowledge space through the...

How do we know what we don't know? Exploring the data and knowledge space through the

Neuroscience Information Framework

Maryann E. Martone, Ph. D.University of California, San Diego

Building Analytics for Integrated Neuroscience DataOntario Brain Institute May 28-29, 2014

We say this to each other all the time, but we set up systems for scholarly advancement and communication that are the antithesis of integrationWhole brain data

(20 um microscopic MRI)

Mosiac LM images (1 GB+)

Conventional LM images

Individual cell morphologies

EM volumes & reconstructions

Solved molecular structures

No single technology serves these all equally well.Multiple data types;

multiple scales; multiple databases

A data integration problem

• NIF is an initiative of the NIH Blueprint consortium of institutes– What types of resources (data, tools, materials, services) are available to the

neuroscience community?– How many are there?– What domains do they cover? What domains do they not cover?– Where are they?

• Web sites• Databases• Literature• Supplementary material

– Who uses them?– Who creates them?– How can we find them?– How can we make them better in the future?

http://neuinfo.org

• PDF files

• Desk drawers

NIF has been surveying,

cataloging and tracking the

neuroscience resource

landscape since < 2008

Old Model: Single type of content; single mode of distribution

Scholar

Library

Scholar

Publisher

Systems for cataloging, metadata standards, and citation in place

Scholar

Consumer

Libraries

Data Repositories

Code Repositories

Community databases/platforms

OA

Curators

Social Networks

Social NetworksSocial

Networks

Peer Reviewers

Narrative

Workflows

Data

Models

Multimedia

Nanopublications

Code

The duality of modern scholarship

Observation: Those who build information systems from the machine side don’t understand the requirements of the human very well

Those who build information systems from the human side, don’t understand requirements of machines very well

Scholarship requires the ability to cite and track usage of scholarly artifacts. In our current mode of working, there is no way to track artifacts as they move through the ecosystem; no way to incrementally add human expertise

NIF: A New Type of Entity for New Modes of Scientific Dissemination

• NIF’s mission is to maximize the awareness of, access to and utility of research resources produced worldwide to enable better science and promote efficient use– NIF unites neuroscience information without respect to domain,

funding agency, institute or community– NIF is like a “Pub Med” for all biomedical resources and a “Pub

Med Central” for databases– Makes them searchable from a single interface– Practical and cost-effective; tries to be sensible– Learned a lot about the effective data sharing

The Neuroscience Information Framework provides a rich data source for understanding the current resource landscape

But we have Google!

• Current web is designed to share documents– Documents are

unstructured data

• Much of the content of digital resources is part of the “hidden web”

• Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.

http://en.wikipedia.org/wiki/World_Wide_Web

http://en.wikipedia.org/wiki/Surface_Web

http://en.wikipedia.org/wiki/Index_(search_engine)

http://en.wikipedia.org/wiki/Search_engine

Surveying the resource landscape

~3000 databases and datasets

Populate broadly and quickly with minimum overhead to resource providers

•NIF curators•Nomination by the community•Semi-automated text mining pipelines

NIF RegistryRequires no special skillsSite map available for local

hosting

•NIF Data Federation• DISCO interop (Yale)• Requires some

programming skill• But designed for quick

ingestion

Bandrowski et al., Database, 2012

Data Federation: Deep search

http://neuinfo.orgWith the thousands of databases and other information sources available, simple descriptive metadata will not suffice

Subthalamus

Data about the subthalamus

http://neuinfo.org

NIF unifies look, feel and access

What do you mean by data?Databases come in many shapes and sizes

• Primary data:– Data available for reanalysis, e.g.,

microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)

• Secondary data– Data features extracted through

data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)

• Tertiary data– Claims and assertions about the

meaning of data• E.g., gene

upregulation/downregulation, brain activation as a function of task

• Registries:– Metadata– Pointers to data sets or materials

stored elsewhere

• Data aggregators– Aggregate data of the same type

from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede

• Single source– Data acquired within a single

context , e.g., Allen Brain Atlas

Researchers are producing a variety of information artifacts using a multitude of technologies; many duplicate effort and content

Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-1310000

100000

1000000

10000000

100000000

1000000000

0

50

100

150

200

250

Num

ber o

f Fed

erat

ed R

ecor

ds (M

illio

ns)

Num

ber o

f Fed

erat

ed D

atab

ases

Data Federation GrowthNIF searches the largest collation of neuroscience-relevant data on the web

DISCO

PurkinjeCell

AxonTerminal

Axon DendriticTree

DendriticSpine

Dendrite

Cell body

Cerebellarcortex

Bringing knowledge to data: Ontologies as framework

There is little obvious connection between data sets taken at different scales using different microscopies without an explicit representation of the biological objects that the data represent

NIF Semantic Framework: NIFSTD ontology

• NIF uses ontologies to help navigate across and unify neuroscience resources• Ontologies are built from community ontologies cross integration with

other domains

NIFSTD

Organism

NS FunctionMolecule InvestigationSubcellular structure

Macromolecule Gene

Molecule Descriptors

Techniques

Reagent Protocols

Cell

Resource Instrument

Dysfunction QualityAnatomical Structure

NIF Ontologies provide standards for integration of diverse data; available through NIF vocabulary services

NIF links neuroscience to other domains via community ontologies

• NIF Subcellular = Gene Ontology Cell Component• NIF Anatomy = UBERON cross-species ontology

(Includes FMA and Neuronames)• NIF Disease = Disease Ontology• NIF Organism = NCBI Taxonomy• NIF Molecule = Chemicals of Biological Interest

(CHEBI); Protein Ontology

• NIF Cell/Investigation/Function = Developed largely by neuroscience community

Use of ontology identifiers within data sources creates linkage across databases and across domains; the more they are used, the better they become

: CNeurolex: > 1 million triples

Dr. Yi Zeng: Chinese neural knowledge baseNIF Cell Graph

This is your brain on computers

Concept-based search: Query by meaning

NIF provides formal definitions of many neuroscience terms

= brain region without a blood brain barrier

Ontologies as a data integration framework

•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions• Brain Architecture Management System (rodent)• Temporal lobe.com (rodent)• Connectome Wiki (human)• Brain Maps (various)• CoCoMac (primate cortex)• UCLA Multimodal database (Human fMRI)• Avian Brain Connectivity Database (Bird)

•Total: 1800 unique brain terms (excluding Avian)

•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385

Building a knowledge space for neuroscience: Neurolex.org

http://neurolex.org

•Semantic MediWiki•Provide a simple interface for defining the concepts required• Light weight semantics

•Community based:• Anyone can contribute their

terms, concepts, things

• Anyone can edit

• Anyone can link

•Accessible: searched by Google•Growing into a significant knowledge base for neuroscience•33,000 concepts

200,000 edits150 contributors

Larson and Martone Frontiers in Neuroinformatics, 2013

“When I use a word...it means what I choose it to mean”

Formalization lets us develop metrics for the precision of the

terms we use

Mapping the known unknowns

Comprehensive ontologies provide an accounting of what we think we know

Where are the data relative to what we think we know?

StriatumHypothalamusOlfactory bulb

Cerebral cortex

Brain

Brai

n re

gion

Data source

01-10

11-100>101

Open World-Closed World: Mapping the knowledge - data space

Data Sources

NIF lets us ask: where isn’t there data? What isn’t studied? Why?

Forebrain

Midbrain

Hindbrain

01-10

11-100>101

Data Sources

Open World-Closed World: Mapping the knowledge - data space

Junk brain regions?

SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186

Adult mouse brain connectivity matrix: revenge of the midbrain

The tale of the tail“Human neuroimaging typically is performed on a whole brain basis. However, for several reasons tail of the caudate activity can easily be missed. •One reason is limitations in the normalization algorithms, that typically are optimized to maximize accuracy for cortical rather than subcortical structures. ... •A second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body, and completely exclude the tail... •A final reason is that the tail of the caudate is close to the hippocampus, and could be misidentified as such especially in tasks involving learning and memory. Therefore, the tail of the caudate may be recruited in additional cognitive tasks, but yet not have been properly identified and reported in the neuroimaging literature”

Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.

fMRI Cerebellum

When results contradict a current theory, they may be ignored

“The Data Homunculus”

Funding drives representation in the data space

NIF Reports: Male vs Female circa 2012

Gender bias

When data is not made available, the data space is an incomplete record of what is available

How much information makes it into the data space?

∞

What is easily machine processable and accessible

What is potentially knowable

What is known:Literature, images, human

knowledge

Unstructured; Natural language processing,

entity recognition, image processing and analysis; paywalls; file

drawersAbstracts vs full text vs tables etc

Estimates that > 50% scientific output is not recoveredChan et al. Lancet, 383, 2014

Data sharing in the long tail of neurosciences

A place for my data

NIF lists over 350 data repositories=accept data contributions from the community

“Empty Archives”Repository Type of Data

Date started Host

Public data Comments

CARMENneuroscience / electrophysiology 2008

Newcastle University; United Kingdom 100 Requires account

INCF Dataspace various 2012

International Neuroinformatics Coordinating Facility ?

Open Source Brain models 2014 University College London 47 Cells and Networks; 23 (Technology -showcases)

XNAT Central Neuroimaging 2010

Washington University School of Medicine in St. Louis; Missouri; USA 34

States 370 projects, 3804 subjects, and 5172 imaging sessions. 123 were visible but do not all appear to be public. 34 public data were listed under “Recent”

Open Connectome

Serial electron Microscopy and Magnetic Resonance 2011

Johns Hopkins University; Maryland; USA (graphs) 9 9, 7 - image projects; 19 - graphs

UCSF DataShare

biomedical including neuroimaging, MRI, cognitive impairment, dementia, aging 2011

University of California at San Francisco; California; USA 15

BrainLinervarious functional data 2011 ATR; Kyoto; Japan 10

ModelDB neuron models 1996Yale University; Connecticut; USA 875

NeuroMorpho

digitally reconstructed neurons 2006

George Mason University; Virginia; USA 10004

Cell Image Library/Cell Centered Database

images, videos, and animations of cell

2002 CCDB2010 CIL

American Society for Cell Biology / University of California at San Diego; California; USA 10,360

The CCDB had 450 data sets when it merged with CIL. CIL also contains large imaging data sets that are not counted as separate images

CRCNS

computational neuroscience datasets 2008

University of California at Berkeley; California; USA 38

OpenfMRI fMRI 2012University of Texas at Austin; Texas; USA 22

NeuroMorpho.org = 10,000 neuronal reconstructions from ~200 labs

Cell Image Library = 10,000 image sets

from 1500 individuals

“I finally gave NeuroMorpho my data so they would stop bothering me.”

Attitudes towards data sharing

“Pry it from my cold, dead fingers” “Done”“You can have it if you really

want”

•Lack of time and resources• Lack of incentives

•Fear of being scooped•Fear of being criticized•Fear that data will be misused•Data sharing is a waste of time

AlwaysNever

Reasons for not making data available

Tenopir, C. et al. Data sharing by scientists: practices and perceptions. PLoS One 6, e21101, doi:10.1371/journal.pone.0021101 (2011)

Many make data available via web sites or via supplementary material

Multivariate analysis of the SCI syndrome using data from two research sites.

Ferguson AR, Irvine K-A, Gensel JC, Nielson JL, et al. (2013) Derivation of Multivariate Syndromic Outcome Metrics for Consistent Testing across Multiple Models of Cervical Spinal Cord Injury in Rats. PLoS ONE 8(3): e59712. doi:10.1371/journal.pone.0059712http://www.plosone.org/article/info:doi/10.1371/journal.pone.0059712

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0059712

Incentives: New solutions• New journals for data, where focus is on data not results

• Data must be deposited in a recognized repository– Persistent

identifier assigned

• Standards for metadata and data types

Nature Scientific Data

Incentives: Data citations• Many groups are

developing guidelines for creating a system of citation for data used in a study

• First step for providing an incentive system for data sharing

• Currently, very difficult to track use of data in articles

http://www.force11.org/

datacitation

“Sound, reproducible scholarship rests upon a foundation of robust, accessible data. Data should be considered legitimate, citable products of research. Data citation, like the citation of other evidence and sources, is good research practice.”

-Joint Declaration of Data Citation Principles

Future of Research Communications and e-Scholarship; FORCE11

1. Importance2. Credit and attribution 3. Evidence4. Unique Identification 5. Access6. Persistence 7. Specificity and verifiability 8. Interoperability and

flexibility

Unique ID’s for all! Resource Identification Initiative

• It is currently impossible to query the biomedical literature to find out what research resources have been used to produce the results of a study

-authors don’t provide enough information to unambiguously identify key research resources

• Impossible to find all studies that used a resource

• Critical for reproducibility and data mining

• Critical for trouble-shooting

http://www.force11.org/resource_identification_initiative

Faulty Antibodies Continue to Enter US and European Markets, Warns Top Clinical Chemistry Researcher-Genome Web Daily, October 11, 2013

Resource Identification Initiative

• Have authors supply appropriate identifiers for key resources used within a study such that they are:– Machine processible (i.e.,

unique identifier that resolves to a single resource)

– Outside of the paywall– Uniform across journals

and publishers Launched February 2014: > 30 journals participating

Anita Bandrowski, Nicole Vasilevsky, Matthew Brush, Melissa Haendel and the RINL group

Pilot Project

• Have authors identify 3 different types of research resources:– Software tools and databases– Antibodies– Genetically modified animals

• Include RRID in methods section• RRID=RRID:Accession number

– Just a string at this point

• Voluntary for authors• Journals did not have to modify their

submission system• Journals have flexibility in

implementation. Send request to author at:– Submission– During review– After acceptance

Sources: NIF Registry, NIF Antibody Registry, Model Organism Databases

Resource Identification Portal: Aggregates accession numbers from >10 different databases that are the authorities for registering research resources

First results are in the literature

Google Scholar: Search RRID; select since 2014

What studies used X?

To date: •30 articles have appeared•2 articles have disappeared, i.e., the RRID’s were removed at copyediting•195 RRID’s were reported•14 were in error = 0.7%•> 200 antibodies were added•> 75 software tools/databases were added•A resolver service has been created•3rd party tools are being created to provide linkage between resources and papers

RRID:nif-0000-30467

Authors did not deliberately leave out identifying information; they just hadn’t thought about it

What have we learned?

Utopia plug-in: Steve Pettifer

•Authors are willing to adopt new types of citations and citation styles; you just have to ask•RRID = usage of research resource•Ideal: resolved by search engines without requiring specialized citation services•Citation drives registration•Clear role for repositories as authorities

Digital objects are a new beast

RRID: Provides foundation for establishing an alerting service for research resources

Trust: Not just who produced it but what produced it

Community database: beginning

Community database:

End

Register your resource to NIF!

“How do I share my data/tool?”

“There is no database for my data”

1

2

3

4

Institutional repositories

Cloud

INCF: Global infrastructure

Government

Education

Industry

NIF provides the “glue” for a functioning ecosystem of data and tools

Tool repositoriesStandards

Brokering

Archiving

Article

Code

Blogs

Workflows

Data

Persistent Identifiers Portals

Persistent Identifiers

Persistent Identifiers

Unique and persistent identifiers and a system for referencing them allow an ecosystem to function

An ecosystem for research objects: the social network of research resources

DataData

CodeCode

BlogsBlogs

WorkflowsWorkflows

PortalsPortals

Search engines

Musings from the NIF• Analytics let us to take a global view of data

– By bringing in a knowledge framework, we can look at positive and negative space

• Well-populated data resources are critical to moving analytics forward– Comprehensive, i.e. they have most of the data that are available– Much can be learned even from messy data, but reasonable standards help– Active outreach is required

• Technological barriers to widespread data sharing are diminishing– Best practices are emerging– General and focused repositories are available, although sustainability of these is a problem

• There is a lot of neuroscience data available, but a culture of routine data sharing does not yet exist in neuroscience– But encouraging signs that it is largely due to lack of time and means, not lack of desire– It is up to us to change the incentive system to support the best science possible

• Most scientists are not adept at managing or curating their own data– Role for repositories and data curators

• Pieces of a functioning ecosystem are in place– Think about how you fit into the ecosystem

NIF team (past and present)Jeff Grethe, UCSD, Co Investigator, Co-PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi PolavarumYueling Li, UCSDTrish Whetzel, UCSD

Fahim ImamLarry LuiAndrea Arnaud StaggJonathan CachatSvetlana SulimaBurak OzyrtDavis BanksVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer (retired)Jonathan Pollock, NIH, Program Officer

And my colleagues in Monarch, dkNet, 3DVC, Force 11

Melissa Haendel, OHSU**Nicole VasilevskyMatthew Brush

**Monarch and Resource Identification Initiative

Creating an on-line knowledge space for neuroscience

Pages are related through properties

Red Links: Information is missing (or misspelled)

Neurolex Neuron

• Led by Dr. Gordon Shepherd

• > 30 world wide experts

• Simple set of properties• Consistent naming

scheme• Integrated with

Structural Lexicon• Used for annotation in

other resources, e.g., NeuroElectro

Location of Cell Soma

Location of dendrites

Location of local axon arbor

Analysis of Red Links in the Neuron Registry

• INCF Project– Neuron Registry– > 30 experts

worldwide– Fill out neuron

pages in Neurolex Wiki

– Led by Dr. Gordon Shepherd

NumberTotal

redlinks easy fixeshard fixes

0

50

100

150

200

250

300

Soma location

Dendrite location

Axon location

Soma locationDendrite locationAxon location

Social networks and community sites let us learn things from the collective behavior of contributors INCF/HBP Knowledge Space

https://neuinfo.org/mynif/search.php?q=*&t=indexable&nif=nlx_144509-3&b=0&r=20

Structural Lexicon in Neurolex

Brain Region

Brain Parcel

•Trans-species•“Stateless”, i.e. no universal defining criteria•General structures and partonomies based on Neuroanatomy 101

Partially overlaps

e.g., Hippocampus, Dentate gyrus

•Species specific•Specific reference•Defining criteria•Sometimes partonomy; sometimes not

e.g., Hippocampus of ABA2009

Standards support diversity

Is there a framework for neuroscience?

• Of the ~ 4000 columns that NIF queries, ~1300 map to one of our core categories:– Organism– Anatomical structure– Cell– Molecule– Function– Dysfunction– Technique

• 30-50% of NIF’s queries autocomplete

• When NIF combines multiple sources, a set of common fields emerges– >Basic information

models/semantic models exist for certain types of entities

Biomedical science does have a conceptual framework

What would a 21st century platform for scholarship look like?

D

K

Macroinform

atics

NIF: Sensors and monitors for the resource ecosystem

Exposing knowledge to the web

Because they are static URL’s, Wikis are searchable by Google

NIF provides a rich source of information on digital resources

• Analytics let us to take a global view of data– By bringing in a knowledge framework, we can look at positive and negative space

• Well-populated data resources are critical to moving analytics forward– Comprehensive, i.e. they have most of the data that are available– Much can be learned even from messy data, but reasonable standards help– Active outreach is required

• Technological barriers to widespread data sharing are diminishing– Best practices are emerging– General and focused repositories are available, although sustainability of these is a

problem• There is a lot of neuroscience data available, but a culture of routine data sharing

does not yet exist in neuroscience– But encouraging signs that it is largely due to lack of time and means, not lack of

agreement• Most scientists are not adept at managing or curating their own data

– Role for repositories and data curators• Pieces of a functioning ecosystem are in place; think globally

Not just science, but data policy should be data driven

Same data: different analysis

•Gemma: Gene ID + Gene Symbol•DRG: Gene name + Probe ID

•Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases

Chronic vs acute morphine in striatum

• Analysis:• 1370 statements from Gemma regarding gene expression as a function of chronic

morphine• 617 were consistent with DRG; over half of the claims of the paper were not

confirmed in this analysis• Results for 1 gene were opposite in DRG and Gemma• 45 did not have enough information provided in the paper to make a judgment

Relatively simple standards would make it easier to perform comparisons across the ecosystem

Musings from the NIF

• Every resource is resource limited: few have enough time, money, staff or expertise required to do everything they would like– If the market can support 11 MRI databases, fine– Some consolidation, coordination is warranted– How can industry help support the data space? How can they take them even further? – Don’t let the data space become fractured

• Big, broad and messy beats small, narrow and neat– Without trying to integrate a lot of data, we will not know what needs to be done– Progressive refinement; addition of complexity through layers

• Be flexible and opportunistic: assume all will change– A single optimal technology/container for all types of scientific data and information does not

exist; technology is changing

• Think globally; act locally:– No source, not even NIF, is THE source; we are all a source– System and culture to be able to learn from everyting– Cooperative model for biomedicine