how do we know what we don't know? exploring the data and knowledge space through the...
DESCRIPTION
Also includes results from the Resource Identification InitiativeTRANSCRIPT
How do we know what we don't know? Exploring the data and knowledge space through the
Neuroscience Information Framework
Maryann E. Martone, Ph. D.University of California, San Diego
Building Analytics for Integrated Neuroscience DataOntario Brain Institute May 28-29, 2014
We say this to each other all the time, but we set up systems for scholarly advancement and communication that are the antithesis of integrationWhole brain data
(20 um microscopic MRI)
Mosiac LM images (1 GB+)
Conventional LM images
Individual cell morphologies
EM volumes & reconstructions
Solved molecular structures
No single technology serves these all equally well.Multiple data types;
multiple scales; multiple databases
A data integration problem
• NIF is an initiative of the NIH Blueprint consortium of institutes– What types of resources (data, tools, materials, services) are available to the
neuroscience community?– How many are there?– What domains do they cover? What domains do they not cover?– Where are they?
• Web sites• Databases• Literature• Supplementary material
– Who uses them?– Who creates them?– How can we find them?– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
NIF has been surveying,
cataloging and tracking the
neuroscience resource
landscape since < 2008
Old Model: Single type of content; single mode of distribution
Scholar
Library
Scholar
Publisher
Systems for cataloging, metadata standards, and citation in place
Scholar
Consumer
Libraries
Data Repositories
Code Repositories
Community databases/platforms
OA
Curators
Social Networks
Social NetworksSocial
Networks
Peer Reviewers
Narrative
Workflows
Data
Models
Multimedia
Nanopublications
Code
The duality of modern scholarship
Observation: Those who build information systems from the machine side don’t understand the requirements of the human very well
Those who build information systems from the human side, don’t understand requirements of machines very well
Scholarship requires the ability to cite and track usage of scholarly artifacts. In our current mode of working, there is no way to track artifacts as they move through the ecosystem; no way to incrementally add human expertise
NIF: A New Type of Entity for New Modes of Scientific Dissemination
• NIF’s mission is to maximize the awareness of, access to and utility of research resources produced worldwide to enable better science and promote efficient use– NIF unites neuroscience information without respect to domain,
funding agency, institute or community– NIF is like a “Pub Med” for all biomedical resources and a “Pub
Med Central” for databases– Makes them searchable from a single interface– Practical and cost-effective; tries to be sensible– Learned a lot about the effective data sharing
The Neuroscience Information Framework provides a rich data source for understanding the current resource landscape
But we have Google!
• Current web is designed to share documents– Documents are
unstructured data
• Much of the content of digital resources is part of the “hidden web”
• Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.
Surveying the resource landscape
~3000 databases and datasets
Populate broadly and quickly with minimum overhead to resource providers
•NIF curators•Nomination by the community•Semi-automated text mining pipelines
NIF RegistryRequires no special skillsSite map available for local
hosting
•NIF Data Federation• DISCO interop (Yale)• Requires some
programming skill• But designed for quick
ingestion
Bandrowski et al., Database, 2012
Data Federation: Deep search
http://neuinfo.orgWith the thousands of databases and other information sources available, simple descriptive metadata will not suffice
Subthalamus
Data about the subthalamus
http://neuinfo.org
NIF unifies look, feel and access
What do you mean by data?Databases come in many shapes and sizes
• Primary data:– Data available for reanalysis, e.g.,
microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
• Secondary data– Data features extracted through
data processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)
• Tertiary data– Claims and assertions about the
meaning of data• E.g., gene
upregulation/downregulation, brain activation as a function of task
• Registries:– Metadata– Pointers to data sets or materials
stored elsewhere
• Data aggregators– Aggregate data of the same type
from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede
• Single source– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies; many duplicate effort and content
Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-1310000
100000
1000000
10000000
100000000
1000000000
0
50
100
150
200
250
Num
ber o
f Fed
erat
ed R
ecor
ds (M
illio
ns)
Num
ber o
f Fed
erat
ed D
atab
ases
Data Federation GrowthNIF searches the largest collation of neuroscience-relevant data on the web
DISCO
PurkinjeCell
AxonTerminal
Axon DendriticTree
DendriticSpine
Dendrite
Cell body
Cerebellarcortex
Bringing knowledge to data: Ontologies as framework
There is little obvious connection between data sets taken at different scales using different microscopies without an explicit representation of the biological objects that the data represent
NIF Semantic Framework: NIFSTD ontology
• NIF uses ontologies to help navigate across and unify neuroscience resources• Ontologies are built from community ontologies cross integration with
other domains
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellular structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NIF Ontologies provide standards for integration of diverse data; available through NIF vocabulary services
NIF links neuroscience to other domains via community ontologies
• NIF Subcellular = Gene Ontology Cell Component• NIF Anatomy = UBERON cross-species ontology
(Includes FMA and Neuronames)• NIF Disease = Disease Ontology• NIF Organism = NCBI Taxonomy• NIF Molecule = Chemicals of Biological Interest
(CHEBI); Protein Ontology
• NIF Cell/Investigation/Function = Developed largely by neuroscience community
Use of ontology identifiers within data sources creates linkage across databases and across domains; the more they are used, the better they become
: CNeurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
Concept-based search: Query by meaning
NIF provides formal definitions of many neuroscience terms
= brain region without a blood brain barrier
Ontologies as a data integration framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions• Brain Architecture Management System (rodent)• Temporal lobe.com (rodent)• Connectome Wiki (human)• Brain Maps (various)• CoCoMac (primate cortex)• UCLA Multimodal database (Human fMRI)• Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385
Building a knowledge space for neuroscience: Neurolex.org
http://neurolex.org
•Semantic MediWiki•Provide a simple interface for defining the concepts required• Light weight semantics
•Community based:• Anyone can contribute their
terms, concepts, things
• Anyone can edit
• Anyone can link
•Accessible: searched by Google•Growing into a significant knowledge base for neuroscience•33,000 concepts
200,000 edits150 contributors
Larson and Martone Frontiers in Neuroinformatics, 2013
“When I use a word...it means what I choose it to mean”
Formalization lets us develop metrics for the precision of the
terms we use
Mapping the known unknowns
Comprehensive ontologies provide an accounting of what we think we know
Where are the data relative to what we think we know?
StriatumHypothalamusOlfactory bulb
Cerebral cortex
Brain
Brai
n re
gion
Data source
01-10
11-100>101
Open World-Closed World: Mapping the knowledge - data space
Data Sources
NIF lets us ask: where isn’t there data? What isn’t studied? Why?
Forebrain
Midbrain
Hindbrain
01-10
11-100>101
Data Sources
Open World-Closed World: Mapping the knowledge - data space
Junk brain regions?
SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186
Adult mouse brain connectivity matrix: revenge of the midbrain
The tale of the tail“Human neuroimaging typically is performed on a whole brain basis. However, for several reasons tail of the caudate activity can easily be missed. •One reason is limitations in the normalization algorithms, that typically are optimized to maximize accuracy for cortical rather than subcortical structures. ... •A second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body, and completely exclude the tail... •A final reason is that the tail of the caudate is close to the hippocampus, and could be misidentified as such especially in tasks involving learning and memory. Therefore, the tail of the caudate may be recruited in additional cognitive tasks, but yet not have been properly identified and reported in the neuroimaging literature”
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.
fMRI Cerebellum
When results contradict a current theory, they may be ignored
“The Data Homunculus”
Funding drives representation in the data space
NIF Reports: Male vs Female circa 2012
Gender bias
When data is not made available, the data space is an incomplete record of what is available
How much information makes it into the data space?
∞
What is easily machine processable and accessible
What is potentially knowable
What is known:Literature, images, human
knowledge
Unstructured; Natural language processing,
entity recognition, image processing and analysis; paywalls; file
drawersAbstracts vs full text vs tables etc
Estimates that > 50% scientific output is not recoveredChan et al. Lancet, 383, 2014
Data sharing in the long tail of neurosciences
A place for my data
NIF lists over 350 data repositories=accept data contributions from the community
“Empty Archives”Repository Type of Data
Date started Host
Public data Comments
CARMENneuroscience / electrophysiology 2008
Newcastle University; United Kingdom 100 Requires account
INCF Dataspace various 2012
International Neuroinformatics Coordinating Facility ?
Open Source Brain models 2014 University College London 47 Cells and Networks; 23 (Technology -showcases)
XNAT Central Neuroimaging 2010
Washington University School of Medicine in St. Louis; Missouri; USA 34
States 370 projects, 3804 subjects, and 5172 imaging sessions. 123 were visible but do not all appear to be public. 34 public data were listed under “Recent”
Open Connectome
Serial electron Microscopy and Magnetic Resonance 2011
Johns Hopkins University; Maryland; USA (graphs) 9 9, 7 - image projects; 19 - graphs
UCSF DataShare
biomedical including neuroimaging, MRI, cognitive impairment, dementia, aging 2011
University of California at San Francisco; California; USA 15
BrainLinervarious functional data 2011 ATR; Kyoto; Japan 10
ModelDB neuron models 1996Yale University; Connecticut; USA 875
NeuroMorpho
digitally reconstructed neurons 2006
George Mason University; Virginia; USA 10004
Cell Image Library/Cell Centered Database
images, videos, and animations of cell
2002 CCDB2010 CIL
American Society for Cell Biology / University of California at San Diego; California; USA 10,360
The CCDB had 450 data sets when it merged with CIL. CIL also contains large imaging data sets that are not counted as separate images
CRCNS
computational neuroscience datasets 2008
University of California at Berkeley; California; USA 38
OpenfMRI fMRI 2012University of Texas at Austin; Texas; USA 22
NeuroMorpho.org = 10,000 neuronal reconstructions from ~200 labs
Cell Image Library = 10,000 image sets
from 1500 individuals
“I finally gave NeuroMorpho my data so they would stop bothering me.”
Attitudes towards data sharing
“Pry it from my cold, dead fingers” “Done”“You can have it if you really
want”
•Lack of time and resources• Lack of incentives
•Fear of being scooped•Fear of being criticized•Fear that data will be misused•Data sharing is a waste of time
AlwaysNever
Reasons for not making data available
Tenopir, C. et al. Data sharing by scientists: practices and perceptions. PLoS One 6, e21101, doi:10.1371/journal.pone.0021101 (2011)
Many make data available via web sites or via supplementary material
Multivariate analysis of the SCI syndrome using data from two research sites.
Ferguson AR, Irvine K-A, Gensel JC, Nielson JL, et al. (2013) Derivation of Multivariate Syndromic Outcome Metrics for Consistent Testing across Multiple Models of Cervical Spinal Cord Injury in Rats. PLoS ONE 8(3): e59712. doi:10.1371/journal.pone.0059712http://www.plosone.org/article/info:doi/10.1371/journal.pone.0059712
Incentives: New solutions• New journals for data, where focus is on data not results
• Data must be deposited in a recognized repository– Persistent
identifier assigned
• Standards for metadata and data types
Nature Scientific Data
Incentives: Data citations• Many groups are
developing guidelines for creating a system of citation for data used in a study
• First step for providing an incentive system for data sharing
• Currently, very difficult to track use of data in articles
http://www.force11.org/
datacitation
“Sound, reproducible scholarship rests upon a foundation of robust, accessible data. Data should be considered legitimate, citable products of research. Data citation, like the citation of other evidence and sources, is good research practice.”
-Joint Declaration of Data Citation Principles
Future of Research Communications and e-Scholarship; FORCE11
1. Importance2. Credit and attribution 3. Evidence4. Unique Identification 5. Access6. Persistence 7. Specificity and verifiability 8. Interoperability and
flexibility
Unique ID’s for all! Resource Identification Initiative
• It is currently impossible to query the biomedical literature to find out what research resources have been used to produce the results of a study
-authors don’t provide enough information to unambiguously identify key research resources
• Impossible to find all studies that used a resource
• Critical for reproducibility and data mining
• Critical for trouble-shooting
http://www.force11.org/resource_identification_initiative
Faulty Antibodies Continue to Enter US and European Markets, Warns Top Clinical Chemistry Researcher-Genome Web Daily, October 11, 2013
Resource Identification Initiative
• Have authors supply appropriate identifiers for key resources used within a study such that they are:– Machine processible (i.e.,
unique identifier that resolves to a single resource)
– Outside of the paywall– Uniform across journals
and publishers Launched February 2014: > 30 journals participating
Anita Bandrowski, Nicole Vasilevsky, Matthew Brush, Melissa Haendel and the RINL group
Pilot Project
• Have authors identify 3 different types of research resources:– Software tools and databases– Antibodies– Genetically modified animals
• Include RRID in methods section• RRID=RRID:Accession number
– Just a string at this point
• Voluntary for authors• Journals did not have to modify their
submission system• Journals have flexibility in
implementation. Send request to author at:– Submission– During review– After acceptance
Sources: NIF Registry, NIF Antibody Registry, Model Organism Databases
Resource Identification Portal: Aggregates accession numbers from >10 different databases that are the authorities for registering research resources
First results are in the literature
Google Scholar: Search RRID; select since 2014
What studies used X?
To date: •30 articles have appeared•2 articles have disappeared, i.e., the RRID’s were removed at copyediting•195 RRID’s were reported•14 were in error = 0.7%•> 200 antibodies were added•> 75 software tools/databases were added•A resolver service has been created•3rd party tools are being created to provide linkage between resources and papers
RRID:nif-0000-30467
Authors did not deliberately leave out identifying information; they just hadn’t thought about it
What have we learned?
Utopia plug-in: Steve Pettifer
•Authors are willing to adopt new types of citations and citation styles; you just have to ask•RRID = usage of research resource•Ideal: resolved by search engines without requiring specialized citation services•Citation drives registration•Clear role for repositories as authorities
Digital objects are a new beast
RRID: Provides foundation for establishing an alerting service for research resources
Trust: Not just who produced it but what produced it
Community database: beginning
Community database:
End
Register your resource to NIF!
“How do I share my data/tool?”
“There is no database for my data”
1
2
3
4
Institutional repositories
Cloud
INCF: Global infrastructure
Government
Education
Industry
NIF provides the “glue” for a functioning ecosystem of data and tools
Tool repositoriesStandards
Brokering
Archiving
Article
Code
Blogs
Workflows
Data
Persistent Identifiers Portals
Persistent Identifiers
Persistent Identifiers
Unique and persistent identifiers and a system for referencing them allow an ecosystem to function
An ecosystem for research objects: the social network of research resources
DataData
CodeCode
BlogsBlogs
WorkflowsWorkflows
PortalsPortals
Search engines
Musings from the NIF• Analytics let us to take a global view of data
– By bringing in a knowledge framework, we can look at positive and negative space
• Well-populated data resources are critical to moving analytics forward– Comprehensive, i.e. they have most of the data that are available– Much can be learned even from messy data, but reasonable standards help– Active outreach is required
• Technological barriers to widespread data sharing are diminishing– Best practices are emerging– General and focused repositories are available, although sustainability of these is a problem
• There is a lot of neuroscience data available, but a culture of routine data sharing does not yet exist in neuroscience– But encouraging signs that it is largely due to lack of time and means, not lack of desire– It is up to us to change the incentive system to support the best science possible
• Most scientists are not adept at managing or curating their own data– Role for repositories and data curators
• Pieces of a functioning ecosystem are in place– Think about how you fit into the ecosystem
NIF team (past and present)Jeff Grethe, UCSD, Co Investigator, Co-PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi PolavarumYueling Li, UCSDTrish Whetzel, UCSD
Fahim ImamLarry LuiAndrea Arnaud StaggJonathan CachatSvetlana SulimaBurak OzyrtDavis BanksVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer (retired)Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11
Melissa Haendel, OHSU**Nicole VasilevskyMatthew Brush
**Monarch and Resource Identification Initiative
Creating an on-line knowledge space for neuroscience
Pages are related through properties
Red Links: Information is missing (or misspelled)
Neurolex Neuron
• Led by Dr. Gordon Shepherd
• > 30 world wide experts
• Simple set of properties• Consistent naming
scheme• Integrated with
Structural Lexicon• Used for annotation in
other resources, e.g., NeuroElectro
Location of Cell Soma
Location of dendrites
Location of local axon arbor
Analysis of Red Links in the Neuron Registry
• INCF Project– Neuron Registry– > 30 experts
worldwide– Fill out neuron
pages in Neurolex Wiki
– Led by Dr. Gordon Shepherd
NumberTotal
redlinks easy fixeshard fixes
0
50
100
150
200
250
300
Soma location
Dendrite location
Axon location
Soma locationDendrite locationAxon location
Social networks and community sites let us learn things from the collective behavior of contributors INCF/HBP Knowledge Space
Structural Lexicon in Neurolex
Brain Region
Brain Parcel
•Trans-species•“Stateless”, i.e. no universal defining criteria•General structures and partonomies based on Neuroanatomy 101
Partially overlaps
e.g., Hippocampus, Dentate gyrus
•Species specific•Specific reference•Defining criteria•Sometimes partonomy; sometimes not
e.g., Hippocampus of ABA2009
Standards support diversity
Is there a framework for neuroscience?
• Of the ~ 4000 columns that NIF queries, ~1300 map to one of our core categories:– Organism– Anatomical structure– Cell– Molecule– Function– Dysfunction– Technique
• 30-50% of NIF’s queries autocomplete
• When NIF combines multiple sources, a set of common fields emerges– >Basic information
models/semantic models exist for certain types of entities
Biomedical science does have a conceptual framework
What would a 21st century platform for scholarship look like?
D
K
Macroinform
atics
NIF: Sensors and monitors for the resource ecosystem
Exposing knowledge to the web
Because they are static URL’s, Wikis are searchable by Google
NIF provides a rich source of information on digital resources
• Analytics let us to take a global view of data– By bringing in a knowledge framework, we can look at positive and negative space
• Well-populated data resources are critical to moving analytics forward– Comprehensive, i.e. they have most of the data that are available– Much can be learned even from messy data, but reasonable standards help– Active outreach is required
• Technological barriers to widespread data sharing are diminishing– Best practices are emerging– General and focused repositories are available, although sustainability of these is a
problem• There is a lot of neuroscience data available, but a culture of routine data sharing
does not yet exist in neuroscience– But encouraging signs that it is largely due to lack of time and means, not lack of
agreement• Most scientists are not adept at managing or curating their own data
– Role for repositories and data curators• Pieces of a functioning ecosystem are in place; think globally
Not just science, but data policy should be data driven
Same data: different analysis
•Gemma: Gene ID + Gene Symbol•DRG: Gene name + Probe ID
•Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatum
• Analysis:• 1370 statements from Gemma regarding gene expression as a function of chronic
morphine• 617 were consistent with DRG; over half of the claims of the paper were not
confirmed in this analysis• Results for 1 gene were opposite in DRG and Gemma• 45 did not have enough information provided in the paper to make a judgment
Relatively simple standards would make it easier to perform comparisons across the ecosystem
Musings from the NIF
• Every resource is resource limited: few have enough time, money, staff or expertise required to do everything they would like– If the market can support 11 MRI databases, fine– Some consolidation, coordination is warranted– How can industry help support the data space? How can they take them even further? – Don’t let the data space become fractured
• Big, broad and messy beats small, narrow and neat– Without trying to integrate a lot of data, we will not know what needs to be done– Progressive refinement; addition of complexity through layers
• Be flexible and opportunistic: assume all will change– A single optimal technology/container for all types of scientific data and information does not
exist; technology is changing
• Think globally; act locally:– No source, not even NIF, is THE source; we are all a source– System and culture to be able to learn from everyting– Cooperative model for biomedicine