data landscapes: the neuroscience information framework
DESCRIPTION
Overview of how to use the Neuroscience Information Framework for data discovery presented at the Genetics of Addiction Workshop, held at Jackson Lab Aug 28- Sept 1, 2014.TRANSCRIPT
Data Landscapes: The Neuroscience Data Landscapes: The Neuroscience Information FrameworkInformation Framework
neuinfo.orgneuinfo.org
Maryann E. Martone, Ph. D.University of California, San
Diego
Organization• Introduction• The Neuroscience Information Framework• A tour of NIF• The NIF Framework– Ontologies– NIF Analytics: What can we learn from the data
space?• Where do we go from here?– Resource Identification Initiative– Conclusions
• NIF is an initiative of the NIH Blueprint consortium of institutesNIF is an initiative of the NIH Blueprint consortium of institutes– What types of resources (data, tools, materials, services) are available to the What types of resources (data, tools, materials, services) are available to the
neuroscience community?neuroscience community?– How many are there?How many are there?– What domains do they cover? What domains do they not cover?What domains do they cover? What domains do they not cover?– Where are they?Where are they?
• Web sitesWeb sites• DatabasesDatabases• LiteratureLiterature• Supplementary materialSupplementary material
– Who uses them?Who uses them?– Who creates them?Who creates them?– How can we find them?How can we find them?– How can we make them better in the future?How can we make them better in the future?
http://neuinfo.org
• PDF filesPDF files
• Desk drawersDesk drawers
Old Model: Single type of content; single mode of distribution
ScholarScholar
LibraryLibrary
Scholar
PublisherPublisher
FORCE11.org: Future of research communications and e-scholarshipFORCE11.org: Future of research communications and e-scholarship
Scholar
Consumer
Libraries
Data Repositories
Code RepositoriesCommunity databases/platforms
OA
Curators
Social Networks
Social Networks
Social Networks
Social NetworksSocial
NetworksSocial
Networks
Peer Reviewers
NarrativeNarrative
WorkflowsWorkflows
DataData
ModelsModels
MultimediaMultimedia
NanopublicationsNanopublications
CodeCode
Solving the large problems of science?
• Observation• Experimentation• Modeling• Cooperative data
intensive science
“An unaided human’s ability to process large data sets is comparable to a dog’s ability to do arithmetic, and not much more valuable.” –Michael Nielson, Reinventing Discovery, 2012.
“An unaided human’s ability to process large data sets is comparable to a dog’s ability to do arithmetic, and not much more valuable.” –Michael Nielson, Reinventing Discovery, 2012.
NIF: A New Type of Entity for New Modes NIF: A New Type of Entity for New Modes of Scientific Disseminationof Scientific Dissemination
NIF: A New Type of Entity for New Modes NIF: A New Type of Entity for New Modes of Scientific Disseminationof Scientific Dissemination
• NIF’s mission is to maximize the awareness of, access to and utility of digital resources produced worldwide to enable better science and promote efficient use– NIF unites neuroscience information without respect to domain, funding
agency, institute or community– NIF is a library for scholarly output that is a web enabled resource and
not a paper– Aggregates all the different databases, tools and resources now
produced by the scientific community– Makes them searchable from a single interface– A practical approach to the data deluge– Educate neuroscientists and students about effective data sharing
Surveying the resource landscapeSurveying the resource landscape
NIF resource registry: listing of > 12000 databases, tools, materials, services, websites (> 2500 databases)NIF resource registry: listing of > 12000 databases, tools, materials, services, websites (> 2500 databases)
NIF data federation: Pub Med Central for dataNIF data federation: Pub Med Central for data
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources, providing deep query of the contents and unified viewsNIF was designed to accommodate the multiplicity of heterogeneous and distributed data resources, providing deep query of the contents and unified views
200 sources> 800 M records200 sources> 800 M records
Registry vs Federation: Metadata Registry vs Federation: Metadata aboutabout resource vs resource vs metadata/data metadata/data inin database database
What resources are available for Addiction and GRM1?What resources are available for Addiction and GRM1?
With the thousands of databases and other information sources available, simple descriptive metadata will not sufficeWith the thousands of databases and other information sources available, simple descriptive metadata will not suffice
How do resources get added to the How do resources get added to the NIF?NIF?
•NIF curators•Nomination by the community•Semi-automated text mining pipelines
NIF RegistryRequires no special skillsSite map available for local hosting
•NIF Data Federation•DISCO interop•Requires some programming skill•Open Source Brain < 2 hr
Low barrier to entry; incremental refinementLow barrier to entry; incremental refinement
What about my data?•Best practice:
•Put it in a repository
•What repository?•Community repository for your data type, e.g., GEO
•General repository:•Dryad•FigShare
•Institutional repository•Research libraries are setting up repositories to manage their “digital assets”
NIF can help you find a place for your dataNIF can help you find a place for your data
Requirements for effective data sharing
• Discoverability– Data can be found
• Accessibility– Data can be accessed and
access rights are clear– Links to data are stable
• Assessability– The reliability of the data can be
determined• Understandability
– The data can be understood• Usability
– The data are in a usable form
Duality of modern scholarship: A machine and human dimension to each
Duality of modern scholarship: A machine and human dimension to each
• Publishing data on your website or as supplemental material is not the best way to make it available
• Publishing data on your website or as supplemental material is not the best way to make it available
But we have Google!But we have Google!• Current web is
designed to share documents– Documents are
unstructured data
• Much of the content of digital resources is part of the “hidden web”
• Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.
What do you mean by data?What do you mean by data?Databases come in many shapes and sizes• Primary data:
– Data available for reanalysis, e.g., microarray data sets from GEO; brain images from XNAT; microscopic images (CCDB/CIL)
• Secondary data– Data features extracted through data
processing and sometimes normalization, e.g, brain structure volumes (IBVD), gene expression levels (Allen Brain Atlas); brain connectivity statements (BAMS)
• Tertiary data– Claims and assertions about the meaning of
data• E.g., gene upregulation/downregulation,
brain activation as a function of task
• Registries:– Metadata– Pointers to data sets or
materials stored elsewhere• Data aggregators
– Aggregate data of the same type from multiple sources, e.g., Cell Image Library ,SUMSdb, Brede
• Single source– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of information artifacts using a multitude of technologies
Researchers are producing a variety of information artifacts using a multitude of technologies
Which databases do you use?Which databases do you use?• Mouse Genome
Database• Clinical Trials.gov• Pub Med• dbGAP• GEO• NIH Reporter• OMIM
• Bionumbers:– -a database of numerical
values extracted from literature
• Epigenomics– - human epigenomic data to
catalyze basic biology and disease-oriented research
• Antibody Registry– -2M antibodies
• BioGrid– an interaction repository of
protein and genetic interactions
17Most resources are largely unknown and underutilizedMost resources are largely unknown and underutilized
NIF unifies look, feel and access
Making it easier to access and understand distributed databases
Each resource implements a different, though related model; systems are complex and difficult to learn, in many casesEach resource implements a different, though related model; systems are complex and difficult to learn, in many cases
Exploring the data space
Facets and filters: Progressive refinement of search
More effective to start with a general query and use the navigation to refine searchMore effective to start with a general query and use the navigation to refine search
Some NIF(ty) Features
Current challenge: With so much available, how do I find what I need?• “What genes are upregulated
by chronic morphine?”– It depends
• Most often use cases require connecting a researcher to relevant data sets and appropriate tools– Depending upon the data and
tools, the answers may differ
• Many databases have tool bases and workflows that they support
Exploration of NIF: 1. Progressive refinement of search
More effective to start with a general query and use the navigation to refine searchMore effective to start with a general query and use the navigation to refine search
2. “Data trails”: Linking data and analysis tools
Same data: different analysisSame data: different analysis
• Gemma: Gene ID + Gene Symbol• DRG: Gene name + Probe ID
• Gemma presented results relative to baseline chronic morphine; DRG with respect to saline, so direction of change is opposite in the 2 databases
Chronic vs acute morphine in striatumChronic vs acute morphine in striatum
• Analysis:•1370 statements from Gemma regarding gene expression as a function of chronic
morphine•617 were consistent with DRG; over half of the claims of the paper were not confirmed in this analysis•Results for 1 gene were opposite in DRG and Gemma•45 did not have enough information provided in the paper to make a judgment
NIF is working to make it easier to find where data has gone and what has been done with itNIF is working to make it easier to find where data has gone and what has been done with it
3. SciCrunch: A social network for data and tools
• NIF platform has been adapted to create SciCrunch– Beta release: http://scicrunch.com
• Create more narrow community-based portals based on common data platform
• Select your data; organize it as you wish
• Cost effective: a data portal can be set up in a few hours
• Connects communities through data and tools
• Shared curation-shared knowledge
28
SciCrunchSciCrunchShared
ResourcesUndiagnosed
Disease ProgramUndiagnosed
Disease Program
Phenotype RCNPhenotype RCN
One Mind for Research
One Mind for Research
Consortia-PediaFaster Cures
Consortia-PediaFaster Cures
Model Organism Databases
Model Organism Databases
Community Outreach
Community Outreach
Community Built Uniform Resource Community Built Uniform Resource LayerLayer
Resource Identification Portal
Aging
Neuroscience
dkNET
Phenotypes
NSF Earthcube
Breaking down silos: Community enrichment
Phases of NIFPhases of NIF• 2006-2008: A survey of what was out there• 2008-2009: Strategy for resource discovery
– NIF Registry vs NIF data federation– Ingestion of data contained within different technology platforms, e.g., XML vs relational vs RDF– Effective search across semantically diverse sources
• NIFSTD ontologies
• 2009-2011: Strategy for data integration– Unified views across common sources– Mapping of content to NIF vocabularies
• 2011-present: Data analytics and Linking data– Uniform external data references
• 2013-present: SciCrunch: unified biomedical resource services• “data trails”
NIF provides a strategy and set of tools applicable to all biomedical scienceNIF provides a strategy and set of tools applicable to all biomedical science
INFORMATION FRAMEWORKS
-a tool for analyzing and structuring information (“ a reduction of uncertainty”)
What is an effective information framework for neuroscience?
Knowledge in space and spatial relationships (the “where”)
Knowledge in words, terminologies and logical relationships (the “what”)
NIF Semantic Framework: NIFSTD ontologyNIF Semantic Framework: NIFSTD ontology
• NIF covers multiple structural scales and domains of relevance to neuroscience• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
NIFSTDNIFSTD
OrganismOrganism
NS FunctionNS FunctionMoleculeMolecule InvestigationInvestigationSubcellular structure
Subcellular structure
MacromoleculeMacromolecule GeneGene
Molecule DescriptorsMolecule Descriptors
TechniquesTechniques
ReagentReagent ProtocolsProtocols
CellCell
ResourceResource InstrumentInstrument
DysfunctionDysfunction QualityQualityAnatomical Structure
Anatomical Structure
Ontologies provide the universals for integrating across disparate data by linking them to human knowledge modelsOntologies provide the universals for integrating across disparate data by linking them to human knowledge models
PurkinjeCell
AxonTerminal
Axon DendriticTree
DendriticSpine
Dendrite
Cell body
Cerebellarcortex
Space limitations: Multiscale integration is not obviousSpace limitations: Multiscale integration is not obvious
There is little obvious connection between data sets taken at different scales using different microscopies without an explicit representation of the biological objects that the data represent
There is little obvious connection between data sets taken at different scales using different microscopies without an explicit representation of the biological objects that the data represent
: CNeurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge baseNIF Cell Graph
This is your brain on computers
NIF “translates” common concepts through ontology and annotation standards
What genes are upregulated by drugs of abuse in the adult mouse? (show me the data!)
MorphineMorphineIncreased expressionIncreased expression
Adult MouseAdult Mouse
Another search tip: Custom Another search tip: Custom query syntaxquery syntax
Ontologies as a data integration frameworkOntologies as a data integration framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)•Temporal lobe.com (rodent)•Connectome Wiki (human)•Brain Maps (various)•CoCoMac (primate cortex)•UCLA Multimodal database (Human fMRI)•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42•Number of synonym matches: 99•Number of 1st order partonomy matches: 385
NIF ANALYTICSWhat can we learn from the data space?
Data Federation GrowthData Federation GrowthData Federation GrowthData Federation GrowthNIF searches the largest collation of neuroscience-relevant data on the webNIF searches the largest collation of neuroscience-relevant data on the web
40
Definition: “The long tail of small Definition: “The long tail of small data”data”
• Long tail data: large numbers of small data sets
http://en.wikipedia.org/wiki/Long_tailhttp://en.wikipedia.org/wiki/Long_tail
Estimate: ~50% of long tail data is “Dark data”: data not available for searchEstimate: ~50% of long tail data is “Dark data”: data not available for search
NIF Analytics: The Neuroscience Landscape
Ontologies provide a semantic framework for understanding data/resource landscapeOntologies provide a semantic framework for understanding data/resource landscape
Where are the data?
StriatumHypothalamusOlfactory bulb
Cerebral cortex
Brain
Brai
n re
gion
Data source
Vadim Astakhov, Kepler Workflow Engine
01-10
11-100>101
Data and knowledge gapsData Sources
NIF lets us ask: where isn’t there data? What isn’t studied? Why?NIF lets us ask: where isn’t there data? What isn’t studied? Why?
ForebrainForebrain
MidbrainMidbrain
HindbrainHindbrain
01-10
11-100>101
Data Sources
SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186SW Oh et al. Nature 000, 1-8 (2014) doi:10.1038/nature13186
Adult mouse brain connectivity matrix: revenge of the midbrain
The tale of the tail“Human neuroimaging typically is performed on a whole brain basis. However, for several reasons tail of the caudate activity can easily be missed. •One reason is limitations in the normalization algorithms, that typically are optimized to maximize accuracy for cortical rather than subcortical structures. ... •A second reason is that standard neuroimaging atlases such as the Harvard-Oxford structural atlas used with neuroimaging analysis programs such as FreeSurfer truncate the caudate at the body, and completely exclude the tail... •A final reason is that the tail of the caudate is close to the hippocampus, and could be misidentified as such especially in tasks involving learning and memory. Therefore, the tail of the caudate may be recruited in additional cognitive tasks, but yet not have been properly identified and reported in the neuroimaging literature”
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.
“The Data Homunculus”
Beware of biases in the data space...Beware of biases in the data space...
WHERE ARE WE GOING?
The Encyclopedia of Life
A…
Access to data has Access to data has changed over the changed over the
yearsyears
Tim Berner-s Lee: Web of dataWikipedia defines Linked Data as "a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.” http://linkeddata.org/
GenbankGenbank
PDBPDB
“Whichever technology wins broad adoption will become, by default, the data web. That’s why we don’t need to know which technological vision of the data web will win to conclude that the data web is inevitable”-Michael Nielson
“Whichever technology wins broad adoption will become, by default, the data web. That’s why we don’t need to know which technological vision of the data web will win to conclude that the data web is inevitable”-Michael Nielson
I am a number: ORCID ID
The web of data runs on the ability to uniquely identify all the relevant entities
The web of data runs on the ability to uniquely identify all the relevant entities
• Have authors supply appropriate identifiers for key resources used within a study such that they are:– Machine processible (i.e., unique
identifier that resolves to a single resource)
– Outside of the paywall– Uniform across journals and
publishers • Goal: Proof of principle
– What infrastructure would be needed
– Could authors perform the task– Would authors perform the task– Will it be useful?
Resource Identification InitiativeResource Identification Initiative
http://www.force11.org/resource_identification_initiative
http://www.force11.org/resource_identification_initiative
What studies used ...?•100 articles have appeared to date•15 journals•Data set being made available to community•> 600 RRID’s
•~10% disappeared after copyediting•5% were in error•14% false negative rate
•> 200 antibodies were added•> 75 software tools/databases were added
Database available at: https://www.force11.org/node/5635 Database available at: https://www.force11.org/node/5635
RRID:AB_90755RRID:AB_90755
ArticleArticle
CodeCode
BlogsBlogs
WorkflowsWorkflows
DataData
Persistent Identifiers
PortalsPortals
Persistent Identifiers
Persistent Identifiers
Unique and persistent identifiers and a system for referencing them allow a scholarly ecosystem to function
Unique and persistent identifiers and a system for referencing them allow a scholarly ecosystem to function
An ecosystem for research objects
DataDataDataData
CodeCodeCodeCode
BlogsBlogsBlogsBlogs
WorkflowsWorkflowsWorkflowsWorkflows
PortalsPortalsPortalsPortals
Search enginesSearch engines
Persistent Identifiers
Persistent Identifiers
Persistent Identifiers
Taking a global view on data: Taking a global view on data: microculture to ecosystemmicroculture to ecosystem
• Several powerful trends should change the way we think about our data: One Many– Many data
• Generation of data is getting easier shared data• Data space is getting richer: more –omes everyday• But...compared to the biological space, still sparse
– Many eyes• Wisdom of crowds• More than one way to interpret data
– Many algorithms• Not a single way to analyze data
– Many analytics• “Signatures” in data may not be directly related to the question for which they
were acquired but tell us something really interesting
One data set one algorithm one paper???One data set one algorithm one paper???
How you can contribute• Register your tools/data to NIF• Let us help you with your use cases• Use RRID’s in your publications– http://scicrunch.com/resources
• Get your ORCID ID!• Put your data in a repository– NIF can help you find one; NIF is one
• If you are planning on building your own data resources, talk to us!
Future of Research Communications and e-Scholarship (FORCE11.org)
Join us! http://force11.orgJoin us! http://force11.org
NIF team (past and present)NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PIAmarnath Gupta, UCSD, Co InvestigatorAnita Bandrowski, NIF Project LeaderGordon Shepherd, Yale UniversityPerry MillerLuis MarencoRixin WangDavid Van Essen, Washington UniversityErin ReidPaul Sternberg, Cal TechArun RangarajanHans Michael MullerYuling LiGiorgio Ascoli, George Mason UniversitySridevi Polavarum
Fahim ImamLarry LuiAndrea Arnaud StaggJonathan CachatJennifer LawrenceSvetlana SulimaDavis BanksVadim AstakhovXufei QianChris ConditMark EllismanStephen LarsonWillie WongTim Clark, Harvard UniversityPaolo CiccareseKaren Skinner, NIH, Program Officer (retired)Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11