data cycle health
TRANSCRIPT
DATA LIFE CYCLE HEALTH: CONSTRAINTS, IMMEDIATE AND LONG TERM
28TH OCTOBEREMBL-ABR
Challenges for data
• Management *• Annotation• Analysis• Storage• Sharing *
Long term planning and maintenance
• Project Funding and continuity of data availability and management
• Funders/ Institutional requirements : data from public resources must be in public domain.
• Availability of storage facilities• Availability of analysis facilities
DMP
Data sharing: going beyond required• Granting authorities• Journal requirements• Improved health care
Facilitation by:• Institutional sharing• Availability of repositories/Archives : GIT hub, EBI
Repositories, NCBI repositories, Institutional and National Data archives
• Analysis, Annotaion and Advertisement of resource• Data publication
Analysis workflow metagenomics*
Unirule
Interpro2go
Resource list
• Submission: GEO, SRA, Array express, ENA/Genbank/DDBJ; PRIDE, Metabolights
• Annotation: UniProt, HPO, GO, Interpro, Reactome, PDB, Interactome …
• Visualisation: Ensembl, Networks, Structures …
• External annotations
Importance of meta-data• Data valuation by addition of metadata• Incorrect/inadequate meta data affects Analysis,
Rediscovery• No meta-data makes set impossible to find, and of
no value. Tagging helps.• Student exercise –Male Breast Cancer– before they
start a submission• If you use resources enrich them for use by yourself
or others through submissions and annotation
Ontologies and standards
Interoperatibility searching and reasoning• HPO – Phenotype Ontology• EFO – experimental eta-data, more terms may
be needed• OBIB ontology ? BioBanking• SNOMED-CT• DM + D• ICD
Resource catalogue
BioSamples – deposit and reference study details for ‘Omics expts
OMICs expts –access using OMICS Discovery Index (http://www.omicsdi.org)
Annotation transfer
Limited biochemical resources, limited number of manual curators to transfer data into databases (UniProtKB/Swiss-Prot, GO)Annotation transfer – Gene OntologyInterPro2GO EC2GOUniProt-keywords2GO Ensembl ComparaUniProt-subcellular locations2GO HAMAP2GOUniPathway2GO
Annotation transfer – TrEMBL annotationUniProt UniRules
All based on InterPro family/domain matches
ICO : code of practice• Processed lawfully, fairly and transparent manner• Collected for specified purpose• Adequate and relevant and limited to necessary• Accuracy is maintained• Data subject will not be identifiable for longer
than necessary• Processed in a secured manner and protected
against unexpected loss or destruction• Rights of the individual will be protected
ICO code of practice
This is handled in four ways:
Ethical approval for studyExplicit consent from individualFollow security guidelines from ICO Develop strong governance around this data
Proprietary data
• Any data generation funded by a commercial entity may have data restrictions associated with it.
• Any data generation involving proprietary organisms/environs may have data restrictions on them.
• Data withdrawal – obsolete vs destroy
Software
• Software graveyard and compute
• Costing for IT and sustainability or software resources
• Software publication
• Recognition and peer review
• Reproducibility in omics research
Data life-cycle
Sequence/assembly/Annotation/RNA seq
Public domaindata deposition
Update annotation
In-house data resource
Summary
• Identify potential issues early on in the project life-cycle – spending time identifying issues and planning how to address them
• Prepare to data share as early as possible – what information would you like to see if your were the data user.
• Think beyond the life-time of the grant, what are your long term plans for the sustainability of the data
• If issues with access do feed back. If not primary submission this should be sorted. Add bug reports as well.