evs data curation the processing and publication of data for web browsing and programmatic access

21
EVS Data Curation The processing and publication of data for web browsing and programmatic access

Upload: evelyn-thomas

Post on 18-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

Gene Ontology and Zebrafish Downloaded as OBO from web sites Processed with C++ program into Ontylog xml – OBO2TDE.exe Processed with C++ program into OWL – ontyxToOWL.exe Loaded using LoadNCIThesOWL.sh Metadata loaded using LoadMetadata Hierarchy and Sources manually edited

TRANSCRIPT

Page 1: EVS Data Curation The processing and publication of data for web browsing and programmatic access

EVS Data Curation

The processing and publication of data for web browsing and programmatic access

Page 2: EVS Data Curation The processing and publication of data for web browsing and programmatic access

Data Curation Flowchart

Page 3: EVS Data Curation The processing and publication of data for web browsing and programmatic access

Gene Ontology and Zebrafish

Downloaded as OBO from web sites Processed with C++ program into Ontylog

xml – OBO2TDE.exe Processed with C++ program into OWL –

ontyxToOWL.exe Loaded using LoadNCIThesOWL.sh Metadata loaded using LoadMetadata Hierarchy and Sources manually edited

Page 4: EVS Data Curation The processing and publication of data for web browsing and programmatic access

HL7 and VA_NDFRT

Retrieved from sources Processed by Apelon into Ontylog XML Loaded into LexBIG using LoadNCIThesOwl

and manifest Metadata loaded using LoadMetadata

Page 5: EVS Data Curation The processing and publication of data for web browsing and programmatic access

MGED

OWL file downloaded from source web site Loaded into Protégé Classified Inferred version exported as OWL file Loaded into LexBIG using LoadNCIThesOwl Metadata loaded using LoadMetadata Hierarchy and Sources manually edited

Page 6: EVS Data Curation The processing and publication of data for web browsing and programmatic access

Snomed, MedDRA and LOINC Extracted from the UMLS into RRF files Loaded into LexBIG using LoadUMLSFiles Metadata loaded using LoadMetadata

Page 7: EVS Data Curation The processing and publication of data for web browsing and programmatic access

UMLS Semnet

Downloaded from UMLS Semnet web site Loaded using LoadUMLSSemnet Metadata loaded using LoadMetadata

Page 8: EVS Data Curation The processing and publication of data for web browsing and programmatic access

Metathesaurus

Load from UMLS into MEME NCI Thesaurus imported monthly Other vocabs added or removed NCI specific edits made to data and relations Exported as RRF Imported to LexBIG using LoadNCIMeta Metadata loaded using LoadMetadata

Page 9: EVS Data Curation The processing and publication of data for web browsing and programmatic access

Preparing TDE Thesaurus for MEME Thesaurus Ontylog XML baseline is

processed through C++ app publishMEME.exe

Current baseline compared to previous to get summary of new properties or roles

Summary used to create import configuration file

Baseline imported into MEME

Page 10: EVS Data Curation The processing and publication of data for web browsing and programmatic access

Preparing Thesaurus for MEME

Page 11: EVS Data Curation The processing and publication of data for web browsing and programmatic access

NCI Thesaurus from TDE

Edited in TDE and exported to Ontylog XML by name

Run through publishTDE to remove unpublishable properties

run through OntyxToOwl.exe to create OWL file by code

Loaded into LexBIG using LoadNCIThesOWL Metadata loaded using LoadMetadata History generated from TDE baseline History loaded using LoadNCIHistory

Page 12: EVS Data Curation The processing and publication of data for web browsing and programmatic access

NCI Thesaurus from TDE

Page 13: EVS Data Curation The processing and publication of data for web browsing and programmatic access

NCI Thesaurus from Protege

Run OWL through application to get Ontylog XML by name

Run Ontylog XML through publishTDE to remove unpublishable properties

Run through OntylogtoOWL to get OWL by code

Do history using the Ontylog XML

Page 14: EVS Data Curation The processing and publication of data for web browsing and programmatic access

NCI Thesaurus History Processing evs_history records concept modifications

made in editor These records are extracted monthly to

consolidate and to remove identifying information

Cleaned records are loaded into concept_history

Full concept_history loaded into LexBIG for NCI Thesaurus

Page 15: EVS Data Curation The processing and publication of data for web browsing and programmatic access

History

Page 16: EVS Data Curation The processing and publication of data for web browsing and programmatic access

TDE to DTS

Page 17: EVS Data Curation The processing and publication of data for web browsing and programmatic access

log.outNew concepts created through Create or Split actions:C72675|Feet_First.Concepts merged into other concepts:C17841|Oncologic_Surgeon.Retired concepts (including merged):C17841|Oncologic_Surgeon.New concepts not found in BSLN2:C73140|Ethaverine_.Retired concepts not found in BSLN2 C73401|Maqui_Berry_Flavor.Modify records correponding to Retired_Kind are discarded:667487|C62920|Medical_Device_Unsafe_to_Use|Modify|2008-03-05 ….Modify records correponding to new codes are discarded:666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 ….Modify records correponding to merged codes are discarded:668629|C3824|Lesion|Modify|2008-03-06 11:03:49.0|remennik|6116otsaremennl.nci.nih.gov|(null)|0.Records correponding to codes not found in BSLN2 are discarded:671933|C73140|Ethaverine_|New|2008-03-19 12:03:01.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0.WARNING: New codes created, then retired, but still found in BSLN2: (to be edited manually)C72675|Feet_First.List of all remaining records.List of all discarded records:666753|C72831|Pramiracetam_Hydrochloride|Modify|2008-02-29 09:02:56.0|shaiu|MSDCorp-Mesh001.inside.msdinc.com|(null)|0.

Page 18: EVS Data Curation The processing and publication of data for web browsing and programmatic access

tde_history_report.txtSpilanthes_oleracea (Code: C72446)

Number of modelers: 3Modeler: shaiuModeler: thomasModeler: creech

Modeler: shaiuAction: modify time: 2008-03-05 05:03:58.0

Modeler: thomasAction: modify time: 2008-03-06 02:03:05.0Action: modify time: 2008-03-14 10:03:06.0

Modeler: creechAction: modify time: 2008-03-06 02:03:06.0

------------------------------------------------------------------.

Edited actions for the following concepts are discarded:

Concept codes requiring manual review:

Page 19: EVS Data Curation The processing and publication of data for web browsing and programmatic access

DTS_history

DTS_history_script.sql insert into concept_history(concept, editaction, editdate, reference) values ('C72675', 'create', '28-MAR-08', null); insert into concept_history(concept, editaction, editdate, reference) values ('C72676', 'create', '28-MAR-08', null);..

DTS_history_out.txt666540|C72675|create|28-MAR-08|(null)666541|C72676|create|28-MAR-08|(null)666542|C62171|modify|28-MAR-08|(null)..

Page 20: EVS Data Curation The processing and publication of data for web browsing and programmatic access

DTS_history_out.outLists complete contents of both baselines.Number of codes in {baseline A} : 65265Number of codes in {baseline B} : 66022

Concepts found in {baseline B}: but not in {baseline A} C72675C72676.Concepts found in {baseline A}: but not in {baseline B} (should be empty).Verify DTS_history_out.txt against baseline data.New Concepts: 757

(1) C72675(2) C72676

.Concepts created through Split: 0

Split Concepts: 0

Retired Concepts: 4(1) C20920(2) C62920

Concepts retired through Merge: 5(1) C14142

Merge Concepts: 5(1) C1363

Modified Concepts: 1364

Invalid actions: 0

Page 21: EVS Data Curation The processing and publication of data for web browsing and programmatic access

Tiered Deployments

NCICB uses 4-tiered deployments Dev tier – used internally by EVS team to test

software and data QA tier – used by QA and other software teams to

test against new EVS software or data Stage tier – used to test software deployments in

a near-production environment Production – available to outside users