genome and proteome data integration in rdf

36
Genome and Proteome data integration in RDF Nadia Anwar, Ela Hunt, Walter Kolch and Andy Pitt Semantic Web Applications and Tools for Life Sciences November 2008 Data Discovery Genome Transcripts Proteins Metabolites

Upload: nadia-anwar

Post on 11-May-2015

657 views

Category:

Health & Medicine


5 download

TRANSCRIPT

Page 1: Genome and Proteome data integration in RDF

Genome and Proteome data integration in RDF Nadia Anwar Ela Hunt Walter Kolch and Andy Pitt

Semantic Web Applications and Tools for Life SciencesNovember 2008

Data Discovery

Genome Tr

ansc

ripts Proteins

Metabolites

Outline

bull Data Integration in Bioinformatics

bull Semantic data integration

bull Francisella

bull Integrating genome annotations with experimental proteomics data in RDF

bull Further work

Data Integration is not a solved problem

Information discovery is not Integrated

ProteomicsPeptide Profiles

Peptide AbundanceProtein IdentificationProtein Interactions

PT-ModificationsLIMS

Gene ExpressionTranscript Profile

Transcript Abundance

LIMS

GenomicsSequence

ORF PredictionGenome

Comparisons

LIMS

Genome Metabolic Pathways

Microarrayexperiments

Computationalanalysis Systems Biology

Synthetic NetworksPathways

Predictions

MetabolomicsLIMS

Translational Medicine

Regulatory Networks

Proteomicsexperiments

Computationalanalysis

High TPSequencing

Semantic Data Integration across omes data silos

Data Information Genes Transcripts Peptides Metabolites Genotype

Data Discovery

Proof of conceptFrancisella tularensis

ulceroglandular tularaemia

respiratory tularaemia

oculoglandular tularaemia

Bioterrorism

bull Francisella tularensis is a very successful intracellular pathogen that causes severe disease (respiratory tulareamia is the most acute form of the disease)

bull low infectious dose (10-50 bacterium compared to anthrax which requires 8000-15000 spores)

bull weaponisation fears

Data sourcesGenome

RDF

(4)IMGgene_oid=639752258 FTN_0209 (3)IMG_Slocus_tag

229107

(3)IMG_Sgenomic_location_start

229976

(3)IMG_Sgenomic_location_end

+

(3)IMG_Sgenomic_location_strand

TPR

(2)RDFScomment

RDFdescription

(1)RDFtype

httpimgjgidoegovcgi-binpubmaincgisection=TaxonDetailamppage=taxonDetailamptaxon_oid=639633024export

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 2: Genome and Proteome data integration in RDF

Outline

bull Data Integration in Bioinformatics

bull Semantic data integration

bull Francisella

bull Integrating genome annotations with experimental proteomics data in RDF

bull Further work

Data Integration is not a solved problem

Information discovery is not Integrated

ProteomicsPeptide Profiles

Peptide AbundanceProtein IdentificationProtein Interactions

PT-ModificationsLIMS

Gene ExpressionTranscript Profile

Transcript Abundance

LIMS

GenomicsSequence

ORF PredictionGenome

Comparisons

LIMS

Genome Metabolic Pathways

Microarrayexperiments

Computationalanalysis Systems Biology

Synthetic NetworksPathways

Predictions

MetabolomicsLIMS

Translational Medicine

Regulatory Networks

Proteomicsexperiments

Computationalanalysis

High TPSequencing

Semantic Data Integration across omes data silos

Data Information Genes Transcripts Peptides Metabolites Genotype

Data Discovery

Proof of conceptFrancisella tularensis

ulceroglandular tularaemia

respiratory tularaemia

oculoglandular tularaemia

Bioterrorism

bull Francisella tularensis is a very successful intracellular pathogen that causes severe disease (respiratory tulareamia is the most acute form of the disease)

bull low infectious dose (10-50 bacterium compared to anthrax which requires 8000-15000 spores)

bull weaponisation fears

Data sourcesGenome

RDF

(4)IMGgene_oid=639752258 FTN_0209 (3)IMG_Slocus_tag

229107

(3)IMG_Sgenomic_location_start

229976

(3)IMG_Sgenomic_location_end

+

(3)IMG_Sgenomic_location_strand

TPR

(2)RDFScomment

RDFdescription

(1)RDFtype

httpimgjgidoegovcgi-binpubmaincgisection=TaxonDetailamppage=taxonDetailamptaxon_oid=639633024export

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 3: Genome and Proteome data integration in RDF

Data Integration is not a solved problem

Information discovery is not Integrated

ProteomicsPeptide Profiles

Peptide AbundanceProtein IdentificationProtein Interactions

PT-ModificationsLIMS

Gene ExpressionTranscript Profile

Transcript Abundance

LIMS

GenomicsSequence

ORF PredictionGenome

Comparisons

LIMS

Genome Metabolic Pathways

Microarrayexperiments

Computationalanalysis Systems Biology

Synthetic NetworksPathways

Predictions

MetabolomicsLIMS

Translational Medicine

Regulatory Networks

Proteomicsexperiments

Computationalanalysis

High TPSequencing

Semantic Data Integration across omes data silos

Data Information Genes Transcripts Peptides Metabolites Genotype

Data Discovery

Proof of conceptFrancisella tularensis

ulceroglandular tularaemia

respiratory tularaemia

oculoglandular tularaemia

Bioterrorism

bull Francisella tularensis is a very successful intracellular pathogen that causes severe disease (respiratory tulareamia is the most acute form of the disease)

bull low infectious dose (10-50 bacterium compared to anthrax which requires 8000-15000 spores)

bull weaponisation fears

Data sourcesGenome

RDF

(4)IMGgene_oid=639752258 FTN_0209 (3)IMG_Slocus_tag

229107

(3)IMG_Sgenomic_location_start

229976

(3)IMG_Sgenomic_location_end

+

(3)IMG_Sgenomic_location_strand

TPR

(2)RDFScomment

RDFdescription

(1)RDFtype

httpimgjgidoegovcgi-binpubmaincgisection=TaxonDetailamppage=taxonDetailamptaxon_oid=639633024export

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 4: Genome and Proteome data integration in RDF

Information discovery is not Integrated

ProteomicsPeptide Profiles

Peptide AbundanceProtein IdentificationProtein Interactions

PT-ModificationsLIMS

Gene ExpressionTranscript Profile

Transcript Abundance

LIMS

GenomicsSequence

ORF PredictionGenome

Comparisons

LIMS

Genome Metabolic Pathways

Microarrayexperiments

Computationalanalysis Systems Biology

Synthetic NetworksPathways

Predictions

MetabolomicsLIMS

Translational Medicine

Regulatory Networks

Proteomicsexperiments

Computationalanalysis

High TPSequencing

Semantic Data Integration across omes data silos

Data Information Genes Transcripts Peptides Metabolites Genotype

Data Discovery

Proof of conceptFrancisella tularensis

ulceroglandular tularaemia

respiratory tularaemia

oculoglandular tularaemia

Bioterrorism

bull Francisella tularensis is a very successful intracellular pathogen that causes severe disease (respiratory tulareamia is the most acute form of the disease)

bull low infectious dose (10-50 bacterium compared to anthrax which requires 8000-15000 spores)

bull weaponisation fears

Data sourcesGenome

RDF

(4)IMGgene_oid=639752258 FTN_0209 (3)IMG_Slocus_tag

229107

(3)IMG_Sgenomic_location_start

229976

(3)IMG_Sgenomic_location_end

+

(3)IMG_Sgenomic_location_strand

TPR

(2)RDFScomment

RDFdescription

(1)RDFtype

httpimgjgidoegovcgi-binpubmaincgisection=TaxonDetailamppage=taxonDetailamptaxon_oid=639633024export

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 5: Genome and Proteome data integration in RDF

Semantic Data Integration across omes data silos

Data Information Genes Transcripts Peptides Metabolites Genotype

Data Discovery

Proof of conceptFrancisella tularensis

ulceroglandular tularaemia

respiratory tularaemia

oculoglandular tularaemia

Bioterrorism

bull Francisella tularensis is a very successful intracellular pathogen that causes severe disease (respiratory tulareamia is the most acute form of the disease)

bull low infectious dose (10-50 bacterium compared to anthrax which requires 8000-15000 spores)

bull weaponisation fears

Data sourcesGenome

RDF

(4)IMGgene_oid=639752258 FTN_0209 (3)IMG_Slocus_tag

229107

(3)IMG_Sgenomic_location_start

229976

(3)IMG_Sgenomic_location_end

+

(3)IMG_Sgenomic_location_strand

TPR

(2)RDFScomment

RDFdescription

(1)RDFtype

httpimgjgidoegovcgi-binpubmaincgisection=TaxonDetailamppage=taxonDetailamptaxon_oid=639633024export

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 6: Genome and Proteome data integration in RDF

Proof of conceptFrancisella tularensis

ulceroglandular tularaemia

respiratory tularaemia

oculoglandular tularaemia

Bioterrorism

bull Francisella tularensis is a very successful intracellular pathogen that causes severe disease (respiratory tulareamia is the most acute form of the disease)

bull low infectious dose (10-50 bacterium compared to anthrax which requires 8000-15000 spores)

bull weaponisation fears

Data sourcesGenome

RDF

(4)IMGgene_oid=639752258 FTN_0209 (3)IMG_Slocus_tag

229107

(3)IMG_Sgenomic_location_start

229976

(3)IMG_Sgenomic_location_end

+

(3)IMG_Sgenomic_location_strand

TPR

(2)RDFScomment

RDFdescription

(1)RDFtype

httpimgjgidoegovcgi-binpubmaincgisection=TaxonDetailamppage=taxonDetailamptaxon_oid=639633024export

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 7: Genome and Proteome data integration in RDF

Bioterrorism

bull Francisella tularensis is a very successful intracellular pathogen that causes severe disease (respiratory tulareamia is the most acute form of the disease)

bull low infectious dose (10-50 bacterium compared to anthrax which requires 8000-15000 spores)

bull weaponisation fears

Data sourcesGenome

RDF

(4)IMGgene_oid=639752258 FTN_0209 (3)IMG_Slocus_tag

229107

(3)IMG_Sgenomic_location_start

229976

(3)IMG_Sgenomic_location_end

+

(3)IMG_Sgenomic_location_strand

TPR

(2)RDFScomment

RDFdescription

(1)RDFtype

httpimgjgidoegovcgi-binpubmaincgisection=TaxonDetailamppage=taxonDetailamptaxon_oid=639633024export

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 8: Genome and Proteome data integration in RDF

Data sourcesGenome

RDF

(4)IMGgene_oid=639752258 FTN_0209 (3)IMG_Slocus_tag

229107

(3)IMG_Sgenomic_location_start

229976

(3)IMG_Sgenomic_location_end

+

(3)IMG_Sgenomic_location_strand

TPR

(2)RDFScomment

RDFdescription

(1)RDFtype

httpimgjgidoegovcgi-binpubmaincgisection=TaxonDetailamppage=taxonDetailamptaxon_oid=639633024export

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 9: Genome and Proteome data integration in RDF

RDF

(4)IMGgene_oid=639752258 FTN_0209 (3)IMG_Slocus_tag

229107

(3)IMG_Sgenomic_location_start

229976

(3)IMG_Sgenomic_location_end

+

(3)IMG_Sgenomic_location_strand

TPR

(2)RDFScomment

RDFdescription

(1)RDFtype

httpimgjgidoegovcgi-binpubmaincgisection=TaxonDetailamppage=taxonDetailamptaxon_oid=639633024export

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 10: Genome and Proteome data integration in RDF

Data sourcesGenome annotations

Francisella SuperFamily Data

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescriptionRDFtype

SUPERFAMILYcgi-binmodelcgimodel=0040419httppurluniprotorgcoreProtein_Family

155-367SUPERFAMILYAssignment_Region

51e-39SUPERFAMILYScore

SUPERFAMILYcgi-binscopcgisunid=52540SUPERFAMILYSCOP_ID

P-loop containing nucleoside triphosphate hydrolases

SUPERFAMILYSCOP_Fold

81269

SUPERFAMILYFamily_ID

733e-06

SUPERFAMILYEvalue

Extended AAA-ATPase domain

SUPERFAMILYFamily_Description

1l8q A77-289

SUPERFAMILYSimilar_Structure

httpsupfamcsbrisacuk

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 11: Genome and Proteome data integration in RDF

Data sourcesGenome annotations - KEGG

httpwwwgenomejpdbget-binwww_bgetpathway+ftn00010

httpwwwgenomejpdbget-binwww_bgetftnFTN_0298

httpimgjgidoegovschemagene

glpX

httpimgjgidoegovschemagene_name

fructose

rdfscomment

httpsrsebiacuksrsbincgi-binwgetz-e+[EC31311]

rdfsseeAlso

httpsrsebiacuksrsbincgi-binwgetz-e+[SPA0Q4N9_FRATN]

rdfsseeAlso

httpwwwncbinlmnihgoventrezviewerfcgidb=proteinampid=118496616

RDFdescription

RDFtype

YP_8976661

RDFidsymbol

httpsrsebiacuksrsbincgi-binwgetz[refseqp-SeqVersionYP_8976661]+-e

RDFSseeAlso

chromosomal

httppurluniprotorgAnnotation

Genome annotations - NCBI protein

httpwwwgenomejpdbget-binwww_bfindFtularensis_U112

httpwwwncbinlmnihgovsitesgqueryterm=Francisella+tularensis+novicida

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 12: Genome and Proteome data integration in RDF

Data sourcesGenome annotations - GO

httpwwwgenomejpdbget-binwww_bgetftnFTN_0277

RDFdescriptionRDFtype

httpamigogeneontologyorgcgi-binamigogocgiview=detailsampquery=0006749mglaGO_AnnotationID

glutathionemglaGO_AnnotationTerm

biological_processmglaGO_AnnotationOntology

7

mglaGO_AnnotationLevel

0879989490261963

httpwwwcompbiodundeeacukSoftwareGOtchaiscore

57273821328517

httpwwwcompbiodundeeacukSoftwareGOtchacscore

Poson annotations - Cogs

httpstoolsnwrceorgcgi-binfnu112posoncgiposon=PSN082435

httpwwwncbinlmnihgovsitesentrezdb=cddampcmd=searchampterm=COG0508mglacogNumber

AceFmglacogDomain

Pyruvate2-oxoglutarate

mglacogDescription

dihydrolipoamide

mglacogCategory

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 13: Genome and Proteome data integration in RDF

Data sources - experimentsTranscriptomics

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 14: Genome and Proteome data integration in RDF

Data sources - experimentsProteomics

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 15: Genome and Proteome data integration in RDF

Proteomics WT vs Mgla Mutant

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 16: Genome and Proteome data integration in RDF

Francisella tularensis novicida U112

Whole Cell(3)

Soluble(3)

Membrane(3)

Whole Cell(3)

Soluble(3)

Membrane(3)

WildType MglA mutant

(4) (4) (4) (4) (4) (4)

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Sequest DRAGONSequest DRAGON

Relative AbundanceIdentification

Two-sided t-test

P val lt001

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 17: Genome and Proteome data integration in RDF

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

RDF - excel conversion

Genome

mglaexperiment

subject

object

predicate

Pval

Pval-1

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 18: Genome and Proteome data integration in RDF

Data integration Reconciled Identifiers

(WashU-B) PSNV1

(WashU-B) PSNV2(COGs) COGID

(Gene Ontology) GOID

(WashU-B) PSNV3

(Fn ORF ID) FTN

(WashU-P) DDB

(Refseq) ACNo

(Uniprot) ACNo(ENZYME) ECNo

(IMG) GENEID(NCBI) PROTEINID

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 19: Genome and Proteome data integration in RDF

Data IntegrationAdding new experiments

Experiment 1

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECAC No

Experiment 4

Experiment 2

Experiment 3

Public domain data

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 20: Genome and Proteome data integration in RDF

NadiaAnwar~ nadia$ openrdf-sesame-21binconsolesh Connected to default data directory

Commands end with at the end of a lineType help for helpgt connect http1270018080openrdf-sesameDisconnecting from default data directoryConnected to http1270018080openrdf-sesamegt show r+----------|SYSTEM (System configuration repository)|ftnRepoNative (Francisella Test)|FrancisellaNative (FrancisellaTestStore)|FrancisellaReified (Native store with RDF Schema inferencing)|FrancisellaReified_index2 (Native store with RDF Schema inferencing)|Francisella (Native store with RDF Schema inferencing)+----------gt open FrancisellaReified_index2Opened repository FrancisellaReified_index2

Data integration Sesame

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 21: Genome and Proteome data integration in RDF

SesameData load (ftnRepoNative) - native (spocposc)

Data File time (s) triples

francisella_locus_tagnt 893 1767

interact-protnt 8851 20682

interact-prot-peptidesnt 248647

mgla search dbfastablastp4 ypURLn3 97 1719

NC_008601nt 4314 12781

Ft_novicidaU112gont 35914 2548

francisellardf2nt 4341 10434

francisellaSUPERFAMILYnt 5788 16110

francisellaPROTEINfastant 1363 5160

Solublent 58887 336761

WholeCellnt 46902 112625

Membranesnt 100319 298771

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 22: Genome and Proteome data integration in RDF

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

Experiment

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

SELECT psn ftn ec FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 23: Genome and Proteome data integration in RDF

SELECT abundance psn ec ftn FROM ftn rdfsseeAlso ec psn rdfsseeAlso ftn analysis mglaposon psnanalysis mglaexperiment abundanceWHERE ec LIKE ldquo[ECrdquoUSING NAMESPACEmgla =lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagt

Data IntegrationMgla data (ftnRepoNative)

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

DDBID

rdfsseeAlso

GO ECSP

mglasequencemglaexperiment

rdfabout

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 24: Genome and Proteome data integration in RDF

Really easy Butbull Simple excel to RDF conversion does not enable all queries

bull Not a simple conversion - Data needs to be ldquomodelledrdquo

Identified Peptide

analysis

Peptide sequence

mglaposon

abundance PSNmglasequence

mglaexperiment

rdfabout

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 25: Genome and Proteome data integration in RDF

Data IntegrationReified statements

Identified Peptideanalysis

Peptide sequence

Experiment Replicate

rdftype

mglaposon

PSNV3 FTNPSN PSNV2rdfsseeAlso rdfsseeAlso rdfsseeAlso

analysis datardfStatement

analysis data

InExperimentReplicate

rdfobject

rdftype

rdfsubject

rdfpredicateabundance

mglaPeptideAbundance

DDBID

rdfsseeAlso

GO ECSP

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 26: Genome and Proteome data integration in RDF

SesameReified Data load - native-RDFS (spocposcposc)

Data File time (s) time(mins) triples

FnU112Version3nt 38344 63 58474

PosonMappingsnt 8456 14 13760

francisella_locus_tagnt 1673 03 1767

ConstructHasGeneIDnt 2300 04 1719

interact-protnt 12495 21 20682

interact-prot-pepteidesnt 112797 187 248647

interact-protSeeAlsoisbURLnt 1067 02 1528

goAnnotation_URLIDnt 7414 12 20501

NC_008601nt 7584 13 12781

Membranes_CogNumberURLnt 860 01 2548

Ft_novicida_U112_gont 56138 93 2548

francisellardf2nt 4619 08 10602

francisellaSUPERFAMILYnt 6667 11 16110

francisellaPROTEINfastant 1527 03 5160

SolubleReifeid_3rdf 139298 232 580873

WholeCellReified_3rdf 94116 156 184221

Membranes_3rdf 102666 17111 416086

fnU112_draftRDFschemaV4nt 21501098 35835 501

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 27: Genome and Proteome data integration in RDF

select ftn psn exp abundance from psn rdfsseeAlso psnv2psnv2 rdfsseeAlso psnv3psnv3 rdfsseeAlso ftnanalysis fnu112poson psnanalysis rdftype rdfStatementanalysis rdfobject expanalysis mglaPeptideAbundance abundancewhere xsdinteger(abundance) gt 100000and ftn LIKE FTNusing namespace mgla=lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglagtfnu112=lthttpwwwfrancisellaorgnovicidafnu112schemafnu112experimentsmglagt

Querieswhich posons have the most highly abundant peptides

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 28: Genome and Proteome data integration in RDF

Querieswhich posons have the most highly abundant peptides

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 29: Genome and Proteome data integration in RDF

Querieswhich experiments have the most highly abundant peptides

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 30: Genome and Proteome data integration in RDF

Reified statementsbull Reified mgla data are much bigger (4 more statementsabundance)

bull The really interesting queries return Java out of memory error (-Xms-1024M -Xmx 1536M)

bull Havenrsquot yet tested shortcut path expression

reifSubj reifPred reifObj pred obj

seq identifiedIn ExpRep hasAbundance abd

ltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nstypegt lthttpwwww3org19990222-rdf-syntax-nsStatementgtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nssubjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaWholeCell_Lvl7_021gtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nspredicategt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaInExperimentReplicategtltWholeCell_Lvl7_0212gt lthttpwwww3org19990222-rdf-syntax-nsobjectgt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglawildtype01_wc_01gtltWholeCell_Lvl7_0212gt lthttpwwwfrancisellaorgnovicidaschemafnu112experimentsmglaPeptideAbundancegt 2594

Peptide SequenceExperiment Replicate

abundance

identifiedIn

hasAbundance

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 31: Genome and Proteome data integration in RDF

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (gt20000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) gt 20000and experiment LIKE solusing namespace

171 146185

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 32: Genome and Proteome data integration in RDF

sol

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solINTERSECTselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

Comparison of integrated experimental dataDistinct and overlapping posons identified within each biological fraction (lt5000)

mem

INTERSECT

sol MINUS memmem MINUS solselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memusing namespace

select distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE memMINUSselect distinct psn fromx fnsposon psnx fnInExperimentReplicate experiment analysis rdfsubject xanalysis rdfobject expanalysis fnPeptideAbundance abundancewhere xsdinteger(abundance) lt 5000and experiment LIKE solusing namespace

219 125245

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 33: Genome and Proteome data integration in RDF

Further work

bull Queries are slow in the native repository database repositories are probably faster

bull Adding transcriptomic experiment

Wt Vs mglA mutant

GEO AC GSE5468

bull RDF-S inferencing

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 34: Genome and Proteome data integration in RDF

Acknowledgements

bull Funding BBSRC -Radical Solutions for Researching the Proteome

bull University of Glasgow Glasgow

bull Prof Walter Kolch

bull Dr Andy Pitt

bull University of Strathclyde Glasgow

bull Dr Ela Hunt (Scientific Advisor)

bull University of Washington Seattle

bull Prof Dave Goodlett (Scientific Advisor)

bull Dr Mitch Brittnacher Mathew Radey Laurence Rohmer

bull Dr Tina Guina (MglA experiment)

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)

Page 35: Genome and Proteome data integration in RDF

Abundance thresholdsbull SeRQL aggregate functions would be nice to have

bull Queries to find low and high abundance values

bull WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance)

bull WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)