hyque: evaluating scientific hypotheses using semantic web technologies

35
HYQUE: EVALUATING SCIENTIFIC HYPOTHESES USING SEMANTIC WEB TECHNOLOGIES MICHEL DUMONTIER, PHD ASSOCIATE PROFESSOR OF BIOINFORMATICS, DEPARTMENT OF BIOLOGY, INSTITUTE OF BIOCHEMISTRY AND SCHOOL OF COMPUTER SCIENCE @ CARLETON UNIVERSITY PROFESSEUR ASSOCIÉ, DÉPARTEMENT D’INFORMATIQUE ET DE GÉNIE LOGICIEL, UNIVERSITÉ LAVAL

Upload: michel-dumontier

Post on 10-May-2015

1.719 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYQUE: EVALUATING SCIENTIFIC HYPOTHESES USING SEMANTIC WEB

TECHNOLOGIES

MICHEL DUMONTIER, PHD

ASSOCIATE PROFESSOR OF BIOINFORMATICS, DEPARTMENT OF BIOLOGY, INSTITUTE OF BIOCHEMISTRY AND SCHOOL OF COMPUTER SCIENCE @ CARLETON UNIVERSITY

PROFESSEUR ASSOCIÉ, DÉPARTEMENT D’INFORMATIQUE ET DE GÉNIELOGICIEL, UNIVERSITÉ LAVAL

Page 2: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYQUE IS A COLLABORATIVE WORK

Work performed by Alison Callahan, a PhD student under my supervision @ Carleton University

Partnership with Dr. Nigam Shah, Assistant Professor at Stanford University

Page 3: HyQue: Evaluating scientific Hypotheses using semantic web technologies
Page 4: HyQue: Evaluating scientific Hypotheses using semantic web technologies

Source: http://kentsimmons.uwinnipeg.ca/cm1504/introscience.htm

Page 5: HyQue: Evaluating scientific Hypotheses using semantic web technologies

WITH UNPARALLELED GROWTH IN RESEARCH OUTPUTS, UNCOVERING ALL THE EVIDENCE TO SUPPORT/REFUTE A HYPOTHESIS IS BECOMING INCREASINGLY DIFFICULT

Citations added to Medline 1995-2009

Source:http://www.nlm.nih.gov/bsd/stats/cit_added.html

Page 6: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYBROW

Computationally augmented method for hypothesis evaluation

• developed by Racunas et al. [1]• minimum event-based vocabulary• uses consistency checking to evaluate hypotheses

• constraints to ensure valid claims• rules to evaluate evidence

• compares hypotheses using neighborhood functions• incremental hypothesis improvement

[1] Racunas S. A., Shah N. H., Albert I. and Fedoroff N. V. (2004). HyBrow: A prototype system for computer-aided hypothesis evaluation. Bioinformatics 20(S. 1): i1-i8.

Page 7: HyQue: Evaluating scientific Hypotheses using semantic web technologies

THE GAL GENE NETWORK IN YEAST

• Genes that encode proteins that transport and metabolize galactose

• permease – gal2p – transports galactose into cells

• galactokinase – gal1p• uridylyltransferase – gal7p• epimerase – gal10p• phosphoglucomutase –gal5p

• Regulation – whether the pathway is on or off

• gal3p• gal4p• gal80p

Page 8: HyQue: Evaluating scientific Hypotheses using semantic web technologies

Source: Ostergaard et al. (2000). Nature Biotechnology 18: 1283 - 1286

Page 9: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYPOTHESISh1:

e1 (Gal4p induces expression of GAL1)

h2:

e2 (Gal3p induces expression of GAL2

e3 AND Gal4p induces expression of GAL7)

h3:

e4 (Gal4p induces expression of GAL7

e5 AND Gal80p inhibits production of Gal4p

when GAL3 is over-expressed

e6 AND Gal80p induces expression of GAL7)

simple event-based expression

conjunctive hypothesis – must satisfy two expressions

conjunctive hypothesis with conditional expression

Page 10: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYBROW• small, manually generated knowledge base

• hard coded Perl rules

• challenging to apply to a new domain

• needs access to a greater KB

Page 11: HyQue: Evaluating scientific Hypotheses using semantic web technologies

SEMANTIC WEB TECHNOLOGIES FOR KNOWLEDGE MANAGEMENT?Semantic Web technologies are promising for application to automating hypothesis evaluation

• Languages for formal knowledge representation• Automated reasoning• Querying over distributed resources• Growing number of biological resources available in SW formats

• Ontologies• Data

Bio2RDF is one the largest resources of linked life data on the Web

~40 data sets available• Globally distributed• Dataset-specific SPARQL endpoints

Page 12: HyQue: Evaluating scientific Hypotheses using semantic web technologies

BIO2RDF IS PART OF A GROWING WEB OF LINKED DATA

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

Page 13: HyQue: Evaluating scientific Hypotheses using semantic web technologies

It is about standards for publishing, sharing and querying knowledge drawn from diverse sources

It enables the answering of sophisticated questions

The Semantic Web is a web of knowledge

Page 14: HyQue: Evaluating scientific Hypotheses using semantic web technologies

ontology as a strategy to

formally represent knowledge

Page 15: HyQue: Evaluating scientific Hypotheses using semantic web technologies

The Web Ontology Language (OWL) Has Explicit Semantics

Can therefore be used to capture knowledge in a machine understandable way

Page 16: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYBROW HYQUE

• Hypothesis query and evaluation system

• Built on Semantic Web technologies

• Background knowledge encoded as OWL ontologies

• Queries against SPARQL endpoints• Context-specific rules that consider experimental

conditions• consumes and produces RDF• Can be accessed via web or semantic web services

Page 17: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYQUE IS COMPOSED OF …

• HyQue hypothesis ontology

• Describes generic input hypothesis and output hypothesis evaluation classes

• Uses upper level classes e.g. ‘proposition’, ‘measurement value’, ‘event’

• HyQue Data

• Experimentally determined interactions between the GAL proteins (GAL knowledge base from HyBrow project)

• Literature-based evidence (citations)• Knowledge about cellular localization and biological processes (GO)• Types of evidence supporting these interactions (ECO)• yeast gene/protein/function data (SGD)

Page 18: HyQue: Evaluating scientific Hypotheses using semantic web technologies

A HYQUE HYPOTHESIS IS A COLLECTION OF PROPOSITIONS

• HyQue hypotheses are composed of one or more propositions connected using logical operators (AND, OR, XOR…)

• proposition: “a statement expressing something true or false”

• HyQue propositions only specify events

HyQue hypothesis ≡ ‘proposition’

that ‘specifies’ only `event’)

HyQue hypothesis ≡ ‘proposition’

that `has component part’ only

(`proposition’ that ‘specifies’ only `event’)

Page 19: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYQUE EVENTS

1. protein-protein binding

2. protein-nucleic acid binding

3. molecular activation

4. molecular inhibition

5. gene induction

6. gene repression

7. transport

Page 20: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYQUE EVENTS

Events are composed of conditional assertions on a relation between ‘actor’ and ‘target’

induces(agent, target, context, location)

For decidable logic (OWL), an n-ary object is used

Event ‘has agent’ agent ‘has target’ target ‘has context’ context ‘is located in’ location

Page 21: HyQue: Evaluating scientific Hypotheses using semantic web technologies

ALL DATA ARE REPRESENTED USING RDF

event:gal4p positively regulates the expression of GAL1

hypothesis

proposition

has component part

specifies

RDF’s basic representation unit is the “triple”

<subject> <predicate> <object>

:h rdf:type hyque:Hypothesis .

:h hyque:has-component-part :p1 .

:p1 rdf:type hyque:Proposition .

Page 22: HyQue: Evaluating scientific Hypotheses using semantic web technologies

ALL DATA ARE REPRESENTED USING RDF

event:gal4p positively regulates the expression of GAL1

hypothesis

specifies

:h a hyque:Hypothesis ;

hyque:specifies :e1 .

:e1 a <http://bio2rdf.org/go:0010628>

<!– positive regulation of gene expression -->

hyque:is_negated "0";

hyque:agent <http://bio2rdf.org/sgd:Gal4p> ;

hyque:target <http://bio2rdf.org/sgd:GAL1> ;

….

Page 23: HyQue: Evaluating scientific Hypotheses using semantic web technologies

USER INTERFACE FACILITATES DESIGNING THE HYPOTHESIS

Page 24: HyQue: Evaluating scientific Hypotheses using semantic web technologies

TEMPLATE SPARQL QUERIES COMPLETED BASED ON EVENT PROPERTIES

:e1 a go:0010628;hyque:is_negated "0" ;hyque:agent sgd:Gal4p;hyque:target sgd:GAL1 .

construct { … }

where { ?event hyque:is_negated ?negated . ?event hyque:logical_operator ?logical_operator . ?event hyque:agent <http://bio2rdf.org/sgd:Gal4p> . ?event hyque:target<http://bio2rdf.org/sgd:GAL1> . …}

binding

Hypothesis + SPARQL Template => SPARQL query

Page 25: HyQue: Evaluating scientific Hypotheses using semantic web technologies

SPARQL QUERY RESULTS RETRIEVED

hybrow_data:f0957524deecae38945736737cc07d45 hyque:logical_operator <http://bio2rdf.org/go:0010628> ; hyque:is_negated "0" ; hyque:agent <http://bio2rdf.org/sgd:Gal4p> ; hyque:target <http://bio2rdf.org/sgd:GAL1>; hyque:agent_type <http://bio2rdf.org/chebi:36080> ; hyque:target_type <http://bio2rdf.org/so:0000236> ; hyque:location <http://bio2rdf.org/go:0005634> ; hyque:agent_function_type <http://bio2rdf.org/go:0003700> .

Protein

Gene

Nucleus

Transcription factor activity

positive regulation of

gene expression

Page 26: HyQue: Evaluating scientific Hypotheses using semantic web technologies

QUERY RESULTS EVALUATED BASED ON RULE SETS‘induce’ rule (maximum score: 5):

• Is event negated?• If yes, subtract 2

• Is logical operator ‘induce’?• If yes, add 1; if no, subtract 1

• Is agent of type ‘protein’ or ‘RNA’?• If yes, add 1; if of type ‘gene’, subtract 1

• Is target of type ‘gene’? • If yes, add 1; if no, subtract 1

• Does agent have known ‘transcription factor activity’? • If yes, add 1

• Is event located in the ‘nucleus’?• If yes, add 1; if no, subtract 1

GO:0010628

CHEBI:36080

SO:0000236

GO:0003700

GO:0005634

Page 27: HyQue: Evaluating scientific Hypotheses using semantic web technologies

EVALUATING HYPOTHESESe1 (Gal4p induces expression of GAL1)

e1 describes the induction of GAL1 gene expression by Gal4p and is therefore an event of type ‘induce’.

Evaluation:

•Agent of type ‘protein’: yes -> +1

•Target of type ‘gene’: yes -> +1

•Agent has function ‘transcription factor activity’: no -> 0

•Event location is ‘nucleus’: yes -> +1

•Logical operator is ‘induce’: yes -> +1

•Event negated in published literature: no -> 0

Thus, the e1 event obtains 4 out of a maximum of 5 points, and receives a score of 0.8.

Page 28: HyQue: Evaluating scientific Hypotheses using semantic web technologies

EVALUATING HYPOTHESES

Events e2, e3, and e4 are also ‘induce’ events and are evaluated using the ‘induce’ rule set, each obtaining a score of 0.8.

e5 is undecidable - no data to support that Gal80p inhibits Gal4p when GAL3 is over-expressed in HKB

-> third entire event set is deemed undecidable.

Overall hypothesis score selected from e1 (0.8), e2 + e3 (0.8+0.8=1.6)

Final hypothesis score is 1.6 + events e2 + e3 have the strongest experimental support.

e1 (Gal4p induces expression of GAL1)

OR

e2 (Gal3p induces expression of GAL2

e3 AND Gal4p induces expression of GAL7)

OR

e4 (Gal4p induces expression of GAL7

e5 AND Gal80p inhibits production of Gal4p

when GAL3 is over-expressed

e6 AND Gal80p induces expression of GAL7)

 

Page 29: HyQue: Evaluating scientific Hypotheses using semantic web technologies

HYPOTHESIS EVALUATION REPRESENTED AS RDF

Page 30: HyQue: Evaluating scientific Hypotheses using semantic web technologies

BROWSE HYPOTHESIS AND EVALUATION AS LINKED DATA

Page 31: HyQue: Evaluating scientific Hypotheses using semantic web technologies

http://sadiframework.org

Mark Wilkinson, UBCMichel Dumontier, Carleton UniversityChristopher Baker, UNB

The Semantic Automated Discovery and Integration (SADI) framework makes it easy to create Semantic Web services using OWL classes as service inputs and outputs

Users can post a hypothesis in RDF and receive the hypothesis evaluation RDF

HyQue can become part of a workflow for investigations

Page 32: HyQue: Evaluating scientific Hypotheses using semantic web technologies

FUTURE DIRECTIONS• Investigate alternative, finer grained scoring systems

• Expand beyond the GAL network with network reconstructions and NLP facilitated data curation

• Collaborative social environment to engineer, share, compare and evaluate hypotheses, and format the results

Page 33: HyQue: Evaluating scientific Hypotheses using semantic web technologies

CONCLUSION

HyQue is a new system to construct and evaluate (automatically obtain support for) hypotheses using formalized background knowledge and data on the Semantic Web

Page 34: HyQue: Evaluating scientific Hypotheses using semantic web technologies

AcknowledgementsAlison Callahan (developing HyQue)

Nigam Shah (key collaborator)

Stephen Racunas and Amar Das for helpful discussions

Bio2RDF: Peter Ansell, Francois Belleau, Allison Callahan, Jacques Corbeil, Jose Cruz-Toledo, Alex De Leon, Steve Etlinger, James Hogan, Nichealla Keith, Jean Morissette, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault and,  Paul Roe

SADI: Christopher Baker, Melanie Courtot, Jose Cruz-Toledo, Steve Etlinger, Nichealla Keith, Artjom Klein, Luke McCarthy, Silvane Paixao, Ben Vandervalk, Natalia Villanueva-Rosales, Mark Wilkinson