chemical named entity recognition and literature mark-up
DESCRIPTION
Chemical named entity recognition and literature mark-up. Colin Batchelor Informatics Department Royal Society of Chemistry [email protected]. Overview. Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology. - PowerPoint PPT PresentationTRANSCRIPT
Chemical named entity recognition and literature mark-upColin BatchelorInformatics DepartmentRoyal Society of [email protected]
2
Overview
Project Prospect: what we find and how we find it.
RDF: How should we be disseminating it?
Next steps: Basics for a chemical ontology.
3
4
5
6
7
8
9
Project Prospect: What do we find?
Chemical compounds Chemical terms from the IUPAC Gold Book
Gene products: function, process, location Nucleotide and polypeptide sequence terms Cell types
10
Project Prospect: How do we find it?
For compound names:~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and
Corbett 2007)
~20% PubChem~20% ChemDrawFor compound numbers:~70% author ChemDraw~30% editors
11
12
RDF in an RSS reader
13
RDF: how we do it now
Content module from RSS 1.0
http://web.resource.org/rss/1.0/modules/content
In what sense does an article “contain” pyridine or base pairs?
We would much rather have proper rdf predicates – e.g. “is_about”, “mentions”.
14
RDF: what it looks like now
<item rdf:about=http://xlink.rsc.org/?DOI=b716356h&RSS=1><title> [… title] </title><link>http://xlink.rsc.org/?DOI=b716356h&RSS=1</link><description> [… blah] </description><content:encoded> [… human-readable stuff</content:encoded>[… dublin core stuff …]<content:items> <rdf:Bag> <rdf:li>
<content:item rdf:about=“info:inchi/InChI=1/C22H22NO4/c1-13-16-11-21(26-4)20(25-3)10-15(16)8-18-17-12-22(27-5)19(24-2)9-14(17)6-7-23(13)18/h6-12H,1-5H3/q+1"/></rdf:li><rdf:li><content:item rdf:about=“http://purl.org/obo/owl/SO#SO:0000028”/></rdf:li>
</rdf:Bag></content:items></item>
15
Basics for a chemical ontology
1. Unambiguous representation of objects of chemical discourse
2. Proper parthood relations
16
Basics for a chemical ontology:1. Objects of chemical discourse
Must be able to represent and clearly distinguish
Compounds Classes of compound Parts of molecules Mixtures
Would be nice to have:
Disambiguation cues for the first three
17
Imidazole
18
An imidazole
19
The imidazole side-chain/group/ring
20
Can ChEBI handle this?
Imidazoles (!) (CHEBI:24780) Imidazole (CHEBI:16069)
Imidazole ring not yet Imidazolyl group not yet (but methyl, benzyl, etc.)
… and there are no disambiguation cues
21
Disambiguation
One Sense per Discourse (Gale et al. 1992)
… this doesn’t hold at all
One Sense per Collocation (Yarowsky 1993)
… matches our intuitions
22
Disambiguation:What a one sense per collocation feature set might look like
CLASS:w(–1) = a, an, the, thisw(0) plural (bit of a cheat, as not a collocation)
PART:w(–1) = bridging, terminalw(+1) = backbone, bridge, chain, core, dyad,
fluorophore, fragment, framework (and many more)
w(+1)w(+2) = “building block”, “protecting group”, “side chain”
23
Basics for a chemical ontology:2. Parthood relations
Parthood in ChEBI means at least three things:
is necessarily chemically part of
carbonyl group part_of carbonyl compounds
24
Basics for a chemical ontology:2. Parthood relations
Is possibly chemically part of:
Lead(2+) part_of lead diacetate
(most lead(2+) isn’t)
Electron part_of muonium (!)
25
Basics for a chemical ontology:2. Parthood relations
Is part of a mixture
Kanamycin A part_of kanamycin
26
Basics for a chemical ontology:2. Parthood relations
Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al., “Relations in biomedical ontologies”, 2005)
carbonyl compound has_part carbonyl group Lead diacetate has_part lead(2+) (?!) Muonium has_part electron Kanamycin has_part kanamycin A (?!)
27
Basics for a chemical ontology:2. Parthood relations
Solution 2 (for discussion): Distinguish molecular-level relationships from sample-level relationships
Carbonyl compound molecule has_part carbonyl substituent
Muonium atom has_part electron
Kanamycin has_component kanamycin A Lead diacetate has_component lead(2+) (?!)
28
Open questions
How do we represent the relationship between named entities and documents?
How do we integrate ontologies and word-sense disambiguation?
What is the best way of distinguishing molecules and samples?
29
Acknowledgements
University of Cambridge: Peter Corbett
OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo)
www.projectprospect.org
30
Open questions
How do we represent the relationship between named entities and documents?
How do we integrate ontologies and word-sense disambiguation?
What is the best way of distinguishing molecules and samples?