mining and meaning in the chemical sciences · conventional text-mining paradigm there is a corpus...
TRANSCRIPT
![Page 2: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/2.jpg)
![Page 3: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/3.jpg)
Overview
� Why are we doing this?� The conventional text-mining paradigm� How we do it� Where text-mining and annotation could
happen in future� Standards� Challenges
![Page 4: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/4.jpg)
Why are we doing this?
A solution looking for many problems
� Enhanced reader experience� Current awareness� Information retrieval (pre-indexing)
![Page 5: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/5.jpg)
Enhanced HTML
![Page 6: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/6.jpg)
Enhanced HTML
![Page 7: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/7.jpg)
Conventional text-mining paradigm
There is a corpus of text (PubMed abstracts, internal reports, PDFs).
There is a resource (WordNet, FrameNet, the NTU Sentiment Dictionary).
Text mining software is trained, using the resourceon subset of corpus and tested on the remainder.
This all happens after publication.
![Page 8: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/8.jpg)
Resources, conventionally
StaticProbably developed for a single use casePossibly inconveniently licensedDeveloped by a single institution
![Page 9: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/9.jpg)
The kind of resources we want
DynamicMultiple use casesOpenDeveloped by multiple institutions
![Page 10: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/10.jpg)
![Page 11: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/11.jpg)
Text mining (Oscar)http://www.sciborg.org.uk/
http://oscar3-chem.sourceforge.net/
Manual QA
Enhanced HTML
Enhanced RSS
Database
![Page 12: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/12.jpg)
Resources we use
StaticIUPAC Gold Book
DynamicOBO biomedical ontologies, especially:ChEBI
RSC ontologies (http://www.rsc.org/ontologies)
CMO, RXNO, MOP (and more to come)
InChI Identifier
![Page 13: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/13.jpg)
Live resource update (stage one)
Integr. Biol., 2009, doi:10.1039/b905580k
affinity chromatography (CMO:0001006)
A chromatography method where the separation is caused by differing analyte–
ligand interactions.
(source: IUPAC Orange Book 9.2.1.5)
![Page 14: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/14.jpg)
Live resource update (stage two)
immobilized metal affinity
chromatography (CMO:0002255)
A chromatography method where the
separation is caused by differing
analyte–ligand interactions. Proteins
containing amino acids with a specific
affinity for metal ions (e.g. His which
has an affinity for Co and Zn ions) are
retained by the column.
metal oxide affinity chromatography
(CMO:0002256)
A chromatography method where the
separation is caused by differing
analyte–ligand interactions.
Phosphorylated proteins and peptides
are retained by metal oxide particles
because of their affinity for the
phosphate group.
![Page 15: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/15.jpg)
But beware of ambiguity
distribution (noun)
Does this mean:(a) Spreading something out (a process)?(b) The way something is spread out (a
quality)?
![Page 16: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/16.jpg)
![Page 17: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/17.jpg)
RSC ontology development
Annotations to a particular ontology are a moving target.
And we can’t guarantee completeness for any given resource–corpus combination.
(Unless we build a corpus-specific resource, which is bad.)
![Page 18: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/18.jpg)
Compounds?
All kinds of problems...
Different names, systematic and commonNo namesImages (specific and generic)
Best dictionary wins for names
![Page 19: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/19.jpg)
Stephen Arnold
Search: The Three Curves of Despair
March 2008
![Page 20: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/20.jpg)
ChemMantis
![Page 21: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/21.jpg)
Deposit structures…build dictionaries
![Page 22: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/22.jpg)
A free to access online database for chemists
Website and web services
A link farm for over 22 million compounds integrated to 200 data sources
A curation platform for the public to improve the quality of data online
A deposition platform for the public to annotate and extend the data
![Page 23: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/23.jpg)
![Page 24: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/24.jpg)
Annotation: where and when?
Pre-publication?
(by authors)
?
At publication?
(by editors)
Prospect
After publication?
(by the crowd)
ChemMantis
![Page 25: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/25.jpg)
Authoring: Word ontology plugin
http://ucsdbiolit.codeplex.com/
![Page 26: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/26.jpg)
<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"><molecule id="m1"><atomArray><atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /><atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /><atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /><atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /><atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /><atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /><atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /><atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" />
</atomArray><bondArray><bond atomRefs2="a1 a2" order="1" /><bond atomRefs2="a2 a3" order="1" /><bond atomRefs2="a2 a4" order="2" /><bond atomRefs2="a1 a5" order="1" /><bond atomRefs2="a1 a6" order="1" /><bond atomRefs2="a1 a7" order="1" /><bond atomRefs2="a3 a8" order="1" />
</bondArray></molecule>
</cml>
<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"><molecule id="m1"><atomArray><atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /><atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /><atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /><atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /><atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /><atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /><atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /><atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" />
</atomArray><bondArray><bond atomRefs2="a1 a2" order="1" /><bond atomRefs2="a2 a3" order="1" /><bond atomRefs2="a2 a4" order="2" /><bond atomRefs2="a1 a5" order="1" /><bond atomRefs2="a1 a6" order="1" /><bond atomRefs2="a1 a7" order="1" /><bond atomRefs2="a3 a8" order="1" />
</bondArray></molecule>
</cml>
Relationships: Navigate
and link referenced
chemistry
Relationships: Navigate
and link referenced
chemistry
• Peter Murray-Rust
• Joe Townsend
• Jim Downing
Available soon:http://research.microsoft.com/chem4word/
Data: Semantics stored
in Chemistry Markup
Language
Data: Semantics stored
in Chemistry Markup
Language
Intent: Recognizes
chemical dictionary and
ontology terms
Intent: Recognizes
chemical dictionary and
ontology terms
Author and edit 1D and 2D
chemistry.
Author and edit 1D and 2D
chemistry.
Intelligence: Verifies
validity of authored
chemistry
Intelligence: Verifies
validity of authored
chemistry
Authoring: Chem4Word – Chemistry Drawing in Word
![Page 27: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/27.jpg)
oreChem
![Page 28: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/28.jpg)
TREC Chemistry
� Combination of 1m+ patents� 36k RSC articles
� Test runs on defined tasks, prior art
� 8 runs in year one (none from UK)
![Page 29: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/29.jpg)
InfoChem
� Chemisches Zentralblatt� Digitised
� Structure searchable
� 98k unique names, 48k unique structures
� Unique P R O B L E M S
� OCR interpretations
� Text searchable from
FIZ Chemie
![Page 30: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/30.jpg)
Role:
� To fund development and support of the IUPAC InChI standard
� Working groups set up by IUPAC Subcommittee: reactions, organometallics, polymers, markush, business rules for structure input, Resolver protocol
![Page 31: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/31.jpg)
Current members of the
Trust:
ACD/Labs
ChemAxonElsevier
FIZ Chemie
Informa / Taylor & Francis
NPGOpenEye
RSC Symyx Technologies
Thomson-Reuters
Wiley-Blackwell
![Page 32: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/32.jpg)
RInChI
� Reactions
� Jonathan Goodman
� http://www-rinchi.ch.cam.ac.uk/
![Page 33: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/33.jpg)
inchis.chemspider.com
![Page 34: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/34.jpg)
NCI Resolver
� So we need a Resolver Protocol
![Page 35: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/35.jpg)
Semantic Enrichment of the
Scientific Literature (SESL)
� Pistoia-funded� EBI� Elsevier, NPG, OUP, RSC
� Oct 2009 – Oct 2010
![Page 36: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/36.jpg)
Assertion & Meta Data Mgmt
Transform / Translate
Integrator
Service Layer
Corpus 1
‘Consumer’
Firewall
Supplier
Firewall
Common
Service
Broker
Multiple
Consumers
Biomedical Knowledge Service Framework
Db 2
Db 3
Db 4
Corpus 5
Std Public
Vocabularies
Knowledge
Applications
Content
Suppliers
Effort required
to fit DBs to
service layer
Business
Rules
Open
Stds
![Page 37: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/37.jpg)
SESL deliverables
� Pilot to deliver target-disease assertions� Publication of data, application and web
service standards
� So: to deliver standards for semantic delivery
![Page 38: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/38.jpg)
Standards - are we there yet?
� InChI Trust� Compound standards
� Reaction InChIs
� Resolver Protocol
� Pistoia/EBI� Semantic standards for web services
� Microsoft/Academia� oreChem
� Chem4Word
� Semantic markup by publishers
![Page 39: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/39.jpg)
![Page 40: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/40.jpg)
Challenges for RSC
Open problems
� Chemical structures from images
� Productive identifiers for productively-named
entities
Putting ChemMantis and Prospect together
� Backfile (to 1841)
� Microsoft Word as well as XML
� Name to structure conversion
![Page 41: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/41.jpg)
...and for text miners and
repositories
Inputs and outputs
� Who’s putting the data in?
� Who’s curating it?
� Who wants to use it?
� ...and what for?
Standards implementation
� Compelling use cases now here
![Page 42: Mining and meaning in the chemical sciences · Conventional text-mining paradigm There is a corpus of text (PubMed abstracts, internal reports, PDFs). There is a resource (WordNet,](https://reader035.vdocuments.us/reader035/viewer/2022070901/5f42b94053d29b327f2ce793/html5/thumbnails/42.jpg)