new software developments on chemical information
TRANSCRIPT
New Software Developments on
Chemical Information Extraction from Patent
Documents and Markush Structure Analysis
Wei Deng (David)
PIUG Meeting
May 2nd, 2012
Denver CO
ChemAxon’s Naming Technology
• Name to structure
– IUPAC, traditional and common names
– A library of existing drugs
– Support CAS Registry number
– Homology group: alkyl, aryl …
– Future: Biological names
• Structure to Name
– IUPAC Name, traditional names
• Accuracy and coverage constantly improving
• Also available from command-line
2
ChemAxon’s “Document to Structure”
• Extract chemical information from documents – Names: powered by the Naming Technology
– Also import smiles, InChI, CAS number …
– Images: OSRA
– Returns structure and their location in the document
• Works with scanned PDF since 5.8 (Feb 2012)
– Great for patent mining
• OCR and syntax correction constantly developed
– 3-rnethyl-l-me- thoxynaphthalene
– 3-methyl-1-methoxynaphthalene
3
From Document to Structures
4
Non-searchable patent (50 pages) Structure (text + image) + location
Search by Structure or Text
5
Non-searchable PDF is now Searchable
6
ChemAxon’s “Document to Structure”
• New Features in 5.9 (Mar 2012)
– MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …
– Embedded structure objects (ChemDraw, Symyx, Marvin
…)
– Progressively display result
– Speed improvement
– Instant JChem Integration; Simplfied API
• Currently in development for 5.10 (May
2012) – Image-to-structure “Confidence”
– Fragment groups integration with Markush generation
– Biologic names
7
Free Online Service Chemicalize.org
• Extract chemical information from web pages and documents
• Interactively display all structures and their predicted properties
• Search all structures extracted
• Gather links of interest to chemists for post processing (search,
analysis, reporting, fun…)
• Recently reviewed on Journal of Chemical Information and
Modeling
8
9
Webpage - chemicalized
• All chemical names are highlighted with dotted line
• Mouse over a name pops up the structure image
• Click on the image will direct to the data page
• Links are “respected”
• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit
Data Page: Extensive Predicted Properties
11
• All structures are summarized above the chemicalized page
• Click on a structure to highlight all occurrences. Click again to
navigate to the next occurrence
• All structures can be downloaded as MRV or SDF (useful with
online patent full text)
Webpage - chemicalized
PDF File - chemicalized
Aspirin: query highlighted in results
Searching Chemicalize.org – Structure Search
• Aspirin; web page hits - “show” related structures
• Autosuggest while typing
Searching Chemicalize.org – Keyword Search
Everything is Published
• Recent viewed
– Webpages
– Structures
– Documents
– Searched queries (structure and keyword)
15
Availability and Customization
• Source code available
• Minor changes required on example codes
for customization, such as
– Import extracted structures to other databases
– Post-process filtering according to properties
– Batch process of multiple documents
16
MARKUSH TECHNOLOGY
UPDATE
17
ChemAxon - Thomson Reuters
Markush project history
1987 Thomson Scientific (Derwent) starts indexing Markush
structures (in collaboration with Questel & INPI)
1998 INPI & Derwent Markush databases merge to form MMS
(Merged Markush Service)
2000 ChemAxon launches first version of JChem Base
2005 Chemaxon starts working on Markush technology
2008 Markush search & enumeration first release in JChem 5.0
2010 Markush DARC file format support in JChem 5.3
2011 Full MMS searchable with JChem 5.5-5.8
Search the Full Patent Database
• Complete patent database from Thomson Reuters (Markush +
exemplified + non-structural information, dated back to 1987)
• Data internally hosted or on Amazon Cloud
• Powerful virtual machine, secure connection and confidential
search
• Useful new features:
– Export exemplified structures
– Retrieve patent document
– Enumerate Markush structures and output result
– Notation
• Batch search of multiple queries
• Constantly improving search performance
19
New Interface, New Buttons, New Features
• All information in one place
20
Export Exemplified Structures
21
Retrieve Patent Document
22
Add Notes
23
Notation Overview
24
Search in Instant JChem
• Search in both exemplified and
Markush structures
• Various structure search options:
– Substructure or full
– Broad translation
– Stereochemistry, tautomer
– Atom/bond matching
• Text search (including dates)
• Multiple search results can be
creatively combined
• Flexible visualization functions to
display result with scripting feasibility
• Integrated with new interfaces for
navigation and enumeration
25
Improved R-group Hit Visualization
Integrated with Markush Viewer
• Substructure hit visualization
– Scaffold only
– Scaffold + relevant R-groups
– Scaffold + all R-group with relevant R-group
colored
Query
Result in original Markush
Reduced result
Hit Expansion
Hit alignment
Hit colouring
Structure cleaning
Batch Search of Multiple Queries
28
Batch Search of Multiple Queries
29
OTHER MARKUSH-RELATED
ANALYSIS
30
Atom lists, bond lists
Position variation bond
Link nodes and repeating units
R-groups
Multiple attachment points
Up to thousands of R-group definitions
Nested to any depth
Markush Structure Features I
Homology groups (“Superatoms”, “Generic definitions”)
(properties)
Easy to understand
graphical representation
All supported in MRV file
Markush Structure Features II
Markush Viewer
Markush Enumeration
Functionality: Full
Sequential
Random
Calculate library size
Scaffold alignment
and coloring
Markush code
Homology group enumeration
Post filter
Re-designed Markush Enumeration Interface
• Markush reduction (hit expansion) according to query
• Query aligned and colored in enumerated structures
• Post-filtering and structure export
• Improved enumeration speed
35
Other Features
• R-group decomposition in molecule tables
• Creation of Markush structure from selected rows.
Future Work
• Improve search speed and accuracy
• Additional query variations
• Better visualization
• Integration with Document to Structure to
extract chemical information from patent
documents
• Collaboration with Linguamatics
37
Hunting for Hidden Treasures
• A CINF Symposium regarding “chemical
information in patents and other documents”
• ACS meeting in Philadelphia, August 19-23,
2012.
• Current speakers from
– Content providers
– Software providers
– Pharmaceutical users
• One opening slot on Markush structure
analysis
38
Acknowledgements
JChem base, Markush and IJC
Helpers - JChem WS, Cartridge,
Core & Marvin, Marketing
• Steve Hajkowski
• Brian Larner
• Don Walter
• Gez Cross
• Tony Ferns
• Tim Miller
Backup Slides
40
Markush User Community
• Markush user community – IP experts / patent searchers / Information
scientists
– Patent lawyers / Patent agents
– Medicinal chemists
– Computational chemists/Cheminformaticians
• Goal – Bring Markush search to a wider
community
Import from Thomson Reuters
42
Query atom
any, metal, hetero ...
Atom topology (ring, chain)
Stereochemistry (E/Z, tetrahedral)
Aromatic, aliphatic atoms
Substitution count
Block substitution (s*)
H count
Explicit H full support
Ring bond count
Isolate ring on atoms (rb*)
Additional Markush Query Features I
on Atoms
Additional Markush Query Features II
Bond topology (chain/ring)
Equal homology translation
Broad translation switchable
Simple R-group queries
on Bonds
and other ...
Multiple Queries: Overlap Analysis
45
Multiple Queries: Overlap Analysis
46