new software developments on chemical information

46
New Software Developments on Chemical Information Extraction from Patent Documents and Markush Structure Analysis Wei Deng (David) PIUG Meeting May 2 nd , 2012 Denver CO

Upload: others

Post on 25-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Software Developments on Chemical Information

New Software Developments on

Chemical Information Extraction from Patent

Documents and Markush Structure Analysis

Wei Deng (David)

PIUG Meeting

May 2nd, 2012

Denver CO

Page 2: New Software Developments on Chemical Information

ChemAxon’s Naming Technology

• Name to structure

– IUPAC, traditional and common names

– A library of existing drugs

– Support CAS Registry number

– Homology group: alkyl, aryl …

– Future: Biological names

• Structure to Name

– IUPAC Name, traditional names

• Accuracy and coverage constantly improving

• Also available from command-line

2

Page 3: New Software Developments on Chemical Information

ChemAxon’s “Document to Structure”

• Extract chemical information from documents – Names: powered by the Naming Technology

– Also import smiles, InChI, CAS number …

– Images: OSRA

– Returns structure and their location in the document

• Works with scanned PDF since 5.8 (Feb 2012)

– Great for patent mining

• OCR and syntax correction constantly developed

– 3-rnethyl-l-me- thoxynaphthalene

– 3-methyl-1-methoxynaphthalene

3

Page 4: New Software Developments on Chemical Information

From Document to Structures

4

Non-searchable patent (50 pages) Structure (text + image) + location

Page 5: New Software Developments on Chemical Information

Search by Structure or Text

5

Page 6: New Software Developments on Chemical Information

Non-searchable PDF is now Searchable

6

Page 7: New Software Developments on Chemical Information

ChemAxon’s “Document to Structure”

• New Features in 5.9 (Mar 2012)

– MS Office document: doc, docx, ppt, pptx, xls, xlsx, odt …

– Embedded structure objects (ChemDraw, Symyx, Marvin

…)

– Progressively display result

– Speed improvement

– Instant JChem Integration; Simplfied API

• Currently in development for 5.10 (May

2012) – Image-to-structure “Confidence”

– Fragment groups integration with Markush generation

– Biologic names

7

Page 8: New Software Developments on Chemical Information

Free Online Service Chemicalize.org

• Extract chemical information from web pages and documents

• Interactively display all structures and their predicted properties

• Search all structures extracted

• Gather links of interest to chemists for post processing (search,

analysis, reporting, fun…)

• Recently reviewed on Journal of Chemical Information and

Modeling

8

Page 9: New Software Developments on Chemical Information

9

Webpage - chemicalized

• All chemical names are highlighted with dotted line

• Mouse over a name pops up the structure image

• Click on the image will direct to the data page

• Links are “respected”

Page 10: New Software Developments on Chemical Information

• Customizable report layout for calculation results. Users can move, open, close, expand calculation boxes and this is remembered on the next visit

Data Page: Extensive Predicted Properties

Page 11: New Software Developments on Chemical Information

11

• All structures are summarized above the chemicalized page

• Click on a structure to highlight all occurrences. Click again to

navigate to the next occurrence

• All structures can be downloaded as MRV or SDF (useful with

online patent full text)

Webpage - chemicalized

Page 12: New Software Developments on Chemical Information

PDF File - chemicalized

Page 13: New Software Developments on Chemical Information

Aspirin: query highlighted in results

Searching Chemicalize.org – Structure Search

Page 14: New Software Developments on Chemical Information

• Aspirin; web page hits - “show” related structures

• Autosuggest while typing

Searching Chemicalize.org – Keyword Search

Page 15: New Software Developments on Chemical Information

Everything is Published

• Recent viewed

– Webpages

– Structures

– Documents

– Searched queries (structure and keyword)

15

Page 16: New Software Developments on Chemical Information

Availability and Customization

• Source code available

• Minor changes required on example codes

for customization, such as

– Import extracted structures to other databases

– Post-process filtering according to properties

– Batch process of multiple documents

16

Page 17: New Software Developments on Chemical Information

MARKUSH TECHNOLOGY

UPDATE

17

Page 18: New Software Developments on Chemical Information

ChemAxon - Thomson Reuters

Markush project history

1987 Thomson Scientific (Derwent) starts indexing Markush

structures (in collaboration with Questel & INPI)

1998 INPI & Derwent Markush databases merge to form MMS

(Merged Markush Service)

2000 ChemAxon launches first version of JChem Base

2005 Chemaxon starts working on Markush technology

2008 Markush search & enumeration first release in JChem 5.0

2010 Markush DARC file format support in JChem 5.3

2011 Full MMS searchable with JChem 5.5-5.8

Page 19: New Software Developments on Chemical Information

Search the Full Patent Database

• Complete patent database from Thomson Reuters (Markush +

exemplified + non-structural information, dated back to 1987)

• Data internally hosted or on Amazon Cloud

• Powerful virtual machine, secure connection and confidential

search

• Useful new features:

– Export exemplified structures

– Retrieve patent document

– Enumerate Markush structures and output result

– Notation

• Batch search of multiple queries

• Constantly improving search performance

19

Page 20: New Software Developments on Chemical Information

New Interface, New Buttons, New Features

• All information in one place

20

Page 21: New Software Developments on Chemical Information

Export Exemplified Structures

21

Page 22: New Software Developments on Chemical Information

Retrieve Patent Document

22

Page 23: New Software Developments on Chemical Information

Add Notes

23

Page 24: New Software Developments on Chemical Information

Notation Overview

24

Page 25: New Software Developments on Chemical Information

Search in Instant JChem

• Search in both exemplified and

Markush structures

• Various structure search options:

– Substructure or full

– Broad translation

– Stereochemistry, tautomer

– Atom/bond matching

• Text search (including dates)

• Multiple search results can be

creatively combined

• Flexible visualization functions to

display result with scripting feasibility

• Integrated with new interfaces for

navigation and enumeration

25

Page 26: New Software Developments on Chemical Information

Improved R-group Hit Visualization

Integrated with Markush Viewer

Page 27: New Software Developments on Chemical Information

• Substructure hit visualization

– Scaffold only

– Scaffold + relevant R-groups

– Scaffold + all R-group with relevant R-group

colored

Query

Result in original Markush

Reduced result

Hit Expansion

Hit alignment

Hit colouring

Structure cleaning

Page 28: New Software Developments on Chemical Information

Batch Search of Multiple Queries

28

Page 29: New Software Developments on Chemical Information

Batch Search of Multiple Queries

29

Page 30: New Software Developments on Chemical Information

OTHER MARKUSH-RELATED

ANALYSIS

30

Page 31: New Software Developments on Chemical Information

Atom lists, bond lists

Position variation bond

Link nodes and repeating units

R-groups

Multiple attachment points

Up to thousands of R-group definitions

Nested to any depth

Markush Structure Features I

Page 32: New Software Developments on Chemical Information

Homology groups (“Superatoms”, “Generic definitions”)

(properties)

Easy to understand

graphical representation

All supported in MRV file

Markush Structure Features II

Page 33: New Software Developments on Chemical Information

Markush Viewer

Page 34: New Software Developments on Chemical Information

Markush Enumeration

Functionality: Full

Sequential

Random

Calculate library size

Scaffold alignment

and coloring

Markush code

Homology group enumeration

Post filter

Page 35: New Software Developments on Chemical Information

Re-designed Markush Enumeration Interface

• Markush reduction (hit expansion) according to query

• Query aligned and colored in enumerated structures

• Post-filtering and structure export

• Improved enumeration speed

35

Page 36: New Software Developments on Chemical Information

Other Features

• R-group decomposition in molecule tables

• Creation of Markush structure from selected rows.

Page 37: New Software Developments on Chemical Information

Future Work

• Improve search speed and accuracy

• Additional query variations

• Better visualization

• Integration with Document to Structure to

extract chemical information from patent

documents

• Collaboration with Linguamatics

37

Page 38: New Software Developments on Chemical Information

Hunting for Hidden Treasures

• A CINF Symposium regarding “chemical

information in patents and other documents”

• ACS meeting in Philadelphia, August 19-23,

2012.

• Current speakers from

– Content providers

– Software providers

– Pharmaceutical users

• One opening slot on Markush structure

analysis

38

Page 39: New Software Developments on Chemical Information

Acknowledgements

JChem base, Markush and IJC

Helpers - JChem WS, Cartridge,

Core & Marvin, Marketing

• Steve Hajkowski

• Brian Larner

• Don Walter

• Gez Cross

• Tony Ferns

• Tim Miller

Page 40: New Software Developments on Chemical Information

Backup Slides

40

Page 41: New Software Developments on Chemical Information

Markush User Community

• Markush user community – IP experts / patent searchers / Information

scientists

– Patent lawyers / Patent agents

– Medicinal chemists

– Computational chemists/Cheminformaticians

• Goal – Bring Markush search to a wider

community

Page 42: New Software Developments on Chemical Information

Import from Thomson Reuters

42

Page 43: New Software Developments on Chemical Information

Query atom

any, metal, hetero ...

Atom topology (ring, chain)

Stereochemistry (E/Z, tetrahedral)

Aromatic, aliphatic atoms

Substitution count

Block substitution (s*)

H count

Explicit H full support

Ring bond count

Isolate ring on atoms (rb*)

Additional Markush Query Features I

on Atoms

Page 44: New Software Developments on Chemical Information

Additional Markush Query Features II

Bond topology (chain/ring)

Equal homology translation

Broad translation switchable

Simple R-group queries

on Bonds

and other ...

Page 45: New Software Developments on Chemical Information

Multiple Queries: Overlap Analysis

45

Page 46: New Software Developments on Chemical Information

Multiple Queries: Overlap Analysis

46