information extraction from chemical images · good news discovery knowledge & informatics,...

26
Information Extraction from Chemical Images Discovery Knowledge & Informatics April 24 th , 2006 Dr. Marc Zimmermann

Upload: others

Post on 04-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Information Extraction from Chemical ImagesDiscovery Knowledge & Informatics

April 24th, 2006

Dr. Marc Zimmermann

Page 2: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Available Chemical Information

page 2Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Textbooks

Reports

Patents

Databases

Scientific journals and publications

Websites

Page 3: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Representations of Chemical Compounds

page 3Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Name (trivial, trade, brand, INN, USAN)

Registration numbers (CAS, NCI, Beilstein)

Formal description (sum formula, SMILES)

Chemical nomenclature (IUPAC, CAS, InChI)

Depictions

Page 4: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Example: Aspirin

page 4Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Name: Acetylsalicylic acid, Aspirin, Bayer, Colfarit, Dolean PH 8, Duramax, Ecotrin, …CAS: 50-78-2, SID: 35870, Formula: C9H8O4IUPAC Name: 2-acetoxybenzoic acidSMILES: CC(=O)OC1=CC=CC=C1C(=O)OInChI: 1.12Beta/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h1H3,2-5H,(H,11,12)Depiction:

Page 5: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Information Extraction Methods

page 5Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Names Dictionary based

Registration numbers Databases

Formal descriptions Rule based

Depictions chemical OCR

Page 6: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Representing a Chemical Compound

page 6Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

How much information do you want to include?

Atoms present

Connections between atoms

bond types

Isotopes

Charges

Stereochemical configuration

OH

CH2

C

14

HN+H3

O-

O

Page 7: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Modeling of Chemicals as Graphs

page 7Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Why use graph theory?

Established mathematical field

Graphs can be easily represented in computers

Existing algorithms for comparison, searching, etc.

Unlike humans, computers aren’t very good at pattern

recognition

Similaror

Same?

Page 8: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Computer Representation

page 8Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

A typical example: MDL MOL file (SDF)

For more information on MDL formats, see http://www.mdl.com/downloads/public/ctfile/ctfile.jsp

Page 9: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Disadvantages of Using Graphs

page 9Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Many graph algorithms are inherently slowAnalogy between chemical structures and graphs is not perfectRealities of chemical structures cause problems

aromaticitystereochemistrytautomerisminorganic compoundsmacromolecules and polymersincompletely-defined substances

Page 10: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Good News

page 10Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

There is only a limited number of chemical drawing tools

(and these are using templates):

ChemDraw (CambridgeSoft)

ChemSketch (ACD)

ISISdraw (MDL)

JAVA applets (ChemAxon)

...

Reduced complexity

Page 11: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

chemOCR: Reconstruction of Chemical Compounds

page 11Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

1 2

3

Document Depiction

Reconstruction SDF file4 - IS IS - 0 9 2 3 0 3 1 5 0 7 2 D

2 7 2 9 0 0 0 0 0 0 0 0 9 9 9 V 2 0 0 0 -0 .9 3 4 8 -0 .4 0 0 0 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 -0 .9 3 5 9 -1 .2 2 7 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 -0 .2 2 1 1 -1 .6 4 0 2 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 .4 9 5 3 -1 .2 2 6 9 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 .4 9 2 5 -0 .3 9 6 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 -0 .2 2 2 9 0 .0 1 2 8 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 1 .0 7 5 0 -1 .8 0 8 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 1 .0 7 0 8 -2 .6 3 3 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 1 .7 8 7 5 -1 .3 9 1 7 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 2 .5 0 4 2 -1 .8 0 3 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 3 .2 1 6 2 -1 .3 8 7 4 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 3 .2 1 2 0 -0 .5 6 1 1 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 2 .4 8 9 9 -0 .1 5 2 6 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 1 .7 8 0 8 -0 .5 7 0 9 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 4 .0 0 4 2 -0 .3 4 1 7 0 .0 0 0 0 N 0 0 0 0 0 0 0 0 0 0 0 0 4 .0 0 8 3 -1 .5 9 5 9 0 .0 0 0 0 N 0 0 3 0 0 0 0 0 0 0 0 0 4 .4 1 2 5 -1 .0 5 4 2 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 5 .2 3 7 5 -1 .0 5 4 2 0 .0 0 0 0 N 0 0 0 0 0 0 0 0 0 0 0 0 0 .4 7 9 2 -3 .2 1 6 7 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 4 .2 1 6 7 -2 .3 9 1 7 0 .0 0 0 0 S 0 0 3 0 0 0 0 0 0 0 0 0 5 .0 1 2 5 -2 .1 7 5 0 0 .0 0 0 0 O 0 0 0 0 0 0 0 0 0 0 0 0 3 .4 1 6 7 -2 .6 0 4 2 0 .0 0 0 0 O 0 0 0 0 0 0 0 0 0 0 0 0 4 .4 2 9 2 -3 .1 8 7 5 0 .0 0 0 0 C 0 0 3 0 0 0 0 0 0 0 0 0 5 .2 2 5 0 -3 .4 0 0 0 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 4 .0 1 2 5 -3 .9 0 0 0 0 .0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 -0 .3 4 5 8 -3 .2 1 2 5 0 .0 0 0 0 O 0 0 0 0 0 0 0 0 0 0 0 0 0 .8 8 7 5 -3 .9 2 9 2 0 .0 0 0 0 N 0 0 0 0 0 0 0 0 0 0 0 0

Page 12: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

CSR (Compound Structure Reconstruction)

page 12Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

raster images

common fragments module

molecule database

chemical cartridge

connected components

manualcuration tool

machinelearning tool

chemical rules module

pagesegmen-tation

imagepreprocessing

vectorizer OCR

componentclassifier

s-atom database

approx.graphmatcher

molecular graph converter

super-atoms

machinelearning tool

Page 13: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Preprocessing Steps

page 13Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Page segmentation

Image extraction

Image conversion (image

restauration, adaptive

binarization ...)

Page 14: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Connected Component Analysis

page 14Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Building an image tree

Using adaptive nested TreeMaps

Page 15: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Component Classification

page 15Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Single bonds

Double bonds

Thick chirals

Dotted chirals

Text

1

2

Raster image

Extract features

3 Classify as...

4 Manual curation

Page 16: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Atomtype Reconstruction

page 16Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Train new characters Expand superatoms1 3

Need of a chemical intelligent

OCR

Define new superatoms2

Page 17: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Vectorization

page 17Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Fixing vectorization errors using relative neighborhood graphs

Need of a chemical intelligent vectorizer

Disconnections

Dubious links

Antiparallel double bonds

Fixing bond lengths

Page 18: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Graph Matching

page 18Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Using a line graph representation

Searching for subgraph isomorphism

Database with common fragments

Decomposition network for fragments

Recognizing new fragments

Graph matching a solution for

mapping bridged ring systems

Page 19: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Manual Curation of Errors

page 19Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Reconstruction score

Editingbonds

Page 20: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Post Processing

page 20Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Workflow plugin technology

2D beautify

File format conversion

2D to 3D conversion

Name generation

Property calculation / prediction

Page 21: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

A Real Challenge

page 21Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Data set with ~7.600 depictions of natural products

to get new scaffolds and super atoms

to incorporate the CSR workflow into a grid service

to add a database interface

But we need more real training sets…

(i.e. pictures and the solved structure)

current status: ~3.400 fully reconstructed!

Page 22: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Future Works

page 22Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Incompletely-defined substances:

unknown stereochemistry

unknown attachment position

unknown repetition

OH

n

NH2

Cl

Page 23: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

Markush (“Generic”) Structures and Reaction Schemes

page 23Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

shorthand for describing sets of structures with common features

structures with R-groups

very important in chemical patents

can be used to describe combinatorial libraries

can be used as queries in database searches

OH

R1R2

Br*

I*

Cl*

R1=

CH2

*CH3 CH2

* CH2CH3 CH2

* CH2CH2

CH3R2=

Page 24: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

The Mission: Combination of CSR and Text Mining

page 24Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

-CH3

-CH2-CH3

-CH2-CNHS

-COOH

-CH3

-CH2-CH3

-CH2-CNHS

-COOH

Image Analysis / Structure Reconstruction

Text Analysis / Entity Recognition

Reconstruction ofPublished Chem-, Pharm-and PatentSpace

Cytochrome inhibitionPPAR activationStability in serumSide effectBlood-brain-barrier

PPAR activation

Cytochrome inhibition

Side effect

Stability in serum

Page 25: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

The Team (in the order of appearance)

page 25Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

Marc Zimmermann

Tanja Fey

Le Thuy Bui Thi

Christoph Friedrich

Yuan Wang

Maria-Elena Algorri

Miguel Alvarez

Wei Wang

Page 26: Information Extraction from Chemical Images · Good News Discovery Knowledge & Informatics, April 24th, 2006 page 10 ... Miguel Alvarez Wei Wang. CSR Software Demo available Discovery

CSR Software Demo available

page 26Discovery Knowledge & Informatics, April 24th, 2006Marc Zimmermann

CSR can extract chemical depictions from various image sources and convert them into SD-files, which can be further used in nearly all chemical software; it allows for the modification of reconstructed molecules by a structure editor; it maintains the superatom and bond (single, double, triple, or chiral) information; and it accepts user curation and scoring schema to improve its performance.