chemical structure representation in the dupont chemical...

28
Chemical Structure Representation in the DuPont Chemical Information Management Solutions Database: Challenges Posed by Complex Materials in a Diversified Science Company Mark A. Andrews and Edward S. Wilks Information & Computing Technologies Division, DuPont CR&D Boston ACS Meeting, Chemical Structure Representations Symp. August 25, 2010

Upload: duongtruc

Post on 02-Sep-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Chemical Structure Representation in the DuPont Chemical Information Management Solutions Database: Challenges Posed by Complex Materials in a Diversified Science Company

Mark A. Andrews and Edward S. Wilks

Information & Computing Technologies Division, DuPont CR&D

Boston ACS Meeting, Chemical Structure Representations Symp.

August 25, 2010

8/25/2010

2

Outline of Talk: Chemical Structure Representation in the DuPont

Chemical Information Management Solutions (CIMS) Database

Background

Structure Classes

Structure vs. Sample

Data Integrity for Registrations

Summary

8/25/2010

3

AGRICULTURE /FOOD34%

CONSTRUCTIONMATERIALS

10%

ELECTRONICS8%

TEXTILES / HOME FURNISHINGS

3%

MOTOR VEHICLES19%

OTHER16%

AIRCRAFT & AEROSPACE2%

PLASTICS & CHEMICALS

8%

DuPont: A Diversified Science Company

2009

8/25/2010

4

Structure Representation: The DuPont Problem

Cheminformatics in a Diversified Science Company

• Chemically intelligent structures: Needed for all types of materialsOrganics, inorganics, polymers, mixtures, formulations, incompletely defined substances (IDS), multi-layer films,composites, even devices

• Commercial chem. info. software tools: DuPont ≠ PharmaCould we push the envelope and vendors far enough?

• Chemical composition: Structure vs. Sample characteristicsWhich details of a material’s composition should be part of the chemical structure vs. which details arebetter viewed as characteristics of a specific sample?

• Structure registrations: Advanced mistake-proofing aids requiredNot just standardization of, e.g., nitro groups for end users but also sophisticated duplicate checking for trained registrars

8/25/2010

5

Structure Representation: The Vendor Problem

DuPont’s Situation in 2000-2004

• DuPont had to bring proprietary chem. info. DB in house from CASChance to reinvent the system to go beyond chemical indexing of proprietary documents to full chem. info. registration system for diverse research groups

• Only a single vendor appeared up to the challengeBuilt-in support for representation and searching of polymers,multiple bracket types, and user-definable attached data

Vendor Situation in 2010 and beyond

• All chem. info. vendors have / are radically enhancing their products• Middle-tier software offers plug-and-play options

Supports multiple drawing tools and database cartridges,provides tools for very facile in-house customization

8/25/2010

6

Structure Representation: The DuPont Solution

DuPont specialized structure identifiers

• Push attached data functionality to the limit� 7 polymer attached data types

� 3 mixture attached data types

� 3 ratio attached data types

� 2 incompletely defined substance (IDS) attached data types

� 5 special purpose attached data types

• Other structure registration features� ca. 75 controlled vocabulary values for above attached data fields

� 4 Molecular Formula (MF) fields – computed and user entered, summed

� Push some structural details down to the sample “history” level:

-- You may not want to bog structure down with minor details or-- Some “structural” aspects of materials are “product by process”

8/25/2010

7

Outline of Talk: Chemical Structure Representation in the DuPont

Chemical Information Management Solutions (CIMS) Database

Background

Structure Classes

Structure vs. Sample

Data Integrity for Registrations

Summary

8/25/2010

8

Problem – What if there is no structure?

• ISIS No-Strucs provide no material identification information

• ISIS No-Strucs cannot be components of mixtures, etc.

Solution:

• Use * atom tagged with DuPont ID#

• Add “ID: xxx” descriptor tag

• Include User Entered molecular formula if known

Incompletely Described Substances (IDS)

ID: Kaolinite

*123456H

ID: Mineral Oil

*34567E

User MF = Al2O3User MF = Al2.(OH)4.(Si2O5)1User MF = (null)

ID: Al2O3 (generic)

*98765F

ID: Al2O3 (alpha)

*765432D

CAS = No-Strucs or Tabular Inorganics

8/25/2010

9

Problem:

• Sometimes even (nominal) structures may not fully define a material

Solution:

• Add ID: xxx descriptor tag

Incompletely Described Substances – Cont’d

ID: Graphite

C(0)

ID: Charcoal

C(0)

ID: Diamond

C(0)

ID: deionized

OH2

ID: reverse osmosis

OH2

8/25/2010

10

Problem:

• Substituents in unknown locations, etc.

Solution:

• Add “unknown xxx” tags

Incompletely Described Substances – Cont’d

ClU

U = unknown locant

ClU

U = unknown locant

U

U = unknown degree of substitution

ID: 54 wt% Cl

D1-Cl

D1-Me

Chlorotoluene:

CIMS CAS Symyx Draw 3.1

PCB 1254:

* Cl or

Chem Draw 12

Cl

CIMS

Marvin

8/25/2010

11

Incompletely Described Substances – Cont’d

CASCIMS Symyx Draw 3.1

R1 OH

*C

4H

9

*C

5H

11

*C

6H

13

R1 =

ID: Alcohols, C4-6

OHD

n

mol6

Note: D = H

4U

U = unknown branching

“No Structure

Available”

Problem:

• Variable chain lengths, etc.

Solution:

• Use “tagged” polymers or

Markush structures

*** Note: Markush-type features

are currently supported only on

query in some software packages,

though most vendors are now

also targeting them for registration

OHD

n

Note: D = H

In CIMS a “flexmatch” query with above

structure finds all variable chain length

alcohols, e.g., C4-6, C10-14

8/25/2010

12

Mixtures

mix

emulsion

c

wt%30

ID: Olive Oil

*13579H

c

70wt%OH

2

csurfactant

0.1

wt%

Na+

S

OO

O

U

U = unknown branching

U

U = unknown locant

U

U = unknown chain length

Ratios

Roles

8/25/2010

13

Cross-linked and Branched Polymers

co

eu

mon

monOH

O

mon

mod

xl

O

O

Mg

O

O

ran

mon

monbrxl

co

eu

� Cross-linked by after-treatment

Difunctional monomer gives

branching and/or cross-linking�

Surlyn® 6000 series Mg Ionomers

8/25/2010

14

Complex Copolymers

MW=2000

NH2

OO

O NH2O

n

n

n

mol mol

U

U = unknown locant

U

U = unknown locant

39 2molU

U = main locant

2

U

U = main locant

Jeffamine® ED-2003

Poly(THF/3-MeTHF)

(Me in one of only

2 positions)

co

eu

O* *n

OH* eg

O

**

Hn

U

U

U = unknown locant

8/25/2010

15

co

eu

monOH

OH

mon

OH

O

OH

O

c2

200um

film

Al(0)

c1

5um

film

f

Multi-Layer Thin Films = Ordered “Mixtures”

(Aluminized PET)

8/25/2010

16

Composite Polymers

(Glass Reinforced Nylon-6,6)

co

eu

mon

OH

O

OH

O

monNH

2

NH2

c

90

wt%

mix

ID: Pyrex

inorg

c

5wt%

1

Na+ O

2-

Na+

c

15wt%

ID: Al2O3 (generic)

*98765F

c

8090wt%

ID: SiO2 (generic)

*434343A

c

515wt%

ID: B2O3

*343434A

cfiller

10

wt%

mix

composite

8/25/2010

17

Miscellaneous

Mo Mo

Cl

Cl

Cl

Cl

Cl

Cl

Cl

Cl

4

chrg

4-

BH

H D

BH

H

D

Note: D = H

DO

OHn

Note: D = H

1-.

chrg

Quadruple Bonds Delocalized Radical Anions

H end groups and bridging Hs require special treatment

8/25/2010

18

Outline of Talk: Chemical Structure Representation in the DuPont

Chemical Information Management Solutions (CIMS) Database

Background

Structure Classes

Structure vs. Sample

Data Integrity for Registrations

Summary

8/25/2010

19

Problem:

• How do you handle minor or variable components in a material?

• When are two materials different if the amounts of their components

are only slightly different?

• What if structural details depend on how material was made?

Solution:

• Register the different but closely related samples of the material

against the same relatively generic structure

AND THEN

• Relegate the variable / difference information to “sample histories”

OR

• Capture the minor components as analytical results on the sample

Structure vs. Sample Detail

8/25/2010

20

“Structure” Details via Sample History Concept

Process Steps• Defines processing steps and their sequence that led to sample

e.g., synthesis, polymerization, spinning, fabrication, …

Process Components• Defines materials used in a processing step

including amounts, specific lot #s / sample IDs, androles (e.g., monomer, catalyst, additive, …)

Process Conditions• Defines conditions used in processing step

Very versatile and detailed, sequential tracking possible

Sample Genealogy• Allows parent-child relationships to be tracked

• Permits viewing total process history through sequential samples

� Goes well beyond reaction registration (in some ways),

though that can be a part of it

8/25/2010

21

Structure vs. Sample Detail (cont’d)

co

eu

mon

OH

O

OH

O

monNH

2

NH2

c

wt%90

c

filler

wt%

ID: SiO2 (generic)

*434343A

10

mix

composite

Note: Choice may change

from sample-based to

structure-based as research

project moves from scouting

to optimization phases !co

eu

mon

OH

O

OH

O

monNH

2

NH2

c

x

c

filler

x

ID: filler (generic)

*121212C

mix

composite

Component detail in Structure

(both filler type and amt)

Component detail in Sample

(filler type and/or amt)

8/25/2010

22

Illustrative CIMS Sample History

* Data fabricated to protect confidentiality

*

8/25/2010

23

Outline of Talk: Chemical Structure Representation in the DuPont

Chemical Information Management Solutions (CIMS) Database

Background

Structure Classes

Structure vs. Sample

Data Integrity for Registrations

Summary

8/25/2010

24

Novelty Checking: Structure Standardization

� ca. 20,000 lines of standardizer code� Standard items (nitro groups, …)

� Salts (ionic vs. covalent)

� Validation of attached data information, including context

� Componentization for auto-registration of components (old and new)

� Can add new attached data types and values with minimal edits

to standardizer code in many cases

8/25/2010

25

CIMS Novelty Checking: Beyond Flexmatch

• Broad ISIS flexmatch, including components of multi-components

• MF search for IDS and “non-routine” compositions (e.g., organometallic)

mix

c

x

ClU

U = unknown locant

c

xO

8/25/2010

26

Summary

Successes• Have been able to represent all materials of interest to date

• One database can serve as both the corporate “chemical data standard”and research data index / repository

• Can customize the CIMS DB interface for new group in just 40-60 hours

• Performance is decent, even with complex structures and low-leveldatabase access control security

Difficulties• Changing structure standards over time means lots of “data cleansing”

• Determination and registration of compositions / structures ofcommercial materials can be very challenging and time consuming

Future• Incorporation of improved vendor software capabilities

• Alternate structure representations (e.g., automatic equivalencing of structure- and source-based polymers)

8/25/2010

27

DuPont CIMS Team Members and Key I&CT Colleagues

Past and Present

Mark AndrewsJiazhong ChenKaren CousinSandip Dutta

Elena KazakovaJa Kim

Keith KunitskySheena Sinclair

Jon TimkoWen-Ying Wang

Ted WilksJane Zhou

Acknowledgments