chemical structure representation in the dupont chemical...
TRANSCRIPT
Chemical Structure Representation in the DuPont Chemical Information Management Solutions Database: Challenges Posed by Complex Materials in a Diversified Science Company
Mark A. Andrews and Edward S. Wilks
Information & Computing Technologies Division, DuPont CR&D
Boston ACS Meeting, Chemical Structure Representations Symp.
August 25, 2010
8/25/2010
2
Outline of Talk: Chemical Structure Representation in the DuPont
Chemical Information Management Solutions (CIMS) Database
Background
Structure Classes
Structure vs. Sample
Data Integrity for Registrations
Summary
8/25/2010
3
AGRICULTURE /FOOD34%
CONSTRUCTIONMATERIALS
10%
ELECTRONICS8%
TEXTILES / HOME FURNISHINGS
3%
MOTOR VEHICLES19%
OTHER16%
AIRCRAFT & AEROSPACE2%
PLASTICS & CHEMICALS
8%
DuPont: A Diversified Science Company
2009
8/25/2010
4
Structure Representation: The DuPont Problem
Cheminformatics in a Diversified Science Company
• Chemically intelligent structures: Needed for all types of materialsOrganics, inorganics, polymers, mixtures, formulations, incompletely defined substances (IDS), multi-layer films,composites, even devices
• Commercial chem. info. software tools: DuPont ≠ PharmaCould we push the envelope and vendors far enough?
• Chemical composition: Structure vs. Sample characteristicsWhich details of a material’s composition should be part of the chemical structure vs. which details arebetter viewed as characteristics of a specific sample?
• Structure registrations: Advanced mistake-proofing aids requiredNot just standardization of, e.g., nitro groups for end users but also sophisticated duplicate checking for trained registrars
8/25/2010
5
Structure Representation: The Vendor Problem
DuPont’s Situation in 2000-2004
• DuPont had to bring proprietary chem. info. DB in house from CASChance to reinvent the system to go beyond chemical indexing of proprietary documents to full chem. info. registration system for diverse research groups
• Only a single vendor appeared up to the challengeBuilt-in support for representation and searching of polymers,multiple bracket types, and user-definable attached data
Vendor Situation in 2010 and beyond
• All chem. info. vendors have / are radically enhancing their products• Middle-tier software offers plug-and-play options
Supports multiple drawing tools and database cartridges,provides tools for very facile in-house customization
8/25/2010
6
Structure Representation: The DuPont Solution
DuPont specialized structure identifiers
• Push attached data functionality to the limit� 7 polymer attached data types
� 3 mixture attached data types
� 3 ratio attached data types
� 2 incompletely defined substance (IDS) attached data types
� 5 special purpose attached data types
• Other structure registration features� ca. 75 controlled vocabulary values for above attached data fields
� 4 Molecular Formula (MF) fields – computed and user entered, summed
� Push some structural details down to the sample “history” level:
-- You may not want to bog structure down with minor details or-- Some “structural” aspects of materials are “product by process”
8/25/2010
7
Outline of Talk: Chemical Structure Representation in the DuPont
Chemical Information Management Solutions (CIMS) Database
Background
Structure Classes
Structure vs. Sample
Data Integrity for Registrations
Summary
8/25/2010
8
Problem – What if there is no structure?
• ISIS No-Strucs provide no material identification information
• ISIS No-Strucs cannot be components of mixtures, etc.
Solution:
• Use * atom tagged with DuPont ID#
• Add “ID: xxx” descriptor tag
• Include User Entered molecular formula if known
Incompletely Described Substances (IDS)
ID: Kaolinite
*123456H
ID: Mineral Oil
*34567E
User MF = Al2O3User MF = Al2.(OH)4.(Si2O5)1User MF = (null)
ID: Al2O3 (generic)
*98765F
ID: Al2O3 (alpha)
*765432D
CAS = No-Strucs or Tabular Inorganics
8/25/2010
9
Problem:
• Sometimes even (nominal) structures may not fully define a material
Solution:
• Add ID: xxx descriptor tag
Incompletely Described Substances – Cont’d
ID: Graphite
C(0)
ID: Charcoal
C(0)
ID: Diamond
C(0)
ID: deionized
OH2
ID: reverse osmosis
OH2
8/25/2010
10
Problem:
• Substituents in unknown locations, etc.
Solution:
• Add “unknown xxx” tags
Incompletely Described Substances – Cont’d
ClU
U = unknown locant
ClU
U = unknown locant
U
U = unknown degree of substitution
ID: 54 wt% Cl
D1-Cl
D1-Me
Chlorotoluene:
CIMS CAS Symyx Draw 3.1
PCB 1254:
* Cl or
Chem Draw 12
Cl
CIMS
Marvin
8/25/2010
11
Incompletely Described Substances – Cont’d
CASCIMS Symyx Draw 3.1
R1 OH
*C
4H
9
*C
5H
11
*C
6H
13
R1 =
ID: Alcohols, C4-6
OHD
n
mol6
Note: D = H
4U
U = unknown branching
“No Structure
Available”
Problem:
• Variable chain lengths, etc.
Solution:
• Use “tagged” polymers or
Markush structures
*** Note: Markush-type features
are currently supported only on
query in some software packages,
though most vendors are now
also targeting them for registration
OHD
n
Note: D = H
In CIMS a “flexmatch” query with above
structure finds all variable chain length
alcohols, e.g., C4-6, C10-14
8/25/2010
12
Mixtures
mix
emulsion
c
wt%30
ID: Olive Oil
*13579H
c
70wt%OH
2
csurfactant
0.1
wt%
Na+
S
OO
O
U
U = unknown branching
U
U = unknown locant
U
U = unknown chain length
Ratios
Roles
8/25/2010
13
Cross-linked and Branched Polymers
co
eu
mon
monOH
O
mon
mod
xl
O
O
Mg
O
O
ran
mon
monbrxl
co
eu
� Cross-linked by after-treatment
Difunctional monomer gives
branching and/or cross-linking�
Surlyn® 6000 series Mg Ionomers
8/25/2010
14
Complex Copolymers
MW=2000
NH2
OO
O NH2O
n
n
n
mol mol
U
U = unknown locant
U
U = unknown locant
39 2molU
U = main locant
2
U
U = main locant
Jeffamine® ED-2003
Poly(THF/3-MeTHF)
(Me in one of only
2 positions)
co
eu
O* *n
OH* eg
O
**
Hn
U
U
U = unknown locant
8/25/2010
15
co
eu
monOH
OH
mon
OH
O
OH
O
c2
200um
film
Al(0)
c1
5um
film
f
Multi-Layer Thin Films = Ordered “Mixtures”
(Aluminized PET)
8/25/2010
16
Composite Polymers
(Glass Reinforced Nylon-6,6)
co
eu
mon
OH
O
OH
O
monNH
2
NH2
c
90
wt%
mix
ID: Pyrex
inorg
c
5wt%
1
Na+ O
2-
Na+
c
15wt%
ID: Al2O3 (generic)
*98765F
c
8090wt%
ID: SiO2 (generic)
*434343A
c
515wt%
ID: B2O3
*343434A
cfiller
10
wt%
mix
composite
8/25/2010
17
Miscellaneous
Mo Mo
Cl
Cl
Cl
Cl
Cl
Cl
Cl
Cl
4
chrg
4-
BH
H D
BH
H
D
Note: D = H
DO
OHn
Note: D = H
1-.
chrg
Quadruple Bonds Delocalized Radical Anions
H end groups and bridging Hs require special treatment
8/25/2010
18
Outline of Talk: Chemical Structure Representation in the DuPont
Chemical Information Management Solutions (CIMS) Database
Background
Structure Classes
Structure vs. Sample
Data Integrity for Registrations
Summary
8/25/2010
19
Problem:
• How do you handle minor or variable components in a material?
• When are two materials different if the amounts of their components
are only slightly different?
• What if structural details depend on how material was made?
Solution:
• Register the different but closely related samples of the material
against the same relatively generic structure
AND THEN
• Relegate the variable / difference information to “sample histories”
OR
• Capture the minor components as analytical results on the sample
Structure vs. Sample Detail
8/25/2010
20
“Structure” Details via Sample History Concept
Process Steps• Defines processing steps and their sequence that led to sample
e.g., synthesis, polymerization, spinning, fabrication, …
Process Components• Defines materials used in a processing step
including amounts, specific lot #s / sample IDs, androles (e.g., monomer, catalyst, additive, …)
Process Conditions• Defines conditions used in processing step
Very versatile and detailed, sequential tracking possible
Sample Genealogy• Allows parent-child relationships to be tracked
• Permits viewing total process history through sequential samples
� Goes well beyond reaction registration (in some ways),
though that can be a part of it
8/25/2010
21
Structure vs. Sample Detail (cont’d)
co
eu
mon
OH
O
OH
O
monNH
2
NH2
c
wt%90
c
filler
wt%
ID: SiO2 (generic)
*434343A
10
mix
composite
Note: Choice may change
from sample-based to
structure-based as research
project moves from scouting
to optimization phases !co
eu
mon
OH
O
OH
O
monNH
2
NH2
c
x
c
filler
x
ID: filler (generic)
*121212C
mix
composite
Component detail in Structure
(both filler type and amt)
Component detail in Sample
(filler type and/or amt)
8/25/2010
23
Outline of Talk: Chemical Structure Representation in the DuPont
Chemical Information Management Solutions (CIMS) Database
Background
Structure Classes
Structure vs. Sample
Data Integrity for Registrations
Summary
8/25/2010
24
Novelty Checking: Structure Standardization
� ca. 20,000 lines of standardizer code� Standard items (nitro groups, …)
� Salts (ionic vs. covalent)
� Validation of attached data information, including context
� Componentization for auto-registration of components (old and new)
� Can add new attached data types and values with minimal edits
to standardizer code in many cases
8/25/2010
25
CIMS Novelty Checking: Beyond Flexmatch
• Broad ISIS flexmatch, including components of multi-components
• MF search for IDS and “non-routine” compositions (e.g., organometallic)
mix
c
x
ClU
U = unknown locant
c
xO
8/25/2010
26
Summary
Successes• Have been able to represent all materials of interest to date
• One database can serve as both the corporate “chemical data standard”and research data index / repository
• Can customize the CIMS DB interface for new group in just 40-60 hours
• Performance is decent, even with complex structures and low-leveldatabase access control security
Difficulties• Changing structure standards over time means lots of “data cleansing”
• Determination and registration of compositions / structures ofcommercial materials can be very challenging and time consuming
Future• Incorporation of improved vendor software capabilities
• Alternate structure representations (e.g., automatic equivalencing of structure- and source-based polymers)
8/25/2010
27
DuPont CIMS Team Members and Key I&CT Colleagues
Past and Present
Mark AndrewsJiazhong ChenKaren CousinSandip Dutta
Elena KazakovaJa Kim
Keith KunitskySheena Sinclair
Jon TimkoWen-Ying Wang
Ted WilksJane Zhou
Acknowledgments