contents...–limited the total number rings between 3 and 5. –exclude non-planar type (5-21) and...

Contents

1

Trend in Computer-Aided Materials Discovery

High-Throughput Computational Screening & Exhaustive

Enumeration

Deep-Learning-based Evolutionary Design

Deep-Learning-based Inverse Design

Efficacy of Computer-Aided Materials Discovery


2

Simulation(low throughput)

[ 1st Gen. ]

Virtual screening(low hit-rate)

[ 2nd Gen.] [ 3rd Gen. ]

Targeted design(high hit-rate)

Conventional

Trial-and-Error(high cost)

Iterative experiments Pre-validation High throughputRight solutions

with minimum effort

For accelerated materials discovery

Machine LearningFirst-principles

Quantum ChemistryHigh-performance

Computing

IntelligenceEfficiencyRationalization

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwix35vDp-_OAhXKE5QKHbhVBeEQjRwIBw&url=http://www.macinchem.org/blog/files/archive-2013.php&bvm=bv.131669213,d.dGo&psig=AFQjCNFBGF6y2ZGH4-sw81RXfTUj2B0mdw&ust=1472858407527873

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwix35vDp-_OAhXKE5QKHbhVBeEQjRwIBw&url=http://www.macinchem.org/blog/files/archive-2013.php&bvm=bv.131669213,d.dGo&psig=AFQjCNFBGF6y2ZGH4-sw81RXfTUj2B0mdw&ust=1472858407527873

Prediction of materials property based on machine learning

– Build-up of Materials vs. Property DB → Materials Informatics


3

QSAR*

(’62, Hansch&Fujita)

ANN**

in Chemistry (’71)

Graph Kernels(‘05 @ UC Irvine)

Bayesian Modeling(‘09 @ MIT)

(‘18 @ Harvard)

SMILES ***

(‘87 Weininger)

Kernel methods Bayesian approaches Deep Learning

(‘16 @ Stanford)

CheminformaticsIntroduction stage of

machine learningDevelopment stage

TrainingDescriptor

SMILES: CC(C)NCC(O)COC1=CC(CC2=CC=CC=C2)=C(CC(N)=O)C=C1

Fingerprint: 011100011111101010010100100000101010001001010…

DescriptorVector

graphs images

Analysis

Process of Machine Learning @ Materials Research

* QSAR: Quantitative Structure-Activity Relationship** ANN: Artificial Neural Network*** SMILES: Simplified Molecular-input-line Systems


4

Materials design based on machine learning

– Inverse QSAR → Inverse Design

SMILES Autoencoder(‘16 @ Harvard)

Genetic Algorithms(’92 @ Purdue)

Inverse Design(’16 @ SAIT)

GAN* for molecules(‘17 @ Harvard)

Exhaustive Generation(’12 @ Tokyo)

Deep Learning / Generative Models

Inverse QSAR(Late 80’s~)

*GAN: Generative Adversarial Network

Combinatorial Evolutionary Autoencoder

Focus on autonomous molecular generation

[ In ]Targets

[ Out ]Molecules

+

(High-ThroughtputComputational Screening)Automated

Simulation

DB

MachineLearning

MaterialsInformatics

HTCS

InverseDesign

EvolutionaryDesign +

MolecularEnumeration

Materials Discovery MethodologiesElemental Technologies


5

Target molecules

In-silico technologies for materials discovery

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjxoKu1yPjbAhVIXLwKHU5jDkQQjRx6BAgBEAU&url=http://geometryofmolecules.com/difference-between-molecule-and-compound/&psig=AOvVaw2YK3a8uB6a0XvIR6aEjz1r&ust=1530351135252768

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjxoKu1yPjbAhVIXLwKHU5jDkQQjRx6BAgBEAU&url=http://geometryofmolecules.com/difference-between-molecule-and-compound/&psig=AOvVaw2YK3a8uB6a0XvIR6aEjz1r&ust=1530351135252768

High-Throughput Computational Screening& Exhaustive Enumeration

6

“Landscape of phosphorescent light-emitting energies of homoleptic Ir(III)-complexes predicted by a graph-based enumeration and deep learning”, GI01.02.02, 2018 MRS fall meeting

High-Throughput Computational Screening

7

Property prediction with high-performance computing for large-scale exploration of materials candidates

Seed Fragments Candidate

Pool

Target Materials

large amountsof candidates

Combination

Database

Verification

Simulation

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjwkIeUoY3OAhWBHpQKHYLfAbwQjRwIBw&url=http://www.iconsdb.com/gray-icons/database-5-icon.html&psig=AFQjCNEMWgmVwGrtmUiAIBxI2Gp4S2MNsA&ust=1469489310475794



8

ML (Machine Learning)-assisted HTCS for higher efficiency

Database

Verification

(1) Simulation + ML

(2) Prioritizing calculationbased on active learning

Seed Fragments Candidate

Pool

Target Materials

large amountsof candidates

Combination




9

Exhaustive enumeration based on graph-theory

– “Graphs”

• Mathematical structures used to model pairwise relations between objects.

• Made up of nodes and edges.

• In chemistry, graph is used to model molecules, where nodes represent atoms and edges represent bonds.

※ Exhaustive enumeration:Systematical enumeration of all possible molecules for optimal solution search


10

Complete list of non-isomorphic graphs

http://www.cadaeic.net/graphpics.htm

ID No. of edges

No. of edges at each node


11

Landscape of phosphorescent light-emitting energies of homoleptic Ir(III)-complex core structures

– Ir(III)-complexes

• Widely used as phosphorescent OLED dopants.

• Figuring out the full landscape of emission color is important for discovering high-performing molecules in target color regions.

New J. Chem., 39, 246 (2015)

ACS Appl. Mater. Interfaces, 10, 1888–1896 (2018)Organic Electronics, 63, 244–249 (2018)


12

Approach

– Consider the nodes in graph as rings and edges as ring-connections.

– Limited the total number rings between 3 and 5.

– Exclude non-planar type (5-21) and invalid structures as dopant.

→ Only 11 graphs are valid among the total 29 graphs.


13

1. Graphs 2. Skeletons 3. Set Iridium positions

4. Substitute some carbon atoms with nitrogen atoms

Enumeration

– For 5- and 6-membered rings.

– Substitute some carbons of each molecule with nitrogen atoms (max. five).

→ Total 9,919,469 (~10M) core structures

total 405 EA


14

Property prediction

– Trained a deep-neural-network model with simulated T1 data

• Input: ECFP(Extended Connectivity FingerPrints) of molecular structures

• Outputs: T1 energy (phosphorescent light-emitting wavelength)

0

0.05

0.1

0.15

0.2

10K 20K 30K 40K 50K 60K 70K 80KMean A

bso

lute

Err

or

of

T1

of

the D

NN

(eV)

Size of the training dataset

With 80k training data,the average prediction error was less than 0.1 eV

80k

10M= 0.8%

By simulating the properties of only 0.8% molecules, we can fully scan the chemical space of 10M!


15

Results

– Distribution of T1 values

– Blue-color emitting materials are rare compared with red and green

0

1

2

3

4

5

6

0.0

5

0.1

5

0.2

5

0.3

5

0.4

5

0.5

5

0.6

5

0.7

5

0.8

5

0.9

5

1.0

5

1.1

5

1.2

5

1.3

5

1.4

5

1.5

5

1.6

5

1.7

5

1.8

5

1.9

5

2.0

5

2.1

5

2.2

5

2.3

5

2.4

5

2.5

5

2.6

5

2.7

5

2.8

5

2.9

5

Num

ber

of

mole

cule

s

x 1

00,0

00

Predicted T1 (eV)

Blue(0.4%)

Green(4.3%)

Red(18.4%)

Conclusions

16

In materials discovery, deep-learning-based HTCS is a good

alternative to conventional trial-and-error type approach.

Moreover, exhaustive enumeration makes it possible to

systematically explore the whole chemical space.

With the proposed exhaustive enumeration method based on

graph theory and deep learning, the whole landscape of 10M

phosphorescent Ir-dopants could be scanned with just 0.8%

computational cost compared with the pure simulation-based

approach.

Deep-Learning-based Evolutionary Design

17

“Evolutionary design of organic molecules based on deep learning and genetic algorithm”, COMP, ACS fall 2018 National Meeting

Evolutionary Design

18

A generic population-based metaheuristic optimization technique

Uses bio-inspired operators to reach near-optimal solutions

; mutation, crossover, and selection in case of genetic algorithm

Fitness

https://en.wikipedia.org/wiki/Fitness_landscape

Avera

ge

fitn

ess

Generation

+

Initial population

Calculate fitness

Selection

Mutation Crossover

New population

Satisfy constraints?DoneYes

No

Deep-Learning-Based Evolutionary Design

19

Proposed approach

Molecular Descriptor

Molecular Evolution

Fitness Evaluation

Graph or ASCII string

Heuristic

Simple assessment

Bit string (ECFP)

Random

DNN

RNN •Prevent heuristic bias •Secure chemical validity

*ECFP (Extended Connectivity FingerPrint)DNN (Deep Neural Network), RNN (Recurrent Neural Network)SMILES (Simplified Molecular-Input Line-Entry System)

•Versatile evaluation is possible

Mutation (n=50)

Inspection of chemical validity

Decoding toSMILES (RNN)

Fitness evaluation (DNN) Selection

EvolutionCrossover → Mutation)

Inspection of chemical validity

Fitness evaluation (DNN)

Seed molecule(ECFP)

Best-fit molecule

DB

1 1 01

1 1 000 1 01

1 0 01

0 0 0 11 0 01

1 0 0 0 0 0

1 0 1

1

1

Parents

Crossover

1

Mutation0 0 11

Iteration

Conventional Proposed Expectations

Decoding toSMILES (RNN)

Deep Learning-Based Evolutionary Design

20

t=1Input

(ECFP*)

y = (‘CCC’,‘CCC’,‘CC(’,…, ‘)=O’) → ‘CCCC(N)=O’

t=2 t=3 t=T+1

y1=‘CCC’ y2=‘CCC’ yT=‘)=O’

y1=‘CCC’ y2=‘CCC’ y3=‘CC(’ <end>

<start>

…

*ECFP (dimension=5,000, neighbor size=6)

Deep learning models

• [DNN] 3 hidden layers, 500 hidden units in each layer

• [RNN] 3 hidden layers, 500 long short-term memory units

Output (SMILES)

Input(ECFP*)

Output (Properties)

DNN Model RNN Model

https://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjA6fGFh9DcAhXTaN4KHaAMDV4QjRx6BAgBEAU&url=https://wp.wwu.edu/machinelearning/2017/02/12/deep-neural-networks/&psig=AOvVaw0AsBWLqQGPX9BQFukK5JCT&ust=1533357236007221

https://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjA6fGFh9DcAhXTaN4KHaAMDV4QjRx6BAgBEAU&url=https://wp.wwu.edu/machinelearning/2017/02/12/deep-neural-networks/&psig=AOvVaw0AsBWLqQGPX9BQFukK5JCT&ust=1533357236007221


21

Validation test

• Design target: change the S1 (light-absorbing wavelength) of seed molecules

• Training data: M.W. 200~600 g/mol from PubChem (10,000~50,000 molecules)

※1. No. of test data=No. of training data/10※2. Chemical validity is evaluated with RDKit,

No. of test data=5,000

-11

-10

-9

-8

-7

-6

-5

-4

-3

-11 -10 -9 -8 -7 -6 -5 -4 -3

R=0.945

HOMO (DFT; eV)

HO

MO

(D

NN

; eV)

-5

-4

-3

-2

-1

0

1

2

3

-5 -4 -3 -2 -1 0 1 2 3

R=0.955

LUMO (DFT; eV)

LU

MO

(D

NN

; eV)

0

2

4

6

8

10

12

0 2 4 6 8 10 12

R=0.973

S1 (DFT; eV)

S1(D

NN

; eV)

No. of training data

Prediction accuracy of DNN※1 (R, MAE) Success rate of decoding※2 (RNN)S1 HOMO LUMO

① 50,000 0.973, 0.198 0.945, 0.172 0.955, 0.209 86.7%

② 30,000 0.930, 0.228 0.934, 0.191 0.945, 0.224 85.3%

③ 10,000 0.913, 0.278 0.885, 0.244 0.917, 0.287 83.2%


22

Evolution toward the increase and decrease of S1 (eV)

• Seed: randomly selected 50 molecules (3.8<S1<4.2)

• Number of training data = 10k, 30k, 50k

Avera

ge r

ate

of

S1

change (

%)

-60

-40

-20

0

20

40

60

0 100 200 300 400 500

Number of trainig data=50,000



Generation

toward the increase of S1

toward the decrease of S1

0

2,500

5,000

7,500

10,000

12,500

0.2

5

1.2

5

2.2

5

3.2

5

4.2

5

5.2

5

6.2

5

7.2

5

8.2

5

9.2

5

S1 (eV)

Num

ber of M

ole

cule

s

4.0 eV

S1 distribution in the training data(50k)

-60

-40

-20

0

20

40

60

0 100 200 300 400 500

Without constraint With constraint

Increase of S1 Decrease of S1


23

Evolution under the constraint of HOMO and LUMO (eV)

• Seed: randomly selected 50 molecules (3.8<S1<4.2)

• Number of training data = 50k

• Constraint: -7.0<HOMO<-5.0,

LUMO<0.0

LUMO

HOMO

0 eV

①

②

④

③① ②

-7 eV

③ ④

-5 eV

Generation

Avera

ge r

ate

of

S1

change (

%) HOMO & LUMO distributions in the training data (50k)

0

5,000

10,000

15,000

20,000

-10.

75

-9.7

5

-8.7

5

-7.7

5

-6.7

5

-5.7

5

-4.7

5

-3.7

5

-2.7

5

-1.7

5

-0.7

5

0.2

5

HOMO (eV)

Num

ber

of M

ole

cule

s

0

2,500

5,000

7,500

10,000

-6.7

5

-5.7

5

-4.7

5

-3.7

5

-2.7

5

-1.7

5

-0.7

5

0.2

5

1.2

5

2.2

5

LUMO (eV)

0 eV-5 eV-7 eV

toward the increase of S1

toward the decrease of S1

0 eV


24

Examples of evolved molecules (No. of training data = 50k)

Constraint (eV)

• -7.0<HOMO<-5.0

• LUMO<0.0

Conclusions

25

A fully data-driven evolutionary molecular design based on

deep-learning models (DNN & RNN) was proposed and

automatically evolved seed molecules toward target without any

pre-defined chemical rules.

Unlike HTCS, the closed-loop evolutionary workflow guided by

deep-learning automatically derived target molecules and found

rational design paths by elucidating the relationship between

structural features and their effect on the molecular properties.

Deep-Learning-based Inverse Design

26

npj Comput. Mater., 4, 67, 2018

Deep-Learning-Based Inverse Design

27

Paradigm shift of ML in computer-aided materials discovery

Artificial Intelligence for

Materials Design

Propose target materials

Target Properties

Screening

Predict materials properties

Passive Role Active Role

• Efficient screening based on property prediction

• Highly depends on explicit knowledge of

chemists

• Propose candidates via automated design

• Provides implicit knowledge from data

Database

Candidate pool

http://itunes.apple.com/app/molecules/id444014972?mt=12











28

Implementation of inverse-design model

Deep learning

Molecular

database

Extraction of

design knowledge

Generation of

new molecules

Molecular design

Molecularstructures

Properties [target properties]

z=e(x)

DNN

Molecular property (t)

Molecular descriptor (x; ECFP format)

f(z)

RNN

d(z) Molecular structure identifier

(y; SMILES format)

e(·) : encoding function

f(·) : property prediction function

d(·) : decoding function

z: encoded vector of molecular descriptor

a

b

Input outputencoder decoder

Hidden Factor(fixed-length vector)


29

Inverse design of light-absorbing organic molecules (1/2)

• Training DB

‒ 50k molecules sampled from PubChem (M.W. 200~600)‒ DFT calculations for S1

λmax (nm)

0%

10%

20%

30%

40%

50%

100 600200 300 400 5000%

10%

20%

30%

40%

50%

Pe

rce

nta

ge

of

ge

ne

rate

d m

ole

cule

s

ca b

100 600200 300 400 5000%

10%

20%

30%

40%

50%

λmax (nm)

0%

10%

20%

30%

40%

50%

Pe

rce

nta

ge

of

ge

ne

rate

d m

ole

cule

s

0%

10%

20%

30%

40%

50%

100 600200 300 400 500

λmax (nm)

0%

10%

20%

30%

40%

50%

Pe

rce

nta

ge

of

ge

ne

rate

d m

ole

cule

s

Distribution of λmax of the inverse-designed molecules

λmax=200–300 nm λmax=300–400 nm λmax=400–500 nm

82.6%

Target

Hit rate 64.8% 45.6%

*Simulation values for the 500 molecules in each target

※ About 10% of the designed molecules were found in PubChem even though those were not included in the randomly selected training library.

a b

c d


30

Inverse design of light-absorbing organic molecules (2/2)

Examples of inverse-designed molecules which share the

moieties with well-known dye materials

b. Azobenzene derivative

(λmax=527.5 nm)

c. Isoidoline derivative

(λmax=434.4 nm)

a. Antraquinone derivative(λmax=433.4 nm)

d. Squaraine derivative

(λmax=503.5 nm)


31

Inverse design of hosts for blue phosphorescent OLED (1/3)

• Target: T1 ≥ 3.00 eV

• Training DB

‒ In-house library of 6,000 molecules by combinatorial enumeration (with nine linker (L) and fifty-seven terminal fragments (R) which are frequently employed in OLED hosts; symmetric R-L-R & R-R type enumeration).

‒ Property labeling with DFT calculations.

0%

10%

20%

30%

40%

50%

60%

2.4 2.6 2.8 3.0 3.2 3.4 3.6

Fra

ctio

n o

f m

ole

cule

s

T1 (eV)

Untargeted inverse design

Targeted inverse design

(T1 ≥ 3.00 eV)

Training librarya

b

c

The distribution of simulated T1 (eV) energy levels for the generated 3,205 molecules a. mean=2.94, std=0.15

b. mean=3.02, std=0.10

c. mean=2.92, std=0.13

The fractions of the hosts that satisfied the target (T1≥3.00 eV)

36.2% for a

58.7% for b

26.9% for c (3,497 molecules)


32


a b c

T1

ML 3.13 eV

DFT 3.16 eV

T1

ML 3.11 eV

DFT 3.12 eV

T1

ML 3.12 eV

DFT 3.12 eV

T1

ML 3.05 eV

DFT 3.06 eV

T1

ML 3.18 eV

DFT 3.20 eV

T1

ML 3.08 eV

DFT 3.12 eV

T1

ML 3.03 eV

DFT 3.12 eV

a1

a2

a3

b1

b2

b3

c1

c2

c3

T1

ML 3.13 eV

DFT 3.22 eV

T1

ML 3.08 eV

DFT 3.12 eV

Examples of inverse-designed host materials

Asymmetric molecules with

the given fragments in the

training library

Symmetric molecules where

the new fragments were

introduced

Asymmetric molecules

where the new fragments

were introduced

Experiment (eV)

HOMO (eV) LUMO (eV) S1 (eV) T1 (eV) ΔEST (eV)

a1 -5.98 -2.43 3.56 3.06 0.55

b1 -5.96 -2.14 3.64 2.93 1.01

c1 -6.07 -2.65 3.38 2.97 0.46


33


Total host molecules

(3,205)

L-(R1,R2,R3) ★

(4)

R-L-R

(3,010)

R-R

(190)

R1-R1

(16)

R1-R2

(174)

L: Linker fragment

R: Terminal fragment

Lsym: Symmetric linker

Lasym: Asymmetric linkerR

(1)

R-Lsym -R

(1,931)

R-Lasym -R

(1,079)

R1-Lsym -R1

(403)

R1-Lsym -R2

(1,528)

R1-Lasym -R1

(636)

R1-Lasym -R2

(443)

★

Linker

Terminal1

Terminal2

Terminal3

The connection rules of the inverse-designed molecules

Conclusions

34

A fully data-driven inverse design method successfully extracted

the latent materials design rules and proposed target molecular

structures without any external intervention.

The inverse design model successfully proposed new candidates

by modifying the assemble rules and creating new fragments.

Efficacy of Computer-Aided Materials Discovery

35

Simulation-based Screening

Full search

Total TAT takes 1 month

1st trial: 1M CandidatesQC simulations take 1.5 years Fail to find the target structure

2nd trial: 1M CandidatesQC simulations take 1.5 years Fail to find the target structure

3rd trial: 1M CandidatesQC simulations take 1.5 years Succeed to find the right structure

Total TAT took 4.5 years

[Step1] Building the training datasetNeeds only QC sim. for 50k molecules(27 days)

50K 5M

Inverse Design

Inverse Design

HTCS for pre-defined chemical space

more than 50X speed up (4.5 years vs. 1 month)

[Step3]QC simulations for the proposed molecules (1 day)

* QC simulation tool : turbomoleTotal computational resources=10,000 CPUIn case of 10 CPU computing per molecule, the simulation requires about 13 hrs.

“The inverse design learns by itself the molecular design rules inherent in the libraries and can reduce the effort of researchers and total time to reach the goal”

[Step2]Deep learning model training with GPU (3 days)

Prospects for AI-based Materials Development

36

Design

Analytical Chemistry

Simulation

Synthesis

Neural Network(MD potential)

Energy

DFT simulation(<100 atoms)

Meso-scale simulation(~104 atoms)

Electronic Properties

Design

SynthesisAnalysis

Training

ChemOS

Robot Characterization ML algorithm

DB

Queuing'system' Queuing'system'

Artificial Intelligence for

Materials Design

Propose target materials

Target Properties

Database

[email protected]

contents...–limited the total number rings between 3 and 5. –exclude non-planar type (5-21) and...

Documents