contents...–limited the total number rings between 3 and 5. –exclude non-planar type (5-21) and...
TRANSCRIPT
Contents
1
Trend in Computer-Aided Materials Discovery
High-Throughput Computational Screening & Exhaustive
Enumeration
Deep-Learning-based Evolutionary Design
Deep-Learning-based Inverse Design
Efficacy of Computer-Aided Materials Discovery
Trend in Computer-Aided Materials Discovery
2
Simulation(low throughput)
[ 1st Gen. ]
Virtual screening(low hit-rate)
[ 2nd Gen.] [ 3rd Gen. ]
Targeted design(high hit-rate)
Conventional
Trial-and-Error(high cost)
Iterative experiments Pre-validation High throughputRight solutions
with minimum effort
For accelerated materials discovery
Machine LearningFirst-principles
Quantum ChemistryHigh-performance
Computing
IntelligenceEfficiencyRationalization
Prediction of materials property based on machine learning
– Build-up of Materials vs. Property DB → Materials Informatics
Trend in Computer-Aided Materials Discovery
3
QSAR*
(’62, Hansch&Fujita)
ANN**
in Chemistry (’71)
Graph Kernels(‘05 @ UC Irvine)
Bayesian Modeling(‘09 @ MIT)
(‘18 @ Harvard)
SMILES ***
(‘87 Weininger)
Kernel methods Bayesian approaches Deep Learning
(‘16 @ Stanford)
CheminformaticsIntroduction stage of
machine learningDevelopment stage
TrainingDescriptor
SMILES: CC(C)NCC(O)COC1=CC(CC2=CC=CC=C2)=C(CC(N)=O)C=C1
Fingerprint: 011100011111101010010100100000101010001001010…
DescriptorVector
graphs images
Analysis
Process of Machine Learning @ Materials Research
* QSAR: Quantitative Structure-Activity Relationship** ANN: Artificial Neural Network*** SMILES: Simplified Molecular-input-line Systems
Trend in Computer-Aided Materials Discovery
4
Materials design based on machine learning
– Inverse QSAR → Inverse Design
SMILES Autoencoder(‘16 @ Harvard)
Genetic Algorithms(’92 @ Purdue)
Inverse Design(’16 @ SAIT)
GAN* for molecules(‘17 @ Harvard)
Exhaustive Generation(’12 @ Tokyo)
Deep Learning / Generative Models
Inverse QSAR(Late 80’s~)
*GAN: Generative Adversarial Network
Combinatorial Evolutionary Autoencoder
Focus on autonomous molecular generation
[ In ]Targets
[ Out ]Molecules
+
(High-ThroughtputComputational Screening)Automated
Simulation
DB
MachineLearning
MaterialsInformatics
HTCS
InverseDesign
EvolutionaryDesign +
MolecularEnumeration
Materials Discovery MethodologiesElemental Technologies
Trend in Computer-Aided Materials Discovery
5
Target molecules
In-silico technologies for materials discovery
High-Throughput Computational Screening& Exhaustive Enumeration
6
“Landscape of phosphorescent light-emitting energies of homoleptic Ir(III)-complexes predicted by a graph-based enumeration and deep learning”, GI01.02.02, 2018 MRS fall meeting
High-Throughput Computational Screening
7
Property prediction with high-performance computing for large-scale exploration of materials candidates
Seed Fragments Candidate
Pool
Target Materials
large amountsof candidates
Combination
Database
Verification
Simulation
High-Throughput Computational Screening
8
ML (Machine Learning)-assisted HTCS for higher efficiency
Database
Verification
(1) Simulation + ML
(2) Prioritizing calculationbased on active learning
Seed Fragments Candidate
Pool
Target Materials
large amountsof candidates
Combination
High-Throughput Computational Screening
9
Exhaustive enumeration based on graph-theory
– “Graphs”
• Mathematical structures used to model pairwise relations between objects.
• Made up of nodes and edges.
• In chemistry, graph is used to model molecules, where nodes represent atoms and edges represent bonds.
※ Exhaustive enumeration:Systematical enumeration of all possible molecules for optimal solution search
High-Throughput Computational Screening
10
Complete list of non-isomorphic graphs
http://www.cadaeic.net/graphpics.htm
ID No. of edges
No. of edges at each node
High-Throughput Computational Screening
11
Landscape of phosphorescent light-emitting energies of homoleptic Ir(III)-complex core structures
– Ir(III)-complexes
• Widely used as phosphorescent OLED dopants.
• Figuring out the full landscape of emission color is important for discovering high-performing molecules in target color regions.
New J. Chem., 39, 246 (2015)
ACS Appl. Mater. Interfaces, 10, 1888–1896 (2018)Organic Electronics, 63, 244–249 (2018)
High-Throughput Computational Screening
12
Approach
– Consider the nodes in graph as rings and edges as ring-connections.
– Limited the total number rings between 3 and 5.
– Exclude non-planar type (5-21) and invalid structures as dopant.
→ Only 11 graphs are valid among the total 29 graphs.
High-Throughput Computational Screening
13
1. Graphs 2. Skeletons 3. Set Iridium positions
4. Substitute some carbon atoms with nitrogen atoms
Enumeration
– For 5- and 6-membered rings.
– Substitute some carbons of each molecule with nitrogen atoms (max. five).
→ Total 9,919,469 (~10M) core structures
total 405 EA
High-Throughput Computational Screening
14
Property prediction
– Trained a deep-neural-network model with simulated T1 data
• Input: ECFP(Extended Connectivity FingerPrints) of molecular structures
• Outputs: T1 energy (phosphorescent light-emitting wavelength)
0
0.05
0.1
0.15
0.2
10K 20K 30K 40K 50K 60K 70K 80KMean A
bso
lute
Err
or
of
T1
of
the D
NN
(eV)
Size of the training dataset
With 80k training data,the average prediction error was less than 0.1 eV
80k
10M= 0.8%
By simulating the properties of only 0.8% molecules, we can fully scan the chemical space of 10M!
High-Throughput Computational Screening
15
Results
– Distribution of T1 values
– Blue-color emitting materials are rare compared with red and green
0
1
2
3
4
5
6
0.0
5
0.1
5
0.2
5
0.3
5
0.4
5
0.5
5
0.6
5
0.7
5
0.8
5
0.9
5
1.0
5
1.1
5
1.2
5
1.3
5
1.4
5
1.5
5
1.6
5
1.7
5
1.8
5
1.9
5
2.0
5
2.1
5
2.2
5
2.3
5
2.4
5
2.5
5
2.6
5
2.7
5
2.8
5
2.9
5
Num
ber
of
mole
cule
s
x 1
00,0
00
Predicted T1 (eV)
Blue(0.4%)
Green(4.3%)
Red(18.4%)
Conclusions
16
In materials discovery, deep-learning-based HTCS is a good
alternative to conventional trial-and-error type approach.
Moreover, exhaustive enumeration makes it possible to
systematically explore the whole chemical space.
With the proposed exhaustive enumeration method based on
graph theory and deep learning, the whole landscape of 10M
phosphorescent Ir-dopants could be scanned with just 0.8%
computational cost compared with the pure simulation-based
approach.
Deep-Learning-based Evolutionary Design
17
“Evolutionary design of organic molecules based on deep learning and genetic algorithm”, COMP, ACS fall 2018 National Meeting
Evolutionary Design
18
A generic population-based metaheuristic optimization technique
Uses bio-inspired operators to reach near-optimal solutions
; mutation, crossover, and selection in case of genetic algorithm
Fitness
https://en.wikipedia.org/wiki/Fitness_landscape
Avera
ge
fitn
ess
Generation
+
Initial population
Calculate fitness
Selection
Mutation Crossover
New population
Satisfy constraints?DoneYes
No
Deep-Learning-Based Evolutionary Design
19
Proposed approach
Molecular Descriptor
Molecular Evolution
Fitness Evaluation
Graph or ASCII string
Heuristic
Simple assessment
Bit string (ECFP)
Random
DNN
RNN •Prevent heuristic bias •Secure chemical validity
*ECFP (Extended Connectivity FingerPrint)DNN (Deep Neural Network), RNN (Recurrent Neural Network)SMILES (Simplified Molecular-Input Line-Entry System)
•Versatile evaluation is possible
Mutation (n=50)
Inspection of chemical validity
Decoding toSMILES (RNN)
Fitness evaluation (DNN) Selection
EvolutionCrossover → Mutation)
Inspection of chemical validity
Fitness evaluation (DNN)
Seed molecule(ECFP)
Best-fit molecule
DB
1 1 01
1 1 000 1 01
1 0 01
0 0 0 11 0 01
1 0 0 0 0 0
1 0 1
1
1
Parents
Crossover
1
Mutation0 0 11
Iteration
Conventional Proposed Expectations
Decoding toSMILES (RNN)
Deep Learning-Based Evolutionary Design
20
t=1Input
(ECFP*)
y = (‘CCC’,‘CCC’,‘CC(’,…, ‘)=O’) → ‘CCCC(N)=O’
t=2 t=3 t=T+1
y1=‘CCC’ y2=‘CCC’ yT=‘)=O’
y1=‘CCC’ y2=‘CCC’ y3=‘CC(’ <end>
<start>
…
*ECFP (dimension=5,000, neighbor size=6)
Deep learning models
• [DNN] 3 hidden layers, 500 hidden units in each layer
• [RNN] 3 hidden layers, 500 long short-term memory units
Output (SMILES)
Input(ECFP*)
Output (Properties)
DNN Model RNN Model
Deep Learning-Based Evolutionary Design
21
Validation test
• Design target: change the S1 (light-absorbing wavelength) of seed molecules
• Training data: M.W. 200~600 g/mol from PubChem (10,000~50,000 molecules)
※1. No. of test data=No. of training data/10※2. Chemical validity is evaluated with RDKit,
No. of test data=5,000
-11
-10
-9
-8
-7
-6
-5
-4
-3
-11 -10 -9 -8 -7 -6 -5 -4 -3
R=0.945
HOMO (DFT; eV)
HO
MO
(D
NN
; eV)
-5
-4
-3
-2
-1
0
1
2
3
-5 -4 -3 -2 -1 0 1 2 3
R=0.955
LUMO (DFT; eV)
LU
MO
(D
NN
; eV)
0
2
4
6
8
10
12
0 2 4 6 8 10 12
R=0.973
S1 (DFT; eV)
S1(D
NN
; eV)
No. of training data
Prediction accuracy of DNN※1 (R, MAE) Success rate of decoding※2 (RNN)S1 HOMO LUMO
① 50,000 0.973, 0.198 0.945, 0.172 0.955, 0.209 86.7%
② 30,000 0.930, 0.228 0.934, 0.191 0.945, 0.224 85.3%
③ 10,000 0.913, 0.278 0.885, 0.244 0.917, 0.287 83.2%
Deep Learning-Based Evolutionary Design
22
Evolution toward the increase and decrease of S1 (eV)
• Seed: randomly selected 50 molecules (3.8<S1<4.2)
• Number of training data = 10k, 30k, 50k
Avera
ge r
ate
of
S1
change (
%)
-60
-40
-20
0
20
40
60
0 100 200 300 400 500
Number of trainig data=50,000
Number of trainig data=30,000
Number of trainig data=10,000
Generation
toward the increase of S1
toward the decrease of S1
0
2,500
5,000
7,500
10,000
12,500
0.2
5
1.2
5
2.2
5
3.2
5
4.2
5
5.2
5
6.2
5
7.2
5
8.2
5
9.2
5
S1 (eV)
Num
ber of M
ole
cule
s
4.0 eV
S1 distribution in the training data(50k)
-60
-40
-20
0
20
40
60
0 100 200 300 400 500
Without constraint With constraint
Increase of S1 Decrease of S1
Deep Learning-Based Evolutionary Design
23
Evolution under the constraint of HOMO and LUMO (eV)
• Seed: randomly selected 50 molecules (3.8<S1<4.2)
• Number of training data = 50k
• Constraint: -7.0<HOMO<-5.0,
LUMO<0.0
LUMO
HOMO
0 eV
①
②
④
③① ②
-7 eV
③ ④
-5 eV
Generation
Avera
ge r
ate
of
S1
change (
%) HOMO & LUMO distributions in the training data (50k)
0
5,000
10,000
15,000
20,000
-10.
75
-9.7
5
-8.7
5
-7.7
5
-6.7
5
-5.7
5
-4.7
5
-3.7
5
-2.7
5
-1.7
5
-0.7
5
0.2
5
HOMO (eV)
Num
ber
of M
ole
cule
s
0
2,500
5,000
7,500
10,000
-6.7
5
-5.7
5
-4.7
5
-3.7
5
-2.7
5
-1.7
5
-0.7
5
0.2
5
1.2
5
2.2
5
LUMO (eV)
0 eV-5 eV-7 eV
toward the increase of S1
toward the decrease of S1
0 eV
Deep Learning-Based Evolutionary Design
24
Examples of evolved molecules (No. of training data = 50k)
Constraint (eV)
• -7.0<HOMO<-5.0
• LUMO<0.0
Conclusions
25
A fully data-driven evolutionary molecular design based on
deep-learning models (DNN & RNN) was proposed and
automatically evolved seed molecules toward target without any
pre-defined chemical rules.
Unlike HTCS, the closed-loop evolutionary workflow guided by
deep-learning automatically derived target molecules and found
rational design paths by elucidating the relationship between
structural features and their effect on the molecular properties.
Deep-Learning-based Inverse Design
26
npj Comput. Mater., 4, 67, 2018
Deep-Learning-Based Inverse Design
27
Paradigm shift of ML in computer-aided materials discovery
Artificial Intelligence for
Materials Design
Propose target materials
Target Properties
Screening
Predict materials properties
Passive Role Active Role
• Efficient screening based on property prediction
• Highly depends on explicit knowledge of
chemists
• Propose candidates via automated design
• Provides implicit knowledge from data
Database
Candidate pool
Deep-Learning-Based Inverse Design
28
Implementation of inverse-design model
Deep learning
Molecular
database
Extraction of
design knowledge
Generation of
new molecules
Molecular design
Molecularstructures
Properties [target properties]
z=e(x)
DNN
Molecular property (t)
Molecular descriptor (x; ECFP format)
f(z)
RNN
d(z) Molecular structure identifier
(y; SMILES format)
e(·) : encoding function
f(·) : property prediction function
d(·) : decoding function
z: encoded vector of molecular descriptor
a
b
Input outputencoder decoder
Hidden Factor(fixed-length vector)
Deep-Learning-Based Inverse Design
29
Inverse design of light-absorbing organic molecules (1/2)
• Training DB
‒ 50k molecules sampled from PubChem (M.W. 200~600)‒ DFT calculations for S1
λmax (nm)
0%
10%
20%
30%
40%
50%
100 600200 300 400 5000%
10%
20%
30%
40%
50%
Pe
rce
nta
ge
of
ge
ne
rate
d m
ole
cule
s
ca b
100 600200 300 400 5000%
10%
20%
30%
40%
50%
λmax (nm)
0%
10%
20%
30%
40%
50%
Pe
rce
nta
ge
of
ge
ne
rate
d m
ole
cule
s
0%
10%
20%
30%
40%
50%
100 600200 300 400 500
λmax (nm)
0%
10%
20%
30%
40%
50%
Pe
rce
nta
ge
of
ge
ne
rate
d m
ole
cule
s
Distribution of λmax of the inverse-designed molecules
λmax=200–300 nm λmax=300–400 nm λmax=400–500 nm
82.6%
Target
Hit rate 64.8% 45.6%
*Simulation values for the 500 molecules in each target
※ About 10% of the designed molecules were found in PubChem even though those were not included in the randomly selected training library.
a b
c d
Deep-Learning-Based Inverse Design
30
Inverse design of light-absorbing organic molecules (2/2)
Examples of inverse-designed molecules which share the
moieties with well-known dye materials
b. Azobenzene derivative
(λmax=527.5 nm)
c. Isoidoline derivative
(λmax=434.4 nm)
a. Antraquinone derivative(λmax=433.4 nm)
d. Squaraine derivative
(λmax=503.5 nm)
Deep-Learning-Based Inverse Design
31
Inverse design of hosts for blue phosphorescent OLED (1/3)
• Target: T1 ≥ 3.00 eV
• Training DB
‒ In-house library of 6,000 molecules by combinatorial enumeration (with nine linker (L) and fifty-seven terminal fragments (R) which are frequently employed in OLED hosts; symmetric R-L-R & R-R type enumeration).
‒ Property labeling with DFT calculations.
0%
10%
20%
30%
40%
50%
60%
2.4 2.6 2.8 3.0 3.2 3.4 3.6
Fra
ctio
n o
f m
ole
cule
s
T1 (eV)
Untargeted inverse design
Targeted inverse design
(T1 ≥ 3.00 eV)
Training librarya
b
c
The distribution of simulated T1 (eV) energy levels for the generated 3,205 molecules a. mean=2.94, std=0.15
b. mean=3.02, std=0.10
c. mean=2.92, std=0.13
The fractions of the hosts that satisfied the target (T1≥3.00 eV)
36.2% for a
58.7% for b
26.9% for c (3,497 molecules)
Deep-Learning-Based Inverse Design
32
Inverse design of hosts for blue phosphorescent OLED (2/3)
a b c
T1
ML 3.13 eV
DFT 3.16 eV
T1
ML 3.11 eV
DFT 3.12 eV
T1
ML 3.12 eV
DFT 3.12 eV
T1
ML 3.05 eV
DFT 3.06 eV
T1
ML 3.18 eV
DFT 3.20 eV
T1
ML 3.08 eV
DFT 3.12 eV
T1
ML 3.03 eV
DFT 3.12 eV
a1
a2
a3
b1
b2
b3
c1
c2
c3
T1
ML 3.13 eV
DFT 3.22 eV
T1
ML 3.08 eV
DFT 3.12 eV
Examples of inverse-designed host materials
Asymmetric molecules with
the given fragments in the
training library
Symmetric molecules where
the new fragments were
introduced
Asymmetric molecules
where the new fragments
were introduced
Experiment (eV)
HOMO (eV) LUMO (eV) S1 (eV) T1 (eV) ΔEST (eV)
a1 -5.98 -2.43 3.56 3.06 0.55
b1 -5.96 -2.14 3.64 2.93 1.01
c1 -6.07 -2.65 3.38 2.97 0.46
Deep-Learning-Based Inverse Design
33
Inverse design of hosts for blue phosphorescent OLED (3/3)
Total host molecules
(3,205)
L-(R1,R2,R3) ★
(4)
R-L-R
(3,010)
R-R
(190)
R1-R1
(16)
R1-R2
(174)
L: Linker fragment
R: Terminal fragment
Lsym: Symmetric linker
Lasym: Asymmetric linkerR
(1)
R-Lsym -R
(1,931)
R-Lasym -R
(1,079)
R1-Lsym -R1
(403)
R1-Lsym -R2
(1,528)
R1-Lasym -R1
(636)
R1-Lasym -R2
(443)
★
Linker
Terminal1
Terminal2
Terminal3
The connection rules of the inverse-designed molecules
Conclusions
34
A fully data-driven inverse design method successfully extracted
the latent materials design rules and proposed target molecular
structures without any external intervention.
The inverse design model successfully proposed new candidates
by modifying the assemble rules and creating new fragments.
Efficacy of Computer-Aided Materials Discovery
35
Simulation-based Screening
Full search
Total TAT takes 1 month
1st trial: 1M CandidatesQC simulations take 1.5 years Fail to find the target structure
2nd trial: 1M CandidatesQC simulations take 1.5 years Fail to find the target structure
3rd trial: 1M CandidatesQC simulations take 1.5 years Succeed to find the right structure
Total TAT took 4.5 years
[Step1] Building the training datasetNeeds only QC sim. for 50k molecules(27 days)
50K 5M
Inverse Design
Inverse Design
HTCS for pre-defined chemical space
more than 50X speed up (4.5 years vs. 1 month)
[Step3]QC simulations for the proposed molecules (1 day)
* QC simulation tool : turbomoleTotal computational resources=10,000 CPUIn case of 10 CPU computing per molecule, the simulation requires about 13 hrs.
“The inverse design learns by itself the molecular design rules inherent in the libraries and can reduce the effort of researchers and total time to reach the goal”
[Step2]Deep learning model training with GPU (3 days)
Prospects for AI-based Materials Development
36
Design
Analytical Chemistry
Simulation
Synthesis
Neural Network(MD potential)
Energy
DFT simulation(<100 atoms)
Meso-scale simulation(~104 atoms)
Electronic Properties
Design
SynthesisAnalysis
Training
ChemOS
Robot Characterization ML algorithm
DB
Queuing'system' Queuing'system'
Artificial Intelligence for
Materials Design
Propose target materials
Target Properties
Database