Title: Data Mining Approaches in Atomistic Modeling
1Data Mining Approaches in Atomistic Modeling
H. Aourag URMER, University of Tlemcen
2Outline
- Introduction
- Ex 1 Intergranular Embrittlement of Fe
- Ex 2 Catalytic Activity - Hydrogenation
- Ex 3 Stainless Steel CrxNiyFe(1-x-y)
- Ex 4 Conductivity T7 7xxx Al Alloys
- Ex 5 Boiling Points
- Ex 6 Crystal Structure Prediction open
questions
3Predicting Properties with Atomistic Modeling
4Power of Data Mining
Use known data to establish R
Use R to predict new data
- Does not require complete and accurate multiscale
theories - New physics in relationships R
- Quick, cheap screening for desired properties,
errors, etc. can be qualitative
5Key Issues
- Descriptors accessible to modeling
- Descriptors optimally chosen
- Use known relationships/physics
- Optimize from large set of possibilities
- Descriptors?Property relationship is robust
- Sensible choice of methods
- tested with cross validation, test sets
- Data
- Large enough
- Clean enough
6 Ex 1 Intergranular Embrittlement of Fe
- Property Fe embrittlement
- Descriptors?Property relationship Embrittlement
? Grain boundary segregation E - Free surface
segregation E (EGB EFS) (Rice 89) - Descriptors (EGB EFS) (calculated ab initio)
- Data Embrittling potency for B, C, P, S.
7 Ex 1 Intergranular Embrittlement of Fe
(Wu, et al., Phys. Rev. B., 96)
Also correctly predicts effect of Mn and Mo on P
embrittlement!
(Zhong, et al., Phys Rev B, 97, Geng, et al.,
Solid State Comm., 01)
8Ex 2 Catalytic Activity - Hydrogenation
- Property Reaction rates (Hydrogenation of
ethene, benzene on 3d transition metal M) - Descriptors?Property relationship
- Adapted Bronsted-Evans_Polanyi Free E
- Langmuir-Hinshelwood Rate Equations
- ? Rate REMC,12 fitting constants
independent of M - Descriptors
- EMC M-C bond strength in bulk NaCl structure
(calculated ab initio) - 12 fitting constants (fit to experimental data
for each reaction) - Data 10-20 reaction rates for each of ethene and
benzene
9Ex 2 Catalytic Activity - Hydrogenation
(Toulhoat, et al. 02)
10Ex 3 Stainless Steel CrxNiyFe(1-x-y)
- Property High hardness and ductility
- Descriptors?Property relationship
- Hardness ? shear modulus G
- Ductility ? bulk modulus/shear modulus B/G
- Descriptors B,G (from ab initio)
- Data Not clearly defined
11Hardness vs. Shear Modulus
(Teter, MRS Bulletin, 98)
12Ex 3 Stainless Steel CrxNiyFe(1-x-y))
(Vitos, et al., Nature Materials, 02)
- Optimal at Cr18Ni24Fe58 (multiple patents)
- Predict improved mechanical properties for Ir, Os
doping
High G (hard)
Conflict!
High B/G (ductile)
13Ex 4 Conductivity T7 7xxx Al Alloys
- Property Electrical conductivity s
- Descriptors?Property relationship
- Linear s Vd (requires only fitting)
- Neurofuzzy s NF(d) (requires only fitting)
- Physical s P(d) (requires thermodynamic models
of relevant phases, RayleighMaxwell equation for
resistivity with dispersed particles,
Starink-Zahra equation for precipitation, 1D
diffusion equation, Matthiesens rule for
resistivity with dissolved elements) - Descriptors Concentrations, ageing time ? d
xZn, xMg, xCu, xZr, xFe, xSi, t
14Ex 4 Conductivity T7 7xxx Al Alloys
s measured for 36 concentration/ageing time
samples
R-Model Fitting Params RMS Error () Cross Validation ()
Linear 7 4.75 5.25
Neurofuzzy 5 1.35 1.525
Physical 6 0.97 1.05
(Starink, et al., 00)
15Ex 5 Boiling Points
(Quantitative Structure-Property Relationships
QSPR)
- Property Boiling Point TB
- Descriptors?Property relationship Neural Network
(10181, sigmoid, backpropagation) - Descriptors Electrostatic and structural
properties (calculated with semiempirical VAMP
AM1) - Data TB for 6629 molecules containing elements
H, B, C, N, O, F, Al, Si, P, S, Cl, Zn, Ge, Br,
Sn, I, Hg
16Data Mining Descriptors?Property Relationships
- Many general approaches
- Graphical
- Linear Regressions (normal least squares,
principal component regression, partial least
squares, ) - Neural Networks (perceptrons, feed-forward,
radial-basis, ) - Clustering (k-means, nearest-neighbor, )
- Many choices in each approach
- Neural Networks
- Number of neurons/layers 341
- Transfer functions step, sigmoid, tansig, etc.
- Training method backpropagation algorithms
- Thousands of possible approaches!
- Many yield similar results
- Appropriate for different situations
- Problem dependent - much art!!
17Descriptors
Charged partial surface areas descriptors,
Accelyris QSAR module
- Partial positive surface area (sum of the surface
area of positive atoms) - Partial negative surface area (sum of the surface
area of negative atoms) - Total charge weighted positive surface area
(descriptor 1 multiplied by the total positive
charge) - Total charge weighted negative surface area
(descriptor 2 multiplied by the total negative
charge) - Atomic charge weighted positive surface area
(sum of sasacharge for all positive atoms) - Atomic charge weighted negative surface area (sum
of sasacharge for all negative atoms) - Difference in charged surface areas (descriptor
1 - descriptor 2) - Difference in total charge weighted surface areas
(descriptor 3 - descriptor 4) - Difference in atomic charge weighted surface
areas (descriptor 5 - descriptor 6) - Fractional charged partial surface areas (6
descriptors divided by total surface area) - "
- "
- "
- "
- "
- Surface weighted charged partial surface areas (6
descriptors multiplied by total surface area) - "
- "
- "
(http//www.accelrys.com/cerius2/descriptor.htmll
ist)
18Descriptors
- Many broad categories composition, topological,
electronic, physical-chemical properties, - Thousands of possible descriptors
- Use physical knowledge to choose relevant ones
(e.g., QSAR principle) - Use numerical methods to choose important
descriptors
19Ex 5 Boiling Point Descriptors
(Chalk, et al., J Chem. Inf. Comput. Sci, 01)
20Ex 5 Atomistic Modeling Methods
- Use VAMP AM1 and PM3 Hamiltonians
- Semi-empirical molecular orbital based
- Quantum mechanical, but matrix elements are fit
to experimental data - Can calculate optimized geometries, electronic
structure (charge properties) - Fairly accurate (known failings) and fast
21Ex 5 Boiling Points
Training set (6000)
Test set (629)
?17? (max -119?)
?19? (max -94?)
(Chalk, et al., J Chem. Inf. Comput. Sci, 01)
- Large errors often due to
- Incorrect experimental measurements of TB (low
pressure) - Incorrect experimental structures (tautomer
misidentification) - Failure of atomistic modeling method
(approximation errors)
22Ex 6 Crystal Structure Prediction
- Property Stable crystal structure
- Descriptors?Property relationship Neighbor
Clustering algorithm (Euclidean metric) - Descriptors Chemical scale (empirically assigned
value for each element) (Pettifor, J. Phys. C,
86) - Data All intermetallic binary alloys (thousands)
23Structure Maps
CsCl
NaCl
(Rodgers, CRYSTMET, 03)
24Ex 6 Crystal Structure Prediction
- Powerful structure maps can give 90-95
predictive accuracy - Many Descriptors 50 have been tried based on
size, atomic number, cohesive energy,
electrochemistry, valence electrons - Cant be extended accurate maps require 40 of
the possible systems to be known (80 binaries
known, 0.1 quaternaries) - Can atomistic modeling help?
- Fill in data for multicomponent systems
- Provide optimal descriptors
(Villars, Intermetallic Compounds, 94)
25Conclusions
- Atomistic modeling and data mining can provide
valuable predictive ability when physical
theories are incomplete - Key issues are data quality, descriptors, and
descriptor?properties relationship - Dangers of overfitting and tuning
26Bible Code
Are these words closer than by chance? Can the
Bible predict future events?
Some say yes (Witzumn, et al, Stat. Sci.,
94) Some say no (McKay, et al., Stat. Sci., 99)
- Many articles
- gt60 books on Bible Codes on Amazon
- 1 major motion picture (Omega Code)
Be careful with your statistics!
27The First and Greatest Example of Atomic Level
Data Mining
28END