Machine Learning in Material Science

Shreya Gupta

Shreya Gupta
PhD Candidate, Graduate Program in Operations Research and Industrial Engineering,
Department of Mechanical Engineering, University of Texas at Austin

A little over a month ago I was connected with Dr. Yuanyue Liu who is an Assistant Professor in the Department of Mechanical Engineering at The University of Texas at Austin. His research focuses on fundamental and technological problems in material science related to electronics, optoelectronics, energy conversion and energy storage (e.g., transistors, solar cells, batteries/supercapacitors, electro/photoelectro-catalysis) as well as emerging materials like 2D materials and topological materials. His most recognized work involves the development of a one-step, scalable approach for producing and patterning porous graphene films with three-dimensional networks from commercial polymer films using a CO2 infrared laser (Lin et al. 2014) and the discovery that oxygen on the Cu surface substantially decreases the graphene nucleation density by passivating Cu surface active sites (Hao et al. 2013). We were connected because Dr. Liu needed some assistance in feature ranking and engineering, i.e. ranking predictor variables by their impact on the output variable (feature ranking) and creating new variables from the existing ones that could better predict the output variable (feature engineering). Dr. Liu shared many papers with me on how machine learning (ML) is being employed in material science, and I was encouraged to dedicate a short article to this new area of application for ML. So here it goes!

Almost all the papers I read described empirical testing as very costly and time consuming (Faber et al. 2016; Li et al. 2017) even though it is the reason behind the discovery of all industry catalysts known today (Nørskov et al. 2011). Thus, ML algorithms for predicting molecular properties are being increasingly explored and helping in progressing material science at a faster rate than in the past. For example, Ramprasad et al. (2017) cites pioneering applications of machine learning in prediction band-gap of insulators, classification into sp-block and transition metal elements, models to identify correlations and analytical relationships between the breakdown field and other easily accessible material properties such as the band gap and the phonon cutoff frequency. And the list goes on. Faber et al. (2016) highlights that even first-principles methods such as density functional theory (DFT) for computational prediction of the existence and basic properties of crystals composed of only the main group elements (columns I to VIII in the periodic table) is challenging as just these elements lead to approximately 2 x 106 possible elpasolite crystals  that can potentially be made. However, ML models are being developed with accuracies close to those of DFT, and only take milliseconds for computations (Montavon et al. 2013; Rupp et al. 2012). Of course, the datasets representing material properties are also small because it's hard to harness data in this field (Ramprasad et al. 2017). Additionally, ML models overcome the trade-off between the versatility of quantum mechanical models and the relative simplicity of semi-empirical force fields. Quantum mechanical models theoretically can be used to study any material as they are governed by analytical differential equations, but these equations are very complex; in contrast, semi-empirical force fields, based on a combination of experimental data and electronic structure calculations on small molecules (Shell 2012), are several orders of magnitude faster but not as versatile. Thus, semi-empirical force fields do not perform well on materials for which the original parameterization were not developed. As ML models are both fast and transferable (Ramprasad et al. 2017), they are gaining popularity amongst material scientists (Faber et al. 2016).

Domain experts and their ML collaborators are also focusing on feature engineering. There is an increasing emphasis on the billions of linear and non-linear compound descriptions that could be engineered using algebraic combinations and mathematical functions (Ramprasad et al. 2017). This would immediately take us into the space of feature ranking. I saw the least absolute shrinkage and selection operator (LASSO) and kernel ridge regression being used widely. In fact kernel ridge regression was used in many models I read about because it works well when attempting to incorporate non-linear relationships.

An important aspect of feature selection and engineering that Ramprasad et al. (2017) talks about is the need for feature invariance to certain transformations (some examples of such transformations are spatial rotation, rigid translations, etc.). One of the models that I found interesting was a kernel ridge regression-based ML model developed by Faber et al. (2016) for modeling "the energy difference between the crystal energy and the sum of static, atom-type dependent, averaged atomic energy contributions, obtained through the fitting of each atomic species in all main group elements up to" Bismuth (Bi). They built and employed ML models of formation energies to investigate all possible elpasolites made up of main-group elements. In their paper they present numerical results for approximately 2 x 106 formation energies which, as discussed earlier, would certainly have been very challenging using first-principles methods like DFT. Li et al. (2017) talks about exploiting the a priori estimation of chemical reactivity of surface metal atoms given the hierarchical complexities in catalyst design. They build an artificial neural network model (see Figure 1) chemisorption model that captures complex, non-linear adsorbate/substrate interactions and thus facilitates exploration of large number of catalytic materials.  

fig1
Figure 1. Figure adopted from Li et al. (2017). The authors here are representing a schematic of their neural net “accelerated catalyst design approach.”

Sendek et al. (2016) ran all possible LR models for feature selection. They spoke about the difficulty of creating features and eventually with about 20 features they went on to build a logistic regression (LR) model to classify superionic materials based on ionic conductivity. They were careful to include negative examples (i.e., they purposely added many poor conductors to the data used for training and testing) as suggested by Raccuglia et al. (2016). In addition, since they had a very small dataset and a simplistic LR model, they ran LR models with all possible combinations of features and attempted to select the best models using the best LR model. This quickly led to ∑20n=1 C(20,n) = 1,048,575 models being tested. Finally, they chose the model with the least misclassification rate using metrics such as the training misclassification rate between the predicted and observed the cross-validated misclassification rate using leave-one-out cross-validation (LOOCV).

In conclusion, I noticed that kernel ridge regression and LOOCV were popular modeling approaches (later due to scarcity of data). Ramprasad et al. (2017) provides a survey of many more applications of classification, clustering, regression, etc., in the material science community. They also highlight that future work can focus on building adaptive models that can handle new data points while updating themselves easily but also producing strong predictions for cases where data is different from all prior information (this is the canonical bias-variance trade-off in machine learning). They also talk about the need for uncertainty quantification, the importance of elucidating uncertainty in predictions, and the scope for inverse modeling. Zhang and Ling (2018) also nicely discuss many ways of employing ML in material science where they explore multiple methods like LASSO and gradient boosted trees (see Figures 2 and 3). One of the main hurdles facing the material science community when employing ML is insufficient data (Ramprasad et al. 2017; Zhang and Ling 2018). Generating data is time consuming and expensive, and material science datasets often are very wide. However, there is a lot of scope for building ML models that can help progress research in material science at a faster and cheaper rate.  

Figure 2. Lasso root mean square error results on 5-fold cross-validation done by Zhang and Ling (2018)
Figure 3. Root mean square errors from a gradient boosted model implemented by Zhang and Ling (2018)
Figure 2. Lasso root mean square error results on 5-fold cross-validation done by Zhang and Ling (2018) Figure 3. Root mean square errors from a gradient boosted model implemented by Zhang and Ling (2018)

References

 Faber, F.A., Lindmaa, A., von Lilienfeld, O.A., & Armiento, R. (2016). Machine learning energies of 2 millionelpasolite (abC2D6) crystals. Phys. Rev. Lett., 117(13), https://doi.org/10.1103/PhysRevLett.117.135502
 Hao, Y. Bharathi, M.S., Wang, L., Liu, Y., Chen, H., Nie, S., Wang, X., Chou, H., Tan, C., Fallahazad, B., Ramanarayan, H., Magnuson, C.W., Tutuc, E., Yakobson, B.I., McCarty, K.F., Zhang, Y.-W., Kim, P., Hone, J., Colombo, L., & Ruoff, R.S. (2013). The role of surface oxygen in the growth of large single-crystal graphene on copper. Science, 342(6159), 720-723, https://doi.org/10.1126/science.1243879
 Li, Z., Ma, X., & Xin, H. (2017). Feature engineering of machine-learning chemisorption models for catalyst design. Catalysis Today, 280:232-238, https://doi.org/10.1016/j.cattod.2016.04.013
 Lin, J., Peng, Z., Liu, Y., Ruiz-zepeda, F., Ye, R., Samuel, E.L.G., Yacaman, M.J., Yakobson, B.I., & Tour, J.M. (2012). Laser-induced porous graphene films from commercial polymers. Nature Communications, 5, https://doi.org/10.1038%2Fncomms6714
 Montavon, G., Rupp, M., Gobre, V., Vazquez-Mayagoitia, A., Hansen, K., Tkatchenko, A., Müller, & K.-R., von Lilienfeld, O.A. (2013). Machine learning of molecular electronic properties in chemical compound space. New Journal of Physics, 15(9), https://doi.org/10.1088/1367-2630/15/9/095003
 Nørskov, J., Abild-Pedersen, F., Studt, F., & Bligaard, T. (2011). Density functional theory in surface chemistry and catalysis. Proceedings of the National Academy of Sciences of the United States of America, 108(3), 937-943, https://doi.org/10.1073/pnas.1006652108
 Raccuglia, P., Elbert, K.C., Adler, P.D.F., Falk, C., Wenny, M.B., Mollo, A., Zeller, M., Friedler, S.A., Schrier, J., & Norquist, A.J. (2016). Machine-learning-assisted materials discovery using failed experiments. Nature, 533(7601), 73-76, https://doi.org/10.1038/nature17439
 Ramprasad, R., Batra, R., Pilania, G., Mannodi-Kanakkithodi, A., & Kim, C. (2017). Machine learning in materials informatics: recent applications and prospects. npj Computational Materials, 3(54), https://doi.org/10.1038/s41524-017-0056-5
 Rupp, M., Tkatchenko, A., Müller, K.-R., & von Lilienfeld, O.A. (2012). Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett., 108, https://doi.org/10.1103/PhysRevLett.108.058301
 Sendek, A.D., Yang, Q., Cubuk, E.D., Duerloo, K.-A.N., Cui, Y., & Reed, E.J. (2016). Holistic computational structure screening of more than 12 000 candidates for solid lithium-ion conductor materials. Energy Environ. Sci., 10(1), 306-320, https://doi.org/10.1039/c6ee02697d
 Shell, M.S. (2012). Classical semi-empirical force fields. Lecture notes for Principles of modern molecular simulation methods, University of California Santa Barbara. https://engineering.ucsb.edu/~shell/che210d/Classical_force_fields.pdf
 Zhang, Y. & Ling, C. (2018). A strategy to apply machine learning to small datasets in materials science. npj Computational Materials, 4(25), https://doi.org/10.1038/s41524-018-0081-z