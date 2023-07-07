Researchers from MIT and the MIT-Watson AI Lab have developed an innovative unified framework utilizing machine learning to predict molecular properties and generate new molecules. This groundbreaking approach requires only a small amount of data for training, significantly streamlining the traditional, time-consuming, and expensive trial-and-error process of discovering new materials and drugs.

The conventional method of predicting molecular properties and identifying molecules for synthesis and testing in the lab involves training machine learning models with millions of labeled molecular structures. However, obtaining large training datasets is challenging and costly, limiting the effectiveness of these approaches. In contrast, the system developed by the MIT researchers can achieve accurate predictions using minimal data.

The key to the system’s success lies in its underlying understanding of the rules governing the combination of building blocks to form valid molecules. By capturing the similarities between molecular structures, the system can efficiently generate new molecules and predict their properties. This approach outperforms other machine learning methods and demonstrates remarkable accuracy even with datasets containing fewer than 100 samples.

Lead author Minghao Guo, an EECS graduate student, explains that the objective of the project is to expedite the discovery of new molecules through data-driven methods, thereby reducing the need for costly experiments. The research team includes members from the MIT-IBM Watson AI Lab and MIT graduates, along with senior author Wojciech Matusik, a professor of electrical engineering and computer science.

To achieve the best results, machine learning models usually require large training datasets with similar properties to those targeted for discovery. However, in reality, domain-specific datasets are often small. The MIT team took a unique approach by developing a machine learning system that learns the “language” of molecules, known as a molecular grammar, using a small, domain-specific dataset. This molecular grammar generates viable molecules and predicts their properties based on a set of production rules, similar to the rules governing language grammar.

While there can be countless ways to combine atoms and substructures, the researchers divided the molecular grammar into two parts to accelerate the learning process. The system learns a general metagrammar manually provided at the beginning and subsequently learns a smaller molecule-specific grammar from the domain dataset. This hierarchical approach significantly speeds up the learning process.

In experiments, the researchers’ system outperformed several popular machine learning approaches, even when applied to domain-specific datasets with only a few hundred samples. The system was particularly effective in predicting physical properties of polymers, such as the glass transition temperature, which typically requires expensive experimental procedures.

The researchers plan to further enhance their approach by incorporating the 3D geometry of molecules and polymers, as well as developing an interface that allows users to provide feedback and improve the accuracy of the system. They also aim to explore applications beyond chemistry and material science for their powerful grammar-based representation.

The research findings will be presented at the International Conference for Machine Learning. The system developed by the MIT researchers marks a significant step forward in accelerating the discovery of new molecules and materials, offering immense potential for numerous scientific and industrial applications.





