In drug discovery, artificial intelligence (AI) and machine learning (ML) technologies have significantly changed the way researchers analyze and interpret chemical data. From molecular modeling to predicting drug responses, AI-driven approaches offer opportunities to expedite the drug development process. To unlock the full potential of AI in drug discovery, however, it is imperative to lay a strong foundation with properly structured and high-quality chemical data. In this blog, we explore key strategies for preparing AI-ready chemical data and maximizing its utility in drug discovery endeavors.

Ensuring Data Quality and Organization

Quality is paramount when it comes to leveraging chemical data for AI applications. Ensuring data accuracy, consistency, and completeness is the cornerstone of effective data preparation. By implementing robust quality control measures, such as data validation and cleaning, researchers can mitigate noise and errors that could compromise the reliability of AI models. Compound depictions may be sourced from a variety of origins, and so their quality must be rigorously validated. Automated structure checking procedures are commonplace to ensure chemical accuracy of the depiction. As a 2D representation of a complex physical arrangement, small molecule depictions may have inherent redundancy or inaccuracy present. Rules for standardized representations in such cases are needed. Additionally, organizing data in a systematic manner facilitates accessibility and usability, enabling researchers to efficiently locate and utilize relevant information for their analyses.

FAIR Data and the Representation of Chemical Compounds toward the Preparation of Data for Machine Learning Applications

The requirements of data quality and organization are met if the data meet FAIR criteria: the Findability, Accessibility, Interoperability, and Reusability of data. (To read more about FAIR, for example, go to worldfair-project.eu and go-fair.org/fair-principles/)  For chemical structures to meet FAIR criteria, formats  that are well defined and interoperable between different data storage and analysis systems are required. The two main formats used for ML applications are the simplified molecular input line-entry system (SMILES) and IUPAC International Chemical Identifiers (InChIs). The SMILES format is commonly used in large language models (LLMs) and plays a central role in the related prediction systems. The interoperability of SMILES is limited by the fact that each compound may be represented by multiple SMILES strings, depending on the starting point of each SMILES within the structure. Canonical SMILES are intended to deliver a unique representation, but the canonicalization of SMILES depends on the underlying canonicalization algorithm. The canonicalization algorithms differ from vendor to vendor, resulting in vendor-dependent canonical SMILES. To overcome this limitation, InChIs can be used instead of SMILES. An InChI is a unique representation of a compound that remains the same no matter how the original chemical structure is drawn. Besides chemical compound identification, InChIs are applied to uniquely identify reactions (Reaction-InChI / RInChI), mixtures (Mixture-InChI/MInChI), and nano materials (Nano-InChI/NInChI). RInChIs, MInChIs, and NInChIs ensure the extensibility of ML/AI into these areas.

The graphic below describes the FAIR data principles and special considerations that apply to working with chemical data.

Folder icon
Key icon
Interlocking circles icon
Box with arrows icon

Findable

Accessible

Interoperable

Reusable

General Practice

Data and metadata should be easy to find for both humans and computers.

Once located, there should be clear rules on how to access the data (authorization, etc.).

Data should work seamlessly with other data and tools. This involves standardized formats.

Data and metadata must be well-described so it can be used again in future studies or applications.

Chemical Data Considerations

Determine the scope of the data –Public, Private data sources and how to integrate.

Use/develop the right APIs to access the data.

Simplified molecular input line-entry system (SMILES) and IUPAC International Chemical Identifiers (InChls) are well-suited formats.

Strive for standard formats and uniform enhanced stereochemistry rules across the data set.

Figure 1 – Tips for managing FAIR chemical data.

Findable

General Practices

Data and metadata should be easy to find for both humans and computers.

Chemical Data Considerations

Determine the scope of the data –Public, Private data sources and how to integrate.

key icon

Accessable 

General Practices

Once located, there should be clear rules on how to access the data (authorization, etc.).

Chemical Data Considerations

Use/develop the right APIs to access the data.

Interlocking circles icon

Interoperable 

General Practices

Data should work seamlessly with other data and tools. This involves standardized formats.

Chemical Data Considerations

Simplified molecular input line-entry system (SMILES) and IUPAC International Chemical Identifiers (InChls) are well-suited formats.

Box with arrows icon

Reusable

General Practices

Data and metadata must be well-described so it can be used again in future studies or applications.

Chemical Data Considerations

Strive for standard formats and uniform enhanced stereochemistry rules across the data set.

Figure 1 – Tips for managing FAIR chemical data.

Facilitating AI Model Training and Prediction Accuracy

The success of AI-driven drug discovery hinges on the ability to develop accurate and reliable predictive models. Feature engineering plays a pivotal role in this process, as it involves identifying informative features from structured chemical data that are conducive to model training. Common features extracted from small molecule structures include structural feature presence or appearance count as well as predicted physicochemical properties. By leveraging domain knowledge and advanced feature selection techniques, researchers can extract meaningful insights that enhance the predictive capabilities of their models. Moreover, model development and optimization entail fine-tuning algorithms and architectures to maximize prediction accuracy while mitigating overfitting. One product for automated model building in ML-driven drug discovery is Trainer Engine by Chemaxon.

Cross-Validation and Evaluation

Validating the performance of AI models is essential to ensure their generalizability and robustness. Cross-validation techniques allow researchers to assess how well their models perform on unseen data, thereby providing insights into their ability to make accurate predictions in real-world scenarios. Utilizing appropriate evaluation metrics, such as accuracy, precision, recall, and F1-score, enables researchers to quantitatively measure the effectiveness of their models and identify areas for improvement.

Transfer Learning and Pre-trained Models

Transfer learning offers a powerful approach to leverage existing knowledge and accelerate model training in drug discovery applications. By fine-tuning pre-trained models on chemical data tasks, researchers can capitalize on transferable knowledge from related domains and expedite the development of predictive models. This approach not only reduces the need for large, annotated datasets but also enhances the adaptability of models to specific prediction objectives.

Continuous Learning and Model Updating

In the dynamic landscape of drug discovery, continuous learning is essential to keep AI models abreast of the latest testing results and domain insights. Implementing mechanisms to update models and adapt their training configuration enables researchers to incorporate new data and refine model predictions over time. By monitoring model performance in real-world applications and retraining models periodically, researchers can ensure that their AI systems remain accurate and reliable. It is also essential to preserve the context in which a model was trained and provide versioning tools to models to allow for the rigorous reproducibility demanded by science.

Harnessing the power of AI in drug discovery requires careful attention to data preparation and model development. By following strategies such as ensuring data quality and organization, facilitating AI model training and prediction accuracy, and embracing continuous learning principles, researchers can unlock the full potential of AI-ready chemical data in accelerating the discovery of novel therapeutics. In a collaborative effort to keep databases freshly trained, a project is underway to share machine learning algorithms across actively managed commercial databases, to cost effectively discover potentially new therapeutic molecules while keeping proprietary information private. See the Melloddy Project.

By adhering to these strategies, researchers can position themselves at the forefront of AI-driven drug discovery, paving the way for advancements in the field.

References

Gerd Blanke, et al., “Reaction InChI (RInChI): Present and Future.” Poster presented at the International Conference on Computational Science.

Kalleid, Inc., “Strategies for Creating AI Ready Chemical Data.” Talk broadcast on 29 April 2021. youtube.com/watch?v=2fJ12P2_0Y8

Gerd Blanke head shot

About the Authors

Gerd Blanke
Gerd Blanke is founder of StructurePendium Technologies GmbH. StructurePendium Technologies GmbH offers consulting services in the area of chem- and bioinformatics with a major focus on standardization and normalization of chemical structures and reactions for registration and retrieval processes. These services are provided in the context of database mergers, data transfers between different vendors, and data analytics. For more information, visit StructurePendium Technologies GmbH
Jan Christopherson headshot
Jan Christopherson
Jan Christopherson is a Senior Application Scientist at Chemaxon, a cheminformatics software and solutions provide with products across the small-molecule preclinical space.
Jan provides cheminformatic expertise across the portfolio, assisting users in discovering and mapping applications and best practices across their R&D informatics landscape. For more information, visit us to learn more at chemaxon.com
Jay Martin head shot
Jay Martin
Jay Martin is a principal technical content specialist at Kalleid Consulting. Before he became a technical writer, he supported R&D research in radiochemistry, signal transduction, and protein synthesis. As a technical writer, he has focused on developing useful documentation for genomics scientists. Jay, his wife, dog, and cat live in San Francisco.

About Kalleid

Kalleid, Inc. is a boutique IT consulting firm that has served the scientific community since 2014. We work across the value chain in R&D, clinical and quality areas to deliver support services for software implementations in highly complex, multi-site organizations.

At Kalleid, we understand that people are at the center of any successful business transformation. Providing high quality technical documentation services to support our clients is therefore one of the key aspects of our integrated approach to IT projects. Kalleid has a team of experienced technical and content writers, editors, and instructional designers who can help you develop content (GxP compliant when required) to support your products, processes, and software. If you are interested in exploring how Kalleid documentation services can benefit your organization, please don’t hesitate to contact us today.