In drug discovery, artificial intelligence (AI) and machine learning (ML) technologies have significantly changed the way researchers analyze and interpret chemical data. From molecular modeling to predicting drug responses, AI-driven approaches offer opportunities to expedite the drug development process. To unlock the full potential of AI in drug discovery, however, it is imperative to lay a strong foundation with properly structured and high-quality chemical data. In this blog, we explore key strategies for preparing AI-ready chemical data and maximizing its utility in drug discovery endeavors.
Ensuring Data Quality and Organization
Quality is paramount when it comes to leveraging chemical data for AI applications. Ensuring data accuracy, consistency, and completeness is the cornerstone of effective data preparation. By implementing robust quality control measures, such as data validation and cleaning, researchers can mitigate noise and errors that could compromise the reliability of AI models. Compound depictions may be sourced from a variety of origins, and so their quality must be rigorously validated. Automated structure checking procedures are commonplace to ensure chemical accuracy of the depiction. As a 2D representation of a complex physical arrangement, small molecule depictions may have inherent redundancy or inaccuracy present. Rules for standardized representations in such cases are needed. Additionally, organizing data in a systematic manner facilitates accessibility and usability, enabling researchers to efficiently locate and utilize relevant information for their analyses.
FAIR Data and the Representation of Chemical Compounds toward the Preparation of Data for Machine Learning Applications
The requirements of data quality and organization are met if the data meet FAIR criteria: the Findability, Accessibility, Interoperability, and Reusability of data. (To read more about FAIR, for example, go to worldfair-project.eu and go-fair.org/fair-principles/) For chemical structures to meet FAIR criteria, formats that are well defined and interoperable between different data storage and analysis systems are required. The two main formats used for ML applications are the simplified molecular input line-entry system (SMILES) and IUPAC International Chemical Identifiers (InChIs). The SMILES format is commonly used in large language models (LLMs) and plays a central role in the related prediction systems. The interoperability of SMILES is limited by the fact that each compound may be represented by multiple SMILES strings, depending on the starting point of each SMILES within the structure. Canonical SMILES are intended to deliver a unique representation, but the canonicalization of SMILES depends on the underlying canonicalization algorithm. The canonicalization algorithms differ from vendor to vendor, resulting in vendor-dependent canonical SMILES. To overcome this limitation, InChIs can be used instead of SMILES. An InChI is a unique representation of a compound that remains the same no matter how the original chemical structure is drawn. Besides chemical compound identification, InChIs are applied to uniquely identify reactions (Reaction-InChI / RInChI), mixtures (Mixture-InChI/MInChI), and nano materials (Nano-InChI/NInChI). RInChIs, MInChIs, and NInChIs ensure the extensibility of ML/AI into these areas.
The graphic below describes the FAIR data principles and special considerations that apply to working with chemical data.
Findable
Accessible
Interoperable
Reusable
General Practice
Data and metadata should be easy to find for both humans and computers.
Once located, there should be clear rules on how to access the data (authorization, etc.).
Data should work seamlessly with other data and tools. This involves standardized formats.
Data and metadata must be well-described so it can be used again in future studies or applications.
Chemical Data Considerations
Determine the scope of the data –Public, Private data sources and how to integrate.
Use/develop the right APIs to access the data.
Simplified molecular input line-entry system (SMILES) and IUPAC International Chemical Identifiers (InChls) are well-suited formats.
Strive for standard formats and uniform enhanced stereochemistry rules across the data set.
Findable
General Practices
Data and metadata should be easy to find for both humans and computers.
Chemical Data Considerations
Determine the scope of the data –Public, Private data sources and how to integrate.
Accessable
General Practices
Once located, there should be clear rules on how to access the data (authorization, etc.).
Chemical Data Considerations
Use/develop the right APIs to access the data.
Interoperable
General Practices
Data should work seamlessly with other data and tools. This involves standardized formats.
Chemical Data Considerations
Simplified molecular input line-entry system (SMILES) and IUPAC International Chemical Identifiers (InChls) are well-suited formats.
Reusable
General Practices
Data and metadata must be well-described so it can be used again in future studies or applications.
Chemical Data Considerations
Strive for standard formats and uniform enhanced stereochemistry rules across the data set.
Facilitating AI Model Training and Prediction Accuracy
The success of AI-driven drug discovery hinges on the ability to develop accurate and reliable predictive models. Feature engineering plays a pivotal role in this process, as it involves identifying informative features from structured chemical data that are conducive to model training. Common features extracted from small molecule structures include structural feature presence or appearance count as well as predicted physicochemical properties. By leveraging domain knowledge and advanced feature selection techniques, researchers can extract meaningful insights that enhance the predictive capabilities of their models. Moreover, model development and optimization entail fine-tuning algorithms and architectures to maximize prediction accuracy while mitigating overfitting. One product for automated model building in ML-driven drug discovery is Trainer Engine by Chemaxon.
Cross-Validation and Evaluation
Validating the performance of AI models is essential to ensure their generalizability and robustness. Cross-validation techniques allow researchers to assess how well their models perform on unseen data, thereby providing insights into their ability to make accurate predictions in real-world scenarios. Utilizing appropriate evaluation metrics, such as accuracy, precision, recall, and F1-score, enables researchers to quantitatively measure the effectiveness of their models and identify areas for improvement.
Transfer Learning and Pre-trained Models
Transfer learning offers a powerful approach to leverage existing knowledge and accelerate model training in drug discovery applications. By fine-tuning pre-trained models on chemical data tasks, researchers can capitalize on transferable knowledge from related domains and expedite the development of predictive models. This approach not only reduces the need for large, annotated datasets but also enhances the adaptability of models to specific prediction objectives.
Continuous Learning and Model Updating
In the dynamic landscape of drug discovery, continuous learning is essential to keep AI models abreast of the latest testing results and domain insights. Implementing mechanisms to update models and adapt their training configuration enables researchers to incorporate new data and refine model predictions over time. By monitoring model performance in real-world applications and retraining models periodically, researchers can ensure that their AI systems remain accurate and reliable. It is also essential to preserve the context in which a model was trained and provide versioning tools to models to allow for the rigorous reproducibility demanded by science.
Harnessing the power of AI in drug discovery requires careful attention to data preparation and model development. By following strategies such as ensuring data quality and organization, facilitating AI model training and prediction accuracy, and embracing continuous learning principles, researchers can unlock the full potential of AI-ready chemical data in accelerating the discovery of novel therapeutics. In a collaborative effort to keep databases freshly trained, a project is underway to share machine learning algorithms across actively managed commercial databases, to cost effectively discover potentially new therapeutic molecules while keeping proprietary information private. See the Melloddy Project.
By adhering to these strategies, researchers can position themselves at the forefront of AI-driven drug discovery, paving the way for advancements in the field.
References
Gerd Blanke, et al., “Reaction InChI (RInChI): Present and Future.” Poster presented at the International Conference on Computational Science.
Kalleid, Inc., “Strategies for Creating AI Ready Chemical Data.” Talk broadcast on 29 April 2021. youtube.com/watch?v=2fJ12P2_0Y8
About the Authors
About Kalleid
Kalleid, Inc. is a boutique IT consulting firm that has served the scientific community since 2014. We work across the value chain in R&D, clinical and quality areas to deliver support services for software implementations in highly complex, multi-site organizations.
At Kalleid, we understand that people are at the center of any successful business transformation. Providing high quality technical documentation services to support our clients is therefore one of the key aspects of our integrated approach to IT projects. Kalleid has a team of experienced technical and content writers, editors, and instructional designers who can help you develop content (GxP compliant when required) to support your products, processes, and software. If you are interested in exploring how Kalleid documentation services can benefit your organization, please don’t hesitate to contact us today.