Loading...
Using machine learning to differentiate between enzymatic and non-enzymatic metalloprotein sites
Feehan, Ryan
Feehan, Ryan
Citations
Altmetric:
Abstract
Enzymes are biological catalysts with exceptional product specificity and the potential for unmatched speed-up of reaction rates. Although current wet-lab techniques are capable of detecting enzyme activity for over 8,000 biochemical reactions, the experimental discovery of new enzymes continues to be time consuming and expensive. Computational tools can offer affordable options that enable the discovery of enzymes capable of advancing healthcare, environmental remediation, and industrial processes.Herein, I describe the development of a computational tool for predicting enzyme activity, including catalytic mechanisms that have not yet been discovered by evolution. I start with a recent literature review to identify the most successful machine learning methods for enzymatic applications. Using the most popular techniques, decision tree ensemble algorithms and structure-based features, I developed a ML-based tool that uses metalloprotein sites, which limited differences in positive and negative data that were unrelated to catalysis (Chapter 2). To train and evaluate ML models, I made the largest set of non-redundant, catalytically labeled metalloprotein sites to date. The developed model, MAHOMES, outperformed alternative methods when evaluated on the same metalloproteins with a 90% recall and 92% precision. The most important features for making predictions were related to electrostatics and pocket lining. Further benchmarking found MAHOMES required sub-angstrom precision.To enable the same performance on computationally generated structures, improvements targeting the required sub-angstrom precision were made to create MAHOMES II (Chapter 3). The updated model had a 92% recall, 94% precision, and increased predictive convergence when given different structural inputs of the same site. On a test of computationally generated sites with no evolutionary relationship to any previously used metalloproteins, MAHOMES II was 90 - 97.5% depending generated site’s confidence. The correctly predicted enzyme sites included 39 biochemical reactions that were not included in the training data. By deploying MAHOMES II on a webserver, I anticipate that this work will catalyze the community’s ability to find novel enzymes and improve enzyme design success rates.
Description
Date
2023-05-31
Journal Title
Journal ISSN
Volume Title
Publisher
University of Kansas
Collections
Research Projects
Organizational Units
Journal Issue
Keywords
Bioinformatics, Computational chemistry, Bioinformatics, Enzymes, Machine Learning, Metalloenzymes, Metalloproteins