TopLapGBT: Bridging Gaps in Protein Solubility Prediction with Cutting-Edge Fusion

cover
17 Feb 2024

Authors:

(1) JunJie Wee, Department of Mathematics, Michigan State University;

(2) Jiahui Chen, Department of Mathematical Sciences, University of Arkansas;

(3) Kelin Xia, Division of Mathematical Sciences, School of Physical and Mathematical Sciences Nanyang Technological University & xiakelin@ntu.edu.sg;

(4)Guo-Wei Wei, 1Department of Mathematics, Michigan State University, Department of Biochemistry and Molecular Biology, Michigan State University, Department of Electrical and Computer Engineering, Michigan State University & xiakelin@ntu.edu.sg.

Abstract & Introduction

Results

Discussion

Conclusion

Materials and Methods

Software and resources, Code and Data Availability

Supporting Information, Acknowledgments & References

4 Conclusion

In the multifaceted quest to understand mutation-induced solubility changes, various scientific domains including quantum mechanics, molecular mechanics, biochemistry, biophysics, and molecular biology have made significant contributions. Despite these concerted efforts, state of-art models have limitations, as evidenced by their normalized CPR value of 0.656 even after employing feature selection methods. Persistent homology (PH) has emerged as a powerful tool for capturing the complexity of biomolecular structures and has achieved noteworthy success in drug discovery applications. However, its inability to capture the nuances of homotopic shape evolution, crucial for delineating molecular interactions in proteins, presents a critical shortcoming.

Our study introduces TopLapGBT, a novel model that integrates persistent Laplacian (PL) features with pretrained transformer features, thereby bridging the gap in capturing both topology and homotopic shape evolution. This innovative fusion leads to significant advancements in classification performance. Specifically, TopLapGBT achieves normalized CPR and GC2 scores of 0.688 and 0.361, respectively, marking improvements of 4.88% and 15.71% over the state-of-the-art PON-Sol2. These findings are further corroborated by an independent blind test, where TopLapGBT continues to outperform existing models.

In summary, our proposed TopLapGBT model not only achieves superior performance over existing state-of-the-art methods but also introduces a more nuanced approach for the classification of protein solubility changes upon mutation. These results underscore the transformative potential of integrating geometric and topological features with machine learning in advancing the field of molecular biology.

This paper is available on arxiv under CC 4.0 license.