Date of Publication
8-2024
Document Type
Dissertation/Thesis
Degree Name
Bachelor of Science (Honors) in Computer Science and Master of Science in Computer Science
College
College of Computer Studies
Department/Unit
Software Technology
Thesis Advisor
Anish M.S. Shrestha
Defense Panel Chair
Charibeth K. Cheng
Defense Panel Member
Charibeth K. Cheng
Llewelyn S. Moron-Espiritu
Ann Franchesca B. Laguna
Abstract (English)
The increased application of phages in biotechnological settings has driven the development of computational approaches for predicting phage-host interaction. Most existing models consider entire proteomes and rely on manual feature engineering, which poses difficulty in selecting the most informative sequence properties to serve as input to the model. In our previous work, we sought to address this by exploring the use of sequence-only protein language models to produce embeddings of phages' receptor-binding proteins. While this approach presented improvements over using handcrafted sequence properties, sequence-only embeddings do not directly capture protein structure information and structure-informed signals related to host specificity. In this study, we extend our previous work and present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera. Our experiments show that PHIStruct is able to make high-confidence predictions without a significant precision-recall trade-off. It also outperforms state-of-the-art tools that take in sequence-only protein embeddings and handcrafted sequence properties, as well as BLASTp, especially as the sequence similarity between the training and test set entries decreases. When the sequence similarity drops below 40% and the confidence threshold is set to above 50%, it presents a 7% to 9% increase in F1 over machine learning tools that do not directly incorporate structure information and a 5% to 6% increase over BLASTp. These results highlight PHIStruct's utility in use cases where phages of interest have receptor-binding proteins with low sequence similarity to those of known phages.
The data and source code for our experiments and analyses are available at https://github.com/bioinfodlsu/phage-host-prediction and at https://github.com/bioinfodlsu/PHIStruct.
Abstract Format
html
Language
English
Recommended Citation
Gonzales, M. (2024). Exploring protein language models for phage-host interaction prediction. Retrieved from https://animorepository.dlsu.edu.ph/etdm_softtech/15
Upload Full Text
wf_yes
Title Page, Approval Sheet, Animo Repository Consent Form
2024_Gonzales_PreliminaryPages.pdf (266 kB)
Preliminary Pages
2024_Gonzales_Chapter1.pdf (156 kB)
Chapter 1
2024_Gonzales_Chapter2.pdf (1973 kB)
Chapter 2
2024_Gonzales_Chapter3.pdf (3335 kB)
Chapter 3
2024_Gonzales_Chapter4.pdf (3676 kB)
Chapter 4
2024_Gonzales_Chapter5.pdf (667 kB)
Chapter 5
2024_Gonzales_AppendixA_WithSignature.pdf (747 kB)
Appendix A
2024_Gonzales_AppendixB.pdf (103 kB)
Appendix B
2024_Gonzales_AppendixC.pdf (1704 kB)
Appendix C
2024_Gonzales_AppendixD.pdf (4131 kB)
Appendix D
2024_Gonzales_References.pdf (359 kB)
References
Embargo Period
8-14-2024