Research led by Department of Chemical and Biological Engineering assistant professor Ratul Chowdhury has landed firmly on the cutting edge of new computational technology in protein research – and on the cover of a recent issue of Nature Biotechnology, one of the highest ranked journals in this field.
The research, entitled “Single-sequence protein structure prediction using a language model and deep learning,” focuses a key advancement in structural protein biology with the use of a new language model entitled recurrent geometric network (RGN2.) This potentially opens the prospect of accurate prediction of changes to protein structures with as little as one amino acid difference – thereby making it relevant for disease surveillance, intervention, and protein design.
Very recent discoveries by others in this area of research yielded the publication of two powerful deep-learning methods for protein structure determination, called AlphaFold2 and RoseTTAFold. While both these models were trained for the same task as RGN2, they first cluster all similar proteins together and thus map similar protein sequences to similar structures. RGN2, on the other hand is trained to “fill in the blank” by predicting the most probable amino acid for a given blank (or “masked token”) in a sentence (protein sequence). This alleviates any bias on known similar structures and makes the predicted structure sensitive to any single amino acid change (point mutation).
Additionally, both AlphaFold2 and RoseTTAFold algorithms consume large amounts of computing resources and due to their modality, are less efficient at predicting the all-important “orphan” proteins (which have been estimated to account for as much as 40% of gene products), which bear no homology with any other known protein. Orphan proteins are key for novel peptide discovery owing to their unique properties which find applications in anti-microbial and therapeutic formulations.
Chowdhury’s research employs orders of magnitude lower computing power while outperforming both the other methods in speed and accuracy in predicting orphan proteins. Chowdhury is currently utilizing RGN2 and other language models to design new proteins, understanding drug mechanism, and metabolic engineering.
See Chowdury’s full research publication.