Protein-protein interaction networks and machine learning

TL;DR: preprint here and on bioRxiv.

Some background about proteins and protein interactions

“Clearly, the proteome of the cell is the most complex and functionally most relevant level of cellular regulation and function.”
- Hein et al., Handbook of Systems Biology

Proteins are the functional units of a cell, responsible for numerous operations necessary to survival such as catalysing (i.e. accelerating) chemical reactions, enabling gene expression and facilitating information flows. Because of their central role in an organism’s metabolism, malfunctioning proteins are almost always responsible for diseases, and are therefore a target of choice for therapeutic treatments. As demonstrated, for example, by the discovery of the role of Htt and GIT1 in Huntington’s disease, a complete mapping of proteins’ roles would enable unprecedented progress in the treatment of diseases. However, such mapping is made difficult by the complex relationship between proteins and biological functions. Indeed, the proteome is constantly changing and by combining with different proteins at different times, a protein can be involved in numerous functions. These protein-protein interactions (PPIs) are the cornerstone of biological processes and better understanding them is instrumental to successfully mapping proteins to functions. However, the diversity of PPIs, from permanent complexes to transient associations, make the task difficult. Another challenge is the size of the PPI network. Although the exact size is still unknown today, it is estimated that, in humans, there are between 20,000 protein-coding genes and several millions of proteoforms; even when only including the 20,000 verified proteins, it still represents 200,000,000 potential pairwise interactions or up to 2.6 x 10¹⁹ 5-way combinations.

The role of computational methods, and their limitations

Computational methods can address the issue of scalability and experimental bias. Given a pair of proteins and some characteristics of each one, machine learning models can learn to predict the likelihood of interaction. Numerous methods have been developed for this, using the full range of machine learning models, from early work on Saccharomyces cerevisiae to algorithms dedicated to human PPIs. Yet, despite a wealth of tools, the mechanics and consequences of the underlying inference are still poorly understood, and it is unclear why models with similar performance make vastly different predictions. Reported performance scores often cannot be compared or replicated due to proprietary data and inconsistent or flawed assessment methods. As a consequence, there are multiple issues for in silico PPIs: it is unclear what the state-of-the-art is, analyses are difficult to reconcile, the development of new models is inefficient, follow-up mechanisms studies are likely undermined and, ultimately, there are different versions of the underlying molecular networks that describe protein function.

A unified framework for PPI inference would improve the development and reliable assessment of new models, and would facilitate the overdue widespread adoption of PPI predictions for downstream analysis. Replicable, trustworthy and generalisable high-performing models can capture more causal biology and enhance many aspects of biological research such as experimental designs and drug development.

Our work

We designed a robust and standardised approach to in silico PPI prediction that accounts for both biological and statistical pitfalls and leverages the strength of large, open-source and professionally curated databases. We made publicly available benchmarking standards for human and yeast PPIs to accelerate future discoveries and lay the foundations for similar datasets for other organisms. Within this framework, we studied and compared the main approaches to PPI prediction in humans, based on functional genomic (FG) information or amino acid sequences alone. We highlighted why both perspectives are still relevant today and how each adapts to the PPI network’s topology. In particular, we showed that the presence of highly connected proteins in the networks has a drastic impact on prediction models and is an area where FG and sequence models diverge. We also replicated these results between human and yeast (S. cerevisiae) and showed which tools are most suitable to cross-species predictions. This work provides robust foundations for future developments in PPI prediction models, but also gives critical insight into which models can and should be used in different situations.

Interested? You can find out more in the preprint (also available on bioRxiv).