Predicting a Protein’s Stability under a Million Mutations: Experimental Setting & Additional Result

cover
12 Mar 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jeffrey Ouyang-Zhang, UT Austin

(2) Daniel J. Diaz, UT Austin

(3) Adam R. Klivans, UT Austin

(4) Philipp Krähenbühl, UT Austin

A Experimental Setting

We elaborate on our experimental protocol, including the datasets used for training and testing, the evaluation metrics, and the baselines that calibrate our models.

A.1 Datasets

A fundamental principle in machine learning is to partition data so that the model is not evaluated on the same data it was trained on. This is essential to assess the model’s ability to generalize beyond its training data (in our case, to new proteins). Protein sequences are regarded as the same protein when they have large sequence similarity (e.g. > 50%), even when they are not exactly the same string. The impact of training on protein sequences similar to those in the test set has received considerable attention [18, 45, 54]. We list the datasets we use below and report the maximum sequence similarity of our training set with each test set. This ensures a clear separation between our training and test sets.

cDNA proteolysis [72] is a large scale dataset containing mutant proteins with ∆∆G measurements. The dataset is generated by cDNA display proteolysis, a massively parallel technique that measures thermodynamic folding stability for millions of protein variants [72]. The cDNA proteins average 56.1 amino acids in length with a maximum length of 72 and a minimum length of 30 amino acids. The mean MSA depth is 7797 with a standard deviation of 6282. The maximum depth is 23525 and the minimum depth is 5. The mutations contain either one or two substitutions on a small protein. We follow the train-val split introduced in [18] and additionally filter out proteins in our training set that are similar to those in our evaluation datasets. Specifically, we train on 116 proteins with 213,000 total mutations, of which 97,000 are double mutants and 117,000 are single mutants. We hold out cDNA2, a validation set of 18 mini-proteins with 22,000 total double mutations. The proteins in our training set have at most 36% sequence similarity to those in cDNA2. A comprehensive analysis of the dataset, experimental assay, and filtering criteria are found in their paper [1].

ProTherm [48] contains experimentally measured thermodynamic data ∆∆G for protein mutants. We use PTMul, a split containing only higher-order mutations curated by DDGun [44]. There are 88 proteins with 846 total mutations in PTMul. The proteins in our training set have at most 35% sequence similarity to those in PTMul.

S669 [54] is a manually curated dataset of mutant proteins and their ∆∆G measurements. S669 contains mutations with one substitution. There are 94 proteins with 669 total single mutations in S669. The proteins in our training set have at most 30% sequence similarity to those in S669.

ProteinGym [49] is a curated set of proteins in which mutations are assessed for some fitness phenotype. These fitness phenotypes vary drastically, from cell growth to expression to fluorescence. The fitness values are always normalized so a higher value indicates stronger fitness. For classification metrics, the fitness values are binarized by some threshold either defined manually or arbitrarily at the median value. We also consider a subset of the proteins in ProteinGym in which fitness correlates highly with thermodynamic stability. These phenotypes include thermostability, expression, etc. There are 87 proteins totaling over 1.6 million (potentially higher-order) mutations. The proteins in our training set have at most 45% sequence similarity to those in ProteinGym.

T2837 [18] is an aggregate dataset with mutant proteins and their ∆∆G measurements. T2837 contains mutations with one substitution. There are 129 proteins with 2837 total single mutations in T2837. The proteins in our training set have at most 34% sequence similarity to those in T2837.

A.2 Metrics

We evaluate our model’s ability to assess mutations in ProTherm, S669, and ProteinGym. Additionally, we evaluate our model’s ability to detect stabilizing mutations in cDNA2.

A.2.1 Mutation Assessment

The model predicts the ∆∆G measurement under all mutations with experimental data. The models are evaluated by the following regression and classification metrics. The classification metrics are useful in applications where it is unnecessary to provide exact ∆∆G and a binary stabilizing or de-stabilizing label (or third label, neutral) is sufficient.

Following the literature, we additionally evaluate reverse mutations (in contrast to the standard “forward” mutations) where the “to” and “from” amino acids are flipped (e.g. P1K to K1P). The model takes the mutant sequence as input and is expected to output the negation of the original ∆∆G value when mutating to the original native amino acid.

Spearman correlation coefficient (rs) evaluates the model’s ability to rank the mutations by their ∆∆G values. This metric considers only the relative ordering between mutations and disregards the predicted values. Spearman has its limitations because our datasets contain predominantly destabilizing mutations (∆∆G > 0). Spearman applied on these datasets overwhelmingly measures the model’s ability to rank destabilizing mutations which does not directly translate to identifying the most stabilizing mutations [8, 58].

Root Mean Square Error (RMSE) provides a measure of how closely the predicted measurements align with the true measurements.

Area Under the receiver operating characteristic Curve (AUC) is a popular classification metric that focuses on positives. The receiver operating characteristic curve is created by plotting precision against recall at various classification thresholds. Precision is calculated as the ratio of true positives to the sum of true positives and false positives. Recall is calculated as the ratio of true positives to the sum of true positives and false negatives.

Matthew correlation coefficient (MCC) evaluates a model’s classification performance under imbalanced datasets. It takes into account true positive, true negative, false positive, and false negative predictions to measure the overall performance of a model.

A.2.2 Mutation Detection

The regression metrics for mutation assessment are not directly suitable for protein engineering where the goal is to find stabilizing mutations. In addition to the classification metrics, we evaluate our models using per-protein metrics which evaluate the model’s ability to detect a few stabilizing mutations from a large number of possible mutations. These metrics are applied to heavily annotated proteins (∼ 1000 ∆∆G measurements) with randomly sampled mutations.

Stabilizing Spearman correlation coefficient rs is the Spearman coefficient calculated on the subset of experimentally stabilizing mutations. This mimics the case in protein engineering where there are too many stabilizing mutations to test in vitro and the goal is to prioritize the most stabilizing ones. This metric is used in [18, 65].

Normalized Discounted Cumulative Gain (nDCG) [30] quantifies the quality of a ranking among mutations by considering both the experimental ∆∆G and predicted rank of each mutation. Specifically, nDCG computes the sum of experimentally stabilizing ∆∆Gs weighted inversely by their ranks. This metric rewards a model that ranks the best mutations first. We truncate the ranking at K = 30 mutations to approximate the number of in-vitro experiments that can typically be run in parallel. This metric is found in [59].

Detection Precision (DetPr) measures the fraction of experimentally stabilizing mutations among the top K = 30 predicted mutations by the model. This simulates the success rate a protein engineer might expect when deploying this model. K is set to approximate the number of in-vitro experiments that can typically be run in parallel.

A.3 Baselines

Several baselines calibrate performance across datasets. A subset of the critical baselines is introduced briefly (see [54] for more).

Mean is a simple baseline in which the mean measurement is predicted for all mutations. This baseline is not conditioned on the protein or mutation and instead uses only the dataset statistic.

Multiple Sequence Alignment (MSA) leverages the statistics from a set of similar protein sequences to rank mutations. Similar protein sequences are fetched, aligned, and indexed at a mutation position.

This yields an amino acid frequency count normalized into a probability distribution. The score is defined as the likelihood ratio between the mutated and native amino acid types. While calibrated for a given position and protein, it may not be well-calibrated across proteins.

ESM [38] also computes the likelihood ratio between a mutated and native amino acid type [42]. Unlike MSA, the probability distribution is decoded from ESM2 [38], a model pre-trained on masked amino acid prediction. Like MSA, ESM may not be well-calibrated across proteins.

DDGun [44] uses a linear combination of scores derived from evolutionary and structural features to compute ∆∆G.

FoldX [67] leverages an empirical force field to estimate the change in stability. It models the Van Der Waals interactions, solvation and hydrogen bonding interactions, changes in entropy, and other physical values.

PROSTATA [74] is a sequence model for stability prediction built on the ESM2 model. It takes native and mutant protein sequences as input and produces a ∆∆G value.

Stability Oracle [18] is a structure-based graph-transformer model for stability prediction.

Tranception [49] is a specialized autoregressive transformer that uses grouped attention to encourage specialization across heads.

B Additional Results

B.1 Higher-order Mutation Results on PTMul Breakdown

ProTherm Multiple (PTMul) contains 846 mutations, of which 536 are double mutations and 310 mutations with more than two mutations. We breakdown the performance on PTMul into double mutations and higher-order mutations (greater than 2) in Table 6. While our model performs similarly against the additive baseline on double mutations, Mutate Everything outperforms the additive baseline on higher-order mutations.

Table 6: Results on PTMul splits containing double mutations and more than 2 mutations. Mutate Everything outperforms existing methods, especially when there are multiple mutations.

Table 7: Comparisons on Single Mutation Thermodynamic Stability prediction in T2837. We report regression and classification metrics on the forward and reverse datasets. rs refers to Spearman correlation coefficient. Our training proteins have at most 34% sequence similarity with those in T2837. Mutate Everything is state-of-the-art on T2837 forward.

Table 8: Mutate Everything regression and classification metrics on T2837 and its reverse datasets. TR refers to Thermodynamic Reversibility. rp is Pearson correlation coefficient. rs is Spearman correlation coefficient. Mutations that failed our data pipeline are excluded.

B.2 Single Mutation Results on T2837

We include our results on the newly proposed T2837 dataset in Table 7 [18]. Mutate Everything achieves a Spearman correlation of 0.65 compared to 0.62 of Stability Oracle, outperforming existing methods on the forward dataset. PROSTATA-IFML [18] fine-tunes PROSTATA [74] on the cDNA proteolysis dataset for more direct comparison. Mutate Everything performs competitively with other sequence models on the reverse dataset even without training on reverse mutations. We show our performance on additional metrics for T2837 in Table 8.