plasma: Partial LeAst Squares for Multi-omics Analysis

Kevin R. Coombes

Introduction

Recent years have seen the development of numerous algorithms and computational packages for the analysis of multi-omics data sets. At this point, one can find multiple review articles summarizing progress in the field [Subramanian-2020; Graw-2021; Heo-2021; Picard-2021; Reel-2021; Vlachavas-2021; Adossa-2021]. As with other applications of machine learning, the kinds of problems addressed by these algorithms are divided into two categories: unsupervised (e.g., clustering or class discovery) or supervised (including class comparison and class prediction). Advances in the area of unsupervised learning have been broader and deeper than advances on supervised learning [CITE].

One of the most effective unsupervised methods is Multi-Omic Factor Analysis (MOFA) [Argelaguet-2018; Argelaguet-2020]. A key property of MOFA is that it does not require all omics assays to have been performed on all samples under study. In particular, it can effectively discover class structure across omics data sets even when data for many patients have only been acquired on a subset of the omics technologies. As of this writing, we do not know of any supervised multi-omics method that can effectively learn to predict outcomes when samples have only been assayed on a subset of the omics data sets.

MOFA starts with a standard method – Latent Factor Analysis – that is known to work well on single omics data set. It then fits a coherent model that identifies latent factors that are common to, and able to explain the data well in, all the omics data sets under study. Our investigation (unpublished) of the factors found by MOFA suggests that it is approximately equivalent to a two-step process:

  1. Use principal components analysis to identify initial latent factors in each individual omics data set.
  2. For each pair of omics data sets, use overlapping samples to train and extend models of each factor to the union of assayed samples.

That interpretation of MOFA suggests that an analogous procedure might work for supervised analyses as well. In this article, we describe a two-step algorithm, which we call “plasma”, to find models that can predict time-to-event outcomes on samples from multi-omics data sets even in the presence of incomplete data. We use partial least squares (PLS) for both steps, using Cox regression to learn the single omics models and linear regression to learn how to extend models from one omics data set to another. To illustrate the method, we use a subset of the esophageal cancer (ESCA) data set from The Cancer Genome Atlas (TCGA).

Methods

Our computational method is implemented and the data are available in the plasma package.

suppressWarnings( library(plasma) )
packageVersion("plasma")
## [1] '0.5.12'

Data

We downloaded the entire esophageal cancer Level 3 data set from the TCGA on 6 August 2018. We filtered the data sets so that only the most variable, and presumably the most informative, features were retained. Here, we load this sample data set.

loadESCAdata()
sapply(assemble, dim)
##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]         185          185 184     185    166     157  126
  1. From TCGA, we obtained 162 columns of clinical, demographic, and laboratory data on 185 patients. We removed any columns that always contained the same value. We also removed any columns whose values were missing in more than 25% of the patients. We converted categorical variables into sets of binary variables using one-hot-encoding. We then separated the clinical data into three parts:
    1. Outcome (overall survival)
    2. Binary covariates (53 columns)
    3. Continuous covariates (6 columns)
  2. Exome sequencing data for 184 patients with esophageal cancer was obtained as mutation allele format (MAF) files. We removed any gene that was mutated in fewer than 3% of the samples. The resulting data set contained 566 mutated genes.
  3. Methylation data for 185 ESCA patients was obtained as beta values computed by the TCGA from Illumina Methylation 450K microarrays. We removed any CpG site for which the standard deviation of the beta values was less than 0.3. The resulting data set contained 1,454 highly variable CpG’s.
  4. Already normalized sequencing data on 2,566 microRNAs (miRs) was obtained for 185 patients. We removed any miR for which the standard deviation of normalized expression was less than 0.05, which left 926 miRs in the final data set.
  5. Already normalized sequencing data on 20,531 mRNAs was obtained in 184 patients. We removed any mRNA whose mean normalized expression was less than 6 or whose standard deviation was less than 1.2. The final data set included 2,520 mRNAs.
  6. Normalized expression data from reverse phase protein arrays (RPPA) was obtained from antibodies targeting 192 proteins in 126 patients. All data were retained for further analysis.

Finally, in order to be able to illustrate the ability of the plasma algorithm to work in the presence of missing data, we randomly selected 10% of the patients to remove from the miRSeq data set (leaving 166 patients) and 15% of the patients to remove from the mRNASeq data set (leaving 157 patients).

Here is a summary of the outcome data.

summary(Outcome)
##    patient_id  vital_status days_to_death days_to_last_followup      Days       
##  a3i8   :  1   alive:108    180    :  2   375    : 4            Min.   :   4.0  
##  a3ql   :  1   dead : 77    283    :  2   383    : 3            1st Qu.: 232.0  
##  a3y9   :  1                104    :  1   366    : 2            Median : 400.0  
##  a3ya   :  1                112    :  1   391    : 2            Mean   : 538.9  
##  a3yb   :  1                118    :  1   401    : 2            3rd Qu.: 681.0  
##  a3yc   :  1                (Other): 70   (Other):95            Max.   :3714.0  
##  (Other):179                NA's   :108   NA's   :77

Computational Approach

The plasma algorithm is based on Partial Least Squares (PLS), which has been shown to be an effective method for finding components that can predict clinically interesting outcomes [Bastien-2015]. The workflow of the plasma algorithm is illustrated in Figure 1 in the case of three omics data sets. First, for each of the omics data sets, we apply the PLS Cox regression algorithm (plsRcox Version 1.7.6 [Bertand-2021]) to the time-to-event outcome data to learn three separate predictive models (indicated in red, green, and blue, respectively). Each of these models may be incomplete, since they are not defined for patients who have not been assayed (shown in white) using that particular omics technology. Second, for each pair of omics data sets, we apply the PLS linear regression algorithm (pls Version 2.8.0 [Liland-2021]) to learn how to predict the coefficients of the Cox regression components from one data set using features from the other data set. This step extends (shown in pastel red, green, and blue, resp.) each of the original models, in different ways, from the intersection of samples assayed on both data sets to their union. Third, we average all of the different extended models (ignoring missing data) to get a single coherent model of component coefficients across all omics data sets. Assuming that this process has been applied to learn the model from a training data set, we can evaluate the final Cox regression model on both the training set and a test set of patient samples.

SF <- system.file("Figure/methods.png", package = "plasma")
knitr::include_graphics(SF)
Figure 1 : Workflow schematic for plasma algorithm with three omics data sets.

Figure 1 : Workflow schematic for plasma algorithm with three omics data sets.

rm(SF)

Terminology

Because of the layered nature of the plasma algorithm, we intend to use the following terminology to help clarify the later discussions.

  1. The input data contains a list of omics data sets.
  2. Each omics data set contains measurements of multiple features.
  3. The first step in the algorithm uses PLS Cox regression to find a set of components. Each component is a linear combination of features. The components are used as predictors in a Cox proportional hazards model, which predicts the log hazard ratio as a linear combination of components.
  4. The second step in the algorithm creates a secondary layer of components. We do not give these components a separate name. They are not an item of particular focus; we view them as a way to extend the first level components to more samples by “re-interpreting” them in other omics data sets.

Preparing the Data

To be consistent with the MOFA2 R package, all of the data sets are arranged so that patient samples are columns and assay features are rows. Our first task is to pad each data set with appropriate NA’s to ensure that each set includes the same patient samples in the same order, where that order matches the outcome data frame.

MO <- prepareMultiOmics(assemble, Outcome)
summary(MO)
## Datasets:
##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]         185          185 185     185    185     185  185
## Outcomes:
##    patient_id  vital_status days_to_death days_to_last_followup      Days       
##  a3i8   :  1   alive:108    180    :  2   375    : 4            Min.   :   4.0  
##  a3ql   :  1   dead : 77    283    :  2   383    : 3            1st Qu.: 232.0  
##  a3y9   :  1                104    :  1   366    : 2            Median : 400.0  
##  a3ya   :  1                112    :  1   391    : 2            Mean   : 538.9  
##  a3yb   :  1                118    :  1   401    : 2            3rd Qu.: 681.0  
##  a3yc   :  1                (Other): 70   (Other):95            Max.   :3714.0  
##  (Other):179                NA's   :108   NA's   :77

We see that the number of patients in each data set is now equal to the number of patients with clinical outcome data.

Split Into Training and Test

As indicated above, we want to separate the data set into training and test samples. We will use 60% for training and 40% for testing.

set.seed(54321)
#set.seed(34672)
splitVec <- rbinom(nrow(Outcome), 1, 0.6)

Figure 2 presents a graphical overview of the number of samples (N) and the number of features (D) in each omics component of the training and test sets.

trainD <- MO[, splitVec == 1]
summary(trainD)
## Datasets:
##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]         108          108 108     108    108     108  108
## Outcomes:
##    patient_id  vital_status days_to_death days_to_last_followup      Days       
##  a3i8   :  1   alive:66     180    : 2    383    : 3            Min.   :  11.0  
##  a3ya   :  1   dead :42     112    : 1    366    : 2            1st Qu.: 231.8  
##  a3yc   :  1                118    : 1    375    : 2            Median : 392.0  
##  a43c   :  1                128    : 1    401    : 2            Mean   : 502.2  
##  a43e   :  1                131    : 1    1007   : 1            3rd Qu.: 684.2  
##  a43i   :  1                (Other):36    (Other):56            Max.   :1837.0  
##  (Other):102                NA's   :66    NA's   :42
testD <- MO[, splitVec == 0]
summary(testD)
## Datasets:
##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]          77           77  77      77     77      77   77
## Outcomes:
##    patient_id vital_status days_to_death days_to_last_followup      Days       
##  a3ql   : 1   alive:42     104    : 1    375    : 2            Min.   :   4.0  
##  a3y9   : 1   dead :35     1263   : 1    408    : 2            1st Qu.: 234.0  
##  a3yb   : 1                136    : 1    1071   : 1            Median : 403.0  
##  a43h   : 1                1402   : 1    1094   : 1            Mean   : 590.4  
##  a43m   : 1                1405   : 1    1168   : 1            3rd Qu.: 650.0  
##  a49l   : 1                (Other):30    (Other):35            Max.   :3714.0  
##  (Other):71                NA's   :42    NA's   :35
opar <- par(mai = c(1.02, 1.32, 0.82, 0.22), mfrow = c(1,2))
plot(trainD, main = "Train")
plot(testD, main = "Test")

Figure 2 : Overview of training and test data.

par(opar)

Results

Individual PLS Cox Regression Models

The first step of the plasma algorithm is to fit PLS Cox models on each omics data set using the function fitCoxModels. The returned object of class MultiplePLSCoxModels contains a list of SingleModel objects, one for each assay, and within each there are three regression models:

suppressWarnings( firstPass <- fitCoxModels(trainD, timevar = "Days",
                          eventvar = "vital_status", eventvalue = "dead") )
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
summary(firstPass)
## An object containing MultiplePLSCoxModels based on:
## [1] "ClinicalBin"  "ClinicalCont" "MAF"          "Meth450"      "miRSeq"       "mRNASeq"     
## [7] "RPPA"
plot(firstPass, legloc = "bottomleft")

Figure 3 : Kaplan-Meier plots of overall survival on the training set from separate PLS Cox omics models

On the training set, each of the seven contributing omics data sets is able to find a PLS model that can successfully separate high risk from low risk patients (Figure 3).

Extend Model Components Across Omics Data Sets

The second step of the algorithm is to extend the individual omics-based models across other omics data sets. This step is performed using the plasma function, which takes in the previously created objects of class multiOmics and MultiplePLSCoxModels. The function operates iteratively, so in our case there are seven different sets of predictions of the PLS components. These different predictions are averaged and saved internally as a data frame called meanPredictions. The structure of models created and stored in the plasma object is the same as for the separate, individual, omics models. Figure 4 shows the Kaplan-Meier plot using the predicted risk, split at the median value, on the training data set.

pl <- plasma(MO, firstPass)
plot(pl, legloc = "topright", main = "Training Data")

Figure 4 : Kaplan-Meier plot of overall survival on the training set using the unified plasma Cox model.

Independent Test Set

Now we want to see how well the final composite model generalizes to our test set. Figure 5 uses the predicted risk, split at the median of the training data, to construct a Kaplan-Meier plot on the test data. The model yields a statistically significant (p = 0.0324) separation of outcomes between the high and low risk patients.

testpred <- predict(pl, testD)
plot(testpred, main="Testing", xlab = "Time (Days)")

Figure 5 : Kaplan-Meier plot of overall survival on the test set.

Interpretation

At this point, our model appears to be a fairly complex black box. We have constructed a matrix of components, based on linear combinations of actual features in different omics data sets. These components can then be combined in yet another linear model that predicts the time-to-event outcome through Cox regression. In this section, we want to explore how the individual features from different omics data sets contribute to different model components.

Our first act toward opening the black box is to realize that not all of the components discovered from the individual omics data sets survived into the final composite model. Some components were eliminated because they appeared to be nearly linearly related to components found in other omics data sets. So, we can examine the final composite model more closely.

pl@fullModel
## Call:
## coxph(formula = Surv(Days, vital_status == "dead") ~ MAF1 + MAF2 + 
##     MAF3 + Meth4502 + miRSeq2 + miRSeq3 + mRNASeq1 + RPPA1 + 
##     RPPA2 + RPPA3, data = riskDF)
## 
##                coef  exp(coef)   se(coef)      z        p
## MAF1      1.434e+00  4.195e+00  3.620e-01  3.961 7.47e-05
## MAF2     -1.921e-01  8.252e-01  1.241e-01 -1.548 0.121607
## MAF3      5.677e+00  2.920e+02  1.439e+00  3.946 7.94e-05
## Meth4502  1.178e+00  3.246e+00  4.358e-01  2.702 0.006890
## miRSeq2   2.276e+00  9.742e+00  5.443e-01  4.183 2.88e-05
## miRSeq3   9.415e+00  1.227e+04  2.746e+00  3.429 0.000607
## mRNASeq1  1.155e-01  1.122e+00  3.817e-02  3.025 0.002485
## RPPA1     6.119e-01  1.844e+00  2.346e-01  2.608 0.009099
## RPPA2     8.806e-01  2.412e+00  2.326e-01  3.786 0.000153
## RPPA3     1.614e+00  5.021e+00  3.843e-01  4.200 2.67e-05
## 
## Likelihood ratio test=56.25  on 10 df, p=1.843e-08
## n= 185, number of events= 77
temp <- terms(pl@fullModel)
mainterms <- attr(temp, "term.labels")
rm(temp)
mainterms
##  [1] "MAF1"     "MAF2"     "MAF3"     "Meth4502" "miRSeq2"  "miRSeq3"  "mRNASeq1" "RPPA1"   
##  [9] "RPPA2"    "RPPA3"

We see that at least one component discovered from each of the five “true” omics data sets survived in the final model. However, neither the binary clinical nor continuous clinical data sets contributed independent components.

Our interest now turns to understanding how the features from individual omics data sets contribute to the components that are used in the final model. As mentioned earlier, these contributions are mediated through two levels of linear regression models when extending a model from data set A to data set B. A linear combination of features from set B is used to define the secondary level of components; then a linear combination of these components is used to predict the components of the single Cox model built that had been from set A. These weights can be combined and extracted using the getAllWeights function, and can then be explored.

Clinical Binary Data

We use the binary clinical data set to illustrate our methods for interpreting the components.

if (requireNamespace("oompaBase")) {
  HG <- oompaBase::blueyellow(64)
} else {
  HG <- heat.colors(64)
}
## Loading required namespace: oompaBase
cbin <- getAllWeights(pl, "ClinicalBin")
heat(cbin, cexCol = 0.9, cexRow = 0.5, col = HG)

Figure 6 : Unscaled heatmap of the contributions of binary clinical features to all components.

Figure 6 shows the raw weights for each clinical binary feature in all of the original omics components. Figure 7 only includes the omics components that were retained in the final composite model. Structurally, these two figures look similar. There is a central set of components where all of the weights look weak, and a central set of features that appear to have weak contributions to all components.

heat(cbin[, mainterms], cexCol = 0.9, cexRow = 0.5, col = HG)

Figure 7 : Unscaled heatmap of the contributions of binary clinical features to important components.

We hypothesize that some components intrinsically have a wider spread of weights, and that it might be more important to scale the components consistently to look at the relative contributions (Figure 8). By standardizing the weights component by component, we see a better balance of contributions per component. However, we can still identify a set of features that seem to make no contributions to any of the components. We can filter the features by only retaining those that make a large contribution to at least one omics component (Figure 9).

dbin <- cbin[, mainterms]
dbin@contrib <- scale(dbin@contrib)
heat(dbin, cexCol = 0.9, cexRow = 0.5, col = HG)

Figure 8 : Scaled heatmap of the contributions of binary clinical features to important components.

shrink <- function(dset, N) {
  dset@contrib <- scale(dset@contrib)
  feat <- unique(unlist(as.list(getTop(dset, N))))
  dset@contrib <- dset@contrib[feat, mainterms]
  dset
}
xbin <- shrink(cbin, 4)
heat(xbin, cexCol = 0.9, cexRow = 0.9, col = HG)

Figure 9 : Scaled heatmap of the contributions of filtered binary clinical features to important components. In Figure 9, we can identify strong contrasts between several pairs of variables. For example, one set of components is enriched with white, never smokers, who still have evidence of tumors, at stage T3 and grade 3 in the lower third of the esophagus (ICD-10 code C15.5), while another group is enriched for Asian, current smokers, who are tumor-free, with stage N0, T2 tumors from the lower third of the esophagus (ICD-10 code C15.4).

Clinical Continuous Data

ccont <- getAllWeights(pl, "ClinicalCont")
xcont <- shrink(ccont, 2)
heat(xcont, cexCol = 0.8, cexRow = 1.2, col = HG)

Figure 10 : Scaled heatmap of the contributions of filtered continuous clinical features to important components

The most important continuous clinical feature is the patient’s weight, which is positively correlated (yellow) with four components and negatively correlated (blue) with six others.

Mutation Data

cmaf <- getAllWeights(pl, "MAF")
xmaf <- shrink(cmaf, 4)
heat(xmaf, cexCol = 0.9, cexRow = 0.9, col = HG)

Figure 11 : Scaled heatmap of the contributions of filtered mutation features to important components

Methylation Data

meth <- getAllWeights(pl, "Meth450")
xmeth <- shrink(meth, 4)
heat(xmeth, cexCol = 0.9, cexRow = 0.9, col = HG)

Figure 12 : Scaled heatmap of the contributions of filtered methylation features to important components

Proteomics Data

rppa <- getAllWeights(pl, "RPPA")
xrppa <- shrink(rppa, 5)
tmp <- rownames(xrppa@contrib)
rownames(xrppa@contrib) <- sapply(strsplit(tmp, "\\."), function(x) paste(x[-1], collapse = "."))
hc <- heat(xrppa, cexCol = 0.9, cexRow = 0.9, col = HG, keep.dendro = TRUE)

Figure 13 : Scaled heatmap of the contributions of filtered RPPA features to important components

mRNA-Sequencing Data

mrna <- getAllWeights(pl, "mRNASeq")
xmrna <- shrink(mrna, 7)
tmp <- rownames(xmrna@contrib)
rownames(xmrna@contrib) <- sapply(strsplit(tmp, "\\."), function(x) x[1])
heat(xmrna, cexCol = 0.9, cexRow = 0.6, col = HG)

Figure 14 : Scaled heatmap of the contributions of filtered mRNA features to important components

miR-Sequencing Data

mir <- getAllWeights(pl, "miRSeq")
xmir <- shrink(mir, 5)
tmp <- strsplit(rownames(xmir@contrib), "\\.")
nc <- sapply(tmp, length)
rownames(xmir@contrib) <- sapply(1:length(tmp), function(dex) 
  paste(tmp[[dex]][1:(nc[dex]-1)], collapse = "."))
heat(xmir, cexCol = 0.9, cexRow = 0.9, col = HG)

Figure 15 : Scaled heatmap of the contributions of filtered miR features to important components

Conclusions

We have identified a method analogous to that of MOFA that allows us to combine different omics data without the need for prior imputation of missing values. A major difference is that while MOFA model learns “factors” that are composites of the variables in an unsupervised fashion, the plasma model learns “components” that are composites of the variables in an supervised fashion, using the outcomes “event” and “time-to-event” as response variables.

Although the factors from MOFA are defined such that the efirst factor, Factor 1, accounts for the greatest variance in the model, the factors’ may or may not be significantly associated with the outcome, and a post-hoc survival analysis would need to be done to assess this. It may be the case that some factors, although they are significantly associated with outcome, account for very small variance in the final MOFA model, which hinders interpretability. This was the case with the TCGA-ESCA dataset, in which, when 10 factors were learned from the MOFA model, only Factor 10 was significantly associated with survival, while accounting for [number] variance in the model [CITE SUPPLEMENTARY RESULTS?]. On the other hand, the components for plasma are created in a way that maximizes the covariance in the predictors and the response, and therefore these components will be automatically associated to some degree with the outcome. This could be advantageous in that dissecting the weights associated with the components would yield a list of variables from different omics datasets that contribute the most to defining the outcome, and any additional analyses could be refined by looking at these high-weighted variables most closely.

Appendix: MOFA2

suppressWarnings( library(MOFA2) )