Recent years have seen the development of numerous algorithms and computational packages for the analysis of multi-omics data sets. At this point, one can find multiple review articles summarizing progress in the field [Subramanian-2020; Graw-2021; Heo-2021; Picard-2021; Reel-2021; Vlachavas-2021; Adossa-2021]. As with other applications of machine learning, the kinds of problems addressed by these algorithms are divided into two categories: unsupervised (e.g., clustering or class discovery) or supervised (including class comparison and class prediction). Advances in the area of unsupervised learning have been broader and deeper than advances on supervised learning [CITE].
One of the most effective unsupervised methods is Multi-Omic Factor Analysis (MOFA) [Argelaguet-2018; Argelaguet-2020]. A key property of MOFA is that it does not require all omics assays to have been performed on all samples under study. In particular, it can effectively discover class structure across omics data sets even when data for many patients have only been acquired on a subset of the omics technologies. As of this writing, we do not know of any supervised multi-omics method that can effectively learn to predict outcomes when samples have only been assayed on a subset of the omics data sets.
MOFA starts with a standard method – Latent Factor Analysis – that is known to work well on single omics data set. It then fits a coherent model that identifies latent factors that are common to, and able to explain the data well in, all the omics data sets under study. Our investigation (unpublished) of the factors found by MOFA suggests that it is approximately equivalent to a two-step process:
That interpretation of MOFA suggests that an analogous procedure might work for supervised analyses as well. In this article, we describe a two-step algorithm, which we call “plasma”, to find models that can predict time-to-event outcomes on samples from multi-omics data sets even in the presence of incomplete data. We use partial least squares (PLS) for both steps, using Cox regression to learn the single omics models and linear regression to learn how to extend models from one omics data set to another. To illustrate the method, we use a subset of the esophageal cancer (ESCA) data set from The Cancer Genome Atlas (TCGA).
Our computational method is implemented and the data are available in the plasma
package.
## [1] '0.5.12'
We downloaded the entire esophageal cancer Level 3 data set from the TCGA on 6 August 2018. We filtered the data sets so that only the most variable, and presumably the most informative, features were retained. Here, we load this sample data set.
## ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,] 53 6 566 1454 926 2520 192
## [2,] 185 185 184 185 166 157 126
Finally, in order to be able to illustrate the ability of the plasma algorithm to work in the presence of missing data, we randomly selected 10% of the patients to remove from the miRSeq data set (leaving 166 patients) and 15% of the patients to remove from the mRNASeq data set (leaving 157 patients).
Here is a summary of the outcome data.
## patient_id vital_status days_to_death days_to_last_followup Days
## a3i8 : 1 alive:108 180 : 2 375 : 4 Min. : 4.0
## a3ql : 1 dead : 77 283 : 2 383 : 3 1st Qu.: 232.0
## a3y9 : 1 104 : 1 366 : 2 Median : 400.0
## a3ya : 1 112 : 1 391 : 2 Mean : 538.9
## a3yb : 1 118 : 1 401 : 2 3rd Qu.: 681.0
## a3yc : 1 (Other): 70 (Other):95 Max. :3714.0
## (Other):179 NA's :108 NA's :77
The plasma
algorithm is based on Partial Least Squares (PLS), which has been shown to be an effective method for finding components that can predict clinically interesting outcomes [Bastien-2015]. The workflow of the plasma algorithm is illustrated in Figure 1 in the case of three omics data sets. First, for each of the omics data sets, we apply the PLS Cox regression algorithm (plsRcox
Version 1.7.6 [Bertand-2021]) to the time-to-event outcome data to learn three separate predictive models (indicated in red, green, and blue, respectively). Each of these models may be incomplete, since they are not defined for patients who have not been assayed (shown in white) using that particular omics technology. Second, for each pair of omics data sets, we apply the PLS linear regression algorithm (pls
Version 2.8.0 [Liland-2021]) to learn how to predict the coefficients of the Cox regression components from one data set using features from the other data set. This step extends (shown in pastel red, green, and blue, resp.) each of the original models, in different ways, from the intersection of samples assayed on both data sets to their union. Third, we average all of the different extended models (ignoring missing data) to get a single coherent model of component coefficients across all omics data sets. Assuming that this process has been applied to learn the model from a training data set, we can evaluate the final Cox regression model on both the training set and a test set of patient samples.
Because of the layered nature of the plasma algorithm, we intend to use the following terminology to help clarify the later discussions.
To be consistent with the MOFA2
R package, all of the data sets are arranged so that patient samples are columns and assay features are rows. Our first task is to pad each data set with appropriate NA
’s to ensure that each set includes the same patient samples in the same order, where that order matches the outcome data frame.
## Datasets:
## ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,] 53 6 566 1454 926 2520 192
## [2,] 185 185 185 185 185 185 185
## Outcomes:
## patient_id vital_status days_to_death days_to_last_followup Days
## a3i8 : 1 alive:108 180 : 2 375 : 4 Min. : 4.0
## a3ql : 1 dead : 77 283 : 2 383 : 3 1st Qu.: 232.0
## a3y9 : 1 104 : 1 366 : 2 Median : 400.0
## a3ya : 1 112 : 1 391 : 2 Mean : 538.9
## a3yb : 1 118 : 1 401 : 2 3rd Qu.: 681.0
## a3yc : 1 (Other): 70 (Other):95 Max. :3714.0
## (Other):179 NA's :108 NA's :77
We see that the number of patients in each data set is now equal to the number of patients with clinical outcome data.
As indicated above, we want to separate the data set into training and test samples. We will use 60% for training and 40% for testing.
Figure 2 presents a graphical overview of the number of samples (N
) and the number of features (D
) in each omics component of the training and test sets.
## Datasets:
## ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,] 53 6 566 1454 926 2520 192
## [2,] 108 108 108 108 108 108 108
## Outcomes:
## patient_id vital_status days_to_death days_to_last_followup Days
## a3i8 : 1 alive:66 180 : 2 383 : 3 Min. : 11.0
## a3ya : 1 dead :42 112 : 1 366 : 2 1st Qu.: 231.8
## a3yc : 1 118 : 1 375 : 2 Median : 392.0
## a43c : 1 128 : 1 401 : 2 Mean : 502.2
## a43e : 1 131 : 1 1007 : 1 3rd Qu.: 684.2
## a43i : 1 (Other):36 (Other):56 Max. :1837.0
## (Other):102 NA's :66 NA's :42
## Datasets:
## ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,] 53 6 566 1454 926 2520 192
## [2,] 77 77 77 77 77 77 77
## Outcomes:
## patient_id vital_status days_to_death days_to_last_followup Days
## a3ql : 1 alive:42 104 : 1 375 : 2 Min. : 4.0
## a3y9 : 1 dead :35 1263 : 1 408 : 2 1st Qu.: 234.0
## a3yb : 1 136 : 1 1071 : 1 Median : 403.0
## a43h : 1 1402 : 1 1094 : 1 Mean : 590.4
## a43m : 1 1405 : 1 1168 : 1 3rd Qu.: 650.0
## a49l : 1 (Other):30 (Other):35 Max. :3714.0
## (Other):71 NA's :42 NA's :35
opar <- par(mai = c(1.02, 1.32, 0.82, 0.22), mfrow = c(1,2))
plot(trainD, main = "Train")
plot(testD, main = "Test")
The first step of the plasma
algorithm is to fit PLS Cox models on each omics data set using the function fitCoxModels
. The returned object of class MultiplePLSCoxModels
contains a list of SingleModel
objects, one for each assay, and within each there are three regression models:
plsRcoxmodel
contains the coefficients of the components learned by PLS Cox regression. The number of components is determined automatically as a function of the logarithm of the number of features in the omics data set. The output of this model is a continuous prediction of “risk” for the time-to-event outcome of interest.riskModel
is a coxph
model using continuous predicted risk as a single predictor.splitModel
is a coxph
model using a binary split of the risk (at the median) as the predictor.suppressWarnings( firstPass <- fitCoxModels(trainD, timevar = "Days",
eventvar = "vital_status", eventvalue = "dead") )
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
##
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
##
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
##
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
##
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
##
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
##
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
## An object containing MultiplePLSCoxModels based on:
## [1] "ClinicalBin" "ClinicalCont" "MAF" "Meth450" "miRSeq" "mRNASeq"
## [7] "RPPA"
On the training set, each of the seven contributing omics data sets is able to find a PLS model that can successfully separate high risk from low risk patients (Figure 3).
The second step of the algorithm is to extend the individual omics-based models across other omics data sets. This step is performed using the plasma
function, which takes in the previously created objects of class multiOmics
and MultiplePLSCoxModels
. The function operates iteratively, so in our case there are seven different sets of predictions of the PLS components. These different predictions are averaged and saved internally as a data frame called meanPredictions
. The structure of models created and stored in the plasma
object is the same as for the separate, individual, omics models. Figure 4 shows the Kaplan-Meier plot using the predicted risk, split at the median value, on the training data set.
Now we want to see how well the final composite model generalizes to our test set. Figure 5 uses the predicted risk, split at the median of the training data, to construct a Kaplan-Meier plot on the test data. The model yields a statistically significant (p = 0.0324) separation of outcomes between the high and low risk patients.
At this point, our model appears to be a fairly complex black box. We have constructed a matrix of components, based on linear combinations of actual features in different omics data sets. These components can then be combined in yet another linear model that predicts the time-to-event outcome through Cox regression. In this section, we want to explore how the individual features from different omics data sets contribute to different model components.
Our first act toward opening the black box is to realize that not all of the components discovered from the individual omics data sets survived into the final composite model. Some components were eliminated because they appeared to be nearly linearly related to components found in other omics data sets. So, we can examine the final composite model more closely.
## Call:
## coxph(formula = Surv(Days, vital_status == "dead") ~ MAF1 + MAF2 +
## MAF3 + Meth4502 + miRSeq2 + miRSeq3 + mRNASeq1 + RPPA1 +
## RPPA2 + RPPA3, data = riskDF)
##
## coef exp(coef) se(coef) z p
## MAF1 1.434e+00 4.195e+00 3.620e-01 3.961 7.47e-05
## MAF2 -1.921e-01 8.252e-01 1.241e-01 -1.548 0.121607
## MAF3 5.677e+00 2.920e+02 1.439e+00 3.946 7.94e-05
## Meth4502 1.178e+00 3.246e+00 4.358e-01 2.702 0.006890
## miRSeq2 2.276e+00 9.742e+00 5.443e-01 4.183 2.88e-05
## miRSeq3 9.415e+00 1.227e+04 2.746e+00 3.429 0.000607
## mRNASeq1 1.155e-01 1.122e+00 3.817e-02 3.025 0.002485
## RPPA1 6.119e-01 1.844e+00 2.346e-01 2.608 0.009099
## RPPA2 8.806e-01 2.412e+00 2.326e-01 3.786 0.000153
## RPPA3 1.614e+00 5.021e+00 3.843e-01 4.200 2.67e-05
##
## Likelihood ratio test=56.25 on 10 df, p=1.843e-08
## n= 185, number of events= 77
## [1] "MAF1" "MAF2" "MAF3" "Meth4502" "miRSeq2" "miRSeq3" "mRNASeq1" "RPPA1"
## [9] "RPPA2" "RPPA3"
We see that at least one component discovered from each of the five “true” omics data sets survived in the final model. However, neither the binary clinical nor continuous clinical data sets contributed independent components.
Our interest now turns to understanding how the features from individual omics data sets contribute to the components that are used in the final model. As mentioned earlier, these contributions are mediated through two levels of linear regression models when extending a model from data set A to data set B. A linear combination of features from set B is used to define the secondary level of components; then a linear combination of these components is used to predict the components of the single Cox model built that had been from set A. These weights can be combined and extracted using the getAllWeights
function, and can then be explored.
We use the binary clinical data set to illustrate our methods for interpreting the components.
if (requireNamespace("oompaBase")) {
HG <- oompaBase::blueyellow(64)
} else {
HG <- heat.colors(64)
}
## Loading required namespace: oompaBase
Figure 6 shows the raw weights for each clinical binary feature in all of the original omics components. Figure 7 only includes the omics components that were retained in the final composite model. Structurally, these two figures look similar. There is a central set of components where all of the weights look weak, and a central set of features that appear to have weak contributions to all components.
We hypothesize that some components intrinsically have a wider spread of weights, and that it might be more important to scale the components consistently to look at the relative contributions (Figure 8). By standardizing the weights component by component, we see a better balance of contributions per component. However, we can still identify a set of features that seem to make no contributions to any of the components. We can filter the features by only retaining those that make a large contribution to at least one omics component (Figure 9).
dbin <- cbin[, mainterms]
dbin@contrib <- scale(dbin@contrib)
heat(dbin, cexCol = 0.9, cexRow = 0.5, col = HG)
shrink <- function(dset, N) {
dset@contrib <- scale(dset@contrib)
feat <- unique(unlist(as.list(getTop(dset, N))))
dset@contrib <- dset@contrib[feat, mainterms]
dset
}
xbin <- shrink(cbin, 4)
heat(xbin, cexCol = 0.9, cexRow = 0.9, col = HG)
In Figure 9, we can identify strong contrasts between several pairs of variables. For example, one set of components is enriched with white, never smokers, who still have evidence of tumors, at stage T3 and grade 3 in the lower third of the esophagus (ICD-10 code C15.5), while another group is enriched for Asian, current smokers, who are tumor-free, with stage N0, T2 tumors from the lower third of the esophagus (ICD-10 code C15.4).
ccont <- getAllWeights(pl, "ClinicalCont")
xcont <- shrink(ccont, 2)
heat(xcont, cexCol = 0.8, cexRow = 1.2, col = HG)
The most important continuous clinical feature is the patient’s weight, which is positively correlated (yellow) with four components and negatively correlated (blue) with six others.
cmaf <- getAllWeights(pl, "MAF")
xmaf <- shrink(cmaf, 4)
heat(xmaf, cexCol = 0.9, cexRow = 0.9, col = HG)
meth <- getAllWeights(pl, "Meth450")
xmeth <- shrink(meth, 4)
heat(xmeth, cexCol = 0.9, cexRow = 0.9, col = HG)
rppa <- getAllWeights(pl, "RPPA")
xrppa <- shrink(rppa, 5)
tmp <- rownames(xrppa@contrib)
rownames(xrppa@contrib) <- sapply(strsplit(tmp, "\\."), function(x) paste(x[-1], collapse = "."))
hc <- heat(xrppa, cexCol = 0.9, cexRow = 0.9, col = HG, keep.dendro = TRUE)
mrna <- getAllWeights(pl, "mRNASeq")
xmrna <- shrink(mrna, 7)
tmp <- rownames(xmrna@contrib)
rownames(xmrna@contrib) <- sapply(strsplit(tmp, "\\."), function(x) x[1])
heat(xmrna, cexCol = 0.9, cexRow = 0.6, col = HG)
mir <- getAllWeights(pl, "miRSeq")
xmir <- shrink(mir, 5)
tmp <- strsplit(rownames(xmir@contrib), "\\.")
nc <- sapply(tmp, length)
rownames(xmir@contrib) <- sapply(1:length(tmp), function(dex)
paste(tmp[[dex]][1:(nc[dex]-1)], collapse = "."))
heat(xmir, cexCol = 0.9, cexRow = 0.9, col = HG)
We have identified a method analogous to that of MOFA
that allows us to combine different omics data without the need for prior imputation of missing values. A major difference is that while MOFA
model learns “factors” that are composites of the variables in an unsupervised fashion, the plasma
model learns “components” that are composites of the variables in an supervised fashion, using the outcomes “event” and “time-to-event” as response variables.
Although the factors from MOFA
are defined such that the efirst factor, Factor 1, accounts for the greatest variance in the model, the factors’ may or may not be significantly associated with the outcome, and a post-hoc survival analysis would need to be done to assess this. It may be the case that some factors, although they are significantly associated with outcome, account for very small variance in the final MOFA
model, which hinders interpretability. This was the case with the TCGA-ESCA dataset, in which, when 10 factors were learned from the MOFA
model, only Factor 10 was significantly associated with survival, while accounting for [number] variance in the model [CITE SUPPLEMENTARY RESULTS?]. On the other hand, the components for plasma
are created in a way that maximizes the covariance in the predictors and the response, and therefore these components will be automatically associated to some degree with the outcome. This could be advantageous in that dissecting the weights associated with the components would yield a list of variables from different omics datasets that contribute the most to defining the outcome, and any additional analyses could be refined by looking at these high-weighted variables most closely.