Methods

Our computational method is implemented and the data are available in the plasma package.

suppressWarnings( library(plasma) )
packageVersion("plasma")

## [1] '0.5.12'

Data

We downloaded the entire esophageal cancer Level 3 data set from the TCGA on 6 August 2018. We filtered the data sets so that only the most variable, and presumably the most informative, features were retained. Here, we load this sample data set.

loadESCAdata()
sapply(assemble, dim)

##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]         185          185 184     185    166     157  126

From TCGA, we obtained 162 columns of clinical, demographic, and laboratory data on 185 patients. We removed any columns that always contained the same value. We also removed any columns whose values were missing in more than 25% of the patients. We converted categorical variables into sets of binary variables using one-hot-encoding. We then separated the clinical data into three parts:
1. Outcome (overall survival)
2. Binary covariates (53 columns)
3. Continuous covariates (6 columns)
Exome sequencing data for 184 patients with esophageal cancer was obtained as mutation allele format (MAF) files. We removed any gene that was mutated in fewer than 3% of the samples. The resulting data set contained 566 mutated genes.
Methylation data for 185 ESCA patients was obtained as beta values computed by the TCGA from Illumina Methylation 450K microarrays. We removed any CpG site for which the standard deviation of the beta values was less than 0.3. The resulting data set contained 1,454 highly variable CpG’s.
Already normalized sequencing data on 2,566 microRNAs (miRs) was obtained for 185 patients. We removed any miR for which the standard deviation of normalized expression was less than 0.05, which left 926 miRs in the final data set.
Already normalized sequencing data on 20,531 mRNAs was obtained in 184 patients. We removed any mRNA whose mean normalized expression was less than 6 or whose standard deviation was less than 1.2. The final data set included 2,520 mRNAs.
Normalized expression data from reverse phase protein arrays (RPPA) was obtained from antibodies targeting 192 proteins in 126 patients. All data were retained for further analysis.

Finally, in order to be able to illustrate the ability of the plasma algorithm to work in the presence of missing data, we randomly selected 10% of the patients to remove from the miRSeq data set (leaving 166 patients) and 15% of the patients to remove from the mRNASeq data set (leaving 157 patients).

Here is a summary of the outcome data.

summary(Outcome)

##    patient_id  vital_status days_to_death days_to_last_followup      Days       
##  a3i8   :  1   alive:108    180    :  2   375    : 4            Min.   :   4.0  
##  a3ql   :  1   dead : 77    283    :  2   383    : 3            1st Qu.: 232.0  
##  a3y9   :  1                104    :  1   366    : 2            Median : 400.0  
##  a3ya   :  1                112    :  1   391    : 2            Mean   : 538.9  
##  a3yb   :  1                118    :  1   401    : 2            3rd Qu.: 681.0  
##  a3yc   :  1                (Other): 70   (Other):95            Max.   :3714.0  
##  (Other):179                NA's   :108   NA's   :77

Computational Approach

The plasma algorithm is based on Partial Least Squares (PLS), which has been shown to be an effective method for finding components that can predict clinically interesting outcomes [Bastien-2015]. The workflow of the plasma algorithm is illustrated in Figure 1 in the case of three omics data sets. First, for each of the omics data sets, we apply the PLS Cox regression algorithm (plsRcox Version 1.7.6 [Bertand-2021]) to the time-to-event outcome data to learn three separate predictive models (indicated in red, green, and blue, respectively). Each of these models may be incomplete, since they are not defined for patients who have not been assayed (shown in white) using that particular omics technology. Second, for each pair of omics data sets, we apply the PLS linear regression algorithm (pls Version 2.8.0 [Liland-2021]) to learn how to predict the coefficients of the Cox regression components from one data set using features from the other data set. This step extends (shown in pastel red, green, and blue, resp.) each of the original models, in different ways, from the intersection of samples assayed on both data sets to their union. Third, we average all of the different extended models (ignoring missing data) to get a single coherent model of component coefficients across all omics data sets. Assuming that this process has been applied to learn the model from a training data set, we can evaluate the final Cox regression model on both the training set and a test set of patient samples.

SF <- system.file("Figure/methods.png", package = "plasma")
knitr::include_graphics(SF)

Figure 1 : Workflow schematic for plasma algorithm with three omics data sets.

rm(SF)

Terminology

Because of the layered nature of the plasma algorithm, we intend to use the following terminology to help clarify the later discussions.

The input data contains a list of omics data sets.
Each omics data set contains measurements of multiple features.
The first step in the algorithm uses PLS Cox regression to find a set of components. Each component is a linear combination of features. The components are used as predictors in a Cox proportional hazards model, which predicts the log hazard ratio as a linear combination of components.
The second step in the algorithm creates a secondary layer of components. We do not give these components a separate name. They are not an item of particular focus; we view them as a way to extend the first level components to more samples by “re-interpreting” them in other omics data sets.

Preparing the Data

To be consistent with the MOFA2 R package, all of the data sets are arranged so that patient samples are columns and assay features are rows. Our first task is to pad each data set with appropriate NA’s to ensure that each set includes the same patient samples in the same order, where that order matches the outcome data frame.

MO <- prepareMultiOmics(assemble, Outcome)
summary(MO)

## Datasets:
##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]         185          185 185     185    185     185  185
## Outcomes:

##    patient_id  vital_status days_to_death days_to_last_followup      Days       
##  a3i8   :  1   alive:108    180    :  2   375    : 4            Min.   :   4.0  
##  a3ql   :  1   dead : 77    283    :  2   383    : 3            1st Qu.: 232.0  
##  a3y9   :  1                104    :  1   366    : 2            Median : 400.0  
##  a3ya   :  1                112    :  1   391    : 2            Mean   : 538.9  
##  a3yb   :  1                118    :  1   401    : 2            3rd Qu.: 681.0  
##  a3yc   :  1                (Other): 70   (Other):95            Max.   :3714.0  
##  (Other):179                NA's   :108   NA's   :77

We see that the number of patients in each data set is now equal to the number of patients with clinical outcome data.

Split Into Training and Test

As indicated above, we want to separate the data set into training and test samples. We will use 60% for training and 40% for testing.

set.seed(54321)
#set.seed(34672)
splitVec <- rbinom(nrow(Outcome), 1, 0.6)

Figure 2 presents a graphical overview of the number of samples (N) and the number of features (D) in each omics component of the training and test sets.

trainD <- MO[, splitVec == 1]
summary(trainD)

## Datasets:
##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]         108          108 108     108    108     108  108
## Outcomes:

##    patient_id  vital_status days_to_death days_to_last_followup      Days       
##  a3i8   :  1   alive:66     180    : 2    383    : 3            Min.   :  11.0  
##  a3ya   :  1   dead :42     112    : 1    366    : 2            1st Qu.: 231.8  
##  a3yc   :  1                118    : 1    375    : 2            Median : 392.0  
##  a43c   :  1                128    : 1    401    : 2            Mean   : 502.2  
##  a43e   :  1                131    : 1    1007   : 1            3rd Qu.: 684.2  
##  a43i   :  1                (Other):36    (Other):56            Max.   :1837.0  
##  (Other):102                NA's   :66    NA's   :42

testD <- MO[, splitVec == 0]
summary(testD)

## Datasets:
##      ClinicalBin ClinicalCont MAF Meth450 miRSeq mRNASeq RPPA
## [1,]          53            6 566    1454    926    2520  192
## [2,]          77           77  77      77     77      77   77
## Outcomes:

##    patient_id vital_status days_to_death days_to_last_followup      Days       
##  a3ql   : 1   alive:42     104    : 1    375    : 2            Min.   :   4.0  
##  a3y9   : 1   dead :35     1263   : 1    408    : 2            1st Qu.: 234.0  
##  a3yb   : 1                136    : 1    1071   : 1            Median : 403.0  
##  a43h   : 1                1402   : 1    1094   : 1            Mean   : 590.4  
##  a43m   : 1                1405   : 1    1168   : 1            3rd Qu.: 650.0  
##  a49l   : 1                (Other):30    (Other):35            Max.   :3714.0  
##  (Other):71                NA's   :42    NA's   :35

opar <- par(mai = c(1.02, 1.32, 0.82, 0.22), mfrow = c(1,2))
plot(trainD, main = "Train")
plot(testD, main = "Test")

Figure 2 : Overview of training and test data.

par(opar)

Results

Individual PLS Cox Regression Models

The first step of the plasma algorithm is to fit PLS Cox models on each omics data set using the function fitCoxModels. The returned object of class MultiplePLSCoxModels contains a list of SingleModel objects, one for each assay, and within each there are three regression models:

The plsRcoxmodel contains the coefficients of the components learned by PLS Cox regression. The number of components is determined automatically as a function of the logarithm of the number of features in the omics data set. The output of this model is a continuous prediction of “risk” for the time-to-event outcome of interest.
Two separate models are constructed using the prediction of risk on the training data.
- The riskModel is a coxph model using continuous predicted risk as a single predictor.
- The splitModel is a coxph model using a binary split of the risk (at the median) as the predictor.

suppressWarnings( firstPass <- fitCoxModels(trainD, timevar = "Days",
                          eventvar = "vital_status", eventvalue = "dead") )

## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____There are some NAs in X but not in Y____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X with NA in X and not in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Component____ 4 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****
## 
## ____************************************************____
## ____Component____ 1 ____
## ____Component____ 2 ____
## ____Component____ 3 ____
## ____Predicting X without NA neither in X nor in Y____
## ****________________________________________________****

summary(firstPass)

## An object containing MultiplePLSCoxModels based on:
## [1] "ClinicalBin"  "ClinicalCont" "MAF"          "Meth450"      "miRSeq"       "mRNASeq"     
## [7] "RPPA"

plot(firstPass, legloc = "bottomleft")

Figure 3 : Kaplan-Meier plots of overall survival on the training set from separate PLS Cox omics models

On the training set, each of the seven contributing omics data sets is able to find a PLS model that can successfully separate high risk from low risk patients (Figure 3).

Extend Model Components Across Omics Data Sets

The second step of the algorithm is to extend the individual omics-based models across other omics data sets. This step is performed using the plasma function, which takes in the previously created objects of class multiOmics and MultiplePLSCoxModels. The function operates iteratively, so in our case there are seven different sets of predictions of the PLS components. These different predictions are averaged and saved internally as a data frame called meanPredictions. The structure of models created and stored in the plasma object is the same as for the separate, individual, omics models. Figure 4 shows the Kaplan-Meier plot using the predicted risk, split at the median value, on the training data set.

pl <- plasma(MO, firstPass)
plot(pl, legloc = "topright", main = "Training Data")

Figure 4 : Kaplan-Meier plot of overall survival on the training set using the unified plasma Cox model.

Independent Test Set

Now we want to see how well the final composite model generalizes to our test set. Figure 5 uses the predicted risk, split at the median of the training data, to construct a Kaplan-Meier plot on the test data. The model yields a statistically significant (p = 0.0324) separation of outcomes between the high and low risk patients.

testpred <- predict(pl, testD)
plot(testpred, main="Testing", xlab = "Time (Days)")

Figure 5 : Kaplan-Meier plot of overall survival on the test set.

Interpretation

At this point, our model appears to be a fairly complex black box. We have constructed a matrix of components, based on linear combinations of actual features in different omics data sets. These components can then be combined in yet another linear model that predicts the time-to-event outcome through Cox regression. In this section, we want to explore how the individual features from different omics data sets contribute to different model components.

Our first act toward opening the black box is to realize that not all of the components discovered from the individual omics data sets survived into the final composite model. Some components were eliminated because they appeared to be nearly linearly related to components found in other omics data sets. So, we can examine the final composite model more closely.

pl@fullModel

## Call:
## coxph(formula = Surv(Days, vital_status == "dead") ~ MAF1 + MAF2 + 
##     MAF3 + Meth4502 + miRSeq2 + miRSeq3 + mRNASeq1 + RPPA1 + 
##     RPPA2 + RPPA3, data = riskDF)
## 
##                coef  exp(coef)   se(coef)      z        p
## MAF1      1.434e+00  4.195e+00  3.620e-01  3.961 7.47e-05
## MAF2     -1.921e-01  8.252e-01  1.241e-01 -1.548 0.121607
## MAF3      5.677e+00  2.920e+02  1.439e+00  3.946 7.94e-05
## Meth4502  1.178e+00  3.246e+00  4.358e-01  2.702 0.006890
## miRSeq2   2.276e+00  9.742e+00  5.443e-01  4.183 2.88e-05
## miRSeq3   9.415e+00  1.227e+04  2.746e+00  3.429 0.000607
## mRNASeq1  1.155e-01  1.122e+00  3.817e-02  3.025 0.002485
## RPPA1     6.119e-01  1.844e+00  2.346e-01  2.608 0.009099
## RPPA2     8.806e-01  2.412e+00  2.326e-01  3.786 0.000153
## RPPA3     1.614e+00  5.021e+00  3.843e-01  4.200 2.67e-05
## 
## Likelihood ratio test=56.25  on 10 df, p=1.843e-08
## n= 185, number of events= 77

temp <- terms(pl@fullModel)
mainterms <- attr(temp, "term.labels")
rm(temp)
mainterms

##  [1] "MAF1"     "MAF2"     "MAF3"     "Meth4502" "miRSeq2"  "miRSeq3"  "mRNASeq1" "RPPA1"   
##  [9] "RPPA2"    "RPPA3"

We see that at least one component discovered from each of the five “true” omics data sets survived in the final model. However, neither the binary clinical nor continuous clinical data sets contributed independent components.

Our interest now turns to understanding how the features from individual omics data sets contribute to the components that are used in the final model. As mentioned earlier, these contributions are mediated through two levels of linear regression models when extending a model from data set A to data set B. A linear combination of features from set B is used to define the secondary level of components; then a linear combination of these components is used to predict the components of the single Cox model built that had been from set A. These weights can be combined and extracted using the getAllWeights function, and can then be explored.

Clinical Binary Data

We use the binary clinical data set to illustrate our methods for interpreting the components.

if (requireNamespace("oompaBase")) {
  HG <- oompaBase::blueyellow(64)
} else {
  HG <- heat.colors(64)
}

## Loading required namespace: oompaBase

cbin <- getAllWeights(pl, "ClinicalBin")
heat(cbin, cexCol = 0.9, cexRow = 0.5, col = HG)

Figure 6 : Unscaled heatmap of the contributions of binary clinical features to all components.

Figure 6 shows the raw weights for each clinical binary feature in all of the original omics components. Figure 7 only includes the omics components that were retained in the final composite model. Structurally, these two figures look similar. There is a central set of components where all of the weights look weak, and a central set of features that appear to have weak contributions to all components.

heat(cbin[, mainterms], cexCol = 0.9, cexRow = 0.5, col = HG)

Figure 7 : Unscaled heatmap of the contributions of binary clinical features to important components.

We hypothesize that some components intrinsically have a wider spread of weights, and that it might be more important to scale the components consistently to look at the relative contributions (Figure 8). By standardizing the weights component by component, we see a better balance of contributions per component. However, we can still identify a set of features that seem to make no contributions to any of the components. We can filter the features by only retaining those that make a large contribution to at least one omics component (Figure 9).

dbin <- cbin[, mainterms]
dbin@contrib <- scale(dbin@contrib)
heat(dbin, cexCol = 0.9, cexRow = 0.5, col = HG)

Figure 8 : Scaled heatmap of the contributions of binary clinical features to important components.

shrink <- function(dset, N) {
  dset@contrib <- scale(dset@contrib)
  feat <- unique(unlist(as.list(getTop(dset, N))))
  dset@contrib <- dset@contrib[feat, mainterms]
  dset
}
xbin <- shrink(cbin, 4)
heat(xbin, cexCol = 0.9, cexRow = 0.9, col = HG)

Figure 9 : Scaled heatmap of the contributions of filtered binary clinical features to important components. In Figure 9, we can identify strong contrasts between several pairs of variables. For example, one set of components is enriched with white, never smokers, who still have evidence of tumors, at stage T3 and grade 3 in the lower third of the esophagus (ICD-10 code C15.5), while another group is enriched for Asian, current smokers, who are tumor-free, with stage N0, T2 tumors from the lower third of the esophagus (ICD-10 code C15.4).

Clinical Continuous Data

ccont <- getAllWeights(pl, "ClinicalCont")
xcont <- shrink(ccont, 2)
heat(xcont, cexCol = 0.8, cexRow = 1.2, col = HG)

Figure 10 : Scaled heatmap of the contributions of filtered continuous clinical features to important components

The most important continuous clinical feature is the patient’s weight, which is positively correlated (yellow) with four components and negatively correlated (blue) with six others.

Mutation Data

cmaf <- getAllWeights(pl, "MAF")
xmaf <- shrink(cmaf, 4)
heat(xmaf, cexCol = 0.9, cexRow = 0.9, col = HG)

Figure 11 : Scaled heatmap of the contributions of filtered mutation features to important components

Methylation Data

meth <- getAllWeights(pl, "Meth450")
xmeth <- shrink(meth, 4)
heat(xmeth, cexCol = 0.9, cexRow = 0.9, col = HG)

Figure 12 : Scaled heatmap of the contributions of filtered methylation features to important components

Proteomics Data

rppa <- getAllWeights(pl, "RPPA")
xrppa <- shrink(rppa, 5)
tmp <- rownames(xrppa@contrib)
rownames(xrppa@contrib) <- sapply(strsplit(tmp, "\\."), function(x) paste(x[-1], collapse = "."))
hc <- heat(xrppa, cexCol = 0.9, cexRow = 0.9, col = HG, keep.dendro = TRUE)

Figure 13 : Scaled heatmap of the contributions of filtered RPPA features to important components

mRNA-Sequencing Data

mrna <- getAllWeights(pl, "mRNASeq")
xmrna <- shrink(mrna, 7)
tmp <- rownames(xmrna@contrib)
rownames(xmrna@contrib) <- sapply(strsplit(tmp, "\\."), function(x) x[1])
heat(xmrna, cexCol = 0.9, cexRow = 0.6, col = HG)

Figure 14 : Scaled heatmap of the contributions of filtered mRNA features to important components

miR-Sequencing Data

mir <- getAllWeights(pl, "miRSeq")
xmir <- shrink(mir, 5)
tmp <- strsplit(rownames(xmir@contrib), "\\.")
nc <- sapply(tmp, length)
rownames(xmir@contrib) <- sapply(1:length(tmp), function(dex) 
  paste(tmp[[dex]][1:(nc[dex]-1)], collapse = "."))
heat(xmir, cexCol = 0.9, cexRow = 0.9, col = HG)

Figure 15 : Scaled heatmap of the contributions of filtered miR features to important components

plasma: Partial LeAst Squares for Multi-omics Analysis

Kevin R. Coombes

Introduction