Introducing mRNA bias into simulations with scaling factors
Alexander Dietrich
11/29/2021
Source:vignettes/simulator_scaling_factors.Rmd
simulator_scaling_factors.Rmd
Using scaling factors
This package allows the user to introduce an mRNA bias into pseudo-bulk RNA-seq samples. Different cell-types contain different amounts of mRNA (Dendritic cells for examples contain much more than Neutrophils); this bias can be added into simulations artificially in different ways.
The scaling factors will always be applied on the single-cell dataset first, altering its expression profiles accordingly, and then the pseudo-bulk samples are generated by summing up the count data from the sampled cells.
#Example data
counts <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol=300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol=300), sparse = TRUE)
tpm <- Matrix::t(1e6*Matrix::t(tpm)/Matrix::colSums(tpm))
colnames(counts) <- paste0("cell_",rep(1:300))
colnames(tpm) <- paste0("cell_",rep(1:300))
rownames(counts) <- paste0("gene_",rep(1:1000))
rownames(tpm) <- paste0("gene_",rep(1:1000))
annotation <- data.frame("ID"=paste0("cell_",rep(1:300)),
"cell_type" = c(rep("T cells CD4",150), rep("T cells CD8",150)),
'spikes' = runif(300),
'add_1' = runif(300),
'add_2' = runif(300))
ds <- SimBu::dataset(annotation = annotation,
count_matrix = counts,
name = "test_dataset")
#> Filtering genes...
#> Created dataset.
Pre-defined scaling factors
Some studies have proposed scaling factors for immune cells, such as EPIC (Racle et al. 2017) or quanTIseq (Finotello et al. 2019), deconvolution tools which correct for the mRNA bias internally using these values:
epic <- data.frame(
type = c("B cells",
"Macrophages",
"Monocytes",
"Neutrophils",
"NK cells",
"T cells",
"T cells CD4",
"T cells CD8",
"T helper cells",
"T regulatory cells",
"otherCells",
"default"),
mRNA = c(0.4016,
1.4196,
1.4196,
0.1300,
0.4396,
0.3952,
0.3952,
0.3952,
0.3952,
0.3952,
0.4000,
0.4000)
)
epic
#> type mRNA
#> 1 B cells 0.4016
#> 2 Macrophages 1.4196
#> 3 Monocytes 1.4196
#> 4 Neutrophils 0.1300
#> 5 NK cells 0.4396
#> 6 T cells 0.3952
#> 7 T cells CD4 0.3952
#> 8 T cells CD8 0.3952
#> 9 T helper cells 0.3952
#> 10 T regulatory cells 0.3952
#> 11 otherCells 0.4000
#> 12 default 0.4000
quantiseq <- data.frame(
type = c("B cells",
"Macrophages",
"MacrophagesM2",
"Monocytes",
"Neutrophils",
"NK cells",
"T cells CD4",
"T cells CD8",
"T regulatory cells",
"Dendritic cells",
"T cells"),
mRNA = c( 65.66148,
138.11520,
119.35447,
130.65455,
27.73634,
117.71584,
63.87200,
70.25659,
72.55110,
140.76091,
68.89323)
)
quantiseq
#> type mRNA
#> 1 B cells 65.66148
#> 2 Macrophages 138.11520
#> 3 MacrophagesM2 119.35447
#> 4 Monocytes 130.65455
#> 5 Neutrophils 27.73634
#> 6 NK cells 117.71584
#> 7 T cells CD4 63.87200
#> 8 T cells CD8 70.25659
#> 9 T regulatory cells 72.55110
#> 10 Dendritic cells 140.76091
#> 11 T cells 68.89323
When you want to apply one of these scaling factors into your simulation (therefore in-/decreasing the expression signals for the cell-types), we can use the scaling_factor
parameter. Note, that these pre-defined scaling factors only offer values for a certain number of cell types, and your annotation in the provided dataset has to match these names 1:1. All cell types from your dataset which are not present in this scaling factor remain unscaled and a warning message will appear.
sim_epic <- SimBu::simulate_bulk(data=ds,
scenario = "random",
scaling_factor="epic",
nsamples = 10,
ncells=100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE)
#> Using parallel generation of simulations.
#> Using EPIC scaling factors.
#> Finished simulation.
We can also try out some custom scaling factors, for example increasing the expression levels for a single cell-type (T cells CD8) by 10-fold compared to the rest. All cell-types which are not mentioned in the named list given to custom_scaling_vector
will be transformed with a scaling factor of 1, meaning nothing changes for them.
sim_extreme_b <- SimBu::simulate_bulk(data = ds,
scenario = "random",
scaling_factor = "custom",
custom_scaling_vector = c("T cells CD8"=10),
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE)
#> Using parallel generation of simulations.
#> Using custom scaling factors.
#> Warning in merge_scaling_factor(data = data, scaling_factor_values = custom_scaling_vector, : For some cell type(s) in the dataset, no
#> scaling factor is available when using custom: T cells CD4. This cell type will not be re-scaled.
#> Finished simulation.
Important: Watch out that the cell-type annotation names in your dataset are the same as in the scaling factor! Otherwise the scaling factor will not be applied or even worse, applied to a different cell-type.
Dataset specific scaling factors
You can also choose to calculate scaling factors, which are depending on your provided single-cell dataset. Compared to the previous section, this will give a unique value for each cell rather than a cell-type, making it possibly more sensitive.
Reads and genes
Two straight forward approaches would be the number of reads or number of expressed genes/features. As these values are easily obtainable from the provided count data, SimBu already calculates them during dataset generation.
sim_reads <- SimBu::simulate_bulk(data=ds,
scenario = "random",
scaling_factor="read_number", # use number of reads as scaling factor
nsamples = 10,
ncells=100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE)
#> Using parallel generation of simulations.
#> Finished simulation.
sim_genes <- SimBu::simulate_bulk(data = ds,
scenario = "random",
scaling_factor = "expressed_genes", # use number of expressed genes column as scaling factor
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE)
#> Using parallel generation of simulations.
#> Finished simulation.
These options would also allow you to use other numerical measurements you have for the single cells as scaling factors, such as weight or size for example. Lets pretend, add_1
and add_2
are such measurements. With the additional_cols
parameter, they can be added to the SimBu dataset and we can use them as scaling factor as well:
head(annotation)
#> ID cell_type spikes add_1 add_2
#> 1 cell_1 T cells CD4 0.09136481 0.4811858 0.435582586
#> 2 cell_2 T cells CD4 0.61761242 0.6596912 0.358060187
#> 3 cell_3 T cells CD4 0.28130156 0.1950135 0.350385112
#> 4 cell_4 T cells CD4 0.44271955 0.6874199 0.008552276
#> 5 cell_5 T cells CD4 0.69949660 0.2325767 0.595156309
#> 6 cell_6 T cells CD4 0.88835517 0.3133121 0.328329093
ds <- SimBu::dataset(annotation = annotation,
count_matrix = counts,
name = "test_dataset",
additional_cols = c('add_1','add_2')) # add columns to dataset
#> Filtering genes...
#> Created dataset.
sim_genes <- SimBu::simulate_bulk(data = ds,
scenario = "random",
scaling_factor = "add_1", # use add_1 column as scaling factor
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE)
#> Using parallel generation of simulations.
#> Scaling by add_1-column in annotation table; if no scaling is wished instead, use 'NONE'.
#> Finished simulation.
Spike-ins
One other numerical measurement can be spike-ins. Usually the number of reads mapped to spike-in molecules per cell is given in the cell annotations. If this is the case, they can be stored in the dataset annotation using the spike_in_col
parameter, where you indicate the name of the column from the annotation
dataframe in which the spike-in information is stored. To calculate a scaling factor from this, the number of reads are also necessary, so we will add this information as well (as above using the read_number_col
parameter).
The scaling factor with spike-ins is calculated as the “% of reads NOT mapped to spike-in reads,” or: (n_reads - n_spike_in)/n_reads
for each cell. We apply it like this:
ds <- SimBu::dataset(annotation = annotation,
count_matrix = counts,
name = "test_dataset",
spike_in_col = 'spikes') # give the name in the annotation file, that contains spike-in information
sim_spike <- SimBu::simulate_bulk(data = ds,
scenario = "random",
scaling_factor = "spike_in", # use spike-in scaling factor
nsamples = 10,
ncells = 100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE)
Census - estimate mRNA counts per cell
Census is an approach which tries to convert TPM counts into relative transcript counts. This basically means, you get the mRNA counts per cell, which can differ between cell-types.
(Qiu et al. 2017) state in their paper, that it should only be applied to TPM/FPKM normalized data, but I tried it out with raw expression counts as well, which worked as well.
Census calculates a vector with a scaling value for each cell in a sample. You can switch this feature on, by setting the scaling_factor
parameter to census
.
ds <- SimBu::dataset(annotation = annotation,
count_matrix = counts,
tpm_matrix = tpm,
name = "test_dataset")
#> Filtering genes...
#> Created dataset.
sim_census <- SimBu::simulate_bulk(data = ds,
scenario = "random",
scaling_factor = "census", # use census scaling factor
nsamples = 10,
ncells=100,
BPPARAM = BiocParallel::MulticoreParam(workers = 4),
run_parallel = TRUE)
#> Using parallel generation of simulations.
#> Finished simulation.
In our analysis we found, that Census is basically a complicated way of estimating the number of expressed genes per cell. It will remain to the user to decide if he/she wants to use census or simply the number of expressed genes (as shown above) as scaling factor.
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.6 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] curl_4.3.2 SeuratObject_4.0.4 Seurat_4.1.0 SimBu_0.99.0
#> [5] SummarizedExperiment_1.22.0 Biobase_2.52.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
#> [9] IRanges_2.26.0 S4Vectors_0.30.2 BiocGenerics_0.38.0 MatrixGenerics_1.4.3
#> [13] matrixStats_0.61.0 Matrix_1.3-4
#>
#> loaded via a namespace (and not attached):
#> [1] utf8_1.2.2 reticulate_1.24 RUnit_0.4.32 tidyselect_1.1.2 htmlwidgets_1.5.4
#> [6] grid_4.1.0 BiocParallel_1.26.2 Rtsne_0.15 munsell_0.5.0 codetools_0.2-18
#> [11] ica_1.0-2 future_1.24.0 miniUI_0.1.1.1 withr_2.5.0 spatstat.random_2.1-0
#> [16] colorspace_2.0-3 filelock_1.0.2 highr_0.9 knitr_1.33 rstudioapi_0.13
#> [21] ROCR_1.0-11 tensor_1.5 listenv_0.8.0 labeling_0.4.2 optparse_1.7.1
#> [26] GenomeInfoDbData_1.2.6 polyclip_1.10-0 farver_2.1.0 rprojroot_2.0.2 basilisk_1.4.0
#> [31] parallelly_1.30.0 vctrs_0.3.8 generics_0.1.2 xfun_0.30 R6_2.5.1
#> [36] bitops_1.0-7 spatstat.utils_2.3-0 cachem_1.0.6 DelayedArray_0.18.0 assertthat_0.2.1
#> [41] promises_1.2.0.1 scales_1.1.1 gtable_0.3.0 downlit_0.4.0 biocViews_1.60.0
#> [46] globals_0.14.0 processx_3.5.3 goftest_1.2-3 rlang_1.0.2 splines_4.1.0
#> [51] lazyeval_0.2.2 spatstat.geom_2.3-2 BiocManager_1.30.16 yaml_2.3.5 reshape2_1.4.4
#> [56] abind_1.4-5 httpuv_1.6.5 RBGL_1.68.0 tools_4.1.0 ggplot2_3.3.5
#> [61] ellipsis_0.3.2 spatstat.core_2.4-0 RColorBrewer_1.1-2 ggridges_0.5.3 Rcpp_1.0.8.3
#> [66] plyr_1.8.7 sparseMatrixStats_1.4.2 zlibbioc_1.38.0 purrr_0.3.4 RCurl_1.98-1.6
#> [71] basilisk.utils_1.4.0 ps_1.6.0 rpart_4.1-15 deldir_1.0-6 pbapply_1.5-0
#> [76] cowplot_1.1.1 zoo_1.8-9 ggrepel_0.9.1 cluster_2.1.2 fs_1.5.2
#> [81] here_1.0.1 magrittr_2.0.1 data.table_1.14.2 scattermore_0.8 lmtest_0.9-40
#> [86] RANN_2.6.1 fitdistrplus_1.1-8 patchwork_1.1.1 mime_0.10 evaluate_0.14
#> [91] xtable_1.8-4 XML_4.0-0 gridExtra_2.3 testthat_3.1.2 compiler_4.1.0
#> [96] tibble_3.1.6 KernSmooth_2.23-20 crayon_1.5.1 htmltools_0.5.2 proxyC_0.2.4
#> [101] mgcv_1.8-36 later_1.3.0 tidyr_1.2.0 RcppParallel_5.1.5 DBI_1.1.2
#> [106] MASS_7.3-54 getopt_1.20.3 brio_1.1.3 cli_3.2.0 igraph_1.2.11
#> [111] pkgconfig_2.0.3 pkgdown_2.0.2 dir.expiry_1.0.0 plotly_4.10.0 spatstat.sparse_2.1-0
#> [116] xml2_1.3.3 stringdist_0.9.8 XVector_0.32.0 BiocCheck_1.28.0 stringr_1.4.0
#> [121] callr_3.7.0 digest_0.6.27 sctransform_0.3.3 RcppAnnoy_0.0.19 graph_1.70.0
#> [126] spatstat.data_2.1-2 rmarkdown_2.13 leiden_0.3.9 uwot_0.1.11 shiny_1.7.1
#> [131] commonmark_1.8.0 lifecycle_1.0.1 nlme_3.1-152 jsonlite_1.7.2 desc_1.4.1
#> [136] viridisLite_0.4.0 fansi_1.0.3 pillar_1.7.0 lattice_0.20-44 fastmap_1.1.0
#> [141] httr_1.4.2 survival_3.2-11 glue_1.6.2 png_0.1-7 stringi_1.6.1
#> [146] memoise_2.0.1 dplyr_1.0.8 irlba_2.3.5 future.apply_1.8.1