Inputs & Outputs
Alexander Dietrich
Source:vignettes/simulator_input_output.Rmd
simulator_input_output.Rmd
library(Seurat)
#> Attaching SeuratObject
#>
#> Attaching package: 'Seurat'
#> The following object is masked from 'package:SummarizedExperiment':
#>
#> Assays
library(curl)
#> Warning: package 'curl' was built under R version 4.1.2
#> Using libcurl 7.58.0 with GnuTLS/3.5.18
library(SimBu)
This chapter covers the different input and output options of the package in detail.
Input
The input for your simulations is always a SummarizedExperiment
object. You can create this object with different constructing functions, which will explained below. It is also possible to merge multiple datasets objects into one.
Sfaira is not covered in this vignette, but in “Public Data Integration”.
Custom data
Using existing count matrices and annotations is already covered in the “Getting started” vignette; this section will explain some minor details.
When generating a dataset with your own data, you need to provide the count_matrix
parameter of dataset()
; additionally you can provide a TPM matrix with the tpm_matrix
. This will then lead to two simulations, one based on counts and one based on TPMs. For either of them, genes are located in the rows, cells in the columns.
Additionally, an annotation table is needed, with the cell-type annotations. It needs to consist of at least out of 2 columns: ID
and cell_type
, where ID
has to be identical to the column names of the provides matrix/matrices. If not all cells appear in the annotation or matrix, the intersection of both is used to generate the dataset.
Here is some example data:
counts <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol=300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(rpois(3e5, 5), ncol=300), sparse = TRUE)
tpm <- Matrix::t(1e6*Matrix::t(tpm)/Matrix::colSums(tpm))
colnames(counts) <- paste0("cell-",rep(1:300))
colnames(tpm) <- paste0("cell-",rep(1:300))
rownames(counts) <- paste0("gene-",rep(1:1000))
rownames(tpm) <- paste0("gene-",rep(1:1000))
annotation <- data.frame("ID"=paste0("cell-",rep(1:300)),
"cell_type"=c(rep("T cells CD4",50),
rep("T cells CD8",50),
rep("Macrophages",100),
rep("NK cells",10),
rep("B cells",70),
rep("Monocytes",20)),
row.names = paste0("cell-",rep(1:300)))
seurat_obj <- Seurat::CreateSeuratObject(counts = counts, assay = 'counts', meta.data = annotation)
tpm_assay <- Seurat::CreateAssayObject(counts = tpm)
seurat_obj[['tpm']] <- tpm_assay
seurat_obj
#> An object of class Seurat
#> 2000 features across 300 samples within 2 assays
#> Active assay: counts (1000 features, 0 variable features)
#> 1 other assay present: tpm
Seurat
It is possible to use a Seurat object to build a dataset; give the name of the assay containing count data in the counts
slot, the name of the column in the meta table with the unique cell IDs and the name of the column in the meta table with the cell type identifier. Additionally you may give the name of the assay containing TPM data in the counts
slot.
ds_seurat <- SimBu::dataset_seurat(seurat_obj = seurat_obj,
count_assay = "counts",
cell_id_col = 'ID',
cell_type_col = 'cell_type',
tpm_assay = 'tpm',
name = "seurat_dataset")
#> Filtering genes...
#> Created dataset.
h5ad files
It is possible to use an h5ad file directly, a file format which stores AnnData objects. As h5ad files can store cell specific information in the obs
layer, no additional annotation input for SimBu is needed.
Note: if you want both counts and tpm data as input, you will have to provide two files; the cell annotation has to match between these two files. As SimBu expects the cells to be in the columns and genes/features in the rows of the input matrix, but this is not necessarily the case for anndata objects, SimBu can handle h5ad files with cells in the obs
or var
layer. If your cells are in obs
, use cells_in_obs=TRUE
and FALSE
otherwise. This will also automatically transpose the matrix. To know, which columns in the cell annotation layer correspond to the cell identifiers and cell type labels, use the cell_id_col
and cell_type_col
parameters, respectively.
As this function uses the SimBu
python environment to read the h5ad files and extract the data, it may take some more time to initalize the conda environment at the first usage only.
# example h5ad file, where cell type info is stored in `obs` layer
h5 <- system.file('extdata', 'anndata.h5ad', package='SimBu')
ds_h5ad <- SimBu::dataset_h5ad(h5ad_file_counts = h5,
name = "h5ad_dataset",
cell_id_col = 'ID', # this will use the 'ID' column of the metadata as cell identifiers
cell_type_col = 'cell_type', # this will use the 'cell_type' column of the metadata as cell type info
cells_in_obs = TRUE) # in case your cell information is stored in the var layer, switch to FALSE
#> Filtering genes...
#> Created dataset.
Merging datasets
You are able to merge multiple datasets by using the dataset_merge
function:
ds <- SimBu::dataset(annotation = annotation,
count_matrix = counts,
tpm_matrix = tpm,
name = "test_dataset")
#> Filtering genes...
#> Created dataset.
ds_multiple <- SimBu::dataset_merge(dataset_list = list(ds_seurat, ds),
name = "ds_multiple")
#> Filtering genes...
#> Created dataset.
Output
The simulation
object contains three named entries:
-
bulk
: a SummarizedExperiment object with the pseudo-bulk dataset(s) stored in theassays
. They can be accessed like this:
simulation <- SimBu::simulate_bulk(data = ds_multiple,
scenario = "random",
scaling_factor = "NONE",
nsamples = 10, ncells = 100)
#> Finished simulation.
dim(SummarizedExperiment::assays(simulation$bulk)[["bulk_counts"]])
#> [1] 1000 10
dim(SummarizedExperiment::assays(simulation$bulk)[["bulk_tpm"]])
#> [1] 1000 10
If only the count matrix was given to the dataset initially, only the bulk_counts
assay is filled.
cell_fractions
: a table where rows represent the simulated samples and columns represent the different simulated cell-types. The entries in the table store the specific cell-type fraction per sample.scaling_vector
: a named list, with the used scaling value for each cell from the single cell dataset.
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.6 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] curl_4.3.2 SeuratObject_4.0.4 Seurat_4.1.0 SimBu_0.99.0
#> [5] SummarizedExperiment_1.22.0 Biobase_2.52.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
#> [9] IRanges_2.26.0 S4Vectors_0.30.2 BiocGenerics_0.38.0 MatrixGenerics_1.4.3
#> [13] matrixStats_0.61.0 Matrix_1.3-4
#>
#> loaded via a namespace (and not attached):
#> [1] utf8_1.2.2 reticulate_1.24 RUnit_0.4.32 tidyselect_1.1.2 htmlwidgets_1.5.4
#> [6] grid_4.1.0 BiocParallel_1.26.2 Rtsne_0.15 munsell_0.5.0 codetools_0.2-18
#> [11] ica_1.0-2 future_1.24.0 miniUI_0.1.1.1 withr_2.5.0 spatstat.random_2.1-0
#> [16] colorspace_2.0-3 filelock_1.0.2 highr_0.9 knitr_1.33 rstudioapi_0.13
#> [21] ROCR_1.0-11 tensor_1.5 listenv_0.8.0 labeling_0.4.2 optparse_1.7.1
#> [26] GenomeInfoDbData_1.2.6 polyclip_1.10-0 farver_2.1.0 rprojroot_2.0.2 basilisk_1.4.0
#> [31] parallelly_1.30.0 vctrs_0.3.8 generics_0.1.2 xfun_0.30 R6_2.5.1
#> [36] bitops_1.0-7 spatstat.utils_2.3-0 cachem_1.0.6 DelayedArray_0.18.0 assertthat_0.2.1
#> [41] promises_1.2.0.1 scales_1.1.1 gtable_0.3.0 downlit_0.4.0 biocViews_1.60.0
#> [46] globals_0.14.0 processx_3.5.3 goftest_1.2-3 rlang_1.0.2 splines_4.1.0
#> [51] lazyeval_0.2.2 spatstat.geom_2.3-2 BiocManager_1.30.16 yaml_2.3.5 reshape2_1.4.4
#> [56] abind_1.4-5 httpuv_1.6.5 RBGL_1.68.0 tools_4.1.0 ggplot2_3.3.5
#> [61] ellipsis_0.3.2 spatstat.core_2.4-0 RColorBrewer_1.1-2 ggridges_0.5.3 Rcpp_1.0.8.3
#> [66] plyr_1.8.7 sparseMatrixStats_1.4.2 zlibbioc_1.38.0 purrr_0.3.4 RCurl_1.98-1.6
#> [71] basilisk.utils_1.4.0 ps_1.6.0 rpart_4.1-15 deldir_1.0-6 pbapply_1.5-0
#> [76] cowplot_1.1.1 zoo_1.8-9 ggrepel_0.9.1 cluster_2.1.2 fs_1.5.2
#> [81] here_1.0.1 magrittr_2.0.1 data.table_1.14.2 scattermore_0.8 lmtest_0.9-40
#> [86] RANN_2.6.1 fitdistrplus_1.1-8 patchwork_1.1.1 mime_0.10 evaluate_0.14
#> [91] xtable_1.8-4 XML_4.0-0 gridExtra_2.3 testthat_3.1.2 compiler_4.1.0
#> [96] tibble_3.1.6 KernSmooth_2.23-20 crayon_1.5.1 htmltools_0.5.2 proxyC_0.2.4
#> [101] mgcv_1.8-36 later_1.3.0 tidyr_1.2.0 RcppParallel_5.1.5 DBI_1.1.2
#> [106] MASS_7.3-54 getopt_1.20.3 brio_1.1.3 cli_3.2.0 igraph_1.2.11
#> [111] pkgconfig_2.0.3 pkgdown_2.0.2 dir.expiry_1.0.0 plotly_4.10.0 spatstat.sparse_2.1-0
#> [116] xml2_1.3.3 stringdist_0.9.8 XVector_0.32.0 BiocCheck_1.28.0 stringr_1.4.0
#> [121] callr_3.7.0 digest_0.6.27 sctransform_0.3.3 RcppAnnoy_0.0.19 graph_1.70.0
#> [126] spatstat.data_2.1-2 rmarkdown_2.13 leiden_0.3.9 uwot_0.1.11 shiny_1.7.1
#> [131] commonmark_1.8.0 lifecycle_1.0.1 nlme_3.1-152 jsonlite_1.7.2 desc_1.4.1
#> [136] viridisLite_0.4.0 fansi_1.0.3 pillar_1.7.0 lattice_0.20-44 fastmap_1.1.0
#> [141] httr_1.4.2 survival_3.2-11 glue_1.6.2 png_0.1-7 stringi_1.6.1
#> [146] memoise_2.0.1 dplyr_1.0.8 irlba_2.3.5 future.apply_1.8.1