simulate whole pseudo-bulk RNAseq dataset — simulate

This function allows you to create a full pseudo-bulk RNAseq dataset. You need to provide a SummarizedExperiment from which the cells will be sampled for the simulation. Also a scenario has to be selected, where you can choose how the cells will be sampled and a scaling_factor on how the read counts will be transformed proir to the simulation.

Usage

simulate_bulk(
  data,
  scenario = c("even", "random", "mirror_db", "weighted", "pure", "custom"),
  scaling_factor = c("NONE", "census", "spike_in", "custom", "read_number",
    "expressed_genes", "annotation_column", "epic", "abis", "quantiseq"),
  scaling_factor_single_cell = TRUE,
  weighted_cell_type = NULL,
  weighted_amount = NULL,
  pure_cell_type = NULL,
  custom_scenario_data = NULL,
  custom_scaling_vector = NULL,
  balance_even_mirror_scenario = 0.01,
  remove_bias_in_counts = TRUE,
  remove_bias_in_counts_method = "read-number",
  norm_counts = FALSE,
  nsamples = 100,
  ncells = 1000,
  total_read_counts = NULL,
  whitelist = NULL,
  blacklist = NULL,
  BPPARAM = BiocParallel::bpparam(),
  run_parallel = FALSE
)

Arguments

data: (mandatory) SummarizedExperiment object
scenario: (mandatory) select on of the pre-defined cell-type fraction scenarios; possible are: even,random,mirror_db,pure,weighted; you can also use the custom scenario, where you need to set the custom_scenario_data parameter.
scaling_factor: (mandatory) name of scaling factor; possible are: census, spike_in, read_number, expressed_genes, custom, epic, abis, quantiseq or NONE for no scaling factor
scaling_factor_single_cell: boolean: decide if a scaling value for each single cell is calculated (default) or the median of all scaling values for each cell type is calculated
weighted_cell_type: name of cell-type used for weighted scenario
weighted_amount: fraction of cell-type used for weighted scenario; must be between 0 and 0.99
pure_cell_type: name of cell-type for pure scenario
custom_scenario_data: dataframe; needs to be of size nsamples x number_of_cell_types, where each sample is a row and each entry is the cell-type fraction. Rows need to sum up to 1.
custom_scaling_vector: named vector with custom scaling values for cell-types. Cell-types that do not occur in this vector but are present in the dataset will be set to 1; mandatory for custom scaling factor
balance_even_mirror_scenario: balancing value for the uniform and mirror_db scenarios: increasing it will result in more diverse simulated fractions. To get the same fractions in each sample, set to 0. Default is 0.01.
remove_bias_in_counts: boolean; if TRUE (default) the internal mRNA bias that is present in count data will be removed using the number of reads mapped to each cell
remove_bias_in_counts_method: 'read-number' (default) or 'gene-number'; method with which the mRNA bias in counts will be removed
norm_counts: boolean; if TRUE the samples simulated with counts will be normalized to CPMs, default is FALSE
nsamples: numeric; number of samples in pseudo-bulk RNAseq dataset (default = 100)
ncells: numeric; number of cells in each dataset (default = 1000)
total_read_counts: numeric; sets the total read count value for each sample
whitelist: list; give a list of cell-types you want to keep for the simulation; if NULL, all are used
blacklist: list; give a list of cell-types you want to remove for the simulation; if NULL, all are used; is applied after whitelist
BPPARAM: BiocParallel::bpparam() by default; if specific number of threads x want to be used, insert: BiocParallel::MulticoreParam(workers = x)
run_parallel: boolean, decide if multi-threaded calculation will be run. FALSE by default

Value

named list; bulk a SummarizedExperiment object, where the assays store the simulated bulk RNAseq datasets. Can hold either one or two assays, depending on how many matrices were present in the dataset cell-fractions is a dataframe with the simulated cell-fractions per sample; scaling_vector scaling value for each cell in dataset

Examples

#generate sample single-cell data to work with:

counts <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol=300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol=300), sparse = TRUE)
tpm <- Matrix::t(1e6*Matrix::t(tpm)/Matrix::colSums(tpm))

colnames(counts) <- paste0("cell_",rep(1:300))
colnames(tpm) <- paste0("cell_",rep(1:300))
rownames(counts) <- paste0("gene_",rep(1:1000))
rownames(tpm) <- paste0("gene_",rep(1:1000))

annotation <- data.frame("ID"=paste0("cell_",rep(1:300)),
                         "cell_type"=c(rep("T cells CD4",50),
                                       rep("T cells CD8",50),
                                       rep("Macrophages",100),
                                       rep("NK cells",10),
                                       rep("B cells",70),
                                       rep("Monocytes",20)))

dataset <- SimBu::dataset(annotation = annotation,
                          count_matrix = counts,
                          tpm_matrix = tpm,
                          name = "test_dataset")
#> Filtering genes...
#> Created dataset.

# this creates a basic pseudo-bulk dataset with uniform cell-type distribution
# and no additional transformation of the data with 10 samples and 2000 cells each

s<-SimBu::simulate_bulk(dataset,
              scenario="even",
              scaling_factor="NONE",
              nsamples=10,
              ncells=100)
#> Finished simulation.

# use a blacklist to exclude certain cell-types for the simulation
s<-SimBu::simulate_bulk(dataset,
              scenario="even",
              scaling_factor="NONE",
              nsamples=10,
              ncells=2000,
              blacklist=c("Monocytes","Macrophages"))
#> Finished simulation.


# use the pure scenario to only have B cells
s<-SimBu::simulate_bulk(dataset,
              scenario="pure",
              scaling_factor="NONE",
              nsamples=10,
              ncells=100,
              pure_cell_type="B cells")
#> Finished simulation.

# simulate a dataset with custom cell-type fraction for each of the 3 samples
fractions <- data.frame("B cells"=c(0.2,0.4,0.2),
                        "T cells CD4"=c(0.4,0.2,0.1),
                        "Macrophages"=c(0.4,0.4,0.7), check.names = FALSE)
s<-SimBu::simulate_bulk(dataset,
              scenario="custom",
              scaling_factor="NONE",
              nsamples=3,
              ncells=2000,
              custom_scenario_data=fractions)
#> Finished simulation.