This function allows you to create a full pseudo-bulk RNAseq dataset. You need to provide a SummarizedExperiment from which the cells
will be sampled for the simulation. Also a scenario
has to be selected, where you can choose how the cells will be sampled and a
scaling_factor
on how the read counts will be transformed proir to the simulation.
Usage
simulate_bulk(
data,
scenario = c("even", "random", "mirror_db", "weighted", "pure", "custom"),
scaling_factor = c("NONE", "census", "spike_in", "custom", "read_number",
"expressed_genes", "annotation_column", "epic", "abis", "quantiseq"),
scaling_factor_single_cell = TRUE,
weighted_cell_type = NULL,
weighted_amount = NULL,
pure_cell_type = NULL,
custom_scenario_data = NULL,
custom_scaling_vector = NULL,
balance_even_mirror_scenario = 0.01,
remove_bias_in_counts = TRUE,
remove_bias_in_counts_method = "read-number",
norm_counts = FALSE,
nsamples = 100,
ncells = 1000,
total_read_counts = NULL,
whitelist = NULL,
blacklist = NULL,
BPPARAM = BiocParallel::bpparam(),
run_parallel = FALSE
)
Arguments
- data
(mandatory) SummarizedExperiment object
- scenario
(mandatory) select on of the pre-defined cell-type fraction scenarios; possible are:
even
,random
,mirror_db
,pure
,weighted
; you can also use thecustom
scenario, where you need to set thecustom_scenario_data
parameter.- scaling_factor
(mandatory) name of scaling factor; possible are:
census
,spike_in
,read_number
,expressed_genes
,custom
,epic
,abis
,quantiseq
orNONE
for no scaling factor- scaling_factor_single_cell
boolean: decide if a scaling value for each single cell is calculated (default) or the median of all scaling values for each cell type is calculated
- weighted_cell_type
name of cell-type used for
weighted
scenario- weighted_amount
fraction of cell-type used for
weighted
scenario; must be between0
and0.99
- pure_cell_type
name of cell-type for
pure
scenario- custom_scenario_data
dataframe; needs to be of size
nsamples
x number_of_cell_types, where each sample is a row and each entry is the cell-type fraction. Rows need to sum up to 1.- custom_scaling_vector
named vector with custom scaling values for cell-types. Cell-types that do not occur in this vector but are present in the dataset will be set to 1; mandatory for
custom
scaling factor- balance_even_mirror_scenario
balancing value for the
uniform
andmirror_db
scenarios: increasing it will result in more diverse simulated fractions. To get the same fractions in each sample, set to 0. Default is 0.01.- remove_bias_in_counts
boolean; if TRUE (default) the internal mRNA bias that is present in count data will be removed using the number of reads mapped to each cell
- remove_bias_in_counts_method
'read-number' (default) or 'gene-number'; method with which the mRNA bias in counts will be removed
- norm_counts
boolean; if TRUE the samples simulated with counts will be normalized to CPMs, default is FALSE
- nsamples
numeric; number of samples in pseudo-bulk RNAseq dataset (default = 100)
- ncells
numeric; number of cells in each dataset (default = 1000)
- total_read_counts
numeric; sets the total read count value for each sample
- whitelist
list; give a list of cell-types you want to keep for the simulation; if NULL, all are used
- blacklist
list; give a list of cell-types you want to remove for the simulation; if NULL, all are used; is applied after whitelist
- BPPARAM
BiocParallel::bpparam() by default; if specific number of threads x want to be used, insert: BiocParallel::MulticoreParam(workers = x)
- run_parallel
boolean, decide if multi-threaded calculation will be run. FALSE by default
Value
named list; bulk
a SummarizedExperiment object, where the assays store the simulated bulk RNAseq datasets. Can hold either one or two assays, depending on how many matrices were present in the dataset
cell-fractions
is a dataframe with the simulated cell-fractions per sample;
scaling_vector
scaling value for each cell in dataset
Examples
#generate sample single-cell data to work with:
counts <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol=300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol=300), sparse = TRUE)
tpm <- Matrix::t(1e6*Matrix::t(tpm)/Matrix::colSums(tpm))
colnames(counts) <- paste0("cell_",rep(1:300))
colnames(tpm) <- paste0("cell_",rep(1:300))
rownames(counts) <- paste0("gene_",rep(1:1000))
rownames(tpm) <- paste0("gene_",rep(1:1000))
annotation <- data.frame("ID"=paste0("cell_",rep(1:300)),
"cell_type"=c(rep("T cells CD4",50),
rep("T cells CD8",50),
rep("Macrophages",100),
rep("NK cells",10),
rep("B cells",70),
rep("Monocytes",20)))
dataset <- SimBu::dataset(annotation = annotation,
count_matrix = counts,
tpm_matrix = tpm,
name = "test_dataset")
#> Filtering genes...
#> Created dataset.
# this creates a basic pseudo-bulk dataset with uniform cell-type distribution
# and no additional transformation of the data with 10 samples and 2000 cells each
s<-SimBu::simulate_bulk(dataset,
scenario="even",
scaling_factor="NONE",
nsamples=10,
ncells=100)
#> Finished simulation.
# use a blacklist to exclude certain cell-types for the simulation
s<-SimBu::simulate_bulk(dataset,
scenario="even",
scaling_factor="NONE",
nsamples=10,
ncells=2000,
blacklist=c("Monocytes","Macrophages"))
#> Finished simulation.
# use the pure scenario to only have B cells
s<-SimBu::simulate_bulk(dataset,
scenario="pure",
scaling_factor="NONE",
nsamples=10,
ncells=100,
pure_cell_type="B cells")
#> Finished simulation.
# simulate a dataset with custom cell-type fraction for each of the 3 samples
fractions <- data.frame("B cells"=c(0.2,0.4,0.2),
"T cells CD4"=c(0.4,0.2,0.1),
"Macrophages"=c(0.4,0.4,0.7), check.names = FALSE)
s<-SimBu::simulate_bulk(dataset,
scenario="custom",
scaling_factor="NONE",
nsamples=3,
ncells=2000,
custom_scenario_data=fractions)
#> Finished simulation.