Skip to contents

Sfaira Integeration

This vignette will cover the integration of the public database Sfaria.

Setup

As a public database, sfaira (Fischer et al. 2020) is used, which is a dataset and model repository for single-cell RNA-sequencing data. It gives access to about 233 datasets from human and mouse with more than 3 million cells in total. You can browse them interactively here: https://theislab.github.io/sfaira-portal/Datasets. Note that only annotated datasets will be downloaded!
In order to use this database, we first need to install it. This can easily be done, by running the setup_sfaira() function for the first time. In the background we use the basilisik package to establish a conda environment that has all sfaira dependencies installed. The installation will be only performed one single time, even if you close your R session and call setup_sfaira() again. This directory serves as the storage for all future downloaded datasets from sfaira:

setup_list <- SimBu::setup_sfaira(basedir = tempdir())
#> [1] "Sucessfully loaded sfaira."

Creating a dataset

The simulator package works with a internally defined data-structure: datasets
We will now create a dataset of samples from human pancreas using the organisms and tissues parameter. You can provide a single word (like we do here) or for example a list of tissues you want to download: c("pancreas","lung"). An additional parameter is the assays parameter, where you subset the database further to only download datasets from certain sequencing assays (for examples Smart-seq2).
The name parameter is used later on to give each sample (cell) a unique name.

ds_pancrease <- SimBu::dataset_sfaira_multiple(sfaira_setup = setup_list,
                                               organisms = "Homo sapiens", 
                                               tissues = "pancreas", 
                                               name="human_pancreas")

Currently there are three datasets in sfaira from human pancreas, which have cell-type annotation. The package will download them for you automatically and merge them together into a single expression matrix and a streamlined annotation table, which we can use for our simulation.
It can happen, that some datasets from sfaira are not (yet) ready for the automatic download, an error message will then appear in R, telling you which file to download and where to put it.

If you wish to see all datasets which are included in sfaira you can use the following command:

all_datasets <- SimBu::sfaira_overview(setup_list = setup_list)
head(all_datasets)
#>                                                                                                id    author                       doi
#> 1:                            homosapiens_liver_2019_10x3v2_popescu_001_10.1038/s41586-019-1652-y   Popescu 10.1038/s41586-019-1652-y
#> 2:                 homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x Madissoon 10.1186/s13059-019-1906-x
#> 3:                      homosapiens_esophagus_2019_10x3v2_madissoon_002_10.1186/s13059-019-1906-x Madissoon 10.1186/s13059-019-1906-x
#> 4:                         homosapiens_spleen_2019_10x3v2_madissoon_003_10.1186/s13059-019-1906-x Madissoon 10.1186/s13059-019-1906-x
#> 5:                              homosapiens_lung_2019_dropseq_braga_001_10.1038/s41591-019-0468-5     Braga 10.1038/s41591-019-0468-5
#> 6: homosapiens_lungparenchyma_2019_10x3transcriptionprofiling_braga_001_10.1038/s41591-019-0468-5     Braga 10.1038/s41591-019-0468-5
#>    annotated                          assay           organ     organism
#> 1:      TRUE                      10x 3' v2           liver Homo sapiens
#> 2:      TRUE                      10x 3' v2 lung parenchyma Homo sapiens
#> 3:      TRUE                      10x 3' v2       esophagus Homo sapiens
#> 4:      TRUE                      10x 3' v2          spleen Homo sapiens
#> 5:      TRUE                       Drop-seq            lung Homo sapiens
#> 6:      TRUE 10x 3' transcription profiling lung parenchyma Homo sapiens

This allows you to find the specific IDs of datasets, which you can download directly:

SimBu::dataset_sfaira(sfaira_id = 'homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x',
                      sfaira_setup = setup_list,
                      name = "dataset_by_id")
#> [1] "Starting to download dataset from Sfaria with id: homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x"
#> [1] "Downloading datasets..."
#> [1] "Streamlining features & meta-data..."
#> Using rownames for cell-IDs.
#> Filtering genes...
#> Created dataset.
#> class: SummarizedExperiment 
#> dim: 20195 57020 
#> metadata(0):
#> assays(1): counts
#> rownames(20195): TSPAN6 TNMD ... LINC02498 MGC4859
#> rowData names(0):
#> colnames(57020): dataset_by_id_1 dataset_by_id_2 ... dataset_by_id_57019 dataset_by_id_57020
#> colData names(5): cell_ID cell_ID.old cell_type nReads_SimBu nGenes_SimBu
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.6 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] SimBu_0.99.0                SummarizedExperiment_1.22.0 Biobase_2.52.0              GenomicRanges_1.44.0       
#>  [5] GenomeInfoDb_1.28.4         IRanges_2.26.0              S4Vectors_0.30.2            BiocGenerics_0.38.0        
#>  [9] MatrixGenerics_1.4.3        matrixStats_0.61.0          Matrix_1.3-4               
#> 
#> loaded via a namespace (and not attached):
#>   [1] plyr_1.8.7              igraph_1.2.11           lazyeval_0.2.2          proxyC_0.2.4            splines_4.1.0          
#>   [6] listenv_0.8.0           scattermore_0.8         ggplot2_3.3.5           digest_0.6.27           htmltools_0.5.2        
#>  [11] fansi_1.0.3             magrittr_2.0.1          memoise_2.0.1           tensor_1.5              cluster_2.1.2          
#>  [16] ROCR_1.0-11             globals_0.14.0          RcppParallel_5.1.5      pkgdown_2.0.2           spatstat.sparse_2.1-0  
#>  [21] colorspace_2.0-3        ggrepel_0.9.1           xfun_0.30               dplyr_1.0.8             callr_3.7.0            
#>  [26] crayon_1.5.1            RCurl_1.98-1.6          jsonlite_1.7.2          graph_1.70.0            spatstat.data_2.1-2    
#>  [31] survival_3.2-11         zoo_1.8-9               glue_1.6.2              polyclip_1.10-0         gtable_0.3.0           
#>  [36] zlibbioc_1.38.0         XVector_0.32.0          leiden_0.3.9            DelayedArray_0.18.0     future.apply_1.8.1     
#>  [41] abind_1.4-5             scales_1.1.1            DBI_1.1.2               spatstat.random_2.1-0   miniUI_0.1.1.1         
#>  [46] Rcpp_1.0.8.3            viridisLite_0.4.0       xtable_1.8-4            reticulate_1.24         spatstat.core_2.4-0    
#>  [51] getopt_1.20.3           htmlwidgets_1.5.4       httr_1.4.2              dir.expiry_1.0.0        RColorBrewer_1.1-2     
#>  [56] ellipsis_0.3.2          Seurat_4.1.0            ica_1.0-2               XML_4.0-0               pkgconfig_2.0.3        
#>  [61] uwot_0.1.11             deldir_1.0-6            here_1.0.1              utf8_1.2.2              tidyselect_1.1.2       
#>  [66] rlang_1.0.2             reshape2_1.4.4          later_1.3.0             biocViews_1.60.0        munsell_0.5.0          
#>  [71] tools_4.1.0             cachem_1.0.6            cli_3.2.0               generics_0.1.2          ggridges_0.5.3         
#>  [76] evaluate_0.14           stringr_1.4.0           fastmap_1.1.0           yaml_2.3.5              goftest_1.2-3          
#>  [81] processx_3.5.3          knitr_1.33              fs_1.5.2                fitdistrplus_1.1-8      purrr_0.3.4            
#>  [86] RANN_2.6.1              sparseMatrixStats_1.4.2 RBGL_1.68.0             pbapply_1.5-0           future_1.24.0          
#>  [91] nlme_3.1-152            mime_0.10               xml2_1.3.3              rstudioapi_0.13         brio_1.1.3             
#>  [96] compiler_4.1.0          curl_4.3.2              plotly_4.10.0           filelock_1.0.2          png_0.1-7              
#> [101] testthat_3.1.2          spatstat.utils_2.3-0    tibble_3.1.6            stringi_1.6.1           ps_1.6.0               
#> [106] basilisk.utils_1.4.0    desc_1.4.1              lattice_0.20-44         commonmark_1.8.0        vctrs_0.3.8            
#> [111] stringdist_0.9.8        BiocCheck_1.28.0        pillar_1.7.0            lifecycle_1.0.1         RUnit_0.4.32           
#> [116] BiocManager_1.30.16     optparse_1.7.1          spatstat.geom_2.3-2     lmtest_0.9-40           RcppAnnoy_0.0.19       
#> [121] data.table_1.14.2       cowplot_1.1.1           bitops_1.0-7            irlba_2.3.5             httpuv_1.6.5           
#> [126] patchwork_1.1.1         R6_2.5.1                promises_1.2.0.1        KernSmooth_2.23-20      gridExtra_2.3          
#> [131] parallelly_1.30.0       codetools_0.2-18        MASS_7.3-54             assertthat_0.2.1        rprojroot_2.0.2        
#> [136] withr_2.5.0             SeuratObject_4.0.4      sctransform_0.3.3       GenomeInfoDbData_1.2.6  mgcv_1.8-36            
#> [141] grid_4.1.0              rpart_4.1-15            tidyr_1.2.0             basilisk_1.4.0          rmarkdown_2.13         
#> [146] downlit_0.4.0           Rtsne_0.15              shiny_1.7.1
Fischer, David S., Leander Dony, Martin König, Abdul Moeed, Luke Zappia, Sophie Tritschler, Olle Holmberg, Hananeh Aliee, and Fabian J. Theis. 2020. “Sfaira Accelerates Data and Model Reuse in Single Cell Genomics.” bioRxiv. https://doi.org/10.1101/2020.12.16.419036.