Here, we load essential libraries and configure the reticulate package to use a specific Conda environment. This is crucial for running Python-based tools within the R environment. The exact conda environment used in this run can be installed using this yml file using conda env create -f single_cell.yml
We use Seurat V2 (Hao et al. 2023) We define the main output directory where all results and intermediate files will be saved. We then read the sc_samples_manifest.xlsx from the data directory, containing metadata for each sample, which will be used to annotate the cells.
We load CellBender-processed HDF5 count data for each sample using Read_CellBender_h5_Mat from scCustomize R package, iterating through all sample IDs to read the data, create Seurat objects, and append metadata such as sample ID, patient ID, and tissue type from the manifest. Cell barcodes are prefixed with their sample IDs to ensure unique identifiers across samples, and all resulting Seurat objects are then merged into a single comprehensive object named ref.
We calculate the percentage of mitochondrial genes for each cell (percent.mt), a key quality control metric for cell quality. We also perform an initial filtering step to remove cells and genes that have zero counts across the entire dataset.
We run Scrublet (Wolock, Lopez, and Klein 2019), a Python-based tool, to identify potential doublets (two or more cells captured as a single droplet). We run it on each sample separately, then the results are transferred back to R and added as metadata to the main Seurat object.
We apply standard quality control filters, removing cells with high mitochondrial content (percent.mt < 15) or a low number of detected features (nFeature_RNA > 200).
Before integration, we perform standard Seurat pre-processing steps. The RNA assay is split by sample to prepare for integration, the data is log-normalized, highly variable features are identified, the data is scaled, and Principal Component Analysis (PCA) is run.
Code
ref[["RNA"]] <-split(ref[["RNA"]], f = ref$sample_id)ref <-NormalizeData(ref)ref <-FindVariableFeatures(ref)ref <-ScaleData(ref)ref <-RunPCA(ref)
4.2 Integration with scVI
To correct for batch effects between different samples, we perform data integration using scVI (Lopez et al. 2018), A deep learning-based method implemented via SeuratWrappers. A new dimensional reduction slots (integrated.scvi) is created in the Seurat object.
Using the scVI-based reduction, we generate graph and perform graph-based clustering. A range of resolutions is tested (from 0.05 to 1.0) to explore different clustering granularities. The clustree (Zappia and Oshlack 2018) R package helps visualize how cells move between clusters at different resolutions, aiding in the selection of an optimal resolution.
A final clustering resolution of 0.2 is chosen. We then find markers for these clusters and load a pre-defined annotation table from an Excel file scvi_cluster_annotation.xlsx in outputs/scRNAseq_Analysis/ directory, created to assign cell type labels to the clusters.
6 Finalize Cell Type Annotations and Generate Plots
6.1 Renaming Clusters
The CellTypes metadata column is converted to a factor with a specific level order. This ensures that cell types appear in a consistent and logical order in all subsequent plots. A list of canonical marker genes is defined for visualization.
The fully processed, integrated, and annotated Seurat object is saved to a file using qsave. This final object can be easily loaded for any further downstream analyses.
Code
qsave(ref, "outputs/scRNAseq_Analysis/scRef.qs")
The final processed file is deposited in Zenodo and can be downloaded from here
8 Session Info
Code
sessionInfo()
References
Hao, Yuhan, Tim Stuart, Madeline H Kowalski, Saket Choudhary, Paul Hoffman, Austin Hartman, Avi Srivastava, et al. 2023. “Dictionary Learning for Integrative, Multimodal and Scalable Single-Cell Analysis.”Nature Biotechnology. https://doi.org/10.1038/s41587-023-01767-y.
Lopez, Romain, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. 2018. “Deep Generative Modeling for Single-Cell Transcriptomics.”Nature Methods 15 (12): 1053–58.
Wolock, Samuel L., Romain Lopez, and Allon M. Klein. 2019. “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.”Cell Systems 8 (4): 281–291.e9. https://doi.org/https://doi.org/10.1016/j.cels.2018.11.005.
Zappia, Luke, and Alicia Oshlack. 2018. “Clustering Trees: A Visualization for Evaluating Clusterings at Multiple Resolutions.”Gigascience 7 (7): giy083.