diff --git a/docs/source/PBMC_1K_ATAC.md b/docs/source/PBMC_1K_ATAC.md index 2133a8f..6549605 100644 --- a/docs/source/PBMC_1K_ATAC.md +++ b/docs/source/PBMC_1K_ATAC.md @@ -116,6 +116,8 @@ cellranger_atac: threads: 10 mem_gb: 64 runtime_minutes: 720 # max SLURM job runtime in minutes (default: 720 = 12 hours) + anndata_threads: 1 + anndata_mem_gb: 16 directories: LOGS_DIR: 00_LOGS doublet_detection: @@ -199,19 +201,13 @@ sc-preprocess run --config-file pipeline_config.yaml --cores 1 --dag | dot -Tpng Here we will break down the meaning of each rule so you can keep track of what's going on. If you want more detail please refer to the [Pipeline Rules Reference](pipeline_rules.md). -**cellranger_atac_count**: Runs the command [cellranger-atac count](https://www.10xgenomics.com/support/software/cell-ranger-atac/latest/analysis/running-pipelines/command-line-arguments#count) per capture, aligning ATAC reads to the reference genome and producing a peak-barcode matrix. - -**create_atac_anndata**: Converts data from the Cell Ranger ATAC output to a per-capture [AnnData object](https://anndata.readthedocs.io/en/latest/) (`.h5ad`), adding traceability metadata (`batch_id`, `capture_id`, `cell_id`). - -**cellranger_atac_aggr**: Runs [cellranger-atac aggr](https://www.10xgenomics.com/support/software/cell-ranger-atac/latest/analysis/running-pipelines/command-line-arguments#aggr) which aggregates all per-capture Cell Ranger ATAC outputs within a batch into a single normalized count matrix. - -**aggregate_atac_batch**: Merges all per-capture AnnData objects into a single batch-level `.h5ad` file, verifying `cell_id` uniqueness across captures. - -**run_scrublet**: Runs Scrublet doublet detection on each per-capture AnnData object, adding doublet scores and predictions to cell metadata. - -**enrich_atac_metadata**: Joins all downstream preprocessing metadata from doublet detection into the batch-level AnnData object. - -**all**: Final Snakemake rule that collects all expected outputs to ensure the full workflow is completed. +* **cellranger_atac_count**: Runs the command [cellranger-atac count](https://www.10xgenomics.com/support/software/cell-ranger-atac/latest/analysis/running-pipelines/command-line-arguments#count) per capture, aligning ATAC reads to the reference genome and producing a peak-barcode matrix. +* **create_atac_anndata**: Converts data from the Cell Ranger ATAC output to a per-capture [AnnData object](https://anndata.readthedocs.io/en/latest/) (`.h5ad`), adding traceability metadata (`batch_id`, `capture_id`, `cell_id`). +* **cellranger_atac_aggr**: Runs [cellranger-atac aggr](https://www.10xgenomics.com/support/software/cell-ranger-atac/latest/analysis/running-pipelines/command-line-arguments#aggr) which aggregates all per-capture Cell Ranger ATAC outputs within a batch into a single normalized count matrix. +* **aggregate_atac_batch**: Merges all per-capture AnnData objects into a single batch-level `.h5ad` file, verifying `cell_id` uniqueness across captures. +* **run_scrublet**: Runs Scrublet doublet detection on each per-capture AnnData object, adding doublet scores and predictions to cell metadata. +* **enrich_atac_metadata**: Joins all downstream preprocessing metadata from doublet detection into the batch-level AnnData object. +* **all**: Final Snakemake rule that collects all expected outputs to ensure the full workflow is completed. ### Local Execution @@ -479,27 +475,27 @@ grep -R "error" 1K_PBMC_ATAC_PROCESSED/00_LOGS The `.done` files are an internal checklist to keep track of a subset of rules that finished (don't worry about it unless you are a developer and want to contribute to the code base). -`01_CELLRANGERATAC_COUNT/` +* `01_CELLRANGERATAC_COUNT/` Here you will find all of the `Cell Ranger ATAC count` outputs for each individual capture. -`02_CELLRANGERATAC_AGGR/` +* `02_CELLRANGERATAC_AGGR/` This will be the aggregated count matrices across batches. In this tutorial there is only one capture so you won't find any processed data here. -`03_ANNDATA/` +* `03_ANNDATA/` Here you will find an `AnnData` object for every capture. -`04_BATCH_OBJECTS/` +* `04_BATCH_OBJECTS/` Batch-level `AnnData` object created by merging all per-capture objects from `03_ANNDATA/`. This is the aggregated, pre-metadata-enriched object — all cells from all captures in the batch are present, and `cell_id` uniqueness is verified. It does not yet contain doublet scores or demultiplexing results. -`06_DOUBLET_DETECTION/` +* `06_DOUBLET_DETECTION/` Doublet detection outputs from `Scrublet`. -`07_FINAL/` +* `07_FINAL/` The final enriched `AnnData` object with all preprocessing metadata joined in, ready for downstream analysis. diff --git a/docs/source/PBMC_3k_multiome.md b/docs/source/PBMC_3k_multiome.md index 046096e..fa163a7 100644 --- a/docs/source/PBMC_3k_multiome.md +++ b/docs/source/PBMC_3k_multiome.md @@ -94,7 +94,7 @@ The [3-column library CSV file](https://www.10xgenomics.com/support/software/cel ```bash FASTQ_DIR=$(realpath pbmc_unsorted_3k) -sed -i "s|pbmc_unsorted_3k/gex|${FASTQ_DIR}/gex|; s|pbmc_unsorted_3k/atac|${FASTQ_DIR}/atac|" pbmc_unsorted_3k_library.csv +sed -i "s|/path/to/fastqs/pbmc_unsorted_3k/gex|${FASTQ_DIR}/gex|; s|/path/to/fastqs/pbmc_unsorted_3k/atac|${FASTQ_DIR}/atac|" pbmc_unsorted_3k_library.csv ``` Finish filling out the `pipeline_config.yaml` with paths to necessary files e.g. `libraries_list.tsv` and reference genome path: @@ -112,6 +112,8 @@ cellranger_arc: reference: /path/to/refdata-cellranger-arc-GRCh38-2024-A # <- add the correct path! libraries: libraries_list.tsv normalize: none + anndata_threads: 1 + anndata_mem_gb: 16 directories: LOGS_DIR: 00_LOGS threads: 10 @@ -199,19 +201,13 @@ sc-preprocess run --config-file pipeline_config.yaml --cores 1 --dag | dot -Tpng Here we will break down the meaning of each rule so you can keep track of what's going on. If you want more detail please refer to the [Pipeline Rules Reference](pipeline_rules.md). -**cellranger_arc_count**: Runs the command [cellranger-arc count](https://www.10xgenomics.com/support/software/cell-ranger-arc/latest/analysis/running-pipelines/command-line-arguments#count) per capture, aligning GEX and ATAC reads to the reference genome and producing a joint feature-barcode matrix. - -**create_arc_mudata**: Converts data from the Cell Ranger ARC output to per-capture [MuData object](https://mudata.readthedocs.io/stable/) (`.h5mu`) using the command [mu.read_10x_mtx()](https://muon.readthedocs.io/en/latest/api/generated/muon.read_10x_mtx.html), adding traceability metadata (`batch_id`, `capture_id`, `cell_id`). - -**cellranger_arc_aggr**: Runs [cellranger-arc aggr](https://www.10xgenomics.com/support/software/cell-ranger-arc/latest/analysis/running-pipelines/command-line-arguments#aggr) which aggregates all per-capture Cell Ranger ARC outputs within a batch into a single normalized count matrix. - -**aggregate_arc_batch**: Merges all per-capture MuData objects into a single batch-level `.h5mu` file, verifying `cell_id` uniqueness across captures. - -**run_scrublet**: Runs Scrublet doublet detection on the GEX modality of each per-capture MuData object, adding doublet scores and predictions to cell metadata. - -**enrich_arc_metadata**: Joins all downstream preprocessing metadata from demultiplexing and doublet detection into the batch-level MuData object. - -**all**: Final Snakemake rule that collects all expected outputs to ensure the full workflow is completed. +* **cellranger_arc_count**: Runs the command [cellranger-arc count](https://www.10xgenomics.com/support/software/cell-ranger-arc/latest/analysis/running-pipelines/command-line-arguments#count) per capture, aligning GEX and ATAC reads to the reference genome and producing a joint feature-barcode matrix. +* **create_arc_mudata**: Converts data from the Cell Ranger ARC output to per-capture [MuData object](https://mudata.readthedocs.io/stable/) (`.h5mu`) using the command [mu.read_10x_mtx()](https://muon.readthedocs.io/en/latest/api/generated/muon.read_10x_mtx.html), adding traceability metadata (`batch_id`, `capture_id`, `cell_id`). +* **cellranger_arc_aggr**: Runs [cellranger-arc aggr](https://www.10xgenomics.com/support/software/cell-ranger-arc/latest/analysis/running-pipelines/command-line-arguments#aggr) which aggregates all per-capture Cell Ranger ARC outputs within a batch into a single normalized count matrix. +* **aggregate_arc_batch**: Merges all per-capture MuData objects into a single batch-level `.h5mu` file, verifying `cell_id` uniqueness across captures. +* **run_scrublet**: Runs Scrublet doublet detection on the GEX modality of each per-capture MuData object, adding doublet scores and predictions to cell metadata. +* **enrich_arc_metadata**: Joins all downstream preprocessing metadata from demultiplexing and doublet detection into the batch-level MuData object. +* **all**: Final Snakemake rule that collects all expected outputs to ensure the full workflow is completed. ### Local Execution @@ -504,27 +500,27 @@ grep -R "error" 3K_PBMC_MULTIOME_PROCESSED/00_LOGS The `.done` files are an internal checklist to keep track of a subset of rules that finished (don't worry about it unless you are a developer and want to contribute to the code base). -`01_CELLRANGERARC_COUNT/` +* `01_CELLRANGERARC_COUNT/` Here you will find all of the `Cell Ranger count` outputs for each individual capture. -`02_CELLRANGERARC_AGGR/` +* `02_CELLRANGERARC_AGGR/` This will be the aggregated count matrices across batches. In this tutorial there is only one capture so you won't find any processed data here. -`03_ANNDATA/` +* `03_ANNDATA/` Here you will find `MuData` objects for every capture. In this case it will be Muon because multiome. -`04_BATCH_OBJECTS/` +* `04_BATCH_OBJECTS/` Batch-level `MuData` object created by merging all per-capture objects from `03_ANNDATA/`. This is the aggregated, pre-metadata-enriched object — all cells from all captures in the batch are present, and `cell_id` uniqueness is verified. It does not yet contain doublet scores or demultiplexing results. -`06_DOUBLET_DETECTION/` +* `06_DOUBLET_DETECTION/` Doublet detection outputs from `Scrublet` -`07_FINAL/` +`* 07_FINAL/` Next, we print the Snakemake command running under the hood for convenient debugging. The `--snakefile` path will reflect where `sc-preprocess` is installed in your environment — this is expected and you don't need to use this path directly. diff --git a/docs/source/PBMC_GEX.md b/docs/source/PBMC_GEX.md index 01bead0..204f78b 100644 --- a/docs/source/PBMC_GEX.md +++ b/docs/source/PBMC_GEX.md @@ -117,6 +117,8 @@ cellranger_gex: create-bam: true threads: 10 mem_gb: 64 + anndata_threads: 1 + anndata_mem_gb: 32 directories: LOGS_DIR: 00_LOGS doublet_detection: @@ -201,19 +203,13 @@ sc-preprocess run --config-file pipeline_config.yaml --cores 1 --dag | dot -Tpng Here we will break down the meaning of each rule so you can keep track of what's going on. If you want more detail please refer to the [Pipeline Rules Reference](pipeline_rules.md). -**cellranger_gex_count**: Runs the command [cellranger count](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/running-pipelines/command-line-arguments#count) per capture, aligning GEX reads to the reference genome and producing a gene-barcode matrix. - -**create_gex_anndata**: Converts data from the Cell Ranger GEX output to a per-capture [AnnData object](https://anndata.readthedocs.io/en/latest/) (`.h5ad`) using `sc.read_10x_h5()`, adding traceability metadata (`batch_id`, `capture_id`, `cell_id`). - -**cellranger_gex_aggr**: Runs [cellranger aggr](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/running-pipelines/command-line-arguments#aggr) which aggregates all per-capture Cell Ranger GEX outputs within a batch into a single normalized count matrix. - -**aggregate_gex_batch**: Merges all per-capture AnnData objects into a single batch-level `.h5ad` file, verifying `cell_id` uniqueness across captures. - -**run_scrublet**: Runs Scrublet doublet detection on each per-capture AnnData object, adding doublet scores and predictions to cell metadata. - -**enrich_gex_metadata**: Joins all downstream preprocessing metadata from demultiplexing and doublet detection into the batch-level AnnData object. - -**all**: Final Snakemake rule that collects all expected outputs to ensure the full workflow is completed. +* **cellranger_gex_count**: Runs the command [cellranger count](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/running-pipelines/command-line-arguments#count) per capture, aligning GEX reads to the reference genome and producing a gene-barcode matrix. +* **create_gex_anndata**: Converts data from the Cell Ranger GEX output to a per-capture [AnnData object](https://anndata.readthedocs.io/en/latest/) (`.h5ad`) using `sc.read_10x_h5()`, adding traceability metadata (`batch_id`, `capture_id`, `cell_id`). +* **cellranger_gex_aggr**: Runs [cellranger aggr](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/running-pipelines/command-line-arguments#aggr) which aggregates all per-capture Cell Ranger GEX outputs within a batch into a single normalized count matrix. +* **aggregate_gex_batch**: Merges all per-capture AnnData objects into a single batch-level `.h5ad` file, verifying `cell_id` uniqueness across captures. +* **run_scrublet**: Runs Scrublet doublet detection on each per-capture AnnData object, adding doublet scores and predictions to cell metadata. +* **enrich_gex_metadata**: Joins all downstream preprocessing metadata from demultiplexing and doublet detection into the batch-level AnnData object. +* **all**: Final Snakemake rule that collects all expected outputs to ensure the full workflow is completed. ### Local Execution @@ -494,27 +490,27 @@ grep -R "error" 1K_PBMC_GEX_PROCESSED/00_LOGS The `.done` files are an internal checklist to keep track of a subset of rules that finished (don't worry about it unless you are a developer and want to contribute to the code base). -`01_CELLRANGERGEX_COUNT/` +* `01_CELLRANGERGEX_COUNT/` Here you will find all of the `Cell Ranger count` outputs for each individual capture. -`02_CELLRANGERGEX_AGGR/` +* `02_CELLRANGERGEX_AGGR/` This will be the aggregated count matrices across batches. In this tutorial there is only one capture so you won't find any processed data here. -`03_ANNDATA/` +* `03_ANNDATA/` Here you will find an `AnnData` object for every capture. -`04_BATCH_OBJECTS/` +* `04_BATCH_OBJECTS/` Batch-level `AnnData` object created by merging all per-capture objects from `03_ANNDATA/`. This is the aggregated, pre-metadata-enriched object — all cells from all captures in the batch are present, and `cell_id` uniqueness is verified. It does not yet contain doublet scores or demultiplexing results. -`06_DOUBLET_DETECTION/` +* `06_DOUBLET_DETECTION/` Doublet detection outputs from `Scrublet`. -`07_FINAL/` +* `07_FINAL/` The final enriched `AnnData` object with all preprocessing metadata joined in, ready for downstream analysis. @@ -577,3 +573,32 @@ Immediately visualize QC metrics: ```python sc.pl.violin(adata, ['total_counts', 'n_genes_by_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True) ``` + +### Seurat + +The easiest way to load the final AnnData object in `07_FINAL/` into R to be analyzed with Seurat is by using the SeuratDisk package, as follows: + +```R +library(Seurat) +library(SeuratDisk) + +# Convert .h5ad to .h5seurat format +Convert("1K_PBMC_GEX_PROCESSED/07_FINAL/1_gex.h5ad", dest = "h5seurat", overwrite = TRUE) + +# Load the converted file into a Seurat object +seurat_obj <- LoadH5Seurat("1K_PBMC_GEX_PROCESSED/07_FINAL/1_gex.h5seurat") +``` + +A second option, which directly loads the AnnData file without creating an intermediate file, uses the zellkonverter and SingleCellExperiment packages. For example: + +```R +library(Seurat) +library(zellkonverter) +library(SingleCellExperiment) + +# Read the .h5ad file as a SingleCellExperiment object +sce <- readH5AD("1K_PBMC_GEX_PROCESSED/07_FINAL/1_gex.h5ad") + +# Convert to Seurat object +seurat_obj <- as.Seurat(sce, counts = "X", data = NULL) +``` \ No newline at end of file