This project focuses on the single-cell RNA sequencing (scRNA-seq) analysis of samples related to Neurog2 expression at different stages and control. The goal is to study how MG cells develope to other cells.
| Sample Name | Description |
|---|---|
| 5 weeks Neurog2_9SA | TH1_GFP_mScarlet3 |
| 2 months control | TH2_GFP_mScarlet3 |
| 2 months Neurog2_9SA | TH3_GFP_mScarlet3 |
The analysis was performed using Scanpy, a scalable toolkit for analyzing single-cell gene expression data. The workflow included:
-
Merge Multiple Samples Multiple
AnnDataobjects are combined into one using their sample names as labels. This enables joint analysis while preserving sample identity. -
Identify Mitochondrial Genes Genes that start with
"mt-"are flagged as mitochondrial genes, which are important indicators of cell stress or damage. -
Calculate Quality Control (QC) Metrics Standard QC metrics are computed for each cell:
n_genes_by_counts: Number of genes detectedtotal_counts: The total number of UMIs observed per cellpct_counts_mt: Percent of transcripts from mitochondrial genes
-
Visualize QC Metrics (Before Filtering) Violin plots are used to visualize the distribution of these metrics to help identify low-quality cells.
-
Filter Out Low-Quality Cells Cells are removed if they have:
- Too few or too many detected genes (e.g. <800 or >8000)
- Extremely low or high total transcript counts
- High mitochondrial content (e.g. >25%), indicating cell stress
-
Further Filtering
- Cells with fewer than 100 genes are removed
- Genes found in fewer than 3 cells are excluded
-
Visualize QC Metrics (After Filtering) Another set of violin plots is generated to assess the impact of filtering on the dataset.
-
Save the Processed Data The cleaned and filtered data is saved as an
.h5adfile for downstream analysis.

UMAP plot colored by sample, showing clustering and distribution of single cells from different conditions.
- Definition: Number of genes with non-zero counts in each cell.
- Use: Helps filter out cells with too few expressed genes (often poor quality or empty droplets).
- Definition: Total number of counts (UMIs or reads) in a cell.
- Use: Indicates cell complexity or sequencing depth. Very low values may indicate damaged cells or low capture.
- Definition: Percentage of counts from mitochondrial genes (e.g., genes starting with
mt-in mouse orMT-in human). - Use: High percentages may indicate cell stress or apoptosis; often used to filter out low-quality cells.

Violin plots displaying quality control metrics such as number of genes detected per cell, total counts, and percentage of mitochondrial gene expression.
Quality filtering was applied to remove low-quality cells and potential doublets. Cells were retained only if they met all the following criteria:
- Number of genes detected per cell between 800 and 8000
- Total counts per cell between 1200 and 30000
- Percentage of mitochondrial gene counts less than 25%
This filtering step ensures removal of dead or dying cells and technical artifacts to improve downstream analysis quality.
| Sample | Cell Count |
|---|---|
| Neurog2_9SA_5weeks | 27,732 |
| Neurog2_9SA_2mo | 11,486 |
| control_2mo | 9,701 |
-
Load the Data A preprocessed
AnnDataobject is loaded from disk. -
Normalize and Transform
- Normalize gene expression values
- Apply a logarithmic transformation to stabilize variance across genes.
-
Feature Selection
- Identify the top 2,000 highly variable genes using the Seurat method. These are the most informative genes for downstream analysis.
-
Scale the Data
- Standardize the expression values (mean = 0, variance = 1).
- Clip extreme values to a maximum of 10 to reduce the impact of outliers.
-
Dimensionality Reduction (PCA)
- Perform Principal Component Analysis to reduce data dimensionality and denoise the dataset.
-
Construct the Neighborhood Graph
- Build a k-nearest neighbors graph based on PCA to capture the local structure of the data.
-
UMAP Embedding
- Compute a 2D UMAP embedding for visualization of the dataset’s structure.
-
Visualize UMAP by Sample
- Generate a UMAP plot where cells are colored by their sample origin.
- Count how many cells belong to each sample.
-
Per-Sample UMAP Plots
- Loop through each sample and generate a separate UMAP plot showing only the cells from that sample.
-
Visualize Predicted Doublets
- Plot a UMAP colored by predicted doublet labels and doublet scores to inspect doublet detection results.
Below are the UMAP visualizations of marker gene expression across clusters. These are auto-generated from your data and saved in the figures/ directory.
Below are the UMAP visualizations of marker gene expression across clusters. These are auto-generated from your data and saved in the figures/ directory.
| ID | Cell Type |
|---|---|
| 7 | Bad Cells |
| 8 | Microglia |
| 11 | Bad Cells |
| 20 | Microglia |
| 28 | Monocyte |
| 33 | RPE/Pax2 |
| 34 | SMC |
then we reclustered and replot the marker genes as below:
| Sample | Cell Count |
|---|---|
| Neurog2_9SA_5weeks | 23,370 |
| Neurog2_9SA_2mo | 10,115 |
| control_2mo | 8,674 |
A doublet is an artifact where two cells are captured and sequenced together, but incorrectly treated as one. Unlike Scrublet, which can operate effectively on clustered or preprocessed AnnData objects, the DoubletDetection tool is more sensitive to data structure and expects the original, unclustered AnnData object. Running it on a processed or subsetted object may yield suboptimal or misleading results.
In the workflow, we applied DoubletDetection to the original data (adata) to ensure it captures the full transcriptomic diversity and avoids artifacts introduced during clustering.
After running DoubletDetection, predicted doublets and doublet scores were stored in adata.obs under the keys:
predicted_doublet: Boolean flag indicating whether each cell is a predicted doublet.doublet_score: Confidence score associated with doublet prediction.
The results were visualized using UMAP, colored by both prediction and score:
- Input is a raw (or filtered) gene expression matrix.
- May optionally normalize, filter, and log-transform the data.
- Creates artificial doublets by randomly pairing real cells.
- Averages their gene expression profiles to simulate doublets.
- Combines real and synthetic cells.
- Performs dimensionality reduction (typically PCA).
- Applies unsupervised clustering (usually Phenograph, a graph-based algorithm).
- Repeats the clustering multiple times (default: 50 runs).
- Tracks how often each real cell clusters with synthetic doublets.
- Cells that frequently cluster with synthetic doublets are flagged as potential doublets.
- Assigns a doublet probability score to each cell.
- Applies a threshold (user-defined or default) to classify each cell as a doublet or singlet.
The doublet_score typically ranges from 0 to 1
Your filter in the code:
combined_adata = combined_adata[combined_adata.obs['doublet_score'] <= threshold]This means you're keeping cells with doublet_score <= threshold.
-
Higher threshold (e.g.,
0.9) 🔹 You keep more cells 🔹 Less doublets are removed -
Lower threshold (e.g.,
0.4) 🔹 You keep fewer cells 🔹 More potential doublets are removed
adata = adata[
(adata.obs['n_genes_by_counts'] > 800) &
(adata.obs['n_genes_by_counts'] < 8000) &
(adata.obs['total_counts'] > 1200) &
(adata.obs['total_counts'] < 30000) &
(adata.obs['pct_counts_mt'] < 25),
:
]| ID | Cell Type |
|---|---|
| 7 | Bad rod |
| 8 | Microglia |
| 18 | Microglia |
| 27 | Monocyte |
| 32 | Astrocyte |
| 33 | Smooth muscle cells |
| 15 | Likely doublets? |
| 24 | Likely doublets? |
| Sample | Cell Count |
|---|---|
| Neurog2_9SA_5weeks | 20,691 |
| Neurog2_9SA_2mo | 10,255 |
| control_2mo | 8,818 |
The following code performs differential expression analysis per cell type (or cluster) in adata, emulating Seurat's FindMarkers function.
The heatmap is sorted by the z-score of the expression values.
No logfold changes is calculated ..
For more details, see the script asSeurat.py.
import scanpy as sc
# Rank genes per cell type vs all other cells
sc.tl.rank_genes_groups(
adata,
groupby=groupby_col, # column in adata.obs with cell type annotations
method='wilcoxon', # Wilcoxon rank-sum test, Seurat default
use_raw=False, # use processed log1p data (like Seurat log-normalized counts)
pts=True # include fraction of cells expressing each gene per group
)and filter without min_fold_change as follows:
sc.tl.filter_rank_genes_groups(
adata,
min_in_group_fraction=0.1,
max_out_group_fraction=1.0,
key='rank_genes_groups',
key_added='filtered_rank_genes_groups'
)
and
Gene Expression using seurat like method
For dry run to check everything before actual run:
snakemake -j1 -p --configfile config.yaml -n
For Actual run:
snakemake -j1 -p --configfile config.yaml
-
Scanpy
Wolf, F. A., Angerer, P., & Theis, F. J. (2018).
Scanpy: large-scale single-cell gene expression data analysis. Genome Biology, 19(1), 15.
https://doi.org/10.1186/s13059-017-1382-0 -
Scrublet
Wolock, S. L., Lopez, R., & Klein, A. M. (2019).
Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Systems, 8(4), 281–291.e9.
https://doi.org/10.1016/j.cels.2018.11.005 -
DoubletDetection
Gayoso, A., Shor, J., Carr, A. J., & Yosef, N. (2019).
DoubletDetection: Computational doublet detection in single-cell RNA sequencing data using boosting algorithms.
GitHub Repository
(No peer-reviewed publication; software citation based on GitHub authorship.)




























































































































































































































































































































