Skip to content

SherineAwad/Neurog2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

268 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neurog2 Project - scRNA-seq Analysis

This project focuses on the single-cell RNA sequencing (scRNA-seq) analysis of samples related to Neurog2 expression at different stages and control. The goal is to study how MG cells develope to other cells.

Samples

Sample Name Description
5 weeks Neurog2_9SA TH1_GFP_mScarlet3
2 months control TH2_GFP_mScarlet3
2 months Neurog2_9SA TH3_GFP_mScarlet3

Analysis Workflow

The analysis was performed using Scanpy, a scalable toolkit for analyzing single-cell gene expression data. The workflow included:

Preprocessing

  1. Merge Multiple Samples Multiple AnnData objects are combined into one using their sample names as labels. This enables joint analysis while preserving sample identity.

  2. Identify Mitochondrial Genes Genes that start with "mt-" are flagged as mitochondrial genes, which are important indicators of cell stress or damage.

  3. Calculate Quality Control (QC) Metrics Standard QC metrics are computed for each cell:

    • n_genes_by_counts: Number of genes detected
    • total_counts: The total number of UMIs observed per cell
    • pct_counts_mt: Percent of transcripts from mitochondrial genes
  4. Visualize QC Metrics (Before Filtering) Violin plots are used to visualize the distribution of these metrics to help identify low-quality cells.

  5. Filter Out Low-Quality Cells Cells are removed if they have:

    • Too few or too many detected genes (e.g. <800 or >8000)
    • Extremely low or high total transcript counts
    • High mitochondrial content (e.g. >25%), indicating cell stress
  6. Further Filtering

    • Cells with fewer than 100 genes are removed
    • Genes found in fewer than 3 cells are excluded
  7. Visualize QC Metrics (After Filtering) Another set of violin plots is generated to assess the impact of filtering on the dataset.

  8. Save the Processed Data The cleaned and filtered data is saved as an .h5ad file for downstream analysis.


Figures

1. UMAP Visualization

UMAP of Neurog2 samples
UMAP plot colored by sample, showing clustering and distribution of single cells from different conditions.

Per sample UMAP

Control 2mo Neurog2_9SA_5weeks Neurog2_9SA_2mo

Scanpy QC Metrics — Quick Overview

🔹 n_genes_by_counts

  • Definition: Number of genes with non-zero counts in each cell.
  • Use: Helps filter out cells with too few expressed genes (often poor quality or empty droplets).

🔹 total_counts

  • Definition: Total number of counts (UMIs or reads) in a cell.
  • Use: Indicates cell complexity or sequencing depth. Very low values may indicate damaged cells or low capture.

🔹 pct_counts_mt

  • Definition: Percentage of counts from mitochondrial genes (e.g., genes starting with mt- in mouse or MT- in human).
  • Use: High percentages may indicate cell stress or apoptosis; often used to filter out low-quality cells.

Before Filtering QC metrics
Violin plots displaying quality control metrics such as number of genes detected per cell, total counts, and percentage of mitochondrial gene expression.

Filtering Criteria

Quality filtering was applied to remove low-quality cells and potential doublets. Cells were retained only if they met all the following criteria:

  • Number of genes detected per cell between 800 and 8000
  • Total counts per cell between 1200 and 30000
  • Percentage of mitochondrial gene counts less than 25%

This filtering step ensures removal of dead or dying cells and technical artifacts to improve downstream analysis quality.

Additional Analysis Figure

After Filtering QC metrics

Number of cells per sample

Sample Cell Count
Neurog2_9SA_5weeks 27,732
Neurog2_9SA_2mo 11,486
control_2mo 9,701

Clustering

  1. Load the Data A preprocessed AnnData object is loaded from disk.

  2. Normalize and Transform

    • Normalize gene expression values
    • Apply a logarithmic transformation to stabilize variance across genes.
  3. Feature Selection

    • Identify the top 2,000 highly variable genes using the Seurat method. These are the most informative genes for downstream analysis.
  4. Scale the Data

    • Standardize the expression values (mean = 0, variance = 1).
    • Clip extreme values to a maximum of 10 to reduce the impact of outliers.
  5. Dimensionality Reduction (PCA)

    • Perform Principal Component Analysis to reduce data dimensionality and denoise the dataset.
  6. Construct the Neighborhood Graph

    • Build a k-nearest neighbors graph based on PCA to capture the local structure of the data.
  7. UMAP Embedding

    • Compute a 2D UMAP embedding for visualization of the dataset’s structure.
  8. Visualize UMAP by Sample

    • Generate a UMAP plot where cells are colored by their sample origin.
    • Count how many cells belong to each sample.
  9. Per-Sample UMAP Plots

    • Loop through each sample and generate a separate UMAP plot showing only the cells from that sample.
  10. Visualize Predicted Doublets

  • Plot a UMAP colored by predicted doublet labels and doublet scores to inspect doublet detection results.

Marker Gene UMAP Plots

Below are the UMAP visualizations of marker gene expression across clusters. These are auto-generated from your data and saved in the figures/ directory.

Initial Clustering

UMAP CLUSTERS

Marker Gene UMAP Plots

Below are the UMAP visualizations of marker gene expression across clusters. These are auto-generated from your data and saved in the figures/ directory.

Malat1mt-Atp6Sox9

GlulLhx2Rlbp1

Rbfox3Csf1rCalb2

Elavl4Calb1Sebox

Gad1Elavl3Cabp5

Isl1Slc6a9Ascl1

Olig2Foxn4Chat

Prdm1Otx2Insm1

Sox11Atoh7Hes5

Emx1mScarlet3GFP

Neurog2Tfap2aBsn

QC per Clsuter

Remove clusters

ID Cell Type
7 Bad Cells
8 Microglia
11 Bad Cells
20 Microglia
28 Monocyte
33 RPE/Pax2
34 SMC

then we reclustered and replot the marker genes as below:

UMAP

UMAP RE CLUSTERS

Per sample UMAP

Control 2mo Neurog2_9SA_5weeks Neurog2_9SA_2mo

Ccr2Pax2Rpe65 Lhx1Kcnj8Tie1 Acta2RhoNrl Arr3Malat1mt-Atp6 Sox9GlulLhx2 Rlbp1Rbfox3Csf1r Calb2Elavl4Calb1 SeboxGad1Elavl3 Cabp5Isl1Elavl4 Slc6a9Ascl1Olig2 Foxn4ChatPrdm1 Olig2Otx2Insm1 Sox11Atoh7Hes5 Emx1mScarlet3GFP Neurog2Tfap2aBsn Slc17a7Slc6a9Lhx4

Number of cells per sample

Sample Cell Count
Neurog2_9SA_5weeks 23,370
Neurog2_9SA_2mo 10,115
control_2mo 8,674

Doublet Detection using DoubletDetection

A doublet is an artifact where two cells are captured and sequenced together, but incorrectly treated as one. Unlike Scrublet, which can operate effectively on clustered or preprocessed AnnData objects, the DoubletDetection tool is more sensitive to data structure and expects the original, unclustered AnnData object. Running it on a processed or subsetted object may yield suboptimal or misleading results.

In the workflow, we applied DoubletDetection to the original data (adata) to ensure it captures the full transcriptomic diversity and avoids artifacts introduced during clustering.

After running DoubletDetection, predicted doublets and doublet scores were stored in adata.obs under the keys:

  • predicted_doublet: Boolean flag indicating whether each cell is a predicted doublet.
  • doublet_score: Confidence score associated with doublet prediction.

The results were visualized using UMAP, colored by both prediction and score:

DoubletDetection Workflow

Data Preprocessing

  • Input is a raw (or filtered) gene expression matrix.
  • May optionally normalize, filter, and log-transform the data.

Synthetic Doublet Generation

  • Creates artificial doublets by randomly pairing real cells.
  • Averages their gene expression profiles to simulate doublets.

Clustering with Real + Synthetic Data

  • Combines real and synthetic cells.
  • Performs dimensionality reduction (typically PCA).
  • Applies unsupervised clustering (usually Phenograph, a graph-based algorithm).

Voting Mechanism via Multiple Runs (Ensemble)

  • Repeats the clustering multiple times (default: 50 runs).
  • Tracks how often each real cell clusters with synthetic doublets.
  • Cells that frequently cluster with synthetic doublets are flagged as potential doublets.

Thresholding & Output

  • Assigns a doublet probability score to each cell.
  • Applies a threshold (user-defined or default) to classify each cell as a doublet or singlet.

Doublet Scores and Conversion

UMAP Doublets Thresholds Doublets Conversion

Understanding doublet_score Thresholds

The doublet_score typically ranges from 0 to 1

Your filter in the code:

combined_adata = combined_adata[combined_adata.obs['doublet_score'] <= threshold]

This means you're keeping cells with doublet_score <= threshold.

Interpretation of Threshold:

  • Higher threshold (e.g., 0.9) 🔹 You keep more cells 🔹 Less doublets are removed

  • Lower threshold (e.g., 0.4) 🔹 You keep fewer cells 🔹 More potential doublets are removed

Doublet Removal at Threshold 0.9

UMAP after clustering

Doublet Detection

UMAP after doublet removal at threshold 0.9

Doublet Removal

Marker Genes UMAP after doublet removal at threshold 0.9

Sox9Hes5Cabp5 Prdx6mScarlet3Vim Aqp4Rlbp1Slc6a9 Abca8aSlc1a3GFP GfapSlc17a7Rbfox3 Neurog2Elavl3Otx2 GlulElavl4Sox11 ApoeSeboxAtoh7 Lhx2Prdm1Bsn Rpe65Csf1rTfap2a Gad1Olig2Calb2 ChatPax6Insm1 Calb1Acta2Lhx4 Notch1RhoAscl1 Emx1Kcnj8Arr3 Pax2Foxn4Isl1 Ccr2Hes1Nrl Lhx1Tie1mt-Atp6 Malat1

Doublet removal using 0.8 threshold

UMAP after clustering

Doublet Detection

UMAP after doublet removal at threshold 0.8

Doublet Removal

Marker Genes UMAP after doublet removal at threshold 0.8

Hes5Sox9Cabp5 mScarlet3VimPrdx6 Aqp4Slc6a9Abca8a Rlbp1Slc1a3Gfap GFPSlc17a7Neurog2 Rbfox3Elavl3Sox11 GlulElavl4Lhx2 Otx2Atoh7Apoe SeboxTfap2aPrdm1 Pax6Gad1Rpe65 BsnOlig2Calb2 Ascl1ChatNotch1 Csf1rCalb1Acta2 Insm1Lhx4Emx1 Kcnj8RhoPax2 Arr3Foxn4Ccr2 Isl1Hes1Lhx1 NrlTie1mt-Atp6 Malat1

Pre filtering QC

Pre QC

How we filtered

adata = adata[
    (adata.obs['n_genes_by_counts'] > 800) &
    (adata.obs['n_genes_by_counts'] < 8000) &
    (adata.obs['total_counts'] > 1200) &
    (adata.obs['total_counts'] < 30000) &
    (adata.obs['pct_counts_mt'] < 25),
    :
]

Post filtering QC

Post QC

QC per Cluster

Doublet QC

UMAP per Sample

Control 2mo Neurog2_9SA_5weeks Neurog2_9SA_2mo

Remove clusters

ID Cell Type
7 Bad rod
8 Microglia
18 Microglia
27 Monocyte
32 Astrocyte
33 Smooth muscle cells
15 Likely doublets?
24 Likely doublets?

Then reCluster

Resolution = 2.0

UMAP resolution 2.0

UMAP of marker genes

Abca8aGuca1bPou4f1 Acta2Hes1Pou4f3 ApoeHes5Prdm1 Aqp4Igf2Prdx6 Arr3Insm1Prkca Ascl1Isl1Prox1 Atoh7Isl2Ptprc Bhlhe23Kcnj8Rbfox3 BsnLhx1Rbpms Cabp5Lhx2Rho Calb1Lhx4Rlbp1 Calb2Malat1Rom1 Cbln4mScarlet3Rpe65 Ccr2mt-Atp6Sall1 ChatNeflSebox CrxNefmSfrp2 Csf1rNeurog2Slc17a7 Cx3cr1Notch1Slc1a3 Ebf3Nr2e3Slc6a9 Elavl3NrlSncg Elavl4Olig2Sox11 Emx1Onecut1Sox2 Fgf15Onecut2Sox9 Foxn4Opn1mwTfap2a Gad1Opn1swTfap2b Gad2Otx2Thrb GfapPax2Thy1 GFPPax6Tie1 Gli1Pcp4Trpm1 GlulPdgfraVim Gnat2Pecam1Vsx1 Grm6Vsx2Slc18a3

Dot plot for major celltypes and marker genes

Dot Plot

Annotations

legend_annotations ON_annotations

Dot plot marker genes

UMAP per sample

Control Neurog2_9SA_5weeks Neurog2_9S1_2mo UMAP_All

Cell Ratio

CellRatio Reversed

Cell counts per sample

Cell Counts per Sample

Sample Cell Count
Neurog2_9SA_5weeks 20,691
Neurog2_9SA_2mo 10,255
control_2mo 8,818

Gene Expression Analysis

Using similar way as Seurat in scanpy to calculate DGE

The following code performs differential expression analysis per cell type (or cluster) in adata, emulating Seurat's FindMarkers function. The heatmap is sorted by the z-score of the expression values. No logfold changes is calculated ..
For more details, see the script asSeurat.py.

import scanpy as sc

# Rank genes per cell type vs all other cells
sc.tl.rank_genes_groups(
    adata,
    groupby=groupby_col,  # column in adata.obs with cell type annotations
    method='wilcoxon',    # Wilcoxon rank-sum test, Seurat default
    use_raw=False,        # use processed log1p data (like Seurat log-normalized counts)
    pts=True              # include fraction of cells expressing each gene per group
)

and filter without min_fold_change as follows:

sc.tl.filter_rank_genes_groups(
    adata,
    min_in_group_fraction=0.1,
    max_out_group_fraction=1.0,
    key='rank_genes_groups',
    key_added='filtered_rank_genes_groups'
)

Using min_in_group_fraction =0.1 (10%)

seuratlikedge and Gene Expression using seurat like method

How to run Snakemake

For dry run to check everything before actual run:

snakemake -j1 -p --configfile config.yaml -n

For Actual run:

snakemake -j1 -p --configfile config.yaml

References

  • Scanpy
    Wolf, F. A., Angerer, P., & Theis, F. J. (2018).
    Scanpy: large-scale single-cell gene expression data analysis. Genome Biology, 19(1), 15.
    https://doi.org/10.1186/s13059-017-1382-0

  • Scrublet
    Wolock, S. L., Lopez, R., & Klein, A. M. (2019).
    Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Systems, 8(4), 281–291.e9.
    https://doi.org/10.1016/j.cels.2018.11.005

  • DoubletDetection
    Gayoso, A., Shor, J., Carr, A. J., & Yosef, N. (2019).
    DoubletDetection: Computational doublet detection in single-cell RNA sequencing data using boosting algorithms.
    GitHub Repository
    (No peer-reviewed publication; software citation based on GitHub authorship.)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors