From 98ba4b862da91c5e1138b19a01218999382dbb26 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Thu, 19 Feb 2026 14:18:36 -0500 Subject: [PATCH 01/12] snATAC-seq preprocessing pipeline notebook --- .../QC/snatacseq_preprocessing.ipynb | 1454 +++++++++++++++++ 1 file changed, 1454 insertions(+) create mode 100644 code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb b/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb new file mode 100644 index 000000000..6b98233ee --- /dev/null +++ b/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb @@ -0,0 +1,1454 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Single-nucleus ATAC-seq Preprocessing Pipeline\n", + "\n", + "## Overview\n", + "\n", + "This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) pseudobulk peak count data\n", + "for downstream chromatin accessibility QTL (caQTL) analysis and region-specific studies.\n", + "\n", + "**Goals:**\n", + "- Transform raw pseudobulk peak counts into analysis-ready formats\n", + "- Remove technical confounders while optionally preserving biological covariates\n", + "- Generate QTL-ready phenotype files or region-specific datasets\n", + "\n", + "## Pipeline Structure\n", + "```\n", + "Step 0: Sample ID Mapping\n", + "↓\n", + "Step 1: Pseudobulk QC\n", + "├── Option A: BIOvar (regress out technical + biological covariates)\n", + "└── Option B: noBIOvar (regress out technical covariates only)\n", + "↓ (optional)\n", + "Batch Correction (ComBat-seq or limma::removeBatchEffect)\n", + "↓\n", + "Step 2: Format Output\n", + "├── Format A: Phenotype Reformatting → BED (genome-wide caQTL mapping)\n", + "└── Format B: Region Peak Filtering → TSV (locus-specific analysis)\n", + "\n", + "```\n", + "\n", + "## Input Files\n", + "\n", + "All input files required to run this pipeline can be downloaded\n", + "[here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link).\n", + "\n", + "| File | Used in |\n", + "|------|---------|\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | Step 0, Step 1 |\n", + "| `metadata_{celltype}.csv` | Step 0, Step 1 |\n", + "| `rosmap_sample_mapping_data.csv` | Step 0 |\n", + "| `rosmap_cov.txt` | Step 1 |\n", + "| `hg38-blacklist.v2.bed.gz` | Step 1 |\n", + "| `SampleSheet.csv` | Step 1 (batch correction only) |\n", + "| `sampleSheetAfterQc.csv` | Step 1 (batch correction only) |\n", + "\n", + "\n", + "## Minimal Working Example" + ] + }, + { + "cell_type": "markdown", + "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 0: Sample ID Mapping\n", + "\n", + "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n", + "across metadata and count matrix files.\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n", + "| `metadata_{celltype}.csv` × 6 | Per-cell-type sample metadata |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Per-cell-type peak count matrices |\n", + "\n", + "Cell types: `Ast`, `Ex`, `In`, `Microglia`, `Oligo`, `OPC`\n", + "\n", + "### Process\n", + "\n", + "**Part 1 — Metadata files**\n", + "\n", + "For each `metadata_{celltype}.csv`:\n", + "1. Look up each `individualID` in the mapping reference\n", + "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n", + "3. Insert `sampleid` as the first column\n", + "4. Save updated file\n", + "\n", + "**Part 2 — Count matrix files**\n", + "\n", + "For each `pseudobulk_peaks_counts_{celltype}.csv.gz`:\n", + "1. Extract the header row (column names only)\n", + "2. Keep `peak_id` (first column) unchanged\n", + "3. Map remaining column names (`individualID` → `sampleid`) where mapping exists,\n", + " otherwise keep original\n", + "4. Write new header and stream data rows unchanged\n", + "5. Recompress with gzip\n", + "\n", + "### Output\n", + "\n", + "Output directory: `output/1_files_with_sampleid/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `metadata_{celltype}.csv` × 6 | Metadata with `sampleid` column prepended |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Count matrices with mapped column headers |\n", + "\n", + "**Timing:** < 1 min\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n", + " --cwd output/atac_seq/1_files_with_sampleid \\\n", + " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", + " --input_dir data/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/1_files_with_sampleid \\\n", + " --celltype Ast Ex In Microglia Oligo OPC\n", + "\n", + "\n", + "# For MIT input data\n", + "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n", + " --cwd output/atac_seq/1_files_with_sampleid \\\n", + " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", + " --input_dir data/atac_seq/1_files_with_sampleid_MIT \\\n", + " --output_dir output/atac_seq/1_files_with_sampleid \\\n", + " --celltype Astro Exc Inh Mic Oligo OPC \\\n", + " --suffix _50nuc" + ] + }, + { + "cell_type": "markdown", + "id": "5540a4da-843a-4789-8123-47911cf519c5", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 1: Pseudobulk QC\n", + "\n", + "Two approaches are available depending on whether biological covariates should be regressed out.\n", + "Both options support an **optional batch correction** step after filtering and normalization.\n", + "\n", + "\n", + "### Option A: With Biological Covariates (BIOvar)\n", + "\n", + "Use when residuals should be adjusted for all technical **and** biological covariates (sex, age, PMI).\n", + "\n", + "**Input:**\n", + "\n", + "| File | Location |\n", + "|------|----------|\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | `1_files_with_sampleid/` |\n", + "| `metadata_{celltype}.csv` | `1_files_with_sampleid/` |\n", + "| `rosmap_cov.txt` | `data/` |\n", + "| `hg38-blacklist.v2.bed.gz` | `data/` |\n", + "| `SampleSheet.csv` *(batch correction only)* | `data/` |\n", + "| `sampleSheetAfterQc.csv` *(batch correction only)* | `data/` |\n", + "\n", + "**Process:**\n", + "\n", + "1. Load pseudobulk peak count matrix and metadata per cell type\n", + "2. Filter samples with fewer than 20 nuclei\n", + "3. Calculate technical QC metrics per sample:\n", + " - `log_n_nuclei`: log-transformed nuclei count\n", + " - `med_nucleosome_signal`: median nucleosome signal\n", + " - `med_tss_enrich`: median TSS enrichment score\n", + " - `log_med_n_tot_fragment`: log-transformed median total fragments\n", + " - `log_total_unique_peaks`: log-transformed unique peak count\n", + "4. Filter blacklisted genomic regions\n", + "5. Merge with demographic covariates (`msex`, `age_death`, `pmi`, `study`)\n", + "6. Apply expression filtering (`filterByExpr`):\n", + " - `min_count = 5`: minimum reads in at least one sample\n", + " - `min_total_count = 15`: minimum total reads across all samples\n", + " - `min_prop = 0.1`: peak expressed in ≥10% of samples\n", + "7. TMM normalization\n", + "8. *(Optional)* Batch correction — see [Batch Correction](#batch-correction-optional) below\n", + "9. Fit linear model (`voom` + `lmFit`):~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich\n", + "\n", + "log_med_n_tot_fragment + log_total_unique_peaks\n", + "sequencingBatch + msex + age_death + pmi + study\n", + "\n", + " > If batch correction was applied, `sequencingBatch` is removed from the model.\n", + "10. Compute residuals adjusted for all covariates\n", + "11. Compute final adjusted values: `offset + residuals`\n", + " - `offset`: predicted expression at median/reference covariate values\n", + " - `residuals`: unexplained variation after removing all covariate effects\n", + "\n", + "**Output:** `output/2_residuals/{celltype}/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n", + "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n", + "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", + "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n", + "\n", + "**Covariates regressed out:**\n", + "- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch\n", + "- Biological: sex (`msex`), age at death (`age_death`), post-mortem interval (`pmi`), study cohort\n", + "\n", + "### Option B: Without Biological Covariates (noBIOvar)\n", + "\n", + "Use when biological variation should be preserved (e.g., age/sex comparisons, region-specific analyses).\n", + "\n", + "**Input:** Same as Option A.\n", + "\n", + "**Process:**\n", + "\n", + "Steps 1–8 are identical to Option A. Key differences at the modelling stage:\n", + "- `msex` and `age_death` are **excluded** from the model\n", + "- `med_peakwidth` (weighted median peak width per sample) is added as a technical covariate\n", + "\n", + "**Model formula:**\n", + "```\n", + "Model: ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch + pmi + study\n", + "```\n", + "\n", + "**Output:** `output/2_residuals/{celltype}/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n", + "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n", + "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", + "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n", + "\n", + "**Variables deliberately NOT regressed out:**\n", + "- Sex (`msex`)\n", + "- Age at death (`age_death`)\n", + "\n", + "**Timing:** <5 min per celltype" + ] + }, + { + "cell_type": "markdown", + "id": "21f80085-6d2c-4e1c-af35-454382d94de1", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC with BIOVar" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8569d816-d292-4512-85b6-fcd3ea1c9ba7", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/2_residuals \\\n", + " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", + " --include_bio TRUE \\\n", + " --batch_correction FALSE \\\n", + " --min_count 5 \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC noBIOvar " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/2_residuals \\\n", + " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", + " --include_bio FALSE \\\n", + " --batch_correction FALSE \\\n", + " --min_count 5 \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "25e96ad2-1b75-43d0-978e-0757bc11f135", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Batch Correction (Optional)\n", + "\n", + "Applies to both Option A and Option B. Runs between TMM normalization and model fitting.\n", + "Use when batch effects are severe (e.g., visible batch clusters in PCA, multiple sequencing runs).\n", + "\n", + "> When batch correction is applied, `sequencingBatch` is **removed** from the model formula\n", + "> since batch variance has already been removed from the counts.\n", + "\n", + "**Method comparison:**\n", + "\n", + "| | ComBat-seq | limma `removeBatchEffect` |\n", + "|---|---|---|\n", + "| **Operates on** | Raw integer counts | log-CPM values |\n", + "| **Mean-variance modelling** | Yes | No |\n", + "| **Best for** | Large, balanced batches | Small or fragmented batches |\n", + "| **Robustness** | May fail with many small batches | More robust to unbalanced designs |\n", + "\n", + "**ComBat-seq:**\n", + "```r\n", + "adjusted_counts <- ComBat_seq(counts = dge$counts, batch = batches)\n", + "```\n", + "\n", + "**limma `removeBatchEffect`:**\n", + "```r\n", + "logCPM <- cpm(dge, log = TRUE, prior.count = 1)\n", + "adj_logCPM <- removeBatchEffect(logCPM, batch = batches, design = model.matrix(~1, data = dge$samples))\n", + "adjusted_counts <- round(pmax(2^adj_logCPM * mean(dge$samples$lib.size) / 1e6, 0))\n", + "```\n", + "\n", + "**Additional filtering applied before correction:**\n", + "- Singleton batches (only 1 sample) are removed\n", + "- Samples absent from the batch sheet are dropped\n", + "\n", + "**Additional output when batch correction is enabled:**\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_results.rds` | Includes `batch_adjusted_counts` and `batch_method` fields |\n" + ] + }, + { + "cell_type": "markdown", + "id": "4d582c85-2265-46ee-8080-0ec5d8423a1d", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC with BIOvar & with batch correction" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "d3676870-496d-4379-8d6b-acec08f1c0d7", + "metadata": { + "kernel": "SoS" + }, + "outputs": [ + { + "ename": "ERROR", + "evalue": "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\n", + "output_type": "error", + "traceback": [ + "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\nTraceback:\n" + ] + } + ], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/2_residuals \\\n", + " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", + " --include_bio TRUE \\\n", + " --batch_correction TRUE \\\n", + " --batch_method limma \\\n", + " --min_count 2\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "9bad900d-768d-45ee-815a-6847e8eba32e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC noBIOvar & with batch correction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "799c0faa-4dd9-431d-a5cf-3e92d7256a3b", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/2_residuals \\\n", + " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", + " --include_bio FALSE \\\n", + " --batch_correction TRUE \\\n", + " --batch_method limma \\\n", + " --min_count 5\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "096f2b32-e80d-472b-9af8-5f3d4ebb9bf2", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "**Note**\n", + "For MIT data, add these parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee860bb3-d628-4255-b222-f62b3c03a91a", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "--celltype Astro Exc Inh Mic Oligo OPC \\\n", + "--suffix _50nuc \\\n", + "--input_dir output/1_files_with_sampleid_MIT" + ] + }, + { + "cell_type": "markdown", + "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed", + "metadata": {}, + "source": [ + "For additional parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb", + "metadata": {}, + "outputs": [], + "source": [ + "--min_count 5\n", + "--min_total_count 15\n", + "--min_prop 0.1\n", + "--min_nuclei 20" + ] + }, + { + "cell_type": "markdown", + "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 2: Format Output\n", + "### Phenotype Reformatting\n", + "\n", + "Converts residuals into a QTL-ready BED format for genome-wide caQTL mapping.\n", + "\n", + "**Input:**\n", + "\n", + "| File | Location |\n", + "|------|----------|\n", + "| `{celltype}_residuals.txt` | `output/2_residuals/{celltype}/` |\n", + "\n", + "**Process:**\n", + "\n", + "1. Read residuals file with proper handling of peak IDs and sample columns\n", + "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", + "3. Convert to midpoint coordinates (standard for QTLtools):\n", + "```\n", + " start = floor((peak_start + peak_end) / 2)\n", + " end = start + 1\n", + "```\n", + "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample expression values\n", + "5. Sort by chromosome and position\n", + "6. Compress with `bgzip` and index with `tabix`\n", + "\n", + "**Output:** `output/3_phenotype_processing/phenotype/{celltype}_snatac_phenotype.bed.gz`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_snatac_phenotype.bed.gz` | bgzip-compressed BED with peak midpoint coordinates |\n", + "| `{celltype}_snatac_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n", + "\n", + "**Use case:** Standard caQTL mapping to identify genetic variants affecting chromatin\n", + "accessibility independent of demographic factors. Compatible with FastQTL, TensorQTL, and QTLtools.\n", + "\n", + "**Timing:** <1 min" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb phenotype_formatting \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/2_residuals \\\n", + " --output_dir output/atac_seq/3_pheno_reformat \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "7c874b17-9a77-4e7d-a0a3-3605f7005148", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Region Peak Filtering\n", + "\n", + "Filters peak counts to specific genomic regions of interest for locus-specific analysis.\n", + "\n", + "**Input:**\n", + "\n", + "| File | Location |\n", + "|------|----------|\n", + "| `{celltype}_filtered_raw_counts.txt` | `output/2_residuals/{celltype}/` |\n", + "\n", + "**Process:**\n", + "\n", + "1. Read filtered raw counts per cell type\n", + "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", + "3. Calculate per-peak metrics:\n", + " - `peakwidth`: `end - start`\n", + " - `midpoint`: `(start + end) / 2`\n", + "4. Filter peaks overlapping target regions (includes peaks that start, end, or span boundaries):\n", + "\n", + " | Region | Coordinates | Size |\n", + " |--------|-------------|------|\n", + " | Chr7 | 28,000,000 – 28,300,000 bp | 300 kb |\n", + " | Chr11 | 85,050,000 – 86,200,000 bp | 1.15 Mb |\n", + "\n", + "5. Calculate summary statistics per peak:\n", + " - `total_count`: sum of counts across all samples\n", + " - `weighted_count`: `total_count / peakwidth` (normalizes for peak size)\n", + "\n", + "**Output:** `output/3_format_output/regions/{celltype}/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_filtered_regions.txt` | Full count matrix for peaks in target regions |\n", + "| `{celltype}_filtered_regions_summary.txt` | Peak metadata with coordinates and count statistics |\n", + "\n", + "**Use case:** Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci such as\n", + "the APOE or TREM2 regions) where biological variation is preserved for downstream interpretation.\n", + "\n", + "**Timing:** <1 min" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f944afdd-fffc-4b56-863f-eee89408cfa1", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/2_residuals \\\n", + " --output_dir output/atac_seq/3_region_filter \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10440301-99c6-4f0e-b6ce-efe5ac9281fb", + "metadata": {}, + "outputs": [], + "source": [ + "# Custom regions\n", + "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/2_residuals \\\n", + " --output_dir output/atac_seq \\\n", + " --celltype Ast Ex In Microglia Oligo OPC \\\n", + " --regions \"chr1:1000000-2000000,chr5:50000000-51000000\"" + ] + }, + { + "cell_type": "markdown", + "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Command interface" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb -h" + ] + }, + { + "cell_type": "markdown", + "id": "0e17a301-cca9-49a1-843b-4248546f1f79", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Setup and global parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "# Output directory\n", + "parameter: cwd = path(\"output\")\n", + "# For cluster jobs, number of commands to run per job\n", + "parameter: job_size = 1\n", + "# Wall clock time expected\n", + "parameter: walltime = \"5h\"\n", + "# Memory expected\n", + "parameter: mem = \"16G\"\n", + "# Number of threads\n", + "parameter: numThreads = 8\n", + "# Software container\n", + "parameter: container = \"\"\n", + "\n", + "import re\n", + "parameter: entrypoint = (\n", + " 'micromamba run -a \"\" -n' + ' ' +\n", + " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n", + ") if container else \"\"\n", + "\n", + "from sos.utils import expand_size\n", + "cwd = path(f'{cwd:a}')" + ] + }, + { + "cell_type": "markdown", + "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `sampleid_mapping`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[sampleid_mapping]\n", + "parameter: map_file = str\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "parameter: celltype = ['Ast', 'Ex', 'In', 'Microglia', 'Oligo', 'OPC']\n", + "parameter: suffix = '' # e.g. '' for Xiong, '_50nuc' for Kellis\n", + "\n", + "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", + "output: [f'{output_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", + "\n", + "python: expand = \"${ }\"\n", + "\n", + " import pandas as pd\n", + " import gzip\n", + " import os\n", + " import subprocess\n", + " import csv\n", + " import numpy as np\n", + "\n", + " map_df = pd.read_csv(\"${map_file}\")\n", + " id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n", + "\n", + " celltype = ${celltype}\n", + " input_dir = \"${input_dir}\"\n", + " output_dir = \"${output_dir}/1_files_with_sampleid\"\n", + " suffix = \"${suffix}\"\n", + "\n", + " os.makedirs(output_dir, exist_ok=True)\n", + "\n", + " def map_id(ind_id):\n", + " return id_map.get(ind_id, ind_id)\n", + " \n", + " def format_value(val):\n", + " \"\"\"Format numeric values: remove .0 from integers, keep decimals\"\"\"\n", + " if pd.isna(val):\n", + " return ''\n", + " if isinstance(val, (int, np.integer)):\n", + " return str(val)\n", + " if isinstance(val, (float, np.floating)):\n", + " if val == int(val): # Check if it's a whole number\n", + " return str(int(val))\n", + " else:\n", + " return str(val)\n", + " return str(val)\n", + "\n", + " # ── Process metadata CSV files ────────────────────────────────────────────\n", + " for ct in celltype:\n", + " fname = f\"metadata_{ct}{suffix}.csv\"\n", + " in_path = os.path.join(input_dir, fname)\n", + " out_path = os.path.join(output_dir, fname)\n", + "\n", + " if not os.path.exists(in_path):\n", + " print(f\"Warning: Metadata file not found: {in_path}\")\n", + " continue\n", + "\n", + " meta = pd.read_csv(in_path)\n", + "\n", + " if \"individualID\" not in meta.columns:\n", + " print(f\"Warning: individualID column not found in {fname}\")\n", + " continue\n", + "\n", + " # Create or update sampleid column\n", + " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n", + " \n", + " # Always reorder: sampleid FIRST, then individualID, then rest\n", + " cols = meta.columns.tolist()\n", + " cols.remove(\"sampleid\")\n", + " cols.remove(\"individualID\")\n", + " new_cols = [\"sampleid\", \"individualID\"] + cols\n", + " meta = meta[new_cols]\n", + "\n", + " # Write CSV with custom formatting\n", + " with open(out_path, 'w', newline='') as f:\n", + " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n", + " # Write header\n", + " writer.writerow(meta.columns)\n", + " # Write data rows with custom formatting\n", + " for _, row in meta.iterrows():\n", + " writer.writerow([format_value(val) for val in row])\n", + " \n", + " print(f\"Processed metadata: {fname}\")\n", + "\n", + " # ── Process count matrix .csv.gz files ───────────────────────────────────\n", + " for ct in celltype:\n", + " # Try both naming patterns: with and without underscore\n", + " patterns = [\n", + " f\"pseudobulk_peaks_counts_{ct}{suffix}.csv.gz\", # Xiong pattern\n", + " f\"pseudobulk_peaks_counts{ct}{suffix}.csv.gz\" # Kellis pattern\n", + " ]\n", + " \n", + " in_path = None\n", + " for pattern in patterns:\n", + " test_path = os.path.join(input_dir, pattern)\n", + " if os.path.exists(test_path):\n", + " in_path = test_path\n", + " fname = pattern\n", + " break\n", + " \n", + " if in_path is None:\n", + " print(f\"Warning: Count file not found for celltype {ct}\")\n", + " continue\n", + " \n", + " out_path = os.path.join(output_dir, fname)\n", + "\n", + " with gzip.open(in_path, \"rt\") as fh:\n", + " header_line = fh.readline().rstrip(\"\\n\")\n", + "\n", + " col_names = header_line.split(\",\")\n", + " peak_id_col = col_names[0]\n", + " sample_cols = col_names[1:]\n", + " new_sample_cols = [map_id(s) for s in sample_cols]\n", + " new_header = \",\".join([peak_id_col] + new_sample_cols)\n", + "\n", + " import tempfile\n", + " temp_header = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n", + " temp_header.write(new_header + \"\\n\")\n", + " temp_header.close()\n", + " \n", + " cmd = f\"zcat {in_path} | tail -n +2 | cat {temp_header.name} - | gzip -6 > {out_path}\"\n", + " subprocess.run(cmd, shell=True, check=True)\n", + " \n", + " os.unlink(temp_header.name)\n", + " print(f\"Processed counts: {fname}\")\n", + "\n", + " print(\"\\nSample ID mapping completed!\")" + ] + }, + { + "cell_type": "markdown", + "id": "f0884ae7-a851-425a-86dd-b606768a012e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `pseudobulk_qc`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c46328b-c3d8-46f8-8c71-bad27820438e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[pseudobulk_qc]\n", + "parameter: celltype = ['Ast','Ex','In','Microglia','Oligo','OPC']\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "parameter: covariates_file = str\n", + "parameter: blacklist_file = ''\n", + "parameter: include_bio = \"FALSE\" # \"TRUE\" or \"FALSE\"\n", + "parameter: batch_correction = \"FALSE\" # \"TRUE\" or \"FALSE\"\n", + "parameter: batch_method = \"limma\" # \"limma\" or \"combat\"\n", + "parameter: min_count = 5\n", + "parameter: min_total_count = 15\n", + "parameter: min_prop = 0.1\n", + "parameter: min_nuclei = 20\n", + "parameter: suffix = ''\n", + "\n", + "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype], \\\n", + " [f'{input_dir}/pseudobulk_peaks_counts_{ct}{suffix}.csv.gz' for ct in celltype]\n", + "output: [f'{output_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n", + "\n", + "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n", + "\n", + " library(edgeR)\n", + " library(limma)\n", + " library(data.table)\n", + " library(GenomicRanges)\n", + " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n", + "\n", + " # ── Helper: standardize metadata column names ─────────────────────────────\n", + " rename_if_found <- function(dt, target, candidates) {\n", + " found <- intersect(candidates, colnames(dt))[1]\n", + " if (!is.na(found) && found != target) setnames(dt, found, target)\n", + " }\n", + "\n", + " standardize_meta <- function(meta) {\n", + " rename_if_found(meta, \"n_nuclei\", c(\"n.nuclei\",\"nNuclei\",\"nuclei_count\"))\n", + " rename_if_found(meta, \"med_nucleosome_signal\", c(\"med.nucleosome_signal.ct\",\"NucleosomeRatio\",\"med_nucleosome_signal.ct\"))\n", + " rename_if_found(meta, \"med_tss_enrich\", c(\"med.tss.enrich.ct\",\"TSSEnrichment\",\"med_tss_enrich.ct\"))\n", + " rename_if_found(meta, \"med_n_tot_fragment\", c(\"med.n_tot_fragment.ct\",\"med_n_tot_fragment.ct\"))\n", + " return(meta)\n", + " }\n", + "\n", + " # ── Helper: blacklist filtering ───────────────────────────────────────────\n", + " filter_blacklist <- function(mat, bed) {\n", + " peaks <- data.table(id = rownames(mat))\n", + " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", + " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " bl <- fread(bed)[, 1:3]\n", + " setnames(bl, c(\"chr\",\"start\",\"end\"))\n", + " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", + " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n", + " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n", + " if (length(blacklisted) > 0) {\n", + " message(\"Blacklisted peaks removed: \", length(blacklisted))\n", + " return(mat[-blacklisted, , drop=FALSE])\n", + " }\n", + " return(mat)\n", + " }\n", + "\n", + " # ── Helper: predictOffset ─────────────────────────────────────────────────\n", + " predictOffset <- function(fit) {\n", + " D <- fit$design\n", + " Dm <- D\n", + " for (col in colnames(D)) {\n", + " if (col == \"(Intercept)\") next\n", + " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n", + " Dm[, col] <- median(D[, col], na.rm=TRUE)\n", + " else\n", + " Dm[, col] <- 0\n", + " }\n", + " B <- fit$coefficients\n", + " B[is.na(B)] <- 0\n", + " B %*% t(Dm)\n", + " }\n", + "\n", + " # ── Main loop ─────────────────────────────────────────────────────────────\n", + " cts <- c(${', '.join([f\"'{x}'\" for x in celltype])})\n", + "\n", + " for (ct in cts) {\n", + " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", + " message(\"Processing: \", ct)\n", + " message(\"Mode: \", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"))\n", + " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n", + " message(paste(rep(\"=\", 40), collapse=\"\"))\n", + "\n", + " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n", + " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n", + "\n", + " # ── 1. Load data ───────────────────────────────────────────────────\n", + " meta <- fread(sprintf(\"${input_dir}/metadata_%s${suffix}.csv\", ct))\n", + " counts_raw <- fread(sprintf(\"${input_dir}/pseudobulk_peaks_counts_%s${suffix}.csv.gz\", ct))\n", + "\n", + " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n", + " rownames(counts) <- counts_raw[[1]]\n", + " rm(counts_raw)\n", + " n_original <- nrow(counts)\n", + " message(\"Loaded: \", n_original, \" peaks x \", ncol(counts), \" samples\")\n", + "\n", + " # ── 2. Standardize metadata columns ───────────────────────────────\n", + " meta <- standardize_meta(meta)\n", + "\n", + " # ── 3. Identify sample ID column ──────────────────────────────────\n", + " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n", + " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n", + "\n", + " # ── 4. Nuclei filter ──────────────────────────────────────────────\n", + " if (\"n_nuclei\" %in% colnames(meta)) {\n", + " meta <- meta[meta$n_nuclei > ${min_nuclei}]\n", + " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n", + " }\n", + " n_after_nuclei <- nrow(meta)\n", + "\n", + " # ── 5. Align samples ───────────────────────────────────────────────\n", + " common <- intersect(meta[[idcol]], colnames(counts))\n", + " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n", + " meta <- meta[match(common, meta[[idcol]])]\n", + " counts <- counts[, common, drop=FALSE]\n", + " message(\"Samples after alignment: \", length(common))\n", + "\n", + " # ── 6. Blacklist filtering ─────────────────────────────────────────\n", + " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n", + " counts <- filter_blacklist(counts, \"${blacklist_file}\")\n", + " message(\"Peaks after blacklist filter: \", nrow(counts))\n", + " } else {\n", + " message(\"No blacklist file provided - skipping blacklist filtering.\")\n", + " }\n", + " n_after_blacklist <- nrow(counts)\n", + "\n", + " # ── 7. Load and merge covariates ───────────────────────────────────\n", + " covs <- fread(\"${covariates_file}\")\n", + " id2 <- intersect(c(\"#id\",\"id\",\"projid\",\"individualID\"), colnames(covs))[1]\n", + " bio_cols <- if (as.logical(\"${include_bio}\")) c(\"msex\",\"age_death\",\"pmi\",\"study\") else c(\"pmi\",\"study\")\n", + " keep_cols <- c(id2, intersect(bio_cols, colnames(covs)))\n", + " covs <- covs[, ..keep_cols]\n", + " meta <- merge(meta, covs, by.x=idcol, by.y=id2, all.x=TRUE)\n", + "\n", + " # ── CRITICAL: re-order meta back to common sample order ────────────\n", + " meta <- meta[match(common, meta[[idcol]])]\n", + "\n", + " # ── 8. Impute missing covariate values ─────────────────────────────\n", + " for (col in intersect(c(\"pmi\",\"age_death\"), colnames(meta))) {\n", + " if (any(is.na(meta[[col]]))) {\n", + " message(\"Imputing missing values for: \", col)\n", + " meta[[col]][is.na(meta[[col]])] <- median(meta[[col]], na.rm=TRUE)\n", + " }\n", + " }\n", + "\n", + " # ── 9. Compute technical metrics ──────────────────────────────────\n", + " meta$log_n_nuclei <- log1p(meta$n_nuclei)\n", + " meta$log_med_n_tot_fragment <- log1p(meta$med_n_tot_fragment)\n", + " meta$log_total_unique_peaks <- log1p(colSums(counts > 0))\n", + "\n", + " # ── 10. Select model variables ────────────────────────────────────\n", + " tech_vars <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", + " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\",\"pmi\",\"study\")\n", + " bio_vars <- c(\"msex\",\"age_death\")\n", + " all_vars <- if (as.logical(\"${include_bio}\")) c(tech_vars, bio_vars) else tech_vars\n", + " all_vars <- intersect(all_vars, colnames(meta))\n", + " message(\"Model terms: \", paste(all_vars, collapse=\", \"))\n", + "\n", + " # ── 11. Drop samples with NA in model variables ────────────────────\n", + " keep_rows <- complete.cases(meta[, ..all_vars])\n", + " meta <- meta[keep_rows]\n", + " counts <- counts[, meta[[idcol]], drop=FALSE]\n", + " message(\"Valid samples for modelling: \", nrow(meta))\n", + "\n", + " # ── 12. Expression filtering ───────────────────────────────────────\n", + " dge <- DGEList(counts=counts, samples=meta)\n", + " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n", + " message(\"Peaks before expression filter: \", nrow(dge))\n", + "\n", + " keep <- filterByExpr(dge, group=dge$samples$group,\n", + " min.count=${min_count},\n", + " min.total.count=${min_total_count},\n", + " min.prop=${min_prop})\n", + " dge <- dge[keep,, keep.lib.sizes=FALSE]\n", + " n_after_expr <- nrow(dge)\n", + " message(\"Peaks after expression filter: \", n_after_expr)\n", + "\n", + " # Save filtered raw counts\n", + " write.table(dge$counts,\n", + " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n", + " sep=\"\\t\", quote=FALSE, col.names=NA)\n", + "\n", + " # ── 13. TMM normalization ──────────────────────────────────────────\n", + " dge <- calcNormFactors(dge, method=\"TMM\")\n", + "\n", + " # ── 14. Optional batch correction ─────────────────────────────────\n", + " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n", + " batches <- dge$samples$sequencingBatch\n", + " batch_counts <- table(batches)\n", + " valid_batches <- names(batch_counts[batch_counts > 1])\n", + " keep_bc <- batches %in% valid_batches\n", + " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n", + " batches <- batches[keep_bc]\n", + " message(\"Samples after singleton batch removal: \", ncol(dge))\n", + "\n", + " if (\"${batch_method}\" == \"combat\") {\n", + " dge$counts <- ComBat_seq(as.matrix(dge$counts), batch=batches)\n", + " message(\"ComBat-seq batch correction applied.\")\n", + " } else {\n", + " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", + " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n", + " dge$counts <- round(pmax(2^logCPM, 0))\n", + " message(\"limma removeBatchEffect applied.\")\n", + " }\n", + " }\n", + "\n", + " # ── 15. Add sequencingBatch and Library to model if multi-level ───\n", + " # Insert after technical vars but before pmi/study to match original order\n", + " tech_only <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", + " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\")\n", + " other_vars <- setdiff(all_vars, tech_only) # pmi, study, msex, age_death\n", + "\n", + " batch_vars <- c()\n", + " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$sequencingBatch)) > 1) {\n", + " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", + " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n", + " }\n", + "\n", + " if (\"Library\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$Library)) > 1) {\n", + " dge$samples$Library_factor <- factor(dge$samples$Library)\n", + " batch_vars <- c(batch_vars, \"Library_factor\")\n", + " }\n", + "\n", + " # Final order: technical + batch + other (pmi, study, bio)\n", + " all_vars <- c(tech_only, batch_vars, other_vars)\n", + " all_vars <- intersect(all_vars, c(colnames(dge$samples), colnames(meta)))\n", + "\n", + " # ── 16. Build design matrix ────────────────────────────────────────\n", + " form <- as.formula(paste(\"~\", paste(all_vars, collapse=\" + \")))\n", + " design <- model.matrix(form, data=dge$samples)\n", + " message(\"Formula: \", deparse(form))\n", + "\n", + " if (!is.fullrank(design)) {\n", + " message(\"Design not full rank - trimming.\")\n", + " qr_d <- qr(design)\n", + " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n", + " }\n", + " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n", + "\n", + " # ── 17. Voom + lmFit + eBayes ─────────────────────────────────────\n", + " v <- voom(dge, design, plot=FALSE)\n", + " fit <- lmFit(v, design)\n", + " fit <- eBayes(fit)\n", + "\n", + " # ── 18. Offset + residuals ─────────────────────────────────────────\n", + " off <- predictOffset(fit)\n", + " res <- residuals(fit, v)\n", + " final <- off + res\n", + "\n", + " # ── 19. Save outputs ───────────────────────────────────────────────\n", + " write.table(final,\n", + " file.path(outdir, paste0(ct, \"_residuals.txt\")),\n", + " sep=\"\\t\", quote=FALSE, col.names=NA)\n", + "\n", + " saveRDS(list(\n", + " dge = dge,\n", + " offset = off,\n", + " residuals = res,\n", + " final_data = final,\n", + " valid_samples = colnames(dge),\n", + " design = design,\n", + " fit = fit,\n", + " model = form,\n", + " mode = ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"),\n", + " batch_correction = as.logical(\"${batch_correction}\"),\n", + " batch_method = ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\")\n", + " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n", + "\n", + " # ── 20. Summary report ─────────────────────────────────────────────\n", + " sink(file.path(outdir, paste0(ct, \"_summary.txt\")))\n", + " cat(\"*** Processing Summary for\", ct, \"***\\n\\n\")\n", + "\n", + " cat(\"=== Analysis Mode ===\\n\")\n", + " cat(\"Mode:\", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"), \"\\n\")\n", + " cat(\"Batch correction:\", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"), \"\\n\")\n", + " cat(\"Model formula:\", deparse(form), \"\\n\\n\")\n", + "\n", + " cat(\"=== Filtering Parameters ===\\n\")\n", + " cat(\"Nuclei cutoff: >\", ${min_nuclei}, \"\\n\")\n", + " cat(\"Blacklist filtering:\", ifelse(\"${blacklist_file}\" != \"\", \"TRUE\", \"FALSE\"), \"\\n\")\n", + " if (\"${blacklist_file}\" != \"\") cat(\"Blacklist file:\", \"${blacklist_file}\", \"\\n\")\n", + " cat(\"min_count:\", ${min_count}, \"\\n\")\n", + " cat(\"min_total_count:\", ${min_total_count}, \"\\n\")\n", + " cat(\"min_prop:\", ${min_prop}, \"\\n\\n\")\n", + "\n", + " cat(\"=== Peak Counts ===\\n\")\n", + " cat(\"Original peak count:\", n_original, \"\\n\")\n", + " cat(\"Peaks after blacklist filtering:\", n_after_blacklist, \"\\n\")\n", + " cat(\"Peaks after expression filtering:\", n_after_expr, \"\\n\\n\")\n", + "\n", + " cat(\"=== Sample Counts ===\\n\")\n", + " cat(\"Number of samples after nuclei (>\", ${min_nuclei}, \") filtering:\", n_after_nuclei, \"\\n\")\n", + " cat(\"Number of samples in final model:\", ncol(final), \"\\n\\n\")\n", + "\n", + " cat(\"=== Technical Variables Used ===\\n\")\n", + " for (v in intersect(c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", + " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\"), all_vars))\n", + " cat(\"-\", v, \"\\n\")\n", + " if (\"sequencingBatch_factor\" %in% all_vars) cat(\"- sequencingBatch: Sequencing batch ID\\n\")\n", + " if (\"Library_factor\" %in% all_vars) cat(\"- Library: Library ID\\n\")\n", + "\n", + " if (as.logical(\"${include_bio}\")) {\n", + " cat(\"\\n=== Biological Variables Used ===\\n\")\n", + " for (v in intersect(c(\"msex\",\"age_death\"), all_vars))\n", + " cat(\"-\", v, \"\\n\")\n", + " } else {\n", + " cat(\"\\n=== Biological Variables Used ===\\n\")\n", + " cat(\"None (noBIOvar mode - biological variation preserved)\\n\")\n", + " }\n", + "\n", + " cat(\"\\n=== Other Variables Used ===\\n\")\n", + " if (\"pmi\" %in% all_vars) cat(\"- pmi: Post-mortem interval\\n\")\n", + " if (\"study\" %in% all_vars) cat(\"- study: Study cohort\\n\")\n", + " sink()\n", + "\n", + " # ── 21. Variable explanation report ───────────────────────────────\n", + " sink(file.path(outdir, paste0(ct, \"_variable_explanation.txt\")))\n", + " cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n", + " cat(\"## Why Log Transformation?\\n\")\n", + " cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n", + " cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n", + " cat(\"2. To stabilize variance across the range of values\\n\")\n", + " cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n", + " cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n", + " cat(\"## Variables and Their Meanings\\n\\n\")\n", + " cat(\"### Technical Variables\\n\")\n", + " cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n", + " cat(\" * Filtered to include only samples with >\", ${min_nuclei}, \"nuclei\\n\")\n", + " cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n", + " cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n", + " cat(\" * Represents sequencing depth\\n\")\n", + " cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n", + " cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n", + " cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n", + " cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", + " cat(\" * Measures the degree of nucleosome positioning\\n\")\n", + " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n", + " cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n", + " cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n", + " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n", + " if (\"sequencingBatch_factor\" %in% all_vars)\n", + " cat(\"- sequencingBatch: Sequencing batch ID\\n * Treated as a factor to account for batch effects\\n\\n\")\n", + " if (\"Library_factor\" %in% all_vars)\n", + " cat(\"- Library: Library preparation batch ID\\n * Treated as a factor to account for library preparation effects\\n\\n\")\n", + " if (as.logical(\"${include_bio}\")) {\n", + " cat(\"### Biological Variables\\n\")\n", + " cat(\"- msex: Sex (male=1, female=0)\\n\")\n", + " cat(\"- age_death: Age at death\\n\\n\")\n", + " }\n", + " cat(\"### Other Variables\\n\")\n", + " cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n", + " cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n", + " cat(\"## Relationship to voom Transformation\\n\")\n", + " cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n", + " cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n", + " cat(\"covariates, we ensure they are on a similar scale to the transformed expression data, \")\n", + " cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n", + " sink()\n", + "\n", + " message(\"Completed: \", ct, \" -> \", outdir)\n", + " message(\" Peaks: \", nrow(final), \" | Samples: \", ncol(final))\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `phenotype_reformatting`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[phenotype_formatting]\n", + "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "\n", + "input: [f'{input_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", + "output: [f'{output_dir}/{ct}_snatac_phenotype.bed.gz' for ct in celltype]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n", + "\n", + "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", + "\n", + " import os\n", + " import subprocess\n", + " import pandas as pd\n", + "\n", + " celltypes = ${celltype}\n", + " input_dir = \"${input_dir}\"\n", + " output_dir = \"${output_dir}\"\n", + "\n", + " def read_residuals(path):\n", + " first_line = open(path).readline().rstrip(\"\\n\")\n", + " col_names = first_line.split(\"\\t\")\n", + " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n", + " if df.shape[1] > len(col_names):\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names\n", + " else:\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names[1:]\n", + " return peak_ids, df\n", + "\n", + " def to_midpoint_bed(peak_ids, residuals):\n", + " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", + " chrs = parts[0].values\n", + " starts = parts[1].astype(int).values\n", + " ends = parts[2].astype(int).values\n", + " mids = ((starts + ends) // 2).astype(int)\n", + " bed = pd.DataFrame({\n", + " \"#chr\": chrs,\n", + " \"start\": mids,\n", + " \"end\": mids + 1,\n", + " \"ID\": peak_ids\n", + " })\n", + " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n", + " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n", + "\n", + " def run_cmd(cmd, label):\n", + " r = subprocess.run(cmd, capture_output=True)\n", + " if r.returncode != 0:\n", + " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n", + " else:\n", + " print(f\"{label}: OK\")\n", + "\n", + " for ct in celltypes:\n", + " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n", + "\n", + " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n", + " os.makedirs(out_dir, exist_ok=True)\n", + "\n", + " res_path = os.path.join(input_dir, ct, f\"{ct}_residuals.txt\")\n", + " if not os.path.exists(res_path):\n", + " print(f\"WARNING: {res_path} not found, skipping.\")\n", + " continue\n", + "\n", + " peak_ids, residuals = read_residuals(res_path)\n", + " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n", + "\n", + " bed = to_midpoint_bed(peak_ids, residuals)\n", + " out_bed = os.path.join(out_dir, f\"{ct}_snatac_phenotype.bed\")\n", + " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n", + " print(f\"Written: {out_bed}\")\n", + "\n", + " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n", + " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n", + "\n", + " print(f\"Completed: {ct} -> {out_dir}\")" + ] + }, + { + "cell_type": "markdown", + "id": "038bc2ab-c412-40ef-a9b6-f5dddf5292ee", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `region_filtering`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd13567e-2d4c-48f6-83d7-62ab252421bf", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[region_filtering]\n", + "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "parameter: regions = \"chr7:28000000-28300000,chr11:85050000-86200000\"\n", + "\n", + "input: [f'{input_dir}/{ct}/{ct}_filtered_raw_counts.txt' for ct in celltype]\n", + "output: [f'{output_dir}/{ct}_filtered_regions_of_interest.txt' for ct in celltype]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '1:00:00', mem = '16G', cores = 2\n", + "\n", + "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", + "\n", + " import os\n", + " import pandas as pd\n", + "\n", + " celltypes = ${celltype}\n", + " input_dir = \"${input_dir}\"\n", + " output_dir = \"${output_dir}\"\n", + "\n", + " def parse_regions(region_str):\n", + " result = []\n", + " for r in region_str.split(\",\"):\n", + " chrom, coords = r.strip().split(\":\")\n", + " start, end = coords.split(\"-\")\n", + " result.append({\"chr\": chrom, \"start\": int(start), \"end\": int(end)})\n", + " return result\n", + "\n", + " regions = parse_regions(\"${regions}\")\n", + "\n", + " def parse_peak_ids(peak_ids):\n", + " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", + " return pd.DataFrame({\n", + " \"chr\": parts[0].values,\n", + " \"start\": parts[1].astype(int).values,\n", + " \"end\": parts[2].astype(int).values\n", + " })\n", + "\n", + " def overlaps_region(chr_col, start_col, end_col, reg):\n", + " return (\n", + " (chr_col == reg[\"chr\"]) &\n", + " (start_col < reg[\"end\"]) &\n", + " (end_col > reg[\"start\"])\n", + " )\n", + "\n", + " for ct in celltypes:\n", + " print(f\"\\n{'='*40}\\nRegion Filtering: {ct}\\n{'='*40}\")\n", + "\n", + " reg_dir = os.path.join(output_dir, \"3_region_filter\")\n", + " os.makedirs(reg_dir, exist_ok=True)\n", + "\n", + " counts_path = os.path.join(input_dir, ct, f\"{ct}_filtered_raw_counts.txt\")\n", + " if not os.path.exists(counts_path):\n", + " print(f\"WARNING: {counts_path} not found, skipping.\")\n", + " continue\n", + "\n", + " df = pd.read_csv(counts_path, sep=\"\\t\", index_col=0)\n", + " df.index.name = \"peak_id\"\n", + " df = df.reset_index()\n", + "\n", + " coords = parse_peak_ids(df[\"peak_id\"].values)\n", + " df[\"chr\"] = coords[\"chr\"].values\n", + " df[\"start\"] = coords[\"start\"].values\n", + " df[\"end\"] = coords[\"end\"].values\n", + " df[\"peakwidth\"] = df[\"end\"] - df[\"start\"]\n", + " df[\"midpoint\"] = ((df[\"start\"] + df[\"end\"]) / 2).astype(int)\n", + "\n", + " # Filter to regions of interest\n", + " mask = pd.Series(False, index=df.index)\n", + " for reg in regions:\n", + " mask |= overlaps_region(df[\"chr\"], df[\"start\"], df[\"end\"], reg)\n", + "\n", + " region_df = df[mask].copy()\n", + " print(f\"Peaks in regions of interest: {len(region_df)}\")\n", + "\n", + " # Save full filtered data\n", + " full_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest.txt\")\n", + " region_df.to_csv(full_out, sep=\"\\t\", index=False)\n", + " print(f\"Saved: {full_out}\")\n", + "\n", + " # Save summary\n", + " meta_cols = [\"peak_id\",\"chr\",\"start\",\"end\",\"peakwidth\",\"midpoint\"]\n", + " count_cols = [c for c in region_df.columns if c not in meta_cols]\n", + " count_mat = region_df[count_cols].apply(pd.to_numeric, errors=\"coerce\")\n", + "\n", + " summary = region_df[meta_cols].copy()\n", + " summary[\"total_count\"] = count_mat.sum(axis=1).values\n", + " summary[\"weighted_count\"] = (summary[\"total_count\"] / summary[\"peakwidth\"]).values\n", + "\n", + " summary_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest_summary.txt\")\n", + " summary.to_csv(summary_out, sep=\"\\t\", index=False)\n", + " print(f\"Saved: {summary_out}\")\n", + "\n", + " print(f\"Completed: {ct} -> {reg_dir}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "4.4.3" + }, + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "sos", + "", + "" + ] + ], + "version": "" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 8a7fa497c24f982c26134e203df1f961a1a66b67 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Thu, 19 Feb 2026 14:20:48 -0500 Subject: [PATCH 02/12] Delete code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb Made some changes --- .../QC/snatacseq_preprocessing.ipynb | 1454 ----------------- 1 file changed, 1454 deletions(-) delete mode 100644 code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb b/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb deleted file mode 100644 index 6b98233ee..000000000 --- a/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb +++ /dev/null @@ -1,1454 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "# Single-nucleus ATAC-seq Preprocessing Pipeline\n", - "\n", - "## Overview\n", - "\n", - "This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) pseudobulk peak count data\n", - "for downstream chromatin accessibility QTL (caQTL) analysis and region-specific studies.\n", - "\n", - "**Goals:**\n", - "- Transform raw pseudobulk peak counts into analysis-ready formats\n", - "- Remove technical confounders while optionally preserving biological covariates\n", - "- Generate QTL-ready phenotype files or region-specific datasets\n", - "\n", - "## Pipeline Structure\n", - "```\n", - "Step 0: Sample ID Mapping\n", - "↓\n", - "Step 1: Pseudobulk QC\n", - "├── Option A: BIOvar (regress out technical + biological covariates)\n", - "└── Option B: noBIOvar (regress out technical covariates only)\n", - "↓ (optional)\n", - "Batch Correction (ComBat-seq or limma::removeBatchEffect)\n", - "↓\n", - "Step 2: Format Output\n", - "├── Format A: Phenotype Reformatting → BED (genome-wide caQTL mapping)\n", - "└── Format B: Region Peak Filtering → TSV (locus-specific analysis)\n", - "\n", - "```\n", - "\n", - "## Input Files\n", - "\n", - "All input files required to run this pipeline can be downloaded\n", - "[here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link).\n", - "\n", - "| File | Used in |\n", - "|------|---------|\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | Step 0, Step 1 |\n", - "| `metadata_{celltype}.csv` | Step 0, Step 1 |\n", - "| `rosmap_sample_mapping_data.csv` | Step 0 |\n", - "| `rosmap_cov.txt` | Step 1 |\n", - "| `hg38-blacklist.v2.bed.gz` | Step 1 |\n", - "| `SampleSheet.csv` | Step 1 (batch correction only) |\n", - "| `sampleSheetAfterQc.csv` | Step 1 (batch correction only) |\n", - "\n", - "\n", - "## Minimal Working Example" - ] - }, - { - "cell_type": "markdown", - "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 0: Sample ID Mapping\n", - "\n", - "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n", - "across metadata and count matrix files.\n", - "\n", - "### Input\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n", - "| `metadata_{celltype}.csv` × 6 | Per-cell-type sample metadata |\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Per-cell-type peak count matrices |\n", - "\n", - "Cell types: `Ast`, `Ex`, `In`, `Microglia`, `Oligo`, `OPC`\n", - "\n", - "### Process\n", - "\n", - "**Part 1 — Metadata files**\n", - "\n", - "For each `metadata_{celltype}.csv`:\n", - "1. Look up each `individualID` in the mapping reference\n", - "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n", - "3. Insert `sampleid` as the first column\n", - "4. Save updated file\n", - "\n", - "**Part 2 — Count matrix files**\n", - "\n", - "For each `pseudobulk_peaks_counts_{celltype}.csv.gz`:\n", - "1. Extract the header row (column names only)\n", - "2. Keep `peak_id` (first column) unchanged\n", - "3. Map remaining column names (`individualID` → `sampleid`) where mapping exists,\n", - " otherwise keep original\n", - "4. Write new header and stream data rows unchanged\n", - "5. Recompress with gzip\n", - "\n", - "### Output\n", - "\n", - "Output directory: `output/1_files_with_sampleid/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `metadata_{celltype}.csv` × 6 | Metadata with `sampleid` column prepended |\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Count matrices with mapped column headers |\n", - "\n", - "**Timing:** < 1 min\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n", - " --cwd output/atac_seq/1_files_with_sampleid \\\n", - " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", - " --input_dir data/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/1_files_with_sampleid \\\n", - " --celltype Ast Ex In Microglia Oligo OPC\n", - "\n", - "\n", - "# For MIT input data\n", - "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n", - " --cwd output/atac_seq/1_files_with_sampleid \\\n", - " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", - " --input_dir data/atac_seq/1_files_with_sampleid_MIT \\\n", - " --output_dir output/atac_seq/1_files_with_sampleid \\\n", - " --celltype Astro Exc Inh Mic Oligo OPC \\\n", - " --suffix _50nuc" - ] - }, - { - "cell_type": "markdown", - "id": "5540a4da-843a-4789-8123-47911cf519c5", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 1: Pseudobulk QC\n", - "\n", - "Two approaches are available depending on whether biological covariates should be regressed out.\n", - "Both options support an **optional batch correction** step after filtering and normalization.\n", - "\n", - "\n", - "### Option A: With Biological Covariates (BIOvar)\n", - "\n", - "Use when residuals should be adjusted for all technical **and** biological covariates (sex, age, PMI).\n", - "\n", - "**Input:**\n", - "\n", - "| File | Location |\n", - "|------|----------|\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | `1_files_with_sampleid/` |\n", - "| `metadata_{celltype}.csv` | `1_files_with_sampleid/` |\n", - "| `rosmap_cov.txt` | `data/` |\n", - "| `hg38-blacklist.v2.bed.gz` | `data/` |\n", - "| `SampleSheet.csv` *(batch correction only)* | `data/` |\n", - "| `sampleSheetAfterQc.csv` *(batch correction only)* | `data/` |\n", - "\n", - "**Process:**\n", - "\n", - "1. Load pseudobulk peak count matrix and metadata per cell type\n", - "2. Filter samples with fewer than 20 nuclei\n", - "3. Calculate technical QC metrics per sample:\n", - " - `log_n_nuclei`: log-transformed nuclei count\n", - " - `med_nucleosome_signal`: median nucleosome signal\n", - " - `med_tss_enrich`: median TSS enrichment score\n", - " - `log_med_n_tot_fragment`: log-transformed median total fragments\n", - " - `log_total_unique_peaks`: log-transformed unique peak count\n", - "4. Filter blacklisted genomic regions\n", - "5. Merge with demographic covariates (`msex`, `age_death`, `pmi`, `study`)\n", - "6. Apply expression filtering (`filterByExpr`):\n", - " - `min_count = 5`: minimum reads in at least one sample\n", - " - `min_total_count = 15`: minimum total reads across all samples\n", - " - `min_prop = 0.1`: peak expressed in ≥10% of samples\n", - "7. TMM normalization\n", - "8. *(Optional)* Batch correction — see [Batch Correction](#batch-correction-optional) below\n", - "9. Fit linear model (`voom` + `lmFit`):~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich\n", - "\n", - "log_med_n_tot_fragment + log_total_unique_peaks\n", - "sequencingBatch + msex + age_death + pmi + study\n", - "\n", - " > If batch correction was applied, `sequencingBatch` is removed from the model.\n", - "10. Compute residuals adjusted for all covariates\n", - "11. Compute final adjusted values: `offset + residuals`\n", - " - `offset`: predicted expression at median/reference covariate values\n", - " - `residuals`: unexplained variation after removing all covariate effects\n", - "\n", - "**Output:** `output/2_residuals/{celltype}/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n", - "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n", - "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", - "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n", - "\n", - "**Covariates regressed out:**\n", - "- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch\n", - "- Biological: sex (`msex`), age at death (`age_death`), post-mortem interval (`pmi`), study cohort\n", - "\n", - "### Option B: Without Biological Covariates (noBIOvar)\n", - "\n", - "Use when biological variation should be preserved (e.g., age/sex comparisons, region-specific analyses).\n", - "\n", - "**Input:** Same as Option A.\n", - "\n", - "**Process:**\n", - "\n", - "Steps 1–8 are identical to Option A. Key differences at the modelling stage:\n", - "- `msex` and `age_death` are **excluded** from the model\n", - "- `med_peakwidth` (weighted median peak width per sample) is added as a technical covariate\n", - "\n", - "**Model formula:**\n", - "```\n", - "Model: ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch + pmi + study\n", - "```\n", - "\n", - "**Output:** `output/2_residuals/{celltype}/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n", - "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n", - "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", - "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n", - "\n", - "**Variables deliberately NOT regressed out:**\n", - "- Sex (`msex`)\n", - "- Age at death (`age_death`)\n", - "\n", - "**Timing:** <5 min per celltype" - ] - }, - { - "cell_type": "markdown", - "id": "21f80085-6d2c-4e1c-af35-454382d94de1", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC with BIOVar" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8569d816-d292-4512-85b6-fcd3ea1c9ba7", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/2_residuals \\\n", - " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", - " --include_bio TRUE \\\n", - " --batch_correction FALSE \\\n", - " --min_count 5 \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC noBIOvar " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/2_residuals \\\n", - " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", - " --include_bio FALSE \\\n", - " --batch_correction FALSE \\\n", - " --min_count 5 \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "25e96ad2-1b75-43d0-978e-0757bc11f135", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Batch Correction (Optional)\n", - "\n", - "Applies to both Option A and Option B. Runs between TMM normalization and model fitting.\n", - "Use when batch effects are severe (e.g., visible batch clusters in PCA, multiple sequencing runs).\n", - "\n", - "> When batch correction is applied, `sequencingBatch` is **removed** from the model formula\n", - "> since batch variance has already been removed from the counts.\n", - "\n", - "**Method comparison:**\n", - "\n", - "| | ComBat-seq | limma `removeBatchEffect` |\n", - "|---|---|---|\n", - "| **Operates on** | Raw integer counts | log-CPM values |\n", - "| **Mean-variance modelling** | Yes | No |\n", - "| **Best for** | Large, balanced batches | Small or fragmented batches |\n", - "| **Robustness** | May fail with many small batches | More robust to unbalanced designs |\n", - "\n", - "**ComBat-seq:**\n", - "```r\n", - "adjusted_counts <- ComBat_seq(counts = dge$counts, batch = batches)\n", - "```\n", - "\n", - "**limma `removeBatchEffect`:**\n", - "```r\n", - "logCPM <- cpm(dge, log = TRUE, prior.count = 1)\n", - "adj_logCPM <- removeBatchEffect(logCPM, batch = batches, design = model.matrix(~1, data = dge$samples))\n", - "adjusted_counts <- round(pmax(2^adj_logCPM * mean(dge$samples$lib.size) / 1e6, 0))\n", - "```\n", - "\n", - "**Additional filtering applied before correction:**\n", - "- Singleton batches (only 1 sample) are removed\n", - "- Samples absent from the batch sheet are dropped\n", - "\n", - "**Additional output when batch correction is enabled:**\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_results.rds` | Includes `batch_adjusted_counts` and `batch_method` fields |\n" - ] - }, - { - "cell_type": "markdown", - "id": "4d582c85-2265-46ee-8080-0ec5d8423a1d", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC with BIOvar & with batch correction" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "d3676870-496d-4379-8d6b-acec08f1c0d7", - "metadata": { - "kernel": "SoS" - }, - "outputs": [ - { - "ename": "ERROR", - "evalue": "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\n", - "output_type": "error", - "traceback": [ - "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\nTraceback:\n" - ] - } - ], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/2_residuals \\\n", - " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", - " --include_bio TRUE \\\n", - " --batch_correction TRUE \\\n", - " --batch_method limma \\\n", - " --min_count 2\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "9bad900d-768d-45ee-815a-6847e8eba32e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC noBIOvar & with batch correction" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "799c0faa-4dd9-431d-a5cf-3e92d7256a3b", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/2_residuals \\\n", - " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", - " --include_bio FALSE \\\n", - " --batch_correction TRUE \\\n", - " --batch_method limma \\\n", - " --min_count 5\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "096f2b32-e80d-472b-9af8-5f3d4ebb9bf2", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "**Note**\n", - "For MIT data, add these parameters:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ee860bb3-d628-4255-b222-f62b3c03a91a", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "--celltype Astro Exc Inh Mic Oligo OPC \\\n", - "--suffix _50nuc \\\n", - "--input_dir output/1_files_with_sampleid_MIT" - ] - }, - { - "cell_type": "markdown", - "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed", - "metadata": {}, - "source": [ - "For additional parameters:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb", - "metadata": {}, - "outputs": [], - "source": [ - "--min_count 5\n", - "--min_total_count 15\n", - "--min_prop 0.1\n", - "--min_nuclei 20" - ] - }, - { - "cell_type": "markdown", - "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 2: Format Output\n", - "### Phenotype Reformatting\n", - "\n", - "Converts residuals into a QTL-ready BED format for genome-wide caQTL mapping.\n", - "\n", - "**Input:**\n", - "\n", - "| File | Location |\n", - "|------|----------|\n", - "| `{celltype}_residuals.txt` | `output/2_residuals/{celltype}/` |\n", - "\n", - "**Process:**\n", - "\n", - "1. Read residuals file with proper handling of peak IDs and sample columns\n", - "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", - "3. Convert to midpoint coordinates (standard for QTLtools):\n", - "```\n", - " start = floor((peak_start + peak_end) / 2)\n", - " end = start + 1\n", - "```\n", - "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample expression values\n", - "5. Sort by chromosome and position\n", - "6. Compress with `bgzip` and index with `tabix`\n", - "\n", - "**Output:** `output/3_phenotype_processing/phenotype/{celltype}_snatac_phenotype.bed.gz`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_snatac_phenotype.bed.gz` | bgzip-compressed BED with peak midpoint coordinates |\n", - "| `{celltype}_snatac_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n", - "\n", - "**Use case:** Standard caQTL mapping to identify genetic variants affecting chromatin\n", - "accessibility independent of demographic factors. Compatible with FastQTL, TensorQTL, and QTLtools.\n", - "\n", - "**Timing:** <1 min" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb phenotype_formatting \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/2_residuals \\\n", - " --output_dir output/atac_seq/3_pheno_reformat \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "7c874b17-9a77-4e7d-a0a3-3605f7005148", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Region Peak Filtering\n", - "\n", - "Filters peak counts to specific genomic regions of interest for locus-specific analysis.\n", - "\n", - "**Input:**\n", - "\n", - "| File | Location |\n", - "|------|----------|\n", - "| `{celltype}_filtered_raw_counts.txt` | `output/2_residuals/{celltype}/` |\n", - "\n", - "**Process:**\n", - "\n", - "1. Read filtered raw counts per cell type\n", - "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", - "3. Calculate per-peak metrics:\n", - " - `peakwidth`: `end - start`\n", - " - `midpoint`: `(start + end) / 2`\n", - "4. Filter peaks overlapping target regions (includes peaks that start, end, or span boundaries):\n", - "\n", - " | Region | Coordinates | Size |\n", - " |--------|-------------|------|\n", - " | Chr7 | 28,000,000 – 28,300,000 bp | 300 kb |\n", - " | Chr11 | 85,050,000 – 86,200,000 bp | 1.15 Mb |\n", - "\n", - "5. Calculate summary statistics per peak:\n", - " - `total_count`: sum of counts across all samples\n", - " - `weighted_count`: `total_count / peakwidth` (normalizes for peak size)\n", - "\n", - "**Output:** `output/3_format_output/regions/{celltype}/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_filtered_regions.txt` | Full count matrix for peaks in target regions |\n", - "| `{celltype}_filtered_regions_summary.txt` | Peak metadata with coordinates and count statistics |\n", - "\n", - "**Use case:** Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci such as\n", - "the APOE or TREM2 regions) where biological variation is preserved for downstream interpretation.\n", - "\n", - "**Timing:** <1 min" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f944afdd-fffc-4b56-863f-eee89408cfa1", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/2_residuals \\\n", - " --output_dir output/atac_seq/3_region_filter \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "10440301-99c6-4f0e-b6ce-efe5ac9281fb", - "metadata": {}, - "outputs": [], - "source": [ - "# Custom regions\n", - "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/2_residuals \\\n", - " --output_dir output/atac_seq \\\n", - " --celltype Ast Ex In Microglia Oligo OPC \\\n", - " --regions \"chr1:1000000-2000000,chr5:50000000-51000000\"" - ] - }, - { - "cell_type": "markdown", - "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Command interface" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb -h" - ] - }, - { - "cell_type": "markdown", - "id": "0e17a301-cca9-49a1-843b-4248546f1f79", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Setup and global parameters" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[global]\n", - "# Output directory\n", - "parameter: cwd = path(\"output\")\n", - "# For cluster jobs, number of commands to run per job\n", - "parameter: job_size = 1\n", - "# Wall clock time expected\n", - "parameter: walltime = \"5h\"\n", - "# Memory expected\n", - "parameter: mem = \"16G\"\n", - "# Number of threads\n", - "parameter: numThreads = 8\n", - "# Software container\n", - "parameter: container = \"\"\n", - "\n", - "import re\n", - "parameter: entrypoint = (\n", - " 'micromamba run -a \"\" -n' + ' ' +\n", - " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n", - ") if container else \"\"\n", - "\n", - "from sos.utils import expand_size\n", - "cwd = path(f'{cwd:a}')" - ] - }, - { - "cell_type": "markdown", - "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `sampleid_mapping`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[sampleid_mapping]\n", - "parameter: map_file = str\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "parameter: celltype = ['Ast', 'Ex', 'In', 'Microglia', 'Oligo', 'OPC']\n", - "parameter: suffix = '' # e.g. '' for Xiong, '_50nuc' for Kellis\n", - "\n", - "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", - "output: [f'{output_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", - "\n", - "python: expand = \"${ }\"\n", - "\n", - " import pandas as pd\n", - " import gzip\n", - " import os\n", - " import subprocess\n", - " import csv\n", - " import numpy as np\n", - "\n", - " map_df = pd.read_csv(\"${map_file}\")\n", - " id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n", - "\n", - " celltype = ${celltype}\n", - " input_dir = \"${input_dir}\"\n", - " output_dir = \"${output_dir}/1_files_with_sampleid\"\n", - " suffix = \"${suffix}\"\n", - "\n", - " os.makedirs(output_dir, exist_ok=True)\n", - "\n", - " def map_id(ind_id):\n", - " return id_map.get(ind_id, ind_id)\n", - " \n", - " def format_value(val):\n", - " \"\"\"Format numeric values: remove .0 from integers, keep decimals\"\"\"\n", - " if pd.isna(val):\n", - " return ''\n", - " if isinstance(val, (int, np.integer)):\n", - " return str(val)\n", - " if isinstance(val, (float, np.floating)):\n", - " if val == int(val): # Check if it's a whole number\n", - " return str(int(val))\n", - " else:\n", - " return str(val)\n", - " return str(val)\n", - "\n", - " # ── Process metadata CSV files ────────────────────────────────────────────\n", - " for ct in celltype:\n", - " fname = f\"metadata_{ct}{suffix}.csv\"\n", - " in_path = os.path.join(input_dir, fname)\n", - " out_path = os.path.join(output_dir, fname)\n", - "\n", - " if not os.path.exists(in_path):\n", - " print(f\"Warning: Metadata file not found: {in_path}\")\n", - " continue\n", - "\n", - " meta = pd.read_csv(in_path)\n", - "\n", - " if \"individualID\" not in meta.columns:\n", - " print(f\"Warning: individualID column not found in {fname}\")\n", - " continue\n", - "\n", - " # Create or update sampleid column\n", - " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n", - " \n", - " # Always reorder: sampleid FIRST, then individualID, then rest\n", - " cols = meta.columns.tolist()\n", - " cols.remove(\"sampleid\")\n", - " cols.remove(\"individualID\")\n", - " new_cols = [\"sampleid\", \"individualID\"] + cols\n", - " meta = meta[new_cols]\n", - "\n", - " # Write CSV with custom formatting\n", - " with open(out_path, 'w', newline='') as f:\n", - " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n", - " # Write header\n", - " writer.writerow(meta.columns)\n", - " # Write data rows with custom formatting\n", - " for _, row in meta.iterrows():\n", - " writer.writerow([format_value(val) for val in row])\n", - " \n", - " print(f\"Processed metadata: {fname}\")\n", - "\n", - " # ── Process count matrix .csv.gz files ───────────────────────────────────\n", - " for ct in celltype:\n", - " # Try both naming patterns: with and without underscore\n", - " patterns = [\n", - " f\"pseudobulk_peaks_counts_{ct}{suffix}.csv.gz\", # Xiong pattern\n", - " f\"pseudobulk_peaks_counts{ct}{suffix}.csv.gz\" # Kellis pattern\n", - " ]\n", - " \n", - " in_path = None\n", - " for pattern in patterns:\n", - " test_path = os.path.join(input_dir, pattern)\n", - " if os.path.exists(test_path):\n", - " in_path = test_path\n", - " fname = pattern\n", - " break\n", - " \n", - " if in_path is None:\n", - " print(f\"Warning: Count file not found for celltype {ct}\")\n", - " continue\n", - " \n", - " out_path = os.path.join(output_dir, fname)\n", - "\n", - " with gzip.open(in_path, \"rt\") as fh:\n", - " header_line = fh.readline().rstrip(\"\\n\")\n", - "\n", - " col_names = header_line.split(\",\")\n", - " peak_id_col = col_names[0]\n", - " sample_cols = col_names[1:]\n", - " new_sample_cols = [map_id(s) for s in sample_cols]\n", - " new_header = \",\".join([peak_id_col] + new_sample_cols)\n", - "\n", - " import tempfile\n", - " temp_header = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n", - " temp_header.write(new_header + \"\\n\")\n", - " temp_header.close()\n", - " \n", - " cmd = f\"zcat {in_path} | tail -n +2 | cat {temp_header.name} - | gzip -6 > {out_path}\"\n", - " subprocess.run(cmd, shell=True, check=True)\n", - " \n", - " os.unlink(temp_header.name)\n", - " print(f\"Processed counts: {fname}\")\n", - "\n", - " print(\"\\nSample ID mapping completed!\")" - ] - }, - { - "cell_type": "markdown", - "id": "f0884ae7-a851-425a-86dd-b606768a012e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `pseudobulk_qc`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0c46328b-c3d8-46f8-8c71-bad27820438e", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[pseudobulk_qc]\n", - "parameter: celltype = ['Ast','Ex','In','Microglia','Oligo','OPC']\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "parameter: covariates_file = str\n", - "parameter: blacklist_file = ''\n", - "parameter: include_bio = \"FALSE\" # \"TRUE\" or \"FALSE\"\n", - "parameter: batch_correction = \"FALSE\" # \"TRUE\" or \"FALSE\"\n", - "parameter: batch_method = \"limma\" # \"limma\" or \"combat\"\n", - "parameter: min_count = 5\n", - "parameter: min_total_count = 15\n", - "parameter: min_prop = 0.1\n", - "parameter: min_nuclei = 20\n", - "parameter: suffix = ''\n", - "\n", - "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype], \\\n", - " [f'{input_dir}/pseudobulk_peaks_counts_{ct}{suffix}.csv.gz' for ct in celltype]\n", - "output: [f'{output_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n", - "\n", - "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n", - "\n", - " library(edgeR)\n", - " library(limma)\n", - " library(data.table)\n", - " library(GenomicRanges)\n", - " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n", - "\n", - " # ── Helper: standardize metadata column names ─────────────────────────────\n", - " rename_if_found <- function(dt, target, candidates) {\n", - " found <- intersect(candidates, colnames(dt))[1]\n", - " if (!is.na(found) && found != target) setnames(dt, found, target)\n", - " }\n", - "\n", - " standardize_meta <- function(meta) {\n", - " rename_if_found(meta, \"n_nuclei\", c(\"n.nuclei\",\"nNuclei\",\"nuclei_count\"))\n", - " rename_if_found(meta, \"med_nucleosome_signal\", c(\"med.nucleosome_signal.ct\",\"NucleosomeRatio\",\"med_nucleosome_signal.ct\"))\n", - " rename_if_found(meta, \"med_tss_enrich\", c(\"med.tss.enrich.ct\",\"TSSEnrichment\",\"med_tss_enrich.ct\"))\n", - " rename_if_found(meta, \"med_n_tot_fragment\", c(\"med.n_tot_fragment.ct\",\"med_n_tot_fragment.ct\"))\n", - " return(meta)\n", - " }\n", - "\n", - " # ── Helper: blacklist filtering ───────────────────────────────────────────\n", - " filter_blacklist <- function(mat, bed) {\n", - " peaks <- data.table(id = rownames(mat))\n", - " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", - " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", - " bl <- fread(bed)[, 1:3]\n", - " setnames(bl, c(\"chr\",\"start\",\"end\"))\n", - " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", - " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", - " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n", - " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n", - " if (length(blacklisted) > 0) {\n", - " message(\"Blacklisted peaks removed: \", length(blacklisted))\n", - " return(mat[-blacklisted, , drop=FALSE])\n", - " }\n", - " return(mat)\n", - " }\n", - "\n", - " # ── Helper: predictOffset ─────────────────────────────────────────────────\n", - " predictOffset <- function(fit) {\n", - " D <- fit$design\n", - " Dm <- D\n", - " for (col in colnames(D)) {\n", - " if (col == \"(Intercept)\") next\n", - " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n", - " Dm[, col] <- median(D[, col], na.rm=TRUE)\n", - " else\n", - " Dm[, col] <- 0\n", - " }\n", - " B <- fit$coefficients\n", - " B[is.na(B)] <- 0\n", - " B %*% t(Dm)\n", - " }\n", - "\n", - " # ── Main loop ─────────────────────────────────────────────────────────────\n", - " cts <- c(${', '.join([f\"'{x}'\" for x in celltype])})\n", - "\n", - " for (ct in cts) {\n", - " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", - " message(\"Processing: \", ct)\n", - " message(\"Mode: \", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"))\n", - " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n", - " message(paste(rep(\"=\", 40), collapse=\"\"))\n", - "\n", - " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n", - " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n", - "\n", - " # ── 1. Load data ───────────────────────────────────────────────────\n", - " meta <- fread(sprintf(\"${input_dir}/metadata_%s${suffix}.csv\", ct))\n", - " counts_raw <- fread(sprintf(\"${input_dir}/pseudobulk_peaks_counts_%s${suffix}.csv.gz\", ct))\n", - "\n", - " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n", - " rownames(counts) <- counts_raw[[1]]\n", - " rm(counts_raw)\n", - " n_original <- nrow(counts)\n", - " message(\"Loaded: \", n_original, \" peaks x \", ncol(counts), \" samples\")\n", - "\n", - " # ── 2. Standardize metadata columns ───────────────────────────────\n", - " meta <- standardize_meta(meta)\n", - "\n", - " # ── 3. Identify sample ID column ──────────────────────────────────\n", - " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n", - " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n", - "\n", - " # ── 4. Nuclei filter ──────────────────────────────────────────────\n", - " if (\"n_nuclei\" %in% colnames(meta)) {\n", - " meta <- meta[meta$n_nuclei > ${min_nuclei}]\n", - " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n", - " }\n", - " n_after_nuclei <- nrow(meta)\n", - "\n", - " # ── 5. Align samples ───────────────────────────────────────────────\n", - " common <- intersect(meta[[idcol]], colnames(counts))\n", - " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n", - " meta <- meta[match(common, meta[[idcol]])]\n", - " counts <- counts[, common, drop=FALSE]\n", - " message(\"Samples after alignment: \", length(common))\n", - "\n", - " # ── 6. Blacklist filtering ─────────────────────────────────────────\n", - " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n", - " counts <- filter_blacklist(counts, \"${blacklist_file}\")\n", - " message(\"Peaks after blacklist filter: \", nrow(counts))\n", - " } else {\n", - " message(\"No blacklist file provided - skipping blacklist filtering.\")\n", - " }\n", - " n_after_blacklist <- nrow(counts)\n", - "\n", - " # ── 7. Load and merge covariates ───────────────────────────────────\n", - " covs <- fread(\"${covariates_file}\")\n", - " id2 <- intersect(c(\"#id\",\"id\",\"projid\",\"individualID\"), colnames(covs))[1]\n", - " bio_cols <- if (as.logical(\"${include_bio}\")) c(\"msex\",\"age_death\",\"pmi\",\"study\") else c(\"pmi\",\"study\")\n", - " keep_cols <- c(id2, intersect(bio_cols, colnames(covs)))\n", - " covs <- covs[, ..keep_cols]\n", - " meta <- merge(meta, covs, by.x=idcol, by.y=id2, all.x=TRUE)\n", - "\n", - " # ── CRITICAL: re-order meta back to common sample order ────────────\n", - " meta <- meta[match(common, meta[[idcol]])]\n", - "\n", - " # ── 8. Impute missing covariate values ─────────────────────────────\n", - " for (col in intersect(c(\"pmi\",\"age_death\"), colnames(meta))) {\n", - " if (any(is.na(meta[[col]]))) {\n", - " message(\"Imputing missing values for: \", col)\n", - " meta[[col]][is.na(meta[[col]])] <- median(meta[[col]], na.rm=TRUE)\n", - " }\n", - " }\n", - "\n", - " # ── 9. Compute technical metrics ──────────────────────────────────\n", - " meta$log_n_nuclei <- log1p(meta$n_nuclei)\n", - " meta$log_med_n_tot_fragment <- log1p(meta$med_n_tot_fragment)\n", - " meta$log_total_unique_peaks <- log1p(colSums(counts > 0))\n", - "\n", - " # ── 10. Select model variables ────────────────────────────────────\n", - " tech_vars <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", - " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\",\"pmi\",\"study\")\n", - " bio_vars <- c(\"msex\",\"age_death\")\n", - " all_vars <- if (as.logical(\"${include_bio}\")) c(tech_vars, bio_vars) else tech_vars\n", - " all_vars <- intersect(all_vars, colnames(meta))\n", - " message(\"Model terms: \", paste(all_vars, collapse=\", \"))\n", - "\n", - " # ── 11. Drop samples with NA in model variables ────────────────────\n", - " keep_rows <- complete.cases(meta[, ..all_vars])\n", - " meta <- meta[keep_rows]\n", - " counts <- counts[, meta[[idcol]], drop=FALSE]\n", - " message(\"Valid samples for modelling: \", nrow(meta))\n", - "\n", - " # ── 12. Expression filtering ───────────────────────────────────────\n", - " dge <- DGEList(counts=counts, samples=meta)\n", - " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n", - " message(\"Peaks before expression filter: \", nrow(dge))\n", - "\n", - " keep <- filterByExpr(dge, group=dge$samples$group,\n", - " min.count=${min_count},\n", - " min.total.count=${min_total_count},\n", - " min.prop=${min_prop})\n", - " dge <- dge[keep,, keep.lib.sizes=FALSE]\n", - " n_after_expr <- nrow(dge)\n", - " message(\"Peaks after expression filter: \", n_after_expr)\n", - "\n", - " # Save filtered raw counts\n", - " write.table(dge$counts,\n", - " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n", - " sep=\"\\t\", quote=FALSE, col.names=NA)\n", - "\n", - " # ── 13. TMM normalization ──────────────────────────────────────────\n", - " dge <- calcNormFactors(dge, method=\"TMM\")\n", - "\n", - " # ── 14. Optional batch correction ─────────────────────────────────\n", - " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n", - " batches <- dge$samples$sequencingBatch\n", - " batch_counts <- table(batches)\n", - " valid_batches <- names(batch_counts[batch_counts > 1])\n", - " keep_bc <- batches %in% valid_batches\n", - " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n", - " batches <- batches[keep_bc]\n", - " message(\"Samples after singleton batch removal: \", ncol(dge))\n", - "\n", - " if (\"${batch_method}\" == \"combat\") {\n", - " dge$counts <- ComBat_seq(as.matrix(dge$counts), batch=batches)\n", - " message(\"ComBat-seq batch correction applied.\")\n", - " } else {\n", - " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", - " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n", - " dge$counts <- round(pmax(2^logCPM, 0))\n", - " message(\"limma removeBatchEffect applied.\")\n", - " }\n", - " }\n", - "\n", - " # ── 15. Add sequencingBatch and Library to model if multi-level ───\n", - " # Insert after technical vars but before pmi/study to match original order\n", - " tech_only <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", - " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\")\n", - " other_vars <- setdiff(all_vars, tech_only) # pmi, study, msex, age_death\n", - "\n", - " batch_vars <- c()\n", - " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n", - " length(unique(dge$samples$sequencingBatch)) > 1) {\n", - " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", - " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n", - " }\n", - "\n", - " if (\"Library\" %in% colnames(dge$samples) &&\n", - " length(unique(dge$samples$Library)) > 1) {\n", - " dge$samples$Library_factor <- factor(dge$samples$Library)\n", - " batch_vars <- c(batch_vars, \"Library_factor\")\n", - " }\n", - "\n", - " # Final order: technical + batch + other (pmi, study, bio)\n", - " all_vars <- c(tech_only, batch_vars, other_vars)\n", - " all_vars <- intersect(all_vars, c(colnames(dge$samples), colnames(meta)))\n", - "\n", - " # ── 16. Build design matrix ────────────────────────────────────────\n", - " form <- as.formula(paste(\"~\", paste(all_vars, collapse=\" + \")))\n", - " design <- model.matrix(form, data=dge$samples)\n", - " message(\"Formula: \", deparse(form))\n", - "\n", - " if (!is.fullrank(design)) {\n", - " message(\"Design not full rank - trimming.\")\n", - " qr_d <- qr(design)\n", - " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n", - " }\n", - " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n", - "\n", - " # ── 17. Voom + lmFit + eBayes ─────────────────────────────────────\n", - " v <- voom(dge, design, plot=FALSE)\n", - " fit <- lmFit(v, design)\n", - " fit <- eBayes(fit)\n", - "\n", - " # ── 18. Offset + residuals ─────────────────────────────────────────\n", - " off <- predictOffset(fit)\n", - " res <- residuals(fit, v)\n", - " final <- off + res\n", - "\n", - " # ── 19. Save outputs ───────────────────────────────────────────────\n", - " write.table(final,\n", - " file.path(outdir, paste0(ct, \"_residuals.txt\")),\n", - " sep=\"\\t\", quote=FALSE, col.names=NA)\n", - "\n", - " saveRDS(list(\n", - " dge = dge,\n", - " offset = off,\n", - " residuals = res,\n", - " final_data = final,\n", - " valid_samples = colnames(dge),\n", - " design = design,\n", - " fit = fit,\n", - " model = form,\n", - " mode = ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"),\n", - " batch_correction = as.logical(\"${batch_correction}\"),\n", - " batch_method = ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\")\n", - " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n", - "\n", - " # ── 20. Summary report ─────────────────────────────────────────────\n", - " sink(file.path(outdir, paste0(ct, \"_summary.txt\")))\n", - " cat(\"*** Processing Summary for\", ct, \"***\\n\\n\")\n", - "\n", - " cat(\"=== Analysis Mode ===\\n\")\n", - " cat(\"Mode:\", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"), \"\\n\")\n", - " cat(\"Batch correction:\", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"), \"\\n\")\n", - " cat(\"Model formula:\", deparse(form), \"\\n\\n\")\n", - "\n", - " cat(\"=== Filtering Parameters ===\\n\")\n", - " cat(\"Nuclei cutoff: >\", ${min_nuclei}, \"\\n\")\n", - " cat(\"Blacklist filtering:\", ifelse(\"${blacklist_file}\" != \"\", \"TRUE\", \"FALSE\"), \"\\n\")\n", - " if (\"${blacklist_file}\" != \"\") cat(\"Blacklist file:\", \"${blacklist_file}\", \"\\n\")\n", - " cat(\"min_count:\", ${min_count}, \"\\n\")\n", - " cat(\"min_total_count:\", ${min_total_count}, \"\\n\")\n", - " cat(\"min_prop:\", ${min_prop}, \"\\n\\n\")\n", - "\n", - " cat(\"=== Peak Counts ===\\n\")\n", - " cat(\"Original peak count:\", n_original, \"\\n\")\n", - " cat(\"Peaks after blacklist filtering:\", n_after_blacklist, \"\\n\")\n", - " cat(\"Peaks after expression filtering:\", n_after_expr, \"\\n\\n\")\n", - "\n", - " cat(\"=== Sample Counts ===\\n\")\n", - " cat(\"Number of samples after nuclei (>\", ${min_nuclei}, \") filtering:\", n_after_nuclei, \"\\n\")\n", - " cat(\"Number of samples in final model:\", ncol(final), \"\\n\\n\")\n", - "\n", - " cat(\"=== Technical Variables Used ===\\n\")\n", - " for (v in intersect(c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", - " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\"), all_vars))\n", - " cat(\"-\", v, \"\\n\")\n", - " if (\"sequencingBatch_factor\" %in% all_vars) cat(\"- sequencingBatch: Sequencing batch ID\\n\")\n", - " if (\"Library_factor\" %in% all_vars) cat(\"- Library: Library ID\\n\")\n", - "\n", - " if (as.logical(\"${include_bio}\")) {\n", - " cat(\"\\n=== Biological Variables Used ===\\n\")\n", - " for (v in intersect(c(\"msex\",\"age_death\"), all_vars))\n", - " cat(\"-\", v, \"\\n\")\n", - " } else {\n", - " cat(\"\\n=== Biological Variables Used ===\\n\")\n", - " cat(\"None (noBIOvar mode - biological variation preserved)\\n\")\n", - " }\n", - "\n", - " cat(\"\\n=== Other Variables Used ===\\n\")\n", - " if (\"pmi\" %in% all_vars) cat(\"- pmi: Post-mortem interval\\n\")\n", - " if (\"study\" %in% all_vars) cat(\"- study: Study cohort\\n\")\n", - " sink()\n", - "\n", - " # ── 21. Variable explanation report ───────────────────────────────\n", - " sink(file.path(outdir, paste0(ct, \"_variable_explanation.txt\")))\n", - " cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n", - " cat(\"## Why Log Transformation?\\n\")\n", - " cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n", - " cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n", - " cat(\"2. To stabilize variance across the range of values\\n\")\n", - " cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n", - " cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n", - " cat(\"## Variables and Their Meanings\\n\\n\")\n", - " cat(\"### Technical Variables\\n\")\n", - " cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n", - " cat(\" * Filtered to include only samples with >\", ${min_nuclei}, \"nuclei\\n\")\n", - " cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n", - " cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n", - " cat(\" * Represents sequencing depth\\n\")\n", - " cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n", - " cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n", - " cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n", - " cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", - " cat(\" * Measures the degree of nucleosome positioning\\n\")\n", - " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n", - " cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n", - " cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n", - " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n", - " if (\"sequencingBatch_factor\" %in% all_vars)\n", - " cat(\"- sequencingBatch: Sequencing batch ID\\n * Treated as a factor to account for batch effects\\n\\n\")\n", - " if (\"Library_factor\" %in% all_vars)\n", - " cat(\"- Library: Library preparation batch ID\\n * Treated as a factor to account for library preparation effects\\n\\n\")\n", - " if (as.logical(\"${include_bio}\")) {\n", - " cat(\"### Biological Variables\\n\")\n", - " cat(\"- msex: Sex (male=1, female=0)\\n\")\n", - " cat(\"- age_death: Age at death\\n\\n\")\n", - " }\n", - " cat(\"### Other Variables\\n\")\n", - " cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n", - " cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n", - " cat(\"## Relationship to voom Transformation\\n\")\n", - " cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n", - " cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n", - " cat(\"covariates, we ensure they are on a similar scale to the transformed expression data, \")\n", - " cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n", - " sink()\n", - "\n", - " message(\"Completed: \", ct, \" -> \", outdir)\n", - " message(\" Peaks: \", nrow(final), \" | Samples: \", ncol(final))\n", - " }" - ] - }, - { - "cell_type": "markdown", - "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `phenotype_reformatting`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[phenotype_formatting]\n", - "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "\n", - "input: [f'{input_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", - "output: [f'{output_dir}/{ct}_snatac_phenotype.bed.gz' for ct in celltype]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n", - "\n", - "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", - "\n", - " import os\n", - " import subprocess\n", - " import pandas as pd\n", - "\n", - " celltypes = ${celltype}\n", - " input_dir = \"${input_dir}\"\n", - " output_dir = \"${output_dir}\"\n", - "\n", - " def read_residuals(path):\n", - " first_line = open(path).readline().rstrip(\"\\n\")\n", - " col_names = first_line.split(\"\\t\")\n", - " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n", - " if df.shape[1] > len(col_names):\n", - " peak_ids = df.iloc[:, 0].values\n", - " df = df.iloc[:, 1:]\n", - " df.columns = col_names\n", - " else:\n", - " peak_ids = df.iloc[:, 0].values\n", - " df = df.iloc[:, 1:]\n", - " df.columns = col_names[1:]\n", - " return peak_ids, df\n", - "\n", - " def to_midpoint_bed(peak_ids, residuals):\n", - " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", - " chrs = parts[0].values\n", - " starts = parts[1].astype(int).values\n", - " ends = parts[2].astype(int).values\n", - " mids = ((starts + ends) // 2).astype(int)\n", - " bed = pd.DataFrame({\n", - " \"#chr\": chrs,\n", - " \"start\": mids,\n", - " \"end\": mids + 1,\n", - " \"ID\": peak_ids\n", - " })\n", - " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n", - " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n", - "\n", - " def run_cmd(cmd, label):\n", - " r = subprocess.run(cmd, capture_output=True)\n", - " if r.returncode != 0:\n", - " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n", - " else:\n", - " print(f\"{label}: OK\")\n", - "\n", - " for ct in celltypes:\n", - " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n", - "\n", - " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n", - " os.makedirs(out_dir, exist_ok=True)\n", - "\n", - " res_path = os.path.join(input_dir, ct, f\"{ct}_residuals.txt\")\n", - " if not os.path.exists(res_path):\n", - " print(f\"WARNING: {res_path} not found, skipping.\")\n", - " continue\n", - "\n", - " peak_ids, residuals = read_residuals(res_path)\n", - " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n", - "\n", - " bed = to_midpoint_bed(peak_ids, residuals)\n", - " out_bed = os.path.join(out_dir, f\"{ct}_snatac_phenotype.bed\")\n", - " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n", - " print(f\"Written: {out_bed}\")\n", - "\n", - " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n", - " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n", - "\n", - " print(f\"Completed: {ct} -> {out_dir}\")" - ] - }, - { - "cell_type": "markdown", - "id": "038bc2ab-c412-40ef-a9b6-f5dddf5292ee", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `region_filtering`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dd13567e-2d4c-48f6-83d7-62ab252421bf", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[region_filtering]\n", - "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "parameter: regions = \"chr7:28000000-28300000,chr11:85050000-86200000\"\n", - "\n", - "input: [f'{input_dir}/{ct}/{ct}_filtered_raw_counts.txt' for ct in celltype]\n", - "output: [f'{output_dir}/{ct}_filtered_regions_of_interest.txt' for ct in celltype]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '1:00:00', mem = '16G', cores = 2\n", - "\n", - "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", - "\n", - " import os\n", - " import pandas as pd\n", - "\n", - " celltypes = ${celltype}\n", - " input_dir = \"${input_dir}\"\n", - " output_dir = \"${output_dir}\"\n", - "\n", - " def parse_regions(region_str):\n", - " result = []\n", - " for r in region_str.split(\",\"):\n", - " chrom, coords = r.strip().split(\":\")\n", - " start, end = coords.split(\"-\")\n", - " result.append({\"chr\": chrom, \"start\": int(start), \"end\": int(end)})\n", - " return result\n", - "\n", - " regions = parse_regions(\"${regions}\")\n", - "\n", - " def parse_peak_ids(peak_ids):\n", - " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", - " return pd.DataFrame({\n", - " \"chr\": parts[0].values,\n", - " \"start\": parts[1].astype(int).values,\n", - " \"end\": parts[2].astype(int).values\n", - " })\n", - "\n", - " def overlaps_region(chr_col, start_col, end_col, reg):\n", - " return (\n", - " (chr_col == reg[\"chr\"]) &\n", - " (start_col < reg[\"end\"]) &\n", - " (end_col > reg[\"start\"])\n", - " )\n", - "\n", - " for ct in celltypes:\n", - " print(f\"\\n{'='*40}\\nRegion Filtering: {ct}\\n{'='*40}\")\n", - "\n", - " reg_dir = os.path.join(output_dir, \"3_region_filter\")\n", - " os.makedirs(reg_dir, exist_ok=True)\n", - "\n", - " counts_path = os.path.join(input_dir, ct, f\"{ct}_filtered_raw_counts.txt\")\n", - " if not os.path.exists(counts_path):\n", - " print(f\"WARNING: {counts_path} not found, skipping.\")\n", - " continue\n", - "\n", - " df = pd.read_csv(counts_path, sep=\"\\t\", index_col=0)\n", - " df.index.name = \"peak_id\"\n", - " df = df.reset_index()\n", - "\n", - " coords = parse_peak_ids(df[\"peak_id\"].values)\n", - " df[\"chr\"] = coords[\"chr\"].values\n", - " df[\"start\"] = coords[\"start\"].values\n", - " df[\"end\"] = coords[\"end\"].values\n", - " df[\"peakwidth\"] = df[\"end\"] - df[\"start\"]\n", - " df[\"midpoint\"] = ((df[\"start\"] + df[\"end\"]) / 2).astype(int)\n", - "\n", - " # Filter to regions of interest\n", - " mask = pd.Series(False, index=df.index)\n", - " for reg in regions:\n", - " mask |= overlaps_region(df[\"chr\"], df[\"start\"], df[\"end\"], reg)\n", - "\n", - " region_df = df[mask].copy()\n", - " print(f\"Peaks in regions of interest: {len(region_df)}\")\n", - "\n", - " # Save full filtered data\n", - " full_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest.txt\")\n", - " region_df.to_csv(full_out, sep=\"\\t\", index=False)\n", - " print(f\"Saved: {full_out}\")\n", - "\n", - " # Save summary\n", - " meta_cols = [\"peak_id\",\"chr\",\"start\",\"end\",\"peakwidth\",\"midpoint\"]\n", - " count_cols = [c for c in region_df.columns if c not in meta_cols]\n", - " count_mat = region_df[count_cols].apply(pd.to_numeric, errors=\"coerce\")\n", - "\n", - " summary = region_df[meta_cols].copy()\n", - " summary[\"total_count\"] = count_mat.sum(axis=1).values\n", - " summary[\"weighted_count\"] = (summary[\"total_count\"] / summary[\"peakwidth\"]).values\n", - "\n", - " summary_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest_summary.txt\")\n", - " summary.to_csv(summary_out, sep=\"\\t\", index=False)\n", - " print(f\"Saved: {summary_out}\")\n", - "\n", - " print(f\"Completed: {ct} -> {reg_dir}\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "R", - "language": "R", - "name": "ir" - }, - "language_info": { - "codemirror_mode": "r", - "file_extension": ".r", - "mimetype": "text/x-r-source", - "name": "R", - "pygments_lexer": "r", - "version": "4.4.3" - }, - "sos": { - "kernels": [ - [ - "SoS", - "sos", - "sos", - "", - "" - ] - ], - "version": "" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From 7fbc1551b88693ee41e74c0eea0190b223b85b8b Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Thu, 19 Feb 2026 14:21:09 -0500 Subject: [PATCH 03/12] snATAC-seq preprocessing pipeline notebook --- .../QC/snatacseq_preprocessing.ipynb | 1453 +++++++++++++++++ 1 file changed, 1453 insertions(+) create mode 100644 code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb b/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb new file mode 100644 index 000000000..b2b5acb6a --- /dev/null +++ b/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb @@ -0,0 +1,1453 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Single-nucleus ATAC-seq Preprocessing Pipeline\n", + "\n", + "## Overview\n", + "\n", + "This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) pseudobulk peak count data\n", + "for downstream chromatin accessibility QTL (caQTL) analysis and region-specific studies.\n", + "\n", + "**Goals:**\n", + "- Transform raw pseudobulk peak counts into analysis-ready formats\n", + "- Remove technical confounders while optionally preserving biological covariates\n", + "- Generate QTL-ready phenotype files or region-specific datasets\n", + "\n", + "## Pipeline Structure\n", + "```\n", + "Step 0: Sample ID Mapping\n", + "↓\n", + "Step 1: Pseudobulk QC\n", + "├── Option A: BIOvar (regress out technical + biological covariates)\n", + "└── Option B: noBIOvar (regress out technical covariates only)\n", + "↓ (optional)\n", + "Batch Correction (ComBat-seq or limma::removeBatchEffect)\n", + "↓\n", + "Step 2: Format Output\n", + "├── Format A: Phenotype Reformatting → BED (genome-wide caQTL mapping)\n", + "└── Format B: Region Peak Filtering → TSV (locus-specific analysis)\n", + "\n", + "```\n", + "\n", + "## Input Files\n", + "\n", + "All input files required to run this pipeline can be downloaded\n", + "[here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link).\n", + "\n", + "| File | Used in |\n", + "|------|---------|\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | Step 0, Step 1 |\n", + "| `metadata_{celltype}.csv` | Step 0, Step 1 |\n", + "| `rosmap_sample_mapping_data.csv` | Step 0 |\n", + "| `rosmap_cov.txt` | Step 1 |\n", + "| `hg38-blacklist.v2.bed.gz` | Step 1 |\n", + "| `SampleSheet.csv` | Step 1 (batch correction only) |\n", + "| `sampleSheetAfterQc.csv` | Step 1 (batch correction only) |\n", + "\n", + "\n", + "## Minimal Working Example" + ] + }, + { + "cell_type": "markdown", + "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 0: Sample ID Mapping\n", + "\n", + "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n", + "across metadata and count matrix files.\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n", + "| `metadata_{celltype}.csv` × 6 | Per-cell-type sample metadata |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Per-cell-type peak count matrices |\n", + "\n", + "Cell types: `Ast`, `Ex`, `In`, `Microglia`, `Oligo`, `OPC`\n", + "\n", + "### Process\n", + "\n", + "**Part 1 — Metadata files**\n", + "\n", + "For each `metadata_{celltype}.csv`:\n", + "1. Look up each `individualID` in the mapping reference\n", + "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n", + "3. Insert `sampleid` as the first column\n", + "4. Save updated file\n", + "\n", + "**Part 2 — Count matrix files**\n", + "\n", + "For each `pseudobulk_peaks_counts_{celltype}.csv.gz`:\n", + "1. Extract the header row (column names only)\n", + "2. Keep `peak_id` (first column) unchanged\n", + "3. Map remaining column names (`individualID` → `sampleid`) where mapping exists,\n", + " otherwise keep original\n", + "4. Write new header and stream data rows unchanged\n", + "5. Recompress with gzip\n", + "\n", + "### Output\n", + "\n", + "Output directory: `output/1_files_with_sampleid/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `metadata_{celltype}.csv` × 6 | Metadata with `sampleid` column prepended |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Count matrices with mapped column headers |\n", + "\n", + "**Timing:** < 1 min\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n", + " --cwd output/atac_seq/1_files_with_sampleid \\\n", + " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", + " --input_dir data/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/1_files_with_sampleid \\\n", + " --celltype Ast Ex In Microglia Oligo OPC\n", + "\n", + "\n", + "# For MIT input data\n", + "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n", + " --cwd output/atac_seq/1_files_with_sampleid \\\n", + " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", + " --input_dir data/atac_seq/1_files_with_sampleid_MIT \\\n", + " --output_dir output/atac_seq/1_files_with_sampleid \\\n", + " --celltype Astro Exc Inh Mic Oligo OPC \\\n", + " --suffix _50nuc" + ] + }, + { + "cell_type": "markdown", + "id": "5540a4da-843a-4789-8123-47911cf519c5", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 1: Pseudobulk QC\n", + "\n", + "Two approaches are available depending on whether biological covariates should be regressed out.\n", + "Both options support an **optional batch correction** step after filtering and normalization.\n", + "\n", + "\n", + "### Option A: With Biological Covariates (BIOvar)\n", + "\n", + "Use when residuals should be adjusted for all technical **and** biological covariates (sex, age, PMI).\n", + "\n", + "**Input:**\n", + "\n", + "| File | Location |\n", + "|------|----------|\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | `1_files_with_sampleid/` |\n", + "| `metadata_{celltype}.csv` | `1_files_with_sampleid/` |\n", + "| `rosmap_cov.txt` | `data/` |\n", + "| `hg38-blacklist.v2.bed.gz` | `data/` |\n", + "| `SampleSheet.csv` *(batch correction only)* | `data/` |\n", + "| `sampleSheetAfterQc.csv` *(batch correction only)* | `data/` |\n", + "\n", + "**Process:**\n", + "\n", + "1. Load pseudobulk peak count matrix and metadata per cell type\n", + "2. Filter samples with fewer than 20 nuclei\n", + "3. Calculate technical QC metrics per sample:\n", + " - `log_n_nuclei`: log-transformed nuclei count\n", + " - `med_nucleosome_signal`: median nucleosome signal\n", + " - `med_tss_enrich`: median TSS enrichment score\n", + " - `log_med_n_tot_fragment`: log-transformed median total fragments\n", + " - `log_total_unique_peaks`: log-transformed unique peak count\n", + "4. Filter blacklisted genomic regions\n", + "5. Merge with demographic covariates (`msex`, `age_death`, `pmi`, `study`)\n", + "6. Apply expression filtering (`filterByExpr`):\n", + " - `min_count = 5`: minimum reads in at least one sample\n", + " - `min_total_count = 15`: minimum total reads across all samples\n", + " - `min_prop = 0.1`: peak expressed in ≥10% of samples\n", + "7. TMM normalization\n", + "8. *(Optional)* Batch correction — see [Batch Correction](#batch-correction-optional) below\n", + "9. Fit linear model (`voom` + `lmFit`):~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich\n", + "\n", + "log_med_n_tot_fragment + log_total_unique_peaks\n", + "sequencingBatch + msex + age_death + pmi + study\n", + "\n", + " > If batch correction was applied, `sequencingBatch` is removed from the model.\n", + "10. Compute residuals adjusted for all covariates\n", + "11. Compute final adjusted values: `offset + residuals`\n", + " - `offset`: predicted expression at median/reference covariate values\n", + " - `residuals`: unexplained variation after removing all covariate effects\n", + "\n", + "**Output:** `output/2_residuals/{celltype}/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n", + "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n", + "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", + "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n", + "\n", + "**Covariates regressed out:**\n", + "- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch\n", + "- Biological: sex (`msex`), age at death (`age_death`), post-mortem interval (`pmi`), study cohort\n", + "\n", + "**Timing:** <5 min per celltype" + ] + }, + { + "cell_type": "markdown", + "id": "21f80085-6d2c-4e1c-af35-454382d94de1", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC with BIOVar" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8569d816-d292-4512-85b6-fcd3ea1c9ba7", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/2_residuals \\\n", + " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", + " --include_bio TRUE \\\n", + " --batch_correction FALSE \\\n", + " --min_count 5 \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "d8270ee1-1f9b-439c-969b-ac20af6fadee", + "metadata": {}, + "source": [ + "### Option B: Without Biological Covariates (noBIOvar)\n", + "\n", + "Use when biological variation should be preserved (e.g., age/sex comparisons, region-specific analyses).\n", + "\n", + "**Input:** Same as Option A.\n", + "\n", + "**Process:**\n", + "\n", + "Steps 1–8 are identical to Option A. Key differences at the modelling stage:\n", + "- `msex` and `age_death` are **excluded** from the model\n", + "- `med_peakwidth` (weighted median peak width per sample) is added as a technical covariate\n", + "\n", + "**Model formula:**\n", + "```\n", + "Model: ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch + pmi + study\n", + "```\n", + "\n", + "**Output:** `output/2_residuals/{celltype}/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n", + "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n", + "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", + "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n", + "\n", + "**Variables deliberately NOT regressed out:**\n", + "- Sex (`msex`)\n", + "- Age at death (`age_death`)\n", + "\n", + "**Timing:** <5 min per celltype" + ] + }, + { + "cell_type": "markdown", + "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC noBIOvar " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/2_residuals \\\n", + " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", + " --include_bio FALSE \\\n", + " --batch_correction FALSE \\\n", + " --min_count 5 \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "25e96ad2-1b75-43d0-978e-0757bc11f135", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Batch Correction (Optional)\n", + "\n", + "Applies to both Option A and Option B. Runs between TMM normalization and model fitting.\n", + "Use when batch effects are severe (e.g., visible batch clusters in PCA, multiple sequencing runs).\n", + "\n", + "> When batch correction is applied, `sequencingBatch` is **removed** from the model formula\n", + "> since batch variance has already been removed from the counts.\n", + "\n", + "**Method comparison:**\n", + "\n", + "| | ComBat-seq | limma `removeBatchEffect` |\n", + "|---|---|---|\n", + "| **Operates on** | Raw integer counts | log-CPM values |\n", + "| **Mean-variance modelling** | Yes | No |\n", + "| **Best for** | Large, balanced batches | Small or fragmented batches |\n", + "| **Robustness** | May fail with many small batches | More robust to unbalanced designs |\n", + "\n", + "**ComBat-seq:**\n", + "```r\n", + "adjusted_counts <- ComBat_seq(counts = dge$counts, batch = batches)\n", + "```\n", + "\n", + "**limma `removeBatchEffect`:**\n", + "```r\n", + "logCPM <- cpm(dge, log = TRUE, prior.count = 1)\n", + "adj_logCPM <- removeBatchEffect(logCPM, batch = batches, design = model.matrix(~1, data = dge$samples))\n", + "adjusted_counts <- round(pmax(2^adj_logCPM * mean(dge$samples$lib.size) / 1e6, 0))\n", + "```\n", + "\n", + "**Additional filtering applied before correction:**\n", + "- Singleton batches (only 1 sample) are removed\n", + "- Samples absent from the batch sheet are dropped\n", + "\n", + "**Additional output when batch correction is enabled:**\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_results.rds` | Includes `batch_adjusted_counts` and `batch_method` fields |\n" + ] + }, + { + "cell_type": "markdown", + "id": "4d582c85-2265-46ee-8080-0ec5d8423a1d", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC with BIOvar & with batch correction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d3676870-496d-4379-8d6b-acec08f1c0d7", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/2_residuals \\\n", + " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", + " --include_bio TRUE \\\n", + " --batch_correction TRUE \\\n", + " --batch_method limma \\\n", + " --min_count 2\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "9bad900d-768d-45ee-815a-6847e8eba32e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC noBIOvar & with batch correction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "799c0faa-4dd9-431d-a5cf-3e92d7256a3b", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", + " --output_dir output/atac_seq/2_residuals \\\n", + " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", + " --include_bio FALSE \\\n", + " --batch_correction TRUE \\\n", + " --batch_method limma \\\n", + " --min_count 5\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "096f2b32-e80d-472b-9af8-5f3d4ebb9bf2", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "**Note**\n", + "For MIT data, add these parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ee860bb3-d628-4255-b222-f62b3c03a91a", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "--celltype Astro Exc Inh Mic Oligo OPC \\\n", + "--suffix _50nuc \\\n", + "--input_dir output/1_files_with_sampleid_MIT" + ] + }, + { + "cell_type": "markdown", + "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed", + "metadata": {}, + "source": [ + "For additional parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb", + "metadata": {}, + "outputs": [], + "source": [ + "--min_count 5\n", + "--min_total_count 15\n", + "--min_prop 0.1\n", + "--min_nuclei 20" + ] + }, + { + "cell_type": "markdown", + "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 2: Format Output\n", + "### Phenotype Reformatting\n", + "\n", + "Converts residuals into a QTL-ready BED format for genome-wide caQTL mapping.\n", + "\n", + "**Input:**\n", + "\n", + "| File | Location |\n", + "|------|----------|\n", + "| `{celltype}_residuals.txt` | `output/2_residuals/{celltype}/` |\n", + "\n", + "**Process:**\n", + "\n", + "1. Read residuals file with proper handling of peak IDs and sample columns\n", + "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", + "3. Convert to midpoint coordinates (standard for QTLtools):\n", + "```\n", + " start = floor((peak_start + peak_end) / 2)\n", + " end = start + 1\n", + "```\n", + "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample expression values\n", + "5. Sort by chromosome and position\n", + "6. Compress with `bgzip` and index with `tabix`\n", + "\n", + "**Output:** `output/3_phenotype_processing/phenotype/{celltype}_snatac_phenotype.bed.gz`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_snatac_phenotype.bed.gz` | bgzip-compressed BED with peak midpoint coordinates |\n", + "| `{celltype}_snatac_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n", + "\n", + "**Use case:** Standard caQTL mapping to identify genetic variants affecting chromatin\n", + "accessibility independent of demographic factors. Compatible with FastQTL, TensorQTL, and QTLtools.\n", + "\n", + "**Timing:** <1 min" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb phenotype_formatting \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/2_residuals \\\n", + " --output_dir output/atac_seq/3_pheno_reformat \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "7c874b17-9a77-4e7d-a0a3-3605f7005148", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Region Peak Filtering\n", + "\n", + "Filters peak counts to specific genomic regions of interest for locus-specific analysis.\n", + "\n", + "**Input:**\n", + "\n", + "| File | Location |\n", + "|------|----------|\n", + "| `{celltype}_filtered_raw_counts.txt` | `output/2_residuals/{celltype}/` |\n", + "\n", + "**Process:**\n", + "\n", + "1. Read filtered raw counts per cell type\n", + "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", + "3. Calculate per-peak metrics:\n", + " - `peakwidth`: `end - start`\n", + " - `midpoint`: `(start + end) / 2`\n", + "4. Filter peaks overlapping target regions (includes peaks that start, end, or span boundaries):\n", + "\n", + " | Region | Coordinates | Size |\n", + " |--------|-------------|------|\n", + " | Chr7 | 28,000,000 – 28,300,000 bp | 300 kb |\n", + " | Chr11 | 85,050,000 – 86,200,000 bp | 1.15 Mb |\n", + "\n", + "5. Calculate summary statistics per peak:\n", + " - `total_count`: sum of counts across all samples\n", + " - `weighted_count`: `total_count / peakwidth` (normalizes for peak size)\n", + "\n", + "**Output:** `output/3_format_output/regions/{celltype}/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_filtered_regions.txt` | Full count matrix for peaks in target regions |\n", + "| `{celltype}_filtered_regions_summary.txt` | Peak metadata with coordinates and count statistics |\n", + "\n", + "**Use case:** Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci such as\n", + "the APOE or TREM2 regions) where biological variation is preserved for downstream interpretation.\n", + "\n", + "**Timing:** <1 min" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f944afdd-fffc-4b56-863f-eee89408cfa1", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/2_residuals \\\n", + " --output_dir output/atac_seq/3_region_filter \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10440301-99c6-4f0e-b6ce-efe5ac9281fb", + "metadata": {}, + "outputs": [], + "source": [ + "# Custom regions\n", + "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n", + " --cwd output/atac_seq \\\n", + " --input_dir output/atac_seq/2_residuals \\\n", + " --output_dir output/atac_seq \\\n", + " --celltype Ast Ex In Microglia Oligo OPC \\\n", + " --regions \"chr1:1000000-2000000,chr5:50000000-51000000\"" + ] + }, + { + "cell_type": "markdown", + "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Command interface" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/snatacseq_preprocessing.ipynb -h" + ] + }, + { + "cell_type": "markdown", + "id": "0e17a301-cca9-49a1-843b-4248546f1f79", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Setup and global parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "# Output directory\n", + "parameter: cwd = path(\"output\")\n", + "# For cluster jobs, number of commands to run per job\n", + "parameter: job_size = 1\n", + "# Wall clock time expected\n", + "parameter: walltime = \"5h\"\n", + "# Memory expected\n", + "parameter: mem = \"16G\"\n", + "# Number of threads\n", + "parameter: numThreads = 8\n", + "# Software container\n", + "parameter: container = \"\"\n", + "\n", + "import re\n", + "parameter: entrypoint = (\n", + " 'micromamba run -a \"\" -n' + ' ' +\n", + " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n", + ") if container else \"\"\n", + "\n", + "from sos.utils import expand_size\n", + "cwd = path(f'{cwd:a}')" + ] + }, + { + "cell_type": "markdown", + "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `sampleid_mapping`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[sampleid_mapping]\n", + "parameter: map_file = str\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "parameter: celltype = ['Ast', 'Ex', 'In', 'Microglia', 'Oligo', 'OPC']\n", + "parameter: suffix = '' # e.g. '' for Xiong, '_50nuc' for Kellis\n", + "\n", + "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", + "output: [f'{output_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", + "\n", + "python: expand = \"${ }\"\n", + "\n", + " import pandas as pd\n", + " import gzip\n", + " import os\n", + " import subprocess\n", + " import csv\n", + " import numpy as np\n", + "\n", + " map_df = pd.read_csv(\"${map_file}\")\n", + " id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n", + "\n", + " celltype = ${celltype}\n", + " input_dir = \"${input_dir}\"\n", + " output_dir = \"${output_dir}/1_files_with_sampleid\"\n", + " suffix = \"${suffix}\"\n", + "\n", + " os.makedirs(output_dir, exist_ok=True)\n", + "\n", + " def map_id(ind_id):\n", + " return id_map.get(ind_id, ind_id)\n", + " \n", + " def format_value(val):\n", + " \"\"\"Format numeric values: remove .0 from integers, keep decimals\"\"\"\n", + " if pd.isna(val):\n", + " return ''\n", + " if isinstance(val, (int, np.integer)):\n", + " return str(val)\n", + " if isinstance(val, (float, np.floating)):\n", + " if val == int(val): # Check if it's a whole number\n", + " return str(int(val))\n", + " else:\n", + " return str(val)\n", + " return str(val)\n", + "\n", + " # ── Process metadata CSV files ────────────────────────────────────────────\n", + " for ct in celltype:\n", + " fname = f\"metadata_{ct}{suffix}.csv\"\n", + " in_path = os.path.join(input_dir, fname)\n", + " out_path = os.path.join(output_dir, fname)\n", + "\n", + " if not os.path.exists(in_path):\n", + " print(f\"Warning: Metadata file not found: {in_path}\")\n", + " continue\n", + "\n", + " meta = pd.read_csv(in_path)\n", + "\n", + " if \"individualID\" not in meta.columns:\n", + " print(f\"Warning: individualID column not found in {fname}\")\n", + " continue\n", + "\n", + " # Create or update sampleid column\n", + " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n", + " \n", + " # Always reorder: sampleid FIRST, then individualID, then rest\n", + " cols = meta.columns.tolist()\n", + " cols.remove(\"sampleid\")\n", + " cols.remove(\"individualID\")\n", + " new_cols = [\"sampleid\", \"individualID\"] + cols\n", + " meta = meta[new_cols]\n", + "\n", + " # Write CSV with custom formatting\n", + " with open(out_path, 'w', newline='') as f:\n", + " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n", + " # Write header\n", + " writer.writerow(meta.columns)\n", + " # Write data rows with custom formatting\n", + " for _, row in meta.iterrows():\n", + " writer.writerow([format_value(val) for val in row])\n", + " \n", + " print(f\"Processed metadata: {fname}\")\n", + "\n", + " # ── Process count matrix .csv.gz files ───────────────────────────────────\n", + " for ct in celltype:\n", + " # Try both naming patterns: with and without underscore\n", + " patterns = [\n", + " f\"pseudobulk_peaks_counts_{ct}{suffix}.csv.gz\", # Xiong pattern\n", + " f\"pseudobulk_peaks_counts{ct}{suffix}.csv.gz\" # Kellis pattern\n", + " ]\n", + " \n", + " in_path = None\n", + " for pattern in patterns:\n", + " test_path = os.path.join(input_dir, pattern)\n", + " if os.path.exists(test_path):\n", + " in_path = test_path\n", + " fname = pattern\n", + " break\n", + " \n", + " if in_path is None:\n", + " print(f\"Warning: Count file not found for celltype {ct}\")\n", + " continue\n", + " \n", + " out_path = os.path.join(output_dir, fname)\n", + "\n", + " with gzip.open(in_path, \"rt\") as fh:\n", + " header_line = fh.readline().rstrip(\"\\n\")\n", + "\n", + " col_names = header_line.split(\",\")\n", + " peak_id_col = col_names[0]\n", + " sample_cols = col_names[1:]\n", + " new_sample_cols = [map_id(s) for s in sample_cols]\n", + " new_header = \",\".join([peak_id_col] + new_sample_cols)\n", + "\n", + " import tempfile\n", + " temp_header = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n", + " temp_header.write(new_header + \"\\n\")\n", + " temp_header.close()\n", + " \n", + " cmd = f\"zcat {in_path} | tail -n +2 | cat {temp_header.name} - | gzip -6 > {out_path}\"\n", + " subprocess.run(cmd, shell=True, check=True)\n", + " \n", + " os.unlink(temp_header.name)\n", + " print(f\"Processed counts: {fname}\")\n", + "\n", + " print(\"\\nSample ID mapping completed!\")" + ] + }, + { + "cell_type": "markdown", + "id": "f0884ae7-a851-425a-86dd-b606768a012e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `pseudobulk_qc`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c46328b-c3d8-46f8-8c71-bad27820438e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[pseudobulk_qc]\n", + "parameter: celltype = ['Ast','Ex','In','Microglia','Oligo','OPC']\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "parameter: covariates_file = str\n", + "parameter: blacklist_file = ''\n", + "parameter: include_bio = \"FALSE\" # \"TRUE\" or \"FALSE\"\n", + "parameter: batch_correction = \"FALSE\" # \"TRUE\" or \"FALSE\"\n", + "parameter: batch_method = \"limma\" # \"limma\" or \"combat\"\n", + "parameter: min_count = 5\n", + "parameter: min_total_count = 15\n", + "parameter: min_prop = 0.1\n", + "parameter: min_nuclei = 20\n", + "parameter: suffix = ''\n", + "\n", + "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype], \\\n", + " [f'{input_dir}/pseudobulk_peaks_counts_{ct}{suffix}.csv.gz' for ct in celltype]\n", + "output: [f'{output_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n", + "\n", + "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n", + "\n", + " library(edgeR)\n", + " library(limma)\n", + " library(data.table)\n", + " library(GenomicRanges)\n", + " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n", + "\n", + " # ── Helper: standardize metadata column names ─────────────────────────────\n", + " rename_if_found <- function(dt, target, candidates) {\n", + " found <- intersect(candidates, colnames(dt))[1]\n", + " if (!is.na(found) && found != target) setnames(dt, found, target)\n", + " }\n", + "\n", + " standardize_meta <- function(meta) {\n", + " rename_if_found(meta, \"n_nuclei\", c(\"n.nuclei\",\"nNuclei\",\"nuclei_count\"))\n", + " rename_if_found(meta, \"med_nucleosome_signal\", c(\"med.nucleosome_signal.ct\",\"NucleosomeRatio\",\"med_nucleosome_signal.ct\"))\n", + " rename_if_found(meta, \"med_tss_enrich\", c(\"med.tss.enrich.ct\",\"TSSEnrichment\",\"med_tss_enrich.ct\"))\n", + " rename_if_found(meta, \"med_n_tot_fragment\", c(\"med.n_tot_fragment.ct\",\"med_n_tot_fragment.ct\"))\n", + " return(meta)\n", + " }\n", + "\n", + " # ── Helper: blacklist filtering ───────────────────────────────────────────\n", + " filter_blacklist <- function(mat, bed) {\n", + " peaks <- data.table(id = rownames(mat))\n", + " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", + " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " bl <- fread(bed)[, 1:3]\n", + " setnames(bl, c(\"chr\",\"start\",\"end\"))\n", + " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", + " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n", + " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n", + " if (length(blacklisted) > 0) {\n", + " message(\"Blacklisted peaks removed: \", length(blacklisted))\n", + " return(mat[-blacklisted, , drop=FALSE])\n", + " }\n", + " return(mat)\n", + " }\n", + "\n", + " # ── Helper: predictOffset ─────────────────────────────────────────────────\n", + " predictOffset <- function(fit) {\n", + " D <- fit$design\n", + " Dm <- D\n", + " for (col in colnames(D)) {\n", + " if (col == \"(Intercept)\") next\n", + " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n", + " Dm[, col] <- median(D[, col], na.rm=TRUE)\n", + " else\n", + " Dm[, col] <- 0\n", + " }\n", + " B <- fit$coefficients\n", + " B[is.na(B)] <- 0\n", + " B %*% t(Dm)\n", + " }\n", + "\n", + " # ── Main loop ─────────────────────────────────────────────────────────────\n", + " cts <- c(${', '.join([f\"'{x}'\" for x in celltype])})\n", + "\n", + " for (ct in cts) {\n", + " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", + " message(\"Processing: \", ct)\n", + " message(\"Mode: \", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"))\n", + " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n", + " message(paste(rep(\"=\", 40), collapse=\"\"))\n", + "\n", + " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n", + " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n", + "\n", + " # ── 1. Load data ───────────────────────────────────────────────────\n", + " meta <- fread(sprintf(\"${input_dir}/metadata_%s${suffix}.csv\", ct))\n", + " counts_raw <- fread(sprintf(\"${input_dir}/pseudobulk_peaks_counts_%s${suffix}.csv.gz\", ct))\n", + "\n", + " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n", + " rownames(counts) <- counts_raw[[1]]\n", + " rm(counts_raw)\n", + " n_original <- nrow(counts)\n", + " message(\"Loaded: \", n_original, \" peaks x \", ncol(counts), \" samples\")\n", + "\n", + " # ── 2. Standardize metadata columns ───────────────────────────────\n", + " meta <- standardize_meta(meta)\n", + "\n", + " # ── 3. Identify sample ID column ──────────────────────────────────\n", + " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n", + " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n", + "\n", + " # ── 4. Nuclei filter ──────────────────────────────────────────────\n", + " if (\"n_nuclei\" %in% colnames(meta)) {\n", + " meta <- meta[meta$n_nuclei > ${min_nuclei}]\n", + " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n", + " }\n", + " n_after_nuclei <- nrow(meta)\n", + "\n", + " # ── 5. Align samples ───────────────────────────────────────────────\n", + " common <- intersect(meta[[idcol]], colnames(counts))\n", + " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n", + " meta <- meta[match(common, meta[[idcol]])]\n", + " counts <- counts[, common, drop=FALSE]\n", + " message(\"Samples after alignment: \", length(common))\n", + "\n", + " # ── 6. Blacklist filtering ─────────────────────────────────────────\n", + " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n", + " counts <- filter_blacklist(counts, \"${blacklist_file}\")\n", + " message(\"Peaks after blacklist filter: \", nrow(counts))\n", + " } else {\n", + " message(\"No blacklist file provided - skipping blacklist filtering.\")\n", + " }\n", + " n_after_blacklist <- nrow(counts)\n", + "\n", + " # ── 7. Load and merge covariates ───────────────────────────────────\n", + " covs <- fread(\"${covariates_file}\")\n", + " id2 <- intersect(c(\"#id\",\"id\",\"projid\",\"individualID\"), colnames(covs))[1]\n", + " bio_cols <- if (as.logical(\"${include_bio}\")) c(\"msex\",\"age_death\",\"pmi\",\"study\") else c(\"pmi\",\"study\")\n", + " keep_cols <- c(id2, intersect(bio_cols, colnames(covs)))\n", + " covs <- covs[, ..keep_cols]\n", + " meta <- merge(meta, covs, by.x=idcol, by.y=id2, all.x=TRUE)\n", + "\n", + " # ── CRITICAL: re-order meta back to common sample order ────────────\n", + " meta <- meta[match(common, meta[[idcol]])]\n", + "\n", + " # ── 8. Impute missing covariate values ─────────────────────────────\n", + " for (col in intersect(c(\"pmi\",\"age_death\"), colnames(meta))) {\n", + " if (any(is.na(meta[[col]]))) {\n", + " message(\"Imputing missing values for: \", col)\n", + " meta[[col]][is.na(meta[[col]])] <- median(meta[[col]], na.rm=TRUE)\n", + " }\n", + " }\n", + "\n", + " # ── 9. Compute technical metrics ──────────────────────────────────\n", + " meta$log_n_nuclei <- log1p(meta$n_nuclei)\n", + " meta$log_med_n_tot_fragment <- log1p(meta$med_n_tot_fragment)\n", + " meta$log_total_unique_peaks <- log1p(colSums(counts > 0))\n", + "\n", + " # ── 10. Select model variables ────────────────────────────────────\n", + " tech_vars <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", + " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\",\"pmi\",\"study\")\n", + " bio_vars <- c(\"msex\",\"age_death\")\n", + " all_vars <- if (as.logical(\"${include_bio}\")) c(tech_vars, bio_vars) else tech_vars\n", + " all_vars <- intersect(all_vars, colnames(meta))\n", + " message(\"Model terms: \", paste(all_vars, collapse=\", \"))\n", + "\n", + " # ── 11. Drop samples with NA in model variables ────────────────────\n", + " keep_rows <- complete.cases(meta[, ..all_vars])\n", + " meta <- meta[keep_rows]\n", + " counts <- counts[, meta[[idcol]], drop=FALSE]\n", + " message(\"Valid samples for modelling: \", nrow(meta))\n", + "\n", + " # ── 12. Expression filtering ───────────────────────────────────────\n", + " dge <- DGEList(counts=counts, samples=meta)\n", + " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n", + " message(\"Peaks before expression filter: \", nrow(dge))\n", + "\n", + " keep <- filterByExpr(dge, group=dge$samples$group,\n", + " min.count=${min_count},\n", + " min.total.count=${min_total_count},\n", + " min.prop=${min_prop})\n", + " dge <- dge[keep,, keep.lib.sizes=FALSE]\n", + " n_after_expr <- nrow(dge)\n", + " message(\"Peaks after expression filter: \", n_after_expr)\n", + "\n", + " # Save filtered raw counts\n", + " write.table(dge$counts,\n", + " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n", + " sep=\"\\t\", quote=FALSE, col.names=NA)\n", + "\n", + " # ── 13. TMM normalization ──────────────────────────────────────────\n", + " dge <- calcNormFactors(dge, method=\"TMM\")\n", + "\n", + " # ── 14. Optional batch correction ─────────────────────────────────\n", + " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n", + " batches <- dge$samples$sequencingBatch\n", + " batch_counts <- table(batches)\n", + " valid_batches <- names(batch_counts[batch_counts > 1])\n", + " keep_bc <- batches %in% valid_batches\n", + " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n", + " batches <- batches[keep_bc]\n", + " message(\"Samples after singleton batch removal: \", ncol(dge))\n", + "\n", + " if (\"${batch_method}\" == \"combat\") {\n", + " dge$counts <- ComBat_seq(as.matrix(dge$counts), batch=batches)\n", + " message(\"ComBat-seq batch correction applied.\")\n", + " } else {\n", + " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", + " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n", + " dge$counts <- round(pmax(2^logCPM, 0))\n", + " message(\"limma removeBatchEffect applied.\")\n", + " }\n", + " }\n", + "\n", + " # ── 15. Add sequencingBatch and Library to model if multi-level ───\n", + " # Insert after technical vars but before pmi/study to match original order\n", + " tech_only <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", + " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\")\n", + " other_vars <- setdiff(all_vars, tech_only) # pmi, study, msex, age_death\n", + "\n", + " batch_vars <- c()\n", + " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$sequencingBatch)) > 1) {\n", + " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", + " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n", + " }\n", + "\n", + " if (\"Library\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$Library)) > 1) {\n", + " dge$samples$Library_factor <- factor(dge$samples$Library)\n", + " batch_vars <- c(batch_vars, \"Library_factor\")\n", + " }\n", + "\n", + " # Final order: technical + batch + other (pmi, study, bio)\n", + " all_vars <- c(tech_only, batch_vars, other_vars)\n", + " all_vars <- intersect(all_vars, c(colnames(dge$samples), colnames(meta)))\n", + "\n", + " # ── 16. Build design matrix ────────────────────────────────────────\n", + " form <- as.formula(paste(\"~\", paste(all_vars, collapse=\" + \")))\n", + " design <- model.matrix(form, data=dge$samples)\n", + " message(\"Formula: \", deparse(form))\n", + "\n", + " if (!is.fullrank(design)) {\n", + " message(\"Design not full rank - trimming.\")\n", + " qr_d <- qr(design)\n", + " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n", + " }\n", + " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n", + "\n", + " # ── 17. Voom + lmFit + eBayes ─────────────────────────────────────\n", + " v <- voom(dge, design, plot=FALSE)\n", + " fit <- lmFit(v, design)\n", + " fit <- eBayes(fit)\n", + "\n", + " # ── 18. Offset + residuals ─────────────────────────────────────────\n", + " off <- predictOffset(fit)\n", + " res <- residuals(fit, v)\n", + " final <- off + res\n", + "\n", + " # ── 19. Save outputs ───────────────────────────────────────────────\n", + " write.table(final,\n", + " file.path(outdir, paste0(ct, \"_residuals.txt\")),\n", + " sep=\"\\t\", quote=FALSE, col.names=NA)\n", + "\n", + " saveRDS(list(\n", + " dge = dge,\n", + " offset = off,\n", + " residuals = res,\n", + " final_data = final,\n", + " valid_samples = colnames(dge),\n", + " design = design,\n", + " fit = fit,\n", + " model = form,\n", + " mode = ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"),\n", + " batch_correction = as.logical(\"${batch_correction}\"),\n", + " batch_method = ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\")\n", + " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n", + "\n", + " # ── 20. Summary report ─────────────────────────────────────────────\n", + " sink(file.path(outdir, paste0(ct, \"_summary.txt\")))\n", + " cat(\"*** Processing Summary for\", ct, \"***\\n\\n\")\n", + "\n", + " cat(\"=== Analysis Mode ===\\n\")\n", + " cat(\"Mode:\", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"), \"\\n\")\n", + " cat(\"Batch correction:\", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"), \"\\n\")\n", + " cat(\"Model formula:\", deparse(form), \"\\n\\n\")\n", + "\n", + " cat(\"=== Filtering Parameters ===\\n\")\n", + " cat(\"Nuclei cutoff: >\", ${min_nuclei}, \"\\n\")\n", + " cat(\"Blacklist filtering:\", ifelse(\"${blacklist_file}\" != \"\", \"TRUE\", \"FALSE\"), \"\\n\")\n", + " if (\"${blacklist_file}\" != \"\") cat(\"Blacklist file:\", \"${blacklist_file}\", \"\\n\")\n", + " cat(\"min_count:\", ${min_count}, \"\\n\")\n", + " cat(\"min_total_count:\", ${min_total_count}, \"\\n\")\n", + " cat(\"min_prop:\", ${min_prop}, \"\\n\\n\")\n", + "\n", + " cat(\"=== Peak Counts ===\\n\")\n", + " cat(\"Original peak count:\", n_original, \"\\n\")\n", + " cat(\"Peaks after blacklist filtering:\", n_after_blacklist, \"\\n\")\n", + " cat(\"Peaks after expression filtering:\", n_after_expr, \"\\n\\n\")\n", + "\n", + " cat(\"=== Sample Counts ===\\n\")\n", + " cat(\"Number of samples after nuclei (>\", ${min_nuclei}, \") filtering:\", n_after_nuclei, \"\\n\")\n", + " cat(\"Number of samples in final model:\", ncol(final), \"\\n\\n\")\n", + "\n", + " cat(\"=== Technical Variables Used ===\\n\")\n", + " for (v in intersect(c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", + " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\"), all_vars))\n", + " cat(\"-\", v, \"\\n\")\n", + " if (\"sequencingBatch_factor\" %in% all_vars) cat(\"- sequencingBatch: Sequencing batch ID\\n\")\n", + " if (\"Library_factor\" %in% all_vars) cat(\"- Library: Library ID\\n\")\n", + "\n", + " if (as.logical(\"${include_bio}\")) {\n", + " cat(\"\\n=== Biological Variables Used ===\\n\")\n", + " for (v in intersect(c(\"msex\",\"age_death\"), all_vars))\n", + " cat(\"-\", v, \"\\n\")\n", + " } else {\n", + " cat(\"\\n=== Biological Variables Used ===\\n\")\n", + " cat(\"None (noBIOvar mode - biological variation preserved)\\n\")\n", + " }\n", + "\n", + " cat(\"\\n=== Other Variables Used ===\\n\")\n", + " if (\"pmi\" %in% all_vars) cat(\"- pmi: Post-mortem interval\\n\")\n", + " if (\"study\" %in% all_vars) cat(\"- study: Study cohort\\n\")\n", + " sink()\n", + "\n", + " # ── 21. Variable explanation report ───────────────────────────────\n", + " sink(file.path(outdir, paste0(ct, \"_variable_explanation.txt\")))\n", + " cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n", + " cat(\"## Why Log Transformation?\\n\")\n", + " cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n", + " cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n", + " cat(\"2. To stabilize variance across the range of values\\n\")\n", + " cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n", + " cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n", + " cat(\"## Variables and Their Meanings\\n\\n\")\n", + " cat(\"### Technical Variables\\n\")\n", + " cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n", + " cat(\" * Filtered to include only samples with >\", ${min_nuclei}, \"nuclei\\n\")\n", + " cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n", + " cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n", + " cat(\" * Represents sequencing depth\\n\")\n", + " cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n", + " cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n", + " cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n", + " cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", + " cat(\" * Measures the degree of nucleosome positioning\\n\")\n", + " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n", + " cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n", + " cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n", + " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n", + " if (\"sequencingBatch_factor\" %in% all_vars)\n", + " cat(\"- sequencingBatch: Sequencing batch ID\\n * Treated as a factor to account for batch effects\\n\\n\")\n", + " if (\"Library_factor\" %in% all_vars)\n", + " cat(\"- Library: Library preparation batch ID\\n * Treated as a factor to account for library preparation effects\\n\\n\")\n", + " if (as.logical(\"${include_bio}\")) {\n", + " cat(\"### Biological Variables\\n\")\n", + " cat(\"- msex: Sex (male=1, female=0)\\n\")\n", + " cat(\"- age_death: Age at death\\n\\n\")\n", + " }\n", + " cat(\"### Other Variables\\n\")\n", + " cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n", + " cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n", + " cat(\"## Relationship to voom Transformation\\n\")\n", + " cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n", + " cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n", + " cat(\"covariates, we ensure they are on a similar scale to the transformed expression data, \")\n", + " cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n", + " sink()\n", + "\n", + " message(\"Completed: \", ct, \" -> \", outdir)\n", + " message(\" Peaks: \", nrow(final), \" | Samples: \", ncol(final))\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `phenotype_reformatting`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[phenotype_formatting]\n", + "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "\n", + "input: [f'{input_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", + "output: [f'{output_dir}/{ct}_snatac_phenotype.bed.gz' for ct in celltype]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n", + "\n", + "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", + "\n", + " import os\n", + " import subprocess\n", + " import pandas as pd\n", + "\n", + " celltypes = ${celltype}\n", + " input_dir = \"${input_dir}\"\n", + " output_dir = \"${output_dir}\"\n", + "\n", + " def read_residuals(path):\n", + " first_line = open(path).readline().rstrip(\"\\n\")\n", + " col_names = first_line.split(\"\\t\")\n", + " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n", + " if df.shape[1] > len(col_names):\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names\n", + " else:\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names[1:]\n", + " return peak_ids, df\n", + "\n", + " def to_midpoint_bed(peak_ids, residuals):\n", + " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", + " chrs = parts[0].values\n", + " starts = parts[1].astype(int).values\n", + " ends = parts[2].astype(int).values\n", + " mids = ((starts + ends) // 2).astype(int)\n", + " bed = pd.DataFrame({\n", + " \"#chr\": chrs,\n", + " \"start\": mids,\n", + " \"end\": mids + 1,\n", + " \"ID\": peak_ids\n", + " })\n", + " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n", + " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n", + "\n", + " def run_cmd(cmd, label):\n", + " r = subprocess.run(cmd, capture_output=True)\n", + " if r.returncode != 0:\n", + " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n", + " else:\n", + " print(f\"{label}: OK\")\n", + "\n", + " for ct in celltypes:\n", + " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n", + "\n", + " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n", + " os.makedirs(out_dir, exist_ok=True)\n", + "\n", + " res_path = os.path.join(input_dir, ct, f\"{ct}_residuals.txt\")\n", + " if not os.path.exists(res_path):\n", + " print(f\"WARNING: {res_path} not found, skipping.\")\n", + " continue\n", + "\n", + " peak_ids, residuals = read_residuals(res_path)\n", + " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n", + "\n", + " bed = to_midpoint_bed(peak_ids, residuals)\n", + " out_bed = os.path.join(out_dir, f\"{ct}_snatac_phenotype.bed\")\n", + " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n", + " print(f\"Written: {out_bed}\")\n", + "\n", + " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n", + " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n", + "\n", + " print(f\"Completed: {ct} -> {out_dir}\")" + ] + }, + { + "cell_type": "markdown", + "id": "038bc2ab-c412-40ef-a9b6-f5dddf5292ee", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `region_filtering`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd13567e-2d4c-48f6-83d7-62ab252421bf", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[region_filtering]\n", + "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "parameter: regions = \"chr7:28000000-28300000,chr11:85050000-86200000\"\n", + "\n", + "input: [f'{input_dir}/{ct}/{ct}_filtered_raw_counts.txt' for ct in celltype]\n", + "output: [f'{output_dir}/{ct}_filtered_regions_of_interest.txt' for ct in celltype]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '1:00:00', mem = '16G', cores = 2\n", + "\n", + "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", + "\n", + " import os\n", + " import pandas as pd\n", + "\n", + " celltypes = ${celltype}\n", + " input_dir = \"${input_dir}\"\n", + " output_dir = \"${output_dir}\"\n", + "\n", + " def parse_regions(region_str):\n", + " result = []\n", + " for r in region_str.split(\",\"):\n", + " chrom, coords = r.strip().split(\":\")\n", + " start, end = coords.split(\"-\")\n", + " result.append({\"chr\": chrom, \"start\": int(start), \"end\": int(end)})\n", + " return result\n", + "\n", + " regions = parse_regions(\"${regions}\")\n", + "\n", + " def parse_peak_ids(peak_ids):\n", + " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", + " return pd.DataFrame({\n", + " \"chr\": parts[0].values,\n", + " \"start\": parts[1].astype(int).values,\n", + " \"end\": parts[2].astype(int).values\n", + " })\n", + "\n", + " def overlaps_region(chr_col, start_col, end_col, reg):\n", + " return (\n", + " (chr_col == reg[\"chr\"]) &\n", + " (start_col < reg[\"end\"]) &\n", + " (end_col > reg[\"start\"])\n", + " )\n", + "\n", + " for ct in celltypes:\n", + " print(f\"\\n{'='*40}\\nRegion Filtering: {ct}\\n{'='*40}\")\n", + "\n", + " reg_dir = os.path.join(output_dir, \"3_region_filter\")\n", + " os.makedirs(reg_dir, exist_ok=True)\n", + "\n", + " counts_path = os.path.join(input_dir, ct, f\"{ct}_filtered_raw_counts.txt\")\n", + " if not os.path.exists(counts_path):\n", + " print(f\"WARNING: {counts_path} not found, skipping.\")\n", + " continue\n", + "\n", + " df = pd.read_csv(counts_path, sep=\"\\t\", index_col=0)\n", + " df.index.name = \"peak_id\"\n", + " df = df.reset_index()\n", + "\n", + " coords = parse_peak_ids(df[\"peak_id\"].values)\n", + " df[\"chr\"] = coords[\"chr\"].values\n", + " df[\"start\"] = coords[\"start\"].values\n", + " df[\"end\"] = coords[\"end\"].values\n", + " df[\"peakwidth\"] = df[\"end\"] - df[\"start\"]\n", + " df[\"midpoint\"] = ((df[\"start\"] + df[\"end\"]) / 2).astype(int)\n", + "\n", + " # Filter to regions of interest\n", + " mask = pd.Series(False, index=df.index)\n", + " for reg in regions:\n", + " mask |= overlaps_region(df[\"chr\"], df[\"start\"], df[\"end\"], reg)\n", + "\n", + " region_df = df[mask].copy()\n", + " print(f\"Peaks in regions of interest: {len(region_df)}\")\n", + "\n", + " # Save full filtered data\n", + " full_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest.txt\")\n", + " region_df.to_csv(full_out, sep=\"\\t\", index=False)\n", + " print(f\"Saved: {full_out}\")\n", + "\n", + " # Save summary\n", + " meta_cols = [\"peak_id\",\"chr\",\"start\",\"end\",\"peakwidth\",\"midpoint\"]\n", + " count_cols = [c for c in region_df.columns if c not in meta_cols]\n", + " count_mat = region_df[count_cols].apply(pd.to_numeric, errors=\"coerce\")\n", + "\n", + " summary = region_df[meta_cols].copy()\n", + " summary[\"total_count\"] = count_mat.sum(axis=1).values\n", + " summary[\"weighted_count\"] = (summary[\"total_count\"] / summary[\"peakwidth\"]).values\n", + "\n", + " summary_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest_summary.txt\")\n", + " summary.to_csv(summary_out, sep=\"\\t\", index=False)\n", + " print(f\"Saved: {summary_out}\")\n", + "\n", + " print(f\"Completed: {ct} -> {reg_dir}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "4.4.3" + }, + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "sos", + "", + "" + ] + ], + "version": "" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 77156c6ddc7bcea163f71489f0d4ffbf82d7d666 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Thu, 19 Feb 2026 14:21:48 -0500 Subject: [PATCH 04/12] Delete code/molecular_phenotypes/QC/xiong_atacseq_preprocessing.ipynb Replace to sos --- .../QC/xiong_atacseq_preprocessing.ipynb | 1828 ----------------- 1 file changed, 1828 deletions(-) delete mode 100644 code/molecular_phenotypes/QC/xiong_atacseq_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/xiong_atacseq_preprocessing.ipynb b/code/molecular_phenotypes/QC/xiong_atacseq_preprocessing.ipynb deleted file mode 100644 index 78e93a169..000000000 --- a/code/molecular_phenotypes/QC/xiong_atacseq_preprocessing.ipynb +++ /dev/null @@ -1,1828 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "55783e3b-582f-4eb9-8d1a-f3647fef7c73", - "metadata": {}, - "source": [ - "# Xiong Lab Single-nuclei ATAC-seq Preprocessing Pipeline\n", - "---\n", - "## Overview\n", - "\n", - "This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) data from the Kellis lab (Xiong et al.) for downstream chromatin accessibility QTL (caQTL) analysis. It processes pseudobulk peak count data across six major brain cell types.\n", - "\n", - "**Pipeline Purpose:**\n", - "- Transform raw pseudobulk peak counts into analysis-ready formats\n", - "- Remove technical confounders while preserving biological variation\n", - "- Generate QTL-ready phenotype files for genome-wide caQTL mapping\n", - "\n", - "**Supported Cell Types:**\n", - "- **Mic** - Microglia\n", - "- **Astro** - Astrocytes\n", - "- **Oligo** - Oligodendrocytes\n", - "- **Ex** - Excitatory neurons\n", - "- **In** - Inhibitory neurons\n", - "- **OPC** - Oligodendrocyte precursor cells\n", - "\n", - "---\n", - "\n", - "## Workflow Structure\n", - "\n", - "This pipeline consists of **three sequential steps**:\n", - "\n", - "#### Step 0: Sample ID Mapping\n", - "\n", - "**Input:**\n", - "- Sample mapping file: `rosmap_sample_mapping_data.csv`\n", - "- Original metadata files: `metadata_{celltype}.csv`\n", - "- Original count files: `pseudobulk_peaks_counts_{celltype}.csv.gz`\n", - "\n", - "**Process:**\n", - "1. Loads sample ID mapping between individualID and sampleid\n", - "2. Processes metadata files:\n", - " - Adds `sampleid` column after `individualID`\n", - " - Maps individualID to sampleid where mapping exists\n", - " - Keeps original individualID for unmapped samples\n", - "3. Processes count matrix files:\n", - " - Renames column headers from individualID to sampleid\n", - " - Maintains count data integrity\n", - "\n", - "#### Step 1: Pseudobulk QC & Calculate Residuals with biological variation\n", - "\n", - "**Input:**\n", - "- Mapped metadata: `metadata_{celltype}.csv` (from Step 0)\n", - "- Mapped peak counts: `pseudobulk_peaks_counts_{celltype}.csv.gz` (from Step 0)\n", - "- Sample covariates: `rosmap_cov.txt`\n", - "- hg38 blacklist: `hg38-blacklist.v2.bed.gz`\n", - "\n", - "**Process:**\n", - "1. Loads pseudobulk peak count matrix and metadata\n", - "2. **Filters samples with n_nuclei > 20**\n", - "3. Calculates technical QC metrics per sample:\n", - " - `log_n_nuclei`: Log-transformed number of nuclei\n", - " - `med_nucleosome_signal`: Median nucleosome signal\n", - " - `med_tss_enrich`: Median TSS enrichment score\n", - " - `log_med_n_tot_fragment`: Log-transformed median total fragments (sequencing depth)\n", - " - `log_total_unique_peaks`: Log-transformed count of unique peaks detected\n", - "4. Filters blacklisted genomic regions using `foverlaps()`\n", - "5. Merges with covariates (pmi, study) - **excludes msex and age_death**\n", - "6. Applies expression filtering with `filterByExpr()`:\n", - " - `min.count = 5`: Minimum 5 reads in at least one sample\n", - " - `min.total.count = 15`: Minimum 15 total reads across all samples\n", - " - `min.prop = 0.1`: Peak must be expressed in ≥10% of samples\n", - "7. TMM normalization with `calcNormFactors()`\n", - "8. Saves **filtered raw counts** (used for region-specific analysis if needed)\n", - "9. Handles sequencingBatch and Library as covariates\n", - "10. Fits linear model using `voom()` and `lmFit()`:\n", - " ```r\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + \n", - " log_med_n_tot_fragment + log_total_unique_peaks + \n", - " sequencingBatch_factor + Library_factor + pmi + study\n", - " ```\n", - "11. Calculates residuals using `predictOffset()`: `offset + residuals`\n", - " - **Preserves biological variation** (sex, age)\n", - " - Removes technical variation and study effects\n", - "\n", - "**Key Variables Regressed Out:**\n", - "- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch, library\n", - "- Study effects: pmi, study cohort\n", - "\n", - "**Key Variables Preserved:**\n", - "- Sex (msex)\n", - "- Age at death (age_death)\n", - "\n", - "\n", - "#### Step 2: Phenotype Reformatting\n", - "\n", - "**Input:**\n", - "- `{celltype}_residuals.txt` from Step 1 (in `2_residuals/{celltype}/`)\n", - "\n", - "**Process:**\n", - "1. Reads residuals file with proper handling of peak IDs and sample columns\n", - "2. Parses peak coordinates from peak IDs (format: `chr-start-end`)\n", - "3. Converts peaks to **midpoint coordinates**:\n", - "\n", - "Use for:\n", - "Genome-wide caQTL mapping with FastQTL, TensorQTL, or MatrixEQTL\n", - "Analysis that accounts for or investigates sex/age effects\n", - "\n", - "---\n", - "\n", - "### Pipeline Outputs\n", - "\n", - "**From Step 0:**\n", - "`metadata_{celltype}.csv`: Metadata with mapped sampleid\n", - "`pseudobulk_peaks_counts_{celltype}.csv.gz`: Counts with mapped sampleid headers\n", - "\n", - "**From Step 1:**\n", - "`{celltype}_residuals.txt`: Covariate-adjusted residuals (log2-CPM scale)\n", - "`{celltype}_filtered_raw_counts.txt`: TMM-normalized counts\n", - "`{celltype}_results.rds`: Complete analysis results\n", - "`{celltype}_summary.txt`: QC and filtering statistics\n", - "`{celltype}_variable_explanation.txt`: Variable documentation\n", - "\n", - "**From Step 2:**\n", - "`{celltype}_kellis_xiong_snatac_phenotype.bed.gz`: Genome-wide QTL-ready BED file\n", - "\n", - "---\n", - "\n", - "**Input files** needed to run this pipeline can be downloaded [here](https://drive.google.com/drive/folders/1l1RJx5toqg_WOlWW3gy-ynkrodi8oqXv?usp=drive_link)." - ] - }, - { - "cell_type": "markdown", - "id": "c58392fc-da3a-4032-9cc3-6f58fdf6c99b", - "metadata": {}, - "source": [ - "#### Before you start, let's set your working paths." - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "3509701b-e03e-49a7-944f-14539b6a46a3", - "metadata": {}, - "outputs": [], - "source": [ - "input_dir <- \" \" # insert your input dir\n", - "output_dir <- \" \" #insert your output dir" - ] - }, - { - "cell_type": "markdown", - "id": "6c6d5f6a-259b-4fa0-ba06-2a83fa19577e", - "metadata": {}, - "source": [ - "## Step 0: Check sample ID\n", - "\n", - "**Purpose:** Maps original sample identifiers (individualID) to standardized sample IDs (sampleid) across metadata and count matrix files.\n", - "\n", - "---\n", - "\n", - "#### Input:\n", - "\n", - "**Sample Mapping Reference:**\n", - "- `rosmap_sample_mapping_data.csv`: Contains mapping between individualID and sampleid\n", - "\n", - "**Metadata Files (per cell type):**\n", - "- `metadata_Ast.csv`\n", - "- `metadata_Ex.csv`\n", - "- `metadata_In.csv`\n", - "- `metadata_Microglia.csv`\n", - "- `metadata_Oligo.csv`\n", - "- `metadata_OPC.csv`\n", - "\n", - "**Count Matrix Files (per cell type):**\n", - "- `pseudobulk_peaks_counts_Ast.csv.gz`\n", - "- `pseudobulk_peaks_counts_Ex.csv.gz`\n", - "- `pseudobulk_peaks_counts_In.csv.gz`\n", - "- `pseudobulk_peaks_counts_Microglia.csv.gz`\n", - "- `pseudobulk_peaks_counts_Oligo.csv.gz`\n", - "- `pseudobulk_peaks_counts_OPC.csv.gz`\n", - "\n", - "\n", - "#### Process:\n", - "\n", - "**Part 1: Process Metadata Files**\n", - "\n", - "1. Loads sample mapping dictionary from `rosmap_sample_mapping_data.csv`\n", - "2. Creates a keyed data.table for fast lookups: `individualID → sampleid`\n", - "3. For each metadata file:\n", - " - Reads the CSV file\n", - " - Finds the position of the `individualID` column\n", - " - Creates a new `sampleid` column\n", - " - For each sample:\n", - " - If mapping exists: uses the mapped sampleid\n", - " - If no mapping: uses the original individualID (preserves unmapped samples)\n", - " - Inserts `sampleid` column immediately after `individualID` column\n", - " - Saves updated metadata file\n", - "\n", - "**Part 2: Process Count Matrix Files**\n", - "\n", - "1. For each count matrix file (gzipped):\n", - " - Extracts header line (first row with column names)\n", - " - First column is `peak_id` (kept as-is)\n", - " - Remaining columns are sample IDs (individualID format)\n", - " - Maps sample IDs to sampleid where mapping exists\n", - " - Creates new header with mapped IDs\n", - " - Replaces original header with new header\n", - " - Recompresses with gzip\n", - "\n", - "#### Output:\n", - "Output Directory: `output/1_files_with_sampleid/`\n", - "\n", - "Metadata Files (with sampleid):\n", - "- `metadata_Ast.csv`\n", - "- `metadata_Ex.csv`\n", - "- `metadata_In.csv`\n", - "- `metadata_Microglia.csv`\n", - "- `metadata_Oligo.csv`\n", - "- `metadata_OPC.csv`\n", - "\n", - "Count Matrix Files (with sampleid headers):\n", - "- `pseudobulk_peaks_counts_Ast.csv.gz`\n", - "- `pseudobulk_peaks_counts_Ex.csv.gz`\n", - "- `pseudobulk_peaks_counts_In.csv.gz`\n", - "- `pseudobulk_peaks_counts_Microglia.csv.gz`\n", - "- `pseudobulk_peaks_counts_Oligo.csv.gz`\n", - "- `pseudobulk_peaks_counts_OPC.csv.gz`\n" - ] - }, - { - "cell_type": "markdown", - "id": "b258d410-4973-4c25-b53e-9a2c3399ce28", - "metadata": {}, - "source": [ - "#### Load libraries" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "23eb41f4-134f-48dc-b8ce-b347fda8af48", - "metadata": {}, - "outputs": [], - "source": [ - "library(data.table)" - ] - }, - { - "cell_type": "markdown", - "id": "70a2a878-1fd0-4be2-ab1a-bcee12e9ebc1", - "metadata": {}, - "source": [ - "#### Load input" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "f415cf24-b424-405b-9032-d225f0ed0310", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Read mapping file, rows: 1200 \n" - ] - } - ], - "source": [ - "# 3. Read mapping data\n", - "map_file <- file.path(input_dir, \"data/rosmap_sample_mapping_data.csv\")\n", - "map <- fread(map_file)\n", - "cat(\"Read mapping file, rows:\", nrow(map), \"\\n\")\n", - "\n", - "# 4. Create mapping dictionary\n", - "id_map <- map[, .(individualID, sampleid)]\n", - "setkey(id_map, individualID)\n", - "\n", - "# Define cell types and paths\n", - "celltype <- c(\"Ast\", \"Ex\", \"In\", \"Microglia\", \"Oligo\", \"OPC\")\n", - "\n", - "# Your specific metadata file paths\n", - "metadata_files <- file.path(input_dir, paste0(\"1_files_with_sampleid/metadata_\", celltype, \".csv\"))\n", - "\n", - "\n", - "for (ct in celltype) {\n", - " specific_dir <- file.path(output_dir, \"1_files_with_sampleid\")\n", - " if (!dir.exists(specific_dir)) {\n", - " dir.create(specific_dir, recursive = TRUE)\n", - " cat(\"Created directory:\", specific_dir, \"\\n\")\n", - " }\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "599dbf62-9db1-4e0b-a689-71eb2f27c98d", - "metadata": {}, - "source": [ - "### Process metadata files" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "28e8a4ec-45f7-4678-8e43-d7d9246850bf", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Processing metadata file: metadata_Ast.csv \n", - "Original rows: 93 columns: 10 \n", - "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ast.csv \n", - "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ast.csv \n", - "Converted rows: 93 columns: 10 \n", - "Mapped IDs: 84 Unmapped IDs: 9 \n", - "\n", - "Processing metadata file: metadata_Ex.csv \n", - "Original rows: 92 columns: 10 \n", - "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ex.csv \n", - "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ex.csv \n", - "Converted rows: 92 columns: 10 \n", - "Mapped IDs: 83 Unmapped IDs: 9 \n", - "\n", - "Processing metadata file: metadata_In.csv \n", - "Original rows: 93 columns: 10 \n", - "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_In.csv \n", - "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_In.csv \n", - "Converted rows: 93 columns: 10 \n", - "Mapped IDs: 84 Unmapped IDs: 9 \n", - "\n", - "Processing metadata file: metadata_Microglia.csv \n", - "Original rows: 93 columns: 10 \n", - "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Microglia.csv \n", - "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Microglia.csv \n", - "Converted rows: 93 columns: 10 \n", - "Mapped IDs: 84 Unmapped IDs: 9 \n", - "\n", - "Processing metadata file: metadata_Oligo.csv \n", - "Original rows: 93 columns: 10 \n", - "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Oligo.csv \n", - "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Oligo.csv \n", - "Converted rows: 93 columns: 10 \n", - "Mapped IDs: 84 Unmapped IDs: 9 \n", - "\n", - "Processing metadata file: metadata_OPC.csv \n", - "Original rows: 93 columns: 10 \n", - "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_OPC.csv \n", - "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_OPC.csv \n", - "Converted rows: 93 columns: 10 \n", - "Mapped IDs: 84 Unmapped IDs: 9 \n", - "\n", - "Metadata file processing summary:\n", - " file mapped_ids unmapped_ids total_ids\n", - " \n", - "1: metadata_Ast.csv 84 9 93\n", - "2: metadata_Ex.csv 83 9 92\n", - "3: metadata_In.csv 84 9 93\n", - "4: metadata_Microglia.csv 84 9 93\n", - "5: metadata_Oligo.csv 84 9 93\n", - "6: metadata_OPC.csv 84 9 93\n" - ] - } - ], - "source": [ - "# Function to process metadata files - adds sampleid and uses individualID for unmapped cases\n", - "process_metadata <- function(file_path, celltype_name) {\n", - " cat(\"\\nProcessing metadata file:\", basename(file_path), \"\\n\")\n", - " \n", - " # Read data\n", - " meta <- fread(file_path)\n", - " cat(\"Original rows:\", nrow(meta), \"columns:\", ncol(meta), \"\\n\")\n", - " \n", - " # Find the position of individualID column\n", - " id_col_index <- which(colnames(meta) == \"individualID\")\n", - " if (length(id_col_index) == 0) {\n", - " cat(\"Warning: individualID column not found\\n\")\n", - " return(NULL)\n", - " }\n", - " \n", - " # Find the mapped sampleids for each individualID\n", - " meta$sampleid <- character(nrow(meta)) # Initialize with empty strings\n", - " \n", - " for (i in 1:nrow(meta)) {\n", - " ind_id <- meta$individualID[i]\n", - " mapped_id <- id_map[ind_id, sampleid]\n", - " \n", - " # If mapping found, use it; otherwise use the original individualID\n", - " if (length(mapped_id) > 0 && !is.na(mapped_id)) {\n", - " meta$sampleid[i] <- mapped_id\n", - " } else {\n", - " # Use the original individualID instead of NA\n", - " meta$sampleid[i] <- ind_id\n", - " }\n", - " }\n", - " \n", - " # Move sampleid column to the front\n", - " setcolorder(meta, c(\"sampleid\", setdiff(names(meta), \"sampleid\")))\n", - " \n", - " # Save results\n", - " output_file <- file.path(output_dir, \"1_files_with_sampleid\",basename(file_path))\n", - " cat(\"Output file will be saved to:\", output_file, \"\\n\")\n", - " fwrite(meta, output_file)\n", - " \n", - " # Count mapped and unmapped IDs\n", - " mapped_count <- sum(meta$sampleid != meta$individualID)\n", - " unmapped_count <- sum(meta$sampleid == meta$individualID)\n", - " \n", - " cat(\"Saved to:\", output_file, \"\\n\")\n", - " cat(\"Converted rows:\", nrow(meta), \"columns:\", ncol(meta), \"\\n\")\n", - " cat(\"Mapped IDs:\", mapped_count, \"Unmapped IDs:\", unmapped_count, \"\\n\")\n", - " \n", - " # Return processing summary\n", - " list(\n", - " file = basename(file_path),\n", - " mapped_ids = mapped_count,\n", - " unmapped_ids = unmapped_count,\n", - " total_ids = nrow(meta)\n", - " )\n", - "}\n", - "\n", - "# Process all metadata files\n", - "meta_results <- mapply(process_metadata, metadata_files, celltype, SIMPLIFY = FALSE)\n", - "meta_summary <- do.call(rbind, lapply(meta_results, as.data.table))\n", - "\n", - "cat(\"\\nMetadata file processing summary:\\n\")\n", - "print(meta_summary)" - ] - }, - { - "cell_type": "markdown", - "id": "4752c617-693c-4fd3-b9b4-a21c9326bec8", - "metadata": {}, - "source": [ - "### Process count matrix files" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "b8c39e1c-5913-411d-bb14-749371fe5368", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Processing count matrix file: pseudobulk_peaks_counts_Ast.csv.gz \n", - "Original columns: 93 \n", - "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a5b135eff - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz \n", - "Input file size: -rw-r--r-- 1 jaempawi xqtl 22M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz \n", - "Output file size: -rw-r--r-- 1 jaempawi xqtl 22M Feb 12 15:32 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz \n", - "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz \n", - "\n", - "Processing count matrix file: pseudobulk_peaks_counts_Ex.csv.gz \n", - "Original columns: 92 \n", - "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a1b4f71a - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz \n", - "Input file size: -rw-r--r-- 1 jaempawi xqtl 24M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz \n", - "Output file size: -rw-r--r-- 1 jaempawi xqtl 24M Feb 12 15:32 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz \n", - "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz \n", - "\n", - "Processing count matrix file: pseudobulk_peaks_counts_In.csv.gz \n", - "Original columns: 93 \n", - "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a24fc9c54 - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz \n", - "Input file size: -rw-r--r-- 1 jaempawi xqtl 24M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz \n", - "Output file size: -rw-r--r-- 1 jaempawi xqtl 24M Feb 12 15:33 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz \n", - "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz \n", - "\n", - "Processing count matrix file: pseudobulk_peaks_counts_Microglia.csv.gz \n", - "Original columns: 93 \n", - "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a5e37a1a8 - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz \n", - "Input file size: -rw-r--r-- 1 jaempawi xqtl 16M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz \n", - "Output file size: -rw-r--r-- 1 jaempawi xqtl 16M Feb 12 15:33 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz \n", - "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz \n", - "\n", - "Processing count matrix file: pseudobulk_peaks_counts_Oligo.csv.gz \n", - "Original columns: 93 \n", - "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a522197c8 - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz \n", - "Input file size: -rw-r--r-- 1 jaempawi xqtl 28M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz \n", - "Output file size: -rw-r--r-- 1 jaempawi xqtl 28M Feb 12 15:33 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz \n", - "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz \n", - "\n", - "Processing count matrix file: pseudobulk_peaks_counts_OPC.csv.gz \n", - "Original columns: 93 \n", - "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a68ad457e - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz \n", - "Input file size: -rw-r--r-- 1 jaempawi xqtl 17M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz \n", - "Output file size: -rw-r--r-- 1 jaempawi xqtl 17M Feb 12 15:33 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz \n", - "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz \n", - "\n", - "Count matrix file processing summary:\n", - " file total_columns mapped_columns\n", - " \n", - "1: pseudobulk_peaks_counts_Ast.csv.gz 93 0\n", - "2: pseudobulk_peaks_counts_Ex.csv.gz 92 0\n", - "3: pseudobulk_peaks_counts_In.csv.gz 93 0\n", - "4: pseudobulk_peaks_counts_Microglia.csv.gz 93 0\n", - "5: pseudobulk_peaks_counts_Oligo.csv.gz 93 0\n", - "6: pseudobulk_peaks_counts_OPC.csv.gz 93 0\n", - " unmapped_columns\n", - " \n", - "1: 92\n", - "2: 91\n", - "3: 92\n", - "4: 92\n", - "5: 92\n", - "6: 92\n", - "\n", - "All files processed!\n" - ] - } - ], - "source": [ - "# Your specific metadata file paths\n", - "count_files <- file.path(input_dir, paste0(\"1_files_with_sampleid/pseudobulk_peaks_counts_\", celltype, \".csv.gz\"))\n", - "\n", - "\n", - "# Direct column renaming for count matrix files\n", - "process_counts_simple <- function(file_path) {\n", - " cat(\"\\nProcessing count matrix file:\", basename(file_path), \"\\n\")\n", - " \n", - " # Get header line only\n", - " header_command <- paste0(\"zcat \", file_path, \" | head -n 1\")\n", - " header_line <- system(header_command, intern = TRUE)\n", - " \n", - " # Parse column names\n", - " col_names <- unlist(strsplit(header_line, \",\"))\n", - " cat(\"Original columns:\", length(col_names), \"\\n\")\n", - " \n", - " # First column is peak_id, remaining columns are sample IDs\n", - " peak_id_col <- col_names[1]\n", - " sample_cols <- col_names[-1]\n", - " \n", - " # Map sample IDs\n", - " new_sample_cols <- character(length(sample_cols))\n", - " mapped_count <- 0\n", - " \n", - " for (i in seq_along(sample_cols)) {\n", - " ind_id <- sample_cols[i]\n", - " mapped_id <- id_map[ind_id, sampleid]\n", - " \n", - " if (length(mapped_id) > 0 && !is.na(mapped_id)) {\n", - " new_sample_cols[i] <- mapped_id\n", - " mapped_count <- mapped_count + 1\n", - " } else {\n", - " # Keep original individualID if no mapping found\n", - " new_sample_cols[i] <- ind_id\n", - " }\n", - " }\n", - " \n", - " # Create new header\n", - " new_col_names <- c(peak_id_col, new_sample_cols)\n", - " \n", - " # Create temporary header file\n", - " temp_header <- tempfile()\n", - " writeLines(paste(new_col_names, collapse = \",\"), temp_header)\n", - " \n", - " # Output file path\n", - " output_file <- file.path(output_dir, \"1_files_with_sampleid\", basename(file_path))\n", - " \n", - " # Use system command to process the file without chunking\n", - " # This extracts the data (excluding header), prepends new header, and compresses\n", - " cmd <- paste0(\n", - " \"zcat \", file_path, \" | tail -n +2 | cat \", temp_header, \" - | gzip > \", output_file\n", - " )\n", - " \n", - " cat(\"Executing command:\", cmd, \"\\n\")\n", - " system_result <- system(cmd)\n", - " \n", - " # Check if command succeeded\n", - " if (system_result != 0) {\n", - " cat(\"ERROR: Command failed with exit code\", system_result, \"\\n\")\n", - " cat(\"Attempting backup method...\\n\")\n", - " \n", - " # Backup method using R's built-in file handling\n", - " tryCatch({\n", - " # Create a named vector for mapping\n", - " id_mapping <- setNames(new_sample_cols, sample_cols)\n", - " \n", - " # Open connections\n", - " in_conn <- gzfile(file_path, \"r\")\n", - " out_conn <- gzfile(output_file, \"w\")\n", - " \n", - " # Read and discard the header line\n", - " readLines(in_conn, n = 1)\n", - " \n", - " # Write the new header\n", - " writeLines(paste(new_col_names, collapse = \",\"), out_conn)\n", - " \n", - " # Copy the rest of the file line by line\n", - " while (length(line <- readLines(in_conn, n = 1)) > 0) {\n", - " writeLines(line, out_conn)\n", - " }\n", - " \n", - " # Close connections\n", - " close(in_conn)\n", - " close(out_conn)\n", - " \n", - " cat(\"Backup method successful\\n\")\n", - " }, error = function(e) {\n", - " cat(\"Backup method also failed:\", e$message, \"\\n\")\n", - " })\n", - " } else {\n", - " # Check file sizes to verify completion\n", - " input_size <- system(paste(\"ls -lh\", file_path), intern = TRUE)\n", - " output_size <- system(paste(\"ls -lh\", output_file), intern = TRUE)\n", - " cat(\"Input file size: \", input_size, \"\\n\")\n", - " cat(\"Output file size:\", output_size, \"\\n\")\n", - " }\n", - " \n", - " # Delete temporary file\n", - " file.remove(temp_header)\n", - " \n", - " cat(\"File processing completed and saved to:\", output_file, \"\\n\")\n", - " \n", - " # Return processing summary\n", - " list(\n", - " file = basename(file_path),\n", - " total_columns = length(col_names),\n", - " mapped_columns = mapped_count,\n", - " unmapped_columns = length(sample_cols) - mapped_count\n", - " )\n", - "}\n", - "\n", - "# Process all count files\n", - "count_results <- lapply(count_files, process_counts_simple)\n", - "\n", - "# Summarize results\n", - "count_summary <- do.call(rbind, lapply(count_results, as.data.table))\n", - "cat(\"\\nCount matrix file processing summary:\\n\")\n", - "print(count_summary)\n", - "\n", - "cat(\"\\nAll files processed!\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "9d97c736-2394-46c3-a76c-2d5bb82a1098", - "metadata": {}, - "source": [ - "## Step 1: Pseudobulk QC noBIOvar\n", - "**Purpose:** Performs quality control on pseudobulk ATAC-seq data, filters low-quality samples and peaks, normalizes data, and calculates covariate-adjusted residuals while preserving biological variation (sex, age).\n", - "\n", - "---\n", - "\n", - "#### Input:\n", - "\n", - "**From Step 0 (required):**\n", - "- `metadata_{celltype}.csv` (in `output/1_files_with_sampleid/`)\n", - "- `pseudobulk_peaks_counts_{celltype}.csv.gz` (in `output/1_files_with_sampleid/`)\n", - "\n", - "**Reference Files:**\n", - "- `rosmap_cov.txt`: Sample covariates (pmi, study)\n", - "- `hg38-blacklist.v2.bed.gz`: ENCODE blacklist regions\n", - "\n", - "**Cell Types:**\n", - "- `Mic` (Microglia)\n", - "- `Astro` (Astrocytes)\n", - "- `Oligo` (Oligodendrocytes)\n", - "- `Ex` (Excitatory neurons)\n", - "- `In` (Inhibitory neurons)\n", - "- `OPC` (Oligodendrocyte precursor cells)\n", - "\n", - "#### Process:\n", - "\n", - "1. Load Data\n", - "2. Sample Quality Filtering\n", - "3. Calculate Technical QC Metrics\n", - "4. Process Peak Coordinates\n", - "5. Filter Blacklisted Regions\n", - "6. Merge Covariates\n", - "7. Create DGE Object\n", - "8. Expression Filtering\n", - "9. Save Filtered Raw Counts\n", - "10. TMM Normalization\n", - "11. Handle Batch and Library Variables\n", - "12. Build Linear Model\n", - "13. Voom Transformation & Model Fitting\n", - "14. Calculate Offsets and Residuals\n", - "\n", - "#### Output:\n", - "Output Directory: `output/2_residuals/{celltype}/`\n", - "\n", - "1. Residuals File: `{celltype}_residuals.txt`\n", - "2. Results Object: `{celltype}_results.rds`\n", - "3. Summary Report: `{celltype}_summary.txt`\n", - "4. Variable Explanation: `{celltype}_variable_explanation.txt`\n", - "5. Filtered Raw Counts: `{celltype}_filtered_raw_counts.txt`" - ] - }, - { - "cell_type": "markdown", - "id": "f4ef8b2d-64b4-4d49-9845-93a6ee4b8895", - "metadata": {}, - "source": [ - "#### Load libaries" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "e85cd96e-2357-41c8-90ab-bd61e14cf22e", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n", - "Attaching package: ‘dplyr’\n", - "\n", - "\n", - "The following objects are masked from ‘package:data.table’:\n", - "\n", - " between, first, last\n", - "\n", - "\n", - "The following objects are masked from ‘package:stats’:\n", - "\n", - " filter, lag\n", - "\n", - "\n", - "The following objects are masked from ‘package:base’:\n", - "\n", - " intersect, setdiff, setequal, union\n", - "\n", - "\n", - "Loading required package: limma\n", - "\n" - ] - } - ], - "source": [ - "library(data.table)\n", - "library(stringr)\n", - "library(dplyr)\n", - "library(edgeR)" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "6d194542-2660-46cd-84ab-362cf147a4d9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Processing celltype: Oligo \n" - ] - } - ], - "source": [ - "# Set cell type and create output directory\n", - "#args <- commandArgs(trailingOnly = TRUE)\n", - "\n", - "celltype <- \"Oligo\"\n", - "cat(\"Processing celltype:\", celltype, \"\\n\")\n", - "\n", - "# Create individual directories for each cell type\n", - "for (ct in celltype) {\n", - " specific_dir <- file.path(output_dir, \"2_residuals\",celltype)\n", - " if (!dir.exists(specific_dir)) {\n", - " dir.create(specific_dir, recursive = TRUE)\n", - " cat(\"Created directory:\", specific_dir, \"\\n\")\n", - " }\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "65bcf867-e949-41a3-8afe-b66f71217ca7", - "metadata": {}, - "source": [ - "#### Create predictOffset funciton" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "2a63f3d9-c699-4e97-85f1-456f684e8b2e", - "metadata": {}, - "outputs": [], - "source": [ - "predictOffset <- function(fit) {\n", - " # Define which variables are factors and which are continuous\n", - " usedFactors <- c(\"sequencingBatch\", \"Library\", \"study\") \n", - " usedContinuous <- c(\"log_n_nuclei\", \"med_nucleosome_signal\", \"med_tss_enrich\", \"log_med_n_tot_fragment\",\n", - " \"log_total_unique_peaks\", \"pmi\")\n", - " \n", - " # Filter to only use variables actually in the design matrix\n", - " usedFactors <- usedFactors[sapply(usedFactors, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n", - " usedContinuous <- usedContinuous[sapply(usedContinuous, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n", - " \n", - " # Get indices for factor and continuous variables\n", - " facInd <- unlist(lapply(as.list(usedFactors), \n", - " function(f) {return(grep(paste0(\"^\", f), \n", - " colnames(fit$design)))}))\n", - " contInd <- unlist(lapply(as.list(usedContinuous), \n", - " function(f) {return(grep(paste0(\"^\", f), \n", - " colnames(fit$design)))}))\n", - " \n", - " # Add the intercept\n", - " all_indices <- c(1, facInd, contInd)\n", - " \n", - " # Verify design matrix structure (using sorted indices to avoid duplication warning)\n", - " all_indices_sorted <- sort(unique(all_indices))\n", - " stopifnot(all(all_indices_sorted %in% 1:ncol(fit$design)))\n", - " \n", - " # Create new design matrix with median values\n", - " D <- fit$design\n", - " D[, facInd] <- 0 # Set all factor levels to reference level\n", - " \n", - " # For continuous variables, set to median value\n", - " if (length(contInd) > 0) {\n", - " medContVals <- apply(D[, contInd, drop=FALSE], 2, median)\n", - " for (i in 1:length(medContVals)) {\n", - " D[, names(medContVals)[i]] <- medContVals[i]\n", - " }\n", - " }\n", - " \n", - " # Calculate offsets\n", - " stopifnot(all(colnames(coefficients(fit)) == colnames(D)))\n", - " offsets <- apply(coefficients(fit), 1, function(c) {\n", - " return(D %*% c)\n", - " })\n", - " offsets <- t(offsets)\n", - " colnames(offsets) <- rownames(fit$design)\n", - " \n", - " return(offsets)\n", - "}\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "35928e75-ca48-49ae-be1d-d1429c3171c3", - "metadata": {}, - "source": [ - "#### Load data" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "4f3ffb5c-a52d-4f7d-ba1f-14241612be1d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded metadata with 93 samples\n", - "Filtered to 92 samples with > 20 nuclei\n", - "Loaded peak data with 363775 peaks\n", - "Valid samples after nuclei filtering: 92 \n", - "Valid samples present in peak data: 90 \n", - "Original peak data dimensions: 363775 × 92 \n", - "Filtered peak data dimensions: 363775 × 90 \n", - "Final metadata samples after filtering: 90 \n" - ] - } - ], - "source": [ - "celltype <- \"Oligo\"\n", - "meta_path <- paste0(output_dir, \"/1_files_with_sampleid/metadata_\", celltype, \".csv\")\n", - "peak_path <- paste0(output_dir, \"/1_files_with_sampleid/pseudobulk_peaks_counts_\", celltype, \".csv.gz\")\n", - "\n", - "# Blacklist and Covariates are in the source 'data_dir'\n", - "blacklist_file <- file.path(input_dir, \"data/hg38-blacklist.v2.bed.gz\")\n", - "covariates_file <- file.path(input_dir, \"data/rosmap_cov.txt\")\n", - "\n", - "# Load metadata\n", - "meta <- fread(meta_path)\n", - "cat(\"Loaded metadata with\", nrow(meta), \"samples\\n\")\n", - "\n", - "# Filter samples with n_nuclei > 20\n", - "meta_filtered <- meta[n.nuclei > 20]\n", - "cat(\"Filtered to\", nrow(meta_filtered), \"samples with > 20 nuclei\\n\")\n", - "\n", - "# Load peak data\n", - "peak_data <- fread(peak_path)\n", - "cat(\"Loaded peak data with\", nrow(peak_data), \"peaks\\n\")\n", - "\n", - "# Extract peak_id and set as rownames\n", - "peak_id <- peak_data$peak_id\n", - "peak_data <- peak_data[, -1, with = FALSE] # Remove peak_id column\n", - "\n", - "# Filter peak data to keep only samples with >20 nuclei\n", - "valid_samples <- meta_filtered$sampleid\n", - "cat(\"Valid samples after nuclei filtering:\", length(valid_samples), \"\\n\")\n", - "\n", - "# Find which valid samples actually exist in the peak data\n", - "available_samples <- intersect(valid_samples, colnames(peak_data))\n", - "cat(\"Valid samples present in peak data:\", length(available_samples), \"\\n\")\n", - "\n", - "# Create filtered peak matrix\n", - "peak_data_filtered <- peak_data[, ..available_samples, with=FALSE]\n", - "cat(\"Original peak data dimensions:\", nrow(peak_data), \"×\", ncol(peak_data), \"\\n\")\n", - "cat(\"Filtered peak data dimensions:\", nrow(peak_data_filtered), \"×\", ncol(peak_data_filtered), \"\\n\")\n", - "\n", - "# Convert to matrix for downstream analysis\n", - "peak_matrix <- as.matrix(peak_data_filtered)\n", - "rownames(peak_matrix) <- peak_id\n", - "\n", - "# Update metadata to match filtered samples\n", - "meta_filtered <- meta_filtered[sampleid %in% available_samples]\n", - "cat(\"Final metadata samples after filtering:\", nrow(meta_filtered), \"\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "8228d5e0-b459-421c-a3aa-e5e8a3a0f992", - "metadata": {}, - "source": [ - "#### Process technical variables from meta data" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "92bd6ab0-27e2-4218-8a35-8f826227d9fd", - "metadata": {}, - "outputs": [], - "source": [ - "# Column name normalization (for easier handling)\n", - "meta_clean <- meta_filtered %>%\n", - " rename(\n", - " med_nucleosome_signal = med.nucleosome_signal.ct,\n", - " med_tss_enrich = med.tss.enrich.ct,\n", - " med_n_tot_fragment = med.n_tot_fragment.ct,\n", - " n_nuclei = n.nuclei\n", - " )\n", - "\n", - "# Calculate peak metrics - total unique peaks per sample and median peak width\n", - "peak_metrics <- data.frame(\n", - " sampleid = colnames(peak_matrix),\n", - " total_unique_peaks = colSums(peak_matrix > 0)\n", - ") %>%\n", - " mutate(log_total_unique_peaks = log(total_unique_peaks + 1))\n", - "\n", - "# Calculate median peak width for each sample using count as weight\n", - "calculate_median_peakwidth <- function(peak_matrix, peak_info) {\n", - " # Create a data frame with peak widths\n", - " peak_widths <- peak_info$end - peak_info$start\n", - " \n", - " # Initialize a vector to store median peak widths\n", - " median_peak_widths <- numeric(ncol(peak_matrix))\n", - " names(median_peak_widths) <- colnames(peak_matrix)\n", - " \n", - " # For each sample, calculate the weighted median peak width\n", - " for (i in 1:ncol(peak_matrix)) {\n", - " sample_counts <- peak_matrix[, i]\n", - " # Only consider peaks with counts > 0\n", - " idx <- which(sample_counts > 0)\n", - " \n", - " if (length(idx) > 0) {\n", - " # Method 1: Use counts as weights\n", - " weights <- sample_counts[idx]\n", - " # Repeat each peak width by its count for weighted calculation\n", - " all_widths <- rep(peak_widths[idx], times=weights)\n", - " median_peak_widths[i] <- median(all_widths)\n", - " } else {\n", - " median_peak_widths[i] <- NA\n", - " }\n", - " }\n", - " \n", - " return(median_peak_widths)\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "3ab3869e-c624-4666-930f-97c6976c74da", - "metadata": {}, - "source": [ - "#### Process peaks" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "b71aebba-d9c9-4836-8432-c3bba27e9864", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Sample of peak coordinates:\n", - " peak_name chr start end\n", - " \n", - "1: chr1-817077-817577 chr1 817077 817577\n", - "2: chr1-827285-827785 chr1 827285 827785\n", - "3: chr1-850237-850737 chr1 850237 850737\n", - "4: chr1-869660-870160 chr1 869660 870160\n", - "5: chr1-903662-904162 chr1 903662 904162\n", - "6: chr1-904504-905004 chr1 904504 905004\n", - "Number of blacklisted peaks: 29 \n", - "Number of peaks after blacklist filtering: 363746 \n" - ] - } - ], - "source": [ - "# Process peak coordinates\n", - "peak_df <- data.table(\n", - " peak_name = peak_id,\n", - " chr = sapply(strsplit(peak_id, \"-\"), `[`, 1),\n", - " start = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 2)),\n", - " end = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 3)),\n", - " stringsAsFactors = FALSE\n", - ")\n", - "\n", - "# Verify peak coordinates were extracted correctly\n", - "cat(\"Sample of peak coordinates:\\n\")\n", - "print(head(peak_df))\n", - "\n", - "if (file.exists(blacklist_file)) {\n", - " blacklist_df <- fread(blacklist_file)\n", - " if (ncol(blacklist_df) >= 4) {\n", - " colnames(blacklist_df)[1:4] <- c(\"chr\", \"start\", \"end\", \"label\")\n", - " } else {\n", - " colnames(blacklist_df)[1:3] <- c(\"chr\", \"start\", \"end\")\n", - " }\n", - " \n", - " # Filter blacklisted peaks\n", - " setkey(blacklist_df, chr, start, end)\n", - " setkey(peak_df, chr, start, end)\n", - " overlapping_peaks <- foverlaps(peak_df, blacklist_df, nomatch=0)\n", - " blacklisted_peaks <- unique(overlapping_peaks$peak_name)\n", - " cat(\"Number of blacklisted peaks:\", length(blacklisted_peaks), \"\\n\")\n", - " \n", - " filtered_peak_idx <- !(peak_id %in% blacklisted_peaks)\n", - " filtered_peak <- peak_matrix[filtered_peak_idx, ]\n", - " cat(\"Number of peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n", - "} else {\n", - " cat(\"Warning: Blacklist file not found at\", blacklist_file, \"\\n\")\n", - " cat(\"Proceeding without blacklist filtering\\n\")\n", - " filtered_peak <- peak_matrix\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "fcabaa13-d809-4538-806d-d3aea0a37858", - "metadata": {}, - "source": [ - "#### Load covariates" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "d5caf744-bf0a-4aa9-95bc-601341111872", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Variable statistics before and after log transformation:\n", - "n_nuclei: min=39.00, median=849.00, max=4394.00, SD=1080.03\n", - "log_n_nuclei: min=3.66, median=6.74, max=8.39, SD=1.05\n", - "med_n_tot_fragment: min=1308.50, median=7521.00, max=30629.00, SD=5373.50\n", - "log_med_n_tot_fragment: min=7.18, median=8.93, max=10.33, SD=0.69\n", - "Number of samples after joining: 83 \n", - "Sample IDs: SM-CTECR, SM-CJK5G, SM-CJEKQ, SM-CJGGY, SM-CJK3S, SM-CTEGU ...\n", - "Available covariates: sampleid, individualID, sequencingBatch, Library, Celltype4, n_nuclei, avg.pct.read.in.peak.ct, med_nucleosome_signal, med_n_tot_fragment, med_tss_enrich, total_unique_peaks, log_total_unique_peaks, pmi, study, log_n_nuclei, log_med_n_tot_fragment \n" - ] - } - ], - "source": [ - "covariates_file <- file.path(input_dir,'data/rosmap_cov.txt')\n", - "\n", - "if (file.exists(covariates_file)) {\n", - " covariates <- fread(covariates_file)\n", - " # Check column names and adjust if needed\n", - " if ('#id' %in% colnames(covariates)) {\n", - " id_col <- '#id'\n", - " } else if ('individualID' %in% colnames(covariates)) {\n", - " id_col <- 'individualID'\n", - " } else {\n", - " cat(\"Warning: Could not identify ID column in covariates file. Available columns:\", \n", - " paste(colnames(covariates), collapse=\", \"), \"\\n\")\n", - " id_col <- colnames(covariates)[1]\n", - " cat(\"Using\", id_col, \"as ID column\\n\")\n", - " }\n", - " \n", - " # Select relevant columns - excluding msex and age_death\n", - " cov_cols <- intersect(c(id_col, 'pmi', 'study'), colnames(covariates))\n", - " covariates <- covariates[, ..cov_cols]\n", - " \n", - " # Merge with metadata\n", - " meta_with_ind <- meta_clean %>%\n", - " select(sampleid, everything())\n", - " \n", - " all_covs <- meta_with_ind %>%\n", - " inner_join(peak_metrics, by = \"sampleid\") %>%\n", - " inner_join(covariates, by = setNames(id_col, \"sampleid\"))\n", - " \n", - " # Impute missing values\n", - " for (col in c(\"pmi\")) {\n", - " if (col %in% colnames(all_covs) && any(is.na(all_covs[[col]]))) {\n", - " cat(\"Imputing missing values for\", col, \"\\n\")\n", - " all_covs[[col]][is.na(all_covs[[col]])] <- median(all_covs[[col]], na.rm=TRUE)\n", - " }\n", - " }\n", - "} else {\n", - " cat(\"Warning: Covariates file\", covariates_file, \"not found.\\n\")\n", - " cat(\"Proceeding with only technical variables.\\n\")\n", - " all_covs <- meta_clean %>%\n", - " inner_join(peak_metrics, by = \"sampleid\")\n", - "}\n", - "\n", - "\n", - "# Perform log transformations on necessary variables\n", - "# Add a small constant to avoid log(0)\n", - "epsilon <- 1e-6\n", - "\n", - "all_covs$log_n_nuclei <- log(all_covs$n_nuclei + epsilon)\n", - "all_covs$log_med_n_tot_fragment <- log(all_covs$med_n_tot_fragment + epsilon)\n", - "\n", - "# Show distribution of original and log-transformed variables\n", - "cat(\"\\nVariable statistics before and after log transformation:\\n\")\n", - "for (var in c(\"n_nuclei\", \"med_n_tot_fragment\")) {\n", - " orig_var <- all_covs[[var]]\n", - " log_var <- all_covs[[paste0(\"log_\", var)]]\n", - " \n", - " cat(sprintf(\"%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n", - " var, min(orig_var), median(orig_var), max(orig_var), sd(orig_var)))\n", - " cat(sprintf(\"log_%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n", - " var, min(log_var), median(log_var), max(log_var), sd(log_var)))\n", - "}\n", - "\n", - "cat(\"Number of samples after joining:\", nrow(all_covs), \"\\n\")\n", - "cat(\"Sample IDs:\", paste(head(all_covs$sampleid), collapse=\", \"), \"...\\n\")\n", - "cat(\"Available covariates:\", paste(colnames(all_covs), collapse=\", \"), \"\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "af1a0588-5d0d-471e-857f-754b69836303", - "metadata": {}, - "source": [ - "#### Create DGE object" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "ccdc7318-b28e-4037-ac3a-c7794d4a72ba", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of valid samples: 83 \n" - ] - } - ], - "source": [ - "valid_samples <- intersect(colnames(filtered_peak), all_covs$sampleid)\n", - "cat(\"Number of valid samples:\", length(valid_samples), \"\\n\")\n", - "\n", - "all_covs_filtered <- all_covs[all_covs$sampleid %in% valid_samples, ]\n", - "filtered_peak_filtered <- filtered_peak[, valid_samples]\n", - "\n", - "dge <- DGEList(\n", - " counts = filtered_peak_filtered,\n", - " samples = all_covs_filtered\n", - ")\n", - "rownames(dge$samples) <- dge$samples$sampleid" - ] - }, - { - "cell_type": "markdown", - "id": "bd4a6650-bedd-4a93-ba36-fd4f091cbb99", - "metadata": {}, - "source": [ - "#### Filter low counts and normalize" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "630dc838-d78b-445c-846d-91f1fc0bc56f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of peaks before filtering: 176039 \n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning message in filterByExpr.DGEList(dge, min.count = 5, min.total.count = 15, :\n", - "“All samples appear to belong to the same group.”\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of peaks after filtering: 176039 \n", - "Saved filtered raw counts to /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/2_residuals/Oligo/Oligo_filtered_raw_counts.txt \n" - ] - } - ], - "source": [ - "cat(\"Number of peaks before filtering:\", nrow(dge), \"\\n\")\n", - "keep <- filterByExpr(dge, \n", - " min.count = 5, # for one sample, min reads \n", - " min.total.count = 15, # min reads overall\n", - " min.prop = 0.1) \n", - "\n", - "dge <- dge[keep, , keep.lib.sizes=FALSE]\n", - "cat(\"Number of peaks after filtering:\", nrow(dge), \"\\n\") #66154 in OPC\n", - "\n", - "# Save filtered raw count data\n", - "filtered_raw_counts <- dge$counts\n", - "write.table(filtered_raw_counts,\n", - " file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_filtered_raw_counts.txt\"), \n", - " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n", - "cat(\"Saved filtered raw counts to\", paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_filtered_raw_counts.txt\"), \"\\n\")\n", - "\n", - "dge <- calcNormFactors(dge, method=\"TMM\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "533471bd-7b96-4bd7-b3b2-00694b69507b", - "metadata": {}, - "source": [ - "#### Handle batch and library as technical variables" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "152eaa2c-8856-4436-a27f-6064bd93dd93", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Handling sequencingBatch and Library as technical variables\n", - "Found 2 unique sequencing batches\n", - "Batch sizes:\n", - "batches\n", - "190820Kel 191203Kel \n", - " 7 76 \n", - "Found 7 unique libraries\n", - "Library sizes:\n", - "libraries\n", - "Library10 Library11 Library2 Library4 Library5 Library7 Library9 \n", - " 26 6 7 6 7 23 8 \n" - ] - } - ], - "source": [ - "# We'll handle batch and library as technical variables rather than doing batch adjustment\n", - "cat(\"Handling sequencingBatch and Library as technical variables\\n\")\n", - "\n", - "# Check batch information\n", - "batches <- dge$samples$sequencingBatch\n", - "cat(\"Found\", length(unique(batches)), \"unique sequencing batches\\n\")\n", - "\n", - "# Check batch size\n", - "batch_counts <- table(batches)\n", - "cat(\"Batch sizes:\\n\")\n", - "print(batch_counts)\n", - "\n", - "# Convert sequencingBatch to factor with at least 2 levels\n", - "if (length(unique(batches)) < 2) {\n", - " cat(\"Only one sequencing batch found. Adding dummy batch for model compatibility.\\n\")\n", - " # Create a dummy batch factor to avoid model errors\n", - " dge$samples$sequencingBatch_factor <- factor(rep(\"batch1\", ncol(dge)))\n", - "} else {\n", - " # Use the existing batch information\n", - " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", - "}\n", - "\n", - "# Check library information\n", - "libraries <- dge$samples$Library\n", - "cat(\"Found\", length(unique(libraries)), \"unique libraries\\n\")\n", - "\n", - "# Check library size\n", - "library_counts <- table(libraries)\n", - "cat(\"Library sizes:\\n\")\n", - "print(library_counts)\n", - "\n", - "# Convert Library to factor with at least 2 levels\n", - "if (length(unique(libraries)) < 2) {\n", - " cat(\"Only one library found. Adding dummy library for model compatibility.\\n\")\n", - " # Create a dummy library factor to avoid model errors\n", - " dge$samples$Library_factor <- factor(rep(\"lib1\", ncol(dge)))\n", - "} else {\n", - " # Use the existing library information\n", - " dge$samples$Library_factor <- factor(dge$samples$Library)\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "9bc8dda3-89ae-47ad-8785-e393695061dd", - "metadata": {}, - "source": [ - "#### Create model and run voom" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "c6e7f374-b7a5-4666-ac99-191807b7e8e2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Using model with technical covariates plus pmi and study\n", - "Model formula: ~log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + sequencingBatch_factor + Library_factor + pmi + study \n", - "Warning: Factor variable group has only one level. Converting to character.\n", - "Successfully created design matrix with 15 columns\n", - "Design matrix is not full rank. Adjusting...\n", - "Adjusted design matrix columns: 14 \n", - "Calculating offsets and residuals...\n" - ] - } - ], - "source": [ - "# Define the model based on available covariates - using log-transformed variables\n", - "# Removed msex and age_death from the model\n", - "if (\"study\" %in% colnames(dge$samples) && \"pmi\" %in% colnames(dge$samples)) {\n", - " # Technical model with pmi and study\n", - " cat(\"Using model with technical covariates plus pmi and study\\n\")\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n", - " log_total_unique_peaks + sequencingBatch_factor + Library_factor + pmi + study\n", - "} else if (\"pmi\" %in% colnames(dge$samples)) {\n", - " # Technical model with pmi only\n", - " cat(\"Using model with technical covariates and pmi\\n\")\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n", - " log_total_unique_peaks + sequencingBatch_factor + Library_factor + pmi\n", - "} else {\n", - " # Technical variables only model\n", - " cat(\"Using model with technical covariates only\\n\")\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n", - " log_total_unique_peaks + sequencingBatch_factor + Library_factor\n", - "}\n", - "\n", - "# Print the model formula\n", - "cat(\"Model formula:\", deparse(model), \"\\n\")\n", - "\n", - "# Check for factor variables with only one level\n", - "for (col in colnames(dge$samples)) {\n", - " if (is.factor(dge$samples[[col]]) && nlevels(dge$samples[[col]]) < 2) {\n", - " cat(\"Warning: Factor variable\", col, \"has only one level. Converting to character.\\n\")\n", - " dge$samples[[col]] <- as.character(dge$samples[[col]])\n", - " }\n", - "}\n", - "\n", - "# Create design matrix with error checking\n", - "tryCatch({\n", - " design <- model.matrix(model, data=dge$samples)\n", - " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n", - "}, error = function(e) {\n", - " cat(\"Error in creating design matrix:\", e$message, \"\\n\")\n", - " cat(\"Attempting to fix model formula...\\n\")\n", - " \n", - " # Check each term in the model\n", - " all_terms <- all.vars(model)\n", - " valid_terms <- character(0)\n", - " \n", - " for (term in all_terms) {\n", - " if (term %in% colnames(dge$samples)) {\n", - " # Check if it's a factor with at least 2 levels\n", - " if (is.factor(dge$samples[[term]])) {\n", - " if (nlevels(dge$samples[[term]]) >= 2) {\n", - " valid_terms <- c(valid_terms, term)\n", - " } else {\n", - " cat(\"Skipping factor\", term, \"with only\", nlevels(dge$samples[[term]]), \"level\\n\")\n", - " }\n", - " } else {\n", - " # Non-factor variables are fine\n", - " valid_terms <- c(valid_terms, term)\n", - " }\n", - " } else {\n", - " cat(\"Variable\", term, \"not found in sample data\\n\")\n", - " }\n", - " }\n", - " \n", - " # Create a simplified model with valid terms\n", - " if (length(valid_terms) > 0) {\n", - " model_str <- paste(\"~\", paste(valid_terms, collapse = \" + \"))\n", - " model <- as.formula(model_str)\n", - " cat(\"New model formula:\", model_str, \"\\n\")\n", - " design <- model.matrix(model, data=dge$samples)\n", - " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n", - " } else {\n", - " stop(\"Could not create a valid model with the available variables\")\n", - " }\n", - "})\n", - "\n", - "# Check if the design matrix is full rank\n", - "if (!is.fullrank(design)) {\n", - " cat(\"Design matrix is not full rank. Adjusting...\\n\")\n", - " # Find and remove the problematic columns\n", - " qr_res <- qr(design)\n", - " design <- design[, qr_res$pivot[1:qr_res$rank]]\n", - " cat(\"Adjusted design matrix columns:\", ncol(design), \"\\n\")\n", - "}\n", - "\n", - "# Run voom and fit model\n", - "v <- voom(dge, design, plot=FALSE) #logCPM\n", - "fit <- lmFit(v, design)\n", - "fit <- eBayes(fit)\n", - "\n", - "# Calculate offset and residuals\n", - "cat(\"Calculating offsets and residuals...\\n\")\n", - "offset <- predictOffset(fit)\n", - "resids <- residuals(fit, y=v)\n", - "\n", - "# Verify dimensions\n", - "stopifnot(all(rownames(offset) == rownames(resids)) &\n", - " all(colnames(offset) == colnames(resids)))\n", - "\n", - "# Final adjusted data\n", - "stopifnot(all(dim(offset) == dim(resids)))\n", - "stopifnot(all(colnames(offset) == colnames(resids)))\n", - "\n", - "final_data <- offset + resids" - ] - }, - { - "cell_type": "markdown", - "id": "fac57ec8-7559-4c60-94f4-73d190a2f11a", - "metadata": {}, - "source": [ - "#### Save results" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "6f3dbbf0-acdf-4257-8dad-529727dac1d2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Processing completed. Results and documentation saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/2_residuals/Oligo/ \n" - ] - } - ], - "source": [ - "# Save results\n", - "saveRDS(list(\n", - " dge = dge,\n", - " offset = offset,\n", - " residuals = resids,\n", - " final_data = final_data,\n", - " valid_samples = colnames(dge),\n", - " design = design,\n", - " fit = fit,\n", - " model = model\n", - "), file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_results.rds\"))\n", - "\n", - "# Write final residual data to file\n", - "write.table(final_data,\n", - " file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_residuals.txt\"), \n", - " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n", - "\n", - "# Write summary statistics\n", - "sink(file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_summary.txt\"))\n", - "cat(\"*** Processing Summary for\", celltype, \"***\\n\\n\")\n", - "cat(\"Original peak count:\", length(peak_id), \"\\n\")\n", - "cat(\"Peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n", - "cat(\"Peaks after expression filtering:\", nrow(dge), \"\\n\\n\")\n", - "cat(\"Number of samples:\", ncol(dge), \"\\n\")\n", - "cat(\"Number of samples after nuclei (>20) filtering:\", ncol(peak_matrix), \"\\n\")\n", - "cat(\"\\nTechnical Variables Used:\\n\")\n", - "cat(\"- log_n_nuclei: Log-transformed number of nuclei per sample\\n\")\n", - "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", - "cat(\"- med_tss_enrich: Median TSS enrichment\\n\")\n", - "cat(\"- log_med_n_tot_fragment: Log-transformed median number of total fragments\\n\")\n", - "cat(\"- log_total_unique_peaks: Log-transformed count of unique peaks per sample\\n\")\n", - "cat(\"- sequencingBatch_factor: Sequencing batch ID\\n\")\n", - "cat(\"- Library_factor: Library ID\\n\")\n", - "cat(\"\\nOther Variables Used:\\n\")\n", - "cat(\"- pmi: Post-mortem interval\\n\")\n", - "cat(\"- study: Study cohort\\n\")\n", - "sink()\n", - "\n", - "# Write an additional explanation file about the variables and log transformation\n", - "sink(file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_variable_explanation.txt\"))\n", - "cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n", - "\n", - "\n", - "cat(\"## Why Log Transformation?\\n\")\n", - "cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n", - "cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n", - "cat(\"2. To stabilize variance across the range of values\\n\")\n", - "cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n", - "cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n", - "\n", - "cat(\"## Variables and Their Meanings\\n\\n\")\n", - "\n", - "cat(\"### Technical Variables\\n\")\n", - "cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n", - "cat(\" * Filtered to include only samples with >20 nuclei\\n\")\n", - "cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n", - "\n", - "cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n", - "cat(\" * Represents sequencing depth\\n\")\n", - "cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n", - "\n", - "cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n", - "cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n", - "\n", - "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", - "cat(\" * Measures the degree of nucleosome positioning\\n\")\n", - "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n", - "\n", - "cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n", - "cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n", - "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n", - "\n", - "\n", - "cat(\"- sequencingBatch: Batch ID for the sequencing run\\n\")\n", - "cat(\" * Treated as a factor to account for batch effects\\n\\n\")\n", - "\n", - "cat(\"- Library: Library preparation batch ID\\n\")\n", - "cat(\" * Treated as a factor to account for library preparation effects\\n\\n\")\n", - "\n", - "cat(\"### Other Variables\\n\")\n", - "cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n", - "cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n", - "\n", - "cat(\"## Relationship to voom Transformation\\n\")\n", - "cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n", - "cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n", - "cat(\"covariates, we ensure they're on a similar scale to the transformed expression data, \")\n", - "cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n", - "sink()\n", - "\n", - "cat(\"Processing completed. Results and documentation saved to:\", paste0(output_dir, \"/2_residuals/\", celltype, \"/\"), \"\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "177f20f0-2d2e-4674-9894-15434681504d", - "metadata": {}, - "source": [ - "## Step 2: Phenotype Reformat\n", - "**Purpose:** Converts covariate-adjusted residuals from Step 1 into genome-wide BED format suitable for QTL mapping tools (FastQTL, TensorQTL, MatrixEQTL).\n", - "\n", - "---\n", - "\n", - "#### Input:\n", - "\n", - "**From Step 1 (required):**\n", - "- `{celltype}_residuals.txt` (in `output/2_residuals/{celltype}/`)\n", - "\n", - "**Cell Types:**\n", - "- `Mic` (Microglia)\n", - "- `Astro` (Astrocytes)\n", - "- `Oligo` (Oligodendrocytes)\n", - "- `Ex` (Excitatory neurons)\n", - "- `In` (Inhibitory neurons)\n", - "- `OPC` (Oligodendrocyte precursor cells)\n", - "\n", - "\n", - "#### Process:\n", - "\n", - "1. Set Cell Type and Paths\n", - "2. Load residuals file\n", - "3. Extract and parse peak IDs\n", - "4. Convert to Midpoint Coordinates\n", - "5. Create BED format\n", - "6. Sort by genomic position\n", - "7. Write BED file\n", - "8. Compress with bgzip\n", - "\n", - "#### Output:\n", - "Output Directory: `output/3_phenotype_reformatting/{celltype}/`\n", - "\n", - "Output File: `{celltype}_kellis_xiong_snatac_phenotype.bed.gz`" - ] - }, - { - "cell_type": "markdown", - "id": "55814e99-5baf-4a24-b185-7ecfd2327ed8", - "metadata": {}, - "source": [ - "#### Load libraries" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "858a632a-3ba8-4791-a2d5-b92110dc8ce3", - "metadata": {}, - "outputs": [], - "source": [ - "library(data.table)\n", - "library(stringr)" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "id": "bf1b630b-20f7-43f2-973a-28dbee1acc61", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column names from first line: SM-CTECR, SM-CJK5G, SM-CJEKQ, SM-CJGGY, SM-CJK3S, SM-CTEGU ...\n" - ] - } - ], - "source": [ - "#!/usr/bin/env Rscript\n", - "\n", - "# Script to reformat ATAC-seq residuals into BED format and compress with bgzip\n", - "# Usage: Rscript reformat_residuals.R [celltype]\n", - "\n", - "# Get command line arguments\n", - "#args <- commandArgs(trailingOnly = TRUE)\n", - "#if (length(args) < 1) {\n", - "# celltype <- \"Ex\" # Default cell type\n", - "# cat(\"No cell type specified, using default:\", celltype, \"\\n\")\n", - "#} else {\n", - "# celltype <- args[1]\n", - "# cat(\"Processing cell type:\", celltype, \"\\n\")\n", - "#}\n", - "\n", - "# Define input and output paths\n", - "#input_dir <- \"/home/al4225/project/kellis_snatac/output/xiong/2_residuals\"\n", - "#output_dir <- \"/home/al4225/project/kellis_snatac/output/3_phenotype_processing\"\n", - "pheno_reformat_output_dir <- paste0(output_dir, \"/3_phenotype_reformatting/\", celltype)\n", - "\n", - "# Create output directory if it doesn't exist\n", - "dir.create(pheno_reformat_output_dir, recursive = TRUE, showWarnings = FALSE)\n", - "\n", - "# Check if input directory exists\n", - "celltype_dir <- paste0(output_dir,\"/2_residuals/\", celltype)\n", - "if (!dir.exists(celltype_dir)) {\n", - " cat(\"Cell type directory not found:\", celltype_dir, \"\\n\")\n", - " cat(\"Using backup directory...\\n\")\n", - " celltype_dir <- file.path(output_dir,paste0(\"2_residuals/backup/\", celltype))\n", - " if (!dir.exists(celltype_dir)) {\n", - " dir.create(celltype_dir, recursive = TRUE)\n", - " stop(\"Backup directory not found either: \", celltype_dir)\n", - " }\n", - "}\n", - "\n", - "input_file <- file.path(celltype_dir, paste0(celltype, \"_residuals.txt\"))\n", - "output_bed <- file.path(output_dir, paste0(\"3_phenotype_reformatting/\",celltype ,\"/\", celltype,\"_kellis_xiong_snatac_phenotype.bed\"))\n", - "\n", - "# Check if input file exists\n", - "if (!file.exists(input_file)) {\n", - " stop(\"Input file not found: \", input_file)\n", - "}\n", - "\n", - "# Read the first line manually to get the column names\n", - "first_line <- readLines(input_file, n = 1)\n", - "col_names <- unlist(strsplit(first_line, split = \"\\t\"))\n", - "cat(\"Column names from first line:\", paste(head(col_names), collapse = \", \"), \"...\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "9282f3a2-650f-4a61-abd1-5038b324cfea", - "metadata": {}, - "source": [ - "#### Load input" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "95e4ee18-4411-4bd6-9b33-6bc426d9742b", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Reading residuals file: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/2_residuals/Oligo/Oligo_residuals.txt \n" - ] - } - ], - "source": [ - "cat(\"Reading residuals file:\", input_file, \"\\n\")\n", - "first_line <- readLines(input_file, n = 1)\n", - "col_names <- unlist(strsplit(first_line, split = \"\\t\"))\n", - "\n", - "residuals <- fread(input_file, header = FALSE, skip = 1)\n", - "\n", - "# Logic to handle row names/peak IDs\n", - "if (ncol(residuals) > length(col_names)) {\n", - " peak_ids <- residuals[[1]]\n", - " residuals <- residuals[, -1, with = FALSE]\n", - " setnames(residuals, col_names)\n", - "} else {\n", - " peak_ids <- residuals[[1]]\n", - " residuals <- residuals[, -1, with = FALSE]\n", - " setnames(residuals, col_names[-1]) # Adjusting for leading empty/ID column\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "5fe44c41-8fde-4107-80fa-de5823e3f0ab", - "metadata": {}, - "source": [ - "#### Coordinate Parsing (BED format)" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "34037cc5-ad0e-48c7-b528-f67ecbc0bec7", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Parsing peak IDs into BED format with midpoint coordinates\n" - ] - } - ], - "source": [ - "cat(\"Parsing peak IDs into BED format with midpoint coordinates\\n\")\n", - "parts <- strsplit(peak_ids, \"-\")\n", - "chrs <- sapply(parts, `[`, 1)\n", - "starts_raw <- as.numeric(sapply(parts, `[`, 2))\n", - "ends_raw <- as.numeric(sapply(parts, `[`, 3))\n", - "\n", - "# Calculate midpoints for a 1bp window (Standard for QTLtools)\n", - "# This centers the peak signal on a single genomic coordinate\n", - "mids <- as.integer((starts_raw + ends_raw) / 2)\n", - "\n", - "parsed_peaks <- data.table(\n", - " '#chr' = chrs,\n", - " start = mids,\n", - " end = mids + 1,\n", - " ID = peak_ids\n", - ")\n", - "\n", - "# Combine and Sort\n", - "bed_data <- cbind(parsed_peaks, residuals)\n", - "setorder(bed_data, '#chr', start)\n" - ] - }, - { - "cell_type": "markdown", - "id": "39221488-f744-402e-97c6-ef6f98c310e6", - "metadata": {}, - "source": [ - "#### Save and compress " - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "e09883c1-9d6e-4447-ae27-e3d668c33ef2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Writing BED file to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/3_phenotype_reformatting/Oligo/Oligo_kellis_xiong_snatac_phenotype.bed \n", - "Compressing with bgzip...\n", - "Process completed for Oligo \n" - ] - } - ], - "source": [ - "cat(\"Writing BED file to:\", output_bed, \"\\n\")\n", - "fwrite(bed_data, output_bed, sep = \"\\t\", col.names = TRUE, quote = FALSE)\n", - "\n", - "cat(\"Compressing with bgzip...\\n\")\n", - "system(paste(\"bgzip -f\", output_bed))\n", - "\n", - "# Highly recommended: Index for tabix\n", - "system(paste(\"tabix -p bed\", paste0(output_bed, \".gz\")))\n", - "\n", - "cat(\"Process completed for\", celltype, \"\\n\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "R", - "language": "R", - "name": "ir" - }, - "language_info": { - "codemirror_mode": "r", - "file_extension": ".r", - "mimetype": "text/x-r-source", - "name": "R", - "pygments_lexer": "r", - "version": "4.4.3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From c97561db09ebe50618779500a851c4daed453c00 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Thu, 19 Feb 2026 14:22:11 -0500 Subject: [PATCH 05/12] Delete code/molecular_phenotypes/QC/kellis_atacseq_preprocessing.ipynb Replace with sos --- .../QC/kellis_atacseq_preprocessing.ipynb | 3188 ----------------- 1 file changed, 3188 deletions(-) delete mode 100644 code/molecular_phenotypes/QC/kellis_atacseq_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/kellis_atacseq_preprocessing.ipynb b/code/molecular_phenotypes/QC/kellis_atacseq_preprocessing.ipynb deleted file mode 100644 index 3aa205482..000000000 --- a/code/molecular_phenotypes/QC/kellis_atacseq_preprocessing.ipynb +++ /dev/null @@ -1,3188 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", - "metadata": {}, - "source": [ - "# Kellis Lab Single-nuclei ATAC-seq Preprocessing Pipeline\n", - "\n", - "---\n", - "\n", - "### Overview\n", - "\n", - "This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) data from the Kellis lab for downstream chromatin accessibility QTL (caQTL) analysis and region-specific studies. It processes pseudobulk peak count data across six major brain cell types with flexible workflow options depending on your analysis goals.\n", - "\n", - "**Pipeline Purpose:**\n", - "- Transform raw pseudobulk peak counts into analysis-ready formats\n", - "- Remove technical confounders and optionally biological covariates\n", - "- Generate QTL-ready phenotype files or region-specific datasets\n", - "\n", - "**Supported Cell Types:**\n", - "- **Mic** - Microglia\n", - "- **Astro** - Astrocytes\n", - "- **Oligo** - Oligodendrocytes\n", - "- **Exc** - Excitatory neurons\n", - "- **Inh** - Inhibitory neurons\n", - "- **OPC** - Oligodendrocyte precursor cells\n", - "\n", - "---\n", - "\n", - "### Workflow Structure\n", - "\n", - "This pipeline consists of two main sequential steps, plus a complete pipeline for severe batch effects.\n", - "\n", - "#### Step 1: Pseudobulk QC with batch as covariates\n", - "\n", - "**Option A: Remove Biological Covariates**\n", - "- Regresses out demographic variables (msex, age_death, pmi, study)\n", - "- **Use when:** You want to identify genetic effects independent of sex/age\n", - "- **Model includes:** technical covariates + sequencingBatch + msex + age_death + pmi + study\n", - "\n", - "**Option B: Preserve Biological Covariates**\n", - "- Regresses out only non-demographic variables (pmi, study)\n", - "- **Use when:** You want to study sex/age effects or preserve biological heterogeneity\n", - "- **Model includes:** technical covariates + sequencingBatch + pmi + study (NO msex, age_death)\n", - "\n", - "#### Step 2: Format Output\n", - "\n", - "**Format A: Phenotype Reformatting**\n", - "- Converts residuals to genome-wide BED format\n", - "- **Input:** `{celltype}_residuals.txt` (from Step 1 Option A or B)\n", - "- **Use for:** FastQTL, TensorQTL, MatrixEQTL (genome-wide caQTL mapping)\n", - "\n", - "**Format B: Region Peak Filtering**\n", - "- Filters to specific genomic regions (chr7: 28-28.3 Mb, chr11: 85.05-86.2 Mb)\n", - "- **Input:** `{celltype}_filtered_raw_counts.txt` (only from Step 1 Option B)\n", - "- **Use for:** Hypothesis-driven locus analysis, region-specific comparisons\n", - "\n", - "#### Alternative Pseudobulk Pipeline: Explicit Batch Correction (Multiome Dataset)\n", - "- Complete standalone pipeline with explicit batch correction using limma's `removeBatchEffect` or ComBat-seq\n", - "- **Input:** Qc'ed Seurat object`{celltype}_qced.rds` and pseudobulk peak counts `{celltype}.rds`\n", - "- **Use when:** Strong batch effects visible in PCA/t-SNE, many small fragmented batches, batch confounds with biology\n", - "- **Note:** From different dataset (multiome) but demonstrates alternative batch correction approach\n", - "\n", - "---\n", - "\n", - "### Key Features:\n", - "- Blacklist region filtering (ENCODE hg38)\n", - "- Technical QC covariate adjustment (TSS enrichment, nucleosome signal, sequencing depth)\n", - "- TMM normalization and expression filtering\n", - "- Log-transformation of count-based covariates\n", - "- Flexible batch handling (covariate vs explicit correction)\n", - "\n", - "#### Pipeline Outputs:\n", - "\n", - "**From Step 1:**\n", - "- `{celltype}_residuals.txt`: Covariate-adjusted residuals (log2-CPM scale)\n", - "- `{celltype}_results.rds`: Complete analysis results\n", - "- `{celltype}_summary.txt`: QC summary and filtering statistics\n", - "- `{celltype}_variable_explanation.txt`: Covariate documentation (Option A only)\n", - "- `{celltype}_filtered_raw_counts.txt`: TMM-normalized counts (Option B only)\n", - "\n", - "**From Step 2, Format A:**\n", - "- `{celltype}_kellis_snatac_phenotype.bed.gz`: Genome-wide QTL-ready BED file\n", - "\n", - "**From Step 2, Format B:**\n", - "- `{celltype}_filtered_regions_of_interest.txt`: Region-specific count data (chr7, chr11)\n", - "- `{celltype}_filtered_regions_of_interest_summary.txt`: Peak metadata and statistics\n", - "\n", - "**From Alternative Pseudobulk Pipeline: Multiome with Batch Correction:**\n", - "- `{celltype}_residuals.txt`: Batch-corrected residuals (log2-CPM scale)\n", - "- `{celltype}_results.rds`: Complete results with batch_adjusted_counts\n", - "\n", - "---\n", - "\n", - "### Input Files\n", - "Input files needed to run this pipeline can be downloaded [here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link)." - ] - }, - { - "cell_type": "markdown", - "id": "5476354a-a9b1-45c4-bd41-010551ca96f1", - "metadata": {}, - "source": [ - "#### Before you start, let's set up your working path." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "955bda26-9f91-41bb-adb7-c09fbf361c5e", - "metadata": {}, - "outputs": [], - "source": [ - "input_dir <- \"/restricted/projectnb/xqtl/jaempawi/atac_seq/kellis_data\" #set your input directory\n", - "output_dir <- \"/restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis\" #set your output directory" - ] - }, - { - "cell_type": "markdown", - "id": "5540a4da-843a-4789-8123-47911cf519c5", - "metadata": {}, - "source": [ - "## Step 1: Pseudobulk QC with batch as covariates\n", - "\n", - "This preprocessing workflow offers **two approaches** depending on whether you want to regress out biological covariates:\n", - "\n", - "---\n", - "### Option A: Pseudobulk QC WITH Biological Variation(Standard QTL Analysis)\n", - "\n", - "Use this option when you want residuals adjusted for all technical AND biological covariates (sex, age, PMI).\n", - "\n", - "**Input:**\n", - "- Pseudobulk peak counts (in 1_files_with_sampleid folder): `pseudobulk_peaks_counts{celltype}_50nuc.csv.gz`\n", - "- Cell metadata (in 1_files_with_sampleid folder): `metadata_{celltype}_50nuc.csv`\n", - "- Sample covariates: `rosmap_cov.txt`\n", - "- hg38 blacklist: `hg38-blacklist.v2.bed.gz`\n", - "\n", - "**Process:**\n", - "1. Loads pseudobulk peak count matrix and metadata per cell type\n", - "2. Calculates technical QC metrics per sample:\n", - " - `log_n_nuclei`: Log-transformed number of nuclei\n", - " - `med_nucleosome_signal`: Median nucleosome signal\n", - " - `med_tss_enrich`: Median TSS enrichment score\n", - " - `log_med_n_tot_fragment`: Log-transformed median total fragments (sequencing depth)\n", - " - `log_total_unique_peaks`: Log-transformed count of unique peaks detected\n", - "3. Filters blacklisted genomic regions using `foverlaps()`\n", - "4. Merges with demographic covariates (msex, age_death, pmi, study)\n", - "5. Applies expression filtering with `filterByExpr()`:\n", - " - `min.count = 2`: Minimum 2 reads in at least one sample\n", - " - `min.total.count = 15`: Minimum 15 total reads across all samples\n", - " - `min.prop = 0.1`: Peak must be expressed in ≥10% of samples\n", - "6. TMM normalization with `calcNormFactors()`\n", - "7. Handles sequencingBatch as a covariate (not batch-corrected)\n", - "8. Fits linear model using `voom()` and `lmFit()`:\n", - "\n", - " ```r\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + sequencingBatch_factor + msex + age_death + pmi + study \n", - " ```\n", - "\n", - "9. Calculates residuals adjusted for ALL covariates (technical + biological)\n", - "10. Computes final adjusted data using predictOffset(): offset + residuals\n", - "- `offset`: Predicted expression at median/reference covariate values\n", - "- `residuals`: Unexplained variation after removing all covariate effects\n", - "\n", - "**Output:** `output/2_residuals/{celltype}/`\n", - "\n", - "- `{celltype}_residuals.txt`: Final covariate-adjusted peak accessibility (log2-CPM scale)\n", - "- `{celltype}_results.rds`: Complete analysis results (DGEList, fit object, design matrix)\n", - "- `{celltype}_summary.txt`: Filtering statistics and QC summary\n", - "- `{celltype}_variable_explanation.txt`: Detailed covariate documentation\n", - "\n", - "**Key Variables Regressed Out**:\n", - "\n", - "- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch\n", - "- Biological: sex (msex), age at death (age_death), post-mortem interval (pmi), study cohort\n" - ] - }, - { - "cell_type": "markdown", - "id": "a58dfe97-3e57-4ce9-b8bb-009aec26b1a5", - "metadata": {}, - "source": [ - "#### Load libaries" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "77deb405-f916-42e5-a74a-c3569d587cbf", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\n", - "Attaching package: ‘dplyr’\n", - "\n", - "\n", - "The following objects are masked from ‘package:data.table’:\n", - "\n", - " between, first, last\n", - "\n", - "\n", - "The following objects are masked from ‘package:stats’:\n", - "\n", - " filter, lag\n", - "\n", - "\n", - "The following objects are masked from ‘package:base’:\n", - "\n", - " intersect, setdiff, setequal, union\n", - "\n", - "\n", - "Loading required package: limma\n", - "\n" - ] - } - ], - "source": [ - "library(data.table)\n", - "library(stringr)\n", - "library(dplyr)\n", - "library(edgeR)" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "5f5d8a77-91c8-4808-94cf-bc576378556c", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Processing celltype: Astro \n" - ] - } - ], - "source": [ - "# Set cell type and create output directory\n", - "celltype <- \"Astro\" # Change this for different cell types eg. Exc, Inh, Mic, Oligo, OPC\n", - "cat(\"Processing celltype:\", celltype, \"\\n\")\n", - "\n", - "out_dir <- paste0(file.path(output_dir,\"2_residuals/\", celltype))\n", - "dir.create(out_dir, recursive = TRUE, showWarnings = FALSE)\n" - ] - }, - { - "cell_type": "markdown", - "id": "3ed15afb-f621-4dd3-be00-c15dd736835b", - "metadata": {}, - "source": [ - "#### Create predictOffset function " - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "823abb05-f105-4f40-918a-5c470a04ffb9", - "metadata": {}, - "outputs": [], - "source": [ - "predictOffset <- function(fit) {\n", - " # Define which variables are factors and which are continuous\n", - " usedFactors <- c(\"sequencingBatch\", \"study\") \n", - " usedContinuous <- c(\"log_n_nuclei\", \"med_nucleosome_signal\", \"med_tss_enrich\", \"log_med_n_tot_fragment\",\n", - " \"log_total_unique_peaks\", \"msex\", \"age_death\", \"pmi\")\n", - " \n", - " # Filter to only use variables actually in the design matrix\n", - " usedFactors <- usedFactors[sapply(usedFactors, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n", - " usedContinuous <- usedContinuous[sapply(usedContinuous, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n", - " \n", - " # Get indices for factor and continuous variables\n", - " facInd <- unlist(lapply(as.list(usedFactors), \n", - " function(f) {return(grep(paste0(\"^\", f), \n", - " colnames(fit$design)))}))\n", - " contInd <- unlist(lapply(as.list(usedContinuous), \n", - " function(f) {return(grep(paste0(\"^\", f), \n", - " colnames(fit$design)))}))\n", - " \n", - " # Add the intercept\n", - " all_indices <- c(1, facInd, contInd)\n", - " \n", - " # Verify design matrix structure (using sorted indices to avoid duplication warning)\n", - " all_indices_sorted <- sort(unique(all_indices))\n", - " stopifnot(all(all_indices_sorted %in% 1:ncol(fit$design)))\n", - " \n", - " # Create new design matrix with median values\n", - " D <- fit$design\n", - " D[, facInd] <- 0 # Set all factor levels to reference level\n", - " \n", - " # For continuous variables, set to median value\n", - " if (length(contInd) > 0) {\n", - " medContVals <- apply(D[, contInd, drop=FALSE], 2, median)\n", - " for (i in 1:length(medContVals)) {\n", - " D[, names(medContVals)[i]] <- medContVals[i]\n", - " }\n", - " }\n", - " \n", - " # Calculate offsets\n", - " stopifnot(all(colnames(coefficients(fit)) == colnames(D)))\n", - " offsets <- apply(coefficients(fit), 1, function(c) {\n", - " return(D %*% c)\n", - " })\n", - " offsets <- t(offsets)\n", - " colnames(offsets) <- rownames(fit$design)\n", - " \n", - " return(offsets)\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "5d79ceae-e255-4a39-a288-12626481b0ac", - "metadata": {}, - "source": [ - "#### Load input" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "46927164-2761-490f-afc2-86181e917a49", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded metadata with 82 samples and peak data with 531489 peaks\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\t\n", - "\t\n", - "\n", - "\n", - "\t\n", - "\t\n", - "\t\n", - "\t\n", - "\t\n", - "\t\n", - "\n", - "
A data.table: 6 × 9
individualIDsampleidsequencingBatchmain_cell_typeavg.pct.read.in.peak.ctmed.nucleosome_signal.ctmed.n_tot_fragment.ctmed.tss.enrich.ctn.nuclei
<chr><chr><chr><chr><dbl><dbl><dbl><dbl><int>
R1042011SM-CJK5G191203KelAstro0.39391890.789418710923.000.3762771409
R1154454SM-CTDQN191203KelAstro0.25576930.778642823144.000.2516681144
R1213305SM-CJEIE191203KelAstro0.32778310.807704216094.780.2896403630
R1407047SM-CTEM5191203KelAstro0.33613160.827510959451.000.3266785189
R1609849SM-CJJ27191203KelAstro0.28570200.7868788 7522.000.2688059186
R1617674SM-CJIWT191203KelAstro0.19344200.787991133724.000.1702281141
\n" - ], - "text/latex": [ - "A data.table: 6 × 9\n", - "\\begin{tabular}{lllllllll}\n", - " individualID & sampleid & sequencingBatch & main\\_cell\\_type & avg.pct.read.in.peak.ct & med.nucleosome\\_signal.ct & med.n\\_tot\\_fragment.ct & med.tss.enrich.ct & n.nuclei\\\\\n", - " & & & & & & & & \\\\\n", - "\\hline\n", - "\t R1042011 & SM-CJK5G & 191203Kel & Astro & 0.3939189 & 0.7894187 & 10923.00 & 0.3762771 & 409\\\\\n", - "\t R1154454 & SM-CTDQN & 191203Kel & Astro & 0.2557693 & 0.7786428 & 23144.00 & 0.2516681 & 144\\\\\n", - "\t R1213305 & SM-CJEIE & 191203Kel & Astro & 0.3277831 & 0.8077042 & 16094.78 & 0.2896403 & 630\\\\\n", - "\t R1407047 & SM-CTEM5 & 191203Kel & Astro & 0.3361316 & 0.8275109 & 59451.00 & 0.3266785 & 189\\\\\n", - "\t R1609849 & SM-CJJ27 & 191203Kel & Astro & 0.2857020 & 0.7868788 & 7522.00 & 0.2688059 & 186\\\\\n", - "\t R1617674 & SM-CJIWT & 191203Kel & Astro & 0.1934420 & 0.7879911 & 33724.00 & 0.1702281 & 141\\\\\n", - "\\end{tabular}\n" - ], - "text/markdown": [ - "\n", - "A data.table: 6 × 9\n", - "\n", - "| individualID <chr> | sampleid <chr> | sequencingBatch <chr> | main_cell_type <chr> | avg.pct.read.in.peak.ct <dbl> | med.nucleosome_signal.ct <dbl> | med.n_tot_fragment.ct <dbl> | med.tss.enrich.ct <dbl> | n.nuclei <int> |\n", - "|---|---|---|---|---|---|---|---|---|\n", - "| R1042011 | SM-CJK5G | 191203Kel | Astro | 0.3939189 | 0.7894187 | 10923.00 | 0.3762771 | 409 |\n", - "| R1154454 | SM-CTDQN | 191203Kel | Astro | 0.2557693 | 0.7786428 | 23144.00 | 0.2516681 | 144 |\n", - "| R1213305 | SM-CJEIE | 191203Kel | Astro | 0.3277831 | 0.8077042 | 16094.78 | 0.2896403 | 630 |\n", - "| R1407047 | SM-CTEM5 | 191203Kel | Astro | 0.3361316 | 0.8275109 | 59451.00 | 0.3266785 | 189 |\n", - "| R1609849 | SM-CJJ27 | 191203Kel | Astro | 0.2857020 | 0.7868788 | 7522.00 | 0.2688059 | 186 |\n", - "| R1617674 | SM-CJIWT | 191203Kel | Astro | 0.1934420 | 0.7879911 | 33724.00 | 0.1702281 | 141 |\n", - "\n" - ], - "text/plain": [ - " individualID sampleid sequencingBatch main_cell_type avg.pct.read.in.peak.ct\n", - "1 R1042011 SM-CJK5G 191203Kel Astro 0.3939189 \n", - "2 R1154454 SM-CTDQN 191203Kel Astro 0.2557693 \n", - "3 R1213305 SM-CJEIE 191203Kel Astro 0.3277831 \n", - "4 R1407047 SM-CTEM5 191203Kel Astro 0.3361316 \n", - "5 R1609849 SM-CJJ27 191203Kel Astro 0.2857020 \n", - "6 R1617674 SM-CJIWT 191203Kel Astro 0.1934420 \n", - " med.nucleosome_signal.ct med.n_tot_fragment.ct med.tss.enrich.ct n.nuclei\n", - "1 0.7894187 10923.00 0.3762771 409 \n", - "2 0.7786428 23144.00 0.2516681 144 \n", - "3 0.8077042 16094.78 0.2896403 630 \n", - "4 0.8275109 59451.00 0.3266785 189 \n", - "5 0.7868788 7522.00 0.2688059 186 \n", - "6 0.7879911 33724.00 0.1702281 141 " - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\t\n", - "\t\n", - "\n", - "\n", - "\t\n", - "\t\n", - "\t\n", - "\t\n", - "\t\n", - "\t\n", - "\n", - "
A data.table: 6 × 82
SM-CJK5GSM-CTDQNSM-CJEIESM-CTEM5SM-CJJ27SM-CJIWTSM-CTEEGROS11430815SM-CJGLGSM-CJIXUR9395022SM-CJIX5SM-CJEGUSM-CJIYHSM-CJGMSSM-CTEGUSM-CTEFJSM-CJEJUSM-CTEGTSM-CJIZE
<int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int>
4 0 0 20 0 0 2 20 1 00 01 0 0 1 00
201245361 7 93016513 57103 61110185
8 1 3 60 611 1 13 5 20 13 0 2 3 54
0 0 0 00 0 1 0 00 0 00 00 0 0 0 00
15 415 92 316 8 35 5 65 55 2 612 77
33 45570410262122530158215203526486
\n" - ], - "text/latex": [ - "A data.table: 6 × 82\n", - "\\begin{tabular}{lllllllllllllllllllll}\n", - " SM-CJK5G & SM-CTDQN & SM-CJEIE & SM-CTEM5 & SM-CJJ27 & SM-CJIWT & SM-CTEEG & ROS11430815 & SM-CJGLG & SM-CJIXU & ⋯ & R9395022 & SM-CJIX5 & SM-CJEGU & SM-CJIYH & SM-CJGMS & SM-CTEGU & SM-CTEFJ & SM-CJEJU & SM-CTEGT & SM-CJIZE\\\\\n", - " & & & & & & & & & & ⋯ & & & & & & & & & & \\\\\n", - "\\hline\n", - "\t 4 & 0 & 0 & 2 & 0 & 0 & 0 & 2 & 2 & 0 & ⋯ & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0\\\\\n", - "\t 20 & 12 & 45 & 36 & 1 & 7 & 9 & 30 & 16 & 5 & ⋯ & 13 & 5 & 7 & 10 & 3 & 6 & 11 & 10 & 18 & 5\\\\\n", - "\t 8 & 1 & 3 & 6 & 0 & 6 & 11 & 1 & 1 & 3 & ⋯ & 5 & 2 & 0 & 1 & 3 & 0 & 2 & 3 & 5 & 4\\\\\n", - "\t 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & ⋯ & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\\n", - "\t 15 & 4 & 15 & 9 & 2 & 3 & 16 & 8 & 3 & 5 & ⋯ & 5 & 6 & 5 & 5 & 5 & 2 & 6 & 12 & 7 & 7\\\\\n", - "\t 33 & 4 & 55 & 70 & 4 & 10 & 26 & 21 & 22 & 5 & ⋯ & 30 & 15 & 8 & 21 & 5 & 20 & 35 & 26 & 48 & 6\\\\\n", - "\\end{tabular}\n" - ], - "text/markdown": [ - "\n", - "A data.table: 6 × 82\n", - "\n", - "| SM-CJK5G <int> | SM-CTDQN <int> | SM-CJEIE <int> | SM-CTEM5 <int> | SM-CJJ27 <int> | SM-CJIWT <int> | SM-CTEEG <int> | ROS11430815 <int> | SM-CJGLG <int> | SM-CJIXU <int> | ⋯ ⋯ | R9395022 <int> | SM-CJIX5 <int> | SM-CJEGU <int> | SM-CJIYH <int> | SM-CJGMS <int> | SM-CTEGU <int> | SM-CTEFJ <int> | SM-CJEJU <int> | SM-CTEGT <int> | SM-CJIZE <int> |\n", - "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n", - "| 4 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 2 | 0 | ⋯ | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |\n", - "| 20 | 12 | 45 | 36 | 1 | 7 | 9 | 30 | 16 | 5 | ⋯ | 13 | 5 | 7 | 10 | 3 | 6 | 11 | 10 | 18 | 5 |\n", - "| 8 | 1 | 3 | 6 | 0 | 6 | 11 | 1 | 1 | 3 | ⋯ | 5 | 2 | 0 | 1 | 3 | 0 | 2 | 3 | 5 | 4 |\n", - "| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n", - "| 15 | 4 | 15 | 9 | 2 | 3 | 16 | 8 | 3 | 5 | ⋯ | 5 | 6 | 5 | 5 | 5 | 2 | 6 | 12 | 7 | 7 |\n", - "| 33 | 4 | 55 | 70 | 4 | 10 | 26 | 21 | 22 | 5 | ⋯ | 30 | 15 | 8 | 21 | 5 | 20 | 35 | 26 | 48 | 6 |\n", - "\n" - ], - "text/plain": [ - " SM-CJK5G SM-CTDQN SM-CJEIE SM-CTEM5 SM-CJJ27 SM-CJIWT SM-CTEEG ROS11430815\n", - "1 4 0 0 2 0 0 0 2 \n", - "2 20 12 45 36 1 7 9 30 \n", - "3 8 1 3 6 0 6 11 1 \n", - "4 0 0 0 0 0 0 1 0 \n", - "5 15 4 15 9 2 3 16 8 \n", - "6 33 4 55 70 4 10 26 21 \n", - " SM-CJGLG SM-CJIXU ⋯ R9395022 SM-CJIX5 SM-CJEGU SM-CJIYH SM-CJGMS SM-CTEGU\n", - "1 2 0 ⋯ 1 0 0 0 1 0 \n", - "2 16 5 ⋯ 13 5 7 10 3 6 \n", - "3 1 3 ⋯ 5 2 0 1 3 0 \n", - "4 0 0 ⋯ 0 0 0 0 0 0 \n", - "5 3 5 ⋯ 5 6 5 5 5 2 \n", - "6 22 5 ⋯ 30 15 8 21 5 20 \n", - " SM-CTEFJ SM-CJEJU SM-CTEGT SM-CJIZE\n", - "1 0 1 0 0 \n", - "2 11 10 18 5 \n", - "3 2 3 5 4 \n", - "4 0 0 0 0 \n", - "5 6 12 7 7 \n", - "6 35 26 48 6 " - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "meta <- fread(file.path(input_dir, \"1_files_with_sampleid\", paste0(\"metadata_\", celltype, \"_50nuc.csv\")))\n", - "peak_data <- fread(file.path(input_dir, \"1_files_with_sampleid\", paste0(\"pseudobulk_peaks_counts\", celltype, \"_50nuc.csv.gz\")))\n", - "\n", - "cat(\"Loaded metadata with\", nrow(meta), \"samples and peak data with\", nrow(peak_data), \"peaks\\n\")\n", - "\n", - "# Extract peak_id and set as rownames\n", - "peak_id <- peak_data$peak_id\n", - "peak_data <- peak_data[, -1, with = FALSE] # Remove peak_id column\n", - "peak_matrix <- as.matrix(peak_data)\n", - "rownames(peak_matrix) <- peak_id\n", - "\n", - "head(meta)\n", - "head(peak_data)" - ] - }, - { - "cell_type": "markdown", - "id": "785bc2c7-8940-47c8-8dd4-769ab2c29f27", - "metadata": {}, - "source": [ - "#### Process technical variables from meta data\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "a6714741-5c18-47ed-a0f5-c6472120ea3a", - "metadata": {}, - "outputs": [], - "source": [ - "# Column name normalization (for easier handling)\n", - "meta_clean <- meta %>%\n", - " rename(\n", - " med_nucleosome_signal = med.nucleosome_signal.ct,\n", - " med_tss_enrich = med.tss.enrich.ct,\n", - " med_n_tot_fragment = med.n_tot_fragment.ct,\n", - " n_nuclei = n.nuclei\n", - " )\n", - "\n", - "# Calculate peak metrics - total unique peaks per sample\n", - "peak_metrics <- data.frame(\n", - " sampleid = colnames(peak_matrix),\n", - " total_unique_peaks = colSums(peak_matrix > 0)\n", - ") %>%\n", - " mutate(log_total_unique_peaks = log(total_unique_peaks + 1))" - ] - }, - { - "cell_type": "markdown", - "id": "15031ec1-8106-45ce-9056-7ae771f2468e", - "metadata": {}, - "source": [ - "#### Process peaks" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "06ee1c4e-7b39-4ba6-ab07-f7395de638dd", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Sample of peak coordinates:\n", - " peak_name chr start end\n", - " \n", - "1: chr1-181293-181565 chr1 181293 181565\n", - "2: chr1-190726-191626 chr1 190726 191626\n", - "3: chr1-629712-630662 chr1 629712 630662\n", - "4: chr1-631261-631470 chr1 631261 631470\n", - "5: chr1-633891-634506 chr1 633891 634506\n", - "6: chr1-777873-779958 chr1 777873 779958\n", - "Number of blacklisted peaks: 2354 \n", - "Number of peaks after blacklist filtering: 529135 \n" - ] - } - ], - "source": [ - "# Process peak coordinates\n", - "peak_df <- data.table(\n", - " peak_name = peak_id,\n", - " chr = sapply(strsplit(peak_id, \"-\"), `[`, 1),\n", - " start = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 2)),\n", - " end = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 3)),\n", - " stringsAsFactors = FALSE\n", - ")\n", - "\n", - "# Verify peak coordinates were extracted correctly\n", - "cat(\"Sample of peak coordinates:\\n\")\n", - "print(head(peak_df))\n", - "\n", - "# Load blacklist\n", - "blacklist_file <- file.path(input_dir,\"hg38-blacklist.v2.bed.gz\")\n", - "if (file.exists(blacklist_file)) {\n", - " blacklist_df <- fread(blacklist_file)\n", - " if (ncol(blacklist_df) >= 4) {\n", - " colnames(blacklist_df)[1:4] <- c(\"chr\", \"start\", \"end\", \"label\")\n", - " } else {\n", - " colnames(blacklist_df)[1:3] <- c(\"chr\", \"start\", \"end\")\n", - " }\n", - " \n", - " # Filter blacklisted peaks\n", - " setkey(blacklist_df, chr, start, end)\n", - " setkey(peak_df, chr, start, end)\n", - " overlapping_peaks <- foverlaps(peak_df, blacklist_df, nomatch=0)\n", - " blacklisted_peaks <- unique(overlapping_peaks$peak_name)\n", - " cat(\"Number of blacklisted peaks:\", length(blacklisted_peaks), \"\\n\")\n", - " \n", - " filtered_peak_idx <- !(peak_id %in% blacklisted_peaks)\n", - " filtered_peak <- peak_matrix[filtered_peak_idx, ]\n", - " cat(\"Number of peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n", - "} else {\n", - " cat(\"Warning: Blacklist file not found at\", blacklist_file, \"\\n\")\n", - " cat(\"Proceeding without blacklist filtering\\n\")\n", - " filtered_peak <- peak_matrix\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "14144ad5-10bf-4475-9e60-370b48550fd1", - "metadata": {}, - "source": [ - "#### Load covariates" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "1421c90d-6b16-40ff-a0c0-7b7c60a20d0c", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Variable statistics before and after log transformation:\n", - "n_nuclei: min=56.00, median=227.00, max=1293.00, SD=193.79\n", - "log_n_nuclei: min=4.03, median=5.42, max=7.16, SD=0.64\n", - "med_n_tot_fragment: min=2890.00, median=20306.00, max=73185.00, SD=15906.37\n", - "log_med_n_tot_fragment: min=7.97, median=9.92, max=11.20, SD=0.66\n", - "Number of samples after joining: 76 \n", - "Sample IDs: SM-CJK5G, SM-CTDQN, SM-CJEIE, SM-CTEM5, SM-CJJ27, SM-CJIWT ...\n", - "Available covariates: sampleid, individualID, sequencingBatch, main_cell_type, avg.pct.read.in.peak.ct, med_nucleosome_signal, med_n_tot_fragment, med_tss_enrich, n_nuclei, total_unique_peaks, log_total_unique_peaks, msex, age_death, pmi, study, log_n_nuclei, log_med_n_tot_fragment \n" - ] - } - ], - "source": [ - "covariates_file <- file.path(input_dir,\"rosmap_cov.txt\")\n", - "if (file.exists(covariates_file)) {\n", - " covariates <- fread(covariates_file)\n", - " # Check column names and adjust if needed\n", - " if ('#id' %in% colnames(covariates)) {\n", - " id_col <- '#id'\n", - " } else if ('individualID' %in% colnames(covariates)) {\n", - " id_col <- 'individualID'\n", - " } else {\n", - " cat(\"Warning: Could not identify ID column in covariates file. Available columns:\", \n", - " paste(colnames(covariates), collapse=\", \"), \"\\n\")\n", - " id_col <- colnames(covariates)[1]\n", - " cat(\"Using\", id_col, \"as ID column\\n\")\n", - " }\n", - " \n", - " # Select relevant columns\n", - " cov_cols <- intersect(c(id_col, 'msex', 'age_death', 'pmi', 'study'), colnames(covariates))\n", - " covariates <- covariates[, ..cov_cols]\n", - " \n", - " # Merge with metadata\n", - " meta_with_ind <- meta_clean %>%\n", - " select(sampleid, everything())\n", - " \n", - " all_covs <- meta_with_ind %>%\n", - " inner_join(peak_metrics, by = \"sampleid\") %>%\n", - " inner_join(covariates, by = setNames(id_col, \"sampleid\"))\n", - " \n", - " # Impute missing values\n", - " for (col in c(\"pmi\", \"age_death\")) {\n", - " if (col %in% colnames(all_covs) && any(is.na(all_covs[[col]]))) {\n", - " cat(\"Imputing missing values for\", col, \"\\n\")\n", - " all_covs[[col]][is.na(all_covs[[col]])] <- median(all_covs[[col]], na.rm=TRUE)\n", - " }\n", - " }\n", - "} else {\n", - " cat(\"Warning: Covariates file\", covariates_file, \"not found.\\n\")\n", - " cat(\"Proceeding with only technical variables.\\n\")\n", - " all_covs <- meta_clean %>%\n", - " inner_join(peak_metrics, by = \"sampleid\")\n", - "}\n", - "\n", - "\n", - "# Perform log transformations on necessary variables\n", - "# Add a small constant to avoid log(0)\n", - "epsilon <- 1e-6\n", - "\n", - "all_covs$log_n_nuclei <- log(all_covs$n_nuclei + epsilon)\n", - "all_covs$log_med_n_tot_fragment <- log(all_covs$med_n_tot_fragment + epsilon)\n", - "\n", - "# Show distribution of original and log-transformed variables\n", - "cat(\"\\nVariable statistics before and after log transformation:\\n\")\n", - "for (var in c(\"n_nuclei\", \"med_n_tot_fragment\")) {\n", - " orig_var <- all_covs[[var]]\n", - " log_var <- all_covs[[paste0(\"log_\", var)]]\n", - " \n", - " cat(sprintf(\"%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n", - " var, min(orig_var), median(orig_var), max(orig_var), sd(orig_var)))\n", - " cat(sprintf(\"log_%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n", - " var, min(log_var), median(log_var), max(log_var), sd(log_var)))\n", - "}\n", - "\n", - "cat(\"Number of samples after joining:\", nrow(all_covs), \"\\n\")\n", - "cat(\"Sample IDs:\", paste(head(all_covs$sampleid), collapse=\", \"), \"...\\n\")\n", - "cat(\"Available covariates:\", paste(colnames(all_covs), collapse=\", \"), \"\\n\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "33e8ab2a-87bb-46be-9c44-5e605b4cc179", - "metadata": {}, - "source": [ - "#### Create DGE object" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "8146b2c5-56b5-449b-b86f-cb64deed05e5", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of valid samples: 76 \n" - ] - } - ], - "source": [ - "valid_samples <- intersect(colnames(filtered_peak), all_covs$sampleid)\n", - "cat(\"Number of valid samples:\", length(valid_samples), \"\\n\")\n", - "\n", - "all_covs_filtered <- all_covs[all_covs$sampleid %in% valid_samples, ]\n", - "filtered_peak_filtered <- filtered_peak[, valid_samples]\n", - "\n", - "dge <- DGEList(\n", - " counts = filtered_peak_filtered,\n", - " samples = all_covs_filtered\n", - ")\n", - "rownames(dge$samples) <- dge$samples$sampleid" - ] - }, - { - "cell_type": "markdown", - "id": "55bb8d6b-3e61-4f2b-9c29-c20d0f38663a", - "metadata": {}, - "source": [ - "#### Filter low counts and normalize" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "6862b6b6-0dfd-45f8-9d6c-c6dfca5247de", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of peaks before filtering: 529135 \n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning message in filterByExpr.DGEList(dge, min.count = 2, min.total.count = 15, :\n", - "“All samples appear to belong to the same group.”\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of peaks after filtering: 323638 \n" - ] - } - ], - "source": [ - "cat(\"Number of peaks before filtering:\", nrow(dge), \"\\n\")\n", - "keep <- filterByExpr(dge, \n", - " min.count = 2, # for one sample, min reads \n", - " min.total.count = 15, # min reads overall\n", - " min.prop = 0.1) \n", - "\n", - "dge <- dge[keep, , keep.lib.sizes=FALSE]\n", - "cat(\"Number of peaks after filtering:\", nrow(dge), \"\\n\") #1368 in mic,2491 in Ast\n", - "dge <- calcNormFactors(dge, method=\"TMM\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "2b4d4f64-1e91-4edd-ad87-813db4f2547b", - "metadata": {}, - "source": [ - "#### Handle batch as technical variable" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "0389e3c4-75fc-4195-b775-032da343b664", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Handling sequencingBatch as a technical variable\n", - "Found 2 unique batches\n", - "Batch sizes:\n", - "batches\n", - "190820Kel 191203Kel \n", - " 4 72 \n" - ] - } - ], - "source": [ - "# We'll handle batch as a technical variable rather than doing batch adjustment\n", - "cat(\"Handling sequencingBatch as a technical variable\\n\")\n", - "\n", - "# Check batch information\n", - "batches <- dge$samples$sequencingBatch\n", - "cat(\"Found\", length(unique(batches)), \"unique batches\\n\")\n", - "\n", - "# Check batch size\n", - "batch_counts <- table(batches)\n", - "cat(\"Batch sizes:\\n\")\n", - "print(batch_counts)\n", - "\n", - "# Convert sequencingBatch to factor with at least 2 levels\n", - "if (length(unique(batches)) < 2) {\n", - " cat(\"Only one batch found. Adding dummy batch for model compatibility.\\n\")\n", - " # Create a dummy batch factor to avoid model errors\n", - " dge$samples$sequencingBatch_factor <- factor(rep(\"batch1\", ncol(dge)))\n", - "} else {\n", - " # Use the existing batch information\n", - " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", - "}\n" - ] - }, - { - "cell_type": "markdown", - "id": "1b23595b-8bd0-471b-8c11-cb0819e9055e", - "metadata": {}, - "source": [ - "#### Create model and run voom" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "efba3973-0cfc-4afd-9dcc-5842190a9995", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Using full model with demographic and technical covariates\n", - "Model formula: ~log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + sequencingBatch_factor + msex + age_death + pmi + study \n", - "Warning: Factor variable group has only one level. Converting to character.\n", - "Successfully created design matrix with 11 columns\n", - "Calculating offsets and residuals...\n" - ] - } - ], - "source": [ - "# Define the model based on available covariates - using log-transformed variables\n", - "if (all(c(\"msex\", \"age_death\", \"pmi\", \"study\") %in% colnames(dge$samples))) {\n", - " # Full model with all covariates\n", - " cat(\"Using full model with demographic and technical covariates\\n\")\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n", - " log_total_unique_peaks + sequencingBatch_factor + \n", - " msex + age_death + pmi + study\n", - "} else {\n", - " # Technical variables only model\n", - " cat(\"Using model with technical covariates only\\n\")\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n", - " log_total_unique_peaks + sequencingBatch_factor\n", - "}\n", - "\n", - "# Print the model formula\n", - "cat(\"Model formula:\", deparse(model), \"\\n\")\n", - "\n", - "# Check for factor variables with only one level\n", - "for (col in colnames(dge$samples)) {\n", - " if (is.factor(dge$samples[[col]]) && nlevels(dge$samples[[col]]) < 2) {\n", - " cat(\"Warning: Factor variable\", col, \"has only one level. Converting to character.\\n\")\n", - " dge$samples[[col]] <- as.character(dge$samples[[col]])\n", - " }\n", - "}\n", - "\n", - "# Create design matrix with error checking\n", - "tryCatch({\n", - " design <- model.matrix(model, data=dge$samples)\n", - " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n", - "}, error = function(e) {\n", - " cat(\"Error in creating design matrix:\", e$message, \"\\n\")\n", - " cat(\"Attempting to fix model formula...\\n\")\n", - " \n", - " # Check each term in the model\n", - " all_terms <- all.vars(model)\n", - " valid_terms <- character(0)\n", - " \n", - " for (term in all_terms) {\n", - " if (term %in% colnames(dge$samples)) {\n", - " # Check if it's a factor with at least 2 levels\n", - " if (is.factor(dge$samples[[term]])) {\n", - " if (nlevels(dge$samples[[term]]) >= 2) {\n", - " valid_terms <- c(valid_terms, term)\n", - " } else {\n", - " cat(\"Skipping factor\", term, \"with only\", nlevels(dge$samples[[term]]), \"level\\n\")\n", - " }\n", - " } else {\n", - " # Non-factor variables are fine\n", - " valid_terms <- c(valid_terms, term)\n", - " }\n", - " } else {\n", - " cat(\"Variable\", term, \"not found in sample data\\n\")\n", - " }\n", - " }\n", - " \n", - " # Create a simplified model with valid terms\n", - " if (length(valid_terms) > 0) {\n", - " model_str <- paste(\"~\", paste(valid_terms, collapse = \" + \"))\n", - " model <- as.formula(model_str)\n", - " cat(\"New model formula:\", model_str, \"\\n\")\n", - " design <- model.matrix(model, data=dge$samples)\n", - " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n", - " } else {\n", - " stop(\"Could not create a valid model with the available variables\")\n", - " }\n", - "})\n", - "\n", - "# Check if the design matrix is full rank\n", - "if (!is.fullrank(design)) {\n", - " cat(\"Design matrix is not full rank. Adjusting...\\n\")\n", - " # Find and remove the problematic columns\n", - " qr_res <- qr(design)\n", - " design <- design[, qr_res$pivot[1:qr_res$rank]]\n", - " cat(\"Adjusted design matrix columns:\", ncol(design), \"\\n\")\n", - "}\n", - "\n", - "# Run voom and fit model\n", - "v <- voom(dge, design, plot=FALSE) #logCPM\n", - "fit <- lmFit(v, design)\n", - "fit <- eBayes(fit)\n", - "\n", - "# Calculate offset and residuals\n", - "cat(\"Calculating offsets and residuals...\\n\")\n", - "offset <- predictOffset(fit)\n", - "resids <- residuals(fit, y=v)\n", - "\n", - "# Verify dimensions\n", - "stopifnot(all(rownames(offset) == rownames(resids)) &\n", - " all(colnames(offset) == colnames(resids)))\n", - "\n", - "# Final adjusted data\n", - "stopifnot(all(dim(offset) == dim(resids)))\n", - "stopifnot(all(colnames(offset) == colnames(resids)))\n", - "\n", - "final_data <- offset + resids" - ] - }, - { - "cell_type": "markdown", - "id": "cbc2d0da-33f0-4d51-ae43-4de228d57873", - "metadata": {}, - "source": [ - "#### Save results" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "c0c15fea-c4d6-41a2-aa92-795b4fd0b9b7", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Processing completed. Results and documentation saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals//Astro \n" - ] - } - ], - "source": [ - "# Save results\n", - "saveRDS(list(\n", - " dge = dge,\n", - " offset = offset,\n", - " residuals = resids,\n", - " final_data = final_data,\n", - " valid_samples = colnames(dge),\n", - " design = design,\n", - " fit = fit,\n", - " model = model\n", - "), file = file.path(out_dir, paste0(celltype,\"_results.rds\")))\n", - "\n", - "# Write final residual data to file\n", - "write.table(final_data,\n", - " file = file.path(out_dir, paste0(celltype,\"_residuals.txt\")), \n", - " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n", - "\n", - "# Write summary statistics\n", - "sink(file = file.path(out_dir, paste0(celltype, \"_summary.txt\")))\n", - "cat(\"*** Processing Summary for\", celltype, \"***\\n\\n\")\n", - "cat(\"Original peak count:\", length(peak_id), \"\\n\")\n", - "cat(\"Peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n", - "cat(\"Peaks after expression filtering:\", nrow(dge), \"\\n\\n\")\n", - "cat(\"Number of samples:\", ncol(dge), \"\\n\")\n", - "cat(\"\\nTechnical Variables Used:\\n\")\n", - "cat(\"- log_n_nuclei: Log-transformed number of nuclei per sample\\n\")\n", - "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", - "cat(\"- med_tss_enrich: Median TSS enrichment\\n\")\n", - "cat(\"- log_med_n_tot_fragment: Log-transformed median number of total fragments\\n\")\n", - "cat(\"- log_total_unique_peaks: Log-transformed count of unique peaks per sample\\n\")\n", - "cat(\"\\nDemographic Variables Used:\\n\")\n", - "cat(\"- msex: Sex (male=1, female=0)\\n\")\n", - "cat(\"- age_death: Age at death\\n\")\n", - "cat(\"- pmi: Post-mortem interval\\n\")\n", - "cat(\"- study: Study cohort\\n\")\n", - "sink()\n", - "\n", - "# Write an additional explanation file about the variables and log transformation\n", - "sink(file = file.path(out_dir, paste0(celltype,\"_variable_explanation.txt\")))\n", - "cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n", - "\n", - "cat(\"## Why Log Transformation?\\n\")\n", - "cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n", - "cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n", - "cat(\"2. To stabilize variance across the range of values\\n\")\n", - "cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n", - "cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n", - "\n", - "cat(\"## Variables and Their Meanings\\n\\n\")\n", - "\n", - "cat(\"### Technical Variables\\n\")\n", - "cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n", - "cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n", - "\n", - "cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n", - "cat(\" * Represents sequencing depth\\n\")\n", - "cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n", - "\n", - "cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n", - "cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n", - "\n", - "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", - "cat(\" * Measures the degree of nucleosome positioning\\n\")\n", - "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n", - "\n", - "cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n", - "cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n", - "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n", - "\n", - "cat(\"### Demographic Variables\\n\")\n", - "cat(\"- msex: Sex (male=1, female=0)\\n\")\n", - "cat(\"- age_death: Age at death\\n\")\n", - "cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n", - "cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n", - "\n", - "cat(\"## Relationship to voom Transformation\\n\")\n", - "cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n", - "cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n", - "cat(\"covariates, we ensure they're on a similar scale to the transformed expression data, \")\n", - "cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n", - "sink()\n", - "\n", - "cat(\"Processing completed. Results and documentation saved to:\", out_dir, \"\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "b28beaf3-804a-4b88-9e7e-156a5d4ee3d0", - "metadata": {}, - "source": [ - "### Option B: Pseudobulk QC WITHOUT Biological Variation (noBIOvar)\n", - "Use this option when you want to preserve biological variation (e.g., for comparing across ages/sexes or region-specific analyses).\n", - "\n", - "**Input:** (Same as Option A)\n", - "- Pseudobulk peak counts (in `1_files_with_sampleid` folder): `pseudobulk_peaks_counts{celltype}_50nuc.csv.gz`\n", - "- Cell metadata (in `1_files_with_sampleid` folder): `metadata_{celltype}_50nuc.csv`\n", - "- Sample covariates: `rosmap_cov.txt`\n", - "- hg38 blacklist: `hg38-blacklist.v2.bed.gz`\n", - "\n", - "**Process:**\n", - "1. Loads pseudobulk peak count matrix and metadata per cell type\n", - "2. Calculates technical QC metrics per sample:\n", - " - `log_n_nuclei`: Log-transformed number of nuclei\n", - " - `med_nucleosome_signal`: Median nucleosome signal\n", - " - `med_tss_enrich`: Median TSS enrichment score\n", - " - `log_med_n_tot_fragment`: Log-transformed median total fragments (sequencing depth)\n", - " - `log_total_unique_peaks`: Log-transformed count of unique peaks detected\n", - "3. Filters blacklisted genomic regions using `foverlaps()`\n", - "4. Merges with demographic covariates (msex, age_death, pmi, study)\n", - "5. Applies expression filtering with `filterByExpr()`:\n", - " - `min.count = 2`: Minimum 2 reads in at least one sample\n", - " - `min.total.count = 15`: Minimum 15 total reads across all samples\n", - " - `min.prop = 0.1`: Peak must be expressed in ≥10% of samples\n", - "6. TMM normalization with `calcNormFactors()`\n", - "7. Saves **filtered raw counts** without covariate adjustment\n", - "\n", - "**Key Difference:** \n", - "- Does NOT regress out msex or age_death\n", - "- No residual calculation performed (voom/lmFit section commented out)\n", - "- Only saves TMM-normalized, filtered count matrix\n", - "\n", - "**Model formula (if residuals were computed):**\n", - "```r\n", - "model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch_factor + pmi + study\n", - "\n", - "```\n", - "Note: The voom/residual calculation section is commented out; only filtered counts are saved\n", - "\n", - "**Output:** `output/2_residuals/{celltype}/`\n", - "\n", - "`{celltype}_filtered_raw_counts.txt`: TMM-normalized, filtered peak counts without biological covariate adjustment\n", - "\n", - "**Key Variables NOT Regressed:**\n", - "- Sex (msex)\n", - "- Age at death (age_death)" - ] - }, - { - "cell_type": "markdown", - "id": "8ee626cb-8aa6-4464-8066-4f501b5d6eaf", - "metadata": {}, - "source": [ - "#### Load libraries" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "0bfb521c-fdc2-4029-b8e6-9c3459ee8872", - "metadata": {}, - "outputs": [], - "source": [ - "library(data.table)\n", - "library(stringr)\n", - "library(dplyr)\n", - "library(edgeR)" - ] - }, - { - "cell_type": "markdown", - "id": "e346a569-892d-43b7-974d-f55ca725d83b", - "metadata": {}, - "source": [ - "#### Load input" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "id": "55b554b2-722b-48d8-aa25-bbdae074963f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Processing celltype: Exc \n" - ] - } - ], - "source": [ - "# Set cell type and create output directory\n", - "#args <- commandArgs(trailingOnly = TRUE)\n", - "#celltype <- args[1] # First argument is the cell type\n", - "celltype <- \"Exc\" # Change this for different cell types\n", - "cat(\"Processing celltype:\", celltype, \"\\n\")\n", - "\n", - "out_dir <- paste0(file.path(output_dir,\"2_residuals/\", celltype))\n", - "dir.create(out_dir, recursive = TRUE, showWarnings = FALSE)" - ] - }, - { - "cell_type": "markdown", - "id": "6d336458-99cd-4ff0-838b-4423d6bf2e9a", - "metadata": {}, - "source": [ - "#### Create predictOffset function " - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "id": "e01e3c57-8abc-4b09-94fb-1ec7fc55b0ac", - "metadata": {}, - "outputs": [], - "source": [ - "predictOffset <- function(fit) {\n", - " # Define which variables are factors and which are continuous\n", - " usedFactors <- c(\"sequencingBatch\", \"study\") \n", - " usedContinuous <- c(\"log_n_nuclei\", \"med_nucleosome_signal\", \"med_tss_enrich\", \"log_med_n_tot_fragment\",\n", - " \"log_total_unique_peaks\", \"med_peakwidth\", \"pmi\")\n", - " \n", - " # Filter to only use variables actually in the design matrix\n", - " usedFactors <- usedFactors[sapply(usedFactors, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n", - " usedContinuous <- usedContinuous[sapply(usedContinuous, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n", - " \n", - " # Get indices for factor and continuous variables\n", - " facInd <- unlist(lapply(as.list(usedFactors), \n", - " function(f) {return(grep(paste0(\"^\", f), \n", - " colnames(fit$design)))}))\n", - " contInd <- unlist(lapply(as.list(usedContinuous), \n", - " function(f) {return(grep(paste0(\"^\", f), \n", - " colnames(fit$design)))}))\n", - " \n", - " # Add the intercept\n", - " all_indices <- c(1, facInd, contInd)\n", - " \n", - " # Verify design matrix structure (using sorted indices to avoid duplication warning)\n", - " all_indices_sorted <- sort(unique(all_indices))\n", - " stopifnot(all(all_indices_sorted %in% 1:ncol(fit$design)))\n", - " \n", - " # Create new design matrix with median values\n", - " D <- fit$design\n", - " D[, facInd] <- 0 # Set all factor levels to reference level\n", - " \n", - " # For continuous variables, set to median value\n", - " if (length(contInd) > 0) {\n", - " medContVals <- apply(D[, contInd, drop=FALSE], 2, median)\n", - " for (i in 1:length(medContVals)) {\n", - " D[, names(medContVals)[i]] <- medContVals[i]\n", - " }\n", - " }\n", - " \n", - " # Calculate offsets\n", - " stopifnot(all(colnames(coefficients(fit)) == colnames(D)))\n", - " offsets <- apply(coefficients(fit), 1, function(c) {\n", - " return(D %*% c)\n", - " })\n", - " offsets <- t(offsets)\n", - " colnames(offsets) <- rownames(fit$design)\n", - " \n", - " return(offsets)\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "ab4a5c28-5295-47e4-a5f5-26d6cbb995ca", - "metadata": {}, - "source": [ - "#### Load input" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "id": "8ea62e42-0bbf-4166-8c51-ded8318a6463", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loaded metadata with 90 samples and peak data with 531489 peaks\n" - ] - } - ], - "source": [ - "meta <- fread(paste0(file.path(input_dir, \"1_files_with_sampleid/metadata_\"), celltype, \"_50nuc.csv\"))\n", - "peak_data <- fread(file.path(input_dir,\"1_files_with_sampleid\", paste0(\"pseudobulk_peaks_counts\", celltype, \"_50nuc.csv.gz\")))\n", - "\n", - "cat(\"Loaded metadata with\", nrow(meta), \"samples and peak data with\", nrow(peak_data), \"peaks\\n\")\n", - "\n", - "# Extract peak_id and set as rownames\n", - "peak_id <- peak_data$peak_id\n", - "peak_data <- peak_data[, -1, with = FALSE] # Remove peak_id column\n", - "peak_matrix <- as.matrix(peak_data)\n", - "rownames(peak_matrix) <- peak_id" - ] - }, - { - "cell_type": "markdown", - "id": "b3b9c011-775e-41bc-b581-4269628592eb", - "metadata": {}, - "source": [ - "#### Process technical variables from meta data" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "id": "866e1e87-0c20-4a71-9d83-450c49a3e647", - "metadata": {}, - "outputs": [], - "source": [ - "# Column name normalization (for easier handling)\n", - "meta_clean <- meta %>%\n", - " rename(\n", - " med_nucleosome_signal = med.nucleosome_signal.ct,\n", - " med_tss_enrich = med.tss.enrich.ct,\n", - " med_n_tot_fragment = med.n_tot_fragment.ct,\n", - " n_nuclei = n.nuclei\n", - " )\n", - "\n", - "# Calculate peak metrics - total unique peaks per sample and median peak width\n", - "peak_metrics <- data.frame(\n", - " sampleid = colnames(peak_matrix),\n", - " total_unique_peaks = colSums(peak_matrix > 0)\n", - ") %>%\n", - " mutate(log_total_unique_peaks = log(total_unique_peaks + 1))\n", - "\n", - "# Calculate median peak width for each sample using count as weight\n", - "calculate_median_peakwidth <- function(peak_matrix, peak_info) {\n", - " # Create a data frame with peak widths\n", - " peak_widths <- peak_info$end - peak_info$start\n", - " \n", - " # Initialize a vector to store median peak widths\n", - " median_peak_widths <- numeric(ncol(peak_matrix))\n", - " names(median_peak_widths) <- colnames(peak_matrix)\n", - " \n", - " # For each sample, calculate the weighted median peak width\n", - " for (i in 1:ncol(peak_matrix)) {\n", - " sample_counts <- peak_matrix[, i]\n", - " # Only consider peaks with counts > 0\n", - " idx <- which(sample_counts > 0)\n", - " \n", - " if (length(idx) > 0) {\n", - " # Method 1: Use counts as weights\n", - " weights <- sample_counts[idx]\n", - " # Repeat each peak width by its count for weighted calculation\n", - " all_widths <- rep(peak_widths[idx], times=weights)\n", - " median_peak_widths[i] <- median(all_widths)\n", - " } else {\n", - " median_peak_widths[i] <- NA\n", - " }\n", - " }\n", - " \n", - " return(median_peak_widths)\n", - "}\n", - "\n", - "# Calculate median peak width for each sample\n", - "# Note: Using the peak_df that was created earlier for blacklist filtering\n", - "median_peakwidths <- calculate_median_peakwidth(peak_matrix, data.frame(\n", - " start = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 2)),\n", - " end = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 3))\n", - "))\n", - "\n", - "# Add median peak width to peak metrics\n", - "peak_metrics$med_peakwidth <- median_peakwidths" - ] - }, - { - "cell_type": "markdown", - "id": "0f7eee8d-91f2-48a8-b3df-f5f6fbd6ac9b", - "metadata": {}, - "source": [ - "#### Process peaks" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "id": "0fc2bc63-131f-425d-bbcd-66d6eba93076", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Sample of peak coordinates:\n", - " peak_name chr start end\n", - " \n", - "1: chr1-181293-181565 chr1 181293 181565\n", - "2: chr1-190726-191626 chr1 190726 191626\n", - "3: chr1-629712-630662 chr1 629712 630662\n", - "4: chr1-631261-631470 chr1 631261 631470\n", - "5: chr1-633891-634506 chr1 633891 634506\n", - "6: chr1-777873-779958 chr1 777873 779958\n", - "Number of blacklisted peaks: 2354 \n", - "Number of peaks after blacklist filtering: 529135 \n" - ] - } - ], - "source": [ - "# Process peak coordinates\n", - "peak_df <- data.table(\n", - " peak_name = peak_id,\n", - " chr = sapply(strsplit(peak_id, \"-\"), `[`, 1),\n", - " start = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 2)),\n", - " end = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 3)),\n", - " stringsAsFactors = FALSE\n", - ")\n", - "\n", - "# Verify peak coordinates were extracted correctly\n", - "cat(\"Sample of peak coordinates:\\n\")\n", - "print(head(peak_df))\n", - "\n", - "# Load blacklist\n", - "blacklist_file <- file.path(input_dir,\"hg38-blacklist.v2.bed.gz\")\n", - "if (file.exists(blacklist_file)) {\n", - " blacklist_df <- fread(blacklist_file)\n", - " if (ncol(blacklist_df) >= 4) {\n", - " colnames(blacklist_df)[1:4] <- c(\"chr\", \"start\", \"end\", \"label\")\n", - " } else {\n", - " colnames(blacklist_df)[1:3] <- c(\"chr\", \"start\", \"end\")\n", - " }\n", - " \n", - " # Filter blacklisted peaks\n", - " setkey(blacklist_df, chr, start, end)\n", - " setkey(peak_df, chr, start, end)\n", - " overlapping_peaks <- foverlaps(peak_df, blacklist_df, nomatch=0)\n", - " blacklisted_peaks <- unique(overlapping_peaks$peak_name)\n", - " cat(\"Number of blacklisted peaks:\", length(blacklisted_peaks), \"\\n\")\n", - " \n", - " filtered_peak_idx <- !(peak_id %in% blacklisted_peaks)\n", - " filtered_peak <- peak_matrix[filtered_peak_idx, ]\n", - " cat(\"Number of peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n", - "} else {\n", - " cat(\"Warning: Blacklist file not found at\", blacklist_file, \"\\n\")\n", - " cat(\"Proceeding without blacklist filtering\\n\")\n", - " filtered_peak <- peak_matrix\n", - "}\n" - ] - }, - { - "cell_type": "markdown", - "id": "d764e632-2fca-401e-9457-8174ff204000", - "metadata": {}, - "source": [ - "#### Load covariates" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "id": "8aaabbbf-70e3-421c-863d-1c8c08c0fc24", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Variable statistics before and after log transformation:\n", - "n_nuclei: min=77.00, median=1762.00, max=7024.00, SD=1275.90\n", - "log_n_nuclei: min=4.34, median=7.47, max=8.86, SD=0.88\n", - "med_n_tot_fragment: min=3234.00, median=21072.00, max=133932.50, SD=20162.62\n", - "log_med_n_tot_fragment: min=8.08, median=9.96, max=11.81, SD=0.73\n", - "Number of samples after joining: 83 \n", - "Sample IDs: SM-CJK5G, SM-CTDQN, SM-CJEIE, SM-CTEM5, SM-CJJ27, SM-CJIWT ...\n", - "Available covariates: sampleid, individualID, sequencingBatch, main_cell_type, avg.pct.read.in.peak.ct, med_nucleosome_signal, med_n_tot_fragment, med_tss_enrich, n_nuclei, total_unique_peaks, log_total_unique_peaks, med_peakwidth, pmi, study, log_n_nuclei, log_med_n_tot_fragment \n" - ] - } - ], - "source": [ - "covariates_file <- file.path(input_dir,'rosmap_cov.txt')\n", - "if (file.exists(covariates_file)) {\n", - " covariates <- fread(covariates_file)\n", - " # Check column names and adjust if needed\n", - " if ('#id' %in% colnames(covariates)) {\n", - " id_col <- '#id'\n", - " } else if ('individualID' %in% colnames(covariates)) {\n", - " id_col <- 'individualID'\n", - " } else {\n", - " cat(\"Warning: Could not identify ID column in covariates file. Available columns:\", \n", - " paste(colnames(covariates), collapse=\", \"), \"\\n\")\n", - " id_col <- colnames(covariates)[1]\n", - " cat(\"Using\", id_col, \"as ID column\\n\")\n", - " }\n", - " \n", - " # Select relevant columns - excluding msex and age_death\n", - " cov_cols <- intersect(c(id_col, 'pmi', 'study'), colnames(covariates))\n", - " covariates <- covariates[, ..cov_cols]\n", - " \n", - " # Merge with metadata\n", - " meta_with_ind <- meta_clean %>%\n", - " select(sampleid, everything())\n", - " \n", - " all_covs <- meta_with_ind %>%\n", - " inner_join(peak_metrics, by = \"sampleid\") %>%\n", - " inner_join(covariates, by = setNames(id_col, \"sampleid\"))\n", - " \n", - " # Impute missing values\n", - " for (col in c(\"pmi\")) {\n", - " if (col %in% colnames(all_covs) && any(is.na(all_covs[[col]]))) {\n", - " cat(\"Imputing missing values for\", col, \"\\n\")\n", - " all_covs[[col]][is.na(all_covs[[col]])] <- median(all_covs[[col]], na.rm=TRUE)\n", - " }\n", - " }\n", - "} else {\n", - " cat(\"Warning: Covariates file\", covariates_file, \"not found.\\n\")\n", - " cat(\"Proceeding with only technical variables.\\n\")\n", - " all_covs <- meta_clean %>%\n", - " inner_join(peak_metrics, by = \"sampleid\")\n", - "}\n", - "\n", - "\n", - "# Perform log transformations on necessary variables\n", - "# Add a small constant to avoid log(0)\n", - "epsilon <- 1e-6\n", - "\n", - "all_covs$log_n_nuclei <- log(all_covs$n_nuclei + epsilon)\n", - "all_covs$log_med_n_tot_fragment <- log(all_covs$med_n_tot_fragment + epsilon)\n", - "\n", - "# Show distribution of original and log-transformed variables\n", - "cat(\"\\nVariable statistics before and after log transformation:\\n\")\n", - "for (var in c(\"n_nuclei\", \"med_n_tot_fragment\")) {\n", - " orig_var <- all_covs[[var]]\n", - " log_var <- all_covs[[paste0(\"log_\", var)]]\n", - " \n", - " cat(sprintf(\"%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n", - " var, min(orig_var), median(orig_var), max(orig_var), sd(orig_var)))\n", - " cat(sprintf(\"log_%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n", - " var, min(log_var), median(log_var), max(log_var), sd(log_var)))\n", - "}\n", - "\n", - "cat(\"Number of samples after joining:\", nrow(all_covs), \"\\n\")\n", - "cat(\"Sample IDs:\", paste(head(all_covs$sampleid), collapse=\", \"), \"...\\n\")\n", - "cat(\"Available covariates:\", paste(colnames(all_covs), collapse=\", \"), \"\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "6bbc0158-7095-48db-a80c-020fad7bd4ec", - "metadata": {}, - "source": [ - "#### Create DGE object" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "id": "13ebe29a-6598-4d9d-b9ff-223ae3a98656", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of valid samples: 83 \n" - ] - } - ], - "source": [ - "valid_samples <- intersect(colnames(filtered_peak), all_covs$sampleid)\n", - "cat(\"Number of valid samples:\", length(valid_samples), \"\\n\")\n", - "\n", - "all_covs_filtered <- all_covs[all_covs$sampleid %in% valid_samples, ]\n", - "filtered_peak_filtered <- filtered_peak[, valid_samples]\n", - "\n", - "dge <- DGEList(\n", - " counts = filtered_peak_filtered,\n", - " samples = all_covs_filtered\n", - ")\n", - "rownames(dge$samples) <- dge$samples$sampleid" - ] - }, - { - "cell_type": "markdown", - "id": "6962730b-7cdd-41e4-ba2c-51cd08d16013", - "metadata": {}, - "source": [ - "#### Filter low counts and normalize" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "id": "9f0f07a2-0b66-4031-acf9-cc0db9e8af4f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of peaks before filtering: 529135 \n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning message in filterByExpr.DGEList(dge, min.count = 5, min.total.count = 15, :\n", - "“All samples appear to belong to the same group.”\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of peaks after filtering: 521515 \n", - "Saved filtered raw counts to /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals//Exc/Exc_filtered_raw_counts.txt \n" - ] - } - ], - "source": [ - "cat(\"Number of peaks before filtering:\", nrow(dge), \"\\n\")\n", - "keep <- filterByExpr(dge, \n", - " min.count = 5, # for one sample, min reads \n", - " min.total.count = 15, # min reads overall\n", - " min.prop = 0.1) \n", - "\n", - "dge <- dge[keep, , keep.lib.sizes=FALSE]\n", - "cat(\"Number of peaks after filtering:\", nrow(dge), \"\\n\") #1368 in mic,2491 in Ast\n", - "\n", - "# Save filtered raw count data\n", - "filtered_raw_counts <- dge$counts\n", - "write.table(filtered_raw_counts,\n", - " file = file.path(out_dir, paste0(celltype, \"_filtered_raw_counts.txt\")), \n", - " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n", - "cat(\"Saved filtered raw counts to\", file.path(out_dir, paste0(celltype, \"_filtered_raw_counts.txt\")), \"\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "3a8ac9ec-e713-411e-940a-3e0e7eff0c27", - "metadata": {}, - "source": [ - "#### Handle batch as technical variable" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "id": "82da4179-feae-47f0-a566-a04127beacc7", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Handling sequencingBatch as a technical variable\n", - "Found 2 unique batches\n", - "Batch sizes:\n", - "batches\n", - "190820Kel 191203Kel \n", - " 6 77 \n" - ] - } - ], - "source": [ - "dge <- calcNormFactors(dge, method=\"TMM\")\n", - "# We'll handle batch as a technical variable rather than doing batch adjustment\n", - "cat(\"Handling sequencingBatch as a technical variable\\n\")\n", - "\n", - "# Check batch information\n", - "batches <- dge$samples$sequencingBatch\n", - "cat(\"Found\", length(unique(batches)), \"unique batches\\n\")\n", - "\n", - "# Check batch size\n", - "batch_counts <- table(batches)\n", - "cat(\"Batch sizes:\\n\")\n", - "print(batch_counts)\n", - "\n", - "# Convert sequencingBatch to factor with at least 2 levels\n", - "if (length(unique(batches)) < 2) {\n", - " cat(\"Only one batch found. Adding dummy batch for model compatibility.\\n\")\n", - " # Create a dummy batch factor to avoid model errors\n", - " dge$samples$sequencingBatch_factor <- factor(rep(\"batch1\", ncol(dge)))\n", - "} else {\n", - " # Use the existing batch information\n", - " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "6593bcae-cf9b-46d4-8411-37aa7b0d2f7a", - "metadata": {}, - "source": [ - "#### Create model and run voom" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "id": "d078474a-45e4-4762-adb2-06925885ff88", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Using model with technical covariates plus pmi and study\n", - "Model formula: ~log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch_factor + pmi + study \n", - "Warning: Factor variable group has only one level. Converting to character.\n", - "Successfully created design matrix with 10 columns\n", - "Calculating offsets and residuals...\n" - ] - } - ], - "source": [ - "# Define the model based on available covariates - using log-transformed variables\n", - "# Removed msex and age_death from the model\n", - "if (\"study\" %in% colnames(dge$samples) && \"pmi\" %in% colnames(dge$samples)) {\n", - " # Technical model with pmi and study\n", - " cat(\"Using model with technical covariates plus pmi and study\\n\")\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n", - " log_total_unique_peaks + med_peakwidth + sequencingBatch_factor + pmi + study\n", - "} else if (\"pmi\" %in% colnames(dge$samples)) {\n", - " # Technical model with pmi only\n", - " cat(\"Using model with technical covariates and pmi\\n\")\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n", - " log_total_unique_peaks + med_peakwidth + sequencingBatch_factor + pmi\n", - "} else {\n", - " # Technical variables only model\n", - " cat(\"Using model with technical covariates only\\n\")\n", - " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n", - " log_total_unique_peaks + med_peakwidth + sequencingBatch_factor\n", - "}\n", - "\n", - "# Print the model formula\n", - "cat(\"Model formula:\", deparse(model), \"\\n\")\n", - "\n", - "# Check for factor variables with only one level\n", - "for (col in colnames(dge$samples)) {\n", - " if (is.factor(dge$samples[[col]]) && nlevels(dge$samples[[col]]) < 2) {\n", - " cat(\"Warning: Factor variable\", col, \"has only one level. Converting to character.\\n\")\n", - " dge$samples[[col]] <- as.character(dge$samples[[col]])\n", - " }\n", - "}\n", - "\n", - "# Create design matrix with error checking\n", - "tryCatch({\n", - " design <- model.matrix(model, data=dge$samples)\n", - " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n", - "}, error = function(e) {\n", - " cat(\"Error in creating design matrix:\", e$message, \"\\n\")\n", - " cat(\"Attempting to fix model formula...\\n\")\n", - " \n", - " # Check each term in the model\n", - " all_terms <- all.vars(model)\n", - " valid_terms <- character(0)\n", - " \n", - " for (term in all_terms) {\n", - " if (term %in% colnames(dge$samples)) {\n", - " # Check if it's a factor with at least 2 levels\n", - " if (is.factor(dge$samples[[term]])) {\n", - " if (nlevels(dge$samples[[term]]) >= 2) {\n", - " valid_terms <- c(valid_terms, term)\n", - " } else {\n", - " cat(\"Skipping factor\", term, \"with only\", nlevels(dge$samples[[term]]), \"level\\n\")\n", - " }\n", - " } else {\n", - " # Non-factor variables are fine\n", - " valid_terms <- c(valid_terms, term)\n", - " }\n", - " } else {\n", - " cat(\"Variable\", term, \"not found in sample data\\n\")\n", - " }\n", - " }\n", - " \n", - " # Create a simplified model with valid terms\n", - " if (length(valid_terms) > 0) {\n", - " model_str <- paste(\"~\", paste(valid_terms, collapse = \" + \"))\n", - " model <- as.formula(model_str)\n", - " cat(\"New model formula:\", model_str, \"\\n\")\n", - " design <- model.matrix(model, data=dge$samples)\n", - " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n", - " } else {\n", - " stop(\"Could not create a valid model with the available variables\")\n", - " }\n", - "})\n", - "\n", - "# Check if the design matrix is full rank\n", - "if (!is.fullrank(design)) {\n", - " cat(\"Design matrix is not full rank. Adjusting...\\n\")\n", - " # Find and remove the problematic columns\n", - " qr_res <- qr(design)\n", - " design <- design[, qr_res$pivot[1:qr_res$rank]]\n", - " cat(\"Adjusted design matrix columns:\", ncol(design), \"\\n\")\n", - "}\n", - "\n", - "# Run voom and fit model\n", - "v <- voom(dge, design, plot=FALSE) #logCPM\n", - "fit <- lmFit(v, design)\n", - "fit <- eBayes(fit)\n", - "\n", - "# Calculate offset and residuals\n", - "cat(\"Calculating offsets and residuals...\\n\")\n", - "offset <- predictOffset(fit)\n", - "resids <- residuals(fit, y=v)\n", - "\n", - "# Verify dimensions\n", - "stopifnot(all(rownames(offset) == rownames(resids)) & all(colnames(offset) == colnames(resids)))\n", - "\n", - "# Final adjusted data\n", - "stopifnot(all(dim(offset) == dim(resids)))\n", - "stopifnot(all(colnames(offset) == colnames(resids)))\n", - "\n", - "final_data <- offset + resids" - ] - }, - { - "cell_type": "markdown", - "id": "ef172cf6-555f-49e7-834f-b0f706b4b3bf", - "metadata": {}, - "source": [ - "#### Save results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f002e9d7-d994-4cdf-9ecc-7a0f18210b58", - "metadata": {}, - "outputs": [], - "source": [ - "saveRDS(list(\n", - " dge = dge,\n", - " offset = offset,\n", - " residuals = resids,\n", - " final_data = final_data,\n", - " valid_samples = colnames(dge),\n", - " design = design,\n", - " fit = fit,\n", - " model = model\n", - "), file = file.path(out_dir, paste0(celltype, \"_results.rds\")))\n", - "\n", - "# Write final residual data to file\n", - "write.table(final_data,\n", - " file = file.path(out_dir, paste0(celltype, \"_residuals.txt\")), \n", - " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n", - "\n", - "# Write summary statistics\n", - "sink(file = file.path(out_dir, paste0(celltype, \"_summary.txt\")))\n", - "cat(\"*** Processing Summary for\", celltype, \"***\\n\\n\")\n", - "cat(\"Original peak count:\", length(peak_id), \"\\n\")\n", - "cat(\"Peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n", - "cat(\"Peaks after expression filtering:\", nrow(dge), \"\\n\\n\")\n", - "cat(\"Number of samples:\", ncol(dge), \"\\n\")\n", - "cat(\"\\nTechnical Variables Used:\\n\")\n", - "cat(\"- log_n_nuclei: Log-transformed number of nuclei per sample\\n\")\n", - "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", - "cat(\"- med_tss_enrich: Median TSS enrichment\\n\")\n", - "cat(\"- log_med_n_tot_fragment: Log-transformed median number of total fragments\\n\")\n", - "cat(\"- log_total_unique_peaks: Log-transformed count of unique peaks per sample\\n\")\n", - "cat(\"\\nOther Variables Used:\\n\")\n", - "cat(\"- pmi: Post-mortem interval\\n\")\n", - "cat(\"- study: Study cohort\\n\")\n", - "sink()\n", - "\n", - "# Write an additional explanation file about the variables and log transformation\n", - "sink(file = file.path(out_dir, paste0(celltype, \"_variable_explanation.txt\")))\n", - "cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n", - "\n", - "cat(\"## Why Log Transformation?\\n\")\n", - "cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n", - "cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n", - "cat(\"2. To stabilize variance across the range of values\\n\")\n", - "cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n", - "cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n", - "\n", - "cat(\"## Variables and Their Meanings\\n\\n\")\n", - "\n", - "cat(\"### Technical Variables\\n\")\n", - "cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n", - "cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n", - "\n", - "cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n", - "cat(\" * Represents sequencing depth\\n\")\n", - "cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n", - "\n", - "cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n", - "cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n", - "\n", - "cat(\"- med_peakwidth: Median width of peaks in each sample (weighted by counts)\\n\")\n", - "cat(\" * Represents the typical size of accessible regions\\n\\n\")\n", - "\n", - "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", - "cat(\" * Measures the degree of nucleosome positioning\\n\")\n", - "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n", - "\n", - "cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n", - "cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n", - "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n", - "\n", - "cat(\"### Other Variables\\n\")\n", - "cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n", - "cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n", - "\n", - "cat(\"## Relationship to voom Transformation\\n\")\n", - "cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n", - "cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n", - "cat(\"covariates, we ensure they're on a similar scale to the transformed expression data, \")\n", - "cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n", - "sink()\n", - "\n", - "cat(\"Processing completed. Results and documentation saved to:\", out_dir, \"\\n\")" - ] - }, - { - "cell_type": "markdown", - "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", - "metadata": {}, - "source": [ - "## Step 2: Format Output\n", - "### Format A: Phenotype Reformatting \n", - "\n", - "**Input:**\n", - "- `{celltype}_residuals.txt` from Step 1 Option A (in `2_residuals/{celltype}/`)\n", - "\n", - "**Process:**\n", - "1. Reads residuals file with proper handling of peak IDs and sample columns\n", - "2. Parses peak coordinates from peak IDs (format: `chr-start-end`)\n", - "3. Converts peaks to **midpoint coordinates**:\n", - " ```r\n", - " midpoint = (start + end) / 2\n", - " start = midpoint\n", - " end = midpoint + 1\n", - "4. Creates BED format: `#chr`, `start`, `end`, `ID` (peak_id), followed by sample expression values\n", - "5. Sorts by chromosome and genomic position using `setorder(bed_data, '#chr', start, end)`\n", - "6. Writes BED file with headers\n", - "7. Compresses with `bgzip -f`\n", - "\n", - "**Output:** `output/3_phenotype_processing/{celltype}`\n", - "\n", - "- `{celltype}_kellis_snatac_phenotype.bed.gz`: QTL-ready BED file with peak midpoint coordinates and bgzip-compressed format\n", - "\n", - "**Use Case:**\n", - "Standard caQTL (chromatin accessibility QTL) mapping where you want to identify genetic variants affecting chromatin accessibility independent of demographic factors. Ready for FastQTL, TensorQTL, or QTLtools.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "4ed50732-7daf-409e-af3a-b3014808cb46", - "metadata": {}, - "source": [ - "#### Load libraries" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "id": "0f4dbe14-2acf-4e8b-b63b-47c67f5f68e5", - "metadata": {}, - "outputs": [], - "source": [ - "library(data.table)\n", - "library(stringr)" - ] - }, - { - "cell_type": "markdown", - "id": "ff83118f-c112-4b03-9256-0a5e98322422", - "metadata": {}, - "source": [ - "#### Load input" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "id": "2f840ed2-3c7c-4a75-8a28-8c90c6f43d2e", - "metadata": {}, - "outputs": [], - "source": [ - "# Get command line arguments\n", - "#args <- commandArgs(trailingOnly = TRUE)\n", - "#if (length(args) < 1) {\n", - "# celltype <- \"Astro\" # Default cell type\n", - "# cat(\"No cell type specified, using default:\", celltype, \"\\n\")\n", - "#} else {\n", - "# celltype <- args[1]\n", - "# cat(\"Processing cell type:\", celltype, \"\\n\")\n", - "#}\n", - "\n", - "celltype <- \"Astro\"\n", - "\n", - "# Define input and output paths\n", - "reformat_input_dir <- file.path(output_dir,\"2_residuals\")\n", - "#output_dir <- \"/home/al4225/project/kellis_snatac/output/3_phenotype_processing\"\n", - "reformat_output_dir <- paste0(output_dir,\"/3_phenotype_processing/\", celltype)\n", - "\n", - "# Create output directory if it doesn't exist\n", - "dir.create(reformat_output_dir, recursive = TRUE, showWarnings = FALSE)\n", - "\n", - "# Check if input directory exists\n", - "celltype_dir <- file.path(reformat_input_dir, celltype)\n", - "if (!dir.exists(reformat_input_dir)) {\n", - " cat(\"Cell type directory not found:\", celltype_dir, \"\\n\")\n", - " cat(\"Using backup directory...\\n\")\n", - " celltype_dir <- file.path(reformat_input_dir, \"backup\", celltype)\n", - " if (!dir.exists(celltype_dir)) {\n", - " stop(\"Backup directory not found either: \", celltype_dir)\n", - " }\n", - "}\n", - "\n", - "input_file <- file.path(celltype_dir, paste0(celltype, \"_residuals.txt\"))\n", - "output_bed <- file.path(reformat_output_dir, paste0(celltype, \"_kellis_snatac_phenotype.bed\"))\n", - "\n", - "# Check if input file exists\n", - "if (!file.exists(input_file)) {\n", - " stop(\"Input file not found: \", input_file)\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "53d4672d-41e7-4533-b2e3-2eccf8c3b4d4", - "metadata": {}, - "source": [ - "#### Processing data" - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "id": "90e6f9a9-8c97-4890-ae20-be758d8c7f1e", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column names from first line: SM-CJK5G, SM-CTDQN, SM-CJEIE, SM-CTEM5, SM-CJJ27, SM-CJIWT ...\n", - "Reading residuals file: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Astro/Astro_residuals.txt \n", - "File has more data columns than header columns. Assuming first column is peak IDs.\n", - "First few peak IDs: chr1-816945-817430, chr1-817852-818227, chr1-818626-819158, chr1-826625-827679, chr1-869475-870473, chr1-903568-904912 \n", - "First few column names: SM-CJK5G, SM-CTDQN, SM-CJEIE, SM-CTEM5, SM-CJJ27, SM-CJIWT \n" - ] - } - ], - "source": [ - "# Read the first line manually to get the column names\n", - "first_line <- readLines(input_file, n = 1)\n", - "col_names <- unlist(strsplit(first_line, split = \"\\t\"))\n", - "cat(\"Column names from first line:\", paste(head(col_names), collapse = \", \"), \"...\\n\")\n", - "\n", - "# Read the residuals file using fread but skip the header\n", - "cat(\"Reading residuals file:\", input_file, \"\\n\")\n", - "residuals <- fread(input_file, header = FALSE, skip = 1)\n", - "\n", - "# If we have an extra column compared to the header line (often happens with rownames)\n", - "if (ncol(residuals) > length(col_names)) {\n", - " cat(\"File has more data columns than header columns. Assuming first column is peak IDs.\\n\")\n", - " peak_ids <- residuals[[1]]\n", - " residuals <- residuals[, -1, with = FALSE]\n", - " # Set proper column names excluding the first one which was for peak IDs\n", - " if (length(col_names) >= 2) {\n", - " setnames(residuals, col_names)\n", - " }\n", - "} else {\n", - " # Normal case - columns match\n", - " setnames(residuals, col_names)\n", - " peak_ids <- residuals[[1]]\n", - " residuals <- residuals[, -1, with = FALSE]\n", - "}\n", - "\n", - "# Check that peak IDs and column names were properly extracted\n", - "cat(\"First few peak IDs:\", paste(head(peak_ids), collapse = \", \"), \"\\n\")\n", - "cat(\"First few column names:\", paste(head(colnames(residuals)), collapse = \", \"), \"\\n\")\n", - "\n", - "# Parse peak IDs to get chromosome, start, and end\n", - "# cat(\"Parsing peak IDs into BED format\\n\")\n", - "# parsed_peaks <- data.table(\n", - "# '#chr' = sapply(strsplit(peak_ids, \"-\"), `[`, 1),\n", - "# start = as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 2)),\n", - "# end = as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 3)),\n", - "# ID = peak_ids # Use peak_id as the ID column (4th column in BED)\n", - "# )\n" - ] - }, - { - "cell_type": "markdown", - "id": "7cfa7749-21b4-4db8-b0d1-a82d7d3b3994", - "metadata": {}, - "source": [ - "#### Parse peak ID" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "id": "cb6e8c9c-66f4-452e-b5ba-c88f9ca9de17", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Parsing peak IDs into BED format with midpoint coordinates\n" - ] - } - ], - "source": [ - "# Parse peak IDs to get chromosome, start, and end\n", - "cat(\"Parsing peak IDs into BED format with midpoint coordinates\\n\")\n", - "\n", - "parsed_peaks <- data.table(\n", - " '#chr' = sapply(strsplit(peak_ids, \"-\"), `[`, 1),\n", - " start = as.integer((as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 2)) + \n", - " as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 3))) / 2),\n", - " end = as.integer(((as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 2)) + \n", - " as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 3))) / 2) + 1), \n", - " ID = peak_ids # Use peak_id as the ID column (4th column in BED)\n", - ")\n", - "\n", - "\n", - "# Add validation to ensure end > start\n", - "if (any(parsed_peaks$end <= parsed_peaks$start)) {\n", - " cat(\"Warning: Found records where end <= start. Fixing...\\n\")\n", - " parsed_peaks[end <= start, end := start + 1]\n", - "}\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "22eb8753-a035-4b01-906d-d552abf522d5", - "metadata": {}, - "source": [ - "#### Create BED" - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "id": "45df0f55-ce77-4e36-8af6-09511031d650", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Writing BED file to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/3_phenotype_processing/Astro/Astro_kellis_snatac_phenotype.bed \n", - "Compressing BED file with bgzip...\n", - "Process completed.\n", - "Output file: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/3_phenotype_processing/Astro/Astro_kellis_snatac_phenotype.bed.gz \n" - ] - } - ], - "source": [ - "# Create BED format with all data columns\n", - "# BED format: chr, start, end, ID, followed by phenotype values with sample IDs as column names\n", - "bed_data <- cbind(parsed_peaks, residuals)\n", - "\n", - "# Sort by chromosome and position\n", - "setorder(bed_data, '#chr', start, end)\n", - "\n", - "# Write BED file with headers\n", - "cat(\"Writing BED file to:\", output_bed, \"\\n\")\n", - "fwrite(bed_data, output_bed, sep = \"\\t\", col.names = TRUE, quote = FALSE)\n", - "\n", - "# Compress the BED file with bgzip\n", - "cat(\"Compressing BED file with bgzip...\\n\")\n", - "bgzip_cmd <- paste(\"bgzip -f\", output_bed)\n", - "system(bgzip_cmd)\n", - "\n", - "cat(\"Process completed.\\n\")\n", - "cat(\"Output file:\", paste0(output_bed, \".gz\"), \"\\n\")\n" - ] - }, - { - "cell_type": "markdown", - "id": "daba0745-5ce6-4a32-8ad0-3e0647f15052", - "metadata": {}, - "source": [ - "### Format B: Regions Peak Filtering\n", - "**Input:**\n", - "- `{celltype}_filtered_raw_counts.txt` from Step 1 Option B (in `2_residuals/{celltype}/`)\n", - "\n", - "**Process:**\n", - "1. Reads filtered raw counts for each cell type\n", - "2. Parses peak coordinates from peak IDs (format: `chr-start-end`)\n", - "3. Calculates peak metrics:\n", - " - `peakwidth`: End - Start\n", - " - `midpoint`: (Start + End) / 2\n", - "4. Filters for **specific genomic regions of interest**:\n", - " - **Chr7:** 28,000,000 - 28,300,000 bp (300kb region)\n", - " - **Chr11:** 85,050,000 - 86,200,000 bp (1.15Mb region)\n", - "5. Includes peaks that overlap these regions (start, end, or span the boundaries)\n", - "6. Calculates summary statistics:\n", - " - `total_count`: Sum of counts across all samples per peak\n", - " - `weighted_count`: total_count / peakwidth (normalizes for peak size)\n", - "\n", - "**Output:** `output/4_regions/{celltype}/`\n", - "- `filtered_regions_of_interest.txt`: Full count data for peaks in target regions (all samples × selected peaks)\n", - "- `filtered_regions_of_interest_summary.txt`: Peak metadata with coordinates and count statistics\n", - "\n", - "**Use Case:** \n", - "Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci like APOE region, TREM2 locus) where biological variation should be preserved for interpretation." - ] - }, - { - "cell_type": "markdown", - "id": "4b87b48b-c7ec-4799-bba1-4574d4d660fe", - "metadata": {}, - "source": [ - "#### Filter and save data for a specific cell type" - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "id": "891a02b1-8f7c-4b97-b151-371b65ec52a3", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Mic/Mic_filtered_raw_counts.txt \n", - "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Astro/Astro_filtered_raw_counts.txt \n", - "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Oligo/Oligo_filtered_raw_counts.txt \n", - "Processing Exc data from: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Exc/Exc_filtered_raw_counts.txt \n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning message in fread(input_file, check.names = TRUE):\n", - "“Detected 83 column names but the data has 84 columns (i.e. invalid file). Added an extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.”\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Found 276 regions of interest for Exc \n", - "Saved filtered data to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/3_regions/Exc/Exc_filtered_regions_of_interest.txt \n", - "Saved summary data to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/3_regions/Exc/Exc_filtered_regions_of_interest_summary.txt \n", - "\n", - "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Inh/Inh_filtered_raw_counts.txt \n", - "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/OPC/OPC_filtered_raw_counts.txt \n" - ] - } - ], - "source": [ - "# Function to filter and save data for a specific cell type with additional summary information\n", - "filter_and_save_by_celltype <- function(celltype) {\n", - " # Create output directory\n", - " peak_output_dir <- file.path(output_dir,\"3_regions\", celltype)\n", - " dir.create(peak_output_dir, recursive = TRUE, showWarnings = FALSE)\n", - " \n", - " # Load filtered raw counts for the cell type\n", - " input_file <- file.path(output_dir,\"2_residuals\", celltype, paste0(celltype, \"_filtered_raw_counts.txt\"))\n", - " \n", - " # Check if file exists before reading\n", - " if (!file.exists(input_file)) {\n", - " cat(\"File not found:\", input_file, \"\\n\")\n", - " return(FALSE)\n", - " }\n", - " \n", - " cat(\"Processing\", celltype, \"data from:\", input_file, \"\\n\")\n", - " \n", - " # Read data - handling the row names issue\n", - " cell_data <- fread(input_file, check.names = TRUE)\n", - " \n", - " # If the first column has no name (it's row names), give it a proper name\n", - " if (names(cell_data)[1] == \"V1\") {\n", - " setnames(cell_data, \"V1\", \"peak_id\")\n", - " }\n", - " \n", - " # Parse coordinates from peak IDs\n", - " cell_data$chr <- gsub(\"^(chr[^-]+)-.*$\", \"\\\\1\", cell_data$peak_id)\n", - " cell_data$start <- as.numeric(gsub(\"^chr[^-]+-([0-9]+)-.*$\", \"\\\\1\", cell_data$peak_id))\n", - " cell_data$end <- as.numeric(gsub(\"^chr[^-]+-[0-9]+-([0-9]+)$\", \"\\\\1\", cell_data$peak_id))\n", - " \n", - " # Calculate additional metrics\n", - " cell_data$peakwidth <- cell_data$end - cell_data$start\n", - " cell_data$midpoint <- (cell_data$start + cell_data$end) / 2\n", - " \n", - " # Filter for chr7 and chr11\n", - " chr_filtered <- cell_data[cell_data$chr %in% c(\"chr7\", \"chr11\"), ]\n", - " \n", - " # Filter for the specific regions\n", - " region_filtered <- chr_filtered[\n", - " # Chr7: 28,000kb-28,300kb\n", - " (chr_filtered$chr == \"chr7\" & \n", - " ((chr_filtered$start >= 28000000 & chr_filtered$start <= 28300000) | \n", - " (chr_filtered$end >= 28000000 & chr_filtered$end <= 28300000) |\n", - " (chr_filtered$start <= 28000000 & chr_filtered$end >= 28300000))) |\n", - " # Chr11: 85,050kb-86,200kb\n", - " (chr_filtered$chr == \"chr11\" & \n", - " ((chr_filtered$start >= 85050000 & chr_filtered$start <= 86200000) | \n", - " (chr_filtered$end >= 85050000 & chr_filtered$end <= 86200000) |\n", - " (chr_filtered$start <= 85050000 & chr_filtered$end >= 86200000))),\n", - " ]\n", - " \n", - " # Report results\n", - " cat(\"Found\", nrow(region_filtered), \"regions of interest for\", celltype, \"\\n\")\n", - " \n", - " # Save the original filtered data (with all columns)\n", - " output_file <- file.path(peak_output_dir, paste0(celltype,\"_filtered_regions_of_interest.txt\"))\n", - " write.table(region_filtered, output_file, sep=\"\\t\", quote=FALSE, row.names=FALSE)\n", - " cat(\"Saved filtered data to:\", output_file, \"\\n\")\n", - " \n", - " # Calculate total count for each peak (sum across all samples)\n", - " # Get only the numeric columns (exclude the metadata columns we added)\n", - " meta_cols <- c(\"peak_id\", \"chr\", \"start\", \"end\", \"peakwidth\", \"midpoint\")\n", - " count_cols <- setdiff(names(region_filtered), meta_cols)\n", - " \n", - " # Ensure all count columns are numeric\n", - " region_filtered_counts <- region_filtered[, ..count_cols]\n", - " region_filtered_counts <- as.data.frame(apply(region_filtered_counts, 2, as.numeric))\n", - " \n", - " # Calculate total count\n", - " region_filtered$total_count <- rowSums(region_filtered_counts)\n", - " \n", - " # Calculate weighted count (total count / peakwidth)\n", - " region_filtered$weighted_count <- region_filtered$total_count / region_filtered$peakwidth\n", - " \n", - " # Create a summary data frame with just the metadata columns\n", - " summary_df <- data.table(\n", - " peak_id = region_filtered$peak_id,\n", - " chr = region_filtered$chr,\n", - " start = region_filtered$start,\n", - " end = region_filtered$end,\n", - " midpoint = region_filtered$midpoint,\n", - " peakwidth = region_filtered$peakwidth,\n", - " total_count = region_filtered$total_count,\n", - " weighted_count = region_filtered$weighted_count\n", - " )\n", - " \n", - " # Save the summary data\n", - " summary_file <- file.path(peak_output_dir, paste0(celltype,\"_filtered_regions_of_interest_summary.txt\"))\n", - " write.table(summary_df, summary_file, sep=\"\\t\", quote=FALSE, row.names=FALSE)\n", - " cat(\"Saved summary data to:\", summary_file, \"\\n\\n\")\n", - " \n", - " return(TRUE)\n", - "}\n", - "\n", - "# List of cell types to process\n", - "celltypes <- c(\"Mic\", \"Astro\", \"Oligo\", \"Exc\", \"Inh\", \"OPC\")\n", - "\n", - "\n", - "# Process each cell type\n", - "for (ct in celltypes) {\n", - " filter_and_save_by_celltype(ct)\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "84c7cc6f-5080-45c1-8204-3b77623a557e", - "metadata": {}, - "source": [ - "## Alternative Pseudobulk Pipeline with Batch Correction\n", - "\n", - "This is an alternative preprocessing approach using ComBat-seq for explicit batch correction. It is from a different dataset (multiome) but demonstrates an alternative strategy when batch effects are severe.\n", - "\n", - "---\n", - "\n", - "#### When to Use This Approach:\n", - "- Strong batch effects that need active correction (not just covariate adjustment)\n", - "- Data from multiple sequencing runs with substantial technical artifacts\n", - "- When batch confounds with biological variables of interest\n", - "- Visible batch clusters in PCA/t-SNE plots\n", - "\n", - "---\n", - "\n", - "**Input:**\n", - "- QC'd Seurat object with metadata: `{celltype}_qced.rds`\n", - "- Pseudobulk peak counts: `{celltype}.rds`\n", - "- Sample covariates: `rosmap_cov.txt`\n", - "- Batch information: `SampleSheet.csv` and `sampleSheetAfterQc.csv`\n", - "- hg38 blacklist: `hg38-blacklist.v2.bed.gz`\n", - "\n", - "**Process:**\n", - "1. Loads Seurat object and extracts metadata\n", - "2. Loads pseudobulk peak count matrix\n", - "3. Calculates technical QC metrics per sample:\n", - " - `TSSEnrichment`: Median TSS enrichment\n", - " - `NucleosomeRatio`: Median nucleosome ratio\n", - " - `LogPercMt`: Log-transformed percent mitochondrial reads\n", - " - `LogUniqueFrags`: Log-transformed unique fragments per sample\n", - "4. Filters blacklisted genomic regions using `foverlaps()`\n", - "5. Calculates peak metrics:\n", - " - `LogTotalUniquePeaks`: Log-transformed count of unique peaks detected\n", - "6. Merges with demographic covariates (msex, age_death, pmi, study)\n", - "7. Creates DGEList object\n", - "8. Applies expression filtering with `filterByExpr()`:\n", - " - `min.count = 5`: Minimum 5 reads in at least one sample\n", - " - `min.total.count = 7`: Minimum 7 total reads across all samples\n", - " - `min.prop = 0.7`: Peak must be expressed in ≥70% of samples\n", - "9. TMM normalization with `calcNormFactors()`\n", - "10. **Batch processing:**\n", - " - Loads sequencing batch information from sample sheets\n", - " - Filters singleton batches (batches with only 1 sample)\n", - " - Filters samples with low library sizes (< 5000 recommended)\n", - "11. **ComBat-seq batch correction:**\n", - " ```r\n", - " adjusted_counts <- ComBat_seq(\n", - " counts = dge$counts, \n", - " batch = batches\n", - " )\n", - " ```\n", - "12. Fits linear model on batch-corrected counts using `voom()` and `lmFit()`:\n", - " ```r\n", - " model <- ~ pmi + msex + age_death + \n", - " TSSEnrichment + NucleosomeRatio + LogPercMt +\n", - " LogUniqueFrags + LogTotalUniquePeaks + \n", - " study\n", - " ```\n", - " Note: Batch is NOT in the model because it was corrected by ComBat-seq\n", - "13. Calculates residuals using `predictOffset()`: `offset + residuals`\n", - " - `offset`: Predicted expression at median/reference covariate values\n", - " - `residuals`: Unexplained variation after removing covariate effects\n", - "\n", - "However, ComBat-seq encountered persistent errors with this dataset:\n", - "```\n", - "Error in .compressOffsets(y, lib.size = lib.size, offset = offset):\n", - "offsets must be finite values\n", - "```\n", - "\n", - "**Issues with ComBat-seq for this data:**\n", - "- Dataset had 232 samples across 60 batches (many small batches)\n", - "- Error persisted even after:\n", - " - Filtering samples with low library sizes (< 5000)\n", - " - Removing singleton batches\n", - " - Ensuring all counts and library sizes were finite\n", - " - Verifying no zero-sum peaks\n", - "- Likely due to internal ComBat-seq edge case with highly fragmented batch structure\n", - "\n", - "**Solution:** Use limma's `removeBatchEffect` which operates on log-CPM values and is more robust to small batch sizes.\n", - "\n", - "**Process:**\n", - "1. Loads Seurat object and extracts metadata\n", - "2. Loads pseudobulk peak count matrix\n", - "3. Calculates technical QC metrics per sample:\n", - " - `TSSEnrichment`: Median TSS enrichment\n", - " - `NucleosomeRatio`: Median nucleosome ratio\n", - " - `LogPercMt`: Log-transformed percent mitochondrial reads\n", - " - `LogUniqueFrags`: Log-transformed unique fragments per sample\n", - "4. Filters blacklisted genomic regions using `foverlaps()`\n", - "5. Calculates peak metrics:\n", - " - `LogTotalUniquePeaks`: Log-transformed count of unique peaks detected\n", - "6. Merges with demographic covariates (msex, age_death, pmi, study)\n", - "7. Creates DGEList object\n", - "8. Applies expression filtering with `filterByExpr()`:\n", - " - `min.count = 5`: Minimum 5 reads in at least one sample\n", - " - `min.total.count = 7`: Minimum 7 total reads across all samples\n", - " - `min.prop = 0.7`: Peak must be expressed in ≥70% of samples\n", - "9. TMM normalization with `calcNormFactors()`\n", - "10. **Batch processing:**\n", - " - Loads sequencing batch information from sample sheets\n", - " - Filters singleton batches (batches with only 1 sample)\n", - "11. **Batch correction using limma's removeBatchEffect:**\n", - " ```r\n", - " # Get log-CPM values\n", - " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", - " \n", - " # Remove batch effects\n", - " adjusted_logCPM <- removeBatchEffect(\n", - " logCPM,\n", - " batch = batches,\n", - " design = model.matrix(~1, data=dge$samples)\n", - " )\n", - " \n", - " # Convert back to counts scale (approximate)\n", - " adjusted_counts <- 2^adjusted_logCPM * mean(dge$$samples$$lib.size) / 1e6\n", - " adjusted_counts <- round(adjusted_counts)\n", - " adjusted_counts[adjusted_counts < 0] <- 0\n", - " ```\n", - "12. Updates sample alignment:\n", - " - Ensures valid_samples match current filtered data\n", - " - Aligns covariates with sample order\n", - " - Converts tibble to data.frame and sets rownames\n", - "13. Fits linear model on batch-corrected counts using `voom()` and `lmFit()`:\n", - " ```r\n", - " model <- ~ pmi + msex + age_death + TSSEnrichment + NucleosomeRatio + LogPercMt + LogUniqueFrags + LogTotalUniquePeaks + study\n", - " ```\n", - " Note: Batch is NOT in the model because it was corrected by removeBatchEffect\n", - "14. Creates new DGEList with batch-corrected counts\n", - "15. Recalculates library sizes and TMM normalization factors\n", - "16. Calculates residuals using `predictOffset()`: `offset + residuals`\n", - " - `offset`: Predicted expression at median/reference covariate values\n", - " - `residuals`: Unexplained variation after removing covariate effects\n", - "\n", - "\n", - "**Output:** `output/3_calculateResiduals/{celltype})`\n", - "- `{celltype}_results.rds`: Complete results object containing:\n", - " - `dge`: Batch-corrected DGEList\n", - " - `offset`: Predicted offset values\n", - " - `residuals`: Model residuals\n", - " - `batch_adjusted_counts`: removeBatchEffect corrected counts\n", - " - `final_data`: Final adjusted expression (offset + residuals)\n", - " - `valid_samples`: Sample IDs after filtering\n", - " - `design`: Design matrix\n", - " - `fit`: Linear model fit object\n", - "- `{celltype}_residuals.txt`: Final covariate-adjusted peak accessibility (log2-CPM scale)\n", - "\n", - "\n", - "**Key Differences from ComBat-seq:**\n", - "- Operates on log-CPM values (not integer counts)\n", - "- More robust to small/unbalanced batch sizes\n", - "- Does not model mean-variance relationship (simpler correction)\n", - "- Approximate back-transformation to count scale\n" - ] - }, - { - "cell_type": "markdown", - "id": "fb2d3390-250e-43dc-848d-a02bcea6bbee", - "metadata": {}, - "source": [ - "#### Load libraries" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "id": "780bafdb-0057-4b18-b5a7-7c7ea3450926", - "metadata": {}, - "outputs": [], - "source": [ - "library(data.table)\n", - "library(stringr)\n", - "library(Seurat)\n", - "library(dplyr)\n", - "library(sva)\n", - "library(edgeR)\n", - "library(limma)" - ] - }, - { - "cell_type": "markdown", - "id": "5c26a146-535e-4123-aca0-f41e6f3f5a0b", - "metadata": {}, - "source": [ - "#### Create output directory" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "id": "52e0c2f0-73da-4cd5-b43d-b744e4b0d726", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Processing celltype: Astro \n" - ] - }, - { - "data": { - "text/html": [ - "'/restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals_batch_corrected/Astro'" - ], - "text/latex": [ - "'/restricted/projectnb/xqtl/jaempawi/atac\\_seq/output/kellis/2\\_residuals\\_batch\\_corrected/Astro'" - ], - "text/markdown": [ - "'/restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals_batch_corrected/Astro'" - ], - "text/plain": [ - "[1] \"/restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals_batch_corrected/Astro\"" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "cat(\"Processing celltype:\", celltype, \"\\n\")\n", - "\n", - "residual_out_dir <- file.path(output_dir,\"2_residuals_batch_corrected\", celltype)\n", - "dir.create(residual_out_dir, recursive = TRUE, showWarnings = FALSE)\n", - "residual_out_dir" - ] - }, - { - "cell_type": "markdown", - "id": "9990a617-310c-47aa-9895-712db99b766f", - "metadata": {}, - "source": [ - "#### Create predictOffset function " - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "id": "f0a1ed02-4744-422e-9830-90886cb9ec04", - "metadata": {}, - "outputs": [], - "source": [ - "predictOffset <- function(fit) {\n", - " # Define which variables are factors and which are continuous\n", - " usedFactors <- c(\"study\") \n", - " usedContinuous <- c(\"pmi\", \"msex\", \"age_death\", \n", - " \"TSSEnrichment\", \"NucleosomeRatio\", \"LogPercMt\",\n", - " \"LogUniqueFrags\", \"LogTotalUniquePeaks\")\n", - " \n", - " # Get indices for factor and continuous variables\n", - " facInd <- unlist(lapply(as.list(usedFactors), \n", - " function(f) {return(grep(paste(\"^\", f, sep=\"\"), \n", - " colnames(fit$design)))}))\n", - " contInd <- unlist(lapply(as.list(usedContinuous), \n", - " function(f) {return(grep(paste(\"^\", f, sep=\"\"), \n", - " colnames(fit$design)))}))\n", - " \n", - " # Verify design matrix structure\n", - " stopifnot(!any(duplicated(c(1, facInd, contInd))))\n", - " stopifnot(all(c(1, facInd, contInd) %in% 1:ncol(fit$design)))\n", - " stopifnot(1:ncol(fit$design) %in% c(1, facInd, contInd))\n", - " \n", - " # Create new design matrix with median values\n", - " D <- fit$design\n", - " D[, facInd] <- 0\n", - " medContVals <- apply(D[, contInd], 2, median)\n", - " for (i in 1:length(medContVals)) {\n", - " D[, names(medContVals)[i]] <- medContVals[i]\n", - " }\n", - " \n", - " # Calculate offsets\n", - " stopifnot(all(colnames(coefficients(fit)) == colnames(D)))\n", - " offsets <- apply(coefficients(fit), 1, function(c) {\n", - " return(D %*% c)\n", - " })\n", - " offsets <- t(offsets)\n", - " colnames(offsets) <- rownames(fit$design)\n", - " \n", - " return(offsets)\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "af303e3f-b918-4cc4-8155-1f7974b48cde", - "metadata": {}, - "source": [ - "#### Load input data" - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "id": "e00e96b8-9e1f-4387-8a3c-fcca2ae2d342", - "metadata": {}, - "outputs": [], - "source": [ - "meta_data = readRDS(file.path(input_dir,\"Endothelial_qced.rds\"))\n", - "meta = meta_data@meta.data\n", - "peak <- readRDS(file.path(input_dir,'Endothelial.rds'))" - ] - }, - { - "cell_type": "markdown", - "id": "d746195e-ca0a-4695-a42c-64bf9c677c85", - "metadata": {}, - "source": [ - "#### Process technical variables" - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "id": "bb915f93-9088-4db6-b663-dc893e25fe85", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - "\n", - "\t\n", - "\t\n", - "\n", - "\n", - "\t\n", - "\t\n", - "\t\n", - "\t\n", - "\t\n", - "\t\n", - "\n", - "
A tibble: 6 × 7
demuxlet_SNG.BEST.GUESSTSSEnrichmentNucleosomeRatioPercMtUniqueFragsLogPercMtLogUniqueFrags
<chr><dbl><dbl><dbl><int><dbl><dbl>
MAP26637867 7.3151.43342040.575539687-0.55244564.465908
MAP50106992 6.2371.57035231.775364719 0.57400642.944439
MAP6134495714.5870.74943900.2781486 7-1.27959601.945910
ROS11430815 6.6061.46446190.2029770 9-1.59465772.197225
ROS1573842812.6200.99088170.188993332-1.66603833.465736
ROS20945666 7.6091.68424170.388500465-0.94545854.174387
\n" - ], - "text/latex": [ - "A tibble: 6 × 7\n", - "\\begin{tabular}{lllllll}\n", - " demuxlet\\_SNG.BEST.GUESS & TSSEnrichment & NucleosomeRatio & PercMt & UniqueFrags & LogPercMt & LogUniqueFrags\\\\\n", - " & & & & & & \\\\\n", - "\\hline\n", - "\t MAP26637867 & 7.315 & 1.4334204 & 0.5755396 & 87 & -0.5524456 & 4.465908\\\\\n", - "\t MAP50106992 & 6.237 & 1.5703523 & 1.7753647 & 19 & 0.5740064 & 2.944439\\\\\n", - "\t MAP61344957 & 14.587 & 0.7494390 & 0.2781486 & 7 & -1.2795960 & 1.945910\\\\\n", - "\t ROS11430815 & 6.606 & 1.4644619 & 0.2029770 & 9 & -1.5946577 & 2.197225\\\\\n", - "\t ROS15738428 & 12.620 & 0.9908817 & 0.1889933 & 32 & -1.6660383 & 3.465736\\\\\n", - "\t ROS20945666 & 7.609 & 1.6842417 & 0.3885004 & 65 & -0.9454585 & 4.174387\\\\\n", - "\\end{tabular}\n" - ], - "text/markdown": [ - "\n", - "A tibble: 6 × 7\n", - "\n", - "| demuxlet_SNG.BEST.GUESS <chr> | TSSEnrichment <dbl> | NucleosomeRatio <dbl> | PercMt <dbl> | UniqueFrags <int> | LogPercMt <dbl> | LogUniqueFrags <dbl> |\n", - "|---|---|---|---|---|---|---|\n", - "| MAP26637867 | 7.315 | 1.4334204 | 0.5755396 | 87 | -0.5524456 | 4.465908 |\n", - "| MAP50106992 | 6.237 | 1.5703523 | 1.7753647 | 19 | 0.5740064 | 2.944439 |\n", - "| MAP61344957 | 14.587 | 0.7494390 | 0.2781486 | 7 | -1.2795960 | 1.945910 |\n", - "| ROS11430815 | 6.606 | 1.4644619 | 0.2029770 | 9 | -1.5946577 | 2.197225 |\n", - "| ROS15738428 | 12.620 | 0.9908817 | 0.1889933 | 32 | -1.6660383 | 3.465736 |\n", - "| ROS20945666 | 7.609 | 1.6842417 | 0.3885004 | 65 | -0.9454585 | 4.174387 |\n", - "\n" - ], - "text/plain": [ - " demuxlet_SNG.BEST.GUESS TSSEnrichment NucleosomeRatio PercMt UniqueFrags\n", - "1 MAP26637867 7.315 1.4334204 0.5755396 87 \n", - "2 MAP50106992 6.237 1.5703523 1.7753647 19 \n", - "3 MAP61344957 14.587 0.7494390 0.2781486 7 \n", - "4 ROS11430815 6.606 1.4644619 0.2029770 9 \n", - "5 ROS15738428 12.620 0.9908817 0.1889933 32 \n", - "6 ROS20945666 7.609 1.6842417 0.3885004 65 \n", - " LogPercMt LogUniqueFrags\n", - "1 -0.5524456 4.465908 \n", - "2 0.5740064 2.944439 \n", - "3 -1.2795960 1.945910 \n", - "4 -1.5946577 2.197225 \n", - "5 -1.6660383 3.465736 \n", - "6 -0.9454585 4.174387 " - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "tech_vars <- meta %>%\n", - " group_by(demuxlet_SNG.BEST.GUESS) %>%\n", - " summarise(\n", - " TSSEnrichment = median(TSSEnrichment),\n", - " NucleosomeRatio = median(NucleosomeRatio),\n", - " PercMt = median(percent.mt),\n", - " UniqueFrags = n_distinct(demuxlet_BARCODE)\n", - " ) %>%\n", - " mutate(\n", - " LogPercMt = log(PercMt + 1e-6),\n", - " LogUniqueFrags = log(UniqueFrags + 1e-6)\n", - " )\n", - "head(tech_vars)" - ] - }, - { - "cell_type": "markdown", - "id": "bdc81763-f662-4f49-a154-7002da0953dd", - "metadata": {}, - "source": [ - "#### Process peaks " - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "id": "1ba88004-7088-4b2d-b169-3fc6f424a540", - "metadata": {}, - "outputs": [], - "source": [ - "# Load blacklist\n", - "blacklist_df <- fread(file.path(input_dir,\"hg38-blacklist.v2.bed.gz\"))\n", - "colnames(blacklist_df) <- c(\"chr\", \"start\", \"end\", \"label\")\n", - "\n", - "# Process peak coordinates\n", - "peak_df <- data.table(\n", - " peak_name = rownames(peak),\n", - " chr = str_extract(rownames(peak), \"chr[0-9XY]+\"),\n", - " start = as.integer(str_extract(rownames(peak), \"(?<=:)[0-9]+\")),\n", - " end = as.integer(str_extract(rownames(peak), \"(?<=-)[0-9]+\"))\n", - ")\n", - "\n", - "# Filter blacklisted peaks\n", - "setkey(blacklist_df, chr, start, end)\n", - "setkey(peak_df, chr, start, end)\n", - "overlapping_peaks <- foverlaps(peak_df, blacklist_df, nomatch=0)\n", - "blacklisted_peaks <- unique(overlapping_peaks$peak_name)\n", - "filtered_peak <- peak[!rownames(peak) %in% blacklisted_peaks,]\n", - "\n", - "# Calculate peak metrics\n", - "peak_metrics <- data.frame(\n", - " sample = colnames(filtered_peak),\n", - " TotalUniquePeaks = colSums(filtered_peak > 0)\n", - ") %>%\n", - " mutate(LogTotalUniquePeaks = log(TotalUniquePeaks))" - ] - }, - { - "cell_type": "markdown", - "id": "087e03aa-3830-4b0b-a291-2e0447b36d99", - "metadata": {}, - "source": [ - "#### Load and merge covariates" - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "id": "970879a5-7964-468d-9b53-7fbcc88ac24c", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of samples after joining: 233 \n", - "Sample IDs: MAP26637867, MAP50106992, MAP61344957, ROS11430815, ROS15738428, ROS20945666 ...\n" - ] - } - ], - "source": [ - "covariates <- fread(file.path(input_dir,'rosmap_cov.txt')) %>%\n", - " select('#id', msex, age_death, pmi, study)\n", - "\n", - "all_covs <- tech_vars %>%\n", - " inner_join(peak_metrics, by = c(\"demuxlet_SNG.BEST.GUESS\" = \"sample\")) %>%\n", - " inner_join(covariates, by = c(\"demuxlet_SNG.BEST.GUESS\" = \"#id\"))\n", - "\n", - "cat(\"Number of samples after joining:\", nrow(all_covs), \"\\n\")\n", - "cat(\"Sample IDs:\", paste(head(all_covs$demuxlet_SNG.BEST.GUESS), collapse=\", \"), \"...\\n\")\n", - "\n", - "# Impute missing values\n", - "for(col in c(\"pmi\", \"age_death\")) {\n", - " all_covs[[col]][is.na(all_covs[[col]])] <- median(all_covs[[col]], na.rm=TRUE)\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "44c51fa4-c3e4-4e1c-84f4-599d8d1bf156", - "metadata": {}, - "source": [ - "#### Create DGE object" - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "id": "e6608fc5-a07c-47aa-8996-96996f9297ce", - "metadata": {}, - "outputs": [], - "source": [ - "dge <- DGEList(\n", - " counts = filtered_peak,\n", - " samples = all_covs\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "ec4756f2-9410-4529-81f4-bca4e5ae510c", - "metadata": {}, - "source": [ - "#### Filter low counts and normalize" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "id": "d9ba397d-049e-4d80-bf6f-984ee8b50724", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of peaks before filtering: 130930 \n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Warning message in filterByExpr.DGEList(dge, min.count = 5, min.total.count = 15, :\n", - "“All samples appear to belong to the same group.”\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Number of peaks after filtering: 21197 \n" - ] - } - ], - "source": [ - "cat(\"Number of peaks before filtering:\", nrow(dge), \"\\n\")\n", - "\n", - "# keep <- filterByExpr(dge) #only 2 peaks left in mic\n", - "# default paramter:\n", - "# keep <- filterByExpr(y, \n", - "# min.count = 10, # for one sample, min reads \n", - "# min.total.count = 15, # min reads overall\n", - "# min.prop = 0.7) \n", - "\n", - "keep <- filterByExpr(dge, \n", - " min.count = 5, # for one sample, min reads \n", - " min.total.count = 15, # min reads overall\n", - " min.prop = 0.1,\n", - " group = NULL) \n", - "\n", - "dge <- dge[keep, , keep.lib.sizes=TRUE] #mic: from 130930 to 2\n", - "cat(\"Number of peaks after filtering:\", nrow(dge), \"\\n\")\n", - "dge <- calcNormFactors(dge, method=\"TMM\")" - ] - }, - { - "cell_type": "markdown", - "id": "ce856212-efb8-4a52-b16f-fb178d39b47b", - "metadata": {}, - "source": [ - "#### Load batch information" - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "id": "7a2bcb0a-9e56-4cab-9439-259cb880a66f", - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\u001b[1m\u001b[22mJoining with `by = join_by(ProjID)`\n" - ] - } - ], - "source": [ - "sample_file <- file.path(input_dir,\"SampleSheet.csv\")\n", - "wgs_qc_file <- file.path(input_dir,\"sampleSheetAfterQc.csv\")\n", - "\n", - "sample <- fread(sample_file, colClasses = \"character\")\n", - "wgs_qc <- fread(wgs_qc_file, colClasses = \"character\")\n", - "sample <- sample %>%\n", - " inner_join(wgs_qc) %>%\n", - " select(SequencingID, SampleID)\n", - "\n", - "# Extract batch information\n", - "batches <- sample$SequencingID\n", - "names(batches) <- sample$SampleID\n", - "\n", - "valid_samples <- colnames(dge$counts)\n", - "batches <- batches[valid_samples]" - ] - }, - { - "cell_type": "markdown", - "id": "42e9ccbe-c1c5-4f29-9f7e-bd9fa967e367", - "metadata": {}, - "source": [ - "#### Run ComBat-seq" - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "id": "d278d1bf-b8d9-47af-8cc3-8d0fb7d8eebe", - "metadata": {}, - "outputs": [ - { - "ename": "ERROR", - "evalue": "Error: all(colnames(dge$counts) == names(batches)) is not TRUE\n", - "output_type": "error", - "traceback": [ - "Error: all(colnames(dge$counts) == names(batches)) is not TRUE\nTraceback:\n", - "1. stop(simpleError(msg, call = if (p <- sys.parent(1L)) sys.call(p)))" - ] - } - ], - "source": [ - "# Filter batches with only one sample\n", - "batch_counts <- table(batches)\n", - "valid_batches <- names(batch_counts[batch_counts > 1])\n", - "batches <- batches[batches %in% valid_batches]\n", - "valid_samples <- names(batches)\n", - "\n", - "keep <- colnames(dge$counts) %in% names(batches)\n", - "dge <- dge[keep, , keep.lib.sizes=TRUE]\n", - "batches <- batches[colnames(dge$counts)]\n", - "stopifnot(all(colnames(dge$counts) == names(batches)))\n", - "\n", - "cat(\"Number of samples after batch filtering:\", length(valid_samples), \"\\n\")\n", - "cat(\"Number of batches:\", length(unique(batches)), \"\\n\")\n", - "\n", - "# Run ComBat-seq\n", - "adjusted_counts <- ComBat_seq(\n", - " counts = dge$counts, \n", - " batch = batches\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "0e219247-72d4-44e5-bd5c-4dffe90537e9", - "metadata": {}, - "source": [ - "#### Create model and run voom" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "id": "fef7c108-0ca0-42ca-b6b8-4ca3410bb507", - "metadata": {}, - "outputs": [ - { - "ename": "ERROR", - "evalue": "Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels\n", - "output_type": "error", - "traceback": [ - "Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels\nTraceback:\n", - "1. model.matrix.default(model, data = all_covs[valid_samples, ])", - "2. `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]])", - "3. stop(\"contrasts can be applied only to factors with 2 or more levels\")", - "4. .handleSimpleError(function (cnd) \n . {\n . watcher$capture_plot_and_output()\n . cnd <- sanitize_call(cnd)\n . watcher$push(cnd)\n . switch(on_error, continue = invokeRestart(\"eval_continue\"), \n . stop = invokeRestart(\"eval_stop\"), error = NULL)\n . }, \"contrasts can be applied only to factors with 2 or more levels\", \n . base::quote(`contrasts<-`(`*tmp*`, value = contr.funs[1 + \n . isOF[nn]])))" - ] - } - ], - "source": [ - "model <- ~ pmi + msex + age_death + \n", - " TSSEnrichment + NucleosomeRatio + LogPercMt +\n", - " LogUniqueFrags + LogTotalUniquePeaks + \n", - " study\n", - "\n", - "# Update design matrix for remaining samples\n", - "design <- model.matrix(model, data=all_covs[valid_samples,])\n", - "stopifnot(is.fullrank(design))\n", - "\n", - "dge_adjusted <- dge[, valid_samples] \n", - "dge_adjusted$counts <- adjusted_counts[, valid_samples] \n", - "\n", - "# Run voom and fit model\n", - "v <- voom(dge_adjusted[, valid_samples], design, plot=FALSE)\n", - "fit <- lmFit(v, design)\n", - "fit <- eBayes(fit)\n", - "\n", - "# Calculate offset and residuals\n", - "offset <- predictOffset(fit)\n", - "resids <- residuals(fit, y=v)\n", - "\n", - "# Verify dimensions\n", - "stopifnot(all(rownames(offset) == rownames(resids)) &\n", - " all(colnames(offset) == colnames(resids)))\n", - "\n", - "# Final adjusted data\n", - "stopifnot(all(dim(offset) == dim(resids)))\n", - "stopifnot(all(colnames(offset) == colnames(resids)))\n", - "\n", - "final_data <- offset + resids" - ] - }, - { - "cell_type": "markdown", - "id": "7ba1b486-bfcf-4488-90b8-252d94d256a2", - "metadata": {}, - "source": [ - "#### Run LIMMA as Combat-seq alternative " - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "id": "dbbe0cd9-9375-40d4-8d44-1d6556d12dfa", - "metadata": {}, - "outputs": [], - "source": [ - "# Alternative: Use limma's removeBatchEffect instead\n", - "# Get log-CPM values\n", - "logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", - "\n", - "# Remove batch effects\n", - "adjusted_logCPM <- removeBatchEffect(\n", - " logCPM,\n", - " batch = batches,\n", - " design = model.matrix(~1, data=dge$samples)\n", - ")\n", - "\n", - "# Convert back to counts scale (approximate)\n", - "adjusted_counts <- 2^adjusted_logCPM * mean(dge$samples$lib.size) / 1e6\n", - "adjusted_counts <- round(adjusted_counts)\n", - "adjusted_counts[adjusted_counts < 0] <- 0" - ] - }, - { - "cell_type": "markdown", - "id": "0a8e252b-bf0d-467f-9990-3fc4b0b2f99c", - "metadata": {}, - "source": [ - "#### Create model and run voom" - ] - }, - { - "cell_type": "code", - "execution_count": 57, - "id": "56c20501-f7db-43b2-bd7d-47d32cc43d6c", - "metadata": {}, - "outputs": [], - "source": [ - "# Update valid_samples to match current data\n", - "valid_samples <- colnames(dge)\n", - "\n", - "# Get aligned covariates\n", - "filtered_covs <- all_covs[match(valid_samples, all_covs$demuxlet_SNG.BEST.GUESS), ]\n", - "filtered_covs <- as.data.frame(filtered_covs) # Convert from tibble\n", - "rownames(filtered_covs) <- valid_samples\n", - "\n", - "\n", - "# Build model formula\n", - "model_formula <- ~ pmi + msex + age_death + \n", - " TSSEnrichment + NucleosomeRatio + LogPercMt +\n", - " LogUniqueFrags + LogTotalUniquePeaks + \n", - " study\n", - "\n", - "# Create design matrix\n", - "design <- model.matrix(model_formula, data=filtered_covs)\n", - "rownames(design) <- valid_samples\n", - "\n", - "stopifnot(is.fullrank(design))\n", - "stopifnot(all(rownames(design) == colnames(dge)))\n", - "\n", - "# Create properly formatted DGEList with adjusted counts\n", - "dge_adjusted <- DGEList(\n", - " counts = adjusted_counts,\n", - " samples = filtered_covs\n", - ")\n", - "\n", - "# Recalculate library sizes and normalization factors\n", - "dge_adjusted$samples$lib.size <- colSums(dge_adjusted$counts)\n", - "dge_adjusted <- calcNormFactors(dge_adjusted, method=\"TMM\")\n", - "\n", - "stopifnot(all(rownames(design) == colnames(dge_adjusted)))\n", - "\n", - "# Run voom and fit model\n", - "v <- voom(dge_adjusted, design, plot=FALSE)\n", - "fit <- lmFit(v, design)\n", - "fit <- eBayes(fit)\n", - "\n", - "# Calculate offset and residuals\n", - "offset <- predictOffset(fit)\n", - "resids <- residuals(fit, y=v)\n", - "\n", - "# Final adjusted data\n", - "final_data <- offset + resids\n" - ] - }, - { - "cell_type": "markdown", - "id": "c0093455-7ff7-47cb-9e31-baf913bbb4cd", - "metadata": {}, - "source": [ - "#### Save results" - ] - }, - { - "cell_type": "code", - "execution_count": 58, - "id": "10cce409-a132-41a6-b294-877e55e136c5", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Results saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals_batch_corrected/Astro \n" - ] - } - ], - "source": [ - "saveRDS(list(\n", - " dge = dge_adjusted,\n", - " offset = offset,\n", - " residuals = resids,\n", - " batch_adjusted_counts = adjusted_counts,\n", - " final_data = final_data,\n", - " valid_samples = valid_samples,\n", - " design = design,\n", - " fit = fit\n", - "), file = file.path(residual_out_dir, paste0(celltype, \"_results.rds\")))\n", - "\n", - "# Write final residual data to file\n", - "write.table(final_data,\n", - " file = file.path(residual_out_dir, paste0(celltype, \"_residuals.txt\")), \n", - " quote=FALSE)\n", - "\n", - "cat(\"Results saved to:\", residual_out_dir, \"\\n\") " - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "R", - "language": "R", - "name": "ir" - }, - "language_info": { - "codemirror_mode": "r", - "file_extension": ".r", - "mimetype": "text/x-r-source", - "name": "R", - "pygments_lexer": "r", - "version": "4.4.3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From 361690ed6fe202a829036f96f2fbdabd5fbfd61a Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 24 Feb 2026 18:40:46 -0500 Subject: [PATCH 06/12] Delete code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb --- .../QC/snatacseq_preprocessing.ipynb | 1453 ----------------- 1 file changed, 1453 deletions(-) delete mode 100644 code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb b/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb deleted file mode 100644 index b2b5acb6a..000000000 --- a/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb +++ /dev/null @@ -1,1453 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "# Single-nucleus ATAC-seq Preprocessing Pipeline\n", - "\n", - "## Overview\n", - "\n", - "This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) pseudobulk peak count data\n", - "for downstream chromatin accessibility QTL (caQTL) analysis and region-specific studies.\n", - "\n", - "**Goals:**\n", - "- Transform raw pseudobulk peak counts into analysis-ready formats\n", - "- Remove technical confounders while optionally preserving biological covariates\n", - "- Generate QTL-ready phenotype files or region-specific datasets\n", - "\n", - "## Pipeline Structure\n", - "```\n", - "Step 0: Sample ID Mapping\n", - "↓\n", - "Step 1: Pseudobulk QC\n", - "├── Option A: BIOvar (regress out technical + biological covariates)\n", - "└── Option B: noBIOvar (regress out technical covariates only)\n", - "↓ (optional)\n", - "Batch Correction (ComBat-seq or limma::removeBatchEffect)\n", - "↓\n", - "Step 2: Format Output\n", - "├── Format A: Phenotype Reformatting → BED (genome-wide caQTL mapping)\n", - "└── Format B: Region Peak Filtering → TSV (locus-specific analysis)\n", - "\n", - "```\n", - "\n", - "## Input Files\n", - "\n", - "All input files required to run this pipeline can be downloaded\n", - "[here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link).\n", - "\n", - "| File | Used in |\n", - "|------|---------|\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | Step 0, Step 1 |\n", - "| `metadata_{celltype}.csv` | Step 0, Step 1 |\n", - "| `rosmap_sample_mapping_data.csv` | Step 0 |\n", - "| `rosmap_cov.txt` | Step 1 |\n", - "| `hg38-blacklist.v2.bed.gz` | Step 1 |\n", - "| `SampleSheet.csv` | Step 1 (batch correction only) |\n", - "| `sampleSheetAfterQc.csv` | Step 1 (batch correction only) |\n", - "\n", - "\n", - "## Minimal Working Example" - ] - }, - { - "cell_type": "markdown", - "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 0: Sample ID Mapping\n", - "\n", - "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n", - "across metadata and count matrix files.\n", - "\n", - "### Input\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n", - "| `metadata_{celltype}.csv` × 6 | Per-cell-type sample metadata |\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Per-cell-type peak count matrices |\n", - "\n", - "Cell types: `Ast`, `Ex`, `In`, `Microglia`, `Oligo`, `OPC`\n", - "\n", - "### Process\n", - "\n", - "**Part 1 — Metadata files**\n", - "\n", - "For each `metadata_{celltype}.csv`:\n", - "1. Look up each `individualID` in the mapping reference\n", - "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n", - "3. Insert `sampleid` as the first column\n", - "4. Save updated file\n", - "\n", - "**Part 2 — Count matrix files**\n", - "\n", - "For each `pseudobulk_peaks_counts_{celltype}.csv.gz`:\n", - "1. Extract the header row (column names only)\n", - "2. Keep `peak_id` (first column) unchanged\n", - "3. Map remaining column names (`individualID` → `sampleid`) where mapping exists,\n", - " otherwise keep original\n", - "4. Write new header and stream data rows unchanged\n", - "5. Recompress with gzip\n", - "\n", - "### Output\n", - "\n", - "Output directory: `output/1_files_with_sampleid/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `metadata_{celltype}.csv` × 6 | Metadata with `sampleid` column prepended |\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Count matrices with mapped column headers |\n", - "\n", - "**Timing:** < 1 min\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n", - " --cwd output/atac_seq/1_files_with_sampleid \\\n", - " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", - " --input_dir data/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/1_files_with_sampleid \\\n", - " --celltype Ast Ex In Microglia Oligo OPC\n", - "\n", - "\n", - "# For MIT input data\n", - "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n", - " --cwd output/atac_seq/1_files_with_sampleid \\\n", - " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", - " --input_dir data/atac_seq/1_files_with_sampleid_MIT \\\n", - " --output_dir output/atac_seq/1_files_with_sampleid \\\n", - " --celltype Astro Exc Inh Mic Oligo OPC \\\n", - " --suffix _50nuc" - ] - }, - { - "cell_type": "markdown", - "id": "5540a4da-843a-4789-8123-47911cf519c5", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 1: Pseudobulk QC\n", - "\n", - "Two approaches are available depending on whether biological covariates should be regressed out.\n", - "Both options support an **optional batch correction** step after filtering and normalization.\n", - "\n", - "\n", - "### Option A: With Biological Covariates (BIOvar)\n", - "\n", - "Use when residuals should be adjusted for all technical **and** biological covariates (sex, age, PMI).\n", - "\n", - "**Input:**\n", - "\n", - "| File | Location |\n", - "|------|----------|\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | `1_files_with_sampleid/` |\n", - "| `metadata_{celltype}.csv` | `1_files_with_sampleid/` |\n", - "| `rosmap_cov.txt` | `data/` |\n", - "| `hg38-blacklist.v2.bed.gz` | `data/` |\n", - "| `SampleSheet.csv` *(batch correction only)* | `data/` |\n", - "| `sampleSheetAfterQc.csv` *(batch correction only)* | `data/` |\n", - "\n", - "**Process:**\n", - "\n", - "1. Load pseudobulk peak count matrix and metadata per cell type\n", - "2. Filter samples with fewer than 20 nuclei\n", - "3. Calculate technical QC metrics per sample:\n", - " - `log_n_nuclei`: log-transformed nuclei count\n", - " - `med_nucleosome_signal`: median nucleosome signal\n", - " - `med_tss_enrich`: median TSS enrichment score\n", - " - `log_med_n_tot_fragment`: log-transformed median total fragments\n", - " - `log_total_unique_peaks`: log-transformed unique peak count\n", - "4. Filter blacklisted genomic regions\n", - "5. Merge with demographic covariates (`msex`, `age_death`, `pmi`, `study`)\n", - "6. Apply expression filtering (`filterByExpr`):\n", - " - `min_count = 5`: minimum reads in at least one sample\n", - " - `min_total_count = 15`: minimum total reads across all samples\n", - " - `min_prop = 0.1`: peak expressed in ≥10% of samples\n", - "7. TMM normalization\n", - "8. *(Optional)* Batch correction — see [Batch Correction](#batch-correction-optional) below\n", - "9. Fit linear model (`voom` + `lmFit`):~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich\n", - "\n", - "log_med_n_tot_fragment + log_total_unique_peaks\n", - "sequencingBatch + msex + age_death + pmi + study\n", - "\n", - " > If batch correction was applied, `sequencingBatch` is removed from the model.\n", - "10. Compute residuals adjusted for all covariates\n", - "11. Compute final adjusted values: `offset + residuals`\n", - " - `offset`: predicted expression at median/reference covariate values\n", - " - `residuals`: unexplained variation after removing all covariate effects\n", - "\n", - "**Output:** `output/2_residuals/{celltype}/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n", - "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n", - "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", - "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n", - "\n", - "**Covariates regressed out:**\n", - "- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch\n", - "- Biological: sex (`msex`), age at death (`age_death`), post-mortem interval (`pmi`), study cohort\n", - "\n", - "**Timing:** <5 min per celltype" - ] - }, - { - "cell_type": "markdown", - "id": "21f80085-6d2c-4e1c-af35-454382d94de1", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC with BIOVar" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8569d816-d292-4512-85b6-fcd3ea1c9ba7", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/2_residuals \\\n", - " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", - " --include_bio TRUE \\\n", - " --batch_correction FALSE \\\n", - " --min_count 5 \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "d8270ee1-1f9b-439c-969b-ac20af6fadee", - "metadata": {}, - "source": [ - "### Option B: Without Biological Covariates (noBIOvar)\n", - "\n", - "Use when biological variation should be preserved (e.g., age/sex comparisons, region-specific analyses).\n", - "\n", - "**Input:** Same as Option A.\n", - "\n", - "**Process:**\n", - "\n", - "Steps 1–8 are identical to Option A. Key differences at the modelling stage:\n", - "- `msex` and `age_death` are **excluded** from the model\n", - "- `med_peakwidth` (weighted median peak width per sample) is added as a technical covariate\n", - "\n", - "**Model formula:**\n", - "```\n", - "Model: ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch + pmi + study\n", - "```\n", - "\n", - "**Output:** `output/2_residuals/{celltype}/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n", - "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n", - "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", - "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n", - "\n", - "**Variables deliberately NOT regressed out:**\n", - "- Sex (`msex`)\n", - "- Age at death (`age_death`)\n", - "\n", - "**Timing:** <5 min per celltype" - ] - }, - { - "cell_type": "markdown", - "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC noBIOvar " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/2_residuals \\\n", - " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", - " --include_bio FALSE \\\n", - " --batch_correction FALSE \\\n", - " --min_count 5 \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "25e96ad2-1b75-43d0-978e-0757bc11f135", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Batch Correction (Optional)\n", - "\n", - "Applies to both Option A and Option B. Runs between TMM normalization and model fitting.\n", - "Use when batch effects are severe (e.g., visible batch clusters in PCA, multiple sequencing runs).\n", - "\n", - "> When batch correction is applied, `sequencingBatch` is **removed** from the model formula\n", - "> since batch variance has already been removed from the counts.\n", - "\n", - "**Method comparison:**\n", - "\n", - "| | ComBat-seq | limma `removeBatchEffect` |\n", - "|---|---|---|\n", - "| **Operates on** | Raw integer counts | log-CPM values |\n", - "| **Mean-variance modelling** | Yes | No |\n", - "| **Best for** | Large, balanced batches | Small or fragmented batches |\n", - "| **Robustness** | May fail with many small batches | More robust to unbalanced designs |\n", - "\n", - "**ComBat-seq:**\n", - "```r\n", - "adjusted_counts <- ComBat_seq(counts = dge$counts, batch = batches)\n", - "```\n", - "\n", - "**limma `removeBatchEffect`:**\n", - "```r\n", - "logCPM <- cpm(dge, log = TRUE, prior.count = 1)\n", - "adj_logCPM <- removeBatchEffect(logCPM, batch = batches, design = model.matrix(~1, data = dge$samples))\n", - "adjusted_counts <- round(pmax(2^adj_logCPM * mean(dge$samples$lib.size) / 1e6, 0))\n", - "```\n", - "\n", - "**Additional filtering applied before correction:**\n", - "- Singleton batches (only 1 sample) are removed\n", - "- Samples absent from the batch sheet are dropped\n", - "\n", - "**Additional output when batch correction is enabled:**\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_results.rds` | Includes `batch_adjusted_counts` and `batch_method` fields |\n" - ] - }, - { - "cell_type": "markdown", - "id": "4d582c85-2265-46ee-8080-0ec5d8423a1d", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC with BIOvar & with batch correction" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d3676870-496d-4379-8d6b-acec08f1c0d7", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/2_residuals \\\n", - " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", - " --include_bio TRUE \\\n", - " --batch_correction TRUE \\\n", - " --batch_method limma \\\n", - " --min_count 2\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "9bad900d-768d-45ee-815a-6847e8eba32e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC noBIOvar & with batch correction" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "799c0faa-4dd9-431d-a5cf-3e92d7256a3b", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n", - " --output_dir output/atac_seq/2_residuals \\\n", - " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates_file data/atac_seq/rosmap_cov.txt \\\n", - " --include_bio FALSE \\\n", - " --batch_correction TRUE \\\n", - " --batch_method limma \\\n", - " --min_count 5\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "096f2b32-e80d-472b-9af8-5f3d4ebb9bf2", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "**Note**\n", - "For MIT data, add these parameters:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ee860bb3-d628-4255-b222-f62b3c03a91a", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "--celltype Astro Exc Inh Mic Oligo OPC \\\n", - "--suffix _50nuc \\\n", - "--input_dir output/1_files_with_sampleid_MIT" - ] - }, - { - "cell_type": "markdown", - "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed", - "metadata": {}, - "source": [ - "For additional parameters:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb", - "metadata": {}, - "outputs": [], - "source": [ - "--min_count 5\n", - "--min_total_count 15\n", - "--min_prop 0.1\n", - "--min_nuclei 20" - ] - }, - { - "cell_type": "markdown", - "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 2: Format Output\n", - "### Phenotype Reformatting\n", - "\n", - "Converts residuals into a QTL-ready BED format for genome-wide caQTL mapping.\n", - "\n", - "**Input:**\n", - "\n", - "| File | Location |\n", - "|------|----------|\n", - "| `{celltype}_residuals.txt` | `output/2_residuals/{celltype}/` |\n", - "\n", - "**Process:**\n", - "\n", - "1. Read residuals file with proper handling of peak IDs and sample columns\n", - "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", - "3. Convert to midpoint coordinates (standard for QTLtools):\n", - "```\n", - " start = floor((peak_start + peak_end) / 2)\n", - " end = start + 1\n", - "```\n", - "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample expression values\n", - "5. Sort by chromosome and position\n", - "6. Compress with `bgzip` and index with `tabix`\n", - "\n", - "**Output:** `output/3_phenotype_processing/phenotype/{celltype}_snatac_phenotype.bed.gz`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_snatac_phenotype.bed.gz` | bgzip-compressed BED with peak midpoint coordinates |\n", - "| `{celltype}_snatac_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n", - "\n", - "**Use case:** Standard caQTL mapping to identify genetic variants affecting chromatin\n", - "accessibility independent of demographic factors. Compatible with FastQTL, TensorQTL, and QTLtools.\n", - "\n", - "**Timing:** <1 min" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb phenotype_formatting \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/2_residuals \\\n", - " --output_dir output/atac_seq/3_pheno_reformat \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "7c874b17-9a77-4e7d-a0a3-3605f7005148", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Region Peak Filtering\n", - "\n", - "Filters peak counts to specific genomic regions of interest for locus-specific analysis.\n", - "\n", - "**Input:**\n", - "\n", - "| File | Location |\n", - "|------|----------|\n", - "| `{celltype}_filtered_raw_counts.txt` | `output/2_residuals/{celltype}/` |\n", - "\n", - "**Process:**\n", - "\n", - "1. Read filtered raw counts per cell type\n", - "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", - "3. Calculate per-peak metrics:\n", - " - `peakwidth`: `end - start`\n", - " - `midpoint`: `(start + end) / 2`\n", - "4. Filter peaks overlapping target regions (includes peaks that start, end, or span boundaries):\n", - "\n", - " | Region | Coordinates | Size |\n", - " |--------|-------------|------|\n", - " | Chr7 | 28,000,000 – 28,300,000 bp | 300 kb |\n", - " | Chr11 | 85,050,000 – 86,200,000 bp | 1.15 Mb |\n", - "\n", - "5. Calculate summary statistics per peak:\n", - " - `total_count`: sum of counts across all samples\n", - " - `weighted_count`: `total_count / peakwidth` (normalizes for peak size)\n", - "\n", - "**Output:** `output/3_format_output/regions/{celltype}/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_filtered_regions.txt` | Full count matrix for peaks in target regions |\n", - "| `{celltype}_filtered_regions_summary.txt` | Peak metadata with coordinates and count statistics |\n", - "\n", - "**Use case:** Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci such as\n", - "the APOE or TREM2 regions) where biological variation is preserved for downstream interpretation.\n", - "\n", - "**Timing:** <1 min" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f944afdd-fffc-4b56-863f-eee89408cfa1", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/2_residuals \\\n", - " --output_dir output/atac_seq/3_region_filter \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "10440301-99c6-4f0e-b6ce-efe5ac9281fb", - "metadata": {}, - "outputs": [], - "source": [ - "# Custom regions\n", - "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n", - " --cwd output/atac_seq \\\n", - " --input_dir output/atac_seq/2_residuals \\\n", - " --output_dir output/atac_seq \\\n", - " --celltype Ast Ex In Microglia Oligo OPC \\\n", - " --regions \"chr1:1000000-2000000,chr5:50000000-51000000\"" - ] - }, - { - "cell_type": "markdown", - "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Command interface" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/snatacseq_preprocessing.ipynb -h" - ] - }, - { - "cell_type": "markdown", - "id": "0e17a301-cca9-49a1-843b-4248546f1f79", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Setup and global parameters" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[global]\n", - "# Output directory\n", - "parameter: cwd = path(\"output\")\n", - "# For cluster jobs, number of commands to run per job\n", - "parameter: job_size = 1\n", - "# Wall clock time expected\n", - "parameter: walltime = \"5h\"\n", - "# Memory expected\n", - "parameter: mem = \"16G\"\n", - "# Number of threads\n", - "parameter: numThreads = 8\n", - "# Software container\n", - "parameter: container = \"\"\n", - "\n", - "import re\n", - "parameter: entrypoint = (\n", - " 'micromamba run -a \"\" -n' + ' ' +\n", - " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n", - ") if container else \"\"\n", - "\n", - "from sos.utils import expand_size\n", - "cwd = path(f'{cwd:a}')" - ] - }, - { - "cell_type": "markdown", - "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `sampleid_mapping`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[sampleid_mapping]\n", - "parameter: map_file = str\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "parameter: celltype = ['Ast', 'Ex', 'In', 'Microglia', 'Oligo', 'OPC']\n", - "parameter: suffix = '' # e.g. '' for Xiong, '_50nuc' for Kellis\n", - "\n", - "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", - "output: [f'{output_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", - "\n", - "python: expand = \"${ }\"\n", - "\n", - " import pandas as pd\n", - " import gzip\n", - " import os\n", - " import subprocess\n", - " import csv\n", - " import numpy as np\n", - "\n", - " map_df = pd.read_csv(\"${map_file}\")\n", - " id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n", - "\n", - " celltype = ${celltype}\n", - " input_dir = \"${input_dir}\"\n", - " output_dir = \"${output_dir}/1_files_with_sampleid\"\n", - " suffix = \"${suffix}\"\n", - "\n", - " os.makedirs(output_dir, exist_ok=True)\n", - "\n", - " def map_id(ind_id):\n", - " return id_map.get(ind_id, ind_id)\n", - " \n", - " def format_value(val):\n", - " \"\"\"Format numeric values: remove .0 from integers, keep decimals\"\"\"\n", - " if pd.isna(val):\n", - " return ''\n", - " if isinstance(val, (int, np.integer)):\n", - " return str(val)\n", - " if isinstance(val, (float, np.floating)):\n", - " if val == int(val): # Check if it's a whole number\n", - " return str(int(val))\n", - " else:\n", - " return str(val)\n", - " return str(val)\n", - "\n", - " # ── Process metadata CSV files ────────────────────────────────────────────\n", - " for ct in celltype:\n", - " fname = f\"metadata_{ct}{suffix}.csv\"\n", - " in_path = os.path.join(input_dir, fname)\n", - " out_path = os.path.join(output_dir, fname)\n", - "\n", - " if not os.path.exists(in_path):\n", - " print(f\"Warning: Metadata file not found: {in_path}\")\n", - " continue\n", - "\n", - " meta = pd.read_csv(in_path)\n", - "\n", - " if \"individualID\" not in meta.columns:\n", - " print(f\"Warning: individualID column not found in {fname}\")\n", - " continue\n", - "\n", - " # Create or update sampleid column\n", - " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n", - " \n", - " # Always reorder: sampleid FIRST, then individualID, then rest\n", - " cols = meta.columns.tolist()\n", - " cols.remove(\"sampleid\")\n", - " cols.remove(\"individualID\")\n", - " new_cols = [\"sampleid\", \"individualID\"] + cols\n", - " meta = meta[new_cols]\n", - "\n", - " # Write CSV with custom formatting\n", - " with open(out_path, 'w', newline='') as f:\n", - " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n", - " # Write header\n", - " writer.writerow(meta.columns)\n", - " # Write data rows with custom formatting\n", - " for _, row in meta.iterrows():\n", - " writer.writerow([format_value(val) for val in row])\n", - " \n", - " print(f\"Processed metadata: {fname}\")\n", - "\n", - " # ── Process count matrix .csv.gz files ───────────────────────────────────\n", - " for ct in celltype:\n", - " # Try both naming patterns: with and without underscore\n", - " patterns = [\n", - " f\"pseudobulk_peaks_counts_{ct}{suffix}.csv.gz\", # Xiong pattern\n", - " f\"pseudobulk_peaks_counts{ct}{suffix}.csv.gz\" # Kellis pattern\n", - " ]\n", - " \n", - " in_path = None\n", - " for pattern in patterns:\n", - " test_path = os.path.join(input_dir, pattern)\n", - " if os.path.exists(test_path):\n", - " in_path = test_path\n", - " fname = pattern\n", - " break\n", - " \n", - " if in_path is None:\n", - " print(f\"Warning: Count file not found for celltype {ct}\")\n", - " continue\n", - " \n", - " out_path = os.path.join(output_dir, fname)\n", - "\n", - " with gzip.open(in_path, \"rt\") as fh:\n", - " header_line = fh.readline().rstrip(\"\\n\")\n", - "\n", - " col_names = header_line.split(\",\")\n", - " peak_id_col = col_names[0]\n", - " sample_cols = col_names[1:]\n", - " new_sample_cols = [map_id(s) for s in sample_cols]\n", - " new_header = \",\".join([peak_id_col] + new_sample_cols)\n", - "\n", - " import tempfile\n", - " temp_header = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n", - " temp_header.write(new_header + \"\\n\")\n", - " temp_header.close()\n", - " \n", - " cmd = f\"zcat {in_path} | tail -n +2 | cat {temp_header.name} - | gzip -6 > {out_path}\"\n", - " subprocess.run(cmd, shell=True, check=True)\n", - " \n", - " os.unlink(temp_header.name)\n", - " print(f\"Processed counts: {fname}\")\n", - "\n", - " print(\"\\nSample ID mapping completed!\")" - ] - }, - { - "cell_type": "markdown", - "id": "f0884ae7-a851-425a-86dd-b606768a012e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `pseudobulk_qc`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0c46328b-c3d8-46f8-8c71-bad27820438e", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[pseudobulk_qc]\n", - "parameter: celltype = ['Ast','Ex','In','Microglia','Oligo','OPC']\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "parameter: covariates_file = str\n", - "parameter: blacklist_file = ''\n", - "parameter: include_bio = \"FALSE\" # \"TRUE\" or \"FALSE\"\n", - "parameter: batch_correction = \"FALSE\" # \"TRUE\" or \"FALSE\"\n", - "parameter: batch_method = \"limma\" # \"limma\" or \"combat\"\n", - "parameter: min_count = 5\n", - "parameter: min_total_count = 15\n", - "parameter: min_prop = 0.1\n", - "parameter: min_nuclei = 20\n", - "parameter: suffix = ''\n", - "\n", - "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype], \\\n", - " [f'{input_dir}/pseudobulk_peaks_counts_{ct}{suffix}.csv.gz' for ct in celltype]\n", - "output: [f'{output_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n", - "\n", - "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n", - "\n", - " library(edgeR)\n", - " library(limma)\n", - " library(data.table)\n", - " library(GenomicRanges)\n", - " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n", - "\n", - " # ── Helper: standardize metadata column names ─────────────────────────────\n", - " rename_if_found <- function(dt, target, candidates) {\n", - " found <- intersect(candidates, colnames(dt))[1]\n", - " if (!is.na(found) && found != target) setnames(dt, found, target)\n", - " }\n", - "\n", - " standardize_meta <- function(meta) {\n", - " rename_if_found(meta, \"n_nuclei\", c(\"n.nuclei\",\"nNuclei\",\"nuclei_count\"))\n", - " rename_if_found(meta, \"med_nucleosome_signal\", c(\"med.nucleosome_signal.ct\",\"NucleosomeRatio\",\"med_nucleosome_signal.ct\"))\n", - " rename_if_found(meta, \"med_tss_enrich\", c(\"med.tss.enrich.ct\",\"TSSEnrichment\",\"med_tss_enrich.ct\"))\n", - " rename_if_found(meta, \"med_n_tot_fragment\", c(\"med.n_tot_fragment.ct\",\"med_n_tot_fragment.ct\"))\n", - " return(meta)\n", - " }\n", - "\n", - " # ── Helper: blacklist filtering ───────────────────────────────────────────\n", - " filter_blacklist <- function(mat, bed) {\n", - " peaks <- data.table(id = rownames(mat))\n", - " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", - " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", - " bl <- fread(bed)[, 1:3]\n", - " setnames(bl, c(\"chr\",\"start\",\"end\"))\n", - " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", - " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", - " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n", - " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n", - " if (length(blacklisted) > 0) {\n", - " message(\"Blacklisted peaks removed: \", length(blacklisted))\n", - " return(mat[-blacklisted, , drop=FALSE])\n", - " }\n", - " return(mat)\n", - " }\n", - "\n", - " # ── Helper: predictOffset ─────────────────────────────────────────────────\n", - " predictOffset <- function(fit) {\n", - " D <- fit$design\n", - " Dm <- D\n", - " for (col in colnames(D)) {\n", - " if (col == \"(Intercept)\") next\n", - " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n", - " Dm[, col] <- median(D[, col], na.rm=TRUE)\n", - " else\n", - " Dm[, col] <- 0\n", - " }\n", - " B <- fit$coefficients\n", - " B[is.na(B)] <- 0\n", - " B %*% t(Dm)\n", - " }\n", - "\n", - " # ── Main loop ─────────────────────────────────────────────────────────────\n", - " cts <- c(${', '.join([f\"'{x}'\" for x in celltype])})\n", - "\n", - " for (ct in cts) {\n", - " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", - " message(\"Processing: \", ct)\n", - " message(\"Mode: \", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"))\n", - " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n", - " message(paste(rep(\"=\", 40), collapse=\"\"))\n", - "\n", - " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n", - " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n", - "\n", - " # ── 1. Load data ───────────────────────────────────────────────────\n", - " meta <- fread(sprintf(\"${input_dir}/metadata_%s${suffix}.csv\", ct))\n", - " counts_raw <- fread(sprintf(\"${input_dir}/pseudobulk_peaks_counts_%s${suffix}.csv.gz\", ct))\n", - "\n", - " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n", - " rownames(counts) <- counts_raw[[1]]\n", - " rm(counts_raw)\n", - " n_original <- nrow(counts)\n", - " message(\"Loaded: \", n_original, \" peaks x \", ncol(counts), \" samples\")\n", - "\n", - " # ── 2. Standardize metadata columns ───────────────────────────────\n", - " meta <- standardize_meta(meta)\n", - "\n", - " # ── 3. Identify sample ID column ──────────────────────────────────\n", - " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n", - " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n", - "\n", - " # ── 4. Nuclei filter ──────────────────────────────────────────────\n", - " if (\"n_nuclei\" %in% colnames(meta)) {\n", - " meta <- meta[meta$n_nuclei > ${min_nuclei}]\n", - " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n", - " }\n", - " n_after_nuclei <- nrow(meta)\n", - "\n", - " # ── 5. Align samples ───────────────────────────────────────────────\n", - " common <- intersect(meta[[idcol]], colnames(counts))\n", - " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n", - " meta <- meta[match(common, meta[[idcol]])]\n", - " counts <- counts[, common, drop=FALSE]\n", - " message(\"Samples after alignment: \", length(common))\n", - "\n", - " # ── 6. Blacklist filtering ─────────────────────────────────────────\n", - " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n", - " counts <- filter_blacklist(counts, \"${blacklist_file}\")\n", - " message(\"Peaks after blacklist filter: \", nrow(counts))\n", - " } else {\n", - " message(\"No blacklist file provided - skipping blacklist filtering.\")\n", - " }\n", - " n_after_blacklist <- nrow(counts)\n", - "\n", - " # ── 7. Load and merge covariates ───────────────────────────────────\n", - " covs <- fread(\"${covariates_file}\")\n", - " id2 <- intersect(c(\"#id\",\"id\",\"projid\",\"individualID\"), colnames(covs))[1]\n", - " bio_cols <- if (as.logical(\"${include_bio}\")) c(\"msex\",\"age_death\",\"pmi\",\"study\") else c(\"pmi\",\"study\")\n", - " keep_cols <- c(id2, intersect(bio_cols, colnames(covs)))\n", - " covs <- covs[, ..keep_cols]\n", - " meta <- merge(meta, covs, by.x=idcol, by.y=id2, all.x=TRUE)\n", - "\n", - " # ── CRITICAL: re-order meta back to common sample order ────────────\n", - " meta <- meta[match(common, meta[[idcol]])]\n", - "\n", - " # ── 8. Impute missing covariate values ─────────────────────────────\n", - " for (col in intersect(c(\"pmi\",\"age_death\"), colnames(meta))) {\n", - " if (any(is.na(meta[[col]]))) {\n", - " message(\"Imputing missing values for: \", col)\n", - " meta[[col]][is.na(meta[[col]])] <- median(meta[[col]], na.rm=TRUE)\n", - " }\n", - " }\n", - "\n", - " # ── 9. Compute technical metrics ──────────────────────────────────\n", - " meta$log_n_nuclei <- log1p(meta$n_nuclei)\n", - " meta$log_med_n_tot_fragment <- log1p(meta$med_n_tot_fragment)\n", - " meta$log_total_unique_peaks <- log1p(colSums(counts > 0))\n", - "\n", - " # ── 10. Select model variables ────────────────────────────────────\n", - " tech_vars <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", - " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\",\"pmi\",\"study\")\n", - " bio_vars <- c(\"msex\",\"age_death\")\n", - " all_vars <- if (as.logical(\"${include_bio}\")) c(tech_vars, bio_vars) else tech_vars\n", - " all_vars <- intersect(all_vars, colnames(meta))\n", - " message(\"Model terms: \", paste(all_vars, collapse=\", \"))\n", - "\n", - " # ── 11. Drop samples with NA in model variables ────────────────────\n", - " keep_rows <- complete.cases(meta[, ..all_vars])\n", - " meta <- meta[keep_rows]\n", - " counts <- counts[, meta[[idcol]], drop=FALSE]\n", - " message(\"Valid samples for modelling: \", nrow(meta))\n", - "\n", - " # ── 12. Expression filtering ───────────────────────────────────────\n", - " dge <- DGEList(counts=counts, samples=meta)\n", - " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n", - " message(\"Peaks before expression filter: \", nrow(dge))\n", - "\n", - " keep <- filterByExpr(dge, group=dge$samples$group,\n", - " min.count=${min_count},\n", - " min.total.count=${min_total_count},\n", - " min.prop=${min_prop})\n", - " dge <- dge[keep,, keep.lib.sizes=FALSE]\n", - " n_after_expr <- nrow(dge)\n", - " message(\"Peaks after expression filter: \", n_after_expr)\n", - "\n", - " # Save filtered raw counts\n", - " write.table(dge$counts,\n", - " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n", - " sep=\"\\t\", quote=FALSE, col.names=NA)\n", - "\n", - " # ── 13. TMM normalization ──────────────────────────────────────────\n", - " dge <- calcNormFactors(dge, method=\"TMM\")\n", - "\n", - " # ── 14. Optional batch correction ─────────────────────────────────\n", - " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n", - " batches <- dge$samples$sequencingBatch\n", - " batch_counts <- table(batches)\n", - " valid_batches <- names(batch_counts[batch_counts > 1])\n", - " keep_bc <- batches %in% valid_batches\n", - " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n", - " batches <- batches[keep_bc]\n", - " message(\"Samples after singleton batch removal: \", ncol(dge))\n", - "\n", - " if (\"${batch_method}\" == \"combat\") {\n", - " dge$counts <- ComBat_seq(as.matrix(dge$counts), batch=batches)\n", - " message(\"ComBat-seq batch correction applied.\")\n", - " } else {\n", - " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", - " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n", - " dge$counts <- round(pmax(2^logCPM, 0))\n", - " message(\"limma removeBatchEffect applied.\")\n", - " }\n", - " }\n", - "\n", - " # ── 15. Add sequencingBatch and Library to model if multi-level ───\n", - " # Insert after technical vars but before pmi/study to match original order\n", - " tech_only <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", - " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\")\n", - " other_vars <- setdiff(all_vars, tech_only) # pmi, study, msex, age_death\n", - "\n", - " batch_vars <- c()\n", - " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n", - " length(unique(dge$samples$sequencingBatch)) > 1) {\n", - " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", - " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n", - " }\n", - "\n", - " if (\"Library\" %in% colnames(dge$samples) &&\n", - " length(unique(dge$samples$Library)) > 1) {\n", - " dge$samples$Library_factor <- factor(dge$samples$Library)\n", - " batch_vars <- c(batch_vars, \"Library_factor\")\n", - " }\n", - "\n", - " # Final order: technical + batch + other (pmi, study, bio)\n", - " all_vars <- c(tech_only, batch_vars, other_vars)\n", - " all_vars <- intersect(all_vars, c(colnames(dge$samples), colnames(meta)))\n", - "\n", - " # ── 16. Build design matrix ────────────────────────────────────────\n", - " form <- as.formula(paste(\"~\", paste(all_vars, collapse=\" + \")))\n", - " design <- model.matrix(form, data=dge$samples)\n", - " message(\"Formula: \", deparse(form))\n", - "\n", - " if (!is.fullrank(design)) {\n", - " message(\"Design not full rank - trimming.\")\n", - " qr_d <- qr(design)\n", - " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n", - " }\n", - " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n", - "\n", - " # ── 17. Voom + lmFit + eBayes ─────────────────────────────────────\n", - " v <- voom(dge, design, plot=FALSE)\n", - " fit <- lmFit(v, design)\n", - " fit <- eBayes(fit)\n", - "\n", - " # ── 18. Offset + residuals ─────────────────────────────────────────\n", - " off <- predictOffset(fit)\n", - " res <- residuals(fit, v)\n", - " final <- off + res\n", - "\n", - " # ── 19. Save outputs ───────────────────────────────────────────────\n", - " write.table(final,\n", - " file.path(outdir, paste0(ct, \"_residuals.txt\")),\n", - " sep=\"\\t\", quote=FALSE, col.names=NA)\n", - "\n", - " saveRDS(list(\n", - " dge = dge,\n", - " offset = off,\n", - " residuals = res,\n", - " final_data = final,\n", - " valid_samples = colnames(dge),\n", - " design = design,\n", - " fit = fit,\n", - " model = form,\n", - " mode = ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"),\n", - " batch_correction = as.logical(\"${batch_correction}\"),\n", - " batch_method = ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\")\n", - " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n", - "\n", - " # ── 20. Summary report ─────────────────────────────────────────────\n", - " sink(file.path(outdir, paste0(ct, \"_summary.txt\")))\n", - " cat(\"*** Processing Summary for\", ct, \"***\\n\\n\")\n", - "\n", - " cat(\"=== Analysis Mode ===\\n\")\n", - " cat(\"Mode:\", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"), \"\\n\")\n", - " cat(\"Batch correction:\", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"), \"\\n\")\n", - " cat(\"Model formula:\", deparse(form), \"\\n\\n\")\n", - "\n", - " cat(\"=== Filtering Parameters ===\\n\")\n", - " cat(\"Nuclei cutoff: >\", ${min_nuclei}, \"\\n\")\n", - " cat(\"Blacklist filtering:\", ifelse(\"${blacklist_file}\" != \"\", \"TRUE\", \"FALSE\"), \"\\n\")\n", - " if (\"${blacklist_file}\" != \"\") cat(\"Blacklist file:\", \"${blacklist_file}\", \"\\n\")\n", - " cat(\"min_count:\", ${min_count}, \"\\n\")\n", - " cat(\"min_total_count:\", ${min_total_count}, \"\\n\")\n", - " cat(\"min_prop:\", ${min_prop}, \"\\n\\n\")\n", - "\n", - " cat(\"=== Peak Counts ===\\n\")\n", - " cat(\"Original peak count:\", n_original, \"\\n\")\n", - " cat(\"Peaks after blacklist filtering:\", n_after_blacklist, \"\\n\")\n", - " cat(\"Peaks after expression filtering:\", n_after_expr, \"\\n\\n\")\n", - "\n", - " cat(\"=== Sample Counts ===\\n\")\n", - " cat(\"Number of samples after nuclei (>\", ${min_nuclei}, \") filtering:\", n_after_nuclei, \"\\n\")\n", - " cat(\"Number of samples in final model:\", ncol(final), \"\\n\\n\")\n", - "\n", - " cat(\"=== Technical Variables Used ===\\n\")\n", - " for (v in intersect(c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n", - " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\"), all_vars))\n", - " cat(\"-\", v, \"\\n\")\n", - " if (\"sequencingBatch_factor\" %in% all_vars) cat(\"- sequencingBatch: Sequencing batch ID\\n\")\n", - " if (\"Library_factor\" %in% all_vars) cat(\"- Library: Library ID\\n\")\n", - "\n", - " if (as.logical(\"${include_bio}\")) {\n", - " cat(\"\\n=== Biological Variables Used ===\\n\")\n", - " for (v in intersect(c(\"msex\",\"age_death\"), all_vars))\n", - " cat(\"-\", v, \"\\n\")\n", - " } else {\n", - " cat(\"\\n=== Biological Variables Used ===\\n\")\n", - " cat(\"None (noBIOvar mode - biological variation preserved)\\n\")\n", - " }\n", - "\n", - " cat(\"\\n=== Other Variables Used ===\\n\")\n", - " if (\"pmi\" %in% all_vars) cat(\"- pmi: Post-mortem interval\\n\")\n", - " if (\"study\" %in% all_vars) cat(\"- study: Study cohort\\n\")\n", - " sink()\n", - "\n", - " # ── 21. Variable explanation report ───────────────────────────────\n", - " sink(file.path(outdir, paste0(ct, \"_variable_explanation.txt\")))\n", - " cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n", - " cat(\"## Why Log Transformation?\\n\")\n", - " cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n", - " cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n", - " cat(\"2. To stabilize variance across the range of values\\n\")\n", - " cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n", - " cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n", - " cat(\"## Variables and Their Meanings\\n\\n\")\n", - " cat(\"### Technical Variables\\n\")\n", - " cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n", - " cat(\" * Filtered to include only samples with >\", ${min_nuclei}, \"nuclei\\n\")\n", - " cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n", - " cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n", - " cat(\" * Represents sequencing depth\\n\")\n", - " cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n", - " cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n", - " cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n", - " cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n", - " cat(\" * Measures the degree of nucleosome positioning\\n\")\n", - " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n", - " cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n", - " cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n", - " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n", - " if (\"sequencingBatch_factor\" %in% all_vars)\n", - " cat(\"- sequencingBatch: Sequencing batch ID\\n * Treated as a factor to account for batch effects\\n\\n\")\n", - " if (\"Library_factor\" %in% all_vars)\n", - " cat(\"- Library: Library preparation batch ID\\n * Treated as a factor to account for library preparation effects\\n\\n\")\n", - " if (as.logical(\"${include_bio}\")) {\n", - " cat(\"### Biological Variables\\n\")\n", - " cat(\"- msex: Sex (male=1, female=0)\\n\")\n", - " cat(\"- age_death: Age at death\\n\\n\")\n", - " }\n", - " cat(\"### Other Variables\\n\")\n", - " cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n", - " cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n", - " cat(\"## Relationship to voom Transformation\\n\")\n", - " cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n", - " cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n", - " cat(\"covariates, we ensure they are on a similar scale to the transformed expression data, \")\n", - " cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n", - " sink()\n", - "\n", - " message(\"Completed: \", ct, \" -> \", outdir)\n", - " message(\" Peaks: \", nrow(final), \" | Samples: \", ncol(final))\n", - " }" - ] - }, - { - "cell_type": "markdown", - "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `phenotype_reformatting`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[phenotype_formatting]\n", - "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "\n", - "input: [f'{input_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", - "output: [f'{output_dir}/{ct}_snatac_phenotype.bed.gz' for ct in celltype]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n", - "\n", - "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", - "\n", - " import os\n", - " import subprocess\n", - " import pandas as pd\n", - "\n", - " celltypes = ${celltype}\n", - " input_dir = \"${input_dir}\"\n", - " output_dir = \"${output_dir}\"\n", - "\n", - " def read_residuals(path):\n", - " first_line = open(path).readline().rstrip(\"\\n\")\n", - " col_names = first_line.split(\"\\t\")\n", - " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n", - " if df.shape[1] > len(col_names):\n", - " peak_ids = df.iloc[:, 0].values\n", - " df = df.iloc[:, 1:]\n", - " df.columns = col_names\n", - " else:\n", - " peak_ids = df.iloc[:, 0].values\n", - " df = df.iloc[:, 1:]\n", - " df.columns = col_names[1:]\n", - " return peak_ids, df\n", - "\n", - " def to_midpoint_bed(peak_ids, residuals):\n", - " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", - " chrs = parts[0].values\n", - " starts = parts[1].astype(int).values\n", - " ends = parts[2].astype(int).values\n", - " mids = ((starts + ends) // 2).astype(int)\n", - " bed = pd.DataFrame({\n", - " \"#chr\": chrs,\n", - " \"start\": mids,\n", - " \"end\": mids + 1,\n", - " \"ID\": peak_ids\n", - " })\n", - " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n", - " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n", - "\n", - " def run_cmd(cmd, label):\n", - " r = subprocess.run(cmd, capture_output=True)\n", - " if r.returncode != 0:\n", - " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n", - " else:\n", - " print(f\"{label}: OK\")\n", - "\n", - " for ct in celltypes:\n", - " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n", - "\n", - " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n", - " os.makedirs(out_dir, exist_ok=True)\n", - "\n", - " res_path = os.path.join(input_dir, ct, f\"{ct}_residuals.txt\")\n", - " if not os.path.exists(res_path):\n", - " print(f\"WARNING: {res_path} not found, skipping.\")\n", - " continue\n", - "\n", - " peak_ids, residuals = read_residuals(res_path)\n", - " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n", - "\n", - " bed = to_midpoint_bed(peak_ids, residuals)\n", - " out_bed = os.path.join(out_dir, f\"{ct}_snatac_phenotype.bed\")\n", - " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n", - " print(f\"Written: {out_bed}\")\n", - "\n", - " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n", - " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n", - "\n", - " print(f\"Completed: {ct} -> {out_dir}\")" - ] - }, - { - "cell_type": "markdown", - "id": "038bc2ab-c412-40ef-a9b6-f5dddf5292ee", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `region_filtering`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dd13567e-2d4c-48f6-83d7-62ab252421bf", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[region_filtering]\n", - "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "parameter: regions = \"chr7:28000000-28300000,chr11:85050000-86200000\"\n", - "\n", - "input: [f'{input_dir}/{ct}/{ct}_filtered_raw_counts.txt' for ct in celltype]\n", - "output: [f'{output_dir}/{ct}_filtered_regions_of_interest.txt' for ct in celltype]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '1:00:00', mem = '16G', cores = 2\n", - "\n", - "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", - "\n", - " import os\n", - " import pandas as pd\n", - "\n", - " celltypes = ${celltype}\n", - " input_dir = \"${input_dir}\"\n", - " output_dir = \"${output_dir}\"\n", - "\n", - " def parse_regions(region_str):\n", - " result = []\n", - " for r in region_str.split(\",\"):\n", - " chrom, coords = r.strip().split(\":\")\n", - " start, end = coords.split(\"-\")\n", - " result.append({\"chr\": chrom, \"start\": int(start), \"end\": int(end)})\n", - " return result\n", - "\n", - " regions = parse_regions(\"${regions}\")\n", - "\n", - " def parse_peak_ids(peak_ids):\n", - " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", - " return pd.DataFrame({\n", - " \"chr\": parts[0].values,\n", - " \"start\": parts[1].astype(int).values,\n", - " \"end\": parts[2].astype(int).values\n", - " })\n", - "\n", - " def overlaps_region(chr_col, start_col, end_col, reg):\n", - " return (\n", - " (chr_col == reg[\"chr\"]) &\n", - " (start_col < reg[\"end\"]) &\n", - " (end_col > reg[\"start\"])\n", - " )\n", - "\n", - " for ct in celltypes:\n", - " print(f\"\\n{'='*40}\\nRegion Filtering: {ct}\\n{'='*40}\")\n", - "\n", - " reg_dir = os.path.join(output_dir, \"3_region_filter\")\n", - " os.makedirs(reg_dir, exist_ok=True)\n", - "\n", - " counts_path = os.path.join(input_dir, ct, f\"{ct}_filtered_raw_counts.txt\")\n", - " if not os.path.exists(counts_path):\n", - " print(f\"WARNING: {counts_path} not found, skipping.\")\n", - " continue\n", - "\n", - " df = pd.read_csv(counts_path, sep=\"\\t\", index_col=0)\n", - " df.index.name = \"peak_id\"\n", - " df = df.reset_index()\n", - "\n", - " coords = parse_peak_ids(df[\"peak_id\"].values)\n", - " df[\"chr\"] = coords[\"chr\"].values\n", - " df[\"start\"] = coords[\"start\"].values\n", - " df[\"end\"] = coords[\"end\"].values\n", - " df[\"peakwidth\"] = df[\"end\"] - df[\"start\"]\n", - " df[\"midpoint\"] = ((df[\"start\"] + df[\"end\"]) / 2).astype(int)\n", - "\n", - " # Filter to regions of interest\n", - " mask = pd.Series(False, index=df.index)\n", - " for reg in regions:\n", - " mask |= overlaps_region(df[\"chr\"], df[\"start\"], df[\"end\"], reg)\n", - "\n", - " region_df = df[mask].copy()\n", - " print(f\"Peaks in regions of interest: {len(region_df)}\")\n", - "\n", - " # Save full filtered data\n", - " full_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest.txt\")\n", - " region_df.to_csv(full_out, sep=\"\\t\", index=False)\n", - " print(f\"Saved: {full_out}\")\n", - "\n", - " # Save summary\n", - " meta_cols = [\"peak_id\",\"chr\",\"start\",\"end\",\"peakwidth\",\"midpoint\"]\n", - " count_cols = [c for c in region_df.columns if c not in meta_cols]\n", - " count_mat = region_df[count_cols].apply(pd.to_numeric, errors=\"coerce\")\n", - "\n", - " summary = region_df[meta_cols].copy()\n", - " summary[\"total_count\"] = count_mat.sum(axis=1).values\n", - " summary[\"weighted_count\"] = (summary[\"total_count\"] / summary[\"peakwidth\"]).values\n", - "\n", - " summary_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest_summary.txt\")\n", - " summary.to_csv(summary_out, sep=\"\\t\", index=False)\n", - " print(f\"Saved: {summary_out}\")\n", - "\n", - " print(f\"Completed: {ct} -> {reg_dir}\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "R", - "language": "R", - "name": "ir" - }, - "language_info": { - "codemirror_mode": "r", - "file_extension": ".r", - "mimetype": "text/x-r-source", - "name": "R", - "pygments_lexer": "r", - "version": "4.4.3" - }, - "sos": { - "kernels": [ - [ - "SoS", - "sos", - "sos", - "", - "" - ] - ], - "version": "" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From d0169c5fed12d8a71f81ed28f5a6271ec59e902e Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Tue, 24 Feb 2026 18:41:41 -0500 Subject: [PATCH 07/12] pseudobulk count data preprocessing both snATAC-seq & snRNA-seq --- .../QC/pseudobulk_preprocessing.ipynb | 1442 +++++++++++++++++ 1 file changed, 1442 insertions(+) create mode 100644 code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb new file mode 100644 index 000000000..9a7a8a59f --- /dev/null +++ b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb @@ -0,0 +1,1442 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Single-nuclei Pseudobulk Preprocessing (RNA-seq and ATAC-seq) Pipeline\n", + "\n", + "## Overview\n", + "\n", + "This pipeline preprocesses single-nuclei pseudobulk count data (snATAC-seq or snRNA-seq)\n", + "for downstream QTL analysis and region-specific studies.\n", + "\n", + "**Goals:**\n", + "- Transform raw pseudobulk counts into analysis-ready formats\n", + "- Remove technical confounders while preserving biological covariates (sex, age)\n", + "- Generate QTL-ready phenotype files or region-specific datasets\n", + "\n", + "## Pipeline Structure\n", + "\n", + "```\n", + "Step 0: Sample ID Mapping [sampleid_mapping]\n", + " ↓\n", + "Step 1: Pseudobulk QC [pseudobulk_qc]\n", + " noBIOvar: regress out technical covariates only\n", + " (msex and age_death deliberately preserved)\n", + " ↓ (optional)\n", + " Batch Correction (ComBat-seq or limma::removeBatchEffect)\n", + " ↓ (optional)\n", + " Quantile Normalization\n", + " ↓\n", + "Step 2: Format Output\n", + " ├── Phenotype Reformatting → BED [phenotype_formatting] (genome-wide QTL mapping, snATAC-seq only, locus-specific)\n", + " └── Region Peak Filtering → TSV [region_filtering] (gene filtering for snRNA-seq)\n", + "```\n", + "\n", + "## Modality Support\n", + "\n", + "| Feature | snATAC-seq | snRNA-seq |\n", + "|---------|-----------|-----------|\n", + "| Count file auto-detected | ✓ | ✓ |\n", + "| Default `tech_vars` | `log_n_nuclei`, `med_nucleosome_signal`, `med_tss_enrich`, `log_med_n_tot_fragment`, `log_total_unique_peaks` | custom via `--tech_vars` |\n", + "| Blacklist filtering | ✓ | — |\n", + "| `region_filtering` step | ✓ | — |\n", + "| `phenotype_formatting` step | ✓ | ✓ |\n", + "\n", + "For snRNA-seq, override `tech_vars` to match your metadata columns, e.g.:\n", + "```bash\n", + "--tech_vars log_n_nuclei percent_mito log_n_genes\n", + "```\n", + "\n", + "Any `tech_var` starting with `log_` is automatically derived via `log1p()` from the\n", + "raw column of the same name with `log_` stripped. No code changes needed across modalities.\n", + "\n", + "## Input Files\n", + "\n", + "All input files required to run this pipeline can be downloaded\n", + "[here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link).\n", + "\n", + "| File | Used in |\n", + "|------|---------|\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Step 0, Step 1 |\n", + "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Step 0, Step 1 |\n", + "| `metadata_{celltype}.csv` | Step 0, Step 1 |\n", + "| `rosmap_sample_mapping_data.csv` | Step 0 |\n", + "| `rosmap_cov.txt` | Step 1 |\n", + "| `hg38-blacklist.v2.bed.gz` | Step 1 (snATAC-seq only) |\n", + "\n", + "Count files are **auto-detected** from `input_dir` — no prefix parameter needed.\n", + "\n", + "## Parameters\n", + "\n", + "### `sampleid_mapping`\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `map_file` | *required* | CSV with `individualID` → `sampleid` mapping |\n", + "| `input_dir` | *required* | Directory with raw metadata and count files |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/1_files_with_sampleid/` |\n", + "| `celltype` | `['Ast','Ex','In','Microglia','Oligo','OPC']` | Cell types to process |\n", + "| `suffix` | `''` | Optional filename suffix (e.g. `_50nuc`) |\n", + "\n", + "### `pseudobulk_qc`\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `input_dir` | *required* | Directory with remapped metadata and count files |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/2_residuals/{ct}/` |\n", + "| `covariates_file` | *required* | Covariate file with `pmi` and `study` columns |\n", + "| `blacklist_file` | `''` | Genomic blacklist BED file (snATAC-seq only) |\n", + "| `sample_list` | `''` | Optional file with one sample ID per line to subset |\n", + "| `tech_vars` | `['log_n_nuclei','med_nucleosome_signal','med_tss_enrich','log_med_n_tot_fragment','log_total_unique_peaks']` | Technical covariates for the model |\n", + "| `batch_correction` | `FALSE` | Apply batch correction (`TRUE`/`FALSE`) |\n", + "| `batch_method` | `limma` | Batch correction method (`limma` or `combat`) |\n", + "| `quant_norm` | `FALSE` | Apply quantile normalization after residuals |\n", + "| `min_count` | `5` | Min reads in at least one sample |\n", + "| `min_total_count` | `15` | Min total reads across all samples |\n", + "| `min_prop` | `0.1` | Min proportion of samples with expression |\n", + "| `min_nuclei` | `20` | Min nuclei per sample |\n", + "| `celltype` | `['Ast','Ex','In','Microglia','Oligo','OPC']` | Cell types to process |\n", + "| `suffix` | `''` | Optional filename suffix |\n", + "\n", + "### `phenotype_formatting`\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `input_dir` | *required* | Directory containing `{ct}/{ct}_residuals.txt` |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/3_pheno_reformat/` |\n", + "| `modality` | `snatac` | Modality label used in output filename (`snatac` or `snrna`) |\n", + "| `celltype` | `['Ast','Ex','In','Mic','Oligo','OPC']` | Cell types to process |\n", + "\n", + "### `region_filtering` *(snATAC-seq only)*\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `input_dir` | *required* | Directory containing `{ct}/{ct}_filtered_raw_counts.txt` |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/3_region_filter/` |\n", + "| `regions` | `chr7:28000000-28300000,...` | Comma-separated genomic regions of interest |\n", + "| `celltype` | `['Ast','Ex','In','Mic','Oligo','OPC']` | Cell types to process |\n", + "\n", + "## Minimal Working Example" + ] + }, + { + "cell_type": "markdown", + "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 0: Sample ID Mapping\n", + "\n", + "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n", + "across metadata and count matrix files.\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n", + "| `metadata_{celltype}.csv` × 6 | Per-cell-type sample metadata |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 *(snATAC-seq)* | Per-cell-type peak count matrices |\n", + "| `pseudobulk_counts_{celltype}.csv.gz` × 6 *(snRNA-seq)* | Per-cell-type gene count matrices |\n", + "\n", + "Cell types: `Ast`, `Ex`, `In`, `Microglia`, `Oligo`, `OPC`\n", + "\n", + "Count files are **auto-detected** from `input_dir` — any `.csv.gz` file ending with\n", + "`{celltype}{suffix}` will be found regardless of prefix (`pseudobulk_peaks_counts_`,\n", + "`pseudobulk_counts_`, etc.).\n", + "\n", + "### Process\n", + "\n", + "**Part 1 — Metadata files**\n", + "\n", + "For each `metadata_{celltype}.csv`:\n", + "1. Look up each `individualID` in the mapping reference\n", + "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n", + "3. Reorder columns: `sampleid` first, then `individualID`, then the rest\n", + "4. Save updated file\n", + "\n", + "**Part 2 — Count matrix files**\n", + "\n", + "For each count file detected in `input_dir`:\n", + "1. Auto-detect filename by scanning for `.csv.gz` files matching `{celltype}{suffix}`\n", + "2. Extract the header row (column names only)\n", + "3. Keep the first column (peak or gene IDs) unchanged\n", + "4. Map remaining column names (`individualID` → `sampleid`) where mapping exists,\n", + " otherwise keep original\n", + "5. Write new header and stream data rows unchanged\n", + "6. Recompress with gzip\n", + "\n", + "### Output\n", + "\n", + "Output directory: `{output_dir}/1_files_with_sampleid/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `metadata_{celltype}.csv` × 6 | Metadata with `sampleid` column prepended |\n", + "| `{detected_count_filename}` × 6 | Count matrices with mapped column headers |\n", + "\n", + "**Timing:** < 1 min" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb sampleid_mapping \\\n", + " --map-file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", + " --input-dir data/atac_seq/1_files_with_sampleid \\\n", + " --output-dir output/atac_seq \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "5540a4da-843a-4789-8123-47911cf519c5", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 1: Pseudobulk QC\n", + "\n", + "Regresses out technical covariates while preserving biological variation (sex, age) for\n", + "downstream QTL analysis. Works for both snATAC-seq and snRNA-seq.\n", + "\n", + "### Input\n", + "\n", + "| File | Location |\n", + "|------|----------|\n", + "| `pseudobulk_*counts_{celltype}.csv.gz` *(auto-detected)* | `1_files_with_sampleid/` |\n", + "| `metadata_{celltype}.csv` | `1_files_with_sampleid/` |\n", + "| `rosmap_cov.txt` | `data/` |\n", + "| `hg38-blacklist.v2.bed.gz` *(snATAC-seq, optional)* | `data/` |\n", + "\n", + "### Process\n", + "\n", + "1. Load metadata per cell type; auto-detect and load count matrix from `input_dir`\n", + "2. Standardize metadata column names across datasets\n", + "3. Filter samples with fewer than `min_nuclei` nuclei (default: 20)\n", + "4. *(Optional)* Subset to samples listed in `sample_list` file\n", + "5. Align samples between metadata and count matrix\n", + "6. *(Optional)* Filter blacklisted genomic regions (`blacklist_file`)\n", + "7. Merge with demographic covariates (`pmi`, `study`) from `covariates_file`\n", + "8. Impute missing `pmi` values with median\n", + "9. Load `tech_vars` from parameter — any variable prefixed with `log_` is automatically\n", + " derived via `log1p()` from the raw column of the same name:\n", + " - e.g. `log_n_nuclei` ← `log1p(n_nuclei)`\n", + " - e.g. `log_total_unique_peaks` ← `log1p(colSums(counts > 0))`\n", + " - Works for both snATAC-seq and snRNA-seq without code changes\n", + "10. Build model variable list — `msex` and `age_death` are **deliberately excluded**\n", + "11. Drop samples with NA in any model variable\n", + "12. Apply expression filtering (`filterByExpr`):\n", + " - `min_count = 5`: minimum reads in at least one sample\n", + " - `min_total_count = 15`: minimum total reads across all samples\n", + " - `min_prop = 0.1`: feature expressed in ≥10% of samples\n", + "13. TMM normalization\n", + "14. *(Optional)* Batch correction (`sequencingBatch` and/or `Library`):\n", + " - `limma::removeBatchEffect` (default)\n", + " - `ComBat-seq`\n", + "15. Add `sequencingBatch` and `Library` to model if multi-level\n", + "16. Fit linear model (`voom` + `lmFit` + `eBayes`)\n", + "\n", + "**Model formula (default snATAC-seq):**\n", + "```\n", + "~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich +\n", + " log_med_n_tot_fragment + log_total_unique_peaks +\n", + " [sequencingBatch] + [Library] + pmi + study\n", + "```\n", + "\n", + "> `sequencingBatch` and `Library` are included only if present in metadata and have\n", + "> more than one level. If batch correction was applied, they are removed from the model.\n", + "\n", + "17. Compute `offset + residuals` as final adjusted values:\n", + " - `offset`: predicted value at median/reference covariate levels\n", + " - `residuals`: unexplained variation after removing all covariate effects\n", + "18. *(Optional)* Quantile normalization of final values\n", + "19. Save outputs\n", + "\n", + "### Output\n", + "\n", + "Output directory: `{output_dir}/2_residuals/{celltype}/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_residuals.txt` | Covariate-adjusted values (log2-CPM) |\n", + "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design, parameters |\n", + "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", + "\n", + "**Variables deliberately NOT regressed out:**\n", + "- Sex (`msex`)\n", + "- Age at death (`age_death`)\n", + "\n", + "**Timing:** < 5 min per cell type" + ] + }, + { + "cell_type": "markdown", + "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "# snATAC-seq\n", + "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", + " --input-dir output/atac_seq/1_files_with_sampleid \\\n", + " --output-dir output/atac_seq \\\n", + " --blacklist-file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates-file data/atac_seq/rosmap_cov.txt \\\n", + " --batch-correction FALSE \\\n", + " --min-count 5 \\\n", + " --celltype Ast Ex In Microglia Oligo OPC\n", + "\n", + "# snRNA-seq\n", + "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", + " --input-dir output/snrna_seq/1_files_with_sampleid \\\n", + " --output-dir output/snrna_seq \\\n", + " --covariates-file data/snrna_seq/covariates.txt \\\n", + " --min-count 5 \\\n", + " --batch-correction FALSE \\\n", + " --quant-norm TRUE \\ # add this if you want quantile normalized output\n", + " --celltype Ast Ex In Microglia Oligo OPC\n" + ] + }, + { + "cell_type": "markdown", + "id": "25e96ad2-1b75-43d0-978e-0757bc11f135", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Batch Correction (Optional)\n", + "\n", + "Runs between TMM normalization (step 15) and model fitting (step 18).\n", + "Use when batch effects are severe (e.g., visible batch clusters in PCA, multiple sequencing runs).\n", + "\n", + "> When batch correction is applied, `sequencingBatch` and `Library` are **removed** from\n", + "> the model formula since their variance has already been removed from the counts.\n", + "\n", + "**Method comparison:**\n", + "\n", + "| | ComBat-seq | limma `removeBatchEffect` |\n", + "|---|---|---|\n", + "| **Operates on** | Raw integer counts | log-CPM values |\n", + "| **Mean-variance modelling** | Yes | No |\n", + "| **Best for** | Large, balanced batches | Small or fragmented batches |\n", + "| **Robustness** | May fail with many small batches | More robust to unbalanced designs |\n", + "\n", + "**ComBat-seq:**\n", + "```r\n", + "dge$counts <- ComBat_seq(as.matrix(dge$counts), batch = batches)\n", + "```\n", + "\n", + "**limma `removeBatchEffect`:**\n", + "```r\n", + "logCPM <- cpm(dge, log = TRUE, prior.count = 1)\n", + "logCPM <- removeBatchEffect(logCPM, batch = factor(batches))\n", + "dge$counts <- round(pmax(2^logCPM, 0))\n", + "```\n", + "\n", + "**Additional filtering applied before correction:**\n", + "- Singleton batches (only 1 sample in a batch) are removed prior to correction\n", + "\n", + "**Parameters:**\n", + "\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `batch_correction` | `FALSE` | Enable batch correction |\n", + "| `batch_method` | `limma` | Method to use (`limma` or `combat`) |\n", + "\n", + "**Command:**\n", + "```bash\n", + "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", + " ... \\\n", + " --batch_correction TRUE \\\n", + " --batch_method limma\n", + "```\n", + "\n", + "**Effect on RDS output:**\n", + "\n", + "The `{celltype}_results.rds` file will include:\n", + "- `batch_correction = TRUE`\n", + "- `batch_method = \"limma\"` or `\"combat\"`" + ] + }, + { + "cell_type": "markdown", + "id": "9bad900d-768d-45ee-815a-6847e8eba32e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC with batch correction\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "799c0faa-4dd9-431d-a5cf-3e92d7256a3b", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", + " --input-dir output/atac_seq/1_files_with_sampleid \\\n", + " --output-dir output/atac_seq \\\n", + " --blacklist-file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", + " --covariates-file data/atac_seq/rosmap_cov.txt \\\n", + " --batch-correction TRUE \\\n", + " --batch-method limma \\\n", + " --min-count 5 \\\n", + " --celltype Ast Ex In Microglia Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed", + "metadata": {}, + "source": [ + "### Additional parameters\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb", + "metadata": {}, + "outputs": [], + "source": [ + "# All available pseudobulk_qc parameters with defaults\n", + "--min-count 5\n", + "--min-total-count 15\n", + "--min-prop 0.1\n", + "--min-nuclei 20\n", + "--sample-list '' # path to file with one sample ID per line\n", + "--tech-vars log_n_nuclei med_nucleosome_signal med_tss_enrich log_med_n_tot_fragment log_total_unique_peaks# snATAC-seq defaults; for snRNA-seq use e.g.: log_n_nuclei percent_mito log_n_genes" + ] + }, + { + "cell_type": "markdown", + "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 2: Format Output\n", + "\n", + "### Phenotype Reformatting (exclusively for snATAC-seq)\n", + "\n", + "Converts residuals into a QTL-ready BED format for genome-wide QTL mapping.\n", + "Works for both snATAC-seq and snRNA-seq.\n", + "\n", + "**Input:**\n", + "\n", + "| File | Location |\n", + "|------|----------|\n", + "| `{celltype}_residuals.txt` | `{output_dir}/2_residuals/{celltype}/` |\n", + "\n", + "**Process:**\n", + "\n", + "1. Read residuals file with proper handling of feature IDs and sample columns\n", + "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", + "3. Convert to midpoint coordinates (standard for QTLtools):\n", + "```\n", + "start = floor((peak_start + peak_end) / 2)\n", + "end = start + 1\n", + "```\n", + "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample values\n", + "5. Sort by chromosome and position\n", + "6. Compress with `bgzip` and index with `tabix`\n", + "\n", + "**Output:** `{output_dir}/3_pheno_reformat/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_{modality}_phenotype.bed.gz` | bgzip-compressed BED with midpoint coordinates |\n", + "| `{celltype}_{modality}_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n", + "\n", + "**Use case:** Standard QTL mapping to identify genetic variants affecting chromatin\n", + "accessibility (caQTL) or gene expression (eQTL), with biological variation preserved.\n", + "Compatible with FastQTL, TensorQTL, and QTLtools.\n", + "\n", + "**Timing:** < 1 min per cell type\n", + "\n", + "**Note** For snRNA-seq, please follow this [pipeline](https://github.com/StatFunGen/xqtl-protocol/blob/main/code/data_preprocessing/phenotype/phenotype_formatting.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb phenotype_formatting \\\n", + " --input-dir output/atac_seq/2_residuals \\\n", + " --output-dir output/atac_seq \\\n", + " --celltype Ast Ex In Mic Oligo OPC" + ] + }, + { + "cell_type": "markdown", + "id": "7c874b17-9a77-4e7d-a0a3-3605f7005148", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Region Filtering\n", + "\n", + "Filters peak counts to specific genomic regions of interest for locus-specific analysis.\n", + "\n", + "**Input:**\n", + "\n", + "| File | Location |\n", + "|------|----------|\n", + "| `{celltype}_filtered_raw_counts.txt` | `{output_dir}/2_residuals/{celltype}/` |\n", + "\n", + "**Process:**\n", + "\n", + "1. Read filtered raw counts per cell type\n", + "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", + "3. Calculate per-peak metrics:\n", + " - `peakwidth`: `end - start`\n", + " - `midpoint`: `(start + end) / 2`\n", + "4. Filter peaks overlapping any target region — includes peaks that start, end, or span region boundaries\n", + "5. Calculate summary statistics per peak:\n", + " - `total_count`: sum of counts across all samples\n", + " - `weighted_count`: `total_count / peakwidth` (normalizes for peak size)\n", + "\n", + "**Output:** `{output_dir}/3_region_filter/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_filtered_regions_of_interest.txt` | Full count matrix for peaks in target regions |\n", + "| `{celltype}_filtered_regions_of_interest_summary.txt` | Peak metadata with coordinates and count statistics |\n", + "\n", + "**Use case:** Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci such as\n", + "the APOE or TREM2 regions) where biological variation is preserved for downstream interpretation.\n", + "\n", + "**Timing:** < 1 min per cell type" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f944afdd-fffc-4b56-863f-eee89408cfa1", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "#snATAC-seq \n", + "sos run pipeline/pseudobulk_preprocessing.ipynb region_filtering \\\n", + " --input-dir output/atac_seq/2_residuals \\\n", + " --output-dir output/atac_seq \\\n", + " --celltype Ast Ex In Mic Oligo OPC \\\n", + " --regions \"chr7:28000000-28300000,chr11:85050000-86200000\"\n", + "\n", + "#snRNA-seq\n", + "sos run pipeline/pseudobulk_preprocessing.ipynb region_filtering \\\n", + " --input-dir output/snrna_seq/2_residuals \\\n", + " --output-dir output/snrna_seq \\\n", + " --celltype MIC \\\n", + " --gene-list \"ENSG00000000010\"" + ] + }, + { + "cell_type": "markdown", + "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Command interface" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb -h" + ] + }, + { + "cell_type": "markdown", + "id": "0e17a301-cca9-49a1-843b-4248546f1f79", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Setup and global parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "parameter: cwd = path(\"output\")\n", + "parameter: job_size = 1\n", + "parameter: walltime = \"5h\"\n", + "parameter: mem = \"16G\"\n", + "parameter: numThreads = 8\n", + "parameter: container = \"\"\n", + "\n", + "import re\n", + "from sos.utils import expand_size\n", + "\n", + "entrypoint = (\n", + " 'micromamba run -a \"\" -n' + ' ' +\n", + " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n", + ") if container else \"\"\n", + "\n", + "cwd = path(f'{cwd:a}')" + ] + }, + { + "cell_type": "markdown", + "id": "eee58015-c8e2-4697-bdae-58d7e494640d", + "metadata": {}, + "source": [ + "```\n", + " usage: sos run pipeline/pseudobulk_preprocessing.ipynb\n", + " [workflow_name | -t targets] [options] [workflow_options]\n", + " workflow_name: Single or combined workflows defined in this script\n", + " targets: One or more targets to generate\n", + " options: Single-hyphen sos parameters (see \"sos run -h\" for details)\n", + " workflow_options: Double-hyphen workflow-specific parameters\n", + "Workflows:\n", + " sampleid_mapping\n", + " pseudobulk_qc\n", + " phenotype_formatting\n", + " region_filtering\n", + "Global Workflow Options:\n", + " --cwd output (as path)\n", + " --job-size 1 (as int)\n", + " --walltime 5h\n", + " --mem 16G\n", + " --numThreads 8 (as int)\n", + " --container ''\n", + "Sections\n", + " sampleid_mapping:\n", + " Workflow Options:\n", + " --map-file VAL (as str, required)\n", + " --input-dir VAL (as str, required)\n", + " --output-dir VAL (as str, required)\n", + " --celltype Ast Ex In Microglia Oligo OPC (as list)\n", + " --suffix ''\n", + " pseudobulk_qc:\n", + " Workflow Options:\n", + " --celltype Ast Ex In Microglia Oligo OPC (as list)\n", + " --input-dir VAL (as str, required)\n", + " --output-dir VAL (as str, required)\n", + " --covariates-file VAL (as str, required)\n", + " --blacklist-file ''\n", + " --sample-list ''\n", + " --tech-vars log_n_nuclei med_nucleosome_signal med_tss_enrich log_med_n_tot_fragment log_total_unique_peaks (as list)\n", + " --batch-correction FALSE\n", + " --batch-method limma\n", + " --quant-norm FALSE\n", + " --min-count 5 (as int)\n", + " --min-total-count 15 (as int)\n", + " --min-prop 0.1 (as float)\n", + " --min-nuclei 20 (as int)\n", + " --suffix ''\n", + " phenotype_formatting:\n", + " Workflow Options:\n", + " --celltype Ast Ex In Mic Oligo OPC (as list)\n", + " --input-dir VAL (as str, required)\n", + " --output-dir VAL (as str, required)\n", + " region_filtering:\n", + " Workflow Options:\n", + " --celltype Ast Ex In Mic Oligo OPC (as list)\n", + " Parameters\n", + " --input-dir VAL (as str, required)\n", + " --output-dir VAL (as str, required)\n", + " --regions ''\n", + " --gene-list ''\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `sampleid_mapping`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[sampleid_mapping]\n", + "parameter: map_file = str\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "parameter: celltype = ['Ast', 'Ex', 'In', 'Microglia', 'Oligo', 'OPC']\n", + "parameter: suffix = ''\n", + "\n", + "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", + "output: [f'{output_dir}/1_files_with_sampleid/metadata_{ct}{suffix}.csv' for ct in celltype]\n", + "\n", + "python: expand = \"${ }\"\n", + "\n", + "import pandas as pd\n", + "import gzip\n", + "import os\n", + "import subprocess\n", + "import csv\n", + "import numpy as np\n", + "import tempfile\n", + "\n", + "map_df = pd.read_csv(\"${map_file}\")\n", + "id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n", + "\n", + "celltype = ${celltype}\n", + "input_dir = \"${input_dir}\"\n", + "output_dir = \"${output_dir}/1_files_with_sampleid\"\n", + "suffix = \"${suffix}\"\n", + "\n", + "os.makedirs(output_dir, exist_ok=True)\n", + "\n", + "def map_id(ind_id):\n", + " return id_map.get(ind_id, ind_id)\n", + "\n", + "def format_value(val):\n", + " if pd.isna(val):\n", + " return ''\n", + " if isinstance(val, (int, np.integer)):\n", + " return str(val)\n", + " if isinstance(val, (float, np.floating)):\n", + " if val == int(val):\n", + " return str(int(val))\n", + " else:\n", + " return str(val)\n", + " return str(val)\n", + "\n", + "def find_count_file(input_dir, ct, suffix):\n", + " candidates = [\n", + " f for f in os.listdir(input_dir)\n", + " if f.endswith(f\"{ct}{suffix}.csv.gz\") or f.endswith(f\"_{ct}{suffix}.csv.gz\")\n", + " ]\n", + " if not candidates:\n", + " return None, None\n", + " preferred = [f for f in candidates if f.endswith(f\"_{ct}{suffix}.csv.gz\")]\n", + " fname = preferred[0] if preferred else candidates[0]\n", + " return os.path.join(input_dir, fname), fname\n", + "\n", + "# ── Process metadata ───────────────────────────────────────────────────────\n", + "for ct in celltype:\n", + " fname = f\"metadata_{ct}{suffix}.csv\"\n", + " in_path = os.path.join(input_dir, fname)\n", + " out_path = os.path.join(output_dir, fname)\n", + "\n", + " if not os.path.exists(in_path):\n", + " print(f\"Skipping metadata (not found): {fname}\")\n", + " continue\n", + "\n", + " meta = pd.read_csv(in_path)\n", + "\n", + " if \"individualID\" not in meta.columns:\n", + " print(f\"Warning: individualID column not found in {fname}\")\n", + " continue\n", + "\n", + " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n", + " cols = meta.columns.tolist()\n", + " cols.remove(\"sampleid\")\n", + " cols.remove(\"individualID\")\n", + " meta = meta[[\"sampleid\", \"individualID\"] + cols]\n", + "\n", + " with open(out_path, 'w', newline='') as f:\n", + " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n", + " writer.writerow(meta.columns)\n", + " for _, row in meta.iterrows():\n", + " writer.writerow([format_value(val) for val in row])\n", + "\n", + " print(f\"Processed metadata: {fname}\")\n", + "\n", + "# ── Process count files ────────────────────────────────────────────────────\n", + "for ct in celltype:\n", + " in_path, fname = find_count_file(input_dir, ct, suffix)\n", + "\n", + " if in_path is None:\n", + " print(f\"Skipping counts (not found) for celltype: {ct}\")\n", + " continue\n", + "\n", + " print(f\"Detected count file: {fname}\")\n", + " out_path = os.path.join(output_dir, fname)\n", + "\n", + " with gzip.open(in_path, \"rt\") as fh:\n", + " header_line = fh.readline().rstrip(\"\\n\")\n", + "\n", + " col_names = header_line.split(\",\")\n", + " peak_id_col = col_names[0]\n", + " new_sample_cols = [map_id(s) for s in col_names[1:]]\n", + " new_header = \",\".join([peak_id_col] + new_sample_cols)\n", + "\n", + " tmp = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n", + " tmp.write(new_header + \"\\n\")\n", + " tmp.close()\n", + "\n", + " cmd = f\"zcat {in_path} | tail -n +2 | cat {tmp.name} - | gzip -6 > {out_path}\"\n", + " subprocess.run(cmd, shell=True, check=True)\n", + " os.unlink(tmp.name)\n", + "\n", + " print(f\"Processed counts: {fname}\")\n", + "\n", + "print(\"\\nSample ID mapping completed!\")" + ] + }, + { + "cell_type": "markdown", + "id": "f0884ae7-a851-425a-86dd-b606768a012e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `pseudobulk_qc`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c46328b-c3d8-46f8-8c71-bad27820438e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[pseudobulk_qc]\n", + "parameter: celltype = ['Ast','Ex','In','Microglia','Oligo','OPC']\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "parameter: covariates_file = str\n", + "parameter: blacklist_file = ''\n", + "parameter: sample_list = ''\n", + "parameter: tech_vars = ['log_n_nuclei','med_nucleosome_signal','med_tss_enrich','log_med_n_tot_fragment','log_total_unique_peaks']\n", + "parameter: batch_correction = \"FALSE\"\n", + "parameter: batch_method = \"limma\"\n", + "parameter: quant_norm = \"FALSE\"\n", + "parameter: min_count = 5\n", + "parameter: min_total_count = 15\n", + "parameter: min_prop = 0.1\n", + "parameter: min_nuclei = 20\n", + "parameter: suffix = ''\n", + "\n", + "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", + "output: [f'{output_dir}/2_residuals/{ct}/{ct}_residuals.txt' for ct in celltype]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n", + "\n", + "cts_str = \"c(\" + \", \".join([f\"'{x}'\" for x in celltype]) + \")\"\n", + "tvs_str = \"c(\" + \", \".join([f\"'{x}'\" for x in tech_vars]) + \")\"\n", + "\n", + "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n", + "\n", + " library(edgeR)\n", + " library(limma)\n", + " library(data.table)\n", + " library(GenomicRanges)\n", + " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n", + "\n", + " rename_if_found <- function(dt, target, candidates) {\n", + " found <- intersect(candidates, colnames(dt))[1]\n", + " if (!is.na(found) && found != target) setnames(dt, found, target)\n", + " }\n", + "\n", + " standardize_meta <- function(meta) {\n", + " rename_if_found(meta, \"n_nuclei\", c(\"n.nuclei\",\"nNuclei\",\"nuclei_count\"))\n", + " rename_if_found(meta, \"med_nucleosome_signal\", c(\"med.nucleosome_signal.ct\",\"NucleosomeRatio\",\"med_nucleosome_signal.ct\"))\n", + " rename_if_found(meta, \"med_tss_enrich\", c(\"med.tss.enrich.ct\",\"TSSEnrichment\",\"med_tss_enrich.ct\"))\n", + " rename_if_found(meta, \"med_n_tot_fragment\", c(\"med.n_tot_fragment.ct\",\"med_n_tot_fragment.ct\"))\n", + " return(meta)\n", + " }\n", + "\n", + " find_count_file <- function(input_dir, ct, suffix) {\n", + " all_files <- list.files(input_dir, pattern=\"\\\\.csv\\\\.gz$\", full.names=FALSE)\n", + " pattern <- paste0(ct, suffix, \"\\\\.csv\\\\.gz$\")\n", + " candidates <- all_files[grepl(pattern, all_files)]\n", + " if (length(candidates) == 0) return(NULL)\n", + " preferred <- candidates[grepl(paste0(\"_\", ct, suffix, \"\\\\.csv\\\\.gz$\"), candidates)]\n", + " if (length(preferred) > 0) return(file.path(input_dir, preferred[1]))\n", + " return(file.path(input_dir, candidates[1]))\n", + " }\n", + "\n", + " filter_blacklist <- function(mat, bed, feat_label) {\n", + " peaks <- data.table(id = rownames(mat))\n", + " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", + " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " bl <- fread(bed)[, 1:3]\n", + " setnames(bl, c(\"chr\",\"start\",\"end\"))\n", + " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", + " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n", + " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n", + " if (length(blacklisted) > 0) {\n", + " message(\"Blacklisted \", feat_label, \" removed: \", length(blacklisted))\n", + " return(mat[-blacklisted, , drop=FALSE])\n", + " }\n", + " return(mat)\n", + " }\n", + "\n", + " predictOffset <- function(fit) {\n", + " D <- fit$design\n", + " Dm <- D\n", + " for (col in colnames(D)) {\n", + " if (col == \"(Intercept)\") next\n", + " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n", + " Dm[, col] <- median(D[, col], na.rm=TRUE)\n", + " else\n", + " Dm[, col] <- 0\n", + " }\n", + " B <- fit$coefficients\n", + " B[is.na(B)] <- 0\n", + " B %*% t(Dm)\n", + " }\n", + "\n", + " cts <- ${cts_str}\n", + " tech_vars <- ${tvs_str}\n", + "\n", + " for (ct in cts) {\n", + " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", + " message(\"Processing: \", ct)\n", + " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n", + " message(\"Quantile normalization: \", ifelse(as.logical(\"${quant_norm}\"), \"TRUE\", \"FALSE\"))\n", + " message(paste(rep(\"=\", 40), collapse=\"\"))\n", + "\n", + " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n", + " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n", + "\n", + " # ── 1. Load data ───────────────────────────────────────────────────\n", + " meta <- fread(sprintf(\"${input_dir}/metadata_%s${suffix}.csv\", ct))\n", + "\n", + " counts_file <- find_count_file(\"${input_dir}\", ct, \"${suffix}\")\n", + " if (is.null(counts_file)) stop(\"No count file found for celltype: \", ct)\n", + " message(\"Detected count file: \", basename(counts_file))\n", + "\n", + " counts_raw <- fread(counts_file)\n", + " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n", + " rownames(counts) <- counts_raw[[1]]\n", + " rm(counts_raw)\n", + "\n", + " # ── Auto-detect modality ───────────────────────────────────────────\n", + " is_atac <- grepl(\"^chr.*-[0-9]+-[0-9]+$\", rownames(counts)[1])\n", + " feat_label <- ifelse(is_atac, \"peaks\", \"genes\")\n", + " message(\"Detected modality: \", ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\"))\n", + " message(\"Loaded: \", nrow(counts), \" \", feat_label, \" x \", ncol(counts), \" samples\")\n", + "\n", + " # ── 2. Standardize metadata ────────────────────────────────────────\n", + " meta <- standardize_meta(meta)\n", + "\n", + " # ── 3. Sample ID column ───────────────────────────────────────────\n", + " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n", + " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n", + "\n", + " # ── 4. Nuclei filter ──────────────────────────────────────────────\n", + " if (\"n_nuclei\" %in% colnames(meta)) {\n", + " meta <- meta[meta$n_nuclei > ${min_nuclei}]\n", + " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n", + " }\n", + "\n", + " # ── 5. Optional sample list filter ────────────────────────────────\n", + " if (\"${sample_list}\" != \"\" && file.exists(\"${sample_list}\")) {\n", + " keep_ids <- fread(\"${sample_list}\", header=FALSE)[[1]]\n", + " meta <- meta[meta[[idcol]] %in% keep_ids]\n", + " message(\"Samples after sample_list filter: \", nrow(meta))\n", + " }\n", + "\n", + " # ── 6. Align samples ──────────────────────────────────────────────\n", + " common <- intersect(meta[[idcol]], colnames(counts))\n", + " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n", + " meta <- meta[match(common, meta[[idcol]])]\n", + " counts <- counts[, common, drop=FALSE]\n", + " message(\"Samples after alignment: \", length(common))\n", + "\n", + " # ── 7. Blacklist filtering ─────────────────────────────────────────\n", + " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n", + " counts <- filter_blacklist(counts, \"${blacklist_file}\", feat_label)\n", + " message(feat_label, \" after blacklist filter: \", nrow(counts))\n", + " } else {\n", + " message(\"No blacklist file provided - skipping.\")\n", + " }\n", + "\n", + " # ── 8. Load and merge covariates ──────────────────────────────────\n", + " covs <- fread(\"${covariates_file}\")\n", + " id2 <- intersect(c(\"#id\",\"id\",\"projid\",\"individualID\"), colnames(covs))[1]\n", + " keep_cols <- c(id2, intersect(c(\"pmi\",\"study\"), colnames(covs)))\n", + " covs <- covs[, ..keep_cols]\n", + " meta <- merge(meta, covs, by.x=idcol, by.y=id2, all.x=TRUE)\n", + " meta <- meta[match(common, meta[[idcol]])]\n", + "\n", + " # ── 9. Impute missing PMI ─────────────────────────────────────────\n", + " if (\"pmi\" %in% colnames(meta) && any(is.na(meta$pmi))) {\n", + " message(\"Imputing missing values for: pmi\")\n", + " meta$pmi[is.na(meta$pmi)] <- median(meta$pmi, na.rm=TRUE)\n", + " }\n", + "\n", + " # ── 10. Tech vars ─────────────────────────────────────────────────\n", + " message(\"Tech vars: \", paste(tech_vars, collapse=\", \"))\n", + "\n", + " # ── 11. Compute derived log metrics ───────────────────────────────\n", + " for (tv in tech_vars[startsWith(tech_vars, \"log_\")]) {\n", + " if (tv %in% colnames(meta)) next\n", + " if (tv == \"log_total_unique_peaks\") {\n", + " meta$log_total_unique_peaks <- log1p(colSums(counts > 0))\n", + " } else {\n", + " raw_col <- sub(\"^log_\", \"\", tv)\n", + " if (raw_col %in% colnames(meta)) {\n", + " meta[[tv]] <- log1p(meta[[raw_col]])\n", + " } else {\n", + " message(\"Warning: cannot compute \", tv, \" - '\", raw_col, \"' not in metadata\")\n", + " }\n", + " }\n", + " }\n", + "\n", + " # ── 12. Select model variables ────────────────────────────────────\n", + " all_vars <- c(intersect(tech_vars, colnames(meta)), \"pmi\", \"study\")\n", + " all_vars <- intersect(all_vars, colnames(meta))\n", + " message(\"Model terms: \", paste(all_vars, collapse=\", \"))\n", + "\n", + " # ── 13. Drop samples with NA in model variables ───────────────────\n", + " keep_rows <- complete.cases(meta[, ..all_vars])\n", + " meta <- meta[keep_rows]\n", + " counts <- counts[, meta[[idcol]], drop=FALSE]\n", + " message(\"Valid samples for modelling: \", nrow(meta))\n", + "\n", + " # ── 14. Expression filtering ──────────────────────────────────────\n", + " dge <- DGEList(counts=counts, samples=meta)\n", + " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n", + " message(feat_label, \" before expression filter: \", nrow(dge))\n", + "\n", + " keep <- filterByExpr(dge, group=dge$samples$group,\n", + " min.count=${min_count},\n", + " min.total.count=${min_total_count},\n", + " min.prop=${min_prop})\n", + " dge <- dge[keep,, keep.lib.sizes=FALSE]\n", + " message(feat_label, \" after expression filter: \", nrow(dge))\n", + "\n", + " # ── Save filtered raw counts ──────────────────────────────────────\n", + " write.table(dge$counts,\n", + " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n", + " sep=\"\\t\", quote=FALSE, col.names=NA)\n", + "\n", + " # ── 15. TMM normalization ─────────────────────────────────────────\n", + " dge <- calcNormFactors(dge, method=\"TMM\")\n", + "\n", + " # ── 16. Optional batch correction ─────────────────────────────────\n", + " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n", + " batches <- dge$samples$sequencingBatch\n", + " batch_counts <- table(batches)\n", + " valid_batches <- names(batch_counts[batch_counts > 1])\n", + " keep_bc <- batches %in% valid_batches\n", + " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n", + " batches <- batches[keep_bc]\n", + " message(\"Samples after singleton batch removal: \", ncol(dge))\n", + "\n", + " if (\"${batch_method}\" == \"combat\") {\n", + " dge$counts <- ComBat_seq(as.matrix(dge$counts), batch=batches)\n", + " message(\"ComBat-seq batch correction applied.\")\n", + " } else {\n", + " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", + " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n", + " dge$counts <- round(pmax(2^logCPM, 0))\n", + " message(\"limma removeBatchEffect applied.\")\n", + " }\n", + " }\n", + "\n", + " # ── 17. Add batch vars to model if multi-level ────────────────────\n", + " other_vars <- setdiff(all_vars, tech_vars)\n", + " batch_vars <- c()\n", + " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$sequencingBatch)) > 1) {\n", + " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", + " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n", + " }\n", + " if (\"Library\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$Library)) > 1) {\n", + " dge$samples$Library_factor <- factor(dge$samples$Library)\n", + " batch_vars <- c(batch_vars, \"Library_factor\")\n", + " }\n", + " all_vars <- intersect(c(tech_vars, batch_vars, other_vars),\n", + " c(colnames(dge$samples), colnames(meta)))\n", + "\n", + " # ── 18. Build design matrix ───────────────────────────────────────\n", + " form <- as.formula(paste(\"~\", paste(all_vars, collapse=\" + \")))\n", + " design <- model.matrix(form, data=dge$samples)\n", + " message(\"Formula: \", deparse(form))\n", + "\n", + " if (!is.fullrank(design)) {\n", + " message(\"Design not full rank - trimming.\")\n", + " qr_d <- qr(design)\n", + " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n", + " }\n", + " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n", + "\n", + " # ── 19. Voom + lmFit + eBayes ────────────────────────────────────\n", + " v <- voom(dge, design, plot=FALSE)\n", + " fit <- lmFit(v, design)\n", + " fit <- eBayes(fit)\n", + "\n", + " # ── 20. Offset + residuals ────────────────────────────────────────\n", + " off <- predictOffset(fit)\n", + " res <- residuals(fit, v$E)\n", + " final <- off + res\n", + "\n", + " # ── 21. Save residuals ────────────────────────────────────────────\n", + " out_file <- file.path(outdir, paste0(ct, \"_residuals.txt\"))\n", + "\n", + " write.table(final,\n", + " out_file,\n", + " sep=\"\\t\", quote=FALSE, col.names=NA)\n", + "\n", + " feat_label <- if (is_atac) \"Peaks\" else \"Genes\"\n", + "\n", + " message(\"Saved: \", out_file)\n", + " message(\" \", feat_label, \": \", nrow(final), \" | Samples: \", ncol(final))\n", + "\n", + " # ── 22. Optional Quantile Normalization ───────────────────────────\n", + " if (as.logical(\"${quant_norm}\")) {\n", + " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", + " message(\"Applying quantile normalization...\")\n", + " message(paste(rep(\"=\", 40), collapse=\"\"))\n", + " \n", + " final_qn <- t(apply(final, 1, rank, ties.method = \"average\"))\n", + " final_qn <- stats::qnorm(final_qn / (ncol(final_qn) + 1))\n", + " \n", + " qn_file <- file.path(outdir, paste0(ct, \"_residuals_qn.txt\"))\n", + " write.table(final_qn,\n", + " qn_file,\n", + " sep=\"\\t\", quote=FALSE, col.names=NA)\n", + " \n", + " message(\"Saved QN: \", qn_file)\n", + " message(\" \", feat_label, \": \", nrow(final_qn), \" | Samples: \", ncol(final_qn))\n", + " \n", + " # Save RDS with QN\n", + " saveRDS(list(\n", + " dge = dge,\n", + " offset = off,\n", + " residuals = res,\n", + " final_data = final,\n", + " final_data_qn = final_qn,\n", + " valid_samples = colnames(dge),\n", + " design = design,\n", + " fit = fit,\n", + " model = form,\n", + " mode = \"noBIOvar\",\n", + " batch_correction = as.logical(\"${batch_correction}\"),\n", + " batch_method = ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", + " quant_norm = TRUE,\n", + " modality = ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", + " ), file.path(outdir, paste0(ct, \"_results_qn.rds\")))\n", + " } else {\n", + " # Save RDS without QN\n", + " saveRDS(list(\n", + " dge = dge,\n", + " offset = off,\n", + " residuals = res,\n", + " final_data = final,\n", + " valid_samples = colnames(dge),\n", + " design = design,\n", + " fit = fit,\n", + " model = form,\n", + " mode = \"noBIOvar\",\n", + " batch_correction = as.logical(\"${batch_correction}\"),\n", + " batch_method = ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", + " quant_norm = FALSE,\n", + " modality = ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", + " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n", + " }\n", + "\n", + " message(\"Completed: \", ct, \" -> \", outdir)\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `phenotype_reformatting`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[phenotype_formatting]\n", + "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "\n", + "input: [f'{input_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", + "output: [f'{output_dir}/3_pheno_reformat/{ct}_phenotype.bed.gz' for ct in celltype]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n", + "\n", + "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", + "\n", + " import os\n", + " import subprocess\n", + " import pandas as pd\n", + "\n", + " celltypes = ${celltype}\n", + " input_dir = \"${input_dir}\"\n", + " output_dir = \"${output_dir}\"\n", + "\n", + " def read_residuals(path):\n", + " first_line = open(path).readline().rstrip(\"\\n\")\n", + " col_names = first_line.split(\"\\t\")\n", + " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n", + " if df.shape[1] > len(col_names):\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names\n", + " else:\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names[1:]\n", + " return peak_ids, df\n", + "\n", + " def to_midpoint_bed(peak_ids, residuals):\n", + " \"\"\"Convert snATAC-seq peak IDs (chr-start-end) to midpoint BED format.\"\"\"\n", + " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", + " chrs = parts[0].values\n", + " starts = parts[1].astype(int).values\n", + " ends = parts[2].astype(int).values\n", + " mids = ((starts + ends) // 2).astype(int)\n", + " bed = pd.DataFrame({\n", + " \"#chr\": chrs,\n", + " \"start\": mids,\n", + " \"end\": mids + 1,\n", + " \"ID\": peak_ids\n", + " })\n", + " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n", + " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n", + "\n", + " def run_cmd(cmd, label):\n", + " r = subprocess.run(cmd, capture_output=True)\n", + " if r.returncode != 0:\n", + " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n", + " else:\n", + " print(f\"{label}: OK\")\n", + "\n", + " for ct in celltypes:\n", + " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n", + "\n", + " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n", + " os.makedirs(out_dir, exist_ok=True)\n", + "\n", + " res_path = os.path.join(input_dir, ct, f\"{ct}_residuals.txt\")\n", + " if not os.path.exists(res_path):\n", + " print(f\"WARNING: {res_path} not found, skipping.\")\n", + " continue\n", + "\n", + " peak_ids, residuals = read_residuals(res_path)\n", + " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n", + "\n", + " bed = to_midpoint_bed(peak_ids, residuals)\n", + " out_bed = os.path.join(out_dir, f\"{ct}_phenotype.bed\")\n", + " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n", + " print(f\"Written: {out_bed}\")\n", + "\n", + " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n", + " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n", + " print(f\"Completed: {ct} -> {out_dir}\")" + ] + }, + { + "cell_type": "markdown", + "id": "038bc2ab-c412-40ef-a9b6-f5dddf5292ee", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `region_filtering`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd13567e-2d4c-48f6-83d7-62ab252421bf", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[region_filtering]\n", + "# Parameters\n", + "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", + "parameter: input_dir = str\n", + "parameter: output_dir = str\n", + "parameter: regions = \"\"\n", + "parameter: gene_list = \"\" # Note: Use --gene_list in command line\n", + "\n", + "# SoS Input/Output logic\n", + "input: [f'{input_dir}/{ct}/{ct}_filtered_raw_counts.txt' for ct in (celltype if isinstance(celltype, list) else [celltype])]\n", + "output: [f'{output_dir}/3_region_filter/{ct}_filtered_regions_of_interest.txt' for ct in (celltype if isinstance(celltype, list) else [celltype])]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '1:00:00', mem = '16G', cores = 2\n", + "\n", + "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", + " import os\n", + " import pandas as pd\n", + "\n", + " # Handle SoS passing single strings vs lists\n", + " raw_ct = ${celltype!r}\n", + " celltypes = [raw_ct] if isinstance(raw_ct, str) else raw_ct\n", + " \n", + " input_dir = \"${input_dir}\"\n", + " output_dir = \"${output_dir}\"\n", + " regions_str = \"${regions}\"\n", + " gene_list_str = \"${gene_list}\"\n", + "\n", + " def parse_regions(region_str):\n", + " if not region_str or region_str.strip() == \"\":\n", + " return []\n", + " result = []\n", + " for r in region_str.split(\",\"):\n", + " chrom, coords = r.strip().split(\":\")\n", + " start, end = coords.split(\"-\")\n", + " result.append({\"chr\": chrom, \"start\": int(start), \"end\": int(end)})\n", + " return result\n", + "\n", + " def parse_peak_ids(peak_ids):\n", + " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", + " return pd.DataFrame({\n", + " \"chr\": parts[0].values,\n", + " \"start\": parts[1].astype(int).values,\n", + " \"end\": parts[2].astype(int).values\n", + " })\n", + "\n", + " def overlaps_region(chr_col, start_col, end_col, reg):\n", + " return (\n", + " (chr_col == reg[\"chr\"]) &\n", + " (start_col < reg[\"end\"]) &\n", + " (end_col > reg[\"start\"])\n", + " )\n", + "\n", + " regions = parse_regions(regions_str)\n", + " \n", + " genes_to_filter = None\n", + " if gene_list_str and gene_list_str.strip():\n", + " genes_to_filter = set([g.strip() for g in gene_list_str.split(\",\")])\n", + "\n", + " for ct in celltypes:\n", + " reg_dir = os.path.join(output_dir, \"3_region_filter\")\n", + " os.makedirs(reg_dir, exist_ok=True)\n", + "\n", + " counts_path = os.path.join(input_dir, ct, f\"{ct}_filtered_raw_counts.txt\")\n", + " if not os.path.exists(counts_path):\n", + " continue\n", + "\n", + " df = pd.read_csv(counts_path, sep=\"\\t\", index_col=0)\n", + " first_id = df.index[0]\n", + " is_atac = \"-\" in str(first_id) and str(first_id).count(\"-\") >= 2\n", + " \n", + " # Consistent output name to match SoS 'output' definition\n", + " full_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest.txt\")\n", + "\n", + " if is_atac:\n", + " if not regions: continue\n", + " df.index.name = \"peak_id\"\n", + " df = df.reset_index()\n", + " coords = parse_peak_ids(df[\"peak_id\"].values)\n", + " df[\"chr\"], df[\"start\"], df[\"end\"] = coords[\"chr\"], coords[\"start\"], coords[\"end\"]\n", + " df[\"peakwidth\"] = df[\"end\"] - df[\"start\"]\n", + " \n", + " mask = pd.Series(False, index=df.index)\n", + " for reg in regions:\n", + " mask |= overlaps_region(df[\"chr\"], df[\"start\"], df[\"end\"], reg)\n", + "\n", + " region_df = df[mask].copy()\n", + " region_df.to_csv(full_out, sep=\"\\t\", index=False)\n", + " \n", + " else:\n", + " if not genes_to_filter: continue\n", + " df.index.name = \"gene_name\"\n", + " genes_present = set(df.index) & genes_to_filter\n", + " if not genes_present: continue\n", + " \n", + " region_df = df.loc[list(genes_present)].copy()\n", + " # FIX: Use the same filename as defined in the SoS 'output' block\n", + " region_df.to_csv(full_out, sep=\"\\t\")\n", + "\n", + " print(f\"Completed: {ct}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "4.4.3" + }, + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "sos", + "", + "" + ] + ], + "version": "" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 6452ce6089541c7a3efb8aa822255f7be572aba8 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Thu, 26 Feb 2026 17:48:17 -0500 Subject: [PATCH 08/12] Delete code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb --- .../QC/pseudobulk_preprocessing.ipynb | 1442 ----------------- 1 file changed, 1442 deletions(-) delete mode 100644 code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb deleted file mode 100644 index 9a7a8a59f..000000000 --- a/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb +++ /dev/null @@ -1,1442 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "# Single-nuclei Pseudobulk Preprocessing (RNA-seq and ATAC-seq) Pipeline\n", - "\n", - "## Overview\n", - "\n", - "This pipeline preprocesses single-nuclei pseudobulk count data (snATAC-seq or snRNA-seq)\n", - "for downstream QTL analysis and region-specific studies.\n", - "\n", - "**Goals:**\n", - "- Transform raw pseudobulk counts into analysis-ready formats\n", - "- Remove technical confounders while preserving biological covariates (sex, age)\n", - "- Generate QTL-ready phenotype files or region-specific datasets\n", - "\n", - "## Pipeline Structure\n", - "\n", - "```\n", - "Step 0: Sample ID Mapping [sampleid_mapping]\n", - " ↓\n", - "Step 1: Pseudobulk QC [pseudobulk_qc]\n", - " noBIOvar: regress out technical covariates only\n", - " (msex and age_death deliberately preserved)\n", - " ↓ (optional)\n", - " Batch Correction (ComBat-seq or limma::removeBatchEffect)\n", - " ↓ (optional)\n", - " Quantile Normalization\n", - " ↓\n", - "Step 2: Format Output\n", - " ├── Phenotype Reformatting → BED [phenotype_formatting] (genome-wide QTL mapping, snATAC-seq only, locus-specific)\n", - " └── Region Peak Filtering → TSV [region_filtering] (gene filtering for snRNA-seq)\n", - "```\n", - "\n", - "## Modality Support\n", - "\n", - "| Feature | snATAC-seq | snRNA-seq |\n", - "|---------|-----------|-----------|\n", - "| Count file auto-detected | ✓ | ✓ |\n", - "| Default `tech_vars` | `log_n_nuclei`, `med_nucleosome_signal`, `med_tss_enrich`, `log_med_n_tot_fragment`, `log_total_unique_peaks` | custom via `--tech_vars` |\n", - "| Blacklist filtering | ✓ | — |\n", - "| `region_filtering` step | ✓ | — |\n", - "| `phenotype_formatting` step | ✓ | ✓ |\n", - "\n", - "For snRNA-seq, override `tech_vars` to match your metadata columns, e.g.:\n", - "```bash\n", - "--tech_vars log_n_nuclei percent_mito log_n_genes\n", - "```\n", - "\n", - "Any `tech_var` starting with `log_` is automatically derived via `log1p()` from the\n", - "raw column of the same name with `log_` stripped. No code changes needed across modalities.\n", - "\n", - "## Input Files\n", - "\n", - "All input files required to run this pipeline can be downloaded\n", - "[here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link).\n", - "\n", - "| File | Used in |\n", - "|------|---------|\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Step 0, Step 1 |\n", - "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Step 0, Step 1 |\n", - "| `metadata_{celltype}.csv` | Step 0, Step 1 |\n", - "| `rosmap_sample_mapping_data.csv` | Step 0 |\n", - "| `rosmap_cov.txt` | Step 1 |\n", - "| `hg38-blacklist.v2.bed.gz` | Step 1 (snATAC-seq only) |\n", - "\n", - "Count files are **auto-detected** from `input_dir` — no prefix parameter needed.\n", - "\n", - "## Parameters\n", - "\n", - "### `sampleid_mapping`\n", - "| Parameter | Default | Description |\n", - "|-----------|---------|-------------|\n", - "| `map_file` | *required* | CSV with `individualID` → `sampleid` mapping |\n", - "| `input_dir` | *required* | Directory with raw metadata and count files |\n", - "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/1_files_with_sampleid/` |\n", - "| `celltype` | `['Ast','Ex','In','Microglia','Oligo','OPC']` | Cell types to process |\n", - "| `suffix` | `''` | Optional filename suffix (e.g. `_50nuc`) |\n", - "\n", - "### `pseudobulk_qc`\n", - "| Parameter | Default | Description |\n", - "|-----------|---------|-------------|\n", - "| `input_dir` | *required* | Directory with remapped metadata and count files |\n", - "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/2_residuals/{ct}/` |\n", - "| `covariates_file` | *required* | Covariate file with `pmi` and `study` columns |\n", - "| `blacklist_file` | `''` | Genomic blacklist BED file (snATAC-seq only) |\n", - "| `sample_list` | `''` | Optional file with one sample ID per line to subset |\n", - "| `tech_vars` | `['log_n_nuclei','med_nucleosome_signal','med_tss_enrich','log_med_n_tot_fragment','log_total_unique_peaks']` | Technical covariates for the model |\n", - "| `batch_correction` | `FALSE` | Apply batch correction (`TRUE`/`FALSE`) |\n", - "| `batch_method` | `limma` | Batch correction method (`limma` or `combat`) |\n", - "| `quant_norm` | `FALSE` | Apply quantile normalization after residuals |\n", - "| `min_count` | `5` | Min reads in at least one sample |\n", - "| `min_total_count` | `15` | Min total reads across all samples |\n", - "| `min_prop` | `0.1` | Min proportion of samples with expression |\n", - "| `min_nuclei` | `20` | Min nuclei per sample |\n", - "| `celltype` | `['Ast','Ex','In','Microglia','Oligo','OPC']` | Cell types to process |\n", - "| `suffix` | `''` | Optional filename suffix |\n", - "\n", - "### `phenotype_formatting`\n", - "| Parameter | Default | Description |\n", - "|-----------|---------|-------------|\n", - "| `input_dir` | *required* | Directory containing `{ct}/{ct}_residuals.txt` |\n", - "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/3_pheno_reformat/` |\n", - "| `modality` | `snatac` | Modality label used in output filename (`snatac` or `snrna`) |\n", - "| `celltype` | `['Ast','Ex','In','Mic','Oligo','OPC']` | Cell types to process |\n", - "\n", - "### `region_filtering` *(snATAC-seq only)*\n", - "| Parameter | Default | Description |\n", - "|-----------|---------|-------------|\n", - "| `input_dir` | *required* | Directory containing `{ct}/{ct}_filtered_raw_counts.txt` |\n", - "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/3_region_filter/` |\n", - "| `regions` | `chr7:28000000-28300000,...` | Comma-separated genomic regions of interest |\n", - "| `celltype` | `['Ast','Ex','In','Mic','Oligo','OPC']` | Cell types to process |\n", - "\n", - "## Minimal Working Example" - ] - }, - { - "cell_type": "markdown", - "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 0: Sample ID Mapping\n", - "\n", - "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n", - "across metadata and count matrix files.\n", - "\n", - "### Input\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n", - "| `metadata_{celltype}.csv` × 6 | Per-cell-type sample metadata |\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 *(snATAC-seq)* | Per-cell-type peak count matrices |\n", - "| `pseudobulk_counts_{celltype}.csv.gz` × 6 *(snRNA-seq)* | Per-cell-type gene count matrices |\n", - "\n", - "Cell types: `Ast`, `Ex`, `In`, `Microglia`, `Oligo`, `OPC`\n", - "\n", - "Count files are **auto-detected** from `input_dir` — any `.csv.gz` file ending with\n", - "`{celltype}{suffix}` will be found regardless of prefix (`pseudobulk_peaks_counts_`,\n", - "`pseudobulk_counts_`, etc.).\n", - "\n", - "### Process\n", - "\n", - "**Part 1 — Metadata files**\n", - "\n", - "For each `metadata_{celltype}.csv`:\n", - "1. Look up each `individualID` in the mapping reference\n", - "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n", - "3. Reorder columns: `sampleid` first, then `individualID`, then the rest\n", - "4. Save updated file\n", - "\n", - "**Part 2 — Count matrix files**\n", - "\n", - "For each count file detected in `input_dir`:\n", - "1. Auto-detect filename by scanning for `.csv.gz` files matching `{celltype}{suffix}`\n", - "2. Extract the header row (column names only)\n", - "3. Keep the first column (peak or gene IDs) unchanged\n", - "4. Map remaining column names (`individualID` → `sampleid`) where mapping exists,\n", - " otherwise keep original\n", - "5. Write new header and stream data rows unchanged\n", - "6. Recompress with gzip\n", - "\n", - "### Output\n", - "\n", - "Output directory: `{output_dir}/1_files_with_sampleid/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `metadata_{celltype}.csv` × 6 | Metadata with `sampleid` column prepended |\n", - "| `{detected_count_filename}` × 6 | Count matrices with mapped column headers |\n", - "\n", - "**Timing:** < 1 min" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/pseudobulk_preprocessing.ipynb sampleid_mapping \\\n", - " --map-file data/atac_seq/rosmap_sample_mapping_data.csv \\\n", - " --input-dir data/atac_seq/1_files_with_sampleid \\\n", - " --output-dir output/atac_seq \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "5540a4da-843a-4789-8123-47911cf519c5", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 1: Pseudobulk QC\n", - "\n", - "Regresses out technical covariates while preserving biological variation (sex, age) for\n", - "downstream QTL analysis. Works for both snATAC-seq and snRNA-seq.\n", - "\n", - "### Input\n", - "\n", - "| File | Location |\n", - "|------|----------|\n", - "| `pseudobulk_*counts_{celltype}.csv.gz` *(auto-detected)* | `1_files_with_sampleid/` |\n", - "| `metadata_{celltype}.csv` | `1_files_with_sampleid/` |\n", - "| `rosmap_cov.txt` | `data/` |\n", - "| `hg38-blacklist.v2.bed.gz` *(snATAC-seq, optional)* | `data/` |\n", - "\n", - "### Process\n", - "\n", - "1. Load metadata per cell type; auto-detect and load count matrix from `input_dir`\n", - "2. Standardize metadata column names across datasets\n", - "3. Filter samples with fewer than `min_nuclei` nuclei (default: 20)\n", - "4. *(Optional)* Subset to samples listed in `sample_list` file\n", - "5. Align samples between metadata and count matrix\n", - "6. *(Optional)* Filter blacklisted genomic regions (`blacklist_file`)\n", - "7. Merge with demographic covariates (`pmi`, `study`) from `covariates_file`\n", - "8. Impute missing `pmi` values with median\n", - "9. Load `tech_vars` from parameter — any variable prefixed with `log_` is automatically\n", - " derived via `log1p()` from the raw column of the same name:\n", - " - e.g. `log_n_nuclei` ← `log1p(n_nuclei)`\n", - " - e.g. `log_total_unique_peaks` ← `log1p(colSums(counts > 0))`\n", - " - Works for both snATAC-seq and snRNA-seq without code changes\n", - "10. Build model variable list — `msex` and `age_death` are **deliberately excluded**\n", - "11. Drop samples with NA in any model variable\n", - "12. Apply expression filtering (`filterByExpr`):\n", - " - `min_count = 5`: minimum reads in at least one sample\n", - " - `min_total_count = 15`: minimum total reads across all samples\n", - " - `min_prop = 0.1`: feature expressed in ≥10% of samples\n", - "13. TMM normalization\n", - "14. *(Optional)* Batch correction (`sequencingBatch` and/or `Library`):\n", - " - `limma::removeBatchEffect` (default)\n", - " - `ComBat-seq`\n", - "15. Add `sequencingBatch` and `Library` to model if multi-level\n", - "16. Fit linear model (`voom` + `lmFit` + `eBayes`)\n", - "\n", - "**Model formula (default snATAC-seq):**\n", - "```\n", - "~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich +\n", - " log_med_n_tot_fragment + log_total_unique_peaks +\n", - " [sequencingBatch] + [Library] + pmi + study\n", - "```\n", - "\n", - "> `sequencingBatch` and `Library` are included only if present in metadata and have\n", - "> more than one level. If batch correction was applied, they are removed from the model.\n", - "\n", - "17. Compute `offset + residuals` as final adjusted values:\n", - " - `offset`: predicted value at median/reference covariate levels\n", - " - `residuals`: unexplained variation after removing all covariate effects\n", - "18. *(Optional)* Quantile normalization of final values\n", - "19. Save outputs\n", - "\n", - "### Output\n", - "\n", - "Output directory: `{output_dir}/2_residuals/{celltype}/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_residuals.txt` | Covariate-adjusted values (log2-CPM) |\n", - "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design, parameters |\n", - "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", - "\n", - "**Variables deliberately NOT regressed out:**\n", - "- Sex (`msex`)\n", - "- Age at death (`age_death`)\n", - "\n", - "**Timing:** < 5 min per cell type" - ] - }, - { - "cell_type": "markdown", - "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "# snATAC-seq\n", - "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", - " --input-dir output/atac_seq/1_files_with_sampleid \\\n", - " --output-dir output/atac_seq \\\n", - " --blacklist-file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates-file data/atac_seq/rosmap_cov.txt \\\n", - " --batch-correction FALSE \\\n", - " --min-count 5 \\\n", - " --celltype Ast Ex In Microglia Oligo OPC\n", - "\n", - "# snRNA-seq\n", - "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", - " --input-dir output/snrna_seq/1_files_with_sampleid \\\n", - " --output-dir output/snrna_seq \\\n", - " --covariates-file data/snrna_seq/covariates.txt \\\n", - " --min-count 5 \\\n", - " --batch-correction FALSE \\\n", - " --quant-norm TRUE \\ # add this if you want quantile normalized output\n", - " --celltype Ast Ex In Microglia Oligo OPC\n" - ] - }, - { - "cell_type": "markdown", - "id": "25e96ad2-1b75-43d0-978e-0757bc11f135", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Batch Correction (Optional)\n", - "\n", - "Runs between TMM normalization (step 15) and model fitting (step 18).\n", - "Use when batch effects are severe (e.g., visible batch clusters in PCA, multiple sequencing runs).\n", - "\n", - "> When batch correction is applied, `sequencingBatch` and `Library` are **removed** from\n", - "> the model formula since their variance has already been removed from the counts.\n", - "\n", - "**Method comparison:**\n", - "\n", - "| | ComBat-seq | limma `removeBatchEffect` |\n", - "|---|---|---|\n", - "| **Operates on** | Raw integer counts | log-CPM values |\n", - "| **Mean-variance modelling** | Yes | No |\n", - "| **Best for** | Large, balanced batches | Small or fragmented batches |\n", - "| **Robustness** | May fail with many small batches | More robust to unbalanced designs |\n", - "\n", - "**ComBat-seq:**\n", - "```r\n", - "dge$counts <- ComBat_seq(as.matrix(dge$counts), batch = batches)\n", - "```\n", - "\n", - "**limma `removeBatchEffect`:**\n", - "```r\n", - "logCPM <- cpm(dge, log = TRUE, prior.count = 1)\n", - "logCPM <- removeBatchEffect(logCPM, batch = factor(batches))\n", - "dge$counts <- round(pmax(2^logCPM, 0))\n", - "```\n", - "\n", - "**Additional filtering applied before correction:**\n", - "- Singleton batches (only 1 sample in a batch) are removed prior to correction\n", - "\n", - "**Parameters:**\n", - "\n", - "| Parameter | Default | Description |\n", - "|-----------|---------|-------------|\n", - "| `batch_correction` | `FALSE` | Enable batch correction |\n", - "| `batch_method` | `limma` | Method to use (`limma` or `combat`) |\n", - "\n", - "**Command:**\n", - "```bash\n", - "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", - " ... \\\n", - " --batch_correction TRUE \\\n", - " --batch_method limma\n", - "```\n", - "\n", - "**Effect on RDS output:**\n", - "\n", - "The `{celltype}_results.rds` file will include:\n", - "- `batch_correction = TRUE`\n", - "- `batch_method = \"limma\"` or `\"combat\"`" - ] - }, - { - "cell_type": "markdown", - "id": "9bad900d-768d-45ee-815a-6847e8eba32e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC with batch correction\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "799c0faa-4dd9-431d-a5cf-3e92d7256a3b", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", - " --input-dir output/atac_seq/1_files_with_sampleid \\\n", - " --output-dir output/atac_seq \\\n", - " --blacklist-file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n", - " --covariates-file data/atac_seq/rosmap_cov.txt \\\n", - " --batch-correction TRUE \\\n", - " --batch-method limma \\\n", - " --min-count 5 \\\n", - " --celltype Ast Ex In Microglia Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed", - "metadata": {}, - "source": [ - "### Additional parameters\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb", - "metadata": {}, - "outputs": [], - "source": [ - "# All available pseudobulk_qc parameters with defaults\n", - "--min-count 5\n", - "--min-total-count 15\n", - "--min-prop 0.1\n", - "--min-nuclei 20\n", - "--sample-list '' # path to file with one sample ID per line\n", - "--tech-vars log_n_nuclei med_nucleosome_signal med_tss_enrich log_med_n_tot_fragment log_total_unique_peaks# snATAC-seq defaults; for snRNA-seq use e.g.: log_n_nuclei percent_mito log_n_genes" - ] - }, - { - "cell_type": "markdown", - "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 2: Format Output\n", - "\n", - "### Phenotype Reformatting (exclusively for snATAC-seq)\n", - "\n", - "Converts residuals into a QTL-ready BED format for genome-wide QTL mapping.\n", - "Works for both snATAC-seq and snRNA-seq.\n", - "\n", - "**Input:**\n", - "\n", - "| File | Location |\n", - "|------|----------|\n", - "| `{celltype}_residuals.txt` | `{output_dir}/2_residuals/{celltype}/` |\n", - "\n", - "**Process:**\n", - "\n", - "1. Read residuals file with proper handling of feature IDs and sample columns\n", - "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", - "3. Convert to midpoint coordinates (standard for QTLtools):\n", - "```\n", - "start = floor((peak_start + peak_end) / 2)\n", - "end = start + 1\n", - "```\n", - "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample values\n", - "5. Sort by chromosome and position\n", - "6. Compress with `bgzip` and index with `tabix`\n", - "\n", - "**Output:** `{output_dir}/3_pheno_reformat/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_{modality}_phenotype.bed.gz` | bgzip-compressed BED with midpoint coordinates |\n", - "| `{celltype}_{modality}_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n", - "\n", - "**Use case:** Standard QTL mapping to identify genetic variants affecting chromatin\n", - "accessibility (caQTL) or gene expression (eQTL), with biological variation preserved.\n", - "Compatible with FastQTL, TensorQTL, and QTLtools.\n", - "\n", - "**Timing:** < 1 min per cell type\n", - "\n", - "**Note** For snRNA-seq, please follow this [pipeline](https://github.com/StatFunGen/xqtl-protocol/blob/main/code/data_preprocessing/phenotype/phenotype_formatting.ipynb)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/pseudobulk_preprocessing.ipynb phenotype_formatting \\\n", - " --input-dir output/atac_seq/2_residuals \\\n", - " --output-dir output/atac_seq \\\n", - " --celltype Ast Ex In Mic Oligo OPC" - ] - }, - { - "cell_type": "markdown", - "id": "7c874b17-9a77-4e7d-a0a3-3605f7005148", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Region Filtering\n", - "\n", - "Filters peak counts to specific genomic regions of interest for locus-specific analysis.\n", - "\n", - "**Input:**\n", - "\n", - "| File | Location |\n", - "|------|----------|\n", - "| `{celltype}_filtered_raw_counts.txt` | `{output_dir}/2_residuals/{celltype}/` |\n", - "\n", - "**Process:**\n", - "\n", - "1. Read filtered raw counts per cell type\n", - "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", - "3. Calculate per-peak metrics:\n", - " - `peakwidth`: `end - start`\n", - " - `midpoint`: `(start + end) / 2`\n", - "4. Filter peaks overlapping any target region — includes peaks that start, end, or span region boundaries\n", - "5. Calculate summary statistics per peak:\n", - " - `total_count`: sum of counts across all samples\n", - " - `weighted_count`: `total_count / peakwidth` (normalizes for peak size)\n", - "\n", - "**Output:** `{output_dir}/3_region_filter/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_filtered_regions_of_interest.txt` | Full count matrix for peaks in target regions |\n", - "| `{celltype}_filtered_regions_of_interest_summary.txt` | Peak metadata with coordinates and count statistics |\n", - "\n", - "**Use case:** Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci such as\n", - "the APOE or TREM2 regions) where biological variation is preserved for downstream interpretation.\n", - "\n", - "**Timing:** < 1 min per cell type" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f944afdd-fffc-4b56-863f-eee89408cfa1", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "#snATAC-seq \n", - "sos run pipeline/pseudobulk_preprocessing.ipynb region_filtering \\\n", - " --input-dir output/atac_seq/2_residuals \\\n", - " --output-dir output/atac_seq \\\n", - " --celltype Ast Ex In Mic Oligo OPC \\\n", - " --regions \"chr7:28000000-28300000,chr11:85050000-86200000\"\n", - "\n", - "#snRNA-seq\n", - "sos run pipeline/pseudobulk_preprocessing.ipynb region_filtering \\\n", - " --input-dir output/snrna_seq/2_residuals \\\n", - " --output-dir output/snrna_seq \\\n", - " --celltype MIC \\\n", - " --gene-list \"ENSG00000000010\"" - ] - }, - { - "cell_type": "markdown", - "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Command interface" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/pseudobulk_preprocessing.ipynb -h" - ] - }, - { - "cell_type": "markdown", - "id": "0e17a301-cca9-49a1-843b-4248546f1f79", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Setup and global parameters" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[global]\n", - "parameter: cwd = path(\"output\")\n", - "parameter: job_size = 1\n", - "parameter: walltime = \"5h\"\n", - "parameter: mem = \"16G\"\n", - "parameter: numThreads = 8\n", - "parameter: container = \"\"\n", - "\n", - "import re\n", - "from sos.utils import expand_size\n", - "\n", - "entrypoint = (\n", - " 'micromamba run -a \"\" -n' + ' ' +\n", - " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n", - ") if container else \"\"\n", - "\n", - "cwd = path(f'{cwd:a}')" - ] - }, - { - "cell_type": "markdown", - "id": "eee58015-c8e2-4697-bdae-58d7e494640d", - "metadata": {}, - "source": [ - "```\n", - " usage: sos run pipeline/pseudobulk_preprocessing.ipynb\n", - " [workflow_name | -t targets] [options] [workflow_options]\n", - " workflow_name: Single or combined workflows defined in this script\n", - " targets: One or more targets to generate\n", - " options: Single-hyphen sos parameters (see \"sos run -h\" for details)\n", - " workflow_options: Double-hyphen workflow-specific parameters\n", - "Workflows:\n", - " sampleid_mapping\n", - " pseudobulk_qc\n", - " phenotype_formatting\n", - " region_filtering\n", - "Global Workflow Options:\n", - " --cwd output (as path)\n", - " --job-size 1 (as int)\n", - " --walltime 5h\n", - " --mem 16G\n", - " --numThreads 8 (as int)\n", - " --container ''\n", - "Sections\n", - " sampleid_mapping:\n", - " Workflow Options:\n", - " --map-file VAL (as str, required)\n", - " --input-dir VAL (as str, required)\n", - " --output-dir VAL (as str, required)\n", - " --celltype Ast Ex In Microglia Oligo OPC (as list)\n", - " --suffix ''\n", - " pseudobulk_qc:\n", - " Workflow Options:\n", - " --celltype Ast Ex In Microglia Oligo OPC (as list)\n", - " --input-dir VAL (as str, required)\n", - " --output-dir VAL (as str, required)\n", - " --covariates-file VAL (as str, required)\n", - " --blacklist-file ''\n", - " --sample-list ''\n", - " --tech-vars log_n_nuclei med_nucleosome_signal med_tss_enrich log_med_n_tot_fragment log_total_unique_peaks (as list)\n", - " --batch-correction FALSE\n", - " --batch-method limma\n", - " --quant-norm FALSE\n", - " --min-count 5 (as int)\n", - " --min-total-count 15 (as int)\n", - " --min-prop 0.1 (as float)\n", - " --min-nuclei 20 (as int)\n", - " --suffix ''\n", - " phenotype_formatting:\n", - " Workflow Options:\n", - " --celltype Ast Ex In Mic Oligo OPC (as list)\n", - " --input-dir VAL (as str, required)\n", - " --output-dir VAL (as str, required)\n", - " region_filtering:\n", - " Workflow Options:\n", - " --celltype Ast Ex In Mic Oligo OPC (as list)\n", - " Parameters\n", - " --input-dir VAL (as str, required)\n", - " --output-dir VAL (as str, required)\n", - " --regions ''\n", - " --gene-list ''\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `sampleid_mapping`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[sampleid_mapping]\n", - "parameter: map_file = str\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "parameter: celltype = ['Ast', 'Ex', 'In', 'Microglia', 'Oligo', 'OPC']\n", - "parameter: suffix = ''\n", - "\n", - "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", - "output: [f'{output_dir}/1_files_with_sampleid/metadata_{ct}{suffix}.csv' for ct in celltype]\n", - "\n", - "python: expand = \"${ }\"\n", - "\n", - "import pandas as pd\n", - "import gzip\n", - "import os\n", - "import subprocess\n", - "import csv\n", - "import numpy as np\n", - "import tempfile\n", - "\n", - "map_df = pd.read_csv(\"${map_file}\")\n", - "id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n", - "\n", - "celltype = ${celltype}\n", - "input_dir = \"${input_dir}\"\n", - "output_dir = \"${output_dir}/1_files_with_sampleid\"\n", - "suffix = \"${suffix}\"\n", - "\n", - "os.makedirs(output_dir, exist_ok=True)\n", - "\n", - "def map_id(ind_id):\n", - " return id_map.get(ind_id, ind_id)\n", - "\n", - "def format_value(val):\n", - " if pd.isna(val):\n", - " return ''\n", - " if isinstance(val, (int, np.integer)):\n", - " return str(val)\n", - " if isinstance(val, (float, np.floating)):\n", - " if val == int(val):\n", - " return str(int(val))\n", - " else:\n", - " return str(val)\n", - " return str(val)\n", - "\n", - "def find_count_file(input_dir, ct, suffix):\n", - " candidates = [\n", - " f for f in os.listdir(input_dir)\n", - " if f.endswith(f\"{ct}{suffix}.csv.gz\") or f.endswith(f\"_{ct}{suffix}.csv.gz\")\n", - " ]\n", - " if not candidates:\n", - " return None, None\n", - " preferred = [f for f in candidates if f.endswith(f\"_{ct}{suffix}.csv.gz\")]\n", - " fname = preferred[0] if preferred else candidates[0]\n", - " return os.path.join(input_dir, fname), fname\n", - "\n", - "# ── Process metadata ───────────────────────────────────────────────────────\n", - "for ct in celltype:\n", - " fname = f\"metadata_{ct}{suffix}.csv\"\n", - " in_path = os.path.join(input_dir, fname)\n", - " out_path = os.path.join(output_dir, fname)\n", - "\n", - " if not os.path.exists(in_path):\n", - " print(f\"Skipping metadata (not found): {fname}\")\n", - " continue\n", - "\n", - " meta = pd.read_csv(in_path)\n", - "\n", - " if \"individualID\" not in meta.columns:\n", - " print(f\"Warning: individualID column not found in {fname}\")\n", - " continue\n", - "\n", - " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n", - " cols = meta.columns.tolist()\n", - " cols.remove(\"sampleid\")\n", - " cols.remove(\"individualID\")\n", - " meta = meta[[\"sampleid\", \"individualID\"] + cols]\n", - "\n", - " with open(out_path, 'w', newline='') as f:\n", - " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n", - " writer.writerow(meta.columns)\n", - " for _, row in meta.iterrows():\n", - " writer.writerow([format_value(val) for val in row])\n", - "\n", - " print(f\"Processed metadata: {fname}\")\n", - "\n", - "# ── Process count files ────────────────────────────────────────────────────\n", - "for ct in celltype:\n", - " in_path, fname = find_count_file(input_dir, ct, suffix)\n", - "\n", - " if in_path is None:\n", - " print(f\"Skipping counts (not found) for celltype: {ct}\")\n", - " continue\n", - "\n", - " print(f\"Detected count file: {fname}\")\n", - " out_path = os.path.join(output_dir, fname)\n", - "\n", - " with gzip.open(in_path, \"rt\") as fh:\n", - " header_line = fh.readline().rstrip(\"\\n\")\n", - "\n", - " col_names = header_line.split(\",\")\n", - " peak_id_col = col_names[0]\n", - " new_sample_cols = [map_id(s) for s in col_names[1:]]\n", - " new_header = \",\".join([peak_id_col] + new_sample_cols)\n", - "\n", - " tmp = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n", - " tmp.write(new_header + \"\\n\")\n", - " tmp.close()\n", - "\n", - " cmd = f\"zcat {in_path} | tail -n +2 | cat {tmp.name} - | gzip -6 > {out_path}\"\n", - " subprocess.run(cmd, shell=True, check=True)\n", - " os.unlink(tmp.name)\n", - "\n", - " print(f\"Processed counts: {fname}\")\n", - "\n", - "print(\"\\nSample ID mapping completed!\")" - ] - }, - { - "cell_type": "markdown", - "id": "f0884ae7-a851-425a-86dd-b606768a012e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `pseudobulk_qc`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0c46328b-c3d8-46f8-8c71-bad27820438e", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[pseudobulk_qc]\n", - "parameter: celltype = ['Ast','Ex','In','Microglia','Oligo','OPC']\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "parameter: covariates_file = str\n", - "parameter: blacklist_file = ''\n", - "parameter: sample_list = ''\n", - "parameter: tech_vars = ['log_n_nuclei','med_nucleosome_signal','med_tss_enrich','log_med_n_tot_fragment','log_total_unique_peaks']\n", - "parameter: batch_correction = \"FALSE\"\n", - "parameter: batch_method = \"limma\"\n", - "parameter: quant_norm = \"FALSE\"\n", - "parameter: min_count = 5\n", - "parameter: min_total_count = 15\n", - "parameter: min_prop = 0.1\n", - "parameter: min_nuclei = 20\n", - "parameter: suffix = ''\n", - "\n", - "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n", - "output: [f'{output_dir}/2_residuals/{ct}/{ct}_residuals.txt' for ct in celltype]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n", - "\n", - "cts_str = \"c(\" + \", \".join([f\"'{x}'\" for x in celltype]) + \")\"\n", - "tvs_str = \"c(\" + \", \".join([f\"'{x}'\" for x in tech_vars]) + \")\"\n", - "\n", - "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n", - "\n", - " library(edgeR)\n", - " library(limma)\n", - " library(data.table)\n", - " library(GenomicRanges)\n", - " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n", - "\n", - " rename_if_found <- function(dt, target, candidates) {\n", - " found <- intersect(candidates, colnames(dt))[1]\n", - " if (!is.na(found) && found != target) setnames(dt, found, target)\n", - " }\n", - "\n", - " standardize_meta <- function(meta) {\n", - " rename_if_found(meta, \"n_nuclei\", c(\"n.nuclei\",\"nNuclei\",\"nuclei_count\"))\n", - " rename_if_found(meta, \"med_nucleosome_signal\", c(\"med.nucleosome_signal.ct\",\"NucleosomeRatio\",\"med_nucleosome_signal.ct\"))\n", - " rename_if_found(meta, \"med_tss_enrich\", c(\"med.tss.enrich.ct\",\"TSSEnrichment\",\"med_tss_enrich.ct\"))\n", - " rename_if_found(meta, \"med_n_tot_fragment\", c(\"med.n_tot_fragment.ct\",\"med_n_tot_fragment.ct\"))\n", - " return(meta)\n", - " }\n", - "\n", - " find_count_file <- function(input_dir, ct, suffix) {\n", - " all_files <- list.files(input_dir, pattern=\"\\\\.csv\\\\.gz$\", full.names=FALSE)\n", - " pattern <- paste0(ct, suffix, \"\\\\.csv\\\\.gz$\")\n", - " candidates <- all_files[grepl(pattern, all_files)]\n", - " if (length(candidates) == 0) return(NULL)\n", - " preferred <- candidates[grepl(paste0(\"_\", ct, suffix, \"\\\\.csv\\\\.gz$\"), candidates)]\n", - " if (length(preferred) > 0) return(file.path(input_dir, preferred[1]))\n", - " return(file.path(input_dir, candidates[1]))\n", - " }\n", - "\n", - " filter_blacklist <- function(mat, bed, feat_label) {\n", - " peaks <- data.table(id = rownames(mat))\n", - " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", - " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", - " bl <- fread(bed)[, 1:3]\n", - " setnames(bl, c(\"chr\",\"start\",\"end\"))\n", - " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", - " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", - " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n", - " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n", - " if (length(blacklisted) > 0) {\n", - " message(\"Blacklisted \", feat_label, \" removed: \", length(blacklisted))\n", - " return(mat[-blacklisted, , drop=FALSE])\n", - " }\n", - " return(mat)\n", - " }\n", - "\n", - " predictOffset <- function(fit) {\n", - " D <- fit$design\n", - " Dm <- D\n", - " for (col in colnames(D)) {\n", - " if (col == \"(Intercept)\") next\n", - " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n", - " Dm[, col] <- median(D[, col], na.rm=TRUE)\n", - " else\n", - " Dm[, col] <- 0\n", - " }\n", - " B <- fit$coefficients\n", - " B[is.na(B)] <- 0\n", - " B %*% t(Dm)\n", - " }\n", - "\n", - " cts <- ${cts_str}\n", - " tech_vars <- ${tvs_str}\n", - "\n", - " for (ct in cts) {\n", - " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", - " message(\"Processing: \", ct)\n", - " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n", - " message(\"Quantile normalization: \", ifelse(as.logical(\"${quant_norm}\"), \"TRUE\", \"FALSE\"))\n", - " message(paste(rep(\"=\", 40), collapse=\"\"))\n", - "\n", - " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n", - " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n", - "\n", - " # ── 1. Load data ───────────────────────────────────────────────────\n", - " meta <- fread(sprintf(\"${input_dir}/metadata_%s${suffix}.csv\", ct))\n", - "\n", - " counts_file <- find_count_file(\"${input_dir}\", ct, \"${suffix}\")\n", - " if (is.null(counts_file)) stop(\"No count file found for celltype: \", ct)\n", - " message(\"Detected count file: \", basename(counts_file))\n", - "\n", - " counts_raw <- fread(counts_file)\n", - " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n", - " rownames(counts) <- counts_raw[[1]]\n", - " rm(counts_raw)\n", - "\n", - " # ── Auto-detect modality ───────────────────────────────────────────\n", - " is_atac <- grepl(\"^chr.*-[0-9]+-[0-9]+$\", rownames(counts)[1])\n", - " feat_label <- ifelse(is_atac, \"peaks\", \"genes\")\n", - " message(\"Detected modality: \", ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\"))\n", - " message(\"Loaded: \", nrow(counts), \" \", feat_label, \" x \", ncol(counts), \" samples\")\n", - "\n", - " # ── 2. Standardize metadata ────────────────────────────────────────\n", - " meta <- standardize_meta(meta)\n", - "\n", - " # ── 3. Sample ID column ───────────────────────────────────────────\n", - " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n", - " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n", - "\n", - " # ── 4. Nuclei filter ──────────────────────────────────────────────\n", - " if (\"n_nuclei\" %in% colnames(meta)) {\n", - " meta <- meta[meta$n_nuclei > ${min_nuclei}]\n", - " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n", - " }\n", - "\n", - " # ── 5. Optional sample list filter ────────────────────────────────\n", - " if (\"${sample_list}\" != \"\" && file.exists(\"${sample_list}\")) {\n", - " keep_ids <- fread(\"${sample_list}\", header=FALSE)[[1]]\n", - " meta <- meta[meta[[idcol]] %in% keep_ids]\n", - " message(\"Samples after sample_list filter: \", nrow(meta))\n", - " }\n", - "\n", - " # ── 6. Align samples ──────────────────────────────────────────────\n", - " common <- intersect(meta[[idcol]], colnames(counts))\n", - " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n", - " meta <- meta[match(common, meta[[idcol]])]\n", - " counts <- counts[, common, drop=FALSE]\n", - " message(\"Samples after alignment: \", length(common))\n", - "\n", - " # ── 7. Blacklist filtering ─────────────────────────────────────────\n", - " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n", - " counts <- filter_blacklist(counts, \"${blacklist_file}\", feat_label)\n", - " message(feat_label, \" after blacklist filter: \", nrow(counts))\n", - " } else {\n", - " message(\"No blacklist file provided - skipping.\")\n", - " }\n", - "\n", - " # ── 8. Load and merge covariates ──────────────────────────────────\n", - " covs <- fread(\"${covariates_file}\")\n", - " id2 <- intersect(c(\"#id\",\"id\",\"projid\",\"individualID\"), colnames(covs))[1]\n", - " keep_cols <- c(id2, intersect(c(\"pmi\",\"study\"), colnames(covs)))\n", - " covs <- covs[, ..keep_cols]\n", - " meta <- merge(meta, covs, by.x=idcol, by.y=id2, all.x=TRUE)\n", - " meta <- meta[match(common, meta[[idcol]])]\n", - "\n", - " # ── 9. Impute missing PMI ─────────────────────────────────────────\n", - " if (\"pmi\" %in% colnames(meta) && any(is.na(meta$pmi))) {\n", - " message(\"Imputing missing values for: pmi\")\n", - " meta$pmi[is.na(meta$pmi)] <- median(meta$pmi, na.rm=TRUE)\n", - " }\n", - "\n", - " # ── 10. Tech vars ─────────────────────────────────────────────────\n", - " message(\"Tech vars: \", paste(tech_vars, collapse=\", \"))\n", - "\n", - " # ── 11. Compute derived log metrics ───────────────────────────────\n", - " for (tv in tech_vars[startsWith(tech_vars, \"log_\")]) {\n", - " if (tv %in% colnames(meta)) next\n", - " if (tv == \"log_total_unique_peaks\") {\n", - " meta$log_total_unique_peaks <- log1p(colSums(counts > 0))\n", - " } else {\n", - " raw_col <- sub(\"^log_\", \"\", tv)\n", - " if (raw_col %in% colnames(meta)) {\n", - " meta[[tv]] <- log1p(meta[[raw_col]])\n", - " } else {\n", - " message(\"Warning: cannot compute \", tv, \" - '\", raw_col, \"' not in metadata\")\n", - " }\n", - " }\n", - " }\n", - "\n", - " # ── 12. Select model variables ────────────────────────────────────\n", - " all_vars <- c(intersect(tech_vars, colnames(meta)), \"pmi\", \"study\")\n", - " all_vars <- intersect(all_vars, colnames(meta))\n", - " message(\"Model terms: \", paste(all_vars, collapse=\", \"))\n", - "\n", - " # ── 13. Drop samples with NA in model variables ───────────────────\n", - " keep_rows <- complete.cases(meta[, ..all_vars])\n", - " meta <- meta[keep_rows]\n", - " counts <- counts[, meta[[idcol]], drop=FALSE]\n", - " message(\"Valid samples for modelling: \", nrow(meta))\n", - "\n", - " # ── 14. Expression filtering ──────────────────────────────────────\n", - " dge <- DGEList(counts=counts, samples=meta)\n", - " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n", - " message(feat_label, \" before expression filter: \", nrow(dge))\n", - "\n", - " keep <- filterByExpr(dge, group=dge$samples$group,\n", - " min.count=${min_count},\n", - " min.total.count=${min_total_count},\n", - " min.prop=${min_prop})\n", - " dge <- dge[keep,, keep.lib.sizes=FALSE]\n", - " message(feat_label, \" after expression filter: \", nrow(dge))\n", - "\n", - " # ── Save filtered raw counts ──────────────────────────────────────\n", - " write.table(dge$counts,\n", - " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n", - " sep=\"\\t\", quote=FALSE, col.names=NA)\n", - "\n", - " # ── 15. TMM normalization ─────────────────────────────────────────\n", - " dge <- calcNormFactors(dge, method=\"TMM\")\n", - "\n", - " # ── 16. Optional batch correction ─────────────────────────────────\n", - " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n", - " batches <- dge$samples$sequencingBatch\n", - " batch_counts <- table(batches)\n", - " valid_batches <- names(batch_counts[batch_counts > 1])\n", - " keep_bc <- batches %in% valid_batches\n", - " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n", - " batches <- batches[keep_bc]\n", - " message(\"Samples after singleton batch removal: \", ncol(dge))\n", - "\n", - " if (\"${batch_method}\" == \"combat\") {\n", - " dge$counts <- ComBat_seq(as.matrix(dge$counts), batch=batches)\n", - " message(\"ComBat-seq batch correction applied.\")\n", - " } else {\n", - " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", - " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n", - " dge$counts <- round(pmax(2^logCPM, 0))\n", - " message(\"limma removeBatchEffect applied.\")\n", - " }\n", - " }\n", - "\n", - " # ── 17. Add batch vars to model if multi-level ────────────────────\n", - " other_vars <- setdiff(all_vars, tech_vars)\n", - " batch_vars <- c()\n", - " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n", - " length(unique(dge$samples$sequencingBatch)) > 1) {\n", - " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", - " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n", - " }\n", - " if (\"Library\" %in% colnames(dge$samples) &&\n", - " length(unique(dge$samples$Library)) > 1) {\n", - " dge$samples$Library_factor <- factor(dge$samples$Library)\n", - " batch_vars <- c(batch_vars, \"Library_factor\")\n", - " }\n", - " all_vars <- intersect(c(tech_vars, batch_vars, other_vars),\n", - " c(colnames(dge$samples), colnames(meta)))\n", - "\n", - " # ── 18. Build design matrix ───────────────────────────────────────\n", - " form <- as.formula(paste(\"~\", paste(all_vars, collapse=\" + \")))\n", - " design <- model.matrix(form, data=dge$samples)\n", - " message(\"Formula: \", deparse(form))\n", - "\n", - " if (!is.fullrank(design)) {\n", - " message(\"Design not full rank - trimming.\")\n", - " qr_d <- qr(design)\n", - " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n", - " }\n", - " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n", - "\n", - " # ── 19. Voom + lmFit + eBayes ────────────────────────────────────\n", - " v <- voom(dge, design, plot=FALSE)\n", - " fit <- lmFit(v, design)\n", - " fit <- eBayes(fit)\n", - "\n", - " # ── 20. Offset + residuals ────────────────────────────────────────\n", - " off <- predictOffset(fit)\n", - " res <- residuals(fit, v$E)\n", - " final <- off + res\n", - "\n", - " # ── 21. Save residuals ────────────────────────────────────────────\n", - " out_file <- file.path(outdir, paste0(ct, \"_residuals.txt\"))\n", - "\n", - " write.table(final,\n", - " out_file,\n", - " sep=\"\\t\", quote=FALSE, col.names=NA)\n", - "\n", - " feat_label <- if (is_atac) \"Peaks\" else \"Genes\"\n", - "\n", - " message(\"Saved: \", out_file)\n", - " message(\" \", feat_label, \": \", nrow(final), \" | Samples: \", ncol(final))\n", - "\n", - " # ── 22. Optional Quantile Normalization ───────────────────────────\n", - " if (as.logical(\"${quant_norm}\")) {\n", - " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", - " message(\"Applying quantile normalization...\")\n", - " message(paste(rep(\"=\", 40), collapse=\"\"))\n", - " \n", - " final_qn <- t(apply(final, 1, rank, ties.method = \"average\"))\n", - " final_qn <- stats::qnorm(final_qn / (ncol(final_qn) + 1))\n", - " \n", - " qn_file <- file.path(outdir, paste0(ct, \"_residuals_qn.txt\"))\n", - " write.table(final_qn,\n", - " qn_file,\n", - " sep=\"\\t\", quote=FALSE, col.names=NA)\n", - " \n", - " message(\"Saved QN: \", qn_file)\n", - " message(\" \", feat_label, \": \", nrow(final_qn), \" | Samples: \", ncol(final_qn))\n", - " \n", - " # Save RDS with QN\n", - " saveRDS(list(\n", - " dge = dge,\n", - " offset = off,\n", - " residuals = res,\n", - " final_data = final,\n", - " final_data_qn = final_qn,\n", - " valid_samples = colnames(dge),\n", - " design = design,\n", - " fit = fit,\n", - " model = form,\n", - " mode = \"noBIOvar\",\n", - " batch_correction = as.logical(\"${batch_correction}\"),\n", - " batch_method = ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", - " quant_norm = TRUE,\n", - " modality = ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", - " ), file.path(outdir, paste0(ct, \"_results_qn.rds\")))\n", - " } else {\n", - " # Save RDS without QN\n", - " saveRDS(list(\n", - " dge = dge,\n", - " offset = off,\n", - " residuals = res,\n", - " final_data = final,\n", - " valid_samples = colnames(dge),\n", - " design = design,\n", - " fit = fit,\n", - " model = form,\n", - " mode = \"noBIOvar\",\n", - " batch_correction = as.logical(\"${batch_correction}\"),\n", - " batch_method = ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", - " quant_norm = FALSE,\n", - " modality = ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", - " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n", - " }\n", - "\n", - " message(\"Completed: \", ct, \" -> \", outdir)\n", - " }" - ] - }, - { - "cell_type": "markdown", - "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `phenotype_reformatting`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[phenotype_formatting]\n", - "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "\n", - "input: [f'{input_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n", - "output: [f'{output_dir}/3_pheno_reformat/{ct}_phenotype.bed.gz' for ct in celltype]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n", - "\n", - "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", - "\n", - " import os\n", - " import subprocess\n", - " import pandas as pd\n", - "\n", - " celltypes = ${celltype}\n", - " input_dir = \"${input_dir}\"\n", - " output_dir = \"${output_dir}\"\n", - "\n", - " def read_residuals(path):\n", - " first_line = open(path).readline().rstrip(\"\\n\")\n", - " col_names = first_line.split(\"\\t\")\n", - " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n", - " if df.shape[1] > len(col_names):\n", - " peak_ids = df.iloc[:, 0].values\n", - " df = df.iloc[:, 1:]\n", - " df.columns = col_names\n", - " else:\n", - " peak_ids = df.iloc[:, 0].values\n", - " df = df.iloc[:, 1:]\n", - " df.columns = col_names[1:]\n", - " return peak_ids, df\n", - "\n", - " def to_midpoint_bed(peak_ids, residuals):\n", - " \"\"\"Convert snATAC-seq peak IDs (chr-start-end) to midpoint BED format.\"\"\"\n", - " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", - " chrs = parts[0].values\n", - " starts = parts[1].astype(int).values\n", - " ends = parts[2].astype(int).values\n", - " mids = ((starts + ends) // 2).astype(int)\n", - " bed = pd.DataFrame({\n", - " \"#chr\": chrs,\n", - " \"start\": mids,\n", - " \"end\": mids + 1,\n", - " \"ID\": peak_ids\n", - " })\n", - " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n", - " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n", - "\n", - " def run_cmd(cmd, label):\n", - " r = subprocess.run(cmd, capture_output=True)\n", - " if r.returncode != 0:\n", - " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n", - " else:\n", - " print(f\"{label}: OK\")\n", - "\n", - " for ct in celltypes:\n", - " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n", - "\n", - " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n", - " os.makedirs(out_dir, exist_ok=True)\n", - "\n", - " res_path = os.path.join(input_dir, ct, f\"{ct}_residuals.txt\")\n", - " if not os.path.exists(res_path):\n", - " print(f\"WARNING: {res_path} not found, skipping.\")\n", - " continue\n", - "\n", - " peak_ids, residuals = read_residuals(res_path)\n", - " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n", - "\n", - " bed = to_midpoint_bed(peak_ids, residuals)\n", - " out_bed = os.path.join(out_dir, f\"{ct}_phenotype.bed\")\n", - " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n", - " print(f\"Written: {out_bed}\")\n", - "\n", - " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n", - " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n", - " print(f\"Completed: {ct} -> {out_dir}\")" - ] - }, - { - "cell_type": "markdown", - "id": "038bc2ab-c412-40ef-a9b6-f5dddf5292ee", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `region_filtering`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dd13567e-2d4c-48f6-83d7-62ab252421bf", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[region_filtering]\n", - "# Parameters\n", - "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n", - "parameter: input_dir = str\n", - "parameter: output_dir = str\n", - "parameter: regions = \"\"\n", - "parameter: gene_list = \"\" # Note: Use --gene_list in command line\n", - "\n", - "# SoS Input/Output logic\n", - "input: [f'{input_dir}/{ct}/{ct}_filtered_raw_counts.txt' for ct in (celltype if isinstance(celltype, list) else [celltype])]\n", - "output: [f'{output_dir}/3_region_filter/{ct}_filtered_regions_of_interest.txt' for ct in (celltype if isinstance(celltype, list) else [celltype])]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '1:00:00', mem = '16G', cores = 2\n", - "\n", - "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", - " import os\n", - " import pandas as pd\n", - "\n", - " # Handle SoS passing single strings vs lists\n", - " raw_ct = ${celltype!r}\n", - " celltypes = [raw_ct] if isinstance(raw_ct, str) else raw_ct\n", - " \n", - " input_dir = \"${input_dir}\"\n", - " output_dir = \"${output_dir}\"\n", - " regions_str = \"${regions}\"\n", - " gene_list_str = \"${gene_list}\"\n", - "\n", - " def parse_regions(region_str):\n", - " if not region_str or region_str.strip() == \"\":\n", - " return []\n", - " result = []\n", - " for r in region_str.split(\",\"):\n", - " chrom, coords = r.strip().split(\":\")\n", - " start, end = coords.split(\"-\")\n", - " result.append({\"chr\": chrom, \"start\": int(start), \"end\": int(end)})\n", - " return result\n", - "\n", - " def parse_peak_ids(peak_ids):\n", - " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", - " return pd.DataFrame({\n", - " \"chr\": parts[0].values,\n", - " \"start\": parts[1].astype(int).values,\n", - " \"end\": parts[2].astype(int).values\n", - " })\n", - "\n", - " def overlaps_region(chr_col, start_col, end_col, reg):\n", - " return (\n", - " (chr_col == reg[\"chr\"]) &\n", - " (start_col < reg[\"end\"]) &\n", - " (end_col > reg[\"start\"])\n", - " )\n", - "\n", - " regions = parse_regions(regions_str)\n", - " \n", - " genes_to_filter = None\n", - " if gene_list_str and gene_list_str.strip():\n", - " genes_to_filter = set([g.strip() for g in gene_list_str.split(\",\")])\n", - "\n", - " for ct in celltypes:\n", - " reg_dir = os.path.join(output_dir, \"3_region_filter\")\n", - " os.makedirs(reg_dir, exist_ok=True)\n", - "\n", - " counts_path = os.path.join(input_dir, ct, f\"{ct}_filtered_raw_counts.txt\")\n", - " if not os.path.exists(counts_path):\n", - " continue\n", - "\n", - " df = pd.read_csv(counts_path, sep=\"\\t\", index_col=0)\n", - " first_id = df.index[0]\n", - " is_atac = \"-\" in str(first_id) and str(first_id).count(\"-\") >= 2\n", - " \n", - " # Consistent output name to match SoS 'output' definition\n", - " full_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest.txt\")\n", - "\n", - " if is_atac:\n", - " if not regions: continue\n", - " df.index.name = \"peak_id\"\n", - " df = df.reset_index()\n", - " coords = parse_peak_ids(df[\"peak_id\"].values)\n", - " df[\"chr\"], df[\"start\"], df[\"end\"] = coords[\"chr\"], coords[\"start\"], coords[\"end\"]\n", - " df[\"peakwidth\"] = df[\"end\"] - df[\"start\"]\n", - " \n", - " mask = pd.Series(False, index=df.index)\n", - " for reg in regions:\n", - " mask |= overlaps_region(df[\"chr\"], df[\"start\"], df[\"end\"], reg)\n", - "\n", - " region_df = df[mask].copy()\n", - " region_df.to_csv(full_out, sep=\"\\t\", index=False)\n", - " \n", - " else:\n", - " if not genes_to_filter: continue\n", - " df.index.name = \"gene_name\"\n", - " genes_present = set(df.index) & genes_to_filter\n", - " if not genes_present: continue\n", - " \n", - " region_df = df.loc[list(genes_present)].copy()\n", - " # FIX: Use the same filename as defined in the SoS 'output' block\n", - " region_df.to_csv(full_out, sep=\"\\t\")\n", - "\n", - " print(f\"Completed: {ct}\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "R", - "language": "R", - "name": "ir" - }, - "language_info": { - "codemirror_mode": "r", - "file_extension": ".r", - "mimetype": "text/x-r-source", - "name": "R", - "pygments_lexer": "r", - "version": "4.4.3" - }, - "sos": { - "kernels": [ - [ - "SoS", - "sos", - "sos", - "", - "" - ] - ], - "version": "" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From 82ffceffec115f912838fd0cb8aea2e56dbcd3da Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Thu, 26 Feb 2026 17:49:09 -0500 Subject: [PATCH 09/12] Modify some codes --- .../QC/pseudobulk_preprocessing.ipynb | 1045 +++++++++++++++++ 1 file changed, 1045 insertions(+) create mode 100644 code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb new file mode 100644 index 000000000..9beaa42f7 --- /dev/null +++ b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb @@ -0,0 +1,1045 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Single-nuclei Pseudobulk Preprocessing (RNA-seq and ATAC-seq) Pipeline\n", + "\n", + "## Overview\n", + "\n", + "This pipeline preprocesses single-nuclei pseudobulk **count** data (snATAC-seq or snRNA-seq) for downstream QTL analysis and region-specific studies.\n", + "\n", + "**Goals:**\n", + "- Transform raw pseudobulk counts into analysis-ready formats\n", + "- Remove technical confounders\n", + "- Generate QTL-ready phenotype files or region-specific datasets\n", + "\n", + "## Pipeline Structure\n", + "\n", + "```\n", + "Step 0: Sample ID Mapping [sampleid_mapping]\n", + " ↓\n", + "Step 1: Pseudobulk QC [pseudobulk_qc]\n", + " (optional) Region Peak/Gene Filtering \n", + " (optional) Batch Correction (ComBat or limma)\n", + " (optional) Quantile Normalization\n", + " ↓\n", + "Step 2: Phenotype Reformatting → BED [phenotype_formatting]\n", + " (genome-wide QTL mapping, snATAC-seq only) \n", + "```\n", + "\n", + "## Modality Support\n", + "\n", + "| Feature | snATAC-seq | snRNA-seq |\n", + "|---------|-----------|-----------|\n", + "| Sample ID mapping | ✓ | ✓ |\n", + "| Region/gene filtering | ✓ (`--regions`) | ✓ (`--gene-list`) |\n", + "| Blacklist filtering | ✓ | — |\n", + "| `pseudobulk_qc` step | ✓ | ✓ |\n", + "| `phenotype_formatting` step | ✓ | — (refer to this [pipeline](https://github.com/StatFunGen/xqtl-protocol/blob/main/code/data_preprocessing/phenotype/phenotype_formatting.ipynb)) |\n", + "\n", + "## Input Files\n", + "\n", + "All toy input files required to run this pipeline can be downloaded\n", + "[here](https://drive.google.com/drive/folders/13ORslmqWTpICMIufhj_mrdL1KxQsG4lH?usp=drive_link).\n", + "\n", + "| File | Used in |\n", + "|------|---------|\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Step 0, Step 1 |\n", + "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Step 0, Step 1 |\n", + "| `metadata_{celltype}.csv` | Step 0, Step 1 |\n", + "| `rosmap_sample_mapping_data.csv` | Step 0 |\n", + "| `tech_vars_{celltype}.csv` | Step 1 |\n", + "| `hg38-blacklist.v2.bed.gz` | Step 1 (snATAC-seq only) |\n", + "\n", + "\n", + "## Minimal Working Example" + ] + }, + { + "cell_type": "markdown", + "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 0: Sample ID Mapping\n", + "\n", + "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n", + "across metadata and count matrix files.\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n", + "| `metadata_{celltype}.csv` | Per-cell-type sample metadata |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Per-cell-type peak count matrices |\n", + "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Per-cell-type gene count matrices |\n", + "\n", + "### Process\n", + "\n", + "**Part 1 — Metadata files**\n", + "\n", + "For each metadata file:\n", + "1. Look up each `individualID` in the mapping reference\n", + "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n", + "3. Reorder columns: `sampleid` first, then `individualID`, then the rest\n", + "4. Save updated file\n", + "\n", + "**Part 2 — Count matrix files**\n", + "\n", + "For each count file:\n", + "1. Extract the header row (column names only)\n", + "2. Keep the first column (peak or gene IDs) unchanged\n", + "3. Map remaining column names (`individualID` → `sampleid`) where mapping exists, otherwise keep original\n", + "4. Write new header and stream data rows unchanged\n", + "5. Recompress with gzip\n", + "\n", + "### Parameters\n", + "\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `map_file` | *required* | CSV with `individualID` → `sampleid` mapping |\n", + "| `meta_files` | *required* | Metadata CSV files to remap |\n", + "| `count_files` | *required* | Count CSV.gz files to remap |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/1_files_with_sampleid/` |\n", + "\n", + "### Output\n", + "\n", + "Output directory: `{output_dir}/1_files_with_sampleid/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `metadata_{celltype}.csv` | Metadata with `sampleid` column prepended |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Count matrices with mapped column headers |\n", + "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Count matrices with mapped column headers |\n", + "\n", + "\n", + "**Timing:** < 1 min" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [ + { + "ename": "ERROR", + "evalue": "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\n", + "output_type": "error", + "traceback": [ + "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\nTraceback:\n" + ] + } + ], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb sampleid_mapping \\\n", + " --output-dir output/snatac_seq \\\n", + " --map-file data/rosmap_sample_mapping_data.csv \\\n", + " --meta-files data/snatac_seq/metadata_Mic_50nuc.csv \\\n", + " --count-files data/snatac_seq/pseudobulk_peaks_counts_Mic_50nuc.csv.gz\n" + ] + }, + { + "cell_type": "markdown", + "id": "5540a4da-843a-4789-8123-47911cf519c5", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 1: Pseudobulk QC\n", + "\n", + "Regresses out technical covariates for downstream QTL analysis. Works for both snATAC-seq and snRNA-seq.\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `metadata_{celltype}.csv` | Sample-level metadata (nuclei counts, batch info) |\n", + "| `pseudobulk_*counts_{celltype}.csv.gz` | Pseudobulk count matrix |\n", + "| `tech_vars.csv` | Technical covariates (sampleid + tech var columns, pre-processed) |\n", + "| `hg38-blacklist.v2.bed.gz` *(snATAC-seq, optional)* | Blacklisted genomic regions |\n", + "\n", + "### Process\n", + "\n", + "1. Load count matrix and auto-detect modality (snATAC-seq vs snRNA-seq)\n", + "2. ***(Optional)*** Filter to specific genomic regions (snATAC-seq) or gene list (snRNA-seq)\n", + "3. Load metadata; filter samples with fewer than `min_nuclei` nuclei (default: 20)\n", + "4. Align samples between metadata and count matrix\n", + "5. ***(Optional)*** Filter blacklisted genomic regions (snATAC-seq only)\n", + "6. Merge tech vars from `tech_vars_file` by `sampleid` \n", + "7. Drop samples with NA in any tech var\n", + "8. Apply expression filtering (`filterByExpr`):\n", + " - `min_count = 5`: minimum reads in at least one sample\n", + " - `min_total_count = 15`: minimum total reads across all samples\n", + " - `min_prop = 0.1`: feature expressed in ≥10% of samples\n", + "9. TMM normalization\n", + "10. ***(Optional)*** Batch correction on `sequencingBatch`:\n", + " - `limma::removeBatchEffect` (default)\n", + " - `ComBat` (on log-CPM)\n", + "11. Add `sequencingBatch` and `Library` to model if present and multi-level\n", + "12. Fit linear model (`voom` + `lmFit` + `eBayes`) with **tech vars + batch vars only** \n", + "13. Compute `offset + residuals` as final adjusted values:\n", + " - `offset`: intercept + batch effects at reference level\n", + " - `residuals`: variation after removing technical effects; biological signal retained\n", + "14. ***(Optional)*** Quantile normalization of final values\n", + "\n", + "**Model formula:**\n", + "```\n", + "~ {tech_vars} + [sequencingBatch] + [Library]\n", + "```\n", + "> `sequencingBatch` and `Library` included only if present and have more than one level.\n", + "> Biological variables (`pmi`, `study`, `msex`, `age_death` etc.) are **not** included — they should not be regressed out as they may be associated with genotype.\n", + "\n", + "### Parameters\n", + "\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `meta_files` | *required* | Metadata CSV files (one per cell type) |\n", + "| `count_files` | *required* | Count CSV.gz files (one per cell type, same order as `meta_files`) |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/2_residuals/{ct}/` |\n", + "| `tech_vars_file` | *required* | CSV with `sampleid` + tech var columns |\n", + "| `blacklist_file` | `''` | Genomic blacklist BED file (snATAC-seq only) |\n", + "| `regions` | `''` | Comma-separated genomic regions e.g. `chr7:28000000-28300000` (snATAC-seq) |\n", + "| `gene_list` | `''` | Comma-separated gene IDs e.g. `ENSG00000000010` (snRNA-seq) |\n", + "| `batch_correction` | `FALSE` | Apply batch correction (`TRUE`/`FALSE`) |\n", + "| `batch_method` | `limma` | Batch correction method (`limma` or `combat`) |\n", + "| `quant_norm` | `FALSE` | Apply quantile normalization after residuals |\n", + "| `min_count` | `5` | Min reads in at least one sample |\n", + "| `min_total_count` | `15` | Min total reads across all samples |\n", + "| `min_prop` | `0.1` | Min proportion of samples with expression |\n", + "| `min_nuclei` | `20` | Min nuclei per sample |\n", + "\n", + "### Output\n", + "\n", + "Output directory: `{output_dir}/2_residuals/{celltype}/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_residuals.txt` | Tech-covariate-adjusted values (log2-CPM) |\n", + "| `{celltype}_residuals_qn.txt` | Quantile-normalized adjusted values *(if `quant_norm=TRUE`)* |\n", + "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design, parameters |\n", + "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", + "\n", + "**Timing:** < 5 min per cell type" + ] + }, + { + "cell_type": "markdown", + "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", + " --meta-files output/snatac_seq/1_files_with_sampleid/metadata_Mic_50nuc.csv \\\n", + " --count-files output/snatac_seq/1_files_with_sampleid/pseudobulk_peaks_counts_Mic_50nuc.csv.gz \\\n", + " --output-dir output/snatac_seq \\\n", + " --tech-vars-file data/snatac_seq/tech_vars_MIC.csv \\\n", + " --blacklist-file data/hg38-blacklist.v2.bed.gz #only for snATAC-seq\n", + "\n", + "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", + " --meta-files output/snrna_seq/1_files_with_sampleid/metadata_MIC.csv \\\n", + " --count-files output/snrna_seq/1_files_with_sampleid/pseudobulk_counts_MIC.csv.gz \\\n", + " --output-dir output/snrna_seq \\\n", + " --tech-vars-file data/snrna_seq/tech_vars_MIC.csv \\\n", + " --gene-list ENSG00000000010,ENSG00000000020 " + ] + }, + { + "cell_type": "markdown", + "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed", + "metadata": {}, + "source": [ + "### Additional parameters\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb", + "metadata": {}, + "outputs": [], + "source": [ + "--min-count 5\n", + "--min-total-count 15\n", + "--min-prop 0.1\n", + "--min-nuclei 20\n", + "--quant-norm TRUE\n", + "--batch-correction TRUE \n", + "--batch-method combat # or limma\n", + "--gene-list ENSG00000000010,ENSG00000000020 # for snRNA-seq\n", + "--regions chr7:28000000-28300000 # for snATAC-seq" + ] + }, + { + "cell_type": "markdown", + "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 2: Phenotype Reformatting (snATAC-seq only)\n", + "\n", + "Converts residuals into a QTL-ready BED format for genome-wide caQTL mapping.\n", + "\n", + "> For snRNA-seq, please follow this [pipeline](https://github.com/StatFunGen/xqtl-protocol/blob/main/code/data_preprocessing/phenotype/phenotype_formatting.ipynb).\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_residuals.txt` | Residuals from `pseudobulk_qc` |\n", + "\n", + "### Process\n", + "\n", + "1. Read residuals file with proper handling of feature IDs and sample columns\n", + "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", + "3. Convert to midpoint coordinates (standard for QTLtools):\n", + "```\n", + "start = floor((peak_start + peak_end) / 2)\n", + "end = start + 1\n", + "```\n", + "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample values\n", + "5. Sort by chromosome and position\n", + "6. Compress with `bgzip` and index with `tabix`\n", + "\n", + "### Parameters\n", + "\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `residual_files` | *required* | Residual txt files from `pseudobulk_qc` |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/3_pheno_reformat/` |\n", + "\n", + "### Output\n", + "\n", + "Output directory: `{output_dir}/3_pheno_reformat/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_phenotype.bed.gz` | bgzip-compressed BED with midpoint coordinates |\n", + "| `{celltype}_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n", + "\n", + "Compatible with FastQTL, TensorQTL, and QTLtools.\n", + "\n", + "**Timing:** < 1 min per cell type" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b", + "metadata": { + "kernel": "SoS" + }, + "outputs": [ + { + "ename": "ERROR", + "evalue": "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\n", + "output_type": "error", + "traceback": [ + "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\nTraceback:\n" + ] + } + ], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb phenotype_formatting \\\n", + " --residual-files output/snatac_seq/2_residuals/Mic_50nuc/Mic_50nuc_residuals.txt \\\n", + " --output-dir output/snatac_seq" + ] + }, + { + "cell_type": "markdown", + "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Command interface" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb -h" + ] + }, + { + "cell_type": "markdown", + "id": "0e17a301-cca9-49a1-843b-4248546f1f79", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Setup and global parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "parameter: cwd = path(\"output\")\n", + "parameter: job_size = 1\n", + "parameter: walltime = \"5h\"\n", + "parameter: mem = \"16G\"\n", + "parameter: numThreads = 8\n", + "parameter: container = \"\"\n", + "\n", + "import re\n", + "from sos.utils import expand_size\n", + "\n", + "entrypoint = (\n", + " 'micromamba run -a \"\" -n' + ' ' +\n", + " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n", + ") if container else \"\"\n", + "\n", + "cwd = path(f'{cwd:a}')" + ] + }, + { + "cell_type": "markdown", + "id": "eee58015-c8e2-4697-bdae-58d7e494640d", + "metadata": {}, + "source": [ + "```\n", + "usage: sos run pipeline/pseudobulk_preprocessing.ipynb\n", + " [workflow_name | -t targets] [options] [workflow_options]\n", + " workflow_name: Single or combined workflows defined in this script\n", + " targets: One or more targets to generate\n", + " options: Single-hyphen sos parameters (see \"sos run -h\" for details)\n", + " workflow_options: Double-hyphen workflow-specific parameters\n", + "Workflows:\n", + " sampleid_mapping\n", + " pseudobulk_qc\n", + " phenotype_formatting\n", + "Global Workflow Options:\n", + " --cwd output (as path)\n", + " --job-size 1 (as int)\n", + " --walltime 5h\n", + " --mem 16G\n", + " --numThreads 8 (as int)\n", + " --container ''\n", + "Sections\n", + " sampleid_mapping:\n", + " Workflow Options:\n", + " --map-file VAL (as str, required)\n", + " --output-dir VAL (as str, required)\n", + " --meta-files (as list)\n", + " --count-files (as list)\n", + " pseudobulk_qc:\n", + " Workflow Options:\n", + " --meta-files (as list)\n", + " --count-files (as list)\n", + " --output-dir VAL (as str, required)\n", + " --tech-vars-file VAL (as str, required)\n", + " --blacklist-file ''\n", + " --batch-correction FALSE\n", + " --batch-method limma\n", + " --quant-norm FALSE\n", + " --min-count 5 (as int)\n", + " --min-total-count 15 (as int)\n", + " --min-prop 0.1 (as float)\n", + " --min-nuclei 20 (as int)\n", + " --regions ''\n", + " --gene-list ''\n", + " phenotype_formatting:\n", + " Workflow Options:\n", + " --residual-files (as list)\n", + " --output-dir VAL (as str, required)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `sampleid_mapping`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[sampleid_mapping]\n", + "parameter: map_file = str\n", + "parameter: output_dir = str\n", + "parameter: meta_files = []\n", + "parameter: count_files = []\n", + "\n", + "import os\n", + "\n", + "input: meta_files + count_files\n", + "output: [f'{output_dir}/1_files_with_sampleid/{os.path.basename(f)}' for f in meta_files + count_files]\n", + " \n", + "python: expand = \"${ }\"\n", + "import pandas as pd\n", + "import gzip\n", + "import os\n", + "import subprocess\n", + "import csv\n", + "import numpy as np\n", + "import tempfile\n", + "\n", + "map_df = pd.read_csv(\"${map_file}\")\n", + "id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n", + "output_dir = \"${output_dir}/1_files_with_sampleid\"\n", + "meta_files = ${meta_files}\n", + "count_files = ${count_files}\n", + "\n", + "os.makedirs(output_dir, exist_ok=True)\n", + "\n", + "def map_id(ind_id):\n", + " return id_map.get(ind_id, ind_id)\n", + "\n", + "def format_value(val):\n", + " if pd.isna(val):\n", + " return ''\n", + " if isinstance(val, (int, np.integer)):\n", + " return str(val)\n", + " if isinstance(val, (float, np.floating)):\n", + " if val == int(val):\n", + " return str(int(val))\n", + " else:\n", + " return str(val)\n", + " return str(val)\n", + "\n", + "# ── Process metadata ───────────────────────────────────────────────────────\n", + "for in_path in meta_files:\n", + " fname = os.path.basename(in_path)\n", + " out_path = os.path.join(output_dir, fname)\n", + " meta = pd.read_csv(in_path)\n", + " if \"individualID\" not in meta.columns:\n", + " print(f\"Warning: individualID column not found in {fname}, skipping.\")\n", + " continue\n", + " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n", + " cols = meta.columns.tolist()\n", + " cols.remove(\"sampleid\")\n", + " cols.remove(\"individualID\")\n", + " meta = meta[[\"sampleid\", \"individualID\"] + cols]\n", + " with open(out_path, 'w', newline='') as f:\n", + " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n", + " writer.writerow(meta.columns)\n", + " for _, row in meta.iterrows():\n", + " writer.writerow([format_value(val) for val in row])\n", + "\n", + "# ── Process count files ────────────────────────────────────────────────────\n", + "for in_path in count_files:\n", + " fname = os.path.basename(in_path)\n", + " out_path = os.path.join(output_dir, fname)\n", + " with gzip.open(in_path, \"rt\") as fh:\n", + " header_line = fh.readline().rstrip(\"\\n\")\n", + " col_names = header_line.split(\",\")\n", + " peak_id_col = col_names[0]\n", + " new_sample_cols = [map_id(s) for s in col_names[1:]]\n", + " new_header = \",\".join([peak_id_col] + new_sample_cols)\n", + " tmp = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n", + " tmp.write(new_header + \"\\n\")\n", + " tmp.close()\n", + " cmd = f\"zcat {in_path} | tail -n +2 | cat {tmp.name} - | gzip -6 > {out_path}\"\n", + " subprocess.run(cmd, shell=True, check=True)\n", + " os.unlink(tmp.name)" + ] + }, + { + "cell_type": "markdown", + "id": "f0884ae7-a851-425a-86dd-b606768a012e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `pseudobulk_qc`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c46328b-c3d8-46f8-8c71-bad27820438e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[pseudobulk_qc]\n", + "parameter: meta_files = []\n", + "parameter: count_files = []\n", + "parameter: output_dir = str\n", + "parameter: tech_vars_file = str\n", + "parameter: blacklist_file = ''\n", + "parameter: batch_correction = \"FALSE\"\n", + "parameter: batch_method = \"limma\"\n", + "parameter: quant_norm = \"FALSE\"\n", + "parameter: min_count = 5\n", + "parameter: min_total_count = 15\n", + "parameter: min_prop = 0.1\n", + "parameter: min_nuclei = 20\n", + "parameter: regions = ''\n", + "parameter: gene_list = ''\n", + "\n", + "import os\n", + "\n", + "_cts = [os.path.basename(f).replace('metadata_','').replace('.csv','') for f in meta_files]\n", + "\n", + "input: meta_files + count_files\n", + "output: [f'{output_dir}/2_residuals/{ct}/{ct}_residuals.txt' for ct in _cts]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n", + "\n", + "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n", + "\n", + " library(edgeR)\n", + " library(limma)\n", + " library(data.table)\n", + " library(GenomicRanges)\n", + " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n", + "\n", + " # ── predictOffset ──────────────────────────────────────────────────────\n", + " predictOffset <- function(fit, tech_vars) {\n", + " D <- fit$design\n", + " Dm <- D\n", + " for (col in colnames(D)) {\n", + " if (col == \"(Intercept)\") next\n", + " is_tech <- any(sapply(tech_vars, function(v) grepl(paste0(\"^\", v), col)))\n", + " if (is_tech) {\n", + " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n", + " Dm[, col] <- median(D[, col], na.rm=TRUE)\n", + " else\n", + " Dm[, col] <- 0\n", + " } else {\n", + " Dm[, col] <- 0\n", + " }\n", + " }\n", + " B <- fit$coefficients\n", + " B[is.na(B)] <- 0\n", + " off <- B %*% t(Dm)\n", + " colnames(off) <- rownames(fit$design)\n", + " return(off)\n", + " }\n", + "\n", + " filter_blacklist <- function(mat, bed, feat_label) {\n", + " peaks <- data.table(id = rownames(mat))\n", + " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", + " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " bl <- fread(bed)[, 1:3]\n", + " setnames(bl, c(\"chr\",\"start\",\"end\"))\n", + " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", + " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n", + " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n", + " if (length(blacklisted) > 0) {\n", + " message(\"Blacklisted \", feat_label, \" removed: \", length(blacklisted))\n", + " return(mat[-blacklisted, , drop=FALSE])\n", + " }\n", + " return(mat)\n", + " }\n", + "\n", + " parse_regions <- function(region_str) {\n", + " if (is.null(region_str) || region_str == \"\") return(NULL)\n", + " lapply(strsplit(region_str, \",\")[[1]], function(r) {\n", + " parts <- strsplit(trimws(r), \":|−|-\")[[1]]\n", + " list(chr=parts[1], start=as.integer(parts[2]), end=as.integer(parts[3]))\n", + " })\n", + " }\n", + "\n", + " filter_regions <- function(mat, regions) {\n", + " peaks <- data.table(id = rownames(mat))\n", + " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", + " peaks[, `:=`(start = as.integer(start), end = as.integer(end))]\n", + " gr_peaks <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", + " gr_regions <- GRanges(\n", + " sapply(regions, `[[`, \"chr\"),\n", + " IRanges(sapply(regions, `[[`, \"start\"), sapply(regions, `[[`, \"end\"))\n", + " )\n", + " keep <- unique(queryHits(findOverlaps(gr_peaks, gr_regions)))\n", + " if (length(keep) == 0) stop(\"No peaks overlap the specified regions.\")\n", + " message(\"Peaks after region filter: \", length(keep))\n", + " mat[keep, , drop=FALSE]\n", + " }\n", + "\n", + " meta_files <- c(${','.join([f'\"{f}\"' for f in meta_files])})\n", + " count_files <- c(${','.join([f'\"{f}\"' for f in count_files])})\n", + "\n", + " if (length(meta_files) != length(count_files))\n", + " stop(\"meta_files and count_files must have the same length and order.\")\n", + "\n", + " # ── Load tech vars from file ───────────────────────────────────────────\n", + " tech_df <- fread(\"${tech_vars_file}\")\n", + " tech_vars <- setdiff(colnames(tech_df), \"sampleid\")\n", + " message(\"Tech vars: \", paste(tech_vars, collapse=\", \"))\n", + "\n", + " regions <- parse_regions(\"${regions}\")\n", + " gene_list <- trimws(strsplit(\"${gene_list}\", \",\")[[1]])\n", + " gene_list <- gene_list[gene_list != \"\"]\n", + "\n", + " for (i in seq_along(meta_files)) {\n", + " meta_file <- meta_files[i]\n", + " counts_file <- count_files[i]\n", + " ct <- sub(\"\\\\.csv$\", \"\", sub(\"^metadata_\", \"\", basename(meta_file)))\n", + "\n", + " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", + " message(\"Processing: \", ct)\n", + " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n", + " message(\"Quantile normalization: \", as.logical(\"${quant_norm}\"))\n", + " message(paste(rep(\"=\", 40), collapse=\"\"))\n", + "\n", + " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n", + " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n", + "\n", + " # ── 1. Load counts ─────────────────────────────────────────────────\n", + " counts_raw <- fread(counts_file)\n", + " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n", + " rownames(counts) <- counts_raw[[1]]\n", + " rm(counts_raw)\n", + "\n", + " # ── Auto-detect modality ───────────────────────────────────────────\n", + " is_atac <- grepl(\"^chr.*-[0-9]+-[0-9]+$\", rownames(counts)[1])\n", + " feat_label <- ifelse(is_atac, \"peaks\", \"genes\")\n", + " message(\"Modality: \", ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\"))\n", + " message(\"Loaded: \", nrow(counts), \" \", feat_label, \" x \", ncol(counts), \" samples\")\n", + "\n", + " # ── 1b. Region/gene filtering (optional) ──────────────────────────\n", + " if (is_atac && !is.null(regions)) {\n", + " message(\"Filtering peaks to specified regions...\")\n", + " counts <- filter_regions(counts, regions)\n", + " } else if (!is_atac && length(gene_list) > 0) {\n", + " genes_present <- intersect(rownames(counts), gene_list)\n", + " if (length(genes_present) == 0) stop(\"No matching genes found in count matrix.\")\n", + " message(\"Genes after gene_list filter: \", length(genes_present))\n", + " counts <- counts[genes_present, , drop=FALSE]\n", + " }\n", + "\n", + " # ── 2. Load metadata ───────────────────────────────────────────────\n", + " meta <- fread(meta_file)\n", + " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n", + " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n", + "\n", + " # ── 3. Nuclei filter ──────────────────────────────────────────────\n", + " n_nuclei_col <- intersect(c(\"n_nuclei\",\"n.nuclei\",\"nNuclei\",\"nuclei_count\"), colnames(meta))[1]\n", + " if (!is.na(n_nuclei_col)) {\n", + " meta <- meta[meta[[n_nuclei_col]] > ${min_nuclei}]\n", + " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n", + " }\n", + "\n", + " # ── 4. Align samples ──────────────────────────────────────────────\n", + " common <- intersect(meta[[idcol]], colnames(counts))\n", + " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n", + " counts <- counts[, common, drop=FALSE]\n", + " message(\"Samples after alignment: \", length(common))\n", + "\n", + " # ── 5. Blacklist filtering ─────────────────────────────────────────\n", + " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n", + " counts <- filter_blacklist(counts, \"${blacklist_file}\", feat_label)\n", + " message(feat_label, \" after blacklist filter: \", nrow(counts))\n", + " } else {\n", + " message(\"No blacklist file - skipping.\")\n", + " }\n", + "\n", + " # ── 6. Merge tech vars by sampleid ────────────────────────────────\n", + " tech_sub <- tech_df[tech_df$sampleid %in% common]\n", + " tech_sub <- tech_sub[match(common, tech_sub$sampleid)]\n", + "\n", + " # ── 7. Drop samples with NA in tech vars ──────────────────────────\n", + " keep_rows <- complete.cases(tech_sub[, ..tech_vars])\n", + " tech_sub <- tech_sub[keep_rows]\n", + " counts <- counts[, tech_sub$sampleid, drop=FALSE]\n", + " message(\"Valid samples for modelling: \", nrow(tech_sub))\n", + "\n", + " # ── 8. Expression filtering ────────────────────────────────────────\n", + " dge <- DGEList(counts=counts, samples=tech_sub)\n", + " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n", + " message(feat_label, \" before filter: \", nrow(dge))\n", + "\n", + " keep <- filterByExpr(dge, group=dge$samples$group,\n", + " min.count=${min_count},\n", + " min.total.count=${min_total_count},\n", + " min.prop=${min_prop})\n", + " dge <- dge[keep,, keep.lib.sizes=FALSE]\n", + " message(feat_label, \" after filter: \", nrow(dge))\n", + "\n", + " write.table(dge$counts,\n", + " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n", + " sep=\"\\t\", quote=FALSE, col.names=NA)\n", + "\n", + " # ── 9. TMM normalization ───────────────────────────────────────────\n", + " dge <- calcNormFactors(dge, method=\"TMM\")\n", + "\n", + " # ── 10. Optional batch correction ──────────────────────────────────\n", + " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n", + " batches <- dge$samples$sequencingBatch\n", + " batch_counts <- table(batches)\n", + " valid_batches <- names(batch_counts[batch_counts > 1])\n", + " keep_bc <- batches %in% valid_batches\n", + " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n", + " batches <- batches[keep_bc]\n", + " message(\"Samples after singleton batch removal: \", ncol(dge))\n", + "\n", + " if (\"${batch_method}\" == \"combat\") {\n", + " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", + " logCPM <- ComBat(dat=logCPM, batch=factor(batches))\n", + " dge$counts <- round(pmax(2^logCPM, 0))\n", + " message(\"ComBat applied on log-CPM.\")\n", + " } else {\n", + " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", + " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n", + " dge$counts <- round(pmax(2^logCPM, 0))\n", + " message(\"limma removeBatchEffect applied.\")\n", + " }\n", + " }\n", + "\n", + " # ── 11. Add batch vars to model if multi-level ────────────────────\n", + " batch_vars <- c()\n", + " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$sequencingBatch)) > 1) {\n", + " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", + " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n", + " }\n", + " if (\"Library\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$Library)) > 1) {\n", + " dge$samples$Library_factor <- factor(dge$samples$Library)\n", + " batch_vars <- c(batch_vars, \"Library_factor\")\n", + " }\n", + "\n", + " # ── 12. Build design matrix ────────────────────────────────────────\n", + " all_model_vars <- intersect(c(tech_vars, batch_vars), colnames(dge$samples))\n", + " form <- as.formula(paste(\"~\", paste(all_model_vars, collapse=\" + \")))\n", + " design <- model.matrix(form, data=dge$samples)\n", + " message(\"Formula: \", deparse(form))\n", + "\n", + " if (!is.fullrank(design)) {\n", + " message(\"Design not full rank - trimming.\")\n", + " qr_d <- qr(design)\n", + " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n", + " }\n", + " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n", + "\n", + " # ── 13. Voom + lmFit + eBayes ─────────────────────────────────────\n", + " v <- voom(dge, design, plot=FALSE)\n", + " fit <- lmFit(v, design)\n", + " fit <- eBayes(fit)\n", + "\n", + " # ── 14. Offset + residuals ─────────────────────────────────────────\n", + " off <- predictOffset(fit, tech_vars=tech_vars)\n", + " res <- residuals(fit, v$E)\n", + " final <- off + res\n", + "\n", + " # ── 15. Save residuals ─────────────────────────────────────────────\n", + " out_file <- file.path(outdir, paste0(ct, \"_residuals.txt\"))\n", + " write.table(final, out_file, sep=\"\\t\", quote=FALSE, col.names=NA)\n", + " message(\"Saved: \", out_file)\n", + " message(\" \", ifelse(is_atac,\"Peaks\",\"Genes\"), \": \", nrow(final), \" | Samples: \", ncol(final))\n", + "\n", + " # ── 16. Optional quantile normalization ───────────────────────────\n", + " if (as.logical(\"${quant_norm}\")) {\n", + " final_qn <- t(apply(final, 1, rank, ties.method=\"average\"))\n", + " final_qn <- stats::qnorm(final_qn / (ncol(final_qn) + 1))\n", + " qn_file <- file.path(outdir, paste0(ct, \"_residuals_qn.txt\"))\n", + " write.table(final_qn, qn_file, sep=\"\\t\", quote=FALSE, col.names=NA)\n", + " message(\"Saved QN: \", qn_file)\n", + "\n", + " saveRDS(list(\n", + " dge=dge, offset=off, residuals=res,\n", + " final_data=final, final_data_qn=final_qn,\n", + " valid_samples=colnames(dge), design=design, fit=fit, model=form,\n", + " tech_vars=tech_vars, batch_vars=batch_vars,\n", + " batch_correction=as.logical(\"${batch_correction}\"),\n", + " batch_method=ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", + " quant_norm=TRUE,\n", + " modality=ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", + " ), file.path(outdir, paste0(ct, \"_results_qn.rds\")))\n", + " } else {\n", + " saveRDS(list(\n", + " dge=dge, offset=off, residuals=res,\n", + " final_data=final,\n", + " valid_samples=colnames(dge), design=design, fit=fit, model=form,\n", + " tech_vars=tech_vars, batch_vars=batch_vars,\n", + " batch_correction=as.logical(\"${batch_correction}\"),\n", + " batch_method=ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", + " quant_norm=FALSE,\n", + " modality=ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", + " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n", + " }\n", + "\n", + " message(\"Completed: \", ct, \" -> \", outdir)\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `phenotype_reformatting`" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73", + "metadata": { + "kernel": "SoS" + }, + "outputs": [ + { + "ename": "ERROR", + "evalue": "Error in parse(text = input): :1:1: unexpected '['\n1: [\n ^\n", + "output_type": "error", + "traceback": [ + "Error in parse(text = input): :1:1: unexpected '['\n1: [\n ^\nTraceback:\n" + ] + } + ], + "source": [ + "[phenotype_formatting]\n", + "parameter: residual_files = []\n", + "parameter: output_dir = str\n", + "\n", + "import os\n", + "\n", + "_cts = [os.path.basename(os.path.dirname(f)) for f in residual_files]\n", + "\n", + "input: residual_files\n", + "output: [f'{output_dir}/3_pheno_reformat/{ct}_phenotype.bed.gz' for ct in _cts]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n", + "\n", + "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", + " import os\n", + " import subprocess\n", + " import pandas as pd\n", + "\n", + " residual_files = ${residual_files}\n", + " output_dir = \"${output_dir}\"\n", + "\n", + " def read_residuals(path):\n", + " first_line = open(path).readline().rstrip(\"\\n\")\n", + " col_names = first_line.split(\"\\t\")\n", + " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n", + " if df.shape[1] > len(col_names):\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names\n", + " else:\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names[1:]\n", + " return peak_ids, df\n", + "\n", + " def to_midpoint_bed(peak_ids, residuals):\n", + " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", + " chrs = parts[0].values\n", + " starts = parts[1].astype(int).values\n", + " ends = parts[2].astype(int).values\n", + " mids = ((starts + ends) // 2).astype(int)\n", + " bed = pd.DataFrame({\n", + " \"#chr\": chrs,\n", + " \"start\": mids,\n", + " \"end\": mids + 1,\n", + " \"ID\": peak_ids\n", + " })\n", + " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n", + " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n", + "\n", + " def run_cmd(cmd, label):\n", + " r = subprocess.run(cmd, capture_output=True)\n", + " if r.returncode != 0:\n", + " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n", + " else:\n", + " print(f\"{label}: OK\")\n", + "\n", + " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n", + " os.makedirs(out_dir, exist_ok=True)\n", + "\n", + " for res_path in residual_files:\n", + " ct = os.path.basename(os.path.dirname(res_path))\n", + "\n", + " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n", + "\n", + " if not os.path.exists(res_path):\n", + " print(f\"WARNING: {res_path} not found, skipping.\")\n", + " continue\n", + "\n", + " peak_ids, residuals = read_residuals(res_path)\n", + " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n", + "\n", + " bed = to_midpoint_bed(peak_ids, residuals)\n", + " out_bed = os.path.join(out_dir, f\"{ct}_phenotype.bed\")\n", + " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n", + " print(f\"Written: {out_bed}\")\n", + "\n", + " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n", + " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n", + " print(f\"Completed: {ct} -> {out_dir}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "4.4.3" + }, + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "sos", + "", + "" + ] + ], + "version": "" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 6f3f2d1a7d9f133568b493b400bd6d16cbc0fcfb Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Fri, 27 Feb 2026 11:36:05 -0500 Subject: [PATCH 10/12] Delete code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb --- .../QC/pseudobulk_preprocessing.ipynb | 1045 ----------------- 1 file changed, 1045 deletions(-) delete mode 100644 code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb deleted file mode 100644 index 9beaa42f7..000000000 --- a/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb +++ /dev/null @@ -1,1045 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "# Single-nuclei Pseudobulk Preprocessing (RNA-seq and ATAC-seq) Pipeline\n", - "\n", - "## Overview\n", - "\n", - "This pipeline preprocesses single-nuclei pseudobulk **count** data (snATAC-seq or snRNA-seq) for downstream QTL analysis and region-specific studies.\n", - "\n", - "**Goals:**\n", - "- Transform raw pseudobulk counts into analysis-ready formats\n", - "- Remove technical confounders\n", - "- Generate QTL-ready phenotype files or region-specific datasets\n", - "\n", - "## Pipeline Structure\n", - "\n", - "```\n", - "Step 0: Sample ID Mapping [sampleid_mapping]\n", - " ↓\n", - "Step 1: Pseudobulk QC [pseudobulk_qc]\n", - " (optional) Region Peak/Gene Filtering \n", - " (optional) Batch Correction (ComBat or limma)\n", - " (optional) Quantile Normalization\n", - " ↓\n", - "Step 2: Phenotype Reformatting → BED [phenotype_formatting]\n", - " (genome-wide QTL mapping, snATAC-seq only) \n", - "```\n", - "\n", - "## Modality Support\n", - "\n", - "| Feature | snATAC-seq | snRNA-seq |\n", - "|---------|-----------|-----------|\n", - "| Sample ID mapping | ✓ | ✓ |\n", - "| Region/gene filtering | ✓ (`--regions`) | ✓ (`--gene-list`) |\n", - "| Blacklist filtering | ✓ | — |\n", - "| `pseudobulk_qc` step | ✓ | ✓ |\n", - "| `phenotype_formatting` step | ✓ | — (refer to this [pipeline](https://github.com/StatFunGen/xqtl-protocol/blob/main/code/data_preprocessing/phenotype/phenotype_formatting.ipynb)) |\n", - "\n", - "## Input Files\n", - "\n", - "All toy input files required to run this pipeline can be downloaded\n", - "[here](https://drive.google.com/drive/folders/13ORslmqWTpICMIufhj_mrdL1KxQsG4lH?usp=drive_link).\n", - "\n", - "| File | Used in |\n", - "|------|---------|\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Step 0, Step 1 |\n", - "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Step 0, Step 1 |\n", - "| `metadata_{celltype}.csv` | Step 0, Step 1 |\n", - "| `rosmap_sample_mapping_data.csv` | Step 0 |\n", - "| `tech_vars_{celltype}.csv` | Step 1 |\n", - "| `hg38-blacklist.v2.bed.gz` | Step 1 (snATAC-seq only) |\n", - "\n", - "\n", - "## Minimal Working Example" - ] - }, - { - "cell_type": "markdown", - "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 0: Sample ID Mapping\n", - "\n", - "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n", - "across metadata and count matrix files.\n", - "\n", - "### Input\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n", - "| `metadata_{celltype}.csv` | Per-cell-type sample metadata |\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Per-cell-type peak count matrices |\n", - "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Per-cell-type gene count matrices |\n", - "\n", - "### Process\n", - "\n", - "**Part 1 — Metadata files**\n", - "\n", - "For each metadata file:\n", - "1. Look up each `individualID` in the mapping reference\n", - "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n", - "3. Reorder columns: `sampleid` first, then `individualID`, then the rest\n", - "4. Save updated file\n", - "\n", - "**Part 2 — Count matrix files**\n", - "\n", - "For each count file:\n", - "1. Extract the header row (column names only)\n", - "2. Keep the first column (peak or gene IDs) unchanged\n", - "3. Map remaining column names (`individualID` → `sampleid`) where mapping exists, otherwise keep original\n", - "4. Write new header and stream data rows unchanged\n", - "5. Recompress with gzip\n", - "\n", - "### Parameters\n", - "\n", - "| Parameter | Default | Description |\n", - "|-----------|---------|-------------|\n", - "| `map_file` | *required* | CSV with `individualID` → `sampleid` mapping |\n", - "| `meta_files` | *required* | Metadata CSV files to remap |\n", - "| `count_files` | *required* | Count CSV.gz files to remap |\n", - "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/1_files_with_sampleid/` |\n", - "\n", - "### Output\n", - "\n", - "Output directory: `{output_dir}/1_files_with_sampleid/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `metadata_{celltype}.csv` | Metadata with `sampleid` column prepended |\n", - "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Count matrices with mapped column headers |\n", - "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Count matrices with mapped column headers |\n", - "\n", - "\n", - "**Timing:** < 1 min" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e", - "metadata": { - "kernel": "SoS" - }, - "outputs": [ - { - "ename": "ERROR", - "evalue": "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\n", - "output_type": "error", - "traceback": [ - "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\nTraceback:\n" - ] - } - ], - "source": [ - "sos run pipeline/pseudobulk_preprocessing.ipynb sampleid_mapping \\\n", - " --output-dir output/snatac_seq \\\n", - " --map-file data/rosmap_sample_mapping_data.csv \\\n", - " --meta-files data/snatac_seq/metadata_Mic_50nuc.csv \\\n", - " --count-files data/snatac_seq/pseudobulk_peaks_counts_Mic_50nuc.csv.gz\n" - ] - }, - { - "cell_type": "markdown", - "id": "5540a4da-843a-4789-8123-47911cf519c5", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 1: Pseudobulk QC\n", - "\n", - "Regresses out technical covariates for downstream QTL analysis. Works for both snATAC-seq and snRNA-seq.\n", - "\n", - "### Input\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `metadata_{celltype}.csv` | Sample-level metadata (nuclei counts, batch info) |\n", - "| `pseudobulk_*counts_{celltype}.csv.gz` | Pseudobulk count matrix |\n", - "| `tech_vars.csv` | Technical covariates (sampleid + tech var columns, pre-processed) |\n", - "| `hg38-blacklist.v2.bed.gz` *(snATAC-seq, optional)* | Blacklisted genomic regions |\n", - "\n", - "### Process\n", - "\n", - "1. Load count matrix and auto-detect modality (snATAC-seq vs snRNA-seq)\n", - "2. ***(Optional)*** Filter to specific genomic regions (snATAC-seq) or gene list (snRNA-seq)\n", - "3. Load metadata; filter samples with fewer than `min_nuclei` nuclei (default: 20)\n", - "4. Align samples between metadata and count matrix\n", - "5. ***(Optional)*** Filter blacklisted genomic regions (snATAC-seq only)\n", - "6. Merge tech vars from `tech_vars_file` by `sampleid` \n", - "7. Drop samples with NA in any tech var\n", - "8. Apply expression filtering (`filterByExpr`):\n", - " - `min_count = 5`: minimum reads in at least one sample\n", - " - `min_total_count = 15`: minimum total reads across all samples\n", - " - `min_prop = 0.1`: feature expressed in ≥10% of samples\n", - "9. TMM normalization\n", - "10. ***(Optional)*** Batch correction on `sequencingBatch`:\n", - " - `limma::removeBatchEffect` (default)\n", - " - `ComBat` (on log-CPM)\n", - "11. Add `sequencingBatch` and `Library` to model if present and multi-level\n", - "12. Fit linear model (`voom` + `lmFit` + `eBayes`) with **tech vars + batch vars only** \n", - "13. Compute `offset + residuals` as final adjusted values:\n", - " - `offset`: intercept + batch effects at reference level\n", - " - `residuals`: variation after removing technical effects; biological signal retained\n", - "14. ***(Optional)*** Quantile normalization of final values\n", - "\n", - "**Model formula:**\n", - "```\n", - "~ {tech_vars} + [sequencingBatch] + [Library]\n", - "```\n", - "> `sequencingBatch` and `Library` included only if present and have more than one level.\n", - "> Biological variables (`pmi`, `study`, `msex`, `age_death` etc.) are **not** included — they should not be regressed out as they may be associated with genotype.\n", - "\n", - "### Parameters\n", - "\n", - "| Parameter | Default | Description |\n", - "|-----------|---------|-------------|\n", - "| `meta_files` | *required* | Metadata CSV files (one per cell type) |\n", - "| `count_files` | *required* | Count CSV.gz files (one per cell type, same order as `meta_files`) |\n", - "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/2_residuals/{ct}/` |\n", - "| `tech_vars_file` | *required* | CSV with `sampleid` + tech var columns |\n", - "| `blacklist_file` | `''` | Genomic blacklist BED file (snATAC-seq only) |\n", - "| `regions` | `''` | Comma-separated genomic regions e.g. `chr7:28000000-28300000` (snATAC-seq) |\n", - "| `gene_list` | `''` | Comma-separated gene IDs e.g. `ENSG00000000010` (snRNA-seq) |\n", - "| `batch_correction` | `FALSE` | Apply batch correction (`TRUE`/`FALSE`) |\n", - "| `batch_method` | `limma` | Batch correction method (`limma` or `combat`) |\n", - "| `quant_norm` | `FALSE` | Apply quantile normalization after residuals |\n", - "| `min_count` | `5` | Min reads in at least one sample |\n", - "| `min_total_count` | `15` | Min total reads across all samples |\n", - "| `min_prop` | `0.1` | Min proportion of samples with expression |\n", - "| `min_nuclei` | `20` | Min nuclei per sample |\n", - "\n", - "### Output\n", - "\n", - "Output directory: `{output_dir}/2_residuals/{celltype}/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_residuals.txt` | Tech-covariate-adjusted values (log2-CPM) |\n", - "| `{celltype}_residuals_qn.txt` | Quantile-normalized adjusted values *(if `quant_norm=TRUE`)* |\n", - "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design, parameters |\n", - "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", - "\n", - "**Timing:** < 5 min per cell type" - ] - }, - { - "cell_type": "markdown", - "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "### Pseudobulk QC\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", - " --meta-files output/snatac_seq/1_files_with_sampleid/metadata_Mic_50nuc.csv \\\n", - " --count-files output/snatac_seq/1_files_with_sampleid/pseudobulk_peaks_counts_Mic_50nuc.csv.gz \\\n", - " --output-dir output/snatac_seq \\\n", - " --tech-vars-file data/snatac_seq/tech_vars_MIC.csv \\\n", - " --blacklist-file data/hg38-blacklist.v2.bed.gz #only for snATAC-seq\n", - "\n", - "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", - " --meta-files output/snrna_seq/1_files_with_sampleid/metadata_MIC.csv \\\n", - " --count-files output/snrna_seq/1_files_with_sampleid/pseudobulk_counts_MIC.csv.gz \\\n", - " --output-dir output/snrna_seq \\\n", - " --tech-vars-file data/snrna_seq/tech_vars_MIC.csv \\\n", - " --gene-list ENSG00000000010,ENSG00000000020 " - ] - }, - { - "cell_type": "markdown", - "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed", - "metadata": {}, - "source": [ - "### Additional parameters\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb", - "metadata": {}, - "outputs": [], - "source": [ - "--min-count 5\n", - "--min-total-count 15\n", - "--min-prop 0.1\n", - "--min-nuclei 20\n", - "--quant-norm TRUE\n", - "--batch-correction TRUE \n", - "--batch-method combat # or limma\n", - "--gene-list ENSG00000000010,ENSG00000000020 # for snRNA-seq\n", - "--regions chr7:28000000-28300000 # for snATAC-seq" - ] - }, - { - "cell_type": "markdown", - "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Step 2: Phenotype Reformatting (snATAC-seq only)\n", - "\n", - "Converts residuals into a QTL-ready BED format for genome-wide caQTL mapping.\n", - "\n", - "> For snRNA-seq, please follow this [pipeline](https://github.com/StatFunGen/xqtl-protocol/blob/main/code/data_preprocessing/phenotype/phenotype_formatting.ipynb).\n", - "\n", - "### Input\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_residuals.txt` | Residuals from `pseudobulk_qc` |\n", - "\n", - "### Process\n", - "\n", - "1. Read residuals file with proper handling of feature IDs and sample columns\n", - "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", - "3. Convert to midpoint coordinates (standard for QTLtools):\n", - "```\n", - "start = floor((peak_start + peak_end) / 2)\n", - "end = start + 1\n", - "```\n", - "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample values\n", - "5. Sort by chromosome and position\n", - "6. Compress with `bgzip` and index with `tabix`\n", - "\n", - "### Parameters\n", - "\n", - "| Parameter | Default | Description |\n", - "|-----------|---------|-------------|\n", - "| `residual_files` | *required* | Residual txt files from `pseudobulk_qc` |\n", - "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/3_pheno_reformat/` |\n", - "\n", - "### Output\n", - "\n", - "Output directory: `{output_dir}/3_pheno_reformat/`\n", - "\n", - "| File | Description |\n", - "|------|-------------|\n", - "| `{celltype}_phenotype.bed.gz` | bgzip-compressed BED with midpoint coordinates |\n", - "| `{celltype}_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n", - "\n", - "Compatible with FastQTL, TensorQTL, and QTLtools.\n", - "\n", - "**Timing:** < 1 min per cell type" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b", - "metadata": { - "kernel": "SoS" - }, - "outputs": [ - { - "ename": "ERROR", - "evalue": "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\n", - "output_type": "error", - "traceback": [ - "Error in parse(text = input): :1:5: unexpected symbol\n1: sos run\n ^\nTraceback:\n" - ] - } - ], - "source": [ - "sos run pipeline/pseudobulk_preprocessing.ipynb phenotype_formatting \\\n", - " --residual-files output/snatac_seq/2_residuals/Mic_50nuc/Mic_50nuc_residuals.txt \\\n", - " --output-dir output/snatac_seq" - ] - }, - { - "cell_type": "markdown", - "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Command interface" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "sos run pipeline/pseudobulk_preprocessing.ipynb -h" - ] - }, - { - "cell_type": "markdown", - "id": "0e17a301-cca9-49a1-843b-4248546f1f79", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## Setup and global parameters" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[global]\n", - "parameter: cwd = path(\"output\")\n", - "parameter: job_size = 1\n", - "parameter: walltime = \"5h\"\n", - "parameter: mem = \"16G\"\n", - "parameter: numThreads = 8\n", - "parameter: container = \"\"\n", - "\n", - "import re\n", - "from sos.utils import expand_size\n", - "\n", - "entrypoint = (\n", - " 'micromamba run -a \"\" -n' + ' ' +\n", - " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n", - ") if container else \"\"\n", - "\n", - "cwd = path(f'{cwd:a}')" - ] - }, - { - "cell_type": "markdown", - "id": "eee58015-c8e2-4697-bdae-58d7e494640d", - "metadata": {}, - "source": [ - "```\n", - "usage: sos run pipeline/pseudobulk_preprocessing.ipynb\n", - " [workflow_name | -t targets] [options] [workflow_options]\n", - " workflow_name: Single or combined workflows defined in this script\n", - " targets: One or more targets to generate\n", - " options: Single-hyphen sos parameters (see \"sos run -h\" for details)\n", - " workflow_options: Double-hyphen workflow-specific parameters\n", - "Workflows:\n", - " sampleid_mapping\n", - " pseudobulk_qc\n", - " phenotype_formatting\n", - "Global Workflow Options:\n", - " --cwd output (as path)\n", - " --job-size 1 (as int)\n", - " --walltime 5h\n", - " --mem 16G\n", - " --numThreads 8 (as int)\n", - " --container ''\n", - "Sections\n", - " sampleid_mapping:\n", - " Workflow Options:\n", - " --map-file VAL (as str, required)\n", - " --output-dir VAL (as str, required)\n", - " --meta-files (as list)\n", - " --count-files (as list)\n", - " pseudobulk_qc:\n", - " Workflow Options:\n", - " --meta-files (as list)\n", - " --count-files (as list)\n", - " --output-dir VAL (as str, required)\n", - " --tech-vars-file VAL (as str, required)\n", - " --blacklist-file ''\n", - " --batch-correction FALSE\n", - " --batch-method limma\n", - " --quant-norm FALSE\n", - " --min-count 5 (as int)\n", - " --min-total-count 15 (as int)\n", - " --min-prop 0.1 (as float)\n", - " --min-nuclei 20 (as int)\n", - " --regions ''\n", - " --gene-list ''\n", - " phenotype_formatting:\n", - " Workflow Options:\n", - " --residual-files (as list)\n", - " --output-dir VAL (as str, required)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `sampleid_mapping`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[sampleid_mapping]\n", - "parameter: map_file = str\n", - "parameter: output_dir = str\n", - "parameter: meta_files = []\n", - "parameter: count_files = []\n", - "\n", - "import os\n", - "\n", - "input: meta_files + count_files\n", - "output: [f'{output_dir}/1_files_with_sampleid/{os.path.basename(f)}' for f in meta_files + count_files]\n", - " \n", - "python: expand = \"${ }\"\n", - "import pandas as pd\n", - "import gzip\n", - "import os\n", - "import subprocess\n", - "import csv\n", - "import numpy as np\n", - "import tempfile\n", - "\n", - "map_df = pd.read_csv(\"${map_file}\")\n", - "id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n", - "output_dir = \"${output_dir}/1_files_with_sampleid\"\n", - "meta_files = ${meta_files}\n", - "count_files = ${count_files}\n", - "\n", - "os.makedirs(output_dir, exist_ok=True)\n", - "\n", - "def map_id(ind_id):\n", - " return id_map.get(ind_id, ind_id)\n", - "\n", - "def format_value(val):\n", - " if pd.isna(val):\n", - " return ''\n", - " if isinstance(val, (int, np.integer)):\n", - " return str(val)\n", - " if isinstance(val, (float, np.floating)):\n", - " if val == int(val):\n", - " return str(int(val))\n", - " else:\n", - " return str(val)\n", - " return str(val)\n", - "\n", - "# ── Process metadata ───────────────────────────────────────────────────────\n", - "for in_path in meta_files:\n", - " fname = os.path.basename(in_path)\n", - " out_path = os.path.join(output_dir, fname)\n", - " meta = pd.read_csv(in_path)\n", - " if \"individualID\" not in meta.columns:\n", - " print(f\"Warning: individualID column not found in {fname}, skipping.\")\n", - " continue\n", - " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n", - " cols = meta.columns.tolist()\n", - " cols.remove(\"sampleid\")\n", - " cols.remove(\"individualID\")\n", - " meta = meta[[\"sampleid\", \"individualID\"] + cols]\n", - " with open(out_path, 'w', newline='') as f:\n", - " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n", - " writer.writerow(meta.columns)\n", - " for _, row in meta.iterrows():\n", - " writer.writerow([format_value(val) for val in row])\n", - "\n", - "# ── Process count files ────────────────────────────────────────────────────\n", - "for in_path in count_files:\n", - " fname = os.path.basename(in_path)\n", - " out_path = os.path.join(output_dir, fname)\n", - " with gzip.open(in_path, \"rt\") as fh:\n", - " header_line = fh.readline().rstrip(\"\\n\")\n", - " col_names = header_line.split(\",\")\n", - " peak_id_col = col_names[0]\n", - " new_sample_cols = [map_id(s) for s in col_names[1:]]\n", - " new_header = \",\".join([peak_id_col] + new_sample_cols)\n", - " tmp = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n", - " tmp.write(new_header + \"\\n\")\n", - " tmp.close()\n", - " cmd = f\"zcat {in_path} | tail -n +2 | cat {tmp.name} - | gzip -6 > {out_path}\"\n", - " subprocess.run(cmd, shell=True, check=True)\n", - " os.unlink(tmp.name)" - ] - }, - { - "cell_type": "markdown", - "id": "f0884ae7-a851-425a-86dd-b606768a012e", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `pseudobulk_qc`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0c46328b-c3d8-46f8-8c71-bad27820438e", - "metadata": { - "kernel": "SoS" - }, - "outputs": [], - "source": [ - "[pseudobulk_qc]\n", - "parameter: meta_files = []\n", - "parameter: count_files = []\n", - "parameter: output_dir = str\n", - "parameter: tech_vars_file = str\n", - "parameter: blacklist_file = ''\n", - "parameter: batch_correction = \"FALSE\"\n", - "parameter: batch_method = \"limma\"\n", - "parameter: quant_norm = \"FALSE\"\n", - "parameter: min_count = 5\n", - "parameter: min_total_count = 15\n", - "parameter: min_prop = 0.1\n", - "parameter: min_nuclei = 20\n", - "parameter: regions = ''\n", - "parameter: gene_list = ''\n", - "\n", - "import os\n", - "\n", - "_cts = [os.path.basename(f).replace('metadata_','').replace('.csv','') for f in meta_files]\n", - "\n", - "input: meta_files + count_files\n", - "output: [f'{output_dir}/2_residuals/{ct}/{ct}_residuals.txt' for ct in _cts]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n", - "\n", - "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n", - "\n", - " library(edgeR)\n", - " library(limma)\n", - " library(data.table)\n", - " library(GenomicRanges)\n", - " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n", - "\n", - " # ── predictOffset ──────────────────────────────────────────────────────\n", - " predictOffset <- function(fit, tech_vars) {\n", - " D <- fit$design\n", - " Dm <- D\n", - " for (col in colnames(D)) {\n", - " if (col == \"(Intercept)\") next\n", - " is_tech <- any(sapply(tech_vars, function(v) grepl(paste0(\"^\", v), col)))\n", - " if (is_tech) {\n", - " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n", - " Dm[, col] <- median(D[, col], na.rm=TRUE)\n", - " else\n", - " Dm[, col] <- 0\n", - " } else {\n", - " Dm[, col] <- 0\n", - " }\n", - " }\n", - " B <- fit$coefficients\n", - " B[is.na(B)] <- 0\n", - " off <- B %*% t(Dm)\n", - " colnames(off) <- rownames(fit$design)\n", - " return(off)\n", - " }\n", - "\n", - " filter_blacklist <- function(mat, bed, feat_label) {\n", - " peaks <- data.table(id = rownames(mat))\n", - " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", - " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", - " bl <- fread(bed)[, 1:3]\n", - " setnames(bl, c(\"chr\",\"start\",\"end\"))\n", - " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", - " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", - " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n", - " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n", - " if (length(blacklisted) > 0) {\n", - " message(\"Blacklisted \", feat_label, \" removed: \", length(blacklisted))\n", - " return(mat[-blacklisted, , drop=FALSE])\n", - " }\n", - " return(mat)\n", - " }\n", - "\n", - " parse_regions <- function(region_str) {\n", - " if (is.null(region_str) || region_str == \"\") return(NULL)\n", - " lapply(strsplit(region_str, \",\")[[1]], function(r) {\n", - " parts <- strsplit(trimws(r), \":|−|-\")[[1]]\n", - " list(chr=parts[1], start=as.integer(parts[2]), end=as.integer(parts[3]))\n", - " })\n", - " }\n", - "\n", - " filter_regions <- function(mat, regions) {\n", - " peaks <- data.table(id = rownames(mat))\n", - " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", - " peaks[, `:=`(start = as.integer(start), end = as.integer(end))]\n", - " gr_peaks <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", - " gr_regions <- GRanges(\n", - " sapply(regions, `[[`, \"chr\"),\n", - " IRanges(sapply(regions, `[[`, \"start\"), sapply(regions, `[[`, \"end\"))\n", - " )\n", - " keep <- unique(queryHits(findOverlaps(gr_peaks, gr_regions)))\n", - " if (length(keep) == 0) stop(\"No peaks overlap the specified regions.\")\n", - " message(\"Peaks after region filter: \", length(keep))\n", - " mat[keep, , drop=FALSE]\n", - " }\n", - "\n", - " meta_files <- c(${','.join([f'\"{f}\"' for f in meta_files])})\n", - " count_files <- c(${','.join([f'\"{f}\"' for f in count_files])})\n", - "\n", - " if (length(meta_files) != length(count_files))\n", - " stop(\"meta_files and count_files must have the same length and order.\")\n", - "\n", - " # ── Load tech vars from file ───────────────────────────────────────────\n", - " tech_df <- fread(\"${tech_vars_file}\")\n", - " tech_vars <- setdiff(colnames(tech_df), \"sampleid\")\n", - " message(\"Tech vars: \", paste(tech_vars, collapse=\", \"))\n", - "\n", - " regions <- parse_regions(\"${regions}\")\n", - " gene_list <- trimws(strsplit(\"${gene_list}\", \",\")[[1]])\n", - " gene_list <- gene_list[gene_list != \"\"]\n", - "\n", - " for (i in seq_along(meta_files)) {\n", - " meta_file <- meta_files[i]\n", - " counts_file <- count_files[i]\n", - " ct <- sub(\"\\\\.csv$\", \"\", sub(\"^metadata_\", \"\", basename(meta_file)))\n", - "\n", - " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", - " message(\"Processing: \", ct)\n", - " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n", - " message(\"Quantile normalization: \", as.logical(\"${quant_norm}\"))\n", - " message(paste(rep(\"=\", 40), collapse=\"\"))\n", - "\n", - " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n", - " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n", - "\n", - " # ── 1. Load counts ─────────────────────────────────────────────────\n", - " counts_raw <- fread(counts_file)\n", - " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n", - " rownames(counts) <- counts_raw[[1]]\n", - " rm(counts_raw)\n", - "\n", - " # ── Auto-detect modality ───────────────────────────────────────────\n", - " is_atac <- grepl(\"^chr.*-[0-9]+-[0-9]+$\", rownames(counts)[1])\n", - " feat_label <- ifelse(is_atac, \"peaks\", \"genes\")\n", - " message(\"Modality: \", ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\"))\n", - " message(\"Loaded: \", nrow(counts), \" \", feat_label, \" x \", ncol(counts), \" samples\")\n", - "\n", - " # ── 1b. Region/gene filtering (optional) ──────────────────────────\n", - " if (is_atac && !is.null(regions)) {\n", - " message(\"Filtering peaks to specified regions...\")\n", - " counts <- filter_regions(counts, regions)\n", - " } else if (!is_atac && length(gene_list) > 0) {\n", - " genes_present <- intersect(rownames(counts), gene_list)\n", - " if (length(genes_present) == 0) stop(\"No matching genes found in count matrix.\")\n", - " message(\"Genes after gene_list filter: \", length(genes_present))\n", - " counts <- counts[genes_present, , drop=FALSE]\n", - " }\n", - "\n", - " # ── 2. Load metadata ───────────────────────────────────────────────\n", - " meta <- fread(meta_file)\n", - " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n", - " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n", - "\n", - " # ── 3. Nuclei filter ──────────────────────────────────────────────\n", - " n_nuclei_col <- intersect(c(\"n_nuclei\",\"n.nuclei\",\"nNuclei\",\"nuclei_count\"), colnames(meta))[1]\n", - " if (!is.na(n_nuclei_col)) {\n", - " meta <- meta[meta[[n_nuclei_col]] > ${min_nuclei}]\n", - " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n", - " }\n", - "\n", - " # ── 4. Align samples ──────────────────────────────────────────────\n", - " common <- intersect(meta[[idcol]], colnames(counts))\n", - " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n", - " counts <- counts[, common, drop=FALSE]\n", - " message(\"Samples after alignment: \", length(common))\n", - "\n", - " # ── 5. Blacklist filtering ─────────────────────────────────────────\n", - " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n", - " counts <- filter_blacklist(counts, \"${blacklist_file}\", feat_label)\n", - " message(feat_label, \" after blacklist filter: \", nrow(counts))\n", - " } else {\n", - " message(\"No blacklist file - skipping.\")\n", - " }\n", - "\n", - " # ── 6. Merge tech vars by sampleid ────────────────────────────────\n", - " tech_sub <- tech_df[tech_df$sampleid %in% common]\n", - " tech_sub <- tech_sub[match(common, tech_sub$sampleid)]\n", - "\n", - " # ── 7. Drop samples with NA in tech vars ──────────────────────────\n", - " keep_rows <- complete.cases(tech_sub[, ..tech_vars])\n", - " tech_sub <- tech_sub[keep_rows]\n", - " counts <- counts[, tech_sub$sampleid, drop=FALSE]\n", - " message(\"Valid samples for modelling: \", nrow(tech_sub))\n", - "\n", - " # ── 8. Expression filtering ────────────────────────────────────────\n", - " dge <- DGEList(counts=counts, samples=tech_sub)\n", - " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n", - " message(feat_label, \" before filter: \", nrow(dge))\n", - "\n", - " keep <- filterByExpr(dge, group=dge$samples$group,\n", - " min.count=${min_count},\n", - " min.total.count=${min_total_count},\n", - " min.prop=${min_prop})\n", - " dge <- dge[keep,, keep.lib.sizes=FALSE]\n", - " message(feat_label, \" after filter: \", nrow(dge))\n", - "\n", - " write.table(dge$counts,\n", - " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n", - " sep=\"\\t\", quote=FALSE, col.names=NA)\n", - "\n", - " # ── 9. TMM normalization ───────────────────────────────────────────\n", - " dge <- calcNormFactors(dge, method=\"TMM\")\n", - "\n", - " # ── 10. Optional batch correction ──────────────────────────────────\n", - " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n", - " batches <- dge$samples$sequencingBatch\n", - " batch_counts <- table(batches)\n", - " valid_batches <- names(batch_counts[batch_counts > 1])\n", - " keep_bc <- batches %in% valid_batches\n", - " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n", - " batches <- batches[keep_bc]\n", - " message(\"Samples after singleton batch removal: \", ncol(dge))\n", - "\n", - " if (\"${batch_method}\" == \"combat\") {\n", - " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", - " logCPM <- ComBat(dat=logCPM, batch=factor(batches))\n", - " dge$counts <- round(pmax(2^logCPM, 0))\n", - " message(\"ComBat applied on log-CPM.\")\n", - " } else {\n", - " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", - " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n", - " dge$counts <- round(pmax(2^logCPM, 0))\n", - " message(\"limma removeBatchEffect applied.\")\n", - " }\n", - " }\n", - "\n", - " # ── 11. Add batch vars to model if multi-level ────────────────────\n", - " batch_vars <- c()\n", - " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n", - " length(unique(dge$samples$sequencingBatch)) > 1) {\n", - " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", - " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n", - " }\n", - " if (\"Library\" %in% colnames(dge$samples) &&\n", - " length(unique(dge$samples$Library)) > 1) {\n", - " dge$samples$Library_factor <- factor(dge$samples$Library)\n", - " batch_vars <- c(batch_vars, \"Library_factor\")\n", - " }\n", - "\n", - " # ── 12. Build design matrix ────────────────────────────────────────\n", - " all_model_vars <- intersect(c(tech_vars, batch_vars), colnames(dge$samples))\n", - " form <- as.formula(paste(\"~\", paste(all_model_vars, collapse=\" + \")))\n", - " design <- model.matrix(form, data=dge$samples)\n", - " message(\"Formula: \", deparse(form))\n", - "\n", - " if (!is.fullrank(design)) {\n", - " message(\"Design not full rank - trimming.\")\n", - " qr_d <- qr(design)\n", - " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n", - " }\n", - " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n", - "\n", - " # ── 13. Voom + lmFit + eBayes ─────────────────────────────────────\n", - " v <- voom(dge, design, plot=FALSE)\n", - " fit <- lmFit(v, design)\n", - " fit <- eBayes(fit)\n", - "\n", - " # ── 14. Offset + residuals ─────────────────────────────────────────\n", - " off <- predictOffset(fit, tech_vars=tech_vars)\n", - " res <- residuals(fit, v$E)\n", - " final <- off + res\n", - "\n", - " # ── 15. Save residuals ─────────────────────────────────────────────\n", - " out_file <- file.path(outdir, paste0(ct, \"_residuals.txt\"))\n", - " write.table(final, out_file, sep=\"\\t\", quote=FALSE, col.names=NA)\n", - " message(\"Saved: \", out_file)\n", - " message(\" \", ifelse(is_atac,\"Peaks\",\"Genes\"), \": \", nrow(final), \" | Samples: \", ncol(final))\n", - "\n", - " # ── 16. Optional quantile normalization ───────────────────────────\n", - " if (as.logical(\"${quant_norm}\")) {\n", - " final_qn <- t(apply(final, 1, rank, ties.method=\"average\"))\n", - " final_qn <- stats::qnorm(final_qn / (ncol(final_qn) + 1))\n", - " qn_file <- file.path(outdir, paste0(ct, \"_residuals_qn.txt\"))\n", - " write.table(final_qn, qn_file, sep=\"\\t\", quote=FALSE, col.names=NA)\n", - " message(\"Saved QN: \", qn_file)\n", - "\n", - " saveRDS(list(\n", - " dge=dge, offset=off, residuals=res,\n", - " final_data=final, final_data_qn=final_qn,\n", - " valid_samples=colnames(dge), design=design, fit=fit, model=form,\n", - " tech_vars=tech_vars, batch_vars=batch_vars,\n", - " batch_correction=as.logical(\"${batch_correction}\"),\n", - " batch_method=ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", - " quant_norm=TRUE,\n", - " modality=ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", - " ), file.path(outdir, paste0(ct, \"_results_qn.rds\")))\n", - " } else {\n", - " saveRDS(list(\n", - " dge=dge, offset=off, residuals=res,\n", - " final_data=final,\n", - " valid_samples=colnames(dge), design=design, fit=fit, model=form,\n", - " tech_vars=tech_vars, batch_vars=batch_vars,\n", - " batch_correction=as.logical(\"${batch_correction}\"),\n", - " batch_method=ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", - " quant_norm=FALSE,\n", - " modality=ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", - " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n", - " }\n", - "\n", - " message(\"Completed: \", ct, \" -> \", outdir)\n", - " }" - ] - }, - { - "cell_type": "markdown", - "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9", - "metadata": { - "kernel": "SoS" - }, - "source": [ - "## `phenotype_reformatting`" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73", - "metadata": { - "kernel": "SoS" - }, - "outputs": [ - { - "ename": "ERROR", - "evalue": "Error in parse(text = input): :1:1: unexpected '['\n1: [\n ^\n", - "output_type": "error", - "traceback": [ - "Error in parse(text = input): :1:1: unexpected '['\n1: [\n ^\nTraceback:\n" - ] - } - ], - "source": [ - "[phenotype_formatting]\n", - "parameter: residual_files = []\n", - "parameter: output_dir = str\n", - "\n", - "import os\n", - "\n", - "_cts = [os.path.basename(os.path.dirname(f)) for f in residual_files]\n", - "\n", - "input: residual_files\n", - "output: [f'{output_dir}/3_pheno_reformat/{ct}_phenotype.bed.gz' for ct in _cts]\n", - "\n", - "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n", - "\n", - "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", - " import os\n", - " import subprocess\n", - " import pandas as pd\n", - "\n", - " residual_files = ${residual_files}\n", - " output_dir = \"${output_dir}\"\n", - "\n", - " def read_residuals(path):\n", - " first_line = open(path).readline().rstrip(\"\\n\")\n", - " col_names = first_line.split(\"\\t\")\n", - " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n", - " if df.shape[1] > len(col_names):\n", - " peak_ids = df.iloc[:, 0].values\n", - " df = df.iloc[:, 1:]\n", - " df.columns = col_names\n", - " else:\n", - " peak_ids = df.iloc[:, 0].values\n", - " df = df.iloc[:, 1:]\n", - " df.columns = col_names[1:]\n", - " return peak_ids, df\n", - "\n", - " def to_midpoint_bed(peak_ids, residuals):\n", - " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", - " chrs = parts[0].values\n", - " starts = parts[1].astype(int).values\n", - " ends = parts[2].astype(int).values\n", - " mids = ((starts + ends) // 2).astype(int)\n", - " bed = pd.DataFrame({\n", - " \"#chr\": chrs,\n", - " \"start\": mids,\n", - " \"end\": mids + 1,\n", - " \"ID\": peak_ids\n", - " })\n", - " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n", - " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n", - "\n", - " def run_cmd(cmd, label):\n", - " r = subprocess.run(cmd, capture_output=True)\n", - " if r.returncode != 0:\n", - " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n", - " else:\n", - " print(f\"{label}: OK\")\n", - "\n", - " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n", - " os.makedirs(out_dir, exist_ok=True)\n", - "\n", - " for res_path in residual_files:\n", - " ct = os.path.basename(os.path.dirname(res_path))\n", - "\n", - " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n", - "\n", - " if not os.path.exists(res_path):\n", - " print(f\"WARNING: {res_path} not found, skipping.\")\n", - " continue\n", - "\n", - " peak_ids, residuals = read_residuals(res_path)\n", - " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n", - "\n", - " bed = to_midpoint_bed(peak_ids, residuals)\n", - " out_bed = os.path.join(out_dir, f\"{ct}_phenotype.bed\")\n", - " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n", - " print(f\"Written: {out_bed}\")\n", - "\n", - " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n", - " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n", - " print(f\"Completed: {ct} -> {out_dir}\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "R", - "language": "R", - "name": "ir" - }, - "language_info": { - "codemirror_mode": "r", - "file_extension": ".r", - "mimetype": "text/x-r-source", - "name": "R", - "pygments_lexer": "r", - "version": "4.4.3" - }, - "sos": { - "kernels": [ - [ - "SoS", - "sos", - "sos", - "", - "" - ] - ], - "version": "" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From 57de58f9bdc5fbe704efa5c499b7dcc157515f20 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Fri, 27 Feb 2026 11:36:58 -0500 Subject: [PATCH 11/12] Added cell count to pseudobulk count step --- .../QC/pseudobulk_preprocessing.ipynb | 1171 +++++++++++++++++ 1 file changed, 1171 insertions(+) create mode 100644 code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb diff --git a/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb new file mode 100644 index 000000000..f790d444d --- /dev/null +++ b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb @@ -0,0 +1,1171 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Single-nuclei Pseudobulk Preprocessing (RNA-seq and ATAC-seq) Pipeline\n", + "\n", + "## Overview\n", + "\n", + "This pipeline preprocesses single-nuclei pseudobulk **count** data (snATAC-seq or snRNA-seq) for downstream QTL analysis and region-specific studies.\n", + "\n", + "**Goals:**\n", + "- Aggregate single-cell counts into pseudobulk count matrices\n", + "- Transform raw pseudobulk counts into analysis-ready formats\n", + "- Remove technical confounders\n", + "- Generate QTL-ready phenotype files or region-specific datasets\n", + "\n", + "## Pipeline Structure\n", + "\n", + "```\n", + "Step 1: Pseudobulk count matrix generation [pseudobulk_count]\n", + " ↓\n", + "Step 2: Sample ID Mapping [sampleid_mapping]\n", + " ↓\n", + "Step 3: Pseudobulk QC [pseudobulk_qc]\n", + " (optional) Region Peak/Gene Filtering \n", + " (optional) Batch Correction (ComBat or limma)\n", + " (optional) Quantile Normalization\n", + " ↓\n", + "Step 4: Phenotype Reformatting → BED [phenotype_formatting]\n", + " (genome-wide QTL mapping, snATAC-seq only) \n", + "```\n", + "\n", + "## Modality Support\n", + "\n", + "| Feature | snATAC-seq | snRNA-seq |\n", + "|---------|-----------|-----------|\n", + " Pseudobulk count generation | TBD | ✓ |\n", + "| Sample ID mapping | ✓ | ✓ |\n", + "| Region/gene filtering | ✓ (`--regions`) | ✓ (`--gene-list`) |\n", + "| Blacklist filtering | ✓ | — |\n", + "| `pseudobulk_qc` step | ✓ | ✓ |\n", + "| `phenotype_formatting` step | ✓ | — (refer to this [pipeline](https://github.com/StatFunGen/xqtl-protocol/blob/main/code/data_preprocessing/phenotype/phenotype_formatting.ipynb)) |\n", + "\n", + "## Input Files\n", + "\n", + "All toy input files required to run this pipeline can be downloaded\n", + "[here](https://drive.google.com/drive/folders/13ORslmqWTpICMIufhj_mrdL1KxQsG4lH?usp=drive_link).\n", + "\n", + "| File | Used in |\n", + "|------|---------|\n", + "| `celltyped_seuratobj{i}.rds` | Step 1 |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Step 2, Step 3 |\n", + "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Step 2, Step 3 |\n", + "| `metadata_{celltype}.csv` | Step 2, Step 3 |\n", + "| `rosmap_sample_mapping_data.csv` | Step 2 |\n", + "| `tech_vars_{celltype}.csv` | Step 2 |\n", + "| `hg38-blacklist.v2.bed.gz` | Step 2 (snATAC-seq only) |\n", + "\n", + "\n", + "## Minimal Working Example" + ] + }, + { + "cell_type": "markdown", + "id": "8d6e0ae1-75b0-4445-a579-080164bbde26", + "metadata": {}, + "source": [ + "## Step 1: Pseudobulk Count Matrix Generation\n", + "\n", + "Aggregates single-nuclei counts into pseudobulk count matrices per cell type from Seurat objects.\n", + "\n", + "> This step is upstream of `sampleid_mapping` and `pseudobulk_qc`. Output feeds directly into the existing preprocessing pipeline.\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `celltyped_seuratobj{i}.rds` | Seurat objects with `celltype` and `sample` annotations in `meta.data` |\n", + "\n", + "### Process\n", + "\n", + "1. Load each Seurat object and subset to target cell type — skips objects where cell type is not present\n", + "2. Merge all subsets across objects and join layers\n", + "3. Aggregate raw counts by sample (`AggregateExpression`)\n", + "4. Filter out samples with fewer than `min_cells` cells (default: 10)\n", + "5. Strip Ensembl version suffixes from gene IDs (`ENSG00000000010.1` → `ENSG00000000010`)\n", + "6. Save as `pseudobulk_counts_{celltype}.csv.gz` — raw counts only, normalization handled downstream in `pseudobulk_qc`\n", + "\n", + "> **GLU cell type**: Due to its large size, process in two batches (files 1–6 and 7–11) and pass separately.\n", + "\n", + "### Parameters\n", + "\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `seurat_files` | *required* | One or more Seurat `.rds` files |\n", + "| `output_dir` | *required* | Output directory for count matrix |\n", + "| `celltype` | `MIC` | Cell type to extract (must match `celltype` column in `meta.data`) |\n", + "| `min_cells` | `10` | Minimum number of cells per sample to retain |\n", + "\n", + "### Output\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `pseudobulk_counts_{celltype}.csv.gz` | Raw pseudobulk count matrix (genes × samples) |\n", + "\n", + "\n", + "**Timing:** 10–30 min per cell type depending on object size" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5e1154b-bb27-428b-bf5b-a4967fa43377", + "metadata": {}, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_counts \\\n", + " --seurat-files /restricted/projectnb/xqtl/ROSMAP_snRNAseq_newreference/singleR_results/celltyped_seuratobj11.rds \\\n", + " --output-dir output/snrna_seq \\\n", + " --celltype MIC " + ] + }, + { + "cell_type": "markdown", + "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 1: Sample ID Mapping\n", + "\n", + "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n", + "across metadata and count matrix files.\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n", + "| `metadata_{celltype}.csv` | Per-cell-type sample metadata |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Per-cell-type peak count matrices |\n", + "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Per-cell-type gene count matrices |\n", + "\n", + "### Process\n", + "\n", + "**Part 1 — Metadata files**\n", + "\n", + "For each metadata file:\n", + "1. Look up each `individualID` in the mapping reference\n", + "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n", + "3. Reorder columns: `sampleid` first, then `individualID`, then the rest\n", + "4. Save updated file\n", + "\n", + "**Part 2 — Count matrix files**\n", + "\n", + "For each count file:\n", + "1. Extract the header row (column names only)\n", + "2. Keep the first column (peak or gene IDs) unchanged\n", + "3. Map remaining column names (`individualID` → `sampleid`) where mapping exists, otherwise keep original\n", + "4. Write new header and stream data rows unchanged\n", + "5. Recompress with gzip\n", + "\n", + "### Parameters\n", + "\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `map_file` | *required* | CSV with `individualID` → `sampleid` mapping |\n", + "| `meta_files` | *required* | Metadata CSV files to remap |\n", + "| `count_files` | *required* | Count CSV.gz files to remap |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/1_files_with_sampleid/` |\n", + "\n", + "### Output\n", + "\n", + "Output directory: `{output_dir}/1_files_with_sampleid/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `metadata_{celltype}.csv` | Metadata with `sampleid` column prepended |\n", + "| `pseudobulk_peaks_counts_{celltype}.csv.gz` *(snATAC-seq)* | Count matrices with mapped column headers |\n", + "| `pseudobulk_counts_{celltype}.csv.gz` *(snRNA-seq)* | Count matrices with mapped column headers |\n", + "\n", + "\n", + "**Timing:** < 1 min" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb sampleid_mapping \\\n", + " --output-dir output/snatac_seq \\\n", + " --map-file data/rosmap_sample_mapping_data.csv \\\n", + " --meta-files data/snatac_seq/metadata_Mic_50nuc.csv \\\n", + " --count-files data/snatac_seq/0_pseudobulk_count/pseudobulk_peaks_counts_Mic_50nuc.csv.gz" + ] + }, + { + "cell_type": "markdown", + "id": "5540a4da-843a-4789-8123-47911cf519c5", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 2: Pseudobulk QC\n", + "\n", + "Regresses out technical covariates for downstream QTL analysis. Works for both snATAC-seq and snRNA-seq.\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `metadata_{celltype}.csv` | Sample-level metadata (nuclei counts, batch info) |\n", + "| `pseudobulk_*counts_{celltype}.csv.gz` | Pseudobulk count matrix |\n", + "| `tech_vars.csv` | Technical covariates (sampleid + tech var columns, pre-processed) |\n", + "| `hg38-blacklist.v2.bed.gz` *(snATAC-seq, optional)* | Blacklisted genomic regions |\n", + "\n", + "### Process\n", + "\n", + "1. Load count matrix and auto-detect modality (snATAC-seq vs snRNA-seq)\n", + "2. ***(Optional)*** Filter to specific genomic regions (snATAC-seq) or gene list (snRNA-seq)\n", + "3. Load metadata; filter samples with fewer than `min_nuclei` nuclei (default: 20)\n", + "4. Align samples between metadata and count matrix\n", + "5. ***(Optional)*** Filter blacklisted genomic regions (snATAC-seq only)\n", + "6. Merge tech vars from `tech_vars_file` by `sampleid` \n", + "7. Drop samples with NA in any tech var\n", + "8. Apply expression filtering (`filterByExpr`):\n", + " - `min_count = 5`: minimum reads in at least one sample\n", + " - `min_total_count = 15`: minimum total reads across all samples\n", + " - `min_prop = 0.1`: feature expressed in ≥10% of samples\n", + "9. TMM normalization\n", + "10. ***(Optional)*** Batch correction on `sequencingBatch`:\n", + " - `limma::removeBatchEffect` (default)\n", + " - `ComBat` (on log-CPM)\n", + "11. Add `sequencingBatch` and `Library` to model if present and multi-level\n", + "12. Fit linear model (`voom` + `lmFit` + `eBayes`) with **tech vars + batch vars only** \n", + "13. Compute `offset + residuals` as final adjusted values:\n", + " - `offset`: intercept + batch effects at reference level\n", + " - `residuals`: variation after removing technical effects; biological signal retained\n", + "14. ***(Optional)*** Quantile normalization of final values\n", + "\n", + "**Model formula:**\n", + "```\n", + "~ {tech_vars} + [sequencingBatch] + [Library]\n", + "```\n", + "> `sequencingBatch` and `Library` included only if present and have more than one level.\n", + "> Biological variables (`pmi`, `study`, `msex`, `age_death` etc.) are **not** included — they should not be regressed out as they may be associated with genotype.\n", + "\n", + "### Parameters\n", + "\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `meta_files` | *required* | Metadata CSV files (one per cell type) |\n", + "| `count_files` | *required* | Count CSV.gz files (one per cell type, same order as `meta_files`) |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/2_residuals/{ct}/` |\n", + "| `tech_vars_file` | *required* | CSV with `sampleid` + tech var columns |\n", + "| `blacklist_file` | `''` | Genomic blacklist BED file (snATAC-seq only) |\n", + "| `regions` | `''` | Comma-separated genomic regions e.g. `chr7:28000000-28300000` (snATAC-seq) |\n", + "| `gene_list` | `''` | Comma-separated gene IDs e.g. `ENSG00000000010` (snRNA-seq) |\n", + "| `batch_correction` | `FALSE` | Apply batch correction (`TRUE`/`FALSE`) |\n", + "| `batch_method` | `limma` | Batch correction method (`limma` or `combat`) |\n", + "| `quant_norm` | `FALSE` | Apply quantile normalization after residuals |\n", + "| `min_count` | `5` | Min reads in at least one sample |\n", + "| `min_total_count` | `15` | Min total reads across all samples |\n", + "| `min_prop` | `0.1` | Min proportion of samples with expression |\n", + "| `min_nuclei` | `20` | Min nuclei per sample |\n", + "\n", + "### Output\n", + "\n", + "Output directory: `{output_dir}/2_residuals/{celltype}/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_residuals.txt` | Tech-covariate-adjusted values (log2-CPM) |\n", + "| `{celltype}_residuals_qn.txt` | Quantile-normalized adjusted values *(if `quant_norm=TRUE`)* |\n", + "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design, parameters |\n", + "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n", + "\n", + "**Timing:** < 5 min per cell type" + ] + }, + { + "cell_type": "markdown", + "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Pseudobulk QC\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "#snATAC-seq\n", + "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", + " --meta-files output/snatac_seq/1_files_with_sampleid/metadata_Mic_50nuc.csv \\\n", + " --count-files output/snatac_seq/1_files_with_sampleid/pseudobulk_peaks_counts_Mic_50nuc.csv.gz \\\n", + " --output-dir output/snatac_seq \\\n", + " --tech-vars-file data/snatac_seq/tech_vars_MIC.csv \\\n", + " --blacklist-file data/hg38-blacklist.v2.bed.gz #only for snATAC-seq\n", + "\n", + "#snRNA-seq\n", + "sos run pipeline/pseudobulk_preprocessing.ipynb pseudobulk_qc \\\n", + " --meta-files output/snrna_seq/1_files_with_sampleid/metadata_MIC.csv \\\n", + " --count-files output/snrna_seq/1_files_with_sampleid/pseudobulk_counts_MIC.csv.gz \\\n", + " --output-dir output/snrna_seq \\\n", + " --tech-vars-file data/snrna_seq/tech_vars_MIC.csv \\\n", + " --gene-list ENSG00000000010,ENSG00000000020 " + ] + }, + { + "cell_type": "markdown", + "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed", + "metadata": {}, + "source": [ + "### Additional parameters\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb", + "metadata": {}, + "outputs": [], + "source": [ + "--min-count 5\n", + "--min-total-count 15\n", + "--min-prop 0.1\n", + "--min-nuclei 20\n", + "--quant-norm TRUE\n", + "--batch-correction TRUE \n", + "--batch-method combat # or limma\n", + "--gene-list ENSG00000000010,ENSG00000000020 # for snRNA-seq\n", + "--regions chr7:28000000-28300000 # for snATAC-seq" + ] + }, + { + "cell_type": "markdown", + "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Step 3: Phenotype Reformatting (snATAC-seq only)\n", + "\n", + "Converts residuals into a QTL-ready BED format for genome-wide caQTL mapping.\n", + "\n", + "> For snRNA-seq, please follow this [pipeline](https://github.com/StatFunGen/xqtl-protocol/blob/main/code/data_preprocessing/phenotype/phenotype_formatting.ipynb).\n", + "\n", + "### Input\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_residuals.txt` | Residuals from `pseudobulk_qc` |\n", + "\n", + "### Process\n", + "\n", + "1. Read residuals file with proper handling of feature IDs and sample columns\n", + "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n", + "3. Convert to midpoint coordinates (standard for QTLtools):\n", + "```\n", + "start = floor((peak_start + peak_end) / 2)\n", + "end = start + 1\n", + "```\n", + "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample values\n", + "5. Sort by chromosome and position\n", + "6. Compress with `bgzip` and index with `tabix`\n", + "\n", + "### Parameters\n", + "\n", + "| Parameter | Default | Description |\n", + "|-----------|---------|-------------|\n", + "| `residual_files` | *required* | Residual txt files from `pseudobulk_qc` |\n", + "| `output_dir` | *required* | Parent output directory; writes to `{output_dir}/3_pheno_reformat/` |\n", + "\n", + "### Output\n", + "\n", + "Output directory: `{output_dir}/3_pheno_reformat/`\n", + "\n", + "| File | Description |\n", + "|------|-------------|\n", + "| `{celltype}_phenotype.bed.gz` | bgzip-compressed BED with midpoint coordinates |\n", + "| `{celltype}_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n", + "\n", + "Compatible with FastQTL, TensorQTL, and QTLtools.\n", + "\n", + "**Timing:** < 1 min per cell type" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb phenotype_formatting \\\n", + " --residual-files output/snatac_seq/2_residuals/Mic_50nuc/Mic_50nuc_residuals.txt \\\n", + " --output-dir output/snatac_seq" + ] + }, + { + "cell_type": "markdown", + "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Command interface" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/pseudobulk_preprocessing.ipynb -h" + ] + }, + { + "cell_type": "markdown", + "id": "0e17a301-cca9-49a1-843b-4248546f1f79", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Setup and global parameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "parameter: cwd = path(\"output\")\n", + "parameter: job_size = 1\n", + "parameter: walltime = \"5h\"\n", + "parameter: mem = \"16G\"\n", + "parameter: numThreads = 8\n", + "parameter: container = \"\"\n", + "\n", + "import re\n", + "from sos.utils import expand_size\n", + "\n", + "entrypoint = (\n", + " 'micromamba run -a \"\" -n' + ' ' +\n", + " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n", + ") if container else \"\"\n", + "\n", + "cwd = path(f'{cwd:a}')" + ] + }, + { + "cell_type": "markdown", + "id": "eee58015-c8e2-4697-bdae-58d7e494640d", + "metadata": {}, + "source": [ + "```\n", + "usage: sos run pipeline/pseudobulk_preprocessing.ipynb\n", + " [workflow_name | -t targets] [options] [workflow_options]\n", + " workflow_name: Single or combined workflows defined in this script\n", + " targets: One or more targets to generate\n", + " options: Single-hyphen sos parameters (see \"sos run -h\" for details)\n", + " workflow_options: Double-hyphen workflow-specific parameters\n", + "Workflows:\n", + " sampleid_mapping\n", + " pseudobulk_qc\n", + " phenotype_formatting\n", + "Global Workflow Options:\n", + " --cwd output (as path)\n", + " --job-size 1 (as int)\n", + " --walltime 5h\n", + " --mem 16G\n", + " --numThreads 8 (as int)\n", + " --container ''\n", + "Sections\n", + " sampleid_mapping:\n", + " Workflow Options:\n", + " --map-file VAL (as str, required)\n", + " --output-dir VAL (as str, required)\n", + " --meta-files (as list)\n", + " --count-files (as list)\n", + " pseudobulk_qc:\n", + " Workflow Options:\n", + " --meta-files (as list)\n", + " --count-files (as list)\n", + " --output-dir VAL (as str, required)\n", + " --tech-vars-file VAL (as str, required)\n", + " --blacklist-file ''\n", + " --batch-correction FALSE\n", + " --batch-method limma\n", + " --quant-norm FALSE\n", + " --min-count 5 (as int)\n", + " --min-total-count 15 (as int)\n", + " --min-prop 0.1 (as float)\n", + " --min-nuclei 20 (as int)\n", + " --regions ''\n", + " --gene-list ''\n", + " phenotype_formatting:\n", + " Workflow Options:\n", + " --residual-files (as list)\n", + " --output-dir VAL (as str, required)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "2ff778ed-2c0a-458d-889a-9c3a6d4a99f0", + "metadata": {}, + "source": [ + "## `pseudobulk_counts`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cea5ab18-f475-44f4-a950-12d9b6321178", + "metadata": {}, + "outputs": [], + "source": [ + "[pseudobulk_counts]\n", + "parameter: seurat_files = []\n", + "parameter: output_dir = str\n", + "parameter: celltype = 'MIC'\n", + "parameter: min_cells = 10\n", + "\n", + "import os\n", + "\n", + "input: seurat_files\n", + "output: f'{output_dir}/0_pseudobulk_counts/pseudobulk_counts_{celltype}.csv.gz'\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '4:00:00', mem = '64G', cores = 4\n", + "\n", + "R: expand = \"${ }\", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'\n", + "\n", + " library(Seurat)\n", + " library(data.table)\n", + "\n", + " seurat_files <- c(${','.join([f'\"{f}\"' for f in seurat_files])})\n", + " celltype <- \"${celltype}\"\n", + " min_cells <- ${min_cells}\n", + " output_dir <- \"${output_dir}\"\n", + " out_file <- \"${_output}\"\n", + "\n", + " message(\"Loading and subsetting Seurat objects for: \", celltype)\n", + "\n", + " # ── 1. Load and subset each Seurat object ─────────────────────\n", + " subsets <- list()\n", + " for (f in seurat_files) {\n", + " message(\"Loading: \", basename(f))\n", + " obj <- readRDS(f)\n", + " if (celltype %in% obj$celltype) {\n", + " subsets[[length(subsets) + 1]] <- subset(obj, subset = celltype == celltype)\n", + " } else {\n", + " message(\" Skipping - celltype not found in: \", basename(f))\n", + " }\n", + " rm(obj)\n", + " gc()\n", + " }\n", + "\n", + " if (length(subsets) == 0) stop(\"No Seurat objects contain celltype: \", celltype)\n", + " message(\"Found \", celltype, \" in \", length(subsets), \" objects\")\n", + "\n", + " # ── 2. Merge and aggregate ────────────────────────────────────\n", + " merged <- Reduce(merge, subsets)\n", + " rm(subsets)\n", + " gc()\n", + "\n", + " merged <- JoinLayers(merged)\n", + " merged <- SetIdent(merged, value = \"sample\")\n", + "\n", + " cell_counts <- table(merged@meta.data$sample)\n", + " expr <- AggregateExpression(merged, group.by = \"sample\", slot = \"counts\")$RNA\n", + " rm(merged)\n", + " gc()\n", + "\n", + " # ── 3. Filter samples with < min_cells ────────────────────────\n", + " valid_samples <- names(cell_counts[cell_counts >= min_cells])\n", + " expr <- expr[, valid_samples]\n", + " message(\"Samples after min_cells (>= \", min_cells, \") filter: \", ncol(expr))\n", + "\n", + " # ── 4. Strip Ensembl version suffixes ─────────────────────────\n", + " rownames(expr) <- gsub(\"\\\\..*$\", \"\", rownames(expr))\n", + "\n", + " # ── 5. Save as csv.gz ─────────────────────────────────────────\n", + " message(\"Genes: \", nrow(expr), \" | Samples: \", ncol(expr))\n", + " dt <- data.table(gene_id = rownames(expr), as.data.frame(expr))\n", + " fwrite(dt, out_file, compress = \"gzip\")\n", + " message(\"Saved: \", out_file)" + ] + }, + { + "cell_type": "markdown", + "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `sampleid_mapping`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[sampleid_mapping]\n", + "parameter: map_file = str\n", + "parameter: output_dir = str\n", + "parameter: meta_files = []\n", + "parameter: count_files = []\n", + "\n", + "import os\n", + "\n", + "input: meta_files + count_files\n", + "output: [f'{output_dir}/1_files_with_sampleid/{os.path.basename(f)}' for f in meta_files + count_files]\n", + " \n", + "python: expand = \"${ }\"\n", + "import pandas as pd\n", + "import gzip\n", + "import os\n", + "import subprocess\n", + "import csv\n", + "import numpy as np\n", + "import tempfile\n", + "\n", + "map_df = pd.read_csv(\"${map_file}\")\n", + "id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n", + "output_dir = \"${output_dir}/1_files_with_sampleid\"\n", + "meta_files = ${meta_files}\n", + "count_files = ${count_files}\n", + "\n", + "os.makedirs(output_dir, exist_ok=True)\n", + "\n", + "def map_id(ind_id):\n", + " return id_map.get(ind_id, ind_id)\n", + "\n", + "def format_value(val):\n", + " if pd.isna(val):\n", + " return ''\n", + " if isinstance(val, (int, np.integer)):\n", + " return str(val)\n", + " if isinstance(val, (float, np.floating)):\n", + " if val == int(val):\n", + " return str(int(val))\n", + " else:\n", + " return str(val)\n", + " return str(val)\n", + "\n", + "# ── Process metadata ───────────────────────────────────────────────────────\n", + "for in_path in meta_files:\n", + " fname = os.path.basename(in_path)\n", + " out_path = os.path.join(output_dir, fname)\n", + " meta = pd.read_csv(in_path)\n", + " if \"individualID\" not in meta.columns:\n", + " print(f\"Warning: individualID column not found in {fname}, skipping.\")\n", + " continue\n", + " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n", + " cols = meta.columns.tolist()\n", + " cols.remove(\"sampleid\")\n", + " cols.remove(\"individualID\")\n", + " meta = meta[[\"sampleid\", \"individualID\"] + cols]\n", + " with open(out_path, 'w', newline='') as f:\n", + " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n", + " writer.writerow(meta.columns)\n", + " for _, row in meta.iterrows():\n", + " writer.writerow([format_value(val) for val in row])\n", + "\n", + "# ── Process count files ────────────────────────────────────────────────────\n", + "for in_path in count_files:\n", + " fname = os.path.basename(in_path)\n", + " out_path = os.path.join(output_dir, fname)\n", + " with gzip.open(in_path, \"rt\") as fh:\n", + " header_line = fh.readline().rstrip(\"\\n\")\n", + " col_names = header_line.split(\",\")\n", + " peak_id_col = col_names[0]\n", + " new_sample_cols = [map_id(s) for s in col_names[1:]]\n", + " new_header = \",\".join([peak_id_col] + new_sample_cols)\n", + " tmp = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n", + " tmp.write(new_header + \"\\n\")\n", + " tmp.close()\n", + " cmd = f\"zcat {in_path} | tail -n +2 | cat {tmp.name} - | gzip -6 > {out_path}\"\n", + " subprocess.run(cmd, shell=True, check=True)\n", + " os.unlink(tmp.name)" + ] + }, + { + "cell_type": "markdown", + "id": "f0884ae7-a851-425a-86dd-b606768a012e", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `pseudobulk_qc`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c46328b-c3d8-46f8-8c71-bad27820438e", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[pseudobulk_qc]\n", + "parameter: meta_files = []\n", + "parameter: count_files = []\n", + "parameter: output_dir = str\n", + "parameter: tech_vars_file = str\n", + "parameter: blacklist_file = ''\n", + "parameter: batch_correction = \"FALSE\"\n", + "parameter: batch_method = \"limma\"\n", + "parameter: quant_norm = \"FALSE\"\n", + "parameter: min_count = 5\n", + "parameter: min_total_count = 15\n", + "parameter: min_prop = 0.1\n", + "parameter: min_nuclei = 20\n", + "parameter: regions = ''\n", + "parameter: gene_list = ''\n", + "\n", + "import os\n", + "\n", + "_cts = [os.path.basename(f).replace('metadata_','').replace('.csv','') for f in meta_files]\n", + "\n", + "input: meta_files + count_files\n", + "output: [f'{output_dir}/2_residuals/{ct}/{ct}_residuals.txt' for ct in _cts]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n", + "\n", + "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n", + "\n", + " library(edgeR)\n", + " library(limma)\n", + " library(data.table)\n", + " library(GenomicRanges)\n", + " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n", + "\n", + " # ── predictOffset ──────────────────────────────────────────────────────\n", + " predictOffset <- function(fit, tech_vars) {\n", + " D <- fit$design\n", + " Dm <- D\n", + " for (col in colnames(D)) {\n", + " if (col == \"(Intercept)\") next\n", + " is_tech <- any(sapply(tech_vars, function(v) grepl(paste0(\"^\", v), col)))\n", + " if (is_tech) {\n", + " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n", + " Dm[, col] <- median(D[, col], na.rm=TRUE)\n", + " else\n", + " Dm[, col] <- 0\n", + " } else {\n", + " Dm[, col] <- 0\n", + " }\n", + " }\n", + " B <- fit$coefficients\n", + " B[is.na(B)] <- 0\n", + " off <- B %*% t(Dm)\n", + " colnames(off) <- rownames(fit$design)\n", + " return(off)\n", + " }\n", + "\n", + " filter_blacklist <- function(mat, bed, feat_label) {\n", + " peaks <- data.table(id = rownames(mat))\n", + " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", + " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " bl <- fread(bed)[, 1:3]\n", + " setnames(bl, c(\"chr\",\"start\",\"end\"))\n", + " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n", + " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", + " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n", + " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n", + " if (length(blacklisted) > 0) {\n", + " message(\"Blacklisted \", feat_label, \" removed: \", length(blacklisted))\n", + " return(mat[-blacklisted, , drop=FALSE])\n", + " }\n", + " return(mat)\n", + " }\n", + "\n", + " parse_regions <- function(region_str) {\n", + " if (is.null(region_str) || region_str == \"\") return(NULL)\n", + " lapply(strsplit(region_str, \",\")[[1]], function(r) {\n", + " parts <- strsplit(trimws(r), \":|−|-\")[[1]]\n", + " list(chr=parts[1], start=as.integer(parts[2]), end=as.integer(parts[3]))\n", + " })\n", + " }\n", + "\n", + " filter_regions <- function(mat, regions) {\n", + " peaks <- data.table(id = rownames(mat))\n", + " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n", + " peaks[, `:=`(start = as.integer(start), end = as.integer(end))]\n", + " gr_peaks <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n", + " gr_regions <- GRanges(\n", + " sapply(regions, `[[`, \"chr\"),\n", + " IRanges(sapply(regions, `[[`, \"start\"), sapply(regions, `[[`, \"end\"))\n", + " )\n", + " keep <- unique(queryHits(findOverlaps(gr_peaks, gr_regions)))\n", + " if (length(keep) == 0) stop(\"No peaks overlap the specified regions.\")\n", + " message(\"Peaks after region filter: \", length(keep))\n", + " mat[keep, , drop=FALSE]\n", + " }\n", + "\n", + " meta_files <- c(${','.join([f'\"{f}\"' for f in meta_files])})\n", + " count_files <- c(${','.join([f'\"{f}\"' for f in count_files])})\n", + "\n", + " if (length(meta_files) != length(count_files))\n", + " stop(\"meta_files and count_files must have the same length and order.\")\n", + "\n", + " # ── Load tech vars from file ───────────────────────────────────────────\n", + " tech_df <- fread(\"${tech_vars_file}\")\n", + " tech_vars <- setdiff(colnames(tech_df), \"sampleid\")\n", + " message(\"Tech vars: \", paste(tech_vars, collapse=\", \"))\n", + "\n", + " regions <- parse_regions(\"${regions}\")\n", + " gene_list <- trimws(strsplit(\"${gene_list}\", \",\")[[1]])\n", + " gene_list <- gene_list[gene_list != \"\"]\n", + "\n", + " for (i in seq_along(meta_files)) {\n", + " meta_file <- meta_files[i]\n", + " counts_file <- count_files[i]\n", + " ct <- sub(\"\\\\.csv$\", \"\", sub(\"^metadata_\", \"\", basename(meta_file)))\n", + "\n", + " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n", + " message(\"Processing: \", ct)\n", + " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n", + " message(\"Quantile normalization: \", as.logical(\"${quant_norm}\"))\n", + " message(paste(rep(\"=\", 40), collapse=\"\"))\n", + "\n", + " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n", + " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n", + "\n", + " # ── 1. Load counts ─────────────────────────────────────────────────\n", + " counts_raw <- fread(counts_file)\n", + " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n", + " rownames(counts) <- counts_raw[[1]]\n", + " rm(counts_raw)\n", + "\n", + " # ── Auto-detect modality ───────────────────────────────────────────\n", + " is_atac <- grepl(\"^chr.*-[0-9]+-[0-9]+$\", rownames(counts)[1])\n", + " feat_label <- ifelse(is_atac, \"peaks\", \"genes\")\n", + " message(\"Modality: \", ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\"))\n", + " message(\"Loaded: \", nrow(counts), \" \", feat_label, \" x \", ncol(counts), \" samples\")\n", + "\n", + " # ── 1b. Region/gene filtering (optional) ──────────────────────────\n", + " if (is_atac && !is.null(regions)) {\n", + " message(\"Filtering peaks to specified regions...\")\n", + " counts <- filter_regions(counts, regions)\n", + " } else if (!is_atac && length(gene_list) > 0) {\n", + " genes_present <- intersect(rownames(counts), gene_list)\n", + " if (length(genes_present) == 0) stop(\"No matching genes found in count matrix.\")\n", + " message(\"Genes after gene_list filter: \", length(genes_present))\n", + " counts <- counts[genes_present, , drop=FALSE]\n", + " }\n", + "\n", + " # ── 2. Load metadata ───────────────────────────────────────────────\n", + " meta <- fread(meta_file)\n", + " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n", + " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n", + "\n", + " # ── 3. Nuclei filter ──────────────────────────────────────────────\n", + " n_nuclei_col <- intersect(c(\"n_nuclei\",\"n.nuclei\",\"nNuclei\",\"nuclei_count\"), colnames(meta))[1]\n", + " if (!is.na(n_nuclei_col)) {\n", + " meta <- meta[meta[[n_nuclei_col]] > ${min_nuclei}]\n", + " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n", + " }\n", + "\n", + " # ── 4. Align samples ──────────────────────────────────────────────\n", + " common <- intersect(meta[[idcol]], colnames(counts))\n", + " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n", + " counts <- counts[, common, drop=FALSE]\n", + " message(\"Samples after alignment: \", length(common))\n", + "\n", + " # ── 5. Blacklist filtering ─────────────────────────────────────────\n", + " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n", + " counts <- filter_blacklist(counts, \"${blacklist_file}\", feat_label)\n", + " message(feat_label, \" after blacklist filter: \", nrow(counts))\n", + " } else {\n", + " message(\"No blacklist file - skipping.\")\n", + " }\n", + "\n", + " # ── 6. Merge tech vars by sampleid ────────────────────────────────\n", + " tech_sub <- tech_df[tech_df$sampleid %in% common]\n", + " tech_sub <- tech_sub[match(common, tech_sub$sampleid)]\n", + "\n", + " # ── 7. Drop samples with NA in tech vars ──────────────────────────\n", + " keep_rows <- complete.cases(tech_sub[, ..tech_vars])\n", + " tech_sub <- tech_sub[keep_rows]\n", + " counts <- counts[, tech_sub$sampleid, drop=FALSE]\n", + " message(\"Valid samples for modelling: \", nrow(tech_sub))\n", + "\n", + " # ── 8. Expression filtering ────────────────────────────────────────\n", + " dge <- DGEList(counts=counts, samples=tech_sub)\n", + " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n", + " message(feat_label, \" before filter: \", nrow(dge))\n", + "\n", + " keep <- filterByExpr(dge, group=dge$samples$group,\n", + " min.count=${min_count},\n", + " min.total.count=${min_total_count},\n", + " min.prop=${min_prop})\n", + " dge <- dge[keep,, keep.lib.sizes=FALSE]\n", + " message(feat_label, \" after filter: \", nrow(dge))\n", + "\n", + " write.table(dge$counts,\n", + " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n", + " sep=\"\\t\", quote=FALSE, col.names=NA)\n", + "\n", + " # ── 9. TMM normalization ───────────────────────────────────────────\n", + " dge <- calcNormFactors(dge, method=\"TMM\")\n", + "\n", + " # ── 10. Optional batch correction ──────────────────────────────────\n", + " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n", + " batches <- dge$samples$sequencingBatch\n", + " batch_counts <- table(batches)\n", + " valid_batches <- names(batch_counts[batch_counts > 1])\n", + " keep_bc <- batches %in% valid_batches\n", + " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n", + " batches <- batches[keep_bc]\n", + " message(\"Samples after singleton batch removal: \", ncol(dge))\n", + "\n", + " if (\"${batch_method}\" == \"combat\") {\n", + " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", + " logCPM <- ComBat(dat=logCPM, batch=factor(batches))\n", + " dge$counts <- round(pmax(2^logCPM, 0))\n", + " message(\"ComBat applied on log-CPM.\")\n", + " } else {\n", + " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n", + " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n", + " dge$counts <- round(pmax(2^logCPM, 0))\n", + " message(\"limma removeBatchEffect applied.\")\n", + " }\n", + " }\n", + "\n", + " # ── 11. Add batch vars to model if multi-level ────────────────────\n", + " batch_vars <- c()\n", + " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$sequencingBatch)) > 1) {\n", + " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n", + " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n", + " }\n", + " if (\"Library\" %in% colnames(dge$samples) &&\n", + " length(unique(dge$samples$Library)) > 1) {\n", + " dge$samples$Library_factor <- factor(dge$samples$Library)\n", + " batch_vars <- c(batch_vars, \"Library_factor\")\n", + " }\n", + "\n", + " # ── 12. Build design matrix ────────────────────────────────────────\n", + " all_model_vars <- intersect(c(tech_vars, batch_vars), colnames(dge$samples))\n", + " form <- as.formula(paste(\"~\", paste(all_model_vars, collapse=\" + \")))\n", + " design <- model.matrix(form, data=dge$samples)\n", + " message(\"Formula: \", deparse(form))\n", + "\n", + " if (!is.fullrank(design)) {\n", + " message(\"Design not full rank - trimming.\")\n", + " qr_d <- qr(design)\n", + " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n", + " }\n", + " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n", + "\n", + " # ── 13. Voom + lmFit + eBayes ─────────────────────────────────────\n", + " v <- voom(dge, design, plot=FALSE)\n", + " fit <- lmFit(v, design)\n", + " fit <- eBayes(fit)\n", + "\n", + " # ── 14. Offset + residuals ─────────────────────────────────────────\n", + " off <- predictOffset(fit, tech_vars=tech_vars)\n", + " res <- residuals(fit, v$E)\n", + " final <- off + res\n", + "\n", + " # ── 15. Save residuals ─────────────────────────────────────────────\n", + " out_file <- file.path(outdir, paste0(ct, \"_residuals.txt\"))\n", + " write.table(final, out_file, sep=\"\\t\", quote=FALSE, col.names=NA)\n", + " message(\"Saved: \", out_file)\n", + " message(\" \", ifelse(is_atac,\"Peaks\",\"Genes\"), \": \", nrow(final), \" | Samples: \", ncol(final))\n", + "\n", + " # ── 16. Optional quantile normalization ───────────────────────────\n", + " if (as.logical(\"${quant_norm}\")) {\n", + " final_qn <- t(apply(final, 1, rank, ties.method=\"average\"))\n", + " final_qn <- stats::qnorm(final_qn / (ncol(final_qn) + 1))\n", + " qn_file <- file.path(outdir, paste0(ct, \"_residuals_qn.txt\"))\n", + " write.table(final_qn, qn_file, sep=\"\\t\", quote=FALSE, col.names=NA)\n", + " message(\"Saved QN: \", qn_file)\n", + "\n", + " saveRDS(list(\n", + " dge=dge, offset=off, residuals=res,\n", + " final_data=final, final_data_qn=final_qn,\n", + " valid_samples=colnames(dge), design=design, fit=fit, model=form,\n", + " tech_vars=tech_vars, batch_vars=batch_vars,\n", + " batch_correction=as.logical(\"${batch_correction}\"),\n", + " batch_method=ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", + " quant_norm=TRUE,\n", + " modality=ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", + " ), file.path(outdir, paste0(ct, \"_results_qn.rds\")))\n", + " } else {\n", + " saveRDS(list(\n", + " dge=dge, offset=off, residuals=res,\n", + " final_data=final,\n", + " valid_samples=colnames(dge), design=design, fit=fit, model=form,\n", + " tech_vars=tech_vars, batch_vars=batch_vars,\n", + " batch_correction=as.logical(\"${batch_correction}\"),\n", + " batch_method=ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"),\n", + " quant_norm=FALSE,\n", + " modality=ifelse(is_atac, \"snATAC-seq\", \"snRNA-seq\")\n", + " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n", + " }\n", + "\n", + " message(\"Completed: \", ct, \" -> \", outdir)\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## `phenotype_reformatting`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73", + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[phenotype_formatting]\n", + "parameter: residual_files = []\n", + "parameter: output_dir = str\n", + "\n", + "import os\n", + "\n", + "_cts = [os.path.basename(os.path.dirname(f)) for f in residual_files]\n", + "\n", + "input: residual_files\n", + "output: [f'{output_dir}/3_pheno_reformat/{ct}_phenotype.bed.gz' for ct in _cts]\n", + "\n", + "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n", + "\n", + "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n", + " import os\n", + " import subprocess\n", + " import pandas as pd\n", + "\n", + " residual_files = ${residual_files}\n", + " output_dir = \"${output_dir}\"\n", + "\n", + " def read_residuals(path):\n", + " first_line = open(path).readline().rstrip(\"\\n\")\n", + " col_names = first_line.split(\"\\t\")\n", + " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n", + " if df.shape[1] > len(col_names):\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names\n", + " else:\n", + " peak_ids = df.iloc[:, 0].values\n", + " df = df.iloc[:, 1:]\n", + " df.columns = col_names[1:]\n", + " return peak_ids, df\n", + "\n", + " def to_midpoint_bed(peak_ids, residuals):\n", + " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n", + " chrs = parts[0].values\n", + " starts = parts[1].astype(int).values\n", + " ends = parts[2].astype(int).values\n", + " mids = ((starts + ends) // 2).astype(int)\n", + " bed = pd.DataFrame({\n", + " \"#chr\": chrs,\n", + " \"start\": mids,\n", + " \"end\": mids + 1,\n", + " \"ID\": peak_ids\n", + " })\n", + " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n", + " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n", + "\n", + " def run_cmd(cmd, label):\n", + " r = subprocess.run(cmd, capture_output=True)\n", + " if r.returncode != 0:\n", + " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n", + " else:\n", + " print(f\"{label}: OK\")\n", + "\n", + " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n", + " os.makedirs(out_dir, exist_ok=True)\n", + "\n", + " for res_path in residual_files:\n", + " ct = os.path.basename(os.path.dirname(res_path))\n", + "\n", + " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n", + "\n", + " if not os.path.exists(res_path):\n", + " print(f\"WARNING: {res_path} not found, skipping.\")\n", + " continue\n", + "\n", + " peak_ids, residuals = read_residuals(res_path)\n", + " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n", + "\n", + " bed = to_midpoint_bed(peak_ids, residuals)\n", + " out_bed = os.path.join(out_dir, f\"{ct}_phenotype.bed\")\n", + " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n", + " print(f\"Written: {out_bed}\")\n", + "\n", + " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n", + " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n", + " print(f\"Completed: {ct} -> {out_dir}\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "R", + "language": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "4.4.3" + }, + "sos": { + "kernels": [ + [ + "SoS", + "sos", + "sos", + "", + "" + ] + ], + "version": "" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 102600512d0b5046f6c83baadae9129a296b4766 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Fri, 27 Feb 2026 11:42:56 -0500 Subject: [PATCH 12/12] Fix formatting in pseudobulk_preprocessing notebook --- code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb index f790d444d..cbac42ea5 100644 --- a/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb +++ b/code/molecular_phenotypes/QC/pseudobulk_preprocessing.ipynb @@ -39,7 +39,7 @@ "\n", "| Feature | snATAC-seq | snRNA-seq |\n", "|---------|-----------|-----------|\n", - " Pseudobulk count generation | TBD | ✓ |\n", + "| Pseudobulk count generation | TBD | ✓ |\n", "| Sample ID mapping | ✓ | ✓ |\n", "| Region/gene filtering | ✓ (`--regions`) | ✓ (`--gene-list`) |\n", "| Blacklist filtering | ✓ | — |\n",