This is a Snakemake pipeline for automated preprocessing of spatial lineage tracing data from SpatioDARLIN. This repository is a fork of snakemake_DARLIN, modified to support spatial lineage tracing data generated by DARLIN mouse and the BMKMANU S3000 platform.
The preprocessing pipeline includes:
- Lineage barcode identification and quality control
- Spatial barcode parsing
- Allele annotation
- Grouping spots into segmented cells
- Generating final clone-by-spots and clone-by-cells matrices
- Conda (for environment management)
- MATLAB (must be available in command line interface)
- Python 3.9
- Snakemake 7.24.0
- BSTMatrix (quantification pipeline for BMKMANU S3000)
Update: We provided alternative choice for allele annotation when matlab is unaccessible. Details in darlinpy.
cd /path/to/tools
# wget http://www.bmkmanu.com/wp-content/uploads/2024/07/BSTMatrix_v2.4.f.1.zip
## download latest version
wget http://www.bmkmanu.com/wp-content/uploads/2025/09/BSTMatrix_v2.4.f.4_release_20250902.zip -O BSTMatrix.zip
unzip BSTMatrix.zip
## conda env for BSTMatrix
cd BSTMatrix
conda env create -n BST-env -f environment.yaml
export PATH=/path/to/tools/BSTMatrix:$PATHkernel_name='spatio_darlin'
conda create -n $kernel_name python=3.9 --yes
conda activate $kernel_name
conda install -c conda-forge -c bioconda snakemake=7.24.0 --yes
pip install jupyterlab umi_tools seaborn papermill biopython cutadapt
pip install numpy==1.24.4
python -m ipykernel install --user --name=$kernel_namecode_directory='.' # change it to the directory where you want to put the packages
cd $code_directory
# Install darlin (this repository)
# If you haven't already cloned this repo, run:
git clone https://github.com/JarningGau/spatio_DARLIN --depth=1
cd spatio_DARLIN # or navigate to where you cloned this repository
python setup.py develop
cd ..Install the following dependencies in your desired code directory:
# Download MATLAB code Custom_CARLIN for allele annotation
mkdir -p CARLIN_pipeline
cd CARLIN_pipeline
git clone https://github.com/ShouWenWang-Lab/Custom_CARLIN --depth=1
cd ..Note:
- Ensure MATLAB is installed and available in your command line interface (accessible via
matlabcommand). - If matlab is unaccessiable, darlinpy is alternative choice for allele calling.
pip install git+https://github.com/JarningGau/darlinpy.gitTo test the pipeline with example data:
conda activate $kernel_name
cd test
bash download_bmk.shThis will download the test data. After downloading, you can run the test pipeline:
# if matlab is accessible
bash test_bmk_matlab.sh
# else
bash test_bmk.shThe pipeline expects the following input data structure:
data/BMKS3000/
├── fastq/ # Sequencing reads
│ ├── <sample>_<locus>_R1.fastq.gz
│ └── <sample>_<locus>_R2.fastq.gz
├── images/ # Image files for BSTMatrix pipeline
│ ├── <sample>_FL.tif # ssDNA, not neccessary when segmentation results are provided.
│ ├── <sample>_HE.tif # HE
│ └── <sample>_HE.txt # Encoding positions of spatial barcodes
└── segmentation/ # Cell segmentation results from BSTMatrix
└── <sample>/
├── all_barcode_num.txt # Spots -> cellbin relationship, obtained when perform spatial mRNA-seq data preprocessing.
└── barcodes_pos.tsv.gz # Spatial barcode positions
Input file descriptions:
- FASTQ files: Paired-end sequencing reads. Naming convention:
<sample>_<locus>_R1.fastq.gzand<sample>_<locus>_R2.fastq.gz, where<locus>can beCA,RA, orTA. - Image files: Required for BSTMatrix pipeline
<sample>_FL.tif: Fluorescence image (ssDNA)<sample>_HE.tif: H&E stained image<sample>_HE.txt: Image metadata
- Segmentation files: Generated from BSTMatrix on mRNA data
all_barcode_num.txt: Maps spots to cell binsbarcodes_pos.tsv.gz: Spatial coordinates of barcodes
Each analysis requires a YAML configuration file. The test directory contains example configs:
test_BMKS3000/
├── config-CA.yaml # Configuration for CA locus
├── config-RA.yaml # Configuration for RA locus
└── config-TA.yaml # Configuration for TA locus
Below is an example configuration file with explanations:
# Sample list to process
SampleList: ['L0927_Brain']
# Template type: 'Tigre_2022_v2' (TA), 'Rosa_v2' (RA), or 'cCARLIN' (CA)
template: 'cCARLIN'
# Directory paths (relative to the config file location)
raw_fastq_dir: '../data/BMKS3000/fastq'
image_dir: '../data/BMKS3000/images'
segmentation_dir: '../data/BMKS3000/segmentation'
# Cutadapt parameters
cutadapt:
base_quality_cutoff: 10
threads: 8
# BSTMatrix parameters
BSTMatrix:
threads: 8
# QC parameters
QC:
## Step1. Correct sequencing error (errorous nucleotides)
LB_error_rate: 0.02
## Step2. Remove amplification artifacts (chimeric molecules)
major_fraction_threshold_molecule: 0.8
## Step3. Remove capture-oligo carryover artifacts (fake spots)
## (SR) spots with k = reads/UMIs >= this value
slope_cutoff: 10
## (SR+UR+LR) molecules with supported reads >= this value
reads_cutoff: 10After a successful run (using the bundled test configs or your own), the workspace will resemble:
test_BMKS3000/
├── BST_config/ # BSTMatrix configuration files
├── BST_output/ # Outputs from BSTMatrix
├── config-*.yaml # Input configs (CA/RA/TA)
├── cutadapt/ # Primer-trimmed FASTQs: reads1, spatial barcode + UMI; reads2, lineage barcode
├── DARLIN/ # Intermediate DARLIN pipeline products
├── outs/ # Aggregated results
└── slim_fastq/ # FASTQs for allele annotation
The final results live in test_BMKS3000/outs/:
test_BMKS3000/outs/
└── L0927_Brain_CA/
├── all.done
├── cellbin/ # Cell-bin level matrices
├── level_1 # spots-bin, level 1 matrices (3μm)
├── ...
└── level_18 # spots-bin, level 18 matrices (99μm)
| Level | 18 | 9 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
|---|---|---|---|---|---|---|---|---|---|
| Resolution (μm) | 99 | 48 | 37 | 31 | 25 | 20 | 14 | 8 | 3 |
To run the pipeline with your own data:
- Create a configuration file following the example above
- Ensure your input data follows the expected structure
- Run Snakemake:
conda activate $kernel_name
## When matlab is avaliable
snakemake --snakefile snakefiles/BMKS3000_matlab.smk --configfile <your_config.yaml> -c <cores>
## Otherwise
snakemake --snakefile snakefiles/BMKS3000.smk --configfile <your_config.yaml> -c <cores>Replace <your_config.yaml> with the path to your configuration file and <cores> with the number of CPU cores to use.
For upstream analysis of BMKMANU S3000 spatial transcriptomics data, see the upstream analysis documentation.