Skip to content
Merged

V1.1 #10

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build-docker-chromap.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: IGVF Chromap CI

on:
push:
branches: [ "dev", "main" ]
branches: [ "dev", "main", "v1" ]
paths:
- 'modules/igvf-chromap/**'
pull_request:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build-docker-kallisto-bustools.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: IGVF kallisto-bustools CI

on:
push:
branches: [ "dev", "main" ]
branches: [ "dev", "main", "v1", "v1.1" ]
paths:
- 'modules/igvf-kallisto-bustools/**'
pull_request:
Expand Down
3 changes: 2 additions & 1 deletion modules/igvf-kallisto-bustools/docker_builder.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,14 @@ FROM python:3.10.16-slim

LABEL maintainer="Eugenio Mattei"
LABEL software="IGVF single-cell pipeline"
LABEL software.version="1"
LABEL software.version="1.1"
LABEL software.organization="IGVF consortium"
LABEL software.version.is-production="Yes"
LABEL software.task="run-kallisto-bustools-module"
LABEL software.description="Run the kallisto-bustools module of the IGVF single-cell pipeline"

RUN mkdir /software
COPY docs/README.md /software
COPY run_kallisto.py /software
COPY pyproject.toml /software
RUN cd /software && pip install --upgrade pip && pip install --editable .
Expand Down
57 changes: 57 additions & 0 deletions modules/igvf-kallisto-bustools/docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Processed Single-Cell Gene Expression Data (kallisto-bustools)

This archive contains processed single-cell gene expression data generated using **kallisto-bustools**.

## File Descriptions

All output files are located in the `counts_unfiltered` directory unless otherwise noted.

### Count Matrices

| File | Description |
|------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `cells_x_genes.mature.mtx` | Matrix of RNA counts from exonic reads spanning splice junctions (spliced/mature transcripts). Corresponds to the `"mature"` layer in the AnnData `.h5ad` file. |
| `cells_x_genes.ambiguous.mtx` | Matrix of RNA counts from ambiguous exonic reads (not spanning junctions). Included in the `"ambiguous"` layer of the AnnData file. |
| `cells_x_genes.cell.mtx` | Sum of mature and ambiguous counts (all exonic reads). |
| `cells_x_genes.nascent.mtx` | Matrix of RNA counts from intronic reads (unspliced/nascent transcripts). Included in the `"nascent"` layer of the AnnData file. |
| `cells_x_genes.nucleus.mtx` | Combination of nascent and ambiguous counts (intronic + ambiguous exonic reads). |
| `cells_x_genes.total.mtx` | Matrix containing all reads: mature, nascent, and ambiguous. This is the default matrix in the `.X` attribute of the AnnData object. |

### Annotation Files

| File | Description |
|------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `cells_x_genes.barcodes.txt` | List of cell barcodes (one per line), including sample-specific suffixes. Used as the observation index in AnnData and as row indices in all matrices. |
| `cells_x_genes.genes.txt` | List of unique ENSEMBL gene IDs (one per line). Used as the variable index in AnnData and as column indices in all matrices. |
| `cells_x_genes.genes.names.txt` | List of human-readable gene names corresponding to the ENSEMBL IDs in `genes.txt`. Names are not guaranteed to be unique. |

### Additional Output Files

| File | Description |
|------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `inspect.json` | Summary statistics: total reads, barcodes, UMIs, mean/median reads and UMIs per barcode, and percentages matching known barcode lists (if provided). |
| `kb_info.json` | Metadata for the `kb count` run: software versions, runtime, and command used. |
| `run_info.json` | Summary of kallisto pseudoalignment: processed reads, mapping percentages, reference targets, k-mer length, and kallisto version. |
| `transcripts.txt` | List of transcripts present in the data, in the same order as the transcriptome FASTA file. |
| `matrix.ec` | Two-column file: (1) equivalence class index (0-based), (2) set of transcript indices (0-based, matching order in `transcripts.txt`) in each equivalence class. |
| `output.bus` | Initial BUS file after pseudoalignment: uncorrected, unsorted barcode-UMI-equivalence class records. |
| `output.unfiltered.bus` | Processed and sorted BUS file for counting: corrected barcodes, no UMI filtering. |
| `output_modified.unfiltered.bus` | BUS file with barcodes modified using a replacement list (e.g., for merging oligo-dT and random hexamer barcodes in Parse Biosciences data). Present only if barcode collapsing was performed. |

### Barcode Collapsing (Parse Biosciences)

Some platforms (e.g., Parse Biosciences) assign two barcodes per cell (oligo-dT and random hexamer). These are typically collapsed into a single barcode during quantification using a replacement list, retaining the oligo-dT barcode.

- Collapsed output is saved in `counts_unfiltered_modified/` (same file structure as above), used to generate the primary AnnData `.h5ad` file.
- The original, uncollapsed output is preserved in `counts_unfiltered/adata.h5ad` for reproducibility and transparency.

---

## References

- Kallisto files:
[PMC10690192](https://pmc.ncbi.nlm.nih.gov/articles/PMC10690192/),
[kb_species_mixing tutorial](https://pachterlab.github.io/kallistobustools/tutorials/kb_species_mixing/R/kb_mixed_species_10x_v2/)
- Parse Bio barcoding:
[Parse Biosciences blog](https://www.parsebiosciences.com/blog/getting-started-with-scrna-seq-post-sequencing-data-analysis/),
[Genome Biology article](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02505-w)
2 changes: 1 addition & 1 deletion modules/igvf-kallisto-bustools/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "igvf-kallisto-bustools"
version = "1.0.0"
version = "1.1.0"
description = "Align scRNA using kallisto-bustools"
requires-python = "==3.10.16"
dependencies = [
Expand Down
2 changes: 1 addition & 1 deletion modules/igvf-kallisto-bustools/run_kallisto.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def quantify_nac(temp_dir, index_dir, read_format, output_dir, strand, subpool,
logging.error(f"Command failed with error: {e.stderr}")

# Archive the directory
archive_cmd = f"tar -kzcvf {output_dir}.tar.gz {output_dir}"
archive_cmd = f"tar -zcvf {output_dir}.tar.gz {output_dir} -C /software README.md "
logging.info(f"Running archive command: {archive_cmd}")
try:
result = subprocess.run(archive_cmd, shell=True, capture_output=True, text=True, check=True)
Expand Down