Summary
Log.final.out reports two categories of unmapped reads — Number of reads unmapped: too short and Number of reads unmapped: other — using STAR's exact field labels. rustar folds both buckets into too short and always reports other = 0. Total unmapped count is preserved, but a downstream tool that parses the two fields separately (MultiQC's STAR module, custom QC dashboards) gets an inflated too short percentage and a zero other percentage.
On WT_REP2 (nf-core/rnaseq test profile):
| Field |
STAR 2.7.11b |
rustar 0.1.0 |
Number of reads unmapped: too many mismatches |
0 |
0 |
Number of reads unmapped: too short |
1 540 (3.11 %) |
4 193 (8.46 %) |
Number of reads unmapped: other |
2 656 (5.36 %) |
0 (0.00 %) |
| Total unmapped |
4 196 |
4 193 |
The totals match within RNG / tie-breaking tolerance — rustar isn't dropping reads. It's the bucketing that's wrong: every read STAR would label other (no seeds / no clusters / failed-stitch reason) is being mislabelled as too short (post-stitch length / score / match filter).
STAR reference behaviour
STAR's bucketing (per source/ReadAlign_assignAlignToWindow.cpp and the nUnmapped arrays in source/Stats.cpp):
too short: a transcript was generated but filtered out by score / match-count / read-length-coverage filters (outFilterScoreMin*, outFilterMatchNmin*).
other: no seeds, no clusters, no transcript — read couldn't be aligned at all.
too many mismatches: filtered by outFilterMismatchN*.
rustar's src/stats.rs::UnmappedReason enum has all four variants:
pub enum UnmappedReason {
Other,
TooShort,
TooManyMismatches,
TooManyLoci,
}
So the categorization exists in the code. The bug is at the classification call sites — wherever an unmapped read is recorded, the code is calling record_unmapped_reason(UnmappedReason::TooShort) instead of UnmappedReason::Other for reads that have no alignments at all.
Reproducer
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-unmapped && cd /tmp/rustar-mre-unmapped
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
echo "=== rustar unmapped buckets ==="
grep -E 'unmapped' RUS./Log.final.out
echo
echo "=== STAR unmapped buckets ==="
grep -E 'unmapped' STAR.Log.final.out
Suggested investigation
Find every site that records an unmapped read in rustar — likely a record_unmapped_reason(UnmappedReason::...) call. Trace which one fires for reads that fail at different pipeline stages:
- No seeds found → should be
Other
- Seeds found but no clusters / no extension → should be
Other
- Transcripts generated but filtered by
outFilterScoreMin* / outFilterMatchNmin* → should be TooShort
- Filtered by
outFilterMismatchN* → should be TooManyMismatches
The fact that rustar's Other is always 0 strongly suggests record_unmapped_reason is never being called with UnmappedReason::Other — either the no-seeds / no-clusters path is silently falling through to the TooShort bucket, or the Other arm of the enum is unreachable in the current code.
Severity
Low-to-medium. Doesn't affect mapping quality or alignment correctness — totals match STAR. Affects downstream QC reading (MultiQC's STAR module pulls both fields; the inflated too short bar and zero other bar would look anomalous to a user comparing samples processed by different aligners). Anything that summarises "fraction of reads unmappable for non-trivial reasons" (STAR's other bucket captures harder-to-rescue reads) loses that signal entirely.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.
Summary
Log.final.outreports two categories of unmapped reads —Number of reads unmapped: too shortandNumber of reads unmapped: other— using STAR's exact field labels. rustar folds both buckets intotoo shortand always reportsother = 0. Total unmapped count is preserved, but a downstream tool that parses the two fields separately (MultiQC's STAR module, custom QC dashboards) gets an inflatedtoo shortpercentage and a zerootherpercentage.On
WT_REP2(nf-core/rnaseq test profile):Number of reads unmapped: too many mismatchesNumber of reads unmapped: too shortNumber of reads unmapped: otherThe totals match within RNG / tie-breaking tolerance — rustar isn't dropping reads. It's the bucketing that's wrong: every read STAR would label
other(no seeds / no clusters / failed-stitch reason) is being mislabelled astoo short(post-stitch length / score / match filter).STAR reference behaviour
STAR's bucketing (per
source/ReadAlign_assignAlignToWindow.cppand thenUnmappedarrays insource/Stats.cpp):too short: a transcript was generated but filtered out by score / match-count / read-length-coverage filters (outFilterScoreMin*,outFilterMatchNmin*).other: no seeds, no clusters, no transcript — read couldn't be aligned at all.too many mismatches: filtered byoutFilterMismatchN*.rustar's
src/stats.rs::UnmappedReasonenum has all four variants:So the categorization exists in the code. The bug is at the classification call sites — wherever an unmapped read is recorded, the code is calling
record_unmapped_reason(UnmappedReason::TooShort)instead ofUnmappedReason::Otherfor reads that have no alignments at all.Reproducer
Suggested investigation
Find every site that records an unmapped read in rustar — likely a
record_unmapped_reason(UnmappedReason::...)call. Trace which one fires for reads that fail at different pipeline stages:OtherOtheroutFilterScoreMin*/outFilterMatchNmin*→ should beTooShortoutFilterMismatchN*→ should beTooManyMismatchesThe fact that rustar's
Otheris always 0 strongly suggestsrecord_unmapped_reasonis never being called withUnmappedReason::Other— either the no-seeds / no-clusters path is silently falling through to theTooShortbucket, or theOtherarm of the enum is unreachable in the current code.Severity
Low-to-medium. Doesn't affect mapping quality or alignment correctness — totals match STAR. Affects downstream QC reading (MultiQC's STAR module pulls both fields; the inflated
too shortbar and zerootherbar would look anomalous to a user comparing samples processed by different aligners). Anything that summarises "fraction of reads unmappable for non-trivial reasons" (STAR'sotherbucket captures harder-to-rescue reads) loses that signal entirely.Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise:
author:pinin4fjordsor grep fornf-core/rnaseq#1855.