Skip to content

--outSAMstrandField intronMotif does not produce any XS:A: tags #30

@pinin4fjords

Description

@pinin4fjords

Summary

--outSAMstrandField intronMotif is supposed to trigger emission of the XS:A:+ / XS:A:- tag on spliced reads, indicating the inferred strand from the intron motif. rustar-aligner does not emit XS:A: on any record, even with --outSAMstrandField intronMotif set and spliced reads in the output.

Breaks every downstream tool that needs strand information for stranded transcript assembly or library-protocol detection:

  • StringTie / Cufflinks need XS to assemble strand-aware transcripts.
  • rseqc's infer_experiment.py reads XS to detect library stranding.
  • HTSeq-count / featureCounts with -s rely on XS for ambiguity resolution.

STAR reference behaviour

STAR has two equivalent paths into XS emission:

  1. User passes XS in --outSAMattributes -> STAR auto-enables --outSAMstrandField intronMotif, Parameters_samAttributes.cpp:172-179:

    } else if (vAttr1.at(ii)== "XS") {
        outSAMattrOrder.push_back(ATTR_XS);
        outSAMattrPresent.XS=true;
        if (outSAMstrandField.type!=1) {
            inOut->logMain << "WARNING --outSAMattributes contains XS, therefore STAR will use --outSAMstrandField intronMotif" <<endl;
            outSAMstrandField.type=1;
        };
    };
  2. User passes --outSAMstrandField intronMotif -> STAR auto-adds ATTR_XS to the output order, Parameters_samAttributes.cpp:213-216:

    if (outSAMstrandField.type==1 && !outSAMattrPresent.XS) {
        outSAMattrOrder.push_back(ATTR_XS);
        inOut->logMain << "WARNING --outSAMstrandField=intronMotif, therefore STAR will output XS attribute" <<endl;
    };

Actual emission at ReadAlign_outputTranscriptSAM.cpp:298-302, driven by the per-read intron-motif inference earlier in the same function:

case ATTR_XS:
    if (...) *outStream<<"\tXS:A:+";
    else if (...) *outStream<<"\tXS:A:-";

+ if the canonical GT/AG donor/acceptor reads as forward-strand, - if reverse, no tag if the read is unspliced or the splice motif is non-canonical.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-30 && cd /tmp/rustar-mre-30

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMstrandField intronMotif
        --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

RBAM=RUS./Aligned.out.bam
SBAM=STAR.Aligned.out.bam

for label in rustar STAR; do
    if [ $label = rustar ]; then BAM=$RBAM; else BAM=$SBAM; fi
    xs=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | grep -c 'XS:A:')
    splices=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | awk '$6 ~ /N/' | wc -l)
    echo "$label: XS:A: tags = $xs, spliced records (CIGAR N) = $splices"
done

Observed (verified, fresh paired run)

rustar: XS:A: tags = 0, spliced records (CIGAR N) = 358
STAR  : XS:A: tags = 1738, spliced records (CIGAR N) = ~715

Despite 358 spliced records and --outSAMstrandField intronMotif set, zero XS:A: tags are emitted by rustar. STAR on the same data emits 1 738 tags. The same is true when XS is explicitly added to --outSAMattributes: still zero tags emitted, where STAR auto-promotes the flag.

Suggested fix

In rustar's attribute order builder, add the two coupling paths from Parameters_samAttributes.cpp:

  1. If XS is in --outSAMattributes, force-enable outSAMstrandField = intronMotif and add ATTR_XS to the writer order.
  2. If outSAMstrandField == intronMotif and XS isn't already in the order list, add ATTR_XS to the writer order.

Then implement the per-record XS write in the genome-BAM record builder: walk the spliced CIGAR, look up donor/acceptor dinucleotides in the genome, and stamp + or - based on the motif (canonical GT/AG -> +, canonical CT/AC -> -, non-canonical -> omit).

rustar already computes junction_motifs (see src/align/transcript.rs::Transcript::junction_motifs and src/align/score::SpliceMotif), so the underlying motif-detection infrastructure is in place. This should be a write-time change, not an algorithmic one.

Test plan

  1. Run rustar with --outSAMstrandField intronMotif on PE data with annotated splices.
  2. Assert samtools view ... | grep -c 'XS:A:' > 0.
  3. Assert XS strand matches the genome-relative strand of the GT/AG donor on a spot-check read.
  4. (Bonus) cross-check against STAR's XS calls on the same reads — expect agreement on at least 95 % of spliced records.

Why this matters

infer_experiment.py runs on every nf-core/rnaseq sample to detect library stranding. With zero XS tags it can't infer anything, and the MultiQC report's RSeQC section reads "stranding could not be inferred" for every rustar-aligned sample. Stringtie's stranded assembly mode also fails silently (assembles all reads as if unstranded). Downstream impact is wider than NM/nM (#29) because XS is consumed during analysis, not just QC.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions