Skip to content

Init demuxalot #11

@mschecht

Description

@mschecht

Add demuxalot demultiplexing to workflow

This issue tracks the addition of demuxalot to the workflow. This will be the first extension of the snakemake pipeline beyond Cell Ranger into downstream single-cell preprocessing steps - expect some bumps along the way!

Background

Preprint: Rogozhnikov et al., 2021

Implementation Notes

  1. Two-stage approach see issue #21]:

    • Start with boilerplate demultiplexing (use existing SNPs)
    • Later: add SNP identification workflow
  2. Test strategy:

    • Implement boilerplate workflow first
    • Validate on test dataset before expanding

Helper Script: Parse demuxalot Output

def process_demuxalot_results(base_dir, confidence_threshold=0.9):
    """
    Parse demuxalot posterior probabilities into assignment dataframe.
    
    Args:
        base_dir: Directory containing demuxalot output subdirectories
        confidence_threshold: Minimum probability for confident assignment (default: 0.9)
    
    Returns:
        DataFrame with columns: BARCODE, assignment, prob_max, assignment_status_demuxalot, sample
    """
    all_results = []
    
    for subdir in sorted(os.listdir(base_dir)):
        subdir_path = os.path.join(base_dir, subdir)
        if not os.path.isdir(subdir_path):
            continue
        
        prob_path = os.path.join(subdir_path, "posterior_probabilities.tsv.gz")
        if not os.path.exists(prob_path):
            continue
            
        # Load posterior probabilities
        prob_df = pd.read_csv(prob_path, sep="\t").set_index("BARCODE")
        
        # Assign max genotype and compute confidence
        assignments = prob_df.idxmax(axis=1)
        max_probs = prob_df.max(axis=1)
        
        results_df = pd.DataFrame({
            "BARCODE": prob_df.index,
            "assignment": assignments,
            "prob_max": max_probs,
        })
        
        # Filter by confidence threshold
        results_df.loc[results_df["prob_max"] < confidence_threshold, "assignment"] = "unassigned"
        
        # Classify assignment type
        def classify(x):
            if x == "unassigned":
                return "unassigned"
            elif "+" in x.lower() or x == "doublet":
                return "doublet"
            else:
                return x
        
        results_df["assignment_status_demuxalot"] = results_df["assignment"].apply(classify)
        
        # Add sample identifier
        results_df["sample"] = subdir.replace("_doublet_prior", "")
        all_results.append(results_df)
    
    # Concatenate all results
    final_df = pd.concat(all_results, ignore_index=True)
    return final_df

# Example usage
demuxalot_results_df = process_demuxalot_results(
    os.path.join(WD, "04_DEMUX_genotypes_lists_demuxalot")
)

TODO

  • Add demuxalot to environment.yml
  • Expand test dataset workflow to this step!
    • Need a BAM file from cellranger to compute genotype VCF files
  • Include demuxalot commands in config generator
  • Create boilerplate demultiplexing rule
  • Integrate parsing script into workflow
  • Add SNP identification workflow (future)
  • Documentation

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions