Init demuxalot

## Add demuxalot demultiplexing to workflow

This issue tracks the addition of [demuxalot](https://github.com/arogozhnikov/demuxalot) to the workflow. This will be the first extension of the snakemake pipeline beyond `Cell Ranger` into downstream single-cell preprocessing steps - expect some bumps along the way!

### Background

**Preprint:** [Rogozhnikov et al., 2021](https://www.biorxiv.org/content/10.1101/2021.05.22.443646v2)

### Implementation Notes

1. **Two-stage approach** see [issue #21](https://github.com/arogozhnikov/demuxalot/issues/21#issuecomment-1893000273)]:
   - Start with boilerplate demultiplexing (use existing SNPs)
   - Later: add SNP identification workflow

2. **Test strategy:**
   - Implement boilerplate workflow first
   - Validate on test dataset before expanding

### Helper Script: Parse demuxalot Output

```python
def process_demuxalot_results(base_dir, confidence_threshold=0.9):
    """
    Parse demuxalot posterior probabilities into assignment dataframe.
    
    Args:
        base_dir: Directory containing demuxalot output subdirectories
        confidence_threshold: Minimum probability for confident assignment (default: 0.9)
    
    Returns:
        DataFrame with columns: BARCODE, assignment, prob_max, assignment_status_demuxalot, sample
    """
    all_results = []
    
    for subdir in sorted(os.listdir(base_dir)):
        subdir_path = os.path.join(base_dir, subdir)
        if not os.path.isdir(subdir_path):
            continue
        
        prob_path = os.path.join(subdir_path, "posterior_probabilities.tsv.gz")
        if not os.path.exists(prob_path):
            continue
            
        # Load posterior probabilities
        prob_df = pd.read_csv(prob_path, sep="\t").set_index("BARCODE")
        
        # Assign max genotype and compute confidence
        assignments = prob_df.idxmax(axis=1)
        max_probs = prob_df.max(axis=1)
        
        results_df = pd.DataFrame({
            "BARCODE": prob_df.index,
            "assignment": assignments,
            "prob_max": max_probs,
        })
        
        # Filter by confidence threshold
        results_df.loc[results_df["prob_max"] < confidence_threshold, "assignment"] = "unassigned"
        
        # Classify assignment type
        def classify(x):
            if x == "unassigned":
                return "unassigned"
            elif "+" in x.lower() or x == "doublet":
                return "doublet"
            else:
                return x
        
        results_df["assignment_status_demuxalot"] = results_df["assignment"].apply(classify)
        
        # Add sample identifier
        results_df["sample"] = subdir.replace("_doublet_prior", "")
        all_results.append(results_df)
    
    # Concatenate all results
    final_df = pd.concat(all_results, ignore_index=True)
    return final_df

# Example usage
demuxalot_results_df = process_demuxalot_results(
    os.path.join(WD, "04_DEMUX_genotypes_lists_demuxalot")
)
```

### TODO

- [x] Add demuxalot to environment.yml
- [x] Expand test dataset workflow to this step!
    - [x] Need a BAM file from cellranger to compute genotype VCF files 
- [x] Include demuxalot commands in config generator
- [x] Create boilerplate demultiplexing rule
- [ ] Integrate parsing script into workflow
- [ ] Add SNP identification workflow (future)
- [ ] Documentation




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Init demuxalot #11

Add demuxalot demultiplexing to workflow

Background

Implementation Notes

Helper Script: Parse demuxalot Output

TODO

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Init demuxalot #11

Description

Add demuxalot demultiplexing to workflow

Background

Implementation Notes

Helper Script: Parse demuxalot Output

TODO

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions