Add demuxalot demultiplexing to workflow
This issue tracks the addition of demuxalot to the workflow. This will be the first extension of the snakemake pipeline beyond Cell Ranger into downstream single-cell preprocessing steps - expect some bumps along the way!
Background
Preprint: Rogozhnikov et al., 2021
Implementation Notes
-
Two-stage approach see issue #21]:
- Start with boilerplate demultiplexing (use existing SNPs)
- Later: add SNP identification workflow
-
Test strategy:
- Implement boilerplate workflow first
- Validate on test dataset before expanding
Helper Script: Parse demuxalot Output
def process_demuxalot_results(base_dir, confidence_threshold=0.9):
"""
Parse demuxalot posterior probabilities into assignment dataframe.
Args:
base_dir: Directory containing demuxalot output subdirectories
confidence_threshold: Minimum probability for confident assignment (default: 0.9)
Returns:
DataFrame with columns: BARCODE, assignment, prob_max, assignment_status_demuxalot, sample
"""
all_results = []
for subdir in sorted(os.listdir(base_dir)):
subdir_path = os.path.join(base_dir, subdir)
if not os.path.isdir(subdir_path):
continue
prob_path = os.path.join(subdir_path, "posterior_probabilities.tsv.gz")
if not os.path.exists(prob_path):
continue
# Load posterior probabilities
prob_df = pd.read_csv(prob_path, sep="\t").set_index("BARCODE")
# Assign max genotype and compute confidence
assignments = prob_df.idxmax(axis=1)
max_probs = prob_df.max(axis=1)
results_df = pd.DataFrame({
"BARCODE": prob_df.index,
"assignment": assignments,
"prob_max": max_probs,
})
# Filter by confidence threshold
results_df.loc[results_df["prob_max"] < confidence_threshold, "assignment"] = "unassigned"
# Classify assignment type
def classify(x):
if x == "unassigned":
return "unassigned"
elif "+" in x.lower() or x == "doublet":
return "doublet"
else:
return x
results_df["assignment_status_demuxalot"] = results_df["assignment"].apply(classify)
# Add sample identifier
results_df["sample"] = subdir.replace("_doublet_prior", "")
all_results.append(results_df)
# Concatenate all results
final_df = pd.concat(all_results, ignore_index=True)
return final_df
# Example usage
demuxalot_results_df = process_demuxalot_results(
os.path.join(WD, "04_DEMUX_genotypes_lists_demuxalot")
)
TODO
Add demuxalot demultiplexing to workflow
This issue tracks the addition of demuxalot to the workflow. This will be the first extension of the snakemake pipeline beyond
Cell Rangerinto downstream single-cell preprocessing steps - expect some bumps along the way!Background
Preprint: Rogozhnikov et al., 2021
Implementation Notes
Two-stage approach see issue #21]:
Test strategy:
Helper Script: Parse demuxalot Output
TODO