Skip to content

Single command haplotype sampling#4849

Open
adamnovak wants to merge 20 commits intomasterfrom
single-command-hap-sampling
Open

Single command haplotype sampling#4849
adamnovak wants to merge 20 commits intomasterfrom
single-command-hap-sampling

Conversation

@adamnovak
Copy link
Member

Changelog Entry

To be copied to the draft changelog by merger:

  • Added vg giraffe --haplotype-sampling to automatically count kmers and haplotype-index and haplotype-sample the graph. Make sure to have kmc installed. Providing either a --kff-name or --haplotype-name will now also trigger generation of the other. To do one-reference sampling, continue to use --set-reference. To do non-diploid sampling with a certain number of haplotypes, use --no-diploid-sampling and --num-haplotypes.

Description

This allows Giraffe to be used as a single-command haplotype sampling workflow, without needing separate commands to count kmers, haplotype-index the graph, haplotype-sample the graph, and map.

I introduced a notion of "scopes" for indexes in the registry, so I could attach a scope of the sample name to the FASTQ or KFF and have it propagate and qualify/subscript all indexes that depend on them. This works OK, but to get the "right" filenames we'd been using for the sampled GBZ (<prefix>.<sample>.gbz), when it's doubling as a "Giraffe GBZ", I've had to introduce a notion of multiple possible extensions for an index, and then a bunch of logic to work out which of the extensions we actually should use so we don't conflict with older things in the plan.

I think I might be able to replace all that with a notion of "weak" aliases, so the "Giraffe GBZ" can come from the "Haplotype-Sampled GBZ" but use it at the "Haplotype-Sampled GBZ"'s name instead of its own.

@adamnovak
Copy link
Member Author

I don't think the "weak" alias approach would work unless the haplotype-sampled GBZ knew it was always sample-scoped and always had {sample}.gbz as its suffix, with some notion of a wildcard. Otherwise we would think it would try and fight the plain GBZ for the gbz suffix.

@adamnovak
Copy link
Member Author

I haven't done weak aliases, but I've made the scopes keyed (so we know we're dealing with a scope for "sample"), and I've introduced Snakemake-style {} wildcards in the extensions. We now pick the first extension where we have all the wildcards available as scopes, so we use {sample}.gbz for the Giraffe GBZ when we can and giraffe.gbz otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant