Conversation
modules/local/annotate_uniprot.nf
Outdated
| def spp = "${meta.id}" | ||
| def is_uniprot = "${meta.uniprot}" | ||
| def project_dir = "${projectDir}" | ||
| def spp_prot_delim = "${params.sppid_protid_delim}"" |
There was a problem hiding this comment.
would it be confusing later on to have these arg names be slightly different?
There was a problem hiding this comment.
good call - probably better to be consistent throughout!
| "sppid_protid_delim": { | ||
| "type": "string", | ||
| "default": "_", | ||
| "description": "Option specifying the character that delimits the unique species and protein IDs. This delimiter MUST NOT occur within either the species ID or any protein ID." |
There was a problem hiding this comment.
is it worth throwing an error if the delimiter is already in the string (e.g., if we find it twice in the string)?
There was a problem hiding this comment.
Hmmm, it could be - alternatively, would it be worth just internally rename things in some consistent way if we catch these exceptions?
…ap/sppid_protid_delim
…, fail and print error if not
|
Okay, so I've made a number of changes, and this now works as anticipated.
I've not yet added in a check at the onset of the workflow to make sure that the sequence headers are named properly, though I have included a check to make sure it's actually in the sequence IDs, and stop the workflow if it's not, printing a useful error message to output in this case. I think we can make these checks a fair bit more extensive, but doing something like this could be part of a larger effort to build in checks throughout the workflow. |
…ap/sppid_protid_delim
Signed-off-by: Austin Patton <austin.patton@arcadiascience.com>
Okay, so as we briefly discussed, this is a (relatively) simple change to use an updated naming convention for protein IDs, made to be consistent with the snakemake preprocessing workflow.
Old convention was:
Genus_species:proteinIDThe colon got replaced by an underscore by orthofinder, which made splitting the species and protein ID more challenging.
Now, the convention is:
Genus-species_proteinIDThe changes I implemented here basically just parameterize the delimiter, making
_the default, but splitting the two identifiers using the parameter value within the annotation module.I haven't actually tested it yet (hence the draft PR), but will make an updated version of the test dataset that follows this convention so that I can do so.