-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Hi,
thanks so much for seqkit - it's one of the tools I use almost on a daily basis and I love it!
There is one feature though, that I feel is lacking a bit. I regularly use seqkit subseq to extract features based on annotations files. Ideally, I would like to have more control over the ID that is given to the extracted fasta sequences and possibly the comment as well. Usually, my annotations come in gff3 (which is not supported), so I convert to bed, and then I use a perl-one-liner to turn the fasta comment into the fasta ID. It works, but since I use it so often, and I guess others might too, it feels like it would be very useful if seqkit natively provided that functionality.
This issue has been raised before: #154, #89
# foo.fna
>foo
GTTTTGTTGACCAGACGATA
# foo.gff
foo . . 3 10 . + . ID=foo_1;Name=gene-foo;product="bla bla bla";
# This is how most my calls look - it works but it feels a bit clunky :)
seqkit subseq --bed <(gff2bed foo.gff) foo.fna | perl -pe 's/>\S+\s/>/'
[INFO] read BED file ...
Processing foo.gff
[INFO] 1 BED features loaded
[INFO] create FASTA index for foo.fna
>foo_1
TTTGTTGA
So what I would which for in terms of enhancements for seqkit would be:
-
gff3 support: gff3 seems so similar to gtf that it feels like adding support for it would not be a big issue, the parsing of tags would have to be modified slightly. But I think it would be very useful to many users.
-
ID tag: A flag like
--id-tagcould be added, similar to--gtf-tag. If set for gtf (/gff) it would tell which tag to use as ID instead of the default <seq_id:from-to> pattern. For bed, it could just accept a number indicating the column that should be used for IDs... -
In a perfect world there would also be an option to add additional tags/columns into the comment section of the fasta header, i.e. something multiple
--gtf-tags gene_id,transcript_id,product.
Obviously, those aren't crucial issues, but I think they would help to further improve seqkit API.
Cheers
Thomas