Fix BLAST results protein to taxonomic accession assignment

Currently the documentation instructs and the code downloads `prot.accession2taxid.gz` which doesn't have all of the `nr` accessions. 
Proteins that aren't found in `prot.accession2taxid.gz` are assigned to root which results in contigs becoming unclassified. 
Currently this is ameliorated by using ` prot.accession2taxid.FULL.gz` instead of  `prot.accession2taxid.gz`, as shown below. But the code needs to be changed to handle missing accessions. Per our meeting today these should probably be assigned to `None` and then should be dropped before handing over to LCA.

![image](https://user-images.githubusercontent.com/18691127/228906285-67776eb5-94a8-4eed-adf2-f252c86830c7.png)

Assignment to root that needs to be changed:
https://github.com/KwanLab/Autometa/blob/baf61c04dddf5b33bb825dba2841de1e38dffefe/autometa/taxonomy/ncbi.py#L453-L457



	# If we still have missing taxids, we will set the sseqid value to the root taxid
	# fill missing taxids with root_taxid
	sseqid_to_taxid_df["cleaned_taxid"] = sseqid_to_taxid_df.merged_taxid.fillna(
	root_taxid
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BLAST results protein to taxonomic accession assignment #317

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Fix BLAST results protein to taxonomic accession assignment #317

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions