Currently the documentation instructs and the code downloads prot.accession2taxid.gz which doesn't have all of the nr accessions.
Proteins that aren't found in prot.accession2taxid.gz are assigned to root which results in contigs becoming unclassified.
Currently this is ameliorated by using prot.accession2taxid.FULL.gz instead of prot.accession2taxid.gz, as shown below. But the code needs to be changed to handle missing accessions. Per our meeting today these should probably be assigned to None and then should be dropped before handing over to LCA.

Assignment to root that needs to be changed:
|
# If we still have missing taxids, we will set the sseqid value to the root taxid |
|
# fill missing taxids with root_taxid |
|
sseqid_to_taxid_df["cleaned_taxid"] = sseqid_to_taxid_df.merged_taxid.fillna( |
|
root_taxid |
|
) |
Currently the documentation instructs and the code downloads
prot.accession2taxid.gzwhich doesn't have all of thenraccessions.Proteins that aren't found in
prot.accession2taxid.gzare assigned to root which results in contigs becoming unclassified.Currently this is ameliorated by using
prot.accession2taxid.FULL.gzinstead ofprot.accession2taxid.gz, as shown below. But the code needs to be changed to handle missing accessions. Per our meeting today these should probably be assigned toNoneand then should be dropped before handing over to LCA.Assignment to root that needs to be changed:
Autometa/autometa/taxonomy/ncbi.py
Lines 453 to 457 in baf61c0