fixes to ingest all HGNC genes for uniprot mappings#290
fixes to ingest all HGNC genes for uniprot mappings#290
Conversation
bsantan
commented
Feb 25, 2025
- HGNC gene IDs were only being mapped for those that exist in the MONDO ontology
- Now ingest the HGNC OWL file, and map all human Uniprot IDs to HGNC IDs
There was a problem hiding this comment.
Pull Request Overview
This PR transitions from using MONDO ontology for HGNC gene mappings to directly using Node Normalizer API to map all human UniProt proteins to their corresponding HGNC gene IDs. The change addresses the limitation where only genes present in MONDO were being mapped, now enabling complete HGNC gene coverage for human UniProt entries.
Key changes:
- Replaced MONDO-based gene mapping with Node Normalizer API calls
- Added HGNC ontology processing to support comprehensive gene mappings
- Updated Node Normalizer API endpoint to version 1.5
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| kg_microbe/utils/uniprot_utils.py | Implements Node Normalizer API function and updates gene parsing logic |
| kg_microbe/transform_utils/wallen_etal/wallen_etal.py | Adds debugging statement |
| kg_microbe/transform_utils/uniprot_human/uniprot_human.py | Removes mondo_gene_dict parameter from function calls |
| kg_microbe/transform_utils/ontologies/ontologies_transform.py | Adds HGNC ontology processing and filters out gene property nodes |
| kg_microbe/transform_utils/constants.py | Adds HGNC-related prefixes and updates Node Normalizer URL |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| wallen_etal_df = pd.read_excel(input_file, skiprows=3, sheet_name=WALLEN_ETAL_TAB_NAME) | ||
| wallen_etal_df[FDR_COLUMN] = pd.to_numeric(wallen_etal_df[FDR_COLUMN], errors="coerce") | ||
|
|
||
| import pdb;pdb.set_trace() |
There was a problem hiding this comment.
Debug statement with pdb.set_trace() should not be committed to production code. This will cause the program to halt execution and wait for debugger input.
| import pdb;pdb.set_trace() |
| CHEBI_REGEX = re.compile(r'/ligand_id="ChEBI:(.*?)";') | ||
| GO_REGEX = re.compile(r"\[(.*?)\]") | ||
|
|
||
| # Takes cure in the form PREFIX:ID |
There was a problem hiding this comment.
Typo in comment: 'cure' should be 'curie'.
| # Takes cure in the form PREFIX:ID | |
| # Takes curie in the form PREFIX:ID |
| except TypeError: | ||
| return None | ||
|
|
||
| else: |
There was a problem hiding this comment.
The function lacks proper error handling for HTTP requests. If the API is unavailable or returns an error status, the function will raise an exception without providing a meaningful error message to help with debugging.
| else: | |
| url = NODE_NORMALIZER_URL + node_curie | |
| try: | |
| # Make the HTTP request to NodeNormalizer | |
| response = requests.get(url, timeout=30) | |
| response.raise_for_status() | |
| # Write response to file if it contains data | |
| entries = response.json()[node_curie] | |
| if len(entries) > 1: # .strip().split("\n") | |
| for iden in entries["equivalent_identifiers"]: | |
| if iden["identifier"].split(":")[0] + ":" == HGNC_NEW_PREFIX: | |
| norm_node = iden["identifier"] | |
| return norm_node | |
| else: | |
| return None | |
| except requests.exceptions.RequestException as e: | |
| logging.error(f"HTTP request failed for node_curie '{node_curie}' at URL '{url}': {e}") | |
| return None | |
| except (ValueError, KeyError, TypeError) as e: | |
| logging.error(f"Error processing response for node_curie '{node_curie}' at URL '{url}': {e}") |
| response.raise_for_status() | ||
|
|
||
| # Write response to file if it contains data | ||
| entries = response.json()[node_curie] |
There was a problem hiding this comment.
This line assumes the response JSON contains the node_curie key, but this could fail if the API response structure is different or if the key doesn't exist, potentially causing a KeyError.
| entries = response.json()[node_curie] | |
| entries = response.json().get(node_curie) | |
| if entries is None: | |
| return None |
| # Write response to file if it contains data | ||
| entries = response.json()[node_curie] | ||
| try: | ||
| if len(entries) > 1: # .strip().split("\n") |
There was a problem hiding this comment.
The condition len(entries) > 1 doesn't make logical sense here. The code is checking if entries has more than one element, but then accesses entries["equivalent_identifiers"], suggesting entries is a dictionary. The logic should verify entries is not None and contains the expected structure.
| if len(entries) > 1: # .strip().split("\n") | |
| if entries and "equivalent_identifiers" in entries and entries["equivalent_identifiers"]: |
| # Rewrite edges file | ||
| with open(edges_file, "w") as new_ef: | ||
| new_ef.write("\t".join(self.edge_header) + "\n") | ||
| # new_ef.write("\t".join(self.edge_header) + "\n") |
There was a problem hiding this comment.
Commented-out code should be removed rather than left in the codebase. If this header writing logic is needed, it should be uncommented and used, otherwise it should be deleted.
| # new_ef.write("\t".join(self.edge_header) + "\n") |