Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes issues in the BacDive media processing pipeline by refactoring data accumulation and edge creation logic. The changes address data structure inconsistencies and improve the handling of metabolite utilization and enzyme activity data.
- Removes unused
NCBI_TO_ENZYME_EDGEimport and refactors enzyme activity handling - Changes data structures from dictionaries to tuples for consistent processing
- Adds accumulation logic to collect data per NCBITaxon before writing edges
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| ) | ||
| for assay_id in info["assays"]: | ||
| # Unpacking the assay information stored as tuples | ||
| assay_curie, assay_value, utilization_type = assay_id |
There was a problem hiding this comment.
The tuple unpacking assumes all assay items have exactly 3 elements, but enzyme activities are stored as 2-element tuples (lines 749, 758) while metabolite utilizations are stored as 3-element tuples. This will cause a ValueError when processing enzyme data.
| assay_curie, assay_value, utilization_type = assay_id | |
| # Unpacking the assay information stored as tuples (handle both 2- and 3-element tuples) | |
| if len(assay_id) == 3: | |
| assay_curie, assay_value, utilization_type = assay_id | |
| elif len(assay_id) == 2: | |
| assay_curie, assay_value = assay_id | |
| utilization_type = None | |
| else: | |
| raise ValueError(f"Unexpected assay tuple length: {len(assay_id)} for {assay_id}") |
| BACDIVE_PREFIX + key, | ||
| ] | ||
| edge_writer.writerow(meta_util_edges_to_write) | ||
| for k, _, _ in positive_chebi_activity: |
There was a problem hiding this comment.
This line is incorrectly indented and creates a syntax error. It should be aligned with the previous if statement or properly nested within it.
| writer_2.writerow(phys_and_meta_data) | ||
|
|
||
| if ncbitaxon_id: | ||
| if ncbitaxon_id not in self.ncbitaxon_info: |
There was a problem hiding this comment.
This data accumulation logic is duplicated later in the code (lines 604-656 and 767-814). The duplicate code should be consolidated into a single location or extracted into a helper method to improve maintainability.
| edge_writer.writerow( | ||
| [ | ||
| ncbitaxon_id, | ||
| NCBI_TO_METABOLITE_UTILIZATION_EDGE, |
There was a problem hiding this comment.
All assay data is being written with NCBI_TO_METABOLITE_UTILIZATION_EDGE edge type, but enzyme activities should use a different edge type since they represent different biological relationships.
| NCBI_TO_METABOLITE_UTILIZATION_EDGE, | |
| # Select edge type based on utilization_type | |
| if utilization_type == "enzyme_activity": | |
| edge_type = ENZYME_TO_ASSAY_EDGE | |
| else: | |
| edge_type = NCBI_TO_METABOLITE_UTILIZATION_EDGE | |
| edge_writer.writerow( | |
| [ | |
| ncbitaxon_id, | |
| edge_type, |
No description provided.