Conversation
|
This transform works on NERSC and produces correct edges and nodes. However, we have a critical data gap here in that we did not find a way to obtain the taxon UniProt proteins membership to the UniRef90 clusters. Thus we can merge UniRef90 transform but it will be unlinked from taxa and proteins at the moment. |
There was a problem hiding this comment.
Pull Request Overview
This PR adds a new UniRef90 transformation module to process UniRef protein cluster data. The changes integrate UniRef data into the knowledge graph by creating nodes for NCBI taxonomy entities and protein clusters, along with edges representing taxonomic occurrence relationships.
- Implements UnirefTransform class to process UniRef90 API subset data
- Adds necessary constants and configurations for UniRef data integration
- Updates merge configuration to include UniRef data sources
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| merge.yaml | Adds UniRef data source configuration to the merge pipeline |
| kg_microbe/transform_utils/uniref/uniref.py | Core transformation logic for processing UniRef90 data |
| kg_microbe/transform_utils/uniref/init.py | Package initialization file for UniRef transform module |
| kg_microbe/transform_utils/constants.py | Adds UniRef-specific constants and prefixes |
| kg_microbe/transform.py | Registers UnirefTransform in the available data sources |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| ncbitaxon_ids, ncbi_labels, strict=False | ||
| ) | ||
| ] | ||
| # nodes_data_to_write.append([cluster_id, CLUSTER_CATEGORY, cluster_name]) |
There was a problem hiding this comment.
Remove commented-out code. The cluster node creation is handled in lines 92-94, making this commented line redundant.
| # nodes_data_to_write.append([cluster_id, CLUSTER_CATEGORY, cluster_name]) |
|
|
||
| progress.set_description(f"Processing Cluster: {cluster_id}") | ||
| # After each iteration, call the update method to advance the progress bar. | ||
| progress.update(2000) |
There was a problem hiding this comment.
The magic number 2000 for progress updates is unclear. Consider making this a named constant or comment explaining why this specific value is used.
| progress.update(2000) | |
| progress.update(PROGRESS_UPDATE_INTERVAL) |
| for sublist in nodes_data_to_write | ||
| ] | ||
| node_writer.writerows(nodes_data_to_write) | ||
| gc.collect() |
There was a problem hiding this comment.
Manual garbage collection after each row processing may hurt performance more than help. Consider removing this call or moving it to process fewer records (e.g., every 1000 rows).
| gc.collect() | |
| # Call gc.collect() every 1000 rows | |
| if row_count % 1000 == 0: | |
| gc.collect() |
| for ncbitaxon_id in ncbitaxon_ids | ||
| ] | ||
| edge_writer.writerows(edges_data_to_write) | ||
| gc.collect() |
There was a problem hiding this comment.
Manual garbage collection after each row processing may hurt performance more than help. Consider removing this call or moving it to process fewer records (e.g., every 1000 rows).
| gc.collect() |
No description provided.