[RTI][SUB] Write the Ingestion Script for RTIs#5
[RTI][SUB] Write the Ingestion Script for RTIs#5ChanukaUOJ wants to merge 5 commits intoLDFLK:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers the foundational ingestion script for the RTI Tracker project, written in Go. Its primary purpose is to automate the population of RTI data into the OpenGIN graph database. The script intelligently processes structured CSV data, creates new RTI entities as nodes, and dynamically links them to existing organizational entities, thereby building a comprehensive relationship graph. This initial step significantly streamlines the data onboarding process for the RTI system. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an ingestion script for RTI data, which is a great step forward. The code is generally well-structured, separating concerns into different packages. However, I've identified several critical issues related to error handling, resource management, and logic that could lead to runtime panics, silent failures, or data duplication. Key areas for improvement include properly handling errors from file I/O and API calls, ensuring resource cleanup (like closing files), fixing a logical flaw in parent entity lookup, and making the data ingestion process idempotent. Please see my detailed comments for specific suggestions.
|
|
||
| // 1. Insert the RTI Entity to Graph | ||
| // Create a deterministic UUID so rerunning the script on the same data doesn't duplicate nodes | ||
| hashInput := fmt.Sprintf("%s_%s", entity.Created, entity.Index) |
There was a problem hiding this comment.
If a user tries to create a new rti request with the same created date and accidentally passes an index that already exists, this will give the same id and will overwrite the existing entity.
There was a problem hiding this comment.
Index is passed by user and its uniqueness is not enforced anywhere
There was a problem hiding this comment.
In the script the Index retrieves from the folder structure. Lets say in one particular date if multiple RTIs send, it should be stored in separate folders, like 1 , 2, 3, 4... like wise. So the use can't create multiple directories with the same name inside one date folder.
| var parentID string | ||
| if len(filteredSearchResult) > 0 { | ||
| sort.Slice(filteredSearchResult, func(i, j int) bool { | ||
| // Sort in descending order by created date | ||
| timeI, errI := time.Parse(time.RFC3339, filteredSearchResult[i].Created) | ||
| timeJ, errJ := time.Parse(time.RFC3339, filteredSearchResult[j].Created) | ||
| if errI != nil || errJ != nil { | ||
| return filteredSearchResult[i].Created > filteredSearchResult[j].Created | ||
| } | ||
| return timeI.After(timeJ) | ||
| }) | ||
|
|
||
| entityCreatedTime, err := time.Parse(time.RFC3339, entity.Created) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed time parsing") | ||
| } | ||
|
|
||
| for _, result := range filteredSearchResult { | ||
| resultTime, err := time.Parse(time.RFC3339, result.Created) | ||
| if err == nil && !resultTime.After(entityCreatedTime) { | ||
| parentID = result.ID | ||
| break | ||
| } | ||
| } | ||
|
|
||
| // Fallback: if no floor date is found or parse failed, pick the first one | ||
| if parentID == "" { | ||
| return nil, fmt.Errorf("skipping relation update (receiver not found for the given date): %s", entity.Created) | ||
| } | ||
| } |
There was a problem hiding this comment.
When it comes to ministers and departments we can't rely on the created time of the node, we need to check the created time of the relationship. For each filtered node, you can retrieve all AS_DEPARTMENT/AS_MINISTER (depending on the node's minor kind) relationships that are incoming and then check their active date for which node has a relationship that was active during the rti's created date. If there are multiple nodes with the same name and a relationship active on that date then return an error.
This PR includes the Ingestion Script to insert
RTIdata into the databases throughOpenGIN. As the first step this covers the changes to insert Nodes and create the relations in theGraphdb.