Conversation
…t duplicate rows into big query for the same snapshot_date
There was a problem hiding this comment.
Pull request overview
Adds a pre-load BigQuery “snapshot already processed” check to avoid inserting duplicate rows for the same snapshot_date when running the GitHub ETL.
Changes:
- Add
snapshot_exists()helper that queries BigQuery for an existing(target_repository, snapshot_date)row (usingpull_requestsas a sentinel table). - Compute
snapshot_dateonce in_main()and skip processing repositories that already have data for that date.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Adds BigQuery snapshot idempotency support to prevent duplicate rows for the same (repo, snapshot_date) by detecting existing snapshots and cleaning them up before reloading.
Changes:
- Added
snapshot_exists()to detect whether a repo/date snapshot is already present (usingpull_requestsas a sentinel). - Added
delete_existing_snapshot()to delete prior rows for the repo/date across all ETL tables before reloading. - Updated
load_data()and_main()to computesnapshot_dateonce and pass it through to all inserts to avoid date-boundary skew.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
shtrom
left a comment
There was a problem hiding this comment.
LGTM assuming the pr table is the first one that gets any write.
cgsheeh
left a comment
There was a problem hiding this comment.
For the Phab-ETL I recently implemented a change which solves the duplicate rows issue in a different fashion. Instead of deleting entries when we detect duplicates, we simply allow duplicate entries to be loaded into BQ and then de-duplicate at the end of the ETL. The Phab-ETL uses a "staging" table where we dump data before MERGE-ing it into the main table, but I think the same approach would work here after some modifications. The advantage is we can re-start the ETL without worrying about duplicate data, which is especially useful as the Phab-ETL takes several days to complete when run from the beginning.
Nothing wrong with this approach and it's a strict improvement, so this PR gets an r+ from me, but I wanted to call that out for your consideration in case we run into similar issues with the Github-ETL in the future. :)
No description provided.