Skip to content

Bug 2024636 - Add a duplicate check to make sure that we do not insert duplicate rows into big query for the same snapshot date#13

Merged
dklawren merged 2 commits intomainfrom
duplication-check
Mar 20, 2026
Merged

Bug 2024636 - Add a duplicate check to make sure that we do not insert duplicate rows into big query for the same snapshot date#13
dklawren merged 2 commits intomainfrom
duplication-check

Conversation

@dklawren
Copy link
Contributor

No description provided.

…t duplicate rows into big query for the same snapshot_date
@dklawren dklawren requested a review from Copilot March 19, 2026 14:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a pre-load BigQuery “snapshot already processed” check to avoid inserting duplicate rows for the same snapshot_date when running the GitHub ETL.

Changes:

  • Add snapshot_exists() helper that queries BigQuery for an existing (target_repository, snapshot_date) row (using pull_requests as a sentinel table).
  • Compute snapshot_date once in _main() and skip processing repositories that already have data for that date.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@dklawren dklawren changed the title Bug 2024636 - Add a duplicate check to make sure that we do not insert duplicate rows into big query for the same snapshot_date Bug 2024636 - Add a duplicate check to make sure that we do not insert duplicate rows into big query for the same snapshot date Mar 19, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds BigQuery snapshot idempotency support to prevent duplicate rows for the same (repo, snapshot_date) by detecting existing snapshots and cleaning them up before reloading.

Changes:

  • Added snapshot_exists() to detect whether a repo/date snapshot is already present (using pull_requests as a sentinel).
  • Added delete_existing_snapshot() to delete prior rows for the repo/date across all ETL tables before reloading.
  • Updated load_data() and _main() to compute snapshot_date once and pass it through to all inserts to avoid date-boundary skew.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dklawren dklawren requested review from cgsheeh, shtrom and zzzeid March 19, 2026 22:32
Copy link
Member

@shtrom shtrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming the pr table is the first one that gets any write.

Copy link
Member

@cgsheeh cgsheeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Phab-ETL I recently implemented a change which solves the duplicate rows issue in a different fashion. Instead of deleting entries when we detect duplicates, we simply allow duplicate entries to be loaded into BQ and then de-duplicate at the end of the ETL. The Phab-ETL uses a "staging" table where we dump data before MERGE-ing it into the main table, but I think the same approach would work here after some modifications. The advantage is we can re-start the ETL without worrying about duplicate data, which is especially useful as the Phab-ETL takes several days to complete when run from the beginning.

Nothing wrong with this approach and it's a strict improvement, so this PR gets an r+ from me, but I wanted to call that out for your consideration in case we run into similar issues with the Github-ETL in the future. :)

@dklawren dklawren merged commit 407a47d into main Mar 20, 2026
8 checks passed
@dklawren dklawren deleted the duplication-check branch March 20, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants