Bug 2024636 - Add a duplicate check to make sure that we do not insert duplicate rows into big query for the same snapshot date by dklawren · Pull Request #13 · mozilla-conduit/github-etl

dklawren · 2026-03-19T14:01:46Z

No description provided.

…t duplicate rows into big query for the same snapshot_date

Copilot

Pull request overview

Adds a pre-load BigQuery “snapshot already processed” check to avoid inserting duplicate rows for the same snapshot_date when running the GitHub ETL.

Changes:

Add snapshot_exists() helper that queries BigQuery for an existing (target_repository, snapshot_date) row (using pull_requests as a sentinel table).
Compute snapshot_date once in _main() and skip processing repositories that already have data for that date.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

main.py

Copilot

Pull request overview

Adds BigQuery snapshot idempotency support to prevent duplicate rows for the same (repo, snapshot_date) by detecting existing snapshots and cleaning them up before reloading.

Changes:

Added snapshot_exists() to detect whether a repo/date snapshot is already present (using pull_requests as a sentinel).
Added delete_existing_snapshot() to delete prior rows for the repo/date across all ETL tables before reloading.
Updated load_data() and _main() to compute snapshot_date once and pass it through to all inserts to avoid date-boundary skew.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

main.py

shtrom

LGTM assuming the pr table is the first one that gets any write.

main.py

cgsheeh

For the Phab-ETL I recently implemented a change which solves the duplicate rows issue in a different fashion. Instead of deleting entries when we detect duplicates, we simply allow duplicate entries to be loaded into BQ and then de-duplicate at the end of the ETL. The Phab-ETL uses a "staging" table where we dump data before MERGE-ing it into the main table, but I think the same approach would work here after some modifications. The advantage is we can re-start the ETL without worrying about duplicate data, which is especially useful as the Phab-ETL takes several days to complete when run from the beginning.

Nothing wrong with this approach and it's a strict improvement, so this PR gets an r+ from me, but I wanted to call that out for your consideration in case we run into similar issues with the Github-ETL in the future. :)

Bug 2024636 - Add a duplicate check to make sure that we do not inser…

f442c9c

…t duplicate rows into big query for the same snapshot_date

dklawren requested a review from Copilot March 19, 2026 14:01

Copilot AI reviewed Mar 19, 2026

View reviewed changes

main.py Outdated Show resolved Hide resolved

main.py Show resolved Hide resolved

main.py Outdated Show resolved Hide resolved

dklawren changed the title ~~Bug 2024636 - Add a duplicate check to make sure that we do not insert duplicate rows into big query for the same snapshot_date~~ Bug 2024636 - Add a duplicate check to make sure that we do not insert duplicate rows into big query for the same snapshot date Mar 19, 2026

Copilot suggested fixes

39ba9b6

dklawren requested a review from Copilot March 19, 2026 22:16

Copilot started reviewing on behalf of dklawren March 19, 2026 22:16 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

main.py Show resolved Hide resolved

main.py Show resolved Hide resolved

main.py Show resolved Hide resolved

dklawren requested review from cgsheeh, shtrom and zzzeid March 19, 2026 22:32

shtrom approved these changes Mar 20, 2026

View reviewed changes

main.py Show resolved Hide resolved

main.py Show resolved Hide resolved

cgsheeh approved these changes Mar 20, 2026

View reviewed changes

dklawren merged commit 407a47d into main Mar 20, 2026
8 checks passed

dklawren deleted the duplication-check branch March 20, 2026 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 2024636 - Add a duplicate check to make sure that we do not insert duplicate rows into big query for the same snapshot date#13

Bug 2024636 - Add a duplicate check to make sure that we do not insert duplicate rows into big query for the same snapshot date#13
dklawren merged 2 commits intomainfrom
duplication-check

dklawren commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shtrom left a comment

Uh oh!

Uh oh!

Uh oh!

cgsheeh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dklawren commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shtrom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cgsheeh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants