Skip to content

[extras] add merge_safe as an alternative to merge_auto#14

Closed
pabloyoyoista wants to merge 3 commits intomasterfrom
merge-safe
Closed

[extras] add merge_safe as an alternative to merge_auto#14
pabloyoyoista wants to merge 3 commits intomasterfrom
merge-safe

Conversation

@pabloyoyoista
Copy link
Copy Markdown
Collaborator

Sofia has recurrently expressed dissatisfaction with merge_auto not working super well at reproducing data due to the update in some data types not being fit for data science. For example, while Counter's update does what one would expect, dict's update might simply override previous values from the iterator without giving out any feedback. To allow data scientists to avoid this issue, we add a new API that makes sure that previous values don't exist.

This is probably missing tests and some further discussion (like, should we consider the Counter type safe?), so marking it as draft

@berkeman
Copy link
Copy Markdown
Member

Yes, a silent override is not optimal. I wonder if we need a new function for this, or if it is sufficient to use the new "safe"-argument? Also, talking to Carl, we should probably profile this to find the fastest way to check for existing keys (maybe using set.union()?).

@pabloyoyoista
Copy link
Copy Markdown
Collaborator Author

Sorry it took so long to come back to this

I wonder if we need a new function for this, or if it is sufficient to use the new "safe"-argument?

I thought it would be easier to use with a new merge_safe function than always requiring using merge_auto(safe=True). In fairness, if this gets in, I'm not sure why would consumers that are not experts would ever want merge_auto() (so with safe=False as default). Seems more like a way to let people shoot themselves in the foot.

Also, talking to Carl, we should probably profile this to find the fastest way to check for existing keys (maybe using set.union()?)

Do you have any existing profiling tests for this? Otherwise I'm happy to try some performance tests with a recent Python version

@SofiaIngrid
Copy link
Copy Markdown

We have really found this safe merge useful and have found no use cases where we want to drop data silently. So I guess the the only drawback to ponder on is how to make it efficient performance wise. The biggest effect on changing to the "safer merge" have been that it have exposed so many bugs in our projects which have been both scary and humbling.

@pabloyoyoista pabloyoyoista force-pushed the merge-safe branch 4 times, most recently from cd433ec to 005c2f5 Compare October 2, 2025 08:32
@pabloyoyoista pabloyoyoista marked this pull request as ready for review October 2, 2025 08:33
Sofia has recurrently expressed dissatisfaction with merge_auto not
working super well at reproducing data due to the update in some data
types not being fit for data science. For example, while Counter's
update does what one would expect, dict's update might simply override
previous values from the iterator without giving out any feedback. To
allow data scientists to avoid this issue, we add a new API that makes
sure that previous values don't exist.
@pabloyoyoista
Copy link
Copy Markdown
Collaborator Author

We've met today and agreed that we want the new behavior to be the default, after some small testing showed that the new behavior only introduces an overhead of 1 second per 31 million rows. Of course, this breaks tests. So I'll fix them and we have a meeting queued in 2 weeks

We are basically returning a set, so we don't really care if the key
already exists
@pabloyoyoista pabloyoyoista deleted the merge-safe branch November 20, 2025 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants