[extras] add merge_safe as an alternative to merge_auto by pabloyoyoista · Pull Request #14 · exaxorg/accelerator

pabloyoyoista · 2024-10-15T08:34:24Z

Sofia has recurrently expressed dissatisfaction with merge_auto not working super well at reproducing data due to the update in some data types not being fit for data science. For example, while Counter's update does what one would expect, dict's update might simply override previous values from the iterator without giving out any feedback. To allow data scientists to avoid this issue, we add a new API that makes sure that previous values don't exist.

This is probably missing tests and some further discussion (like, should we consider the Counter type safe?), so marking it as draft

berkeman · 2024-10-25T12:44:15Z

Yes, a silent override is not optimal. I wonder if we need a new function for this, or if it is sufficient to use the new "safe"-argument? Also, talking to Carl, we should probably profile this to find the fastest way to check for existing keys (maybe using set.union()?).

pabloyoyoista · 2025-07-22T13:34:20Z

Sorry it took so long to come back to this

I wonder if we need a new function for this, or if it is sufficient to use the new "safe"-argument?

I thought it would be easier to use with a new merge_safe function than always requiring using merge_auto(safe=True). In fairness, if this gets in, I'm not sure why would consumers that are not experts would ever want merge_auto() (so with safe=False as default). Seems more like a way to let people shoot themselves in the foot.

Also, talking to Carl, we should probably profile this to find the fastest way to check for existing keys (maybe using set.union()?)

Do you have any existing profiling tests for this? Otherwise I'm happy to try some performance tests with a recent Python version

SofiaIngrid · 2025-08-21T09:41:06Z

We have really found this safe merge useful and have found no use cases where we want to drop data silently. So I guess the the only drawback to ponder on is how to make it efficient performance wise. The biggest effect on changing to the "safer merge" have been that it have exposed so many bugs in our projects which have been both scary and humbling.

Sofia has recurrently expressed dissatisfaction with merge_auto not working super well at reproducing data due to the update in some data types not being fit for data science. For example, while Counter's update does what one would expect, dict's update might simply override previous values from the iterator without giving out any feedback. To allow data scientists to avoid this issue, we add a new API that makes sure that previous values don't exist.

pabloyoyoista · 2025-10-02T09:05:14Z

We've met today and agreed that we want the new behavior to be the default, after some small testing showed that the new behavior only introduces an overhead of 1 second per 31 million rows. Of course, this breaks tests. So I'll fix them and we have a meeting queued in 2 weeks

We are basically returning a set, so we don't really care if the key already exists

pabloyoyoista force-pushed the merge-safe branch from 1249164 to 9112272 Compare July 22, 2025 13:28

pabloyoyoista force-pushed the merge-safe branch 4 times, most recently from cd433ec to 005c2f5 Compare October 2, 2025 08:32

pabloyoyoista marked this pull request as ready for review October 2, 2025 08:33

pabloyoyoista force-pushed the merge-safe branch from 005c2f5 to 439c5d8 Compare October 2, 2025 08:59

pabloyoyoista added 2 commits October 2, 2025 18:27

[test_dataset_fanout] set hashlabel of previous dataset

b86dd2d

[dataset_fanout_collect] allow_overwrite in merge_auto

e35b848

We are basically returning a set, so we don't really care if the key already exists

pabloyoyoista closed this Nov 20, 2025

pabloyoyoista deleted the merge-safe branch November 20, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[extras] add merge_safe as an alternative to merge_auto#14

[extras] add merge_safe as an alternative to merge_auto#14
pabloyoyoista wants to merge 3 commits intomasterfrom
merge-safe

pabloyoyoista commented Oct 15, 2024

Uh oh!

berkeman commented Oct 25, 2024

Uh oh!

pabloyoyoista commented Jul 22, 2025

Uh oh!

SofiaIngrid commented Aug 21, 2025

Uh oh!

pabloyoyoista commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pabloyoyoista commented Oct 15, 2024

Uh oh!

berkeman commented Oct 25, 2024

Uh oh!

pabloyoyoista commented Jul 22, 2025

Uh oh!

SofiaIngrid commented Aug 21, 2025

Uh oh!

pabloyoyoista commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants